| Yann Collet | 2fa9904 | 2016-07-01 20:55:28 +0200 | [diff] [blame] | 1 | Zstandard Compression Format Description |
| 2 | ======================================== |
| 3 | |
| 4 | ### Notices |
| 5 | |
| 6 | Copyright (c) 2016 Yann Collet |
| 7 | |
| 8 | Permission is granted to copy and distribute this document |
| 9 | for any purpose and without charge, |
| 10 | including translations into other languages |
| 11 | and incorporation into compilations, |
| 12 | provided that the copyright notice and this notice are preserved, |
| 13 | and that any substantive changes or deletions from the original |
| 14 | are clearly marked. |
| 15 | Distribution of this document is unlimited. |
| 16 | |
| 17 | ### Version |
| 18 | |
| 19 | 0.1.0 (30/06/2016 - unfinished) |
| 20 | |
| 21 | |
| 22 | Introduction |
| 23 | ------------ |
| 24 | |
| 25 | The purpose of this document is to define a lossless compressed data format, |
| 26 | that is independent of CPU type, operating system, |
| 27 | file system and character set, suitable for |
| 28 | File compression, Pipe and streaming compression |
| 29 | using the [Zstandard algorithm](http://www.zstandard.org). |
| 30 | |
| 31 | The data can be produced or consumed, |
| 32 | even for an arbitrarily long sequentially presented input data stream, |
| 33 | using only an a priori bounded amount of intermediate storage, |
| 34 | and hence can be used in data communications. |
| 35 | The format uses the Zstandard compression method, |
| 36 | and optional [xxHash-64 checksum method](http://www.xxhash.org), |
| 37 | for detection of data corruption. |
| 38 | |
| 39 | The data format defined by this specification |
| 40 | does not attempt to allow random access to compressed data. |
| 41 | |
| 42 | This specification is intended for use by implementers of software |
| 43 | to compress data into Zstandard format and/or decompress data from Zstandard format. |
| 44 | The text of the specification assumes a basic background in programming |
| 45 | at the level of bits and other primitive data representations. |
| 46 | |
| 47 | Unless otherwise indicated below, |
| 48 | a compliant compressor must produce data sets |
| 49 | that conform to the specifications presented here. |
| 50 | It doesn’t need to support all options though. |
| 51 | |
| 52 | A compliant decompressor must be able to decompress |
| 53 | at least one working set of parameters |
| 54 | that conforms to the specifications presented here. |
| 55 | It may also ignore informative fields, such as checksum. |
| 56 | Whenever it does not support a parameter defined in the compressed stream, |
| 57 | it must produce a non-ambiguous error code and associated error message |
| 58 | explaining which parameter is unsupported. |
| 59 | |
| 60 | |
| 61 | Definitions |
| 62 | ----------- |
| 63 | A content compressed by Zstandard is transformed into a Zstandard __frame__. |
| 64 | Multiple frames can be appended into a single file or stream. |
| 65 | A frame is totally independent, has a defined beginning and end, |
| 66 | and a set of parameters which tells the decoder how to decompress it. |
| 67 | |
| 68 | A frame encapsulates one or multiple __blocks__. |
| 69 | Each block can be compressed or not, |
| 70 | and has a guaranteed maximum content size, which depends on frame parameters. |
| 71 | Unlike frames, each block depends on previous blocks for proper decoding. |
| 72 | However, each block can be decompressed without waiting for its successor, |
| 73 | allowing streaming operations. |
| 74 | |
| 75 | |
| 76 | General Structure of Zstandard Frame format |
| 77 | ------------------------------------------- |
| 78 | |
| 79 | | MagicNb | F. Header | Block | (More blocks) | EndMark | |
| 80 | |:-------:|:----------:| ----- | ------------- | ------- | |
| 81 | | 4 bytes | 2-14 bytes | | | 3 bytes | |
| 82 | |
| 83 | __Magic Number__ |
| 84 | |
| 85 | 4 Bytes, Little endian format. |
| 86 | Value : 0xFD2FB527 |
| 87 | |
| 88 | __Frame Header__ |
| 89 | |
| 90 | 2 to 14 Bytes, to be detailed in the next part. |
| 91 | |
| 92 | __Data Blocks__ |
| 93 | |
| 94 | To be detailed later on. |
| 95 | That’s where compressed data is stored. |
| 96 | |
| 97 | __EndMark__ |
| 98 | |
| 99 | The flow of blocks ends when the last block header brings an _end signal_ . |
| 100 | This last block header may optionally host a __Content Checksum__ . |
| 101 | |
| 102 | __Content Checksum__ |
| 103 | |
| 104 | Content Checksum verify that frame content has been regenrated correctly. |
| 105 | The content checksum is the result |
| 106 | of [xxh64() hash function](https://www.xxHash.com) |
| 107 | digesting the original (decoded) data as input, and a seed of zero. |
| 108 | Bits from 11 to 32 (included) are extracted to form a 22 bits checksum |
| 109 | stored into the last block header. |
| 110 | ``` |
| 111 | contentChecksum = (XXH64(content, size, 0) >> 11) & (1<<22)-1); |
| 112 | ``` |
| 113 | Content checksum is only present when its associated flag |
| 114 | is set in the frame descriptor. |
| 115 | Its usage is optional. |
| 116 | |
| 117 | __Frame Concatenation__ |
| 118 | |
| 119 | In some circumstances, it may be required to append multiple frames, |
| 120 | for example in order to add new data to an existing compressed file |
| 121 | without re-framing it. |
| 122 | |
| 123 | In such case, each frame brings its own set of descriptor flags. |
| 124 | Each frame is considered independent. |
| 125 | The only relation between frames is their sequential order. |
| 126 | |
| 127 | The ability to decode multiple concatenated frames |
| 128 | within a single stream or file is left outside of this specification. |
| 129 | As an example, the reference `zstd` command line utility is able |
| 130 | to decode all concatenated frames in their sequential order, |
| 131 | delivering the final decompressed result as if it was a single content. |
| 132 | |
| 133 | |
| 134 | Frame Header |
| 135 | ------------- |
| 136 | |
| 137 | | FHD | (WD) | (Content Size) | (dictID) | |
| 138 | | ------- | --------- |:--------------:| --------- | |
| 139 | | 1 byte | 0-1 byte | 0 - 8 bytes | 0-4 bytes | |
| 140 | |
| 141 | Frame header has a variable size, which uses a minimum of 2 bytes, |
| 142 | and up to 14 bytes depending on optional parameters. |
| 143 | |
| 144 | __FHD byte__ (Frame Header Descriptor) |
| 145 | |
| 146 | The first Header's byte is called the Frame Header Descriptor. |
| 147 | It tells which other fields are present. |
| 148 | Decoding this byte is enough to get the full size of the Frame Header. |
| 149 | |
| 150 | | BitNb | 7-6 | 5 | 4 | 3 | 2 | 1-0 | |
| 151 | | ------- | ------ | ------- | ------ | -------- | -------- | -------- | |
| 152 | |FieldName| FCSize | Segment | Unused | Reserved | Checksum | dictID | |
| 153 | |
| 154 | In this table, bit 7 is highest bit, while bit 0 is lowest. |
| 155 | |
| 156 | __Frame Content Size flag__ |
| 157 | |
| 158 | This is a 2-bits flag (`= FHD >> 6`), |
| 159 | specifying if decompressed data size is provided within the header. |
| 160 | |
| 161 | | Value | 0 | 1 | 2 | 3 | |
| 162 | | ------- | --- | --- | --- | --- | |
| 163 | |FieldSize| 0-1 | 2 | 4 | 8 | |
| 164 | |
| 165 | Value 0 has a double meaning : |
| 166 | it either means `0` (size not provided) _if_ the `WD` byte is present, |
| 167 | or it means `1` byte (size <= 255 bytes). |
| 168 | |
| 169 | __Single Segment__ |
| 170 | |
| 171 | If this flag is set, |
| 172 | data shall be regenerated within a single continuous memory segment. |
| 173 | In which case, `WD` byte __is not present__, |
| 174 | but `Frame Content Size` field necessarily is. |
| 175 | |
| 176 | As a consequence, the decoder must allocate a memory segment |
| 177 | of size `>= Frame Content Size`. |
| 178 | |
| 179 | In order to preserve the decoder from unreasonable memory requirement, |
| 180 | a decoder can refuse a compressed frame |
| 181 | which requests a memory size beyond decoder's authorized range. |
| 182 | |
| 183 | For broader compatibility, decoders are recommended to support |
| 184 | memory sizes of 8 MB at least. |
| 185 | However, this is merely a recommendation, |
| 186 | and each decoder is free to support higher or lower limits, |
| 187 | depending on local limitations. |
| 188 | |
| 189 | __Unused bit__ |
| 190 | |
| 191 | The value of this bit is unimportant |
| 192 | and not interpreted by a decoder compliant with this specification version. |
| 193 | It may be used in a future revision, |
| 194 | to signal a property which is not required to properly decode the frame. |
| 195 | |
| 196 | __Reserved bit__ |
| 197 | |
| 198 | This bit is reserved for some future feature. |
| 199 | Its value _must be zero_. |
| 200 | A decoder compliant with this specification version must ensure it is not set. |
| 201 | This bit may be used in a future revision, |
| 202 | to signal a feature that must be interpreted in order to decode the frame. |
| 203 | |
| 204 | __Content checksum flag__ |
| 205 | |
| 206 | If this flag is set, a content checksum will be present into the EndMark. |
| 207 | The checksum is a 22 bits value extracted from the XXH64() of data. |
| 208 | See __Content Checksum__ . |
| 209 | |
| 210 | __Dictionary ID flag__ |
| 211 | |
| 212 | This is a 2-bits flag (`= FHD & 3`), |
| 213 | telling if a dictionary ID is provided within the header |
| 214 | |
| 215 | | Value | 0 | 1 | 2 | 3 | |
| 216 | | ------- | --- | --- | --- | --- | |
| 217 | |FieldSize| 0 | 1 | 2 | 4 | |
| 218 | |
| 219 | __WD byte__ (Window Descriptor) |
| 220 | |
| 221 | Provides guarantees on maximum back-reference distance |
| 222 | that will be present within compressed data. |
| 223 | This information is useful for decoders to allocate enough memory. |
| 224 | |
| 225 | | BitNb | 7-3 | 0-2 | |
| 226 | | --------- | -------- | -------- | |
| 227 | | FieldName | Exponent | Mantissa | |
| 228 | |
| 229 | Maximum distance is given by the following formulae : |
| 230 | ``` |
| 231 | windowLog = 10 + Exponent; |
| 232 | windowBase = 1 << windowLog; |
| 233 | windowAdd = (windowBase / 8) * Mantissa; |
| 234 | windowSize = windowBase + windowAdd; |
| 235 | ``` |
| 236 | The minimum window size is 1 KB. |
| 237 | The maximum size is (15*(2^38))-1 bytes, which is almost 1.875 TB. |
| 238 | |
| 239 | To properly decode compressed data, |
| 240 | a decoder will need to allocate a buffer of at least `windowSize` bytes. |
| 241 | |
| 242 | Note that `WD` byte is optional. It's not present in `single segment` mode. |
| 243 | In which case, the maximum back-reference distance is the content size itself, |
| 244 | which can be any value from 1 to 2^64-1 bytes (16 EB). |
| 245 | |
| 246 | In order to preserve decoder from unreasonable memory requirements, |
| 247 | a decoder can refuse a compressed frame |
| 248 | which requests a memory size beyond decoder's authorized range. |
| 249 | |
| 250 | For better interoperability, |
| 251 | decoders are recommended to be compatible with window sizes of 8 MB. |
| 252 | Encoders are recommended to not request more than 8 MB. |
| 253 | It's merely a recommendation though, |
| 254 | decoders are free to support larger or lower limits, |
| 255 | depending on local limitations. |
| 256 | |
| 257 | __Frame Content Size__ |
| 258 | |
| 259 | This is the original (uncompressed) size. |
| 260 | This information is optional, and only present if associated flag is set. |
| 261 | Content size is provided using 1, 2, 4 or 8 Bytes. |
| 262 | Format is Little endian. |
| 263 | |
| 264 | | Field Size | Range | |
| 265 | | ---------- | ---------- | |
| 266 | | 0 | 0 | |
| 267 | | 1 | 0 - 255 | |
| 268 | | 2 | 256 - 65791| |
| 269 | | 4 | 0 - 2^32-1 | |
| 270 | | 8 | 0 - 2^64-1 | |
| 271 | |
| 272 | When field size is 1, 4 or 8 bytes, the value is read directly. |
| 273 | When field size is 2, _an offset of 256 is added_. |
| 274 | It's allowed to represent a small size (ex: `18`) using the 8-bytes variant. |
| 275 | A size of `0` means `content size is unknown`. |
| 276 | In which case, the `WD` byte will necessarily be present, |
| 277 | and becomes the only hint to determine memory allocation. |
| 278 | |
| 279 | In order to preserve decoder from unreasonable memory requirement, |
| 280 | a decoder can refuse a compressed frame |
| 281 | which requests a memory size beyond decoder's authorized range. |
| 282 | |
| 283 | __Dictionary ID__ |
| 284 | |
| 285 | This is a variable size field, which contains a single ID. |
| 286 | It checks if the correct dictionary is used for decoding. |
| 287 | Note that this field is optional. If it's not present, |
| 288 | it's up to the caller to make sure it uses the correct dictionary. |
| 289 | |
| 290 | Field size depends on __Dictionary ID flag__. |
| 291 | 1 byte can represent an ID 0-255. |
| 292 | 2 bytes can represent an ID 0-65535. |
| 293 | 4 bytes can represent an ID 0-(2^32-1). |
| 294 | |
| 295 | It's allowed to represent a small ID (for example `13`) |
| 296 | with a large 4-bytes dictionary ID, losing some efficiency in the process. |
| 297 | |
| 298 | |
| 299 | Data Blocks |
| 300 | ----------- |
| 301 | |
| 302 | | B. Header | data | |
| 303 | |:---------:| ------ | |
| 304 | | 3 bytes | | |
| 305 | |
| 306 | |
| 307 | __Block Header__ |
| 308 | |
| 309 | This field uses 3-bytes, format is __big-endian__. |
| 310 | |
| 311 | The 2 highest bits represent the `block type`, |
| 312 | while the remaining 22 bits represent the (compressed) block size. |
| 313 | |
| 314 | There are 4 block types : |
| 315 | |
| 316 | | Value | 0 | 1 | 2 | 3 | |
| 317 | | ---------- | ---------- | --- | --- | ------- | |
| 318 | | Block Type | Compressed | Raw | RLE | EndMark | |
| 319 | |
| 320 | - Compressed : this is a Zstandard compressed block, |
| 321 | detailed in a later part of this specification. |
| 322 | "block size" is the compressed size. |
| 323 | Decompressed size is unknown, |
| 324 | but its maximum possible value is guaranteed (see below) |
| 325 | - Raw : this is an uncompressed block. |
| 326 | "block size" is the number of bytes to read and copy. |
| 327 | - RLE : this is a single byte, repeated N times. |
| 328 | In which case, "block size" is the size to regenerate, |
| 329 | while the "compressed" block is just 1 byte (the byte to repeat). |
| 330 | - EndMark : this is not a block. Signal the end of the frame. |
| 331 | The rest of the field may be optionally filled by a checksum |
| 332 | (see frame checksum). |
| 333 | |
| 334 | Block sizes must respect a few rules : |
| 335 | - In compressed mode, compressed size if always strictly `< contentSize`. |
| 336 | - Block decompressed size is necessarily <= maximum back-reference distance . |
| 337 | - Block decompressed size is necessarily <= 128 KB |
| 338 | |
| 339 | |
| 340 | __Data__ |
| 341 | |
| 342 | Where the actual data to decode stands. |
| 343 | It might be compressed or not, depending on previous field indications. |
| 344 | A data block is not necessarily "full" : |
| 345 | since an arbitrary “flush” may happen anytime, |
| 346 | block content can be any size, up to Block Maximum Size. |
| 347 | Block Maximum Size is the smallest of : |
| 348 | - Max back-reference distance |
| 349 | - 128 KB |
| 350 | |
| 351 | |
| 352 | Skippable Frames |
| 353 | ---------------- |
| 354 | |
| 355 | | Magic Number | Frame Size | User Data | |
| 356 | |:------------:|:----------:| --------- | |
| 357 | | 4 bytes | 4 bytes | | |
| 358 | |
| 359 | Skippable frames allow the insertion of user-defined data |
| 360 | into a flow of concatenated frames. |
| 361 | Its design is pretty straightforward, |
| 362 | with the sole objective to allow the decoder to quickly skip |
| 363 | over user-defined data and continue decoding. |
| 364 | |
| 365 | Skippable frames defined in this specification are compatible with LZ4 ones. |
| 366 | |
| 367 | |
| 368 | __Magic Number__ : |
| 369 | |
| 370 | 4 Bytes, Little endian format. |
| 371 | Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F. |
| 372 | All 16 values are valid to identify a skippable frame. |
| 373 | |
| 374 | __Frame Size__ : |
| 375 | |
| 376 | This is the size, in bytes, of the following User Data |
| 377 | (without including the magic number nor the size field itself). |
| 378 | 4 Bytes, Little endian format, unsigned 32-bits. |
| 379 | This means User Data can’t be bigger than (2^32-1) Bytes. |
| 380 | |
| 381 | __User Data__ : |
| 382 | |
| 383 | User Data can be anything. Data will just be skipped by the decoder. |
| 384 | |
| 385 | |
| 386 | Compressed block format |
| 387 | ----------------------- |
| 388 | This specification details the content of a _compressed block_. |
| 389 | A compressed block has a size, which must be known in order to decode it. |
| 390 | It also has a guaranteed maximum regenerated size, |
| 391 | in order to properly allocate destination buffer. |
| 392 | See "Frame format" for more details. |
| 393 | |
| 394 | A compressed block consists of 2 sections : |
| 395 | - Literals section |
| 396 | - Sequences section |
| 397 | |
| 398 | ### Compressed Literals |
| 399 | |
| 400 | Literals are compressed using order-0 huffman compression. |
| 401 | During sequence phase, literals will be entangled with match copy operations. |
| 402 | All literals are regrouped in the first part of the block. |
| 403 | They can be decoded first, and then copied during sequence operations, |
| 404 | or they can be decoded on the flow, as needed by sequences. |
| 405 | |
| 406 | | Header | (Tree Description) | Stream1 | (Stream2) | (Stream3) | (Stream4) | |
| 407 | | ------ | ------------------ | ------- | --------- | --------- | --------- | |
| 408 | |
| 409 | Literals can be compressed, or uncompressed. |
| 410 | When compressed, an optional tree description can be present, |
| 411 | followed by 1 or 4 streams. |
| 412 | |
| 413 | #### Block Literal Header |
| 414 | |
| 415 | Header is in charge of describing precisely how literals are packed. |
| 416 | It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes, |
| 417 | using big-endian convention. |
| 418 | |
| 419 | | BlockType | sizes format | (compressed size) | regenerated size | |
| 420 | | --------- | ------------ | ----------------- | ---------------- | |
| 421 | | 2 bits | 1 - 2 bits | 0 - 18 bits | 5 - 20 bits | |
| 422 | |
| 423 | __Block Type__ : |
| 424 | |
| 425 | This is a 2-bits field, describing 4 different block types : |
| 426 | |
| 427 | | Value | 0 | 1 | 2 | 3 | |
| 428 | | ---------- | ---------- | ------ | --- | ------- | |
| 429 | | Block Type | Compressed | Repeat | Raw | RLE | |
| 430 | |
| 431 | - Compressed : This is a standard huffman-compressed block, |
| 432 | starting with a huffman tree description. |
| 433 | See details below. |
| 434 | - Repeat Stats : This is a huffman-compressed block, |
| 435 | using huffman tree from previous huffman-compressed block. |
| 436 | Huffman tree description will be skipped. |
| 437 | Compressed stream is equivalent to "compressed" block type. |
| 438 | - Raw : Literals are stored uncompressed. |
| 439 | - RLE : Literals consist of a single byte value repeated N times. |
| 440 | |
| 441 | __Sizes format__ : |
| 442 | |
| 443 | Sizes format are divided into 2 families : |
| 444 | |
| 445 | - For compressed block, it requires to decode both the compressed size |
| 446 | and the decompressed size. It will also decode the number of streams. |
| 447 | - For Raw or RLE blocks, it's enough to decode the size to regenerate. |
| 448 | |
| 449 | For values spanning several bytes, convention is Big-endian. |
| 450 | |
| 451 | __Sizes format for Raw or RLE block__ : |
| 452 | |
| 453 | - Value : 0x : Regenerated size uses 5 bits (0-31). |
| 454 | Total literal header size is 1 byte. |
| 455 | `size = h[0] & 31;` |
| 456 | - Value : 10 : Regenerated size uses 12 bits (0-4095). |
| 457 | Total literal header size is 2 bytes. |
| 458 | `size = ((h[0] & 15) << 8) + h[1];` |
| 459 | - Value : 11 : Regenerated size uses 20 bits (0-1048575). |
| 460 | Total literal header size is 2 bytes. |
| 461 | `size = ((h[0] & 15) << 16) + (h[1]<<8) + h[2];` |
| 462 | |
| 463 | Note : it's allowed to represent a short value (ex : `13`) |
| 464 | using a long format, accepting the reduced compacity. |
| 465 | |
| 466 | __Sizes format for Compressed Block__ : |
| 467 | |
| 468 | Note : also applicable to "repeat-stats" blocks. |
| 469 | - Value : 00 : 4 streams |
| 470 | Compressed and regenerated sizes use 10 bits (0-1023) |
| 471 | Total literal header size is 3 bytes |
| 472 | - Value : 01 : _Single stream_ |
| 473 | Compressed and regenerated sizes use 10 bits (0-1023) |
| 474 | Total literal header size is 3 bytes |
| 475 | - Value : 10 : 4 streams |
| 476 | Compressed and regenerated sizes use 14 bits (0-16383) |
| 477 | Total literal header size is 4 bytes |
| 478 | - Value : 10 : 4 streams |
| 479 | Compressed and regenerated sizes use 18 bits (0-262143) |
| 480 | Total literal header size is 5 bytes |
| 481 | |
| 482 | |
| 483 | |
| 484 | Version changes |
| 485 | --------------- |
| 486 | 0.1 : initial release |