blob: 13c4ace18d5cd3c73552d5dc605fa15c5cf47479 [file] [log] [blame] [view]
Yann Collet5cc18822016-07-03 19:03:13 +02001Zstandard Compression Format
2============================
Yann Collet2fa99042016-07-01 20:55:28 +02003
4### Notices
5
6Copyright (c) 2016 Yann Collet
7
8Permission is granted to copy and distribute this document
9for any purpose and without charge,
10including translations into other languages
11and incorporation into compilations,
12provided that the copyright notice and this notice are preserved,
13and that any substantive changes or deletions from the original
14are clearly marked.
15Distribution of this document is unlimited.
16
17### Version
18
Yann Colletf6ff53c2016-07-15 17:03:38 +0200190.1.1 (15/07/16)
Yann Collet2fa99042016-07-01 20:55:28 +020020
21
22Introduction
23------------
24
25The purpose of this document is to define a lossless compressed data format,
26that is independent of CPU type, operating system,
27file system and character set, suitable for
Yann Collet9ca73362016-07-05 10:53:38 +020028file compression, pipe and streaming compression,
Yann Collet2fa99042016-07-01 20:55:28 +020029using the [Zstandard algorithm](http://www.zstandard.org).
30
31The data can be produced or consumed,
32even for an arbitrarily long sequentially presented input data stream,
33using only an a priori bounded amount of intermediate storage,
34and hence can be used in data communications.
35The format uses the Zstandard compression method,
36and optional [xxHash-64 checksum method](http://www.xxhash.org),
37for detection of data corruption.
38
39The data format defined by this specification
40does not attempt to allow random access to compressed data.
41
42This specification is intended for use by implementers of software
43to compress data into Zstandard format and/or decompress data from Zstandard format.
44The text of the specification assumes a basic background in programming
45at the level of bits and other primitive data representations.
46
47Unless otherwise indicated below,
48a compliant compressor must produce data sets
49that conform to the specifications presented here.
50It doesn’t need to support all options though.
51
52A compliant decompressor must be able to decompress
53at least one working set of parameters
54that conforms to the specifications presented here.
55It may also ignore informative fields, such as checksum.
56Whenever it does not support a parameter defined in the compressed stream,
57it must produce a non-ambiguous error code and associated error message
58explaining which parameter is unsupported.
59
60
61Definitions
62-----------
63A content compressed by Zstandard is transformed into a Zstandard __frame__.
64Multiple frames can be appended into a single file or stream.
65A frame is totally independent, has a defined beginning and end,
66and a set of parameters which tells the decoder how to decompress it.
67
68A frame encapsulates one or multiple __blocks__.
69Each block can be compressed or not,
70and has a guaranteed maximum content size, which depends on frame parameters.
71Unlike frames, each block depends on previous blocks for proper decoding.
72However, each block can be decompressed without waiting for its successor,
73allowing streaming operations.
74
75
76General Structure of Zstandard Frame format
77-------------------------------------------
78
Yann Colletcd25a912016-07-05 11:50:37 +020079| MagicNb | Frame Header | Block | (More blocks) | EndMark |
80|:-------:|:-------------:| ----- | ------------- | ------- |
81| 4 bytes | 2-14 bytes | | | 3 bytes |
Yann Collet2fa99042016-07-01 20:55:28 +020082
83__Magic Number__
84
854 Bytes, Little endian format.
86Value : 0xFD2FB527
87
88__Frame Header__
89
Yann Colletcd25a912016-07-05 11:50:37 +0200902 to 14 Bytes, detailed in [next part](#frame-header).
Yann Collet2fa99042016-07-01 20:55:28 +020091
92__Data Blocks__
93
Yann Colletcd25a912016-07-05 11:50:37 +020094Detailed in [next chapter](#data-blocks).
Yann Collet2fa99042016-07-01 20:55:28 +020095That’s where compressed data is stored.
96
97__EndMark__
98
99The flow of blocks ends when the last block header brings an _end signal_ .
100This last block header may optionally host a __Content Checksum__ .
101
Yann Collet9ca73362016-07-05 10:53:38 +0200102##### __Content Checksum__
Yann Collet2fa99042016-07-01 20:55:28 +0200103
Yann Colletcd25a912016-07-05 11:50:37 +0200104Content Checksum verify that frame content has been regenerated correctly.
Yann Collet2fa99042016-07-01 20:55:28 +0200105The content checksum is the result
106of [xxh64() hash function](https://www.xxHash.com)
107digesting the original (decoded) data as input, and a seed of zero.
108Bits from 11 to 32 (included) are extracted to form a 22 bits checksum
Yann Colletcd25a912016-07-05 11:50:37 +0200109stored into the endmark body.
Yann Collet2fa99042016-07-01 20:55:28 +0200110```
Yann Collet9ca73362016-07-05 10:53:38 +0200111mask22bits = (1<<22)-1;
112contentChecksum = (XXH64(content, size, 0) >> 11) & mask22bits;
Yann Collet2fa99042016-07-01 20:55:28 +0200113```
114Content checksum is only present when its associated flag
115is set in the frame descriptor.
116Its usage is optional.
117
118__Frame Concatenation__
119
120In some circumstances, it may be required to append multiple frames,
121for example in order to add new data to an existing compressed file
122without re-framing it.
123
124In such case, each frame brings its own set of descriptor flags.
125Each frame is considered independent.
126The only relation between frames is their sequential order.
127
128The ability to decode multiple concatenated frames
129within a single stream or file is left outside of this specification.
130As an example, the reference `zstd` command line utility is able
131to decode all concatenated frames in their sequential order,
132delivering the final decompressed result as if it was a single content.
133
134
135Frame Header
136-------------
137
Yann Collet23f05cc2016-07-04 16:13:11 +0200138| FHD | (WD) | (dictID) | (Content Size) |
139| ------- | --------- | --------- |:--------------:|
140| 1 byte | 0-1 byte | 0-4 bytes | 0 - 8 bytes |
Yann Collet2fa99042016-07-01 20:55:28 +0200141
142Frame header has a variable size, which uses a minimum of 2 bytes,
143and up to 14 bytes depending on optional parameters.
144
145__FHD byte__ (Frame Header Descriptor)
146
147The first Header's byte is called the Frame Header Descriptor.
148It tells which other fields are present.
Yann Collet23f05cc2016-07-04 16:13:11 +0200149Decoding this byte is enough to tell the size of Frame Header.
Yann Collet2fa99042016-07-01 20:55:28 +0200150
Yann Collet23f05cc2016-07-04 16:13:11 +0200151| BitNb | 7-6 | 5 | 4 | 3 | 2 | 1-0 |
152| ------- | ------ | ------- | ------ | -------- | -------- | ------ |
153|FieldName| FCSize | Segment | Unused | Reserved | Checksum | dictID |
Yann Collet2fa99042016-07-01 20:55:28 +0200154
155In this table, bit 7 is highest bit, while bit 0 is lowest.
156
157__Frame Content Size flag__
158
159This is a 2-bits flag (`= FHD >> 6`),
160specifying if decompressed data size is provided within the header.
161
162| Value | 0 | 1 | 2 | 3 |
163| ------- | --- | --- | --- | --- |
164|FieldSize| 0-1 | 2 | 4 | 8 |
165
Yann Collet23f05cc2016-07-04 16:13:11 +0200166Value 0 meaning depends on _single segment_ mode :
Yann Collet2fa99042016-07-01 20:55:28 +0200167it either means `0` (size not provided) _if_ the `WD` byte is present,
Yann Collet23f05cc2016-07-04 16:13:11 +0200168or `1` (frame content size <= 255 bytes) otherwise.
Yann Collet2fa99042016-07-01 20:55:28 +0200169
170__Single Segment__
171
172If this flag is set,
173data shall be regenerated within a single continuous memory segment.
Yann Collet23f05cc2016-07-04 16:13:11 +0200174
Yann Collet2fa99042016-07-01 20:55:28 +0200175In which case, `WD` byte __is not present__,
176but `Frame Content Size` field necessarily is.
Yann Collet2fa99042016-07-01 20:55:28 +0200177As a consequence, the decoder must allocate a memory segment
178of size `>= Frame Content Size`.
179
180In order to preserve the decoder from unreasonable memory requirement,
Yann Collet23f05cc2016-07-04 16:13:11 +0200181a decoder can reject a compressed frame
Yann Collet2fa99042016-07-01 20:55:28 +0200182which requests a memory size beyond decoder's authorized range.
183
184For broader compatibility, decoders are recommended to support
Yann Collet23f05cc2016-07-04 16:13:11 +0200185memory sizes of at least 8 MB.
186This is just a recommendation,
Yann Collet9ca73362016-07-05 10:53:38 +0200187each decoder is free to support higher or lower limits,
Yann Collet2fa99042016-07-01 20:55:28 +0200188depending on local limitations.
189
190__Unused bit__
191
Yann Colletf0bc6732016-07-13 17:30:21 +0200192The value of this bit should be set to zero.
193A decoder compliant with this specification version should not interpret it.
194It might be used in a future version,
195to signal a property which is not mandatory to properly decode the frame.
Yann Collet2fa99042016-07-01 20:55:28 +0200196
197__Reserved bit__
198
199This bit is reserved for some future feature.
200Its value _must be zero_.
201A decoder compliant with this specification version must ensure it is not set.
202This bit may be used in a future revision,
203to signal a feature that must be interpreted in order to decode the frame.
204
205__Content checksum flag__
206
207If this flag is set, a content checksum will be present into the EndMark.
Yann Collet9ca73362016-07-05 10:53:38 +0200208The checksum is a 22 bits value extracted from the XXH64() of data,
209and stored into endMark. See [__Content Checksum__](#content-checksum) .
Yann Collet2fa99042016-07-01 20:55:28 +0200210
211__Dictionary ID flag__
212
213This is a 2-bits flag (`= FHD & 3`),
Yann Collet9ca73362016-07-05 10:53:38 +0200214telling if a dictionary ID is provided within the header.
215It also specifies the size of this field.
Yann Collet2fa99042016-07-01 20:55:28 +0200216
217| Value | 0 | 1 | 2 | 3 |
218| ------- | --- | --- | --- | --- |
219|FieldSize| 0 | 1 | 2 | 4 |
220
221__WD byte__ (Window Descriptor)
222
223Provides guarantees on maximum back-reference distance
224that will be present within compressed data.
225This information is useful for decoders to allocate enough memory.
226
Yann Colletcd25a912016-07-05 11:50:37 +0200227`WD` byte is optional. It's not present in `single segment` mode.
228In which case, the maximum back-reference distance is the content size itself,
229which can be any value from 1 to 2^64-1 bytes (16 EB).
230
Yann Collet2fa99042016-07-01 20:55:28 +0200231| BitNb | 7-3 | 0-2 |
232| --------- | -------- | -------- |
233| FieldName | Exponent | Mantissa |
234
235Maximum distance is given by the following formulae :
236```
237windowLog = 10 + Exponent;
238windowBase = 1 << windowLog;
239windowAdd = (windowBase / 8) * Mantissa;
240windowSize = windowBase + windowAdd;
241```
242The minimum window size is 1 KB.
Yann Colletcd25a912016-07-05 11:50:37 +0200243The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
Yann Collet2fa99042016-07-01 20:55:28 +0200244
245To properly decode compressed data,
246a decoder will need to allocate a buffer of at least `windowSize` bytes.
247
Yann Collet2fa99042016-07-01 20:55:28 +0200248In order to preserve decoder from unreasonable memory requirements,
249a decoder can refuse a compressed frame
250which requests a memory size beyond decoder's authorized range.
251
Yann Colletcd25a912016-07-05 11:50:37 +0200252For improved interoperability,
Yann Collet2fa99042016-07-01 20:55:28 +0200253decoders are recommended to be compatible with window sizes of 8 MB.
254Encoders are recommended to not request more than 8 MB.
255It's merely a recommendation though,
256decoders are free to support larger or lower limits,
257depending on local limitations.
258
Yann Collet23f05cc2016-07-04 16:13:11 +0200259__Dictionary ID__
260
Yann Colletf6ff53c2016-07-15 17:03:38 +0200261This is a variable size field, which contains
262the ID of the dictionary required to properly decode the frame.
263Note that this field is optional. When it's not present,
Yann Collet23f05cc2016-07-04 16:13:11 +0200264it's up to the caller to make sure it uses the correct dictionary.
265
266Field size depends on __Dictionary ID flag__.
2671 byte can represent an ID 0-255.
2682 bytes can represent an ID 0-65535.
Yann Colletcd25a912016-07-05 11:50:37 +02002694 bytes can represent an ID 0-4294967295.
Yann Collet23f05cc2016-07-04 16:13:11 +0200270
271It's allowed to represent a small ID (for example `13`)
Yann Colletcd25a912016-07-05 11:50:37 +0200272with a large 4-bytes dictionary ID, losing some compacity in the process.
Yann Collet23f05cc2016-07-04 16:13:11 +0200273
Yann Colletf6ff53c2016-07-15 17:03:38 +0200274_Reserved ranges :_
275If the frame is going to be distributed in a private environment,
276any dictionary ID can be used.
277However, for public distribution of compressed frames using a dictionary,
278some ranges are reserved for future use :
279- low : 1 - 32767 : reserved
280- high : >= (2^31) : reserved
281
282
Yann Collet2fa99042016-07-01 20:55:28 +0200283__Frame Content Size__
284
285This is the original (uncompressed) size.
286This information is optional, and only present if associated flag is set.
287Content size is provided using 1, 2, 4 or 8 Bytes.
288Format is Little endian.
289
290| Field Size | Range |
291| ---------- | ---------- |
292| 0 | 0 |
293| 1 | 0 - 255 |
294| 2 | 256 - 65791|
295| 4 | 0 - 2^32-1 |
296| 8 | 0 - 2^64-1 |
297
298When field size is 1, 4 or 8 bytes, the value is read directly.
299When field size is 2, _an offset of 256 is added_.
Yann Collet9ca73362016-07-05 10:53:38 +0200300It's allowed to represent a small size (ex: `18`) using any compatible variant.
Yann Collet2fa99042016-07-01 20:55:28 +0200301A size of `0` means `content size is unknown`.
302In which case, the `WD` byte will necessarily be present,
Yann Collet9ca73362016-07-05 10:53:38 +0200303and becomes the only hint to guide memory allocation.
Yann Collet2fa99042016-07-01 20:55:28 +0200304
305In order to preserve decoder from unreasonable memory requirement,
306a decoder can refuse a compressed frame
307which requests a memory size beyond decoder's authorized range.
308
Yann Collet2fa99042016-07-01 20:55:28 +0200309
310Data Blocks
311-----------
312
313| B. Header | data |
314|:---------:| ------ |
315| 3 bytes | |
316
317
318__Block Header__
319
320This field uses 3-bytes, format is __big-endian__.
321
322The 2 highest bits represent the `block type`,
323while the remaining 22 bits represent the (compressed) block size.
324
325There are 4 block types :
326
327| Value | 0 | 1 | 2 | 3 |
328| ---------- | ---------- | --- | --- | ------- |
329| Block Type | Compressed | Raw | RLE | EndMark |
330
Yann Collet9ca73362016-07-05 10:53:38 +0200331- Compressed : this is a [Zstandard compressed block](#compressed-block-format),
332 detailed in another section of this specification.
Yann Collet2fa99042016-07-01 20:55:28 +0200333 "block size" is the compressed size.
334 Decompressed size is unknown,
335 but its maximum possible value is guaranteed (see below)
336- Raw : this is an uncompressed block.
337 "block size" is the number of bytes to read and copy.
338- RLE : this is a single byte, repeated N times.
339 In which case, "block size" is the size to regenerate,
340 while the "compressed" block is just 1 byte (the byte to repeat).
341- EndMark : this is not a block. Signal the end of the frame.
342 The rest of the field may be optionally filled by a checksum
Yann Colletcd25a912016-07-05 11:50:37 +0200343 (see [Content Checksum](#content-checksum)).
Yann Collet2fa99042016-07-01 20:55:28 +0200344
345Block sizes must respect a few rules :
Yann Collet9ca73362016-07-05 10:53:38 +0200346- In compressed mode, compressed size if always strictly `< decompressed size`.
347- Block decompressed size is always <= maximum back-reference distance .
348- Block decompressed size is always <= 128 KB
Yann Collet2fa99042016-07-01 20:55:28 +0200349
350
351__Data__
352
353Where the actual data to decode stands.
354It might be compressed or not, depending on previous field indications.
355A data block is not necessarily "full" :
356since an arbitrary “flush” may happen anytime,
Yann Colletcd25a912016-07-05 11:50:37 +0200357block decompressed content can be any size,
358up to Block Maximum Decompressed Size, which is the smallest of :
359- Maximum back-reference distance
Yann Collet2fa99042016-07-01 20:55:28 +0200360- 128 KB
361
362
363Skippable Frames
364----------------
365
366| Magic Number | Frame Size | User Data |
367|:------------:|:----------:| --------- |
368| 4 bytes | 4 bytes | |
369
370Skippable frames allow the insertion of user-defined data
371into a flow of concatenated frames.
372Its design is pretty straightforward,
373with the sole objective to allow the decoder to quickly skip
374over user-defined data and continue decoding.
375
Yann Colletcd25a912016-07-05 11:50:37 +0200376Skippable frames defined in this specification are compatible with [LZ4] ones.
377
378[LZ4]:http://www.lz4.org
Yann Collet2fa99042016-07-01 20:55:28 +0200379
Yann Collet2fa99042016-07-01 20:55:28 +0200380__Magic Number__ :
381
3824 Bytes, Little endian format.
383Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
384All 16 values are valid to identify a skippable frame.
385
386__Frame Size__ :
387
388This is the size, in bytes, of the following User Data
389(without including the magic number nor the size field itself).
3904 Bytes, Little endian format, unsigned 32-bits.
391This means User Data can’t be bigger than (2^32-1) Bytes.
392
393__User Data__ :
394
395User Data can be anything. Data will just be skipped by the decoder.
396
397
398Compressed block format
399-----------------------
400This specification details the content of a _compressed block_.
Yann Collet00d44ab2016-07-04 01:29:47 +0200401A compressed block has a size, which must be known.
Yann Collet2fa99042016-07-01 20:55:28 +0200402It also has a guaranteed maximum regenerated size,
403in order to properly allocate destination buffer.
Yann Collet9ca73362016-07-05 10:53:38 +0200404See [Data Blocks](#data-blocks) for more details.
Yann Collet2fa99042016-07-01 20:55:28 +0200405
406A compressed block consists of 2 sections :
Yann Colletcd25a912016-07-05 11:50:37 +0200407- [Literals section](#literals-section)
408- [Sequences section](#sequences-section)
Yann Collet2fa99042016-07-01 20:55:28 +0200409
Yann Collet23f05cc2016-07-04 16:13:11 +0200410### Prerequisites
411To decode a compressed block, the following elements are necessary :
Yann Collet698cb632016-07-03 18:49:35 +0200412- Previous decoded blocks, up to a distance of `windowSize`,
Yann Collet9ca73362016-07-05 10:53:38 +0200413 or all previous blocks in "single segment" mode.
Yann Collet698cb632016-07-03 18:49:35 +0200414- List of "recent offsets" from previous compressed block.
Yann Collet00d44ab2016-07-04 01:29:47 +0200415- Decoding tables of previous compressed block for each symbol type
Yann Collet9ca73362016-07-05 10:53:38 +0200416 (literals, litLength, matchLength, offset).
Yann Collet698cb632016-07-03 18:49:35 +0200417
418
Yann Collet00d44ab2016-07-04 01:29:47 +0200419### Literals section
Yann Collet2fa99042016-07-01 20:55:28 +0200420
Yann Colletc40ba712016-07-08 15:39:02 +0200421Literals are compressed using Huffman prefix codes.
Yann Collet2fa99042016-07-01 20:55:28 +0200422During sequence phase, literals will be entangled with match copy operations.
423All literals are regrouped in the first part of the block.
424They can be decoded first, and then copied during sequence operations,
Yann Collet00d44ab2016-07-04 01:29:47 +0200425or they can be decoded on the flow, as needed by sequence commands.
Yann Collet2fa99042016-07-01 20:55:28 +0200426
427| Header | (Tree Description) | Stream1 | (Stream2) | (Stream3) | (Stream4) |
428| ------ | ------------------ | ------- | --------- | --------- | --------- |
429
430Literals can be compressed, or uncompressed.
431When compressed, an optional tree description can be present,
432followed by 1 or 4 streams.
433
Yann Collet00d44ab2016-07-04 01:29:47 +0200434#### Literals section header
Yann Collet2fa99042016-07-01 20:55:28 +0200435
Yann Collet00d44ab2016-07-04 01:29:47 +0200436Header is in charge of describing how literals are packed.
Yann Collet2fa99042016-07-01 20:55:28 +0200437It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
438using big-endian convention.
439
440| BlockType | sizes format | (compressed size) | regenerated size |
441| --------- | ------------ | ----------------- | ---------------- |
442| 2 bits | 1 - 2 bits | 0 - 18 bits | 5 - 20 bits |
443
444__Block Type__ :
445
446This is a 2-bits field, describing 4 different block types :
447
448| Value | 0 | 1 | 2 | 3 |
449| ---------- | ---------- | ------ | --- | ------- |
450| Block Type | Compressed | Repeat | Raw | RLE |
451
452- Compressed : This is a standard huffman-compressed block,
Yann Colletcd25a912016-07-05 11:50:37 +0200453 starting with a huffman tree description.
454 See details below.
Yann Collet2fa99042016-07-01 20:55:28 +0200455- Repeat Stats : This is a huffman-compressed block,
Yann Colletcd25a912016-07-05 11:50:37 +0200456 using huffman tree _from previous huffman-compressed literals block_.
457 Huffman tree description will be skipped.
Yann Collet2fa99042016-07-01 20:55:28 +0200458- Raw : Literals are stored uncompressed.
459- RLE : Literals consist of a single byte value repeated N times.
460
461__Sizes format__ :
462
463Sizes format are divided into 2 families :
464
465- For compressed block, it requires to decode both the compressed size
466 and the decompressed size. It will also decode the number of streams.
467- For Raw or RLE blocks, it's enough to decode the size to regenerate.
468
469For values spanning several bytes, convention is Big-endian.
470
Yann Collet9ca73362016-07-05 10:53:38 +0200471__Sizes format for Raw or RLE literals block__ :
Yann Collet2fa99042016-07-01 20:55:28 +0200472
473- Value : 0x : Regenerated size uses 5 bits (0-31).
474 Total literal header size is 1 byte.
475 `size = h[0] & 31;`
476- Value : 10 : Regenerated size uses 12 bits (0-4095).
477 Total literal header size is 2 bytes.
478 `size = ((h[0] & 15) << 8) + h[1];`
479- Value : 11 : Regenerated size uses 20 bits (0-1048575).
Yann Colletd916c902016-07-04 00:42:58 +0200480 Total literal header size is 3 bytes.
Yann Collet2fa99042016-07-01 20:55:28 +0200481 `size = ((h[0] & 15) << 16) + (h[1]<<8) + h[2];`
482
483Note : it's allowed to represent a short value (ex : `13`)
484using a long format, accepting the reduced compacity.
485
Yann Collet9ca73362016-07-05 10:53:38 +0200486__Sizes format for Compressed literals block__ :
Yann Collet2fa99042016-07-01 20:55:28 +0200487
488Note : also applicable to "repeat-stats" blocks.
Yann Colletcd25a912016-07-05 11:50:37 +0200489- Value : 00 : 4 streams.
490 Compressed and regenerated sizes use 10 bits (0-1023).
491 Total literal header size is 3 bytes.
492- Value : 01 : _Single stream_.
493 Compressed and regenerated sizes use 10 bits (0-1023).
494 Total literal header size is 3 bytes.
495- Value : 10 : 4 streams.
496 Compressed and regenerated sizes use 14 bits (0-16383).
497 Total literal header size is 4 bytes.
498- Value : 10 : 4 streams.
499 Compressed and regenerated sizes use 18 bits (0-262143).
500 Total literal header size is 5 bytes.
Yann Collet2fa99042016-07-01 20:55:28 +0200501
Yann Collet698cb632016-07-03 18:49:35 +0200502Compressed and regenerated size fields follow big endian convention.
503
504#### Huffman Tree description
505
Yann Colletcd25a912016-07-05 11:50:37 +0200506This section is only present when literals block type is `Compressed` (`0`).
Yann Collet00d44ab2016-07-04 01:29:47 +0200507
Yann Collet9ca73362016-07-05 10:53:38 +0200508Prefix coding represents symbols from an a priori known alphabet
Yann Colletb21e9cb2016-07-15 17:31:13 +0200509by bit sequences (codewords), one codeword for each symbol,
Yann Collet9ca73362016-07-05 10:53:38 +0200510in a manner such that different symbols may be represented
511by bit sequences of different lengths,
512but a parser can always parse an encoded string
513unambiguously symbol-by-symbol.
Yann Collet00d44ab2016-07-04 01:29:47 +0200514
Yann Collet9ca73362016-07-05 10:53:38 +0200515Given an alphabet with known symbol frequencies,
516the Huffman algorithm allows the construction of an optimal prefix code
517using the fewest bits of any possible prefix codes for that alphabet.
Yann Collet00d44ab2016-07-04 01:29:47 +0200518
Yann Collet9ca73362016-07-05 10:53:38 +0200519Prefix code must not exceed a maximum code length.
Yann Collet00d44ab2016-07-04 01:29:47 +0200520More bits improve accuracy but cost more header size,
Yann Collet9ca73362016-07-05 10:53:38 +0200521and require more memory for decoding operations.
Yann Collet00d44ab2016-07-04 01:29:47 +0200522
523The current format limits the maximum depth to 15 bits.
Yann Colletb21e9cb2016-07-15 17:31:13 +0200524The reference decoder goes further, by limiting it to 12 bits.
Yann Collet00d44ab2016-07-04 01:29:47 +0200525It is recommended to remain compatible with reference decoder.
526
Yann Collet698cb632016-07-03 18:49:35 +0200527
528##### Representation
529
Yann Collet00d44ab2016-07-04 01:29:47 +0200530All literal values from zero (included) to last present one (excluded)
Yann Collet698cb632016-07-03 18:49:35 +0200531are represented by `weight` values, from 0 to `maxBits`.
532Transformation from `weight` to `nbBits` follows this formulae :
533`nbBits = weight ? maxBits + 1 - weight : 0;` .
534The last symbol's weight is deduced from previously decoded ones,
535by completing to the nearest power of 2.
536This power of 2 gives `maxBits`, the depth of the current tree.
537
538__Example__ :
539Let's presume the following huffman tree must be described :
540
Yann Colletd916c902016-07-04 00:42:58 +0200541| literal | 0 | 1 | 2 | 3 | 4 | 5 |
542| ------- | --- | --- | --- | --- | --- | --- |
543| nbBits | 1 | 2 | 3 | 0 | 4 | 4 |
Yann Collet698cb632016-07-03 18:49:35 +0200544
545The tree depth is 4, since its smallest element uses 4 bits.
546Value `5` will not be listed, nor will values above `5`.
547Values from `0` to `4` will be listed using `weight` instead of `nbBits`.
548Weight formula is : `weight = nbBits ? maxBits + 1 - nbBits : 0;`
549It gives the following serie of weights :
550
Yann Colletd916c902016-07-04 00:42:58 +0200551| weights | 4 | 3 | 2 | 0 | 1 |
552| ------- | --- | --- | --- | --- | --- |
553| literal | 0 | 1 | 2 | 3 | 4 |
Yann Collet698cb632016-07-03 18:49:35 +0200554
555The decoder will do the inverse operation :
Yann Colletd916c902016-07-04 00:42:58 +0200556having collected weights of literals from `0` to `4`,
557it knows the last literal, `5`, is present with a non-zero weight.
Yann Colletcd25a912016-07-05 11:50:37 +0200558The weight of `5` can be deducted by joining to the nearest power of 2.
Yann Collet698cb632016-07-03 18:49:35 +0200559Sum of 2^(weight-1) (excluding 0) is :
Yann Colletd916c902016-07-04 00:42:58 +0200560`8 + 4 + 2 + 0 + 1 = 15`
Yann Collet698cb632016-07-03 18:49:35 +0200561Nearest power of 2 is 16.
562Therefore, `maxBits = 4` and `weight[5] = 1`.
Yann Collet698cb632016-07-03 18:49:35 +0200563
564##### Huffman Tree header
565
Yann Collet9ca73362016-07-05 10:53:38 +0200566This is a single byte value (0-255),
567which tells how to decode the list of weights.
Yann Collet698cb632016-07-03 18:49:35 +0200568
569- if headerByte >= 242 : this is one of 14 pre-defined weight distributions :
Yann Colletcd25a912016-07-05 11:50:37 +0200570
571| value |242|243|244|245|246|247|248|249|250|251|252|253|254|255|
Yann Collete0ce5b02016-07-06 01:50:44 +0200572| -------- |---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Yann Colletcd25a912016-07-05 11:50:37 +0200573| Nb of 1s | 1 | 2 | 3 | 4 | 7 | 8 | 15| 16| 31| 32| 63| 64|127|128|
574|Complement| 1 | 2 | 1 | 4 | 1 | 8 | 1 | 16| 1 | 32| 1 | 64| 1 |128|
575
Yann Collet26f68142016-07-08 10:42:59 +0200576_Note_ : complement is found by using "join to nearest power of 2" rule.
Yann Collet698cb632016-07-03 18:49:35 +0200577
578- if headerByte >= 128 : this is a direct representation,
579 where each weight is written directly as a 4 bits field (0-15).
Yann Colletcd25a912016-07-05 11:50:37 +0200580 The full representation occupies `((nbSymbols+1)/2)` bytes,
Yann Collet698cb632016-07-03 18:49:35 +0200581 meaning it uses a last full byte even if nbSymbols is odd.
Yann Collet26f68142016-07-08 10:42:59 +0200582 `nbSymbols = headerByte - 127;`.
583 Note that maximum nbSymbols is 241-127 = 114.
584 A larger serie must necessarily use FSE compression.
Yann Collet698cb632016-07-03 18:49:35 +0200585
586- if headerByte < 128 :
587 the serie of weights is compressed by FSE.
Yann Collet26f68142016-07-08 10:42:59 +0200588 The length of the FSE-compressed serie is `headerByte` (0-127).
Yann Collet698cb632016-07-03 18:49:35 +0200589
590##### FSE (Finite State Entropy) compression of huffman weights
591
Yann Collet26f68142016-07-08 10:42:59 +0200592The serie of weights is compressed using FSE compression.
Yann Collet698cb632016-07-03 18:49:35 +0200593It's a single bitstream with 2 interleaved states,
Yann Collet26f68142016-07-08 10:42:59 +0200594sharing a single distribution table.
Yann Collet698cb632016-07-03 18:49:35 +0200595
596To decode an FSE bitstream, it is necessary to know its compressed size.
597Compressed size is provided by `headerByte`.
Yann Collet26f68142016-07-08 10:42:59 +0200598It's also necessary to know its maximum decompressed size,
599which is `255`, since literal values span from `0` to `255`,
Yann Colletcd25a912016-07-05 11:50:37 +0200600and last symbol value is not represented.
Yann Collet698cb632016-07-03 18:49:35 +0200601
602An FSE bitstream starts by a header, describing probabilities distribution.
Yann Collet00d44ab2016-07-04 01:29:47 +0200603It will create a Decoding Table.
Yann Collet26f68142016-07-08 10:42:59 +0200604Table must be pre-allocated, which requires to support a maximum accuracy.
605For a list of huffman weights, recommended maximum is 7 bits.
Yann Collet698cb632016-07-03 18:49:35 +0200606
Yann Collet26f68142016-07-08 10:42:59 +0200607FSE header is [described in relevant chapter](#fse-distribution-table--condensed-format),
608and so is [FSE bitstream](#bitstream).
609The main difference is that Huffman header compression uses 2 states,
610which share the same FSE distribution table.
611Bitstream contains only FSE symbols, there are no interleaved "raw bitfields".
612The number of symbols to decode is discovered
613by tracking bitStream overflow condition.
614When both states have overflowed the bitstream, end is reached.
615
Yann Collet698cb632016-07-03 18:49:35 +0200616
617##### Conversion from weights to huffman prefix codes
618
Yann Colletd916c902016-07-04 00:42:58 +0200619All present symbols shall now have a `weight` value.
Yann Colletd916c902016-07-04 00:42:58 +0200620Symbols are sorted by weight.
Yann Colletb21e9cb2016-07-15 17:31:13 +0200621Symbols with a weight of zero are removed.
Yann Colletd916c902016-07-04 00:42:58 +0200622Within same weight, symbols keep natural order.
623Starting from lowest weight,
Yann Colletb21e9cb2016-07-15 17:31:13 +0200624symbols are being allocated to a `range`.
625A `weight` directly represents a `range`,
626following the formulae : `range = weight ? 1 << (weight-1) : 0 ;`
627Similarly, it is possible to transform weights into nbBits :
Yann Colletd916c902016-07-04 00:42:58 +0200628`nbBits = nbBits ? maxBits + 1 - weight : 0;` .
629
630
631__Example__ :
Yann Colletb21e9cb2016-07-15 17:31:13 +0200632Let's presume the following list of weights has been decoded :
Yann Colletd916c902016-07-04 00:42:58 +0200633
634| Literal | 0 | 1 | 2 | 3 | 4 | 5 |
635| ------- | --- | --- | --- | --- | --- | --- |
636| weight | 4 | 3 | 2 | 0 | 1 | 1 |
637
638Sorted by weight and then natural order,
639it gives the following distribution :
640
641| Literal | 3 | 4 | 5 | 2 | 1 | 0 |
642| ------------ | --- | --- | --- | --- | --- | ---- |
643| weight | 0 | 1 | 1 | 2 | 3 | 4 |
644| range | 0 | 1 | 1 | 2 | 4 | 8 |
Yann Colletb21e9cb2016-07-15 17:31:13 +0200645| table entries| N/A | 0 | 1 | 2-3 | 4-7 | 8-15 |
Yann Colletd916c902016-07-04 00:42:58 +0200646| nb bits | 0 | 4 | 4 | 3 | 2 | 1 |
Yann Colletb21e9cb2016-07-15 17:31:13 +0200647| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 |
Yann Colletd916c902016-07-04 00:42:58 +0200648
649
Yann Colletd916c902016-07-04 00:42:58 +0200650#### Literals bitstreams
651
652##### Bitstreams sizes
653
654As seen in a previous paragraph,
655there are 2 flavors of huffman-compressed literals :
656single stream, and 4-streams.
657
6584-streams is useful for CPU with multiple execution units and OoO operations.
659Since each stream can be decoded independently,
660it's possible to decode them up to 4x faster than a single stream,
661presuming the CPU has enough parallelism available.
662
663For single stream, header provides both the compressed and regenerated size.
664For 4-streams though,
665header only provides compressed and regenerated size of all 4 streams combined.
Yann Colletd916c902016-07-04 00:42:58 +0200666In order to properly decode the 4 streams,
667it's necessary to know the compressed and regenerated size of each stream.
668
669Regenerated size is easiest :
670each stream has a size of `(totalSize+3)/4`,
Yann Colletcd25a912016-07-05 11:50:37 +0200671except the last one, which is up to 3 bytes smaller, to reach `totalSize`.
Yann Colletd916c902016-07-04 00:42:58 +0200672
673Compressed size must be provided explicitly : in the 4-streams variant,
Yann Collet00d44ab2016-07-04 01:29:47 +0200674bitstreams are preceded by 3 unsigned Little Endian 16-bits values.
675Each value represents the compressed size of one stream, in order.
Yann Colletd916c902016-07-04 00:42:58 +0200676The last stream size is deducted from total compressed size
677and from already known stream sizes :
678`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize;`
679
Yann Collet00d44ab2016-07-04 01:29:47 +0200680##### Bitstreams read and decode
Yann Colletd916c902016-07-04 00:42:58 +0200681
682Each bitstream must be read _backward_,
683that is starting from the end down to the beginning.
684Therefore it's necessary to know the size of each bitstream.
685
686It's also necessary to know exactly which _bit_ is the latest.
687This is detected by a final bit flag :
688the highest bit of latest byte is a final-bit-flag.
689Consequently, a last byte of `0` is not possible.
690And the final-bit-flag itself is not part of the useful bitstream.
691Hence, the last byte contain between 0 and 7 useful bits.
692
693Starting from the end,
694it's possible to read the bitstream in a little-endian fashion,
695keeping track of already used bits.
696
Yann Collet00d44ab2016-07-04 01:29:47 +0200697Reading the last `maxBits` bits,
Yann Colletb21e9cb2016-07-15 17:31:13 +0200698it's then possible to compare extracted value to decoding table,
Yann Colletd916c902016-07-04 00:42:58 +0200699determining the symbol to decode and number of bits to discard.
700
701The process continues up to reading the required number of symbols per stream.
702If a bitstream is not entirely and exactly consumed,
Yann Colletb21e9cb2016-07-15 17:31:13 +0200703hence reaching exactly its beginning position with _all_ bits consumed,
Yann Colletd916c902016-07-04 00:42:58 +0200704the decoding process is considered faulty.
705
Yann Collet698cb632016-07-03 18:49:35 +0200706
Yann Collet00d44ab2016-07-04 01:29:47 +0200707### Sequences section
708
709A compressed block is a succession of _sequences_ .
710A sequence is a literal copy command, followed by a match copy command.
711A literal copy command specifies a length.
712It is the number of bytes to be copied (or extracted) from the literal section.
713A match copy command specifies an offset and a length.
714The offset gives the position to copy from,
Yann Colletb21e9cb2016-07-15 17:31:13 +0200715which can be within a previous block.
Yann Collet00d44ab2016-07-04 01:29:47 +0200716
Yann Colletcd25a912016-07-05 11:50:37 +0200717There are 3 symbol types, `literalLength`, `matchLength` and `offset`,
Yann Collet00d44ab2016-07-04 01:29:47 +0200718which are encoded together, interleaved in a single _bitstream_.
719
Yann Colletcd25a912016-07-05 11:50:37 +0200720Each symbol is a _code_ in its own context,
721which specifies a baseline and a number of bits to add.
Yann Collet00d44ab2016-07-04 01:29:47 +0200722_Codes_ are FSE compressed,
723and interleaved with raw additional bits in the same bitstream.
724
Yann Collet23f05cc2016-07-04 16:13:11 +0200725The Sequences section starts by a header,
726followed by optional Probability tables for each symbol type,
Yann Collet00d44ab2016-07-04 01:29:47 +0200727followed by the bitstream.
728
Yann Colletc40ba712016-07-08 15:39:02 +0200729| Header | (LitLengthTable) | (OffsetTable) | (MatchLengthTable) | bitStream |
730| ------ | ---------------- | ------------- | ------------------ | --------- |
731
Yann Collet23f05cc2016-07-04 16:13:11 +0200732To decode the Sequence section, it's required to know its size.
Yann Colletcd25a912016-07-05 11:50:37 +0200733This size is deducted from `blockSize - literalSectionSize`.
Yann Collet23f05cc2016-07-04 16:13:11 +0200734
735
Yann Collet00d44ab2016-07-04 01:29:47 +0200736#### Sequences section header
737
Yann Collet23f05cc2016-07-04 16:13:11 +0200738Consists in 2 items :
739- Nb of Sequences
740- Flags providing Symbol compression types
741
742__Nb of Sequences__
743
744This is a variable size field, `nbSeqs`, using between 1 and 3 bytes.
745Let's call its first byte `byte0`.
746- `if (byte0 == 0)` : there are no sequences.
747 The sequence section stops there.
748 Regenerated content is defined entirely by literals section.
Yann Colletcd25a912016-07-05 11:50:37 +0200749- `if (byte0 < 128)` : `nbSeqs = byte0;` . Uses 1 byte.
750- `if (byte0 < 255)` : `nbSeqs = ((byte0-128) << 8) + byte1;` . Uses 2 bytes.
751- `if (byte0 == 255)`: `nbSeqs = byte1 + (byte2<<8) + 0x7F00;` . Uses 3 bytes.
Yann Collet23f05cc2016-07-04 16:13:11 +0200752
753__Symbol compression modes__
754
755This is a single byte, defining the compression mode of each symbol type.
756
757| BitNb | 7-6 | 5-4 | 3-2 | 1-0 |
758| ------- | ------ | ------ | ------ | -------- |
759|FieldName| LLtype | OFType | MLType | Reserved |
760
761The last field, `Reserved`, must be all-zeroes.
762
763`LLtype`, `OFType` and `MLType` define the compression mode of
764Literal Lengths, Offsets and Match Lengths respectively.
765
766They follow the same enumeration :
767
768| Value | 0 | 1 | 2 | 3 |
769| ---------------- | ------ | --- | ------ | --- |
770| Compression Mode | predef | RLE | Repeat | FSE |
771
772- "predef" : uses a pre-defined distribution table.
773- "RLE" : it's a single code, repeated `nbSeqs` times.
774- "Repeat" : re-use distribution table from previous compressed block.
775- "FSE" : standard FSE compression.
Yann Colletcd25a912016-07-05 11:50:37 +0200776 A distribution table will be present.
777 It will be described in [next part](#distribution-tables).
Yann Collet23f05cc2016-07-04 16:13:11 +0200778
779#### Symbols decoding
780
781##### Literal Lengths codes
782
783Literal lengths codes are values ranging from `0` to `35` included.
784They define lengths from 0 to 131071 bytes.
785
786| Code | 0-15 |
787| ------ | ---- |
Yann Colletc40ba712016-07-08 15:39:02 +0200788| length | Code |
Yann Colletcd25a912016-07-05 11:50:37 +0200789| nbBits | 0 |
790
Yann Collet23f05cc2016-07-04 16:13:11 +0200791
792| Code | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
793| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
794| Baseline | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 |
795| nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
796
797| Code | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
798| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
799| Baseline | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 |
800| nb Bits | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
801
802| Code | 32 | 33 | 34 | 35 |
803| -------- | ---- | ---- | ---- | ---- |
804| Baseline | 8192 |16384 |32768 |65536 |
805| nb Bits | 13 | 14 | 15 | 16 |
806
807__Default distribution__
808
Yann Colletcd25a912016-07-05 11:50:37 +0200809When "compression mode" is "predef"",
Yann Collet23f05cc2016-07-04 16:13:11 +0200810a pre-defined distribution is used for FSE compression.
811
Yann Colletc40ba712016-07-08 15:39:02 +0200812Below is its definition. It uses an accuracy of 6 bits (64 states).
Yann Collet23f05cc2016-07-04 16:13:11 +0200813```
814short literalLengths_defaultDistribution[36] =
815 { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
816 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
817 -1,-1,-1,-1 };
818```
819
820##### Match Lengths codes
821
822Match lengths codes are values ranging from `0` to `52` included.
823They define lengths from 3 to 131074 bytes.
824
825| Code | 0-31 |
826| ------ | -------- |
Yann Collet23f05cc2016-07-04 16:13:11 +0200827| value | Code + 3 |
Yann Colletcd25a912016-07-05 11:50:37 +0200828| nbBits | 0 |
Yann Collet23f05cc2016-07-04 16:13:11 +0200829
830| Code | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
831| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
832| Baseline | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 |
833| nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
834
835| Code | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
836| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
837| Baseline | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 |
838| nb Bits | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 |
839
840| Code | 48 | 49 | 50 | 51 | 52 |
841| -------- | ---- | ---- | ---- | ---- | ---- |
842| Baseline | 4098 | 8194 |16486 |32770 |65538 |
843| nb Bits | 12 | 13 | 14 | 15 | 16 |
844
845__Default distribution__
846
Yann Colletc40ba712016-07-08 15:39:02 +0200847When "compression mode" is defined as "predef",
Yann Collet23f05cc2016-07-04 16:13:11 +0200848a pre-defined distribution is used for FSE compression.
849
850Here is its definition. It uses an accuracy of 6 bits (64 states).
851```
852short matchLengths_defaultDistribution[53] =
853 { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
854 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
855 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
856 -1,-1,-1,-1,-1 };
857```
858
859##### Offset codes
860
861Offset codes are values ranging from `0` to `N`,
862with `N` being limited by maximum backreference distance.
863
Yann Colletcd25a912016-07-05 11:50:37 +0200864A decoder is free to limit its maximum `N` supported.
865Recommendation is to support at least up to `22`.
Yann Collet23f05cc2016-07-04 16:13:11 +0200866For information, at the time of this writing.
867the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
868
869An offset code is also the nb of additional bits to read,
870and can be translated into an `OFValue` using the following formulae :
871
872```
873OFValue = (1 << offsetCode) + readNBits(offsetCode);
874if (OFValue > 3) offset = OFValue - 3;
875```
876
877OFValue from 1 to 3 are special : they define "repeat codes",
878which means one of the previous offsets will be repeated.
879They are sorted in recency order, with 1 meaning the most recent one.
Yann Colletcd25a912016-07-05 11:50:37 +0200880See [Repeat offsets](#repeat-offsets) paragraph.
Yann Collet23f05cc2016-07-04 16:13:11 +0200881
882__Default distribution__
883
Yann Colletcd25a912016-07-05 11:50:37 +0200884When "compression mode" is defined as "predef",
Yann Collet23f05cc2016-07-04 16:13:11 +0200885a pre-defined distribution is used for FSE compression.
886
887Here is its definition. It uses an accuracy of 5 bits (32 states),
Yann Colletcd25a912016-07-05 11:50:37 +0200888and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
Yann Collet23f05cc2016-07-04 16:13:11 +0200889
890If any sequence in the compressed block requires an offset larger than this,
891it's not possible to use the default distribution to represent it.
892
893```
894short offsetCodes_defaultDistribution[53] =
895 { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
896 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
897```
898
899#### Distribution tables
900
901Following the header, up to 3 distribution tables can be described.
902They are, in order :
903- Literal lengthes
904- Offsets
905- Match Lengthes
906
907The content to decode depends on their respective compression mode :
908- Repeat mode : no content. Re-use distribution from previous compressed block.
909- Predef : no content. Use pre-defined distribution table.
910- RLE : 1 byte. This is the only code to use across the whole compressed block.
911- FSE : A distribution table is present.
912
913##### FSE distribution table : condensed format
914
915An FSE distribution table describes the probabilities of all symbols
916from `0` to the last present one (included)
Yann Collet9ca73362016-07-05 10:53:38 +0200917on a normalized scale of `1 << AccuracyLog` .
Yann Collet23f05cc2016-07-04 16:13:11 +0200918
919It's a bitstream which is read forward, in little-endian fashion.
920It's not necessary to know its exact size,
921since it will be discovered and reported by the decoding process.
922
923The bitstream starts by reporting on which scale it operates.
924`AccuracyLog = low4bits + 5;`
925In theory, it can define a scale from 5 to 20.
926In practice, decoders are allowed to limit the maximum supported `AccuracyLog`.
927Recommended maximum are `9` for literal and match lengthes, and `8` for offsets.
928The reference decoder uses these limits.
929
930Then follow each symbol value, from `0` to last present one.
931The nb of bits used by each field is variable.
932It depends on :
933
934- Remaining probabilities + 1 :
935 __example__ :
936 Presuming an AccuracyLog of 8,
937 and presuming 100 probabilities points have already been distributed,
Yann Colletcd25a912016-07-05 11:50:37 +0200938 the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included).
Yann Collet23f05cc2016-07-04 16:13:11 +0200939 Therefore, it must read `log2sup(156) == 8` bits.
940
941- Value decoded : small values use 1 less bit :
942 __example__ :
943 Presuming values from 0 to 156 (included) are possible,
944 255-156 = 99 values are remaining in an 8-bits field.
945 They are used this way :
946 first 99 values (hence from 0 to 98) use only 7 bits,
947 values from 99 to 156 use 8 bits.
948 This is achieved through this scheme :
949
950 | Value read | Value decoded | nb Bits used |
951 | ---------- | ------------- | ------------ |
952 | 0 - 98 | 0 - 98 | 7 |
953 | 99 - 127 | 99 - 127 | 8 |
954 | 128 - 226 | 0 - 98 | 7 |
955 | 227 - 255 | 128 - 156 | 8 |
956
957Symbols probabilities are read one by one, in order.
958
959Probability is obtained from Value decoded by following formulae :
960`Proba = value - 1;`
961
962It means value `0` becomes negative probability `-1`.
963`-1` is a special probability, which means `less than 1`.
Yann Colletc40ba712016-07-08 15:39:02 +0200964Its effect on distribution table is described in [next paragraph].
Yann Collet23f05cc2016-07-04 16:13:11 +0200965For the purpose of calculating cumulated distribution, it counts as one.
966
Yann Colletc40ba712016-07-08 15:39:02 +0200967[next paragraph]:#fse-decoding--from-normalized-distribution-to-decoding-tables
968
Yann Collet23f05cc2016-07-04 16:13:11 +0200969When a symbol has a probability of `zero`,
970it is followed by a 2-bits repeat flag.
971This repeat flag tells how many probabilities of zeroes follow the current one.
972It provides a number ranging from 0 to 3.
973If it is a 3, another 2-bits repeat flag follows, and so on.
974
Yann Collet9ca73362016-07-05 10:53:38 +0200975When last symbol reaches cumulated total of `1 << AccuracyLog`,
Yann Collet23f05cc2016-07-04 16:13:11 +0200976decoding is complete.
977Then the decoder can tell how many bytes were used in this process,
978and how many symbols are present.
979
980The bitstream consumes a round number of bytes.
981Any remaining bit within the last byte is just unused.
982
Yann Collet9ca73362016-07-05 10:53:38 +0200983If the last symbol makes cumulated total go above `1 << AccuracyLog`,
Yann Collet23f05cc2016-07-04 16:13:11 +0200984distribution is considered corrupted.
985
986##### FSE decoding : from normalized distribution to decoding tables
987
Yann Collet9ca73362016-07-05 10:53:38 +0200988The distribution of normalized probabilities is enough
989to create a unique decoding table.
990
991It follows the following build rule :
992
993The table has a size of `tableSize = 1 << AccuracyLog;`.
994Each cell describes the symbol decoded,
995and instructions to get the next state.
996
997Symbols are scanned in their natural order for `less than 1` probabilities.
998Symbols with this probability are being attributed a single cell,
999starting from the end of the table.
1000These symbols define a full state reset, reading `AccuracyLog` bits.
1001
1002All remaining symbols are sorted in their natural order.
1003Starting from symbol `0` and table position `0`,
1004each symbol gets attributed as many cells as its probability.
1005Cell allocation is spreaded, not linear :
1006each successor position follow this rule :
1007
Yann Colletcd25a912016-07-05 11:50:37 +02001008```
1009position += (tableSize>>1) + (tableSize>>3) + 3;
1010position &= tableSize-1;
1011```
Yann Collet9ca73362016-07-05 10:53:38 +02001012
1013A position is skipped if already occupied,
1014typically by a "less than 1" probability symbol.
1015
1016The result is a list of state values.
1017Each state will decode the current symbol.
1018
1019To get the Number of bits and baseline required for next state,
1020it's first necessary to sort all states in their natural order.
Yann Colletcd25a912016-07-05 11:50:37 +02001021The lower states will need 1 more bit than higher ones.
Yann Collet9ca73362016-07-05 10:53:38 +02001022
1023__Example__ :
1024Presuming a symbol has a probability of 5.
Yann Colletcd25a912016-07-05 11:50:37 +02001025It receives 5 state values. States are sorted in natural order.
Yann Collet9ca73362016-07-05 10:53:38 +02001026
1027Next power of 2 is 8.
1028Space of probabilities is divided into 8 equal parts.
1029Presuming the AccuracyLog is 7, it defines 128 states.
1030Divided by 8, each share is 16 large.
1031
1032In order to reach 8, 8-5=3 lowest states will count "double",
1033taking shares twice larger,
1034requiring one more bit in the process.
1035
1036Numbering starts from higher states using less bits.
1037
1038| state order | 0 | 1 | 2 | 3 | 4 |
1039| ----------- | ----- | ----- | ------ | ---- | ----- |
1040| width | 32 | 32 | 32 | 16 | 16 |
1041| nb Bits | 5 | 5 | 5 | 4 | 4 |
1042| range nb | 2 | 4 | 6 | 0 | 1 |
1043| baseline | 32 | 64 | 96 | 0 | 16 |
1044| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
1045
1046Next state is determined from current state
1047by reading the required number of bits, and adding the specified baseline.
Yann Collet23f05cc2016-07-04 16:13:11 +02001048
1049
1050#### Bitstream
Yann Collet00d44ab2016-07-04 01:29:47 +02001051
Yann Collet9ca73362016-07-05 10:53:38 +02001052All sequences are stored in a single bitstream, read _backward_.
1053It is therefore necessary to know the bitstream size,
1054which is deducted from compressed block size.
1055
Yann Colletc40ba712016-07-08 15:39:02 +02001056The last useful bit of the stream is followed by an end-bit-flag.
Yann Colletcd25a912016-07-05 11:50:37 +02001057Highest bit of last byte is this flag.
Yann Collet9ca73362016-07-05 10:53:38 +02001058It does not belong to the useful part of the bitstream.
1059Therefore, last byte has 0-7 useful bits.
1060Note that it also means that last byte cannot be `0`.
1061
1062##### Starting states
1063
1064The bitstream starts with initial state values,
1065each using the required number of bits in their respective _accuracy_,
1066decoded previously from their normalized distribution.
1067
1068It starts by `Literal Length State`,
1069followed by `Offset State`,
1070and finally `Match Length State`.
1071
1072Reminder : always keep in mind that all values are read _backward_.
1073
1074##### Decoding a sequence
1075
1076A state gives a code.
1077A code provides a baseline and number of bits to add.
1078See [Symbol Decoding] section for details on each symbol.
1079
1080Decoding starts by reading the nb of bits required to decode offset.
1081It then does the same for match length,
1082and then for literal length.
1083
Yann Colletc40ba712016-07-08 15:39:02 +02001084Offset / matchLength / litLength define a sequence.
1085It starts by inserting the number of literals defined by `litLength`,
1086then continue by copying `matchLength` bytes from `currentPos - offset`.
Yann Collet9ca73362016-07-05 10:53:38 +02001087
1088The next operation is to update states.
1089Using rules pre-calculated in the decoding tables,
1090`Literal Length State` is updated,
1091followed by `Match Length State`,
1092and then `Offset State`.
1093
1094This operation will be repeated `NbSeqs` times.
1095At the end, the bitstream shall be entirely consumed,
1096otherwise bitstream is considered corrupted.
1097
1098[Symbol Decoding]:#symbols-decoding
1099
1100##### Repeat offsets
1101
1102As seen in [Offset Codes], the first 3 values define a repeated offset.
1103They are sorted in recency order, with 1 meaning "most recent one".
1104
1105There is an exception though, when current sequence's literal length is `0`.
1106In which case, 1 would just make previous match longer.
1107Therefore, in such case, 1 means in fact 2, and 2 is impossible.
1108Meaning of 3 is unmodified.
1109
1110Repeat offsets start with the following values : 1, 4 and 8 (in order).
1111
1112Then each block receives its start value from previous compressed block.
1113Note that non-compressed blocks are skipped,
1114they do not contribute to offset history.
1115
1116[Offset Codes]: #offset-codes
1117
1118###### Offset updates rules
1119
1120When the new offset is a normal one,
1121offset history is simply translated by one position,
1122with the new offset taking first spot.
1123
1124- When repeat offset 1 (most recent) is used, history is unmodified.
1125- When repeat offset 2 is used, it's swapped with offset 1.
1126- When repeat offset 3 is used, it takes first spot,
1127 pushing the other ones by one position.
Yann Collet00d44ab2016-07-04 01:29:47 +02001128
Yann Collet2fa99042016-07-01 20:55:28 +02001129
Yann Colletbd106072016-07-08 19:16:57 +02001130Dictionary format
1131-----------------
1132
1133`zstd` is compatible with "pure content" dictionaries, free of any format restriction.
1134But dictionaries created by `zstd --train` follow a format, described here.
1135
1136__Pre-requisites__ : a dictionary has a known length,
1137 defined either by a buffer limit, or a file size.
1138
1139| Header | DictID | Stats | Content |
1140| ------ | ------ | ----- | ------- |
1141
1142__Header__ : 4 bytes ID, value 0xEC30A437, Little Endian format
1143
1144__Dict_ID__ : 4 bytes, stored in Little Endian format.
1145 DictID can be any value, except 0 (which means no DictID).
Yann Collet722e14b2016-07-08 19:22:16 +02001146 It's used by decoders to check if they use the correct dictionary.
Yann Colletf6ff53c2016-07-15 17:03:38 +02001147 _Reserved ranges :_
1148 If the frame is going to be distributed in a private environment,
1149 any dictionary ID can be used.
1150 However, for public distribution of compressed frames,
1151 some ranges are reserved for future use :
Yann Collet6cacd342016-07-15 17:58:13 +02001152
1153 - low range : 1 - 32767 : reserved
1154 - high range : >= (2^31) : reserved
Yann Colletbd106072016-07-08 19:16:57 +02001155
1156__Stats__ : Entropy tables, following the same format as a [compressed blocks].
1157 They are stored in following order :
1158 Huffman tables for literals, FSE table for offset,
Yann Collet722e14b2016-07-08 19:22:16 +02001159 FSE table for matchLenth, and FSE table for litLength.
1160 It's finally followed by 3 offset values, populating recent offsets,
1161 stored in order, 4-bytes little endian each, for a total of 12 bytes.
Yann Colletbd106072016-07-08 19:16:57 +02001162
1163__Content__ : Where the actual dictionary content is.
Yann Collet722e14b2016-07-08 19:22:16 +02001164 Content size depends on Dictionary size.
Yann Colletbd106072016-07-08 19:16:57 +02001165
1166[compressed blocks]: #compressed-block-format
1167
Yann Collet2fa99042016-07-01 20:55:28 +02001168
1169Version changes
1170---------------
Yann Collet6cacd342016-07-15 17:58:13 +02001171- 0.1.1 reserved dictID ranges
1172- 0.1.0 initial release