blob: 2fbe3fa4c018312c4b83f207a66a73439208349b [file] [log] [blame] [view]
Yann Collet5cc18822016-07-03 19:03:13 +02001Zstandard Compression Format
2============================
Yann Collet2fa99042016-07-01 20:55:28 +02003
4### Notices
5
6Copyright (c) 2016 Yann Collet
7
8Permission is granted to copy and distribute this document
9for any purpose and without charge,
10including translations into other languages
11and incorporation into compilations,
12provided that the copyright notice and this notice are preserved,
13and that any substantive changes or deletions from the original
14are clearly marked.
15Distribution of this document is unlimited.
16
17### Version
18
Yann Colletd916c902016-07-04 00:42:58 +0200190.0.2 (July 2016 - Work in progress - unfinished)
Yann Collet2fa99042016-07-01 20:55:28 +020020
21
22Introduction
23------------
24
25The purpose of this document is to define a lossless compressed data format,
26that is independent of CPU type, operating system,
27file system and character set, suitable for
Yann Collet9ca73362016-07-05 10:53:38 +020028file compression, pipe and streaming compression,
Yann Collet2fa99042016-07-01 20:55:28 +020029using the [Zstandard algorithm](http://www.zstandard.org).
30
31The data can be produced or consumed,
32even for an arbitrarily long sequentially presented input data stream,
33using only an a priori bounded amount of intermediate storage,
34and hence can be used in data communications.
35The format uses the Zstandard compression method,
36and optional [xxHash-64 checksum method](http://www.xxhash.org),
37for detection of data corruption.
38
39The data format defined by this specification
40does not attempt to allow random access to compressed data.
41
42This specification is intended for use by implementers of software
43to compress data into Zstandard format and/or decompress data from Zstandard format.
44The text of the specification assumes a basic background in programming
45at the level of bits and other primitive data representations.
46
47Unless otherwise indicated below,
48a compliant compressor must produce data sets
49that conform to the specifications presented here.
50It doesn’t need to support all options though.
51
52A compliant decompressor must be able to decompress
53at least one working set of parameters
54that conforms to the specifications presented here.
55It may also ignore informative fields, such as checksum.
56Whenever it does not support a parameter defined in the compressed stream,
57it must produce a non-ambiguous error code and associated error message
58explaining which parameter is unsupported.
59
60
61Definitions
62-----------
63A content compressed by Zstandard is transformed into a Zstandard __frame__.
64Multiple frames can be appended into a single file or stream.
65A frame is totally independent, has a defined beginning and end,
66and a set of parameters which tells the decoder how to decompress it.
67
68A frame encapsulates one or multiple __blocks__.
69Each block can be compressed or not,
70and has a guaranteed maximum content size, which depends on frame parameters.
71Unlike frames, each block depends on previous blocks for proper decoding.
72However, each block can be decompressed without waiting for its successor,
73allowing streaming operations.
74
75
76General Structure of Zstandard Frame format
77-------------------------------------------
78
Yann Colletcd25a912016-07-05 11:50:37 +020079| MagicNb | Frame Header | Block | (More blocks) | EndMark |
80|:-------:|:-------------:| ----- | ------------- | ------- |
81| 4 bytes | 2-14 bytes | | | 3 bytes |
Yann Collet2fa99042016-07-01 20:55:28 +020082
83__Magic Number__
84
854 Bytes, Little endian format.
86Value : 0xFD2FB527
87
88__Frame Header__
89
Yann Colletcd25a912016-07-05 11:50:37 +0200902 to 14 Bytes, detailed in [next part](#frame-header).
Yann Collet2fa99042016-07-01 20:55:28 +020091
92__Data Blocks__
93
Yann Colletcd25a912016-07-05 11:50:37 +020094Detailed in [next chapter](#data-blocks).
Yann Collet2fa99042016-07-01 20:55:28 +020095That’s where compressed data is stored.
96
97__EndMark__
98
99The flow of blocks ends when the last block header brings an _end signal_ .
100This last block header may optionally host a __Content Checksum__ .
101
Yann Collet9ca73362016-07-05 10:53:38 +0200102##### __Content Checksum__
Yann Collet2fa99042016-07-01 20:55:28 +0200103
Yann Colletcd25a912016-07-05 11:50:37 +0200104Content Checksum verify that frame content has been regenerated correctly.
Yann Collet2fa99042016-07-01 20:55:28 +0200105The content checksum is the result
106of [xxh64() hash function](https://www.xxHash.com)
107digesting the original (decoded) data as input, and a seed of zero.
108Bits from 11 to 32 (included) are extracted to form a 22 bits checksum
Yann Colletcd25a912016-07-05 11:50:37 +0200109stored into the endmark body.
Yann Collet2fa99042016-07-01 20:55:28 +0200110```
Yann Collet9ca73362016-07-05 10:53:38 +0200111mask22bits = (1<<22)-1;
112contentChecksum = (XXH64(content, size, 0) >> 11) & mask22bits;
Yann Collet2fa99042016-07-01 20:55:28 +0200113```
114Content checksum is only present when its associated flag
115is set in the frame descriptor.
116Its usage is optional.
117
118__Frame Concatenation__
119
120In some circumstances, it may be required to append multiple frames,
121for example in order to add new data to an existing compressed file
122without re-framing it.
123
124In such case, each frame brings its own set of descriptor flags.
125Each frame is considered independent.
126The only relation between frames is their sequential order.
127
128The ability to decode multiple concatenated frames
129within a single stream or file is left outside of this specification.
130As an example, the reference `zstd` command line utility is able
131to decode all concatenated frames in their sequential order,
132delivering the final decompressed result as if it was a single content.
133
134
135Frame Header
136-------------
137
Yann Collet23f05cc2016-07-04 16:13:11 +0200138| FHD | (WD) | (dictID) | (Content Size) |
139| ------- | --------- | --------- |:--------------:|
140| 1 byte | 0-1 byte | 0-4 bytes | 0 - 8 bytes |
Yann Collet2fa99042016-07-01 20:55:28 +0200141
142Frame header has a variable size, which uses a minimum of 2 bytes,
143and up to 14 bytes depending on optional parameters.
144
145__FHD byte__ (Frame Header Descriptor)
146
147The first Header's byte is called the Frame Header Descriptor.
148It tells which other fields are present.
Yann Collet23f05cc2016-07-04 16:13:11 +0200149Decoding this byte is enough to tell the size of Frame Header.
Yann Collet2fa99042016-07-01 20:55:28 +0200150
Yann Collet23f05cc2016-07-04 16:13:11 +0200151| BitNb | 7-6 | 5 | 4 | 3 | 2 | 1-0 |
152| ------- | ------ | ------- | ------ | -------- | -------- | ------ |
153|FieldName| FCSize | Segment | Unused | Reserved | Checksum | dictID |
Yann Collet2fa99042016-07-01 20:55:28 +0200154
155In this table, bit 7 is highest bit, while bit 0 is lowest.
156
157__Frame Content Size flag__
158
159This is a 2-bits flag (`= FHD >> 6`),
160specifying if decompressed data size is provided within the header.
161
162| Value | 0 | 1 | 2 | 3 |
163| ------- | --- | --- | --- | --- |
164|FieldSize| 0-1 | 2 | 4 | 8 |
165
Yann Collet23f05cc2016-07-04 16:13:11 +0200166Value 0 meaning depends on _single segment_ mode :
Yann Collet2fa99042016-07-01 20:55:28 +0200167it either means `0` (size not provided) _if_ the `WD` byte is present,
Yann Collet23f05cc2016-07-04 16:13:11 +0200168or `1` (frame content size <= 255 bytes) otherwise.
Yann Collet2fa99042016-07-01 20:55:28 +0200169
170__Single Segment__
171
172If this flag is set,
173data shall be regenerated within a single continuous memory segment.
Yann Collet23f05cc2016-07-04 16:13:11 +0200174
Yann Collet2fa99042016-07-01 20:55:28 +0200175In which case, `WD` byte __is not present__,
176but `Frame Content Size` field necessarily is.
Yann Collet2fa99042016-07-01 20:55:28 +0200177As a consequence, the decoder must allocate a memory segment
178of size `>= Frame Content Size`.
179
180In order to preserve the decoder from unreasonable memory requirement,
Yann Collet23f05cc2016-07-04 16:13:11 +0200181a decoder can reject a compressed frame
Yann Collet2fa99042016-07-01 20:55:28 +0200182which requests a memory size beyond decoder's authorized range.
183
184For broader compatibility, decoders are recommended to support
Yann Collet23f05cc2016-07-04 16:13:11 +0200185memory sizes of at least 8 MB.
186This is just a recommendation,
Yann Collet9ca73362016-07-05 10:53:38 +0200187each decoder is free to support higher or lower limits,
Yann Collet2fa99042016-07-01 20:55:28 +0200188depending on local limitations.
189
190__Unused bit__
191
192The value of this bit is unimportant
193and not interpreted by a decoder compliant with this specification version.
194It may be used in a future revision,
195to signal a property which is not required to properly decode the frame.
196
197__Reserved bit__
198
199This bit is reserved for some future feature.
200Its value _must be zero_.
201A decoder compliant with this specification version must ensure it is not set.
202This bit may be used in a future revision,
203to signal a feature that must be interpreted in order to decode the frame.
204
205__Content checksum flag__
206
207If this flag is set, a content checksum will be present into the EndMark.
Yann Collet9ca73362016-07-05 10:53:38 +0200208The checksum is a 22 bits value extracted from the XXH64() of data,
209and stored into endMark. See [__Content Checksum__](#content-checksum) .
Yann Collet2fa99042016-07-01 20:55:28 +0200210
211__Dictionary ID flag__
212
213This is a 2-bits flag (`= FHD & 3`),
Yann Collet9ca73362016-07-05 10:53:38 +0200214telling if a dictionary ID is provided within the header.
215It also specifies the size of this field.
Yann Collet2fa99042016-07-01 20:55:28 +0200216
217| Value | 0 | 1 | 2 | 3 |
218| ------- | --- | --- | --- | --- |
219|FieldSize| 0 | 1 | 2 | 4 |
220
221__WD byte__ (Window Descriptor)
222
223Provides guarantees on maximum back-reference distance
224that will be present within compressed data.
225This information is useful for decoders to allocate enough memory.
226
Yann Colletcd25a912016-07-05 11:50:37 +0200227`WD` byte is optional. It's not present in `single segment` mode.
228In which case, the maximum back-reference distance is the content size itself,
229which can be any value from 1 to 2^64-1 bytes (16 EB).
230
Yann Collet2fa99042016-07-01 20:55:28 +0200231| BitNb | 7-3 | 0-2 |
232| --------- | -------- | -------- |
233| FieldName | Exponent | Mantissa |
234
235Maximum distance is given by the following formulae :
236```
237windowLog = 10 + Exponent;
238windowBase = 1 << windowLog;
239windowAdd = (windowBase / 8) * Mantissa;
240windowSize = windowBase + windowAdd;
241```
242The minimum window size is 1 KB.
Yann Colletcd25a912016-07-05 11:50:37 +0200243The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
Yann Collet2fa99042016-07-01 20:55:28 +0200244
245To properly decode compressed data,
246a decoder will need to allocate a buffer of at least `windowSize` bytes.
247
Yann Collet2fa99042016-07-01 20:55:28 +0200248In order to preserve decoder from unreasonable memory requirements,
249a decoder can refuse a compressed frame
250which requests a memory size beyond decoder's authorized range.
251
Yann Colletcd25a912016-07-05 11:50:37 +0200252For improved interoperability,
Yann Collet2fa99042016-07-01 20:55:28 +0200253decoders are recommended to be compatible with window sizes of 8 MB.
254Encoders are recommended to not request more than 8 MB.
255It's merely a recommendation though,
256decoders are free to support larger or lower limits,
257depending on local limitations.
258
Yann Collet23f05cc2016-07-04 16:13:11 +0200259__Dictionary ID__
260
261This is a variable size field, which contains an ID.
262It checks if the correct dictionary is used for decoding.
263Note that this field is optional. If it's not present,
264it's up to the caller to make sure it uses the correct dictionary.
265
266Field size depends on __Dictionary ID flag__.
2671 byte can represent an ID 0-255.
2682 bytes can represent an ID 0-65535.
Yann Colletcd25a912016-07-05 11:50:37 +02002694 bytes can represent an ID 0-4294967295.
Yann Collet23f05cc2016-07-04 16:13:11 +0200270
271It's allowed to represent a small ID (for example `13`)
Yann Colletcd25a912016-07-05 11:50:37 +0200272with a large 4-bytes dictionary ID, losing some compacity in the process.
Yann Collet23f05cc2016-07-04 16:13:11 +0200273
Yann Collet2fa99042016-07-01 20:55:28 +0200274__Frame Content Size__
275
276This is the original (uncompressed) size.
277This information is optional, and only present if associated flag is set.
278Content size is provided using 1, 2, 4 or 8 Bytes.
279Format is Little endian.
280
281| Field Size | Range |
282| ---------- | ---------- |
283| 0 | 0 |
284| 1 | 0 - 255 |
285| 2 | 256 - 65791|
286| 4 | 0 - 2^32-1 |
287| 8 | 0 - 2^64-1 |
288
289When field size is 1, 4 or 8 bytes, the value is read directly.
290When field size is 2, _an offset of 256 is added_.
Yann Collet9ca73362016-07-05 10:53:38 +0200291It's allowed to represent a small size (ex: `18`) using any compatible variant.
Yann Collet2fa99042016-07-01 20:55:28 +0200292A size of `0` means `content size is unknown`.
293In which case, the `WD` byte will necessarily be present,
Yann Collet9ca73362016-07-05 10:53:38 +0200294and becomes the only hint to guide memory allocation.
Yann Collet2fa99042016-07-01 20:55:28 +0200295
296In order to preserve decoder from unreasonable memory requirement,
297a decoder can refuse a compressed frame
298which requests a memory size beyond decoder's authorized range.
299
Yann Collet2fa99042016-07-01 20:55:28 +0200300
301Data Blocks
302-----------
303
304| B. Header | data |
305|:---------:| ------ |
306| 3 bytes | |
307
308
309__Block Header__
310
311This field uses 3-bytes, format is __big-endian__.
312
313The 2 highest bits represent the `block type`,
314while the remaining 22 bits represent the (compressed) block size.
315
316There are 4 block types :
317
318| Value | 0 | 1 | 2 | 3 |
319| ---------- | ---------- | --- | --- | ------- |
320| Block Type | Compressed | Raw | RLE | EndMark |
321
Yann Collet9ca73362016-07-05 10:53:38 +0200322- Compressed : this is a [Zstandard compressed block](#compressed-block-format),
323 detailed in another section of this specification.
Yann Collet2fa99042016-07-01 20:55:28 +0200324 "block size" is the compressed size.
325 Decompressed size is unknown,
326 but its maximum possible value is guaranteed (see below)
327- Raw : this is an uncompressed block.
328 "block size" is the number of bytes to read and copy.
329- RLE : this is a single byte, repeated N times.
330 In which case, "block size" is the size to regenerate,
331 while the "compressed" block is just 1 byte (the byte to repeat).
332- EndMark : this is not a block. Signal the end of the frame.
333 The rest of the field may be optionally filled by a checksum
Yann Colletcd25a912016-07-05 11:50:37 +0200334 (see [Content Checksum](#content-checksum)).
Yann Collet2fa99042016-07-01 20:55:28 +0200335
336Block sizes must respect a few rules :
Yann Collet9ca73362016-07-05 10:53:38 +0200337- In compressed mode, compressed size if always strictly `< decompressed size`.
338- Block decompressed size is always <= maximum back-reference distance .
339- Block decompressed size is always <= 128 KB
Yann Collet2fa99042016-07-01 20:55:28 +0200340
341
342__Data__
343
344Where the actual data to decode stands.
345It might be compressed or not, depending on previous field indications.
346A data block is not necessarily "full" :
347since an arbitrary “flush” may happen anytime,
Yann Colletcd25a912016-07-05 11:50:37 +0200348block decompressed content can be any size,
349up to Block Maximum Decompressed Size, which is the smallest of :
350- Maximum back-reference distance
Yann Collet2fa99042016-07-01 20:55:28 +0200351- 128 KB
352
353
354Skippable Frames
355----------------
356
357| Magic Number | Frame Size | User Data |
358|:------------:|:----------:| --------- |
359| 4 bytes | 4 bytes | |
360
361Skippable frames allow the insertion of user-defined data
362into a flow of concatenated frames.
363Its design is pretty straightforward,
364with the sole objective to allow the decoder to quickly skip
365over user-defined data and continue decoding.
366
Yann Colletcd25a912016-07-05 11:50:37 +0200367Skippable frames defined in this specification are compatible with [LZ4] ones.
368
369[LZ4]:http://www.lz4.org
Yann Collet2fa99042016-07-01 20:55:28 +0200370
Yann Collet2fa99042016-07-01 20:55:28 +0200371__Magic Number__ :
372
3734 Bytes, Little endian format.
374Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
375All 16 values are valid to identify a skippable frame.
376
377__Frame Size__ :
378
379This is the size, in bytes, of the following User Data
380(without including the magic number nor the size field itself).
3814 Bytes, Little endian format, unsigned 32-bits.
382This means User Data can’t be bigger than (2^32-1) Bytes.
383
384__User Data__ :
385
386User Data can be anything. Data will just be skipped by the decoder.
387
388
389Compressed block format
390-----------------------
391This specification details the content of a _compressed block_.
Yann Collet00d44ab2016-07-04 01:29:47 +0200392A compressed block has a size, which must be known.
Yann Collet2fa99042016-07-01 20:55:28 +0200393It also has a guaranteed maximum regenerated size,
394in order to properly allocate destination buffer.
Yann Collet9ca73362016-07-05 10:53:38 +0200395See [Data Blocks](#data-blocks) for more details.
Yann Collet2fa99042016-07-01 20:55:28 +0200396
397A compressed block consists of 2 sections :
Yann Colletcd25a912016-07-05 11:50:37 +0200398- [Literals section](#literals-section)
399- [Sequences section](#sequences-section)
Yann Collet2fa99042016-07-01 20:55:28 +0200400
Yann Collet23f05cc2016-07-04 16:13:11 +0200401### Prerequisites
402To decode a compressed block, the following elements are necessary :
Yann Collet698cb632016-07-03 18:49:35 +0200403- Previous decoded blocks, up to a distance of `windowSize`,
Yann Collet9ca73362016-07-05 10:53:38 +0200404 or all previous blocks in "single segment" mode.
Yann Collet698cb632016-07-03 18:49:35 +0200405- List of "recent offsets" from previous compressed block.
Yann Collet00d44ab2016-07-04 01:29:47 +0200406- Decoding tables of previous compressed block for each symbol type
Yann Collet9ca73362016-07-05 10:53:38 +0200407 (literals, litLength, matchLength, offset).
Yann Collet698cb632016-07-03 18:49:35 +0200408
409
Yann Collet00d44ab2016-07-04 01:29:47 +0200410### Literals section
Yann Collet2fa99042016-07-01 20:55:28 +0200411
Yann Collet9ca73362016-07-05 10:53:38 +0200412Literals are compressed using huffman compression.
Yann Collet2fa99042016-07-01 20:55:28 +0200413During sequence phase, literals will be entangled with match copy operations.
414All literals are regrouped in the first part of the block.
415They can be decoded first, and then copied during sequence operations,
Yann Collet00d44ab2016-07-04 01:29:47 +0200416or they can be decoded on the flow, as needed by sequence commands.
Yann Collet2fa99042016-07-01 20:55:28 +0200417
418| Header | (Tree Description) | Stream1 | (Stream2) | (Stream3) | (Stream4) |
419| ------ | ------------------ | ------- | --------- | --------- | --------- |
420
421Literals can be compressed, or uncompressed.
422When compressed, an optional tree description can be present,
423followed by 1 or 4 streams.
424
Yann Collet00d44ab2016-07-04 01:29:47 +0200425#### Literals section header
Yann Collet2fa99042016-07-01 20:55:28 +0200426
Yann Collet00d44ab2016-07-04 01:29:47 +0200427Header is in charge of describing how literals are packed.
Yann Collet2fa99042016-07-01 20:55:28 +0200428It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
429using big-endian convention.
430
431| BlockType | sizes format | (compressed size) | regenerated size |
432| --------- | ------------ | ----------------- | ---------------- |
433| 2 bits | 1 - 2 bits | 0 - 18 bits | 5 - 20 bits |
434
435__Block Type__ :
436
437This is a 2-bits field, describing 4 different block types :
438
439| Value | 0 | 1 | 2 | 3 |
440| ---------- | ---------- | ------ | --- | ------- |
441| Block Type | Compressed | Repeat | Raw | RLE |
442
443- Compressed : This is a standard huffman-compressed block,
Yann Colletcd25a912016-07-05 11:50:37 +0200444 starting with a huffman tree description.
445 See details below.
Yann Collet2fa99042016-07-01 20:55:28 +0200446- Repeat Stats : This is a huffman-compressed block,
Yann Colletcd25a912016-07-05 11:50:37 +0200447 using huffman tree _from previous huffman-compressed literals block_.
448 Huffman tree description will be skipped.
Yann Collet2fa99042016-07-01 20:55:28 +0200449- Raw : Literals are stored uncompressed.
450- RLE : Literals consist of a single byte value repeated N times.
451
452__Sizes format__ :
453
454Sizes format are divided into 2 families :
455
456- For compressed block, it requires to decode both the compressed size
457 and the decompressed size. It will also decode the number of streams.
458- For Raw or RLE blocks, it's enough to decode the size to regenerate.
459
460For values spanning several bytes, convention is Big-endian.
461
Yann Collet9ca73362016-07-05 10:53:38 +0200462__Sizes format for Raw or RLE literals block__ :
Yann Collet2fa99042016-07-01 20:55:28 +0200463
464- Value : 0x : Regenerated size uses 5 bits (0-31).
465 Total literal header size is 1 byte.
466 `size = h[0] & 31;`
467- Value : 10 : Regenerated size uses 12 bits (0-4095).
468 Total literal header size is 2 bytes.
469 `size = ((h[0] & 15) << 8) + h[1];`
470- Value : 11 : Regenerated size uses 20 bits (0-1048575).
Yann Colletd916c902016-07-04 00:42:58 +0200471 Total literal header size is 3 bytes.
Yann Collet2fa99042016-07-01 20:55:28 +0200472 `size = ((h[0] & 15) << 16) + (h[1]<<8) + h[2];`
473
474Note : it's allowed to represent a short value (ex : `13`)
475using a long format, accepting the reduced compacity.
476
Yann Collet9ca73362016-07-05 10:53:38 +0200477__Sizes format for Compressed literals block__ :
Yann Collet2fa99042016-07-01 20:55:28 +0200478
479Note : also applicable to "repeat-stats" blocks.
Yann Colletcd25a912016-07-05 11:50:37 +0200480- Value : 00 : 4 streams.
481 Compressed and regenerated sizes use 10 bits (0-1023).
482 Total literal header size is 3 bytes.
483- Value : 01 : _Single stream_.
484 Compressed and regenerated sizes use 10 bits (0-1023).
485 Total literal header size is 3 bytes.
486- Value : 10 : 4 streams.
487 Compressed and regenerated sizes use 14 bits (0-16383).
488 Total literal header size is 4 bytes.
489- Value : 10 : 4 streams.
490 Compressed and regenerated sizes use 18 bits (0-262143).
491 Total literal header size is 5 bytes.
Yann Collet2fa99042016-07-01 20:55:28 +0200492
Yann Collet698cb632016-07-03 18:49:35 +0200493Compressed and regenerated size fields follow big endian convention.
494
495#### Huffman Tree description
496
Yann Colletcd25a912016-07-05 11:50:37 +0200497This section is only present when literals block type is `Compressed` (`0`).
Yann Collet00d44ab2016-07-04 01:29:47 +0200498
Yann Collet9ca73362016-07-05 10:53:38 +0200499Prefix coding represents symbols from an a priori known alphabet
500by bit sequences (codes), one code for each symbol,
501in a manner such that different symbols may be represented
502by bit sequences of different lengths,
503but a parser can always parse an encoded string
504unambiguously symbol-by-symbol.
Yann Collet00d44ab2016-07-04 01:29:47 +0200505
Yann Collet9ca73362016-07-05 10:53:38 +0200506Given an alphabet with known symbol frequencies,
507the Huffman algorithm allows the construction of an optimal prefix code
508using the fewest bits of any possible prefix codes for that alphabet.
509Such a code is called a Huffman code.
Yann Collet00d44ab2016-07-04 01:29:47 +0200510
Yann Collet9ca73362016-07-05 10:53:38 +0200511Prefix code must not exceed a maximum code length.
Yann Collet00d44ab2016-07-04 01:29:47 +0200512More bits improve accuracy but cost more header size,
Yann Collet9ca73362016-07-05 10:53:38 +0200513and require more memory for decoding operations.
Yann Collet00d44ab2016-07-04 01:29:47 +0200514
515The current format limits the maximum depth to 15 bits.
516The reference decoder goes further, by limiting it to 11 bits.
517It is recommended to remain compatible with reference decoder.
518
Yann Collet698cb632016-07-03 18:49:35 +0200519
520##### Representation
521
Yann Collet00d44ab2016-07-04 01:29:47 +0200522All literal values from zero (included) to last present one (excluded)
Yann Collet698cb632016-07-03 18:49:35 +0200523are represented by `weight` values, from 0 to `maxBits`.
524Transformation from `weight` to `nbBits` follows this formulae :
525`nbBits = weight ? maxBits + 1 - weight : 0;` .
526The last symbol's weight is deduced from previously decoded ones,
527by completing to the nearest power of 2.
528This power of 2 gives `maxBits`, the depth of the current tree.
529
530__Example__ :
531Let's presume the following huffman tree must be described :
532
Yann Colletd916c902016-07-04 00:42:58 +0200533| literal | 0 | 1 | 2 | 3 | 4 | 5 |
534| ------- | --- | --- | --- | --- | --- | --- |
535| nbBits | 1 | 2 | 3 | 0 | 4 | 4 |
Yann Collet698cb632016-07-03 18:49:35 +0200536
537The tree depth is 4, since its smallest element uses 4 bits.
538Value `5` will not be listed, nor will values above `5`.
539Values from `0` to `4` will be listed using `weight` instead of `nbBits`.
540Weight formula is : `weight = nbBits ? maxBits + 1 - nbBits : 0;`
541It gives the following serie of weights :
542
Yann Colletd916c902016-07-04 00:42:58 +0200543| weights | 4 | 3 | 2 | 0 | 1 |
544| ------- | --- | --- | --- | --- | --- |
545| literal | 0 | 1 | 2 | 3 | 4 |
Yann Collet698cb632016-07-03 18:49:35 +0200546
547The decoder will do the inverse operation :
Yann Colletd916c902016-07-04 00:42:58 +0200548having collected weights of literals from `0` to `4`,
549it knows the last literal, `5`, is present with a non-zero weight.
Yann Colletcd25a912016-07-05 11:50:37 +0200550The weight of `5` can be deducted by joining to the nearest power of 2.
Yann Collet698cb632016-07-03 18:49:35 +0200551Sum of 2^(weight-1) (excluding 0) is :
Yann Colletd916c902016-07-04 00:42:58 +0200552`8 + 4 + 2 + 0 + 1 = 15`
Yann Collet698cb632016-07-03 18:49:35 +0200553Nearest power of 2 is 16.
554Therefore, `maxBits = 4` and `weight[5] = 1`.
Yann Collet698cb632016-07-03 18:49:35 +0200555
556##### Huffman Tree header
557
Yann Collet9ca73362016-07-05 10:53:38 +0200558This is a single byte value (0-255),
559which tells how to decode the list of weights.
Yann Collet698cb632016-07-03 18:49:35 +0200560
561- if headerByte >= 242 : this is one of 14 pre-defined weight distributions :
Yann Colletcd25a912016-07-05 11:50:37 +0200562
563| value |242|243|244|245|246|247|248|249|250|251|252|253|254|255|
Yann Collete0ce5b02016-07-06 01:50:44 +0200564| -------- |---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Yann Colletcd25a912016-07-05 11:50:37 +0200565| Nb of 1s | 1 | 2 | 3 | 4 | 7 | 8 | 15| 16| 31| 32| 63| 64|127|128|
566|Complement| 1 | 2 | 1 | 4 | 1 | 8 | 1 | 16| 1 | 32| 1 | 64| 1 |128|
567
568_Note_ : complement is by using the "join to nearest power of 2" rule.
Yann Collet698cb632016-07-03 18:49:35 +0200569
570- if headerByte >= 128 : this is a direct representation,
571 where each weight is written directly as a 4 bits field (0-15).
Yann Colletcd25a912016-07-05 11:50:37 +0200572 The full representation occupies `((nbSymbols+1)/2)` bytes,
Yann Collet698cb632016-07-03 18:49:35 +0200573 meaning it uses a last full byte even if nbSymbols is odd.
574 `nbSymbols = headerByte - 127;`
575
576- if headerByte < 128 :
577 the serie of weights is compressed by FSE.
578 The length of the compressed serie is `headerByte` (0-127).
579
580##### FSE (Finite State Entropy) compression of huffman weights
581
582The serie of weights is compressed using standard FSE compression.
583It's a single bitstream with 2 interleaved states,
584using a single distribution table.
585
586To decode an FSE bitstream, it is necessary to know its compressed size.
587Compressed size is provided by `headerByte`.
588It's also necessary to know its maximum decompressed size.
589In this case, it's `255`, since literal values range from `0` to `255`,
Yann Colletcd25a912016-07-05 11:50:37 +0200590and last symbol value is not represented.
Yann Collet698cb632016-07-03 18:49:35 +0200591
592An FSE bitstream starts by a header, describing probabilities distribution.
Yann Collet00d44ab2016-07-04 01:29:47 +0200593It will create a Decoding Table.
Yann Collet698cb632016-07-03 18:49:35 +0200594It is necessary to know the maximum accuracy of distribution
595to properly allocate space for the Table.
Yann Colletcd25a912016-07-05 11:50:37 +0200596For a list of huffman weights, this maximum is 7 bits.
Yann Collet698cb632016-07-03 18:49:35 +0200597
598FSE header and bitstreams are described in a separated chapter.
599
600##### Conversion from weights to huffman prefix codes
601
Yann Colletd916c902016-07-04 00:42:58 +0200602All present symbols shall now have a `weight` value.
Yann Collet00d44ab2016-07-04 01:29:47 +0200603A `weight` directly represents a `range` of prefix codes,
Yann Colletd916c902016-07-04 00:42:58 +0200604following the formulae : `range = weight ? 1 << (weight-1) : 0 ;`
605Symbols are sorted by weight.
606Within same weight, symbols keep natural order.
607Starting from lowest weight,
608symbols are being allocated to a range of prefix codes.
609Symbols with a weight of zero are not present.
610
Yann Collet00d44ab2016-07-04 01:29:47 +0200611It is then possible to transform weights into nbBits :
Yann Colletd916c902016-07-04 00:42:58 +0200612`nbBits = nbBits ? maxBits + 1 - weight : 0;` .
613
614
615__Example__ :
616Let's presume the following huffman tree has been decoded :
617
618| Literal | 0 | 1 | 2 | 3 | 4 | 5 |
619| ------- | --- | --- | --- | --- | --- | --- |
620| weight | 4 | 3 | 2 | 0 | 1 | 1 |
621
622Sorted by weight and then natural order,
623it gives the following distribution :
624
625| Literal | 3 | 4 | 5 | 2 | 1 | 0 |
626| ------------ | --- | --- | --- | --- | --- | ---- |
627| weight | 0 | 1 | 1 | 2 | 3 | 4 |
628| range | 0 | 1 | 1 | 2 | 4 | 8 |
629| prefix codes | N/A | 0 | 1 | 2-3 | 4-7 | 8-15 |
630| nb bits | 0 | 4 | 4 | 3 | 2 | 1 |
631
632
Yann Colletd916c902016-07-04 00:42:58 +0200633#### Literals bitstreams
634
635##### Bitstreams sizes
636
637As seen in a previous paragraph,
638there are 2 flavors of huffman-compressed literals :
639single stream, and 4-streams.
640
6414-streams is useful for CPU with multiple execution units and OoO operations.
642Since each stream can be decoded independently,
643it's possible to decode them up to 4x faster than a single stream,
644presuming the CPU has enough parallelism available.
645
646For single stream, header provides both the compressed and regenerated size.
647For 4-streams though,
648header only provides compressed and regenerated size of all 4 streams combined.
Yann Colletd916c902016-07-04 00:42:58 +0200649In order to properly decode the 4 streams,
650it's necessary to know the compressed and regenerated size of each stream.
651
652Regenerated size is easiest :
653each stream has a size of `(totalSize+3)/4`,
Yann Colletcd25a912016-07-05 11:50:37 +0200654except the last one, which is up to 3 bytes smaller, to reach `totalSize`.
Yann Colletd916c902016-07-04 00:42:58 +0200655
656Compressed size must be provided explicitly : in the 4-streams variant,
Yann Collet00d44ab2016-07-04 01:29:47 +0200657bitstreams are preceded by 3 unsigned Little Endian 16-bits values.
658Each value represents the compressed size of one stream, in order.
Yann Colletd916c902016-07-04 00:42:58 +0200659The last stream size is deducted from total compressed size
660and from already known stream sizes :
661`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize;`
662
Yann Collet00d44ab2016-07-04 01:29:47 +0200663##### Bitstreams read and decode
Yann Colletd916c902016-07-04 00:42:58 +0200664
665Each bitstream must be read _backward_,
666that is starting from the end down to the beginning.
667Therefore it's necessary to know the size of each bitstream.
668
669It's also necessary to know exactly which _bit_ is the latest.
670This is detected by a final bit flag :
671the highest bit of latest byte is a final-bit-flag.
672Consequently, a last byte of `0` is not possible.
673And the final-bit-flag itself is not part of the useful bitstream.
674Hence, the last byte contain between 0 and 7 useful bits.
675
676Starting from the end,
677it's possible to read the bitstream in a little-endian fashion,
678keeping track of already used bits.
679
Yann Collet00d44ab2016-07-04 01:29:47 +0200680Reading the last `maxBits` bits,
Yann Colletd916c902016-07-04 00:42:58 +0200681it's then possible to compare extracted value to the prefix codes table,
682determining the symbol to decode and number of bits to discard.
683
684The process continues up to reading the required number of symbols per stream.
685If a bitstream is not entirely and exactly consumed,
686hence reaching exactly its beginning position with all bits consumed,
687the decoding process is considered faulty.
688
Yann Collet698cb632016-07-03 18:49:35 +0200689
Yann Collet00d44ab2016-07-04 01:29:47 +0200690### Sequences section
691
692A compressed block is a succession of _sequences_ .
693A sequence is a literal copy command, followed by a match copy command.
694A literal copy command specifies a length.
695It is the number of bytes to be copied (or extracted) from the literal section.
696A match copy command specifies an offset and a length.
697The offset gives the position to copy from,
698which can stand within a previous block.
699
Yann Colletcd25a912016-07-05 11:50:37 +0200700There are 3 symbol types, `literalLength`, `matchLength` and `offset`,
Yann Collet00d44ab2016-07-04 01:29:47 +0200701which are encoded together, interleaved in a single _bitstream_.
702
Yann Colletcd25a912016-07-05 11:50:37 +0200703Each symbol is a _code_ in its own context,
704which specifies a baseline and a number of bits to add.
Yann Collet00d44ab2016-07-04 01:29:47 +0200705_Codes_ are FSE compressed,
706and interleaved with raw additional bits in the same bitstream.
707
Yann Collet23f05cc2016-07-04 16:13:11 +0200708The Sequences section starts by a header,
709followed by optional Probability tables for each symbol type,
Yann Collet00d44ab2016-07-04 01:29:47 +0200710followed by the bitstream.
711
Yann Collet23f05cc2016-07-04 16:13:11 +0200712To decode the Sequence section, it's required to know its size.
Yann Colletcd25a912016-07-05 11:50:37 +0200713This size is deducted from `blockSize - literalSectionSize`.
Yann Collet23f05cc2016-07-04 16:13:11 +0200714
715
Yann Collet00d44ab2016-07-04 01:29:47 +0200716#### Sequences section header
717
Yann Collet23f05cc2016-07-04 16:13:11 +0200718Consists in 2 items :
719- Nb of Sequences
720- Flags providing Symbol compression types
721
722__Nb of Sequences__
723
724This is a variable size field, `nbSeqs`, using between 1 and 3 bytes.
725Let's call its first byte `byte0`.
726- `if (byte0 == 0)` : there are no sequences.
727 The sequence section stops there.
728 Regenerated content is defined entirely by literals section.
Yann Colletcd25a912016-07-05 11:50:37 +0200729- `if (byte0 < 128)` : `nbSeqs = byte0;` . Uses 1 byte.
730- `if (byte0 < 255)` : `nbSeqs = ((byte0-128) << 8) + byte1;` . Uses 2 bytes.
731- `if (byte0 == 255)`: `nbSeqs = byte1 + (byte2<<8) + 0x7F00;` . Uses 3 bytes.
Yann Collet23f05cc2016-07-04 16:13:11 +0200732
733__Symbol compression modes__
734
735This is a single byte, defining the compression mode of each symbol type.
736
737| BitNb | 7-6 | 5-4 | 3-2 | 1-0 |
738| ------- | ------ | ------ | ------ | -------- |
739|FieldName| LLtype | OFType | MLType | Reserved |
740
741The last field, `Reserved`, must be all-zeroes.
742
743`LLtype`, `OFType` and `MLType` define the compression mode of
744Literal Lengths, Offsets and Match Lengths respectively.
745
746They follow the same enumeration :
747
748| Value | 0 | 1 | 2 | 3 |
749| ---------------- | ------ | --- | ------ | --- |
750| Compression Mode | predef | RLE | Repeat | FSE |
751
752- "predef" : uses a pre-defined distribution table.
753- "RLE" : it's a single code, repeated `nbSeqs` times.
754- "Repeat" : re-use distribution table from previous compressed block.
755- "FSE" : standard FSE compression.
Yann Colletcd25a912016-07-05 11:50:37 +0200756 A distribution table will be present.
757 It will be described in [next part](#distribution-tables).
Yann Collet23f05cc2016-07-04 16:13:11 +0200758
759#### Symbols decoding
760
761##### Literal Lengths codes
762
763Literal lengths codes are values ranging from `0` to `35` included.
764They define lengths from 0 to 131071 bytes.
765
766| Code | 0-15 |
767| ------ | ---- |
Yann Collet23f05cc2016-07-04 16:13:11 +0200768| value | Code |
Yann Colletcd25a912016-07-05 11:50:37 +0200769| nbBits | 0 |
770
Yann Collet23f05cc2016-07-04 16:13:11 +0200771
772| Code | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
773| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
774| Baseline | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 |
775| nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
776
777| Code | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
778| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
779| Baseline | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 |
780| nb Bits | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
781
782| Code | 32 | 33 | 34 | 35 |
783| -------- | ---- | ---- | ---- | ---- |
784| Baseline | 8192 |16384 |32768 |65536 |
785| nb Bits | 13 | 14 | 15 | 16 |
786
787__Default distribution__
788
Yann Colletcd25a912016-07-05 11:50:37 +0200789When "compression mode" is "predef"",
Yann Collet23f05cc2016-07-04 16:13:11 +0200790a pre-defined distribution is used for FSE compression.
791
792Here is its definition. It uses an accuracy of 6 bits (64 states).
793```
794short literalLengths_defaultDistribution[36] =
795 { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
796 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
797 -1,-1,-1,-1 };
798```
799
800##### Match Lengths codes
801
802Match lengths codes are values ranging from `0` to `52` included.
803They define lengths from 3 to 131074 bytes.
804
805| Code | 0-31 |
806| ------ | -------- |
Yann Collet23f05cc2016-07-04 16:13:11 +0200807| value | Code + 3 |
Yann Colletcd25a912016-07-05 11:50:37 +0200808| nbBits | 0 |
Yann Collet23f05cc2016-07-04 16:13:11 +0200809
810| Code | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
811| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
812| Baseline | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 |
813| nb Bits | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 |
814
815| Code | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
816| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
817| Baseline | 67 | 83 | 99 | 131 | 258 | 514 | 1026 | 2050 |
818| nb Bits | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 |
819
820| Code | 48 | 49 | 50 | 51 | 52 |
821| -------- | ---- | ---- | ---- | ---- | ---- |
822| Baseline | 4098 | 8194 |16486 |32770 |65538 |
823| nb Bits | 12 | 13 | 14 | 15 | 16 |
824
825__Default distribution__
826
827When "compression mode" is defined as "default distribution",
828a pre-defined distribution is used for FSE compression.
829
830Here is its definition. It uses an accuracy of 6 bits (64 states).
831```
832short matchLengths_defaultDistribution[53] =
833 { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
834 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
835 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
836 -1,-1,-1,-1,-1 };
837```
838
839##### Offset codes
840
841Offset codes are values ranging from `0` to `N`,
842with `N` being limited by maximum backreference distance.
843
Yann Colletcd25a912016-07-05 11:50:37 +0200844A decoder is free to limit its maximum `N` supported.
845Recommendation is to support at least up to `22`.
Yann Collet23f05cc2016-07-04 16:13:11 +0200846For information, at the time of this writing.
847the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
848
849An offset code is also the nb of additional bits to read,
850and can be translated into an `OFValue` using the following formulae :
851
852```
853OFValue = (1 << offsetCode) + readNBits(offsetCode);
854if (OFValue > 3) offset = OFValue - 3;
855```
856
857OFValue from 1 to 3 are special : they define "repeat codes",
858which means one of the previous offsets will be repeated.
859They are sorted in recency order, with 1 meaning the most recent one.
Yann Colletcd25a912016-07-05 11:50:37 +0200860See [Repeat offsets](#repeat-offsets) paragraph.
Yann Collet23f05cc2016-07-04 16:13:11 +0200861
862__Default distribution__
863
Yann Colletcd25a912016-07-05 11:50:37 +0200864When "compression mode" is defined as "predef",
Yann Collet23f05cc2016-07-04 16:13:11 +0200865a pre-defined distribution is used for FSE compression.
866
867Here is its definition. It uses an accuracy of 5 bits (32 states),
Yann Colletcd25a912016-07-05 11:50:37 +0200868and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
Yann Collet23f05cc2016-07-04 16:13:11 +0200869
870If any sequence in the compressed block requires an offset larger than this,
871it's not possible to use the default distribution to represent it.
872
873```
874short offsetCodes_defaultDistribution[53] =
875 { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
876 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
877```
878
879#### Distribution tables
880
881Following the header, up to 3 distribution tables can be described.
882They are, in order :
883- Literal lengthes
884- Offsets
885- Match Lengthes
886
887The content to decode depends on their respective compression mode :
888- Repeat mode : no content. Re-use distribution from previous compressed block.
889- Predef : no content. Use pre-defined distribution table.
890- RLE : 1 byte. This is the only code to use across the whole compressed block.
891- FSE : A distribution table is present.
892
893##### FSE distribution table : condensed format
894
895An FSE distribution table describes the probabilities of all symbols
896from `0` to the last present one (included)
Yann Collet9ca73362016-07-05 10:53:38 +0200897on a normalized scale of `1 << AccuracyLog` .
Yann Collet23f05cc2016-07-04 16:13:11 +0200898
899It's a bitstream which is read forward, in little-endian fashion.
900It's not necessary to know its exact size,
901since it will be discovered and reported by the decoding process.
902
903The bitstream starts by reporting on which scale it operates.
904`AccuracyLog = low4bits + 5;`
905In theory, it can define a scale from 5 to 20.
906In practice, decoders are allowed to limit the maximum supported `AccuracyLog`.
907Recommended maximum are `9` for literal and match lengthes, and `8` for offsets.
908The reference decoder uses these limits.
909
910Then follow each symbol value, from `0` to last present one.
911The nb of bits used by each field is variable.
912It depends on :
913
914- Remaining probabilities + 1 :
915 __example__ :
916 Presuming an AccuracyLog of 8,
917 and presuming 100 probabilities points have already been distributed,
Yann Colletcd25a912016-07-05 11:50:37 +0200918 the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included).
Yann Collet23f05cc2016-07-04 16:13:11 +0200919 Therefore, it must read `log2sup(156) == 8` bits.
920
921- Value decoded : small values use 1 less bit :
922 __example__ :
923 Presuming values from 0 to 156 (included) are possible,
924 255-156 = 99 values are remaining in an 8-bits field.
925 They are used this way :
926 first 99 values (hence from 0 to 98) use only 7 bits,
927 values from 99 to 156 use 8 bits.
928 This is achieved through this scheme :
929
930 | Value read | Value decoded | nb Bits used |
931 | ---------- | ------------- | ------------ |
932 | 0 - 98 | 0 - 98 | 7 |
933 | 99 - 127 | 99 - 127 | 8 |
934 | 128 - 226 | 0 - 98 | 7 |
935 | 227 - 255 | 128 - 156 | 8 |
936
937Symbols probabilities are read one by one, in order.
938
939Probability is obtained from Value decoded by following formulae :
940`Proba = value - 1;`
941
942It means value `0` becomes negative probability `-1`.
943`-1` is a special probability, which means `less than 1`.
Yann Colletcd25a912016-07-05 11:50:37 +0200944Its effect on distribution table is described in next paragraph.
Yann Collet23f05cc2016-07-04 16:13:11 +0200945For the purpose of calculating cumulated distribution, it counts as one.
946
947When a symbol has a probability of `zero`,
948it is followed by a 2-bits repeat flag.
949This repeat flag tells how many probabilities of zeroes follow the current one.
950It provides a number ranging from 0 to 3.
951If it is a 3, another 2-bits repeat flag follows, and so on.
952
Yann Collet9ca73362016-07-05 10:53:38 +0200953When last symbol reaches cumulated total of `1 << AccuracyLog`,
Yann Collet23f05cc2016-07-04 16:13:11 +0200954decoding is complete.
955Then the decoder can tell how many bytes were used in this process,
956and how many symbols are present.
957
958The bitstream consumes a round number of bytes.
959Any remaining bit within the last byte is just unused.
960
Yann Collet9ca73362016-07-05 10:53:38 +0200961If the last symbol makes cumulated total go above `1 << AccuracyLog`,
Yann Collet23f05cc2016-07-04 16:13:11 +0200962distribution is considered corrupted.
963
964##### FSE decoding : from normalized distribution to decoding tables
965
Yann Collet9ca73362016-07-05 10:53:38 +0200966The distribution of normalized probabilities is enough
967to create a unique decoding table.
968
969It follows the following build rule :
970
971The table has a size of `tableSize = 1 << AccuracyLog;`.
972Each cell describes the symbol decoded,
973and instructions to get the next state.
974
975Symbols are scanned in their natural order for `less than 1` probabilities.
976Symbols with this probability are being attributed a single cell,
977starting from the end of the table.
978These symbols define a full state reset, reading `AccuracyLog` bits.
979
980All remaining symbols are sorted in their natural order.
981Starting from symbol `0` and table position `0`,
982each symbol gets attributed as many cells as its probability.
983Cell allocation is spreaded, not linear :
984each successor position follow this rule :
985
Yann Colletcd25a912016-07-05 11:50:37 +0200986```
987position += (tableSize>>1) + (tableSize>>3) + 3;
988position &= tableSize-1;
989```
Yann Collet9ca73362016-07-05 10:53:38 +0200990
991A position is skipped if already occupied,
992typically by a "less than 1" probability symbol.
993
994The result is a list of state values.
995Each state will decode the current symbol.
996
997To get the Number of bits and baseline required for next state,
998it's first necessary to sort all states in their natural order.
Yann Colletcd25a912016-07-05 11:50:37 +0200999The lower states will need 1 more bit than higher ones.
Yann Collet9ca73362016-07-05 10:53:38 +02001000
1001__Example__ :
1002Presuming a symbol has a probability of 5.
Yann Colletcd25a912016-07-05 11:50:37 +02001003It receives 5 state values. States are sorted in natural order.
Yann Collet9ca73362016-07-05 10:53:38 +02001004
1005Next power of 2 is 8.
1006Space of probabilities is divided into 8 equal parts.
1007Presuming the AccuracyLog is 7, it defines 128 states.
1008Divided by 8, each share is 16 large.
1009
1010In order to reach 8, 8-5=3 lowest states will count "double",
1011taking shares twice larger,
1012requiring one more bit in the process.
1013
1014Numbering starts from higher states using less bits.
1015
1016| state order | 0 | 1 | 2 | 3 | 4 |
1017| ----------- | ----- | ----- | ------ | ---- | ----- |
1018| width | 32 | 32 | 32 | 16 | 16 |
1019| nb Bits | 5 | 5 | 5 | 4 | 4 |
1020| range nb | 2 | 4 | 6 | 0 | 1 |
1021| baseline | 32 | 64 | 96 | 0 | 16 |
1022| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
1023
1024Next state is determined from current state
1025by reading the required number of bits, and adding the specified baseline.
Yann Collet23f05cc2016-07-04 16:13:11 +02001026
1027
1028#### Bitstream
Yann Collet00d44ab2016-07-04 01:29:47 +02001029
Yann Collet9ca73362016-07-05 10:53:38 +02001030All sequences are stored in a single bitstream, read _backward_.
1031It is therefore necessary to know the bitstream size,
1032which is deducted from compressed block size.
1033
Yann Colletcd25a912016-07-05 11:50:37 +02001034The bit of the stream is followed by a set-bit-flag.
1035Highest bit of last byte is this flag.
Yann Collet9ca73362016-07-05 10:53:38 +02001036It does not belong to the useful part of the bitstream.
1037Therefore, last byte has 0-7 useful bits.
1038Note that it also means that last byte cannot be `0`.
1039
1040##### Starting states
1041
1042The bitstream starts with initial state values,
1043each using the required number of bits in their respective _accuracy_,
1044decoded previously from their normalized distribution.
1045
1046It starts by `Literal Length State`,
1047followed by `Offset State`,
1048and finally `Match Length State`.
1049
1050Reminder : always keep in mind that all values are read _backward_.
1051
1052##### Decoding a sequence
1053
1054A state gives a code.
1055A code provides a baseline and number of bits to add.
1056See [Symbol Decoding] section for details on each symbol.
1057
1058Decoding starts by reading the nb of bits required to decode offset.
1059It then does the same for match length,
1060and then for literal length.
1061
1062Offset / matchLength / litLength define a sequence, which can be applied.
1063
1064The next operation is to update states.
1065Using rules pre-calculated in the decoding tables,
1066`Literal Length State` is updated,
1067followed by `Match Length State`,
1068and then `Offset State`.
1069
1070This operation will be repeated `NbSeqs` times.
1071At the end, the bitstream shall be entirely consumed,
1072otherwise bitstream is considered corrupted.
1073
1074[Symbol Decoding]:#symbols-decoding
1075
1076##### Repeat offsets
1077
1078As seen in [Offset Codes], the first 3 values define a repeated offset.
1079They are sorted in recency order, with 1 meaning "most recent one".
1080
1081There is an exception though, when current sequence's literal length is `0`.
1082In which case, 1 would just make previous match longer.
1083Therefore, in such case, 1 means in fact 2, and 2 is impossible.
1084Meaning of 3 is unmodified.
1085
1086Repeat offsets start with the following values : 1, 4 and 8 (in order).
1087
1088Then each block receives its start value from previous compressed block.
1089Note that non-compressed blocks are skipped,
1090they do not contribute to offset history.
1091
1092[Offset Codes]: #offset-codes
1093
1094###### Offset updates rules
1095
1096When the new offset is a normal one,
1097offset history is simply translated by one position,
1098with the new offset taking first spot.
1099
1100- When repeat offset 1 (most recent) is used, history is unmodified.
1101- When repeat offset 2 is used, it's swapped with offset 1.
1102- When repeat offset 3 is used, it takes first spot,
1103 pushing the other ones by one position.
Yann Collet00d44ab2016-07-04 01:29:47 +02001104
Yann Collet2fa99042016-07-01 20:55:28 +02001105
1106
1107Version changes
1108---------------