libfly
6.2.2
C++20 utility library for Linux, macOS, and Windows
|
#include <huffman_encoder.hpp>
Public Member Functions | |
HuffmanEncoder (const std::shared_ptr< CoderConfig > &config) noexcept | |
![]() | |
virtual | ~Encoder ()=default |
virtual bool | encode_string (const std::string &decoded, std::string &encoded) |
virtual bool | encode_file (const std::filesystem::path &decoded, const std::filesystem::path &encoded) |
Protected Member Functions | |
bool | encode_binary (std::istream &decoded, fly::BitStreamWriter &encoded) override |
![]() | |
bool | encode_internal (std::istream &decoded, std::ostream &encoded) final |
Implementation of the Encoder interface for Huffman coding. Forms length-limted, canonical Huffman codes to encode symbols.
|
explicitnoexcept |
Constructor.
config | Reference to coder configuration. |
|
overrideprotectedvirtual |
Huffman encode a stream.
If the input stream is large, in order to limit memory usage, the stream is encoded in chunks. Each chunk is treated as its own input stream, and the encoding sequence is repeated for each chunk.
The first bytes of the output stream are reserved as a header. Currently, the header contains: the incurred BitStream header, the version of the Huffman coder used to encode the stream, the maximum chunk length used to to split large streams (in kilobytes), and the maximum allowed Huffman code length:
| 8 bits | 8 bits | 16 bits | 8 bits | -------------------------------------------------------------------- | BitStream header | Version | Chunk length (KB) | Max code length |
The sequence to encode a stream is:
1. Create a Huffman tree from the input stream. 2. Generate standard Huffman codes from the Huffman tree. 3. Length-limit the standard Huffman codes. 4. Convert the length-limited codes to canonical Huffman codes. 5. Encode the canonical codes. 6. Encode the input stream using the canonical codes.
This sequence involves iterating over the entire input stream twice (to create the Huffman tree and to encode the stream).
The coder does not assume the Huffman codes are retained between calls. Thus, the codes are encoded before the input stream (step 5) so that they may be learned during decoding.
Length-limiting is performed on the generated Huffman codes to improve decoder performance. Worst-case, a Huffman code could have the same length as the maximum number of symbols. Limiting the length of Huffman codes awards a significant decoder performance improvement, while only incurring a small cost in compression ratio.
Canonical form is used for its property of generally being describable in fewer bits than standard form. When in canonical form, the Huffman codes are sorted by code length. With this sorting, the count of the number of symbols for each code length is computed: (N0, N1, N2, ..., Nn), where N<n> is the number of symbols of code length <n>. Call the length of this list NN.
The encoding of canonical Huffman codes then becomes:
NN,N0,N1,N2,...,Nn,S0,S1,S2,...,Sn
Where S<n> is all symbols of code length <n>.
Encoding the input stream (step 6) consists of reading each symbol from the input stream and outputting that symbol's canonical Huffman code.
decoded | Stream holding the contents to encode. |
encoded | Stream to store the encoded contents. |
Implements fly::coders::BinaryEncoder.