lbzip2 logo

Algorithms

Algorithms

The following page contains lists of publications describing many algorithms behind lbzip2, most of them are easily available online.

Basic concepts

These papers descriibe general concepts related to bzip2 compression. They provide enough information to write a working bz2 compressor and decompressor, using simple algorithms. Lecture is recommended if you want to understand how lbzip2 works internally.

  • A Block-sorting Lossless Data Compression Algorithm by Burrows, Michael and Wheeler, David John, available online

    This paper describes basics of BWT, MTF and entropy coding. It also provides details of implementation of quicksort-based mathod of calculating BWT, as well as gives some hints about possible improvements. Very valuable publication, a must-read for anyone wanting to understand lbzip2 internals.

  • On the implementation of minimum-redundancy prefix codes by Moffat, Alistair and Turpin, Andrew, available online

    This paper describes all aspects of entropy coding used in lbzip2: basic definitions, coding and decoding. Must-read if you are not familiar with cannonical prefix codes.

Algorithm details

These papers are foundation on some algorithms implemented in lbzip2. The code often closely follows description from these documents. In particular, variable names in source code often match naming in theese papers. Recommended if you are working on particular algorithm and you need to understand the code.

  • In-Place Calculation of Minimum-Redundancy Codes by Moffat, Alistair and Katajainen, Jyrki available online

    This paper describes a lightweight, linear variation of Huffman algorithm for finding optimal prefix codes in-situ. The prefix coding algorithm used in lbzip2 uses algorithm from this paper as first stage.

  • A Fast Algorithm for Optimal Length-Limited Huffman Codes by Larmore, Lawrence and Hirschberg, Daniel, available online

    This paper introduces Package Merge – algorithm for finding optimal length-limited prefix codes. Because of relatively higher time complexity, lbzip2 uses Package Merge only as fallback algorithm, which is executed only when Huffman algoritm is unable to find optimal solution.

  • Decoding prefix codes by Liddell, Mike and Moffat, Alistair available online

    Among other things, this document describes how prefix decoding could be implemented to minimize number of operations performed. Some of algorithms described there were used as a foundation of lbzip2 prefix decoder.

  • Fast lightweight suffix array construction and checking by Burkhardt, Stefan and K"arkk"ainen, Juha, available online

  • An Efficient Method for in Memory Construction of Suffix Arrays by Itoh, Hideo and Tanaka, Hozumi, available online

  • Space-efficient linear time construction of suffix arrays by Ko, Pang and Aluru, Srinivas, available online

  • Faster suffix sorting by Larsson, Jesper and Sadakane, Kunihiko, available online

  • MSufSort by Maniscalco, Michael, available online

  • Short description of improved two-stage suffix sorting algorithm by Mori, Yuta, available online

Older versions

Former versions of lbzip2 also used to benefit from:

  • An Incomplex Algorithm for Fast Suffix Array Construction by Schurmann, Klaus-Bernd and Stoye, Jens, available online

  • Engineering a Lightweight Suffix Array Construction Algorithm by Manzini, Giovanni and Ferragina, Paolo, available online

  • Engineering a Sort Function by Bentley, Jon Louis and McIlroy, Douglas Malcolm, available online

  • Fast Algorithms for Sorting and Searching Strings by Sedgewick, Robert and Bentley, Jon Louis, available~online