org.crosswire.common.compress
Class LZSS

java.lang.Object
  extended by org.crosswire.common.compress.AbstractCompressor
      extended by org.crosswire.common.compress.LZSS
All Implemented Interfaces:
Compressor

public class LZSS
extends AbstractCompressor

The LZSS compression is a port of code as implemented for STEP. The following information gives the history of this implementation.

Compression Info, 10-11-95
Jeff Wheeler

Source of Algorithm

The compression algorithms used here are based upon the algorithms developed and published by Haruhiko Okumura in a paper entitled "Data Compression Algorithms of LARC and LHarc." This paper discusses three compression algorithms, LSZZ, LZARI, and LZHUF. LZSS is described as the "first" of these, and is described as providing moderate compression with good speed. LZARI is described as an improved LZSS, a combination of the LZSS algorithm with adaptive arithmetic compression. It is described as being slower than LZSS but with better compression. LZHUF (the basis of the common LHA compression program) was included in the paper, however, a free usage license was not included.

The following are copies of the statements included at the beginning of each source code listing that was supplied in the working paper.

LZSS, dated 4/6/89, marked as "Use, distribute and modify this program freely."
LZARI, dated 4/7/89, marked as "Use, distribute and modify this program freely."
LZHUF, dated 11/20/88, written by Haruyasu Yoshizaki, translated by Haruhiko Okumura on 4/7/89. Not expressly marked as redistributable or modifiable.

Since both LZSS and LZARI are marked as "use, distribute and modify freely" we have felt at liberty basing our compression algorithm on either of these.

Selection of Algorithm

Working samples of three possible compression algorithms are supplied in Okumura's paper. Which should be used?

LZSS is the fastest at decompression, but does not generated as small a compressed file as the other methods. The other two methods provided, perhaps, a 15% improvement in compression. Or, put another way, on a 100K file, LZSS might compress it to 50K while the others might approach 40-45K. For STEP purposes, it was decided that decoding speed was of more importance than tighter compression. For these reasons, the first compression algorithm implemented is the LZSS algorithm.

About LZSS Encoding

(adapted from Haruhiko Okumura's paper)

This scheme was proposed by Ziv and Lempel [1]. A slightly modified version is described by Storer and Szymanski [2]. An implementation using a binary tree has been proposed by Bell [3].

The algorithm is quite simple.
  1. Keep a ring buffer which initially contains all space characters.
  2. Read several letters from the file to the buffer.
  3. Search the buffer for the longest string that matches the letters just read, and send its length and position into the buffer.

If the ring buffer is 4096 bytes, the position can be stored in 12 bits. If the length is represented in 4 bits, the pair is two bytes long. If the longest match is no more than two characters, then just one character is sent without encoding. The process starts again with the next character. An extra bit is sent each time to tell the decoder whether the next item is a character of a pair.

[1] J. Ziv and A. Lempel, IEEE Transactions IT-23, 337-343 (1977).
[2] J. A. Storer and T. G. Szymanski, J. ACM, 29, 928-951 (1982).
[3] T.C. Gell, IEEE Transactions COM-34, 1176-1182 (1986).

Regarding this port to Java and not the original code, the following license applies:

Author:
DM Smith
See Also:
for license details.
The copyright to this program is held by it's authors.

Field Summary
private  short[] dad
          leftSon, rightSon, and dad are the Japanese way of referring to a tree structure.
private  short[] leftSon
           
private  short matchLength
          The number of characters in the ring buffer at matchPosition that match a given string.
private  short matchPosition
          The position in the ring buffer.
private static int MAX_STORE_LENGTH
          This is the maximum length of a character sequence that can be taken from the ring buffer.
private static short NOT_USED
          Used to mark nodes as not used.
private  ByteArrayOutputStream out
          The output stream containing the result.
private  short[] rightSon
           
private static short RING_SIZE
          This is the size of the ring buffer.
private static short RING_WRAP
          This is used to determine the next position in the ring buffer, from 0 to RING_SIZE - 1.
private  byte[] ringBuffer
          A text buffer.
private static int THRESHOLD
          It takes 2 bytes to store an offset and a length.
 
Fields inherited from class org.crosswire.common.compress.AbstractCompressor
input
 
Fields inherited from interface org.crosswire.common.compress.Compressor
BUF_SIZE
 
Constructor Summary
LZSS(InputStream input)
          Create an LZSS that is capable of transforming the input.
 
Method Summary
 ByteArrayOutputStream compress()
          Compresses the input and provides the result.
private  void deleteNode(short node)
          Remove a node from the tree.
private  void initTree()
          Initializes the tree nodes to "empty" states.
private  void insertNode(short pos)
          Inserts a string from the ring buffer into one of the trees.
 ByteArrayOutputStream uncompress()
          Uncompresses the input and provides the result.
 ByteArrayOutputStream uncompress(int expectedSize)
          Uncompresses the input and provides the result.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

RING_SIZE

private static final short RING_SIZE
This is the size of the ring buffer. It is set to 4K. It is important to note that a position within the ring buffer requires 12 bits.

See Also:
Constant Field Values

RING_WRAP

private static final short RING_WRAP
This is used to determine the next position in the ring buffer, from 0 to RING_SIZE - 1. The idiom s = (s + 1) & RING_WRAP; will ensure this. This only works if RING_SIZE is a power of 2. Note this is slightly faster than the equivalent: s = (s + 1) % RING_SIZE;

See Also:
Constant Field Values

MAX_STORE_LENGTH

private static final int MAX_STORE_LENGTH
This is the maximum length of a character sequence that can be taken from the ring buffer. It is set to 18. Note that a length must be 3 before it is worthwhile to store a position/length pair, so the length can be encoded in only 4 bits. Or, put yet another way, it is not necessary to encode a length of 0-18, it is necessary to encode a length of 3-18, which requires 4 bits.

Note that the 12 bits used to store the position and the 4 bits used to store the length equal a total of 16 bits, or 2 bytes.

See Also:
Constant Field Values

THRESHOLD

private static final int THRESHOLD
It takes 2 bytes to store an offset and a length. If a character sequence only requires 1 or 2 characters to store uncompressed, then it is better to store it uncompressed than as an offset into the ring buffer.

See Also:
Constant Field Values

NOT_USED

private static final short NOT_USED
Used to mark nodes as not used.

See Also:
Constant Field Values

ringBuffer

private byte[] ringBuffer
A text buffer. It contains "nodes" of uncompressed text that can be indexed by position. That is, a substring of the ring buffer can be indexed by a position and a length. When decoding, the compressed text may contain a position in the ring buffer and a count of the number of bytes from the ring buffer that are to be moved into the uncompressed buffer.

This ring buffer is not maintained as part of the compressed text. Instead, it is reconstructed dynamically. That is, it starts out empty and gets built as the text is decompressed.

The ring buffer contain RING_SIZE bytes, with an additional MAX_STORE_LENGTH - 1 bytes to facilitate string comparison.


matchPosition

private short matchPosition
The position in the ring buffer. Used by insertNode.


matchLength

private short matchLength
The number of characters in the ring buffer at matchPosition that match a given string. Used by insertNode.


dad

private short[] dad
leftSon, rightSon, and dad are the Japanese way of referring to a tree structure. The dad is the parent and it has a right and left son (child).

For i = 0 to RING_SIZE-1, rightSon[i] and leftSon[i] will be the right and left children of node i.

For i = 0 to RING_SIZE-1, dad[i] is the parent of node i.

For i = 0 to 255, rightSon[RING_SIZE + i + 1] is the root of the tree for strings that begin with the character i. Note that this requires one byte characters.

These nodes store values of 0...(RING_SIZE-1). Memory requirements can be reduces by using 2-byte integers instead of full 4-byte integers (for 32-bit applications). Therefore, these are defined as "shorts."


leftSon

private short[] leftSon

rightSon

private short[] rightSon

out

private ByteArrayOutputStream out
The output stream containing the result.

Constructor Detail

LZSS

public LZSS(InputStream input)
Create an LZSS that is capable of transforming the input.

Parameters:
input - to compress or uncompress.
Method Detail

compress

public ByteArrayOutputStream compress()
                               throws IOException
Description copied from interface: Compressor
Compresses the input and provides the result.

Returns:
the compressed result
Throws:
IOException

uncompress

public ByteArrayOutputStream uncompress()
                                 throws IOException
Description copied from interface: Compressor
Uncompresses the input and provides the result.

Returns:
the uncompressed result
Throws:
IOException

uncompress

public ByteArrayOutputStream uncompress(int expectedSize)
                                 throws IOException
Description copied from interface: Compressor
Uncompresses the input and provides the result.

Parameters:
expectedSize - the size of the result buffer
Returns:
the uncompressed result
Throws:
IOException

initTree

private void initTree()
Initializes the tree nodes to "empty" states.


insertNode

private void insertNode(short pos)
Inserts a string from the ring buffer into one of the trees. It loads the match position and length member variables for the longest match.

The string to be inserted is identified by the parameter pos, A full MAX_STORE_LENGTH bytes are inserted. So, ringBuffer[pos ... pos+MAX_STORE_LENGTH-1] are inserted.

If the matched length is exactly MAX_STORE_LENGTH, then an old node is removed in favor of the new one (because the old one will be deleted sooner).

Parameters:
pos - plays a dual role. It is used as both a position in the ring buffer and also as a tree node. ringBuffer[pos] defines a character that is used to identify a tree node.

deleteNode

private void deleteNode(short node)
Remove a node from the tree.

Parameters:
node - the node to remove

Copyright ? 2003-2011