Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 0 additions & 71 deletions README

This file was deleted.

139 changes: 139 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
Smaz
=========================================

Compression for very small strings
----------------------------------

Smaz is a simple compression library suitable for compressing very short
strings. General purpose compression libraries will build the state needed
for compressing data dynamically, in order to be able to compress every kind
of data. This is a very good idea, but not for a specific problem: compressing
small strings will not work.

Smaz instead is not good for compressing general purpose data, but can compress
text by 40-50% in the average case (works better with English), and is able to
perform a bit of compression for HTML and urls as well. The important point is
that Smaz is able to compress even strings of two or three bytes!

For example the string "the" is compressed into a single byte.

To compare this with other libraries, think that like zlib will usually not be
able to compress text shorter than 100 bytes.

Compression Examples
--------------------

* <code>'This is a small string'</code> compressed by 50%
* <code>'foobar'</code> compressed by 34%
* <code>'the end'</code> compressed by 58%
* <code>'not-a-g00d-Exampl333'</code> enlarged by 15%
* <code>'Smaz is a simple compression library'</code> compressed by 39%
* <code>'Nothing is more difficult, and therefore more precious, than to be able to decide'</code> compressed by 49%
* <code>'this is an example of what works very well with smaz'</code> compressed by 49%
* <code>'1000 numbers 2000 will 10 20 30 compress very little'</code> compressed by 10%

In general, lowercase English will work very well. It will suck with a lot
of numbers inside the strings. Other languages are compressed pretty well too,
the following is Italian, not very similar to English but still compressible
by smaz:

* <code>'Nel mezzo del cammin di nostra vita, mi ritrovai in una selva oscura'</code> compressed by 33%
* <code>'Mi illumino di immenso'</code> compressed by 37%
* <code>'L'autore di questa libreria vive in Sicilia'</code> compressed by 28%

It can compress URLS pretty well:

* <code>'http://google.com'</code> compressed by 59%
* <code>'http://programming.reddit.com'</code> compressed by 52%
* <code>'http://github.com/antirez/smaz/tree/master'</code> compressed by 46%

Usage
-----

**Compression:**

The compression function is:

```cpp
int smaz_compress(struct SmazBranch *trie, char *in, int inlen, char *out, int outlen);
```

This compresses the buffer 'in' of length 'inlen' and put the compressed data into
'out' of max length 'outlen' bytes. If the output buffer is too short to hold
the whole compressed string, outlen+1 is returned. Otherwise the length of the
compressed string (less then or equal to outlen) is returned.

The first parameter is the lookup trie used for compression. The default one can be generated with:

```cpp
struct SmazBranch *smaz_build_trie();
```

Alternatively, you can provide a custom codebook with:

```cpp
struct SmazBranch *smaz_build_custom_trie(char *codebook[254]);
```

*Note:* If you are using a custom codebook, be sure not to have any entries exceeding
11 characters in length.

The original reference implementation of Smaz compression is included for testing
and benchmarking comparison purposes:

```cpp
int smaz_compress_ref(char *in, int inlen, char *out, int outlen);
```

**Decompression:**

To decompress with the default codebook:

```cpp
int smaz_decompress(char *in, int inlen, char *out, int outlen);
```

Or if you are using a custom codebook:

```cpp
int smaz_decompress_custom(char *cb[254], char *in, int inlen, char *out, int outlen);
```

These decompress the buffer 'in' of length 'inlen' and put the decompressed data into
'out' of max length 'outlen' bytes. If the output buffer is too short to hold
the whole decompressed string, outlen+1 is returned. Otherwise the length of the
compressed string (less then or equal to outlen) is returned. This function will
not automatically put a null-term at the end of the string if the original
compressed string didn't included a nulterm.

smaz_test
---------

smaz_test.c contains some simple tests and comparitive benchmarks between the reference
implementation and the trie implementation.

The provided makefile should take care compilation. Running the tests will take up
about a gig of RAM, as some tests pre-generate large numbers of strings.


Trie speed improvement
----------------------

These are just some rough numbers generated by my machine.

For very compressible data, the new implementation appears ~2.2x faster than the
reference implementation.

Basic english strings should see something around a ~2.6x speed improvement.

For random textual strings you can get somewhere around a 4.9x speed increase.


Credits
-------

Smaz was written by Salvatore Sanfilippo and is released under the 3 clause BSD license.
Check the COPYING file for more information.

Trie-based implementation by Richard Johnson, released under the same BSD license.

4 changes: 0 additions & 4 deletions TODO

This file was deleted.

Loading