lrzip v0.23

Long Range ZIP or Lzma RZIP

This is a compression program optimised for large files. The larger the file
and the more memory you have, the better the compression advantage this will
provide, especially once the files are larger than 100MB. The advantage can
be chosen to be either size (much smaller than bzip2) or speed (much faster
than bzip2). Decompression is always much faster than bzip2.

Lrzip uses an extended version of rzip which does a first pass long distance
redundancy reduction. The lrzip modifications make it scale according to
memory size.
The data is then either:
1. Compressed by lzma (default) which gives excellent compression
at approximately half the speed of bzip2 compression
2. Compressed by lzo which on most machines compresses faster than disk
writing making it as fast (or even faster) than simply copying a large file
3. Leaving it uncompressed and rzip prepared. This form improves substantially
any compression performed on the resulting file in both size and speed (due to
the nature of rzip preparation merging similar compressible blocks of data and
creating a smaller file).
4. Compressed by bzip2 as an rzip-like compression format.

The major disadvantages are:
1. It only works on single files. To get the best performance out of the
compression it is best to tarball all your files together.
2. It requires a lot of memory to get the best performance out of, and is not
really usable (for compression) with less than 256MB. Decompression requires
very little ram and works on small ram machines.
3. Does not work on stdin/stdout.

(See file Current-Benchmarks.txt for updated information)

Example on a 1GB ram P4 3GHz: 

A tarball of a fully compiled kernel tree:
		Size		Compression 	Decompression
base file:	646963200
gzip		218071923	1:27.27		0:45.39	
bzip2		192484690	4:41.62		1:41.20
bzip2 -1	215555795	3:24.08		1:21.45
bzip2 -9	192484690	4:53.18		1:31.40
lzma		112229937	11:48.07	0:56.38
lzma -9		97704505	27:18.77	?
lrzip		88560021	10:11.28	0:57.88
lrzip -l	191415649	0:30.19		0:50.69
lrzip -M	82708048	11:45.79	1:00.75
lrzip -n	389125460	0:31.02		0:58.9

Summary: 	Ratio		Value(Ratio/Time)
gzip		2.97		2.048
bzip2		3.36		0.717
bzip2 -1	3.00		0.883
bzip2 -9	3.36		0.688
lzma		5.76		0.488
lzma -9		6.62		0.242
lrzip		7.31		0.718
lrzip -l	3.38		6.760 *
lrzip -M	7.82 *		0.666
lrzip -n	1.66		3.222


Requires:
liblzo2-dev
libbz2-dev
libz-dev
libm

To build/install:
./configure
make
make install


FAQS.

Q. How do I make a static build?
A. make static

Q. I want the absolute maximum compression I can possibly get, what do I do?
A. Try the -M option. Note it will use all available ram so expect serious
swapping to occur. It may even fail to run if you do not have enough swap
space allocated. Why? Well the more ram lrzip uses the better the compression
it can achieve.

Q. Can I use your tool for even more compression than lzma offers?
A. Yes, the rzip preparation of files makes them more compressible by every
other compression technique I have tried. Using the -n option will generate
a .lrz file smaller than the original which should be more compressible, and
since it is smaller it will compress faster than it otherwise would have.

Q. How about 64bit?
A. As of v0.15 64 bit is working well, but the lzo library may give grief due
to naming differences.

Q. Other operating systems?
A. Patches are welcome. The configure/build system works only on linux at the
moment, but a darwin specific Makefile without configure is included that
should work.

Q. Can it be made to work on stdin/stdout?
A. The rzip design basically works in a way that makes this virtually
impossible.

Q. Really why can't I use stdin/stdout?
A. Well the first compression stage (rzip) takes the largest chunk of the
file your ram can fit and completely reorders all the data in it. Then it
hands over the data in chunks to the compressor. Then it is written to disk.
So theoretically for stdin it could buffer all input till it filled the
chunk size and then start compressing. So adding stdin would not be too big
a stretch. On the other side though, with stdout, the data cannot be
fed to anything till it is completely decompressed and re-ordered into the
original chunk size. Theoretically we could decompress a whole chunk in ram,
reorder it and then start piping it to stdout. This would mean the
decompression ram requirements would almost be as big as the compression
requirements which makes it not portable to machines with less ram. Currently
lrzip uses extraordinarily little amounts of ram on decompression, and is
very fast. Adding stdout support would cancel both of those advantages. The
other option for supporting stdin/stdout is to do each chunk to a separate
file and then feed it. None of these are particularly desirable or practical.
Since stdout support is impractical, there is no point implementing just
stdin.

Q. I still want stdin/stdout?
A. I take patches.

Q. I have another compression format that is even better than lzma, can you
use that?
A. You can use it yourself on rzip prepared files (see above). Alternatively
if the source code is compatible with the GPL license it can be added to the
lrzip source code. Libraries with functions similar to compress() and
decompress() functions of zlib would make the process most painless. Please
tell me if you have such a library so I can include it :)

Q. What's this "Progress percentage pausing during lzma compression" message?
A. While I'm a big fan of progress percentage being visible, unfortunately
lzma compression can't currently be tracked when handing over 100+MB chunks
over to the lzma library. Therefore you'll see progress percentage until
each chunk is handed over to the lzma library. lzo, bzip2 or no compression
doesn't have this problem and shows progress continuously.

Q. What's this "lzo testing for incompressible data" message?
A. The lzma compression is the slowest compression technique in lrzip, and
lzo is the fastest. To help speed up the process, lzo compression is
performed on the data first to test that the data is at all compressible. If
a small block of data is not compressible, it tests progressively larger
blocks until it has tested all the data (if it fails to compress at all). If
no compressible data is found, then lzma compression is not even attempted.
This can save a lot of time during the compression phase when there is
incompressible data. It also works around a known bug that incompressible
data gets the lzma compression library stuck in an endless loop. Theoretically
it may be possible that data is compressible by lzma and not at all by lzo,
but in practice such data achieves only miniscule amounts of compression
which are not worth pursuing. Most of the time it is clear one way or the
other that data is compressible or not.

Q. I Have truckloads of ram so I can compress files much better, but can my
generated file be decompressed on machines with less ram?
A. Yes. Ram requirements for decompression go up only by the -L compression
option with lzma and are never anywhere near as large as the compression
requirements.

Q. Any plans to turn this into a complete archiver?
A. Not really. The compression format relies on being fed large files, and
tar does a good job of this already. Maybe I should include a script with
lrzip that automates what tar+lrzip does.

Q. I've changed the compression level with -L in combination with -l and the
file size doesn't vary?
A. That's right, -l only has one compression level.

Q. Help? I'm a newbie and have no idea how to turn my directory into a
tarball!
A. Here is a walkthrough for a directory called myfiles
to compress:
	tar cf myfiles.tar myfiles
	lrzip myfiles.tar
this will create a file called myfiles.tar.lrz
to extract:
	lrzip -d myfiles.tar.lrz
	tar xf myfiles.tar
will create and extract everything into a directory called myfiles

Q. Why are you including bzip2 compression?
A. To maintain a similar compression format to the original rzip (although the
other modes are more useful).

Q. What about multimedia?
A. Most multimedia is already in a heavily compressed "lossy" format which by
its very nature has very little redundancy. This means that there is not
much that can actually be compressed. If your video/audio/picture is in a
high bitrate, there will be more redundancy than a low bitrate one making it
more suitable to compression. None of the compression techniques in lrzip are
optimised for this sort of data. However, the nature of rzip preparation
means that you'll still get better compression than most normal compression
algorithms give you if you have very large files. ISO images of dvds for
example are best compressed directly instead of individual .VOB files.

Q. Is this multithreaded?
A. As of version 0.21, the answer is yes for lzma compression only thanks to a
multithreaded lzma library. However I have not found the gains to scale well
with number of cpus, but there are definite performance gains with more cpus.

Q. This uses heaps of memory, can I make it use less?
A. Well you can by setting -w to the lowest value (1) but the huge use of
memory is what makes the compression better than ordinary compression
programs so it defeats the point. You'll still derive benefit with -w 1 but
not as much.

Q. What CFLAGS should I use?
A. With a recent enough compiler (gcc>4) setting both CFLAGS and CXXFLAGS to
	-O3 -march=$archname -fomit-frame-pointer
and putting your architecture into $archname (like pentium4) causes noticeable
speed improvements with lzma without risk of breakage. Because of the c++
code used in lzma, -O3 actually does give demonstrable advantage over -O2
(unlike most c programs). Newest compilers take -march=native without needing
to specify the architecture.

Q. What compiler does this work with?
A. It has been tested on gcc, ekopath and the intel compiler successfully.
Whether the commercial compilers help or not, I could not tell you.

Q. What codebase are you basing this on?
A. rzip v2.1 and lzma sdk443, but it should be possible to stay in sync with
each of these in the future.

Q. Do we really need yet another compression format?
A. It's not really a new one at all; simply a reimplementation of a few very
good performing ones that will scale with memory and file size.

Q. How do you use lrzip yourself?
A. Two basic uses. I compress large files currently on my drive with the
-l option since it is so quick to get a space saving, and when archiving
data for permament storage I compress it with the default options.

Q. I found a file that compressed better with plain lzma. How can that be?
A. When the file is more than 5 times the size of the compression window
you have available, the efficiency of rzip preparation drops off as a means
of getting better compression. Eventually when the file is large enough,
plain lzma compression will get better ratios. The lrzip compression will be
a lot faster though. Currently I have no way around this problem without
throwing more and more ram at the compression because trying to do this off
disk (whether directly on the file or from swap) will mean the file is read
a ridulous number of times over and over again. It presents an interesting
problem for which there is no perfect solution but it certainly has us
thinking hard about how to tackle it.

Q. Can I use swapspace as ram for lrzip with a massive window?
A. No. To make lrzip work completely from disk would make the data be read
off disk an unrealistic number of times over again and again. For example, if
you have 1GB of ram and a 2GB file to compress, it might read the file a
billion times off disk. Most hard drives would fail in that time :) See the
previous question. Update; I have been informed that people have successfully
done this without destroying their hard drives and they've been _very_ patient,
but it didn't take as long as I had predicted.

Q. Why do you nice it to +19 by default? Can I speed up the compression by
changing the nice value?
A. This is a common misconception about what nice values do. They only tell the
cpu process scheduler how to prioritise workloads, and if your application is
the _only_ thing running it will be no faster at nice -20 nor will it be any
slower at +19.

Q. What is the Threshold option, -T ## (1-10)?
A. It is for adjusting the sensitivity of the LZO test that is used when LZMA
compression is selected. When highly random or already-compressed data chunks
are evaluated for LZMA compression, sometimes LZO compression actually will
create a larger chunk than the original. If this data chunk is passed to the
LZMA compressor, it will take an extremely long time or hang until the
program is aborted.

The Threshold is used to determine a minimum compression amount relative to
the size of the data being evaluated. A value of 2 is the default. This
means that the compression threshold amount is >5% of the size of the
original data. If the threshold is not achieved, the LZMA compression will not
be done and the chunk will not be compressed. Values can be from 1 (little
or no compression expected, up to 5%) to 10 (maximum compression efficiency
expected). The following table can be used.

For LZO compressor test
T value		Compression %	Compression Ratio
  1		    0-5%	     1.00-1.05	very low compression expected
  3		    5-10%	     1.05-1.10	default value
  3		    10-20%	     1.12-1.25
  4		    20-30%           1.25-1.43
  5		    30-40%	     1.43-1.66
  6		    40-50%           1.66-2.00
  7		    50-60%           2.00-2.50
  8                 60-70%	     2.50-3.33
  9		    70-80%           3.33-5.00
  10                 80+%           5x+

Whenever the data chunk does not compress to the Threshold value, no LZMA
compression will be attempted. For example, if you select -T 5, LZMA
compression will be performed if the projected compression ratio is
less than 1.43. Otherwise, data will be written in rzip format. Setting
a very high T value will result in a lot of uncompressed data in the lrzip
file. However, a lot of time will be saved. For most people you shouldn't ever
need to touch this.

Q. Compression and decompression progress on large archives slows down and
speeds up. There's also a jump in the percentage at the end?
A. Yes, that's the nature of the compression/decompression mechanism. The jump
is because the rzip preparation makes the amount of data much smaller that the
compression backend (lzma) needs to compress.

Q. I'm terrified that my compressed data may be corrupted and there is no test
function. How can I test the integrity of the data?
A. Use md5sum. Here is a walkthrough:

lrzip inputfile
lrzip -o test_outputfile inputfile.lrz
md5sum inputfile
	c5f74ca56f0b4ac8b61070d11d712145 inputfile
md5sum test_outputfile
	c5f74ca56f0b4ac8b61070d11d712145 test_outputfile

The values given are examples only. If they match, then the integrity can be
guaranteed.

Q. Tell me about patented compression algorithms, GPL, lawyers and copyright.
A. No


LIMITATIONS
There's still some serious limitations on window size and the possible
compression performance on machines with greater than 4GB due to many 32 bit
restrictions. These exist on 64bit builds as well for the time being.


BUGS:
Probably lots.


Links:
rzip:
http://rzip.samba.org/
lzo:
http://www.oberhumer.com/opensource/lzo/
lzma:
http://www.7-zip.org/


Thanks to Andrew Tridgell for rzip. Thanks to Markus Oberhumer for lzo.
Thanks to Igor Pavlov for lzma. Thanks to Jean-loup Gailly and Mark Adler
for the zlib compression library. Thanks to Christian Leber for lzma
compat layer, Michael J Cohen for Darwin support, Lasse Collin for fix
to LZMALib.cpp and for Makefile.in suggestions, and everyone else who coded
along the way. Huge thanks to Peter Hyman for most of the 0.19 changes
onwards, and the update to the multithreaded lzma library and all sorts of
other features


Con Kolivas <kernel@kolivas.org>
Fri, 21 Mar 2008
