Monday, January 28, 2013

Various compression algorithm comparison

    Although there are already several pages comparing compression algorithms, they are based on multimedia data (already compressed), random data, or data which may not represent a real backup process, so most of them are not useful if compared with real scenarios like, for example, compressing an operating system image efectivelly. Also, they often measures the CPU user time rather than complete elapsed time, so data is not really representative.

    In my case, I am creating a new stage5 Gentoo backup from scratch, so I am writing here the results (compressing time, resulting file size and so) so other could find useful to compare a real situation like this one.

    This is important as it may be useful for other purposes, like creating a LiveCD/DVD or other purposes where size may matter.

    The starting point will be a tar file created with a full Gentoo installation (with KDE, Libreoffice, firefox, Netbeans, and other programs I use daily) as a basis with a Core2 Duo T9550 @2.66GHz and 4GiB RAM on ext4 partition running Gentoo ~amd64 arch.
 
    This is result's table (Marked in red the worst and in green the best values of each measurement):

Algorithm
(Version)
Compress Time
(hh:mm:ss)
Resulting Size
(GiB)
CPU Usage
%
Peak MEM
Usage (KiB)
Compress Ratio
TAR (1.26) 00:17:51.931 6.4623 2 1780 0.0000
GZIP (1.5) 00:17:37.07 2.3228 93 896 0.3594
BZIP2 (1.06) 00:17:51.82 2.1137 89 7996 0.3270
XZ (5.04) 01:42:16.00 1.5962 99 690332 0.2470
7ZIP (9.20) 00:52:22.65 1.6015 142 697004 0.2478
RAR (4.20) 00:17:50.12 1.8819 133 102864 0.2912
All compresion algorithms were tested with maximum compression level options 
 
    With this table, depending on the real needs and facts (slow machine, slow network when doing backup via network, etc), you can chose the best algorithm which meet the needs.

    I hope it is useful to someone.

1: Despite TAR uses no compression, its performance impact is due to its IO is more intensive than the compressors as it has to look recursivelly lots of files and folders (a tipical Linux installation have many) in contrast with compressors which only read from TAR file.