Testing compression speed and ratio
The compression speed and ratio of several compression programs is tested.
Hardware
These tests were originally run on a machine with an Intel Core2 Q9300 CPU and
a 150 MB/s harddisk running FreeBSD 11.2. I’ve updated the figures for an
Intel Core i7-7700 and a WDC WD4002FYYZ-01B7CB1 (600 MB/s transfers) harddisk.
Data for the brotli
compression program has been added.
Test data
There are two kinds of text data;
- An mbox file of spam (42432 KiB of text)
- a tarball of a LaTeX project (71680 KiB of text and binary data)
Tests
gzip
The gzip
program was conceived as an alternative to the compress
program which was under threat form corporations holding patents on the LZW
algorithm used in compress.
The gzip program supports nine levels of compression. Level 1 is supposed to be the fastest, while level 9 is supposed to offer the best compression. Level 6 is the default All these levels are tested once:
foreach n (`seq 1 9`) /usr/bin/time gzip -c -${n} foo >foo_${n}.gz end
Afterwards, the size of the compressed files is checked:
du foo_*.gz
From this, we calculate the compression speed (MiB/s) and ratio (%). The former is defined as original_size/time. The latter is defined as (1 - compressed_size/original_size)*100%. This is converted into a table.
Text | Tarball | ||||
---|---|---|---|---|---|
Level | Speed [MB/s] | Reduction [%] | Level | Speed [MB/s] | Reduction [%] |
1 | 61.85 | 51.73 | 1 | 47.95 | 6.12 |
2 | 59.20 | 52.34 | 2 | 47.95 | 6.16 |
3 | 53.12 | 52.94 | 3 | 47.62 | 6.21 |
4 | 49.92 | 54.07 | 4 | 47.30 | 6.34 |
5 | 45.54 | 54.68 | 5 | 46.67 | 6.38 |
6 | 40.62 | 54.98 | 6 | 45.75 | 6.38 |
7 | 39.84 | 55.05 | 7 | 44.87 | 6.38 |
8 | 32.89 | 55.13 | 8 | 41.67 | 6.43 |
9 | 28.78 | 55.13 | 9 | 33.98 | 6.43 |
As expected, the tarball doesn’t compress as well or as fast as the text file.
The default compression level is 6. A lot of alternative compression programs support the same options as gzip, which makes testing simpler.
The gzip
test was also done on an M2 SSD. The times ware almost identical.
So it seems that gzip
is CPU bound, not I/O bound.
bzip2
The test is similar to that of gzip:
foreach n (`seq 1 9`) /usr/bin/time bzip2 -c -${n} foo >foo_${n}.bz2 end
The data is gathered and converted in the same way:
Text | Tarball | ||||
---|---|---|---|---|---|
Level | Speed [MB/s] | Reduction [%] | Level | Speed [MB/s] | Reduction [%] |
1 | 12.79 | 55.13 | 1 | 10.26 | 5.80 |
2 | 12.44 | 55.73 | 2 | 10.62 | 6.12 |
3 | 11.77 | 55.96 | 3 | 10.51 | 6.21 |
4 | 11.42 | 56.33 | 4 | 10.40 | 6.38 |
5 | 11.42 | 56.26 | 5 | 10.36 | 6.43 |
6 | 10.99 | 56.64 | 6 | 10.20 | 6.56 |
7 | 10.96 | 56.56 | 7 | 10.09 | 6.70 |
8 | 10.79 | 56.71 | 8 | 9.94 | 6.70 |
9 | 10.79 | 56.86 | 9 | 10.06 | 6.79 |
The compression ratio for text and tarball is slightly better with bzip2. But the compression speed is much inferior to that of gzip.
lzma
Again the test is similar:
foreach n (`seq 1 9`) /usr/bin/time lzma -c -${n} foo >foo_${n}.lzma end
The resulting data:
Text | Tarball | ||||
---|---|---|---|---|---|
Level | Speed [MB/s] | Reduction [%] | Level | Speed [MB/s] | Reduction [%] |
1 | 11.87 | 57.01 | 1 | 5.86 | 7.10 |
2 | 7.37 | 57.69 | 2 | 3.33 | 7.14 |
3 | 5.77 | 58.37 | 3 | 2.80 | 7.19 |
4 | 4.74 | 60.56 | 4 | 2.96 | 7.32 |
5 | 4.03 | 62.14 | 5 | 2.65 | 7.59 |
6 | 3.63 | 62.37 | 6 | 2.62 | 7.59 |
7 | 3.59 | 63.27 | 7 | 2.58 | 9.38 |
8 | 3.37 | 63.35 | 8 | 2.81 | 20.85 |
9 | 3.32 | 63.35 | 9 | 2.63 | 22.37 |
Level 1 LZMA is better than level 9 bzip2. The size reduction for the tarball at compression level 8 or 9 is impressive.
brotli
The test command is slighty different in this case:
foreach n (`seq 1 11`) /usr/bin/time brotli -c -q ${n} foo >foo_${n}.br end
The brotli
program supports 11 compression levels, with level 11 being the default.
Text | Tarball | ||||
---|---|---|---|---|---|
Level | Speed [MB/s] | Reduction [%] | Level | Speed [MB/s] | Reduction [%] |
1 | 345.31 | 54.30 | 1 | 179.49 | 6.16 |
2 | 180.16 | 63.65 | 2 | 152.17 | 10.04 |
3 | 147.99 | 63.88 | 3 | 179.49 | 10.13 |
4 | 67.93 | 57.32 | 4 | 95.89 | 7.54 |
5 | 51.80 | 57.77 | 5 | 106.06 | 7.59 |
6 | 36.03 | 61.39 | 6 | 71.43 | 8.79 |
7 | 15.40 | 60.26 | 7 | 20.77 | 7.77 |
8 | 6.59 | 61.92 | 8 | 7.14 | 7.81 |
9 | 1.55 | 62.37 | 9 | 3.36 | 8.04 |
10 | 1.10 | 62.82 | 10 | 0.77 | 9.11 |
11 | 0.53 | 62.97 | 11 | 0.47 | 9.42 |
It is noteworthy that for the given data, level 3 gives the best size reduction for both datasets, and does so at blistering speed.
Conclusions
The corpus of text can be reduced in size by slightly more than half.
The baseline, gzip
at its default setting, reduces the size by 55%.
The best compression (lzma
and brotli
) reach 63%.
The difference between worst and best compression can come at a huge
computational cost. For example, lzma
at its best is 11x slower than
gzip
.
For the tarball, somewhat surprisingly lzma
takes the crown w.r.t. size reduction.
The new contender brotli
is a surprise. At compression level 2 or 3 it
both is significantly faster than gzip
and reaches the highest
size reduction for text and the second highest size reduction for a tarball.
For comments, please send me an e-mail.