reduce gzip archive size if files are similar

+10

−0

The following users marked this post as Works for me:

User	Comment	Date
Matthias Braun‭	(no comment)	May 5, 2022 at 17:23

That Wikipedia quote is correct, yet somewhat wrong or at the very least misleading.

A key term to understand is the window size (sometimes called the block size) of a data compression flow.

By necessity, every compression algorithm works on chunks of data. These chunks of data can be large or small, and they can be of a fixed size or a configurable size or a dynamically adjusted size, but when all is said and done, these chunks are treated as largely independent units for the purpose of compression. The size of these chunks is the window size.

In most cases where gzip is used to compress multiple files into a single compressed archive, it's coupled with a separate archiver, such as tar, hence .tar.gz which by convention is often shortened to .tgz on systems that don't allow the full extension. (The gzip file format apparently allows for multiple files within a single gzip compressed file, but I don't think I've ever seen that used in practice.) By carefully controlling how the archiving process arranges files in its output (which becomes the input to gzip) it surely is possible to place redundant data sufficiently close enough together that the gzip compressor will be more likely to see those within the same window during compression and therefore be able to take advantage of that redundancy. Once the compressor has picked up on the compression candidate, it will be in the compression dictionary and therefore usable also for later blocks.

Compare and contrast the manual page for bzip2, which is quite explicit that the -1 through -9 switches select block sizes of the corresponding number of hundreds of kilobytes, causing the compressor not really to work harder, but to work on more uncompressed data at once.

See also zstd, which has specific tuning switches for this, including --long:

--long[=#]: enables long distance matching with # windowLog, if # is not present it defaults to 27. This increases the window size (windowLog) and memory usage for both the compressor and decompressor. This setting is designed to improve the compression ratio for files with long matches at a large distance.

(Passing --long=27 corresponds to setting a block size of 2²⁷ bytes = 128 MiB.)

What all this comes down to is basically that the results you got indicate that the individual files are a significant fraction of the compression window/block size or larger. This causes the compressor to not pick up on the similarities, or to consider them too rare or too short to be meaningful compression candidates, in turn resulting in redundancy in the compressed output.

This can be demonstrated by compressing incompressible files that are much smaller than what your image files likely are (as well as significantly smaller than the window size):

$ dd if=/dev/random of=random.data bs=1024 count=4
4+0 records in
4+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.xxxxxxxxx s, 14.1 MB/s
$ for t in $(seq 1 100); do cp random.data random.${t}; done
$ tar czf random.tar.gz random.*
$ stat -c %s random.tar.gz
8564
$

showing that 101 identical files, each with 4 KiB of incompressible data, compressed into less than 9 KiB. If we do the same thing with 4 MiB files (so significantly larger than the window size):

$ dd if=/dev/random of=random.data bs=1024 count=4096
4096+0 records in
4096+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.xxxxxx s, 48.4 MB/s
$ for t in $(seq 1 100); do cp random.data random.${t}; done
$ tar czf random.tar.gz random.*
$ stat -c %s random.tar.gz
423707543
$

which is about 81 KiB larger than the combined size of the input (the 101 individual 4 MiB files): 423707543 = (101 * 4194304) + 82839

To make a compressed archive that takes advantage of the redundancy between files, you'll need to increase the block size significantly (so that the compressor picks up on the similarities), and/or carefully arrange the contents of the uncompressed archive (so that the compressor sees the similarities within a single block). The latter is probably more meaningful if you have some files that are highly similar, and others that are not; you can then, in principle, group the similarities together.

You could split the gzip compression into its own step, which you already have but didn't use to its full extent, which would allow you to pass additional parameters to gzip; something like

tar cf - *.png | gzip -c9 - > pngs.tar.gz

or you could use a more modern compression algorithm, since gzip is pretty old and pretty slow for the compression it gives. You could, for example, try using zstd instead of gzip

tar cf - *.png | zstd -z --long - > pngs.tar.zstd

but the latter, of course, won't meet your criteria of producing a .tar.gz (gzipped tar) file. There's a good chance that it'll give you a considerably smaller output file, though. To illustrate, using the 4 MiB random data files from the second example above, even using zstd's default -3 (it goes up to -19, or -22 with --ultra at the cost of much higher memory usage):

$ tar cf - random.* | zstd -z --long - > random.tar.zstd
$ stat -c %s random.tar.zstd 
4239476
$

which is only about 1% (or in this case 44 KiB) more than 4 MiB: 4239476 = 4194304 + 45172. To wit, the above is also significantly faster than the tar czf, at least on my system.

posted almost 2 years ago

CC BY-SA 4.0

2y ago

Canina‭

1059 reputation 5 20 121 48

Copy Link

Raw

Markdown

History

Communities

reduce gzip archive size if files are similar

0 comment threads

1 answer

0 comment threads