Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Notifications
Mark all as read
Q&A

reduce gzip archive size if files are similar

+5
−0

I'd like to use the similarities between files to produce a gzip archive that's as small as possible.

On gzip's Wikipedia entry it says that:

gzip is not to be confused with the ZIP archive format, which also uses DEFLATE. The ZIP format can hold collections of files without an external archiver, but is less compact than compressed tarballs holding the same data, because it compresses files individually and cannot take advantage of redundancy between files (solid compression).

I'd like use that "redundancy between files" and for testing have tried to gzip multiple identical PNG files, assuming that their size would not add up: tar -czf pngs.tar.gz *.png

But as it turns out, the resulting .tar.gz file is considerably larger than one of the PNGs, indicating that the algorithm didn't pick up on the fact that all files were identical.

Also, creating a non-compressed tar file with tar -cvf pngs.tar *.png and then using gzip on it did not reduce the size: gzip --keep pngs.tar.

Is there a way to create a tar.gz archive that uses the redundancies/similarities between its files?

Used tools and version:

  • gzip 1.12
  • GNU tar 1.34
  • bsdtar 3.6.1

I know I can use 7-Zip with something like bsdtar --auto-compress -cf pngs.tar.7z pngs.tar to create an archive that's smaller than one PNG but I'm curious if it's also possible with gzip.

Why does this post require moderator attention?
You might want to add some details to your flag.
Why should this post be closed?

0 comment threads

1 answer

+7
−0

That Wikipedia quote is correct, yet somewhat wrong or at the very least misleading.

A key term to understand is the window size (sometimes called the block size) of a data compression flow.

By necessity, every compression algorithm works on chunks of data. These chunks of data can be large or small, and they can be of a fixed size or a configurable size or a dynamically adjusted size, but when all is said and done, these chunks are treated as largely independent units for the purpose of compression. The size of these chunks is the window size.

In most cases where gzip is used to compress multiple files into a single compressed archive, it's coupled with a separate archiver, such as tar, hence .tar.gz which by convention is often shortened to .tgz on systems that don't allow the full extension. (The gzip file format apparently allows for multiple files within a single gzip compressed file, but I don't think I've ever seen that used in practice.) By carefully controlling how the archiving process arranges files in its output (which becomes the input to gzip) it surely is possible to place redundant data sufficiently close enough together that the gzip compressor will be more likely to see those within the same window during compression and therefore be able to take advantage of that redundancy. Once the compressor has picked up on the compression candidate, it will be in the compression dictionary and therefore usable also for later blocks.

Compare and contrast the manual page for bzip2, which is quite explicit that the -1 through -9 switches select block sizes of the corresponding number of hundreds of kilobytes, causing the compressor not really to work harder, but to work on more uncompressed data at once.

See also zstd, which has specific tuning switches for this, including --long:

--long[=#]: enables long distance matching with # windowLog, if # is not present it defaults to 27. This increases the window size (windowLog) and memory usage for both the compressor and decompressor. This setting is designed to improve the compression ratio for files with long matches at a large distance.

(Passing --long=27 corresponds to setting a block size of 227 bytes = 128 MiB.)

What all this comes down to is basically that the results you got indicate that the individual files are a significant fraction of the compression window/block size or larger. This causes the compressor to not pick up on the similarities, or to consider them too rare or too short to be meaningful compression candidates, in turn resulting in redundancy in the compressed output.

This can be demonstrated by compressing incompressible files that are much smaller than what your image files likely are (as well as significantly smaller than the window size):

$ dd if=/dev/random of=random.data bs=1024 count=4
4+0 records in
4+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.xxxxxxxxx s, 14.1 MB/s
$ for t in $(seq 1 100); do cp random.data random.${t}; done
$ tar czf random.tar.gz random.*
$ stat -c %s random.tar.gz
8564
$

showing that 101 identical files, each with 4 KiB of incompressible data, compressed into less than 9 KiB. If we do the same thing with 4 MiB files (so significantly larger than the window size):

$ dd if=/dev/random of=random.data bs=1024 count=4096
4096+0 records in
4096+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.xxxxxx s, 48.4 MB/s
$ for t in $(seq 1 100); do cp random.data random.${t}; done
$ tar czf random.tar.gz random.*
$ stat -c %s random.tar.gz
423707543
$

which is about 81 KiB larger than the combined size of the input (the 101 individual 4 MiB files): 423707543 = (101 * 4194304) + 82839

To make a compressed archive that takes advantage of the redundancy between files, you'll need to increase the block size significantly (so that the compressor picks up on the similarities), and/or carefully arrange the contents of the uncompressed archive (so that the compressor sees the similarities within a single block). The latter is probably more meaningful if you have some files that are highly similar, and others that are not; you can then, in principle, group the similarities together.

You could split the gzip compression into its own step, which you already have but didn't use to its full extent, which would allow you to pass additional parameters to gzip; something like

tar cf - *.png | gzip -c9 - > pngs.tar.gz

or you could use a more modern compression algorithm, since gzip is pretty old and pretty slow for the compression it gives. You could, for example, try using zstd instead of gzip

tar cf - *.png | zstd -z --long - > pngs.tar.zstd

but the latter, of course, won't meet your criteria of producing a .tar.gz (gzipped tar) file. There's a good chance that it'll give you a considerably smaller output file, though. To illustrate, using the 4 MiB random data files from the second example above, even using zstd's default -3 (it goes up to -19, or -22 with --ultra at the cost of much higher memory usage):

$ tar cf - random.* | zstd -z --long - > random.tar.zstd
$ stat -c %s random.tar.zstd 
4239476
$

which is only about 1% (or in this case 44 KiB) more than 4 MiB: 4239476 = 4194304 + 45172. To wit, the above is also significantly faster than the tar czf, at least on my system.

Why does this post require moderator attention?
You might want to add some details to your flag.

0 comment threads

Sign up to answer this question »

This community is part of the Codidact network. We have other communities too — take a look!

You can also join us in chat!

Want to advertise this community? Use our templates!

Like what we're doing? Support us! Donate