Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »

Welcome to the Power Users community on Codidact!

Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.

Review Suggested Edit

You can't approve or reject suggested edits because you haven't yet earned the Edit Posts ability.

Approved.
This suggested edit was approved and applied to the post almost 2 years ago by Canina‭.

112 / 255
  • That Wikipedia quote is correct, yet somewhat wrong or at the very least misleading.
  • A key term to understand is the **window size** (sometimes called the block size) of a data compression flow.
  • By necessity, every compression algorithm works on chunks of data. These chunks of data can be large or small, and they can be of a fixed size or a configurable size or a dynamically adjusted size, but when all is said and done, these chunks are treated as largely independent units for the purpose of compression. The size of these chunks is the window size.
  • In most cases where gzip is used to compress multiple files into a single compressed archive, it's coupled with a separate archiver, such as tar, hence `.tar.gz` which by convention is often shortened to `.tgz` on systems that don't allow the full extension. (The gzip file format apparently allows for multiple files within a single gzip compressed file, but I don't think I've ever seen that used in practice.) By carefully controlling how the archiving process arranges files in its output (which becomes the input to gzip) it surely is possible to place redundant data sufficiently close enough together that the gzip compressor will be more likely to see those within the same window during compression and therefore be able to take advantage of that redundancy.
  • Compare and contrast the manual page for bzip2, which is quite explicit that the `-1` through `-9` switches select block sizes of the corresponding number of hundreds of kilobytes, causing the compressor not really to work harder, but to work on more uncompressed data at once.
  • See also `zstd`, which has specific tuning switches for this, including `--long`:
  • > `--long[=#]`: enables long distance matching with `#` `windowLog`, if not `#` is not present it defaults to 27. This increases the window size (`windowLog`) and memory usage for both the compressor and decompressor. This setting is designed to improve the compression ratio for files with long matches at a large distance.
  • (Passing `--long=27` corresponds to setting a block size of 2<sup>27</sup> bytes = 128 MiB.)
  • What all this comes down to is basically that **the results you got indicate that the individual files are a significant fraction of the compression window/block size or larger.** This causes the compressor to not pick up on the similarities, or to consider them too rare or too short to be meaningful compression candidates, in turn resulting in redundancy in the compressed output.
  • This can be demonstrated by compressing *incompressible* files that are much smaller than what your image files likely are:
  • $ dd if=/dev/random of=random.data bs=1024 count=4
  • 4+0 records in
  • 4+0 records out
  • 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.xxxxxxxxx s, 14.1 MB/s
  • $ for t in $(seq 1 100); do cp random.data random.${t}; done
  • $ tar czf random.tar.gz random.*
  • $ stat -c %s random.tar.gz
  • 8564
  • $
  • showing that 101 identical files, each with 4 KiB of incompressible data, compressed into less than 9 KiB, yet if we do the same thing with 4 MiB files (so significantly larger than the window size):
  • $ dd if=/dev/random of=random.data bs=1024 count=4096
  • 4096+0 records in
  • 4096+0 records out
  • 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.xxxxxx s, 48.4 MB/s
  • $ for t in $(seq 1 100); do cp random.data random.${t}; done
  • $ tar czf random.tar.gz random.*
  • $ stat -c %s random.tar.gz
  • 423707543
  • $
  • which is about 81 KiB larger than the combined size of the input (the 101 individual 4 MiB files).
  • To make a compressed archive that takes advantage of the redundancy between files, you'll need to increase the block size significantly (so that the compressor picks up on the similarities), and/or carefully arrange the contents of the uncompressed archive (so that the compressor sees the similarities within a single block). The latter is probably more meaningful if you have some files that are highly similar, and others that are not; you can then, in principle, group the similarities together.
  • You could try splitting the gzip compression into its own step, allowing you to pass additional parameters to gzip; something like
  • tar cf - *.png | gzip -c9 - > pngs.tar.gz
  • or you could use a more modern compression algorithm, since gzip is pretty old and pretty slow for the compression it gives. You could, for example, try using zstd instead of gzip
  • tar cf - *.png | zstd -z --long - > pngs.tar.zstd
  • but the latter, of course, won't meet your criteria of producing a .tar.gz (gzipped tar) file. There's a good chance that it'll give you a considerably smaller output file, though. To illustrate, using the 4 MiB random data files from the second example above, even using zstd's default `-3` (it goes up to `-19`, or `-22` with `--ultra` at the cost of much higher memory usage):
  • $ tar cf - random.* | zstd -z --long - > random.tar.zstd
  • $ stat -c %s random.tar.zstd
  • 4239476
  • $
  • which is only about 1% more than 4 MiB. To wit, the above is also significantly faster than the `tar czf`, at least on my system.
  • That Wikipedia quote is correct, yet somewhat wrong or at the very least misleading.
  • A key term to understand is the **window size** (sometimes called the block size) of a data compression flow.
  • By necessity, every compression algorithm works on chunks of data. These chunks of data can be large or small, and they can be of a fixed size or a configurable size or a dynamically adjusted size, but when all is said and done, these chunks are treated as largely independent units for the purpose of compression. The size of these chunks is the window size.
  • In most cases where gzip is used to compress multiple files into a single compressed archive, it's coupled with a separate archiver, such as tar, hence `.tar.gz` which by convention is often shortened to `.tgz` on systems that don't allow the full extension. (The gzip file format apparently allows for multiple files within a single gzip compressed file, but I don't think I've ever seen that used in practice.) By carefully controlling how the archiving process arranges files in its output (which becomes the input to gzip) it surely is possible to place redundant data sufficiently close enough together that the gzip compressor will be more likely to see those within the same window during compression and therefore be able to take advantage of that redundancy.
  • Compare and contrast the [manual page for bzip2](https://man.archlinux.org/man/bzip2.1), which is quite explicit that the `-1` through `-9` switches select block sizes of the corresponding number of hundreds of kilobytes, causing the compressor not really to work harder, but to work on more uncompressed data at once.
  • See also [`zstd`](https://man.archlinux.org/man/zstd.1.en), which has specific tuning switches for this, including `--long`:
  • > `--long[=#]`: enables long distance matching with `#` `windowLog`, if `#` is not present it defaults to 27. This increases the window size (`windowLog`) and memory usage for both the compressor and decompressor. This setting is designed to improve the compression ratio for files with long matches at a large distance.
  • (Passing `--long=27` corresponds to setting a block size of 2<sup>27</sup> bytes = 128 MiB.)
  • What all this comes down to is basically that **the results you got indicate that the individual files are a significant fraction of the compression window/block size or larger.** This causes the compressor to not pick up on the similarities, or to consider them too rare or too short to be meaningful compression candidates, in turn resulting in redundancy in the compressed output.
  • This can be demonstrated by compressing *incompressible* files that are much smaller than what your image files likely are:
  • $ dd if=/dev/random of=random.data bs=1024 count=4
  • 4+0 records in
  • 4+0 records out
  • 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.xxxxxxxxx s, 14.1 MB/s
  • $ for t in $(seq 1 100); do cp random.data random.${t}; done
  • $ tar czf random.tar.gz random.*
  • $ stat -c %s random.tar.gz
  • 8564
  • $
  • showing that 101 identical files, each with 4 KiB of incompressible data, compressed into less than 9 KiB. Yet, if we do the same thing with 4 MiB files (so significantly larger than the window size):
  • $ dd if=/dev/random of=random.data bs=1024 count=4096
  • 4096+0 records in
  • 4096+0 records out
  • 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.xxxxxx s, 48.4 MB/s
  • $ for t in $(seq 1 100); do cp random.data random.${t}; done
  • $ tar czf random.tar.gz random.*
  • $ stat -c %s random.tar.gz
  • 423707543
  • $
  • which is about 81 KiB larger than the combined size of the input (the 101 individual 4 MiB files).
  • To make a compressed archive that takes advantage of the redundancy between files, you'll need to increase the block size significantly (so that the compressor picks up on the similarities), and/or carefully arrange the contents of the uncompressed archive (so that the compressor sees the similarities within a single block). The latter is probably more meaningful if you have some files that are highly similar, and others that are not; you can then, in principle, group the similarities together.
  • You could try splitting the gzip compression into its own step, allowing you to pass additional parameters to gzip; something like
  • tar cf - *.png | gzip -c9 - > pngs.tar.gz
  • or you could use a more modern compression algorithm, since gzip is pretty old and pretty slow for the compression it gives. You could, for example, try using zstd instead of gzip
  • tar cf - *.png | zstd -z --long - > pngs.tar.zstd
  • but the latter, of course, won't meet your criteria of producing a .tar.gz (gzipped tar) file. There's a good chance that it'll give you a considerably smaller output file, though. To illustrate, using the 4 MiB random data files from the second example above, even using zstd's default `-3` (it goes up to `-19`, or `-22` with `--ultra` at the cost of much higher memory usage):
  • $ tar cf - random.* | zstd -z --long - > random.tar.zstd
  • $ stat -c %s random.tar.zstd
  • 4239476
  • $
  • which is only about 1% more than 4 MiB. To wit, the above is also significantly faster than the `tar czf`, at least on my system.

Suggested almost 2 years ago by Matthias Braun‭