Welcome to the Power Users community on Codidact!
Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.
Post History
That Wikipedia quote is correct, yet somewhat wrong or at the very least misleading. A key term to understand is the window size (sometimes called the block size) of a data compression flow. By n...
Answer
#4: Post edited
- That Wikipedia quote is correct, yet somewhat wrong or at the very least misleading.
- A key term to understand is the **window size** (sometimes called the block size) of a data compression flow.
- By necessity, every compression algorithm works on chunks of data. These chunks of data can be large or small, and they can be of a fixed size or a configurable size or a dynamically adjusted size, but when all is said and done, these chunks are treated as largely independent units for the purpose of compression. The size of these chunks is the window size.
In most cases where gzip is used to compress multiple files into a single compressed archive, it's coupled with a separate archiver, such as tar, hence `.tar.gz` which by convention is often shortened to `.tgz` on systems that don't allow the full extension. (The gzip file format apparently allows for multiple files within a single gzip compressed file, but I don't think I've ever seen that used in practice.) By carefully controlling how the archiving process arranges files in its output (which becomes the input to gzip) it surely is possible to place redundant data sufficiently close enough together that the gzip compressor will be more likely to see those within the same window during compression and therefore be able to take advantage of that redundancy.- Compare and contrast the [manual page for bzip2](https://man.archlinux.org/man/bzip2.1), which is quite explicit that the `-1` through `-9` switches select block sizes of the corresponding number of hundreds of kilobytes, causing the compressor not really to work harder, but to work on more uncompressed data at once.
- See also [`zstd`](https://man.archlinux.org/man/zstd.1.en), which has specific tuning switches for this, including `--long`:
- > `--long[=#]`: enables long distance matching with `#` `windowLog`, if `#` is not present it defaults to 27. This increases the window size (`windowLog`) and memory usage for both the compressor and decompressor. This setting is designed to improve the compression ratio for files with long matches at a large distance.
- (Passing `--long=27` corresponds to setting a block size of 2<sup>27</sup> bytes = 128 MiB.)
- What all this comes down to is basically that **the results you got indicate that the individual files are a significant fraction of the compression window/block size or larger.** This causes the compressor to not pick up on the similarities, or to consider them too rare or too short to be meaningful compression candidates, in turn resulting in redundancy in the compressed output.
This can be demonstrated by compressing *incompressible* files that are much smaller than what your image files likely are:- $ dd if=/dev/random of=random.data bs=1024 count=4
- 4+0 records in
- 4+0 records out
- 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.xxxxxxxxx s, 14.1 MB/s
- $ for t in $(seq 1 100); do cp random.data random.${t}; done
- $ tar czf random.tar.gz random.*
- $ stat -c %s random.tar.gz
- 8564
- $
showing that 101 identical files, each with 4 KiB of incompressible data, compressed into less than 9 KiB. Yet, if we do the same thing with 4 MiB files (so significantly larger than the window size):- $ dd if=/dev/random of=random.data bs=1024 count=4096
- 4096+0 records in
- 4096+0 records out
- 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.xxxxxx s, 48.4 MB/s
- $ for t in $(seq 1 100); do cp random.data random.${t}; done
- $ tar czf random.tar.gz random.*
- $ stat -c %s random.tar.gz
- 423707543
- $
which is about 81 KiB larger than the combined size of the input (the 101 individual 4 MiB files).- To make a compressed archive that takes advantage of the redundancy between files, you'll need to increase the block size significantly (so that the compressor picks up on the similarities), and/or carefully arrange the contents of the uncompressed archive (so that the compressor sees the similarities within a single block). The latter is probably more meaningful if you have some files that are highly similar, and others that are not; you can then, in principle, group the similarities together.
You could try splitting the gzip compression into its own step, allowing you to pass additional parameters to gzip; something like- tar cf - *.png | gzip -c9 - > pngs.tar.gz
- or you could use a more modern compression algorithm, since gzip is pretty old and pretty slow for the compression it gives. You could, for example, try using zstd instead of gzip
- tar cf - *.png | zstd -z --long - > pngs.tar.zstd
- but the latter, of course, won't meet your criteria of producing a .tar.gz (gzipped tar) file. There's a good chance that it'll give you a considerably smaller output file, though. To illustrate, using the 4 MiB random data files from the second example above, even using zstd's default `-3` (it goes up to `-19`, or `-22` with `--ultra` at the cost of much higher memory usage):
- $ tar cf - random.* | zstd -z --long - > random.tar.zstd
- $ stat -c %s random.tar.zstd
- 4239476
- $
which is only about 1% more than 4 MiB. To wit, the above is also significantly faster than the `tar czf`, at least on my system.
- That Wikipedia quote is correct, yet somewhat wrong or at the very least misleading.
- A key term to understand is the **window size** (sometimes called the block size) of a data compression flow.
- By necessity, every compression algorithm works on chunks of data. These chunks of data can be large or small, and they can be of a fixed size or a configurable size or a dynamically adjusted size, but when all is said and done, these chunks are treated as largely independent units for the purpose of compression. The size of these chunks is the window size.
- In most cases where gzip is used to compress multiple files into a single compressed archive, it's coupled with a separate archiver, such as tar, hence `.tar.gz` which by convention is often shortened to `.tgz` on systems that don't allow the full extension. (The gzip file format apparently allows for multiple files within a single gzip compressed file, but I don't think I've ever seen that used in practice.) By carefully controlling how the archiving process arranges files in its output (which becomes the input to gzip) it surely is possible to place redundant data sufficiently close enough together that the gzip compressor will be more likely to see those within the same window during compression and therefore be able to take advantage of that redundancy. Once the compressor has picked up on the compression candidate, it will be in the compression dictionary and therefore usable also for later blocks.
- Compare and contrast the [manual page for bzip2](https://man.archlinux.org/man/bzip2.1), which is quite explicit that the `-1` through `-9` switches select block sizes of the corresponding number of hundreds of kilobytes, causing the compressor not really to work harder, but to work on more uncompressed data at once.
- See also [`zstd`](https://man.archlinux.org/man/zstd.1.en), which has specific tuning switches for this, including `--long`:
- > `--long[=#]`: enables long distance matching with `#` `windowLog`, if `#` is not present it defaults to 27. This increases the window size (`windowLog`) and memory usage for both the compressor and decompressor. This setting is designed to improve the compression ratio for files with long matches at a large distance.
- (Passing `--long=27` corresponds to setting a block size of 2<sup>27</sup> bytes = 128 MiB.)
- What all this comes down to is basically that **the results you got indicate that the individual files are a significant fraction of the compression window/block size or larger.** This causes the compressor to not pick up on the similarities, or to consider them too rare or too short to be meaningful compression candidates, in turn resulting in redundancy in the compressed output.
- This can be demonstrated by compressing *incompressible* files that are much smaller than what your image files likely are (as well as significantly smaller than the window size):
- $ dd if=/dev/random of=random.data bs=1024 count=4
- 4+0 records in
- 4+0 records out
- 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.xxxxxxxxx s, 14.1 MB/s
- $ for t in $(seq 1 100); do cp random.data random.${t}; done
- $ tar czf random.tar.gz random.*
- $ stat -c %s random.tar.gz
- 8564
- $
- showing that 101 identical files, each with 4 KiB of incompressible data, compressed into less than 9 KiB. If we do the same thing with 4 MiB files (so significantly larger than the window size):
- $ dd if=/dev/random of=random.data bs=1024 count=4096
- 4096+0 records in
- 4096+0 records out
- 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.xxxxxx s, 48.4 MB/s
- $ for t in $(seq 1 100); do cp random.data random.${t}; done
- $ tar czf random.tar.gz random.*
- $ stat -c %s random.tar.gz
- 423707543
- $
- which is about 81 KiB larger than the combined size of the input (the 101 individual 4 MiB files): 423707543 = (101 * 4194304) + 82839
- To make a compressed archive that takes advantage of the redundancy between files, you'll need to increase the block size significantly (so that the compressor picks up on the similarities), and/or carefully arrange the contents of the uncompressed archive (so that the compressor sees the similarities within a single block). The latter is probably more meaningful if you have some files that are highly similar, and others that are not; you can then, in principle, group the similarities together.
- You could split the gzip compression into its own step, which you already have but didn't use to its full extent, which would allow you to pass additional parameters to gzip; something like
- tar cf - *.png | gzip -c9 - > pngs.tar.gz
- or you could use a more modern compression algorithm, since gzip is pretty old and pretty slow for the compression it gives. You could, for example, try using zstd instead of gzip
- tar cf - *.png | zstd -z --long - > pngs.tar.zstd
- but the latter, of course, won't meet your criteria of producing a .tar.gz (gzipped tar) file. There's a good chance that it'll give you a considerably smaller output file, though. To illustrate, using the 4 MiB random data files from the second example above, even using zstd's default `-3` (it goes up to `-19`, or `-22` with `--ultra` at the cost of much higher memory usage):
- $ tar cf - random.* | zstd -z --long - > random.tar.zstd
- $ stat -c %s random.tar.zstd
- 4239476
- $
- which is only about 1% (or in this case 44 KiB) more than 4 MiB: 4239476 = 4194304 + 45172. To wit, the above is also significantly faster than the `tar czf`, at least on my system.
#3: Post edited
- That Wikipedia quote is correct, yet somewhat wrong or at the very least misleading.
- A key term to understand is the **window size** (sometimes called the block size) of a data compression flow.
- By necessity, every compression algorithm works on chunks of data. These chunks of data can be large or small, and they can be of a fixed size or a configurable size or a dynamically adjusted size, but when all is said and done, these chunks are treated as largely independent units for the purpose of compression. The size of these chunks is the window size.
- In most cases where gzip is used to compress multiple files into a single compressed archive, it's coupled with a separate archiver, such as tar, hence `.tar.gz` which by convention is often shortened to `.tgz` on systems that don't allow the full extension. (The gzip file format apparently allows for multiple files within a single gzip compressed file, but I don't think I've ever seen that used in practice.) By carefully controlling how the archiving process arranges files in its output (which becomes the input to gzip) it surely is possible to place redundant data sufficiently close enough together that the gzip compressor will be more likely to see those within the same window during compression and therefore be able to take advantage of that redundancy.
Compare and contrast the manual page for bzip2, which is quite explicit that the `-1` through `-9` switches select block sizes of the corresponding number of hundreds of kilobytes, causing the compressor not really to work harder, but to work on more uncompressed data at once.See also `zstd`, which has specific tuning switches for this, including `--long`:> `--long[=#]`: enables long distance matching with `#` `windowLog`, if not `#` is not present it defaults to 27. This increases the window size (`windowLog`) and memory usage for both the compressor and decompressor. This setting is designed to improve the compression ratio for files with long matches at a large distance.- (Passing `--long=27` corresponds to setting a block size of 2<sup>27</sup> bytes = 128 MiB.)
- What all this comes down to is basically that **the results you got indicate that the individual files are a significant fraction of the compression window/block size or larger.** This causes the compressor to not pick up on the similarities, or to consider them too rare or too short to be meaningful compression candidates, in turn resulting in redundancy in the compressed output.
- This can be demonstrated by compressing *incompressible* files that are much smaller than what your image files likely are:
- $ dd if=/dev/random of=random.data bs=1024 count=4
- 4+0 records in
- 4+0 records out
- 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.xxxxxxxxx s, 14.1 MB/s
- $ for t in $(seq 1 100); do cp random.data random.${t}; done
- $ tar czf random.tar.gz random.*
- $ stat -c %s random.tar.gz
- 8564
- $
showing that 101 identical files, each with 4 KiB of incompressible data, compressed into less than 9 KiB, yet if we do the same thing with 4 MiB files (so significantly larger than the window size):- $ dd if=/dev/random of=random.data bs=1024 count=4096
- 4096+0 records in
- 4096+0 records out
- 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.xxxxxx s, 48.4 MB/s
- $ for t in $(seq 1 100); do cp random.data random.${t}; done
- $ tar czf random.tar.gz random.*
- $ stat -c %s random.tar.gz
- 423707543
- $
- which is about 81 KiB larger than the combined size of the input (the 101 individual 4 MiB files).
- To make a compressed archive that takes advantage of the redundancy between files, you'll need to increase the block size significantly (so that the compressor picks up on the similarities), and/or carefully arrange the contents of the uncompressed archive (so that the compressor sees the similarities within a single block). The latter is probably more meaningful if you have some files that are highly similar, and others that are not; you can then, in principle, group the similarities together.
- You could try splitting the gzip compression into its own step, allowing you to pass additional parameters to gzip; something like
- tar cf - *.png | gzip -c9 - > pngs.tar.gz
- or you could use a more modern compression algorithm, since gzip is pretty old and pretty slow for the compression it gives. You could, for example, try using zstd instead of gzip
- tar cf - *.png | zstd -z --long - > pngs.tar.zstd
- but the latter, of course, won't meet your criteria of producing a .tar.gz (gzipped tar) file. There's a good chance that it'll give you a considerably smaller output file, though. To illustrate, using the 4 MiB random data files from the second example above, even using zstd's default `-3` (it goes up to `-19`, or `-22` with `--ultra` at the cost of much higher memory usage):
- $ tar cf - random.* | zstd -z --long - > random.tar.zstd
- $ stat -c %s random.tar.zstd
- 4239476
- $
which is only about 1% more than 4 MiB. To wit, the above is also significantly faster than the `tar czf`, at least on my system.
- That Wikipedia quote is correct, yet somewhat wrong or at the very least misleading.
- A key term to understand is the **window size** (sometimes called the block size) of a data compression flow.
- By necessity, every compression algorithm works on chunks of data. These chunks of data can be large or small, and they can be of a fixed size or a configurable size or a dynamically adjusted size, but when all is said and done, these chunks are treated as largely independent units for the purpose of compression. The size of these chunks is the window size.
- In most cases where gzip is used to compress multiple files into a single compressed archive, it's coupled with a separate archiver, such as tar, hence `.tar.gz` which by convention is often shortened to `.tgz` on systems that don't allow the full extension. (The gzip file format apparently allows for multiple files within a single gzip compressed file, but I don't think I've ever seen that used in practice.) By carefully controlling how the archiving process arranges files in its output (which becomes the input to gzip) it surely is possible to place redundant data sufficiently close enough together that the gzip compressor will be more likely to see those within the same window during compression and therefore be able to take advantage of that redundancy.
- Compare and contrast the [manual page for bzip2](https://man.archlinux.org/man/bzip2.1), which is quite explicit that the `-1` through `-9` switches select block sizes of the corresponding number of hundreds of kilobytes, causing the compressor not really to work harder, but to work on more uncompressed data at once.
- See also [`zstd`](https://man.archlinux.org/man/zstd.1.en), which has specific tuning switches for this, including `--long`:
- > `--long[=#]`: enables long distance matching with `#` `windowLog`, if `#` is not present it defaults to 27. This increases the window size (`windowLog`) and memory usage for both the compressor and decompressor. This setting is designed to improve the compression ratio for files with long matches at a large distance.
- (Passing `--long=27` corresponds to setting a block size of 2<sup>27</sup> bytes = 128 MiB.)
- What all this comes down to is basically that **the results you got indicate that the individual files are a significant fraction of the compression window/block size or larger.** This causes the compressor to not pick up on the similarities, or to consider them too rare or too short to be meaningful compression candidates, in turn resulting in redundancy in the compressed output.
- This can be demonstrated by compressing *incompressible* files that are much smaller than what your image files likely are:
- $ dd if=/dev/random of=random.data bs=1024 count=4
- 4+0 records in
- 4+0 records out
- 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.xxxxxxxxx s, 14.1 MB/s
- $ for t in $(seq 1 100); do cp random.data random.${t}; done
- $ tar czf random.tar.gz random.*
- $ stat -c %s random.tar.gz
- 8564
- $
- showing that 101 identical files, each with 4 KiB of incompressible data, compressed into less than 9 KiB. Yet, if we do the same thing with 4 MiB files (so significantly larger than the window size):
- $ dd if=/dev/random of=random.data bs=1024 count=4096
- 4096+0 records in
- 4096+0 records out
- 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.xxxxxx s, 48.4 MB/s
- $ for t in $(seq 1 100); do cp random.data random.${t}; done
- $ tar czf random.tar.gz random.*
- $ stat -c %s random.tar.gz
- 423707543
- $
- which is about 81 KiB larger than the combined size of the input (the 101 individual 4 MiB files).
- To make a compressed archive that takes advantage of the redundancy between files, you'll need to increase the block size significantly (so that the compressor picks up on the similarities), and/or carefully arrange the contents of the uncompressed archive (so that the compressor sees the similarities within a single block). The latter is probably more meaningful if you have some files that are highly similar, and others that are not; you can then, in principle, group the similarities together.
- You could try splitting the gzip compression into its own step, allowing you to pass additional parameters to gzip; something like
- tar cf - *.png | gzip -c9 - > pngs.tar.gz
- or you could use a more modern compression algorithm, since gzip is pretty old and pretty slow for the compression it gives. You could, for example, try using zstd instead of gzip
- tar cf - *.png | zstd -z --long - > pngs.tar.zstd
- but the latter, of course, won't meet your criteria of producing a .tar.gz (gzipped tar) file. There's a good chance that it'll give you a considerably smaller output file, though. To illustrate, using the 4 MiB random data files from the second example above, even using zstd's default `-3` (it goes up to `-19`, or `-22` with `--ultra` at the cost of much higher memory usage):
- $ tar cf - random.* | zstd -z --long - > random.tar.zstd
- $ stat -c %s random.tar.zstd
- 4239476
- $
- which is only about 1% more than 4 MiB. To wit, the above is also significantly faster than the `tar czf`, at least on my system.
#2: Post edited
- That Wikipedia quote is correct, yet somewhat wrong or at the very least misleading.
- A key term to understand is the **window size** (sometimes called the block size) of a data compression flow.
By necessity, every compression algorithm works on chunks of data. These chunks of data can be large or small, and they can be of a fixed size or a configurable size or a dynamically adjusted size, but when all is said and done, these chunks are treated as independent units for the purpose of compression. The size of these chunks is the window size.- In most cases where gzip is used to compress multiple files into a single compressed archive, it's coupled with a separate archiver, such as tar, hence `.tar.gz` which by convention is often shortened to `.tgz` on systems that don't allow the full extension. (The gzip file format apparently allows for multiple files within a single gzip compressed file, but I don't think I've ever seen that used in practice.) By carefully controlling how the archiving process arranges files in its output (which becomes the input to gzip) it surely is possible to place redundant data sufficiently close enough together that the gzip compressor will be more likely to see those within the same window during compression and therefore be able to take advantage of that redundancy.
- Compare and contrast the manual page for bzip2, which is quite explicit that the `-1` through `-9` switches select block sizes of the corresponding number of hundreds of kilobytes, causing the compressor not really to work harder, but to work on more uncompressed data at once.
- See also `zstd`, which has specific tuning switches for this, including `--long`:
- > `--long[=#]`: enables long distance matching with `#` `windowLog`, if not `#` is not present it defaults to 27. This increases the window size (`windowLog`) and memory usage for both the compressor and decompressor. This setting is designed to improve the compression ratio for files with long matches at a large distance.
- (Passing `--long=27` corresponds to setting a block size of 2<sup>27</sup> bytes = 128 MiB.)
- What all this comes down to is basically that **the results you got indicate that the individual files are a significant fraction of the compression window/block size or larger.** This causes the compressor to not pick up on the similarities, or to consider them too rare or too short to be meaningful compression candidates, in turn resulting in redundancy in the compressed output.
- This can be demonstrated by compressing *incompressible* files that are much smaller than what your image files likely are:
- $ dd if=/dev/random of=random.data bs=1024 count=4
- 4+0 records in
- 4+0 records out
- 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.xxxxxxxxx s, 14.1 MB/s
- $ for t in $(seq 1 100); do cp random.data random.${t}; done
- $ tar czf random.tar.gz random.*
- $ stat -c %s random.tar.gz
- 8564
- $
- showing that 101 identical files, each with 4 KiB of incompressible data, compressed into less than 9 KiB, yet if we do the same thing with 4 MiB files (so significantly larger than the window size):
- $ dd if=/dev/random of=random.data bs=1024 count=4096
- 4096+0 records in
- 4096+0 records out
- 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.xxxxxx s, 48.4 MB/s
- $ for t in $(seq 1 100); do cp random.data random.${t}; done
- $ tar czf random.tar.gz random.*
- $ stat -c %s random.tar.gz
- 423707543
- $
- which is about 81 KiB larger than the combined size of the input (the 101 individual 4 MiB files).
- To make a compressed archive that takes advantage of the redundancy between files, you'll need to increase the block size significantly (so that the compressor picks up on the similarities), and/or carefully arrange the contents of the uncompressed archive (so that the compressor sees the similarities within a single block). The latter is probably more meaningful if you have some files that are highly similar, and others that are not; you can then, in principle, group the similarities together.
- You could try splitting the gzip compression into its own step, allowing you to pass additional parameters to gzip; something like
- tar cf - *.png | gzip -c9 - > pngs.tar.gz
- or you could use a more modern compression algorithm, since gzip is pretty old and pretty slow for the compression it gives. You could, for example, try using zstd instead of gzip
- tar cf - *.png | zstd -z --long - > pngs.tar.zstd
- but the latter, of course, won't meet your criteria of producing a .tar.gz (gzipped tar) file. There's a good chance that it'll give you a considerably smaller output file, though. To illustrate, using the 4 MiB random data files from the second example above, even using zstd's default `-3` (it goes up to `-19`, or `-22` with `--ultra` at the cost of much higher memory usage):
- $ tar cf - random.* | zstd -z --long - > random.tar.zstd
- $ stat -c %s random.tar.zstd
- 4239476
- $
- which is only about 1% more than 4 MiB. To wit, the above is also significantly faster than the `tar czf`, at least on my system.
- That Wikipedia quote is correct, yet somewhat wrong or at the very least misleading.
- A key term to understand is the **window size** (sometimes called the block size) of a data compression flow.
- By necessity, every compression algorithm works on chunks of data. These chunks of data can be large or small, and they can be of a fixed size or a configurable size or a dynamically adjusted size, but when all is said and done, these chunks are treated as largely independent units for the purpose of compression. The size of these chunks is the window size.
- In most cases where gzip is used to compress multiple files into a single compressed archive, it's coupled with a separate archiver, such as tar, hence `.tar.gz` which by convention is often shortened to `.tgz` on systems that don't allow the full extension. (The gzip file format apparently allows for multiple files within a single gzip compressed file, but I don't think I've ever seen that used in practice.) By carefully controlling how the archiving process arranges files in its output (which becomes the input to gzip) it surely is possible to place redundant data sufficiently close enough together that the gzip compressor will be more likely to see those within the same window during compression and therefore be able to take advantage of that redundancy.
- Compare and contrast the manual page for bzip2, which is quite explicit that the `-1` through `-9` switches select block sizes of the corresponding number of hundreds of kilobytes, causing the compressor not really to work harder, but to work on more uncompressed data at once.
- See also `zstd`, which has specific tuning switches for this, including `--long`:
- > `--long[=#]`: enables long distance matching with `#` `windowLog`, if not `#` is not present it defaults to 27. This increases the window size (`windowLog`) and memory usage for both the compressor and decompressor. This setting is designed to improve the compression ratio for files with long matches at a large distance.
- (Passing `--long=27` corresponds to setting a block size of 2<sup>27</sup> bytes = 128 MiB.)
- What all this comes down to is basically that **the results you got indicate that the individual files are a significant fraction of the compression window/block size or larger.** This causes the compressor to not pick up on the similarities, or to consider them too rare or too short to be meaningful compression candidates, in turn resulting in redundancy in the compressed output.
- This can be demonstrated by compressing *incompressible* files that are much smaller than what your image files likely are:
- $ dd if=/dev/random of=random.data bs=1024 count=4
- 4+0 records in
- 4+0 records out
- 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.xxxxxxxxx s, 14.1 MB/s
- $ for t in $(seq 1 100); do cp random.data random.${t}; done
- $ tar czf random.tar.gz random.*
- $ stat -c %s random.tar.gz
- 8564
- $
- showing that 101 identical files, each with 4 KiB of incompressible data, compressed into less than 9 KiB, yet if we do the same thing with 4 MiB files (so significantly larger than the window size):
- $ dd if=/dev/random of=random.data bs=1024 count=4096
- 4096+0 records in
- 4096+0 records out
- 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.xxxxxx s, 48.4 MB/s
- $ for t in $(seq 1 100); do cp random.data random.${t}; done
- $ tar czf random.tar.gz random.*
- $ stat -c %s random.tar.gz
- 423707543
- $
- which is about 81 KiB larger than the combined size of the input (the 101 individual 4 MiB files).
- To make a compressed archive that takes advantage of the redundancy between files, you'll need to increase the block size significantly (so that the compressor picks up on the similarities), and/or carefully arrange the contents of the uncompressed archive (so that the compressor sees the similarities within a single block). The latter is probably more meaningful if you have some files that are highly similar, and others that are not; you can then, in principle, group the similarities together.
- You could try splitting the gzip compression into its own step, allowing you to pass additional parameters to gzip; something like
- tar cf - *.png | gzip -c9 - > pngs.tar.gz
- or you could use a more modern compression algorithm, since gzip is pretty old and pretty slow for the compression it gives. You could, for example, try using zstd instead of gzip
- tar cf - *.png | zstd -z --long - > pngs.tar.zstd
- but the latter, of course, won't meet your criteria of producing a .tar.gz (gzipped tar) file. There's a good chance that it'll give you a considerably smaller output file, though. To illustrate, using the 4 MiB random data files from the second example above, even using zstd's default `-3` (it goes up to `-19`, or `-22` with `--ultra` at the cost of much higher memory usage):
- $ tar cf - random.* | zstd -z --long - > random.tar.zstd
- $ stat -c %s random.tar.zstd
- 4239476
- $
- which is only about 1% more than 4 MiB. To wit, the above is also significantly faster than the `tar czf`, at least on my system.
#1: Initial revision
That Wikipedia quote is correct, yet somewhat wrong or at the very least misleading. A key term to understand is the **window size** (sometimes called the block size) of a data compression flow. By necessity, every compression algorithm works on chunks of data. These chunks of data can be large or small, and they can be of a fixed size or a configurable size or a dynamically adjusted size, but when all is said and done, these chunks are treated as independent units for the purpose of compression. The size of these chunks is the window size. In most cases where gzip is used to compress multiple files into a single compressed archive, it's coupled with a separate archiver, such as tar, hence `.tar.gz` which by convention is often shortened to `.tgz` on systems that don't allow the full extension. (The gzip file format apparently allows for multiple files within a single gzip compressed file, but I don't think I've ever seen that used in practice.) By carefully controlling how the archiving process arranges files in its output (which becomes the input to gzip) it surely is possible to place redundant data sufficiently close enough together that the gzip compressor will be more likely to see those within the same window during compression and therefore be able to take advantage of that redundancy. Compare and contrast the manual page for bzip2, which is quite explicit that the `-1` through `-9` switches select block sizes of the corresponding number of hundreds of kilobytes, causing the compressor not really to work harder, but to work on more uncompressed data at once. See also `zstd`, which has specific tuning switches for this, including `--long`: > `--long[=#]`: enables long distance matching with `#` `windowLog`, if not `#` is not present it defaults to 27. This increases the window size (`windowLog`) and memory usage for both the compressor and decompressor. This setting is designed to improve the compression ratio for files with long matches at a large distance. (Passing `--long=27` corresponds to setting a block size of 2<sup>27</sup> bytes = 128 MiB.) What all this comes down to is basically that **the results you got indicate that the individual files are a significant fraction of the compression window/block size or larger.** This causes the compressor to not pick up on the similarities, or to consider them too rare or too short to be meaningful compression candidates, in turn resulting in redundancy in the compressed output. This can be demonstrated by compressing *incompressible* files that are much smaller than what your image files likely are: $ dd if=/dev/random of=random.data bs=1024 count=4 4+0 records in 4+0 records out 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.xxxxxxxxx s, 14.1 MB/s $ for t in $(seq 1 100); do cp random.data random.${t}; done $ tar czf random.tar.gz random.* $ stat -c %s random.tar.gz 8564 $ showing that 101 identical files, each with 4 KiB of incompressible data, compressed into less than 9 KiB, yet if we do the same thing with 4 MiB files (so significantly larger than the window size): $ dd if=/dev/random of=random.data bs=1024 count=4096 4096+0 records in 4096+0 records out 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.xxxxxx s, 48.4 MB/s $ for t in $(seq 1 100); do cp random.data random.${t}; done $ tar czf random.tar.gz random.* $ stat -c %s random.tar.gz 423707543 $ which is about 81 KiB larger than the combined size of the input (the 101 individual 4 MiB files). To make a compressed archive that takes advantage of the redundancy between files, you'll need to increase the block size significantly (so that the compressor picks up on the similarities), and/or carefully arrange the contents of the uncompressed archive (so that the compressor sees the similarities within a single block). The latter is probably more meaningful if you have some files that are highly similar, and others that are not; you can then, in principle, group the similarities together. You could try splitting the gzip compression into its own step, allowing you to pass additional parameters to gzip; something like tar cf - *.png | gzip -c9 - > pngs.tar.gz or you could use a more modern compression algorithm, since gzip is pretty old and pretty slow for the compression it gives. You could, for example, try using zstd instead of gzip tar cf - *.png | zstd -z --long - > pngs.tar.zstd but the latter, of course, won't meet your criteria of producing a .tar.gz (gzipped tar) file. There's a good chance that it'll give you a considerably smaller output file, though. To illustrate, using the 4 MiB random data files from the second example above, even using zstd's default `-3` (it goes up to `-19`, or `-22` with `--ultra` at the cost of much higher memory usage): $ tar cf - random.* | zstd -z --long - > random.tar.zstd $ stat -c %s random.tar.zstd 4239476 $ which is only about 1% more than 4 MiB. To wit, the above is also significantly faster than the `tar czf`, at least on my system.