public inbox for bzip2-devel@sourceware.org
 help / color / mirror / Atom feed
* Q: Are bzip2 archives of identical inputs also guaranteed identical?
@ 2024-03-12  1:58 Jim DeLaHunt
  0 siblings, 0 replies; only message in thread
From: Jim DeLaHunt @ 2024-03-12  1:58 UTC (permalink / raw)
  To: bzip2-devel

[-- Attachment #1: Type: text/plain, Size: 2450 bytes --]

Hello, bzip2 supporters:

Many thanks for your work to develop BZip2 and make it freely available 
in the world. I am using it, and it works well for me.

I had a question, for which I could not find an answer at the 
documentation[1].

If I have two input files, F1 and F2, and I compress them with bzip2 on 
different machines at different times, maybe with different versions of 
bzip2, or with different implementations of the bzip2 algorithm, are the 
resulting archives F1.bz2 and F2.bz2 guaranteed to be bit-for-bit 
identical if and only iff the inputs F1 and F2 are bit-for-bit 
identical?  Or are they guaranteed to be different?  Or is this property 
undefined?

Reasons why they could be guaranteed identical:

  * The algorithm is 100% deterministic in the output it generates
  * The test suite of the implementation tests this property

Reasons why they could be guaranteed to be different:

  * The algorithm calls for putting a date stamp or nonce value in the
    output.
  * The algorithm calls for putting the name or version of the
    compression tool used into the output.
  * The compression algorithm is not deterministic.
  * The uncompression algorithm is not deterministic, the same archive
    could generate different uncompressed output depending on
    circumstances. (This would surprise me, but I suppose it is
    logically possible.)

Reasons why the property could be undefined:

  * No-one specified this property or tested it.
  * There are known cases where the property is true, and known cases
    where the property is not true, and you never can tell which case a
    user fill find themselves in.

The motivation for this question:

I was cleaning up a file server. I had just finished compressing a very 
large file F1 to an archive F1.bz2, and had just irretrievably deleted 
F1. I came across another large file F2, with the same byte count as F1. 
I want to know if F2 is identical to F1. I could uncompress F1.bz2 to 
recreate F1, then diff F1 and F2. Or, if the archives are guaranteed to 
be the same if and only if inputs are the same, then I could compress F2 
to archive F2.bz2, and diff F1.bz2 with F2.bz2.

Best regards,
      —Jim DeLaHunt, Vancouver, Canada

[1] bzip2 documentation <https://sourceware.org/bzip2/manual/manual.html>

-- 
.   --Jim DeLaHunt,jdlh@jdlh.com      http://blog.jdlh.com/  (http://jdlh.com/)
       multilingual websites consultant, Vancouver, B.C., Canada

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2024-03-12  1:58 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-12  1:58 Q: Are bzip2 archives of identical inputs also guaranteed identical? Jim DeLaHunt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).