From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <bzip2-devel-return-30-listarch-bzip2-devel=sourceware.org@sourceware.org>
Received: (qmail 62339 invoked by alias); 28 Jun 2019 06:08:52 -0000
Mailing-List: contact bzip2-devel-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Post: <mailto:bzip2-devel@sourceware.org>
List-Help: <mailto:bzip2-devel-help@sourceware.org>
List-Subscribe: <mailto:bzip2-devel-subscribe@sourceware.org>
List-Id: <bzip2-devel.sourceware.org>
Sender: bzip2-devel-owner@sourceware.org
Received: (qmail 62326 invoked by uid 89); 28 Jun 2019 06:08:51 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Checked: by ClamAV 0.100.3 on sourceware.org
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.3.1 spammy=eur, 3003, HX-Languages-Length:3400, lame
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on sourceware.org
X-Spam-Level: 
X-HELO: black.bookofsand.co.uk
DKIM-Filter: OpenDKIM Filter v2.11.0 squarecat1.vs.mythic-beasts.com B56962F471
DKIM-Filter: OpenDKIM Filter v2.11.0 mail.kestrel.ws 35528244B6
Reply-To: jseward@acm.org
Subject: Re: bzip2 1.0.7 released
To: Federico Mena Quintero <federico@gnome.org>,
 Mark Wielaard <mark@klomp.org>, bzip2-devel@sourceware.org
References: <b8aab785a67113e4f50c54a6cb59129c11f805b6.camel@klomp.org>
 <20190627205837.GD9273@wildebeest.org>
 <0a2331bc6d0c8500c2c45df1e3ebe01b49ad5831.camel@klomp.org>
 <8c4d5cf2479253406dacdee122692cc77771afb9.camel@gnome.org>
From: Julian Seward <jseward@acm.org>
Message-ID: <e73799c0-cb3b-c9f8-84c8-53d028dbe1d5@acm.org>
Date: Tue, 01 Jan 2019 00:00:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.7.2
MIME-Version: 1.0
In-Reply-To: <8c4d5cf2479253406dacdee122692cc77771afb9.camel@gnome.org>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-GB
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: ClamAV using ClamSMTP
X-SW-Source: 2019-q2/txt/msg00029.txt.bz2


Mark, Federico,

Thank you both for the analysis.  Here's my Eur 0.02:

I see the following, in the 1.0.6 sources:

* bzlib_private.h:#define BZ_MAX_SELECTORS (2 + (900000 / BZ_G_SIZE))

* compress.c has
    AssertH( nSelectors < 32768 &&
             nSelectors <= (2 + (900000 / BZ_G_SIZE)),
             3003 );

* decompress.c has (well, wants to have)
       if (nSelectors < 1 || nSelectors > BZ_MAX_SELECTORS)

Mark observes this:

 > So it looks like some implementations might add more selectors than
 > necessary. For example lbzip2 seems to use a max of 18000 + 1 + 7.  Which
 > might explain why our 18002 = 2 + (900000 / 50) isn't enough, and why my
 > random increase of 5 seemed to work for the given file.

Seems plausible, except that (18000 + 1 + 7) - (2 + (900000 / 50)) == 6,
not 5.  So we'd have to increase the number by 6, not 5.

IIUC (which I may not), bzip2-1.0.7 is internally self-consistent, in that it
won't create more than 18002 selectors, and it also won't accept more than
18002 selectors.  But some lbzip2 implementations can create up to 18008
selectors (per Mark's comments), and that is what is causing the problem here.

I would propose the following fix for discussion:

* split BZ_MAX_SELECTORS into two different values:

   - BZ_MAX_SELECTORS_ENC = (2 + (900000 / BZ_G_SIZE)) [= 18002], the max
     number of selectors that bzip2 can create ("encode")

   - BZ_MAX_SELECTORS_DEC = BZ_MAX_SELECTORS_ENC + 6 [= 18008], the max number
     of selectors that we'll accept during decoding, and add a comment
     explaining that the difference is due to buggy lbzip2/whatever creating
     more than BZ_MAX_SELECTORS_ENC

* use BZ_MAX_SELECTORS_ENC to dimension the arrays in struct EState

* use BZ_MAX_SELECTORS_DEC to dimension the arrays in struct DState, and
   for the new decompress.c range check

* change the compress.c assertion

    AssertH( nSelectors < 32768 &&
             nSelectors <= (2 + (900000 / BZ_G_SIZE)),
             3003 );

    to actually mention BZ_MAX_SELECTORS_ENC directly, instead of
    (2 + (900000 / BZ_G_SIZE)), [which is embarrassingly lame]

It seems to me to be important to now split BZ_MAX_SELECTORS into these two
parts so as to make it clear to everybody that we're accepting (decompressing)
a slightly larger set of inputs than we create (a la that old saying about
network protocol implementations), so as to tolerate other compressors.

That then leaves the questions:

* Why did it work before?  I suspect Federico's guess that "I think the len
   arrays are still uninitialized and will get filled until later" is correct,
   but I'd need to look more at that.

* How to verify any change?  For whatever fix we arrive at, I'd like to do
   some test runs over big trees of files (several tens of GB) with both
   compressor and decompressor compiled with asan (and/or ubsan), since I
   believe at least asan can detect overruns of the statically sized arrays
   within the EState and DState structs -- V can't.

   Per communication with Federico (which didn't go to Mark, unfortunately), I
   have a C driver program that compresses and decompresses every file in a
   tree and checks it is the same as the original.  But I don't know what the
   copyright status of it is and so I'm not happy to redistribute it (alas).
   We should try to replace it.

   In particular, the three test files in the tarball merely serve to verify
   that the build didn't fail in some obvious way.  They are in no way a
   comprehensive test set.

J