From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 92081 invoked by alias); 28 Jun 2019 11:10:20 -0000 Mailing-List: contact bzip2-devel-help@sourceware.org; run by ezmlm Precedence: bulk List-Post: List-Help: List-Subscribe: List-Id: Sender: bzip2-devel-owner@sourceware.org Received: (qmail 92070 invoked by uid 89); 28 Jun 2019 11:10:20 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Checked: by ClamAV 0.100.3 on sourceware.org X-Virus-Found: No X-Spam-SWARE-Status: No, score=-18.8 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,SPF_PASS autolearn=ham version=3.3.1 spammy= X-Spam-Status: No, score=-18.8 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,SPF_PASS autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on sourceware.org X-Spam-Level: X-HELO: gnu.wildebeest.org Message-ID: <9998ca428c4c7f895a543aa91941e58efb0d5291.camel@klomp.org> Subject: Re: bzip2 1.0.7 released From: Mark Wielaard To: jseward@acm.org, Federico Mena Quintero , bzip2-devel@sourceware.org Date: Tue, 01 Jan 2019 00:00:00 -0000 In-Reply-To: References: <20190627205837.GD9273@wildebeest.org> <0a2331bc6d0c8500c2c45df1e3ebe01b49ad5831.camel@klomp.org> <8c4d5cf2479253406dacdee122692cc77771afb9.camel@gnome.org> Content-Type: multipart/mixed; boundary="=-R3PNU+9qZQ0c7zNIqdNy" X-Mailer: Evolution 3.28.5 (3.28.5-2.el7) Mime-Version: 1.0 X-Spam-Flag: NO X-SW-Source: 2019-q2/txt/msg00030.txt.bz2 --=-R3PNU+9qZQ0c7zNIqdNy Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Content-length: 5279 Hi Julian, On Fri, 2019-06-28 at 08:08 +0200, Julian Seward wrote: > I see the following, in the 1.0.6 sources: >=20 > * bzlib_private.h:#define BZ_MAX_SELECTORS (2 + (900000 / BZ_G_SIZE)) >=20 > * compress.c has > AssertH( nSelectors < 32768 && > nSelectors <=3D (2 + (900000 / BZ_G_SIZE)), > 3003 ); >=20 > * decompress.c has (well, wants to have) > if (nSelectors < 1 || nSelectors > BZ_MAX_SELECTORS) >=20 > Mark observes this: >=20 > > So it looks like some implementations might add more selectors than > > necessary. For example lbzip2 seems to use a max of 18000 + 1 + 7. Wh= ich > > might explain why our 18002 =3D 2 + (900000 / 50) isn't enough, and wh= y my > > random increase of 5 seemed to work for the given file. >=20 > Seems plausible, except that (18000 + 1 + 7) - (2 + (900000 / 50)) =3D=3D= 6, > not 5. So we'd have to increase the number by 6, not 5. :} off-by-one... > IIUC (which I may not), bzip2-1.0.7 is internally self-consistent, in tha= t it > won't create more than 18002 selectors, and it also won't accept more than > 18002 selectors. But some lbzip2 implementations can create up to 18008 > selectors (per Mark's comments), and that is what is causing the problem = here. Yes. if I understand correctly, then 18002 is also the theoretical limit of the number of selectors. Being 2 (a zero and end of block marker) plus (900000 / BZ_G_SIZE), the maximum block size, because you cannot have more symbols than fit in a block, divided by BZ_G_SIZE (50), the minimum group of symbols per tree. But some implementations might "round up" that number. And I see some implementations do accept the theoretical limit of 32768 (being 2^15). But that seems impossible, and is just an upper limit because 18002 doesn't fit in 14 bits, so you have to use 15 bits to encode the number of selectors. > I would propose the following fix for discussion: >=20 > * split BZ_MAX_SELECTORS into two different values: >=20 > - BZ_MAX_SELECTORS_ENC =3D (2 + (900000 / BZ_G_SIZE)) [=3D 18002], the= max > number of selectors that bzip2 can create ("encode") >=20 > - BZ_MAX_SELECTORS_DEC =3D BZ_MAX_SELECTORS_ENC + 6 [=3D 18008], the m= ax number > of selectors that we'll accept during decoding, and add a comment > explaining that the difference is due to buggy lbzip2/whatever creat= ing > more than BZ_MAX_SELECTORS_ENC >=20 > * use BZ_MAX_SELECTORS_ENC to dimension the arrays in struct EState >=20 > * use BZ_MAX_SELECTORS_DEC to dimension the arrays in struct DState, and > for the new decompress.c range check >=20 > * change the compress.c assertion >=20 > AssertH( nSelectors < 32768 && > nSelectors <=3D (2 + (900000 / BZ_G_SIZE)), > 3003 ); >=20 > to actually mention BZ_MAX_SELECTORS_ENC directly, instead of > (2 + (900000 / BZ_G_SIZE)), [which is embarrassingly lame] >=20 > It seems to me to be important to now split BZ_MAX_SELECTORS into these t= wo > parts so as to make it clear to everybody that we're accepting (decompres= sing) > a slightly larger set of inputs than we create (a la that old saying about > network protocol implementations), so as to tolerate other compressors. That seems good. The attached patch does this and makes it possible to decode the problematic bz2 file. > That then leaves the questions: >=20 > * Why did it work before? I suspect Federico's guess that "I think the l= en > arrays are still uninitialized and will get filled until later" is cor= rect, > but I'd need to look more at that. Yes, I think Federico is correct, the len array comes directly after the selectorMtf array and is being filled in after the selectorMtf array overflow has been written. It is also large enough (BZ_N_GROUPS * BZ_MAX_ALPHA_SIZE =3D 6 * 258 =3D 1548 UChars) so would have caught a small overflow. The extra selectors are also never really used, since they were just padding beyond the theoretical max. So the file would decode correctly. > * How to verify any change? For whatever fix we arrive at, I'd like to do > some test runs over big trees of files (several tens of GB) with both > compressor and decompressor compiled with asan (and/or ubsan), since I > believe at least asan can detect overruns of the statically sized arra= ys > within the EState and DState structs -- V can't. I couldn't get gcc -fsanitize=3Daddress to find the overwrite in the subtle case. But the original report was for a fuzzed file. With a ridiculously large selector value it should be easy to show (even under valgrind). > Per communication with Federico (which didn't go to Mark, unfortunatel= y), I > have a C driver program that compresses and decompresses every file in= a > tree and checks it is the same as the original. But I don't know what= the > copyright status of it is and so I'm not happy to redistribute it (ala= s). > We should try to replace it. >=20 > In particular, the three test files in the tarball merely serve to ver= ify > that the build didn't fail in some obvious way. They are in no way a > comprehensive test set. Yes, having a more comprehensive test set would be great. Especially having file encoded with other bzip2 encoders. Thanks, Mark --=-R3PNU+9qZQ0c7zNIqdNy Content-Disposition: inline; filename*0=0001-Be-more-liberal-in-the-number-of-selectors-we-accept.pat; filename*1=ch Content-Type: text/x-patch; name="0001-Be-more-liberal-in-the-number-of-selectors-we-accept.patch"; charset="UTF-8" Content-Transfer-Encoding: base64 Content-length: 5251 RnJvbSA4Yzk5MzQwZTNjYjQ1MzhiOTFiYzUwOGViZmU3NzhmN2RlMjE3NGIx IE1vbiBTZXAgMTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBNYXJrIFdpZWxhYXJk IDxtYXJrQGtsb21wLm9yZz4KRGF0ZTogRnJpLCAyOCBKdW4gMjAxOSAxMjoz ODo0MSArMDIwMApTdWJqZWN0OiBbUEFUQ0hdIEJlIG1vcmUgbGliZXJhbCBp biB0aGUgbnVtYmVyIG9mIHNlbGVjdG9ycyB3ZSBhY2NlcHQgd2hlbgogZGVj b2RpbmcuCgpBcyBwcm9wb3NlZCBieSBKdWxpYW4gU2V3YXJ0OgoKKiBzcGxp dCBCWl9NQVhfU0VMRUNUT1JTIGludG8gdHdvIGRpZmZlcmVudCB2YWx1ZXM6 CgogIC0gQlpfTUFYX1NFTEVDVE9SU19FTkMgPSAoMiArICg5MDAwMDAgLyBC Wl9HX1NJWkUpKSBbPSAxODAwMl0sIHRoZSBtYXgKICAgIG51bWJlciBvZiBz ZWxlY3RvcnMgdGhhdCBiemlwMiBjYW4gY3JlYXRlICgiZW5jb2RlIikKCiAg LSBCWl9NQVhfU0VMRUNUT1JTX0RFQyA9IEJaX01BWF9TRUxFQ1RPUlNfRU5D ICsgNiBbPSAxODAwOF0sIHRoZSBtYXgKICAgIG51bWJlciBvZiBzZWxlY3Rv cnMgdGhhdCB3ZSdsbCBhY2NlcHQgZHVyaW5nIGRlY29kaW5nLCBhbmQgYWRk IGEKICAgIGNvbW1lbnQgZXhwbGFpbmluZyB0aGF0IHRoZSBkaWZmZXJlbmNl IGlzIGR1ZSB0byBidWdneSBsYnppcDIvd2hhdGV2ZXIKICAgIGNyZWF0aW5n IG1vcmUgdGhhbiBCWl9NQVhfU0VMRUNUT1JTX0VOQwoKKiB1c2UgQlpfTUFY X1NFTEVDVE9SU19FTkMgdG8gZGltZW5zaW9uIHRoZSBhcnJheXMgaW4gc3Ry dWN0IEVTdGF0ZQoKKiB1c2UgQlpfTUFYX1NFTEVDVE9SU19ERUMgdG8gZGlt ZW5zaW9uIHRoZSBhcnJheXMgaW4gc3RydWN0IERTdGF0ZSwKICBhbmQgZm9y IHRoZSBuZXcgZGVjb21wcmVzcy5jIHJhbmdlIGNoZWNrCgoqIGNoYW5nZSB0 aGUgY29tcHJlc3MuYyBhc3NlcnRpb24KCiAgQXNzZXJ0SCggblNlbGVjdG9y cyA8IDMyNzY4ICYmCiAgICAgICAgICAgblNlbGVjdG9ycyA8PSAoMiArICg5 MDAwMDAgLyBCWl9HX1NJWkUpKSwKICAgICAgICAgICAzMDAzICk7CgogIHRv IGFjdHVhbGx5IG1lbnRpb24gQlpfTUFYX1NFTEVDVE9SU19FTkMgZGlyZWN0 bHksIGluc3RlYWQgb2YKICAoMiArICg5MDAwMDAgLyBCWl9HX1NJWkUpKSwg W3doaWNoIGlzIGVtYmFycmFzc2luZ2x5IGxhbWVdCi0tLQogYnpsaWJfcHJp dmF0ZS5oIHwgMTUgKysrKysrKysrKy0tLS0tCiBjb21wcmVzcy5jICAgICAg fCAgMiArLQogZGVjb21wcmVzcy5jICAgIHwgIDIgKy0KIDMgZmlsZXMgY2hh bmdlZCwgMTIgaW5zZXJ0aW9ucygrKSwgNyBkZWxldGlvbnMoLSkKCmRpZmYg LS1naXQgYS9iemxpYl9wcml2YXRlLmggYi9iemxpYl9wcml2YXRlLmgKaW5k ZXggNzk3NTU1Mi4uNjIwOGJhMSAxMDA2NDQKLS0tIGEvYnpsaWJfcHJpdmF0 ZS5oCisrKyBiL2J6bGliX3ByaXZhdGUuaApAQCAtMTIyLDcgKzEyMiwxMiBA QCBleHRlcm4gdm9pZCBiel9pbnRlcm5hbF9lcnJvciAoIGludCBlcnJjb2Rl ICk7CiAjZGVmaW5lIEJaX0dfU0laRSAgIDUwCiAjZGVmaW5lIEJaX05fSVRF UlMgIDQKIAotI2RlZmluZSBCWl9NQVhfU0VMRUNUT1JTICgyICsgKDkwMDAw MCAvIEJaX0dfU0laRSkpCisvKiBUaGUgbWF4IG51bWJlciBvZiBzZWxlY3Rv cnMgdGhhdCBiemlwMiBjYW4gY3JlYXRlICgiZW5jb2RlIikgWz0gMTgwMDJd ICovCisjZGVmaW5lIEJaX01BWF9TRUxFQ1RPUlNfRU5DICgyICsgKDkwMDAw MCAvIEJaX0dfU0laRSkpCisvKiBUaGUgbWF4IG51bWJlciBvZiBzZWxlY3Rv cnMgdGhhdCBiemlwMiBhY2NlcHQgZHVyaW5nIGRlY29kaW5nIFs9IDE4MDA4 XQorICAgVGhpcyBpcyBsYXJnZXIgdGhhbiBCWl9NQVhfU0VMRUNUT1JTX0VO QyBiZWNhdXNlIHNvbWUgaW1wbGVtZW50YXRpb25zLAorICAgbWlnaHQgcm91 bmQgdXAgdGhlIG51bWJlciBvZiBzZWxlY3RvcnMgdG8gYSBmYWN0b3Igb2Yg OC4gKi8KKyNkZWZpbmUgQlpfTUFYX1NFTEVDVE9SU19ERUMgKEJaX01BWF9T RUxFQ1RPUlNfRU5DICsgNikKIAogCiAKQEAgLTI1Myw4ICsyNTgsOCBAQCB0 eXBlZGVmCiAgICAgICAvKiBzdHVmZiBmb3IgY29kaW5nIHRoZSBNVEYgdmFs dWVzICovCiAgICAgICBJbnQzMiAgICBuTVRGOwogICAgICAgSW50MzIgICAg bXRmRnJlcSAgICBbQlpfTUFYX0FMUEhBX1NJWkVdOwotICAgICAgVUNoYXIg ICAgc2VsZWN0b3IgICBbQlpfTUFYX1NFTEVDVE9SU107Ci0gICAgICBVQ2hh ciAgICBzZWxlY3Rvck10ZltCWl9NQVhfU0VMRUNUT1JTXTsKKyAgICAgIFVD aGFyICAgIHNlbGVjdG9yICAgW0JaX01BWF9TRUxFQ1RPUlNfRU5DXTsKKyAg ICAgIFVDaGFyICAgIHNlbGVjdG9yTXRmW0JaX01BWF9TRUxFQ1RPUlNfRU5D XTsKIAogICAgICAgVUNoYXIgICAgbGVuICAgICBbQlpfTl9HUk9VUFNdW0Ja X01BWF9BTFBIQV9TSVpFXTsKICAgICAgIEludDMyICAgIGNvZGUgICAgW0Ja X05fR1JPVVBTXVtCWl9NQVhfQUxQSEFfU0laRV07CkBAIC0zOTksOCArNDA0 LDggQEAgdHlwZWRlZgogICAgICAgLyogZm9yIGRlY29kaW5nIHRoZSBNVEYg dmFsdWVzICovCiAgICAgICBVQ2hhciAgICBtdGZhICAgW01URkFfU0laRV07 CiAgICAgICBJbnQzMiAgICBtdGZiYXNlWzI1NiAvIE1URkxfU0laRV07Ci0g ICAgICBVQ2hhciAgICBzZWxlY3RvciAgIFtCWl9NQVhfU0VMRUNUT1JTXTsK LSAgICAgIFVDaGFyICAgIHNlbGVjdG9yTXRmW0JaX01BWF9TRUxFQ1RPUlNd OworICAgICAgVUNoYXIgICAgc2VsZWN0b3IgICBbQlpfTUFYX1NFTEVDVE9S U19ERUNdOworICAgICAgVUNoYXIgICAgc2VsZWN0b3JNdGZbQlpfTUFYX1NF TEVDVE9SU19ERUNdOwogICAgICAgVUNoYXIgICAgbGVuICBbQlpfTl9HUk9V UFNdW0JaX01BWF9BTFBIQV9TSVpFXTsKIAogICAgICAgSW50MzIgICAgbGlt aXQgIFtCWl9OX0dST1VQU11bQlpfTUFYX0FMUEhBX1NJWkVdOwpkaWZmIC0t Z2l0IGEvY29tcHJlc3MuYyBiL2NvbXByZXNzLmMKaW5kZXggMjM3NjIwZC4u OWI2NjBmOCAxMDA2NDQKLS0tIGEvY29tcHJlc3MuYworKysgYi9jb21wcmVz cy5jCkBAIC00NTQsNyArNDU0LDcgQEAgdm9pZCBzZW5kTVRGVmFsdWVzICgg RVN0YXRlKiBzICkKIAogICAgQXNzZXJ0SCggbkdyb3VwcyA8IDgsIDMwMDIg KTsKICAgIEFzc2VydEgoIG5TZWxlY3RvcnMgPCAzMjc2OCAmJgotICAgICAg ICAgICAgblNlbGVjdG9ycyA8PSAoMiArICg5MDAwMDAgLyBCWl9HX1NJWkUp KSwKKyAgICAgICAgICAgIG5TZWxlY3RvcnMgPD0gQlpfTUFYX1NFTEVDVE9S U19FTkMsCiAgICAgICAgICAgICAzMDAzICk7CiAKIApkaWZmIC0tZ2l0IGEv ZGVjb21wcmVzcy5jIGIvZGVjb21wcmVzcy5jCmluZGV4IDIwY2U0OTMuLmQy NGEwNTIgMTAwNjQ0Ci0tLSBhL2RlY29tcHJlc3MuYworKysgYi9kZWNvbXBy ZXNzLmMKQEAgLTI4Nyw3ICsyODcsNyBAQCBJbnQzMiBCWjJfZGVjb21wcmVz cyAoIERTdGF0ZSogcyApCiAgICAgICBHRVRfQklUUyhCWl9YX1NFTEVDVE9S XzEsIG5Hcm91cHMsIDMpOwogICAgICAgaWYgKG5Hcm91cHMgPCAyIHx8IG5H cm91cHMgPiBCWl9OX0dST1VQUykgUkVUVVJOKEJaX0RBVEFfRVJST1IpOwog ICAgICAgR0VUX0JJVFMoQlpfWF9TRUxFQ1RPUl8yLCBuU2VsZWN0b3JzLCAx NSk7Ci0gICAgICBpZiAoblNlbGVjdG9ycyA8IDEgfHwgblNlbGVjdG9ycyA+ IEJaX01BWF9TRUxFQ1RPUlMpIFJFVFVSTihCWl9EQVRBX0VSUk9SKTsKKyAg ICAgIGlmIChuU2VsZWN0b3JzIDwgMSB8fCBuU2VsZWN0b3JzID4gQlpfTUFY X1NFTEVDVE9SU19ERUMpIFJFVFVSTihCWl9EQVRBX0VSUk9SKTsKICAgICAg IGZvciAoaSA9IDA7IGkgPCBuU2VsZWN0b3JzOyBpKyspIHsKICAgICAgICAg IGogPSAwOwogICAgICAgICAgd2hpbGUgKFRydWUpIHsKLS0gCjEuOC4zLjEK Cg== --=-R3PNU+9qZQ0c7zNIqdNy--