From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp116.iad3a.emailsrvr.com (smtp116.iad3a.emailsrvr.com [173.203.187.116]) by sourceware.org (Postfix) with ESMTPS id 8F2D2385B836 for ; Mon, 30 Mar 2020 17:40:24 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 8F2D2385B836 X-Auth-ID: tom@honermann.net Received: by smtp31.relay.iad3a.emailsrvr.com (Authenticated sender: tom-AT-honermann.net) with ESMTPSA id AB10424898; Mon, 30 Mar 2020 13:40:23 -0400 (EDT) X-Sender-Id: tom@honermann.net Received: from [192.168.1.13] (pool-74-110-208-227.rcmdva.fios.verizon.net [74.110.208.227]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA) by 0.0.0.0:587 (trex/5.7.12); Mon, 30 Mar 2020 13:40:23 -0400 Subject: Re: [PATCH] Reset converter state after second wchar_t output (Bug 25734) To: Carlos O'Donell , Andreas Schwab , libc-alpha , Rafal Luzynski References: From: Tom Honermann Message-ID: <691be68d-2969-1076-75a9-4185c5de5f69@honermann.net> Date: Mon, 30 Mar 2020 13:40:22 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.4.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US X-Classification-ID: 478182a5-73a6-4d06-9947-23d861bd0cd3-1-1 X-Spam-Status: No, score=-20.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_SHORT, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Mar 2020 17:40:26 -0000 On 3/28/20 4:37 PM, Carlos O'Donell wrote: > On 3/27/20 5:18 PM, Carlos O'Donell wrote: >> Andreas, >> >> Does this look good to you? >> >> I admit it is odd that the converter returns 0-bytes consumed on the >> second conversion, it could be enhanced to correctly show that we're >> really consume 1-byte at a time, but that's another bug. >> > Tom, > > Thanks for your review! > > If you want you can give a "Reviewed-by:" Thanks, Carlos! Reviewed-by: Tom Honermann Tom. > > Andreas, > > I'd like to see a review from you given your experience with BIG5. > > v2 > - Fixed typo. > - Added mbsinit test (thanks Tom). > - Fixed logical flaw in test loop in do_test. > >> 8< --- 8< --- 8< --- > From 40353453e97c321b20e683939c919e6d997fffcb Mon Sep 17 00:00:00 2001 > From: Carlos O'Donell > Date: Fri, 27 Mar 2020 17:03:36 -0400 > Subject: [PATCH] Reset converter state after second wchar_t output (Bug 25734) > > An input BIG5-HKSCS character may be converted into at most 2 whcar_t > characters. After outputting the second whcar_t character (which was > saved in the converter state) we must reset the state. If we fail > to reset the state we will be stuck continually copying that > character to the output even if we have further input to consider. > > We add a new test case that covers the 4 BIG5-HKSCS characters > that may become 2 wchar_t characters. > --- > iconvdata/Makefile | 6 +- > iconvdata/big5hkscs.c | 3 + > iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c | 160 ++++++++++++++++++++++ > 3 files changed, 168 insertions(+), 1 deletion(-) > create mode 100644 iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c > > diff --git a/iconvdata/Makefile b/iconvdata/Makefile > index c83962f351..71ddef09db 100644 > --- a/iconvdata/Makefile > +++ b/iconvdata/Makefile > @@ -73,7 +73,7 @@ modules.so := $(addsuffix .so, $(modules)) > ifeq (yes,$(build-shared)) > tests = bug-iconv1 bug-iconv2 tst-loading tst-e2big tst-iconv4 bug-iconv4 \ > tst-iconv6 bug-iconv5 bug-iconv6 tst-iconv7 bug-iconv8 bug-iconv9 \ > - bug-iconv10 bug-iconv11 bug-iconv12 > + bug-iconv10 bug-iconv11 bug-iconv12 tst-iconv-big5-hkscs-to-2ucs4 > ifeq ($(have-thread-library),yes) > tests += bug-iconv3 > endif > @@ -340,3 +340,7 @@ tst-tables-clean: > > $(objpfx)gconv-modules: gconv-modules > cat $(sysdeps-gconv-modules) $^ > $@ > + > +# Test requires BIG5HKSCS. > +$(objpfx)tst-iconv-big5-hkscs-to-2ucs4.out: $(objpfx)gconv-modules \ > + $(addprefix $(objpfx),BIG5HKSCS.so) > diff --git a/iconvdata/big5hkscs.c b/iconvdata/big5hkscs.c > index 01fcfeba76..ef325119b1 100644 > --- a/iconvdata/big5hkscs.c > +++ b/iconvdata/big5hkscs.c > @@ -17895,6 +17895,9 @@ static struct > else \ > ++inptr; \ > } \ > + else \ > + /* Clear the queue and proceed to output the saved character. */ \ > + *statep = 0; \ > \ > put32 (outptr, ch); \ > outptr += 4; \ > diff --git a/iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c b/iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c > new file mode 100644 > index 0000000000..8389adebf2 > --- /dev/null > +++ b/iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c > @@ -0,0 +1,160 @@ > +/* Verify the BIG5HKSCS outputs that generate 2 wchar_t's (Bug 25734). > + Copyright (C) 2020 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + . */ > + > +#include > +#include > +#include > +#include > +#include > +#include > + > +/* A few BIG5-HKSCS characters map in two unicode code points. > + They are: > + /x88/x62 => > + /x88/x64 => > + /x88/xa3 => > + /x88/xa5 => > + Each of these is special cased in iconvdata/big5hkscs.c. > + This test ensures that we correctly reset the shift state after > + outputting any of these characters. We do this by converting > + each them followed by converting an ASCII character. If we fail > + to reset the shift state (bug 25734) then we'll see the last > + character in the queue output again. */ > + > +/* Each test has name, input bytes, and expected wide character > + output. */ > +struct testdata { > + const char *name; > + const char input[3]; > + wchar_t expected[3]; > +}; > + > +/* In BIG5-HKSCS (2008) there are 4 characters that generate multiple > + wide characters. */ > +struct testdata tests[4] = { > + /* X => X */ > + { "", "\x88\x62\x58", { 0x00CA, 0x0304, 0x0058 } }, > + /* X => X */ > + { "", "\x88\x64\x58", { 0x00CA, 0x030C, 0x0058 } }, > + /* X => X */ > + { "", "\x88\xa3\x58", { 0x00EA, 0x0304, 0x0058 } }, > + /* X => X */ > + { "", "\x88\xa5\x58", { 0x00EA, 0x030C, 0x0058 } } > +}; > + > +/* Each test is of the form: > + - Translate first code sequence (two bytes) > + - Translate second (zero bytes) > + - Translate the third (one byte). */ > +static int > +check_conversion (struct testdata test) > +{ > + int err = 0; > + wchar_t wc; > + mbstate_t st; > + size_t ret; > + const char *mbs = test.input; > + int consumed = 0; > + /* Input is always 3 bytes long. */ > + int inlen = 3; > + > + memset (&st, 0, sizeof (st)); > + /* First conversion: Consumes first 2 bytes. */ > + ret = mbrtowc (&wc, mbs, inlen - consumed, &st); > + if (ret != 2) > + { > + printf ("error: First conversion consumed only %zd bytes.\n", ret); > + err++; > + } > + /* Advance the two consumed bytes. */ > + mbs += ret; > + consumed += ret; > + if (wc != test.expected[0]) > + { > + printf ("error: Result of first conversion was wrong.\n"); > + err++; > + } > + /* Second conversion: Consumes 0 bytes. */ > + ret = mbrtowc (&wc, mbs, inlen - consumed, &st); > + if (ret != 0) > + { > + printf ("error: Second conversion consumed only %zd bytes.\n", ret); > + err++; > + } > + /* Advance the zero consumed bytes. */ > + mbs += ret; > + consumed += ret; > + if (wc != test.expected[1]) > + { > + printf ("error: Result of second conversion was wrong.\n"); > + err++; > + } > + /* After the second conversion the state of the converter should be > + in the initial state. It is in the initial state because the two > + input BIG5-HKSCS bytes have been consumed and the 2 wchar_t's have > + been output. */ > + if (mbsinit (&st) == 0) > + { > + printf ("error: Converter not in initial state.\n"); > + err++; > + } > + /* Third conversion: Consumes 1 byte (it's an ASCII character). */ > + ret = mbrtowc (&wc, mbs, inlen - consumed, &st); > + if (ret != 1) > + { > + printf ("error: Third conversion consumed only %zd bytes.\n", ret); > + err++; > + } > + /* Advance the one byte. */ > + mbs += ret; > + consumed += ret; > + if (wc != test.expected[2]) > + { > + printf ("error: Result of third conversion was wrong.\n"); > + err++; > + } > + /* Return 0 if we saw no errors. */ > + return err; > +} > + > +static int > +do_test (void) > +{ > + int err = 0; > + int ret; > + /* Testing BIG5-HKSCS. */ > + xsetlocale (LC_ALL, "zh_HK.BIG5-HKSCS"); > + > + /* Run all the special conversions. */ > + for (int i = 0; i < (sizeof (tests) / sizeof (struct testdata)); i++) > + { > + printf ("Running test for %s\n", tests[i].name); > + ret = check_conversion (tests[i]); > + if (ret > 0) > + printf ("Test %s failed.\n", tests[i].name); > + err += ret; > + } > + > + /* Fail if any conversion had an error. */ > + if (err > 0) > + FAIL_EXIT1 ("One or more conversions failed."); > + > + return 0; > +} > + > +#include