From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id C8CAC3898C43; Sun, 29 Mar 2020 04:28:58 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C8CAC3898C43 From: "tom at honermann dot net" To: glibc-bugs@sourceware.org Subject: [Bug locale/25744] New: mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters Date: Sun, 29 Mar 2020 04:28:58 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: locale X-Bugzilla-Version: 2.31 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: tom at honermann dot net X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: glibc-bugs@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Glibc-bugs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 29 Mar 2020 04:28:58 -0000 https://sourceware.org/bugzilla/show_bug.cgi?id=3D25744 Bug ID: 25744 Summary: mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters Product: glibc Version: 2.31 Status: UNCONFIRMED Severity: normal Priority: P2 Component: locale Assignee: unassigned at sourceware dot org Reporter: tom at honermann dot net Target Milestone: --- The following test case demonstrates an issue with the Big5-HKSCS converter that occurs when certain double byte characters are consumed one byte at a time. Quoting from the C18 standard for convenience: 7.29.6.3.2p4 states: > The mbrtowc function returns the first of the following that applies > (given the current conversion state): > > 0 if the next n or fewer bytes complete the multibyte character that > corresponds to the null wide character (which is the value stored). > > between 1 and n inclusive if the next n or fewer bytes complete a valid > multibyte character (which is the value stored); the value returned is > the number of bytes that complete the multibyte character. > > (size_t) (=E2=88=922) if the next n bytes contribute to an incomplete (but > potentially valid) multibyte character, and all n bytes have been > processed (no value is stored). > > (size_t)(-1) if an encoding error occurs, in which case the next n or > fewer bytes do not contribute to a complete and valid multibyte > character (no value is stored); the value of the macro EILSEQ is stored > in errno, and the conversion state is unspecified. The following test case converts the Big5-HKSCS double byte character 0x88 = 0x62 one byte at a time. This is one of the special double byte characters that converts to two Unicode code points (U+00CA U+0304). When consuming the se= cond byte, mbrtowc() returns a value of 2, but 1 should be returned since only o= ne byte was consumed; the wording quoted above doesn't allow for the return va= lue to be larger than the value of 'n' that was passed in. I don't know if this issue only occurs for double byte characters that map = to multiple Unicode code points. $ cat t.c=20 #include #include #include #include #include int main() { /* This test case demonstrates glibc's current (2.31) behavior when attempting to convert, one byte at a time, Big5-HKSCS input containing a double byte sequence that maps to multiple Unicode code points. The first call to mbrtowc() consumes the first byte and returns a value of -2 indicating an incomplete multibyte character as expected. However, the second call, with an input length of 1, consumes the second byte, recognizes completion of the previously incomplete character, writes the mapped Unicode code point, and then returns 2. The return value of 2 is surprising since only 1 byte was read. This seems to violate the C standard since the return value is greater than the input length. This issue does not occur for all double byte characters. */ if (! setlocale(LC_ALL, "zh_HK.BIG5-HKSCS")) { perror("setlocale"); return 1; } const char *mbs; wchar_t wc; mbstate_t s; size_t result; mbs =3D "\x88\x62"; memset(&s, 0, sizeof(s)); /* This call to mbrtowc() consumes the first byte and returns -2 indicati= ng that a potentially valid but incomplete character was read. This is expected behavior. */ result =3D mbrtowc(&wc, mbs, 1, &s); printf("1st mbrtowc call:\n"); printf(" result: %zd (-2 expected)\n", result); assert(result =3D=3D (size_t) -2); mbs +=3D 1; /* This call consumes the second byte to complete the double byte charact= er and writes the Unicode code point. The C standard requires a return value of 1 since only one byte was consumed by this call, but glibs returns 2 (presumably corresponding to the total number of bytes that contributed to the multibyte character. */ result =3D mbrtowc(&wc, mbs, 1, &s); printf("2nd mbrtowc call:\n"); printf(" result: %zd (1 expected, glibc returns 2)\n", result); printf(" wc: 0x%04X (0x00CA expected)\n", (unsigned)wc); mbs +=3D 1; assert(result =3D=3D (size_t) 2); /* Current glibc behavior */ assert(wc =3D=3D 0x00CA); /* This call writes the second Unicode code point, but does not consume any input. 0 is returned since no input is consumed. According to the C standard, a return of 0 is reserved for when a null character is written, but since the C standard doesn't acknowledge the existence of characters that can't be represented in a single wchar_t, we're already operating outside the scope of the standard. Returning 0 seems reasonable to me. Returning -3 as mbrtoc16() does would also be reasonable. */ result =3D mbrtowc(&wc, mbs, 1, &s); printf("3rd mbrtowc call:\n"); printf(" result: %zd (0 expected)\n", result); printf(" wc: 0x%04X (0x0304 expected)\n", (unsigned)wc); mbs +=3D 0; assert(result =3D=3D (size_t) 0); assert(wc =3D=3D 0x0304); /* Attempting to process any further input would run afoul of a separate issue tracked by https://sourceware.org/bugzilla/show_bug.cgi?id=3D257= 34. */ } $ gcc t.c -o t $ ./t 1st mbrtowc call: result: -2 (-2 expected) 2nd mbrtowc call: result: 2 (1 expected, glibc returns 2) wc: 0x00CA (0x00CA expected) 3rd mbrtowc call: result: 0 (0 expected) wc: 0x0304 (0x0304 expected) As noted in the code comments and process output, the return value for the second call to mbrtowc() is unexpected. --=20 You are receiving this mail because: You are on the CC list for the bug.=