From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id E09AB3858C5E; Mon, 3 Jul 2023 21:30:34 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E09AB3858C5E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1688419834; bh=GkgyVGjgHdhr7bc6b0zik+w5Ji7QwYFEBb3jJYoaq4s=; h=From:To:Subject:Date:From; b=OYraLSjuUmegZ9H5ZEe+KltbUai3VUBGGZ99RkAoGYvmZ+4Dm0iu4m18RSMcmk1y0 WkW4W+Ci5dv7CJISqEGAqSFke03HDzkVcoahDOs2YK2Wn/mlvgEgdMd9X2ZBxSPOBf jcnqmvMskr4+AbxsbTN+wO2nNzc3iZgJ5BM7tIWg= From: "bruno at clisp dot org" To: glibc-bugs@sourceware.org Subject: [Bug locale/30611] New: mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS locale Date: Mon, 03 Jul 2023 21:30:34 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: locale X-Bugzilla-Version: 2.35 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: bruno at clisp dot org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone attachments.created Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://sourceware.org/bugzilla/show_bug.cgi?id=3D30611 Bug ID: 30611 Summary: mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS locale Product: glibc Version: 2.35 Status: UNCONFIRMED Severity: normal Priority: P2 Component: locale Assignee: unassigned at sourceware dot org Reporter: bruno at clisp dot org Target Milestone: --- Created attachment 14953 --> https://sourceware.org/bugzilla/attachment.cgi?id=3D14953&action=3Ded= it test case foo.c The BIG5-HKSCS encoding is particular because it is the only locale encoding supported by glibc where there exist multibyte sequences that map not to one Unicode character, but to a sequence of two multibyte characters. The file glibc/iconvdata/BIG5HKSCS.precomposed contains these mappings. One= of them is 0x88 0x62, that corresponds to U+00CA U+0304. The mbrtowc() function cannot cope with this situation, since it was design= ed for the case that a multibyte sequence maps to *one* wchar_t only. Some discussion about this occurred in https://sourceware.org/pipermail/libc-alpha/2020-March/112300.html and https://sourceware.org/bugzilla/show_bug.cgi?id=3D25744#c3 But the mbrtoc32() function, added in ISO C 11, has a special return code, (size_t)(-3), that was introduced to cope with this situation. Here is a test case (foo.c is attached): $ gcc -Wall foo.c $ ./a.out Actual output (with glibc 2.35, 2.36): mbsinit() =3D 1 mbrtowc return value =3D 2, wc =3D 0x00CA, mbsinit() =3D 0 mbrtowc return value =3D -2, wc =3D 0x00CA, mbsinit() =3D 0 mbrtowc return value =3D 0, wc =3D 0x0304, mbsinit() =3D 1 mbsinit() =3D 1 mbrtoc32 return value =3D 2, c32 =3D 0x00CA, mbsinit() =3D 0 mbrtoc32 return value =3D -2, c32 =3D 0x00CA, mbsinit() =3D 0 mbrtoc32 return value =3D 0, c32 =3D 0x0304, mbsinit() =3D 1 For the second part, regarding mbrtoc32, the output should be: mbsinit() =3D 1 mbrtoc32 return value =3D 2, c32 =3D 0x00CA, mbsinit() =3D 0 mbrtoc32 return value =3D -3, c32 =3D 0x0304, mbsinit() =3D 1 mbrtoc32 return value =3D 0, c32 =3D 0x0000, mbsinit() =3D 1 That is: 1) The first call to mbrtoc32 is operating correctly. 2) The second call to mbrtoc32 should produce the second Unicode character, return (size_t)(-3), and put the state back into initial state. 3) The third call should eat the NUL byte and produce U+0000, like it always does. BUT: It has been observed that this behaviour of mbrtoc32() introduces a complexity in application code: After mbrtoc32() has returned an integer in= the range 1..MB_LEN_MAX, the application needs to call !mbsinit() to see whether there is some non-trivial conversion state, and if so, call mbrtoc32() once again, with 0 additional input bytes. When this problem occurred with the Vietnamese vi_VN.TCVN5712-1 locale, in 2012, the decision was to remove that locale from the glibc/localedata/SUPPORTED file. See bug #13691 and the discussion that sta= rted at https://sourceware.org/legacy-ml/libc-alpha/2012-05/msg00736.html . So, there are two options: (a) Change mbrtoc32 as described above. (b) Remove zh_HK/BIG5-HKSCS. I vote for (b), because - It minimizes complexity for applications: Programs don't need to handle (size_t)(-3) since it will never occur. (Other platform libcs don't return (size_t)(-3). I checked all of them.) - It keeps the code in libc simple: mbrtoc32 remains nearly identical to mbrtowc. - Users have migrated to use UTF-8 locales in their immense majority. Even the GB18030 locale, which at some time was so much pushed by the Chinese government, is not even choosable or installed in current GNU/Linux distros (see https://lists.gnu.org/archive/html/bug-gnulib/2023-05/msg00105.html ). --=20 You are receiving this mail because: You are on the CC list for the bug.=