public inbox for glibc-bugs@sourceware.org help / color / mirror / Atom feed
From: "bruno at clisp dot org" <sourceware-bugzilla@sourceware.org> To: glibc-bugs@sourceware.org Subject: [Bug locale/30611] New: mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS locale Date: Mon, 03 Jul 2023 21:30:34 +0000 [thread overview] Message-ID: <bug-30611-131@http.sourceware.org/bugzilla/> (raw) https://sourceware.org/bugzilla/show_bug.cgi?id=30611 Bug ID: 30611 Summary: mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS locale Product: glibc Version: 2.35 Status: UNCONFIRMED Severity: normal Priority: P2 Component: locale Assignee: unassigned at sourceware dot org Reporter: bruno at clisp dot org Target Milestone: --- Created attachment 14953 --> https://sourceware.org/bugzilla/attachment.cgi?id=14953&action=edit test case foo.c The BIG5-HKSCS encoding is particular because it is the only locale encoding supported by glibc where there exist multibyte sequences that map not to one Unicode character, but to a sequence of two multibyte characters. The file glibc/iconvdata/BIG5HKSCS.precomposed contains these mappings. One of them is 0x88 0x62, that corresponds to U+00CA U+0304. The mbrtowc() function cannot cope with this situation, since it was designed for the case that a multibyte sequence maps to *one* wchar_t only. Some discussion about this occurred in https://sourceware.org/pipermail/libc-alpha/2020-March/112300.html and https://sourceware.org/bugzilla/show_bug.cgi?id=25744#c3 But the mbrtoc32() function, added in ISO C 11, has a special return code, (size_t)(-3), that was introduced to cope with this situation. Here is a test case (foo.c is attached): $ gcc -Wall foo.c $ ./a.out Actual output (with glibc 2.35, 2.36): mbsinit() = 1 mbrtowc return value = 2, wc = 0x00CA, mbsinit() = 0 mbrtowc return value = -2, wc = 0x00CA, mbsinit() = 0 mbrtowc return value = 0, wc = 0x0304, mbsinit() = 1 mbsinit() = 1 mbrtoc32 return value = 2, c32 = 0x00CA, mbsinit() = 0 mbrtoc32 return value = -2, c32 = 0x00CA, mbsinit() = 0 mbrtoc32 return value = 0, c32 = 0x0304, mbsinit() = 1 For the second part, regarding mbrtoc32, the output should be: mbsinit() = 1 mbrtoc32 return value = 2, c32 = 0x00CA, mbsinit() = 0 mbrtoc32 return value = -3, c32 = 0x0304, mbsinit() = 1 mbrtoc32 return value = 0, c32 = 0x0000, mbsinit() = 1 That is: 1) The first call to mbrtoc32 is operating correctly. 2) The second call to mbrtoc32 should produce the second Unicode character, return (size_t)(-3), and put the state back into initial state. 3) The third call should eat the NUL byte and produce U+0000, like it always does. BUT: It has been observed that this behaviour of mbrtoc32() introduces a complexity in application code: After mbrtoc32() has returned an integer in the range 1..MB_LEN_MAX, the application needs to call !mbsinit() to see whether there is some non-trivial conversion state, and if so, call mbrtoc32() once again, with 0 additional input bytes. When this problem occurred with the Vietnamese vi_VN.TCVN5712-1 locale, in 2012, the decision was to remove that locale from the glibc/localedata/SUPPORTED file. See bug #13691 and the discussion that started at https://sourceware.org/legacy-ml/libc-alpha/2012-05/msg00736.html . So, there are two options: (a) Change mbrtoc32 as described above. (b) Remove zh_HK/BIG5-HKSCS. I vote for (b), because - It minimizes complexity for applications: Programs don't need to handle (size_t)(-3) since it will never occur. (Other platform libcs don't return (size_t)(-3). I checked all of them.) - It keeps the code in libc simple: mbrtoc32 remains nearly identical to mbrtowc. - Users have migrated to use UTF-8 locales in their immense majority. Even the GB18030 locale, which at some time was so much pushed by the Chinese government, is not even choosable or installed in current GNU/Linux distros (see https://lists.gnu.org/archive/html/bug-gnulib/2023-05/msg00105.html ). -- You are receiving this mail because: You are on the CC list for the bug.
next reply other threads:[~2023-07-03 21:30 UTC|newest] Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-07-03 21:30 bruno at clisp dot org [this message] 2023-07-03 21:31 ` [Bug locale/30611] " bruno at clisp dot org 2023-07-12 1:59 ` eggert at cs dot ucla.edu
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=bug-30611-131@http.sourceware.org/bugzilla/ \ --to=sourceware-bugzilla@sourceware.org \ --cc=glibc-bugs@sourceware.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).