public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "bruno at clisp dot org" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug locale/30611] New: mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS locale
Date: Mon, 03 Jul 2023 21:30:34 +0000	[thread overview]
Message-ID: <bug-30611-131@http.sourceware.org/bugzilla/> (raw)

https://sourceware.org/bugzilla/show_bug.cgi?id=30611

            Bug ID: 30611
           Summary: mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS
                    locale
           Product: glibc
           Version: 2.35
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: locale
          Assignee: unassigned at sourceware dot org
          Reporter: bruno at clisp dot org
  Target Milestone: ---

Created attachment 14953
  --> https://sourceware.org/bugzilla/attachment.cgi?id=14953&action=edit
test case foo.c

The BIG5-HKSCS encoding is particular because it is the only locale encoding
supported by glibc where there exist multibyte sequences that map not to one
Unicode character, but to a sequence of two multibyte characters.

The file glibc/iconvdata/BIG5HKSCS.precomposed contains these mappings. One of
them is
0x88 0x62, that corresponds to U+00CA U+0304.

The mbrtowc() function cannot cope with this situation, since it was designed
for the case that a multibyte sequence maps to *one* wchar_t only. Some
discussion about this occurred in
https://sourceware.org/pipermail/libc-alpha/2020-March/112300.html
and https://sourceware.org/bugzilla/show_bug.cgi?id=25744#c3

But the mbrtoc32() function, added in ISO C 11, has a special return code,
(size_t)(-3), that was introduced to cope with this situation.

Here is a test case (foo.c is attached):
$ gcc -Wall foo.c
$ ./a.out

Actual output (with glibc 2.35, 2.36):

mbsinit() = 1
mbrtowc return value = 2, wc = 0x00CA, mbsinit() = 0
mbrtowc return value = -2, wc = 0x00CA, mbsinit() = 0
mbrtowc return value = 0, wc = 0x0304, mbsinit() = 1
mbsinit() = 1
mbrtoc32 return value = 2, c32 = 0x00CA, mbsinit() = 0
mbrtoc32 return value = -2, c32 = 0x00CA, mbsinit() = 0
mbrtoc32 return value = 0, c32 = 0x0304, mbsinit() = 1

For the second part, regarding mbrtoc32, the output should be:

mbsinit() = 1
mbrtoc32 return value = 2, c32 = 0x00CA, mbsinit() = 0
mbrtoc32 return value = -3, c32 = 0x0304, mbsinit() = 1
mbrtoc32 return value = 0, c32 = 0x0000, mbsinit() = 1

That is:
1) The first call to mbrtoc32 is operating correctly.
2) The second call to mbrtoc32 should produce the second Unicode character,
return (size_t)(-3), and put the state back into initial state.
3) The third call should eat the NUL byte and produce U+0000, like it always
does.

BUT: It has been observed that this behaviour of mbrtoc32() introduces a
complexity in application code: After mbrtoc32() has returned an integer in the
range 1..MB_LEN_MAX, the application needs to call !mbsinit() to see whether
there is some non-trivial conversion state, and if so, call mbrtoc32() once
again, with 0 additional input bytes.

When this problem occurred with the Vietnamese vi_VN.TCVN5712-1 locale, in
2012, the decision was to remove that locale from the
glibc/localedata/SUPPORTED file. See bug #13691 and the discussion that started
at https://sourceware.org/legacy-ml/libc-alpha/2012-05/msg00736.html .

So, there are two options:

(a) Change mbrtoc32 as described above.

(b) Remove zh_HK/BIG5-HKSCS.

I vote for (b), because
  - It minimizes complexity for applications: Programs don't need to handle
(size_t)(-3) since it will never occur. (Other platform libcs don't return
(size_t)(-3). I checked all of them.)
  - It keeps the code in libc simple: mbrtoc32 remains nearly identical to
mbrtowc.
  - Users have migrated to use UTF-8 locales in their immense majority. Even
the GB18030 locale, which at some time was so much pushed by the Chinese
government, is not even choosable or installed in current GNU/Linux distros
(see https://lists.gnu.org/archive/html/bug-gnulib/2023-05/msg00105.html ).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

             reply	other threads:[~2023-07-03 21:30 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-03 21:30 bruno at clisp dot org [this message]
2023-07-03 21:31 ` [Bug locale/30611] " bruno at clisp dot org
2023-07-12  1:59 ` eggert at cs dot ucla.edu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-30611-131@http.sourceware.org/bugzilla/ \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=glibc-bugs@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).