public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug locale/30611] New: mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS locale
@ 2023-07-03 21:30 bruno at clisp dot org
  2023-07-03 21:31 ` [Bug locale/30611] " bruno at clisp dot org
  2023-07-12  1:59 ` eggert at cs dot ucla.edu
  0 siblings, 2 replies; 3+ messages in thread
From: bruno at clisp dot org @ 2023-07-03 21:30 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30611

            Bug ID: 30611
           Summary: mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS
                    locale
           Product: glibc
           Version: 2.35
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: locale
          Assignee: unassigned at sourceware dot org
          Reporter: bruno at clisp dot org
  Target Milestone: ---

Created attachment 14953
  --> https://sourceware.org/bugzilla/attachment.cgi?id=14953&action=edit
test case foo.c

The BIG5-HKSCS encoding is particular because it is the only locale encoding
supported by glibc where there exist multibyte sequences that map not to one
Unicode character, but to a sequence of two multibyte characters.

The file glibc/iconvdata/BIG5HKSCS.precomposed contains these mappings. One of
them is
0x88 0x62, that corresponds to U+00CA U+0304.

The mbrtowc() function cannot cope with this situation, since it was designed
for the case that a multibyte sequence maps to *one* wchar_t only. Some
discussion about this occurred in
https://sourceware.org/pipermail/libc-alpha/2020-March/112300.html
and https://sourceware.org/bugzilla/show_bug.cgi?id=25744#c3

But the mbrtoc32() function, added in ISO C 11, has a special return code,
(size_t)(-3), that was introduced to cope with this situation.

Here is a test case (foo.c is attached):
$ gcc -Wall foo.c
$ ./a.out

Actual output (with glibc 2.35, 2.36):

mbsinit() = 1
mbrtowc return value = 2, wc = 0x00CA, mbsinit() = 0
mbrtowc return value = -2, wc = 0x00CA, mbsinit() = 0
mbrtowc return value = 0, wc = 0x0304, mbsinit() = 1
mbsinit() = 1
mbrtoc32 return value = 2, c32 = 0x00CA, mbsinit() = 0
mbrtoc32 return value = -2, c32 = 0x00CA, mbsinit() = 0
mbrtoc32 return value = 0, c32 = 0x0304, mbsinit() = 1

For the second part, regarding mbrtoc32, the output should be:

mbsinit() = 1
mbrtoc32 return value = 2, c32 = 0x00CA, mbsinit() = 0
mbrtoc32 return value = -3, c32 = 0x0304, mbsinit() = 1
mbrtoc32 return value = 0, c32 = 0x0000, mbsinit() = 1

That is:
1) The first call to mbrtoc32 is operating correctly.
2) The second call to mbrtoc32 should produce the second Unicode character,
return (size_t)(-3), and put the state back into initial state.
3) The third call should eat the NUL byte and produce U+0000, like it always
does.

BUT: It has been observed that this behaviour of mbrtoc32() introduces a
complexity in application code: After mbrtoc32() has returned an integer in the
range 1..MB_LEN_MAX, the application needs to call !mbsinit() to see whether
there is some non-trivial conversion state, and if so, call mbrtoc32() once
again, with 0 additional input bytes.

When this problem occurred with the Vietnamese vi_VN.TCVN5712-1 locale, in
2012, the decision was to remove that locale from the
glibc/localedata/SUPPORTED file. See bug #13691 and the discussion that started
at https://sourceware.org/legacy-ml/libc-alpha/2012-05/msg00736.html .

So, there are two options:

(a) Change mbrtoc32 as described above.

(b) Remove zh_HK/BIG5-HKSCS.

I vote for (b), because
  - It minimizes complexity for applications: Programs don't need to handle
(size_t)(-3) since it will never occur. (Other platform libcs don't return
(size_t)(-3). I checked all of them.)
  - It keeps the code in libc simple: mbrtoc32 remains nearly identical to
mbrtowc.
  - Users have migrated to use UTF-8 locales in their immense majority. Even
the GB18030 locale, which at some time was so much pushed by the Chinese
government, is not even choosable or installed in current GNU/Linux distros
(see https://lists.gnu.org/archive/html/bug-gnulib/2023-05/msg00105.html ).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug locale/30611] mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS locale
  2023-07-03 21:30 [Bug locale/30611] New: mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS locale bruno at clisp dot org
@ 2023-07-03 21:31 ` bruno at clisp dot org
  2023-07-12  1:59 ` eggert at cs dot ucla.edu
  1 sibling, 0 replies; 3+ messages in thread
From: bruno at clisp dot org @ 2023-07-03 21:31 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30611

Bruno Haible <bruno at clisp dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
               Host|                            |x86_64-linux-gnu
              Build|                            |x86_64-linux-gnu
            Version|2.35                        |2.36
             Target|                            |x86_64-linux-gnu

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug locale/30611] mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS locale
  2023-07-03 21:30 [Bug locale/30611] New: mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS locale bruno at clisp dot org
  2023-07-03 21:31 ` [Bug locale/30611] " bruno at clisp dot org
@ 2023-07-12  1:59 ` eggert at cs dot ucla.edu
  1 sibling, 0 replies; 3+ messages in thread
From: eggert at cs dot ucla.edu @ 2023-07-12  1:59 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=30611

eggert at cs dot ucla.edu changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |eggert at cs dot ucla.edu

--- Comment #1 from eggert at cs dot ucla.edu ---
(In reply to Bruno Haible from comment #0)

> https://lists.gnu.org/archive/html/bug-gnulib/2023-05/msg00105.html

That URL also suggests that BIG5-HKSCS is also not choosable or useful in
current GNU/Linux distributions, which is another argument for option (b),
i.e., remove zh_HK/BIG5-HKSCS.

As a practical matter, GNU diffutils will not support mbrtoc32's returning
((size_t) -3) as it's too painful to write and maintain the calling code, and
any rewrite would likely implicate performance as well as correctness. I expect
other applications are similar. So even if glibc supported the four unusual
character pairs in BIG5-HKSCS[1], typical applications would misbehave anyway.

So option (a) is not realistic, whereas option (b) is simple and is a
maintenance win.

[1]: https://lists.gnu.org/r/bug-gnulib/2023-07/msg00014.html

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-07-12  1:59 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-03 21:30 [Bug locale/30611] New: mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS locale bruno at clisp dot org
2023-07-03 21:31 ` [Bug locale/30611] " bruno at clisp dot org
2023-07-12  1:59 ` eggert at cs dot ucla.edu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).