public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug locale/30611] New: mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS locale
@ 2023-07-03 21:30 bruno at clisp dot org
2023-07-03 21:31 ` [Bug locale/30611] " bruno at clisp dot org
2023-07-12 1:59 ` eggert at cs dot ucla.edu
0 siblings, 2 replies; 3+ messages in thread
From: bruno at clisp dot org @ 2023-07-03 21:30 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30611
Bug ID: 30611
Summary: mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS
locale
Product: glibc
Version: 2.35
Status: UNCONFIRMED
Severity: normal
Priority: P2
Component: locale
Assignee: unassigned at sourceware dot org
Reporter: bruno at clisp dot org
Target Milestone: ---
Created attachment 14953
--> https://sourceware.org/bugzilla/attachment.cgi?id=14953&action=edit
test case foo.c
The BIG5-HKSCS encoding is particular because it is the only locale encoding
supported by glibc where there exist multibyte sequences that map not to one
Unicode character, but to a sequence of two multibyte characters.
The file glibc/iconvdata/BIG5HKSCS.precomposed contains these mappings. One of
them is
0x88 0x62, that corresponds to U+00CA U+0304.
The mbrtowc() function cannot cope with this situation, since it was designed
for the case that a multibyte sequence maps to *one* wchar_t only. Some
discussion about this occurred in
https://sourceware.org/pipermail/libc-alpha/2020-March/112300.html
and https://sourceware.org/bugzilla/show_bug.cgi?id=25744#c3
But the mbrtoc32() function, added in ISO C 11, has a special return code,
(size_t)(-3), that was introduced to cope with this situation.
Here is a test case (foo.c is attached):
$ gcc -Wall foo.c
$ ./a.out
Actual output (with glibc 2.35, 2.36):
mbsinit() = 1
mbrtowc return value = 2, wc = 0x00CA, mbsinit() = 0
mbrtowc return value = -2, wc = 0x00CA, mbsinit() = 0
mbrtowc return value = 0, wc = 0x0304, mbsinit() = 1
mbsinit() = 1
mbrtoc32 return value = 2, c32 = 0x00CA, mbsinit() = 0
mbrtoc32 return value = -2, c32 = 0x00CA, mbsinit() = 0
mbrtoc32 return value = 0, c32 = 0x0304, mbsinit() = 1
For the second part, regarding mbrtoc32, the output should be:
mbsinit() = 1
mbrtoc32 return value = 2, c32 = 0x00CA, mbsinit() = 0
mbrtoc32 return value = -3, c32 = 0x0304, mbsinit() = 1
mbrtoc32 return value = 0, c32 = 0x0000, mbsinit() = 1
That is:
1) The first call to mbrtoc32 is operating correctly.
2) The second call to mbrtoc32 should produce the second Unicode character,
return (size_t)(-3), and put the state back into initial state.
3) The third call should eat the NUL byte and produce U+0000, like it always
does.
BUT: It has been observed that this behaviour of mbrtoc32() introduces a
complexity in application code: After mbrtoc32() has returned an integer in the
range 1..MB_LEN_MAX, the application needs to call !mbsinit() to see whether
there is some non-trivial conversion state, and if so, call mbrtoc32() once
again, with 0 additional input bytes.
When this problem occurred with the Vietnamese vi_VN.TCVN5712-1 locale, in
2012, the decision was to remove that locale from the
glibc/localedata/SUPPORTED file. See bug #13691 and the discussion that started
at https://sourceware.org/legacy-ml/libc-alpha/2012-05/msg00736.html .
So, there are two options:
(a) Change mbrtoc32 as described above.
(b) Remove zh_HK/BIG5-HKSCS.
I vote for (b), because
- It minimizes complexity for applications: Programs don't need to handle
(size_t)(-3) since it will never occur. (Other platform libcs don't return
(size_t)(-3). I checked all of them.)
- It keeps the code in libc simple: mbrtoc32 remains nearly identical to
mbrtowc.
- Users have migrated to use UTF-8 locales in their immense majority. Even
the GB18030 locale, which at some time was so much pushed by the Chinese
government, is not even choosable or installed in current GNU/Linux distros
(see https://lists.gnu.org/archive/html/bug-gnulib/2023-05/msg00105.html ).
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 3+ messages in thread
* [Bug locale/30611] mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS locale
2023-07-03 21:30 [Bug locale/30611] New: mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS locale bruno at clisp dot org
@ 2023-07-03 21:31 ` bruno at clisp dot org
2023-07-12 1:59 ` eggert at cs dot ucla.edu
1 sibling, 0 replies; 3+ messages in thread
From: bruno at clisp dot org @ 2023-07-03 21:31 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30611
Bruno Haible <bruno at clisp dot org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Host| |x86_64-linux-gnu
Build| |x86_64-linux-gnu
Version|2.35 |2.36
Target| |x86_64-linux-gnu
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 3+ messages in thread
* [Bug locale/30611] mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS locale
2023-07-03 21:30 [Bug locale/30611] New: mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS locale bruno at clisp dot org
2023-07-03 21:31 ` [Bug locale/30611] " bruno at clisp dot org
@ 2023-07-12 1:59 ` eggert at cs dot ucla.edu
1 sibling, 0 replies; 3+ messages in thread
From: eggert at cs dot ucla.edu @ 2023-07-12 1:59 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=30611
eggert at cs dot ucla.edu changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |eggert at cs dot ucla.edu
--- Comment #1 from eggert at cs dot ucla.edu ---
(In reply to Bruno Haible from comment #0)
> https://lists.gnu.org/archive/html/bug-gnulib/2023-05/msg00105.html
That URL also suggests that BIG5-HKSCS is also not choosable or useful in
current GNU/Linux distributions, which is another argument for option (b),
i.e., remove zh_HK/BIG5-HKSCS.
As a practical matter, GNU diffutils will not support mbrtoc32's returning
((size_t) -3) as it's too painful to write and maintain the calling code, and
any rewrite would likely implicate performance as well as correctness. I expect
other applications are similar. So even if glibc supported the four unusual
character pairs in BIG5-HKSCS[1], typical applications would misbehave anyway.
So option (a) is not realistic, whereas option (b) is simple and is a
maintenance win.
[1]: https://lists.gnu.org/r/bug-gnulib/2023-07/msg00014.html
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2023-07-12 1:59 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-03 21:30 [Bug locale/30611] New: mbrtoc32 works incorrectly in the zh_HK.BIG5-HKSCS locale bruno at clisp dot org
2023-07-03 21:31 ` [Bug locale/30611] " bruno at clisp dot org
2023-07-12 1:59 ` eggert at cs dot ucla.edu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).