public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug locale/25734] New: mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points
@ 2020-03-27 4:05 tom at honermann dot net
2020-03-27 16:37 ` [Bug locale/25734] " carlos at redhat dot com
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: tom at honermann dot net @ 2020-03-27 4:05 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=25734
Bug ID: 25734
Summary: mbrtowc with Big5-HKSCS fails to reset conversion
state for conversions that produce two Unicode code
points
Product: glibc
Version: 2.31
Status: UNCONFIRMED
Severity: normal
Priority: P2
Component: locale
Assignee: unassigned at sourceware dot org
Reporter: tom at honermann dot net
Target Milestone: ---
mbrtowc() with Big5-HKSCS fails to reset conversion state for double byte
sequences (such as 0x88 0x62) that are converted to multiple Unicode code
points (U+00CA U+0304) after writing the second code point. This results in
subsequent calls to mbrtowc() repeatedly writing the second code point without
consuming new input.
$ cat t.c
#include <assert.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>
#include <wchar.h>
int main() {
/* Attempt to translate Big5-HKSCS input containing a double byte sequence
(0x88 0x62) that maps to two Unicode code points (U+00CA U+0304) followed
by an ASCII character. This test demonstrates that the converter fails to
reset conversion state after writing the second code point such that each
subsequent call of mbrtowc() writes the second code point again. */
if (! setlocale(LC_ALL, "zh_HK.BIG5-HKSCS")) {
perror("setlocale");
return 1;
}
const char *mbs;
wchar_t wc;
mbstate_t s;
size_t result;
mbs = "\x88\x62\x58";
memset(&s, 0, sizeof(s));
/* Translate the first code unit sequence. This call to mbrtowc() consumes
the first two bytes and writes the first Unicode code point. */
result = mbrtowc(&wc, mbs, 3, &s);
assert(result == (size_t) 2);
mbs += 2;
printf("1st wc: 0x%04X\n", (unsigned)wc);
assert(wc == 0x00CA);
/* This next call to mbrtowc() writes the second Unicode code point without
consuming any input. Since output was written, but no input was consumed,
0 is returned. This is a case where mbrtoc16() would return (size_t)-3,
but mbrtowc() isn't specified to do so. This behavior is a bit confusing
because the return of 0 is specified to indicate that a null character
was written; but that isn't the case here. */
result = mbrtowc(&wc, mbs, 1, &s);
assert(result == (size_t) 0);
printf("2nd wc: 0x%04X\n", (unsigned)wc);
assert(wc == 0x0304);
/* This next call to mbrtowc() should now consume and write the ASCII
character. However, the failure of the converter to reset the conversion
state results in the second Unicode code point being written again; again
without consuming new input. */
wc = 0; /* set wc to an arbitrary value. */
result = mbrtowc(&wc, mbs, 1, &s);
printf("3rd wc: 0x%04X (should be 0x0058)\n", (unsigned)wc);
printf("result: %d (should be 1)\n", (int)result);
/* These two assertions should succeed, but currently fail. */
assert(wc == 0x0058);
assert(result == 1);
}
$ gcc t.c -o t
$ ./t
1st wc: 0x00CA
2nd wc: 0x0304
3rd wc: 0x0304 (should be 0x0058)
result: 0 (should be 1)
t: t.c:52: main: Assertion `wc == 0x0058' failed.
Aborted (core dumped)
As noted in the test and by the failed assertion, the third call to mbrtowc()
should have consumed 1 new byte (0x58) and written one code point (U+0058), but
it instead consumed no input and re-wrote the previously written code point.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug locale/25734] mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points
2020-03-27 4:05 [Bug locale/25734] New: mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points tom at honermann dot net
@ 2020-03-27 16:37 ` carlos at redhat dot com
2020-03-27 16:55 ` carlos at redhat dot com
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: carlos at redhat dot com @ 2020-03-27 16:37 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=25734
Carlos O'Donell <carlos at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2020-03-27
Status|UNCONFIRMED |NEW
CC| |carlos at redhat dot com,
| |schwab@linux-m68k.org
Ever confirmed|0 |1
--- Comment #1 from Carlos O'Donell <carlos at redhat dot com> ---
Andreas Schwab probably has the most experience with Big5-HKSCS. I'm CC'ing him
on this this issue in case he has any thoughts about this.
I can confirm that /x88/x62 should map to <U00CA><U0304>.
Double checked that against Big5-HKSCS 2016 (which we don't support yet):
https://www.ogcio.gov.hk/en/our_work/business/tech_promotion/ccli/terms/doc/e_hkscs_2016.pdf
I can confirm the test case behaves as you indicated.
In glibc/iconvdata/big5hkscs.c we certainly handle the combining characters:
17843 /* Check for special cases: combining characters. */
\
17844 if (idx == 195 + 0x22 /* 8862 */)
\
17845 {
\
17846 ch = 0xca;
\
17847 ch2 = 0x304;
\
17848 }
\
17849 else if (idx == 195 + 0x24 /* 8864 */)
\
17850 {
\
17851 ch = 0xca;
\
17852 ch2 = 0x30c;
\
17853 }
\
17854 else if (idx == 195 + 0x63 /* 88a3 */)
\
17855 {
\
17856 ch = 0xea;
\
17857 ch2 = 0x304;
\
17858 }
\
17859 else if (idx == 195 + 0x65 /* 88a5 */)
\
17860 {
\
17861 ch = 0xea;
\
17862 ch2 = 0x30c;
\
17863 }
\
17864 else
\
17865 /* This is illegal. */
\
17866 STANDARD_FROM_LOOP_ERR_HANDLER (1);
\
I haven't debugged this in great detail though.
diff --git a/iconvdata/big5hkscs.c b/iconvdata/big5hkscs.c
index 01fcfeba76..70f84a5226 100644
--- a/iconvdata/big5hkscs.c
+++ b/iconvdata/big5hkscs.c
@@ -17898,6 +17898,7 @@ static struct
\
put32 (outptr, ch);
\
outptr += 4; \
+ *statep = 0; \
}
#define LOOP_NEED_FLAGS
#define EXTRA_LOOP_DECLS , int *statep
If we have a pushed back character we should clear statep.
I expect we need a test case to cover the case you outline.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug locale/25734] mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points
2020-03-27 4:05 [Bug locale/25734] New: mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points tom at honermann dot net
2020-03-27 16:37 ` [Bug locale/25734] " carlos at redhat dot com
@ 2020-03-27 16:55 ` carlos at redhat dot com
2020-03-27 18:25 ` tom at honermann dot net
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: carlos at redhat dot com @ 2020-03-27 16:55 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=25734
--- Comment #2 from Carlos O'Donell <carlos at redhat dot com> ---
The quick fix appears to work.
GCONV_PATH=/home/carlos/build/glibc/iconvdata ./test
1st wc: 0x00CA
2nd wc: 0x0304
3rd wc: 0x0058 (should be 0x0058)
result: 1 (should be 1)
I need to dig into this a bit more.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug locale/25734] mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points
2020-03-27 4:05 [Bug locale/25734] New: mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points tom at honermann dot net
2020-03-27 16:37 ` [Bug locale/25734] " carlos at redhat dot com
2020-03-27 16:55 ` carlos at redhat dot com
@ 2020-03-27 18:25 ` tom at honermann dot net
2020-03-27 21:20 ` carlos at redhat dot com
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: tom at honermann dot net @ 2020-03-27 18:25 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=25734
--- Comment #3 from Tom Honermann <tom at honermann dot net> ---
Thanks for confirming, Carlos.
I had independently arrived at the same quick fix, but you beat me to posting
it.
I don't know glibc internals well. Something I've been surprised by is that
iconv doesn't seem to suffer this same issue. I haven't tried to debug to
determine why it is unaffected.
(In reply to Carlos O'Donell from comment #1)
> I expect we need a test case to cover the case you outline.
Agreed. The patch below is what I had added locally as a test. Feel free to
use this (I've already signed a glibc contributor form).
diff --git a/wcsmbs/Makefile b/wcsmbs/Makefile
index f02167fa58..2e58ac36e8 100644
--- a/wcsmbs/Makefile
+++ b/wcsmbs/Makefile
@@ -58,7 +58,8 @@ include ../Rules
ifeq ($(run-built-tests),yes)
LOCALES := de_DE.ISO-8859-1 de_DE.UTF-8 en_US.ANSI_X3.4-1968 hr_HR.ISO-8859-2
\
- ja_JP.EUC-JP zh_TW.EUC-TW tr_TR.UTF-8 tr_TR.ISO-8859-9
+ ja_JP.EUC-JP zh_TW.EUC-TW tr_TR.UTF-8 tr_TR.ISO-8859-9 \
+ zh_HK.BIG5-HKSCS
include ../gen-locales.mk
$(objpfx)tst-btowc.out: $(gen-locales)
diff --git a/wcsmbs/tst-mbrtowc.c b/wcsmbs/tst-mbrtowc.c
index 38a431983b..b2b3ff3744 100644
--- a/wcsmbs/tst-mbrtowc.c
+++ b/wcsmbs/tst-mbrtowc.c
@@ -150,6 +150,43 @@ utf8_test (void)
return error;
}
+static int
+big5_hkscs_test (void)
+{
+ const char *locale = "zh_HK.BIG5-HKSCS";
+ const char *mbs;
+ wchar_t wc;
+ mbstate_t s;
+
+ if (!setlocale (LC_CTYPE, locale))
+ {
+ fprintf (stderr, "locale '%s' not available!\n", locale);
+ exit (1);
+ }
+
+ /* This test exercises a special case conversion of a double byte sequence
+ to a pair of Unicode code points. */
+ wc = 42; /* arbitrary number */
+ mbs = "\x88\x62"; /* 0x88 0x62 => U+00CA U+0304 */
+ memset (&s, 0, sizeof(s));
+ /* Translate the double byte sequence. This call consumes the first two
+ bytes and writes the first of the two code points. */
+ assert (mbrtowc (&wc, mbs, 3, &s) == (size_t) 2);
+ assert (wc == 0x00CA);
+ /* Attempt to translate the next sequence. mbrtowc isn't specified to
+ return (size_t) -3 like mbrtoc16 is in the case where there are multiple
+ wide chars to write. This call returns 0 despite not writing a null
+ character. 0 is returned because no new input is consumed; the call
+ writes the second code point. */
+ assert (mbrtowc (&wc, mbs + 2, 1, &s) == (size_t) 0);
+ assert (wc == 0x0304);
+ /* The following now consumes and converts the null character. */
+ assert (mbrtowc (&wc, mbs + 2, 1, &s) == (size_t) 0);
+ assert(wc == 0);
+
+ return 0;
+}
+
static int
do_test (void)
@@ -169,6 +206,8 @@ do_test (void)
setlocale (LC_ALL, "ja_JP.EUC-JP");
result |= check_ascii (setlocale (LC_ALL, NULL));
+ result |= big5_hkscs_test ();
+
return result;
}
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug locale/25734] mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points
2020-03-27 4:05 [Bug locale/25734] New: mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points tom at honermann dot net
` (2 preceding siblings ...)
2020-03-27 18:25 ` tom at honermann dot net
@ 2020-03-27 21:20 ` carlos at redhat dot com
2020-04-16 2:40 ` carlos at redhat dot com
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: carlos at redhat dot com @ 2020-03-27 21:20 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=25734
--- Comment #4 from Carlos O'Donell <carlos at redhat dot com> ---
Tom,
I split the test out specifically for BIG5HKSCS and included all 4 conversions
that create 2 wchar_ts.
Posted: https://sourceware.org/pipermail/libc-alpha/2020-March/112257.html
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug locale/25734] mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points
2020-03-27 4:05 [Bug locale/25734] New: mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points tom at honermann dot net
` (3 preceding siblings ...)
2020-03-27 21:20 ` carlos at redhat dot com
@ 2020-04-16 2:40 ` carlos at redhat dot com
2020-04-18 1:56 ` tom at honermann dot net
2020-04-18 10:20 ` fw at deneb dot enyo.de
6 siblings, 0 replies; 8+ messages in thread
From: carlos at redhat dot com @ 2020-04-16 2:40 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=25734
Carlos O'Donell <carlos at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Target Milestone|--- |2.32
Resolution|--- |FIXED
--- Comment #5 from Carlos O'Donell <carlos at redhat dot com> ---
Fixed in 2.32
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug locale/25734] mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points
2020-03-27 4:05 [Bug locale/25734] New: mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points tom at honermann dot net
` (4 preceding siblings ...)
2020-04-16 2:40 ` carlos at redhat dot com
@ 2020-04-18 1:56 ` tom at honermann dot net
2020-04-18 10:20 ` fw at deneb dot enyo.de
6 siblings, 0 replies; 8+ messages in thread
From: tom at honermann dot net @ 2020-04-18 1:56 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=25734
Tom Honermann <tom at honermann dot net> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|RESOLVED |VERIFIED
--- Comment #6 from Tom Honermann <tom at honermann dot net> ---
Verified fixed for 2.32 as of commit c580e6466d6da8262820cdbad19f32c5546226cf
(https://sourceware.org/git/?p=glibc.git;a=commit;h=c580e6466d6da8262820cdbad19f32c5546226cf)
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug locale/25734] mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points
2020-03-27 4:05 [Bug locale/25734] New: mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points tom at honermann dot net
` (5 preceding siblings ...)
2020-04-18 1:56 ` tom at honermann dot net
@ 2020-04-18 10:20 ` fw at deneb dot enyo.de
6 siblings, 0 replies; 8+ messages in thread
From: fw at deneb dot enyo.de @ 2020-04-18 10:20 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=25734
Florian Weimer <fw at deneb dot enyo.de> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |fw at deneb dot enyo.de
Flags| |security-
Status|VERIFIED |RESOLVED
--- Comment #7 from Florian Weimer <fw at deneb dot enyo.de> ---
Please leave the status as RESOLVED. There is some automation that depends on
it.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2020-04-18 10:20 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-27 4:05 [Bug locale/25734] New: mbrtowc with Big5-HKSCS fails to reset conversion state for conversions that produce two Unicode code points tom at honermann dot net
2020-03-27 16:37 ` [Bug locale/25734] " carlos at redhat dot com
2020-03-27 16:55 ` carlos at redhat dot com
2020-03-27 18:25 ` tom at honermann dot net
2020-03-27 21:20 ` carlos at redhat dot com
2020-04-16 2:40 ` carlos at redhat dot com
2020-04-18 1:56 ` tom at honermann dot net
2020-04-18 10:20 ` fw at deneb dot enyo.de
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).