public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug locale/25744] New: mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters
@ 2020-03-29 4:28 tom at honermann dot net
2020-03-29 16:59 ` [Bug locale/25744] " carlos at redhat dot com
` (9 more replies)
0 siblings, 10 replies; 11+ messages in thread
From: tom at honermann dot net @ 2020-03-29 4:28 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=25744
Bug ID: 25744
Summary: mbrtowc with Big5-HKSCS returns 2 instead of 1 when
consuming the second byte of certain double byte
characters
Product: glibc
Version: 2.31
Status: UNCONFIRMED
Severity: normal
Priority: P2
Component: locale
Assignee: unassigned at sourceware dot org
Reporter: tom at honermann dot net
Target Milestone: ---
The following test case demonstrates an issue with the Big5-HKSCS converter
that occurs when certain double byte characters are consumed one byte at a
time.
Quoting from the C18 standard for convenience: 7.29.6.3.2p4 states:
> The mbrtowc function returns the first of the following that applies
> (given the current conversion state):
>
> 0 if the next n or fewer bytes complete the multibyte character that
> corresponds to the null wide character (which is the value stored).
>
> between 1 and n inclusive if the next n or fewer bytes complete a valid
> multibyte character (which is the value stored); the value returned is
> the number of bytes that complete the multibyte character.
>
> (size_t) (−2) if the next n bytes contribute to an incomplete (but
> potentially valid) multibyte character, and all n bytes have been
> processed (no value is stored).
>
> (size_t)(-1) if an encoding error occurs, in which case the next n or
> fewer bytes do not contribute to a complete and valid multibyte
> character (no value is stored); the value of the macro EILSEQ is stored
> in errno, and the conversion state is unspecified.
The following test case converts the Big5-HKSCS double byte character 0x88 0x62
one byte at a time. This is one of the special double byte characters that
converts to two Unicode code points (U+00CA U+0304). When consuming the second
byte, mbrtowc() returns a value of 2, but 1 should be returned since only one
byte was consumed; the wording quoted above doesn't allow for the return value
to be larger than the value of 'n' that was passed in.
I don't know if this issue only occurs for double byte characters that map to
multiple Unicode code points.
$ cat t.c
#include <assert.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>
#include <wchar.h>
int main() {
/* This test case demonstrates glibc's current (2.31) behavior when
attempting to convert, one byte at a time, Big5-HKSCS input
containing a double byte sequence that maps to multiple Unicode
code points.
The first call to mbrtowc() consumes the first byte and returns
a value of -2 indicating an incomplete multibyte character as
expected. However, the second call, with an input length of 1,
consumes the second byte, recognizes completion of the previously
incomplete character, writes the mapped Unicode code point, and
then returns 2. The return value of 2 is surprising since only
1 byte was read. This seems to violate the C standard since the
return value is greater than the input length.
This issue does not occur for all double byte characters. */
if (! setlocale(LC_ALL, "zh_HK.BIG5-HKSCS")) {
perror("setlocale");
return 1;
}
const char *mbs;
wchar_t wc;
mbstate_t s;
size_t result;
mbs = "\x88\x62";
memset(&s, 0, sizeof(s));
/* This call to mbrtowc() consumes the first byte and returns -2 indicating
that a potentially valid but incomplete character was read. This is
expected behavior. */
result = mbrtowc(&wc, mbs, 1, &s);
printf("1st mbrtowc call:\n");
printf(" result: %zd (-2 expected)\n", result);
assert(result == (size_t) -2);
mbs += 1;
/* This call consumes the second byte to complete the double byte character
and writes the Unicode code point. The C standard requires a return
value of 1 since only one byte was consumed by this call, but glibs
returns 2 (presumably corresponding to the total number of bytes that
contributed to the multibyte character. */
result = mbrtowc(&wc, mbs, 1, &s);
printf("2nd mbrtowc call:\n");
printf(" result: %zd (1 expected, glibc returns 2)\n", result);
printf(" wc: 0x%04X (0x00CA expected)\n", (unsigned)wc);
mbs += 1;
assert(result == (size_t) 2); /* Current glibc behavior */
assert(wc == 0x00CA);
/* This call writes the second Unicode code point, but does not consume
any input. 0 is returned since no input is consumed. According to
the C standard, a return of 0 is reserved for when a null character is
written, but since the C standard doesn't acknowledge the existence of
characters that can't be represented in a single wchar_t, we're already
operating outside the scope of the standard. Returning 0 seems
reasonable to me. Returning -3 as mbrtoc16() does would also be
reasonable. */
result = mbrtowc(&wc, mbs, 1, &s);
printf("3rd mbrtowc call:\n");
printf(" result: %zd (0 expected)\n", result);
printf(" wc: 0x%04X (0x0304 expected)\n", (unsigned)wc);
mbs += 0;
assert(result == (size_t) 0);
assert(wc == 0x0304);
/* Attempting to process any further input would run afoul of a separate
issue tracked by https://sourceware.org/bugzilla/show_bug.cgi?id=25734. */
}
$ gcc t.c -o t
$ ./t
1st mbrtowc call:
result: -2 (-2 expected)
2nd mbrtowc call:
result: 2 (1 expected, glibc returns 2)
wc: 0x00CA (0x00CA expected)
3rd mbrtowc call:
result: 0 (0 expected)
wc: 0x0304 (0x0304 expected)
As noted in the code comments and process output, the return value for the
second call to mbrtowc() is unexpected.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters
2020-03-29 4:28 [Bug locale/25744] New: mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters tom at honermann dot net
@ 2020-03-29 16:59 ` carlos at redhat dot com
2020-03-30 12:44 ` schwab@linux-m68k.org
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: carlos at redhat dot com @ 2020-03-29 16:59 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=25744
Carlos O'Donell <carlos at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Ever confirmed|0 |1
Status|UNCONFIRMED |NEW
CC| |carlos at redhat dot com
Last reconfirmed| |2020-03-29
--- Comment #1 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Tom Honermann from comment #0)
> 2nd mbrtowc call:
> result: 2 (1 expected, glibc returns 2)
> wc: 0x00CA (0x00CA expected)
> As noted in the code comments and process output, the return value for the
> second call to mbrtowc() is unexpected.
My opinion is that this is an implementation defect in the converter which has
absolute control over how those return values are computed.
My worry though is that changing this now might break existing code that is out
there that is using BIG5HKSCS.
If that's actually the case then we could create a distinct gconv module with
the old behaviour and release that.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters
2020-03-29 4:28 [Bug locale/25744] New: mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters tom at honermann dot net
2020-03-29 16:59 ` [Bug locale/25744] " carlos at redhat dot com
@ 2020-03-30 12:44 ` schwab@linux-m68k.org
2020-03-30 15:13 ` carlos at redhat dot com
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: schwab@linux-m68k.org @ 2020-03-30 12:44 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=25744
--- Comment #2 from Andreas Schwab <schwab@linux-m68k.org> ---
I don't think you can call mbrtowc with a different value for mbs after the
first call returned -2, without resetting the state inbetween.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters
2020-03-29 4:28 [Bug locale/25744] New: mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters tom at honermann dot net
2020-03-29 16:59 ` [Bug locale/25744] " carlos at redhat dot com
2020-03-30 12:44 ` schwab@linux-m68k.org
@ 2020-03-30 15:13 ` carlos at redhat dot com
2020-03-30 17:31 ` tom at honermann dot net
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: carlos at redhat dot com @ 2020-03-30 15:13 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=25744
--- Comment #3 from Carlos O'Donell <carlos at redhat dot com> ---
https://sourceware.org/pipermail/libc-alpha/2020-March/112300.html
Options:
(a) Do not change the current converter. We return 2 consumed bytes in
the first conversion, and 0 on the second.
(b) Look ahead and split the conversion. We return 1 consumed in the
first conversion, and 1 on the second. This prevents us from returning
0 which may be interpreted as a L'\0'. This leads to false assumption
that the user could stop the conversion at this point and modify the
input.
(c) Design something new. Return (size_t) -3 indicating a One-to-Many
conversion is happening and that there is more output to be generated.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters
2020-03-29 4:28 [Bug locale/25744] New: mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters tom at honermann dot net
` (2 preceding siblings ...)
2020-03-30 15:13 ` carlos at redhat dot com
@ 2020-03-30 17:31 ` tom at honermann dot net
2020-03-30 17:36 ` schwab@linux-m68k.org
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: tom at honermann dot net @ 2020-03-30 17:31 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=25744
--- Comment #4 from Tom Honermann <tom at honermann dot net> ---
(In reply to Andreas Schwab from comment #2)
> I don't think you can call mbrtowc with a different value for mbs after the
> first call returned -2, without resetting the state inbetween.
The standard is not clear here, but I think it is required to increment 'mbs'
by 'n' in such cases. Existing tests in glibc for other encodings demonstrate
this intent. For examples, see the tests for UTF-8 in wcsmbs/tst-mbrtowc.c.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters
2020-03-29 4:28 [Bug locale/25744] New: mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters tom at honermann dot net
` (3 preceding siblings ...)
2020-03-30 17:31 ` tom at honermann dot net
@ 2020-03-30 17:36 ` schwab@linux-m68k.org
2020-03-30 17:38 ` tom at honermann dot net
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: schwab@linux-m68k.org @ 2020-03-30 17:36 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=25744
--- Comment #5 from Andreas Schwab <schwab@linux-m68k.org> ---
`n' is what?
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters
2020-03-29 4:28 [Bug locale/25744] New: mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters tom at honermann dot net
` (4 preceding siblings ...)
2020-03-30 17:36 ` schwab@linux-m68k.org
@ 2020-03-30 17:38 ` tom at honermann dot net
2020-04-01 4:04 ` tom at honermann dot net
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: tom at honermann dot net @ 2020-03-30 17:38 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=25744
--- Comment #6 from Tom Honermann <tom at honermann dot net> ---
(In reply to Andreas Schwab from comment #5)
> `n' is what?
The value for 'n' passed to mbrtowc() and that corresponds to the specification
for the -2 return value:
> (size_t) (−2) if the next n bytes contribute to an incomplete (but
> potentially valid) multibyte character, and all n bytes have been
> processed (no value is stored).
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters
2020-03-29 4:28 [Bug locale/25744] New: mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters tom at honermann dot net
` (5 preceding siblings ...)
2020-03-30 17:38 ` tom at honermann dot net
@ 2020-04-01 4:04 ` tom at honermann dot net
2020-04-07 5:43 ` tom at honermann dot net
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: tom at honermann dot net @ 2020-04-01 4:04 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=25744
--- Comment #7 from Tom Honermann <tom at honermann dot net> ---
I debugged and identified the root cause for this issue. The problem is that
the Big5-HKSCS converter is failing to preserve the lowest 3 bits of the
mbstate_t __count data member.
The relevant code is below.
In iconv/loop.c, lines 396-397 are responsible for unpacking previously cached
bytes stored in 'state' into an internal buffer (bytebuf). The number of
previously cached bytes is stored in the three lowest bits of 'state->__count'.
(The Big5-HKSCS converter does not define UNPACK_BYTES).
Lines 485-490 are responsible for the packing of processed bytes and the count
of them when insufficient bytes were available to process a complete character.
(The Big5-HKSCS converter does not define STORE_REST).
The line most relevant to this issue is line 477 where the number of bytes to
advance the input pointer provided by the caller is determined. The
calculation is performed by determining how much of the internal buffer
(bytebuf) was read by the converter, and then subtracting the number of bytes
that were populated from the cache (lines 396-397).
The Big5-HKSCS converter runs at line 446 in the expansion of BODY. Note that
this occurs between the lines that unpack cached bytes (396-397) and the line
that computes how far to advance the input pointer (477). (more after the code
break).
iconv/loop.c:
366 static inline int
367 __attribute ((always_inline))
368 SINGLE(LOOPFCT) (struct __gconv_step *step,
369 struct __gconv_step_data *step_data,
370 const unsigned char **inptrp, const unsigned char *inend,
371 unsigned char **outptrp, unsigned char *outend,
372 size_t *irreversible EXTRA_LOOP_DECLS)
373 {
...
391 # ifdef UNPACK_BYTES
...
393 # else
394 /* Add the bytes from the state to the input buffer. */
395 assert ((state->__count & 7) <= sizeof (state->__value));
396 for (inlen = 0; inlen < (size_t) (state->__count & 7); ++inlen)
397 bytebuf[inlen] = state->__value.__wchb[inlen];
398 # endif
...
441 inptr = bytebuf;
442 inend = &bytebuf[inlen];
...
446 BODY
...
477 *inptrp += inend - bytebuf - (state->__count & 7);
478 # ifdef STORE_REST
...
482 # else
483 /* We don't have enough input for another complete input
484 character. */
485 assert (inend - inptr > (state->__count & ~7));
486 assert (inend - inptr <= sizeof (state->__value));
487 state->__count = (state->__count & ~7) | (inend - inptr);
488 inlen = 0;
489 while (inptr < inend)
490 state->__value.__wchb[inlen++] = *inptr++;
491 # endif
492 }
493
494 return result;
495 }
The Big5-HKSCS converter also stores information in 'state->__count'. It does
not use the lower most 3 bits for its own state (where the number of cached
characters is stored; it appears to acknowledge that they are reserved). The
problem is that there are numerous places where the converter updates
'state->__count', but fails to preserve those bits (there are also cases where
it fails to properly ignore them). When these bits are not preserved, line 477
above miscalulates the number of bytes consumed from the input.
Cases where those bits are not properly handled are shown below.
iconvdata/big5hkscs.c:
17773 #define EMIT_SHIFT_TO_INIT \
17774 if (data->__statep->__count != 0)
\
.....
17783 data->__statep->__count = 0; .....
.....
17797 data->__statep->__count = 0;
\
.....
17803 }
.....
17812 #define BODY \
17813 {
\
.....
17883 *statep = ch2 << 3;
\
.....
17901 }
.....
17920 #define BODY \
17921 {
\
.....
17948 *statep = 0;
\
.....
17961 *statep = 0;
\
.....
17998 *statep = ((cp[0] << 8) | cp[1]) << 3;
\
.....
18019 }
A local update to preserve the lower 3 bits sufficed to resolve this issue.
I'm working on a patch.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters
2020-03-29 4:28 [Bug locale/25744] New: mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters tom at honermann dot net
` (6 preceding siblings ...)
2020-04-01 4:04 ` tom at honermann dot net
@ 2020-04-07 5:43 ` tom at honermann dot net
2022-07-06 14:20 ` cvs-commit at gcc dot gnu.org
2022-07-06 14:25 ` adhemerval.zanella at linaro dot org
9 siblings, 0 replies; 11+ messages in thread
From: tom at honermann dot net @ 2020-04-07 5:43 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=25744
--- Comment #8 from Tom Honermann <tom at honermann dot net> ---
A patch for this issue has been submitted to the libc-alpha mailing list.
- https://sourceware.org/pipermail/libc-alpha/2020-April/112595.html
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters
2020-03-29 4:28 [Bug locale/25744] New: mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters tom at honermann dot net
` (7 preceding siblings ...)
2020-04-07 5:43 ` tom at honermann dot net
@ 2022-07-06 14:20 ` cvs-commit at gcc dot gnu.org
2022-07-06 14:25 ` adhemerval.zanella at linaro dot org
9 siblings, 0 replies; 11+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-07-06 14:20 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=25744
--- Comment #9 from cvs-commit at gcc dot gnu.org <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Adhemerval Zanella
<azanella@sourceware.org>:
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=598f790fb17bcfff7fedde5209933a82d7748328
commit 598f790fb17bcfff7fedde5209933a82d7748328
Author: Tom Honermann <tom@honermann.net>
Date: Thu Jun 30 08:52:13 2022 -0400
gconv: Correct Big5-HKSCS conversion to preserve all state bits. [BZ
#25744]
This patch corrects the Big5-HKSCS converter to preserve the lowest 3 bits
of
the mbstate_t __count data member when the converter encounters an
incomplete
multibyte character.
This fixes BZ #25744.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters
2020-03-29 4:28 [Bug locale/25744] New: mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters tom at honermann dot net
` (8 preceding siblings ...)
2022-07-06 14:20 ` cvs-commit at gcc dot gnu.org
@ 2022-07-06 14:25 ` adhemerval.zanella at linaro dot org
9 siblings, 0 replies; 11+ messages in thread
From: adhemerval.zanella at linaro dot org @ 2022-07-06 14:25 UTC (permalink / raw)
To: glibc-bugs
https://sourceware.org/bugzilla/show_bug.cgi?id=25744
Adhemerval Zanella <adhemerval.zanella at linaro dot org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Target Milestone|--- |2.36
CC| |adhemerval.zanella at linaro dot o
| |rg
Resolution|--- |FIXED
--- Comment #10 from Adhemerval Zanella <adhemerval.zanella at linaro dot org> ---
Fixed on 2.36.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2022-07-06 14:25 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-29 4:28 [Bug locale/25744] New: mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters tom at honermann dot net
2020-03-29 16:59 ` [Bug locale/25744] " carlos at redhat dot com
2020-03-30 12:44 ` schwab@linux-m68k.org
2020-03-30 15:13 ` carlos at redhat dot com
2020-03-30 17:31 ` tom at honermann dot net
2020-03-30 17:36 ` schwab@linux-m68k.org
2020-03-30 17:38 ` tom at honermann dot net
2020-04-01 4:04 ` tom at honermann dot net
2020-04-07 5:43 ` tom at honermann dot net
2022-07-06 14:20 ` cvs-commit at gcc dot gnu.org
2022-07-06 14:25 ` adhemerval.zanella at linaro dot org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).