public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug locale/25744] New: mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters
@ 2020-03-29  4:28 tom at honermann dot net
  2020-03-29 16:59 ` [Bug locale/25744] " carlos at redhat dot com
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: tom at honermann dot net @ 2020-03-29  4:28 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=25744

            Bug ID: 25744
           Summary: mbrtowc with Big5-HKSCS returns 2 instead of 1 when
                    consuming the second byte of certain double byte
                    characters
           Product: glibc
           Version: 2.31
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: locale
          Assignee: unassigned at sourceware dot org
          Reporter: tom at honermann dot net
  Target Milestone: ---

The following test case demonstrates an issue with the Big5-HKSCS converter
that occurs when certain double byte characters are consumed one byte at a
time.

Quoting from the C18 standard for convenience: 7.29.6.3.2p4 states:

> The mbrtowc function returns the first of the following that applies
> (given the current conversion state):
>
> 0 if the next n or fewer bytes complete the multibyte character that
> corresponds to the null wide character (which is the value stored).
>
> between 1 and n inclusive if the next n or fewer bytes complete a valid
> multibyte character (which is the value stored); the value returned is
> the number of bytes that complete the multibyte character.
>
> (size_t) (−2) if the next n bytes contribute to an incomplete (but
> potentially valid) multibyte character, and all n bytes have been
> processed (no value is stored).
>
> (size_t)(-1) if an encoding error occurs, in which case the next n or
> fewer bytes do not contribute to a complete and valid multibyte
> character (no value is stored); the value of the macro EILSEQ is stored
> in errno, and the conversion state is unspecified.

The following test case converts the Big5-HKSCS double byte character 0x88 0x62
one byte at a time.  This is one of the special double byte characters that
converts to two Unicode code points (U+00CA U+0304).  When consuming the second
byte, mbrtowc() returns a value of 2, but 1 should be returned since only one
byte was consumed; the wording quoted above doesn't allow for the return value
to be larger than the value of 'n' that was passed in.

I don't know if this issue only occurs for double byte characters that map to
multiple Unicode code points.

$ cat t.c 
#include <assert.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>
#include <wchar.h>

int main() {
  /* This test case demonstrates glibc's current (2.31) behavior when
     attempting to convert, one byte at a time, Big5-HKSCS input
     containing a double byte sequence that maps to multiple Unicode
     code points.

     The first call to mbrtowc() consumes the first byte and returns
     a value of -2 indicating an incomplete multibyte character as
     expected.  However, the second call, with an input length of 1,
     consumes the second byte, recognizes completion of the previously
     incomplete character, writes the mapped Unicode code point, and
     then returns 2.  The return value of 2 is surprising since only
     1 byte was read.  This seems to violate the C standard since the
     return value is greater than the input length.

     This issue does not occur for all double byte characters. */

  if (! setlocale(LC_ALL, "zh_HK.BIG5-HKSCS")) {
    perror("setlocale");
    return 1;
  }

  const char *mbs;
  wchar_t wc;
  mbstate_t s;
  size_t result;

  mbs = "\x88\x62";
  memset(&s, 0, sizeof(s));
  /* This call to mbrtowc() consumes the first byte and returns -2 indicating
     that a potentially valid but incomplete character was read.  This is
     expected behavior. */
  result = mbrtowc(&wc, mbs, 1, &s);
  printf("1st mbrtowc call:\n");
  printf("  result: %zd (-2 expected)\n", result);
  assert(result == (size_t) -2);
  mbs += 1;
  /* This call consumes the second byte to complete the double byte character
     and writes the Unicode code point.  The C standard requires a return
     value of 1 since only one byte was consumed by this call, but glibs
     returns 2 (presumably corresponding to the total number of bytes that
     contributed to the multibyte character. */
  result = mbrtowc(&wc, mbs, 1, &s);
  printf("2nd mbrtowc call:\n");
  printf("  result: %zd (1 expected, glibc returns 2)\n", result);
  printf("  wc: 0x%04X (0x00CA expected)\n", (unsigned)wc);
  mbs += 1;
  assert(result == (size_t) 2); /* Current glibc behavior */
  assert(wc == 0x00CA);
  /* This call writes the second Unicode code point, but does not consume
     any input.  0 is returned since no input is consumed.  According to
     the C standard, a return of 0 is reserved for when a null character is
     written, but since the C standard doesn't acknowledge the existence of
     characters that can't be represented in a single wchar_t, we're already
     operating outside the scope of the standard.  Returning 0 seems
     reasonable to me.  Returning -3 as mbrtoc16() does would also be
     reasonable. */
  result = mbrtowc(&wc, mbs, 1, &s);
  printf("3rd mbrtowc call:\n");
  printf("  result: %zd (0 expected)\n", result);
  printf("  wc: 0x%04X (0x0304 expected)\n", (unsigned)wc);
  mbs += 0;
  assert(result == (size_t) 0);
  assert(wc == 0x0304);
  /* Attempting to process any further input would run afoul of a separate
     issue tracked by https://sourceware.org/bugzilla/show_bug.cgi?id=25734. */
}

$ gcc t.c -o t

$ ./t
1st mbrtowc call:
  result: -2 (-2 expected)
2nd mbrtowc call:
  result: 2 (1 expected, glibc returns 2)
  wc: 0x00CA (0x00CA expected)
3rd mbrtowc call:
  result: 0 (0 expected)
  wc: 0x0304 (0x0304 expected)

As noted in the code comments and process output, the return value for the
second call to mbrtowc() is unexpected.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-07-06 14:25 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-29  4:28 [Bug locale/25744] New: mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters tom at honermann dot net
2020-03-29 16:59 ` [Bug locale/25744] " carlos at redhat dot com
2020-03-30 12:44 ` schwab@linux-m68k.org
2020-03-30 15:13 ` carlos at redhat dot com
2020-03-30 17:31 ` tom at honermann dot net
2020-03-30 17:36 ` schwab@linux-m68k.org
2020-03-30 17:38 ` tom at honermann dot net
2020-04-01  4:04 ` tom at honermann dot net
2020-04-07  5:43 ` tom at honermann dot net
2022-07-06 14:20 ` cvs-commit at gcc dot gnu.org
2022-07-06 14:25 ` adhemerval.zanella at linaro dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).