public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "tom at honermann dot net" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters
Date: Wed, 01 Apr 2020 04:04:24 +0000	[thread overview]
Message-ID: <bug-25744-131-O3fZfrK4vB@http.sourceware.org/bugzilla/> (raw)
In-Reply-To: <bug-25744-131@http.sourceware.org/bugzilla/>

https://sourceware.org/bugzilla/show_bug.cgi?id=25744

--- Comment #7 from Tom Honermann <tom at honermann dot net> ---
I debugged and identified the root cause for this issue.  The problem is that
the Big5-HKSCS converter is failing to preserve the lowest 3 bits of the
mbstate_t __count data member.

The relevant code is below.

In iconv/loop.c, lines 396-397 are responsible for unpacking previously cached
bytes stored in 'state' into an internal buffer (bytebuf).  The number of
previously cached bytes is stored in the three lowest bits of 'state->__count'.
 (The Big5-HKSCS converter does not define UNPACK_BYTES).

Lines 485-490 are responsible for the packing of processed bytes and the count
of them when insufficient bytes were available to process a complete character.
 (The Big5-HKSCS converter does not define STORE_REST).

The line most relevant to this issue is line 477 where the number of bytes to
advance the input pointer provided by the caller is determined.  The
calculation is performed by determining how much of the internal buffer
(bytebuf) was read by the converter, and then subtracting the number of bytes
that were populated from the cache (lines 396-397).

The Big5-HKSCS converter runs at line 446 in the expansion of BODY.  Note that
this occurs between the lines that unpack cached bytes (396-397) and the line
that computes how far to advance the input pointer (477).  (more after the code
break).

iconv/loop.c:
366 static inline int
367 __attribute ((always_inline))
368 SINGLE(LOOPFCT) (struct __gconv_step *step,
369                  struct __gconv_step_data *step_data,
370                  const unsigned char **inptrp, const unsigned char *inend,
371                  unsigned char **outptrp, unsigned char *outend,
372                  size_t *irreversible EXTRA_LOOP_DECLS)
373 {
...
391 #  ifdef UNPACK_BYTES
...
393 #  else
394   /* Add the bytes from the state to the input buffer.  */
395   assert ((state->__count & 7) <= sizeof (state->__value));
396   for (inlen = 0; inlen < (size_t) (state->__count & 7); ++inlen)
397     bytebuf[inlen] = state->__value.__wchb[inlen];
398 #  endif
...
441   inptr = bytebuf;
442   inend = &bytebuf[inlen];
...
446       BODY
...
477       *inptrp += inend - bytebuf - (state->__count & 7);
478 #  ifdef STORE_REST
...
482 #  else
483       /* We don't have enough input for another complete input
484          character.  */
485       assert (inend - inptr > (state->__count & ~7));
486       assert (inend - inptr <= sizeof (state->__value));
487       state->__count = (state->__count & ~7) | (inend - inptr);
488       inlen = 0;
489       while (inptr < inend)
490         state->__value.__wchb[inlen++] = *inptr++;
491 #  endif
492     }
493 
494   return result;
495 }

The Big5-HKSCS converter also stores information in 'state->__count'.  It does
not use the lower most 3 bits for its own state (where the number of cached
characters is stored; it appears to acknowledge that they are reserved).  The
problem is that there are numerous places where the converter updates
'state->__count', but fails to preserve those bits (there are also cases where
it fails to properly ignore them).  When these bits are not preserved, line 477
above miscalulates the number of bytes consumed from the input.

Cases where those bits are not properly handled are shown below.

iconvdata/big5hkscs.c:
17773 #define EMIT_SHIFT_TO_INIT \
17774   if (data->__statep->__count != 0)                                      
    \
.....
17783               data->__statep->__count = 0;     .....
.....
17797               data->__statep->__count = 0;                               
    \
.....
17803     }
.....
17812 #define BODY \
17813   {                                                                      
    \
.....
17883                 *statep = ch2 << 3;                                      
    \
.....
17901   }
.....
17920 #define BODY \
17921   {                                                                      
    \
.....
17948         *statep = 0;                                                     
    \
.....
17961         *statep = 0;                                                     
    \
.....
17998                 *statep = ((cp[0] << 8) | cp[1]) << 3;                   
    \
.....
18019   }

A local update to preserve the lower 3 bits sufficed to resolve this issue. 
I'm working on a patch.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

  parent reply	other threads:[~2020-04-01  4:04 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-29  4:28 [Bug locale/25744] New: " tom at honermann dot net
2020-03-29 16:59 ` [Bug locale/25744] " carlos at redhat dot com
2020-03-30 12:44 ` schwab@linux-m68k.org
2020-03-30 15:13 ` carlos at redhat dot com
2020-03-30 17:31 ` tom at honermann dot net
2020-03-30 17:36 ` schwab@linux-m68k.org
2020-03-30 17:38 ` tom at honermann dot net
2020-04-01  4:04 ` tom at honermann dot net [this message]
2020-04-07  5:43 ` tom at honermann dot net
2022-07-06 14:20 ` cvs-commit at gcc dot gnu.org
2022-07-06 14:25 ` adhemerval.zanella at linaro dot org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-25744-131-O3fZfrK4vB@http.sourceware.org/bugzilla/ \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=glibc-bugs@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).