public inbox for glibc-bugs@sourceware.org help / color / mirror / Atom feed
From: "tom at honermann dot net" <sourceware-bugzilla@sourceware.org> To: glibc-bugs@sourceware.org Subject: [Bug locale/25744] mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters Date: Wed, 01 Apr 2020 04:04:24 +0000 [thread overview] Message-ID: <bug-25744-131-O3fZfrK4vB@http.sourceware.org/bugzilla/> (raw) In-Reply-To: <bug-25744-131@http.sourceware.org/bugzilla/> https://sourceware.org/bugzilla/show_bug.cgi?id=25744 --- Comment #7 from Tom Honermann <tom at honermann dot net> --- I debugged and identified the root cause for this issue. The problem is that the Big5-HKSCS converter is failing to preserve the lowest 3 bits of the mbstate_t __count data member. The relevant code is below. In iconv/loop.c, lines 396-397 are responsible for unpacking previously cached bytes stored in 'state' into an internal buffer (bytebuf). The number of previously cached bytes is stored in the three lowest bits of 'state->__count'. (The Big5-HKSCS converter does not define UNPACK_BYTES). Lines 485-490 are responsible for the packing of processed bytes and the count of them when insufficient bytes were available to process a complete character. (The Big5-HKSCS converter does not define STORE_REST). The line most relevant to this issue is line 477 where the number of bytes to advance the input pointer provided by the caller is determined. The calculation is performed by determining how much of the internal buffer (bytebuf) was read by the converter, and then subtracting the number of bytes that were populated from the cache (lines 396-397). The Big5-HKSCS converter runs at line 446 in the expansion of BODY. Note that this occurs between the lines that unpack cached bytes (396-397) and the line that computes how far to advance the input pointer (477). (more after the code break). iconv/loop.c: 366 static inline int 367 __attribute ((always_inline)) 368 SINGLE(LOOPFCT) (struct __gconv_step *step, 369 struct __gconv_step_data *step_data, 370 const unsigned char **inptrp, const unsigned char *inend, 371 unsigned char **outptrp, unsigned char *outend, 372 size_t *irreversible EXTRA_LOOP_DECLS) 373 { ... 391 # ifdef UNPACK_BYTES ... 393 # else 394 /* Add the bytes from the state to the input buffer. */ 395 assert ((state->__count & 7) <= sizeof (state->__value)); 396 for (inlen = 0; inlen < (size_t) (state->__count & 7); ++inlen) 397 bytebuf[inlen] = state->__value.__wchb[inlen]; 398 # endif ... 441 inptr = bytebuf; 442 inend = &bytebuf[inlen]; ... 446 BODY ... 477 *inptrp += inend - bytebuf - (state->__count & 7); 478 # ifdef STORE_REST ... 482 # else 483 /* We don't have enough input for another complete input 484 character. */ 485 assert (inend - inptr > (state->__count & ~7)); 486 assert (inend - inptr <= sizeof (state->__value)); 487 state->__count = (state->__count & ~7) | (inend - inptr); 488 inlen = 0; 489 while (inptr < inend) 490 state->__value.__wchb[inlen++] = *inptr++; 491 # endif 492 } 493 494 return result; 495 } The Big5-HKSCS converter also stores information in 'state->__count'. It does not use the lower most 3 bits for its own state (where the number of cached characters is stored; it appears to acknowledge that they are reserved). The problem is that there are numerous places where the converter updates 'state->__count', but fails to preserve those bits (there are also cases where it fails to properly ignore them). When these bits are not preserved, line 477 above miscalulates the number of bytes consumed from the input. Cases where those bits are not properly handled are shown below. iconvdata/big5hkscs.c: 17773 #define EMIT_SHIFT_TO_INIT \ 17774 if (data->__statep->__count != 0) \ ..... 17783 data->__statep->__count = 0; ..... ..... 17797 data->__statep->__count = 0; \ ..... 17803 } ..... 17812 #define BODY \ 17813 { \ ..... 17883 *statep = ch2 << 3; \ ..... 17901 } ..... 17920 #define BODY \ 17921 { \ ..... 17948 *statep = 0; \ ..... 17961 *statep = 0; \ ..... 17998 *statep = ((cp[0] << 8) | cp[1]) << 3; \ ..... 18019 } A local update to preserve the lower 3 bits sufficed to resolve this issue. I'm working on a patch. -- You are receiving this mail because: You are on the CC list for the bug.
next prev parent reply other threads:[~2020-04-01 4:04 UTC|newest] Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-03-29 4:28 [Bug locale/25744] New: " tom at honermann dot net 2020-03-29 16:59 ` [Bug locale/25744] " carlos at redhat dot com 2020-03-30 12:44 ` schwab@linux-m68k.org 2020-03-30 15:13 ` carlos at redhat dot com 2020-03-30 17:31 ` tom at honermann dot net 2020-03-30 17:36 ` schwab@linux-m68k.org 2020-03-30 17:38 ` tom at honermann dot net 2020-04-01 4:04 ` tom at honermann dot net [this message] 2020-04-07 5:43 ` tom at honermann dot net 2022-07-06 14:20 ` cvs-commit at gcc dot gnu.org 2022-07-06 14:25 ` adhemerval.zanella at linaro dot org
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=bug-25744-131-O3fZfrK4vB@http.sourceware.org/bugzilla/ \ --to=sourceware-bugzilla@sourceware.org \ --cc=glibc-bugs@sourceware.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).