public inbox for gcc-prs@sourceware.org
help / color / mirror / Atom feed
From: James Clark <jjc@jclark.com>
To: nobody@gcc.gnu.org
Cc: gcc-prs@gcc.gnu.org,
Subject: Re: libgcj/9802: Bug in surrogate handling in Unicode to UTF-8	conversion
Date: Sat, 22 Feb 2003 14:56:00 -0000	[thread overview]
Message-ID: <20030222145601.21045.qmail@sources.redhat.com> (raw)

The following reply was made to PR libgcj/9802; it has been noted by GNATS.

From: James Clark <jjc@jclark.com>
To: Mark Wielaard <mark@klomp.org>
Cc: gcc-gnats@gcc.gnu.org, java-prs@gcc.gnu.org, gcc-bugs@gcc.gnu.org,
   nobody@gcc.gnu.org, gcc-prs@gcc.gnu.org
Subject: Re: libgcj/9802: Bug in surrogate handling in Unicode to UTF-8	conversion
Date: Sat, 22 Feb 2003 21:45:31 +0700

 Mark Wielaard wrote:
 > Thanks for the bug report.
 > Your suggested fix seems obviously correct and I verified that making
 > sure that avail is always decremented makes String.getBytes("UTF-8")
 > work (read not throw an ArrayIndexOutOfBoundException).
 > 
 > But while creating a test case I noticed that for your example we return
 > two bytes: {0xf0, 0x90} but other implementations return four bytes
 > {0xf0, 0x90, 0x8c, 0x80}. I don't know enough of Unicode and UTF-8
 > encoding to know what is correct or why.
 
 Four bytes is correct.
 
 > If someone has a quick reference to the relevant definitions and/or a
 > testsuite for these kind of things that would be higly appreciated.
 
 RFC 2279 <http://www.ietf.org/rfc/rfc2279.txt> describes UTF-8
 RFC 2781 <http://www.ietf.org/rfc/rfc2781.txt> describes UTF-16
 
 The code is doing a conversion of UTF-16 (represented as chars) to UTF-8 
 represented as bytes.
 
 The problem is that String::getBytes assumes that when converter->write 
 returns some value N it means that the result of converting N chars has 
 been completely written to the output buffer, whereas in fact 
 Output_UTF8.write may have some trailing bytes in the representation of 
 the last character still to write.   This can happen not just with 
 surrogates.  There is probably a similar problem with other multibyte 
 encodings (e.g. SJIS).
 
 It's not immediately obvious to me whether the right fix is to
 
 - extend the interface of UnicodeToBytes to indicate whether there are 
 still pending bytes to write and change String::getBytes to use this,
 
 - change Output_UTF8 not to convert a char unless there's room for its 
 complete representation in the output buffer, or
 
 - change String::getBytes to keep writing even after all characters have 
 been converted until no bytes are output.
 
 James
 
 http://gcc.gnu.org/cgi-bin/gnatsweb.pl?cmd=view%20audit-trail&database=gcc&pr=9802
 


             reply	other threads:[~2003-02-22 14:56 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-02-22 14:56 James Clark [this message]
  -- strict thread matches above, loose matches on Subject: below --
2003-02-22 13:46 Mark Wielaard
2003-02-22  9:56 jjc

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20030222145601.21045.qmail@sources.redhat.com \
    --to=jjc@jclark.com \
    --cc=gcc-prs@gcc.gnu.org \
    --cc=nobody@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).