From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 21061 invoked by alias); 22 Feb 2003 14:56:01 -0000 Mailing-List: contact gcc-prs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Archive: List-Post: List-Help: Sender: gcc-prs-owner@gcc.gnu.org Received: (qmail 21046 invoked by uid 71); 22 Feb 2003 14:56:01 -0000 Date: Sat, 22 Feb 2003 14:56:00 -0000 Message-ID: <20030222145601.21045.qmail@sources.redhat.com> To: nobody@gcc.gnu.org Cc: gcc-prs@gcc.gnu.org, From: James Clark Subject: Re: libgcj/9802: Bug in surrogate handling in Unicode to UTF-8 conversion Reply-To: James Clark X-SW-Source: 2003-02/txt/msg01120.txt.bz2 List-Id: The following reply was made to PR libgcj/9802; it has been noted by GNATS. From: James Clark To: Mark Wielaard Cc: gcc-gnats@gcc.gnu.org, java-prs@gcc.gnu.org, gcc-bugs@gcc.gnu.org, nobody@gcc.gnu.org, gcc-prs@gcc.gnu.org Subject: Re: libgcj/9802: Bug in surrogate handling in Unicode to UTF-8 conversion Date: Sat, 22 Feb 2003 21:45:31 +0700 Mark Wielaard wrote: > Thanks for the bug report. > Your suggested fix seems obviously correct and I verified that making > sure that avail is always decremented makes String.getBytes("UTF-8") > work (read not throw an ArrayIndexOutOfBoundException). > > But while creating a test case I noticed that for your example we return > two bytes: {0xf0, 0x90} but other implementations return four bytes > {0xf0, 0x90, 0x8c, 0x80}. I don't know enough of Unicode and UTF-8 > encoding to know what is correct or why. Four bytes is correct. > If someone has a quick reference to the relevant definitions and/or a > testsuite for these kind of things that would be higly appreciated. RFC 2279 describes UTF-8 RFC 2781 describes UTF-16 The code is doing a conversion of UTF-16 (represented as chars) to UTF-8 represented as bytes. The problem is that String::getBytes assumes that when converter->write returns some value N it means that the result of converting N chars has been completely written to the output buffer, whereas in fact Output_UTF8.write may have some trailing bytes in the representation of the last character still to write. This can happen not just with surrogates. There is probably a similar problem with other multibyte encodings (e.g. SJIS). It's not immediately obvious to me whether the right fix is to - extend the interface of UnicodeToBytes to indicate whether there are still pending bytes to write and change String::getBytes to use this, - change Output_UTF8 not to convert a char unless there's room for its complete representation in the output buffer, or - change String::getBytes to keep writing even after all characters have been converted until no bytes are output. James http://gcc.gnu.org/cgi-bin/gnatsweb.pl?cmd=view%20audit-trail&database=gcc&pr=9802