From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 13382 invoked by alias); 22 Feb 2003 13:46:00 -0000 Mailing-List: contact gcc-prs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Archive: List-Post: List-Help: Sender: gcc-prs-owner@gcc.gnu.org Received: (qmail 13362 invoked by uid 71); 22 Feb 2003 13:46:00 -0000 Date: Sat, 22 Feb 2003 13:46:00 -0000 Message-ID: <20030222134600.13361.qmail@sources.redhat.com> To: nobody@gcc.gnu.org Cc: gcc-prs@gcc.gnu.org, From: Mark Wielaard Subject: Re: libgcj/9802: Bug in surrogate handling in Unicode to UTF-8 conversion Reply-To: Mark Wielaard X-SW-Source: 2003-02/txt/msg01117.txt.bz2 List-Id: The following reply was made to PR libgcj/9802; it has been noted by GNATS. From: Mark Wielaard To: gcc-gnats@gcc.gnu.org, jjc@jclark.com, java-prs@gcc.gnu.org, gcc-bugs@gcc.gnu.org, nobody@gcc.gnu.org, gcc-prs@gcc.gnu.org Cc: Subject: Re: libgcj/9802: Bug in surrogate handling in Unicode to UTF-8 conversion Date: 22 Feb 2003 14:38:56 +0100 Thanks for the bug report. Your suggested fix seems obviously correct and I verified that making sure that avail is always decremented makes String.getBytes("UTF-8") work (read not throw an ArrayIndexOutOfBoundException). But while creating a test case I noticed that for your example we return two bytes: {0xf0, 0x90} but other implementations return four bytes {0xf0, 0x90, 0x8c, 0x80}. I don't know enough of Unicode and UTF-8 encoding to know what is correct or why. If someone has a quick reference to the relevant definitions and/or a testsuite for these kind of things that would be higly appreciated. http://gcc.gnu.org/cgi-bin/gnatsweb.pl?cmd=view%20audit-trail&database=gcc&pr=9802