libgcj/9802: Bug in surrogate handling in Unicode to UTF-8 conversion

public inbox for gcc-prs@sourceware.org
help / color / mirror / Atom feed

* libgcj/9802: Bug in surrogate handling in Unicode to UTF-8 conversion
@ 2003-02-22  9:56 jjc
  0 siblings, 0 replies; 3+ messages in thread
From: jjc @ 2003-02-22  9:56 UTC (permalink / raw)
  To: gcc-gnats


>Number:         9802
>Category:       libgcj
>Synopsis:       Bug in surrogate handling in Unicode to UTF-8 conversion
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    unassigned
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Feb 22 09:56:01 UTC 2003
>Closed-Date:
>Last-Modified:
>Originator:     jjc@jclark.com
>Release:        gcc version 3.3 20030217 (prerelease)
>Organization:
>Environment:
Red Hat Linux 8.0
>Description:
The following program

class Bug {
    static public char surrogate1(int c) {
	return (char)(((c - 0x10000) >> 10) | 0xD800);
    }
    static public char surrogate2(int c) {
      return (char)(((c - 0x10000) & 0x3FF) | 0xDC00);
    }

    static public void main(String[] args) throws java.io.UnsupportedEncodingException {
	int ch = 0x10300;
	char[] v = new char[2];
	v[0] = surrogate1(ch);
	v[1] = surrogate2(ch);
	String str = new String(v);
	str.getBytes("UTF-8");
    }
}

when compiled and executed throws an exception

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2
   at gnu.gcj.convert.Output_UTF8.write(char[], int, int) (/home/jjc/gcc/lib/libgcj.so.4.0.0)
   at gnu.gcj.convert.UnicodeToBytes.write(java.lang.String, int, int, char[]) (/home/jjc/gcc/lib/libgcj.so.4.0.0)
   at java.lang.String.getBytes(java.lang.String) (/home/jjc/gcc/lib/libgcj.so.4.0.0)
   at Bug.main(java.lang.String[]) (Unknown Source)
 
>How-To-Repeat:

>Fix:
I haven't tested this, but I suspect the following should fix it:

*** gcc/libjava/gnu/gcj/convert/Output_UTF8.java~	2000-08-09 00:35:32.000000000 +0700
--- gcc/libjava/gnu/gcj/convert/Output_UTF8.java	2003-02-22 16:38:52.000000000 +0700
***************
*** 104,109 ****
--- 104,110 ----
  	      {
  		value = (hi_part - 0xD800) * 0x400 + (ch - 0xDC00) + 0x10000;
  		buf[count++] = (byte) (0xF0 | (value >> 18));
+ 		avail--
  		bytes_todo = 3;
  		hi_part = 0;
  	      }


>Release-Note:
>Audit-Trail:
>Unformatted:


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: libgcj/9802: Bug in surrogate handling in Unicode to UTF-8 conversion
@ 2003-02-22 14:56 James Clark
  0 siblings, 0 replies; 3+ messages in thread
From: James Clark @ 2003-02-22 14:56 UTC (permalink / raw)
  To: nobody; +Cc: gcc-prs

The following reply was made to PR libgcj/9802; it has been noted by GNATS.

From: James Clark <jjc@jclark.com>
To: Mark Wielaard <mark@klomp.org>
Cc: gcc-gnats@gcc.gnu.org, java-prs@gcc.gnu.org, gcc-bugs@gcc.gnu.org,
   nobody@gcc.gnu.org, gcc-prs@gcc.gnu.org
Subject: Re: libgcj/9802: Bug in surrogate handling in Unicode to UTF-8	conversion
Date: Sat, 22 Feb 2003 21:45:31 +0700

 Mark Wielaard wrote:
 > Thanks for the bug report.
 > Your suggested fix seems obviously correct and I verified that making
 > sure that avail is always decremented makes String.getBytes("UTF-8")
 > work (read not throw an ArrayIndexOutOfBoundException).
 > 
 > But while creating a test case I noticed that for your example we return
 > two bytes: {0xf0, 0x90} but other implementations return four bytes
 > {0xf0, 0x90, 0x8c, 0x80}. I don't know enough of Unicode and UTF-8
 > encoding to know what is correct or why.
 
 Four bytes is correct.
 
 > If someone has a quick reference to the relevant definitions and/or a
 > testsuite for these kind of things that would be higly appreciated.
 
 RFC 2279 <http://www.ietf.org/rfc/rfc2279.txt> describes UTF-8
 RFC 2781 <http://www.ietf.org/rfc/rfc2781.txt> describes UTF-16
 
 The code is doing a conversion of UTF-16 (represented as chars) to UTF-8 
 represented as bytes.
 
 The problem is that String::getBytes assumes that when converter->write 
 returns some value N it means that the result of converting N chars has 
 been completely written to the output buffer, whereas in fact 
 Output_UTF8.write may have some trailing bytes in the representation of 
 the last character still to write.   This can happen not just with 
 surrogates.  There is probably a similar problem with other multibyte 
 encodings (e.g. SJIS).
 
 It's not immediately obvious to me whether the right fix is to
 
 - extend the interface of UnicodeToBytes to indicate whether there are 
 still pending bytes to write and change String::getBytes to use this,
 
 - change Output_UTF8 not to convert a char unless there's room for its 
 complete representation in the output buffer, or
 
 - change String::getBytes to keep writing even after all characters have 
 been converted until no bytes are output.
 
 James
 
 http://gcc.gnu.org/cgi-bin/gnatsweb.pl?cmd=view%20audit-trail&database=gcc&pr=9802
 


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: libgcj/9802: Bug in surrogate handling in Unicode to UTF-8 conversion
@ 2003-02-22 13:46 Mark Wielaard
  0 siblings, 0 replies; 3+ messages in thread
From: Mark Wielaard @ 2003-02-22 13:46 UTC (permalink / raw)
  To: nobody; +Cc: gcc-prs

The following reply was made to PR libgcj/9802; it has been noted by GNATS.

From: Mark Wielaard <mark@klomp.org>
To: gcc-gnats@gcc.gnu.org, jjc@jclark.com, java-prs@gcc.gnu.org,  gcc-bugs@gcc.gnu.org, nobody@gcc.gnu.org, gcc-prs@gcc.gnu.org
Cc:  
Subject: Re: libgcj/9802: Bug in surrogate handling in Unicode to UTF-8
	conversion
Date: 22 Feb 2003 14:38:56 +0100

 Thanks for the bug report.
 Your suggested fix seems obviously correct and I verified that making
 sure that avail is always decremented makes String.getBytes("UTF-8")
 work (read not throw an ArrayIndexOutOfBoundException).

 But while creating a test case I noticed that for your example we return
 two bytes: {0xf0, 0x90} but other implementations return four bytes
 {0xf0, 0x90, 0x8c, 0x80}. I don't know enough of Unicode and UTF-8
 encoding to know what is correct or why.

 If someone has a quick reference to the relevant definitions and/or a
 testsuite for these kind of things that would be higly appreciated.

 http://gcc.gnu.org/cgi-bin/gnatsweb.pl?cmd=view%20audit-trail&database=gcc&pr=9802

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2003-02-22 14:56 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-02-22  9:56 libgcj/9802: Bug in surrogate handling in Unicode to UTF-8 conversion jjc
2003-02-22 13:46 Mark Wielaard
2003-02-22 14:56 James Clark

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).