public inbox for gcc-prs@sourceware.org
help / color / mirror / Atom feed
* java/1356: gcj mangles composed characters
@ 2000-12-20 12:25 doko
  0 siblings, 0 replies; only message in thread
From: doko @ 2000-12-20 12:25 UTC (permalink / raw)
  To: java-gnats

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 11520 bytes --]

>Number:         1356
>Category:       java
>Synopsis:       gcj mangles composed characters
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    tromey
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Dec 20 12:19:17 PST 2000
>Closed-Date:    Thu Sep 14 11:36:15 PDT 2000
>Last-Modified:  Thu Sep 14 11:40:00 PDT 2000
>Originator:     Stephane Bortzmeyer <bortz@pasteur.fr>
>Release:        gcj/libgcj 2.95
>Organization:
>Environment:
Debian Linux/GNU unstable ix86
>Description:
[Please see http://www.debian.org/Bugs/db/42/42895.html for the original report]

Java is supposed to be Unicode for its character strings. A program such
as this one:

public class Hello {
  
  public static void main ( String []arguments)
  {
    System.out.println ("Liberté, égalité, fraternité !");
  }
  
}

works fine (both with JDK-java and kaffe) when compiled with
JDK-javac or jikes, but prints strange stuff instead of my letters
when compiled with gcj (both with JDK-java and kaffe):

ishtar:~/tmp/Java> jikes Hello.java
ishtar:~/tmp/Java> kaffe Hello
Liberté, égalité, fraternité !

ishtar:~/tmp/Java> gcj -C Hello.java
ishtar:~/tmp/Java> kaffe Hello
Libertþ, þgalitþ, fraternitþ !

>How-To-Repeat:

>Fix:

>Release-Note:

>Audit-Trail:

Formerly PR gcj/33


From: Tom Tromey <tromey@cygnus.com>
To: Java Gnats Server <java-gnats@sourceware.cygnus.com>
Cc: Stephane Bortzmeyer <bortz@pasteur.fr> 
Subject: gcj/33
Date: 02 Sep 1999 20:54:20 -0600

 I looked at this problem.
 
 gcj assumes that the input file is itself Utf-8 encoded.  Is this the
 case in your example?  My guess is that the other compilers assume
 that the input file is encoded according to your locale's charset, and
 that your input file is Latin-1.  To gcj, a Latin-1 file looks like a
 file with encoding errors.
 
 As a workaround you can convert your program to Utf-8 using GNU recode
 (or iconv if you have it).
 
 In the long term I agree that gcj should read the file using the
 locale (possibly augmented with a new flag to indicate the encoding).
 For instance, we could do this quite easily using libunicode (Java
 hackers, contact me for details).
 
 Tom

From: Stephane Bortzmeyer <bortzmeyer@pasteur.fr>
To: tromey@cygnus.com
Cc: Java Gnats Server <java-gnats@sourceware.cygnus.com>,
        bortzmeyer@pasteur.fr, 42895@bugs.debian.org
Subject: Re: gcj/33 
Date: Fri, 03 Sep 1999 09:58:37 +0200

 On Thursday 2 September 1999, at 20 h 54, the keyboard of Tom Tromey 
 <tromey@cygnus.com> wrote:
 
 > I looked at this problem.
 
 [BTW, where can I read the PR on gcj? I find nothing on 
 http://sourceware.cygnus.com .]
 
 > gcj assumes that the input file is itself Utf-8 encoded.  Is this the
 > case in your example? 
 
 No, they were in ISO-8859-1 (Latin 1).
 
 > My guess is that the other compilers assume
 > that the input file is encoded according to your locale's charset, and
 
 Hmmm, this is certainly specified in the Java Language Definition, no ? If so, 
 this is just a matter of finding who is right.
 
 > As a workaround you can convert your program to Utf-8 using GNU recode
 
 The workaround works:
 
 ishtar:~/tmp/Java> recode Latin1..UTF-8 Hello.java
 ishtar:~/tmp/Java> more Hello.java
 public class Hello {
   
   public static void main ( String []arguments)
   {
     System.out.println ("Liberté, égalité, fraternité !");
   }
   
 }
 ishtar:~/tmp/Java> gcj -C Hello.java
 ishtar:~/tmp/Java> kaffe Hello
 Liberté, égalité, fraternité !
 
 > In the long term I agree that gcj should read the file using the
 > locale (possibly augmented with a new flag to indicate the encoding).
 
 Well, the most important for me is that all Java compilers do the same and follow the Java Language Definition.
 
 Apart from that, practically speaking, I would say that noone can type UTF-8 on her keyboard, while many people can enter Latin-*.
 
 

From: Jason Molenda <crash@dollarCOMPANYNAME.com>
To: Stephane Bortzmeyer <bortzmeyer@pasteur.fr>
Cc: java-gnats@sourceware.cygnus.com
Subject: Re: gcj/33
Date: Fri, 3 Sep 1999 01:57:51 -0700

 >  [BTW, where can I read the PR on gcj? I find nothing on 
 >  http://sourceware.cygnus.com .]
 
 It looks like there isn't a link to it on the java web page.  Look at
 
   http://sourceware.cygnus.com/cgi-bin/gnatsweb.pl?database=java&user=guest&password=guest&cmd=login
 
 and bring up PR # 33.
 
 Jason

From: Tom Tromey <tromey@cygnus.com>
To: Stephane Bortzmeyer <bortzmeyer@pasteur.fr>
Cc: tromey@cygnus.com, Java Gnats Server <java-gnats@sourceware.cygnus.com>,
        42895@bugs.debian.org
Subject: Re: gcj/33 
Date: Fri, 3 Sep 1999 10:02:05 -0700

 Stephane> [BTW, where can I read the PR on gcj? I find nothing on 
 Stephane> http://sourceware.cygnus.com .]
 
 Go to http://sourceware.cygnus.com/java and follow the link to the
 Gnats database.
 
 Stephane> Hmmm, this is certainly specified in the Java Language
 Stephane> Definition, no ? If so, this is just a matter of finding who
 Stephane> is right.
 
 I doubt this is specified, though I can't look right now (I don't know
 where my copy of the JLS is).
 
 Stephane> Apart from that, practically speaking, I would say that
 Stephane> noone can type UTF-8 on her keyboard, while many people can
 Stephane> enter Latin-*.
 
 This is a file encoding issue, not an input issue.
 
 Still, I agree -- I just doubt anybody has time to implement this
 right now.
 
 Tom

From: Alexandre Petit-Bianco <apbianco@cygnus.com>
To: java-gnats@sourceware.cygnus.com
Cc:  
Subject: Re: gcj/33
Date: Fri, 3 Sep 1999 10:48:38 -0700

 Stephane Bortzmeyer writes:
 
 >> My guess is that the other compilers assume that the input file is
 >> encoded according to your locale's charset, and
 >  Hmmm, this is certainly specified in the Java Language Definition,
 >  no ? If so, this is just a matter of finding who is right.
 
 The JLS says that Java programs must be written in Unicodes, but also
 defines Unicode escape sequences so that any Unicode characters can be
 defined using only ASCII characters
 ( http://java.sun.com/docs/books/jls/html/3.doc.html#95413 ). That
 somehow defines the minimum encoding one has to support (plain
 "printable" ASCII)
 
 I've never found any specs on how locale should be consulted to
 interpret the input stream. We happen to try to read utf-8 for
 character values greater or equal to 128.
 
 >  > As a workaround you can convert your program to Utf-8 using GNU recode
 
 Or express `é' as the unicode escape sequence `\u00e9'.
 
 ./A
 

From: Stephane Bortzmeyer <bortzmeyer@pasteur.fr>
To: Tom Tromey <tromey@cygnus.com>
Cc: Java Gnats Server <java-gnats@sourceware.cygnus.com>,
        bortzmeyer@pasteur.fr
Subject: Re: gcj/33 
Date: Mon, 06 Sep 1999 10:38:39 +0200

 On Friday 3 September 1999, at 10 h 2, the keyboard of Tom Tromey 
 <tromey@cygnus.com> wrote:
 
 > Go to http://sourceware.cygnus.com/java and follow the link to the
 > Gnats database.
 
 There is none.
 
 
 
State-Changed-From-To: open->analyzed
State-Changed-By: tromey
State-Changed-When: Mon Mar  6 13:41:29 2000
State-Changed-Why:
    I wrote a patch for this.  The patch isn't perfect
    (the encoding doesn't default to the current encoding
    from your locale; I don't know how to find that information)
    but it does seem to work.  It is pending approval:
    
    http://gcc.gnu.org/ml/gcc-patches/2000-03/msg00190.html

From: tromey@cygnus.com
To: bortz@pasteur.fr, 42895@bugs.debian.org, apbianco@cygnus.com,
  doko@debian.org, java-gnats@sourceware.cygnus.com
Cc:  
Subject: Re: gcj/33
Date: 6 Mar 2000 21:41:29 -0000

 Synopsis: gcj mangles composed characters
 
 State-Changed-From-To: open->analyzed
 State-Changed-By: tromey
 State-Changed-When: Mon Mar  6 13:41:29 2000
 State-Changed-Why:
     I wrote a patch for this.  The patch isn't perfect
     (the encoding doesn't default to the current encoding
     from your locale; I don't know how to find that information)
     but it does seem to work.  It is pending approval:
     
     http://gcc.gnu.org/ml/gcc-patches/2000-03/msg00190.html
 
 http://sourceware.cygnus.com/cgi-bin/gnatsweb.pl?cmd=view&pr=33&database=java
Responsible-Changed-From-To: apbianco->tromey
Responsible-Changed-By: tromey
Responsible-Changed-When: Sat Jun 24 09:36:38 2000
Responsible-Changed-Why:
    This is actually mine -- I have an unfinished patch
    to fix it.

From: tromey@cygnus.com
To: bortz@pasteur.fr, 42895@bugs.debian.org, apbianco@cygnus.com,
  doko@debian.org, java-gnats@sourceware.cygnus.com, tromey@cygnus.com
Cc:  
Subject: Re: gcj/33
Date: 24 Jun 2000 16:36:38 -0000

 Synopsis: gcj mangles composed characters
 
 Responsible-Changed-From-To: apbianco->tromey
 Responsible-Changed-By: tromey
 Responsible-Changed-When: Sat Jun 24 09:36:38 2000
 Responsible-Changed-Why:
     This is actually mine -- I have an unfinished patch
     to fix it.
 
 http://sourceware.cygnus.com/cgi-bin/gnatsweb.pl?cmd=view&pr=33&database=java
State-Changed-From-To: analyzed->feedback
State-Changed-By: tromey
State-Changed-When: Tue Sep 12 15:13:20 2000
State-Changed-Why:
    I'm finally checking in my patch to fix this problem.
    My patch changes gcj to use the current locale's encoding
    by default.  If it can't find the locale's encoding it
    assumes UTF-8.  The patch also adds a `--encoding' switch
    to gcj so that the default encoding can be changed from
    the command line.
    
    If you can try this, please do.
    If not, tell me and I will simply close the PR.
    
    The patch only works on systems with a working iconv.
    I don't currently intend to make it work elsewhere
    (though eventually I may by using libiconv).

From: tromey@cygnus.com
To: bortz@pasteur.fr, 42895@bugs.debian.org, doko@debian.org,
  java-gnats@sourceware.cygnus.com, tromey@cygnus.com
Cc:  
Subject: Re: gcj/33
Date: 12 Sep 2000 22:13:20 -0000

 Synopsis: gcj mangles composed characters
 
 State-Changed-From-To: analyzed->feedback
 State-Changed-By: tromey
 State-Changed-When: Tue Sep 12 15:13:20 2000
 State-Changed-Why:
     I'm finally checking in my patch to fix this problem.
     My patch changes gcj to use the current locale's encoding
     by default.  If it can't find the locale's encoding it
     assumes UTF-8.  The patch also adds a `--encoding' switch
     to gcj so that the default encoding can be changed from
     the command line.
     
     If you can try this, please do.
     If not, tell me and I will simply close the PR.
     
     The patch only works on systems with a working iconv.
     I don't currently intend to make it work elsewhere
     (though eventually I may by using libiconv).
 
 http://sources.redhat.com/cgi-bin/gnatsweb.pl?cmd=view&pr=33&database=java
State-Changed-From-To: feedback->closed
State-Changed-By: tromey
State-Changed-When: Thu Sep 14 11:36:15 2000
State-Changed-Why:
    Reporter can't verify but I know it is fixed.

From: tromey@cygnus.com
To: bortz@pasteur.fr, 42895@bugs.debian.org, doko@debian.org,
  java-gnats@sourceware.cygnus.com, tromey@cygnus.com
Cc:  
Subject: Re: gcj/33
Date: 14 Sep 2000 18:36:15 -0000

 Synopsis: gcj mangles composed characters
 
 State-Changed-From-To: feedback->closed
 State-Changed-By: tromey
 State-Changed-When: Thu Sep 14 11:36:15 2000
 State-Changed-Why:
     Reporter can't verify but I know it is fixed.
 
 http://sources.redhat.com/cgi-bin/gnatsweb.pl?cmd=view&pr=33&database=java
>Unformatted:



^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2000-12-20 12:25 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-12-20 12:25 java/1356: gcj mangles composed characters doko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).