The following reply was made to PR java/2313; it has been noted by GNATS. From: Tom Tromey To: Bryce McKinlay Cc: diam@ensta.fr, gcc-gnats@gcc.gnu.org, java-patches@gcc.gnu.org Subject: Re: java/2313: Java SimpleDateFormat crash with non US locales (french...) Date: 19 Mar 2001 09:42:35 -0700 >>>>> "Bryce" == Bryce McKinlay writes: Bryce> System.out.println ("Liberté, égalité, fraternité !"); Bryce> works fine in the default mode, but with "--encoding=UTF-8" it Bryce> produces incorrect output. That's because the input file isn't actually in UTF-8, but it also doesn't contain an incorrect (by our rules -- see below) UTF-8 sequence that would let us see it as erroneous. The `é' is 0xe9. This is a valid start byte for a 2-byte UTF-8 sequence. That is why the following character is also removed. We ought to be noticing that the subsequent bytes in the sequence are invalid. That is what Unicode specifies, and there probably isn't a good reason to allow incorrectly encoded characters. However the code wasn't originally written this way and I never updated it to do this. I'll submit a PR. Bryce> Unfortunately, I know very little about character Bryce> encoding. Maybe Tom can suggest a fix or workaround. Perhaps Bryce> its possible to do something to convert the file to a UTF-8 Bryce> encoding before trying to compile it? One fix would be to tell gcj the real encoding of the file: gcj --encoding=8859_1 ... This works for me. However, note that the encoding names are system-dependent :-(. Ideally we'd have a table of aliases mapping the Java-specified names to the system-dependent ones. Another fix would be to use the `iconv' or `recode' programs to convert the file into UTF-8 before compiling. This is a pain to do, but might be the only recourse on systems with a losing (or no) iconv() implementation. Tom