From mboxrd@z Thu Jan 1 00:00:00 1970 From: Per Bothner To: Paul Eggert Cc: gcc2@gnu.org, egcs@cygnus.com Subject: Re: thoughts on martin's proposed patch for GCC and UTF-8 Date: Wed, 09 Dec 1998 23:03:00 -0000 Message-id: <199812100702.XAA26400@cygnus.com> References: <199812100145.RAA07906@shade.twinsun.com> X-SW-Source: 1998-12/msg00359.html > However, if jc1 is meant only to be used in UTF-8 locales (which seems > likely), then we needn't worry about this. We just tell people that > they have to use an UTF-8 locale if they want to use jc1 with non-"C" > names, because jc1 object files must use UTF-8. This would be an > understandable restriction, and it means we could avoid having to do > the translations in either the compiler or the assembler. I'm not sure about jc1, but gcj (the preferred user-level driver) is not meant to be used only in UTF-8 locales. Java only uses Unicode *internally*, but we need to be able to read non-Unicode / non-UTF-8 *files*. And Java defines a mechanism where you can specify an encoding to use when translating external byte streams to/from internal Unicode streams. However, Java does not define the external encoding of Java program files, but only that after processing \u and \U escapes the input to the lexer is a stream of Unicode chracters. This is a somewhat hypythetical problem, as we have no experience with to what extent if any people need to be able use non-Ascii characters in their source files. But I assume they will want to do that in their locale's text encoding - which need not be a "UTF-8" locale. In that case, jc1 (or a pre-processor for jc1) has to translate the locale's character set into Unicode. It is reasonable that the default locale for source files (i.e. te one assumed if you don't override things) should be UTF-8. The locale for assembler files should probably also be UTF-8. I see no reason to support anything else. What we might do is had gasp (the gas pre-processor) provide a hook for converting from other character sets. But gas itself should just assume UTF-8 - and generate ld symbols that are also UTF-8. (A simple implementation is for gas to just recognize that bytes that have the high-order bit set should be treated as (part of) letters.) Similarly, gdb and ld should assume that labels are UTF-8. --Per Bothner Cygnus Solutions bothner@cygnus.com http://www.cygnus.com/~bothner