From mboxrd@z Thu Jan 1 00:00:00 1970 From: Per Bothner To: Martin von Loewis Cc: egcs@cygnus.com Subject: Re: UTF8 in identifiers Date: Fri, 01 May 1998 14:59:00 -0000 Message-id: <199805012159.OAA05092@cygnus.com> References: <199805011927.VAA16755@mira.isdn.cs.tu-berlin.de> X-SW-Source: 1998-05/msg00013.html > After reading gxxint, I see that the mangling format already provides > for Unicode in identifiers. Well, it is not really implemented yet, except perhaps some bits and pieces. > However, I wonder whether it might be better to choose a different > solution: mangle Unicode characters as UTF-8. Yes, that is a better solution, and I have been tempted towards it. The problem, as you point out: > There is one drawback, of course: UTF-8 is illegal in most assemblers. Extending gas would not be that difficult. The problems is: Can we require gas? I don't think we are ready for that: We need to use Gcc on targets to which Gas has not been ported yet. > - It supports \U escapes as well. This is not an important issue, as > the current mangling can be extended to \U escapes, and because > those escapes will be rare in the next few years. I'm not sure what you mean. You use \U escapes to specify Unicode characters in *source code* (and possibly assembly code); it has nothing to do with mangling. > Before starting to work on it, I'd like to know what people think > about this proposal. Well, whether or not we use UTF8 for mangling, I still think we should make gas UTF8-aware, even if we don't immediately make the compiler take advantage of it. I would like: 1) Make sure gas and bfd are 8-bit clean for identifier names. 2) Agree that source characters with the high-bit set are interpreted as UTF-8. 3) Change Gas to handle \uXXXX and \UXXXXXXXX escapes in names, and to generate corresponding UTF-8 sequence. And one more small feature: 4) \ followed by a non-alphanumeric and not inside a string literal means that the following character is treated as part on an identifier (i.e. as if it were a letter). One other suggestion: 5) In a string literal, a \u or \U escape generates a UTF-8 sequence, while an octal or hex escape generates a single byte with the specified value. Thus "\u00FF" translates to { 0xC3, 0xBF } while "\xFF" or "\377" translate to { 0xFF }. This is different from what Java does (whose Strings a re Unicode strings), but seems to make sense for byte strings. --Per Bothner Cygnus Solutions bothner@cygnus.com http://www.cygnus.com/~bothner