From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin von Loewis To: eggert@twinsun.com Cc: brolley@cygnus.com, gcc2@gnu.org, egcs@cygnus.com Subject: Re: thoughts on martin's proposed patch for GCC and UTF-8 Date: Wed, 09 Dec 1998 23:18:00 -0000 Message-id: <199812100712.IAA00283@mira.isdn.cs.tu-berlin.de> References: <19981204032449.3033.qmail@comton.airs.com> <199812060519.VAA07309@shade.twinsun.com> <366C0645.61C48A38@cygnus.com> <199812080057.QAA00491@shade.twinsun.com> <366D460E.4FB0ECD0@cygnus.com> <199812092143.NAA04890@shade.twinsun.com> <199812092227.XAA12100@mira.isdn.cs.tu-berlin.de> <199812100145.RAA07906@shade.twinsun.com> X-SW-Source: 1998-12/msg00361.html > I see the need for mangling, but I don't see why TREE_UNIVERSAL_CHAR > is needed. When outputting a name, you don't need to have a separate > flag specifying whether whether the identifier contains \u; you can > just inspect the identifier string directly. This would be > ASM_OUTPUT_LABELREF's job. TREE_UNIVERSAL_CHAR is an optimization to avoid inspecting the string. Note that it is defined for the C++ front-end only. The encoding of Unicode has to be done in the front-end for C++; the length of a class name depends on the encoding, and it has to get into the mangling. Also, if the mangling of gxxint.texi is used, Fo\u1234 becomes U7Fo_1234, where the U indicates that the underscore is an escape. The backend can't know this concept. > Also, I assume that once the patch is generalized to non-UTF-8 > locales, it won't be just the \u and \U escapes that require mangling. There is no need to generalise that. Defining object files to use Unicode is the right thing :-) > If the object-code standard is to use UTF-8 names, then I suppose the > assembler can convert to UTF-8. No. The gas people made it very clear that they consider character sets somebody else's problems (i.e. ours). > Sorry, I don't understand this point. If you're saying that C++ > mangles non-ASCII identifiers into ASCII labels, but C doesn't, then I > don't see why that should be: there's no reason in principle that C > couldn't or shouldn't use the same sort of mangling. Sure there is. Look at the example above, and see how you can't do that service for C linkage. > I've run into shells that use the top bit for their own purposes. What system? > > And, even if such shells are discounted, it's a bit odd to use UTF-8 > in configure.in without labeling the file. My Emacs (20.3) > misidentified the file as being ISO Latin 1. So what? This tests whether the assembler can process a certain sequence of uninterpreted bytes (well, whether they are interpreted is up to the assembler). The test is to test a feature, not to look nice in Emacs. Please tell me how I can perform the same test with ASCII-only shell commands, and I happily convert. > Really? Suppose I write the preprocessor line > > #if X == 1 > > where X is some Japanese identifier, but I make the understandable > mistake of using a FULLWIDTH DIGIT ONE (code FF11) instead of an ASCII 1. \uFF11 is not a letter in C++, so this is ill-formed and will be rejected. The same holds for the Arabic digits. If you want to write numbers in C++, use ASCII 0-9. Regards, Martin