* thoughts on martin's proposed patch for GCC and UTF-8 [not found] ` <366D460E.4FB0ECD0@cygnus.com> @ 1998-12-09 13:44 ` Paul Eggert 1998-12-09 14:38 ` Martin von Loewis 0 siblings, 1 reply; 81+ messages in thread From: Paul Eggert @ 1998-12-09 13:44 UTC (permalink / raw) To: Martin von Loewis, brolley; +Cc: gcc2, egcs I took a look at martin's proposed patch for UTF-8 support in GCC, and have the following thoughts and suggestions. * GCC must unify the identifiers \u00b5, \u00B5, \U000000b5, and \U000000B5; but GCC should not always unify these four identifiers to the identifier with character code B5, as this is incorrect in non-UTF-8 locales. The latest EGCS and GCC2 code already contains support for non-UTF-8 locales, and this support is incompatible with the proposed patch. To get started, perhaps the proposed patch could be modified to report an error if it encounters \u or \U in a non-UTF-8 locale, saying that this is not supported yet. * GCC should represent non-ASCII identifiers using the locale's preferred multibyte encoding; e.g. it should use EUC-JIS if that's what the locale uses. This is the best way to make GCC work well with other tools in that locale. If the locale cannot represent a particular Unicode character, GCC should store it in a canonicalized escape form (e.g. the locale's encoding for \u with lowercase alpha digits if it fits in 16 bits, \U with lowercase alpha digits otherwise); this is along the lines of what draft C9x suggests. Proper support for \u in non-UTF-8 locales requires a locale-specific translation table from Unicode to the locale's encoding. We'll also need a locale-specific table that specifies which characters are C letters and digits, but this can be derived from the other table automatically. One way to translate from Unicode to non-UTF-8 is to have GCC use the iconv function if available. iconv will be supported by glibc 2.1; it's also been supported by Solaris 2.x for some time. GCC could supply its own substitute for iconv if that's needed by cross-compilers, but the native iconv is generally preferable. * Given the above, I don't see the need for TREE_UNIVERSAL_CHAR. The identifier should be stored using the locale's multibyte chars as suggested above (with canonical escapes if needed), and output as-is, just as identifiers are now. * HAVE_GAS_UTF8 isn't needed and to some extent doesn't fit with GCC's current philosophy that the user knows what he or she is doing. People who use multibyte chars in identifiers will expect them to go through to the assembler; if the assembler doesn't support them, they'll understand the assembler's error message. So GCC's behavior shouldn't depend on whether the assembler supports multibyte chars. There's precedent for this: GCC already doesn't care whether the assembler supports dollar signs in identifiers. If the user writes a function named `a$b', and the assembler doesn't support that name, then the assembler will report the error. That's preferable to having GCC second-guess the assembler. Also, the configure.in test for HAVE_GAS_UTF8 has UTF-8 in it. This won't work with older shells that don't allow UTF-8. It's simpler if we just remove HAVE_GAS_UTF8. * I assume that cp/universal.c is supposed to support the constraints on identifiers required by ISO/IEC TR 10176? If so, it should be commented that way. The code needs to be fixed to have an is_universal_digit function, since letters and digits have distinct roles in identifiers. You need to remove `,' before `}' in the code, for portability to older compilers. The code currently dumps core if is_uni[h]==NULL. * The universal-char code needs to be exported out to the main GCC level; it's not specific to C++. * The C compiler and preprocessor also need to support \u and multibyte chars. I'll take a look at doing this, taking inspiration from martin's proposed patch. * GAS should be extended to support locales with encodings other than UTF-8; in particular, this means that GAS should support \u, if it doesn't already, as \u is needed for characters that can't be represented in the locale's multibyte encoding. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-09 13:44 ` thoughts on martin's proposed patch for GCC and UTF-8 Paul Eggert @ 1998-12-09 14:38 ` Martin von Loewis 1998-12-09 14:56 ` Per Bothner 1998-12-09 17:46 ` Paul Eggert 0 siblings, 2 replies; 81+ messages in thread From: Martin von Loewis @ 1998-12-09 14:38 UTC (permalink / raw) To: eggert; +Cc: brolley, gcc2, egcs > * GCC must unify the identifiers \u00b5, \u00B5, \U000000b5, and > \U000000B5; but GCC should not always unify these four identifiers > to the identifier with character code B5, as this is incorrect in > non-UTF-8 locales. No, it shouldn't. Instead, it should convert B5 into Unicode, and use whatever it then gets (at least for identifiers, it also needs to check that B5 is a letter in the given locale). One question, though - what exactly is a UTF-8 locale? E.g. would the "C" locale qualify? > * GCC should represent non-ASCII identifiers using the locale's > preferred multibyte encoding; e.g. it should use EUC-JIS if that's > what the locale uses. I assume you talk about error messages, here? > This is the best way to make GCC work well > with other tools in that locale. If the locale cannot represent a > particular Unicode character, GCC should store it in a canonicalized > escape form (e.g. the locale's encoding for \u with lowercase alpha > digits if it fits in 16 bits, \U with lowercase alpha digits > otherwise); this is along the lines of what draft C9x suggests. Can you elaborate? What is the locale's encoding for \u, if not "\u" literally? > One way to translate from Unicode to non-UTF-8 is to have GCC use > the iconv function if available. iconv will be supported by glibc > 2.1; it's also been supported by Solaris 2.x for some time. GCC > could supply its own substitute for iconv if that's needed by > cross-compilers, but the native iconv is generally preferable. I agree. Still, I'd like to ask whether incorporation of gconv into egcs would be acceptable to the egcs maintainers; if this is the case, I will work towards such an incorporation. Having gconv available would simplify cross-compilation. There might be a problem with the native iconv: What if it doesn't have conversion to Unicode, or what if we don't know what the name of Unicode is in a particular iconv implementation (we always know for glibc iconv, i.e. gconv). > * Given the above, I don't see the need for TREE_UNIVERSAL_CHAR. The > identifier should be stored using the locale's multibyte chars as > suggested above (with canonical escapes if needed), and output > as-is, just as identifiers are now. Well, it is needed for name mangling. Name mangling (in object files) needs to be independent from the user's locale; otherwise you can't link libraries produced by somebody else. Please note that jc1 already defines object files to use UTF-8. It seems that jc1 integration is a objective for cc1plus, so we need to keep that fixed. > * HAVE_GAS_UTF8 isn't needed and to some extent doesn't fit with GCC's > current philosophy that the user knows what he or she is doing. Please give other examples of that "current" philosophy :-) g++ usually does not assume that users know what they are doing, instead, there is big emphasis on binary compatibility and protecting users from linking non-matching things. > People who use multibyte chars in identifiers will expect them to go > through to the assembler; if the assembler doesn't support them, > they'll understand the assembler's error message. This is certainly true for C, it does not hold for C++. In C++, you can produce output even if the assembler does not support non-ASCII in labels. Again, this is what jc1 does, and it is the right thing. > Also, the configure.in test for HAVE_GAS_UTF8 has UTF-8 in it. This > won't work with older shells that don't allow UTF-8. It's simpler if > we just remove HAVE_GAS_UTF8. No shell supports UTF-8 as such. What systems don't support 8-bit characters as arguments to echo? > * I assume that cp/universal.c is supposed to support the constraints > on identifiers required by ISO/IEC TR 10176? Yes. This has to be rewritten to check a list of ranges, instead of trying to be smart. > The code needs to be fixed to have an is_universal_digit function, > since letters and digits have distinct roles in identifiers. No. C++ does not distinguish between non-ASCII digits and letters. > * The universal-char code needs to be exported out to the main GCC > level; it's not specific to C++. Yes. I'd like to have a module that knows what characters the different standards accept (it would also have is_universal_digit if C9X requires that). The compilers then would accept any characters that are accepted in any of the languages, except when being pedantic. > * GAS should be extended to support locales with encodings other than > UTF-8; in particular, this means that GAS should support \u, if it > doesn't already, as \u is needed for characters that can't be > represented in the locale's multibyte encoding. The current (CVS?) gas supports arbitrary 8-bit characters in strings and identifiers; the GAS maintainers think that encodings is not a thing they need to care about. I agree (now). Regards, Martin ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-09 14:38 ` Martin von Loewis @ 1998-12-09 14:56 ` Per Bothner 1998-12-09 22:57 ` Martin von Loewis 1998-12-09 17:46 ` Paul Eggert 1 sibling, 1 reply; 81+ messages in thread From: Per Bothner @ 1998-12-09 14:56 UTC (permalink / raw) To: Martin von Loewis; +Cc: gcc2, egcs > Please note that jc1 already defines object files to use UTF-8. I wouldn't put it that strongly. The Java language allows non-Ascii Unicode characters in identifiers. These have to be mangled in some standard way. Using UTF-8 seems like the cleanest way, but that may require use of gas and also requires an 8-bit clean ld. I would prefer to mangle Unicode characters using UTF-8, as that is the cleanest solution, but there are alternative mangling schemes which have the advantage of working with older assemblers. I don't have a good handle on how important that is. --Per Bothner Cygnus Solutions bothner@cygnus.com http://www.cygnus.com/~bothner ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-09 14:56 ` Per Bothner @ 1998-12-09 22:57 ` Martin von Loewis 1998-12-09 23:16 ` Per Bothner 0 siblings, 1 reply; 81+ messages in thread From: Martin von Loewis @ 1998-12-09 22:57 UTC (permalink / raw) To: bothner; +Cc: gcc2, egcs > I would prefer to mangle Unicode characters using UTF-8, as that > is the cleanest solution, but there are alternative mangling schemes > which have the advantage of working with older assemblers. I don't > have a good handle on how important that is. My mistake. jc1 does not mandate UTF-8; what it does mandate is Unicode (in some form), right? Paul is proposing that assembler files should be in the source character set; I think this is the wrong way. Martin ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-09 22:57 ` Martin von Loewis @ 1998-12-09 23:16 ` Per Bothner 1998-12-11 19:27 ` Paul Eggert 0 siblings, 1 reply; 81+ messages in thread From: Per Bothner @ 1998-12-09 23:16 UTC (permalink / raw) To: Martin von Loewis; +Cc: gcc2, egcs > Paul is proposing that assembler files > should be in the source character set; I think this is the wrong way. Well, it seems clear that symbols in .o files have to be in a locale-independent encoding. That to me seems to mandate UTF-8. It is less clear what encoding we should use for assembler files, but given that the assembler translates to UTF-8, that the assembler is primarily used for compiler output files, and that assembly files are traditionally low-level and close to the .o files, that suggests to me that assembler files should also be in UTF-8, at least for compiler-generated .s files). Humans-written .s files will probably be in the source locale, so we may need a pre-processor (possibly gasp) to convert to UTF-8. --Per Bothner Cygnus Solutions bothner@cygnus.com http://www.cygnus.com/~bothner ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-09 23:16 ` Per Bothner @ 1998-12-11 19:27 ` Paul Eggert 0 siblings, 0 replies; 81+ messages in thread From: Paul Eggert @ 1998-12-11 19:27 UTC (permalink / raw) To: bothner; +Cc: martin, gcc2, egcs Date: Wed, 09 Dec 1998 23:15:43 -0800 From: Per Bothner <bothner@cygnus.com> assembler files should also be in UTF-8, at least for compiler-generated .s files). Humans-written .s files will probably be in the source locale, so we may need a pre-processor (possibly gasp) to convert to UTF-8. This doesn't sound feasible to me. It will be confusing to explain to people that assembly-language files do not all smell the same, and that you need to compile hand-generated files with one command, and compiler-generated files with another. Also, it doesn't match current GCC practice, which already puts EUC-JIS strings into assembler files quite happily. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-09 14:38 ` Martin von Loewis 1998-12-09 14:56 ` Per Bothner @ 1998-12-09 17:46 ` Paul Eggert 1998-12-09 18:01 ` Tim Hollebeek ` (3 more replies) 1 sibling, 4 replies; 81+ messages in thread From: Paul Eggert @ 1998-12-09 17:46 UTC (permalink / raw) To: martin; +Cc: brolley, gcc2, egcs Date: Wed, 9 Dec 1998 23:27:17 +0100 From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de> > * GCC should represent non-ASCII identifiers using the locale's > preferred multibyte encoding I assume you talk about error messages, here? I'm talking about every place that GCC outputs identifiers. This includes error messages, assembler output, and other auxiliary text output (e.g. -dM output). > If the locale cannot represent a particular Unicode character, > GCC should store it in a canonicalized escape form (e.g. the > locale's encoding for \u with lowercase alpha digits if it fits > in 16 bits, \U with lowercase alpha digits otherwise); this is > along the lines of what draft C9x suggests. Can you elaborate? What is the locale's encoding for \u, if not "\u" literally? If it were up to me, it would always be \u or \U as specified above. We have to be a bit careful, though, as we can't just splice e.g. "\u1234" into a multibyte sequence; we may need to surround the "\u1234" with bytes that bring us to the initial shift state and back again. The platform may have its own idea of the canonicalized escape form. ASM_OUTPUT_LABELREF could arrange to convert to the platform's format. what exactly is a UTF-8 locale? a locale that uses UTF-8 encoding for multibyte characters, and (obviously) that uses the Unicode character set. E.g. would the "C" locale qualify? No, typically "C" uses plain ASCII with no multibyte chars, though it can and sometimes does use something else (e.g. ASCII subsets). > One way to translate from Unicode to non-UTF-8 is to have GCC use > the iconv function if available. There might be a problem with the native iconv: What if it doesn't have conversion to Unicode, or what if we don't know what the name of Unicode is in a particular iconv implementation (we always know for glibc iconv, i.e. gconv). Yes, but that's just one thing to configure. And the default value, "UTF-8", will probably work on most hosts. There's also the problem of getting the name of the encoding for the current locale. The XPG4 way of doing this is `nl_langinfo (CODESET)'. This is supported by glibc 2.1 and by recent Solaris versions. Systems that don't support this will have to be supported ad hoc (if at all). I'd like to ask whether incorporation of gconv into egcs would be acceptable to the egcs maintainers; if this is the case, I will work towards such an incorporation. Having gconv available would simplify cross-compilation. Since egcs already maintains its own idea of S-JIS etc. I suspect that there'd be no objection to also maintaining its own idea of what the translation tables should be, for hosts that don't already have it. > * Given the above, I don't see the need for TREE_UNIVERSAL_CHAR. The > identifier should be stored using the locale's multibyte chars as > suggested above (with canonical escapes if needed), and output > as-is, just as identifiers are now. Well, it is needed for name mangling. Name mangling (in object files) needs to be independent from the user's locale; otherwise you can't link libraries produced by somebody else. I see the need for mangling, but I don't see why TREE_UNIVERSAL_CHAR is needed. When outputting a name, you don't need to have a separate flag specifying whether whether the identifier contains \u; you can just inspect the identifier string directly. This would be ASM_OUTPUT_LABELREF's job. Also, I assume that once the patch is generalized to non-UTF-8 locales, it won't be just the \u and \U escapes that require mangling. If the goal is to link libraries that were built in other locales, then we'll also need to mangle the non-UTF-8 multibyte chars. Again, this doesn't sound like a high-level concept that needs to be in the parse tree -- it's just a low-level thing that can be done on output. Perhaps it's just an efficiency thing? If so, then this should be made a bit clearer, and the flag name changed to TREE_NAME_NEEDS_ASCIIFYING or something like that, with the other identifiers named changed accordingly. jc1 already defines object files to use UTF-8. It seems that jc1 integration is a objective for cc1plus, so we need to keep that fixed. If the compilation locale uses, say, Shift-JIS, then the assembly language text file should use shift-JIS, as this is what the programmer's tools will expect. If the object-code standard is to use UTF-8 names, then I suppose the assembler can convert to UTF-8. (Object code isn't text, so the usual rules about text locales don't apply to it.) But this would mean that the assembler would have to understand character conversion, which is a unwanted complication. However, if jc1 is meant only to be used in UTF-8 locales (which seems likely), then we needn't worry about this. We just tell people that they have to use an UTF-8 locale if they want to use jc1 with non-"C" names, because jc1 object files must use UTF-8. This would be an understandable restriction, and it means we could avoid having to do the translations in either the compiler or the assembler. > People who use multibyte chars in identifiers will expect them to go > through to the assembler; if the assembler doesn't support them, > they'll understand the assembler's error message. This is certainly true for C, it does not hold for C++. In C++, you can produce output even if the assembler does not support non-ASCII in labels. Sorry, I don't understand this point. If you're saying that C++ mangles non-ASCII identifiers into ASCII labels, but C doesn't, then I don't see why that should be: there's no reason in principle that C couldn't or shouldn't use the same sort of mangling. If the assembler requires some form of mangling from non-ASCII identifiers into ASCII labels, then shouldn't this be ASM_OUTPUT_LABELREF's job, or something like that? I don't see why the issue is specific to C++; it sounds like it's general to all languages with non-ASCII identifiers. > Also, the configure.in test for HAVE_GAS_UTF8 has UTF-8 in it. This > won't work with older shells that don't allow UTF-8. It's simpler if > we just remove HAVE_GAS_UTF8. What systems don't support 8-bit characters as arguments to echo? I've run into shells that use the top bit for their own purposes. And, even if such shells are discounted, it's a bit odd to use UTF-8 in configure.in without labeling the file. My Emacs (20.3) misidentified the file as being ISO Latin 1. It'd be better if you rewrote the configure.in test in ASCII, so we didn't have to worry aboug gotchas like this. It should be fairly easy to do this with tr. C++ does not distinguish between non-ASCII digits and letters. Really? Suppose I write the preprocessor line #if X == 1 where X is some Japanese identifier, but I make the understandable mistake of using a FULLWIDTH DIGIT ONE (code FF11) instead of an ASCII 1. What you're saying is that the preprocessor is obliged to treat this line as if it were #if X == 0 because undeclared preprocessor identifiers default to zero? This seems to me to be asking for trouble; it's a common mistake in Japanese text. If C++ requires this, then I suggest that C++ by default should warn about identifiers beginning with digits. (It also means yet another difference between the C and C++ preprocessors, sigh.) ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-09 17:46 ` Paul Eggert @ 1998-12-09 18:01 ` Tim Hollebeek 1998-12-10 5:58 ` Craig Burley 1998-12-09 23:03 ` Per Bothner ` (2 subsequent siblings) 3 siblings, 1 reply; 81+ messages in thread From: Tim Hollebeek @ 1998-12-09 18:01 UTC (permalink / raw) To: Paul Eggert; +Cc: martin, brolley, gcc2, egcs Paul Eggert writes ... > > Really? Suppose I write the preprocessor line > > #if X == 1 > > where X is some Japanese identifier, but I make the understandable > mistake of using a FULLWIDTH DIGIT ONE (code FF11) instead of an ASCII 1. > What you're saying is that the preprocessor is obliged to treat this > line as if it were > > #if X == 0 > > because undeclared preprocessor identifiers default to zero? IMO, the problem here has nothing to do with Japanese, and everything to do with the fact that this rule is error prone in general. I've been meaning to implement -Wpreprocessor-undeclared for a while now. In fact, there is an instance of a typo in the SGI standard header files which has never been caught by any ANSI C compiler because if this "undeclared macro == 0" rule, despite the fact that the rule only exists to support source files which predate the existence of #ifdef. gcc really should complain "undeclared identifier 'x' in preprocessor conditional expression has value 0" when -Wall is specified. Only BASIC and perl programmers, and Fortran programmers who don't use "implicit none", should have to worry about creating new variables every time they make a typo. --------------------------------------------------------------------------- Tim Hollebeek | "Everything above is a true email: tim@wfn-shop.princeton.edu | statement, for sufficiently URL: http://wfn-shop.princeton.edu/~tim | false values of true." ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-09 18:01 ` Tim Hollebeek @ 1998-12-10 5:58 ` Craig Burley 1998-12-10 10:21 ` Tim Hollebeek 1998-12-10 14:23 ` Chip Salzenberg 0 siblings, 2 replies; 81+ messages in thread From: Craig Burley @ 1998-12-10 5:58 UTC (permalink / raw) To: tim; +Cc: burley >gcc really should complain "undeclared identifier 'x' in preprocessor >conditional expression has value 0" when -Wall is specified. Only >BASIC and perl programmers, and Fortran programmers who don't use >"implicit none", should have to worry about creating new variables >every time they make a typo. Long a pet peeve of mine as well, though I wonder if fixing this might cause gcc to warn about too many constructs like #if defined (FOO) && (FOO == 1) and: #ifdef FOO #if FOO == 1 The warning could be made smart enough to avoid most, maybe all, such spurious warnings. If not all, the documentation should be pretty clear about what to do, and what to not bother doing, to eliminate the warnings. tq vm, (burley) ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-10 5:58 ` Craig Burley @ 1998-12-10 10:21 ` Tim Hollebeek 1998-12-10 11:50 ` Craig Burley 1998-12-10 14:23 ` Chip Salzenberg 1 sibling, 1 reply; 81+ messages in thread From: Tim Hollebeek @ 1998-12-10 10:21 UTC (permalink / raw) To: Craig Burley; +Cc: eggert, martin, brolley, gcc2, egcs, burley Craig Burley writes ... > > >gcc really should complain "undeclared identifier 'x' in preprocessor > >conditional expression has value 0" when -Wall is specified. Only > >BASIC and perl programmers, and Fortran programmers who don't use > >"implicit none", should have to worry about creating new variables > >every time they make a typo. > > Long a pet peeve of mine as well, though I wonder if fixing this might > cause gcc to warn about too many constructs like > > #if defined (FOO) && (FOO == 1) > > and: > > #ifdef FOO > #if FOO == 1 > > The warning could be made smart enough to avoid most, maybe all, such > spurious warnings. > > If not all, the documentation should be pretty clear about what to > do, and what to not bother doing, to eliminate the warnings. Good points. The second isn't a problem, though, since if FOO isn't defined, we're skipping when we see #if, and don't need too parse the expression. In fact, if I remember correctly, ANSI forbids parsing of the expression (other than recognizing pp-tokens). The first case is more important, and one I hadn't thought of. However, the boolean operators that short circuit are the only ones that don't use one operand, so it seems consistent with C to not evaluate (and hence not warn about) arguments that are short circuited. I believe this isn't ad hoc, and avoids all spurious warnings in a consistent manner. --------------------------------------------------------------------------- Tim Hollebeek | "Everything above is a true email: tim@wfn-shop.princeton.edu | statement, for sufficiently URL: http://wfn-shop.princeton.edu/~tim | false values of true." ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-10 10:21 ` Tim Hollebeek @ 1998-12-10 11:50 ` Craig Burley 0 siblings, 0 replies; 81+ messages in thread From: Craig Burley @ 1998-12-10 11:50 UTC (permalink / raw) To: tim; +Cc: burley >> #if defined (FOO) && (FOO == 1) >> >> and: >> >> #ifdef FOO >> #if FOO == 1 >> >Good points. The second isn't a problem, though, since if FOO isn't >defined, we're skipping when we see #if, and don't need too parse the >expression. In fact, if I remember correctly, ANSI forbids parsing of >the expression (other than recognizing pp-tokens). That's good to hear, and I had hoped it would be the case, but thought I should mention it anyway. >The first case is more important, and one I hadn't thought of. >However, the boolean operators that short circuit are the only ones >that don't use one operand, so it seems consistent with C to not >evaluate (and hence not warn about) arguments that are short >circuited. I believe this isn't ad hoc, and avoids all spurious >warnings in a consistent manner. It sounds like you're saying the only binary operators are && and ||, which doesn't sound quite right. && and || are the logical AND and OR operators, according to my 1998-12-07 draft copy of the ANSI C standard, and I believe the &, |, and other bitwise operators, plus the +, -, * and / integer operators are supported for preprocessor directives as well. But I'm not sure this changes what you're saying. What might, though, is the implementation. It might insist on expanding macros beyond a short-circuit, and, if it does, you can't just change it so that, when it replaces an undefined macro with 0, it optionally warns. tq vm, (burley) ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-10 5:58 ` Craig Burley 1998-12-10 10:21 ` Tim Hollebeek @ 1998-12-10 14:23 ` Chip Salzenberg 1 sibling, 0 replies; 81+ messages in thread From: Chip Salzenberg @ 1998-12-10 14:23 UTC (permalink / raw) To: Craig Burley; +Cc: tim, eggert, martin, brolley, gcc2, egcs <advocacy subject="perl"> Tim writes: > Only BASIC and perl programmers, and Fortran programmers who don't > use "implicit none", should have to worry about creating new variables > every time they make a typo. That should be: "Perl programmers who don't 'use strict'". Thanks. </advocacy> -- Chip Salzenberg - a.k.a. - <chip@perlsupport.com> "When do you work?" "Whenever I'm not busy." ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-09 17:46 ` Paul Eggert 1998-12-09 18:01 ` Tim Hollebeek @ 1998-12-09 23:03 ` Per Bothner 1998-12-10 7:49 ` Ian Lance Taylor 1998-12-11 19:23 ` Paul Eggert 1998-12-09 23:18 ` Martin von Loewis [not found] ` <199812100200.VAA06419.cygnus.egcs@wagner.Princeton.EDU> 3 siblings, 2 replies; 81+ messages in thread From: Per Bothner @ 1998-12-09 23:03 UTC (permalink / raw) To: Paul Eggert; +Cc: gcc2, egcs > However, if jc1 is meant only to be used in UTF-8 locales (which seems > likely), then we needn't worry about this. We just tell people that > they have to use an UTF-8 locale if they want to use jc1 with non-"C" > names, because jc1 object files must use UTF-8. This would be an > understandable restriction, and it means we could avoid having to do > the translations in either the compiler or the assembler. I'm not sure about jc1, but gcj (the preferred user-level driver) is not meant to be used only in UTF-8 locales. Java only uses Unicode *internally*, but we need to be able to read non-Unicode / non-UTF-8 *files*. And Java defines a mechanism where you can specify an encoding to use when translating external byte streams to/from internal Unicode streams. However, Java does not define the external encoding of Java program files, but only that after processing \u and \U escapes the input to the lexer is a stream of Unicode chracters. This is a somewhat hypythetical problem, as we have no experience with to what extent if any people need to be able use non-Ascii characters in their source files. But I assume they will want to do that in their locale's text encoding - which need not be a "UTF-8" locale. In that case, jc1 (or a pre-processor for jc1) has to translate the locale's character set into Unicode. It is reasonable that the default locale for source files (i.e. te one assumed if you don't override things) should be UTF-8. The locale for assembler files should probably also be UTF-8. I see no reason to support anything else. What we might do is had gasp (the gas pre-processor) provide a hook for converting from other character sets. But gas itself should just assume UTF-8 - and generate ld symbols that are also UTF-8. (A simple implementation is for gas to just recognize that bytes that have the high-order bit set should be treated as (part of) letters.) Similarly, gdb and ld should assume that labels are UTF-8. --Per Bothner Cygnus Solutions bothner@cygnus.com http://www.cygnus.com/~bothner ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-09 23:03 ` Per Bothner @ 1998-12-10 7:49 ` Ian Lance Taylor 1998-12-11 19:23 ` Paul Eggert 1 sibling, 0 replies; 81+ messages in thread From: Ian Lance Taylor @ 1998-12-10 7:49 UTC (permalink / raw) To: bothner; +Cc: eggert, gcc2, egcs Date: Wed, 09 Dec 1998 23:02:40 -0800 From: Per Bothner <bothner@cygnus.com> The locale for assembler files should probably also be UTF-8. I see no reason to support anything else. What we might do is had gasp (the gas pre-processor) provide a hook for converting from other character sets. As far as I'm concerned, gasp is dead. It served a purpose for a time, which was to provide a richer set of assembly language operations. However, gas itself now has all the interesting features which were once found only in gasp (basically, macros). I don't want to see any plan that relies on using gasp. But gas itself should just assume UTF-8 - and generate ld symbols that are also UTF-8. (A simple implementation is for gas to just recognize that bytes that have the high-order bit set should be treated as (part of) letters.) This change has already been made in the development sources, and will be in the next release. Ian ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-09 23:03 ` Per Bothner 1998-12-10 7:49 ` Ian Lance Taylor @ 1998-12-11 19:23 ` Paul Eggert 1998-12-12 2:21 ` Martin von Loewis 1 sibling, 1 reply; 81+ messages in thread From: Paul Eggert @ 1998-12-11 19:23 UTC (permalink / raw) To: bothner; +Cc: gcc2, egcs Date: Wed, 09 Dec 1998 23:02:40 -0800 From: Per Bothner <bothner@cygnus.com> This is a somewhat hypythetical problem, as we have no experience with to what extent if any people need to be able use non-Ascii characters in their source files. I have some experience; we sometimes use gcc that way here. But I assume they will want to do that in their locale's text encoding - which need not be a "UTF-8" locale. Yes; this is already widespread practice for C strings. In that case, jc1 (or a pre-processor for jc1) has to translate the locale's character set into Unicode. It could also be done by a postprocessor for jc1. The locale for assembler files should probably also be UTF-8. This disagrees with existing practice with C strings. I don't think it's wise to commit now to UTF-8 for all assembler files. Among other things, it'd mean you couldn't look at the files with Emacs (as the latest Emacs doesn't support UTF-8). I have misgivings about having GCC support multiple locales simultaneously. Multilingual applications are the province of fancy text editors like Emacs; simple translators like GCC shouldn't have to worry about handling multiple locales in the same program execution. I've dealt with programs like that, and they are a pain to configure and maintain. For GCC it's cleaner to add a separate pass to translate the assembler input, if this is needed. To some extent this is an ``after you, alphonse'' situation. The gas people don't want to worry about translating codes, and I don't blame them. I don't want cpp to worry about it either. Or cc1. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-11 19:23 ` Paul Eggert @ 1998-12-12 2:21 ` Martin von Loewis 1998-12-13 6:23 ` Richard Stallman 1998-12-15 22:00 ` Paul Eggert 0 siblings, 2 replies; 81+ messages in thread From: Martin von Loewis @ 1998-12-12 2:21 UTC (permalink / raw) To: eggert; +Cc: bothner, gcc2, egcs > I have misgivings about having GCC support multiple locales > simultaneously. So how about that: gcc/g++ process strictly-conforming input that is already in the base character set (plus \u escapes, in a way that the standards mandate. Object files are then UTF-8, (or U escapes for C++). gcc/g++ also process input based on the current locale, and pass the input unmodified to the output. There is no interworking between the two (i.e: characters in the current locale are not at all related to \u escapes) This means that the compiler, in locale-aware mode, would not be strictly conforming, but so what? People could ask their editors to save files in C/C++ style encoding if they want portable source files, or use filters. If this sounds like a reasonable strategy, we only need to worry how to combine the two algorithms, i.e. how we arrange processing of identifiers and strings both using the C wchar functions, and recognizing \u. What do you think? Martin ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-12 2:21 ` Martin von Loewis @ 1998-12-13 6:23 ` Richard Stallman 1998-12-13 12:27 ` Martin von Loewis 1998-12-15 22:00 ` Paul Eggert 1 sibling, 1 reply; 81+ messages in thread From: Richard Stallman @ 1998-12-13 6:23 UTC (permalink / raw) To: martin; +Cc: eggert, bothner, gcc2, egcs To make GCC depend on the current locale for correct compilation of a program is error-prone. If it is necessary for the handling of C code to depend on the locale, we should have a way to specify the locale in the source file itself--perhaps with a #-line. But it would be much better for GCC to be independent of the locale, as regards the behavior of the .o file at link time and at run time. It is no great loss if debugging symbol tables depend on the locale, but all other aspects of the generated .o file should be as close to locale-independent as we can make them, to reduce the possibility for things to go wrong. Wouldn't it work for GCC to treat all byte values above 127 as part of an identifier, and not worry about how they group into multibyte characters? Perhaps the current locale would have something to do with how to they look when printed in an error message, but no more than that. I have not been following the discussion until now--no time to study all those messages carefully--so please forgive me if someone has already explained a reason this cannot work. But if it merely has some possible inconvenience for the user, the advantage of being locale-independent could easily outweigh that. It is a very large advantage. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-13 6:23 ` Richard Stallman @ 1998-12-13 12:27 ` Martin von Loewis 1998-12-14 2:22 ` Richard Stallman 0 siblings, 1 reply; 81+ messages in thread From: Martin von Loewis @ 1998-12-13 12:27 UTC (permalink / raw) To: rms; +Cc: gcc2, egcs > Wouldn't it work for GCC to treat all byte values above 127 as part of > an identifier, and not worry about how they group into multibyte > characters? That would work fine. It might or might not be what the user expects. The only real drawback is standards compliance. C++, Java, and C9X all allow to express Unicode in identifiers using \u escapes, like void h\u00D6llo(); The standards go on saying that the actual source input might be in a different character set, and the implementation defines how that relates to Unicode. A sensible implementation would use translation mechanisms. Of course, the easiest thing would be to assume that we always get non-ASCII in identifiers as Unicode escapes. The editor (e.g. Emacs) would then need to convert the internal encoding to Unicode escapes when saving. That would make the feature in the language truly useful. Regards, Martin ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-13 12:27 ` Martin von Loewis @ 1998-12-14 2:22 ` Richard Stallman 1998-12-15 10:47 ` Paul Eggert 0 siblings, 1 reply; 81+ messages in thread From: Richard Stallman @ 1998-12-14 2:22 UTC (permalink / raw) To: martin; +Cc: gcc2, egcs > Wouldn't it work for GCC to treat all byte values above 127 as part of > an identifier, and not worry about how they group into multibyte > characters? That would work fine. It might or might not be what the user expects. Passing along the multibyte sequence unchanged is the most natural thing to do; it is what cat does, for example. Why would any user be surprised by it? The only real drawback is standards compliance. C++, Java, and C9X all allow to express Unicode in identifiers using \u escapes, like void h\u00D6llo(); You are right that the handling of \u would have to depend on the multibyte representation, and therefore on the locale. That would be unfortunate, but at least it would happen only when \u is used. It remains desirable to make GCC handle multibyte input in a way that is independent of the locale--is there any *specific* problem with that? Of course, the easiest thing would be to assume that we always get non-ASCII in identifiers as Unicode escapes. No, just the opposite. If the non-ASCII characters are represented in multibyte, then GCC can handle them properly in a locale-independent way. But \u cannot be handled in a locale-independent way. Therefore, Emacs should save these characters in multibyte representation (which, as it happens, is the more general feature, and what we are going to work on anyway). ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-14 2:22 ` Richard Stallman @ 1998-12-15 10:47 ` Paul Eggert 1998-12-17 18:10 ` Richard Stallman 0 siblings, 1 reply; 81+ messages in thread From: Paul Eggert @ 1998-12-15 10:47 UTC (permalink / raw) To: rms; +Cc: martin, gcc2, egcs Date: Mon, 14 Dec 1998 03:22:51 -0700 (MST) From: Richard Stallman <rms@gnu.org> It remains desirable to make GCC handle multibyte input in a way that is independent of the locale--is there any *specific* problem with that? Yes. Some widely used multibyte encodings use ordinary ASCII bytes to encode multibyte characters. Examples include Shift-JIS, BIG5, and 7-bit ISO-2022. The ASCII bytes include printable bytes like "\", so this is an issue for both strings and identifiers. These encodings all use first bytes that cannot be confused with any of the single-byte chars in the basic C character set, so they can be supported by a C compiler. Passing along the multibyte sequence unchanged is the most natural thing to do Yes, I tend to think this is the right thing to do in both identifiers and strings. Otherwise, the assembly language output might not be a text file, as it might use different encodings in different regions with no easy way to distinguish between the reasons. that the handling of \u would have to depend on the multibyte representation, and therefore on the locale. That would be unfortunate, but at least it would happen only when \u is used. If GCC is to support encodings like Shift-JIS, it also needs to have a locale-dependent way to determine the number of bytes in a multibyte character. It should copy these bytes straight through; but without a way of being able to count them, it can't copy them. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-15 10:47 ` Paul Eggert @ 1998-12-17 18:10 ` Richard Stallman 1998-12-17 21:41 ` Paul Eggert 1998-12-17 23:55 ` Joern Rennecke 0 siblings, 2 replies; 81+ messages in thread From: Richard Stallman @ 1998-12-17 18:10 UTC (permalink / raw) To: eggert; +Cc: martin, gcc2, egcs Yes. Some widely used multibyte encodings use ordinary ASCII bytes to encode multibyte characters. Examples include Shift-JIS, BIG5, and 7-bit ISO-2022. This makes the situation more difficult. I think the key to how to cope with it is to recognize that currently GCC does not handle these encodings at all. But it does handle (in strings and comments) multibyte encodings that use only non-ASCII bytes, and it does so reliably, regardless of the environment. I'll call these "clean multibyte encodings". The default mode in GCC should continue to handle clean multibyte encodings reliably, meaning that it should not try to understand the encodings, just treat sequences of non-ASCII bytes in the usual way. Perhaps it will handle them thus in identifiers, as well as in strings and comments. However, it would be ok to add an option which tells GCC to decode all multibyte encodings encountered, according to the specified locale. That mode would handle the unclean multibyte encodings. In addition to that, handling of \u in a non-wide string has to depend on the encoding. So there may be two things for which GCC needs to know the encoding, and there is certainly at least one. It is unreliable to get the encoding from the environment. So we should provide other ways to specify the encoding, and encourage people to use them. One way is with an option, --locale=LOCALE Another way is with a directive such as #locale LOCALE in the source code. I think GCC should issue a warning if the source actually depends on the choice of locale, and the locale has been obtained from the environment. The warning should encourage use of --locale or #locale to specify the locale. I think \u will need to be translated, though, if possible -- unless the assembler handles \u, which is not true for gas at least. Once we decide that either GCC or the assembler should translate \u into a locale-specific multibyte encoding, it may as well be done in GCC. GCC is used with many different assemblers. * We'll have to disable the checking for identifier spellings in multibyte chars, since we won't know which multibyte chars are letters and/or digits. Why would we want to check? Why NOT simply define all non-ASCII characters as being allowed in identifiers? No non-ASCII characters have any other meaning in C. * In general, assembly language files will not be text files. When GCC wants to put certain non-ASCII bytes into a string or identifier, that doesn't necessarily mean just outputting those bytes into the .s file. Non-ASCII bytes in strings can be output to the .s file using .byte, so that the bytes themselves don't appear in the file. This is a reliable way to produce the same sequence of bytes in core when the program runs. Non-ASCII bytes in identifiers need to be encoded in some way, since assemblers won't allow them in identifiers. I suggest using `.' followed by the hex code of the byte. That is allowed by most assemblers, and provides a unique representation. Of course, there should be a way to specify a different handling for any given system, in case the native compiler does something different. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-17 18:10 ` Richard Stallman @ 1998-12-17 21:41 ` Paul Eggert 1998-12-18 1:23 ` Martin von Loewis 1998-12-17 23:55 ` Joern Rennecke 1 sibling, 1 reply; 81+ messages in thread From: Paul Eggert @ 1998-12-17 21:41 UTC (permalink / raw) To: rms; +Cc: martin, gcc2, egcs Date: Thu, 17 Dec 1998 19:10:24 -0700 (MST) From: Richard Stallman <rms@gnu.org> One way is with an option, --locale=LOCALE Another way is with a directive such as `#locale LOCALE' in the source code. Like your other suggestions, this one sounds reasonable to me. It might help to use more specific names, e.g. `--locale-ctype=LOCALE' and `#locale LC_CTYPE LOCALE'. Locales are a more general notion than just their LC_CTYPE component, and we may find a use for specifying other (non-LC_CTYPE) parts of the locale later. Why NOT simply define all non-ASCII characters as being allowed in identifiers? No non-ASCII characters have any other meaning in C. OK, you can talk me into not checking, at least by default. It might be useful to have an optional check, for people who want to port their code to more restrictive compilers that check the restrictions of draft C9x's Annex I. This optional check would be locale-dependent. Non-ASCII bytes in identifiers need to be encoded in some way, since assemblers won't allow them in identifiers. I suggest using `.' followed by the hex code of the byte. It may make sense to use this notation (or something like it) even for assemblers like GAS that allow multibyte chars in identifiers. This would mean that assembly language files would be text, which is good. However, a drawback of this approach is that the names will be mangled. This will require that debuggers demangle the names. For C++ this is not such a big deal, as the names are already mangled, but for C it is a bit inconvenient. Here is another possibility. For identifier chars that can be expressed as multibyte chars in the locale's encoding, use those chars; otherwise, use `.uxxxx' or `.Uxxxxxxxx' where xxxx (or xxxxxxxx) are the Unicode position. E.g. if the original identifier was a\u1234b\U12345678c, and if \u1234 and \U12345678 cannot be represented as multibyte chars, then represent this identifier as a.u1234b.U12345678c in the assembler file. An advantage of this approach is that (for C, at least), it's upward compatible with martin's proposal. A locale that uses a UTF-8 charset and encoding will simply use UTF-8 identifiers, which is what he wants. This will be a natural way to do things in the UTF-8 world -- the assembler file will be much easier to read as UTF-8 text than it would be with .uxxxx or .xx.xx.xx escapes. (Similarly for other encodings like EUC-JIS.) I don't know how this would affect C++ mangling, though. handling of \u in a non-wide string has to depend on the encoding. A minor point: this is also true for \u in wide strings, as not all systems use Unicode for wide chars. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-17 21:41 ` Paul Eggert @ 1998-12-18 1:23 ` Martin von Loewis 0 siblings, 0 replies; 81+ messages in thread From: Martin von Loewis @ 1998-12-18 1:23 UTC (permalink / raw) To: eggert; +Cc: rms, gcc2, egcs > Here is another possibility. For identifier chars that can be > expressed as multibyte chars in the locale's encoding, use those > chars; otherwise, use `.uxxxx' or `.Uxxxxxxxx' where xxxx (or > xxxxxxxx) are the Unicode position. [...] > I don't know how this would affect C++ mangling, though. This won't work for C++. Consider class Foo{ static int u1234; }; This currently compiles into _3Foo.u1234. With your proposal, _3Foo.u1234.u1234 could either be Foo\u1234::u1234, or Foo::u1234\u1234. If people don't like converting Unicode identifiers to UTF-8 always, I drop that proposal with regrets. It would work on assemblers that support 8bit in identifiers, it would work for C and C++, and it would work independently from compile-time or runtime settings (identifiers are *not* effected by the users locale whatsoever). Anyway, I drop that proposal. There is a proposed mangling for \u escapes in C++ in gxxint.texi. It works for all cases and for all assemblers, giving plain text in identifiers. It doesn't work for C, but after this discussion, I guess I don't care about that anymore. Somebody just tell me how it should work for C. Kind regrets, Martin ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-17 18:10 ` Richard Stallman 1998-12-17 21:41 ` Paul Eggert @ 1998-12-17 23:55 ` Joern Rennecke 1998-12-19 5:13 ` Richard Stallman 1 sibling, 1 reply; 81+ messages in thread From: Joern Rennecke @ 1998-12-17 23:55 UTC (permalink / raw) To: rms; +Cc: eggert, martin, gcc2, egcs > #locale LOCALE I think it should rather be #pragma locale LOCALE ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-17 23:55 ` Joern Rennecke @ 1998-12-19 5:13 ` Richard Stallman 1998-12-19 10:36 ` Paul Eggert 0 siblings, 1 reply; 81+ messages in thread From: Richard Stallman @ 1998-12-19 5:13 UTC (permalink / raw) To: amylaar; +Cc: eggert, martin, gcc2, egcs > #locale LOCALE I think it should rather be #pragma locale LOCALE No, definitely not. It is not a good idea to use #pragma for anything that affects the meaning of the program. We should invent a new command for this, so that it could later be adopted as a standard. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-19 5:13 ` Richard Stallman @ 1998-12-19 10:36 ` Paul Eggert 1998-12-20 20:29 ` Richard Stallman ` (2 more replies) 0 siblings, 3 replies; 81+ messages in thread From: Paul Eggert @ 1998-12-19 10:36 UTC (permalink / raw) To: rms; +Cc: amylaar, martin, gcc2, egcs I thought of some drawbacks to #pragma LC_CTYPE "ja_JP.PCK" (or to #locale LC_CTYPE "ja_JP.PCK", for that matter). * If the program text is converted from one encoding to another, its #pragma will become incorrect. This will make it a hassle to convert program text automatically (e.g. from Shift-JIS to UTF-8). * Locale names aren't very portable. E.g. Solaris uses "ja" for EUC-JIS whereas Unixware uses "ja_JP.EUC". Perhaps it would be better for GCC to autodetect the character set and encoding, much as Emacs already does. GCC could even reuse the Emacs code. There would have to be a way to override the default (e.g. a command-line option), but autodetection might be good enough in practice so that overriding would be rarely needed. Date: Sat, 19 Dec 1998 06:12:48 -0700 (MST) From: Richard Stallman <rms@gnu.org> It is not a good idea to use #pragma for anything that affects the meaning of the program. But all the pragmas required by draft C9x affect the meaning of the program. I think draft C9x's intent is that #pragma not affect the meaning of the program ``much''. For reference, here are the draft C9x pragmas and what they do. #pragma STDC FP_CONTRACT ON allows a floating-point expression to be ``contracted'', i.e. evaluated as though it were an atomic operation, thereby omitting rounding errors. (E.g. PowerPC multiply-add.) #pragma STDC FENV_ACCESS ON lets floating-point code test flags or run under non-default modes. #pragma STDC CX_LIMITED_RANGE ON lets the implementation evaluate complex multiply, divide, and absolute value efficiently without worrying about correct behavior because of undue overflow and underflow. The default state of these pragmas is implementation-defined. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-19 10:36 ` Paul Eggert @ 1998-12-20 20:29 ` Richard Stallman 1998-12-21 1:52 ` Andreas Schwab 1998-12-20 20:29 ` Richard Stallman 1998-12-21 12:25 ` Samuel Figueroa 2 siblings, 1 reply; 81+ messages in thread From: Richard Stallman @ 1998-12-20 20:29 UTC (permalink / raw) To: eggert; +Cc: amylaar, martin, gcc2, egcs For reference, here are the draft C9x pragmas and what they do. These arithmetic parameters should not be done with #pragma, not with any kind of #-command. That is because macro expansions cannot produce a #-command. I told the committee about this problem ten years ago, but it seems that the temptation of the #pragma idea is too strong for mere logic to overcome. We should design a cleaner syntax for these parameters, one that can be produced by macro expansion, and we should deprecate the use of pragmas for this purpose. Actually I did design one ten years ago or so. Maybe the committee still has records of what it was. If some of us are still on the committee, could those people please forward the suggestion to the committee, before it is too late? (I was a member but left when they insisted on paying to be a member.) Locale specification is a different kind of operation. It does not affect the meaning of expressions; instead it says how to read lines of source code. That is why a #-command is ok for locale specification, even tho it is a bad interface for these floating point parameters. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-20 20:29 ` Richard Stallman @ 1998-12-21 1:52 ` Andreas Schwab 1998-12-22 1:09 ` Richard Stallman 0 siblings, 1 reply; 81+ messages in thread From: Andreas Schwab @ 1998-12-21 1:52 UTC (permalink / raw) To: rms; +Cc: eggert, amylaar, martin, gcc2, egcs Richard Stallman <rms@gnu.org> writes: |> For reference, here are the draft C9x pragmas and what they do. |> |> These arithmetic parameters should not be done with #pragma, not with |> any kind of #-command. That is because macro expansions cannot |> produce a #-command. That's why C9x has the Pragma operator. 6.10.9 Pragma operator Semantics [#1] A unary operator expression of the form: _Pragma ( string-literal ) is processed as follows: The string literal is destringized by deleting the L prefix, if present, deleting the leading and trailing double-quotes, replacing each escape sequence \" by a double-quote, and replacing each escape sequence \\ by a single backslash. The resulting sequence of characters is processed through translation phase 3 to produce preprocessing tokens that are executed as if they were the pp-tokens in a pragma directive. The original four preprocessing tokens in the unary operator expression are removed. -- Andreas Schwab "And now for something schwab@issan.cs.uni-dortmund.de completely different" schwab@gnu.org ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 1:52 ` Andreas Schwab @ 1998-12-22 1:09 ` Richard Stallman 0 siblings, 0 replies; 81+ messages in thread From: Richard Stallman @ 1998-12-22 1:09 UTC (permalink / raw) To: schwab; +Cc: eggert, amylaar, martin, gcc2, egcs Amazing--they actually listened. Thanks. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-19 10:36 ` Paul Eggert 1998-12-20 20:29 ` Richard Stallman @ 1998-12-20 20:29 ` Richard Stallman 1998-12-21 7:00 ` Zack Weinberg 1998-12-21 18:11 ` Paul Eggert 1998-12-21 12:25 ` Samuel Figueroa 2 siblings, 2 replies; 81+ messages in thread From: Richard Stallman @ 1998-12-20 20:29 UTC (permalink / raw) To: eggert; +Cc: amylaar, martin, gcc2, egcs * If the program text is converted from one encoding to another, its #pragma will become incorrect. Yes, the #locale will need to be changed in that case. If GCC is going to depend on the locale, you will have to specify the locale for your files. Regardless of how you specify it, with #locale or with --locale or with an envvar, in any case there is a risk you might forget to change the specification along with the file. This problem applies to ALL possible ways of specifying the locale. So it is not a reason to prefer method of specification to another method. However, use of #locale avoids the danger that you will simply forget to specify the right locale, even though the correct locale is the same as it always was. * Locale names aren't very portable. E.g. Solaris uses "ja" for EUC-JIS whereas Unixware uses "ja_JP.EUC". This is a real issue. I see three possible solutions. 1. Define our own system-independent names for (some) locales. 2. Allow specification of several locale names, and GCC will use the first one that is meaningful on the system in use. 3. Allow specification of several locale names, each associated with a host system type. Perhaps it would be better for GCC to autodetect the character set and encoding, much as Emacs already does. Autodetection is limited; in Emacs it requires (in some cases) preference information from the user. For example, Emacs cannot distinguish between Latin-1, Latin-2, Latin-3, Latin-4 and Latin-5. There is no way to distinguish them automatically, because they use the same set of valid bytes. Adding something to the file to specify this preference information is pretty much equivalent to alternative 1 above. But all the pragmas required by draft C9x affect the meaning of the program. The committee is making a foolish decision. Let's not follow suit. If some day they define a specific #pragma to do this job, then we should support it, but we should lead the way towards a cleaner approach. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-20 20:29 ` Richard Stallman @ 1998-12-21 7:00 ` Zack Weinberg 1998-12-21 18:58 ` Paul Eggert 1998-12-21 18:11 ` Paul Eggert 1 sibling, 1 reply; 81+ messages in thread From: Zack Weinberg @ 1998-12-21 7:00 UTC (permalink / raw) To: rms; +Cc: amylaar, martin, gcc2, egcs On Sun, 20 Dec 1998 21:29:41 -0700 (MST), Richard Stallman wrote: > > * Locale names aren't very portable. E.g. Solaris uses "ja" for > EUC-JIS whereas Unixware uses "ja_JP.EUC". > >This is a real issue. I see three possible solutions. > >1. Define our own system-independent names for (some) locales. >2. Allow specification of several locale names, and GCC will use >the first one that is meaningful on the system in use. >3. Allow specification of several locale names, each associated >with a host system type. GCC should only care about the character set, not the rest of the locale. Therefore, it makes sense to use the charset names from the iconv library (part of glibc 2.1, also in Solaris and probably elsewhere) which are the names standardized by the MIME RFCs. zw ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 7:00 ` Zack Weinberg @ 1998-12-21 18:58 ` Paul Eggert 1998-12-21 19:07 ` Zack Weinberg ` (2 more replies) 0 siblings, 3 replies; 81+ messages in thread From: Paul Eggert @ 1998-12-21 18:58 UTC (permalink / raw) To: zack; +Cc: rms, amylaar, martin, gcc2, egcs Date: Mon, 21 Dec 1998 10:00:13 -0500 From: Zack Weinberg <zack@rabi.columbia.edu> GCC should only care about the character set, not the rest of the locale. Therefore, it makes sense to use the charset names from the iconv library (part of glibc 2.1, also in Solaris and probably elsewhere) which are the names standardized by the MIME RFCs. This is a good suggestion. I assume you're saying that GCC should use directives like `#charset "SJIS"' rather than directives like `#locale "ja"', since the other attributes of "ja" are not important for GCC. Unfortunately, this suggestion doesn't solve the problem of unportable directives in practice, because the charset+encoding names are not standardized well either. E.g for Shift-JIS, Solaris 7 has the aliases "PCK" and "SJIS", glibc 2.0.108 has "SJIS", and MIME has "Shift_JIS", "MS_Kanji", and "csShiftJIS". It sounds like we might slide through with "SJIS" for Shift-JIS, even though it's not in the MIME standard; but for EUC-JIS, Solaris 7 has the name "eucJP", glibc 2.0.108 has "EUC-JP", and MIME has the aliases "EUC-JP", "csEUCPkdFmtJapanese", and "Extended_UNIX_Code_Packed_Format_for_Japanese"; so no single name will do in practice for EUC-JIS. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 18:58 ` Paul Eggert @ 1998-12-21 19:07 ` Zack Weinberg 1998-12-21 19:28 ` Ulrich Drepper 1998-12-23 0:36 ` Richard Stallman 2 siblings, 0 replies; 81+ messages in thread From: Zack Weinberg @ 1998-12-21 19:07 UTC (permalink / raw) To: Paul Eggert; +Cc: rms, amylaar, martin, gcc2, egcs On Mon, 21 Dec 1998 18:57:01 -0800 (PST), Paul Eggert wrote: > Date: Mon, 21 Dec 1998 10:00:13 -0500 > From: Zack Weinberg <zack@rabi.columbia.edu> > > GCC should only care about the character set, not the rest of the > locale. Therefore, it makes sense to use the charset names from the > iconv library (part of glibc 2.1, also in Solaris and probably > elsewhere) which are the names standardized by the MIME RFCs. > >This is a good suggestion. I assume you're saying that GCC should use >directives like `#charset "SJIS"' rather than directives like `#locale >"ja"', since the other attributes of "ja" are not important for GCC. Yes. You have a point about this being something that belongs in the environment. I don't have any experience in this field and can't say what makes the most sense. >Unfortunately, this suggestion doesn't solve the problem of unportable >directives in practice, because the charset+encoding names are not >standardized well either. The MIME RFCs try to standardize charset+encoding names. It's the closest to a proper standard there is, and all the iconv(3) implementations I know of (all two of them :) support all those names. I'm tempted to suggest that we use iconv to convert everything to UTF-8 (Java seems to need this, and consistency is good) but only when it comes with the system. When it isn't available, we don't even try to support extended charsets. Trying to support all the different incompatible encoding libraries out there would be a nightmare, and importing glibc's iconv is impractical - it's >4megs of code. zw ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 18:58 ` Paul Eggert 1998-12-21 19:07 ` Zack Weinberg @ 1998-12-21 19:28 ` Ulrich Drepper 1998-12-23 0:36 ` Richard Stallman 2 siblings, 0 replies; 81+ messages in thread From: Ulrich Drepper @ 1998-12-21 19:28 UTC (permalink / raw) To: Paul Eggert; +Cc: zack, rms, amylaar, martin, gcc2, egcs Paul Eggert <eggert@twinsun.com> writes: > Unfortunately, this suggestion doesn't solve the problem of unportable > directives in practice, because the charset+encoding names are not > standardized well either. Take a look at the gconv-modules file in glibc. A similar file could be part of gcc. The user could extend it in whatever way s/he needs. -- ---------------. drepper at gnu.org ,-. 1325 Chesapeake Terrace Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA Cygnus Solutions `--' drepper at cygnus.com `------------------------ ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 18:58 ` Paul Eggert 1998-12-21 19:07 ` Zack Weinberg 1998-12-21 19:28 ` Ulrich Drepper @ 1998-12-23 0:36 ` Richard Stallman 2 siblings, 0 replies; 81+ messages in thread From: Richard Stallman @ 1998-12-23 0:36 UTC (permalink / raw) To: eggert; +Cc: zack, amylaar, martin, gcc2, egcs It sounds like we might slide through with "SJIS" for Shift-JIS, even though it's not in the MIME standard; but for EUC-JIS, Solaris 7 has the name "eucJP", glibc 2.0.108 has "EUC-JP", and MIME has the aliases "EUC-JP", "csEUCPkdFmtJapanese", and "Extended_UNIX_Code_Packed_Format_for_Japanese"; At worst, we may have to pick one of them. so no single name will do in practice for EUC-JIS. Any one of them will do in practice--we just have to document which one to use. Or we could support all three of them, as equivalent aliases. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-20 20:29 ` Richard Stallman 1998-12-21 7:00 ` Zack Weinberg @ 1998-12-21 18:11 ` Paul Eggert 1998-12-21 18:46 ` Per Bothner ` (3 more replies) 1 sibling, 4 replies; 81+ messages in thread From: Paul Eggert @ 1998-12-21 18:11 UTC (permalink / raw) To: rms; +Cc: amylaar, martin, gcc2, egcs Date: Sun, 20 Dec 1998 21:29:41 -0700 (MST) From: Richard Stallman <rms@gnu.org> Perhaps it would be better for GCC to autodetect the character set and encoding, much as Emacs already does. Autodetection is limited; in Emacs it requires (in some cases) preference information from the user. For example, Emacs cannot distinguish between Latin-1, Latin-2, Latin-3, Latin-4 and Latin-5. There is no way to distinguish them automatically, because they use the same set of valid bytes. There are two related issues here: 1. Can autodetection work well enough to support non-"C" characters in strings and identifiers? 2. Can autodetection work well enough to support \u escapes as well? Your example suggests that the answer to (2) is ``no''. But this already follows from the fact that a C program could be written entirely in the "C" character set with \u escapes, and autodetection can't possibly determine the multibyte encoding of such a program. I was thinking more about case (1), which I think will be more common in practice. I suspect that autodetection could work reasonably well for \u-free programs. For example, GCC needn't worry about the distinction between Latin-1 and Latin-2 if there are no \u escapes, since it needn't worry about whether the byte 0xb5 corresponds to MICRO SIGN or to some other character. all the pragmas required by draft C9x affect the program's meaning The committee is making a foolish decision. The committee is also requiring _Pragma("FOO") to have the same meaning as #pragma FOO. _Pragma("FOO") can be output by macros. Does this overcome your objection to pragmas? * If the program text is converted from one encoding to another, its #locale will become incorrect. If GCC is going to depend on the locale, you will have to specify the locale for your files. Regardless of how you specify it, with #locale or with --locale or with an envvar, in any case there is a risk you might forget to change the specification along with the file. I have a shorter and a longer answer to this. The shorter answer: GCC currently doesn't have directives like this: #character-set ASCII #character-set EBCDIC because they're not needed; people who compile in EBCDIC environments already know about these issues, set things up appropriately, and would find those directives to be a pain to maintain. GCC (and other compilers) have survived all these years without character-set directives, even though they solve roughly the same problems that #locale directives would solve. This suggests that GCC doesn't need #locale directives either. The longer answer: There is a risk to mis-specifying the locale, yes, but in practice my experience is that the risk is smaller if the locale is part of the environment. In our company, when we import files from other sources, we typically transliterate them to an encoding suitable for our preferred working locale. This is the only plausible way to do things; otherwise, few of our text-processing tools would work. Even Emacs supports only _some_ of the Japanese encodings that we import -- e.g. it doesn't support UTF-8 or DBCS. Most other tools support only one character set and encoding at a time, and it is set from the locale environment variables in the usual way. If #locale were part of the source, we'd have more work to do, since we'd also have to munge the #locale directives of imported sources. This would be doable, but it would be a hassle, particularly when trading patches with our correspondents who use different encodings. I can easily see where people would screw this up. In contrast, if the locale is part of the build environment, we needn't worry about munging anything. We must set up our build environment correctly, but that's OK -- we also must set up our PATH correctly, etc., and setting up the locale correctly is something that everyone versed in software internationalization and localization already knows how to do. use of #locale avoids the danger that you will simply forget to specify the right locale, No, it doesn't avoid the danger. You can specify the wrong locale just as easily, if not more easily, with #locale -- e.g. see the transliteration scenario in my longer answer above. * Locale names aren't very portable. E.g. Solaris uses "ja" for EUC-JIS whereas Unixware uses "ja_JP.EUC". This is a real issue. I see three possible solutions. 1. Define our own system-independent names for (some) locales. I'd rather not do this -- it will be a maintenance hassle. But if we must do it, we should steal code from glibc rather than reinvent the wheel (as is done in the current GCC2 and EGCS snapshots). 2. Allow specification of several locale names, and GCC will use the first one that is meaningful on the system in use. 3. Allow specification of several locale names, each associated with a host system type. These are also maintenance hassles, for several reasons. E.g. "ja" means different things on different hosts. I don't know which locale names map to which encodings on which hosts, and keeping track of this info will be tedious and quite error prone. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 18:11 ` Paul Eggert @ 1998-12-21 18:46 ` Per Bothner 1998-12-21 19:44 ` Paul Eggert 1998-12-21 20:16 ` Paul Eggert 1998-12-21 19:16 ` Per Bothner ` (2 subsequent siblings) 3 siblings, 2 replies; 81+ messages in thread From: Per Bothner @ 1998-12-21 18:46 UTC (permalink / raw) To: Paul Eggert; +Cc: rms, amylaar, martin, gcc2, egcs > 1. Can autodetection work well enough to support non-"C" characters in > strings and identifiers? I don't know if it can work for C. I do know that autodectection cannot work for Java. If a Java program is written in Latin-2, then any non-Ascii characters have to be converted into the corresponding Unicode at some stage in the translation process. I.e. either the compiler or the assembler (or a pre-processor) has to know the character set of the input file. Yes, we could have auto-detection for C but not Java, but that does seem rather clumsy. In any case: I think we want to support linking together source files written in different locales. E.g. libc should be written in UTF-8, but an application may be written in a local character set. If we want these to be able to link, either the linker has to be able to convert between character encodings (which I think we agree we don't want), or symbol names in .o files have to be a common character set. The only plauible contender for such a common character set is UTF-8. Given that symbols have to be in a common character encoding, it follows that you cannot possibly do autodetection, at least not for identifiers. --Per Bothner Cygnus Solutions bothner@cygnus.com http://www.cygnus.com/~bothner ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 18:46 ` Per Bothner @ 1998-12-21 19:44 ` Paul Eggert 1998-12-21 20:30 ` Per Bothner 1998-12-21 20:16 ` Paul Eggert 1 sibling, 1 reply; 81+ messages in thread From: Paul Eggert @ 1998-12-21 19:44 UTC (permalink / raw) To: bothner; +Cc: rms, amylaar, martin, gcc2, egcs Date: Mon, 21 Dec 1998 18:45:09 -0800 From: Per Bothner <bothner@cygnus.com> Yes, we could have auto-detection for C but not Java, but that does seem rather clumsy. It would be nice to use the same method for all languages, yes. This is a good argument against autodetection. libc should be written in UTF-8, but an application may be written in a local character set. libc's identifiers use only the "C" subset of ASCII, and therefore libc will link to an application written in any locale, even if we use the native multibyte encoding for identifiers. Given that [.o] symbols have to be in a common character encoding, it follows that you cannot possibly do autodetection, at least not for identifiers. I don't see how this follows. The compiler could use autodetection to discover the input character set, and then translate the identifiers' characters to UTF-8 when outputting assembly language. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 19:44 ` Paul Eggert @ 1998-12-21 20:30 ` Per Bothner 1998-12-23 0:35 ` Richard Stallman 0 siblings, 1 reply; 81+ messages in thread From: Per Bothner @ 1998-12-21 20:30 UTC (permalink / raw) To: Paul Eggert; +Cc: gcc2, egcs > libc's identifiers use only the "C" subset of ASCII, and therefore > libc will link to an application written in any locale, even if we use > the native multibyte encoding for identifiers. It was an example. In practice, libraries meant for other-than-internal use will probably stick to the C subset - but I don't want to depend on that. > I don't see how this follows. The compiler could use autodetection to > discover the input character set, and then translate the identifiers' > characters to UTF-8 when outputting assembly language. We've already established that the compiler cannot use autodetection to discover the input character set except in very specific environments. I guess I was responding to the idea of passing uninterpreted bytes through, and pointing that that is a bad idea for at least external identifiers and for Java. --Per Bothner Cygnus Solutions bothner@cygnus.com http://www.cygnus.com/~bothner ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 20:30 ` Per Bothner @ 1998-12-23 0:35 ` Richard Stallman 0 siblings, 0 replies; 81+ messages in thread From: Richard Stallman @ 1998-12-23 0:35 UTC (permalink / raw) To: bothner; +Cc: eggert, gcc2, egcs I guess I was responding to the idea of passing uninterpreted bytes through, and pointing that that is a bad idea for at least external identifiers and for Java. For non-ASCII bytes in external identifiers, we can't simply pass them through, because many assemblers won't accept them as identifier characters. Some sort of encoding is needed in the .s file. I proposed one, but others might be better. Conversion to UTF-8 won't work, because the assembler probably won't accept that either. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 18:46 ` Per Bothner 1998-12-21 19:44 ` Paul Eggert @ 1998-12-21 20:16 ` Paul Eggert 1998-12-21 20:28 ` Zack Weinberg ` (2 more replies) 1 sibling, 3 replies; 81+ messages in thread From: Paul Eggert @ 1998-12-21 20:16 UTC (permalink / raw) To: bothner; +Cc: rms, amylaar, martin, gcc2, egcs Date: Mon, 21 Dec 1998 18:45:09 -0800 From: Per Bothner <bothner@cygnus.com> we want to support linking together source files written in different locales... The only plauible contender for such a common character set is UTF-8. OK, how about this proposal? I've tried to formulate it to address everybody's concerns: (1) The input character set is determined by #pragma charset FOO (or _Pragma ("charset FOO")) directive, compile-time option, or environment variable (in that order). For the last alternative, the default is to use setlocale (LC_CTYPE, ""); nl_langinfo (CODESET) if these two functions are available. The default is UTF-8. (2) GCC uses the iconv function to translate from the input multibyte encoding to UTF-8 internally (for identifiers), and to determine character boundaries (in strings and comments). If the implementation doesn't have iconv, GCC normally supports only UTF-8; however, if the installer wants to build a compiler that knows about other encodings (e.g. for cross-compilation), we supply an easy way to use glibc's iconv. We can then remove the existing local_mblen function and friends, as they're no longer needed. (3) GCC transliterates each \u escape in a string to the string's charset, which is specified as described in (1) above. (4) After the translation in (3) (and after processing the other escapes like \n), GCC copies the contents of strings straight through to the assembler, if possible. As is currently the case, characters like \ and " that need escaping are escaped. However, a new feature is that if a string contains troublesome multibyte characters (e.g. the characters contain the bytes for ASCII \ or "), then those characters are output using octal escapes for each byte. Similarly, if there is a string of multibyte characters not in the initial shift state that contains a \ or " byte, the entire string is output using octal escapes. (5) GCC transliterates all identifiers to UTF-8 for the assembly language output. If the input character set is a superset of UTF-8 (e.g. ISO-2022-JP), then the extra information is lost. If the assembler doesn't support UTF-8 identifiers, GCC transliterates identifiers to some ASCII escape sequence representing the UTF-8 identifiers. (6) GCC transliterates all identifiers to the working charset for all other output (e.g. diagnostics). Here are some properties of this proposal: * If the input file uses UTF-8, then the assembly language output uses UTF-8 as well. * If the input file is a text file that does not use \u escapes, and does not use multibyte characters in identifiers, then the assembly language output is a text file that uses the same encoding. This should accommodate existing practice reasonably well. * The assembler needn't know about encodings. * You can link together source files written in different locales, since all the identifiers are transliterated to some encoding of Unicode. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 20:16 ` Paul Eggert @ 1998-12-21 20:28 ` Zack Weinberg 1998-12-22 2:59 ` Paul Eggert 1998-12-21 21:03 ` Per Bothner 1998-12-25 0:05 ` Richard Stallman 2 siblings, 1 reply; 81+ messages in thread From: Zack Weinberg @ 1998-12-21 20:28 UTC (permalink / raw) To: Paul Eggert; +Cc: rms, amylaar, martin, gcc2, egcs I like Paul's proposal in general but I have two nits relating to the implementation. >(1) The input character set is determined by #pragma charset FOO > (or _Pragma ("charset FOO")) directive, compile-time option, or > environment variable (in that order). It is going to be extremely difficult to put _Pragma into the preprocessor as it's specified in the current standard. I'll talk about this in another message. >(2) GCC uses the iconv function to translate from the input multibyte > encoding to UTF-8 internally (for identifiers), and to determine > character boundaries (in strings and comments). To do this we'd need to translate the entire file to UTF-8 in order to know where identifiers begin and end, and then translate strings back. That can lose information - say strings are in ISO 2022-JP but nothing else is. zw ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 20:28 ` Zack Weinberg @ 1998-12-22 2:59 ` Paul Eggert 1998-12-23 17:16 ` Richard Stallman 0 siblings, 1 reply; 81+ messages in thread From: Paul Eggert @ 1998-12-22 2:59 UTC (permalink / raw) To: zack; +Cc: rms, amylaar, martin, gcc2, egcs Date: Mon, 21 Dec 1998 23:28:09 -0500 From: Zack Weinberg <zack@rabi.columbia.edu> >(2) GCC uses the iconv function to translate from the input multibyte > encoding to UTF-8 internally (for identifiers), and to determine > character boundaries (in strings and comments). To do this we'd need to translate the entire file to UTF-8 in order to know where identifiers begin and end, and then translate strings back. We can avoid this problem by using iconv in ``byte-at-a-time'' mode. I.e. we can use iconv to discover the minimal nonempty sequence of input bytes S such that S is a multibyte character string, and such that the first char after S is a "C" char. If we find such an S in a string, we copy its value through unchanged; if we find it in an identifier, we translate it to UTF-8. This would mean we wouldn't have to translate the entire file to UTF-8. As an optimization, we don't need to call iconv at all in the common case where the input file uses only the "C" subset of ASCII. This is because such files cannot contain multibyte chars. We need to call iconv only if the input contains non-"C" bytes (e.g. bytes with the top bit on, or the ESC character). It might be nice if there was something faster than invoking iconv in byte-at-a-time mode, for files that contain lots of multibyte chars. E.g. it might be nice if we could use an efficient primitive that acts like iconv, except it stops translating when it finds a "C" char. If necessary to improve performance, we can add such a primitive to glibc, and use it if it's available. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-22 2:59 ` Paul Eggert @ 1998-12-23 17:16 ` Richard Stallman 1998-12-23 18:11 ` Zack Weinberg 1998-12-23 19:21 ` Paul Eggert 0 siblings, 2 replies; 81+ messages in thread From: Richard Stallman @ 1998-12-23 17:16 UTC (permalink / raw) To: eggert; +Cc: zack, amylaar, martin, gcc2, egcs The idea of translating everything into UTF-8 is not useful for C. It is pointless and mistaken to translate symbols to UTF-8. The assembler won't accept them in UTF-8, and users who use other encodings wouldn't want them in UTF-8 anyway. It is pointless and buggy to translate strings to UTF-8 and then translate them back. As Handa pointed out, it's impossible to translate them back. If that translation is required for correct handling of Java, then let's do it for Java. But not for C. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-23 17:16 ` Richard Stallman @ 1998-12-23 18:11 ` Zack Weinberg 1998-12-25 0:05 ` Richard Stallman 1998-12-23 19:21 ` Paul Eggert 1 sibling, 1 reply; 81+ messages in thread From: Zack Weinberg @ 1998-12-23 18:11 UTC (permalink / raw) To: rms; +Cc: zack, amylaar, martin, gcc2, egcs On Wed, 23 Dec 1998 18:16:42 -0700 (MST), Richard Stallman wrote: >The idea of translating everything into UTF-8 is not useful for C. > >It is pointless and mistaken to translate symbols to UTF-8. The >assembler won't accept them in UTF-8, and users who use other >encodings wouldn't want them in UTF-8 anyway. I think you may have missed a few things. gas has no problem with symbols in UTF-8 (I am told). ASCII <-> UTF-8 is a no-op, and gcc does not currently accept non-ASCII identifiers, so no existing code will be broken by the change. Converting all symbol names to UTF-8 is desirable for all languages for two reasons. First, Java requires this and we want to be able to link modules written in Java with modules written in any other language supported by gcc. Second, we want to be able to link modules written in encoding X with other modules in encoding Y. One way translation of all identifiers to UTF8 achieves this. >It is pointless and buggy to translate strings to UTF-8 and then >translate them back. As Handa pointed out, it's impossible to >translate them back. No argument here. zw ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-23 18:11 ` Zack Weinberg @ 1998-12-25 0:05 ` Richard Stallman 1998-12-28 5:55 ` Martin von Loewis 0 siblings, 1 reply; 81+ messages in thread From: Richard Stallman @ 1998-12-25 0:05 UTC (permalink / raw) To: zack; +Cc: zack, amylaar, martin, gcc2, egcs I think you may have missed a few things. gas has no problem with symbols in UTF-8 (I am told). GCC works with many assemblers. I doubt that they all support UTF-8, and it would be hard even to check them all. So I think we will have to mangle non-ASCII byte values somehow in the .s files, whether the encoding used is UTF-8 or not. Anyway, there are other reasons not to always use UTF-8. ascii <-> UTF-8 is a no-op, and gcc does not currently accept non-ASCII identifiers, so no existing code will be broken by the change. This is true, but does not eliminate the problems. Second, we want to be able to link modules written in encoding X with other modules in encoding Y. This is a useful feature. However, not needing to specify what encoding the file is in is also a useful feature. These two features are inherently incompatible, so perhaps we should give the user a choice, through an option. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-25 0:05 ` Richard Stallman @ 1998-12-28 5:55 ` Martin von Loewis 1998-12-30 5:19 ` Richard Stallman 0 siblings, 1 reply; 81+ messages in thread From: Martin von Loewis @ 1998-12-28 5:55 UTC (permalink / raw) To: rms; +Cc: zack, zack, amylaar, gcc2, egcs > GCC works with many assemblers. I doubt that they all support UTF-8, > and it would be hard even to check them all. So I think we will have > to mangle non-ASCII byte values somehow in the .s files, whether the > encoding used is UTF-8 or not. Alternatively, we could reject code that uses non-ASCII identifiers if the assembler cannot reasonably represent them. That might mean that the full feature set is only available on GNU systems. I could live with that. Regards, Martin ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-28 5:55 ` Martin von Loewis @ 1998-12-30 5:19 ` Richard Stallman 0 siblings, 0 replies; 81+ messages in thread From: Richard Stallman @ 1998-12-30 5:19 UTC (permalink / raw) To: martin; +Cc: zack, zack, amylaar, gcc2, egcs Alternatively, we could reject code that uses non-ASCII identifiers if the assembler cannot reasonably represent them. That might mean that the full feature set is only available on GNU systems. I could live with that. It would be much better to mangle the names. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-23 17:16 ` Richard Stallman 1998-12-23 18:11 ` Zack Weinberg @ 1998-12-23 19:21 ` Paul Eggert 1998-12-25 0:05 ` Richard Stallman 1998-12-25 0:05 ` Richard Stallman 1 sibling, 2 replies; 81+ messages in thread From: Paul Eggert @ 1998-12-23 19:21 UTC (permalink / raw) To: rms; +Cc: zack, amylaar, martin, gcc2, egcs Date: Wed, 23 Dec 1998 18:16:42 -0700 (MST) From: Richard Stallman <rms@gnu.org> It is pointless and buggy to translate strings to UTF-8 and then translate them back. I agree, and my proposal doesn't do that for C. String bytes are copied straight through. It is pointless and mistaken to translate symbols to UTF-8. The assembler won't accept them in UTF-8, and users who use other encodings wouldn't want them in UTF-8 anyway. For non-GNU platforms like Solaris, we'll have to follow the platform's convention in this area, so that GCC-compiled code can link to non-GCC-compiled code. Most likely we'll need a way to configure the method GCC uses to output non-ASCII identifiers in assembly language, as there probably won't be a universally accepted standard method. Possibly, some platforms will require symbols to be translated to a canonical form (allowing cross-locale linking) and other platforms will just use the symbol bytes as-is (disallowing cross-locale linking); GCC will just have to go with the flow. For GNU platforms, my understanding is that GAS allows arbitrary bytes in symbols, so it is plausible to use UTF-8 for the canonical symbol encoding. If we go this route, assembler files will be UTF-8. In general, GCC will have to use \x escapes in strings to represent the bytes of non-ASCII characters, so that string bytes are copied straight-through without loss of information -- but \x escapes will be required no matter what solution is employed, since we want the assembler to be locale-independent, so requiring \x escapes is not a major loss. Another possibility for GNU is to mangle symbols into some form of ASCII. To do this, we'll have to come up with a mangling method that is compatible with existing C++ mangling, and which doesn't usurp existing user identifier space. You proposed a method, but someone else found a problem with it (sorry, I don't recall the details). Even if we solve the mangling problem, though, the ASCII-only name-mangling method seems less useful than UTF-8 name mangling. Neither mangling method allows an arbitrary native encoding (e.g. Shift-JIS or ISO-2022-JP) to be used uniformly, but at least the UTF-8 mangling method allows UTF-8 to be used uniformly. By the way, even if we don't care about linking from different locales, GCC must still translate symbols to a canonical form. For example, suppose `@' denotes the character MICRO SIGN (Unicode character 00b5). Then `@' (1 character) and `\u00b5' (6 characters) are different spellings of the same symbol, and GCC must unify the two spellings. This is true no matter how the symbol is represented in assembly language output. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-23 19:21 ` Paul Eggert @ 1998-12-25 0:05 ` Richard Stallman 1998-12-25 0:05 ` Richard Stallman 1 sibling, 0 replies; 81+ messages in thread From: Richard Stallman @ 1998-12-25 0:05 UTC (permalink / raw) To: eggert; +Cc: zack, amylaar, martin, gcc2, egcs Even if we solve the mangling problem, though, the ASCII-only name-mangling method seems less useful than UTF-8 name mangling. Neither mangling method allows an arbitrary native encoding (e.g. Shift-JIS or ISO-2022-JP) to be used uniformly, ASCII-only name mangling ought to achieve that. Could you please explain why you think it will not? ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-23 19:21 ` Paul Eggert 1998-12-25 0:05 ` Richard Stallman @ 1998-12-25 0:05 ` Richard Stallman 1 sibling, 0 replies; 81+ messages in thread From: Richard Stallman @ 1998-12-25 0:05 UTC (permalink / raw) To: eggert; +Cc: zack, amylaar, martin, gcc2, egcs By the way, even if we don't care about linking from different locales, GCC must still translate symbols to a canonical form. For example, suppose `@' denotes the character MICRO SIGN (Unicode character 00b5). Then `@' (1 character) and `\u00b5' (6 characters) are different spellings of the same symbol, and GCC must unify the two spellings. This is true no matter how the symbol is represented in assembly language output. That's right: GCC will have to convert \u00b5 into whatever is the proper thing to output for @, and likewise for unicode characters that have a multibyte representation in the encoding system it is using. This is why GCC has to depend on the encoding, when \u is used. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 20:16 ` Paul Eggert 1998-12-21 20:28 ` Zack Weinberg @ 1998-12-21 21:03 ` Per Bothner 1998-12-22 2:35 ` Paul Eggert 1998-12-28 8:10 ` Martin von Loewis 1998-12-25 0:05 ` Richard Stallman 2 siblings, 2 replies; 81+ messages in thread From: Per Bothner @ 1998-12-21 21:03 UTC (permalink / raw) To: Paul Eggert; +Cc: gcc2, egcs > (2) GCC uses the iconv function to translate from the input multibyte > encoding to UTF-8 internally (for identifiers and strings in Java. > (3) GCC transliterates each \u escape in a string to the string's charset, > which is specified as described in (1) above. Hm. (1) above specifies the *file's* charset. It does not follow that the *string's* charset is the same. Certainly for Java, it would not be. What happens to: wchar_t x = '\u1234'; /* or: L'\u1234' */ are these different from: wchar_t x = (wchar_t) 0x1234; I assume your proposal is that the string charset at least by default should be the file charset except for Java where the string charset is Unicode. I don't know if that is reasonable; I guess so. > If the input character set is a superset of UTF-8 > (e.g. ISO-2022-JP), then the extra information is lost. I'm confused. I thought that Unicode was specifically designed so that dictinct characters in existing Japanese character standards were mapped into distinct Unicode characters. Did I misunderstand, or is ISO-2022-JP not one of the "source" character sets the Unicode designers used? --Per Bothner Cygnus Solutions bothner@cygnus.com http://www.cygnus.com/~bothner ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 21:03 ` Per Bothner @ 1998-12-22 2:35 ` Paul Eggert 1998-12-28 8:10 ` Martin von Loewis 1 sibling, 0 replies; 81+ messages in thread From: Paul Eggert @ 1998-12-22 2:35 UTC (permalink / raw) To: bothner; +Cc: gcc2, egcs Date: Mon, 21 Dec 1998 21:02:31 -0800 From: Per Bothner <bothner@cygnus.com> > (3) GCC transliterates each \u escape in a string to the string's charset, > which is specified as described in (1) above. Hm. (1) above specifies the *file's* charset. It does not follow that the *string's* charset is the same. Certainly for Java, it would not be. (1) also specifies the string's charset in C, because you can switch charsets in the middle of a file e.g. with _Pragma ("charset Shift_JIS") or whatever. What happens to: wchar_t x = '\u1234'; /* or: L'\u1234' */ are these different from: wchar_t x = (wchar_t) 0x1234; Yes, e.g. the string's charset might specify JIS for wide characters. I assume your proposal is that the string charset at least by default should be the file charset except for Java where the string charset is Unicode. Yes. > If the input character set is a superset of UTF-8 > (e.g. ISO-2022-JP), then the extra information is lost. I'm confused. I thought that Unicode was specifically designed so that dictinct characters in existing Japanese character standards were mapped into distinct Unicode characters. Did I misunderstand, or is ISO-2022-JP not one of the "source" character sets the Unicode designers used? You understood correctly. To some extent, ISO-2022-JP and Unicode are competing standards. ISO-2022-JP distinguishes between (say) the Japanese and Chinese forms of the same character, whereas Unicode does not. Right now, my impression is that ISO-2022-JP is used more often in Japanese world than Unicode is. This is certainly true for email. Microsoft is pushing Unicode mightily in the DOS and NT domains, though. There is little call for distinguishing Chinese from Japanese in identifiers. So it's OK if GCC supports only the Unicode ``subset'' of ISO-2022-JP in identifiers. If there are ISO-2022-JP partisans who are disturbed by this part of my proposal, then I have some reassurance for them. Rumor has it that ISO 10646 might be officially extended so that it will become a functional superset of ISO-2022-JP. (This is the ``plane-14'' language-tagging effort.) This will require more than 16 bits per character, so it won't be Unicode, and presumably Java char and string won't support it (unless Java is also extended); but C and C++ will support plane-14, because they already have \u escapes for 32-bit characters, and allow UTF-8 implementations (which also supports 32-bit chars). If and when the plane-14 proposal becomes a standard, then C and C++ could distinguish between Chinese and Japanese in identifiers under my proposal. Isn't internationalization fun? ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 21:03 ` Per Bothner 1998-12-22 2:35 ` Paul Eggert @ 1998-12-28 8:10 ` Martin von Loewis 1998-12-28 11:00 ` Per Bothner 1 sibling, 1 reply; 81+ messages in thread From: Martin von Loewis @ 1998-12-28 8:10 UTC (permalink / raw) To: bothner; +Cc: eggert, gcc2, egcs > I'm confused. I thought that Unicode was specifically designed > so that dictinct characters in existing Japanese character > standards were mapped into distinct Unicode characters. Paul already answered that, I'd like to add from a different angle. ISO 2022 uses escapes sequences to switch between different character sets. ISO-2022-JP combines four different character sets in this way. Now, there are potential overlappings between the character sets. In such cases, Unicode typically unifies the overlappings, whereas ISO 2022 leaves them as-is. The argument is which is the right thing. For example, there are four encodings for "LATIN CAPITAL LETTER A": ESC ( B A (ASCII) ESC ( J A (JIS X 0201) ESC $ @ # A (JIS X 0208-1978) ESC $ B # A (JIS X 0208-1983) (*) Unicode has only one character here (U+0041). In other places, Unicode probably was wrong to unify (Han Unification). Not that I want to push a particular solution: Converted to Unicode, encoded in UTF-8, we would get the following for all four encodings: A Regards, Martin (*) Somebody correct me if my tables are wrong. The three-bytes escape-sequence can be omitted if previous characters are already in this encoding. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-28 8:10 ` Martin von Loewis @ 1998-12-28 11:00 ` Per Bothner 0 siblings, 0 replies; 81+ messages in thread From: Per Bothner @ 1998-12-28 11:00 UTC (permalink / raw) To: Martin von Loewis; +Cc: gcc2, egcs > In other places, Unicode > probably was wrong to unify (Han Unification). Unicode may have been wrong politically (but had no choice given a 16-bit limit). I still havan't seen any plausible argument explaining why Han Unification is wrong in theory or causes problems in practice. The problem with unification concerns high-quality multi-lingual documents - which are rare, and which should be using language or fonts attributes to dis-ambiguate. (I.e. if you need to explicitly distinguish Chinese and Japanese, should should be using styled text, not plain text.) --Per Bothner Cygnus Solutions bothner@cygnus.com http://www.cygnus.com/~bothner ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 20:16 ` Paul Eggert 1998-12-21 20:28 ` Zack Weinberg 1998-12-21 21:03 ` Per Bothner @ 1998-12-25 0:05 ` Richard Stallman 1998-12-26 0:36 ` Paul Eggert 2 siblings, 1 reply; 81+ messages in thread From: Richard Stallman @ 1998-12-25 0:05 UTC (permalink / raw) To: eggert; +Cc: bothner, amylaar, martin, gcc2, egcs OK, how about this proposal? I've tried to formulate it to address everybody's concerns: You've designed this based on the assmuption of converting everything to UTF-8. It's useful to offer that as one alternative, and your design seems reasonable as a way to do it. But there also needs to be a mode which does not convert. There are some demanding situations in which people would want to link together the results of compiling files written in different encodings, but that will be rare. It will be much more common for people to use one encoding. So the default mode should be not to convert, and in that case, GCC doesn't need to know what the encoding is (unless /u is used). I expect that all general-purpose libraries will limit themselves to ASCII for symbol names for many years to come. That is the wise choice for anyone writing a general-purpose library, and the GNU coding standards will call for that. (Perhaps the situations will be different in the future, if use of Unicode become universal, but that will take years.) If and when it happens, we can change the standards. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-25 0:05 ` Richard Stallman @ 1998-12-26 0:36 ` Paul Eggert 1998-12-27 17:24 ` Richard Stallman 0 siblings, 1 reply; 81+ messages in thread From: Paul Eggert @ 1998-12-26 0:36 UTC (permalink / raw) To: rms; +Cc: zack, bothner, amylaar, martin, gcc2, egcs Date: Fri, 25 Dec 1998 03:07:56 -0500 From: Richard Stallman <rms@gnu.org> Even if we solve the mangling problem, though, the ASCII-only name-mangling method seems less useful than UTF-8 name mangling. Neither mangling method allows an arbitrary native encoding (e.g. Shift-JIS or ISO-2022-JP) to be used uniformly, ASCII-only name mangling ought to achieve that. Could you please explain why you think it will not? Here's what I was thinking: * Unsafe native encodings can't be used in assembly-language strings. The simplest way to handle this is to do what GCC currently does: escape non-ASCII bytes in assembly-language strings using notation like `\377'. * Hence, if ASCII-only name mangling is also used, assembly language files will contain only ASCII, regardless of the input encoding. * This will work, but it's unfriendly for non-English writers, because it means that assembly language uses ASCII instead of the native encoding -- i.e. the native encoding isn't being used uniformly in both source and assembly language output. E.g. suppose we have the following code: const char message[] = "contents"; except that the words `message and `contents' are in Japanese. A Japanese reader would naturally desire to see something like the following assembly language output: message: .asciz "contents" except, of course, the words `message' and `contents' would be in Japanese. Unfortunately, though, with ASCII name mangling, and with string mangling as described above, the Japanese reader will see something like the following instead: .x8c.x32.x9c.x41.x91.x32.xac.x90: .asciz "\200 \x309!\x240@\x201\\\x300\"" which is painful to work with. If GCC outputs bytes with the top bit on in assembly language identifiers and strings, then at least safe encodings like UTF-8, ISO 8859, and EUC will yield the naturally desired assembly language output. (Shift-JIS and other unsafe encodings may still yield undesirable escapes in output, but this is no worse than the escapes they already get.) I believe this is what is partly motivating martin's proposed patch, and I'm sympathetic to this motivation. Date: Fri, 25 Dec 1998 03:09:25 -0500 From: Richard Stallman <rms@gnu.org> the default mode should be not to convert, and in that case, GCC doesn't need to know what the encoding is (unless /u is used). Even when not converting, GCC needs to know the input encoding if it's an unsafe one like Shift-JIS or ISO-2022-JP (``unsafe'' meaning ``some multibyte chars contain ASCII bytes'') -- otherwise GCC won't be able to parse comments, strings, and identifiers correctly. Much (if not most) east Asian text currently uses unsafe encodings, so this is not a minor point. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-26 0:36 ` Paul Eggert @ 1998-12-27 17:24 ` Richard Stallman 0 siblings, 0 replies; 81+ messages in thread From: Richard Stallman @ 1998-12-27 17:24 UTC (permalink / raw) To: eggert; +Cc: zack, bothner, amylaar, martin, gcc2, egcs ASCII-only name mangling ought to achieve that. Could you please explain why you think it will not? * This will work, but it's unfriendly for non-English writers, Making .s files "friendly" for non-ASCII scripts is a very low-priority goal. There are, or have been, some assemblers which required all strings to be expressed with .byte. So what? Above all, we have to do the right thing for compiler *users*; making .s files look nice has to come second. Even when not converting, GCC needs to know the input encoding if it's an unsafe one like Shift-JIS or ISO-2022-JP (``unsafe'' meaning ``some multibyte chars contain ASCII bytes'') Yes, that's true. However, we can arrange for GCC to do the right thing without knowing the actual encoding, when the real encoding is a safe one--as long as GCC does not think you have specified an unsafe encoding. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 18:11 ` Paul Eggert 1998-12-21 18:46 ` Per Bothner @ 1998-12-21 19:16 ` Per Bothner 1998-12-21 19:20 ` Per Bothner 1998-12-23 0:35 ` Richard Stallman 1998-12-22 3:09 ` Joern Rennecke 1998-12-23 0:36 ` Richard Stallman 3 siblings, 2 replies; 81+ messages in thread From: Per Bothner @ 1998-12-21 19:16 UTC (permalink / raw) To: gcc2, egcs I too am rather leary of using #pragma locale or any other in-band indicator or the character set. Paul mentions the problem of converting a set of text files from one encoding to another. Perhaps someone in Western Europe wants to examine a program with its documentation, but both were written in China. It makes sense to convert it to the local character set first. If the original program contains #pragma locale statements, these have to be translated also, but expecting a chracter-set translation tool to understand C syntax seems a bit much. If you *don't* do the translation, all your other tools (emacs, less, grep, etc) need to understand the #pragma locale statement, which again seems reasonable. Another problem is that switching character encoding in-band may be difficult. Many libraries do not support it. The Java FileReader class requires you to specify the encoding at *open* time. Of course there are various work-around. For example, you can try opening the file in UTF-8 mode, and if you see a #pragma locale statement, re-open it in the apprioriate mode. Still this is not something applications programmers shoudl have to deal with. The only general solution I think is for the *file system* and/or input library to do the translation. Perferably each file should specify its encoding out-of-bound, just like MIME does. As a back-up, the user should be able tospecify a default encoding (based on their lcoale), and perhaps over-ride it for individual files. Still, while #pragra locale does have its problems, and we must also support other ways for getting character encoding information, it might still be a useful *alternative* method for specifying the encoding. One useful data point is that the XML specification provides a command to specify the character encoding in use. See: http://www.w3.org/TR/PR-xml-971208#NT-EncodingDecl The XML spec also includes an appendix on auto-detection: http://www.w3.org/TR/PR-xml-971208#sec-guessing --Per Bothner Cygnus Solutions bothner@cygnus.com http://www.cygnus.com/~bothner ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 19:16 ` Per Bothner @ 1998-12-21 19:20 ` Per Bothner 1998-12-23 0:35 ` Richard Stallman 1 sibling, 0 replies; 81+ messages in thread From: Per Bothner @ 1998-12-21 19:20 UTC (permalink / raw) To: Per Bothner; +Cc: gcc2, egcs > If you *don't* [convert your source files, but rely of #pragma locale > or whatever], all your other tools (emacs, > less, grep, etc) need to understand the #pragma locale statement, > which again seems reasonable. s/reasonable/unreasonable/ Sigh. --Per Bothner Cygnus Solutions bothner@cygnus.com http://www.cygnus.com/~bothner ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 19:16 ` Per Bothner 1998-12-21 19:20 ` Per Bothner @ 1998-12-23 0:35 ` Richard Stallman 1 sibling, 0 replies; 81+ messages in thread From: Richard Stallman @ 1998-12-23 0:35 UTC (permalink / raw) To: bothner; +Cc: gcc2, egcs If you *don't* do the translation, all your other tools (emacs, less, grep, etc) need to understand the #pragma locale statement, On the contrary, most other programs have no need to understand it. Most of these "tools" don't pay attention to the character encoding. Only Emacs does--and it has its own way you can specify the encoding, if it guesses wrong. It is ok for Emacs to guess, since the results are shown to you straightaway; if it guessed wrong, you will see that on the screen. In many cases, you don't need to care. If you visit a file, change some text at the beginning and save it, and if some part later in the file (which you did not look at) contained some Latin-N characters, it makes no difference to you whether Emacs thought they were Latin-1 or Latin-2. All that matters is that they are saved the same as they were before. But if GCC gets this wrong, you will get errors or incorrect behavior later on, and it may take some time for you to even notice, let alone figure out the cause. Another problem is that switching character encoding in-band may be difficult. Many libraries do not support it. The Java FileReader class requires you to specify the encoding at *open* time. GCC is not written in Java and does not use this class, so this limitation is not a factor for us. Perferably each file should specify its encoding out-of-bound, just like MIME does. I would not object to this sort of system, if users were happy with it. It would avoid depending on the environment. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 18:11 ` Paul Eggert 1998-12-21 18:46 ` Per Bothner 1998-12-21 19:16 ` Per Bothner @ 1998-12-22 3:09 ` Joern Rennecke 1998-12-22 10:52 ` Paul Eggert 1998-12-23 0:36 ` Richard Stallman 3 siblings, 1 reply; 81+ messages in thread From: Joern Rennecke @ 1998-12-22 3:09 UTC (permalink / raw) To: Paul Eggert; +Cc: rms, martin, gcc2, egcs > In our company, when we import files from other sources, we typically > transliterate them to an encoding suitable for our preferred working > locale. This is the only plausible way to do things; otherwise, few > of our text-processing tools would work. Even Emacs supports only > _some_ of the Japanese encodings that we import -- e.g. it doesn't > support UTF-8 or DBCS. Most other tools support only one character > set and encoding at a time, and it is set from the locale environment > variables in the usual way. > > If #locale were part of the source, we'd have more work to do, since > we'd also have to munge the #locale directives of imported sources. > This would be doable, but it would be a hassle, particularly when > trading patches with our correspondents who use different encodings. > I can easily see where people would screw this up. Ok, how about not naming the locale, but describing it in a way so that it gets automatically adjusted when you transliterate the file? I.e. you pick a set of non-ASCII characters that are sufficient to identify the locale, and for each of them, state their name, followed by their encoding, followed by an ASCII delimiter that makes it possible to detect where the end of a multibyte encoding is. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-22 3:09 ` Joern Rennecke @ 1998-12-22 10:52 ` Paul Eggert 0 siblings, 0 replies; 81+ messages in thread From: Paul Eggert @ 1998-12-22 10:52 UTC (permalink / raw) To: amylaar; +Cc: rms, martin, gcc2, egcs From: Joern Rennecke <amylaar@cygnus.co.uk> Date: Tue, 22 Dec 1998 11:08:52 +0000 (GMT) pick a set of non-ASCII characters that are sufficient to identify the locale, and for each of them, state their name, followed by their encoding, followed by an ASCII delimiter that makes it possible to detect where the end of a multibyte encoding is. I think that would be too brittle to work well in practice. The magic cookie would be long and would be hard to explain to users. For example, they couldn't just cut and paste the magic cookie's bytes out of a recipe file; instead, they'd have to transliterate it to their locale's character set, and they'd have to know what to do when their locale can't represent all the characters. Also, the set of characters would have to be large -- enough to distinguish all the ISO 8859 variants, among other things. Worse, the set would have to change with time as new character sets were added to GCC's set of supported charsets. I'd hate to see a new GCC release required because of the Euro! ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-21 18:11 ` Paul Eggert ` (2 preceding siblings ...) 1998-12-22 3:09 ` Joern Rennecke @ 1998-12-23 0:36 ` Richard Stallman 3 siblings, 0 replies; 81+ messages in thread From: Richard Stallman @ 1998-12-23 0:36 UTC (permalink / raw) To: eggert; +Cc: amylaar, martin, gcc2, egcs The committee is also requiring _Pragma("FOO") to have the same meaning as #pragma FOO. _Pragma("FOO") can be output by macros. Does this overcome your objection to pragmas? Yes. I am still not convinced that we should use #pragma for this particular job, but _Pragma certainly solves that problem for pragmas in general. GCC currently doesn't have directives like this: #character-set ASCII #character-set EBCDIC because they're not needed GCC does not have any way to specify ASCII vs EBCDIC at run time. The choice of ASCII vs EBCDIC is fixed, given your host platform. So it does not shed any light on the question at hand. No, it doesn't avoid the danger. You can specify the wrong locale just as easily, if not more easily, with #locale -- e.g. see the transliteration scenario in my longer answer above. ANY method of specifying the locale leaves a chance you get it wrong when you CHANGE the locale. But at that time, you will be on the alert for having made a mistake in operating on the file. Using an environment variable makes lossage possible any time, if you changed your environment for some reason--even when the same file worked correctly yesterday and you have not changed it in weeks. We cannot get rid of the former problem, so we have to accept it. But we can get rid of the latter problem, and we should. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-19 10:36 ` Paul Eggert 1998-12-20 20:29 ` Richard Stallman 1998-12-20 20:29 ` Richard Stallman @ 1998-12-21 12:25 ` Samuel Figueroa 2 siblings, 0 replies; 81+ messages in thread From: Samuel Figueroa @ 1998-12-21 12:25 UTC (permalink / raw) To: rms; +Cc: gcc2, egcs >These arithmetic parameters should not be done with #pragma, not with >any kind of #-command. That is because macro expansions cannot >produce a #-command. I told the committee about this problem ten >years ago, but it seems that the temptation of the #pragma idea >is too strong for mere logic to overcome. I have never been a member of the committee, but I attended a meeting in which your objections to using #pragma for these arithmetic parameters were mentioned. The attitude was that if no one present wished to champion a proposal or objection, it would not be considered at all. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-12 2:21 ` Martin von Loewis 1998-12-13 6:23 ` Richard Stallman @ 1998-12-15 22:00 ` Paul Eggert 1998-12-15 23:17 ` Martin von Loewis 1998-12-16 0:18 ` Per Bothner 1 sibling, 2 replies; 81+ messages in thread From: Paul Eggert @ 1998-12-15 22:00 UTC (permalink / raw) To: martin; +Cc: bothner, gcc2, egcs Date: Sat, 12 Dec 1998 11:18:00 +0100 From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de> > I have misgivings about having GCC support multiple locales > simultaneously. gcc/g++ process strictly-conforming input that is already in the base character set (plus \u escapes, in a way that the standards mandate. Object files are then UTF-8, (or U escapes for C++). But this would mean that \u escapes wouldn't have their intended effect in non-UTF-8 locales. E.g. "\u00b5" would turn into a two-byte multibyte character string, which is incorrect for the common ISO 8859/1 encoding where it is represented by a single byte. gcc/g++ also process input based on the current locale Yes. But the current locale should affect the processing of \u escapes, as well as the recognition of multibyte characters. and pass the input unmodified to the output. This is largely correct for multibyte characters (though their bytes may need escaping to satisfy some assemblers). I think \u will need to be translated, though, if possible -- unless the assembler handles \u, which is not true for gas at least. There is no interworking between the two (i.e: characters in the current locale are not at all related to \u escapes) I'm not sure that this is a good idea, partly for the reasons described above. Tt would mean that \u escapes would turn into gibberish in the vast majority of locales in practical use today. This means that the compiler, in locale-aware mode, would not be strictly conforming, but so what? Actually, draft C9x allows the behavior that you propose, because it says that the relationship between multibyte chars and \u is implementation defined. I lobbied for this design freedom; earlier C9x drafts required closer conformance to Unicode (and my impression is that C++ still requires it). I was hoping that this freedom would let GCC (or at least cpp :-) function in a locale-invariant way. But if we go this route, we have several problems: * We won't handle \u the way that users will expect. * We're limited to locales whose multibyte encodings never use ASCII bytes -- and this rules out several popular encodings. * We'll have to disable the checking for identifier spellings in multibyte chars, since we won't know which multibyte chars are letters and/or digits. * In general, assembly language files will not be text files. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-15 22:00 ` Paul Eggert @ 1998-12-15 23:17 ` Martin von Loewis 1998-12-17 7:32 ` Paul Eggert 1998-12-16 0:18 ` Per Bothner 1 sibling, 1 reply; 81+ messages in thread From: Martin von Loewis @ 1998-12-15 23:17 UTC (permalink / raw) To: eggert; +Cc: bothner, gcc2, egcs [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 3712 bytes --] > But this would mean that \u escapes wouldn't have their intended > effect in non-UTF-8 locales. E.g. "\u00b5" would turn into a two-byte > multibyte character string, which is incorrect for the common ISO > 8859/1 encoding where it is represented by a single byte. Are you talking about source character set or process character set here? There is no concept of locale character set in C++... If you have char hello[]="Hall\u00f6chen"; you would get "Hall\303\266chen" at run time. This is why the user shouldn't do that, instead she should write wchar hello[]=L"Hall\u00f6chen"; which would then give a wide string where hello[4] is 0xf6. How you use the string at run-time is up to the application; who says the application needs to consider the user's locale when processing that string? Come on. > Yes. But the current locale should affect the processing of \u > escapes, as well as the recognition of multibyte characters. No way. After the discussion with RMS, I agree that we should copy bytes *unmodified* into output. That is, if we have a Latin-1 ö in the input, copy it literally to the output. If we get a \u escape (which the standard says clearly identifies ISO 10646 characters) we should also copy it as-is to the output. Now, there are cases were we can't output the character unmodified. We use UTF-8 in these cases (identifiers, narrow strings) then, this is how we define the process character set. > may need escaping to satisfy some assemblers). I think \u will need > to be translated, though, if possible -- unless the assembler handles > \u, which is not true for gas at least. No. gcc shall *not* perform character set conversions, at least for the time being. > I'm not sure that this is a good idea, partly for the reasons > described above. Tt would mean that \u escapes would turn into > gibberish in the vast majority of locales in practical use today. Rubbish (sorry). You seem to know exactly how programmers use these things. Well, I tell you. In a Microsoft COM program, you want to write WCHAR DriverName[] = "\u1234\u5678"; The C standard says you should get Unicode. Microsoft says you should use Unicode in certain situations. This is what you do: You use Unicode, no matter what the system locale is on Windows NT. > * We won't handle \u the way that users will expect. I still don't see this problem. The user expects \u to be Unicode, we give her Unicode. Why would this not be what the user expects? > * We're limited to locales whose multibyte encodings never use ASCII > bytes -- and this rules out several popular encodings. This is indeed a problem. But then, maybe it is not. What's most important is that you can use these encodings in *strings*. The primary reason why the original C restricted identifiers to ASCII was the feeling that otherwise, it "would not work". The primary reason why the standards mandate Unicode and \u escapes is that it might work with this approach, but it still won't work with other encodings. We can either accept that as a fact of life, or come up with something smart. Converting Unicode escapes to an encoding that uses illegal ASCII in assembler doesn't sound too smart to me. > * We'll have to disable the checking for identifier spellings in > multibyte chars, since we won't know which multibyte chars are > letters and/or digits. Well, I think there is agreement that we should process the input unmodified. Whether we use the locale functions to define the set of programs we accept; well, maybe. I'd prefer command line options, but there is certainly a problem with that. > * In general, assembly language files will not be text files. Define "text file". Regards, Martin ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-15 23:17 ` Martin von Loewis @ 1998-12-17 7:32 ` Paul Eggert 1998-12-17 16:48 ` Martin von Loewis 1998-12-18 21:31 ` Richard Stallman 0 siblings, 2 replies; 81+ messages in thread From: Paul Eggert @ 1998-12-17 7:32 UTC (permalink / raw) To: martin; +Cc: bothner, gcc2, egcs Date: Wed, 16 Dec 1998 08:12:52 +0100 From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de> > But this would mean ... "\u00b5" would turn into a two-byte > multibyte character string, which is incorrect for the common ISO > 8859/1 encoding where it is represented by a single byte. Are you talking about source character set or process character set here? I'm talking about the execution character set. E.g. printf ("\u00b5") should output a single byte in the Solaris 7 "de" locale, which uses ISO 8859/1. There is no concept of locale character set in C++... There's no such official term in C either, but I think C++ and C use the same basic idea here, namely that a locale (in particular, the LC_CTYPE part of the locale) specifies the rules for how multibyte characters are converted to wide chars, which wide chars are considered to be upper case, etc., etc. These rules are defined by the locale's character set and encoding. char hello[]="Hall\u00f6chen"; you would get "Hall\303\266chen" at run time. That's certainly not true in draft C9x, for non-UTF-8 locales. In draft C9x, if you want "Hall\303\266chen" at run time, you can write "Hall\303\266chen" at compile time. I also suspect that it's not true for C++. It's hard for me to believe that C++ requires UTF-8 encoding for strings at run-time. wchar hello[]=L"Hall\u00f6chen"; This should give the equivalent wide string at run-time. If the implementation uses Unicode wide chars, this is equivalent to "Hall\x00f6chen"; otherwise, it's equivalent to whatever binary encoding they use. It's possible for the locale to use Unicode wide strings even though it uses a non-UTF-8 encoding for multibyte chars. (I believe glibc 2.1 does this, but I haven't checked.) But it's not required by the C standard, and some systems use other wide encodings (e.g. JIS). > Yes. But the current locale should affect the processing of \u > escapes, as well as the recognition of multibyte characters. After the discussion with RMS, I agree that we should copy bytes *unmodified* into output. But your example above with `char hello' doesn't copy the bytes unmodified! It translates the 6 chars "\u00f6" to 2 bytes in your locale's charset and encoding, which is the right thing to do; RMS (reluctantly, I think :-) agreed that \u requires locale-dependent translation. I agree that multibyte chars should be copied unmodified into the output. However, as I mentioned earlier, they require locale-specific processing to be *recognized*; otherwise they might be confused with ASCII chars. If we get a \u escape (which the standard says clearly identifies ISO 10646 characters) we should also copy it as-is to the output. Again, you seem to be contradicting your own example. Though draft C9x says that \u identifies ISO 10646 chars, it doesn't require that the implementation use UTF-8 narrow strings, nor does it require that the implementation use Unicode in wide strings. It can use some other encoding, e.g. Shift-JIS or ISO 8859/1 or even Ascii. I assume C++ is similar here. Java is a different animal here; it requires Unicode at run-time. But we're talking about C (and C++), which make no such requirement. WCHAR DriverName[] = "\u1234\u5678"; The C standard says you should get Unicode. All draft C9x says is that you should get the appropriate chars, and that the relationship between those chars and the Unicode chars is implementation-defined. Microsoft says you should use Unicode in certain situations. Absolutely. In a locale that uses Unicode, you should get Unicode. Converting Unicode escapes to an encoding that uses illegal ASCII in assembler doesn't sound too smart to me. Sorry, you've lost me. ``illegal ASCII'?? > * In general, assembly language files will not be text files. Define "text file". A file that (among other things) uses a single encoding for its characters. Such files can be processed by standard text tools like wc, iconv, and emacs. You're proposing that assembler files use UTF-8 in some cases, and the locale's multibyte encoding in other cases. Such files can't be processed by standard text tools. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-17 7:32 ` Paul Eggert @ 1998-12-17 16:48 ` Martin von Loewis 1998-12-17 22:10 ` Paul Eggert 1998-12-18 21:31 ` Richard Stallman 1 sibling, 1 reply; 81+ messages in thread From: Martin von Loewis @ 1998-12-17 16:48 UTC (permalink / raw) To: eggert; +Cc: bothner, gcc2, egcs > I'm talking about the execution character set. E.g. printf ("\u00b5") > should output a single byte in the Solaris 7 "de" locale, which uses > ISO 8859/1. Is this what you want to happen, or what some standard mandates to happen? If so, what does the standard mandate for printf("\u1234"); > char hello[]="Hall\u00f6chen"; > > you would get "Hall\303\266chen" at run time. > > That's certainly not true in draft C9x, for non-UTF-8 locales. I didn't mean that C++ mandates that. It is implementation defined what happens, and I'm proposing that egcs defines it that way. > In draft C9x, if you want "Hall\303\266chen" at run time, > you can write "Hall\303\266chen" at compile time. You've confused input and output here. The question is not how to achieve a certain output, but how to process a certain input. > I also suspect that it's not true for C++. It's hard for me to > believe that C++ requires UTF-8 encoding for strings at run-time. It doesn't. It doesn't prohibit that, either. > But your example above with `char hello' doesn't copy the bytes > unmodified! It translates the 6 chars "\u00f6" to 2 bytes in your > locale's charset and encoding, which is the right thing to do No, it is not *my* locale, it is how gcc is (or could be) defined. The big difference is predictability. If gcc defines that translation into multibyte characters always means UTF-8 for \u escapes, people know what to expect. If the output *at run time* depends on the setting of environment variables *at compile time*, people will kill us. > If we get a \u escape (which the standard says clearly identifies > ISO 10646 characters) we should also copy it as-is to the output. > > Again, you seem to be contradicting your own example. Converting Unicode to UTF-8 is as close as you can get to 'as-is', if you want to convert arbitrary Unicode to multibyte. > Java is a different animal here; it requires Unicode at run-time. But > we're talking about C (and C++), which make no such requirement. We also plan to combine C++ and Java. > > WCHAR DriverName[] = "\u1234\u5678"; [...] > Microsoft says you should use Unicode in certain situations. > > Absolutely. In a locale that uses Unicode, you should get Unicode. Microsoft says you should get Unicode no matter what the locale is. > Converting Unicode escapes to an encoding that uses illegal > ASCII in assembler doesn't sound too smart to me. > > Sorry, you've lost me. ``illegal ASCII'?? Well, ASCII sequences that are not legal as identifiers. > You're proposing that assembler files use UTF-8 in some cases, and the > locale's multibyte encoding in other cases. Such files can't be > processed by standard text tools. I don't want to process assembler files by standard text tools, I want the assembler to process it. Martin ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-17 16:48 ` Martin von Loewis @ 1998-12-17 22:10 ` Paul Eggert 0 siblings, 0 replies; 81+ messages in thread From: Paul Eggert @ 1998-12-17 22:10 UTC (permalink / raw) To: martin; +Cc: bothner, gcc2, egcs Date: Fri, 18 Dec 1998 01:44:20 +0100 From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de> > E.g. printf ("\u00b5") should output a single byte in the Solaris 7 > "de" locale, which uses ISO 8859/1. Is this what you want to happen, or what some standard mandates to happen? The draft C9x standard mandates only that the implementation define the relation between \u escapes and the locale's characters. However, the intent is that \u00b5 correspond to the ISO 10646-1 MICRO SIGN character, and the ISO 8859/1 equivalent is the single byte with hex code b5. what does the standard mandate for printf("\u1234"); Again, it's implementation defined. If the implementation's encoding can't represent a Unicode character, the implementation must substitute some other char. E.g. printf("\u1234") might print a question mark in a locale that is limited to ISO 8859/1 chars. If gcc defines that translation into multibyte characters always means UTF-8 for \u escapes, people know what to expect. It's true that this would be reproducible behavior, and it would also conform to the letter of the standard; but it's undesirable (e.g. it mixes encodings on output) and doesn't conform to the standard's intent. It would make \u useless in non-UTF-8 locales. If the output *at run time* depends on the setting of environment variables *at compile time*, people will kill us. I think you're right to be leery of environmental settings (as is RMS), and I also think it wise to prefer explicit settings to environmental ones. But it's too strong to rule out the environment entirely. The runtime behavior already depends on the values of compile-time environment variables (e.g. C_INCLUDE_PATH); having one more such dependency won't kill us. > Java is a different animal here; it requires Unicode at run-time. But > we're talking about C (and C++), which make no such requirement. We also plan to combine C++ and Java. This means that the C++ side will most likely have to use UTF-8. That's OK. For UTF-8 locales I think we're pretty much in agreement. Microsoft says you should get Unicode no matter what the locale is. GCC is used by many non-Microsoft platforms; it can't (and shouldn't try to) impose Microsoft's rules on everybody else. I don't want to process assembler files by standard text tools You may not need this capability, but other people do. E.g. GCC's maintainers need to look at the assembler output to debug GCC itself. These needs make it desirable to have assembler files be text rather than some encoding that's not human-readable. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-17 7:32 ` Paul Eggert 1998-12-17 16:48 ` Martin von Loewis @ 1998-12-18 21:31 ` Richard Stallman 1 sibling, 0 replies; 81+ messages in thread From: Richard Stallman @ 1998-12-18 21:31 UTC (permalink / raw) To: eggert; +Cc: martin, bothner, gcc2, egcs It translates the 6 chars "\u00f6" to 2 bytes in your locale's charset and encoding, which is the right thing to do; RMS (reluctantly, I think :-) agreed that \u requires locale-dependent translation. We have to do locale-dependent translation for \u in a non-wide string, because the character meaning of a \u escape is locale-independent, while the proper multibyte representation of any given character in a non-wide string is locale-dependent. It might be appropriate to do locale-dependent translation for \u in a wide string, in case the locale's wide character representation is not Unicode. But maybe it is ok to say, "you lose in that case." ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-15 22:00 ` Paul Eggert 1998-12-15 23:17 ` Martin von Loewis @ 1998-12-16 0:18 ` Per Bothner 1 sibling, 0 replies; 81+ messages in thread From: Per Bothner @ 1998-12-16 0:18 UTC (permalink / raw) To: Paul Eggert; +Cc: gcc2, egcs > But this would mean that \u escapes wouldn't have their intended > effect in non-UTF-8 locales. E.g. "\u00b5" would turn into a two-byte > multibyte character string, which is incorrect for the common ISO > 8859/1 encoding where it is represented by a single byte. This is not the case for Java (nor from what Martin says C). For Java, the locale would specify how bytes on a disk file are interpreted/translated into Unicode. For example if the locale uses 8859/1 (ISO-Latin 1, I believe) then the byte 0xb5 becomes the character '\u00b5'; if the local uses UTF-8 (which should be the default, I think), the 0xb5 is part of a multi-byte encoding. However, the interpretation of \u00b5 is *not* locale-dependent. That is, once input processor seems the characters '\\', 'u', '0', '0', 'b', '5' then that is interpreted as the Unicode character '\u00b5' in *all* locales. Whether the \u00b5 is passed through to the assembler for expansion or not is an implemenation detail. (The Java "phases of translation" specification requires \u-escapes to be expanded early, so having the assembler handle it does not seem to be practical.) In any case, it is quite clear that "\u00b5" becomes a run-time String object containing one 16-bit character, whatever the locales or external character encoding. --Per Bothner Cygnus Solutions bothner@cygnus.com http://www.cygnus.com/~bothner ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-09 17:46 ` Paul Eggert 1998-12-09 18:01 ` Tim Hollebeek 1998-12-09 23:03 ` Per Bothner @ 1998-12-09 23:18 ` Martin von Loewis 1998-12-10 7:57 ` Ian Lance Taylor 1998-12-11 19:28 ` Paul Eggert [not found] ` <199812100200.VAA06419.cygnus.egcs@wagner.Princeton.EDU> 3 siblings, 2 replies; 81+ messages in thread From: Martin von Loewis @ 1998-12-09 23:18 UTC (permalink / raw) To: eggert; +Cc: brolley, gcc2, egcs > I see the need for mangling, but I don't see why TREE_UNIVERSAL_CHAR > is needed. When outputting a name, you don't need to have a separate > flag specifying whether whether the identifier contains \u; you can > just inspect the identifier string directly. This would be > ASM_OUTPUT_LABELREF's job. TREE_UNIVERSAL_CHAR is an optimization to avoid inspecting the string. Note that it is defined for the C++ front-end only. The encoding of Unicode has to be done in the front-end for C++; the length of a class name depends on the encoding, and it has to get into the mangling. Also, if the mangling of gxxint.texi is used, Fo\u1234 becomes U7Fo_1234, where the U indicates that the underscore is an escape. The backend can't know this concept. > Also, I assume that once the patch is generalized to non-UTF-8 > locales, it won't be just the \u and \U escapes that require mangling. There is no need to generalise that. Defining object files to use Unicode is the right thing :-) > If the object-code standard is to use UTF-8 names, then I suppose the > assembler can convert to UTF-8. No. The gas people made it very clear that they consider character sets somebody else's problems (i.e. ours). > Sorry, I don't understand this point. If you're saying that C++ > mangles non-ASCII identifiers into ASCII labels, but C doesn't, then I > don't see why that should be: there's no reason in principle that C > couldn't or shouldn't use the same sort of mangling. Sure there is. Look at the example above, and see how you can't do that service for C linkage. > I've run into shells that use the top bit for their own purposes. What system? > > And, even if such shells are discounted, it's a bit odd to use UTF-8 > in configure.in without labeling the file. My Emacs (20.3) > misidentified the file as being ISO Latin 1. So what? This tests whether the assembler can process a certain sequence of uninterpreted bytes (well, whether they are interpreted is up to the assembler). The test is to test a feature, not to look nice in Emacs. Please tell me how I can perform the same test with ASCII-only shell commands, and I happily convert. > Really? Suppose I write the preprocessor line > > #if X == 1 > > where X is some Japanese identifier, but I make the understandable > mistake of using a FULLWIDTH DIGIT ONE (code FF11) instead of an ASCII 1. \uFF11 is not a letter in C++, so this is ill-formed and will be rejected. The same holds for the Arabic digits. If you want to write numbers in C++, use ASCII 0-9. Regards, Martin ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-09 23:18 ` Martin von Loewis @ 1998-12-10 7:57 ` Ian Lance Taylor 1998-12-10 13:12 ` Martin von Loewis ` (2 more replies) 1998-12-11 19:28 ` Paul Eggert 1 sibling, 3 replies; 81+ messages in thread From: Ian Lance Taylor @ 1998-12-10 7:57 UTC (permalink / raw) To: martin; +Cc: eggert, brolley, gcc2, egcs Date: Thu, 10 Dec 1998 08:12:20 +0100 From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de> > If the object-code standard is to use UTF-8 names, then I suppose the > assembler can convert to UTF-8. No. The gas people made it very clear that they consider character sets somebody else's problems (i.e. ours). That is too strong. For hand coded assembler, I can see that there may be a need for gas to do some character set conversions. Also, if it is ever possible for an identifier name to include a byte value which gas will consider to be an operator, then it is clearly necessary for gas to permit quoting that byte value, and perhaps to do more general character set conversions. In general, though, if gcc needs to understands character set issues, which appears to be the case, and if it can emit identifiers in a manner which will not confuse gas, then I think it is reasonable for gcc to emit identifiers as uninterpreted byte sequences, and for gas to simply pass those identifiers straight through into the object file. I can't claim to understand many of the issues here, though. Several people have mentioned the linker as an issue. To the best of my knowledge, the linker will permit any byte value except 0 to appear in an identifier. I don't see why the linker has to change at all for any character set issues. Ian ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-10 7:57 ` Ian Lance Taylor @ 1998-12-10 13:12 ` Martin von Loewis 1998-12-11 19:32 ` Paul Eggert 1998-12-11 19:34 ` Ken Raeburn 2 siblings, 0 replies; 81+ messages in thread From: Martin von Loewis @ 1998-12-10 13:12 UTC (permalink / raw) To: ian; +Cc: eggert, brolley, gcc2, egcs > Also, if it is ever possible for an identifier name to include a > byte value which gas will consider to be an operator, then it is > clearly necessary for gas to permit quoting that byte value, and > perhaps to do more general character set conversions. Fortunately, UTF-8 only uses characters above 128, plus ASCII; so you don't get special characters in identifiers, since they are already banned by C. > Several people have mentioned the linker as an issue. To the best of > my knowledge, the linker will permit any byte value except 0 to appear > in an identifier. I don't see why the linker has to change at all for > any character set issues. I've tried the binutils linker, and it is happy with any byte sequence. Of course, there still might be linkers that do care about characters above 128. Maybe we should perform some manual tests now, or even have an autoconf test. OTOH, people will complain when the linker complains... Regards, Martin ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-10 7:57 ` Ian Lance Taylor 1998-12-10 13:12 ` Martin von Loewis @ 1998-12-11 19:32 ` Paul Eggert 1998-12-11 19:34 ` Ken Raeburn 2 siblings, 0 replies; 81+ messages in thread From: Paul Eggert @ 1998-12-11 19:32 UTC (permalink / raw) To: ian; +Cc: martin, brolley, gcc2, egcs Date: Thu, 10 Dec 1998 10:57:10 -0500 From: Ian Lance Taylor <ian@cygnus.com> I think it is reasonable for gcc to emit identifiers as uninterpreted byte sequences, and for gas to simply pass those identifiers straight through into the object file. Yes, that should work. Several people have mentioned the linker as an issue. To the best of my knowledge, the linker will permit any byte value except 0 to appear in an identifier. I don't see why the linker has to change at all for any character set issues. Perhaps people are thinking that the user might want to link files that were compiled in different locales. E.g. one user compiles with C-language function names in Shift-JIS, whereas user compiles with them encoded in EUC-JIS. These scenarios are fanciful now, because nobody compiles with non-ASCII names. I see no particular reason why the linker (or the compiler or assembler) would have to support such scenarios. Nobody is doing it this sort of thing now, and I think few if any users will require this behavior in the future. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-10 7:57 ` Ian Lance Taylor 1998-12-10 13:12 ` Martin von Loewis 1998-12-11 19:32 ` Paul Eggert @ 1998-12-11 19:34 ` Ken Raeburn 1998-12-14 17:05 ` Ian Lance Taylor 2 siblings, 1 reply; 81+ messages in thread From: Ken Raeburn @ 1998-12-11 19:34 UTC (permalink / raw) To: Ian Lance Taylor; +Cc: martin, eggert, brolley, gcc2, egcs Ian Lance Taylor <ian@cygnus.com> writes: > Several people have mentioned the linker as an issue. To the best of > my knowledge, the linker will permit any byte value except 0 to appear > in an identifier. I don't see why the linker has to change at all for > any character set issues. Linker scripts processing with odd symbol, section or file names? ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-11 19:34 ` Ken Raeburn @ 1998-12-14 17:05 ` Ian Lance Taylor 0 siblings, 0 replies; 81+ messages in thread From: Ian Lance Taylor @ 1998-12-14 17:05 UTC (permalink / raw) To: raeburn; +Cc: martin, eggert, brolley, gcc2, egcs From: Ken Raeburn <raeburn@cygnus.com> Date: 11 Dec 1998 22:35:51 -0500 Ian Lance Taylor <ian@cygnus.com> writes: > Several people have mentioned the linker as an issue. To the best of > my knowledge, the linker will permit any byte value except 0 to appear > in an identifier. I don't see why the linker has to change at all for > any character set issues. Linker scripts processing with odd symbol, section or file names? You're right, I hadn't considered those. That can be a problem for somebody to solve some day. Ian ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-09 23:18 ` Martin von Loewis 1998-12-10 7:57 ` Ian Lance Taylor @ 1998-12-11 19:28 ` Paul Eggert 1998-12-12 1:06 ` Martin von Loewis 1 sibling, 1 reply; 81+ messages in thread From: Paul Eggert @ 1998-12-11 19:28 UTC (permalink / raw) To: martin; +Cc: brolley, gcc2, egcs Date: Thu, 10 Dec 1998 08:12:20 +0100 From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de> Also, if the mangling of gxxint.texi is used, Fo\u1234 becomes U7Fo_1234, where the U indicates that the underscore is an escape.... Sorry, I'm still lost. If the identifier is the UTF-8 character MICRO SIGN (code 00B5), do you generate the same UTF-8 character on output, or do you mangle it as if the user had typed `\u00b5'? If the latter, then I don't understand why gas needs to be 8-bit clean; if the former, then I don't understand your example with \u1234 as it seems to me that it won't unify with the UTF-8 sequence that is equivalent to \u1234. > I've run into shells that use the top bit for their own purposes. What system? Older BSD systems. The original Bourne shell used the top bit for its own purposes. A few years back, all major Unix suppliers went through their shells and made them 8-bit clean, but a few bugs lurked for a while and I wouldn't be surprised if some were still out there. Please tell me how I can perform the same test with ASCII-only shell commands, and I happily convert. Something like this should do it: echo ab | tr 'ab' '\123\456' Or you could write a little C program, compile it, and run it. > C++ does not distinguish between non-ASCII digits and letters. > > Really? Suppose I write the preprocessor line > > #if X == 1 > > where X is some Japanese identifier, but I make the understandable > mistake of using a FULLWIDTH DIGIT ONE (code FF11) instead of an ASCII 1. \uFF11 is not a letter in C++ OK, so then there's no problem: C++ _does_ distinguish between non-ASCII digits and letters. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: thoughts on martin's proposed patch for GCC and UTF-8 1998-12-11 19:28 ` Paul Eggert @ 1998-12-12 1:06 ` Martin von Loewis 0 siblings, 0 replies; 81+ messages in thread From: Martin von Loewis @ 1998-12-12 1:06 UTC (permalink / raw) To: eggert; +Cc: brolley, gcc2, egcs [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 857 bytes --] > Sorry, I'm still lost. If the identifier is the UTF-8 character MICRO > SIGN (code 00B5), do you generate the same UTF-8 character on output, > or do you mangle it as if the user had typed `\u00b5'? Suppose I have a class µ{ µ(); //This should read MICRO SIGN }; Then, the compiler tests at installation time whether the assembler on the system is 8-bit-clean. If it is, the constructor is mangled as __2\302\265v If the assembler does not support 8-bit symbols, it is mangled as __U5_00b5 This is what jc1 currently does. > echo ab | tr 'ab' '\123\456' Thanks, this looks good. > OK, so then there's no problem: C++ _does_ distinguish between > non-ASCII digits and letters. Right. It just doesn't distinguish between non-ASCII digits and non-ASCII non-alphanumerics :-) That's why no predicate function was needed. Regards, Martin ^ permalink raw reply [flat|nested] 81+ messages in thread
[parent not found: <199812100200.VAA06419.cygnus.egcs@wagner.Princeton.EDU>]
* Re: thoughts on martin's proposed patch for GCC and UTF-8 [not found] ` <199812100200.VAA06419.cygnus.egcs@wagner.Princeton.EDU> @ 1998-12-10 11:31 ` Jonathan Larmour 0 siblings, 0 replies; 81+ messages in thread From: Jonathan Larmour @ 1998-12-10 11:31 UTC (permalink / raw) To: tim; +Cc: gcc2, egcs In article <199812100200.VAA06419.cygnus.egcs@wagner.Princeton.EDU> you write: [ #if X == 1 is a problem if "1" is not ASCII 1 ] >gcc really should complain "undeclared identifier 'x' in preprocessor >conditional expression has value 0" when -Wall is specified. It already does this check if you use -Wundef. Its in the info page, and not the man page, unfortunately. `-Wundef' Warn if an undefined identifier is evaluated in an `#if' directive. Jifl -- Cygnus Solutions, 35 Cambridge Place, Cambridge, UK. Tel: +44 (1223) 728762 "Women marry hoping their husbands will change, men||Home e-mail: jifl @ marry hoping their wives never do. Both are rare." || jifvik.demon.co.uk Help fight spam! http://spam.abuse.net/ These opinions are all my own fault ^ permalink raw reply [flat|nested] 81+ messages in thread
end of thread, other threads:[~1998-12-30 5:19 UTC | newest] Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <19981204032449.3033.qmail@comton.airs.com> [not found] ` <199812060519.VAA07309@shade.twinsun.com> [not found] ` <366C0645.61C48A38@cygnus.com> [not found] ` <199812080057.QAA00491@shade.twinsun.com> [not found] ` <366D460E.4FB0ECD0@cygnus.com> 1998-12-09 13:44 ` thoughts on martin's proposed patch for GCC and UTF-8 Paul Eggert 1998-12-09 14:38 ` Martin von Loewis 1998-12-09 14:56 ` Per Bothner 1998-12-09 22:57 ` Martin von Loewis 1998-12-09 23:16 ` Per Bothner 1998-12-11 19:27 ` Paul Eggert 1998-12-09 17:46 ` Paul Eggert 1998-12-09 18:01 ` Tim Hollebeek 1998-12-10 5:58 ` Craig Burley 1998-12-10 10:21 ` Tim Hollebeek 1998-12-10 11:50 ` Craig Burley 1998-12-10 14:23 ` Chip Salzenberg 1998-12-09 23:03 ` Per Bothner 1998-12-10 7:49 ` Ian Lance Taylor 1998-12-11 19:23 ` Paul Eggert 1998-12-12 2:21 ` Martin von Loewis 1998-12-13 6:23 ` Richard Stallman 1998-12-13 12:27 ` Martin von Loewis 1998-12-14 2:22 ` Richard Stallman 1998-12-15 10:47 ` Paul Eggert 1998-12-17 18:10 ` Richard Stallman 1998-12-17 21:41 ` Paul Eggert 1998-12-18 1:23 ` Martin von Loewis 1998-12-17 23:55 ` Joern Rennecke 1998-12-19 5:13 ` Richard Stallman 1998-12-19 10:36 ` Paul Eggert 1998-12-20 20:29 ` Richard Stallman 1998-12-21 1:52 ` Andreas Schwab 1998-12-22 1:09 ` Richard Stallman 1998-12-20 20:29 ` Richard Stallman 1998-12-21 7:00 ` Zack Weinberg 1998-12-21 18:58 ` Paul Eggert 1998-12-21 19:07 ` Zack Weinberg 1998-12-21 19:28 ` Ulrich Drepper 1998-12-23 0:36 ` Richard Stallman 1998-12-21 18:11 ` Paul Eggert 1998-12-21 18:46 ` Per Bothner 1998-12-21 19:44 ` Paul Eggert 1998-12-21 20:30 ` Per Bothner 1998-12-23 0:35 ` Richard Stallman 1998-12-21 20:16 ` Paul Eggert 1998-12-21 20:28 ` Zack Weinberg 1998-12-22 2:59 ` Paul Eggert 1998-12-23 17:16 ` Richard Stallman 1998-12-23 18:11 ` Zack Weinberg 1998-12-25 0:05 ` Richard Stallman 1998-12-28 5:55 ` Martin von Loewis 1998-12-30 5:19 ` Richard Stallman 1998-12-23 19:21 ` Paul Eggert 1998-12-25 0:05 ` Richard Stallman 1998-12-25 0:05 ` Richard Stallman 1998-12-21 21:03 ` Per Bothner 1998-12-22 2:35 ` Paul Eggert 1998-12-28 8:10 ` Martin von Loewis 1998-12-28 11:00 ` Per Bothner 1998-12-25 0:05 ` Richard Stallman 1998-12-26 0:36 ` Paul Eggert 1998-12-27 17:24 ` Richard Stallman 1998-12-21 19:16 ` Per Bothner 1998-12-21 19:20 ` Per Bothner 1998-12-23 0:35 ` Richard Stallman 1998-12-22 3:09 ` Joern Rennecke 1998-12-22 10:52 ` Paul Eggert 1998-12-23 0:36 ` Richard Stallman 1998-12-21 12:25 ` Samuel Figueroa 1998-12-15 22:00 ` Paul Eggert 1998-12-15 23:17 ` Martin von Loewis 1998-12-17 7:32 ` Paul Eggert 1998-12-17 16:48 ` Martin von Loewis 1998-12-17 22:10 ` Paul Eggert 1998-12-18 21:31 ` Richard Stallman 1998-12-16 0:18 ` Per Bothner 1998-12-09 23:18 ` Martin von Loewis 1998-12-10 7:57 ` Ian Lance Taylor 1998-12-10 13:12 ` Martin von Loewis 1998-12-11 19:32 ` Paul Eggert 1998-12-11 19:34 ` Ken Raeburn 1998-12-14 17:05 ` Ian Lance Taylor 1998-12-11 19:28 ` Paul Eggert 1998-12-12 1:06 ` Martin von Loewis [not found] ` <199812100200.VAA06419.cygnus.egcs@wagner.Princeton.EDU> 1998-12-10 11:31 ` Jonathan Larmour
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).