From mboxrd@z Thu Jan 1 00:00:00 1970 From: Paul Eggert To: bothner@cygnus.com Cc: gcc2@gnu.org, egcs@cygnus.com Subject: Re: thoughts on martin's proposed patch for GCC and UTF-8 Date: Tue, 22 Dec 1998 02:35:00 -0000 Message-id: <199812221034.CAA08773@shade.twinsun.com> References: <199812220502.VAA10296@cygnus.com> X-SW-Source: 1998-12/msg00846.html Date: Mon, 21 Dec 1998 21:02:31 -0800 From: Per Bothner > (3) GCC transliterates each \u escape in a string to the string's charset, > which is specified as described in (1) above. Hm. (1) above specifies the *file's* charset. It does not follow that the *string's* charset is the same. Certainly for Java, it would not be. (1) also specifies the string's charset in C, because you can switch charsets in the middle of a file e.g. with _Pragma ("charset Shift_JIS") or whatever. What happens to: wchar_t x = '\u1234'; /* or: L'\u1234' */ are these different from: wchar_t x = (wchar_t) 0x1234; Yes, e.g. the string's charset might specify JIS for wide characters. I assume your proposal is that the string charset at least by default should be the file charset except for Java where the string charset is Unicode. Yes. > If the input character set is a superset of UTF-8 > (e.g. ISO-2022-JP), then the extra information is lost. I'm confused. I thought that Unicode was specifically designed so that dictinct characters in existing Japanese character standards were mapped into distinct Unicode characters. Did I misunderstand, or is ISO-2022-JP not one of the "source" character sets the Unicode designers used? You understood correctly. To some extent, ISO-2022-JP and Unicode are competing standards. ISO-2022-JP distinguishes between (say) the Japanese and Chinese forms of the same character, whereas Unicode does not. Right now, my impression is that ISO-2022-JP is used more often in Japanese world than Unicode is. This is certainly true for email. Microsoft is pushing Unicode mightily in the DOS and NT domains, though. There is little call for distinguishing Chinese from Japanese in identifiers. So it's OK if GCC supports only the Unicode ``subset'' of ISO-2022-JP in identifiers. If there are ISO-2022-JP partisans who are disturbed by this part of my proposal, then I have some reassurance for them. Rumor has it that ISO 10646 might be officially extended so that it will become a functional superset of ISO-2022-JP. (This is the ``plane-14'' language-tagging effort.) This will require more than 16 bits per character, so it won't be Unicode, and presumably Java char and string won't support it (unless Java is also extended); but C and C++ will support plane-14, because they already have \u escapes for 32-bit characters, and allow UTF-8 implementations (which also supports 32-bit chars). If and when the plane-14 proposal becomes a standard, then C and C++ could distinguish between Chinese and Japanese in identifiers under my proposal. Isn't internationalization fun?