From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin von Loewis To: bothner@cygnus.com Cc: eggert@twinsun.com, gcc2@gnu.org, egcs@cygnus.com Subject: Re: thoughts on martin's proposed patch for GCC and UTF-8 Date: Mon, 28 Dec 1998 08:10:00 -0000 Message-id: <199812281608.RAA00512@mira.isdn.cs.tu-berlin.de> References: <199812220502.VAA10296@cygnus.com> X-SW-Source: 1998-12/msg00939.html > I'm confused. I thought that Unicode was specifically designed > so that dictinct characters in existing Japanese character > standards were mapped into distinct Unicode characters. Paul already answered that, I'd like to add from a different angle. ISO 2022 uses escapes sequences to switch between different character sets. ISO-2022-JP combines four different character sets in this way. Now, there are potential overlappings between the character sets. In such cases, Unicode typically unifies the overlappings, whereas ISO 2022 leaves them as-is. The argument is which is the right thing. For example, there are four encodings for "LATIN CAPITAL LETTER A": ESC ( B A (ASCII) ESC ( J A (JIS X 0201) ESC $ @ # A (JIS X 0208-1978) ESC $ B # A (JIS X 0208-1983) (*) Unicode has only one character here (U+0041). In other places, Unicode probably was wrong to unify (Han Unification). Not that I want to push a particular solution: Converted to Unicode, encoded in UTF-8, we would get the following for all four encodings: A Regards, Martin (*) Somebody correct me if my tables are wrong. The three-bytes escape-sequence can be omitted if previous characters are already in this encoding.