From mboxrd@z Thu Jan  1 00:00:00 1970
From: Martin von Loewis <martin@mira.isdn.cs.tu-berlin.de>
To: bothner@cygnus.com
Cc: eggert@twinsun.com, gcc2@gnu.org, egcs@cygnus.com
Subject: Re: thoughts on martin's proposed patch for GCC and UTF-8
Date: Mon, 28 Dec 1998 08:10:00 -0000
Message-id: <199812281608.RAA00512@mira.isdn.cs.tu-berlin.de>
References: <199812220502.VAA10296@cygnus.com>
X-SW-Source: 1998-12/msg00939.html

> I'm confused.  I thought that Unicode was specifically designed
> so that dictinct characters in existing Japanese character
> standards were mapped into distinct Unicode characters.

Paul already answered that, I'd like to add from a different angle.

ISO 2022 uses escapes sequences to switch between different character
sets. ISO-2022-JP combines four different character sets in this way.
Now, there are potential overlappings between the character sets. In
such cases, Unicode typically unifies the overlappings, whereas ISO
2022 leaves them as-is.

The argument is which is the right thing. For example, there are four
encodings for "LATIN CAPITAL LETTER A": 
ESC ( B A         (ASCII)
ESC ( J A         (JIS X 0201)
ESC $ @ # A       (JIS X 0208-1978)
ESC $ B # A       (JIS X 0208-1983) (*)
Unicode has only one character here (U+0041). In other places, Unicode
probably was wrong to unify (Han Unification).

Not that I want to push a particular solution: Converted to Unicode,
encoded in UTF-8, we would get the following for all four encodings:
A

Regards,
Martin

(*) Somebody correct me if my tables are wrong. The three-bytes
escape-sequence can be omitted if previous characters are already in
this encoding.