From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-help-return-33808-listarch-gcc-help=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 20155 invoked by alias); 21 Aug 2008 11:50:29 -0000
Received: (qmail 20147 invoked by uid 22791); 21 Aug 2008 11:50:29 -0000
X-Spam-Check-By: sourceware.org
Received: from exprod6og107.obsmtp.com (HELO exprod6og107.obsmtp.com) (64.18.1.208)     by sourceware.org (qpsmtpd/0.31) with ESMTP; Thu, 21 Aug 2008 11:49:36 +0000
Received: from source ([192.150.8.22]) by exprod6ob107.postini.com ([64.18.5.12]) with SMTP; 	Thu, 21 Aug 2008 04:49:33 PDT
Received: from inner-relay-3.eur.adobe.com (inner-relay-3b [10.128.4.236]) 	by outbound-smtp-2.corp.adobe.com (8.12.10/8.12.10) with ESMTP id m7LBnVE0024492; 	Thu, 21 Aug 2008 04:49:31 -0700 (PDT)
Received: from fe1.corp.adobe.com (fe1.corp.adobe.com [10.8.192.70]) 	by inner-relay-3.eur.adobe.com (8.12.10/8.12.9) with ESMTP id m7LBnSqJ007562; 	Thu, 21 Aug 2008 04:49:29 -0700 (PDT)
Received: from namailgen.corp.adobe.com ([10.8.192.91]) by fe1.corp.adobe.com with Microsoft SMTPSVC(6.0.3790.1830); 	 Thu, 21 Aug 2008 04:49:28 -0700
Received: from 10.32.16.88 ([10.32.16.88]) by namailgen.corp.adobe.com ([10.8.192.91]) via Exchange Front-End Server namail.corp.adobe.com ([10.8.189.100]) with Microsoft Exchange Server HTTP-DAV ;  Thu, 21 Aug 2008 11:49:27 +0000
User-Agent: Microsoft-Entourage/12.11.0.080522
Date: Thu, 21 Aug 2008 12:15:00 -0000
Subject: Re: UTF-8, UTF-16 and UTF-32
From: John Love-Jensen <eljay@adobe.com>
To: Dallas Clarke <DClarke@unwired.com.au>, GCC-help <gcc-help@gcc.gnu.org>
Message-ID: <C4D2C073.3312C%eljay@adobe.com>
In-Reply-To: <000d01c90377$1e0c1670$3b9c65dc@testserver>
Mime-version: 1.0
Content-type: text/plain; 	charset="US-ASCII"
Content-transfer-encoding: 7bit
X-IsSubscribed: yes
Mailing-List: contact gcc-help-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-help.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-help/>
List-Post: <mailto:gcc-help@gcc.gnu.org>
List-Help: <mailto:gcc-help-help@gcc.gnu.org>
Sender: gcc-help-owner@gcc.gnu.org
X-SW-Source: 2008-08/txt/msg00211.txt.bz2

Hi Dallas,

> Thanks for your reply, but with Pictorial languages such as Cantonese and
> Mandarin, that have up to 60,000 character in the full set (one picture for
> each word), using locality page sheets with UTF-8 is limited.

UTF-8 does not use locality page sheets.  (Are you conflating UTF-8 and
Windows Code Pages?  Ala the difference between the FooA() ACP routines, and
the FooW() Wide character routines?)

UTF-8 encodes Unicode characters from U+00000 to U+10FFFF in a variable
number of octets, 1 to 4 octets (1-4 bytes).  UTF-8 supports the entire
gamut of Unicode characters.

UTF-16 encodes Unicode characters from U+00000 to U+10FFFF in a variable
number of 16-bit chunks, 1 or 2 of them (2 or 4 bytes).

UTF-32 encodes Unicode characters from U+00000 to U+10FFFF in a single
32-bit chunk (4 bytes), with 11 of the 32 bits being fallow.

> GCC and MS VC++ are now inconsistent with their wchar_t types and this
> difference will make it nearly impossible for us to continue supporting
> Linux, i.e. in a choice between Linux and Windows, I have to follow my
> customers.

GCC and MS VC++ are not inconsistent.  Both of those compilers comply with
the ABI of the platform that they target.

There is not requirement in any platforms ABI that I work with that char be
a UTF8 and wchar_t be UTF16 or UTF32.

Perhaps what you need is to make your own character type (or, technically,
encoding unit type):

struct UTF8
{
  typedef uint8_t Type;
  Type mEncodingUnit;
};

struct UTF16
{
  typedef uint16_t Type;
  Type mEncodingUnit;
};

struct UTF32
{
  typedef uint32_t Type;
  Type mEncodingUnit;
};

Or use a Unicode savvy library like ICU <http://www.icu-project.org/>.

> I am not trying to deny UTF-32 or saying that GCC should not support it, I
> am saying that GCC should support all three Unicode formats because UTF-16
> is a format that I have to deal with in the real world. Why not support all
> three formats?

GCC does not support Unicode.

Some libraries (that are not part of GCC) support Unicode.

Perhaps parts of the OS support Unicode, in some transformation format, with
their LANG environment, or Window's 65001, 65005, 65006, 1200, 1201 code
pages, or Mac OS X's kCFStringEncodingUnicode, kCFStringEncodingUTF8,
kCFStringEncodingUTF16, kCFStringEncodingUTF16BE, kCFStringEncodingUTF16LE,
kCFStringEncodingUTF32, kCFStringEncodingUTF32BE, kCFStringEncodingUTF32LE.

The only computer languages that I'm aware of that support Unicode are:
+ Python 2.3 (somewhat, as an opt-in transition feature)
+ Python 2.5 (somewhat)
+ Python 3.0 (very well)
+ D Programming Language (very well)
+ Java (very well)

My favorite computer languages do NOT support Unicode "out of the box" (by
"support" I mean both Unicode source code, which can target Unicode
applications):
+ C
+ C++
+ Lua

With add-on libraries and/or OS API support, discipline, and a bit of luck,
those languages can target Unicode applications.

I can't see Lua supporting Unicode "out of the box" without increasing it's
tiny embedded scripting engine footprint by over an order of magnitude.

> As someone with has written a scripting language based on C++, I can tell
> you that changing the 'wchar_t' to something else would only take five
> minutes - it wouldn't break any thing.

It would break the OS ABI, which is defined by the OS, not by the compiler.

HTH,
--Eljay