public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed
From: John Love-Jensen <eljay@adobe.com>
To: Dallas Clarke <DClarke@unwired.com.au>, GCC-help <gcc-help@gcc.gnu.org>
Subject: Re: UTF-8, UTF-16 and UTF-32
Date: Thu, 21 Aug 2008 12:15:00 -0000	[thread overview]
Message-ID: <C4D2C073.3312C%eljay@adobe.com> (raw)
In-Reply-To: <000d01c90377$1e0c1670$3b9c65dc@testserver>

Hi Dallas,

> Thanks for your reply, but with Pictorial languages such as Cantonese and
> Mandarin, that have up to 60,000 character in the full set (one picture for
> each word), using locality page sheets with UTF-8 is limited.

UTF-8 does not use locality page sheets.  (Are you conflating UTF-8 and
Windows Code Pages?  Ala the difference between the FooA() ACP routines, and
the FooW() Wide character routines?)

UTF-8 encodes Unicode characters from U+00000 to U+10FFFF in a variable
number of octets, 1 to 4 octets (1-4 bytes).  UTF-8 supports the entire
gamut of Unicode characters.

UTF-16 encodes Unicode characters from U+00000 to U+10FFFF in a variable
number of 16-bit chunks, 1 or 2 of them (2 or 4 bytes).

UTF-32 encodes Unicode characters from U+00000 to U+10FFFF in a single
32-bit chunk (4 bytes), with 11 of the 32 bits being fallow.

> GCC and MS VC++ are now inconsistent with their wchar_t types and this
> difference will make it nearly impossible for us to continue supporting
> Linux, i.e. in a choice between Linux and Windows, I have to follow my
> customers.

GCC and MS VC++ are not inconsistent.  Both of those compilers comply with
the ABI of the platform that they target.

There is not requirement in any platforms ABI that I work with that char be
a UTF8 and wchar_t be UTF16 or UTF32.

Perhaps what you need is to make your own character type (or, technically,
encoding unit type):

struct UTF8
{
  typedef uint8_t Type;
  Type mEncodingUnit;
};

struct UTF16
{
  typedef uint16_t Type;
  Type mEncodingUnit;
};

struct UTF32
{
  typedef uint32_t Type;
  Type mEncodingUnit;
};

Or use a Unicode savvy library like ICU <http://www.icu-project.org/>.

> I am not trying to deny UTF-32 or saying that GCC should not support it, I
> am saying that GCC should support all three Unicode formats because UTF-16
> is a format that I have to deal with in the real world. Why not support all
> three formats?

GCC does not support Unicode.

Some libraries (that are not part of GCC) support Unicode.

Perhaps parts of the OS support Unicode, in some transformation format, with
their LANG environment, or Window's 65001, 65005, 65006, 1200, 1201 code
pages, or Mac OS X's kCFStringEncodingUnicode, kCFStringEncodingUTF8,
kCFStringEncodingUTF16, kCFStringEncodingUTF16BE, kCFStringEncodingUTF16LE,
kCFStringEncodingUTF32, kCFStringEncodingUTF32BE, kCFStringEncodingUTF32LE.

The only computer languages that I'm aware of that support Unicode are:
+ Python 2.3 (somewhat, as an opt-in transition feature)
+ Python 2.5 (somewhat)
+ Python 3.0 (very well)
+ D Programming Language (very well)
+ Java (very well)

My favorite computer languages do NOT support Unicode "out of the box" (by
"support" I mean both Unicode source code, which can target Unicode
applications):
+ C
+ C++
+ Lua

With add-on libraries and/or OS API support, discipline, and a bit of luck,
those languages can target Unicode applications.

I can't see Lua supporting Unicode "out of the box" without increasing it's
tiny embedded scripting engine footprint by over an order of magnitude.

> As someone with has written a scripting language based on C++, I can tell
> you that changing the 'wchar_t' to something else would only take five
> minutes - it wouldn't break any thing.

It would break the OS ABI, which is defined by the OS, not by the compiler.

HTH,
--Eljay

  reply	other threads:[~2008-08-21 11:50 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-08-21  5:16 Dallas Clarke
2008-08-21  9:30 ` me22
     [not found]   ` <004501c90350$87491330$0100a8c0@testserver>
2008-08-21 12:49     ` me22
2008-08-21 10:18 ` Andrew Haley
2008-08-21 11:50   ` Dallas Clarke
2008-08-21 12:15     ` John Love-Jensen [this message]
2008-08-21 14:38 ` John Gateley
     [not found] <002901c903df$08265510$3b9c65dc@testserver>
2008-08-22 14:54 ` Eljay Love-Jensen
2008-08-23  2:00   ` Dallas Clarke
2008-08-23  2:24     ` me22
2008-08-23  2:45       ` Dallas Clarke
2008-08-23  3:06         ` me22
2008-08-23  3:52           ` Dallas Clarke
2008-08-23  4:31             ` Brian Dessent
2008-08-23 11:33     ` Andrew Haley
2008-08-23 21:41     ` Eljay Love-Jensen
2008-08-24  0:41       ` Dallas Clarke
2008-08-24  4:02         ` me22
2008-08-24  5:53         ` corey taylor
2008-08-24  6:02           ` Dallas Clarke
2008-08-24 11:11             ` me22
2008-08-24 19:11             ` Eljay Love-Jensen
2008-08-26 14:50               ` Marco Manfredini
2008-08-25 23:15         ` Matthew Woehlke
2008-08-26  4:14           ` Dallas Clarke
2008-08-26  6:03             ` Matthew Woehlke
2008-08-26 18:29             ` Jim Cobban
2008-08-26 18:37               ` me22
2008-08-26 19:20                 ` me22
2008-08-26 21:29                 ` me22
2008-08-27 13:29                 ` me22
2008-08-26 18:54               ` Andrew Haley
2008-08-26 21:19                 ` me22
2008-08-27  8:18                   ` me22
2008-08-27 11:45                   ` me22

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=C4D2C073.3312C%eljay@adobe.com \
    --to=eljay@adobe.com \
    --cc=DClarke@unwired.com.au \
    --cc=gcc-help@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).