Re: UTF-8, UTF-16 and UTF-32

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: UTF-8, UTF-16 and UTF-32
       [not found] <002901c903df$08265510$3b9c65dc@testserver>
@ 2008-08-22 14:54 ` Eljay Love-Jensen
  2008-08-23  2:00   ` Dallas Clarke
  0 siblings, 1 reply; 35+ messages in thread
From: Eljay Love-Jensen @ 2008-08-22 14:54 UTC (permalink / raw)
  To: Dallas Clarke, GCC-help

Hi Dallas,

> Thank you for taking the time to reply to my earlier message. The problem is
> that I have when processing 16-bit strings in my source code, and from
> processing other 16-bit text files, and having to write all the 16-bit
> library functions myself, and the L"" notation that returns a 32-bit
> string - so I need to declare all strings like unsigned short string[] =
> {'H','e','l','l','o',' ','W','o','r','l','d',0}; and finally, it the fact
> that both type are called wchar_t and I can't redefine the type.

Yes, that is what you have to do.  Although I'd use C99-ism uint16_t rather
than unsigned short.

As a convenience, you could write a MakeUtf16StringFromASCII routine, to
convert ASCII to Utf16String:

#if MAC
typedef UniChar Utf16EncodingUnit;
#elif WIN
#ifndef UNICODE
#error Need -DUNICODE
#endif
typedef WCHAR Utf16EncodingUnit;
#else
typedef uint16_t Utf16EncodingUnit;
#endif

typedef std::basic_string<Utf16EncodingUnit> Utf16String;

Utf16String s = MakeUtf16StringFromASCII("Hello world");

NOTE: Windows WCHAR may be wchar_t or may be unsigned short.  The TCHAR
flippy type is either WCHAR or char, depending on -DUNICODE or not.

> I am not saying to drop support for UTF-32 in favour for UTF-16, but I am
> saying that with Microsoft's and Macintosh's decision to support 16-bit
> Unicode files, and the fact the Vista is not fully supporting multibyte
> characters, I am forced to use Unicode. If GCC chooses not to support 16-bit
> strings, or does so in such a way which doesn't enable portability between
> Windows and Linux, then I will be in all likely hood forced to stop porting
> to Linux.
> 
> The choice is:-
> a) Windows + Linux + Solaris; or
> b) Windows.
> 
> It is delusional to think that there is a third choice.

A C or C++ L"blah blah" string literal is not Unicode.

A wchar_t is not Unicode.

C and C++ do not support Unicode.  (They also don't not support it.  Rather,
it's unspecified.)  Neither UTF-8, UTF-16, or UTF-32 are supported.

If you want Unicode string handling in your code, L"blah" is the wrong way
to go about it.  It's "wrong" in the sense that it does not do what you want
it to do.

(I really really *WISH* it did do what you want it to do, since I want that
facility too.  But, alas, it does not.  It's not portable as Unicode, since
it's not Unicode in any portable / guaranteed / reliable sense.)

> Adding support for 16-bit strings will not break any current ABI
> convention

Yes, it does violate ABI.  The platform ABI is not specified by GCC.

Sincerely,
--Eljay

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-22 14:54 ` UTF-8, UTF-16 and UTF-32 Eljay Love-Jensen
@ 2008-08-23  2:00   ` Dallas Clarke
  2008-08-23  2:24     ` me22
                       ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Dallas Clarke @ 2008-08-23  2:00 UTC (permalink / raw)
  To: Eljay Love-Jensen, GCC-help

There is a solution that will please everyone and your stance for not doing 
it is at it break the ABI, but haven't we learnt anything for the 2/4 byte 
int type debacle of several decades ago - why would you want to go through 
that all over again.

You argue why only GCC, although MSVC++ is using 2-byte wchar_t, Borland C++ 
Builder has a policy of conforming to MSVC++ and most likely already uses 
2-byte wchar_t, Sun Studio will most like bend to the market reality and 
that will leave GCC.

My preferred Solution: -

Standardise: - sizeof(char) = 1; sizeof(wchar_t) = 2; and sizeof(long 
wchar_t) = 4.

Implement all the string functions: - strcmp(); mbscmp(); wcscmp(); and 
lcscmp().

In ASCII C++ source files: -

"String" returns type char

L"String" returns type wchar_t

LL"String" returns type long wchar_t

In UTF-8 C++ source files: -

"String" return type unsigned char

L"String" returns type wchar_t

LL"String" returns type long wchar_t

In UTF-16 C++ source files: -

A"String" returns type unsigned char

"String" returns type wchar_t

LL"String" returns type long wchar_t

In UTF-32 C++ source files: -

A"String" returns type unsigned char

L"String" returns type wchar_t

"String" returns type long wchar_t

In this solution there is something for everyone, the Chinese can write 
their source code in visible Mandarin in UTF-16 or UTF-32, not in 
hexadecimal ASCII. The Europeans can save a few bytes by writing in UTF-8. 
We can all process files in any of the Unicode text formats from any OS. No 
one need to implement dodgy string conversion routines that must allocate 
memory and not release it. We can use constant string in function 
parameters - such as strcmp(string,"answer"), rather then allocating and 
initialising vectors every time.

Why not support all three Unicode formats? If it breaks the ABI, then the 
ABI needs to be broken. We are all responsible for our own actions and 
lettings someone else make bad decisions for us, we are just as liable as if 
we made the decision ourselves.

Dallas.

http://www.ekkySoftware.com/ 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-23  2:00   ` Dallas Clarke
@ 2008-08-23  2:24     ` me22
  2008-08-23  2:45       ` Dallas Clarke
  2008-08-23 11:33     ` Andrew Haley
  2008-08-23 21:41     ` Eljay Love-Jensen
  2 siblings, 1 reply; 35+ messages in thread
From: me22 @ 2008-08-23  2:24 UTC (permalink / raw)
  To: Dallas Clarke; +Cc: Eljay Love-Jensen, GCC-help

On Fri, Aug 22, 2008 at 21:37, Dallas Clarke <DClarke@unwired.com.au> wrote:
>
> Standardise: - sizeof(char) = 1; sizeof(wchar_t) = 2; and sizeof(long
> wchar_t) = 4.
>

Do you mean "standardize char as UTF-8, wchar_t as UTF-16, and long
wchar_t as UTF-32"?  Because that's not what you said, even if (on
POSIX, but not necessarily C or C++) the sizes would be appropriate.

> Implement all the string functions: - strcmp(); mbscmp(); wcscmp(); and
> lcscmp().
>

How exactly do you plan on implementing strchr for UTF-16?
Specifically, what would its signature be?

~ Scott

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-23  2:24     ` me22
@ 2008-08-23  2:45       ` Dallas Clarke
  2008-08-23  3:06         ` me22
  0 siblings, 1 reply; 35+ messages in thread
From: Dallas Clarke @ 2008-08-23  2:45 UTC (permalink / raw)
  To: me22; +Cc: Eljay Love-Jensen, GCC-help

Hello Scott,

I guess that ASCII would be char, UTF-8 would be unsigned char, UTF-16 would 
be wchar_t and UTF-32 would be long wchar_t. But it is more appropriate just 
to have the three sizes of strings, i.e. 8-bits, 16-bits and 32 bits, and 
the ability to have const 16-bit strings.

wchar_t* strchr(wchar_t *string, wchar_t chr){
    while(*string != '\0' && *string != chr) ++string;
    if(*string == chr) return string;
    return NULL;
}

const wchar_t* strchr(const wchar_t *string, wchar_t chr){
    while(*string != '\0' && *string != chr) ++string;
    if(*string == chr) return string;
    return NULL;
}

Cheers,
Dallas.
http://www.ekkySoftware.com/

----- Original Message ----- 
From: "me22" <me22.ca@gmail.com>
To: "Dallas Clarke" <DClarke@unwired.com.au>
Cc: "Eljay Love-Jensen" <eljay@adobe.com>; "GCC-help" <gcc-help@gcc.gnu.org>
Sent: Saturday, August 23, 2008 12:12 PM
Subject: Re: UTF-8, UTF-16 and UTF-32


> On Fri, Aug 22, 2008 at 21:37, Dallas Clarke <DClarke@unwired.com.au> 
> wrote:
>>
>> Standardise: - sizeof(char) = 1; sizeof(wchar_t) = 2; and sizeof(long
>> wchar_t) = 4.
>>
>
> Do you mean "standardize char as UTF-8, wchar_t as UTF-16, and long
> wchar_t as UTF-32"?  Because that's not what you said, even if (on
> POSIX, but not necessarily C or C++) the sizes would be appropriate.
>
>> Implement all the string functions: - strcmp(); mbscmp(); wcscmp(); and
>> lcscmp().
>>
>
> How exactly do you plan on implementing strchr for UTF-16?
> Specifically, what would its signature be?
>
> ~ Scott
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-23  2:45       ` Dallas Clarke
@ 2008-08-23  3:06         ` me22
  2008-08-23  3:52           ` Dallas Clarke
  0 siblings, 1 reply; 35+ messages in thread
From: me22 @ 2008-08-23  3:06 UTC (permalink / raw)
  To: Dallas Clarke; +Cc: GCC-help

On Fri, Aug 22, 2008 at 22:36, Dallas Clarke <DClarke@unwired.com.au> wrote:
> Hello Scott,
>
> wchar_t* strchr(wchar_t *string, wchar_t chr){
>   while(*string != '\0' && *string != chr) ++string;
>   if(*string == chr) return string;
>   return NULL;
> }
>

That doesn't work.  What if I want to look for 𝔅, U+1D505
MATHEMATICAL FRAKTUR CAPITAL B?  It's UTF-16 representation is 0xD835
0xDD05, which obviously doesn't fit in the single wchar_t parameter.

~ Scott

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-23  3:06         ` me22
@ 2008-08-23  3:52           ` Dallas Clarke
  2008-08-23  4:31             ` Brian Dessent
  0 siblings, 1 reply; 35+ messages in thread
From: Dallas Clarke @ 2008-08-23  3:52 UTC (permalink / raw)
  To: me22; +Cc: GCC-help

Well Scott, that's probably why I don't want to have to implement the string 
library functions myself. With each different implementation, we will 
probably get a different outcome and it makes more sense that GCC does it in 
the standard C library.

All I want is to be able to create one common source code and have it 
compile in both Windows and Linux. Windows in now forcing me to use Unicode 
since with Vista and Visual Studio 2008, several MFC class are no longer 
supported in a multibyte compile, meaning I have to compile in Unicode. 
Supplying the graphical and OS components with Unicode, my database server 
need to store and process 16-bit Unicode. Without 16-bit strings in GCC, 
this is now a nightmare and now I have to make a decision whether it is 
worth porting to Linux/Solaris

Dallas.
http://www.ekkySoftware.com 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-23  3:52           ` Dallas Clarke
@ 2008-08-23  4:31             ` Brian Dessent
  0 siblings, 0 replies; 35+ messages in thread
From: Brian Dessent @ 2008-08-23  4:31 UTC (permalink / raw)
  To: Dallas Clarke; +Cc: me22, GCC-help

Dallas Clarke wrote:

> Well Scott, that's probably why I don't want to have to implement the string
> library functions myself. With each different implementation, we will
> probably get a different outcome and it makes more sense that GCC does it in
> the standard C library.

But gcc does not implement a standard C library.  On Linux, that is
usually[1] glibc -- a totally separate project from gcc.  The glibc
maintainers are very steadfast that wchar_t is 32 bits wide and anything
narrower is broken and evil.  Hell will freeze over before you convince
them to change that, as they are noted for their strong personalities.

Do you see now why asking gcc to make some sort of change is a losing
battle?  And not even on its technical merits, but because gcc has no
control over the issue.

> All I want is to be able to create one common source code and have it
> compile in both Windows and Linux. Windows in now forcing me to use Unicode
> since with Vista and Visual Studio 2008, several MFC class are no longer
> supported in a multibyte compile, meaning I have to compile in Unicode.
> Supplying the graphical and OS components with Unicode, my database server
> need to store and process 16-bit Unicode. Without 16-bit strings in GCC,
> this is now a nightmare and now I have to make a decision whether it is
> worth porting to Linux/Solaris

Portability is hard.  This is the job of abstraction libraries like
boost, gtk, and so on.  It's not the job of the compiler.

Brian

[1] Though it could be uclibc, newlib, or something else.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-23  2:00   ` Dallas Clarke
  2008-08-23  2:24     ` me22
@ 2008-08-23 11:33     ` Andrew Haley
  2008-08-23 21:41     ` Eljay Love-Jensen
  2 siblings, 0 replies; 35+ messages in thread
From: Andrew Haley @ 2008-08-23 11:33 UTC (permalink / raw)
  To: Dallas Clarke; +Cc: GCC-help

Dallas Clarke wrote:
> There is a solution that will please everyone and your stance for not
> doing it is at it break the ABI, but haven't we learnt anything for the
> 2/4 byte int type debacle of several decades ago - why would you want to
> go through that all over again.

If it breaks the ABI then it isn't a solution that will please everyone,
is it?

> You argue why only GCC, although MSVC++ is using 2-byte wchar_t, Borland
> C++ Builder has a policy of conforming to MSVC++ and most likely already
> uses 2-byte wchar_t, Sun Studio will most like bend to the market
> reality and that will leave GCC.

I think we've been over this already.  gcc doesn't decide this; it's
part of the system ABI.

> In this solution there is something for everyone, the Chinese can write
> their source code in visible Mandarin in UTF-16 or UTF-32, not in
> hexadecimal ASCII. The Europeans can save a few bytes by writing in
> UTF-8. We can all process files in any of the Unicode text formats from
> any OS. No one need to implement dodgy string conversion routines that
> must allocate memory and not release it. We can use constant string in
> function parameters - such as strcmp(string,"answer"), rather then
> allocating and initialising vectors every time.

I note that you have, several times, failed to answer some of the points
that people have made.  I'm going to try again.  The Chinese, Europeans,
and everyone else, can use UTF-8.  You have alleged several times that
UTF-8 is in some way deficient, but have never replied when challenged
as to why.

> Why not support all three Unicode formats?

We can support all of them.  I just don't think that the compiler itself
needs to change in order to do it.

> If it breaks the ABI, then
> the ABI needs to be broken. We are all responsible for our own actions
> and lettings someone else make bad decisions for us, we are just as
> liable as if we made the decision ourselves.

Sure, but you so far have failed to convince anyone that a bad decision
has been made.

Andrew.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-23  2:00   ` Dallas Clarke
  2008-08-23  2:24     ` me22
  2008-08-23 11:33     ` Andrew Haley
@ 2008-08-23 21:41     ` Eljay Love-Jensen
  2008-08-24  0:41       ` Dallas Clarke
  2 siblings, 1 reply; 35+ messages in thread
From: Eljay Love-Jensen @ 2008-08-23 21:41 UTC (permalink / raw)
  To: Dallas Clarke, GCC-help

Hi Dallas,

The changes you propose, in my opinion, are noble and worthy.

Unfortunately, neither C (ISO 9899) nor C++ (ISO 14882) can incorporate
those changes.  Making those changes as extensions to C or C++ would be a
variant language that is almost C and almost C++... which would indubitably
cause more issues in the long run.

Also, GCC cannot mandate those ABI changes, since GCC complies with the
platform required ABI, not vice versa.

I have a for-instance... GCC had several C++ extensions that I thought were
great, I used them a lot, and I tended to call my not-quite-C++ code "G++"
(informally).  Even though the extensions were cool, and sensical, and
useful, they were an enormous impediment to portability.  I no longer do
that.  [Those in the GCC community who have also done this are either
laughing or shuddering, or both.]

I had an opportunity to speak with Bjarne Stroustrup about all sorts of
issues with C++, as I saw them.  He stopped me short and said (paraphrased),
"If you don't like C++, you are free to write your own compiler.  I did."

Unicode and/or ISO 10646 were not on my radar at that time.  Had they been,
I probably would have brought that up too, since I am a Unicode fanboy.

What you are proposing is not C, and is not C++.  FSF does not control ISO
9899 nor ISO 14882.  GCC does not drive platform ABI.

HOWEVER, you are at liberty to write your own language.  I tried, and I
discovered that writing a good, fleshed-out, general purpose programming
language is very, very hard.  (I was using the GCC back-end, so all I needed
to do was write the front-end for my ultimate programming language.)

FORTUNATELY, there is a programming language that is much like C++, which
has the Unicode support you are looking for, and has a GCC front-end.  The
language is the D Programming Language <http://www.digitalmars.com/d/>.  It
is available now.  D 1.0 is supported by the gdc project
<http://dgcc.sourceforge.net/>, and has been used in commercial software.
Digital Mars, the progenitor of D Programming Language supplies their own
dmd compiler for Windows and Linux.

There's also Java, which has excellent Unicode support, and supports Unicode
source code as well as Unicode strings at runtime.

Alternatively, embrace ICU <http://www.icu-project.org/> for C (which works
in C++ too) to work with Unicode strings.  But that is not a solution that
works to support Unicode source code.

> ...the Chinese can write their source code in visible Mandarin in UTF-16 or
UTF-32...

I think you misunderstand what UTF-16 and UTF-32 are.

The visible Mandarin source code would be in Unicode.

UTF-8, UTF-16, and UTF-32 are encoding representations of Unicode.  You
don't "write in UTF-16" or "write in UTF-32".  The Mandarin can be encoded
using UTF-8 just fine, there is no prohibition against it.  And for a source
code file, any one of the three UTF-8/16/32 formats is as good as another.

To better understand this, get the D Programming Language.  You usually work
with Unicode characters, not UTF-8, UTF-16, UTF-32 encoding units.  You are
thinking at the wrong meta-level.

Sincerely,
--Eljay

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-23 21:41     ` Eljay Love-Jensen
@ 2008-08-24  0:41       ` Dallas Clarke
  2008-08-24  4:02         ` me22
                           ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Dallas Clarke @ 2008-08-24  0:41 UTC (permalink / raw)
  To: Eljay Love-Jensen, GCC-help

Hello Eljay & Andrew

> What you are proposing is not C, and is not C++.  FSF does not control ISO
> 9899 nor ISO 14882.  GCC does not drive platform ABI.

The heart of this issue is that GCC is not compatible with MS VC++, by 
defining wchar_t as 4-bytes and not providing any 16-bit Unicode support - 
it just going to be too hard to continue porting to Linux.

At the end of the day if you want to live in a world where you only consider 
yourself - then you can live in that world by yourself. Like you said, if I 
don't like it I can use another language and GCC will become irrelevant, you 
can all go your own separate way.

Also I have written my own scripting language, designed to add functionality 
post installation, its not that hard.

>Sure, but you so far have failed to convince anyone that a bad decision
>has been made.

I wont bother repeating myself, it not my responsibility to cure your dogma, 
it just the end of me using GCC. I am sure that many other developers will 
run into the same problem and choose the same solution.

It was fun,
Dallas.
http://www.ekkySoftware.com/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-24  0:41       ` Dallas Clarke
@ 2008-08-24  4:02         ` me22
  2008-08-24  5:53         ` corey taylor
  2008-08-25 23:15         ` Matthew Woehlke
  2 siblings, 0 replies; 35+ messages in thread
From: me22 @ 2008-08-24  4:02 UTC (permalink / raw)
  To: Dallas Clarke; +Cc: Eljay Love-Jensen, GCC-help

On Sat, Aug 23, 2008 at 17:40, Dallas Clarke <DClarke@unwired.com.au> wrote:
> The heart of this issue is that GCC is not compatible with MS VC++, by
> defining wchar_t as 4-bytes and not providing any 16-bit Unicode support -
> it just going to be too hard to continue porting to Linux.
>

Then why does GCC need to change, rather than MSVC++, which caused the
problem in the first place?  Why shouldn't MSVC++ be using UTF-32
instead of UTF-16?  strchr is a good example of one way in which
UTF-32 is a better fit, and I have yet to see any reason why UTF-16 is
better, other than "well, Windows is installed all over the place".

> At the end of the day if you want to live in a world where you only consider
> yourself - then you can live in that world by yourself. Like you said, if I
> don't like it I can use another language and GCC will become irrelevant, you
> can all go your own separate way.
>

I suspect you'll find that rather a high percentage of other languages
either use GCC indirectly, or require GCC to build themselves or their
runtimes.

> Also I have written my own scripting language, designed to add functionality
> post installation, its not that hard.
>

Which, I expect, didn't even need to be turing complete, let alone
deal with mountains of legacy code, dozens of platforms, or backwards
compatability.  A bad interpreter is trivial, I agree.

~ Scott

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-24  0:41       ` Dallas Clarke
  2008-08-24  4:02         ` me22
@ 2008-08-24  5:53         ` corey taylor
  2008-08-24  6:02           ` Dallas Clarke
  2008-08-25 23:15         ` Matthew Woehlke
  2 siblings, 1 reply; 35+ messages in thread
From: corey taylor @ 2008-08-24  5:53 UTC (permalink / raw)
  To: Dallas Clarke; +Cc: Eljay Love-Jensen, GCC-help

On Sat, Aug 23, 2008 at 5:40 PM, Dallas Clarke <DClarke@unwired.com.au> wrote:
> I wont bother repeating myself, it not my responsibility to cure your dogma,
> it just the end of me using GCC. I am sure that many other developers will
> run into the same problem and choose the same solution.
>

I think you're failing to convince most people due to the fact that
many of your arguments definitely require repeating and further
discussion.

You're obviously dealing with portability issues - both at the
compiler level and ABI level (and there are others I guess depending
on your needs).

If I understand your two key issues correctly, they are:

1.  You want source code that has unicode support.
2.  You want to be able to process unicode in c++ and runtime libraries.

They are different issues.  I think you understand that, but some of
your replies have both solutions lumped together.

I would like to respond to #2.

Your initial email was confusing and not the authoritative one you
might have thought.  You made arguments against UTF-8 and UTF-32 which
others here don't understand, and it seems the response was simply
restating wanting UTF-16 support.

Can you really not represent everything you want in UTF-8?  Seems
unlikely considering its meant to represent them.  Your comment on
UTF-32 was odd to say the least.  Sure, if we ever have a textual
representation that contains a quadrillion characters then we will
have to redesign how we encode it (exaggerated).  UTF-16 requires a
multi-word sequence to represent everything as well so it's nothing
special except for the fact that it is used as you said.

Now, as far as your problems above, you should look at encoding as a
design issue and not a compiler issue as far as c++ goes.  The
compiler implements wchar_t in a way that represents all of the
characters as required - as msvc and gcc are not developed in tandem,
they obviously came up with different requirements at different times.
 If you think of it at just the compiler level, you're setting your
design up for failure such that it won't be portable (not only to
different systems but to upgrades of existing compilers and language
specs).

What I mean by that is encoding should definitely be handled in a
layer above the system you are working on.  Even if both compilers
implemented the same wchart_t, it doesn't mean that every API you use
will use that wchar_t.  So, what you need to do is find a way to
represent your data and then map it to the system, api, etc that
you're using.  You never know what display or render you'll need to
use or what system you need to interface with.

I have a couple comments on your gcc modification solution.

1.  Modifying wchar_t to be 2 bytes and then making L create 2-byte
UTF-16 constants means that gcc users could no longer rely on constant
lengths like before.  And if it is just as easy as you indicated, it's
also an indication that it's probably something that should only be
touched carefully.  An code relying on this gcc implementation would
be broken.

2.  Creating a new type long wchar_t as a solution to compatibility?
You're just asking for the same issue.

You mentioned needing to read store data and presumably write it back.
 I saw a mention of a text file and a mention of a database.

UTF-8 seems exceptionally up to the task for encoding your data!  How
can you know for certain that all of your input will be in the same
format?  How will having 2-byte wchar_t in GCC solve all of your
problems?  GCC only controls types not storage or implementation of
any library or OS.

I think have to expect to write an encoder for data and to provide a
layer around your unique systems (data files, databases, constants,
OS).  Linux and Windows themselves certainly aren't going to be
completely compatible.  You could make implementations portable but
never fully compatible!

Just my thoughts after reading through his.  I think several people
here would be interested in discussing solutions with you that make
sense at all levels.

(A quick note about your issue #1, I think it would be very confusing
for source file encoding to be based on what a user typed.  It should
be constant or configured in a more visible way).

corey

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-24  5:53         ` corey taylor
@ 2008-08-24  6:02           ` Dallas Clarke
  2008-08-24 11:11             ` me22
  2008-08-24 19:11             ` Eljay Love-Jensen
  0 siblings, 2 replies; 35+ messages in thread
From: Dallas Clarke @ 2008-08-24  6:02 UTC (permalink / raw)
  To: corey taylor; +Cc: Eljay Love-Jensen, GCC-help

Hello Corey and Scott

At the sake of sound repetitive, the problems are:-
1) using wchar_t for 4-byte string stuffs up function overloading - for 
example problems with shared libraries written with different 2-byte string 
type (i.e., short, unsigned short, struct UTF16{ typedef uint16_t Type; Type 
mEncodingUnit;};, etc)
2) casting error from declaring strings as unsigned short string[] = 
{'H','e','l','l','o',' ','W','o','r','l','d',0} or unsigned short *string = 
(unsinged short*)"H\0e\0l\0l\0o\0 \0W\0o\0r\0l\0d\0\0";
3) pointer arithmetic bugs and other portability issues consuming time, 
money and resources.
4) no standard library for strings functions, creating different behaviours 
from different implementations. And the standard C-Library people will not 
implement the string routines until there is a standard type for 16-bit 
strings offered by the compiler.

The full set of MS Common Controls no longer support the -D _MBCS, this 
means I must compile in with -D UNICODE and -D _UNICODE, this makes all the 
standard WINAPI to use 16-bit Unicode strings as well. Rather than 
constantly convert between UTF-8 and 16-bit Unicode I am moving totally to 
16-bit Unicode. Why is MS doing this - probably because they know your not 
supporting 16-bit Unicode and that will force people like me to drop plans 
to port to Linux/Solaris because it is just too hard.

Once again, there are no legacy issues because no one is currently using 
16-bit Unicode in GCC, it does not exist. Adding such support will not break 
anything. I am not arguing to stop support for 32-bit Unicode. Secondly 
object code does not use the label "wchar_t", meaning the change would force 
people to do a global search and replace "wchar_t" to "long wchar_t" before 
their next compile. Quite a simple change compared to what I must do to 
support 16-bit strings in GCC.

It would be nice to substitute "long wchar_t" for "wchar_t" as it would not 
only be consistent with MS VC++, but also definitions of double and long 
double, long and long long, and integers as 123456789012LL. Using S"String" 
or U"String" would be too confusing with signed and unsigned.

The issues of confusion between Unicode Text Format (UTF) 8, 16 and 32, are 
not only mine, but as pointed out earier, they are constantly changing. The 
16-bit string is a format I am forced to deal with and there is no support 
from GCC at all. I can't tell you if MS Unicode is the older style fixed 
16-bit or it is the newer multibyte type similar to the UTF-8 definition.

And in case you don't already know, MS VC++ compiles source code written in 
16-bit Unicode, allowing function name, variables and strings to be written 
in 16-bit Unicode. This means that more and more files/data is going to be 
16-bit Unicode. Developers like myself are going to have to deal with the 
format, whether we like it or not.

And to answer Scott's questions why must we follow the thousand pound 
gorilla that is Microsoft? For the same reason that rain falls down, because 
that is just how the world is. I have spend the last three week migrating 
all our products to 16-bit Unicode, I probably still have another two weeks 
to go, then 2-4 weeks of testing after that. Do I like it? No, I just have 
to keep our products competitive in a market environment. And also how does 
Microsoft deal with issues of "mountains of legacy code"? Like with this 
move to Unicode, they just stop fully supporting UTF-8 and if you don't move 
you become a dinosaur. (They call it "deprecation", I think they meant 
depreciation because they lower it, not strangle it.)

So I have to ask - what are your arguments for not providing support for all 
three, 8-bit, 16-bit and 32-bit Unicode strings?

Regards,
Dallas.
http://www.ekkySoftware.com

P.S. I suggest that the strings default to the same type as the underlining 
file format, other wise it can be overridden by expressively stating:- 
A"String", L"String", LL"Sting".

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-24  6:02           ` Dallas Clarke
@ 2008-08-24 11:11             ` me22
  2008-08-24 19:11             ` Eljay Love-Jensen
  1 sibling, 0 replies; 35+ messages in thread
From: me22 @ 2008-08-24 11:11 UTC (permalink / raw)
  To: Dallas Clarke; +Cc: corey taylor, Eljay Love-Jensen, GCC-help

On Sun, Aug 24, 2008 at 01:53, Dallas Clarke <DClarke@unwired.com.au> wrote:
> At the sake of sound repetitive, the problems are:-
> 1) using wchar_t for 4-byte string stuffs up function overloading - for
> example problems with shared libraries written with different 2-byte string
> type (i.e., short, unsigned short, struct UTF16{ typedef uint16_t Type; Type
> mEncodingUnit;};, etc)

Of course, people relying on sizeof(short)*CHAR_BIT being 16 aren't
technically portable anyways...

using wchar_t for 2-byte string stuffs up function overloading - for
example problems with shared libraries written with different 4-byte
string type (i.e., int, unsigned long, struct UTF16{ typedef uint32_t
Type; Type mEncodingUnit;};, etc)

> 2) casting error from declaring strings as unsigned short string[] =
> {'H','e','l','l','o',' ','W','o','r','l','d',0} or unsigned short *string =
> (unsinged short*)"H\0e\0l\0l\0o\0 \0W\0o\0r\0l\0d\0\0";

Why do you care how you're defining strings?  Anything that needs
localization belongs in an external "resource file" anyways, and
anything in the code can be done perfectly fine with wstring s =
L"Whatever you want here, including non-ASCII", and the compiler will
be fine with it, so long as your locale is consistent.

I'm really not convinced that this is a real problem, since there are
many C++ projects using wxWidgets that compile fine in "Unicode" mode
in both Windows and Linux, where writing a localizable string just
means saying _T("your text here") and everything works perfectly fine.

> 3) pointer arithmetic bugs and other portability issues consuming time,
> money and resources.

Anyone doing explicit pointer manipulation on strings in
application-level code deserves their problems.  And UTF-32 actually
has fewer possible places for pointer errors, since incrementing a
pointer actually moves to the next codepoint, something that neither
UTF-8 nor UTF-16 allow.

> 4) no standard library for strings functions, creating different behaviours
> from different implementations. And the standard C-Library people will not
> implement the string routines until there is a standard type for 16-bit
> strings offered by the compiler.

The ISO standards for C and C++ may not provide it, but that's
certainly not something that GCC can change.  If this is a problem,
you should have submitted a proposal to the Standards Committees.

Regardless, there are plenty of mature, open, and free libraries for Unicode.

> The full set of MS Common Controls no longer support the -D _MBCS, this
> means I must compile in with -D UNICODE and -D _UNICODE, this makes all the
> standard WINAPI to use 16-bit Unicode strings as well. Rather than
> constantly convert between UTF-8 and 16-bit Unicode I am moving totally to
> 16-bit Unicode. Why is MS doing this - probably because they know your not
> supporting 16-bit Unicode and that will force people like me to drop plans
> to port to Linux/Solaris because it is just too hard.

Well, "MS Common Controls" obviously aren't available on Linux either.
 Since you need to change basically your whole GUI to port it, you'd
be using a cross-platform library (like wxWidgets) that handles all
this for you.

> Once again, there are no legacy issues because no one is currently using
> 16-bit Unicode in GCC, it does not exist. Adding such support will not break
> anything. I am not arguing to stop support for 32-bit Unicode. Secondly
> object code does not use the label "wchar_t", meaning the change would force
> people to do a global search and replace "wchar_t" to "long wchar_t" before
> their next compile. Quite a simple change compared to what I must do to
> support 16-bit strings in GCC.

On the compilation side, sure, though I really don't think
search-and-replace works as well as you think it does in that
situation.  (Certainly s/int/long int/ isn't safe.)

But that would introduce a huge amount of issues.  It would mean that
changing the C library on a box to your proposed new one would break
every single binary there that used its wchar_t functions, for
example.

> It would be nice to substitute "long wchar_t" for "wchar_t" as it would not
> only be consistent with MS VC++, but also definitions of double and long
> double, long and long long, and integers as 123456789012LL. Using S"String"
> or U"String" would be too confusing with signed and unsigned.
>

BTW, are you following the standards process for C++0x?  Check
http://herbsutter.spaces.live.com/blog/cns!2D4327CC297151BB!214.entry
or the actual paper,
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2249.html

u"string" and U"string" are actually good choices because they mirror
the \u and \U escape sequences in wide-character strings from C++98.

> The issues of confusion between Unicode Text Format (UTF) 8, 16 and 32, are
> not only mine, but as pointed out earier, they are constantly changing. The
> 16-bit string is a format I am forced to deal with and there is no support
> from GCC at all. I can't tell you if MS Unicode is the older style fixed
> 16-bit or it is the newer multibyte type similar to the UTF-8 definition.

I think you need to re-read the standards.  First of all, UTF is
"Unicode Transformation Format", not Text Format.  The standards are
quite specific about what UTF-8, UTF-16, and UTF-32 are.  There are
other encodings, UCS-2 for example, which are different.

> And in case you don't already know, MS VC++ compiles source code written in
> 16-bit Unicode, allowing function name, variables and strings to be written
> in 16-bit Unicode. This means that more and more files/data is going to be
> 16-bit Unicode. Developers like myself are going to have to deal with the
> format, whether we like it or not.

And GCC quite happily compiles code written in UTF-8, allowing
variable names and strings to be written in Unicode.  (Note that
there's no such thing as "16-bit Unicode".)  I suspect it'll compile
UTF-32 code as well, with similar results.

> So I have to ask - what are your arguments for not providing support for all
> three, 8-bit, 16-bit and 32-bit Unicode strings?
>

First, none of those things exist.  But I don't think I ever said that
providing support for UTF-8, UTF-16, and UTF-32 is such a terrible
thing, though I do think that it's somewhat pointless, since there are
already mature, capable libraries that do what you need, and the
cost/benefit quotient of providing it in the compiler is far too high.
 (And completely undesirable on many embedded platforms that GCC, C,
and C++ support.)

Why not change wchar_t to UTF-16?  Largely because while it might make
a vapourware project of yours easier, it creates the same problem you
have about having now with Microsoft dropping UTF-8, except without a
feasible upgrade path.  Also, UTF-32 is more convenient to deal with,
as illustrated by the strchr example.

(Though technically unicode fundamentally has to be dealt with
thinking about multi-element characters, because of irreducible
combining codepoints, hardly anything actually supports those, so
pretending that UTF-32 is one character per element is safe in effect.
 I don't think the fonts that come with Windows even include glyphs
for combining codepoints.  That said, SIL worldpad and a few others
actually do things properly, so really one might as well just use
UTF-8 everywhere.)

> P.S. I suggest that the strings default to the same type as the underlining
> file format, other wise it can be overridden by expressively stating:-
> A"String", L"String", LL"Sting".
>

Terrible idea, since it means that the legality of char const *p =
"hello world"; changes depending on the encoding of my file.  Encoding
is just that, and shouldn't change semantics.  (Just like whitespace.)
 An image should look the same whether saved in PNG or TGA, and a
program should do the same thing whether saved in UTF-16 or UTF-32.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-24  6:02           ` Dallas Clarke
  2008-08-24 11:11             ` me22
@ 2008-08-24 19:11             ` Eljay Love-Jensen
  2008-08-26 14:50               ` Marco Manfredini
  1 sibling, 1 reply; 35+ messages in thread
From: Eljay Love-Jensen @ 2008-08-24 19:11 UTC (permalink / raw)
  To: Dallas Clarke; +Cc: GCC-help

Hi Dallas,

> Once again, there are no legacy issues because no one is currently using
16-bit Unicode in GCC, it does not exist.

I'm using UTF-16 Unicode in GCC.  I've done so for years.

I do not use wchar_t to specify UTF-16 Unicode, since that is not portable.

The same code runs on different platforms, the Windows platform being
compiled with MSVC++.

Although what you say is not without merit, in that C/C++ do not specify the
character set (let alone the encoding of the character set).

> So I have to ask - what are your arguments for not providing support for all
> three, 8-bit, 16-bit and 32-bit Unicode strings?

It is not part of ISO 9899 (for C), nor ISO 14882 (for C++).

There are languages which support UTF-8, UTF-16, and UTF-32 Unicode strings.
C and C++ are not those languages.

There are support libraries for Unicode (UTF-8, UTF-16, and UTF-32) for C
and C++.  They work on Linux and on Windows.  You are at liberty to use
those.

If you use Microsoft's extensions to C++, your code is no longer C++... it
is MS-C++.  Portability issues will be problematic, at least until Microsoft
comes out with MSVC++ for Linux and OS X and whatever other platform you are
interested in.

Maybe a future version of C and/or C++ will be more Unicode friendly.

Sincerely,
--Eljay

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-24  0:41       ` Dallas Clarke
  2008-08-24  4:02         ` me22
  2008-08-24  5:53         ` corey taylor
@ 2008-08-25 23:15         ` Matthew Woehlke
  2008-08-26  4:14           ` Dallas Clarke
  2 siblings, 1 reply; 35+ messages in thread
From: Matthew Woehlke @ 2008-08-25 23:15 UTC (permalink / raw)
  To: gcc-help, DClarke

Anti-FUD below. To add a purely technical suggestion, why not use 
wchar16_t everywhere? Since this type does not currently exist, you 
could attempt to make a case for adding support to glibc (though, good 
luck with that) without breaking the ABI, and then use that on 
non-conforming (read: Microsoft) platforms.

Or just use one of the various existing cross-platform libraries, like 
other people keep telling you to do.

Dallas Clarke wrote:
> Hello Eljay & Andrew
>> What you are proposing is not C, and is not C++.  FSF does not control 
>> ISO 9899 nor ISO 14882.  GCC does not drive platform ABI.
> 
> The heart of this issue is that GCC is not compatible with MS VC++,

gcc is not compatible with MSVC, no, nor do I expect there is any 
intention to make it so, nor do I believe it should be. Similarly, Sun C 
is not compatible with gcc, nor with MSVC.

If you refuse to write portable code, that's *your* problem, not ours.

> by 
> defining wchar_t as 4-bytes and not providing any 16-bit Unicode support 
> - it just going to be too hard to continue porting to Linux.

Maybe we should copy other bad decisions by Microsoft, like 
sizeof(void*) != sizeof(long)?

> At the end of the day if you want to live in a world where you only 
> consider yourself - then you can live in that world by yourself. 

That's funny, that's exactly how I would describe the Microsoft world.

> Like 
> you said, if I don't like it I can use another language and GCC will 
> become irrelevant, you can all go your own separate way.

Here's to hoping that MSVC becomes irrelevant :-).

>> Sure, but you so far have failed to convince anyone that a bad decision
>> has been made.
> 
> I wont bother repeating myself, it not my responsibility to cure your 
> dogma, 

Then I guess it shouldn't be our responsibility to cure yours.

> it just the end of me using GCC.

You mean, it's just the end of you trying to support any non-Windows 
platform? That's too bad, since you're limiting yourself to an outdated 
platform that to all appearances seems to be in decline. (Even Microsoft 
is making noises of Windows going away...)

> I am sure that many other 
> developers will run into the same problem and choose the same solution.

I am sure many other developers have bothered to write code that is 
actually portable. In fact, the vast majority of software available for 
Linux, much of which can be compiled on multiple architectures as well 
as multiple OS's, proves that point. It's not gcc that's different from 
everyone else, it's Microsoft (and I can say that from personal 
experience). Sure, other platforms have their quirks, but for the most 
part, there are POSIX platforms including Linux, Solaris, and dozens of 
others, and then there is Windows.

> The full set of MS Common Controls no longer support the -D _MBCS, this 
> means I must compile in with -D UNICODE and -D _UNICODE, this makes all 
> the standard WINAPI to use 16-bit Unicode strings as well.

Wow. I knew Microsoft went out of their way to be incompatible with 
everyone else, but I hadn't heard this one before. Can't say I'm 
surprised though.

That's what you get for relying on the idiosyncrasies of a platform that 
is intentionally as incompatible as possible. You'd be much better off 
writing portable code in the first place. (This, incidentally, is a 
great way to learn just what a mess Microsoft's API's are, when you 
realize that you can write code that runs on any POSIX platform with 
minimal effort, but writing code that runs on Windows is a monumental pain.)

> Why is MS doing this - probably because they know 
> your not supporting 16-bit Unicode and that will force people like me to 
> drop plans to port to Linux/Solaris because it is just too hard.

I'd actually bet money that's exactly why they're doing it. Microsoft is 
well known for engineering incompatibility. They're business model is 
based on lock-in.

Just in case I haven't repeated myself often enough, if you wrote for 
Linux first, you'd quickly discover that porting to Solaris/etc is 
trivial by comparison to porting to Microsoft.

> And to answer Scott's questions why must we follow the thousand pound 
> gorilla that is Microsoft? For the same reason that rain falls down, 
> because that is just how the world is.

Wrong. That's the way the world *was*. The world is changing (for the 
better)... :-)

What do you plan to do in a few years when Microsoft stops supporting 
everything that is not .NET?

-- 
Matthew
Person A: It's an ISO standard.
Person B: ...And that means what?
   --mal (http://theangryadmin.blogspot.com/2008/04/future.html)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-25 23:15         ` Matthew Woehlke
@ 2008-08-26  4:14           ` Dallas Clarke
  2008-08-26  6:03             ` Matthew Woehlke
  2008-08-26 18:29             ` Jim Cobban
  0 siblings, 2 replies; 35+ messages in thread
From: Dallas Clarke @ 2008-08-26  4:14 UTC (permalink / raw)
  To: gcc-help

Dear Matthew

This was never a debate between the merits of UTF-8 verses UTF-16 verses 
UTF-32, nor was it a debate about the merits of Linux/Solaris verses 
Windows, it was a discussion about the lack of standardized support for 
UTF-16 in gcc and with the move by Microsoft to deprecate UTF-8 in favour 
for UTF-16, it would make life increasingly difficult for all gcc developers 
to process UTF-16.

I can competently code in both a Linux/Solaris and Windows environment, I 
choose primarily Windows because I like writing software that interacts with 
people. Such social skills are surely lacking in other environments. And as 
much as you would like to believe that people like me are Windoze 
developers, it took me nearly 10 years to master the WinAPI plus MFC plus 
.NET architectures, but only about 2 months to master the Linux/Solaris 
APIs.

I like technology, I like change, if Microsoft deprecates the .NET 
architecture (which I think they have already done with .Net v3.5 XML / XSLT 
architecture), I will just spend the time upgrading to the new design - who 
wants to be a stick in the mud?

Yours sincerely,

Dallas.

http://www.ekkySoftware.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-26  4:14           ` Dallas Clarke
@ 2008-08-26  6:03             ` Matthew Woehlke
  2008-08-26 18:29             ` Jim Cobban
  1 sibling, 0 replies; 35+ messages in thread
From: Matthew Woehlke @ 2008-08-26  6:03 UTC (permalink / raw)
  To: gcc-help, DClarke

Dallas Clarke wrote:
> This was never a debate between the merits of UTF-8 verses UTF-16 verses 
> UTF-32, nor was it a debate about the merits of Linux/Solaris verses 
> Windows,

Sorry, but a line like:
> I wont bother repeating myself, it not my responsibility to cure your dogma, it just the end of me using GCC.

...comes across much too strongly as FLOSS-bashing. And when you follow 
it with something like:

> I choose primarily Windows because I like writing software that 
> interacts with people.

...the FUD detector just keeps ringing. (There's plenty of "software 
that interacts with people" on my Fedora system, thank you!)

But back on topic...

> it was a discussion about the lack of standardized support for 
> UTF-16 in gcc and with the move by Microsoft to deprecate UTF-8 in 
> favour for UTF-16, it would make life increasingly difficult for all gcc 
> developers to process UTF-16.

Ok, but clearly you've failed to convince us how this is a problem, or 
why we should bend over backwards to accommodate Microsoft's insistence 
on doing everything different.

> I can competently code in both a Linux/Solaris and Windows environment, 

Then what is the problem? Use a reasonable text-support library and be 
done with it.

> And as much as you would like to believe that people like 
> me are Windoze developers, it took me nearly 10 years to master the 
> WinAPI plus MFC plus .NET architectures, but only about 2 months to 
> master the Linux/Solaris APIs.

Then I guess we're doing something right. (I'd say, "for example, not 
moving the goalposts and making everyone start from scratch every 3 
years" but that would fail to account for the general quality 
difference. When you combine the two, well, the above statement speaks 
for itself.)

With that in mind, I would submit for your consideration that there may 
be a reason that "because that's how Microsoft does it" tends to be met 
with hostility when suggesting a change.

> I like technology, I like change, if Microsoft deprecates the .NET 
> architecture (which I think they have already done with .Net v3.5 XML / 
> XSLT architecture), I will just spend the time upgrading to the new 
> design - who wants to be a stick in the mud?

People that like compatibility? ;-) True, gcc (well, really glibc and 
g++) are rather schizophrenic in this respect, but at least I don't run 
into nearly the level of "writing for Microsoft's latest fad language" 
:-). One project I'm currently the main developer for is littered with 
comments from the early 1990's and to my knowledge, very little of that 
code (on POSIX platforms) has needed significant change in about 15 
years. Whereas the COM and ActiveX bits were developed later and have 
since been scrapped.

Personally, I don't like having to re-write entire code bases every five 
years because Microsoft has changed paradigms (again). My condolences 
that this seems to be the exact situation you are currently in.

-- 
Matthew
ENOWIT: .sig file for this machine not set up yet

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-24 19:11             ` Eljay Love-Jensen
@ 2008-08-26 14:50               ` Marco Manfredini
  0 siblings, 0 replies; 35+ messages in thread
From: Marco Manfredini @ 2008-08-26 14:50 UTC (permalink / raw)
  To: gcc-help

On Sunday 24 August 2008, Eljay Love-Jensen wrote:
> Maybe a future version of C and/or C++ will be more Unicode friendly.
It will!
-> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2249.html


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-26  4:14           ` Dallas Clarke
  2008-08-26  6:03             ` Matthew Woehlke
@ 2008-08-26 18:29             ` Jim Cobban
  2008-08-26 18:37               ` me22
  2008-08-26 18:54               ` Andrew Haley
  1 sibling, 2 replies; 35+ messages in thread
From: Jim Cobban @ 2008-08-26 18:29 UTC (permalink / raw)
  To: GCC-help

Dallas Clarke wrote:
> This was never a debate between the merits of UTF-8 verses UTF-16 
> verses UTF-32, nor was it a debate about the merits of Linux/Solaris 
> verses Windows, it was a discussion about the lack of standardized 
> support for UTF-16 in gcc and with the move by Microsoft to deprecate 
> UTF-8 in favour for UTF-16, it would make life increasingly difficult 
> for all gcc developers to process UTF-16.
The following is my understanding of how the industry got to the 
situation where this is an issue.

Back in the 1980s Xerox was the first company to seriously examine 
multi-lingual text handling, at their famous Palo Alto Research Centre, 
in the implementation of the Star platform.  Xerox PARC quickly focussed 
in on the use of a 16 bit representation for each character, which they 
believed would be capable of representing all of the symbols used in 
living languages.  Xerox called this representation Unicode.  Microsoft 
was one of the earliest adopters of this representation, since it 
naturally wanted to sell licenses for M$ Word to every human being on 
the planet.  Since the primary purpose of Visual C++ was to facilitate 
the implementation of Windows and M$ Office it incorporated support, 
using the wchar_t type, for strings of Unicode characters.  Later, after 
M$ had frozen their implementation, the ISO standards committee decided 
that in order to support processing of strings representing non-living 
languages (Akkadian cuneiform, Egyptian hieroglyphics, Mayan 
hieroglyphics, Klingon, Elvish, archaic Chinese, etc.) more than 16 bits 
were needed, so the adopted ISO 10646 standard requires a 32 bit word to 
hold every conceivable character.

The definition of a wchar_t string or std::wstring, even if a wchar_t is 
16 bits in size, is not the same thing as UTF-16.  A wchar_t string or 
std::wstring, as defined by by the C, C++, and POSIX standards, contains 
ONE wchar_t value for each displayed glyph.  Alternatively the value of 
strlen() for a wchar_t string is the same as the number of glyphs in the 
displayed representation of the string.

In these standards the size of a wchar_t is not explicitly defined 
except that it must be large enough to represent every text 
"character".  It is critical to understand that a wchar_t string, as 
defined by these standards, is not the same thing as a UTF-16 string, 
even if a wchar_t is 16 bits in size.  UTF-16 may use up to THREE 16-bit 
words to represent a single glyph, although I believe that almost all 
symbols actually used by living languages can be represented in a single 
word in UTF-16.  I have not worked with Visual C++ recently precisely 
because it accepts a non-portable language.  The last time I used it the 
M$ library was standards compliant, with the understanding that its 
definition of wchar_t as a 16-bit word meant the library could not 
support some languages.  If the implementation of the wchar_t strings in 
the Visual C++ library has been changed to implement UTF-16 internally, 
then in my opinion it is not compliant with the POSIX, C, and C++ standards.

Furthermore UTF-8 and UTF-16 should have nothing to do with the 
internals of the representation of strings inside a C++ program.  It is 
obviously convenient that  a wchar_t *  or std::wstring should contain 
one "word" for each external glyph, which is not true for either UTF-8 
or UTF-16.  UTF-8 and UTF-16 are standards for the external 
representation of text for transmission between applications, and in 
particular for writing files used to carry international text.  For 
example UTF-8 is clearly a desirable format for the representation of 
C/C++ programs themselves, because so many of the characters used in the 
language are limited to the ASCII code set, which requires only 8 bits 
to represent in UTF-8.  However once such a file is read into an 
application its contents should be represented internally using wchar_t 
* or std::wstring with fixed length words.  Full compliance with ISO 
10646 requires that internal representation to use at least 32 bit words 
although a practical implementation can get away with16 bit words.

-- 
Jim Cobban   jcobban@magma.ca
34 Palomino Dr.
Kanata, ON, CANADA
K2M 1M1
+1-613-592-9438

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-26 18:29             ` Jim Cobban
@ 2008-08-26 18:37               ` me22
  2008-08-26 19:20                 ` me22
                                   ` (2 more replies)
  2008-08-26 18:54               ` Andrew Haley
  1 sibling, 3 replies; 35+ messages in thread
From: me22 @ 2008-08-26 18:37 UTC (permalink / raw)
  To: Jim Cobban; +Cc: GCC-help

On Tue, Aug 26, 2008 at 14:08, Jim Cobban <jcobban@magma.ca> wrote:
> The definition of a wchar_t string or std::wstring, even if a wchar_t is 16
> bits in size, is not the same thing as UTF-16.  A wchar_t string or
> std::wstring, as defined by by the C, C++, and POSIX standards, contains ONE
> wchar_t value for each displayed glyph.  Alternatively the value of strlen()
> for a wchar_t string is the same as the number of glyphs in the displayed
> representation of the string.
>

One wchar_t value for each codepoint -- glyphs can be formed from
multiple codepoints.  (Combining characters and ligatures, for
example.)

> In these standards the size of a wchar_t is not explicitly defined except
> that it must be large enough to represent every text "character".  It is
> critical to understand that a wchar_t string, as defined by these standards,
> is not the same thing as a UTF-16 string, even if a wchar_t is 16 bits in
> size.  UTF-16 may use up to THREE 16-bit words to represent a single glyph,
> although I believe that almost all symbols actually used by living languages
> can be represented in a single word in UTF-16.  I have not worked with
> Visual C++ recently precisely because it accepts a non-portable language.
>  The last time I used it the M$ library was standards compliant, with the
> understanding that its definition of wchar_t as a 16-bit word meant the
> library could not support some languages.  If the implementation of the
> wchar_t strings in the Visual C++ library has been changed to implement
> UTF-16 internally, then in my opinion it is not compliant with the POSIX, C,
> and C++ standards.
>

The outdated encoding that only supports codepoints 0x0000 through
0xFFFF is called UCS-2.  ( See http://en.wikipedia.org/wiki/UTF-16 )

~ Scott

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-26 18:29             ` Jim Cobban
  2008-08-26 18:37               ` me22
@ 2008-08-26 18:54               ` Andrew Haley
  2008-08-26 21:19                 ` me22
  1 sibling, 1 reply; 35+ messages in thread
From: Andrew Haley @ 2008-08-26 18:54 UTC (permalink / raw)
  To: Jim Cobban; +Cc: GCC-help

Jim Cobban wrote:

> Furthermore UTF-8 and UTF-16 should have nothing to do with the
> internals of the representation of strings inside a C++
> program.  It is obviously convenient that a wchar_t * or
> std::wstring should contain one "word" for each external glyph,
> which is not true for either UTF-8 or UTF-16.  UTF-8 and UTF-16
> are standards for the external representation of text for
> transmission between applications, and in particular for
> writing files used to carry international text.  For example
> UTF-8 is clearly a desirable format for the representation of
> C/C++ programs themselves, because so many of the characters
> used in the language are limited to the ASCII code set, which
> requires only 8 bits to represent in UTF-8.

Just in case anyone thinks that UTF-16 might be a good format for saving
data in files or for data to be sent over a network, here's a gem from
Microsoft:

'The example in the documentation didn't specify Little Endian, so
the Unicode string that the code generates is Big Endian.  The
SQL Server Driver for PHP expected Big Endian, so the data
written to SQL Server is not what was expected.  However, because
the code to retrieve the data converts the string from Big Endian
back to UTF-8, the resulting string in the example matches the
original string.

'If you change the Unicode charset in the example from "UTF-16"
to "UCS-2LE" or "UTF-16LE" in both calls to iconv, you'll still
see the original and resulting strings match but now you'll also
see that the code sends the expected data to the database.'

http://forums.microsoft.com/msdn/ShowPost.aspx?PostID=3644735&SiteID=1

Andrew.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-26 18:37               ` me22
@ 2008-08-26 19:20                 ` me22
  2008-08-26 21:29                 ` me22
  2008-08-27 13:29                 ` me22
  2 siblings, 0 replies; 35+ messages in thread
From: me22 @ 2008-08-26 19:20 UTC (permalink / raw)
  To: Jim Cobban; +Cc: GCC-help

On Tue, Aug 26, 2008 at 14:08, Jim Cobban <jcobban@magma.ca> wrote:
> The definition of a wchar_t string or std::wstring, even if a wchar_t is 16
> bits in size, is not the same thing as UTF-16.  A wchar_t string or
> std::wstring, as defined by by the C, C++, and POSIX standards, contains ONE
> wchar_t value for each displayed glyph.  Alternatively the value of strlen()
> for a wchar_t string is the same as the number of glyphs in the displayed
> representation of the string.
>

One wchar_t value for each codepoint -- glyphs can be formed from
multiple codepoints.  (Combining characters and ligatures, for
example.)

> In these standards the size of a wchar_t is not explicitly defined except
> that it must be large enough to represent every text "character".  It is
> critical to understand that a wchar_t string, as defined by these standards,
> is not the same thing as a UTF-16 string, even if a wchar_t is 16 bits in
> size.  UTF-16 may use up to THREE 16-bit words to represent a single glyph,
> although I believe that almost all symbols actually used by living languages
> can be represented in a single word in UTF-16.  I have not worked with
> Visual C++ recently precisely because it accepts a non-portable language.
>  The last time I used it the M$ library was standards compliant, with the
> understanding that its definition of wchar_t as a 16-bit word meant the
> library could not support some languages.  If the implementation of the
> wchar_t strings in the Visual C++ library has been changed to implement
> UTF-16 internally, then in my opinion it is not compliant with the POSIX, C,
> and C++ standards.
>

The outdated encoding that only supports codepoints 0x0000 through
0xFFFF is called UCS-2.  ( See http://en.wikipedia.org/wiki/UTF-16 )

~ Scott

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-26 18:54               ` Andrew Haley
@ 2008-08-26 21:19                 ` me22
  2008-08-27  8:18                   ` me22
  2008-08-27 11:45                   ` me22
  0 siblings, 2 replies; 35+ messages in thread
From: me22 @ 2008-08-26 21:19 UTC (permalink / raw)
  To: Andrew Haley; +Cc: GCC-help

On Tue, Aug 26, 2008 at 14:32, Andrew Haley <aph@redhat.com> wrote:
>
> Just in case anyone thinks that UTF-16 might be a good format for saving
> data in files or for data to be sent over a network, here's a gem from
> Microsoft:
>
> 'The example in the documentation didn't specify Little Endian, so
> the Unicode string that the code generates is Big Endian.  The
> SQL Server Driver for PHP expected Big Endian, so the data
> written to SQL Server is not what was expected.  However, because
> the code to retrieve the data converts the string from Big Endian
> back to UTF-8, the resulting string in the example matches the
> original string.
>
> 'If you change the Unicode charset in the example from "UTF-16"
> to "UCS-2LE" or "UTF-16LE" in both calls to iconv, you'll still
> see the original and resulting strings match but now you'll also
> see that the code sends the expected data to the database.'
>
> http://forums.microsoft.com/msdn/ShowPost.aspx?PostID=3644735&SiteID=1
>

Absolutely.  UTF-8 is the only one without possible byte ordering
issues, so it (or UTF-7, if needed) is the only reasonable option for
interchange, since for text, size isn't that high anyways, and with
compression it's not bad at all.  (All the bytes in the UTF-8
representation of a codepoint are the same, for a language, except the
last and maybe second last, so even just a naive huffman can pretty
much eliminate the cost in size over UTF-16, since UTF-16 also has
those prelude bytes for specific languages.)

And really, since at a glyph level even UTF-32 is a variable-width
encoding, you have to think about it anyways, so I don't see why it's
worth not just using UTF-8 everywhere.  (For example, suppose you have
an s codepoint followed by a combining accent codepoint.  Pressing
"backspace" with the cursor after it should, probably, erase both
codepoints.  At the same time, if it's an ffi ligature, then probably
backspace should replace it with an ff ligature.  So since you can't
just do --size on your string anyways...)

~ Scott

P.S.  Are there any architectures around using middle-endian UTF-32? ;)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-26 18:37               ` me22
  2008-08-26 19:20                 ` me22
@ 2008-08-26 21:29                 ` me22
  2008-08-27 13:29                 ` me22
  2 siblings, 0 replies; 35+ messages in thread
From: me22 @ 2008-08-26 21:29 UTC (permalink / raw)
  To: Jim Cobban; +Cc: GCC-help

On Tue, Aug 26, 2008 at 14:08, Jim Cobban <jcobban@magma.ca> wrote:
> The definition of a wchar_t string or std::wstring, even if a wchar_t is 16
> bits in size, is not the same thing as UTF-16.  A wchar_t string or
> std::wstring, as defined by by the C, C++, and POSIX standards, contains ONE
> wchar_t value for each displayed glyph.  Alternatively the value of strlen()
> for a wchar_t string is the same as the number of glyphs in the displayed
> representation of the string.
>

One wchar_t value for each codepoint -- glyphs can be formed from
multiple codepoints.  (Combining characters and ligatures, for
example.)

> In these standards the size of a wchar_t is not explicitly defined except
> that it must be large enough to represent every text "character".  It is
> critical to understand that a wchar_t string, as defined by these standards,
> is not the same thing as a UTF-16 string, even if a wchar_t is 16 bits in
> size.  UTF-16 may use up to THREE 16-bit words to represent a single glyph,
> although I believe that almost all symbols actually used by living languages
> can be represented in a single word in UTF-16.  I have not worked with
> Visual C++ recently precisely because it accepts a non-portable language.
>  The last time I used it the M$ library was standards compliant, with the
> understanding that its definition of wchar_t as a 16-bit word meant the
> library could not support some languages.  If the implementation of the
> wchar_t strings in the Visual C++ library has been changed to implement
> UTF-16 internally, then in my opinion it is not compliant with the POSIX, C,
> and C++ standards.
>

The outdated encoding that only supports codepoints 0x0000 through
0xFFFF is called UCS-2.  ( See http://en.wikipedia.org/wiki/UTF-16 )

~ Scott

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-26 21:19                 ` me22
@ 2008-08-27  8:18                   ` me22
  2008-08-27 11:45                   ` me22
  1 sibling, 0 replies; 35+ messages in thread
From: me22 @ 2008-08-27  8:18 UTC (permalink / raw)
  To: Andrew Haley; +Cc: GCC-help

On Tue, Aug 26, 2008 at 14:32, Andrew Haley <aph@redhat.com> wrote:
>
> Just in case anyone thinks that UTF-16 might be a good format for saving
> data in files or for data to be sent over a network, here's a gem from
> Microsoft:
>
> 'The example in the documentation didn't specify Little Endian, so
> the Unicode string that the code generates is Big Endian.  The
> SQL Server Driver for PHP expected Big Endian, so the data
> written to SQL Server is not what was expected.  However, because
> the code to retrieve the data converts the string from Big Endian
> back to UTF-8, the resulting string in the example matches the
> original string.
>
> 'If you change the Unicode charset in the example from "UTF-16"
> to "UCS-2LE" or "UTF-16LE" in both calls to iconv, you'll still
> see the original and resulting strings match but now you'll also
> see that the code sends the expected data to the database.'
>
> http://forums.microsoft.com/msdn/ShowPost.aspx?PostID=3644735&SiteID=1
>

Absolutely.  UTF-8 is the only one without possible byte ordering
issues, so it (or UTF-7, if needed) is the only reasonable option for
interchange, since for text, size isn't that high anyways, and with
compression it's not bad at all.  (All the bytes in the UTF-8
representation of a codepoint are the same, for a language, except the
last and maybe second last, so even just a naive huffman can pretty
much eliminate the cost in size over UTF-16, since UTF-16 also has
those prelude bytes for specific languages.)

And really, since at a glyph level even UTF-32 is a variable-width
encoding, you have to think about it anyways, so I don't see why it's
worth not just using UTF-8 everywhere.  (For example, suppose you have
an s codepoint followed by a combining accent codepoint.  Pressing
"backspace" with the cursor after it should, probably, erase both
codepoints.  At the same time, if it's an ffi ligature, then probably
backspace should replace it with an ff ligature.  So since you can't
just do --size on your string anyways...)

~ Scott

P.S.  Are there any architectures around using middle-endian UTF-32? ;)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-26 21:19                 ` me22
  2008-08-27  8:18                   ` me22
@ 2008-08-27 11:45                   ` me22
  1 sibling, 0 replies; 35+ messages in thread
From: me22 @ 2008-08-27 11:45 UTC (permalink / raw)
  To: Andrew Haley; +Cc: GCC-help

On Tue, Aug 26, 2008 at 14:32, Andrew Haley <aph@redhat.com> wrote:
>
> Just in case anyone thinks that UTF-16 might be a good format for saving
> data in files or for data to be sent over a network, here's a gem from
> Microsoft:
>
> 'The example in the documentation didn't specify Little Endian, so
> the Unicode string that the code generates is Big Endian.  The
> SQL Server Driver for PHP expected Big Endian, so the data
> written to SQL Server is not what was expected.  However, because
> the code to retrieve the data converts the string from Big Endian
> back to UTF-8, the resulting string in the example matches the
> original string.
>
> 'If you change the Unicode charset in the example from "UTF-16"
> to "UCS-2LE" or "UTF-16LE" in both calls to iconv, you'll still
> see the original and resulting strings match but now you'll also
> see that the code sends the expected data to the database.'
>
> http://forums.microsoft.com/msdn/ShowPost.aspx?PostID=3644735&SiteID=1
>

Absolutely.  UTF-8 is the only one without possible byte ordering
issues, so it (or UTF-7, if needed) is the only reasonable option for
interchange, since for text, size isn't that high anyways, and with
compression it's not bad at all.  (All the bytes in the UTF-8
representation of a codepoint are the same, for a language, except the
last and maybe second last, so even just a naive huffman can pretty
much eliminate the cost in size over UTF-16, since UTF-16 also has
those prelude bytes for specific languages.)

And really, since at a glyph level even UTF-32 is a variable-width
encoding, you have to think about it anyways, so I don't see why it's
worth not just using UTF-8 everywhere.  (For example, suppose you have
an s codepoint followed by a combining accent codepoint.  Pressing
"backspace" with the cursor after it should, probably, erase both
codepoints.  At the same time, if it's an ffi ligature, then probably
backspace should replace it with an ff ligature.  So since you can't
just do --size on your string anyways...)

~ Scott

P.S.  Are there any architectures around using middle-endian UTF-32? ;)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-26 18:37               ` me22
  2008-08-26 19:20                 ` me22
  2008-08-26 21:29                 ` me22
@ 2008-08-27 13:29                 ` me22
  2 siblings, 0 replies; 35+ messages in thread
From: me22 @ 2008-08-27 13:29 UTC (permalink / raw)
  To: Jim Cobban; +Cc: GCC-help

On Tue, Aug 26, 2008 at 14:08, Jim Cobban <jcobban@magma.ca> wrote:
> The definition of a wchar_t string or std::wstring, even if a wchar_t is 16
> bits in size, is not the same thing as UTF-16.  A wchar_t string or
> std::wstring, as defined by by the C, C++, and POSIX standards, contains ONE
> wchar_t value for each displayed glyph.  Alternatively the value of strlen()
> for a wchar_t string is the same as the number of glyphs in the displayed
> representation of the string.
>

One wchar_t value for each codepoint -- glyphs can be formed from
multiple codepoints.  (Combining characters and ligatures, for
example.)

> In these standards the size of a wchar_t is not explicitly defined except
> that it must be large enough to represent every text "character".  It is
> critical to understand that a wchar_t string, as defined by these standards,
> is not the same thing as a UTF-16 string, even if a wchar_t is 16 bits in
> size.  UTF-16 may use up to THREE 16-bit words to represent a single glyph,
> although I believe that almost all symbols actually used by living languages
> can be represented in a single word in UTF-16.  I have not worked with
> Visual C++ recently precisely because it accepts a non-portable language.
>  The last time I used it the M$ library was standards compliant, with the
> understanding that its definition of wchar_t as a 16-bit word meant the
> library could not support some languages.  If the implementation of the
> wchar_t strings in the Visual C++ library has been changed to implement
> UTF-16 internally, then in my opinion it is not compliant with the POSIX, C,
> and C++ standards.
>

The outdated encoding that only supports codepoints 0x0000 through
0xFFFF is called UCS-2.  ( See http://en.wikipedia.org/wiki/UTF-16 )

~ Scott

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-21  5:16 Dallas Clarke
  2008-08-21  9:30 ` me22
  2008-08-21 10:18 ` Andrew Haley
@ 2008-08-21 14:38 ` John Gateley
  2 siblings, 0 replies; 35+ messages in thread
From: John Gateley @ 2008-08-21 14:38 UTC (permalink / raw)
  To: gcc-help

On Thu, 21 Aug 2008 14:43:19 +1000
"Dallas Clarke" <DClarke@unwired.com.au> wrote:

> I have had to spend the last several days totally writing from scratch the 
> UTF-16 string functions, and realise that with a bit of common sense every 
> thing can work out okay. Hopefully quick action to move wchar_t to 2 bytes 
> and create another type for 4 byte strings, we can see a lot of problems 
> solved. Maybe have UTF-16 strings with L"My String" and UTF-32 with LL"My 
> String" notations.
> 

Try -fshort-wchar, this will make wchar_t 2 bytes long.

I also suggest writing your own string class that is portable, instead
of relying on Windows/Linux libraries. That will avoid incompatibilities
while at the same time give you a much more powerful tool.

That said, I strongly agree with your sentiment that Windows and Linux
could be closer to each other, and the sentiment "it's the standard,
it's not going to change" pushes me to the ceiling too.

j
-- 
John Gateley <gateley@jriver.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
       [not found]   ` <004501c90350$87491330$0100a8c0@testserver>
@ 2008-08-21 12:49     ` me22
  0 siblings, 0 replies; 35+ messages in thread
From: me22 @ 2008-08-21 12:49 UTC (permalink / raw)
  To: Dallas Clarke; +Cc: GCC-help

On Thu, Aug 21, 2008 at 01:41, Dallas Clarke <DClarke@unwired.com.au> wrote:
> The problem is that Vista is no longer fully supporting UTF-8, presumably
> because there are people in China would wish to use computers in their own
> native language - shock horror.
>

A quick peek at gnome-character-map shows that I can represent 㙡,
U+3661 CJK UNIFIED IDEOGRAPH-3661, in UTF-8 as 0xE3 0x99 0xA1.  In
fact, it gives UTF-8 encodings for all the codepoints through U+E01EF.

Are you sure you're talking about UTF-8 and not about 8-bit character
encoding tables à la ISO 8859?

~ Scott

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-21 11:50   ` Dallas Clarke
@ 2008-08-21 12:15     ` John Love-Jensen
  0 siblings, 0 replies; 35+ messages in thread
From: John Love-Jensen @ 2008-08-21 12:15 UTC (permalink / raw)
  To: Dallas Clarke, GCC-help

Hi Dallas,

> Thanks for your reply, but with Pictorial languages such as Cantonese and
> Mandarin, that have up to 60,000 character in the full set (one picture for
> each word), using locality page sheets with UTF-8 is limited.

UTF-8 does not use locality page sheets.  (Are you conflating UTF-8 and
Windows Code Pages?  Ala the difference between the FooA() ACP routines, and
the FooW() Wide character routines?)

UTF-8 encodes Unicode characters from U+00000 to U+10FFFF in a variable
number of octets, 1 to 4 octets (1-4 bytes).  UTF-8 supports the entire
gamut of Unicode characters.

UTF-16 encodes Unicode characters from U+00000 to U+10FFFF in a variable
number of 16-bit chunks, 1 or 2 of them (2 or 4 bytes).

UTF-32 encodes Unicode characters from U+00000 to U+10FFFF in a single
32-bit chunk (4 bytes), with 11 of the 32 bits being fallow.

> GCC and MS VC++ are now inconsistent with their wchar_t types and this
> difference will make it nearly impossible for us to continue supporting
> Linux, i.e. in a choice between Linux and Windows, I have to follow my
> customers.

GCC and MS VC++ are not inconsistent.  Both of those compilers comply with
the ABI of the platform that they target.

There is not requirement in any platforms ABI that I work with that char be
a UTF8 and wchar_t be UTF16 or UTF32.

Perhaps what you need is to make your own character type (or, technically,
encoding unit type):

struct UTF8
{
  typedef uint8_t Type;
  Type mEncodingUnit;
};

struct UTF16
{
  typedef uint16_t Type;
  Type mEncodingUnit;
};

struct UTF32
{
  typedef uint32_t Type;
  Type mEncodingUnit;
};

Or use a Unicode savvy library like ICU <http://www.icu-project.org/>.

> I am not trying to deny UTF-32 or saying that GCC should not support it, I
> am saying that GCC should support all three Unicode formats because UTF-16
> is a format that I have to deal with in the real world. Why not support all
> three formats?

GCC does not support Unicode.

Some libraries (that are not part of GCC) support Unicode.

Perhaps parts of the OS support Unicode, in some transformation format, with
their LANG environment, or Window's 65001, 65005, 65006, 1200, 1201 code
pages, or Mac OS X's kCFStringEncodingUnicode, kCFStringEncodingUTF8,
kCFStringEncodingUTF16, kCFStringEncodingUTF16BE, kCFStringEncodingUTF16LE,
kCFStringEncodingUTF32, kCFStringEncodingUTF32BE, kCFStringEncodingUTF32LE.

The only computer languages that I'm aware of that support Unicode are:
+ Python 2.3 (somewhat, as an opt-in transition feature)
+ Python 2.5 (somewhat)
+ Python 3.0 (very well)
+ D Programming Language (very well)
+ Java (very well)

My favorite computer languages do NOT support Unicode "out of the box" (by
"support" I mean both Unicode source code, which can target Unicode
applications):
+ C
+ C++
+ Lua

With add-on libraries and/or OS API support, discipline, and a bit of luck,
those languages can target Unicode applications.

I can't see Lua supporting Unicode "out of the box" without increasing it's
tiny embedded scripting engine footprint by over an order of magnitude.

> As someone with has written a scripting language based on C++, I can tell
> you that changing the 'wchar_t' to something else would only take five
> minutes - it wouldn't break any thing.

It would break the OS ABI, which is defined by the OS, not by the compiler.

HTH,
--Eljay

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-21 10:18 ` Andrew Haley
@ 2008-08-21 11:50   ` Dallas Clarke
  2008-08-21 12:15     ` John Love-Jensen
  0 siblings, 1 reply; 35+ messages in thread
From: Dallas Clarke @ 2008-08-21 11:50 UTC (permalink / raw)
  To: gcc-help

Hello Andrew,

Thanks for your reply, but with Pictorial languages such as Cantonese and
Mandarin, that have up to 60,000 character in the full set (one picture for
each word), using locality page sheets with UTF-8 is limited.

GCC and MS VC++ are now inconsistent with their wchar_t types and this
difference will make it nearly impossible for us to continue supporting
Linux, i.e. in a choice between Linux and Windows, I have to follow my
customers.

I am not trying to deny UTF-32 or saying that GCC should not support it, I
am saying that GCC should support all three Unicode formats because UTF-16
is a format that I have to deal with in the real world. Why not support all
three formats?

As someone with has written a scripting language based on C++, I can tell
you that changing the 'wchar_t' to something else would only take five
minutes - it wouldn't break any thing.

Dallas.


----- Original Message ----- 
From: "Andrew Haley" <aph@redhat.com>
To: "Dallas Clarke" <DClarke@unwired.com.au>
Cc: <gcc-help@gcc.gnu.org>
Sent: Thursday, August 21, 2008 7:28 PM
Subject: Re: UTF-8, UTF-16 and UTF-32


> Dallas Clarke wrote:
>
>> Now I have had the time to pull myself off the ceiling, I realise the
>> problem is that Unix/GCC is supporting both UTF-8 and UTF-32, while
>> Windows is supporting UTF-8 and UTF-16. And the solution is for both
>> Unix and Windows to support all three Unicode formats.
>>
>> I have had to spend the last several days totally writing from scratch
>> the UTF-16 string functions, and realise that with a bit of common sense
>> every thing can work out okay. Hopefully quick action to move wchar_t to
>> 2 bytes and create another type for 4 byte strings, we can see a lot of
>> problems solved. Maybe have UTF-16 strings with L"My String" and UTF-32
>> with LL"My String" notations.
>
> Changing wchar_t would break the ABI.  It isn't going to happen.
>
>> I hope your steering committee can see that there will be lots of UTF-16
>> text files out there, with a lot of code required to be written to
>> process those files and while UTF-8 will not support many none Latin
>> based languages, UTF-32 will not support many none Human base languages
>> - i.e. no signal system is fault free.
>
> I don't think that such a change can be decreed by the GCC SC.
>
> I don't understand your claim that "UTF-8 will not support many none Latin
> based languages".  UTF-8 <http://tools.ietf.org/html/rfc3629> supports
> everything from U+0000 to U+10FFFF.  While programs use a variety of
> internal representations of characters, successful transmission of data
> between machines requires a common interchange format, and UTF-8 is that
> format.
>
> Andrew.
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-21  5:16 Dallas Clarke
  2008-08-21  9:30 ` me22
@ 2008-08-21 10:18 ` Andrew Haley
  2008-08-21 11:50   ` Dallas Clarke
  2008-08-21 14:38 ` John Gateley
  2 siblings, 1 reply; 35+ messages in thread
From: Andrew Haley @ 2008-08-21 10:18 UTC (permalink / raw)
  To: Dallas Clarke; +Cc: gcc-help

Dallas Clarke wrote:

> Now I have had the time to pull myself off the ceiling, I realise the
> problem is that Unix/GCC is supporting both UTF-8 and UTF-32, while
> Windows is supporting UTF-8 and UTF-16. And the solution is for both
> Unix and Windows to support all three Unicode formats.
> 
> I have had to spend the last several days totally writing from scratch
> the UTF-16 string functions, and realise that with a bit of common sense
> every thing can work out okay. Hopefully quick action to move wchar_t to
> 2 bytes and create another type for 4 byte strings, we can see a lot of
> problems solved. Maybe have UTF-16 strings with L"My String" and UTF-32
> with LL"My String" notations.

Changing wchar_t would break the ABI.  It isn't going to happen.

> I hope your steering committee can see that there will be lots of UTF-16
> text files out there, with a lot of code required to be written to
> process those files and while UTF-8 will not support many none Latin
> based languages, UTF-32 will not support many none Human base languages
> - i.e. no signal system is fault free.

I don't think that such a change can be decreed by the GCC SC.

I don't understand your claim that "UTF-8 will not support many none Latin
based languages".  UTF-8 <http://tools.ietf.org/html/rfc3629> supports
everything from U+0000 to U+10FFFF.  While programs use a variety of
internal representations of characters, successful transmission of data
between machines requires a common interchange format, and UTF-8 is that
format.

Andrew.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: UTF-8, UTF-16 and UTF-32
  2008-08-21  5:16 Dallas Clarke
@ 2008-08-21  9:30 ` me22
       [not found]   ` <004501c90350$87491330$0100a8c0@testserver>
  2008-08-21 10:18 ` Andrew Haley
  2008-08-21 14:38 ` John Gateley
  2 siblings, 1 reply; 35+ messages in thread
From: me22 @ 2008-08-21  9:30 UTC (permalink / raw)
  To: Dallas Clarke; +Cc: gcc-help

On Thu, Aug 21, 2008 at 00:43, Dallas Clarke <DClarke@unwired.com.au> wrote:
>
> Now I have had the time to pull myself off the ceiling, I realise the
> problem is that Unix/GCC is supporting both UTF-8 and UTF-32, while Windows
> is supporting UTF-8 and UTF-16. And the solution is for both Unix and
> Windows to support all three Unicode formats.
>

Why is the solution to change Windows and GCC, rather than just use
the UTF-8 that's apparently already in both?  With combining
codepoints, even UTF-32 is effectively a variable-length encoding (at
the glyph level), so...

> I hope your steering committee can see that there will be lots of UTF-16
> text files out there, with a lot of code required to be written to process
> those files and while UTF-8 will not support many none Latin based
> languages, UTF-32 will not support many none Human base languages - i.e. no
> signal system is fault free.
>

Huh?  It sounds like the later part of that claims that UTF-16
supports more languages than UTF-8 and UTF-32, which is clearly wrong.

Though I've never seen the point in UTF-16 anyways.  It can't be
transported by things assuming 8-bit-clean ASCII anyways, and once
compressed (as any significant amount would be) isn't usefully smaller
than just using a fixed-length codepoint encoding.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* UTF-8, UTF-16 and UTF-32
@ 2008-08-21  5:16 Dallas Clarke
  2008-08-21  9:30 ` me22
                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Dallas Clarke @ 2008-08-21  5:16 UTC (permalink / raw)
  To: gcc-help

Hello GCC,

Now I have had the time to pull myself off the ceiling, I realise the 
problem is that Unix/GCC is supporting both UTF-8 and UTF-32, while Windows 
is supporting UTF-8 and UTF-16. And the solution is for both Unix and 
Windows to support all three Unicode formats.

I have had to spend the last several days totally writing from scratch the 
UTF-16 string functions, and realise that with a bit of common sense every 
thing can work out okay. Hopefully quick action to move wchar_t to 2 bytes 
and create another type for 4 byte strings, we can see a lot of problems 
solved. Maybe have UTF-16 strings with L"My String" and UTF-32 with LL"My 
String" notations.

I hope your steering committee can see that there will be lots of UTF-16 
text files out there, with a lot of code required to be written to process 
those files and while UTF-8 will not support many none Latin based 
languages, UTF-32 will not support many none Human base languages - i.e. no 
signal system is fault free.

Thanks,
Dallas
http://www.ekkySoftware.com/ 

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2008-08-26 20:29 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <002901c903df$08265510$3b9c65dc@testserver>
2008-08-22 14:54 ` UTF-8, UTF-16 and UTF-32 Eljay Love-Jensen
2008-08-23  2:00   ` Dallas Clarke
2008-08-23  2:24     ` me22
2008-08-23  2:45       ` Dallas Clarke
2008-08-23  3:06         ` me22
2008-08-23  3:52           ` Dallas Clarke
2008-08-23  4:31             ` Brian Dessent
2008-08-23 11:33     ` Andrew Haley
2008-08-23 21:41     ` Eljay Love-Jensen
2008-08-24  0:41       ` Dallas Clarke
2008-08-24  4:02         ` me22
2008-08-24  5:53         ` corey taylor
2008-08-24  6:02           ` Dallas Clarke
2008-08-24 11:11             ` me22
2008-08-24 19:11             ` Eljay Love-Jensen
2008-08-26 14:50               ` Marco Manfredini
2008-08-25 23:15         ` Matthew Woehlke
2008-08-26  4:14           ` Dallas Clarke
2008-08-26  6:03             ` Matthew Woehlke
2008-08-26 18:29             ` Jim Cobban
2008-08-26 18:37               ` me22
2008-08-26 19:20                 ` me22
2008-08-26 21:29                 ` me22
2008-08-27 13:29                 ` me22
2008-08-26 18:54               ` Andrew Haley
2008-08-26 21:19                 ` me22
2008-08-27  8:18                   ` me22
2008-08-27 11:45                   ` me22
2008-08-21  5:16 Dallas Clarke
2008-08-21  9:30 ` me22
     [not found]   ` <004501c90350$87491330$0100a8c0@testserver>
2008-08-21 12:49     ` me22
2008-08-21 10:18 ` Andrew Haley
2008-08-21 11:50   ` Dallas Clarke
2008-08-21 12:15     ` John Love-Jensen
2008-08-21 14:38 ` John Gateley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).