From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 4659 invoked by alias); 24 Aug 2008 06:46:48 -0000 Received: (qmail 4646 invoked by uid 22791); 24 Aug 2008 06:46:47 -0000 X-Spam-Check-By: sourceware.org Received: from wa-out-1112.google.com (HELO wa-out-1112.google.com) (209.85.146.179) by sourceware.org (qpsmtpd/0.31) with ESMTP; Sun, 24 Aug 2008 06:45:43 +0000 Received: by wa-out-1112.google.com with SMTP id k22so501146waf.20 for ; Sat, 23 Aug 2008 23:45:41 -0700 (PDT) Received: by 10.114.148.2 with SMTP id v2mr2368526wad.202.1219560341076; Sat, 23 Aug 2008 23:45:41 -0700 (PDT) Received: by 10.115.93.5 with HTTP; Sat, 23 Aug 2008 23:45:41 -0700 (PDT) Message-ID: Date: Sun, 24 Aug 2008 11:11:00 -0000 From: me22 To: "Dallas Clarke" Subject: Re: UTF-8, UTF-16 and UTF-32 Cc: "corey taylor" , "Eljay Love-Jensen" , GCC-help In-Reply-To: <002101c905ad$aef4ce10$3b9c65dc@testserver> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <000c01c90568$d436d230$3b9c65dc@testserver> <2e393d080808232101nc339585xa13d8f26082161e0@mail.gmail.com> <002101c905ad$aef4ce10$3b9c65dc@testserver> X-IsSubscribed: yes Mailing-List: contact gcc-help-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-help-owner@gcc.gnu.org X-SW-Source: 2008-08/txt/msg00269.txt.bz2 On Sun, Aug 24, 2008 at 01:53, Dallas Clarke wrote: > At the sake of sound repetitive, the problems are:- > 1) using wchar_t for 4-byte string stuffs up function overloading - for > example problems with shared libraries written with different 2-byte string > type (i.e., short, unsigned short, struct UTF16{ typedef uint16_t Type; Type > mEncodingUnit;};, etc) Of course, people relying on sizeof(short)*CHAR_BIT being 16 aren't technically portable anyways... using wchar_t for 2-byte string stuffs up function overloading - for example problems with shared libraries written with different 4-byte string type (i.e., int, unsigned long, struct UTF16{ typedef uint32_t Type; Type mEncodingUnit;};, etc) > 2) casting error from declaring strings as unsigned short string[] = > {'H','e','l','l','o',' ','W','o','r','l','d',0} or unsigned short *string = > (unsinged short*)"H\0e\0l\0l\0o\0 \0W\0o\0r\0l\0d\0\0"; Why do you care how you're defining strings? Anything that needs localization belongs in an external "resource file" anyways, and anything in the code can be done perfectly fine with wstring s = L"Whatever you want here, including non-ASCII", and the compiler will be fine with it, so long as your locale is consistent. I'm really not convinced that this is a real problem, since there are many C++ projects using wxWidgets that compile fine in "Unicode" mode in both Windows and Linux, where writing a localizable string just means saying _T("your text here") and everything works perfectly fine. > 3) pointer arithmetic bugs and other portability issues consuming time, > money and resources. Anyone doing explicit pointer manipulation on strings in application-level code deserves their problems. And UTF-32 actually has fewer possible places for pointer errors, since incrementing a pointer actually moves to the next codepoint, something that neither UTF-8 nor UTF-16 allow. > 4) no standard library for strings functions, creating different behaviours > from different implementations. And the standard C-Library people will not > implement the string routines until there is a standard type for 16-bit > strings offered by the compiler. The ISO standards for C and C++ may not provide it, but that's certainly not something that GCC can change. If this is a problem, you should have submitted a proposal to the Standards Committees. Regardless, there are plenty of mature, open, and free libraries for Unicode. > The full set of MS Common Controls no longer support the -D _MBCS, this > means I must compile in with -D UNICODE and -D _UNICODE, this makes all the > standard WINAPI to use 16-bit Unicode strings as well. Rather than > constantly convert between UTF-8 and 16-bit Unicode I am moving totally to > 16-bit Unicode. Why is MS doing this - probably because they know your not > supporting 16-bit Unicode and that will force people like me to drop plans > to port to Linux/Solaris because it is just too hard. Well, "MS Common Controls" obviously aren't available on Linux either. Since you need to change basically your whole GUI to port it, you'd be using a cross-platform library (like wxWidgets) that handles all this for you. > Once again, there are no legacy issues because no one is currently using > 16-bit Unicode in GCC, it does not exist. Adding such support will not break > anything. I am not arguing to stop support for 32-bit Unicode. Secondly > object code does not use the label "wchar_t", meaning the change would force > people to do a global search and replace "wchar_t" to "long wchar_t" before > their next compile. Quite a simple change compared to what I must do to > support 16-bit strings in GCC. On the compilation side, sure, though I really don't think search-and-replace works as well as you think it does in that situation. (Certainly s/int/long int/ isn't safe.) But that would introduce a huge amount of issues. It would mean that changing the C library on a box to your proposed new one would break every single binary there that used its wchar_t functions, for example. > It would be nice to substitute "long wchar_t" for "wchar_t" as it would not > only be consistent with MS VC++, but also definitions of double and long > double, long and long long, and integers as 123456789012LL. Using S"String" > or U"String" would be too confusing with signed and unsigned. > BTW, are you following the standards process for C++0x? Check http://herbsutter.spaces.live.com/blog/cns!2D4327CC297151BB!214.entry or the actual paper, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2249.html u"string" and U"string" are actually good choices because they mirror the \u and \U escape sequences in wide-character strings from C++98. > The issues of confusion between Unicode Text Format (UTF) 8, 16 and 32, are > not only mine, but as pointed out earier, they are constantly changing. The > 16-bit string is a format I am forced to deal with and there is no support > from GCC at all. I can't tell you if MS Unicode is the older style fixed > 16-bit or it is the newer multibyte type similar to the UTF-8 definition. I think you need to re-read the standards. First of all, UTF is "Unicode Transformation Format", not Text Format. The standards are quite specific about what UTF-8, UTF-16, and UTF-32 are. There are other encodings, UCS-2 for example, which are different. > And in case you don't already know, MS VC++ compiles source code written in > 16-bit Unicode, allowing function name, variables and strings to be written > in 16-bit Unicode. This means that more and more files/data is going to be > 16-bit Unicode. Developers like myself are going to have to deal with the > format, whether we like it or not. And GCC quite happily compiles code written in UTF-8, allowing variable names and strings to be written in Unicode. (Note that there's no such thing as "16-bit Unicode".) I suspect it'll compile UTF-32 code as well, with similar results. > So I have to ask - what are your arguments for not providing support for all > three, 8-bit, 16-bit and 32-bit Unicode strings? > First, none of those things exist. But I don't think I ever said that providing support for UTF-8, UTF-16, and UTF-32 is such a terrible thing, though I do think that it's somewhat pointless, since there are already mature, capable libraries that do what you need, and the cost/benefit quotient of providing it in the compiler is far too high. (And completely undesirable on many embedded platforms that GCC, C, and C++ support.) Why not change wchar_t to UTF-16? Largely because while it might make a vapourware project of yours easier, it creates the same problem you have about having now with Microsoft dropping UTF-8, except without a feasible upgrade path. Also, UTF-32 is more convenient to deal with, as illustrated by the strchr example. (Though technically unicode fundamentally has to be dealt with thinking about multi-element characters, because of irreducible combining codepoints, hardly anything actually supports those, so pretending that UTF-32 is one character per element is safe in effect. I don't think the fonts that come with Windows even include glyphs for combining codepoints. That said, SIL worldpad and a few others actually do things properly, so really one might as well just use UTF-8 everywhere.) > P.S. I suggest that the strings default to the same type as the underlining > file format, other wise it can be overridden by expressively stating:- > A"String", L"String", LL"Sting". > Terrible idea, since it means that the legality of char const *p = "hello world"; changes depending on the encoding of my file. Encoding is just that, and shouldn't change semantics. (Just like whitespace.) An image should look the same whether saved in PNG or TGA, and a program should do the same thing whether saved in UTF-16 or UTF-32.