From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 20155 invoked by alias); 21 Aug 2008 11:50:29 -0000 Received: (qmail 20147 invoked by uid 22791); 21 Aug 2008 11:50:29 -0000 X-Spam-Check-By: sourceware.org Received: from exprod6og107.obsmtp.com (HELO exprod6og107.obsmtp.com) (64.18.1.208) by sourceware.org (qpsmtpd/0.31) with ESMTP; Thu, 21 Aug 2008 11:49:36 +0000 Received: from source ([192.150.8.22]) by exprod6ob107.postini.com ([64.18.5.12]) with SMTP; Thu, 21 Aug 2008 04:49:33 PDT Received: from inner-relay-3.eur.adobe.com (inner-relay-3b [10.128.4.236]) by outbound-smtp-2.corp.adobe.com (8.12.10/8.12.10) with ESMTP id m7LBnVE0024492; Thu, 21 Aug 2008 04:49:31 -0700 (PDT) Received: from fe1.corp.adobe.com (fe1.corp.adobe.com [10.8.192.70]) by inner-relay-3.eur.adobe.com (8.12.10/8.12.9) with ESMTP id m7LBnSqJ007562; Thu, 21 Aug 2008 04:49:29 -0700 (PDT) Received: from namailgen.corp.adobe.com ([10.8.192.91]) by fe1.corp.adobe.com with Microsoft SMTPSVC(6.0.3790.1830); Thu, 21 Aug 2008 04:49:28 -0700 Received: from 10.32.16.88 ([10.32.16.88]) by namailgen.corp.adobe.com ([10.8.192.91]) via Exchange Front-End Server namail.corp.adobe.com ([10.8.189.100]) with Microsoft Exchange Server HTTP-DAV ; Thu, 21 Aug 2008 11:49:27 +0000 User-Agent: Microsoft-Entourage/12.11.0.080522 Date: Thu, 21 Aug 2008 12:15:00 -0000 Subject: Re: UTF-8, UTF-16 and UTF-32 From: John Love-Jensen To: Dallas Clarke , GCC-help Message-ID: In-Reply-To: <000d01c90377$1e0c1670$3b9c65dc@testserver> Mime-version: 1.0 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit X-IsSubscribed: yes Mailing-List: contact gcc-help-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-help-owner@gcc.gnu.org X-SW-Source: 2008-08/txt/msg00211.txt.bz2 Hi Dallas, > Thanks for your reply, but with Pictorial languages such as Cantonese and > Mandarin, that have up to 60,000 character in the full set (one picture for > each word), using locality page sheets with UTF-8 is limited. UTF-8 does not use locality page sheets. (Are you conflating UTF-8 and Windows Code Pages? Ala the difference between the FooA() ACP routines, and the FooW() Wide character routines?) UTF-8 encodes Unicode characters from U+00000 to U+10FFFF in a variable number of octets, 1 to 4 octets (1-4 bytes). UTF-8 supports the entire gamut of Unicode characters. UTF-16 encodes Unicode characters from U+00000 to U+10FFFF in a variable number of 16-bit chunks, 1 or 2 of them (2 or 4 bytes). UTF-32 encodes Unicode characters from U+00000 to U+10FFFF in a single 32-bit chunk (4 bytes), with 11 of the 32 bits being fallow. > GCC and MS VC++ are now inconsistent with their wchar_t types and this > difference will make it nearly impossible for us to continue supporting > Linux, i.e. in a choice between Linux and Windows, I have to follow my > customers. GCC and MS VC++ are not inconsistent. Both of those compilers comply with the ABI of the platform that they target. There is not requirement in any platforms ABI that I work with that char be a UTF8 and wchar_t be UTF16 or UTF32. Perhaps what you need is to make your own character type (or, technically, encoding unit type): struct UTF8 { typedef uint8_t Type; Type mEncodingUnit; }; struct UTF16 { typedef uint16_t Type; Type mEncodingUnit; }; struct UTF32 { typedef uint32_t Type; Type mEncodingUnit; }; Or use a Unicode savvy library like ICU . > I am not trying to deny UTF-32 or saying that GCC should not support it, I > am saying that GCC should support all three Unicode formats because UTF-16 > is a format that I have to deal with in the real world. Why not support all > three formats? GCC does not support Unicode. Some libraries (that are not part of GCC) support Unicode. Perhaps parts of the OS support Unicode, in some transformation format, with their LANG environment, or Window's 65001, 65005, 65006, 1200, 1201 code pages, or Mac OS X's kCFStringEncodingUnicode, kCFStringEncodingUTF8, kCFStringEncodingUTF16, kCFStringEncodingUTF16BE, kCFStringEncodingUTF16LE, kCFStringEncodingUTF32, kCFStringEncodingUTF32BE, kCFStringEncodingUTF32LE. The only computer languages that I'm aware of that support Unicode are: + Python 2.3 (somewhat, as an opt-in transition feature) + Python 2.5 (somewhat) + Python 3.0 (very well) + D Programming Language (very well) + Java (very well) My favorite computer languages do NOT support Unicode "out of the box" (by "support" I mean both Unicode source code, which can target Unicode applications): + C + C++ + Lua With add-on libraries and/or OS API support, discipline, and a bit of luck, those languages can target Unicode applications. I can't see Lua supporting Unicode "out of the box" without increasing it's tiny embedded scripting engine footprint by over an order of magnitude. > As someone with has written a scripting language based on C++, I can tell > you that changing the 'wchar_t' to something else would only take five > minutes - it wouldn't break any thing. It would break the OS ABI, which is defined by the OS, not by the compiler. HTH, --Eljay