From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 5453 invoked by alias); 21 Aug 2008 09:30:03 -0000 Received: (qmail 5380 invoked by uid 22791); 21 Aug 2008 09:30:02 -0000 X-Spam-Check-By: sourceware.org Received: from mx1.redhat.com (HELO mx1.redhat.com) (66.187.233.31) by sourceware.org (qpsmtpd/0.31) with ESMTP; Thu, 21 Aug 2008 09:29:00 +0000 Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id m7L9Svne028400; Thu, 21 Aug 2008 05:28:57 -0400 Received: from zebedee.pink (vpn-12-14.rdu.redhat.com [10.11.12.14]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id m7L9SpGF020859; Thu, 21 Aug 2008 05:28:51 -0400 Message-ID: <48AD3552.4040206@redhat.com> Date: Thu, 21 Aug 2008 10:18:00 -0000 From: Andrew Haley User-Agent: Thunderbird 2.0.0.16 (X11/20080707) MIME-Version: 1.0 To: Dallas Clarke CC: gcc-help@gcc.gnu.org Subject: Re: UTF-8, UTF-16 and UTF-32 References: <001e01c90348$71a299a0$0100a8c0@testserver> In-Reply-To: <001e01c90348$71a299a0$0100a8c0@testserver> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-IsSubscribed: yes Mailing-List: contact gcc-help-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-help-owner@gcc.gnu.org X-SW-Source: 2008-08/txt/msg00209.txt.bz2 Dallas Clarke wrote: > Now I have had the time to pull myself off the ceiling, I realise the > problem is that Unix/GCC is supporting both UTF-8 and UTF-32, while > Windows is supporting UTF-8 and UTF-16. And the solution is for both > Unix and Windows to support all three Unicode formats. > > I have had to spend the last several days totally writing from scratch > the UTF-16 string functions, and realise that with a bit of common sense > every thing can work out okay. Hopefully quick action to move wchar_t to > 2 bytes and create another type for 4 byte strings, we can see a lot of > problems solved. Maybe have UTF-16 strings with L"My String" and UTF-32 > with LL"My String" notations. Changing wchar_t would break the ABI. It isn't going to happen. > I hope your steering committee can see that there will be lots of UTF-16 > text files out there, with a lot of code required to be written to > process those files and while UTF-8 will not support many none Latin > based languages, UTF-32 will not support many none Human base languages > - i.e. no signal system is fault free. I don't think that such a change can be decreed by the GCC SC. I don't understand your claim that "UTF-8 will not support many none Latin based languages". UTF-8 supports everything from U+0000 to U+10FFFF. While programs use a variety of internal representations of characters, successful transmission of data between machines requires a common interchange format, and UTF-8 is that format. Andrew.