From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 11612 invoked by alias); 24 Aug 2008 04:02:26 -0000 Received: (qmail 11589 invoked by uid 22791); 24 Aug 2008 04:02:21 -0000 X-Spam-Check-By: sourceware.org Received: from py-out-1112.google.com (HELO py-out-1112.google.com) (64.233.166.179) by sourceware.org (qpsmtpd/0.31) with ESMTP; Sun, 24 Aug 2008 04:01:46 +0000 Received: by py-out-1112.google.com with SMTP id d37so789303pye.29 for ; Sat, 23 Aug 2008 21:01:44 -0700 (PDT) Received: by 10.65.192.19 with SMTP id u19mr5202742qbp.9.1219550503531; Sat, 23 Aug 2008 21:01:43 -0700 (PDT) Received: by 10.64.242.10 with HTTP; Sat, 23 Aug 2008 21:01:43 -0700 (PDT) Message-ID: <2e393d080808232101nc339585xa13d8f26082161e0@mail.gmail.com> Date: Sun, 24 Aug 2008 05:53:00 -0000 From: "corey taylor" To: "Dallas Clarke" Subject: Re: UTF-8, UTF-16 and UTF-32 Cc: "Eljay Love-Jensen" , GCC-help In-Reply-To: <000c01c90568$d436d230$3b9c65dc@testserver> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <000c01c90568$d436d230$3b9c65dc@testserver> X-IsSubscribed: yes Mailing-List: contact gcc-help-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-help-owner@gcc.gnu.org X-SW-Source: 2008-08/txt/msg00266.txt.bz2 On Sat, Aug 23, 2008 at 5:40 PM, Dallas Clarke wrote: > I wont bother repeating myself, it not my responsibility to cure your dogma, > it just the end of me using GCC. I am sure that many other developers will > run into the same problem and choose the same solution. > I think you're failing to convince most people due to the fact that many of your arguments definitely require repeating and further discussion. You're obviously dealing with portability issues - both at the compiler level and ABI level (and there are others I guess depending on your needs). If I understand your two key issues correctly, they are: 1. You want source code that has unicode support. 2. You want to be able to process unicode in c++ and runtime libraries. They are different issues. I think you understand that, but some of your replies have both solutions lumped together. I would like to respond to #2. Your initial email was confusing and not the authoritative one you might have thought. You made arguments against UTF-8 and UTF-32 which others here don't understand, and it seems the response was simply restating wanting UTF-16 support. Can you really not represent everything you want in UTF-8? Seems unlikely considering its meant to represent them. Your comment on UTF-32 was odd to say the least. Sure, if we ever have a textual representation that contains a quadrillion characters then we will have to redesign how we encode it (exaggerated). UTF-16 requires a multi-word sequence to represent everything as well so it's nothing special except for the fact that it is used as you said. Now, as far as your problems above, you should look at encoding as a design issue and not a compiler issue as far as c++ goes. The compiler implements wchar_t in a way that represents all of the characters as required - as msvc and gcc are not developed in tandem, they obviously came up with different requirements at different times. If you think of it at just the compiler level, you're setting your design up for failure such that it won't be portable (not only to different systems but to upgrades of existing compilers and language specs). What I mean by that is encoding should definitely be handled in a layer above the system you are working on. Even if both compilers implemented the same wchart_t, it doesn't mean that every API you use will use that wchar_t. So, what you need to do is find a way to represent your data and then map it to the system, api, etc that you're using. You never know what display or render you'll need to use or what system you need to interface with. I have a couple comments on your gcc modification solution. 1. Modifying wchar_t to be 2 bytes and then making L create 2-byte UTF-16 constants means that gcc users could no longer rely on constant lengths like before. And if it is just as easy as you indicated, it's also an indication that it's probably something that should only be touched carefully. An code relying on this gcc implementation would be broken. 2. Creating a new type long wchar_t as a solution to compatibility? You're just asking for the same issue. You mentioned needing to read store data and presumably write it back. I saw a mention of a text file and a mention of a database. UTF-8 seems exceptionally up to the task for encoding your data! How can you know for certain that all of your input will be in the same format? How will having 2-byte wchar_t in GCC solve all of your problems? GCC only controls types not storage or implementation of any library or OS. I think have to expect to write an encoder for data and to provide a layer around your unique systems (data files, databases, constants, OS). Linux and Windows themselves certainly aren't going to be completely compatible. You could make implementations portable but never fully compatible! Just my thoughts after reading through his. I think several people here would be interested in discussing solutions with you that make sense at all levels. (A quick note about your issue #1, I think it would be very confusing for source file encoding to be based on what a user typed. It should be constant or configured in a more visible way). corey