From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-249837-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 15327 invoked by alias); 12 Apr 2008 04:40:53 -0000
Received: (qmail 15232 invoked by alias); 12 Apr 2008 04:40:08 -0000
Date: Sat, 12 Apr 2008 04:40:00 -0000
Message-ID: <20080412044008.15231.qmail@sourceware.org>
X-Bugzilla-Reason: CC
References: <bug-35908-6@http.gcc.gnu.org/bugzilla/>
Subject: [Bug c/35908] Dubious charset conversions
In-Reply-To: <bug-35908-6@http.gcc.gnu.org/bugzilla/>
Reply-To: gcc-bugzilla@gcc.gnu.org
To: gcc-bugs@gcc.gnu.org
From: "neil at daikokuya dot co dot uk" <gcc-bugzilla@gcc.gnu.org>
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
X-SW-Source: 2008-04/txt/msg00868.txt.bz2


------- Comment #2 from neil at daikokuya dot co dot uk  2008-04-12 04:40 -------
Subject: Re:  Dubious charset conversions

joseph at codesourcery dot com wrote:-

> > GCC accepts the following with -ansi -pedantic -Wall without diagnostics
> > 
> > #include <stdlib.h>
> > wchar_t z[] = L"a" "\xff";
> > 
> > GCC claims a default execution charset of UTF-8; presumably the default
> > execution wide character set is UTF-32.  But "\xff" is a two-character narrow
> > execution character set string literal, with characters \xff \0, which is
> > invalid UTF-8 and so cannot be converted in a meaningful way to the execution
> > character set (whatever it is).
> > 
> > I would expect the above code to be rejected, or at least diagnosed.
> 
> Accepting it as equivalent to L"a\xff" (generating a wide character L'a' 
> followed by one with value 0xff) seems in accordance with the principles 
> of N951, the relevant ones of which are implemented in GCC.
> 
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n951.htm
> http://gcc.gnu.org/ml/gcc-patches/2003-07/msg00532.html

Ah, I'd forgotten about that.  That document does make much more
sense, thanks.  However I think there are at least two things wrong
in "Principle 7"; I've mailed Clive about those.  [The single byte
requirement cannot be fulfilled for Latin source charset to UTF-8
target, for example, and UCNs are escape sequences that typically
cannot be encoded as a single byte].

GCC should perhaps consider not creating invalid UTF-8 (i.e. no 5 or
6 bytes forms, or encodings of \ufffe \uffff etc.)

Please feel free to close this report.

Neil.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35908