From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 15327 invoked by alias); 12 Apr 2008 04:40:53 -0000 Received: (qmail 15232 invoked by alias); 12 Apr 2008 04:40:08 -0000 Date: Sat, 12 Apr 2008 04:40:00 -0000 Message-ID: <20080412044008.15231.qmail@sourceware.org> X-Bugzilla-Reason: CC References: Subject: [Bug c/35908] Dubious charset conversions In-Reply-To: Reply-To: gcc-bugzilla@gcc.gnu.org To: gcc-bugs@gcc.gnu.org From: "neil at daikokuya dot co dot uk" Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2008-04/txt/msg00868.txt.bz2 ------- Comment #2 from neil at daikokuya dot co dot uk 2008-04-12 04:40 ------- Subject: Re: Dubious charset conversions joseph at codesourcery dot com wrote:- > > GCC accepts the following with -ansi -pedantic -Wall without diagnostics > > > > #include > > wchar_t z[] = L"a" "\xff"; > > > > GCC claims a default execution charset of UTF-8; presumably the default > > execution wide character set is UTF-32. But "\xff" is a two-character narrow > > execution character set string literal, with characters \xff \0, which is > > invalid UTF-8 and so cannot be converted in a meaningful way to the execution > > character set (whatever it is). > > > > I would expect the above code to be rejected, or at least diagnosed. > > Accepting it as equivalent to L"a\xff" (generating a wide character L'a' > followed by one with value 0xff) seems in accordance with the principles > of N951, the relevant ones of which are implemented in GCC. > > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n951.htm > http://gcc.gnu.org/ml/gcc-patches/2003-07/msg00532.html Ah, I'd forgotten about that. That document does make much more sense, thanks. However I think there are at least two things wrong in "Principle 7"; I've mailed Clive about those. [The single byte requirement cannot be fulfilled for Latin source charset to UTF-8 target, for example, and UCNs are escape sequences that typically cannot be encoded as a single byte]. GCC should perhaps consider not creating invalid UTF-8 (i.e. no 5 or 6 bytes forms, or encodings of \ufffe \uffff etc.) Please feel free to close this report. Neil. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35908