[Bug c/35908] New: Dubious charset conversions

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/35908]  New: Dubious charset conversions
@ 2008-04-11 15:21 neil at gcc dot gnu dot org
  2008-04-11 16:59 ` [Bug c/35908] " joseph at codesourcery dot com
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: neil at gcc dot gnu dot org @ 2008-04-11 15:21 UTC (permalink / raw)
  To: gcc-bugs

GCC accepts the following with -ansi -pedantic -Wall without diagnostics

#include <stdlib.h>
wchar_t z[] = L"a" "\xff";

GCC claims a default execution charset of UTF-8; presumably the default
execution wide character set is UTF-32.  But "\xff" is a two-character narrow
execution character set string literal, with characters \xff \0, which is
invalid UTF-8 and so cannot be converted in a meaningful way to the execution
character set (whatever it is).

I would expect the above code to be rejected, or at least diagnosed.

-- 
           Summary: Dubious charset conversions
           Product: gcc
           Version: 4.1.3
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: neil at gcc dot gnu dot org

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35908

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug c/35908] Dubious charset conversions
  2008-04-11 15:21 [Bug c/35908] New: Dubious charset conversions neil at gcc dot gnu dot org
@ 2008-04-11 16:59 ` joseph at codesourcery dot com
  2008-04-12  4:40 ` neil at daikokuya dot co dot uk
  2009-03-30  1:03 ` jsm28 at gcc dot gnu dot org
  2 siblings, 0 replies; 4+ messages in thread
From: joseph at codesourcery dot com @ 2008-04-11 16:59 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from joseph at codesourcery dot com  2008-04-11 16:58 -------
Subject: Re:   New: Dubious charset conversions

On Fri, 11 Apr 2008, neil at gcc dot gnu dot org wrote:

> GCC accepts the following with -ansi -pedantic -Wall without diagnostics
> 
> #include <stdlib.h>
> wchar_t z[] = L"a" "\xff";
> 
> GCC claims a default execution charset of UTF-8; presumably the default
> execution wide character set is UTF-32.  But "\xff" is a two-character narrow
> execution character set string literal, with characters \xff \0, which is
> invalid UTF-8 and so cannot be converted in a meaningful way to the execution
> character set (whatever it is).
> 
> I would expect the above code to be rejected, or at least diagnosed.

Accepting it as equivalent to L"a\xff" (generating a wide character L'a' 
followed by one with value 0xff) seems in accordance with the principles 
of N951, the relevant ones of which are implemented in GCC.

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n951.htm
http://gcc.gnu.org/ml/gcc-patches/2003-07/msg00532.html


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35908


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug c/35908] Dubious charset conversions
  2008-04-11 15:21 [Bug c/35908] New: Dubious charset conversions neil at gcc dot gnu dot org
  2008-04-11 16:59 ` [Bug c/35908] " joseph at codesourcery dot com
@ 2008-04-12  4:40 ` neil at daikokuya dot co dot uk
  2009-03-30  1:03 ` jsm28 at gcc dot gnu dot org
  2 siblings, 0 replies; 4+ messages in thread
From: neil at daikokuya dot co dot uk @ 2008-04-12  4:40 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from neil at daikokuya dot co dot uk  2008-04-12 04:40 -------
Subject: Re:  Dubious charset conversions

joseph at codesourcery dot com wrote:-

> > GCC accepts the following with -ansi -pedantic -Wall without diagnostics
> > 
> > #include <stdlib.h>
> > wchar_t z[] = L"a" "\xff";
> > 
> > GCC claims a default execution charset of UTF-8; presumably the default
> > execution wide character set is UTF-32.  But "\xff" is a two-character narrow
> > execution character set string literal, with characters \xff \0, which is
> > invalid UTF-8 and so cannot be converted in a meaningful way to the execution
> > character set (whatever it is).
> > 
> > I would expect the above code to be rejected, or at least diagnosed.
> 
> Accepting it as equivalent to L"a\xff" (generating a wide character L'a' 
> followed by one with value 0xff) seems in accordance with the principles 
> of N951, the relevant ones of which are implemented in GCC.
> 
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n951.htm
> http://gcc.gnu.org/ml/gcc-patches/2003-07/msg00532.html

Ah, I'd forgotten about that.  That document does make much more
sense, thanks.  However I think there are at least two things wrong
in "Principle 7"; I've mailed Clive about those.  [The single byte
requirement cannot be fulfilled for Latin source charset to UTF-8
target, for example, and UCNs are escape sequences that typically
cannot be encoded as a single byte].

GCC should perhaps consider not creating invalid UTF-8 (i.e. no 5 or
6 bytes forms, or encodings of \ufffe \uffff etc.)

Please feel free to close this report.

Neil.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35908


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug c/35908] Dubious charset conversions
  2008-04-11 15:21 [Bug c/35908] New: Dubious charset conversions neil at gcc dot gnu dot org
  2008-04-11 16:59 ` [Bug c/35908] " joseph at codesourcery dot com
  2008-04-12  4:40 ` neil at daikokuya dot co dot uk
@ 2009-03-30  1:03 ` jsm28 at gcc dot gnu dot org
  2 siblings, 0 replies; 4+ messages in thread
From: jsm28 at gcc dot gnu dot org @ 2009-03-30  1:03 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from jsm28 at gcc dot gnu dot org  2009-03-30 01:03 -------
As discussed, L"a\xff" is the correct interpretation, and the strings changes
in C++0x and C1x make that clear.


-- 

jsm28 at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|                            |INVALID


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35908


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-03-30  1:03 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-04-11 15:21 [Bug c/35908] New: Dubious charset conversions neil at gcc dot gnu dot org
2008-04-11 16:59 ` [Bug c/35908] " joseph at codesourcery dot com
2008-04-12  4:40 ` neil at daikokuya dot co dot uk
2009-03-30  1:03 ` jsm28 at gcc dot gnu dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).