public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/41698]  New: "\uFFFF" converts incorrectly to two-byte character
@ 2009-10-13 21:00 chasonr at newsguy dot com
  2009-10-13 21:05 ` [Bug c/41698] " chasonr at newsguy dot com
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: chasonr at newsguy dot com @ 2009-10-13 21:00 UTC (permalink / raw)
  To: gcc-bugs

GCC 4.4.1 incorrectly parses the code point U+FFFF when generating a two-byte
character.  It mistakes this code point for a supplemental one, and generates
an improper surrogate pair U+D7FF U+DFFF.  This bug is present as far back as
GCC 3.4.6.

Here is a test program that demonstrates the bug, and could function as a
regression test.  This program uses char16_t, but GCC 3.4.5 as shipped with
MinGW also shows this bug when wchar_t is used.

--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--
/* gcc-utf16-test.c -- demonstrate a bug in GCC 4.4.1, that causes the code
   point U+FFFF to convert incorrectly to UTF-16.
   Compile on GCC 4.4.1 with -std=gnu99. */

#include <stdio.h>
#include <stdlib.h>

int
main(void)
{
    static const __CHAR16_TYPE__ teststr1[] = u"\uFFFF";
    static const __CHAR16_TYPE__ teststr2[] = u"\U00010000";
    size_t i;

    printf("The string \"\\uFFFF\" converts as:");
    for (i = 0; teststr1[i] != 0; i++)
        printf(" U+%04X", teststr1[i]);
    printf("\n");
    if (teststr1[0] != 0xFFFF || teststr1[1] != 0)
    {
        printf("This conversion is INCORRECT.  It should be U+FFFF.\n");
        return EXIT_FAILURE;
    }

    printf("The string \"\\U00010000\" converts as:");
    for (i = 0; teststr2[i] != 0; i++)
        printf(" U+%04X", teststr2[i]);
    printf("\n");
    if (teststr2[0] != 0xD800 || teststr2[1] != 0xDC00 || teststr2[2] != 0)
    {
        printf("This conversion is INCORRECT.  It should be U+D800 U+DC00.\n");
        return EXIT_FAILURE;
    }

    return EXIT_SUCCESS;
}
--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--

The problem is a simple off-by-one error in the function one_utf8_to_utf16 in
libcpp/charset.c .  The following patch against the GCC 4.4.1 source corrects
the bug:

--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--
--- gcc-4.4.1/libcpp/charset.c.old      2009-04-09 19:23:07.000000000 -0400
+++ gcc-4.4.1/libcpp/charset.c  2009-10-12 04:06:25.000000000 -0400
@@ -354,7 +354,7 @@
       return EILSEQ;
     }

-  if (s < 0xFFFF)
+  if (s <= 0xFFFF)
     {
       if (*outbytesleftp < 2)
        {
--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--


-- 
           Summary: "\uFFFF" converts incorrectly to two-byte character
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: c
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: chasonr at newsguy dot com
 GCC build triplet: x86_64-unknown-linux
  GCC host triplet: x86_64-unknown-linux
GCC target triplet: x86_64-unknown-linux


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41698


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug c/41698] "\uFFFF" converts incorrectly to two-byte character
  2009-10-13 21:00 [Bug c/41698] New: "\uFFFF" converts incorrectly to two-byte character chasonr at newsguy dot com
@ 2009-10-13 21:05 ` chasonr at newsguy dot com
  2009-10-13 21:08 ` chasonr at newsguy dot com
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: chasonr at newsguy dot com @ 2009-10-13 21:05 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from chasonr at newsguy dot com  2009-10-13 21:04 -------
Created an attachment (id=18796)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18796&action=view)
Test case for this bug

This test uses the built-in __CHAR16_TYPE__, so that it will demonstrate the
bug even when wchar_t is four bytes wide, and as such will only compile on 4.4.
 For earlier compilers, change __CHAR16_TYPE__ to wchar_t and the test strings
to L"\uFFFF" and L"\U00010000".  When using wchar_t, the bug only appears when
wchar_t is two bytes wide.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41698


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug c/41698] "\uFFFF" converts incorrectly to two-byte character
  2009-10-13 21:00 [Bug c/41698] New: "\uFFFF" converts incorrectly to two-byte character chasonr at newsguy dot com
  2009-10-13 21:05 ` [Bug c/41698] " chasonr at newsguy dot com
@ 2009-10-13 21:08 ` chasonr at newsguy dot com
  2009-10-14 10:34 ` [Bug preprocessor/41698] " rguenth at gcc dot gnu dot org
  2009-11-22 19:28 ` jsm28 at gcc dot gnu dot org
  3 siblings, 0 replies; 7+ messages in thread
From: chasonr at newsguy dot com @ 2009-10-13 21:08 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from chasonr at newsguy dot com  2009-10-13 21:08 -------
Created an attachment (id=18797)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18797&action=view)
Proposed patch for this bug

This patch is against the GCC 4.4.1 distribution.  The function altered is
one_utf8_to_utf16 in libcpp/charset.c .


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41698


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug preprocessor/41698] "\uFFFF" converts incorrectly to two-byte character
  2009-10-13 21:00 [Bug c/41698] New: "\uFFFF" converts incorrectly to two-byte character chasonr at newsguy dot com
  2009-10-13 21:05 ` [Bug c/41698] " chasonr at newsguy dot com
  2009-10-13 21:08 ` chasonr at newsguy dot com
@ 2009-10-14 10:34 ` rguenth at gcc dot gnu dot org
  2009-11-22 19:28 ` jsm28 at gcc dot gnu dot org
  3 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-10-14 10:34 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from rguenth at gcc dot gnu dot org  2009-10-14 10:34 -------
Please post patches to gcc-patches@gcc.gnu.org.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41698


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug preprocessor/41698] "\uFFFF" converts incorrectly to two-byte character
  2009-10-13 21:00 [Bug c/41698] New: "\uFFFF" converts incorrectly to two-byte character chasonr at newsguy dot com
                   ` (2 preceding siblings ...)
  2009-10-14 10:34 ` [Bug preprocessor/41698] " rguenth at gcc dot gnu dot org
@ 2009-11-22 19:28 ` jsm28 at gcc dot gnu dot org
  3 siblings, 0 replies; 7+ messages in thread
From: jsm28 at gcc dot gnu dot org @ 2009-11-22 19:28 UTC (permalink / raw)
  To: gcc-bugs



-- 

jsm28 at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever Confirmed|0                           |1
   Last reconfirmed|0000-00-00 00:00:00         |2009-11-22 19:28:08
               date|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41698


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug preprocessor/41698] "\uFFFF" converts incorrectly to two-byte character
       [not found] <bug-41698-4@http.gcc.gnu.org/bugzilla/>
  2014-11-29  1:56 ` jsm28 at gcc dot gnu.org
@ 2014-11-29  1:57 ` jsm28 at gcc dot gnu.org
  1 sibling, 0 replies; 7+ messages in thread
From: jsm28 at gcc dot gnu.org @ 2014-11-29  1:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=41698

Joseph S. Myers <jsm28 at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED
   Target Milestone|---                         |5.0

--- Comment #5 from Joseph S. Myers <jsm28 at gcc dot gnu.org> ---
Fixed for GCC 5.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug preprocessor/41698] "\uFFFF" converts incorrectly to two-byte character
       [not found] <bug-41698-4@http.gcc.gnu.org/bugzilla/>
@ 2014-11-29  1:56 ` jsm28 at gcc dot gnu.org
  2014-11-29  1:57 ` jsm28 at gcc dot gnu.org
  1 sibling, 0 replies; 7+ messages in thread
From: jsm28 at gcc dot gnu.org @ 2014-11-29  1:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=41698

--- Comment #4 from Joseph S. Myers <jsm28 at gcc dot gnu.org> ---
Author: jsm28
Date: Sat Nov 29 01:56:06 2014
New Revision: 218179

URL: https://gcc.gnu.org/viewcvs?rev=218179&root=gcc&view=rev
Log:
Fix off-by-one bug in utf16 conversion (PR preprocessor/41698).

libcpp:
2014-11-29  John Schmerge  <jbschmerge@gmail.com>

    PR preprocessor/41698
    * charset.c (one_utf8_to_utf16): Do not produce surrogate pairs
    for 0xffff.

gcc/testsuite:
2014-11-29  Joseph Myers  <joseph@codesourcery.com>

    PR preprocessor/41698
    * gcc/testsuite/g++.dg/cpp/utf16-pr41698-1.C: New test.

Added:
    trunk/gcc/testsuite/g++.dg/cpp/utf16-pr41698-1.C
Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/libcpp/ChangeLog
    trunk/libcpp/charset.c


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-11-29  1:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-13 21:00 [Bug c/41698] New: "\uFFFF" converts incorrectly to two-byte character chasonr at newsguy dot com
2009-10-13 21:05 ` [Bug c/41698] " chasonr at newsguy dot com
2009-10-13 21:08 ` chasonr at newsguy dot com
2009-10-14 10:34 ` [Bug preprocessor/41698] " rguenth at gcc dot gnu dot org
2009-11-22 19:28 ` jsm28 at gcc dot gnu dot org
     [not found] <bug-41698-4@http.gcc.gnu.org/bugzilla/>
2014-11-29  1:56 ` jsm28 at gcc dot gnu.org
2014-11-29  1:57 ` jsm28 at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).