public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Fwd: g++ off-by-one bug in utf16 conversion
       [not found] <CAL1ZCA-JS0w+qeQW2BsJq8u6geTtZWjwMQYQKBiE5wSaX3tKag@mail.gmail.com>
@ 2014-10-26 22:52 ` John Schmerge
  2014-11-29 10:49   ` Joseph Myers
  0 siblings, 1 reply; 2+ messages in thread
From: John Schmerge @ 2014-10-26 22:52 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1368 bytes --]

I believe I sent this yesterday to the incorrect list...


---------- Forwarded message ----------
From: John Schmerge <jbschmerge@gmail.com>
Date: Sun, Oct 26, 2014 at 1:58 AM
Subject: g++ off-by-one bug in utf16 conversion
To: gcc-bugs@gcc.gnu.org


Hey guys,

I came across this bug earlier today in implementing some
unit tests for utf8/16 conversions... The following c++
fragment gives the wrong result:

int main() {
  char16_t s[] = u"\uffff";
  std::cout << std::hex << s[0] << " " << s[1] << std::endl;
}

it prints:
  d7ff dfff
where as it should print:
  ffff 0
For those unfamiliar with utf16, all unicode values less than
or equal to 0xffff remain 16 bit values and no conversion is
done on them, code points greater than 0xffff get converted
to a pair of 16-bit shorts, where the 1st is in the range
0xd800-dbff and the 2nd is in the range 0xdc00-dffff.

Clearly this is an off-by-one issue. I traced it down to a
use of a less-than operator vs less-than-equal operator in
libcpp/charset.c

I have verified this is a bug with versions 4.4.7 (rhel 6.5),
4.8.2 (linaro/ubuntu/mint) and g++ (GCC) 5.0.0 20141025...
I am a bit surprised  that this has gone so many years unnoticed
or at least unresolved.

Attached is a patch against gcc 4.8.2 from the gcc website for
the issue to $gcc-root/libcpp/charset.c that fixes the issue by my tests.

Thanks,
John

[-- Attachment #2: gcc-utf16.patch --]
[-- Type: text/x-patch, Size: 250 bytes --]

--- libcpp/charset.c	2014-10-26 01:24:10.583796875 -0400
+++ libcpp/charset.c.old	2014-10-26 01:23:50.103796842 -0400
@@ -353,7 +353,7 @@
       return EILSEQ;
     }
 
-  if (s <= 0xFFFF)
+  if (s < 0xFFFF)
     {
       if (*outbytesleftp < 2)
 	{

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Fwd: g++ off-by-one bug in utf16 conversion
  2014-10-26 22:52 ` Fwd: g++ off-by-one bug in utf16 conversion John Schmerge
@ 2014-11-29 10:49   ` Joseph Myers
  0 siblings, 0 replies; 2+ messages in thread
From: Joseph Myers @ 2014-11-29 10:49 UTC (permalink / raw)
  To: John Schmerge; +Cc: gcc-patches

Thanks, I've added a testcase and committed this patch.  Bootstrapped
with no regressions on x86_64-unknown-linux-gnu.

libcpp:
2014-11-29  John Schmerge  <jbschmerge@gmail.com>

	PR preprocessor/41698
	* charset.c (one_utf8_to_utf16): Do not produce surrogate pairs
	for 0xffff.

gcc/testsuite:
2014-11-29  Joseph Myers  <joseph@codesourcery.com>

	PR preprocessor/41698
	* gcc/testsuite/g++.dg/cpp/utf16-pr41698-1.C: New test.

Index: gcc/testsuite/g++.dg/cpp/utf16-pr41698-1.C
===================================================================
--- gcc/testsuite/g++.dg/cpp/utf16-pr41698-1.C	(revision 0)
+++ gcc/testsuite/g++.dg/cpp/utf16-pr41698-1.C	(working copy)
@@ -0,0 +1,15 @@
+// PR 41698: off-by-one error in UTF-16 encoding.
+
+// { dg-do run { target c++11 } }
+
+extern "C" void abort (void);
+extern "C" void exit (int);
+
+int
+main ()
+{
+  char16_t s[] = u"\uffff";
+  if (sizeof s != 2 * sizeof (char16_t) || s[0] != 0xffff || s[1] != 0)
+    abort ();
+  exit (0);
+}
Index: libcpp/charset.c
===================================================================
--- libcpp/charset.c	(revision 218163)
+++ libcpp/charset.c	(working copy)
@@ -353,7 +353,7 @@ one_utf8_to_utf16 (iconv_t bigend, const uchar **i
       return EILSEQ;
     }
 
-  if (s < 0xFFFF)
+  if (s <= 0xFFFF)
     {
       if (*outbytesleftp < 2)
 	{

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2014-11-29  1:56 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAL1ZCA-JS0w+qeQW2BsJq8u6geTtZWjwMQYQKBiE5wSaX3tKag@mail.gmail.com>
2014-10-26 22:52 ` Fwd: g++ off-by-one bug in utf16 conversion John Schmerge
2014-11-29 10:49   ` Joseph Myers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).