public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] PR libstdc++/80624 satisfy invariant for char_traits<char16_t>::eof()
@ 2017-05-05 17:19 Jonathan Wakely
  2017-05-08  7:53 ` Stephan Bergmann via gcc-patches
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Jonathan Wakely @ 2017-05-05 17:19 UTC (permalink / raw)
  To: libstdc++, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1737 bytes --]

As discussed at http://stackoverflow.com/q/43769773/981959 (and kinda
hinted at by http://wg21.link/lwg1200) there's a problem with
char_traits<char16_t>::eof() because it returns int_type(-1) which is
the same value as u'\uFFFF', a valid UTF-16 code point.

i.e. because all values of int_type are also valid values of char_type
we cannot meet the requirement that:

"The member eof() shall return an implementation-defined constant
that cannot appear as a valid UTF-16 code unit."

I've reported this as a defect, suggesting that the wording above
needs to change.

One consequence is that basic_streambuf<char16_t>::sputc(u'\uFFFF')
always returns the same value, whether it succeeds or not. On success
it returns to_int_type(u'\uFFFF') and on failure it returns eof(),
which is the same value. I think that can be solved with the attached
change, which preserves the invariant in [char.traits.require] that
eof() returns:

"a value e such that X::eq_int_type(e,X::to_int_type(c)) is false for
all values c."

This can be true if we ensure that to_int_type never returns the eof()
value. http://www.unicode.org/faq/private_use.html#nonchar10 suggests
doing something like this.

It means that when writing u'\uFFFF' to a streambuf we write that
character successfully, but return u'\uFFFD' instead; and when reading
u'\uFFFF' from a streambuf we return u'\uFFFD' instead. This is
asymmetrical, as we can write that character but not read it back.  It
might be better to refuse to write u'\uFFFF' and write it as the
replacement character instead, but I think I prefer to write the right
character when possible. It also doesn't require any extra changes.

All tests pass with this, does anybody see any problems with this
approach?



[-- Attachment #2: patch.txt --]
[-- Type: text/x-patch, Size: 7405 bytes --]

commit 8ab705e4920e933d3b0e90fd004b93d89aab8619
Author: Jonathan Wakely <jwakely@redhat.com>
Date:   Fri May 5 16:57:07 2017 +0100

    PR libstdc++/80624 satisfy invariant for char_traits<char16_t>::eof()
    
    	PR libstdc++/80624
    	* doc/xml/manual/status_cxx2011.xml: Document to_int_type behaviour.
    	* include/bits/char_traits.h (char_traits<char16_t>::to_int_type):
    	Transform eof value to U+FFFD.
    	* testsuite/21_strings/char_traits/requirements/char16_t/eof.cc: New.
    	* testsuite/27_io/basic_streambuf/sgetc/char16_t/80624.cc: New.
    	* testsuite/27_io/basic_streambuf/sputc/char16_t/80624.cc: New.

diff --git a/libstdc++-v3/doc/xml/manual/status_cxx2011.xml b/libstdc++-v3/doc/xml/manual/status_cxx2011.xml
index 705f2ee..0fa4bc0 100644
--- a/libstdc++-v3/doc/xml/manual/status_cxx2011.xml
+++ b/libstdc++-v3/doc/xml/manual/status_cxx2011.xml
@@ -2630,6 +2630,10 @@ particular release.
       <classname>u32streampos</classname> are both synonyms for
       <classname>fpos&lt;mbstate_t&gt;</classname>.
       The function <function>eof</function> returns <code>int_type(-1)</code>.
+      <function>char_traits&lt;char16_t&gt;::to_int_type</function> will
+      transform the "noncharacter" U+FFFF to U+FFFD (REPLACEMENT CHARACTER).
+      This is done to ensure that <function>to_int_type</function> never
+      returns the same value as <function>eof</function>, which is U+FFFF.
    </para>
 
    <para>
diff --git a/libstdc++-v3/include/bits/char_traits.h b/libstdc++-v3/include/bits/char_traits.h
index 75db5b8..f19120b 100644
--- a/libstdc++-v3/include/bits/char_traits.h
+++ b/libstdc++-v3/include/bits/char_traits.h
@@ -507,7 +507,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 
       static constexpr int_type
       to_int_type(const char_type& __c) noexcept
-      { return int_type(__c); }
+      { return __c == eof() ? int_type(0xfffd) : int_type(__c); }
 
       static constexpr bool
       eq_int_type(const int_type& __c1, const int_type& __c2) noexcept
diff --git a/libstdc++-v3/testsuite/21_strings/char_traits/requirements/char16_t/eof.cc b/libstdc++-v3/testsuite/21_strings/char_traits/requirements/char16_t/eof.cc
new file mode 100644
index 0000000..05def7f
--- /dev/null
+++ b/libstdc++-v3/testsuite/21_strings/char_traits/requirements/char16_t/eof.cc
@@ -0,0 +1,31 @@
+// Copyright (C) 2017 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// You should have received a copy of the GNU General Public License along
+// with this library; see the file COPYING3.  If not see
+// <http://www.gnu.org/licenses/>.
+
+// { dg-do compile { target c++11 } }
+
+#include <string>
+
+
+constexpr bool not_equal_to_eof(char16_t c)
+{
+  using T = std::char_traits<char16_t>;
+  return T::eq_int_type(T::eof(), T::to_int_type(c)) == false;
+}
+
+// Last two code points of the BMP are noncharacters:
+static_assert(not_equal_to_eof(u'\uFFFE'), "U+FFFE compares unequal to eof");
+static_assert(not_equal_to_eof(u'\uFFFF'), "U+FFFF compares unequal to eof");
diff --git a/libstdc++-v3/testsuite/27_io/basic_streambuf/sgetc/char16_t/80624.cc b/libstdc++-v3/testsuite/27_io/basic_streambuf/sgetc/char16_t/80624.cc
new file mode 100644
index 0000000..c08c227
--- /dev/null
+++ b/libstdc++-v3/testsuite/27_io/basic_streambuf/sgetc/char16_t/80624.cc
@@ -0,0 +1,56 @@
+// Copyright (C) 2017 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// You should have received a copy of the GNU General Public License along
+// with this library; see the file COPYING3.  If not see
+// <http://www.gnu.org/licenses/>.
+
+// { dg-do run { target c++11 } }
+
+#include <streambuf>
+#include <testsuite_hooks.h>
+
+struct streambuf : std::basic_streambuf<char16_t>
+{
+  basic_streambuf* setbuf(char_type* s, std::streamsize n) override
+  {
+    setp(s, s + n);
+    setg(s, s, s + n);
+    return this;
+  }
+};
+
+void
+test01()
+{
+  using traits = streambuf::traits_type;
+
+  char16_t buf[2] = { streambuf::char_type(-1), streambuf::char_type(-2) };
+  streambuf sb;
+  sb.pubsetbuf(buf, 2);
+
+  streambuf::int_type res;
+
+  res = sb.sbumpc();
+  VERIFY( traits::eq_int_type(res, traits::eof()) == false );
+  res = sb.sbumpc();
+  VERIFY( traits::eq_int_type(res, traits::eof()) == false );
+  res = sb.sbumpc();
+  VERIFY( traits::eq_int_type(res, traits::eof()) == true );
+}
+
+int
+main()
+{
+  test01();
+}
diff --git a/libstdc++-v3/testsuite/27_io/basic_streambuf/sputc/char16_t/80624.cc b/libstdc++-v3/testsuite/27_io/basic_streambuf/sputc/char16_t/80624.cc
new file mode 100644
index 0000000..a2a1917
--- /dev/null
+++ b/libstdc++-v3/testsuite/27_io/basic_streambuf/sputc/char16_t/80624.cc
@@ -0,0 +1,59 @@
+// Copyright (C) 2017 Free Software Foundation, Inc.
+//
+// This file is part of the GNU ISO C++ Library.  This library is free
+// software; you can redistribute it and/or modify it under the
+// terms of the GNU General Public License as published by the
+// Free Software Foundation; either version 3, or (at your option)
+// any later version.
+
+// This library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+
+// You should have received a copy of the GNU General Public License along
+// with this library; see the file COPYING3.  If not see
+// <http://www.gnu.org/licenses/>.
+
+// { dg-do run { target c++11 } }
+
+#include <streambuf>
+#include <testsuite_hooks.h>
+
+struct streambuf : std::basic_streambuf<char16_t>
+{
+  basic_streambuf* setbuf(char_type* s, std::streamsize n) override
+  {
+    setp(s, s + n);
+    setg(s, s, s + n);
+    return this;
+  }
+};
+
+void
+test01()
+{
+  using traits = streambuf::traits_type;
+
+  char16_t buf[2] = { };
+  streambuf sb;
+  sb.pubsetbuf(buf, 2);
+
+  streambuf::int_type res;
+
+  res = sb.sputc(streambuf::char_type(-1));
+  VERIFY( traits::eq_int_type(res, traits::eof()) == false );
+  res = sb.sputc(streambuf::char_type(-2));
+  VERIFY( traits::eq_int_type(res, traits::eof()) == false );
+  res = sb.sputc(streambuf::char_type(1));
+  VERIFY( traits::eq_int_type(res, traits::eof()) == true );
+
+  VERIFY( buf[0] == streambuf::char_type(-1) );
+  VERIFY( buf[1] == streambuf::char_type(-2) );
+}
+
+int
+main()
+{
+  test01();
+}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] PR libstdc++/80624 satisfy invariant for char_traits<char16_t>::eof()
  2017-05-05 17:19 [PATCH] PR libstdc++/80624 satisfy invariant for char_traits<char16_t>::eof() Jonathan Wakely
@ 2017-05-08  7:53 ` Stephan Bergmann via gcc-patches
  2017-05-08 10:24   ` Jonathan Wakely via gcc-patches
  2017-05-08 10:19 ` Florian Weimer via gcc-patches
  2017-06-02 18:35 ` Jonathan Wakely
  2 siblings, 1 reply; 8+ messages in thread
From: Stephan Bergmann via gcc-patches @ 2017-05-08  7:53 UTC (permalink / raw)
  To: libstdc++, gcc-patches

On 05/05/2017 07:05 PM, Jonathan Wakely wrote:
> As discussed at http://stackoverflow.com/q/43769773/981959 (and kinda
> hinted at by http://wg21.link/lwg1200) there's a problem with
> char_traits<char16_t>::eof() because it returns int_type(-1) which is
> the same value as u'\uFFFF', a valid UTF-16 code point.
> 
> i.e. because all values of int_type are also valid values of char_type
> we cannot meet the requirement that:
> 
> "The member eof() shall return an implementation-defined constant
> that cannot appear as a valid UTF-16 code unit."
> 
> I've reported this as a defect, suggesting that the wording above
> needs to change.
> 
> One consequence is that basic_streambuf<char16_t>::sputc(u'\uFFFF')
> always returns the same value, whether it succeeds or not. On success
> it returns to_int_type(u'\uFFFF') and on failure it returns eof(),
> which is the same value. I think that can be solved with the attached
> change, which preserves the invariant in [char.traits.require] that
> eof() returns:
> 
> "a value e such that X::eq_int_type(e,X::to_int_type(c)) is false for
> all values c."
> 
> This can be true if we ensure that to_int_type never returns the eof()
> value. http://www.unicode.org/faq/private_use.html#nonchar10 suggests
> doing something like this.
> 
> It means that when writing u'\uFFFF' to a streambuf we write that
> character successfully, but return u'\uFFFD' instead; and when reading
> u'\uFFFF' from a streambuf we return u'\uFFFD' instead. This is
> asymmetrical, as we can write that character but not read it back.  It
> might be better to refuse to write u'\uFFFF' and write it as the
> replacement character instead, but I think I prefer to write the right
> character when possible. It also doesn't require any extra changes.
> 
> All tests pass with this, does anybody see any problems with this
> approach?

Sounds scary to me.  As an application programmer, I'd expect to be able 
to use chart16_t based streams to read and write arbitrary sequences of 
Unicode code points (encoded as sequences of UTF-16 code units).  (Think 
of an application temporarily storing internal strings to a disk file.)

Also, I'd be surprised to find this asymmetric behavior only for U+FFFF 
and not for other noncharacters, and only for char16_t and not for char32_t.

To me, the definition of char16_t's int_type and eof() sounds like a bug 
that needs fixing, not working around?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] PR libstdc++/80624 satisfy invariant for char_traits<char16_t>::eof()
  2017-05-05 17:19 [PATCH] PR libstdc++/80624 satisfy invariant for char_traits<char16_t>::eof() Jonathan Wakely
  2017-05-08  7:53 ` Stephan Bergmann via gcc-patches
@ 2017-05-08 10:19 ` Florian Weimer via gcc-patches
  2017-05-08 10:46   ` Jonathan Wakely via gcc-patches
  2017-06-02 18:35 ` Jonathan Wakely
  2 siblings, 1 reply; 8+ messages in thread
From: Florian Weimer via gcc-patches @ 2017-05-08 10:19 UTC (permalink / raw)
  To: Jonathan Wakely, libstdc++, gcc-patches

On 05/05/2017 07:05 PM, Jonathan Wakely wrote:
> As discussed at http://stackoverflow.com/q/43769773/981959 (and kinda
> hinted at by http://wg21.link/lwg1200) there's a problem with
> char_traits<char16_t>::eof() because it returns int_type(-1) which is
> the same value as u'\uFFFF', a valid UTF-16 code point.

I think the real bug is that char_traits<char16_t>::int_type is just 
plain wrong.  It has to be a signed integer, and capable of representing 
values in the range 0 .. 65535.  char_traits<char32_t> has a similar 
problem.  char_traits<wchar_t> should be fine on glibc because WEOF is 
reserved, something that is probably not the case for char32_t.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] PR libstdc++/80624 satisfy invariant for char_traits<char16_t>::eof()
  2017-05-08  7:53 ` Stephan Bergmann via gcc-patches
@ 2017-05-08 10:24   ` Jonathan Wakely via gcc-patches
  0 siblings, 0 replies; 8+ messages in thread
From: Jonathan Wakely via gcc-patches @ 2017-05-08 10:24 UTC (permalink / raw)
  To: Stephan Bergmann; +Cc: libstdc++, gcc-patches

On 08/05/17 09:52 +0200, Stephan Bergmann via libstdc++ wrote:
>On 05/05/2017 07:05 PM, Jonathan Wakely wrote:
>>As discussed at http://stackoverflow.com/q/43769773/981959 (and kinda
>>hinted at by http://wg21.link/lwg1200) there's a problem with
>>char_traits<char16_t>::eof() because it returns int_type(-1) which is
>>the same value as u'\uFFFF', a valid UTF-16 code point.
>>
>>i.e. because all values of int_type are also valid values of char_type
>>we cannot meet the requirement that:
>>
>>"The member eof() shall return an implementation-defined constant
>>that cannot appear as a valid UTF-16 code unit."
>>
>>I've reported this as a defect, suggesting that the wording above
>>needs to change.
>>
>>One consequence is that basic_streambuf<char16_t>::sputc(u'\uFFFF')
>>always returns the same value, whether it succeeds or not. On success
>>it returns to_int_type(u'\uFFFF') and on failure it returns eof(),
>>which is the same value. I think that can be solved with the attached
>>change, which preserves the invariant in [char.traits.require] that
>>eof() returns:
>>
>>"a value e such that X::eq_int_type(e,X::to_int_type(c)) is false for
>>all values c."
>>
>>This can be true if we ensure that to_int_type never returns the eof()
>>value. http://www.unicode.org/faq/private_use.html#nonchar10 suggests
>>doing something like this.
>>
>>It means that when writing u'\uFFFF' to a streambuf we write that
>>character successfully, but return u'\uFFFD' instead; and when reading
>>u'\uFFFF' from a streambuf we return u'\uFFFD' instead. This is
>>asymmetrical, as we can write that character but not read it back.  It
>>might be better to refuse to write u'\uFFFF' and write it as the
>>replacement character instead, but I think I prefer to write the right
>>character when possible. It also doesn't require any extra changes.
>>
>>All tests pass with this, does anybody see any problems with this
>>approach?
>
>Sounds scary to me.  As an application programmer, I'd expect to be 
>able to use chart16_t based streams to read and write arbitrary 
>sequences of Unicode code points (encoded as sequences of UTF-16 code 
>units).  (Think of an application temporarily storing internal strings 
>to a disk file.)
>
>Also, I'd be surprised to find this asymmetric behavior only for 
>U+FFFF and not for other noncharacters, and only for char16_t and not 
>for char32_t.
>
>To me, the definition of char16_t's int_type and eof() sounds like a 
>bug that needs fixing, not working around?

Fixing that would require changing the standard and breaking the ABI
of all existing implementations. I've opened a defect report against
that standard, but a change that requires an ABI break isn't likely to
be popular.

Changing the semantics of to_int_type for U+FFFF is far less likely to
affect any ABIs (it's a constexpr function so it's possible somebody
is using the value of to_int_type(char_type(-1)) as a template
argument, but it seems unlikely. It's a much smaller change, "allowed"
by http://www.unicode.org/faq/private_use.html#nonchar10 and it only
affects a noncharacter that is not intended for interchange anyway.

I'm not claiming it's ideal, but it fixes a bug today.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] PR libstdc++/80624 satisfy invariant for char_traits<char16_t>::eof()
  2017-05-08 10:19 ` Florian Weimer via gcc-patches
@ 2017-05-08 10:46   ` Jonathan Wakely via gcc-patches
  2017-05-08 10:57     ` Florian Weimer via gcc-patches
  0 siblings, 1 reply; 8+ messages in thread
From: Jonathan Wakely via gcc-patches @ 2017-05-08 10:46 UTC (permalink / raw)
  To: Florian Weimer; +Cc: libstdc++, gcc-patches

On 08/05/17 11:53 +0200, Florian Weimer via libstdc++ wrote:
>On 05/05/2017 07:05 PM, Jonathan Wakely wrote:
>>As discussed at http://stackoverflow.com/q/43769773/981959 (and kinda
>>hinted at by http://wg21.link/lwg1200) there's a problem with
>>char_traits<char16_t>::eof() because it returns int_type(-1) which is
>>the same value as u'\uFFFF', a valid UTF-16 code point.
>
>I think the real bug is that char_traits<char16_t>::int_type is just 
>plain wrong.  It has to be a signed integer,

Why does it have to be signed?

>and capable of 
>representing values in the range 0 .. 65535.  char_traits<char32_t> 
>has a similar problem.  char_traits<wchar_t> should be fine on glibc 
>because WEOF is reserved, something that is probably not the case for 
>char32_t.

I think there are 32-bit values which are not valid UTF-32 code
points, including char32_t(-1) which we use for EOF.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] PR libstdc++/80624 satisfy invariant for char_traits<char16_t>::eof()
  2017-05-08 10:46   ` Jonathan Wakely via gcc-patches
@ 2017-05-08 10:57     ` Florian Weimer via gcc-patches
  2017-05-08 11:18       ` Jonathan Wakely via gcc-patches
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Weimer via gcc-patches @ 2017-05-08 10:57 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: libstdc++, gcc-patches

On 05/08/2017 12:24 PM, Jonathan Wakely wrote:
> On 08/05/17 11:53 +0200, Florian Weimer via libstdc++ wrote:
>> On 05/05/2017 07:05 PM, Jonathan Wakely wrote:
>>> As discussed at http://stackoverflow.com/q/43769773/981959 (and kinda
>>> hinted at by http://wg21.link/lwg1200) there's a problem with
>>> char_traits<char16_t>::eof() because it returns int_type(-1) which is
>>> the same value as u'\uFFFF', a valid UTF-16 code point.
>>
>> I think the real bug is that char_traits<char16_t>::int_type is just 
>> plain wrong.  It has to be a signed integer,
> 
> Why does it have to be signed?

Hmm.  Maybe it's not strictly required.  int_type(-1) as a distinct 
value is likely sufficient.

>> and capable of representing values in the range 0 .. 65535.  
>> char_traits<char32_t> has a similar problem.  char_traits<wchar_t> 
>> should be fine on glibc because WEOF is reserved, something that is 
>> probably not the case for char32_t.
> 
> I think there are 32-bit values which are not valid UTF-32 code
> points, including char32_t(-1) which we use for EOF.

I'm not sure if char32_t is restricted to UTF-32 codepoints (the 
standard does not say, I think).  But even UCS-4 is 31-bit only, so 
maybe the problem does not arise there.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] PR libstdc++/80624 satisfy invariant for char_traits<char16_t>::eof()
  2017-05-08 10:57     ` Florian Weimer via gcc-patches
@ 2017-05-08 11:18       ` Jonathan Wakely via gcc-patches
  0 siblings, 0 replies; 8+ messages in thread
From: Jonathan Wakely via gcc-patches @ 2017-05-08 11:18 UTC (permalink / raw)
  To: Florian Weimer; +Cc: libstdc++, gcc-patches

On 08/05/17 12:52 +0200, Florian Weimer via libstdc++ wrote:
>On 05/08/2017 12:24 PM, Jonathan Wakely wrote:
>>On 08/05/17 11:53 +0200, Florian Weimer via libstdc++ wrote:
>>>On 05/05/2017 07:05 PM, Jonathan Wakely wrote:
>>>>As discussed at http://stackoverflow.com/q/43769773/981959 (and kinda
>>>>hinted at by http://wg21.link/lwg1200) there's a problem with
>>>>char_traits<char16_t>::eof() because it returns int_type(-1) which is
>>>>the same value as u'\uFFFF', a valid UTF-16 code point.
>>>
>>>I think the real bug is that char_traits<char16_t>::int_type is 
>>>just plain wrong.  It has to be a signed integer,
>>
>>Why does it have to be signed?
>
>Hmm.  Maybe it's not strictly required.  int_type(-1) as a distinct 
>value is likely sufficient.

Agreed.

>>>and capable of representing values in the range 0 .. 65535.  
>>>char_traits<char32_t> has a similar problem.  char_traits<wchar_t> 
>>>should be fine on glibc because WEOF is reserved, something that 
>>>is probably not the case for char32_t.
>>
>>I think there are 32-bit values which are not valid UTF-32 code
>>points, including char32_t(-1) which we use for EOF.
>
>I'm not sure if char32_t is restricted to UTF-32 codepoints (the 
>standard does not say, I think).  But even UCS-4 is 31-bit only, so 
>maybe the problem does not arise there.

It's not really clear what the encoding of char32_t is (see
http://talesofcpp.fusionfenix.com/post-10/episode-seven-one-char-to-rule-them-all
for a good analysis) but whether it's UCS-4 or UTF-32, U+FFFFFFFF is
not in the universal character set, so we can use 0xFFFFFFFF for
char_traits<char32_t>::eof().

So I think only char_traits<char16_t> has this problem.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] PR libstdc++/80624 satisfy invariant for char_traits<char16_t>::eof()
  2017-05-05 17:19 [PATCH] PR libstdc++/80624 satisfy invariant for char_traits<char16_t>::eof() Jonathan Wakely
  2017-05-08  7:53 ` Stephan Bergmann via gcc-patches
  2017-05-08 10:19 ` Florian Weimer via gcc-patches
@ 2017-06-02 18:35 ` Jonathan Wakely
  2 siblings, 0 replies; 8+ messages in thread
From: Jonathan Wakely @ 2017-06-02 18:35 UTC (permalink / raw)
  To: libstdc++, gcc-patches

On 05/05/17 18:05 +0100, Jonathan Wakely wrote:
>As discussed at http://stackoverflow.com/q/43769773/981959 (and kinda
>hinted at by http://wg21.link/lwg1200) there's a problem with
>char_traits<char16_t>::eof() because it returns int_type(-1) which is
>the same value as u'\uFFFF', a valid UTF-16 code point.
>
>i.e. because all values of int_type are also valid values of char_type
>we cannot meet the requirement that:
>
>"The member eof() shall return an implementation-defined constant
>that cannot appear as a valid UTF-16 code unit."
>
>I've reported this as a defect, suggesting that the wording above
>needs to change.
>
>One consequence is that basic_streambuf<char16_t>::sputc(u'\uFFFF')
>always returns the same value, whether it succeeds or not. On success
>it returns to_int_type(u'\uFFFF') and on failure it returns eof(),
>which is the same value. I think that can be solved with the attached
>change, which preserves the invariant in [char.traits.require] that
>eof() returns:
>
>"a value e such that X::eq_int_type(e,X::to_int_type(c)) is false for
>all values c."
>
>This can be true if we ensure that to_int_type never returns the eof()
>value. http://www.unicode.org/faq/private_use.html#nonchar10 suggests
>doing something like this.
>
>It means that when writing u'\uFFFF' to a streambuf we write that
>character successfully, but return u'\uFFFD' instead; and when reading
>u'\uFFFF' from a streambuf we return u'\uFFFD' instead. This is
>asymmetrical, as we can write that character but not read it back.  It
>might be better to refuse to write u'\uFFFF' and write it as the
>replacement character instead, but I think I prefer to write the right
>character when possible. It also doesn't require any extra changes.
>
>All tests pass with this, does anybody see any problems with this
>approach?
>
>

>commit 8ab705e4920e933d3b0e90fd004b93d89aab8619
>Author: Jonathan Wakely <jwakely@redhat.com>
>Date:   Fri May 5 16:57:07 2017 +0100
>
>    PR libstdc++/80624 satisfy invariant for char_traits<char16_t>::eof()
>    
>    	PR libstdc++/80624
>    	* doc/xml/manual/status_cxx2011.xml: Document to_int_type behaviour.
>    	* include/bits/char_traits.h (char_traits<char16_t>::to_int_type):
>    	Transform eof value to U+FFFD.
>    	* testsuite/21_strings/char_traits/requirements/char16_t/eof.cc: New.
>    	* testsuite/27_io/basic_streambuf/sgetc/char16_t/80624.cc: New.
>    	* testsuite/27_io/basic_streambuf/sputc/char16_t/80624.cc: New.

I've committed this now. I'll work with WG21 to resolve
https://wg21.link/lwg2959 and if a better solution is found, we can do
that instead. Until then getting some implementation and usage
experience of this solution seems valuable.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-06-02 18:35 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-05 17:19 [PATCH] PR libstdc++/80624 satisfy invariant for char_traits<char16_t>::eof() Jonathan Wakely
2017-05-08  7:53 ` Stephan Bergmann via gcc-patches
2017-05-08 10:24   ` Jonathan Wakely via gcc-patches
2017-05-08 10:19 ` Florian Weimer via gcc-patches
2017-05-08 10:46   ` Jonathan Wakely via gcc-patches
2017-05-08 10:57     ` Florian Weimer via gcc-patches
2017-05-08 11:18       ` Jonathan Wakely via gcc-patches
2017-06-02 18:35 ` Jonathan Wakely

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).