public inbox for libstdc++@gcc.gnu.org
 help / color / mirror / Atom feed
* ostream::operator<<() and sputn()
@ 2021-07-14 21:26 Lewis Hyatt
  2021-07-14 21:30 ` Jonathan Wakely
  0 siblings, 1 reply; 8+ messages in thread
From: Lewis Hyatt @ 2021-07-14 21:26 UTC (permalink / raw)
  To: libstdc++

[-- Attachment #1: Type: text/plain, Size: 3341 bytes --]

Hello-

I noticed that libstdc++'s implementation of ostream::operator<<() prefers
to call sputn() on the underlying streambuf for all char, char*, and string
output operations, including single characters, rather than manipulate the
buffer directly. I am curious why it works this way, it feels perhaps
suboptimal to me because sputn() is mandated to call the virtual function
xsputn() on every call, while e.g. sputc() simply manipulates the buffer and
only needs a virtual call when the buffer is full. I always thought that the
buffer abstraction and the resulting avoidance of virtual calls for the
majority of operations was the main point of streambuf's design, and that
sputn() was meant for cases when the output would be large enough to
overflow the buffer anyway, if it may be possible to skip the buffer and
flush directly instead?

It seems to me that for most typical use cases, xsputn() is still going to
want to use the buffer if the output fits into it; libstdc++ does this in
basic_filebuf, for example. So then it would seem to be beneficial to try
the buffer prior to making the virtual function call, instead of after --
especially because the typical char instantiation of __ostream_insert that
makes this call for operator<<() is hidden inside the .so, and is not
inlined or eligible for devirtualization optimizations.

FWIW, here is a small test case.

---------
#include <ostream>
#include <iostream>
#include <fstream>
#include <sstream>
#include <chrono>
#include <random>
using namespace std;

int main() {
    constexpr size_t N = 500000000;
    string s(N, 'x');

    ofstream of{"/dev/null"};
    ostringstream os;
    ostream* streams[] = {&of, &os};
    mt19937 rng{random_device{}()};

    const auto timed_run = [&](const char* label, auto&& callback) {
        const auto t1 = chrono::steady_clock::now();
        for(char c: s) callback(*streams[rng() % 2], c);
        const auto t2 = chrono::steady_clock::now();
        cout << label << " took: "
             << chrono::duration<double>(t2-t1).count()
             << " seconds" << endl;
    };

    timed_run("insert with put()", [](ostream& o, char c) {o.put(c);});
    timed_run("insert with op<< ", [](ostream& o, char c) {o << c;});
}
---------

This is what I get with the current trunk:
---------
insert with put() took: 6.12152 seconds
insert with op<<  took: 13.4437 seconds
---------

And this is what I get with the attached patch:
---------
insert with put() took: 6.08313 seconds
insert with op<<  took: 8.24565 seconds
---------

So the overhead of calling operator<< vs calling put() was reduced by more
than 3X.

The prototype patch calls an internal alternate to sputn(), which tries the
buffer prior to calling xsputn(). All the tests still pass with this patch
in place, but I think it would not be suitable as-is, because it affects the
overload for std::string_view, and I think the standard may actually mandate
that xsputn() be called always for this case. (It says it should be output
as-if sputn() were called, which I think makes xsputn() should be an
observable side-effect.) Still it would work fine outside the string case
AFAIK, and I thought it was maybe an interesting result worth sharing, and
am curious what others think here, or if there is a reason I am missing why
it's better the current way. Thanks!

-Lewis


[-- Attachment #2: buffered_sputn.txt --]
[-- Type: text/plain, Size: 2939 bytes --]

diff --git a/libstdc++-v3/include/bits/ostream.tcc b/libstdc++-v3/include/bits/ostream.tcc
index 06b2217d216..614b347736b 100644
--- a/libstdc++-v3/include/bits/ostream.tcc
+++ b/libstdc++-v3/include/bits/ostream.tcc
@@ -195,7 +195,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	  ios_base::iostate __err = ios_base::goodbit;
 	  __try
 	    {
-	      if (this->rdbuf()->sputn(__s, __n) != __n)
+	      if (this->rdbuf()->__buffered_sputn(__s, __n) != __n)
 		__err = ios_base::badbit;
 	    }
 	  __catch(__cxxabiv1::__forced_unwind&)
diff --git a/libstdc++-v3/include/bits/ostream_insert.h b/libstdc++-v3/include/bits/ostream_insert.h
index 72eea8b1d05..650ea1afd9e 100644
--- a/libstdc++-v3/include/bits/ostream_insert.h
+++ b/libstdc++-v3/include/bits/ostream_insert.h
@@ -48,7 +48,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
       typedef basic_ostream<_CharT, _Traits>       __ostream_type;      
       typedef typename __ostream_type::ios_base    __ios_base;
 
-      const streamsize __put = __out.rdbuf()->sputn(__s, __n);
+      const streamsize __put = __out.rdbuf()->__buffered_sputn(__s, __n);
       if (__put != __n)
 	__out.setstate(__ios_base::badbit);
     }
diff --git a/libstdc++-v3/include/bits/streambuf.tcc b/libstdc++-v3/include/bits/streambuf.tcc
index 22464c4401c..7e803e4fc08 100644
--- a/libstdc++-v3/include/bits/streambuf.tcc
+++ b/libstdc++-v3/include/bits/streambuf.tcc
@@ -108,6 +108,22 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
       return __ret;
     }
 
+  template<typename _CharT, typename _Traits>
+    streamsize
+    basic_streambuf<_CharT, _Traits>::
+    __buffered_sputn(const char_type* __s, streamsize __n)
+  {
+    const streamsize __buf_len = this->epptr() - this->pptr();
+    if (__buf_len >= __n)
+      {
+	traits_type::copy(this->pptr(), __s, __n);
+	this->__safe_pbump(__n);
+	return __n;
+      }
+    else
+      return this->xsputn(__s, __n);
+  }
+
   // Conceivably, this could be used to implement buffer-to-buffer
   // copies, if this was ever desired in an un-ambiguous way by the
   // standard.
diff --git a/libstdc++-v3/include/std/streambuf b/libstdc++-v3/include/std/streambuf
index e213104c4f4..a1a2277adc7 100644
--- a/libstdc++-v3/include/std/streambuf
+++ b/libstdc++-v3/include/std/streambuf
@@ -455,6 +455,20 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
       sputn(const char_type* __s, streamsize __n)
       { return this->xsputn(__s, __n); }
 
+      /**
+       *  @brief  Entry point for all single-character output functions.
+       *  @param  __s  A buffer read area.
+       *  @param  __n  A count.
+       *
+       *  Internal output function.
+       *
+       *
+       *  Has the same effect as sputn(__s,__n), but attempts to use the
+       *  existing buffer space rather than make a virtual call, if possible.
+      */
+      streamsize
+      __buffered_sputn(const char_type* __s, streamsize __n);
+
     protected:
       /**
        *  @brief  Base constructor.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ostream::operator<<() and sputn()
  2021-07-14 21:26 ostream::operator<<() and sputn() Lewis Hyatt
@ 2021-07-14 21:30 ` Jonathan Wakely
  2021-07-14 21:45   ` Lewis Hyatt
  2021-07-14 21:54   ` Dietmar Kühl
  0 siblings, 2 replies; 8+ messages in thread
From: Jonathan Wakely @ 2021-07-14 21:30 UTC (permalink / raw)
  To: Lewis Hyatt; +Cc: libstdc++

On Wed, 14 Jul 2021 at 22:26, Lewis Hyatt via Libstdc++
<libstdc++@gcc.gnu.org> wrote:
>
> Hello-
>
> I noticed that libstdc++'s implementation of ostream::operator<<() prefers
> to call sputn() on the underlying streambuf for all char, char*, and string
> output operations, including single characters, rather than manipulate the
> buffer directly. I am curious why it works this way, it feels perhaps
> suboptimal to me because sputn() is mandated to call the virtual function
> xsputn() on every call, while e.g. sputc() simply manipulates the buffer and
> only needs a virtual call when the buffer is full. I always thought that the
> buffer abstraction and the resulting avoidance of virtual calls for the
> majority of operations was the main point of streambuf's design, and that
> sputn() was meant for cases when the output would be large enough to
> overflow the buffer anyway, if it may be possible to skip the buffer and
> flush directly instead?
>
> It seems to me that for most typical use cases, xsputn() is still going to
> want to use the buffer if the output fits into it; libstdc++ does this in
> basic_filebuf, for example. So then it would seem to be beneficial to try
> the buffer prior to making the virtual function call, instead of after --
> especially because the typical char instantiation of __ostream_insert that
> makes this call for operator<<() is hidden inside the .so, and is not
> inlined or eligible for devirtualization optimizations.
>
> FWIW, here is a small test case.
>
> ---------
> #include <ostream>
> #include <iostream>
> #include <fstream>
> #include <sstream>
> #include <chrono>
> #include <random>
> using namespace std;
>
> int main() {
>     constexpr size_t N = 500000000;
>     string s(N, 'x');
>
>     ofstream of{"/dev/null"};
>     ostringstream os;
>     ostream* streams[] = {&of, &os};
>     mt19937 rng{random_device{}()};
>
>     const auto timed_run = [&](const char* label, auto&& callback) {
>         const auto t1 = chrono::steady_clock::now();
>         for(char c: s) callback(*streams[rng() % 2], c);
>         const auto t2 = chrono::steady_clock::now();
>         cout << label << " took: "
>              << chrono::duration<double>(t2-t1).count()
>              << " seconds" << endl;
>     };
>
>     timed_run("insert with put()", [](ostream& o, char c) {o.put(c);});
>     timed_run("insert with op<< ", [](ostream& o, char c) {o << c;});
> }
> ---------
>
> This is what I get with the current trunk:
> ---------
> insert with put() took: 6.12152 seconds
> insert with op<<  took: 13.4437 seconds
> ---------
>
> And this is what I get with the attached patch:
> ---------
> insert with put() took: 6.08313 seconds
> insert with op<<  took: 8.24565 seconds
> ---------
>
> So the overhead of calling operator<< vs calling put() was reduced by more
> than 3X.
>
> The prototype patch calls an internal alternate to sputn(), which tries the
> buffer prior to calling xsputn().

This won't work if a user provides an explicit specialization of
basic_streambuf<char, MyTraits>. std::basic_ostream<char, MyTraits>
will still try to call your new function, but it won't be present in
the user's specialization, so will fail to compile. The basic_ostream
primary template can only use the standard API of basic_streambuf. The
std::basic_ostream<char> specialization can use non-standard members
of std::basic_streambuf<char> because we know users can't specialize
that.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ostream::operator<<() and sputn()
  2021-07-14 21:30 ` Jonathan Wakely
@ 2021-07-14 21:45   ` Lewis Hyatt
  2021-07-15 17:11     ` François Dumont
  2021-07-14 21:54   ` Dietmar Kühl
  1 sibling, 1 reply; 8+ messages in thread
From: Lewis Hyatt @ 2021-07-14 21:45 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: libstdc++

On Wed, Jul 14, 2021 at 5:31 PM Jonathan Wakely <jwakely.gcc@gmail.com> wrote:
>
> On Wed, 14 Jul 2021 at 22:26, Lewis Hyatt via Libstdc++
> <libstdc++@gcc.gnu.org> wrote:
> >
> > Hello-
> >
> > I noticed that libstdc++'s implementation of ostream::operator<<() prefers
> > to call sputn() on the underlying streambuf for all char, char*, and string
> > output operations, including single characters, rather than manipulate the
> > buffer directly. I am curious why it works this way, it feels perhaps
> > suboptimal to me because sputn() is mandated to call the virtual function
> > xsputn() on every call, while e.g. sputc() simply manipulates the buffer and
> > only needs a virtual call when the buffer is full. I always thought that the
> > buffer abstraction and the resulting avoidance of virtual calls for the
> > majority of operations was the main point of streambuf's design, and that
> > sputn() was meant for cases when the output would be large enough to
> > overflow the buffer anyway, if it may be possible to skip the buffer and
> > flush directly instead?
> >
> > It seems to me that for most typical use cases, xsputn() is still going to
> > want to use the buffer if the output fits into it; libstdc++ does this in
> > basic_filebuf, for example. So then it would seem to be beneficial to try
> > the buffer prior to making the virtual function call, instead of after --
> > especially because the typical char instantiation of __ostream_insert that
> > makes this call for operator<<() is hidden inside the .so, and is not
> > inlined or eligible for devirtualization optimizations.
> >
> > FWIW, here is a small test case.
> >
> > ---------
> > #include <ostream>
> > #include <iostream>
> > #include <fstream>
> > #include <sstream>
> > #include <chrono>
> > #include <random>
> > using namespace std;
> >
> > int main() {
> >     constexpr size_t N = 500000000;
> >     string s(N, 'x');
> >
> >     ofstream of{"/dev/null"};
> >     ostringstream os;
> >     ostream* streams[] = {&of, &os};
> >     mt19937 rng{random_device{}()};
> >
> >     const auto timed_run = [&](const char* label, auto&& callback) {
> >         const auto t1 = chrono::steady_clock::now();
> >         for(char c: s) callback(*streams[rng() % 2], c);
> >         const auto t2 = chrono::steady_clock::now();
> >         cout << label << " took: "
> >              << chrono::duration<double>(t2-t1).count()
> >              << " seconds" << endl;
> >     };
> >
> >     timed_run("insert with put()", [](ostream& o, char c) {o.put(c);});
> >     timed_run("insert with op<< ", [](ostream& o, char c) {o << c;});
> > }
> > ---------
> >
> > This is what I get with the current trunk:
> > ---------
> > insert with put() took: 6.12152 seconds
> > insert with op<<  took: 13.4437 seconds
> > ---------
> >
> > And this is what I get with the attached patch:
> > ---------
> > insert with put() took: 6.08313 seconds
> > insert with op<<  took: 8.24565 seconds
> > ---------
> >
> > So the overhead of calling operator<< vs calling put() was reduced by more
> > than 3X.
> >
> > The prototype patch calls an internal alternate to sputn(), which tries the
> > buffer prior to calling xsputn().
>
> This won't work if a user provides an explicit specialization of
> basic_streambuf<char, MyTraits>. std::basic_ostream<char, MyTraits>
> will still try to call your new function, but it won't be present in
> the user's specialization, so will fail to compile. The basic_ostream
> primary template can only use the standard API of basic_streambuf. The
> std::basic_ostream<char> specialization can use non-standard members
> of std::basic_streambuf<char> because we know users can't specialize
> that.

Thanks, makes sense, this was more just a quick proof of concept. I
guess a real version could work around this, well it could be
implemented purely in terms of sputc() too. Am curious if you think
the overall idea is worthwhile though? Partly I am trying to
understand it better, like it was a bit surprising to me that the
standard says that sputn() *must* call xsputn(). Feels like calling
it, only if a call to overflow() would otherwise be necessary, makes
more sense to me...


-Lewis

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ostream::operator<<() and sputn()
  2021-07-14 21:30 ` Jonathan Wakely
  2021-07-14 21:45   ` Lewis Hyatt
@ 2021-07-14 21:54   ` Dietmar Kühl
  1 sibling, 0 replies; 8+ messages in thread
From: Dietmar Kühl @ 2021-07-14 21:54 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: Lewis Hyatt, libstdc++

Hello,

Of course, you could detect whether the user did specialize basic_streambuf: the default implementation can use a special member name, let’s say, __not_specialized, which the user cannot write in a portable program and, hence, in the user’s specialization. Detecting that could opt into an enhanced code path. I’m not recommending to use such an implementation. When I implemented my own version of IOStreams I used a few special approaches, notably a segmented iterator optimization for std::[io]streambuf_iterator with a bit more direct access to the streambuf’s buffer. For actually buffered streams these things can yield a substantial performance boost. There is some penalty for non-buffered (well, single character buffered) stream buffers, though.

Making changes to the streams while maintaining ABI compatibility is probably not quite worth the trouble, though.

Thanks,
  Dietmar

> On 14 Jul 2021, at 22:31, Jonathan Wakely via Libstdc++ <libstdc++@gcc.gnu.org> wrote:
> 
> On Wed, 14 Jul 2021 at 22:26, Lewis Hyatt via Libstdc++
> <libstdc++@gcc.gnu.org> wrote:
>> 
>> Hello-
>> 
>> I noticed that libstdc++'s implementation of ostream::operator<<() prefers
>> to call sputn() on the underlying streambuf for all char, char*, and string
>> output operations, including single characters, rather than manipulate the
>> buffer directly. I am curious why it works this way, it feels perhaps
>> suboptimal to me because sputn() is mandated to call the virtual function
>> xsputn() on every call, while e.g. sputc() simply manipulates the buffer and
>> only needs a virtual call when the buffer is full. I always thought that the
>> buffer abstraction and the resulting avoidance of virtual calls for the
>> majority of operations was the main point of streambuf's design, and that
>> sputn() was meant for cases when the output would be large enough to
>> overflow the buffer anyway, if it may be possible to skip the buffer and
>> flush directly instead?
>> 
>> It seems to me that for most typical use cases, xsputn() is still going to
>> want to use the buffer if the output fits into it; libstdc++ does this in
>> basic_filebuf, for example. So then it would seem to be beneficial to try
>> the buffer prior to making the virtual function call, instead of after --
>> especially because the typical char instantiation of __ostream_insert that
>> makes this call for operator<<() is hidden inside the .so, and is not
>> inlined or eligible for devirtualization optimizations.
>> 
>> FWIW, here is a small test case.
>> 
>> ---------
>> #include <ostream>
>> #include <iostream>
>> #include <fstream>
>> #include <sstream>
>> #include <chrono>
>> #include <random>
>> using namespace std;
>> 
>> int main() {
>>    constexpr size_t N = 500000000;
>>    string s(N, 'x');
>> 
>>    ofstream of{"/dev/null"};
>>    ostringstream os;
>>    ostream* streams[] = {&of, &os};
>>    mt19937 rng{random_device{}()};
>> 
>>    const auto timed_run = [&](const char* label, auto&& callback) {
>>        const auto t1 = chrono::steady_clock::now();
>>        for(char c: s) callback(*streams[rng() % 2], c);
>>        const auto t2 = chrono::steady_clock::now();
>>        cout << label << " took: "
>>             << chrono::duration<double>(t2-t1).count()
>>             << " seconds" << endl;
>>    };
>> 
>>    timed_run("insert with put()", [](ostream& o, char c) {o.put(c);});
>>    timed_run("insert with op<< ", [](ostream& o, char c) {o << c;});
>> }
>> ---------
>> 
>> This is what I get with the current trunk:
>> ---------
>> insert with put() took: 6.12152 seconds
>> insert with op<<  took: 13.4437 seconds
>> ---------
>> 
>> And this is what I get with the attached patch:
>> ---------
>> insert with put() took: 6.08313 seconds
>> insert with op<<  took: 8.24565 seconds
>> ---------
>> 
>> So the overhead of calling operator<< vs calling put() was reduced by more
>> than 3X.
>> 
>> The prototype patch calls an internal alternate to sputn(), which tries the
>> buffer prior to calling xsputn().
> 
> This won't work if a user provides an explicit specialization of
> basic_streambuf<char, MyTraits>. std::basic_ostream<char, MyTraits>
> will still try to call your new function, but it won't be present in
> the user's specialization, so will fail to compile. The basic_ostream
> primary template can only use the standard API of basic_streambuf. The
> std::basic_ostream<char> specialization can use non-standard members
> of std::basic_streambuf<char> because we know users can't specialize
> that.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ostream::operator<<() and sputn()
  2021-07-14 21:45   ` Lewis Hyatt
@ 2021-07-15 17:11     ` François Dumont
  2022-01-10 16:07       ` Jonathan Wakely
  0 siblings, 1 reply; 8+ messages in thread
From: François Dumont @ 2021-07-15 17:11 UTC (permalink / raw)
  To: Lewis Hyatt, Jonathan Wakely; +Cc: libstdc++

On 14/07/21 11:45 pm, Lewis Hyatt via Libstdc++ wrote:
> On Wed, Jul 14, 2021 at 5:31 PM Jonathan Wakely<jwakely.gcc@gmail.com>  wrote:
>> On Wed, 14 Jul 2021 at 22:26, Lewis Hyatt via Libstdc++
>> <libstdc++@gcc.gnu.org>  wrote:
>>> Hello-
>>>
>>> I noticed that libstdc++'s implementation of ostream::operator<<() prefers
>>> to call sputn() on the underlying streambuf for all char, char*, and string
>>> output operations, including single characters, rather than manipulate the
>>> buffer directly. I am curious why it works this way, it feels perhaps
>>> suboptimal to me because sputn() is mandated to call the virtual function
>>> xsputn() on every call, while e.g. sputc() simply manipulates the buffer and
>>> only needs a virtual call when the buffer is full. I always thought that the
>>> buffer abstraction and the resulting avoidance of virtual calls for the
>>> majority of operations was the main point of streambuf's design, and that
>>> sputn() was meant for cases when the output would be large enough to
>>> overflow the buffer anyway, if it may be possible to skip the buffer and
>>> flush directly instead?
>>>
>>> It seems to me that for most typical use cases, xsputn() is still going to
>>> want to use the buffer if the output fits into it; libstdc++ does this in
>>> basic_filebuf, for example. So then it would seem to be beneficial to try
>>> the buffer prior to making the virtual function call, instead of after --
>>> especially because the typical char instantiation of __ostream_insert that
>>> makes this call for operator<<() is hidden inside the .so, and is not
>>> inlined or eligible for devirtualization optimizations.
>>>
>>> FWIW, here is a small test case.
>>>
>>> ---------
>>> #include <ostream>
>>> #include <iostream>
>>> #include <fstream>
>>> #include <sstream>
>>> #include <chrono>
>>> #include <random>
>>> using namespace std;
>>>
>>> int main() {
>>>      constexpr size_t N = 500000000;
>>>      string s(N, 'x');
>>>
>>>      ofstream of{"/dev/null"};
>>>      ostringstream os;
>>>      ostream* streams[] = {&of, &os};
>>>      mt19937 rng{random_device{}()};
>>>
>>>      const auto timed_run = [&](const char* label, auto&& callback) {
>>>          const auto t1 = chrono::steady_clock::now();
>>>          for(char c: s) callback(*streams[rng() % 2], c);
>>>          const auto t2 = chrono::steady_clock::now();
>>>          cout << label << " took: "
>>>               << chrono::duration<double>(t2-t1).count()
>>>               << " seconds" << endl;
>>>      };
>>>
>>>      timed_run("insert with put()", [](ostream& o, char c) {o.put(c);});
>>>      timed_run("insert with op<< ", [](ostream& o, char c) {o << c;});
>>> }
>>> ---------
>>>
>>> This is what I get with the current trunk:
>>> ---------
>>> insert with put() took: 6.12152 seconds
>>> insert with op<<  took: 13.4437 seconds
>>> ---------
>>>
>>> And this is what I get with the attached patch:
>>> ---------
>>> insert with put() took: 6.08313 seconds
>>> insert with op<<  took: 8.24565 seconds
>>> ---------
>>>
>>> So the overhead of calling operator<< vs calling put() was reduced by more
>>> than 3X.
>>>
>>> The prototype patch calls an internal alternate to sputn(), which tries the
>>> buffer prior to calling xsputn().
>> This won't work if a user provides an explicit specialization of
>> basic_streambuf<char, MyTraits>. std::basic_ostream<char, MyTraits>
>> will still try to call your new function, but it won't be present in
>> the user's specialization, so will fail to compile. The basic_ostream
>> primary template can only use the standard API of basic_streambuf. The
>> std::basic_ostream<char> specialization can use non-standard members
>> of std::basic_streambuf<char> because we know users can't specialize
>> that.
> Thanks, makes sense, this was more just a quick proof of concept. I
> guess a real version could work around this, well it could be
> implemented purely in terms of sputc() too. Am curious if you think
> the overall idea is worthwhile though? Partly I am trying to
> understand it better, like it was a bit surprising to me that the
> standard says that sputn() *must* call xsputn(). Feels like calling
> it, only if a call to overflow() would otherwise be necessary, makes
> more sense to me...
>
>
> -Lewis
> .

I think that the issue you spotted can be summarize by the 
implementation of operator<< in <ostream>:

   template<typename _CharT, typename _Traits>
     inline basic_ostream<_CharT, _Traits>&
     operator<<(basic_ostream<_CharT, _Traits>& __out, _CharT __c)
     { return __ostream_insert(__out, &__c, 1); }

To output a single _CharT is treated as to output a C string.

If you add the plumbing to have a __ostream_insert(__out, __c) then 
buffering should take place normally as it will end-up into a call to sputc.

Either it is worthwhile or not, I would say that if you need it and 
eventually implement it then do not hesitate to submit it here !

François



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ostream::operator<<() and sputn()
  2021-07-15 17:11     ` François Dumont
@ 2022-01-10 16:07       ` Jonathan Wakely
  2022-01-10 23:18         ` Lewis Hyatt
  0 siblings, 1 reply; 8+ messages in thread
From: Jonathan Wakely @ 2022-01-10 16:07 UTC (permalink / raw)
  To: François Dumont; +Cc: Lewis Hyatt, libstdc++, Dietmar Kuehl

On Thu, 15 Jul 2021 at 18:11, François Dumont wrote:
>
> On 14/07/21 11:45 pm, Lewis Hyatt via Libstdc++ wrote:
> > On Wed, Jul 14, 2021 at 5:31 PM Jonathan Wakely<jwakely.gcc@gmail.com>  wrote:
> >> On Wed, 14 Jul 2021 at 22:26, Lewis Hyatt via Libstdc++
> >> <libstdc++@gcc.gnu.org>  wrote:
> >>> Hello-
> >>>
> >>> I noticed that libstdc++'s implementation of ostream::operator<<() prefers
> >>> to call sputn() on the underlying streambuf for all char, char*, and string
> >>> output operations, including single characters, rather than manipulate the
> >>> buffer directly. I am curious why it works this way, it feels perhaps
> >>> suboptimal to me because sputn() is mandated to call the virtual function
> >>> xsputn() on every call, while e.g. sputc() simply manipulates the buffer and
> >>> only needs a virtual call when the buffer is full. I always thought that the
> >>> buffer abstraction and the resulting avoidance of virtual calls for the
> >>> majority of operations was the main point of streambuf's design, and that
> >>> sputn() was meant for cases when the output would be large enough to
> >>> overflow the buffer anyway, if it may be possible to skip the buffer and
> >>> flush directly instead?
> >>>
> >>> It seems to me that for most typical use cases, xsputn() is still going to
> >>> want to use the buffer if the output fits into it; libstdc++ does this in
> >>> basic_filebuf, for example. So then it would seem to be beneficial to try
> >>> the buffer prior to making the virtual function call, instead of after --
> >>> especially because the typical char instantiation of __ostream_insert that
> >>> makes this call for operator<<() is hidden inside the .so, and is not
> >>> inlined or eligible for devirtualization optimizations.
> >>>
> >>> FWIW, here is a small test case.
> >>>
> >>> ---------
> >>> #include <ostream>
> >>> #include <iostream>
> >>> #include <fstream>
> >>> #include <sstream>
> >>> #include <chrono>
> >>> #include <random>
> >>> using namespace std;
> >>>
> >>> int main() {
> >>>      constexpr size_t N = 500000000;
> >>>      string s(N, 'x');
> >>>
> >>>      ofstream of{"/dev/null"};
> >>>      ostringstream os;
> >>>      ostream* streams[] = {&of, &os};
> >>>      mt19937 rng{random_device{}()};
> >>>
> >>>      const auto timed_run = [&](const char* label, auto&& callback) {
> >>>          const auto t1 = chrono::steady_clock::now();
> >>>          for(char c: s) callback(*streams[rng() % 2], c);
> >>>          const auto t2 = chrono::steady_clock::now();
> >>>          cout << label << " took: "
> >>>               << chrono::duration<double>(t2-t1).count()
> >>>               << " seconds" << endl;
> >>>      };
> >>>
> >>>      timed_run("insert with put()", [](ostream& o, char c) {o.put(c);});
> >>>      timed_run("insert with op<< ", [](ostream& o, char c) {o << c;});
> >>> }
> >>> ---------
> >>>
> >>> This is what I get with the current trunk:
> >>> ---------
> >>> insert with put() took: 6.12152 seconds
> >>> insert with op<<  took: 13.4437 seconds
> >>> ---------
> >>>
> >>> And this is what I get with the attached patch:
> >>> ---------
> >>> insert with put() took: 6.08313 seconds
> >>> insert with op<<  took: 8.24565 seconds
> >>> ---------
> >>>
> >>> So the overhead of calling operator<< vs calling put() was reduced by more
> >>> than 3X.
> >>>
> >>> The prototype patch calls an internal alternate to sputn(), which tries the
> >>> buffer prior to calling xsputn().
> >> This won't work if a user provides an explicit specialization of
> >> basic_streambuf<char, MyTraits>. std::basic_ostream<char, MyTraits>
> >> will still try to call your new function, but it won't be present in
> >> the user's specialization, so will fail to compile. The basic_ostream
> >> primary template can only use the standard API of basic_streambuf. The
> >> std::basic_ostream<char> specialization can use non-standard members
> >> of std::basic_streambuf<char> because we know users can't specialize
> >> that.
> > Thanks, makes sense, this was more just a quick proof of concept. I
> > guess a real version could work around this, well it could be
> > implemented purely in terms of sputc() too. Am curious if you think
> > the overall idea is worthwhile though? Partly I am trying to
> > understand it better, like it was a bit surprising to me that the
> > standard says that sputn() *must* call xsputn(). Feels like calling
> > it, only if a call to overflow() would otherwise be necessary, makes
> > more sense to me...
> >
> >
> > -Lewis
> > .
>
> I think that the issue you spotted can be summarize by the
> implementation of operator<< in <ostream>:
>
>    template<typename _CharT, typename _Traits>
>      inline basic_ostream<_CharT, _Traits>&
>      operator<<(basic_ostream<_CharT, _Traits>& __out, _CharT __c)
>      { return __ostream_insert(__out, &__c, 1); }
>
> To output a single _CharT is treated as to output a C string.
>
> If you add the plumbing to have a __ostream_insert(__out, __c) then
> buffering should take place normally as it will end-up into a call to sputc.

Indeed. This improves the performance of Lewis's testcase so there is
almost no overhead compared to using ostream::put(c) directly:

--- a/libstdc++-v3/include/std/ostream
+++ b/libstdc++-v3/include/std/ostream
@@ -505,7 +505,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
  template<typename _CharT, typename _Traits>
    inline basic_ostream<_CharT, _Traits>&
    operator<<(basic_ostream<_CharT, _Traits>& __out, _CharT __c)
-    { return __ostream_insert(__out, &__c, 1); }
+    {
+      if (__out.width() != 0)
+       return __ostream_insert(__out, &__c, 1);
+      __out.put(__c);
+      return __out;
+    }

  template<typename _CharT, typename _Traits>
    inline basic_ostream<_CharT, _Traits>&
@@ -516,7 +521,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
  template<typename _Traits>
    inline basic_ostream<char, _Traits>&
    operator<<(basic_ostream<char, _Traits>& __out, char __c)
-    { return __ostream_insert(__out, &__c, 1); }
+    {
+      if (__out.width() != 0)
+       return __ostream_insert(__out, &__c, 1);
+      __out.put(__c);
+      return __out;
+    }

  // Signed and unsigned
  template<typename _Traits>

I think the compiler will optimize this more aggressively than if the
new code is inside __ostream_write or __ostream_insert.

We could also use put(__c) when __out.width() == 1 but then we'd need
to also call width(0) if put(__c) succeeds, and that extra logic
defeats some of the optimization. Using width()==1 is rather pointless
when inserting single chars (as it is no different to width(0)) so I'm
not too worried about not optimizing that case.

> Either it is worthwhile or not, I would say that if you need it and
> eventually implement it then do not hesitate to submit it here !

I think this is simple enough that it's worth doing. It improves the
default case where there's no padding, and shouldn't add much overhead
when there is padding.

If it turns out that requiring an observable call to xsputn is not
required for string views and strings, we can consider something like
Lewis's patch as well (which would help when writing many strings
where most of them fit in the buffer).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ostream::operator<<() and sputn()
  2022-01-10 16:07       ` Jonathan Wakely
@ 2022-01-10 23:18         ` Lewis Hyatt
  2022-01-11  1:09           ` Jonathan Wakely
  0 siblings, 1 reply; 8+ messages in thread
From: Lewis Hyatt @ 2022-01-10 23:18 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: François Dumont, libstdc++, Dietmar Kuehl

On Mon, Jan 10, 2022 at 11:07 AM Jonathan Wakely <jwakely.gcc@gmail.com> wrote:
>
> On Thu, 15 Jul 2021 at 18:11, François Dumont wrote:
> >
> > On 14/07/21 11:45 pm, Lewis Hyatt via Libstdc++ wrote:
> > > On Wed, Jul 14, 2021 at 5:31 PM Jonathan Wakely<jwakely.gcc@gmail.com>  wrote:
> > >> On Wed, 14 Jul 2021 at 22:26, Lewis Hyatt via Libstdc++
> > >> <libstdc++@gcc.gnu.org>  wrote:
> > >>> Hello-
> > >>>
> > >>> I noticed that libstdc++'s implementation of ostream::operator<<() prefers
> > >>> to call sputn() on the underlying streambuf for all char, char*, and string
> > >>> output operations, including single characters, rather than manipulate the
> > >>> buffer directly. I am curious why it works this way, it feels perhaps
> > >>> suboptimal to me because sputn() is mandated to call the virtual function
> > >>> xsputn() on every call, while e.g. sputc() simply manipulates the buffer and
> > >>> only needs a virtual call when the buffer is full. I always thought that the
> > >>> buffer abstraction and the resulting avoidance of virtual calls for the
> > >>> majority of operations was the main point of streambuf's design, and that
> > >>> sputn() was meant for cases when the output would be large enough to
> > >>> overflow the buffer anyway, if it may be possible to skip the buffer and
> > >>> flush directly instead?
> > >>>
> > >>> It seems to me that for most typical use cases, xsputn() is still going to
> > >>> want to use the buffer if the output fits into it; libstdc++ does this in
> > >>> basic_filebuf, for example. So then it would seem to be beneficial to try
> > >>> the buffer prior to making the virtual function call, instead of after --
> > >>> especially because the typical char instantiation of __ostream_insert that
> > >>> makes this call for operator<<() is hidden inside the .so, and is not
> > >>> inlined or eligible for devirtualization optimizations.
> > >>>
> > >>> FWIW, here is a small test case.
> > >>>
> > >>> ---------
> > >>> #include <ostream>
> > >>> #include <iostream>
> > >>> #include <fstream>
> > >>> #include <sstream>
> > >>> #include <chrono>
> > >>> #include <random>
> > >>> using namespace std;
> > >>>
> > >>> int main() {
> > >>>      constexpr size_t N = 500000000;
> > >>>      string s(N, 'x');
> > >>>
> > >>>      ofstream of{"/dev/null"};
> > >>>      ostringstream os;
> > >>>      ostream* streams[] = {&of, &os};
> > >>>      mt19937 rng{random_device{}()};
> > >>>
> > >>>      const auto timed_run = [&](const char* label, auto&& callback) {
> > >>>          const auto t1 = chrono::steady_clock::now();
> > >>>          for(char c: s) callback(*streams[rng() % 2], c);
> > >>>          const auto t2 = chrono::steady_clock::now();
> > >>>          cout << label << " took: "
> > >>>               << chrono::duration<double>(t2-t1).count()
> > >>>               << " seconds" << endl;
> > >>>      };
> > >>>
> > >>>      timed_run("insert with put()", [](ostream& o, char c) {o.put(c);});
> > >>>      timed_run("insert with op<< ", [](ostream& o, char c) {o << c;});
> > >>> }
> > >>> ---------
> > >>>
> > >>> This is what I get with the current trunk:
> > >>> ---------
> > >>> insert with put() took: 6.12152 seconds
> > >>> insert with op<<  took: 13.4437 seconds
> > >>> ---------
> > >>>
> > >>> And this is what I get with the attached patch:
> > >>> ---------
> > >>> insert with put() took: 6.08313 seconds
> > >>> insert with op<<  took: 8.24565 seconds
> > >>> ---------
> > >>>
> > >>> So the overhead of calling operator<< vs calling put() was reduced by more
> > >>> than 3X.
> > >>>
> > >>> The prototype patch calls an internal alternate to sputn(), which tries the
> > >>> buffer prior to calling xsputn().
> > >> This won't work if a user provides an explicit specialization of
> > >> basic_streambuf<char, MyTraits>. std::basic_ostream<char, MyTraits>
> > >> will still try to call your new function, but it won't be present in
> > >> the user's specialization, so will fail to compile. The basic_ostream
> > >> primary template can only use the standard API of basic_streambuf. The
> > >> std::basic_ostream<char> specialization can use non-standard members
> > >> of std::basic_streambuf<char> because we know users can't specialize
> > >> that.
> > > Thanks, makes sense, this was more just a quick proof of concept. I
> > > guess a real version could work around this, well it could be
> > > implemented purely in terms of sputc() too. Am curious if you think
> > > the overall idea is worthwhile though? Partly I am trying to
> > > understand it better, like it was a bit surprising to me that the
> > > standard says that sputn() *must* call xsputn(). Feels like calling
> > > it, only if a call to overflow() would otherwise be necessary, makes
> > > more sense to me...
> > >
> > >
> > > -Lewis
> > > .
> >
> > I think that the issue you spotted can be summarize by the
> > implementation of operator<< in <ostream>:
> >
> >    template<typename _CharT, typename _Traits>
> >      inline basic_ostream<_CharT, _Traits>&
> >      operator<<(basic_ostream<_CharT, _Traits>& __out, _CharT __c)
> >      { return __ostream_insert(__out, &__c, 1); }
> >
> > To output a single _CharT is treated as to output a C string.
> >
> > If you add the plumbing to have a __ostream_insert(__out, __c) then
> > buffering should take place normally as it will end-up into a call to sputc.
>
> Indeed. This improves the performance of Lewis's testcase so there is
> almost no overhead compared to using ostream::put(c) directly:
>
> --- a/libstdc++-v3/include/std/ostream
> +++ b/libstdc++-v3/include/std/ostream
> @@ -505,7 +505,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>   template<typename _CharT, typename _Traits>
>     inline basic_ostream<_CharT, _Traits>&
>     operator<<(basic_ostream<_CharT, _Traits>& __out, _CharT __c)
> -    { return __ostream_insert(__out, &__c, 1); }
> +    {
> +      if (__out.width() != 0)
> +       return __ostream_insert(__out, &__c, 1);
> +      __out.put(__c);
> +      return __out;
> +    }
>
>   template<typename _CharT, typename _Traits>
>     inline basic_ostream<_CharT, _Traits>&
> @@ -516,7 +521,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>   template<typename _Traits>
>     inline basic_ostream<char, _Traits>&
>     operator<<(basic_ostream<char, _Traits>& __out, char __c)
> -    { return __ostream_insert(__out, &__c, 1); }
> +    {
> +      if (__out.width() != 0)
> +       return __ostream_insert(__out, &__c, 1);
> +      __out.put(__c);
> +      return __out;
> +    }
>
>   // Signed and unsigned
>   template<typename _Traits>
>
> I think the compiler will optimize this more aggressively than if the
> new code is inside __ostream_write or __ostream_insert.
>
> We could also use put(__c) when __out.width() == 1 but then we'd need
> to also call width(0) if put(__c) succeeds, and that extra logic
> defeats some of the optimization. Using width()==1 is rather pointless
> when inserting single chars (as it is no different to width(0)) so I'm
> not too worried about not optimizing that case.
>
> > Either it is worthwhile or not, I would say that if you need it and
> > eventually implement it then do not hesitate to submit it here !
>
> I think this is simple enough that it's worth doing. It improves the
> default case where there's no padding, and shouldn't add much overhead
> when there is padding.
>
> If it turns out that requiring an observable call to xsputn is not
> required for string views and strings, we can consider something like
> Lewis's patch as well (which would help when writing many strings
> where most of them fit in the buffer).

Thanks a lot for circling back to this. I feel like your change is a
pure improvement with little downside, so it would be great to have
it. Let me know please if I should do anything more there.

To me, the whole situation with sputn() is just unfortunate, because I
think it should not have been specified to always make the virtual
call to xsputn(), but rather only make it if needed. Then everything
would work as well as possible. Like after this change to the
single-char overload, a similar unexpected performance issue will
affect something like:  cout << "a"  compared to cout << 'a', where
now the former will make a virtual call every time and the latter
won't. libstdc++ could handle the char* overload case as well, but
then it still would not be permitted to handle a string or
string_view, so at some level the standard specification is the
constraint... Seems unlikely it will change though, so I think it
means that avoiding sputn() wherever permitted will always be an
avenue towards improving performance.

-Lewis


-Lewis

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ostream::operator<<() and sputn()
  2022-01-10 23:18         ` Lewis Hyatt
@ 2022-01-11  1:09           ` Jonathan Wakely
  0 siblings, 0 replies; 8+ messages in thread
From: Jonathan Wakely @ 2022-01-11  1:09 UTC (permalink / raw)
  To: Lewis Hyatt; +Cc: François Dumont, libstdc++, Dietmar Kuehl

On Mon, 10 Jan 2022 at 23:18, Lewis Hyatt wrote:
> Thanks a lot for circling back to this. I feel like your change is a
> pure improvement with little downside, so it would be great to have
> it. Let me know please if I should do anything more there.

Leave it with me.

> To me, the whole situation with sputn() is just unfortunate, because I
> think it should not have been specified to always make the virtual
> call to xsputn(), but rather only make it if needed. Then everything
> would work as well as possible. Like after this change to the
> single-char overload, a similar unexpected performance issue will
> affect something like:  cout << "a"  compared to cout << 'a', where
> now the former will make a virtual call every time and the latter
> won't. libstdc++ could handle the char* overload case as well, but
> then it still would not be permitted to handle a string or
> string_view, so at some level the standard specification is the
> constraint... Seems unlikely it will change though, so I think it
> means that avoiding sputn() wherever permitted will always be an
> avenue towards improving performance.

I think the standard is overspecified, and changing it makes sense. It will
take some time though.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-01-11  1:09 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-14 21:26 ostream::operator<<() and sputn() Lewis Hyatt
2021-07-14 21:30 ` Jonathan Wakely
2021-07-14 21:45   ` Lewis Hyatt
2021-07-15 17:11     ` François Dumont
2022-01-10 16:07       ` Jonathan Wakely
2022-01-10 23:18         ` Lewis Hyatt
2022-01-11  1:09           ` Jonathan Wakely
2021-07-14 21:54   ` Dietmar Kühl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).