* ostream::operator<<() and sputn() @ 2021-07-14 21:26 Lewis Hyatt 2021-07-14 21:30 ` Jonathan Wakely 0 siblings, 1 reply; 8+ messages in thread From: Lewis Hyatt @ 2021-07-14 21:26 UTC (permalink / raw) To: libstdc++ [-- Attachment #1: Type: text/plain, Size: 3341 bytes --] Hello- I noticed that libstdc++'s implementation of ostream::operator<<() prefers to call sputn() on the underlying streambuf for all char, char*, and string output operations, including single characters, rather than manipulate the buffer directly. I am curious why it works this way, it feels perhaps suboptimal to me because sputn() is mandated to call the virtual function xsputn() on every call, while e.g. sputc() simply manipulates the buffer and only needs a virtual call when the buffer is full. I always thought that the buffer abstraction and the resulting avoidance of virtual calls for the majority of operations was the main point of streambuf's design, and that sputn() was meant for cases when the output would be large enough to overflow the buffer anyway, if it may be possible to skip the buffer and flush directly instead? It seems to me that for most typical use cases, xsputn() is still going to want to use the buffer if the output fits into it; libstdc++ does this in basic_filebuf, for example. So then it would seem to be beneficial to try the buffer prior to making the virtual function call, instead of after -- especially because the typical char instantiation of __ostream_insert that makes this call for operator<<() is hidden inside the .so, and is not inlined or eligible for devirtualization optimizations. FWIW, here is a small test case. --------- #include <ostream> #include <iostream> #include <fstream> #include <sstream> #include <chrono> #include <random> using namespace std; int main() { constexpr size_t N = 500000000; string s(N, 'x'); ofstream of{"/dev/null"}; ostringstream os; ostream* streams[] = {&of, &os}; mt19937 rng{random_device{}()}; const auto timed_run = [&](const char* label, auto&& callback) { const auto t1 = chrono::steady_clock::now(); for(char c: s) callback(*streams[rng() % 2], c); const auto t2 = chrono::steady_clock::now(); cout << label << " took: " << chrono::duration<double>(t2-t1).count() << " seconds" << endl; }; timed_run("insert with put()", [](ostream& o, char c) {o.put(c);}); timed_run("insert with op<< ", [](ostream& o, char c) {o << c;}); } --------- This is what I get with the current trunk: --------- insert with put() took: 6.12152 seconds insert with op<< took: 13.4437 seconds --------- And this is what I get with the attached patch: --------- insert with put() took: 6.08313 seconds insert with op<< took: 8.24565 seconds --------- So the overhead of calling operator<< vs calling put() was reduced by more than 3X. The prototype patch calls an internal alternate to sputn(), which tries the buffer prior to calling xsputn(). All the tests still pass with this patch in place, but I think it would not be suitable as-is, because it affects the overload for std::string_view, and I think the standard may actually mandate that xsputn() be called always for this case. (It says it should be output as-if sputn() were called, which I think makes xsputn() should be an observable side-effect.) Still it would work fine outside the string case AFAIK, and I thought it was maybe an interesting result worth sharing, and am curious what others think here, or if there is a reason I am missing why it's better the current way. Thanks! -Lewis [-- Attachment #2: buffered_sputn.txt --] [-- Type: text/plain, Size: 2939 bytes --] diff --git a/libstdc++-v3/include/bits/ostream.tcc b/libstdc++-v3/include/bits/ostream.tcc index 06b2217d216..614b347736b 100644 --- a/libstdc++-v3/include/bits/ostream.tcc +++ b/libstdc++-v3/include/bits/ostream.tcc @@ -195,7 +195,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION ios_base::iostate __err = ios_base::goodbit; __try { - if (this->rdbuf()->sputn(__s, __n) != __n) + if (this->rdbuf()->__buffered_sputn(__s, __n) != __n) __err = ios_base::badbit; } __catch(__cxxabiv1::__forced_unwind&) diff --git a/libstdc++-v3/include/bits/ostream_insert.h b/libstdc++-v3/include/bits/ostream_insert.h index 72eea8b1d05..650ea1afd9e 100644 --- a/libstdc++-v3/include/bits/ostream_insert.h +++ b/libstdc++-v3/include/bits/ostream_insert.h @@ -48,7 +48,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION typedef basic_ostream<_CharT, _Traits> __ostream_type; typedef typename __ostream_type::ios_base __ios_base; - const streamsize __put = __out.rdbuf()->sputn(__s, __n); + const streamsize __put = __out.rdbuf()->__buffered_sputn(__s, __n); if (__put != __n) __out.setstate(__ios_base::badbit); } diff --git a/libstdc++-v3/include/bits/streambuf.tcc b/libstdc++-v3/include/bits/streambuf.tcc index 22464c4401c..7e803e4fc08 100644 --- a/libstdc++-v3/include/bits/streambuf.tcc +++ b/libstdc++-v3/include/bits/streambuf.tcc @@ -108,6 +108,22 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION return __ret; } + template<typename _CharT, typename _Traits> + streamsize + basic_streambuf<_CharT, _Traits>:: + __buffered_sputn(const char_type* __s, streamsize __n) + { + const streamsize __buf_len = this->epptr() - this->pptr(); + if (__buf_len >= __n) + { + traits_type::copy(this->pptr(), __s, __n); + this->__safe_pbump(__n); + return __n; + } + else + return this->xsputn(__s, __n); + } + // Conceivably, this could be used to implement buffer-to-buffer // copies, if this was ever desired in an un-ambiguous way by the // standard. diff --git a/libstdc++-v3/include/std/streambuf b/libstdc++-v3/include/std/streambuf index e213104c4f4..a1a2277adc7 100644 --- a/libstdc++-v3/include/std/streambuf +++ b/libstdc++-v3/include/std/streambuf @@ -455,6 +455,20 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION sputn(const char_type* __s, streamsize __n) { return this->xsputn(__s, __n); } + /** + * @brief Entry point for all single-character output functions. + * @param __s A buffer read area. + * @param __n A count. + * + * Internal output function. + * + * + * Has the same effect as sputn(__s,__n), but attempts to use the + * existing buffer space rather than make a virtual call, if possible. + */ + streamsize + __buffered_sputn(const char_type* __s, streamsize __n); + protected: /** * @brief Base constructor. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: ostream::operator<<() and sputn() 2021-07-14 21:26 ostream::operator<<() and sputn() Lewis Hyatt @ 2021-07-14 21:30 ` Jonathan Wakely 2021-07-14 21:45 ` Lewis Hyatt 2021-07-14 21:54 ` Dietmar Kühl 0 siblings, 2 replies; 8+ messages in thread From: Jonathan Wakely @ 2021-07-14 21:30 UTC (permalink / raw) To: Lewis Hyatt; +Cc: libstdc++ On Wed, 14 Jul 2021 at 22:26, Lewis Hyatt via Libstdc++ <libstdc++@gcc.gnu.org> wrote: > > Hello- > > I noticed that libstdc++'s implementation of ostream::operator<<() prefers > to call sputn() on the underlying streambuf for all char, char*, and string > output operations, including single characters, rather than manipulate the > buffer directly. I am curious why it works this way, it feels perhaps > suboptimal to me because sputn() is mandated to call the virtual function > xsputn() on every call, while e.g. sputc() simply manipulates the buffer and > only needs a virtual call when the buffer is full. I always thought that the > buffer abstraction and the resulting avoidance of virtual calls for the > majority of operations was the main point of streambuf's design, and that > sputn() was meant for cases when the output would be large enough to > overflow the buffer anyway, if it may be possible to skip the buffer and > flush directly instead? > > It seems to me that for most typical use cases, xsputn() is still going to > want to use the buffer if the output fits into it; libstdc++ does this in > basic_filebuf, for example. So then it would seem to be beneficial to try > the buffer prior to making the virtual function call, instead of after -- > especially because the typical char instantiation of __ostream_insert that > makes this call for operator<<() is hidden inside the .so, and is not > inlined or eligible for devirtualization optimizations. > > FWIW, here is a small test case. > > --------- > #include <ostream> > #include <iostream> > #include <fstream> > #include <sstream> > #include <chrono> > #include <random> > using namespace std; > > int main() { > constexpr size_t N = 500000000; > string s(N, 'x'); > > ofstream of{"/dev/null"}; > ostringstream os; > ostream* streams[] = {&of, &os}; > mt19937 rng{random_device{}()}; > > const auto timed_run = [&](const char* label, auto&& callback) { > const auto t1 = chrono::steady_clock::now(); > for(char c: s) callback(*streams[rng() % 2], c); > const auto t2 = chrono::steady_clock::now(); > cout << label << " took: " > << chrono::duration<double>(t2-t1).count() > << " seconds" << endl; > }; > > timed_run("insert with put()", [](ostream& o, char c) {o.put(c);}); > timed_run("insert with op<< ", [](ostream& o, char c) {o << c;}); > } > --------- > > This is what I get with the current trunk: > --------- > insert with put() took: 6.12152 seconds > insert with op<< took: 13.4437 seconds > --------- > > And this is what I get with the attached patch: > --------- > insert with put() took: 6.08313 seconds > insert with op<< took: 8.24565 seconds > --------- > > So the overhead of calling operator<< vs calling put() was reduced by more > than 3X. > > The prototype patch calls an internal alternate to sputn(), which tries the > buffer prior to calling xsputn(). This won't work if a user provides an explicit specialization of basic_streambuf<char, MyTraits>. std::basic_ostream<char, MyTraits> will still try to call your new function, but it won't be present in the user's specialization, so will fail to compile. The basic_ostream primary template can only use the standard API of basic_streambuf. The std::basic_ostream<char> specialization can use non-standard members of std::basic_streambuf<char> because we know users can't specialize that. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: ostream::operator<<() and sputn() 2021-07-14 21:30 ` Jonathan Wakely @ 2021-07-14 21:45 ` Lewis Hyatt 2021-07-15 17:11 ` François Dumont 2021-07-14 21:54 ` Dietmar Kühl 1 sibling, 1 reply; 8+ messages in thread From: Lewis Hyatt @ 2021-07-14 21:45 UTC (permalink / raw) To: Jonathan Wakely; +Cc: libstdc++ On Wed, Jul 14, 2021 at 5:31 PM Jonathan Wakely <jwakely.gcc@gmail.com> wrote: > > On Wed, 14 Jul 2021 at 22:26, Lewis Hyatt via Libstdc++ > <libstdc++@gcc.gnu.org> wrote: > > > > Hello- > > > > I noticed that libstdc++'s implementation of ostream::operator<<() prefers > > to call sputn() on the underlying streambuf for all char, char*, and string > > output operations, including single characters, rather than manipulate the > > buffer directly. I am curious why it works this way, it feels perhaps > > suboptimal to me because sputn() is mandated to call the virtual function > > xsputn() on every call, while e.g. sputc() simply manipulates the buffer and > > only needs a virtual call when the buffer is full. I always thought that the > > buffer abstraction and the resulting avoidance of virtual calls for the > > majority of operations was the main point of streambuf's design, and that > > sputn() was meant for cases when the output would be large enough to > > overflow the buffer anyway, if it may be possible to skip the buffer and > > flush directly instead? > > > > It seems to me that for most typical use cases, xsputn() is still going to > > want to use the buffer if the output fits into it; libstdc++ does this in > > basic_filebuf, for example. So then it would seem to be beneficial to try > > the buffer prior to making the virtual function call, instead of after -- > > especially because the typical char instantiation of __ostream_insert that > > makes this call for operator<<() is hidden inside the .so, and is not > > inlined or eligible for devirtualization optimizations. > > > > FWIW, here is a small test case. > > > > --------- > > #include <ostream> > > #include <iostream> > > #include <fstream> > > #include <sstream> > > #include <chrono> > > #include <random> > > using namespace std; > > > > int main() { > > constexpr size_t N = 500000000; > > string s(N, 'x'); > > > > ofstream of{"/dev/null"}; > > ostringstream os; > > ostream* streams[] = {&of, &os}; > > mt19937 rng{random_device{}()}; > > > > const auto timed_run = [&](const char* label, auto&& callback) { > > const auto t1 = chrono::steady_clock::now(); > > for(char c: s) callback(*streams[rng() % 2], c); > > const auto t2 = chrono::steady_clock::now(); > > cout << label << " took: " > > << chrono::duration<double>(t2-t1).count() > > << " seconds" << endl; > > }; > > > > timed_run("insert with put()", [](ostream& o, char c) {o.put(c);}); > > timed_run("insert with op<< ", [](ostream& o, char c) {o << c;}); > > } > > --------- > > > > This is what I get with the current trunk: > > --------- > > insert with put() took: 6.12152 seconds > > insert with op<< took: 13.4437 seconds > > --------- > > > > And this is what I get with the attached patch: > > --------- > > insert with put() took: 6.08313 seconds > > insert with op<< took: 8.24565 seconds > > --------- > > > > So the overhead of calling operator<< vs calling put() was reduced by more > > than 3X. > > > > The prototype patch calls an internal alternate to sputn(), which tries the > > buffer prior to calling xsputn(). > > This won't work if a user provides an explicit specialization of > basic_streambuf<char, MyTraits>. std::basic_ostream<char, MyTraits> > will still try to call your new function, but it won't be present in > the user's specialization, so will fail to compile. The basic_ostream > primary template can only use the standard API of basic_streambuf. The > std::basic_ostream<char> specialization can use non-standard members > of std::basic_streambuf<char> because we know users can't specialize > that. Thanks, makes sense, this was more just a quick proof of concept. I guess a real version could work around this, well it could be implemented purely in terms of sputc() too. Am curious if you think the overall idea is worthwhile though? Partly I am trying to understand it better, like it was a bit surprising to me that the standard says that sputn() *must* call xsputn(). Feels like calling it, only if a call to overflow() would otherwise be necessary, makes more sense to me... -Lewis ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: ostream::operator<<() and sputn() 2021-07-14 21:45 ` Lewis Hyatt @ 2021-07-15 17:11 ` François Dumont 2022-01-10 16:07 ` Jonathan Wakely 0 siblings, 1 reply; 8+ messages in thread From: François Dumont @ 2021-07-15 17:11 UTC (permalink / raw) To: Lewis Hyatt, Jonathan Wakely; +Cc: libstdc++ On 14/07/21 11:45 pm, Lewis Hyatt via Libstdc++ wrote: > On Wed, Jul 14, 2021 at 5:31 PM Jonathan Wakely<jwakely.gcc@gmail.com> wrote: >> On Wed, 14 Jul 2021 at 22:26, Lewis Hyatt via Libstdc++ >> <libstdc++@gcc.gnu.org> wrote: >>> Hello- >>> >>> I noticed that libstdc++'s implementation of ostream::operator<<() prefers >>> to call sputn() on the underlying streambuf for all char, char*, and string >>> output operations, including single characters, rather than manipulate the >>> buffer directly. I am curious why it works this way, it feels perhaps >>> suboptimal to me because sputn() is mandated to call the virtual function >>> xsputn() on every call, while e.g. sputc() simply manipulates the buffer and >>> only needs a virtual call when the buffer is full. I always thought that the >>> buffer abstraction and the resulting avoidance of virtual calls for the >>> majority of operations was the main point of streambuf's design, and that >>> sputn() was meant for cases when the output would be large enough to >>> overflow the buffer anyway, if it may be possible to skip the buffer and >>> flush directly instead? >>> >>> It seems to me that for most typical use cases, xsputn() is still going to >>> want to use the buffer if the output fits into it; libstdc++ does this in >>> basic_filebuf, for example. So then it would seem to be beneficial to try >>> the buffer prior to making the virtual function call, instead of after -- >>> especially because the typical char instantiation of __ostream_insert that >>> makes this call for operator<<() is hidden inside the .so, and is not >>> inlined or eligible for devirtualization optimizations. >>> >>> FWIW, here is a small test case. >>> >>> --------- >>> #include <ostream> >>> #include <iostream> >>> #include <fstream> >>> #include <sstream> >>> #include <chrono> >>> #include <random> >>> using namespace std; >>> >>> int main() { >>> constexpr size_t N = 500000000; >>> string s(N, 'x'); >>> >>> ofstream of{"/dev/null"}; >>> ostringstream os; >>> ostream* streams[] = {&of, &os}; >>> mt19937 rng{random_device{}()}; >>> >>> const auto timed_run = [&](const char* label, auto&& callback) { >>> const auto t1 = chrono::steady_clock::now(); >>> for(char c: s) callback(*streams[rng() % 2], c); >>> const auto t2 = chrono::steady_clock::now(); >>> cout << label << " took: " >>> << chrono::duration<double>(t2-t1).count() >>> << " seconds" << endl; >>> }; >>> >>> timed_run("insert with put()", [](ostream& o, char c) {o.put(c);}); >>> timed_run("insert with op<< ", [](ostream& o, char c) {o << c;}); >>> } >>> --------- >>> >>> This is what I get with the current trunk: >>> --------- >>> insert with put() took: 6.12152 seconds >>> insert with op<< took: 13.4437 seconds >>> --------- >>> >>> And this is what I get with the attached patch: >>> --------- >>> insert with put() took: 6.08313 seconds >>> insert with op<< took: 8.24565 seconds >>> --------- >>> >>> So the overhead of calling operator<< vs calling put() was reduced by more >>> than 3X. >>> >>> The prototype patch calls an internal alternate to sputn(), which tries the >>> buffer prior to calling xsputn(). >> This won't work if a user provides an explicit specialization of >> basic_streambuf<char, MyTraits>. std::basic_ostream<char, MyTraits> >> will still try to call your new function, but it won't be present in >> the user's specialization, so will fail to compile. The basic_ostream >> primary template can only use the standard API of basic_streambuf. The >> std::basic_ostream<char> specialization can use non-standard members >> of std::basic_streambuf<char> because we know users can't specialize >> that. > Thanks, makes sense, this was more just a quick proof of concept. I > guess a real version could work around this, well it could be > implemented purely in terms of sputc() too. Am curious if you think > the overall idea is worthwhile though? Partly I am trying to > understand it better, like it was a bit surprising to me that the > standard says that sputn() *must* call xsputn(). Feels like calling > it, only if a call to overflow() would otherwise be necessary, makes > more sense to me... > > > -Lewis > . I think that the issue you spotted can be summarize by the implementation of operator<< in <ostream>: template<typename _CharT, typename _Traits> inline basic_ostream<_CharT, _Traits>& operator<<(basic_ostream<_CharT, _Traits>& __out, _CharT __c) { return __ostream_insert(__out, &__c, 1); } To output a single _CharT is treated as to output a C string. If you add the plumbing to have a __ostream_insert(__out, __c) then buffering should take place normally as it will end-up into a call to sputc. Either it is worthwhile or not, I would say that if you need it and eventually implement it then do not hesitate to submit it here ! François ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: ostream::operator<<() and sputn() 2021-07-15 17:11 ` François Dumont @ 2022-01-10 16:07 ` Jonathan Wakely 2022-01-10 23:18 ` Lewis Hyatt 0 siblings, 1 reply; 8+ messages in thread From: Jonathan Wakely @ 2022-01-10 16:07 UTC (permalink / raw) To: François Dumont; +Cc: Lewis Hyatt, libstdc++, Dietmar Kuehl On Thu, 15 Jul 2021 at 18:11, François Dumont wrote: > > On 14/07/21 11:45 pm, Lewis Hyatt via Libstdc++ wrote: > > On Wed, Jul 14, 2021 at 5:31 PM Jonathan Wakely<jwakely.gcc@gmail.com> wrote: > >> On Wed, 14 Jul 2021 at 22:26, Lewis Hyatt via Libstdc++ > >> <libstdc++@gcc.gnu.org> wrote: > >>> Hello- > >>> > >>> I noticed that libstdc++'s implementation of ostream::operator<<() prefers > >>> to call sputn() on the underlying streambuf for all char, char*, and string > >>> output operations, including single characters, rather than manipulate the > >>> buffer directly. I am curious why it works this way, it feels perhaps > >>> suboptimal to me because sputn() is mandated to call the virtual function > >>> xsputn() on every call, while e.g. sputc() simply manipulates the buffer and > >>> only needs a virtual call when the buffer is full. I always thought that the > >>> buffer abstraction and the resulting avoidance of virtual calls for the > >>> majority of operations was the main point of streambuf's design, and that > >>> sputn() was meant for cases when the output would be large enough to > >>> overflow the buffer anyway, if it may be possible to skip the buffer and > >>> flush directly instead? > >>> > >>> It seems to me that for most typical use cases, xsputn() is still going to > >>> want to use the buffer if the output fits into it; libstdc++ does this in > >>> basic_filebuf, for example. So then it would seem to be beneficial to try > >>> the buffer prior to making the virtual function call, instead of after -- > >>> especially because the typical char instantiation of __ostream_insert that > >>> makes this call for operator<<() is hidden inside the .so, and is not > >>> inlined or eligible for devirtualization optimizations. > >>> > >>> FWIW, here is a small test case. > >>> > >>> --------- > >>> #include <ostream> > >>> #include <iostream> > >>> #include <fstream> > >>> #include <sstream> > >>> #include <chrono> > >>> #include <random> > >>> using namespace std; > >>> > >>> int main() { > >>> constexpr size_t N = 500000000; > >>> string s(N, 'x'); > >>> > >>> ofstream of{"/dev/null"}; > >>> ostringstream os; > >>> ostream* streams[] = {&of, &os}; > >>> mt19937 rng{random_device{}()}; > >>> > >>> const auto timed_run = [&](const char* label, auto&& callback) { > >>> const auto t1 = chrono::steady_clock::now(); > >>> for(char c: s) callback(*streams[rng() % 2], c); > >>> const auto t2 = chrono::steady_clock::now(); > >>> cout << label << " took: " > >>> << chrono::duration<double>(t2-t1).count() > >>> << " seconds" << endl; > >>> }; > >>> > >>> timed_run("insert with put()", [](ostream& o, char c) {o.put(c);}); > >>> timed_run("insert with op<< ", [](ostream& o, char c) {o << c;}); > >>> } > >>> --------- > >>> > >>> This is what I get with the current trunk: > >>> --------- > >>> insert with put() took: 6.12152 seconds > >>> insert with op<< took: 13.4437 seconds > >>> --------- > >>> > >>> And this is what I get with the attached patch: > >>> --------- > >>> insert with put() took: 6.08313 seconds > >>> insert with op<< took: 8.24565 seconds > >>> --------- > >>> > >>> So the overhead of calling operator<< vs calling put() was reduced by more > >>> than 3X. > >>> > >>> The prototype patch calls an internal alternate to sputn(), which tries the > >>> buffer prior to calling xsputn(). > >> This won't work if a user provides an explicit specialization of > >> basic_streambuf<char, MyTraits>. std::basic_ostream<char, MyTraits> > >> will still try to call your new function, but it won't be present in > >> the user's specialization, so will fail to compile. The basic_ostream > >> primary template can only use the standard API of basic_streambuf. The > >> std::basic_ostream<char> specialization can use non-standard members > >> of std::basic_streambuf<char> because we know users can't specialize > >> that. > > Thanks, makes sense, this was more just a quick proof of concept. I > > guess a real version could work around this, well it could be > > implemented purely in terms of sputc() too. Am curious if you think > > the overall idea is worthwhile though? Partly I am trying to > > understand it better, like it was a bit surprising to me that the > > standard says that sputn() *must* call xsputn(). Feels like calling > > it, only if a call to overflow() would otherwise be necessary, makes > > more sense to me... > > > > > > -Lewis > > . > > I think that the issue you spotted can be summarize by the > implementation of operator<< in <ostream>: > > template<typename _CharT, typename _Traits> > inline basic_ostream<_CharT, _Traits>& > operator<<(basic_ostream<_CharT, _Traits>& __out, _CharT __c) > { return __ostream_insert(__out, &__c, 1); } > > To output a single _CharT is treated as to output a C string. > > If you add the plumbing to have a __ostream_insert(__out, __c) then > buffering should take place normally as it will end-up into a call to sputc. Indeed. This improves the performance of Lewis's testcase so there is almost no overhead compared to using ostream::put(c) directly: --- a/libstdc++-v3/include/std/ostream +++ b/libstdc++-v3/include/std/ostream @@ -505,7 +505,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION template<typename _CharT, typename _Traits> inline basic_ostream<_CharT, _Traits>& operator<<(basic_ostream<_CharT, _Traits>& __out, _CharT __c) - { return __ostream_insert(__out, &__c, 1); } + { + if (__out.width() != 0) + return __ostream_insert(__out, &__c, 1); + __out.put(__c); + return __out; + } template<typename _CharT, typename _Traits> inline basic_ostream<_CharT, _Traits>& @@ -516,7 +521,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION template<typename _Traits> inline basic_ostream<char, _Traits>& operator<<(basic_ostream<char, _Traits>& __out, char __c) - { return __ostream_insert(__out, &__c, 1); } + { + if (__out.width() != 0) + return __ostream_insert(__out, &__c, 1); + __out.put(__c); + return __out; + } // Signed and unsigned template<typename _Traits> I think the compiler will optimize this more aggressively than if the new code is inside __ostream_write or __ostream_insert. We could also use put(__c) when __out.width() == 1 but then we'd need to also call width(0) if put(__c) succeeds, and that extra logic defeats some of the optimization. Using width()==1 is rather pointless when inserting single chars (as it is no different to width(0)) so I'm not too worried about not optimizing that case. > Either it is worthwhile or not, I would say that if you need it and > eventually implement it then do not hesitate to submit it here ! I think this is simple enough that it's worth doing. It improves the default case where there's no padding, and shouldn't add much overhead when there is padding. If it turns out that requiring an observable call to xsputn is not required for string views and strings, we can consider something like Lewis's patch as well (which would help when writing many strings where most of them fit in the buffer). ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: ostream::operator<<() and sputn() 2022-01-10 16:07 ` Jonathan Wakely @ 2022-01-10 23:18 ` Lewis Hyatt 2022-01-11 1:09 ` Jonathan Wakely 0 siblings, 1 reply; 8+ messages in thread From: Lewis Hyatt @ 2022-01-10 23:18 UTC (permalink / raw) To: Jonathan Wakely; +Cc: François Dumont, libstdc++, Dietmar Kuehl On Mon, Jan 10, 2022 at 11:07 AM Jonathan Wakely <jwakely.gcc@gmail.com> wrote: > > On Thu, 15 Jul 2021 at 18:11, François Dumont wrote: > > > > On 14/07/21 11:45 pm, Lewis Hyatt via Libstdc++ wrote: > > > On Wed, Jul 14, 2021 at 5:31 PM Jonathan Wakely<jwakely.gcc@gmail.com> wrote: > > >> On Wed, 14 Jul 2021 at 22:26, Lewis Hyatt via Libstdc++ > > >> <libstdc++@gcc.gnu.org> wrote: > > >>> Hello- > > >>> > > >>> I noticed that libstdc++'s implementation of ostream::operator<<() prefers > > >>> to call sputn() on the underlying streambuf for all char, char*, and string > > >>> output operations, including single characters, rather than manipulate the > > >>> buffer directly. I am curious why it works this way, it feels perhaps > > >>> suboptimal to me because sputn() is mandated to call the virtual function > > >>> xsputn() on every call, while e.g. sputc() simply manipulates the buffer and > > >>> only needs a virtual call when the buffer is full. I always thought that the > > >>> buffer abstraction and the resulting avoidance of virtual calls for the > > >>> majority of operations was the main point of streambuf's design, and that > > >>> sputn() was meant for cases when the output would be large enough to > > >>> overflow the buffer anyway, if it may be possible to skip the buffer and > > >>> flush directly instead? > > >>> > > >>> It seems to me that for most typical use cases, xsputn() is still going to > > >>> want to use the buffer if the output fits into it; libstdc++ does this in > > >>> basic_filebuf, for example. So then it would seem to be beneficial to try > > >>> the buffer prior to making the virtual function call, instead of after -- > > >>> especially because the typical char instantiation of __ostream_insert that > > >>> makes this call for operator<<() is hidden inside the .so, and is not > > >>> inlined or eligible for devirtualization optimizations. > > >>> > > >>> FWIW, here is a small test case. > > >>> > > >>> --------- > > >>> #include <ostream> > > >>> #include <iostream> > > >>> #include <fstream> > > >>> #include <sstream> > > >>> #include <chrono> > > >>> #include <random> > > >>> using namespace std; > > >>> > > >>> int main() { > > >>> constexpr size_t N = 500000000; > > >>> string s(N, 'x'); > > >>> > > >>> ofstream of{"/dev/null"}; > > >>> ostringstream os; > > >>> ostream* streams[] = {&of, &os}; > > >>> mt19937 rng{random_device{}()}; > > >>> > > >>> const auto timed_run = [&](const char* label, auto&& callback) { > > >>> const auto t1 = chrono::steady_clock::now(); > > >>> for(char c: s) callback(*streams[rng() % 2], c); > > >>> const auto t2 = chrono::steady_clock::now(); > > >>> cout << label << " took: " > > >>> << chrono::duration<double>(t2-t1).count() > > >>> << " seconds" << endl; > > >>> }; > > >>> > > >>> timed_run("insert with put()", [](ostream& o, char c) {o.put(c);}); > > >>> timed_run("insert with op<< ", [](ostream& o, char c) {o << c;}); > > >>> } > > >>> --------- > > >>> > > >>> This is what I get with the current trunk: > > >>> --------- > > >>> insert with put() took: 6.12152 seconds > > >>> insert with op<< took: 13.4437 seconds > > >>> --------- > > >>> > > >>> And this is what I get with the attached patch: > > >>> --------- > > >>> insert with put() took: 6.08313 seconds > > >>> insert with op<< took: 8.24565 seconds > > >>> --------- > > >>> > > >>> So the overhead of calling operator<< vs calling put() was reduced by more > > >>> than 3X. > > >>> > > >>> The prototype patch calls an internal alternate to sputn(), which tries the > > >>> buffer prior to calling xsputn(). > > >> This won't work if a user provides an explicit specialization of > > >> basic_streambuf<char, MyTraits>. std::basic_ostream<char, MyTraits> > > >> will still try to call your new function, but it won't be present in > > >> the user's specialization, so will fail to compile. The basic_ostream > > >> primary template can only use the standard API of basic_streambuf. The > > >> std::basic_ostream<char> specialization can use non-standard members > > >> of std::basic_streambuf<char> because we know users can't specialize > > >> that. > > > Thanks, makes sense, this was more just a quick proof of concept. I > > > guess a real version could work around this, well it could be > > > implemented purely in terms of sputc() too. Am curious if you think > > > the overall idea is worthwhile though? Partly I am trying to > > > understand it better, like it was a bit surprising to me that the > > > standard says that sputn() *must* call xsputn(). Feels like calling > > > it, only if a call to overflow() would otherwise be necessary, makes > > > more sense to me... > > > > > > > > > -Lewis > > > . > > > > I think that the issue you spotted can be summarize by the > > implementation of operator<< in <ostream>: > > > > template<typename _CharT, typename _Traits> > > inline basic_ostream<_CharT, _Traits>& > > operator<<(basic_ostream<_CharT, _Traits>& __out, _CharT __c) > > { return __ostream_insert(__out, &__c, 1); } > > > > To output a single _CharT is treated as to output a C string. > > > > If you add the plumbing to have a __ostream_insert(__out, __c) then > > buffering should take place normally as it will end-up into a call to sputc. > > Indeed. This improves the performance of Lewis's testcase so there is > almost no overhead compared to using ostream::put(c) directly: > > --- a/libstdc++-v3/include/std/ostream > +++ b/libstdc++-v3/include/std/ostream > @@ -505,7 +505,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > template<typename _CharT, typename _Traits> > inline basic_ostream<_CharT, _Traits>& > operator<<(basic_ostream<_CharT, _Traits>& __out, _CharT __c) > - { return __ostream_insert(__out, &__c, 1); } > + { > + if (__out.width() != 0) > + return __ostream_insert(__out, &__c, 1); > + __out.put(__c); > + return __out; > + } > > template<typename _CharT, typename _Traits> > inline basic_ostream<_CharT, _Traits>& > @@ -516,7 +521,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > template<typename _Traits> > inline basic_ostream<char, _Traits>& > operator<<(basic_ostream<char, _Traits>& __out, char __c) > - { return __ostream_insert(__out, &__c, 1); } > + { > + if (__out.width() != 0) > + return __ostream_insert(__out, &__c, 1); > + __out.put(__c); > + return __out; > + } > > // Signed and unsigned > template<typename _Traits> > > I think the compiler will optimize this more aggressively than if the > new code is inside __ostream_write or __ostream_insert. > > We could also use put(__c) when __out.width() == 1 but then we'd need > to also call width(0) if put(__c) succeeds, and that extra logic > defeats some of the optimization. Using width()==1 is rather pointless > when inserting single chars (as it is no different to width(0)) so I'm > not too worried about not optimizing that case. > > > Either it is worthwhile or not, I would say that if you need it and > > eventually implement it then do not hesitate to submit it here ! > > I think this is simple enough that it's worth doing. It improves the > default case where there's no padding, and shouldn't add much overhead > when there is padding. > > If it turns out that requiring an observable call to xsputn is not > required for string views and strings, we can consider something like > Lewis's patch as well (which would help when writing many strings > where most of them fit in the buffer). Thanks a lot for circling back to this. I feel like your change is a pure improvement with little downside, so it would be great to have it. Let me know please if I should do anything more there. To me, the whole situation with sputn() is just unfortunate, because I think it should not have been specified to always make the virtual call to xsputn(), but rather only make it if needed. Then everything would work as well as possible. Like after this change to the single-char overload, a similar unexpected performance issue will affect something like: cout << "a" compared to cout << 'a', where now the former will make a virtual call every time and the latter won't. libstdc++ could handle the char* overload case as well, but then it still would not be permitted to handle a string or string_view, so at some level the standard specification is the constraint... Seems unlikely it will change though, so I think it means that avoiding sputn() wherever permitted will always be an avenue towards improving performance. -Lewis -Lewis ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: ostream::operator<<() and sputn() 2022-01-10 23:18 ` Lewis Hyatt @ 2022-01-11 1:09 ` Jonathan Wakely 0 siblings, 0 replies; 8+ messages in thread From: Jonathan Wakely @ 2022-01-11 1:09 UTC (permalink / raw) To: Lewis Hyatt; +Cc: François Dumont, libstdc++, Dietmar Kuehl On Mon, 10 Jan 2022 at 23:18, Lewis Hyatt wrote: > Thanks a lot for circling back to this. I feel like your change is a > pure improvement with little downside, so it would be great to have > it. Let me know please if I should do anything more there. Leave it with me. > To me, the whole situation with sputn() is just unfortunate, because I > think it should not have been specified to always make the virtual > call to xsputn(), but rather only make it if needed. Then everything > would work as well as possible. Like after this change to the > single-char overload, a similar unexpected performance issue will > affect something like: cout << "a" compared to cout << 'a', where > now the former will make a virtual call every time and the latter > won't. libstdc++ could handle the char* overload case as well, but > then it still would not be permitted to handle a string or > string_view, so at some level the standard specification is the > constraint... Seems unlikely it will change though, so I think it > means that avoiding sputn() wherever permitted will always be an > avenue towards improving performance. I think the standard is overspecified, and changing it makes sense. It will take some time though. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: ostream::operator<<() and sputn() 2021-07-14 21:30 ` Jonathan Wakely 2021-07-14 21:45 ` Lewis Hyatt @ 2021-07-14 21:54 ` Dietmar Kühl 1 sibling, 0 replies; 8+ messages in thread From: Dietmar Kühl @ 2021-07-14 21:54 UTC (permalink / raw) To: Jonathan Wakely; +Cc: Lewis Hyatt, libstdc++ Hello, Of course, you could detect whether the user did specialize basic_streambuf: the default implementation can use a special member name, let’s say, __not_specialized, which the user cannot write in a portable program and, hence, in the user’s specialization. Detecting that could opt into an enhanced code path. I’m not recommending to use such an implementation. When I implemented my own version of IOStreams I used a few special approaches, notably a segmented iterator optimization for std::[io]streambuf_iterator with a bit more direct access to the streambuf’s buffer. For actually buffered streams these things can yield a substantial performance boost. There is some penalty for non-buffered (well, single character buffered) stream buffers, though. Making changes to the streams while maintaining ABI compatibility is probably not quite worth the trouble, though. Thanks, Dietmar > On 14 Jul 2021, at 22:31, Jonathan Wakely via Libstdc++ <libstdc++@gcc.gnu.org> wrote: > > On Wed, 14 Jul 2021 at 22:26, Lewis Hyatt via Libstdc++ > <libstdc++@gcc.gnu.org> wrote: >> >> Hello- >> >> I noticed that libstdc++'s implementation of ostream::operator<<() prefers >> to call sputn() on the underlying streambuf for all char, char*, and string >> output operations, including single characters, rather than manipulate the >> buffer directly. I am curious why it works this way, it feels perhaps >> suboptimal to me because sputn() is mandated to call the virtual function >> xsputn() on every call, while e.g. sputc() simply manipulates the buffer and >> only needs a virtual call when the buffer is full. I always thought that the >> buffer abstraction and the resulting avoidance of virtual calls for the >> majority of operations was the main point of streambuf's design, and that >> sputn() was meant for cases when the output would be large enough to >> overflow the buffer anyway, if it may be possible to skip the buffer and >> flush directly instead? >> >> It seems to me that for most typical use cases, xsputn() is still going to >> want to use the buffer if the output fits into it; libstdc++ does this in >> basic_filebuf, for example. So then it would seem to be beneficial to try >> the buffer prior to making the virtual function call, instead of after -- >> especially because the typical char instantiation of __ostream_insert that >> makes this call for operator<<() is hidden inside the .so, and is not >> inlined or eligible for devirtualization optimizations. >> >> FWIW, here is a small test case. >> >> --------- >> #include <ostream> >> #include <iostream> >> #include <fstream> >> #include <sstream> >> #include <chrono> >> #include <random> >> using namespace std; >> >> int main() { >> constexpr size_t N = 500000000; >> string s(N, 'x'); >> >> ofstream of{"/dev/null"}; >> ostringstream os; >> ostream* streams[] = {&of, &os}; >> mt19937 rng{random_device{}()}; >> >> const auto timed_run = [&](const char* label, auto&& callback) { >> const auto t1 = chrono::steady_clock::now(); >> for(char c: s) callback(*streams[rng() % 2], c); >> const auto t2 = chrono::steady_clock::now(); >> cout << label << " took: " >> << chrono::duration<double>(t2-t1).count() >> << " seconds" << endl; >> }; >> >> timed_run("insert with put()", [](ostream& o, char c) {o.put(c);}); >> timed_run("insert with op<< ", [](ostream& o, char c) {o << c;}); >> } >> --------- >> >> This is what I get with the current trunk: >> --------- >> insert with put() took: 6.12152 seconds >> insert with op<< took: 13.4437 seconds >> --------- >> >> And this is what I get with the attached patch: >> --------- >> insert with put() took: 6.08313 seconds >> insert with op<< took: 8.24565 seconds >> --------- >> >> So the overhead of calling operator<< vs calling put() was reduced by more >> than 3X. >> >> The prototype patch calls an internal alternate to sputn(), which tries the >> buffer prior to calling xsputn(). > > This won't work if a user provides an explicit specialization of > basic_streambuf<char, MyTraits>. std::basic_ostream<char, MyTraits> > will still try to call your new function, but it won't be present in > the user's specialization, so will fail to compile. The basic_ostream > primary template can only use the standard API of basic_streambuf. The > std::basic_ostream<char> specialization can use non-standard members > of std::basic_streambuf<char> because we know users can't specialize > that. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2022-01-11 1:09 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-07-14 21:26 ostream::operator<<() and sputn() Lewis Hyatt 2021-07-14 21:30 ` Jonathan Wakely 2021-07-14 21:45 ` Lewis Hyatt 2021-07-15 17:11 ` François Dumont 2022-01-10 16:07 ` Jonathan Wakely 2022-01-10 23:18 ` Lewis Hyatt 2022-01-11 1:09 ` Jonathan Wakely 2021-07-14 21:54 ` Dietmar Kühl
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).