Hello- I noticed that libstdc++'s implementation of ostream::operator<<() prefers to call sputn() on the underlying streambuf for all char, char*, and string output operations, including single characters, rather than manipulate the buffer directly. I am curious why it works this way, it feels perhaps suboptimal to me because sputn() is mandated to call the virtual function xsputn() on every call, while e.g. sputc() simply manipulates the buffer and only needs a virtual call when the buffer is full. I always thought that the buffer abstraction and the resulting avoidance of virtual calls for the majority of operations was the main point of streambuf's design, and that sputn() was meant for cases when the output would be large enough to overflow the buffer anyway, if it may be possible to skip the buffer and flush directly instead? It seems to me that for most typical use cases, xsputn() is still going to want to use the buffer if the output fits into it; libstdc++ does this in basic_filebuf, for example. So then it would seem to be beneficial to try the buffer prior to making the virtual function call, instead of after -- especially because the typical char instantiation of __ostream_insert that makes this call for operator<<() is hidden inside the .so, and is not inlined or eligible for devirtualization optimizations. FWIW, here is a small test case. --------- #include #include #include #include #include #include using namespace std; int main() { constexpr size_t N = 500000000; string s(N, 'x'); ofstream of{"/dev/null"}; ostringstream os; ostream* streams[] = {&of, &os}; mt19937 rng{random_device{}()}; const auto timed_run = [&](const char* label, auto&& callback) { const auto t1 = chrono::steady_clock::now(); for(char c: s) callback(*streams[rng() % 2], c); const auto t2 = chrono::steady_clock::now(); cout << label << " took: " << chrono::duration(t2-t1).count() << " seconds" << endl; }; timed_run("insert with put()", [](ostream& o, char c) {o.put(c);}); timed_run("insert with op<< ", [](ostream& o, char c) {o << c;}); } --------- This is what I get with the current trunk: --------- insert with put() took: 6.12152 seconds insert with op<< took: 13.4437 seconds --------- And this is what I get with the attached patch: --------- insert with put() took: 6.08313 seconds insert with op<< took: 8.24565 seconds --------- So the overhead of calling operator<< vs calling put() was reduced by more than 3X. The prototype patch calls an internal alternate to sputn(), which tries the buffer prior to calling xsputn(). All the tests still pass with this patch in place, but I think it would not be suitable as-is, because it affects the overload for std::string_view, and I think the standard may actually mandate that xsputn() be called always for this case. (It says it should be output as-if sputn() were called, which I think makes xsputn() should be an observable side-effect.) Still it would work fine outside the string case AFAIK, and I thought it was maybe an interesting result worth sharing, and am curious what others think here, or if there is a reason I am missing why it's better the current way. Thanks! -Lewis