From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-x135.google.com (mail-lf1-x135.google.com [IPv6:2a00:1450:4864:20::135]) by sourceware.org (Postfix) with ESMTPS id 3D0193858C3A for ; Mon, 10 Jan 2022 23:18:54 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3D0193858C3A Received: by mail-lf1-x135.google.com with SMTP id x7so49990802lfu.8 for ; Mon, 10 Jan 2022 15:18:54 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=LqjehdydkEyXWvUZ93NNMMkIYTJJzxUyO1307O0p3Bs=; b=biT5Xc3v1RuazeSYYUfh3Ic/MEnaVfeRzPyyIRszhW3IZGxPVJnJvvKUkLupHjcdXO c30LbLQWNwo3d0TSJcqaj2wWPIOaDffqMhOitwTxp2p4kToMhBlQoQ3TYCgWvr6eu5BU 1vo9JCLnOnNvN8P+OuWJyZg2Ms0C/jPXudNjU8suJO1i+XqToV2XA/imzdIYxJhzgfCk ep266TvlWOhvJcw1eWFNv1FTl1QfEgQcrYVPRF7XMZDk19eFe+fm7YKrW0jOKwmhQVYz JThg7OqVSSbiMqrfYU8y/poXT3emGT3bk/yZZRdl/UzGMgWryXh72X0sav8OukHiSWyH elzA== X-Gm-Message-State: AOAM532opjM5s0TC+S50epAZoY1sGb+kK6nceuw6jkakDBUhv5pla+Pb W6l40eAXb1gYJ8lEFLeNq2i2GytOqYATVbfG8xo= X-Google-Smtp-Source: ABdhPJxpL6lsDJdlLZFRGRsqgdA4w3KReOWGlAdA16lp0OM7iKGFNm5XJEtOtWiNxMLm89z3wy2DdAt/alh80ZLBWHk= X-Received: by 2002:a2e:9acf:: with SMTP id p15mr1168947ljj.477.1641856732776; Mon, 10 Jan 2022 15:18:52 -0800 (PST) MIME-Version: 1.0 References: <20210714212609.GA78610@ldh-imac.local> <1ee79d3c-a373-cb2a-f975-e62d182f1882@gmail.com> In-Reply-To: From: Lewis Hyatt Date: Mon, 10 Jan 2022 18:18:40 -0500 Message-ID: Subject: Re: ostream::operator<<() and sputn() To: Jonathan Wakely Cc: =?UTF-8?Q?Fran=C3=A7ois_Dumont?= , "libstdc++" , Dietmar Kuehl Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-3030.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libstdc++@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libstdc++ mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Jan 2022 23:18:56 -0000 On Mon, Jan 10, 2022 at 11:07 AM Jonathan Wakely wr= ote: > > On Thu, 15 Jul 2021 at 18:11, Fran=C3=A7ois Dumont wrote: > > > > On 14/07/21 11:45 pm, Lewis Hyatt via Libstdc++ wrote: > > > On Wed, Jul 14, 2021 at 5:31 PM Jonathan Wakely wrote: > > >> On Wed, 14 Jul 2021 at 22:26, Lewis Hyatt via Libstdc++ > > >> wrote: > > >>> Hello- > > >>> > > >>> I noticed that libstdc++'s implementation of ostream::operator<<() = prefers > > >>> to call sputn() on the underlying streambuf for all char, char*, an= d string > > >>> output operations, including single characters, rather than manipul= ate the > > >>> buffer directly. I am curious why it works this way, it feels perha= ps > > >>> suboptimal to me because sputn() is mandated to call the virtual fu= nction > > >>> xsputn() on every call, while e.g. sputc() simply manipulates the b= uffer and > > >>> only needs a virtual call when the buffer is full. I always thought= that the > > >>> buffer abstraction and the resulting avoidance of virtual calls for= the > > >>> majority of operations was the main point of streambuf's design, an= d that > > >>> sputn() was meant for cases when the output would be large enough t= o > > >>> overflow the buffer anyway, if it may be possible to skip the buffe= r and > > >>> flush directly instead? > > >>> > > >>> It seems to me that for most typical use cases, xsputn() is still g= oing to > > >>> want to use the buffer if the output fits into it; libstdc++ does t= his in > > >>> basic_filebuf, for example. So then it would seem to be beneficial = to try > > >>> the buffer prior to making the virtual function call, instead of af= ter -- > > >>> especially because the typical char instantiation of __ostream_inse= rt that > > >>> makes this call for operator<<() is hidden inside the .so, and is n= ot > > >>> inlined or eligible for devirtualization optimizations. > > >>> > > >>> FWIW, here is a small test case. > > >>> > > >>> --------- > > >>> #include > > >>> #include > > >>> #include > > >>> #include > > >>> #include > > >>> #include > > >>> using namespace std; > > >>> > > >>> int main() { > > >>> constexpr size_t N =3D 500000000; > > >>> string s(N, 'x'); > > >>> > > >>> ofstream of{"/dev/null"}; > > >>> ostringstream os; > > >>> ostream* streams[] =3D {&of, &os}; > > >>> mt19937 rng{random_device{}()}; > > >>> > > >>> const auto timed_run =3D [&](const char* label, auto&& callbac= k) { > > >>> const auto t1 =3D chrono::steady_clock::now(); > > >>> for(char c: s) callback(*streams[rng() % 2], c); > > >>> const auto t2 =3D chrono::steady_clock::now(); > > >>> cout << label << " took: " > > >>> << chrono::duration(t2-t1).count() > > >>> << " seconds" << endl; > > >>> }; > > >>> > > >>> timed_run("insert with put()", [](ostream& o, char c) {o.put(c= );}); > > >>> timed_run("insert with op<< ", [](ostream& o, char c) {o << c;= }); > > >>> } > > >>> --------- > > >>> > > >>> This is what I get with the current trunk: > > >>> --------- > > >>> insert with put() took: 6.12152 seconds > > >>> insert with op<< took: 13.4437 seconds > > >>> --------- > > >>> > > >>> And this is what I get with the attached patch: > > >>> --------- > > >>> insert with put() took: 6.08313 seconds > > >>> insert with op<< took: 8.24565 seconds > > >>> --------- > > >>> > > >>> So the overhead of calling operator<< vs calling put() was reduced = by more > > >>> than 3X. > > >>> > > >>> The prototype patch calls an internal alternate to sputn(), which t= ries the > > >>> buffer prior to calling xsputn(). > > >> This won't work if a user provides an explicit specialization of > > >> basic_streambuf. std::basic_ostream > > >> will still try to call your new function, but it won't be present in > > >> the user's specialization, so will fail to compile. The basic_ostrea= m > > >> primary template can only use the standard API of basic_streambuf. T= he > > >> std::basic_ostream specialization can use non-standard members > > >> of std::basic_streambuf because we know users can't specialize > > >> that. > > > Thanks, makes sense, this was more just a quick proof of concept. I > > > guess a real version could work around this, well it could be > > > implemented purely in terms of sputc() too. Am curious if you think > > > the overall idea is worthwhile though? Partly I am trying to > > > understand it better, like it was a bit surprising to me that the > > > standard says that sputn() *must* call xsputn(). Feels like calling > > > it, only if a call to overflow() would otherwise be necessary, makes > > > more sense to me... > > > > > > > > > -Lewis > > > . > > > > I think that the issue you spotted can be summarize by the > > implementation of operator<< in : > > > > template > > inline basic_ostream<_CharT, _Traits>& > > operator<<(basic_ostream<_CharT, _Traits>& __out, _CharT __c) > > { return __ostream_insert(__out, &__c, 1); } > > > > To output a single _CharT is treated as to output a C string. > > > > If you add the plumbing to have a __ostream_insert(__out, __c) then > > buffering should take place normally as it will end-up into a call to s= putc. > > Indeed. This improves the performance of Lewis's testcase so there is > almost no overhead compared to using ostream::put(c) directly: > > --- a/libstdc++-v3/include/std/ostream > +++ b/libstdc++-v3/include/std/ostream > @@ -505,7 +505,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > template > inline basic_ostream<_CharT, _Traits>& > operator<<(basic_ostream<_CharT, _Traits>& __out, _CharT __c) > - { return __ostream_insert(__out, &__c, 1); } > + { > + if (__out.width() !=3D 0) > + return __ostream_insert(__out, &__c, 1); > + __out.put(__c); > + return __out; > + } > > template > inline basic_ostream<_CharT, _Traits>& > @@ -516,7 +521,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > template > inline basic_ostream& > operator<<(basic_ostream& __out, char __c) > - { return __ostream_insert(__out, &__c, 1); } > + { > + if (__out.width() !=3D 0) > + return __ostream_insert(__out, &__c, 1); > + __out.put(__c); > + return __out; > + } > > // Signed and unsigned > template > > I think the compiler will optimize this more aggressively than if the > new code is inside __ostream_write or __ostream_insert. > > We could also use put(__c) when __out.width() =3D=3D 1 but then we'd need > to also call width(0) if put(__c) succeeds, and that extra logic > defeats some of the optimization. Using width()=3D=3D1 is rather pointles= s > when inserting single chars (as it is no different to width(0)) so I'm > not too worried about not optimizing that case. > > > Either it is worthwhile or not, I would say that if you need it and > > eventually implement it then do not hesitate to submit it here ! > > I think this is simple enough that it's worth doing. It improves the > default case where there's no padding, and shouldn't add much overhead > when there is padding. > > If it turns out that requiring an observable call to xsputn is not > required for string views and strings, we can consider something like > Lewis's patch as well (which would help when writing many strings > where most of them fit in the buffer). Thanks a lot for circling back to this. I feel like your change is a pure improvement with little downside, so it would be great to have it. Let me know please if I should do anything more there. To me, the whole situation with sputn() is just unfortunate, because I think it should not have been specified to always make the virtual call to xsputn(), but rather only make it if needed. Then everything would work as well as possible. Like after this change to the single-char overload, a similar unexpected performance issue will affect something like: cout << "a" compared to cout << 'a', where now the former will make a virtual call every time and the latter won't. libstdc++ could handle the char* overload case as well, but then it still would not be permitted to handle a string or string_view, so at some level the standard specification is the constraint... Seems unlikely it will change though, so I think it means that avoiding sputn() wherever permitted will always be an avenue towards improving performance. -Lewis -Lewis