From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jwakely.gcc@gmail.com>
Received: from mail-wm1-x32f.google.com (mail-wm1-x32f.google.com
 [IPv6:2a00:1450:4864:20::32f])
 by sourceware.org (Postfix) with ESMTPS id 828343858427
 for <libstdc++@gcc.gnu.org>; Mon, 10 Jan 2022 16:07:15 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 828343858427
Received: by mail-wm1-x32f.google.com with SMTP id v123so9106379wme.2
 for <libstdc++@gcc.gnu.org>; Mon, 10 Jan 2022 08:07:15 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc:content-transfer-encoding;
 bh=twu360J+THW6JF0qOWPWhBukJRb3GG5ntD7CQgIMSHU=;
 b=aZvBsxSQ5ck4EDAAsUEJgU6+jwLmBs651aU5tl+RzlxPiHfzcrzbhvI7MgFPmJWQID
 Im2Lca3KfUzqwpb5zR2l0mVgRzJzZQuyr2kgJgQh/qtDFNkzPvvS31jbrQGQyPyqP0a2
 98uVJXXok8RSqS6tpGAWDC6hY0qeCMqzu048aLpi53kObTVvcmVGfNesjDzOef5zaXl7
 AmRLjCeop1oKi4Hf6rNdyG8N6TWls9M0Yqx0k0su0Q1aMlFZkTvtRqs+4sjUfJNdNXp8
 ZGno4CXQszTCr4GdmKsdjfSHaigm+EANrhHqzLbi4Xd68tnfqXxOFseqQkEoUj/rre1L
 jSIg==
X-Gm-Message-State: AOAM532wb6xAe3DbrwkjXx3RvIqOQCX6qFKwZ2RVamQrL4ENsPfSltvH
 aPE1p3rHw8WojsHJVLUUTLMYyU9vNMmzJ/tBMLM=
X-Google-Smtp-Source: ABdhPJxqwzfR40nWFwFfb9r9DHb+Zh6aBevzcIgH5MkB99FAI6pvOfMKukCTCZp5+ZYldcFTUvw+bPtR1FiAR8498jc=
X-Received: by 2002:a05:600c:384c:: with SMTP id
 s12mr148231wmr.108.1641830834567; 
 Mon, 10 Jan 2022 08:07:14 -0800 (PST)
MIME-Version: 1.0
References: <20210714212609.GA78610@ldh-imac.local>
 <CAH6eHdQ1X8DYrEmjSqxTqWWkd5Ev2a12+PhM7TEUi-fCmA3i3A@mail.gmail.com>
 <CAA_5UQ4D5zk8WFjLFtasWicbUhTCKUDUiyDKHzACpvyT7Lfgpw@mail.gmail.com>
 <1ee79d3c-a373-cb2a-f975-e62d182f1882@gmail.com>
In-Reply-To: <1ee79d3c-a373-cb2a-f975-e62d182f1882@gmail.com>
From: Jonathan Wakely <jwakely.gcc@gmail.com>
Date: Mon, 10 Jan 2022 16:07:02 +0000
Message-ID: <CAH6eHdSejdzFPVP+v_J8XWgd6AY9eoJ+x4Y4c5mC_bUcO=MJjg@mail.gmail.com>
Subject: Re: ostream::operator<<() and sputn()
To: =?UTF-8?Q?Fran=C3=A7ois_Dumont?= <frs.dumont@gmail.com>
Cc: Lewis Hyatt <lhyatt@gmail.com>, "libstdc++" <libstdc++@gcc.gnu.org>, 
 Dietmar Kuehl <dietmar_kuehl@yahoo.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-1.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libstdc++@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libstdc++ mailing list <libstdc++.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/libstdc++>,
 <mailto:libstdc++-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/libstdc++/>
List-Post: <mailto:libstdc++@gcc.gnu.org>
List-Help: <mailto:libstdc++-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/libstdc++>,
 <mailto:libstdc++-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 10 Jan 2022 16:07:19 -0000

On Thu, 15 Jul 2021 at 18:11, Fran=C3=A7ois Dumont wrote:
>
> On 14/07/21 11:45 pm, Lewis Hyatt via Libstdc++ wrote:
> > On Wed, Jul 14, 2021 at 5:31 PM Jonathan Wakely<jwakely.gcc@gmail.com> =
 wrote:
> >> On Wed, 14 Jul 2021 at 22:26, Lewis Hyatt via Libstdc++
> >> <libstdc++@gcc.gnu.org>  wrote:
> >>> Hello-
> >>>
> >>> I noticed that libstdc++'s implementation of ostream::operator<<() pr=
efers
> >>> to call sputn() on the underlying streambuf for all char, char*, and =
string
> >>> output operations, including single characters, rather than manipulat=
e the
> >>> buffer directly. I am curious why it works this way, it feels perhaps
> >>> suboptimal to me because sputn() is mandated to call the virtual func=
tion
> >>> xsputn() on every call, while e.g. sputc() simply manipulates the buf=
fer and
> >>> only needs a virtual call when the buffer is full. I always thought t=
hat the
> >>> buffer abstraction and the resulting avoidance of virtual calls for t=
he
> >>> majority of operations was the main point of streambuf's design, and =
that
> >>> sputn() was meant for cases when the output would be large enough to
> >>> overflow the buffer anyway, if it may be possible to skip the buffer =
and
> >>> flush directly instead?
> >>>
> >>> It seems to me that for most typical use cases, xsputn() is still goi=
ng to
> >>> want to use the buffer if the output fits into it; libstdc++ does thi=
s in
> >>> basic_filebuf, for example. So then it would seem to be beneficial to=
 try
> >>> the buffer prior to making the virtual function call, instead of afte=
r --
> >>> especially because the typical char instantiation of __ostream_insert=
 that
> >>> makes this call for operator<<() is hidden inside the .so, and is not
> >>> inlined or eligible for devirtualization optimizations.
> >>>
> >>> FWIW, here is a small test case.
> >>>
> >>> ---------
> >>> #include <ostream>
> >>> #include <iostream>
> >>> #include <fstream>
> >>> #include <sstream>
> >>> #include <chrono>
> >>> #include <random>
> >>> using namespace std;
> >>>
> >>> int main() {
> >>>      constexpr size_t N =3D 500000000;
> >>>      string s(N, 'x');
> >>>
> >>>      ofstream of{"/dev/null"};
> >>>      ostringstream os;
> >>>      ostream* streams[] =3D {&of, &os};
> >>>      mt19937 rng{random_device{}()};
> >>>
> >>>      const auto timed_run =3D [&](const char* label, auto&& callback)=
 {
> >>>          const auto t1 =3D chrono::steady_clock::now();
> >>>          for(char c: s) callback(*streams[rng() % 2], c);
> >>>          const auto t2 =3D chrono::steady_clock::now();
> >>>          cout << label << " took: "
> >>>               << chrono::duration<double>(t2-t1).count()
> >>>               << " seconds" << endl;
> >>>      };
> >>>
> >>>      timed_run("insert with put()", [](ostream& o, char c) {o.put(c);=
});
> >>>      timed_run("insert with op<< ", [](ostream& o, char c) {o << c;})=
;
> >>> }
> >>> ---------
> >>>
> >>> This is what I get with the current trunk:
> >>> ---------
> >>> insert with put() took: 6.12152 seconds
> >>> insert with op<<  took: 13.4437 seconds
> >>> ---------
> >>>
> >>> And this is what I get with the attached patch:
> >>> ---------
> >>> insert with put() took: 6.08313 seconds
> >>> insert with op<<  took: 8.24565 seconds
> >>> ---------
> >>>
> >>> So the overhead of calling operator<< vs calling put() was reduced by=
 more
> >>> than 3X.
> >>>
> >>> The prototype patch calls an internal alternate to sputn(), which tri=
es the
> >>> buffer prior to calling xsputn().
> >> This won't work if a user provides an explicit specialization of
> >> basic_streambuf<char, MyTraits>. std::basic_ostream<char, MyTraits>
> >> will still try to call your new function, but it won't be present in
> >> the user's specialization, so will fail to compile. The basic_ostream
> >> primary template can only use the standard API of basic_streambuf. The
> >> std::basic_ostream<char> specialization can use non-standard members
> >> of std::basic_streambuf<char> because we know users can't specialize
> >> that.
> > Thanks, makes sense, this was more just a quick proof of concept. I
> > guess a real version could work around this, well it could be
> > implemented purely in terms of sputc() too. Am curious if you think
> > the overall idea is worthwhile though? Partly I am trying to
> > understand it better, like it was a bit surprising to me that the
> > standard says that sputn() *must* call xsputn(). Feels like calling
> > it, only if a call to overflow() would otherwise be necessary, makes
> > more sense to me...
> >
> >
> > -Lewis
> > .
>
> I think that the issue you spotted can be summarize by the
> implementation of operator<< in <ostream>:
>
>    template<typename _CharT, typename _Traits>
>      inline basic_ostream<_CharT, _Traits>&
>      operator<<(basic_ostream<_CharT, _Traits>& __out, _CharT __c)
>      { return __ostream_insert(__out, &__c, 1); }
>
> To output a single _CharT is treated as to output a C string.
>
> If you add the plumbing to have a __ostream_insert(__out, __c) then
> buffering should take place normally as it will end-up into a call to spu=
tc.

Indeed. This improves the performance of Lewis's testcase so there is
almost no overhead compared to using ostream::put(c) directly:

--- a/libstdc++-v3/include/std/ostream
+++ b/libstdc++-v3/include/std/ostream
@@ -505,7 +505,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
  template<typename _CharT, typename _Traits>
    inline basic_ostream<_CharT, _Traits>&
    operator<<(basic_ostream<_CharT, _Traits>& __out, _CharT __c)
-    { return __ostream_insert(__out, &__c, 1); }
+    {
+      if (__out.width() !=3D 0)
+       return __ostream_insert(__out, &__c, 1);
+      __out.put(__c);
+      return __out;
+    }

  template<typename _CharT, typename _Traits>
    inline basic_ostream<_CharT, _Traits>&
@@ -516,7 +521,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
  template<typename _Traits>
    inline basic_ostream<char, _Traits>&
    operator<<(basic_ostream<char, _Traits>& __out, char __c)
-    { return __ostream_insert(__out, &__c, 1); }
+    {
+      if (__out.width() !=3D 0)
+       return __ostream_insert(__out, &__c, 1);
+      __out.put(__c);
+      return __out;
+    }

  // Signed and unsigned
  template<typename _Traits>

I think the compiler will optimize this more aggressively than if the
new code is inside __ostream_write or __ostream_insert.

We could also use put(__c) when __out.width() =3D=3D 1 but then we'd need
to also call width(0) if put(__c) succeeds, and that extra logic
defeats some of the optimization. Using width()=3D=3D1 is rather pointless
when inserting single chars (as it is no different to width(0)) so I'm
not too worried about not optimizing that case.

> Either it is worthwhile or not, I would say that if you need it and
> eventually implement it then do not hesitate to submit it here !

I think this is simple enough that it's worth doing. It improves the
default case where there's no padding, and shouldn't add much overhead
when there is padding.

If it turns out that requiring an observable call to xsputn is not
required for string views and strings, we can consider something like
Lewis's patch as well (which would help when writing many strings
where most of them fit in the buffer).