public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH v2] libstdc++: basic_filebuf: don't flush more often than necessary.
@ 2022-10-06 19:02 Charles-Francois Natali
  2022-10-21 10:08 ` [PING 3] " Charles-François Natali
  2022-11-07 17:00 ` Jonathan Wakely
  0 siblings, 2 replies; 4+ messages in thread
From: Charles-Francois Natali @ 2022-10-06 19:02 UTC (permalink / raw)
  To: libstdc++, gcc-patches; +Cc: Charles-Francois Natali

`basic_filebuf::xsputn` would bypass the buffer when passed a chunk of
size 1024 and above, seemingly as an optimisation.

This can have a significant performance impact if the overhead of a
`write` syscall is non-negligible, e.g. on a slow disk, on network
filesystems, or simply during IO contention because instead of flushing
every `BUFSIZ` (by default), we can flush every 1024 char.
The impact is even greater with custom larger buffers, e.g. for network
filesystems, because the code could issue `write` for example 1000X more
often than necessary with respect to the buffer size.
It also introduces a significant discontinuity in performance when
writing chunks of size 1024 and above.

See this reproducer which writes down a fixed number of chunks to a file
open with `O_SYNC` - to replicate high-latency `write` - for varying
size of chunks:

```
$ cat test_fstream_flush.cpp

int
main(int argc, char* argv[])
{
  assert(argc == 3);

  const auto* path = argv[1];
  const auto chunk_size = std::stoul(argv[2]);

  const auto fd =
    open(path, O_CREAT | O_TRUNC | O_WRONLY | O_SYNC | O_CLOEXEC, 0666);
  assert(fd >= 0);

  auto filebuf = __gnu_cxx::stdio_filebuf<char>(fd, std::ios_base::out);
  auto stream = std::ostream(&filebuf);

  const auto chunk = std::vector<char>(chunk_size);

  for (auto i = 0; i < 1'000; ++i) {
    stream.write(chunk.data(), chunk.size());
  }

  return 0;
}
```

```
$ g++ -o /tmp/test_fstream_flush test_fstream_flush.cpp -std=c++17
$ for i in $(seq 1021 1025); do echo -e "\n$i"; time /tmp/test_fstream_flush /tmp/foo $i; done

1021

real    0m0.997s
user    0m0.000s
sys     0m0.038s

1022

real    0m0.939s
user    0m0.005s
sys     0m0.032s

1023

real    0m0.954s
user    0m0.005s
sys     0m0.034s

1024

real    0m7.102s
user    0m0.040s
sys     0m0.192s

1025

real    0m7.204s
user    0m0.025s
sys     0m0.209s
```

See the huge drop in performance at the 1024-boundary.

An `strace` confirms that from size 1024 we effectively defeat
buffering:
1023-sized writes
```
$ strace -P /tmp/foo -e openat,write,writev /tmp/test_fstream_flush /tmp/foo 1023 2>&1 | head -n5
openat(AT_FDCWD, "/tmp/foo", O_WRONLY|O_CREAT|O_TRUNC|O_SYNC|O_CLOEXEC, 0666) = 3
writev(3, [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=8184}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1023}], 2) = 9207
writev(3, [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=8184}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1023}], 2) = 9207
writev(3, [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=8184}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1023}], 2) = 9207
writev(3, [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=8184}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1023}], 2) = 9207
```

vs 1024-sized writes
```
$ strace -P /tmp/foo -e openat,write,writev /tmp/test_fstream_flush /tmp/foo 1024 2>&1 | head -n5
openat(AT_FDCWD, "/tmp/foo", O_WRONLY|O_CREAT|O_TRUNC|O_SYNC|O_CLOEXEC, 0666) = 3
writev(3, [{iov_base=NULL, iov_len=0}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1024}], 2) = 1024
writev(3, [{iov_base="", iov_len=0}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1024}], 2) = 1024
writev(3, [{iov_base="", iov_len=0}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1024}], 2) = 1024
writev(3, [{iov_base="", iov_len=0}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1024}], 2) = 1024
```

Instead, it makes sense to only bypass the buffer if the amount of data
to be written is larger than the buffer capacity.

Closes https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63746

Signed-off-by: Charles-Francois Natali <cf.natali@gmail.com>
---
 libstdc++-v3/include/bits/fstream.tcc         |  9 ++---
 .../27_io/basic_filebuf/sputn/char/63746.cc   | 38 +++++++++++++++++++
 2 files changed, 41 insertions(+), 6 deletions(-)
 create mode 100644 libstdc++-v3/testsuite/27_io/basic_filebuf/sputn/char/63746.cc

diff --git a/libstdc++-v3/include/bits/fstream.tcc b/libstdc++-v3/include/bits/fstream.tcc
index 7ccc887b8..2e9369628 100644
--- a/libstdc++-v3/include/bits/fstream.tcc
+++ b/libstdc++-v3/include/bits/fstream.tcc
@@ -757,23 +757,20 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
     {
       streamsize __ret = 0;
       // Optimization in the always_noconv() case, to be generalized in the
-      // future: when __n is sufficiently large we write directly instead of
-      // using the buffer.
+      // future: when __n is larger than the available capacity we write
+      // directly instead of using the buffer.
       const bool __testout = (_M_mode & ios_base::out
 			      || _M_mode & ios_base::app);
       if (__check_facet(_M_codecvt).always_noconv()
 	  && __testout && !_M_reading)
 	{
-	  // Measurement would reveal the best choice.
-	  const streamsize __chunk = 1ul << 10;
 	  streamsize __bufavail = this->epptr() - this->pptr();
 
 	  // Don't mistake 'uncommitted' mode buffered with unbuffered.
 	  if (!_M_writing && _M_buf_size > 1)
 	    __bufavail = _M_buf_size - 1;
 
-	  const streamsize __limit = std::min(__chunk, __bufavail);
-	  if (__n >= __limit)
+	  if (__n >= __bufavail)
 	    {
 	      const streamsize __buffill = this->pptr() - this->pbase();
 	      const char* __buf = reinterpret_cast<const char*>(this->pbase());
diff --git a/libstdc++-v3/testsuite/27_io/basic_filebuf/sputn/char/63746.cc b/libstdc++-v3/testsuite/27_io/basic_filebuf/sputn/char/63746.cc
new file mode 100644
index 000000000..baab93407
--- /dev/null
+++ b/libstdc++-v3/testsuite/27_io/basic_filebuf/sputn/char/63746.cc
@@ -0,0 +1,38 @@
+// { dg-require-fileio "" }
+
+#include <fstream>
+#include <testsuite_hooks.h>
+
+class testbuf : public std::filebuf {
+public:
+  char_type* pub_pprt() const
+  {
+    return this->pptr();
+  }
+
+  char_type* pub_pbase() const
+  {
+    return this->pbase();
+  }
+};
+
+void test01()
+{
+  using namespace std;
+
+  // Leave capacity to avoid flush.
+  const streamsize chunk_size = BUFSIZ - 1 - 1;
+  const char data[chunk_size] = {};
+
+  testbuf a_f;
+  VERIFY( a_f.open("tmp_63746_sputn", ios_base::out) );
+  VERIFY( chunk_size == a_f.sputn(data, chunk_size) );
+  VERIFY( (a_f.pub_pprt() - a_f.pub_pbase()) == chunk_size );
+  VERIFY( a_f.close() );
+}
+
+int main()
+{
+  test01();
+  return 0;
+}
-- 
2.30.2


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PING 3] [PATCH v2] libstdc++: basic_filebuf: don't flush more often than necessary.
  2022-10-06 19:02 [PATCH v2] libstdc++: basic_filebuf: don't flush more often than necessary Charles-Francois Natali
@ 2022-10-21 10:08 ` Charles-François Natali
  2022-11-07 17:00 ` Jonathan Wakely
  1 sibling, 0 replies; 4+ messages in thread
From: Charles-François Natali @ 2022-10-21 10:08 UTC (permalink / raw)
  To: libstdc++, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 7337 bytes --]

On Thu, Oct 6, 2022, 20:03 Charles-Francois Natali <cf.natali@gmail.com>
wrote:

> `basic_filebuf::xsputn` would bypass the buffer when passed a chunk of
> size 1024 and above, seemingly as an optimisation.
>
> This can have a significant performance impact if the overhead of a
> `write` syscall is non-negligible, e.g. on a slow disk, on network
> filesystems, or simply during IO contention because instead of flushing
> every `BUFSIZ` (by default), we can flush every 1024 char.
> The impact is even greater with custom larger buffers, e.g. for network
> filesystems, because the code could issue `write` for example 1000X more
> often than necessary with respect to the buffer size.
> It also introduces a significant discontinuity in performance when
> writing chunks of size 1024 and above.
>
> See this reproducer which writes down a fixed number of chunks to a file
> open with `O_SYNC` - to replicate high-latency `write` - for varying
> size of chunks:
>
> ```
> $ cat test_fstream_flush.cpp
>
> int
> main(int argc, char* argv[])
> {
>   assert(argc == 3);
>
>   const auto* path = argv[1];
>   const auto chunk_size = std::stoul(argv[2]);
>
>   const auto fd =
>     open(path, O_CREAT | O_TRUNC | O_WRONLY | O_SYNC | O_CLOEXEC, 0666);
>   assert(fd >= 0);
>
>   auto filebuf = __gnu_cxx::stdio_filebuf<char>(fd, std::ios_base::out);
>   auto stream = std::ostream(&filebuf);
>
>   const auto chunk = std::vector<char>(chunk_size);
>
>   for (auto i = 0; i < 1'000; ++i) {
>     stream.write(chunk.data(), chunk.size());
>   }
>
>   return 0;
> }
> ```
>
> ```
> $ g++ -o /tmp/test_fstream_flush test_fstream_flush.cpp -std=c++17
> $ for i in $(seq 1021 1025); do echo -e "\n$i"; time
> /tmp/test_fstream_flush /tmp/foo $i; done
>
> 1021
>
> real    0m0.997s
> user    0m0.000s
> sys     0m0.038s
>
> 1022
>
> real    0m0.939s
> user    0m0.005s
> sys     0m0.032s
>
> 1023
>
> real    0m0.954s
> user    0m0.005s
> sys     0m0.034s
>
> 1024
>
> real    0m7.102s
> user    0m0.040s
> sys     0m0.192s
>
> 1025
>
> real    0m7.204s
> user    0m0.025s
> sys     0m0.209s
> ```
>
> See the huge drop in performance at the 1024-boundary.
>
> An `strace` confirms that from size 1024 we effectively defeat
> buffering:
> 1023-sized writes
> ```
> $ strace -P /tmp/foo -e openat,write,writev /tmp/test_fstream_flush
> /tmp/foo 1023 2>&1 | head -n5
> openat(AT_FDCWD, "/tmp/foo", O_WRONLY|O_CREAT|O_TRUNC|O_SYNC|O_CLOEXEC,
> 0666) = 3
> writev(3,
> [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> iov_len=8184},
> {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> iov_len=1023}], 2) = 9207
> writev(3,
> [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> iov_len=8184},
> {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> iov_len=1023}], 2) = 9207
> writev(3,
> [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> iov_len=8184},
> {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> iov_len=1023}], 2) = 9207
> writev(3,
> [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> iov_len=8184},
> {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> iov_len=1023}], 2) = 9207
> ```
>
> vs 1024-sized writes
> ```
> $ strace -P /tmp/foo -e openat,write,writev /tmp/test_fstream_flush
> /tmp/foo 1024 2>&1 | head -n5
> openat(AT_FDCWD, "/tmp/foo", O_WRONLY|O_CREAT|O_TRUNC|O_SYNC|O_CLOEXEC,
> 0666) = 3
> writev(3, [{iov_base=NULL, iov_len=0},
> {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> iov_len=1024}], 2) = 1024
> writev(3, [{iov_base="", iov_len=0},
> {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> iov_len=1024}], 2) = 1024
> writev(3, [{iov_base="", iov_len=0},
> {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> iov_len=1024}], 2) = 1024
> writev(3, [{iov_base="", iov_len=0},
> {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> iov_len=1024}], 2) = 1024
> ```
>
> Instead, it makes sense to only bypass the buffer if the amount of data
> to be written is larger than the buffer capacity.
>
> Closes https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63746
>
> Signed-off-by: Charles-Francois Natali <cf.natali@gmail.com>
> ---
>  libstdc++-v3/include/bits/fstream.tcc         |  9 ++---
>  .../27_io/basic_filebuf/sputn/char/63746.cc   | 38 +++++++++++++++++++
>  2 files changed, 41 insertions(+), 6 deletions(-)
>  create mode 100644
> libstdc++-v3/testsuite/27_io/basic_filebuf/sputn/char/63746.cc
>
> diff --git a/libstdc++-v3/include/bits/fstream.tcc
> b/libstdc++-v3/include/bits/fstream.tcc
> index 7ccc887b8..2e9369628 100644
> --- a/libstdc++-v3/include/bits/fstream.tcc
> +++ b/libstdc++-v3/include/bits/fstream.tcc
> @@ -757,23 +757,20 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>      {
>        streamsize __ret = 0;
>        // Optimization in the always_noconv() case, to be generalized in
> the
> -      // future: when __n is sufficiently large we write directly instead
> of
> -      // using the buffer.
> +      // future: when __n is larger than the available capacity we write
> +      // directly instead of using the buffer.
>        const bool __testout = (_M_mode & ios_base::out
>                               || _M_mode & ios_base::app);
>        if (__check_facet(_M_codecvt).always_noconv()
>           && __testout && !_M_reading)
>         {
> -         // Measurement would reveal the best choice.
> -         const streamsize __chunk = 1ul << 10;
>           streamsize __bufavail = this->epptr() - this->pptr();
>
>           // Don't mistake 'uncommitted' mode buffered with unbuffered.
>           if (!_M_writing && _M_buf_size > 1)
>             __bufavail = _M_buf_size - 1;
>
> -         const streamsize __limit = std::min(__chunk, __bufavail);
> -         if (__n >= __limit)
> +         if (__n >= __bufavail)
>             {
>               const streamsize __buffill = this->pptr() - this->pbase();
>               const char* __buf = reinterpret_cast<const
> char*>(this->pbase());
> diff --git
> a/libstdc++-v3/testsuite/27_io/basic_filebuf/sputn/char/63746.cc
> b/libstdc++-v3/testsuite/27_io/basic_filebuf/sputn/char/63746.cc
> new file mode 100644
> index 000000000..baab93407
> --- /dev/null
> +++ b/libstdc++-v3/testsuite/27_io/basic_filebuf/sputn/char/63746.cc
> @@ -0,0 +1,38 @@
> +// { dg-require-fileio "" }
> +
> +#include <fstream>
> +#include <testsuite_hooks.h>
> +
> +class testbuf : public std::filebuf {
> +public:
> +  char_type* pub_pprt() const
> +  {
> +    return this->pptr();
> +  }
> +
> +  char_type* pub_pbase() const
> +  {
> +    return this->pbase();
> +  }
> +};
> +
> +void test01()
> +{
> +  using namespace std;
> +
> +  // Leave capacity to avoid flush.
> +  const streamsize chunk_size = BUFSIZ - 1 - 1;
> +  const char data[chunk_size] = {};
> +
> +  testbuf a_f;
> +  VERIFY( a_f.open("tmp_63746_sputn", ios_base::out) );
> +  VERIFY( chunk_size == a_f.sputn(data, chunk_size) );
> +  VERIFY( (a_f.pub_pprt() - a_f.pub_pbase()) == chunk_size );
> +  VERIFY( a_f.close() );
> +}
> +
> +int main()
> +{
> +  test01();
> +  return 0;
> +}
> --
> 2.30.2
>
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] libstdc++: basic_filebuf: don't flush more often than necessary.
  2022-10-06 19:02 [PATCH v2] libstdc++: basic_filebuf: don't flush more often than necessary Charles-Francois Natali
  2022-10-21 10:08 ` [PING 3] " Charles-François Natali
@ 2022-11-07 17:00 ` Jonathan Wakely
  2022-11-08 17:45   ` Jonathan Wakely
  1 sibling, 1 reply; 4+ messages in thread
From: Jonathan Wakely @ 2022-11-07 17:00 UTC (permalink / raw)
  To: Charles-Francois Natali; +Cc: libstdc++, gcc-patches

On Thu, 6 Oct 2022 at 20:03, Charles-Francois Natali via Libstdc++
<libstdc++@gcc.gnu.org> wrote:
>
> `basic_filebuf::xsputn` would bypass the buffer when passed a chunk of
> size 1024 and above, seemingly as an optimisation.
>
> This can have a significant performance impact if the overhead of a
> `write` syscall is non-negligible, e.g. on a slow disk, on network
> filesystems, or simply during IO contention because instead of flushing
> every `BUFSIZ` (by default), we can flush every 1024 char.
> The impact is even greater with custom larger buffers, e.g. for network
> filesystems, because the code could issue `write` for example 1000X more
> often than necessary with respect to the buffer size.
> It also introduces a significant discontinuity in performance when
> writing chunks of size 1024 and above.
>
> See this reproducer which writes down a fixed number of chunks to a file
> open with `O_SYNC` - to replicate high-latency `write` - for varying
> size of chunks:
>
> ```
> $ cat test_fstream_flush.cpp
>
> int
> main(int argc, char* argv[])
> {
>   assert(argc == 3);
>
>   const auto* path = argv[1];
>   const auto chunk_size = std::stoul(argv[2]);
>
>   const auto fd =
>     open(path, O_CREAT | O_TRUNC | O_WRONLY | O_SYNC | O_CLOEXEC, 0666);
>   assert(fd >= 0);
>
>   auto filebuf = __gnu_cxx::stdio_filebuf<char>(fd, std::ios_base::out);
>   auto stream = std::ostream(&filebuf);
>
>   const auto chunk = std::vector<char>(chunk_size);
>
>   for (auto i = 0; i < 1'000; ++i) {
>     stream.write(chunk.data(), chunk.size());
>   }
>
>   return 0;
> }
> ```
>
> ```
> $ g++ -o /tmp/test_fstream_flush test_fstream_flush.cpp -std=c++17
> $ for i in $(seq 1021 1025); do echo -e "\n$i"; time /tmp/test_fstream_flush /tmp/foo $i; done
>
> 1021
>
> real    0m0.997s
> user    0m0.000s
> sys     0m0.038s
>
> 1022
>
> real    0m0.939s
> user    0m0.005s
> sys     0m0.032s
>
> 1023
>
> real    0m0.954s
> user    0m0.005s
> sys     0m0.034s
>
> 1024
>
> real    0m7.102s
> user    0m0.040s
> sys     0m0.192s
>
> 1025
>
> real    0m7.204s
> user    0m0.025s
> sys     0m0.209s
> ```
>
> See the huge drop in performance at the 1024-boundary.

I've finally found time to properly look at this, sorry for the delay.

I thought I was unable to reproduce these numbers, then I realised I'd
already installed a build with the patch, so was measuring the patched
performance for both my "before" and "after" tests. Oops!

My concern is that the patch doesn't only affect files on remote
filesystems. I assume the original 1024-byte chunking behaviour is
there for a reason, because for large writes the performance might be
better if we just write directly instead of buffering and then writing
again. Assuming we have a fast disk, writing straight to disk avoids
copying in and out of the buffer. But if we have a slow disk, it's
better to buffer and reduce the total number of disk writes. I'm
concerned that the patch optimizes the slow disk case potentially at a
cost for the fast disk case.

I wonder whether it would make sense to check whether the buffer size
has been manually changed, i.e. epptr() - pbase() != _M_buf_size. If
the buffer has been explicitly set by the user, then we should assume
they really want it to be used and so don't bypass it for writes >=
1024.

In the absence of a better idea, I think I'm going to commit the patch
as-is. I don't see it causing any measurable slowdown for very large
writes on fast disks, and it's certainly a huge improvement for slow
disks.


>
> An `strace` confirms that from size 1024 we effectively defeat
> buffering:
> 1023-sized writes
> ```
> $ strace -P /tmp/foo -e openat,write,writev /tmp/test_fstream_flush /tmp/foo 1023 2>&1 | head -n5
> openat(AT_FDCWD, "/tmp/foo", O_WRONLY|O_CREAT|O_TRUNC|O_SYNC|O_CLOEXEC, 0666) = 3
> writev(3, [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=8184}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1023}], 2) = 9207
> writev(3, [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=8184}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1023}], 2) = 9207
> writev(3, [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=8184}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1023}], 2) = 9207
> writev(3, [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=8184}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1023}], 2) = 9207
> ```
>
> vs 1024-sized writes
> ```
> $ strace -P /tmp/foo -e openat,write,writev /tmp/test_fstream_flush /tmp/foo 1024 2>&1 | head -n5
> openat(AT_FDCWD, "/tmp/foo", O_WRONLY|O_CREAT|O_TRUNC|O_SYNC|O_CLOEXEC, 0666) = 3
> writev(3, [{iov_base=NULL, iov_len=0}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1024}], 2) = 1024
> writev(3, [{iov_base="", iov_len=0}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1024}], 2) = 1024
> writev(3, [{iov_base="", iov_len=0}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1024}], 2) = 1024
> writev(3, [{iov_base="", iov_len=0}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1024}], 2) = 1024
> ```
>
> Instead, it makes sense to only bypass the buffer if the amount of data
> to be written is larger than the buffer capacity.
>
> Closes https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63746
>
> Signed-off-by: Charles-Francois Natali <cf.natali@gmail.com>
> ---
>  libstdc++-v3/include/bits/fstream.tcc         |  9 ++---
>  .../27_io/basic_filebuf/sputn/char/63746.cc   | 38 +++++++++++++++++++
>  2 files changed, 41 insertions(+), 6 deletions(-)
>  create mode 100644 libstdc++-v3/testsuite/27_io/basic_filebuf/sputn/char/63746.cc
>
> diff --git a/libstdc++-v3/include/bits/fstream.tcc b/libstdc++-v3/include/bits/fstream.tcc
> index 7ccc887b8..2e9369628 100644
> --- a/libstdc++-v3/include/bits/fstream.tcc
> +++ b/libstdc++-v3/include/bits/fstream.tcc
> @@ -757,23 +757,20 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>      {
>        streamsize __ret = 0;
>        // Optimization in the always_noconv() case, to be generalized in the
> -      // future: when __n is sufficiently large we write directly instead of
> -      // using the buffer.
> +      // future: when __n is larger than the available capacity we write
> +      // directly instead of using the buffer.
>        const bool __testout = (_M_mode & ios_base::out
>                               || _M_mode & ios_base::app);
>        if (__check_facet(_M_codecvt).always_noconv()
>           && __testout && !_M_reading)
>         {
> -         // Measurement would reveal the best choice.
> -         const streamsize __chunk = 1ul << 10;
>           streamsize __bufavail = this->epptr() - this->pptr();
>
>           // Don't mistake 'uncommitted' mode buffered with unbuffered.
>           if (!_M_writing && _M_buf_size > 1)
>             __bufavail = _M_buf_size - 1;
>
> -         const streamsize __limit = std::min(__chunk, __bufavail);
> -         if (__n >= __limit)
> +         if (__n >= __bufavail)
>             {
>               const streamsize __buffill = this->pptr() - this->pbase();
>               const char* __buf = reinterpret_cast<const char*>(this->pbase());
> diff --git a/libstdc++-v3/testsuite/27_io/basic_filebuf/sputn/char/63746.cc b/libstdc++-v3/testsuite/27_io/basic_filebuf/sputn/char/63746.cc
> new file mode 100644
> index 000000000..baab93407
> --- /dev/null
> +++ b/libstdc++-v3/testsuite/27_io/basic_filebuf/sputn/char/63746.cc
> @@ -0,0 +1,38 @@
> +// { dg-require-fileio "" }
> +
> +#include <fstream>
> +#include <testsuite_hooks.h>
> +
> +class testbuf : public std::filebuf {
> +public:
> +  char_type* pub_pprt() const
> +  {
> +    return this->pptr();
> +  }
> +
> +  char_type* pub_pbase() const
> +  {
> +    return this->pbase();
> +  }
> +};
> +
> +void test01()
> +{
> +  using namespace std;
> +
> +  // Leave capacity to avoid flush.
> +  const streamsize chunk_size = BUFSIZ - 1 - 1;
> +  const char data[chunk_size] = {};
> +
> +  testbuf a_f;
> +  VERIFY( a_f.open("tmp_63746_sputn", ios_base::out) );
> +  VERIFY( chunk_size == a_f.sputn(data, chunk_size) );
> +  VERIFY( (a_f.pub_pprt() - a_f.pub_pbase()) == chunk_size );
> +  VERIFY( a_f.close() );
> +}
> +
> +int main()
> +{
> +  test01();
> +  return 0;
> +}
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] libstdc++: basic_filebuf: don't flush more often than necessary.
  2022-11-07 17:00 ` Jonathan Wakely
@ 2022-11-08 17:45   ` Jonathan Wakely
  0 siblings, 0 replies; 4+ messages in thread
From: Jonathan Wakely @ 2022-11-08 17:45 UTC (permalink / raw)
  To: Charles-Francois Natali; +Cc: libstdc++, gcc-patches

On Mon, 7 Nov 2022 at 17:00, Jonathan Wakely wrote:
>
> On Thu, 6 Oct 2022 at 20:03, Charles-Francois Natali via Libstdc++
> <libstdc++@gcc.gnu.org> wrote:
> >
> > `basic_filebuf::xsputn` would bypass the buffer when passed a chunk of
> > size 1024 and above, seemingly as an optimisation.
> >
> > This can have a significant performance impact if the overhead of a
> > `write` syscall is non-negligible, e.g. on a slow disk, on network
> > filesystems, or simply during IO contention because instead of flushing
> > every `BUFSIZ` (by default), we can flush every 1024 char.
> > The impact is even greater with custom larger buffers, e.g. for network
> > filesystems, because the code could issue `write` for example 1000X more
> > often than necessary with respect to the buffer size.
> > It also introduces a significant discontinuity in performance when
> > writing chunks of size 1024 and above.
> >
> > See this reproducer which writes down a fixed number of chunks to a file
> > open with `O_SYNC` - to replicate high-latency `write` - for varying
> > size of chunks:
> >
> > ```
> > $ cat test_fstream_flush.cpp
> >
> > int
> > main(int argc, char* argv[])
> > {
> >   assert(argc == 3);
> >
> >   const auto* path = argv[1];
> >   const auto chunk_size = std::stoul(argv[2]);
> >
> >   const auto fd =
> >     open(path, O_CREAT | O_TRUNC | O_WRONLY | O_SYNC | O_CLOEXEC, 0666);
> >   assert(fd >= 0);
> >
> >   auto filebuf = __gnu_cxx::stdio_filebuf<char>(fd, std::ios_base::out);
> >   auto stream = std::ostream(&filebuf);
> >
> >   const auto chunk = std::vector<char>(chunk_size);
> >
> >   for (auto i = 0; i < 1'000; ++i) {
> >     stream.write(chunk.data(), chunk.size());
> >   }
> >
> >   return 0;
> > }
> > ```
> >
> > ```
> > $ g++ -o /tmp/test_fstream_flush test_fstream_flush.cpp -std=c++17
> > $ for i in $(seq 1021 1025); do echo -e "\n$i"; time /tmp/test_fstream_flush /tmp/foo $i; done
> >
> > 1021
> >
> > real    0m0.997s
> > user    0m0.000s
> > sys     0m0.038s
> >
> > 1022
> >
> > real    0m0.939s
> > user    0m0.005s
> > sys     0m0.032s
> >
> > 1023
> >
> > real    0m0.954s
> > user    0m0.005s
> > sys     0m0.034s
> >
> > 1024
> >
> > real    0m7.102s
> > user    0m0.040s
> > sys     0m0.192s
> >
> > 1025
> >
> > real    0m7.204s
> > user    0m0.025s
> > sys     0m0.209s
> > ```
> >
> > See the huge drop in performance at the 1024-boundary.
>
> I've finally found time to properly look at this, sorry for the delay.
>
> I thought I was unable to reproduce these numbers, then I realised I'd
> already installed a build with the patch, so was measuring the patched
> performance for both my "before" and "after" tests. Oops!
>
> My concern is that the patch doesn't only affect files on remote
> filesystems. I assume the original 1024-byte chunking behaviour is
> there for a reason, because for large writes the performance might be
> better if we just write directly instead of buffering and then writing
> again. Assuming we have a fast disk, writing straight to disk avoids
> copying in and out of the buffer. But if we have a slow disk, it's
> better to buffer and reduce the total number of disk writes. I'm
> concerned that the patch optimizes the slow disk case potentially at a
> cost for the fast disk case.
>
> I wonder whether it would make sense to check whether the buffer size
> has been manually changed, i.e. epptr() - pbase() != _M_buf_size. If
> the buffer has been explicitly set by the user, then we should assume
> they really want it to be used and so don't bypass it for writes >=
> 1024.
>
> In the absence of a better idea, I think I'm going to commit the patch
> as-is. I don't see it causing any measurable slowdown for very large
> writes on fast disks, and it's certainly a huge improvement for slow
> disks.

The patch has been pushed to trunk now, thanks for the contribution.

I removed the testcase and results from the commit message as they
don't need to be in the git log. I added a link to your email into
bugzilla though, so we can still find it easily.


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-11-08 17:46 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-06 19:02 [PATCH v2] libstdc++: basic_filebuf: don't flush more often than necessary Charles-Francois Natali
2022-10-21 10:08 ` [PING 3] " Charles-François Natali
2022-11-07 17:00 ` Jonathan Wakely
2022-11-08 17:45   ` Jonathan Wakely

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).