public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH v2] Single threaded stdio optimization
@ 2017-06-21 16:37 Szabolcs Nagy
  2017-06-21 16:52 ` Joseph Myers
  2017-07-07 12:38 ` Florian Weimer
  0 siblings, 2 replies; 31+ messages in thread
From: Szabolcs Nagy @ 2017-06-21 16:37 UTC (permalink / raw)
  To: GNU C Library; +Cc: nd, triegel

[-- Attachment #1: Type: text/plain, Size: 2136 bytes --]

Locking overhead can be significant in some stdio operations
that are common in single threaded applications.  I'd like
to address it in this release as it causes performance problems
to aarch64 users.  I prefer this high-level approach to be
reviewed, if that does not work then the aarch64 specific
low-level approach will be taken.

This patch adds the _IO_FLAGS2_NEED_LOCK flag to indicate if
an _IO_FILE object needs to be locked and some of the stdio
functions just jump to their _unlocked variant when not.  The
flag is set on all _IO_FILE objects when the first thread is
created.  A new libc abi symbol, _IO_enable_locks, does this.
(The abilist files will have to be updated in a separate patch).

The optimization can be applied to more stdio functions,
currently it is only applied to single flag check or single
non-wide-char standard operations.  The flag should probably
be never set for files with _IO_USER_LOCK, but that's just a
further optimization, not a correctness requirement.

The optimization is not valid if user code may run during
an stdio operation (interposed malloc, printf hooks, etc)
because the user code may create threads.  To handle those cases
the flag has to be set when callbacks are registered etc which
is more involved and those cases are less important to optimize
hence _IO_flockfile is not modified.

2017-06-21  Szabolcs Nagy  <szabolcs.nagy@arm.com>

	* libio/libio.h (_IO_FLAGS2_NEED_LOCK, _IO_need_lock): Define.
	* libio/libioP.h (_IO_enable_locks): Declare.
	* libio/Versions (_IO_enable_locks): New symbol.
	* libio/genops.c (_IO_enable_locks): Define.
	(_IO_old_init): Initialize flags2.
	* libio/feof.c.c (_IO_feof): Avoid locking when not needed.
	* libio/ferror.c (_IO_ferror): Likewise.
	* libio/fputc.c (fputc): Likewise.
	* libio/putc.c (_IO_putc): Likewise.
	* libio/getc.c (_IO_getc): Likewise.
	* libio/getchar.c (getchar): Likewise.
	* libio/ioungetc.c (_IO_ungetc): Likewise.
	* nptl/pthread_create.c (__pthread_create_2_1): Enable stdio locks.
	* libio/iofopncook.c (_IO_fopencookie): Enable locking for the file.
	* sysdeps/pthread/flockfile.c (__flockfile): Likewise.

[-- Attachment #2: stdio.diff --]
[-- Type: text/x-patch, Size: 6656 bytes --]

diff --git a/libio/Versions b/libio/Versions
index 2a1d6e6c85..300f539243 100644
--- a/libio/Versions
+++ b/libio/Versions
@@ -152,6 +152,9 @@ libc {
     # f*
     fmemopen;
   }
+  GLIBC_2.26 {
+    _IO_enable_locks;
+  }
   GLIBC_PRIVATE {
     # Used by NPTL and librt
     __libc_fatal;
diff --git a/libio/feof.c b/libio/feof.c
index 9712a81e78..8890a5f51f 100644
--- a/libio/feof.c
+++ b/libio/feof.c
@@ -32,6 +32,8 @@ _IO_feof (_IO_FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_feof_unlocked (fp);
   _IO_flockfile (fp);
   result = _IO_feof_unlocked (fp);
   _IO_funlockfile (fp);
diff --git a/libio/ferror.c b/libio/ferror.c
index 01e3bd8e2b..d10fcd9fff 100644
--- a/libio/ferror.c
+++ b/libio/ferror.c
@@ -32,6 +32,8 @@ _IO_ferror (_IO_FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_ferror_unlocked (fp);
   _IO_flockfile (fp);
   result = _IO_ferror_unlocked (fp);
   _IO_funlockfile (fp);
diff --git a/libio/fputc.c b/libio/fputc.c
index a7cd682fe2..b72305c06f 100644
--- a/libio/fputc.c
+++ b/libio/fputc.c
@@ -32,6 +32,8 @@ fputc (int c, _IO_FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_putc_unlocked (c, fp);
   _IO_acquire_lock (fp);
   result = _IO_putc_unlocked (c, fp);
   _IO_release_lock (fp);
diff --git a/libio/genops.c b/libio/genops.c
index a466cfa337..132f1f1a7d 100644
--- a/libio/genops.c
+++ b/libio/genops.c
@@ -570,11 +570,28 @@ _IO_init (_IO_FILE *fp, int flags)
   _IO_init_internal (fp, flags);
 }
 
+static int stdio_needs_locking;
+
+void
+_IO_enable_locks (void)
+{
+  _IO_ITER i;
+
+  if (stdio_needs_locking)
+    return;
+  stdio_needs_locking = 1;
+  for (i = _IO_iter_begin (); i != _IO_iter_end (); i = _IO_iter_next (i))
+    _IO_iter_file (i)->_flags2 |= _IO_FLAGS2_NEED_LOCK;
+}
+libc_hidden_def (_IO_enable_locks)
+
 void
 _IO_old_init (_IO_FILE *fp, int flags)
 {
   fp->_flags = _IO_MAGIC|flags;
   fp->_flags2 = 0;
+  if (stdio_needs_locking)
+    fp->_flags2 |= _IO_FLAGS2_NEED_LOCK;
   fp->_IO_buf_base = NULL;
   fp->_IO_buf_end = NULL;
   fp->_IO_read_base = NULL;
diff --git a/libio/getc.c b/libio/getc.c
index b58fd62308..fd66ef93cf 100644
--- a/libio/getc.c
+++ b/libio/getc.c
@@ -34,6 +34,8 @@ _IO_getc (FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_getc_unlocked (fp);
   _IO_acquire_lock (fp);
   result = _IO_getc_unlocked (fp);
   _IO_release_lock (fp);
diff --git a/libio/getchar.c b/libio/getchar.c
index 5b41595d17..d79932114e 100644
--- a/libio/getchar.c
+++ b/libio/getchar.c
@@ -33,6 +33,8 @@ int
 getchar (void)
 {
   int result;
+  if (!_IO_need_lock (_IO_stdin))
+    return _IO_getc_unlocked (_IO_stdin);
   _IO_acquire_lock (_IO_stdin);
   result = _IO_getc_unlocked (_IO_stdin);
   _IO_release_lock (_IO_stdin);
diff --git a/libio/iofopncook.c b/libio/iofopncook.c
index a08dfdaa42..982f464a68 100644
--- a/libio/iofopncook.c
+++ b/libio/iofopncook.c
@@ -172,6 +172,8 @@ _IO_cookie_init (struct _IO_cookie_file *cfile, int read_write,
   _IO_mask_flags (&cfile->__fp.file, read_write,
 		  _IO_NO_READS+_IO_NO_WRITES+_IO_IS_APPENDING);
 
+  cfile->__fp.file._flags2 |= _IO_FLAGS2_NEED_LOCK;
+
   /* We use a negative number different from -1 for _fileno to mark that
      this special stream is not associated with a real file, but still has
      to be treated as such.  */
diff --git a/libio/ioungetc.c b/libio/ioungetc.c
index 951064fa12..917cad8abb 100644
--- a/libio/ioungetc.c
+++ b/libio/ioungetc.c
@@ -33,6 +33,8 @@ _IO_ungetc (int c, _IO_FILE *fp)
   CHECK_FILE (fp, EOF);
   if (c == EOF)
     return EOF;
+  if (!_IO_need_lock (fp))
+    return _IO_sputbackc (fp, (unsigned char) c);
   _IO_acquire_lock (fp);
   result = _IO_sputbackc (fp, (unsigned char) c);
   _IO_release_lock (fp);
diff --git a/libio/libio.h b/libio/libio.h
index 518ffd8e44..14bcb92332 100644
--- a/libio/libio.h
+++ b/libio/libio.h
@@ -119,6 +119,7 @@
 # define _IO_FLAGS2_SCANF_STD 16
 # define _IO_FLAGS2_NOCLOSE 32
 # define _IO_FLAGS2_CLOEXEC 64
+# define _IO_FLAGS2_NEED_LOCK 128
 #endif
 
 /* These are "formatting flags" matching the iostream fmtflags enum values. */
@@ -451,6 +452,9 @@ extern int _IO_ftrylockfile (_IO_FILE *) __THROW;
 #define _IO_cleanup_region_end(_Doit) /**/
 #endif
 
+#define _IO_need_lock(_fp) \
+  (((_fp)->_flags2 & _IO_FLAGS2_NEED_LOCK) != 0)
+
 extern int _IO_vfscanf (_IO_FILE * __restrict, const char * __restrict,
 			_IO_va_list, int *__restrict);
 extern int _IO_vfprintf (_IO_FILE *__restrict, const char *__restrict,
diff --git a/libio/libioP.h b/libio/libioP.h
index eb93418c8d..163dfb1386 100644
--- a/libio/libioP.h
+++ b/libio/libioP.h
@@ -444,6 +444,8 @@ extern void _IO_list_unlock (void) __THROW;
 libc_hidden_proto (_IO_list_unlock)
 extern void _IO_list_resetlock (void) __THROW;
 libc_hidden_proto (_IO_list_resetlock)
+extern void _IO_enable_locks (void) __THROW;
+libc_hidden_proto (_IO_enable_locks)
 
 /* Default jumptable functions. */
 
@@ -750,7 +752,7 @@ extern _IO_off64_t _IO_seekpos_unlocked (_IO_FILE *, _IO_off64_t, int)
 
 #if _G_HAVE_MMAP
 
-# ifdef _LIBC
+# if IS_IN (libc)
 /* When using this code in the GNU libc we must not pollute the name space.  */
 #  define mmap __mmap
 #  define munmap __munmap
diff --git a/libio/putc.c b/libio/putc.c
index b591c5538b..6e1fdeef3a 100644
--- a/libio/putc.c
+++ b/libio/putc.c
@@ -25,6 +25,8 @@ _IO_putc (int c, _IO_FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_putc_unlocked (c, fp);
   _IO_acquire_lock (fp);
   result = _IO_putc_unlocked (c, fp);
   _IO_release_lock (fp);
diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
index c7d1b8f413..0b3fa942f2 100644
--- a/nptl/pthread_create.c
+++ b/nptl/pthread_create.c
@@ -32,6 +32,7 @@
 #include <exit-thread.h>
 #include <default-sched.h>
 #include <futex-internal.h>
+#include "libioP.h"
 
 #include <shlib-compat.h>
 
@@ -756,6 +757,9 @@ __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
         collect_default_sched (pd);
     }
 
+  if (__glibc_unlikely (__nptl_nthreads == 1))
+    _IO_enable_locks ();
+
   /* Pass the descriptor to the caller.  */
   *newthread = (pthread_t) pd;
 
diff --git a/sysdeps/pthread/flockfile.c b/sysdeps/pthread/flockfile.c
index 7fe8e99161..a8e6c28ed9 100644
--- a/sysdeps/pthread/flockfile.c
+++ b/sysdeps/pthread/flockfile.c
@@ -25,6 +25,7 @@
 void
 __flockfile (FILE *stream)
 {
+  stream->_flags2 |= _IO_FLAGS2_NEED_LOCK;
   _IO_lock_lock (*stream->_lock);
 }
 strong_alias (__flockfile, _IO_flockfile)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-21 16:37 [PATCH v2] Single threaded stdio optimization Szabolcs Nagy
@ 2017-06-21 16:52 ` Joseph Myers
  2017-06-21 16:53   ` Szabolcs Nagy
  2017-06-22  9:50   ` Szabolcs Nagy
  2017-07-07 12:38 ` Florian Weimer
  1 sibling, 2 replies; 31+ messages in thread
From: Joseph Myers @ 2017-06-21 16:52 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: GNU C Library, nd, triegel

On Wed, 21 Jun 2017, Szabolcs Nagy wrote:

> created.  A new libc abi symbol, _IO_enable_locks, does this.
> (The abilist files will have to be updated in a separate patch).

I don't see any declarations or uses of this symbol in public headers, so 
why does it need to be at a public symbol version instead of 
GLIBC_PRIVATE?

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-21 16:52 ` Joseph Myers
@ 2017-06-21 16:53   ` Szabolcs Nagy
  2017-06-22  9:50   ` Szabolcs Nagy
  1 sibling, 0 replies; 31+ messages in thread
From: Szabolcs Nagy @ 2017-06-21 16:53 UTC (permalink / raw)
  To: Joseph Myers; +Cc: nd, GNU C Library, triegel

On 21/06/17 17:52, Joseph Myers wrote:
> On Wed, 21 Jun 2017, Szabolcs Nagy wrote:
> 
>> created.  A new libc abi symbol, _IO_enable_locks, does this.
>> (The abilist files will have to be updated in a separate patch).
> 
> I don't see any declarations or uses of this symbol in public headers, so 
> why does it need to be at a public symbol version instead of 
> GLIBC_PRIVATE?
> 

oh i didn't know GLIBC_PRIVATE is the way to do libc internal symbols.

thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-21 16:52 ` Joseph Myers
  2017-06-21 16:53   ` Szabolcs Nagy
@ 2017-06-22  9:50   ` Szabolcs Nagy
  2017-06-29 11:41     ` Siddhesh Poyarekar
  1 sibling, 1 reply; 31+ messages in thread
From: Szabolcs Nagy @ 2017-06-22  9:50 UTC (permalink / raw)
  To: Joseph Myers; +Cc: nd, GNU C Library, triegel

[-- Attachment #1: Type: text/plain, Size: 2884 bytes --]

On 21/06/17 17:52, Joseph Myers wrote:
> On Wed, 21 Jun 2017, Szabolcs Nagy wrote:
> 
>> created.  A new libc abi symbol, _IO_enable_locks, does this.
>> (The abilist files will have to be updated in a separate patch).
> 
> I don't see any declarations or uses of this symbol in public headers, so 
> why does it need to be at a public symbol version instead of 
> GLIBC_PRIVATE?
> 

PATCH v3: use GLIBC_PRIVATE for _IO_enable_locks.
(should i use __libc_* naming convention for private symbols?)

Locking overhead can be significant in some stdio operations
that are common in single threaded applications.

I'd like to address it in this release if possible as it
causes performance problems to aarch64 users.  I prefer this
high-level approach to be reviewed, if that does not work then
the aarch64 specific low-level approach will be taken.

This patch adds the _IO_FLAGS2_NEED_LOCK flag to indicate if
an _IO_FILE object needs to be locked and some of the stdio
functions just jump to their _unlocked variant when not.  The
flag is set on all _IO_FILE objects when the first thread is
created.  A new GLIBC_PRIVATE libc symbol, _IO_enable_locks,
was added to do this from libpthread.

The optimization can be applied to more stdio functions,
currently it is only applied to single flag check or single
non-wide-char standard operations.  The flag should probably
be never set for files with _IO_USER_LOCK, but that's just a
further optimization, not a correctness requirement.

The optimization is valid in a single thread because stdio
operations are non-as-safe (so lock state is not observable
from a signal handler) and stdio locks are recursive (so lock
state is not observable via deadlock).  The optimization is not
valid if a thread may be created while an stdio lock is taken
and thus it should be disabled if any user code may run during
an stdio operation (interposed malloc, printf hooks, etc).
This makes the optimization more complicated for some stdio
operations (e.g. printf), but those are bigger and thus less
important to optimize so this patch does not try to do that.

2017-06-22  Szabolcs Nagy  <szabolcs.nagy@arm.com>

	* libio/libio.h (_IO_FLAGS2_NEED_LOCK, _IO_need_lock): Define.
	* libio/libioP.h (_IO_enable_locks): Declare.
	* libio/Versions (_IO_enable_locks): New symbol.
	* libio/genops.c (_IO_enable_locks): Define.
	(_IO_old_init): Initialize flags2.
	* libio/feof.c.c (_IO_feof): Avoid locking when not needed.
	* libio/ferror.c (_IO_ferror): Likewise.
	* libio/fputc.c (fputc): Likewise.
	* libio/putc.c (_IO_putc): Likewise.
	* libio/getc.c (_IO_getc): Likewise.
	* libio/getchar.c (getchar): Likewise.
	* libio/ioungetc.c (_IO_ungetc): Likewise.
	* nptl/pthread_create.c (__pthread_create_2_1): Enable stdio locks.
	* libio/iofopncook.c (_IO_fopencookie): Enable locking for the file.
	* sysdeps/pthread/flockfile.c (__flockfile): Likewise.

[-- Attachment #2: stdio.diff --]
[-- Type: text/x-patch, Size: 6613 bytes --]

diff --git a/libio/Versions b/libio/Versions
index 2a1d6e6c85..6c1624e8cd 100644
--- a/libio/Versions
+++ b/libio/Versions
@@ -155,5 +155,6 @@ libc {
   GLIBC_PRIVATE {
     # Used by NPTL and librt
     __libc_fatal;
+    _IO_enable_locks;
   }
 }
diff --git a/libio/feof.c b/libio/feof.c
index 9712a81e78..8890a5f51f 100644
--- a/libio/feof.c
+++ b/libio/feof.c
@@ -32,6 +32,8 @@ _IO_feof (_IO_FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_feof_unlocked (fp);
   _IO_flockfile (fp);
   result = _IO_feof_unlocked (fp);
   _IO_funlockfile (fp);
diff --git a/libio/ferror.c b/libio/ferror.c
index 01e3bd8e2b..d10fcd9fff 100644
--- a/libio/ferror.c
+++ b/libio/ferror.c
@@ -32,6 +32,8 @@ _IO_ferror (_IO_FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_ferror_unlocked (fp);
   _IO_flockfile (fp);
   result = _IO_ferror_unlocked (fp);
   _IO_funlockfile (fp);
diff --git a/libio/fputc.c b/libio/fputc.c
index a7cd682fe2..b72305c06f 100644
--- a/libio/fputc.c
+++ b/libio/fputc.c
@@ -32,6 +32,8 @@ fputc (int c, _IO_FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_putc_unlocked (c, fp);
   _IO_acquire_lock (fp);
   result = _IO_putc_unlocked (c, fp);
   _IO_release_lock (fp);
diff --git a/libio/genops.c b/libio/genops.c
index a466cfa337..132f1f1a7d 100644
--- a/libio/genops.c
+++ b/libio/genops.c
@@ -570,11 +570,28 @@ _IO_init (_IO_FILE *fp, int flags)
   _IO_init_internal (fp, flags);
 }
 
+static int stdio_needs_locking;
+
+void
+_IO_enable_locks (void)
+{
+  _IO_ITER i;
+
+  if (stdio_needs_locking)
+    return;
+  stdio_needs_locking = 1;
+  for (i = _IO_iter_begin (); i != _IO_iter_end (); i = _IO_iter_next (i))
+    _IO_iter_file (i)->_flags2 |= _IO_FLAGS2_NEED_LOCK;
+}
+libc_hidden_def (_IO_enable_locks)
+
 void
 _IO_old_init (_IO_FILE *fp, int flags)
 {
   fp->_flags = _IO_MAGIC|flags;
   fp->_flags2 = 0;
+  if (stdio_needs_locking)
+    fp->_flags2 |= _IO_FLAGS2_NEED_LOCK;
   fp->_IO_buf_base = NULL;
   fp->_IO_buf_end = NULL;
   fp->_IO_read_base = NULL;
diff --git a/libio/getc.c b/libio/getc.c
index b58fd62308..fd66ef93cf 100644
--- a/libio/getc.c
+++ b/libio/getc.c
@@ -34,6 +34,8 @@ _IO_getc (FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_getc_unlocked (fp);
   _IO_acquire_lock (fp);
   result = _IO_getc_unlocked (fp);
   _IO_release_lock (fp);
diff --git a/libio/getchar.c b/libio/getchar.c
index 5b41595d17..d79932114e 100644
--- a/libio/getchar.c
+++ b/libio/getchar.c
@@ -33,6 +33,8 @@ int
 getchar (void)
 {
   int result;
+  if (!_IO_need_lock (_IO_stdin))
+    return _IO_getc_unlocked (_IO_stdin);
   _IO_acquire_lock (_IO_stdin);
   result = _IO_getc_unlocked (_IO_stdin);
   _IO_release_lock (_IO_stdin);
diff --git a/libio/iofopncook.c b/libio/iofopncook.c
index a08dfdaa42..982f464a68 100644
--- a/libio/iofopncook.c
+++ b/libio/iofopncook.c
@@ -172,6 +172,8 @@ _IO_cookie_init (struct _IO_cookie_file *cfile, int read_write,
   _IO_mask_flags (&cfile->__fp.file, read_write,
 		  _IO_NO_READS+_IO_NO_WRITES+_IO_IS_APPENDING);
 
+  cfile->__fp.file._flags2 |= _IO_FLAGS2_NEED_LOCK;
+
   /* We use a negative number different from -1 for _fileno to mark that
      this special stream is not associated with a real file, but still has
      to be treated as such.  */
diff --git a/libio/ioungetc.c b/libio/ioungetc.c
index 951064fa12..917cad8abb 100644
--- a/libio/ioungetc.c
+++ b/libio/ioungetc.c
@@ -33,6 +33,8 @@ _IO_ungetc (int c, _IO_FILE *fp)
   CHECK_FILE (fp, EOF);
   if (c == EOF)
     return EOF;
+  if (!_IO_need_lock (fp))
+    return _IO_sputbackc (fp, (unsigned char) c);
   _IO_acquire_lock (fp);
   result = _IO_sputbackc (fp, (unsigned char) c);
   _IO_release_lock (fp);
diff --git a/libio/libio.h b/libio/libio.h
index 518ffd8e44..14bcb92332 100644
--- a/libio/libio.h
+++ b/libio/libio.h
@@ -119,6 +119,7 @@
 # define _IO_FLAGS2_SCANF_STD 16
 # define _IO_FLAGS2_NOCLOSE 32
 # define _IO_FLAGS2_CLOEXEC 64
+# define _IO_FLAGS2_NEED_LOCK 128
 #endif
 
 /* These are "formatting flags" matching the iostream fmtflags enum values. */
@@ -451,6 +452,9 @@ extern int _IO_ftrylockfile (_IO_FILE *) __THROW;
 #define _IO_cleanup_region_end(_Doit) /**/
 #endif
 
+#define _IO_need_lock(_fp) \
+  (((_fp)->_flags2 & _IO_FLAGS2_NEED_LOCK) != 0)
+
 extern int _IO_vfscanf (_IO_FILE * __restrict, const char * __restrict,
 			_IO_va_list, int *__restrict);
 extern int _IO_vfprintf (_IO_FILE *__restrict, const char *__restrict,
diff --git a/libio/libioP.h b/libio/libioP.h
index eb93418c8d..163dfb1386 100644
--- a/libio/libioP.h
+++ b/libio/libioP.h
@@ -444,6 +444,8 @@ extern void _IO_list_unlock (void) __THROW;
 libc_hidden_proto (_IO_list_unlock)
 extern void _IO_list_resetlock (void) __THROW;
 libc_hidden_proto (_IO_list_resetlock)
+extern void _IO_enable_locks (void) __THROW;
+libc_hidden_proto (_IO_enable_locks)
 
 /* Default jumptable functions. */
 
@@ -750,7 +752,7 @@ extern _IO_off64_t _IO_seekpos_unlocked (_IO_FILE *, _IO_off64_t, int)
 
 #if _G_HAVE_MMAP
 
-# ifdef _LIBC
+# if IS_IN (libc)
 /* When using this code in the GNU libc we must not pollute the name space.  */
 #  define mmap __mmap
 #  define munmap __munmap
diff --git a/libio/putc.c b/libio/putc.c
index b591c5538b..6e1fdeef3a 100644
--- a/libio/putc.c
+++ b/libio/putc.c
@@ -25,6 +25,8 @@ _IO_putc (int c, _IO_FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_putc_unlocked (c, fp);
   _IO_acquire_lock (fp);
   result = _IO_putc_unlocked (c, fp);
   _IO_release_lock (fp);
diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
index c7d1b8f413..0b3fa942f2 100644
--- a/nptl/pthread_create.c
+++ b/nptl/pthread_create.c
@@ -32,6 +32,7 @@
 #include <exit-thread.h>
 #include <default-sched.h>
 #include <futex-internal.h>
+#include "libioP.h"
 
 #include <shlib-compat.h>
 
@@ -756,6 +757,9 @@ __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
         collect_default_sched (pd);
     }
 
+  if (__glibc_unlikely (__nptl_nthreads == 1))
+    _IO_enable_locks ();
+
   /* Pass the descriptor to the caller.  */
   *newthread = (pthread_t) pd;
 
diff --git a/sysdeps/pthread/flockfile.c b/sysdeps/pthread/flockfile.c
index 7fe8e99161..a8e6c28ed9 100644
--- a/sysdeps/pthread/flockfile.c
+++ b/sysdeps/pthread/flockfile.c
@@ -25,6 +25,7 @@
 void
 __flockfile (FILE *stream)
 {
+  stream->_flags2 |= _IO_FLAGS2_NEED_LOCK;
   _IO_lock_lock (*stream->_lock);
 }
 strong_alias (__flockfile, _IO_flockfile)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-22  9:50   ` Szabolcs Nagy
@ 2017-06-29 11:41     ` Siddhesh Poyarekar
  2017-06-29 12:01       ` Siddhesh Poyarekar
  2017-06-29 13:29       ` Szabolcs Nagy
  0 siblings, 2 replies; 31+ messages in thread
From: Siddhesh Poyarekar @ 2017-06-29 11:41 UTC (permalink / raw)
  To: Szabolcs Nagy, Joseph Myers; +Cc: nd, GNU C Library, triegel

On Thursday 22 June 2017 03:20 PM, Szabolcs Nagy wrote:
> On 21/06/17 17:52, Joseph Myers wrote:
>> On Wed, 21 Jun 2017, Szabolcs Nagy wrote:
>>
>>> created.  A new libc abi symbol, _IO_enable_locks, does this.
>>> (The abilist files will have to be updated in a separate patch).
>>
>> I don't see any declarations or uses of this symbol in public headers, so 
>> why does it need to be at a public symbol version instead of 
>> GLIBC_PRIVATE?
>>
> 
> PATCH v3: use GLIBC_PRIVATE for _IO_enable_locks.
> (should i use __libc_* naming convention for private symbols?)
> 
> Locking overhead can be significant in some stdio operations
> that are common in single threaded applications.
> 
> I'd like to address it in this release if possible as it
> causes performance problems to aarch64 users.  I prefer this
> high-level approach to be reviewed, if that does not work then
> the aarch64 specific low-level approach will be taken.
> 
> This patch adds the _IO_FLAGS2_NEED_LOCK flag to indicate if
> an _IO_FILE object needs to be locked and some of the stdio
> functions just jump to their _unlocked variant when not.  The
> flag is set on all _IO_FILE objects when the first thread is
> created.  A new GLIBC_PRIVATE libc symbol, _IO_enable_locks,
> was added to do this from libpthread.
>
> The optimization can be applied to more stdio functions,
> currently it is only applied to single flag check or single
> non-wide-char standard operations.  The flag should probably
> be never set for files with _IO_USER_LOCK, but that's just a
> further optimization, not a correctness requirement.
> 
> The optimization is valid in a single thread because stdio
> operations are non-as-safe (so lock state is not observable
> from a signal handler) and stdio locks are recursive (so lock
> state is not observable via deadlock).  The optimization is not
> valid if a thread may be created while an stdio lock is taken
> and thus it should be disabled if any user code may run during
> an stdio operation (interposed malloc, printf hooks, etc).
> This makes the optimization more complicated for some stdio
> operations (e.g. printf), but those are bigger and thus less
> important to optimize so this patch does not try to do that.
> 
> 2017-06-22  Szabolcs Nagy  <szabolcs.nagy@arm.com>
> 
> 	* libio/libio.h (_IO_FLAGS2_NEED_LOCK, _IO_need_lock): Define.
> 	* libio/libioP.h (_IO_enable_locks): Declare.
> 	* libio/Versions (_IO_enable_locks): New symbol.
> 	* libio/genops.c (_IO_enable_locks): Define.
> 	(_IO_old_init): Initialize flags2.
> 	* libio/feof.c.c (_IO_feof): Avoid locking when not needed.
> 	* libio/ferror.c (_IO_ferror): Likewise.
> 	* libio/fputc.c (fputc): Likewise.
> 	* libio/putc.c (_IO_putc): Likewise.
> 	* libio/getc.c (_IO_getc): Likewise.
> 	* libio/getchar.c (getchar): Likewise.
> 	* libio/ioungetc.c (_IO_ungetc): Likewise.
> 	* nptl/pthread_create.c (__pthread_create_2_1): Enable stdio locks.
> 	* libio/iofopncook.c (_IO_fopencookie): Enable locking for the file.
> 	* sysdeps/pthread/flockfile.c (__flockfile): Likewise.
> 
> 
> stdio.diff
> 
> 
> diff --git a/libio/Versions b/libio/Versions
> index 2a1d6e6c85..6c1624e8cd 100644
> --- a/libio/Versions
> +++ b/libio/Versions
> @@ -155,5 +155,6 @@ libc {
>    GLIBC_PRIVATE {
>      # Used by NPTL and librt
>      __libc_fatal;
> +    _IO_enable_locks;
>    }
>  }
> diff --git a/libio/feof.c b/libio/feof.c
> index 9712a81e78..8890a5f51f 100644
> --- a/libio/feof.c
> +++ b/libio/feof.c
> @@ -32,6 +32,8 @@ _IO_feof (_IO_FILE *fp)
>  {
>    int result;
>    CHECK_FILE (fp, EOF);
> +  if (!_IO_need_lock (fp))
> +    return _IO_feof_unlocked (fp);
>    _IO_flockfile (fp);
>    result = _IO_feof_unlocked (fp);
>    _IO_funlockfile (fp);
> diff --git a/libio/ferror.c b/libio/ferror.c
> index 01e3bd8e2b..d10fcd9fff 100644
> --- a/libio/ferror.c
> +++ b/libio/ferror.c
> @@ -32,6 +32,8 @@ _IO_ferror (_IO_FILE *fp)
>  {
>    int result;
>    CHECK_FILE (fp, EOF);
> +  if (!_IO_need_lock (fp))
> +    return _IO_ferror_unlocked (fp);
>    _IO_flockfile (fp);
>    result = _IO_ferror_unlocked (fp);

The patch looks OK except for the duplication (and a missing comment
below), which looks a bit clumsy.  How about something like this instead:

  bool need_lock = _IO_need_lock (fp);

  if (need_lock)
    _IO_flockfile (fp);
  result = _IO_ferror_unlocked (fp);
  if (need_lock)
    _IO_funlockfile (fp);

  return result;

You could probably make some kind of a macro out of this, I haven't
looked that hard.

>    _IO_funlockfile (fp);
> diff --git a/libio/fputc.c b/libio/fputc.c
> index a7cd682fe2..b72305c06f 100644
> --- a/libio/fputc.c
> +++ b/libio/fputc.c
> @@ -32,6 +32,8 @@ fputc (int c, _IO_FILE *fp)
>  {
>    int result;
>    CHECK_FILE (fp, EOF);
> +  if (!_IO_need_lock (fp))
> +    return _IO_putc_unlocked (c, fp);
>    _IO_acquire_lock (fp);
>    result = _IO_putc_unlocked (c, fp);
>    _IO_release_lock (fp);
> diff --git a/libio/genops.c b/libio/genops.c
> index a466cfa337..132f1f1a7d 100644
> --- a/libio/genops.c
> +++ b/libio/genops.c
> @@ -570,11 +570,28 @@ _IO_init (_IO_FILE *fp, int flags)
>    _IO_init_internal (fp, flags);
>  }
>  

Add a fat comment here describing in detail what you're doing here and why.

> +static int stdio_needs_locking;
> +
> +void
> +_IO_enable_locks (void)
> +{
> +  _IO_ITER i;
> +
> +  if (stdio_needs_locking)
> +    return;
> +  stdio_needs_locking = 1;
> +  for (i = _IO_iter_begin (); i != _IO_iter_end (); i = _IO_iter_next (i))
> +    _IO_iter_file (i)->_flags2 |= _IO_FLAGS2_NEED_LOCK;
> +}
> +libc_hidden_def (_IO_enable_locks)
> +
>  void
>  _IO_old_init (_IO_FILE *fp, int flags)
>  {
>    fp->_flags = _IO_MAGIC|flags;
>    fp->_flags2 = 0;
> +  if (stdio_needs_locking)
> +    fp->_flags2 |= _IO_FLAGS2_NEED_LOCK;
>    fp->_IO_buf_base = NULL;
>    fp->_IO_buf_end = NULL;
>    fp->_IO_read_base = NULL;
> diff --git a/libio/getc.c b/libio/getc.c
> index b58fd62308..fd66ef93cf 100644
> --- a/libio/getc.c
> +++ b/libio/getc.c
> @@ -34,6 +34,8 @@ _IO_getc (FILE *fp)
>  {
>    int result;
>    CHECK_FILE (fp, EOF);
> +  if (!_IO_need_lock (fp))
> +    return _IO_getc_unlocked (fp);
>    _IO_acquire_lock (fp);
>    result = _IO_getc_unlocked (fp);
>    _IO_release_lock (fp);
> diff --git a/libio/getchar.c b/libio/getchar.c
> index 5b41595d17..d79932114e 100644
> --- a/libio/getchar.c
> +++ b/libio/getchar.c
> @@ -33,6 +33,8 @@ int
>  getchar (void)
>  {
>    int result;
> +  if (!_IO_need_lock (_IO_stdin))
> +    return _IO_getc_unlocked (_IO_stdin);
>    _IO_acquire_lock (_IO_stdin);
>    result = _IO_getc_unlocked (_IO_stdin);
>    _IO_release_lock (_IO_stdin);
> diff --git a/libio/iofopncook.c b/libio/iofopncook.c
> index a08dfdaa42..982f464a68 100644
> --- a/libio/iofopncook.c
> +++ b/libio/iofopncook.c
> @@ -172,6 +172,8 @@ _IO_cookie_init (struct _IO_cookie_file *cfile, int read_write,
>    _IO_mask_flags (&cfile->__fp.file, read_write,
>  		  _IO_NO_READS+_IO_NO_WRITES+_IO_IS_APPENDING);
>  
> +  cfile->__fp.file._flags2 |= _IO_FLAGS2_NEED_LOCK;
> +
>    /* We use a negative number different from -1 for _fileno to mark that
>       this special stream is not associated with a real file, but still has
>       to be treated as such.  */
> diff --git a/libio/ioungetc.c b/libio/ioungetc.c
> index 951064fa12..917cad8abb 100644
> --- a/libio/ioungetc.c
> +++ b/libio/ioungetc.c
> @@ -33,6 +33,8 @@ _IO_ungetc (int c, _IO_FILE *fp)
>    CHECK_FILE (fp, EOF);
>    if (c == EOF)
>      return EOF;
> +  if (!_IO_need_lock (fp))
> +    return _IO_sputbackc (fp, (unsigned char) c);
>    _IO_acquire_lock (fp);
>    result = _IO_sputbackc (fp, (unsigned char) c);
>    _IO_release_lock (fp);
> diff --git a/libio/libio.h b/libio/libio.h
> index 518ffd8e44..14bcb92332 100644
> --- a/libio/libio.h
> +++ b/libio/libio.h
> @@ -119,6 +119,7 @@
>  # define _IO_FLAGS2_SCANF_STD 16
>  # define _IO_FLAGS2_NOCLOSE 32
>  # define _IO_FLAGS2_CLOEXEC 64
> +# define _IO_FLAGS2_NEED_LOCK 128
>  #endif
>  
>  /* These are "formatting flags" matching the iostream fmtflags enum values. */
> @@ -451,6 +452,9 @@ extern int _IO_ftrylockfile (_IO_FILE *) __THROW;
>  #define _IO_cleanup_region_end(_Doit) /**/
>  #endif
>  
> +#define _IO_need_lock(_fp) \
> +  (((_fp)->_flags2 & _IO_FLAGS2_NEED_LOCK) != 0)
> +
>  extern int _IO_vfscanf (_IO_FILE * __restrict, const char * __restrict,
>  			_IO_va_list, int *__restrict);
>  extern int _IO_vfprintf (_IO_FILE *__restrict, const char *__restrict,
> diff --git a/libio/libioP.h b/libio/libioP.h
> index eb93418c8d..163dfb1386 100644
> --- a/libio/libioP.h
> +++ b/libio/libioP.h
> @@ -444,6 +444,8 @@ extern void _IO_list_unlock (void) __THROW;
>  libc_hidden_proto (_IO_list_unlock)
>  extern void _IO_list_resetlock (void) __THROW;
>  libc_hidden_proto (_IO_list_resetlock)
> +extern void _IO_enable_locks (void) __THROW;
> +libc_hidden_proto (_IO_enable_locks)
>  
>  /* Default jumptable functions. */
>  
> @@ -750,7 +752,7 @@ extern _IO_off64_t _IO_seekpos_unlocked (_IO_FILE *, _IO_off64_t, int)
>  
>  #if _G_HAVE_MMAP
>  
> -# ifdef _LIBC
> +# if IS_IN (libc)
>  /* When using this code in the GNU libc we must not pollute the name space.  */
>  #  define mmap __mmap
>  #  define munmap __munmap
> diff --git a/libio/putc.c b/libio/putc.c
> index b591c5538b..6e1fdeef3a 100644
> --- a/libio/putc.c
> +++ b/libio/putc.c
> @@ -25,6 +25,8 @@ _IO_putc (int c, _IO_FILE *fp)
>  {
>    int result;
>    CHECK_FILE (fp, EOF);
> +  if (!_IO_need_lock (fp))
> +    return _IO_putc_unlocked (c, fp);
>    _IO_acquire_lock (fp);
>    result = _IO_putc_unlocked (c, fp);
>    _IO_release_lock (fp);
> diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
> index c7d1b8f413..0b3fa942f2 100644
> --- a/nptl/pthread_create.c
> +++ b/nptl/pthread_create.c
> @@ -32,6 +32,7 @@
>  #include <exit-thread.h>
>  #include <default-sched.h>
>  #include <futex-internal.h>
> +#include "libioP.h"
>  
>  #include <shlib-compat.h>
>  
> @@ -756,6 +757,9 @@ __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
>          collect_default_sched (pd);
>      }
>  
> +  if (__glibc_unlikely (__nptl_nthreads == 1))
> +    _IO_enable_locks ();
> +
>    /* Pass the descriptor to the caller.  */
>    *newthread = (pthread_t) pd;
>  
> diff --git a/sysdeps/pthread/flockfile.c b/sysdeps/pthread/flockfile.c
> index 7fe8e99161..a8e6c28ed9 100644
> --- a/sysdeps/pthread/flockfile.c
> +++ b/sysdeps/pthread/flockfile.c
> @@ -25,6 +25,7 @@
>  void
>  __flockfile (FILE *stream)
>  {
> +  stream->_flags2 |= _IO_FLAGS2_NEED_LOCK;
>    _IO_lock_lock (*stream->_lock);
>  }
>  strong_alias (__flockfile, _IO_flockfile)
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-29 11:41     ` Siddhesh Poyarekar
@ 2017-06-29 12:01       ` Siddhesh Poyarekar
  2017-06-29 12:12         ` Carlos O'Donell
  2017-06-29 13:29       ` Szabolcs Nagy
  1 sibling, 1 reply; 31+ messages in thread
From: Siddhesh Poyarekar @ 2017-06-29 12:01 UTC (permalink / raw)
  To: Szabolcs Nagy, Joseph Myers; +Cc: nd, GNU C Library, triegel

On Thursday 29 June 2017 05:11 PM, Siddhesh Poyarekar wrote:
> The patch looks OK except for the duplication (and a missing comment
> below), which looks a bit clumsy.  How about something like this instead:
> 
>   bool need_lock = _IO_need_lock (fp);
> 
>   if (need_lock)
>     _IO_flockfile (fp);
>   result = _IO_ferror_unlocked (fp);
>   if (need_lock)
>     _IO_funlockfile (fp);
> 
>   return result;
> 
> You could probably make some kind of a macro out of this, I haven't
> looked that hard.

I forgot that Torvald had commented (off-list, the thread broke somehow)
that it would be important to try and measure how much worse this makes
the multi-threaded case worse.

Siddhesh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-29 12:01       ` Siddhesh Poyarekar
@ 2017-06-29 12:12         ` Carlos O'Donell
  2017-06-29 13:44           ` Szabolcs Nagy
  2017-06-30 12:16           ` Szabolcs Nagy
  0 siblings, 2 replies; 31+ messages in thread
From: Carlos O'Donell @ 2017-06-29 12:12 UTC (permalink / raw)
  To: Siddhesh Poyarekar, Szabolcs Nagy, Joseph Myers
  Cc: nd, GNU C Library, triegel

On 06/29/2017 08:01 AM, Siddhesh Poyarekar wrote:
> On Thursday 29 June 2017 05:11 PM, Siddhesh Poyarekar wrote:
>> The patch looks OK except for the duplication (and a missing comment
>> below), which looks a bit clumsy.  How about something like this instead:
>>
>>   bool need_lock = _IO_need_lock (fp);
>>
>>   if (need_lock)
>>     _IO_flockfile (fp);
>>   result = _IO_ferror_unlocked (fp);
>>   if (need_lock)
>>     _IO_funlockfile (fp);
>>
>>   return result;
>>
>> You could probably make some kind of a macro out of this, I haven't
>> looked that hard.
> 
> I forgot that Torvald had commented (off-list, the thread broke somehow)
> that it would be important to try and measure how much worse this makes
> the multi-threaded case worse.

+1

If we are going to optimize the single threaded case we need to know what
impact this has on the multi-threaded case.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-29 11:41     ` Siddhesh Poyarekar
  2017-06-29 12:01       ` Siddhesh Poyarekar
@ 2017-06-29 13:29       ` Szabolcs Nagy
  2017-06-29 14:06         ` Siddhesh Poyarekar
  1 sibling, 1 reply; 31+ messages in thread
From: Szabolcs Nagy @ 2017-06-29 13:29 UTC (permalink / raw)
  To: Siddhesh Poyarekar, Joseph Myers; +Cc: nd, GNU C Library, triegel

On 29/06/17 12:41, Siddhesh Poyarekar wrote:
>> @@ -32,6 +32,8 @@ _IO_ferror (_IO_FILE *fp)
>>  {
>>    int result;
>>    CHECK_FILE (fp, EOF);
>> +  if (!_IO_need_lock (fp))
>> +    return _IO_ferror_unlocked (fp);
>>    _IO_flockfile (fp);
>>    result = _IO_ferror_unlocked (fp);
> 
> The patch looks OK except for the duplication (and a missing comment
> below), which looks a bit clumsy.  How about something like this instead:
> 
>   bool need_lock = _IO_need_lock (fp);
> 
>   if (need_lock)
>     _IO_flockfile (fp);
>   result = _IO_ferror_unlocked (fp);
>   if (need_lock)
>     _IO_funlockfile (fp);
> 
>   return result;
> 
> You could probably make some kind of a macro out of this, I haven't
> looked that hard.
> 

this does not do what we want:
single check + tailcall the unlocked version.

with your code gcc is more likely to do two
checks and register spills and an additional
return.

while my code does not force gcc to do the
right thing at least it shows the intent.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-29 12:12         ` Carlos O'Donell
@ 2017-06-29 13:44           ` Szabolcs Nagy
  2017-06-30 12:16           ` Szabolcs Nagy
  1 sibling, 0 replies; 31+ messages in thread
From: Szabolcs Nagy @ 2017-06-29 13:44 UTC (permalink / raw)
  To: Carlos O'Donell, Siddhesh Poyarekar, Joseph Myers
  Cc: nd, GNU C Library, triegel

On 29/06/17 13:12, Carlos O'Donell wrote:
> On 06/29/2017 08:01 AM, Siddhesh Poyarekar wrote:
>> On Thursday 29 June 2017 05:11 PM, Siddhesh Poyarekar wrote:
>>> The patch looks OK except for the duplication (and a missing comment
>>> below), which looks a bit clumsy.  How about something like this instead:
>>>
>>>   bool need_lock = _IO_need_lock (fp);
>>>
>>>   if (need_lock)
>>>     _IO_flockfile (fp);
>>>   result = _IO_ferror_unlocked (fp);
>>>   if (need_lock)
>>>     _IO_funlockfile (fp);
>>>
>>>   return result;
>>>
>>> You could probably make some kind of a macro out of this, I haven't
>>> looked that hard.
>>
>> I forgot that Torvald had commented (off-list, the thread broke somehow)
>> that it would be important to try and measure how much worse this makes
>> the multi-threaded case worse.
> 
> +1
> 
> If we are going to optimize the single threaded case we need to know what
> impact this has on the multi-threaded case.
> 

note that this impacts multi-threaded case less
than the lowlevellock approach that is currently
implemented: that adds two checks, my code does
one, that loads __libc_multiple_threads twice,
mine checks a flag in fp, which is in a cache line
that is most likely already accessed by the rest
of the io code.

i cannot produce numbers immediately as the last
time i measured this, adding a dummy thread via an
ldpreloaded lib had more effect on the timing
of the same binary than adding a branch in the
stdio code (i'm not sure why the additional thread
affects timing so much with the current code, it
might be a cpu issue, e.g. cache aliasing caused
by slightly different layout of the loaded libs,
but it also shows that the effect of the patch
is small)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-29 13:29       ` Szabolcs Nagy
@ 2017-06-29 14:06         ` Siddhesh Poyarekar
  2017-06-30 12:38           ` Szabolcs Nagy
  0 siblings, 1 reply; 31+ messages in thread
From: Siddhesh Poyarekar @ 2017-06-29 14:06 UTC (permalink / raw)
  To: Szabolcs Nagy, Joseph Myers; +Cc: nd, GNU C Library, triegel

On Thursday 29 June 2017 06:59 PM, Szabolcs Nagy wrote:
> this does not do what we want:
> single check + tailcall the unlocked version.
> 
> with your code gcc is more likely to do two
> checks and register spills and an additional
> return.
> 
> while my code does not force gcc to do the
> right thing at least it shows the intent.

Can you verify this both ways?  If they don't show any noticeable
difference then I prefer the way I suggested, but that is really just
for aesthetics.  Maybe also provide a __glibc_unlikely hint to aid the
compiler.

Siddhesh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-29 12:12         ` Carlos O'Donell
  2017-06-29 13:44           ` Szabolcs Nagy
@ 2017-06-30 12:16           ` Szabolcs Nagy
  2017-06-30 13:16             ` Carlos O'Donell
  2017-06-30 15:18             ` [PATCH v2] " Torvald Riegel
  1 sibling, 2 replies; 31+ messages in thread
From: Szabolcs Nagy @ 2017-06-30 12:16 UTC (permalink / raw)
  To: Carlos O'Donell, Siddhesh Poyarekar, Joseph Myers
  Cc: nd, GNU C Library, triegel

On 29/06/17 13:12, Carlos O'Donell wrote:
> On 06/29/2017 08:01 AM, Siddhesh Poyarekar wrote:
>> On Thursday 29 June 2017 05:11 PM, Siddhesh Poyarekar wrote:
>>> The patch looks OK except for the duplication (and a missing comment
>>> below), which looks a bit clumsy.  How about something like this instead:
>>>
>>>   bool need_lock = _IO_need_lock (fp);
>>>
>>>   if (need_lock)
>>>     _IO_flockfile (fp);
>>>   result = _IO_ferror_unlocked (fp);
>>>   if (need_lock)
>>>     _IO_funlockfile (fp);
>>>
>>>   return result;
>>>
>>> You could probably make some kind of a macro out of this, I haven't
>>> looked that hard.
>>
>> I forgot that Torvald had commented (off-list, the thread broke somehow)
>> that it would be important to try and measure how much worse this makes
>> the multi-threaded case worse.
> 
> +1
> 
> If we are going to optimize the single threaded case we need to know what
> impact this has on the multi-threaded case.
> 

$orig == current
$stdio == my patch
$stdio_mt == my patch but 'needs lock' flag is set so multithread path is taken

on two particular aarch64 cpus with a particular loop count:

cpu1
time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar
8.08user 0.04system 0:08.12elapsed 100%CPU (0avgtext+0avgdata 1472maxresident)k
0inputs+0outputs (0major+40minor)pagefaults 0swaps
time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar
1.07user 0.04system 0:01.11elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k
0inputs+0outputs (0major+40minor)pagefaults 0swaps
time $stdio_mt/lib64/ld-2.25.90.so --library-path $stdio_mt/lib64 ./getchar
7.87user 0.00system 0:07.88elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k
0inputs+0outputs (0major+40minor)pagefaults 0swaps

cpu2
time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar
8.11user 0.04system 0:08.16elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k
0inputs+0outputs (0major+40minor)pagefaults 0swaps
time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar
2.29user 0.06system 0:02.35elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k
0inputs+0outputs (0major+40minor)pagefaults 0swaps
time $stdio_mt/lib64/ld-2.25.90.so --library-path $stdio_mt/lib64 ./getchar
8.12user 0.03system 0:08.16elapsed 99%CPU (0avgtext+0avgdata 1472maxresident)k
0inputs+0outputs (0major+40minor)pagefaults 0swaps

on a particular x86_64 cpu with particular loop count:

time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar
5.89user 0.07system 0:05.98elapsed 99%CPU (0avgtext+0avgdata 2000maxresident)k
0inputs+0outputs (0major+153minor)pagefaults 0swaps
time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar
2.66user 0.06system 0:02.73elapsed 99%CPU (0avgtext+0avgdata 2032maxresident)k
0inputs+0outputs (0major+155minor)pagefaults 0swaps
time $stdio_mt/lib64/ld-2.25.90.so --library-path $stdio_mt/lib64 ./getchar
6.76user 0.08system 0:06.87elapsed 99%CPU (0avgtext+0avgdata 2032maxresident)k
0inputs+0outputs (0major+155minor)pagefaults 0swaps

in summary: on aarch64 i see no regression (in some case stdio_mt
become faster, can happen since the layout of the code changed)
on this particular x86 cpu stdio_mt has a close to 15% regression.

i don't believe the big regression on x86 is valid, it could
be that the benchmark just got past some cpu internal limit
or the code got aligned differently, in fact if i static link
the exact same code then on the same cpu i get

time ./getchar_static-orig
6.60user 0.05system 0:06.66elapsed 99%CPU (0avgtext+0avgdata 912maxresident)k
0inputs+0outputs (0major+81minor)pagefaults 0swaps
time ./getchar_static-stdio
2.24user 0.08system 0:02.33elapsed 99%CPU (0avgtext+0avgdata 896maxresident)k
0inputs+0outputs (0major+81minor)pagefaults 0swaps
time ./getchar_static-stdio_mt
6.50user 0.06system 0:06.57elapsed 99%CPU (0avgtext+0avgdata 896maxresident)k
0inputs+0outputs (0major+81minor)pagefaults 0swaps

i.e. now it is faster from the branch! (both measurements
are repeatable)

i didn't dig into the root cause of the regression (or
why is static linking slower?), i would not be too
worried about it since the common case for hot stdio
loops is in single thread processes where even on x86
the patch gives >2x speedup.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-29 14:06         ` Siddhesh Poyarekar
@ 2017-06-30 12:38           ` Szabolcs Nagy
  2017-06-30 15:17             ` Siddhesh Poyarekar
  0 siblings, 1 reply; 31+ messages in thread
From: Szabolcs Nagy @ 2017-06-30 12:38 UTC (permalink / raw)
  To: Siddhesh Poyarekar, Joseph Myers; +Cc: nd, GNU C Library, triegel

On 29/06/17 15:06, Siddhesh Poyarekar wrote:
> On Thursday 29 June 2017 06:59 PM, Szabolcs Nagy wrote:
>> this does not do what we want:
>> single check + tailcall the unlocked version.
>>
>> with your code gcc is more likely to do two
>> checks and register spills and an additional
>> return.
>>
>> while my code does not force gcc to do the
>> right thing at least it shows the intent.
> 
> Can you verify this both ways?  If they don't show any noticeable
> difference then I prefer the way I suggested, but that is really just
> for aesthetics.  Maybe also provide a __glibc_unlikely hint to aid the
> compiler.

your code layout does not work for most
io apis as they use _IO_acquire_lock/unlock
which create a new scope for cancel cleanup
handling.

it only works with non-cancellation point
apis (flag checks: feof, ferror) that just
flockfile/funlockfile.

i can change those two functions where it
works but i dont think it will be any more
maintainable.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-30 12:16           ` Szabolcs Nagy
@ 2017-06-30 13:16             ` Carlos O'Donell
  2017-06-30 13:22               ` Adhemerval Zanella
  2017-06-30 14:39               ` Szabolcs Nagy
  2017-06-30 15:18             ` [PATCH v2] " Torvald Riegel
  1 sibling, 2 replies; 31+ messages in thread
From: Carlos O'Donell @ 2017-06-30 13:16 UTC (permalink / raw)
  To: Szabolcs Nagy, Siddhesh Poyarekar, Joseph Myers
  Cc: nd, GNU C Library, triegel

On 06/30/2017 08:15 AM, Szabolcs Nagy wrote:
> i didn't dig into the root cause of the regression (or
> why is static linking slower?), i would not be too
> worried about it since the common case for hot stdio
> loops is in single thread processes where even on x86
> the patch gives >2x speedup.

Regardless of the cause, the 15% regression on x86 MT performance
is serious, and I see no reason to push this into glibc 2.26. 
We can add it any time in 2.27, or the distros can pick it up with
a backport.

I would like to see a better characterization of the regression before
accepting this patch.

While I agree that common case for hot stdio loops is non-MT, there
are still MT cases, and 15% is a large double-digit loss.

Have you looked at the assembly differences? What is the compiler
doing differently?

When our a user asks "Why is my MT stdio 15% slower?" We owe them an
answer that is clear and concise.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-30 13:16             ` Carlos O'Donell
@ 2017-06-30 13:22               ` Adhemerval Zanella
  2017-06-30 14:39               ` Szabolcs Nagy
  1 sibling, 0 replies; 31+ messages in thread
From: Adhemerval Zanella @ 2017-06-30 13:22 UTC (permalink / raw)
  To: libc-alpha



On 30/06/2017 10:16, Carlos O'Donell wrote:
> On 06/30/2017 08:15 AM, Szabolcs Nagy wrote:
>> i didn't dig into the root cause of the regression (or
>> why is static linking slower?), i would not be too
>> worried about it since the common case for hot stdio
>> loops is in single thread processes where even on x86
>> the patch gives >2x speedup.
> 
> Regardless of the cause, the 15% regression on x86 MT performance
> is serious, and I see no reason to push this into glibc 2.26. 
> We can add it any time in 2.27, or the distros can pick it up with
> a backport.
> 
> I would like to see a better characterization of the regression before
> accepting this patch.
> 
> While I agree that common case for hot stdio loops is non-MT, there
> are still MT cases, and 15% is a large double-digit loss.
> 
> Have you looked at the assembly differences? What is the compiler
> doing differently?
> 
> When our a user asks "Why is my MT stdio 15% slower?" We owe them an
> answer that is clear and concise.
> 

Just a wild guess, but might some overhead due on x86_64 the internal
locks will now unnecessary check for single-thread case (the 
__lll_lock_asm_start [1] snippet). Did you try by using the simple
variant, without the check for __libc_multiple_threads?

[1] ./sysdeps/unix/sysv/linux/x86_64/lowlevellock.h

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-30 13:16             ` Carlos O'Donell
  2017-06-30 13:22               ` Adhemerval Zanella
@ 2017-06-30 14:39               ` Szabolcs Nagy
  2017-06-30 16:07                 ` Carlos O'Donell
  1 sibling, 1 reply; 31+ messages in thread
From: Szabolcs Nagy @ 2017-06-30 14:39 UTC (permalink / raw)
  To: Carlos O'Donell, Siddhesh Poyarekar, Joseph Myers
  Cc: nd, GNU C Library, triegel

On 30/06/17 14:16, Carlos O'Donell wrote:
> On 06/30/2017 08:15 AM, Szabolcs Nagy wrote:
>> i didn't dig into the root cause of the regression (or
>> why is static linking slower?), i would not be too
>> worried about it since the common case for hot stdio
>> loops is in single thread processes where even on x86
>> the patch gives >2x speedup.
> 
> Regardless of the cause, the 15% regression on x86 MT performance
> is serious, and I see no reason to push this into glibc 2.26. 
> We can add it any time in 2.27, or the distros can pick it up with
> a backport.
> 
> I would like to see a better characterization of the regression before
> accepting this patch.
> 
> While I agree that common case for hot stdio loops is non-MT, there
> are still MT cases, and 15% is a large double-digit loss.
> 
> Have you looked at the assembly differences? What is the compiler
> doing differently?
> 
> When our a user asks "Why is my MT stdio 15% slower?" We owe them an
> answer that is clear and concise.
> 

sorry the x86 measurement was bogus because only
the high level code thought it's multithreaded, the
lowlevellock code thought it's single threaded so
there were no atomic ops executed in the stdio_mt case

with atomics the orig performance is significantly
slower so the regression relative to that is small in %.

if i create a dummy thread (to measure true mt
behaviour, same loop count):

time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar_mt
20.31user 0.11system 0:20.47elapsed 99%CPU (0avgtext+0avgdata 2416maxresident)k
0inputs+0outputs (0major+180minor)pagefaults 0swaps
time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar_mt
20.72user 0.03system 0:20.79elapsed 99%CPU (0avgtext+0avgdata 2400maxresident)k
0inputs+0outputs (0major+179minor)pagefaults 0swaps

the relative diff is 2% now, but notice that the
abs diff went down too (which points to uarch issue
in the previous measurement).

perf stat indicates that there are 15 vs 16 branches
in the loop (so my patch indeed adds one branch
but there are plenty branches already) the instruction
count goes from 43 to 45 per loop iteration
(flag check + branch).

in my previous measurements, how can +1 branch
decrease the performance >10% when there are
already >10 branches (and several other insns)
is something the x86 uarchitects could explain.

in summary the patch trades 2% mt performance to
2x non-mt performance on this x86 cpu.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-30 12:38           ` Szabolcs Nagy
@ 2017-06-30 15:17             ` Siddhesh Poyarekar
  0 siblings, 0 replies; 31+ messages in thread
From: Siddhesh Poyarekar @ 2017-06-30 15:17 UTC (permalink / raw)
  To: Szabolcs Nagy, Joseph Myers; +Cc: nd, GNU C Library, triegel

On Friday 30 June 2017 06:08 PM, Szabolcs Nagy wrote:
> your code layout does not work for most
> io apis as they use _IO_acquire_lock/unlock
> which create a new scope for cancel cleanup
> handling.
> 
> it only works with non-cancellation point
> apis (flag checks: feof, ferror) that just
> flockfile/funlockfile.
> 
> i can change those two functions where it
> works but i dont think it will be any more
> maintainable.

Right, that won't work.  There should be a cleaner way to do both, but I
can't think of one right now, so I'll let that go.

So the only open points right now is the comment I asked you to add
describing what/why you're doing this for stdio and the performance
question on x86.  So Carlos, etc. who are monitoring this will tell you
when they're satisfied with your updated results.

Siddhesh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-30 12:16           ` Szabolcs Nagy
  2017-06-30 13:16             ` Carlos O'Donell
@ 2017-06-30 15:18             ` Torvald Riegel
  2017-06-30 15:24               ` Siddhesh Poyarekar
                                 ` (2 more replies)
  1 sibling, 3 replies; 31+ messages in thread
From: Torvald Riegel @ 2017-06-30 15:18 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: Carlos O'Donell, Siddhesh Poyarekar, Joseph Myers, nd, GNU C Library

On Fri, 2017-06-30 at 13:15 +0100, Szabolcs Nagy wrote:
> on a particular x86_64 cpu with particular loop count:
> 
> time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar
> 5.89user 0.07system 0:05.98elapsed 99%CPU (0avgtext+0avgdata 2000maxresident)k
> 0inputs+0outputs (0major+153minor)pagefaults 0swaps
> time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar
> 2.66user 0.06system 0:02.73elapsed 99%CPU (0avgtext+0avgdata 2032maxresident)k
> 0inputs+0outputs (0major+155minor)pagefaults 0swaps

What's interesting here is that your high-level optimization is faster
than doing the single-thread check in the low-level lock (x86 has it
already in the low-level lock).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-30 15:18             ` [PATCH v2] " Torvald Riegel
@ 2017-06-30 15:24               ` Siddhesh Poyarekar
  2017-06-30 15:35                 ` Torvald Riegel
  2017-06-30 15:32               ` Ramana Radhakrishnan
  2017-06-30 15:51               ` Szabolcs Nagy
  2 siblings, 1 reply; 31+ messages in thread
From: Siddhesh Poyarekar @ 2017-06-30 15:24 UTC (permalink / raw)
  To: Torvald Riegel, Szabolcs Nagy
  Cc: Carlos O'Donell, Joseph Myers, nd, GNU C Library

On Friday 30 June 2017 08:48 PM, Torvald Riegel wrote:
> What's interesting here is that your high-level optimization is faster
> than doing the single-thread check in the low-level lock (x86 has it
> already in the low-level lock).

Cache locality might be a factor - the fp struct may already be in cache
as opposed to the TLS block from which one gets the value of
multiple_threads.

Siddhesh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-30 15:18             ` [PATCH v2] " Torvald Riegel
  2017-06-30 15:24               ` Siddhesh Poyarekar
@ 2017-06-30 15:32               ` Ramana Radhakrishnan
  2017-06-30 15:51               ` Szabolcs Nagy
  2 siblings, 0 replies; 31+ messages in thread
From: Ramana Radhakrishnan @ 2017-06-30 15:32 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Szabolcs Nagy, Carlos O'Donell, Siddhesh Poyarekar,
	Joseph Myers, nd, GNU C Library

On Fri, Jun 30, 2017 at 4:18 PM, Torvald Riegel <triegel@redhat.com> wrote:
> On Fri, 2017-06-30 at 13:15 +0100, Szabolcs Nagy wrote:
>> on a particular x86_64 cpu with particular loop count:
>>
>> time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar
>> 5.89user 0.07system 0:05.98elapsed 99%CPU (0avgtext+0avgdata 2000maxresident)k
>> 0inputs+0outputs (0major+153minor)pagefaults 0swaps
>> time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar
>> 2.66user 0.06system 0:02.73elapsed 99%CPU (0avgtext+0avgdata 2032maxresident)k
>> 0inputs+0outputs (0major+155minor)pagefaults 0swaps
>
> What's interesting here is that your high-level optimization is faster
> than doing the single-thread check in the low-level lock (x86 has it
> already in the low-level lock).

Isn't that the expected behaviour here ?

Ramana



>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-30 15:24               ` Siddhesh Poyarekar
@ 2017-06-30 15:35                 ` Torvald Riegel
  0 siblings, 0 replies; 31+ messages in thread
From: Torvald Riegel @ 2017-06-30 15:35 UTC (permalink / raw)
  To: Siddhesh Poyarekar
  Cc: Szabolcs Nagy, Carlos O'Donell, Joseph Myers, nd, GNU C Library

On Fri, 2017-06-30 at 20:54 +0530, Siddhesh Poyarekar wrote:
> On Friday 30 June 2017 08:48 PM, Torvald Riegel wrote:
> > What's interesting here is that your high-level optimization is faster
> > than doing the single-thread check in the low-level lock (x86 has it
> > already in the low-level lock).
> 
> Cache locality might be a factor - the fp struct may already be in cache
> as opposed to the TLS block from which one gets the value of
> multiple_threads.

And the check can happen earlier in the code, and at a higher level.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-30 15:18             ` [PATCH v2] " Torvald Riegel
  2017-06-30 15:24               ` Siddhesh Poyarekar
  2017-06-30 15:32               ` Ramana Radhakrishnan
@ 2017-06-30 15:51               ` Szabolcs Nagy
  2 siblings, 0 replies; 31+ messages in thread
From: Szabolcs Nagy @ 2017-06-30 15:51 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: nd, Carlos O'Donell, Siddhesh Poyarekar, Joseph Myers, GNU C Library

On 30/06/17 16:18, Torvald Riegel wrote:
> On Fri, 2017-06-30 at 13:15 +0100, Szabolcs Nagy wrote:
>> on a particular x86_64 cpu with particular loop count:
>>
>> time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar
>> 5.89user 0.07system 0:05.98elapsed 99%CPU (0avgtext+0avgdata 2000maxresident)k
>> 0inputs+0outputs (0major+153minor)pagefaults 0swaps
>> time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar
>> 2.66user 0.06system 0:02.73elapsed 99%CPU (0avgtext+0avgdata 2032maxresident)k
>> 0inputs+0outputs (0major+155minor)pagefaults 0swaps
> 
> What's interesting here is that your high-level optimization is faster
> than doing the single-thread check in the low-level lock (x86 has it
> already in the low-level lock).
> 

musl implemented stdio this way since 2011, which
is the reason for the 2x in getc/putc (i386):
http://www.etalabs.net/compare_libcs.html
we knew this already, but pushing a high level opt
in glibc is harder than doing target specific lll opt.

musl does not have to worry about fopencookie, libc
internal malloc interopsition, printf hooks, magic IO
flags, target specific hacks, so the optimization is
more obviously correct there, but now that the patch
only omits locks in specific functions the correctness
should be easier to verify.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-30 14:39               ` Szabolcs Nagy
@ 2017-06-30 16:07                 ` Carlos O'Donell
  2017-07-03 12:16                   ` [PATCH v4] " Szabolcs Nagy
  0 siblings, 1 reply; 31+ messages in thread
From: Carlos O'Donell @ 2017-06-30 16:07 UTC (permalink / raw)
  To: Szabolcs Nagy, Siddhesh Poyarekar, Joseph Myers
  Cc: nd, GNU C Library, triegel

On 06/30/2017 10:39 AM, Szabolcs Nagy wrote:
> On 30/06/17 14:16, Carlos O'Donell wrote:
>> On 06/30/2017 08:15 AM, Szabolcs Nagy wrote:
>>> i didn't dig into the root cause of the regression (or
>>> why is static linking slower?), i would not be too
>>> worried about it since the common case for hot stdio
>>> loops is in single thread processes where even on x86
>>> the patch gives >2x speedup.
>>
>> Regardless of the cause, the 15% regression on x86 MT performance
>> is serious, and I see no reason to push this into glibc 2.26. 
>> We can add it any time in 2.27, or the distros can pick it up with
>> a backport.
>>
>> I would like to see a better characterization of the regression before
>> accepting this patch.
>>
>> While I agree that common case for hot stdio loops is non-MT, there
>> are still MT cases, and 15% is a large double-digit loss.
>>
>> Have you looked at the assembly differences? What is the compiler
>> doing differently?
>>
>> When our a user asks "Why is my MT stdio 15% slower?" We owe them an
>> answer that is clear and concise.
>>
> 
> sorry the x86 measurement was bogus because only
> the high level code thought it's multithreaded, the
> lowlevellock code thought it's single threaded so
> there were no atomic ops executed in the stdio_mt case

OK.

> with atomics the orig performance is significantly
> slower so the regression relative to that is small in %.
> 
> if i create a dummy thread (to measure true mt
> behaviour, same loop count):
> 
> time $orig/lib64/ld-2.25.90.so --library-path $orig/lib64 ./getchar_mt
> 20.31user 0.11system 0:20.47elapsed 99%CPU (0avgtext+0avgdata 2416maxresident)k
> 0inputs+0outputs (0major+180minor)pagefaults 0swaps
> time $stdio/lib64/ld-2.25.90.so --library-path $stdio/lib64 ./getchar_mt
> 20.72user 0.03system 0:20.79elapsed 99%CPU (0avgtext+0avgdata 2400maxresident)k
> 0inputs+0outputs (0major+179minor)pagefaults 0swaps
> 
> the relative diff is 2% now, but notice that the
> abs diff went down too (which points to uarch issue
> in the previous measurement).

OK. This is much better.

> perf stat indicates that there are 15 vs 16 branches
> in the loop (so my patch indeed adds one branch
> but there are plenty branches already) the instruction
> count goes from 43 to 45 per loop iteration
> (flag check + branch).
> 
> in my previous measurements, how can +1 branch
> decrease the performance >10% when there are
> already >10 branches (and several other insns)
> is something the x86 uarchitects could explain.
> 
> in summary the patch trades 2% mt performance to
> 2x non-mt performance on this x86 cpu.
 
Excellent, this is exactly the analysis I was looking for, and this kind
of result is something that can make sense to our users.

I'm OK with the patch for 2.26.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v4] Single threaded stdio optimization
  2017-06-30 16:07                 ` Carlos O'Donell
@ 2017-07-03 12:16                   ` Szabolcs Nagy
  0 siblings, 0 replies; 31+ messages in thread
From: Szabolcs Nagy @ 2017-07-03 12:16 UTC (permalink / raw)
  To: Carlos O'Donell, Siddhesh Poyarekar, Joseph Myers
  Cc: nd, GNU C Library, triegel

[-- Attachment #1: Type: text/plain, Size: 2731 bytes --]

On 30/06/17 17:07, Carlos O'Donell wrote:
> 
> I'm OK with the patch for 2.26.
> 

v4: i did two modifications:

i dropped the hunk

 #if _G_HAVE_MMAP

-# ifdef _LIBC
+# if IS_IN (libc)
 /* When using this code in the GNU libc we must not pollute the name space.  */
 #  define mmap __mmap
 #  define munmap __munmap

because
https://sourceware.org/ml/libc-alpha/2017-06/msg00854.html
got committed which made __mmap work in pthread_create.

and added comments to the _IO_enable_locks definition.

if there are no complaints about these changes i'll commit
the patch tomorrow as:

Locking overhead can be significant in some stdio operations
that are common in single threaded applications.

This patch adds the _IO_FLAGS2_NEED_LOCK flag to indicate if
an _IO_FILE object needs to be locked and some of the stdio
functions just jump to their _unlocked variant when not.  The
flag is set on all _IO_FILE objects when the first thread is
created.  A new GLIBC_PRIVATE libc symbol, _IO_enable_locks,
was added to do this from libpthread.

The optimization can be applied to more stdio functions,
currently it is only applied to single flag check or single
non-wide-char standard operations.  The flag should probably
be never set for files with _IO_USER_LOCK, but that's just a
further optimization, not a correctness requirement.

The optimization is valid in a single thread because stdio
operations are non-as-safe (so lock state is not observable
from a signal handler) and stdio locks are recursive (so lock
state is not observable via deadlock).  The optimization is not
valid if a thread may be created while an stdio lock is taken
and thus it should be disabled if any user code may run during
an stdio operation (interposed malloc, printf hooks, etc).
This makes the optimization more complicated for some stdio
operations (e.g. printf), but those are bigger and thus less
important to optimize so this patch does not try to do that.

2017-07-03  Szabolcs Nagy  <szabolcs.nagy@arm.com>

	* libio/libio.h (_IO_FLAGS2_NEED_LOCK, _IO_need_lock): Define.
	* libio/libioP.h (_IO_enable_locks): Declare.
	* libio/Versions (_IO_enable_locks): New symbol.
	* libio/genops.c (_IO_enable_locks): Define.
	(_IO_old_init): Initialize flags2.
	* libio/feof.c.c (_IO_feof): Avoid locking when not needed.
	* libio/ferror.c (_IO_ferror): Likewise.
	* libio/fputc.c (fputc): Likewise.
	* libio/putc.c (_IO_putc): Likewise.
	* libio/getc.c (_IO_getc): Likewise.
	* libio/getchar.c (getchar): Likewise.
	* libio/ioungetc.c (_IO_ungetc): Likewise.
	* nptl/pthread_create.c (__pthread_create_2_1): Enable stdio locks.
	* libio/iofopncook.c (_IO_fopencookie): Enable locking for the file.
	* sysdeps/pthread/flockfile.c (__flockfile): Likewise.


[-- Attachment #2: stdio.diff --]
[-- Type: text/x-patch, Size: 6965 bytes --]

diff --git a/libio/Versions b/libio/Versions
index 2a1d6e6c85..77123347e3 100644
--- a/libio/Versions
+++ b/libio/Versions
@@ -155,5 +155,8 @@ libc {
   GLIBC_PRIVATE {
     # Used by NPTL and librt
     __libc_fatal;
+
+    # Used by NPTL
+    _IO_enable_locks;
   }
 }
diff --git a/libio/feof.c b/libio/feof.c
index 9712a81e78..8890a5f51f 100644
--- a/libio/feof.c
+++ b/libio/feof.c
@@ -32,6 +32,8 @@ _IO_feof (_IO_FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_feof_unlocked (fp);
   _IO_flockfile (fp);
   result = _IO_feof_unlocked (fp);
   _IO_funlockfile (fp);
diff --git a/libio/ferror.c b/libio/ferror.c
index 01e3bd8e2b..d10fcd9fff 100644
--- a/libio/ferror.c
+++ b/libio/ferror.c
@@ -32,6 +32,8 @@ _IO_ferror (_IO_FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_ferror_unlocked (fp);
   _IO_flockfile (fp);
   result = _IO_ferror_unlocked (fp);
   _IO_funlockfile (fp);
diff --git a/libio/fputc.c b/libio/fputc.c
index a7cd682fe2..b72305c06f 100644
--- a/libio/fputc.c
+++ b/libio/fputc.c
@@ -32,6 +32,8 @@ fputc (int c, _IO_FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_putc_unlocked (c, fp);
   _IO_acquire_lock (fp);
   result = _IO_putc_unlocked (c, fp);
   _IO_release_lock (fp);
diff --git a/libio/genops.c b/libio/genops.c
index a466cfa337..6ad7346cae 100644
--- a/libio/genops.c
+++ b/libio/genops.c
@@ -570,11 +570,39 @@ _IO_init (_IO_FILE *fp, int flags)
   _IO_init_internal (fp, flags);
 }
 
+static int stdio_needs_locking;
+
+/* In a single-threaded process most stdio locks can be omitted.  After
+   _IO_enable_locks is called, locks are not optimized away any more.
+   It must be first called while the process is still single-threaded.
+
+   This lock optimization can be disabled on a per-file basis by setting
+   _IO_FLAGS2_NEED_LOCK, because a file can have user-defined callbacks
+   or can be locked with flockfile and then a thread may be created
+   between a lock and unlock, so omitting the lock is not valid.
+
+   Here we have to make sure that the flag is set on all existing files
+   and files created later.  */
+void
+_IO_enable_locks (void)
+{
+  _IO_ITER i;
+
+  if (stdio_needs_locking)
+    return;
+  stdio_needs_locking = 1;
+  for (i = _IO_iter_begin (); i != _IO_iter_end (); i = _IO_iter_next (i))
+    _IO_iter_file (i)->_flags2 |= _IO_FLAGS2_NEED_LOCK;
+}
+libc_hidden_def (_IO_enable_locks)
+
 void
 _IO_old_init (_IO_FILE *fp, int flags)
 {
   fp->_flags = _IO_MAGIC|flags;
   fp->_flags2 = 0;
+  if (stdio_needs_locking)
+    fp->_flags2 |= _IO_FLAGS2_NEED_LOCK;
   fp->_IO_buf_base = NULL;
   fp->_IO_buf_end = NULL;
   fp->_IO_read_base = NULL;
diff --git a/libio/getc.c b/libio/getc.c
index b58fd62308..fd66ef93cf 100644
--- a/libio/getc.c
+++ b/libio/getc.c
@@ -34,6 +34,8 @@ _IO_getc (FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_getc_unlocked (fp);
   _IO_acquire_lock (fp);
   result = _IO_getc_unlocked (fp);
   _IO_release_lock (fp);
diff --git a/libio/getchar.c b/libio/getchar.c
index 5b41595d17..d79932114e 100644
--- a/libio/getchar.c
+++ b/libio/getchar.c
@@ -33,6 +33,8 @@ int
 getchar (void)
 {
   int result;
+  if (!_IO_need_lock (_IO_stdin))
+    return _IO_getc_unlocked (_IO_stdin);
   _IO_acquire_lock (_IO_stdin);
   result = _IO_getc_unlocked (_IO_stdin);
   _IO_release_lock (_IO_stdin);
diff --git a/libio/iofopncook.c b/libio/iofopncook.c
index a08dfdaa42..982f464a68 100644
--- a/libio/iofopncook.c
+++ b/libio/iofopncook.c
@@ -172,6 +172,8 @@ _IO_cookie_init (struct _IO_cookie_file *cfile, int read_write,
   _IO_mask_flags (&cfile->__fp.file, read_write,
 		  _IO_NO_READS+_IO_NO_WRITES+_IO_IS_APPENDING);
 
+  cfile->__fp.file._flags2 |= _IO_FLAGS2_NEED_LOCK;
+
   /* We use a negative number different from -1 for _fileno to mark that
      this special stream is not associated with a real file, but still has
      to be treated as such.  */
diff --git a/libio/ioungetc.c b/libio/ioungetc.c
index 951064fa12..917cad8abb 100644
--- a/libio/ioungetc.c
+++ b/libio/ioungetc.c
@@ -33,6 +33,8 @@ _IO_ungetc (int c, _IO_FILE *fp)
   CHECK_FILE (fp, EOF);
   if (c == EOF)
     return EOF;
+  if (!_IO_need_lock (fp))
+    return _IO_sputbackc (fp, (unsigned char) c);
   _IO_acquire_lock (fp);
   result = _IO_sputbackc (fp, (unsigned char) c);
   _IO_release_lock (fp);
diff --git a/libio/libio.h b/libio/libio.h
index 518ffd8e44..14bcb92332 100644
--- a/libio/libio.h
+++ b/libio/libio.h
@@ -119,6 +119,7 @@
 # define _IO_FLAGS2_SCANF_STD 16
 # define _IO_FLAGS2_NOCLOSE 32
 # define _IO_FLAGS2_CLOEXEC 64
+# define _IO_FLAGS2_NEED_LOCK 128
 #endif
 
 /* These are "formatting flags" matching the iostream fmtflags enum values. */
@@ -451,6 +452,9 @@ extern int _IO_ftrylockfile (_IO_FILE *) __THROW;
 #define _IO_cleanup_region_end(_Doit) /**/
 #endif
 
+#define _IO_need_lock(_fp) \
+  (((_fp)->_flags2 & _IO_FLAGS2_NEED_LOCK) != 0)
+
 extern int _IO_vfscanf (_IO_FILE * __restrict, const char * __restrict,
 			_IO_va_list, int *__restrict);
 extern int _IO_vfprintf (_IO_FILE *__restrict, const char *__restrict,
diff --git a/libio/libioP.h b/libio/libioP.h
index eb93418c8d..1832b44cc7 100644
--- a/libio/libioP.h
+++ b/libio/libioP.h
@@ -444,6 +444,8 @@ extern void _IO_list_unlock (void) __THROW;
 libc_hidden_proto (_IO_list_unlock)
 extern void _IO_list_resetlock (void) __THROW;
 libc_hidden_proto (_IO_list_resetlock)
+extern void _IO_enable_locks (void) __THROW;
+libc_hidden_proto (_IO_enable_locks)
 
 /* Default jumptable functions. */
 
diff --git a/libio/putc.c b/libio/putc.c
index b591c5538b..6e1fdeef3a 100644
--- a/libio/putc.c
+++ b/libio/putc.c
@@ -25,6 +25,8 @@ _IO_putc (int c, _IO_FILE *fp)
 {
   int result;
   CHECK_FILE (fp, EOF);
+  if (!_IO_need_lock (fp))
+    return _IO_putc_unlocked (c, fp);
   _IO_acquire_lock (fp);
   result = _IO_putc_unlocked (c, fp);
   _IO_release_lock (fp);
diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
index 7a970ffc5b..2f8ada34d6 100644
--- a/nptl/pthread_create.c
+++ b/nptl/pthread_create.c
@@ -32,6 +32,7 @@
 #include <exit-thread.h>
 #include <default-sched.h>
 #include <futex-internal.h>
+#include "libioP.h"
 
 #include <shlib-compat.h>
 
@@ -756,6 +757,9 @@ __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
         collect_default_sched (pd);
     }
 
+  if (__glibc_unlikely (__nptl_nthreads == 1))
+    _IO_enable_locks ();
+
   /* Pass the descriptor to the caller.  */
   *newthread = (pthread_t) pd;
 
diff --git a/sysdeps/pthread/flockfile.c b/sysdeps/pthread/flockfile.c
index 7fe8e99161..a8e6c28ed9 100644
--- a/sysdeps/pthread/flockfile.c
+++ b/sysdeps/pthread/flockfile.c
@@ -25,6 +25,7 @@
 void
 __flockfile (FILE *stream)
 {
+  stream->_flags2 |= _IO_FLAGS2_NEED_LOCK;
   _IO_lock_lock (*stream->_lock);
 }
 strong_alias (__flockfile, _IO_flockfile)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-06-21 16:37 [PATCH v2] Single threaded stdio optimization Szabolcs Nagy
  2017-06-21 16:52 ` Joseph Myers
@ 2017-07-07 12:38 ` Florian Weimer
  2017-07-07 13:38   ` Szabolcs Nagy
  1 sibling, 1 reply; 31+ messages in thread
From: Florian Weimer @ 2017-07-07 12:38 UTC (permalink / raw)
  To: Szabolcs Nagy, GNU C Library; +Cc: nd, triegel

[-- Attachment #1: Type: text/plain, Size: 423 bytes --]

This patch breaks the attached test case.  Not all stream objects are
linked into the global list, so the locking upgrade doesn't happen for
some of them.

The global list management is quite expensive because the list is
single-linked, so we can't just add all stream objects when not yet
running multi-threaded.

I fear that this patch may have to be backed out again, until we can
address these issues.

Thanks,
Florian

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: memstream-thread.c --]
[-- Type: text/x-csrc; name="memstream-thread.c", Size: 1824 bytes --]

#include <err.h>
#include <errno.h>
#include <pthread.h>
#include <stdio.h>

enum
  {
    thread_count = 2,
    byte_count = 1000000,
  };

struct closure
{
  FILE *fp;
  char b;
};

static void *
thread_func (void *closure)
{
  struct closure *args = closure;

  for (int i = 0; i < byte_count; ++i)
    fputc (args->b, args->fp);

  return NULL;
}

int
main (void)
{
  char *buffer = NULL;
  size_t buffer_length = 0;
  FILE *fp = open_memstream (&buffer, &buffer_length);
  if (fp == NULL)
    err (1, "open_memstream");

  pthread_t threads[thread_count];
  struct closure args[thread_count];

  for (int i = 0; i < thread_count; ++i)
    {
      args[i].fp = fp;
      args[i].b = 'A' + i;
      int ret = pthread_create (threads + i, NULL, thread_func, args + i);
      if (ret != 0)
        {
          errno = ret;
          err (1, "pthread_create");
        }
    }

  for (int i = 0; i < thread_count; ++i)
    {
      int ret = pthread_join (threads[i], NULL);
      if (ret != 0)
        {
          errno = ret;
          err (1, "pthread_join");
        }
    }

  fclose (fp);

  if (buffer_length != thread_count * byte_count)
    errx (1, "unexpected number of written bytes: %zu (should be %d)",
          buffer_length, thread_count * byte_count);

  size_t counts[thread_count] = { 0, };
  for (size_t i = 0; i < buffer_length; ++i)
    {
      if (buffer[i] < 'A' || buffer[i] >= 'A' + thread_count)
        errx (1, "written byte at %zu out of range: %d", i, buffer[i] & 0xFF);
      ++counts[buffer[i] - 'A'];
    }
  for (int i = 0; i < thread_count; ++i)
    if (counts[i] != byte_count)
      errx (1, "incorrect write count for thread %d: %zu (should be %d)",
            i, counts[i], byte_count);

  return 0;
}

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-07-07 12:38 ` Florian Weimer
@ 2017-07-07 13:38   ` Szabolcs Nagy
  2017-07-07 15:02     ` Szabolcs Nagy
  0 siblings, 1 reply; 31+ messages in thread
From: Szabolcs Nagy @ 2017-07-07 13:38 UTC (permalink / raw)
  To: Florian Weimer, GNU C Library; +Cc: nd, triegel

On 07/07/17 13:38, Florian Weimer wrote:
> This patch breaks the attached test case.  Not all stream objects are
> linked into the global list, so the locking upgrade doesn't happen for
> some of them.
> 

i thought all files need to be flushed on exit
or on fflush(0), if glibc does not do that that's
observably non-conforming.

> The global list management is quite expensive because the list is
> single-linked, so we can't just add all stream objects when not yet
> running multi-threaded.
> 
> I fear that this patch may have to be backed out again, until we can
> address these issues.
> 

i can back it out, or try to create all the
problematic files with the optimization-disabling
flag set (in case that's simple.. will look into
it in a minute).

> Thanks,
> Florian
> 
> 
> memstream-thread.c
> 
> 
> #include <err.h>
> #include <errno.h>
> #include <pthread.h>
> #include <stdio.h>
> 
> enum
>   {
>     thread_count = 2,
>     byte_count = 1000000,
>   };
> 
> struct closure
> {
>   FILE *fp;
>   char b;
> };
> 
> static void *
> thread_func (void *closure)
> {
>   struct closure *args = closure;
> 
>   for (int i = 0; i < byte_count; ++i)
>     fputc (args->b, args->fp);
> 
>   return NULL;
> }
> 
> int
> main (void)
> {
>   char *buffer = NULL;
>   size_t buffer_length = 0;
>   FILE *fp = open_memstream (&buffer, &buffer_length);
>   if (fp == NULL)
>     err (1, "open_memstream");
> 
>   pthread_t threads[thread_count];
>   struct closure args[thread_count];
> 
>   for (int i = 0; i < thread_count; ++i)
>     {
>       args[i].fp = fp;
>       args[i].b = 'A' + i;
>       int ret = pthread_create (threads + i, NULL, thread_func, args + i);
>       if (ret != 0)
>         {
>           errno = ret;
>           err (1, "pthread_create");
>         }
>     }
> 
>   for (int i = 0; i < thread_count; ++i)
>     {
>       int ret = pthread_join (threads[i], NULL);
>       if (ret != 0)
>         {
>           errno = ret;
>           err (1, "pthread_join");
>         }
>     }
> 
>   fclose (fp);
> 
>   if (buffer_length != thread_count * byte_count)
>     errx (1, "unexpected number of written bytes: %zu (should be %d)",
>           buffer_length, thread_count * byte_count);
> 
>   size_t counts[thread_count] = { 0, };
>   for (size_t i = 0; i < buffer_length; ++i)
>     {
>       if (buffer[i] < 'A' || buffer[i] >= 'A' + thread_count)
>         errx (1, "written byte at %zu out of range: %d", i, buffer[i] & 0xFF);
>       ++counts[buffer[i] - 'A'];
>     }
>   for (int i = 0; i < thread_count; ++i)
>     if (counts[i] != byte_count)
>       errx (1, "incorrect write count for thread %d: %zu (should be %d)",
>             i, counts[i], byte_count);
> 
>   return 0;
> }
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-07-07 13:38   ` Szabolcs Nagy
@ 2017-07-07 15:02     ` Szabolcs Nagy
  2017-07-07 17:31       ` Szabolcs Nagy
  0 siblings, 1 reply; 31+ messages in thread
From: Szabolcs Nagy @ 2017-07-07 15:02 UTC (permalink / raw)
  To: Florian Weimer, GNU C Library; +Cc: nd, triegel

On 07/07/17 14:38, Szabolcs Nagy wrote:
> On 07/07/17 13:38, Florian Weimer wrote:
>> This patch breaks the attached test case.  Not all stream objects are
>> linked into the global list, so the locking upgrade doesn't happen for
>> some of them.
>>
> 
> i thought all files need to be flushed on exit
> or on fflush(0), if glibc does not do that that's
> observably non-conforming.
> 
>> The global list management is quite expensive because the list is
>> single-linked, so we can't just add all stream objects when not yet
>> running multi-threaded.
>>
>> I fear that this patch may have to be backed out again, until we can
>> address these issues.
>>
> 
> i can back it out, or try to create all the
> problematic files with the optimization-disabling
> flag set (in case that's simple.. will look into
> it in a minute).
> 

it seems both changing the optimization flag or adding
these streams to the list are easy.

i think glibc should just fix the open_memstream bug,
https://sourceware.org/bugzilla/show_bug.cgi?id=21735
i'll work on a patch.

(i don't expect a large number of open/close memstream
operations, so it should not have a huge impact)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-07-07 15:02     ` Szabolcs Nagy
@ 2017-07-07 17:31       ` Szabolcs Nagy
  2017-07-07 20:47         ` Adhemerval Zanella
  0 siblings, 1 reply; 31+ messages in thread
From: Szabolcs Nagy @ 2017-07-07 17:31 UTC (permalink / raw)
  To: Florian Weimer, GNU C Library; +Cc: nd, triegel

On 07/07/17 16:02, Szabolcs Nagy wrote:
> On 07/07/17 14:38, Szabolcs Nagy wrote:
>> On 07/07/17 13:38, Florian Weimer wrote:
>>> This patch breaks the attached test case.  Not all stream objects are
>>> linked into the global list, so the locking upgrade doesn't happen for
>>> some of them.
>>>
>>
>> i thought all files need to be flushed on exit
>> or on fflush(0), if glibc does not do that that's
>> observably non-conforming.
>>
>>> The global list management is quite expensive because the list is
>>> single-linked, so we can't just add all stream objects when not yet
>>> running multi-threaded.
>>>
>>> I fear that this patch may have to be backed out again, until we can
>>> address these issues.
>>>
>>
>> i can back it out, or try to create all the
>> problematic files with the optimization-disabling
>> flag set (in case that's simple.. will look into
>> it in a minute).
>>
> 
> it seems both changing the optimization flag or adding
> these streams to the list are easy.
> 
> i think glibc should just fix the open_memstream bug,
> https://sourceware.org/bugzilla/show_bug.cgi?id=21735
> i'll work on a patch.
> 
> (i don't expect a large number of open/close memstream
> operations, so it should not have a huge impact)
> 

sorry, i cannot work on this until monday,
i hope it's ok to leave multithread open_memstream
broken for a few days.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-07-07 17:31       ` Szabolcs Nagy
@ 2017-07-07 20:47         ` Adhemerval Zanella
  2017-07-07 21:31           ` Szabolcs Nagy
  0 siblings, 1 reply; 31+ messages in thread
From: Adhemerval Zanella @ 2017-07-07 20:47 UTC (permalink / raw)
  To: libc-alpha



On 07/07/2017 14:30, Szabolcs Nagy wrote:
> On 07/07/17 16:02, Szabolcs Nagy wrote:
>> On 07/07/17 14:38, Szabolcs Nagy wrote:
>>> On 07/07/17 13:38, Florian Weimer wrote:
>>>> This patch breaks the attached test case.  Not all stream objects are
>>>> linked into the global list, so the locking upgrade doesn't happen for
>>>> some of them.
>>>>
>>>
>>> i thought all files need to be flushed on exit
>>> or on fflush(0), if glibc does not do that that's
>>> observably non-conforming.
>>>
>>>> The global list management is quite expensive because the list is
>>>> single-linked, so we can't just add all stream objects when not yet
>>>> running multi-threaded.
>>>>
>>>> I fear that this patch may have to be backed out again, until we can
>>>> address these issues.
>>>>
>>>
>>> i can back it out, or try to create all the
>>> problematic files with the optimization-disabling
>>> flag set (in case that's simple.. will look into
>>> it in a minute).
>>>
>>
>> it seems both changing the optimization flag or adding
>> these streams to the list are easy.
>>
>> i think glibc should just fix the open_memstream bug,
>> https://sourceware.org/bugzilla/show_bug.cgi?id=21735
>> i'll work on a patch.
>>
>> (i don't expect a large number of open/close memstream
>> operations, so it should not have a huge impact)
>>
> 
> sorry, i cannot work on this until monday,
> i hope it's ok to leave multithread open_memstream
> broken for a few days.

What about patch below to add memstream FILE objects to the fflush
list along with a size update on write operation update?  I do not like
the cast, but currently 'struct _IO_FILE_plus' and 'struct _IO_streambuf'
are essentially the same type so it should be safe. I think for 2.27 we
can think about cleaning up these definitions.

diff --git a/libio/memstream.c b/libio/memstream.c
index f83d4a5..3be5b58 100644
--- a/libio/memstream.c
+++ b/libio/memstream.c
@@ -31,13 +31,14 @@ struct _IO_FILE_memstream
 
 static int _IO_mem_sync (_IO_FILE* fp) __THROW;
 static void _IO_mem_finish (_IO_FILE* fp, int) __THROW;
+static int _IO_mem_overflow (_IO_FILE *fp, int c) __THROW;
 
 
 static const struct _IO_jump_t _IO_mem_jumps libio_vtable =
 {
   JUMP_INIT_DUMMY,
   JUMP_INIT (finish, _IO_mem_finish),
-  JUMP_INIT (overflow, _IO_str_overflow),
+  JUMP_INIT (overflow, _IO_mem_overflow),
   JUMP_INIT (underflow, _IO_str_underflow),
   JUMP_INIT (uflow, _IO_default_uflow),
   JUMP_INIT (pbackfail, _IO_str_pbackfail),
@@ -87,6 +88,7 @@ __open_memstream (char **bufloc, _IO_size_t *sizeloc)
       return NULL;
     }
   _IO_init_internal (&new_f->fp._sf._sbf._f, 0);
+  _IO_link_in ((struct _IO_FILE_plus *) &new_f->fp._sf._sbf);
   _IO_JUMPS_FILE_plus (&new_f->fp._sf._sbf) = &_IO_mem_jumps;
   _IO_str_init_static_internal (&new_f->fp._sf, buf, _IO_BUFSIZ, buf);
   new_f->fp._sf._sbf._f._flags &= ~_IO_USER_BUF;
@@ -137,3 +139,14 @@ _IO_mem_finish (_IO_FILE *fp, int dummy)
 
   _IO_str_finish (fp, 0);
 }
+
+static int
+_IO_mem_overflow (_IO_FILE *fp, int c)
+{
+  int ret = _IO_str_overflow (fp, c);
+
+  struct _IO_FILE_memstream *mp = (struct _IO_FILE_memstream *) fp;
+  *mp->sizeloc = fp->_IO_write_ptr - fp->_IO_write_base;
+
+  return ret;
+}

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-07-07 20:47         ` Adhemerval Zanella
@ 2017-07-07 21:31           ` Szabolcs Nagy
  2017-07-10 14:28             ` Szabolcs Nagy
  0 siblings, 1 reply; 31+ messages in thread
From: Szabolcs Nagy @ 2017-07-07 21:31 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

* Adhemerval Zanella <adhemerval.zanella@linaro.org> [2017-07-07 17:46:56 -0300]:
> On 07/07/2017 14:30, Szabolcs Nagy wrote:
> > On 07/07/17 16:02, Szabolcs Nagy wrote:
> >> On 07/07/17 14:38, Szabolcs Nagy wrote:
> >>> On 07/07/17 13:38, Florian Weimer wrote:
> >>>> This patch breaks the attached test case.  Not all stream objects are
> >>>> linked into the global list, so the locking upgrade doesn't happen for
> >>>> some of them.
> >>>>
> >>>
> >>> i thought all files need to be flushed on exit
> >>> or on fflush(0), if glibc does not do that that's
> >>> observably non-conforming.
> >>>
> >>>> The global list management is quite expensive because the list is
> >>>> single-linked, so we can't just add all stream objects when not yet
> >>>> running multi-threaded.
> >>>>
> >>>> I fear that this patch may have to be backed out again, until we can
> >>>> address these issues.
> >>>>
> >>>
> >>> i can back it out, or try to create all the
> >>> problematic files with the optimization-disabling
> >>> flag set (in case that's simple.. will look into
> >>> it in a minute).
> >>>
> >>
> >> it seems both changing the optimization flag or adding
> >> these streams to the list are easy.
> >>
> >> i think glibc should just fix the open_memstream bug,
> >> https://sourceware.org/bugzilla/show_bug.cgi?id=21735
> >> i'll work on a patch.
> >>
> >> (i don't expect a large number of open/close memstream
> >> operations, so it should not have a huge impact)
> >>
> > 
> > sorry, i cannot work on this until monday,
> > i hope it's ok to leave multithread open_memstream
> > broken for a few days.
> 
> What about patch below to add memstream FILE objects to the fflush
> list along with a size update on write operation update?  I do not like
> the cast, but currently 'struct _IO_FILE_plus' and 'struct _IO_streambuf'
> are essentially the same type so it should be safe. I think for 2.27 we
> can think about cleaning up these definitions.
> 
> diff --git a/libio/memstream.c b/libio/memstream.c
> index f83d4a5..3be5b58 100644
> --- a/libio/memstream.c
> +++ b/libio/memstream.c
> @@ -31,13 +31,14 @@ struct _IO_FILE_memstream
>  
>  static int _IO_mem_sync (_IO_FILE* fp) __THROW;
>  static void _IO_mem_finish (_IO_FILE* fp, int) __THROW;
> +static int _IO_mem_overflow (_IO_FILE *fp, int c) __THROW;
>  
>  
>  static const struct _IO_jump_t _IO_mem_jumps libio_vtable =
>  {
>    JUMP_INIT_DUMMY,
>    JUMP_INIT (finish, _IO_mem_finish),
> -  JUMP_INIT (overflow, _IO_str_overflow),
> +  JUMP_INIT (overflow, _IO_mem_overflow),
>    JUMP_INIT (underflow, _IO_str_underflow),
>    JUMP_INIT (uflow, _IO_default_uflow),
>    JUMP_INIT (pbackfail, _IO_str_pbackfail),
> @@ -87,6 +88,7 @@ __open_memstream (char **bufloc, _IO_size_t *sizeloc)
>        return NULL;
>      }
>    _IO_init_internal (&new_f->fp._sf._sbf._f, 0);
> +  _IO_link_in ((struct _IO_FILE_plus *) &new_f->fp._sf._sbf);

i think if you link_in, then you have to unlink somewhere,
but i havent tracked down the ways in which the io callbacks
handle that, in principle it should be at close time.

>    _IO_JUMPS_FILE_plus (&new_f->fp._sf._sbf) = &_IO_mem_jumps;
>    _IO_str_init_static_internal (&new_f->fp._sf, buf, _IO_BUFSIZ, buf);
>    new_f->fp._sf._sbf._f._flags &= ~_IO_USER_BUF;
> @@ -137,3 +139,14 @@ _IO_mem_finish (_IO_FILE *fp, int dummy)
>  
>    _IO_str_finish (fp, 0);
>  }
> +
> +static int
> +_IO_mem_overflow (_IO_FILE *fp, int c)
> +{
> +  int ret = _IO_str_overflow (fp, c);
> +
> +  struct _IO_FILE_memstream *mp = (struct _IO_FILE_memstream *) fp;
> +  *mp->sizeloc = fp->_IO_write_ptr - fp->_IO_write_base;
> +
> +  return ret;
> +}

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-07-07 21:31           ` Szabolcs Nagy
@ 2017-07-10 14:28             ` Szabolcs Nagy
  2017-07-10 15:06               ` Adhemerval Zanella
  0 siblings, 1 reply; 31+ messages in thread
From: Szabolcs Nagy @ 2017-07-10 14:28 UTC (permalink / raw)
  To: Szabolcs Nagy, Adhemerval Zanella; +Cc: nd, libc-alpha

On 07/07/17 22:31, Szabolcs Nagy wrote:
> * Adhemerval Zanella <adhemerval.zanella@linaro.org> [2017-07-07 17:46:56 -0300]:
>> On 07/07/2017 14:30, Szabolcs Nagy wrote:
>>> On 07/07/17 16:02, Szabolcs Nagy wrote:
>>>> On 07/07/17 14:38, Szabolcs Nagy wrote:
>>>>> On 07/07/17 13:38, Florian Weimer wrote:
>>>>>> This patch breaks the attached test case.  Not all stream objects are
>>>>>> linked into the global list, so the locking upgrade doesn't happen for
>>>>>> some of them.
>>>>>>
>>>>>
>>>>> i thought all files need to be flushed on exit
>>>>> or on fflush(0), if glibc does not do that that's
>>>>> observably non-conforming.
>>>>>
>>>>>> The global list management is quite expensive because the list is
>>>>>> single-linked, so we can't just add all stream objects when not yet
>>>>>> running multi-threaded.
>>>>>>
>>>>>> I fear that this patch may have to be backed out again, until we can
>>>>>> address these issues.
>>>>>>
>>>>>
>>>>> i can back it out, or try to create all the
>>>>> problematic files with the optimization-disabling
>>>>> flag set (in case that's simple.. will look into
>>>>> it in a minute).
>>>>>
>>>>
>>>> it seems both changing the optimization flag or adding
>>>> these streams to the list are easy.
>>>>
>>>> i think glibc should just fix the open_memstream bug,
>>>> https://sourceware.org/bugzilla/show_bug.cgi?id=21735
>>>> i'll work on a patch.
>>>>
>>>> (i don't expect a large number of open/close memstream
>>>> operations, so it should not have a huge impact)
>>>>
>>>
>>> sorry, i cannot work on this until monday,
>>> i hope it's ok to leave multithread open_memstream
>>> broken for a few days.
>>
>> What about patch below to add memstream FILE objects to the fflush
>> list along with a size update on write operation update?  I do not like
>> the cast, but currently 'struct _IO_FILE_plus' and 'struct _IO_streambuf'
>> are essentially the same type so it should be safe. I think for 2.27 we
>> can think about cleaning up these definitions.
>>
>> diff --git a/libio/memstream.c b/libio/memstream.c
>> index f83d4a5..3be5b58 100644
>> --- a/libio/memstream.c
>> +++ b/libio/memstream.c
>> @@ -31,13 +31,14 @@ struct _IO_FILE_memstream
>>  
>>  static int _IO_mem_sync (_IO_FILE* fp) __THROW;
>>  static void _IO_mem_finish (_IO_FILE* fp, int) __THROW;
>> +static int _IO_mem_overflow (_IO_FILE *fp, int c) __THROW;
>>  
>>  
>>  static const struct _IO_jump_t _IO_mem_jumps libio_vtable =
>>  {
>>    JUMP_INIT_DUMMY,
>>    JUMP_INIT (finish, _IO_mem_finish),
>> -  JUMP_INIT (overflow, _IO_str_overflow),
>> +  JUMP_INIT (overflow, _IO_mem_overflow),
>>    JUMP_INIT (underflow, _IO_str_underflow),
>>    JUMP_INIT (uflow, _IO_default_uflow),
>>    JUMP_INIT (pbackfail, _IO_str_pbackfail),
>> @@ -87,6 +88,7 @@ __open_memstream (char **bufloc, _IO_size_t *sizeloc)
>>        return NULL;
>>      }
>>    _IO_init_internal (&new_f->fp._sf._sbf._f, 0);
>> +  _IO_link_in ((struct _IO_FILE_plus *) &new_f->fp._sf._sbf);
> 
> i think if you link_in, then you have to unlink somewhere,
> but i havent tracked down the ways in which the io callbacks
> handle that, in principle it should be at close time.
> 

after further investigation i found that actually both
fclose and _IO_str_finish call _IO_un_link if the file
is linked.
(which is suboptimal, but my worry was not justified)

and that

_IO_flush_all calls the overflow callback
_IO_fflush calls the sync callback
_IO_fclose calls the finish callback

so all three callbacks need to set *sizeloc.
(which is suboptimal, but we should not redesign libio now)

so your patch seems to be the right fix for BZ 21735 .


>>    _IO_JUMPS_FILE_plus (&new_f->fp._sf._sbf) = &_IO_mem_jumps;
>>    _IO_str_init_static_internal (&new_f->fp._sf, buf, _IO_BUFSIZ, buf);
>>    new_f->fp._sf._sbf._f._flags &= ~_IO_USER_BUF;
>> @@ -137,3 +139,14 @@ _IO_mem_finish (_IO_FILE *fp, int dummy)
>>  
>>    _IO_str_finish (fp, 0);
>>  }
>> +
>> +static int
>> +_IO_mem_overflow (_IO_FILE *fp, int c)
>> +{
>> +  int ret = _IO_str_overflow (fp, c);
>> +
>> +  struct _IO_FILE_memstream *mp = (struct _IO_FILE_memstream *) fp;
>> +  *mp->sizeloc = fp->_IO_write_ptr - fp->_IO_write_base;
>> +
>> +  return ret;
>> +}

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2] Single threaded stdio optimization
  2017-07-10 14:28             ` Szabolcs Nagy
@ 2017-07-10 15:06               ` Adhemerval Zanella
  0 siblings, 0 replies; 31+ messages in thread
From: Adhemerval Zanella @ 2017-07-10 15:06 UTC (permalink / raw)
  To: Szabolcs Nagy, Szabolcs Nagy; +Cc: nd, libc-alpha



On 10/07/2017 11:28, Szabolcs Nagy wrote:
> On 07/07/17 22:31, Szabolcs Nagy wrote:
>> * Adhemerval Zanella <adhemerval.zanella@linaro.org> [2017-07-07 17:46:56 -0300]:
>>> On 07/07/2017 14:30, Szabolcs Nagy wrote:
>>>> On 07/07/17 16:02, Szabolcs Nagy wrote:
>>>>> On 07/07/17 14:38, Szabolcs Nagy wrote:
>>>>>> On 07/07/17 13:38, Florian Weimer wrote:
>>>>>>> This patch breaks the attached test case.  Not all stream objects are
>>>>>>> linked into the global list, so the locking upgrade doesn't happen for
>>>>>>> some of them.
>>>>>>>
>>>>>>
>>>>>> i thought all files need to be flushed on exit
>>>>>> or on fflush(0), if glibc does not do that that's
>>>>>> observably non-conforming.
>>>>>>
>>>>>>> The global list management is quite expensive because the list is
>>>>>>> single-linked, so we can't just add all stream objects when not yet
>>>>>>> running multi-threaded.
>>>>>>>
>>>>>>> I fear that this patch may have to be backed out again, until we can
>>>>>>> address these issues.
>>>>>>>
>>>>>>
>>>>>> i can back it out, or try to create all the
>>>>>> problematic files with the optimization-disabling
>>>>>> flag set (in case that's simple.. will look into
>>>>>> it in a minute).
>>>>>>
>>>>>
>>>>> it seems both changing the optimization flag or adding
>>>>> these streams to the list are easy.
>>>>>
>>>>> i think glibc should just fix the open_memstream bug,
>>>>> https://sourceware.org/bugzilla/show_bug.cgi?id=21735
>>>>> i'll work on a patch.
>>>>>
>>>>> (i don't expect a large number of open/close memstream
>>>>> operations, so it should not have a huge impact)
>>>>>
>>>>
>>>> sorry, i cannot work on this until monday,
>>>> i hope it's ok to leave multithread open_memstream
>>>> broken for a few days.
>>>
>>> What about patch below to add memstream FILE objects to the fflush
>>> list along with a size update on write operation update?  I do not like
>>> the cast, but currently 'struct _IO_FILE_plus' and 'struct _IO_streambuf'
>>> are essentially the same type so it should be safe. I think for 2.27 we
>>> can think about cleaning up these definitions.
>>>
>>> diff --git a/libio/memstream.c b/libio/memstream.c
>>> index f83d4a5..3be5b58 100644
>>> --- a/libio/memstream.c
>>> +++ b/libio/memstream.c
>>> @@ -31,13 +31,14 @@ struct _IO_FILE_memstream
>>>  
>>>  static int _IO_mem_sync (_IO_FILE* fp) __THROW;
>>>  static void _IO_mem_finish (_IO_FILE* fp, int) __THROW;
>>> +static int _IO_mem_overflow (_IO_FILE *fp, int c) __THROW;
>>>  
>>>  
>>>  static const struct _IO_jump_t _IO_mem_jumps libio_vtable =
>>>  {
>>>    JUMP_INIT_DUMMY,
>>>    JUMP_INIT (finish, _IO_mem_finish),
>>> -  JUMP_INIT (overflow, _IO_str_overflow),
>>> +  JUMP_INIT (overflow, _IO_mem_overflow),
>>>    JUMP_INIT (underflow, _IO_str_underflow),
>>>    JUMP_INIT (uflow, _IO_default_uflow),
>>>    JUMP_INIT (pbackfail, _IO_str_pbackfail),
>>> @@ -87,6 +88,7 @@ __open_memstream (char **bufloc, _IO_size_t *sizeloc)
>>>        return NULL;
>>>      }
>>>    _IO_init_internal (&new_f->fp._sf._sbf._f, 0);
>>> +  _IO_link_in ((struct _IO_FILE_plus *) &new_f->fp._sf._sbf);
>>
>> i think if you link_in, then you have to unlink somewhere,
>> but i havent tracked down the ways in which the io callbacks
>> handle that, in principle it should be at close time.
>>
> 
> after further investigation i found that actually both
> fclose and _IO_str_finish call _IO_un_link if the file
> is linked.
> (which is suboptimal, but my worry was not justified)
> 
> and that
> 
> _IO_flush_all calls the overflow callback
> _IO_fflush calls the sync callback
> _IO_fclose calls the finish callback
> 
> so all three callbacks need to set *sizeloc.
> (which is suboptimal, but we should not redesign libio now)
> 
> so your patch seems to be the right fix for BZ 21735 .

Yes, I was about to reply with the same observations.  I am finishing the 
patch itself with a proper testcase.

> 
> 
>>>    _IO_JUMPS_FILE_plus (&new_f->fp._sf._sbf) = &_IO_mem_jumps;
>>>    _IO_str_init_static_internal (&new_f->fp._sf, buf, _IO_BUFSIZ, buf);
>>>    new_f->fp._sf._sbf._f._flags &= ~_IO_USER_BUF;
>>> @@ -137,3 +139,14 @@ _IO_mem_finish (_IO_FILE *fp, int dummy)
>>>  
>>>    _IO_str_finish (fp, 0);
>>>  }
>>> +
>>> +static int
>>> +_IO_mem_overflow (_IO_FILE *fp, int c)
>>> +{
>>> +  int ret = _IO_str_overflow (fp, c);
>>> +
>>> +  struct _IO_FILE_memstream *mp = (struct _IO_FILE_memstream *) fp;
>>> +  *mp->sizeloc = fp->_IO_write_ptr - fp->_IO_write_base;
>>> +
>>> +  return ret;
>>> +}
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-07-10 15:06 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-21 16:37 [PATCH v2] Single threaded stdio optimization Szabolcs Nagy
2017-06-21 16:52 ` Joseph Myers
2017-06-21 16:53   ` Szabolcs Nagy
2017-06-22  9:50   ` Szabolcs Nagy
2017-06-29 11:41     ` Siddhesh Poyarekar
2017-06-29 12:01       ` Siddhesh Poyarekar
2017-06-29 12:12         ` Carlos O'Donell
2017-06-29 13:44           ` Szabolcs Nagy
2017-06-30 12:16           ` Szabolcs Nagy
2017-06-30 13:16             ` Carlos O'Donell
2017-06-30 13:22               ` Adhemerval Zanella
2017-06-30 14:39               ` Szabolcs Nagy
2017-06-30 16:07                 ` Carlos O'Donell
2017-07-03 12:16                   ` [PATCH v4] " Szabolcs Nagy
2017-06-30 15:18             ` [PATCH v2] " Torvald Riegel
2017-06-30 15:24               ` Siddhesh Poyarekar
2017-06-30 15:35                 ` Torvald Riegel
2017-06-30 15:32               ` Ramana Radhakrishnan
2017-06-30 15:51               ` Szabolcs Nagy
2017-06-29 13:29       ` Szabolcs Nagy
2017-06-29 14:06         ` Siddhesh Poyarekar
2017-06-30 12:38           ` Szabolcs Nagy
2017-06-30 15:17             ` Siddhesh Poyarekar
2017-07-07 12:38 ` Florian Weimer
2017-07-07 13:38   ` Szabolcs Nagy
2017-07-07 15:02     ` Szabolcs Nagy
2017-07-07 17:31       ` Szabolcs Nagy
2017-07-07 20:47         ` Adhemerval Zanella
2017-07-07 21:31           ` Szabolcs Nagy
2017-07-10 14:28             ` Szabolcs Nagy
2017-07-10 15:06               ` Adhemerval Zanella

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).