[PATCH v2 0/4] Simplify internal single-threaded usage

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH v2 0/4] Simplify internal single-threaded usage
@ 2022-06-10 16:35 Adhemerval Zanella
  2022-06-10 16:35 ` [PATCH v2 1/4] misc: Optimize internal usage of __libc_single_threaded Adhemerval Zanella
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Adhemerval Zanella @ 2022-06-10 16:35 UTC (permalink / raw)
  To: libc-alpha, Wilco Dijkstra

The glibc currently has three different internal ways to check if a
process is single-threaded: the exported global variable
__libc_single_threaded, the internal-only __libc_multiple_threads, and
the variant used by some architectures and allocated on TCB, 
multiple_threads.  Also each port can define SINGLE_THREAD_BY_GLOBAL
to either use __libc_multiple_threads or multiple_threads.

The __libc_single_threaded and __libc_multiple_threads have essentially
the same semantic: both are global variables where the value is not reset
if/when the process becomes multi-threaded.  The issue of using
__libc_single_threaded internally is since it is accessed through copy
relocation, both values must be updated.  This is fixed in the first
patch.

The second patch replaces __libc_multiple_threads with
__libc_single_threaded, while also fixing a bug where architectures that
define SINGLE_THREAD_BY_GLOBAL does not enable the optimization.

The third patch replaces multiple_threads with __libc_single_threaded,
to simplify a possible single-thread lock optimization.  On most
architectures, accessing an internal global variable should be as fast
as through the TCB (it seems that only legacy ABIs that require extra
code sequence to materialize global access, such as i686 and sparc,
using the TCB would be faster, however it is mitigated when the
SINGLE_THREAD_P is accessed in large code blocks).

The x86 is only architecture that optimizes the lock access directly by
reimplementing atomic operations.  In this case, the affected
implementations are rewritten to use SINGLE_THREAD_P macro, while some
unused macros are just removed (for instance atomic_add_zero).  The idea
is to just phase out this specific atomic implementation in favor of
compiler builtins and move the single-thread optimization to be
arch-neutral.

In the last patch just remove the single-thread.h header and move the
definition to internal sys/single_threaded.h, so now there is only
one place to add such optimization.

v2:
* Add RTLD_DEFAULT support for __libc_dlsym and use it instead of
  ___dlsym.
* Simplify the x86 atomic macros.

Adhemerval Zanella (4):
  misc: Optimize internal usage of __libc_single_threaded
  Replace __libc_multiple_threads with __libc_single_threaded
  Remove usage of TLS_MULTIPLE_THREADS_IN_TCB
  Remove single-thread.h

 elf/dl-libc.c                               |  20 +-
 elf/libc_early_init.c                       |   9 +
 include/sys/single_threaded.h               |  20 +-
 misc/single_threaded.c                      |   2 +
 misc/tst-atomic.c                           |   1 +
 nptl/Makefile                               |   1 -
 nptl/allocatestack.c                        |  12 -
 nptl/descr.h                                |  17 +-
 nptl/libc_multiple_threads.c                |  28 --
 nptl/pthread_cancel.c                       |   9 +-
 nptl/pthread_create.c                       |  11 +-
 sysdeps/generic/single-thread.h             |  25 -
 sysdeps/i386/htl/tcb-offsets.sym            |   1 -
 sysdeps/i386/nptl/tcb-offsets.sym           |   1 -
 sysdeps/i386/nptl/tls.h                     |   4 +-
 sysdeps/ia64/nptl/tcb-offsets.sym           |   1 -
 sysdeps/ia64/nptl/tls.h                     |   2 -
 sysdeps/mach/hurd/i386/tls.h                |   4 +-
 sysdeps/mach/hurd/sysdep-cancel.h           |   5 -
 sysdeps/nios2/nptl/tcb-offsets.sym          |   1 -
 sysdeps/or1k/nptl/tls.h                     |   2 -
 sysdeps/powerpc/nptl/tcb-offsets.sym        |   3 -
 sysdeps/powerpc/nptl/tls.h                  |   3 -
 sysdeps/s390/nptl/tcb-offsets.sym           |   1 -
 sysdeps/s390/nptl/tls.h                     |   6 +-
 sysdeps/sh/nptl/tcb-offsets.sym             |   1 -
 sysdeps/sh/nptl/tls.h                       |   2 -
 sysdeps/sparc/nptl/tcb-offsets.sym          |   1 -
 sysdeps/sparc/nptl/tls.h                    |   2 +-
 sysdeps/unix/sysdep.h                       |   2 +-
 sysdeps/unix/sysv/linux/aarch64/sysdep.h    |   2 -
 sysdeps/unix/sysv/linux/alpha/sysdep.h      |   2 -
 sysdeps/unix/sysv/linux/arc/sysdep.h        |   2 -
 sysdeps/unix/sysv/linux/arm/sysdep.h        |   2 -
 sysdeps/unix/sysv/linux/hppa/sysdep.h       |   2 -
 sysdeps/unix/sysv/linux/microblaze/sysdep.h |   2 -
 sysdeps/unix/sysv/linux/s390/sysdep.h       |   3 -
 sysdeps/unix/sysv/linux/single-thread.h     |  44 --
 sysdeps/unix/sysv/linux/x86_64/sysdep.h     |   2 -
 sysdeps/x86/atomic-machine.h                | 484 ++++++--------------
 sysdeps/x86_64/nptl/tcb-offsets.sym         |   1 -
 41 files changed, 199 insertions(+), 544 deletions(-)
 delete mode 100644 nptl/libc_multiple_threads.c
 delete mode 100644 sysdeps/generic/single-thread.h
 delete mode 100644 sysdeps/unix/sysv/linux/single-thread.h

-- 
2.34.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 1/4] misc: Optimize internal usage of __libc_single_threaded
  2022-06-10 16:35 [PATCH v2 0/4] Simplify internal single-threaded usage Adhemerval Zanella
@ 2022-06-10 16:35 ` Adhemerval Zanella
  2022-06-16  7:15   ` Fangrui Song
  2022-06-20  8:37   ` Florian Weimer
  2022-06-10 16:35 ` [PATCH v2 2/4] Replace __libc_multiple_threads with __libc_single_threaded Adhemerval Zanella
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 21+ messages in thread
From: Adhemerval Zanella @ 2022-06-10 16:35 UTC (permalink / raw)
  To: libc-alpha, Wilco Dijkstra

By adding an internal hidden_def alias to avoid the GOT indirection.
On some architecture, __libc_single_thread may be accessed through
copy relocations and thus it requires to update both copies.

To obtain the correct address of the __libc_single_thread,
__libc_dlsym is extended to support RTLD_DEFAULT.  It searches
through all scope instead of default local ones.

Checked on x86_64-linux-gnu and i686-linux-gnu.
---
 elf/dl-libc.c                 | 20 ++++++++++++++++++--
 elf/libc_early_init.c         |  9 +++++++++
 include/sys/single_threaded.h | 11 +++++++++++
 misc/single_threaded.c        |  2 ++
 nptl/pthread_create.c         |  6 +++++-
 5 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/elf/dl-libc.c b/elf/dl-libc.c
index 266e068da6..e64f4b9910 100644
--- a/elf/dl-libc.c
+++ b/elf/dl-libc.c
@@ -16,6 +16,7 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
+#include <assert.h>
 #include <dlfcn.h>
 #include <stdlib.h>
 #include <ldsodefs.h>
@@ -72,6 +73,7 @@ struct do_dlsym_args
   /* Arguments to do_dlsym.  */
   struct link_map *map;
   const char *name;
+  const void *caller_dlsym;
 
   /* Return values of do_dlsym.  */
   lookup_t loadbase;
@@ -102,8 +104,21 @@ do_dlsym (void *ptr)
 {
   struct do_dlsym_args *args = (struct do_dlsym_args *) ptr;
   args->ref = NULL;
-  args->loadbase = GLRO(dl_lookup_symbol_x) (args->name, args->map, &args->ref,
-					     args->map->l_local_scope, NULL, 0,
+  struct link_map *match = args->map;
+  struct r_scope_elem **scope;
+  if (args->map == RTLD_DEFAULT)
+    {
+      ElfW(Addr) caller = (ElfW(Addr)) args->caller_dlsym;
+      match = _dl_find_dso_for_object (caller);
+      /* It is only used internally, so caller should be always recognized.  */
+      assert (match != NULL);
+      scope = match->l_scope;
+    }
+  else
+    scope = args->map->l_local_scope;
+
+  args->loadbase = GLRO(dl_lookup_symbol_x) (args->name, match, &args->ref,
+					     scope, NULL, 0,
 					     DL_LOOKUP_RETURN_NEWEST, NULL);
 }
 
@@ -182,6 +197,7 @@ __libc_dlsym (void *map, const char *name)
   struct do_dlsym_args args;
   args.map = map;
   args.name = name;
+  args.caller_dlsym = RETURN_ADDRESS (0);
 
 #ifdef SHARED
   if (GLRO (dl_dlfcn_hook) != NULL)
diff --git a/elf/libc_early_init.c b/elf/libc_early_init.c
index 3c4a19cf6b..7cc2997122 100644
--- a/elf/libc_early_init.c
+++ b/elf/libc_early_init.c
@@ -16,7 +16,9 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
+#include <assert.h>
 #include <ctype.h>
+#include <dlfcn.h>
 #include <elision-conf.h>
 #include <libc-early-init.h>
 #include <libc-internal.h>
@@ -38,6 +40,13 @@ __libc_early_init (_Bool initial)
   __libc_single_threaded = initial;
 
 #ifdef SHARED
+  /* __libc_single_threaded can be accessed through copy relocations, so it
+     requires to update the external copy.  */
+  __libc_external_single_threaded = __libc_dlsym (RTLD_DEFAULT,
+						  "__libc_single_threaded");
+  assert (__libc_external_single_threaded != NULL);
+  *__libc_external_single_threaded = initial;
+
   __libc_initial = initial;
 #endif
 
diff --git a/include/sys/single_threaded.h b/include/sys/single_threaded.h
index 18f6972482..258b01e0b2 100644
--- a/include/sys/single_threaded.h
+++ b/include/sys/single_threaded.h
@@ -1 +1,12 @@
 #include <misc/sys/single_threaded.h>
+
+#ifndef _ISOMAC
+
+libc_hidden_proto (__libc_single_threaded);
+
+# ifdef SHARED
+extern __typeof (__libc_single_threaded) *__libc_external_single_threaded
+  attribute_hidden;
+# endif
+
+#endif
diff --git a/misc/single_threaded.c b/misc/single_threaded.c
index 96ada9137b..201d86a273 100644
--- a/misc/single_threaded.c
+++ b/misc/single_threaded.c
@@ -22,6 +22,8 @@
    __libc_early_init (as false for inner libcs).  */
 #ifdef SHARED
 char __libc_single_threaded;
+__typeof (__libc_single_threaded) *__libc_external_single_threaded;
 #else
 char __libc_single_threaded = 1;
 #endif
+libc_hidden_data_def (__libc_single_threaded)
diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
index e7a099acb7..5633d01c62 100644
--- a/nptl/pthread_create.c
+++ b/nptl/pthread_create.c
@@ -627,7 +627,11 @@ __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
   if (__libc_single_threaded)
     {
       late_init ();
-      __libc_single_threaded = 0;
+      __libc_single_threaded =
+#ifdef SHARED
+        *__libc_external_single_threaded =
+#endif
+	0;
     }
 
   const struct pthread_attr *iattr = (struct pthread_attr *) attr;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 2/4] Replace __libc_multiple_threads with __libc_single_threaded
  2022-06-10 16:35 [PATCH v2 0/4] Simplify internal single-threaded usage Adhemerval Zanella
  2022-06-10 16:35 ` [PATCH v2 1/4] misc: Optimize internal usage of __libc_single_threaded Adhemerval Zanella
@ 2022-06-10 16:35 ` Adhemerval Zanella
  2022-06-10 16:35 ` [PATCH v2 3/4] Remove usage of TLS_MULTIPLE_THREADS_IN_TCB Adhemerval Zanella
  2022-06-10 16:35 ` [PATCH v2 4/4] Remove single-thread.h Adhemerval Zanella
  3 siblings, 0 replies; 21+ messages in thread
From: Adhemerval Zanella @ 2022-06-10 16:35 UTC (permalink / raw)
  To: libc-alpha, Wilco Dijkstra

And also fixes the SINGLE_THREAD_P macro for SINGLE_THREAD_BY_GLOBAL,
since header inclusion single-thread.h is in the wrong order, the define
needs to come before including sysdeps/unix/sysdep.h.  The macro
is now moved to a per-arch single-threade.h header.
---
 nptl/Makefile                                 |  1 -
 nptl/allocatestack.c                          |  6 ----
 nptl/libc_multiple_threads.c                  | 28 -------------------
 nptl/pthread_cancel.c                         |  2 +-
 .../unix/sysv/linux/aarch64/single-thread.h   |  2 ++
 sysdeps/unix/sysv/linux/aarch64/sysdep.h      |  2 --
 sysdeps/unix/sysv/linux/alpha/sysdep.h        |  2 --
 sysdeps/unix/sysv/linux/arc/single-thread.h   |  2 ++
 sysdeps/unix/sysv/linux/arc/sysdep.h          |  2 --
 sysdeps/unix/sysv/linux/arm/single-thread.h   |  2 ++
 sysdeps/unix/sysv/linux/arm/sysdep.h          |  2 --
 sysdeps/unix/sysv/linux/hppa/single-thread.h  |  2 ++
 sysdeps/unix/sysv/linux/hppa/sysdep.h         |  2 --
 .../sysv/linux/microblaze/single-thread.h     |  2 ++
 sysdeps/unix/sysv/linux/microblaze/sysdep.h   |  2 --
 sysdeps/unix/sysv/linux/s390/single-thread.h  |  2 ++
 sysdeps/unix/sysv/linux/s390/sysdep.h         |  3 --
 sysdeps/unix/sysv/linux/single-thread.h       | 11 ++++----
 .../unix/sysv/linux/x86_64/single-thread.h    |  2 ++
 sysdeps/unix/sysv/linux/x86_64/sysdep.h       |  2 --
 20 files changed, 20 insertions(+), 59 deletions(-)
 delete mode 100644 nptl/libc_multiple_threads.c
 create mode 100644 sysdeps/unix/sysv/linux/aarch64/single-thread.h
 create mode 100644 sysdeps/unix/sysv/linux/arc/single-thread.h
 create mode 100644 sysdeps/unix/sysv/linux/arm/single-thread.h
 create mode 100644 sysdeps/unix/sysv/linux/hppa/single-thread.h
 create mode 100644 sysdeps/unix/sysv/linux/microblaze/single-thread.h
 create mode 100644 sysdeps/unix/sysv/linux/s390/single-thread.h
 create mode 100644 sysdeps/unix/sysv/linux/x86_64/single-thread.h

diff --git a/nptl/Makefile b/nptl/Makefile
index b585663974..3d2ce8af8a 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -50,7 +50,6 @@ routines = \
   events \
   futex-internal \
   libc-cleanup \
-  libc_multiple_threads \
   lowlevellock \
   nptl-stack \
   nptl_deallocate_tsd \
diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
index 01a282f3f6..98f5f6dd85 100644
--- a/nptl/allocatestack.c
+++ b/nptl/allocatestack.c
@@ -292,9 +292,6 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
 
       /* This is at least the second thread.  */
       pd->header.multiple_threads = 1;
-#ifndef TLS_MULTIPLE_THREADS_IN_TCB
-      __libc_multiple_threads = 1;
-#endif
 
 #ifdef NEED_DL_SYSINFO
       SETUP_THREAD_SYSINFO (pd);
@@ -413,9 +410,6 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
 
 	  /* This is at least the second thread.  */
 	  pd->header.multiple_threads = 1;
-#ifndef TLS_MULTIPLE_THREADS_IN_TCB
-	  __libc_multiple_threads = 1;
-#endif
 
 #ifdef NEED_DL_SYSINFO
 	  SETUP_THREAD_SYSINFO (pd);
diff --git a/nptl/libc_multiple_threads.c b/nptl/libc_multiple_threads.c
deleted file mode 100644
index 0c2dc33d0d..0000000000
--- a/nptl/libc_multiple_threads.c
+++ /dev/null
@@ -1,28 +0,0 @@
-/* Copyright (C) 2002-2022 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library; if not, see
-   <https://www.gnu.org/licenses/>.  */
-
-#include <pthreadP.h>
-
-#if IS_IN (libc)
-# ifndef TLS_MULTIPLE_THREADS_IN_TCB
-/* Variable set to a nonzero value either if more than one thread runs or ran,
-   or if a single-threaded process is trying to cancel itself.  See
-   nptl/descr.h for more context on the single-threaded process case.  */
-int __libc_multiple_threads;
-libc_hidden_data_def (__libc_multiple_threads)
-# endif
-#endif
diff --git a/nptl/pthread_cancel.c b/nptl/pthread_cancel.c
index e67b2df5cc..e1735279f2 100644
--- a/nptl/pthread_cancel.c
+++ b/nptl/pthread_cancel.c
@@ -161,7 +161,7 @@ __pthread_cancel (pthread_t th)
 	   points get executed.  */
 	THREAD_SETMEM (THREAD_SELF, header.multiple_threads, 1);
 #ifndef TLS_MULTIPLE_THREADS_IN_TCB
-      __libc_multiple_threads = 1;
+	__libc_single_threaded = 0;
 #endif
     }
   while (!atomic_compare_exchange_weak_acquire (&pd->cancelhandling, &oldval,
diff --git a/sysdeps/unix/sysv/linux/aarch64/single-thread.h b/sysdeps/unix/sysv/linux/aarch64/single-thread.h
new file mode 100644
index 0000000000..a5d3a2aaf4
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/aarch64/single-thread.h
@@ -0,0 +1,2 @@
+#define SINGLE_THREAD_BY_GLOBAL
+#include_next <single-thread.h>
diff --git a/sysdeps/unix/sysv/linux/aarch64/sysdep.h b/sysdeps/unix/sysv/linux/aarch64/sysdep.h
index 3b230dccf1..f1853e012f 100644
--- a/sysdeps/unix/sysv/linux/aarch64/sysdep.h
+++ b/sysdeps/unix/sysv/linux/aarch64/sysdep.h
@@ -164,8 +164,6 @@
 # define HAVE_CLOCK_GETTIME64_VSYSCALL	"__kernel_clock_gettime"
 # define HAVE_GETTIMEOFDAY_VSYSCALL	"__kernel_gettimeofday"
 
-# define SINGLE_THREAD_BY_GLOBAL		1
-
 # undef INTERNAL_SYSCALL_RAW
 # define INTERNAL_SYSCALL_RAW(name, nr, args...)		\
   ({ long _sys_result;						\
diff --git a/sysdeps/unix/sysv/linux/alpha/sysdep.h b/sysdeps/unix/sysv/linux/alpha/sysdep.h
index 3051a744b4..77ec2b5400 100644
--- a/sysdeps/unix/sysv/linux/alpha/sysdep.h
+++ b/sysdeps/unix/sysv/linux/alpha/sysdep.h
@@ -32,8 +32,6 @@
 #undef SYS_ify
 #define SYS_ify(syscall_name)	__NR_##syscall_name
 
-#define SINGLE_THREAD_BY_GLOBAL 1
-
 #ifdef __ASSEMBLER__
 #include <asm/pal.h>
 #include <alpha/regdef.h>
diff --git a/sysdeps/unix/sysv/linux/arc/single-thread.h b/sysdeps/unix/sysv/linux/arc/single-thread.h
new file mode 100644
index 0000000000..a5d3a2aaf4
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/arc/single-thread.h
@@ -0,0 +1,2 @@
+#define SINGLE_THREAD_BY_GLOBAL
+#include_next <single-thread.h>
diff --git a/sysdeps/unix/sysv/linux/arc/sysdep.h b/sysdeps/unix/sysv/linux/arc/sysdep.h
index 29b0e0161c..d0c1a78381 100644
--- a/sysdeps/unix/sysv/linux/arc/sysdep.h
+++ b/sysdeps/unix/sysv/linux/arc/sysdep.h
@@ -132,8 +132,6 @@ L (call_syscall_err):			ASM_LINE_SEP	\
 
 #else  /* !__ASSEMBLER__ */
 
-# define SINGLE_THREAD_BY_GLOBAL		1
-
 # if IS_IN (libc)
 extern long int __syscall_error (long int);
 hidden_proto (__syscall_error)
diff --git a/sysdeps/unix/sysv/linux/arm/single-thread.h b/sysdeps/unix/sysv/linux/arm/single-thread.h
new file mode 100644
index 0000000000..a5d3a2aaf4
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/arm/single-thread.h
@@ -0,0 +1,2 @@
+#define SINGLE_THREAD_BY_GLOBAL
+#include_next <single-thread.h>
diff --git a/sysdeps/unix/sysv/linux/arm/sysdep.h b/sysdeps/unix/sysv/linux/arm/sysdep.h
index 7bdd218063..1f270b961e 100644
--- a/sysdeps/unix/sysv/linux/arm/sysdep.h
+++ b/sysdeps/unix/sysv/linux/arm/sysdep.h
@@ -408,8 +408,6 @@ __local_syscall_error:						\
 #define INTERNAL_SYSCALL_NCS(number, nr, args...)              \
   INTERNAL_SYSCALL_RAW (number, nr, args)
 
-#define SINGLE_THREAD_BY_GLOBAL	1
-
 #endif	/* __ASSEMBLER__ */
 
 #endif /* linux/arm/sysdep.h */
diff --git a/sysdeps/unix/sysv/linux/hppa/single-thread.h b/sysdeps/unix/sysv/linux/hppa/single-thread.h
new file mode 100644
index 0000000000..a5d3a2aaf4
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/hppa/single-thread.h
@@ -0,0 +1,2 @@
+#define SINGLE_THREAD_BY_GLOBAL
+#include_next <single-thread.h>
diff --git a/sysdeps/unix/sysv/linux/hppa/sysdep.h b/sysdeps/unix/sysv/linux/hppa/sysdep.h
index 42f7705852..2f339a4bd6 100644
--- a/sysdeps/unix/sysv/linux/hppa/sysdep.h
+++ b/sysdeps/unix/sysv/linux/hppa/sysdep.h
@@ -474,6 +474,4 @@ L(pre_end):					ASM_LINE_SEP	\
 #define PTR_MANGLE(var) (void) (var)
 #define PTR_DEMANGLE(var) (void) (var)
 
-#define SINGLE_THREAD_BY_GLOBAL	1
-
 #endif /* _LINUX_HPPA_SYSDEP_H */
diff --git a/sysdeps/unix/sysv/linux/microblaze/single-thread.h b/sysdeps/unix/sysv/linux/microblaze/single-thread.h
new file mode 100644
index 0000000000..a5d3a2aaf4
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/microblaze/single-thread.h
@@ -0,0 +1,2 @@
+#define SINGLE_THREAD_BY_GLOBAL
+#include_next <single-thread.h>
diff --git a/sysdeps/unix/sysv/linux/microblaze/sysdep.h b/sysdeps/unix/sysv/linux/microblaze/sysdep.h
index dfd6312506..fda78f6467 100644
--- a/sysdeps/unix/sysv/linux/microblaze/sysdep.h
+++ b/sysdeps/unix/sysv/linux/microblaze/sysdep.h
@@ -308,8 +308,6 @@ SYSCALL_ERROR_LABEL_DCL:                            \
 # define PTR_MANGLE(var) (void) (var)
 # define PTR_DEMANGLE(var) (void) (var)
 
-# define SINGLE_THREAD_BY_GLOBAL	1
-
 #undef HAVE_INTERNAL_BRK_ADDR_SYMBOL
 #define HAVE_INTERNAL_BRK_ADDR_SYMBOL 1
 
diff --git a/sysdeps/unix/sysv/linux/s390/single-thread.h b/sysdeps/unix/sysv/linux/s390/single-thread.h
new file mode 100644
index 0000000000..a5d3a2aaf4
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/s390/single-thread.h
@@ -0,0 +1,2 @@
+#define SINGLE_THREAD_BY_GLOBAL
+#include_next <single-thread.h>
diff --git a/sysdeps/unix/sysv/linux/s390/sysdep.h b/sysdeps/unix/sysv/linux/s390/sysdep.h
index 78c7e8c7e2..2d0a26779c 100644
--- a/sysdeps/unix/sysv/linux/s390/sysdep.h
+++ b/sysdeps/unix/sysv/linux/s390/sysdep.h
@@ -93,9 +93,6 @@
 #define ASMFMT_5 , "0" (gpr2), "d" (gpr3), "d" (gpr4), "d" (gpr5), "d" (gpr6)
 #define ASMFMT_6 , "0" (gpr2), "d" (gpr3), "d" (gpr4), "d" (gpr5), "d" (gpr6), "d" (gpr7)
 
-#define SINGLE_THREAD_BY_GLOBAL		1
-
-
 #define VDSO_NAME  "LINUX_2.6.29"
 #define VDSO_HASH  123718585
 
diff --git a/sysdeps/unix/sysv/linux/single-thread.h b/sysdeps/unix/sysv/linux/single-thread.h
index 4529a906d2..208edccce6 100644
--- a/sysdeps/unix/sysv/linux/single-thread.h
+++ b/sysdeps/unix/sysv/linux/single-thread.h
@@ -19,6 +19,10 @@
 #ifndef _SINGLE_THREAD_H
 #define _SINGLE_THREAD_H
 
+#ifndef __ASSEMBLER__
+# include <sys/single_threaded.h>
+#endif
+
 /* The default way to check if the process is single thread is by using the
    pthread_t 'multiple_threads' field.  However, for some architectures it is
    faster to either use an extra field on TCB or global variables (the TCB
@@ -27,16 +31,11 @@
    The ABI might define SINGLE_THREAD_BY_GLOBAL to enable the single thread
    check to use global variables instead of the pthread_t field.  */
 
-#ifndef __ASSEMBLER__
-extern int __libc_multiple_threads;
-libc_hidden_proto (__libc_multiple_threads)
-#endif
-
 #if !defined SINGLE_THREAD_BY_GLOBAL || IS_IN (rtld)
 # define SINGLE_THREAD_P \
   (THREAD_GETMEM (THREAD_SELF, header.multiple_threads) == 0)
 #else
-# define SINGLE_THREAD_P (__libc_multiple_threads == 0)
+# define SINGLE_THREAD_P (__libc_single_threaded != 0)
 #endif
 
 #define RTLD_SINGLE_THREAD_P SINGLE_THREAD_P
diff --git a/sysdeps/unix/sysv/linux/x86_64/single-thread.h b/sysdeps/unix/sysv/linux/x86_64/single-thread.h
new file mode 100644
index 0000000000..a5d3a2aaf4
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/x86_64/single-thread.h
@@ -0,0 +1,2 @@
+#define SINGLE_THREAD_BY_GLOBAL
+#include_next <single-thread.h>
diff --git a/sysdeps/unix/sysv/linux/x86_64/sysdep.h b/sysdeps/unix/sysv/linux/x86_64/sysdep.h
index e1ce3b62eb..740abefcfd 100644
--- a/sysdeps/unix/sysv/linux/x86_64/sysdep.h
+++ b/sysdeps/unix/sysv/linux/x86_64/sysdep.h
@@ -379,8 +379,6 @@
 
 # define HAVE_CLONE3_WRAPPER			1
 
-# define SINGLE_THREAD_BY_GLOBAL		1
-
 #endif	/* __ASSEMBLER__ */
 
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 3/4] Remove usage of TLS_MULTIPLE_THREADS_IN_TCB
  2022-06-10 16:35 [PATCH v2 0/4] Simplify internal single-threaded usage Adhemerval Zanella
  2022-06-10 16:35 ` [PATCH v2 1/4] misc: Optimize internal usage of __libc_single_threaded Adhemerval Zanella
  2022-06-10 16:35 ` [PATCH v2 2/4] Replace __libc_multiple_threads with __libc_single_threaded Adhemerval Zanella
@ 2022-06-10 16:35 ` Adhemerval Zanella
  2022-06-10 19:49   ` H.J. Lu
                     ` (3 more replies)
  2022-06-10 16:35 ` [PATCH v2 4/4] Remove single-thread.h Adhemerval Zanella
  3 siblings, 4 replies; 21+ messages in thread
From: Adhemerval Zanella @ 2022-06-10 16:35 UTC (permalink / raw)
  To: libc-alpha, Wilco Dijkstra

Instead use __libc_single_threaded on all architectures.  The TCB
field is renamed to avoid change the struct layout.

The x86 atomic need some adjustments since it has single-thread
optimizationi builtin within the inline assemblye.  It now uses
SINGLE_THREAD_P and atomic optimizations are removed (since they
are not used).

Checked on x86_64-linux-gnu and i686-linux-gnu.
---
 misc/tst-atomic.c                       |   1 +
 nptl/allocatestack.c                    |   6 -
 nptl/descr.h                            |  17 +-
 nptl/pthread_cancel.c                   |   7 +-
 nptl/pthread_create.c                   |   5 -
 sysdeps/i386/htl/tcb-offsets.sym        |   1 -
 sysdeps/i386/nptl/tcb-offsets.sym       |   1 -
 sysdeps/i386/nptl/tls.h                 |   4 +-
 sysdeps/ia64/nptl/tcb-offsets.sym       |   1 -
 sysdeps/ia64/nptl/tls.h                 |   2 -
 sysdeps/mach/hurd/i386/tls.h            |   4 +-
 sysdeps/nios2/nptl/tcb-offsets.sym      |   1 -
 sysdeps/or1k/nptl/tls.h                 |   2 -
 sysdeps/powerpc/nptl/tcb-offsets.sym    |   3 -
 sysdeps/powerpc/nptl/tls.h              |   3 -
 sysdeps/s390/nptl/tcb-offsets.sym       |   1 -
 sysdeps/s390/nptl/tls.h                 |   6 +-
 sysdeps/sh/nptl/tcb-offsets.sym         |   1 -
 sysdeps/sh/nptl/tls.h                   |   2 -
 sysdeps/sparc/nptl/tcb-offsets.sym      |   1 -
 sysdeps/sparc/nptl/tls.h                |   2 +-
 sysdeps/unix/sysv/linux/single-thread.h |  15 +-
 sysdeps/x86/atomic-machine.h            | 484 +++++++-----------------
 sysdeps/x86_64/nptl/tcb-offsets.sym     |   1 -
 24 files changed, 145 insertions(+), 426 deletions(-)

diff --git a/misc/tst-atomic.c b/misc/tst-atomic.c
index 6d681a7bfd..ddbc618e25 100644
--- a/misc/tst-atomic.c
+++ b/misc/tst-atomic.c
@@ -18,6 +18,7 @@
 
 #include <stdio.h>
 #include <atomic.h>
+#include <support/xthread.h>
 
 #ifndef atomic_t
 # define atomic_t int
diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
index 98f5f6dd85..3e0d01cb52 100644
--- a/nptl/allocatestack.c
+++ b/nptl/allocatestack.c
@@ -290,9 +290,6 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
 	 stack cache nor will the memory (except the TLS memory) be freed.  */
       pd->user_stack = true;
 
-      /* This is at least the second thread.  */
-      pd->header.multiple_threads = 1;
-
 #ifdef NEED_DL_SYSINFO
       SETUP_THREAD_SYSINFO (pd);
 #endif
@@ -408,9 +405,6 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
 	     descriptor.  */
 	  pd->specific[0] = pd->specific_1stblock;
 
-	  /* This is at least the second thread.  */
-	  pd->header.multiple_threads = 1;
-
 #ifdef NEED_DL_SYSINFO
 	  SETUP_THREAD_SYSINFO (pd);
 #endif
diff --git a/nptl/descr.h b/nptl/descr.h
index bb46b5958e..77b25d8267 100644
--- a/nptl/descr.h
+++ b/nptl/descr.h
@@ -137,22 +137,7 @@ struct pthread
 #else
     struct
     {
-      /* multiple_threads is enabled either when the process has spawned at
-	 least one thread or when a single-threaded process cancels itself.
-	 This enables additional code to introduce locking before doing some
-	 compare_and_exchange operations and also enable cancellation points.
-	 The concepts of multiple threads and cancellation points ideally
-	 should be separate, since it is not necessary for multiple threads to
-	 have been created for cancellation points to be enabled, as is the
-	 case is when single-threaded process cancels itself.
-
-	 Since enabling multiple_threads enables additional code in
-	 cancellation points and compare_and_exchange operations, there is a
-	 potential for an unneeded performance hit when it is enabled in a
-	 single-threaded, self-canceling process.  This is OK though, since a
-	 single-threaded process will enable async cancellation only when it
-	 looks to cancel itself and is hence going to end anyway.  */
-      int multiple_threads;
+      int unused_multiple_threads;
       int gscope_flag;
     } header;
 #endif
diff --git a/nptl/pthread_cancel.c b/nptl/pthread_cancel.c
index e1735279f2..6d26a15d0e 100644
--- a/nptl/pthread_cancel.c
+++ b/nptl/pthread_cancel.c
@@ -157,12 +157,9 @@ __pthread_cancel (pthread_t th)
 
 	/* A single-threaded process should be able to kill itself, since
 	   there is nothing in the POSIX specification that says that it
-	   cannot.  So we set multiple_threads to true so that cancellation
-	   points get executed.  */
-	THREAD_SETMEM (THREAD_SELF, header.multiple_threads, 1);
-#ifndef TLS_MULTIPLE_THREADS_IN_TCB
+	   cannot.  So we set __libc_single_threaded to true so that
+	   cancellation points get executed.  */
 	__libc_single_threaded = 0;
-#endif
     }
   while (!atomic_compare_exchange_weak_acquire (&pd->cancelhandling, &oldval,
 						newval));
diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
index 5633d01c62..d43865352f 100644
--- a/nptl/pthread_create.c
+++ b/nptl/pthread_create.c
@@ -882,11 +882,6 @@ __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
 	   other reason that create_thread chose.  Now let it run
 	   free.  */
 	lll_unlock (pd->lock, LLL_PRIVATE);
-
-      /* We now have for sure more than one thread.  The main thread might
-	 not yet have the flag set.  No need to set the global variable
-	 again if this is what we use.  */
-      THREAD_SETMEM (THREAD_SELF, header.multiple_threads, 1);
     }
 
  out:
diff --git a/sysdeps/i386/htl/tcb-offsets.sym b/sysdeps/i386/htl/tcb-offsets.sym
index 7b7c719369..f3f7df6c06 100644
--- a/sysdeps/i386/htl/tcb-offsets.sym
+++ b/sysdeps/i386/htl/tcb-offsets.sym
@@ -2,7 +2,6 @@
 #include <tls.h>
 #include <kernel-features.h>
 
-MULTIPLE_THREADS_OFFSET offsetof (tcbhead_t, multiple_threads)
 SYSINFO_OFFSET          offsetof (tcbhead_t, sysinfo)
 POINTER_GUARD           offsetof (tcbhead_t, pointer_guard)
 SIGSTATE_OFFSET         offsetof (tcbhead_t, _hurd_sigstate)
diff --git a/sysdeps/i386/nptl/tcb-offsets.sym b/sysdeps/i386/nptl/tcb-offsets.sym
index 2ec9e787c1..1efd1469d8 100644
--- a/sysdeps/i386/nptl/tcb-offsets.sym
+++ b/sysdeps/i386/nptl/tcb-offsets.sym
@@ -6,7 +6,6 @@ RESULT			offsetof (struct pthread, result)
 TID			offsetof (struct pthread, tid)
 CANCELHANDLING		offsetof (struct pthread, cancelhandling)
 CLEANUP_JMP_BUF		offsetof (struct pthread, cleanup_jmp_buf)
-MULTIPLE_THREADS_OFFSET	offsetof (tcbhead_t, multiple_threads)
 SYSINFO_OFFSET		offsetof (tcbhead_t, sysinfo)
 CLEANUP			offsetof (struct pthread, cleanup)
 CLEANUP_PREV		offsetof (struct _pthread_cleanup_buffer, __prev)
diff --git a/sysdeps/i386/nptl/tls.h b/sysdeps/i386/nptl/tls.h
index 91090bf287..48940a9f44 100644
--- a/sysdeps/i386/nptl/tls.h
+++ b/sysdeps/i386/nptl/tls.h
@@ -36,7 +36,7 @@ typedef struct
 			   thread descriptor used by libpthread.  */
   dtv_t *dtv;
   void *self;		/* Pointer to the thread descriptor.  */
-  int multiple_threads;
+  int unused_multiple_threads;
   uintptr_t sysinfo;
   uintptr_t stack_guard;
   uintptr_t pointer_guard;
@@ -57,8 +57,6 @@ typedef struct
 _Static_assert (offsetof (tcbhead_t, __private_ss) == 0x30,
 		"offset of __private_ss != 0x30");
 
-# define TLS_MULTIPLE_THREADS_IN_TCB 1
-
 #else /* __ASSEMBLER__ */
 # include <tcb-offsets.h>
 #endif
diff --git a/sysdeps/ia64/nptl/tcb-offsets.sym b/sysdeps/ia64/nptl/tcb-offsets.sym
index b01f712be2..ab2cb180f9 100644
--- a/sysdeps/ia64/nptl/tcb-offsets.sym
+++ b/sysdeps/ia64/nptl/tcb-offsets.sym
@@ -2,5 +2,4 @@
 #include <tls.h>
 
 TID			offsetof (struct pthread, tid) - TLS_PRE_TCB_SIZE
-MULTIPLE_THREADS_OFFSET offsetof (struct pthread, header.multiple_threads) - TLS_PRE_TCB_SIZE
 SYSINFO_OFFSET		offsetof (tcbhead_t, __private)
diff --git a/sysdeps/ia64/nptl/tls.h b/sysdeps/ia64/nptl/tls.h
index 8ccedb73e6..008e080fc4 100644
--- a/sysdeps/ia64/nptl/tls.h
+++ b/sysdeps/ia64/nptl/tls.h
@@ -36,8 +36,6 @@ typedef struct
 
 register struct pthread *__thread_self __asm__("r13");
 
-# define TLS_MULTIPLE_THREADS_IN_TCB 1
-
 #else /* __ASSEMBLER__ */
 # include <tcb-offsets.h>
 #endif
diff --git a/sysdeps/mach/hurd/i386/tls.h b/sysdeps/mach/hurd/i386/tls.h
index 264ed9a9c5..d33e91c922 100644
--- a/sysdeps/mach/hurd/i386/tls.h
+++ b/sysdeps/mach/hurd/i386/tls.h
@@ -33,7 +33,7 @@ typedef struct
   void *tcb;			/* Points to this structure.  */
   dtv_t *dtv;			/* Vector of pointers to TLS data.  */
   thread_t self;		/* This thread's control port.  */
-  int multiple_threads;
+  int unused_multiple_threads;
   uintptr_t sysinfo;
   uintptr_t stack_guard;
   uintptr_t pointer_guard;
@@ -117,8 +117,6 @@ _hurd_tls_init (tcbhead_t *tcb)
   /* This field is used by TLS accesses to get our "thread pointer"
      from the TLS point of view.  */
   tcb->tcb = tcb;
-  /* We always at least start the sigthread anyway.  */
-  tcb->multiple_threads = 1;
 
   /* Get the first available selector.  */
   int sel = -1;
diff --git a/sysdeps/nios2/nptl/tcb-offsets.sym b/sysdeps/nios2/nptl/tcb-offsets.sym
index 3cd8d984ac..93a695ac7f 100644
--- a/sysdeps/nios2/nptl/tcb-offsets.sym
+++ b/sysdeps/nios2/nptl/tcb-offsets.sym
@@ -8,6 +8,5 @@
 # define __thread_self          ((void *) 0)
 # define thread_offsetof(mem)   ((ptrdiff_t) THREAD_SELF + offsetof (struct pthread, mem))
 
-MULTIPLE_THREADS_OFFSET		thread_offsetof (header.multiple_threads)
 TID_OFFSET			thread_offsetof (tid)
 POINTER_GUARD			(offsetof (tcbhead_t, pointer_guard) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
diff --git a/sysdeps/or1k/nptl/tls.h b/sysdeps/or1k/nptl/tls.h
index c6ffe62c3f..3bb07beef8 100644
--- a/sysdeps/or1k/nptl/tls.h
+++ b/sysdeps/or1k/nptl/tls.h
@@ -35,8 +35,6 @@ typedef struct
 
 register tcbhead_t *__thread_self __asm__("r10");
 
-# define TLS_MULTIPLE_THREADS_IN_TCB 1
-
 /* Get system call information.  */
 # include <sysdep.h>
 
diff --git a/sysdeps/powerpc/nptl/tcb-offsets.sym b/sysdeps/powerpc/nptl/tcb-offsets.sym
index 4c01615ad0..a0ee95f94d 100644
--- a/sysdeps/powerpc/nptl/tcb-offsets.sym
+++ b/sysdeps/powerpc/nptl/tcb-offsets.sym
@@ -10,9 +10,6 @@
 # define thread_offsetof(mem)	((ptrdiff_t) THREAD_SELF + offsetof (struct pthread, mem))
 
 
-#if TLS_MULTIPLE_THREADS_IN_TCB
-MULTIPLE_THREADS_OFFSET		thread_offsetof (header.multiple_threads)
-#endif
 TID				thread_offsetof (tid)
 POINTER_GUARD			(offsetof (tcbhead_t, pointer_guard) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
 TAR_SAVE			(offsetof (tcbhead_t, tar_save) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
diff --git a/sysdeps/powerpc/nptl/tls.h b/sysdeps/powerpc/nptl/tls.h
index 22b0075235..fd5ee51981 100644
--- a/sysdeps/powerpc/nptl/tls.h
+++ b/sysdeps/powerpc/nptl/tls.h
@@ -52,9 +52,6 @@
 # define TLS_DTV_AT_TP	1
 # define TLS_TCB_AT_TP	0
 
-/* We use the multiple_threads field in the pthread struct */
-#define TLS_MULTIPLE_THREADS_IN_TCB	1
-
 /* Get the thread descriptor definition.  */
 # include <nptl/descr.h>
 
diff --git a/sysdeps/s390/nptl/tcb-offsets.sym b/sysdeps/s390/nptl/tcb-offsets.sym
index 9c1c01f353..bc7b267463 100644
--- a/sysdeps/s390/nptl/tcb-offsets.sym
+++ b/sysdeps/s390/nptl/tcb-offsets.sym
@@ -1,6 +1,5 @@
 #include <sysdep.h>
 #include <tls.h>
 
-MULTIPLE_THREADS_OFFSET		offsetof (tcbhead_t, multiple_threads)
 STACK_GUARD			offsetof (tcbhead_t, stack_guard)
 TID				offsetof (struct pthread, tid)
diff --git a/sysdeps/s390/nptl/tls.h b/sysdeps/s390/nptl/tls.h
index ff210ffeb2..d69ed539f7 100644
--- a/sysdeps/s390/nptl/tls.h
+++ b/sysdeps/s390/nptl/tls.h
@@ -35,7 +35,7 @@ typedef struct
 			   thread descriptor used by libpthread.  */
   dtv_t *dtv;
   void *self;		/* Pointer to the thread descriptor.  */
-  int multiple_threads;
+  int unused_multiple_threads;
   uintptr_t sysinfo;
   uintptr_t stack_guard;
   int gscope_flag;
@@ -44,10 +44,6 @@ typedef struct
   void *__private_ss;
 } tcbhead_t;
 
-# ifndef __s390x__
-#  define TLS_MULTIPLE_THREADS_IN_TCB 1
-# endif
-
 #else /* __ASSEMBLER__ */
 # include <tcb-offsets.h>
 #endif
diff --git a/sysdeps/sh/nptl/tcb-offsets.sym b/sysdeps/sh/nptl/tcb-offsets.sym
index 234207779d..4e452d9c6c 100644
--- a/sysdeps/sh/nptl/tcb-offsets.sym
+++ b/sysdeps/sh/nptl/tcb-offsets.sym
@@ -6,7 +6,6 @@ RESULT			offsetof (struct pthread, result)
 TID			offsetof (struct pthread, tid)
 CANCELHANDLING		offsetof (struct pthread, cancelhandling)
 CLEANUP_JMP_BUF		offsetof (struct pthread, cleanup_jmp_buf)
-MULTIPLE_THREADS_OFFSET	offsetof (struct pthread, header.multiple_threads)
 TLS_PRE_TCB_SIZE	sizeof (struct pthread)
 MUTEX_FUTEX		offsetof (pthread_mutex_t, __data.__lock)
 POINTER_GUARD		offsetof (tcbhead_t, pointer_guard)
diff --git a/sysdeps/sh/nptl/tls.h b/sysdeps/sh/nptl/tls.h
index 76591ab6ef..8778cb4ac0 100644
--- a/sysdeps/sh/nptl/tls.h
+++ b/sysdeps/sh/nptl/tls.h
@@ -36,8 +36,6 @@ typedef struct
   uintptr_t pointer_guard;
 } tcbhead_t;
 
-# define TLS_MULTIPLE_THREADS_IN_TCB 1
-
 #else /* __ASSEMBLER__ */
 # include <tcb-offsets.h>
 #endif /* __ASSEMBLER__ */
diff --git a/sysdeps/sparc/nptl/tcb-offsets.sym b/sysdeps/sparc/nptl/tcb-offsets.sym
index f75d02065e..e4a7e4720f 100644
--- a/sysdeps/sparc/nptl/tcb-offsets.sym
+++ b/sysdeps/sparc/nptl/tcb-offsets.sym
@@ -1,6 +1,5 @@
 #include <sysdep.h>
 #include <tls.h>
 
-MULTIPLE_THREADS_OFFSET		offsetof (tcbhead_t, multiple_threads)
 POINTER_GUARD			offsetof (tcbhead_t, pointer_guard)
 TID				offsetof (struct pthread, tid)
diff --git a/sysdeps/sparc/nptl/tls.h b/sysdeps/sparc/nptl/tls.h
index d1e2bb4ad1..b78cf0d6b4 100644
--- a/sysdeps/sparc/nptl/tls.h
+++ b/sysdeps/sparc/nptl/tls.h
@@ -35,7 +35,7 @@ typedef struct
 			   thread descriptor used by libpthread.  */
   dtv_t *dtv;
   void *self;
-  int multiple_threads;
+  int unused_multiple_threads;
 #if __WORDSIZE == 64
   int gscope_flag;
 #endif
diff --git a/sysdeps/unix/sysv/linux/single-thread.h b/sysdeps/unix/sysv/linux/single-thread.h
index 208edccce6..dd80e82c82 100644
--- a/sysdeps/unix/sysv/linux/single-thread.h
+++ b/sysdeps/unix/sysv/linux/single-thread.h
@@ -23,20 +23,7 @@
 # include <sys/single_threaded.h>
 #endif
 
-/* The default way to check if the process is single thread is by using the
-   pthread_t 'multiple_threads' field.  However, for some architectures it is
-   faster to either use an extra field on TCB or global variables (the TCB
-   field is also used on x86 for some single-thread atomic optimizations).
-
-   The ABI might define SINGLE_THREAD_BY_GLOBAL to enable the single thread
-   check to use global variables instead of the pthread_t field.  */
-
-#if !defined SINGLE_THREAD_BY_GLOBAL || IS_IN (rtld)
-# define SINGLE_THREAD_P \
-  (THREAD_GETMEM (THREAD_SELF, header.multiple_threads) == 0)
-#else
-# define SINGLE_THREAD_P (__libc_single_threaded != 0)
-#endif
+#define SINGLE_THREAD_P (__libc_single_threaded != 0)
 
 #define RTLD_SINGLE_THREAD_P SINGLE_THREAD_P
 
diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h
index f24f1c71ed..23e087e7e0 100644
--- a/sysdeps/x86/atomic-machine.h
+++ b/sysdeps/x86/atomic-machine.h
@@ -51,292 +51,145 @@
 #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \
   (! __sync_bool_compare_and_swap (mem, oldval, newval))
 
+#define __cmpxchg_op(lock, mem, newval, oldval)				      \
+  ({ __typeof (*mem) __ret;						      \
+     if (sizeof (*mem) == 1)						      \
+       asm volatile (lock "cmpxchgb %2, %1"				      \
+		     : "=a" (ret), "+m" (*mem)				      \
+		     : BR_CONSTRAINT (newval), "0" (oldval)	  	      \
+		     : "memory");					      \
+     else if (sizeof (*mem) == 2)					      \
+       asm volatile (lock "cmpxchgw %2, %1"				      \
+		     : "=a" (ret), "+m" (*mem)				      \
+		     : BR_CONSTRAINT (newval), "0" (oldval)	  	      \
+		     : "memory");					      \
+     else if (sizeof (*mem) == 4)					      \
+       asm volatile (lock "cmpxchgl %2, %1"				      \
+		     : "=a" (ret), "+m" (*mem)				      \
+		     : BR_CONSTRAINT (newval), "0" (oldval)	  	      \
+		     : "memory");					      \
+     else if (__HAVE_64B_ATOMICS)					      \
+       asm volatile (lock "cmpxchgq %2, %1"				      \
+                    : "=a" (ret), "+m" (*mem)				      \
+                    : "q" ((int64_t) cast_to_integer (newval)),		      \
+                      "0" ((int64_t) cast_to_integer (oldval))		      \
+                    : "memory");					      \
+     else								      \
+       __atomic_link_error ();						      \
+     __ret; })
 
-#define __arch_c_compare_and_exchange_val_8_acq(mem, newval, oldval) \
+#define __arch_c_compare_and_exchange_val_8_acq(mem, newval, oldval)	      \
   ({ __typeof (*mem) ret;						      \
-     __asm __volatile ("cmpl $0, %%" SEG_REG ":%P5\n\t"			      \
-		       "je 0f\n\t"					      \
-		       "lock\n"						      \
-		       "0:\tcmpxchgb %b2, %1"				      \
-		       : "=a" (ret), "=m" (*mem)			      \
-		       : BR_CONSTRAINT (newval), "m" (*mem), "0" (oldval),    \
-			 "i" (offsetof (tcbhead_t, multiple_threads)));	      \
+     if (SINGLE_THREAD_P)						      \
+       __cmpxchg_op ("", (mem), (newval), (oldval));			      \
+     else								      \
+       __cmpxchg_op (LOCK_PREFIX, (mem), (newval), (oldval));		      \
      ret; })
 
-#define __arch_c_compare_and_exchange_val_16_acq(mem, newval, oldval) \
+#define __arch_c_compare_and_exchange_val_16_acq(mem, newval, oldval)	      \
   ({ __typeof (*mem) ret;						      \
-     __asm __volatile ("cmpl $0, %%" SEG_REG ":%P5\n\t"			      \
-		       "je 0f\n\t"					      \
-		       "lock\n"						      \
-		       "0:\tcmpxchgw %w2, %1"				      \
-		       : "=a" (ret), "=m" (*mem)			      \
-		       : BR_CONSTRAINT (newval), "m" (*mem), "0" (oldval),    \
-			 "i" (offsetof (tcbhead_t, multiple_threads)));	      \
+     if (SINGLE_THREAD_P)						      \
+       __cmpxchg_op ("", (mem), (newval), (oldval));			      \
+     else								      \
+       __cmpxchg_op (LOCK_PREFIX, (mem), (newval), (oldval));		      \
      ret; })
 
-#define __arch_c_compare_and_exchange_val_32_acq(mem, newval, oldval) \
+#define __arch_c_compare_and_exchange_val_32_acq(mem, newval, oldval)	      \
   ({ __typeof (*mem) ret;						      \
-     __asm __volatile ("cmpl $0, %%" SEG_REG ":%P5\n\t"			      \
-		       "je 0f\n\t"					      \
-		       "lock\n"						      \
-		       "0:\tcmpxchgl %2, %1"				      \
-		       : "=a" (ret), "=m" (*mem)			      \
-		       : BR_CONSTRAINT (newval), "m" (*mem), "0" (oldval),    \
-			 "i" (offsetof (tcbhead_t, multiple_threads)));       \
+     if (SINGLE_THREAD_P)						      \
+       __cmpxchg_op ("", (mem), (newval), (oldval));			      \
+     else								      \
+       __cmpxchg_op (LOCK_PREFIX, (mem), (newval), (oldval));		      \
      ret; })
 
-#ifdef __x86_64__
-# define __arch_c_compare_and_exchange_val_64_acq(mem, newval, oldval) \
+#define __arch_c_compare_and_exchange_val_64_acq(mem, newval, oldval)	      \
   ({ __typeof (*mem) ret;						      \
-     __asm __volatile ("cmpl $0, %%fs:%P5\n\t"				      \
-		       "je 0f\n\t"					      \
-		       "lock\n"						      \
-		       "0:\tcmpxchgq %q2, %1"				      \
-		       : "=a" (ret), "=m" (*mem)			      \
-		       : "q" ((int64_t) cast_to_integer (newval)),	      \
-			 "m" (*mem),					      \
-			 "0" ((int64_t) cast_to_integer (oldval)),	      \
-			 "i" (offsetof (tcbhead_t, multiple_threads)));	      \
-     ret; })
-# define do_exchange_and_add_val_64_acq(pfx, mem, value) 0
-# define do_add_val_64_acq(pfx, mem, value) do { } while (0)
-#else
-/* XXX We do not really need 64-bit compare-and-exchange.  At least
-   not in the moment.  Using it would mean causing portability
-   problems since not many other 32-bit architectures have support for
-   such an operation.  So don't define any code for now.  If it is
-   really going to be used the code below can be used on Intel Pentium
-   and later, but NOT on i486.  */
-# define __arch_c_compare_and_exchange_val_64_acq(mem, newval, oldval) \
-  ({ __typeof (*mem) ret = *(mem);					      \
-     __atomic_link_error ();						      \
-     ret = (newval);							      \
-     ret = (oldval);							      \
-     ret; })
-
-# define __arch_compare_and_exchange_val_64_acq(mem, newval, oldval)	      \
-  ({ __typeof (*mem) ret = *(mem);					      \
-     __atomic_link_error ();						      \
-     ret = (newval);							      \
-     ret = (oldval);							      \
-     ret; })
-
-# define do_exchange_and_add_val_64_acq(pfx, mem, value) \
-  ({ __typeof (value) __addval = (value);				      \
-     __typeof (*mem) __result;						      \
-     __typeof (mem) __memp = (mem);					      \
-     __typeof (*mem) __tmpval;						      \
-     __result = *__memp;						      \
-     do									      \
-       __tmpval = __result;						      \
-     while ((__result = pfx##_compare_and_exchange_val_64_acq		      \
-	     (__memp, __result + __addval, __result)) == __tmpval);	      \
-     __result; })
-
-# define do_add_val_64_acq(pfx, mem, value) \
-  {									      \
-    __typeof (value) __addval = (value);				      \
-    __typeof (mem) __memp = (mem);					      \
-    __typeof (*mem) __oldval = *__memp;					      \
-    __typeof (*mem) __tmpval;						      \
-    do									      \
-      __tmpval = __oldval;						      \
-    while ((__oldval = pfx##_compare_and_exchange_val_64_acq		      \
-	    (__memp, __oldval + __addval, __oldval)) == __tmpval);	      \
-  }
-#endif
-
-
-/* Note that we need no lock prefix.  */
-#define atomic_exchange_acq(mem, newvalue) \
-  ({ __typeof (*mem) result;						      \
-     if (sizeof (*mem) == 1)						      \
-       __asm __volatile ("xchgb %b0, %1"				      \
-			 : "=q" (result), "=m" (*mem)			      \
-			 : "0" (newvalue), "m" (*mem));			      \
-     else if (sizeof (*mem) == 2)					      \
-       __asm __volatile ("xchgw %w0, %1"				      \
-			 : "=r" (result), "=m" (*mem)			      \
-			 : "0" (newvalue), "m" (*mem));			      \
-     else if (sizeof (*mem) == 4)					      \
-       __asm __volatile ("xchgl %0, %1"					      \
-			 : "=r" (result), "=m" (*mem)			      \
-			 : "0" (newvalue), "m" (*mem));			      \
-     else if (__HAVE_64B_ATOMICS)					      \
-       __asm __volatile ("xchgq %q0, %1"				      \
-			 : "=r" (result), "=m" (*mem)			      \
-			 : "0" ((int64_t) cast_to_integer (newvalue)),        \
-			   "m" (*mem));					      \
-     else								      \
-       {								      \
-	 result = 0;							      \
-	 __atomic_link_error ();					      \
-       }								      \
-     result; })
-
-
-#define __arch_exchange_and_add_body(lock, pfx, mem, value) \
-  ({ __typeof (*mem) __result;						      \
-     __typeof (value) __addval = (value);				      \
-     if (sizeof (*mem) == 1)						      \
-       __asm __volatile (lock "xaddb %b0, %1"				      \
-			 : "=q" (__result), "=m" (*mem)			      \
-			 : "0" (__addval), "m" (*mem),			      \
-			   "i" (offsetof (tcbhead_t, multiple_threads)));     \
-     else if (sizeof (*mem) == 2)					      \
-       __asm __volatile (lock "xaddw %w0, %1"				      \
-			 : "=r" (__result), "=m" (*mem)			      \
-			 : "0" (__addval), "m" (*mem),			      \
-			   "i" (offsetof (tcbhead_t, multiple_threads)));     \
-     else if (sizeof (*mem) == 4)					      \
-       __asm __volatile (lock "xaddl %0, %1"				      \
-			 : "=r" (__result), "=m" (*mem)			      \
-			 : "0" (__addval), "m" (*mem),			      \
-			   "i" (offsetof (tcbhead_t, multiple_threads)));     \
-     else if (__HAVE_64B_ATOMICS)					      \
-       __asm __volatile (lock "xaddq %q0, %1"				      \
-			 : "=r" (__result), "=m" (*mem)			      \
-			 : "0" ((int64_t) cast_to_integer (__addval)),     \
-			   "m" (*mem),					      \
-			   "i" (offsetof (tcbhead_t, multiple_threads)));     \
+     if (SINGLE_THREAD_P)						      \
+       __cmpxchg_op ("", (mem), (newval), (oldval));			      \
      else								      \
-       __result = do_exchange_and_add_val_64_acq (pfx, (mem), __addval);      \
-     __result; })
-
-#define atomic_exchange_and_add(mem, value) \
-  __sync_fetch_and_add (mem, value)
-
-#define __arch_exchange_and_add_cprefix \
-  "cmpl $0, %%" SEG_REG ":%P4\n\tje 0f\n\tlock\n0:\t"
-
-#define catomic_exchange_and_add(mem, value) \
-  __arch_exchange_and_add_body (__arch_exchange_and_add_cprefix, __arch_c,    \
-				mem, value)
-
-
-#define __arch_add_body(lock, pfx, apfx, mem, value) \
-  do {									      \
-    if (__builtin_constant_p (value) && (value) == 1)			      \
-      pfx##_increment (mem);						      \
-    else if (__builtin_constant_p (value) && (value) == -1)		      \
-      pfx##_decrement (mem);						      \
-    else if (sizeof (*mem) == 1)					      \
-      __asm __volatile (lock "addb %b1, %0"				      \
-			: "=m" (*mem)					      \
-			: IBR_CONSTRAINT (value), "m" (*mem),		      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else if (sizeof (*mem) == 2)					      \
-      __asm __volatile (lock "addw %w1, %0"				      \
-			: "=m" (*mem)					      \
-			: "ir" (value), "m" (*mem),			      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else if (sizeof (*mem) == 4)					      \
-      __asm __volatile (lock "addl %1, %0"				      \
-			: "=m" (*mem)					      \
-			: "ir" (value), "m" (*mem),			      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else if (__HAVE_64B_ATOMICS)					      \
-      __asm __volatile (lock "addq %q1, %0"				      \
-			: "=m" (*mem)					      \
-			: "ir" ((int64_t) cast_to_integer (value)),	      \
-			  "m" (*mem),					      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else								      \
-      do_add_val_64_acq (apfx, (mem), (value));				      \
-  } while (0)
-
-# define atomic_add(mem, value) \
-  __arch_add_body (LOCK_PREFIX, atomic, __arch, mem, value)
-
-#define __arch_add_cprefix \
-  "cmpl $0, %%" SEG_REG ":%P3\n\tje 0f\n\tlock\n0:\t"
-
-#define catomic_add(mem, value) \
-  __arch_add_body (__arch_add_cprefix, atomic, __arch_c, mem, value)
+       __cmpxchg_op (LOCK_PREFIX, (mem), (newval), (oldval));		      \
+     ret; })
 
 
-#define atomic_add_negative(mem, value) \
-  ({ unsigned char __result;						      \
+#define __xchg_op(lock, mem, arg, op)					      \
+  ({ __typeof (*mem) __ret = (arg);					      \
      if (sizeof (*mem) == 1)						      \
-       __asm __volatile (LOCK_PREFIX "addb %b2, %0; sets %1"		      \
-			 : "=m" (*mem), "=qm" (__result)		      \
-			 : IBR_CONSTRAINT (value), "m" (*mem));		      \
+       __asm __volatile (lock #op "b %b0, %1"				      \
+			 : "=q" (__ret), "=m" (*mem)			      \
+			 : "0" (arg), "m" (*mem)			      \
+			 : "memory", "cc");				      \
      else if (sizeof (*mem) == 2)					      \
-       __asm __volatile (LOCK_PREFIX "addw %w2, %0; sets %1"		      \
-			 : "=m" (*mem), "=qm" (__result)		      \
-			 : "ir" (value), "m" (*mem));			      \
+       __asm __volatile (lock #op "w %w0, %1"				      \
+			 : "=r" (__ret), "=m" (*mem)			      \
+			 : "0" (arg), "m" (*mem)			      \
+			 : "memory", "cc");				      \
      else if (sizeof (*mem) == 4)					      \
-       __asm __volatile (LOCK_PREFIX "addl %2, %0; sets %1"		      \
-			 : "=m" (*mem), "=qm" (__result)		      \
-			 : "ir" (value), "m" (*mem));			      \
+       __asm __volatile (lock #op "l %0, %1"				      \
+			 : "=r" (__ret), "=m" (*mem)			      \
+			 : "0" (arg), "m" (*mem)			      \
+			 : "memory", "cc");				      \
      else if (__HAVE_64B_ATOMICS)					      \
-       __asm __volatile (LOCK_PREFIX "addq %q2, %0; sets %1"		      \
-			 : "=m" (*mem), "=qm" (__result)		      \
-			 : "ir" ((int64_t) cast_to_integer (value)),	      \
-			   "m" (*mem));					      \
+       __asm __volatile (lock #op "q %q0, %1"				      \
+			 : "=r" (__ret), "=m" (*mem)			      \
+			 : "0" ((int64_t) cast_to_integer (arg)),	      \
+			   "m" (*mem)					      \
+			 : "memory", "cc");				      \
      else								      \
        __atomic_link_error ();						      \
-     __result; })
-
+     __ret; })
 
-#define atomic_add_zero(mem, value) \
-  ({ unsigned char __result;						      \
+#define __single_op(lock, mem, op)					      \
+  ({									      \
      if (sizeof (*mem) == 1)						      \
-       __asm __volatile (LOCK_PREFIX "addb %b2, %0; setz %1"		      \
-			 : "=m" (*mem), "=qm" (__result)		      \
-			 : IBR_CONSTRAINT (value), "m" (*mem));		      \
+       __asm __volatile (lock #op "b %b0"				      \
+			 : "=m" (*mem)					      \
+			 : "m" (*mem)					      \
+			 : "memory", "cc");				      \
      else if (sizeof (*mem) == 2)					      \
-       __asm __volatile (LOCK_PREFIX "addw %w2, %0; setz %1"		      \
-			 : "=m" (*mem), "=qm" (__result)		      \
-			 : "ir" (value), "m" (*mem));			      \
+       __asm __volatile (lock #op "w %b0"				      \
+			 : "=m" (*mem)					      \
+			 : "m" (*mem)					      \
+			 : "memory", "cc");				      \
      else if (sizeof (*mem) == 4)					      \
-       __asm __volatile (LOCK_PREFIX "addl %2, %0; setz %1"		      \
-			 : "=m" (*mem), "=qm" (__result)		      \
-			 : "ir" (value), "m" (*mem));			      \
+       __asm __volatile (lock #op "l %b0"				      \
+			 : "=m" (*mem)					      \
+			 : "m" (*mem)					      \
+			 : "memory", "cc");				      \
      else if (__HAVE_64B_ATOMICS)					      \
-       __asm __volatile (LOCK_PREFIX "addq %q2, %0; setz %1"		      \
-			 : "=m" (*mem), "=qm" (__result)		      \
-			 : "ir" ((int64_t) cast_to_integer (value)),	      \
-			   "m" (*mem));					      \
+       __asm __volatile (lock #op "q %b0"				      \
+			 : "=m" (*mem)					      \
+			 : "m" (*mem)					      \
+			 : "memory", "cc");				      \
      else								      \
-       __atomic_link_error ();					      \
-     __result; })
+       __atomic_link_error ();						      \
+  })
 
+/* Note that we need no lock prefix.  */
+#define atomic_exchange_acq(mem, newvalue)				      \
+  __xchg_op ("", (mem), (newvalue), xchg)
 
-#define __arch_increment_body(lock, pfx, mem) \
-  do {									      \
-    if (sizeof (*mem) == 1)						      \
-      __asm __volatile (lock "incb %b0"					      \
-			: "=m" (*mem)					      \
-			: "m" (*mem),					      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else if (sizeof (*mem) == 2)					      \
-      __asm __volatile (lock "incw %w0"					      \
-			: "=m" (*mem)					      \
-			: "m" (*mem),					      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else if (sizeof (*mem) == 4)					      \
-      __asm __volatile (lock "incl %0"					      \
-			: "=m" (*mem)					      \
-			: "m" (*mem),					      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else if (__HAVE_64B_ATOMICS)					      \
-      __asm __volatile (lock "incq %q0"					      \
-			: "=m" (*mem)					      \
-			: "m" (*mem),					      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else								      \
-      do_add_val_64_acq (pfx, mem, 1);					      \
-  } while (0)
+#define atomic_add(mem, value) \
+  __xchg_op (LOCK_PREFIX, (mem), (value), add);				      \
 
-#define atomic_increment(mem) __arch_increment_body (LOCK_PREFIX, __arch, mem)
+#define catomic_add(mem, value)						      \
+  ({									      \
+    if (SINGLE_THREAD_P)						      \
+      __xchg_op ("", (mem), (value), add);				      \
+   else									      \
+     atomic_add (mem, value);						      \
+  })
 
-#define __arch_increment_cprefix \
-  "cmpl $0, %%" SEG_REG ":%P2\n\tje 0f\n\tlock\n0:\t"
 
-#define catomic_increment(mem) \
-  __arch_increment_body (__arch_increment_cprefix, __arch_c, mem)
+#define atomic_increment(mem) \
+  __single_op (LOCK_PREFIX, (mem), inc)
 
+#define catomic_increment(mem)						      \
+  ({									      \
+    if (SINGLE_THREAD_P)						      \
+      __single_op ("", (mem), inc);					      \
+   else									      \
+     atomic_increment (mem);						      \
+  })
 
 #define atomic_increment_and_test(mem) \
   ({ unsigned char __result;						      \
@@ -357,43 +210,20 @@
 			 : "=m" (*mem), "=qm" (__result)		      \
 			 : "m" (*mem));					      \
      else								      \
-       __atomic_link_error ();					      \
+       __atomic_link_error ();						      \
      __result; })
 
 
-#define __arch_decrement_body(lock, pfx, mem) \
-  do {									      \
-    if (sizeof (*mem) == 1)						      \
-      __asm __volatile (lock "decb %b0"					      \
-			: "=m" (*mem)					      \
-			: "m" (*mem),					      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else if (sizeof (*mem) == 2)					      \
-      __asm __volatile (lock "decw %w0"					      \
-			: "=m" (*mem)					      \
-			: "m" (*mem),					      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else if (sizeof (*mem) == 4)					      \
-      __asm __volatile (lock "decl %0"					      \
-			: "=m" (*mem)					      \
-			: "m" (*mem),					      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else if (__HAVE_64B_ATOMICS)					      \
-      __asm __volatile (lock "decq %q0"					      \
-			: "=m" (*mem)					      \
-			: "m" (*mem),					      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else								      \
-      do_add_val_64_acq (pfx, mem, -1);					      \
-  } while (0)
-
-#define atomic_decrement(mem) __arch_decrement_body (LOCK_PREFIX, __arch, mem)
+#define atomic_decrement(mem)						      \
+  __single_op (LOCK_PREFIX, (mem), dec)
 
-#define __arch_decrement_cprefix \
-  "cmpl $0, %%" SEG_REG ":%P2\n\tje 0f\n\tlock\n0:\t"
-
-#define catomic_decrement(mem) \
-  __arch_decrement_body (__arch_decrement_cprefix, __arch_c, mem)
+#define catomic_decrement(mem)						      \
+  ({									      \
+    if (SINGLE_THREAD_P)						      \
+      __single_op ("", (mem), dec);					      \
+   else									      \
+     atomic_decrement (mem);						      \
+  })
 
 
 #define atomic_decrement_and_test(mem) \
@@ -463,73 +293,31 @@
 			 : "=q" (__result), "=m" (*mem)			      \
 			 : "m" (*mem), "ir" (bit));			      \
      else							      	      \
-       __atomic_link_error ();					      \
+       __atomic_link_error ();						      \
      __result; })
 
 
-#define __arch_and_body(lock, mem, mask) \
-  do {									      \
-    if (sizeof (*mem) == 1)						      \
-      __asm __volatile (lock "andb %b1, %0"				      \
-			: "=m" (*mem)					      \
-			: IBR_CONSTRAINT (mask), "m" (*mem),		      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else if (sizeof (*mem) == 2)					      \
-      __asm __volatile (lock "andw %w1, %0"				      \
-			: "=m" (*mem)					      \
-			: "ir" (mask), "m" (*mem),			      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else if (sizeof (*mem) == 4)					      \
-      __asm __volatile (lock "andl %1, %0"				      \
-			: "=m" (*mem)					      \
-			: "ir" (mask), "m" (*mem),			      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else if (__HAVE_64B_ATOMICS)					      \
-      __asm __volatile (lock "andq %q1, %0"				      \
-			: "=m" (*mem)					      \
-			: "ir" (mask), "m" (*mem),			      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else								      \
-      __atomic_link_error ();						      \
-  } while (0)
-
-#define __arch_cprefix \
-  "cmpl $0, %%" SEG_REG ":%P3\n\tje 0f\n\tlock\n0:\t"
-
-#define atomic_and(mem, mask) __arch_and_body (LOCK_PREFIX, mem, mask)
-
-#define catomic_and(mem, mask) __arch_and_body (__arch_cprefix, mem, mask)
+#define atomic_and(mem, mask)						      \
+  __xchg_op (LOCK_PREFIX, (mem), (mask), and)
 
+#define catomic_and(mem, mask) \
+  ({									      \
+    if (SINGLE_THREAD_P)						      \
+      __xchg_op ("", (mem), (mask), and);				      \
+   else									      \
+      atomic_and (mem, mask);						      \
+  })
 
-#define __arch_or_body(lock, mem, mask) \
-  do {									      \
-    if (sizeof (*mem) == 1)						      \
-      __asm __volatile (lock "orb %b1, %0"				      \
-			: "=m" (*mem)					      \
-			: IBR_CONSTRAINT (mask), "m" (*mem),		      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else if (sizeof (*mem) == 2)					      \
-      __asm __volatile (lock "orw %w1, %0"				      \
-			: "=m" (*mem)					      \
-			: "ir" (mask), "m" (*mem),			      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else if (sizeof (*mem) == 4)					      \
-      __asm __volatile (lock "orl %1, %0"				      \
-			: "=m" (*mem)					      \
-			: "ir" (mask), "m" (*mem),			      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else if (__HAVE_64B_ATOMICS)					      \
-      __asm __volatile (lock "orq %q1, %0"				      \
-			: "=m" (*mem)					      \
-			: "ir" (mask), "m" (*mem),			      \
-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
-    else								      \
-      __atomic_link_error ();						      \
-  } while (0)
-
-#define atomic_or(mem, mask) __arch_or_body (LOCK_PREFIX, mem, mask)
+#define atomic_or(mem, mask)						      \
+  __xchg_op (LOCK_PREFIX, (mem), (mask), or)
 
-#define catomic_or(mem, mask) __arch_or_body (__arch_cprefix, mem, mask)
+#define catomic_or(mem, mask) \
+  ({									      \
+    if (SINGLE_THREAD_P)						      \
+      __xchg_op ("", (mem), (mask), or);				      \
+   else									      \
+      atomic_or (mem, mask);						      \
+  })
 
 /* We don't use mfence because it is supposedly slower due to having to
    provide stronger guarantees (e.g., regarding self-modifying code).  */
diff --git a/sysdeps/x86_64/nptl/tcb-offsets.sym b/sysdeps/x86_64/nptl/tcb-offsets.sym
index 2bbd563a6c..8ec55a7ea8 100644
--- a/sysdeps/x86_64/nptl/tcb-offsets.sym
+++ b/sysdeps/x86_64/nptl/tcb-offsets.sym
@@ -9,7 +9,6 @@ CLEANUP_JMP_BUF		offsetof (struct pthread, cleanup_jmp_buf)
 CLEANUP			offsetof (struct pthread, cleanup)
 CLEANUP_PREV		offsetof (struct _pthread_cleanup_buffer, __prev)
 MUTEX_FUTEX		offsetof (pthread_mutex_t, __data.__lock)
-MULTIPLE_THREADS_OFFSET	offsetof (tcbhead_t, multiple_threads)
 POINTER_GUARD		offsetof (tcbhead_t, pointer_guard)
 FEATURE_1_OFFSET	offsetof (tcbhead_t, feature_1)
 SSP_BASE_OFFSET		offsetof (tcbhead_t, ssp_base)
-- 
2.34.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 4/4] Remove single-thread.h
  2022-06-10 16:35 [PATCH v2 0/4] Simplify internal single-threaded usage Adhemerval Zanella
                   ` (2 preceding siblings ...)
  2022-06-10 16:35 ` [PATCH v2 3/4] Remove usage of TLS_MULTIPLE_THREADS_IN_TCB Adhemerval Zanella
@ 2022-06-10 16:35 ` Adhemerval Zanella
  3 siblings, 0 replies; 21+ messages in thread
From: Adhemerval Zanella @ 2022-06-10 16:35 UTC (permalink / raw)
  To: libc-alpha, Wilco Dijkstra

And move SINGLE_THREAD_P macro to sys/single_threaded.h.
---
 include/sys/single_threaded.h                 | 15 +++++++---
 sysdeps/generic/single-thread.h               | 25 ----------------
 sysdeps/mach/hurd/sysdep-cancel.h             |  5 ----
 sysdeps/unix/sysdep.h                         |  2 +-
 .../unix/sysv/linux/aarch64/single-thread.h   |  2 --
 sysdeps/unix/sysv/linux/arc/single-thread.h   |  2 --
 sysdeps/unix/sysv/linux/arm/single-thread.h   |  2 --
 sysdeps/unix/sysv/linux/hppa/single-thread.h  |  2 --
 .../sysv/linux/microblaze/single-thread.h     |  2 --
 sysdeps/unix/sysv/linux/s390/single-thread.h  |  2 --
 sysdeps/unix/sysv/linux/single-thread.h       | 30 -------------------
 .../unix/sysv/linux/x86_64/single-thread.h    |  2 --
 12 files changed, 12 insertions(+), 79 deletions(-)
 delete mode 100644 sysdeps/generic/single-thread.h
 delete mode 100644 sysdeps/unix/sysv/linux/aarch64/single-thread.h
 delete mode 100644 sysdeps/unix/sysv/linux/arc/single-thread.h
 delete mode 100644 sysdeps/unix/sysv/linux/arm/single-thread.h
 delete mode 100644 sysdeps/unix/sysv/linux/hppa/single-thread.h
 delete mode 100644 sysdeps/unix/sysv/linux/microblaze/single-thread.h
 delete mode 100644 sysdeps/unix/sysv/linux/s390/single-thread.h
 delete mode 100644 sysdeps/unix/sysv/linux/single-thread.h
 delete mode 100644 sysdeps/unix/sysv/linux/x86_64/single-thread.h

diff --git a/include/sys/single_threaded.h b/include/sys/single_threaded.h
index 258b01e0b2..c08bd52ab8 100644
--- a/include/sys/single_threaded.h
+++ b/include/sys/single_threaded.h
@@ -1,12 +1,19 @@
-#include <misc/sys/single_threaded.h>
+#ifndef __ASSEMBLER__
+# include <misc/sys/single_threaded.h>
 
-#ifndef _ISOMAC
+# ifndef _ISOMAC
 
 libc_hidden_proto (__libc_single_threaded);
 
-# ifdef SHARED
+#  ifdef SHARED
 extern __typeof (__libc_single_threaded) *__libc_external_single_threaded
   attribute_hidden;
+#  endif
+
+#  define SINGLE_THREAD_P (__libc_single_threaded != 0)
+
+#  define RTLD_SINGLE_THREAD_P SINGLE_THREAD_P
+
 # endif
 
-#endif
+#endif /* __ASSEMBLER__ */
diff --git a/sysdeps/generic/single-thread.h b/sysdeps/generic/single-thread.h
deleted file mode 100644
index 7f8222b38a..0000000000
--- a/sysdeps/generic/single-thread.h
+++ /dev/null
@@ -1,25 +0,0 @@
-/* Single thread optimization, generic version.
-   Copyright (C) 2019-2022 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library; if not, see
-   <https://www.gnu.org/licenses/>.  */
-
-#ifndef _SINGLE_THREAD_H
-#define _SINGLE_THREAD_H
-
-#define SINGLE_THREAD_P (0)
-#define RTLD_SINGLE_THREAD_P (0)
-
-#endif /* _SINGLE_THREAD_H  */
diff --git a/sysdeps/mach/hurd/sysdep-cancel.h b/sysdeps/mach/hurd/sysdep-cancel.h
index 669c17151a..9311367ab9 100644
--- a/sysdeps/mach/hurd/sysdep-cancel.h
+++ b/sysdeps/mach/hurd/sysdep-cancel.h
@@ -6,11 +6,6 @@ void __pthread_disable_asynccancel (int oldtype);
 #pragma weak __pthread_enable_asynccancel
 #pragma weak __pthread_disable_asynccancel
 
-/* Always multi-thread (since there's at least the sig handler), but no
-   handling enabled.  */
-#define SINGLE_THREAD_P (0)
-#define RTLD_SINGLE_THREAD_P (0)
-
 #define LIBC_CANCEL_ASYNC() ({ \
 	int __cancel_oldtype = 0; \
 	if (__pthread_enable_asynccancel) \
diff --git a/sysdeps/unix/sysdep.h b/sysdeps/unix/sysdep.h
index a1d9df4c73..a8abecb92b 100644
--- a/sysdeps/unix/sysdep.h
+++ b/sysdeps/unix/sysdep.h
@@ -16,7 +16,7 @@
    <https://www.gnu.org/licenses/>.  */
 
 #include <sysdeps/generic/sysdep.h>
-#include <single-thread.h>
+#include <sys/single_threaded.h>
 #include <sys/syscall.h>
 #define	HAVE_SYSCALLS
 
diff --git a/sysdeps/unix/sysv/linux/aarch64/single-thread.h b/sysdeps/unix/sysv/linux/aarch64/single-thread.h
deleted file mode 100644
index a5d3a2aaf4..0000000000
--- a/sysdeps/unix/sysv/linux/aarch64/single-thread.h
+++ /dev/null
@@ -1,2 +0,0 @@
-#define SINGLE_THREAD_BY_GLOBAL
-#include_next <single-thread.h>
diff --git a/sysdeps/unix/sysv/linux/arc/single-thread.h b/sysdeps/unix/sysv/linux/arc/single-thread.h
deleted file mode 100644
index a5d3a2aaf4..0000000000
--- a/sysdeps/unix/sysv/linux/arc/single-thread.h
+++ /dev/null
@@ -1,2 +0,0 @@
-#define SINGLE_THREAD_BY_GLOBAL
-#include_next <single-thread.h>
diff --git a/sysdeps/unix/sysv/linux/arm/single-thread.h b/sysdeps/unix/sysv/linux/arm/single-thread.h
deleted file mode 100644
index a5d3a2aaf4..0000000000
--- a/sysdeps/unix/sysv/linux/arm/single-thread.h
+++ /dev/null
@@ -1,2 +0,0 @@
-#define SINGLE_THREAD_BY_GLOBAL
-#include_next <single-thread.h>
diff --git a/sysdeps/unix/sysv/linux/hppa/single-thread.h b/sysdeps/unix/sysv/linux/hppa/single-thread.h
deleted file mode 100644
index a5d3a2aaf4..0000000000
--- a/sysdeps/unix/sysv/linux/hppa/single-thread.h
+++ /dev/null
@@ -1,2 +0,0 @@
-#define SINGLE_THREAD_BY_GLOBAL
-#include_next <single-thread.h>
diff --git a/sysdeps/unix/sysv/linux/microblaze/single-thread.h b/sysdeps/unix/sysv/linux/microblaze/single-thread.h
deleted file mode 100644
index a5d3a2aaf4..0000000000
--- a/sysdeps/unix/sysv/linux/microblaze/single-thread.h
+++ /dev/null
@@ -1,2 +0,0 @@
-#define SINGLE_THREAD_BY_GLOBAL
-#include_next <single-thread.h>
diff --git a/sysdeps/unix/sysv/linux/s390/single-thread.h b/sysdeps/unix/sysv/linux/s390/single-thread.h
deleted file mode 100644
index a5d3a2aaf4..0000000000
--- a/sysdeps/unix/sysv/linux/s390/single-thread.h
+++ /dev/null
@@ -1,2 +0,0 @@
-#define SINGLE_THREAD_BY_GLOBAL
-#include_next <single-thread.h>
diff --git a/sysdeps/unix/sysv/linux/single-thread.h b/sysdeps/unix/sysv/linux/single-thread.h
deleted file mode 100644
index dd80e82c82..0000000000
--- a/sysdeps/unix/sysv/linux/single-thread.h
+++ /dev/null
@@ -1,30 +0,0 @@
-/* Single thread optimization, Linux version.
-   Copyright (C) 2019-2022 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library; if not, see
-   <https://www.gnu.org/licenses/>.  */
-
-#ifndef _SINGLE_THREAD_H
-#define _SINGLE_THREAD_H
-
-#ifndef __ASSEMBLER__
-# include <sys/single_threaded.h>
-#endif
-
-#define SINGLE_THREAD_P (__libc_single_threaded != 0)
-
-#define RTLD_SINGLE_THREAD_P SINGLE_THREAD_P
-
-#endif /* _SINGLE_THREAD_H  */
diff --git a/sysdeps/unix/sysv/linux/x86_64/single-thread.h b/sysdeps/unix/sysv/linux/x86_64/single-thread.h
deleted file mode 100644
index a5d3a2aaf4..0000000000
--- a/sysdeps/unix/sysv/linux/x86_64/single-thread.h
+++ /dev/null
@@ -1,2 +0,0 @@
-#define SINGLE_THREAD_BY_GLOBAL
-#include_next <single-thread.h>
-- 
2.34.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/4] Remove usage of TLS_MULTIPLE_THREADS_IN_TCB
  2022-06-10 16:35 ` [PATCH v2 3/4] Remove usage of TLS_MULTIPLE_THREADS_IN_TCB Adhemerval Zanella
@ 2022-06-10 19:49   ` H.J. Lu
  2022-06-10 21:00   ` Noah Goldstein
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: H.J. Lu @ 2022-06-10 19:49 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library, Wilco Dijkstra

On Fri, Jun 10, 2022 at 9:40 AM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Instead use __libc_single_threaded on all architectures.  The TCB
> field is renamed to avoid change the struct layout.
>
> The x86 atomic need some adjustments since it has single-thread
> optimizationi builtin within the inline assemblye.  It now uses
> SINGLE_THREAD_P and atomic optimizations are removed (since they
> are not used).
>
> Checked on x86_64-linux-gnu and i686-linux-gnu.
> ---
>  misc/tst-atomic.c                       |   1 +
>  nptl/allocatestack.c                    |   6 -
>  nptl/descr.h                            |  17 +-
>  nptl/pthread_cancel.c                   |   7 +-
>  nptl/pthread_create.c                   |   5 -
>  sysdeps/i386/htl/tcb-offsets.sym        |   1 -
>  sysdeps/i386/nptl/tcb-offsets.sym       |   1 -
>  sysdeps/i386/nptl/tls.h                 |   4 +-
>  sysdeps/ia64/nptl/tcb-offsets.sym       |   1 -
>  sysdeps/ia64/nptl/tls.h                 |   2 -
>  sysdeps/mach/hurd/i386/tls.h            |   4 +-
>  sysdeps/nios2/nptl/tcb-offsets.sym      |   1 -
>  sysdeps/or1k/nptl/tls.h                 |   2 -
>  sysdeps/powerpc/nptl/tcb-offsets.sym    |   3 -
>  sysdeps/powerpc/nptl/tls.h              |   3 -
>  sysdeps/s390/nptl/tcb-offsets.sym       |   1 -
>  sysdeps/s390/nptl/tls.h                 |   6 +-
>  sysdeps/sh/nptl/tcb-offsets.sym         |   1 -
>  sysdeps/sh/nptl/tls.h                   |   2 -
>  sysdeps/sparc/nptl/tcb-offsets.sym      |   1 -
>  sysdeps/sparc/nptl/tls.h                |   2 +-
>  sysdeps/unix/sysv/linux/single-thread.h |  15 +-
>  sysdeps/x86/atomic-machine.h            | 484 +++++++-----------------
>  sysdeps/x86_64/nptl/tcb-offsets.sym     |   1 -
>  24 files changed, 145 insertions(+), 426 deletions(-)
>
> diff --git a/misc/tst-atomic.c b/misc/tst-atomic.c
> index 6d681a7bfd..ddbc618e25 100644
> --- a/misc/tst-atomic.c
> +++ b/misc/tst-atomic.c
> @@ -18,6 +18,7 @@
>
>  #include <stdio.h>
>  #include <atomic.h>
> +#include <support/xthread.h>
>
>  #ifndef atomic_t
>  # define atomic_t int
> diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
> index 98f5f6dd85..3e0d01cb52 100644
> --- a/nptl/allocatestack.c
> +++ b/nptl/allocatestack.c
> @@ -290,9 +290,6 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
>          stack cache nor will the memory (except the TLS memory) be freed.  */
>        pd->user_stack = true;
>
> -      /* This is at least the second thread.  */
> -      pd->header.multiple_threads = 1;
> -
>  #ifdef NEED_DL_SYSINFO
>        SETUP_THREAD_SYSINFO (pd);
>  #endif
> @@ -408,9 +405,6 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
>              descriptor.  */
>           pd->specific[0] = pd->specific_1stblock;
>
> -         /* This is at least the second thread.  */
> -         pd->header.multiple_threads = 1;
> -
>  #ifdef NEED_DL_SYSINFO
>           SETUP_THREAD_SYSINFO (pd);
>  #endif
> diff --git a/nptl/descr.h b/nptl/descr.h
> index bb46b5958e..77b25d8267 100644
> --- a/nptl/descr.h
> +++ b/nptl/descr.h
> @@ -137,22 +137,7 @@ struct pthread
>  #else
>      struct
>      {
> -      /* multiple_threads is enabled either when the process has spawned at
> -        least one thread or when a single-threaded process cancels itself.
> -        This enables additional code to introduce locking before doing some
> -        compare_and_exchange operations and also enable cancellation points.
> -        The concepts of multiple threads and cancellation points ideally
> -        should be separate, since it is not necessary for multiple threads to
> -        have been created for cancellation points to be enabled, as is the
> -        case is when single-threaded process cancels itself.
> -
> -        Since enabling multiple_threads enables additional code in
> -        cancellation points and compare_and_exchange operations, there is a
> -        potential for an unneeded performance hit when it is enabled in a
> -        single-threaded, self-canceling process.  This is OK though, since a
> -        single-threaded process will enable async cancellation only when it
> -        looks to cancel itself and is hence going to end anyway.  */
> -      int multiple_threads;
> +      int unused_multiple_threads;
>        int gscope_flag;
>      } header;
>  #endif
> diff --git a/nptl/pthread_cancel.c b/nptl/pthread_cancel.c
> index e1735279f2..6d26a15d0e 100644
> --- a/nptl/pthread_cancel.c
> +++ b/nptl/pthread_cancel.c
> @@ -157,12 +157,9 @@ __pthread_cancel (pthread_t th)
>
>         /* A single-threaded process should be able to kill itself, since
>            there is nothing in the POSIX specification that says that it
> -          cannot.  So we set multiple_threads to true so that cancellation
> -          points get executed.  */
> -       THREAD_SETMEM (THREAD_SELF, header.multiple_threads, 1);
> -#ifndef TLS_MULTIPLE_THREADS_IN_TCB
> +          cannot.  So we set __libc_single_threaded to true so that
> +          cancellation points get executed.  */
>         __libc_single_threaded = 0;
> -#endif
>      }
>    while (!atomic_compare_exchange_weak_acquire (&pd->cancelhandling, &oldval,
>                                                 newval));
> diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
> index 5633d01c62..d43865352f 100644
> --- a/nptl/pthread_create.c
> +++ b/nptl/pthread_create.c
> @@ -882,11 +882,6 @@ __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
>            other reason that create_thread chose.  Now let it run
>            free.  */
>         lll_unlock (pd->lock, LLL_PRIVATE);
> -
> -      /* We now have for sure more than one thread.  The main thread might
> -        not yet have the flag set.  No need to set the global variable
> -        again if this is what we use.  */
> -      THREAD_SETMEM (THREAD_SELF, header.multiple_threads, 1);
>      }
>
>   out:
> diff --git a/sysdeps/i386/htl/tcb-offsets.sym b/sysdeps/i386/htl/tcb-offsets.sym
> index 7b7c719369..f3f7df6c06 100644
> --- a/sysdeps/i386/htl/tcb-offsets.sym
> +++ b/sysdeps/i386/htl/tcb-offsets.sym
> @@ -2,7 +2,6 @@
>  #include <tls.h>
>  #include <kernel-features.h>
>
> -MULTIPLE_THREADS_OFFSET offsetof (tcbhead_t, multiple_threads)
>  SYSINFO_OFFSET          offsetof (tcbhead_t, sysinfo)
>  POINTER_GUARD           offsetof (tcbhead_t, pointer_guard)
>  SIGSTATE_OFFSET         offsetof (tcbhead_t, _hurd_sigstate)
> diff --git a/sysdeps/i386/nptl/tcb-offsets.sym b/sysdeps/i386/nptl/tcb-offsets.sym
> index 2ec9e787c1..1efd1469d8 100644
> --- a/sysdeps/i386/nptl/tcb-offsets.sym
> +++ b/sysdeps/i386/nptl/tcb-offsets.sym
> @@ -6,7 +6,6 @@ RESULT                  offsetof (struct pthread, result)
>  TID                    offsetof (struct pthread, tid)
>  CANCELHANDLING         offsetof (struct pthread, cancelhandling)
>  CLEANUP_JMP_BUF                offsetof (struct pthread, cleanup_jmp_buf)
> -MULTIPLE_THREADS_OFFSET        offsetof (tcbhead_t, multiple_threads)
>  SYSINFO_OFFSET         offsetof (tcbhead_t, sysinfo)
>  CLEANUP                        offsetof (struct pthread, cleanup)
>  CLEANUP_PREV           offsetof (struct _pthread_cleanup_buffer, __prev)
> diff --git a/sysdeps/i386/nptl/tls.h b/sysdeps/i386/nptl/tls.h
> index 91090bf287..48940a9f44 100644
> --- a/sysdeps/i386/nptl/tls.h
> +++ b/sysdeps/i386/nptl/tls.h
> @@ -36,7 +36,7 @@ typedef struct
>                            thread descriptor used by libpthread.  */
>    dtv_t *dtv;
>    void *self;          /* Pointer to the thread descriptor.  */
> -  int multiple_threads;
> +  int unused_multiple_threads;
>    uintptr_t sysinfo;
>    uintptr_t stack_guard;
>    uintptr_t pointer_guard;
> @@ -57,8 +57,6 @@ typedef struct
>  _Static_assert (offsetof (tcbhead_t, __private_ss) == 0x30,
>                 "offset of __private_ss != 0x30");
>
> -# define TLS_MULTIPLE_THREADS_IN_TCB 1
> -
>  #else /* __ASSEMBLER__ */
>  # include <tcb-offsets.h>
>  #endif
> diff --git a/sysdeps/ia64/nptl/tcb-offsets.sym b/sysdeps/ia64/nptl/tcb-offsets.sym
> index b01f712be2..ab2cb180f9 100644
> --- a/sysdeps/ia64/nptl/tcb-offsets.sym
> +++ b/sysdeps/ia64/nptl/tcb-offsets.sym
> @@ -2,5 +2,4 @@
>  #include <tls.h>
>
>  TID                    offsetof (struct pthread, tid) - TLS_PRE_TCB_SIZE
> -MULTIPLE_THREADS_OFFSET offsetof (struct pthread, header.multiple_threads) - TLS_PRE_TCB_SIZE
>  SYSINFO_OFFSET         offsetof (tcbhead_t, __private)
> diff --git a/sysdeps/ia64/nptl/tls.h b/sysdeps/ia64/nptl/tls.h
> index 8ccedb73e6..008e080fc4 100644
> --- a/sysdeps/ia64/nptl/tls.h
> +++ b/sysdeps/ia64/nptl/tls.h
> @@ -36,8 +36,6 @@ typedef struct
>
>  register struct pthread *__thread_self __asm__("r13");
>
> -# define TLS_MULTIPLE_THREADS_IN_TCB 1
> -
>  #else /* __ASSEMBLER__ */
>  # include <tcb-offsets.h>
>  #endif
> diff --git a/sysdeps/mach/hurd/i386/tls.h b/sysdeps/mach/hurd/i386/tls.h
> index 264ed9a9c5..d33e91c922 100644
> --- a/sysdeps/mach/hurd/i386/tls.h
> +++ b/sysdeps/mach/hurd/i386/tls.h
> @@ -33,7 +33,7 @@ typedef struct
>    void *tcb;                   /* Points to this structure.  */
>    dtv_t *dtv;                  /* Vector of pointers to TLS data.  */
>    thread_t self;               /* This thread's control port.  */
> -  int multiple_threads;
> +  int unused_multiple_threads;
>    uintptr_t sysinfo;
>    uintptr_t stack_guard;
>    uintptr_t pointer_guard;
> @@ -117,8 +117,6 @@ _hurd_tls_init (tcbhead_t *tcb)
>    /* This field is used by TLS accesses to get our "thread pointer"
>       from the TLS point of view.  */
>    tcb->tcb = tcb;
> -  /* We always at least start the sigthread anyway.  */
> -  tcb->multiple_threads = 1;
>
>    /* Get the first available selector.  */
>    int sel = -1;
> diff --git a/sysdeps/nios2/nptl/tcb-offsets.sym b/sysdeps/nios2/nptl/tcb-offsets.sym
> index 3cd8d984ac..93a695ac7f 100644
> --- a/sysdeps/nios2/nptl/tcb-offsets.sym
> +++ b/sysdeps/nios2/nptl/tcb-offsets.sym
> @@ -8,6 +8,5 @@
>  # define __thread_self          ((void *) 0)
>  # define thread_offsetof(mem)   ((ptrdiff_t) THREAD_SELF + offsetof (struct pthread, mem))
>
> -MULTIPLE_THREADS_OFFSET                thread_offsetof (header.multiple_threads)
>  TID_OFFSET                     thread_offsetof (tid)
>  POINTER_GUARD                  (offsetof (tcbhead_t, pointer_guard) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
> diff --git a/sysdeps/or1k/nptl/tls.h b/sysdeps/or1k/nptl/tls.h
> index c6ffe62c3f..3bb07beef8 100644
> --- a/sysdeps/or1k/nptl/tls.h
> +++ b/sysdeps/or1k/nptl/tls.h
> @@ -35,8 +35,6 @@ typedef struct
>
>  register tcbhead_t *__thread_self __asm__("r10");
>
> -# define TLS_MULTIPLE_THREADS_IN_TCB 1
> -
>  /* Get system call information.  */
>  # include <sysdep.h>
>
> diff --git a/sysdeps/powerpc/nptl/tcb-offsets.sym b/sysdeps/powerpc/nptl/tcb-offsets.sym
> index 4c01615ad0..a0ee95f94d 100644
> --- a/sysdeps/powerpc/nptl/tcb-offsets.sym
> +++ b/sysdeps/powerpc/nptl/tcb-offsets.sym
> @@ -10,9 +10,6 @@
>  # define thread_offsetof(mem)  ((ptrdiff_t) THREAD_SELF + offsetof (struct pthread, mem))
>
>
> -#if TLS_MULTIPLE_THREADS_IN_TCB
> -MULTIPLE_THREADS_OFFSET                thread_offsetof (header.multiple_threads)
> -#endif
>  TID                            thread_offsetof (tid)
>  POINTER_GUARD                  (offsetof (tcbhead_t, pointer_guard) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
>  TAR_SAVE                       (offsetof (tcbhead_t, tar_save) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
> diff --git a/sysdeps/powerpc/nptl/tls.h b/sysdeps/powerpc/nptl/tls.h
> index 22b0075235..fd5ee51981 100644
> --- a/sysdeps/powerpc/nptl/tls.h
> +++ b/sysdeps/powerpc/nptl/tls.h
> @@ -52,9 +52,6 @@
>  # define TLS_DTV_AT_TP 1
>  # define TLS_TCB_AT_TP 0
>
> -/* We use the multiple_threads field in the pthread struct */
> -#define TLS_MULTIPLE_THREADS_IN_TCB    1
> -
>  /* Get the thread descriptor definition.  */
>  # include <nptl/descr.h>
>
> diff --git a/sysdeps/s390/nptl/tcb-offsets.sym b/sysdeps/s390/nptl/tcb-offsets.sym
> index 9c1c01f353..bc7b267463 100644
> --- a/sysdeps/s390/nptl/tcb-offsets.sym
> +++ b/sysdeps/s390/nptl/tcb-offsets.sym
> @@ -1,6 +1,5 @@
>  #include <sysdep.h>
>  #include <tls.h>
>
> -MULTIPLE_THREADS_OFFSET                offsetof (tcbhead_t, multiple_threads)
>  STACK_GUARD                    offsetof (tcbhead_t, stack_guard)
>  TID                            offsetof (struct pthread, tid)
> diff --git a/sysdeps/s390/nptl/tls.h b/sysdeps/s390/nptl/tls.h
> index ff210ffeb2..d69ed539f7 100644
> --- a/sysdeps/s390/nptl/tls.h
> +++ b/sysdeps/s390/nptl/tls.h
> @@ -35,7 +35,7 @@ typedef struct
>                            thread descriptor used by libpthread.  */
>    dtv_t *dtv;
>    void *self;          /* Pointer to the thread descriptor.  */
> -  int multiple_threads;
> +  int unused_multiple_threads;
>    uintptr_t sysinfo;
>    uintptr_t stack_guard;
>    int gscope_flag;
> @@ -44,10 +44,6 @@ typedef struct
>    void *__private_ss;
>  } tcbhead_t;
>
> -# ifndef __s390x__
> -#  define TLS_MULTIPLE_THREADS_IN_TCB 1
> -# endif
> -
>  #else /* __ASSEMBLER__ */
>  # include <tcb-offsets.h>
>  #endif
> diff --git a/sysdeps/sh/nptl/tcb-offsets.sym b/sysdeps/sh/nptl/tcb-offsets.sym
> index 234207779d..4e452d9c6c 100644
> --- a/sysdeps/sh/nptl/tcb-offsets.sym
> +++ b/sysdeps/sh/nptl/tcb-offsets.sym
> @@ -6,7 +6,6 @@ RESULT                  offsetof (struct pthread, result)
>  TID                    offsetof (struct pthread, tid)
>  CANCELHANDLING         offsetof (struct pthread, cancelhandling)
>  CLEANUP_JMP_BUF                offsetof (struct pthread, cleanup_jmp_buf)
> -MULTIPLE_THREADS_OFFSET        offsetof (struct pthread, header.multiple_threads)
>  TLS_PRE_TCB_SIZE       sizeof (struct pthread)
>  MUTEX_FUTEX            offsetof (pthread_mutex_t, __data.__lock)
>  POINTER_GUARD          offsetof (tcbhead_t, pointer_guard)
> diff --git a/sysdeps/sh/nptl/tls.h b/sysdeps/sh/nptl/tls.h
> index 76591ab6ef..8778cb4ac0 100644
> --- a/sysdeps/sh/nptl/tls.h
> +++ b/sysdeps/sh/nptl/tls.h
> @@ -36,8 +36,6 @@ typedef struct
>    uintptr_t pointer_guard;
>  } tcbhead_t;
>
> -# define TLS_MULTIPLE_THREADS_IN_TCB 1
> -
>  #else /* __ASSEMBLER__ */
>  # include <tcb-offsets.h>
>  #endif /* __ASSEMBLER__ */
> diff --git a/sysdeps/sparc/nptl/tcb-offsets.sym b/sysdeps/sparc/nptl/tcb-offsets.sym
> index f75d02065e..e4a7e4720f 100644
> --- a/sysdeps/sparc/nptl/tcb-offsets.sym
> +++ b/sysdeps/sparc/nptl/tcb-offsets.sym
> @@ -1,6 +1,5 @@
>  #include <sysdep.h>
>  #include <tls.h>
>
> -MULTIPLE_THREADS_OFFSET                offsetof (tcbhead_t, multiple_threads)
>  POINTER_GUARD                  offsetof (tcbhead_t, pointer_guard)
>  TID                            offsetof (struct pthread, tid)
> diff --git a/sysdeps/sparc/nptl/tls.h b/sysdeps/sparc/nptl/tls.h
> index d1e2bb4ad1..b78cf0d6b4 100644
> --- a/sysdeps/sparc/nptl/tls.h
> +++ b/sysdeps/sparc/nptl/tls.h
> @@ -35,7 +35,7 @@ typedef struct
>                            thread descriptor used by libpthread.  */
>    dtv_t *dtv;
>    void *self;
> -  int multiple_threads;
> +  int unused_multiple_threads;
>  #if __WORDSIZE == 64
>    int gscope_flag;
>  #endif
> diff --git a/sysdeps/unix/sysv/linux/single-thread.h b/sysdeps/unix/sysv/linux/single-thread.h
> index 208edccce6..dd80e82c82 100644
> --- a/sysdeps/unix/sysv/linux/single-thread.h
> +++ b/sysdeps/unix/sysv/linux/single-thread.h
> @@ -23,20 +23,7 @@
>  # include <sys/single_threaded.h>
>  #endif
>
> -/* The default way to check if the process is single thread is by using the
> -   pthread_t 'multiple_threads' field.  However, for some architectures it is
> -   faster to either use an extra field on TCB or global variables (the TCB
> -   field is also used on x86 for some single-thread atomic optimizations).
> -
> -   The ABI might define SINGLE_THREAD_BY_GLOBAL to enable the single thread
> -   check to use global variables instead of the pthread_t field.  */
> -
> -#if !defined SINGLE_THREAD_BY_GLOBAL || IS_IN (rtld)
> -# define SINGLE_THREAD_P \
> -  (THREAD_GETMEM (THREAD_SELF, header.multiple_threads) == 0)
> -#else
> -# define SINGLE_THREAD_P (__libc_single_threaded != 0)
> -#endif
> +#define SINGLE_THREAD_P (__libc_single_threaded != 0)
>
>  #define RTLD_SINGLE_THREAD_P SINGLE_THREAD_P
>
> diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h
> index f24f1c71ed..23e087e7e0 100644
> --- a/sysdeps/x86/atomic-machine.h
> +++ b/sysdeps/x86/atomic-machine.h
> @@ -51,292 +51,145 @@
>  #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \
>    (! __sync_bool_compare_and_swap (mem, oldval, newval))
>
> +#define __cmpxchg_op(lock, mem, newval, oldval)                                      \
> +  ({ __typeof (*mem) __ret;                                                  \
> +     if (sizeof (*mem) == 1)                                                 \
> +       asm volatile (lock "cmpxchgb %2, %1"                                  \
> +                    : "=a" (ret), "+m" (*mem)                                \
> +                    : BR_CONSTRAINT (newval), "0" (oldval)                   \
> +                    : "memory");                                             \
> +     else if (sizeof (*mem) == 2)                                            \
> +       asm volatile (lock "cmpxchgw %2, %1"                                  \
> +                    : "=a" (ret), "+m" (*mem)                                \
> +                    : BR_CONSTRAINT (newval), "0" (oldval)                   \
> +                    : "memory");                                             \
> +     else if (sizeof (*mem) == 4)                                            \
> +       asm volatile (lock "cmpxchgl %2, %1"                                  \
> +                    : "=a" (ret), "+m" (*mem)                                \
> +                    : BR_CONSTRAINT (newval), "0" (oldval)                   \
> +                    : "memory");                                             \
> +     else if (__HAVE_64B_ATOMICS)                                            \
> +       asm volatile (lock "cmpxchgq %2, %1"                                  \
> +                    : "=a" (ret), "+m" (*mem)                                \
> +                    : "q" ((int64_t) cast_to_integer (newval)),                      \
> +                      "0" ((int64_t) cast_to_integer (oldval))               \
> +                    : "memory");                                             \
> +     else                                                                    \
> +       __atomic_link_error ();                                               \
> +     __ret; })
>
> -#define __arch_c_compare_and_exchange_val_8_acq(mem, newval, oldval) \
> +#define __arch_c_compare_and_exchange_val_8_acq(mem, newval, oldval)         \
>    ({ __typeof (*mem) ret;                                                    \
> -     __asm __volatile ("cmpl $0, %%" SEG_REG ":%P5\n\t"                              \
> -                      "je 0f\n\t"                                            \
> -                      "lock\n"                                               \
> -                      "0:\tcmpxchgb %b2, %1"                                 \
> -                      : "=a" (ret), "=m" (*mem)                              \
> -                      : BR_CONSTRAINT (newval), "m" (*mem), "0" (oldval),    \
> -                        "i" (offsetof (tcbhead_t, multiple_threads)));       \
> +     if (SINGLE_THREAD_P)                                                    \
> +       __cmpxchg_op ("", (mem), (newval), (oldval));                         \
> +     else                                                                    \
> +       __cmpxchg_op (LOCK_PREFIX, (mem), (newval), (oldval));                \
>       ret; })
>
> -#define __arch_c_compare_and_exchange_val_16_acq(mem, newval, oldval) \
> +#define __arch_c_compare_and_exchange_val_16_acq(mem, newval, oldval)        \
>    ({ __typeof (*mem) ret;                                                    \
> -     __asm __volatile ("cmpl $0, %%" SEG_REG ":%P5\n\t"                              \
> -                      "je 0f\n\t"                                            \
> -                      "lock\n"                                               \
> -                      "0:\tcmpxchgw %w2, %1"                                 \
> -                      : "=a" (ret), "=m" (*mem)                              \
> -                      : BR_CONSTRAINT (newval), "m" (*mem), "0" (oldval),    \
> -                        "i" (offsetof (tcbhead_t, multiple_threads)));       \
> +     if (SINGLE_THREAD_P)                                                    \
> +       __cmpxchg_op ("", (mem), (newval), (oldval));                         \
> +     else                                                                    \
> +       __cmpxchg_op (LOCK_PREFIX, (mem), (newval), (oldval));                \
>       ret; })
>
> -#define __arch_c_compare_and_exchange_val_32_acq(mem, newval, oldval) \
> +#define __arch_c_compare_and_exchange_val_32_acq(mem, newval, oldval)        \
>    ({ __typeof (*mem) ret;                                                    \
> -     __asm __volatile ("cmpl $0, %%" SEG_REG ":%P5\n\t"                              \
> -                      "je 0f\n\t"                                            \
> -                      "lock\n"                                               \
> -                      "0:\tcmpxchgl %2, %1"                                  \
> -                      : "=a" (ret), "=m" (*mem)                              \
> -                      : BR_CONSTRAINT (newval), "m" (*mem), "0" (oldval),    \
> -                        "i" (offsetof (tcbhead_t, multiple_threads)));       \
> +     if (SINGLE_THREAD_P)                                                    \
> +       __cmpxchg_op ("", (mem), (newval), (oldval));                         \
> +     else                                                                    \
> +       __cmpxchg_op (LOCK_PREFIX, (mem), (newval), (oldval));                \
>       ret; })
>
> -#ifdef __x86_64__
> -# define __arch_c_compare_and_exchange_val_64_acq(mem, newval, oldval) \
> +#define __arch_c_compare_and_exchange_val_64_acq(mem, newval, oldval)        \
>    ({ __typeof (*mem) ret;                                                    \
> -     __asm __volatile ("cmpl $0, %%fs:%P5\n\t"                               \
> -                      "je 0f\n\t"                                            \
> -                      "lock\n"                                               \
> -                      "0:\tcmpxchgq %q2, %1"                                 \
> -                      : "=a" (ret), "=m" (*mem)                              \
> -                      : "q" ((int64_t) cast_to_integer (newval)),            \
> -                        "m" (*mem),                                          \
> -                        "0" ((int64_t) cast_to_integer (oldval)),            \
> -                        "i" (offsetof (tcbhead_t, multiple_threads)));       \
> -     ret; })
> -# define do_exchange_and_add_val_64_acq(pfx, mem, value) 0
> -# define do_add_val_64_acq(pfx, mem, value) do { } while (0)
> -#else
> -/* XXX We do not really need 64-bit compare-and-exchange.  At least
> -   not in the moment.  Using it would mean causing portability
> -   problems since not many other 32-bit architectures have support for
> -   such an operation.  So don't define any code for now.  If it is
> -   really going to be used the code below can be used on Intel Pentium
> -   and later, but NOT on i486.  */
> -# define __arch_c_compare_and_exchange_val_64_acq(mem, newval, oldval) \
> -  ({ __typeof (*mem) ret = *(mem);                                           \
> -     __atomic_link_error ();                                                 \
> -     ret = (newval);                                                         \
> -     ret = (oldval);                                                         \
> -     ret; })
> -
> -# define __arch_compare_and_exchange_val_64_acq(mem, newval, oldval)         \
> -  ({ __typeof (*mem) ret = *(mem);                                           \
> -     __atomic_link_error ();                                                 \
> -     ret = (newval);                                                         \
> -     ret = (oldval);                                                         \
> -     ret; })
> -
> -# define do_exchange_and_add_val_64_acq(pfx, mem, value) \
> -  ({ __typeof (value) __addval = (value);                                    \
> -     __typeof (*mem) __result;                                               \
> -     __typeof (mem) __memp = (mem);                                          \
> -     __typeof (*mem) __tmpval;                                               \
> -     __result = *__memp;                                                     \
> -     do                                                                              \
> -       __tmpval = __result;                                                  \
> -     while ((__result = pfx##_compare_and_exchange_val_64_acq                \
> -            (__memp, __result + __addval, __result)) == __tmpval);           \
> -     __result; })
> -
> -# define do_add_val_64_acq(pfx, mem, value) \
> -  {                                                                          \
> -    __typeof (value) __addval = (value);                                     \
> -    __typeof (mem) __memp = (mem);                                           \
> -    __typeof (*mem) __oldval = *__memp;                                              \
> -    __typeof (*mem) __tmpval;                                                \
> -    do                                                                       \
> -      __tmpval = __oldval;                                                   \
> -    while ((__oldval = pfx##_compare_and_exchange_val_64_acq                 \
> -           (__memp, __oldval + __addval, __oldval)) == __tmpval);            \
> -  }
> -#endif
> -
> -
> -/* Note that we need no lock prefix.  */
> -#define atomic_exchange_acq(mem, newvalue) \
> -  ({ __typeof (*mem) result;                                                 \
> -     if (sizeof (*mem) == 1)                                                 \
> -       __asm __volatile ("xchgb %b0, %1"                                     \
> -                        : "=q" (result), "=m" (*mem)                         \
> -                        : "0" (newvalue), "m" (*mem));                       \
> -     else if (sizeof (*mem) == 2)                                            \
> -       __asm __volatile ("xchgw %w0, %1"                                     \
> -                        : "=r" (result), "=m" (*mem)                         \
> -                        : "0" (newvalue), "m" (*mem));                       \
> -     else if (sizeof (*mem) == 4)                                            \
> -       __asm __volatile ("xchgl %0, %1"                                              \
> -                        : "=r" (result), "=m" (*mem)                         \
> -                        : "0" (newvalue), "m" (*mem));                       \
> -     else if (__HAVE_64B_ATOMICS)                                            \
> -       __asm __volatile ("xchgq %q0, %1"                                     \
> -                        : "=r" (result), "=m" (*mem)                         \
> -                        : "0" ((int64_t) cast_to_integer (newvalue)),        \
> -                          "m" (*mem));                                       \
> -     else                                                                    \
> -       {                                                                     \
> -        result = 0;                                                          \
> -        __atomic_link_error ();                                              \
> -       }                                                                     \
> -     result; })
> -
> -
> -#define __arch_exchange_and_add_body(lock, pfx, mem, value) \
> -  ({ __typeof (*mem) __result;                                               \
> -     __typeof (value) __addval = (value);                                    \
> -     if (sizeof (*mem) == 1)                                                 \
> -       __asm __volatile (lock "xaddb %b0, %1"                                \
> -                        : "=q" (__result), "=m" (*mem)                       \
> -                        : "0" (__addval), "m" (*mem),                        \
> -                          "i" (offsetof (tcbhead_t, multiple_threads)));     \
> -     else if (sizeof (*mem) == 2)                                            \
> -       __asm __volatile (lock "xaddw %w0, %1"                                \
> -                        : "=r" (__result), "=m" (*mem)                       \
> -                        : "0" (__addval), "m" (*mem),                        \
> -                          "i" (offsetof (tcbhead_t, multiple_threads)));     \
> -     else if (sizeof (*mem) == 4)                                            \
> -       __asm __volatile (lock "xaddl %0, %1"                                 \
> -                        : "=r" (__result), "=m" (*mem)                       \
> -                        : "0" (__addval), "m" (*mem),                        \
> -                          "i" (offsetof (tcbhead_t, multiple_threads)));     \
> -     else if (__HAVE_64B_ATOMICS)                                            \
> -       __asm __volatile (lock "xaddq %q0, %1"                                \
> -                        : "=r" (__result), "=m" (*mem)                       \
> -                        : "0" ((int64_t) cast_to_integer (__addval)),     \
> -                          "m" (*mem),                                        \
> -                          "i" (offsetof (tcbhead_t, multiple_threads)));     \
> +     if (SINGLE_THREAD_P)                                                    \
> +       __cmpxchg_op ("", (mem), (newval), (oldval));                         \
>       else                                                                    \
> -       __result = do_exchange_and_add_val_64_acq (pfx, (mem), __addval);      \
> -     __result; })
> -
> -#define atomic_exchange_and_add(mem, value) \
> -  __sync_fetch_and_add (mem, value)
> -
> -#define __arch_exchange_and_add_cprefix \
> -  "cmpl $0, %%" SEG_REG ":%P4\n\tje 0f\n\tlock\n0:\t"
> -
> -#define catomic_exchange_and_add(mem, value) \
> -  __arch_exchange_and_add_body (__arch_exchange_and_add_cprefix, __arch_c,    \
> -                               mem, value)
> -
> -
> -#define __arch_add_body(lock, pfx, apfx, mem, value) \
> -  do {                                                                       \
> -    if (__builtin_constant_p (value) && (value) == 1)                        \
> -      pfx##_increment (mem);                                                 \
> -    else if (__builtin_constant_p (value) && (value) == -1)                  \
> -      pfx##_decrement (mem);                                                 \
> -    else if (sizeof (*mem) == 1)                                             \
> -      __asm __volatile (lock "addb %b1, %0"                                  \
> -                       : "=m" (*mem)                                         \
> -                       : IBR_CONSTRAINT (value), "m" (*mem),                 \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 2)                                             \
> -      __asm __volatile (lock "addw %w1, %0"                                  \
> -                       : "=m" (*mem)                                         \
> -                       : "ir" (value), "m" (*mem),                           \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 4)                                             \
> -      __asm __volatile (lock "addl %1, %0"                                   \
> -                       : "=m" (*mem)                                         \
> -                       : "ir" (value), "m" (*mem),                           \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (__HAVE_64B_ATOMICS)                                             \
> -      __asm __volatile (lock "addq %q1, %0"                                  \
> -                       : "=m" (*mem)                                         \
> -                       : "ir" ((int64_t) cast_to_integer (value)),           \
> -                         "m" (*mem),                                         \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else                                                                     \
> -      do_add_val_64_acq (apfx, (mem), (value));                                      \
> -  } while (0)
> -
> -# define atomic_add(mem, value) \
> -  __arch_add_body (LOCK_PREFIX, atomic, __arch, mem, value)
> -
> -#define __arch_add_cprefix \
> -  "cmpl $0, %%" SEG_REG ":%P3\n\tje 0f\n\tlock\n0:\t"
> -
> -#define catomic_add(mem, value) \
> -  __arch_add_body (__arch_add_cprefix, atomic, __arch_c, mem, value)
> +       __cmpxchg_op (LOCK_PREFIX, (mem), (newval), (oldval));                \
> +     ret; })
>
>
> -#define atomic_add_negative(mem, value) \
> -  ({ unsigned char __result;                                                 \
> +#define __xchg_op(lock, mem, arg, op)                                        \
> +  ({ __typeof (*mem) __ret = (arg);                                          \
>       if (sizeof (*mem) == 1)                                                 \
> -       __asm __volatile (LOCK_PREFIX "addb %b2, %0; sets %1"                 \
> -                        : "=m" (*mem), "=qm" (__result)                      \
> -                        : IBR_CONSTRAINT (value), "m" (*mem));               \
> +       __asm __volatile (lock #op "b %b0, %1"                                \
> +                        : "=q" (__ret), "=m" (*mem)                          \
> +                        : "0" (arg), "m" (*mem)                              \
> +                        : "memory", "cc");                                   \
>       else if (sizeof (*mem) == 2)                                            \
> -       __asm __volatile (LOCK_PREFIX "addw %w2, %0; sets %1"                 \
> -                        : "=m" (*mem), "=qm" (__result)                      \
> -                        : "ir" (value), "m" (*mem));                         \
> +       __asm __volatile (lock #op "w %w0, %1"                                \
> +                        : "=r" (__ret), "=m" (*mem)                          \
> +                        : "0" (arg), "m" (*mem)                              \
> +                        : "memory", "cc");                                   \
>       else if (sizeof (*mem) == 4)                                            \
> -       __asm __volatile (LOCK_PREFIX "addl %2, %0; sets %1"                  \
> -                        : "=m" (*mem), "=qm" (__result)                      \
> -                        : "ir" (value), "m" (*mem));                         \
> +       __asm __volatile (lock #op "l %0, %1"                                 \
> +                        : "=r" (__ret), "=m" (*mem)                          \
> +                        : "0" (arg), "m" (*mem)                              \
> +                        : "memory", "cc");                                   \
>       else if (__HAVE_64B_ATOMICS)                                            \
> -       __asm __volatile (LOCK_PREFIX "addq %q2, %0; sets %1"                 \
> -                        : "=m" (*mem), "=qm" (__result)                      \
> -                        : "ir" ((int64_t) cast_to_integer (value)),          \
> -                          "m" (*mem));                                       \
> +       __asm __volatile (lock #op "q %q0, %1"                                \
> +                        : "=r" (__ret), "=m" (*mem)                          \
> +                        : "0" ((int64_t) cast_to_integer (arg)),             \
> +                          "m" (*mem)                                         \
> +                        : "memory", "cc");                                   \
>       else                                                                    \
>         __atomic_link_error ();                                               \
> -     __result; })
> -
> +     __ret; })
>
> -#define atomic_add_zero(mem, value) \
> -  ({ unsigned char __result;                                                 \
> +#define __single_op(lock, mem, op)                                           \
> +  ({                                                                         \
>       if (sizeof (*mem) == 1)                                                 \
> -       __asm __volatile (LOCK_PREFIX "addb %b2, %0; setz %1"                 \
> -                        : "=m" (*mem), "=qm" (__result)                      \
> -                        : IBR_CONSTRAINT (value), "m" (*mem));               \
> +       __asm __volatile (lock #op "b %b0"                                    \
> +                        : "=m" (*mem)                                        \
> +                        : "m" (*mem)                                         \
> +                        : "memory", "cc");                                   \
>       else if (sizeof (*mem) == 2)                                            \
> -       __asm __volatile (LOCK_PREFIX "addw %w2, %0; setz %1"                 \
> -                        : "=m" (*mem), "=qm" (__result)                      \
> -                        : "ir" (value), "m" (*mem));                         \
> +       __asm __volatile (lock #op "w %b0"                                    \
> +                        : "=m" (*mem)                                        \
> +                        : "m" (*mem)                                         \
> +                        : "memory", "cc");                                   \
>       else if (sizeof (*mem) == 4)                                            \
> -       __asm __volatile (LOCK_PREFIX "addl %2, %0; setz %1"                  \
> -                        : "=m" (*mem), "=qm" (__result)                      \
> -                        : "ir" (value), "m" (*mem));                         \
> +       __asm __volatile (lock #op "l %b0"                                    \
> +                        : "=m" (*mem)                                        \
> +                        : "m" (*mem)                                         \
> +                        : "memory", "cc");                                   \
>       else if (__HAVE_64B_ATOMICS)                                            \
> -       __asm __volatile (LOCK_PREFIX "addq %q2, %0; setz %1"                 \
> -                        : "=m" (*mem), "=qm" (__result)                      \
> -                        : "ir" ((int64_t) cast_to_integer (value)),          \
> -                          "m" (*mem));                                       \
> +       __asm __volatile (lock #op "q %b0"                                    \
> +                        : "=m" (*mem)                                        \
> +                        : "m" (*mem)                                         \
> +                        : "memory", "cc");                                   \
>       else                                                                    \
> -       __atomic_link_error ();                                       \
> -     __result; })
> +       __atomic_link_error ();                                               \
> +  })
>
> +/* Note that we need no lock prefix.  */
> +#define atomic_exchange_acq(mem, newvalue)                                   \
> +  __xchg_op ("", (mem), (newvalue), xchg)
>
> -#define __arch_increment_body(lock, pfx, mem) \
> -  do {                                                                       \
> -    if (sizeof (*mem) == 1)                                                  \
> -      __asm __volatile (lock "incb %b0"                                              \
> -                       : "=m" (*mem)                                         \
> -                       : "m" (*mem),                                         \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 2)                                             \
> -      __asm __volatile (lock "incw %w0"                                              \
> -                       : "=m" (*mem)                                         \
> -                       : "m" (*mem),                                         \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 4)                                             \
> -      __asm __volatile (lock "incl %0"                                       \
> -                       : "=m" (*mem)                                         \
> -                       : "m" (*mem),                                         \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (__HAVE_64B_ATOMICS)                                             \
> -      __asm __volatile (lock "incq %q0"                                              \
> -                       : "=m" (*mem)                                         \
> -                       : "m" (*mem),                                         \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else                                                                     \
> -      do_add_val_64_acq (pfx, mem, 1);                                       \
> -  } while (0)
> +#define atomic_add(mem, value) \
> +  __xchg_op (LOCK_PREFIX, (mem), (value), add);                                      \
>
> -#define atomic_increment(mem) __arch_increment_body (LOCK_PREFIX, __arch, mem)
> +#define catomic_add(mem, value)                                                      \
> +  ({                                                                         \
> +    if (SINGLE_THREAD_P)                                                     \
> +      __xchg_op ("", (mem), (value), add);                                   \
> +   else                                                                              \
> +     atomic_add (mem, value);                                                \
> +  })
>
> -#define __arch_increment_cprefix \
> -  "cmpl $0, %%" SEG_REG ":%P2\n\tje 0f\n\tlock\n0:\t"
>
> -#define catomic_increment(mem) \
> -  __arch_increment_body (__arch_increment_cprefix, __arch_c, mem)
> +#define atomic_increment(mem) \
> +  __single_op (LOCK_PREFIX, (mem), inc)
>
> +#define catomic_increment(mem)                                               \
> +  ({                                                                         \
> +    if (SINGLE_THREAD_P)                                                     \
> +      __single_op ("", (mem), inc);                                          \
> +   else                                                                              \
> +     atomic_increment (mem);                                                 \
> +  })
>
>  #define atomic_increment_and_test(mem) \
>    ({ unsigned char __result;                                                 \
> @@ -357,43 +210,20 @@
>                          : "=m" (*mem), "=qm" (__result)                      \
>                          : "m" (*mem));                                       \
>       else                                                                    \
> -       __atomic_link_error ();                                       \
> +       __atomic_link_error ();                                               \
>       __result; })
>
>
> -#define __arch_decrement_body(lock, pfx, mem) \
> -  do {                                                                       \
> -    if (sizeof (*mem) == 1)                                                  \
> -      __asm __volatile (lock "decb %b0"                                              \
> -                       : "=m" (*mem)                                         \
> -                       : "m" (*mem),                                         \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 2)                                             \
> -      __asm __volatile (lock "decw %w0"                                              \
> -                       : "=m" (*mem)                                         \
> -                       : "m" (*mem),                                         \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 4)                                             \
> -      __asm __volatile (lock "decl %0"                                       \
> -                       : "=m" (*mem)                                         \
> -                       : "m" (*mem),                                         \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (__HAVE_64B_ATOMICS)                                             \
> -      __asm __volatile (lock "decq %q0"                                              \
> -                       : "=m" (*mem)                                         \
> -                       : "m" (*mem),                                         \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else                                                                     \
> -      do_add_val_64_acq (pfx, mem, -1);                                              \
> -  } while (0)
> -
> -#define atomic_decrement(mem) __arch_decrement_body (LOCK_PREFIX, __arch, mem)
> +#define atomic_decrement(mem)                                                \
> +  __single_op (LOCK_PREFIX, (mem), dec)
>
> -#define __arch_decrement_cprefix \
> -  "cmpl $0, %%" SEG_REG ":%P2\n\tje 0f\n\tlock\n0:\t"
> -
> -#define catomic_decrement(mem) \
> -  __arch_decrement_body (__arch_decrement_cprefix, __arch_c, mem)
> +#define catomic_decrement(mem)                                               \
> +  ({                                                                         \
> +    if (SINGLE_THREAD_P)                                                     \
> +      __single_op ("", (mem), dec);                                          \
> +   else                                                                              \
> +     atomic_decrement (mem);                                                 \
> +  })
>
>
>  #define atomic_decrement_and_test(mem) \
> @@ -463,73 +293,31 @@
>                          : "=q" (__result), "=m" (*mem)                       \
>                          : "m" (*mem), "ir" (bit));                           \
>       else                                                                    \
> -       __atomic_link_error ();                                       \
> +       __atomic_link_error ();                                               \
>       __result; })
>
>
> -#define __arch_and_body(lock, mem, mask) \
> -  do {                                                                       \
> -    if (sizeof (*mem) == 1)                                                  \
> -      __asm __volatile (lock "andb %b1, %0"                                  \
> -                       : "=m" (*mem)                                         \
> -                       : IBR_CONSTRAINT (mask), "m" (*mem),                  \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 2)                                             \
> -      __asm __volatile (lock "andw %w1, %0"                                  \
> -                       : "=m" (*mem)                                         \
> -                       : "ir" (mask), "m" (*mem),                            \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 4)                                             \
> -      __asm __volatile (lock "andl %1, %0"                                   \
> -                       : "=m" (*mem)                                         \
> -                       : "ir" (mask), "m" (*mem),                            \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (__HAVE_64B_ATOMICS)                                             \
> -      __asm __volatile (lock "andq %q1, %0"                                  \
> -                       : "=m" (*mem)                                         \
> -                       : "ir" (mask), "m" (*mem),                            \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else                                                                     \
> -      __atomic_link_error ();                                                \
> -  } while (0)
> -
> -#define __arch_cprefix \
> -  "cmpl $0, %%" SEG_REG ":%P3\n\tje 0f\n\tlock\n0:\t"
> -
> -#define atomic_and(mem, mask) __arch_and_body (LOCK_PREFIX, mem, mask)
> -
> -#define catomic_and(mem, mask) __arch_and_body (__arch_cprefix, mem, mask)
> +#define atomic_and(mem, mask)                                                \
> +  __xchg_op (LOCK_PREFIX, (mem), (mask), and)
>
> +#define catomic_and(mem, mask) \
> +  ({                                                                         \
> +    if (SINGLE_THREAD_P)                                                     \
> +      __xchg_op ("", (mem), (mask), and);                                    \
> +   else                                                                              \
> +      atomic_and (mem, mask);                                                \
> +  })
>
> -#define __arch_or_body(lock, mem, mask) \
> -  do {                                                                       \
> -    if (sizeof (*mem) == 1)                                                  \
> -      __asm __volatile (lock "orb %b1, %0"                                   \
> -                       : "=m" (*mem)                                         \
> -                       : IBR_CONSTRAINT (mask), "m" (*mem),                  \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 2)                                             \
> -      __asm __volatile (lock "orw %w1, %0"                                   \
> -                       : "=m" (*mem)                                         \
> -                       : "ir" (mask), "m" (*mem),                            \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 4)                                             \
> -      __asm __volatile (lock "orl %1, %0"                                    \
> -                       : "=m" (*mem)                                         \
> -                       : "ir" (mask), "m" (*mem),                            \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (__HAVE_64B_ATOMICS)                                             \
> -      __asm __volatile (lock "orq %q1, %0"                                   \
> -                       : "=m" (*mem)                                         \
> -                       : "ir" (mask), "m" (*mem),                            \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else                                                                     \
> -      __atomic_link_error ();                                                \
> -  } while (0)
> -
> -#define atomic_or(mem, mask) __arch_or_body (LOCK_PREFIX, mem, mask)
> +#define atomic_or(mem, mask)                                                 \
> +  __xchg_op (LOCK_PREFIX, (mem), (mask), or)
>
> -#define catomic_or(mem, mask) __arch_or_body (__arch_cprefix, mem, mask)
> +#define catomic_or(mem, mask) \
> +  ({                                                                         \
> +    if (SINGLE_THREAD_P)                                                     \
> +      __xchg_op ("", (mem), (mask), or);                                     \
> +   else                                                                              \
> +      atomic_or (mem, mask);                                                 \
> +  })

I believe the motivation for this approach on x86 was that PIC access
was quite expensive.  Although x86-64 has PC-relative access, do
we access the local copy of SINGLE_THREAD_P directly or the global
one via GOT?

>  /* We don't use mfence because it is supposedly slower due to having to
>     provide stronger guarantees (e.g., regarding self-modifying code).  */
> diff --git a/sysdeps/x86_64/nptl/tcb-offsets.sym b/sysdeps/x86_64/nptl/tcb-offsets.sym
> index 2bbd563a6c..8ec55a7ea8 100644
> --- a/sysdeps/x86_64/nptl/tcb-offsets.sym
> +++ b/sysdeps/x86_64/nptl/tcb-offsets.sym
> @@ -9,7 +9,6 @@ CLEANUP_JMP_BUF         offsetof (struct pthread, cleanup_jmp_buf)
>  CLEANUP                        offsetof (struct pthread, cleanup)
>  CLEANUP_PREV           offsetof (struct _pthread_cleanup_buffer, __prev)
>  MUTEX_FUTEX            offsetof (pthread_mutex_t, __data.__lock)
> -MULTIPLE_THREADS_OFFSET        offsetof (tcbhead_t, multiple_threads)
>  POINTER_GUARD          offsetof (tcbhead_t, pointer_guard)
>  FEATURE_1_OFFSET       offsetof (tcbhead_t, feature_1)
>  SSP_BASE_OFFSET                offsetof (tcbhead_t, ssp_base)
> --
> 2.34.1
>


-- 
H.J.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/4] Remove usage of TLS_MULTIPLE_THREADS_IN_TCB
  2022-06-10 16:35 ` [PATCH v2 3/4] Remove usage of TLS_MULTIPLE_THREADS_IN_TCB Adhemerval Zanella
  2022-06-10 19:49   ` H.J. Lu
@ 2022-06-10 21:00   ` Noah Goldstein
  2022-06-11 13:59     ` Wilco Dijkstra
  2022-06-13 21:31   ` Wilco Dijkstra
  2022-06-16  7:35   ` Fangrui Song
  3 siblings, 1 reply; 21+ messages in thread
From: Noah Goldstein @ 2022-06-10 21:00 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library, Wilco Dijkstra

On Fri, Jun 10, 2022 at 9:39 AM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Instead use __libc_single_threaded on all architectures.  The TCB
> field is renamed to avoid change the struct layout.
>
> The x86 atomic need some adjustments since it has single-thread
> optimizationi builtin within the inline assemblye.  It now uses
> SINGLE_THREAD_P and atomic optimizations are removed (since they
> are not used).
>
> Checked on x86_64-linux-gnu and i686-linux-gnu.
> ---
>  misc/tst-atomic.c                       |   1 +
>  nptl/allocatestack.c                    |   6 -
>  nptl/descr.h                            |  17 +-
>  nptl/pthread_cancel.c                   |   7 +-
>  nptl/pthread_create.c                   |   5 -
>  sysdeps/i386/htl/tcb-offsets.sym        |   1 -
>  sysdeps/i386/nptl/tcb-offsets.sym       |   1 -
>  sysdeps/i386/nptl/tls.h                 |   4 +-
>  sysdeps/ia64/nptl/tcb-offsets.sym       |   1 -
>  sysdeps/ia64/nptl/tls.h                 |   2 -
>  sysdeps/mach/hurd/i386/tls.h            |   4 +-
>  sysdeps/nios2/nptl/tcb-offsets.sym      |   1 -
>  sysdeps/or1k/nptl/tls.h                 |   2 -
>  sysdeps/powerpc/nptl/tcb-offsets.sym    |   3 -
>  sysdeps/powerpc/nptl/tls.h              |   3 -
>  sysdeps/s390/nptl/tcb-offsets.sym       |   1 -
>  sysdeps/s390/nptl/tls.h                 |   6 +-
>  sysdeps/sh/nptl/tcb-offsets.sym         |   1 -
>  sysdeps/sh/nptl/tls.h                   |   2 -
>  sysdeps/sparc/nptl/tcb-offsets.sym      |   1 -
>  sysdeps/sparc/nptl/tls.h                |   2 +-
>  sysdeps/unix/sysv/linux/single-thread.h |  15 +-
>  sysdeps/x86/atomic-machine.h            | 484 +++++++-----------------
>  sysdeps/x86_64/nptl/tcb-offsets.sym     |   1 -
>  24 files changed, 145 insertions(+), 426 deletions(-)
>
> diff --git a/misc/tst-atomic.c b/misc/tst-atomic.c
> index 6d681a7bfd..ddbc618e25 100644
> --- a/misc/tst-atomic.c
> +++ b/misc/tst-atomic.c
> @@ -18,6 +18,7 @@
>
>  #include <stdio.h>
>  #include <atomic.h>
> +#include <support/xthread.h>
>
>  #ifndef atomic_t
>  # define atomic_t int
> diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
> index 98f5f6dd85..3e0d01cb52 100644
> --- a/nptl/allocatestack.c
> +++ b/nptl/allocatestack.c
> @@ -290,9 +290,6 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
>          stack cache nor will the memory (except the TLS memory) be freed.  */
>        pd->user_stack = true;
>
> -      /* This is at least the second thread.  */
> -      pd->header.multiple_threads = 1;
> -
>  #ifdef NEED_DL_SYSINFO
>        SETUP_THREAD_SYSINFO (pd);
>  #endif
> @@ -408,9 +405,6 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
>              descriptor.  */
>           pd->specific[0] = pd->specific_1stblock;
>
> -         /* This is at least the second thread.  */
> -         pd->header.multiple_threads = 1;
> -
>  #ifdef NEED_DL_SYSINFO
>           SETUP_THREAD_SYSINFO (pd);
>  #endif
> diff --git a/nptl/descr.h b/nptl/descr.h
> index bb46b5958e..77b25d8267 100644
> --- a/nptl/descr.h
> +++ b/nptl/descr.h
> @@ -137,22 +137,7 @@ struct pthread
>  #else
>      struct
>      {
> -      /* multiple_threads is enabled either when the process has spawned at
> -        least one thread or when a single-threaded process cancels itself.
> -        This enables additional code to introduce locking before doing some
> -        compare_and_exchange operations and also enable cancellation points.
> -        The concepts of multiple threads and cancellation points ideally
> -        should be separate, since it is not necessary for multiple threads to
> -        have been created for cancellation points to be enabled, as is the
> -        case is when single-threaded process cancels itself.
> -
> -        Since enabling multiple_threads enables additional code in
> -        cancellation points and compare_and_exchange operations, there is a
> -        potential for an unneeded performance hit when it is enabled in a
> -        single-threaded, self-canceling process.  This is OK though, since a
> -        single-threaded process will enable async cancellation only when it
> -        looks to cancel itself and is hence going to end anyway.  */
> -      int multiple_threads;
> +      int unused_multiple_threads;
>        int gscope_flag;
>      } header;
>  #endif
> diff --git a/nptl/pthread_cancel.c b/nptl/pthread_cancel.c
> index e1735279f2..6d26a15d0e 100644
> --- a/nptl/pthread_cancel.c
> +++ b/nptl/pthread_cancel.c
> @@ -157,12 +157,9 @@ __pthread_cancel (pthread_t th)
>
>         /* A single-threaded process should be able to kill itself, since
>            there is nothing in the POSIX specification that says that it
> -          cannot.  So we set multiple_threads to true so that cancellation
> -          points get executed.  */
> -       THREAD_SETMEM (THREAD_SELF, header.multiple_threads, 1);
> -#ifndef TLS_MULTIPLE_THREADS_IN_TCB
> +          cannot.  So we set __libc_single_threaded to true so that
> +          cancellation points get executed.  */
>         __libc_single_threaded = 0;
> -#endif
>      }
>    while (!atomic_compare_exchange_weak_acquire (&pd->cancelhandling, &oldval,
>                                                 newval));
> diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
> index 5633d01c62..d43865352f 100644
> --- a/nptl/pthread_create.c
> +++ b/nptl/pthread_create.c
> @@ -882,11 +882,6 @@ __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
>            other reason that create_thread chose.  Now let it run
>            free.  */
>         lll_unlock (pd->lock, LLL_PRIVATE);
> -
> -      /* We now have for sure more than one thread.  The main thread might
> -        not yet have the flag set.  No need to set the global variable
> -        again if this is what we use.  */
> -      THREAD_SETMEM (THREAD_SELF, header.multiple_threads, 1);
>      }
>
>   out:
> diff --git a/sysdeps/i386/htl/tcb-offsets.sym b/sysdeps/i386/htl/tcb-offsets.sym
> index 7b7c719369..f3f7df6c06 100644
> --- a/sysdeps/i386/htl/tcb-offsets.sym
> +++ b/sysdeps/i386/htl/tcb-offsets.sym
> @@ -2,7 +2,6 @@
>  #include <tls.h>
>  #include <kernel-features.h>
>
> -MULTIPLE_THREADS_OFFSET offsetof (tcbhead_t, multiple_threads)
>  SYSINFO_OFFSET          offsetof (tcbhead_t, sysinfo)
>  POINTER_GUARD           offsetof (tcbhead_t, pointer_guard)
>  SIGSTATE_OFFSET         offsetof (tcbhead_t, _hurd_sigstate)
> diff --git a/sysdeps/i386/nptl/tcb-offsets.sym b/sysdeps/i386/nptl/tcb-offsets.sym
> index 2ec9e787c1..1efd1469d8 100644
> --- a/sysdeps/i386/nptl/tcb-offsets.sym
> +++ b/sysdeps/i386/nptl/tcb-offsets.sym
> @@ -6,7 +6,6 @@ RESULT                  offsetof (struct pthread, result)
>  TID                    offsetof (struct pthread, tid)
>  CANCELHANDLING         offsetof (struct pthread, cancelhandling)
>  CLEANUP_JMP_BUF                offsetof (struct pthread, cleanup_jmp_buf)
> -MULTIPLE_THREADS_OFFSET        offsetof (tcbhead_t, multiple_threads)
>  SYSINFO_OFFSET         offsetof (tcbhead_t, sysinfo)
>  CLEANUP                        offsetof (struct pthread, cleanup)
>  CLEANUP_PREV           offsetof (struct _pthread_cleanup_buffer, __prev)
> diff --git a/sysdeps/i386/nptl/tls.h b/sysdeps/i386/nptl/tls.h
> index 91090bf287..48940a9f44 100644
> --- a/sysdeps/i386/nptl/tls.h
> +++ b/sysdeps/i386/nptl/tls.h
> @@ -36,7 +36,7 @@ typedef struct
>                            thread descriptor used by libpthread.  */
>    dtv_t *dtv;
>    void *self;          /* Pointer to the thread descriptor.  */
> -  int multiple_threads;
> +  int unused_multiple_threads;
>    uintptr_t sysinfo;
>    uintptr_t stack_guard;
>    uintptr_t pointer_guard;
> @@ -57,8 +57,6 @@ typedef struct
>  _Static_assert (offsetof (tcbhead_t, __private_ss) == 0x30,
>                 "offset of __private_ss != 0x30");
>
> -# define TLS_MULTIPLE_THREADS_IN_TCB 1
> -
>  #else /* __ASSEMBLER__ */
>  # include <tcb-offsets.h>
>  #endif
> diff --git a/sysdeps/ia64/nptl/tcb-offsets.sym b/sysdeps/ia64/nptl/tcb-offsets.sym
> index b01f712be2..ab2cb180f9 100644
> --- a/sysdeps/ia64/nptl/tcb-offsets.sym
> +++ b/sysdeps/ia64/nptl/tcb-offsets.sym
> @@ -2,5 +2,4 @@
>  #include <tls.h>
>
>  TID                    offsetof (struct pthread, tid) - TLS_PRE_TCB_SIZE
> -MULTIPLE_THREADS_OFFSET offsetof (struct pthread, header.multiple_threads) - TLS_PRE_TCB_SIZE
>  SYSINFO_OFFSET         offsetof (tcbhead_t, __private)
> diff --git a/sysdeps/ia64/nptl/tls.h b/sysdeps/ia64/nptl/tls.h
> index 8ccedb73e6..008e080fc4 100644
> --- a/sysdeps/ia64/nptl/tls.h
> +++ b/sysdeps/ia64/nptl/tls.h
> @@ -36,8 +36,6 @@ typedef struct
>
>  register struct pthread *__thread_self __asm__("r13");
>
> -# define TLS_MULTIPLE_THREADS_IN_TCB 1
> -
>  #else /* __ASSEMBLER__ */
>  # include <tcb-offsets.h>
>  #endif
> diff --git a/sysdeps/mach/hurd/i386/tls.h b/sysdeps/mach/hurd/i386/tls.h
> index 264ed9a9c5..d33e91c922 100644
> --- a/sysdeps/mach/hurd/i386/tls.h
> +++ b/sysdeps/mach/hurd/i386/tls.h
> @@ -33,7 +33,7 @@ typedef struct
>    void *tcb;                   /* Points to this structure.  */
>    dtv_t *dtv;                  /* Vector of pointers to TLS data.  */
>    thread_t self;               /* This thread's control port.  */
> -  int multiple_threads;
> +  int unused_multiple_threads;
>    uintptr_t sysinfo;
>    uintptr_t stack_guard;
>    uintptr_t pointer_guard;
> @@ -117,8 +117,6 @@ _hurd_tls_init (tcbhead_t *tcb)
>    /* This field is used by TLS accesses to get our "thread pointer"
>       from the TLS point of view.  */
>    tcb->tcb = tcb;
> -  /* We always at least start the sigthread anyway.  */
> -  tcb->multiple_threads = 1;
>
>    /* Get the first available selector.  */
>    int sel = -1;
> diff --git a/sysdeps/nios2/nptl/tcb-offsets.sym b/sysdeps/nios2/nptl/tcb-offsets.sym
> index 3cd8d984ac..93a695ac7f 100644
> --- a/sysdeps/nios2/nptl/tcb-offsets.sym
> +++ b/sysdeps/nios2/nptl/tcb-offsets.sym
> @@ -8,6 +8,5 @@
>  # define __thread_self          ((void *) 0)
>  # define thread_offsetof(mem)   ((ptrdiff_t) THREAD_SELF + offsetof (struct pthread, mem))
>
> -MULTIPLE_THREADS_OFFSET                thread_offsetof (header.multiple_threads)
>  TID_OFFSET                     thread_offsetof (tid)
>  POINTER_GUARD                  (offsetof (tcbhead_t, pointer_guard) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
> diff --git a/sysdeps/or1k/nptl/tls.h b/sysdeps/or1k/nptl/tls.h
> index c6ffe62c3f..3bb07beef8 100644
> --- a/sysdeps/or1k/nptl/tls.h
> +++ b/sysdeps/or1k/nptl/tls.h
> @@ -35,8 +35,6 @@ typedef struct
>
>  register tcbhead_t *__thread_self __asm__("r10");
>
> -# define TLS_MULTIPLE_THREADS_IN_TCB 1
> -
>  /* Get system call information.  */
>  # include <sysdep.h>
>
> diff --git a/sysdeps/powerpc/nptl/tcb-offsets.sym b/sysdeps/powerpc/nptl/tcb-offsets.sym
> index 4c01615ad0..a0ee95f94d 100644
> --- a/sysdeps/powerpc/nptl/tcb-offsets.sym
> +++ b/sysdeps/powerpc/nptl/tcb-offsets.sym
> @@ -10,9 +10,6 @@
>  # define thread_offsetof(mem)  ((ptrdiff_t) THREAD_SELF + offsetof (struct pthread, mem))
>
>
> -#if TLS_MULTIPLE_THREADS_IN_TCB
> -MULTIPLE_THREADS_OFFSET                thread_offsetof (header.multiple_threads)
> -#endif
>  TID                            thread_offsetof (tid)
>  POINTER_GUARD                  (offsetof (tcbhead_t, pointer_guard) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
>  TAR_SAVE                       (offsetof (tcbhead_t, tar_save) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
> diff --git a/sysdeps/powerpc/nptl/tls.h b/sysdeps/powerpc/nptl/tls.h
> index 22b0075235..fd5ee51981 100644
> --- a/sysdeps/powerpc/nptl/tls.h
> +++ b/sysdeps/powerpc/nptl/tls.h
> @@ -52,9 +52,6 @@
>  # define TLS_DTV_AT_TP 1
>  # define TLS_TCB_AT_TP 0
>
> -/* We use the multiple_threads field in the pthread struct */
> -#define TLS_MULTIPLE_THREADS_IN_TCB    1
> -
>  /* Get the thread descriptor definition.  */
>  # include <nptl/descr.h>
>
> diff --git a/sysdeps/s390/nptl/tcb-offsets.sym b/sysdeps/s390/nptl/tcb-offsets.sym
> index 9c1c01f353..bc7b267463 100644
> --- a/sysdeps/s390/nptl/tcb-offsets.sym
> +++ b/sysdeps/s390/nptl/tcb-offsets.sym
> @@ -1,6 +1,5 @@
>  #include <sysdep.h>
>  #include <tls.h>
>
> -MULTIPLE_THREADS_OFFSET                offsetof (tcbhead_t, multiple_threads)
>  STACK_GUARD                    offsetof (tcbhead_t, stack_guard)
>  TID                            offsetof (struct pthread, tid)
> diff --git a/sysdeps/s390/nptl/tls.h b/sysdeps/s390/nptl/tls.h
> index ff210ffeb2..d69ed539f7 100644
> --- a/sysdeps/s390/nptl/tls.h
> +++ b/sysdeps/s390/nptl/tls.h
> @@ -35,7 +35,7 @@ typedef struct
>                            thread descriptor used by libpthread.  */
>    dtv_t *dtv;
>    void *self;          /* Pointer to the thread descriptor.  */
> -  int multiple_threads;
> +  int unused_multiple_threads;
>    uintptr_t sysinfo;
>    uintptr_t stack_guard;
>    int gscope_flag;
> @@ -44,10 +44,6 @@ typedef struct
>    void *__private_ss;
>  } tcbhead_t;
>
> -# ifndef __s390x__
> -#  define TLS_MULTIPLE_THREADS_IN_TCB 1
> -# endif
> -
>  #else /* __ASSEMBLER__ */
>  # include <tcb-offsets.h>
>  #endif
> diff --git a/sysdeps/sh/nptl/tcb-offsets.sym b/sysdeps/sh/nptl/tcb-offsets.sym
> index 234207779d..4e452d9c6c 100644
> --- a/sysdeps/sh/nptl/tcb-offsets.sym
> +++ b/sysdeps/sh/nptl/tcb-offsets.sym
> @@ -6,7 +6,6 @@ RESULT                  offsetof (struct pthread, result)
>  TID                    offsetof (struct pthread, tid)
>  CANCELHANDLING         offsetof (struct pthread, cancelhandling)
>  CLEANUP_JMP_BUF                offsetof (struct pthread, cleanup_jmp_buf)
> -MULTIPLE_THREADS_OFFSET        offsetof (struct pthread, header.multiple_threads)
>  TLS_PRE_TCB_SIZE       sizeof (struct pthread)
>  MUTEX_FUTEX            offsetof (pthread_mutex_t, __data.__lock)
>  POINTER_GUARD          offsetof (tcbhead_t, pointer_guard)
> diff --git a/sysdeps/sh/nptl/tls.h b/sysdeps/sh/nptl/tls.h
> index 76591ab6ef..8778cb4ac0 100644
> --- a/sysdeps/sh/nptl/tls.h
> +++ b/sysdeps/sh/nptl/tls.h
> @@ -36,8 +36,6 @@ typedef struct
>    uintptr_t pointer_guard;
>  } tcbhead_t;
>
> -# define TLS_MULTIPLE_THREADS_IN_TCB 1
> -
>  #else /* __ASSEMBLER__ */
>  # include <tcb-offsets.h>
>  #endif /* __ASSEMBLER__ */
> diff --git a/sysdeps/sparc/nptl/tcb-offsets.sym b/sysdeps/sparc/nptl/tcb-offsets.sym
> index f75d02065e..e4a7e4720f 100644
> --- a/sysdeps/sparc/nptl/tcb-offsets.sym
> +++ b/sysdeps/sparc/nptl/tcb-offsets.sym
> @@ -1,6 +1,5 @@
>  #include <sysdep.h>
>  #include <tls.h>
>
> -MULTIPLE_THREADS_OFFSET                offsetof (tcbhead_t, multiple_threads)
>  POINTER_GUARD                  offsetof (tcbhead_t, pointer_guard)
>  TID                            offsetof (struct pthread, tid)
> diff --git a/sysdeps/sparc/nptl/tls.h b/sysdeps/sparc/nptl/tls.h
> index d1e2bb4ad1..b78cf0d6b4 100644
> --- a/sysdeps/sparc/nptl/tls.h
> +++ b/sysdeps/sparc/nptl/tls.h
> @@ -35,7 +35,7 @@ typedef struct
>                            thread descriptor used by libpthread.  */
>    dtv_t *dtv;
>    void *self;
> -  int multiple_threads;
> +  int unused_multiple_threads;
>  #if __WORDSIZE == 64
>    int gscope_flag;
>  #endif
> diff --git a/sysdeps/unix/sysv/linux/single-thread.h b/sysdeps/unix/sysv/linux/single-thread.h
> index 208edccce6..dd80e82c82 100644
> --- a/sysdeps/unix/sysv/linux/single-thread.h
> +++ b/sysdeps/unix/sysv/linux/single-thread.h
> @@ -23,20 +23,7 @@
>  # include <sys/single_threaded.h>
>  #endif
>
> -/* The default way to check if the process is single thread is by using the
> -   pthread_t 'multiple_threads' field.  However, for some architectures it is
> -   faster to either use an extra field on TCB or global variables (the TCB
> -   field is also used on x86 for some single-thread atomic optimizations).
> -
> -   The ABI might define SINGLE_THREAD_BY_GLOBAL to enable the single thread
> -   check to use global variables instead of the pthread_t field.  */
> -
> -#if !defined SINGLE_THREAD_BY_GLOBAL || IS_IN (rtld)
> -# define SINGLE_THREAD_P \
> -  (THREAD_GETMEM (THREAD_SELF, header.multiple_threads) == 0)
> -#else
> -# define SINGLE_THREAD_P (__libc_single_threaded != 0)
> -#endif
> +#define SINGLE_THREAD_P (__libc_single_threaded != 0)
>
>  #define RTLD_SINGLE_THREAD_P SINGLE_THREAD_P
>
> diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h
> index f24f1c71ed..23e087e7e0 100644
> --- a/sysdeps/x86/atomic-machine.h
> +++ b/sysdeps/x86/atomic-machine.h
> @@ -51,292 +51,145 @@
>  #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \
>    (! __sync_bool_compare_and_swap (mem, oldval, newval))
>
> +#define __cmpxchg_op(lock, mem, newval, oldval)                                      \
> +  ({ __typeof (*mem) __ret;                                                  \
> +     if (sizeof (*mem) == 1)                                                 \
> +       asm volatile (lock "cmpxchgb %2, %1"                                  \
> +                    : "=a" (ret), "+m" (*mem)                                \
> +                    : BR_CONSTRAINT (newval), "0" (oldval)                   \
> +                    : "memory");                                             \

Is the full "memory" clobber needed? Shouldn't the "+m"(*mem) be enough?
> +     else if (sizeof (*mem) == 2)                                            \
> +       asm volatile (lock "cmpxchgw %2, %1"                                  \
> +                    : "=a" (ret), "+m" (*mem)                                \
> +                    : BR_CONSTRAINT (newval), "0" (oldval)                   \
> +                    : "memory");                                             \
> +     else if (sizeof (*mem) == 4)                                            \
> +       asm volatile (lock "cmpxchgl %2, %1"                                  \
> +                    : "=a" (ret), "+m" (*mem)                                \
> +                    : BR_CONSTRAINT (newval), "0" (oldval)                   \
> +                    : "memory");                                             \
> +     else if (__HAVE_64B_ATOMICS)                                            \
> +       asm volatile (lock "cmpxchgq %2, %1"                                  \
> +                    : "=a" (ret), "+m" (*mem)                                \
> +                    : "q" ((int64_t) cast_to_integer (newval)),                      \
> +                      "0" ((int64_t) cast_to_integer (oldval))               \
> +                    : "memory");                                             \
> +     else                                                                    \
> +       __atomic_link_error ();                                               \
> +     __ret; })
>
> -#define __arch_c_compare_and_exchange_val_8_acq(mem, newval, oldval) \
> +#define __arch_c_compare_and_exchange_val_8_acq(mem, newval, oldval)         \
>    ({ __typeof (*mem) ret;                                                    \
> -     __asm __volatile ("cmpl $0, %%" SEG_REG ":%P5\n\t"                              \
> -                      "je 0f\n\t"                                            \
> -                      "lock\n"                                               \
> -                      "0:\tcmpxchgb %b2, %1"                                 \
> -                      : "=a" (ret), "=m" (*mem)                              \
> -                      : BR_CONSTRAINT (newval), "m" (*mem), "0" (oldval),    \
> -                        "i" (offsetof (tcbhead_t, multiple_threads)));       \
> +     if (SINGLE_THREAD_P)                                                    \
> +       __cmpxchg_op ("", (mem), (newval), (oldval));                         \
> +     else                                                                    \
> +       __cmpxchg_op (LOCK_PREFIX, (mem), (newval), (oldval));                \
>       ret; })
>
> -#define __arch_c_compare_and_exchange_val_16_acq(mem, newval, oldval) \
> +#define __arch_c_compare_and_exchange_val_16_acq(mem, newval, oldval)        \
>    ({ __typeof (*mem) ret;                                                    \
> -     __asm __volatile ("cmpl $0, %%" SEG_REG ":%P5\n\t"                              \
> -                      "je 0f\n\t"                                            \
> -                      "lock\n"                                               \
> -                      "0:\tcmpxchgw %w2, %1"                                 \
> -                      : "=a" (ret), "=m" (*mem)                              \
> -                      : BR_CONSTRAINT (newval), "m" (*mem), "0" (oldval),    \
> -                        "i" (offsetof (tcbhead_t, multiple_threads)));       \
> +     if (SINGLE_THREAD_P)                                                    \
> +       __cmpxchg_op ("", (mem), (newval), (oldval));                         \
> +     else                                                                    \
> +       __cmpxchg_op (LOCK_PREFIX, (mem), (newval), (oldval));                \
>       ret; })
>
> -#define __arch_c_compare_and_exchange_val_32_acq(mem, newval, oldval) \
> +#define __arch_c_compare_and_exchange_val_32_acq(mem, newval, oldval)        \
>    ({ __typeof (*mem) ret;                                                    \
> -     __asm __volatile ("cmpl $0, %%" SEG_REG ":%P5\n\t"                              \
> -                      "je 0f\n\t"                                            \
> -                      "lock\n"                                               \
> -                      "0:\tcmpxchgl %2, %1"                                  \
> -                      : "=a" (ret), "=m" (*mem)                              \
> -                      : BR_CONSTRAINT (newval), "m" (*mem), "0" (oldval),    \
> -                        "i" (offsetof (tcbhead_t, multiple_threads)));       \
> +     if (SINGLE_THREAD_P)                                                    \
> +       __cmpxchg_op ("", (mem), (newval), (oldval));                         \
> +     else                                                                    \
> +       __cmpxchg_op (LOCK_PREFIX, (mem), (newval), (oldval));                \
>       ret; })
>
> -#ifdef __x86_64__
> -# define __arch_c_compare_and_exchange_val_64_acq(mem, newval, oldval) \
> +#define __arch_c_compare_and_exchange_val_64_acq(mem, newval, oldval)        \
>    ({ __typeof (*mem) ret;                                                    \
> -     __asm __volatile ("cmpl $0, %%fs:%P5\n\t"                               \
> -                      "je 0f\n\t"                                            \
> -                      "lock\n"                                               \
> -                      "0:\tcmpxchgq %q2, %1"                                 \
> -                      : "=a" (ret), "=m" (*mem)                              \
> -                      : "q" ((int64_t) cast_to_integer (newval)),            \
> -                        "m" (*mem),                                          \
> -                        "0" ((int64_t) cast_to_integer (oldval)),            \
> -                        "i" (offsetof (tcbhead_t, multiple_threads)));       \
> -     ret; })
> -# define do_exchange_and_add_val_64_acq(pfx, mem, value) 0
> -# define do_add_val_64_acq(pfx, mem, value) do { } while (0)
> -#else
> -/* XXX We do not really need 64-bit compare-and-exchange.  At least
> -   not in the moment.  Using it would mean causing portability
> -   problems since not many other 32-bit architectures have support for
> -   such an operation.  So don't define any code for now.  If it is
> -   really going to be used the code below can be used on Intel Pentium
> -   and later, but NOT on i486.  */
> -# define __arch_c_compare_and_exchange_val_64_acq(mem, newval, oldval) \
> -  ({ __typeof (*mem) ret = *(mem);                                           \
> -     __atomic_link_error ();                                                 \
> -     ret = (newval);                                                         \
> -     ret = (oldval);                                                         \
> -     ret; })
> -
> -# define __arch_compare_and_exchange_val_64_acq(mem, newval, oldval)         \
> -  ({ __typeof (*mem) ret = *(mem);                                           \
> -     __atomic_link_error ();                                                 \
> -     ret = (newval);                                                         \
> -     ret = (oldval);                                                         \
> -     ret; })
> -
> -# define do_exchange_and_add_val_64_acq(pfx, mem, value) \
> -  ({ __typeof (value) __addval = (value);                                    \
> -     __typeof (*mem) __result;                                               \
> -     __typeof (mem) __memp = (mem);                                          \
> -     __typeof (*mem) __tmpval;                                               \
> -     __result = *__memp;                                                     \
> -     do                                                                              \
> -       __tmpval = __result;                                                  \
> -     while ((__result = pfx##_compare_and_exchange_val_64_acq                \
> -            (__memp, __result + __addval, __result)) == __tmpval);           \
> -     __result; })
> -
> -# define do_add_val_64_acq(pfx, mem, value) \
> -  {                                                                          \
> -    __typeof (value) __addval = (value);                                     \
> -    __typeof (mem) __memp = (mem);                                           \
> -    __typeof (*mem) __oldval = *__memp;                                              \
> -    __typeof (*mem) __tmpval;                                                \
> -    do                                                                       \
> -      __tmpval = __oldval;                                                   \
> -    while ((__oldval = pfx##_compare_and_exchange_val_64_acq                 \
> -           (__memp, __oldval + __addval, __oldval)) == __tmpval);            \
> -  }
> -#endif
> -
> -
> -/* Note that we need no lock prefix.  */
> -#define atomic_exchange_acq(mem, newvalue) \
> -  ({ __typeof (*mem) result;                                                 \
> -     if (sizeof (*mem) == 1)                                                 \
> -       __asm __volatile ("xchgb %b0, %1"                                     \
> -                        : "=q" (result), "=m" (*mem)                         \
> -                        : "0" (newvalue), "m" (*mem));                       \
> -     else if (sizeof (*mem) == 2)                                            \
> -       __asm __volatile ("xchgw %w0, %1"                                     \
> -                        : "=r" (result), "=m" (*mem)                         \
> -                        : "0" (newvalue), "m" (*mem));                       \
> -     else if (sizeof (*mem) == 4)                                            \
> -       __asm __volatile ("xchgl %0, %1"                                              \
> -                        : "=r" (result), "=m" (*mem)                         \
> -                        : "0" (newvalue), "m" (*mem));                       \
> -     else if (__HAVE_64B_ATOMICS)                                            \
> -       __asm __volatile ("xchgq %q0, %1"                                     \
> -                        : "=r" (result), "=m" (*mem)                         \
> -                        : "0" ((int64_t) cast_to_integer (newvalue)),        \
> -                          "m" (*mem));                                       \
> -     else                                                                    \
> -       {                                                                     \
> -        result = 0;                                                          \
> -        __atomic_link_error ();                                              \
> -       }                                                                     \
> -     result; })
> -
> -
> -#define __arch_exchange_and_add_body(lock, pfx, mem, value) \
> -  ({ __typeof (*mem) __result;                                               \
> -     __typeof (value) __addval = (value);                                    \
> -     if (sizeof (*mem) == 1)                                                 \
> -       __asm __volatile (lock "xaddb %b0, %1"                                \
> -                        : "=q" (__result), "=m" (*mem)                       \
> -                        : "0" (__addval), "m" (*mem),                        \
> -                          "i" (offsetof (tcbhead_t, multiple_threads)));     \
> -     else if (sizeof (*mem) == 2)                                            \
> -       __asm __volatile (lock "xaddw %w0, %1"                                \
> -                        : "=r" (__result), "=m" (*mem)                       \
> -                        : "0" (__addval), "m" (*mem),                        \
> -                          "i" (offsetof (tcbhead_t, multiple_threads)));     \
> -     else if (sizeof (*mem) == 4)                                            \
> -       __asm __volatile (lock "xaddl %0, %1"                                 \
> -                        : "=r" (__result), "=m" (*mem)                       \
> -                        : "0" (__addval), "m" (*mem),                        \
> -                          "i" (offsetof (tcbhead_t, multiple_threads)));     \
> -     else if (__HAVE_64B_ATOMICS)                                            \
> -       __asm __volatile (lock "xaddq %q0, %1"                                \
> -                        : "=r" (__result), "=m" (*mem)                       \
> -                        : "0" ((int64_t) cast_to_integer (__addval)),     \
> -                          "m" (*mem),                                        \
> -                          "i" (offsetof (tcbhead_t, multiple_threads)));     \
> +     if (SINGLE_THREAD_P)                                                    \
> +       __cmpxchg_op ("", (mem), (newval), (oldval));                         \
>       else                                                                    \
> -       __result = do_exchange_and_add_val_64_acq (pfx, (mem), __addval);      \
> -     __result; })
> -
> -#define atomic_exchange_and_add(mem, value) \
> -  __sync_fetch_and_add (mem, value)
> -
> -#define __arch_exchange_and_add_cprefix \
> -  "cmpl $0, %%" SEG_REG ":%P4\n\tje 0f\n\tlock\n0:\t"
> -
> -#define catomic_exchange_and_add(mem, value) \
> -  __arch_exchange_and_add_body (__arch_exchange_and_add_cprefix, __arch_c,    \
> -                               mem, value)
> -
> -
> -#define __arch_add_body(lock, pfx, apfx, mem, value) \
> -  do {                                                                       \
> -    if (__builtin_constant_p (value) && (value) == 1)                        \
> -      pfx##_increment (mem);                                                 \
> -    else if (__builtin_constant_p (value) && (value) == -1)                  \
> -      pfx##_decrement (mem);                                                 \
> -    else if (sizeof (*mem) == 1)                                             \
> -      __asm __volatile (lock "addb %b1, %0"                                  \
> -                       : "=m" (*mem)                                         \
> -                       : IBR_CONSTRAINT (value), "m" (*mem),                 \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 2)                                             \
> -      __asm __volatile (lock "addw %w1, %0"                                  \
> -                       : "=m" (*mem)                                         \
> -                       : "ir" (value), "m" (*mem),                           \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 4)                                             \
> -      __asm __volatile (lock "addl %1, %0"                                   \
> -                       : "=m" (*mem)                                         \
> -                       : "ir" (value), "m" (*mem),                           \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (__HAVE_64B_ATOMICS)                                             \
> -      __asm __volatile (lock "addq %q1, %0"                                  \
> -                       : "=m" (*mem)                                         \
> -                       : "ir" ((int64_t) cast_to_integer (value)),           \
> -                         "m" (*mem),                                         \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else                                                                     \
> -      do_add_val_64_acq (apfx, (mem), (value));                                      \
> -  } while (0)
> -
> -# define atomic_add(mem, value) \
> -  __arch_add_body (LOCK_PREFIX, atomic, __arch, mem, value)
> -
> -#define __arch_add_cprefix \
> -  "cmpl $0, %%" SEG_REG ":%P3\n\tje 0f\n\tlock\n0:\t"
> -
> -#define catomic_add(mem, value) \
> -  __arch_add_body (__arch_add_cprefix, atomic, __arch_c, mem, value)
> +       __cmpxchg_op (LOCK_PREFIX, (mem), (newval), (oldval));                \
> +     ret; })
>
>
> -#define atomic_add_negative(mem, value) \
> -  ({ unsigned char __result;                                                 \
> +#define __xchg_op(lock, mem, arg, op)                                        \
> +  ({ __typeof (*mem) __ret = (arg);                                          \
>       if (sizeof (*mem) == 1)                                                 \
> -       __asm __volatile (LOCK_PREFIX "addb %b2, %0; sets %1"                 \
> -                        : "=m" (*mem), "=qm" (__result)                      \
> -                        : IBR_CONSTRAINT (value), "m" (*mem));               \
> +       __asm __volatile (lock #op "b %b0, %1"                                \
> +                        : "=q" (__ret), "=m" (*mem)                          \
> +                        : "0" (arg), "m" (*mem)                              \
> +                        : "memory", "cc");                                   \
>       else if (sizeof (*mem) == 2)                                            \
> -       __asm __volatile (LOCK_PREFIX "addw %w2, %0; sets %1"                 \
> -                        : "=m" (*mem), "=qm" (__result)                      \
> -                        : "ir" (value), "m" (*mem));                         \
> +       __asm __volatile (lock #op "w %w0, %1"                                \
> +                        : "=r" (__ret), "=m" (*mem)                          \
> +                        : "0" (arg), "m" (*mem)                              \
> +                        : "memory", "cc");                                   \
>       else if (sizeof (*mem) == 4)                                            \
> -       __asm __volatile (LOCK_PREFIX "addl %2, %0; sets %1"                  \
> -                        : "=m" (*mem), "=qm" (__result)                      \
> -                        : "ir" (value), "m" (*mem));                         \
> +       __asm __volatile (lock #op "l %0, %1"                                 \
> +                        : "=r" (__ret), "=m" (*mem)                          \
> +                        : "0" (arg), "m" (*mem)                              \
> +                        : "memory", "cc");                                   \
>       else if (__HAVE_64B_ATOMICS)                                            \
> -       __asm __volatile (LOCK_PREFIX "addq %q2, %0; sets %1"                 \
> -                        : "=m" (*mem), "=qm" (__result)                      \
> -                        : "ir" ((int64_t) cast_to_integer (value)),          \
> -                          "m" (*mem));                                       \
> +       __asm __volatile (lock #op "q %q0, %1"                                \
> +                        : "=r" (__ret), "=m" (*mem)                          \
> +                        : "0" ((int64_t) cast_to_integer (arg)),             \
> +                          "m" (*mem)                                         \
> +                        : "memory", "cc");                                   \
>       else                                                                    \
>         __atomic_link_error ();                                               \
> -     __result; })
> -
> +     __ret; })
>
> -#define atomic_add_zero(mem, value) \
> -  ({ unsigned char __result;                                                 \
> +#define __single_op(lock, mem, op)                                           \
> +  ({                                                                         \
>       if (sizeof (*mem) == 1)                                                 \
> -       __asm __volatile (LOCK_PREFIX "addb %b2, %0; setz %1"                 \
> -                        : "=m" (*mem), "=qm" (__result)                      \
> -                        : IBR_CONSTRAINT (value), "m" (*mem));               \
> +       __asm __volatile (lock #op "b %b0"                                    \
> +                        : "=m" (*mem)                                        \
> +                        : "m" (*mem)                                         \
> +                        : "memory", "cc");                                   \
>       else if (sizeof (*mem) == 2)                                            \
> -       __asm __volatile (LOCK_PREFIX "addw %w2, %0; setz %1"                 \
> -                        : "=m" (*mem), "=qm" (__result)                      \
> -                        : "ir" (value), "m" (*mem));                         \
> +       __asm __volatile (lock #op "w %b0"                                    \
> +                        : "=m" (*mem)                                        \
> +                        : "m" (*mem)                                         \
> +                        : "memory", "cc");                                   \
>       else if (sizeof (*mem) == 4)                                            \
> -       __asm __volatile (LOCK_PREFIX "addl %2, %0; setz %1"                  \
> -                        : "=m" (*mem), "=qm" (__result)                      \
> -                        : "ir" (value), "m" (*mem));                         \
> +       __asm __volatile (lock #op "l %b0"                                    \
> +                        : "=m" (*mem)                                        \
> +                        : "m" (*mem)                                         \
> +                        : "memory", "cc");                                   \
>       else if (__HAVE_64B_ATOMICS)                                            \
> -       __asm __volatile (LOCK_PREFIX "addq %q2, %0; setz %1"                 \
> -                        : "=m" (*mem), "=qm" (__result)                      \
> -                        : "ir" ((int64_t) cast_to_integer (value)),          \
> -                          "m" (*mem));                                       \
> +       __asm __volatile (lock #op "q %b0"                                    \
> +                        : "=m" (*mem)                                        \
> +                        : "m" (*mem)                                         \
> +                        : "memory", "cc");                                   \
>       else                                                                    \
> -       __atomic_link_error ();                                       \
> -     __result; })
> +       __atomic_link_error ();                                               \
> +  })
>
> +/* Note that we need no lock prefix.  */
> +#define atomic_exchange_acq(mem, newvalue)                                   \
> +  __xchg_op ("", (mem), (newvalue), xchg)
>
> -#define __arch_increment_body(lock, pfx, mem) \
> -  do {                                                                       \
> -    if (sizeof (*mem) == 1)                                                  \
> -      __asm __volatile (lock "incb %b0"                                              \
> -                       : "=m" (*mem)                                         \
> -                       : "m" (*mem),                                         \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 2)                                             \
> -      __asm __volatile (lock "incw %w0"                                              \
> -                       : "=m" (*mem)                                         \
> -                       : "m" (*mem),                                         \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 4)                                             \
> -      __asm __volatile (lock "incl %0"                                       \
> -                       : "=m" (*mem)                                         \
> -                       : "m" (*mem),                                         \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (__HAVE_64B_ATOMICS)                                             \
> -      __asm __volatile (lock "incq %q0"                                              \
> -                       : "=m" (*mem)                                         \
> -                       : "m" (*mem),                                         \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else                                                                     \
> -      do_add_val_64_acq (pfx, mem, 1);                                       \
> -  } while (0)
> +#define atomic_add(mem, value) \
> +  __xchg_op (LOCK_PREFIX, (mem), (value), add);                                      \
>
> -#define atomic_increment(mem) __arch_increment_body (LOCK_PREFIX, __arch, mem)
> +#define catomic_add(mem, value)                                                      \
> +  ({                                                                         \
> +    if (SINGLE_THREAD_P)                                                     \
> +      __xchg_op ("", (mem), (value), add);                                   \
> +   else                                                                              \
> +     atomic_add (mem, value);                                                \
> +  })
>
> -#define __arch_increment_cprefix \
> -  "cmpl $0, %%" SEG_REG ":%P2\n\tje 0f\n\tlock\n0:\t"
>
> -#define catomic_increment(mem) \
> -  __arch_increment_body (__arch_increment_cprefix, __arch_c, mem)
> +#define atomic_increment(mem) \
> +  __single_op (LOCK_PREFIX, (mem), inc)
>
> +#define catomic_increment(mem)                                               \
> +  ({                                                                         \
> +    if (SINGLE_THREAD_P)                                                     \
> +      __single_op ("", (mem), inc);                                          \
> +   else                                                                              \
> +     atomic_increment (mem);                                                 \
> +  })
>
>  #define atomic_increment_and_test(mem) \
>    ({ unsigned char __result;                                                 \
> @@ -357,43 +210,20 @@
>                          : "=m" (*mem), "=qm" (__result)                      \
>                          : "m" (*mem));                                       \
>       else                                                                    \
> -       __atomic_link_error ();                                       \
> +       __atomic_link_error ();                                               \
>       __result; })
>
>
> -#define __arch_decrement_body(lock, pfx, mem) \
> -  do {                                                                       \
> -    if (sizeof (*mem) == 1)                                                  \
> -      __asm __volatile (lock "decb %b0"                                              \
> -                       : "=m" (*mem)                                         \
> -                       : "m" (*mem),                                         \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 2)                                             \
> -      __asm __volatile (lock "decw %w0"                                              \
> -                       : "=m" (*mem)                                         \
> -                       : "m" (*mem),                                         \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 4)                                             \
> -      __asm __volatile (lock "decl %0"                                       \
> -                       : "=m" (*mem)                                         \
> -                       : "m" (*mem),                                         \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (__HAVE_64B_ATOMICS)                                             \
> -      __asm __volatile (lock "decq %q0"                                              \
> -                       : "=m" (*mem)                                         \
> -                       : "m" (*mem),                                         \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else                                                                     \
> -      do_add_val_64_acq (pfx, mem, -1);                                              \
> -  } while (0)
> -
> -#define atomic_decrement(mem) __arch_decrement_body (LOCK_PREFIX, __arch, mem)
> +#define atomic_decrement(mem)                                                \
> +  __single_op (LOCK_PREFIX, (mem), dec)
>
> -#define __arch_decrement_cprefix \
> -  "cmpl $0, %%" SEG_REG ":%P2\n\tje 0f\n\tlock\n0:\t"
> -
> -#define catomic_decrement(mem) \
> -  __arch_decrement_body (__arch_decrement_cprefix, __arch_c, mem)
> +#define catomic_decrement(mem)                                               \
> +  ({                                                                         \
> +    if (SINGLE_THREAD_P)                                                     \
> +      __single_op ("", (mem), dec);                                          \
> +   else                                                                              \
> +     atomic_decrement (mem);                                                 \
> +  })
>
>
>  #define atomic_decrement_and_test(mem) \
> @@ -463,73 +293,31 @@
>                          : "=q" (__result), "=m" (*mem)                       \
>                          : "m" (*mem), "ir" (bit));                           \
>       else                                                                    \
> -       __atomic_link_error ();                                       \
> +       __atomic_link_error ();                                               \
>       __result; })
>
>
> -#define __arch_and_body(lock, mem, mask) \
> -  do {                                                                       \
> -    if (sizeof (*mem) == 1)                                                  \
> -      __asm __volatile (lock "andb %b1, %0"                                  \
> -                       : "=m" (*mem)                                         \
> -                       : IBR_CONSTRAINT (mask), "m" (*mem),                  \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 2)                                             \
> -      __asm __volatile (lock "andw %w1, %0"                                  \
> -                       : "=m" (*mem)                                         \
> -                       : "ir" (mask), "m" (*mem),                            \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 4)                                             \
> -      __asm __volatile (lock "andl %1, %0"                                   \
> -                       : "=m" (*mem)                                         \
> -                       : "ir" (mask), "m" (*mem),                            \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (__HAVE_64B_ATOMICS)                                             \
> -      __asm __volatile (lock "andq %q1, %0"                                  \
> -                       : "=m" (*mem)                                         \
> -                       : "ir" (mask), "m" (*mem),                            \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else                                                                     \
> -      __atomic_link_error ();                                                \
> -  } while (0)
> -
> -#define __arch_cprefix \
> -  "cmpl $0, %%" SEG_REG ":%P3\n\tje 0f\n\tlock\n0:\t"
> -
> -#define atomic_and(mem, mask) __arch_and_body (LOCK_PREFIX, mem, mask)
> -
> -#define catomic_and(mem, mask) __arch_and_body (__arch_cprefix, mem, mask)
> +#define atomic_and(mem, mask)                                                \
> +  __xchg_op (LOCK_PREFIX, (mem), (mask), and)
>
> +#define catomic_and(mem, mask) \
> +  ({                                                                         \
> +    if (SINGLE_THREAD_P)                                                     \
> +      __xchg_op ("", (mem), (mask), and);                                    \
> +   else                                                                              \
> +      atomic_and (mem, mask);                                                \
> +  })
>
> -#define __arch_or_body(lock, mem, mask) \
> -  do {                                                                       \
> -    if (sizeof (*mem) == 1)                                                  \
> -      __asm __volatile (lock "orb %b1, %0"                                   \
> -                       : "=m" (*mem)                                         \
> -                       : IBR_CONSTRAINT (mask), "m" (*mem),                  \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 2)                                             \
> -      __asm __volatile (lock "orw %w1, %0"                                   \
> -                       : "=m" (*mem)                                         \
> -                       : "ir" (mask), "m" (*mem),                            \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (sizeof (*mem) == 4)                                             \
> -      __asm __volatile (lock "orl %1, %0"                                    \
> -                       : "=m" (*mem)                                         \
> -                       : "ir" (mask), "m" (*mem),                            \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else if (__HAVE_64B_ATOMICS)                                             \
> -      __asm __volatile (lock "orq %q1, %0"                                   \
> -                       : "=m" (*mem)                                         \
> -                       : "ir" (mask), "m" (*mem),                            \
> -                         "i" (offsetof (tcbhead_t, multiple_threads)));      \
> -    else                                                                     \
> -      __atomic_link_error ();                                                \
> -  } while (0)
> -
> -#define atomic_or(mem, mask) __arch_or_body (LOCK_PREFIX, mem, mask)
> +#define atomic_or(mem, mask)                                                 \
> +  __xchg_op (LOCK_PREFIX, (mem), (mask), or)
>
> -#define catomic_or(mem, mask) __arch_or_body (__arch_cprefix, mem, mask)
> +#define catomic_or(mem, mask) \
> +  ({                                                                         \
> +    if (SINGLE_THREAD_P)                                                     \
> +      __xchg_op ("", (mem), (mask), or);                                     \
> +   else                                                                              \
> +      atomic_or (mem, mask);                                                 \
> +  })
>
>  /* We don't use mfence because it is supposedly slower due to having to
>     provide stronger guarantees (e.g., regarding self-modifying code).  */
> diff --git a/sysdeps/x86_64/nptl/tcb-offsets.sym b/sysdeps/x86_64/nptl/tcb-offsets.sym
> index 2bbd563a6c..8ec55a7ea8 100644
> --- a/sysdeps/x86_64/nptl/tcb-offsets.sym
> +++ b/sysdeps/x86_64/nptl/tcb-offsets.sym
> @@ -9,7 +9,6 @@ CLEANUP_JMP_BUF         offsetof (struct pthread, cleanup_jmp_buf)
>  CLEANUP                        offsetof (struct pthread, cleanup)
>  CLEANUP_PREV           offsetof (struct _pthread_cleanup_buffer, __prev)
>  MUTEX_FUTEX            offsetof (pthread_mutex_t, __data.__lock)
> -MULTIPLE_THREADS_OFFSET        offsetof (tcbhead_t, multiple_threads)
>  POINTER_GUARD          offsetof (tcbhead_t, pointer_guard)
>  FEATURE_1_OFFSET       offsetof (tcbhead_t, feature_1)
>  SSP_BASE_OFFSET                offsetof (tcbhead_t, ssp_base)
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/4] Remove usage of TLS_MULTIPLE_THREADS_IN_TCB
  2022-06-10 21:00   ` Noah Goldstein
@ 2022-06-11 13:59     ` Wilco Dijkstra
  2022-06-15 21:07       ` Adhemerval Zanella
  0 siblings, 1 reply; 21+ messages in thread
From: Wilco Dijkstra @ 2022-06-11 13:59 UTC (permalink / raw)
  To: Noah Goldstein, Adhemerval Zanella; +Cc: GNU C Library

Hi Noah,

> +#define __cmpxchg_op(lock, mem, newval, oldval)                                      \
> +  ({ __typeof (*mem) __ret;                                                  \
> +     if (sizeof (*mem) == 1)                                                 \
> +       asm volatile (lock "cmpxchgb %2, %1"                                  \
> +                    : "=a" (ret), "+m" (*mem)                                \
> +                    : BR_CONSTRAINT (newval), "0" (oldval)                   \
> +                    : "memory");                                             \

> Is the full "memory" clobber needed? Shouldn't the "+m"(*mem) be enough?

For use in acquire/release atomics, it is required since code hoisting and other
optimizations must be prevented. So the old implementation was buggy, and this
is why we need to remove these target specific hacks.

Also it is ridiculous to write hundreds of lines of hacky inline assembler for basic macros
like atomic_bit_test_set when there is only a single use in all of GLIBC which can trivially
be replaced with the compiler builtin __atomic_fetch_or.

Any single-threaded optimizations should be done on a much higher level and only where
there is a clear performance gain. So we should get rid of all the atomic-machine headers.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/4] Remove usage of TLS_MULTIPLE_THREADS_IN_TCB
  2022-06-10 16:35 ` [PATCH v2 3/4] Remove usage of TLS_MULTIPLE_THREADS_IN_TCB Adhemerval Zanella
  2022-06-10 19:49   ` H.J. Lu
  2022-06-10 21:00   ` Noah Goldstein
@ 2022-06-13 21:31   ` Wilco Dijkstra
  2022-06-15 21:10     ` Adhemerval Zanella
  2022-06-16  7:35   ` Fangrui Song
  3 siblings, 1 reply; 21+ messages in thread
From: Wilco Dijkstra @ 2022-06-13 21:31 UTC (permalink / raw)
  To: Adhemerval Zanella, libc-alpha

Hi Adhemerval,

> The x86 atomic need some adjustments since it has single-thread
> optimizationi builtin within the inline assemblye.  It now uses
> SINGLE_THREAD_P and atomic optimizations are removed (since they
> are not used).

I'd suggest to remove all single-thread optimizations from target atomics.
Many aren't used at all (eg. catomic_or/catomic_and), and the rest has few
uses which are not performance critical. The uses in malloc.c are in fact
counterproductive since they are only used if there are multiple threads!
As a result I think you might not need any inline assembler at all.

I have a patch that removes all of the catomic definitions across GLIBC.
If we find code that could benefit then we can optimize it at a higher level
and in a generic way similar to how we optimized malloc or 
https://sourceware.org/pipermail/libc-alpha/2022-June/139566.html.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/4] Remove usage of TLS_MULTIPLE_THREADS_IN_TCB
  2022-06-11 13:59     ` Wilco Dijkstra
@ 2022-06-15 21:07       ` Adhemerval Zanella
  2022-06-16 12:48         ` Wilco Dijkstra
  0 siblings, 1 reply; 21+ messages in thread
From: Adhemerval Zanella @ 2022-06-15 21:07 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Noah Goldstein, GNU C Library



> On 11 Jun 2022, at 06:59, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> 
> Hi Noah,
> 
>> +#define __cmpxchg_op(lock, mem, newval, oldval)                                      \
>> +  ({ __typeof (*mem) __ret;                                                  \
>> +     if (sizeof (*mem) == 1)                                                 \
>> +       asm volatile (lock "cmpxchgb %2, %1"                                  \
>> +                    : "=a" (ret), "+m" (*mem)                                \
>> +                    : BR_CONSTRAINT (newval), "0" (oldval)                   \
>> +                    : "memory");                                             \
> 
>> Is the full "memory" clobber needed? Shouldn't the "+m"(*mem) be enough?
> 
> For use in acquire/release atomics, it is required since code hoisting and other
> optimizations must be prevented. So the old implementation was buggy, and this
> is why we need to remove these target specific hacks.

Yes, I noted this checking out the Linux kernel implementation. Although I am not
sure if it really matter since we already have a volatile asm to should prevent code
hoisting. 

> 
> Also it is ridiculous to write hundreds of lines of hacky inline assembler for basic macros
> like atomic_bit_test_set when there is only a single use in all of GLIBC which can trivially
> be replaced with the compiler builtin __atomic_fetch_or.

Indeed, afaik it was done back when gcc did not provide atomic builtins over different 
architectures so each one need to implement it. For newer code we have started to
require to use the new C11-like atomic macros, and I have started to send some fixes
to move away for old atomics (for instance the pthread_cancel revert). 

> 
> Any single-threaded optimizations should be done on a much higher level and only where
> there is a clear performance gain. So we should get rid of all the atomic-machine headers.

Complete agree, I have started to clean up by first moving some architectures to use
compiler builtins [1]. I will check which are the architectures that still don’t use compiler
builtin and see if we move them.

[1] https://patchwork.sourceware.org/project/glibc/patch/20210929191430.884057-1-adhemerval.zanella@linaro.org/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/4] Remove usage of TLS_MULTIPLE_THREADS_IN_TCB
  2022-06-13 21:31   ` Wilco Dijkstra
@ 2022-06-15 21:10     ` Adhemerval Zanella
  0 siblings, 0 replies; 21+ messages in thread
From: Adhemerval Zanella @ 2022-06-15 21:10 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: libc-alpha



> On 13 Jun 2022, at 14:31, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> 
> Hi Adhemerval,
> 
>> The x86 atomic need some adjustments since it has single-thread
>> optimizationi builtin within the inline assemblye.  It now uses
>> SINGLE_THREAD_P and atomic optimizations are removed (since they
>> are not used).
> 
> I'd suggest to remove all single-thread optimizations from target atomics.
> Many aren't used at all (eg. catomic_or/catomic_and), and the rest has few
> uses which are not performance critical. The uses in malloc.c are in fact
> counterproductive since they are only used if there are multiple threads!
> As a result I think you might not need any inline assembler at all.

That is my plan, I haven’t done this on this patchset because I want first to
consolidate the single-thread optimization to use only one internal scheme.

Once it is in, my plan is to move all architectures to use compiler builtins
if possible (I recall that the minimum gcc version was prevent us to do so
on some architectures, but I think we might be able to do it now), then
remove all the unused atomic macros, and make all internal ones use the
C11-like internal macros.

> 
> I have a patch that removes all of the catomic definitions across GLIBC.
> If we find code that could benefit then we can optimize it at a higher level
> and in a generic way similar to how we optimized malloc or 
> https://sourceware.org/pipermail/libc-alpha/2022-June/139566.html.

Yes, it is on my backlog to review it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/4] misc: Optimize internal usage of __libc_single_threaded
  2022-06-10 16:35 ` [PATCH v2 1/4] misc: Optimize internal usage of __libc_single_threaded Adhemerval Zanella
@ 2022-06-16  7:15   ` Fangrui Song
  2022-06-16 22:06     ` Adhemerval Zanella
  2022-06-20  8:37   ` Florian Weimer
  1 sibling, 1 reply; 21+ messages in thread
From: Fangrui Song @ 2022-06-16  7:15 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha, Wilco Dijkstra

On 2022-06-10, Adhemerval Zanella via Libc-alpha wrote:
>By adding an internal hidden_def alias to avoid the GOT indirection.
>On some architecture, __libc_single_thread may be accessed through
>copy relocations and thus it requires to update both copies.
>
>To obtain the correct address of the __libc_single_thread,
>__libc_dlsym is extended to support RTLD_DEFAULT.  It searches
>through all scope instead of default local ones.
>
>Checked on x86_64-linux-gnu and i686-linux-gnu.
>---
> elf/dl-libc.c                 | 20 ++++++++++++++++++--
> elf/libc_early_init.c         |  9 +++++++++
> include/sys/single_threaded.h | 11 +++++++++++
> misc/single_threaded.c        |  2 ++
> nptl/pthread_create.c         |  6 +++++-
> 5 files changed, 45 insertions(+), 3 deletions(-)
>
>diff --git a/elf/dl-libc.c b/elf/dl-libc.c
>index 266e068da6..e64f4b9910 100644
>--- a/elf/dl-libc.c
>+++ b/elf/dl-libc.c
>@@ -16,6 +16,7 @@
>    License along with the GNU C Library; if not, see
>    <https://www.gnu.org/licenses/>.  */
>
>+#include <assert.h>
> #include <dlfcn.h>
> #include <stdlib.h>
> #include <ldsodefs.h>
>@@ -72,6 +73,7 @@ struct do_dlsym_args
>   /* Arguments to do_dlsym.  */
>   struct link_map *map;
>   const char *name;
>+  const void *caller_dlsym;
>
>   /* Return values of do_dlsym.  */
>   lookup_t loadbase;
>@@ -102,8 +104,21 @@ do_dlsym (void *ptr)
> {
>   struct do_dlsym_args *args = (struct do_dlsym_args *) ptr;
>   args->ref = NULL;
>-  args->loadbase = GLRO(dl_lookup_symbol_x) (args->name, args->map, &args->ref,
>-					     args->map->l_local_scope, NULL, 0,
>+  struct link_map *match = args->map;
>+  struct r_scope_elem **scope;
>+  if (args->map == RTLD_DEFAULT)
>+    {
>+      ElfW(Addr) caller = (ElfW(Addr)) args->caller_dlsym;
>+      match = _dl_find_dso_for_object (caller);
>+      /* It is only used internally, so caller should be always recognized.  */
>+      assert (match != NULL);
>+      scope = match->l_scope;
>+    }
>+  else
>+    scope = args->map->l_local_scope;
>+
>+  args->loadbase = GLRO(dl_lookup_symbol_x) (args->name, match, &args->ref,
>+					     scope, NULL, 0,
> 					     DL_LOOKUP_RETURN_NEWEST, NULL);
> }
>
>@@ -182,6 +197,7 @@ __libc_dlsym (void *map, const char *name)
>   struct do_dlsym_args args;
>   args.map = map;
>   args.name = name;
>+  args.caller_dlsym = RETURN_ADDRESS (0);
>
> #ifdef SHARED
>   if (GLRO (dl_dlfcn_hook) != NULL)
>diff --git a/elf/libc_early_init.c b/elf/libc_early_init.c
>index 3c4a19cf6b..7cc2997122 100644
>--- a/elf/libc_early_init.c
>+++ b/elf/libc_early_init.c
>@@ -16,7 +16,9 @@
>    License along with the GNU C Library; if not, see
>    <https://www.gnu.org/licenses/>.  */
>
>+#include <assert.h>
> #include <ctype.h>
>+#include <dlfcn.h>
> #include <elision-conf.h>
> #include <libc-early-init.h>
> #include <libc-internal.h>
>@@ -38,6 +40,13 @@ __libc_early_init (_Bool initial)
>   __libc_single_threaded = initial;
>
> #ifdef SHARED
>+  /* __libc_single_threaded can be accessed through copy relocations, so it
>+     requires to update the external copy.  */
>+  __libc_external_single_threaded = __libc_dlsym (RTLD_DEFAULT,
>+						  "__libc_single_threaded");
>+  assert (__libc_external_single_threaded != NULL);
>+  *__libc_external_single_threaded = initial;
>+
>   __libc_initial = initial;
> #endif

I think this whole scheme can be greatly simplified.

Under the hood,

extern __typeof (__libc_single_threaded) __EI___libc_single_threaded __asm__("" "__libc_single_threaded"); extern __typeof (__libc_single_threaded) __EI___libc_single_threaded __attribute__((alias ("" "__GI___libc_single_threaded"))) __attribute__ ((__copy__ (__libc_single_threaded))); 

* __libc_single_threaded is the STV_HIDDEN symbol with the asm name "__GI___libc_single_threaded"
* __EI___libc_single_threaded is the STV_DEFAULT symbol with the asm name "__libc_single_threaded". It aliases __libc_single_threaded.

We can just access __EI___libc_single_threaded which will lead to a GOT
indirection (R_*_GLOB_DAT).  This can avoid a __libc_dlsym call.

I can see that accessing the __EI declaration is currently inconvenient
because include/libc-symbols.h does not seem to provide a convenient
macro, but that be added.

>diff --git a/include/sys/single_threaded.h b/include/sys/single_threaded.h
>index 18f6972482..258b01e0b2 100644
>--- a/include/sys/single_threaded.h
>+++ b/include/sys/single_threaded.h
>@@ -1 +1,12 @@
> #include <misc/sys/single_threaded.h>
>+
>+#ifndef _ISOMAC
>+
>+libc_hidden_proto (__libc_single_threaded);
>+
>+# ifdef SHARED
>+extern __typeof (__libc_single_threaded) *__libc_external_single_threaded
>+  attribute_hidden;
>+# endif
>+
>+#endif
>diff --git a/misc/single_threaded.c b/misc/single_threaded.c
>index 96ada9137b..201d86a273 100644
>--- a/misc/single_threaded.c
>+++ b/misc/single_threaded.c
>@@ -22,6 +22,8 @@
>    __libc_early_init (as false for inner libcs).  */
> #ifdef SHARED
> char __libc_single_threaded;
>+__typeof (__libc_single_threaded) *__libc_external_single_threaded;
> #else
> char __libc_single_threaded = 1;
> #endif
>+libc_hidden_data_def (__libc_single_threaded)
>diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
>index e7a099acb7..5633d01c62 100644
>--- a/nptl/pthread_create.c
>+++ b/nptl/pthread_create.c
>@@ -627,7 +627,11 @@ __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
>   if (__libc_single_threaded)
>     {
>       late_init ();
>-      __libc_single_threaded = 0;
>+      __libc_single_threaded =
>+#ifdef SHARED
>+        *__libc_external_single_threaded =
>+#endif
>+	0;
>     }
>
>   const struct pthread_attr *iattr = (struct pthread_attr *) attr;
>-- 
>2.34.1
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/4] Remove usage of TLS_MULTIPLE_THREADS_IN_TCB
  2022-06-10 16:35 ` [PATCH v2 3/4] Remove usage of TLS_MULTIPLE_THREADS_IN_TCB Adhemerval Zanella
                     ` (2 preceding siblings ...)
  2022-06-13 21:31   ` Wilco Dijkstra
@ 2022-06-16  7:35   ` Fangrui Song
  3 siblings, 0 replies; 21+ messages in thread
From: Fangrui Song @ 2022-06-16  7:35 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha, Wilco Dijkstra

On 2022-06-10, Adhemerval Zanella via Libc-alpha wrote:
>Instead use __libc_single_threaded on all architectures.  The TCB
>field is renamed to avoid change the struct layout.
>
>The x86 atomic need some adjustments since it has single-thread
>optimizationi builtin within the inline assemblye.  It now uses
>SINGLE_THREAD_P and atomic optimizations are removed (since they
>are not used).
>
>Checked on x86_64-linux-gnu and i686-linux-gnu.
>---
> misc/tst-atomic.c                       |   1 +
> nptl/allocatestack.c                    |   6 -
> nptl/descr.h                            |  17 +-
> nptl/pthread_cancel.c                   |   7 +-
> nptl/pthread_create.c                   |   5 -
> sysdeps/i386/htl/tcb-offsets.sym        |   1 -
> sysdeps/i386/nptl/tcb-offsets.sym       |   1 -
> sysdeps/i386/nptl/tls.h                 |   4 +-
> sysdeps/ia64/nptl/tcb-offsets.sym       |   1 -
> sysdeps/ia64/nptl/tls.h                 |   2 -
> sysdeps/mach/hurd/i386/tls.h            |   4 +-
> sysdeps/nios2/nptl/tcb-offsets.sym      |   1 -
> sysdeps/or1k/nptl/tls.h                 |   2 -
> sysdeps/powerpc/nptl/tcb-offsets.sym    |   3 -
> sysdeps/powerpc/nptl/tls.h              |   3 -
> sysdeps/s390/nptl/tcb-offsets.sym       |   1 -
> sysdeps/s390/nptl/tls.h                 |   6 +-
> sysdeps/sh/nptl/tcb-offsets.sym         |   1 -
> sysdeps/sh/nptl/tls.h                   |   2 -
> sysdeps/sparc/nptl/tcb-offsets.sym      |   1 -
> sysdeps/sparc/nptl/tls.h                |   2 +-
> sysdeps/unix/sysv/linux/single-thread.h |  15 +-
> sysdeps/x86/atomic-machine.h            | 484 +++++++-----------------
> sysdeps/x86_64/nptl/tcb-offsets.sym     |   1 -
> 24 files changed, 145 insertions(+), 426 deletions(-)
>
>diff --git a/misc/tst-atomic.c b/misc/tst-atomic.c
>index 6d681a7bfd..ddbc618e25 100644
>--- a/misc/tst-atomic.c
>+++ b/misc/tst-atomic.c
>@@ -18,6 +18,7 @@
>
> #include <stdio.h>
> #include <atomic.h>
>+#include <support/xthread.h>
>
> #ifndef atomic_t
> # define atomic_t int
>diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
>index 98f5f6dd85..3e0d01cb52 100644
>--- a/nptl/allocatestack.c
>+++ b/nptl/allocatestack.c
>@@ -290,9 +290,6 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
> 	 stack cache nor will the memory (except the TLS memory) be freed.  */
>       pd->user_stack = true;
>
>-      /* This is at least the second thread.  */
>-      pd->header.multiple_threads = 1;
>-
> #ifdef NEED_DL_SYSINFO
>       SETUP_THREAD_SYSINFO (pd);
> #endif
>@@ -408,9 +405,6 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
> 	     descriptor.  */
> 	  pd->specific[0] = pd->specific_1stblock;
>
>-	  /* This is at least the second thread.  */
>-	  pd->header.multiple_threads = 1;
>-
> #ifdef NEED_DL_SYSINFO
> 	  SETUP_THREAD_SYSINFO (pd);
> #endif
>diff --git a/nptl/descr.h b/nptl/descr.h
>index bb46b5958e..77b25d8267 100644
>--- a/nptl/descr.h
>+++ b/nptl/descr.h
>@@ -137,22 +137,7 @@ struct pthread
> #else
>     struct
>     {
>-      /* multiple_threads is enabled either when the process has spawned at
>-	 least one thread or when a single-threaded process cancels itself.
>-	 This enables additional code to introduce locking before doing some
>-	 compare_and_exchange operations and also enable cancellation points.
>-	 The concepts of multiple threads and cancellation points ideally
>-	 should be separate, since it is not necessary for multiple threads to
>-	 have been created for cancellation points to be enabled, as is the
>-	 case is when single-threaded process cancels itself.
>-
>-	 Since enabling multiple_threads enables additional code in
>-	 cancellation points and compare_and_exchange operations, there is a
>-	 potential for an unneeded performance hit when it is enabled in a
>-	 single-threaded, self-canceling process.  This is OK though, since a
>-	 single-threaded process will enable async cancellation only when it
>-	 looks to cancel itself and is hence going to end anyway.  */
>-      int multiple_threads;
>+      int unused_multiple_threads;

For an unused member variable: I see that sometimes a name like
__glibc_unused1 is used.  Is unused_ preferred now?

>       int gscope_flag;
>     } header;
> #endif
>diff --git a/nptl/pthread_cancel.c b/nptl/pthread_cancel.c
>index e1735279f2..6d26a15d0e 100644
>--- a/nptl/pthread_cancel.c
>+++ b/nptl/pthread_cancel.c
>@@ -157,12 +157,9 @@ __pthread_cancel (pthread_t th)
>
> 	/* A single-threaded process should be able to kill itself, since
> 	   there is nothing in the POSIX specification that says that it
>-	   cannot.  So we set multiple_threads to true so that cancellation
>-	   points get executed.  */
>-	THREAD_SETMEM (THREAD_SELF, header.multiple_threads, 1);
>-#ifndef TLS_MULTIPLE_THREADS_IN_TCB
>+	   cannot.  So we set __libc_single_threaded to true so that
>+	   cancellation points get executed.  */
> 	__libc_single_threaded = 0;
>-#endif
>     }
>   while (!atomic_compare_exchange_weak_acquire (&pd->cancelhandling, &oldval,
> 						newval));
>diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
>index 5633d01c62..d43865352f 100644
>--- a/nptl/pthread_create.c
>+++ b/nptl/pthread_create.c
>@@ -882,11 +882,6 @@ __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
> 	   other reason that create_thread chose.  Now let it run
> 	   free.  */
> 	lll_unlock (pd->lock, LLL_PRIVATE);
>-
>-      /* We now have for sure more than one thread.  The main thread might
>-	 not yet have the flag set.  No need to set the global variable
>-	 again if this is what we use.  */
>-      THREAD_SETMEM (THREAD_SELF, header.multiple_threads, 1);
>     }
>
>  out:
>diff --git a/sysdeps/i386/htl/tcb-offsets.sym b/sysdeps/i386/htl/tcb-offsets.sym
>index 7b7c719369..f3f7df6c06 100644
>--- a/sysdeps/i386/htl/tcb-offsets.sym
>+++ b/sysdeps/i386/htl/tcb-offsets.sym
>@@ -2,7 +2,6 @@
> #include <tls.h>
> #include <kernel-features.h>
>
>-MULTIPLE_THREADS_OFFSET offsetof (tcbhead_t, multiple_threads)
> SYSINFO_OFFSET          offsetof (tcbhead_t, sysinfo)
> POINTER_GUARD           offsetof (tcbhead_t, pointer_guard)
> SIGSTATE_OFFSET         offsetof (tcbhead_t, _hurd_sigstate)
>diff --git a/sysdeps/i386/nptl/tcb-offsets.sym b/sysdeps/i386/nptl/tcb-offsets.sym
>index 2ec9e787c1..1efd1469d8 100644
>--- a/sysdeps/i386/nptl/tcb-offsets.sym
>+++ b/sysdeps/i386/nptl/tcb-offsets.sym
>@@ -6,7 +6,6 @@ RESULT			offsetof (struct pthread, result)
> TID			offsetof (struct pthread, tid)
> CANCELHANDLING		offsetof (struct pthread, cancelhandling)
> CLEANUP_JMP_BUF		offsetof (struct pthread, cleanup_jmp_buf)
>-MULTIPLE_THREADS_OFFSET	offsetof (tcbhead_t, multiple_threads)
> SYSINFO_OFFSET		offsetof (tcbhead_t, sysinfo)
> CLEANUP			offsetof (struct pthread, cleanup)
> CLEANUP_PREV		offsetof (struct _pthread_cleanup_buffer, __prev)
>diff --git a/sysdeps/i386/nptl/tls.h b/sysdeps/i386/nptl/tls.h
>index 91090bf287..48940a9f44 100644
>--- a/sysdeps/i386/nptl/tls.h
>+++ b/sysdeps/i386/nptl/tls.h
>@@ -36,7 +36,7 @@ typedef struct
> 			   thread descriptor used by libpthread.  */
>   dtv_t *dtv;
>   void *self;		/* Pointer to the thread descriptor.  */
>-  int multiple_threads;
>+  int unused_multiple_threads;
>   uintptr_t sysinfo;
>   uintptr_t stack_guard;
>   uintptr_t pointer_guard;
>@@ -57,8 +57,6 @@ typedef struct
> _Static_assert (offsetof (tcbhead_t, __private_ss) == 0x30,
> 		"offset of __private_ss != 0x30");
>
>-# define TLS_MULTIPLE_THREADS_IN_TCB 1
>-
> #else /* __ASSEMBLER__ */
> # include <tcb-offsets.h>
> #endif
>diff --git a/sysdeps/ia64/nptl/tcb-offsets.sym b/sysdeps/ia64/nptl/tcb-offsets.sym
>index b01f712be2..ab2cb180f9 100644
>--- a/sysdeps/ia64/nptl/tcb-offsets.sym
>+++ b/sysdeps/ia64/nptl/tcb-offsets.sym
>@@ -2,5 +2,4 @@
> #include <tls.h>
>
> TID			offsetof (struct pthread, tid) - TLS_PRE_TCB_SIZE
>-MULTIPLE_THREADS_OFFSET offsetof (struct pthread, header.multiple_threads) - TLS_PRE_TCB_SIZE
> SYSINFO_OFFSET		offsetof (tcbhead_t, __private)
>diff --git a/sysdeps/ia64/nptl/tls.h b/sysdeps/ia64/nptl/tls.h
>index 8ccedb73e6..008e080fc4 100644
>--- a/sysdeps/ia64/nptl/tls.h
>+++ b/sysdeps/ia64/nptl/tls.h
>@@ -36,8 +36,6 @@ typedef struct
>
> register struct pthread *__thread_self __asm__("r13");
>
>-# define TLS_MULTIPLE_THREADS_IN_TCB 1
>-
> #else /* __ASSEMBLER__ */
> # include <tcb-offsets.h>
> #endif
>diff --git a/sysdeps/mach/hurd/i386/tls.h b/sysdeps/mach/hurd/i386/tls.h
>index 264ed9a9c5..d33e91c922 100644
>--- a/sysdeps/mach/hurd/i386/tls.h
>+++ b/sysdeps/mach/hurd/i386/tls.h
>@@ -33,7 +33,7 @@ typedef struct
>   void *tcb;			/* Points to this structure.  */
>   dtv_t *dtv;			/* Vector of pointers to TLS data.  */
>   thread_t self;		/* This thread's control port.  */
>-  int multiple_threads;
>+  int unused_multiple_threads;
>   uintptr_t sysinfo;
>   uintptr_t stack_guard;
>   uintptr_t pointer_guard;
>@@ -117,8 +117,6 @@ _hurd_tls_init (tcbhead_t *tcb)
>   /* This field is used by TLS accesses to get our "thread pointer"
>      from the TLS point of view.  */
>   tcb->tcb = tcb;
>-  /* We always at least start the sigthread anyway.  */
>-  tcb->multiple_threads = 1;
>
>   /* Get the first available selector.  */
>   int sel = -1;
>diff --git a/sysdeps/nios2/nptl/tcb-offsets.sym b/sysdeps/nios2/nptl/tcb-offsets.sym
>index 3cd8d984ac..93a695ac7f 100644
>--- a/sysdeps/nios2/nptl/tcb-offsets.sym
>+++ b/sysdeps/nios2/nptl/tcb-offsets.sym
>@@ -8,6 +8,5 @@
> # define __thread_self          ((void *) 0)
> # define thread_offsetof(mem)   ((ptrdiff_t) THREAD_SELF + offsetof (struct pthread, mem))
>
>-MULTIPLE_THREADS_OFFSET		thread_offsetof (header.multiple_threads)
> TID_OFFSET			thread_offsetof (tid)
> POINTER_GUARD			(offsetof (tcbhead_t, pointer_guard) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
>diff --git a/sysdeps/or1k/nptl/tls.h b/sysdeps/or1k/nptl/tls.h
>index c6ffe62c3f..3bb07beef8 100644
>--- a/sysdeps/or1k/nptl/tls.h
>+++ b/sysdeps/or1k/nptl/tls.h
>@@ -35,8 +35,6 @@ typedef struct
>
> register tcbhead_t *__thread_self __asm__("r10");
>
>-# define TLS_MULTIPLE_THREADS_IN_TCB 1
>-
> /* Get system call information.  */
> # include <sysdep.h>
>
>diff --git a/sysdeps/powerpc/nptl/tcb-offsets.sym b/sysdeps/powerpc/nptl/tcb-offsets.sym
>index 4c01615ad0..a0ee95f94d 100644
>--- a/sysdeps/powerpc/nptl/tcb-offsets.sym
>+++ b/sysdeps/powerpc/nptl/tcb-offsets.sym
>@@ -10,9 +10,6 @@
> # define thread_offsetof(mem)	((ptrdiff_t) THREAD_SELF + offsetof (struct pthread, mem))
>
>
>-#if TLS_MULTIPLE_THREADS_IN_TCB
>-MULTIPLE_THREADS_OFFSET		thread_offsetof (header.multiple_threads)
>-#endif
> TID				thread_offsetof (tid)
> POINTER_GUARD			(offsetof (tcbhead_t, pointer_guard) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
> TAR_SAVE			(offsetof (tcbhead_t, tar_save) - TLS_TCB_OFFSET - sizeof (tcbhead_t))
>diff --git a/sysdeps/powerpc/nptl/tls.h b/sysdeps/powerpc/nptl/tls.h
>index 22b0075235..fd5ee51981 100644
>--- a/sysdeps/powerpc/nptl/tls.h
>+++ b/sysdeps/powerpc/nptl/tls.h
>@@ -52,9 +52,6 @@
> # define TLS_DTV_AT_TP	1
> # define TLS_TCB_AT_TP	0
>
>-/* We use the multiple_threads field in the pthread struct */
>-#define TLS_MULTIPLE_THREADS_IN_TCB	1
>-
> /* Get the thread descriptor definition.  */
> # include <nptl/descr.h>
>
>diff --git a/sysdeps/s390/nptl/tcb-offsets.sym b/sysdeps/s390/nptl/tcb-offsets.sym
>index 9c1c01f353..bc7b267463 100644
>--- a/sysdeps/s390/nptl/tcb-offsets.sym
>+++ b/sysdeps/s390/nptl/tcb-offsets.sym
>@@ -1,6 +1,5 @@
> #include <sysdep.h>
> #include <tls.h>
>
>-MULTIPLE_THREADS_OFFSET		offsetof (tcbhead_t, multiple_threads)
> STACK_GUARD			offsetof (tcbhead_t, stack_guard)
> TID				offsetof (struct pthread, tid)
>diff --git a/sysdeps/s390/nptl/tls.h b/sysdeps/s390/nptl/tls.h
>index ff210ffeb2..d69ed539f7 100644
>--- a/sysdeps/s390/nptl/tls.h
>+++ b/sysdeps/s390/nptl/tls.h
>@@ -35,7 +35,7 @@ typedef struct
> 			   thread descriptor used by libpthread.  */
>   dtv_t *dtv;
>   void *self;		/* Pointer to the thread descriptor.  */
>-  int multiple_threads;
>+  int unused_multiple_threads;
>   uintptr_t sysinfo;
>   uintptr_t stack_guard;
>   int gscope_flag;
>@@ -44,10 +44,6 @@ typedef struct
>   void *__private_ss;
> } tcbhead_t;
>
>-# ifndef __s390x__
>-#  define TLS_MULTIPLE_THREADS_IN_TCB 1
>-# endif
>-
> #else /* __ASSEMBLER__ */
> # include <tcb-offsets.h>
> #endif
>diff --git a/sysdeps/sh/nptl/tcb-offsets.sym b/sysdeps/sh/nptl/tcb-offsets.sym
>index 234207779d..4e452d9c6c 100644
>--- a/sysdeps/sh/nptl/tcb-offsets.sym
>+++ b/sysdeps/sh/nptl/tcb-offsets.sym
>@@ -6,7 +6,6 @@ RESULT			offsetof (struct pthread, result)
> TID			offsetof (struct pthread, tid)
> CANCELHANDLING		offsetof (struct pthread, cancelhandling)
> CLEANUP_JMP_BUF		offsetof (struct pthread, cleanup_jmp_buf)
>-MULTIPLE_THREADS_OFFSET	offsetof (struct pthread, header.multiple_threads)
> TLS_PRE_TCB_SIZE	sizeof (struct pthread)
> MUTEX_FUTEX		offsetof (pthread_mutex_t, __data.__lock)
> POINTER_GUARD		offsetof (tcbhead_t, pointer_guard)
>diff --git a/sysdeps/sh/nptl/tls.h b/sysdeps/sh/nptl/tls.h
>index 76591ab6ef..8778cb4ac0 100644
>--- a/sysdeps/sh/nptl/tls.h
>+++ b/sysdeps/sh/nptl/tls.h
>@@ -36,8 +36,6 @@ typedef struct
>   uintptr_t pointer_guard;
> } tcbhead_t;
>
>-# define TLS_MULTIPLE_THREADS_IN_TCB 1
>-
> #else /* __ASSEMBLER__ */
> # include <tcb-offsets.h>
> #endif /* __ASSEMBLER__ */
>diff --git a/sysdeps/sparc/nptl/tcb-offsets.sym b/sysdeps/sparc/nptl/tcb-offsets.sym
>index f75d02065e..e4a7e4720f 100644
>--- a/sysdeps/sparc/nptl/tcb-offsets.sym
>+++ b/sysdeps/sparc/nptl/tcb-offsets.sym
>@@ -1,6 +1,5 @@
> #include <sysdep.h>
> #include <tls.h>
>
>-MULTIPLE_THREADS_OFFSET		offsetof (tcbhead_t, multiple_threads)
> POINTER_GUARD			offsetof (tcbhead_t, pointer_guard)
> TID				offsetof (struct pthread, tid)
>diff --git a/sysdeps/sparc/nptl/tls.h b/sysdeps/sparc/nptl/tls.h
>index d1e2bb4ad1..b78cf0d6b4 100644
>--- a/sysdeps/sparc/nptl/tls.h
>+++ b/sysdeps/sparc/nptl/tls.h
>@@ -35,7 +35,7 @@ typedef struct
> 			   thread descriptor used by libpthread.  */
>   dtv_t *dtv;
>   void *self;
>-  int multiple_threads;
>+  int unused_multiple_threads;
> #if __WORDSIZE == 64
>   int gscope_flag;
> #endif
>diff --git a/sysdeps/unix/sysv/linux/single-thread.h b/sysdeps/unix/sysv/linux/single-thread.h
>index 208edccce6..dd80e82c82 100644
>--- a/sysdeps/unix/sysv/linux/single-thread.h
>+++ b/sysdeps/unix/sysv/linux/single-thread.h
>@@ -23,20 +23,7 @@
> # include <sys/single_threaded.h>
> #endif
>
>-/* The default way to check if the process is single thread is by using the
>-   pthread_t 'multiple_threads' field.  However, for some architectures it is
>-   faster to either use an extra field on TCB or global variables (the TCB
>-   field is also used on x86 for some single-thread atomic optimizations).
>-
>-   The ABI might define SINGLE_THREAD_BY_GLOBAL to enable the single thread
>-   check to use global variables instead of the pthread_t field.  */
>-
>-#if !defined SINGLE_THREAD_BY_GLOBAL || IS_IN (rtld)
>-# define SINGLE_THREAD_P \
>-  (THREAD_GETMEM (THREAD_SELF, header.multiple_threads) == 0)
>-#else
>-# define SINGLE_THREAD_P (__libc_single_threaded != 0)
>-#endif
>+#define SINGLE_THREAD_P (__libc_single_threaded != 0)
>
> #define RTLD_SINGLE_THREAD_P SINGLE_THREAD_P
>
>diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h
>index f24f1c71ed..23e087e7e0 100644
>--- a/sysdeps/x86/atomic-machine.h
>+++ b/sysdeps/x86/atomic-machine.h
>@@ -51,292 +51,145 @@
> #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \
>   (! __sync_bool_compare_and_swap (mem, oldval, newval))
>
>+#define __cmpxchg_op(lock, mem, newval, oldval)				      \
>+  ({ __typeof (*mem) __ret;						      \
>+     if (sizeof (*mem) == 1)						      \
>+       asm volatile (lock "cmpxchgb %2, %1"				      \
>+		     : "=a" (ret), "+m" (*mem)				      \
>+		     : BR_CONSTRAINT (newval), "0" (oldval)	  	      \
>+		     : "memory");					      \
>+     else if (sizeof (*mem) == 2)					      \
>+       asm volatile (lock "cmpxchgw %2, %1"				      \
>+		     : "=a" (ret), "+m" (*mem)				      \
>+		     : BR_CONSTRAINT (newval), "0" (oldval)	  	      \
>+		     : "memory");					      \
>+     else if (sizeof (*mem) == 4)					      \
>+       asm volatile (lock "cmpxchgl %2, %1"				      \
>+		     : "=a" (ret), "+m" (*mem)				      \
>+		     : BR_CONSTRAINT (newval), "0" (oldval)	  	      \
>+		     : "memory");					      \
>+     else if (__HAVE_64B_ATOMICS)					      \
>+       asm volatile (lock "cmpxchgq %2, %1"				      \
>+                    : "=a" (ret), "+m" (*mem)				      \
>+                    : "q" ((int64_t) cast_to_integer (newval)),		      \
>+                      "0" ((int64_t) cast_to_integer (oldval))		      \
>+                    : "memory");					      \
>+     else								      \
>+       __atomic_link_error ();						      \
>+     __ret; })
>
>-#define __arch_c_compare_and_exchange_val_8_acq(mem, newval, oldval) \
>+#define __arch_c_compare_and_exchange_val_8_acq(mem, newval, oldval)	      \
>   ({ __typeof (*mem) ret;						      \
>-     __asm __volatile ("cmpl $0, %%" SEG_REG ":%P5\n\t"			      \
>-		       "je 0f\n\t"					      \
>-		       "lock\n"						      \
>-		       "0:\tcmpxchgb %b2, %1"				      \
>-		       : "=a" (ret), "=m" (*mem)			      \
>-		       : BR_CONSTRAINT (newval), "m" (*mem), "0" (oldval),    \
>-			 "i" (offsetof (tcbhead_t, multiple_threads)));	      \
>+     if (SINGLE_THREAD_P)						      \
>+       __cmpxchg_op ("", (mem), (newval), (oldval));			      \
>+     else								      \
>+       __cmpxchg_op (LOCK_PREFIX, (mem), (newval), (oldval));		      \
>      ret; })
>
>-#define __arch_c_compare_and_exchange_val_16_acq(mem, newval, oldval) \
>+#define __arch_c_compare_and_exchange_val_16_acq(mem, newval, oldval)	      \
>   ({ __typeof (*mem) ret;						      \
>-     __asm __volatile ("cmpl $0, %%" SEG_REG ":%P5\n\t"			      \
>-		       "je 0f\n\t"					      \
>-		       "lock\n"						      \
>-		       "0:\tcmpxchgw %w2, %1"				      \
>-		       : "=a" (ret), "=m" (*mem)			      \
>-		       : BR_CONSTRAINT (newval), "m" (*mem), "0" (oldval),    \
>-			 "i" (offsetof (tcbhead_t, multiple_threads)));	      \
>+     if (SINGLE_THREAD_P)						      \
>+       __cmpxchg_op ("", (mem), (newval), (oldval));			      \
>+     else								      \
>+       __cmpxchg_op (LOCK_PREFIX, (mem), (newval), (oldval));		      \
>      ret; })
>
>-#define __arch_c_compare_and_exchange_val_32_acq(mem, newval, oldval) \
>+#define __arch_c_compare_and_exchange_val_32_acq(mem, newval, oldval)	      \
>   ({ __typeof (*mem) ret;						      \
>-     __asm __volatile ("cmpl $0, %%" SEG_REG ":%P5\n\t"			      \
>-		       "je 0f\n\t"					      \
>-		       "lock\n"						      \
>-		       "0:\tcmpxchgl %2, %1"				      \
>-		       : "=a" (ret), "=m" (*mem)			      \
>-		       : BR_CONSTRAINT (newval), "m" (*mem), "0" (oldval),    \
>-			 "i" (offsetof (tcbhead_t, multiple_threads)));       \
>+     if (SINGLE_THREAD_P)						      \
>+       __cmpxchg_op ("", (mem), (newval), (oldval));			      \
>+     else								      \
>+       __cmpxchg_op (LOCK_PREFIX, (mem), (newval), (oldval));		      \
>      ret; })
>
>-#ifdef __x86_64__
>-# define __arch_c_compare_and_exchange_val_64_acq(mem, newval, oldval) \
>+#define __arch_c_compare_and_exchange_val_64_acq(mem, newval, oldval)	      \
>   ({ __typeof (*mem) ret;						      \
>-     __asm __volatile ("cmpl $0, %%fs:%P5\n\t"				      \
>-		       "je 0f\n\t"					      \
>-		       "lock\n"						      \
>-		       "0:\tcmpxchgq %q2, %1"				      \
>-		       : "=a" (ret), "=m" (*mem)			      \
>-		       : "q" ((int64_t) cast_to_integer (newval)),	      \
>-			 "m" (*mem),					      \
>-			 "0" ((int64_t) cast_to_integer (oldval)),	      \
>-			 "i" (offsetof (tcbhead_t, multiple_threads)));	      \
>-     ret; })
>-# define do_exchange_and_add_val_64_acq(pfx, mem, value) 0
>-# define do_add_val_64_acq(pfx, mem, value) do { } while (0)
>-#else
>-/* XXX We do not really need 64-bit compare-and-exchange.  At least
>-   not in the moment.  Using it would mean causing portability
>-   problems since not many other 32-bit architectures have support for
>-   such an operation.  So don't define any code for now.  If it is
>-   really going to be used the code below can be used on Intel Pentium
>-   and later, but NOT on i486.  */
>-# define __arch_c_compare_and_exchange_val_64_acq(mem, newval, oldval) \
>-  ({ __typeof (*mem) ret = *(mem);					      \
>-     __atomic_link_error ();						      \
>-     ret = (newval);							      \
>-     ret = (oldval);							      \
>-     ret; })
>-
>-# define __arch_compare_and_exchange_val_64_acq(mem, newval, oldval)	      \
>-  ({ __typeof (*mem) ret = *(mem);					      \
>-     __atomic_link_error ();						      \
>-     ret = (newval);							      \
>-     ret = (oldval);							      \
>-     ret; })
>-
>-# define do_exchange_and_add_val_64_acq(pfx, mem, value) \
>-  ({ __typeof (value) __addval = (value);				      \
>-     __typeof (*mem) __result;						      \
>-     __typeof (mem) __memp = (mem);					      \
>-     __typeof (*mem) __tmpval;						      \
>-     __result = *__memp;						      \
>-     do									      \
>-       __tmpval = __result;						      \
>-     while ((__result = pfx##_compare_and_exchange_val_64_acq		      \
>-	     (__memp, __result + __addval, __result)) == __tmpval);	      \
>-     __result; })
>-
>-# define do_add_val_64_acq(pfx, mem, value) \
>-  {									      \
>-    __typeof (value) __addval = (value);				      \
>-    __typeof (mem) __memp = (mem);					      \
>-    __typeof (*mem) __oldval = *__memp;					      \
>-    __typeof (*mem) __tmpval;						      \
>-    do									      \
>-      __tmpval = __oldval;						      \
>-    while ((__oldval = pfx##_compare_and_exchange_val_64_acq		      \
>-	    (__memp, __oldval + __addval, __oldval)) == __tmpval);	      \
>-  }
>-#endif
>-
>-
>-/* Note that we need no lock prefix.  */
>-#define atomic_exchange_acq(mem, newvalue) \
>-  ({ __typeof (*mem) result;						      \
>-     if (sizeof (*mem) == 1)						      \
>-       __asm __volatile ("xchgb %b0, %1"				      \
>-			 : "=q" (result), "=m" (*mem)			      \
>-			 : "0" (newvalue), "m" (*mem));			      \
>-     else if (sizeof (*mem) == 2)					      \
>-       __asm __volatile ("xchgw %w0, %1"				      \
>-			 : "=r" (result), "=m" (*mem)			      \
>-			 : "0" (newvalue), "m" (*mem));			      \
>-     else if (sizeof (*mem) == 4)					      \
>-       __asm __volatile ("xchgl %0, %1"					      \
>-			 : "=r" (result), "=m" (*mem)			      \
>-			 : "0" (newvalue), "m" (*mem));			      \
>-     else if (__HAVE_64B_ATOMICS)					      \
>-       __asm __volatile ("xchgq %q0, %1"				      \
>-			 : "=r" (result), "=m" (*mem)			      \
>-			 : "0" ((int64_t) cast_to_integer (newvalue)),        \
>-			   "m" (*mem));					      \
>-     else								      \
>-       {								      \
>-	 result = 0;							      \
>-	 __atomic_link_error ();					      \
>-       }								      \
>-     result; })
>-
>-
>-#define __arch_exchange_and_add_body(lock, pfx, mem, value) \
>-  ({ __typeof (*mem) __result;						      \
>-     __typeof (value) __addval = (value);				      \
>-     if (sizeof (*mem) == 1)						      \
>-       __asm __volatile (lock "xaddb %b0, %1"				      \
>-			 : "=q" (__result), "=m" (*mem)			      \
>-			 : "0" (__addval), "m" (*mem),			      \
>-			   "i" (offsetof (tcbhead_t, multiple_threads)));     \
>-     else if (sizeof (*mem) == 2)					      \
>-       __asm __volatile (lock "xaddw %w0, %1"				      \
>-			 : "=r" (__result), "=m" (*mem)			      \
>-			 : "0" (__addval), "m" (*mem),			      \
>-			   "i" (offsetof (tcbhead_t, multiple_threads)));     \
>-     else if (sizeof (*mem) == 4)					      \
>-       __asm __volatile (lock "xaddl %0, %1"				      \
>-			 : "=r" (__result), "=m" (*mem)			      \
>-			 : "0" (__addval), "m" (*mem),			      \
>-			   "i" (offsetof (tcbhead_t, multiple_threads)));     \
>-     else if (__HAVE_64B_ATOMICS)					      \
>-       __asm __volatile (lock "xaddq %q0, %1"				      \
>-			 : "=r" (__result), "=m" (*mem)			      \
>-			 : "0" ((int64_t) cast_to_integer (__addval)),     \
>-			   "m" (*mem),					      \
>-			   "i" (offsetof (tcbhead_t, multiple_threads)));     \
>+     if (SINGLE_THREAD_P)						      \
>+       __cmpxchg_op ("", (mem), (newval), (oldval));			      \
>      else								      \
>-       __result = do_exchange_and_add_val_64_acq (pfx, (mem), __addval);      \
>-     __result; })
>-
>-#define atomic_exchange_and_add(mem, value) \
>-  __sync_fetch_and_add (mem, value)
>-
>-#define __arch_exchange_and_add_cprefix \
>-  "cmpl $0, %%" SEG_REG ":%P4\n\tje 0f\n\tlock\n0:\t"
>-
>-#define catomic_exchange_and_add(mem, value) \
>-  __arch_exchange_and_add_body (__arch_exchange_and_add_cprefix, __arch_c,    \
>-				mem, value)
>-
>-
>-#define __arch_add_body(lock, pfx, apfx, mem, value) \
>-  do {									      \
>-    if (__builtin_constant_p (value) && (value) == 1)			      \
>-      pfx##_increment (mem);						      \
>-    else if (__builtin_constant_p (value) && (value) == -1)		      \
>-      pfx##_decrement (mem);						      \
>-    else if (sizeof (*mem) == 1)					      \
>-      __asm __volatile (lock "addb %b1, %0"				      \
>-			: "=m" (*mem)					      \
>-			: IBR_CONSTRAINT (value), "m" (*mem),		      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else if (sizeof (*mem) == 2)					      \
>-      __asm __volatile (lock "addw %w1, %0"				      \
>-			: "=m" (*mem)					      \
>-			: "ir" (value), "m" (*mem),			      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else if (sizeof (*mem) == 4)					      \
>-      __asm __volatile (lock "addl %1, %0"				      \
>-			: "=m" (*mem)					      \
>-			: "ir" (value), "m" (*mem),			      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else if (__HAVE_64B_ATOMICS)					      \
>-      __asm __volatile (lock "addq %q1, %0"				      \
>-			: "=m" (*mem)					      \
>-			: "ir" ((int64_t) cast_to_integer (value)),	      \
>-			  "m" (*mem),					      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else								      \
>-      do_add_val_64_acq (apfx, (mem), (value));				      \
>-  } while (0)
>-
>-# define atomic_add(mem, value) \
>-  __arch_add_body (LOCK_PREFIX, atomic, __arch, mem, value)
>-
>-#define __arch_add_cprefix \
>-  "cmpl $0, %%" SEG_REG ":%P3\n\tje 0f\n\tlock\n0:\t"
>-
>-#define catomic_add(mem, value) \
>-  __arch_add_body (__arch_add_cprefix, atomic, __arch_c, mem, value)
>+       __cmpxchg_op (LOCK_PREFIX, (mem), (newval), (oldval));		      \
>+     ret; })
>
>
>-#define atomic_add_negative(mem, value) \
>-  ({ unsigned char __result;						      \
>+#define __xchg_op(lock, mem, arg, op)					      \
>+  ({ __typeof (*mem) __ret = (arg);					      \
>      if (sizeof (*mem) == 1)						      \
>-       __asm __volatile (LOCK_PREFIX "addb %b2, %0; sets %1"		      \
>-			 : "=m" (*mem), "=qm" (__result)		      \
>-			 : IBR_CONSTRAINT (value), "m" (*mem));		      \
>+       __asm __volatile (lock #op "b %b0, %1"				      \
>+			 : "=q" (__ret), "=m" (*mem)			      \
>+			 : "0" (arg), "m" (*mem)			      \
>+			 : "memory", "cc");				      \
>      else if (sizeof (*mem) == 2)					      \
>-       __asm __volatile (LOCK_PREFIX "addw %w2, %0; sets %1"		      \
>-			 : "=m" (*mem), "=qm" (__result)		      \
>-			 : "ir" (value), "m" (*mem));			      \
>+       __asm __volatile (lock #op "w %w0, %1"				      \
>+			 : "=r" (__ret), "=m" (*mem)			      \
>+			 : "0" (arg), "m" (*mem)			      \
>+			 : "memory", "cc");				      \
>      else if (sizeof (*mem) == 4)					      \
>-       __asm __volatile (LOCK_PREFIX "addl %2, %0; sets %1"		      \
>-			 : "=m" (*mem), "=qm" (__result)		      \
>-			 : "ir" (value), "m" (*mem));			      \
>+       __asm __volatile (lock #op "l %0, %1"				      \
>+			 : "=r" (__ret), "=m" (*mem)			      \
>+			 : "0" (arg), "m" (*mem)			      \
>+			 : "memory", "cc");				      \
>      else if (__HAVE_64B_ATOMICS)					      \
>-       __asm __volatile (LOCK_PREFIX "addq %q2, %0; sets %1"		      \
>-			 : "=m" (*mem), "=qm" (__result)		      \
>-			 : "ir" ((int64_t) cast_to_integer (value)),	      \
>-			   "m" (*mem));					      \
>+       __asm __volatile (lock #op "q %q0, %1"				      \
>+			 : "=r" (__ret), "=m" (*mem)			      \
>+			 : "0" ((int64_t) cast_to_integer (arg)),	      \
>+			   "m" (*mem)					      \
>+			 : "memory", "cc");				      \
>      else								      \
>        __atomic_link_error ();						      \
>-     __result; })
>-
>+     __ret; })
>
>-#define atomic_add_zero(mem, value) \
>-  ({ unsigned char __result;						      \
>+#define __single_op(lock, mem, op)					      \
>+  ({									      \
>      if (sizeof (*mem) == 1)						      \
>-       __asm __volatile (LOCK_PREFIX "addb %b2, %0; setz %1"		      \
>-			 : "=m" (*mem), "=qm" (__result)		      \
>-			 : IBR_CONSTRAINT (value), "m" (*mem));		      \
>+       __asm __volatile (lock #op "b %b0"				      \
>+			 : "=m" (*mem)					      \
>+			 : "m" (*mem)					      \
>+			 : "memory", "cc");				      \
>      else if (sizeof (*mem) == 2)					      \
>-       __asm __volatile (LOCK_PREFIX "addw %w2, %0; setz %1"		      \
>-			 : "=m" (*mem), "=qm" (__result)		      \
>-			 : "ir" (value), "m" (*mem));			      \
>+       __asm __volatile (lock #op "w %b0"				      \
>+			 : "=m" (*mem)					      \
>+			 : "m" (*mem)					      \
>+			 : "memory", "cc");				      \
>      else if (sizeof (*mem) == 4)					      \
>-       __asm __volatile (LOCK_PREFIX "addl %2, %0; setz %1"		      \
>-			 : "=m" (*mem), "=qm" (__result)		      \
>-			 : "ir" (value), "m" (*mem));			      \
>+       __asm __volatile (lock #op "l %b0"				      \
>+			 : "=m" (*mem)					      \
>+			 : "m" (*mem)					      \
>+			 : "memory", "cc");				      \
>      else if (__HAVE_64B_ATOMICS)					      \
>-       __asm __volatile (LOCK_PREFIX "addq %q2, %0; setz %1"		      \
>-			 : "=m" (*mem), "=qm" (__result)		      \
>-			 : "ir" ((int64_t) cast_to_integer (value)),	      \
>-			   "m" (*mem));					      \
>+       __asm __volatile (lock #op "q %b0"				      \
>+			 : "=m" (*mem)					      \
>+			 : "m" (*mem)					      \
>+			 : "memory", "cc");				      \
>      else								      \
>-       __atomic_link_error ();					      \
>-     __result; })
>+       __atomic_link_error ();						      \
>+  })
>
>+/* Note that we need no lock prefix.  */
>+#define atomic_exchange_acq(mem, newvalue)				      \
>+  __xchg_op ("", (mem), (newvalue), xchg)
>
>-#define __arch_increment_body(lock, pfx, mem) \
>-  do {									      \
>-    if (sizeof (*mem) == 1)						      \
>-      __asm __volatile (lock "incb %b0"					      \
>-			: "=m" (*mem)					      \
>-			: "m" (*mem),					      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else if (sizeof (*mem) == 2)					      \
>-      __asm __volatile (lock "incw %w0"					      \
>-			: "=m" (*mem)					      \
>-			: "m" (*mem),					      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else if (sizeof (*mem) == 4)					      \
>-      __asm __volatile (lock "incl %0"					      \
>-			: "=m" (*mem)					      \
>-			: "m" (*mem),					      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else if (__HAVE_64B_ATOMICS)					      \
>-      __asm __volatile (lock "incq %q0"					      \
>-			: "=m" (*mem)					      \
>-			: "m" (*mem),					      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else								      \
>-      do_add_val_64_acq (pfx, mem, 1);					      \
>-  } while (0)
>+#define atomic_add(mem, value) \
>+  __xchg_op (LOCK_PREFIX, (mem), (value), add);				      \
>
>-#define atomic_increment(mem) __arch_increment_body (LOCK_PREFIX, __arch, mem)
>+#define catomic_add(mem, value)						      \
>+  ({									      \
>+    if (SINGLE_THREAD_P)						      \
>+      __xchg_op ("", (mem), (value), add);				      \
>+   else									      \
>+     atomic_add (mem, value);						      \
>+  })
>
>-#define __arch_increment_cprefix \
>-  "cmpl $0, %%" SEG_REG ":%P2\n\tje 0f\n\tlock\n0:\t"
>
>-#define catomic_increment(mem) \
>-  __arch_increment_body (__arch_increment_cprefix, __arch_c, mem)
>+#define atomic_increment(mem) \
>+  __single_op (LOCK_PREFIX, (mem), inc)
>
>+#define catomic_increment(mem)						      \
>+  ({									      \
>+    if (SINGLE_THREAD_P)						      \
>+      __single_op ("", (mem), inc);					      \
>+   else									      \
>+     atomic_increment (mem);						      \
>+  })
>
> #define atomic_increment_and_test(mem) \
>   ({ unsigned char __result;						      \
>@@ -357,43 +210,20 @@
> 			 : "=m" (*mem), "=qm" (__result)		      \
> 			 : "m" (*mem));					      \
>      else								      \
>-       __atomic_link_error ();					      \
>+       __atomic_link_error ();						      \
>      __result; })
>
>
>-#define __arch_decrement_body(lock, pfx, mem) \
>-  do {									      \
>-    if (sizeof (*mem) == 1)						      \
>-      __asm __volatile (lock "decb %b0"					      \
>-			: "=m" (*mem)					      \
>-			: "m" (*mem),					      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else if (sizeof (*mem) == 2)					      \
>-      __asm __volatile (lock "decw %w0"					      \
>-			: "=m" (*mem)					      \
>-			: "m" (*mem),					      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else if (sizeof (*mem) == 4)					      \
>-      __asm __volatile (lock "decl %0"					      \
>-			: "=m" (*mem)					      \
>-			: "m" (*mem),					      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else if (__HAVE_64B_ATOMICS)					      \
>-      __asm __volatile (lock "decq %q0"					      \
>-			: "=m" (*mem)					      \
>-			: "m" (*mem),					      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else								      \
>-      do_add_val_64_acq (pfx, mem, -1);					      \
>-  } while (0)
>-
>-#define atomic_decrement(mem) __arch_decrement_body (LOCK_PREFIX, __arch, mem)
>+#define atomic_decrement(mem)						      \
>+  __single_op (LOCK_PREFIX, (mem), dec)
>
>-#define __arch_decrement_cprefix \
>-  "cmpl $0, %%" SEG_REG ":%P2\n\tje 0f\n\tlock\n0:\t"
>-
>-#define catomic_decrement(mem) \
>-  __arch_decrement_body (__arch_decrement_cprefix, __arch_c, mem)
>+#define catomic_decrement(mem)						      \
>+  ({									      \
>+    if (SINGLE_THREAD_P)						      \
>+      __single_op ("", (mem), dec);					      \
>+   else									      \
>+     atomic_decrement (mem);						      \
>+  })
>
>
> #define atomic_decrement_and_test(mem) \
>@@ -463,73 +293,31 @@
> 			 : "=q" (__result), "=m" (*mem)			      \
> 			 : "m" (*mem), "ir" (bit));			      \
>      else							      	      \
>-       __atomic_link_error ();					      \
>+       __atomic_link_error ();						      \
>      __result; })
>
>
>-#define __arch_and_body(lock, mem, mask) \
>-  do {									      \
>-    if (sizeof (*mem) == 1)						      \
>-      __asm __volatile (lock "andb %b1, %0"				      \
>-			: "=m" (*mem)					      \
>-			: IBR_CONSTRAINT (mask), "m" (*mem),		      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else if (sizeof (*mem) == 2)					      \
>-      __asm __volatile (lock "andw %w1, %0"				      \
>-			: "=m" (*mem)					      \
>-			: "ir" (mask), "m" (*mem),			      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else if (sizeof (*mem) == 4)					      \
>-      __asm __volatile (lock "andl %1, %0"				      \
>-			: "=m" (*mem)					      \
>-			: "ir" (mask), "m" (*mem),			      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else if (__HAVE_64B_ATOMICS)					      \
>-      __asm __volatile (lock "andq %q1, %0"				      \
>-			: "=m" (*mem)					      \
>-			: "ir" (mask), "m" (*mem),			      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else								      \
>-      __atomic_link_error ();						      \
>-  } while (0)
>-
>-#define __arch_cprefix \
>-  "cmpl $0, %%" SEG_REG ":%P3\n\tje 0f\n\tlock\n0:\t"
>-
>-#define atomic_and(mem, mask) __arch_and_body (LOCK_PREFIX, mem, mask)
>-
>-#define catomic_and(mem, mask) __arch_and_body (__arch_cprefix, mem, mask)
>+#define atomic_and(mem, mask)						      \
>+  __xchg_op (LOCK_PREFIX, (mem), (mask), and)
>
>+#define catomic_and(mem, mask) \
>+  ({									      \
>+    if (SINGLE_THREAD_P)						      \
>+      __xchg_op ("", (mem), (mask), and);				      \
>+   else									      \
>+      atomic_and (mem, mask);						      \
>+  })
>
>-#define __arch_or_body(lock, mem, mask) \
>-  do {									      \
>-    if (sizeof (*mem) == 1)						      \
>-      __asm __volatile (lock "orb %b1, %0"				      \
>-			: "=m" (*mem)					      \
>-			: IBR_CONSTRAINT (mask), "m" (*mem),		      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else if (sizeof (*mem) == 2)					      \
>-      __asm __volatile (lock "orw %w1, %0"				      \
>-			: "=m" (*mem)					      \
>-			: "ir" (mask), "m" (*mem),			      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else if (sizeof (*mem) == 4)					      \
>-      __asm __volatile (lock "orl %1, %0"				      \
>-			: "=m" (*mem)					      \
>-			: "ir" (mask), "m" (*mem),			      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else if (__HAVE_64B_ATOMICS)					      \
>-      __asm __volatile (lock "orq %q1, %0"				      \
>-			: "=m" (*mem)					      \
>-			: "ir" (mask), "m" (*mem),			      \
>-			  "i" (offsetof (tcbhead_t, multiple_threads)));      \
>-    else								      \
>-      __atomic_link_error ();						      \
>-  } while (0)
>-
>-#define atomic_or(mem, mask) __arch_or_body (LOCK_PREFIX, mem, mask)
>+#define atomic_or(mem, mask)						      \
>+  __xchg_op (LOCK_PREFIX, (mem), (mask), or)
>
>-#define catomic_or(mem, mask) __arch_or_body (__arch_cprefix, mem, mask)
>+#define catomic_or(mem, mask) \
>+  ({									      \
>+    if (SINGLE_THREAD_P)						      \
>+      __xchg_op ("", (mem), (mask), or);				      \
>+   else									      \
>+      atomic_or (mem, mask);						      \
>+  })
>
> /* We don't use mfence because it is supposedly slower due to having to
>    provide stronger guarantees (e.g., regarding self-modifying code).  */
>diff --git a/sysdeps/x86_64/nptl/tcb-offsets.sym b/sysdeps/x86_64/nptl/tcb-offsets.sym
>index 2bbd563a6c..8ec55a7ea8 100644
>--- a/sysdeps/x86_64/nptl/tcb-offsets.sym
>+++ b/sysdeps/x86_64/nptl/tcb-offsets.sym
>@@ -9,7 +9,6 @@ CLEANUP_JMP_BUF		offsetof (struct pthread, cleanup_jmp_buf)
> CLEANUP			offsetof (struct pthread, cleanup)
> CLEANUP_PREV		offsetof (struct _pthread_cleanup_buffer, __prev)
> MUTEX_FUTEX		offsetof (pthread_mutex_t, __data.__lock)
>-MULTIPLE_THREADS_OFFSET	offsetof (tcbhead_t, multiple_threads)
> POINTER_GUARD		offsetof (tcbhead_t, pointer_guard)
> FEATURE_1_OFFSET	offsetof (tcbhead_t, feature_1)
> SSP_BASE_OFFSET		offsetof (tcbhead_t, ssp_base)
>-- 
>2.34.1
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/4] Remove usage of TLS_MULTIPLE_THREADS_IN_TCB
  2022-06-15 21:07       ` Adhemerval Zanella
@ 2022-06-16 12:48         ` Wilco Dijkstra
  2022-06-16 17:23           ` Adhemerval Zanella
  0 siblings, 1 reply; 21+ messages in thread
From: Wilco Dijkstra @ 2022-06-16 12:48 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Noah Goldstein, GNU C Library

Hi Adhemerval,

>> For use in acquire/release atomics, it is required since code hoisting and other
>> optimizations must be prevented. So the old implementation was buggy, and this
>> is why we need to remove these target specific hacks.
>
> Yes, I noted this checking out the Linux kernel implementation. Although I am not
> sure if it really matter since we already have a volatile asm to should prevent code
> hoisting. 

Yes it really does matter - volatile asm does not block any optimizations across it.
This example shows how it fails:

int x, y;
int g(void)
{
  y = 3;
  //__atomic_fetch_add (&x, 1, __ATOMIC_ACQUIRE);
  asm volatile ("lock add %1, 1" : "+m" (x) ::  );
  return x + y;
}

The value of y propagates across the acquire without reloading it:

        mov     DWORD PTR y[rip], 3
        lock add DWORD PTR x[rip], 1
        mov     eax, DWORD PTR x[rip]
        add     eax, 3      // bug - no reload of y!!!
        ret

With the atomic or "memory" constraint we get the correct:

        mov     DWORD PTR y[rip], 3
        lock add        DWORD PTR x[rip], 1
        mov     eax, DWORD PTR y[rip]
        add     eax, DWORD PTR x[rip]
        ret

>> Any single-threaded optimizations should be done on a much higher level and only where
>> there is a clear performance gain. So we should get rid of all the atomic-machine headers.
>
> Complete agree, I have started to clean up by first moving some architectures to use
> compiler builtins [1]. I will check which are the architectures that still don’t use compiler
> builtin and see if we move them.

Yes I think it should be possible to move everything to use USE_ATOMIC_COMPILER_BUILTINS.
However targets that already use it still have a significant amount of atomic macros.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/4] Remove usage of TLS_MULTIPLE_THREADS_IN_TCB
  2022-06-16 12:48         ` Wilco Dijkstra
@ 2022-06-16 17:23           ` Adhemerval Zanella
  0 siblings, 0 replies; 21+ messages in thread
From: Adhemerval Zanella @ 2022-06-16 17:23 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Noah Goldstein, GNU C Library



> On 16 Jun 2022, at 05:48, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> 
> Hi Adhemerval,
> 
>>> For use in acquire/release atomics, it is required since code hoisting and other
>>> optimizations must be prevented. So the old implementation was buggy, and this
>>> is why we need to remove these target specific hacks.
>> 
>> Yes, I noted this checking out the Linux kernel implementation. Although I am not
>> sure if it really matter since we already have a volatile asm to should prevent code
>> hoisting. 
> 
> Yes it really does matter - volatile asm does not block any optimizations across it.
> This example shows how it fails:
> 
> int x, y;
> int g(void)
> {
>   y = 3;
>   //__atomic_fetch_add (&x, 1, __ATOMIC_ACQUIRE);
>   asm volatile ("lock add %1, 1" : "+m" (x) ::  );
>   return x + y;
> }
> 
> The value of y propagates across the acquire without reloading it:
> 
>        mov     DWORD PTR y[rip], 3
>        lock add DWORD PTR x[rip], 1
>        mov     eax, DWORD PTR x[rip]
>        add     eax, 3      // bug - no reload of y!!!
>        ret
> 
> With the atomic or "memory" constraint we get the correct:
> 
>        mov     DWORD PTR y[rip], 3
>        lock add        DWORD PTR x[rip], 1
>        mov     eax, DWORD PTR y[rip]
>        add     eax, DWORD PTR x[rip]
>        ret

Interesting, a good point to move away from rei-implement atomics operations
now that we have proper compiler support (specially now that we don’t support
tricky ABI like sparcv7).

> 
>>> Any single-threaded optimizations should be done on a much higher level and only where
>>> there is a clear performance gain. So we should get rid of all the atomic-machine headers.
>> 
>> Complete agree, I have started to clean up by first moving some architectures to use
>> compiler builtins [1]. I will check which are the architectures that still don’t use compiler
>> builtin and see if we move them.
> 
> Yes I think it should be possible to move everything to use USE_ATOMIC_COMPILER_BUILTINS.
> However targets that already use it still have a significant amount of atomic macros.

The next step will be to consolidate all the atomics macros on the generic atomic.h.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/4] misc: Optimize internal usage of __libc_single_threaded
  2022-06-16  7:15   ` Fangrui Song
@ 2022-06-16 22:06     ` Adhemerval Zanella
  2022-06-16 22:30       ` Fangrui Song
  0 siblings, 1 reply; 21+ messages in thread
From: Adhemerval Zanella @ 2022-06-16 22:06 UTC (permalink / raw)
  To: Fangrui Song; +Cc: GNU C Library, Wilco Dijkstra



> On 16 Jun 2022, at 00:15, Fangrui Song <maskray@google.com> wrote:
> 
> On 2022-06-10, Adhemerval Zanella via Libc-alpha wrote:
>> By adding an internal hidden_def alias to avoid the GOT indirection.
>> On some architecture, __libc_single_thread may be accessed through
>> copy relocations and thus it requires to update both copies.
>> 
>> To obtain the correct address of the __libc_single_thread,
>> __libc_dlsym is extended to support RTLD_DEFAULT. It searches
>> through all scope instead of default local ones.
>> 
>> Checked on x86_64-linux-gnu and i686-linux-gnu.
>> ---
>> elf/dl-libc.c | 20 ++++++++++++++++++--
>> elf/libc_early_init.c | 9 +++++++++
>> include/sys/single_threaded.h | 11 +++++++++++
>> misc/single_threaded.c | 2 ++
>> nptl/pthread_create.c | 6 +++++-
>> 5 files changed, 45 insertions(+), 3 deletions(-)
>> 
>> diff --git a/elf/dl-libc.c b/elf/dl-libc.c
>> index 266e068da6..e64f4b9910 100644
>> --- a/elf/dl-libc.c
>> +++ b/elf/dl-libc.c
>> @@ -16,6 +16,7 @@
>> License along with the GNU C Library; if not, see
>> <https://www.gnu.org/licenses/>. */
>> 
>> +#include <assert.h>
>> #include <dlfcn.h>
>> #include <stdlib.h>
>> #include <ldsodefs.h>
>> @@ -72,6 +73,7 @@ struct do_dlsym_args
>> /* Arguments to do_dlsym. */
>> struct link_map *map;
>> const char *name;
>> + const void *caller_dlsym;
>> 
>> /* Return values of do_dlsym. */
>> lookup_t loadbase;
>> @@ -102,8 +104,21 @@ do_dlsym (void *ptr)
>> {
>> struct do_dlsym_args *args = (struct do_dlsym_args *) ptr;
>> args->ref = NULL;
>> - args->loadbase = GLRO(dl_lookup_symbol_x) (args->name, args->map, &args->ref,
>> -					 args->map->l_local_scope, NULL, 0,
>> + struct link_map *match = args->map;
>> + struct r_scope_elem **scope;
>> + if (args->map == RTLD_DEFAULT)
>> + {
>> + ElfW(Addr) caller = (ElfW(Addr)) args->caller_dlsym;
>> + match = _dl_find_dso_for_object (caller);
>> + /* It is only used internally, so caller should be always recognized. */
>> + assert (match != NULL);
>> + scope = match->l_scope;
>> + }
>> + else
>> + scope = args->map->l_local_scope;
>> +
>> + args->loadbase = GLRO(dl_lookup_symbol_x) (args->name, match, &args->ref,
>> +					 scope, NULL, 0,
>> 					 DL_LOOKUP_RETURN_NEWEST, NULL);
>> }
>> 
>> @@ -182,6 +197,7 @@ __libc_dlsym (void *map, const char *name)
>> struct do_dlsym_args args;
>> args.map = map;
>> args.name = name;
>> + args.caller_dlsym = RETURN_ADDRESS (0);
>> 
>> #ifdef SHARED
>> if (GLRO (dl_dlfcn_hook) != NULL)
>> diff --git a/elf/libc_early_init.c b/elf/libc_early_init.c
>> index 3c4a19cf6b..7cc2997122 100644
>> --- a/elf/libc_early_init.c
>> +++ b/elf/libc_early_init.c
>> @@ -16,7 +16,9 @@
>> License along with the GNU C Library; if not, see
>> <https://www.gnu.org/licenses/>. */
>> 
>> +#include <assert.h>
>> #include <ctype.h>
>> +#include <dlfcn.h>
>> #include <elision-conf.h>
>> #include <libc-early-init.h>
>> #include <libc-internal.h>
>> @@ -38,6 +40,13 @@ __libc_early_init (_Bool initial)
>> __libc_single_threaded = initial;
>> 
>> #ifdef SHARED
>> + /* __libc_single_threaded can be accessed through copy relocations, so it
>> + requires to update the external copy. */
>> + __libc_external_single_threaded = __libc_dlsym (RTLD_DEFAULT,
>> +						 "__libc_single_threaded");
>> + assert (__libc_external_single_threaded != NULL);
>> + *__libc_external_single_threaded = initial;
>> +
>> __libc_initial = initial;
>> #endif
> 
> I think this whole scheme can be greatly simplified.
> 
> Under the hood,
> 
> extern __typeof (__libc_single_threaded) __EI___libc_single_threaded __asm__("" "__libc_single_threaded"); extern __typeof (__libc_single_threaded) __EI___libc_single_threaded __attribute__((alias ("" "__GI___libc_single_threaded"))) __attribute__ ((__copy__ (__libc_single_threaded))); 
> * __libc_single_threaded is the STV_HIDDEN symbol with the asm name "__GI___libc_single_threaded"
> * __EI___libc_single_threaded is the STV_DEFAULT symbol with the asm name "__libc_single_threaded". It aliases __libc_single_threaded.

In fact libc_hidden_proto for SHARED will result in:

extern __typeof (__libc_single_threaded) __libc_single_threaded __asm__ ("" "__GI___libc_single_threaded") __attribute__ ((visibility ("hidden")));;

So using __libc_single_threaded will be the most simple way.

> 
> We can just access __EI___libc_single_threaded which will lead to a GOT
> indirection (R_*_GLOB_DAT). This can avoid a __libc_dlsym call.

The idea is exactly to avoid the GOT indirection.

> 
> I can see that accessing the __EI declaration is currently inconvenient
> because include/libc-symbols.h does not seem to provide a convenient
> macro, but that be added.
> 
>> diff --git a/include/sys/single_threaded.h b/include/sys/single_threaded.h
>> index 18f6972482..258b01e0b2 100644
>> --- a/include/sys/single_threaded.h
>> +++ b/include/sys/single_threaded.h
>> @@ -1 +1,12 @@
>> #include <misc/sys/single_threaded.h>
>> +
>> +#ifndef _ISOMAC
>> +
>> +libc_hidden_proto (__libc_single_threaded);
>> +
>> +# ifdef SHARED
>> +extern __typeof (__libc_single_threaded) *__libc_external_single_threaded
>> + attribute_hidden;
>> +# endif
>> +
>> +#endif
>> diff --git a/misc/single_threaded.c b/misc/single_threaded.c
>> index 96ada9137b..201d86a273 100644
>> --- a/misc/single_threaded.c
>> +++ b/misc/single_threaded.c
>> @@ -22,6 +22,8 @@
>> __libc_early_init (as false for inner libcs). */
>> #ifdef SHARED
>> char __libc_single_threaded;
>> +__typeof (__libc_single_threaded) *__libc_external_single_threaded;
>> #else
>> char __libc_single_threaded = 1;
>> #endif
>> +libc_hidden_data_def (__libc_single_threaded)
>> diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
>> index e7a099acb7..5633d01c62 100644
>> --- a/nptl/pthread_create.c
>> +++ b/nptl/pthread_create.c
>> @@ -627,7 +627,11 @@ __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
>> if (__libc_single_threaded)
>> {
>> late_init ();
>> - __libc_single_threaded = 0;
>> + __libc_single_threaded =
>> +#ifdef SHARED
>> + *__libc_external_single_threaded =
>> +#endif
>> +	0;
>> }
>> 
>> const struct pthread_attr *iattr = (struct pthread_attr *) attr;
>> -- 
>> 2.34.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/4] misc: Optimize internal usage of __libc_single_threaded
  2022-06-16 22:06     ` Adhemerval Zanella
@ 2022-06-16 22:30       ` Fangrui Song
  2022-06-17  0:35         ` Adhemerval Zanella
  0 siblings, 1 reply; 21+ messages in thread
From: Fangrui Song @ 2022-06-16 22:30 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library, Wilco Dijkstra

On 2022-06-16, Adhemerval Zanella wrote:
>
>
>> On 16 Jun 2022, at 00:15, Fangrui Song <maskray@google.com> wrote:
>>
>> On 2022-06-10, Adhemerval Zanella via Libc-alpha wrote:
>>> By adding an internal hidden_def alias to avoid the GOT indirection.
>>> On some architecture, __libc_single_thread may be accessed through
>>> copy relocations and thus it requires to update both copies.
>>>
>>> To obtain the correct address of the __libc_single_thread,
>>> __libc_dlsym is extended to support RTLD_DEFAULT. It searches
>>> through all scope instead of default local ones.
>>>
>>> Checked on x86_64-linux-gnu and i686-linux-gnu.
>>> ---
>>> elf/dl-libc.c | 20 ++++++++++++++++++--
>>> elf/libc_early_init.c | 9 +++++++++
>>> include/sys/single_threaded.h | 11 +++++++++++
>>> misc/single_threaded.c | 2 ++
>>> nptl/pthread_create.c | 6 +++++-
>>> 5 files changed, 45 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/elf/dl-libc.c b/elf/dl-libc.c
>>> index 266e068da6..e64f4b9910 100644
>>> --- a/elf/dl-libc.c
>>> +++ b/elf/dl-libc.c
>>> @@ -16,6 +16,7 @@
>>> License along with the GNU C Library; if not, see
>>> <https://www.gnu.org/licenses/>. */
>>>
>>> +#include <assert.h>
>>> #include <dlfcn.h>
>>> #include <stdlib.h>
>>> #include <ldsodefs.h>
>>> @@ -72,6 +73,7 @@ struct do_dlsym_args
>>> /* Arguments to do_dlsym. */
>>> struct link_map *map;
>>> const char *name;
>>> + const void *caller_dlsym;
>>>
>>> /* Return values of do_dlsym. */
>>> lookup_t loadbase;
>>> @@ -102,8 +104,21 @@ do_dlsym (void *ptr)
>>> {
>>> struct do_dlsym_args *args = (struct do_dlsym_args *) ptr;
>>> args->ref = NULL;
>>> - args->loadbase = GLRO(dl_lookup_symbol_x) (args->name, args->map, &args->ref,
>>> -					 args->map->l_local_scope, NULL, 0,
>>> + struct link_map *match = args->map;
>>> + struct r_scope_elem **scope;
>>> + if (args->map == RTLD_DEFAULT)
>>> + {
>>> + ElfW(Addr) caller = (ElfW(Addr)) args->caller_dlsym;
>>> + match = _dl_find_dso_for_object (caller);
>>> + /* It is only used internally, so caller should be always recognized. */
>>> + assert (match != NULL);
>>> + scope = match->l_scope;
>>> + }
>>> + else
>>> + scope = args->map->l_local_scope;
>>> +
>>> + args->loadbase = GLRO(dl_lookup_symbol_x) (args->name, match, &args->ref,
>>> +					 scope, NULL, 0,
>>> 					 DL_LOOKUP_RETURN_NEWEST, NULL);
>>> }
>>>
>>> @@ -182,6 +197,7 @@ __libc_dlsym (void *map, const char *name)
>>> struct do_dlsym_args args;
>>> args.map = map;
>>> args.name = name;
>>> + args.caller_dlsym = RETURN_ADDRESS (0);
>>>
>>> #ifdef SHARED
>>> if (GLRO (dl_dlfcn_hook) != NULL)
>>> diff --git a/elf/libc_early_init.c b/elf/libc_early_init.c
>>> index 3c4a19cf6b..7cc2997122 100644
>>> --- a/elf/libc_early_init.c
>>> +++ b/elf/libc_early_init.c
>>> @@ -16,7 +16,9 @@
>>> License along with the GNU C Library; if not, see
>>> <https://www.gnu.org/licenses/>. */
>>>
>>> +#include <assert.h>
>>> #include <ctype.h>
>>> +#include <dlfcn.h>
>>> #include <elision-conf.h>
>>> #include <libc-early-init.h>
>>> #include <libc-internal.h>
>>> @@ -38,6 +40,13 @@ __libc_early_init (_Bool initial)
>>> __libc_single_threaded = initial;
>>>
>>> #ifdef SHARED
>>> + /* __libc_single_threaded can be accessed through copy relocations, so it
>>> + requires to update the external copy. */
>>> + __libc_external_single_threaded = __libc_dlsym (RTLD_DEFAULT,
>>> +						 "__libc_single_threaded");
>>> + assert (__libc_external_single_threaded != NULL);
>>> + *__libc_external_single_threaded = initial;
>>> +
>>> __libc_initial = initial;
>>> #endif
>>
>> I think this whole scheme can be greatly simplified.
>>
>> Under the hood,
>>
>> extern __typeof (__libc_single_threaded) __EI___libc_single_threaded __asm__("" "__libc_single_threaded"); extern __typeof (__libc_single_threaded) __EI___libc_single_threaded __attribute__((alias ("" "__GI___libc_single_threaded"))) __attribute__ ((__copy__ (__libc_single_threaded)));
>> * __libc_single_threaded is the STV_HIDDEN symbol with the asm name "__GI___libc_single_threaded"
>> * __EI___libc_single_threaded is the STV_DEFAULT symbol with the asm name "__libc_single_threaded". It aliases __libc_single_threaded.
>
>In fact libc_hidden_proto for SHARED will result in:
>
>extern __typeof (__libc_single_threaded) __libc_single_threaded __asm__ ("" "__GI___libc_single_threaded") __attribute__ ((visibility ("hidden")));;
>
>So using __libc_single_threaded will be the most simple way.

The code tries to read the initial value of the copy relocated __libc_single_threaded.
Accessing the STV_DEFAULT symbol does the trick and is simpler than __libc_dlsym.
(Accessing the STV_HIDDEN symbol just accesses libc.so's own copy.)

Note: The copy relocated __libc_single_threaded likely has a value of zero and is read-only from the executable.
I assume that this is a conservative and safe value.

>>
>> We can just access __EI___libc_single_threaded which will lead to a GOT
>> indirection (R_*_GLOB_DAT). This can avoid a __libc_dlsym call.
>
>The idea is exactly to avoid the GOT indirection.
>
>>
>> I can see that accessing the __EI declaration is currently inconvenient
>> because include/libc-symbols.h does not seem to provide a convenient
>> macro, but that be added.
>>
>>> diff --git a/include/sys/single_threaded.h b/include/sys/single_threaded.h
>>> index 18f6972482..258b01e0b2 100644
>>> --- a/include/sys/single_threaded.h
>>> +++ b/include/sys/single_threaded.h
>>> @@ -1 +1,12 @@
>>> #include <misc/sys/single_threaded.h>
>>> +
>>> +#ifndef _ISOMAC
>>> +
>>> +libc_hidden_proto (__libc_single_threaded);
>>> +
>>> +# ifdef SHARED
>>> +extern __typeof (__libc_single_threaded) *__libc_external_single_threaded
>>> + attribute_hidden;
>>> +# endif
>>> +
>>> +#endif
>>> diff --git a/misc/single_threaded.c b/misc/single_threaded.c
>>> index 96ada9137b..201d86a273 100644
>>> --- a/misc/single_threaded.c
>>> +++ b/misc/single_threaded.c
>>> @@ -22,6 +22,8 @@
>>> __libc_early_init (as false for inner libcs). */
>>> #ifdef SHARED
>>> char __libc_single_threaded;
>>> +__typeof (__libc_single_threaded) *__libc_external_single_threaded;
>>> #else
>>> char __libc_single_threaded = 1;
>>> #endif
>>> +libc_hidden_data_def (__libc_single_threaded)
>>> diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
>>> index e7a099acb7..5633d01c62 100644
>>> --- a/nptl/pthread_create.c
>>> +++ b/nptl/pthread_create.c
>>> @@ -627,7 +627,11 @@ __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
>>> if (__libc_single_threaded)
>>> {
>>> late_init ();
>>> - __libc_single_threaded = 0;
>>> + __libc_single_threaded =
>>> +#ifdef SHARED
>>> + *__libc_external_single_threaded =
>>> +#endif
>>> +	0;
>>> }
>>>
>>> const struct pthread_attr *iattr = (struct pthread_attr *) attr;
>>> --
>>> 2.34.1
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/4] misc: Optimize internal usage of __libc_single_threaded
  2022-06-16 22:30       ` Fangrui Song
@ 2022-06-17  0:35         ` Adhemerval Zanella
  2022-06-17 20:35           ` Fangrui Song
  0 siblings, 1 reply; 21+ messages in thread
From: Adhemerval Zanella @ 2022-06-17  0:35 UTC (permalink / raw)
  To: Fangrui Song; +Cc: GNU C Library, Wilco Dijkstra



> On 16 Jun 2022, at 15:30, Fangrui Song <maskray@google.com> wrote:
> 
> On 2022-06-16, Adhemerval Zanella wrote:
>> 
>> 
>>> On 16 Jun 2022, at 00:15, Fangrui Song <maskray@google.com> wrote:
>>> 
>>> On 2022-06-10, Adhemerval Zanella via Libc-alpha wrote:
>>>> By adding an internal hidden_def alias to avoid the GOT indirection.
>>>> On some architecture, __libc_single_thread may be accessed through
>>>> copy relocations and thus it requires to update both copies.
>>>> 
>>>> To obtain the correct address of the __libc_single_thread,
>>>> __libc_dlsym is extended to support RTLD_DEFAULT. It searches
>>>> through all scope instead of default local ones.
>>>> 
>>>> Checked on x86_64-linux-gnu and i686-linux-gnu.
>>>> ---
>>>> elf/dl-libc.c | 20 ++++++++++++++++++--
>>>> elf/libc_early_init.c | 9 +++++++++
>>>> include/sys/single_threaded.h | 11 +++++++++++
>>>> misc/single_threaded.c | 2 ++
>>>> nptl/pthread_create.c | 6 +++++-
>>>> 5 files changed, 45 insertions(+), 3 deletions(-)
>>>> 
>>>> diff --git a/elf/dl-libc.c b/elf/dl-libc.c
>>>> index 266e068da6..e64f4b9910 100644
>>>> --- a/elf/dl-libc.c
>>>> +++ b/elf/dl-libc.c
>>>> @@ -16,6 +16,7 @@
>>>> License along with the GNU C Library; if not, see
>>>> <https://www.gnu.org/licenses/>. */
>>>> 
>>>> +#include <assert.h>
>>>> #include <dlfcn.h>
>>>> #include <stdlib.h>
>>>> #include <ldsodefs.h>
>>>> @@ -72,6 +73,7 @@ struct do_dlsym_args
>>>> /* Arguments to do_dlsym. */
>>>> struct link_map *map;
>>>> const char *name;
>>>> + const void *caller_dlsym;
>>>> 
>>>> /* Return values of do_dlsym. */
>>>> lookup_t loadbase;
>>>> @@ -102,8 +104,21 @@ do_dlsym (void *ptr)
>>>> {
>>>> struct do_dlsym_args *args = (struct do_dlsym_args *) ptr;
>>>> args->ref = NULL;
>>>> - args->loadbase = GLRO(dl_lookup_symbol_x) (args->name, args->map, &args->ref,
>>>> -					 args->map->l_local_scope, NULL, 0,
>>>> + struct link_map *match = args->map;
>>>> + struct r_scope_elem **scope;
>>>> + if (args->map == RTLD_DEFAULT)
>>>> + {
>>>> + ElfW(Addr) caller = (ElfW(Addr)) args->caller_dlsym;
>>>> + match = _dl_find_dso_for_object (caller);
>>>> + /* It is only used internally, so caller should be always recognized. */
>>>> + assert (match != NULL);
>>>> + scope = match->l_scope;
>>>> + }
>>>> + else
>>>> + scope = args->map->l_local_scope;
>>>> +
>>>> + args->loadbase = GLRO(dl_lookup_symbol_x) (args->name, match, &args->ref,
>>>> +					 scope, NULL, 0,
>>>> 					 DL_LOOKUP_RETURN_NEWEST, NULL);
>>>> }
>>>> 
>>>> @@ -182,6 +197,7 @@ __libc_dlsym (void *map, const char *name)
>>>> struct do_dlsym_args args;
>>>> args.map = map;
>>>> args.name = name;
>>>> + args.caller_dlsym = RETURN_ADDRESS (0);
>>>> 
>>>> #ifdef SHARED
>>>> if (GLRO (dl_dlfcn_hook) != NULL)
>>>> diff --git a/elf/libc_early_init.c b/elf/libc_early_init.c
>>>> index 3c4a19cf6b..7cc2997122 100644
>>>> --- a/elf/libc_early_init.c
>>>> +++ b/elf/libc_early_init.c
>>>> @@ -16,7 +16,9 @@
>>>> License along with the GNU C Library; if not, see
>>>> <https://www.gnu.org/licenses/>. */
>>>> 
>>>> +#include <assert.h>
>>>> #include <ctype.h>
>>>> +#include <dlfcn.h>
>>>> #include <elision-conf.h>
>>>> #include <libc-early-init.h>
>>>> #include <libc-internal.h>
>>>> @@ -38,6 +40,13 @@ __libc_early_init (_Bool initial)
>>>> __libc_single_threaded = initial;
>>>> 
>>>> #ifdef SHARED
>>>> + /* __libc_single_threaded can be accessed through copy relocations, so it
>>>> + requires to update the external copy. */
>>>> + __libc_external_single_threaded = __libc_dlsym (RTLD_DEFAULT,
>>>> +						 "__libc_single_threaded");
>>>> + assert (__libc_external_single_threaded != NULL);
>>>> + *__libc_external_single_threaded = initial;
>>>> +
>>>> __libc_initial = initial;
>>>> #endif
>>> 
>>> I think this whole scheme can be greatly simplified.
>>> 
>>> Under the hood,
>>> 
>>> extern __typeof (__libc_single_threaded) __EI___libc_single_threaded __asm__("" "__libc_single_threaded"); extern __typeof (__libc_single_threaded) __EI___libc_single_threaded __attribute__((alias ("" "__GI___libc_single_threaded"))) __attribute__ ((__copy__ (__libc_single_threaded)));
>>> * __libc_single_threaded is the STV_HIDDEN symbol with the asm name "__GI___libc_single_threaded"
>>> * __EI___libc_single_threaded is the STV_DEFAULT symbol with the asm name "__libc_single_threaded". It aliases __libc_single_threaded.
>> 
>> In fact libc_hidden_proto for SHARED will result in:
>> 
>> extern __typeof (__libc_single_threaded) __libc_single_threaded __asm__ ("" "__GI___libc_single_threaded") __attribute__ ((visibility ("hidden")));;
>> 
>> So using __libc_single_threaded will be the most simple way.
> 
> The code tries to read the initial value of the copy relocated __libc_single_threaded.
> Accessing the STV_DEFAULT symbol does the trick and is simpler than __libc_dlsym.
> (Accessing the STV_HIDDEN symbol just accesses libc.so's own copy.)

The glibc will still continue to use the __libc_single_threaded value internally, the 
__libc_dlsym is used to initialize __libc_external_single_threaded which is used to 
update the external one solely to avoid update not being visible on architectures that
use copy relocations.

This is actually not needed on some architectures, aarch64 for instance, but without
it x86 does not see the __libc_single_thread being updated.

> 
> Note: The copy relocated __libc_single_threaded likely has a value of zero and is read-only from the executable.
> I assume that this is a conservative and safe value.
> 

It can not be a read-only value because it should be updated by pthread_create.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/4] misc: Optimize internal usage of __libc_single_threaded
  2022-06-17  0:35         ` Adhemerval Zanella
@ 2022-06-17 20:35           ` Fangrui Song
  2022-06-18 13:20             ` Wilco Dijkstra
  0 siblings, 1 reply; 21+ messages in thread
From: Fangrui Song @ 2022-06-17 20:35 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: GNU C Library, Wilco Dijkstra

On 2022-06-16, Adhemerval Zanella wrote:
>
>
>> On 16 Jun 2022, at 15:30, Fangrui Song <maskray@google.com> wrote:
>>
>> On 2022-06-16, Adhemerval Zanella wrote:
>>>
>>>
>>>> On 16 Jun 2022, at 00:15, Fangrui Song <maskray@google.com> wrote:
>>>>
>>>> On 2022-06-10, Adhemerval Zanella via Libc-alpha wrote:
>>>>> By adding an internal hidden_def alias to avoid the GOT indirection.
>>>>> On some architecture, __libc_single_thread may be accessed through
>>>>> copy relocations and thus it requires to update both copies.
>>>>>
>>>>> To obtain the correct address of the __libc_single_thread,
>>>>> __libc_dlsym is extended to support RTLD_DEFAULT. It searches
>>>>> through all scope instead of default local ones.
>>>>>
>>>>> Checked on x86_64-linux-gnu and i686-linux-gnu.
>>>>> ---
>>>>> elf/dl-libc.c | 20 ++++++++++++++++++--
>>>>> elf/libc_early_init.c | 9 +++++++++
>>>>> include/sys/single_threaded.h | 11 +++++++++++
>>>>> misc/single_threaded.c | 2 ++
>>>>> nptl/pthread_create.c | 6 +++++-
>>>>> 5 files changed, 45 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/elf/dl-libc.c b/elf/dl-libc.c
>>>>> index 266e068da6..e64f4b9910 100644
>>>>> --- a/elf/dl-libc.c
>>>>> +++ b/elf/dl-libc.c
>>>>> @@ -16,6 +16,7 @@
>>>>> License along with the GNU C Library; if not, see
>>>>> <https://www.gnu.org/licenses/>. */
>>>>>
>>>>> +#include <assert.h>
>>>>> #include <dlfcn.h>
>>>>> #include <stdlib.h>
>>>>> #include <ldsodefs.h>
>>>>> @@ -72,6 +73,7 @@ struct do_dlsym_args
>>>>> /* Arguments to do_dlsym. */
>>>>> struct link_map *map;
>>>>> const char *name;
>>>>> + const void *caller_dlsym;
>>>>>
>>>>> /* Return values of do_dlsym. */
>>>>> lookup_t loadbase;
>>>>> @@ -102,8 +104,21 @@ do_dlsym (void *ptr)
>>>>> {
>>>>> struct do_dlsym_args *args = (struct do_dlsym_args *) ptr;
>>>>> args->ref = NULL;
>>>>> - args->loadbase = GLRO(dl_lookup_symbol_x) (args->name, args->map, &args->ref,
>>>>> -					 args->map->l_local_scope, NULL, 0,
>>>>> + struct link_map *match = args->map;
>>>>> + struct r_scope_elem **scope;
>>>>> + if (args->map == RTLD_DEFAULT)
>>>>> + {
>>>>> + ElfW(Addr) caller = (ElfW(Addr)) args->caller_dlsym;
>>>>> + match = _dl_find_dso_for_object (caller);
>>>>> + /* It is only used internally, so caller should be always recognized. */
>>>>> + assert (match != NULL);
>>>>> + scope = match->l_scope;
>>>>> + }
>>>>> + else
>>>>> + scope = args->map->l_local_scope;
>>>>> +
>>>>> + args->loadbase = GLRO(dl_lookup_symbol_x) (args->name, match, &args->ref,
>>>>> +					 scope, NULL, 0,
>>>>> 					 DL_LOOKUP_RETURN_NEWEST, NULL);
>>>>> }
>>>>>
>>>>> @@ -182,6 +197,7 @@ __libc_dlsym (void *map, const char *name)
>>>>> struct do_dlsym_args args;
>>>>> args.map = map;
>>>>> args.name = name;
>>>>> + args.caller_dlsym = RETURN_ADDRESS (0);
>>>>>
>>>>> #ifdef SHARED
>>>>> if (GLRO (dl_dlfcn_hook) != NULL)
>>>>> diff --git a/elf/libc_early_init.c b/elf/libc_early_init.c
>>>>> index 3c4a19cf6b..7cc2997122 100644
>>>>> --- a/elf/libc_early_init.c
>>>>> +++ b/elf/libc_early_init.c
>>>>> @@ -16,7 +16,9 @@
>>>>> License along with the GNU C Library; if not, see
>>>>> <https://www.gnu.org/licenses/>. */
>>>>>
>>>>> +#include <assert.h>
>>>>> #include <ctype.h>
>>>>> +#include <dlfcn.h>
>>>>> #include <elision-conf.h>
>>>>> #include <libc-early-init.h>
>>>>> #include <libc-internal.h>
>>>>> @@ -38,6 +40,13 @@ __libc_early_init (_Bool initial)
>>>>> __libc_single_threaded = initial;
>>>>>
>>>>> #ifdef SHARED
>>>>> + /* __libc_single_threaded can be accessed through copy relocations, so it
>>>>> + requires to update the external copy. */
>>>>> + __libc_external_single_threaded = __libc_dlsym (RTLD_DEFAULT,
>>>>> +						 "__libc_single_threaded");
>>>>> + assert (__libc_external_single_threaded != NULL);
>>>>> + *__libc_external_single_threaded = initial;
>>>>> +
>>>>> __libc_initial = initial;
>>>>> #endif
>>>>
>>>> I think this whole scheme can be greatly simplified.
>>>>
>>>> Under the hood,
>>>>
>>>> extern __typeof (__libc_single_threaded) __EI___libc_single_threaded __asm__("" "__libc_single_threaded"); extern __typeof (__libc_single_threaded) __EI___libc_single_threaded __attribute__((alias ("" "__GI___libc_single_threaded"))) __attribute__ ((__copy__ (__libc_single_threaded)));
>>>> * __libc_single_threaded is the STV_HIDDEN symbol with the asm name "__GI___libc_single_threaded"
>>>> * __EI___libc_single_threaded is the STV_DEFAULT symbol with the asm name "__libc_single_threaded". It aliases __libc_single_threaded.
>>>
>>> In fact libc_hidden_proto for SHARED will result in:
>>>
>>> extern __typeof (__libc_single_threaded) __libc_single_threaded __asm__ ("" "__GI___libc_single_threaded") __attribute__ ((visibility ("hidden")));;
>>>
>>> So using __libc_single_threaded will be the most simple way.
>>
>> The code tries to read the initial value of the copy relocated __libc_single_threaded.
>> Accessing the STV_DEFAULT symbol does the trick and is simpler than __libc_dlsym.
>> (Accessing the STV_HIDDEN symbol just accesses libc.so's own copy.)
>
>The glibc will still continue to use the __libc_single_threaded value internally, the
>__libc_dlsym is used to initialize __libc_external_single_threaded which is used to
>update the external one solely to avoid update not being visible on architectures that
>use copy relocations.

Yes, I know.
My point is that to access the possibly external definition (in the
presence of a copy relocation), we can access the STV_DEFAULT symbol,
instead of using __libc_dlsym.

The compiler generates a GOT-generating relocation in -fPIC mode and the
linker will create a GLOB_DAT dynamic relocation because the symbol is
preemptible/interposable (see
https://maskray.me/blog/2020-11-15-explain-gnu-linker-options#bsymbolic-and---dynamic-list
how a symbol is considered preemptible). ld.so resolves GLOB_DAT to the
copy relocated definition in the executable (if exists).

>This is actually not needed on some architectures, aarch64 for instance, but without
>it x86 does not see the __libc_single_thread being updated.

__libc_single_thread is in a public header sys/single_threaded.h and is
used by libstdc++ ext/atomicity.h.
A user program compiled with -fno-pic will typically access __libc_single_thread 
with an absolute or PC-relative relocation. The linker will resolve this relocation
to a copy relocation if __libc_single_thread is defined in libc.so.6.

aarch64 -fno-pic uses PC-relative relocations. I think some variants of
mips avoid -fno-pic code generation which may lead to copy relocation.

clang -fno-direct-access-external-data and gcc -mno-direct-extern-access
avoids absolute/PC-relative relocation in -fno-pic/-fpie mode, but the
options are by no means popular.

>>
>> Note: The copy relocated __libc_single_threaded likely has a value of zero and is read-only from the executable.
>> I assume that this is a conservative and safe value.
>>
>
>It can not be a read-only value because it should be updated by pthread_create.
>
I mean that the executable does not write to __libc_single_threaded.
Its write to __libc_single_threaded will not be visible to libc.so.6.
If libc.so.6 decides to write the STV_HIDDEN __libc_single_threaded,
it needs to write the STV_DEFAULT __libc_single_threaded to update the
possibly copy relocated definition.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/4] misc: Optimize internal usage of __libc_single_threaded
  2022-06-17 20:35           ` Fangrui Song
@ 2022-06-18 13:20             ` Wilco Dijkstra
  0 siblings, 0 replies; 21+ messages in thread
From: Wilco Dijkstra @ 2022-06-18 13:20 UTC (permalink / raw)
  To: Fangrui Song, Adhemerval Zanella; +Cc: GNU C Library

Hi,

So long story short, what happens is that the magic macros effectively emit both global
and hidden versions of the same symbol which can be accessed without using dl_sym.

>>>> I think this whole scheme can be greatly simplified.

Yes, I would suggest to keep it simple and just explicitly add a separate hidden symbol
for use inside GLIBC (eg. __libc_single_thread_internal).

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/4] misc: Optimize internal usage of __libc_single_threaded
  2022-06-10 16:35 ` [PATCH v2 1/4] misc: Optimize internal usage of __libc_single_threaded Adhemerval Zanella
  2022-06-16  7:15   ` Fangrui Song
@ 2022-06-20  8:37   ` Florian Weimer
  1 sibling, 0 replies; 21+ messages in thread
From: Florian Weimer @ 2022-06-20  8:37 UTC (permalink / raw)
  To: Adhemerval Zanella via Libc-alpha; +Cc: Wilco Dijkstra, Adhemerval Zanella

* Adhemerval Zanella via Libc-alpha:

> diff --git a/elf/libc_early_init.c b/elf/libc_early_init.c
> index 3c4a19cf6b..7cc2997122 100644
> --- a/elf/libc_early_init.c
> +++ b/elf/libc_early_init.c
> @@ -16,7 +16,9 @@
>     License along with the GNU C Library; if not, see
>     <https://www.gnu.org/licenses/>.  */
>  
> +#include <assert.h>
>  #include <ctype.h>
> +#include <dlfcn.h>
>  #include <elision-conf.h>
>  #include <libc-early-init.h>
>  #include <libc-internal.h>
> @@ -38,6 +40,13 @@ __libc_early_init (_Bool initial)
>    __libc_single_threaded = initial;
>  
>  #ifdef SHARED
> +  /* __libc_single_threaded can be accessed through copy relocations, so it
> +     requires to update the external copy.  */
> +  __libc_external_single_threaded = __libc_dlsym (RTLD_DEFAULT,
> +						  "__libc_single_threaded");
> +  assert (__libc_external_single_threaded != NULL);
> +  *__libc_external_single_threaded = initial;
> +
>    __libc_initial = initial;
>  #endif

I wrote this before, but:

There is no need to cache the pointer because we need it at most once.
We can just do the lookup in pthread_create.

RTLD_DEFAULT is wrong here because we need the scope of the main
program, where the copy relocations are, so:

  __libc_dlsym (GL (dl_ns)[LM_ID_BASE]._ns_loaded, "__libc_single_threaded");

and no __libc_dlsym changes are needed.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-06-20  8:37 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-10 16:35 [PATCH v2 0/4] Simplify internal single-threaded usage Adhemerval Zanella
2022-06-10 16:35 ` [PATCH v2 1/4] misc: Optimize internal usage of __libc_single_threaded Adhemerval Zanella
2022-06-16  7:15   ` Fangrui Song
2022-06-16 22:06     ` Adhemerval Zanella
2022-06-16 22:30       ` Fangrui Song
2022-06-17  0:35         ` Adhemerval Zanella
2022-06-17 20:35           ` Fangrui Song
2022-06-18 13:20             ` Wilco Dijkstra
2022-06-20  8:37   ` Florian Weimer
2022-06-10 16:35 ` [PATCH v2 2/4] Replace __libc_multiple_threads with __libc_single_threaded Adhemerval Zanella
2022-06-10 16:35 ` [PATCH v2 3/4] Remove usage of TLS_MULTIPLE_THREADS_IN_TCB Adhemerval Zanella
2022-06-10 19:49   ` H.J. Lu
2022-06-10 21:00   ` Noah Goldstein
2022-06-11 13:59     ` Wilco Dijkstra
2022-06-15 21:07       ` Adhemerval Zanella
2022-06-16 12:48         ` Wilco Dijkstra
2022-06-16 17:23           ` Adhemerval Zanella
2022-06-13 21:31   ` Wilco Dijkstra
2022-06-15 21:10     ` Adhemerval Zanella
2022-06-16  7:35   ` Fangrui Song
2022-06-10 16:35 ` [PATCH v2 4/4] Remove single-thread.h Adhemerval Zanella

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).