public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: "H.J. Lu" <hjl.tools@gmail.com>
To: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Cc: libc-alpha@sourceware.org, Florian Weimer <fweimer@redhat.com>,
	 Fangrui Song <maskray@google.com>,
	"Carlos O'Donell" <carlos@redhat.com>
Subject: Re: [PATCH v4 2/2] x86: Update _dl_tlsdesc_dynamic to preserve caller-saved registers
Date: Thu, 15 Feb 2024 15:15:10 -0800	[thread overview]
Message-ID: <CAMe9rOqichkHy=AhGZE4m=4QC6LpVQYgB3V9V5nSHn9oiXi3hg@mail.gmail.com> (raw)
In-Reply-To: <ce327884-642d-4ff4-86df-10ff08804a58@linaro.org>

On Thu, Feb 15, 2024 at 3:05 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 13/02/24 01:15, H.J. Lu wrote:
> > Compiler generates the following instruction sequence for GNU2 dynamic
> > TLS access:
> >
> >       leaq    tls_var@TLSDESC(%rip), %rax
> >       call    *tls_var@TLSCALL(%rax)
> >
> > or
> >
> >       leal    tls_var@TLSDESC(%ebx), %eax
> >       call    *tls_var@TLSCALL(%eax)
> >
> > CALL instruction is transparent to compiler which assumes all registers,
> > except for EFLAGS and RAX/EAX, are unchanged after CALL.  When
> > _dl_tlsdesc_dynamic is called, it calls __tls_get_addr on the slow
> > path.  __tls_get_addr is a normal function which doesn't preserve any
> > caller-saved registers.  _dl_tlsdesc_dynamic saved and restored integer
> > caller-saved registers, but didn't preserve any other caller-saved
> > registers.  Add _dl_tlsdesc_dynamic IFUNC functions for FNSAVE, FXSAVE,
> > XSAVE and XSAVEC to save and restore all caller-saved registers.  This
> > fixes BZ #31372.
> >
> > Add GLRO(dl_x86_64_runtime_resolve) with GLRO(dl_x86_tlsdesc_dynamic)
> > to optimize elf_machine_runtime_setup.
> > ---
> >  elf/Makefile                                 |  19 ++
> >  elf/malloc-for-test.c                        |  32 ++++
> >  elf/malloc-for-test.map                      |   6 +
> >  elf/tst-gnu2-tls2.c                          |  97 ++++++++++
> >  elf/tst-gnu2-tls2.h                          |  26 +++
> >  elf/tst-gnu2-tls2mod0.c                      |  28 +++
> >  elf/tst-gnu2-tls2mod1.c                      |  28 +++
> >  elf/tst-gnu2-tls2mod2.c                      |  28 +++
> >  sysdeps/i386/dl-machine.h                    |   2 +-
> >  sysdeps/i386/dl-tlsdesc-dynamic.h            | 187 +++++++++++++++++++
> >  sysdeps/i386/dl-tlsdesc.S                    | 115 +++++-------
> >  sysdeps/i386/tst-gnu2-tls2.c                 |   5 +
> >  sysdeps/x86/Makefile                         |   7 +-
> >  sysdeps/x86/cpu-features.c                   |  56 +++++-
> >  sysdeps/x86/dl-procinfo.c                    |  16 ++
> >  sysdeps/{x86_64 => x86}/features-offsets.sym |   2 +
> >  sysdeps/x86/malloc-for-test.c                |  33 ++++
> >  sysdeps/x86/sysdep.h                         |   6 +
> >  sysdeps/x86_64/Makefile                      |   2 +-
> >  sysdeps/x86_64/dl-machine.h                  |  19 +-
> >  sysdeps/x86_64/dl-procinfo.c                 |  16 ++
> >  sysdeps/x86_64/dl-tlsdesc-dynamic.h          | 166 ++++++++++++++++
> >  sysdeps/x86_64/dl-tlsdesc.S                  | 108 ++++-------
> >  sysdeps/x86_64/dl-trampoline-save.h          |  34 ++++
> >  sysdeps/x86_64/dl-trampoline-state.h         |  51 +++++
> >  sysdeps/x86_64/dl-trampoline.S               |  20 +-
> >  sysdeps/x86_64/dl-trampoline.h               |  34 +---
> >  27 files changed, 930 insertions(+), 213 deletions(-)
> >  create mode 100644 elf/malloc-for-test.c
> >  create mode 100644 elf/malloc-for-test.map
> >  create mode 100644 elf/tst-gnu2-tls2.c
> >  create mode 100644 elf/tst-gnu2-tls2.h
> >  create mode 100644 elf/tst-gnu2-tls2mod0.c
> >  create mode 100644 elf/tst-gnu2-tls2mod1.c
> >  create mode 100644 elf/tst-gnu2-tls2mod2.c
> >  create mode 100644 sysdeps/i386/dl-tlsdesc-dynamic.h
> >  create mode 100644 sysdeps/i386/tst-gnu2-tls2.c
> >  rename sysdeps/{x86_64 => x86}/features-offsets.sym (89%)
> >  create mode 100644 sysdeps/x86/malloc-for-test.c
> >  create mode 100644 sysdeps/x86_64/dl-tlsdesc-dynamic.h
> >  create mode 100644 sysdeps/x86_64/dl-trampoline-save.h
> >  create mode 100644 sysdeps/x86_64/dl-trampoline-state.h
> >
> > diff --git a/elf/Makefile b/elf/Makefile
> > index 5d78b659ce..e0665d2007 100644
> > --- a/elf/Makefile
> > +++ b/elf/Makefile
> > @@ -424,6 +424,7 @@ tests += \
> >    tst-glibc-hwcaps-prepend \
> >    tst-global1 \
> >    tst-global2 \
> > +  tst-gnu2-tls2 \
> >    tst-initfinilazyfail \
> >    tst-initorder \
> >    tst-initorder2 \
> > @@ -699,6 +700,7 @@ modules-names += \
> >    libtracemod5-1 \
> >    ltglobmod1 \
> >    ltglobmod2 \
> > +  malloc-for-test \
> >    neededobj1 \
> >    neededobj2 \
> >    neededobj3 \
> > @@ -846,6 +848,9 @@ modules-names += \
> >    tst-filterobj-flt \
> >    tst-finilazyfailmod \
> >    tst-globalmod2 \
> > +  tst-gnu2-tls2mod0 \
> > +  tst-gnu2-tls2mod1 \
> > +  tst-gnu2-tls2mod2 \
> >    tst-initlazyfailmod \
> >    tst-initorder2a \
> >    tst-initorder2b \
> > @@ -3044,8 +3049,22 @@ $(objpfx)tst-tlsgap.out: \
> >    $(objpfx)tst-tlsgap-mod0.so \
> >    $(objpfx)tst-tlsgap-mod1.so \
> >    $(objpfx)tst-tlsgap-mod2.so
> > +
> > +$(objpfx)tst-gnu2-tls2: \
> > +  $(shared-thread-library) \
> > +  $(objpfx)malloc-for-test.so
> > +$(objpfx)tst-gnu2-tls2.out: \
> > +  $(objpfx)tst-gnu2-tls2mod0.so \
> > +  $(objpfx)tst-gnu2-tls2mod1.so \
> > +  $(objpfx)tst-gnu2-tls2mod2.so
> > +
> > +LDFLAGS-malloc-for-test.so += -Wl,--version-script=malloc-for-test.map
> > +
> >  ifeq (yes,$(have-mtls-dialect-gnu2))
> >  CFLAGS-tst-tlsgap-mod0.c += -mtls-dialect=gnu2
> >  CFLAGS-tst-tlsgap-mod1.c += -mtls-dialect=gnu2
> >  CFLAGS-tst-tlsgap-mod2.c += -mtls-dialect=gnu2
> > +CFLAGS-tst-gnu2-tls2mod0.c += -mtls-dialect=gnu2
> > +CFLAGS-tst-gnu2-tls2mod1.c += -mtls-dialect=gnu2
> > +CFLAGS-tst-gnu2-tls2mod2.c += -mtls-dialect=gnu2
> >  endif
> > diff --git a/elf/malloc-for-test.c b/elf/malloc-for-test.c
> > new file mode 100644
> > index 0000000000..1bec69eda7
> > --- /dev/null
> > +++ b/elf/malloc-for-test.c
> > @@ -0,0 +1,32 @@
> > +/* A malloc for intercept test.
> > +   Copyright (C) 2024 Free Software Foundation, Inc.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library; if not, see
> > +   <http://www.gnu.org/licenses/>.  */
> > +
> > +#include <stdlib.h>
> > +
> > +extern void * __libc_malloc (size_t);
> > +
> > +#ifndef PREPARE_MALLOC
> > +# define PREPARE_MALLOC()
> > +#endif
> > +
> > +void *
> > +malloc (size_t n)
> > +{
> > +  PREPARE_MALLOC ();
>
> It is not clear to me how exactly this adds proper coverage without
> actually set the affected registers *before* the TLS variable access
> and later check its value has not changed. In fact, on x86_64 it is
> passing on my system even without the test actually work as expected
> (see below).

It depends on the compiler version.  Newer GCC will generate vector
load/store on x86-64 to copy a structure.

> And I think we also need to add arch-specific rules to build the test
> with only the base ABI, and add extra macros to clobber and check
> the expected registers that _dl_tlsdesc_dynamic should save/restore.
>
> It would be slightly more trick on ABIs that already have a large set
> or register (like x86_64-v1 and armv8-a).
>
> > +  return __libc_malloc (n);
> > +}
> > diff --git a/elf/malloc-for-test.map b/elf/malloc-for-test.map
> > new file mode 100644
> > index 0000000000..8437cf4346
> > --- /dev/null
> > +++ b/elf/malloc-for-test.map
> > @@ -0,0 +1,6 @@
> > +GLIBC_2.0 {
>
> You need to use the correct version to override the malloc:
>
> $ gdb --args tst-gnu2-tls2 --direct
> [...]
> (gdb) b apply_tls
> (gdb) r
> Thread 2 "tst-gnu2-tls2" hit Breakpoint 1, apply_tls (p=0x7ffff7bfee80) at tst-gnu2-tls2mod1.c:25
> 25 {
> (gdb) b malloc
> Breakpoint 2 at 0x7ffff7ca8ad0: malloc. (3 locations)
> (gdb) c
> Continuing.
>
> Thread 2 "test-gnu2-tls2" hit Breakpoint 2, __GI___libc_malloc (bytes=3200) at malloc.c:3294
> 3294 {
> (gdb) bt
> #0 __GI___libc_malloc (bytes=3200) at malloc.c:3294
> #1 0x00007ffff7fda3de in malloc (size=<optimized out>) at ../include/rtld-malloc.h:56
> #2 allocate_dtv_entry (size=<optimized out>, alignment=16) at ../elf/dl-tls.c:679
> #3 allocate_and_init (map=0x7ffff0000bd0) at ../elf/dl-tls.c:704
> #4 tls_get_addr_tail (ti=0x7ffff0001240, dtv=0x55555555e340, the_map=0x7ffff0000bd0) at ../elf/dl-tls.c:904
> #5 0x00007ffff7fdda2e in _dl_tlsdesc_dynamic_xsavec () at ../sysdeps/x86_64/dl-tlsdesc-dynamic.h:135
> #6 0x00007ffff7fb0155 in apply_tls (p=0xc80) at tst-gnu2-tls2mod1.c:27
> #7 0x0000555555556965 in access_mod (i=1, sym=0x555555559022 "apply_tls") at tst-gnu2-tls2.c:58
> #8 start (arg=0x0) at tst-gnu2-tls2.c:73
> #9 0x00007ffff7c96a82 in start_thread (arg=<optimized out>) at pthread_create.c:447
> #10 0x00007ffff7d1b13c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
>
> By using GLIBC_2.2.5 instead of GLIBC_2.0, now I see:
>
> (gdb) bt
> #0 malloc (n=3200) at ../elf/malloc-for-test.c:29
> #1 0x00007ffff7fda3de in malloc (size=<optimized out>) at ../include/rtld-malloc.h:56
> #2 allocate_dtv_entry (size=<optimized out>, alignment=16) at ../elf/dl-tls.c:679
> #3 allocate_and_init (map=0x7ffff0000bd0) at ../elf/dl-tls.c:704
> #4 tls_get_addr_tail (ti=0x7ffff0001240, dtv=0x55555555e340, the_map=0x7ffff0000bd0) at ../elf/dl-tls.c:904
> #5 0x00007ffff7fdda2e in _dl_tlsdesc_dynamic_xsavec () at ../sysdeps/x86_64/dl-tlsdesc-dynamic.h:135
> #6 0x00007ffff7fb0155 in apply_tls (p=0xc80) at tst-gnu2-tls2mod1.c:27
> #7 0x0000555555556965 in access_mod (i=1, sym=0x555555559022 "apply_tls") at tst-gnu2-tls2.c:58
> #8 start (arg=0x0) at tst-gnu2-tls2.c:73
> #9 0x00007ffff7c96a82 in start_thread (arg=<optimized out>) at pthread_create.c:447
> #10 0x00007ffff7d1b13c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
>
> So you will need either to get the correct version or parameterize
> the map file for each ABI (which is just a handful that actually
> support TLS descriptors).

An arch may need a different version map.

> Which made me realize that preloading malloc won't interpose ld
> implementation since 3a0ecccb599a6b1ad4b149dc569c0080e92d057b unless
> the malloc replacement also exports the malloc with the proper ABI
> version.

True.

> I think it is an unexpected change from BZ#25486, but at the same
> time I don't think this is a bad one. We will need to add all the
> machinery required to save/restore the caller-saved register for
> each ABI that supports TLS descriptors because even glibc malloc
> might call internal functions that might use such ABI extension
> (for instance mem* and str* functions).
>
> All this made me realize that the TLS descriptor slow path
> is *far* from maintainable, as we discussed on the weekly call.
> And I think we *should* move away from it. Some issues:
>
>  * To properly support _dl_tlsdesc_dynamic on ABI with vector
>    extensions, it would either need to pessimize code generation
>    for TLS access (so the compiler would add all the required
>    instructions to save/restore  the caller-saved registers) or
>    move the complexity to libc.
>
>  * The latter would make the libc to require either a quite complex
>    _dl_tlsdesc_dynamic, which would either need to probe hardware
>    support to provide the multiple code paths or add the support
>    through iFUNC.

True.

>  * ARM also has the issue and I think it has not seen this issue
>    because gnu2 is not the default TLS ABI and gcc likely won't change
>    in nearby future. And to properly fix it, it would require to add
>    something like what you are doing for x86 to support the multiple
>    vector extensions (VFP, VFP3, NEON).

True.

>  * Loongsong is finishing its TLSDESC ABI support on gcc/binutils, and
>    most likely would require quite similar support to proper support
>    LSX, LASX.
>
>  * I think RISC-V would also have a similar issue for its vector ABI.
>
> So I think we really should reevaluate the BZ#16133 fix that we reverted
> on 2.20 [1] [2]. if I recall correctly (I need to go through again my
> notes about this issue), two main issues triggered the revert:
>
>  1. It broke LSAN;
>  2. Lazy allocation is an explicit feature [3].
>
> For 1. I think it should be doable to fix on sanitizer, either by adding
> more hacks to get the correct TLS size or by providing a proper ABI.
>
> However for 2. I think it is past time that we accept that lazy allocation
> was a nice idea, but it adds a *lot* of maintainability burden that
> it is not paying off.
>
> [1] https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=1f33d36a8a9e78c81bed59b47f260723f56bb7e6
> [2] https://sourceware.org/legacy-ml/libc-alpha/2013-09/msg00721.html
> [3] https://sourceware.org/legacy-ml/libc-alpha/2014-01/msg00287.html
>

We need a short-team fix before the slow path is removed.


-- 
H.J.

  reply	other threads:[~2024-02-15 23:15 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-13  4:14 [PATCH v4 0/2] " H.J. Lu
2024-02-13  4:15 ` [PATCH v4 1/2] x86-64: Save APX registers in ld.so trampoline H.J. Lu
2024-02-13  4:15 ` [PATCH v4 2/2] x86: Update _dl_tlsdesc_dynamic to preserve caller-saved registers H.J. Lu
2024-02-14 22:44   ` Noah Goldstein
2024-02-14 23:21     ` H.J. Lu
2024-02-14 23:57       ` Noah Goldstein
2024-02-15  0:23         ` H.J. Lu
2024-02-15 23:05   ` Adhemerval Zanella Netto
2024-02-15 23:15     ` H.J. Lu [this message]
2024-02-16  6:23     ` Florian Weimer
2024-02-16 11:59       ` H.J. Lu
2024-02-16 12:18         ` Florian Weimer
2024-02-16 12:20           ` H.J. Lu
2024-02-16 12:37             ` H.J. Lu
2024-02-16 12:47               ` Adhemerval Zanella Netto
2024-02-16 12:58                 ` H.J. Lu
2024-02-16 13:24                   ` Adhemerval Zanella Netto
2024-02-16 14:25                     ` H.J. Lu
2024-02-16 13:06               ` Florian Weimer
2024-02-16 13:24                 ` H.J. Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMe9rOqichkHy=AhGZE4m=4QC6LpVQYgB3V9V5nSHn9oiXi3hg@mail.gmail.com' \
    --to=hjl.tools@gmail.com \
    --cc=adhemerval.zanella@linaro.org \
    --cc=carlos@redhat.com \
    --cc=fweimer@redhat.com \
    --cc=libc-alpha@sourceware.org \
    --cc=maskray@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).