public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Ben Woodard <woodard@redhat.com>
To: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Cc: Florian Weimer <fweimer@redhat.com>,
	John Mellor-Crummey <johnmc@rice.edu>,
	libc-alpha@sourceware.org
Subject: Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
Date: Fri, 30 Jul 2021 17:59:48 -0700	[thread overview]
Message-ID: <C7DAE37C-9575-41AE-BA86-87432C39D32E@redhat.com> (raw)
In-Reply-To: <81b1ad4e-cda4-4ed9-61e6-6dc8884800d5@linaro.org>



> On Jul 30, 2021, at 2:09 PM, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
> 
> 
> 
> On 30/07/2021 15:59, Ben Woodard wrote:
>> 
>> 
>>> On Jul 30, 2021, at 7:58 AM, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
>>> 
>>> 
>>> 
>>> On 23/06/2021 14:42, Ben Woodard wrote:
>>>> 
>>>> 
>>>>> On Jun 17, 2021, at 4:06 PM, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On 17/06/2021 17:09, Florian Weimer wrote:
>>>>>> * Adhemerval Zanella:
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> * SVE support: as indicated by Szabolcs SVE calls are marked with the 
>>>>>>> STO_AARCH64_VARIANT_PCS and thus explicit not supported by dynamic loader. 
>>>> 
>>>> To me this sounds like partly a toolchain issue. The aarch64 PCS does define the ABI for SVE calls. I haven’t checked what GCC/binutils does in quite a while. It seems like the  STO_AARCH64_VARIANT_PCS was an expedient for SVE when it first came out and the semantics that it defines where all the registers are caller preserved makes it very difficult to implement around.
>>>> 
>>>> For LAV_CURRENT=2 what I planned to do was:
>>> 
>>> Now that I have finished the various audit issues that John Mellor-Crummey 
>>> has brought up, I think I have a better idea of how to address it.
>> 
>> Well the tier1 ones at least. Thank you. I was working on modifying John’s reproducers and making them official glibc tests so that when your  patch series lands in 2.35 they would have tests that ensure that no regressions show up.
>> 
>> I might not have seen it yet? Did you have a new version of 2/6 or a V3 version of the patch set which handles the lazy binding issue that you pointed out in. https://sourceware.org/pipermail/libc-alpha/2021-July/129510.html 
> 
> I sent it [1].  I changed your aarch64 patch slightly, fixed a small issue, and
> added a couple of tests.  It should fixed all tier1 and mostly of tier2 issues,
> modulo the aarch64 SVE one.
> 
> [1] https://patchwork.sourceware.org/project/glibc/list/?series=2577 <https://patchwork.sourceware.org/project/glibc/list/?series=2577>

Yea!! I’ll test it with my stuff.

> 
>> 
>>> 
>>>> 
>>>> diff --git a/sysdeps/aarch64/bits/link.h b/sysdeps/aarch64/bits/link.h
>>>> index ca76087ee1..390b12a826 100644
>>>> --- a/sysdeps/aarch64/bits/link.h
>>>> +++ b/sysdeps/aarch64/bits/link.h
>>>> @@ -20,13 +20,24 @@
>>>> # error "Never include <bits/link.h> directly; use <link.h> instead."
>>>> #endif
>>>> 
>>>> +typedef struct La_sve_regs {
>>>> +  uint16_t    *lr_preg[3];
>>>> +  __uint128_t *lr_zreg[8];
>>>> +} La_sve_regs;
>>>> +
>>>> /* Registers for entry into PLT on AArch64.  */
>>>> typedef struct La_aarch64_regs
>>>> {
>>>>  uint64_t    lr_xreg[9];
>>>> -  __uint128_t lr_vreg[8];
>>>>  uint64_t    lr_sp;
>>>>  uint64_t    lr_lr;
>>>> +  char    lr_sve; /* 0 - no SVE,
>>>> +                        1-16 length of the SVE registers in vq (128 bits) */
>>>> +  union {
>>>> +    /* when sve!=0 accessing the lr_vreg is undefined */
>>>> +    __uint128_t lr_vreg[8];
>>>> +    La_sve_regs lr_zreg;
>>>> +  };
>>>> } La_aarch64_regs;
>>>> 
>>>> /* Return values for calls from PLT on AArch64.  */
>>>> @@ -34,9 +45,14 @@ typedef struct La_aarch64_retval
>>>> {
>>>>  /* Up to eight integer registers can be used for a return value.  */
>>>>  uint64_t    lrv_xreg[8];
>>>> -  /* Up to eight V registers can be used for a return value.  */
>>>> -  __uint128_t lrv_vreg[8];
>>>> -
>>>> +  char        lrv_sve; /* 0 - no SVE,
>>>> +                         1-16 length of the SVE registers in vq (128 bits) */
>>>> +  union{
>>>> +    /* Up to eight V registers can be used for a return value.
>>>> +       When sve!=0 accessing the lr_vreg is undefined */
>>>> +    __uint128_t lrv_vreg[8];
>>>> +    La_sve_regs lrv_zreg;
>>>> +  };
>>>> } La_aarch64_retval;
>>>> __BEGIN_DECLS
>>> 
>>> My idea is to do something similar:
>>> ---
>>> typedef struct La_sve_regs
>>> {
>>> uint16_t    *lr_preg;
>>> long double *lr_zreg;
>>> } La_sve_regs;
>>> 
>>> typedef struct La_aarch64_regs
>>> {
>>> [...]
>>> uint8_t  lr_sve;           /* 0 - no SVE
>>>                               1-16 length of the SVE registers in vq (128 bits)  */
>>> La_sve_regs *lr_sve_regs;  /* NULL - no SVE.  */
>>> };
>> 
>> Take this with a grain of salt you are the expert not me. These are three minor issues that I don’t really like about this approach:
>> 1) I think having lr_sve be an uint8_t will mess up the alignment of the V registers that are immediately after it. It probably needs to be at the end of the structure or it needs to be padded out for alignment.
> 
> No sure if there will be alignment constraint here, since SVE instruction afaik
> can load/store unaligned memory and this is not a performance critical code. 
> In any case, that's why I added 'long double' for the 'lz_zreg' since it should
> at least set 16 bytes alignment.
> 

I wasn’t really worried about the Z registers here. I was thinking about the V registers. I tried fixing this myself a few times and one of my attempts looked like yours with a char  lr_sve and in that case I ran into alignment problems. The difference between what you are doing and what I tried is that I moved what are now the D registers (which is broken - you need to save the full 128b V register — this was copied over from V7 and wasn’t updated for 64b V8) down in the structure to after lr_sve and that is what caused the alignment problem in the assembly portion of the code.

typedef struct La_sve_regs {
  uint16_t    *lr_preg[3];
  __uint128_t *lr_zreg[8];
} La_sve_regs;

/* Registers for entry into PLT on AArch64.  */
typedef struct La_aarch64_regs
{
    uint64_t    lr_xreg[9];
    uint64_t    lr_sp;
    uint64_t    lr_lr;
=>  char    lr_sve; /* 0 - no SVE, 
                         1-16 length of the SVE registers in vq (128 bits) */

=> // Note that this is *BEFORE* the V registers 
  union {
    /* when sve!=0 accessing the lr_vreg is undefined */
    __uint128_t lr_vreg[8];
    La_sve_regs lr_zreg;
  };
} La_aarch64_regs;

>> 2) Second it appears like you have to traverse two pointers to actually get at the registers. In my opinion, it would be better if everything were just offsets from the beginning of the structure so that the actual memory for the Z and P registers is immediately following the La_aarch64_regs structure.
> 
> It is a trade-off to avoid add minimize the struct space to non-SVE code.  But
> I don't have a strong preference, we can use this instead:
> 
> typedef struct La_aarch64_regs
> {
>  [...]
>  uint8_t  lr_sve;           /* 0 - no SVE
>                                1-16 length of the SVE registers in vq (128 bits)  */
>  uint16_t    *lr_preg[8];   /* NULL - no SVE.  */
>  long double *lr_zreg[4];   /* NULL - no SVE.  */
> };
> 

You are correct, This isn’t a performance critical code path. If you are watching the pltenter/exit things are going to suck anyway.  I was trying to be too clever. I was sort of thinking: Whose going to allocate the memory for this structure - how do we make it thread safe how are we going to deal with all 100 threads coming through here all at once. So I was thinking let’s keep the whole register structure on the stack. 

With your idea my best guess is that the lr_sve_regs would be allocated immediately after the La_aarch64_regs structure on the stack and so in memory it would look like:

  uint64_t    lr_xreg[9];
  __uint128_t lr_vreg[8];
  uint64_t    lr_sp;
  uint64_t    lr_lr;
  uint8_t;    lr_sve; 
  La_sve_regs *lr_sve_regs; -> SVE_REGS
SVE_REGS:
   uint16_t    *lr_preg; -> SVE_P_REGS
   long double *lr_zreg; -> SVE_P_REGS+3*lr_sve*sizeof(uint16_t)
SVE_P_REGS;
   unit16_t lr_preg[3*lr_sve]; 
   long double lr_zreg[8*lr_sve];

If my guess about where the memory was going to be allocated is correct, then you have a bunch or pointer arithmetic for things which going to be right there anyway at static offsets.

Really it doesn’t matter, we can do it anyway you like. We just need it to work. Your way makes sense to a normal C programmer. I was trying to come up with something more clever that would take up less space which wouldn’t have all the pointer following. However, I couldn’t get it working.

>> 3) Don’t the V and the Z registers overlap in the machine architecture? Don’t we want to in some way overlap them so that you don’t have unused V register space in the structures when it is a SVE call?
>> 
>> What I would prefer to see in memory is:
>> 
>>  uint64_t lr_xreg[8];
>>  uint64_t lr_sp;
>>  uint64_t lr_lr;
>>  uint64_t  lr_sve;  // increased in size for alignment
>> 
>> then when lr_sve == 0
>>  long double lr_vreg[8];
>> or
>>  long double lr_zreg[8][lr_sve]
>>  uint16_t preg[4][lr_sve]
>> 
>> or something like that where the V or Z & P registers are immediately adjacent in memory to the rest of the normal registers. You don’t have to traverse pointers to get to them.
> 
> We can, but my idea is to make non SVE calls to have no performance or
> space implications.  Do we really need to optimize to avoid pointer
> transverse? What kind of analysis the audit module usually do?
> 
> I am asking because I foresee that any pointer transverse will only be
> a fraction of time of the PLT hook runtime itself. 

Yep that is correct.

> 
>> 
>> Could we have two la_aarch64_gnu_pltenter/exit functions? One which passes the normal La_aarch64_regs and one that passes la_aarch64_sveregs?
> 
> It might be an option, but it would add another complexity layer:
> 
>  - We would need to add arch-specific hooks on the audit internal 'auditstate'
>    interface.
> 
>  - We would need to add arch-specific loader code to handle the SVE la_pltenter
>    la_pltexit
> 
>  - It would require either a different implementation for _dl_runtime_resolve
>    to call the correct pltexit hook or a add a way to decide which hook to
>    call at runtime.  x86_64 has decided for the former and it sets the expected
>    trampoline based on the cpuid() (it would be hwcap2 for aarch64).
> 
> I am not sure which would be better, maybe adding SVE specific hooks might be
> a better approach indeed (although it would required more code).
> 
>> 
>> or would it be better to have two types, the normal version and then when lr_sve!=0 then you cast it to: La_aarch64_sve_regs which has the Z and the P regs instead.
>> 
>> 
>>> 
>>> typedef struct La_aarch64_retval
>>> {
>>> [...]
>>> uint8_t  lr_sve;           /* 0 - no SVE
>>>                               1-16 length of the SVE registers in vq (128 bits)  */
>>> La_sve_regs *lr_sve_regs;  /* NULL - no SVE.  */
>>> }
>>> ---
>>> 
>>> The _dl_runtime_resolve will be responsible to allocate on the stack the required
>>> space for the La_sve_regs and setup the La_aarch64_regs and La_aarch64_retval internal
>>> pointers.  It has the advantage of allocate only the required space and if the we 
>>> can distinguish if the symbol does use SVE we can avoid the performance issue for 
>>> symbols that do not use SVE. The downside is it would require a potential large stack 
>>> space.
>>> 
>>>> 
>>>> However, that would require toolchain support and another hint in st_other which separates SVE calls from other uses of STO_AARCH64_VARIANT_PCS like STO_AARCH64_VARIANT_SVE. Then the runtime linker could populate the lrv_sve with information from the lrv_vreg with the size of the vector registers from the processor’s registers. 
>>> 
>>> I think it should be feasible to assume now that STO_AARCH64_VARIANT_PCS 
>>> means SVE, which meant that we can use it _dl_runtime_resolve to skip
>>> the save/restore if the symbol follows the default standard procedure
>>> call.  
>> 
>> Isn’t STO_AARCH64_VARIANT_PCS used for other kinds of calls as well, ones that require all the registers to be preserved.
>> I know that it would require some toolchain changes but shouldn’t we define something like STO_AARCH64_VARIANT_SVE to differentiate the two.
> 
> Ideally yes, but it would require both compiler and binutils support. I am not
> sure if ARM is willing to move forward, and even though it would take time.
> 
> I suggested assuming STO_AARCH64_VARIANT_PCS means SVE because at least for
> profiling PLT stub, I think we can check if hwcap supports SVE and then
> assume the symbol is indeed a SVE call. It would work for *current* defined
> ABI and ISA, but if ARM starts to use it for something else, specially for
> newer architecture that also support SVE, it indeed might break.

Please double check with someone more informed than I am, I’m almost certain someone told me that it is used somewhere else for something else.

-ben

> 
>> 
>>> 
>>>> 
>>>> There are at least two problems with that approach. 
>>>> 1) who allocates the lr_zreg pointers in the la_sve_regs and how long should they be? Do they always have to be allocated to be the max size 2048 bits?
>>>> 2) I hadn’t worked out how to handle functions that return things in the SVE regs. Do we need two new flags in st_other? STO_AARCH64_VARIANT_SVE and STO_AARCH64_VARIANT_SVERET?
>>> 
>>> I don't think we need the two extra flags to handle SVE calls, but I am
>>> also assuming that STO_AARCH64_VARIANT_PCS will be used solely for SVE.
>> 
>> Can you double check with your ARM contacts regarding that? I’m fairly sure that someone either in RH, ARM, or Linaro told me that VARIANT_PCS was also used other unusual situations other than SVE.
>> 
>>> And it has the extra problem of using two extra flags is not backward
>>> compatible.
>>> 
>>> I presume that if ARM wants to push for another procedure call variant
>>> on linux-gnu I would expect another flag.
>>> 
>>>> 
>>>> Then there was the question in the future could there be: big.LITTLE processors where some big processors had SVE registers of one length while the LITTLE processors had different ones? 
>>> 
>>> Although I find this kind of setup unlikely, my expectation is either that it
>>> would be transparent to userland (either kernel will emulate the required 
>>> instructions or it will bind process with STO_AARCH64_VARIANT_PCS to cores 
>>> with SVE).
>> 
>> Can you explain how that would work?
>> 
>> You launch a binary that has SVE from any processor. Somewhere along the way this means one of the exec syscalls handles it in the kernel. The kernel notices that it is an ELF file and hands it off to the ELF interpreter ld.so. Nowhere along the way would the kernel do a deep enough inspection of the binary to notice that it has SVE instructions in it and so how would it know to bind it to a processor which has HW SVE?
>> 
>> I also don’t see any places where ld.so can call back into the kernel and say “uh move me over to a processor that has these capabilities”. 
>> 
>> What I can imagine is a LITTLE processor traps into the kernel when executing a SVE instruction just like any illegal instruction. However, the handler looks at the instruction being executed notices it is an SVE instruction and instead of delivering a SIGILL to the process, it sets a flag in the task structure that says that this process uses SVE and migrates it to one of the big processors that can execute SVE.
> 
> Indeed, the STO_AARCH64_VARIANT_PCS would not help in this case.
> But you summarize what I think it should happen: if a core executes
> the SVE instruct and traps, the kernel would either move to a SVE
> core with pc reset, emulate the instruction, or throw a SIGILL.
> In any case, I would expect it to be transparent to userland.


  reply	other threads:[~2021-07-31  0:59 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-16 17:55 John Mellor-Crummey
2021-06-17 19:42 ` Adhemerval Zanella
2021-06-17 20:09   ` Florian Weimer
2021-06-17 23:06     ` Adhemerval Zanella
2021-06-23 17:42       ` Ben Woodard
2021-07-30 14:58         ` Adhemerval Zanella
2021-07-30 18:59           ` Ben Woodard
2021-07-30 21:09             ` Adhemerval Zanella
2021-07-31  0:59               ` Ben Woodard [this message]
2021-08-04 18:11                 ` Adhemerval Zanella
2021-08-05 10:32                   ` Szabolcs Nagy
2021-08-05 19:36                     ` Ben Woodard
2021-08-06  9:04                       ` Szabolcs Nagy
2021-06-21 19:42     ` John Mellor-Crummey
2021-06-22  8:15       ` Florian Weimer
2021-06-22 15:04         ` John Mellor-Crummey
2021-06-22 15:36           ` Florian Weimer
2021-06-22 16:17             ` John Mellor-Crummey
2021-06-22 16:33               ` Adhemerval Zanella
2021-06-23  6:32                 ` Florian Weimer
2021-06-23 20:06                   ` Mark Krentel
2021-06-18 17:48   ` John Mellor-Crummey
2021-06-18 18:27     ` Adhemerval Zanella

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=C7DAE37C-9575-41AE-BA86-87432C39D32E@redhat.com \
    --to=woodard@redhat.com \
    --cc=adhemerval.zanella@linaro.org \
    --cc=fweimer@redhat.com \
    --cc=johnmc@rice.edu \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).