public inbox for binutils@sourceware.org
 help / color / mirror / Atom feed
* x86-64: new CET-enabled PLT format proposal
@ 2022-02-27  3:18 Rui Ueyama
  2022-02-27 15:06 ` H.J. Lu
  0 siblings, 1 reply; 14+ messages in thread
From: Rui Ueyama @ 2022-02-27  3:18 UTC (permalink / raw)
  To: binutils

Hello,

I'd like to propose an alternative instruction sequence for the Intel
CET-enabled PLT section. Compared to the existing one, the new scheme is
simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not
require a separate second PLT section (.plt.sec).

Here is the proposed code sequence:

  PLT0:

  f3 0f 1e fa        // endbr64
  41 53              // push %r11
  ff 35 00 00 00 00  // push GOT[1]
  ff 25 00 00 00 00  // jmp *GOT[2]
  0f 1f 40 00        // nop
  0f 1f 40 00        // nop
  0f 1f 40 00        // nop
  66 90              // nop

  PLTn:

  f3 0f 1e fa        // endbr64
  41 bb 00 00 00 00  // mov $namen_reloc_index %r11d
  ff 25 00 00 00 00  // jmp *GOT[namen_index]

GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a
PLT entry is called for the first time, the control is passed to PLT0 to call
the resolver function.

It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries
to clobber this register (*1), and the resolve function (__dl_runtime_resolve)
already clobbers it.

(*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be
preserved, nor is it used to pass arguments. Making this register available as
scratch register means that code in the PLT need not spill any registers when
computing the address to which control needs to be transferred."

FYI, this is the current CET-enabled PLT:

  PLT0:

  ff 35 00 00 00 00    // push GOT[0]
  f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1]
  0f 1f 00             // nop

  PLTn in .plt:

  f3 0f 1e fa          // endbr64
  68 00 00 00 00       // push $namen_reloc_index
  f2 e9 e1 ff ff ff    // bnd jmpq PLT0
  90                   // nop

  PLTn in .plt.sec:

  f3 0f 1e fa          // endbr64
  f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index]
  0f 1f 44 00 00       // nop

In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In
the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we
have many PLT sections while we have only one header, so in practice, the
proposed format is almost 50% smaller than the existing one.

The proposed PLT does not use jump instructions with BND prefix, as Intel MPX
has been deprecated.

I already implemented the proposed scheme to my linker
(https://github.com/rui314/mold) and it looks like it's working fine.

Any thoughts?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: x86-64: new CET-enabled PLT format proposal
  2022-02-27  3:18 x86-64: new CET-enabled PLT format proposal Rui Ueyama
@ 2022-02-27 15:06 ` H.J. Lu
  2022-02-28  3:46   ` Rui Ueyama
  2022-03-01 10:35   ` Florian Weimer
  0 siblings, 2 replies; 14+ messages in thread
From: H.J. Lu @ 2022-02-27 15:06 UTC (permalink / raw)
  To: Rui Ueyama, Andi Kleen, x86-64-abi; +Cc: Binutils

On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils
<binutils@sourceware.org> wrote:
>
> Hello,
>
> I'd like to propose an alternative instruction sequence for the Intel
> CET-enabled PLT section. Compared to the existing one, the new scheme is
> simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not
> require a separate second PLT section (.plt.sec).
>
> Here is the proposed code sequence:
>
>   PLT0:
>
>   f3 0f 1e fa        // endbr64
>   41 53              // push %r11
>   ff 35 00 00 00 00  // push GOT[1]
>   ff 25 00 00 00 00  // jmp *GOT[2]
>   0f 1f 40 00        // nop
>   0f 1f 40 00        // nop
>   0f 1f 40 00        // nop
>   66 90              // nop
>
>   PLTn:
>
>   f3 0f 1e fa        // endbr64
>   41 bb 00 00 00 00  // mov $namen_reloc_index %r11d
>   ff 25 00 00 00 00  // jmp *GOT[namen_index]

All PLT calls will have an extra MOV.

> GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a
> PLT entry is called for the first time, the control is passed to PLT0 to call
> the resolver function.
>
> It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries
> to clobber this register (*1), and the resolve function (__dl_runtime_resolve)
> already clobbers it.
>
> (*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be
> preserved, nor is it used to pass arguments. Making this register available as
> scratch register means that code in the PLT need not spill any registers when
> computing the address to which control needs to be transferred."
>
> FYI, this is the current CET-enabled PLT:
>
>   PLT0:
>
>   ff 35 00 00 00 00    // push GOT[0]
>   f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1]
>   0f 1f 00             // nop
>
>   PLTn in .plt:
>
>   f3 0f 1e fa          // endbr64
>   68 00 00 00 00       // push $namen_reloc_index
>   f2 e9 e1 ff ff ff    // bnd jmpq PLT0
>   90                   // nop
>
>   PLTn in .plt.sec:
>
>   f3 0f 1e fa          // endbr64
>   f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index]
>   0f 1f 44 00 00       // nop
>
> In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In
> the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we
> have many PLT sections while we have only one header, so in practice, the
> proposed format is almost 50% smaller than the existing one.

Does it have any impact on performance?   .plt.sec can be placed
in a different page from .plt.

> The proposed PLT does not use jump instructions with BND prefix, as Intel MPX
> has been deprecated.
>
> I already implemented the proposed scheme to my linker
> (https://github.com/rui314/mold) and it looks like it's working fine.
>
> Any thoughts?

I'd like to see visible performance improvements or new features in
a new PLT layout.

I cced x86-64 psABI mailing list.


-- 
H.J.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: x86-64: new CET-enabled PLT format proposal
  2022-02-27 15:06 ` H.J. Lu
@ 2022-02-28  3:46   ` Rui Ueyama
  2022-03-01  0:04     ` H.J. Lu
  2022-03-01 10:35   ` Florian Weimer
  1 sibling, 1 reply; 14+ messages in thread
From: Rui Ueyama @ 2022-02-28  3:46 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Andi Kleen, x86-64-abi, Binutils

On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils
> <binutils@sourceware.org> wrote:
> >
> > Hello,
> >
> > I'd like to propose an alternative instruction sequence for the Intel
> > CET-enabled PLT section. Compared to the existing one, the new scheme is
> > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not
> > require a separate second PLT section (.plt.sec).
> >
> > Here is the proposed code sequence:
> >
> >   PLT0:
> >
> >   f3 0f 1e fa        // endbr64
> >   41 53              // push %r11
> >   ff 35 00 00 00 00  // push GOT[1]
> >   ff 25 00 00 00 00  // jmp *GOT[2]
> >   0f 1f 40 00        // nop
> >   0f 1f 40 00        // nop
> >   0f 1f 40 00        // nop
> >   66 90              // nop
> >
> >   PLTn:
> >
> >   f3 0f 1e fa        // endbr64
> >   41 bb 00 00 00 00  // mov $namen_reloc_index %r11d
> >   ff 25 00 00 00 00  // jmp *GOT[namen_index]
>
> All PLT calls will have an extra MOV.

One extra load-immediate mov instruction is executed per a function
call through a PLT entry. It's so tiny that I couldn't see any
difference in real-world apps.

> > GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a
> > PLT entry is called for the first time, the control is passed to PLT0 to call
> > the resolver function.
> >
> > It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries
> > to clobber this register (*1), and the resolve function (__dl_runtime_resolve)
> > already clobbers it.
> >
> > (*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be
> > preserved, nor is it used to pass arguments. Making this register available as
> > scratch register means that code in the PLT need not spill any registers when
> > computing the address to which control needs to be transferred."
> >
> > FYI, this is the current CET-enabled PLT:
> >
> >   PLT0:
> >
> >   ff 35 00 00 00 00    // push GOT[0]
> >   f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1]
> >   0f 1f 00             // nop
> >
> >   PLTn in .plt:
> >
> >   f3 0f 1e fa          // endbr64
> >   68 00 00 00 00       // push $namen_reloc_index
> >   f2 e9 e1 ff ff ff    // bnd jmpq PLT0
> >   90                   // nop
> >
> >   PLTn in .plt.sec:
> >
> >   f3 0f 1e fa          // endbr64
> >   f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index]
> >   0f 1f 44 00 00       // nop
> >
> > In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In
> > the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we
> > have many PLT sections while we have only one header, so in practice, the
> > proposed format is almost 50% smaller than the existing one.
>
> Does it have any impact on performance?   .plt.sec can be placed
> in a different page from .plt.
>
> > The proposed PLT does not use jump instructions with BND prefix, as Intel MPX
> > has been deprecated.
> >
> > I already implemented the proposed scheme to my linker
> > (https://github.com/rui314/mold) and it looks like it's working fine.
> >
> > Any thoughts?
>
> I'd like to see visible performance improvements or new features in
> a new PLT layout.

I didn't see any visible performance improvement with real-world apps.
I might be able to craft a microbenchmark to hammer PLT entries really
hard in some pattern to see some difference, but I think that doesn't
make much sense. The size reduction is for real though.

> I cced x86-64 psABI mailing list.
>
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: x86-64: new CET-enabled PLT format proposal
  2022-02-28  3:46   ` Rui Ueyama
@ 2022-03-01  0:04     ` H.J. Lu
  2022-03-01  0:30       ` Rui Ueyama
  2022-03-01  9:16       ` Joao Moreira
  0 siblings, 2 replies; 14+ messages in thread
From: H.J. Lu @ 2022-03-01  0:04 UTC (permalink / raw)
  To: Rui Ueyama, Moreira, Joao; +Cc: Andi Kleen, x86-64-abi, Binutils

On Sun, Feb 27, 2022 at 7:46 PM Rui Ueyama <rui314@gmail.com> wrote:
>
> On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils
> > <binutils@sourceware.org> wrote:
> > >
> > > Hello,
> > >
> > > I'd like to propose an alternative instruction sequence for the Intel
> > > CET-enabled PLT section. Compared to the existing one, the new scheme is
> > > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not
> > > require a separate second PLT section (.plt.sec).
> > >
> > > Here is the proposed code sequence:
> > >
> > >   PLT0:
> > >
> > >   f3 0f 1e fa        // endbr64
> > >   41 53              // push %r11
> > >   ff 35 00 00 00 00  // push GOT[1]
> > >   ff 25 00 00 00 00  // jmp *GOT[2]
> > >   0f 1f 40 00        // nop
> > >   0f 1f 40 00        // nop
> > >   0f 1f 40 00        // nop
> > >   66 90              // nop
> > >
> > >   PLTn:
> > >
> > >   f3 0f 1e fa        // endbr64
> > >   41 bb 00 00 00 00  // mov $namen_reloc_index %r11d
> > >   ff 25 00 00 00 00  // jmp *GOT[namen_index]
> >
> > All PLT calls will have an extra MOV.
>
> One extra load-immediate mov instruction is executed per a function
> call through a PLT entry. It's so tiny that I couldn't see any
> difference in real-world apps.
>
> > > GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a
> > > PLT entry is called for the first time, the control is passed to PLT0 to call
> > > the resolver function.
> > >
> > > It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries
> > > to clobber this register (*1), and the resolve function (__dl_runtime_resolve)
> > > already clobbers it.
> > >
> > > (*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be
> > > preserved, nor is it used to pass arguments. Making this register available as
> > > scratch register means that code in the PLT need not spill any registers when
> > > computing the address to which control needs to be transferred."
> > >
> > > FYI, this is the current CET-enabled PLT:
> > >
> > >   PLT0:
> > >
> > >   ff 35 00 00 00 00    // push GOT[0]
> > >   f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1]
> > >   0f 1f 00             // nop
> > >
> > >   PLTn in .plt:
> > >
> > >   f3 0f 1e fa          // endbr64
> > >   68 00 00 00 00       // push $namen_reloc_index
> > >   f2 e9 e1 ff ff ff    // bnd jmpq PLT0
> > >   90                   // nop
> > >
> > >   PLTn in .plt.sec:
> > >
> > >   f3 0f 1e fa          // endbr64
> > >   f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index]
> > >   0f 1f 44 00 00       // nop
> > >
> > > In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In
> > > the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we
> > > have many PLT sections while we have only one header, so in practice, the
> > > proposed format is almost 50% smaller than the existing one.
> >
> > Does it have any impact on performance?   .plt.sec can be placed
> > in a different page from .plt.
> >
> > > The proposed PLT does not use jump instructions with BND prefix, as Intel MPX
> > > has been deprecated.
> > >
> > > I already implemented the proposed scheme to my linker
> > > (https://github.com/rui314/mold) and it looks like it's working fine.
> > >
> > > Any thoughts?
> >
> > I'd like to see visible performance improvements or new features in
> > a new PLT layout.
>
> I didn't see any visible performance improvement with real-world apps.
> I might be able to craft a microbenchmark to hammer PLT entries really
> hard in some pattern to see some difference, but I think that doesn't
> make much sense. The size reduction is for real though.

I am aware that there are 2 other proposals to use R11 in PLT/function
call.   But they are introducing new features.  I don't think we should
use R11 in PLT without any real performance improvements.

> > I cced x86-64 psABI mailing list.
> >
> >
> > --
> > H.J.



-- 
H.J.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: x86-64: new CET-enabled PLT format proposal
  2022-03-01  0:04     ` H.J. Lu
@ 2022-03-01  0:30       ` Rui Ueyama
  2022-03-01  2:22         ` Fangrui Song
  2022-03-01  9:16       ` Joao Moreira
  1 sibling, 1 reply; 14+ messages in thread
From: Rui Ueyama @ 2022-03-01  0:30 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Moreira, Joao, Andi Kleen, x86-64-abi, Binutils

I think size reduction matters to some users even if you do not care
about that that much. But I'm not trying too hard to push GNU binutils
to adopt it. I just wanted to let you guys know that we invented a
compact (and we believe better) instruction sequence for the
CET-enabled PLT and we are already using it.

On Tue, Mar 1, 2022 at 9:05 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Sun, Feb 27, 2022 at 7:46 PM Rui Ueyama <rui314@gmail.com> wrote:
> >
> > On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils
> > > <binutils@sourceware.org> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I'd like to propose an alternative instruction sequence for the Intel
> > > > CET-enabled PLT section. Compared to the existing one, the new scheme is
> > > > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not
> > > > require a separate second PLT section (.plt.sec).
> > > >
> > > > Here is the proposed code sequence:
> > > >
> > > >   PLT0:
> > > >
> > > >   f3 0f 1e fa        // endbr64
> > > >   41 53              // push %r11
> > > >   ff 35 00 00 00 00  // push GOT[1]
> > > >   ff 25 00 00 00 00  // jmp *GOT[2]
> > > >   0f 1f 40 00        // nop
> > > >   0f 1f 40 00        // nop
> > > >   0f 1f 40 00        // nop
> > > >   66 90              // nop
> > > >
> > > >   PLTn:
> > > >
> > > >   f3 0f 1e fa        // endbr64
> > > >   41 bb 00 00 00 00  // mov $namen_reloc_index %r11d
> > > >   ff 25 00 00 00 00  // jmp *GOT[namen_index]
> > >
> > > All PLT calls will have an extra MOV.
> >
> > One extra load-immediate mov instruction is executed per a function
> > call through a PLT entry. It's so tiny that I couldn't see any
> > difference in real-world apps.
> >
> > > > GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a
> > > > PLT entry is called for the first time, the control is passed to PLT0 to call
> > > > the resolver function.
> > > >
> > > > It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries
> > > > to clobber this register (*1), and the resolve function (__dl_runtime_resolve)
> > > > already clobbers it.
> > > >
> > > > (*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be
> > > > preserved, nor is it used to pass arguments. Making this register available as
> > > > scratch register means that code in the PLT need not spill any registers when
> > > > computing the address to which control needs to be transferred."
> > > >
> > > > FYI, this is the current CET-enabled PLT:
> > > >
> > > >   PLT0:
> > > >
> > > >   ff 35 00 00 00 00    // push GOT[0]
> > > >   f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1]
> > > >   0f 1f 00             // nop
> > > >
> > > >   PLTn in .plt:
> > > >
> > > >   f3 0f 1e fa          // endbr64
> > > >   68 00 00 00 00       // push $namen_reloc_index
> > > >   f2 e9 e1 ff ff ff    // bnd jmpq PLT0
> > > >   90                   // nop
> > > >
> > > >   PLTn in .plt.sec:
> > > >
> > > >   f3 0f 1e fa          // endbr64
> > > >   f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index]
> > > >   0f 1f 44 00 00       // nop
> > > >
> > > > In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In
> > > > the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we
> > > > have many PLT sections while we have only one header, so in practice, the
> > > > proposed format is almost 50% smaller than the existing one.
> > >
> > > Does it have any impact on performance?   .plt.sec can be placed
> > > in a different page from .plt.
> > >
> > > > The proposed PLT does not use jump instructions with BND prefix, as Intel MPX
> > > > has been deprecated.
> > > >
> > > > I already implemented the proposed scheme to my linker
> > > > (https://github.com/rui314/mold) and it looks like it's working fine.
> > > >
> > > > Any thoughts?
> > >
> > > I'd like to see visible performance improvements or new features in
> > > a new PLT layout.
> >
> > I didn't see any visible performance improvement with real-world apps.
> > I might be able to craft a microbenchmark to hammer PLT entries really
> > hard in some pattern to see some difference, but I think that doesn't
> > make much sense. The size reduction is for real though.
>
> I am aware that there are 2 other proposals to use R11 in PLT/function
> call.   But they are introducing new features.  I don't think we should
> use R11 in PLT without any real performance improvements.
>
> > > I cced x86-64 psABI mailing list.
> > >
> > >
> > > --
> > > H.J.
>
>
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: x86-64: new CET-enabled PLT format proposal
  2022-03-01  0:30       ` Rui Ueyama
@ 2022-03-01  2:22         ` Fangrui Song
  0 siblings, 0 replies; 14+ messages in thread
From: Fangrui Song @ 2022-03-01  2:22 UTC (permalink / raw)
  To: Rui Ueyama; +Cc: H.J. Lu, x86-64-abi, Andi Kleen, Binutils, Moreira, Joao

On 2022-03-01, Rui Ueyama via Binutils wrote:
>I think size reduction matters to some users even if you do not care
>about that that much. But I'm not trying too hard to push GNU binutils
>to adopt it. I just wanted to let you guys know that we invented a
>compact (and we believe better) instruction sequence for the
>CET-enabled PLT and we are already using it.
>
>On Tue, Mar 1, 2022 at 9:05 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Sun, Feb 27, 2022 at 7:46 PM Rui Ueyama <rui314@gmail.com> wrote:
>> >
>> > On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>> > >
>> > > On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils
>> > > <binutils@sourceware.org> wrote:
>> > > >
>> > > > Hello,
>> > > >
>> > > > I'd like to propose an alternative instruction sequence for the Intel
>> > > > CET-enabled PLT section. Compared to the existing one, the new scheme is
>> > > > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not
>> > > > require a separate second PLT section (.plt.sec).
>> > > >
>> > > > Here is the proposed code sequence:
>> > > >
>> > > >   PLT0:
>> > > >
>> > > >   f3 0f 1e fa        // endbr64
>> > > >   41 53              // push %r11
>> > > >   ff 35 00 00 00 00  // push GOT[1]
>> > > >   ff 25 00 00 00 00  // jmp *GOT[2]
>> > > >   0f 1f 40 00        // nop
>> > > >   0f 1f 40 00        // nop
>> > > >   0f 1f 40 00        // nop
>> > > >   66 90              // nop
>> > > >
>> > > >   PLTn:
>> > > >
>> > > >   f3 0f 1e fa        // endbr64
>> > > >   41 bb 00 00 00 00  // mov $namen_reloc_index %r11d
>> > > >   ff 25 00 00 00 00  // jmp *GOT[namen_index]
>> > >
>> > > All PLT calls will have an extra MOV.
>> >
>> > One extra load-immediate mov instruction is executed per a function
>> > call through a PLT entry. It's so tiny that I couldn't see any
>> > difference in real-world apps.
>> >
>> > > > GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a
>> > > > PLT entry is called for the first time, the control is passed to PLT0 to call
>> > > > the resolver function.
>> > > >
>> > > > It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries
>> > > > to clobber this register (*1), and the resolve function (__dl_runtime_resolve)
>> > > > already clobbers it.
>> > > >
>> > > > (*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be
>> > > > preserved, nor is it used to pass arguments. Making this register available as
>> > > > scratch register means that code in the PLT need not spill any registers when
>> > > > computing the address to which control needs to be transferred."
>> > > >
>> > > > FYI, this is the current CET-enabled PLT:
>> > > >
>> > > >   PLT0:
>> > > >
>> > > >   ff 35 00 00 00 00    // push GOT[0]
>> > > >   f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1]
>> > > >   0f 1f 00             // nop
>> > > >
>> > > >   PLTn in .plt:
>> > > >
>> > > >   f3 0f 1e fa          // endbr64
>> > > >   68 00 00 00 00       // push $namen_reloc_index
>> > > >   f2 e9 e1 ff ff ff    // bnd jmpq PLT0
>> > > >   90                   // nop
>> > > >
>> > > >   PLTn in .plt.sec:
>> > > >
>> > > >   f3 0f 1e fa          // endbr64
>> > > >   f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index]
>> > > >   0f 1f 44 00 00       // nop
>> > > >
>> > > > In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In
>> > > > the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we
>> > > > have many PLT sections while we have only one header, so in practice, the
>> > > > proposed format is almost 50% smaller than the existing one.
>> > >
>> > > Does it have any impact on performance?   .plt.sec can be placed
>> > > in a different page from .plt.
>> > >
>> > > > The proposed PLT does not use jump instructions with BND prefix, as Intel MPX
>> > > > has been deprecated.
>> > > >
>> > > > I already implemented the proposed scheme to my linker
>> > > > (https://github.com/rui314/mold) and it looks like it's working fine.
>> > > >
>> > > > Any thoughts?
>> > >
>> > > I'd like to see visible performance improvements or new features in
>> > > a new PLT layout.
>> >
>> > I didn't see any visible performance improvement with real-world apps.
>> > I might be able to craft a microbenchmark to hammer PLT entries really
>> > hard in some pattern to see some difference, but I think that doesn't
>> > make much sense. The size reduction is for real though.
>>
>> I am aware that there are 2 other proposals to use R11 in PLT/function
>> call.   But they are introducing new features.  I don't think we should
>> use R11 in PLT without any real performance improvements.

I like the proposal.  There are merits of simplified implementation,
code size reduction, and less obvious ones: (a) linker script users
won't need to mention .plt.sec (b) tools can use a more unified approach
identifying PLTs like other architectures.

>> > > I cced x86-64 psABI mailing list.
>> > >
>> > >
>> > > --
>> > > H.J.
>>
>>
>>
>> --
>> H.J.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: x86-64: new CET-enabled PLT format proposal
  2022-03-01  0:04     ` H.J. Lu
  2022-03-01  0:30       ` Rui Ueyama
@ 2022-03-01  9:16       ` Joao Moreira
  2022-03-01  9:25         ` Rui Ueyama
  1 sibling, 1 reply; 14+ messages in thread
From: Joao Moreira @ 2022-03-01  9:16 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Rui Ueyama, Moreira, Joao, Andi Kleen, x86-64-abi, Binutils, i

On 2022-02-28 16:04, H.J. Lu wrote:
> On Sun, Feb 27, 2022 at 7:46 PM Rui Ueyama <rui314@gmail.com> wrote:
>> 
>> On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>> >
>> > On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils
>> > <binutils@sourceware.org> wrote:
>> > >
>> > > Hello,
>> > >
>> > > I'd like to propose an alternative instruction sequence for the Intel
>> > > CET-enabled PLT section. Compared to the existing one, the new scheme is
>> > > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not
>> > > require a separate second PLT section (.plt.sec).
>> > >
>> > > Here is the proposed code sequence:
>> > >
>> > >   PLT0:
>> > >
>> > >   f3 0f 1e fa        // endbr64
>> > >   41 53              // push %r11
>> > >   ff 35 00 00 00 00  // push GOT[1]
>> > >   ff 25 00 00 00 00  // jmp *GOT[2]
>> > >   0f 1f 40 00        // nop
>> > >   0f 1f 40 00        // nop
>> > >   0f 1f 40 00        // nop
>> > >   66 90              // nop
>> > >
>> > >   PLTn:
>> > >
>> > >   f3 0f 1e fa        // endbr64
>> > >   41 bb 00 00 00 00  // mov $namen_reloc_index %r11d
>> > >   ff 25 00 00 00 00  // jmp *GOT[namen_index]
>> >
>> > All PLT calls will have an extra MOV.
>> 
>> One extra load-immediate mov instruction is executed per a function
>> call through a PLT entry. It's so tiny that I couldn't see any
>> difference in real-world apps.

(also replying to Fangrui, whose e-mail, for whatever reason, did not 
come to this mailbox).

I can see the benefits of having 16 byte/single plt entries. Yet, the 
R11 clobbering on every PLT transition is not amusing... If we want PLT 
entries to have only 16 bytes and not have a sec.plt section, maybe we 
could try:

<plt_header>
pop %r11
sub %r11d, plt_header
shr $0x5, %r11
push %r11
jmp _dl_runtime_resolve_shstk_thunk

<foo>:
endbr // 4b
jmp GOT[foo] // 6b
call plt_header // 5b

Here, the plt entry has 16 bytes and it pushes the PLT entry address to 
the stack by calling it. The address is then popped in the plt_header 
and worked to retrieve the index by subbing the plt offset from the 
address and then dividing it by 16. Then, the final step to make it 
shstk compatible is jmping to a special implementation of 
_dl_runtime_resolve (shstk_thnk) which will have the following snippet 
(similarly to glibc's __longjmp):

testl $X86_FEATURE_1_SHSTK, %fs:FEATURE_1_OFFSET
jz 1
mov $1, %r11
incsspq %r11
1:
jmp _dl_runtime_resolve

I don't think the above test fits along with the other instructions in 
the plt_header if we want it 32b at most, thus the suggestion for having 
it as a __dl_runtime_resolve thunk. Another possibility is to also 
resolve the relocation to the special thunk only if shstk is in place, 
if not, resolve it directly to _dl_runtime_resolve to prevent resolving 
overheads in the absence of shstk.

I think this solves both the size and the dummy mov overheads. The logic 
is a bit more convoluted, but perhaps we can work on making it simpler. 
Fwiiw, I did not test nor implement anything.

Ah, also, pardon any asm mistakes/obvious details that I may have missed 
:)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: x86-64: new CET-enabled PLT format proposal
  2022-03-01  9:16       ` Joao Moreira
@ 2022-03-01  9:25         ` Rui Ueyama
  2022-03-01  9:27           ` Joao Moreira
  0 siblings, 1 reply; 14+ messages in thread
From: Rui Ueyama @ 2022-03-01  9:25 UTC (permalink / raw)
  To: Joao Moreira; +Cc: H.J. Lu, Moreira, Joao, Andi Kleen, x86-64-abi, Binutils, i

On Tue, Mar 1, 2022 at 6:17 PM Joao Moreira <joao@overdrivepizza.com> wrote:
>
> On 2022-02-28 16:04, H.J. Lu wrote:
> > On Sun, Feb 27, 2022 at 7:46 PM Rui Ueyama <rui314@gmail.com> wrote:
> >>
> >> On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >> >
> >> > On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils
> >> > <binutils@sourceware.org> wrote:
> >> > >
> >> > > Hello,
> >> > >
> >> > > I'd like to propose an alternative instruction sequence for the Intel
> >> > > CET-enabled PLT section. Compared to the existing one, the new scheme is
> >> > > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not
> >> > > require a separate second PLT section (.plt.sec).
> >> > >
> >> > > Here is the proposed code sequence:
> >> > >
> >> > >   PLT0:
> >> > >
> >> > >   f3 0f 1e fa        // endbr64
> >> > >   41 53              // push %r11
> >> > >   ff 35 00 00 00 00  // push GOT[1]
> >> > >   ff 25 00 00 00 00  // jmp *GOT[2]
> >> > >   0f 1f 40 00        // nop
> >> > >   0f 1f 40 00        // nop
> >> > >   0f 1f 40 00        // nop
> >> > >   66 90              // nop
> >> > >
> >> > >   PLTn:
> >> > >
> >> > >   f3 0f 1e fa        // endbr64
> >> > >   41 bb 00 00 00 00  // mov $namen_reloc_index %r11d
> >> > >   ff 25 00 00 00 00  // jmp *GOT[namen_index]
> >> >
> >> > All PLT calls will have an extra MOV.
> >>
> >> One extra load-immediate mov instruction is executed per a function
> >> call through a PLT entry. It's so tiny that I couldn't see any
> >> difference in real-world apps.
>
> (also replying to Fangrui, whose e-mail, for whatever reason, did not
> come to this mailbox).
>
> I can see the benefits of having 16 byte/single plt entries. Yet, the
> R11 clobbering on every PLT transition is not amusing... If we want PLT
> entries to have only 16 bytes and not have a sec.plt section, maybe we
> could try:
>
> <plt_header>
> pop %r11
> sub %r11d, plt_header
> shr $0x5, %r11
> push %r11
> jmp _dl_runtime_resolve_shstk_thunk
>
> <foo>:
> endbr // 4b
> jmp GOT[foo] // 6b
> call plt_header // 5b

This is what I tried first but I then realized that I needed to insert
another `endbr` between `jmp` and `call`. `jmp GOT[foo]` can jump only
to `endbr` if CET is enabled, so it can't directly jump to the
following `call`.

> Here, the plt entry has 16 bytes and it pushes the PLT entry address to
> the stack by calling it. The address is then popped in the plt_header
> and worked to retrieve the index by subbing the plt offset from the
> address and then dividing it by 16. Then, the final step to make it
> shstk compatible is jmping to a special implementation of
> _dl_runtime_resolve (shstk_thnk) which will have the following snippet
> (similarly to glibc's __longjmp):
>
> testl $X86_FEATURE_1_SHSTK, %fs:FEATURE_1_OFFSET
> jz 1
> mov $1, %r11
> incsspq %r11
> 1:
> jmp _dl_runtime_resolve
>
> I don't think the above test fits along with the other instructions in
> the plt_header if we want it 32b at most, thus the suggestion for having
> it as a __dl_runtime_resolve thunk. Another possibility is to also
> resolve the relocation to the special thunk only if shstk is in place,
> if not, resolve it directly to _dl_runtime_resolve to prevent resolving
> overheads in the absence of shstk.
>
> I think this solves both the size and the dummy mov overheads. The logic
> is a bit more convoluted, but perhaps we can work on making it simpler.
> Fwiiw, I did not test nor implement anything.
>
> Ah, also, pardon any asm mistakes/obvious details that I may have missed
> :)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: x86-64: new CET-enabled PLT format proposal
  2022-03-01  9:25         ` Rui Ueyama
@ 2022-03-01  9:27           ` Joao Moreira
  2022-03-01  9:32             ` Rui Ueyama
  0 siblings, 1 reply; 14+ messages in thread
From: Joao Moreira @ 2022-03-01  9:27 UTC (permalink / raw)
  To: Rui Ueyama; +Cc: H.J. Lu, Moreira, Joao, Andi Kleen, x86-64-abi, Binutils, i

> This is what I tried first but I then realized that I needed to insert
> another `endbr` between `jmp` and `call`. `jmp GOT[foo]` can jump only
> to `endbr` if CET is enabled, so it can't directly jump to the
> following `call`.
> 
Ugh, there we go... dead. Thanks for not letting me waste a ton of time 
:)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: x86-64: new CET-enabled PLT format proposal
  2022-03-01  9:27           ` Joao Moreira
@ 2022-03-01  9:32             ` Rui Ueyama
  2022-03-01  9:45               ` Joao Moreira
  0 siblings, 1 reply; 14+ messages in thread
From: Rui Ueyama @ 2022-03-01  9:32 UTC (permalink / raw)
  To: Joao Moreira; +Cc: H.J. Lu, Moreira, Joao, Andi Kleen, x86-64-abi, Binutils, i

On Tue, Mar 1, 2022 at 6:27 PM Joao Moreira <joao@overdrivepizza.com> wrote:
>
> > This is what I tried first but I then realized that I needed to insert
> > another `endbr` between `jmp` and `call`. `jmp GOT[foo]` can jump only
> > to `endbr` if CET is enabled, so it can't directly jump to the
> > following `call`.
> >
> Ugh, there we go... dead. Thanks for not letting me waste a ton of time
> :)

I actually wasted my time by implementing it only to find that it
wouldn't work. :) If you are interested, this is my commit to my linker.
https://github.com/rui314/mold/commit/4ec0bbf04841e514aca2000f3d780d14efcaefc9

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: x86-64: new CET-enabled PLT format proposal
  2022-03-01  9:32             ` Rui Ueyama
@ 2022-03-01  9:45               ` Joao Moreira
  2022-03-01  9:48                 ` Rui Ueyama
  0 siblings, 1 reply; 14+ messages in thread
From: Joao Moreira @ 2022-03-01  9:45 UTC (permalink / raw)
  To: Rui Ueyama; +Cc: H.J. Lu, Moreira, Joao, Andi Kleen, x86-64-abi, Binutils, i

On 2022-03-01 01:32, Rui Ueyama wrote:
> On Tue, Mar 1, 2022 at 6:27 PM Joao Moreira <joao@overdrivepizza.com> 
> wrote:
>> 
>> > This is what I tried first but I then realized that I needed to insert
>> > another `endbr` between `jmp` and `call`. `jmp GOT[foo]` can jump only
>> > to `endbr` if CET is enabled, so it can't directly jump to the
>> > following `call`.
>> >
>> Ugh, there we go... dead. Thanks for not letting me waste a ton of 
>> time
>> :)
> 
> I actually wasted my time by implementing it only to find that it
> wouldn't work. :) If you are interested, this is my commit to my 
> linker.
> https://github.com/rui314/mold/commit/4ec0bbf04841e514aca2000f3d780d14efcaefc9

I'm glad I posted it here before trying to go and implement :)

Regarding the projects mentioned by HJ, I assume one of them is this (in 
case you are curious):

https://static.sched.com/hosted_files/lssna2021/8f/LSS_FINEIBT_JOAOMOREIRA.pdf

In FineIBT we use R11 to pass hashes around through direct calls to 
enable fine-grain CFI on top of IBT.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: x86-64: new CET-enabled PLT format proposal
  2022-03-01  9:45               ` Joao Moreira
@ 2022-03-01  9:48                 ` Rui Ueyama
  0 siblings, 0 replies; 14+ messages in thread
From: Rui Ueyama @ 2022-03-01  9:48 UTC (permalink / raw)
  To: Joao Moreira; +Cc: H.J. Lu, Moreira, Joao, Andi Kleen, x86-64-abi, Binutils, i

Thank you for sharing the slide!

As to our usage of r11, always cloberring r11 doesn't look pretty
indeed. But I couldn't observe any performance difference by doing
this. I think I can explain why. It's because there's no data
dependency to r11 on function entry. r11 is not expected to be
preserved across a function call, and it's not used for passing an
argument. So no one would read a value from r11 that we write in
PLT[n] (except the code in PLT0). So it cannot cause a pipeline stall
and thus very cheap if not free.

On Tue, Mar 1, 2022 at 6:45 PM Joao Moreira <joao@overdrivepizza.com> wrote:
>
> On 2022-03-01 01:32, Rui Ueyama wrote:
> > On Tue, Mar 1, 2022 at 6:27 PM Joao Moreira <joao@overdrivepizza.com>
> > wrote:
> >>
> >> > This is what I tried first but I then realized that I needed to insert
> >> > another `endbr` between `jmp` and `call`. `jmp GOT[foo]` can jump only
> >> > to `endbr` if CET is enabled, so it can't directly jump to the
> >> > following `call`.
> >> >
> >> Ugh, there we go... dead. Thanks for not letting me waste a ton of
> >> time
> >> :)
> >
> > I actually wasted my time by implementing it only to find that it
> > wouldn't work. :) If you are interested, this is my commit to my
> > linker.
> > https://github.com/rui314/mold/commit/4ec0bbf04841e514aca2000f3d780d14efcaefc9
>
> I'm glad I posted it here before trying to go and implement :)
>
> Regarding the projects mentioned by HJ, I assume one of them is this (in
> case you are curious):
>
> https://static.sched.com/hosted_files/lssna2021/8f/LSS_FINEIBT_JOAOMOREIRA.pdf
>
> In FineIBT we use R11 to pass hashes around through direct calls to
> enable fine-grain CFI on top of IBT.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: x86-64: new CET-enabled PLT format proposal
  2022-02-27 15:06 ` H.J. Lu
  2022-02-28  3:46   ` Rui Ueyama
@ 2022-03-01 10:35   ` Florian Weimer
  2022-03-01 22:16     ` Fangrui Song
  1 sibling, 1 reply; 14+ messages in thread
From: Florian Weimer @ 2022-03-01 10:35 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Rui Ueyama, Andi Kleen, x86-64-abi, Binutils

I do wonder if time is better spent on making symbol binding faster in
general, and eliminate the semantic difference between BIND_NOW and lazy
binding (like musl has done, albeit in an IFUNC-less context).

An example of the current performance issues:

  ld.so has poor performance characteristics when loading large
  quantities of .so files
  <https://sourceware.org/bugzilla/show_bug.cgi?id=27695>

I'm not suggesting we bring back prelink.  There must be other
approaches to make binding go faster.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: x86-64: new CET-enabled PLT format proposal
  2022-03-01 10:35   ` Florian Weimer
@ 2022-03-01 22:16     ` Fangrui Song
  0 siblings, 0 replies; 14+ messages in thread
From: Fangrui Song @ 2022-03-01 22:16 UTC (permalink / raw)
  To: Florian Weimer; +Cc: H.J. Lu, Andi Kleen, Binutils, x86-64-abi

On 2022-03-01, Florian Weimer via Binutils wrote:
>I do wonder if time is better spent on making symbol binding faster in
>general, and eliminate the semantic difference between BIND_NOW and lazy
>binding (like musl has done, albeit in an IFUNC-less context).
>
>An example of the current performance issues:
>
>  ld.so has poor performance characteristics when loading large
>  quantities of .so files
>  <https://sourceware.org/bugzilla/show_bug.cgi?id=27695>
>
>I'm not suggesting we bring back prelink.  There must be other
>approaches to make binding go faster.
>
>Thanks,
>Florian
>

Improving symbol binding performance will definitely help and be
appreciated by companys deploying large dynamically linked executables.
They may run into a situation with O(1000) direct/indirect DT_NEEDED
shared objects.  I remember that this can take more than one minute.

In
https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/google/grte/v5-2.27/master ,
Google uses a fastload patch
https://sourceware.org/git/?p=glibc.git;a=commit;h=af63681769182a8e29568088d6c9cd3c916b22f9
(I haven't tried reading it).

For more traditional desktop/server applications,
I think we should shift to direct binding model (Solaris direct binding,
Mac OS X two-level namespace)
https://maskray.me/blog/2021-05-16-elf-interposition-and-bsymbolic#the-last-alliance-of-elf-and-men
Definitions in shared objects don't need the costly symbol lookup.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-03-01 22:16 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-27  3:18 x86-64: new CET-enabled PLT format proposal Rui Ueyama
2022-02-27 15:06 ` H.J. Lu
2022-02-28  3:46   ` Rui Ueyama
2022-03-01  0:04     ` H.J. Lu
2022-03-01  0:30       ` Rui Ueyama
2022-03-01  2:22         ` Fangrui Song
2022-03-01  9:16       ` Joao Moreira
2022-03-01  9:25         ` Rui Ueyama
2022-03-01  9:27           ` Joao Moreira
2022-03-01  9:32             ` Rui Ueyama
2022-03-01  9:45               ` Joao Moreira
2022-03-01  9:48                 ` Rui Ueyama
2022-03-01 10:35   ` Florian Weimer
2022-03-01 22:16     ` Fangrui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).