* x86-64: new CET-enabled PLT format proposal @ 2022-02-27 3:18 Rui Ueyama 2022-02-27 15:06 ` H.J. Lu 0 siblings, 1 reply; 14+ messages in thread From: Rui Ueyama @ 2022-02-27 3:18 UTC (permalink / raw) To: binutils Hello, I'd like to propose an alternative instruction sequence for the Intel CET-enabled PLT section. Compared to the existing one, the new scheme is simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not require a separate second PLT section (.plt.sec). Here is the proposed code sequence: PLT0: f3 0f 1e fa // endbr64 41 53 // push %r11 ff 35 00 00 00 00 // push GOT[1] ff 25 00 00 00 00 // jmp *GOT[2] 0f 1f 40 00 // nop 0f 1f 40 00 // nop 0f 1f 40 00 // nop 66 90 // nop PLTn: f3 0f 1e fa // endbr64 41 bb 00 00 00 00 // mov $namen_reloc_index %r11d ff 25 00 00 00 00 // jmp *GOT[namen_index] GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a PLT entry is called for the first time, the control is passed to PLT0 to call the resolver function. It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries to clobber this register (*1), and the resolve function (__dl_runtime_resolve) already clobbers it. (*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be preserved, nor is it used to pass arguments. Making this register available as scratch register means that code in the PLT need not spill any registers when computing the address to which control needs to be transferred." FYI, this is the current CET-enabled PLT: PLT0: ff 35 00 00 00 00 // push GOT[0] f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1] 0f 1f 00 // nop PLTn in .plt: f3 0f 1e fa // endbr64 68 00 00 00 00 // push $namen_reloc_index f2 e9 e1 ff ff ff // bnd jmpq PLT0 90 // nop PLTn in .plt.sec: f3 0f 1e fa // endbr64 f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index] 0f 1f 44 00 00 // nop In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we have many PLT sections while we have only one header, so in practice, the proposed format is almost 50% smaller than the existing one. The proposed PLT does not use jump instructions with BND prefix, as Intel MPX has been deprecated. I already implemented the proposed scheme to my linker (https://github.com/rui314/mold) and it looks like it's working fine. Any thoughts? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: x86-64: new CET-enabled PLT format proposal 2022-02-27 3:18 x86-64: new CET-enabled PLT format proposal Rui Ueyama @ 2022-02-27 15:06 ` H.J. Lu 2022-02-28 3:46 ` Rui Ueyama 2022-03-01 10:35 ` Florian Weimer 0 siblings, 2 replies; 14+ messages in thread From: H.J. Lu @ 2022-02-27 15:06 UTC (permalink / raw) To: Rui Ueyama, Andi Kleen, x86-64-abi; +Cc: Binutils On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils <binutils@sourceware.org> wrote: > > Hello, > > I'd like to propose an alternative instruction sequence for the Intel > CET-enabled PLT section. Compared to the existing one, the new scheme is > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not > require a separate second PLT section (.plt.sec). > > Here is the proposed code sequence: > > PLT0: > > f3 0f 1e fa // endbr64 > 41 53 // push %r11 > ff 35 00 00 00 00 // push GOT[1] > ff 25 00 00 00 00 // jmp *GOT[2] > 0f 1f 40 00 // nop > 0f 1f 40 00 // nop > 0f 1f 40 00 // nop > 66 90 // nop > > PLTn: > > f3 0f 1e fa // endbr64 > 41 bb 00 00 00 00 // mov $namen_reloc_index %r11d > ff 25 00 00 00 00 // jmp *GOT[namen_index] All PLT calls will have an extra MOV. > GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a > PLT entry is called for the first time, the control is passed to PLT0 to call > the resolver function. > > It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries > to clobber this register (*1), and the resolve function (__dl_runtime_resolve) > already clobbers it. > > (*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be > preserved, nor is it used to pass arguments. Making this register available as > scratch register means that code in the PLT need not spill any registers when > computing the address to which control needs to be transferred." > > FYI, this is the current CET-enabled PLT: > > PLT0: > > ff 35 00 00 00 00 // push GOT[0] > f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1] > 0f 1f 00 // nop > > PLTn in .plt: > > f3 0f 1e fa // endbr64 > 68 00 00 00 00 // push $namen_reloc_index > f2 e9 e1 ff ff ff // bnd jmpq PLT0 > 90 // nop > > PLTn in .plt.sec: > > f3 0f 1e fa // endbr64 > f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index] > 0f 1f 44 00 00 // nop > > In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In > the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we > have many PLT sections while we have only one header, so in practice, the > proposed format is almost 50% smaller than the existing one. Does it have any impact on performance? .plt.sec can be placed in a different page from .plt. > The proposed PLT does not use jump instructions with BND prefix, as Intel MPX > has been deprecated. > > I already implemented the proposed scheme to my linker > (https://github.com/rui314/mold) and it looks like it's working fine. > > Any thoughts? I'd like to see visible performance improvements or new features in a new PLT layout. I cced x86-64 psABI mailing list. -- H.J. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: x86-64: new CET-enabled PLT format proposal 2022-02-27 15:06 ` H.J. Lu @ 2022-02-28 3:46 ` Rui Ueyama 2022-03-01 0:04 ` H.J. Lu 2022-03-01 10:35 ` Florian Weimer 1 sibling, 1 reply; 14+ messages in thread From: Rui Ueyama @ 2022-02-28 3:46 UTC (permalink / raw) To: H.J. Lu; +Cc: Andi Kleen, x86-64-abi, Binutils On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils > <binutils@sourceware.org> wrote: > > > > Hello, > > > > I'd like to propose an alternative instruction sequence for the Intel > > CET-enabled PLT section. Compared to the existing one, the new scheme is > > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not > > require a separate second PLT section (.plt.sec). > > > > Here is the proposed code sequence: > > > > PLT0: > > > > f3 0f 1e fa // endbr64 > > 41 53 // push %r11 > > ff 35 00 00 00 00 // push GOT[1] > > ff 25 00 00 00 00 // jmp *GOT[2] > > 0f 1f 40 00 // nop > > 0f 1f 40 00 // nop > > 0f 1f 40 00 // nop > > 66 90 // nop > > > > PLTn: > > > > f3 0f 1e fa // endbr64 > > 41 bb 00 00 00 00 // mov $namen_reloc_index %r11d > > ff 25 00 00 00 00 // jmp *GOT[namen_index] > > All PLT calls will have an extra MOV. One extra load-immediate mov instruction is executed per a function call through a PLT entry. It's so tiny that I couldn't see any difference in real-world apps. > > GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a > > PLT entry is called for the first time, the control is passed to PLT0 to call > > the resolver function. > > > > It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries > > to clobber this register (*1), and the resolve function (__dl_runtime_resolve) > > already clobbers it. > > > > (*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be > > preserved, nor is it used to pass arguments. Making this register available as > > scratch register means that code in the PLT need not spill any registers when > > computing the address to which control needs to be transferred." > > > > FYI, this is the current CET-enabled PLT: > > > > PLT0: > > > > ff 35 00 00 00 00 // push GOT[0] > > f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1] > > 0f 1f 00 // nop > > > > PLTn in .plt: > > > > f3 0f 1e fa // endbr64 > > 68 00 00 00 00 // push $namen_reloc_index > > f2 e9 e1 ff ff ff // bnd jmpq PLT0 > > 90 // nop > > > > PLTn in .plt.sec: > > > > f3 0f 1e fa // endbr64 > > f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index] > > 0f 1f 44 00 00 // nop > > > > In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In > > the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we > > have many PLT sections while we have only one header, so in practice, the > > proposed format is almost 50% smaller than the existing one. > > Does it have any impact on performance? .plt.sec can be placed > in a different page from .plt. > > > The proposed PLT does not use jump instructions with BND prefix, as Intel MPX > > has been deprecated. > > > > I already implemented the proposed scheme to my linker > > (https://github.com/rui314/mold) and it looks like it's working fine. > > > > Any thoughts? > > I'd like to see visible performance improvements or new features in > a new PLT layout. I didn't see any visible performance improvement with real-world apps. I might be able to craft a microbenchmark to hammer PLT entries really hard in some pattern to see some difference, but I think that doesn't make much sense. The size reduction is for real though. > I cced x86-64 psABI mailing list. > > > -- > H.J. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: x86-64: new CET-enabled PLT format proposal 2022-02-28 3:46 ` Rui Ueyama @ 2022-03-01 0:04 ` H.J. Lu 2022-03-01 0:30 ` Rui Ueyama 2022-03-01 9:16 ` Joao Moreira 0 siblings, 2 replies; 14+ messages in thread From: H.J. Lu @ 2022-03-01 0:04 UTC (permalink / raw) To: Rui Ueyama, Moreira, Joao; +Cc: Andi Kleen, x86-64-abi, Binutils On Sun, Feb 27, 2022 at 7:46 PM Rui Ueyama <rui314@gmail.com> wrote: > > On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils > > <binutils@sourceware.org> wrote: > > > > > > Hello, > > > > > > I'd like to propose an alternative instruction sequence for the Intel > > > CET-enabled PLT section. Compared to the existing one, the new scheme is > > > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not > > > require a separate second PLT section (.plt.sec). > > > > > > Here is the proposed code sequence: > > > > > > PLT0: > > > > > > f3 0f 1e fa // endbr64 > > > 41 53 // push %r11 > > > ff 35 00 00 00 00 // push GOT[1] > > > ff 25 00 00 00 00 // jmp *GOT[2] > > > 0f 1f 40 00 // nop > > > 0f 1f 40 00 // nop > > > 0f 1f 40 00 // nop > > > 66 90 // nop > > > > > > PLTn: > > > > > > f3 0f 1e fa // endbr64 > > > 41 bb 00 00 00 00 // mov $namen_reloc_index %r11d > > > ff 25 00 00 00 00 // jmp *GOT[namen_index] > > > > All PLT calls will have an extra MOV. > > One extra load-immediate mov instruction is executed per a function > call through a PLT entry. It's so tiny that I couldn't see any > difference in real-world apps. > > > > GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a > > > PLT entry is called for the first time, the control is passed to PLT0 to call > > > the resolver function. > > > > > > It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries > > > to clobber this register (*1), and the resolve function (__dl_runtime_resolve) > > > already clobbers it. > > > > > > (*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be > > > preserved, nor is it used to pass arguments. Making this register available as > > > scratch register means that code in the PLT need not spill any registers when > > > computing the address to which control needs to be transferred." > > > > > > FYI, this is the current CET-enabled PLT: > > > > > > PLT0: > > > > > > ff 35 00 00 00 00 // push GOT[0] > > > f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1] > > > 0f 1f 00 // nop > > > > > > PLTn in .plt: > > > > > > f3 0f 1e fa // endbr64 > > > 68 00 00 00 00 // push $namen_reloc_index > > > f2 e9 e1 ff ff ff // bnd jmpq PLT0 > > > 90 // nop > > > > > > PLTn in .plt.sec: > > > > > > f3 0f 1e fa // endbr64 > > > f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index] > > > 0f 1f 44 00 00 // nop > > > > > > In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In > > > the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we > > > have many PLT sections while we have only one header, so in practice, the > > > proposed format is almost 50% smaller than the existing one. > > > > Does it have any impact on performance? .plt.sec can be placed > > in a different page from .plt. > > > > > The proposed PLT does not use jump instructions with BND prefix, as Intel MPX > > > has been deprecated. > > > > > > I already implemented the proposed scheme to my linker > > > (https://github.com/rui314/mold) and it looks like it's working fine. > > > > > > Any thoughts? > > > > I'd like to see visible performance improvements or new features in > > a new PLT layout. > > I didn't see any visible performance improvement with real-world apps. > I might be able to craft a microbenchmark to hammer PLT entries really > hard in some pattern to see some difference, but I think that doesn't > make much sense. The size reduction is for real though. I am aware that there are 2 other proposals to use R11 in PLT/function call. But they are introducing new features. I don't think we should use R11 in PLT without any real performance improvements. > > I cced x86-64 psABI mailing list. > > > > > > -- > > H.J. -- H.J. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: x86-64: new CET-enabled PLT format proposal 2022-03-01 0:04 ` H.J. Lu @ 2022-03-01 0:30 ` Rui Ueyama 2022-03-01 2:22 ` Fangrui Song 2022-03-01 9:16 ` Joao Moreira 1 sibling, 1 reply; 14+ messages in thread From: Rui Ueyama @ 2022-03-01 0:30 UTC (permalink / raw) To: H.J. Lu; +Cc: Moreira, Joao, Andi Kleen, x86-64-abi, Binutils I think size reduction matters to some users even if you do not care about that that much. But I'm not trying too hard to push GNU binutils to adopt it. I just wanted to let you guys know that we invented a compact (and we believe better) instruction sequence for the CET-enabled PLT and we are already using it. On Tue, Mar 1, 2022 at 9:05 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Sun, Feb 27, 2022 at 7:46 PM Rui Ueyama <rui314@gmail.com> wrote: > > > > On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > > > On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils > > > <binutils@sourceware.org> wrote: > > > > > > > > Hello, > > > > > > > > I'd like to propose an alternative instruction sequence for the Intel > > > > CET-enabled PLT section. Compared to the existing one, the new scheme is > > > > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not > > > > require a separate second PLT section (.plt.sec). > > > > > > > > Here is the proposed code sequence: > > > > > > > > PLT0: > > > > > > > > f3 0f 1e fa // endbr64 > > > > 41 53 // push %r11 > > > > ff 35 00 00 00 00 // push GOT[1] > > > > ff 25 00 00 00 00 // jmp *GOT[2] > > > > 0f 1f 40 00 // nop > > > > 0f 1f 40 00 // nop > > > > 0f 1f 40 00 // nop > > > > 66 90 // nop > > > > > > > > PLTn: > > > > > > > > f3 0f 1e fa // endbr64 > > > > 41 bb 00 00 00 00 // mov $namen_reloc_index %r11d > > > > ff 25 00 00 00 00 // jmp *GOT[namen_index] > > > > > > All PLT calls will have an extra MOV. > > > > One extra load-immediate mov instruction is executed per a function > > call through a PLT entry. It's so tiny that I couldn't see any > > difference in real-world apps. > > > > > > GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a > > > > PLT entry is called for the first time, the control is passed to PLT0 to call > > > > the resolver function. > > > > > > > > It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries > > > > to clobber this register (*1), and the resolve function (__dl_runtime_resolve) > > > > already clobbers it. > > > > > > > > (*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be > > > > preserved, nor is it used to pass arguments. Making this register available as > > > > scratch register means that code in the PLT need not spill any registers when > > > > computing the address to which control needs to be transferred." > > > > > > > > FYI, this is the current CET-enabled PLT: > > > > > > > > PLT0: > > > > > > > > ff 35 00 00 00 00 // push GOT[0] > > > > f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1] > > > > 0f 1f 00 // nop > > > > > > > > PLTn in .plt: > > > > > > > > f3 0f 1e fa // endbr64 > > > > 68 00 00 00 00 // push $namen_reloc_index > > > > f2 e9 e1 ff ff ff // bnd jmpq PLT0 > > > > 90 // nop > > > > > > > > PLTn in .plt.sec: > > > > > > > > f3 0f 1e fa // endbr64 > > > > f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index] > > > > 0f 1f 44 00 00 // nop > > > > > > > > In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In > > > > the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we > > > > have many PLT sections while we have only one header, so in practice, the > > > > proposed format is almost 50% smaller than the existing one. > > > > > > Does it have any impact on performance? .plt.sec can be placed > > > in a different page from .plt. > > > > > > > The proposed PLT does not use jump instructions with BND prefix, as Intel MPX > > > > has been deprecated. > > > > > > > > I already implemented the proposed scheme to my linker > > > > (https://github.com/rui314/mold) and it looks like it's working fine. > > > > > > > > Any thoughts? > > > > > > I'd like to see visible performance improvements or new features in > > > a new PLT layout. > > > > I didn't see any visible performance improvement with real-world apps. > > I might be able to craft a microbenchmark to hammer PLT entries really > > hard in some pattern to see some difference, but I think that doesn't > > make much sense. The size reduction is for real though. > > I am aware that there are 2 other proposals to use R11 in PLT/function > call. But they are introducing new features. I don't think we should > use R11 in PLT without any real performance improvements. > > > > I cced x86-64 psABI mailing list. > > > > > > > > > -- > > > H.J. > > > > -- > H.J. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: x86-64: new CET-enabled PLT format proposal 2022-03-01 0:30 ` Rui Ueyama @ 2022-03-01 2:22 ` Fangrui Song 0 siblings, 0 replies; 14+ messages in thread From: Fangrui Song @ 2022-03-01 2:22 UTC (permalink / raw) To: Rui Ueyama; +Cc: H.J. Lu, x86-64-abi, Andi Kleen, Binutils, Moreira, Joao On 2022-03-01, Rui Ueyama via Binutils wrote: >I think size reduction matters to some users even if you do not care >about that that much. But I'm not trying too hard to push GNU binutils >to adopt it. I just wanted to let you guys know that we invented a >compact (and we believe better) instruction sequence for the >CET-enabled PLT and we are already using it. > >On Tue, Mar 1, 2022 at 9:05 AM H.J. Lu <hjl.tools@gmail.com> wrote: >> >> On Sun, Feb 27, 2022 at 7:46 PM Rui Ueyama <rui314@gmail.com> wrote: >> > >> > On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu <hjl.tools@gmail.com> wrote: >> > > >> > > On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils >> > > <binutils@sourceware.org> wrote: >> > > > >> > > > Hello, >> > > > >> > > > I'd like to propose an alternative instruction sequence for the Intel >> > > > CET-enabled PLT section. Compared to the existing one, the new scheme is >> > > > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not >> > > > require a separate second PLT section (.plt.sec). >> > > > >> > > > Here is the proposed code sequence: >> > > > >> > > > PLT0: >> > > > >> > > > f3 0f 1e fa // endbr64 >> > > > 41 53 // push %r11 >> > > > ff 35 00 00 00 00 // push GOT[1] >> > > > ff 25 00 00 00 00 // jmp *GOT[2] >> > > > 0f 1f 40 00 // nop >> > > > 0f 1f 40 00 // nop >> > > > 0f 1f 40 00 // nop >> > > > 66 90 // nop >> > > > >> > > > PLTn: >> > > > >> > > > f3 0f 1e fa // endbr64 >> > > > 41 bb 00 00 00 00 // mov $namen_reloc_index %r11d >> > > > ff 25 00 00 00 00 // jmp *GOT[namen_index] >> > > >> > > All PLT calls will have an extra MOV. >> > >> > One extra load-immediate mov instruction is executed per a function >> > call through a PLT entry. It's so tiny that I couldn't see any >> > difference in real-world apps. >> > >> > > > GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a >> > > > PLT entry is called for the first time, the control is passed to PLT0 to call >> > > > the resolver function. >> > > > >> > > > It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries >> > > > to clobber this register (*1), and the resolve function (__dl_runtime_resolve) >> > > > already clobbers it. >> > > > >> > > > (*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be >> > > > preserved, nor is it used to pass arguments. Making this register available as >> > > > scratch register means that code in the PLT need not spill any registers when >> > > > computing the address to which control needs to be transferred." >> > > > >> > > > FYI, this is the current CET-enabled PLT: >> > > > >> > > > PLT0: >> > > > >> > > > ff 35 00 00 00 00 // push GOT[0] >> > > > f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1] >> > > > 0f 1f 00 // nop >> > > > >> > > > PLTn in .plt: >> > > > >> > > > f3 0f 1e fa // endbr64 >> > > > 68 00 00 00 00 // push $namen_reloc_index >> > > > f2 e9 e1 ff ff ff // bnd jmpq PLT0 >> > > > 90 // nop >> > > > >> > > > PLTn in .plt.sec: >> > > > >> > > > f3 0f 1e fa // endbr64 >> > > > f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index] >> > > > 0f 1f 44 00 00 // nop >> > > > >> > > > In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In >> > > > the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we >> > > > have many PLT sections while we have only one header, so in practice, the >> > > > proposed format is almost 50% smaller than the existing one. >> > > >> > > Does it have any impact on performance? .plt.sec can be placed >> > > in a different page from .plt. >> > > >> > > > The proposed PLT does not use jump instructions with BND prefix, as Intel MPX >> > > > has been deprecated. >> > > > >> > > > I already implemented the proposed scheme to my linker >> > > > (https://github.com/rui314/mold) and it looks like it's working fine. >> > > > >> > > > Any thoughts? >> > > >> > > I'd like to see visible performance improvements or new features in >> > > a new PLT layout. >> > >> > I didn't see any visible performance improvement with real-world apps. >> > I might be able to craft a microbenchmark to hammer PLT entries really >> > hard in some pattern to see some difference, but I think that doesn't >> > make much sense. The size reduction is for real though. >> >> I am aware that there are 2 other proposals to use R11 in PLT/function >> call. But they are introducing new features. I don't think we should >> use R11 in PLT without any real performance improvements. I like the proposal. There are merits of simplified implementation, code size reduction, and less obvious ones: (a) linker script users won't need to mention .plt.sec (b) tools can use a more unified approach identifying PLTs like other architectures. >> > > I cced x86-64 psABI mailing list. >> > > >> > > >> > > -- >> > > H.J. >> >> >> >> -- >> H.J. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: x86-64: new CET-enabled PLT format proposal 2022-03-01 0:04 ` H.J. Lu 2022-03-01 0:30 ` Rui Ueyama @ 2022-03-01 9:16 ` Joao Moreira 2022-03-01 9:25 ` Rui Ueyama 1 sibling, 1 reply; 14+ messages in thread From: Joao Moreira @ 2022-03-01 9:16 UTC (permalink / raw) To: H.J. Lu; +Cc: Rui Ueyama, Moreira, Joao, Andi Kleen, x86-64-abi, Binutils, i On 2022-02-28 16:04, H.J. Lu wrote: > On Sun, Feb 27, 2022 at 7:46 PM Rui Ueyama <rui314@gmail.com> wrote: >> >> On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu <hjl.tools@gmail.com> wrote: >> > >> > On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils >> > <binutils@sourceware.org> wrote: >> > > >> > > Hello, >> > > >> > > I'd like to propose an alternative instruction sequence for the Intel >> > > CET-enabled PLT section. Compared to the existing one, the new scheme is >> > > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not >> > > require a separate second PLT section (.plt.sec). >> > > >> > > Here is the proposed code sequence: >> > > >> > > PLT0: >> > > >> > > f3 0f 1e fa // endbr64 >> > > 41 53 // push %r11 >> > > ff 35 00 00 00 00 // push GOT[1] >> > > ff 25 00 00 00 00 // jmp *GOT[2] >> > > 0f 1f 40 00 // nop >> > > 0f 1f 40 00 // nop >> > > 0f 1f 40 00 // nop >> > > 66 90 // nop >> > > >> > > PLTn: >> > > >> > > f3 0f 1e fa // endbr64 >> > > 41 bb 00 00 00 00 // mov $namen_reloc_index %r11d >> > > ff 25 00 00 00 00 // jmp *GOT[namen_index] >> > >> > All PLT calls will have an extra MOV. >> >> One extra load-immediate mov instruction is executed per a function >> call through a PLT entry. It's so tiny that I couldn't see any >> difference in real-world apps. (also replying to Fangrui, whose e-mail, for whatever reason, did not come to this mailbox). I can see the benefits of having 16 byte/single plt entries. Yet, the R11 clobbering on every PLT transition is not amusing... If we want PLT entries to have only 16 bytes and not have a sec.plt section, maybe we could try: <plt_header> pop %r11 sub %r11d, plt_header shr $0x5, %r11 push %r11 jmp _dl_runtime_resolve_shstk_thunk <foo>: endbr // 4b jmp GOT[foo] // 6b call plt_header // 5b Here, the plt entry has 16 bytes and it pushes the PLT entry address to the stack by calling it. The address is then popped in the plt_header and worked to retrieve the index by subbing the plt offset from the address and then dividing it by 16. Then, the final step to make it shstk compatible is jmping to a special implementation of _dl_runtime_resolve (shstk_thnk) which will have the following snippet (similarly to glibc's __longjmp): testl $X86_FEATURE_1_SHSTK, %fs:FEATURE_1_OFFSET jz 1 mov $1, %r11 incsspq %r11 1: jmp _dl_runtime_resolve I don't think the above test fits along with the other instructions in the plt_header if we want it 32b at most, thus the suggestion for having it as a __dl_runtime_resolve thunk. Another possibility is to also resolve the relocation to the special thunk only if shstk is in place, if not, resolve it directly to _dl_runtime_resolve to prevent resolving overheads in the absence of shstk. I think this solves both the size and the dummy mov overheads. The logic is a bit more convoluted, but perhaps we can work on making it simpler. Fwiiw, I did not test nor implement anything. Ah, also, pardon any asm mistakes/obvious details that I may have missed :) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: x86-64: new CET-enabled PLT format proposal 2022-03-01 9:16 ` Joao Moreira @ 2022-03-01 9:25 ` Rui Ueyama 2022-03-01 9:27 ` Joao Moreira 0 siblings, 1 reply; 14+ messages in thread From: Rui Ueyama @ 2022-03-01 9:25 UTC (permalink / raw) To: Joao Moreira; +Cc: H.J. Lu, Moreira, Joao, Andi Kleen, x86-64-abi, Binutils, i On Tue, Mar 1, 2022 at 6:17 PM Joao Moreira <joao@overdrivepizza.com> wrote: > > On 2022-02-28 16:04, H.J. Lu wrote: > > On Sun, Feb 27, 2022 at 7:46 PM Rui Ueyama <rui314@gmail.com> wrote: > >> > >> On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu <hjl.tools@gmail.com> wrote: > >> > > >> > On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils > >> > <binutils@sourceware.org> wrote: > >> > > > >> > > Hello, > >> > > > >> > > I'd like to propose an alternative instruction sequence for the Intel > >> > > CET-enabled PLT section. Compared to the existing one, the new scheme is > >> > > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not > >> > > require a separate second PLT section (.plt.sec). > >> > > > >> > > Here is the proposed code sequence: > >> > > > >> > > PLT0: > >> > > > >> > > f3 0f 1e fa // endbr64 > >> > > 41 53 // push %r11 > >> > > ff 35 00 00 00 00 // push GOT[1] > >> > > ff 25 00 00 00 00 // jmp *GOT[2] > >> > > 0f 1f 40 00 // nop > >> > > 0f 1f 40 00 // nop > >> > > 0f 1f 40 00 // nop > >> > > 66 90 // nop > >> > > > >> > > PLTn: > >> > > > >> > > f3 0f 1e fa // endbr64 > >> > > 41 bb 00 00 00 00 // mov $namen_reloc_index %r11d > >> > > ff 25 00 00 00 00 // jmp *GOT[namen_index] > >> > > >> > All PLT calls will have an extra MOV. > >> > >> One extra load-immediate mov instruction is executed per a function > >> call through a PLT entry. It's so tiny that I couldn't see any > >> difference in real-world apps. > > (also replying to Fangrui, whose e-mail, for whatever reason, did not > come to this mailbox). > > I can see the benefits of having 16 byte/single plt entries. Yet, the > R11 clobbering on every PLT transition is not amusing... If we want PLT > entries to have only 16 bytes and not have a sec.plt section, maybe we > could try: > > <plt_header> > pop %r11 > sub %r11d, plt_header > shr $0x5, %r11 > push %r11 > jmp _dl_runtime_resolve_shstk_thunk > > <foo>: > endbr // 4b > jmp GOT[foo] // 6b > call plt_header // 5b This is what I tried first but I then realized that I needed to insert another `endbr` between `jmp` and `call`. `jmp GOT[foo]` can jump only to `endbr` if CET is enabled, so it can't directly jump to the following `call`. > Here, the plt entry has 16 bytes and it pushes the PLT entry address to > the stack by calling it. The address is then popped in the plt_header > and worked to retrieve the index by subbing the plt offset from the > address and then dividing it by 16. Then, the final step to make it > shstk compatible is jmping to a special implementation of > _dl_runtime_resolve (shstk_thnk) which will have the following snippet > (similarly to glibc's __longjmp): > > testl $X86_FEATURE_1_SHSTK, %fs:FEATURE_1_OFFSET > jz 1 > mov $1, %r11 > incsspq %r11 > 1: > jmp _dl_runtime_resolve > > I don't think the above test fits along with the other instructions in > the plt_header if we want it 32b at most, thus the suggestion for having > it as a __dl_runtime_resolve thunk. Another possibility is to also > resolve the relocation to the special thunk only if shstk is in place, > if not, resolve it directly to _dl_runtime_resolve to prevent resolving > overheads in the absence of shstk. > > I think this solves both the size and the dummy mov overheads. The logic > is a bit more convoluted, but perhaps we can work on making it simpler. > Fwiiw, I did not test nor implement anything. > > Ah, also, pardon any asm mistakes/obvious details that I may have missed > :) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: x86-64: new CET-enabled PLT format proposal 2022-03-01 9:25 ` Rui Ueyama @ 2022-03-01 9:27 ` Joao Moreira 2022-03-01 9:32 ` Rui Ueyama 0 siblings, 1 reply; 14+ messages in thread From: Joao Moreira @ 2022-03-01 9:27 UTC (permalink / raw) To: Rui Ueyama; +Cc: H.J. Lu, Moreira, Joao, Andi Kleen, x86-64-abi, Binutils, i > This is what I tried first but I then realized that I needed to insert > another `endbr` between `jmp` and `call`. `jmp GOT[foo]` can jump only > to `endbr` if CET is enabled, so it can't directly jump to the > following `call`. > Ugh, there we go... dead. Thanks for not letting me waste a ton of time :) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: x86-64: new CET-enabled PLT format proposal 2022-03-01 9:27 ` Joao Moreira @ 2022-03-01 9:32 ` Rui Ueyama 2022-03-01 9:45 ` Joao Moreira 0 siblings, 1 reply; 14+ messages in thread From: Rui Ueyama @ 2022-03-01 9:32 UTC (permalink / raw) To: Joao Moreira; +Cc: H.J. Lu, Moreira, Joao, Andi Kleen, x86-64-abi, Binutils, i On Tue, Mar 1, 2022 at 6:27 PM Joao Moreira <joao@overdrivepizza.com> wrote: > > > This is what I tried first but I then realized that I needed to insert > > another `endbr` between `jmp` and `call`. `jmp GOT[foo]` can jump only > > to `endbr` if CET is enabled, so it can't directly jump to the > > following `call`. > > > Ugh, there we go... dead. Thanks for not letting me waste a ton of time > :) I actually wasted my time by implementing it only to find that it wouldn't work. :) If you are interested, this is my commit to my linker. https://github.com/rui314/mold/commit/4ec0bbf04841e514aca2000f3d780d14efcaefc9 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: x86-64: new CET-enabled PLT format proposal 2022-03-01 9:32 ` Rui Ueyama @ 2022-03-01 9:45 ` Joao Moreira 2022-03-01 9:48 ` Rui Ueyama 0 siblings, 1 reply; 14+ messages in thread From: Joao Moreira @ 2022-03-01 9:45 UTC (permalink / raw) To: Rui Ueyama; +Cc: H.J. Lu, Moreira, Joao, Andi Kleen, x86-64-abi, Binutils, i On 2022-03-01 01:32, Rui Ueyama wrote: > On Tue, Mar 1, 2022 at 6:27 PM Joao Moreira <joao@overdrivepizza.com> > wrote: >> >> > This is what I tried first but I then realized that I needed to insert >> > another `endbr` between `jmp` and `call`. `jmp GOT[foo]` can jump only >> > to `endbr` if CET is enabled, so it can't directly jump to the >> > following `call`. >> > >> Ugh, there we go... dead. Thanks for not letting me waste a ton of >> time >> :) > > I actually wasted my time by implementing it only to find that it > wouldn't work. :) If you are interested, this is my commit to my > linker. > https://github.com/rui314/mold/commit/4ec0bbf04841e514aca2000f3d780d14efcaefc9 I'm glad I posted it here before trying to go and implement :) Regarding the projects mentioned by HJ, I assume one of them is this (in case you are curious): https://static.sched.com/hosted_files/lssna2021/8f/LSS_FINEIBT_JOAOMOREIRA.pdf In FineIBT we use R11 to pass hashes around through direct calls to enable fine-grain CFI on top of IBT. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: x86-64: new CET-enabled PLT format proposal 2022-03-01 9:45 ` Joao Moreira @ 2022-03-01 9:48 ` Rui Ueyama 0 siblings, 0 replies; 14+ messages in thread From: Rui Ueyama @ 2022-03-01 9:48 UTC (permalink / raw) To: Joao Moreira; +Cc: H.J. Lu, Moreira, Joao, Andi Kleen, x86-64-abi, Binutils, i Thank you for sharing the slide! As to our usage of r11, always cloberring r11 doesn't look pretty indeed. But I couldn't observe any performance difference by doing this. I think I can explain why. It's because there's no data dependency to r11 on function entry. r11 is not expected to be preserved across a function call, and it's not used for passing an argument. So no one would read a value from r11 that we write in PLT[n] (except the code in PLT0). So it cannot cause a pipeline stall and thus very cheap if not free. On Tue, Mar 1, 2022 at 6:45 PM Joao Moreira <joao@overdrivepizza.com> wrote: > > On 2022-03-01 01:32, Rui Ueyama wrote: > > On Tue, Mar 1, 2022 at 6:27 PM Joao Moreira <joao@overdrivepizza.com> > > wrote: > >> > >> > This is what I tried first but I then realized that I needed to insert > >> > another `endbr` between `jmp` and `call`. `jmp GOT[foo]` can jump only > >> > to `endbr` if CET is enabled, so it can't directly jump to the > >> > following `call`. > >> > > >> Ugh, there we go... dead. Thanks for not letting me waste a ton of > >> time > >> :) > > > > I actually wasted my time by implementing it only to find that it > > wouldn't work. :) If you are interested, this is my commit to my > > linker. > > https://github.com/rui314/mold/commit/4ec0bbf04841e514aca2000f3d780d14efcaefc9 > > I'm glad I posted it here before trying to go and implement :) > > Regarding the projects mentioned by HJ, I assume one of them is this (in > case you are curious): > > https://static.sched.com/hosted_files/lssna2021/8f/LSS_FINEIBT_JOAOMOREIRA.pdf > > In FineIBT we use R11 to pass hashes around through direct calls to > enable fine-grain CFI on top of IBT. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: x86-64: new CET-enabled PLT format proposal 2022-02-27 15:06 ` H.J. Lu 2022-02-28 3:46 ` Rui Ueyama @ 2022-03-01 10:35 ` Florian Weimer 2022-03-01 22:16 ` Fangrui Song 1 sibling, 1 reply; 14+ messages in thread From: Florian Weimer @ 2022-03-01 10:35 UTC (permalink / raw) To: H.J. Lu; +Cc: Rui Ueyama, Andi Kleen, x86-64-abi, Binutils I do wonder if time is better spent on making symbol binding faster in general, and eliminate the semantic difference between BIND_NOW and lazy binding (like musl has done, albeit in an IFUNC-less context). An example of the current performance issues: ld.so has poor performance characteristics when loading large quantities of .so files <https://sourceware.org/bugzilla/show_bug.cgi?id=27695> I'm not suggesting we bring back prelink. There must be other approaches to make binding go faster. Thanks, Florian ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: x86-64: new CET-enabled PLT format proposal 2022-03-01 10:35 ` Florian Weimer @ 2022-03-01 22:16 ` Fangrui Song 0 siblings, 0 replies; 14+ messages in thread From: Fangrui Song @ 2022-03-01 22:16 UTC (permalink / raw) To: Florian Weimer; +Cc: H.J. Lu, Andi Kleen, Binutils, x86-64-abi On 2022-03-01, Florian Weimer via Binutils wrote: >I do wonder if time is better spent on making symbol binding faster in >general, and eliminate the semantic difference between BIND_NOW and lazy >binding (like musl has done, albeit in an IFUNC-less context). > >An example of the current performance issues: > > ld.so has poor performance characteristics when loading large > quantities of .so files > <https://sourceware.org/bugzilla/show_bug.cgi?id=27695> > >I'm not suggesting we bring back prelink. There must be other >approaches to make binding go faster. > >Thanks, >Florian > Improving symbol binding performance will definitely help and be appreciated by companys deploying large dynamically linked executables. They may run into a situation with O(1000) direct/indirect DT_NEEDED shared objects. I remember that this can take more than one minute. In https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/google/grte/v5-2.27/master , Google uses a fastload patch https://sourceware.org/git/?p=glibc.git;a=commit;h=af63681769182a8e29568088d6c9cd3c916b22f9 (I haven't tried reading it). For more traditional desktop/server applications, I think we should shift to direct binding model (Solaris direct binding, Mac OS X two-level namespace) https://maskray.me/blog/2021-05-16-elf-interposition-and-bsymbolic#the-last-alliance-of-elf-and-men Definitions in shared objects don't need the costly symbol lookup. ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2022-03-01 22:16 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-02-27 3:18 x86-64: new CET-enabled PLT format proposal Rui Ueyama 2022-02-27 15:06 ` H.J. Lu 2022-02-28 3:46 ` Rui Ueyama 2022-03-01 0:04 ` H.J. Lu 2022-03-01 0:30 ` Rui Ueyama 2022-03-01 2:22 ` Fangrui Song 2022-03-01 9:16 ` Joao Moreira 2022-03-01 9:25 ` Rui Ueyama 2022-03-01 9:27 ` Joao Moreira 2022-03-01 9:32 ` Rui Ueyama 2022-03-01 9:45 ` Joao Moreira 2022-03-01 9:48 ` Rui Ueyama 2022-03-01 10:35 ` Florian Weimer 2022-03-01 22:16 ` Fangrui Song
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).