* Powerpc Linux 'scv' system call ABI proposal take 2 @ 2020-04-15 21:45 Nicholas Piggin 2020-04-15 22:55 ` [musl] " Rich Felker 0 siblings, 1 reply; 62+ messages in thread From: Nicholas Piggin @ 2020-04-15 21:45 UTC (permalink / raw) To: linuxppc-dev; +Cc: libc-alpha, libc-dev, musl, Segher Boessenkool I would like to enable Linux support for the powerpc 'scv' instruction, as a faster system call instruction. This requires two things to be defined: Firstly a way to advertise to userspace that kernel supports scv, and a way to allocate and advertise support for individual scv vectors. Secondly, a calling convention ABI for this new instruction. Thanks to those who commented last time, since then I have removed my answered questions and unpopular alternatives but you can find them here https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html Let me try one more with a wider cc list, and then we'll get something merged. Any questions or counter-opinions are welcome. System Call Vectored (scv) ABI ============================== The scv instruction is introduced with POWER9 / ISA3, it comes with an rfscv counter-part. The benefit of these instructions is performance (trading slower SRR0/1 with faster LR/CTR registers, and entering the kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR updates. The scv instruction has 128 interrupt entry points (not enough to cover the Linux system call space). The proposal is to assign scv numbers very conservatively and allocate them as individual HWCAP features as we add support for more. The zero vector ('scv 0') will be used for normal system calls, equivalent to 'sc'. Advertisement Linux has not enabled FSCR[SCV] yet, so the instruction will cause a SIGILL in current environments. Linux has defined a HWCAP2 bit PPC_FEATURE2_SCV for SCV support, but does not set it. When scv instruction support and the scv 0 vector for system calls are added, PPC_FEATURE2_SCV will indicate support for these. Other vectors should not be used without future HWCAP bits indicating support, which is how we will allocate them. (Should unallocated ones generate SIGILL, or return -ENOSYS in r3?) Calling convention The proposal is for scv 0 to provide the standard Linux system call ABI with the following differences from sc convention[1]: - LR is to be volatile across scv calls. This is necessary because the scv instruction clobbers LR. From previous discussion, this should be possible to deal with in GCC clobbers and CFI. - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the kernel system call exit to avoid restoring the CR register (although we probably still would anyway to avoid information leak). - Error handling: I think the consensus has been to move to using negative return value in r3 rather than CR0[SO]=1 to indicate error, which matches most other architectures and is closer to a function call. The number of scratch registers (r9-r12) at kernel entry seems sufficient that we don't have any costly spilling, patch is here[2]. [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-15 21:45 Powerpc Linux 'scv' system call ABI proposal take 2 Nicholas Piggin @ 2020-04-15 22:55 ` Rich Felker 2020-04-16 0:16 ` Nicholas Piggin ` (2 more replies) 0 siblings, 3 replies; 62+ messages in thread From: Rich Felker @ 2020-04-15 22:55 UTC (permalink / raw) To: Nicholas Piggin Cc: linuxppc-dev, libc-alpha, libc-dev, musl, Segher Boessenkool On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: > I would like to enable Linux support for the powerpc 'scv' instruction, > as a faster system call instruction. > > This requires two things to be defined: Firstly a way to advertise to > userspace that kernel supports scv, and a way to allocate and advertise > support for individual scv vectors. Secondly, a calling convention ABI > for this new instruction. > > Thanks to those who commented last time, since then I have removed my > answered questions and unpopular alternatives but you can find them > here > > https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html > > Let me try one more with a wider cc list, and then we'll get something > merged. Any questions or counter-opinions are welcome. > > System Call Vectored (scv) ABI > ============================== > > The scv instruction is introduced with POWER9 / ISA3, it comes with an > rfscv counter-part. The benefit of these instructions is performance > (trading slower SRR0/1 with faster LR/CTR registers, and entering the > kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR > updates. The scv instruction has 128 interrupt entry points (not enough > to cover the Linux system call space). > > The proposal is to assign scv numbers very conservatively and allocate > them as individual HWCAP features as we add support for more. The zero > vector ('scv 0') will be used for normal system calls, equivalent to 'sc'. > > Advertisement > > Linux has not enabled FSCR[SCV] yet, so the instruction will cause a > SIGILL in current environments. Linux has defined a HWCAP2 bit > PPC_FEATURE2_SCV for SCV support, but does not set it. > > When scv instruction support and the scv 0 vector for system calls are > added, PPC_FEATURE2_SCV will indicate support for these. Other vectors > should not be used without future HWCAP bits indicating support, which is > how we will allocate them. (Should unallocated ones generate SIGILL, or > return -ENOSYS in r3?) > > Calling convention > > The proposal is for scv 0 to provide the standard Linux system call ABI > with the following differences from sc convention[1]: > > - LR is to be volatile across scv calls. This is necessary because the > scv instruction clobbers LR. From previous discussion, this should be > possible to deal with in GCC clobbers and CFI. > > - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the > kernel system call exit to avoid restoring the CR register (although > we probably still would anyway to avoid information leak). > > - Error handling: I think the consensus has been to move to using negative > return value in r3 rather than CR0[SO]=1 to indicate error, which matches > most other architectures and is closer to a function call. > > The number of scratch registers (r9-r12) at kernel entry seems > sufficient that we don't have any costly spilling, patch is here[2]. > > [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst > [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html My preference would be that it work just like the i386 AT_SYSINFO where you just replace "int $128" with "call *%%gs:16" and the kernel provides a stub in the vdso that performs either scv or the old mechanism with the same calling convention. Then if the kernel doesn't provide it (because the kernel is too old) libc would have to provide its own stub that uses the legacy method and matches the calling convention of the one the kernel is expected to provide. Note that any libc that actually makes use of the new functionality is not going to be able to make clobbers conditional on support for it; branching around different clobbers is going to defeat any gains vs always just treating anything clobbered by either method as clobbered. Likewise, it's not useful to have different error return mechanisms because the caller just has to branch to support both (or the kernel-provided stub just has to emulate one for it; that could work if you really want to change the bad existing convention). Thoughts? Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-15 22:55 ` [musl] " Rich Felker @ 2020-04-16 0:16 ` Nicholas Piggin 2020-04-16 0:48 ` Rich Felker ` (2 more replies) 2020-04-16 4:48 ` Florian Weimer 2020-04-16 14:16 ` Adhemerval Zanella 2 siblings, 3 replies; 62+ messages in thread From: Nicholas Piggin @ 2020-04-16 0:16 UTC (permalink / raw) To: Rich Felker; +Cc: libc-alpha, libc-dev, linuxppc-dev, musl, Segher Boessenkool Excerpts from Rich Felker's message of April 16, 2020 8:55 am: > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: >> I would like to enable Linux support for the powerpc 'scv' instruction, >> as a faster system call instruction. >> >> This requires two things to be defined: Firstly a way to advertise to >> userspace that kernel supports scv, and a way to allocate and advertise >> support for individual scv vectors. Secondly, a calling convention ABI >> for this new instruction. >> >> Thanks to those who commented last time, since then I have removed my >> answered questions and unpopular alternatives but you can find them >> here >> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html >> >> Let me try one more with a wider cc list, and then we'll get something >> merged. Any questions or counter-opinions are welcome. >> >> System Call Vectored (scv) ABI >> ============================== >> >> The scv instruction is introduced with POWER9 / ISA3, it comes with an >> rfscv counter-part. The benefit of these instructions is performance >> (trading slower SRR0/1 with faster LR/CTR registers, and entering the >> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR >> updates. The scv instruction has 128 interrupt entry points (not enough >> to cover the Linux system call space). >> >> The proposal is to assign scv numbers very conservatively and allocate >> them as individual HWCAP features as we add support for more. The zero >> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'. >> >> Advertisement >> >> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a >> SIGILL in current environments. Linux has defined a HWCAP2 bit >> PPC_FEATURE2_SCV for SCV support, but does not set it. >> >> When scv instruction support and the scv 0 vector for system calls are >> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors >> should not be used without future HWCAP bits indicating support, which is >> how we will allocate them. (Should unallocated ones generate SIGILL, or >> return -ENOSYS in r3?) >> >> Calling convention >> >> The proposal is for scv 0 to provide the standard Linux system call ABI >> with the following differences from sc convention[1]: >> >> - LR is to be volatile across scv calls. This is necessary because the >> scv instruction clobbers LR. From previous discussion, this should be >> possible to deal with in GCC clobbers and CFI. >> >> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the >> kernel system call exit to avoid restoring the CR register (although >> we probably still would anyway to avoid information leak). >> >> - Error handling: I think the consensus has been to move to using negative >> return value in r3 rather than CR0[SO]=1 to indicate error, which matches >> most other architectures and is closer to a function call. >> >> The number of scratch registers (r9-r12) at kernel entry seems >> sufficient that we don't have any costly spilling, patch is here[2]. >> >> [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst >> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html > > My preference would be that it work just like the i386 AT_SYSINFO > where you just replace "int $128" with "call *%%gs:16" and the kernel > provides a stub in the vdso that performs either scv or the old > mechanism with the same calling convention. Then if the kernel doesn't > provide it (because the kernel is too old) libc would have to provide > its own stub that uses the legacy method and matches the calling > convention of the one the kernel is expected to provide. I'm not sure if that's necessary. That's done on x86-32 because they select different sequences to use based on the CPU running and if the host kernel is 32 or 64 bit. Sure they could in theory have a bunch of HWCAP bits and select the right sequence in libc as well I suppose. > Note that any libc that actually makes use of the new functionality is > not going to be able to make clobbers conditional on support for it; > branching around different clobbers is going to defeat any gains vs > always just treating anything clobbered by either method as clobbered. Well it would have to test HWCAP and patch in or branch to two completely different sequences including register save/restores yes. You could have the same asm and matching clobbers to put the sequence inline and then you could patch the one sc/scv instruction I suppose. A bit of logic to select between them doesn't defeat gains though, it's about 90 cycle improvement which is a handful of branch mispredicts so it really is an improvement. Eventually userspace will stop supporting the old variant too. > Likewise, it's not useful to have different error return mechanisms > because the caller just has to branch to support both (or the > kernel-provided stub just has to emulate one for it; that could work > if you really want to change the bad existing convention). > > Thoughts? The existing convention has to change somewhat because of the clobbers, so I thought we could change the error return at the same time. I'm open to not changing it and using CR0[SO], but others liked the idea. Pro: it matches sc and vsyscall. Con: it's different from other common archs. Performnce-wise it would really be a wash -- cost of conditional branch is not the cmp but the mispredict. Thanks, Nick ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 0:16 ` Nicholas Piggin @ 2020-04-16 0:48 ` Rich Felker 2020-04-16 2:24 ` Nicholas Piggin 2020-04-16 9:58 ` Szabolcs Nagy 2020-04-16 15:21 ` Jeffrey Walton 2 siblings, 1 reply; 62+ messages in thread From: Rich Felker @ 2020-04-16 0:48 UTC (permalink / raw) To: Nicholas Piggin Cc: libc-alpha, libc-dev, linuxppc-dev, musl, Segher Boessenkool On Thu, Apr 16, 2020 at 10:16:54AM +1000, Nicholas Piggin wrote: > Excerpts from Rich Felker's message of April 16, 2020 8:55 am: > > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: > >> I would like to enable Linux support for the powerpc 'scv' instruction, > >> as a faster system call instruction. > >> > >> This requires two things to be defined: Firstly a way to advertise to > >> userspace that kernel supports scv, and a way to allocate and advertise > >> support for individual scv vectors. Secondly, a calling convention ABI > >> for this new instruction. > >> > >> Thanks to those who commented last time, since then I have removed my > >> answered questions and unpopular alternatives but you can find them > >> here > >> > >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html > >> > >> Let me try one more with a wider cc list, and then we'll get something > >> merged. Any questions or counter-opinions are welcome. > >> > >> System Call Vectored (scv) ABI > >> ============================== > >> > >> The scv instruction is introduced with POWER9 / ISA3, it comes with an > >> rfscv counter-part. The benefit of these instructions is performance > >> (trading slower SRR0/1 with faster LR/CTR registers, and entering the > >> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR > >> updates. The scv instruction has 128 interrupt entry points (not enough > >> to cover the Linux system call space). > >> > >> The proposal is to assign scv numbers very conservatively and allocate > >> them as individual HWCAP features as we add support for more. The zero > >> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'. > >> > >> Advertisement > >> > >> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a > >> SIGILL in current environments. Linux has defined a HWCAP2 bit > >> PPC_FEATURE2_SCV for SCV support, but does not set it. > >> > >> When scv instruction support and the scv 0 vector for system calls are > >> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors > >> should not be used without future HWCAP bits indicating support, which is > >> how we will allocate them. (Should unallocated ones generate SIGILL, or > >> return -ENOSYS in r3?) > >> > >> Calling convention > >> > >> The proposal is for scv 0 to provide the standard Linux system call ABI > >> with the following differences from sc convention[1]: > >> > >> - LR is to be volatile across scv calls. This is necessary because the > >> scv instruction clobbers LR. From previous discussion, this should be > >> possible to deal with in GCC clobbers and CFI. > >> > >> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the > >> kernel system call exit to avoid restoring the CR register (although > >> we probably still would anyway to avoid information leak). > >> > >> - Error handling: I think the consensus has been to move to using negative > >> return value in r3 rather than CR0[SO]=1 to indicate error, which matches > >> most other architectures and is closer to a function call. > >> > >> The number of scratch registers (r9-r12) at kernel entry seems > >> sufficient that we don't have any costly spilling, patch is here[2]. > >> > >> [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst > >> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840..html > > > > My preference would be that it work just like the i386 AT_SYSINFO > > where you just replace "int $128" with "call *%%gs:16" and the kernel > > provides a stub in the vdso that performs either scv or the old > > mechanism with the same calling convention. Then if the kernel doesn't > > provide it (because the kernel is too old) libc would have to provide > > its own stub that uses the legacy method and matches the calling > > convention of the one the kernel is expected to provide. > > I'm not sure if that's necessary. That's done on x86-32 because they > select different sequences to use based on the CPU running and if the host > kernel is 32 or 64 bit. Sure they could in theory have a bunch of HWCAP > bits and select the right sequence in libc as well I suppose. It's not just a HWCAP. It's a contract between the kernel and userspace to support a particular calling convention that's not exposed except as the public entry point the kernel exports via AT_SYSINFO. > > Note that any libc that actually makes use of the new functionality is > > not going to be able to make clobbers conditional on support for it; > > branching around different clobbers is going to defeat any gains vs > > always just treating anything clobbered by either method as clobbered. > > Well it would have to test HWCAP and patch in or branch to two > completely different sequences including register save/restores yes. > You could have the same asm and matching clobbers to put the sequence > inline and then you could patch the one sc/scv instruction I suppose. > > A bit of logic to select between them doesn't defeat gains though, > it's about 90 cycle improvement which is a handful of branch mispredicts > so it really is an improvement. Eventually userspace will stop > supporting the old variant too. Oh, I didn't mean it would neutralize the benefit of svc. Rather, I meant it would be worse to do: if (hwcap & X) { __asm__(... with some clobbers); } else { __asm__(... with different clobbers); } instead of just __asm__("indirect call" ... with common clobbers); where the indirect call is to an address ideally provided like on i386, or otherwise initialized to one of two or more code addresses in libc based on hwcap bits. > > Likewise, it's not useful to have different error return mechanisms > > because the caller just has to branch to support both (or the > > kernel-provided stub just has to emulate one for it; that could work > > if you really want to change the bad existing convention). > > > > Thoughts? > > The existing convention has to change somewhat because of the clobbers, > so I thought we could change the error return at the same time. I'm > open to not changing it and using CR0[SO], but others liked the idea. > Pro: it matches sc and vsyscall. Con: it's different from other common > archs. Performnce-wise it would really be a wash -- cost of conditional > branch is not the cmp but the mispredict. If you do the branch on hwcap at each syscall, then you significantly increase code size of every syscall point, likely turning a bunch of trivial functions that didn't need stack frames into ones that do. You also potentially make them need a TOC pointer. Making them all just do an indirect call unconditionally (with pointer in TLS like i386?) is a lot more efficient in code size and at least as good for performance. Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 0:48 ` Rich Felker @ 2020-04-16 2:24 ` Nicholas Piggin 2020-04-16 2:35 ` Rich Felker 0 siblings, 1 reply; 62+ messages in thread From: Nicholas Piggin @ 2020-04-16 2:24 UTC (permalink / raw) To: Rich Felker; +Cc: libc-alpha, libc-dev, linuxppc-dev, musl, Segher Boessenkool Excerpts from Rich Felker's message of April 16, 2020 10:48 am: > On Thu, Apr 16, 2020 at 10:16:54AM +1000, Nicholas Piggin wrote: >> Excerpts from Rich Felker's message of April 16, 2020 8:55 am: >> > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: >> >> I would like to enable Linux support for the powerpc 'scv' instruction, >> >> as a faster system call instruction. >> >> >> >> This requires two things to be defined: Firstly a way to advertise to >> >> userspace that kernel supports scv, and a way to allocate and advertise >> >> support for individual scv vectors. Secondly, a calling convention ABI >> >> for this new instruction. >> >> >> >> Thanks to those who commented last time, since then I have removed my >> >> answered questions and unpopular alternatives but you can find them >> >> here >> >> >> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html >> >> >> >> Let me try one more with a wider cc list, and then we'll get something >> >> merged. Any questions or counter-opinions are welcome. >> >> >> >> System Call Vectored (scv) ABI >> >> ============================== >> >> >> >> The scv instruction is introduced with POWER9 / ISA3, it comes with an >> >> rfscv counter-part. The benefit of these instructions is performance >> >> (trading slower SRR0/1 with faster LR/CTR registers, and entering the >> >> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR >> >> updates. The scv instruction has 128 interrupt entry points (not enough >> >> to cover the Linux system call space). >> >> >> >> The proposal is to assign scv numbers very conservatively and allocate >> >> them as individual HWCAP features as we add support for more. The zero >> >> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'. >> >> >> >> Advertisement >> >> >> >> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a >> >> SIGILL in current environments. Linux has defined a HWCAP2 bit >> >> PPC_FEATURE2_SCV for SCV support, but does not set it. >> >> >> >> When scv instruction support and the scv 0 vector for system calls are >> >> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors >> >> should not be used without future HWCAP bits indicating support, which is >> >> how we will allocate them. (Should unallocated ones generate SIGILL, or >> >> return -ENOSYS in r3?) >> >> >> >> Calling convention >> >> >> >> The proposal is for scv 0 to provide the standard Linux system call ABI >> >> with the following differences from sc convention[1]: >> >> >> >> - LR is to be volatile across scv calls. This is necessary because the >> >> scv instruction clobbers LR. From previous discussion, this should be >> >> possible to deal with in GCC clobbers and CFI. >> >> >> >> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the >> >> kernel system call exit to avoid restoring the CR register (although >> >> we probably still would anyway to avoid information leak). >> >> >> >> - Error handling: I think the consensus has been to move to using negative >> >> return value in r3 rather than CR0[SO]=1 to indicate error, which matches >> >> most other architectures and is closer to a function call. >> >> >> >> The number of scratch registers (r9-r12) at kernel entry seems >> >> sufficient that we don't have any costly spilling, patch is here[2]. >> >> >> >> [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst >> >> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840..html >> > >> > My preference would be that it work just like the i386 AT_SYSINFO >> > where you just replace "int $128" with "call *%%gs:16" and the kernel >> > provides a stub in the vdso that performs either scv or the old >> > mechanism with the same calling convention. Then if the kernel doesn't >> > provide it (because the kernel is too old) libc would have to provide >> > its own stub that uses the legacy method and matches the calling >> > convention of the one the kernel is expected to provide. >> >> I'm not sure if that's necessary. That's done on x86-32 because they >> select different sequences to use based on the CPU running and if the host >> kernel is 32 or 64 bit. Sure they could in theory have a bunch of HWCAP >> bits and select the right sequence in libc as well I suppose. > > It's not just a HWCAP. It's a contract between the kernel and > userspace to support a particular calling convention that's not > exposed except as the public entry point the kernel exports via > AT_SYSINFO. Right. >> > Note that any libc that actually makes use of the new functionality is >> > not going to be able to make clobbers conditional on support for it; >> > branching around different clobbers is going to defeat any gains vs >> > always just treating anything clobbered by either method as clobbered. >> >> Well it would have to test HWCAP and patch in or branch to two >> completely different sequences including register save/restores yes. >> You could have the same asm and matching clobbers to put the sequence >> inline and then you could patch the one sc/scv instruction I suppose. >> >> A bit of logic to select between them doesn't defeat gains though, >> it's about 90 cycle improvement which is a handful of branch mispredicts >> so it really is an improvement. Eventually userspace will stop >> supporting the old variant too. > > Oh, I didn't mean it would neutralize the benefit of svc. Rather, I > meant it would be worse to do: > > if (hwcap & X) { > __asm__(... with some clobbers); > } else { > __asm__(... with different clobbers); > } > > instead of just > > __asm__("indirect call" ... with common clobbers); Ah okay. Well that's debatable but if you didn't have an indirect call, rather a runtime-patched sequence, then yes saving the LR clobber or whatever wouldn't be worth a branch. > where the indirect call is to an address ideally provided like on > i386, or otherwise initialized to one of two or more code addresses in > libc based on hwcap bits. Right, I'm just skeptical we need the indirect call or need to provide it in the vdso. The "clever" reason to add it on x86-32 was because of the bugs and different combinations needed, that doesn't really apply to scv 0 and was not necessarily a great choice. > >> > Likewise, it's not useful to have different error return mechanisms >> > because the caller just has to branch to support both (or the >> > kernel-provided stub just has to emulate one for it; that could work >> > if you really want to change the bad existing convention). >> > >> > Thoughts? >> >> The existing convention has to change somewhat because of the clobbers, >> so I thought we could change the error return at the same time. I'm >> open to not changing it and using CR0[SO], but others liked the idea. >> Pro: it matches sc and vsyscall. Con: it's different from other common >> archs. Performnce-wise it would really be a wash -- cost of conditional >> branch is not the cmp but the mispredict. > > If you do the branch on hwcap at each syscall, then you significantly > increase code size of every syscall point, likely turning a bunch of > trivial functions that didn't need stack frames into ones that do. You > also potentially make them need a TOC pointer. Making them all just do > an indirect call unconditionally (with pointer in TLS like i386?) is a > lot more efficient in code size and at least as good for performance. I disagree. Doing the long vdso indirect call *necessarily* requires touching a new icache line, and even a new TLB entry. Indirect branches also tend to be more costly and/or less accurate to predict than direct even without spectre (generally fewer indirect predictor entries than direct, far branches in particular require a lot of bits for target). And with spectre we're flushing the indirect predictors on context switch or even disabling indirect prediction or flushing across privilege domains in the same context. And finally, the HWCAP test can eventually go away in future. A vdso call can not. If you really want to select with an indirect branch rather than direct conditional, you can do that all within the library. Thanks, Nick ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 2:24 ` Nicholas Piggin @ 2020-04-16 2:35 ` Rich Felker 2020-04-16 2:53 ` Nicholas Piggin 0 siblings, 1 reply; 62+ messages in thread From: Rich Felker @ 2020-04-16 2:35 UTC (permalink / raw) To: Nicholas Piggin Cc: libc-alpha, libc-dev, linuxppc-dev, musl, Segher Boessenkool On Thu, Apr 16, 2020 at 12:24:16PM +1000, Nicholas Piggin wrote: > >> > Likewise, it's not useful to have different error return mechanisms > >> > because the caller just has to branch to support both (or the > >> > kernel-provided stub just has to emulate one for it; that could work > >> > if you really want to change the bad existing convention). > >> > > >> > Thoughts? > >> > >> The existing convention has to change somewhat because of the clobbers, > >> so I thought we could change the error return at the same time. I'm > >> open to not changing it and using CR0[SO], but others liked the idea. > >> Pro: it matches sc and vsyscall. Con: it's different from other common > >> archs. Performnce-wise it would really be a wash -- cost of conditional > >> branch is not the cmp but the mispredict. > > > > If you do the branch on hwcap at each syscall, then you significantly > > increase code size of every syscall point, likely turning a bunch of > > trivial functions that didn't need stack frames into ones that do. You > > also potentially make them need a TOC pointer. Making them all just do > > an indirect call unconditionally (with pointer in TLS like i386?) is a > > lot more efficient in code size and at least as good for performance. > > I disagree. Doing the long vdso indirect call *necessarily* requires > touching a new icache line, and even a new TLB entry. Indirect branches The increase in number of icache lines from the branch at every syscall point is far greater than the use of a single extra icache line shared by all syscalls. Not to mention the dcache line to access __hwcap or whatever, and the icache lines to setup access TOC-relative access to it. (Of course you could put a copy of its value in TLS at a fixed offset, which would somewhat mitigate both.) > And finally, the HWCAP test can eventually go away in future. A vdso > call can not. We support nearly arbitrarily old kernels (with limited functionality) and hardware (with full functionality) and don't intend for that to change, ever. But indeed glibc might want too eventually drop the check. > If you really want to select with an indirect branch rather than > direct conditional, you can do that all within the library. OK. It's a little bit more work if that's not the interface the kernel will give us, but it's no big deal. Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 2:35 ` Rich Felker @ 2020-04-16 2:53 ` Nicholas Piggin 2020-04-16 3:03 ` Rich Felker 2020-04-16 20:18 ` Florian Weimer 0 siblings, 2 replies; 62+ messages in thread From: Nicholas Piggin @ 2020-04-16 2:53 UTC (permalink / raw) To: Rich Felker; +Cc: libc-alpha, libc-dev, linuxppc-dev, musl, Segher Boessenkool Excerpts from Rich Felker's message of April 16, 2020 12:35 pm: > On Thu, Apr 16, 2020 at 12:24:16PM +1000, Nicholas Piggin wrote: >> >> > Likewise, it's not useful to have different error return mechanisms >> >> > because the caller just has to branch to support both (or the >> >> > kernel-provided stub just has to emulate one for it; that could work >> >> > if you really want to change the bad existing convention). >> >> > >> >> > Thoughts? >> >> >> >> The existing convention has to change somewhat because of the clobbers, >> >> so I thought we could change the error return at the same time. I'm >> >> open to not changing it and using CR0[SO], but others liked the idea. >> >> Pro: it matches sc and vsyscall. Con: it's different from other common >> >> archs. Performnce-wise it would really be a wash -- cost of conditional >> >> branch is not the cmp but the mispredict. >> > >> > If you do the branch on hwcap at each syscall, then you significantly >> > increase code size of every syscall point, likely turning a bunch of >> > trivial functions that didn't need stack frames into ones that do. You >> > also potentially make them need a TOC pointer. Making them all just do >> > an indirect call unconditionally (with pointer in TLS like i386?) is a >> > lot more efficient in code size and at least as good for performance. >> >> I disagree. Doing the long vdso indirect call *necessarily* requires >> touching a new icache line, and even a new TLB entry. Indirect branches > > The increase in number of icache lines from the branch at every > syscall point is far greater than the use of a single extra icache > line shared by all syscalls. That's true, I was thinking of a single function that does the test and calls syscalls, which might be the fair comparison. > Not to mention the dcache line to access > __hwcap or whatever, and the icache lines to setup access TOC-relative > access to it. (Of course you could put a copy of its value in TLS at a > fixed offset, which would somewhat mitigate both.) > >> And finally, the HWCAP test can eventually go away in future. A vdso >> call can not. > > We support nearly arbitrarily old kernels (with limited functionality) > and hardware (with full functionality) and don't intend for that to > change, ever. But indeed glibc might want too eventually drop the > check. Ah, cool. Any build-time flexibility there? We may or may not be getting a new ABI that will use instructions not supported by old processors. https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html Current ABI continues to work of course and be the default for some time, but building for new one would give some opportunity to drop such support for old procs, at least for glibc. > >> If you really want to select with an indirect branch rather than >> direct conditional, you can do that all within the library. > > OK. It's a little bit more work if that's not the interface the kernel > will give us, but it's no big deal. Okay. Thanks, Nick ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 2:53 ` Nicholas Piggin @ 2020-04-16 3:03 ` Rich Felker 2020-04-16 3:41 ` Nicholas Piggin 2020-04-16 20:18 ` Florian Weimer 1 sibling, 1 reply; 62+ messages in thread From: Rich Felker @ 2020-04-16 3:03 UTC (permalink / raw) To: Nicholas Piggin Cc: libc-alpha, libc-dev, linuxppc-dev, musl, Segher Boessenkool On Thu, Apr 16, 2020 at 12:53:31PM +1000, Nicholas Piggin wrote: > > Not to mention the dcache line to access > > __hwcap or whatever, and the icache lines to setup access TOC-relative > > access to it. (Of course you could put a copy of its value in TLS at a > > fixed offset, which would somewhat mitigate both.) > > > >> And finally, the HWCAP test can eventually go away in future. A vdso > >> call can not. > > > > We support nearly arbitrarily old kernels (with limited functionality) > > and hardware (with full functionality) and don't intend for that to > > change, ever. But indeed glibc might want too eventually drop the > > check. > > Ah, cool. Any build-time flexibility there? > > We may or may not be getting a new ABI that will use instructions not > supported by old processors. > > https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html > > Current ABI continues to work of course and be the default for some > time, but building for new one would give some opportunity to drop > such support for old procs, at least for glibc. What does "new ABI" entail to you? In the terminology I use with musl, "new ABI" and "new ISA level" are different things. You can compile (explicit -march or compiler default) binaries that won't run on older cpus due to use of new insns etc., but we consider it the same ABI if you can link code for an older/baseline ISA level with the newer-ISA-level object files, i.e. if the interface surface for linkage remains compatible. We also try to avoid gratuitous proliferation of different ABIs unless there's a strong underlying need (like addition of softfloat ABIs for archs that usually have FPU, or vice versa). In principle the same could be done for kernels except it's a bigger silent gotcha (possible ENOSYS in places where it shouldn't be able to happen rather than a trapping SIGILL or similar) and there's rarely any serious performance or size benefit to dropping support for older kernels. Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 3:03 ` Rich Felker @ 2020-04-16 3:41 ` Nicholas Piggin 0 siblings, 0 replies; 62+ messages in thread From: Nicholas Piggin @ 2020-04-16 3:41 UTC (permalink / raw) To: Rich Felker; +Cc: libc-alpha, libc-dev, linuxppc-dev, musl, Segher Boessenkool Excerpts from Rich Felker's message of April 16, 2020 1:03 pm: > On Thu, Apr 16, 2020 at 12:53:31PM +1000, Nicholas Piggin wrote: >> > Not to mention the dcache line to access >> > __hwcap or whatever, and the icache lines to setup access TOC-relative >> > access to it. (Of course you could put a copy of its value in TLS at a >> > fixed offset, which would somewhat mitigate both.) >> > >> >> And finally, the HWCAP test can eventually go away in future. A vdso >> >> call can not. >> > >> > We support nearly arbitrarily old kernels (with limited functionality) >> > and hardware (with full functionality) and don't intend for that to >> > change, ever. But indeed glibc might want too eventually drop the >> > check. >> >> Ah, cool. Any build-time flexibility there? >> >> We may or may not be getting a new ABI that will use instructions not >> supported by old processors. >> >> https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html >> >> Current ABI continues to work of course and be the default for some >> time, but building for new one would give some opportunity to drop >> such support for old procs, at least for glibc. > > What does "new ABI" entail to you? In the terminology I use with musl, > "new ABI" and "new ISA level" are different things. You can compile > (explicit -march or compiler default) binaries that won't run on older > cpus due to use of new insns etc., but we consider it the same ABI if > you can link code for an older/baseline ISA level with the > newer-ISA-level object files, i.e. if the interface surface for > linkage remains compatible. We also try to avoid gratuitous > proliferation of different ABIs unless there's a strong underlying > need (like addition of softfloat ABIs for archs that usually have FPU, > or vice versa). Yeah it will be a new ABI type that also requires a new ISA level. As far as I know (and I'm not on the toolchain side) there will be some call compatibility between the two, so it may be fine to continue with existing ABI for libc. But it just something that comes to mind as a build-time cutover where we might be able to assume particular features. > In principle the same could be done for kernels except it's a bigger > silent gotcha (possible ENOSYS in places where it shouldn't be able to > happen rather than a trapping SIGILL or similar) and there's rarely > any serious performance or size benefit to dropping support for older > kernels. Right, I don't think it'd be a huge problem whatever way we go, compared with the cost of the system call. Thanks, Nick ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 2:53 ` Nicholas Piggin 2020-04-16 3:03 ` Rich Felker @ 2020-04-16 20:18 ` Florian Weimer 1 sibling, 0 replies; 62+ messages in thread From: Florian Weimer @ 2020-04-16 20:18 UTC (permalink / raw) To: Nicholas Piggin via Libc-alpha Cc: Rich Felker, Nicholas Piggin, libc-dev, linuxppc-dev, musl * Nicholas Piggin via Libc-alpha: > We may or may not be getting a new ABI that will use instructions not > supported by old processors. > > https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html > > Current ABI continues to work of course and be the default for some > time, but building for new one would give some opportunity to drop > such support for old procs, at least for glibc. If I recall correctly, during last year's GNU Tools Cauldron, I think it was pretty clear that this was only to be used for intra-DSO ABIs, not cross-DSO optimization. Relocatable object files have an ABI, too, of course, so that's why there's a ABI documentation needed. For cross-DSO optimization, the link editor would look at the DSO being linked in, check if it uses the -mfuture ABI, and apply some shortcuts. But at that point, if the DSO is swapped back to a version built without -mfuture, it no longer works with those newly linked binaries against the -mfuture version. Such a thing is a clear ABI bump, and based what I remember from Cauldron, that is not the plan here. (I don't have any insider knowledge—I just don't want people to read this think: gosh, yet another POWER ABI bump. But the PCREL stuff *is* exciting!) ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 0:16 ` Nicholas Piggin 2020-04-16 0:48 ` Rich Felker @ 2020-04-16 9:58 ` Szabolcs Nagy 2020-04-20 0:27 ` Nicholas Piggin 2020-04-16 15:21 ` Jeffrey Walton 2 siblings, 1 reply; 62+ messages in thread From: Szabolcs Nagy @ 2020-04-16 9:58 UTC (permalink / raw) To: Nicholas Piggin via Libc-alpha; +Cc: Rich Felker, libc-dev, linuxppc-dev, musl * Nicholas Piggin via Libc-alpha <libc-alpha@sourceware.org> [2020-04-16 10:16:54 +1000]: > Well it would have to test HWCAP and patch in or branch to two > completely different sequences including register save/restores yes. > You could have the same asm and matching clobbers to put the sequence > inline and then you could patch the one sc/scv instruction I suppose. how would that 'patch' work? there are many reasons why you don't want libc to write its .text ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 9:58 ` Szabolcs Nagy @ 2020-04-20 0:27 ` Nicholas Piggin 2020-04-20 1:29 ` Rich Felker 0 siblings, 1 reply; 62+ messages in thread From: Nicholas Piggin @ 2020-04-20 0:27 UTC (permalink / raw) To: Nicholas Piggin via Libc-alpha, Szabolcs Nagy Cc: Rich Felker, libc-dev, linuxppc-dev, musl Excerpts from Szabolcs Nagy's message of April 16, 2020 7:58 pm: > * Nicholas Piggin via Libc-alpha <libc-alpha@sourceware.org> [2020-04-16 10:16:54 +1000]: >> Well it would have to test HWCAP and patch in or branch to two >> completely different sequences including register save/restores yes. >> You could have the same asm and matching clobbers to put the sequence >> inline and then you could patch the one sc/scv instruction I suppose. > > how would that 'patch' work? > > there are many reasons why you don't > want libc to write its .text I guess I don't know what I'm talking about when it comes to libraries. Shame if there is no good way to load-time patch libc. It's orthogonal to the scv selection though -- if you don't patch you have to conditional or indirect branch however you implement it. Thanks, Nick ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-20 0:27 ` Nicholas Piggin @ 2020-04-20 1:29 ` Rich Felker 2020-04-20 2:08 ` Nicholas Piggin 0 siblings, 1 reply; 62+ messages in thread From: Rich Felker @ 2020-04-20 1:29 UTC (permalink / raw) To: Nicholas Piggin Cc: Nicholas Piggin via Libc-alpha, Szabolcs Nagy, libc-dev, linuxppc-dev, musl On Mon, Apr 20, 2020 at 10:27:58AM +1000, Nicholas Piggin wrote: > Excerpts from Szabolcs Nagy's message of April 16, 2020 7:58 pm: > > * Nicholas Piggin via Libc-alpha <libc-alpha@sourceware.org> [2020-04-16 10:16:54 +1000]: > >> Well it would have to test HWCAP and patch in or branch to two > >> completely different sequences including register save/restores yes. > >> You could have the same asm and matching clobbers to put the sequence > >> inline and then you could patch the one sc/scv instruction I suppose. > > > > how would that 'patch' work? > > > > there are many reasons why you don't > > want libc to write its .text > > I guess I don't know what I'm talking about when it comes to libraries. > Shame if there is no good way to load-time patch libc. It's orthogonal > to the scv selection though -- if you don't patch you have to > conditional or indirect branch however you implement it. Patched pages cannot be shared. The whole design of PIC and shared libraries is that the code("text")/rodata is immutable and shared and that only a minimal amount of data, packed tightly together (the GOT) has to exist per-instance. Also, allowing patching of executable pages is generally frowned upon these days because W^X is a desirable hardening property. Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-20 1:29 ` Rich Felker @ 2020-04-20 2:08 ` Nicholas Piggin 2020-04-20 21:17 ` Szabolcs Nagy 0 siblings, 1 reply; 62+ messages in thread From: Nicholas Piggin @ 2020-04-20 2:08 UTC (permalink / raw) To: Rich Felker Cc: Nicholas Piggin via Libc-alpha, libc-dev, linuxppc-dev, musl, Szabolcs Nagy Excerpts from Rich Felker's message of April 20, 2020 11:29 am: > On Mon, Apr 20, 2020 at 10:27:58AM +1000, Nicholas Piggin wrote: >> Excerpts from Szabolcs Nagy's message of April 16, 2020 7:58 pm: >> > * Nicholas Piggin via Libc-alpha <libc-alpha@sourceware.org> [2020-04-16 10:16:54 +1000]: >> >> Well it would have to test HWCAP and patch in or branch to two >> >> completely different sequences including register save/restores yes. >> >> You could have the same asm and matching clobbers to put the sequence >> >> inline and then you could patch the one sc/scv instruction I suppose. >> > >> > how would that 'patch' work? >> > >> > there are many reasons why you don't >> > want libc to write its .text >> >> I guess I don't know what I'm talking about when it comes to libraries. >> Shame if there is no good way to load-time patch libc. It's orthogonal >> to the scv selection though -- if you don't patch you have to >> conditional or indirect branch however you implement it. > > Patched pages cannot be shared. The whole design of PIC and shared > libraries is that the code("text")/rodata is immutable and shared and > that only a minimal amount of data, packed tightly together (the GOT) > has to exist per-instance. Yeah the pages which were patched couldn't be shared across exec, which is a significant downside, unless you could group all patch sites into their own section and similarly pack it together (which has issues of being out of line). > > Also, allowing patching of executable pages is generally frowned upon > these days because W^X is a desirable hardening property. Right, it would want be write-protected after being patched. Thanks, Nick ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-20 2:08 ` Nicholas Piggin @ 2020-04-20 21:17 ` Szabolcs Nagy 2020-04-21 9:57 ` Florian Weimer 0 siblings, 1 reply; 62+ messages in thread From: Szabolcs Nagy @ 2020-04-20 21:17 UTC (permalink / raw) To: Nicholas Piggin Cc: Rich Felker, Nicholas Piggin via Libc-alpha, libc-dev, linuxppc-dev, musl * Nicholas Piggin <npiggin@gmail.com> [2020-04-20 12:08:36 +1000]: > Excerpts from Rich Felker's message of April 20, 2020 11:29 am: > > Also, allowing patching of executable pages is generally frowned upon > > these days because W^X is a desirable hardening property. > > Right, it would want be write-protected after being patched. "frowned upon" means that users may have to update their security policy setting in pax, selinux, apparmor, seccomp bpf filters and who knows what else that may monitor and flag W&X mprotect. libc update can break systems if the new libc does W&X. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-20 21:17 ` Szabolcs Nagy @ 2020-04-21 9:57 ` Florian Weimer 0 siblings, 0 replies; 62+ messages in thread From: Florian Weimer @ 2020-04-21 9:57 UTC (permalink / raw) To: Nicholas Piggin Cc: Rich Felker, Nicholas Piggin via Libc-alpha, libc-dev, linuxppc-dev, musl * Szabolcs Nagy: > * Nicholas Piggin <npiggin@gmail.com> [2020-04-20 12:08:36 +1000]: >> Excerpts from Rich Felker's message of April 20, 2020 11:29 am: >> > Also, allowing patching of executable pages is generally frowned upon >> > these days because W^X is a desirable hardening property. >> >> Right, it would want be write-protected after being patched. > > "frowned upon" means that users may have to update > their security policy setting in pax, selinux, apparmor, > seccomp bpf filters and who knows what else that may > monitor and flag W&X mprotect. > > libc update can break systems if the new libc does W&X. It's possible to map over pre-compiled alternative implementations, though. Basically, we would do the patching and build time and store the results in the file. It works best if the variance is concentrated on a few pages, and there are very few alternatives. For example, having two syscall APIs and supporting threading and no-threading versions would need four code versions in total, which is likely excessive. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 0:16 ` Nicholas Piggin 2020-04-16 0:48 ` Rich Felker 2020-04-16 9:58 ` Szabolcs Nagy @ 2020-04-16 15:21 ` Jeffrey Walton 2020-04-16 15:40 ` Rich Felker 2 siblings, 1 reply; 62+ messages in thread From: Jeffrey Walton @ 2020-04-16 15:21 UTC (permalink / raw) To: musl; +Cc: libc-alpha, libc-dev, linuxppc-dev On Wed, Apr 15, 2020 at 8:17 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > Excerpts from Rich Felker's message of April 16, 2020 8:55 am: > > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: > >> I would like to enable Linux support for the powerpc 'scv' instruction, > >> as a faster system call instruction. > >> > >> This requires two things to be defined: Firstly a way to advertise to > >> userspace that kernel supports scv, and a way to allocate and advertise > >> support for individual scv vectors. Secondly, a calling convention ABI > >> for this new instruction. > >> ... > > Note that any libc that actually makes use of the new functionality is > > not going to be able to make clobbers conditional on support for it; > > branching around different clobbers is going to defeat any gains vs > > always just treating anything clobbered by either method as clobbered. > > Well it would have to test HWCAP and patch in or branch to two > completely different sequences including register save/restores yes. > You could have the same asm and matching clobbers to put the sequence > inline and then you could patch the one sc/scv instruction I suppose. Could GCC function multiversioning work here? https://gcc.gnu.org/wiki/FunctionMultiVersioning It seems like selecting a runtime version of a function is the sort of thing you are trying to do. Jeff ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 15:21 ` Jeffrey Walton @ 2020-04-16 15:40 ` Rich Felker 0 siblings, 0 replies; 62+ messages in thread From: Rich Felker @ 2020-04-16 15:40 UTC (permalink / raw) To: Jeffrey Walton; +Cc: musl, libc-alpha, libc-dev, linuxppc-dev On Thu, Apr 16, 2020 at 11:21:56AM -0400, Jeffrey Walton wrote: > On Wed, Apr 15, 2020 at 8:17 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > > > Excerpts from Rich Felker's message of April 16, 2020 8:55 am: > > > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: > > >> I would like to enable Linux support for the powerpc 'scv' instruction, > > >> as a faster system call instruction. > > >> > > >> This requires two things to be defined: Firstly a way to advertise to > > >> userspace that kernel supports scv, and a way to allocate and advertise > > >> support for individual scv vectors. Secondly, a calling convention ABI > > >> for this new instruction. > > >> ... > > > Note that any libc that actually makes use of the new functionality is > > > not going to be able to make clobbers conditional on support for it; > > > branching around different clobbers is going to defeat any gains vs > > > always just treating anything clobbered by either method as clobbered. > > > > Well it would have to test HWCAP and patch in or branch to two > > completely different sequences including register save/restores yes. > > You could have the same asm and matching clobbers to put the sequence > > inline and then you could patch the one sc/scv instruction I suppose. > > Could GCC function multiversioning work here? > https://gcc.gnu.org/wiki/FunctionMultiVersioning > > It seems like selecting a runtime version of a function is the sort of > thing you are trying to do. On glibc it potentially could. This is ifunc-based functionality though and musl explicitly does not (and will not) support ifunc because of lots of fundamental problems it entails. But even on glibc the underlying mechanisms for ifunc are just the same as a normal indirect call and there's no real reason to prefer implementing it with ifunc/multiversioning vs directly. Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-15 22:55 ` [musl] " Rich Felker 2020-04-16 0:16 ` Nicholas Piggin @ 2020-04-16 4:48 ` Florian Weimer 2020-04-16 15:35 ` Rich Felker 2020-04-16 14:16 ` Adhemerval Zanella 2 siblings, 1 reply; 62+ messages in thread From: Florian Weimer @ 2020-04-16 4:48 UTC (permalink / raw) To: Rich Felker; +Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev * Rich Felker: > My preference would be that it work just like the i386 AT_SYSINFO > where you just replace "int $128" with "call *%%gs:16" and the kernel > provides a stub in the vdso that performs either scv or the old > mechanism with the same calling convention. The i386 mechanism has received some criticism because it provides an effective means to redirect execution flow to anyone who can write to the TCB. I am not sure if it makes sense to copy it. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 4:48 ` Florian Weimer @ 2020-04-16 15:35 ` Rich Felker 2020-04-16 16:42 ` Florian Weimer 0 siblings, 1 reply; 62+ messages in thread From: Rich Felker @ 2020-04-16 15:35 UTC (permalink / raw) To: Florian Weimer; +Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote: > * Rich Felker: > > > My preference would be that it work just like the i386 AT_SYSINFO > > where you just replace "int $128" with "call *%%gs:16" and the kernel > > provides a stub in the vdso that performs either scv or the old > > mechanism with the same calling convention. > > The i386 mechanism has received some criticism because it provides an > effective means to redirect execution flow to anyone who can write to > the TCB. I am not sure if it makes sense to copy it. Indeed that's a good point. Do you have ideas for making it equally efficient without use of a function pointer in the TCB? Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 15:35 ` Rich Felker @ 2020-04-16 16:42 ` Florian Weimer 2020-04-16 16:52 ` Rich Felker 0 siblings, 1 reply; 62+ messages in thread From: Florian Weimer @ 2020-04-16 16:42 UTC (permalink / raw) To: Rich Felker; +Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev * Rich Felker: > On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote: >> * Rich Felker: >> >> > My preference would be that it work just like the i386 AT_SYSINFO >> > where you just replace "int $128" with "call *%%gs:16" and the kernel >> > provides a stub in the vdso that performs either scv or the old >> > mechanism with the same calling convention. >> >> The i386 mechanism has received some criticism because it provides an >> effective means to redirect execution flow to anyone who can write to >> the TCB. I am not sure if it makes sense to copy it. > > Indeed that's a good point. Do you have ideas for making it equally > efficient without use of a function pointer in the TCB? We could add a shared non-writable mapping at a 64K offset from the thread pointer and store the function pointer or the code there. Then it would be safe. However, since this is apparently tied to POWER9 and we already have a POWER9 multilib, and assuming that we are going to backport the kernel change, I would tweak the selection criterion for that multilib to include the new HWCAP2 flag. If a user runs this glibc on a kernel which does not have support, they will get set baseline (POWER8) multilib, which still works. This way, outside the dynamic loader, no run-time dispatch is needed at all. I guess this is not at all the answer you were looking for. 8-) If a single binary is needed, I would perhaps follow what Arm did for -moutline-atomics: lay out the code so that its easy to execute for the non-POWER9 case, assuming that POWER9 machines will be better at predicting things than their predecessors. Or you could also put the function pointer into a RELRO segment. Then there's overlap with the __libc_single_threaded discussion, where people objected to this kind of optimization (although I did not propose to change the TCB ABI, that would be required for __libc_single_threaded because it's an external interface). ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 16:42 ` Florian Weimer @ 2020-04-16 16:52 ` Rich Felker 2020-04-16 18:12 ` Florian Weimer 0 siblings, 1 reply; 62+ messages in thread From: Rich Felker @ 2020-04-16 16:52 UTC (permalink / raw) To: Florian Weimer; +Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev On Thu, Apr 16, 2020 at 06:42:32PM +0200, Florian Weimer wrote: > * Rich Felker: > > > On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote: > >> * Rich Felker: > >> > >> > My preference would be that it work just like the i386 AT_SYSINFO > >> > where you just replace "int $128" with "call *%%gs:16" and the kernel > >> > provides a stub in the vdso that performs either scv or the old > >> > mechanism with the same calling convention. > >> > >> The i386 mechanism has received some criticism because it provides an > >> effective means to redirect execution flow to anyone who can write to > >> the TCB. I am not sure if it makes sense to copy it. > > > > Indeed that's a good point. Do you have ideas for making it equally > > efficient without use of a function pointer in the TCB? > > We could add a shared non-writable mapping at a 64K offset from the > thread pointer and store the function pointer or the code there. Then > it would be safe. > > However, since this is apparently tied to POWER9 and we already have a > POWER9 multilib, and assuming that we are going to backport the kernel > change, I would tweak the selection criterion for that multilib to > include the new HWCAP2 flag. If a user runs this glibc on a kernel > which does not have support, they will get set baseline (POWER8) > multilib, which still works. This way, outside the dynamic loader, no > run-time dispatch is needed at all. I guess this is not at all the > answer you were looking for. 8-) How does this work with -static? :-) > If a single binary is needed, I would perhaps follow what Arm did for > -moutline-atomics: lay out the code so that its easy to execute for > the non-POWER9 case, assuming that POWER9 machines will be better at > predicting things than their predecessors. > > Or you could also put the function pointer into a RELRO segment. Then > there's overlap with the __libc_single_threaded discussion, where > people objected to this kind of optimization (although I did not > propose to change the TCB ABI, that would be required for > __libc_single_threaded because it's an external interface). Of course you can use a normal global, but now every call point needs to setup a TOC pointer (= two entry points and more icache lines for otherwise trivial functions). I think my choice would be just making the inline syscall be a single call insn to an asm source file that out-of-lines the loading of TOC pointer and call through it or branch based on hwcap so that it's not repeated all over the place. Alternatively, it would perhaps work to just put hwcap in the TCB and branch on it rather than making an indirect call to a function pointer in the TCB, so that the worst you could do by clobbering it is execute the wrong syscall insn and thereby get SIGILL. Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 16:52 ` Rich Felker @ 2020-04-16 18:12 ` Florian Weimer 2020-04-16 23:02 ` Segher Boessenkool 0 siblings, 1 reply; 62+ messages in thread From: Florian Weimer @ 2020-04-16 18:12 UTC (permalink / raw) To: Rich Felker; +Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev * Rich Felker: > On Thu, Apr 16, 2020 at 06:42:32PM +0200, Florian Weimer wrote: >> * Rich Felker: >> >> > On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote: >> >> * Rich Felker: >> >> >> >> > My preference would be that it work just like the i386 AT_SYSINFO >> >> > where you just replace "int $128" with "call *%%gs:16" and the kernel >> >> > provides a stub in the vdso that performs either scv or the old >> >> > mechanism with the same calling convention. >> >> >> >> The i386 mechanism has received some criticism because it provides an >> >> effective means to redirect execution flow to anyone who can write to >> >> the TCB. I am not sure if it makes sense to copy it. >> > >> > Indeed that's a good point. Do you have ideas for making it equally >> > efficient without use of a function pointer in the TCB? >> >> We could add a shared non-writable mapping at a 64K offset from the >> thread pointer and store the function pointer or the code there. Then >> it would be safe. >> >> However, since this is apparently tied to POWER9 and we already have a >> POWER9 multilib, and assuming that we are going to backport the kernel >> change, I would tweak the selection criterion for that multilib to >> include the new HWCAP2 flag. If a user runs this glibc on a kernel >> which does not have support, they will get set baseline (POWER8) >> multilib, which still works. This way, outside the dynamic loader, no >> run-time dispatch is needed at all. I guess this is not at all the >> answer you were looking for. 8-) > > How does this work with -static? :-) -static is not supported. 8-) (If you use the unsupported static libraries, you get POWER8 code.) (Just to be clear, in case someone doesn't get the joke: This is about a potential approach for a heavily constrained, vertically integrated environment. It does not reflect general glibc recommendations.) >> If a single binary is needed, I would perhaps follow what Arm did for >> -moutline-atomics: lay out the code so that its easy to execute for >> the non-POWER9 case, assuming that POWER9 machines will be better at >> predicting things than their predecessors. >> >> Or you could also put the function pointer into a RELRO segment. Then >> there's overlap with the __libc_single_threaded discussion, where >> people objected to this kind of optimization (although I did not >> propose to change the TCB ABI, that would be required for >> __libc_single_threaded because it's an external interface). > > Of course you can use a normal global, but now every call point needs > to setup a TOC pointer (= two entry points and more icache lines for > otherwise trivial functions). > > I think my choice would be just making the inline syscall be a single > call insn to an asm source file that out-of-lines the loading of TOC > pointer and call through it or branch based on hwcap so that it's not > repeated all over the place. I don't know how problematic control flow out of an inline asm is on POWER. But this is basically the -moutline-atomics approach. > Alternatively, it would perhaps work to just put hwcap in the TCB and > branch on it rather than making an indirect call to a function pointer > in the TCB, so that the worst you could do by clobbering it is execute > the wrong syscall insn and thereby get SIGILL. The HWCAP is already in the TCB. I expect this is what generic glibc builds are going to use (perhaps with a bit of tweaking favorable to POWER8 implementations, but we'll see). ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 18:12 ` Florian Weimer @ 2020-04-16 23:02 ` Segher Boessenkool 2020-04-17 0:34 ` Rich Felker 0 siblings, 1 reply; 62+ messages in thread From: Segher Boessenkool @ 2020-04-16 23:02 UTC (permalink / raw) To: Florian Weimer Cc: Rich Felker, musl, libc-alpha, linuxppc-dev, Nicholas Piggin, libc-dev On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote: > > I think my choice would be just making the inline syscall be a single > > call insn to an asm source file that out-of-lines the loading of TOC > > pointer and call through it or branch based on hwcap so that it's not > > repeated all over the place. > > I don't know how problematic control flow out of an inline asm is on > POWER. But this is basically the -moutline-atomics approach. Control flow out of inline asm (other than with "asm goto") is not allowed at all, just like on any other target (and will not work in practice, either -- just like on any other target). But the suggestion was to use actual assembler code, not inline asm? Segher ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 23:02 ` Segher Boessenkool @ 2020-04-17 0:34 ` Rich Felker 2020-04-17 1:48 ` Segher Boessenkool 0 siblings, 1 reply; 62+ messages in thread From: Rich Felker @ 2020-04-17 0:34 UTC (permalink / raw) To: Segher Boessenkool Cc: Florian Weimer, musl, libc-alpha, linuxppc-dev, Nicholas Piggin, libc-dev On Thu, Apr 16, 2020 at 06:02:35PM -0500, Segher Boessenkool wrote: > On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote: > > > I think my choice would be just making the inline syscall be a single > > > call insn to an asm source file that out-of-lines the loading of TOC > > > pointer and call through it or branch based on hwcap so that it's not > > > repeated all over the place. > > > > I don't know how problematic control flow out of an inline asm is on > > POWER. But this is basically the -moutline-atomics approach. > > Control flow out of inline asm (other than with "asm goto") is not > allowed at all, just like on any other target (and will not work in > practice, either -- just like on any other target). But the suggestion > was to use actual assembler code, not inline asm? Calling it control flow out of inline asm is something of a misnomer. The enclosing state is not discarded or altered; the asm statement exits normally, reaching the next instruction in the enclosing block/function as soon as the call from the asm statement returns, with all register/clobber constraints satisfied. Control flow out of inline asm would be more like longjmp, and it can be valid -- for instance, you can implement coroutines this way (assuming you switch stack correctly) or do longjmp this way (jumping to the location saved by setjmp). But it's not what'd be happening here. Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-17 0:34 ` Rich Felker @ 2020-04-17 1:48 ` Segher Boessenkool 2020-04-17 8:34 ` Florian Weimer 0 siblings, 1 reply; 62+ messages in thread From: Segher Boessenkool @ 2020-04-17 1:48 UTC (permalink / raw) To: Rich Felker Cc: Florian Weimer, musl, libc-alpha, linuxppc-dev, Nicholas Piggin, libc-dev On Thu, Apr 16, 2020 at 08:34:42PM -0400, Rich Felker wrote: > On Thu, Apr 16, 2020 at 06:02:35PM -0500, Segher Boessenkool wrote: > > On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote: > > > > I think my choice would be just making the inline syscall be a single > > > > call insn to an asm source file that out-of-lines the loading of TOC > > > > pointer and call through it or branch based on hwcap so that it's not > > > > repeated all over the place. > > > > > > I don't know how problematic control flow out of an inline asm is on > > > POWER. But this is basically the -moutline-atomics approach. > > > > Control flow out of inline asm (other than with "asm goto") is not > > allowed at all, just like on any other target (and will not work in > > practice, either -- just like on any other target). But the suggestion > > was to use actual assembler code, not inline asm? > > Calling it control flow out of inline asm is something of a misnomer. > The enclosing state is not discarded or altered; the asm statement > exits normally, reaching the next instruction in the enclosing > block/function as soon as the call from the asm statement returns, > with all register/clobber constraints satisfied. Ah. That should always Just Work, then -- our ABIs guarantee you can. > Control flow out of inline asm would be more like longjmp, and it can > be valid -- for instance, you can implement coroutines this way > (assuming you switch stack correctly) or do longjmp this way (jumping > to the location saved by setjmp). But it's not what'd be happening > here. Yeah, you cannot do that in C, not without making assumptions about what machine code the compiler generates. GCC explicitly disallows it, too: 'asm' statements may not perform jumps into other 'asm' statements, only to the listed GOTOLABELS. GCC's optimizers do not know about other jumps; therefore they cannot take account of them when deciding how to optimize. Segher ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-17 1:48 ` Segher Boessenkool @ 2020-04-17 8:34 ` Florian Weimer 0 siblings, 0 replies; 62+ messages in thread From: Florian Weimer @ 2020-04-17 8:34 UTC (permalink / raw) To: Segher Boessenkool Cc: Rich Felker, musl, libc-alpha, linuxppc-dev, Nicholas Piggin, libc-dev * Segher Boessenkool: > On Thu, Apr 16, 2020 at 08:34:42PM -0400, Rich Felker wrote: >> On Thu, Apr 16, 2020 at 06:02:35PM -0500, Segher Boessenkool wrote: >> > On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote: >> > > > I think my choice would be just making the inline syscall be a single >> > > > call insn to an asm source file that out-of-lines the loading of TOC >> > > > pointer and call through it or branch based on hwcap so that it's not >> > > > repeated all over the place. >> > > >> > > I don't know how problematic control flow out of an inline asm is on >> > > POWER. But this is basically the -moutline-atomics approach. >> > >> > Control flow out of inline asm (other than with "asm goto") is not >> > allowed at all, just like on any other target (and will not work in >> > practice, either -- just like on any other target). But the suggestion >> > was to use actual assembler code, not inline asm? >> >> Calling it control flow out of inline asm is something of a misnomer. >> The enclosing state is not discarded or altered; the asm statement >> exits normally, reaching the next instruction in the enclosing >> block/function as soon as the call from the asm statement returns, >> with all register/clobber constraints satisfied. > > Ah. That should always Just Work, then -- our ABIs guarantee you can. After thinking about it, I agree: GCC will handle spilling of the link register. Branch-and-link instructions do not clobber the protected zone, so no stack adjustment is needed (which would be problematic to reflect in the unwind information). Of course, the target function has to be written in assembler because it must not use a regular stack frame. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-15 22:55 ` [musl] " Rich Felker 2020-04-16 0:16 ` Nicholas Piggin 2020-04-16 4:48 ` Florian Weimer @ 2020-04-16 14:16 ` Adhemerval Zanella 2020-04-16 15:37 ` Rich Felker 2 siblings, 1 reply; 62+ messages in thread From: Adhemerval Zanella @ 2020-04-16 14:16 UTC (permalink / raw) To: Rich Felker, Nicholas Piggin; +Cc: libc-alpha, musl, linuxppc-dev, libc-dev On 15/04/2020 19:55, Rich Felker wrote: > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: >> I would like to enable Linux support for the powerpc 'scv' instruction, >> as a faster system call instruction. >> >> This requires two things to be defined: Firstly a way to advertise to >> userspace that kernel supports scv, and a way to allocate and advertise >> support for individual scv vectors. Secondly, a calling convention ABI >> for this new instruction. >> >> Thanks to those who commented last time, since then I have removed my >> answered questions and unpopular alternatives but you can find them >> here >> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html >> >> Let me try one more with a wider cc list, and then we'll get something >> merged. Any questions or counter-opinions are welcome. >> >> System Call Vectored (scv) ABI >> ============================== >> >> The scv instruction is introduced with POWER9 / ISA3, it comes with an >> rfscv counter-part. The benefit of these instructions is performance >> (trading slower SRR0/1 with faster LR/CTR registers, and entering the >> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR >> updates. The scv instruction has 128 interrupt entry points (not enough >> to cover the Linux system call space). >> >> The proposal is to assign scv numbers very conservatively and allocate >> them as individual HWCAP features as we add support for more. The zero >> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'. >> >> Advertisement >> >> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a >> SIGILL in current environments. Linux has defined a HWCAP2 bit >> PPC_FEATURE2_SCV for SCV support, but does not set it. >> >> When scv instruction support and the scv 0 vector for system calls are >> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors >> should not be used without future HWCAP bits indicating support, which is >> how we will allocate them. (Should unallocated ones generate SIGILL, or >> return -ENOSYS in r3?) >> >> Calling convention >> >> The proposal is for scv 0 to provide the standard Linux system call ABI >> with the following differences from sc convention[1]: >> >> - LR is to be volatile across scv calls. This is necessary because the >> scv instruction clobbers LR. From previous discussion, this should be >> possible to deal with in GCC clobbers and CFI. >> >> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the >> kernel system call exit to avoid restoring the CR register (although >> we probably still would anyway to avoid information leak). >> >> - Error handling: I think the consensus has been to move to using negative >> return value in r3 rather than CR0[SO]=1 to indicate error, which matches >> most other architectures and is closer to a function call. >> >> The number of scratch registers (r9-r12) at kernel entry seems >> sufficient that we don't have any costly spilling, patch is here[2]. >> >> [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst >> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html > > My preference would be that it work just like the i386 AT_SYSINFO > where you just replace "int $128" with "call *%%gs:16" and the kernel > provides a stub in the vdso that performs either scv or the old > mechanism with the same calling convention. Then if the kernel doesn't > provide it (because the kernel is too old) libc would have to provide > its own stub that uses the legacy method and matches the calling > convention of the one the kernel is expected to provide. What about pthread cancellation and the requirement of checking the cancellable syscall anchors in asynchronous cancellation? My plan is still to use musl strategy on glibc (BZ#12683) and for i686 it requires to always use old int$128 for program that uses cancellation (static case) or just threads (dynamic mode, which should be more common on glibc). Using the i686 strategy of a vDSO bridge symbol would require to always fallback to 'sc' to still use the same cancellation strategy (and thus defeating this optimization in such cases). > Note that any libc that actually makes use of the new functionality is > not going to be able to make clobbers conditional on support for it; > branching around different clobbers is going to defeat any gains vs > always just treating anything clobbered by either method as clobbered. > Likewise, it's not useful to have different error return mechanisms > because the caller just has to branch to support both (or the > kernel-provided stub just has to emulate one for it; that could work > if you really want to change the bad existing convention). > > Thoughts? > > Rich > ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 14:16 ` Adhemerval Zanella @ 2020-04-16 15:37 ` Rich Felker 2020-04-16 17:50 ` Adhemerval Zanella 0 siblings, 1 reply; 62+ messages in thread From: Rich Felker @ 2020-04-16 15:37 UTC (permalink / raw) To: Adhemerval Zanella Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: > > My preference would be that it work just like the i386 AT_SYSINFO > > where you just replace "int $128" with "call *%%gs:16" and the kernel > > provides a stub in the vdso that performs either scv or the old > > mechanism with the same calling convention. Then if the kernel doesn't > > provide it (because the kernel is too old) libc would have to provide > > its own stub that uses the legacy method and matches the calling > > convention of the one the kernel is expected to provide. > > What about pthread cancellation and the requirement of checking the > cancellable syscall anchors in asynchronous cancellation? My plan is > still to use musl strategy on glibc (BZ#12683) and for i686 it > requires to always use old int$128 for program that uses cancellation > (static case) or just threads (dynamic mode, which should be more > common on glibc). > > Using the i686 strategy of a vDSO bridge symbol would require to always > fallback to 'sc' to still use the same cancellation strategy (and > thus defeating this optimization in such cases). Yes, I assumed it would be the same, ignoring the new syscall mechanism for cancellable syscalls. While there are some exceptions, cancellable syscalls are generally not hot paths but things that are expected to block and to have significant amounts of work to do in kernelspace, so saving a few tens of cycles is rather pointless. It's possible to do a branch/multiple versions of the syscall asm for cancellation but would require extending the cancellation handler to support checking against multiple independent address ranges or using some alternate markup of them. Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 15:37 ` Rich Felker @ 2020-04-16 17:50 ` Adhemerval Zanella 2020-04-16 17:59 ` Rich Felker 0 siblings, 1 reply; 62+ messages in thread From: Adhemerval Zanella @ 2020-04-16 17:50 UTC (permalink / raw) To: Rich Felker; +Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev On 16/04/2020 12:37, Rich Felker wrote: > On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: >>> My preference would be that it work just like the i386 AT_SYSINFO >>> where you just replace "int $128" with "call *%%gs:16" and the kernel >>> provides a stub in the vdso that performs either scv or the old >>> mechanism with the same calling convention. Then if the kernel doesn't >>> provide it (because the kernel is too old) libc would have to provide >>> its own stub that uses the legacy method and matches the calling >>> convention of the one the kernel is expected to provide. >> >> What about pthread cancellation and the requirement of checking the >> cancellable syscall anchors in asynchronous cancellation? My plan is >> still to use musl strategy on glibc (BZ#12683) and for i686 it >> requires to always use old int$128 for program that uses cancellation >> (static case) or just threads (dynamic mode, which should be more >> common on glibc). >> >> Using the i686 strategy of a vDSO bridge symbol would require to always >> fallback to 'sc' to still use the same cancellation strategy (and >> thus defeating this optimization in such cases). > > Yes, I assumed it would be the same, ignoring the new syscall > mechanism for cancellable syscalls. While there are some exceptions, > cancellable syscalls are generally not hot paths but things that are > expected to block and to have significant amounts of work to do in > kernelspace, so saving a few tens of cycles is rather pointless. > > It's possible to do a branch/multiple versions of the syscall asm for > cancellation but would require extending the cancellation handler to > support checking against multiple independent address ranges or using > some alternate markup of them. The main issue is at least for glibc dynamic linking is way more common than static linking and once the program become multithread the fallback will be always used. And besides the cancellation performance issue, a new bridge vDSO mechanism will still require to setup some extra bridge for the case of the older kernel. In the scheme you suggested: __asm__("indirect call" ... with common clobbers); The indirect call will be either the vDSO bridge or an libc provided that fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain against: if (hwcap & PPC_FEATURE2_SCV) { __asm__(... with some clobbers); } else { __asm__(... with different clobbers); } Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a TCB member (as we do on glibc) and if we could make the asm clever enough to not require different clobbers (although not sure if it would be possible). ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 17:50 ` Adhemerval Zanella @ 2020-04-16 17:59 ` Rich Felker 2020-04-16 18:18 ` Adhemerval Zanella 0 siblings, 1 reply; 62+ messages in thread From: Rich Felker @ 2020-04-16 17:59 UTC (permalink / raw) To: Adhemerval Zanella Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote: > > > On 16/04/2020 12:37, Rich Felker wrote: > > On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: > >>> My preference would be that it work just like the i386 AT_SYSINFO > >>> where you just replace "int $128" with "call *%%gs:16" and the kernel > >>> provides a stub in the vdso that performs either scv or the old > >>> mechanism with the same calling convention. Then if the kernel doesn't > >>> provide it (because the kernel is too old) libc would have to provide > >>> its own stub that uses the legacy method and matches the calling > >>> convention of the one the kernel is expected to provide. > >> > >> What about pthread cancellation and the requirement of checking the > >> cancellable syscall anchors in asynchronous cancellation? My plan is > >> still to use musl strategy on glibc (BZ#12683) and for i686 it > >> requires to always use old int$128 for program that uses cancellation > >> (static case) or just threads (dynamic mode, which should be more > >> common on glibc). > >> > >> Using the i686 strategy of a vDSO bridge symbol would require to always > >> fallback to 'sc' to still use the same cancellation strategy (and > >> thus defeating this optimization in such cases). > > > > Yes, I assumed it would be the same, ignoring the new syscall > > mechanism for cancellable syscalls. While there are some exceptions, > > cancellable syscalls are generally not hot paths but things that are > > expected to block and to have significant amounts of work to do in > > kernelspace, so saving a few tens of cycles is rather pointless. > > > > It's possible to do a branch/multiple versions of the syscall asm for > > cancellation but would require extending the cancellation handler to > > support checking against multiple independent address ranges or using > > some alternate markup of them. > > The main issue is at least for glibc dynamic linking is way more common > than static linking and once the program become multithread the fallback > will be always used. I'm not relying on static linking optimizing out the cancellable version. I'm talking about how cancellable syscalls are pretty much all "heavy" operations to begin with where a few tens of cycles are in the realm of "measurement noise" relative to the dominating time costs. > And besides the cancellation performance issue, a new bridge vDSO mechanism > will still require to setup some extra bridge for the case of the older > kernel. In the scheme you suggested: > > __asm__("indirect call" ... with common clobbers); > > The indirect call will be either the vDSO bridge or an libc provided that > fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain > against: > > if (hwcap & PPC_FEATURE2_SCV) { > __asm__(... with some clobbers); > } else { > __asm__(... with different clobbers); > } If the indirect call can be made roughly as efficiently as the sc sequence now (which already have some cost due to handling the nasty error return convention, making the indirect call likely just as small or smaller), it's O(1) additional code size (and thus icache usage) rather than O(n) where n is number of syscall points. Of course it would work just as well (for avoiding O(n) growth) to have a direct call to out-of-line branch like you suggested. > Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a > TCB member (as we do on glibc) and if we could make the asm clever > enough to not require different clobbers (although not sure if > it would be possible). The easy way not to require different clobbers is just using the union of the clobbers, no? Does the proposed new method clobber any call-saved registers that would make it painful (requiring new call frames to save them in)? Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 17:59 ` Rich Felker @ 2020-04-16 18:18 ` Adhemerval Zanella 2020-04-16 18:31 ` Rich Felker 0 siblings, 1 reply; 62+ messages in thread From: Adhemerval Zanella @ 2020-04-16 18:18 UTC (permalink / raw) To: Rich Felker; +Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev On 16/04/2020 14:59, Rich Felker wrote: > On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote: >> >> >> On 16/04/2020 12:37, Rich Felker wrote: >>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: >>>>> My preference would be that it work just like the i386 AT_SYSINFO >>>>> where you just replace "int $128" with "call *%%gs:16" and the kernel >>>>> provides a stub in the vdso that performs either scv or the old >>>>> mechanism with the same calling convention. Then if the kernel doesn't >>>>> provide it (because the kernel is too old) libc would have to provide >>>>> its own stub that uses the legacy method and matches the calling >>>>> convention of the one the kernel is expected to provide. >>>> >>>> What about pthread cancellation and the requirement of checking the >>>> cancellable syscall anchors in asynchronous cancellation? My plan is >>>> still to use musl strategy on glibc (BZ#12683) and for i686 it >>>> requires to always use old int$128 for program that uses cancellation >>>> (static case) or just threads (dynamic mode, which should be more >>>> common on glibc). >>>> >>>> Using the i686 strategy of a vDSO bridge symbol would require to always >>>> fallback to 'sc' to still use the same cancellation strategy (and >>>> thus defeating this optimization in such cases). >>> >>> Yes, I assumed it would be the same, ignoring the new syscall >>> mechanism for cancellable syscalls. While there are some exceptions, >>> cancellable syscalls are generally not hot paths but things that are >>> expected to block and to have significant amounts of work to do in >>> kernelspace, so saving a few tens of cycles is rather pointless. >>> >>> It's possible to do a branch/multiple versions of the syscall asm for >>> cancellation but would require extending the cancellation handler to >>> support checking against multiple independent address ranges or using >>> some alternate markup of them. >> >> The main issue is at least for glibc dynamic linking is way more common >> than static linking and once the program become multithread the fallback >> will be always used. > > I'm not relying on static linking optimizing out the cancellable > version. I'm talking about how cancellable syscalls are pretty much > all "heavy" operations to begin with where a few tens of cycles are in > the realm of "measurement noise" relative to the dominating time > costs. Yes I am aware, but at same time I am not sure how it plays on real world. For instance, some workloads might issue kernel query syscalls, such as recv, where buffer copying might not be dominant factor. So I see that if the idea is optimizing syscall mechanism, we should try to leverage it as whole in libc. > >> And besides the cancellation performance issue, a new bridge vDSO mechanism >> will still require to setup some extra bridge for the case of the older >> kernel. In the scheme you suggested: >> >> __asm__("indirect call" ... with common clobbers); >> >> The indirect call will be either the vDSO bridge or an libc provided that >> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain >> against: >> >> if (hwcap & PPC_FEATURE2_SCV) { >> __asm__(... with some clobbers); >> } else { >> __asm__(... with different clobbers); >> } > > If the indirect call can be made roughly as efficiently as the sc > sequence now (which already have some cost due to handling the nasty > error return convention, making the indirect call likely just as small > or smaller), it's O(1) additional code size (and thus icache usage) > rather than O(n) where n is number of syscall points. > > Of course it would work just as well (for avoiding O(n) growth) to > have a direct call to out-of-line branch like you suggested. Yes, but does it really matter to optimize this specific usage case for size? glibc, for instance, tries to leverage the syscall mechanism by adding some complex pre-processor asm directives. It optimizes the syscall code size in most cases. For instance, kill in static case generates on x86_64: 0000000000000000 <__kill>: 0: b8 3e 00 00 00 mov $0x3e,%eax 5: 0f 05 syscall 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> 13: c3 retq While on musl: 0000000000000000 <kill>: 0: 48 83 ec 08 sub $0x8,%rsp 4: 48 63 ff movslq %edi,%rdi 7: 48 63 f6 movslq %esi,%rsi a: b8 3e 00 00 00 mov $0x3e,%eax f: 0f 05 syscall 11: 48 89 c7 mov %rax,%rdi 14: e8 00 00 00 00 callq 19 <kill+0x19> 19: 5a pop %rdx 1a: c3 retq But I hardly think it pays off the required code complexity. Some for providing a O(1) bridge: this will require additional complexity to write it and setup correctly. > >> Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a >> TCB member (as we do on glibc) and if we could make the asm clever >> enough to not require different clobbers (although not sure if >> it would be possible). > > The easy way not to require different clobbers is just using the union > of the clobbers, no? Does the proposed new method clobber any > call-saved registers that would make it painful (requiring new call > frames to save them in)? As far I can tell, it should be ok. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 18:18 ` Adhemerval Zanella @ 2020-04-16 18:31 ` Rich Felker 2020-04-16 18:44 ` Rich Felker ` (2 more replies) 0 siblings, 3 replies; 62+ messages in thread From: Rich Felker @ 2020-04-16 18:31 UTC (permalink / raw) To: Adhemerval Zanella Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote: > > > On 16/04/2020 14:59, Rich Felker wrote: > > On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote: > >> > >> > >> On 16/04/2020 12:37, Rich Felker wrote: > >>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: > >>>>> My preference would be that it work just like the i386 AT_SYSINFO > >>>>> where you just replace "int $128" with "call *%%gs:16" and the kernel > >>>>> provides a stub in the vdso that performs either scv or the old > >>>>> mechanism with the same calling convention. Then if the kernel doesn't > >>>>> provide it (because the kernel is too old) libc would have to provide > >>>>> its own stub that uses the legacy method and matches the calling > >>>>> convention of the one the kernel is expected to provide. > >>>> > >>>> What about pthread cancellation and the requirement of checking the > >>>> cancellable syscall anchors in asynchronous cancellation? My plan is > >>>> still to use musl strategy on glibc (BZ#12683) and for i686 it > >>>> requires to always use old int$128 for program that uses cancellation > >>>> (static case) or just threads (dynamic mode, which should be more > >>>> common on glibc). > >>>> > >>>> Using the i686 strategy of a vDSO bridge symbol would require to always > >>>> fallback to 'sc' to still use the same cancellation strategy (and > >>>> thus defeating this optimization in such cases). > >>> > >>> Yes, I assumed it would be the same, ignoring the new syscall > >>> mechanism for cancellable syscalls. While there are some exceptions, > >>> cancellable syscalls are generally not hot paths but things that are > >>> expected to block and to have significant amounts of work to do in > >>> kernelspace, so saving a few tens of cycles is rather pointless. > >>> > >>> It's possible to do a branch/multiple versions of the syscall asm for > >>> cancellation but would require extending the cancellation handler to > >>> support checking against multiple independent address ranges or using > >>> some alternate markup of them. > >> > >> The main issue is at least for glibc dynamic linking is way more common > >> than static linking and once the program become multithread the fallback > >> will be always used. > > > > I'm not relying on static linking optimizing out the cancellable > > version. I'm talking about how cancellable syscalls are pretty much > > all "heavy" operations to begin with where a few tens of cycles are in > > the realm of "measurement noise" relative to the dominating time > > costs. > > Yes I am aware, but at same time I am not sure how it plays on real world. > For instance, some workloads might issue kernel query syscalls, such as > recv, where buffer copying might not be dominant factor. So I see that if > the idea is optimizing syscall mechanism, we should try to leverage it > as whole in libc. Have you timed a minimal recv? I'm not assuming buffer copying is the dominant factor. I'm assuming the overhead of all the kernel layers involved is dominant. > >> And besides the cancellation performance issue, a new bridge vDSO mechanism > >> will still require to setup some extra bridge for the case of the older > >> kernel. In the scheme you suggested: > >> > >> __asm__("indirect call" ... with common clobbers); > >> > >> The indirect call will be either the vDSO bridge or an libc provided that > >> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain > >> against: > >> > >> if (hwcap & PPC_FEATURE2_SCV) { > >> __asm__(... with some clobbers); > >> } else { > >> __asm__(... with different clobbers); > >> } > > > > If the indirect call can be made roughly as efficiently as the sc > > sequence now (which already have some cost due to handling the nasty > > error return convention, making the indirect call likely just as small > > or smaller), it's O(1) additional code size (and thus icache usage) > > rather than O(n) where n is number of syscall points. > > > > Of course it would work just as well (for avoiding O(n) growth) to > > have a direct call to out-of-line branch like you suggested. > > Yes, but does it really matter to optimize this specific usage case > for size? glibc, for instance, tries to leverage the syscall mechanism > by adding some complex pre-processor asm directives. It optimizes > the syscall code size in most cases. For instance, kill in static case > generates on x86_64: > > 0000000000000000 <__kill>: > 0: b8 3e 00 00 00 mov $0x3e,%eax > 5: 0f 05 syscall > 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax > d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> > 13: c3 retq > > While on musl: > > 0000000000000000 <kill>: > 0: 48 83 ec 08 sub $0x8,%rsp > 4: 48 63 ff movslq %edi,%rdi > 7: 48 63 f6 movslq %esi,%rsi > a: b8 3e 00 00 00 mov $0x3e,%eax > f: 0f 05 syscall > 11: 48 89 c7 mov %rax,%rdi > 14: e8 00 00 00 00 callq 19 <kill+0x19> > 19: 5a pop %rdx > 1a: c3 retq Wow that's some extraordinarily bad codegen going on by gcc... The sign-extension is semantically needed and I don't see a good way around it (glibc's asm is kinda a hack taking advantage of kernel not looking at high bits, I think), but the gratuitous stack adjustment and refusal to generate a tail call isn't. I'll see if we can track down what's going on and get it fixed. > But I hardly think it pays off the required code complexity. Some > for providing a O(1) bridge: this will require additional complexity > to write it and setup correctly. In some sense I agree, but inline instructions are a lot more expensive on ppc (being 32-bit each), and it might take out-of-lining anyway to get rid of stack frame setups if that ends up being a problem. > >> Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a > >> TCB member (as we do on glibc) and if we could make the asm clever > >> enough to not require different clobbers (although not sure if > >> it would be possible). > > > > The easy way not to require different clobbers is just using the union > > of the clobbers, no? Does the proposed new method clobber any > > call-saved registers that would make it painful (requiring new call > > frames to save them in)? > > As far I can tell, it should be ok. Note that because lr is clobbered we need at least once normally call-clobbered register that's not syscall clobbered to save lr in. Otherwise stack frame setup is required to spill it. (And I'm not even sure if gcc does things right to avoid it by using a register -- we should check that I guess...) Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 18:31 ` Rich Felker @ 2020-04-16 18:44 ` Rich Felker 2020-04-16 18:52 ` Adhemerval Zanella 2020-04-20 1:10 ` Nicholas Piggin 2 siblings, 0 replies; 62+ messages in thread From: Rich Felker @ 2020-04-16 18:44 UTC (permalink / raw) To: Adhemerval Zanella Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev On Thu, Apr 16, 2020 at 02:31:51PM -0400, Rich Felker wrote: > > While on musl: > > > > 0000000000000000 <kill>: > > 0: 48 83 ec 08 sub $0x8,%rsp > > 4: 48 63 ff movslq %edi,%rdi > > 7: 48 63 f6 movslq %esi,%rsi > > a: b8 3e 00 00 00 mov $0x3e,%eax > > f: 0f 05 syscall > > 11: 48 89 c7 mov %rax,%rdi > > 14: e8 00 00 00 00 callq 19 <kill+0x19> > > 19: 5a pop %rdx > > 1a: c3 retq > > Wow that's some extraordinarily bad codegen going on by gcc... The > sign-extension is semantically needed and I don't see a good way > around it (glibc's asm is kinda a hack taking advantage of kernel not > looking at high bits, I think), but the gratuitous stack adjustment > and refusal to generate a tail call isn't. I'll see if we can track > down what's going on and get it fixed. It seems to be https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14441 which I've updated with a comment about the above. Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 18:31 ` Rich Felker 2020-04-16 18:44 ` Rich Felker @ 2020-04-16 18:52 ` Adhemerval Zanella 2020-04-20 0:46 ` Nicholas Piggin 2020-04-20 1:10 ` Nicholas Piggin 2 siblings, 1 reply; 62+ messages in thread From: Adhemerval Zanella @ 2020-04-16 18:52 UTC (permalink / raw) To: Rich Felker; +Cc: Nicholas Piggin, libc-alpha, musl, linuxppc-dev, libc-dev On 16/04/2020 15:31, Rich Felker wrote: > On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote: >> >> >> On 16/04/2020 14:59, Rich Felker wrote: >>> On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote: >>>> >>>> >>>> On 16/04/2020 12:37, Rich Felker wrote: >>>>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: >>>>>>> My preference would be that it work just like the i386 AT_SYSINFO >>>>>>> where you just replace "int $128" with "call *%%gs:16" and the kernel >>>>>>> provides a stub in the vdso that performs either scv or the old >>>>>>> mechanism with the same calling convention. Then if the kernel doesn't >>>>>>> provide it (because the kernel is too old) libc would have to provide >>>>>>> its own stub that uses the legacy method and matches the calling >>>>>>> convention of the one the kernel is expected to provide. >>>>>> >>>>>> What about pthread cancellation and the requirement of checking the >>>>>> cancellable syscall anchors in asynchronous cancellation? My plan is >>>>>> still to use musl strategy on glibc (BZ#12683) and for i686 it >>>>>> requires to always use old int$128 for program that uses cancellation >>>>>> (static case) or just threads (dynamic mode, which should be more >>>>>> common on glibc). >>>>>> >>>>>> Using the i686 strategy of a vDSO bridge symbol would require to always >>>>>> fallback to 'sc' to still use the same cancellation strategy (and >>>>>> thus defeating this optimization in such cases). >>>>> >>>>> Yes, I assumed it would be the same, ignoring the new syscall >>>>> mechanism for cancellable syscalls. While there are some exceptions, >>>>> cancellable syscalls are generally not hot paths but things that are >>>>> expected to block and to have significant amounts of work to do in >>>>> kernelspace, so saving a few tens of cycles is rather pointless. >>>>> >>>>> It's possible to do a branch/multiple versions of the syscall asm for >>>>> cancellation but would require extending the cancellation handler to >>>>> support checking against multiple independent address ranges or using >>>>> some alternate markup of them. >>>> >>>> The main issue is at least for glibc dynamic linking is way more common >>>> than static linking and once the program become multithread the fallback >>>> will be always used. >>> >>> I'm not relying on static linking optimizing out the cancellable >>> version. I'm talking about how cancellable syscalls are pretty much >>> all "heavy" operations to begin with where a few tens of cycles are in >>> the realm of "measurement noise" relative to the dominating time >>> costs. >> >> Yes I am aware, but at same time I am not sure how it plays on real world. >> For instance, some workloads might issue kernel query syscalls, such as >> recv, where buffer copying might not be dominant factor. So I see that if >> the idea is optimizing syscall mechanism, we should try to leverage it >> as whole in libc. > > Have you timed a minimal recv? I'm not assuming buffer copying is the > dominant factor. I'm assuming the overhead of all the kernel layers > involved is dominant. Not really, but reading the advantages of using 'scv' over 'sc' also does not outline the real expect gain. Taking in consideration this should be a micro-optimization (focused on entry syscall patch), I think we should use where it possible. > >>>> And besides the cancellation performance issue, a new bridge vDSO mechanism >>>> will still require to setup some extra bridge for the case of the older >>>> kernel. In the scheme you suggested: >>>> >>>> __asm__("indirect call" ... with common clobbers); >>>> >>>> The indirect call will be either the vDSO bridge or an libc provided that >>>> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain >>>> against: >>>> >>>> if (hwcap & PPC_FEATURE2_SCV) { >>>> __asm__(... with some clobbers); >>>> } else { >>>> __asm__(... with different clobbers); >>>> } >>> >>> If the indirect call can be made roughly as efficiently as the sc >>> sequence now (which already have some cost due to handling the nasty >>> error return convention, making the indirect call likely just as small >>> or smaller), it's O(1) additional code size (and thus icache usage) >>> rather than O(n) where n is number of syscall points. >>> >>> Of course it would work just as well (for avoiding O(n) growth) to >>> have a direct call to out-of-line branch like you suggested. >> >> Yes, but does it really matter to optimize this specific usage case >> for size? glibc, for instance, tries to leverage the syscall mechanism >> by adding some complex pre-processor asm directives. It optimizes >> the syscall code size in most cases. For instance, kill in static case >> generates on x86_64: >> >> 0000000000000000 <__kill>: >> 0: b8 3e 00 00 00 mov $0x3e,%eax >> 5: 0f 05 syscall >> 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax >> d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> >> 13: c3 retq >> >> While on musl: >> >> 0000000000000000 <kill>: >> 0: 48 83 ec 08 sub $0x8,%rsp >> 4: 48 63 ff movslq %edi,%rdi >> 7: 48 63 f6 movslq %esi,%rsi >> a: b8 3e 00 00 00 mov $0x3e,%eax >> f: 0f 05 syscall >> 11: 48 89 c7 mov %rax,%rdi >> 14: e8 00 00 00 00 callq 19 <kill+0x19> >> 19: 5a pop %rdx >> 1a: c3 retq > > Wow that's some extraordinarily bad codegen going on by gcc... The > sign-extension is semantically needed and I don't see a good way > around it (glibc's asm is kinda a hack taking advantage of kernel not > looking at high bits, I think), but the gratuitous stack adjustment > and refusal to generate a tail call isn't. I'll see if we can track > down what's going on and get it fixed. Wrt glibc, it is most likely and it has bitten us on x32 port recently (where some types were being passed correctly). In any case, my long term plan to also get rid of this nasty assembly pre-processor on syscall passing. > >> But I hardly think it pays off the required code complexity. Some >> for providing a O(1) bridge: this will require additional complexity >> to write it and setup correctly. > > In some sense I agree, but inline instructions are a lot more > expensive on ppc (being 32-bit each), and it might take out-of-lining > anyway to get rid of stack frame setups if that ends up being a > problem. Indeed, I didn't started to prototype what would be required to make this change on glibc. Maybe an out-of-line helper might make sense. > >>>> Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a >>>> TCB member (as we do on glibc) and if we could make the asm clever >>>> enough to not require different clobbers (although not sure if >>>> it would be possible). >>> >>> The easy way not to require different clobbers is just using the union >>> of the clobbers, no? Does the proposed new method clobber any >>> call-saved registers that would make it painful (requiring new call >>> frames to save them in)? >> >> As far I can tell, it should be ok. > > Note that because lr is clobbered we need at least once normally > call-clobbered register that's not syscall clobbered to save lr in. > Otherwise stack frame setup is required to spill it. (And I'm not even > sure if gcc does things right to avoid it by using a register -- we > should check that I guess...) If I recall correctly Florian has found some issue in lr clobbering. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 18:52 ` Adhemerval Zanella @ 2020-04-20 0:46 ` Nicholas Piggin 0 siblings, 0 replies; 62+ messages in thread From: Nicholas Piggin @ 2020-04-20 0:46 UTC (permalink / raw) To: Adhemerval Zanella, Rich Felker; +Cc: libc-alpha, libc-dev, linuxppc-dev, musl Excerpts from Adhemerval Zanella's message of April 17, 2020 4:52 am: > > > On 16/04/2020 15:31, Rich Felker wrote: >> On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote: >>> >>> >>> On 16/04/2020 14:59, Rich Felker wrote: >>>> On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote: >>>>> >>>>> >>>>> On 16/04/2020 12:37, Rich Felker wrote: >>>>>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: >>>>>>>> My preference would be that it work just like the i386 AT_SYSINFO >>>>>>>> where you just replace "int $128" with "call *%%gs:16" and the kernel >>>>>>>> provides a stub in the vdso that performs either scv or the old >>>>>>>> mechanism with the same calling convention. Then if the kernel doesn't >>>>>>>> provide it (because the kernel is too old) libc would have to provide >>>>>>>> its own stub that uses the legacy method and matches the calling >>>>>>>> convention of the one the kernel is expected to provide. >>>>>>> >>>>>>> What about pthread cancellation and the requirement of checking the >>>>>>> cancellable syscall anchors in asynchronous cancellation? My plan is >>>>>>> still to use musl strategy on glibc (BZ#12683) and for i686 it >>>>>>> requires to always use old int$128 for program that uses cancellation >>>>>>> (static case) or just threads (dynamic mode, which should be more >>>>>>> common on glibc). >>>>>>> >>>>>>> Using the i686 strategy of a vDSO bridge symbol would require to always >>>>>>> fallback to 'sc' to still use the same cancellation strategy (and >>>>>>> thus defeating this optimization in such cases). >>>>>> >>>>>> Yes, I assumed it would be the same, ignoring the new syscall >>>>>> mechanism for cancellable syscalls. While there are some exceptions, >>>>>> cancellable syscalls are generally not hot paths but things that are >>>>>> expected to block and to have significant amounts of work to do in >>>>>> kernelspace, so saving a few tens of cycles is rather pointless. >>>>>> >>>>>> It's possible to do a branch/multiple versions of the syscall asm for >>>>>> cancellation but would require extending the cancellation handler to >>>>>> support checking against multiple independent address ranges or using >>>>>> some alternate markup of them. >>>>> >>>>> The main issue is at least for glibc dynamic linking is way more common >>>>> than static linking and once the program become multithread the fallback >>>>> will be always used. >>>> >>>> I'm not relying on static linking optimizing out the cancellable >>>> version. I'm talking about how cancellable syscalls are pretty much >>>> all "heavy" operations to begin with where a few tens of cycles are in >>>> the realm of "measurement noise" relative to the dominating time >>>> costs. >>> >>> Yes I am aware, but at same time I am not sure how it plays on real world. >>> For instance, some workloads might issue kernel query syscalls, such as >>> recv, where buffer copying might not be dominant factor. So I see that if >>> the idea is optimizing syscall mechanism, we should try to leverage it >>> as whole in libc. >> >> Have you timed a minimal recv? I'm not assuming buffer copying is the >> dominant factor. I'm assuming the overhead of all the kernel layers >> involved is dominant. > > Not really, but reading the advantages of using 'scv' over 'sc' also does > not outline the real expect gain. Taking in consideration this should > be a micro-optimization (focused on entry syscall patch), I think we should > use where it possible. It's around 90 cycles improvement, depending on config options and speculative mitigations in place, this may be roughly 5-20% of a gettid syscall, which itself probably bears little relationship to what a recv syscall doing real work would do, it's easy to swamp it with other work. But it's a pretty big win in terms of how much we try to optimise this path. Thanks, Nick ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-16 18:31 ` Rich Felker 2020-04-16 18:44 ` Rich Felker 2020-04-16 18:52 ` Adhemerval Zanella @ 2020-04-20 1:10 ` Nicholas Piggin 2020-04-20 1:34 ` Rich Felker 2020-04-21 12:28 ` David Laight 2 siblings, 2 replies; 62+ messages in thread From: Nicholas Piggin @ 2020-04-20 1:10 UTC (permalink / raw) To: Adhemerval Zanella, Rich Felker; +Cc: libc-alpha, libc-dev, linuxppc-dev, musl Excerpts from Rich Felker's message of April 17, 2020 4:31 am: > On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote: >> >> >> On 16/04/2020 14:59, Rich Felker wrote: >> > On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote: >> >> >> >> >> >> On 16/04/2020 12:37, Rich Felker wrote: >> >>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: >> >>>>> My preference would be that it work just like the i386 AT_SYSINFO >> >>>>> where you just replace "int $128" with "call *%%gs:16" and the kernel >> >>>>> provides a stub in the vdso that performs either scv or the old >> >>>>> mechanism with the same calling convention. Then if the kernel doesn't >> >>>>> provide it (because the kernel is too old) libc would have to provide >> >>>>> its own stub that uses the legacy method and matches the calling >> >>>>> convention of the one the kernel is expected to provide. >> >>>> >> >>>> What about pthread cancellation and the requirement of checking the >> >>>> cancellable syscall anchors in asynchronous cancellation? My plan is >> >>>> still to use musl strategy on glibc (BZ#12683) and for i686 it >> >>>> requires to always use old int$128 for program that uses cancellation >> >>>> (static case) or just threads (dynamic mode, which should be more >> >>>> common on glibc). >> >>>> >> >>>> Using the i686 strategy of a vDSO bridge symbol would require to always >> >>>> fallback to 'sc' to still use the same cancellation strategy (and >> >>>> thus defeating this optimization in such cases). >> >>> >> >>> Yes, I assumed it would be the same, ignoring the new syscall >> >>> mechanism for cancellable syscalls. While there are some exceptions, >> >>> cancellable syscalls are generally not hot paths but things that are >> >>> expected to block and to have significant amounts of work to do in >> >>> kernelspace, so saving a few tens of cycles is rather pointless. >> >>> >> >>> It's possible to do a branch/multiple versions of the syscall asm for >> >>> cancellation but would require extending the cancellation handler to >> >>> support checking against multiple independent address ranges or using >> >>> some alternate markup of them. >> >> >> >> The main issue is at least for glibc dynamic linking is way more common >> >> than static linking and once the program become multithread the fallback >> >> will be always used. >> > >> > I'm not relying on static linking optimizing out the cancellable >> > version. I'm talking about how cancellable syscalls are pretty much >> > all "heavy" operations to begin with where a few tens of cycles are in >> > the realm of "measurement noise" relative to the dominating time >> > costs. >> >> Yes I am aware, but at same time I am not sure how it plays on real world. >> For instance, some workloads might issue kernel query syscalls, such as >> recv, where buffer copying might not be dominant factor. So I see that if >> the idea is optimizing syscall mechanism, we should try to leverage it >> as whole in libc. > > Have you timed a minimal recv? I'm not assuming buffer copying is the > dominant factor. I'm assuming the overhead of all the kernel layers > involved is dominant. > >> >> And besides the cancellation performance issue, a new bridge vDSO mechanism >> >> will still require to setup some extra bridge for the case of the older >> >> kernel. In the scheme you suggested: >> >> >> >> __asm__("indirect call" ... with common clobbers); >> >> >> >> The indirect call will be either the vDSO bridge or an libc provided that >> >> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain >> >> against: >> >> >> >> if (hwcap & PPC_FEATURE2_SCV) { >> >> __asm__(... with some clobbers); >> >> } else { >> >> __asm__(... with different clobbers); >> >> } >> > >> > If the indirect call can be made roughly as efficiently as the sc >> > sequence now (which already have some cost due to handling the nasty >> > error return convention, making the indirect call likely just as small >> > or smaller), it's O(1) additional code size (and thus icache usage) >> > rather than O(n) where n is number of syscall points. >> > >> > Of course it would work just as well (for avoiding O(n) growth) to >> > have a direct call to out-of-line branch like you suggested. >> >> Yes, but does it really matter to optimize this specific usage case >> for size? glibc, for instance, tries to leverage the syscall mechanism >> by adding some complex pre-processor asm directives. It optimizes >> the syscall code size in most cases. For instance, kill in static case >> generates on x86_64: >> >> 0000000000000000 <__kill>: >> 0: b8 3e 00 00 00 mov $0x3e,%eax >> 5: 0f 05 syscall >> 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax >> d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> >> 13: c3 retq >> >> While on musl: >> >> 0000000000000000 <kill>: >> 0: 48 83 ec 08 sub $0x8,%rsp >> 4: 48 63 ff movslq %edi,%rdi >> 7: 48 63 f6 movslq %esi,%rsi >> a: b8 3e 00 00 00 mov $0x3e,%eax >> f: 0f 05 syscall >> 11: 48 89 c7 mov %rax,%rdi >> 14: e8 00 00 00 00 callq 19 <kill+0x19> >> 19: 5a pop %rdx >> 1a: c3 retq > > Wow that's some extraordinarily bad codegen going on by gcc... The > sign-extension is semantically needed and I don't see a good way > around it (glibc's asm is kinda a hack taking advantage of kernel not > looking at high bits, I think), but the gratuitous stack adjustment > and refusal to generate a tail call isn't. I'll see if we can track > down what's going on and get it fixed. > >> But I hardly think it pays off the required code complexity. Some >> for providing a O(1) bridge: this will require additional complexity >> to write it and setup correctly. > > In some sense I agree, but inline instructions are a lot more > expensive on ppc (being 32-bit each), and it might take out-of-lining > anyway to get rid of stack frame setups if that ends up being a > problem. > >> >> Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a >> >> TCB member (as we do on glibc) and if we could make the asm clever >> >> enough to not require different clobbers (although not sure if >> >> it would be possible). >> > >> > The easy way not to require different clobbers is just using the union >> > of the clobbers, no? Does the proposed new method clobber any >> > call-saved registers that would make it painful (requiring new call >> > frames to save them in)? >> >> As far I can tell, it should be ok. > > Note that because lr is clobbered we need at least once normally > call-clobbered register that's not syscall clobbered to save lr in. > Otherwise stack frame setup is required to spill it. The kernel would like to use r9-r12 for itself. We could do with fewer registers, but we have some delay establishing the stack (depends on a load which depends on a mfspr), and entry code tends to be quite store heavy whereas on the caller side you have r1 set up (modulo stack updates), and the system call is a long delay during which time the store queue has significant time to drain. My feeling is it would be better for kernel to have these scratch registers. Thanks, Nick ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-20 1:10 ` Nicholas Piggin @ 2020-04-20 1:34 ` Rich Felker 2020-04-20 2:32 ` Nicholas Piggin 2020-04-21 12:28 ` David Laight 1 sibling, 1 reply; 62+ messages in thread From: Rich Felker @ 2020-04-20 1:34 UTC (permalink / raw) To: Nicholas Piggin Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote: > Excerpts from Rich Felker's message of April 17, 2020 4:31 am: > > Note that because lr is clobbered we need at least once normally > > call-clobbered register that's not syscall clobbered to save lr in. > > Otherwise stack frame setup is required to spill it. > > The kernel would like to use r9-r12 for itself. We could do with fewer > registers, but we have some delay establishing the stack (depends on a > load which depends on a mfspr), and entry code tends to be quite store > heavy whereas on the caller side you have r1 set up (modulo stack > updates), and the system call is a long delay during which time the > store queue has significant time to drain. > > My feeling is it would be better for kernel to have these scratch > registers. If your new kernel syscall mechanism requires the caller to make a whole stack frame it otherwise doesn't need and spill registers to it, it becomes a lot less attractive. Some of those 90 cycles saved are immediately lost on the userspace side, plus you either waste icache at the call point or require the syscall to go through a userspace-side helper function that performs the spill and restore. The right way to do this is to have the kernel preserve enough registers that userspace can avoid having any spills. It doesn't have to preserve everything, probably just enough to save lr. (BTW are syscall arg registers still preserved? If not, this is a major cost on the userspace side, since any call point that has to loop-and-retry (e.g. futex) now needs to make its own place to store the original values.) Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-20 1:34 ` Rich Felker @ 2020-04-20 2:32 ` Nicholas Piggin 2020-04-20 4:09 ` Rich Felker 0 siblings, 1 reply; 62+ messages in thread From: Nicholas Piggin @ 2020-04-20 2:32 UTC (permalink / raw) To: Rich Felker; +Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl Excerpts from Rich Felker's message of April 20, 2020 11:34 am: > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote: >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am: >> > Note that because lr is clobbered we need at least once normally >> > call-clobbered register that's not syscall clobbered to save lr in. >> > Otherwise stack frame setup is required to spill it. >> >> The kernel would like to use r9-r12 for itself. We could do with fewer >> registers, but we have some delay establishing the stack (depends on a >> load which depends on a mfspr), and entry code tends to be quite store >> heavy whereas on the caller side you have r1 set up (modulo stack >> updates), and the system call is a long delay during which time the >> store queue has significant time to drain. >> >> My feeling is it would be better for kernel to have these scratch >> registers. > > If your new kernel syscall mechanism requires the caller to make a > whole stack frame it otherwise doesn't need and spill registers to it, > it becomes a lot less attractive. Some of those 90 cycles saved are > immediately lost on the userspace side, plus you either waste icache > at the call point or require the syscall to go through a > userspace-side helper function that performs the spill and restore. You would be surprised how few cycles that takes on a high end CPU. Some might be a couple of %. I am one for counting cycles mind you, I'm not being flippant about it. If we can come up with something faster I'd be up for it. > > The right way to do this is to have the kernel preserve enough > registers that userspace can avoid having any spills. It doesn't have > to preserve everything, probably just enough to save lr. (BTW are Again, the problem is the kernel doesn't have its dependencies immediately ready to spill, and spilling (may be) more costly immediately after the call because we're doing a lot of stores. I could try measure this. Unfortunately our pipeline simulator tool doesn't model system calls properly so it's hard to see what's happening across the user/kernel horizon, I might check if that can be improved or I can hack it by putting some isync in there or something. > syscall arg registers still preserved? If not, this is a major cost on > the userspace side, since any call point that has to loop-and-retry > (e.g. futex) now needs to make its own place to store the original > values.) Powerpc system calls never did. We could have scv preserve them, but you'd still need to restore r3. We could make an ABI which does not clobber r3 but puts the return value in r9, say. I'd like to see what the user side code looks like to take advantage of such a thing though. Thanks, Nick ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-20 2:32 ` Nicholas Piggin @ 2020-04-20 4:09 ` Rich Felker 2020-04-20 4:31 ` Nicholas Piggin 0 siblings, 1 reply; 62+ messages in thread From: Rich Felker @ 2020-04-20 4:09 UTC (permalink / raw) To: Nicholas Piggin Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote: > Excerpts from Rich Felker's message of April 20, 2020 11:34 am: > > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote: > >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am: > >> > Note that because lr is clobbered we need at least once normally > >> > call-clobbered register that's not syscall clobbered to save lr in. > >> > Otherwise stack frame setup is required to spill it. > >> > >> The kernel would like to use r9-r12 for itself. We could do with fewer > >> registers, but we have some delay establishing the stack (depends on a > >> load which depends on a mfspr), and entry code tends to be quite store > >> heavy whereas on the caller side you have r1 set up (modulo stack > >> updates), and the system call is a long delay during which time the > >> store queue has significant time to drain. > >> > >> My feeling is it would be better for kernel to have these scratch > >> registers. > > > > If your new kernel syscall mechanism requires the caller to make a > > whole stack frame it otherwise doesn't need and spill registers to it, > > it becomes a lot less attractive. Some of those 90 cycles saved are > > immediately lost on the userspace side, plus you either waste icache > > at the call point or require the syscall to go through a > > userspace-side helper function that performs the spill and restore. > > You would be surprised how few cycles that takes on a high end CPU. Some > might be a couple of %. I am one for counting cycles mind you, I'm not > being flippant about it. If we can come up with something faster I'd be > up for it. If the cycle count is trivial then just do it on the kernel side. > > The right way to do this is to have the kernel preserve enough > > registers that userspace can avoid having any spills. It doesn't have > > to preserve everything, probably just enough to save lr. (BTW are > > Again, the problem is the kernel doesn't have its dependencies > immediately ready to spill, and spilling (may be) more costly > immediately after the call because we're doing a lot of stores. > > I could try measure this. Unfortunately our pipeline simulator tool > doesn't model system calls properly so it's hard to see what's happening > across the user/kernel horizon, I might check if that can be improved > or I can hack it by putting some isync in there or something. I think it's unlikely to make any real difference to the total number of cycles spent which side it happens on, but putting it on the kernel side makes it easier to avoid wasting size/icache at each syscall site. > > syscall arg registers still preserved? If not, this is a major cost on > > the userspace side, since any call point that has to loop-and-retry > > (e.g. futex) now needs to make its own place to store the original > > values.) > > Powerpc system calls never did. We could have scv preserve them, but > you'd still need to restore r3. We could make an ABI which does not > clobber r3 but puts the return value in r9, say. I'd like to see what > the user side code looks like to take advantage of such a thing though. Oh wow, I hadn't realized that, but indeed the code we have now is allowing for the kernel to clobber them all. So at least this isn't getting any worse I guess. I think it was a very poor choice of behavior though and a disadvantage vs what other archs do (some of them preserve all registers; others preserve only normally call-saved ones plus the syscall arg ones and possibly a few other specials). Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-20 4:09 ` Rich Felker @ 2020-04-20 4:31 ` Nicholas Piggin 2020-04-20 17:27 ` Rich Felker 0 siblings, 1 reply; 62+ messages in thread From: Nicholas Piggin @ 2020-04-20 4:31 UTC (permalink / raw) To: Rich Felker; +Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl Excerpts from Rich Felker's message of April 20, 2020 2:09 pm: > On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote: >> Excerpts from Rich Felker's message of April 20, 2020 11:34 am: >> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote: >> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am: >> >> > Note that because lr is clobbered we need at least once normally >> >> > call-clobbered register that's not syscall clobbered to save lr in. >> >> > Otherwise stack frame setup is required to spill it. >> >> >> >> The kernel would like to use r9-r12 for itself. We could do with fewer >> >> registers, but we have some delay establishing the stack (depends on a >> >> load which depends on a mfspr), and entry code tends to be quite store >> >> heavy whereas on the caller side you have r1 set up (modulo stack >> >> updates), and the system call is a long delay during which time the >> >> store queue has significant time to drain. >> >> >> >> My feeling is it would be better for kernel to have these scratch >> >> registers. >> > >> > If your new kernel syscall mechanism requires the caller to make a >> > whole stack frame it otherwise doesn't need and spill registers to it, >> > it becomes a lot less attractive. Some of those 90 cycles saved are >> > immediately lost on the userspace side, plus you either waste icache >> > at the call point or require the syscall to go through a >> > userspace-side helper function that performs the spill and restore. >> >> You would be surprised how few cycles that takes on a high end CPU. Some >> might be a couple of %. I am one for counting cycles mind you, I'm not >> being flippant about it. If we can come up with something faster I'd be >> up for it. > > If the cycle count is trivial then just do it on the kernel side. The cycle count for user is, because you have r1 ready. Kernel does not have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to save into. Which is also wasted work for a userspace. Now that I think about it, no stack frame is even required! lr is saved into the caller's stack when its clobbered with an asm, just as when it's used for a function call. >> > The right way to do this is to have the kernel preserve enough >> > registers that userspace can avoid having any spills. It doesn't have >> > to preserve everything, probably just enough to save lr. (BTW are >> >> Again, the problem is the kernel doesn't have its dependencies >> immediately ready to spill, and spilling (may be) more costly >> immediately after the call because we're doing a lot of stores. >> >> I could try measure this. Unfortunately our pipeline simulator tool >> doesn't model system calls properly so it's hard to see what's happening >> across the user/kernel horizon, I might check if that can be improved >> or I can hack it by putting some isync in there or something. > > I think it's unlikely to make any real difference to the total number > of cycles spent which side it happens on, but putting it on the kernel > side makes it easier to avoid wasting size/icache at each syscall > site. > >> > syscall arg registers still preserved? If not, this is a major cost on >> > the userspace side, since any call point that has to loop-and-retry >> > (e.g. futex) now needs to make its own place to store the original >> > values.) >> >> Powerpc system calls never did. We could have scv preserve them, but >> you'd still need to restore r3. We could make an ABI which does not >> clobber r3 but puts the return value in r9, say. I'd like to see what >> the user side code looks like to take advantage of such a thing though. > > Oh wow, I hadn't realized that, but indeed the code we have now is > allowing for the kernel to clobber them all. So at least this isn't > getting any worse I guess. I think it was a very poor choice of > behavior though and a disadvantage vs what other archs do (some of > them preserve all registers; others preserve only normally call-saved > ones plus the syscall arg ones and possibly a few other specials). Well, we could change it. Does the generated code improve significantly we take those clobbers away? Thanks, Nick ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-20 4:31 ` Nicholas Piggin @ 2020-04-20 17:27 ` Rich Felker 2020-04-22 6:18 ` Nicholas Piggin 0 siblings, 1 reply; 62+ messages in thread From: Rich Felker @ 2020-04-20 17:27 UTC (permalink / raw) To: Nicholas Piggin Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl On Mon, Apr 20, 2020 at 02:31:58PM +1000, Nicholas Piggin wrote: > Excerpts from Rich Felker's message of April 20, 2020 2:09 pm: > > On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote: > >> Excerpts from Rich Felker's message of April 20, 2020 11:34 am: > >> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote: > >> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am: > >> >> > Note that because lr is clobbered we need at least once normally > >> >> > call-clobbered register that's not syscall clobbered to save lr in. > >> >> > Otherwise stack frame setup is required to spill it. > >> >> > >> >> The kernel would like to use r9-r12 for itself. We could do with fewer > >> >> registers, but we have some delay establishing the stack (depends on a > >> >> load which depends on a mfspr), and entry code tends to be quite store > >> >> heavy whereas on the caller side you have r1 set up (modulo stack > >> >> updates), and the system call is a long delay during which time the > >> >> store queue has significant time to drain. > >> >> > >> >> My feeling is it would be better for kernel to have these scratch > >> >> registers. > >> > > >> > If your new kernel syscall mechanism requires the caller to make a > >> > whole stack frame it otherwise doesn't need and spill registers to it, > >> > it becomes a lot less attractive. Some of those 90 cycles saved are > >> > immediately lost on the userspace side, plus you either waste icache > >> > at the call point or require the syscall to go through a > >> > userspace-side helper function that performs the spill and restore. > >> > >> You would be surprised how few cycles that takes on a high end CPU. Some > >> might be a couple of %. I am one for counting cycles mind you, I'm not > >> being flippant about it. If we can come up with something faster I'd be > >> up for it. > > > > If the cycle count is trivial then just do it on the kernel side. > > The cycle count for user is, because you have r1 ready. Kernel does not > have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to > save into. > > Which is also wasted work for a userspace. > > Now that I think about it, no stack frame is even required! lr is saved > into the caller's stack when its clobbered with an asm, just as when > it's used for a function call. No. If there is a non-clobbered register, lr can be moved to the non-clobbered register rather than saved to the stack. However it looks like (1) gcc doesn't take advantage of that possibility, but (2) the caller already arranged for there to be space on the stack to save lr, so the cost is only one store and one load, not any stack adjustment or other frame setup. So it's probably not a really big deal. However, just adding "lr" clobber to existing syscall in musl increased the size of a simple syscall function (getuid) from 20 bytes to 36 bytes. > >> > syscall arg registers still preserved? If not, this is a major cost on > >> > the userspace side, since any call point that has to loop-and-retry > >> > (e.g. futex) now needs to make its own place to store the original > >> > values.) > >> > >> Powerpc system calls never did. We could have scv preserve them, but > >> you'd still need to restore r3. We could make an ABI which does not > >> clobber r3 but puts the return value in r9, say. I'd like to see what > >> the user side code looks like to take advantage of such a thing though. > > > > Oh wow, I hadn't realized that, but indeed the code we have now is > > allowing for the kernel to clobber them all. So at least this isn't > > getting any worse I guess. I think it was a very poor choice of > > behavior though and a disadvantage vs what other archs do (some of > > them preserve all registers; others preserve only normally call-saved > > ones plus the syscall arg ones and possibly a few other specials). > > Well, we could change it. Does the generated code improve significantly > we take those clobbers away? I'd have to experiment a bit more to see. It's not going to help at all in functions which are pure syscall wrappers that just do the syscall and return, since the arg regs are dead after the syscall anyway (the caller must assume they were clobbered). But where syscalls are inlined and used in a loop, like a futex wait, it might make a nontrivial difference. Unfortunately even if you did change it for the new scv mechanism, it would be hard to take advantage of the change while also supporting sc, unless we used a helper function that just did scv directly, but saved/restored all the arg regs when using the legacy sc mechanism. Just inlining the hwcap conditional and clobbering more regs in one code path than in the other likely would not help; gcc won't shrink-wrap the clobbered/non-clobbered paths separately, and even if it did, when this were inlined somewhere like a futex loop, it'd end up having to lift the conditional out of the loop to be very advantageous, then making the code much larger by producing two copies of the loop. So I think just behaving similarly to the old sc method is probably the best option we have... Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-20 17:27 ` Rich Felker @ 2020-04-22 6:18 ` Nicholas Piggin 2020-04-22 6:29 ` Nicholas Piggin 2020-04-23 2:36 ` Rich Felker 0 siblings, 2 replies; 62+ messages in thread From: Nicholas Piggin @ 2020-04-22 6:18 UTC (permalink / raw) To: Rich Felker; +Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl Excerpts from Rich Felker's message of April 21, 2020 3:27 am: > On Mon, Apr 20, 2020 at 02:31:58PM +1000, Nicholas Piggin wrote: >> Excerpts from Rich Felker's message of April 20, 2020 2:09 pm: >> > On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote: >> >> Excerpts from Rich Felker's message of April 20, 2020 11:34 am: >> >> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote: >> >> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am: >> >> >> > Note that because lr is clobbered we need at least once normally >> >> >> > call-clobbered register that's not syscall clobbered to save lr in. >> >> >> > Otherwise stack frame setup is required to spill it. >> >> >> >> >> >> The kernel would like to use r9-r12 for itself. We could do with fewer >> >> >> registers, but we have some delay establishing the stack (depends on a >> >> >> load which depends on a mfspr), and entry code tends to be quite store >> >> >> heavy whereas on the caller side you have r1 set up (modulo stack >> >> >> updates), and the system call is a long delay during which time the >> >> >> store queue has significant time to drain. >> >> >> >> >> >> My feeling is it would be better for kernel to have these scratch >> >> >> registers. >> >> > >> >> > If your new kernel syscall mechanism requires the caller to make a >> >> > whole stack frame it otherwise doesn't need and spill registers to it, >> >> > it becomes a lot less attractive. Some of those 90 cycles saved are >> >> > immediately lost on the userspace side, plus you either waste icache >> >> > at the call point or require the syscall to go through a >> >> > userspace-side helper function that performs the spill and restore. >> >> >> >> You would be surprised how few cycles that takes on a high end CPU. Some >> >> might be a couple of %. I am one for counting cycles mind you, I'm not >> >> being flippant about it. If we can come up with something faster I'd be >> >> up for it. >> > >> > If the cycle count is trivial then just do it on the kernel side. >> >> The cycle count for user is, because you have r1 ready. Kernel does not >> have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to >> save into. >> >> Which is also wasted work for a userspace. >> >> Now that I think about it, no stack frame is even required! lr is saved >> into the caller's stack when its clobbered with an asm, just as when >> it's used for a function call. > > No. If there is a non-clobbered register, lr can be moved to the > non-clobbered register rather than saved to the stack. However it > looks like (1) gcc doesn't take advantage of that possibility, but (2) > the caller already arranged for there to be space on the stack to save > lr, so the cost is only one store and one load, not any stack > adjustment or other frame setup. So it's probably not a really big > deal. However, just adding "lr" clobber to existing syscall in musl > increased the size of a simple syscall function (getuid) from 20 bytes > to 36 bytes. Yeah I had a bit of a play around with musl (which is very nice code I must say). The powerpc64 syscall asm is missing ctr clobber by the way. Fortunately adding it doesn't change code generation for me, but it should be fixed. glibc had the same bug at one point I think (probably due to syscall ABI documentation not existing -- something now lives in linux/Documentation/powerpc/syscall64-abi.rst). Yes lr needs to be saved, I didn't see any new requirement for stack frames, and it was often already saved, but it does hurt the small wrapper functions. I did look at entirely replacing sc with scv though, just as an experiment. One day you might make sc optional! Text size impoves by about 3kB with the proposed ABI. Mostly seems to be the bns+ ; neg sequence. __syscall1/2/3 get out-of-lined by the compiler in a lot of cases. Linux's bloat-o-meter says: add/remove: 0/5 grow/shrink: 24/260 up/down: 220/-3428 (-3208) Function old new delta fcntl 400 424 +24 popen 600 620 +20 times 32 40 +8 [...] alloc_rev 816 784 -32 alloc_fwd 812 780 -32 __syscall1.constprop 32 - -32 __fdopen 504 472 -32 __expand_heap 628 592 -36 __syscall2 40 - -40 __syscall3 44 - -44 fchmodat 372 324 -48 __wake.constprop 408 360 -48 child 1116 1064 -52 checker 220 156 -64 __bin_chunk 1576 1512 -64 malloc 1940 1860 -80 __syscall3.constprop 96 - -96 __syscall1 108 - -108 Total: Before=613379, After=610171, chg -0.52% Now if we go a step further we could preserve r0,r4-r8. That gives the kernel r9-r12 as scratch while leaving userspace with some spare volatile GPRs except in the uncommon syscall6 case. static inline long __syscall0(long n) { register long r0 __asm__("r0") = n; register long r3 __asm__("r3"); __asm__ __volatile__("scv 0" : "=r"(r3) : "r"(r0) : "memory", "cr0", "cr1", "cr5", "cr6", "cr7", "lr", "ctr", "r9", "r10", "r11", "r12" return r3; } That saves another ~400 bytes, reducing some of the register shuffling for futex loops etc: [...] __pthread_cond_timedwait 964 944 -20 __expand_heap 592 572 -20 socketpair 292 268 -24 __wake.constprop 360 336 -24 malloc 1860 1828 -32 __bin_chunk 1512 1472 -40 fcntl 424 376 -48 Total: Before=610171, After=609723, chg -0.07% As you say, the compiler doesn't do a good job of saving lr in a spare GPR unfortunately. Saving it ourselves to eliminate the lr clobber is no good because it's almost always already saved. At least having non-clobbered volatile GPRs could let a future smarter compiler take advantage. If we go further and try to preserve r3 as well by putting the return value in r9 or r0, we go backwards about 300 bytes. It's good for the lock loops and complex functions, but hurts a lot of simpler functions that have to add 'mr r3,r9' etc. Most of the time there are saved non-volatile GPRs around anyway though, so not sure which way to go on this. Text size savings can't be ignored and it's pretty easy for the kernel to do (we already save r3-r8 and zero them on exit, so we could load them instead from cache line that's should be hot). So I may be inclined to go this way, even if we won't see benefit now. Thanks, Nick ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-22 6:18 ` Nicholas Piggin @ 2020-04-22 6:29 ` Nicholas Piggin 2020-04-23 2:36 ` Rich Felker 1 sibling, 0 replies; 62+ messages in thread From: Nicholas Piggin @ 2020-04-22 6:29 UTC (permalink / raw) To: Rich Felker; +Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl Excerpts from Nicholas Piggin's message of April 22, 2020 4:18 pm: > If we go further and try to preserve r3 as well by putting the return > value in r9 or r0, we go backwards about 300 bytes. It's good for the > lock loops and complex functions, but hurts a lot of simpler functions > that have to add 'mr r3,r9' etc. > > Most of the time there are saved non-volatile GPRs around anyway though, > so not sure which way to go on this. Text size savings can't be ignored > and it's pretty easy for the kernel to do (we already save r3-r8 and > zero them on exit, so we could load them instead from cache line that's > should be hot). > > So I may be inclined to go this way, even if we won't see benefit now. By, "this way" I don't mean r9 or r0 return value (which is larger code), but r3 return value with r0,r4-r8 preserved. Thanks, Nick ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-22 6:18 ` Nicholas Piggin 2020-04-22 6:29 ` Nicholas Piggin @ 2020-04-23 2:36 ` Rich Felker 2020-04-23 12:13 ` Adhemerval Zanella 2020-04-25 3:30 ` Nicholas Piggin 1 sibling, 2 replies; 62+ messages in thread From: Rich Felker @ 2020-04-23 2:36 UTC (permalink / raw) To: Nicholas Piggin Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: > Yeah I had a bit of a play around with musl (which is very nice code I > must say). The powerpc64 syscall asm is missing ctr clobber by the way. > Fortunately adding it doesn't change code generation for me, but it > should be fixed. glibc had the same bug at one point I think (probably > due to syscall ABI documentation not existing -- something now lives in > linux/Documentation/powerpc/syscall64-abi.rst). Do you know anywhere I can read about the ctr issue, possibly the relevant glibc bug report? I'm not particularly familiar with ppc register file (at least I have to refamiliarize myself every time I work on this stuff) so it'd be nice to understand what's potentially-wrong now. Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-23 2:36 ` Rich Felker @ 2020-04-23 12:13 ` Adhemerval Zanella 2020-04-23 16:18 ` Rich Felker 2020-04-25 3:30 ` Nicholas Piggin 1 sibling, 1 reply; 62+ messages in thread From: Adhemerval Zanella @ 2020-04-23 12:13 UTC (permalink / raw) To: Rich Felker, Nicholas Piggin; +Cc: libc-alpha, libc-dev, linuxppc-dev, musl On 22/04/2020 23:36, Rich Felker wrote: > On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: >> Yeah I had a bit of a play around with musl (which is very nice code I >> must say). The powerpc64 syscall asm is missing ctr clobber by the way. >> Fortunately adding it doesn't change code generation for me, but it >> should be fixed. glibc had the same bug at one point I think (probably >> due to syscall ABI documentation not existing -- something now lives in >> linux/Documentation/powerpc/syscall64-abi.rst). > > Do you know anywhere I can read about the ctr issue, possibly the > relevant glibc bug report? I'm not particularly familiar with ppc > register file (at least I have to refamiliarize myself every time I > work on this stuff) so it'd be nice to understand what's > potentially-wrong now. My understanding is the ctr issue only happens for vDSO calls where it fallback to a syscall in case an error (invalid argument, etc. and assuming if vDSO does not fallback to a syscall it always succeed). This makes the vDSO call on powerpc to have same same ABI constraint as a syscall, where it clobbers CR0. On glibc we handle by simulating a function call and analysing the CR0 result: __asm__ __volatile__ ("mtctr %0\n\t" "bctrl\n\t" "mfcr %0\n\t" "0:" : "+r" (r0), "+r" (r3), "+r" (r4), "+r" (r5), "+r" (r6), "+r" (r7), "+r" (r8) : : "r9", "r10", "r11", "r12", "cr0", "ctr", "lr", "memory"); __asm__ __volatile__ ("" : "=r" (rval) : "r" (r3)); On musl you don't have this issue because it does not enable vDSO support on powerpc. And if it eventually does it with the VDSO_* macros the only issue I see is on when vDSO fallbacks to the syscall and it also fails (the return code won't be negated since on musl it uses a default C function pointer issue which does not model the CR0 kernel abi). So I think the extra ctr constraint on glibc powerpc syscall code is not really required. I think I have some patches to optimize this a bit based on previous discussions. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-23 12:13 ` Adhemerval Zanella @ 2020-04-23 16:18 ` Rich Felker 2020-04-23 16:35 ` Adhemerval Zanella 0 siblings, 1 reply; 62+ messages in thread From: Rich Felker @ 2020-04-23 16:18 UTC (permalink / raw) To: Adhemerval Zanella Cc: Nicholas Piggin, libc-alpha, libc-dev, linuxppc-dev, musl On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: > > > On 22/04/2020 23:36, Rich Felker wrote: > > On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: > >> Yeah I had a bit of a play around with musl (which is very nice code I > >> must say). The powerpc64 syscall asm is missing ctr clobber by the way. > >> Fortunately adding it doesn't change code generation for me, but it > >> should be fixed. glibc had the same bug at one point I think (probably > >> due to syscall ABI documentation not existing -- something now lives in > >> linux/Documentation/powerpc/syscall64-abi.rst). > > > > Do you know anywhere I can read about the ctr issue, possibly the > > relevant glibc bug report? I'm not particularly familiar with ppc > > register file (at least I have to refamiliarize myself every time I > > work on this stuff) so it'd be nice to understand what's > > potentially-wrong now. > > My understanding is the ctr issue only happens for vDSO calls where it > fallback to a syscall in case an error (invalid argument, etc. and > assuming if vDSO does not fallback to a syscall it always succeed). > This makes the vDSO call on powerpc to have same same ABI constraint > as a syscall, where it clobbers CR0. I think you mean "vsyscall", the old thing glibc used where there are in-userspace implementations of some syscalls with call interfaces roughly equivalent to a syscall. musl has never used this. It only uses the actual exported functions from the vdso which have normal external function call ABI. Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-23 16:18 ` Rich Felker @ 2020-04-23 16:35 ` Adhemerval Zanella 2020-04-23 16:43 ` Rich Felker 0 siblings, 1 reply; 62+ messages in thread From: Adhemerval Zanella @ 2020-04-23 16:35 UTC (permalink / raw) To: Rich Felker; +Cc: Nicholas Piggin, libc-alpha, libc-dev, linuxppc-dev, musl On 23/04/2020 13:18, Rich Felker wrote: > On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: >> >> >> On 22/04/2020 23:36, Rich Felker wrote: >>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: >>>> Yeah I had a bit of a play around with musl (which is very nice code I >>>> must say). The powerpc64 syscall asm is missing ctr clobber by the way. >>>> Fortunately adding it doesn't change code generation for me, but it >>>> should be fixed. glibc had the same bug at one point I think (probably >>>> due to syscall ABI documentation not existing -- something now lives in >>>> linux/Documentation/powerpc/syscall64-abi.rst). >>> >>> Do you know anywhere I can read about the ctr issue, possibly the >>> relevant glibc bug report? I'm not particularly familiar with ppc >>> register file (at least I have to refamiliarize myself every time I >>> work on this stuff) so it'd be nice to understand what's >>> potentially-wrong now. >> >> My understanding is the ctr issue only happens for vDSO calls where it >> fallback to a syscall in case an error (invalid argument, etc. and >> assuming if vDSO does not fallback to a syscall it always succeed). >> This makes the vDSO call on powerpc to have same same ABI constraint >> as a syscall, where it clobbers CR0. > > I think you mean "vsyscall", the old thing glibc used where there are > in-userspace implementations of some syscalls with call interfaces > roughly equivalent to a syscall. musl has never used this. It only > uses the actual exported functions from the vdso which have normal > external function call ABI. I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing. The issue is indeed when calling the powerpc provided functions in vDSO, which musl might want to do eventually. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-23 16:35 ` Adhemerval Zanella @ 2020-04-23 16:43 ` Rich Felker 2020-04-23 17:15 ` Adhemerval Zanella 0 siblings, 1 reply; 62+ messages in thread From: Rich Felker @ 2020-04-23 16:43 UTC (permalink / raw) To: Adhemerval Zanella Cc: Nicholas Piggin, libc-alpha, libc-dev, linuxppc-dev, musl On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote: > > > On 23/04/2020 13:18, Rich Felker wrote: > > On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: > >> > >> > >> On 22/04/2020 23:36, Rich Felker wrote: > >>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: > >>>> Yeah I had a bit of a play around with musl (which is very nice code I > >>>> must say). The powerpc64 syscall asm is missing ctr clobber by the way. > >>>> Fortunately adding it doesn't change code generation for me, but it > >>>> should be fixed. glibc had the same bug at one point I think (probably > >>>> due to syscall ABI documentation not existing -- something now lives in > >>>> linux/Documentation/powerpc/syscall64-abi.rst). > >>> > >>> Do you know anywhere I can read about the ctr issue, possibly the > >>> relevant glibc bug report? I'm not particularly familiar with ppc > >>> register file (at least I have to refamiliarize myself every time I > >>> work on this stuff) so it'd be nice to understand what's > >>> potentially-wrong now. > >> > >> My understanding is the ctr issue only happens for vDSO calls where it > >> fallback to a syscall in case an error (invalid argument, etc. and > >> assuming if vDSO does not fallback to a syscall it always succeed). > >> This makes the vDSO call on powerpc to have same same ABI constraint > >> as a syscall, where it clobbers CR0. > > > > I think you mean "vsyscall", the old thing glibc used where there are > > in-userspace implementations of some syscalls with call interfaces > > roughly equivalent to a syscall. musl has never used this. It only > > uses the actual exported functions from the vdso which have normal > > external function call ABI. > > I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing. > The issue is indeed when calling the powerpc provided functions in > vDSO, which musl might want to do eventually. AIUI (at least this is true for all other archs) the functions have normal external function call ABI and calling them has nothing to do with syscall mechanisms. It looks like we're not using them right now and I'm not sure why. It could be that there are ABI mismatch issues (are 32-bit ones compatible with secure-plt? are 64-bit ones compatible with ELFv2?) or just that nobody proposed adding them. Also as of 5.4 32-bit ppc lacked time64 versions of them; not sure if this is fixed yet. Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-23 16:43 ` Rich Felker @ 2020-04-23 17:15 ` Adhemerval Zanella 2020-04-23 17:42 ` Rich Felker 0 siblings, 1 reply; 62+ messages in thread From: Adhemerval Zanella @ 2020-04-23 17:15 UTC (permalink / raw) To: Rich Felker; +Cc: Nicholas Piggin, libc-alpha, libc-dev, linuxppc-dev, musl On 23/04/2020 13:43, Rich Felker wrote: > On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote: >> >> >> On 23/04/2020 13:18, Rich Felker wrote: >>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: >>>> >>>> >>>> On 22/04/2020 23:36, Rich Felker wrote: >>>>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: >>>>>> Yeah I had a bit of a play around with musl (which is very nice code I >>>>>> must say). The powerpc64 syscall asm is missing ctr clobber by the way. >>>>>> Fortunately adding it doesn't change code generation for me, but it >>>>>> should be fixed. glibc had the same bug at one point I think (probably >>>>>> due to syscall ABI documentation not existing -- something now lives in >>>>>> linux/Documentation/powerpc/syscall64-abi.rst). >>>>> >>>>> Do you know anywhere I can read about the ctr issue, possibly the >>>>> relevant glibc bug report? I'm not particularly familiar with ppc >>>>> register file (at least I have to refamiliarize myself every time I >>>>> work on this stuff) so it'd be nice to understand what's >>>>> potentially-wrong now. >>>> >>>> My understanding is the ctr issue only happens for vDSO calls where it >>>> fallback to a syscall in case an error (invalid argument, etc. and >>>> assuming if vDSO does not fallback to a syscall it always succeed). >>>> This makes the vDSO call on powerpc to have same same ABI constraint >>>> as a syscall, where it clobbers CR0. >>> >>> I think you mean "vsyscall", the old thing glibc used where there are >>> in-userspace implementations of some syscalls with call interfaces >>> roughly equivalent to a syscall. musl has never used this. It only >>> uses the actual exported functions from the vdso which have normal >>> external function call ABI. >> >> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing. >> The issue is indeed when calling the powerpc provided functions in >> vDSO, which musl might want to do eventually. > > AIUI (at least this is true for all other archs) the functions have > normal external function call ABI and calling them has nothing to do > with syscall mechanisms. My point is powerpc specifically does not follow it, since it issues a syscall in fallback and its semantic follow kernel syscalls (error signalled in cr0, r3 being always a positive value): -- V_FUNCTION_BEGIN(__kernel_clock_gettime) .cfi_startproc [...] /* * syscall fallback */ 99: li r0,__NR_clock_gettime .cfi_restore lr sc blr .cfi_endproc V_FUNCTION_END(__kernel_clock_gettime) > > It looks like we're not using them right now and I'm not sure why. It > could be that there are ABI mismatch issues (are 32-bit ones > compatible with secure-plt? are 64-bit ones compatible with ELFv2?) or > just that nobody proposed adding them. Also as of 5.4 32-bit ppc > lacked time64 versions of them; not sure if this is fixed yet. For 64-bit it also have an issue where vDSO does not provide an OPD for ELFv1, which has bitten glibc while trying to implement an ifunc optimization. I don't recall any issue for ELFv2. For 32-bit I am not sure secure-plt will change anything, at least not on powerpc where we use the same strategy for 64-bit and use a mtctr/bctr directly. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-23 17:15 ` Adhemerval Zanella @ 2020-04-23 17:42 ` Rich Felker 2020-04-25 3:40 ` Nicholas Piggin 0 siblings, 1 reply; 62+ messages in thread From: Rich Felker @ 2020-04-23 17:42 UTC (permalink / raw) To: Adhemerval Zanella Cc: Nicholas Piggin, libc-alpha, libc-dev, linuxppc-dev, musl On Thu, Apr 23, 2020 at 02:15:58PM -0300, Adhemerval Zanella wrote: > > > On 23/04/2020 13:43, Rich Felker wrote: > > On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote: > >> > >> > >> On 23/04/2020 13:18, Rich Felker wrote: > >>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: > >>>> > >>>> > >>>> On 22/04/2020 23:36, Rich Felker wrote: > >>>>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: > >>>>>> Yeah I had a bit of a play around with musl (which is very nice code I > >>>>>> must say). The powerpc64 syscall asm is missing ctr clobber by the way. > >>>>>> Fortunately adding it doesn't change code generation for me, but it > >>>>>> should be fixed. glibc had the same bug at one point I think (probably > >>>>>> due to syscall ABI documentation not existing -- something now lives in > >>>>>> linux/Documentation/powerpc/syscall64-abi.rst). > >>>>> > >>>>> Do you know anywhere I can read about the ctr issue, possibly the > >>>>> relevant glibc bug report? I'm not particularly familiar with ppc > >>>>> register file (at least I have to refamiliarize myself every time I > >>>>> work on this stuff) so it'd be nice to understand what's > >>>>> potentially-wrong now. > >>>> > >>>> My understanding is the ctr issue only happens for vDSO calls where it > >>>> fallback to a syscall in case an error (invalid argument, etc. and > >>>> assuming if vDSO does not fallback to a syscall it always succeed). > >>>> This makes the vDSO call on powerpc to have same same ABI constraint > >>>> as a syscall, where it clobbers CR0. > >>> > >>> I think you mean "vsyscall", the old thing glibc used where there are > >>> in-userspace implementations of some syscalls with call interfaces > >>> roughly equivalent to a syscall. musl has never used this. It only > >>> uses the actual exported functions from the vdso which have normal > >>> external function call ABI. > >> > >> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing. > >> The issue is indeed when calling the powerpc provided functions in > >> vDSO, which musl might want to do eventually. > > > > AIUI (at least this is true for all other archs) the functions have > > normal external function call ABI and calling them has nothing to do > > with syscall mechanisms. > > My point is powerpc specifically does not follow it, since it issues a > syscall in fallback and its semantic follow kernel syscalls (error > signalled in cr0, r3 being always a positive value): Oh, then I think we'll just ignore these unless the kernel can make ones with a reasonable ABI. It's not worth having ppc-specific code for this... It would be really nice if ones that actually behave like functions could be added though. > -- > V_FUNCTION_BEGIN(__kernel_clock_gettime) > .cfi_startproc > [...] > /* > * syscall fallback > */ > 99: > li r0,__NR_clock_gettime > .cfi_restore lr > sc > blr > .cfi_endproc > V_FUNCTION_END(__kernel_clock_gettime) > > > > > > It looks like we're not using them right now and I'm not sure why. It > > could be that there are ABI mismatch issues (are 32-bit ones > > compatible with secure-plt? are 64-bit ones compatible with ELFv2?) or > > just that nobody proposed adding them. Also as of 5.4 32-bit ppc > > lacked time64 versions of them; not sure if this is fixed yet. > > For 64-bit it also have an issue where vDSO does not provide an OPD > for ELFv1, which has bitten glibc while trying to implement an ifunc > optimization. I don't recall any issue for ELFv2. > > For 32-bit I am not sure secure-plt will change anything, at least not > on powerpc where we use the same strategy for 64-bit and use a > mtctr/bctr directly. Indeed, I don't think there's a secure-plt distinction unless you're making outgoing calls to possibly-cross-DSO functions. Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-23 17:42 ` Rich Felker @ 2020-04-25 3:40 ` Nicholas Piggin 2020-04-25 4:52 ` Rich Felker 0 siblings, 1 reply; 62+ messages in thread From: Nicholas Piggin @ 2020-04-25 3:40 UTC (permalink / raw) To: Adhemerval Zanella, Rich Felker; +Cc: libc-alpha, libc-dev, linuxppc-dev, musl Excerpts from Rich Felker's message of April 24, 2020 3:42 am: > On Thu, Apr 23, 2020 at 02:15:58PM -0300, Adhemerval Zanella wrote: >> >> >> On 23/04/2020 13:43, Rich Felker wrote: >> > On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote: >> >> >> >> >> >> On 23/04/2020 13:18, Rich Felker wrote: >> >>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: >> >>>> >> >>>> >> >>>> On 22/04/2020 23:36, Rich Felker wrote: >> >>>>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: >> >>>>>> Yeah I had a bit of a play around with musl (which is very nice code I >> >>>>>> must say). The powerpc64 syscall asm is missing ctr clobber by the way. >> >>>>>> Fortunately adding it doesn't change code generation for me, but it >> >>>>>> should be fixed. glibc had the same bug at one point I think (probably >> >>>>>> due to syscall ABI documentation not existing -- something now lives in >> >>>>>> linux/Documentation/powerpc/syscall64-abi.rst). >> >>>>> >> >>>>> Do you know anywhere I can read about the ctr issue, possibly the >> >>>>> relevant glibc bug report? I'm not particularly familiar with ppc >> >>>>> register file (at least I have to refamiliarize myself every time I >> >>>>> work on this stuff) so it'd be nice to understand what's >> >>>>> potentially-wrong now. >> >>>> >> >>>> My understanding is the ctr issue only happens for vDSO calls where it >> >>>> fallback to a syscall in case an error (invalid argument, etc. and >> >>>> assuming if vDSO does not fallback to a syscall it always succeed). >> >>>> This makes the vDSO call on powerpc to have same same ABI constraint >> >>>> as a syscall, where it clobbers CR0. >> >>> >> >>> I think you mean "vsyscall", the old thing glibc used where there are >> >>> in-userspace implementations of some syscalls with call interfaces >> >>> roughly equivalent to a syscall. musl has never used this. It only >> >>> uses the actual exported functions from the vdso which have normal >> >>> external function call ABI. >> >> >> >> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing. >> >> The issue is indeed when calling the powerpc provided functions in >> >> vDSO, which musl might want to do eventually. >> > >> > AIUI (at least this is true for all other archs) the functions have >> > normal external function call ABI and calling them has nothing to do >> > with syscall mechanisms. >> >> My point is powerpc specifically does not follow it, since it issues a >> syscall in fallback and its semantic follow kernel syscalls (error >> signalled in cr0, r3 being always a positive value): > > Oh, then I think we'll just ignore these unless the kernel can make > ones with a reasonable ABI. It's not worth having ppc-specific code > for this... It would be really nice if ones that actually behave like > functions could be added though. Yeah this is an annoyance for me after making the scv ABI return -ve in r3 for error and other things that more closely follow function calls, we still have the vdso functions using the old style. Maybe we should add function call style vdso too. Thanks, Nick ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-25 3:40 ` Nicholas Piggin @ 2020-04-25 4:52 ` Rich Felker 0 siblings, 0 replies; 62+ messages in thread From: Rich Felker @ 2020-04-25 4:52 UTC (permalink / raw) To: Nicholas Piggin Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl On Sat, Apr 25, 2020 at 01:40:24PM +1000, Nicholas Piggin wrote: > Excerpts from Rich Felker's message of April 24, 2020 3:42 am: > > On Thu, Apr 23, 2020 at 02:15:58PM -0300, Adhemerval Zanella wrote: > >> > >> > >> On 23/04/2020 13:43, Rich Felker wrote: > >> > On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote: > >> >> > >> >> > >> >> On 23/04/2020 13:18, Rich Felker wrote: > >> >>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: > >> >>>> > >> >>>> > >> >>>> On 22/04/2020 23:36, Rich Felker wrote: > >> >>>>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: > >> >>>>>> Yeah I had a bit of a play around with musl (which is very nice code I > >> >>>>>> must say). The powerpc64 syscall asm is missing ctr clobber by the way. > >> >>>>>> Fortunately adding it doesn't change code generation for me, but it > >> >>>>>> should be fixed. glibc had the same bug at one point I think (probably > >> >>>>>> due to syscall ABI documentation not existing -- something now lives in > >> >>>>>> linux/Documentation/powerpc/syscall64-abi.rst). > >> >>>>> > >> >>>>> Do you know anywhere I can read about the ctr issue, possibly the > >> >>>>> relevant glibc bug report? I'm not particularly familiar with ppc > >> >>>>> register file (at least I have to refamiliarize myself every time I > >> >>>>> work on this stuff) so it'd be nice to understand what's > >> >>>>> potentially-wrong now. > >> >>>> > >> >>>> My understanding is the ctr issue only happens for vDSO calls where it > >> >>>> fallback to a syscall in case an error (invalid argument, etc. and > >> >>>> assuming if vDSO does not fallback to a syscall it always succeed). > >> >>>> This makes the vDSO call on powerpc to have same same ABI constraint > >> >>>> as a syscall, where it clobbers CR0. > >> >>> > >> >>> I think you mean "vsyscall", the old thing glibc used where there are > >> >>> in-userspace implementations of some syscalls with call interfaces > >> >>> roughly equivalent to a syscall. musl has never used this. It only > >> >>> uses the actual exported functions from the vdso which have normal > >> >>> external function call ABI. > >> >> > >> >> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing. > >> >> The issue is indeed when calling the powerpc provided functions in > >> >> vDSO, which musl might want to do eventually. > >> > > >> > AIUI (at least this is true for all other archs) the functions have > >> > normal external function call ABI and calling them has nothing to do > >> > with syscall mechanisms. > >> > >> My point is powerpc specifically does not follow it, since it issues a > >> syscall in fallback and its semantic follow kernel syscalls (error > >> signalled in cr0, r3 being always a positive value): > > > > Oh, then I think we'll just ignore these unless the kernel can make > > ones with a reasonable ABI. It's not worth having ppc-specific code > > for this... It would be really nice if ones that actually behave like > > functions could be added though. > > Yeah this is an annoyance for me after making the scv ABI return -ve in > r3 for error and other things that more closely follow function calls, > we still have the vdso functions using the old style. > > Maybe we should add function call style vdso too. Please do. Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-23 2:36 ` Rich Felker 2020-04-23 12:13 ` Adhemerval Zanella @ 2020-04-25 3:30 ` Nicholas Piggin 1 sibling, 0 replies; 62+ messages in thread From: Nicholas Piggin @ 2020-04-25 3:30 UTC (permalink / raw) To: Rich Felker; +Cc: Adhemerval Zanella, libc-alpha, libc-dev, linuxppc-dev, musl Excerpts from Rich Felker's message of April 23, 2020 12:36 pm: > On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: >> Yeah I had a bit of a play around with musl (which is very nice code I >> must say). The powerpc64 syscall asm is missing ctr clobber by the way. >> Fortunately adding it doesn't change code generation for me, but it >> should be fixed. glibc had the same bug at one point I think (probably >> due to syscall ABI documentation not existing -- something now lives in >> linux/Documentation/powerpc/syscall64-abi.rst). > > Do you know anywhere I can read about the ctr issue, possibly the > relevant glibc bug report? I'm not particularly familiar with ppc > register file (at least I have to refamiliarize myself every time I > work on this stuff) so it'd be nice to understand what's > potentially-wrong now. Ah I was misremembering, glibc was (and still is) actually missing cr clobbers from its "vsyscall", probably because it copied syscall which only clobbers cr0, but vsyscall clobbers cr0-1,5-7 like a normal function call. musl is missing the ctr register clobber from syscalls. powerpc has gpr0-31 GPRs, cr0-7 condition regs, and lr and ctr branch registers (lr is generally used for function returns, ctr for other indirect branches). ctr is volatile (caller saved) across C function calls, and sc system calls on Linux. Thanks, Nick ^ permalink raw reply [flat|nested] 62+ messages in thread
* RE: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-20 1:10 ` Nicholas Piggin 2020-04-20 1:34 ` Rich Felker @ 2020-04-21 12:28 ` David Laight 2020-04-21 14:39 ` Rich Felker 1 sibling, 1 reply; 62+ messages in thread From: David Laight @ 2020-04-21 12:28 UTC (permalink / raw) To: 'Nicholas Piggin', Adhemerval Zanella, Rich Felker Cc: libc-dev, libc-alpha, linuxppc-dev, musl From: Nicholas Piggin > Sent: 20 April 2020 02:10 ... > >> Yes, but does it really matter to optimize this specific usage case > >> for size? glibc, for instance, tries to leverage the syscall mechanism > >> by adding some complex pre-processor asm directives. It optimizes > >> the syscall code size in most cases. For instance, kill in static case > >> generates on x86_64: > >> > >> 0000000000000000 <__kill>: > >> 0: b8 3e 00 00 00 mov $0x3e,%eax > >> 5: 0f 05 syscall > >> 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax > >> d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> Hmmm... that cmp + jae is unnecessary here. It is also a 32bit offset jump. I also suspect it gets predicted very badly. > >> 13: c3 retq > >> > >> While on musl: > >> > >> 0000000000000000 <kill>: > >> 0: 48 83 ec 08 sub $0x8,%rsp > >> 4: 48 63 ff movslq %edi,%rdi > >> 7: 48 63 f6 movslq %esi,%rsi > >> a: b8 3e 00 00 00 mov $0x3e,%eax > >> f: 0f 05 syscall > >> 11: 48 89 c7 mov %rax,%rdi > >> 14: e8 00 00 00 00 callq 19 <kill+0x19> > >> 19: 5a pop %rdx > >> 1a: c3 retq > > > > Wow that's some extraordinarily bad codegen going on by gcc... The > > sign-extension is semantically needed and I don't see a good way > > around it (glibc's asm is kinda a hack taking advantage of kernel not > > looking at high bits, I think), but the gratuitous stack adjustment > > and refusal to generate a tail call isn't. I'll see if we can track > > down what's going on and get it fixed. A suitable cast might get rid of the sign extension. Possibly just (unsigned int). David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-21 12:28 ` David Laight @ 2020-04-21 14:39 ` Rich Felker 2020-04-21 15:00 ` Adhemerval Zanella 0 siblings, 1 reply; 62+ messages in thread From: Rich Felker @ 2020-04-21 14:39 UTC (permalink / raw) To: David Laight Cc: 'Nicholas Piggin', Adhemerval Zanella, libc-dev, libc-alpha, linuxppc-dev, musl On Tue, Apr 21, 2020 at 12:28:25PM +0000, David Laight wrote: > From: Nicholas Piggin > > Sent: 20 April 2020 02:10 > ... > > >> Yes, but does it really matter to optimize this specific usage case > > >> for size? glibc, for instance, tries to leverage the syscall mechanism > > >> by adding some complex pre-processor asm directives. It optimizes > > >> the syscall code size in most cases. For instance, kill in static case > > >> generates on x86_64: > > >> > > >> 0000000000000000 <__kill>: > > >> 0: b8 3e 00 00 00 mov $0x3e,%eax > > >> 5: 0f 05 syscall > > >> 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax > > >> d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> > > Hmmm... that cmp + jae is unnecessary here. It's not.. Rather the objdump was just mistakenly done without -r so it looks like a nop jump rather than a conditional tail call to the function that sets errno. > It is also a 32bit offset jump. > I also suspect it gets predicted very badly. I doubt that. This is a very standard idiom and the size of the offset (which is necessarily 32-bit because it has a relocation on it) is orthogonal to the condition on the jump. FWIW a syscall like kill takes global kernel-side locks to be able to address a target process by pid, and the rate of meaningful calls you can make to it is very low (since it's bounded by time for target process to act on the signal). Trying to optimize it for speed is pointless, and even size isn't important locally (although in aggregate, lots of wasted small size can add up to more pages = more TLB entries = ...). > > >> 13: c3 retq > > >> > > >> While on musl: > > >> > > >> 0000000000000000 <kill>: > > >> 0: 48 83 ec 08 sub $0x8,%rsp > > >> 4: 48 63 ff movslq %edi,%rdi > > >> 7: 48 63 f6 movslq %esi,%rsi > > >> a: b8 3e 00 00 00 mov $0x3e,%eax > > >> f: 0f 05 syscall > > >> 11: 48 89 c7 mov %rax,%rdi > > >> 14: e8 00 00 00 00 callq 19 <kill+0x19> > > >> 19: 5a pop %rdx > > >> 1a: c3 retq > > > > > > Wow that's some extraordinarily bad codegen going on by gcc... The > > > sign-extension is semantically needed and I don't see a good way > > > around it (glibc's asm is kinda a hack taking advantage of kernel not > > > looking at high bits, I think), but the gratuitous stack adjustment > > > and refusal to generate a tail call isn't. I'll see if we can track > > > down what's going on and get it fixed. > > A suitable cast might get rid of the sign extension. > Possibly just (unsigned int). No, it won't. The problem is that there is no representation of the fact that the kernel is only going to inspect the low 32 bits (by declaring the kernel-side function as taking an int argument). The external kill function receives arguments by the ABI, where the upper bits of int args can contain junk, and the asm register constraints for syscalls use longs (or rather an abstract syscall-arg type). It wouldn't even work to have macro magic detect that the expressions passed are ints and use hacks to avoid that, since it's perfectly valid to pass an int to a syscall that expects a long argument (e.g. offset to mmap), in which case it needs to be sign-extended. The only way to avoid this is encoding somewhere the syscall-specific knowledge of what arg size the kernel function expects. That's way too much redundant effort and too error-prone for the incredibly miniscule size benefit you'd get out of it. Rich ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-21 14:39 ` Rich Felker @ 2020-04-21 15:00 ` Adhemerval Zanella 2020-04-21 15:31 ` David Laight 2020-04-22 6:54 ` Nicholas Piggin 0 siblings, 2 replies; 62+ messages in thread From: Adhemerval Zanella @ 2020-04-21 15:00 UTC (permalink / raw) To: Rich Felker, David Laight Cc: 'Nicholas Piggin', libc-dev, libc-alpha, linuxppc-dev, musl On 21/04/2020 11:39, Rich Felker wrote: > On Tue, Apr 21, 2020 at 12:28:25PM +0000, David Laight wrote: >> From: Nicholas Piggin >>> Sent: 20 April 2020 02:10 >> ... >>>>> Yes, but does it really matter to optimize this specific usage case >>>>> for size? glibc, for instance, tries to leverage the syscall mechanism >>>>> by adding some complex pre-processor asm directives. It optimizes >>>>> the syscall code size in most cases. For instance, kill in static case >>>>> generates on x86_64: >>>>> >>>>> 0000000000000000 <__kill>: >>>>> 0: b8 3e 00 00 00 mov $0x3e,%eax >>>>> 5: 0f 05 syscall >>>>> 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax >>>>> d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> >> >> Hmmm... that cmp + jae is unnecessary here. > > It's not.. Rather the objdump was just mistakenly done without -r so > it looks like a nop jump rather than a conditional tail call to the > function that sets errno. > Indeed, the output with -r is: 0000000000000000 <__kill>: 0: b8 3e 00 00 00 mov $0x3e,%eax 5: 0f 05 syscall 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> f: R_X86_64_PLT32 __syscall_error-0x4 13: c3 retq And for x86_64 __syscall_error is defined as: 0000000000000000 <__syscall_error>: 0: 48 f7 d8 neg %rax 0000000000000003 <__syscall_error_1>: 3: 64 89 04 25 00 00 00 mov %eax,%fs:0x0 a: 00 7: R_X86_64_TPOFF32 errno b: 48 83 c8 ff or $0xffffffffffffffff,%rax f: c3 retq Different than musl, each architecture defines its own error handling mechanism (some embedded errno setting in syscall itself, other branches to a __syscall_error like function as x86_64). This is due most likely from the glibc long history. One of my long term plan is to just simplify, get rid of the assembly pre-processor, implement all syscall in C code, and set error handling mechanism in a platform neutral way using a tail call (most likely you do on musl). >> It is also a 32bit offset jump. >> I also suspect it gets predicted very badly. > > I doubt that. This is a very standard idiom and the size of the offset > (which is necessarily 32-bit because it has a relocation on it) is > orthogonal to the condition on the jump. > > FWIW a syscall like kill takes global kernel-side locks to be able to > address a target process by pid, and the rate of meaningful calls you > can make to it is very low (since it's bounded by time for target > process to act on the signal). Trying to optimize it for speed is > pointless, and even size isn't important locally (although in > aggregate, lots of wasted small size can add up to more pages = more > TLB entries = ...). I agree and I would prefer to focus on code simplicity to have a platform neutral way to handle error and let the compiler optimize it than messy with assembly macros to squeeze this kind of micro-optimizations. > >>>>> 13: c3 retq >>>>> >>>>> While on musl: >>>>> >>>>> 0000000000000000 <kill>: >>>>> 0: 48 83 ec 08 sub $0x8,%rsp >>>>> 4: 48 63 ff movslq %edi,%rdi >>>>> 7: 48 63 f6 movslq %esi,%rsi >>>>> a: b8 3e 00 00 00 mov $0x3e,%eax >>>>> f: 0f 05 syscall >>>>> 11: 48 89 c7 mov %rax,%rdi >>>>> 14: e8 00 00 00 00 callq 19 <kill+0x19> >>>>> 19: 5a pop %rdx >>>>> 1a: c3 retq >>>> >>>> Wow that's some extraordinarily bad codegen going on by gcc... The >>>> sign-extension is semantically needed and I don't see a good way >>>> around it (glibc's asm is kinda a hack taking advantage of kernel not >>>> looking at high bits, I think), but the gratuitous stack adjustment >>>> and refusal to generate a tail call isn't. I'll see if we can track >>>> down what's going on and get it fixed. >> >> A suitable cast might get rid of the sign extension. >> Possibly just (unsigned int). > > No, it won't. The problem is that there is no representation of the > fact that the kernel is only going to inspect the low 32 bits (by > declaring the kernel-side function as taking an int argument). The > external kill function receives arguments by the ABI, where the upper > bits of int args can contain junk, and the asm register constraints > for syscalls use longs (or rather an abstract syscall-arg type). It > wouldn't even work to have macro magic detect that the expressions > passed are ints and use hacks to avoid that, since it's perfectly > valid to pass an int to a syscall that expects a long argument (e.g. > offset to mmap), in which case it needs to be sign-extended. > > The only way to avoid this is encoding somewhere the syscall-specific > knowledge of what arg size the kernel function expects. That's way too > much redundant effort and too error-prone for the incredibly miniscule > size benefit you'd get out of it. > > Rich > ^ permalink raw reply [flat|nested] 62+ messages in thread
* RE: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-21 15:00 ` Adhemerval Zanella @ 2020-04-21 15:31 ` David Laight 2020-04-22 6:54 ` Nicholas Piggin 1 sibling, 0 replies; 62+ messages in thread From: David Laight @ 2020-04-21 15:31 UTC (permalink / raw) To: 'Adhemerval Zanella', Rich Felker Cc: 'Nicholas Piggin', libc-dev, libc-alpha, linuxppc-dev, musl From: Adhemerval Zanella > Sent: 21 April 2020 16:01 > > On 21/04/2020 11:39, Rich Felker wrote: > > On Tue, Apr 21, 2020 at 12:28:25PM +0000, David Laight wrote: > >> From: Nicholas Piggin > >>> Sent: 20 April 2020 02:10 > >> ... > >>>>> Yes, but does it really matter to optimize this specific usage case > >>>>> for size? glibc, for instance, tries to leverage the syscall mechanism > >>>>> by adding some complex pre-processor asm directives. It optimizes > >>>>> the syscall code size in most cases. For instance, kill in static case > >>>>> generates on x86_64: > >>>>> > >>>>> 0000000000000000 <__kill>: > >>>>> 0: b8 3e 00 00 00 mov $0x3e,%eax > >>>>> 5: 0f 05 syscall > >>>>> 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax > >>>>> d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> > >> > >> Hmmm... that cmp + jae is unnecessary here. > > > > It's not.. Rather the objdump was just mistakenly done without -r so > > it looks like a nop jump rather than a conditional tail call to the > > function that sets errno. > > > > Indeed, the output with -r is: > > 0000000000000000 <__kill>: > 0: b8 3e 00 00 00 mov $0x3e,%eax > 5: 0f 05 syscall > 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax > d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> > f: R_X86_64_PLT32 __syscall_error-0x4 > 13: c3 retq Yes, I probably should have remembered it looked like that :-) ... > >> I also suspect it gets predicted very badly. > > > > I doubt that. This is a very standard idiom and the size of the offset > > (which is necessarily 32-bit because it has a relocation on it) is > > orthogonal to the condition on the jump. Yes, it only gets mispredicted as badly as any other conditional jump. I believe modern intel x86 will randomly predict it taken (regardless of the direction) and then hit a TLB fault on text.unlikely :-) > > FWIW a syscall like kill takes global kernel-side locks to be able to > > address a target process by pid, and the rate of meaningful calls you > > can make to it is very low (since it's bounded by time for target > > process to act on the signal). Trying to optimize it for speed is > > pointless, and even size isn't important locally (although in > > aggregate, lots of wasted small size can add up to more pages = more > > TLB entries = ...). > > I agree and I would prefer to focus on code simplicity to have a > platform neutral way to handle error and let the compiler optimize > it than messy with assembly macros to squeeze this kind of > micro-optimizations. syscall entry does get micro-optimised. Real speed-ups can probably be found by optimising other places. I've a patch i need to resumbit that should improve the reading of iov[] from user space. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-21 15:00 ` Adhemerval Zanella 2020-04-21 15:31 ` David Laight @ 2020-04-22 6:54 ` Nicholas Piggin 2020-04-22 7:15 ` [musl] " Florian Weimer 1 sibling, 1 reply; 62+ messages in thread From: Nicholas Piggin @ 2020-04-22 6:54 UTC (permalink / raw) To: Adhemerval Zanella, Rich Felker, David Laight Cc: libc-alpha, libc-dev, linuxppc-dev, musl, Segher Boessenkool Let me try to summarise what we have. - vdso style call is ruled out as unnecessary with possible security concerns. Caller can internally use indirect branch to select variant if it wants to use that mechanism to select. - LR clobber seems to handled okay by gcc. It can increase size of small leaf wrapper functions, but they can use the caller stack frame for this (and even red zone for saving other things if necessary), but not a huge amount. - -ve error return seems to be favoured by everyone. Experimentally, it's better for musl (but musl could probably improve cr0[SO] error handling a bit 'asm goto'). - Preserving syscall args and volatiles up to r8 is a small but noticable help for cases that inline the call rather than always call wrappers. This is unlikely to be helpful unless 'sc' support is compiled out but I'll consider doing it for the long term. Next step is to trace and test on real hardware. - One thing that nobody has really asked about is error handling for unsupported scv vectors, so I would like to just go over it: Today, the scv facility is disabled by the kernel (FSCR[SCV] is cleared), which makes any `scv` instruction take a facility unavailable, which ends up printing a kernel message about SCV facility unavilable, and SIGILL's the process with ILL_ILLOPC. Enabling 'scv 0' will enable 1-127 as well, so the kernel has to handle those somehow. What we are saying is that we will allocate HWCAP bits in future if we implement more scv vectors, so userspace is not *supposed* to rely on this, but kernel has to choose some behaviour for invalid vectors. My proposal was to do the same SIGILL (with no kernel facility message), so it appears to behave the same way to userspace as it does now. There is also the ILL_ILLOPN code that could be used as invalid operand, but powerpc does not use this much, and e.g., the static instruction coded operands e.g., invalid mfspr generate ILL_ILLOPC so we could consider the entire instruction as the opcode, and input register values as operands. Now I don't know why a process would want to distinguish between FSCR[SCV]=0 and the case where it is enabled but kernel doesn't implement the vector, but maybe it does? Another option would be to use a different signal. I don't see that any are more suitable. Or return without a signal but -ENOSYS or something in r3. This doesn't seem so good because an invalid scv vector is not a system call, and a failure ABI would constrain any future implementation just a little bit. Any objections to SIGILL ILL_ILLOPC? Thanks, Nick ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Re: Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-22 6:54 ` Nicholas Piggin @ 2020-04-22 7:15 ` Florian Weimer 2020-04-22 7:31 ` Nicholas Piggin 0 siblings, 1 reply; 62+ messages in thread From: Florian Weimer @ 2020-04-22 7:15 UTC (permalink / raw) To: Nicholas Piggin Cc: Adhemerval Zanella, Rich Felker, David Laight, musl, libc-alpha, libc-dev, linuxppc-dev, Segher Boessenkool * Nicholas Piggin: > Another option would be to use a different signal. I don't see that any > are more suitable. SIGSYS comes to my mind. But I don't know how exclusively it is associated with seccomp these days. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Re: Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-22 7:15 ` [musl] " Florian Weimer @ 2020-04-22 7:31 ` Nicholas Piggin 2020-04-22 8:11 ` Florian Weimer 0 siblings, 1 reply; 62+ messages in thread From: Nicholas Piggin @ 2020-04-22 7:31 UTC (permalink / raw) To: Florian Weimer Cc: Adhemerval Zanella, Rich Felker, David Laight, libc-alpha, libc-dev, linuxppc-dev, musl, Segher Boessenkool Excerpts from Florian Weimer's message of April 22, 2020 5:15 pm: > * Nicholas Piggin: > >> Another option would be to use a different signal. I don't see that any >> are more suitable. > > SIGSYS comes to my mind. But I don't know how exclusively it is > associated with seccomp these days. SIGSYS is entirely seccomp now. There looks like a single obscure MIPS user of it in Linux that's not seccomp, but it would be entirely new for powerpc (or any of the common platforms, arm, x86 etc). So I would be disinclined to use SIGSYS unless there are no other better signal types, and we don't want to use SIGILL for some good reason -- is there a good reason to add complexity for userspace by differentiating these two situations? Thanks, Nick ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [musl] Re: Powerpc Linux 'scv' system call ABI proposal take 2 2020-04-22 7:31 ` Nicholas Piggin @ 2020-04-22 8:11 ` Florian Weimer 0 siblings, 0 replies; 62+ messages in thread From: Florian Weimer @ 2020-04-22 8:11 UTC (permalink / raw) To: Nicholas Piggin Cc: Adhemerval Zanella, Rich Felker, David Laight, libc-alpha, libc-dev, linuxppc-dev, musl, Segher Boessenkool * Nicholas Piggin: > So I would be disinclined to use SIGSYS unless there are no other better > signal types, and we don't want to use SIGILL for some good reason -- is > there a good reason to add complexity for userspace by differentiating > these two situations? No, SIGILL seems fine to me. scv 0 and scv 1 could well be considered different instructions eventually (with different mnemonics). ^ permalink raw reply [flat|nested] 62+ messages in thread
end of thread, other threads:[~2020-04-25 4:52 UTC | newest] Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-04-15 21:45 Powerpc Linux 'scv' system call ABI proposal take 2 Nicholas Piggin 2020-04-15 22:55 ` [musl] " Rich Felker 2020-04-16 0:16 ` Nicholas Piggin 2020-04-16 0:48 ` Rich Felker 2020-04-16 2:24 ` Nicholas Piggin 2020-04-16 2:35 ` Rich Felker 2020-04-16 2:53 ` Nicholas Piggin 2020-04-16 3:03 ` Rich Felker 2020-04-16 3:41 ` Nicholas Piggin 2020-04-16 20:18 ` Florian Weimer 2020-04-16 9:58 ` Szabolcs Nagy 2020-04-20 0:27 ` Nicholas Piggin 2020-04-20 1:29 ` Rich Felker 2020-04-20 2:08 ` Nicholas Piggin 2020-04-20 21:17 ` Szabolcs Nagy 2020-04-21 9:57 ` Florian Weimer 2020-04-16 15:21 ` Jeffrey Walton 2020-04-16 15:40 ` Rich Felker 2020-04-16 4:48 ` Florian Weimer 2020-04-16 15:35 ` Rich Felker 2020-04-16 16:42 ` Florian Weimer 2020-04-16 16:52 ` Rich Felker 2020-04-16 18:12 ` Florian Weimer 2020-04-16 23:02 ` Segher Boessenkool 2020-04-17 0:34 ` Rich Felker 2020-04-17 1:48 ` Segher Boessenkool 2020-04-17 8:34 ` Florian Weimer 2020-04-16 14:16 ` Adhemerval Zanella 2020-04-16 15:37 ` Rich Felker 2020-04-16 17:50 ` Adhemerval Zanella 2020-04-16 17:59 ` Rich Felker 2020-04-16 18:18 ` Adhemerval Zanella 2020-04-16 18:31 ` Rich Felker 2020-04-16 18:44 ` Rich Felker 2020-04-16 18:52 ` Adhemerval Zanella 2020-04-20 0:46 ` Nicholas Piggin 2020-04-20 1:10 ` Nicholas Piggin 2020-04-20 1:34 ` Rich Felker 2020-04-20 2:32 ` Nicholas Piggin 2020-04-20 4:09 ` Rich Felker 2020-04-20 4:31 ` Nicholas Piggin 2020-04-20 17:27 ` Rich Felker 2020-04-22 6:18 ` Nicholas Piggin 2020-04-22 6:29 ` Nicholas Piggin 2020-04-23 2:36 ` Rich Felker 2020-04-23 12:13 ` Adhemerval Zanella 2020-04-23 16:18 ` Rich Felker 2020-04-23 16:35 ` Adhemerval Zanella 2020-04-23 16:43 ` Rich Felker 2020-04-23 17:15 ` Adhemerval Zanella 2020-04-23 17:42 ` Rich Felker 2020-04-25 3:40 ` Nicholas Piggin 2020-04-25 4:52 ` Rich Felker 2020-04-25 3:30 ` Nicholas Piggin 2020-04-21 12:28 ` David Laight 2020-04-21 14:39 ` Rich Felker 2020-04-21 15:00 ` Adhemerval Zanella 2020-04-21 15:31 ` David Laight 2020-04-22 6:54 ` Nicholas Piggin 2020-04-22 7:15 ` [musl] " Florian Weimer 2020-04-22 7:31 ` Nicholas Piggin 2020-04-22 8:11 ` Florian Weimer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).