* Document use of IFUNC support outside of libc. @ 2016-03-03 21:10 Carlos O'Donell 2016-03-04 17:54 ` Szabolcs Nagy 0 siblings, 1 reply; 9+ messages in thread From: Carlos O'Donell @ 2016-03-03 21:10 UTC (permalink / raw) To: Szabolcs Nagy, GNU C Library Szabolcs, I attempted to distill some of your notes here: https://sourceware.org/glibc/wiki/GNU_IFUNC That way I can point users at this. In gperftools tcmalloc added an IFUNC use [1] which violates some of the requirements under -Wl,z,now, so I have a need to document this support and discuss with tcmalloc developers what we might do. Right now they call way too much code for this to work. Cheers, Carlos. [1] https://github.com/gperftools/gperftools/commit/6fdfc5a7f40ebcff3fdaada1a2994ff54be2f9c7 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Document use of IFUNC support outside of libc. 2016-03-03 21:10 Document use of IFUNC support outside of libc Carlos O'Donell @ 2016-03-04 17:54 ` Szabolcs Nagy 2016-03-04 21:49 ` Carlos O'Donell 2016-03-04 21:56 ` Document use of IFUNC support outside of libc Florian Weimer 0 siblings, 2 replies; 9+ messages in thread From: Szabolcs Nagy @ 2016-03-04 17:54 UTC (permalink / raw) To: Carlos O'Donell, GNU C Library; +Cc: nd On 03/03/16 21:10, Carlos O'Donell wrote: > I attempted to distill some of your notes here: > https://sourceware.org/glibc/wiki/GNU_IFUNC > thanks, i was meaning to write something about it on the wiki, but it is a bit hard to separate the bugs from the features. i identified some issues: * the first point about bind now is not entirely correct, lazy binding does not change that much. the reloc processing order at load time is: 1) DT_REL(A) relocs 2) DT_REL(A) relocs that call ifunc resolvers 3) DT_JMPREL relocs (may call ifunc resolvers or delay them) 4) DT_JMPREL relocs that call ifunc resolvers (for example 1) can be data access through GOT, 2) is ifunc resolved function address access through GOT, 3) is extern function call, 4) is ifunc resolved function call that binds locally e.g. static function with _IRELATIVE reloc.) the only difference between lazy binding and bind now is at step 3): run time vs load time ifunc resolution. of course the ordering in 3) can break resolvers with bind now that work with lazy binding, but the real problem is 2): a resolver called there must only depend on relocs in 1). it is still possible to call extern functions from an ifunc resolver, but only if it is forced to use relocs in 1) (e.g. call through a volatile funcptr or -fno-plt). i'm not sure if glibc wants to document this to work, because the user needs to know about relocations (which is compiler/linker internals). the nasty part is that the compiler is free to add extern calls (into libc or compiler runtime) which can break the resolver so it cannot be written in c or c++ in principle :( the dynamic linker could do the reloc ordering a bit better (so e.g. 2) happens after 3) in case of lazy binding), but i'm not sure how much that would help if potentially all functions may be ifunc resolved in a module. * an omission from that wiki page is static linking: ifunc resolvers run very early then (so memcpy etc work during libc initialization), and that breaks stack-protection etc instrumentation: the thread pointer is not yet set up. the vdso is not yet set up either and the vsyscall mechanism uses ifunc now, so vdso does not work with static linking at all (!) clock_gettime goes through a syscall (i think this is a bug that can result in surprising perf regression for users who expect speedup from static linking so i opened BZ 19767 ). i suspect there might be other limitations on resolvers because ptr mangling is not set up either.. probably static linking can be fixed by having two sets of ifunc resolvers: one that only the libc uses and runs early and another set that runs after some c runtime init is done similar to the dynamic linked case. i actually would like to use vdso from ifunc resolvers to do the ifunc dispatch based on information that is only available in the kernel and cannot be easily communicated through other means (e.g. sysfs stuff). * yet another issue is that the ifunc resolver type signature is different on different targets. (and if the user defined resolver takes no argument, but the dynamic linker calls it with arguments that is not strictly correct in c even if it happens to work for most call abis: there were hardening proposals based on type signature checks for indirect calls which the dynamic linker would violate). > That way I can point users at this. > > In gperftools tcmalloc added an IFUNC use [1] which > violates some of the requirements under -Wl,z,now, > so I have a need to document this support and discuss > with tcmalloc developers what we might do. Right now > they call way too much code for this to work. > > Cheers, > Carlos. > > [1] https://github.com/gperftools/gperftools/commit/6fdfc5a7f40ebcff3fdaada1a2994ff54be2f9c7 > +static bool sized_delete_enabled(void) { + if (tcmalloc_sized_delete_enabled != 0) { + return !!tcmalloc_sized_delete_enabled(); + } i think this call happens to work because the func address check for the weak ref forces the reloc to happen at step 1). + const char *flag = TCMallocGetenvSafe("TCMALLOC_ENABLE_SIZED_DELETE"); + return tcmalloc::commandlineflags::StringToBool(flag, false); i think this will crash if the address of delete is used (so ifunc resolver runs at step 2 while PLTGOT entries are uninitialized) independently of binding lazy vs now. with binding now it may crash without taking the address of delete. i'll try to update the wiki, but will wait for some feedbacks here for a while. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Document use of IFUNC support outside of libc. 2016-03-04 17:54 ` Szabolcs Nagy @ 2016-03-04 21:49 ` Carlos O'Donell 2016-03-07 17:33 ` Szabolcs Nagy 2016-03-04 21:56 ` Document use of IFUNC support outside of libc Florian Weimer 1 sibling, 1 reply; 9+ messages in thread From: Carlos O'Donell @ 2016-03-04 21:49 UTC (permalink / raw) To: Szabolcs Nagy, GNU C Library; +Cc: nd On 03/04/2016 12:54 PM, Szabolcs Nagy wrote: > On 03/03/16 21:10, Carlos O'Donell wrote: >> I attempted to distill some of your notes here: >> https://sourceware.org/glibc/wiki/GNU_IFUNC >> > > thanks, i was meaning to write something about it on the wiki, > but it is a bit hard to separate the bugs from the features. I think we should make this work sensibly for a sensible set of use cases. In particular we are probably going to have to explicitly what is and is not supported, and what functions you can and can't call. I'm happy for IFUNC to exist for user code if we impose limits like: only access local variables, only call local functions, only use POD data types, only call the following glibc functions, etc. etc. > i identified some issues: > > * the first point about bind now is not entirely correct, > lazy binding does not change that much. Clarified. I agree the ordering doesn't change, all I wanted to do was provide some background about *why* on certain machines this fails. > the reloc processing order at load time is: > > 1) DT_REL(A) relocs > 2) DT_REL(A) relocs that call ifunc resolvers > 3) DT_JMPREL relocs (may call ifunc resolvers or delay them) > 4) DT_JMPREL relocs that call ifunc resolvers This is the ordering per elf_dynamic_do_Rel right? Where we force IRELATIVE to be resolved after in every given group (but not across the groups e.g. 1) 3) 2) 4)). > (for example 1) can be data access through GOT, 2) is ifunc > resolved function address access through GOT, 3) is extern > function call, 4) is ifunc resolved function call that binds > locally e.g. static function with _IRELATIVE reloc.) > > the only difference between lazy binding and bind now is at > step 3): run time vs load time ifunc resolution. Agreed. > of course the ordering in 3) can break resolvers with bind > now that work with lazy binding, but the real problem is 2): > a resolver called there must only depend on relocs in 1). I was thinking about this. Would it be possible on ARM and PPC64 whose R_*_IRELATIVE relocs are in DT_REL* to reorder the processing in the dynamic loader? Resolve DT_JMPREL first then DT_REL* That would give those machines feature parity with x86_64 without needing to rewrite the relocations in binutils to handler this case? > it is still possible to call extern functions from an ifunc > resolver, but only if it is forced to use relocs in 1) (e.g. > call through a volatile funcptr or -fno-plt). i'm not sure > if glibc wants to document this to work, because the user > needs to know about relocations (which is compiler/linker > internals). the nasty part is that the compiler is free to > add extern calls (into libc or compiler runtime) which can > break the resolver so it cannot be written in c or c++ in > principle :( Correct. On x86 with multiversioning the compiler emits multiple clones of a function with different optimizations and selects based on cpuid results. To get the cpuid results the ifunc resolver emitted by the compiler calls into libgcc. As it is implemented this multiversioning only works on x86 because of the relocation ordering. > the dynamic linker could do the reloc ordering a bit better > (so e.g. 2) happens after 3) in case of lazy binding), but > i'm not sure how much that would help if potentially all > functions may be ifunc resolved in a module. Could you expand on this a bit more? What would be the problem in having the dynamic loader do relocation processing in this order: 1) 3) 2) 4). > * an omission from that wiki page is static linking: > ifunc resolvers run very early then (so memcpy etc work > during libc initialization), and that breaks stack-protection > etc instrumentation: the thread pointer is not yet set up. I mentioned that? "The resolver must not be compiled with -fstack-protector-all or any similar protections e.g. asan, since they may require early setup which has not yet completed." I just didn't talk about static vs. dynamic, I just forbid it in general. > the vdso is not yet set up either and the vsyscall mechanism > uses ifunc now, so vdso does not work with static linking at > all (!) clock_gettime goes through a syscall (i think this is > a bug that can result in surprising perf regression for users > who expect speedup from static linking so i opened BZ 19767 ). Agreed. > i suspect there might be other limitations on resolvers > because ptr mangling is not set up either.. Maybe. > probably static linking can be fixed by having two sets of > ifunc resolvers: one that only the libc uses and runs early > and another set that runs after some c runtime init is done > similar to the dynamic linked case. Right. > i actually would like to use vdso from ifunc resolvers > to do the ifunc dispatch based on information that is only > available in the kernel and cannot be easily communicated > through other means (e.g. sysfs stuff). Sure. Examples needed. > * yet another issue is that the ifunc resolver type > signature is different on different targets. This is really lame. > (and if the user defined resolver takes no argument, but the > dynamic linker calls it with arguments that is not strictly > correct in c even if it happens to work for most call abis: > there were hardening proposals based on type signature checks > for indirect calls which the dynamic linker would violate). Agreed, we need to fix this. >> That way I can point users at this. >> >> In gperftools tcmalloc added an IFUNC use [1] which >> violates some of the requirements under -Wl,z,now, >> so I have a need to document this support and discuss >> with tcmalloc developers what we might do. Right now >> they call way too much code for this to work. >> >> Cheers, >> Carlos. >> >> [1] https://github.com/gperftools/gperftools/commit/6fdfc5a7f40ebcff3fdaada1a2994ff54be2f9c7 >> > +static bool sized_delete_enabled(void) { > + if (tcmalloc_sized_delete_enabled != 0) { > + return !!tcmalloc_sized_delete_enabled(); > + } > > i think this call happens to work because the func address > check for the weak ref forces the reloc to happen at step 1). OK. > + const char *flag = TCMallocGetenvSafe("TCMALLOC_ENABLE_SIZED_DELETE"); > + return tcmalloc::commandlineflags::StringToBool(flag, false); > > i think this will crash if the address of delete is used > (so ifunc resolver runs at step 2 while PLTGOT entries are > uninitialized) independently of binding lazy vs now. > with binding now it may crash without taking the address > of delete. Right. > i'll try to update the wiki, but will wait for some > feedbacks here for a while. Thanks! Feel free to update the page! Cheers, Carlos. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Document use of IFUNC support outside of libc. 2016-03-04 21:49 ` Carlos O'Donell @ 2016-03-07 17:33 ` Szabolcs Nagy 2016-04-15 15:11 ` Doing more inside an ifunc (Was Re: Document use of IFUNC support outside of libc.) Siddhesh Poyarekar 0 siblings, 1 reply; 9+ messages in thread From: Szabolcs Nagy @ 2016-03-07 17:33 UTC (permalink / raw) To: Carlos O'Donell, GNU C Library Cc: nd, Ramana Radhakrishnan, Marcus Shawcroft On 04/03/16 21:48, Carlos O'Donell wrote: > On 03/04/2016 12:54 PM, Szabolcs Nagy wrote: >> On 03/03/16 21:10, Carlos O'Donell wrote: >>> I attempted to distill some of your notes here: >>> https://sourceware.org/glibc/wiki/GNU_IFUNC >>> >> >> thanks, i was meaning to write something about it on the wiki, >> but it is a bit hard to separate the bugs from the features. > > I think we should make this work sensibly for a sensible set > of use cases. In particular we are probably going to have to > explicitly what is and is not supported, and what functions > you can and can't call. I'm happy for IFUNC to exist for user > code if we impose limits like: only access local variables, > only call local functions, only use POD data types, only call > the following glibc functions, etc. etc. > >> i identified some issues: >> >> * the first point about bind now is not entirely correct, >> lazy binding does not change that much. > > Clarified. I agree the ordering doesn't change, all I wanted to > do was provide some background about *why* on certain machines > this fails. > >> the reloc processing order at load time is: >> >> 1) DT_REL(A) relocs >> 2) DT_REL(A) relocs that call ifunc resolvers >> 3) DT_JMPREL relocs (may call ifunc resolvers or delay them) >> 4) DT_JMPREL relocs that call ifunc resolvers > > This is the ordering per elf_dynamic_do_Rel right? Where > we force IRELATIVE to be resolved after in every given > group (but not across the groups e.g. 1) 3) 2) 4)). > _ELF_DYNAMIC_DO_RELOC in elf/dynamic-link.h orders 1,2 before 3,4 and elf_dynamic_do_Rel in elf/do-rel.h orders 3 before 4. 3 before 4 is also guaranteed by binutils ld since https://sourceware.org/bugzilla/show_bug.cgi?id=13302 i think 1 is ordered before 2 only in recent binutils ld https://sourceware.org/bugzilla/show_bug.cgi?id=18841 (and it seems it was only fixed for x86, ppc and s390) i think JUMP_SLOT relocs within 3 are also sorted by ld such that STT_GNU_IFUNC symbols come last. >> (for example 1) can be data access through GOT, 2) is ifunc >> resolved function address access through GOT, 3) is extern >> function call, 4) is ifunc resolved function call that binds >> locally e.g. static function with _IRELATIVE reloc.) >> >> the only difference between lazy binding and bind now is at >> step 3): run time vs load time ifunc resolution. > > Agreed. > >> of course the ordering in 3) can break resolvers with bind >> now that work with lazy binding, but the real problem is 2): >> a resolver called there must only depend on relocs in 1). > > I was thinking about this. > > Would it be possible on ARM and PPC64 whose R_*_IRELATIVE > relocs are in DT_REL* to reorder the processing in the dynamic > loader? Resolve DT_JMPREL first then DT_REL* > > That would give those machines feature parity with x86_64 > without needing to rewrite the relocations in binutils to > handler this case? > i haven't looked at non-x86 targets yet. i think glibc dynlinker can do the relocs in arbitrary order (the order is only observable through ifunc resolvers), but the code might become ugly if there is arch dependent ordering. >> it is still possible to call extern functions from an ifunc >> resolver, but only if it is forced to use relocs in 1) (e.g. >> call through a volatile funcptr or -fno-plt). i'm not sure >> if glibc wants to document this to work, because the user >> needs to know about relocations (which is compiler/linker >> internals). the nasty part is that the compiler is free to >> add extern calls (into libc or compiler runtime) which can >> break the resolver so it cannot be written in c or c++ in >> principle :( > > Correct. > > On x86 with multiversioning the compiler emits multiple clones > of a function with different optimizations and selects based > on cpuid results. To get the cpuid results the ifunc resolver > emitted by the compiler calls into libgcc. As it is > implemented this multiversioning only works on x86 because of > the relocation ordering. > >> the dynamic linker could do the reloc ordering a bit better >> (so e.g. 2) happens after 3) in case of lazy binding), but >> i'm not sure how much that would help if potentially all >> functions may be ifunc resolved in a module. > > Could you expand on this a bit more? What would be the problem > in having the dynamic loader do relocation processing in this > order: 1) 3) 2) 4). > the ordering does not fix the case when ifunc resolvers reference ifunc resolved functions in the same module. (because the relocs are not ordered according to ifunc dependency) otherwise i think it would make the most common cases work. (both lazy and non-lazy binding, although lazy binding would work in more cases) >> * an omission from that wiki page is static linking: >> ifunc resolvers run very early then (so memcpy etc work >> during libc initialization), and that breaks stack-protection >> etc instrumentation: the thread pointer is not yet set up. > > I mentioned that? > > "The resolver must not be compiled with -fstack-protector-all > or any similar protections e.g. asan, since they may require > early setup which has not yet completed." > > I just didn't talk about static vs. dynamic, I just forbid it > in general. > sorry, indeed it is documented, but i wanted to note that it only fails with static linking because i think this is undesirable. (that code is running without thread pointer set up so accessing errno or other tls would crash). >> the vdso is not yet set up either and the vsyscall mechanism >> uses ifunc now, so vdso does not work with static linking at >> all (!) clock_gettime goes through a syscall (i think this is >> a bug that can result in surprising perf regression for users >> who expect speedup from static linking so i opened BZ 19767 ). > > Agreed. > >> i suspect there might be other limitations on resolvers >> because ptr mangling is not set up either.. > > Maybe. > >> probably static linking can be fixed by having two sets of >> ifunc resolvers: one that only the libc uses and runs early >> and another set that runs after some c runtime init is done >> similar to the dynamic linked case. > > Right. > >> i actually would like to use vdso from ifunc resolvers >> to do the ifunc dispatch based on information that is only >> available in the kernel and cannot be easily communicated >> through other means (e.g. sysfs stuff). > > Sure. Examples needed. > there seems to be interest in optimizations/dispatch based on the micro architecture which is not easily available in userspace currently (on aarch64). linux exports various cpu info in /sys but that is not stable abi and users probably don't want large number of syscalls traversing the /sys tree at process startup just to get slightly better tuned memcpy or similar. one idea by Adhemerval Zanella was to use vdso for this. (the kernel can provide a versioned function symbol there to return a pointer to some cpu info struct, which can be read only thus shared across processes). there is no proposed design for this yet either on kernel or libc side, but it would make sense if ifunc could use it. currently the only reliable mechanisms for ifunc dispatch are hwcap feature bits (if passed as argument) or cpuid like instruction (e.g. on aarch64 cpuid like instructions are not available to userspace, but can be emulated by the kernel or provided as syscall, in either case it would be context switch into the kernel, which can be bad if large number of ifunc resolvers do it e.g. because function multi- versioning is implemented that way, unless there is some caching mechanism which is also not easy to do in ifunc...) >> * yet another issue is that the ifunc resolver type >> signature is different on different targets. > > This is really lame. > >> (and if the user defined resolver takes no argument, but the >> dynamic linker calls it with arguments that is not strictly >> correct in c even if it happens to work for most call abis: >> there were hardening proposals based on type signature checks >> for indirect calls which the dynamic linker would violate). > > Agreed, we need to fix this. > i think it's not easy to fix: binutils and gcc already have ifunc test cases (where resolvers take no argument) most non-x86 archs take a hwcap argument, but in the mips ifunc patch the resolver has 3 arguments. >>> That way I can point users at this. >>> >>> In gperftools tcmalloc added an IFUNC use [1] which >>> violates some of the requirements under -Wl,z,now, >>> so I have a need to document this support and discuss >>> with tcmalloc developers what we might do. Right now >>> they call way too much code for this to work. >>> >>> Cheers, >>> Carlos. >>> >>> [1] https://github.com/gperftools/gperftools/commit/6fdfc5a7f40ebcff3fdaada1a2994ff54be2f9c7 >>> >> +static bool sized_delete_enabled(void) { >> + if (tcmalloc_sized_delete_enabled != 0) { >> + return !!tcmalloc_sized_delete_enabled(); >> + } >> >> i think this call happens to work because the func address >> check for the weak ref forces the reloc to happen at step 1). > > OK. > >> + const char *flag = TCMallocGetenvSafe("TCMALLOC_ENABLE_SIZED_DELETE"); >> + return tcmalloc::commandlineflags::StringToBool(flag, false); >> >> i think this will crash if the address of delete is used >> (so ifunc resolver runs at step 2 while PLTGOT entries are >> uninitialized) independently of binding lazy vs now. >> with binding now it may crash without taking the address >> of delete. > > Right. > >> i'll try to update the wiki, but will wait for some >> feedbacks here for a while. > > Thanks! Feel free to update the page! > > Cheers, > Carlos. > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Doing more inside an ifunc (Was Re: Document use of IFUNC support outside of libc.) 2016-03-07 17:33 ` Szabolcs Nagy @ 2016-04-15 15:11 ` Siddhesh Poyarekar 2016-04-15 23:51 ` pinskia 0 siblings, 1 reply; 9+ messages in thread From: Siddhesh Poyarekar @ 2016-04-15 15:11 UTC (permalink / raw) To: Szabolcs Nagy Cc: Carlos O'Donell, GNU C Library, nd, Ramana Radhakrishnan, Marcus Shawcroft On Mon, Mar 07, 2016 at 05:33:24PM +0000, Szabolcs Nagy wrote: > there seems to be interest in optimizations/dispatch based > on the micro architecture which is not easily available in > userspace currently (on aarch64). Sorry, I was interested in this conversation but completely missed it, so starting it again. I hope it's not too late :) > linux exports various cpu info in /sys but that is not > stable abi and users probably don't want large number of > syscalls traversing the /sys tree at process startup just > to get slightly better tuned memcpy or similar. > > one idea by Adhemerval Zanella was to use vdso for this. > (the kernel can provide a versioned function symbol there > to return a pointer to some cpu info struct, which can be > read only thus shared across processes). > there is no proposed design for this yet either on kernel > or libc side, but it would make sense if ifunc could use it. > > currently the only reliable mechanisms for ifunc dispatch > are hwcap feature bits (if passed as argument) or cpuid > like instruction (e.g. on aarch64 cpuid like instructions > are not available to userspace, but can be emulated by the > kernel or provided as syscall, in either case it would be > context switch into the kernel, which can be bad if large > number of ifunc resolvers do it e.g. because function multi- > versioning is implemented that way, unless there is some > caching mechanism which is also not easy to do in ifunc...) The context switch is not the worst thing that can happen for the emulated instructions because we can easily cache the result and reduce the number of context switches to a minimum. The difficult bit for the emulated instruction (MRS) is heterogenous systems, where it would be difficult (impossible?) for userspace to just use the emulated instruction to deterministically identify all of the processor cores. So the emulated instruction will only work for specific processor cores that are known to always be in a homogenous configuration and never otherwise. For anything else, we will need the kernel to give us full information about all of the cores in another way, either via sysfs or vdso. The sysfs route has been proposed earlier[1] but is hairy for us because it traverses the filesystem to identify all CPU cores, resulting in a proportional number of syscalls. The vdso alternative is better because the kernel can then give us all of the information in exactly one call and avoid the context switch at the same time. I had hacked up a patch to test using the sysfs patches in [1] and it required reimplementing some string functions to avoid referencing them but that was about the only thing needed to get it working. Safety however is a completely different issue and I don't know if we can even guarantee that during symbol resolution. Siddhesh [1] https://lkml.org/lkml/2015/9/16/452 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Doing more inside an ifunc (Was Re: Document use of IFUNC support outside of libc.) 2016-04-15 15:11 ` Doing more inside an ifunc (Was Re: Document use of IFUNC support outside of libc.) Siddhesh Poyarekar @ 2016-04-15 23:51 ` pinskia 2016-04-16 17:39 ` Siddhesh Poyarekar 0 siblings, 1 reply; 9+ messages in thread From: pinskia @ 2016-04-15 23:51 UTC (permalink / raw) To: Siddhesh Poyarekar Cc: Szabolcs Nagy, Carlos O'Donell, GNU C Library, nd, Ramana Radhakrishnan, Marcus Shawcroft > On Apr 15, 2016, at 8:10 AM, Siddhesh Poyarekar <sid@reserved-bit.com> wrote: > >> On Mon, Mar 07, 2016 at 05:33:24PM +0000, Szabolcs Nagy wrote: >> there seems to be interest in optimizations/dispatch based >> on the micro architecture which is not easily available in >> userspace currently (on aarch64). > > Sorry, I was interested in this conversation but completely missed it, > so starting it again. I hope it's not too late :) > >> linux exports various cpu info in /sys but that is not >> stable abi and users probably don't want large number of >> syscalls traversing the /sys tree at process startup just >> to get slightly better tuned memcpy or similar. >> >> one idea by Adhemerval Zanella was to use vdso for this. >> (the kernel can provide a versioned function symbol there >> to return a pointer to some cpu info struct, which can be >> read only thus shared across processes). >> there is no proposed design for this yet either on kernel >> or libc side, but it would make sense if ifunc could use it. >> >> currently the only reliable mechanisms for ifunc dispatch >> are hwcap feature bits (if passed as argument) or cpuid >> like instruction (e.g. on aarch64 cpuid like instructions >> are not available to userspace, but can be emulated by the >> kernel or provided as syscall, in either case it would be >> context switch into the kernel, which can be bad if large >> number of ifunc resolvers do it e.g. because function multi- >> versioning is implemented that way, unless there is some >> caching mechanism which is also not easy to do in ifunc...) > > The context switch is not the worst thing that can happen for the > emulated instructions because we can easily cache the result and > reduce the number of context switches to a minimum. The difficult bit > for the emulated instruction (MRS) is heterogenous systems, where it > would be difficult (impossible?) for userspace to just use the > emulated instruction to deterministically identify all of the > processor cores. > > So the emulated instruction will only work for specific processor > cores that are known to always be in a homogenous configuration and > never otherwise. For anything else, we will need the kernel to give > us full information about all of the cores in another way, either via > sysfs or vdso. The sysfs route has been proposed earlier[1] but is > hairy for us because it traverses the filesystem to identify all CPU > cores, resulting in a proportional number of syscalls. The vdso > alternative is better because the kernel can then give us all of the > information in exactly one call and avoid the context switch at the > same time. > > I had hacked up a patch to test using the sysfs patches in [1] and it > required reimplementing some string functions to avoid referencing > them but that was about the only thing needed to get it working. > Safety however is a completely different issue and I don't know if we > can even guarantee that during symbol resolution. I gave an alternative to this approach by passing midr via the aux vector. It still is useful and we can change the kernel to have it return unknown for those known values which will be used for big.little. I don't have a link to my implementation right now though as I am traveling. This is much safer and easier to the black listing inside the kernel and the aux vector is basically free no open/read/close from ifunc or early launch either. Thanks, Andrew > > Siddhesh > > [1] https://lkml.org/lkml/2015/9/16/452 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Doing more inside an ifunc (Was Re: Document use of IFUNC support outside of libc.) 2016-04-15 23:51 ` pinskia @ 2016-04-16 17:39 ` Siddhesh Poyarekar 2016-05-10 6:34 ` Andrew Pinski 0 siblings, 1 reply; 9+ messages in thread From: Siddhesh Poyarekar @ 2016-04-16 17:39 UTC (permalink / raw) To: pinskia Cc: Szabolcs Nagy, Carlos O'Donell, GNU C Library, nd, Ramana Radhakrishnan, Marcus Shawcroft On Fri, Apr 15, 2016 at 03:09:38PM -0700, pinskia@gmail.com wrote: > I gave an alternative to this approach by passing midr via the aux > vector. It still is useful and we can change the kernel to have it > return unknown for those known values which will be used for > big.little. I don't have a link to my implementation right now > though as I am traveling. This is much safer and easier to the > black listing inside the kernel and the aux vector is basically free > no open/read/close from ifunc or early launch either. Is this[1] the patch you're referring to? It seems reasonable to me given that we can never support big.little reliably with hotplug potentially mixing things up. But it really depends on how seriously we want to consider the possibility of having optimal routines for big.little systems. We could probably make this patch play nicely with Suzuki's patchset and use the auxvec entry as a first check and then fall back to trawling sysfs or do the vdso function call if we ever need to implement optimal routines for a big.little system. However if optimizing for big.little is a serious possibility then it makes sense to solve that problem right now instead of burying it temporarily. Siddhesh [1] https://patches.linaro.org/patch/52856/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Doing more inside an ifunc (Was Re: Document use of IFUNC support outside of libc.) 2016-04-16 17:39 ` Siddhesh Poyarekar @ 2016-05-10 6:34 ` Andrew Pinski 0 siblings, 0 replies; 9+ messages in thread From: Andrew Pinski @ 2016-05-10 6:34 UTC (permalink / raw) To: Siddhesh Poyarekar Cc: Szabolcs Nagy, Carlos O'Donell, GNU C Library, nd, Ramana Radhakrishnan, Marcus Shawcroft On Sat, Apr 16, 2016 at 10:38 AM, Siddhesh Poyarekar <sid@reserved-bit.com> wrote: > On Fri, Apr 15, 2016 at 03:09:38PM -0700, pinskia@gmail.com wrote: >> I gave an alternative to this approach by passing midr via the aux >> vector. It still is useful and we can change the kernel to have it >> return unknown for those known values which will be used for >> big.little. I don't have a link to my implementation right now >> though as I am traveling. This is much safer and easier to the >> black listing inside the kernel and the aux vector is basically free >> no open/read/close from ifunc or early launch either. > > Is this[1] the patch you're referring to? Yes. > It seems reasonable to me > given that we can never support big.little reliably with hotplug > potentially mixing things up. But it really depends on how seriously > we want to consider the possibility of having optimal routines for > big.little systems. I personally don't have any big.little system which I need to optimize for. I need to optimize for ThunderX series of processors. I already have a memcpy for ThunderX and a memset that I optimized but it is dependent on this kernel patch being approved. Thanks, Andrew > > We could probably make this patch play nicely with Suzuki's patchset > and use the auxvec entry as a first check and then fall back to > trawling sysfs or do the vdso function call if we ever need to > implement optimal routines for a big.little system. However if > optimizing for big.little is a serious possibility then it makes sense > to solve that problem right now instead of burying it temporarily. > > Siddhesh > > [1] https://patches.linaro.org/patch/52856/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Document use of IFUNC support outside of libc. 2016-03-04 17:54 ` Szabolcs Nagy 2016-03-04 21:49 ` Carlos O'Donell @ 2016-03-04 21:56 ` Florian Weimer 1 sibling, 0 replies; 9+ messages in thread From: Florian Weimer @ 2016-03-04 21:56 UTC (permalink / raw) To: Szabolcs Nagy; +Cc: Carlos O'Donell, GNU C Library, nd * Szabolcs Nagy: > it is still possible to call extern functions from an ifunc > resolver, but only if it is forced to use relocs in 1) (e.g. > call through a volatile funcptr or -fno-plt). Does this change for architectures which use function descriptors? ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-05-10 6:34 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-03-03 21:10 Document use of IFUNC support outside of libc Carlos O'Donell 2016-03-04 17:54 ` Szabolcs Nagy 2016-03-04 21:49 ` Carlos O'Donell 2016-03-07 17:33 ` Szabolcs Nagy 2016-04-15 15:11 ` Doing more inside an ifunc (Was Re: Document use of IFUNC support outside of libc.) Siddhesh Poyarekar 2016-04-15 23:51 ` pinskia 2016-04-16 17:39 ` Siddhesh Poyarekar 2016-05-10 6:34 ` Andrew Pinski 2016-03-04 21:56 ` Document use of IFUNC support outside of libc Florian Weimer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).