public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* Document use of IFUNC support outside of libc.
@ 2016-03-03 21:10 Carlos O'Donell
  2016-03-04 17:54 ` Szabolcs Nagy
  0 siblings, 1 reply; 9+ messages in thread
From: Carlos O'Donell @ 2016-03-03 21:10 UTC (permalink / raw)
  To: Szabolcs Nagy, GNU C Library

Szabolcs,

I attempted to distill some of your notes here:
https://sourceware.org/glibc/wiki/GNU_IFUNC

That way I can point users at this.

In gperftools tcmalloc added an IFUNC use [1] which
violates some of the requirements under -Wl,z,now,
so I have a need to document this support and discuss
with tcmalloc developers what we might do. Right now
they call way too much code for this to work.

Cheers,
Carlos.

[1] https://github.com/gperftools/gperftools/commit/6fdfc5a7f40ebcff3fdaada1a2994ff54be2f9c7

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Document use of IFUNC support outside of libc.
  2016-03-03 21:10 Document use of IFUNC support outside of libc Carlos O'Donell
@ 2016-03-04 17:54 ` Szabolcs Nagy
  2016-03-04 21:49   ` Carlos O'Donell
  2016-03-04 21:56   ` Document use of IFUNC support outside of libc Florian Weimer
  0 siblings, 2 replies; 9+ messages in thread
From: Szabolcs Nagy @ 2016-03-04 17:54 UTC (permalink / raw)
  To: Carlos O'Donell, GNU C Library; +Cc: nd

On 03/03/16 21:10, Carlos O'Donell wrote:
> I attempted to distill some of your notes here:
> https://sourceware.org/glibc/wiki/GNU_IFUNC
> 

thanks, i was meaning to write something about it on the wiki,
but it is a bit hard to separate the bugs from the features.

i identified some issues:

* the first point about bind now is not entirely correct,
lazy binding does not change that much.

the reloc processing order at load time is:

1) DT_REL(A) relocs
2) DT_REL(A) relocs that call ifunc resolvers
3) DT_JMPREL relocs (may call ifunc resolvers or delay them)
4) DT_JMPREL relocs that call ifunc resolvers

(for example 1) can be data access through GOT, 2) is ifunc
resolved function address access through GOT, 3) is extern
function call, 4) is ifunc resolved function call that binds
locally e.g. static function with _IRELATIVE reloc.)

the only difference between lazy binding and bind now is at
step 3): run time vs load time ifunc resolution.

of course the ordering in 3) can break resolvers with bind
now that work with lazy binding, but the real problem is 2):
a resolver called there must only depend on relocs in 1).

it is still possible to call extern functions from an ifunc
resolver, but only if it is forced to use relocs in 1) (e.g.
call through a volatile funcptr or -fno-plt).  i'm not sure
if glibc wants to document this to work, because the user
needs to know about relocations (which is compiler/linker
internals).  the nasty part is that the compiler is free to
add extern calls (into libc or compiler runtime) which can
break the resolver so it cannot be written in c or c++ in
principle :(

the dynamic linker could do the reloc ordering a bit better
(so e.g. 2) happens after 3) in case of lazy binding), but
i'm not sure how much that would help if potentially all
functions may be ifunc resolved in a module.


* an omission from that wiki page is static linking:
ifunc resolvers run very early then (so memcpy etc work
during libc initialization), and that breaks stack-protection
etc instrumentation: the thread pointer is not yet set up.

the vdso is not yet set up either and the vsyscall mechanism
uses ifunc now, so vdso does not work with static linking at
all (!) clock_gettime goes through a syscall (i think this is
a bug that can result in surprising perf regression for users
who expect speedup from static linking so i opened BZ 19767 ).

i suspect there might be other limitations on resolvers
because ptr mangling is not set up either..

probably static linking can be fixed by having two sets of
ifunc resolvers: one that only the libc uses and runs early
and another set that runs after some c runtime init is done
similar to the dynamic linked case.

i actually would like to use vdso from ifunc resolvers
to do the ifunc dispatch based on information that is only
available in the kernel and cannot be easily communicated
through other means (e.g. sysfs stuff).


* yet another issue is that the ifunc resolver type
signature is different on different targets.
(and if the user defined resolver takes no argument, but the
dynamic linker calls it with arguments that is not strictly
correct in c even if it happens to work for most call abis:
there were hardening proposals based on type signature checks
for indirect calls which the dynamic linker would violate).

> That way I can point users at this.
> 
> In gperftools tcmalloc added an IFUNC use [1] which
> violates some of the requirements under -Wl,z,now,
> so I have a need to document this support and discuss
> with tcmalloc developers what we might do. Right now
> they call way too much code for this to work.
> 
> Cheers,
> Carlos.
> 
> [1] https://github.com/gperftools/gperftools/commit/6fdfc5a7f40ebcff3fdaada1a2994ff54be2f9c7
> 
+static bool sized_delete_enabled(void) {
+  if (tcmalloc_sized_delete_enabled != 0) {
+    return !!tcmalloc_sized_delete_enabled();
+  }

i think this call happens to work because the func address
check for the weak ref forces the reloc to happen at step 1).

+  const char *flag = TCMallocGetenvSafe("TCMALLOC_ENABLE_SIZED_DELETE");
+  return tcmalloc::commandlineflags::StringToBool(flag, false);

i think this will crash if the address of delete is used
(so ifunc resolver runs at step 2 while PLTGOT entries are
uninitialized) independently of binding lazy vs now.
with binding now it may crash without taking the address
of delete.


i'll try to update the wiki, but will wait for some
feedbacks here for a while.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Document use of IFUNC support outside of libc.
  2016-03-04 17:54 ` Szabolcs Nagy
@ 2016-03-04 21:49   ` Carlos O'Donell
  2016-03-07 17:33     ` Szabolcs Nagy
  2016-03-04 21:56   ` Document use of IFUNC support outside of libc Florian Weimer
  1 sibling, 1 reply; 9+ messages in thread
From: Carlos O'Donell @ 2016-03-04 21:49 UTC (permalink / raw)
  To: Szabolcs Nagy, GNU C Library; +Cc: nd

On 03/04/2016 12:54 PM, Szabolcs Nagy wrote:
> On 03/03/16 21:10, Carlos O'Donell wrote:
>> I attempted to distill some of your notes here:
>> https://sourceware.org/glibc/wiki/GNU_IFUNC
>>
> 
> thanks, i was meaning to write something about it on the wiki,
> but it is a bit hard to separate the bugs from the features.

I think we should make this work sensibly for a sensible set
of use cases. In particular we are probably going to have to
explicitly what is and is not supported, and what functions
you can and can't call. I'm happy for IFUNC to exist for user
code if we impose limits like: only access local variables,
only call local functions, only use POD data types, only call
the following glibc functions, etc. etc.

> i identified some issues:
> 
> * the first point about bind now is not entirely correct,
> lazy binding does not change that much.

Clarified. I agree the ordering doesn't change, all I wanted to
do was provide some background about *why* on certain machines
this fails.

> the reloc processing order at load time is:
> 
> 1) DT_REL(A) relocs
> 2) DT_REL(A) relocs that call ifunc resolvers
> 3) DT_JMPREL relocs (may call ifunc resolvers or delay them)
> 4) DT_JMPREL relocs that call ifunc resolvers

This is the ordering per elf_dynamic_do_Rel right? Where
we force IRELATIVE to be resolved after in every given
group (but not across the groups e.g. 1) 3) 2) 4)).

> (for example 1) can be data access through GOT, 2) is ifunc
> resolved function address access through GOT, 3) is extern
> function call, 4) is ifunc resolved function call that binds
> locally e.g. static function with _IRELATIVE reloc.)
> 
> the only difference between lazy binding and bind now is at
> step 3): run time vs load time ifunc resolution.

Agreed.

> of course the ordering in 3) can break resolvers with bind
> now that work with lazy binding, but the real problem is 2):
> a resolver called there must only depend on relocs in 1).

I was thinking about this.

Would it be possible on ARM and PPC64 whose R_*_IRELATIVE
relocs are in DT_REL* to reorder the processing in the dynamic
loader? Resolve DT_JMPREL first then DT_REL*

That would give those machines feature parity with x86_64
without needing to rewrite the relocations in binutils to
handler this case?

> it is still possible to call extern functions from an ifunc
> resolver, but only if it is forced to use relocs in 1) (e.g.
> call through a volatile funcptr or -fno-plt).  i'm not sure
> if glibc wants to document this to work, because the user
> needs to know about relocations (which is compiler/linker
> internals).  the nasty part is that the compiler is free to
> add extern calls (into libc or compiler runtime) which can
> break the resolver so it cannot be written in c or c++ in
> principle :(

Correct.

On x86 with multiversioning the compiler emits multiple clones
of a function with different optimizations and selects based
on cpuid results. To get the cpuid results the ifunc resolver
emitted by the compiler calls into libgcc. As it is 
implemented this multiversioning only works on x86 because of
the relocation ordering.

> the dynamic linker could do the reloc ordering a bit better
> (so e.g. 2) happens after 3) in case of lazy binding), but
> i'm not sure how much that would help if potentially all
> functions may be ifunc resolved in a module.

Could you expand on this a bit more? What would be the problem
in having the dynamic loader do relocation processing in this
order: 1) 3) 2) 4).
 
> * an omission from that wiki page is static linking:
> ifunc resolvers run very early then (so memcpy etc work
> during libc initialization), and that breaks stack-protection
> etc instrumentation: the thread pointer is not yet set up.

I mentioned that?

"The resolver must not be compiled with -fstack-protector-all
or any similar protections e.g. asan, since they may require
early setup which has not yet completed."

I just didn't talk about static vs. dynamic, I just forbid it
in general.

> the vdso is not yet set up either and the vsyscall mechanism
> uses ifunc now, so vdso does not work with static linking at
> all (!) clock_gettime goes through a syscall (i think this is
> a bug that can result in surprising perf regression for users
> who expect speedup from static linking so i opened BZ 19767 ).

Agreed.

> i suspect there might be other limitations on resolvers
> because ptr mangling is not set up either..

Maybe.

> probably static linking can be fixed by having two sets of
> ifunc resolvers: one that only the libc uses and runs early
> and another set that runs after some c runtime init is done
> similar to the dynamic linked case.

Right.

> i actually would like to use vdso from ifunc resolvers
> to do the ifunc dispatch based on information that is only
> available in the kernel and cannot be easily communicated
> through other means (e.g. sysfs stuff).

Sure. Examples needed.
 
> * yet another issue is that the ifunc resolver type
> signature is different on different targets.

This is really lame.

> (and if the user defined resolver takes no argument, but the
> dynamic linker calls it with arguments that is not strictly
> correct in c even if it happens to work for most call abis:
> there were hardening proposals based on type signature checks
> for indirect calls which the dynamic linker would violate).

Agreed, we need to fix this.

>> That way I can point users at this.
>>
>> In gperftools tcmalloc added an IFUNC use [1] which
>> violates some of the requirements under -Wl,z,now,
>> so I have a need to document this support and discuss
>> with tcmalloc developers what we might do. Right now
>> they call way too much code for this to work.
>>
>> Cheers,
>> Carlos.
>>
>> [1] https://github.com/gperftools/gperftools/commit/6fdfc5a7f40ebcff3fdaada1a2994ff54be2f9c7
>>
> +static bool sized_delete_enabled(void) {
> +  if (tcmalloc_sized_delete_enabled != 0) {
> +    return !!tcmalloc_sized_delete_enabled();
> +  }
> 
> i think this call happens to work because the func address
> check for the weak ref forces the reloc to happen at step 1).

OK.

> +  const char *flag = TCMallocGetenvSafe("TCMALLOC_ENABLE_SIZED_DELETE");
> +  return tcmalloc::commandlineflags::StringToBool(flag, false);
> 
> i think this will crash if the address of delete is used
> (so ifunc resolver runs at step 2 while PLTGOT entries are
> uninitialized) independently of binding lazy vs now.
> with binding now it may crash without taking the address
> of delete.

Right.
 
> i'll try to update the wiki, but will wait for some
> feedbacks here for a while.

Thanks! Feel free to update the page!

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Document use of IFUNC support outside of libc.
  2016-03-04 17:54 ` Szabolcs Nagy
  2016-03-04 21:49   ` Carlos O'Donell
@ 2016-03-04 21:56   ` Florian Weimer
  1 sibling, 0 replies; 9+ messages in thread
From: Florian Weimer @ 2016-03-04 21:56 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: Carlos O'Donell, GNU C Library, nd

* Szabolcs Nagy:

> it is still possible to call extern functions from an ifunc
> resolver, but only if it is forced to use relocs in 1) (e.g.
> call through a volatile funcptr or -fno-plt).

Does this change for architectures which use function descriptors?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Document use of IFUNC support outside of libc.
  2016-03-04 21:49   ` Carlos O'Donell
@ 2016-03-07 17:33     ` Szabolcs Nagy
  2016-04-15 15:11       ` Doing more inside an ifunc (Was Re: Document use of IFUNC support outside of libc.) Siddhesh Poyarekar
  0 siblings, 1 reply; 9+ messages in thread
From: Szabolcs Nagy @ 2016-03-07 17:33 UTC (permalink / raw)
  To: Carlos O'Donell, GNU C Library
  Cc: nd, Ramana Radhakrishnan, Marcus Shawcroft

On 04/03/16 21:48, Carlos O'Donell wrote:
> On 03/04/2016 12:54 PM, Szabolcs Nagy wrote:
>> On 03/03/16 21:10, Carlos O'Donell wrote:
>>> I attempted to distill some of your notes here:
>>> https://sourceware.org/glibc/wiki/GNU_IFUNC
>>>
>>
>> thanks, i was meaning to write something about it on the wiki,
>> but it is a bit hard to separate the bugs from the features.
> 
> I think we should make this work sensibly for a sensible set
> of use cases. In particular we are probably going to have to
> explicitly what is and is not supported, and what functions
> you can and can't call. I'm happy for IFUNC to exist for user
> code if we impose limits like: only access local variables,
> only call local functions, only use POD data types, only call
> the following glibc functions, etc. etc.
> 
>> i identified some issues:
>>
>> * the first point about bind now is not entirely correct,
>> lazy binding does not change that much.
> 
> Clarified. I agree the ordering doesn't change, all I wanted to
> do was provide some background about *why* on certain machines
> this fails.
> 
>> the reloc processing order at load time is:
>>
>> 1) DT_REL(A) relocs
>> 2) DT_REL(A) relocs that call ifunc resolvers
>> 3) DT_JMPREL relocs (may call ifunc resolvers or delay them)
>> 4) DT_JMPREL relocs that call ifunc resolvers
> 
> This is the ordering per elf_dynamic_do_Rel right? Where
> we force IRELATIVE to be resolved after in every given
> group (but not across the groups e.g. 1) 3) 2) 4)).
> 

_ELF_DYNAMIC_DO_RELOC in elf/dynamic-link.h orders 1,2
before 3,4 and elf_dynamic_do_Rel in elf/do-rel.h orders
3 before 4.

3 before 4 is also guaranteed by binutils ld since
https://sourceware.org/bugzilla/show_bug.cgi?id=13302

i think 1 is ordered before 2 only in recent binutils ld
https://sourceware.org/bugzilla/show_bug.cgi?id=18841
(and it seems it was only fixed for x86, ppc and s390)

i think JUMP_SLOT relocs within 3 are also sorted by ld
such that STT_GNU_IFUNC symbols come last.

>> (for example 1) can be data access through GOT, 2) is ifunc
>> resolved function address access through GOT, 3) is extern
>> function call, 4) is ifunc resolved function call that binds
>> locally e.g. static function with _IRELATIVE reloc.)
>>
>> the only difference between lazy binding and bind now is at
>> step 3): run time vs load time ifunc resolution.
> 
> Agreed.
> 
>> of course the ordering in 3) can break resolvers with bind
>> now that work with lazy binding, but the real problem is 2):
>> a resolver called there must only depend on relocs in 1).
> 
> I was thinking about this.
> 
> Would it be possible on ARM and PPC64 whose R_*_IRELATIVE
> relocs are in DT_REL* to reorder the processing in the dynamic
> loader? Resolve DT_JMPREL first then DT_REL*
> 
> That would give those machines feature parity with x86_64
> without needing to rewrite the relocations in binutils to
> handler this case?
> 

i haven't looked at non-x86 targets yet.

i think glibc dynlinker can do the relocs in arbitrary order
(the order is only observable through ifunc resolvers), but
the code might become ugly if there is arch dependent ordering.

>> it is still possible to call extern functions from an ifunc
>> resolver, but only if it is forced to use relocs in 1) (e.g.
>> call through a volatile funcptr or -fno-plt).  i'm not sure
>> if glibc wants to document this to work, because the user
>> needs to know about relocations (which is compiler/linker
>> internals).  the nasty part is that the compiler is free to
>> add extern calls (into libc or compiler runtime) which can
>> break the resolver so it cannot be written in c or c++ in
>> principle :(
> 
> Correct.
> 
> On x86 with multiversioning the compiler emits multiple clones
> of a function with different optimizations and selects based
> on cpuid results. To get the cpuid results the ifunc resolver
> emitted by the compiler calls into libgcc. As it is 
> implemented this multiversioning only works on x86 because of
> the relocation ordering.
> 
>> the dynamic linker could do the reloc ordering a bit better
>> (so e.g. 2) happens after 3) in case of lazy binding), but
>> i'm not sure how much that would help if potentially all
>> functions may be ifunc resolved in a module.
> 
> Could you expand on this a bit more? What would be the problem
> in having the dynamic loader do relocation processing in this
> order: 1) 3) 2) 4).
>  

the ordering does not fix the case when ifunc resolvers
reference ifunc resolved functions in the same module.
(because the relocs are not ordered according to ifunc
dependency)

otherwise i think it would make the most common cases work.
(both lazy and non-lazy binding, although lazy binding would
work in more cases)

>> * an omission from that wiki page is static linking:
>> ifunc resolvers run very early then (so memcpy etc work
>> during libc initialization), and that breaks stack-protection
>> etc instrumentation: the thread pointer is not yet set up.
> 
> I mentioned that?
> 
> "The resolver must not be compiled with -fstack-protector-all
> or any similar protections e.g. asan, since they may require
> early setup which has not yet completed."
> 
> I just didn't talk about static vs. dynamic, I just forbid it
> in general.
> 

sorry, indeed it is documented, but i wanted to note that it only
fails with static linking because i think this is undesirable.
(that code is running without thread pointer set up so accessing
errno or other tls would crash).

>> the vdso is not yet set up either and the vsyscall mechanism
>> uses ifunc now, so vdso does not work with static linking at
>> all (!) clock_gettime goes through a syscall (i think this is
>> a bug that can result in surprising perf regression for users
>> who expect speedup from static linking so i opened BZ 19767 ).
> 
> Agreed.
> 
>> i suspect there might be other limitations on resolvers
>> because ptr mangling is not set up either..
> 
> Maybe.
> 
>> probably static linking can be fixed by having two sets of
>> ifunc resolvers: one that only the libc uses and runs early
>> and another set that runs after some c runtime init is done
>> similar to the dynamic linked case.
> 
> Right.
> 
>> i actually would like to use vdso from ifunc resolvers
>> to do the ifunc dispatch based on information that is only
>> available in the kernel and cannot be easily communicated
>> through other means (e.g. sysfs stuff).
> 
> Sure. Examples needed.
>  

there seems to be interest in optimizations/dispatch based
on the micro architecture which is not easily available in
userspace currently (on aarch64).

linux exports various cpu info in /sys but that is not
stable abi and users probably don't want large number of
syscalls traversing the /sys tree at process startup just
to get slightly better tuned memcpy or similar.

one idea by Adhemerval Zanella was to use vdso for this.
(the kernel can provide a versioned function symbol there
to return a pointer to some cpu info struct, which can be
read only thus shared across processes).
there is no proposed design for this yet either on kernel
or libc side, but it would make sense if ifunc could use it.

currently the only reliable mechanisms for ifunc dispatch
are hwcap feature bits (if passed as argument) or cpuid
like instruction (e.g. on aarch64 cpuid like instructions
are not available to userspace, but can be emulated by the
kernel or provided as syscall, in either case it would be
context switch into the kernel, which can be bad if large
number of ifunc resolvers do it e.g. because function multi-
versioning is implemented that way, unless there is some
caching mechanism which is also not easy to do in ifunc...)

>> * yet another issue is that the ifunc resolver type
>> signature is different on different targets.
> 
> This is really lame.
> 
>> (and if the user defined resolver takes no argument, but the
>> dynamic linker calls it with arguments that is not strictly
>> correct in c even if it happens to work for most call abis:
>> there were hardening proposals based on type signature checks
>> for indirect calls which the dynamic linker would violate).
> 
> Agreed, we need to fix this.
> 

i think it's not easy to fix: binutils and gcc already
have ifunc test cases (where resolvers take no argument)

most non-x86 archs take a hwcap argument, but in the
mips ifunc patch the resolver has 3 arguments.

>>> That way I can point users at this.
>>>
>>> In gperftools tcmalloc added an IFUNC use [1] which
>>> violates some of the requirements under -Wl,z,now,
>>> so I have a need to document this support and discuss
>>> with tcmalloc developers what we might do. Right now
>>> they call way too much code for this to work.
>>>
>>> Cheers,
>>> Carlos.
>>>
>>> [1] https://github.com/gperftools/gperftools/commit/6fdfc5a7f40ebcff3fdaada1a2994ff54be2f9c7
>>>
>> +static bool sized_delete_enabled(void) {
>> +  if (tcmalloc_sized_delete_enabled != 0) {
>> +    return !!tcmalloc_sized_delete_enabled();
>> +  }
>>
>> i think this call happens to work because the func address
>> check for the weak ref forces the reloc to happen at step 1).
> 
> OK.
> 
>> +  const char *flag = TCMallocGetenvSafe("TCMALLOC_ENABLE_SIZED_DELETE");
>> +  return tcmalloc::commandlineflags::StringToBool(flag, false);
>>
>> i think this will crash if the address of delete is used
>> (so ifunc resolver runs at step 2 while PLTGOT entries are
>> uninitialized) independently of binding lazy vs now.
>> with binding now it may crash without taking the address
>> of delete.
> 
> Right.
>  
>> i'll try to update the wiki, but will wait for some
>> feedbacks here for a while.
> 
> Thanks! Feel free to update the page!
> 
> Cheers,
> Carlos.
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Doing more inside an ifunc (Was Re: Document use of IFUNC support outside of libc.)
  2016-03-07 17:33     ` Szabolcs Nagy
@ 2016-04-15 15:11       ` Siddhesh Poyarekar
  2016-04-15 23:51         ` pinskia
  0 siblings, 1 reply; 9+ messages in thread
From: Siddhesh Poyarekar @ 2016-04-15 15:11 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: Carlos O'Donell, GNU C Library, nd, Ramana Radhakrishnan,
	Marcus Shawcroft

On Mon, Mar 07, 2016 at 05:33:24PM +0000, Szabolcs Nagy wrote:
> there seems to be interest in optimizations/dispatch based
> on the micro architecture which is not easily available in
> userspace currently (on aarch64).

Sorry, I was interested in this conversation but completely missed it,
so starting it again.  I hope it's not too late :)

> linux exports various cpu info in /sys but that is not
> stable abi and users probably don't want large number of
> syscalls traversing the /sys tree at process startup just
> to get slightly better tuned memcpy or similar.
> 
> one idea by Adhemerval Zanella was to use vdso for this.
> (the kernel can provide a versioned function symbol there
> to return a pointer to some cpu info struct, which can be
> read only thus shared across processes).
> there is no proposed design for this yet either on kernel
> or libc side, but it would make sense if ifunc could use it.
> 
> currently the only reliable mechanisms for ifunc dispatch
> are hwcap feature bits (if passed as argument) or cpuid
> like instruction (e.g. on aarch64 cpuid like instructions
> are not available to userspace, but can be emulated by the
> kernel or provided as syscall, in either case it would be
> context switch into the kernel, which can be bad if large
> number of ifunc resolvers do it e.g. because function multi-
> versioning is implemented that way, unless there is some
> caching mechanism which is also not easy to do in ifunc...)

The context switch is not the worst thing that can happen for the
emulated instructions because we can easily cache the result and
reduce the number of context switches to a minimum.  The difficult bit
for the emulated instruction (MRS) is heterogenous systems, where it
would be difficult (impossible?) for userspace to just use the
emulated instruction to deterministically identify all of the
processor cores.

So the emulated instruction will only work for specific processor
cores that are known to always be in a homogenous configuration and
never otherwise.  For anything else, we will need the kernel to give
us full information about all of the cores in another way, either via
sysfs or vdso.  The sysfs route has been proposed earlier[1] but is
hairy for us because it traverses the filesystem to identify all CPU
cores, resulting in a proportional number of syscalls.  The vdso
alternative is better because the kernel can then give us all of the
information in exactly one call and avoid the context switch at the
same time.

I had hacked up a patch to test using the sysfs patches in [1] and it
required reimplementing some string functions to avoid referencing
them but that was about the only thing needed to get it working.
Safety however is a completely different issue and I don't know if we
can even guarantee that during symbol resolution.

Siddhesh

[1] https://lkml.org/lkml/2015/9/16/452

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Doing more inside an ifunc (Was Re: Document use of IFUNC support outside of libc.)
  2016-04-15 15:11       ` Doing more inside an ifunc (Was Re: Document use of IFUNC support outside of libc.) Siddhesh Poyarekar
@ 2016-04-15 23:51         ` pinskia
  2016-04-16 17:39           ` Siddhesh Poyarekar
  0 siblings, 1 reply; 9+ messages in thread
From: pinskia @ 2016-04-15 23:51 UTC (permalink / raw)
  To: Siddhesh Poyarekar
  Cc: Szabolcs Nagy, Carlos O'Donell, GNU C Library, nd,
	Ramana Radhakrishnan, Marcus Shawcroft



> On Apr 15, 2016, at 8:10 AM, Siddhesh Poyarekar <sid@reserved-bit.com> wrote:
> 
>> On Mon, Mar 07, 2016 at 05:33:24PM +0000, Szabolcs Nagy wrote:
>> there seems to be interest in optimizations/dispatch based
>> on the micro architecture which is not easily available in
>> userspace currently (on aarch64).
> 
> Sorry, I was interested in this conversation but completely missed it,
> so starting it again.  I hope it's not too late :)
> 
>> linux exports various cpu info in /sys but that is not
>> stable abi and users probably don't want large number of
>> syscalls traversing the /sys tree at process startup just
>> to get slightly better tuned memcpy or similar.
>> 
>> one idea by Adhemerval Zanella was to use vdso for this.
>> (the kernel can provide a versioned function symbol there
>> to return a pointer to some cpu info struct, which can be
>> read only thus shared across processes).
>> there is no proposed design for this yet either on kernel
>> or libc side, but it would make sense if ifunc could use it.
>> 
>> currently the only reliable mechanisms for ifunc dispatch
>> are hwcap feature bits (if passed as argument) or cpuid
>> like instruction (e.g. on aarch64 cpuid like instructions
>> are not available to userspace, but can be emulated by the
>> kernel or provided as syscall, in either case it would be
>> context switch into the kernel, which can be bad if large
>> number of ifunc resolvers do it e.g. because function multi-
>> versioning is implemented that way, unless there is some
>> caching mechanism which is also not easy to do in ifunc...)
> 
> The context switch is not the worst thing that can happen for the
> emulated instructions because we can easily cache the result and
> reduce the number of context switches to a minimum.  The difficult bit
> for the emulated instruction (MRS) is heterogenous systems, where it
> would be difficult (impossible?) for userspace to just use the
> emulated instruction to deterministically identify all of the
> processor cores.
> 
> So the emulated instruction will only work for specific processor
> cores that are known to always be in a homogenous configuration and
> never otherwise.  For anything else, we will need the kernel to give
> us full information about all of the cores in another way, either via
> sysfs or vdso.  The sysfs route has been proposed earlier[1] but is
> hairy for us because it traverses the filesystem to identify all CPU
> cores, resulting in a proportional number of syscalls.  The vdso
> alternative is better because the kernel can then give us all of the
> information in exactly one call and avoid the context switch at the
> same time.
> 
> I had hacked up a patch to test using the sysfs patches in [1] and it
> required reimplementing some string functions to avoid referencing
> them but that was about the only thing needed to get it working.
> Safety however is a completely different issue and I don't know if we
> can even guarantee that during symbol resolution.

I gave an alternative to this approach by passing midr via the aux vector. It still is useful and we can change the kernel to have it return unknown for those known values which will be used for big.little. I don't have a link to my implementation right now though as I am traveling.  This is much safer and easier to the black listing inside the kernel and the aux vector is basically free no open/read/close from ifunc or early launch either.

Thanks,
Andrew


> 
> Siddhesh
> 
> [1] https://lkml.org/lkml/2015/9/16/452

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Doing more inside an ifunc (Was Re: Document use of IFUNC support outside of libc.)
  2016-04-15 23:51         ` pinskia
@ 2016-04-16 17:39           ` Siddhesh Poyarekar
  2016-05-10  6:34             ` Andrew Pinski
  0 siblings, 1 reply; 9+ messages in thread
From: Siddhesh Poyarekar @ 2016-04-16 17:39 UTC (permalink / raw)
  To: pinskia
  Cc: Szabolcs Nagy, Carlos O'Donell, GNU C Library, nd,
	Ramana Radhakrishnan, Marcus Shawcroft

On Fri, Apr 15, 2016 at 03:09:38PM -0700, pinskia@gmail.com wrote:
> I gave an alternative to this approach by passing midr via the aux
> vector. It still is useful and we can change the kernel to have it
> return unknown for those known values which will be used for
> big.little. I don't have a link to my implementation right now
> though as I am traveling.  This is much safer and easier to the
> black listing inside the kernel and the aux vector is basically free
> no open/read/close from ifunc or early launch either.

Is this[1] the patch you're referring to?  It seems reasonable to me
given that we can never support big.little reliably with hotplug
potentially mixing things up.  But it really depends on how seriously
we want to consider the possibility of having optimal routines for
big.little systems.

We could probably make this patch play nicely with Suzuki's patchset
and use the auxvec entry as a first check and then fall back to
trawling sysfs or do the vdso function call if we ever need to
implement optimal routines for a big.little system.  However if
optimizing for big.little is a serious possibility then it makes sense
to solve that problem right now instead of burying it temporarily.

Siddhesh

[1] https://patches.linaro.org/patch/52856/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Doing more inside an ifunc (Was Re: Document use of IFUNC support outside of libc.)
  2016-04-16 17:39           ` Siddhesh Poyarekar
@ 2016-05-10  6:34             ` Andrew Pinski
  0 siblings, 0 replies; 9+ messages in thread
From: Andrew Pinski @ 2016-05-10  6:34 UTC (permalink / raw)
  To: Siddhesh Poyarekar
  Cc: Szabolcs Nagy, Carlos O'Donell, GNU C Library, nd,
	Ramana Radhakrishnan, Marcus Shawcroft

On Sat, Apr 16, 2016 at 10:38 AM, Siddhesh Poyarekar
<sid@reserved-bit.com> wrote:
> On Fri, Apr 15, 2016 at 03:09:38PM -0700, pinskia@gmail.com wrote:
>> I gave an alternative to this approach by passing midr via the aux
>> vector. It still is useful and we can change the kernel to have it
>> return unknown for those known values which will be used for
>> big.little. I don't have a link to my implementation right now
>> though as I am traveling.  This is much safer and easier to the
>> black listing inside the kernel and the aux vector is basically free
>> no open/read/close from ifunc or early launch either.
>
> Is this[1] the patch you're referring to?
Yes.

> It seems reasonable to me
> given that we can never support big.little reliably with hotplug
> potentially mixing things up.  But it really depends on how seriously
> we want to consider the possibility of having optimal routines for
> big.little systems.

I personally don't have any big.little system which I need to optimize
for.  I need to optimize for ThunderX series of processors.  I already
have a memcpy for ThunderX and a memset that I optimized but it is
dependent on this kernel patch being approved.

Thanks,
Andrew

>
> We could probably make this patch play nicely with Suzuki's patchset
> and use the auxvec entry as a first check and then fall back to
> trawling sysfs or do the vdso function call if we ever need to
> implement optimal routines for a big.little system.  However if
> optimizing for big.little is a serious possibility then it makes sense
> to solve that problem right now instead of burying it temporarily.
>
> Siddhesh
>
> [1] https://patches.linaro.org/patch/52856/

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-05-10  6:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-03 21:10 Document use of IFUNC support outside of libc Carlos O'Donell
2016-03-04 17:54 ` Szabolcs Nagy
2016-03-04 21:49   ` Carlos O'Donell
2016-03-07 17:33     ` Szabolcs Nagy
2016-04-15 15:11       ` Doing more inside an ifunc (Was Re: Document use of IFUNC support outside of libc.) Siddhesh Poyarekar
2016-04-15 23:51         ` pinskia
2016-04-16 17:39           ` Siddhesh Poyarekar
2016-05-10  6:34             ` Andrew Pinski
2016-03-04 21:56   ` Document use of IFUNC support outside of libc Florian Weimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).