public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [RFC] ifunc suck, use ufunc.
@ 2015-05-25  2:23 Ondřej Bílka
  2015-05-25  4:41 ` Szabolcs Nagy
  0 siblings, 1 reply; 22+ messages in thread
From: Ondřej Bílka @ 2015-05-25  2:23 UTC (permalink / raw)
  To: libc-alpha

Hi, header cleanup reminded me a project in my backlog.
In short ifunc approach is inadequate for optimization.
There are numerous problems, main one is lack of granularity.
In different applications a different variant could be optimal, even
different call sites in same application could differ.

Then there is problem that selection happens too early, an optimal
variant is decided by runtime profiling and you cannot change that.

Simple example of problem is if say strncpy is called once per program
lifetime you waste lot of cycles by fetching implementation to cache.

A better approach would be to have small implementation and switch to
bigger one once you used it at least say 20 times.

So if we found that ifunc is problem not solution an solution is
suprisingly simple:

Do resolution in the userspace. It would also be faster as it saves
cycles on useless plt indirection.

This is my plan for string functions, essence is use following pattern
with actual resolution instead dlsym(RTLD_DEFAULT,"memset")

A main benefit would be interlibrary constant folding. Why waste cycles
on reinitializing constant, just save it to ufunc structure. Resolver 
then could precompute tables to improve speed.

As interposing these you would need to interpose resolver.

An gcc support is not needed but we could get something with alternate
calling convention as passing resolver struct is common and could be
preserved for loops with tail calls.

A future direction could be replace plt and linker with ufunc, it would
require adding function string pointer to structure and calling first
generic resolver to select specific resolver.

Comments?

An simplified example is here (for arch that pass four arguments in registers):

struct memset_ufunc
{
  void *(*fn)(void *, int, size_t, struct memset_ufunc *);
  char  __attribute__((aligned (32))) data[32];
};
# define memset(s, c, n) \
   (__extension__({ \
    static struct memset_ufunc __resolve = {.fn = memset_resolver}; \
    __resolve.fn (s, c, n, &__resolve); \
    }))

void *
memset_resolver (void *s, int c, size_t n, struct memset_ufunc *resolve)
{
  resolve->fn = dlsym (RTLD_DEFAULT, "memset");
  return resolve->fn (s, c, n, resolve);
}

void foo (char *c);
int main ()
{
  char u[32];
  memset (u, 1, 32);
}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-25  2:23 [RFC] ifunc suck, use ufunc Ondřej Bílka
@ 2015-05-25  4:41 ` Szabolcs Nagy
  2015-05-25  5:18   ` Rich Felker
                     ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Szabolcs Nagy @ 2015-05-25  4:41 UTC (permalink / raw)
  To: Ond??ej B?lka; +Cc: libc-alpha

* Ond??ej B?lka <neleai@seznam.cz> [2015-05-24 23:38:58 +0200]:
> A main benefit would be interlibrary constant folding. Why waste cycles
> on reinitializing constant, just save it to ufunc structure. Resolver 
> then could precompute tables to improve speed.
> 
> As interposing these you would need to interpose resolver.
> 
> An gcc support is not needed but we could get something with alternate
> calling convention as passing resolver struct is common and could be
> preserved for loops with tail calls.
> 
> A future direction could be replace plt and linker with ufunc, it would
> require adding function string pointer to structure and calling first
> generic resolver to select specific resolver.
> 
> Comments?
> 

this makes memset non-async-signal-safe. (qoi issue)

it is not thread-safe either and would need an acquire
load barrier on every invocation of memset to fix that
or the use of thread local storage. (conformance issue)

(in the example only resolve->fn is modified and idempotently,
this would work in practice but as soon as ->data is accessed
too the memory ordering guarantees are required.. which can
be made efficient on some archs but only in asm)

in the example memset is called through the wrong type
of function pointer: the resolver and resolvee are
incompatible so this is invalid c, only works in asm.

it is not clear to me how many such ufunc structs will be
in a program for a specific function and how their redundant
initialization is avoided.
(one for every call site? every tu? every dso?)

> An simplified example is here (for arch that pass four arguments in registers):
> 
> struct memset_ufunc
> {
>   void *(*fn)(void *, int, size_t, struct memset_ufunc *);
>   char  __attribute__((aligned (32))) data[32];
> };
> # define memset(s, c, n) \
>    (__extension__({ \
>     static struct memset_ufunc __resolve = {.fn = memset_resolver}; \
>     __resolve.fn (s, c, n, &__resolve); \
>     }))
> 
> void *
> memset_resolver (void *s, int c, size_t n, struct memset_ufunc *resolve)
> {
>   resolve->fn = dlsym (RTLD_DEFAULT, "memset");
>   return resolve->fn (s, c, n, resolve);
> }
> 
> void foo (char *c);
> int main ()
> {
>   char u[32];
>   memset (u, 1, 32);
> }

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-25  4:41 ` Szabolcs Nagy
@ 2015-05-25  5:18   ` Rich Felker
  2015-05-25  6:11     ` Ondřej Bílka
  2015-05-25  6:03   ` [RFC] ifunc suck, use ufunc Ondřej Bílka
  2015-05-25 10:55   ` Ondřej Bílka
  2 siblings, 1 reply; 22+ messages in thread
From: Rich Felker @ 2015-05-25  5:18 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: Ondřej Bílka, libc-alpha

On Mon, May 25, 2015 at 03:43:23AM +0200, Szabolcs Nagy wrote:
> * Ondřej Bílka <neleai@seznam.cz> [2015-05-24 23:38:58 +0200]:
> > A main benefit would be interlibrary constant folding. Why waste cycles
> > on reinitializing constant, just save it to ufunc structure. Resolver 
> > then could precompute tables to improve speed.
> > 
> > As interposing these you would need to interpose resolver.
> > 
> > An gcc support is not needed but we could get something with alternate
> > calling convention as passing resolver struct is common and could be
> > preserved for loops with tail calls.
> > 
> > A future direction could be replace plt and linker with ufunc, it would
> > require adding function string pointer to structure and calling first
> > generic resolver to select specific resolver.
> > 
> > Comments?
> > 
> 
> this makes memset non-async-signal-safe. (qoi issue)
> 
> it is not thread-safe either and would need an acquire
> load barrier on every invocation of memset to fix that
> or the use of thread local storage. (conformance issue)
> 
> (in the example only resolve->fn is modified and idempotently,
> this would work in practice but as soon as ->data is accessed
> too the memory ordering guarantees are required.. which can
> be made efficient on some archs but only in asm)
> 
> in the example memset is called through the wrong type
> of function pointer: the resolver and resolvee are
> incompatible so this is invalid c, only works in asm.
> 
> it is not clear to me how many such ufunc structs will be
> in a program for a specific function and how their redundant
> initialization is avoided.
> (one for every call site? every tu? every dso?)

Agree with all points. IFUNC sucks, but the reasons that IFUNC sucks
are all exacerbated by this proposal, not fixed.

Rich

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-25  4:41 ` Szabolcs Nagy
  2015-05-25  5:18   ` Rich Felker
@ 2015-05-25  6:03   ` Ondřej Bílka
  2015-05-25  6:07     ` Rich Felker
  2015-05-25  8:12     ` Ondřej Bílka
  2015-05-25 10:55   ` Ondřej Bílka
  2 siblings, 2 replies; 22+ messages in thread
From: Ondřej Bílka @ 2015-05-25  6:03 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: libc-alpha

On Mon, May 25, 2015 at 03:43:23AM +0200, Szabolcs Nagy wrote:
> * Ond??ej B?lka <neleai@seznam.cz> [2015-05-24 23:38:58 +0200]:
> > A main benefit would be interlibrary constant folding. Why waste cycles
> > on reinitializing constant, just save it to ufunc structure. Resolver 
> > then could precompute tables to improve speed.
> > 
> > As interposing these you would need to interpose resolver.
> > 
> > An gcc support is not needed but we could get something with alternate
> > calling convention as passing resolver struct is common and could be
> > preserved for loops with tail calls.
> > 
> > A future direction could be replace plt and linker with ufunc, it would
> > require adding function string pointer to structure and calling first
> > generic resolver to select specific resolver.
> > 
> > Comments?
> > 
> 
> this makes memset non-async-signal-safe. (qoi issue)
>
Did I explicitly say that its architecture specific optimization or did
I forgot?
 
> it is not thread-safe either and would need an acquire
> load barrier on every invocation of memset to fix that
> or the use of thread local storage. (conformance issue)
> 
> (in the example only resolve->fn is modified and idempotently,
> this would work in practice but as soon as ->data is accessed
> too the memory ordering guarantees are required.. which can
> be made efficient on some archs but only in asm)
> 
> in the example memset is called through the wrong type
> of function pointer: the resolver and resolvee are
> incompatible so this is invalid c, only works in asm.
> 
Thats why I intended it as architecture-specific. On x64 it will work
along with memset prototype. Adding atomic/locking in resolver would be unnecessary
overhead.

Could make this generic by defining macros that expand to atomic read on
archs that don't act as pram.


> it is not clear to me how many such ufunc structs will be
> in a program for a specific function and how their redundant
> initialization is avoided.
> (one for every call site? every tu? every dso?)
> 
Main objective is neccessary on call-site basis. These aren't retundant
as data will be different for different call sites.
For example in sequence

memset (x,0,n);
memset (y,1,n);

In first memset data would contain 16 zeros, second will have 16 ones to
save cycles on repeated creating of mask.

Then there is planned optimization x64-specific where I need to change prototype
more to pass data in xmm0 register and end of string. Then you could
call different places of unrolled moves like

...
movdqu %xmm0, -64(%rdi)
movdqu %xmm0, -48(%rdi)
movdqu %xmm0, -32(%rdi)
movdqu %xmm0, -16(%rdi)
ret

It fixes that gcc does similar unrolling but only with 8-byte moves and
tends to be quite excessive with it and it would cause performance
penalty on cold paths. Also gcc couldn't do this without spliting
function to several per-cpu variants as on some arch this would be slow
without aligning, on others a rep stosq would be faster etc and you must
do resolution to determine what happened.

A per-dso would be possible with more bookkeeping (as I don't know how
convince compiler to do that), you would need to end
compilation by adding file with hidden variables with protected
attribute.

A similar idea would make more sense as gcc optimization to first
extract address from plt.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-25  6:03   ` [RFC] ifunc suck, use ufunc Ondřej Bílka
@ 2015-05-25  6:07     ` Rich Felker
  2015-05-25  7:02       ` Ondřej Bílka
  2015-05-25  8:12     ` Ondřej Bílka
  1 sibling, 1 reply; 22+ messages in thread
From: Rich Felker @ 2015-05-25  6:07 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: Szabolcs Nagy, libc-alpha

On Mon, May 25, 2015 at 04:36:52AM +0200, Ondřej Bílka wrote:
> On Mon, May 25, 2015 at 03:43:23AM +0200, Szabolcs Nagy wrote:
> > * Ond??ej B?lka <neleai@seznam.cz> [2015-05-24 23:38:58 +0200]:
> > > A main benefit would be interlibrary constant folding. Why waste cycles
> > > on reinitializing constant, just save it to ufunc structure. Resolver 
> > > then could precompute tables to improve speed.
> > > 
> > > As interposing these you would need to interpose resolver.
> > > 
> > > An gcc support is not needed but we could get something with alternate
> > > calling convention as passing resolver struct is common and could be
> > > preserved for loops with tail calls.
> > > 
> > > A future direction could be replace plt and linker with ufunc, it would
> > > require adding function string pointer to structure and calling first
> > > generic resolver to select specific resolver.
> > > 
> > > Comments?
> > > 
> > 
> > this makes memset non-async-signal-safe. (qoi issue)
> >
> Did I explicitly say that its architecture specific optimization or did
> I forgot?

AS-safety is broken regardless of arch. Only the barrier stuff if
arch-specific.

> > it is not thread-safe either and would need an acquire
> > load barrier on every invocation of memset to fix that
> > or the use of thread local storage. (conformance issue)
> > 
> > (in the example only resolve->fn is modified and idempotently,
> > this would work in practice but as soon as ->data is accessed
> > too the memory ordering guarantees are required.. which can
> > be made efficient on some archs but only in asm)
> > 
> > in the example memset is called through the wrong type
> > of function pointer: the resolver and resolvee are
> > incompatible so this is invalid c, only works in asm.
> > 
> Thats why I intended it as architecture-specific. On x64 it will work
> along with memset prototype. Adding atomic/locking in resolver would be unnecessary
> overhead.
> 
> Could make this generic by defining macros that expand to atomic read on
> archs that don't act as pram.

Do you realize the relative cost of an atomic read (barrier) versus a
small memset? This is like driving an extra mile to a cheaper gas
station to save $0.01 per gallon...

> > it is not clear to me how many such ufunc structs will be
> > in a program for a specific function and how their redundant
> > initialization is avoided.
> > (one for every call site? every tu? every dso?)
> > 
> Main objective is neccessary on call-site basis. These aren't retundant
> as data will be different for different call sites.
> For example in sequence
> 
> memset (x,0,n);
> memset (y,1,n);
> 
> In first memset data would contain 16 zeros, second will have 16 ones to
> save cycles on repeated creating of mask.
> 
> Then there is planned optimization x64-specific where I need to change prototype
> more to pass data in xmm0 register and end of string. Then you could
> call different places of unrolled moves like
> 
> ....
> movdqu %xmm0, -64(%rdi)
> movdqu %xmm0, -48(%rdi)
> movdqu %xmm0, -32(%rdi)
> movdqu %xmm0, -16(%rdi)
> ret
> 
> It fixes that gcc does similar unrolling but only with 8-byte moves and
> tends to be quite excessive with it and it would cause performance
> penalty on cold paths. Also gcc couldn't do this without spliting
> function to several per-cpu variants as on some arch this would be slow
> without aligning, on others a rep stosq would be faster etc and you must
> do resolution to determine what happened.
> 
> A per-dso would be possible with more bookkeeping (as I don't know how
> convince compiler to do that), you would need to end
> compilation by adding file with hidden variables with protected
> attribute.
> 
> A similar idea would make more sense as gcc optimization to first
> extract address from plt.

This is utterly hideous...

Rich

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-25  5:18   ` Rich Felker
@ 2015-05-25  6:11     ` Ondřej Bílka
  2015-05-25  8:24       ` Carlos O'Donell
  0 siblings, 1 reply; 22+ messages in thread
From: Ondřej Bílka @ 2015-05-25  6:11 UTC (permalink / raw)
  To: Rich Felker; +Cc: Szabolcs Nagy, libc-alpha

On Sun, May 24, 2015 at 10:22:50PM -0400, Rich Felker wrote:
> On Mon, May 25, 2015 at 03:43:23AM +0200, Szabolcs Nagy wrote:
> > * Ondřej Bílka <neleai@seznam.cz> [2015-05-24 23:38:58 +0200]:
> > > A main benefit would be interlibrary constant folding. Why waste cycles
> > > on reinitializing constant, just save it to ufunc structure. Resolver 
> > > then could precompute tables to improve speed.
> > > 
> > > As interposing these you would need to interpose resolver.
> > > 
> > > An gcc support is not needed but we could get something with alternate
> > > calling convention as passing resolver struct is common and could be
> > > preserved for loops with tail calls.
> > > 
> > > A future direction could be replace plt and linker with ufunc, it would
> > > require adding function string pointer to structure and calling first
> > > generic resolver to select specific resolver.
> > > 
> > > Comments?
> > > 
> > 
> > this makes memset non-async-signal-safe. (qoi issue)
> > 
> > it is not thread-safe either and would need an acquire
> > load barrier on every invocation of memset to fix that
> > or the use of thread local storage. (conformance issue)
> > 
> > (in the example only resolve->fn is modified and idempotently,
> > this would work in practice but as soon as ->data is accessed
> > too the memory ordering guarantees are required.. which can
> > be made efficient on some archs but only in asm)
> > 
> > in the example memset is called through the wrong type
> > of function pointer: the resolver and resolvee are
> > incompatible so this is invalid c, only works in asm.
> > 
> > it is not clear to me how many such ufunc structs will be
> > in a program for a specific function and how their redundant
> > initialization is avoided.
> > (one for every call site? every tu? every dso?)
> 
> Agree with all points. IFUNC sucks, but the reasons that IFUNC sucks
> are all exacerbated by this proposal, not fixed.
> 
These were easily fixable technical details that are pointless to argue
at this stage.

A problem is that optimal function selection depends on combination of
call site, current cpy and profile feedback. If you ignore one of
parameters you would obviously get worse performance.

So make a better proposal how to handle selection acccording to this
triplet. If you cannot make one its evidence that my ufuncs are best
approach so shut up.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-25  6:07     ` Rich Felker
@ 2015-05-25  7:02       ` Ondřej Bílka
  2015-05-25  8:58         ` Andrew Pinski
  0 siblings, 1 reply; 22+ messages in thread
From: Ondřej Bílka @ 2015-05-25  7:02 UTC (permalink / raw)
  To: Rich Felker; +Cc: Szabolcs Nagy, libc-alpha

On Sun, May 24, 2015 at 11:15:10PM -0400, Rich Felker wrote:
> On Mon, May 25, 2015 at 04:36:52AM +0200, Ondřej Bílka wrote:
> > On Mon, May 25, 2015 at 03:43:23AM +0200, Szabolcs Nagy wrote:
> > > * Ond??ej B?lka <neleai@seznam.cz> [2015-05-24 23:38:58 +0200]:
> > > > A main benefit would be interlibrary constant folding. Why waste cycles
> > > > on reinitializing constant, just save it to ufunc structure. Resolver 
> > > > then could precompute tables to improve speed.
> > > > 
> > > > As interposing these you would need to interpose resolver.
> > > > 
> > > > An gcc support is not needed but we could get something with alternate
> > > > calling convention as passing resolver struct is common and could be
> > > > preserved for loops with tail calls.
> > > > 
> > > > A future direction could be replace plt and linker with ufunc, it would
> > > > require adding function string pointer to structure and calling first
> > > > generic resolver to select specific resolver.
> > > > 
> > > > Comments?
> > > > 
> > > 
> > > this makes memset non-async-signal-safe. (qoi issue)
> > >
> > Did I explicitly say that its architecture specific optimization or did
> > I forgot?
> 
> AS-safety is broken regardless of arch. Only the barrier stuff if
> arch-specific.
> 
> > > it is not thread-safe either and would need an acquire
> > > load barrier on every invocation of memset to fix that
> > > or the use of thread local storage. (conformance issue)
> > > 
> > > (in the example only resolve->fn is modified and idempotently,
> > > this would work in practice but as soon as ->data is accessed
> > > too the memory ordering guarantees are required.. which can
> > > be made efficient on some archs but only in asm)
> > > 
> > > in the example memset is called through the wrong type
> > > of function pointer: the resolver and resolvee are
> > > incompatible so this is invalid c, only works in asm.
> > > 
> > Thats why I intended it as architecture-specific. On x64 it will work
> > along with memset prototype. Adding atomic/locking in resolver would be unnecessary
> > overhead.
> > 
> > Could make this generic by defining macros that expand to atomic read on
> > archs that don't act as pram.
> 
> Do you realize the relative cost of an atomic read (barrier) versus a
> small memset? This is like driving an extra mile to a cheaper gas
> station to save $0.01 per gallon...
>
Since when does atomic read for x64 generates barrier?
And you don't need to establish any happens-before relashionship here,
so relaxed atomic read suffices. With atomic write you could only call a
function and you are ok or resolver multiple times. A resolver would
already take lock so its handled.

A problem is only for arch that don't do atomic write of pointers. Chips
that do this are typically so slow that if you care about performance
best you could do is sell it and buy something decent. Nobody would try
to improve performance by making several variants.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-25  6:03   ` [RFC] ifunc suck, use ufunc Ondřej Bílka
  2015-05-25  6:07     ` Rich Felker
@ 2015-05-25  8:12     ` Ondřej Bílka
  1 sibling, 0 replies; 22+ messages in thread
From: Ondřej Bílka @ 2015-05-25  8:12 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: libc-alpha

On Mon, May 25, 2015 at 04:36:52AM +0200, Ondřej Bílka wrote:
> On Mon, May 25, 2015 at 03:43:23AM +0200, Szabolcs Nagy wrote:
> > * Ond??ej B?lka <neleai@seznam.cz> [2015-05-24 23:38:58 +0200]:
> > > A main benefit would be interlibrary constant folding. Why waste cycles
> > > on reinitializing constant, just save it to ufunc structure. Resolver 
> > > then could precompute tables to improve speed.
> > > 
> > > As interposing these you would need to interpose resolver.
> > > 
> > > An gcc support is not needed but we could get something with alternate
> > > calling convention as passing resolver struct is common and could be
> > > preserved for loops with tail calls.
> > > 
> > > A future direction could be replace plt and linker with ufunc, it would
> > > require adding function string pointer to structure and calling first
> > > generic resolver to select specific resolver.
> > > 
> > > Comments?
> > > 
> > 
> > this makes memset non-async-signal-safe. (qoi issue)
> >
> Did I explicitly say that its architecture specific optimization or did
> I forgot?
>  
> > it is not thread-safe either and would need an acquire
> > load barrier on every invocation of memset to fix that
> > or the use of thread local storage. (conformance issue)
> > 
> > (in the example only resolve->fn is modified and idempotently,
> > this would work in practice but as soon as ->data is accessed
> > too the memory ordering guarantees are required.. which can
> > be made efficient on some archs but only in asm)
> > 
> > in the example memset is called through the wrong type
> > of function pointer: the resolver and resolvee are
> > incompatible so this is invalid c, only works in asm.
> > 
> Thats why I intended it as architecture-specific. On x64 it will work
> along with memset prototype. Adding atomic/locking in resolver would be unnecessary
> overhead.
> 
> Could make this generic by defining macros that expand to atomic read on
> archs that don't act as pram.
> 
> 
> > it is not clear to me how many such ufunc structs will be
> > in a program for a specific function and how their redundant
> > initialization is avoided.
> > (one for every call site? every tu? every dso?)
> > 
> Main objective is neccessary on call-site basis. These aren't retundant
> as data will be different for different call sites.
> For example in sequence
> 
> memset (x,0,n);
> memset (y,1,n);
> 
Also this is just example so ignore low dimensionality. I realized that
going to logical conclusion would be add to glibc 4096 byte array with
precomputed values, then pass precomputed value for c as xmm argument of
memset/other string function variant.

You should ask around real problem that is profitability. Adding a
structure has a cost in cache pressure so question is when benefits
outweigth that cost.

Answer is that you need to run profiler for it. I forgotten to mention
how to do it for simplicity, I will write it in separate thread.

Obviously you would need more complicated wrapper.
A summary is that app programmer should need to first compile it and run given
application several times with say LD_PRELOAD=ufunc_profiler.so

Then it would produce a header to include and it would optimize away
ufunc where it was called less than given treshold, fill precomputed
tables so you could use simpler resolver etc.

Then there is possibility of using low-overhead profiling. A
disadvantage of this is lot of duplicated work among many users so its
better that programer would run profiling once and share results.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-25  6:11     ` Ondřej Bílka
@ 2015-05-25  8:24       ` Carlos O'Donell
  2015-05-25 10:38         ` [RFC] ifunc suck, use ufunc.\ Ondřej Bílka
  0 siblings, 1 reply; 22+ messages in thread
From: Carlos O'Donell @ 2015-05-25  8:24 UTC (permalink / raw)
  To: Ondřej Bílka, Rich Felker; +Cc: Szabolcs Nagy, libc-alpha

On 05/25/2015 12:21 AM, Ondřej Bílka wrote:
> So make a better proposal how to handle selection acccording to this
> triplet. If you cannot make one its evidence that my ufuncs are best
> approach so shut up.

Please refrain from ad hominem attacks.

Cheers,
Carlos.
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-25  7:02       ` Ondřej Bílka
@ 2015-05-25  8:58         ` Andrew Pinski
  2015-05-25 10:15           ` Ondřej Bílka
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Pinski @ 2015-05-25  8:58 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: Rich Felker, Szabolcs Nagy, GNU C Library

On Mon, May 25, 2015 at 12:40 PM, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Sun, May 24, 2015 at 11:15:10PM -0400, Rich Felker wrote:
>> On Mon, May 25, 2015 at 04:36:52AM +0200, Ondřej Bílka wrote:
>> > On Mon, May 25, 2015 at 03:43:23AM +0200, Szabolcs Nagy wrote:
>> > > * Ond??ej B?lka <neleai@seznam.cz> [2015-05-24 23:38:58 +0200]:
>> > > > A main benefit would be interlibrary constant folding. Why waste cycles
>> > > > on reinitializing constant, just save it to ufunc structure. Resolver
>> > > > then could precompute tables to improve speed.
>> > > >
>> > > > As interposing these you would need to interpose resolver.
>> > > >
>> > > > An gcc support is not needed but we could get something with alternate
>> > > > calling convention as passing resolver struct is common and could be
>> > > > preserved for loops with tail calls.
>> > > >
>> > > > A future direction could be replace plt and linker with ufunc, it would
>> > > > require adding function string pointer to structure and calling first
>> > > > generic resolver to select specific resolver.
>> > > >
>> > > > Comments?
>> > > >
>> > >
>> > > this makes memset non-async-signal-safe. (qoi issue)
>> > >
>> > Did I explicitly say that its architecture specific optimization or did
>> > I forgot?
>>
>> AS-safety is broken regardless of arch. Only the barrier stuff if
>> arch-specific.
>>
>> > > it is not thread-safe either and would need an acquire
>> > > load barrier on every invocation of memset to fix that
>> > > or the use of thread local storage. (conformance issue)
>> > >
>> > > (in the example only resolve->fn is modified and idempotently,
>> > > this would work in practice but as soon as ->data is accessed
>> > > too the memory ordering guarantees are required.. which can
>> > > be made efficient on some archs but only in asm)
>> > >
>> > > in the example memset is called through the wrong type
>> > > of function pointer: the resolver and resolvee are
>> > > incompatible so this is invalid c, only works in asm.
>> > >
>> > Thats why I intended it as architecture-specific. On x64 it will work
>> > along with memset prototype. Adding atomic/locking in resolver would be unnecessary
>> > overhead.
>> >
>> > Could make this generic by defining macros that expand to atomic read on
>> > archs that don't act as pram.
>>
>> Do you realize the relative cost of an atomic read (barrier) versus a
>> small memset? This is like driving an extra mile to a cheaper gas
>> station to save $0.01 per gallon...
>>
> Since when does atomic read for x64 generates barrier?
> And you don't need to establish any happens-before relashionship here,
> so relaxed atomic read suffices. With atomic write you could only call a
> function and you are ok or resolver multiple times. A resolver would
> already take lock so its handled.

Since when is glibc x64 specific :).  There are other targets which
use ifuncs and that is what he is talking about.

Thanks,
Andrew

>
> A problem is only for arch that don't do atomic write of pointers. Chips
> that do this are typically so slow that if you care about performance
> best you could do is sell it and buy something decent. Nobody would try
> to improve performance by making several variants.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-25  8:58         ` Andrew Pinski
@ 2015-05-25 10:15           ` Ondřej Bílka
  2015-05-25 18:17             ` Adhemerval Zanella
  0 siblings, 1 reply; 22+ messages in thread
From: Ondřej Bílka @ 2015-05-25 10:15 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: Rich Felker, Szabolcs Nagy, GNU C Library

On Mon, May 25, 2015 at 01:17:47PM +0800, Andrew Pinski wrote:
> On Mon, May 25, 2015 at 12:40 PM, Ondřej Bílka <neleai@seznam.cz> wrote:
> > On Sun, May 24, 2015 at 11:15:10PM -0400, Rich Felker wrote:
> >> On Mon, May 25, 2015 at 04:36:52AM +0200, Ondřej Bílka wrote:
> >> > On Mon, May 25, 2015 at 03:43:23AM +0200, Szabolcs Nagy wrote:
> >> > > * Ond??ej B?lka <neleai@seznam.cz> [2015-05-24 23:38:58 +0200]:
> >> > > > A main benefit would be interlibrary constant folding. Why waste cycles
> >> > > > on reinitializing constant, just save it to ufunc structure. Resolver
> >> > > > then could precompute tables to improve speed.
> >> > > >
> >> > > > As interposing these you would need to interpose resolver.
> >> > > >
> >> > > > An gcc support is not needed but we could get something with alternate
> >> > > > calling convention as passing resolver struct is common and could be
> >> > > > preserved for loops with tail calls.
> >> > > >
> >> > > > A future direction could be replace plt and linker with ufunc, it would
> >> > > > require adding function string pointer to structure and calling first
> >> > > > generic resolver to select specific resolver.
> >> > > >
> >> > > > Comments?
> >> > > >
> >> > >
> >> > > this makes memset non-async-signal-safe. (qoi issue)
> >> > >
> >> > Did I explicitly say that its architecture specific optimization or did
> >> > I forgot?
> >>
> >> AS-safety is broken regardless of arch. Only the barrier stuff if
> >> arch-specific.
> >>
> >> > > it is not thread-safe either and would need an acquire
> >> > > load barrier on every invocation of memset to fix that
> >> > > or the use of thread local storage. (conformance issue)
> >> > >
> >> > > (in the example only resolve->fn is modified and idempotently,
> >> > > this would work in practice but as soon as ->data is accessed
> >> > > too the memory ordering guarantees are required.. which can
> >> > > be made efficient on some archs but only in asm)
> >> > >
> >> > > in the example memset is called through the wrong type
> >> > > of function pointer: the resolver and resolvee are
> >> > > incompatible so this is invalid c, only works in asm.
> >> > >
> >> > Thats why I intended it as architecture-specific. On x64 it will work
> >> > along with memset prototype. Adding atomic/locking in resolver would be unnecessary
> >> > overhead.
> >> >
> >> > Could make this generic by defining macros that expand to atomic read on
> >> > archs that don't act as pram.
> >>
> >> Do you realize the relative cost of an atomic read (barrier) versus a
> >> small memset? This is like driving an extra mile to a cheaper gas
> >> station to save $0.01 per gallon...
> >>
> > Since when does atomic read for x64 generates barrier?
> > And you don't need to establish any happens-before relashionship here,
> > so relaxed atomic read suffices. With atomic write you could only call a
> > function and you are ok or resolver multiple times. A resolver would
> > already take lock so its handled.
> 
> Since when is glibc x64 specific :).  There are other targets which
> use ifuncs and that is what he is talking about.
>

Its example where ufunc would be useful. We also newer used ifunc for
cross-platform selection its just always used for some specific
architecture.

So now we should focus on making them usable for x64. After that we
could port these to other architectures. As I said before its too early for
implementation details. Adding these now would result on rewriting them
five times until interface stabilizes and abi could change.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.\
  2015-05-25  8:24       ` Carlos O'Donell
@ 2015-05-25 10:38         ` Ondřej Bílka
  0 siblings, 0 replies; 22+ messages in thread
From: Ondřej Bílka @ 2015-05-25 10:38 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Rich Felker, Szabolcs Nagy, libc-alpha

On Mon, May 25, 2015 at 01:17:00AM -0400, Carlos O'Donell wrote:
> On 05/25/2015 12:21 AM, Ondřej Bílka wrote:
> > So make a better proposal how to handle selection acccording to this
> > triplet. If you cannot make one its evidence that my ufuncs are best
> > approach so shut up.
> 
> Please refrain from ad hominem attacks.
> 
Sorry, lost patience on Rich, as most of his arguments are too easy to
refute.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-25  4:41 ` Szabolcs Nagy
  2015-05-25  5:18   ` Rich Felker
  2015-05-25  6:03   ` [RFC] ifunc suck, use ufunc Ondřej Bílka
@ 2015-05-25 10:55   ` Ondřej Bílka
  2015-05-25 18:27     ` Szabolcs Nagy
  2 siblings, 1 reply; 22+ messages in thread
From: Ondřej Bílka @ 2015-05-25 10:55 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: libc-alpha

On Mon, May 25, 2015 at 03:43:23AM +0200, Szabolcs Nagy wrote:
> * Ond??ej B?lka <neleai@seznam.cz> [2015-05-24 23:38:58 +0200]:
> > A main benefit would be interlibrary constant folding. Why waste cycles
> > on reinitializing constant, just save it to ufunc structure. Resolver 
> > then could precompute tables to improve speed.
> > 
> > As interposing these you would need to interpose resolver.
> > 
> > An gcc support is not needed but we could get something with alternate
> > calling convention as passing resolver struct is common and could be
> > preserved for loops with tail calls.
> > 
> > A future direction could be replace plt and linker with ufunc, it would
> > require adding function string pointer to structure and calling first
> > generic resolver to select specific resolver.
> > 
> > Comments?
> > 
> 
> this makes memset non-async-signal-safe. (qoi issue)
> 
How, could you elaborate where exactly would it cause problem with
signals? And dlsym don't count, its just implementation detail. 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-25 10:15           ` Ondřej Bílka
@ 2015-05-25 18:17             ` Adhemerval Zanella
  2015-05-25 18:54               ` Ondřej Bílka
  0 siblings, 1 reply; 22+ messages in thread
From: Adhemerval Zanella @ 2015-05-25 18:17 UTC (permalink / raw)
  To: libc-alpha



On 25-05-2015 03:03, Ondřej Bílka wrote:
> On Mon, May 25, 2015 at 01:17:47PM +0800, Andrew Pinski wrote:
>> On Mon, May 25, 2015 at 12:40 PM, Ondřej Bílka <neleai@seznam.cz> wrote:
>>> On Sun, May 24, 2015 at 11:15:10PM -0400, Rich Felker wrote:
>>>> On Mon, May 25, 2015 at 04:36:52AM +0200, Ondřej Bílka wrote:
>>>>> On Mon, May 25, 2015 at 03:43:23AM +0200, Szabolcs Nagy wrote:
>>>>>> * Ond??ej B?lka <neleai@seznam.cz> [2015-05-24 23:38:58 +0200]:
>>>>>>> A main benefit would be interlibrary constant folding. Why waste cycles
>>>>>>> on reinitializing constant, just save it to ufunc structure. Resolver
>>>>>>> then could precompute tables to improve speed.
>>>>>>>
>>>>>>> As interposing these you would need to interpose resolver.
>>>>>>>
>>>>>>> An gcc support is not needed but we could get something with alternate
>>>>>>> calling convention as passing resolver struct is common and could be
>>>>>>> preserved for loops with tail calls.
>>>>>>>
>>>>>>> A future direction could be replace plt and linker with ufunc, it would
>>>>>>> require adding function string pointer to structure and calling first
>>>>>>> generic resolver to select specific resolver.
>>>>>>>
>>>>>>> Comments?
>>>>>>>
>>>>>>
>>>>>> this makes memset non-async-signal-safe. (qoi issue)
>>>>>>
>>>>> Did I explicitly say that its architecture specific optimization or did
>>>>> I forgot?
>>>>
>>>> AS-safety is broken regardless of arch. Only the barrier stuff if
>>>> arch-specific.
>>>>
>>>>>> it is not thread-safe either and would need an acquire
>>>>>> load barrier on every invocation of memset to fix that
>>>>>> or the use of thread local storage. (conformance issue)
>>>>>>
>>>>>> (in the example only resolve->fn is modified and idempotently,
>>>>>> this would work in practice but as soon as ->data is accessed
>>>>>> too the memory ordering guarantees are required.. which can
>>>>>> be made efficient on some archs but only in asm)
>>>>>>
>>>>>> in the example memset is called through the wrong type
>>>>>> of function pointer: the resolver and resolvee are
>>>>>> incompatible so this is invalid c, only works in asm.
>>>>>>
>>>>> Thats why I intended it as architecture-specific. On x64 it will work
>>>>> along with memset prototype. Adding atomic/locking in resolver would be unnecessary
>>>>> overhead.
>>>>>
>>>>> Could make this generic by defining macros that expand to atomic read on
>>>>> archs that don't act as pram.
>>>>
>>>> Do you realize the relative cost of an atomic read (barrier) versus a
>>>> small memset? This is like driving an extra mile to a cheaper gas
>>>> station to save $0.01 per gallon...
>>>>
>>> Since when does atomic read for x64 generates barrier?
>>> And you don't need to establish any happens-before relashionship here,
>>> so relaxed atomic read suffices. With atomic write you could only call a
>>> function and you are ok or resolver multiple times. A resolver would
>>> already take lock so its handled.
>>
>> Since when is glibc x64 specific :).  There are other targets which
>> use ifuncs and that is what he is talking about.
>>
> 
> Its example where ufunc would be useful. We also newer used ifunc for
> cross-platform selection its just always used for some specific
> architecture.
> 
> So now we should focus on making them usable for x64. After that we
> could port these to other architectures. As I said before its too early for
> implementation details. Adding these now would result on rewriting them
> five times until interface stabilizes and abi could change.
> 

Although the reason for such mechanism seems reasonable, I do not see a
good approach to add more architecture specific behaviour on GLIBC. 
IFUNC is already not really on all platforms and it has its own 
idiosyncrasies for some ports (like hwcap init order).

Also, I think Rich and Szabolcs brought valid points: the devils in the
details.  For such proposal to have some traction the ideal implementation
proposed (either in pseudo-code or C implementation) should also be 
reasonable (and bashing criticise arguing that it is a easily fixable 
technical details is not the way to go).

Do you have some background on how other languages or runtime accomplish
such think? I know java hotspot do have some internal profiler and
dynamic recompilation based on profile feedback.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-25 10:55   ` Ondřej Bílka
@ 2015-05-25 18:27     ` Szabolcs Nagy
  2015-05-25 19:09       ` Ondřej Bílka
  0 siblings, 1 reply; 22+ messages in thread
From: Szabolcs Nagy @ 2015-05-25 18:27 UTC (permalink / raw)
  To: Ond??ej B?lka; +Cc: libc-alpha

* Ond??ej B?lka <neleai@seznam.cz> [2015-05-25 08:11:31 +0200]:
> On Mon, May 25, 2015 at 03:43:23AM +0200, Szabolcs Nagy wrote:
> > * Ond??ej B?lka <neleai@seznam.cz> [2015-05-24 23:38:58 +0200]:
> > > A main benefit would be interlibrary constant folding. Why waste cycles
> > > on reinitializing constant, just save it to ufunc structure. Resolver 
> > > then could precompute tables to improve speed.
> > > 
> > > As interposing these you would need to interpose resolver.
> > > 
> > > An gcc support is not needed but we could get something with alternate
> > > calling convention as passing resolver struct is common and could be
> > > preserved for loops with tail calls.
> > > 
> > > A future direction could be replace plt and linker with ufunc, it would
> > > require adding function string pointer to structure and calling first
> > > generic resolver to select specific resolver.
> > > 
> > > Comments?
> > > 
> > 
> > this makes memset non-async-signal-safe. (qoi issue)
> > 
> How, could you elaborate where exactly would it cause problem with
> signals? And dlsym don't count, its just implementation detail. 

with a strict reading of the standard, any access to objects
with static storge duration is invalid from a signal handler
unless their type is volatile sig_atomic_t or lock-free atomic.
(c11/5.1.2.3p5)

in practice it probably works if you only read/write one pointer object,
but if you start using the data member too then the data access can be
interrupted at the wrong time and then calling into the same code path
from the signal handler will cause problems. (so it's not memset that
becomes non-as-safe, but whatever function that calls memset..
assuming each memset call site has its own ufunc struct).


there are other problems: same security issues as with lazy binding
(modifiable function pointers for control flow hijack), interoperability
problems with other hacks (fortify source, interposition), significant
amount of non-sharable data, startup overhead, prevents compiler
optimizations (assuming gcc only sees a call through an ufunc pointer),
debugging issues.

if it can be done as a pure cpp hack (alternate string.h) then you
can experiment with that outside of libc and only if benchmarks show
consistent performance benefits it is worth thinking about.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-25 18:17             ` Adhemerval Zanella
@ 2015-05-25 18:54               ` Ondřej Bílka
  2015-05-25 19:22                 ` Adhemerval Zanella
  0 siblings, 1 reply; 22+ messages in thread
From: Ondřej Bílka @ 2015-05-25 18:54 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

On Mon, May 25, 2015 at 09:17:48AM -0300, Adhemerval Zanella wrote:
> 
> Although the reason for such mechanism seems reasonable, I do not see a
> good approach to add more architecture specific behaviour on GLIBC. 
> IFUNC is already not really on all platforms and it has its own 
> idiosyncrasies for some ports (like hwcap init order).
>
While it isn't wise to add another mechanism it is necessary. If you
have program where in 50% of call sites implementation A is better while
in other half B is better how would you resolve that? No matter how you
select you would get bad performance (and resolving selection on
backtrace would likely cost more than savings.)

Also you would lose performance from not being able to use precomputed
tables.

You could use ifunc with losing per-call site information. However
ifuncs would get ugly not simple if you have feature X select
implementation using X in order from newest to oldest.

You would need to at start of program read file and you would must
decide to select function according to profile from file. You should
probably do same with ufuncs but these provide some stack.

An aim would be replace ifuncs with ufunc or some variant. For that they
would need to coexist for some time until they are converted.

One side benefit is that you wont need add ifunc for new architectures,
just use ufuncs. One problem would be cost, as mentioned before without
harware atomic write it could be too costy. One possibility that I
mentioned was force double compilation on these arch so you could do
resolving for given cpu and then distribute separate binaries with each
cpu data. 
 
> Also, I think Rich and Szabolcs brought valid points: the devils in the
> details.  For such proposal to have some traction the ideal implementation
> proposed (either in pseudo-code or C implementation) should also be 
> reasonable (and bashing criticise arguing that it is a easily fixable 
> technical details is not the way to go).
> 
Sorry, its when I seen complaint that looked like that i should use tab
followed by space followed by tab on line 1 i tried to cut irrelevant
parts and move to real problems.

> Do you have some background on how other languages or runtime accomplish
> such think? I know java hotspot do have some internal profiler and
> dynamic recompilation based on profile feedback.

Yes, main problem why java-like approach is unsuitable is that in linux
lot of programs have very short lifetime so you need persistent storage
to collect enough data to select implementation with some reliability.
This is compounded that some functions are called rarely.

As i said before one benefit could be make resolver select
implementation that minimizes size for first twenty calls and then do
potentialy expensive profiling/reading profile to determine optimum.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-25 18:27     ` Szabolcs Nagy
@ 2015-05-25 19:09       ` Ondřej Bílka
  0 siblings, 0 replies; 22+ messages in thread
From: Ondřej Bílka @ 2015-05-25 19:09 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: libc-alpha

On Mon, May 25, 2015 at 02:49:19PM +0200, Szabolcs Nagy wrote:
> * Ond??ej B?lka <neleai@seznam.cz> [2015-05-25 08:11:31 +0200]:
> > On Mon, May 25, 2015 at 03:43:23AM +0200, Szabolcs Nagy wrote:
> > > * Ond??ej B?lka <neleai@seznam.cz> [2015-05-24 23:38:58 +0200]:
> > > > A main benefit would be interlibrary constant folding. Why waste cycles
> > > > on reinitializing constant, just save it to ufunc structure. Resolver 
> > > > then could precompute tables to improve speed.
> > > > 
> > > > As interposing these you would need to interpose resolver.
> > > > 
> > > > An gcc support is not needed but we could get something with alternate
> > > > calling convention as passing resolver struct is common and could be
> > > > preserved for loops with tail calls.
> > > > 
> > > > A future direction could be replace plt and linker with ufunc, it would
> > > > require adding function string pointer to structure and calling first
> > > > generic resolver to select specific resolver.
> > > > 
> > > > Comments?
> > > > 
> > > 
> > > this makes memset non-async-signal-safe. (qoi issue)
> > > 
> > How, could you elaborate where exactly would it cause problem with
> > signals? And dlsym don't count, its just implementation detail. 
>
Thanks, you first comment just looked that you glanced over
implementation and as dlsym isn't async-signal safe you claimed that
issue.
 
> with a strict reading of the standard, any access to objects
> with static storge duration is invalid from a signal handler
> unless their type is volatile sig_atomic_t or lock-free atomic.
> (c11/5.1.2.3p5)
> 
While correct our manual has following comment:

http://www.gnu.org/software/libc/manual/html_node/Atomic-Types.html

In practice, you can assume that int is atomic. You can also assume that
pointer types are atomic; that is very convenient. Both of these
assumptions are true on all of the machines that the GNU C Library
supports and on all POSIX systems we know of.

> in practice it probably works if you only read/write one pointer object,
> but if you start using the data member too then the data access can be
> interrupted at the wrong time and then calling into the same code path
> from the signal handler will cause problems. (so it's not memset that
> becomes non-as-safe, but whatever function that calls memset..
> assuming each memset call site has its own ufunc struct).
>
That wouldn't be problem as standard gives you bare minimum and you
could exploit architecture knowledge as this is architecture specific
optimization.
if you are careful and don't read data then both writes would be same.

Anyway there is real problem that is bit related. One could create
recursion with resolvers. There is simple solution that solves both
problems. Add fallback implemenation f that dont read data. Then CAS
resolver fn for fallback. Only one resolver would succeed while other
will simply call fallback. Then after succeeding resolver finished he
will write real implemention.

Again this adds cost and question is if its worth benefit.

>
> there are other problems: same security issues as with lazy binding
> (modifiable function pointers for control flow hijack),

Thats already solved techinal detail. In libc you xor function pointer
with random value for randomization.

 interoperability
> problems with other hacks (fortify source, interposition), significant
these already don't work reliably as gcc could optimizes these away for
same reason and why shouldnt it work with fortify source.

> amount of non-sharable data, startup overhead,
Could be solved with profiling. Read previous mail. 

> prevents compiler
> optimizations (assuming gcc only sees a call through an ufunc pointer),
Which is bonus as compiler builtin cause more regressions than their
benefit which is that gcc could sometimes value of string function with
constant arguments.

> debugging issues.
> 
How its different from preexisting situation. With ifunc you would also
be redirected to random assembly according how gcc expands it.

These were just examples that if you add random issue without any
justification I could reject it with same ease. You need to write more
and you could find that isn't problem or that I should do X to avoid it.

> if it can be done as a pure cpp hack (alternate string.h) then you
> can experiment with that outside of libc and only if benchmarks show
> consistent performance benefits it is worth thinking about.

You will see, I don't see why do you use pure cpp limit as ufuncs are
for selecting optimized assembly. I will just use standard LD_PRELOAD
trick

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-25 18:54               ` Ondřej Bílka
@ 2015-05-25 19:22                 ` Adhemerval Zanella
  2015-05-26  6:49                   ` Ondřej Bílka
  0 siblings, 1 reply; 22+ messages in thread
From: Adhemerval Zanella @ 2015-05-25 19:22 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: libc-alpha



On 25-05-2015 10:49, Ondřej Bílka wrote:
> On Mon, May 25, 2015 at 09:17:48AM -0300, Adhemerval Zanella wrote:
>>
>> Although the reason for such mechanism seems reasonable, I do not see a
>> good approach to add more architecture specific behaviour on GLIBC. 
>> IFUNC is already not really on all platforms and it has its own 
>> idiosyncrasies for some ports (like hwcap init order).
>>
> While it isn't wise to add another mechanism it is necessary. If you
> have program where in 50% of call sites implementation A is better while
> in other half B is better how would you resolve that? No matter how you
> select you would get bad performance (and resolving selection on
> backtrace would likely cost more than savings.)
> 
> Also you would lose performance from not being able to use precomputed
> tables.
> 
> You could use ifunc with losing per-call site information. However
> ifuncs would get ugly not simple if you have feature X select
> implementation using X in order from newest to oldest.
> 
> You would need to at start of program read file and you would must
> decide to select function according to profile from file. You should
> probably do same with ufuncs but these provide some stack.
> 
> An aim would be replace ifuncs with ufunc or some variant. For that they
> would need to coexist for some time until they are converted.
> 
> One side benefit is that you wont need add ifunc for new architectures,
> just use ufuncs. One problem would be cost, as mentioned before without
> harware atomic write it could be too costy. One possibility that I
> mentioned was force double compilation on these arch so you could do
> resolving for given cpu and then distribute separate binaries with each
> cpu data. 

I understood the problem, but I see your initial idea troublesome. As pointed
out by Szabolcs, there is security issues about adding the function pointer
information in read-write segments (which recent changes to binutils seems
to avoid with copy segments and relro), it has potentially issues on debugging
and profiling tools, it relies on memory semantics that is too x86_64 centric, 
among others.

Also, I see IFUNC as *not* to fix this issue your are bringing, but rather to
provide chip specific general optimization for average case.  What you are
suggesting is a polymorphic way to change the function call based on profiling
feedback.  How are you going to the granularity? Per program base, per thread
base, per call?.  Also, as pointed out, what happen if you start to use more
profiling feedback information than call site and start to require more atomic
access to select/update the function call?

> 
> Yes, main problem why java-like approach is unsuitable is that in linux
> lot of programs have very short lifetime so you need persistent storage
> to collect enough data to select implementation with some reliability.
> This is compounded that some functions are called rarely.
> 
> As i said before one benefit could be make resolver select
> implementation that minimizes size for first twenty calls and then do
> potentialy expensive profiling/reading profile to determine optimum.
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-25 19:22                 ` Adhemerval Zanella
@ 2015-05-26  6:49                   ` Ondřej Bílka
  2015-05-26  9:13                     ` Adhemerval Zanella
  0 siblings, 1 reply; 22+ messages in thread
From: Ondřej Bílka @ 2015-05-26  6:49 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

On Mon, May 25, 2015 at 03:11:35PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 25-05-2015 10:49, Ondřej Bílka wrote:
> > On Mon, May 25, 2015 at 09:17:48AM -0300, Adhemerval Zanella wrote:
> >>
> >> Although the reason for such mechanism seems reasonable, I do not see a
> >> good approach to add more architecture specific behaviour on GLIBC. 
> >> IFUNC is already not really on all platforms and it has its own 
> >> idiosyncrasies for some ports (like hwcap init order).
> >>
> > While it isn't wise to add another mechanism it is necessary. If you
> > have program where in 50% of call sites implementation A is better while
> > in other half B is better how would you resolve that? No matter how you
> > select you would get bad performance (and resolving selection on
> > backtrace would likely cost more than savings.)
> > 
> > Also you would lose performance from not being able to use precomputed
> > tables.
> > 
> > You could use ifunc with losing per-call site information. However
> > ifuncs would get ugly not simple if you have feature X select
> > implementation using X in order from newest to oldest.
> > 
> > You would need to at start of program read file and you would must
> > decide to select function according to profile from file. You should
> > probably do same with ufuncs but these provide some stack.
> > 
> > An aim would be replace ifuncs with ufunc or some variant. For that they
> > would need to coexist for some time until they are converted.
> > 
> > One side benefit is that you wont need add ifunc for new architectures,
> > just use ufuncs. One problem would be cost, as mentioned before without
> > harware atomic write it could be too costy. One possibility that I
> > mentioned was force double compilation on these arch so you could do
> > resolving for given cpu and then distribute separate binaries with each
> > cpu data. 
> 
> I understood the problem, but I see your initial idea troublesome. As pointed
> out by Szabolcs, there is security issues about adding the function pointer
> information in read-write segments (which recent changes to binutils seems
> to avoid with copy segments and relro), it has potentially issues on debugging
> and profiling tools, it relies on memory semantics that is too x86_64 centric, 
> among others.
>
As i wrote before thats solved problem. Use PTR_MANGLE/DEMANGLE equivalent which 
harms performance a bit. Also what is problem with debugging? Now
steping to string function would send you to random assembly so how
would adding ufunc different?
 
> Also, I see IFUNC as *not* to fix this issue your are bringing, but rather to
> provide chip specific general optimization for average case.  What you are
> suggesting is a polymorphic way to change the function call based on profiling
> feedback.  How are you going to the granularity? Per program base, per thread
> base, per call?.  Also, as pointed out, what happen if you start to use more
> profiling feedback information than call site and start to require more atomic
> access to select/update the function call?
>
Not quite. with profiling you should use ufunc to replace ifunc to stub status.
You said that you want chip specific general optimization. A natural
question is which one is best. For that you need to do profiling and
select one that looks best. Otherwise it would be suboptimal.

So ifunc selection would be granularity where you aggregate profile over
all programs.

As I mentioned before that you need two phases, first where you enable
profiling and compile with per-call granularity everywhere second where
you strip rarely used ufunc, precompute arrays and select implementation
for given cpu.
 
> > 
> > Yes, main problem why java-like approach is unsuitable is that in linux
> > lot of programs have very short lifetime so you need persistent storage
> > to collect enough data to select implementation with some reliability.
> > This is compounded that some functions are called rarely.
> > 
> > As i said before one benefit could be make resolver select
> > implementation that minimizes size for first twenty calls and then do
> > potentialy expensive profiling/reading profile to determine optimum.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-26  6:49                   ` Ondřej Bílka
@ 2015-05-26  9:13                     ` Adhemerval Zanella
  2015-05-29 10:56                       ` Andrew Pinski
  2015-05-31 21:37                       ` Ondřej Bílka
  0 siblings, 2 replies; 22+ messages in thread
From: Adhemerval Zanella @ 2015-05-26  9:13 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: libc-alpha



On 25-05-2015 16:01, Ondřej Bílka wrote:
> On Mon, May 25, 2015 at 03:11:35PM -0300, Adhemerval Zanella wrote:
>>
>>
>> On 25-05-2015 10:49, Ondřej Bílka wrote:
>>> On Mon, May 25, 2015 at 09:17:48AM -0300, Adhemerval Zanella wrote:
>>>>
>>>> Although the reason for such mechanism seems reasonable, I do not see a
>>>> good approach to add more architecture specific behaviour on GLIBC. 
>>>> IFUNC is already not really on all platforms and it has its own 
>>>> idiosyncrasies for some ports (like hwcap init order).
>>>>
>>> While it isn't wise to add another mechanism it is necessary. If you
>>> have program where in 50% of call sites implementation A is better while
>>> in other half B is better how would you resolve that? No matter how you
>>> select you would get bad performance (and resolving selection on
>>> backtrace would likely cost more than savings.)
>>>
>>> Also you would lose performance from not being able to use precomputed
>>> tables.
>>>
>>> You could use ifunc with losing per-call site information. However
>>> ifuncs would get ugly not simple if you have feature X select
>>> implementation using X in order from newest to oldest.
>>>
>>> You would need to at start of program read file and you would must
>>> decide to select function according to profile from file. You should
>>> probably do same with ufuncs but these provide some stack.
>>>
>>> An aim would be replace ifuncs with ufunc or some variant. For that they
>>> would need to coexist for some time until they are converted.
>>>
>>> One side benefit is that you wont need add ifunc for new architectures,
>>> just use ufuncs. One problem would be cost, as mentioned before without
>>> harware atomic write it could be too costy. One possibility that I
>>> mentioned was force double compilation on these arch so you could do
>>> resolving for given cpu and then distribute separate binaries with each
>>> cpu data. 
>>
>> I understood the problem, but I see your initial idea troublesome. As pointed
>> out by Szabolcs, there is security issues about adding the function pointer
>> information in read-write segments (which recent changes to binutils seems
>> to avoid with copy segments and relro), it has potentially issues on debugging
>> and profiling tools, it relies on memory semantics that is too x86_64 centric, 
>> among others.
>>
> As i wrote before thats solved problem. Use PTR_MANGLE/DEMANGLE equivalent which 
> harms performance a bit. Also what is problem with debugging? Now
> steping to string function would send you to random assembly so how
> would adding ufunc different?

My understanding is PTR_MANGLE/DEMANGLE is somewhat a hack for the problem
where you must have a RW segment where function points reside.  Recent binutils
options are aiming to exactly *avoid* this, since if you know the memory
where the seed reside you can read and defeat the mangle/demangle (although
it will require a more cleaver thread).  It also defeat another optimization with
relro which is mark the page as RO and thus allowing more sharing.

The problem which debugging imho is now you will have the teach debuggers and
profilers that some strings functions for some specific arch do no follow any
ELF defined way.

>  
>> Also, I see IFUNC as *not* to fix this issue your are bringing, but rather to
>> provide chip specific general optimization for average case.  What you are
>> suggesting is a polymorphic way to change the function call based on profiling
>> feedback.  How are you going to the granularity? Per program base, per thread
>> base, per call?.  Also, as pointed out, what happen if you start to use more
>> profiling feedback information than call site and start to require more atomic
>> access to select/update the function call?
>>
> Not quite. with profiling you should use ufunc to replace ifunc to stub status.
> You said that you want chip specific general optimization. A natural
> question is which one is best. For that you need to do profiling and
> select one that looks best. Otherwise it would be suboptimal.
> 
> So ifunc selection would be granularity where you aggregate profile over
> all programs.
> 
> As I mentioned before that you need two phases, first where you enable
> profiling and compile with per-call granularity everywhere second where
> you strip rarely used ufunc, precompute arrays and select implementation
> for given cpu.
>  
>>>
>>> Yes, main problem why java-like approach is unsuitable is that in linux
>>> lot of programs have very short lifetime so you need persistent storage
>>> to collect enough data to select implementation with some reliability.
>>> This is compounded that some functions are called rarely.
>>>
>>> As i said before one benefit could be make resolver select
>>> implementation that minimizes size for first twenty calls and then do
>>> potentialy expensive profiling/reading profile to determine optimum.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-26  9:13                     ` Adhemerval Zanella
@ 2015-05-29 10:56                       ` Andrew Pinski
  2015-05-31 21:37                       ` Ondřej Bílka
  1 sibling, 0 replies; 22+ messages in thread
From: Andrew Pinski @ 2015-05-29 10:56 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Ondřej Bílka, GNU C Library

On Tue, May 26, 2015 at 3:21 AM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
> On 25-05-2015 16:01, Ondřej Bílka wrote:
>> On Mon, May 25, 2015 at 03:11:35PM -0300, Adhemerval Zanella wrote:
>>>
>>>
>>> On 25-05-2015 10:49, Ondřej Bílka wrote:
>>>> On Mon, May 25, 2015 at 09:17:48AM -0300, Adhemerval Zanella wrote:
>>>>>
>>>>> Although the reason for such mechanism seems reasonable, I do not see a
>>>>> good approach to add more architecture specific behaviour on GLIBC.
>>>>> IFUNC is already not really on all platforms and it has its own
>>>>> idiosyncrasies for some ports (like hwcap init order).
>>>>>
>>>> While it isn't wise to add another mechanism it is necessary. If you
>>>> have program where in 50% of call sites implementation A is better while
>>>> in other half B is better how would you resolve that? No matter how you
>>>> select you would get bad performance (and resolving selection on
>>>> backtrace would likely cost more than savings.)
>>>>
>>>> Also you would lose performance from not being able to use precomputed
>>>> tables.
>>>>
>>>> You could use ifunc with losing per-call site information. However
>>>> ifuncs would get ugly not simple if you have feature X select
>>>> implementation using X in order from newest to oldest.
>>>>
>>>> You would need to at start of program read file and you would must
>>>> decide to select function according to profile from file. You should
>>>> probably do same with ufuncs but these provide some stack.
>>>>
>>>> An aim would be replace ifuncs with ufunc or some variant. For that they
>>>> would need to coexist for some time until they are converted.
>>>>
>>>> One side benefit is that you wont need add ifunc for new architectures,
>>>> just use ufuncs. One problem would be cost, as mentioned before without
>>>> harware atomic write it could be too costy. One possibility that I
>>>> mentioned was force double compilation on these arch so you could do
>>>> resolving for given cpu and then distribute separate binaries with each
>>>> cpu data.
>>>
>>> I understood the problem, but I see your initial idea troublesome. As pointed
>>> out by Szabolcs, there is security issues about adding the function pointer
>>> information in read-write segments (which recent changes to binutils seems
>>> to avoid with copy segments and relro), it has potentially issues on debugging
>>> and profiling tools, it relies on memory semantics that is too x86_64 centric,
>>> among others.
>>>
>> As i wrote before thats solved problem. Use PTR_MANGLE/DEMANGLE equivalent which
>> harms performance a bit. Also what is problem with debugging? Now
>> steping to string function would send you to random assembly so how
>> would adding ufunc different?
>
> My understanding is PTR_MANGLE/DEMANGLE is somewhat a hack for the problem
> where you must have a RW segment where function points reside.  Recent binutils
> options are aiming to exactly *avoid* this, since if you know the memory
> where the seed reside you can read and defeat the mangle/demangle (although
> it will require a more cleaver thread).  It also defeat another optimization with
> relro which is mark the page as RO and thus allowing more sharing.

There are other places where pc addresses will be stored besides in
statically allocated memory.  Like setjmp buffer.

Thanks,
Andrew

>
> The problem which debugging imho is now you will have the teach debuggers and
> profilers that some strings functions for some specific arch do no follow any
> ELF defined way.
>
>>
>>> Also, I see IFUNC as *not* to fix this issue your are bringing, but rather to
>>> provide chip specific general optimization for average case.  What you are
>>> suggesting is a polymorphic way to change the function call based on profiling
>>> feedback.  How are you going to the granularity? Per program base, per thread
>>> base, per call?.  Also, as pointed out, what happen if you start to use more
>>> profiling feedback information than call site and start to require more atomic
>>> access to select/update the function call?
>>>
>> Not quite. with profiling you should use ufunc to replace ifunc to stub status.
>> You said that you want chip specific general optimization. A natural
>> question is which one is best. For that you need to do profiling and
>> select one that looks best. Otherwise it would be suboptimal.
>>
>> So ifunc selection would be granularity where you aggregate profile over
>> all programs.
>>
>> As I mentioned before that you need two phases, first where you enable
>> profiling and compile with per-call granularity everywhere second where
>> you strip rarely used ufunc, precompute arrays and select implementation
>> for given cpu.
>>
>>>>
>>>> Yes, main problem why java-like approach is unsuitable is that in linux
>>>> lot of programs have very short lifetime so you need persistent storage
>>>> to collect enough data to select implementation with some reliability.
>>>> This is compounded that some functions are called rarely.
>>>>
>>>> As i said before one benefit could be make resolver select
>>>> implementation that minimizes size for first twenty calls and then do
>>>> potentialy expensive profiling/reading profile to determine optimum.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] ifunc suck, use ufunc.
  2015-05-26  9:13                     ` Adhemerval Zanella
  2015-05-29 10:56                       ` Andrew Pinski
@ 2015-05-31 21:37                       ` Ondřej Bílka
  1 sibling, 0 replies; 22+ messages in thread
From: Ondřej Bílka @ 2015-05-31 21:37 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

On Mon, May 25, 2015 at 04:21:57PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 25-05-2015 16:01, Ondřej Bílka wrote:
> > On Mon, May 25, 2015 at 03:11:35PM -0300, Adhemerval Zanella wrote:
> >>
> >>
> >> On 25-05-2015 10:49, Ondřej Bílka wrote:
> >>> On Mon, May 25, 2015 at 09:17:48AM -0300, Adhemerval Zanella wrote:
> >>>>
> >>>> Although the reason for such mechanism seems reasonable, I do not see a
> >>>> good approach to add more architecture specific behaviour on GLIBC. 
> >>>> IFUNC is already not really on all platforms and it has its own 
> >>>> idiosyncrasies for some ports (like hwcap init order).
> >>>>
> >>> While it isn't wise to add another mechanism it is necessary. If you
> >>> have program where in 50% of call sites implementation A is better while
> >>> in other half B is better how would you resolve that? No matter how you
> >>> select you would get bad performance (and resolving selection on
> >>> backtrace would likely cost more than savings.)
> >>>
> >>> Also you would lose performance from not being able to use precomputed
> >>> tables.
> >>>
> >>> You could use ifunc with losing per-call site information. However
> >>> ifuncs would get ugly not simple if you have feature X select
> >>> implementation using X in order from newest to oldest.
> >>>
> >>> You would need to at start of program read file and you would must
> >>> decide to select function according to profile from file. You should
> >>> probably do same with ufuncs but these provide some stack.
> >>>
> >>> An aim would be replace ifuncs with ufunc or some variant. For that they
> >>> would need to coexist for some time until they are converted.
> >>>
> >>> One side benefit is that you wont need add ifunc for new architectures,
> >>> just use ufuncs. One problem would be cost, as mentioned before without
> >>> harware atomic write it could be too costy. One possibility that I
> >>> mentioned was force double compilation on these arch so you could do
> >>> resolving for given cpu and then distribute separate binaries with each
> >>> cpu data. 
> >>
> >> I understood the problem, but I see your initial idea troublesome. As pointed
> >> out by Szabolcs, there is security issues about adding the function pointer
> >> information in read-write segments (which recent changes to binutils seems
> >> to avoid with copy segments and relro), it has potentially issues on debugging
> >> and profiling tools, it relies on memory semantics that is too x86_64 centric, 
> >> among others.
> >>
> > As i wrote before thats solved problem. Use PTR_MANGLE/DEMANGLE equivalent which 
> > harms performance a bit. Also what is problem with debugging? Now
> > steping to string function would send you to random assembly so how
> > would adding ufunc different?
> 
> My understanding is PTR_MANGLE/DEMANGLE is somewhat a hack for the problem
> where you must have a RW segment where function points reside.  Recent binutils
> options are aiming to exactly *avoid* this, since if you know the memory
> where the seed reside you can read and defeat the mangle/demangle (although
> it will require a more cleaver thread).  It also defeat another optimization with
> relro which is mark the page as RO and thus allowing more sharing.
>
First question how do you make page with ifunc result sharable, its same
situation?

My main problem is how to make these easy to use. If we mandate that
user needs instead writting make do

UFUNC_PROFILE=1 make; run program; make 

Then I can offer better solution. Main problem is cpp limitation that I
cant create counter in macro. Second limitation is collecting profiler
data.

With these two conditions I don't need lazy resolution. Just put all
these in one table aligned to page boundary, write a constructor that
resolves all ufuncs in dso and when finishes it mprotects it to be
read-only.

It would be bit ugly as I need to use macro like
 (__LINE__ == 324) ? 0 :
 (__LINE__ == 333) ? 1 : 
to find index and verify that table is correct.
  

 
> The problem which debugging imho is now you will have the teach debuggers and
> profilers that some strings functions for some specific arch do no follow any
> ELF defined way.
>
Still don't see it. Unless you start using assembly tricks gdb handles
function pointers just fine. With assembly you need to stepi to get into
these. And how its different than debuging with FORCE_INLINES where
these are replaced by ineffective inline assembly?

Profiling is relatively easy but technical problem. You only need to
check something like

  if (function != dlsym_real("function"))
    f->fn = function_wrapper;

where function_wrapper will properly call function. Only dlsym wouldn't
be enough due plt indirection. We would need to provide one that
provides actual pointer. This could be done.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-05-31  9:47 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-25  2:23 [RFC] ifunc suck, use ufunc Ondřej Bílka
2015-05-25  4:41 ` Szabolcs Nagy
2015-05-25  5:18   ` Rich Felker
2015-05-25  6:11     ` Ondřej Bílka
2015-05-25  8:24       ` Carlos O'Donell
2015-05-25 10:38         ` [RFC] ifunc suck, use ufunc.\ Ondřej Bílka
2015-05-25  6:03   ` [RFC] ifunc suck, use ufunc Ondřej Bílka
2015-05-25  6:07     ` Rich Felker
2015-05-25  7:02       ` Ondřej Bílka
2015-05-25  8:58         ` Andrew Pinski
2015-05-25 10:15           ` Ondřej Bílka
2015-05-25 18:17             ` Adhemerval Zanella
2015-05-25 18:54               ` Ondřej Bílka
2015-05-25 19:22                 ` Adhemerval Zanella
2015-05-26  6:49                   ` Ondřej Bílka
2015-05-26  9:13                     ` Adhemerval Zanella
2015-05-29 10:56                       ` Andrew Pinski
2015-05-31 21:37                       ` Ondřej Bílka
2015-05-25  8:12     ` Ondřej Bílka
2015-05-25 10:55   ` Ondřej Bílka
2015-05-25 18:27     ` Szabolcs Nagy
2015-05-25 19:09       ` Ondřej Bílka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).