Re: [PATCH 0/2] Multiarch hooks for memcpy variants

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
@ 2017-08-11 11:11 Wilco Dijkstra
  2017-08-11 11:22 ` Siddhesh Poyarekar
  0 siblings, 1 reply; 27+ messages in thread
From: Wilco Dijkstra @ 2017-08-11 11:11 UTC (permalink / raw)
  To: libc-alpha, Siddhesh Poyarekar; +Cc: nd

Siddhesh Poyarekar wrote:
> Functions like mempcpy, __mempcpy_chk and __memcpy_chk continue to call the
> generic memcpy implementation.  These two patches fix this by adding ifunc
> entry points for these functions for generic, thunderx and falkor.

I don't understand what the goal of this is since on AArch64 we always transform
mempcpy to memcpy. Also why use ifuncs on the _chk variants? Are they ever
used in cases where the last 1% of performance matters?

Wilco

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-11 11:11 [PATCH 0/2] Multiarch hooks for memcpy variants Wilco Dijkstra
@ 2017-08-11 11:22 ` Siddhesh Poyarekar
  2017-08-11 17:58   ` Szabolcs Nagy
  0 siblings, 1 reply; 27+ messages in thread
From: Siddhesh Poyarekar @ 2017-08-11 11:22 UTC (permalink / raw)
  To: Wilco Dijkstra, libc-alpha; +Cc: nd

On Friday 11 August 2017 04:41 PM, Wilco Dijkstra wrote:
> Siddhesh Poyarekar wrote:
>> Functions like mempcpy, __mempcpy_chk and __memcpy_chk continue to call the
>> generic memcpy implementation.  These two patches fix this by adding ifunc
>> entry points for these functions for generic, thunderx and falkor.
> 
> I don't understand what the goal of this is since on AArch64 we always transform
> mempcpy to memcpy. Also why use ifuncs on the _chk variants? Are they ever
> used in cases where the last 1% of performance matters?

I started off by writing this for __memcpy_chk because gcc transforms
memcpy to __memcpy_chk in some cases with -O3 and then extended it to
mempcpy for completeness.  I'm not going to push too hard for mempcpy if
you strongly oppose it, but I am definitely interested in getting
__memcpy_chk ifuncs in.

Siddhesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-11 11:22 ` Siddhesh Poyarekar
@ 2017-08-11 17:58   ` Szabolcs Nagy
  2017-08-11 18:06     ` Zack Weinberg
  0 siblings, 1 reply; 27+ messages in thread
From: Szabolcs Nagy @ 2017-08-11 17:58 UTC (permalink / raw)
  To: Siddhesh Poyarekar, Wilco Dijkstra, libc-alpha; +Cc: nd

On 11/08/17 12:22, Siddhesh Poyarekar wrote:
> On Friday 11 August 2017 04:41 PM, Wilco Dijkstra wrote:
>> Siddhesh Poyarekar wrote:
>>> Functions like mempcpy, __mempcpy_chk and __memcpy_chk continue to call the
>>> generic memcpy implementation.  These two patches fix this by adding ifunc
>>> entry points for these functions for generic, thunderx and falkor.
>>
>> I don't understand what the goal of this is since on AArch64 we always transform
>> mempcpy to memcpy. Also why use ifuncs on the _chk variants? Are they ever
>> used in cases where the last 1% of performance matters?
> 
> I started off by writing this for __memcpy_chk because gcc transforms
> memcpy to __memcpy_chk in some cases with -O3 and then extended it to
> mempcpy for completeness.  I'm not going to push too hard for mempcpy if
> you strongly oppose it, but I am definitely interested in getting
> __memcpy_chk ifuncs in.
> 

ifuncs and asm implementations have a lot of non-trivial
maintainance costs.

as far as i understand *_chk calls are only generated when
compiling with -D_FORTIFY_SOURCE=* and most of these checks
could be inlined by the compiler (i.e. there is no need
for runtime support in principle).

may be the generic __memcpy_chk should call the ifunced
memcpy so it goes through an extra plt indirection, but
at least less target specific code is needed.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-11 17:58   ` Szabolcs Nagy
@ 2017-08-11 18:06     ` Zack Weinberg
  2017-08-11 18:53       ` Siddhesh Poyarekar
  0 siblings, 1 reply; 27+ messages in thread
From: Zack Weinberg @ 2017-08-11 18:06 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: Siddhesh Poyarekar, Wilco Dijkstra, libc-alpha, nd

On Fri, Aug 11, 2017 at 1:58 PM, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
>
> as far as i understand *_chk calls are only generated when
> compiling with -D_FORTIFY_SOURCE=* and most of these checks
> could be inlined by the compiler (i.e. there is no need
> for runtime support in principle).

Calls to _chk functions that survive in the executable are for cases
where the compiler couldn't prove the call was safe - for instance, if
the size of data copied can vary at runtime.

Using _FORTIFY_SOURCE means you accept *some* extra overhead, but we
don't want it to sacrifice *all* machine-specific optimizations.
People do things like compiling entire "hardened" distributions with
it on.

> may be the generic __memcpy_chk should call the ifunced
> memcpy so it goes through an extra plt indirection, but
> at least less target specific code is needed.

I was thinking of making this suggestion myself.  I think that would
be a better maintainability/efficiency tradeoff.  (Of course, I also
think we shouldn't bypass ifuncs for intra-libc calls.)

zw

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-11 18:06     ` Zack Weinberg
@ 2017-08-11 18:53       ` Siddhesh Poyarekar
  2017-08-11 18:55         ` Zack Weinberg
  0 siblings, 1 reply; 27+ messages in thread
From: Siddhesh Poyarekar @ 2017-08-11 18:53 UTC (permalink / raw)
  To: Zack Weinberg, Szabolcs Nagy; +Cc: Wilco Dijkstra, libc-alpha, nd

On Friday 11 August 2017 11:36 PM, Zack Weinberg wrote:
>> may be the generic __memcpy_chk should call the ifunced
>> memcpy so it goes through an extra plt indirection, but
>> at least less target specific code is needed.
> 
> I was thinking of making this suggestion myself.  I think that would
> be a better maintainability/efficiency tradeoff.  (Of course, I also
> think we shouldn't bypass ifuncs for intra-libc calls.)

That was my initial approach, but I was under the impression that PLTs
in internal calls were frowned upon, hence the ifuncs similar to what's
done in x86.  If this is acceptable, I could do more tests to check
gains within the library if we were to call memcpy via ifunc.

Siddhesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-11 18:53       ` Siddhesh Poyarekar
@ 2017-08-11 18:55         ` Zack Weinberg
  2017-08-14 10:36           ` Wilco Dijkstra
  0 siblings, 1 reply; 27+ messages in thread
From: Zack Weinberg @ 2017-08-11 18:55 UTC (permalink / raw)
  To: Siddhesh Poyarekar; +Cc: Szabolcs Nagy, Wilco Dijkstra, libc-alpha, nd

On Fri, Aug 11, 2017 at 2:53 PM, Siddhesh Poyarekar <siddhesh@gotplt.org> wrote:
> On Friday 11 August 2017 11:36 PM, Zack Weinberg wrote:
>>> may be the generic __memcpy_chk should call the ifunced
>>> memcpy so it goes through an extra plt indirection, but
>>> at least less target specific code is needed.
>>
>> I was thinking of making this suggestion myself.  I think that would
>> be a better maintainability/efficiency tradeoff.  (Of course, I also
>> think we shouldn't bypass ifuncs for intra-libc calls.)
>
> That was my initial approach, but I was under the impression that PLTs
> in internal calls were frowned upon, hence the ifuncs similar to what's
> done in x86.  If this is acceptable, I could do more tests to check
> gains within the library if we were to call memcpy via ifunc.

There's been a bunch of inconclusive arguments about this in the past.
If you have the time and the resources to do some thorough testing and
properly resolve the question, that would be really great.

zw

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-11 18:55         ` Zack Weinberg
@ 2017-08-14 10:36           ` Wilco Dijkstra
  2017-08-14 12:14             ` Siddhesh Poyarekar
  2017-08-15 20:52             ` Zack Weinberg
  0 siblings, 2 replies; 27+ messages in thread
From: Wilco Dijkstra @ 2017-08-14 10:36 UTC (permalink / raw)
  To: Zack Weinberg, Siddhesh Poyarekar; +Cc: Szabolcs Nagy, libc-alpha, nd

Zack Weinberg wrote:
> On Fri, Aug 11, 2017 at 2:53 PM, Siddhesh Poyarekar <siddhesh@gotplt.org> wrote:
> > On Friday 11 August 2017 11:36 PM, Zack Weinberg wrote:
>>>> may be the generic __memcpy_chk should call the ifunced
>>>> memcpy so it goes through an extra plt indirection, but
>>>> at least less target specific code is needed.
>>>
>>> I was thinking of making this suggestion myself.  I think that would
>>> be a better maintainability/efficiency tradeoff.  (Of course, I also
>>> think we shouldn't bypass ifuncs for intra-libc calls.)
>>
>> That was my initial approach, but I was under the impression that PLTs
>> in internal calls were frowned upon, hence the ifuncs similar to what's
>> done in x86.  If this is acceptable, I could do more tests to check
>> gains within the library if we were to call memcpy via ifunc.
>
> There's been a bunch of inconclusive arguments about this in the past.
> If you have the time and the resources to do some thorough testing and
> properly resolve the question, that would be really great.

I don't believe you can resolve this generally, it's highly dependent on the details.
If the generic implementation is very efficient, the possible gain of specialized
ifuncs may be so low that it can never offset the overhead of an ifunc. Also note
that you're always slowing down the generic case, so if that version is used in
many cases, an ifunc wouldn't make sense.

I haven't looked in detail at memcpy use in GLIBC, however if the statistics are
similar to typical use I measured then it makes no sense to use ifuncs. Large
copies can benefit from special tweaks, and in that case the overhead of an ifunc
would be much smaller (both relatively and absolutely due to lower frequency), so
that's where an ifunc might be useful.

Wilco

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-14 10:36           ` Wilco Dijkstra
@ 2017-08-14 12:14             ` Siddhesh Poyarekar
  2017-08-14 13:20               ` Szabolcs Nagy
  2017-08-14 13:22               ` Wilco Dijkstra
  2017-08-15 20:52             ` Zack Weinberg
  1 sibling, 2 replies; 27+ messages in thread
From: Siddhesh Poyarekar @ 2017-08-14 12:14 UTC (permalink / raw)
  To: Wilco Dijkstra, Zack Weinberg; +Cc: Szabolcs Nagy, libc-alpha, nd

On Monday 14 August 2017 04:06 PM, Wilco Dijkstra wrote:
> I don't believe you can resolve this generally, it's highly dependent on the details.
> If the generic implementation is very efficient, the possible gain of specialized
> ifuncs may be so low that it can never offset the overhead of an ifunc. Also note
> that you're always slowing down the generic case, so if that version is used in
> many cases, an ifunc wouldn't make sense.
>
> I haven't looked in detail at memcpy use in GLIBC, however if the statistics are
> similar to typical use I measured then it makes no sense to use ifuncs. Large
> copies can benefit from special tweaks, and in that case the overhead of an ifunc
> would be much smaller (both relatively and absolutely due to lower frequency), so
> that's where an ifunc might be useful.

The first part is not true for falkor since its implementation is a good
10-15% faster on the falkor chip due to its design differences.  glibc
makes pretty extensive use of memcpy throughout, but I don't have data
on how much difference a core-specific memcpy will make there, so I
don't have enough grounds for a generic change.

Your last point about hurting everything else is very valid though; it's
very likely that adding an extra indirection in cases where
__memcpy_generic is going to be called anyway is going to be expensive
given that a bulk of the memcpy calls will be for small sizes of less
than 1k.

Allowing a PLT only for __memcpy_chk and mempcpy would need a test case
waiver in check_localplt and that would become a blanket OK for PLT
usage for memcpy, which we don't want.  Hence my patch is probably the
best compromise, especially since there is precedent for the approach in
x86.

Siddhesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-14 12:14             ` Siddhesh Poyarekar
@ 2017-08-14 13:20               ` Szabolcs Nagy
  2017-08-14 13:29                 ` Siddhesh Poyarekar
  2017-08-14 13:22               ` Wilco Dijkstra
  1 sibling, 1 reply; 27+ messages in thread
From: Szabolcs Nagy @ 2017-08-14 13:20 UTC (permalink / raw)
  To: Siddhesh Poyarekar, Wilco Dijkstra, Zack Weinberg; +Cc: nd, libc-alpha

On 14/08/17 13:14, Siddhesh Poyarekar wrote:
> Allowing a PLT only for __memcpy_chk and mempcpy would need a test case
> waiver in check_localplt and that would become a blanket OK for PLT

i only proposed plt for this specific case on
the grounds that i expect _chk to be obsoleted
and instead the check will be in user code
(which has the same semantics as if plt was
used in libc, the memcpy call is interposable)

> usage for memcpy, which we don't want.  Hence my patch is probably the
> best compromise, especially since there is precedent for the approach in
> x86.

it is probably best from performance point
of view.

but i'm not yet convinced that __memcpy_chk is
performance critical, if not, then i'd rather
not add ifuncs.

does anyone know how commonly distros build
with _FORTIFY_SOURCE?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-14 12:14             ` Siddhesh Poyarekar
  2017-08-14 13:20               ` Szabolcs Nagy
@ 2017-08-14 13:22               ` Wilco Dijkstra
  2017-08-14 13:54                 ` Siddhesh Poyarekar
                                   ` (3 more replies)
  1 sibling, 4 replies; 27+ messages in thread
From: Wilco Dijkstra @ 2017-08-14 13:22 UTC (permalink / raw)
  To: Siddhesh Poyarekar, Zack Weinberg; +Cc: Szabolcs Nagy, libc-alpha, nd

Siddhesh Poyarekar wrote:
> The first part is not true for falkor since its implementation is a good
> 10-15% faster on the falkor chip due to its design differences.  glibc
> makes pretty extensive use of memcpy throughout, but I don't have data
> on how much difference a core-specific memcpy will make there, so I
> don't have enough grounds for a generic change.

66% of memcpy calls are <=16 bytes. Assuming you can even get a 15% gain
for these small sizes (there is very little you can do different), that's at most 1
cycle faster, so the PLT indirection is going to be more expensive.

> Your last point about hurting everything else is very valid though; it's
> very likely that adding an extra indirection in cases where
> __memcpy_generic is going to be called anyway is going to be expensive
> given that a bulk of the memcpy calls will be for small sizes of less
> than 1k.

Note that the falkor version does quite well in memcpy-random across several
micro architectures so I think parts of it could be moved into the generic code.

> Allowing a PLT only for __memcpy_chk and mempcpy would need a test case
> waiver in check_localplt and that would become a blanket OK for PLT
> usage for memcpy, which we don't want.  Hence my patch is probably the
> best compromise, especially since there is precedent for the approach in
> x86.

I still can't see any reason to even support these entry points in GLIBC, let
alone optimize them using ifuncs. The _chk functions should obviously be
inlined to avoid all the target specific complexity for no benefit. I think this
could trivially be done via the GLIBC headers already. (That's assuming they
are in any way performance critical.)

Wilco

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-14 13:20               ` Szabolcs Nagy
@ 2017-08-14 13:29                 ` Siddhesh Poyarekar
  2017-08-14 13:52                   ` Szabolcs Nagy
  0 siblings, 1 reply; 27+ messages in thread
From: Siddhesh Poyarekar @ 2017-08-14 13:29 UTC (permalink / raw)
  To: Szabolcs Nagy, Wilco Dijkstra, Zack Weinberg; +Cc: nd, libc-alpha

On Monday 14 August 2017 06:50 PM, Szabolcs Nagy wrote:
> i only proposed plt for this specific case on
> the grounds that i expect _chk to be obsoleted
> and instead the check will be in user code
> (which has the same semantics as if plt was
> used in libc, the memcpy call is interposable)

Right, but the test case puts in a blanket override, so it becomes
useless for testing any future inadvertent PLT introductions.

>> usage for memcpy, which we don't want.  Hence my patch is probably the
>> best compromise, especially since there is precedent for the approach in
>> x86.
> 
> it is probably best from performance point
> of view.
> 
> but i'm not yet convinced that __memcpy_chk is
> performance critical, if not, then i'd rather
> not add ifuncs.
> 
> does anyone know how commonly distros build
> with _FORTIFY_SOURCE?

Fedora does it by default:

-14: __global_compiler_flags	-O2 -g -pipe -Wall -Werror=format-security
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong
--param=ssp-buffer-size=4 -grecord-gcc-switches %{_hardened_cflags}

Siddhesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-14 13:29                 ` Siddhesh Poyarekar
@ 2017-08-14 13:52                   ` Szabolcs Nagy
  2017-08-14 13:56                     ` Siddhesh Poyarekar
  0 siblings, 1 reply; 27+ messages in thread
From: Szabolcs Nagy @ 2017-08-14 13:52 UTC (permalink / raw)
  To: Siddhesh Poyarekar, Wilco Dijkstra, Zack Weinberg; +Cc: nd, libc-alpha

On 14/08/17 14:28, Siddhesh Poyarekar wrote:
> On Monday 14 August 2017 06:50 PM, Szabolcs Nagy wrote:
>> does anyone know how commonly distros build
>> with _FORTIFY_SOURCE?
> 
> Fedora does it by default:
> 
> -14: __global_compiler_flags	-O2 -g -pipe -Wall -Werror=format-security
> -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong
> --param=ssp-buffer-size=4 -grecord-gcc-switches %{_hardened_cflags}
> 

it seems the firefox on my system is compiled with
_FORTIFY_SOURCE, so i checked libxul.so:

memcpy calls: 2570 (95.5%)
__memcpy_chk calls: 122 (4.5%)

it seems it's rare that the destination is an array
with known size, so _chk is rarely used, to me it
makes more sense to add static inline __memcpy_chk
to string.h than fiddle with asm and ifunc.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-14 13:22               ` Wilco Dijkstra
@ 2017-08-14 13:54                 ` Siddhesh Poyarekar
  2017-08-15 20:11                 ` Patrick McGehearty
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Siddhesh Poyarekar @ 2017-08-14 13:54 UTC (permalink / raw)
  To: Wilco Dijkstra, Zack Weinberg; +Cc: Szabolcs Nagy, libc-alpha, nd

On Monday 14 August 2017 06:52 PM, Wilco Dijkstra wrote:
> 66% of memcpy calls are <=16 bytes. Assuming you can even get a 15% gain
> for these small sizes (there is very little you can do different), that's at most 1
> cycle faster, so the PLT indirection is going to be more expensive.

Yeah, I won't argue for copies of that size.

> Note that the falkor version does quite well in memcpy-random across several
> micro architectures so I think parts of it could be moved into the generic code.

That's interesting.  Not surprising though, since a lot of it was just
issue slot usage and alignments and nothing else.  I don't expect those
to be widely different between cores.

> I still can't see any reason to even support these entry points in GLIBC, let
> alone optimize them using ifuncs. The _chk functions should obviously be
> inlined to avoid all the target specific complexity for no benefit. I think this
> could trivially be done via the GLIBC headers already. (That's assuming they
> are in any way performance critical.)

These entry points are supported in the ABI, so you don't have a choice
in terms of supporting them.  Inlining by default has a different
problem - it will take effect only when a distribution does a full
rebuild and that happens very infrequently.  This will completely
discount backporting of these routines to any stable distribution.

Siddhesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-14 13:52                   ` Szabolcs Nagy
@ 2017-08-14 13:56                     ` Siddhesh Poyarekar
  2017-08-15 21:55                       ` Wilco Dijkstra
  0 siblings, 1 reply; 27+ messages in thread
From: Siddhesh Poyarekar @ 2017-08-14 13:56 UTC (permalink / raw)
  To: Szabolcs Nagy, Wilco Dijkstra, Zack Weinberg; +Cc: nd, libc-alpha

On Monday 14 August 2017 07:22 PM, Szabolcs Nagy wrote:
> it seems the firefox on my system is compiled with
> _FORTIFY_SOURCE, so i checked libxul.so:
> 
> memcpy calls: 2570 (95.5%)
> __memcpy_chk calls: 122 (4.5%)
> 
> it seems it's rare that the destination is an array
> with known size, so _chk is rarely used, to me it
> makes more sense to add static inline __memcpy_chk
> to string.h than fiddle with asm and ifunc.

See my response to Wilco about static inlines - it requires a full
distro rebuild to have any effect and that's a pointless option to have
if we have to backport these routines to stable distributions.

Siddhesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-14 13:22               ` Wilco Dijkstra
  2017-08-14 13:54                 ` Siddhesh Poyarekar
@ 2017-08-15 20:11                 ` Patrick McGehearty
       [not found]                 ` <a62a7627-1d1e-6a2b-8197-f4b16bfbdcc6@oracle.com>
  2017-08-15 21:02                 ` Zack Weinberg
  3 siblings, 0 replies; 27+ messages in thread
From: Patrick McGehearty @ 2017-08-15 20:11 UTC (permalink / raw)
  To: libc-alpha

On 8/14/2017 8:22 AM, Wilco Dijkstra wrote:
> Siddhesh Poyarekar wrote:
>> The first part is not true for falkor since its implementation is a good
>> 10-15% faster on the falkor chip due to its design differences.  glibc
>> makes pretty extensive use of memcpy throughout, but I don't have data
>> on how much difference a core-specific memcpy will make there, so I
>> don't have enough grounds for a generic change.
> 66% of memcpy calls are <=16 bytes. Assuming you can even get a 15% gain
> for these small sizes (there is very little you can do different), that's at most 1
> cycle faster, so the PLT indirection is going to be more expensive.
It is important to be careful about overemphasizing the frequency of 
short memcpy calls.
Even though a high percentage of memcpy calls are short, my experience 
is that a high
percentage of time spent in memcpy is on longer copies.

Following example is just that, an example, not an expression of any 
specific real application behavior:
If 66% of calls are <=16 bytes  (average length=8, say) but the average 
length of the remaining
1/3 of calls was 1K bytes (i.e. > 100 times as long), then the vast 
majority of time
in memcpy would be in the longer copies.

My experience with tuning libc memcpy off and on on multiple platforms 
is that copies
of length > 256 bytes are the ones that affect overall application 
performance. Really short
copies where the length and/or alignment might be known at compile time 
are best handled
by inlining the copy.

I've produced platform specific optimizations for memcpy many times over 
the years.  By platform
specific, I mean different code for different generations/platforms of 
the same architecture.
These versions have shown improvements from at little as 10% to as much 
as 250%
depending on how close the memory architecture of latest platform is to 
the prior platform.

Typical factors that can influence best memcpy performance a specific 
platform for a given architecture include:
ideal prefetch distance ... depends on processor speed, cache/memory 
latency, depth of memory
    subsystem queues, details of memory subsystem priorities for 
prefetch vs demand fetch, and more.
number of alu operations that can be issued per cycle
number of memory operations that can be issued per cycle
number of total instructions that can be issued per cycle
branch misprediction latency; branch predictor behavior; other branch issues
and many other architectural features which can make occasional differences

I find it hard to imagine a single generic memcpy library routine that 
can match the performance
of a platform specific tuned routine over a typical range of copy 
lengths, assuming the architecture
has been around long enough to go through several semiconductor process 
redesigns.
With dynamic linking, the overhead of using platform specific code for 
something frequently
called should be relatively minimal.

I do agree a good generic version should be available as the effort of 
finding the best tuning
for a particular platform can take weeks and not all 
architecture/platform combinations
will get that intense attention.

- patrick mcgehearty

>
>> Your last point about hurting everything else is very valid though; it's
>> very likely that adding an extra indirection in cases where
>> __memcpy_generic is going to be called anyway is going to be expensive
>> given that a bulk of the memcpy calls will be for small sizes of less
>> than 1k.
> Note that the falkor version does quite well in memcpy-random across several
> micro architectures so I think parts of it could be moved into the generic code.
>
>> Allowing a PLT only for __memcpy_chk and mempcpy would need a test case
>> waiver in check_localplt and that would become a blanket OK for PLT
>> usage for memcpy, which we don't want.  Hence my patch is probably the
>> best compromise, especially since there is precedent for the approach in
>> x86.
> I still can't see any reason to even support these entry points in GLIBC, let
> alone optimize them using ifuncs. The _chk functions should obviously be
> inlined to avoid all the target specific complexity for no benefit. I think this
> could trivially be done via the GLIBC headers already. (That's assuming they
> are in any way performance critical.)
>
> Wilco

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
       [not found]                 ` <a62a7627-1d1e-6a2b-8197-f4b16bfbdcc6@oracle.com>
@ 2017-08-15 20:14                   ` Patrick McGehearty
  0 siblings, 0 replies; 27+ messages in thread
From: Patrick McGehearty @ 2017-08-15 20:14 UTC (permalink / raw)
  To: libc-alpha

Apologies for the multiple versions of this message (again).
I've got to learn to edit my complete message outside email since I can't
seem to fix my habit of typing "^S" (save text) every time
I complete a paragraph. Unfortunately, Thunderbird treats
^S as a send command.
- patrick

On 8/15/2017 3:10 PM, Patrick McGehearty wrote:
> On 8/14/2017 8:22 AM, Wilco Dijkstra wrote:
>> Siddhesh Poyarekar wrote:
>>> The first part is not true for falkor since its implementation is a good
>>> 10-15% faster on the falkor chip due to its design differences.  glibc
>>> makes pretty extensive use of memcpy throughout, but I don't have data
>>> on how much difference a core-specific memcpy will make there, so I
>>> don't have enough grounds for a generic change.
>> 66% of memcpy calls are <=16 bytes. Assuming you can even get a 15% gain
>> for these small sizes (there is very little you can do different), that's at most 1
>> cycle faster, so the PLT indirection is going to be more expensive.
> It is important to be careful about overemphasizing the frequency of 
> short memcpy calls.
> Even though a high percentage of memcpy calls are short, my experience 
> is that a high
> percentage of time spent in memcpy is on longer copies.
>
> Following example is just that, an example, not an expression of any 
> specific real application behavior:
> If 66% of calls are <=16 bytes  (average length=8, say) but the 
> average length of the remaining
> 1/3 of calls was 1K bytes (i.e. > 100 times as long), then the vast 
> majority of time
> in memcpy would be in the longer copies.
>
> My experience with tuning libc memcpy off and on on multiple platforms 
> is that copies
> of length > 256 bytes are the ones that affect overall application 
> performance. Really short
> copies where the length and/or alignment might be known at compile 
> time are best handled
> by inlining the copy.
>
> I've produced platform specific optimizations for memcpy many times 
> over the years.  By platform
> specific, I mean different code for different generations/platforms of 
> the same architecture.
> These) which shows improvements from at little as 10% to as much as 250%
> depending on how close the memory architecture of latest platform is 
> to the prior platform.
>
> Typical factors that can influence best memcpy performance a specific 
> platform for a given architecture include:
> ideal prefetch distance ... depends on processor speed, cache/memory 
> latency, depth of memory
>    subsystem queues, details of memory subsystem priorities for 
> prefetch vs demand fetch, and more.
> number of alu operations that can be issued per cycle
> number of memory operations that can be issued per cycle
> number of total instructions that can be issued per cycle
> branch misprediction latency; branch predictor behavior; other branch 
> issues
> and many other architectural features which can make occasional 
> differences
>
> I find it hard to imagine a single generic memcpy library routine that 
> can match the performance
> of a platform specific tuned routine over a typical range of copy 
> lengths, assuming the architecture
> has been around long enough to go through several semiconductor 
> process redesigns.
> With dynamic linking, the overhead of using platform specific code for 
> something frequently
> called should be relatively minimal.
>
> I do agree a good generic version should be available as the effort of 
> finding the best tuning
> for a particular platform can take weeks and not all 
> architecture/platform combinations
> will get that intense attention.
>
> - patrick mcgehearty
>
>>> Your last point about hurting everything else is very valid though; it's
>>> very likely that adding an extra indirection in cases where
>>> __memcpy_generic is going to be called anyway is going to be expensive
>>> given that a bulk of the memcpy calls will be for small sizes of less
>>> than 1k.
>> Note that the falkor version does quite well in memcpy-random across several
>> micro architectures so I think parts of it could be moved into the generic code.
>>
>>> Allowing a PLT only for __memcpy_chk and mempcpy would need a test case
>>> waiver in check_localplt and that would become a blanket OK for PLT
>>> usage for memcpy, which we don't want.  Hence my patch is probably the
>>> best compromise, especially since there is precedent for the approach in
>>> x86.
>> I still can't see any reason to even support these entry points in GLIBC, let
>> alone optimize them using ifuncs. The _chk functions should obviously be
>> inlined to avoid all the target specific complexity for no benefit. I think this
>> could trivially be done via the GLIBC headers already. (That's assuming they
>> are in any way performance critical.)
>>
>> Wilco
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-14 10:36           ` Wilco Dijkstra
  2017-08-14 12:14             ` Siddhesh Poyarekar
@ 2017-08-15 20:52             ` Zack Weinberg
  2017-08-16 12:28               ` Wilco Dijkstra
  1 sibling, 1 reply; 27+ messages in thread
From: Zack Weinberg @ 2017-08-15 20:52 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Siddhesh Poyarekar, Szabolcs Nagy, libc-alpha, nd

On Mon, Aug 14, 2017 at 6:36 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Zack Weinberg wrote:
>> On Fri, Aug 11, 2017 at 2:53 PM, Siddhesh Poyarekar <siddhesh@gotplt.org> wrote:
>> > On Friday 11 August 2017 11:36 PM, Zack Weinberg wrote:
>>>>> may be the generic __memcpy_chk should call the ifunced
>>>>> memcpy so it goes through an extra plt indirection, but
>>>>> at least less target specific code is needed.
>>>>
>>>> I was thinking of making this suggestion myself.  I think that would
>>>> be a better maintainability/efficiency tradeoff.  (Of course, I also
>>>> think we shouldn't bypass ifuncs for intra-libc calls.)
>>>
>>> That was my initial approach, but I was under the impression that PLTs
>>> in internal calls were frowned upon, hence the ifuncs similar to what's
>>> done in x86.  If this is acceptable, I could do more tests to check
>>> gains within the library if we were to call memcpy via ifunc.
>>
>> There's been a bunch of inconclusive arguments about this in the past.
>> If you have the time and the resources to do some thorough testing and
>> properly resolve the question, that would be really great.
>
> I don't believe you can resolve this generally, it's highly dependent on the details.
> If the generic implementation is very efficient, the possible gain of specialized
> ifuncs may be so low that it can never offset the overhead of an ifunc. Also note
> that you're always slowing down the generic case, so if that version is used in
> many cases, an ifunc wouldn't make sense.
>
> I haven't looked in detail at memcpy use in GLIBC, however if the statistics are
> similar to typical use I measured then it makes no sense to use ifuncs. Large
> copies can benefit from special tweaks, and in that case the overhead of an ifunc
> would be much smaller (both relatively and absolutely due to lower frequency), so
> that's where an ifunc might be useful.

Last time we had this argument, someone (Ondrej?) claimed that the
overhead of going through an ifunc for intra-libc calls (specifically
to memcpy, IIRC) was dwarfed by the I-cache costs of having both the
generic and the targeted version of the function get used. I would
really like to see measurements addressing that specific point.

zw

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-14 13:22               ` Wilco Dijkstra
                                   ` (2 preceding siblings ...)
       [not found]                 ` <a62a7627-1d1e-6a2b-8197-f4b16bfbdcc6@oracle.com>
@ 2017-08-15 21:02                 ` Zack Weinberg
  2017-08-15 21:41                   ` Wilco Dijkstra
  2017-08-16  5:10                   ` Siddhesh Poyarekar
  3 siblings, 2 replies; 27+ messages in thread
From: Zack Weinberg @ 2017-08-15 21:02 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Siddhesh Poyarekar, Szabolcs Nagy, libc-alpha, nd

On Mon, Aug 14, 2017 at 9:22 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> I still can't see any reason to even support these entry points in GLIBC, let
> alone optimize them using ifuncs.

They are already exported by libc.so.6 so we are stuck with them, but
what if we demote the copies in libc.so.6 to compat symbols and shove
the "real" versions into libc_nonshared.a? Then their calls to their
"normal" counterparts will naturally go through the PLT and hit the
"proper" ifuncs, without any messing around with assembly language.
The compat symbols can continue to call the 'generic' functions until
we get around to deciding we want to stop bypassing the PLT for
intra-libc ifunc calls.

Also it took something like 20 years to get all the counterproductive
inline functions out of string.h, let's not be putting them back now.
I _like_ the idea of inlining the runtime check done by the _chk entry
points, but it should be done in the compiler, and the compiler might
still prefer to call the _chk entry points when optimizing for size.

zw

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-15 21:02                 ` Zack Weinberg
@ 2017-08-15 21:41                   ` Wilco Dijkstra
  2017-08-15 22:06                     ` Zack Weinberg
  2017-08-16  5:10                   ` Siddhesh Poyarekar
  1 sibling, 1 reply; 27+ messages in thread
From: Wilco Dijkstra @ 2017-08-15 21:41 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Siddhesh Poyarekar, Szabolcs Nagy, libc-alpha, nd

Zack Weinberg wrote:
> On Mon, Aug 14, 2017 at 9:22 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> >
> > I still can't see any reason to even support these entry points in GLIBC, let
> > alone optimize them using ifuncs.
>
> They are already exported by libc.so.6 so we are stuck with them, but
> what if we demote the copies in libc.so.6 to compat symbols and shove
> the "real" versions into libc_nonshared.a? Then their calls to their
> "normal" counterparts will naturally go through the PLT and hit the
> "proper" ifuncs, without any messing around with assembly language.

But that means we still need to provide non-compat _chk entry points
indefinitely. See below for an alternative option that removes them now.

> Also it took something like 20 years to get all the counterproductive
> inline functions out of string.h, let's not be putting them back now.
> I _like_ the idea of inlining the runtime check done by the _chk entry
> points, but it should be done in the compiler, and the compiler might
> still prefer to call the _chk entry points when optimizing for size.

They are already inline functions that expand into GCC builtins which result
in calls to the _chk variants. The current implementation is a hopeless hack
and only checks a small percentage of the cases that could be checked
(I tried char arr[N]; memcpy (arr+1, p, n) and this is not checked...).

We can and should implement much more accurate bounds checking in GCC. 
The checks would always be inlined so the overhead is low and the normal
ifunc is called. This works with any library, not just GLIBC.

When building with an old GCC we could simply inline in the headers to stop
people adding ifuncs for this flawed "security" feature. That way the existing
symbols can immediately become compat symbols and all the _chk entry points
and ifuncs can be completely removed.

Wilco

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-14 13:56                     ` Siddhesh Poyarekar
@ 2017-08-15 21:55                       ` Wilco Dijkstra
  2017-08-16  4:29                         ` Siddhesh Poyarekar
  0 siblings, 1 reply; 27+ messages in thread
From: Wilco Dijkstra @ 2017-08-15 21:55 UTC (permalink / raw)
  To: Siddhesh Poyarekar, Szabolcs Nagy, Zack Weinberg; +Cc: nd, libc-alpha

Siddhesh Poyarekar wrote:
> See my response to Wilco about static inlines - it requires a full
> distro rebuild to have any effect and that's a pointless option to have
> if we have to backport these routines to stable distributions.

I'm not sure what the argument is here, do we (a) backport ifuncs to
older GLIBCs or do (b) distros upgrade to the latest GLIBC but nothing else?

Wilco

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-15 21:41                   ` Wilco Dijkstra
@ 2017-08-15 22:06                     ` Zack Weinberg
  2017-08-16  4:40                       ` Siddhesh Poyarekar
  0 siblings, 1 reply; 27+ messages in thread
From: Zack Weinberg @ 2017-08-15 22:06 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Siddhesh Poyarekar, Szabolcs Nagy, libc-alpha, nd

On Tue, Aug 15, 2017 at 5:40 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Zack Weinberg wrote:
>> On Mon, Aug 14, 2017 at 9:22 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>> >
>> > I still can't see any reason to even support these entry points in GLIBC, let
>> > alone optimize them using ifuncs.
>>
>> They are already exported by libc.so.6 so we are stuck with them, but
>> what if we demote the copies in libc.so.6 to compat symbols and shove
>> the "real" versions into libc_nonshared.a? Then their calls to their
>> "normal" counterparts will naturally go through the PLT and hit the
>> "proper" ifuncs, without any messing around with assembly language.
>
> But that means we still need to provide non-compat _chk entry points
> indefinitely.

Not so; the libc_nonshared.a versions would call only the normal entry
points plus __fortify_fail (I think that's what it's called).  The
symbols in libc_nonshared.a can themselves go away whenever they
become unnecessary.

Also, "have to provide non-compat _chk entry points indefinitely" is
not significantly worse than "have to provide compat _chk entry points
indefinitely", and no matter what we do here we have to provide at
least compat _chk entry points indefinitely.

> See below for an alternative option that removes them now [...]

I'm sorry, that seems an awful lot more complicated, and most of the
complexity is in the headers, which is the exact place I think it
_shouldn't_ be.  And is anyone volunteering to do the compiler work?
You, perhaps?

zw

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-15 21:55                       ` Wilco Dijkstra
@ 2017-08-16  4:29                         ` Siddhesh Poyarekar
  0 siblings, 0 replies; 27+ messages in thread
From: Siddhesh Poyarekar @ 2017-08-16  4:29 UTC (permalink / raw)
  To: Wilco Dijkstra, Szabolcs Nagy, Zack Weinberg; +Cc: nd, libc-alpha

On Wednesday 16 August 2017 03:15 AM, Wilco Dijkstra wrote:
> I'm not sure what the argument is here, do we (a) backport ifuncs to
> older GLIBCs or do (b) distros upgrade to the latest GLIBC but nothing else?

Backport IFUNCs to older glibcs since it does not introduce new ABI;
upgrading to latest glibc introduces new ABI, which server distros will
almost never do.

Siddhesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-15 22:06                     ` Zack Weinberg
@ 2017-08-16  4:40                       ` Siddhesh Poyarekar
  0 siblings, 0 replies; 27+ messages in thread
From: Siddhesh Poyarekar @ 2017-08-16  4:40 UTC (permalink / raw)
  To: Zack Weinberg, Wilco Dijkstra; +Cc: Szabolcs Nagy, libc-alpha, nd

On Wednesday 16 August 2017 03:36 AM, Zack Weinberg wrote:
> I'm sorry, that seems an awful lot more complicated, and most of the
> complexity is in the headers, which is the exact place I think it
> _shouldn't_ be.  And is anyone volunteering to do the compiler work?
> You, perhaps?

It also precludes the possibility of backports, so as far as my use case
is concerned, it doesn't solve the problem.  It could be done in
addition to the ifuncs/PLT solution to phase out the symbols in future
if there is consensus on that.  That is not an immediate interest for
me, at least not for the next 5 years or so.

Siddhesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-15 21:02                 ` Zack Weinberg
  2017-08-15 21:41                   ` Wilco Dijkstra
@ 2017-08-16  5:10                   ` Siddhesh Poyarekar
  1 sibling, 0 replies; 27+ messages in thread
From: Siddhesh Poyarekar @ 2017-08-16  5:10 UTC (permalink / raw)
  To: Zack Weinberg, Wilco Dijkstra; +Cc: Szabolcs Nagy, libc-alpha, nd

On Wednesday 16 August 2017 02:31 AM, Zack Weinberg wrote:
> They are already exported by libc.so.6 so we are stuck with them, but
> what if we demote the copies in libc.so.6 to compat symbols and shove
> the "real" versions into libc_nonshared.a? Then their calls to their
> "normal" counterparts will naturally go through the PLT and hit the
> "proper" ifuncs, without any messing around with assembly language.
> The compat symbols can continue to call the 'generic' functions until
> we get around to deciding we want to stop bypassing the PLT for
> intra-libc ifunc calls.

Hmm, this may work for me (i.e. do memcpy via PLT instead and then add
it as a compat symbol to libc.so and also into libc_nonshared.a) so that
only the backport needs the check-localplt override and not upstream.
I'll give it a shot next month since I'm going underground next week for
the rest of the month.

Siddhesh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-15 20:52             ` Zack Weinberg
@ 2017-08-16 12:28               ` Wilco Dijkstra
  2017-08-17  1:09                 ` Zack Weinberg
  0 siblings, 1 reply; 27+ messages in thread
From: Wilco Dijkstra @ 2017-08-16 12:28 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Siddhesh Poyarekar, Szabolcs Nagy, libc-alpha, nd

Zack Weinberg wrote:
>
> Last time we had this argument, someone (Ondrej?) claimed that the
> overhead of going through an ifunc for intra-libc calls (specifically
> to memcpy, IIRC) was dwarfed by the I-cache costs of having both the
> generic and the targeted version of the function get used. I would
> really like to see measurements addressing that specific point.

I think it might be more easily measured if we make the effect much worse,
for example by adding several KB of NOPs at entry of generic memcpy.

I could easily generate a trace of internal calls to memcpy, however the key
question is which functions in GLIBC use memcpy in performance critical
ways and which applications make heavy use of those?

Wilco

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] Multiarch hooks for memcpy variants
  2017-08-16 12:28               ` Wilco Dijkstra
@ 2017-08-17  1:09                 ` Zack Weinberg
  0 siblings, 0 replies; 27+ messages in thread
From: Zack Weinberg @ 2017-08-17  1:09 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Siddhesh Poyarekar, Szabolcs Nagy, libc-alpha, nd

On Wed, Aug 16, 2017 at 8:28 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Zack Weinberg wrote:
>>
>> Last time we had this argument, someone (Ondrej?) claimed that the
>> overhead of going through an ifunc for intra-libc calls (specifically
>> to memcpy, IIRC) was dwarfed by the I-cache costs of having both the
>> generic and the targeted version of the function get used. I would
>> really like to see measurements addressing that specific point.
>
> I think it might be more easily measured if we make the effect much worse,
> for example by adding several KB of NOPs at entry of generic memcpy.

I think this needs to be an A/B test of the real code before and after
the real proposed change (i.e. sending intra-libc calls to memcpy
through the PLT and the ifuncs) in order to resolve the argument to
everyone's satisfaction.  `perf`, looking specifically at all levels
of cache misses, ought to be able to pick out the signal even without
an artificial penalty.

> I could easily generate a trace of internal calls to memcpy, however the key
> question is which functions in GLIBC use memcpy in performance critical
> ways and which applications make heavy use of those?

I don't know.  Maybe start with whole-program tests on big complicated
applications like Firefox and LibreOffice?  Web and database servers
might also be interesting.

zw

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 0/2] Multiarch hooks for memcpy variants
@ 2017-08-11  7:14 Siddhesh Poyarekar
  0 siblings, 0 replies; 27+ messages in thread
From: Siddhesh Poyarekar @ 2017-08-11  7:14 UTC (permalink / raw)
  To: libc-alpha

Functions like mempcpy, __mempcpy_chk and __memcpy_chk continue to call the
generic memcpy implementation.  These two patches fix this by adding ifunc
entry points for these functions for generic, thunderx and falkor.

Siddhesh Poyarekar (2):
  aarch64: Add multiarch variants of __memcpy_chk
  Call the correct memcpy function through mempcpy

 sysdeps/aarch64/memcpy.S                          | 16 ++++++-
 sysdeps/aarch64/multiarch/Makefile                |  7 ++-
 sysdeps/aarch64/multiarch/ifunc-impl-list.c       | 12 +++++
 sysdeps/aarch64/multiarch/memcpy_falkor.S         | 13 +++++-
 sysdeps/aarch64/multiarch/memcpy_generic.S        |  5 +++
 sysdeps/aarch64/multiarch/memcpy_thunderx.S       | 13 +++++-
 sysdeps/aarch64/multiarch/mempcpy.c               | 47 +++++++++++++++++++
 sysdeps/aarch64/multiarch/mempcpy_chk-nonshared.S | 28 ++++++++++++
 sysdeps/aarch64/multiarch/mempcpy_chk.c           | 35 +++++++++++++++
 sysdeps/aarch64/multiarch/mempcpy_falkor.S        | 23 ++++++++++
 sysdeps/aarch64/multiarch/mempcpy_generic.S       | 55 +++++++++++++++++++++++
 sysdeps/aarch64/multiarch/mempcpy_thunderx.S      | 23 ++++++++++
 12 files changed, 273 insertions(+), 4 deletions(-)
 create mode 100644 sysdeps/aarch64/multiarch/mempcpy.c
 create mode 100644 sysdeps/aarch64/multiarch/mempcpy_chk-nonshared.S
 create mode 100644 sysdeps/aarch64/multiarch/mempcpy_chk.c
 create mode 100644 sysdeps/aarch64/multiarch/mempcpy_falkor.S
 create mode 100644 sysdeps/aarch64/multiarch/mempcpy_generic.S
 create mode 100644 sysdeps/aarch64/multiarch/mempcpy_thunderx.S

-- 
2.7.4

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2017-08-17  1:09 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-11 11:11 [PATCH 0/2] Multiarch hooks for memcpy variants Wilco Dijkstra
2017-08-11 11:22 ` Siddhesh Poyarekar
2017-08-11 17:58   ` Szabolcs Nagy
2017-08-11 18:06     ` Zack Weinberg
2017-08-11 18:53       ` Siddhesh Poyarekar
2017-08-11 18:55         ` Zack Weinberg
2017-08-14 10:36           ` Wilco Dijkstra
2017-08-14 12:14             ` Siddhesh Poyarekar
2017-08-14 13:20               ` Szabolcs Nagy
2017-08-14 13:29                 ` Siddhesh Poyarekar
2017-08-14 13:52                   ` Szabolcs Nagy
2017-08-14 13:56                     ` Siddhesh Poyarekar
2017-08-15 21:55                       ` Wilco Dijkstra
2017-08-16  4:29                         ` Siddhesh Poyarekar
2017-08-14 13:22               ` Wilco Dijkstra
2017-08-14 13:54                 ` Siddhesh Poyarekar
2017-08-15 20:11                 ` Patrick McGehearty
     [not found]                 ` <a62a7627-1d1e-6a2b-8197-f4b16bfbdcc6@oracle.com>
2017-08-15 20:14                   ` Patrick McGehearty
2017-08-15 21:02                 ` Zack Weinberg
2017-08-15 21:41                   ` Wilco Dijkstra
2017-08-15 22:06                     ` Zack Weinberg
2017-08-16  4:40                       ` Siddhesh Poyarekar
2017-08-16  5:10                   ` Siddhesh Poyarekar
2017-08-15 20:52             ` Zack Weinberg
2017-08-16 12:28               ` Wilco Dijkstra
2017-08-17  1:09                 ` Zack Weinberg
  -- strict thread matches above, loose matches on Subject: below --
2017-08-11  7:14 Siddhesh Poyarekar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).