Review decision to inline mempcpy to memcpy.

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* Review decision to inline mempcpy to memcpy.
@ 2016-03-03 15:23 Carlos O'Donell
  2016-03-03 20:55 ` H.J. Lu
       [not found] ` <AM3PR08MB0088D8CBEE224AA54E620F6983BE0@AM3PR08MB0088.eurprd08.prod.outlook.com>
  0 siblings, 2 replies; 5+ messages in thread
From: Carlos O'Donell @ 2016-03-03 15:23 UTC (permalink / raw)
  To: GNU C Library, Ondrej Bilka, Joseph S. Myers, Wilco Dijkstra,
	Jakub Jelinek, Jeff Law

The upstream gcc PR/70055 requests that glibc revert commit
05a910f7b420c2b831f35ba90e61c80f001c0606 and instead work
with gcc to make the builtin mempcpy better (for various
aspects of performance).

The crux of the argument is that the compiler may be able
to do a better job of optimizing if it knows the call was
a mempcpy as opposed to memcpy + addition.

I understand the need of glibc machine maintainers to produce
a library whose performance is as optimal as they can given
the compilers we have today. This leads to decisions like those
we made to transform mempcpy to memcpy + addition.

I also understand the worries that the compiler developers see
in such a transformation. Information has been lost to the
compiler that might generated better code.

Is there a middle ground here? Should machines with mempcpy.S,
particularly x86_64 define _HAVE_STRING_ARCH_mempcpy in their
string.h? I think this question wasn't clearly answered before
the patch went in. However, the microbenchmarks show this is
a clear gain given modern compilers for x86_64. Is this because
x86_64's mempcpy.S is flawed? Does it need to be fixed as
opposed to transforming mempcpy to memcpy+addition?

Ondrej,

Did you file gcc bugs to revuew the optimization issues
with current mempcpy as suggested in [2] and [3]?

Wilco,

Were the changes in glibc to optimize mempcpy as memcpy
originally motivated by performance for ARM?

Particularly because ARM does not have an optimized
mempcpy implementation in glibc?

Cheers,
Carlos.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055
[2] https://sourceware.org/ml/libc-alpha/2015-05/msg00778.html
[3] https://sourceware.org/ml/libc-alpha/2015-05/msg00780.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Review decision to inline mempcpy to memcpy.
  2016-03-03 15:23 Review decision to inline mempcpy to memcpy Carlos O'Donell
@ 2016-03-03 20:55 ` H.J. Lu
  2016-03-03 21:37   ` Carlos O'Donell
       [not found] ` <AM3PR08MB0088D8CBEE224AA54E620F6983BE0@AM3PR08MB0088.eurprd08.prod.outlook.com>
  1 sibling, 1 reply; 5+ messages in thread
From: H.J. Lu @ 2016-03-03 20:55 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: GNU C Library, Ondrej Bilka, Joseph S. Myers, Wilco Dijkstra,
	Jakub Jelinek, Jeff Law

On Thu, Mar 3, 2016 at 7:23 AM, Carlos O'Donell <carlos@redhat.com> wrote:
> The upstream gcc PR/70055 requests that glibc revert commit
> 05a910f7b420c2b831f35ba90e61c80f001c0606 and instead work
> with gcc to make the builtin mempcpy better (for various
> aspects of performance).
>
> The crux of the argument is that the compiler may be able
> to do a better job of optimizing if it knows the call was
> a mempcpy as opposed to memcpy + addition.

I opened:

https://sourceware.org/bugzilla/show_bug.cgi?id=19759

> I understand the need of glibc machine maintainers to produce
> a library whose performance is as optimal as they can given
> the compilers we have today. This leads to decisions like those
> we made to transform mempcpy to memcpy + addition.
>
> I also understand the worries that the compiler developers see
> in such a transformation. Information has been lost to the
> compiler that might generated better code.
>
> Is there a middle ground here? Should machines with mempcpy.S,
> particularly x86_64 define _HAVE_STRING_ARCH_mempcpy in their
> string.h? I think this question wasn't clearly answered before
> the patch went in. However, the microbenchmarks show this is
> a clear gain given modern compilers for x86_64. Is this because
> x86_64's mempcpy.S is flawed? Does it need to be fixed as
> opposed to transforming mempcpy to memcpy+addition?
>
> Ondrej,
>
> Did you file gcc bugs to revuew the optimization issues
> with current mempcpy as suggested in [2] and [3]?
>
> Wilco,
>
> Were the changes in glibc to optimize mempcpy as memcpy
> originally motivated by performance for ARM?

Can we make it opt-in?

> Particularly because ARM does not have an optimized
> mempcpy implementation in glibc?

It is 3 lines of code for x86-64:

#ifdef USE_AS_MEMPCPY
        add     %rdx, %rax
#endif

How hard will ARM be?

-- 
H.J.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Review decision to inline mempcpy to memcpy.
  2016-03-03 20:55 ` H.J. Lu
@ 2016-03-03 21:37   ` Carlos O'Donell
  0 siblings, 0 replies; 5+ messages in thread
From: Carlos O'Donell @ 2016-03-03 21:37 UTC (permalink / raw)
  To: H.J. Lu
  Cc: GNU C Library, Ondrej Bilka, Joseph S. Myers, Wilco Dijkstra,
	Jakub Jelinek, Jeff Law

On 03/03/2016 03:55 PM, H.J. Lu wrote:
> On Thu, Mar 3, 2016 at 7:23 AM, Carlos O'Donell <carlos@redhat.com> wrote:
>> The upstream gcc PR/70055 requests that glibc revert commit
>> 05a910f7b420c2b831f35ba90e61c80f001c0606 and instead work
>> with gcc to make the builtin mempcpy better (for various
>> aspects of performance).
>>
>> The crux of the argument is that the compiler may be able
>> to do a better job of optimizing if it knows the call was
>> a mempcpy as opposed to memcpy + addition.
> 
> I opened:
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=19759

I think x86_64 should define  
#define _HAVE_STRING_ARCH_mempcpy 1 in string.h?

That way the inline is avoided.

HJ, Can you show that doing that gives a speedup?

The point Ondrej makes in his original emails is that
using the inline gave better performance on x86_64.
Should we ignore such findings? Or are the microbenchmarks
we have insufficient to show the overall gains we're
talking about (in terms of the register usage)?

That is to say, should the change be made simply from
first principles that it's a good change, or do we need
a performance justification?

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 5+ messages in thread

[parent not found: <AM3PR08MB0088D8CBEE224AA54E620F6983BE0@AM3PR08MB0088.eurprd08.prod.outlook.com>]

* Re: Review decision to inline mempcpy to memcpy.
       [not found] ` <AM3PR08MB0088D8CBEE224AA54E620F6983BE0@AM3PR08MB0088.eurprd08.prod.outlook.com>
@ 2016-03-04 16:57   ` Jakub Jelinek
  2016-03-04 20:20   ` Wilco Dijkstra
  1 sibling, 0 replies; 5+ messages in thread
From: Jakub Jelinek @ 2016-03-04 16:57 UTC (permalink / raw)
  To: Wilco Dijkstra
  Cc: Carlos O'Donell, GNU C Library, Ondrej Bilka,
	Joseph S. Myers, Jeff Law

On Fri, Mar 04, 2016 at 04:48:00PM +0000, Wilco Dijkstra wrote:
> > Were the changes in glibc to optimize mempcpy as memcpy
> > originally motivated by performance for ARM?
> 
> OK, so the goal behind this was to provide the best possible out of the box performance
> in GLIBC without requiring all targets to write a lot of assembler code. For less

If people use mempcpy in their code, they do it for a reason.

> > The crux of the argument is that the compiler may be able
> > to do a better job of optimizing if it knows the call was
> > a mempcpy as opposed to memcpy + addition.
> 
> No, unfortunately even GCC6 optimizes memcpy better than mempcpy:

Must be some aarch64 backend issue then.  On x86_64 I get for memcpy (x, y,
32):
        movq    (%rsi), %rdx
        movq    %rdi, %rax
        movq    %rdx, (%rdi)
        movq    8(%rsi), %rdx
        movq    %rdx, 8(%rdi)
        movq    16(%rsi), %rdx
        movq    %rdx, 16(%rdi)
        movq    24(%rsi), %rdx
        movq    %rdx, 24(%rdi)
        ret
and for mempcpy (x, y, 32):
        movq    (%rsi), %rax
        movq    %rax, (%rdi)
        movq    8(%rsi), %rax
        movq    %rax, 8(%rdi)
        movq    16(%rsi), %rax
        movq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        movq    %rax, 24(%rdi)
        leaq    32(%rdi), %rax
        ret

> If we can get GCC to do the right thing depending of the preference of the target and

That is wrong, the preference isn't cast in stone, but you force it into the
apps.

	Jakub

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Review decision to inline mempcpy to memcpy.
       [not found] ` <AM3PR08MB0088D8CBEE224AA54E620F6983BE0@AM3PR08MB0088.eurprd08.prod.outlook.com>
  2016-03-04 16:57   ` Jakub Jelinek
@ 2016-03-04 20:20   ` Wilco Dijkstra
  1 sibling, 0 replies; 5+ messages in thread
From: Wilco Dijkstra @ 2016-03-04 20:20 UTC (permalink / raw)
  To: Carlos O'Donell, GNU C Library, Ondrej Bilka,
	Joseph S. Myers, Jakub Jelinek, Jeff Law
  Cc: nd

Hi,

(resend to post to GLIBC list too)

> Were the changes in glibc to optimize mempcpy as memcpy
> originally motivated by performance for ARM?

OK, so the goal behind this was to provide the best possible out of the box performance
in GLIBC without requiring all targets to write a lot of assembler code. For less
frequently used functions which are identical to a standard function but with a different
return value the most obvious and efficient implementation is to inline a call to the most
commonly used version. There are several good reasons to do this for mempcpy:

1. Few targets implement mempcpy.S, so currently use the slow mempcpy.c veneer.
2. On most targets merging mempcpy into memcpy looks impossible without
   slowing down memcpy as a result.
3. Adding a separate mempcpy.S implementation increases I-cache pressure as
   now you need to load mempcpy too even if memcpy is already resident in L1/L2 cache.
4. GCC doesn't optimize/inline mempcpy as well as it does memcpy (see below)

> The crux of the argument is that the compiler may be able
> to do a better job of optimizing if it knows the call was
> a mempcpy as opposed to memcpy + addition.

No, unfortunately even GCC6 optimizes memcpy better than mempcpy:

return __builtin_memcpy(x, y, 32);

        ldp     x4, x5, [x1]
        stp     x4, x5, [x0]
        ldp     x4, x5, [x1, 16]
        stp     x4, x5, [x0, 16]
        ret

return __builtin_mempcpy(x, y, 32);

        mov     x2, 32
        b       mempcpy

return mempcpy(x, y, 32);  // using GLIBC2.23 inline

        mov     x2, x0
        add     x0, x0, 32
        ldp     x4, x5, [x1]
        stp     x4, x5, [x2]
        ldp     x4, x5, [x1, 16]
        stp     x4, x5, [x2, 16]
        ret

So the only case where I can see a clear win for mempcpy is if you can do a good
merged implementation and GCC is fixed to optimize mempcpy always in exactly the
same way as memcpy. In that case just define _HAVE_STRING_ARCH_mempcpy.

If we can get GCC to do the right thing depending of the preference of the target and
library then things would be perfect.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-03-04 20:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-03 15:23 Review decision to inline mempcpy to memcpy Carlos O'Donell
2016-03-03 20:55 ` H.J. Lu
2016-03-03 21:37   ` Carlos O'Donell
     [not found] ` <AM3PR08MB0088D8CBEE224AA54E620F6983BE0@AM3PR08MB0088.eurprd08.prod.outlook.com>
2016-03-04 16:57   ` Jakub Jelinek
2016-03-04 20:20   ` Wilco Dijkstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).