* Review decision to inline mempcpy to memcpy. @ 2016-03-03 15:23 Carlos O'Donell 2016-03-03 20:55 ` H.J. Lu [not found] ` <AM3PR08MB0088D8CBEE224AA54E620F6983BE0@AM3PR08MB0088.eurprd08.prod.outlook.com> 0 siblings, 2 replies; 5+ messages in thread From: Carlos O'Donell @ 2016-03-03 15:23 UTC (permalink / raw) To: GNU C Library, Ondrej Bilka, Joseph S. Myers, Wilco Dijkstra, Jakub Jelinek, Jeff Law The upstream gcc PR/70055 requests that glibc revert commit 05a910f7b420c2b831f35ba90e61c80f001c0606 and instead work with gcc to make the builtin mempcpy better (for various aspects of performance). The crux of the argument is that the compiler may be able to do a better job of optimizing if it knows the call was a mempcpy as opposed to memcpy + addition. I understand the need of glibc machine maintainers to produce a library whose performance is as optimal as they can given the compilers we have today. This leads to decisions like those we made to transform mempcpy to memcpy + addition. I also understand the worries that the compiler developers see in such a transformation. Information has been lost to the compiler that might generated better code. Is there a middle ground here? Should machines with mempcpy.S, particularly x86_64 define _HAVE_STRING_ARCH_mempcpy in their string.h? I think this question wasn't clearly answered before the patch went in. However, the microbenchmarks show this is a clear gain given modern compilers for x86_64. Is this because x86_64's mempcpy.S is flawed? Does it need to be fixed as opposed to transforming mempcpy to memcpy+addition? Ondrej, Did you file gcc bugs to revuew the optimization issues with current mempcpy as suggested in [2] and [3]? Wilco, Were the changes in glibc to optimize mempcpy as memcpy originally motivated by performance for ARM? Particularly because ARM does not have an optimized mempcpy implementation in glibc? Cheers, Carlos. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055 [2] https://sourceware.org/ml/libc-alpha/2015-05/msg00778.html [3] https://sourceware.org/ml/libc-alpha/2015-05/msg00780.html ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Review decision to inline mempcpy to memcpy. 2016-03-03 15:23 Review decision to inline mempcpy to memcpy Carlos O'Donell @ 2016-03-03 20:55 ` H.J. Lu 2016-03-03 21:37 ` Carlos O'Donell [not found] ` <AM3PR08MB0088D8CBEE224AA54E620F6983BE0@AM3PR08MB0088.eurprd08.prod.outlook.com> 1 sibling, 1 reply; 5+ messages in thread From: H.J. Lu @ 2016-03-03 20:55 UTC (permalink / raw) To: Carlos O'Donell Cc: GNU C Library, Ondrej Bilka, Joseph S. Myers, Wilco Dijkstra, Jakub Jelinek, Jeff Law On Thu, Mar 3, 2016 at 7:23 AM, Carlos O'Donell <carlos@redhat.com> wrote: > The upstream gcc PR/70055 requests that glibc revert commit > 05a910f7b420c2b831f35ba90e61c80f001c0606 and instead work > with gcc to make the builtin mempcpy better (for various > aspects of performance). > > The crux of the argument is that the compiler may be able > to do a better job of optimizing if it knows the call was > a mempcpy as opposed to memcpy + addition. I opened: https://sourceware.org/bugzilla/show_bug.cgi?id=19759 > I understand the need of glibc machine maintainers to produce > a library whose performance is as optimal as they can given > the compilers we have today. This leads to decisions like those > we made to transform mempcpy to memcpy + addition. > > I also understand the worries that the compiler developers see > in such a transformation. Information has been lost to the > compiler that might generated better code. > > Is there a middle ground here? Should machines with mempcpy.S, > particularly x86_64 define _HAVE_STRING_ARCH_mempcpy in their > string.h? I think this question wasn't clearly answered before > the patch went in. However, the microbenchmarks show this is > a clear gain given modern compilers for x86_64. Is this because > x86_64's mempcpy.S is flawed? Does it need to be fixed as > opposed to transforming mempcpy to memcpy+addition? > > Ondrej, > > Did you file gcc bugs to revuew the optimization issues > with current mempcpy as suggested in [2] and [3]? > > Wilco, > > Were the changes in glibc to optimize mempcpy as memcpy > originally motivated by performance for ARM? Can we make it opt-in? > Particularly because ARM does not have an optimized > mempcpy implementation in glibc? It is 3 lines of code for x86-64: #ifdef USE_AS_MEMPCPY add %rdx, %rax #endif How hard will ARM be? -- H.J. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Review decision to inline mempcpy to memcpy. 2016-03-03 20:55 ` H.J. Lu @ 2016-03-03 21:37 ` Carlos O'Donell 0 siblings, 0 replies; 5+ messages in thread From: Carlos O'Donell @ 2016-03-03 21:37 UTC (permalink / raw) To: H.J. Lu Cc: GNU C Library, Ondrej Bilka, Joseph S. Myers, Wilco Dijkstra, Jakub Jelinek, Jeff Law On 03/03/2016 03:55 PM, H.J. Lu wrote: > On Thu, Mar 3, 2016 at 7:23 AM, Carlos O'Donell <carlos@redhat.com> wrote: >> The upstream gcc PR/70055 requests that glibc revert commit >> 05a910f7b420c2b831f35ba90e61c80f001c0606 and instead work >> with gcc to make the builtin mempcpy better (for various >> aspects of performance). >> >> The crux of the argument is that the compiler may be able >> to do a better job of optimizing if it knows the call was >> a mempcpy as opposed to memcpy + addition. > > I opened: > > https://sourceware.org/bugzilla/show_bug.cgi?id=19759 I think x86_64 should define #define _HAVE_STRING_ARCH_mempcpy 1 in string.h? That way the inline is avoided. HJ, Can you show that doing that gives a speedup? The point Ondrej makes in his original emails is that using the inline gave better performance on x86_64. Should we ignore such findings? Or are the microbenchmarks we have insufficient to show the overall gains we're talking about (in terms of the register usage)? That is to say, should the change be made simply from first principles that it's a good change, or do we need a performance justification? Cheers, Carlos. ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <AM3PR08MB0088D8CBEE224AA54E620F6983BE0@AM3PR08MB0088.eurprd08.prod.outlook.com>]
* Re: Review decision to inline mempcpy to memcpy. [not found] ` <AM3PR08MB0088D8CBEE224AA54E620F6983BE0@AM3PR08MB0088.eurprd08.prod.outlook.com> @ 2016-03-04 16:57 ` Jakub Jelinek 2016-03-04 20:20 ` Wilco Dijkstra 1 sibling, 0 replies; 5+ messages in thread From: Jakub Jelinek @ 2016-03-04 16:57 UTC (permalink / raw) To: Wilco Dijkstra Cc: Carlos O'Donell, GNU C Library, Ondrej Bilka, Joseph S. Myers, Jeff Law On Fri, Mar 04, 2016 at 04:48:00PM +0000, Wilco Dijkstra wrote: > > Were the changes in glibc to optimize mempcpy as memcpy > > originally motivated by performance for ARM? > > OK, so the goal behind this was to provide the best possible out of the box performance > in GLIBC without requiring all targets to write a lot of assembler code. For less If people use mempcpy in their code, they do it for a reason. > > The crux of the argument is that the compiler may be able > > to do a better job of optimizing if it knows the call was > > a mempcpy as opposed to memcpy + addition. > > No, unfortunately even GCC6 optimizes memcpy better than mempcpy: Must be some aarch64 backend issue then. On x86_64 I get for memcpy (x, y, 32): movq (%rsi), %rdx movq %rdi, %rax movq %rdx, (%rdi) movq 8(%rsi), %rdx movq %rdx, 8(%rdi) movq 16(%rsi), %rdx movq %rdx, 16(%rdi) movq 24(%rsi), %rdx movq %rdx, 24(%rdi) ret and for mempcpy (x, y, 32): movq (%rsi), %rax movq %rax, (%rdi) movq 8(%rsi), %rax movq %rax, 8(%rdi) movq 16(%rsi), %rax movq %rax, 16(%rdi) movq 24(%rsi), %rax movq %rax, 24(%rdi) leaq 32(%rdi), %rax ret > If we can get GCC to do the right thing depending of the preference of the target and That is wrong, the preference isn't cast in stone, but you force it into the apps. Jakub ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Review decision to inline mempcpy to memcpy. [not found] ` <AM3PR08MB0088D8CBEE224AA54E620F6983BE0@AM3PR08MB0088.eurprd08.prod.outlook.com> 2016-03-04 16:57 ` Jakub Jelinek @ 2016-03-04 20:20 ` Wilco Dijkstra 1 sibling, 0 replies; 5+ messages in thread From: Wilco Dijkstra @ 2016-03-04 20:20 UTC (permalink / raw) To: Carlos O'Donell, GNU C Library, Ondrej Bilka, Joseph S. Myers, Jakub Jelinek, Jeff Law Cc: nd Hi, (resend to post to GLIBC list too) > Were the changes in glibc to optimize mempcpy as memcpy > originally motivated by performance for ARM? OK, so the goal behind this was to provide the best possible out of the box performance in GLIBC without requiring all targets to write a lot of assembler code. For less frequently used functions which are identical to a standard function but with a different return value the most obvious and efficient implementation is to inline a call to the most commonly used version. There are several good reasons to do this for mempcpy: 1. Few targets implement mempcpy.S, so currently use the slow mempcpy.c veneer. 2. On most targets merging mempcpy into memcpy looks impossible without slowing down memcpy as a result. 3. Adding a separate mempcpy.S implementation increases I-cache pressure as now you need to load mempcpy too even if memcpy is already resident in L1/L2 cache. 4. GCC doesn't optimize/inline mempcpy as well as it does memcpy (see below) > The crux of the argument is that the compiler may be able > to do a better job of optimizing if it knows the call was > a mempcpy as opposed to memcpy + addition. No, unfortunately even GCC6 optimizes memcpy better than mempcpy: return __builtin_memcpy(x, y, 32); ldp x4, x5, [x1] stp x4, x5, [x0] ldp x4, x5, [x1, 16] stp x4, x5, [x0, 16] ret return __builtin_mempcpy(x, y, 32); mov x2, 32 b mempcpy return mempcpy(x, y, 32); // using GLIBC2.23 inline mov x2, x0 add x0, x0, 32 ldp x4, x5, [x1] stp x4, x5, [x2] ldp x4, x5, [x1, 16] stp x4, x5, [x2, 16] ret So the only case where I can see a clear win for mempcpy is if you can do a good merged implementation and GCC is fixed to optimize mempcpy always in exactly the same way as memcpy. In that case just define _HAVE_STRING_ARCH_mempcpy. If we can get GCC to do the right thing depending of the preference of the target and library then things would be perfect. Cheers, Wilco ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2016-03-04 20:20 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-03-03 15:23 Review decision to inline mempcpy to memcpy Carlos O'Donell 2016-03-03 20:55 ` H.J. Lu 2016-03-03 21:37 ` Carlos O'Donell [not found] ` <AM3PR08MB0088D8CBEE224AA54E620F6983BE0@AM3PR08MB0088.eurprd08.prod.outlook.com> 2016-03-04 16:57 ` Jakub Jelinek 2016-03-04 20:20 ` Wilco Dijkstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).