* Gcc builtin review: memcpy, mempcpy, memmove.
@ 2015-05-25 12:18 Ondřej Bílka
2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
` (2 more replies)
0 siblings, 3 replies; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-25 12:18 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
Andrew as I promised review and what you would need to fix for x64 gcc
and remove ugly gcc hacks and replace them with nice headers :-)
I will take a functions in order from string.h and comment about
these.
Lose all hope, these are imposible to optimize purely with gcc.
These are hardest functions to optimize. A optimal implementation
depends on cpu and size, whether inputs are likely in L1 or L2 or L3
cache and relationship is quite chaotic.
With gcc unless you start doing per-cpu function cloning and optimize
for each cpu (which would be good idea if you could convince maintainers
to deal with size bloat) then there will always be some cpu where you
lose.
Its because there are loop A and B and A is 50% faster than B on
cpu 1 but B is 50% faster on cpu 2
A main weakness for current cpus lies in 100-1000 byte range. A in
following benchmark an -march=native gcc expansion is outperformed
by at least 10% with obsolete memcpy_sse2 from glibc-2.13 on
sandy_bridge and haswell
As memcpy_sse2 uses only 8-byte loads that speaks for itself and newer
implementations are faster.
A benchmark to check instrinct is following.
Compile a.c with -mstringop-strategy=libcall and you will see
performance improvement. Of course compile a.c and main separately to
avoid optimizing memcpy out.
a.c:
#include <string.h>
int memcpy2(char *c, char *s)
{
return memcpy(c, s, 200);
}
#include <string.h>
extern int n;
int memcpy2(char *c, char *s)
{
return memcpy(c, s, n);
}
main.c:
int n = 200;
int main(){
int i;
char u[40000], v[40000];
for (i=0;i<1000000;i++)
memcpy2(u + 16 * ((unsigned int) rand ()) % (1 << 10), v + + 16 * ((unsigned int) rand ()) % (1 << 10));
}
We dealt with suboptimal selection.
Then you have problem that optimum also depends on profile feedback. If
you fixed previous selection it would be still wrong if you know that
input is mainly in L3 cache. For some reason rep stosq is by far best
implementation for these inputs in almost any intel cpu.
Then next problem is code size, I filled bug about that. Gcc does too
excessive expansion.
memcpy(c, s, 120);
gets expanded to 125 byte sequence of movs
While it may improve overall performance most of these expansions are
wrong as they lie on cold path and could easily introduce extra 60 cycle
penalty for fetching instructions.
A ugly problem is that upto certain size gcc needs to expand these for
optimizations to work, one wants to be able to determine that b[1] is b
in following fragment and do constant propagation of copied structures.
A question is after what size that doesn't make sense.
char *a = "abc";
memcpy(b,a,3);
return b[1];
^ permalink raw reply [flat|nested] 23+ messages in thread
* Gcc builtin review: memset
2015-05-25 12:18 Gcc builtin review: memcpy, mempcpy, memmove Ondřej Bílka
@ 2015-05-25 12:49 ` Ondřej Bílka
2015-05-25 13:50 ` Gcc builtin review: memchr, memrchr, rawmemchr Ondřej Bílka
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
2015-05-28 16:57 ` Gcc builtin review: memcpy, mempcpy, memmove Joseph Myers
2 siblings, 1 reply; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-25 12:49 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
First quick statement that memccpy is ok as gcc doesn't do anything.
Then memset suffers for same flaws as memcpy. If modify previous memcpy
benchmark to following then for size 120 and obsolete glibc-2.13 you
have an around 25% performance regression.
a.c:
#include <string.h>
int memcpy2(char *c, char *s)
{
return memset(c, 0, 120);
}
An size bloat is lesser.
Then there is optimization that on some archs you should call
__bzero (s, n) instead memset (s, 0, n). A effective __bzero consist of
small prologue that jumps to memset after initialization. It saves
cycles of creating mask from c and passing additional argument.
A main diadvantage would be renaming as bzero doesn't return anything
but x64 does return same value as memset.
Also as I mentioned ufunc a memset is target to get better performance
without troubles that you couldn't select cpu. Note that on haswell
using avx2 instincs could improve performance. How are you going exploit
that with gcc?
^ permalink raw reply [flat|nested] 23+ messages in thread
* Gcc builtin review: memchr, memrchr, rawmemchr
2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
@ 2015-05-25 13:50 ` Ondřej Bílka
2015-05-27 6:36 ` Ondřej Bílka
0 siblings, 1 reply; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-25 13:50 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
Gcc doesn't do any optimizations with these but misses several.
I already submitted that rawmemchr (s, 0) is silly, you should use
s + strlen (s) as strlen is better optimized (AFAIR around 50% faster
than rawmemchr)
Then gcc also doesn't optimize memchr with n small constant which is
possible but bit obscure. I could get similar performance improvement as
strcmp question is if its worth mainatainence. But fell free to add gcc
optimization for that.
Another use case is how wordexp abuses strchr/memchr to test membership
in set. For example it checked whitespace with
if (strchr (" \t\r\n", c))
While this would be relatively easy to implement with header and would
be candidate for ufunc calculating table in gcc would be really ugly
hack.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
2015-05-25 12:18 Gcc builtin review: memcpy, mempcpy, memmove Ondřej Bílka
2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
@ 2015-05-25 13:53 ` Ondřej Bílka
2015-05-25 17:46 ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
` (6 more replies)
2015-05-28 16:57 ` Gcc builtin review: memcpy, mempcpy, memmove Joseph Myers
2 siblings, 7 replies; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-25 13:53 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
I will omit expansion to memcpy and flaws from previous thread.
First problem is that gcc in
int foo (char *c, char *c2){
stpcpy (c, c2);
}
Replaces it with strcpy. One could argue that opposite way to replace
strcpy with stpcpy is faster.
Reason is register pressure. Strcpy needs extra register to save return
value while stpcpy has return value already in register used for writing
terminating zero.
Also to decrease instruction cache pressure you should always call
stpcpy and newer strcpy. You can't do it other way as you would need
expensive strlen. A pressure is problem as strcpy tend to be relatively rarely
used.
A possible advantage of strcpy is that gcc4.9 started to exploit that it
returns x saving register spill in caller. I don't know how is that
useful.
With stpcpy and strcat there are more optimizations with higher payoff.
You should expoit that stpcpy returns end of string so gcc should
replace
strcpy (x,y)
return strlen (x)
with
return stpcpy (x,y) - x
Then there is famous problem of quadratic performance of strcat loop.
That is something that gcc could eliminate with tracking end of string.
Also there was discussion on libc-alpha about adding stpcat. That would
also save some end of string calculations.
Otherwise strcat look ok modulo memcpy related problems.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Gcc builtin review: strcmp, strncmp, memcmp.
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
@ 2015-05-25 17:46 ` Ondřej Bílka
2015-05-25 19:28 ` Gcc builtin review: strpbrk, strspn, strcspn Ondřej Bílka
2015-05-25 17:52 ` Gcc builtin review: strdup Ondřej Bílka
` (5 subsequent siblings)
6 siblings, 1 reply; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-25 17:46 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
As mentioned in other thread expansion of strcmp and strncmp with
constant has terrible performance. Thankfully there is no optimization
of memcmp.
Then there are strcoll, strxfrm that are rarely used.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Gcc builtin review: strdup
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
2015-05-25 17:46 ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
@ 2015-05-25 17:52 ` Ondřej Bílka
2015-05-25 18:11 ` Gcc builtin review: strchr, strrchr, strchrnul Ondřej Bílka
` (4 subsequent siblings)
6 siblings, 0 replies; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-25 17:52 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
Again there is missed optimization. Consider
int bar()
{
return strdup("xyz");
}
This could be clearly expanded to
int bar()
{
char *c = malloc (strlen ("xyz") + 1);
memcpy (c, "xyz", strlen ("xyz") + 1);
}
One of consequences of this missed optimization is that in
int bar()
{
char *c = strdup("xyz");
return c[1];
}
gcc cannot determine that c[1] is y while it can in memcpy case.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Gcc builtin review: strchr, strrchr, strchrnul
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
2015-05-25 17:46 ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
2015-05-25 17:52 ` Gcc builtin review: strdup Ondřej Bílka
@ 2015-05-25 18:11 ` Ondřej Bílka
2015-05-25 19:15 ` Ondřej Bílka
2015-05-26 8:00 ` Gcc builtin review: strstr, strcasestr, memmem Ondřej Bílka
` (3 subsequent siblings)
6 siblings, 1 reply; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-25 18:11 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
Then we have strchr, strrchr, strchrnul.
Aside from suggestion of replacing every strchr with strchrnul there are
missed optimizations of strchr(x,0) and strchrnul(x,0).
A strrchr(x,0) expands just to strchr(x,0) which is not enough as it
should do full expansion to x+strlen(x).
Then you miss case when x is constant as in wordexp.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Gcc builtin review: strchr, strrchr, strchrnul
2015-05-25 18:11 ` Gcc builtin review: strchr, strrchr, strchrnul Ondřej Bílka
@ 2015-05-25 19:15 ` Ondřej Bílka
2015-05-26 10:58 ` Gcc builtin review: isinf, insnan Ondřej Bílka
0 siblings, 1 reply; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-25 19:15 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
On Mon, May 25, 2015 at 02:16:34PM +0200, OndÅej BÃlka wrote:
> Then we have strchr, strrchr, strchrnul.
>
> Aside from suggestion of replacing every strchr with strchrnul there are
> missed optimizations of strchr(x,0) and strchrnul(x,0).
>
> A strrchr(x,0) expands just to strchr(x,0) which is not enough as it
> should do full expansion to x+strlen(x).
>
> Then you miss case when x is constant as in wordexp.
Second problem is that you want builtin to evaluate constant strings
arguments. So why gcc could optimize away
int foo(){
return strchr("foo", 'y');
}
but not when strchr is replaced by rawmemchr and strchrnul?
Also as for effectivity a correct primitive is strchrnul and strchr
should be expanded into it this is problem.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Gcc builtin review: strpbrk, strspn, strcspn
2015-05-25 17:46 ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
@ 2015-05-25 19:28 ` Ondřej Bílka
0 siblings, 0 replies; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-25 19:28 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
These do relatively good job.
These would be typical use-case for ufuncs. In most use cases accept
argument is consant so you want to precompute it as array as one
implementation computes table with characters and then checks if each
character is in table.
But on x64 there is better implementation which is only case where
sse4.2 instructions are useful in glibc string function.
So you need to do resolving if to expand table based on if cpu uses
sse4.2 which gcc cannot do.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Gcc builtin review: strstr, strcasestr, memmem
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
` (2 preceding siblings ...)
2015-05-25 18:11 ` Gcc builtin review: strchr, strrchr, strchrnul Ondřej Bílka
@ 2015-05-26 8:00 ` Ondřej Bílka
2015-05-26 8:55 ` Ondřej Bílka
2015-05-26 10:45 ` Gcc builtin review: strfry, basename, strncpy, memfrob Ondřej Bílka
` (2 subsequent siblings)
6 siblings, 1 reply; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-26 8:00 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
A gcc does little optimizations, just constant arguments.
And why gcc doesnt optimize memmem as arguments are constant and with
size 1 needle it could use memchr?
#include <string.h>
int foo(char *x)
{
return memmem ("tauhtatu", 4, "f", 1);
}
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Gcc builtin review: strstr, strcasestr, memmem
2015-05-26 8:00 ` Gcc builtin review: strstr, strcasestr, memmem Ondřej Bílka
@ 2015-05-26 8:55 ` Ondřej Bílka
0 siblings, 0 replies; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-26 8:55 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
On Mon, May 25, 2015 at 09:09:23PM +0200, OndÅej BÃlka wrote:
> A gcc does little optimizations, just constant arguments.
>
> And why gcc doesnt optimize memmem as arguments are constant and with
> size 1 needle it could use memchr?
>
Forgot to add that these could be candidate to ufunc pattern. While I
added heuristic to delay critical factorization as far as possible it
could be possible to get faster algorithm when you don't have to spend
90% of strstr time to recompute factorization again which was previously
case of small haystacks.
This is also quite unsuitable for gcc pass and as implementation could
change it could easily become obsolete and cause problems.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Gcc builtin review: strfry, basename, strncpy, memfrob
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
` (3 preceding siblings ...)
2015-05-26 8:00 ` Gcc builtin review: strstr, strcasestr, memmem Ondřej Bílka
@ 2015-05-26 10:45 ` Ondřej Bílka
2015-05-28 17:02 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Joseph Myers
2015-06-04 10:34 ` Richard Earnshaw
6 siblings, 0 replies; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-26 10:45 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
And I was most disapointed on lack of gcc support with these functions.
Memfrob is essential security feature but gcc doesn't optimize it at all.
Sadly in strfry gcc could use opportunity of fixing glibc bug but it
doesn't do it.
While gcc optimizes strncpy a bit it doesn't do that enough. It should
expand to sequence of stores for at least size 1000. A strncpy is best
way to copy strings. And what more. You could also use strncpy as memset
by passing "" as second argument.
Also basename should be always inlined as it almost always lies in hot
path followed by opening file.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Gcc builtin review: isinf, insnan ...
2015-05-25 19:15 ` Ondřej Bílka
@ 2015-05-26 10:58 ` Ondřej Bílka
0 siblings, 0 replies; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-26 10:58 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
I raised this issue before but didn't wrote patch so I should do it now.
I would be silent about glibc as it shares same flaw as gcc.
Main problem that these functions try to be branchless. Which causes
performance regression for most applications versus branched code.
A problem is that predicted branch is free while conditional store
always cost cycle. So you need to have unpredictable branch to get
performance gain. When branch is 95% predicted then branchless code
wouldn't pay for itself if it adds one cycle versus branched and
misprediction costs 20 cycles.
And NaN is quite exceptional value so branches will almost always be
predicted. Otherwise user has other problems, like that if 5% of his
data are NaN's then result will likely be garbage.
Then you have problem that with modern gcc you wont likely save branch.
Most of these functions are surrounded by if. From gcc-4.9 it will
optimize out that branch as its predicated and it results in simpler
code.
More evidence about that is that I took assembly of benchmark below and
changed conditional move to jump which improves performance back by 10%
For showing that I wrote simple example of branched isinf that is around
10% faster than builtin.
#ifdef BRANCHED
static inline int
isinf (double dx)
{
union u {
double d;
long l;
};
union u u;
u.d = dx;
long x = u.l;
return 2 * x == 0xffe0000000000000 ? (x == 0x7ff0000000000000 ? 1 : -1) : 0;
}
#endif
int main()
{
int ret;
int i, j;
double *d = malloc (800000);
for (j=0; j<1000000; j++)
for (i=0; i<1000; i++)
if (__builtin_expect(isinf (d[i]),0))
ret += 42;
return ret;
}
.file "inf.c"
.section .text.unlikely,"ax",@progbits
.LCOLDB2:
.section .text.startup,"ax",@progbits
.LHOTB2:
.p2align 4,,15
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
pushq %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
movl $800000, %edi
call malloc
movsd .LC0(%rip), %xmm2
leaq 8000(%rax), %rsi
movsd .LC1(%rip), %xmm1
movl $1000000, %edi
.p2align 4,,10
.p2align 3
.L2:
movq %rax, %rdx
.p2align 4,,10
.p2align 3
.L3:
movsd (%rdx), %xmm0
andpd %xmm2, %xmm0
ucomisd %xmm1, %xmm0
ja .LC
leal 42(%rbx), %ebx
.LC:
addq $8, %rdx
cmpq %rsi, %rdx
jne .L3
subl $1, %edi
jne .L2
movl %ebx, %eax
popq %rbx
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.section .text.unlikely
.LCOLDE2:
.section .text.startup
.LHOTE2:
.section .rodata.cst16,"aM",@progbits,16
.align 16
.LC0:
.long 4294967295
.long 2147483647
.long 0
.long 0
.section .rodata.cst8,"aM",@progbits,8
.align 8
.LC1:
.long 4294967295
.long 2146435071
.ident "GCC: (Debian 4.9.2-9) 4.9.2"
.section .note.GNU-stack,"",@progbits
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Gcc builtin review: memchr, memrchr, rawmemchr
2015-05-25 13:50 ` Gcc builtin review: memchr, memrchr, rawmemchr Ondřej Bílka
@ 2015-05-27 6:36 ` Ondřej Bílka
0 siblings, 0 replies; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-27 6:36 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
On Mon, May 25, 2015 at 12:55:40PM +0200, OndÅej BÃlka wrote:
> Gcc doesn't do any optimizations with these but misses several.
>
Also next optimization is that memchr(s, 0, n) could be converted to
calling strnlen (s, n) and converting result.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Gcc builtin review: memcpy, mempcpy, memmove.
2015-05-25 12:18 Gcc builtin review: memcpy, mempcpy, memmove Ondřej Bílka
2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
@ 2015-05-28 16:57 ` Joseph Myers
2015-05-28 17:08 ` Jeff Law
2 siblings, 1 reply; 23+ messages in thread
From: Joseph Myers @ 2015-05-28 16:57 UTC (permalink / raw)
To: Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski
[-- Attachment #1: Type: text/plain, Size: 1841 bytes --]
On Mon, 25 May 2015, Ondøej BĂlka wrote:
> Andrew as I promised review and what you would need to fix for x64 gcc
> and remove ugly gcc hacks and replace them with nice headers :-)
>
> I will take a functions in order from string.h and comment about
> these.
Reviews of optimization issues with GCC built-in functions belong
primarily in the GCC context, not glibc at all. That is, file one report
in GCC Bugzilla for each issue you find that isn't already reported there
(making sure to give the usual details of GCC version, target, compilation
options etc.). File a meta-bug in GCC Bugzilla depending on all those
bugs (and any pre-existing bugs for issues you found that you didn't
report because they were already reported there). Then just send a
summary to the libc-alpha and gcc lists pointing to that meta-bug.
This applies to most of your reviews of GCC built-in functions - send the
reports where they belong, and any temporary implementations of
optimizations in glibc headers need comments pointing to the GCC bug
reports which point back to those comments. Only where an optimization
involves calling glibc-specific functions are compiler optimizations more
troublesome (and even there, GCC knows when it's compiling for a glibc
target and knows the minimum glibc version supported on the target, so
it's entirely reasonable to agree that e.g. __rawmemchr will remain
available as a public glibc ABI so that GCC can generate code calling it
on glibc targets unless told not to because e.g. generating kernel code).
> Then you have problem that optimum also depends on profile feedback. If
Which is clearly better handled in the GCC context (GCC has the
infrastructure to handle profile feedback data from training runs of the
program) than in headers.
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
` (4 preceding siblings ...)
2015-05-26 10:45 ` Gcc builtin review: strfry, basename, strncpy, memfrob Ondřej Bílka
@ 2015-05-28 17:02 ` Joseph Myers
2015-06-04 10:34 ` Richard Earnshaw
6 siblings, 0 replies; 23+ messages in thread
From: Joseph Myers @ 2015-05-28 17:02 UTC (permalink / raw)
To: Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski
[-- Attachment #1: Type: text/plain, Size: 400 bytes --]
On Mon, 25 May 2015, Ondøej BĂlka wrote:
> You should expoit that stpcpy returns end of string so gcc should
> replace
>
> strcpy (x,y)
> return strlen (x)
>
> with
>
> return stpcpy (x,y) - x
Subject to namespace issues (we could agree that __stpcpy is a stable ABI
- won't become a compat symbol - so that GCC can safely generate calls to
it).
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Gcc builtin review: memcpy, mempcpy, memmove.
2015-05-28 16:57 ` Gcc builtin review: memcpy, mempcpy, memmove Joseph Myers
@ 2015-05-28 17:08 ` Jeff Law
2015-06-04 10:25 ` Ondřej Bílka
0 siblings, 1 reply; 23+ messages in thread
From: Jeff Law @ 2015-05-28 17:08 UTC (permalink / raw)
To: Joseph Myers, Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski
On 05/28/2015 10:36 AM, Joseph Myers wrote:
> On Mon, 25 May 2015, Ondøej BĂlka wrote:
>
>> Andrew as I promised review and what you would need to fix for x64 gcc
>> and remove ugly gcc hacks and replace them with nice headers :-)
>>
>> I will take a functions in order from string.h and comment about
>> these.
>
> Reviews of optimization issues with GCC built-in functions belong
> primarily in the GCC context, not glibc at all.
Agreed. Trying to address all this stuff in glibc is wrong. Each issue
should be evaluated and a determination made individually whether the
issue should be addressed in gcc or glibc.
Jeff
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Gcc builtin review: memcpy, mempcpy, memmove.
2015-05-28 17:08 ` Jeff Law
@ 2015-06-04 10:25 ` Ondřej Bílka
0 siblings, 0 replies; 23+ messages in thread
From: Ondřej Bílka @ 2015-06-04 10:25 UTC (permalink / raw)
To: Jeff Law; +Cc: Joseph Myers, libc-alpha, Andrew Pinski
On Thu, May 28, 2015 at 10:41:04AM -0600, Jeff Law wrote:
> On 05/28/2015 10:36 AM, Joseph Myers wrote:
> >On Mon, 25 May 2015, OndÅej BÃlka wrote:
> >
> >>Andrew as I promised review and what you would need to fix for x64 gcc
> >> and remove ugly gcc hacks and replace them with nice headers :-)
> >>
> >>I will take a functions in order from string.h and comment about
> >>these.
> >
> >Reviews of optimization issues with GCC built-in functions belong
> >primarily in the GCC context, not glibc at all.
> Agreed. Trying to address all this stuff in glibc is wrong. Each
> issue should be evaluated and a determination made individually
> whether the issue should be addressed in gcc or glibc.
>
>
> Jeff
Ok, I will crosspost issues that needs to be addressed. The main two
problems are that lot of special cases should be handled as ordinary
code, not trying to add extra pass every time you notice that sometimes
you could expand foo into bar and baz. Then there are implementation
details and that correct decision depends on these. For optimizations
you need to know how fast are different libc functions which is easier
checkable inside libc than trying to sync that information in gcc.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
` (5 preceding siblings ...)
2015-05-28 17:02 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Joseph Myers
@ 2015-06-04 10:34 ` Richard Earnshaw
2015-06-04 12:33 ` Ondřej Bílka
6 siblings, 1 reply; 23+ messages in thread
From: Richard Earnshaw @ 2015-06-04 10:34 UTC (permalink / raw)
To: Ondřej Bílka, libc-alpha; +Cc: Andrew Pinski
On 25/05/15 12:45, OndÅej BÃlka wrote:
> Replaces it with strcpy. One could argue that opposite way to replace
> strcpy with stpcpy is faster.
>
> Reason is register pressure. Strcpy needs extra register to save return
> value while stpcpy has return value already in register used for writing
> terminating zero.
Depends on your architecture. On aarch64 we have plenty of spare
registers, so strcpy simply copies the destination register into a
scratch. It then doesn't have to carefully calculate the return value
at the end of the function (making the tail code simpler - there are
multiple return statements, but only one entry point).
R.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
2015-06-04 10:34 ` Richard Earnshaw
@ 2015-06-04 12:33 ` Ondřej Bílka
2015-06-04 13:50 ` Richard Earnshaw
0 siblings, 1 reply; 23+ messages in thread
From: Ondřej Bílka @ 2015-06-04 12:33 UTC (permalink / raw)
To: Richard Earnshaw; +Cc: libc-alpha, Andrew Pinski
On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
> On 25/05/15 12:45, OndÅej BÃlka wrote:
> > Replaces it with strcpy. One could argue that opposite way to replace
> > strcpy with stpcpy is faster.
> >
> > Reason is register pressure. Strcpy needs extra register to save return
> > value while stpcpy has return value already in register used for writing
> > terminating zero.
>
>
> Depends on your architecture. On aarch64 we have plenty of spare
> registers, so strcpy simply copies the destination register into a
> scratch. It then doesn't have to carefully calculate the return value
> at the end of the function (making the tail code simpler - there are
> multiple return statements, but only one entry point).
>
Thats correct, main saving you get is from return value is first register, that
forces needing extra copy which is suboptimal.
I don't have data how strcpy and stpcpy mix and want to know if few
extra cycles are worth it when these aren't called exactly often, I will
try to think how test these.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
2015-06-04 12:33 ` Ondřej Bílka
@ 2015-06-04 13:50 ` Richard Earnshaw
2015-06-04 14:35 ` [RFC] Aarch64: optimize stpcpy a bit Ondřej Bílka
0 siblings, 1 reply; 23+ messages in thread
From: Richard Earnshaw @ 2015-06-04 13:50 UTC (permalink / raw)
To: Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski
On 04/06/15 13:28, OndÅej BÃlka wrote:
> On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
>> On 25/05/15 12:45, OndÅej BÃlka wrote:
>>> Replaces it with strcpy. One could argue that opposite way to replace
>>> strcpy with stpcpy is faster.
>>>
>>> Reason is register pressure. Strcpy needs extra register to save return
>>> value while stpcpy has return value already in register used for writing
>>> terminating zero.
>>
>>
>> Depends on your architecture. On aarch64 we have plenty of spare
>> registers, so strcpy simply copies the destination register into a
>> scratch. It then doesn't have to carefully calculate the return value
>> at the end of the function (making the tail code simpler - there are
>> multiple return statements, but only one entry point).
>>
> Thats correct, main saving you get is from return value is first register, that
> forces needing extra copy which is suboptimal.
No, look at the AArch64 code. The only time we ever end up with a
simple MOV instruction to copy the register from one location to another
is in the stPcpy code. In strcpy it always ends up folded into some
other operation that we have to do anyway. Once it's been copied to
that other register we never have to use it elsewhere again.
Furthermore, the need to handle smallish copies with overlapping stores
means we need both the original base address /and/ the final result, so
we'd still need to end up saving it for stpcpy.
>
> I don't have data how strcpy and stpcpy mix and want to know if few
> extra cycles are worth it when these aren't called exactly often, I will
> try to think how test these.
>
Well, stpcpy is Posix, while strcpy is part of ISO C. That suggests to
me we're far more likely to see folk using strcpy than stpcpy,
especially when they don't care about the result.
R.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [RFC] Aarch64: optimize stpcpy a bit.
2015-06-04 13:50 ` Richard Earnshaw
@ 2015-06-04 14:35 ` Ondřej Bílka
2015-06-04 16:41 ` Richard Earnshaw
0 siblings, 1 reply; 23+ messages in thread
From: Ondřej Bílka @ 2015-06-04 14:35 UTC (permalink / raw)
To: Richard Earnshaw; +Cc: libc-alpha, Andrew Pinski
On Thu, Jun 04, 2015 at 02:44:59PM +0100, Richard Earnshaw wrote:
> On 04/06/15 13:28, OndÅej BÃlka wrote:
> > On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
> >> On 25/05/15 12:45, OndÅej BÃlka wrote:
> >>> Replaces it with strcpy. One could argue that opposite way to replace
> >>> strcpy with stpcpy is faster.
> >>>
> >>> Reason is register pressure. Strcpy needs extra register to save return
> >>> value while stpcpy has return value already in register used for writing
> >>> terminating zero.
> >>
> >>
> >> Depends on your architecture. On aarch64 we have plenty of spare
> >> registers, so strcpy simply copies the destination register into a
> >> scratch. It then doesn't have to carefully calculate the return value
> >> at the end of the function (making the tail code simpler - there are
> >> multiple return statements, but only one entry point).
> >>
> > Thats correct, main saving you get is from return value is first register, that
> > forces needing extra copy which is suboptimal.
>
> No, look at the AArch64 code. The only time we ever end up with a
> simple MOV instruction to copy the register from one location to another
> is in the stPcpy code. In strcpy it always ends up folded into some
> other operation that we have to do anyway. Once it's been copied to
> that other register we never have to use it elsewhere again.
> Furthermore, the need to handle smallish copies with overlapping stores
> means we need both the original base address /and/ the final result, so
> we'd still need to end up saving it for stpcpy.
>
Wrote too fast, was refering that you would need to copy that on small
address. With dest in different register I could make strcpy and stpcpy
const same instructions in most cases except size 8-15 by adjusting
offsets with some constants.
Also if I think that you could remove extra instructions for stpcpy loop
with following, which also removes one instruction from strcpy if I read
code correctly.
diff --git a/sysdeps/aarch64/strcpy.S b/sysdeps/aarch64/strcpy.S
index 28846fb..7ca0412 100644
--- a/sysdeps/aarch64/strcpy.S
+++ b/sysdeps/aarch64/strcpy.S
@@ -204,6 +204,9 @@ L(bulk_entry):
sub to_align, to_align, #16
stp data1, data2, [dstin]
sub src, srcin, to_align
+#ifdef BUILD_STPCPY
+# define dst dstin
+#endif
sub dst, dstin, to_align
b L(entry_no_page_cross)
@@ -243,17 +246,16 @@ L(entry_no_page_cross):
#endif
rev has_nul1, has_nul1
clz pos, has_nul1
- add tmp1, pos, #72
- add pos, pos, #8
+ add tmp1, pos, #64
csel pos, pos, tmp1, ne
add src, src, pos, lsr #3
add dst, dst, pos, lsr #3
- ldp data1, data2, [src, #-32]
- stp data1, data2, [dst, #-16]
+ ldp data1, data2, [src, #-31]
+ stp data1, data2, [dst, #-15]
+ ret
#ifdef BUILD_STPCPY
- sub dstin, dst, #1
+# undef dst
#endif
- ret
L(page_cross):
bic src, srcin, #15
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] Aarch64: optimize stpcpy a bit.
2015-06-04 14:35 ` [RFC] Aarch64: optimize stpcpy a bit Ondřej Bílka
@ 2015-06-04 16:41 ` Richard Earnshaw
0 siblings, 0 replies; 23+ messages in thread
From: Richard Earnshaw @ 2015-06-04 16:41 UTC (permalink / raw)
To: Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski
On 04/06/15 15:16, OndÅej BÃlka wrote:
> On Thu, Jun 04, 2015 at 02:44:59PM +0100, Richard Earnshaw wrote:
>> On 04/06/15 13:28, OndÅej BÃlka wrote:
>>> On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
>>>> On 25/05/15 12:45, OndÅej BÃlka wrote:
>>>>> Replaces it with strcpy. One could argue that opposite way to replace
>>>>> strcpy with stpcpy is faster.
>>>>>
>>>>> Reason is register pressure. Strcpy needs extra register to save return
>>>>> value while stpcpy has return value already in register used for writing
>>>>> terminating zero.
>>>>
>>>>
>>>> Depends on your architecture. On aarch64 we have plenty of spare
>>>> registers, so strcpy simply copies the destination register into a
>>>> scratch. It then doesn't have to carefully calculate the return value
>>>> at the end of the function (making the tail code simpler - there are
>>>> multiple return statements, but only one entry point).
>>>>
>>> Thats correct, main saving you get is from return value is first register, that
>>> forces needing extra copy which is suboptimal.
>>
>> No, look at the AArch64 code. The only time we ever end up with a
>> simple MOV instruction to copy the register from one location to another
>> is in the stPcpy code. In strcpy it always ends up folded into some
>> other operation that we have to do anyway. Once it's been copied to
>> that other register we never have to use it elsewhere again.
>> Furthermore, the need to handle smallish copies with overlapping stores
>> means we need both the original base address /and/ the final result, so
>> we'd still need to end up saving it for stpcpy.
>>
> Wrote too fast, was refering that you would need to copy that on small
> address. With dest in different register I could make strcpy and stpcpy
> const same instructions in most cases except size 8-15 by adjusting
> offsets with some constants.
>
> Also if I think that you could remove extra instructions for stpcpy loop
> with following, which also removes one instruction from strcpy if I read
> code correctly.
>
> diff --git a/sysdeps/aarch64/strcpy.S b/sysdeps/aarch64/strcpy.S
> index 28846fb..7ca0412 100644
> --- a/sysdeps/aarch64/strcpy.S
> +++ b/sysdeps/aarch64/strcpy.S
> @@ -204,6 +204,9 @@ L(bulk_entry):
> sub to_align, to_align, #16
> stp data1, data2, [dstin]
> sub src, srcin, to_align
> +#ifdef BUILD_STPCPY
> +# define dst dstin
> +#endif
> sub dst, dstin, to_align
> b L(entry_no_page_cross)
>
> @@ -243,17 +246,16 @@ L(entry_no_page_cross):
> #endif
> rev has_nul1, has_nul1
> clz pos, has_nul1
> - add tmp1, pos, #72
> - add pos, pos, #8
> + add tmp1, pos, #64
> csel pos, pos, tmp1, ne
> add src, src, pos, lsr #3
> add dst, dst, pos, lsr #3
> - ldp data1, data2, [src, #-32]
> - stp data1, data2, [dst, #-16]
> + ldp data1, data2, [src, #-31]
> + stp data1, data2, [dst, #-15]
That's not valid, the offset has to be a multiple of the register size
(8 in this case).
> + ret
> #ifdef BUILD_STPCPY
> - sub dstin, dst, #1
> +# undef dst
Nor is this, dst is already a #define, so this leaves it unspecified.
But since you can't avoid the late subtract, that's irrelevant anyway.
R.
> #endif
> - ret
>
> L(page_cross):
> bic src, srcin, #15
>
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2015-06-04 15:44 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-25 12:18 Gcc builtin review: memcpy, mempcpy, memmove Ondřej Bílka
2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
2015-05-25 13:50 ` Gcc builtin review: memchr, memrchr, rawmemchr Ondřej Bílka
2015-05-27 6:36 ` Ondřej Bílka
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
2015-05-25 17:46 ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
2015-05-25 19:28 ` Gcc builtin review: strpbrk, strspn, strcspn Ondřej Bílka
2015-05-25 17:52 ` Gcc builtin review: strdup Ondřej Bílka
2015-05-25 18:11 ` Gcc builtin review: strchr, strrchr, strchrnul Ondřej Bílka
2015-05-25 19:15 ` Ondřej Bílka
2015-05-26 10:58 ` Gcc builtin review: isinf, insnan Ondřej Bílka
2015-05-26 8:00 ` Gcc builtin review: strstr, strcasestr, memmem Ondřej Bílka
2015-05-26 8:55 ` Ondřej Bílka
2015-05-26 10:45 ` Gcc builtin review: strfry, basename, strncpy, memfrob Ondřej Bílka
2015-05-28 17:02 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Joseph Myers
2015-06-04 10:34 ` Richard Earnshaw
2015-06-04 12:33 ` Ondřej Bílka
2015-06-04 13:50 ` Richard Earnshaw
2015-06-04 14:35 ` [RFC] Aarch64: optimize stpcpy a bit Ondřej Bílka
2015-06-04 16:41 ` Richard Earnshaw
2015-05-28 16:57 ` Gcc builtin review: memcpy, mempcpy, memmove Joseph Myers
2015-05-28 17:08 ` Jeff Law
2015-06-04 10:25 ` Ondřej Bílka
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).