* Gcc builtin review: memcpy, mempcpy, memmove.
@ 2015-05-25 12:18 Ondřej Bílka
2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
` (2 more replies)
0 siblings, 3 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-25 12:18 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
Andrew as I promised review and what you would need to fix for x64 gcc
and remove ugly gcc hacks and replace them with nice headers :-)
I will take a functions in order from string.h and comment about
these.
Lose all hope, these are imposible to optimize purely with gcc.
These are hardest functions to optimize. A optimal implementation
depends on cpu and size, whether inputs are likely in L1 or L2 or L3
cache and relationship is quite chaotic.
With gcc unless you start doing per-cpu function cloning and optimize
for each cpu (which would be good idea if you could convince maintainers
to deal with size bloat) then there will always be some cpu where you
lose.
Its because there are loop A and B and A is 50% faster than B on
cpu 1 but B is 50% faster on cpu 2
A main weakness for current cpus lies in 100-1000 byte range. A in
following benchmark an -march=native gcc expansion is outperformed
by at least 10% with obsolete memcpy_sse2 from glibc-2.13 on
sandy_bridge and haswell
As memcpy_sse2 uses only 8-byte loads that speaks for itself and newer
implementations are faster.
A benchmark to check instrinct is following.
Compile a.c with -mstringop-strategy=libcall and you will see
performance improvement. Of course compile a.c and main separately to
avoid optimizing memcpy out.
a.c:
#include <string.h>
int memcpy2(char *c, char *s)
{
return memcpy(c, s, 200);
}
#include <string.h>
extern int n;
int memcpy2(char *c, char *s)
{
return memcpy(c, s, n);
}
main.c:
int n = 200;
int main(){
int i;
char u[40000], v[40000];
for (i=0;i<1000000;i++)
memcpy2(u + 16 * ((unsigned int) rand ()) % (1 << 10), v + + 16 * ((unsigned int) rand ()) % (1 << 10));
}
We dealt with suboptimal selection.
Then you have problem that optimum also depends on profile feedback. If
you fixed previous selection it would be still wrong if you know that
input is mainly in L3 cache. For some reason rep stosq is by far best
implementation for these inputs in almost any intel cpu.
Then next problem is code size, I filled bug about that. Gcc does too
excessive expansion.
memcpy(c, s, 120);
gets expanded to 125 byte sequence of movs
While it may improve overall performance most of these expansions are
wrong as they lie on cold path and could easily introduce extra 60 cycle
penalty for fetching instructions.
A ugly problem is that upto certain size gcc needs to expand these for
optimizations to work, one wants to be able to determine that b[1] is b
in following fragment and do constant propagation of copied structures.
A question is after what size that doesn't make sense.
char *a = "abc";
memcpy(b,a,3);
return b[1];
^ permalink raw reply [flat|nested] 34+ messages in thread
* Gcc builtin review: memset
2015-05-25 12:18 Gcc builtin review: memcpy, mempcpy, memmove Ondřej Bílka
@ 2015-05-25 12:49 ` Ondřej Bílka
2015-05-25 13:50 ` Gcc builtin review: memchr, memrchr, rawmemchr Ondřej Bílka
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
2015-05-28 16:57 ` Gcc builtin review: memcpy, mempcpy, memmove Joseph Myers
2 siblings, 1 reply; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-25 12:49 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
First quick statement that memccpy is ok as gcc doesn't do anything.
Then memset suffers for same flaws as memcpy. If modify previous memcpy
benchmark to following then for size 120 and obsolete glibc-2.13 you
have an around 25% performance regression.
a.c:
#include <string.h>
int memcpy2(char *c, char *s)
{
return memset(c, 0, 120);
}
An size bloat is lesser.
Then there is optimization that on some archs you should call
__bzero (s, n) instead memset (s, 0, n). A effective __bzero consist of
small prologue that jumps to memset after initialization. It saves
cycles of creating mask from c and passing additional argument.
A main diadvantage would be renaming as bzero doesn't return anything
but x64 does return same value as memset.
Also as I mentioned ufunc a memset is target to get better performance
without troubles that you couldn't select cpu. Note that on haswell
using avx2 instincs could improve performance. How are you going exploit
that with gcc?
^ permalink raw reply [flat|nested] 34+ messages in thread
* Gcc builtin review: memchr, memrchr, rawmemchr
2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
@ 2015-05-25 13:50 ` Ondřej Bílka
2015-05-27 6:36 ` Ondřej Bílka
0 siblings, 1 reply; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-25 13:50 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
Gcc doesn't do any optimizations with these but misses several.
I already submitted that rawmemchr (s, 0) is silly, you should use
s + strlen (s) as strlen is better optimized (AFAIR around 50% faster
than rawmemchr)
Then gcc also doesn't optimize memchr with n small constant which is
possible but bit obscure. I could get similar performance improvement as
strcmp question is if its worth mainatainence. But fell free to add gcc
optimization for that.
Another use case is how wordexp abuses strchr/memchr to test membership
in set. For example it checked whitespace with
if (strchr (" \t\r\n", c))
While this would be relatively easy to implement with header and would
be candidate for ufunc calculating table in gcc would be really ugly
hack.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
2015-05-25 12:18 Gcc builtin review: memcpy, mempcpy, memmove Ondřej Bílka
2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
@ 2015-05-25 13:53 ` Ondřej Bílka
2015-05-25 17:46 ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
` (6 more replies)
2015-05-28 16:57 ` Gcc builtin review: memcpy, mempcpy, memmove Joseph Myers
2 siblings, 7 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-25 13:53 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
I will omit expansion to memcpy and flaws from previous thread.
First problem is that gcc in
int foo (char *c, char *c2){
stpcpy (c, c2);
}
Replaces it with strcpy. One could argue that opposite way to replace
strcpy with stpcpy is faster.
Reason is register pressure. Strcpy needs extra register to save return
value while stpcpy has return value already in register used for writing
terminating zero.
Also to decrease instruction cache pressure you should always call
stpcpy and newer strcpy. You can't do it other way as you would need
expensive strlen. A pressure is problem as strcpy tend to be relatively rarely
used.
A possible advantage of strcpy is that gcc4.9 started to exploit that it
returns x saving register spill in caller. I don't know how is that
useful.
With stpcpy and strcat there are more optimizations with higher payoff.
You should expoit that stpcpy returns end of string so gcc should
replace
strcpy (x,y)
return strlen (x)
with
return stpcpy (x,y) - x
Then there is famous problem of quadratic performance of strcat loop.
That is something that gcc could eliminate with tracking end of string.
Also there was discussion on libc-alpha about adding stpcat. That would
also save some end of string calculations.
Otherwise strcat look ok modulo memcpy related problems.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Gcc builtin review: strcmp, strncmp, memcmp.
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
@ 2015-05-25 17:46 ` Ondřej Bílka
2015-05-25 19:28 ` Gcc builtin review: strpbrk, strspn, strcspn Ondřej Bílka
2015-05-25 17:52 ` Gcc builtin review: strdup Ondřej Bílka
` (5 subsequent siblings)
6 siblings, 1 reply; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-25 17:46 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
As mentioned in other thread expansion of strcmp and strncmp with
constant has terrible performance. Thankfully there is no optimization
of memcmp.
Then there are strcoll, strxfrm that are rarely used.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Gcc builtin review: strdup
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
2015-05-25 17:46 ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
@ 2015-05-25 17:52 ` Ondřej Bílka
2015-05-25 18:11 ` Gcc builtin review: strchr, strrchr, strchrnul Ondřej Bílka
` (4 subsequent siblings)
6 siblings, 0 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-25 17:52 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
Again there is missed optimization. Consider
int bar()
{
return strdup("xyz");
}
This could be clearly expanded to
int bar()
{
char *c = malloc (strlen ("xyz") + 1);
memcpy (c, "xyz", strlen ("xyz") + 1);
}
One of consequences of this missed optimization is that in
int bar()
{
char *c = strdup("xyz");
return c[1];
}
gcc cannot determine that c[1] is y while it can in memcpy case.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Gcc builtin review: strchr, strrchr, strchrnul
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
2015-05-25 17:46 ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
2015-05-25 17:52 ` Gcc builtin review: strdup Ondřej Bílka
@ 2015-05-25 18:11 ` Ondřej Bílka
2015-05-25 19:15 ` Ondřej Bílka
2015-05-26 8:00 ` Gcc builtin review: strstr, strcasestr, memmem Ondřej Bílka
` (3 subsequent siblings)
6 siblings, 1 reply; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-25 18:11 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
Then we have strchr, strrchr, strchrnul.
Aside from suggestion of replacing every strchr with strchrnul there are
missed optimizations of strchr(x,0) and strchrnul(x,0).
A strrchr(x,0) expands just to strchr(x,0) which is not enough as it
should do full expansion to x+strlen(x).
Then you miss case when x is constant as in wordexp.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Gcc builtin review: strchr, strrchr, strchrnul
2015-05-25 18:11 ` Gcc builtin review: strchr, strrchr, strchrnul Ondřej Bílka
@ 2015-05-25 19:15 ` Ondřej Bílka
2015-05-26 10:58 ` Gcc builtin review: isinf, insnan Ondřej Bílka
0 siblings, 1 reply; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-25 19:15 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
On Mon, May 25, 2015 at 02:16:34PM +0200, OndÅej BÃlka wrote:
> Then we have strchr, strrchr, strchrnul.
>
> Aside from suggestion of replacing every strchr with strchrnul there are
> missed optimizations of strchr(x,0) and strchrnul(x,0).
>
> A strrchr(x,0) expands just to strchr(x,0) which is not enough as it
> should do full expansion to x+strlen(x).
>
> Then you miss case when x is constant as in wordexp.
Second problem is that you want builtin to evaluate constant strings
arguments. So why gcc could optimize away
int foo(){
return strchr("foo", 'y');
}
but not when strchr is replaced by rawmemchr and strchrnul?
Also as for effectivity a correct primitive is strchrnul and strchr
should be expanded into it this is problem.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Gcc builtin review: strpbrk, strspn, strcspn
2015-05-25 17:46 ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
@ 2015-05-25 19:28 ` Ondřej Bílka
0 siblings, 0 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-25 19:28 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
These do relatively good job.
These would be typical use-case for ufuncs. In most use cases accept
argument is consant so you want to precompute it as array as one
implementation computes table with characters and then checks if each
character is in table.
But on x64 there is better implementation which is only case where
sse4.2 instructions are useful in glibc string function.
So you need to do resolving if to expand table based on if cpu uses
sse4.2 which gcc cannot do.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Gcc builtin review: strstr, strcasestr, memmem
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
` (2 preceding siblings ...)
2015-05-25 18:11 ` Gcc builtin review: strchr, strrchr, strchrnul Ondřej Bílka
@ 2015-05-26 8:00 ` Ondřej Bílka
2015-05-26 8:55 ` Ondřej Bílka
2015-05-26 10:45 ` Gcc builtin review: strfry, basename, strncpy, memfrob Ondřej Bílka
` (2 subsequent siblings)
6 siblings, 1 reply; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-26 8:00 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
A gcc does little optimizations, just constant arguments.
And why gcc doesnt optimize memmem as arguments are constant and with
size 1 needle it could use memchr?
#include <string.h>
int foo(char *x)
{
return memmem ("tauhtatu", 4, "f", 1);
}
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Gcc builtin review: strstr, strcasestr, memmem
2015-05-26 8:00 ` Gcc builtin review: strstr, strcasestr, memmem Ondřej Bílka
@ 2015-05-26 8:55 ` Ondřej Bílka
0 siblings, 0 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-26 8:55 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
On Mon, May 25, 2015 at 09:09:23PM +0200, OndÅej BÃlka wrote:
> A gcc does little optimizations, just constant arguments.
>
> And why gcc doesnt optimize memmem as arguments are constant and with
> size 1 needle it could use memchr?
>
Forgot to add that these could be candidate to ufunc pattern. While I
added heuristic to delay critical factorization as far as possible it
could be possible to get faster algorithm when you don't have to spend
90% of strstr time to recompute factorization again which was previously
case of small haystacks.
This is also quite unsuitable for gcc pass and as implementation could
change it could easily become obsolete and cause problems.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Gcc builtin review: strfry, basename, strncpy, memfrob
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
` (3 preceding siblings ...)
2015-05-26 8:00 ` Gcc builtin review: strstr, strcasestr, memmem Ondřej Bílka
@ 2015-05-26 10:45 ` Ondřej Bílka
2015-05-28 17:02 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Joseph Myers
2015-06-04 10:34 ` Richard Earnshaw
6 siblings, 0 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-26 10:45 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
And I was most disapointed on lack of gcc support with these functions.
Memfrob is essential security feature but gcc doesn't optimize it at all.
Sadly in strfry gcc could use opportunity of fixing glibc bug but it
doesn't do it.
While gcc optimizes strncpy a bit it doesn't do that enough. It should
expand to sequence of stores for at least size 1000. A strncpy is best
way to copy strings. And what more. You could also use strncpy as memset
by passing "" as second argument.
Also basename should be always inlined as it almost always lies in hot
path followed by opening file.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Gcc builtin review: isinf, insnan ...
2015-05-25 19:15 ` Ondřej Bílka
@ 2015-05-26 10:58 ` Ondřej Bílka
0 siblings, 0 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-26 10:58 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
I raised this issue before but didn't wrote patch so I should do it now.
I would be silent about glibc as it shares same flaw as gcc.
Main problem that these functions try to be branchless. Which causes
performance regression for most applications versus branched code.
A problem is that predicted branch is free while conditional store
always cost cycle. So you need to have unpredictable branch to get
performance gain. When branch is 95% predicted then branchless code
wouldn't pay for itself if it adds one cycle versus branched and
misprediction costs 20 cycles.
And NaN is quite exceptional value so branches will almost always be
predicted. Otherwise user has other problems, like that if 5% of his
data are NaN's then result will likely be garbage.
Then you have problem that with modern gcc you wont likely save branch.
Most of these functions are surrounded by if. From gcc-4.9 it will
optimize out that branch as its predicated and it results in simpler
code.
More evidence about that is that I took assembly of benchmark below and
changed conditional move to jump which improves performance back by 10%
For showing that I wrote simple example of branched isinf that is around
10% faster than builtin.
#ifdef BRANCHED
static inline int
isinf (double dx)
{
union u {
double d;
long l;
};
union u u;
u.d = dx;
long x = u.l;
return 2 * x == 0xffe0000000000000 ? (x == 0x7ff0000000000000 ? 1 : -1) : 0;
}
#endif
int main()
{
int ret;
int i, j;
double *d = malloc (800000);
for (j=0; j<1000000; j++)
for (i=0; i<1000; i++)
if (__builtin_expect(isinf (d[i]),0))
ret += 42;
return ret;
}
.file "inf.c"
.section .text.unlikely,"ax",@progbits
.LCOLDB2:
.section .text.startup,"ax",@progbits
.LHOTB2:
.p2align 4,,15
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
pushq %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
movl $800000, %edi
call malloc
movsd .LC0(%rip), %xmm2
leaq 8000(%rax), %rsi
movsd .LC1(%rip), %xmm1
movl $1000000, %edi
.p2align 4,,10
.p2align 3
.L2:
movq %rax, %rdx
.p2align 4,,10
.p2align 3
.L3:
movsd (%rdx), %xmm0
andpd %xmm2, %xmm0
ucomisd %xmm1, %xmm0
ja .LC
leal 42(%rbx), %ebx
.LC:
addq $8, %rdx
cmpq %rsi, %rdx
jne .L3
subl $1, %edi
jne .L2
movl %ebx, %eax
popq %rbx
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.section .text.unlikely
.LCOLDE2:
.section .text.startup
.LHOTE2:
.section .rodata.cst16,"aM",@progbits,16
.align 16
.LC0:
.long 4294967295
.long 2147483647
.long 0
.long 0
.section .rodata.cst8,"aM",@progbits,8
.align 8
.LC1:
.long 4294967295
.long 2146435071
.ident "GCC: (Debian 4.9.2-9) 4.9.2"
.section .note.GNU-stack,"",@progbits
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Gcc builtin review: memchr, memrchr, rawmemchr
2015-05-25 13:50 ` Gcc builtin review: memchr, memrchr, rawmemchr Ondřej Bílka
@ 2015-05-27 6:36 ` Ondřej Bílka
0 siblings, 0 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-27 6:36 UTC (permalink / raw)
To: libc-alpha; +Cc: Andrew Pinski
On Mon, May 25, 2015 at 12:55:40PM +0200, OndÅej BÃlka wrote:
> Gcc doesn't do any optimizations with these but misses several.
>
Also next optimization is that memchr(s, 0, n) could be converted to
calling strnlen (s, n) and converting result.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Gcc builtin review: memcpy, mempcpy, memmove.
2015-05-25 12:18 Gcc builtin review: memcpy, mempcpy, memmove Ondřej Bílka
2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
@ 2015-05-28 16:57 ` Joseph Myers
2015-05-28 17:08 ` Jeff Law
2 siblings, 1 reply; 34+ messages in thread
From: Joseph Myers @ 2015-05-28 16:57 UTC (permalink / raw)
To: Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski
[-- Attachment #1: Type: text/plain, Size: 1841 bytes --]
On Mon, 25 May 2015, Ondøej BĂlka wrote:
> Andrew as I promised review and what you would need to fix for x64 gcc
> and remove ugly gcc hacks and replace them with nice headers :-)
>
> I will take a functions in order from string.h and comment about
> these.
Reviews of optimization issues with GCC built-in functions belong
primarily in the GCC context, not glibc at all. That is, file one report
in GCC Bugzilla for each issue you find that isn't already reported there
(making sure to give the usual details of GCC version, target, compilation
options etc.). File a meta-bug in GCC Bugzilla depending on all those
bugs (and any pre-existing bugs for issues you found that you didn't
report because they were already reported there). Then just send a
summary to the libc-alpha and gcc lists pointing to that meta-bug.
This applies to most of your reviews of GCC built-in functions - send the
reports where they belong, and any temporary implementations of
optimizations in glibc headers need comments pointing to the GCC bug
reports which point back to those comments. Only where an optimization
involves calling glibc-specific functions are compiler optimizations more
troublesome (and even there, GCC knows when it's compiling for a glibc
target and knows the minimum glibc version supported on the target, so
it's entirely reasonable to agree that e.g. __rawmemchr will remain
available as a public glibc ABI so that GCC can generate code calling it
on glibc targets unless told not to because e.g. generating kernel code).
> Then you have problem that optimum also depends on profile feedback. If
Which is clearly better handled in the GCC context (GCC has the
infrastructure to handle profile feedback data from training runs of the
program) than in headers.
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
` (4 preceding siblings ...)
2015-05-26 10:45 ` Gcc builtin review: strfry, basename, strncpy, memfrob Ondřej Bílka
@ 2015-05-28 17:02 ` Joseph Myers
2015-06-04 10:34 ` Richard Earnshaw
6 siblings, 0 replies; 34+ messages in thread
From: Joseph Myers @ 2015-05-28 17:02 UTC (permalink / raw)
To: Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski
[-- Attachment #1: Type: text/plain, Size: 400 bytes --]
On Mon, 25 May 2015, Ondøej BĂlka wrote:
> You should expoit that stpcpy returns end of string so gcc should
> replace
>
> strcpy (x,y)
> return strlen (x)
>
> with
>
> return stpcpy (x,y) - x
Subject to namespace issues (we could agree that __stpcpy is a stable ABI
- won't become a compat symbol - so that GCC can safely generate calls to
it).
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Gcc builtin review: memcpy, mempcpy, memmove.
2015-05-28 16:57 ` Gcc builtin review: memcpy, mempcpy, memmove Joseph Myers
@ 2015-05-28 17:08 ` Jeff Law
2015-06-04 10:25 ` Ondřej Bílka
0 siblings, 1 reply; 34+ messages in thread
From: Jeff Law @ 2015-05-28 17:08 UTC (permalink / raw)
To: Joseph Myers, Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski
On 05/28/2015 10:36 AM, Joseph Myers wrote:
> On Mon, 25 May 2015, Ondøej BĂlka wrote:
>
>> Andrew as I promised review and what you would need to fix for x64 gcc
>> and remove ugly gcc hacks and replace them with nice headers :-)
>>
>> I will take a functions in order from string.h and comment about
>> these.
>
> Reviews of optimization issues with GCC built-in functions belong
> primarily in the GCC context, not glibc at all.
Agreed. Trying to address all this stuff in glibc is wrong. Each issue
should be evaluated and a determination made individually whether the
issue should be addressed in gcc or glibc.
Jeff
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Gcc builtin review: memcpy, mempcpy, memmove.
2015-05-28 17:08 ` Jeff Law
@ 2015-06-04 10:25 ` Ondřej Bílka
0 siblings, 0 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-06-04 10:25 UTC (permalink / raw)
To: Jeff Law; +Cc: Joseph Myers, libc-alpha, Andrew Pinski
On Thu, May 28, 2015 at 10:41:04AM -0600, Jeff Law wrote:
> On 05/28/2015 10:36 AM, Joseph Myers wrote:
> >On Mon, 25 May 2015, OndÅej BÃlka wrote:
> >
> >>Andrew as I promised review and what you would need to fix for x64 gcc
> >> and remove ugly gcc hacks and replace them with nice headers :-)
> >>
> >>I will take a functions in order from string.h and comment about
> >>these.
> >
> >Reviews of optimization issues with GCC built-in functions belong
> >primarily in the GCC context, not glibc at all.
> Agreed. Trying to address all this stuff in glibc is wrong. Each
> issue should be evaluated and a determination made individually
> whether the issue should be addressed in gcc or glibc.
>
>
> Jeff
Ok, I will crosspost issues that needs to be addressed. The main two
problems are that lot of special cases should be handled as ordinary
code, not trying to add extra pass every time you notice that sometimes
you could expand foo into bar and baz. Then there are implementation
details and that correct decision depends on these. For optimizations
you need to know how fast are different libc functions which is easier
checkable inside libc than trying to sync that information in gcc.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
` (5 preceding siblings ...)
2015-05-28 17:02 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Joseph Myers
@ 2015-06-04 10:34 ` Richard Earnshaw
2015-06-04 12:33 ` Ondřej Bílka
6 siblings, 1 reply; 34+ messages in thread
From: Richard Earnshaw @ 2015-06-04 10:34 UTC (permalink / raw)
To: Ondřej Bílka, libc-alpha; +Cc: Andrew Pinski
On 25/05/15 12:45, OndÅej BÃlka wrote:
> Replaces it with strcpy. One could argue that opposite way to replace
> strcpy with stpcpy is faster.
>
> Reason is register pressure. Strcpy needs extra register to save return
> value while stpcpy has return value already in register used for writing
> terminating zero.
Depends on your architecture. On aarch64 we have plenty of spare
registers, so strcpy simply copies the destination register into a
scratch. It then doesn't have to carefully calculate the return value
at the end of the function (making the tail code simpler - there are
multiple return statements, but only one entry point).
R.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
2015-06-04 10:34 ` Richard Earnshaw
@ 2015-06-04 12:33 ` Ondřej Bílka
2015-06-04 13:50 ` Richard Earnshaw
0 siblings, 1 reply; 34+ messages in thread
From: Ondřej Bílka @ 2015-06-04 12:33 UTC (permalink / raw)
To: Richard Earnshaw; +Cc: libc-alpha, Andrew Pinski
On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
> On 25/05/15 12:45, OndÅej BÃlka wrote:
> > Replaces it with strcpy. One could argue that opposite way to replace
> > strcpy with stpcpy is faster.
> >
> > Reason is register pressure. Strcpy needs extra register to save return
> > value while stpcpy has return value already in register used for writing
> > terminating zero.
>
>
> Depends on your architecture. On aarch64 we have plenty of spare
> registers, so strcpy simply copies the destination register into a
> scratch. It then doesn't have to carefully calculate the return value
> at the end of the function (making the tail code simpler - there are
> multiple return statements, but only one entry point).
>
Thats correct, main saving you get is from return value is first register, that
forces needing extra copy which is suboptimal.
I don't have data how strcpy and stpcpy mix and want to know if few
extra cycles are worth it when these aren't called exactly often, I will
try to think how test these.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
2015-06-04 12:33 ` Ondřej Bílka
@ 2015-06-04 13:50 ` Richard Earnshaw
2015-06-04 14:35 ` [RFC] Aarch64: optimize stpcpy a bit Ondřej Bílka
0 siblings, 1 reply; 34+ messages in thread
From: Richard Earnshaw @ 2015-06-04 13:50 UTC (permalink / raw)
To: Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski
On 04/06/15 13:28, OndÅej BÃlka wrote:
> On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
>> On 25/05/15 12:45, OndÅej BÃlka wrote:
>>> Replaces it with strcpy. One could argue that opposite way to replace
>>> strcpy with stpcpy is faster.
>>>
>>> Reason is register pressure. Strcpy needs extra register to save return
>>> value while stpcpy has return value already in register used for writing
>>> terminating zero.
>>
>>
>> Depends on your architecture. On aarch64 we have plenty of spare
>> registers, so strcpy simply copies the destination register into a
>> scratch. It then doesn't have to carefully calculate the return value
>> at the end of the function (making the tail code simpler - there are
>> multiple return statements, but only one entry point).
>>
> Thats correct, main saving you get is from return value is first register, that
> forces needing extra copy which is suboptimal.
No, look at the AArch64 code. The only time we ever end up with a
simple MOV instruction to copy the register from one location to another
is in the stPcpy code. In strcpy it always ends up folded into some
other operation that we have to do anyway. Once it's been copied to
that other register we never have to use it elsewhere again.
Furthermore, the need to handle smallish copies with overlapping stores
means we need both the original base address /and/ the final result, so
we'd still need to end up saving it for stpcpy.
>
> I don't have data how strcpy and stpcpy mix and want to know if few
> extra cycles are worth it when these aren't called exactly often, I will
> try to think how test these.
>
Well, stpcpy is Posix, while strcpy is part of ISO C. That suggests to
me we're far more likely to see folk using strcpy than stpcpy,
especially when they don't care about the result.
R.
^ permalink raw reply [flat|nested] 34+ messages in thread
* [RFC] Aarch64: optimize stpcpy a bit.
2015-06-04 13:50 ` Richard Earnshaw
@ 2015-06-04 14:35 ` Ondřej Bílka
2015-06-04 16:41 ` Richard Earnshaw
0 siblings, 1 reply; 34+ messages in thread
From: Ondřej Bílka @ 2015-06-04 14:35 UTC (permalink / raw)
To: Richard Earnshaw; +Cc: libc-alpha, Andrew Pinski
On Thu, Jun 04, 2015 at 02:44:59PM +0100, Richard Earnshaw wrote:
> On 04/06/15 13:28, OndÅej BÃlka wrote:
> > On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
> >> On 25/05/15 12:45, OndÅej BÃlka wrote:
> >>> Replaces it with strcpy. One could argue that opposite way to replace
> >>> strcpy with stpcpy is faster.
> >>>
> >>> Reason is register pressure. Strcpy needs extra register to save return
> >>> value while stpcpy has return value already in register used for writing
> >>> terminating zero.
> >>
> >>
> >> Depends on your architecture. On aarch64 we have plenty of spare
> >> registers, so strcpy simply copies the destination register into a
> >> scratch. It then doesn't have to carefully calculate the return value
> >> at the end of the function (making the tail code simpler - there are
> >> multiple return statements, but only one entry point).
> >>
> > Thats correct, main saving you get is from return value is first register, that
> > forces needing extra copy which is suboptimal.
>
> No, look at the AArch64 code. The only time we ever end up with a
> simple MOV instruction to copy the register from one location to another
> is in the stPcpy code. In strcpy it always ends up folded into some
> other operation that we have to do anyway. Once it's been copied to
> that other register we never have to use it elsewhere again.
> Furthermore, the need to handle smallish copies with overlapping stores
> means we need both the original base address /and/ the final result, so
> we'd still need to end up saving it for stpcpy.
>
Wrote too fast, was refering that you would need to copy that on small
address. With dest in different register I could make strcpy and stpcpy
const same instructions in most cases except size 8-15 by adjusting
offsets with some constants.
Also if I think that you could remove extra instructions for stpcpy loop
with following, which also removes one instruction from strcpy if I read
code correctly.
diff --git a/sysdeps/aarch64/strcpy.S b/sysdeps/aarch64/strcpy.S
index 28846fb..7ca0412 100644
--- a/sysdeps/aarch64/strcpy.S
+++ b/sysdeps/aarch64/strcpy.S
@@ -204,6 +204,9 @@ L(bulk_entry):
sub to_align, to_align, #16
stp data1, data2, [dstin]
sub src, srcin, to_align
+#ifdef BUILD_STPCPY
+# define dst dstin
+#endif
sub dst, dstin, to_align
b L(entry_no_page_cross)
@@ -243,17 +246,16 @@ L(entry_no_page_cross):
#endif
rev has_nul1, has_nul1
clz pos, has_nul1
- add tmp1, pos, #72
- add pos, pos, #8
+ add tmp1, pos, #64
csel pos, pos, tmp1, ne
add src, src, pos, lsr #3
add dst, dst, pos, lsr #3
- ldp data1, data2, [src, #-32]
- stp data1, data2, [dst, #-16]
+ ldp data1, data2, [src, #-31]
+ stp data1, data2, [dst, #-15]
+ ret
#ifdef BUILD_STPCPY
- sub dstin, dst, #1
+# undef dst
#endif
- ret
L(page_cross):
bic src, srcin, #15
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC] Aarch64: optimize stpcpy a bit.
2015-06-04 14:35 ` [RFC] Aarch64: optimize stpcpy a bit Ondřej Bílka
@ 2015-06-04 16:41 ` Richard Earnshaw
0 siblings, 0 replies; 34+ messages in thread
From: Richard Earnshaw @ 2015-06-04 16:41 UTC (permalink / raw)
To: Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski
On 04/06/15 15:16, OndÅej BÃlka wrote:
> On Thu, Jun 04, 2015 at 02:44:59PM +0100, Richard Earnshaw wrote:
>> On 04/06/15 13:28, OndÅej BÃlka wrote:
>>> On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
>>>> On 25/05/15 12:45, OndÅej BÃlka wrote:
>>>>> Replaces it with strcpy. One could argue that opposite way to replace
>>>>> strcpy with stpcpy is faster.
>>>>>
>>>>> Reason is register pressure. Strcpy needs extra register to save return
>>>>> value while stpcpy has return value already in register used for writing
>>>>> terminating zero.
>>>>
>>>>
>>>> Depends on your architecture. On aarch64 we have plenty of spare
>>>> registers, so strcpy simply copies the destination register into a
>>>> scratch. It then doesn't have to carefully calculate the return value
>>>> at the end of the function (making the tail code simpler - there are
>>>> multiple return statements, but only one entry point).
>>>>
>>> Thats correct, main saving you get is from return value is first register, that
>>> forces needing extra copy which is suboptimal.
>>
>> No, look at the AArch64 code. The only time we ever end up with a
>> simple MOV instruction to copy the register from one location to another
>> is in the stPcpy code. In strcpy it always ends up folded into some
>> other operation that we have to do anyway. Once it's been copied to
>> that other register we never have to use it elsewhere again.
>> Furthermore, the need to handle smallish copies with overlapping stores
>> means we need both the original base address /and/ the final result, so
>> we'd still need to end up saving it for stpcpy.
>>
> Wrote too fast, was refering that you would need to copy that on small
> address. With dest in different register I could make strcpy and stpcpy
> const same instructions in most cases except size 8-15 by adjusting
> offsets with some constants.
>
> Also if I think that you could remove extra instructions for stpcpy loop
> with following, which also removes one instruction from strcpy if I read
> code correctly.
>
> diff --git a/sysdeps/aarch64/strcpy.S b/sysdeps/aarch64/strcpy.S
> index 28846fb..7ca0412 100644
> --- a/sysdeps/aarch64/strcpy.S
> +++ b/sysdeps/aarch64/strcpy.S
> @@ -204,6 +204,9 @@ L(bulk_entry):
> sub to_align, to_align, #16
> stp data1, data2, [dstin]
> sub src, srcin, to_align
> +#ifdef BUILD_STPCPY
> +# define dst dstin
> +#endif
> sub dst, dstin, to_align
> b L(entry_no_page_cross)
>
> @@ -243,17 +246,16 @@ L(entry_no_page_cross):
> #endif
> rev has_nul1, has_nul1
> clz pos, has_nul1
> - add tmp1, pos, #72
> - add pos, pos, #8
> + add tmp1, pos, #64
> csel pos, pos, tmp1, ne
> add src, src, pos, lsr #3
> add dst, dst, pos, lsr #3
> - ldp data1, data2, [src, #-32]
> - stp data1, data2, [dst, #-16]
> + ldp data1, data2, [src, #-31]
> + stp data1, data2, [dst, #-15]
That's not valid, the offset has to be a multiple of the register size
(8 in this case).
> + ret
> #ifdef BUILD_STPCPY
> - sub dstin, dst, #1
> +# undef dst
Nor is this, dst is already a #define, so this leaves it unspecified.
But since you can't avoid the late subtract, that's irrelevant anyway.
R.
> #endif
> - ret
>
> L(page_cross):
> bic src, srcin, #15
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* RE: Gcc builtin review: isinf, insnan ...
2015-05-29 20:58 ` Wilco Dijkstra
@ 2015-05-30 19:14 ` Joseph Myers
0 siblings, 0 replies; 34+ messages in thread
From: Joseph Myers @ 2015-05-30 19:14 UTC (permalink / raw)
To: Wilco Dijkstra; +Cc: Ondřej Bílka, libc-alpha, pinskia
On Fri, 29 May 2015, Wilco Dijkstra wrote:
> Note another issue is that GCC tends to use FP arithmetic with -ffast-math
> for FP related built-ins (even when it uses integer arithmetic with -O2/-O3),
> presumably on the assumption that FP calculations are faster. This is not true
> on most CPUs but it spectacularly backfires when using software floating point...
>
> int test(double x)
> {
> return __builtin_signbit (x);
> }
>
> -O3 -mfloat-abi=soft:
> and r0, r1, #-2147483648
> bx lr
>
> -Ofast -mfloat-abi=soft:
> push {r3, lr}
> movs r2, #0
> movs r3, #0
> bl __aeabi_dcmplt
> adds r0, r0, #0
> it ne
> movne r0, #1
> pop {r3, pc}
GCC bug report number? I don't think this is an assumption that FP
calculations are faster; it seems a simple inconsistency between tree
folding (prefers expanding to a comparison when no signed zeroes) and RTL
expansion (prefers bit manipulation whenever a read-only sign bit is
available, which it always is for all currently-supported floating-point
formats).
> Agreed, I'd much prefer the GCC builtins doing the right thing. However what
> to do when they don't? Assuming we agree to just support inline functions in
> headers from GCC4.0 onwards, given the built-ins in GCC are not good enough,
> we still need explicit optimized implementations for all of signbit, isinf etc
> even if they were fixed in GCC6.
Well, if you want a -ffast-math version of signbit (commented to be
working around a particular, preferably fixed, GCC bug, with __GNUC_PREREQ
conditionals) - if that's considered sufficiently important - at least it
ought to be architecture-independent (maybe one bits/ header per
floating-point format) rather than duplicated per-architecture.
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* RE: Gcc builtin review: isinf, insnan ...
2015-05-29 14:26 ` Joseph Myers
@ 2015-05-29 20:58 ` Wilco Dijkstra
2015-05-30 19:14 ` Joseph Myers
0 siblings, 1 reply; 34+ messages in thread
From: Wilco Dijkstra @ 2015-05-29 20:58 UTC (permalink / raw)
To: 'Joseph Myers', Ondřej Bílka; +Cc: libc-alpha, pinskia
> Joseph Myers wrote:
> On Fri, 29 May 2015, Ondřej Bílka wrote:
>
> > On Thu, May 28, 2015 at 05:49:05PM +0000, Joseph Myers wrote:
> > > On Wed, 27 May 2015, Wilco Dijkstra wrote:
> > >
> > > It's not obvious that integer arithmetic is optimal in all cases (if the
> > > implementation is otherwise correct given the compiler options passed).
> > > For example, if the argument is in a floating-point register and an
> > > integer implementation would involve the (architecture-specific) costs of
> > > moving it to an integer register - obviously something that can only be
> > > decided quite late in code generation.
> > >
> > Naturally one should check which implementation is best. Could you send
> > link to numeric application and how to measure it that was mentioned
> > before?
>
> I have no idea which is best - all I assert here is that it's not obvious
> that "should not be used until they are fixed to use integer arithmetic"
> is true (any patch to use the GCC built-in functions should come with
> benchmark data as well as information on correctness). It may well be the
> case that changing the GCC built-in functions to use integer arithmetic is
> unconditionally beneficial; such a change obviously needs benchmark data
> (in a way that changing for correctness with -fsignaling-nans only
> wouldn't).
Note another issue is that GCC tends to use FP arithmetic with -ffast-math
for FP related built-ins (even when it uses integer arithmetic with -O2/-O3),
presumably on the assumption that FP calculations are faster. This is not true
on most CPUs but it spectacularly backfires when using software floating point...
int test(double x)
{
return __builtin_signbit (x);
}
-O3 -mfloat-abi=soft:
and r0, r1, #-2147483648
bx lr
-Ofast -mfloat-abi=soft:
push {r3, lr}
movs r2, #0
movs r3, #0
bl __aeabi_dcmplt
adds r0, r0, #0
it ne
movne r0, #1
pop {r3, pc}
So these built-ins "should not be used until they are fixed to use integer
arithmetic" :-)
> All other things being equal, use of type-generic built-in functions has
> the advantage that the macro expansion text only contains the argument
> text once - generally a good idea when macro calls may be nested inside
> arguments to other macros.
Agreed, I'd much prefer the GCC builtins doing the right thing. However what
to do when they don't? Assuming we agree to just support inline functions in
headers from GCC4.0 onwards, given the built-ins in GCC are not good enough,
we still need explicit optimized implementations for all of signbit, isinf etc
even if they were fixed in GCC6.
Wilco
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Gcc builtin review: isinf, insnan ...
2015-05-29 14:02 ` Ondřej Bílka
@ 2015-05-29 14:26 ` Joseph Myers
2015-05-29 20:58 ` Wilco Dijkstra
0 siblings, 1 reply; 34+ messages in thread
From: Joseph Myers @ 2015-05-29 14:26 UTC (permalink / raw)
To: Ondřej Bílka; +Cc: Wilco Dijkstra, libc-alpha, pinskia
[-- Attachment #1: Type: text/plain, Size: 1565 bytes --]
On Fri, 29 May 2015, Ondøej BĂlka wrote:
> On Thu, May 28, 2015 at 05:49:05PM +0000, Joseph Myers wrote:
> > On Wed, 27 May 2015, Wilco Dijkstra wrote:
> >
> > It's not obvious that integer arithmetic is optimal in all cases (if the
> > implementation is otherwise correct given the compiler options passed).
> > For example, if the argument is in a floating-point register and an
> > integer implementation would involve the (architecture-specific) costs of
> > moving it to an integer register - obviously something that can only be
> > decided quite late in code generation.
> >
> Naturally one should check which implementation is best. Could you send
> link to numeric application and how to measure it that was mentioned
> before?
I have no idea which is best - all I assert here is that it's not obvious
that "should not be used until they are fixed to use integer arithmetic"
is true (any patch to use the GCC built-in functions should come with
benchmark data as well as information on correctness). It may well be the
case that changing the GCC built-in functions to use integer arithmetic is
unconditionally beneficial; such a change obviously needs benchmark data
(in a way that changing for correctness with -fsignaling-nans only
wouldn't).
All other things being equal, use of type-generic built-in functions has
the advantage that the macro expansion text only contains the argument
text once - generally a good idea when macro calls may be nested inside
arguments to other macros.
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Gcc builtin review: isinf, insnan ...
2015-05-28 19:33 ` Joseph Myers
2015-05-29 13:22 ` Wilco Dijkstra
@ 2015-05-29 14:02 ` Ondřej Bílka
2015-05-29 14:26 ` Joseph Myers
1 sibling, 1 reply; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-29 14:02 UTC (permalink / raw)
To: Joseph Myers; +Cc: Wilco Dijkstra, libc-alpha, pinskia
On Thu, May 28, 2015 at 05:49:05PM +0000, Joseph Myers wrote:
> On Wed, 27 May 2015, Wilco Dijkstra wrote:
>
> It's not obvious that integer arithmetic is optimal in all cases (if the
> implementation is otherwise correct given the compiler options passed).
> For example, if the argument is in a floating-point register and an
> integer implementation would involve the (architecture-specific) costs of
> moving it to an integer register - obviously something that can only be
> decided quite late in code generation.
>
Naturally one should check which implementation is best. Could you send
link to numeric application and how to measure it that was mentioned
before?
I have bit doubt that floating point would do that well but it needs to
be tested. Versus a cost of moving float to integer register you also
have a cost of loading floating constant which could be as expensive how
do you tell which is faster? Integer computation may even help as due to
pipelining it would be executed in parallel while floating ones would be
sequential.
Also you run into diminishing returns. While you could get lot by using
header a specific optimizations could only rarely save one cycle so are
they worth maintainance burden.
^ permalink raw reply [flat|nested] 34+ messages in thread
* RE: Gcc builtin review: isinf, insnan ...
2015-05-29 13:22 ` Wilco Dijkstra
2015-05-29 13:25 ` Joseph Myers
@ 2015-05-29 13:30 ` Joseph Myers
1 sibling, 0 replies; 34+ messages in thread
From: Joseph Myers @ 2015-05-29 13:30 UTC (permalink / raw)
To: Wilco Dijkstra; +Cc: 'Ondřej Bílka', libc-alpha, pinskia
On Fri, 29 May 2015, Wilco Dijkstra wrote:
> so shouldn't be the default. Both GCC and GLIBC are too conservative and don't
> generate quality floating point code by default. And despite conservative and
The very slow multiple-precision paths in some dbl-64 functions that
attempt to be correctly rounding should simply be removed as out of scope
for the goals of those functions, subject to error analysis showing that
the previous paths in the code always generate results within glibc's
accuracy goals (a few ulps).
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* RE: Gcc builtin review: isinf, insnan ...
2015-05-29 13:22 ` Wilco Dijkstra
@ 2015-05-29 13:25 ` Joseph Myers
2015-05-29 13:30 ` Joseph Myers
1 sibling, 0 replies; 34+ messages in thread
From: Joseph Myers @ 2015-05-29 13:25 UTC (permalink / raw)
To: Wilco Dijkstra; +Cc: 'Ondřej Bílka', libc-alpha, pinskia
On Fri, 29 May 2015, Wilco Dijkstra wrote:
> > It's certainly the case that substantial GCC work is needed to make
> > floating-point properly follow even C99/C11 Annex F regarding exceptions /
> > rounding modes, let alone make signaling NaNs follow TS 18661-1, though
> > the signaling NaNs fixes would probably be more straightforward (reviewing
> > lots of code and making local changes, rather than any tricky design
> > issues involved for other floating-point areas). While I think such
> > support would be desirable, it also seems there's limited demand for it.
>
> IMHO support for different rounding modes, trapping exceptions and signalling
> NaNs are not required by many (most ARM CPUs don't even support FP traps...),
-ftrapping-math (default) is nothing to do with traps in the sense of
SIGFPE or change of flow of control, only about the raising of exception
flags that can be tested with <fenv.h> interfaces. Annex F and TS 18661
also have nothing to with the traps (part 5 deals with alternate exception
handling, but I doubt processor-level traps will be particularly useful in
implementing it). -frounding-math and -fsignaling-nans are not enabled by
default.
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* RE: Gcc builtin review: isinf, insnan ...
2015-05-28 19:33 ` Joseph Myers
@ 2015-05-29 13:22 ` Wilco Dijkstra
2015-05-29 13:25 ` Joseph Myers
2015-05-29 13:30 ` Joseph Myers
2015-05-29 14:02 ` Ondřej Bílka
1 sibling, 2 replies; 34+ messages in thread
From: Wilco Dijkstra @ 2015-05-29 13:22 UTC (permalink / raw)
To: 'Joseph Myers'
Cc: 'Ondřej Bílka', libc-alpha, pinskia
> Joseph Myers wrote:
> On Wed, 27 May 2015, Wilco Dijkstra wrote:
>
> > Note the GCC built-ins are actually incorrect and should not be used until
> > they are fixed to use integer arithmetic. The GLIBC versions are never
>
> GCC bug report number? Is the problem something other than general issues
> with -fsignaling-nans not making everything correct for the presence of
> signaling NaNs (as documented)? If there are signaling NaN issues but not
> other issues, the built-in functions could be used conditional on
> __SUPPORT_SNAN__ to fix bugs 15367 and 17441, provided (a) comments
> referred to applicable GCC bug reports for any signaling NaN issues, with
> those bug reports referring back to getting the glibc comments updated
> when fixed in GCC, and (b) if uses in glibc rely on proper signaling NaN
> handling, glibc were made to build with -fsignaling-nans (not the default
> in GCC).
Yes I meant the signalling NaN issue.
> It's certainly the case that substantial GCC work is needed to make
> floating-point properly follow even C99/C11 Annex F regarding exceptions /
> rounding modes, let alone make signaling NaNs follow TS 18661-1, though
> the signaling NaNs fixes would probably be more straightforward (reviewing
> lots of code and making local changes, rather than any tricky design
> issues involved for other floating-point areas). While I think such
> support would be desirable, it also seems there's limited demand for it.
IMHO support for different rounding modes, trapping exceptions and signalling
NaNs are not required by many (most ARM CPUs don't even support FP traps...),
so shouldn't be the default. Both GCC and GLIBC are too conservative and don't
generate quality floating point code by default. And despite conservative and
slow, the produced code is still not good enough to fully support the above
features, so everybody pays the price of slow code for no useful benefit...
I mostly use -Ofast as that's the only way to get half decent FP code out of GCC.
GLIBC should also have a mode with fastmath versions of all the math functions.
> It's not obvious that integer arithmetic is optimal in all cases (if the
> implementation is otherwise correct given the compiler options passed).
> For example, if the argument is in a floating-point register and an
> integer implementation would involve the (architecture-specific) costs of
> moving it to an integer register - obviously something that can only be
> decided quite late in code generation.
For simple bitmanipulation functions like isinf/isnan etc you can't get anywhere
near the latency of integer arithmetic on most targets. Note that while FP->int
moves may be expensive, FP comparisons also move the condition code to the integer
side, and thus have similar latency. If you do multiple isinf and isnan tests on
the same value (as is common in GLIBC) then the FP->int moves will be CSEd while
comparisons cannot.
However it would be possible to give targets the option to choose the GCC
builtins if signalling NaNs are disabled - perhaps that might be a way to get
inlining enabled quickly using a simple patch (for the integer versions we'd
need to move the various EXTRACT_WORDS macros into math.h which may be tricky).
Wilco
^ permalink raw reply [flat|nested] 34+ messages in thread
* RE: Gcc builtin review: isinf, insnan ...
2015-05-27 17:55 ` Gcc builtin review: isinf, insnan Wilco Dijkstra
2015-05-27 21:07 ` Julian Taylor
@ 2015-05-28 19:33 ` Joseph Myers
2015-05-29 13:22 ` Wilco Dijkstra
2015-05-29 14:02 ` Ondřej Bílka
1 sibling, 2 replies; 34+ messages in thread
From: Joseph Myers @ 2015-05-28 19:33 UTC (permalink / raw)
To: Wilco Dijkstra; +Cc: 'Ondřej Bílka', libc-alpha, pinskia
On Wed, 27 May 2015, Wilco Dijkstra wrote:
> Note the GCC built-ins are actually incorrect and should not be used until
> they are fixed to use integer arithmetic. The GLIBC versions are never
GCC bug report number? Is the problem something other than general issues
with -fsignaling-nans not making everything correct for the presence of
signaling NaNs (as documented)? If there are signaling NaN issues but not
other issues, the built-in functions could be used conditional on
__SUPPORT_SNAN__ to fix bugs 15367 and 17441, provided (a) comments
referred to applicable GCC bug reports for any signaling NaN issues, with
those bug reports referring back to getting the glibc comments updated
when fixed in GCC, and (b) if uses in glibc rely on proper signaling NaN
handling, glibc were made to build with -fsignaling-nans (not the default
in GCC).
It's certainly the case that substantial GCC work is needed to make
floating-point properly follow even C99/C11 Annex F regarding exceptions /
rounding modes, let alone make signaling NaNs follow TS 18661-1, though
the signaling NaNs fixes would probably be more straightforward (reviewing
lots of code and making local changes, rather than any tricky design
issues involved for other floating-point areas). While I think such
support would be desirable, it also seems there's limited demand for it.
It's not obvious that integer arithmetic is optimal in all cases (if the
implementation is otherwise correct given the compiler options passed).
For example, if the argument is in a floating-point register and an
integer implementation would involve the (architecture-specific) costs of
moving it to an integer register - obviously something that can only be
decided quite late in code generation.
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Gcc builtin review: isinf, insnan ...
2015-05-27 21:07 ` Julian Taylor
@ 2015-05-27 21:38 ` Ondřej Bílka
0 siblings, 0 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-27 21:38 UTC (permalink / raw)
To: Julian Taylor; +Cc: Wilco Dijkstra, libc-alpha, pinskia
On Wed, May 27, 2015 at 07:15:32PM +0200, Julian Taylor wrote:
> On 27.05.2015 16:57, Wilco Dijkstra wrote:
> > OndÅej BÃlka wrote:
> >> I raised this issue before but didn't wrote patch so I should do it now.
> >> I would be silent about glibc as it shares same flaw as gcc.
> >>
> >> Main problem that these functions try to be branchless. Which causes
> >> performance regression for most applications versus branched code.
> >
> > Being branchless is one issue indeed but the main issue they are never
> > inlined on any target as GLIBC headers explicitly disable inlining by GCC.
> >
>
>
> This is a huge problem for almost all numerical applications that don't
> care about signaling nans, so basically all.
> I had to fix too many applications having poor performance due to that
> (e.g.
> http://yarikoptic.github.io/numpy-vbench/vb_vb_ufunc.html#numpy-isnan-a-10types)
>
> I would love when glibc would somehow allow programs to use the builtin
> via the isnan function in an easy way, so not via compiler specific
> compiler flags like -fno-signaling-nan or preprocessor compiler version
> checks.
> Maybe alway direct to the builtin/fastest possible non signaling
> implementation when some GIVE_ME_FAST_BOOL_ISNAN is defined before
> including <math.h>?
>
> Also note that gcc has updated their documentation which might allow to
> now use __builtin_isinf
> (https://sourceware.org/bugzilla/show_bug.cgi?id=15367)
A simple inclusion of math.h should be enough. As now redirecting to
builtin would be slower than offering one in header which would also fix
signaling issue.
Only reason when nonconstant builtin would make sense would be if gcc
started to analyze floats to prove that operations are finite. It
doesn't do that. It keeps isinf even in simplest cases like
int
foo (double d)
{
if (d < 11) return 1;
if (d > 444) return 2;
if (isinf (d)) return 3;
return 4;
}
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Gcc builtin review: isinf, insnan ...
2015-05-27 17:55 ` Gcc builtin review: isinf, insnan Wilco Dijkstra
@ 2015-05-27 21:07 ` Julian Taylor
2015-05-27 21:38 ` Ondřej Bílka
2015-05-28 19:33 ` Joseph Myers
1 sibling, 1 reply; 34+ messages in thread
From: Julian Taylor @ 2015-05-27 21:07 UTC (permalink / raw)
To: Wilco Dijkstra, 'Ondřej Bílka'; +Cc: libc-alpha, pinskia
On 27.05.2015 16:57, Wilco Dijkstra wrote:
> Ondøej BĂlka wrote:
>> I raised this issue before but didn't wrote patch so I should do it now.
>> I would be silent about glibc as it shares same flaw as gcc.
>>
>> Main problem that these functions try to be branchless. Which causes
>> performance regression for most applications versus branched code.
>
> Being branchless is one issue indeed but the main issue they are never
> inlined on any target as GLIBC headers explicitly disable inlining by GCC.
>
This is a huge problem for almost all numerical applications that don't
care about signaling nans, so basically all.
I had to fix too many applications having poor performance due to that
(e.g.
http://yarikoptic.github.io/numpy-vbench/vb_vb_ufunc.html#numpy-isnan-a-10types)
I would love when glibc would somehow allow programs to use the builtin
via the isnan function in an easy way, so not via compiler specific
compiler flags like -fno-signaling-nan or preprocessor compiler version
checks.
Maybe alway direct to the builtin/fastest possible non signaling
implementation when some GIVE_ME_FAST_BOOL_ISNAN is defined before
including <math.h>?
Also note that gcc has updated their documentation which might allow to
now use __builtin_isinf
(https://sourceware.org/bugzilla/show_bug.cgi?id=15367)
^ permalink raw reply [flat|nested] 34+ messages in thread
* RE: Gcc builtin review: isinf, insnan ...
[not found] <A610E03AD50BFC4D95529A36D37FA55E769AD3FAC3@GEORGE.Emea.Arm.com>
@ 2015-05-27 17:55 ` Wilco Dijkstra
2015-05-27 21:07 ` Julian Taylor
2015-05-28 19:33 ` Joseph Myers
0 siblings, 2 replies; 34+ messages in thread
From: Wilco Dijkstra @ 2015-05-27 17:55 UTC (permalink / raw)
To: 'Ondřej Bílka'; +Cc: libc-alpha, pinskia
Ondřej Bílka wrote:
> I raised this issue before but didn't wrote patch so I should do it now.
> I would be silent about glibc as it shares same flaw as gcc.
>
> Main problem that these functions try to be branchless. Which causes
> performance regression for most applications versus branched code.
Being branchless is one issue indeed but the main issue they are never
inlined on any target as GLIBC headers explicitly disable inlining by GCC.
> A problem is that predicted branch is free while conditional store
> always cost cycle. So you need to have unpredictable branch to get
> performance gain. When branch is 95% predicted then branchless code
> wouldn't pay for itself if it adds one cycle versus branched and
> misprediction costs 20 cycles.
>
> And NaN is quite exceptional value so branches will almost always be
> predicted. Otherwise user has other problems, like that if 5% of his
> data are NaN's then result will likely be garbage.
>
> Then you have problem that with modern gcc you wont likely save branch.
> Most of these functions are surrounded by if. From gcc-4.9 it will
> optimize out that branch as its predicated and it results in simpler
> code.
>
> More evidence about that is that I took assembly of benchmark below and
> changed conditional move to jump which improves performance back by 10%
>
> For showing that I wrote simple example of branched isinf that is around
> 10% faster than builtin.
Note the GCC built-ins are actually incorrect and should not be used until
they are fixed to use integer arithmetic. The GLIBC versions are never
inlined on any target, and adding generic inline implementations gives a 4-6
times speedup. Isnan, isnormal, isfinite, issignalling are equally trivial,
needing ~3-4 instructions. An optimized fpclassify implementation seems
small enough to be fully inlineable (useful given it is used in lots of
complex math functions), however it could be partially inlined like:
__glibc_likely(isnormal(x)) ? FP_NORMAL : __fpclassify(x)
Just checking, are you planning to post patches for these?
Wilco
^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2015-06-04 15:44 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-25 12:18 Gcc builtin review: memcpy, mempcpy, memmove Ondřej Bílka
2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
2015-05-25 13:50 ` Gcc builtin review: memchr, memrchr, rawmemchr Ondřej Bílka
2015-05-27 6:36 ` Ondřej Bílka
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
2015-05-25 17:46 ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
2015-05-25 19:28 ` Gcc builtin review: strpbrk, strspn, strcspn Ondřej Bílka
2015-05-25 17:52 ` Gcc builtin review: strdup Ondřej Bílka
2015-05-25 18:11 ` Gcc builtin review: strchr, strrchr, strchrnul Ondřej Bílka
2015-05-25 19:15 ` Ondřej Bílka
2015-05-26 10:58 ` Gcc builtin review: isinf, insnan Ondřej Bílka
2015-05-26 8:00 ` Gcc builtin review: strstr, strcasestr, memmem Ondřej Bílka
2015-05-26 8:55 ` Ondřej Bílka
2015-05-26 10:45 ` Gcc builtin review: strfry, basename, strncpy, memfrob Ondřej Bílka
2015-05-28 17:02 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Joseph Myers
2015-06-04 10:34 ` Richard Earnshaw
2015-06-04 12:33 ` Ondřej Bílka
2015-06-04 13:50 ` Richard Earnshaw
2015-06-04 14:35 ` [RFC] Aarch64: optimize stpcpy a bit Ondřej Bílka
2015-06-04 16:41 ` Richard Earnshaw
2015-05-28 16:57 ` Gcc builtin review: memcpy, mempcpy, memmove Joseph Myers
2015-05-28 17:08 ` Jeff Law
2015-06-04 10:25 ` Ondřej Bílka
[not found] <A610E03AD50BFC4D95529A36D37FA55E769AD3FAC3@GEORGE.Emea.Arm.com>
2015-05-27 17:55 ` Gcc builtin review: isinf, insnan Wilco Dijkstra
2015-05-27 21:07 ` Julian Taylor
2015-05-27 21:38 ` Ondřej Bílka
2015-05-28 19:33 ` Joseph Myers
2015-05-29 13:22 ` Wilco Dijkstra
2015-05-29 13:25 ` Joseph Myers
2015-05-29 13:30 ` Joseph Myers
2015-05-29 14:02 ` Ondřej Bílka
2015-05-29 14:26 ` Joseph Myers
2015-05-29 20:58 ` Wilco Dijkstra
2015-05-30 19:14 ` Joseph Myers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).