Gcc builtin review: memcpy, mempcpy, memmove.

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* Gcc builtin review: memcpy, mempcpy, memmove.
@ 2015-05-25 12:18 Ondřej Bílka
  2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-25 12:18 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

Andrew as I promised review and what you would need to fix for x64 gcc
 and remove ugly gcc hacks and replace them with nice headers :-)

I will take a functions in order from string.h and comment about
these.

Lose all hope, these are imposible to optimize purely with gcc.

These are hardest functions to optimize. A optimal implementation
depends on cpu and size, whether inputs are likely in L1 or L2 or L3
cache  and relationship is quite chaotic.

With gcc unless you start doing per-cpu function cloning and optimize
for each cpu (which would be good idea if you could convince maintainers
to deal with size bloat) then there will always be some cpu where you
lose.

Its because there are loop A and B and A is 50% faster than B on
cpu 1 but B is 50% faster on cpu 2

A main weakness for current cpus lies in 100-1000 byte range. A in
following benchmark an -march=native gcc expansion is outperformed 
by at least 10% with obsolete memcpy_sse2 from glibc-2.13 on 
sandy_bridge and haswell

As memcpy_sse2 uses only 8-byte loads that speaks for itself and newer
implementations are faster.

A benchmark to check instrinct is following. 
Compile a.c with -mstringop-strategy=libcall and you will see
performance improvement. Of course compile a.c and main separately to
avoid optimizing memcpy out.

a.c:

#include <string.h>
int memcpy2(char *c, char *s)
{
  return memcpy(c, s, 200);
}

#include <string.h>
extern int n;
int memcpy2(char *c, char *s)
{
  return memcpy(c, s, n);
}

main.c:

int n = 200;
int main(){
  int i;
  char u[40000], v[40000];
  for (i=0;i<1000000;i++)
    memcpy2(u + 16 * ((unsigned int) rand ()) % (1 << 10), v + + 16 * ((unsigned int) rand ()) % (1 << 10));
}

We dealt with suboptimal selection. 

Then you have problem that optimum also depends on profile feedback. If
you fixed previous selection it would be still wrong if you know that
input is mainly in L3 cache. For some reason rep stosq is by far best
implementation for these inputs in almost any intel cpu.

Then next problem is code size, I filled bug about that. Gcc does too 
excessive expansion.
memcpy(c, s, 120);
gets expanded to 125 byte sequence of movs 

While it may improve overall performance most of these expansions are
wrong as they lie on cold path and could easily introduce extra 60 cycle
penalty for fetching instructions.

A ugly problem is that upto certain size gcc needs to expand these for
optimizations to work, one wants to be able to determine that b[1] is b
in following fragment and do constant propagation of copied structures.
A question is after what size that doesn't make sense.

char *a = "abc";
memcpy(b,a,3);
return b[1];

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Gcc builtin review: memset
  2015-05-25 12:18 Gcc builtin review: memcpy, mempcpy, memmove Ondřej Bílka
@ 2015-05-25 12:49 ` Ondřej Bílka
  2015-05-25 13:50   ` Gcc builtin review: memchr, memrchr, rawmemchr Ondřej Bílka
  2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
  2015-05-28 16:57 ` Gcc builtin review: memcpy, mempcpy, memmove Joseph Myers
  2 siblings, 1 reply; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-25 12:49 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

First quick statement that memccpy is ok as gcc doesn't do anything.

Then memset suffers for same flaws as memcpy. If modify previous memcpy
benchmark to following then for size 120 and obsolete glibc-2.13 you
have an around 25% performance regression.

a.c:
#include <string.h>
int memcpy2(char *c, char *s)
{
  return memset(c, 0, 120);
}

An size bloat is lesser.

Then there is optimization that on some archs you should call
__bzero (s, n) instead memset (s, 0, n). A effective __bzero consist of
small prologue that jumps to memset after initialization. It saves
cycles of creating mask from c and passing additional argument.

A main diadvantage would be renaming as bzero doesn't return anything
but x64 does return same value as memset.

Also as I mentioned ufunc a memset is target to get better performance
without troubles that you couldn't select cpu. Note that on haswell
using avx2 instincs could improve performance. How are you going exploit
that with gcc?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Gcc builtin review: memchr, memrchr, rawmemchr
  2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
@ 2015-05-25 13:50   ` Ondřej Bílka
  2015-05-27  6:36     ` Ondřej Bílka
  0 siblings, 1 reply; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-25 13:50 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

Gcc doesn't do any optimizations with these but misses several.

I already submitted that rawmemchr (s, 0) is silly, you should use
s + strlen (s) as strlen is better optimized (AFAIR around 50% faster
than rawmemchr)

Then gcc also doesn't optimize memchr with n small constant which is
possible but bit obscure. I could get similar performance improvement as
strcmp question is if its worth mainatainence. But fell free to add gcc
optimization for that.

Another use case is how wordexp abuses strchr/memchr to test membership
in set. For example it checked whitespace with

if (strchr (" \t\r\n", c))

While this would be relatively easy to implement with header and would
be candidate for ufunc calculating table in gcc would be really ugly
hack.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
  2015-05-25 12:18 Gcc builtin review: memcpy, mempcpy, memmove Ondřej Bílka
  2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
@ 2015-05-25 13:53 ` Ondřej Bílka
  2015-05-25 17:46   ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
                     ` (6 more replies)
  2015-05-28 16:57 ` Gcc builtin review: memcpy, mempcpy, memmove Joseph Myers
  2 siblings, 7 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-25 13:53 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

I will omit expansion to memcpy and flaws from previous thread.

First problem is that gcc in

int foo (char *c, char *c2){
  stpcpy (c, c2);
}

Replaces it with strcpy. One could argue that opposite way to replace
strcpy with stpcpy is faster.

Reason is register pressure. Strcpy needs extra register to save return
value while stpcpy has return value already in register used for writing
terminating zero. 

Also to decrease instruction cache pressure you should always call
stpcpy and newer strcpy. You can't do it other way as you would need
expensive strlen. A pressure is problem as strcpy tend to be relatively rarely
used.

A possible advantage of strcpy is that gcc4.9 started to exploit that it
returns x saving register spill in caller. I don't know how is that
useful.

With stpcpy and strcat there are more optimizations with higher payoff.

You should expoit that stpcpy returns end of string so gcc should
replace

strcpy (x,y)
return strlen (x)

with 

return stpcpy (x,y) - x

Then there is famous problem of quadratic performance of strcat loop.
That is something that gcc could eliminate with tracking end of string.

Also there was discussion on libc-alpha about adding stpcat. That would
also save some end of string calculations.

Otherwise strcat look ok modulo memcpy related problems.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Gcc builtin review: strcmp, strncmp, memcmp.
  2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
@ 2015-05-25 17:46   ` Ondřej Bílka
  2015-05-25 19:28     ` Gcc builtin review: strpbrk, strspn, strcspn Ondřej Bílka
  2015-05-25 17:52   ` Gcc builtin review: strdup Ondřej Bílka
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-25 17:46 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

As mentioned in other thread expansion of strcmp and strncmp with
constant has terrible performance. Thankfully there is no optimization
of memcmp.

Then there are strcoll, strxfrm that are rarely used. 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Gcc builtin review: strdup
  2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
  2015-05-25 17:46   ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
@ 2015-05-25 17:52   ` Ondřej Bílka
  2015-05-25 18:11   ` Gcc builtin review: strchr, strrchr, strchrnul Ondřej Bílka
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-25 17:52 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

Again there is missed optimization. Consider

int bar()
{
  return strdup("xyz");
}

This could be clearly expanded to

int bar()
{
  char *c = malloc (strlen ("xyz") + 1);
  memcpy (c, "xyz", strlen ("xyz") + 1);
}

One of consequences of this missed optimization is that in

int bar()
{
  char *c = strdup("xyz");
  return c[1];
}

gcc cannot determine that c[1] is y while it can in memcpy case.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Gcc builtin review: strchr, strrchr, strchrnul
  2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
  2015-05-25 17:46   ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
  2015-05-25 17:52   ` Gcc builtin review: strdup Ondřej Bílka
@ 2015-05-25 18:11   ` Ondřej Bílka
  2015-05-25 19:15     ` Ondřej Bílka
  2015-05-26  8:00   ` Gcc builtin review: strstr, strcasestr, memmem Ondřej Bílka
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-25 18:11 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

Then we have strchr, strrchr, strchrnul.

Aside from suggestion of replacing every strchr with strchrnul there are
missed optimizations of strchr(x,0) and strchrnul(x,0). 

A strrchr(x,0) expands just to strchr(x,0) which is not enough as it
should do full expansion to x+strlen(x).

Then you miss case when x is constant as in wordexp. 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Gcc builtin review: strchr, strrchr, strchrnul
  2015-05-25 18:11   ` Gcc builtin review: strchr, strrchr, strchrnul Ondřej Bílka
@ 2015-05-25 19:15     ` Ondřej Bílka
  2015-05-26 10:58       ` Gcc builtin review: isinf, insnan Ondřej Bílka
  0 siblings, 1 reply; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-25 19:15 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

On Mon, May 25, 2015 at 02:16:34PM +0200, OndÅ™ej BÃlka wrote:
> Then we have strchr, strrchr, strchrnul.
> 
> Aside from suggestion of replacing every strchr with strchrnul there are
> missed optimizations of strchr(x,0) and strchrnul(x,0). 
> 
> A strrchr(x,0) expands just to strchr(x,0) which is not enough as it
> should do full expansion to x+strlen(x).
> 
> Then you miss case when x is constant as in wordexp. 

Second problem is that you want builtin to evaluate constant strings
arguments. So why gcc could optimize away

int foo(){
  return strchr("foo", 'y');
}

but not when strchr is replaced by rawmemchr and strchrnul?

Also as for effectivity a correct primitive is strchrnul and strchr
should be expanded into it this is problem.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Gcc builtin review: strpbrk, strspn, strcspn
  2015-05-25 17:46   ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
@ 2015-05-25 19:28     ` Ondřej Bílka
  0 siblings, 0 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-25 19:28 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

These do relatively good job.

These would be typical use-case for ufuncs. In most use cases accept
argument is consant so you want to precompute it as array as one
implementation computes table with characters and then checks if each
character is in table.

But on x64 there is better implementation which is only case where
sse4.2 instructions are useful in glibc string function.

So you need to do resolving if to expand table based on if cpu uses
sse4.2 which gcc cannot do.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Gcc builtin review: strstr, strcasestr, memmem
  2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
                     ` (2 preceding siblings ...)
  2015-05-25 18:11   ` Gcc builtin review: strchr, strrchr, strchrnul Ondřej Bílka
@ 2015-05-26  8:00   ` Ondřej Bílka
  2015-05-26  8:55     ` Ondřej Bílka
  2015-05-26 10:45   ` Gcc builtin review: strfry, basename, strncpy, memfrob Ondřej Bílka
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-26  8:00 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

A gcc does little optimizations, just constant arguments.

And why gcc doesnt optimize memmem as arguments are constant and with
size 1 needle it could use memchr?

#include <string.h>
int foo(char *x)
{
  return memmem ("tauhtatu", 4, "f", 1);
}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Gcc builtin review: strstr, strcasestr, memmem
  2015-05-26  8:00   ` Gcc builtin review: strstr, strcasestr, memmem Ondřej Bílka
@ 2015-05-26  8:55     ` Ondřej Bílka
  0 siblings, 0 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-26  8:55 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

On Mon, May 25, 2015 at 09:09:23PM +0200, OndÅ™ej BÃlka wrote:
> A gcc does little optimizations, just constant arguments.
> 
> And why gcc doesnt optimize memmem as arguments are constant and with
> size 1 needle it could use memchr?
> 
Forgot to add that these could be candidate to ufunc pattern. While I
added heuristic to delay critical factorization as far as possible it
could be possible to get faster algorithm when you don't have to spend
90% of strstr time to recompute factorization again which was previously
case of small haystacks.

This is also quite unsuitable for gcc pass and as implementation could
change it could easily become obsolete and cause problems.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Gcc builtin review: strfry, basename, strncpy, memfrob
  2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
                     ` (3 preceding siblings ...)
  2015-05-26  8:00   ` Gcc builtin review: strstr, strcasestr, memmem Ondřej Bílka
@ 2015-05-26 10:45   ` Ondřej Bílka
  2015-05-28 17:02   ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Joseph Myers
  2015-06-04 10:34   ` Richard Earnshaw
  6 siblings, 0 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-26 10:45 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

And I was most disapointed on lack of gcc support with these functions.

Memfrob is essential security feature but gcc doesn't optimize it at all.

Sadly in strfry gcc could use opportunity of fixing glibc bug but it
doesn't do it.

While gcc optimizes strncpy a bit it doesn't do that enough. It should
expand to sequence of stores for at least size 1000. A strncpy is best
way to copy strings. And what more. You could also use strncpy as memset
by passing "" as second argument.

Also basename should be always inlined as it almost always lies in hot
path followed by opening file.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Gcc builtin review: isinf, insnan ...
  2015-05-25 19:15     ` Ondřej Bílka
@ 2015-05-26 10:58       ` Ondřej Bílka
  0 siblings, 0 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-26 10:58 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

I raised this issue before but didn't wrote patch so I should do it now.
I would be silent about glibc as it shares same flaw as gcc.

Main problem that these functions try to be branchless. Which causes
performance regression for most applications versus branched code.

A problem is that predicted branch is free while conditional store
always cost cycle. So you need to have unpredictable branch to get
performance gain. When branch is 95% predicted then branchless code
wouldn't pay for itself if it adds one cycle versus branched and
misprediction costs 20 cycles.

And NaN is quite exceptional value so branches will almost always be
predicted. Otherwise user has other problems, like that if 5% of his
data are NaN's then result will likely be garbage.

Then you have problem that with modern gcc you wont likely save branch.
Most of these functions are surrounded by if. From gcc-4.9 it will
optimize out that branch as its predicated and it results in simpler
code.

More evidence about that is that I took assembly of benchmark below and
changed conditional move to jump which improves performance back by 10%

For showing that I wrote simple example of branched isinf that is around
10% faster than builtin. 

#ifdef BRANCHED
static inline int
isinf (double dx)
{
  union u {
    double d;
    long l;
  };
  union u u;
  u.d = dx;
  long x = u.l;
  return 2 * x == 0xffe0000000000000 ? (x == 0x7ff0000000000000 ? 1 : -1) : 0;
}
#endif
int main()
{
  int ret;
  int i, j;
  double *d = malloc (800000);
  for (j=0; j<1000000; j++)
  for (i=0; i<1000; i++)
    if (__builtin_expect(isinf (d[i]),0))
      ret += 42;
  return ret;
}

	.file	"inf.c"
	.section	.text.unlikely,"ax",@progbits
.LCOLDB2:
	.section	.text.startup,"ax",@progbits
.LHOTB2:
	.p2align 4,,15
	.globl	main
	.type	main, @function
main:
.LFB0:
	.cfi_startproc
	pushq	%rbx
	.cfi_def_cfa_offset 16
	.cfi_offset 3, -16
	movl	$800000, %edi
	call	malloc
	movsd	.LC0(%rip), %xmm2
	leaq	8000(%rax), %rsi
	movsd	.LC1(%rip), %xmm1
	movl	$1000000, %edi
	.p2align 4,,10
	.p2align 3
.L2:
	movq	%rax, %rdx
	.p2align 4,,10
	.p2align 3
.L3:
	movsd	(%rdx), %xmm0
	andpd	%xmm2, %xmm0
	ucomisd	%xmm1, %xmm0
	ja	.LC
	leal	42(%rbx), %ebx
	.LC:
	addq	$8, %rdx
	cmpq	%rsi, %rdx
	jne	.L3
	subl	$1, %edi
	jne	.L2
	movl	%ebx, %eax
	popq	%rbx
	.cfi_def_cfa_offset 8
	ret

	.cfi_endproc
.LFE0:
	.size	main, .-main
	.section	.text.unlikely
.LCOLDE2:
	.section	.text.startup
.LHOTE2:
	.section	.rodata.cst16,"aM",@progbits,16
	.align 16
.LC0:
	.long	4294967295
	.long	2147483647
	.long	0
	.long	0
	.section	.rodata.cst8,"aM",@progbits,8
	.align 8
.LC1:
	.long	4294967295
	.long	2146435071
	.ident	"GCC: (Debian 4.9.2-9) 4.9.2"
	.section	.note.GNU-stack,"",@progbits

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Gcc builtin review: memchr, memrchr, rawmemchr
  2015-05-25 13:50   ` Gcc builtin review: memchr, memrchr, rawmemchr Ondřej Bílka
@ 2015-05-27  6:36     ` Ondřej Bílka
  0 siblings, 0 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-27  6:36 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

On Mon, May 25, 2015 at 12:55:40PM +0200, OndÅ™ej BÃlka wrote:
> Gcc doesn't do any optimizations with these but misses several.
> 
Also next optimization is that memchr(s, 0, n) could be converted to
calling strnlen (s, n) and converting result.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Gcc builtin review: memcpy, mempcpy, memmove.
  2015-05-25 12:18 Gcc builtin review: memcpy, mempcpy, memmove Ondřej Bílka
  2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
  2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
@ 2015-05-28 16:57 ` Joseph Myers
  2015-05-28 17:08   ` Jeff Law
  2 siblings, 1 reply; 34+ messages in thread
From: Joseph Myers @ 2015-05-28 16:57 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski

[-- Attachment #1: Type: text/plain, Size: 1841 bytes --]

On Mon, 25 May 2015, OndĂ¸ej BĂlka wrote:

> Andrew as I promised review and what you would need to fix for x64 gcc
>  and remove ugly gcc hacks and replace them with nice headers :-)
> 
> I will take a functions in order from string.h and comment about
> these.

Reviews of optimization issues with GCC built-in functions belong 
primarily in the GCC context, not glibc at all.  That is, file one report 
in GCC Bugzilla for each issue you find that isn't already reported there 
(making sure to give the usual details of GCC version, target, compilation 
options etc.).  File a meta-bug in GCC Bugzilla depending on all those 
bugs (and any pre-existing bugs for issues you found that you didn't 
report because they were already reported there).  Then just send a 
summary to the libc-alpha and gcc lists pointing to that meta-bug.

This applies to most of your reviews of GCC built-in functions - send the 
reports where they belong, and any temporary implementations of 
optimizations in glibc headers need comments pointing to the GCC bug 
reports which point back to those comments.  Only where an optimization 
involves calling glibc-specific functions are compiler optimizations more 
troublesome (and even there, GCC knows when it's compiling for a glibc 
target and knows the minimum glibc version supported on the target, so 
it's entirely reasonable to agree that e.g. __rawmemchr will remain 
available as a public glibc ABI so that GCC can generate code calling it 
on glibc targets unless told not to because e.g. generating kernel code).

> Then you have problem that optimum also depends on profile feedback. If

Which is clearly better handled in the GCC context (GCC has the 
infrastructure to handle profile feedback data from training runs of the 
program) than in headers.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
  2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
                     ` (4 preceding siblings ...)
  2015-05-26 10:45   ` Gcc builtin review: strfry, basename, strncpy, memfrob Ondřej Bílka
@ 2015-05-28 17:02   ` Joseph Myers
  2015-06-04 10:34   ` Richard Earnshaw
  6 siblings, 0 replies; 34+ messages in thread
From: Joseph Myers @ 2015-05-28 17:02 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski

[-- Attachment #1: Type: text/plain, Size: 400 bytes --]

On Mon, 25 May 2015, OndĂ¸ej BĂlka wrote:

> You should expoit that stpcpy returns end of string so gcc should
> replace
> 
> strcpy (x,y)
> return strlen (x)
> 
> with 
> 
> return stpcpy (x,y) - x

Subject to namespace issues (we could agree that __stpcpy is a stable ABI 
- won't become a compat symbol - so that GCC can safely generate calls to 
it).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Gcc builtin review: memcpy, mempcpy, memmove.
  2015-05-28 16:57 ` Gcc builtin review: memcpy, mempcpy, memmove Joseph Myers
@ 2015-05-28 17:08   ` Jeff Law
  2015-06-04 10:25     ` Ondřej Bílka
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff Law @ 2015-05-28 17:08 UTC (permalink / raw)
  To: Joseph Myers, Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski

On 05/28/2015 10:36 AM, Joseph Myers wrote:
> On Mon, 25 May 2015, OndĂ¸ej BĂlka wrote:
>
>> Andrew as I promised review and what you would need to fix for x64 gcc
>>   and remove ugly gcc hacks and replace them with nice headers :-)
>>
>> I will take a functions in order from string.h and comment about
>> these.
>
> Reviews of optimization issues with GCC built-in functions belong
> primarily in the GCC context, not glibc at all.
Agreed.  Trying to address all this stuff in glibc is wrong.  Each issue 
should be evaluated and a determination made individually whether the 
issue should be addressed in gcc or glibc.


Jeff


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Gcc builtin review: memcpy, mempcpy, memmove.
  2015-05-28 17:08   ` Jeff Law
@ 2015-06-04 10:25     ` Ondřej Bílka
  0 siblings, 0 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-06-04 10:25 UTC (permalink / raw)
  To: Jeff Law; +Cc: Joseph Myers, libc-alpha, Andrew Pinski

On Thu, May 28, 2015 at 10:41:04AM -0600, Jeff Law wrote:
> On 05/28/2015 10:36 AM, Joseph Myers wrote:
> >On Mon, 25 May 2015, OndÅ™ej BÃlka wrote:
> >
> >>Andrew as I promised review and what you would need to fix for x64 gcc
> >>  and remove ugly gcc hacks and replace them with nice headers :-)
> >>
> >>I will take a functions in order from string.h and comment about
> >>these.
> >
> >Reviews of optimization issues with GCC built-in functions belong
> >primarily in the GCC context, not glibc at all.
> Agreed.  Trying to address all this stuff in glibc is wrong.  Each
> issue should be evaluated and a determination made individually
> whether the issue should be addressed in gcc or glibc.
> 
> 
> Jeff

Ok, I will crosspost issues that needs to be addressed. The main two
problems are that lot of special cases should be handled as ordinary
code, not trying to add extra pass every time you notice that sometimes
you could expand foo into bar and baz. Then there are implementation
details and that correct decision depends on these. For optimizations
you need to know how fast are different libc functions which is easier
checkable inside libc than trying to sync that information in gcc.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
  2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
                     ` (5 preceding siblings ...)
  2015-05-28 17:02   ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Joseph Myers
@ 2015-06-04 10:34   ` Richard Earnshaw
  2015-06-04 12:33     ` Ondřej Bílka
  6 siblings, 1 reply; 34+ messages in thread
From: Richard Earnshaw @ 2015-06-04 10:34 UTC (permalink / raw)
  To: Ondřej Bílka, libc-alpha; +Cc: Andrew Pinski

On 25/05/15 12:45, OndÅ™ej BÃlka wrote:
> Replaces it with strcpy. One could argue that opposite way to replace
> strcpy with stpcpy is faster.
> 
> Reason is register pressure. Strcpy needs extra register to save return
> value while stpcpy has return value already in register used for writing
> terminating zero.


Depends on your architecture.  On aarch64 we have plenty of spare
registers, so strcpy simply copies the destination register into a
scratch.  It then doesn't have to carefully calculate the return value
at the end of the function (making the tail code simpler - there are
multiple return statements, but only one entry point).

R.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
  2015-06-04 10:34   ` Richard Earnshaw
@ 2015-06-04 12:33     ` Ondřej Bílka
  2015-06-04 13:50       ` Richard Earnshaw
  0 siblings, 1 reply; 34+ messages in thread
From: Ondřej Bílka @ 2015-06-04 12:33 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: libc-alpha, Andrew Pinski

On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
> On 25/05/15 12:45, OndÅ™ej BÃlka wrote:
> > Replaces it with strcpy. One could argue that opposite way to replace
> > strcpy with stpcpy is faster.
> > 
> > Reason is register pressure. Strcpy needs extra register to save return
> > value while stpcpy has return value already in register used for writing
> > terminating zero.
> 
> 
> Depends on your architecture.  On aarch64 we have plenty of spare
> registers, so strcpy simply copies the destination register into a
> scratch.  It then doesn't have to carefully calculate the return value
> at the end of the function (making the tail code simpler - there are
> multiple return statements, but only one entry point).
> 
Thats correct, main saving you get is from return value is first register, that
forces needing extra copy which is suboptimal.

I don't have data how strcpy and stpcpy mix and want to know if few
extra cycles are worth it when these aren't called exactly often, I will
try to think how test these.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
  2015-06-04 12:33     ` Ondřej Bílka
@ 2015-06-04 13:50       ` Richard Earnshaw
  2015-06-04 14:35         ` [RFC] Aarch64: optimize stpcpy a bit Ondřej Bílka
  0 siblings, 1 reply; 34+ messages in thread
From: Richard Earnshaw @ 2015-06-04 13:50 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski

On 04/06/15 13:28, OndÅ™ej BÃlka wrote:
> On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
>> On 25/05/15 12:45, OndÅ™ej BÃlka wrote:
>>> Replaces it with strcpy. One could argue that opposite way to replace
>>> strcpy with stpcpy is faster.
>>>
>>> Reason is register pressure. Strcpy needs extra register to save return
>>> value while stpcpy has return value already in register used for writing
>>> terminating zero.
>>
>>
>> Depends on your architecture.  On aarch64 we have plenty of spare
>> registers, so strcpy simply copies the destination register into a
>> scratch.  It then doesn't have to carefully calculate the return value
>> at the end of the function (making the tail code simpler - there are
>> multiple return statements, but only one entry point).
>>
> Thats correct, main saving you get is from return value is first register, that
> forces needing extra copy which is suboptimal.

No, look at the AArch64 code.  The only time we ever end up with a
simple MOV instruction to copy the register from one location to another
is in the stPcpy code.  In strcpy it always ends up folded into some
other operation that we have to do anyway.  Once it's been copied to
that other register we never have to use it elsewhere again.
Furthermore, the need to handle smallish copies with overlapping stores
means we need both the original base address /and/ the final result, so
we'd still need to end up saving it for stpcpy.

> 
> I don't have data how strcpy and stpcpy mix and want to know if few
> extra cycles are worth it when these aren't called exactly often, I will
> try to think how test these.
> 

Well, stpcpy is Posix, while strcpy is part of ISO C.  That suggests to
me we're far more likely to see folk using strcpy than stpcpy,
especially when they don't care about the result.

R.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC] Aarch64: optimize stpcpy a bit.
  2015-06-04 13:50       ` Richard Earnshaw
@ 2015-06-04 14:35         ` Ondřej Bílka
  2015-06-04 16:41           ` Richard Earnshaw
  0 siblings, 1 reply; 34+ messages in thread
From: Ondřej Bílka @ 2015-06-04 14:35 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: libc-alpha, Andrew Pinski

On Thu, Jun 04, 2015 at 02:44:59PM +0100, Richard Earnshaw wrote:
> On 04/06/15 13:28, OndÅ™ej BÃlka wrote:
> > On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
> >> On 25/05/15 12:45, OndÅ™ej BÃlka wrote:
> >>> Replaces it with strcpy. One could argue that opposite way to replace
> >>> strcpy with stpcpy is faster.
> >>>
> >>> Reason is register pressure. Strcpy needs extra register to save return
> >>> value while stpcpy has return value already in register used for writing
> >>> terminating zero.
> >>
> >>
> >> Depends on your architecture.  On aarch64 we have plenty of spare
> >> registers, so strcpy simply copies the destination register into a
> >> scratch.  It then doesn't have to carefully calculate the return value
> >> at the end of the function (making the tail code simpler - there are
> >> multiple return statements, but only one entry point).
> >>
> > Thats correct, main saving you get is from return value is first register, that
> > forces needing extra copy which is suboptimal.
> 
> No, look at the AArch64 code.  The only time we ever end up with a
> simple MOV instruction to copy the register from one location to another
> is in the stPcpy code.  In strcpy it always ends up folded into some
> other operation that we have to do anyway.  Once it's been copied to
> that other register we never have to use it elsewhere again.
> Furthermore, the need to handle smallish copies with overlapping stores
> means we need both the original base address /and/ the final result, so
> we'd still need to end up saving it for stpcpy.
>
Wrote too fast, was refering that you would need to copy that on small
address. With dest in different register I could make strcpy and stpcpy
const same  instructions in most cases except size 8-15 by adjusting 
offsets with some constants.

Also if I think that you could remove extra instructions for stpcpy loop 
with following, which also removes one instruction from strcpy if I read
code correctly.

diff --git a/sysdeps/aarch64/strcpy.S b/sysdeps/aarch64/strcpy.S
index 28846fb..7ca0412 100644
--- a/sysdeps/aarch64/strcpy.S
+++ b/sysdeps/aarch64/strcpy.S
@@ -204,6 +204,9 @@ L(bulk_entry):
 	sub	to_align, to_align, #16
 	stp	data1, data2, [dstin]
 	sub	src, srcin, to_align
+#ifdef BUILD_STPCPY
+# define dst dstin
+#endif
 	sub	dst, dstin, to_align
 	b	L(entry_no_page_cross)
 
@@ -243,17 +246,16 @@ L(entry_no_page_cross):
 #endif
 	rev	has_nul1, has_nul1
 	clz	pos, has_nul1
-	add	tmp1, pos, #72
-	add	pos, pos, #8
+	add	tmp1, pos, #64
 	csel	pos, pos, tmp1, ne
 	add	src, src, pos, lsr #3
 	add	dst, dst, pos, lsr #3
-	ldp	data1, data2, [src, #-32]
-	stp	data1, data2, [dst, #-16]
+	ldp	data1, data2, [src, #-31]
+	stp	data1, data2, [dst, #-15]
+	ret
 #ifdef BUILD_STPCPY
-	sub	dstin, dst, #1
+# undef dst
 #endif
-	ret
 
 L(page_cross):
 	bic	src, srcin, #15

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC] Aarch64: optimize stpcpy a bit.
  2015-06-04 14:35         ` [RFC] Aarch64: optimize stpcpy a bit Ondřej Bílka
@ 2015-06-04 16:41           ` Richard Earnshaw
  0 siblings, 0 replies; 34+ messages in thread
From: Richard Earnshaw @ 2015-06-04 16:41 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski

On 04/06/15 15:16, OndÅ™ej BÃlka wrote:
> On Thu, Jun 04, 2015 at 02:44:59PM +0100, Richard Earnshaw wrote:
>> On 04/06/15 13:28, OndÅ™ej BÃlka wrote:
>>> On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
>>>> On 25/05/15 12:45, OndÅ™ej BÃlka wrote:
>>>>> Replaces it with strcpy. One could argue that opposite way to replace
>>>>> strcpy with stpcpy is faster.
>>>>>
>>>>> Reason is register pressure. Strcpy needs extra register to save return
>>>>> value while stpcpy has return value already in register used for writing
>>>>> terminating zero.
>>>>
>>>>
>>>> Depends on your architecture.  On aarch64 we have plenty of spare
>>>> registers, so strcpy simply copies the destination register into a
>>>> scratch.  It then doesn't have to carefully calculate the return value
>>>> at the end of the function (making the tail code simpler - there are
>>>> multiple return statements, but only one entry point).
>>>>
>>> Thats correct, main saving you get is from return value is first register, that
>>> forces needing extra copy which is suboptimal.
>>
>> No, look at the AArch64 code.  The only time we ever end up with a
>> simple MOV instruction to copy the register from one location to another
>> is in the stPcpy code.  In strcpy it always ends up folded into some
>> other operation that we have to do anyway.  Once it's been copied to
>> that other register we never have to use it elsewhere again.
>> Furthermore, the need to handle smallish copies with overlapping stores
>> means we need both the original base address /and/ the final result, so
>> we'd still need to end up saving it for stpcpy.
>>
> Wrote too fast, was refering that you would need to copy that on small
> address. With dest in different register I could make strcpy and stpcpy
> const same  instructions in most cases except size 8-15 by adjusting 
> offsets with some constants.
> 
> Also if I think that you could remove extra instructions for stpcpy loop 
> with following, which also removes one instruction from strcpy if I read
> code correctly.
> 
> diff --git a/sysdeps/aarch64/strcpy.S b/sysdeps/aarch64/strcpy.S
> index 28846fb..7ca0412 100644
> --- a/sysdeps/aarch64/strcpy.S
> +++ b/sysdeps/aarch64/strcpy.S
> @@ -204,6 +204,9 @@ L(bulk_entry):
>  	sub	to_align, to_align, #16
>  	stp	data1, data2, [dstin]
>  	sub	src, srcin, to_align
> +#ifdef BUILD_STPCPY
> +# define dst dstin
> +#endif
>  	sub	dst, dstin, to_align
>  	b	L(entry_no_page_cross)
>  
> @@ -243,17 +246,16 @@ L(entry_no_page_cross):
>  #endif
>  	rev	has_nul1, has_nul1
>  	clz	pos, has_nul1
> -	add	tmp1, pos, #72
> -	add	pos, pos, #8
> +	add	tmp1, pos, #64
>  	csel	pos, pos, tmp1, ne
>  	add	src, src, pos, lsr #3
>  	add	dst, dst, pos, lsr #3
> -	ldp	data1, data2, [src, #-32]
> -	stp	data1, data2, [dst, #-16]
> +	ldp	data1, data2, [src, #-31]
> +	stp	data1, data2, [dst, #-15]

That's not valid, the offset has to be a multiple of the register size
(8 in this case).
> +	ret
>  #ifdef BUILD_STPCPY
> -	sub	dstin, dst, #1
> +# undef dst

Nor is this, dst is already a #define, so this leaves it unspecified.
But since you can't avoid the late subtract, that's irrelevant anyway.

R.

>  #endif
> -	ret
>  
>  L(page_cross):
>  	bic	src, srcin, #15
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Gcc builtin review: isinf, insnan ...
  2015-05-29 20:58         ` Wilco Dijkstra
@ 2015-05-30 19:14           ` Joseph Myers
  0 siblings, 0 replies; 34+ messages in thread
From: Joseph Myers @ 2015-05-30 19:14 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Ondřej Bílka, libc-alpha, pinskia

On Fri, 29 May 2015, Wilco Dijkstra wrote:

> Note another issue is that GCC tends to use FP arithmetic with -ffast-math
> for FP related built-ins (even when it uses integer arithmetic with -O2/-O3), 
> presumably on the assumption that FP calculations are faster. This is not true
> on most CPUs but it spectacularly backfires when using software floating point...
> 
> int test(double x)
> {
>   return __builtin_signbit (x);
> }
> 
> -O3 -mfloat-abi=soft:
>         and     r0, r1, #-2147483648
>         bx      lr
> 
> -Ofast -mfloat-abi=soft:
>         push    {r3, lr}
>         movs    r2, #0
>         movs    r3, #0
>         bl      __aeabi_dcmplt
>         adds    r0, r0, #0
>         it      ne
>         movne   r0, #1
>         pop     {r3, pc}

GCC bug report number?  I don't think this is an assumption that FP 
calculations are faster; it seems a simple inconsistency between tree 
folding (prefers expanding to a comparison when no signed zeroes) and RTL 
expansion (prefers bit manipulation whenever a read-only sign bit is 
available, which it always is for all currently-supported floating-point 
formats).

> Agreed, I'd much prefer the GCC builtins doing the right thing. However what 
> to do when they don't? Assuming we agree to just support inline functions in 
> headers from GCC4.0 onwards, given the built-ins in GCC are not good enough, 
> we still need explicit optimized implementations for all of signbit, isinf etc
> even if they were fixed in GCC6.

Well, if you want a -ffast-math version of signbit (commented to be 
working around a particular, preferably fixed, GCC bug, with __GNUC_PREREQ 
conditionals) - if that's considered sufficiently important - at least it 
ought to be architecture-independent (maybe one bits/ header per 
floating-point format) rather than duplicated per-architecture.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Gcc builtin review: isinf, insnan ...
  2015-05-29 14:26       ` Joseph Myers
@ 2015-05-29 20:58         ` Wilco Dijkstra
  2015-05-30 19:14           ` Joseph Myers
  0 siblings, 1 reply; 34+ messages in thread
From: Wilco Dijkstra @ 2015-05-29 20:58 UTC (permalink / raw)
  To: 'Joseph Myers', Ondřej Bílka; +Cc: libc-alpha, pinskia

> Joseph Myers wrote:
> On Fri, 29 May 2015, Ondřej Bílka wrote:
> 
> > On Thu, May 28, 2015 at 05:49:05PM +0000, Joseph Myers wrote:
> > > On Wed, 27 May 2015, Wilco Dijkstra wrote:
> > >
> > > It's not obvious that integer arithmetic is optimal in all cases (if the
> > > implementation is otherwise correct given the compiler options passed).
> > > For example, if the argument is in a floating-point register and an
> > > integer implementation would involve the (architecture-specific) costs of
> > > moving it to an integer register - obviously something that can only be
> > > decided quite late in code generation.
> > >
> > Naturally one should check which implementation is best. Could you send
> > link to numeric application and how to measure it that was mentioned
> > before?
> 
> I have no idea which is best - all I assert here is that it's not obvious
> that "should not be used until they are fixed to use integer arithmetic"
> is true (any patch to use the GCC built-in functions should come with
> benchmark data as well as information on correctness).  It may well be the
> case that changing the GCC built-in functions to use integer arithmetic is
> unconditionally beneficial; such a change obviously needs benchmark data
> (in a way that changing for correctness with -fsignaling-nans only
> wouldn't).

Note another issue is that GCC tends to use FP arithmetic with -ffast-math
for FP related built-ins (even when it uses integer arithmetic with -O2/-O3), 
presumably on the assumption that FP calculations are faster. This is not true
on most CPUs but it spectacularly backfires when using software floating point...

int test(double x)
{
  return __builtin_signbit (x);
}

-O3 -mfloat-abi=soft:
        and     r0, r1, #-2147483648
        bx      lr

-Ofast -mfloat-abi=soft:
        push    {r3, lr}
        movs    r2, #0
        movs    r3, #0
        bl      __aeabi_dcmplt
        adds    r0, r0, #0
        it      ne
        movne   r0, #1
        pop     {r3, pc}

So these built-ins "should not be used until they are fixed to use integer 
arithmetic" :-)

> All other things being equal, use of type-generic built-in functions has
> the advantage that the macro expansion text only contains the argument
> text once - generally a good idea when macro calls may be nested inside
> arguments to other macros.

Agreed, I'd much prefer the GCC builtins doing the right thing. However what 
to do when they don't? Assuming we agree to just support inline functions in 
headers from GCC4.0 onwards, given the built-ins in GCC are not good enough, 
we still need explicit optimized implementations for all of signbit, isinf etc
even if they were fixed in GCC6.

Wilco


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Gcc builtin review: isinf, insnan ...
  2015-05-29 14:02     ` Ondřej Bílka
@ 2015-05-29 14:26       ` Joseph Myers
  2015-05-29 20:58         ` Wilco Dijkstra
  0 siblings, 1 reply; 34+ messages in thread
From: Joseph Myers @ 2015-05-29 14:26 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: Wilco Dijkstra, libc-alpha, pinskia

[-- Attachment #1: Type: text/plain, Size: 1565 bytes --]

On Fri, 29 May 2015, OndĂ¸ej BĂlka wrote:

> On Thu, May 28, 2015 at 05:49:05PM +0000, Joseph Myers wrote:
> > On Wed, 27 May 2015, Wilco Dijkstra wrote:
> > 
> > It's not obvious that integer arithmetic is optimal in all cases (if the 
> > implementation is otherwise correct given the compiler options passed).  
> > For example, if the argument is in a floating-point register and an 
> > integer implementation would involve the (architecture-specific) costs of 
> > moving it to an integer register - obviously something that can only be 
> > decided quite late in code generation.
> > 
> Naturally one should check which implementation is best. Could you send
> link to numeric application and how to measure it that was mentioned
> before?

I have no idea which is best - all I assert here is that it's not obvious 
that "should not be used until they are fixed to use integer arithmetic" 
is true (any patch to use the GCC built-in functions should come with 
benchmark data as well as information on correctness).  It may well be the 
case that changing the GCC built-in functions to use integer arithmetic is 
unconditionally beneficial; such a change obviously needs benchmark data 
(in a way that changing for correctness with -fsignaling-nans only 
wouldn't).

All other things being equal, use of type-generic built-in functions has 
the advantage that the macro expansion text only contains the argument 
text once - generally a good idea when macro calls may be nested inside 
arguments to other macros.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Gcc builtin review: isinf, insnan ...
  2015-05-28 19:33   ` Joseph Myers
  2015-05-29 13:22     ` Wilco Dijkstra
@ 2015-05-29 14:02     ` Ondřej Bílka
  2015-05-29 14:26       ` Joseph Myers
  1 sibling, 1 reply; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-29 14:02 UTC (permalink / raw)
  To: Joseph Myers; +Cc: Wilco Dijkstra, libc-alpha, pinskia

On Thu, May 28, 2015 at 05:49:05PM +0000, Joseph Myers wrote:
> On Wed, 27 May 2015, Wilco Dijkstra wrote:
> 
> It's not obvious that integer arithmetic is optimal in all cases (if the 
> implementation is otherwise correct given the compiler options passed).  
> For example, if the argument is in a floating-point register and an 
> integer implementation would involve the (architecture-specific) costs of 
> moving it to an integer register - obviously something that can only be 
> decided quite late in code generation.
> 
Naturally one should check which implementation is best. Could you send
link to numeric application and how to measure it that was mentioned
before?

I have bit doubt that floating point would do that well but it needs to
be tested. Versus a cost of moving float to integer register you also
have a cost of loading floating constant which could be as expensive how
do you tell which is faster? Integer computation may even help as due to
pipelining it would be executed in parallel while floating ones would be
sequential.

Also you run into diminishing returns. While you could get lot by using
header a specific optimizations could only rarely save one cycle so are
they worth maintainance burden.  

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Gcc builtin review: isinf, insnan ...
  2015-05-29 13:22     ` Wilco Dijkstra
  2015-05-29 13:25       ` Joseph Myers
@ 2015-05-29 13:30       ` Joseph Myers
  1 sibling, 0 replies; 34+ messages in thread
From: Joseph Myers @ 2015-05-29 13:30 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: 'Ondřej Bílka', libc-alpha, pinskia

On Fri, 29 May 2015, Wilco Dijkstra wrote:

> so shouldn't be the default. Both GCC and GLIBC are too conservative and don't
> generate quality floating point code by default. And despite conservative and 

The very slow multiple-precision paths in some dbl-64 functions that 
attempt to be correctly rounding should simply be removed as out of scope 
for the goals of those functions, subject to error analysis showing that 
the previous paths in the code always generate results within glibc's 
accuracy goals (a few ulps).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Gcc builtin review: isinf, insnan ...
  2015-05-29 13:22     ` Wilco Dijkstra
@ 2015-05-29 13:25       ` Joseph Myers
  2015-05-29 13:30       ` Joseph Myers
  1 sibling, 0 replies; 34+ messages in thread
From: Joseph Myers @ 2015-05-29 13:25 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: 'Ondřej Bílka', libc-alpha, pinskia

On Fri, 29 May 2015, Wilco Dijkstra wrote:

> > It's certainly the case that substantial GCC work is needed to make
> > floating-point properly follow even C99/C11 Annex F regarding exceptions /
> > rounding modes, let alone make signaling NaNs follow TS 18661-1, though
> > the signaling NaNs fixes would probably be more straightforward (reviewing
> > lots of code and making local changes, rather than any tricky design
> > issues involved for other floating-point areas).  While I think such
> > support would be desirable, it also seems there's limited demand for it.
> 
> IMHO support for different rounding modes, trapping exceptions and signalling
> NaNs are not required by many (most ARM CPUs don't even support FP traps...), 

-ftrapping-math (default) is nothing to do with traps in the sense of 
SIGFPE or change of flow of control, only about the raising of exception 
flags that can be tested with <fenv.h> interfaces.  Annex F and TS 18661 
also have nothing to with the traps (part 5 deals with alternate exception 
handling, but I doubt processor-level traps will be particularly useful in 
implementing it).  -frounding-math and -fsignaling-nans are not enabled by 
default.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Gcc builtin review: isinf, insnan ...
  2015-05-28 19:33   ` Joseph Myers
@ 2015-05-29 13:22     ` Wilco Dijkstra
  2015-05-29 13:25       ` Joseph Myers
  2015-05-29 13:30       ` Joseph Myers
  2015-05-29 14:02     ` Ondřej Bílka
  1 sibling, 2 replies; 34+ messages in thread
From: Wilco Dijkstra @ 2015-05-29 13:22 UTC (permalink / raw)
  To: 'Joseph Myers'
  Cc: 'Ondřej Bílka', libc-alpha, pinskia

> Joseph Myers wrote:
> On Wed, 27 May 2015, Wilco Dijkstra wrote:
> 
> > Note the GCC built-ins are actually incorrect and should not be used until
> > they are fixed to use integer arithmetic. The GLIBC versions are never
> 
> GCC bug report number?  Is the problem something other than general issues
> with -fsignaling-nans not making everything correct for the presence of
> signaling NaNs (as documented)?  If there are signaling NaN issues but not
> other issues, the built-in functions could be used conditional on
> __SUPPORT_SNAN__ to fix bugs 15367 and 17441, provided (a) comments
> referred to applicable GCC bug reports for any signaling NaN issues, with
> those bug reports referring back to getting the glibc comments updated
> when fixed in GCC, and (b) if uses in glibc rely on proper signaling NaN
> handling, glibc were made to build with -fsignaling-nans (not the default
> in GCC).

Yes I meant the signalling NaN issue. 

> It's certainly the case that substantial GCC work is needed to make
> floating-point properly follow even C99/C11 Annex F regarding exceptions /
> rounding modes, let alone make signaling NaNs follow TS 18661-1, though
> the signaling NaNs fixes would probably be more straightforward (reviewing
> lots of code and making local changes, rather than any tricky design
> issues involved for other floating-point areas).  While I think such
> support would be desirable, it also seems there's limited demand for it.

IMHO support for different rounding modes, trapping exceptions and signalling
NaNs are not required by many (most ARM CPUs don't even support FP traps...), 
so shouldn't be the default. Both GCC and GLIBC are too conservative and don't
generate quality floating point code by default. And despite conservative and 
slow, the produced code is still not good enough to fully support the above 
features, so everybody pays the price of slow code for no useful benefit...

I mostly use -Ofast as that's the only way to get half decent FP code out of GCC. 
GLIBC should also have a mode with fastmath versions of all the math functions.

> It's not obvious that integer arithmetic is optimal in all cases (if the
> implementation is otherwise correct given the compiler options passed).
> For example, if the argument is in a floating-point register and an
> integer implementation would involve the (architecture-specific) costs of
> moving it to an integer register - obviously something that can only be
> decided quite late in code generation.

For simple bitmanipulation functions like isinf/isnan etc you can't get anywhere 
near the latency of integer arithmetic on most targets. Note that while FP->int 
moves may be expensive, FP comparisons also move the condition code to the integer
side, and thus have similar latency. If you do multiple isinf and isnan tests on 
the same value (as is common in GLIBC) then the FP->int moves will be CSEd while 
comparisons cannot.

However it would be possible to give targets the option to choose the GCC
builtins if signalling NaNs are disabled - perhaps that might be a way to get
inlining enabled quickly using a simple patch (for the integer versions we'd
need to move the various EXTRACT_WORDS macros into math.h which may be tricky).

Wilco

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Gcc builtin review: isinf, insnan ...
  2015-05-27 17:55 ` Gcc builtin review: isinf, insnan Wilco Dijkstra
  2015-05-27 21:07   ` Julian Taylor
@ 2015-05-28 19:33   ` Joseph Myers
  2015-05-29 13:22     ` Wilco Dijkstra
  2015-05-29 14:02     ` Ondřej Bílka
  1 sibling, 2 replies; 34+ messages in thread
From: Joseph Myers @ 2015-05-28 19:33 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: 'Ondřej Bílka', libc-alpha, pinskia

On Wed, 27 May 2015, Wilco Dijkstra wrote:

> Note the GCC built-ins are actually incorrect and should not be used until
> they are fixed to use integer arithmetic. The GLIBC versions are never

GCC bug report number?  Is the problem something other than general issues 
with -fsignaling-nans not making everything correct for the presence of 
signaling NaNs (as documented)?  If there are signaling NaN issues but not 
other issues, the built-in functions could be used conditional on 
__SUPPORT_SNAN__ to fix bugs 15367 and 17441, provided (a) comments 
referred to applicable GCC bug reports for any signaling NaN issues, with 
those bug reports referring back to getting the glibc comments updated 
when fixed in GCC, and (b) if uses in glibc rely on proper signaling NaN 
handling, glibc were made to build with -fsignaling-nans (not the default 
in GCC).

It's certainly the case that substantial GCC work is needed to make 
floating-point properly follow even C99/C11 Annex F regarding exceptions / 
rounding modes, let alone make signaling NaNs follow TS 18661-1, though 
the signaling NaNs fixes would probably be more straightforward (reviewing 
lots of code and making local changes, rather than any tricky design 
issues involved for other floating-point areas).  While I think such 
support would be desirable, it also seems there's limited demand for it.

It's not obvious that integer arithmetic is optimal in all cases (if the 
implementation is otherwise correct given the compiler options passed).  
For example, if the argument is in a floating-point register and an 
integer implementation would involve the (architecture-specific) costs of 
moving it to an integer register - obviously something that can only be 
decided quite late in code generation.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Gcc builtin review: isinf, insnan ...
  2015-05-27 21:07   ` Julian Taylor
@ 2015-05-27 21:38     ` Ondřej Bílka
  0 siblings, 0 replies; 34+ messages in thread
From: Ondřej Bílka @ 2015-05-27 21:38 UTC (permalink / raw)
  To: Julian Taylor; +Cc: Wilco Dijkstra, libc-alpha, pinskia

On Wed, May 27, 2015 at 07:15:32PM +0200, Julian Taylor wrote:
> On 27.05.2015 16:57, Wilco Dijkstra wrote:
> > OndÅ™ej BÃlka wrote:
> >> I raised this issue before but didn't wrote patch so I should do it now.
> >> I would be silent about glibc as it shares same flaw as gcc.
> >>
> >> Main problem that these functions try to be branchless. Which causes
> >> performance regression for most applications versus branched code.
> > 
> > Being branchless is one issue indeed but the main issue they are never
> > inlined on any target as GLIBC headers explicitly disable inlining by GCC.
> > 
> 
> 
> This is a huge problem for almost all numerical applications that don't
> care about signaling nans, so basically all.
> I had to fix too many applications having poor performance due to that
> (e.g.
> http://yarikoptic.github.io/numpy-vbench/vb_vb_ufunc.html#numpy-isnan-a-10types)
> 
> I would love when glibc would somehow allow programs to use the builtin
> via the isnan function in an easy way, so not via compiler specific
> compiler flags like -fno-signaling-nan or preprocessor compiler version
> checks.
> Maybe alway direct to the builtin/fastest possible non signaling
> implementation when some GIVE_ME_FAST_BOOL_ISNAN is defined before
> including <math.h>?
> 
> Also note that gcc has updated their documentation which might allow to
> now use __builtin_isinf
> (https://sourceware.org/bugzilla/show_bug.cgi?id=15367)

A simple inclusion of math.h should be enough. As now redirecting to
builtin would be slower than offering one in header which would also fix
signaling issue.

Only reason when nonconstant builtin would make sense would be if gcc
started to analyze floats to prove that operations are finite. It
doesn't do that. It keeps isinf even in simplest cases like

int 
foo (double d)
{
  if (d < 11) return 1;
  if (d > 444) return 2;
  if (isinf (d)) return 3;
  return 4;
}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Gcc builtin review: isinf, insnan ...
  2015-05-27 17:55 ` Gcc builtin review: isinf, insnan Wilco Dijkstra
@ 2015-05-27 21:07   ` Julian Taylor
  2015-05-27 21:38     ` Ondřej Bílka
  2015-05-28 19:33   ` Joseph Myers
  1 sibling, 1 reply; 34+ messages in thread
From: Julian Taylor @ 2015-05-27 21:07 UTC (permalink / raw)
  To: Wilco Dijkstra, 'Ondřej Bílka'; +Cc: libc-alpha, pinskia

On 27.05.2015 16:57, Wilco Dijkstra wrote:
> OndĂ¸ej BĂlka wrote:
>> I raised this issue before but didn't wrote patch so I should do it now.
>> I would be silent about glibc as it shares same flaw as gcc.
>>
>> Main problem that these functions try to be branchless. Which causes
>> performance regression for most applications versus branched code.
> 
> Being branchless is one issue indeed but the main issue they are never
> inlined on any target as GLIBC headers explicitly disable inlining by GCC.
> 

This is a huge problem for almost all numerical applications that don't
care about signaling nans, so basically all.
I had to fix too many applications having poor performance due to that
(e.g.
http://yarikoptic.github.io/numpy-vbench/vb_vb_ufunc.html#numpy-isnan-a-10types)

I would love when glibc would somehow allow programs to use the builtin
via the isnan function in an easy way, so not via compiler specific
compiler flags like -fno-signaling-nan or preprocessor compiler version
checks.
Maybe alway direct to the builtin/fastest possible non signaling
implementation when some GIVE_ME_FAST_BOOL_ISNAN is defined before
including <math.h>?

Also note that gcc has updated their documentation which might allow to
now use __builtin_isinf
(https://sourceware.org/bugzilla/show_bug.cgi?id=15367)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Gcc builtin review: isinf, insnan ...
       [not found] <A610E03AD50BFC4D95529A36D37FA55E769AD3FAC3@GEORGE.Emea.Arm.com>
@ 2015-05-27 17:55 ` Wilco Dijkstra
  2015-05-27 21:07   ` Julian Taylor
  2015-05-28 19:33   ` Joseph Myers
  0 siblings, 2 replies; 34+ messages in thread
From: Wilco Dijkstra @ 2015-05-27 17:55 UTC (permalink / raw)
  To: 'Ondřej Bílka'; +Cc: libc-alpha, pinskia

Ondřej Bílka wrote:
> I raised this issue before but didn't wrote patch so I should do it now.
> I would be silent about glibc as it shares same flaw as gcc.
> 
> Main problem that these functions try to be branchless. Which causes
> performance regression for most applications versus branched code.

Being branchless is one issue indeed but the main issue they are never
inlined on any target as GLIBC headers explicitly disable inlining by GCC.

> A problem is that predicted branch is free while conditional store
> always cost cycle. So you need to have unpredictable branch to get
> performance gain. When branch is 95% predicted then branchless code
> wouldn't pay for itself if it adds one cycle versus branched and
> misprediction costs 20 cycles.
> 
> And NaN is quite exceptional value so branches will almost always be
> predicted. Otherwise user has other problems, like that if 5% of his
> data are NaN's then result will likely be garbage.
> 
> Then you have problem that with modern gcc you wont likely save branch.
> Most of these functions are surrounded by if. From gcc-4.9 it will
> optimize out that branch as its predicated and it results in simpler
> code.
> 
> More evidence about that is that I took assembly of benchmark below and
> changed conditional move to jump which improves performance back by 10%
> 
> For showing that I wrote simple example of branched isinf that is around
> 10% faster than builtin.

Note the GCC built-ins are actually incorrect and should not be used until
they are fixed to use integer arithmetic. The GLIBC versions are never
inlined on any target, and adding generic inline implementations gives a 4-6
times speedup. Isnan, isnormal, isfinite, issignalling are equally trivial, 
needing ~3-4 instructions. An optimized fpclassify implementation seems
small enough to be fully inlineable (useful given it is used in lots of
complex math functions), however it could be partially inlined like:

__glibc_likely(isnormal(x)) ? FP_NORMAL : __fpclassify(x)

Just checking, are you planning to post patches for these?

Wilco


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2015-06-04 15:44 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-25 12:18 Gcc builtin review: memcpy, mempcpy, memmove Ondřej Bílka
2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
2015-05-25 13:50   ` Gcc builtin review: memchr, memrchr, rawmemchr Ondřej Bílka
2015-05-27  6:36     ` Ondřej Bílka
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
2015-05-25 17:46   ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
2015-05-25 19:28     ` Gcc builtin review: strpbrk, strspn, strcspn Ondřej Bílka
2015-05-25 17:52   ` Gcc builtin review: strdup Ondřej Bílka
2015-05-25 18:11   ` Gcc builtin review: strchr, strrchr, strchrnul Ondřej Bílka
2015-05-25 19:15     ` Ondřej Bílka
2015-05-26 10:58       ` Gcc builtin review: isinf, insnan Ondřej Bílka
2015-05-26  8:00   ` Gcc builtin review: strstr, strcasestr, memmem Ondřej Bílka
2015-05-26  8:55     ` Ondřej Bílka
2015-05-26 10:45   ` Gcc builtin review: strfry, basename, strncpy, memfrob Ondřej Bílka
2015-05-28 17:02   ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Joseph Myers
2015-06-04 10:34   ` Richard Earnshaw
2015-06-04 12:33     ` Ondřej Bílka
2015-06-04 13:50       ` Richard Earnshaw
2015-06-04 14:35         ` [RFC] Aarch64: optimize stpcpy a bit Ondřej Bílka
2015-06-04 16:41           ` Richard Earnshaw
2015-05-28 16:57 ` Gcc builtin review: memcpy, mempcpy, memmove Joseph Myers
2015-05-28 17:08   ` Jeff Law
2015-06-04 10:25     ` Ondřej Bílka
     [not found] <A610E03AD50BFC4D95529A36D37FA55E769AD3FAC3@GEORGE.Emea.Arm.com>
2015-05-27 17:55 ` Gcc builtin review: isinf, insnan Wilco Dijkstra
2015-05-27 21:07   ` Julian Taylor
2015-05-27 21:38     ` Ondřej Bílka
2015-05-28 19:33   ` Joseph Myers
2015-05-29 13:22     ` Wilco Dijkstra
2015-05-29 13:25       ` Joseph Myers
2015-05-29 13:30       ` Joseph Myers
2015-05-29 14:02     ` Ondřej Bílka
2015-05-29 14:26       ` Joseph Myers
2015-05-29 20:58         ` Wilco Dijkstra
2015-05-30 19:14           ` Joseph Myers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).