Gcc builtin review: memcpy, mempcpy, memmove.

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* Gcc builtin review: memcpy, mempcpy, memmove.
@ 2015-05-25 12:18 Ondřej Bílka
  2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-25 12:18 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

Andrew as I promised review and what you would need to fix for x64 gcc
 and remove ugly gcc hacks and replace them with nice headers :-)

I will take a functions in order from string.h and comment about
these.

Lose all hope, these are imposible to optimize purely with gcc.

These are hardest functions to optimize. A optimal implementation
depends on cpu and size, whether inputs are likely in L1 or L2 or L3
cache  and relationship is quite chaotic.

With gcc unless you start doing per-cpu function cloning and optimize
for each cpu (which would be good idea if you could convince maintainers
to deal with size bloat) then there will always be some cpu where you
lose.

Its because there are loop A and B and A is 50% faster than B on
cpu 1 but B is 50% faster on cpu 2

A main weakness for current cpus lies in 100-1000 byte range. A in
following benchmark an -march=native gcc expansion is outperformed 
by at least 10% with obsolete memcpy_sse2 from glibc-2.13 on 
sandy_bridge and haswell

As memcpy_sse2 uses only 8-byte loads that speaks for itself and newer
implementations are faster.

A benchmark to check instrinct is following. 
Compile a.c with -mstringop-strategy=libcall and you will see
performance improvement. Of course compile a.c and main separately to
avoid optimizing memcpy out.

a.c:

#include <string.h>
int memcpy2(char *c, char *s)
{
  return memcpy(c, s, 200);
}

#include <string.h>
extern int n;
int memcpy2(char *c, char *s)
{
  return memcpy(c, s, n);
}

main.c:

int n = 200;
int main(){
  int i;
  char u[40000], v[40000];
  for (i=0;i<1000000;i++)
    memcpy2(u + 16 * ((unsigned int) rand ()) % (1 << 10), v + + 16 * ((unsigned int) rand ()) % (1 << 10));
}

We dealt with suboptimal selection. 

Then you have problem that optimum also depends on profile feedback. If
you fixed previous selection it would be still wrong if you know that
input is mainly in L3 cache. For some reason rep stosq is by far best
implementation for these inputs in almost any intel cpu.

Then next problem is code size, I filled bug about that. Gcc does too 
excessive expansion.
memcpy(c, s, 120);
gets expanded to 125 byte sequence of movs 

While it may improve overall performance most of these expansions are
wrong as they lie on cold path and could easily introduce extra 60 cycle
penalty for fetching instructions.

A ugly problem is that upto certain size gcc needs to expand these for
optimizations to work, one wants to be able to determine that b[1] is b
in following fragment and do constant propagation of copied structures.
A question is after what size that doesn't make sense.

char *a = "abc";
memcpy(b,a,3);
return b[1];

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Gcc builtin review: memset
  2015-05-25 12:18 Gcc builtin review: memcpy, mempcpy, memmove Ondřej Bílka
@ 2015-05-25 12:49 ` Ondřej Bílka
  2015-05-25 13:50   ` Gcc builtin review: memchr, memrchr, rawmemchr Ondřej Bílka
  2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
  2015-05-28 16:57 ` Gcc builtin review: memcpy, mempcpy, memmove Joseph Myers
  2 siblings, 1 reply; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-25 12:49 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

First quick statement that memccpy is ok as gcc doesn't do anything.

Then memset suffers for same flaws as memcpy. If modify previous memcpy
benchmark to following then for size 120 and obsolete glibc-2.13 you
have an around 25% performance regression.

a.c:
#include <string.h>
int memcpy2(char *c, char *s)
{
  return memset(c, 0, 120);
}

An size bloat is lesser.

Then there is optimization that on some archs you should call
__bzero (s, n) instead memset (s, 0, n). A effective __bzero consist of
small prologue that jumps to memset after initialization. It saves
cycles of creating mask from c and passing additional argument.

A main diadvantage would be renaming as bzero doesn't return anything
but x64 does return same value as memset.

Also as I mentioned ufunc a memset is target to get better performance
without troubles that you couldn't select cpu. Note that on haswell
using avx2 instincs could improve performance. How are you going exploit
that with gcc?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Gcc builtin review: memchr, memrchr, rawmemchr
  2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
@ 2015-05-25 13:50   ` Ondřej Bílka
  2015-05-27  6:36     ` Ondřej Bílka
  0 siblings, 1 reply; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-25 13:50 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

Gcc doesn't do any optimizations with these but misses several.

I already submitted that rawmemchr (s, 0) is silly, you should use
s + strlen (s) as strlen is better optimized (AFAIR around 50% faster
than rawmemchr)

Then gcc also doesn't optimize memchr with n small constant which is
possible but bit obscure. I could get similar performance improvement as
strcmp question is if its worth mainatainence. But fell free to add gcc
optimization for that.

Another use case is how wordexp abuses strchr/memchr to test membership
in set. For example it checked whitespace with

if (strchr (" \t\r\n", c))

While this would be relatively easy to implement with header and would
be candidate for ufunc calculating table in gcc would be really ugly
hack.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
  2015-05-25 12:18 Gcc builtin review: memcpy, mempcpy, memmove Ondřej Bílka
  2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
@ 2015-05-25 13:53 ` Ondřej Bílka
  2015-05-25 17:46   ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
                     ` (6 more replies)
  2015-05-28 16:57 ` Gcc builtin review: memcpy, mempcpy, memmove Joseph Myers
  2 siblings, 7 replies; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-25 13:53 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

I will omit expansion to memcpy and flaws from previous thread.

First problem is that gcc in

int foo (char *c, char *c2){
  stpcpy (c, c2);
}

Replaces it with strcpy. One could argue that opposite way to replace
strcpy with stpcpy is faster.

Reason is register pressure. Strcpy needs extra register to save return
value while stpcpy has return value already in register used for writing
terminating zero. 

Also to decrease instruction cache pressure you should always call
stpcpy and newer strcpy. You can't do it other way as you would need
expensive strlen. A pressure is problem as strcpy tend to be relatively rarely
used.

A possible advantage of strcpy is that gcc4.9 started to exploit that it
returns x saving register spill in caller. I don't know how is that
useful.

With stpcpy and strcat there are more optimizations with higher payoff.

You should expoit that stpcpy returns end of string so gcc should
replace

strcpy (x,y)
return strlen (x)

with 

return stpcpy (x,y) - x

Then there is famous problem of quadratic performance of strcat loop.
That is something that gcc could eliminate with tracking end of string.

Also there was discussion on libc-alpha about adding stpcat. That would
also save some end of string calculations.

Otherwise strcat look ok modulo memcpy related problems.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Gcc builtin review: strcmp, strncmp, memcmp.
  2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
@ 2015-05-25 17:46   ` Ondřej Bílka
  2015-05-25 19:28     ` Gcc builtin review: strpbrk, strspn, strcspn Ondřej Bílka
  2015-05-25 17:52   ` Gcc builtin review: strdup Ondřej Bílka
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-25 17:46 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

As mentioned in other thread expansion of strcmp and strncmp with
constant has terrible performance. Thankfully there is no optimization
of memcmp.

Then there are strcoll, strxfrm that are rarely used. 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Gcc builtin review: strdup
  2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
  2015-05-25 17:46   ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
@ 2015-05-25 17:52   ` Ondřej Bílka
  2015-05-25 18:11   ` Gcc builtin review: strchr, strrchr, strchrnul Ondřej Bílka
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-25 17:52 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

Again there is missed optimization. Consider

int bar()
{
  return strdup("xyz");
}

This could be clearly expanded to

int bar()
{
  char *c = malloc (strlen ("xyz") + 1);
  memcpy (c, "xyz", strlen ("xyz") + 1);
}

One of consequences of this missed optimization is that in

int bar()
{
  char *c = strdup("xyz");
  return c[1];
}

gcc cannot determine that c[1] is y while it can in memcpy case.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Gcc builtin review: strchr, strrchr, strchrnul
  2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
  2015-05-25 17:46   ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
  2015-05-25 17:52   ` Gcc builtin review: strdup Ondřej Bílka
@ 2015-05-25 18:11   ` Ondřej Bílka
  2015-05-25 19:15     ` Ondřej Bílka
  2015-05-26  8:00   ` Gcc builtin review: strstr, strcasestr, memmem Ondřej Bílka
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-25 18:11 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

Then we have strchr, strrchr, strchrnul.

Aside from suggestion of replacing every strchr with strchrnul there are
missed optimizations of strchr(x,0) and strchrnul(x,0). 

A strrchr(x,0) expands just to strchr(x,0) which is not enough as it
should do full expansion to x+strlen(x).

Then you miss case when x is constant as in wordexp. 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Gcc builtin review: strchr, strrchr, strchrnul
  2015-05-25 18:11   ` Gcc builtin review: strchr, strrchr, strchrnul Ondřej Bílka
@ 2015-05-25 19:15     ` Ondřej Bílka
  2015-05-26 10:58       ` Gcc builtin review: isinf, insnan Ondřej Bílka
  0 siblings, 1 reply; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-25 19:15 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

On Mon, May 25, 2015 at 02:16:34PM +0200, OndÅ™ej BÃlka wrote:
> Then we have strchr, strrchr, strchrnul.
> 
> Aside from suggestion of replacing every strchr with strchrnul there are
> missed optimizations of strchr(x,0) and strchrnul(x,0). 
> 
> A strrchr(x,0) expands just to strchr(x,0) which is not enough as it
> should do full expansion to x+strlen(x).
> 
> Then you miss case when x is constant as in wordexp. 

Second problem is that you want builtin to evaluate constant strings
arguments. So why gcc could optimize away

int foo(){
  return strchr("foo", 'y');
}

but not when strchr is replaced by rawmemchr and strchrnul?

Also as for effectivity a correct primitive is strchrnul and strchr
should be expanded into it this is problem.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Gcc builtin review: strpbrk, strspn, strcspn
  2015-05-25 17:46   ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
@ 2015-05-25 19:28     ` Ondřej Bílka
  0 siblings, 0 replies; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-25 19:28 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

These do relatively good job.

These would be typical use-case for ufuncs. In most use cases accept
argument is consant so you want to precompute it as array as one
implementation computes table with characters and then checks if each
character is in table.

But on x64 there is better implementation which is only case where
sse4.2 instructions are useful in glibc string function.

So you need to do resolving if to expand table based on if cpu uses
sse4.2 which gcc cannot do.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Gcc builtin review: strstr, strcasestr, memmem
  2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
                     ` (2 preceding siblings ...)
  2015-05-25 18:11   ` Gcc builtin review: strchr, strrchr, strchrnul Ondřej Bílka
@ 2015-05-26  8:00   ` Ondřej Bílka
  2015-05-26  8:55     ` Ondřej Bílka
  2015-05-26 10:45   ` Gcc builtin review: strfry, basename, strncpy, memfrob Ondřej Bílka
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-26  8:00 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

A gcc does little optimizations, just constant arguments.

And why gcc doesnt optimize memmem as arguments are constant and with
size 1 needle it could use memchr?

#include <string.h>
int foo(char *x)
{
  return memmem ("tauhtatu", 4, "f", 1);
}

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Gcc builtin review: strstr, strcasestr, memmem
  2015-05-26  8:00   ` Gcc builtin review: strstr, strcasestr, memmem Ondřej Bílka
@ 2015-05-26  8:55     ` Ondřej Bílka
  0 siblings, 0 replies; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-26  8:55 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

On Mon, May 25, 2015 at 09:09:23PM +0200, OndÅ™ej BÃlka wrote:
> A gcc does little optimizations, just constant arguments.
> 
> And why gcc doesnt optimize memmem as arguments are constant and with
> size 1 needle it could use memchr?
> 
Forgot to add that these could be candidate to ufunc pattern. While I
added heuristic to delay critical factorization as far as possible it
could be possible to get faster algorithm when you don't have to spend
90% of strstr time to recompute factorization again which was previously
case of small haystacks.

This is also quite unsuitable for gcc pass and as implementation could
change it could easily become obsolete and cause problems.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Gcc builtin review: strfry, basename, strncpy, memfrob
  2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
                     ` (3 preceding siblings ...)
  2015-05-26  8:00   ` Gcc builtin review: strstr, strcasestr, memmem Ondřej Bílka
@ 2015-05-26 10:45   ` Ondřej Bílka
  2015-05-28 17:02   ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Joseph Myers
  2015-06-04 10:34   ` Richard Earnshaw
  6 siblings, 0 replies; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-26 10:45 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

And I was most disapointed on lack of gcc support with these functions.

Memfrob is essential security feature but gcc doesn't optimize it at all.

Sadly in strfry gcc could use opportunity of fixing glibc bug but it
doesn't do it.

While gcc optimizes strncpy a bit it doesn't do that enough. It should
expand to sequence of stores for at least size 1000. A strncpy is best
way to copy strings. And what more. You could also use strncpy as memset
by passing "" as second argument.

Also basename should be always inlined as it almost always lies in hot
path followed by opening file.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Gcc builtin review: isinf, insnan ...
  2015-05-25 19:15     ` Ondřej Bílka
@ 2015-05-26 10:58       ` Ondřej Bílka
  0 siblings, 0 replies; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-26 10:58 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

I raised this issue before but didn't wrote patch so I should do it now.
I would be silent about glibc as it shares same flaw as gcc.

Main problem that these functions try to be branchless. Which causes
performance regression for most applications versus branched code.

A problem is that predicted branch is free while conditional store
always cost cycle. So you need to have unpredictable branch to get
performance gain. When branch is 95% predicted then branchless code
wouldn't pay for itself if it adds one cycle versus branched and
misprediction costs 20 cycles.

And NaN is quite exceptional value so branches will almost always be
predicted. Otherwise user has other problems, like that if 5% of his
data are NaN's then result will likely be garbage.

Then you have problem that with modern gcc you wont likely save branch.
Most of these functions are surrounded by if. From gcc-4.9 it will
optimize out that branch as its predicated and it results in simpler
code.

More evidence about that is that I took assembly of benchmark below and
changed conditional move to jump which improves performance back by 10%

For showing that I wrote simple example of branched isinf that is around
10% faster than builtin. 

#ifdef BRANCHED
static inline int
isinf (double dx)
{
  union u {
    double d;
    long l;
  };
  union u u;
  u.d = dx;
  long x = u.l;
  return 2 * x == 0xffe0000000000000 ? (x == 0x7ff0000000000000 ? 1 : -1) : 0;
}
#endif
int main()
{
  int ret;
  int i, j;
  double *d = malloc (800000);
  for (j=0; j<1000000; j++)
  for (i=0; i<1000; i++)
    if (__builtin_expect(isinf (d[i]),0))
      ret += 42;
  return ret;
}

	.file	"inf.c"
	.section	.text.unlikely,"ax",@progbits
.LCOLDB2:
	.section	.text.startup,"ax",@progbits
.LHOTB2:
	.p2align 4,,15
	.globl	main
	.type	main, @function
main:
.LFB0:
	.cfi_startproc
	pushq	%rbx
	.cfi_def_cfa_offset 16
	.cfi_offset 3, -16
	movl	$800000, %edi
	call	malloc
	movsd	.LC0(%rip), %xmm2
	leaq	8000(%rax), %rsi
	movsd	.LC1(%rip), %xmm1
	movl	$1000000, %edi
	.p2align 4,,10
	.p2align 3
.L2:
	movq	%rax, %rdx
	.p2align 4,,10
	.p2align 3
.L3:
	movsd	(%rdx), %xmm0
	andpd	%xmm2, %xmm0
	ucomisd	%xmm1, %xmm0
	ja	.LC
	leal	42(%rbx), %ebx
	.LC:
	addq	$8, %rdx
	cmpq	%rsi, %rdx
	jne	.L3
	subl	$1, %edi
	jne	.L2
	movl	%ebx, %eax
	popq	%rbx
	.cfi_def_cfa_offset 8
	ret

	.cfi_endproc
.LFE0:
	.size	main, .-main
	.section	.text.unlikely
.LCOLDE2:
	.section	.text.startup
.LHOTE2:
	.section	.rodata.cst16,"aM",@progbits,16
	.align 16
.LC0:
	.long	4294967295
	.long	2147483647
	.long	0
	.long	0
	.section	.rodata.cst8,"aM",@progbits,8
	.align 8
.LC1:
	.long	4294967295
	.long	2146435071
	.ident	"GCC: (Debian 4.9.2-9) 4.9.2"
	.section	.note.GNU-stack,"",@progbits

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Gcc builtin review: memchr, memrchr, rawmemchr
  2015-05-25 13:50   ` Gcc builtin review: memchr, memrchr, rawmemchr Ondřej Bílka
@ 2015-05-27  6:36     ` Ondřej Bílka
  0 siblings, 0 replies; 23+ messages in thread
From: Ondřej Bílka @ 2015-05-27  6:36 UTC (permalink / raw)
  To: libc-alpha; +Cc: Andrew Pinski

On Mon, May 25, 2015 at 12:55:40PM +0200, OndÅ™ej BÃlka wrote:
> Gcc doesn't do any optimizations with these but misses several.
> 
Also next optimization is that memchr(s, 0, n) could be converted to
calling strnlen (s, n) and converting result.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Gcc builtin review: memcpy, mempcpy, memmove.
  2015-05-25 12:18 Gcc builtin review: memcpy, mempcpy, memmove Ondřej Bílka
  2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
  2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
@ 2015-05-28 16:57 ` Joseph Myers
  2015-05-28 17:08   ` Jeff Law
  2 siblings, 1 reply; 23+ messages in thread
From: Joseph Myers @ 2015-05-28 16:57 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski

[-- Attachment #1: Type: text/plain, Size: 1841 bytes --]

On Mon, 25 May 2015, OndĂ¸ej BĂlka wrote:

> Andrew as I promised review and what you would need to fix for x64 gcc
>  and remove ugly gcc hacks and replace them with nice headers :-)
> 
> I will take a functions in order from string.h and comment about
> these.

Reviews of optimization issues with GCC built-in functions belong 
primarily in the GCC context, not glibc at all.  That is, file one report 
in GCC Bugzilla for each issue you find that isn't already reported there 
(making sure to give the usual details of GCC version, target, compilation 
options etc.).  File a meta-bug in GCC Bugzilla depending on all those 
bugs (and any pre-existing bugs for issues you found that you didn't 
report because they were already reported there).  Then just send a 
summary to the libc-alpha and gcc lists pointing to that meta-bug.

This applies to most of your reviews of GCC built-in functions - send the 
reports where they belong, and any temporary implementations of 
optimizations in glibc headers need comments pointing to the GCC bug 
reports which point back to those comments.  Only where an optimization 
involves calling glibc-specific functions are compiler optimizations more 
troublesome (and even there, GCC knows when it's compiling for a glibc 
target and knows the minimum glibc version supported on the target, so 
it's entirely reasonable to agree that e.g. __rawmemchr will remain 
available as a public glibc ABI so that GCC can generate code calling it 
on glibc targets unless told not to because e.g. generating kernel code).

> Then you have problem that optimum also depends on profile feedback. If

Which is clearly better handled in the GCC context (GCC has the 
infrastructure to handle profile feedback data from training runs of the 
program) than in headers.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
  2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
                     ` (4 preceding siblings ...)
  2015-05-26 10:45   ` Gcc builtin review: strfry, basename, strncpy, memfrob Ondřej Bílka
@ 2015-05-28 17:02   ` Joseph Myers
  2015-06-04 10:34   ` Richard Earnshaw
  6 siblings, 0 replies; 23+ messages in thread
From: Joseph Myers @ 2015-05-28 17:02 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski

[-- Attachment #1: Type: text/plain, Size: 400 bytes --]

On Mon, 25 May 2015, OndĂ¸ej BĂlka wrote:

> You should expoit that stpcpy returns end of string so gcc should
> replace
> 
> strcpy (x,y)
> return strlen (x)
> 
> with 
> 
> return stpcpy (x,y) - x

Subject to namespace issues (we could agree that __stpcpy is a stable ABI 
- won't become a compat symbol - so that GCC can safely generate calls to 
it).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Gcc builtin review: memcpy, mempcpy, memmove.
  2015-05-28 16:57 ` Gcc builtin review: memcpy, mempcpy, memmove Joseph Myers
@ 2015-05-28 17:08   ` Jeff Law
  2015-06-04 10:25     ` Ondřej Bílka
  0 siblings, 1 reply; 23+ messages in thread
From: Jeff Law @ 2015-05-28 17:08 UTC (permalink / raw)
  To: Joseph Myers, Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski

On 05/28/2015 10:36 AM, Joseph Myers wrote:
> On Mon, 25 May 2015, OndĂ¸ej BĂlka wrote:
>
>> Andrew as I promised review and what you would need to fix for x64 gcc
>>   and remove ugly gcc hacks and replace them with nice headers :-)
>>
>> I will take a functions in order from string.h and comment about
>> these.
>
> Reviews of optimization issues with GCC built-in functions belong
> primarily in the GCC context, not glibc at all.
Agreed.  Trying to address all this stuff in glibc is wrong.  Each issue 
should be evaluated and a determination made individually whether the 
issue should be addressed in gcc or glibc.


Jeff


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Gcc builtin review: memcpy, mempcpy, memmove.
  2015-05-28 17:08   ` Jeff Law
@ 2015-06-04 10:25     ` Ondřej Bílka
  0 siblings, 0 replies; 23+ messages in thread
From: Ondřej Bílka @ 2015-06-04 10:25 UTC (permalink / raw)
  To: Jeff Law; +Cc: Joseph Myers, libc-alpha, Andrew Pinski

On Thu, May 28, 2015 at 10:41:04AM -0600, Jeff Law wrote:
> On 05/28/2015 10:36 AM, Joseph Myers wrote:
> >On Mon, 25 May 2015, OndÅ™ej BÃlka wrote:
> >
> >>Andrew as I promised review and what you would need to fix for x64 gcc
> >>  and remove ugly gcc hacks and replace them with nice headers :-)
> >>
> >>I will take a functions in order from string.h and comment about
> >>these.
> >
> >Reviews of optimization issues with GCC built-in functions belong
> >primarily in the GCC context, not glibc at all.
> Agreed.  Trying to address all this stuff in glibc is wrong.  Each
> issue should be evaluated and a determination made individually
> whether the issue should be addressed in gcc or glibc.
> 
> 
> Jeff

Ok, I will crosspost issues that needs to be addressed. The main two
problems are that lot of special cases should be handled as ordinary
code, not trying to add extra pass every time you notice that sometimes
you could expand foo into bar and baz. Then there are implementation
details and that correct decision depends on these. For optimizations
you need to know how fast are different libc functions which is easier
checkable inside libc than trying to sync that information in gcc.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
  2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
                     ` (5 preceding siblings ...)
  2015-05-28 17:02   ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Joseph Myers
@ 2015-06-04 10:34   ` Richard Earnshaw
  2015-06-04 12:33     ` Ondřej Bílka
  6 siblings, 1 reply; 23+ messages in thread
From: Richard Earnshaw @ 2015-06-04 10:34 UTC (permalink / raw)
  To: Ondřej Bílka, libc-alpha; +Cc: Andrew Pinski

On 25/05/15 12:45, OndÅ™ej BÃlka wrote:
> Replaces it with strcpy. One could argue that opposite way to replace
> strcpy with stpcpy is faster.
> 
> Reason is register pressure. Strcpy needs extra register to save return
> value while stpcpy has return value already in register used for writing
> terminating zero.


Depends on your architecture.  On aarch64 we have plenty of spare
registers, so strcpy simply copies the destination register into a
scratch.  It then doesn't have to carefully calculate the return value
at the end of the function (making the tail code simpler - there are
multiple return statements, but only one entry point).

R.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
  2015-06-04 10:34   ` Richard Earnshaw
@ 2015-06-04 12:33     ` Ondřej Bílka
  2015-06-04 13:50       ` Richard Earnshaw
  0 siblings, 1 reply; 23+ messages in thread
From: Ondřej Bílka @ 2015-06-04 12:33 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: libc-alpha, Andrew Pinski

On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
> On 25/05/15 12:45, OndÅ™ej BÃlka wrote:
> > Replaces it with strcpy. One could argue that opposite way to replace
> > strcpy with stpcpy is faster.
> > 
> > Reason is register pressure. Strcpy needs extra register to save return
> > value while stpcpy has return value already in register used for writing
> > terminating zero.
> 
> 
> Depends on your architecture.  On aarch64 we have plenty of spare
> registers, so strcpy simply copies the destination register into a
> scratch.  It then doesn't have to carefully calculate the return value
> at the end of the function (making the tail code simpler - there are
> multiple return statements, but only one entry point).
> 
Thats correct, main saving you get is from return value is first register, that
forces needing extra copy which is suboptimal.

I don't have data how strcpy and stpcpy mix and want to know if few
extra cycles are worth it when these aren't called exactly often, I will
try to think how test these.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Gcc builtin review: strcpy, stpcpy, strcat, stpcat?
  2015-06-04 12:33     ` Ondřej Bílka
@ 2015-06-04 13:50       ` Richard Earnshaw
  2015-06-04 14:35         ` [RFC] Aarch64: optimize stpcpy a bit Ondřej Bílka
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Earnshaw @ 2015-06-04 13:50 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski

On 04/06/15 13:28, OndÅ™ej BÃlka wrote:
> On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
>> On 25/05/15 12:45, OndÅ™ej BÃlka wrote:
>>> Replaces it with strcpy. One could argue that opposite way to replace
>>> strcpy with stpcpy is faster.
>>>
>>> Reason is register pressure. Strcpy needs extra register to save return
>>> value while stpcpy has return value already in register used for writing
>>> terminating zero.
>>
>>
>> Depends on your architecture.  On aarch64 we have plenty of spare
>> registers, so strcpy simply copies the destination register into a
>> scratch.  It then doesn't have to carefully calculate the return value
>> at the end of the function (making the tail code simpler - there are
>> multiple return statements, but only one entry point).
>>
> Thats correct, main saving you get is from return value is first register, that
> forces needing extra copy which is suboptimal.

No, look at the AArch64 code.  The only time we ever end up with a
simple MOV instruction to copy the register from one location to another
is in the stPcpy code.  In strcpy it always ends up folded into some
other operation that we have to do anyway.  Once it's been copied to
that other register we never have to use it elsewhere again.
Furthermore, the need to handle smallish copies with overlapping stores
means we need both the original base address /and/ the final result, so
we'd still need to end up saving it for stpcpy.

> 
> I don't have data how strcpy and stpcpy mix and want to know if few
> extra cycles are worth it when these aren't called exactly often, I will
> try to think how test these.
> 

Well, stpcpy is Posix, while strcpy is part of ISO C.  That suggests to
me we're far more likely to see folk using strcpy than stpcpy,
especially when they don't care about the result.

R.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC] Aarch64: optimize stpcpy a bit.
  2015-06-04 13:50       ` Richard Earnshaw
@ 2015-06-04 14:35         ` Ondřej Bílka
  2015-06-04 16:41           ` Richard Earnshaw
  0 siblings, 1 reply; 23+ messages in thread
From: Ondřej Bílka @ 2015-06-04 14:35 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: libc-alpha, Andrew Pinski

On Thu, Jun 04, 2015 at 02:44:59PM +0100, Richard Earnshaw wrote:
> On 04/06/15 13:28, OndÅ™ej BÃlka wrote:
> > On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
> >> On 25/05/15 12:45, OndÅ™ej BÃlka wrote:
> >>> Replaces it with strcpy. One could argue that opposite way to replace
> >>> strcpy with stpcpy is faster.
> >>>
> >>> Reason is register pressure. Strcpy needs extra register to save return
> >>> value while stpcpy has return value already in register used for writing
> >>> terminating zero.
> >>
> >>
> >> Depends on your architecture.  On aarch64 we have plenty of spare
> >> registers, so strcpy simply copies the destination register into a
> >> scratch.  It then doesn't have to carefully calculate the return value
> >> at the end of the function (making the tail code simpler - there are
> >> multiple return statements, but only one entry point).
> >>
> > Thats correct, main saving you get is from return value is first register, that
> > forces needing extra copy which is suboptimal.
> 
> No, look at the AArch64 code.  The only time we ever end up with a
> simple MOV instruction to copy the register from one location to another
> is in the stPcpy code.  In strcpy it always ends up folded into some
> other operation that we have to do anyway.  Once it's been copied to
> that other register we never have to use it elsewhere again.
> Furthermore, the need to handle smallish copies with overlapping stores
> means we need both the original base address /and/ the final result, so
> we'd still need to end up saving it for stpcpy.
>
Wrote too fast, was refering that you would need to copy that on small
address. With dest in different register I could make strcpy and stpcpy
const same  instructions in most cases except size 8-15 by adjusting 
offsets with some constants.

Also if I think that you could remove extra instructions for stpcpy loop 
with following, which also removes one instruction from strcpy if I read
code correctly.

diff --git a/sysdeps/aarch64/strcpy.S b/sysdeps/aarch64/strcpy.S
index 28846fb..7ca0412 100644
--- a/sysdeps/aarch64/strcpy.S
+++ b/sysdeps/aarch64/strcpy.S
@@ -204,6 +204,9 @@ L(bulk_entry):
 	sub	to_align, to_align, #16
 	stp	data1, data2, [dstin]
 	sub	src, srcin, to_align
+#ifdef BUILD_STPCPY
+# define dst dstin
+#endif
 	sub	dst, dstin, to_align
 	b	L(entry_no_page_cross)
 
@@ -243,17 +246,16 @@ L(entry_no_page_cross):
 #endif
 	rev	has_nul1, has_nul1
 	clz	pos, has_nul1
-	add	tmp1, pos, #72
-	add	pos, pos, #8
+	add	tmp1, pos, #64
 	csel	pos, pos, tmp1, ne
 	add	src, src, pos, lsr #3
 	add	dst, dst, pos, lsr #3
-	ldp	data1, data2, [src, #-32]
-	stp	data1, data2, [dst, #-16]
+	ldp	data1, data2, [src, #-31]
+	stp	data1, data2, [dst, #-15]
+	ret
 #ifdef BUILD_STPCPY
-	sub	dstin, dst, #1
+# undef dst
 #endif
-	ret
 
 L(page_cross):
 	bic	src, srcin, #15

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] Aarch64: optimize stpcpy a bit.
  2015-06-04 14:35         ` [RFC] Aarch64: optimize stpcpy a bit Ondřej Bílka
@ 2015-06-04 16:41           ` Richard Earnshaw
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Earnshaw @ 2015-06-04 16:41 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: libc-alpha, Andrew Pinski

On 04/06/15 15:16, OndÅ™ej BÃlka wrote:
> On Thu, Jun 04, 2015 at 02:44:59PM +0100, Richard Earnshaw wrote:
>> On 04/06/15 13:28, OndÅ™ej BÃlka wrote:
>>> On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
>>>> On 25/05/15 12:45, OndÅ™ej BÃlka wrote:
>>>>> Replaces it with strcpy. One could argue that opposite way to replace
>>>>> strcpy with stpcpy is faster.
>>>>>
>>>>> Reason is register pressure. Strcpy needs extra register to save return
>>>>> value while stpcpy has return value already in register used for writing
>>>>> terminating zero.
>>>>
>>>>
>>>> Depends on your architecture.  On aarch64 we have plenty of spare
>>>> registers, so strcpy simply copies the destination register into a
>>>> scratch.  It then doesn't have to carefully calculate the return value
>>>> at the end of the function (making the tail code simpler - there are
>>>> multiple return statements, but only one entry point).
>>>>
>>> Thats correct, main saving you get is from return value is first register, that
>>> forces needing extra copy which is suboptimal.
>>
>> No, look at the AArch64 code.  The only time we ever end up with a
>> simple MOV instruction to copy the register from one location to another
>> is in the stPcpy code.  In strcpy it always ends up folded into some
>> other operation that we have to do anyway.  Once it's been copied to
>> that other register we never have to use it elsewhere again.
>> Furthermore, the need to handle smallish copies with overlapping stores
>> means we need both the original base address /and/ the final result, so
>> we'd still need to end up saving it for stpcpy.
>>
> Wrote too fast, was refering that you would need to copy that on small
> address. With dest in different register I could make strcpy and stpcpy
> const same  instructions in most cases except size 8-15 by adjusting 
> offsets with some constants.
> 
> Also if I think that you could remove extra instructions for stpcpy loop 
> with following, which also removes one instruction from strcpy if I read
> code correctly.
> 
> diff --git a/sysdeps/aarch64/strcpy.S b/sysdeps/aarch64/strcpy.S
> index 28846fb..7ca0412 100644
> --- a/sysdeps/aarch64/strcpy.S
> +++ b/sysdeps/aarch64/strcpy.S
> @@ -204,6 +204,9 @@ L(bulk_entry):
>  	sub	to_align, to_align, #16
>  	stp	data1, data2, [dstin]
>  	sub	src, srcin, to_align
> +#ifdef BUILD_STPCPY
> +# define dst dstin
> +#endif
>  	sub	dst, dstin, to_align
>  	b	L(entry_no_page_cross)
>  
> @@ -243,17 +246,16 @@ L(entry_no_page_cross):
>  #endif
>  	rev	has_nul1, has_nul1
>  	clz	pos, has_nul1
> -	add	tmp1, pos, #72
> -	add	pos, pos, #8
> +	add	tmp1, pos, #64
>  	csel	pos, pos, tmp1, ne
>  	add	src, src, pos, lsr #3
>  	add	dst, dst, pos, lsr #3
> -	ldp	data1, data2, [src, #-32]
> -	stp	data1, data2, [dst, #-16]
> +	ldp	data1, data2, [src, #-31]
> +	stp	data1, data2, [dst, #-15]

That's not valid, the offset has to be a multiple of the register size
(8 in this case).
> +	ret
>  #ifdef BUILD_STPCPY
> -	sub	dstin, dst, #1
> +# undef dst

Nor is this, dst is already a #define, so this leaves it unspecified.
But since you can't avoid the late subtract, that's irrelevant anyway.

R.

>  #endif
> -	ret
>  
>  L(page_cross):
>  	bic	src, srcin, #15
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-06-04 15:44 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-25 12:18 Gcc builtin review: memcpy, mempcpy, memmove Ondřej Bílka
2015-05-25 12:49 ` Gcc builtin review: memset Ondřej Bílka
2015-05-25 13:50   ` Gcc builtin review: memchr, memrchr, rawmemchr Ondřej Bílka
2015-05-27  6:36     ` Ondřej Bílka
2015-05-25 13:53 ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Ondřej Bílka
2015-05-25 17:46   ` Gcc builtin review: strcmp, strncmp, memcmp Ondřej Bílka
2015-05-25 19:28     ` Gcc builtin review: strpbrk, strspn, strcspn Ondřej Bílka
2015-05-25 17:52   ` Gcc builtin review: strdup Ondřej Bílka
2015-05-25 18:11   ` Gcc builtin review: strchr, strrchr, strchrnul Ondřej Bílka
2015-05-25 19:15     ` Ondřej Bílka
2015-05-26 10:58       ` Gcc builtin review: isinf, insnan Ondřej Bílka
2015-05-26  8:00   ` Gcc builtin review: strstr, strcasestr, memmem Ondřej Bílka
2015-05-26  8:55     ` Ondřej Bílka
2015-05-26 10:45   ` Gcc builtin review: strfry, basename, strncpy, memfrob Ondřej Bílka
2015-05-28 17:02   ` Gcc builtin review: strcpy, stpcpy, strcat, stpcat? Joseph Myers
2015-06-04 10:34   ` Richard Earnshaw
2015-06-04 12:33     ` Ondřej Bílka
2015-06-04 13:50       ` Richard Earnshaw
2015-06-04 14:35         ` [RFC] Aarch64: optimize stpcpy a bit Ondřej Bílka
2015-06-04 16:41           ` Richard Earnshaw
2015-05-28 16:57 ` Gcc builtin review: memcpy, mempcpy, memmove Joseph Myers
2015-05-28 17:08   ` Jeff Law
2015-06-04 10:25     ` Ondřej Bílka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).