[PATCH] x86_64: Optimize ffsll function code size.

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH] x86_64: Optimize ffsll function code size.
@ 2023-07-26 16:05 Sunil K Pandey
  2023-07-26 16:38 ` Richard Henderson
  2023-07-26 17:04 ` Noah Goldstein
  0 siblings, 2 replies; 37+ messages in thread
From: Sunil K Pandey @ 2023-07-26 16:05 UTC (permalink / raw)
  To: libc-alpha; +Cc: hjl.tools

Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
Currently ffsll function randomly regress by ~20%, depending on how
code get aligned.

This patch fixes ffsll function random performance regression.
---
 sysdeps/x86_64/ffsll.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
index a1c13d4906..dbded6f0a1 100644
--- a/sysdeps/x86_64/ffsll.c
+++ b/sysdeps/x86_64/ffsll.c
@@ -29,7 +29,7 @@ ffsll (long long int x)
   long long int tmp;
 
   asm ("bsfq %2,%0\n"		/* Count low bits in X and store in %1.  */
-       "cmoveq %1,%0\n"		/* If number was zero, use -1 as result.  */
+       "cmove %k1,%k0\n"	/* If number was zero, use -1 as result.  */
        : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
 
   return cnt + 1;
-- 
2.41.0


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-26 16:05 [PATCH] x86_64: Optimize ffsll function code size Sunil K Pandey
@ 2023-07-26 16:38 ` Richard Henderson
  2023-07-26 16:50   ` H.J. Lu
                     ` (2 more replies)
  2023-07-26 17:04 ` Noah Goldstein
  1 sibling, 3 replies; 37+ messages in thread
From: Richard Henderson @ 2023-07-26 16:38 UTC (permalink / raw)
  To: Sunil K Pandey, libc-alpha; +Cc: hjl.tools

On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
> Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> Currently ffsll function randomly regress by ~20%, depending on how
> code get aligned.
> 
> This patch fixes ffsll function random performance regression.
> ---
>   sysdeps/x86_64/ffsll.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> index a1c13d4906..dbded6f0a1 100644
> --- a/sysdeps/x86_64/ffsll.c
> +++ b/sysdeps/x86_64/ffsll.c
> @@ -29,7 +29,7 @@ ffsll (long long int x)
>     long long int tmp;
>   
>     asm ("bsfq %2,%0\n"		/* Count low bits in X and store in %1.  */
> -       "cmoveq %1,%0\n"		/* If number was zero, use -1 as result.  */
> +       "cmove %k1,%k0\n"	/* If number was zero, use -1 as result.  */

This no longer produces -1, but 0xffffffff in cnt.  However, since the return type is 
'int', cnt need not be 'long long int' either.  I'm not sure why tmp exists at all, since 
cnt is the only register modified.


r~

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-26 16:38 ` Richard Henderson
@ 2023-07-26 16:50   ` H.J. Lu
  2023-07-26 16:51   ` Noah Goldstein
  2023-07-26 16:51   ` Sunil Pandey
  2 siblings, 0 replies; 37+ messages in thread
From: H.J. Lu @ 2023-07-26 16:50 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Sunil K Pandey, libc-alpha

On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
> > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> > Currently ffsll function randomly regress by ~20%, depending on how
> > code get aligned.
> >
> > This patch fixes ffsll function random performance regression.
> > ---
> >   sysdeps/x86_64/ffsll.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> > index a1c13d4906..dbded6f0a1 100644
> > --- a/sysdeps/x86_64/ffsll.c
> > +++ b/sysdeps/x86_64/ffsll.c
> > @@ -29,7 +29,7 @@ ffsll (long long int x)
> >     long long int tmp;
> >
> >     asm ("bsfq %2,%0\n"               /* Count low bits in X and store in %1.  */
> > -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as result.  */
> > +       "cmove %k1,%k0\n"     /* If number was zero, use -1 as result.  */
>
> This no longer produces -1, but 0xffffffff in cnt.  However, since the return type is
> 'int', cnt need not be 'long long int' either.  I'm not sure why tmp exists at all, since
> cnt is the only register modified.
>
>
> r~

tmp was initialized to -1 with

: "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));

H.J.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-26 16:38 ` Richard Henderson
  2023-07-26 16:50   ` H.J. Lu
@ 2023-07-26 16:51   ` Noah Goldstein
  2023-07-26 16:51   ` Sunil Pandey
  2 siblings, 0 replies; 37+ messages in thread
From: Noah Goldstein @ 2023-07-26 16:51 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Sunil K Pandey, libc-alpha, hjl.tools

On Wed, Jul 26, 2023 at 11:38 AM Richard Henderson via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
> > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> > Currently ffsll function randomly regress by ~20%, depending on how
> > code get aligned.
> >
> > This patch fixes ffsll function random performance regression.
> > ---
> >   sysdeps/x86_64/ffsll.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> > index a1c13d4906..dbded6f0a1 100644
> > --- a/sysdeps/x86_64/ffsll.c
> > +++ b/sysdeps/x86_64/ffsll.c
> > @@ -29,7 +29,7 @@ ffsll (long long int x)
> >     long long int tmp;
> >
> >     asm ("bsfq %2,%0\n"               /* Count low bits in X and store in %1.  */
> > -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as result.  */
> > +       "cmove %k1,%k0\n"     /* If number was zero, use -1 as result.  */
Alternatively just store `-1` in dst before the `bsfq` and drop the
cmov altogether.
>
> This no longer produces -1, but 0xffffffff in cnt.  However, since the return type is
> 'int', cnt need not be 'long long int' either.  I'm not sure why tmp exists at all, since
> cnt is the only register modified.
>
>
> r~

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-26 16:38 ` Richard Henderson
  2023-07-26 16:50   ` H.J. Lu
  2023-07-26 16:51   ` Noah Goldstein
@ 2023-07-26 16:51   ` Sunil Pandey
  2023-07-26 16:59     ` Noah Goldstein
  2 siblings, 1 reply; 37+ messages in thread
From: Sunil Pandey @ 2023-07-26 16:51 UTC (permalink / raw)
  To: Richard Henderson; +Cc: libc-alpha, hjl.tools

[-- Attachment #1: Type: text/plain, Size: 1678 bytes --]

On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson <
richard.henderson@linaro.org> wrote:

> On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
> > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> > Currently ffsll function randomly regress by ~20%, depending on how
> > code get aligned.
> >
> > This patch fixes ffsll function random performance regression.
> > ---
> >   sysdeps/x86_64/ffsll.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> > index a1c13d4906..dbded6f0a1 100644
> > --- a/sysdeps/x86_64/ffsll.c
> > +++ b/sysdeps/x86_64/ffsll.c
> > @@ -29,7 +29,7 @@ ffsll (long long int x)
> >     long long int tmp;
> >
> >     asm ("bsfq %2,%0\n"               /* Count low bits in X and store
> in %1.  */
> > -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as
> result.  */
> > +       "cmove %k1,%k0\n"     /* If number was zero, use -1 as result.
> */
>
> This no longer produces -1, but 0xffffffff in cnt.  However, since the
> return type is
> 'int', cnt need not be 'long long int' either.  I'm not sure why tmp
> exists at all, since
> cnt is the only register modified.
>

Here is the exact assembly produced with this change.
./build-x86_64-linux/string/ffsll.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <ffsll>:
   0: ba ff ff ff ff       mov    $0xffffffff,%edx
   5: 48 0f bc c7           bsf    %rdi,%rax
   9: 0f 44 c2             cmove  %edx,%eax
   c: 83 c0 01             add    $0x1,%eax
   f: c3                   ret


>
>
> r~
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-26 16:51   ` Sunil Pandey
@ 2023-07-26 16:59     ` Noah Goldstein
  2023-07-26 17:11       ` Adhemerval Zanella Netto
  2023-07-26 20:43       ` Sunil Pandey
  0 siblings, 2 replies; 37+ messages in thread
From: Noah Goldstein @ 2023-07-26 16:59 UTC (permalink / raw)
  To: Sunil Pandey; +Cc: Richard Henderson, libc-alpha, hjl.tools

On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson <
> richard.henderson@linaro.org> wrote:
>
> > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
> > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> > > Currently ffsll function randomly regress by ~20%, depending on how
> > > code get aligned.
> > >
> > > This patch fixes ffsll function random performance regression.
> > > ---
> > >   sysdeps/x86_64/ffsll.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> > > index a1c13d4906..dbded6f0a1 100644
> > > --- a/sysdeps/x86_64/ffsll.c
> > > +++ b/sysdeps/x86_64/ffsll.c
> > > @@ -29,7 +29,7 @@ ffsll (long long int x)
> > >     long long int tmp;
> > >
> > >     asm ("bsfq %2,%0\n"               /* Count low bits in X and store
> > in %1.  */
> > > -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as
> > result.  */
> > > +       "cmove %k1,%k0\n"     /* If number was zero, use -1 as result.
> > */
> >
> > This no longer produces -1, but 0xffffffff in cnt.  However, since the
> > return type is
> > 'int', cnt need not be 'long long int' either.  I'm not sure why tmp
> > exists at all, since
> > cnt is the only register modified.
> >
>
> Here is the exact assembly produced with this change.
> ./build-x86_64-linux/string/ffsll.o:     file format elf64-x86-64
>
>
> Disassembly of section .text:
>
> 0000000000000000 <ffsll>:
>    0: ba ff ff ff ff       mov    $0xffffffff,%edx
>    5: 48 0f bc c7           bsf    %rdi,%rax
>    9: 0f 44 c2             cmove  %edx,%eax
>    c: 83 c0 01             add    $0x1,%eax
>    f: c3                   ret
>

FWIW it should be:
```
0000000000000000 <.text>:
   0: b8 ff ff ff ff        mov    $0xffffffff,%eax
   5: 48 0f bc c7          bsf    %rdi,%rax
   9: ff c0                inc    %eax
```

And since its in inline asm no reason not to get that.
>
> >
> >
> > r~
> >

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-26 16:05 [PATCH] x86_64: Optimize ffsll function code size Sunil K Pandey
  2023-07-26 16:38 ` Richard Henderson
@ 2023-07-26 17:04 ` Noah Goldstein
  2023-07-26 17:25   ` Andrew Pinski
  1 sibling, 1 reply; 37+ messages in thread
From: Noah Goldstein @ 2023-07-26 17:04 UTC (permalink / raw)
  To: Sunil K Pandey; +Cc: libc-alpha, hjl.tools

On Wed, Jul 26, 2023 at 11:06 AM Sunil K Pandey via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> Currently ffsll function randomly regress by ~20%, depending on how
> code get aligned.
>
> This patch fixes ffsll function random performance regression.
> ---
>  sysdeps/x86_64/ffsll.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> index a1c13d4906..dbded6f0a1 100644
> --- a/sysdeps/x86_64/ffsll.c
> +++ b/sysdeps/x86_64/ffsll.c
> @@ -29,7 +29,7 @@ ffsll (long long int x)
>    long long int tmp;
>
>    asm ("bsfq %2,%0\n"          /* Count low bits in X and store in %1.  */
> -       "cmoveq %1,%0\n"                /* If number was zero, use -1 as result.  */
> +       "cmove %k1,%k0\n"       /* If number was zero, use -1 as result.  */
>         : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
>
Note, there is technically a "bug" in this code in that we clobber
"cc" but don't
specify so.
Mightly as well fix that.
>    return cnt + 1;
> --
> 2.41.0
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-26 16:59     ` Noah Goldstein
@ 2023-07-26 17:11       ` Adhemerval Zanella Netto
  2023-07-27  0:31         ` Cristian Rodríguez
  2023-07-26 20:43       ` Sunil Pandey
  1 sibling, 1 reply; 37+ messages in thread
From: Adhemerval Zanella Netto @ 2023-07-26 17:11 UTC (permalink / raw)
  To: Noah Goldstein, Sunil Pandey; +Cc: Richard Henderson, libc-alpha, hjl.tools



On 26/07/23 13:59, Noah Goldstein via Libc-alpha wrote:
> On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson <
>> richard.henderson@linaro.org> wrote:
>>
>>> On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
>>>> Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
>>>> Currently ffsll function randomly regress by ~20%, depending on how
>>>> code get aligned.
>>>>
>>>> This patch fixes ffsll function random performance regression.
>>>> ---
>>>>   sysdeps/x86_64/ffsll.c | 2 +-
>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
>>>> index a1c13d4906..dbded6f0a1 100644
>>>> --- a/sysdeps/x86_64/ffsll.c
>>>> +++ b/sysdeps/x86_64/ffsll.c
>>>> @@ -29,7 +29,7 @@ ffsll (long long int x)
>>>>     long long int tmp;
>>>>
>>>>     asm ("bsfq %2,%0\n"               /* Count low bits in X and store
>>> in %1.  */
>>>> -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as
>>> result.  */
>>>> +       "cmove %k1,%k0\n"     /* If number was zero, use -1 as result.
>>> */
>>>
>>> This no longer produces -1, but 0xffffffff in cnt.  However, since the
>>> return type is
>>> 'int', cnt need not be 'long long int' either.  I'm not sure why tmp
>>> exists at all, since
>>> cnt is the only register modified.
>>>
>>
>> Here is the exact assembly produced with this change.
>> ./build-x86_64-linux/string/ffsll.o:     file format elf64-x86-64
>>
>>
>> Disassembly of section .text:
>>
>> 0000000000000000 <ffsll>:
>>    0: ba ff ff ff ff       mov    $0xffffffff,%edx
>>    5: 48 0f bc c7           bsf    %rdi,%rax
>>    9: 0f 44 c2             cmove  %edx,%eax
>>    c: 83 c0 01             add    $0x1,%eax
>>    f: c3                   ret
>>
> 
> FWIW it should be:
> ```
> 0000000000000000 <.text>:
>    0: b8 ff ff ff ff        mov    $0xffffffff,%eax
>    5: 48 0f bc c7          bsf    %rdi,%rax
>    9: ff c0                inc    %eax
> ```
> 
> And since its in inline asm no reason not to get that.

Can't we just use compiler builtins instead and remove a bunch of asm?
GCC already optimizes ffsl/ffsll to builtin if the architecture allows it,
so I think microptimizing it on libc.so using arch-specific is really not
ideal.

With gcc 13 and my patch [1] I see:

$ objdump -d ./string/ffsll.os

./string/ffsll.os:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <__ffsll>:
   0:   48 0f bc ff             bsf    %rdi,%rdi
   4:   48 c7 c0 ff ff ff ff    mov    $0xffffffffffffffff,%rax
   b:   48 0f 44 f8             cmove  %rax,%rdi
   f:   8d 47 01                lea    0x1(%rdi),%eax
  12:   c3                      ret


[1] https://patchwork.sourceware.org/project/glibc/patch/20230717143431.2075924-1-adhemerval.zanella@linaro.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-26 17:04 ` Noah Goldstein
@ 2023-07-26 17:25   ` Andrew Pinski
  0 siblings, 0 replies; 37+ messages in thread
From: Andrew Pinski @ 2023-07-26 17:25 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: Sunil K Pandey, libc-alpha, hjl.tools

On Wed, Jul 26, 2023 at 10:05 AM Noah Goldstein via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Jul 26, 2023 at 11:06 AM Sunil K Pandey via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
> >
> > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> > Currently ffsll function randomly regress by ~20%, depending on how
> > code get aligned.
> >
> > This patch fixes ffsll function random performance regression.
> > ---
> >  sysdeps/x86_64/ffsll.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> > index a1c13d4906..dbded6f0a1 100644
> > --- a/sysdeps/x86_64/ffsll.c
> > +++ b/sysdeps/x86_64/ffsll.c
> > @@ -29,7 +29,7 @@ ffsll (long long int x)
> >    long long int tmp;
> >
> >    asm ("bsfq %2,%0\n"          /* Count low bits in X and store in %1.  */
> > -       "cmoveq %1,%0\n"                /* If number was zero, use -1 as result.  */
> > +       "cmove %k1,%k0\n"       /* If number was zero, use -1 as result.  */
> >         : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
> >
> Note, there is technically a "bug" in this code in that we clobber
> "cc" but don't
> specify so.
> Mightly as well fix that.

On x86, flags/cc is always considered as clobbered by inline-asm.
ix86_md_asm_adjust will add the clobber if there was no asm flags output.

Thanks,
Andrew Pinski

> >    return cnt + 1;
> > --
> > 2.41.0
> >

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-26 16:59     ` Noah Goldstein
  2023-07-26 17:11       ` Adhemerval Zanella Netto
@ 2023-07-26 20:43       ` Sunil Pandey
  2023-07-26 21:05         ` Noah Goldstein
  1 sibling, 1 reply; 37+ messages in thread
From: Sunil Pandey @ 2023-07-26 20:43 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: Richard Henderson, libc-alpha, hjl.tools

[-- Attachment #1: Type: text/plain, Size: 2596 bytes --]

On Wed, Jul 26, 2023 at 10:00 AM Noah Goldstein <goldstein.w.n@gmail.com>
wrote:

> On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
> >
> > On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson <
> > richard.henderson@linaro.org> wrote:
> >
> > > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
> > > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> > > > Currently ffsll function randomly regress by ~20%, depending on how
> > > > code get aligned.
> > > >
> > > > This patch fixes ffsll function random performance regression.
> > > > ---
> > > >   sysdeps/x86_64/ffsll.c | 2 +-
> > > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> > > > index a1c13d4906..dbded6f0a1 100644
> > > > --- a/sysdeps/x86_64/ffsll.c
> > > > +++ b/sysdeps/x86_64/ffsll.c
> > > > @@ -29,7 +29,7 @@ ffsll (long long int x)
> > > >     long long int tmp;
> > > >
> > > >     asm ("bsfq %2,%0\n"               /* Count low bits in X and
> store
> > > in %1.  */
> > > > -       "cmoveq %1,%0\n"              /* If number was zero, use -1
> as
> > > result.  */
> > > > +       "cmove %k1,%k0\n"     /* If number was zero, use -1 as
> result.
> > > */
> > >
> > > This no longer produces -1, but 0xffffffff in cnt.  However, since the
> > > return type is
> > > 'int', cnt need not be 'long long int' either.  I'm not sure why tmp
> > > exists at all, since
> > > cnt is the only register modified.
> > >
> >
> > Here is the exact assembly produced with this change.
> > ./build-x86_64-linux/string/ffsll.o:     file format elf64-x86-64
> >
> >
> > Disassembly of section .text:
> >
> > 0000000000000000 <ffsll>:
> >    0: ba ff ff ff ff       mov    $0xffffffff,%edx
> >    5: 48 0f bc c7           bsf    %rdi,%rax
> >    9: 0f 44 c2             cmove  %edx,%eax
> >    c: 83 c0 01             add    $0x1,%eax
> >    f: c3                   ret
> >
>
> FWIW it should be:
> ```
> 0000000000000000 <.text>:
>    0: b8 ff ff ff ff        mov    $0xffffffff,%eax
>    5: 48 0f bc c7          bsf    %rdi,%rax
>    9: ff c0                inc    %eax
> ```
>
> And since its in inline asm no reason not to get that.
>

We shouldn't remove cmove because as per Intel BSF instruction manual, if
the content of source operand
 is 0, the content of the destination operand is undefined. Also removing
cmove doesn't provide any perf
advantage.



> >
> > >
> > >
> > > r~
> > >
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-26 20:43       ` Sunil Pandey
@ 2023-07-26 21:05         ` Noah Goldstein
  2023-07-26 22:37           ` Sunil Pandey
  0 siblings, 1 reply; 37+ messages in thread
From: Noah Goldstein @ 2023-07-26 21:05 UTC (permalink / raw)
  To: Sunil Pandey; +Cc: Richard Henderson, libc-alpha, hjl.tools

On Wed, Jul 26, 2023 at 3:44 PM Sunil Pandey <skpgkp2@gmail.com> wrote:
>
>
>
> On Wed, Jul 26, 2023 at 10:00 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>
>> On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha
>> <libc-alpha@sourceware.org> wrote:
>> >
>> > On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson <
>> > richard.henderson@linaro.org> wrote:
>> >
>> > > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
>> > > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
>> > > > Currently ffsll function randomly regress by ~20%, depending on how
>> > > > code get aligned.
>> > > >
>> > > > This patch fixes ffsll function random performance regression.
>> > > > ---
>> > > >   sysdeps/x86_64/ffsll.c | 2 +-
>> > > >   1 file changed, 1 insertion(+), 1 deletion(-)
>> > > >
>> > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
>> > > > index a1c13d4906..dbded6f0a1 100644
>> > > > --- a/sysdeps/x86_64/ffsll.c
>> > > > +++ b/sysdeps/x86_64/ffsll.c
>> > > > @@ -29,7 +29,7 @@ ffsll (long long int x)
>> > > >     long long int tmp;
>> > > >
>> > > >     asm ("bsfq %2,%0\n"               /* Count low bits in X and store
>> > > in %1.  */
>> > > > -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as
>> > > result.  */
>> > > > +       "cmove %k1,%k0\n"     /* If number was zero, use -1 as result.
>> > > */
>> > >
>> > > This no longer produces -1, but 0xffffffff in cnt.  However, since the
>> > > return type is
>> > > 'int', cnt need not be 'long long int' either.  I'm not sure why tmp
>> > > exists at all, since
>> > > cnt is the only register modified.
>> > >
>> >
>> > Here is the exact assembly produced with this change.
>> > ./build-x86_64-linux/string/ffsll.o:     file format elf64-x86-64
>> >
>> >
>> > Disassembly of section .text:
>> >
>> > 0000000000000000 <ffsll>:
>> >    0: ba ff ff ff ff       mov    $0xffffffff,%edx
>> >    5: 48 0f bc c7           bsf    %rdi,%rax
>> >    9: 0f 44 c2             cmove  %edx,%eax
>> >    c: 83 c0 01             add    $0x1,%eax
>> >    f: c3                   ret
>> >
>>
>> FWIW it should be:
>> ```
>> 0000000000000000 <.text>:
>>    0: b8 ff ff ff ff        mov    $0xffffffff,%eax
>>    5: 48 0f bc c7          bsf    %rdi,%rax
>>    9: ff c0                inc    %eax
>> ```
>>
>> And since its in inline asm no reason not to get that.
>
>
> We shouldn't remove cmove because as per Intel BSF instruction manual, if the content of source operand
>  is 0, the content of the destination operand is undefined. Also removing cmove doesn't provide any perf
> advantage.

We rely on that behavior in other areas (memchr-evex512 for example).
It's been confirmed it is defined to not changed the destination (like AMD).

It saves an instructions..

Either way, however, I think adhemerval is right and we should use builtins
for this.
>
>
>>
>> >
>> > >
>> > >
>> > > r~
>> > >

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-26 21:05         ` Noah Goldstein
@ 2023-07-26 22:37           ` Sunil Pandey
  2023-07-27  0:00             ` Noah Goldstein
  0 siblings, 1 reply; 37+ messages in thread
From: Sunil Pandey @ 2023-07-26 22:37 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: Richard Henderson, libc-alpha, hjl.tools

[-- Attachment #1: Type: text/plain, Size: 3616 bytes --]

On Wed, Jul 26, 2023 at 2:05 PM Noah Goldstein <goldstein.w.n@gmail.com>
wrote:

> On Wed, Jul 26, 2023 at 3:44 PM Sunil Pandey <skpgkp2@gmail.com> wrote:
> >
> >
> >
> > On Wed, Jul 26, 2023 at 10:00 AM Noah Goldstein <goldstein.w.n@gmail.com>
> wrote:
> >>
> >> On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha
> >> <libc-alpha@sourceware.org> wrote:
> >> >
> >> > On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson <
> >> > richard.henderson@linaro.org> wrote:
> >> >
> >> > > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
> >> > > > Ffsll function size is 17 byte, this patch optimizes size to 16
> byte.
> >> > > > Currently ffsll function randomly regress by ~20%, depending on
> how
> >> > > > code get aligned.
> >> > > >
> >> > > > This patch fixes ffsll function random performance regression.
> >> > > > ---
> >> > > >   sysdeps/x86_64/ffsll.c | 2 +-
> >> > > >   1 file changed, 1 insertion(+), 1 deletion(-)
> >> > > >
> >> > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> >> > > > index a1c13d4906..dbded6f0a1 100644
> >> > > > --- a/sysdeps/x86_64/ffsll.c
> >> > > > +++ b/sysdeps/x86_64/ffsll.c
> >> > > > @@ -29,7 +29,7 @@ ffsll (long long int x)
> >> > > >     long long int tmp;
> >> > > >
> >> > > >     asm ("bsfq %2,%0\n"               /* Count low bits in X and
> store
> >> > > in %1.  */
> >> > > > -       "cmoveq %1,%0\n"              /* If number was zero, use
> -1 as
> >> > > result.  */
> >> > > > +       "cmove %k1,%k0\n"     /* If number was zero, use -1 as
> result.
> >> > > */
> >> > >
> >> > > This no longer produces -1, but 0xffffffff in cnt.  However, since
> the
> >> > > return type is
> >> > > 'int', cnt need not be 'long long int' either.  I'm not sure why tmp
> >> > > exists at all, since
> >> > > cnt is the only register modified.
> >> > >
> >> >
> >> > Here is the exact assembly produced with this change.
> >> > ./build-x86_64-linux/string/ffsll.o:     file format elf64-x86-64
> >> >
> >> >
> >> > Disassembly of section .text:
> >> >
> >> > 0000000000000000 <ffsll>:
> >> >    0: ba ff ff ff ff       mov    $0xffffffff,%edx
> >> >    5: 48 0f bc c7           bsf    %rdi,%rax
> >> >    9: 0f 44 c2             cmove  %edx,%eax
> >> >    c: 83 c0 01             add    $0x1,%eax
> >> >    f: c3                   ret
> >> >
> >>
> >> FWIW it should be:
> >> ```
> >> 0000000000000000 <.text>:
> >>    0: b8 ff ff ff ff        mov    $0xffffffff,%eax
> >>    5: 48 0f bc c7          bsf    %rdi,%rax
> >>    9: ff c0                inc    %eax
> >> ```
> >>
> >> And since its in inline asm no reason not to get that.
> >
> >
> > We shouldn't remove cmove because as per Intel BSF instruction manual,
> if the content of source operand
> >  is 0, the content of the destination operand is undefined. Also
> removing cmove doesn't provide any perf
> > advantage.
>
> We rely on that behavior in other areas (memchr-evex512 for example).
> It's been confirmed it is defined to not changed the destination (like
> AMD).
>
> It saves an instructions..
>
> Either way, however, I think adhemerval is right and we should use builtins
> for this.
>

Few things to notice here.

      This code is more than 20 years old, not sure what all intel
processors this code may be running.
      Builtin will have the same issue, unless it can produce code 16 byte
or less.
      Also the purpose of this patch is to fix random unexpected ~20% perf
loss.

>
> >
> >>
> >> >
> >> > >
> >> > >
> >> > > r~
> >> > >
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-26 22:37           ` Sunil Pandey
@ 2023-07-27  0:00             ` Noah Goldstein
  2023-07-27  8:16               ` Florian Weimer
  0 siblings, 1 reply; 37+ messages in thread
From: Noah Goldstein @ 2023-07-27  0:00 UTC (permalink / raw)
  To: Sunil Pandey; +Cc: Richard Henderson, libc-alpha, hjl.tools

On Wed, Jul 26, 2023 at 5:38 PM Sunil Pandey <skpgkp2@gmail.com> wrote:
>
>
>
> On Wed, Jul 26, 2023 at 2:05 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>
>> On Wed, Jul 26, 2023 at 3:44 PM Sunil Pandey <skpgkp2@gmail.com> wrote:
>> >
>> >
>> >
>> > On Wed, Jul 26, 2023 at 10:00 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>> >>
>> >> On Wed, Jul 26, 2023 at 11:52 AM Sunil Pandey via Libc-alpha
>> >> <libc-alpha@sourceware.org> wrote:
>> >> >
>> >> > On Wed, Jul 26, 2023 at 9:38 AM Richard Henderson <
>> >> > richard.henderson@linaro.org> wrote:
>> >> >
>> >> > > On 7/26/23 09:05, Sunil K Pandey via Libc-alpha wrote:
>> >> > > > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
>> >> > > > Currently ffsll function randomly regress by ~20%, depending on how
>> >> > > > code get aligned.
>> >> > > >
>> >> > > > This patch fixes ffsll function random performance regression.
>> >> > > > ---
>> >> > > >   sysdeps/x86_64/ffsll.c | 2 +-
>> >> > > >   1 file changed, 1 insertion(+), 1 deletion(-)
>> >> > > >
>> >> > > > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
>> >> > > > index a1c13d4906..dbded6f0a1 100644
>> >> > > > --- a/sysdeps/x86_64/ffsll.c
>> >> > > > +++ b/sysdeps/x86_64/ffsll.c
>> >> > > > @@ -29,7 +29,7 @@ ffsll (long long int x)
>> >> > > >     long long int tmp;
>> >> > > >
>> >> > > >     asm ("bsfq %2,%0\n"               /* Count low bits in X and store
>> >> > > in %1.  */
>> >> > > > -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as
>> >> > > result.  */
>> >> > > > +       "cmove %k1,%k0\n"     /* If number was zero, use -1 as result.
>> >> > > */
>> >> > >
>> >> > > This no longer produces -1, but 0xffffffff in cnt.  However, since the
>> >> > > return type is
>> >> > > 'int', cnt need not be 'long long int' either.  I'm not sure why tmp
>> >> > > exists at all, since
>> >> > > cnt is the only register modified.
>> >> > >
>> >> >
>> >> > Here is the exact assembly produced with this change.
>> >> > ./build-x86_64-linux/string/ffsll.o:     file format elf64-x86-64
>> >> >
>> >> >
>> >> > Disassembly of section .text:
>> >> >
>> >> > 0000000000000000 <ffsll>:
>> >> >    0: ba ff ff ff ff       mov    $0xffffffff,%edx
>> >> >    5: 48 0f bc c7           bsf    %rdi,%rax
>> >> >    9: 0f 44 c2             cmove  %edx,%eax
>> >> >    c: 83 c0 01             add    $0x1,%eax
>> >> >    f: c3                   ret
>> >> >
>> >>
>> >> FWIW it should be:
>> >> ```
>> >> 0000000000000000 <.text>:
>> >>    0: b8 ff ff ff ff        mov    $0xffffffff,%eax
>> >>    5: 48 0f bc c7          bsf    %rdi,%rax
>> >>    9: ff c0                inc    %eax
>> >> ```
>> >>
>> >> And since its in inline asm no reason not to get that.
>> >
>> >
>> > We shouldn't remove cmove because as per Intel BSF instruction manual, if the content of source operand
>> >  is 0, the content of the destination operand is undefined. Also removing cmove doesn't provide any perf
>> > advantage.
>>
>> We rely on that behavior in other areas (memchr-evex512 for example).
>> It's been confirmed it is defined to not changed the destination (like AMD).
>>
>> It saves an instructions..
>>
>> Either way, however, I think adhemerval is right and we should use builtins
>> for this.
>
>
> Few things to notice here.
>
>       This code is more than 20 years old, not sure what all intel processors this code may be running.
>       Builtin will have the same issue, unless it can produce code 16 byte or less.
>       Also the purpose of this patch is to fix random unexpected ~20% perf loss.
>
Likewise for the string/memory library....
Why not just update it w.o cmov? Just seems like a waste not to
address it while we're at it.

>> >
>> >
>> >>
>> >> >
>> >> > >
>> >> > >
>> >> > > r~
>> >> > >

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-26 17:11       ` Adhemerval Zanella Netto
@ 2023-07-27  0:31         ` Cristian Rodríguez
  0 siblings, 0 replies; 37+ messages in thread
From: Cristian Rodríguez @ 2023-07-27  0:31 UTC (permalink / raw)
  To: Adhemerval Zanella Netto
  Cc: Noah Goldstein, Sunil Pandey, Richard Henderson, libc-alpha, hjl.tools

[-- Attachment #1: Type: text/plain, Size: 295 bytes --]

On Wed, Jul 26, 2023 at 1:12 PM Adhemerval Zanella Netto via Libc-alpha <
libc-alpha@sourceware.org> wrote:

>
>
> Can't we just use compiler builtins instead and remove a bunch of asm?
>

Yes, please. and any better code generation strategy is better directed at
gcc not the glibc.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-27  0:00             ` Noah Goldstein
@ 2023-07-27  8:16               ` Florian Weimer
  2023-07-27 11:46                 ` Alexander Monakov
  0 siblings, 1 reply; 37+ messages in thread
From: Florian Weimer @ 2023-07-27  8:16 UTC (permalink / raw)
  To: Noah Goldstein via Libc-alpha
  Cc: Sunil Pandey, Noah Goldstein, Richard Henderson, hjl.tools

* Noah Goldstein via Libc-alpha:

> Likewise for the string/memory library....
> Why not just update it w.o cmov? Just seems like a waste not to
> address it while we're at it.

Given that it violates the spec, doing it in an inline function seems
kind of risky.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-27  8:16               ` Florian Weimer
@ 2023-07-27 11:46                 ` Alexander Monakov
  2023-07-27 12:10                   ` Florian Weimer
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Monakov @ 2023-07-27 11:46 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Noah Goldstein via Libc-alpha, Sunil Pandey, Noah Goldstein,
	Richard Henderson, hjl.tools


On Thu, 27 Jul 2023, Florian Weimer via Libc-alpha wrote:

> * Noah Goldstein via Libc-alpha:
> 
> > Likewise for the string/memory library....
> > Why not just update it w.o cmov? Just seems like a waste not to
> > address it while we're at it.
> 
> Given that it violates the spec, doing it in an inline function seems
> kind of risky.

Sorry, what inline function? The function seems to modify an implementation
in ffsll.c, nothing else.

Alexander

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-27 11:46                 ` Alexander Monakov
@ 2023-07-27 12:10                   ` Florian Weimer
  2023-07-27 13:59                     ` Cristian Rodríguez
  2023-07-27 14:00                     ` Alexander Monakov
  0 siblings, 2 replies; 37+ messages in thread
From: Florian Weimer @ 2023-07-27 12:10 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Noah Goldstein via Libc-alpha, Sunil Pandey, Noah Goldstein,
	Richard Henderson, hjl.tools

* Alexander Monakov:

> On Thu, 27 Jul 2023, Florian Weimer via Libc-alpha wrote:
>
>> * Noah Goldstein via Libc-alpha:
>> 
>> > Likewise for the string/memory library....
>> > Why not just update it w.o cmov? Just seems like a waste not to
>> > address it while we're at it.
>> 
>> Given that it violates the spec, doing it in an inline function seems
>> kind of risky.
>
> Sorry, what inline function? The function seems to modify an implementation
> in ffsll.c, nothing else.

Yeah, sorry, I was confused.  There's a GCC built-in, right?  So the
glibc implementation probably isn't that important on x86.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-27 12:10                   ` Florian Weimer
@ 2023-07-27 13:59                     ` Cristian Rodríguez
  2023-07-27 14:00                     ` Alexander Monakov
  1 sibling, 0 replies; 37+ messages in thread
From: Cristian Rodríguez @ 2023-07-27 13:59 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Alexander Monakov, Noah Goldstein via Libc-alpha, Sunil Pandey,
	Noah Goldstein, Richard Henderson, hjl.tools

[-- Attachment #1: Type: text/plain, Size: 245 bytes --]

On Thu, Jul 27, 2023 at 8:11 AM Florian Weimer via Libc-alpha <
libc-alpha@sourceware.org> wrote:

>
> Yeah, sorry, I was confused.  There's a GCC built-in, right?


Yes, in all targets that have an ffs pattern,,the others need libgcc

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-27 12:10                   ` Florian Weimer
  2023-07-27 13:59                     ` Cristian Rodríguez
@ 2023-07-27 14:00                     ` Alexander Monakov
  2023-07-27 15:13                       ` Sunil Pandey
  1 sibling, 1 reply; 37+ messages in thread
From: Alexander Monakov @ 2023-07-27 14:00 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Noah Goldstein via Libc-alpha, Sunil Pandey, Noah Goldstein,
	Richard Henderson, hjl.tools


On Thu, 27 Jul 2023, Florian Weimer via Libc-alpha wrote:

> * Alexander Monakov:
> 
> > On Thu, 27 Jul 2023, Florian Weimer via Libc-alpha wrote:
> >
> >> * Noah Goldstein via Libc-alpha:
> >> 
> >> > Likewise for the string/memory library....
> >> > Why not just update it w.o cmov? Just seems like a waste not to
> >> > address it while we're at it.
> >> 
> >> Given that it violates the spec, doing it in an inline function seems
> >> kind of risky.
> >
> > Sorry, what inline function? The function seems to modify an implementation
> > in ffsll.c, nothing else.
> 
> Yeah, sorry, I was confused.  There's a GCC built-in, right?  So the
> glibc implementation probably isn't that important on x86.

Yep. The built-in gets expanded even at -Os, so it's quite unusual that Glibc's
implementation get called. I see the following possibilities:

* at -O0 (but then performance doesn't matter, presumably)
* with -fno-builtin
* when called via a function pointer

Sunil, could you clear this up, please?

Alexander

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-27 14:00                     ` Alexander Monakov
@ 2023-07-27 15:13                       ` Sunil Pandey
  2023-07-27 15:50                         ` Alexander Monakov
  2023-07-27 16:24                         ` Florian Weimer
  0 siblings, 2 replies; 37+ messages in thread
From: Sunil Pandey @ 2023-07-27 15:13 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Florian Weimer, Noah Goldstein via Libc-alpha, Noah Goldstein,
	Richard Henderson, hjl.tools

[-- Attachment #1: Type: text/plain, Size: 2213 bytes --]

On Thu, Jul 27, 2023 at 7:00 AM Alexander Monakov <amonakov@ispras.ru>
wrote:

>
> On Thu, 27 Jul 2023, Florian Weimer via Libc-alpha wrote:
>
> > * Alexander Monakov:
> >
> > > On Thu, 27 Jul 2023, Florian Weimer via Libc-alpha wrote:
> > >
> > >> * Noah Goldstein via Libc-alpha:
> > >>
> > >> > Likewise for the string/memory library....
> > >> > Why not just update it w.o cmov? Just seems like a waste not to
> > >> > address it while we're at it.
> > >>
> > >> Given that it violates the spec, doing it in an inline function seems
> > >> kind of risky.
> > >
> > > Sorry, what inline function? The function seems to modify an
> implementation
> > > in ffsll.c, nothing else.
> >
> > Yeah, sorry, I was confused.  There's a GCC built-in, right?  So the
> > glibc implementation probably isn't that important on x86.
>
> Yep. The built-in gets expanded even at -Os, so it's quite unusual that
> Glibc's
> implementation get called. I see the following possibilities:
>
> * at -O0 (but then performance doesn't matter, presumably)
> * with -fno-builtin
> * when called via a function pointer
>
> Sunil, could you clear this up, please?
>

This issue only matters if ffsll functionality is implemented in a
function(size > 16 byte) and the
function is not inlined (doesn't matter whether it's implemented in C or
assembly).

By default the function entry point gets aligned to the 16 byte boundary,
so the following layout
are all valid.

64 byte aligned: No issue as 17 byte function will not cause another cache
line load.
48 byte aligned: ~20% regression as 17 byte function will trigger another
cache line load.
32 byte aligned: No issue as 17 byte function will not cause another cache
line load.
16 byte aligned: No issue as 17 byte function will not cause another cache
line load.

Ffsll is one of the benchmark tests in the phoronix test suite, not sure
how much it matters to
the application. Lots of people involved in phoronix benchmark
testing/tracking and this kind
of random perf behavior wastes their time.

Again I'm not against GCC, but if this function exists in glibc, I don't
see any harm in fixing it.

>
> Alexander
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-27 15:13                       ` Sunil Pandey
@ 2023-07-27 15:50                         ` Alexander Monakov
  2023-07-27 16:24                         ` Florian Weimer
  1 sibling, 0 replies; 37+ messages in thread
From: Alexander Monakov @ 2023-07-27 15:50 UTC (permalink / raw)
  To: Sunil Pandey
  Cc: Florian Weimer, Noah Goldstein via Libc-alpha, Noah Goldstein,
	Richard Henderson, hjl.tools

[-- Attachment #1: Type: text/plain, Size: 1057 bytes --]


On Thu, 27 Jul 2023, Sunil Pandey wrote:

> Ffsll is one of the benchmark tests in the phoronix test suite, not
> sure how much it matters to the application. Lots of people involved
> in phoronix benchmark testing/tracking and this kind of random perf
> behavior wastes their time.

Ah, I see — PTS runs Glibc benchtests (which use -fno-builtin).

> Again I'm not against GCC, but if this function exists in glibc, I don't
> see any harm in fixing it.

Sure. But as Richard seems to be pointing out, the function seems
needlessly obfuscated (even the first comment is wrong!). So if Noah's
suggestion is not accepted, I can offer the following cleaned-up
variant:

int
ffsll (long long int x)
{
  int cnt;

  asm ("bsfq %1,%q0\n"          /* Count low bits in X and store in %0.  */
       "cmovel %2,%0\n"         /* If number was zero, use -1 as result.  */
       : "=&r" (cnt) : "r" (x), "r" (-1));

  return cnt + 1;
}

(note that initial bsfq has an undesirable false dependency on the
output operand, and Noah's variant fixes that too)

Alexander

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-27 15:13                       ` Sunil Pandey
  2023-07-27 15:50                         ` Alexander Monakov
@ 2023-07-27 16:24                         ` Florian Weimer
  2023-07-27 16:35                           ` Adhemerval Zanella Netto
  2023-07-27 16:40                           ` [PATCH] " Sunil Pandey
  1 sibling, 2 replies; 37+ messages in thread
From: Florian Weimer @ 2023-07-27 16:24 UTC (permalink / raw)
  To: Sunil Pandey
  Cc: Alexander Monakov, Noah Goldstein via Libc-alpha, Noah Goldstein,
	Richard Henderson, hjl.tools

* Sunil Pandey:

> Ffsll is one of the benchmark tests in the phoronix test suite, not
> sure how much it matters to the application. Lots of people involved
> in phoronix benchmark testing/tracking and this kind of random perf
> behavior wastes their time.

That's a good point.  I've seen similar reports before (sadly I don't
recall if they were specifically about ffsll).

Regarding the mechanics of fixing it, if the instruction ordering and
sizing is so sensitive, should this be an assembler implementation
instead?  And will the fix even work for distributions that build with
--enable-cet, considering that there's going to be an additional 4-byte
NOP at the start of the function?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-27 16:24                         ` Florian Weimer
@ 2023-07-27 16:35                           ` Adhemerval Zanella Netto
  2023-07-27 17:09                             ` Florian Weimer
  2023-07-27 16:40                           ` [PATCH] " Sunil Pandey
  1 sibling, 1 reply; 37+ messages in thread
From: Adhemerval Zanella Netto @ 2023-07-27 16:35 UTC (permalink / raw)
  To: Florian Weimer, Sunil Pandey
  Cc: Alexander Monakov, Noah Goldstein via Libc-alpha, Noah Goldstein,
	Richard Henderson, hjl.tools



On 27/07/23 13:24, Florian Weimer via Libc-alpha wrote:
> * Sunil Pandey:
> 
>> Ffsll is one of the benchmark tests in the phoronix test suite, not
>> sure how much it matters to the application. Lots of people involved
>> in phoronix benchmark testing/tracking and this kind of random perf
>> behavior wastes their time.
> 
> That's a good point.  I've seen similar reports before (sadly I don't
> recall if they were specifically about ffsll).
> 
> Regarding the mechanics of fixing it, if the instruction ordering and
> sizing is so sensitive, should this be an assembler implementation
> instead?  And will the fix even work for distributions that build with
> --enable-cet, considering that there's going to be an additional 4-byte
> NOP at the start of the function?
> 


Sigh... do we really need to care about this synthetic benchmark that is
exercising a fallback path since compiler will most likely issue the
inline builtin? And even this is really important, tune function alignment
and size to fit on a cacheline should be done by the compiler, specially
in the case where we can implement by using a builtin.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-27 16:24                         ` Florian Weimer
  2023-07-27 16:35                           ` Adhemerval Zanella Netto
@ 2023-07-27 16:40                           ` Sunil Pandey
  1 sibling, 0 replies; 37+ messages in thread
From: Sunil Pandey @ 2023-07-27 16:40 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Alexander Monakov, Noah Goldstein via Libc-alpha, Noah Goldstein,
	Richard Henderson, hjl.tools

[-- Attachment #1: Type: text/plain, Size: 1320 bytes --]

On Thu, Jul 27, 2023 at 9:24 AM Florian Weimer <fweimer@redhat.com> wrote:

> * Sunil Pandey:
>
> > Ffsll is one of the benchmark tests in the phoronix test suite, not
> > sure how much it matters to the application. Lots of people involved
> > in phoronix benchmark testing/tracking and this kind of random perf
> > behavior wastes their time.
>
> That's a good point.  I've seen similar reports before (sadly I don't
> recall if they were specifically about ffsll).
>
> Regarding the mechanics of fixing it, if the instruction ordering and
> sizing is so sensitive, should this be an assembler implementation
> instead?


Instruction ordering and sizing pretty much always matter. Many instructions
can't be used in low latency applications, because their encoding size is
big.
Unfortunately assemblers can't do much in this case.


>   And will the fix even work for distributions that build with
> --enable-cet, considering that there's going to be an additional 4-byte
> NOP at the start of the function?
>

Extra 4 byte from --enable-cet can affect performance of many small
size highly optimized routines and can potentially create perf variation
depending on how function gets loaded to memory.  But again, this is
one of the costs of using cet.


> Thanks,
> Florian
>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-27 16:35                           ` Adhemerval Zanella Netto
@ 2023-07-27 17:09                             ` Florian Weimer
  2023-07-27 17:25                               ` Sunil Pandey
  0 siblings, 1 reply; 37+ messages in thread
From: Florian Weimer @ 2023-07-27 17:09 UTC (permalink / raw)
  To: Adhemerval Zanella Netto
  Cc: Sunil Pandey, Alexander Monakov, Noah Goldstein via Libc-alpha,
	Noah Goldstein, Richard Henderson, hjl.tools

* Adhemerval Zanella Netto:

> On 27/07/23 13:24, Florian Weimer via Libc-alpha wrote:
>> * Sunil Pandey:
>> 
>>> Ffsll is one of the benchmark tests in the phoronix test suite, not
>>> sure how much it matters to the application. Lots of people involved
>>> in phoronix benchmark testing/tracking and this kind of random perf
>>> behavior wastes their time.
>> 
>> That's a good point.  I've seen similar reports before (sadly I don't
>> recall if they were specifically about ffsll).
>> 
>> Regarding the mechanics of fixing it, if the instruction ordering and
>> sizing is so sensitive, should this be an assembler implementation
>> instead?  And will the fix even work for distributions that build with
>> --enable-cet, considering that there's going to be an additional 4-byte
>> NOP at the start of the function?

> Sigh... do we really need to care about this synthetic benchmark that is
> exercising a fallback path since compiler will most likely issue the
> inline builtin? And even this is really important, tune function alignment
> and size to fit on a cacheline should be done by the compiler, specially
> in the case where we can implement by using a builtin.

I think avoiding wasting people's time with spurious benchmark
differences is useful.  Compared to things like the PID cache, this one
seems pretty harmless.

Maybe we should increase function alignment to cover the CET case, too?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] x86_64: Optimize ffsll function code size.
  2023-07-27 17:09                             ` Florian Weimer
@ 2023-07-27 17:25                               ` Sunil Pandey
  2023-07-31 18:35                                 ` [PATCH v2] " Sunil K Pandey
  0 siblings, 1 reply; 37+ messages in thread
From: Sunil Pandey @ 2023-07-27 17:25 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Adhemerval Zanella Netto, Alexander Monakov,
	Noah Goldstein via Libc-alpha, Noah Goldstein, Richard Henderson,
	hjl.tools

[-- Attachment #1: Type: text/plain, Size: 1873 bytes --]

On Thu, Jul 27, 2023 at 10:09 AM Florian Weimer <fweimer@redhat.com> wrote:

> * Adhemerval Zanella Netto:
>
> > On 27/07/23 13:24, Florian Weimer via Libc-alpha wrote:
> >> * Sunil Pandey:
> >>
> >>> Ffsll is one of the benchmark tests in the phoronix test suite, not
> >>> sure how much it matters to the application. Lots of people involved
> >>> in phoronix benchmark testing/tracking and this kind of random perf
> >>> behavior wastes their time.
> >>
> >> That's a good point.  I've seen similar reports before (sadly I don't
> >> recall if they were specifically about ffsll).
> >>
> >> Regarding the mechanics of fixing it, if the instruction ordering and
> >> sizing is so sensitive, should this be an assembler implementation
> >> instead?  And will the fix even work for distributions that build with
> >> --enable-cet, considering that there's going to be an additional 4-byte
> >> NOP at the start of the function?
>
> > Sigh... do we really need to care about this synthetic benchmark that is
> > exercising a fallback path since compiler will most likely issue the
> > inline builtin? And even this is really important, tune function
> alignment
> > and size to fit on a cacheline should be done by the compiler, specially
> > in the case where we can implement by using a builtin.
>
> I think avoiding wasting people's time with spurious benchmark
> differences is useful.  Compared to things like the PID cache, this one
> seems pretty harmless.
>
> Maybe we should increase function alignment to cover the CET case, too?
>

Good point. I have seen some other cases where function alignment creates
big perf swing. I think aligning function entry points to cache line size
will fix
most of the alignment related issues. Trade-off of bigger alignment will be
increased memory usage.


> Thanks,
> Florian
>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2] x86_64: Optimize ffsll function code size.
  2023-07-27 17:25                               ` Sunil Pandey
@ 2023-07-31 18:35                                 ` Sunil K Pandey
  2023-07-31 20:12                                   ` Adhemerval Zanella Netto
  2024-01-10 19:19                                   ` Carlos O'Donell
  0 siblings, 2 replies; 37+ messages in thread
From: Sunil K Pandey @ 2023-07-31 18:35 UTC (permalink / raw)
  To: libc-alpha; +Cc: hjl.tools

Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
Currently ffsll function randomly regress by ~20%, depending on how
code get aligned.

This patch fixes ffsll function random performance regression.

Changes from v1:
- Further reduce size ffsll function size to 12 bytes.
---
 sysdeps/x86_64/ffsll.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
index a1c13d4906..6a5803c7c1 100644
--- a/sysdeps/x86_64/ffsll.c
+++ b/sysdeps/x86_64/ffsll.c
@@ -26,13 +26,13 @@ int
 ffsll (long long int x)
 {
   long long int cnt;
-  long long int tmp;
 
-  asm ("bsfq %2,%0\n"		/* Count low bits in X and store in %1.  */
-       "cmoveq %1,%0\n"		/* If number was zero, use -1 as result.  */
-       : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
+  asm ("mov $-1,%k0\n"	/* Intialize CNT to -1.  */
+       "bsf %1,%0\n"	/* Count low bits in X and store in CNT.  */
+       "inc %k0\n"	/* Increment CNT by 1.  */
+       : "=&r" (cnt) : "r" (x));
 
-  return cnt + 1;
+  return cnt;
 }
 
 #ifndef __ILP32__
-- 
2.41.0


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] x86_64: Optimize ffsll function code size.
  2023-07-31 18:35                                 ` [PATCH v2] " Sunil K Pandey
@ 2023-07-31 20:12                                   ` Adhemerval Zanella Netto
  2023-07-31 20:58                                     ` Sunil Pandey
  2024-01-10 19:19                                   ` Carlos O'Donell
  1 sibling, 1 reply; 37+ messages in thread
From: Adhemerval Zanella Netto @ 2023-07-31 20:12 UTC (permalink / raw)
  To: Sunil K Pandey, libc-alpha; +Cc: hjl.tools



On 31/07/23 15:35, Sunil K Pandey via Libc-alpha wrote:
> Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> Currently ffsll function randomly regress by ~20%, depending on how
> code get aligned.
> 
> This patch fixes ffsll function random performance regression.
> 
> Changes from v1:
> - Further reduce size ffsll function size to 12 bytes.
> ---
>  sysdeps/x86_64/ffsll.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> index a1c13d4906..6a5803c7c1 100644
> --- a/sysdeps/x86_64/ffsll.c
> +++ b/sysdeps/x86_64/ffsll.c
> @@ -26,13 +26,13 @@ int
>  ffsll (long long int x)
>  {
>    long long int cnt;
> -  long long int tmp;
>  
> -  asm ("bsfq %2,%0\n"		/* Count low bits in X and store in %1.  */
> -       "cmoveq %1,%0\n"		/* If number was zero, use -1 as result.  */
> -       : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
> +  asm ("mov $-1,%k0\n"	/* Intialize CNT to -1.  */
> +       "bsf %1,%0\n"	/* Count low bits in X and store in CNT.  */
> +       "inc %k0\n"	/* Increment CNT by 1.  */
> +       : "=&r" (cnt) : "r" (x));
>  
> -  return cnt + 1;
> +  return cnt;
>  }
>  
>  #ifndef __ILP32__



I still prefer if we can just remove this arch-optimized function in favor
in compiler builtins.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] x86_64: Optimize ffsll function code size.
  2023-07-31 20:12                                   ` Adhemerval Zanella Netto
@ 2023-07-31 20:58                                     ` Sunil Pandey
  2023-07-31 22:57                                       ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 37+ messages in thread
From: Sunil Pandey @ 2023-07-31 20:58 UTC (permalink / raw)
  To: Adhemerval Zanella Netto; +Cc: libc-alpha, hjl.tools

[-- Attachment #1: Type: text/plain, Size: 1682 bytes --]

On Mon, Jul 31, 2023 at 1:12 PM Adhemerval Zanella Netto <
adhemerval.zanella@linaro.org> wrote:

>
>
> On 31/07/23 15:35, Sunil K Pandey via Libc-alpha wrote:
> > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> > Currently ffsll function randomly regress by ~20%, depending on how
> > code get aligned.
> >
> > This patch fixes ffsll function random performance regression.
> >
> > Changes from v1:
> > - Further reduce size ffsll function size to 12 bytes.
> > ---
> >  sysdeps/x86_64/ffsll.c | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> > index a1c13d4906..6a5803c7c1 100644
> > --- a/sysdeps/x86_64/ffsll.c
> > +++ b/sysdeps/x86_64/ffsll.c
> > @@ -26,13 +26,13 @@ int
> >  ffsll (long long int x)
> >  {
> >    long long int cnt;
> > -  long long int tmp;
> >
> > -  asm ("bsfq %2,%0\n"                /* Count low bits in X and store
> in %1.  */
> > -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as
> result.  */
> > -       : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
> > +  asm ("mov $-1,%k0\n"       /* Intialize CNT to -1.  */
> > +       "bsf %1,%0\n" /* Count low bits in X and store in CNT.  */
> > +       "inc %k0\n"   /* Increment CNT by 1.  */
> > +       : "=&r" (cnt) : "r" (x));
> >
> > -  return cnt + 1;
> > +  return cnt;
> >  }
> >
> >  #ifndef __ILP32__
>
>
>
> I still prefer if we can just remove this arch-optimized function in favor
> in compiler builtins.
>

Sure, compiler builtin should replace it in the long run.
In the meantime, can it get fixed?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] x86_64: Optimize ffsll function code size.
  2023-07-31 20:58                                     ` Sunil Pandey
@ 2023-07-31 22:57                                       ` Adhemerval Zanella Netto
  2023-07-31 23:44                                         ` Sunil Pandey
  0 siblings, 1 reply; 37+ messages in thread
From: Adhemerval Zanella Netto @ 2023-07-31 22:57 UTC (permalink / raw)
  To: Sunil Pandey; +Cc: libc-alpha, hjl.tools



On 31/07/23 17:58, Sunil Pandey wrote:
> 
> 
> On Mon, Jul 31, 2023 at 1:12 PM Adhemerval Zanella Netto <adhemerval.zanella@linaro.org <mailto:adhemerval.zanella@linaro.org>> wrote:
> 
> 
> 
>     On 31/07/23 15:35, Sunil K Pandey via Libc-alpha wrote:
>     > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
>     > Currently ffsll function randomly regress by ~20%, depending on how
>     > code get aligned.
>     >
>     > This patch fixes ffsll function random performance regression.
>     >
>     > Changes from v1:
>     > - Further reduce size ffsll function size to 12 bytes.
>     > ---
>     >  sysdeps/x86_64/ffsll.c | 10 +++++-----
>     >  1 file changed, 5 insertions(+), 5 deletions(-)
>     >
>     > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
>     > index a1c13d4906..6a5803c7c1 100644
>     > --- a/sysdeps/x86_64/ffsll.c
>     > +++ b/sysdeps/x86_64/ffsll.c
>     > @@ -26,13 +26,13 @@ int
>     >  ffsll (long long int x)
>     >  {
>     >    long long int cnt;
>     > -  long long int tmp;
>     > 
>     > -  asm ("bsfq %2,%0\n"                /* Count low bits in X and store in %1.  */
>     > -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as result.  */
>     > -       : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
>     > +  asm ("mov $-1,%k0\n"       /* Intialize CNT to -1.  */
>     > +       "bsf %1,%0\n" /* Count low bits in X and store in CNT.  */
>     > +       "inc %k0\n"   /* Increment CNT by 1.  */
>     > +       : "=&r" (cnt) : "r" (x));
>     > 
>     > -  return cnt + 1;
>     > +  return cnt;
>     >  }
>     > 
>     >  #ifndef __ILP32__
> 
> 
> 
>     I still prefer if we can just remove this arch-optimized function in favor
>     in compiler builtins.
> 
> 
> Sure, compiler builtin should replace it in the long run.
> In the meantime, can it get fixed? 

This fix only works if compiler does not insert anything in the prologue, if 
you use CET or stack protector strong it might not work.  And I *really*
do not want to add another assembly optimization to a symbol that won't
be used in most real programs.

And already have a fix to use compiler builtins [1].

[1] https://patchwork.sourceware.org/project/glibc/patch/20230717143431.2075924-1-adhemerval.zanella@linaro.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] x86_64: Optimize ffsll function code size.
  2023-07-31 22:57                                       ` Adhemerval Zanella Netto
@ 2023-07-31 23:44                                         ` Sunil Pandey
  2023-07-31 23:54                                           ` Noah Goldstein
  2023-08-01 13:46                                           ` Adhemerval Zanella Netto
  0 siblings, 2 replies; 37+ messages in thread
From: Sunil Pandey @ 2023-07-31 23:44 UTC (permalink / raw)
  To: Adhemerval Zanella Netto; +Cc: libc-alpha, hjl.tools

[-- Attachment #1: Type: text/plain, Size: 3689 bytes --]

On Mon, Jul 31, 2023 at 3:57 PM Adhemerval Zanella Netto <
adhemerval.zanella@linaro.org> wrote:

>
>
> On 31/07/23 17:58, Sunil Pandey wrote:
> >
> >
> > On Mon, Jul 31, 2023 at 1:12 PM Adhemerval Zanella Netto <
> adhemerval.zanella@linaro.org <mailto:adhemerval.zanella@linaro.org>>
> wrote:
> >
> >
> >
> >     On 31/07/23 15:35, Sunil K Pandey via Libc-alpha wrote:
> >     > Ffsll function size is 17 byte, this patch optimizes size to 16
> byte.
> >     > Currently ffsll function randomly regress by ~20%, depending on how
> >     > code get aligned.
> >     >
> >     > This patch fixes ffsll function random performance regression.
> >     >
> >     > Changes from v1:
> >     > - Further reduce size ffsll function size to 12 bytes.
> >     > ---
> >     >  sysdeps/x86_64/ffsll.c | 10 +++++-----
> >     >  1 file changed, 5 insertions(+), 5 deletions(-)
> >     >
> >     > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> >     > index a1c13d4906..6a5803c7c1 100644
> >     > --- a/sysdeps/x86_64/ffsll.c
> >     > +++ b/sysdeps/x86_64/ffsll.c
> >     > @@ -26,13 +26,13 @@ int
> >     >  ffsll (long long int x)
> >     >  {
> >     >    long long int cnt;
> >     > -  long long int tmp;
> >     >
> >     > -  asm ("bsfq %2,%0\n"                /* Count low bits in X and
> store in %1.  */
> >     > -       "cmoveq %1,%0\n"              /* If number was zero, use
> -1 as result.  */
> >     > -       : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
> >     > +  asm ("mov $-1,%k0\n"       /* Intialize CNT to -1.  */
> >     > +       "bsf %1,%0\n" /* Count low bits in X and store in CNT.  */
> >     > +       "inc %k0\n"   /* Increment CNT by 1.  */
> >     > +       : "=&r" (cnt) : "r" (x));
> >     >
> >     > -  return cnt + 1;
> >     > +  return cnt;
> >     >  }
> >     >
> >     >  #ifndef __ILP32__
> >
> >
> >
> >     I still prefer if we can just remove this arch-optimized function in
> favor
> >     in compiler builtins.
> >
> >
> > Sure, compiler builtin should replace it in the long run.
> > In the meantime, can it get fixed?
>
> This fix only works if compiler does not insert anything in the prologue,
> if
> you use CET or stack protector strong it might not work.  And I *really*
> do not want to add another assembly optimization to a symbol that won't
> be used in most real programs.
>

v2 will fix it, as CET overhead is 4 byte and the latest code size is only
12 byte.
So toal code size will be 16 byte even if CET gets enabled.


> And already have a fix to use compiler builtins [1].
>

Here is code generated from the builtin patch.

(gdb) b ffsll
Breakpoint 1 at 0x10a0
(gdb) run --direct
Starting program: benchtests/bench-ffsll --direct
Breakpoint 1, __ffsll (i=66900473708975203) at ffsll.c:30
30  return __builtin_ffsll (i);
(gdb) disass
Dump of assembler code for function __ffsll:
=> 0x00007ffff7c955a0 <+0>: bsf    %rdi,%rdi
   0x00007ffff7c955a4 <+4>: mov    $0xffffffffffffffff,%rax
   0x00007ffff7c955ab <+11>: cmove  %rax,%rdi
   0x00007ffff7c955af <+15>: lea    0x1(%rdi),%eax
   0x00007ffff7c955b2 <+18>: ret
End of assembler dump.
(gdb)

It is not going to fix the problem. Random 20% variation will continue even
with
builtin patch in benchmarking.

I do not see anything wrong using architecture features, if it helps
people save their valuable debugging time. After all, its valuable
glibc feature to override generic implementation with architecture specific
code.


> [1]
> https://patchwork.sourceware.org/project/glibc/patch/20230717143431.2075924-1-adhemerval.zanella@linaro.org/
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] x86_64: Optimize ffsll function code size.
  2023-07-31 23:44                                         ` Sunil Pandey
@ 2023-07-31 23:54                                           ` Noah Goldstein
  2023-08-01  6:47                                             ` Andreas Schwab
  2023-08-01 13:46                                           ` Adhemerval Zanella Netto
  1 sibling, 1 reply; 37+ messages in thread
From: Noah Goldstein @ 2023-07-31 23:54 UTC (permalink / raw)
  To: Sunil Pandey; +Cc: Adhemerval Zanella Netto, libc-alpha, hjl.tools

On Mon, Jul 31, 2023 at 6:44 PM Sunil Pandey via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Mon, Jul 31, 2023 at 3:57 PM Adhemerval Zanella Netto <
> adhemerval.zanella@linaro.org> wrote:
>
> >
> >
> > On 31/07/23 17:58, Sunil Pandey wrote:
> > >
> > >
> > > On Mon, Jul 31, 2023 at 1:12 PM Adhemerval Zanella Netto <
> > adhemerval.zanella@linaro.org <mailto:adhemerval.zanella@linaro.org>>
> > wrote:
> > >
> > >
> > >
> > >     On 31/07/23 15:35, Sunil K Pandey via Libc-alpha wrote:
> > >     > Ffsll function size is 17 byte, this patch optimizes size to 16
> > byte.
> > >     > Currently ffsll function randomly regress by ~20%, depending on how
> > >     > code get aligned.
> > >     >
> > >     > This patch fixes ffsll function random performance regression.
> > >     >
> > >     > Changes from v1:
> > >     > - Further reduce size ffsll function size to 12 bytes.
> > >     > ---
> > >     >  sysdeps/x86_64/ffsll.c | 10 +++++-----
> > >     >  1 file changed, 5 insertions(+), 5 deletions(-)
> > >     >
> > >     > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> > >     > index a1c13d4906..6a5803c7c1 100644
> > >     > --- a/sysdeps/x86_64/ffsll.c
> > >     > +++ b/sysdeps/x86_64/ffsll.c
> > >     > @@ -26,13 +26,13 @@ int
> > >     >  ffsll (long long int x)
> > >     >  {
> > >     >    long long int cnt;
> > >     > -  long long int tmp;
> > >     >
> > >     > -  asm ("bsfq %2,%0\n"                /* Count low bits in X and
> > store in %1.  */
> > >     > -       "cmoveq %1,%0\n"              /* If number was zero, use
> > -1 as result.  */
> > >     > -       : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
> > >     > +  asm ("mov $-1,%k0\n"       /* Intialize CNT to -1.  */
> > >     > +       "bsf %1,%0\n" /* Count low bits in X and store in CNT.  */
> > >     > +       "inc %k0\n"   /* Increment CNT by 1.  */
> > >     > +       : "=&r" (cnt) : "r" (x));
> > >     >
> > >     > -  return cnt + 1;
> > >     > +  return cnt;
> > >     >  }
> > >     >
> > >     >  #ifndef __ILP32__
> > >
> > >
> > >
> > >     I still prefer if we can just remove this arch-optimized function in
> > favor
> > >     in compiler builtins.
> > >
> > >
> > > Sure, compiler builtin should replace it in the long run.
> > > In the meantime, can it get fixed?
> >
> > This fix only works if compiler does not insert anything in the prologue,
> > if
> > you use CET or stack protector strong it might not work.  And I *really*
> > do not want to add another assembly optimization to a symbol that won't
> > be used in most real programs.
> >
>
> v2 will fix it, as CET overhead is 4 byte and the latest code size is only
> 12 byte.
> So toal code size will be 16 byte even if CET gets enabled.
>
>
> > And already have a fix to use compiler builtins [1].
> >
>
> Here is code generated from the builtin patch.
>
> (gdb) b ffsll
> Breakpoint 1 at 0x10a0
> (gdb) run --direct
> Starting program: benchtests/bench-ffsll --direct
> Breakpoint 1, __ffsll (i=66900473708975203) at ffsll.c:30
> 30  return __builtin_ffsll (i);
> (gdb) disass
> Dump of assembler code for function __ffsll:
> => 0x00007ffff7c955a0 <+0>: bsf    %rdi,%rdi
>    0x00007ffff7c955a4 <+4>: mov    $0xffffffffffffffff,%rax
>    0x00007ffff7c955ab <+11>: cmove  %rax,%rdi
>    0x00007ffff7c955af <+15>: lea    0x1(%rdi),%eax
>    0x00007ffff7c955b2 <+18>: ret
> End of assembler dump.
> (gdb)
>
> It is not going to fix the problem. Random 20% variation will continue even
> with
> builtin patch in benchmarking.
>
> I do not see anything wrong using architecture features, if it helps
> people save their valuable debugging time. After all, its valuable
> glibc feature to override generic implementation with architecture specific
> code.
>
>
> > [1]
> > https://patchwork.sourceware.org/project/glibc/patch/20230717143431.2075924-1-adhemerval.zanella@linaro.org/
> >

For my money this patch is fine.
It's not really that important to optimize (or worth the time thats been
spent in argument), and don't really see the harm in the asm impl,
esp since GCC doesn't really produce good codegen for the builtin :(

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] x86_64: Optimize ffsll function code size.
  2023-07-31 23:54                                           ` Noah Goldstein
@ 2023-08-01  6:47                                             ` Andreas Schwab
  0 siblings, 0 replies; 37+ messages in thread
From: Andreas Schwab @ 2023-08-01  6:47 UTC (permalink / raw)
  To: Noah Goldstein via Libc-alpha
  Cc: Sunil Pandey, Noah Goldstein, Adhemerval Zanella Netto, hjl.tools

On Jul 31 2023, Noah Goldstein via Libc-alpha wrote:

> esp since GCC doesn't really produce good codegen for the builtin :(

So this is a GCC bug.  Please report it.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] x86_64: Optimize ffsll function code size.
  2023-07-31 23:44                                         ` Sunil Pandey
  2023-07-31 23:54                                           ` Noah Goldstein
@ 2023-08-01 13:46                                           ` Adhemerval Zanella Netto
  2023-08-01 18:25                                             ` Cristian Rodríguez
  1 sibling, 1 reply; 37+ messages in thread
From: Adhemerval Zanella Netto @ 2023-08-01 13:46 UTC (permalink / raw)
  To: Sunil Pandey, Noah Goldstein, Andreas Schwab; +Cc: libc-alpha, hjl.tools



On 31/07/23 20:44, Sunil Pandey wrote:
>
> 
> It is not going to fix the problem. Random 20% variation will continue even with
> builtin patch in benchmarking.
> 
> I do not see anything wrong using architecture features, if it helps
> people save their valuable debugging time. After all, its valuable
> glibc feature to override generic implementation with architecture specific
> code.
>
Because it is a synthetic benchmark that is exercising code that should not be
stressed with the default compiler flags.  And the problem of using arch-specific
code is when you already have support from the compiler to generate optimized
code, this is just extra complexity plus maintenance.

On my system, ffssl is only used by a single program (/usr/bin/nvidia-persistenced)
which I am not sure how it is compile because it a closed source one.  The ffs is
used by a couple more (gdb, lvm, and some mesa drivers), but again far from being
a hotspot.

And as Andreas have said, the best course of action here is to fix the compiler
if it is not generating code enough code.  Fixing gcc means that we will get 
any optimization by free (by using the builtin) and any code that actually uses
ffs/ffsl/ffsll.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] x86_64: Optimize ffsll function code size.
  2023-08-01 13:46                                           ` Adhemerval Zanella Netto
@ 2023-08-01 18:25                                             ` Cristian Rodríguez
  0 siblings, 0 replies; 37+ messages in thread
From: Cristian Rodríguez @ 2023-08-01 18:25 UTC (permalink / raw)
  To: Adhemerval Zanella Netto
  Cc: Sunil Pandey, Noah Goldstein, Andreas Schwab, libc-alpha, hjl.tools

[-- Attachment #1: Type: text/plain, Size: 1075 bytes --]

On Tue, Aug 1, 2023 at 9:47 AM Adhemerval Zanella Netto via Libc-alpha <
libc-alpha@sourceware.org> wrote:

>
>
> On 31/07/23 20:44, Sunil Pandey wrote:
>
> On my system, ffssl is only used by a single program
> (/usr/bin/nvidia-persistenced)
> which I am not sure how it is compile because it a closed source one.


it is probably built with some strict-standard setting.. -ansi or whatever.


>   The ffs is
> used by a couple more (gdb, lvm, and some mesa drivers), but again far
> from being
> a hotspot.
>
> https://cgit.freedesktop.org/drm/libdrm/tree/meson.build#n27 --> built
with std=c11 instead of gnu11 therefore not benefiting from the compiler's
use of builtins..

And as Andreas have said, the best course of action here is to fix the
> compiler
> if it is not generating code enough code.  Fixing gcc means that we will
> get
> any optimization by free (by using the builtin) and any code that actually
> uses
> ffs/ffsl/ffsll.
>

I agree. the compiler already does this, if done suboptimally the ball is
on GCC to fix it.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] x86_64: Optimize ffsll function code size.
  2023-07-31 18:35                                 ` [PATCH v2] " Sunil K Pandey
  2023-07-31 20:12                                   ` Adhemerval Zanella Netto
@ 2024-01-10 19:19                                   ` Carlos O'Donell
  2024-01-25  3:10                                     ` Sunil Pandey
  1 sibling, 1 reply; 37+ messages in thread
From: Carlos O'Donell @ 2024-01-10 19:19 UTC (permalink / raw)
  To: Sunil K Pandey, libc-alpha, Adhemerval Zanella; +Cc: hjl.tools

On 7/31/23 14:35, Sunil K Pandey via Libc-alpha wrote:
> Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> Currently ffsll function randomly regress by ~20%, depending on how
> code get aligned.
> 
> This patch fixes ffsll function random performance regression.

Here is my suggestion.

Commit this to glibc 2.39 as an incremental improvement that we can backport
to any active branch.

I have already approved the removal for glibc 2.40 via Adhemerval's patch.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>

> Changes from v1:
> - Further reduce size ffsll function size to 12 bytes.
> ---
>  sysdeps/x86_64/ffsll.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> index a1c13d4906..6a5803c7c1 100644
> --- a/sysdeps/x86_64/ffsll.c
> +++ b/sysdeps/x86_64/ffsll.c
> @@ -26,13 +26,13 @@ int
>  ffsll (long long int x)
>  {
>    long long int cnt;
> -  long long int tmp;
>  
> -  asm ("bsfq %2,%0\n"		/* Count low bits in X and store in %1.  */
> -       "cmoveq %1,%0\n"		/* If number was zero, use -1 as result.  */
> -       : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
> +  asm ("mov $-1,%k0\n"	/* Intialize CNT to -1.  */
> +       "bsf %1,%0\n"	/* Count low bits in X and store in CNT.  */
> +       "inc %k0\n"	/* Increment CNT by 1.  */
> +       : "=&r" (cnt) : "r" (x));
>  
> -  return cnt + 1;
> +  return cnt;
>  }
>  
>  #ifndef __ILP32__

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2] x86_64: Optimize ffsll function code size.
  2024-01-10 19:19                                   ` Carlos O'Donell
@ 2024-01-25  3:10                                     ` Sunil Pandey
  0 siblings, 0 replies; 37+ messages in thread
From: Sunil Pandey @ 2024-01-25  3:10 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: libc-alpha, Adhemerval Zanella, hjl.tools

[-- Attachment #1: Type: text/plain, Size: 1867 bytes --]

On Wed, Jan 10, 2024 at 11:19 AM Carlos O'Donell <carlos@redhat.com> wrote:

> On 7/31/23 14:35, Sunil K Pandey via Libc-alpha wrote:
> > Ffsll function size is 17 byte, this patch optimizes size to 16 byte.
> > Currently ffsll function randomly regress by ~20%, depending on how
> > code get aligned.
> >
> > This patch fixes ffsll function random performance regression.
>
> Here is my suggestion.
>
> Commit this to glibc 2.39 as an incremental improvement that we can
> backport
> to any active branch.
>
> I have already approved the removal for glibc 2.40 via Adhemerval's patch.
>
> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
>
> > Changes from v1:
> > - Further reduce size ffsll function size to 12 bytes.
> > ---
> >  sysdeps/x86_64/ffsll.c | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/ffsll.c b/sysdeps/x86_64/ffsll.c
> > index a1c13d4906..6a5803c7c1 100644
> > --- a/sysdeps/x86_64/ffsll.c
> > +++ b/sysdeps/x86_64/ffsll.c
> > @@ -26,13 +26,13 @@ int
> >  ffsll (long long int x)
> >  {
> >    long long int cnt;
> > -  long long int tmp;
> >
> > -  asm ("bsfq %2,%0\n"                /* Count low bits in X and store
> in %1.  */
> > -       "cmoveq %1,%0\n"              /* If number was zero, use -1 as
> result.  */
> > -       : "=&r" (cnt), "=r" (tmp) : "rm" (x), "1" (-1));
> > +  asm ("mov $-1,%k0\n"       /* Intialize CNT to -1.  */
> > +       "bsf %1,%0\n" /* Count low bits in X and store in CNT.  */
> > +       "inc %k0\n"   /* Increment CNT by 1.  */
> > +       : "=&r" (cnt) : "r" (x));
> >
> > -  return cnt + 1;
> > +  return cnt;
> >  }
> >
> >  #ifndef __ILP32__
>
> --
> Cheers,
> Carlos.
>

I would like to backport this patch to release branches from 2.28 to 2.38.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2024-01-25  3:10 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-26 16:05 [PATCH] x86_64: Optimize ffsll function code size Sunil K Pandey
2023-07-26 16:38 ` Richard Henderson
2023-07-26 16:50   ` H.J. Lu
2023-07-26 16:51   ` Noah Goldstein
2023-07-26 16:51   ` Sunil Pandey
2023-07-26 16:59     ` Noah Goldstein
2023-07-26 17:11       ` Adhemerval Zanella Netto
2023-07-27  0:31         ` Cristian Rodríguez
2023-07-26 20:43       ` Sunil Pandey
2023-07-26 21:05         ` Noah Goldstein
2023-07-26 22:37           ` Sunil Pandey
2023-07-27  0:00             ` Noah Goldstein
2023-07-27  8:16               ` Florian Weimer
2023-07-27 11:46                 ` Alexander Monakov
2023-07-27 12:10                   ` Florian Weimer
2023-07-27 13:59                     ` Cristian Rodríguez
2023-07-27 14:00                     ` Alexander Monakov
2023-07-27 15:13                       ` Sunil Pandey
2023-07-27 15:50                         ` Alexander Monakov
2023-07-27 16:24                         ` Florian Weimer
2023-07-27 16:35                           ` Adhemerval Zanella Netto
2023-07-27 17:09                             ` Florian Weimer
2023-07-27 17:25                               ` Sunil Pandey
2023-07-31 18:35                                 ` [PATCH v2] " Sunil K Pandey
2023-07-31 20:12                                   ` Adhemerval Zanella Netto
2023-07-31 20:58                                     ` Sunil Pandey
2023-07-31 22:57                                       ` Adhemerval Zanella Netto
2023-07-31 23:44                                         ` Sunil Pandey
2023-07-31 23:54                                           ` Noah Goldstein
2023-08-01  6:47                                             ` Andreas Schwab
2023-08-01 13:46                                           ` Adhemerval Zanella Netto
2023-08-01 18:25                                             ` Cristian Rodríguez
2024-01-10 19:19                                   ` Carlos O'Donell
2024-01-25  3:10                                     ` Sunil Pandey
2023-07-27 16:40                           ` [PATCH] " Sunil Pandey
2023-07-26 17:04 ` Noah Goldstein
2023-07-26 17:25   ` Andrew Pinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).