Priority of builtins expansion strategies

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Priority of builtins expansion strategies
@ 2021-07-12 11:29 Christoph Müllner
  2021-07-13  0:10 ` Alexandre Oliva
  0 siblings, 1 reply; 5+ messages in thread
From: Christoph Müllner @ 2021-07-12 11:29 UTC (permalink / raw)
  To: gcc; +Cc: Martin Sebor, Alexandre Oliva

Hi,

I'm working on some platform-specific optimizations for
memset/memcpy/strcpy/strncpy.
However, I am having difficulties understanding how my code should be
integrated.
Initially, I got inspired by rs6000-string.c, where I see expansion
code for instructions
like setmemsi or cmpstrsi. However, that expansion code is not always called.
Instead, the first strategy is using the generic by-pieces infrastructure.

To understand what I mean, let's have a look at memset
(expand_builtin_memset_args).
The backend can provide a tailored code sequence by expanding setmem.
However, there is also a generic solution available using the
by-pieces infrastructure.
The generic by-pieces infrastructure has a higher priority than the
target-specific setmem
expansion. However, the recently added by-multiple-pieces
infrastructure has lower priority
than setmem.

See:
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/builtins.c;h=39ab139b7e1c06c98d2db1aef2b3a6095ffbec63;hb=HEAD#l7004

The same observation is true for most (all?) other uses of builtins.

The current priority requires me to duplicate the condition code to
decide if my optimization
can be applied to the following places:
1) in TARGET_USE_BY_PIECES_INFRASTRUCTURE_P () to block by-pieces
2) in the setmem expansion to gate the optimization

As I would expect  that a target-specific mechanism is preferred over
a generic mechanism,
my questions are:
* Why does the generic by-pieces infrastructure have a higher priority
than the target-specific expansion via INSNs like setmem?
* And if there are no particular reasons, would it be acceptable to
change the order?

Thanks,
Christoph

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Priority of builtins expansion strategies
  2021-07-12 11:29 Priority of builtins expansion strategies Christoph Müllner
@ 2021-07-13  0:10 ` Alexandre Oliva
  2021-07-13 12:18   ` Christoph Müllner
  0 siblings, 1 reply; 5+ messages in thread
From: Alexandre Oliva @ 2021-07-13  0:10 UTC (permalink / raw)
  To: Christoph Müllner; +Cc: gcc, Martin Sebor

On Jul 12, 2021, Christoph Müllner <cmuellner@gcc.gnu.org> wrote:

> * Why does the generic by-pieces infrastructure have a higher priority
> than the target-specific expansion via INSNs like setmem?

by-pieces was not affected by the recent change, and IMHO it generally
makes sense for it to have priority over setmem.  It generates only
straigh-line code for constant-sized blocks.  Even if you can beat that
with some machine-specific logic, you'll probably end up generating
equivalent code at least in some cases, and then, you probably want to
carefully tune the settings that select one or the other, or disable
by-pieces altogether.

by-multiple-pieces, OTOH, is likely to be beaten by machine-specific
looping constructs, if any are available, so setmem takes precedence.

My testing involved bringing it ahead of the insns, to exercise the code
more thoroughly even on x86*, but the submitted patch only used
by-multiple-pieces as a fallback.

> * And if there are no particular reasons, would it be acceptable to
> change the order?

I suppose moving insns ahead of by-pieces might break careful tuning of
multiple platforms, so I'd rather we did not make that change.

-- 
Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
   Free Software Activist                       GNU Toolchain Engineer
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Priority of builtins expansion strategies
  2021-07-13  0:10 ` Alexandre Oliva
@ 2021-07-13 12:18   ` Christoph Müllner
  2021-07-13 12:59     ` Richard Biener
  0 siblings, 1 reply; 5+ messages in thread
From: Christoph Müllner @ 2021-07-13 12:18 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: gcc, Martin Sebor

On Tue, Jul 13, 2021 at 2:11 AM Alexandre Oliva <oliva@adacore.com> wrote:
>
> On Jul 12, 2021, Christoph Müllner <cmuellner@gcc.gnu.org> wrote:
>
> > * Why does the generic by-pieces infrastructure have a higher priority
> > than the target-specific expansion via INSNs like setmem?
>
> by-pieces was not affected by the recent change, and IMHO it generally
> makes sense for it to have priority over setmem.  It generates only
> straigh-line code for constant-sized blocks.  Even if you can beat that
> with some machine-specific logic, you'll probably end up generating
> equivalent code at least in some cases, and then, you probably want to
> carefully tune the settings that select one or the other, or disable
> by-pieces altogether.
>
>
> by-multiple-pieces, OTOH, is likely to be beaten by machine-specific
> looping constructs, if any are available, so setmem takes precedence.
>
> My testing involved bringing it ahead of the insns, to exercise the code
> more thoroughly even on x86*, but the submitted patch only used
> by-multiple-pieces as a fallback.

Let me give you an example of what by-pieces does on RISC-V (RV64GC).
The following code...

void* do_memset0_8 (void *p)
{
    return memset (p, 0, 8);
}

void* do_memset0_15 (void *p)
{
    return memset (p, 0, 15);
}

...becomes (you can validate that with compiler explorer):

do_memset0_8(void*):
        sb      zero,0(a0)
        sb      zero,1(a0)
        sb      zero,2(a0)
        sb      zero,3(a0)
        sb      zero,4(a0)
        sb      zero,5(a0)
        sb      zero,6(a0)
        sb      zero,7(a0)
        ret
do_memset0_15(void*):
        sb      zero,0(a0)
        sb      zero,1(a0)
        sb      zero,2(a0)
        sb      zero,3(a0)
        sb      zero,4(a0)
        sb      zero,5(a0)
        sb      zero,6(a0)
        sb      zero,7(a0)
        sb      zero,8(a0)
        sb      zero,9(a0)
        sb      zero,10(a0)
        sb      zero,11(a0)
        sb      zero,12(a0)
        sb      zero,13(a0)
        sb      zero,14(a0)
        ret

Here is what a setmemsi expansion in the backend can do (in case
unaligned access is cheap):

000000000000003c <do_memset0_8>:
  3c:   00053023                sd      zero,0(a0)
  40:   8082                    ret

000000000000007e <do_memset0_15>:
  7e:   00053023                sd      zero,0(a0)
  82:   000533a3                sd      zero,7(a0)
  86:   8082                    ret

Is there a way to generate similar code with the by-pieces infrastructure?

> > * And if there are no particular reasons, would it be acceptable to
> > change the order?
>
> I suppose moving insns ahead of by-pieces might break careful tuning of
> multiple platforms, so I'd rather we did not make that change.

Only platforms that have "setmemsi" implemented would be affected.
And those platforms (arm, frv, ft32, nds32, pa, rs6000, rx, visium)
have a carefully tuned
implementation of the setmem expansion. I can't imagine that these
setmem expansions
produce less optimal code than the by-pieces infrastructure (which has
less knowledge
about the target).

Thanks,
Christoph

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Priority of builtins expansion strategies
  2021-07-13 12:18   ` Christoph Müllner
@ 2021-07-13 12:59     ` Richard Biener
  2021-07-13 14:04       ` Christoph Müllner
  0 siblings, 1 reply; 5+ messages in thread
From: Richard Biener @ 2021-07-13 12:59 UTC (permalink / raw)
  To: Christoph Müllner; +Cc: Alexandre Oliva, GCC Development, Martin Sebor

On Tue, Jul 13, 2021 at 2:19 PM Christoph Müllner via Gcc
<gcc@gcc.gnu.org> wrote:
>
> On Tue, Jul 13, 2021 at 2:11 AM Alexandre Oliva <oliva@adacore.com> wrote:
> >
> > On Jul 12, 2021, Christoph Müllner <cmuellner@gcc.gnu.org> wrote:
> >
> > > * Why does the generic by-pieces infrastructure have a higher priority
> > > than the target-specific expansion via INSNs like setmem?
> >
> > by-pieces was not affected by the recent change, and IMHO it generally
> > makes sense for it to have priority over setmem.  It generates only
> > straigh-line code for constant-sized blocks.  Even if you can beat that
> > with some machine-specific logic, you'll probably end up generating
> > equivalent code at least in some cases, and then, you probably want to
> > carefully tune the settings that select one or the other, or disable
> > by-pieces altogether.
> >
> >
> > by-multiple-pieces, OTOH, is likely to be beaten by machine-specific
> > looping constructs, if any are available, so setmem takes precedence.
> >
> > My testing involved bringing it ahead of the insns, to exercise the code
> > more thoroughly even on x86*, but the submitted patch only used
> > by-multiple-pieces as a fallback.
>
> Let me give you an example of what by-pieces does on RISC-V (RV64GC).
> The following code...
>
> void* do_memset0_8 (void *p)
> {
>     return memset (p, 0, 8);
> }
>
> void* do_memset0_15 (void *p)
> {
>     return memset (p, 0, 15);
> }
>
> ...becomes (you can validate that with compiler explorer):
>
> do_memset0_8(void*):
>         sb      zero,0(a0)
>         sb      zero,1(a0)
>         sb      zero,2(a0)
>         sb      zero,3(a0)
>         sb      zero,4(a0)
>         sb      zero,5(a0)
>         sb      zero,6(a0)
>         sb      zero,7(a0)
>         ret
> do_memset0_15(void*):
>         sb      zero,0(a0)
>         sb      zero,1(a0)
>         sb      zero,2(a0)
>         sb      zero,3(a0)
>         sb      zero,4(a0)
>         sb      zero,5(a0)
>         sb      zero,6(a0)
>         sb      zero,7(a0)
>         sb      zero,8(a0)
>         sb      zero,9(a0)
>         sb      zero,10(a0)
>         sb      zero,11(a0)
>         sb      zero,12(a0)
>         sb      zero,13(a0)
>         sb      zero,14(a0)
>         ret
>
> Here is what a setmemsi expansion in the backend can do (in case
> unaligned access is cheap):
>
> 000000000000003c <do_memset0_8>:
>   3c:   00053023                sd      zero,0(a0)
>   40:   8082                    ret
>
> 000000000000007e <do_memset0_15>:
>   7e:   00053023                sd      zero,0(a0)
>   82:   000533a3                sd      zero,7(a0)
>   86:   8082                    ret
>
> Is there a way to generate similar code with the by-pieces infrastructure?

Sure - tell it unaligned access is cheap.  See alignment_for_piecewise_move
and how it uses slow_unaligned_access.

Richard.

> > > * And if there are no particular reasons, would it be acceptable to
> > > change the order?
> >
> > I suppose moving insns ahead of by-pieces might break careful tuning of
> > multiple platforms, so I'd rather we did not make that change.
>
> Only platforms that have "setmemsi" implemented would be affected.
> And those platforms (arm, frv, ft32, nds32, pa, rs6000, rx, visium)
> have a carefully tuned
> implementation of the setmem expansion. I can't imagine that these
> setmem expansions
> produce less optimal code than the by-pieces infrastructure (which has
> less knowledge
> about the target).
>
> Thanks,
> Christoph

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Priority of builtins expansion strategies
  2021-07-13 12:59     ` Richard Biener
@ 2021-07-13 14:04       ` Christoph Müllner
  0 siblings, 0 replies; 5+ messages in thread
From: Christoph Müllner @ 2021-07-13 14:04 UTC (permalink / raw)
  To: Richard Biener; +Cc: Alexandre Oliva, GCC Development, Martin Sebor

On Tue, Jul 13, 2021 at 2:59 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Tue, Jul 13, 2021 at 2:19 PM Christoph Müllner via Gcc
> <gcc@gcc.gnu.org> wrote:
> >
> > On Tue, Jul 13, 2021 at 2:11 AM Alexandre Oliva <oliva@adacore.com> wrote:
> > >
> > > On Jul 12, 2021, Christoph Müllner <cmuellner@gcc.gnu.org> wrote:
> > >
> > > > * Why does the generic by-pieces infrastructure have a higher priority
> > > > than the target-specific expansion via INSNs like setmem?
> > >
> > > by-pieces was not affected by the recent change, and IMHO it generally
> > > makes sense for it to have priority over setmem.  It generates only
> > > straigh-line code for constant-sized blocks.  Even if you can beat that
> > > with some machine-specific logic, you'll probably end up generating
> > > equivalent code at least in some cases, and then, you probably want to
> > > carefully tune the settings that select one or the other, or disable
> > > by-pieces altogether.
> > >
> > >
> > > by-multiple-pieces, OTOH, is likely to be beaten by machine-specific
> > > looping constructs, if any are available, so setmem takes precedence.
> > >
> > > My testing involved bringing it ahead of the insns, to exercise the code
> > > more thoroughly even on x86*, but the submitted patch only used
> > > by-multiple-pieces as a fallback.
> >
> > Let me give you an example of what by-pieces does on RISC-V (RV64GC).
> > The following code...
> >
> > void* do_memset0_8 (void *p)
> > {
> >     return memset (p, 0, 8);
> > }
> >
> > void* do_memset0_15 (void *p)
> > {
> >     return memset (p, 0, 15);
> > }
> >
> > ...becomes (you can validate that with compiler explorer):
> >
> > do_memset0_8(void*):
> >         sb      zero,0(a0)
> >         sb      zero,1(a0)
> >         sb      zero,2(a0)
> >         sb      zero,3(a0)
> >         sb      zero,4(a0)
> >         sb      zero,5(a0)
> >         sb      zero,6(a0)
> >         sb      zero,7(a0)
> >         ret
> > do_memset0_15(void*):
> >         sb      zero,0(a0)
> >         sb      zero,1(a0)
> >         sb      zero,2(a0)
> >         sb      zero,3(a0)
> >         sb      zero,4(a0)
> >         sb      zero,5(a0)
> >         sb      zero,6(a0)
> >         sb      zero,7(a0)
> >         sb      zero,8(a0)
> >         sb      zero,9(a0)
> >         sb      zero,10(a0)
> >         sb      zero,11(a0)
> >         sb      zero,12(a0)
> >         sb      zero,13(a0)
> >         sb      zero,14(a0)
> >         ret
> >
> > Here is what a setmemsi expansion in the backend can do (in case
> > unaligned access is cheap):
> >
> > 000000000000003c <do_memset0_8>:
> >   3c:   00053023                sd      zero,0(a0)
> >   40:   8082                    ret
> >
> > 000000000000007e <do_memset0_15>:
> >   7e:   00053023                sd      zero,0(a0)
> >   82:   000533a3                sd      zero,7(a0)
> >   86:   8082                    ret
> >
> > Is there a way to generate similar code with the by-pieces infrastructure?
>
> Sure - tell it unaligned access is cheap.  See alignment_for_piecewise_move
> and how it uses slow_unaligned_access.

Thanks for the pointer.
I already knew about slow_unaligned_access, but I was not aware of
overlap_op_by_pieces_p.
Enabling both gives exactly the same as above.

Thanks,
Christoph

> > > > * And if there are no particular reasons, would it be acceptable to
> > > > change the order?
> > >
> > > I suppose moving insns ahead of by-pieces might break careful tuning of
> > > multiple platforms, so I'd rather we did not make that change.
> >
> > Only platforms that have "setmemsi" implemented would be affected.
> > And those platforms (arm, frv, ft32, nds32, pa, rs6000, rx, visium)
> > have a carefully tuned
> > implementation of the setmem expansion. I can't imagine that these
> > setmem expansions
> > produce less optimal code than the by-pieces infrastructure (which has
> > less knowledge
> > about the target).
> >
> > Thanks,
> > Christoph

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-07-13 14:04 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-12 11:29 Priority of builtins expansion strategies Christoph Müllner
2021-07-13  0:10 ` Alexandre Oliva
2021-07-13 12:18   ` Christoph Müllner
2021-07-13 12:59     ` Richard Biener
2021-07-13 14:04       ` Christoph Müllner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).