[PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining
@ 2023-05-24  5:47 Alexandre Oliva
  2023-05-24  9:12 ` Richard Biener
  0 siblings, 1 reply; 8+ messages in thread
From: Alexandre Oliva @ 2023-05-24  5:47 UTC (permalink / raw)
  To: gcc-patches, H.J. Lu; +Cc: Jan Hubicka, Uros Bizjak

MOVE_MAX on x86* used to accept up to 16 bytes, even without SSE,
which enabled inlining of small memmove by loading and then storing
the entire range.  After the "x86: Update piecewise move and store"
r12-2666 change, memmove of more than 4 bytes would not be inlined in
gimple_fold_bultin_memory_op, failing the expectations of a few tests.

I can see how lowering it for MOVE_MAX_PIECES can get us better
codegen decisions overall, but surely inlining memmove with 2 32-bit
loads and stores is better than an outline call that requires setting
up 3 arguments.  I suppose even 3 or 4 could do better.  But maybe it
is gimple_fold_builtin_memory_op that needs tweaking?

Anyhow, this patch raises MOVE_MAX back a little for non-SSE targets,
while preserving the new value for MOVE_MAX_PIECES.

Bootstrapped on x86_64-linux-gnu.  Also tested on ppc- and x86-vx7r2
with gcc-12.

for gcc/ChangeLog

	* config/i386/i386.h (MOVE_MAX): Rename to...
	(MOVE_MAX_VEC): ... this.  Add NONVEC parameter, and use it as
	the last resort, instead of UNITS_PER_WORD.
	(MOVE_MAX): Reintroduce in terms of MOVE_MAX_VEC, with
	2*UNITS_PER_WORD.
	(MOVE_MAX_PIECES): Likewise, but with UNITS_PER_WORD.
---
 gcc/config/i386/i386.h |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index c7439f89bdf92..5293a332a969a 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -1801,7 +1801,9 @@ typedef struct ix86_args {
    is the number of bytes at a time which we can move efficiently.
    MOVE_MAX_PIECES defaults to MOVE_MAX.  */

-#define MOVE_MAX \
+#define MOVE_MAX MOVE_MAX_VEC (2 * UNITS_PER_WORD)
+#define MOVE_MAX_PIECES MOVE_MAX_VEC (UNITS_PER_WORD)
+#define MOVE_MAX_VEC(NONVEC) \
   ((TARGET_AVX512F \
     && (ix86_move_max == PVW_AVX512 \
 	|| ix86_store_max == PVW_AVX512)) \
@@ -1813,7 +1815,7 @@ typedef struct ix86_args {
       : ((TARGET_SSE2 \
 	  && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
 	  && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
-	 ? 16 : UNITS_PER_WORD)))
+	 ? 16 : (NONVEC))))

 /* STORE_MAX_PIECES is the number of bytes at a time that we can store
    efficiently.  Allow 16/32/64 bytes only if inter-unit move is enabled

-- 
Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
   Free Software Activist                       GNU Toolchain Engineer
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining
  2023-05-24  5:47 [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining Alexandre Oliva
@ 2023-05-24  9:12 ` Richard Biener
  2023-05-25 10:01   ` Alexandre Oliva
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Biener @ 2023-05-24  9:12 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: gcc-patches, H.J. Lu, Jan Hubicka, Uros Bizjak

On Wed, May 24, 2023 at 7:47 AM Alexandre Oliva via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
>
> MOVE_MAX on x86* used to accept up to 16 bytes, even without SSE,
> which enabled inlining of small memmove by loading and then storing
> the entire range.  After the "x86: Update piecewise move and store"
> r12-2666 change, memmove of more than 4 bytes would not be inlined in
> gimple_fold_bultin_memory_op, failing the expectations of a few tests.
>
> I can see how lowering it for MOVE_MAX_PIECES can get us better
> codegen decisions overall, but surely inlining memmove with 2 32-bit
> loads and stores is better than an outline call that requires setting
> up 3 arguments.  I suppose even 3 or 4 could do better.  But maybe it
> is gimple_fold_builtin_memory_op that needs tweaking?

gimple_fold_builtin_memory_op tries to expand the call to a single
load plus a single store so we can handle overlaps by first loading
everything to registers and then storing:

      /* If we can perform the copy efficiently with first doing all loads
         and then all stores inline it that way.  Currently efficiently
         means that we can load all the memory into a single integer
         register which is what MOVE_MAX gives us.  */

using DImode on i?86 without SSE means we eventually perform two
loads and two stores which means we need two registers available.
That might not be an issue on x86_64 doing 16 bytes with two DImode
ops (and -mno-sse) since there's plenty of regs available.

So I think if we want to expand this further at the GIMPLE level we
should still honor MOVE_MAX but eventually emit multiple loads/stores
honoring the MOVE_MAX_PIECES set of constraints there and avoid
expanding to sequences where we cannot interleave the loads/stores
(aka for the memmove case).

> Anyhow, this patch raises MOVE_MAX back a little for non-SSE targets,
> while preserving the new value for MOVE_MAX_PIECES.
>
> Bootstrapped on x86_64-linux-gnu.  Also tested on ppc- and x86-vx7r2
> with gcc-12.
>
> for gcc/ChangeLog
>
>         * config/i386/i386.h (MOVE_MAX): Rename to...
>         (MOVE_MAX_VEC): ... this.  Add NONVEC parameter, and use it as
>         the last resort, instead of UNITS_PER_WORD.
>         (MOVE_MAX): Reintroduce in terms of MOVE_MAX_VEC, with
>         2*UNITS_PER_WORD.
>         (MOVE_MAX_PIECES): Likewise, but with UNITS_PER_WORD.
> ---
>  gcc/config/i386/i386.h |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index c7439f89bdf92..5293a332a969a 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -1801,7 +1801,9 @@ typedef struct ix86_args {
>     is the number of bytes at a time which we can move efficiently.
>     MOVE_MAX_PIECES defaults to MOVE_MAX.  */
>
> -#define MOVE_MAX \
> +#define MOVE_MAX MOVE_MAX_VEC (2 * UNITS_PER_WORD)
> +#define MOVE_MAX_PIECES MOVE_MAX_VEC (UNITS_PER_WORD)
> +#define MOVE_MAX_VEC(NONVEC) \
>    ((TARGET_AVX512F \
>      && (ix86_move_max == PVW_AVX512 \
>         || ix86_store_max == PVW_AVX512)) \
> @@ -1813,7 +1815,7 @@ typedef struct ix86_args {
>        : ((TARGET_SSE2 \
>           && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
>           && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
> -        ? 16 : UNITS_PER_WORD)))
> +        ? 16 : (NONVEC))))
>
>  /* STORE_MAX_PIECES is the number of bytes at a time that we can store
>     efficiently.  Allow 16/32/64 bytes only if inter-unit move is enabled
>
> --
> Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
>    Free Software Activist                       GNU Toolchain Engineer
> Disinformation flourishes because many people care deeply about injustice
> but very few check the facts.  Ask me about <https://stallmansupport.org>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining
  2023-05-24  9:12 ` Richard Biener
@ 2023-05-25 10:01   ` Alexandre Oliva
  2023-05-25 10:49     ` Richard Biener
  0 siblings, 1 reply; 8+ messages in thread
From: Alexandre Oliva @ 2023-05-25 10:01 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, H.J. Lu, Jan Hubicka, Uros Bizjak

--text follows this line--
On May 24, 2023, Richard Biener <richard.guenther@gmail.com> wrote:

> gimple_fold_builtin_memory_op tries to expand the call to a single
> load plus a single store so we can handle overlaps by first loading
> everything to registers and then storing:

*nod*, that's why I figured we could afford to go back to allowing
DImode (with -m32) or TImode (with -m64) even without vector modes: we'd
just use a pair of registers, a single insn, even though not a single
hardware instruction.

> using DImode on i?86 without SSE means we eventually perform two
> loads and two stores which means we need two registers available.

*nod*.  But the alternative is to issue an out-of-line call to memmove,
which would clobber more than 2 registers.  ISTM that inlining such
calls is better, whether optimizing for speed or size.

> So I think if we want to expand this further at the GIMPLE level we
> should still honor MOVE_MAX but eventually emit multiple loads/stores
> honoring the MOVE_MAX_PIECES set of constraints there and avoid
> expanding to sequences where we cannot interleave the loads/stores
> (aka for the memmove case).

But...  don't we already?  If I'm reading the code right, we'll already
issue gimple code to load the whole block into a temporary and then
store it, but current MOVE_MAX won't let us go past 4 bytes on SSE-less
x86.

-- 
Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
   Free Software Activist                       GNU Toolchain Engineer
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining
  2023-05-25 10:01   ` Alexandre Oliva
@ 2023-05-25 10:49     ` Richard Biener
  2023-05-25 11:10       ` Alexandre Oliva
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Biener @ 2023-05-25 10:49 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: gcc-patches, H.J. Lu, Jan Hubicka, Uros Bizjak

On Thu, May 25, 2023 at 12:01 PM Alexandre Oliva <oliva@adacore.com> wrote:
>
> --text follows this line--
> On May 24, 2023, Richard Biener <richard.guenther@gmail.com> wrote:
>
> > gimple_fold_builtin_memory_op tries to expand the call to a single
> > load plus a single store so we can handle overlaps by first loading
> > everything to registers and then storing:
>
> *nod*, that's why I figured we could afford to go back to allowing
> DImode (with -m32) or TImode (with -m64) even without vector modes: we'd
> just use a pair of registers, a single insn, even though not a single
> hardware instruction.
>
> > using DImode on i?86 without SSE means we eventually perform two
> > loads and two stores which means we need two registers available.
>
> *nod*.  But the alternative is to issue an out-of-line call to memmove,
> which would clobber more than 2 registers.  ISTM that inlining such
> calls is better, whether optimizing for speed or size.
>
> > So I think if we want to expand this further at the GIMPLE level we
> > should still honor MOVE_MAX but eventually emit multiple loads/stores
> > honoring the MOVE_MAX_PIECES set of constraints there and avoid
> > expanding to sequences where we cannot interleave the loads/stores
> > (aka for the memmove case).
>
> But...  don't we already?  If I'm reading the code right, we'll already
> issue gimple code to load the whole block into a temporary and then
> store it, but current MOVE_MAX won't let us go past 4 bytes on SSE-less
> x86.

I mean we could do what RTL expansion would do later and do
by-pieces, thus emit multiple loads/stores but not n loads and then
n stores but interleaved.

Richard.

>
> --
> Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
>    Free Software Activist                       GNU Toolchain Engineer
> Disinformation flourishes because many people care deeply about injustice
> but very few check the facts.  Ask me about <https://stallmansupport.org>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining
  2023-05-25 10:49     ` Richard Biener
@ 2023-05-25 11:10       ` Alexandre Oliva
  2023-05-25 11:33         ` Richard Biener
  0 siblings, 1 reply; 8+ messages in thread
From: Alexandre Oliva @ 2023-05-25 11:10 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, H.J. Lu, Jan Hubicka, Uros Bizjak

On May 25, 2023, Richard Biener <richard.guenther@gmail.com> wrote:

> I mean we could do what RTL expansion would do later and do
> by-pieces, thus emit multiple loads/stores but not n loads and then
> n stores but interleaved.

That wouldn't help e.g. gcc.dg/memcpy-6.c's fold_move_8, because
MOVE_MAX and MOVE_MAX_PIECES currently limits inline expansion to 4
bytes on x86 without SSE, both in gimple and RTL, and interleaved loads
and stores wouldn't help with memmove.  We can't fix that by changing
code that uses MOVE_MAX and/or MOVE_MAX_PIECES, when these limits are
set too low.

I'm also concerned that doing more such expansion in gimple folding
would be reversed in later gimple passes.  That's good in that it would
enable efficient rtl movmem/cpymem instruction selection, but it's not
clear to me that there would generally be benefits to such early
open-coding in gimple.

-- 
Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
   Free Software Activist                       GNU Toolchain Engineer
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining
  2023-05-25 11:10       ` Alexandre Oliva
@ 2023-05-25 11:33         ` Richard Biener
  2023-05-25 13:25           ` Alexandre Oliva
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Biener @ 2023-05-25 11:33 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: gcc-patches, H.J. Lu, Jan Hubicka, Uros Bizjak

On Thu, May 25, 2023 at 1:10 PM Alexandre Oliva <oliva@adacore.com> wrote:
>
> On May 25, 2023, Richard Biener <richard.guenther@gmail.com> wrote:
>
> > I mean we could do what RTL expansion would do later and do
> > by-pieces, thus emit multiple loads/stores but not n loads and then
> > n stores but interleaved.
>
> That wouldn't help e.g. gcc.dg/memcpy-6.c's fold_move_8, because
> MOVE_MAX and MOVE_MAX_PIECES currently limits inline expansion to 4
> bytes on x86 without SSE, both in gimple and RTL, and interleaved loads
> and stores wouldn't help with memmove.  We can't fix that by changing
> code that uses MOVE_MAX and/or MOVE_MAX_PIECES, when these limits are
> set too low.
>
> I'm also concerned that doing more such expansion in gimple folding
> would be reversed in later gimple passes.  That's good in that it would
> enable efficient rtl movmem/cpymem instruction selection, but it's not
> clear to me that there would generally be benefits to such early
> open-coding in gimple.

Btw, there was a short period where the MOVE_MAX limit was restricted
but that had fallout and we've reverted since then.

Richard.

> --
> Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
>    Free Software Activist                       GNU Toolchain Engineer
> Disinformation flourishes because many people care deeply about injustice
> but very few check the facts.  Ask me about <https://stallmansupport.org>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining
  2023-05-25 11:33         ` Richard Biener
@ 2023-05-25 13:25           ` Alexandre Oliva
  2023-05-25 13:32             ` Richard Biener
  0 siblings, 1 reply; 8+ messages in thread
From: Alexandre Oliva @ 2023-05-25 13:25 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, H.J. Lu, Jan Hubicka, Uros Bizjak

On May 25, 2023, Richard Biener <richard.guenther@gmail.com> wrote:

> On Thu, May 25, 2023 at 1:10 PM Alexandre Oliva <oliva@adacore.com> wrote:
>> 
>> On May 25, 2023, Richard Biener <richard.guenther@gmail.com> wrote:
>> 
>> > I mean we could do what RTL expansion would do later and do
>> > by-pieces, thus emit multiple loads/stores but not n loads and then
>> > n stores but interleaved.
>> 
>> That wouldn't help e.g. gcc.dg/memcpy-6.c's fold_move_8, because
>> MOVE_MAX and MOVE_MAX_PIECES currently limits inline expansion to 4
>> bytes on x86 without SSE, both in gimple and RTL, and interleaved loads
>> and stores wouldn't help with memmove.  We can't fix that by changing
>> code that uses MOVE_MAX and/or MOVE_MAX_PIECES, when these limits are
>> set too low.

> Btw, there was a short period where the MOVE_MAX limit was restricted
> but that had fallout and we've reverted since then.

Erhm...  Are we even talking about the same issue?

i386/i386.h reduced the 32-bit non-SSE MOVE_MAX from 16 to 4, which
broke this test; I'm proposing to bounce it back up to 8, so that we get
a little more memmove inlining, enough for tests that expect that much
to pass.

You may be focusing on the gimple-fold bit, because I mentioned it, but
even the rtl expander is failing to expand the memmove because of the
setting, as evidenced by the test's failure in the scan for memmove in
the final dump.

That MOVE_MAX change was a significant regression in codegen for 32-bit
non-SSE x86, and I'm proposing to fix that.  Compensating for that
regression elsewhere doesn't seem desirable to me: MOVE_MAX can be much
higher even on other x86 variants, so the effects of such attempts may
harm quite significantly more modern CPUs.

Conversely, I don't expect the reduction of MOVE_MAX on SSE-less x86 a
couple of years ago to have been measured for performance effects, given
the little overall relevance of such CPUs, and the very visible and
undesirable effects on codegen that change brought onto them.  And yet,
I'm being very conservative in the proposed reversion, because
benchmarking such targets in any meaningful way would be somewhat
challenging for myself as well.

So, could we please have this narrow fix of this limited regression at
the spot where it was introduced accepted, rather than debating
tangents?

-- 
Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
   Free Software Activist                       GNU Toolchain Engineer
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining
  2023-05-25 13:25           ` Alexandre Oliva
@ 2023-05-25 13:32             ` Richard Biener
  0 siblings, 0 replies; 8+ messages in thread
From: Richard Biener @ 2023-05-25 13:32 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: gcc-patches, H.J. Lu, Jan Hubicka, Uros Bizjak

On Thu, May 25, 2023 at 3:25 PM Alexandre Oliva <oliva@adacore.com> wrote:
>
> On May 25, 2023, Richard Biener <richard.guenther@gmail.com> wrote:
>
> > On Thu, May 25, 2023 at 1:10 PM Alexandre Oliva <oliva@adacore.com> wrote:
> >>
> >> On May 25, 2023, Richard Biener <richard.guenther@gmail.com> wrote:
> >>
> >> > I mean we could do what RTL expansion would do later and do
> >> > by-pieces, thus emit multiple loads/stores but not n loads and then
> >> > n stores but interleaved.
> >>
> >> That wouldn't help e.g. gcc.dg/memcpy-6.c's fold_move_8, because
> >> MOVE_MAX and MOVE_MAX_PIECES currently limits inline expansion to 4
> >> bytes on x86 without SSE, both in gimple and RTL, and interleaved loads
> >> and stores wouldn't help with memmove.  We can't fix that by changing
> >> code that uses MOVE_MAX and/or MOVE_MAX_PIECES, when these limits are
> >> set too low.
>
> > Btw, there was a short period where the MOVE_MAX limit was restricted
> > but that had fallout and we've reverted since then.
>
> Erhm...  Are we even talking about the same issue?
>
> i386/i386.h reduced the 32-bit non-SSE MOVE_MAX from 16 to 4, which
> broke this test; I'm proposing to bounce it back up to 8, so that we get
> a little more memmove inlining, enough for tests that expect that much
> to pass.
>
> You may be focusing on the gimple-fold bit, because I mentioned it, but
> even the rtl expander is failing to expand the memmove because of the
> setting, as evidenced by the test's failure in the scan for memmove in
> the final dump.

So indeed fold_move_8 expands to the following, even with -minline-all-stringops

fold_move_8:
.LFB5:
        .cfi_startproc
        pushl   %ebp
        .cfi_def_cfa_offset 8
        .cfi_offset 5, -8
        movl    %esp, %ebp
        .cfi_def_cfa_register 5
        subl    $8, %esp
        movl    $a+3, %eax
        subl    $4, %esp
        pushl   $8
        pushl   $a
        pushl   %eax
        call    memmove
        addl    $16, %esp
        nop

I do think it's still up to RTL expansion or the target to decide whether
its worth spending two registers to handle the overlap or maybe
emit a compare & jump to do forward and backward variants.

Yes, increasing MOVE_MAX to 8 makes this expand at the GIMPLE
level already, which I belive is premature and difficult to undo.

> That MOVE_MAX change was a significant regression in codegen for 32-bit
> non-SSE x86, and I'm proposing to fix that.  Compensating for that
> regression elsewhere doesn't seem desirable to me: MOVE_MAX can be much
> higher even on other x86 variants, so the effects of such attempts may
> harm quite significantly more modern CPUs.
>
> Conversely, I don't expect the reduction of MOVE_MAX on SSE-less x86 a
> couple of years ago to have been measured for performance effects, given
> the little overall relevance of such CPUs, and the very visible and
> undesirable effects on codegen that change brought onto them.  And yet,
> I'm being very conservative in the proposed reversion, because
> benchmarking such targets in any meaningful way would be somewhat
> challenging for myself as well.
>
> So, could we please have this narrow fix of this limited regression at
> the spot where it was introduced accepted, rather than debating
> tangents?
>
> --
> Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
>    Free Software Activist                       GNU Toolchain Engineer
> Disinformation flourishes because many people care deeply about injustice
> but very few check the facts.  Ask me about <https://stallmansupport.org>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-05-25 13:34 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-24  5:47 [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining Alexandre Oliva
2023-05-24  9:12 ` Richard Biener
2023-05-25 10:01   ` Alexandre Oliva
2023-05-25 10:49     ` Richard Biener
2023-05-25 11:10       ` Alexandre Oliva
2023-05-25 11:33         ` Richard Biener
2023-05-25 13:25           ` Alexandre Oliva
2023-05-25 13:32             ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).