* [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining @ 2023-05-24 5:47 Alexandre Oliva 2023-05-24 9:12 ` Richard Biener 0 siblings, 1 reply; 8+ messages in thread From: Alexandre Oliva @ 2023-05-24 5:47 UTC (permalink / raw) To: gcc-patches, H.J. Lu; +Cc: Jan Hubicka, Uros Bizjak MOVE_MAX on x86* used to accept up to 16 bytes, even without SSE, which enabled inlining of small memmove by loading and then storing the entire range. After the "x86: Update piecewise move and store" r12-2666 change, memmove of more than 4 bytes would not be inlined in gimple_fold_bultin_memory_op, failing the expectations of a few tests. I can see how lowering it for MOVE_MAX_PIECES can get us better codegen decisions overall, but surely inlining memmove with 2 32-bit loads and stores is better than an outline call that requires setting up 3 arguments. I suppose even 3 or 4 could do better. But maybe it is gimple_fold_builtin_memory_op that needs tweaking? Anyhow, this patch raises MOVE_MAX back a little for non-SSE targets, while preserving the new value for MOVE_MAX_PIECES. Bootstrapped on x86_64-linux-gnu. Also tested on ppc- and x86-vx7r2 with gcc-12. for gcc/ChangeLog * config/i386/i386.h (MOVE_MAX): Rename to... (MOVE_MAX_VEC): ... this. Add NONVEC parameter, and use it as the last resort, instead of UNITS_PER_WORD. (MOVE_MAX): Reintroduce in terms of MOVE_MAX_VEC, with 2*UNITS_PER_WORD. (MOVE_MAX_PIECES): Likewise, but with UNITS_PER_WORD. --- gcc/config/i386/i386.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index c7439f89bdf92..5293a332a969a 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -1801,7 +1801,9 @@ typedef struct ix86_args { is the number of bytes at a time which we can move efficiently. MOVE_MAX_PIECES defaults to MOVE_MAX. */ -#define MOVE_MAX \ +#define MOVE_MAX MOVE_MAX_VEC (2 * UNITS_PER_WORD) +#define MOVE_MAX_PIECES MOVE_MAX_VEC (UNITS_PER_WORD) +#define MOVE_MAX_VEC(NONVEC) \ ((TARGET_AVX512F \ && (ix86_move_max == PVW_AVX512 \ || ix86_store_max == PVW_AVX512)) \ @@ -1813,7 +1815,7 @@ typedef struct ix86_args { : ((TARGET_SSE2 \ && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \ && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \ - ? 16 : UNITS_PER_WORD))) + ? 16 : (NONVEC)))) /* STORE_MAX_PIECES is the number of bytes at a time that we can store efficiently. Allow 16/32/64 bytes only if inter-unit move is enabled -- Alexandre Oliva, happy hacker https://FSFLA.org/blogs/lxo/ Free Software Activist GNU Toolchain Engineer Disinformation flourishes because many people care deeply about injustice but very few check the facts. Ask me about <https://stallmansupport.org> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining 2023-05-24 5:47 [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining Alexandre Oliva @ 2023-05-24 9:12 ` Richard Biener 2023-05-25 10:01 ` Alexandre Oliva 0 siblings, 1 reply; 8+ messages in thread From: Richard Biener @ 2023-05-24 9:12 UTC (permalink / raw) To: Alexandre Oliva; +Cc: gcc-patches, H.J. Lu, Jan Hubicka, Uros Bizjak On Wed, May 24, 2023 at 7:47 AM Alexandre Oliva via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > > MOVE_MAX on x86* used to accept up to 16 bytes, even without SSE, > which enabled inlining of small memmove by loading and then storing > the entire range. After the "x86: Update piecewise move and store" > r12-2666 change, memmove of more than 4 bytes would not be inlined in > gimple_fold_bultin_memory_op, failing the expectations of a few tests. > > I can see how lowering it for MOVE_MAX_PIECES can get us better > codegen decisions overall, but surely inlining memmove with 2 32-bit > loads and stores is better than an outline call that requires setting > up 3 arguments. I suppose even 3 or 4 could do better. But maybe it > is gimple_fold_builtin_memory_op that needs tweaking? gimple_fold_builtin_memory_op tries to expand the call to a single load plus a single store so we can handle overlaps by first loading everything to registers and then storing: /* If we can perform the copy efficiently with first doing all loads and then all stores inline it that way. Currently efficiently means that we can load all the memory into a single integer register which is what MOVE_MAX gives us. */ using DImode on i?86 without SSE means we eventually perform two loads and two stores which means we need two registers available. That might not be an issue on x86_64 doing 16 bytes with two DImode ops (and -mno-sse) since there's plenty of regs available. So I think if we want to expand this further at the GIMPLE level we should still honor MOVE_MAX but eventually emit multiple loads/stores honoring the MOVE_MAX_PIECES set of constraints there and avoid expanding to sequences where we cannot interleave the loads/stores (aka for the memmove case). > Anyhow, this patch raises MOVE_MAX back a little for non-SSE targets, > while preserving the new value for MOVE_MAX_PIECES. > > Bootstrapped on x86_64-linux-gnu. Also tested on ppc- and x86-vx7r2 > with gcc-12. > > for gcc/ChangeLog > > * config/i386/i386.h (MOVE_MAX): Rename to... > (MOVE_MAX_VEC): ... this. Add NONVEC parameter, and use it as > the last resort, instead of UNITS_PER_WORD. > (MOVE_MAX): Reintroduce in terms of MOVE_MAX_VEC, with > 2*UNITS_PER_WORD. > (MOVE_MAX_PIECES): Likewise, but with UNITS_PER_WORD. > --- > gcc/config/i386/i386.h | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h > index c7439f89bdf92..5293a332a969a 100644 > --- a/gcc/config/i386/i386.h > +++ b/gcc/config/i386/i386.h > @@ -1801,7 +1801,9 @@ typedef struct ix86_args { > is the number of bytes at a time which we can move efficiently. > MOVE_MAX_PIECES defaults to MOVE_MAX. */ > > -#define MOVE_MAX \ > +#define MOVE_MAX MOVE_MAX_VEC (2 * UNITS_PER_WORD) > +#define MOVE_MAX_PIECES MOVE_MAX_VEC (UNITS_PER_WORD) > +#define MOVE_MAX_VEC(NONVEC) \ > ((TARGET_AVX512F \ > && (ix86_move_max == PVW_AVX512 \ > || ix86_store_max == PVW_AVX512)) \ > @@ -1813,7 +1815,7 @@ typedef struct ix86_args { > : ((TARGET_SSE2 \ > && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \ > && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \ > - ? 16 : UNITS_PER_WORD))) > + ? 16 : (NONVEC)))) > > /* STORE_MAX_PIECES is the number of bytes at a time that we can store > efficiently. Allow 16/32/64 bytes only if inter-unit move is enabled > > -- > Alexandre Oliva, happy hacker https://FSFLA.org/blogs/lxo/ > Free Software Activist GNU Toolchain Engineer > Disinformation flourishes because many people care deeply about injustice > but very few check the facts. Ask me about <https://stallmansupport.org> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining 2023-05-24 9:12 ` Richard Biener @ 2023-05-25 10:01 ` Alexandre Oliva 2023-05-25 10:49 ` Richard Biener 0 siblings, 1 reply; 8+ messages in thread From: Alexandre Oliva @ 2023-05-25 10:01 UTC (permalink / raw) To: Richard Biener; +Cc: gcc-patches, H.J. Lu, Jan Hubicka, Uros Bizjak --text follows this line-- On May 24, 2023, Richard Biener <richard.guenther@gmail.com> wrote: > gimple_fold_builtin_memory_op tries to expand the call to a single > load plus a single store so we can handle overlaps by first loading > everything to registers and then storing: *nod*, that's why I figured we could afford to go back to allowing DImode (with -m32) or TImode (with -m64) even without vector modes: we'd just use a pair of registers, a single insn, even though not a single hardware instruction. > using DImode on i?86 without SSE means we eventually perform two > loads and two stores which means we need two registers available. *nod*. But the alternative is to issue an out-of-line call to memmove, which would clobber more than 2 registers. ISTM that inlining such calls is better, whether optimizing for speed or size. > So I think if we want to expand this further at the GIMPLE level we > should still honor MOVE_MAX but eventually emit multiple loads/stores > honoring the MOVE_MAX_PIECES set of constraints there and avoid > expanding to sequences where we cannot interleave the loads/stores > (aka for the memmove case). But... don't we already? If I'm reading the code right, we'll already issue gimple code to load the whole block into a temporary and then store it, but current MOVE_MAX won't let us go past 4 bytes on SSE-less x86. -- Alexandre Oliva, happy hacker https://FSFLA.org/blogs/lxo/ Free Software Activist GNU Toolchain Engineer Disinformation flourishes because many people care deeply about injustice but very few check the facts. Ask me about <https://stallmansupport.org> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining 2023-05-25 10:01 ` Alexandre Oliva @ 2023-05-25 10:49 ` Richard Biener 2023-05-25 11:10 ` Alexandre Oliva 0 siblings, 1 reply; 8+ messages in thread From: Richard Biener @ 2023-05-25 10:49 UTC (permalink / raw) To: Alexandre Oliva; +Cc: gcc-patches, H.J. Lu, Jan Hubicka, Uros Bizjak On Thu, May 25, 2023 at 12:01 PM Alexandre Oliva <oliva@adacore.com> wrote: > > --text follows this line-- > On May 24, 2023, Richard Biener <richard.guenther@gmail.com> wrote: > > > gimple_fold_builtin_memory_op tries to expand the call to a single > > load plus a single store so we can handle overlaps by first loading > > everything to registers and then storing: > > *nod*, that's why I figured we could afford to go back to allowing > DImode (with -m32) or TImode (with -m64) even without vector modes: we'd > just use a pair of registers, a single insn, even though not a single > hardware instruction. > > > using DImode on i?86 without SSE means we eventually perform two > > loads and two stores which means we need two registers available. > > *nod*. But the alternative is to issue an out-of-line call to memmove, > which would clobber more than 2 registers. ISTM that inlining such > calls is better, whether optimizing for speed or size. > > > So I think if we want to expand this further at the GIMPLE level we > > should still honor MOVE_MAX but eventually emit multiple loads/stores > > honoring the MOVE_MAX_PIECES set of constraints there and avoid > > expanding to sequences where we cannot interleave the loads/stores > > (aka for the memmove case). > > But... don't we already? If I'm reading the code right, we'll already > issue gimple code to load the whole block into a temporary and then > store it, but current MOVE_MAX won't let us go past 4 bytes on SSE-less > x86. I mean we could do what RTL expansion would do later and do by-pieces, thus emit multiple loads/stores but not n loads and then n stores but interleaved. Richard. > > -- > Alexandre Oliva, happy hacker https://FSFLA.org/blogs/lxo/ > Free Software Activist GNU Toolchain Engineer > Disinformation flourishes because many people care deeply about injustice > but very few check the facts. Ask me about <https://stallmansupport.org> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining 2023-05-25 10:49 ` Richard Biener @ 2023-05-25 11:10 ` Alexandre Oliva 2023-05-25 11:33 ` Richard Biener 0 siblings, 1 reply; 8+ messages in thread From: Alexandre Oliva @ 2023-05-25 11:10 UTC (permalink / raw) To: Richard Biener; +Cc: gcc-patches, H.J. Lu, Jan Hubicka, Uros Bizjak On May 25, 2023, Richard Biener <richard.guenther@gmail.com> wrote: > I mean we could do what RTL expansion would do later and do > by-pieces, thus emit multiple loads/stores but not n loads and then > n stores but interleaved. That wouldn't help e.g. gcc.dg/memcpy-6.c's fold_move_8, because MOVE_MAX and MOVE_MAX_PIECES currently limits inline expansion to 4 bytes on x86 without SSE, both in gimple and RTL, and interleaved loads and stores wouldn't help with memmove. We can't fix that by changing code that uses MOVE_MAX and/or MOVE_MAX_PIECES, when these limits are set too low. I'm also concerned that doing more such expansion in gimple folding would be reversed in later gimple passes. That's good in that it would enable efficient rtl movmem/cpymem instruction selection, but it's not clear to me that there would generally be benefits to such early open-coding in gimple. -- Alexandre Oliva, happy hacker https://FSFLA.org/blogs/lxo/ Free Software Activist GNU Toolchain Engineer Disinformation flourishes because many people care deeply about injustice but very few check the facts. Ask me about <https://stallmansupport.org> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining 2023-05-25 11:10 ` Alexandre Oliva @ 2023-05-25 11:33 ` Richard Biener 2023-05-25 13:25 ` Alexandre Oliva 0 siblings, 1 reply; 8+ messages in thread From: Richard Biener @ 2023-05-25 11:33 UTC (permalink / raw) To: Alexandre Oliva; +Cc: gcc-patches, H.J. Lu, Jan Hubicka, Uros Bizjak On Thu, May 25, 2023 at 1:10 PM Alexandre Oliva <oliva@adacore.com> wrote: > > On May 25, 2023, Richard Biener <richard.guenther@gmail.com> wrote: > > > I mean we could do what RTL expansion would do later and do > > by-pieces, thus emit multiple loads/stores but not n loads and then > > n stores but interleaved. > > That wouldn't help e.g. gcc.dg/memcpy-6.c's fold_move_8, because > MOVE_MAX and MOVE_MAX_PIECES currently limits inline expansion to 4 > bytes on x86 without SSE, both in gimple and RTL, and interleaved loads > and stores wouldn't help with memmove. We can't fix that by changing > code that uses MOVE_MAX and/or MOVE_MAX_PIECES, when these limits are > set too low. > > I'm also concerned that doing more such expansion in gimple folding > would be reversed in later gimple passes. That's good in that it would > enable efficient rtl movmem/cpymem instruction selection, but it's not > clear to me that there would generally be benefits to such early > open-coding in gimple. Btw, there was a short period where the MOVE_MAX limit was restricted but that had fallout and we've reverted since then. Richard. > -- > Alexandre Oliva, happy hacker https://FSFLA.org/blogs/lxo/ > Free Software Activist GNU Toolchain Engineer > Disinformation flourishes because many people care deeply about injustice > but very few check the facts. Ask me about <https://stallmansupport.org> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining 2023-05-25 11:33 ` Richard Biener @ 2023-05-25 13:25 ` Alexandre Oliva 2023-05-25 13:32 ` Richard Biener 0 siblings, 1 reply; 8+ messages in thread From: Alexandre Oliva @ 2023-05-25 13:25 UTC (permalink / raw) To: Richard Biener; +Cc: gcc-patches, H.J. Lu, Jan Hubicka, Uros Bizjak On May 25, 2023, Richard Biener <richard.guenther@gmail.com> wrote: > On Thu, May 25, 2023 at 1:10 PM Alexandre Oliva <oliva@adacore.com> wrote: >> >> On May 25, 2023, Richard Biener <richard.guenther@gmail.com> wrote: >> >> > I mean we could do what RTL expansion would do later and do >> > by-pieces, thus emit multiple loads/stores but not n loads and then >> > n stores but interleaved. >> >> That wouldn't help e.g. gcc.dg/memcpy-6.c's fold_move_8, because >> MOVE_MAX and MOVE_MAX_PIECES currently limits inline expansion to 4 >> bytes on x86 without SSE, both in gimple and RTL, and interleaved loads >> and stores wouldn't help with memmove. We can't fix that by changing >> code that uses MOVE_MAX and/or MOVE_MAX_PIECES, when these limits are >> set too low. > Btw, there was a short period where the MOVE_MAX limit was restricted > but that had fallout and we've reverted since then. Erhm... Are we even talking about the same issue? i386/i386.h reduced the 32-bit non-SSE MOVE_MAX from 16 to 4, which broke this test; I'm proposing to bounce it back up to 8, so that we get a little more memmove inlining, enough for tests that expect that much to pass. You may be focusing on the gimple-fold bit, because I mentioned it, but even the rtl expander is failing to expand the memmove because of the setting, as evidenced by the test's failure in the scan for memmove in the final dump. That MOVE_MAX change was a significant regression in codegen for 32-bit non-SSE x86, and I'm proposing to fix that. Compensating for that regression elsewhere doesn't seem desirable to me: MOVE_MAX can be much higher even on other x86 variants, so the effects of such attempts may harm quite significantly more modern CPUs. Conversely, I don't expect the reduction of MOVE_MAX on SSE-less x86 a couple of years ago to have been measured for performance effects, given the little overall relevance of such CPUs, and the very visible and undesirable effects on codegen that change brought onto them. And yet, I'm being very conservative in the proposed reversion, because benchmarking such targets in any meaningful way would be somewhat challenging for myself as well. So, could we please have this narrow fix of this limited regression at the spot where it was introduced accepted, rather than debating tangents? -- Alexandre Oliva, happy hacker https://FSFLA.org/blogs/lxo/ Free Software Activist GNU Toolchain Engineer Disinformation flourishes because many people care deeply about injustice but very few check the facts. Ask me about <https://stallmansupport.org> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining 2023-05-25 13:25 ` Alexandre Oliva @ 2023-05-25 13:32 ` Richard Biener 0 siblings, 0 replies; 8+ messages in thread From: Richard Biener @ 2023-05-25 13:32 UTC (permalink / raw) To: Alexandre Oliva; +Cc: gcc-patches, H.J. Lu, Jan Hubicka, Uros Bizjak On Thu, May 25, 2023 at 3:25 PM Alexandre Oliva <oliva@adacore.com> wrote: > > On May 25, 2023, Richard Biener <richard.guenther@gmail.com> wrote: > > > On Thu, May 25, 2023 at 1:10 PM Alexandre Oliva <oliva@adacore.com> wrote: > >> > >> On May 25, 2023, Richard Biener <richard.guenther@gmail.com> wrote: > >> > >> > I mean we could do what RTL expansion would do later and do > >> > by-pieces, thus emit multiple loads/stores but not n loads and then > >> > n stores but interleaved. > >> > >> That wouldn't help e.g. gcc.dg/memcpy-6.c's fold_move_8, because > >> MOVE_MAX and MOVE_MAX_PIECES currently limits inline expansion to 4 > >> bytes on x86 without SSE, both in gimple and RTL, and interleaved loads > >> and stores wouldn't help with memmove. We can't fix that by changing > >> code that uses MOVE_MAX and/or MOVE_MAX_PIECES, when these limits are > >> set too low. > > > Btw, there was a short period where the MOVE_MAX limit was restricted > > but that had fallout and we've reverted since then. > > Erhm... Are we even talking about the same issue? > > i386/i386.h reduced the 32-bit non-SSE MOVE_MAX from 16 to 4, which > broke this test; I'm proposing to bounce it back up to 8, so that we get > a little more memmove inlining, enough for tests that expect that much > to pass. > > You may be focusing on the gimple-fold bit, because I mentioned it, but > even the rtl expander is failing to expand the memmove because of the > setting, as evidenced by the test's failure in the scan for memmove in > the final dump. So indeed fold_move_8 expands to the following, even with -minline-all-stringops fold_move_8: .LFB5: .cfi_startproc pushl %ebp .cfi_def_cfa_offset 8 .cfi_offset 5, -8 movl %esp, %ebp .cfi_def_cfa_register 5 subl $8, %esp movl $a+3, %eax subl $4, %esp pushl $8 pushl $a pushl %eax call memmove addl $16, %esp nop I do think it's still up to RTL expansion or the target to decide whether its worth spending two registers to handle the overlap or maybe emit a compare & jump to do forward and backward variants. Yes, increasing MOVE_MAX to 8 makes this expand at the GIMPLE level already, which I belive is premature and difficult to undo. > That MOVE_MAX change was a significant regression in codegen for 32-bit > non-SSE x86, and I'm proposing to fix that. Compensating for that > regression elsewhere doesn't seem desirable to me: MOVE_MAX can be much > higher even on other x86 variants, so the effects of such attempts may > harm quite significantly more modern CPUs. > > Conversely, I don't expect the reduction of MOVE_MAX on SSE-less x86 a > couple of years ago to have been measured for performance effects, given > the little overall relevance of such CPUs, and the very visible and > undesirable effects on codegen that change brought onto them. And yet, > I'm being very conservative in the proposed reversion, because > benchmarking such targets in any meaningful way would be somewhat > challenging for myself as well. > > So, could we please have this narrow fix of this limited regression at > the spot where it was introduced accepted, rather than debating > tangents? > > -- > Alexandre Oliva, happy hacker https://FSFLA.org/blogs/lxo/ > Free Software Activist GNU Toolchain Engineer > Disinformation flourishes because many people care deeply about injustice > but very few check the facts. Ask me about <https://stallmansupport.org> ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2023-05-25 13:34 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-05-24 5:47 [PATCH] [x86] reenable dword MOVE_MAX for better memmove inlining Alexandre Oliva 2023-05-24 9:12 ` Richard Biener 2023-05-25 10:01 ` Alexandre Oliva 2023-05-25 10:49 ` Richard Biener 2023-05-25 11:10 ` Alexandre Oliva 2023-05-25 11:33 ` Richard Biener 2023-05-25 13:25 ` Alexandre Oliva 2023-05-25 13:32 ` Richard Biener
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).