[x86 PATCH take #2] Add peephole2 to reduce double word register shuffling

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [x86 PATCH take #2] Add peephole2 to reduce double word register shuffling
@ 2022-08-07 17:04 Roger Sayle
  2022-08-08  7:22 ` Uros Bizjak
  0 siblings, 1 reply; 2+ messages in thread
From: Roger Sayle @ 2022-08-07 17:04 UTC (permalink / raw)
  To: gcc-patches; +Cc: 'Uros Bizjak', 'Richard Biener'

[-- Attachment #1: Type: text/plain, Size: 2749 bytes --]

This is a resubmission of my patch from June to fix some forms of
inefficient
register allocation using an additional peephole2 in i386.md.
https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596064.html

Since the original, a number of supporting patches/improvements have
been reviewed and approved, making this peephole even more effective.
Hence for the simple function:
__int128 foo(__int128 x, __int128 y) { return x+y; }

mainline GCC on x86_64 with -O2 currently generates:
        movq    %rsi, %rax
        movq    %rdi, %r8
        movq    %rax, %rdi
        movq    %rdx, %rax
        movq    %rcx, %rdx
        addq    %r8, %rax
        adcq    %rdi, %rdx
        ret

with this patch we now generate (a three insn improvement):
        movq    %rdx, %rax
        movq    %rcx, %rdx
        addq    %rdi, %rax
        adcq    %rsi, %rdx
        ret

Back in June the review of the original patch stalled, as peephole2
isn't the ideal place to fix this (with which I fully agree), and this patch
is really just a workaround for a deeper deficiency in reload/lra.
To address this I've now filed a new enhancement PR in Bugzilla,
PR rtl-optimization/106518, that describes that underlying issue,
which might make an interesting (GSoC) project for anyone brave
(fool hardy) enough to tweak GCC's register allocation.

By comparison, this single peephole can't adversely affect other targets,
and should the happy day come that it's no longer required, at worst
would just become a harmless legacy transform that no longer triggers.

I'm also investigating Uros' suggestion that it may be possible for RTL
expansion to do a better job expanding the function prologue, but
ultimately the hard register placement constraints are fixed by the
target ABI, and poor allocation/assignment of hard registers is the
responsibility/fault of the register allocation passes.
But it may still be possible to reduce register pressure, but avoiding the
use of SUBREGs (which keep the source and destination double words
live during shuffling) along the lines of Richard's CONCAT suggestion.

This patch has been retested again mainline using make bootstrap and
make -k check, both with and without --target_board=unix{-m32},
with no new failures.  Ok mainline?  

2022-08-07  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        PR target/43644
        PR rtl-optimization/97756
        PR rtl-optimization/98438
        * config/i386/i386.md (define_peephole2): Recognize double word
        swap sequences, and replace them with more efficient idioms,
        including using xchg when optimizing for size.

gcc/testsuite/ChangeLog
        PR target/43644
        * gcc.target/i386/pr43644.c: New test case.

Thanks in advance,
Roger
--

[-- Attachment #2: patchxg2.txt --]
[-- Type: text/plain, Size: 2156 bytes --]

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 298e4b3..a11fd5b 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -3039,6 +3039,36 @@
   [(parallel [(set (match_dup 0) (match_dup 1))
 	      (set (match_dup 1) (match_dup 0))])])

+;; Replace a double word swap that requires 4 mov insns with a
+;; 3 mov insn implementation (or an xchg when optimizing for size).
+(define_peephole2
+  [(set (match_operand:DWIH 0 "general_reg_operand")
+	(match_operand:DWIH 1 "general_reg_operand"))
+   (set (match_operand:DWIH 2 "general_reg_operand")
+	(match_operand:DWIH 3 "general_reg_operand"))
+   (clobber (match_operand:<DWI> 4 "general_reg_operand"))
+   (set (match_dup 3) (match_dup 0))
+   (set (match_dup 1) (match_dup 2))]
+  "REGNO (operands[0]) != REGNO (operands[3])
+   && REGNO (operands[1]) != REGNO (operands[2])
+   && REGNO (operands[1]) != REGNO (operands[3])
+   && REGNO (operands[3]) == REGNO (operands[4])
+   && peep2_reg_dead_p (4, operands[0])
+   && peep2_reg_dead_p (5, operands[2])"
+  [(parallel [(set (match_dup 1) (match_dup 3))
+	      (set (match_dup 3) (match_dup 1))])]
+{
+  if (!optimize_insn_for_size_p ())
+    {
+      rtx tmp = REGNO (operands[0]) > REGNO (operands[2]) ? operands[0]
+							  : operands[2];
+      emit_move_insn (tmp, operands[1]);
+      emit_move_insn (operands[1], operands[3]);
+      emit_move_insn (operands[3], tmp);
+      DONE;
+    }
+})
+
 (define_expand "movstrict<mode>"
   [(set (strict_low_part (match_operand:SWI12 0 "register_operand"))
 	(match_operand:SWI12 1 "general_operand"))]
diff --git a/gcc/testsuite/gcc.target/i386/pr43644.c b/gcc/testsuite/gcc.target/i386/pr43644.c
new file mode 100644
index 0000000..ffdf31c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr43644.c
@@ -0,0 +1,11 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+
+__int128 foo(__int128 x, __int128 y)
+{
+  return x+y;
+}
+
+/* { dg-final { scan-assembler-times "movq" 2 } } */
+/* { dg-final { scan-assembler-not "push" } } */
+/* { dg-final { scan-assembler-not "pop" } } */

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [x86 PATCH take #2] Add peephole2 to reduce double word register shuffling
  2022-08-07 17:04 [x86 PATCH take #2] Add peephole2 to reduce double word register shuffling Roger Sayle
@ 2022-08-08  7:22 ` Uros Bizjak
  0 siblings, 0 replies; 2+ messages in thread
From: Uros Bizjak @ 2022-08-08  7:22 UTC (permalink / raw)
  To: Roger Sayle; +Cc: gcc-patches, Richard Biener

On Sun, Aug 7, 2022 at 7:04 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This is a resubmission of my patch from June to fix some forms of
> inefficient
> register allocation using an additional peephole2 in i386.md.
> https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596064.html
>
> Since the original, a number of supporting patches/improvements have
> been reviewed and approved, making this peephole even more effective.
> Hence for the simple function:
> __int128 foo(__int128 x, __int128 y) { return x+y; }
>
> mainline GCC on x86_64 with -O2 currently generates:
>         movq    %rsi, %rax
>         movq    %rdi, %r8
>         movq    %rax, %rdi
>         movq    %rdx, %rax
>         movq    %rcx, %rdx
>         addq    %r8, %rax
>         adcq    %rdi, %rdx
>         ret
>
> with this patch we now generate (a three insn improvement):
>         movq    %rdx, %rax
>         movq    %rcx, %rdx
>         addq    %rdi, %rax
>         adcq    %rsi, %rdx
>         ret
>
> Back in June the review of the original patch stalled, as peephole2
> isn't the ideal place to fix this (with which I fully agree), and this patch
> is really just a workaround for a deeper deficiency in reload/lra.
> To address this I've now filed a new enhancement PR in Bugzilla,
> PR rtl-optimization/106518, that describes that underlying issue,
> which might make an interesting (GSoC) project for anyone brave
> (fool hardy) enough to tweak GCC's register allocation.
>
> By comparison, this single peephole can't adversely affect other targets,
> and should the happy day come that it's no longer required, at worst
> would just become a harmless legacy transform that no longer triggers.
>
> I'm also investigating Uros' suggestion that it may be possible for RTL
> expansion to do a better job expanding the function prologue, but
> ultimately the hard register placement constraints are fixed by the
> target ABI, and poor allocation/assignment of hard registers is the
> responsibility/fault of the register allocation passes.
> But it may still be possible to reduce register pressure, but avoiding the
> use of SUBREGs (which keep the source and destination double words
> live during shuffling) along the lines of Richard's CONCAT suggestion.
>
> This patch has been retested again mainline using make bootstrap and
> make -k check, both with and without --target_board=unix{-m32},
> with no new failures.  Ok mainline?
>
>
> 2022-08-07  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         PR target/43644
>         PR rtl-optimization/97756
>         PR rtl-optimization/98438
>         * config/i386/i386.md (define_peephole2): Recognize double word
>         swap sequences, and replace them with more efficient idioms,
>         including using xchg when optimizing for size.
>
> gcc/testsuite/ChangeLog
>         PR target/43644
>         * gcc.target/i386/pr43644.c: New test case.

+;; Replace a double word swap that requires 4 mov insns with a
+;; 3 mov insn implementation (or an xchg when optimizing for size).
+(define_peephole2
+  [(set (match_operand:DWIH 0 "general_reg_operand")
+ (match_operand:DWIH 1 "general_reg_operand"))
+   (set (match_operand:DWIH 2 "general_reg_operand")
+ (match_operand:DWIH 3 "general_reg_operand"))
+   (clobber (match_operand:<DWI> 4 "general_reg_operand"))
+   (set (match_dup 3) (match_dup 0))
+   (set (match_dup 1) (match_dup 2))]
+  "REGNO (operands[0]) != REGNO (operands[3])
+   && REGNO (operands[1]) != REGNO (operands[2])
+   && REGNO (operands[1]) != REGNO (operands[3])
+   && REGNO (operands[3]) == REGNO (operands[4])
+   && peep2_reg_dead_p (4, operands[0])
+   && peep2_reg_dead_p (5, operands[2])"
+  [(parallel [(set (match_dup 1) (match_dup 3))
+      (set (match_dup 3) (match_dup 1))])]

I'm not sure it is correct to remove the clobber here. Some RTL expert
should comment on this change.

+  if (!optimize_insn_for_size_p ())
+    {
+      rtx tmp = REGNO (operands[0]) > REGNO (operands[2]) ? operands[0]
+  : operands[2];

Hm, this is a strange relation, and it is not obvious why it is done
in that way. Usually, REGNO (op1) != REGNO (op2) does the trick. At
least a comment should be added here.

Uros.

+      emit_move_insn (tmp, operands[1]);
+      emit_move_insn (operands[1], operands[3]);
+      emit_move_insn (operands[3], tmp);
+      DONE;
+    }





>
> Thanks in advance,
> Roger
> --
>

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2022-08-08  7:22 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-07 17:04 [x86 PATCH take #2] Add peephole2 to reduce double word register shuffling Roger Sayle
2022-08-08  7:22 ` Uros Bizjak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).