Re: [x86 PATCH] Add peephole2 to reduce double word register shuffling.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Uros Bizjak <ubizjak@gmail.com>
To: Roger Sayle <roger@nextmovesoftware.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [x86 PATCH] Add peephole2 to reduce double word register shuffling.
Date: Thu, 2 Jun 2022 11:32:52 +0200	[thread overview]
Message-ID: <CAFULd4ZqXDc6p9eMqcKJ30niEporP5hoy7q3EE0Fyhj6UgNvBA@mail.gmail.com> (raw)
In-Reply-To: <032501d87651$43cf0960$cb6d1c20$@nextmovesoftware.com>

[-- Attachment #1: Type: text/plain, Size: 3642 bytes --]

On Thu, Jun 2, 2022 at 9:20 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
> The simple test case below demonstrates an interesting register
> allocation challenge facing x86_64, imposed by ABI requirements
> on int128.
>
> __int128 foo(__int128 x, __int128 y)
> {
>   return x+y;
> }
>
> For which GCC currently generates the unusual sequence:
>
>         movq    %rsi, %rax
>         movq    %rdi, %r8
>         movq    %rax, %rdi
>         movq    %rdx, %rax
>         movq    %rcx, %rdx
>         addq    %r8, %rax
>         adcq    %rdi, %rdx
>         ret
>
> The challenge is that the x86_64 ABI requires passing the first __int128,
> x, in %rsi:%rdi (highpart in %rsi, lowpart in %rdi), where internally
> GCC prefers TI mode (double word) integers to be register allocated as
> %rdi:%rsi (highpart in %rdi, lowpart in %rsi).  So after reload, we have
> four mov instructions, two to move the double word to temporary registers
> and then two to move them back.
>
> This patch adds a peephole2 to spot this register shuffling, and with
> -Os generates a xchg instruction, to produce:
>
>         xchgq   %rsi, %rdi
>         movq    %rdx, %rax
>         movq    %rcx, %rdx
>         addq    %rsi, %rax
>         adcq    %rdi, %rdx
>         ret
>
> or when optimizing for speed, a three mov sequence, using just one of
> the temporary registers, which ultimately results in the improved:
>
>         movq    %rdi, %r8
>         movq    %rdx, %rax
>         movq    %rcx, %rdx
>         addq    %r8, %rax
>         adcq    %rsi, %rdx
>         ret
>
> I've a follow-up patch which improves things further, and with the
> output in flux, I'd like to add the new testcase with part 2, once
> we're back down to requiring only two movq instructions.

Shouldn't we rather do something about:

(insn 2 9 3 2 (set (reg:DI 85)
       (reg:DI 5 di [ x ])) "dword-2.c":2:1 82 {*movdi_internal}
    (nil))
(insn 3 2 4 2 (set (reg:DI 86)
       (reg:DI 4 si [ x+8 ])) "dword-2.c":2:1 82 {*movdi_internal}
    (nil))
(insn 4 3 5 2 (set (reg:TI 84)
       (subreg:TI (reg:DI 85) 0)) "dword-2.c":2:1 81 {*movti_internal}
    (nil))
(insn 5 4 6 2 (set (subreg:DI (reg:TI 84) 8)
       (reg:DI 86)) "dword-2.c":2:1 82 {*movdi_internal}
    (nil))
(insn 6 5 7 2 (set (reg/v:TI 83 [ x ])
       (reg:TI 84)) "dword-2.c":2:1 81 {*movti_internal}
    (nil))

The above is how the functionTImode argument is constructed.

The other problem is that double-word addition gets split only after
reload, mostly due to RA reasons. In the past it was determined that
RA creates better code when registers are split late (this reason
probably does not hold anymore), but nowadays the limitation remains
only for arithmetic and shifts.

Attached to this message, please find the patch that performs dual
word mode arithmetic splitting before reload. It improves generated
code somehow, but due to the above argument construction sequence, the
bulk of moves remain. Unfortunately, when under register pressure
(e.g. 32-bit targets), the peephole approach gets ineffective due to
register spilling, so IMO the root of the problem should be fixed.

Uros.


>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32} with
> no new failures.  Ok for mainline?
>
>
> 2022-06-02  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/i386.md (define_peephole2): Recognize double word
>         swap sequences, and replace them with more efficient idioms,
>         including using xchg when optimizing for size.
>
>
> Thanks in advance,
> Roger
> --
>

[-- Attachment #2: dw-arith.diff.txt --]
[-- Type: text/plain, Size: 6113 bytes --]

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 050dee7d43a..876220d57f9 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -5730,9 +5730,10 @@
 	  (match_operand:<DWI> 1 "nonimmediate_operand" "%0,0")
 	  (match_operand:<DWI> 2 "x86_64_hilo_general_operand" "r<di>,o")))
    (clobber (reg:CC FLAGS_REG))]
-  "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)"
+  "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)
+   && ix86_pre_reload_split ()"
   "#"
-  "&& reload_completed"
+  "&& 1"
   [(parallel [(set (reg:CCC FLAGS_REG)
 		   (compare:CCC
 		     (plus:DWIH (match_dup 1) (match_dup 2))
@@ -5750,6 +5751,7 @@
   split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]);
   if (operands[2] == const0_rtx)
     {
+      emit_move_insn (operands[0], operands[1]);
       ix86_expand_binary_operator (PLUS, <MODE>mode, &operands[3]);
       DONE;
     }
@@ -6539,9 +6541,10 @@
 	    (plus:<DWI> (match_dup 1) (match_dup 2)))))
    (set (match_operand:<DWI> 0 "nonimmediate_operand" "=ro,r")
 	(plus:<DWI> (match_dup 1) (match_dup 2)))]
-  "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)"
+  "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)
+   && ix86_pre_reload_split ()"
   "#"
-  "&& reload_completed"
+  "&& 1"
   [(parallel [(set (reg:CCC FLAGS_REG)
 		   (compare:CCC
 		     (plus:DWIH (match_dup 1) (match_dup 2))
@@ -6567,9 +6570,7 @@
 		       (ltu:DWIH (reg:CC FLAGS_REG) (const_int 0))
 		       (match_dup 4))
 		     (match_dup 5)))])]
-{
-  split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]);
-})
+  "split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]);")
 
 (define_insn_and_split "*addv<dwi>4_doubleword_1"
   [(set (reg:CCO FLAGS_REG)
@@ -6586,9 +6587,10 @@
 	(plus:<DWI> (match_dup 1) (match_dup 2)))]
   "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)
    && CONST_SCALAR_INT_P (operands[2])
-   && rtx_equal_p (operands[2], operands[3])"
+   && rtx_equal_p (operands[2], operands[3])
+   && ix86_pre_reload_split ()"
   "#"
-  "&& reload_completed"
+  "&& 1"
   [(parallel [(set (reg:CCC FLAGS_REG)
 		   (compare:CCC
 		     (plus:DWIH (match_dup 1) (match_dup 2))
@@ -6618,6 +6620,7 @@
   split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]);
   if (operands[2] == const0_rtx)
     {
+      emit_move_insn (operands[0], operands[1]);
       emit_insn (gen_addv<mode>4_1 (operands[3], operands[4], operands[5],
 				    operands[5]));
       DONE;
@@ -6880,9 +6883,10 @@
 	  (match_operand:<DWI> 1 "nonimmediate_operand" "0,0")
 	  (match_operand:<DWI> 2 "x86_64_hilo_general_operand" "r<di>,o")))
    (clobber (reg:CC FLAGS_REG))]
-  "ix86_binary_operator_ok (MINUS, <MODE>mode, operands)"
+  "ix86_binary_operator_ok (MINUS, <MODE>mode, operands)
+   && ix86_pre_reload_split ()"
   "#"
-  "&& reload_completed"
+  "&& 1"
   [(parallel [(set (reg:CC FLAGS_REG)
 		   (compare:CC (match_dup 1) (match_dup 2)))
 	      (set (match_dup 0)
@@ -6898,6 +6902,7 @@
   split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]);
   if (operands[2] == const0_rtx)
     {
+      emit_move_insn (operands[0], operands[1]);
       ix86_expand_binary_operator (MINUS, <MODE>mode, &operands[3]);
       DONE;
     }
@@ -7080,9 +7085,10 @@
 	    (minus:<DWI> (match_dup 1) (match_dup 2)))))
    (set (match_operand:<DWI> 0 "nonimmediate_operand" "=ro,r")
 	(minus:<DWI> (match_dup 1) (match_dup 2)))]
-  "ix86_binary_operator_ok (MINUS, <MODE>mode, operands)"
+  "ix86_binary_operator_ok (MINUS, <MODE>mode, operands)
+   && ix86_pre_reload_split ()"
   "#"
-  "&& reload_completed"
+  "&& 1"
   [(parallel [(set (reg:CC FLAGS_REG)
 		   (compare:CC (match_dup 1) (match_dup 2)))
 	      (set (match_dup 0)
@@ -7106,9 +7112,7 @@
 		       (match_dup 4)
 		       (ltu:DWIH (reg:CC FLAGS_REG) (const_int 0)))
 		     (match_dup 5)))])]
-{
-  split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]);
-})
+  "split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]);")
 
 (define_insn_and_split "*subv<dwi>4_doubleword_1"
   [(set (reg:CCO FLAGS_REG)
@@ -7125,9 +7129,10 @@
 	(minus:<DWI> (match_dup 1) (match_dup 2)))]
   "ix86_binary_operator_ok (MINUS, <MODE>mode, operands)
    && CONST_SCALAR_INT_P (operands[2])
-   && rtx_equal_p (operands[2], operands[3])"
+   && rtx_equal_p (operands[2], operands[3])
+   && ix86_pre_reload_split ()"
   "#"
-  "&& reload_completed"
+  "&& 1"
   [(parallel [(set (reg:CC FLAGS_REG)
 		   (compare:CC (match_dup 1) (match_dup 2)))
 	      (set (match_dup 0)
@@ -7155,6 +7160,7 @@
   split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]);
   if (operands[2] == const0_rtx)
     {
+      emit_move_insn (operands[0], operands[1]);
       emit_insn (gen_subv<mode>4_1 (operands[3], operands[4], operands[5],
 				    operands[5]));
       DONE;
@@ -7805,9 +7811,10 @@
 	  (match_dup 1)))
    (set (match_operand:<DWI> 0 "nonimmediate_operand" "=ro,r")
 	(plus:<DWI> (match_dup 1) (match_dup 2)))]
-  "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)"
+  "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)
+   && ix86_pre_reload_split ()"
   "#"
-  "&& reload_completed"
+  "&& 1"
   [(parallel [(set (reg:CCC FLAGS_REG)
 		   (compare:CCC
 		     (plus:DWIH (match_dup 1) (match_dup 2))
@@ -7834,6 +7841,7 @@
   split_double_mode (<DWI>mode, &operands[0], 3, &operands[0], &operands[3]);
   if (operands[2] == const0_rtx)
     {
+      emit_move_insn (operands[0], operands[1]);
       emit_insn (gen_addcarry<mode>_0 (operands[3], operands[4], operands[5]));
       DONE;
     }
@@ -10977,9 +10985,10 @@
   [(set (match_operand:<DWI> 0 "nonimmediate_operand" "=ro")
 	(neg:<DWI> (match_operand:<DWI> 1 "nonimmediate_operand" "0")))
    (clobber (reg:CC FLAGS_REG))]
-  "ix86_unary_operator_ok (NEG, <DWI>mode, operands)"
+  "ix86_unary_operator_ok (NEG, <DWI>mode, operands)
+   && ix86_pre_reload_split ()"
   "#"
-  "&& reload_completed"
+  "&& 1"
   [(parallel
     [(set (reg:CCC FLAGS_REG)
 	  (ne:CCC (match_dup 1) (const_int 0)))

next prev parent reply	other threads:[~2022-06-02  9:33 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-02  7:20 Roger Sayle
2022-06-02  9:30 ` Richard Biener
2022-06-02  9:32 ` Uros Bizjak [this message]
2022-06-02  9:36   ` Uros Bizjak
2022-06-02 10:04   ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAFULd4ZqXDc6p9eMqcKJ30niEporP5hoy7q3EE0Fyhj6UgNvBA@mail.gmail.com \
    --to=ubizjak@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=roger@nextmovesoftware.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).