public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [x86 PATCH] PR target/91681: zero_extendditi2 pattern for more optimizations.
@ 2022-06-03  9:49 Roger Sayle
  2022-06-03 10:08 ` Uros Bizjak
  0 siblings, 1 reply; 5+ messages in thread
From: Roger Sayle @ 2022-06-03  9:49 UTC (permalink / raw)
  To: 'GCC Patches'

[-- Attachment #1: Type: text/plain, Size: 3650 bytes --]


Technically, PR target/91681 has already been resolved; we now recognize the
highpart multiplication at the tree-level, we no longer use the stack, and
we currently generate the same number of instructions as LLVM.  However, it
is still possible to do better, the current x86_64 code to generate a double
word addition of a zero extended operand, looks like:

        xorl    %r11d, %r11d
        addq    %r10, %rax
        adcq    %r11, %rdx

when it's possible (as LLVM does) to use an immediate constant:

        addq    %r10, %rax
        adcq    $0, %rdx

To do this, the backend required one or two simple changes, that
then themselves required one or two more obscure tweaks.

The simple starting point is to define a zero_extendditi2 pattern,
for zero extension from DImode to TImode on TARGET_64BIT that is
split after reload.  Double word (TImode) addition/subtraction is
split after reload, so that constrains when things should happen.

With zero extension now visible to combine, we add two new
define_insn_and_split that add/subtract a zero extended operand
in double word mode.  These apply to both 32-bit and 64-bit code
generation, to produce adc $0 and sbb $0.

The first strange tweak is that these new patterns interfere with
the optimization that recognizes DW:DI = (HI:SI<<32)+LO:SI as a pair
of register moves, or more accurately the combine splitter no longer
triggers as we're now converting two instructions into two instructions
(not three instructions into two instructions).  This is easily
repaired (and extended to handle TImode) by changing from a pair
of define_split (that handle operand commutativity) to a set of
four define_insn_and_split (again to handle operand commutativity).

The other/final strange tweak that the above splitters now interfere
with AVX512's kunpckdq instruction which is defined as identical RTL,
DW:DI = (HI:SI<<32)|zero_extend(LO:SI).  To distinguish this, and
also avoid AVX512 mask registers being used by reload to perform
SImode scalar shifts, I've added the explicit (unspec UNSPEC_MASKOP)
to the unpack mask operations, which matches what sse.md does for
the other mask specific (logic) operations.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures.  Ok for mainline?


2022-06-03  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        PR target/91681
        * config/i386/i386.md (zero_extendditi2): New define_insn_and_split.
        (*add<dwi>3_doubleword_zext): New define_insn_and_split.
        (*sub<dwi>3_doubleword_zext): New define_insn_and_split.
        (*concat<mode><dwi>3_1): New define_insn_and_split replacing
        previous define_split for implementing DST = (HI<<32)|LO as
        pair of move instructions, setting lopart and hipart.
        (*concat<mode><dwi>3_2): Likewise.
        (*concat<mode><dwi>3_3): Likewise, where HI is zero_extended.
        (*concat<mode><dwi>3_4): Likewise, where HI is zero_extended.
        * config/i386/sse.md (kunpckhi): Add UNSPEC_MASKOP unspec.
        (kunpcksi): Likewise, add UNSPEC_MASKOP unspec.
        (kunpckdi): Likewise, add UNSPEC_MASKOP unspec.
        (vec_pack_trunc_qi): Update to specify required UNSPEC_MASKOP
unspec.
        (vec_pack_trunc_<mode>): Likewise.

gcc/testsuite/ChangeLog
        PR target/91681
        * g++.target/i386/pr91681.C: New test case (from the PR).
        * gcc.target/i386/pr91681-1.c: New int128 test case.
        * gcc.target/i386/pr91681-2.c: Likewise.
        * gcc.target/i386/pr91681-3.c: Likewise, but for ia32.

Thanks in advance,
Roger
--


[-- Attachment #2: patchzt4.txt --]
[-- Type: text/plain, Size: 12315 bytes --]

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 602dfa7..6cce256 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -4325,6 +4325,16 @@
    (set_attr "type" "imovx,mskmov,mskmov")
    (set_attr "mode" "SI,QI,QI")])
 
+(define_insn_and_split "zero_extendditi2"
+  [(set (match_operand:TI 0 "nonimmediate_operand" "=r,o")
+	(zero_extend:TI (match_operand:DI 1 "nonimmediate_operand" "rm,r")))]
+  "TARGET_64BIT"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 3) (match_dup 1))
+   (set (match_dup 4) (const_int 0))]
+  "split_double_mode (TImode, &operands[0], 1, &operands[3], &operands[4]);")
+
 ;; Transform xorl; mov[bw] (set strict_low_part) into movz[bw]l.
 (define_peephole2
   [(parallel [(set (match_operand:SWI48 0 "general_reg_operand")
@@ -6453,6 +6463,33 @@
   [(set_attr "type" "alu")
    (set_attr "mode" "QI")])
 
+(define_insn_and_split "*add<dwi>3_doubleword_zext"
+  [(set (match_operand:<DWI> 0 "nonimmediate_operand" "=r,o")
+	(plus:<DWI>
+	  (zero_extend:<DWI>
+	    (match_operand:DWIH 2 "nonimmediate_operand" "rm,r")) 
+	  (match_operand:<DWI> 1 "nonimmediate_operand" "0,0")))
+   (clobber (reg:CC FLAGS_REG))]
+  "MEM_P (operands[0]) ? rtx_equal_p (operands[0], operands[1])
+			 && !MEM_P (operands[2])
+		       : !MEM_P (operands[1])"
+  "#"
+  "&& reload_completed"
+  [(parallel [(set (reg:CCC FLAGS_REG)
+		   (compare:CCC
+		     (plus:DWIH (match_dup 1) (match_dup 2))
+		     (match_dup 1)))
+	      (set (match_dup 0)
+		   (plus:DWIH (match_dup 1) (match_dup 2)))])
+   (parallel [(set (match_dup 3)
+		   (plus:DWIH
+		     (plus:DWIH
+		       (ltu:DWIH (reg:CC FLAGS_REG) (const_int 0))
+		       (match_dup 4))
+		     (const_int 0)))
+	      (clobber (reg:CC FLAGS_REG))])]
+ "split_double_mode (<DWI>mode, &operands[0], 2, &operands[0], &operands[3]);")
+
 ;; Like DWI, but use POImode instead of OImode.
 (define_mode_attr DPWI [(QI "HI") (HI "SI") (SI "DI") (DI "TI") (TI "POI")])
 
@@ -6903,6 +6940,31 @@
     }
 })
 
+(define_insn_and_split "*sub<dwi>3_doubleword_zext"
+  [(set (match_operand:<DWI> 0 "nonimmediate_operand" "=r,o")
+	(minus:<DWI>
+	  (match_operand:<DWI> 1 "nonimmediate_operand" "0,0")
+	  (zero_extend:<DWI>
+	    (match_operand:DWIH 2 "nonimmediate_operand" "rm,r"))))
+   (clobber (reg:CC FLAGS_REG))]
+  "MEM_P (operands[0]) ? rtx_equal_p (operands[0], operands[1])
+			 && !MEM_P (operands[2])
+		       : !MEM_P (operands[1])"
+  "#"
+  "&& reload_completed"
+  [(parallel [(set (reg:CC FLAGS_REG)
+		   (compare:CC (match_dup 1) (match_dup 2)))
+	      (set (match_dup 0)
+		   (minus:DWIH (match_dup 1) (match_dup 2)))])
+   (parallel [(set (match_dup 3)
+		   (minus:DWIH
+		     (minus:DWIH
+		       (match_dup 4)
+		       (ltu:DWIH (reg:CC FLAGS_REG) (const_int 0)))
+		     (const_int 0)))
+	      (clobber (reg:CC FLAGS_REG))])]
+  "split_double_mode (<DWI>mode, &operands[0], 2, &operands[0], &operands[3]);")
+
 (define_insn "*sub<mode>_1"
   [(set (match_operand:SWI 0 "nonimmediate_operand" "=<r>m,<r>")
 	(minus:SWI
@@ -10956,34 +11018,76 @@
 
 ;; Split DST = (HI<<32)|LO early to minimize register usage.
 (define_code_iterator any_or_plus [plus ior xor])
-(define_split
-  [(set (match_operand:DI 0 "register_operand")
-	(any_or_plus:DI
-	  (ashift:DI (match_operand:DI 1 "register_operand")
-		     (const_int 32))
-	  (zero_extend:DI (match_operand:SI 2 "register_operand"))))]
-  "!TARGET_64BIT"
-  [(set (match_dup 3) (match_dup 4))
-   (set (match_dup 5) (match_dup 2))]
+(define_insn_and_split "*concat<mode><dwi>3_1"
+  [(set (match_operand:<DWI> 0 "register_operand" "=r")
+	(any_or_plus:<DWI>
+	  (ashift:<DWI> (match_operand:<DWI> 1 "register_operand" "r")
+			(match_operand:<DWI> 2 "const_int_operand" "n"))
+	  (zero_extend:<DWI> (match_operand:DWIH 3 "register_operand" "r"))))]
+  "INTVAL (operands[2]) == <MODE_SIZE> * BITS_PER_UNIT
+  && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 4) (match_dup 3))
+   (set (match_dup 5) (match_dup 6))]
 {
-  operands[3] = gen_highpart (SImode, operands[0]);
-  operands[4] = gen_lowpart (SImode, operands[1]);
-  operands[5] = gen_lowpart (SImode, operands[0]);
+  operands[4] = gen_lowpart (<MODE>mode, operands[0]);
+  operands[5] = gen_highpart (<MODE>mode, operands[0]);
+  operands[6] = gen_lowpart (<MODE>mode, operands[1]);
 })
 
-(define_split
-  [(set (match_operand:DI 0 "register_operand")
-	(any_or_plus:DI
-	  (zero_extend:DI (match_operand:SI 1 "register_operand"))
-	  (ashift:DI (match_operand:DI 2 "register_operand")
-		     (const_int 32))))]
-  "!TARGET_64BIT"
-  [(set (match_dup 3) (match_dup 4))
+(define_insn_and_split "*concat<mode><dwi>3_2"
+  [(set (match_operand:<DWI> 0 "register_operand" "=r")
+	(any_or_plus:<DWI>
+	  (zero_extend:<DWI> (match_operand:DWIH 1 "register_operand" "r"))
+	  (ashift:<DWI> (match_operand:<DWI> 2 "register_operand" "r")
+			(match_operand:<DWI> 3 "const_int_operand" "n"))))]
+  "INTVAL (operands[3]) == <MODE_SIZE> * BITS_PER_UNIT
+   && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 4) (match_dup 1))
+   (set (match_dup 5) (match_dup 6))]
+{
+  operands[4] = gen_lowpart (<MODE>mode, operands[0]);
+  operands[5] = gen_highpart (<MODE>mode, operands[0]);
+  operands[6] = gen_lowpart (<MODE>mode, operands[2]);
+})
+
+(define_insn_and_split "*concat<mode><dwi>3_3"
+  [(set (match_operand:<DWI> 0 "register_operand" "=r")
+	(any_or_plus:<DWI>
+	  (ashift:<DWI>
+	    (zero_extend:<DWI> (match_operand:DWIH 1 "register_operand" "r"))
+	    (match_operand:<DWI> 2 "const_int_operand" "n"))
+	  (zero_extend:<DWI> (match_operand:DWIH 3 "register_operand" "r"))))]
+  "INTVAL (operands[2]) == <MODE_SIZE> * BITS_PER_UNIT
+  && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 4) (match_dup 3))
    (set (match_dup 5) (match_dup 1))]
 {
-  operands[3] = gen_highpart (SImode, operands[0]);
-  operands[4] = gen_lowpart (SImode, operands[2]);
-  operands[5] = gen_lowpart (SImode, operands[0]);
+  operands[4] = gen_lowpart (<MODE>mode, operands[0]);
+  operands[5] = gen_highpart (<MODE>mode, operands[0]);
+})
+
+(define_insn_and_split "*concat<mode><dwi>3_4"
+  [(set (match_operand:<DWI> 0 "register_operand" "=r")
+	(any_or_plus:<DWI>
+	  (zero_extend:<DWI> (match_operand:DWIH 1 "register_operand" "r"))
+	  (ashift:<DWI>
+	    (zero_extend:<DWI> (match_operand:DWIH 2 "register_operand" "r"))
+	    (match_operand:<DWI> 3 "const_int_operand" "n"))))]
+  "INTVAL (operands[3]) == <MODE_SIZE> * BITS_PER_UNIT
+   && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 4) (match_dup 1))
+   (set (match_dup 5) (match_dup 2))]
+{
+  operands[4] = gen_lowpart (<MODE>mode, operands[0]);
+  operands[5] = gen_highpart (<MODE>mode, operands[0]);
 })
 \f
 ;; Negation instructions
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 8b2602b..0198156 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -2070,7 +2070,8 @@
 	  (ashift:HI
 	    (zero_extend:HI (match_operand:QI 1 "register_operand" "k"))
 	    (const_int 8))
-	  (zero_extend:HI (match_operand:QI 2 "register_operand" "k"))))]
+	  (zero_extend:HI (match_operand:QI 2 "register_operand" "k"))))
+   (unspec [(const_int 0)] UNSPEC_MASKOP)]
   "TARGET_AVX512F"
   "kunpckbw\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "mode" "HI")
@@ -2083,7 +2084,8 @@
 	  (ashift:SI
 	    (zero_extend:SI (match_operand:HI 1 "register_operand" "k"))
 	    (const_int 16))
-	  (zero_extend:SI (match_operand:HI 2 "register_operand" "k"))))]
+	  (zero_extend:SI (match_operand:HI 2 "register_operand" "k"))))
+   (unspec [(const_int 0)] UNSPEC_MASKOP)]
   "TARGET_AVX512BW"
   "kunpckwd\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "mode" "SI")])
@@ -2094,7 +2096,8 @@
 	  (ashift:DI
 	    (zero_extend:DI (match_operand:SI 1 "register_operand" "k"))
 	    (const_int 32))
-	  (zero_extend:DI (match_operand:SI 2 "register_operand" "k"))))]
+	  (zero_extend:DI (match_operand:SI 2 "register_operand" "k"))))
+   (unspec [(const_int 0)] UNSPEC_MASKOP)]
   "TARGET_AVX512BW"
   "kunpckdq\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "mode" "DI")])
@@ -17398,21 +17401,26 @@
 })
 
 (define_expand "vec_pack_trunc_qi"
-  [(set (match_operand:HI 0 "register_operand")
-	(ior:HI (ashift:HI (zero_extend:HI (match_operand:QI 2 "register_operand"))
-                           (const_int 8))
-		(zero_extend:HI (match_operand:QI 1 "register_operand"))))]
+  [(parallel
+    [(set (match_operand:HI 0 "register_operand")
+	(ior:HI
+	   (ashift:HI (zero_extend:HI (match_operand:QI 2 "register_operand"))
+		      (const_int 8))
+	   (zero_extend:HI (match_operand:QI 1 "register_operand"))))
+     (unspec [(const_int 0)] UNSPEC_MASKOP)])]
   "TARGET_AVX512F")
 
 (define_expand "vec_pack_trunc_<mode>"
-  [(set (match_operand:<DOUBLEMASKMODE> 0 "register_operand")
-	(ior:<DOUBLEMASKMODE>
-	  (ashift:<DOUBLEMASKMODE>
+  [(parallel
+    [(set (match_operand:<DOUBLEMASKMODE> 0 "register_operand")
+	  (ior:<DOUBLEMASKMODE>
+	    (ashift:<DOUBLEMASKMODE>
+	      (zero_extend:<DOUBLEMASKMODE>
+	        (match_operand:SWI24 2 "register_operand"))
+	      (match_dup 3))
 	    (zero_extend:<DOUBLEMASKMODE>
-	      (match_operand:SWI24 2 "register_operand"))
-	    (match_dup 3))
-	  (zero_extend:<DOUBLEMASKMODE>
-	    (match_operand:SWI24 1 "register_operand"))))]
+	      (match_operand:SWI24 1 "register_operand"))))
+     (unspec [(const_int 0)] UNSPEC_MASKOP)])]
   "TARGET_AVX512BW"
 {
   operands[3] = GEN_INT (GET_MODE_BITSIZE (<MODE>mode));
diff --git a/gcc/testsuite/g++.target/i386/pr91681.C b/gcc/testsuite/g++.target/i386/pr91681.C
new file mode 100644
index 0000000..0271e43
--- /dev/null
+++ b/gcc/testsuite/g++.target/i386/pr91681.C
@@ -0,0 +1,20 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+
+void multiply128x64x2_3 ( 
+    const unsigned long a, 
+    const unsigned long b, 
+    const unsigned long c, 
+    const unsigned long d, 
+    __uint128_t o[2])
+{
+    __uint128_t B0 = (__uint128_t) b * c;
+    __uint128_t B2 = (__uint128_t) a * c;
+    __uint128_t B1 = (__uint128_t) b * d;
+    __uint128_t B3 = (__uint128_t) a * d;
+
+    o[0] = B2 + (B0 >> 64);
+    o[1] = B3 + (B1 >> 64);
+}
+
+/* { dg-final { scan-assembler-not "xor" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr91681-1.c b/gcc/testsuite/gcc.target/i386/pr91681-1.c
new file mode 100644
index 0000000..ab83cc4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr91681-1.c
@@ -0,0 +1,20 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+unsigned __int128 m;
+
+unsigned __int128 foo(unsigned __int128 x, unsigned long long y)
+{
+    return x + y;
+}
+
+void bar(unsigned __int128 x, unsigned long long y)
+{
+    m = x + y;
+}
+
+void baz(unsigned long long y)
+{
+    m += y;
+}
+
+/* { dg-final { scan-assembler-not "xor" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr91681-2.c b/gcc/testsuite/gcc.target/i386/pr91681-2.c
new file mode 100644
index 0000000..ea52c72
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr91681-2.c
@@ -0,0 +1,20 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+unsigned __int128 m;
+
+unsigned __int128 foo(unsigned __int128 x, unsigned long long y)
+{
+    return x - y;
+}
+
+void bar(unsigned __int128 x, unsigned long long y)
+{
+    m = x - y;
+}
+
+void baz(unsigned long long y)
+{
+    m -= y;
+}
+
+/* { dg-final { scan-assembler-not "xor" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr91681-3.c b/gcc/testsuite/gcc.target/i386/pr91681-3.c
new file mode 100644
index 0000000..22a03c2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr91681-3.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target ia32 } } */
+/* { dg-options "-O2" } */
+
+unsigned long long m;
+
+unsigned long long foo(unsigned long long x, unsigned int y)
+{
+    return x - y;
+}
+
+void bar(unsigned long long x, unsigned int y)
+{
+    m = x - y;
+}
+
+/* { dg-final { scan-assembler-not "xor" } } */

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-06-05 18:57 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-03  9:49 [x86 PATCH] PR target/91681: zero_extendditi2 pattern for more optimizations Roger Sayle
2022-06-03 10:08 ` Uros Bizjak
2022-06-03 10:15   ` Uros Bizjak
2022-06-05 11:48   ` Roger Sayle
2022-06-05 18:57     ` Uros Bizjak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).