public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 0/7]AArch64 Optimize truncation, shifts and bitmask comparisons
@ 2021-09-29 16:19 Tamar Christina
  2021-09-29 16:19 ` [PATCH 1/7]AArch64 Add combine patterns for right shift and narrow Tamar Christina
                   ` (6 more replies)
  0 siblings, 7 replies; 31+ messages in thread
From: Tamar Christina @ 2021-09-29 16:19 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 3438 bytes --]

Hi All,

This patch series is optimizing AArch64 codegen for narrowing operations,
shift and narrow, and some comparisons with bitmasks.

There are more to come but this is the first batch.

This series shows a 2% gain on x264 in SPECCPU2017 and 0.05% size reduction
and shows 5-10% perf gain on various intrinsics optimized real world
libraries.

One part that is missing and needs additional work is being able to combine
stores into sequential locations.  Consider:

#include <arm_neon.h>
?
#define SIZE 1
#define SIZE2 8 * 8 * 8
?
extern void pop (uint8_t*);
?
void foo (int16x8_t row0, int16x8_t row1, int16x8_t row2, int16x8_t row3,
          int16x8_t row4, int16x8_t row5, int16x8_t row6, int16x8_t row7) {
    uint8_t block_nbits[SIZE2];

    uint8x8_t row0_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row0)));
    uint8x8_t row1_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row1)));
    uint8x8_t row2_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row2)));
    uint8x8_t row3_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row3)));
    uint8x8_t row4_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row4)));
    uint8x8_t row5_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row5)));
    uint8x8_t row6_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row6)));
    uint8x8_t row7_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row7)));

    vst1_u8(block_nbits + 0 * SIZE, row0_nbits);
    vst1_u8(block_nbits + 1 * SIZE, row1_nbits);
    vst1_u8(block_nbits + 2 * SIZE, row2_nbits);
    vst1_u8(block_nbits + 3 * SIZE, row3_nbits);
    vst1_u8(block_nbits + 4 * SIZE, row4_nbits);
    vst1_u8(block_nbits + 5 * SIZE, row5_nbits);
    vst1_u8(block_nbits + 6 * SIZE, row6_nbits);
    vst1_u8(block_nbits + 7 * SIZE, row7_nbits);
?
    pop (block_nbits);
}

currently generates:

movi v1.8b, #0x10

xtn v17.8b, v17.8h
xtn v23.8b, v23.8h
xtn v22.8b, v22.8h
xtn v4.8b, v21.8h
xtn v20.8b, v20.8h
xtn v19.8b, v19.8h
xtn v18.8b, v18.8h
xtn v24.8b, v24.8h

sub v17.8b, v1.8b, v17.8b
sub v23.8b, v1.8b, v23.8b
sub v22.8b, v1.8b, v22.8b
sub v16.8b, v1.8b, v4.8b
sub v8.8b, v1.8b, v20.8b
sub v4.8b, v1.8b, v19.8b
sub v2.8b, v1.8b, v18.8b
sub v1.8b, v1.8b, v24.8b

stp d17, d23, [sp, #224]
stp d22, d16, [sp, #240]
stp d8, d4, [sp, #256]
stp d2, d1, [sp, #272]

where optimized codegen for this is:

movi v1.16b, #0x10

uzp1 v17.16b, v17.16b, v23.16b
uzp1 v22.16b, v22.16b, v4.16b
uzp1 v20.16b, v20.16b, v19.16b
uzp1 v24.16b, v18.16b, v24.16b

sub v17.16b, v1.16b, v17.16b
sub v18.16b, v1.16b, v22.16b
sub v19.16b, v1.16b, v20.16b
sub v20.16b, v1.16b, v24.16b

stp q17, q18, [sp, #224]
stp q19, q20, [sp, #256]

which requires us to recognize the stores into sequential locations (multiple
stp d blocks in the current example) and merge them into one.

This pattern happens reasonably often but unsure how to handle it.  For one this
requires st1 and friends to not be unspec, which is currently the focus of

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579582.html

Thanks,
Tamar

--- inline copy of patch -- 

-- 

[-- Attachment #2: rb14899.patch --]
[-- Type: text/x-diff, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/7]AArch64 Add combine patterns for right shift and narrow
  2021-09-29 16:19 [PATCH 0/7]AArch64 Optimize truncation, shifts and bitmask comparisons Tamar Christina
@ 2021-09-29 16:19 ` Tamar Christina
  2021-09-30  8:50   ` Kyrylo Tkachov
  2021-09-29 16:19 ` [PATCH 2/7]AArch64 Add combine patterns for narrowing shift of half top bits (shuffle) Tamar Christina
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Tamar Christina @ 2021-09-29 16:19 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 4109 bytes --]

Hi All,

This adds a simple pattern for combining right shifts and narrows into
shifted narrows.

i.e.

typedef short int16_t;
typedef unsigned short uint16_t;

void foo (uint16_t * restrict a, int16_t * restrict d, int n)
{
    for( int i = 0; i < n; i++ )
      d[i] = (a[i] * a[i]) >> 10;
}

now generates:

.L4:
        ldr     q0, [x0, x3]
        umull   v1.4s, v0.4h, v0.4h
        umull2  v0.4s, v0.8h, v0.8h
        shrn    v1.4h, v1.4s, 10
        shrn2   v1.8h, v0.4s, 10
        str     q1, [x1, x3]
        add     x3, x3, 16
        cmp     x4, x3
        bne     .L4

instead of:

.L4:
        ldr     q0, [x0, x3]
        umull   v1.4s, v0.4h, v0.4h
        umull2  v0.4s, v0.8h, v0.8h
        sshr    v1.4s, v1.4s, 10
        sshr    v0.4s, v0.4s, 10
        xtn     v1.4h, v1.4s
        xtn2    v1.8h, v0.4s
        str     q1, [x1, x3]
        add     x3, x3, 16
        cmp     x4, x3
        bne     .L4

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (*aarch64_<srn_op>shrn<mode>_vect,
	*aarch64_<srn_op>shrn<mode>2_vect): New.
	* config/aarch64/iterators.md (srn_op): New.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/shrn-combine.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 48eddf64e05afe3788abfa05141f6544a9323ea1..d7b6cae424622d259f97a3d5fa9093c0fb0bd5ce 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1818,6 +1818,28 @@ (define_insn "aarch64_shrn<mode>_insn_be"
   [(set_attr "type" "neon_shift_imm_narrow_q")]
 )
 
+(define_insn "*aarch64_<srn_op>shrn<mode>_vect"
+  [(set (match_operand:<VNARROWQ> 0 "register_operand" "=w")
+        (truncate:<VNARROWQ>
+          (SHIFTRT:VQN (match_operand:VQN 1 "register_operand" "w")
+            (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_<vn_mode>"))))]
+  "TARGET_SIMD"
+  "shrn\\t%0.<Vntype>, %1.<Vtype>, %2"
+  [(set_attr "type" "neon_shift_imm_narrow_q")]
+)
+
+(define_insn "*aarch64_<srn_op>shrn<mode>2_vect"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+	  (match_operand:<VNARROWQ> 1 "register_operand" "0")
+	  (truncate:<VNARROWQ>
+	    (SHIFTRT:VQN (match_operand:VQN 2 "register_operand" "w")
+	      (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_<vn_mode>")))))]
+  "TARGET_SIMD"
+  "shrn2\\t%0.<V2ntype>, %2.<Vtype>, %3"
+  [(set_attr "type" "neon_shift_imm_narrow_q")]
+)
+
 (define_expand "aarch64_shrn<mode>"
   [(set (match_operand:<VNARROWQ> 0 "register_operand")
 	(truncate:<VNARROWQ>
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index caa42f8f169fbf2cf46a90cf73dee05619acc300..8dbeed3b0d4a44cdc17dd333ed397b39a33f386a 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -2003,6 +2003,9 @@ (define_code_attr shift [(ashift "lsl") (ashiftrt "asr")
 ;; Op prefix for shift right and accumulate.
 (define_code_attr sra_op [(ashiftrt "s") (lshiftrt "u")])
 
+;; op prefix for shift right and narrow.
+(define_code_attr srn_op [(ashiftrt "r") (lshiftrt "")])
+
 ;; Map shift operators onto underlying bit-field instructions
 (define_code_attr bfshift [(ashift "ubfiz") (ashiftrt "sbfx")
 			   (lshiftrt "ubfx") (rotatert "extr")])
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine.c
new file mode 100644
index 0000000000000000000000000000000000000000..0187f49f4dcc76182c90366caaf00d294e835707
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+typedef short int16_t;
+typedef unsigned short uint16_t;
+
+void foo (uint16_t * restrict a, int16_t * restrict d, int n)
+{
+    for( int i = 0; i < n; i++ )
+      d[i] = (a[i] * a[i]) >> 10;
+}
+
+/* { dg-final { scan-assembler-times {\tshrn\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tshrn2\t} 1 } } */


-- 

[-- Attachment #2: rb14892.patch --]
[-- Type: text/x-diff, Size: 2804 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 48eddf64e05afe3788abfa05141f6544a9323ea1..d7b6cae424622d259f97a3d5fa9093c0fb0bd5ce 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1818,6 +1818,28 @@ (define_insn "aarch64_shrn<mode>_insn_be"
   [(set_attr "type" "neon_shift_imm_narrow_q")]
 )
 
+(define_insn "*aarch64_<srn_op>shrn<mode>_vect"
+  [(set (match_operand:<VNARROWQ> 0 "register_operand" "=w")
+        (truncate:<VNARROWQ>
+          (SHIFTRT:VQN (match_operand:VQN 1 "register_operand" "w")
+            (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_<vn_mode>"))))]
+  "TARGET_SIMD"
+  "shrn\\t%0.<Vntype>, %1.<Vtype>, %2"
+  [(set_attr "type" "neon_shift_imm_narrow_q")]
+)
+
+(define_insn "*aarch64_<srn_op>shrn<mode>2_vect"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+	  (match_operand:<VNARROWQ> 1 "register_operand" "0")
+	  (truncate:<VNARROWQ>
+	    (SHIFTRT:VQN (match_operand:VQN 2 "register_operand" "w")
+	      (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_<vn_mode>")))))]
+  "TARGET_SIMD"
+  "shrn2\\t%0.<V2ntype>, %2.<Vtype>, %3"
+  [(set_attr "type" "neon_shift_imm_narrow_q")]
+)
+
 (define_expand "aarch64_shrn<mode>"
   [(set (match_operand:<VNARROWQ> 0 "register_operand")
 	(truncate:<VNARROWQ>
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index caa42f8f169fbf2cf46a90cf73dee05619acc300..8dbeed3b0d4a44cdc17dd333ed397b39a33f386a 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -2003,6 +2003,9 @@ (define_code_attr shift [(ashift "lsl") (ashiftrt "asr")
 ;; Op prefix for shift right and accumulate.
 (define_code_attr sra_op [(ashiftrt "s") (lshiftrt "u")])
 
+;; op prefix for shift right and narrow.
+(define_code_attr srn_op [(ashiftrt "r") (lshiftrt "")])
+
 ;; Map shift operators onto underlying bit-field instructions
 (define_code_attr bfshift [(ashift "ubfiz") (ashiftrt "sbfx")
 			   (lshiftrt "ubfx") (rotatert "extr")])
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine.c
new file mode 100644
index 0000000000000000000000000000000000000000..0187f49f4dcc76182c90366caaf00d294e835707
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+typedef short int16_t;
+typedef unsigned short uint16_t;
+
+void foo (uint16_t * restrict a, int16_t * restrict d, int n)
+{
+    for( int i = 0; i < n; i++ )
+      d[i] = (a[i] * a[i]) >> 10;
+}
+
+/* { dg-final { scan-assembler-times {\tshrn\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tshrn2\t} 1 } } */


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 2/7]AArch64 Add combine patterns for narrowing shift of half top bits (shuffle)
  2021-09-29 16:19 [PATCH 0/7]AArch64 Optimize truncation, shifts and bitmask comparisons Tamar Christina
  2021-09-29 16:19 ` [PATCH 1/7]AArch64 Add combine patterns for right shift and narrow Tamar Christina
@ 2021-09-29 16:19 ` Tamar Christina
  2021-09-30  8:54   ` Kyrylo Tkachov
  2021-09-29 16:20 ` [PATCH 3/7]AArch64 Add pattern for sshr to cmlt Tamar Christina
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Tamar Christina @ 2021-09-29 16:19 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 5470 bytes --]

Hi All,

When doing a (narrowing) right shift by half the width of the original type then
we are essentially shuffling the top bits from the first number down.

If we have a hi/lo pair we can just use a single shuffle instead of needing two
shifts.

i.e.

typedef short int16_t;
typedef unsigned short uint16_t;

void foo (uint16_t * restrict a, int16_t * restrict d, int n)
{
    for( int i = 0; i < n; i++ )
      d[i] = (a[i] * a[i]) >> 16;
}

now generates:

.L4:
        ldr     q0, [x0, x3]
        umull   v1.4s, v0.4h, v0.4h
        umull2  v0.4s, v0.8h, v0.8h
        uzp2    v0.8h, v1.8h, v0.8h
        str     q0, [x1, x3]
        add     x3, x3, 16
        cmp     x4, x3
        bne     .L4

instead of

.L4:
        ldr     q0, [x0, x3]
        umull   v1.4s, v0.4h, v0.4h
        umull2  v0.4s, v0.8h, v0.8h
        sshr    v1.4s, v1.4s, 16
        sshr    v0.4s, v0.4s, 16
        xtn     v1.4h, v1.4s
        xtn2    v1.8h, v0.4s
        str     q1, [x1, x3]
        add     x3, x3, 16
        cmp     x4, x3
        bne     .L4

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md
	(*aarch64_<srn_op>topbits_shuffle<mode>,
	*aarch64_topbits_shuffle<mode>): New.
	* config/aarch64/predicates.md
	(aarch64_simd_shift_imm_vec_exact_top): New.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/shrn-combine-2.c: New test.
	* gcc.target/aarch64/shrn-combine-3.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index d7b6cae424622d259f97a3d5fa9093c0fb0bd5ce..300bf001b59ca7fa197c580b10adb7f70f20d1e0 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1840,6 +1840,36 @@ (define_insn "*aarch64_<srn_op>shrn<mode>2_vect"
   [(set_attr "type" "neon_shift_imm_narrow_q")]
 )
 
+(define_insn "*aarch64_<srn_op>topbits_shuffle<mode>"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+          (truncate:<VNARROWQ>
+            (SHIFTRT:VQN (match_operand:VQN 1 "register_operand" "w")
+	      (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_exact_top")))
+	  (truncate:<VNARROWQ>
+	    (SHIFTRT:VQN (match_operand:VQN 3 "register_operand" "w")
+	      (match_dup 2)))))]
+  "TARGET_SIMD"
+  "uzp2\\t%0.<V2ntype>, %1.<V2ntype>, %3.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+
+(define_insn "*aarch64_topbits_shuffle<mode>"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+          (unspec:<VNARROWQ> [
+              (match_operand:VQN 1 "register_operand" "w")
+	      (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_exact_top")
+	     ] UNSPEC_RSHRN)
+	  (unspec:<VNARROWQ> [
+	      (match_operand:VQN 3 "register_operand" "w")
+	      (match_dup 2)
+	     ] UNSPEC_RSHRN)))]
+  "TARGET_SIMD"
+  "uzp2\\t%0.<V2ntype>, %1.<V2ntype>, %3.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+
 (define_expand "aarch64_shrn<mode>"
   [(set (match_operand:<VNARROWQ> 0 "register_operand")
 	(truncate:<VNARROWQ>
diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
index 49f02ae0381359174fed80c2a2264295c75bc189..7fd4f9e7d06d3082d6f3047290f0446789e1d0d2 100644
--- a/gcc/config/aarch64/predicates.md
+++ b/gcc/config/aarch64/predicates.md
@@ -545,6 +545,12 @@ (define_predicate "aarch64_simd_shift_imm_offset_di"
   (and (match_code "const_int")
        (match_test "IN_RANGE (INTVAL (op), 1, 64)")))
 
+(define_predicate "aarch64_simd_shift_imm_vec_exact_top"
+  (and (match_code "const_vector")
+       (match_test "aarch64_const_vec_all_same_in_range_p (op,
+			GET_MODE_UNIT_BITSIZE (GET_MODE (op)) / 2,
+			GET_MODE_UNIT_BITSIZE (GET_MODE (op)) / 2)")))
+
 (define_predicate "aarch64_simd_shift_imm_vec_qi"
   (and (match_code "const_vector")
        (match_test "aarch64_const_vec_all_same_in_range_p (op, 1, 8)")))
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-2.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..924b3b849e449082b8c0b7dc6b955a2bad8d0911
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-2.c
@@ -0,0 +1,15 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+typedef short int16_t;
+typedef unsigned short uint16_t;
+
+void foo (uint16_t * restrict a, int16_t * restrict d, int n)
+{
+    for( int i = 0; i < n; i++ )
+      d[i] = (a[i] * a[i]) >> 16;
+}
+
+/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tshrn\t} } } */
+/* { dg-final { scan-assembler-not {\tshrn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-3.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..929a55c5c338844e6a5c5ad249af482286ab9c61
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-3.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+
+#include <arm_neon.h>
+
+uint16x8_t foo (uint32x4_t a, uint32x4_t b)
+{
+  return vrshrn_high_n_u32 (vrshrn_n_u32 (a, 16), b, 16);
+}
+
+/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tshrn\t} } } */
+/* { dg-final { scan-assembler-not {\tshrn2\t} } } */


-- 

[-- Attachment #2: rb14893.patch --]
[-- Type: text/x-diff, Size: 3956 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index d7b6cae424622d259f97a3d5fa9093c0fb0bd5ce..300bf001b59ca7fa197c580b10adb7f70f20d1e0 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1840,6 +1840,36 @@ (define_insn "*aarch64_<srn_op>shrn<mode>2_vect"
   [(set_attr "type" "neon_shift_imm_narrow_q")]
 )
 
+(define_insn "*aarch64_<srn_op>topbits_shuffle<mode>"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+          (truncate:<VNARROWQ>
+            (SHIFTRT:VQN (match_operand:VQN 1 "register_operand" "w")
+	      (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_exact_top")))
+	  (truncate:<VNARROWQ>
+	    (SHIFTRT:VQN (match_operand:VQN 3 "register_operand" "w")
+	      (match_dup 2)))))]
+  "TARGET_SIMD"
+  "uzp2\\t%0.<V2ntype>, %1.<V2ntype>, %3.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+
+(define_insn "*aarch64_topbits_shuffle<mode>"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+          (unspec:<VNARROWQ> [
+              (match_operand:VQN 1 "register_operand" "w")
+	      (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_exact_top")
+	     ] UNSPEC_RSHRN)
+	  (unspec:<VNARROWQ> [
+	      (match_operand:VQN 3 "register_operand" "w")
+	      (match_dup 2)
+	     ] UNSPEC_RSHRN)))]
+  "TARGET_SIMD"
+  "uzp2\\t%0.<V2ntype>, %1.<V2ntype>, %3.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+
 (define_expand "aarch64_shrn<mode>"
   [(set (match_operand:<VNARROWQ> 0 "register_operand")
 	(truncate:<VNARROWQ>
diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
index 49f02ae0381359174fed80c2a2264295c75bc189..7fd4f9e7d06d3082d6f3047290f0446789e1d0d2 100644
--- a/gcc/config/aarch64/predicates.md
+++ b/gcc/config/aarch64/predicates.md
@@ -545,6 +545,12 @@ (define_predicate "aarch64_simd_shift_imm_offset_di"
   (and (match_code "const_int")
        (match_test "IN_RANGE (INTVAL (op), 1, 64)")))
 
+(define_predicate "aarch64_simd_shift_imm_vec_exact_top"
+  (and (match_code "const_vector")
+       (match_test "aarch64_const_vec_all_same_in_range_p (op,
+			GET_MODE_UNIT_BITSIZE (GET_MODE (op)) / 2,
+			GET_MODE_UNIT_BITSIZE (GET_MODE (op)) / 2)")))
+
 (define_predicate "aarch64_simd_shift_imm_vec_qi"
   (and (match_code "const_vector")
        (match_test "aarch64_const_vec_all_same_in_range_p (op, 1, 8)")))
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-2.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..924b3b849e449082b8c0b7dc6b955a2bad8d0911
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-2.c
@@ -0,0 +1,15 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+typedef short int16_t;
+typedef unsigned short uint16_t;
+
+void foo (uint16_t * restrict a, int16_t * restrict d, int n)
+{
+    for( int i = 0; i < n; i++ )
+      d[i] = (a[i] * a[i]) >> 16;
+}
+
+/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tshrn\t} } } */
+/* { dg-final { scan-assembler-not {\tshrn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-3.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..929a55c5c338844e6a5c5ad249af482286ab9c61
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-3.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+
+#include <arm_neon.h>
+
+uint16x8_t foo (uint32x4_t a, uint32x4_t b)
+{
+  return vrshrn_high_n_u32 (vrshrn_n_u32 (a, 16), b, 16);
+}
+
+/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tshrn\t} } } */
+/* { dg-final { scan-assembler-not {\tshrn2\t} } } */


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 3/7]AArch64 Add pattern for sshr to cmlt
  2021-09-29 16:19 [PATCH 0/7]AArch64 Optimize truncation, shifts and bitmask comparisons Tamar Christina
  2021-09-29 16:19 ` [PATCH 1/7]AArch64 Add combine patterns for right shift and narrow Tamar Christina
  2021-09-29 16:19 ` [PATCH 2/7]AArch64 Add combine patterns for narrowing shift of half top bits (shuffle) Tamar Christina
@ 2021-09-29 16:20 ` Tamar Christina
  2021-09-30  9:27   ` Kyrylo Tkachov
  2021-09-29 16:20 ` [PATCH 4/7]AArch64 Add pattern xtn+xtn2 to uzp2 Tamar Christina
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Tamar Christina @ 2021-09-29 16:20 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 3561 bytes --]

Hi All,

This optimizes signed right shift by BITSIZE-1 into a cmlt operation which is
more optimal because generally compares have a higher throughput than shifts.

On AArch64 the result of the shift would have been either -1 or 0 which is the
results of the compare.

i.e.

void e (int * restrict a, int *b, int n)
{
    for (int i = 0; i < n; i++)
      b[i] = a[i] >> 31;
}

now generates:

.L4:
        ldr     q0, [x0, x3]
        cmlt    v0.4s, v0.4s, #0
        str     q0, [x1, x3]
        add     x3, x3, 16
        cmp     x4, x3
        bne     .L4

instead of:

.L4:
        ldr     q0, [x0, x3]
        sshr    v0.4s, v0.4s, 31
        str     q0, [x1, x3]
        add     x3, x3, 16
        cmp     x4, x3
        bne     .L4

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (aarch64_simd_ashr<mode>): Add case cmp
	case.
	* config/aarch64/constraints.md (D1): New.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/shl-combine-2.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 300bf001b59ca7fa197c580b10adb7f70f20d1e0..19b2d0ad4dab4d5742698297ded861228ee22007 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1127,12 +1127,14 @@ (define_insn "aarch64_simd_lshr<mode>"
 )
 
 (define_insn "aarch64_simd_ashr<mode>"
- [(set (match_operand:VDQ_I 0 "register_operand" "=w")
-       (ashiftrt:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w")
-		     (match_operand:VDQ_I  2 "aarch64_simd_rshift_imm" "Dr")))]
+ [(set (match_operand:VDQ_I 0 "register_operand" "=w,w")
+       (ashiftrt:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w,w")
+		     (match_operand:VDQ_I  2 "aarch64_simd_rshift_imm" "D1,Dr")))]
  "TARGET_SIMD"
- "sshr\t%0.<Vtype>, %1.<Vtype>, %2"
-  [(set_attr "type" "neon_shift_imm<q>")]
+ "@
+  cmlt\t%0.<Vtype>, %1.<Vtype>, #0
+  sshr\t%0.<Vtype>, %1.<Vtype>, %2"
+  [(set_attr "type" "neon_compare<q>,neon_shift_imm<q>")]
 )
 
 (define_insn "*aarch64_simd_sra<mode>"
diff --git a/gcc/config/aarch64/constraints.md b/gcc/config/aarch64/constraints.md
index 3b49b452119c49320020fa9183314d9a25b92491..18630815ffc13f2168300a899db69fd428dfb0d6 100644
--- a/gcc/config/aarch64/constraints.md
+++ b/gcc/config/aarch64/constraints.md
@@ -437,6 +437,14 @@ (define_constraint "Dl"
       (match_test "aarch64_simd_shift_imm_p (op, GET_MODE (op),
 						 true)")))
 
+(define_constraint "D1"
+  "@internal
+ A constraint that matches vector of immediates that is bits(mode)-1."
+ (and (match_code "const,const_vector")
+      (match_test "aarch64_const_vec_all_same_in_range_p (op,
+			GET_MODE_UNIT_BITSIZE (mode) - 1,
+			GET_MODE_UNIT_BITSIZE (mode) - 1)")))
+
 (define_constraint "Dr"
   "@internal
  A constraint that matches vector of immediates for right shifts."
diff --git a/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c b/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..bdfe35d09ffccc7928947c9e57f1034f7ca2c798
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c
@@ -0,0 +1,12 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+void e (int * restrict a, int *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      b[i] = a[i] >> 31;
+}
+
+/* { dg-final { scan-assembler-times {\tcmlt\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tsshr\t} } } */
+


-- 

[-- Attachment #2: rb14894.patch --]
[-- Type: text/x-diff, Size: 2471 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 300bf001b59ca7fa197c580b10adb7f70f20d1e0..19b2d0ad4dab4d5742698297ded861228ee22007 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1127,12 +1127,14 @@ (define_insn "aarch64_simd_lshr<mode>"
 )
 
 (define_insn "aarch64_simd_ashr<mode>"
- [(set (match_operand:VDQ_I 0 "register_operand" "=w")
-       (ashiftrt:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w")
-		     (match_operand:VDQ_I  2 "aarch64_simd_rshift_imm" "Dr")))]
+ [(set (match_operand:VDQ_I 0 "register_operand" "=w,w")
+       (ashiftrt:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w,w")
+		     (match_operand:VDQ_I  2 "aarch64_simd_rshift_imm" "D1,Dr")))]
  "TARGET_SIMD"
- "sshr\t%0.<Vtype>, %1.<Vtype>, %2"
-  [(set_attr "type" "neon_shift_imm<q>")]
+ "@
+  cmlt\t%0.<Vtype>, %1.<Vtype>, #0
+  sshr\t%0.<Vtype>, %1.<Vtype>, %2"
+  [(set_attr "type" "neon_compare<q>,neon_shift_imm<q>")]
 )
 
 (define_insn "*aarch64_simd_sra<mode>"
diff --git a/gcc/config/aarch64/constraints.md b/gcc/config/aarch64/constraints.md
index 3b49b452119c49320020fa9183314d9a25b92491..18630815ffc13f2168300a899db69fd428dfb0d6 100644
--- a/gcc/config/aarch64/constraints.md
+++ b/gcc/config/aarch64/constraints.md
@@ -437,6 +437,14 @@ (define_constraint "Dl"
       (match_test "aarch64_simd_shift_imm_p (op, GET_MODE (op),
 						 true)")))
 
+(define_constraint "D1"
+  "@internal
+ A constraint that matches vector of immediates that is bits(mode)-1."
+ (and (match_code "const,const_vector")
+      (match_test "aarch64_const_vec_all_same_in_range_p (op,
+			GET_MODE_UNIT_BITSIZE (mode) - 1,
+			GET_MODE_UNIT_BITSIZE (mode) - 1)")))
+
 (define_constraint "Dr"
   "@internal
  A constraint that matches vector of immediates for right shifts."
diff --git a/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c b/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..bdfe35d09ffccc7928947c9e57f1034f7ca2c798
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c
@@ -0,0 +1,12 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+void e (int * restrict a, int *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      b[i] = a[i] >> 31;
+}
+
+/* { dg-final { scan-assembler-times {\tcmlt\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tsshr\t} } } */
+


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 4/7]AArch64 Add pattern xtn+xtn2 to uzp2
  2021-09-29 16:19 [PATCH 0/7]AArch64 Optimize truncation, shifts and bitmask comparisons Tamar Christina
                   ` (2 preceding siblings ...)
  2021-09-29 16:20 ` [PATCH 3/7]AArch64 Add pattern for sshr to cmlt Tamar Christina
@ 2021-09-29 16:20 ` Tamar Christina
  2021-09-30  9:28   ` Kyrylo Tkachov
  2021-09-29 16:21 ` [PATCH 5/7]middle-end Convert bitclear <imm> + cmp<cc> #0 into cm<cc2> <imm2> Tamar Christina
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Tamar Christina @ 2021-09-29 16:20 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 4467 bytes --]

Hi All,

This turns truncate operations with a hi/lo pair into a single permute of half
the bit size of the input and just ignoring the top bits (which are truncated
out).

i.e.

void d2 (short * restrict a, int *b, int n)
{
    for (int i = 0; i < n; i++)
      a[i] = b[i];
}

now generates:

.L4:
        ldp     q0, q1, [x3]
        add     x3, x3, 32
        uzp1    v0.8h, v0.8h, v1.8h
        str     q0, [x5], 16
        cmp     x4, x3
        bne     .L4

instead of

.L4:
        ldp     q0, q1, [x3]
        add     x3, x3, 32
        xtn     v0.4h, v0.4s
        xtn2    v0.8h, v1.4s
        str     q0, [x5], 16
        cmp     x4, x3
        bne     .L4

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (*aarch64_narrow_trunc<mode>): New.
	* config/aarch64/iterators.md (VNARROWSIMD, Vnarrowsimd): New.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/xtn-combine-1.c: New test.
	* gcc.target/aarch64/narrow_high_combine.c: Update case.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 36396ef236e8c476d5e2f1acee80dc54ec5ebe4e..33e3301d229366022a5b9481b6c3ae8f8d93f9e2 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1753,6 +1753,18 @@ (define_expand "aarch64_xtn2<mode>"
   }
 )
 
+(define_insn "*aarch64_narrow_trunc<mode>"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+          (truncate:<VNARROWQ>
+            (match_operand:VQN 1 "register_operand" "w"))
+	  (truncate:<VNARROWQ>
+	    (match_operand:VQN 2 "register_operand" "w"))))]
+  "TARGET_SIMD"
+  "uzp1\\t%0.<V2ntype>, %1.<V2ntype>, %2.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+
 ;; Packing doubles.
 
 (define_expand "vec_pack_trunc_<mode>"
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 8dbeed3b0d4a44cdc17dd333ed397b39a33f386a..95b385c0c9405fe95fcd07262a9471ab13d5488e 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -270,6 +270,14 @@ (define_mode_iterator VDQHS [V4HI V8HI V2SI V4SI])
 ;; Advanced SIMD modes for H, S and D types.
 (define_mode_iterator VDQHSD [V4HI V8HI V2SI V4SI V2DI])
 
+;; Modes for which we can narrow the element and increase the lane counts
+;; to preserve the same register size.
+(define_mode_attr VNARROWSIMD [(V4HI "V8QI") (V8HI "V16QI") (V4SI "V8HI")
+			       (V2SI "V4HI") (V2DI "V4SI")])
+
+(define_mode_attr Vnarrowsimd [(V4HI "v8qi") (V8HI "v16qi") (V4SI "v8hi")
+			       (V2SI "v4hi") (V2DI "v4si")])
+
 ;; Advanced SIMD and scalar integer modes for H and S.
 (define_mode_iterator VSDQ_HSI [V4HI V8HI V2SI V4SI HI SI])
 
diff --git a/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c b/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
index 50ecab002a3552d37a5cc0d8921f42f6c3dba195..fa61196d3644caa48b12151e12b15dfeab8c7e71 100644
--- a/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
+++ b/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
@@ -225,7 +225,8 @@ TEST_2_UNARY (vqmovun, uint32x4_t, int64x2_t, s64, u32)
 /* { dg-final { scan-assembler-times "\\tuqshrn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tsqrshrn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tuqrshrn2\\tv" 6} }  */
-/* { dg-final { scan-assembler-times "\\txtn2\\tv" 12} }  */
+/* { dg-final { scan-assembler-times "\\txtn2\\tv" 6} }  */
+/* { dg-final { scan-assembler-times "\\tuzp1\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tuqxtn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tsqxtn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tsqxtun2\\tv" 6} }  */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..ed655cc970a602da4ace78dc8dbd64ab18b0d4ab
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
@@ -0,0 +1,12 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+void d2 (short * restrict a, int *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */


-- 

[-- Attachment #2: rb14895.patch --]
[-- Type: text/x-diff, Size: 3383 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 36396ef236e8c476d5e2f1acee80dc54ec5ebe4e..33e3301d229366022a5b9481b6c3ae8f8d93f9e2 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1753,6 +1753,18 @@ (define_expand "aarch64_xtn2<mode>"
   }
 )
 
+(define_insn "*aarch64_narrow_trunc<mode>"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+          (truncate:<VNARROWQ>
+            (match_operand:VQN 1 "register_operand" "w"))
+	  (truncate:<VNARROWQ>
+	    (match_operand:VQN 2 "register_operand" "w"))))]
+  "TARGET_SIMD"
+  "uzp1\\t%0.<V2ntype>, %1.<V2ntype>, %2.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+
 ;; Packing doubles.
 
 (define_expand "vec_pack_trunc_<mode>"
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 8dbeed3b0d4a44cdc17dd333ed397b39a33f386a..95b385c0c9405fe95fcd07262a9471ab13d5488e 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -270,6 +270,14 @@ (define_mode_iterator VDQHS [V4HI V8HI V2SI V4SI])
 ;; Advanced SIMD modes for H, S and D types.
 (define_mode_iterator VDQHSD [V4HI V8HI V2SI V4SI V2DI])
 
+;; Modes for which we can narrow the element and increase the lane counts
+;; to preserve the same register size.
+(define_mode_attr VNARROWSIMD [(V4HI "V8QI") (V8HI "V16QI") (V4SI "V8HI")
+			       (V2SI "V4HI") (V2DI "V4SI")])
+
+(define_mode_attr Vnarrowsimd [(V4HI "v8qi") (V8HI "v16qi") (V4SI "v8hi")
+			       (V2SI "v4hi") (V2DI "v4si")])
+
 ;; Advanced SIMD and scalar integer modes for H and S.
 (define_mode_iterator VSDQ_HSI [V4HI V8HI V2SI V4SI HI SI])
 
diff --git a/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c b/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
index 50ecab002a3552d37a5cc0d8921f42f6c3dba195..fa61196d3644caa48b12151e12b15dfeab8c7e71 100644
--- a/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
+++ b/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
@@ -225,7 +225,8 @@ TEST_2_UNARY (vqmovun, uint32x4_t, int64x2_t, s64, u32)
 /* { dg-final { scan-assembler-times "\\tuqshrn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tsqrshrn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tuqrshrn2\\tv" 6} }  */
-/* { dg-final { scan-assembler-times "\\txtn2\\tv" 12} }  */
+/* { dg-final { scan-assembler-times "\\txtn2\\tv" 6} }  */
+/* { dg-final { scan-assembler-times "\\tuzp1\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tuqxtn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tsqxtn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tsqxtun2\\tv" 6} }  */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..ed655cc970a602da4ace78dc8dbd64ab18b0d4ab
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
@@ -0,0 +1,12 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+void d2 (short * restrict a, int *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 5/7]middle-end Convert bitclear <imm> + cmp<cc> #0 into cm<cc2> <imm2>
  2021-09-29 16:19 [PATCH 0/7]AArch64 Optimize truncation, shifts and bitmask comparisons Tamar Christina
                   ` (3 preceding siblings ...)
  2021-09-29 16:20 ` [PATCH 4/7]AArch64 Add pattern xtn+xtn2 to uzp2 Tamar Christina
@ 2021-09-29 16:21 ` Tamar Christina
  2021-09-30  6:17   ` Richard Biener
  2021-09-29 16:21 ` [PATCH 6/7]AArch64 Add neg + cmle into cmgt Tamar Christina
  2021-09-29 16:21 ` [PATCH 7/7]AArch64 Combine cmeq 0 + not into cmtst Tamar Christina
  6 siblings, 1 reply; 31+ messages in thread
From: Tamar Christina @ 2021-09-29 16:21 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, rguenther

[-- Attachment #1: Type: text/plain, Size: 16236 bytes --]

Hi All,

This optimizes the case where a mask Y which fulfills ~Y + 1 == pow2 is used to
clear a some bits and then compared against 0 into one without the masking and
a compare against a different bit immediate.

We can do this for all unsigned compares and for signed we can do it for
comparisons of EQ and NE:

(x & (~255)) == 0 becomes x <= 255. Which for leaves it to the target to
optimally deal with the comparison.

This transformation has to be done in the mid-end because in RTL you don't have
the signs of the comparison operands and if the target needs an immediate this
should be floated outside of the loop.

The RTL loop invariant hoisting is done before split1.

i.e.

void fun1(int32_t *x, int n)
{
    for (int i = 0; i < (n & -16); i++)
      x[i] = (x[i]&(~255)) == 0;
}

now generates:

.L3:
        ldr     q0, [x0]
        cmhs    v0.4s, v2.4s, v0.4s
        and     v0.16b, v1.16b, v0.16b
        str     q0, [x0], 16
        cmp     x0, x1
        bne     .L3

and floats the immediate out of the loop.

instead of:

.L3:
        ldr     q0, [x0]
        bic     v0.4s, #255
        cmeq    v0.4s, v0.4s, #0
        and     v0.16b, v1.16b, v0.16b
        str     q0, [x0], 16
        cmp     x0, x1
        bne     .L3

Bootstrapped Regtested on aarch64-none-linux-gnu,
x86_64-pc-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* match.pd: New bitmask compare pattern.

gcc/testsuite/ChangeLog:

	* gcc.dg/bic-bitmask-10.c: New test.
	* gcc.dg/bic-bitmask-11.c: New test.
	* gcc.dg/bic-bitmask-12.c: New test.
	* gcc.dg/bic-bitmask-2.c: New test.
	* gcc.dg/bic-bitmask-3.c: New test.
	* gcc.dg/bic-bitmask-4.c: New test.
	* gcc.dg/bic-bitmask-5.c: New test.
	* gcc.dg/bic-bitmask-6.c: New test.
	* gcc.dg/bic-bitmask-7.c: New test.
	* gcc.dg/bic-bitmask-8.c: New test.
	* gcc.dg/bic-bitmask-9.c: New test.
	* gcc.dg/bic-bitmask.h: New test.
	* gcc.target/aarch64/bic-bitmask-1.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/match.pd b/gcc/match.pd
index 0fcfd0ea62c043dc217d0d560ce5b7e569b70e7d..df9212cb27d172856b9d43b0875262f96e8993c4 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -4288,6 +4288,56 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
     (if (ic == ncmp)
      (ncmp @0 @1))))))
 
+/* Transform comparisons of the form (X & Y) CMP 0 to X CMP2 Z
+   where ~Y + 1 == pow2 and Z = ~Y.  */
+(for cmp (simple_comparison)
+ (simplify
+  (cmp (bit_and:c @0 VECTOR_CST@1) integer_zerop)
+   (if (VECTOR_INTEGER_TYPE_P (TREE_TYPE (@1))
+	&& uniform_vector_p (@1))
+    (with { tree elt = vector_cst_elt (@1, 0); }
+     (switch
+      (if (TYPE_UNSIGNED (TREE_TYPE (@1)) && tree_fits_uhwi_p (elt))
+	(with { unsigned HOST_WIDE_INT diff = tree_to_uhwi (elt);
+	        tree tdiff = wide_int_to_tree (TREE_TYPE (elt), (~diff) + 1);
+		tree newval = wide_int_to_tree (TREE_TYPE (elt), ~diff);
+		tree newmask = build_uniform_cst (TREE_TYPE (@1), newval); }
+	 (if (integer_pow2p (tdiff))
+	  (switch
+	   /* ((mask & x) < 0) -> 0.  */
+	   (if (cmp == LT_EXPR)
+	    { build_zero_cst (TREE_TYPE (@1)); })
+	   /* ((mask & x) <= 0) -> x < mask.  */
+	   (if (cmp == LE_EXPR)
+	    (lt @0 { newmask; }))
+	   /* ((mask & x) == 0) -> x < mask.  */
+	   (if (cmp == EQ_EXPR)
+	    (le @0 { newmask; }))
+	   /* ((mask & x) != 0) -> x > mask.  */
+	   (if (cmp == NE_EXPR)
+	    (gt @0 { newmask; }))
+	   /* ((mask & x) >= 0) -> x <= mask.  */
+	   (if (cmp == GE_EXPR)
+	    (le @0 { newmask; }))
+	    /* ((mask & x) > 0) -> x < mask.  */
+	   (if (cmp == GT_EXPR)
+	    (lt @0 { newmask; }))))))
+      (if (!TYPE_UNSIGNED (TREE_TYPE (@1)) && tree_fits_shwi_p (elt))
+	(with { unsigned HOST_WIDE_INT diff = tree_to_shwi (elt);
+		tree ustype = unsigned_type_for (TREE_TYPE (elt));
+		tree uvtype = unsigned_type_for (TREE_TYPE (@1));
+	        tree tdiff = wide_int_to_tree (ustype, (~diff) + 1);
+	        tree udiff = wide_int_to_tree (ustype, ~diff);
+		tree cst = build_uniform_cst (uvtype, udiff); }
+	 (if (integer_pow2p (tdiff))
+	  (switch
+	    /* ((mask & x) == 0) -> x < mask.  */
+	    (if (cmp == EQ_EXPR)
+	     (le (convert:uvtype @0) { cst; }))
+	    /* ((mask & x) != 0) -> x > mask.  */
+	    (if (cmp == NE_EXPR)
+	     (gt (convert:uvtype @0) { cst; })))))))))))
+
 /* Transform comparisons of the form X - Y CMP 0 to X CMP Y.
    ??? The transformation is valid for the other operators if overflow
    is undefined for the type, but performing it here badly interacts
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-10.c b/gcc/testsuite/gcc.dg/bic-bitmask-10.c
new file mode 100644
index 0000000000000000000000000000000000000000..76a22a2313137a2a75dd711c2c15c2d3a34e15aa
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-10.c
@@ -0,0 +1,26 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(int32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(int32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+#define TYPE int32_t
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump {<=\s*.+\{ 255,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-11.c b/gcc/testsuite/gcc.dg/bic-bitmask-11.c
new file mode 100644
index 0000000000000000000000000000000000000000..32553d7ba2f823f7a21237451990d0a216d2f912
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-11.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) != 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) != 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump {>\s*.+\{ 255,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-12.c b/gcc/testsuite/gcc.dg/bic-bitmask-12.c
new file mode 100644
index 0000000000000000000000000000000000000000..e10cbf7fabe2dbf7ce436cdf37b0f8b207c58408
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-12.c
@@ -0,0 +1,17 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+typedef unsigned int v4si __attribute__ ((vector_size (16)));
+
+__attribute__((noinline, noipa))
+void fun(v4si *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+/* { dg-final { scan-tree-dump {<=\s*.+\{ 255,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-2.c b/gcc/testsuite/gcc.dg/bic-bitmask-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..da30fad89f6c8239baa4395b3ffaec0be577e13f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-2.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-3.c b/gcc/testsuite/gcc.dg/bic-bitmask-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..da30fad89f6c8239baa4395b3ffaec0be577e13f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-3.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-4.c b/gcc/testsuite/gcc.dg/bic-bitmask-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..1bcf23ccf1447d6c8c999ed1eb25ba0a450028e1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-4.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) >= 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) >= 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {=\s*.+\{ 1,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-5.c b/gcc/testsuite/gcc.dg/bic-bitmask-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..6e5a2fca9992efbc01f8dbbc6f95936e86643028
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-5.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) > 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) > 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {>\s*.+\{ 255,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&`s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-6.c b/gcc/testsuite/gcc.dg/bic-bitmask-6.c
new file mode 100644
index 0000000000000000000000000000000000000000..018e7a4348c9fc461106c3d9d01291325d3406c2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-6.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) <= 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) <= 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-7.c b/gcc/testsuite/gcc.dg/bic-bitmask-7.c
new file mode 100644
index 0000000000000000000000000000000000000000..798678fb7555052c93abc4ca34f617d640f73bb4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-7.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) < 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) < 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {__builtin_memset} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-8.c b/gcc/testsuite/gcc.dg/bic-bitmask-8.c
new file mode 100644
index 0000000000000000000000000000000000000000..1dabe834ed57dfa0be48c1dc3dbb226092c79a1a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-8.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) != 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) != 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {>\s*.+\{ 1,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967294,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-9.c b/gcc/testsuite/gcc.dg/bic-bitmask-9.c
new file mode 100644
index 0000000000000000000000000000000000000000..9c1f8ee0adfc45d1b9fc212138ea26bb6b693e49
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-9.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~5)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~5)) == 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-not {<=\s*.+\{ 4294967289,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump {&\s*.+\{ 4294967290,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask.h b/gcc/testsuite/gcc.dg/bic-bitmask.h
new file mode 100644
index 0000000000000000000000000000000000000000..2b94065c025e0cbf71a21ac9b9d6314e24b0c2d9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask.h
@@ -0,0 +1,43 @@
+#include <stdio.h>
+
+#ifndef N
+#define N 50
+#endif
+
+#ifndef TYPE
+#define TYPE uint32_t
+#endif
+
+#ifndef DEBUG
+#define DEBUG 0
+#endif
+
+#define BASE ((TYPE) -1 < 0 ? -126 : 4)
+
+int main ()
+{
+  TYPE a[N];
+  TYPE b[N];
+
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 13;
+      b[i] = BASE + i * 13;
+      if (DEBUG)
+        printf ("%d: 0x%x\n", i, a[i]);
+    }
+
+  fun1 (a, N);
+  fun2 (b, N);
+
+  for (int i = 0; i < N; ++i)
+    {
+      if (DEBUG)
+        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
+
+      if (a[i] != b[i])
+        __builtin_abort ();
+    }
+  return 0;
+}
+
diff --git a/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c b/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..568c1ffc8bc4148efaeeba7a45a75ecbd3a7a3dd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c
@@ -0,0 +1,13 @@
+/* { dg-do assemble } */
+/* { dg-options "-O2 -save-temps" } */
+
+#include <arm_neon.h>
+
+uint32x4_t foo (int32x4_t a)
+{
+  int32x4_t cst = vdupq_n_s32 (255);
+  int32x4_t zero = vdupq_n_s32 (0);
+  return vceqq_s32 (vbicq_s32 (a, cst), zero);
+}
+
+/* { dg-final { scan-assembler-not {\tbic\t} { xfail { aarch64*-*-* } } } } */


-- 

[-- Attachment #2: rb14896.patch --]
[-- Type: text/x-diff, Size: 14260 bytes --]

diff --git a/gcc/match.pd b/gcc/match.pd
index 0fcfd0ea62c043dc217d0d560ce5b7e569b70e7d..df9212cb27d172856b9d43b0875262f96e8993c4 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -4288,6 +4288,56 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
     (if (ic == ncmp)
      (ncmp @0 @1))))))
 
+/* Transform comparisons of the form (X & Y) CMP 0 to X CMP2 Z
+   where ~Y + 1 == pow2 and Z = ~Y.  */
+(for cmp (simple_comparison)
+ (simplify
+  (cmp (bit_and:c @0 VECTOR_CST@1) integer_zerop)
+   (if (VECTOR_INTEGER_TYPE_P (TREE_TYPE (@1))
+	&& uniform_vector_p (@1))
+    (with { tree elt = vector_cst_elt (@1, 0); }
+     (switch
+      (if (TYPE_UNSIGNED (TREE_TYPE (@1)) && tree_fits_uhwi_p (elt))
+	(with { unsigned HOST_WIDE_INT diff = tree_to_uhwi (elt);
+	        tree tdiff = wide_int_to_tree (TREE_TYPE (elt), (~diff) + 1);
+		tree newval = wide_int_to_tree (TREE_TYPE (elt), ~diff);
+		tree newmask = build_uniform_cst (TREE_TYPE (@1), newval); }
+	 (if (integer_pow2p (tdiff))
+	  (switch
+	   /* ((mask & x) < 0) -> 0.  */
+	   (if (cmp == LT_EXPR)
+	    { build_zero_cst (TREE_TYPE (@1)); })
+	   /* ((mask & x) <= 0) -> x < mask.  */
+	   (if (cmp == LE_EXPR)
+	    (lt @0 { newmask; }))
+	   /* ((mask & x) == 0) -> x < mask.  */
+	   (if (cmp == EQ_EXPR)
+	    (le @0 { newmask; }))
+	   /* ((mask & x) != 0) -> x > mask.  */
+	   (if (cmp == NE_EXPR)
+	    (gt @0 { newmask; }))
+	   /* ((mask & x) >= 0) -> x <= mask.  */
+	   (if (cmp == GE_EXPR)
+	    (le @0 { newmask; }))
+	    /* ((mask & x) > 0) -> x < mask.  */
+	   (if (cmp == GT_EXPR)
+	    (lt @0 { newmask; }))))))
+      (if (!TYPE_UNSIGNED (TREE_TYPE (@1)) && tree_fits_shwi_p (elt))
+	(with { unsigned HOST_WIDE_INT diff = tree_to_shwi (elt);
+		tree ustype = unsigned_type_for (TREE_TYPE (elt));
+		tree uvtype = unsigned_type_for (TREE_TYPE (@1));
+	        tree tdiff = wide_int_to_tree (ustype, (~diff) + 1);
+	        tree udiff = wide_int_to_tree (ustype, ~diff);
+		tree cst = build_uniform_cst (uvtype, udiff); }
+	 (if (integer_pow2p (tdiff))
+	  (switch
+	    /* ((mask & x) == 0) -> x < mask.  */
+	    (if (cmp == EQ_EXPR)
+	     (le (convert:uvtype @0) { cst; }))
+	    /* ((mask & x) != 0) -> x > mask.  */
+	    (if (cmp == NE_EXPR)
+	     (gt (convert:uvtype @0) { cst; })))))))))))
+
 /* Transform comparisons of the form X - Y CMP 0 to X CMP Y.
    ??? The transformation is valid for the other operators if overflow
    is undefined for the type, but performing it here badly interacts
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-10.c b/gcc/testsuite/gcc.dg/bic-bitmask-10.c
new file mode 100644
index 0000000000000000000000000000000000000000..76a22a2313137a2a75dd711c2c15c2d3a34e15aa
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-10.c
@@ -0,0 +1,26 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(int32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(int32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+#define TYPE int32_t
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump {<=\s*.+\{ 255,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-11.c b/gcc/testsuite/gcc.dg/bic-bitmask-11.c
new file mode 100644
index 0000000000000000000000000000000000000000..32553d7ba2f823f7a21237451990d0a216d2f912
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-11.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) != 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) != 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump {>\s*.+\{ 255,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-12.c b/gcc/testsuite/gcc.dg/bic-bitmask-12.c
new file mode 100644
index 0000000000000000000000000000000000000000..e10cbf7fabe2dbf7ce436cdf37b0f8b207c58408
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-12.c
@@ -0,0 +1,17 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+typedef unsigned int v4si __attribute__ ((vector_size (16)));
+
+__attribute__((noinline, noipa))
+void fun(v4si *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+/* { dg-final { scan-tree-dump {<=\s*.+\{ 255,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-2.c b/gcc/testsuite/gcc.dg/bic-bitmask-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..da30fad89f6c8239baa4395b3ffaec0be577e13f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-2.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-3.c b/gcc/testsuite/gcc.dg/bic-bitmask-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..da30fad89f6c8239baa4395b3ffaec0be577e13f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-3.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-4.c b/gcc/testsuite/gcc.dg/bic-bitmask-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..1bcf23ccf1447d6c8c999ed1eb25ba0a450028e1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-4.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) >= 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) >= 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {=\s*.+\{ 1,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-5.c b/gcc/testsuite/gcc.dg/bic-bitmask-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..6e5a2fca9992efbc01f8dbbc6f95936e86643028
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-5.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) > 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) > 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {>\s*.+\{ 255,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&`s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-6.c b/gcc/testsuite/gcc.dg/bic-bitmask-6.c
new file mode 100644
index 0000000000000000000000000000000000000000..018e7a4348c9fc461106c3d9d01291325d3406c2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-6.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) <= 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) <= 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-7.c b/gcc/testsuite/gcc.dg/bic-bitmask-7.c
new file mode 100644
index 0000000000000000000000000000000000000000..798678fb7555052c93abc4ca34f617d640f73bb4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-7.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) < 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) < 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {__builtin_memset} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-8.c b/gcc/testsuite/gcc.dg/bic-bitmask-8.c
new file mode 100644
index 0000000000000000000000000000000000000000..1dabe834ed57dfa0be48c1dc3dbb226092c79a1a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-8.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) != 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) != 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {>\s*.+\{ 1,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967294,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-9.c b/gcc/testsuite/gcc.dg/bic-bitmask-9.c
new file mode 100644
index 0000000000000000000000000000000000000000..9c1f8ee0adfc45d1b9fc212138ea26bb6b693e49
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-9.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~5)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~5)) == 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-not {<=\s*.+\{ 4294967289,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump {&\s*.+\{ 4294967290,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask.h b/gcc/testsuite/gcc.dg/bic-bitmask.h
new file mode 100644
index 0000000000000000000000000000000000000000..2b94065c025e0cbf71a21ac9b9d6314e24b0c2d9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask.h
@@ -0,0 +1,43 @@
+#include <stdio.h>
+
+#ifndef N
+#define N 50
+#endif
+
+#ifndef TYPE
+#define TYPE uint32_t
+#endif
+
+#ifndef DEBUG
+#define DEBUG 0
+#endif
+
+#define BASE ((TYPE) -1 < 0 ? -126 : 4)
+
+int main ()
+{
+  TYPE a[N];
+  TYPE b[N];
+
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 13;
+      b[i] = BASE + i * 13;
+      if (DEBUG)
+        printf ("%d: 0x%x\n", i, a[i]);
+    }
+
+  fun1 (a, N);
+  fun2 (b, N);
+
+  for (int i = 0; i < N; ++i)
+    {
+      if (DEBUG)
+        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
+
+      if (a[i] != b[i])
+        __builtin_abort ();
+    }
+  return 0;
+}
+
diff --git a/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c b/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..568c1ffc8bc4148efaeeba7a45a75ecbd3a7a3dd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c
@@ -0,0 +1,13 @@
+/* { dg-do assemble } */
+/* { dg-options "-O2 -save-temps" } */
+
+#include <arm_neon.h>
+
+uint32x4_t foo (int32x4_t a)
+{
+  int32x4_t cst = vdupq_n_s32 (255);
+  int32x4_t zero = vdupq_n_s32 (0);
+  return vceqq_s32 (vbicq_s32 (a, cst), zero);
+}
+
+/* { dg-final { scan-assembler-not {\tbic\t} { xfail { aarch64*-*-* } } } } */


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 6/7]AArch64 Add neg + cmle into cmgt
  2021-09-29 16:19 [PATCH 0/7]AArch64 Optimize truncation, shifts and bitmask comparisons Tamar Christina
                   ` (4 preceding siblings ...)
  2021-09-29 16:21 ` [PATCH 5/7]middle-end Convert bitclear <imm> + cmp<cc> #0 into cm<cc2> <imm2> Tamar Christina
@ 2021-09-29 16:21 ` Tamar Christina
  2021-09-30  9:34   ` Kyrylo Tkachov
  2021-09-29 16:21 ` [PATCH 7/7]AArch64 Combine cmeq 0 + not into cmtst Tamar Christina
  6 siblings, 1 reply; 31+ messages in thread
From: Tamar Christina @ 2021-09-29 16:21 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 3321 bytes --]

Hi All,

This turns an inversion of the sign bit + arithmetic right shift into a
comparison with 0.

i.e.

void fun1(int32_t *x, int n)
{
    for (int i = 0; i < (n & -16); i++)
      x[i] = (-x[i]) >> 31;
}

now generates:

.L3:
        ldr     q0, [x0]
        cmgt    v0.4s, v0.4s, #0
        str     q0, [x0], 16
        cmp     x0, x1
        bne     .L3

instead of:

.L3:
        ldr     q0, [x0]
        neg     v0.4s, v0.4s
        sshr    v0.4s, v0.4s, 31
        str     q0, [x0], 16
        cmp     x0, x1
        bne     .L3

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (*aarch64_simd_neg_ashr<mode>): New.
	* config/aarch64/predicates.md
	(aarch64_simd_shift_imm_vec_signbit): New.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/signbit-1.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 0045b100c6af1c007293ee26506199868be90e9f..9d936428b438c95b56614c94081d7e2ebc47d89f 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1137,6 +1137,18 @@ (define_insn "aarch64_simd_ashr<mode>"
   [(set_attr "type" "neon_compare<q>,neon_shift_imm<q>")]
 )
 
+;; Additional opt when we negate the sign bit and then shift right
+(define_insn "*aarch64_simd_neg_ashr<mode>"
+ [(set (match_operand:VDQ_I 0 "register_operand" "=w")
+       (ashiftrt:VDQ_I
+	 (neg:VDQ_I
+	   (match_operand:VDQ_I 1 "register_operand" "w"))
+	   (match_operand:VDQ_I 2 "aarch64_simd_shift_imm_vec_signbit" "D1")))]
+ "TARGET_SIMD"
+ "cmgt\t%0.<Vtype>, %1.<Vtype>, #0"
+  [(set_attr "type" "neon_compare_zero<q>")]
+)
+
 (define_insn "*aarch64_simd_sra<mode>"
  [(set (match_operand:VDQ_I 0 "register_operand" "=w")
 	(plus:VDQ_I
diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
index 7fd4f9e7d06d3082d6f3047290f0446789e1d0d2..12e7d35da154b10f0190274d0279cab313563455 100644
--- a/gcc/config/aarch64/predicates.md
+++ b/gcc/config/aarch64/predicates.md
@@ -545,6 +545,12 @@ (define_predicate "aarch64_simd_shift_imm_offset_di"
   (and (match_code "const_int")
        (match_test "IN_RANGE (INTVAL (op), 1, 64)")))
 
+(define_predicate "aarch64_simd_shift_imm_vec_signbit"
+  (and (match_code "const_vector")
+       (match_test "aarch64_const_vec_all_same_in_range_p (op,
+			GET_MODE_UNIT_BITSIZE (mode) - 1,
+			GET_MODE_UNIT_BITSIZE (mode) - 1)")))
+
 (define_predicate "aarch64_simd_shift_imm_vec_exact_top"
   (and (match_code "const_vector")
        (match_test "aarch64_const_vec_all_same_in_range_p (op,
diff --git a/gcc/testsuite/gcc.target/aarch64/signbit-1.c b/gcc/testsuite/gcc.target/aarch64/signbit-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..3ebfb0586f37de29cf58635b27fe48503714447e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/signbit-1.c
@@ -0,0 +1,18 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps" } */
+
+#include <stdint.h>
+
+void fun1(int32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (-x[i]) >> 31;
+}
+
+void fun2(int32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (-x[i]) >> 30;
+}
+
+/* { dg-final { scan-assembler-times {\tcmgt\t} 1 } } */


-- 

[-- Attachment #2: rb14897.patch --]
[-- Type: text/x-diff, Size: 2416 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 0045b100c6af1c007293ee26506199868be90e9f..9d936428b438c95b56614c94081d7e2ebc47d89f 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1137,6 +1137,18 @@ (define_insn "aarch64_simd_ashr<mode>"
   [(set_attr "type" "neon_compare<q>,neon_shift_imm<q>")]
 )
 
+;; Additional opt when we negate the sign bit and then shift right
+(define_insn "*aarch64_simd_neg_ashr<mode>"
+ [(set (match_operand:VDQ_I 0 "register_operand" "=w")
+       (ashiftrt:VDQ_I
+	 (neg:VDQ_I
+	   (match_operand:VDQ_I 1 "register_operand" "w"))
+	   (match_operand:VDQ_I 2 "aarch64_simd_shift_imm_vec_signbit" "D1")))]
+ "TARGET_SIMD"
+ "cmgt\t%0.<Vtype>, %1.<Vtype>, #0"
+  [(set_attr "type" "neon_compare_zero<q>")]
+)
+
 (define_insn "*aarch64_simd_sra<mode>"
  [(set (match_operand:VDQ_I 0 "register_operand" "=w")
 	(plus:VDQ_I
diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
index 7fd4f9e7d06d3082d6f3047290f0446789e1d0d2..12e7d35da154b10f0190274d0279cab313563455 100644
--- a/gcc/config/aarch64/predicates.md
+++ b/gcc/config/aarch64/predicates.md
@@ -545,6 +545,12 @@ (define_predicate "aarch64_simd_shift_imm_offset_di"
   (and (match_code "const_int")
        (match_test "IN_RANGE (INTVAL (op), 1, 64)")))
 
+(define_predicate "aarch64_simd_shift_imm_vec_signbit"
+  (and (match_code "const_vector")
+       (match_test "aarch64_const_vec_all_same_in_range_p (op,
+			GET_MODE_UNIT_BITSIZE (mode) - 1,
+			GET_MODE_UNIT_BITSIZE (mode) - 1)")))
+
 (define_predicate "aarch64_simd_shift_imm_vec_exact_top"
   (and (match_code "const_vector")
        (match_test "aarch64_const_vec_all_same_in_range_p (op,
diff --git a/gcc/testsuite/gcc.target/aarch64/signbit-1.c b/gcc/testsuite/gcc.target/aarch64/signbit-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..3ebfb0586f37de29cf58635b27fe48503714447e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/signbit-1.c
@@ -0,0 +1,18 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps" } */
+
+#include <stdint.h>
+
+void fun1(int32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (-x[i]) >> 31;
+}
+
+void fun2(int32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (-x[i]) >> 30;
+}
+
+/* { dg-final { scan-assembler-times {\tcmgt\t} 1 } } */


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 7/7]AArch64 Combine cmeq 0 + not into cmtst
  2021-09-29 16:19 [PATCH 0/7]AArch64 Optimize truncation, shifts and bitmask comparisons Tamar Christina
                   ` (5 preceding siblings ...)
  2021-09-29 16:21 ` [PATCH 6/7]AArch64 Add neg + cmle into cmgt Tamar Christina
@ 2021-09-29 16:21 ` Tamar Christina
  2021-09-30  9:35   ` Kyrylo Tkachov
  6 siblings, 1 reply; 31+ messages in thread
From: Tamar Christina @ 2021-09-29 16:21 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 2993 bytes --]

Hi All,

This turns a bitwise inverse of an equality comparison with 0 into a compare of
bitwise nonzero (cmtst).

We already have one pattern for cmsts, this adds an additional one which does
not require an additional bitwise and.

i.e.

#include <arm_neon.h>

uint8x8_t bar(int16x8_t abs_row0, int16x8_t row0) {
  uint16x8_t row0_diff =
    vreinterpretq_u16_s16(veorq_s16(abs_row0, vshrq_n_s16(row0, 15)));
  uint8x8_t abs_row0_gt0 =
    vmovn_u16(vcgtq_u16(vreinterpretq_u16_s16(abs_row0), vdupq_n_u16(0)));
  return abs_row0_gt0;
}

now generates:

bar:
        cmtst   v0.8h, v0.8h, v0.8h
        xtn     v0.8b, v0.8h
        ret

instead of:

bar:
        cmeq    v0.8h, v0.8h, #0
        not     v0.16b, v0.16b
        xtn     v0.8b, v0.8h
        ret

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (*aarch64_cmtst_same_<mode>): New.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/mvn-cmeq0-1.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 9d936428b438c95b56614c94081d7e2ebc47d89f..bce01c36386074bf475b8b7e5c69a1959a13fef3 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -6585,6 +6585,23 @@ (define_insn "aarch64_cmtst<mode>"
   [(set_attr "type" "neon_tst<q>")]
 )
 
+;; One can also get a cmtsts by having to combine a
+;; not (neq (eq x 0)) in which case you rewrite it to
+;; a comparison against itself
+
+(define_insn "*aarch64_cmtst_same_<mode>"
+  [(set (match_operand:<V_INT_EQUIV> 0 "register_operand" "=w")
+	(plus:<V_INT_EQUIV>
+	  (eq:<V_INT_EQUIV>
+	    (match_operand:VDQ_I 1 "register_operand" "w")
+	    (match_operand:VDQ_I 2 "aarch64_simd_imm_zero"))
+	  (match_operand:<V_INT_EQUIV> 3 "aarch64_simd_imm_minus_one")))
+  ]
+  "TARGET_SIMD"
+  "cmtst\t%<v>0<Vmtype>, %<v>1<Vmtype>, %<v>1<Vmtype>"
+  [(set_attr "type" "neon_tst<q>")]
+)
+
 (define_insn_and_split "aarch64_cmtstdi"
   [(set (match_operand:DI 0 "register_operand" "=w,r")
 	(neg:DI
diff --git a/gcc/testsuite/gcc.target/aarch64/mvn-cmeq0-1.c b/gcc/testsuite/gcc.target/aarch64/mvn-cmeq0-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..59f3a230271c70d3bb51d0338d9ec2613bd4394b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mvn-cmeq0-1.c
@@ -0,0 +1,17 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#include <arm_neon.h>
+
+uint8x8_t bar(int16x8_t abs_row0, int16x8_t row0) {
+  uint16x8_t row0_diff =
+    vreinterpretq_u16_s16(veorq_s16(abs_row0, vshrq_n_s16(row0, 15)));
+  uint8x8_t abs_row0_gt0 =
+    vmovn_u16(vcgtq_u16(vreinterpretq_u16_s16(abs_row0), vdupq_n_u16(0)));
+  return abs_row0_gt0;
+}
+
+
+/* { dg-final { scan-assembler-times {\tcmtst\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tcmeq\t} } } */
+/* { dg-final { scan-assembler-not {\tnot\t} } } */


-- 

[-- Attachment #2: rb14898.patch --]
[-- Type: text/x-diff, Size: 1942 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 9d936428b438c95b56614c94081d7e2ebc47d89f..bce01c36386074bf475b8b7e5c69a1959a13fef3 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -6585,6 +6585,23 @@ (define_insn "aarch64_cmtst<mode>"
   [(set_attr "type" "neon_tst<q>")]
 )
 
+;; One can also get a cmtsts by having to combine a
+;; not (neq (eq x 0)) in which case you rewrite it to
+;; a comparison against itself
+
+(define_insn "*aarch64_cmtst_same_<mode>"
+  [(set (match_operand:<V_INT_EQUIV> 0 "register_operand" "=w")
+	(plus:<V_INT_EQUIV>
+	  (eq:<V_INT_EQUIV>
+	    (match_operand:VDQ_I 1 "register_operand" "w")
+	    (match_operand:VDQ_I 2 "aarch64_simd_imm_zero"))
+	  (match_operand:<V_INT_EQUIV> 3 "aarch64_simd_imm_minus_one")))
+  ]
+  "TARGET_SIMD"
+  "cmtst\t%<v>0<Vmtype>, %<v>1<Vmtype>, %<v>1<Vmtype>"
+  [(set_attr "type" "neon_tst<q>")]
+)
+
 (define_insn_and_split "aarch64_cmtstdi"
   [(set (match_operand:DI 0 "register_operand" "=w,r")
 	(neg:DI
diff --git a/gcc/testsuite/gcc.target/aarch64/mvn-cmeq0-1.c b/gcc/testsuite/gcc.target/aarch64/mvn-cmeq0-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..59f3a230271c70d3bb51d0338d9ec2613bd4394b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mvn-cmeq0-1.c
@@ -0,0 +1,17 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#include <arm_neon.h>
+
+uint8x8_t bar(int16x8_t abs_row0, int16x8_t row0) {
+  uint16x8_t row0_diff =
+    vreinterpretq_u16_s16(veorq_s16(abs_row0, vshrq_n_s16(row0, 15)));
+  uint8x8_t abs_row0_gt0 =
+    vmovn_u16(vcgtq_u16(vreinterpretq_u16_s16(abs_row0), vdupq_n_u16(0)));
+  return abs_row0_gt0;
+}
+
+
+/* { dg-final { scan-assembler-times {\tcmtst\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tcmeq\t} } } */
+/* { dg-final { scan-assembler-not {\tnot\t} } } */


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/7]middle-end Convert bitclear <imm> + cmp<cc> #0 into cm<cc2> <imm2>
  2021-09-29 16:21 ` [PATCH 5/7]middle-end Convert bitclear <imm> + cmp<cc> #0 into cm<cc2> <imm2> Tamar Christina
@ 2021-09-30  6:17   ` Richard Biener
  2021-09-30  9:56     ` Tamar Christina
  0 siblings, 1 reply; 31+ messages in thread
From: Richard Biener @ 2021-09-30  6:17 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc-patches, nd

On Wed, 29 Sep 2021, Tamar Christina wrote:

> Hi All,
> 
> This optimizes the case where a mask Y which fulfills ~Y + 1 == pow2 is used to
> clear a some bits and then compared against 0 into one without the masking and
> a compare against a different bit immediate.
> 
> We can do this for all unsigned compares and for signed we can do it for
> comparisons of EQ and NE:
> 
> (x & (~255)) == 0 becomes x <= 255. Which for leaves it to the target to
> optimally deal with the comparison.
> 
> This transformation has to be done in the mid-end because in RTL you don't have
> the signs of the comparison operands and if the target needs an immediate this
> should be floated outside of the loop.
> 
> The RTL loop invariant hoisting is done before split1.
> 
> i.e.
> 
> void fun1(int32_t *x, int n)
> {
>     for (int i = 0; i < (n & -16); i++)
>       x[i] = (x[i]&(~255)) == 0;
> }
> 
> now generates:
> 
> .L3:
>         ldr     q0, [x0]
>         cmhs    v0.4s, v2.4s, v0.4s
>         and     v0.16b, v1.16b, v0.16b
>         str     q0, [x0], 16
>         cmp     x0, x1
>         bne     .L3
> 
> and floats the immediate out of the loop.
> 
> instead of:
> 
> .L3:
>         ldr     q0, [x0]
>         bic     v0.4s, #255
>         cmeq    v0.4s, v0.4s, #0
>         and     v0.16b, v1.16b, v0.16b
>         str     q0, [x0], 16
>         cmp     x0, x1
>         bne     .L3
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu,
> x86_64-pc-linux-gnu and no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* match.pd: New bitmask compare pattern.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.dg/bic-bitmask-10.c: New test.
> 	* gcc.dg/bic-bitmask-11.c: New test.
> 	* gcc.dg/bic-bitmask-12.c: New test.
> 	* gcc.dg/bic-bitmask-2.c: New test.
> 	* gcc.dg/bic-bitmask-3.c: New test.
> 	* gcc.dg/bic-bitmask-4.c: New test.
> 	* gcc.dg/bic-bitmask-5.c: New test.
> 	* gcc.dg/bic-bitmask-6.c: New test.
> 	* gcc.dg/bic-bitmask-7.c: New test.
> 	* gcc.dg/bic-bitmask-8.c: New test.
> 	* gcc.dg/bic-bitmask-9.c: New test.
> 	* gcc.dg/bic-bitmask.h: New test.
> 	* gcc.target/aarch64/bic-bitmask-1.c: New test.
> 
> --- inline copy of patch -- 
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 0fcfd0ea62c043dc217d0d560ce5b7e569b70e7d..df9212cb27d172856b9d43b0875262f96e8993c4 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -4288,6 +4288,56 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>      (if (ic == ncmp)
>       (ncmp @0 @1))))))
>  
> +/* Transform comparisons of the form (X & Y) CMP 0 to X CMP2 Z
> +   where ~Y + 1 == pow2 and Z = ~Y.  */
> +(for cmp (simple_comparison)
> + (simplify
> +  (cmp (bit_and:c @0 VECTOR_CST@1) integer_zerop)

Why not for INTEGER_CST as well?  We do have a related folding (only
for INTEGER_CST) that does

/* A & (2**N - 1) <= 2**K - 1 -> A & (2**N - 2**K) == 0
   A & (2**N - 1) >  2**K - 1 -> A & (2**N - 2**K) != 0

which could be extended for integer vectors.  That said, can you please
place the pattern next to the above?

Why does the transform only work for uniform vector constants?  (I see
that the implementation becomes simpler, but then you should also handle
the INTEGER_CST case at least)

> +   (if (VECTOR_INTEGER_TYPE_P (TREE_TYPE (@1))
> +	&& uniform_vector_p (@1))
> +    (with { tree elt = vector_cst_elt (@1, 0); }
> +     (switch
> +      (if (TYPE_UNSIGNED (TREE_TYPE (@1)) && tree_fits_uhwi_p (elt))

avoid tree_fits_uhwi_p and use wide_int here

> +	(with { unsigned HOST_WIDE_INT diff = tree_to_uhwi (elt);
> +	        tree tdiff = wide_int_to_tree (TREE_TYPE (elt), (~diff) + 1);
> +		tree newval = wide_int_to_tree (TREE_TYPE (elt), ~diff);
> +		tree newmask = build_uniform_cst (TREE_TYPE (@1), newval); }
> +	 (if (integer_pow2p (tdiff))

You don't seem to use 'tdiff' so please do this check in wide_int

> +	  (switch
> +	   /* ((mask & x) < 0) -> 0.  */
> +	   (if (cmp == LT_EXPR)
> +	    { build_zero_cst (TREE_TYPE (@1)); })
> +	   /* ((mask & x) <= 0) -> x < mask.  */
> +	   (if (cmp == LE_EXPR)
> +	    (lt @0 { newmask; }))
> +	   /* ((mask & x) == 0) -> x < mask.  */
> +	   (if (cmp == EQ_EXPR)
> +	    (le @0 { newmask; }))
> +	   /* ((mask & x) != 0) -> x > mask.  */
> +	   (if (cmp == NE_EXPR)
> +	    (gt @0 { newmask; }))
> +	   /* ((mask & x) >= 0) -> x <= mask.  */
> +	   (if (cmp == GE_EXPR)
> +	    (le @0 { newmask; }))
> +	    /* ((mask & x) > 0) -> x < mask.  */
> +	   (if (cmp == GT_EXPR)
> +	    (lt @0 { newmask; }))))))

you can avoid this switch with a lock-step (for, that maps 'cmp'
to the result comparison code (for simplicity you can either keep
the LT_EXPR special-case or transform to an always true condition
which will be simplified).

> +      (if (!TYPE_UNSIGNED (TREE_TYPE (@1)) && tree_fits_shwi_p (elt))
> +	(with { unsigned HOST_WIDE_INT diff = tree_to_shwi (elt);
> +		tree ustype = unsigned_type_for (TREE_TYPE (elt));
> +		tree uvtype = unsigned_type_for (TREE_TYPE (@1));
> +	        tree tdiff = wide_int_to_tree (ustype, (~diff) + 1);
> +	        tree udiff = wide_int_to_tree (ustype, ~diff);
> +		tree cst = build_uniform_cst (uvtype, udiff); }
> +	 (if (integer_pow2p (tdiff))
> +	  (switch
> +	    /* ((mask & x) == 0) -> x < mask.  */
> +	    (if (cmp == EQ_EXPR)
> +	     (le (convert:uvtype @0) { cst; }))
> +	    /* ((mask & x) != 0) -> x > mask.  */
> +	    (if (cmp == NE_EXPR)
> +	     (gt (convert:uvtype @0) { cst; })))))))))))
> +
>  /* Transform comparisons of the form X - Y CMP 0 to X CMP Y.
>     ??? The transformation is valid for the other operators if overflow
>     is undefined for the type, but performing it here badly interacts
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-10.c b/gcc/testsuite/gcc.dg/bic-bitmask-10.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..76a22a2313137a2a75dd711c2c15c2d3a34e15aa
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-10.c
> @@ -0,0 +1,26 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(int32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(int32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +#define TYPE int32_t
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump {<=\s*.+\{ 255,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-11.c b/gcc/testsuite/gcc.dg/bic-bitmask-11.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..32553d7ba2f823f7a21237451990d0a216d2f912
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-11.c
> @@ -0,0 +1,25 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) != 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) != 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump {>\s*.+\{ 255,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-12.c b/gcc/testsuite/gcc.dg/bic-bitmask-12.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..e10cbf7fabe2dbf7ce436cdf37b0f8b207c58408
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-12.c
> @@ -0,0 +1,17 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +typedef unsigned int v4si __attribute__ ((vector_size (16)));
> +
> +__attribute__((noinline, noipa))
> +void fun(v4si *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +/* { dg-final { scan-tree-dump {<=\s*.+\{ 255,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-2.c b/gcc/testsuite/gcc.dg/bic-bitmask-2.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..da30fad89f6c8239baa4395b3ffaec0be577e13f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-2.c
> @@ -0,0 +1,25 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-3.c b/gcc/testsuite/gcc.dg/bic-bitmask-3.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..da30fad89f6c8239baa4395b3ffaec0be577e13f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-3.c
> @@ -0,0 +1,25 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-4.c b/gcc/testsuite/gcc.dg/bic-bitmask-4.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..1bcf23ccf1447d6c8c999ed1eb25ba0a450028e1
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-4.c
> @@ -0,0 +1,25 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) >= 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) >= 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {=\s*.+\{ 1,.+\}} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-5.c b/gcc/testsuite/gcc.dg/bic-bitmask-5.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..6e5a2fca9992efbc01f8dbbc6f95936e86643028
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-5.c
> @@ -0,0 +1,25 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) > 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) > 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {>\s*.+\{ 255,.+\}} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&`s*.+\{ 4294967040,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-6.c b/gcc/testsuite/gcc.dg/bic-bitmask-6.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..018e7a4348c9fc461106c3d9d01291325d3406c2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-6.c
> @@ -0,0 +1,25 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) <= 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) <= 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-7.c b/gcc/testsuite/gcc.dg/bic-bitmask-7.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..798678fb7555052c93abc4ca34f617d640f73bb4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-7.c
> @@ -0,0 +1,24 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~1)) < 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~1)) < 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {__builtin_memset} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-8.c b/gcc/testsuite/gcc.dg/bic-bitmask-8.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..1dabe834ed57dfa0be48c1dc3dbb226092c79a1a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-8.c
> @@ -0,0 +1,25 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~1)) != 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~1)) != 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {>\s*.+\{ 1,.+\}} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967294,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-9.c b/gcc/testsuite/gcc.dg/bic-bitmask-9.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..9c1f8ee0adfc45d1b9fc212138ea26bb6b693e49
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-9.c
> @@ -0,0 +1,25 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~5)) == 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~5)) == 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-not {<=\s*.+\{ 4294967289,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump {&\s*.+\{ 4294967290,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask.h b/gcc/testsuite/gcc.dg/bic-bitmask.h
> new file mode 100644
> index 0000000000000000000000000000000000000000..2b94065c025e0cbf71a21ac9b9d6314e24b0c2d9
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask.h
> @@ -0,0 +1,43 @@
> +#include <stdio.h>
> +
> +#ifndef N
> +#define N 50
> +#endif
> +
> +#ifndef TYPE
> +#define TYPE uint32_t
> +#endif
> +
> +#ifndef DEBUG
> +#define DEBUG 0
> +#endif
> +
> +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> +
> +int main ()
> +{
> +  TYPE a[N];
> +  TYPE b[N];
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      a[i] = BASE + i * 13;
> +      b[i] = BASE + i * 13;
> +      if (DEBUG)
> +        printf ("%d: 0x%x\n", i, a[i]);
> +    }
> +
> +  fun1 (a, N);
> +  fun2 (b, N);
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      if (DEBUG)
> +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> +
> +      if (a[i] != b[i])
> +        __builtin_abort ();
> +    }
> +  return 0;
> +}
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c b/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..568c1ffc8bc4148efaeeba7a45a75ecbd3a7a3dd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c
> @@ -0,0 +1,13 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O2 -save-temps" } */
> +
> +#include <arm_neon.h>
> +
> +uint32x4_t foo (int32x4_t a)
> +{
> +  int32x4_t cst = vdupq_n_s32 (255);
> +  int32x4_t zero = vdupq_n_s32 (0);
> +  return vceqq_s32 (vbicq_s32 (a, cst), zero);
> +}
> +
> +/* { dg-final { scan-assembler-not {\tbic\t} { xfail { aarch64*-*-* } } } } */
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 1/7]AArch64 Add combine patterns for right shift and narrow
  2021-09-29 16:19 ` [PATCH 1/7]AArch64 Add combine patterns for right shift and narrow Tamar Christina
@ 2021-09-30  8:50   ` Kyrylo Tkachov
  2021-10-06 14:32     ` Richard Sandiford
  0 siblings, 1 reply; 31+ messages in thread
From: Kyrylo Tkachov @ 2021-09-30  8:50 UTC (permalink / raw)
  To: Tamar Christina, gcc-patches
  Cc: nd, Richard Earnshaw, Marcus Shawcroft, Richard Sandiford

Hi Tamar,

> -----Original Message-----
> From: Tamar Christina <Tamar.Christina@arm.com>
> Sent: Wednesday, September 29, 2021 5:19 PM
> To: gcc-patches@gcc.gnu.org
> Cc: nd <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>;
> Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> <Kyrylo.Tkachov@arm.com>; Richard Sandiford
> <Richard.Sandiford@arm.com>
> Subject: [PATCH 1/7]AArch64 Add combine patterns for right shift and
> narrow
> 
> Hi All,
> 
> This adds a simple pattern for combining right shifts and narrows into
> shifted narrows.
> 
> i.e.
> 
> typedef short int16_t;
> typedef unsigned short uint16_t;
> 
> void foo (uint16_t * restrict a, int16_t * restrict d, int n)
> {
>     for( int i = 0; i < n; i++ )
>       d[i] = (a[i] * a[i]) >> 10;
> }
> 
> now generates:
> 
> .L4:
>         ldr     q0, [x0, x3]
>         umull   v1.4s, v0.4h, v0.4h
>         umull2  v0.4s, v0.8h, v0.8h
>         shrn    v1.4h, v1.4s, 10
>         shrn2   v1.8h, v0.4s, 10
>         str     q1, [x1, x3]
>         add     x3, x3, 16
>         cmp     x4, x3
>         bne     .L4
> 
> instead of:
> 
> .L4:
>         ldr     q0, [x0, x3]
>         umull   v1.4s, v0.4h, v0.4h
>         umull2  v0.4s, v0.8h, v0.8h
>         sshr    v1.4s, v1.4s, 10
>         sshr    v0.4s, v0.4s, 10
>         xtn     v1.4h, v1.4s
>         xtn2    v1.8h, v0.4s
>         str     q1, [x1, x3]
>         add     x3, x3, 16
>         cmp     x4, x3
>         bne     .L4
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* config/aarch64/aarch64-simd.md
> (*aarch64_<srn_op>shrn<mode>_vect,
> 	*aarch64_<srn_op>shrn<mode>2_vect): New.
> 	* config/aarch64/iterators.md (srn_op): New.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/aarch64/shrn-combine.c: New test.
> 
> --- inline copy of patch --
> diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> index
> 48eddf64e05afe3788abfa05141f6544a9323ea1..d7b6cae424622d259f97a3d5
> fa9093c0fb0bd5ce 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1818,6 +1818,28 @@ (define_insn "aarch64_shrn<mode>_insn_be"
>    [(set_attr "type" "neon_shift_imm_narrow_q")]
>  )
> 
> +(define_insn "*aarch64_<srn_op>shrn<mode>_vect"
> +  [(set (match_operand:<VNARROWQ> 0 "register_operand" "=w")
> +        (truncate:<VNARROWQ>
> +          (SHIFTRT:VQN (match_operand:VQN 1 "register_operand" "w")
> +            (match_operand:VQN 2
> "aarch64_simd_shift_imm_vec_<vn_mode>"))))]
> +  "TARGET_SIMD"
> +  "shrn\\t%0.<Vntype>, %1.<Vtype>, %2"
> +  [(set_attr "type" "neon_shift_imm_narrow_q")]
> +)
> +
> +(define_insn "*aarch64_<srn_op>shrn<mode>2_vect"
> +  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
> +	(vec_concat:<VNARROWQ2>
> +	  (match_operand:<VNARROWQ> 1 "register_operand" "0")
> +	  (truncate:<VNARROWQ>
> +	    (SHIFTRT:VQN (match_operand:VQN 2 "register_operand" "w")
> +	      (match_operand:VQN 3
> "aarch64_simd_shift_imm_vec_<vn_mode>")))))]
> +  "TARGET_SIMD"
> +  "shrn2\\t%0.<V2ntype>, %2.<Vtype>, %3"
> +  [(set_attr "type" "neon_shift_imm_narrow_q")]
> +)

I think this needs to be guarded on !BYTES_BIG_ENDIAN and a similar pattern added for BYTES_BIG_ENDIAN with the vec_concat operands swapped around.
This is similar to the aarch64_xtn2<mode>_insn_be pattern, for example.

Thanks,
Kyrill

> +
>  (define_expand "aarch64_shrn<mode>"
>    [(set (match_operand:<VNARROWQ> 0 "register_operand")
>  	(truncate:<VNARROWQ>
> diff --git a/gcc/config/aarch64/iterators.md
> b/gcc/config/aarch64/iterators.md
> index
> caa42f8f169fbf2cf46a90cf73dee05619acc300..8dbeed3b0d4a44cdc17dd333e
> d397b39a33f386a 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -2003,6 +2003,9 @@ (define_code_attr shift [(ashift "lsl") (ashiftrt "asr")
>  ;; Op prefix for shift right and accumulate.
>  (define_code_attr sra_op [(ashiftrt "s") (lshiftrt "u")])
> 
> +;; op prefix for shift right and narrow.
> +(define_code_attr srn_op [(ashiftrt "r") (lshiftrt "")])
> +
>  ;; Map shift operators onto underlying bit-field instructions
>  (define_code_attr bfshift [(ashift "ubfiz") (ashiftrt "sbfx")
>  			   (lshiftrt "ubfx") (rotatert "extr")])
> diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine.c
> b/gcc/testsuite/gcc.target/aarch64/shrn-combine.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..0187f49f4dcc76182c90366c
> aaf00d294e835707
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine.c
> @@ -0,0 +1,14 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +typedef short int16_t;
> +typedef unsigned short uint16_t;
> +
> +void foo (uint16_t * restrict a, int16_t * restrict d, int n)
> +{
> +    for( int i = 0; i < n; i++ )
> +      d[i] = (a[i] * a[i]) >> 10;
> +}
> +
> +/* { dg-final { scan-assembler-times {\tshrn\t} 1 } } */
> +/* { dg-final { scan-assembler-times {\tshrn2\t} 1 } } */
> 
> 
> --

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 2/7]AArch64 Add combine patterns for narrowing shift of half top bits (shuffle)
  2021-09-29 16:19 ` [PATCH 2/7]AArch64 Add combine patterns for narrowing shift of half top bits (shuffle) Tamar Christina
@ 2021-09-30  8:54   ` Kyrylo Tkachov
  2021-10-12 16:23     ` Tamar Christina
  0 siblings, 1 reply; 31+ messages in thread
From: Kyrylo Tkachov @ 2021-09-30  8:54 UTC (permalink / raw)
  To: Tamar Christina, gcc-patches
  Cc: nd, Richard Earnshaw, Marcus Shawcroft, Richard Sandiford



> -----Original Message-----
> From: Tamar Christina <Tamar.Christina@arm.com>
> Sent: Wednesday, September 29, 2021 5:20 PM
> To: gcc-patches@gcc.gnu.org
> Cc: nd <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>;
> Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> <Kyrylo.Tkachov@arm.com>; Richard Sandiford
> <Richard.Sandiford@arm.com>
> Subject: [PATCH 2/7]AArch64 Add combine patterns for narrowing shift of
> half top bits (shuffle)
> 
> Hi All,
> 
> When doing a (narrowing) right shift by half the width of the original type
> then
> we are essentially shuffling the top bits from the first number down.
> 
> If we have a hi/lo pair we can just use a single shuffle instead of needing two
> shifts.
> 
> i.e.
> 
> typedef short int16_t;
> typedef unsigned short uint16_t;
> 
> void foo (uint16_t * restrict a, int16_t * restrict d, int n)
> {
>     for( int i = 0; i < n; i++ )
>       d[i] = (a[i] * a[i]) >> 16;
> }
> 
> now generates:
> 
> .L4:
>         ldr     q0, [x0, x3]
>         umull   v1.4s, v0.4h, v0.4h
>         umull2  v0.4s, v0.8h, v0.8h
>         uzp2    v0.8h, v1.8h, v0.8h
>         str     q0, [x1, x3]
>         add     x3, x3, 16
>         cmp     x4, x3
>         bne     .L4
> 
> instead of
> 
> .L4:
>         ldr     q0, [x0, x3]
>         umull   v1.4s, v0.4h, v0.4h
>         umull2  v0.4s, v0.8h, v0.8h
>         sshr    v1.4s, v1.4s, 16
>         sshr    v0.4s, v0.4s, 16
>         xtn     v1.4h, v1.4s
>         xtn2    v1.8h, v0.4s
>         str     q1, [x1, x3]
>         add     x3, x3, 16
>         cmp     x4, x3
>         bne     .L4
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?
> 

Ok.
Thanks,
Kyrill

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* config/aarch64/aarch64-simd.md
> 	(*aarch64_<srn_op>topbits_shuffle<mode>,
> 	*aarch64_topbits_shuffle<mode>): New.
> 	* config/aarch64/predicates.md
> 	(aarch64_simd_shift_imm_vec_exact_top): New.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/aarch64/shrn-combine-2.c: New test.
> 	* gcc.target/aarch64/shrn-combine-3.c: New test.
> 
> --- inline copy of patch --
> diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> index
> d7b6cae424622d259f97a3d5fa9093c0fb0bd5ce..300bf001b59ca7fa197c580b
> 10adb7f70f20d1e0 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1840,6 +1840,36 @@ (define_insn
> "*aarch64_<srn_op>shrn<mode>2_vect"
>    [(set_attr "type" "neon_shift_imm_narrow_q")]
>  )
> 
> +(define_insn "*aarch64_<srn_op>topbits_shuffle<mode>"
> +  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
> +	(vec_concat:<VNARROWQ2>
> +          (truncate:<VNARROWQ>
> +            (SHIFTRT:VQN (match_operand:VQN 1 "register_operand" "w")
> +	      (match_operand:VQN 2
> "aarch64_simd_shift_imm_vec_exact_top")))
> +	  (truncate:<VNARROWQ>
> +	    (SHIFTRT:VQN (match_operand:VQN 3 "register_operand" "w")
> +	      (match_dup 2)))))]
> +  "TARGET_SIMD"
> +  "uzp2\\t%0.<V2ntype>, %1.<V2ntype>, %3.<V2ntype>"
> +  [(set_attr "type" "neon_permute<q>")]
> +)
> +
> +(define_insn "*aarch64_topbits_shuffle<mode>"
> +  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
> +	(vec_concat:<VNARROWQ2>
> +          (unspec:<VNARROWQ> [
> +              (match_operand:VQN 1 "register_operand" "w")
> +	      (match_operand:VQN 2
> "aarch64_simd_shift_imm_vec_exact_top")
> +	     ] UNSPEC_RSHRN)
> +	  (unspec:<VNARROWQ> [
> +	      (match_operand:VQN 3 "register_operand" "w")
> +	      (match_dup 2)
> +	     ] UNSPEC_RSHRN)))]
> +  "TARGET_SIMD"
> +  "uzp2\\t%0.<V2ntype>, %1.<V2ntype>, %3.<V2ntype>"
> +  [(set_attr "type" "neon_permute<q>")]
> +)
> +
>  (define_expand "aarch64_shrn<mode>"
>    [(set (match_operand:<VNARROWQ> 0 "register_operand")
>  	(truncate:<VNARROWQ>
> diff --git a/gcc/config/aarch64/predicates.md
> b/gcc/config/aarch64/predicates.md
> index
> 49f02ae0381359174fed80c2a2264295c75bc189..7fd4f9e7d06d3082d6f30472
> 90f0446789e1d0d2 100644
> --- a/gcc/config/aarch64/predicates.md
> +++ b/gcc/config/aarch64/predicates.md
> @@ -545,6 +545,12 @@ (define_predicate
> "aarch64_simd_shift_imm_offset_di"
>    (and (match_code "const_int")
>         (match_test "IN_RANGE (INTVAL (op), 1, 64)")))
> 
> +(define_predicate "aarch64_simd_shift_imm_vec_exact_top"
> +  (and (match_code "const_vector")
> +       (match_test "aarch64_const_vec_all_same_in_range_p (op,
> +			GET_MODE_UNIT_BITSIZE (GET_MODE (op)) / 2,
> +			GET_MODE_UNIT_BITSIZE (GET_MODE (op)) / 2)")))
> +
>  (define_predicate "aarch64_simd_shift_imm_vec_qi"
>    (and (match_code "const_vector")
>         (match_test "aarch64_const_vec_all_same_in_range_p (op, 1, 8)")))
> diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-2.c
> b/gcc/testsuite/gcc.target/aarch64/shrn-combine-2.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..924b3b849e449082b8c0b7
> dc6b955a2bad8d0911
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-2.c
> @@ -0,0 +1,15 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +typedef short int16_t;
> +typedef unsigned short uint16_t;
> +
> +void foo (uint16_t * restrict a, int16_t * restrict d, int n)
> +{
> +    for( int i = 0; i < n; i++ )
> +      d[i] = (a[i] * a[i]) >> 16;
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\tshrn\t} } } */
> +/* { dg-final { scan-assembler-not {\tshrn2\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-3.c
> b/gcc/testsuite/gcc.target/aarch64/shrn-combine-3.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..929a55c5c338844e6a5c5ad
> 249af482286ab9c61
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-3.c
> @@ -0,0 +1,14 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +
> +#include <arm_neon.h>
> +
> +uint16x8_t foo (uint32x4_t a, uint32x4_t b)
> +{
> +  return vrshrn_high_n_u32 (vrshrn_n_u32 (a, 16), b, 16);
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\tshrn\t} } } */
> +/* { dg-final { scan-assembler-not {\tshrn2\t} } } */
> 
> 
> --

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 3/7]AArch64 Add pattern for sshr to cmlt
  2021-09-29 16:20 ` [PATCH 3/7]AArch64 Add pattern for sshr to cmlt Tamar Christina
@ 2021-09-30  9:27   ` Kyrylo Tkachov
  2021-10-11 19:56     ` Andrew Pinski
  0 siblings, 1 reply; 31+ messages in thread
From: Kyrylo Tkachov @ 2021-09-30  9:27 UTC (permalink / raw)
  To: Tamar Christina, gcc-patches
  Cc: nd, Richard Earnshaw, Marcus Shawcroft, Richard Sandiford, apinski



> -----Original Message-----
> From: Tamar Christina <Tamar.Christina@arm.com>
> Sent: Wednesday, September 29, 2021 5:20 PM
> To: gcc-patches@gcc.gnu.org
> Cc: nd <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>;
> Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> <Kyrylo.Tkachov@arm.com>; Richard Sandiford
> <Richard.Sandiford@arm.com>
> Subject: [PATCH 3/7]AArch64 Add pattern for sshr to cmlt
> 
> Hi All,
> 
> This optimizes signed right shift by BITSIZE-1 into a cmlt operation which is
> more optimal because generally compares have a higher throughput than
> shifts.
> 
> On AArch64 the result of the shift would have been either -1 or 0 which is the
> results of the compare.
> 
> i.e.
> 
> void e (int * restrict a, int *b, int n)
> {
>     for (int i = 0; i < n; i++)
>       b[i] = a[i] >> 31;
> }
> 
> now generates:
> 
> .L4:
>         ldr     q0, [x0, x3]
>         cmlt    v0.4s, v0.4s, #0
>         str     q0, [x1, x3]
>         add     x3, x3, 16
>         cmp     x4, x3
>         bne     .L4
> 
> instead of:
> 
> .L4:
>         ldr     q0, [x0, x3]
>         sshr    v0.4s, v0.4s, 31
>         str     q0, [x1, x3]
>         add     x3, x3, 16
>         cmp     x4, x3
>         bne     .L4
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?

This should be okay (either a win or neutral) for Arm Cortex and Neoverse cores so I'm tempted to not ask for a CPU-specific tunable to guard it to keep the code clean.
Andrew, would this change be okay from a Thunder X line perspective?
Thanks,
Kyrill

> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* config/aarch64/aarch64-simd.md (aarch64_simd_ashr<mode>):
> Add case cmp
> 	case.
> 	* config/aarch64/constraints.md (D1): New.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/aarch64/shl-combine-2.c: New test.
> 
> --- inline copy of patch --
> diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> index
> 300bf001b59ca7fa197c580b10adb7f70f20d1e0..19b2d0ad4dab4d574269829
> 7ded861228ee22007 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1127,12 +1127,14 @@ (define_insn "aarch64_simd_lshr<mode>"
>  )
> 
>  (define_insn "aarch64_simd_ashr<mode>"
> - [(set (match_operand:VDQ_I 0 "register_operand" "=w")
> -       (ashiftrt:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w")
> -		     (match_operand:VDQ_I  2 "aarch64_simd_rshift_imm"
> "Dr")))]
> + [(set (match_operand:VDQ_I 0 "register_operand" "=w,w")
> +       (ashiftrt:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w,w")
> +		     (match_operand:VDQ_I  2 "aarch64_simd_rshift_imm"
> "D1,Dr")))]
>   "TARGET_SIMD"
> - "sshr\t%0.<Vtype>, %1.<Vtype>, %2"
> -  [(set_attr "type" "neon_shift_imm<q>")]
> + "@
> +  cmlt\t%0.<Vtype>, %1.<Vtype>, #0
> +  sshr\t%0.<Vtype>, %1.<Vtype>, %2"
> +  [(set_attr "type" "neon_compare<q>,neon_shift_imm<q>")]
>  )
> 
>  (define_insn "*aarch64_simd_sra<mode>"
> diff --git a/gcc/config/aarch64/constraints.md
> b/gcc/config/aarch64/constraints.md
> index
> 3b49b452119c49320020fa9183314d9a25b92491..18630815ffc13f2168300a89
> 9db69fd428dfb0d6 100644
> --- a/gcc/config/aarch64/constraints.md
> +++ b/gcc/config/aarch64/constraints.md
> @@ -437,6 +437,14 @@ (define_constraint "Dl"
>        (match_test "aarch64_simd_shift_imm_p (op, GET_MODE (op),
>  						 true)")))
> 
> +(define_constraint "D1"
> +  "@internal
> + A constraint that matches vector of immediates that is bits(mode)-1."
> + (and (match_code "const,const_vector")
> +      (match_test "aarch64_const_vec_all_same_in_range_p (op,
> +			GET_MODE_UNIT_BITSIZE (mode) - 1,
> +			GET_MODE_UNIT_BITSIZE (mode) - 1)")))
> +
>  (define_constraint "Dr"
>    "@internal
>   A constraint that matches vector of immediates for right shifts."
> diff --git a/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c
> b/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..bdfe35d09ffccc7928947c9e
> 57f1034f7ca2c798
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c
> @@ -0,0 +1,12 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +void e (int * restrict a, int *b, int n)
> +{
> +    for (int i = 0; i < n; i++)
> +      b[i] = a[i] >> 31;
> +}
> +
> +/* { dg-final { scan-assembler-times {\tcmlt\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\tsshr\t} } } */
> +
> 
> 
> --

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 4/7]AArch64 Add pattern xtn+xtn2 to uzp2
  2021-09-29 16:20 ` [PATCH 4/7]AArch64 Add pattern xtn+xtn2 to uzp2 Tamar Christina
@ 2021-09-30  9:28   ` Kyrylo Tkachov
  2021-10-12 16:25     ` Tamar Christina
  0 siblings, 1 reply; 31+ messages in thread
From: Kyrylo Tkachov @ 2021-09-30  9:28 UTC (permalink / raw)
  To: Tamar Christina, gcc-patches
  Cc: nd, Richard Earnshaw, Marcus Shawcroft, Richard Sandiford



> -----Original Message-----
> From: Tamar Christina <Tamar.Christina@arm.com>
> Sent: Wednesday, September 29, 2021 5:20 PM
> To: gcc-patches@gcc.gnu.org
> Cc: nd <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>;
> Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> <Kyrylo.Tkachov@arm.com>; Richard Sandiford
> <Richard.Sandiford@arm.com>
> Subject: [PATCH 4/7]AArch64 Add pattern xtn+xtn2 to uzp2
> 
> Hi All,
> 
> This turns truncate operations with a hi/lo pair into a single permute of half
> the bit size of the input and just ignoring the top bits (which are truncated
> out).
> 
> i.e.
> 
> void d2 (short * restrict a, int *b, int n)
> {
>     for (int i = 0; i < n; i++)
>       a[i] = b[i];
> }
> 
> now generates:
> 
> .L4:
>         ldp     q0, q1, [x3]
>         add     x3, x3, 32
>         uzp1    v0.8h, v0.8h, v1.8h
>         str     q0, [x5], 16
>         cmp     x4, x3
>         bne     .L4
> 
> instead of
> 
> .L4:
>         ldp     q0, q1, [x3]
>         add     x3, x3, 32
>         xtn     v0.4h, v0.4s
>         xtn2    v0.8h, v1.4s
>         str     q0, [x5], 16
>         cmp     x4, x3
>         bne     .L4
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?
> 

Ok.
Thanks,
Kyrill

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* config/aarch64/aarch64-simd.md
> (*aarch64_narrow_trunc<mode>): New.
> 	* config/aarch64/iterators.md (VNARROWSIMD, Vnarrowsimd):
> New.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/aarch64/xtn-combine-1.c: New test.
> 	* gcc.target/aarch64/narrow_high_combine.c: Update case.
> 
> --- inline copy of patch --
> diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> index
> 36396ef236e8c476d5e2f1acee80dc54ec5ebe4e..33e3301d229366022a5b9481
> b6c3ae8f8d93f9e2 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1753,6 +1753,18 @@ (define_expand "aarch64_xtn2<mode>"
>    }
>  )
> 
> +(define_insn "*aarch64_narrow_trunc<mode>"
> +  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
> +	(vec_concat:<VNARROWQ2>
> +          (truncate:<VNARROWQ>
> +            (match_operand:VQN 1 "register_operand" "w"))
> +	  (truncate:<VNARROWQ>
> +	    (match_operand:VQN 2 "register_operand" "w"))))]
> +  "TARGET_SIMD"
> +  "uzp1\\t%0.<V2ntype>, %1.<V2ntype>, %2.<V2ntype>"
> +  [(set_attr "type" "neon_permute<q>")]
> +)
> +
>  ;; Packing doubles.
> 
>  (define_expand "vec_pack_trunc_<mode>"
> diff --git a/gcc/config/aarch64/iterators.md
> b/gcc/config/aarch64/iterators.md
> index
> 8dbeed3b0d4a44cdc17dd333ed397b39a33f386a..95b385c0c9405fe95fcd072
> 62a9471ab13d5488e 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -270,6 +270,14 @@ (define_mode_iterator VDQHS [V4HI V8HI V2SI
> V4SI])
>  ;; Advanced SIMD modes for H, S and D types.
>  (define_mode_iterator VDQHSD [V4HI V8HI V2SI V4SI V2DI])
> 
> +;; Modes for which we can narrow the element and increase the lane counts
> +;; to preserve the same register size.
> +(define_mode_attr VNARROWSIMD [(V4HI "V8QI") (V8HI "V16QI") (V4SI
> "V8HI")
> +			       (V2SI "V4HI") (V2DI "V4SI")])
> +
> +(define_mode_attr Vnarrowsimd [(V4HI "v8qi") (V8HI "v16qi") (V4SI "v8hi")
> +			       (V2SI "v4hi") (V2DI "v4si")])
> +
>  ;; Advanced SIMD and scalar integer modes for H and S.
>  (define_mode_iterator VSDQ_HSI [V4HI V8HI V2SI V4SI HI SI])
> 
> diff --git a/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
> b/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
> index
> 50ecab002a3552d37a5cc0d8921f42f6c3dba195..fa61196d3644caa48b12151e
> 12b15dfeab8c7e71 100644
> --- a/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
> +++ b/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
> @@ -225,7 +225,8 @@ TEST_2_UNARY (vqmovun, uint32x4_t, int64x2_t,
> s64, u32)
>  /* { dg-final { scan-assembler-times "\\tuqshrn2\\tv" 6} }  */
>  /* { dg-final { scan-assembler-times "\\tsqrshrn2\\tv" 6} }  */
>  /* { dg-final { scan-assembler-times "\\tuqrshrn2\\tv" 6} }  */
> -/* { dg-final { scan-assembler-times "\\txtn2\\tv" 12} }  */
> +/* { dg-final { scan-assembler-times "\\txtn2\\tv" 6} }  */
> +/* { dg-final { scan-assembler-times "\\tuzp1\\tv" 6} }  */
>  /* { dg-final { scan-assembler-times "\\tuqxtn2\\tv" 6} }  */
>  /* { dg-final { scan-assembler-times "\\tsqxtn2\\tv" 6} }  */
>  /* { dg-final { scan-assembler-times "\\tsqxtun2\\tv" 6} }  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
> b/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..ed655cc970a602da4ace78d
> c8dbd64ab18b0d4ab
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
> @@ -0,0 +1,12 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +void d2 (short * restrict a, int *b, int n)
> +{
> +    for (int i = 0; i < n; i++)
> +      a[i] = b[i];
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\txtn\t} } } */
> +/* { dg-final { scan-assembler-not {\txtn2\t} } } */
> 
> 
> --

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 6/7]AArch64 Add neg + cmle into cmgt
  2021-09-29 16:21 ` [PATCH 6/7]AArch64 Add neg + cmle into cmgt Tamar Christina
@ 2021-09-30  9:34   ` Kyrylo Tkachov
  0 siblings, 0 replies; 31+ messages in thread
From: Kyrylo Tkachov @ 2021-09-30  9:34 UTC (permalink / raw)
  To: Tamar Christina, gcc-patches
  Cc: nd, Richard Earnshaw, Marcus Shawcroft, Richard Sandiford



> -----Original Message-----
> From: Tamar Christina <Tamar.Christina@arm.com>
> Sent: Wednesday, September 29, 2021 5:22 PM
> To: gcc-patches@gcc.gnu.org
> Cc: nd <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>;
> Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> <Kyrylo.Tkachov@arm.com>; Richard Sandiford
> <Richard.Sandiford@arm.com>
> Subject: [PATCH 6/7]AArch64 Add neg + cmle into cmgt
> 
> Hi All,
> 
> This turns an inversion of the sign bit + arithmetic right shift into a
> comparison with 0.
> 
> i.e.
> 
> void fun1(int32_t *x, int n)
> {
>     for (int i = 0; i < (n & -16); i++)
>       x[i] = (-x[i]) >> 31;
> }
> 
> now generates:
> 
> .L3:
>         ldr     q0, [x0]
>         cmgt    v0.4s, v0.4s, #0
>         str     q0, [x0], 16
>         cmp     x0, x1
>         bne     .L3
> 
> instead of:
> 
> .L3:
>         ldr     q0, [x0]
>         neg     v0.4s, v0.4s
>         sshr    v0.4s, v0.4s, 31
>         str     q0, [x0], 16
>         cmp     x0, x1
>         bne     .L3
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* config/aarch64/aarch64-simd.md
> (*aarch64_simd_neg_ashr<mode>): New.
> 	* config/aarch64/predicates.md
> 	(aarch64_simd_shift_imm_vec_signbit): New.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/aarch64/signbit-1.c: New test.
> 
> --- inline copy of patch --
> diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> index
> 0045b100c6af1c007293ee26506199868be90e9f..9d936428b438c95b56614c94
> 081d7e2ebc47d89f 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1137,6 +1137,18 @@ (define_insn "aarch64_simd_ashr<mode>"
>    [(set_attr "type" "neon_compare<q>,neon_shift_imm<q>")]
>  )
> 
> +;; Additional opt when we negate the sign bit and then shift right
> +(define_insn "*aarch64_simd_neg_ashr<mode>"
> + [(set (match_operand:VDQ_I 0 "register_operand" "=w")
> +       (ashiftrt:VDQ_I
> +	 (neg:VDQ_I
> +	   (match_operand:VDQ_I 1 "register_operand" "w"))
> +	   (match_operand:VDQ_I 2 "aarch64_simd_shift_imm_vec_signbit"
> "D1")))]
> + "TARGET_SIMD"
> + "cmgt\t%0.<Vtype>, %1.<Vtype>, #0"
> +  [(set_attr "type" "neon_compare_zero<q>")]
> +)
> +
>  (define_insn "*aarch64_simd_sra<mode>"
>   [(set (match_operand:VDQ_I 0 "register_operand" "=w")
>  	(plus:VDQ_I
> diff --git a/gcc/config/aarch64/predicates.md
> b/gcc/config/aarch64/predicates.md
> index
> 7fd4f9e7d06d3082d6f3047290f0446789e1d0d2..12e7d35da154b10f0190274
> d0279cab313563455 100644
> --- a/gcc/config/aarch64/predicates.md
> +++ b/gcc/config/aarch64/predicates.md
> @@ -545,6 +545,12 @@ (define_predicate
> "aarch64_simd_shift_imm_offset_di"
>    (and (match_code "const_int")
>         (match_test "IN_RANGE (INTVAL (op), 1, 64)")))
> 
> +(define_predicate "aarch64_simd_shift_imm_vec_signbit"
> +  (and (match_code "const_vector")
> +       (match_test "aarch64_const_vec_all_same_in_range_p (op,
> +			GET_MODE_UNIT_BITSIZE (mode) - 1,
> +			GET_MODE_UNIT_BITSIZE (mode) - 1)")))
> +
>  (define_predicate "aarch64_simd_shift_imm_vec_exact_top"
>    (and (match_code "const_vector")
>         (match_test "aarch64_const_vec_all_same_in_range_p (op,

Ok but....

> diff --git a/gcc/testsuite/gcc.target/aarch64/signbit-1.c
> b/gcc/testsuite/gcc.target/aarch64/signbit-1.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..3ebfb0586f37de29cf58635
> b27fe48503714447e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/signbit-1.c
> @@ -0,0 +1,18 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps" } */
> +
> +#include <stdint.h>
> +
> +void fun1(int32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (-x[i]) >> 31;
> +}
> +
> +void fun2(int32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (-x[i]) >> 30;
> +}
> +
> +/* { dg-final { scan-assembler-times {\tcmgt\t} 1 } } */

... as discussed offline can we also add test coverage for the other modes used in the iterators in this patch series. The extra tests can be added as separate follow up patches.

Also, I'd appreciate a comment in the test for why only one of the functions is expected to generate a cmgt here (or remove the one that's irrelevant here)
Thanks,
Kyrill

> 
> 
> --

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 7/7]AArch64 Combine cmeq 0 + not into cmtst
  2021-09-29 16:21 ` [PATCH 7/7]AArch64 Combine cmeq 0 + not into cmtst Tamar Christina
@ 2021-09-30  9:35   ` Kyrylo Tkachov
  0 siblings, 0 replies; 31+ messages in thread
From: Kyrylo Tkachov @ 2021-09-30  9:35 UTC (permalink / raw)
  To: Tamar Christina, gcc-patches
  Cc: nd, Richard Earnshaw, Marcus Shawcroft, Richard Sandiford



> -----Original Message-----
> From: Tamar Christina <Tamar.Christina@arm.com>
> Sent: Wednesday, September 29, 2021 5:22 PM
> To: gcc-patches@gcc.gnu.org
> Cc: nd <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>;
> Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> <Kyrylo.Tkachov@arm.com>; Richard Sandiford
> <Richard.Sandiford@arm.com>
> Subject: [PATCH 7/7]AArch64 Combine cmeq 0 + not into cmtst
> 
> Hi All,
> 
> This turns a bitwise inverse of an equality comparison with 0 into a compare
> of
> bitwise nonzero (cmtst).
> 
> We already have one pattern for cmsts, this adds an additional one which
> does
> not require an additional bitwise and.
> 
> i.e.
> 
> #include <arm_neon.h>
> 
> uint8x8_t bar(int16x8_t abs_row0, int16x8_t row0) {
>   uint16x8_t row0_diff =
>     vreinterpretq_u16_s16(veorq_s16(abs_row0, vshrq_n_s16(row0, 15)));
>   uint8x8_t abs_row0_gt0 =
>     vmovn_u16(vcgtq_u16(vreinterpretq_u16_s16(abs_row0),
> vdupq_n_u16(0)));
>   return abs_row0_gt0;
> }
> 
> now generates:
> 
> bar:
>         cmtst   v0.8h, v0.8h, v0.8h
>         xtn     v0.8b, v0.8h
>         ret
> 
> instead of:
> 
> bar:
>         cmeq    v0.8h, v0.8h, #0
>         not     v0.16b, v0.16b
>         xtn     v0.8b, v0.8h
>         ret
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* config/aarch64/aarch64-simd.md
> (*aarch64_cmtst_same_<mode>): New.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/aarch64/mvn-cmeq0-1.c: New test.
> 
> --- inline copy of patch --
> diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> index
> 9d936428b438c95b56614c94081d7e2ebc47d89f..bce01c36386074bf475b8b7
> e5c69a1959a13fef3 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -6585,6 +6585,23 @@ (define_insn "aarch64_cmtst<mode>"
>    [(set_attr "type" "neon_tst<q>")]
>  )
> 
> +;; One can also get a cmtsts by having to combine a
> +;; not (neq (eq x 0)) in which case you rewrite it to
> +;; a comparison against itself
> +
> +(define_insn "*aarch64_cmtst_same_<mode>"
> +  [(set (match_operand:<V_INT_EQUIV> 0 "register_operand" "=w")
> +	(plus:<V_INT_EQUIV>
> +	  (eq:<V_INT_EQUIV>
> +	    (match_operand:VDQ_I 1 "register_operand" "w")
> +	    (match_operand:VDQ_I 2 "aarch64_simd_imm_zero"))
> +	  (match_operand:<V_INT_EQUIV> 3
> "aarch64_simd_imm_minus_one")))
> +  ]
> +  "TARGET_SIMD"
> +  "cmtst\t%<v>0<Vmtype>, %<v>1<Vmtype>, %<v>1<Vmtype>"
> +  [(set_attr "type" "neon_tst<q>")]
> +)
> +
>  (define_insn_and_split "aarch64_cmtstdi"
>    [(set (match_operand:DI 0 "register_operand" "=w,r")
>  	(neg:DI
> diff --git a/gcc/testsuite/gcc.target/aarch64/mvn-cmeq0-1.c
> b/gcc/testsuite/gcc.target/aarch64/mvn-cmeq0-1.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..59f3a230271c70d3bb51d03
> 38d9ec2613bd4394b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/mvn-cmeq0-1.c
> @@ -0,0 +1,17 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */

I don't think we need the param here (or even anything higher than -O really).
Ok otherwise.
Thanks,
Kyrill


> +
> +#include <arm_neon.h>
> +
> +uint8x8_t bar(int16x8_t abs_row0, int16x8_t row0) {
> +  uint16x8_t row0_diff =
> +    vreinterpretq_u16_s16(veorq_s16(abs_row0, vshrq_n_s16(row0, 15)));
> +  uint8x8_t abs_row0_gt0 =
> +    vmovn_u16(vcgtq_u16(vreinterpretq_u16_s16(abs_row0),
> vdupq_n_u16(0)));
> +  return abs_row0_gt0;
> +}
> +
> +
> +/* { dg-final { scan-assembler-times {\tcmtst\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\tcmeq\t} } } */
> +/* { dg-final { scan-assembler-not {\tnot\t} } } */
> 
> 
> --

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 5/7]middle-end Convert bitclear <imm> + cmp<cc> #0 into cm<cc2> <imm2>
  2021-09-30  6:17   ` Richard Biener
@ 2021-09-30  9:56     ` Tamar Christina
  2021-09-30 10:26       ` Richard Biener
  0 siblings, 1 reply; 31+ messages in thread
From: Tamar Christina @ 2021-09-30  9:56 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, nd

> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Thursday, September 30, 2021 7:18 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> Subject: Re: [PATCH 5/7]middle-end Convert bitclear <imm> + cmp<cc> #0
> into cm<cc2> <imm2>
> 
> On Wed, 29 Sep 2021, Tamar Christina wrote:
> 
> > Hi All,
> >
> > This optimizes the case where a mask Y which fulfills ~Y + 1 == pow2
> > is used to clear a some bits and then compared against 0 into one
> > without the masking and a compare against a different bit immediate.
> >
> > We can do this for all unsigned compares and for signed we can do it
> > for comparisons of EQ and NE:
> >
> > (x & (~255)) == 0 becomes x <= 255. Which for leaves it to the target
> > to optimally deal with the comparison.
> >
> > This transformation has to be done in the mid-end because in RTL you
> > don't have the signs of the comparison operands and if the target
> > needs an immediate this should be floated outside of the loop.
> >
> > The RTL loop invariant hoisting is done before split1.
> >
> > i.e.
> >
> > void fun1(int32_t *x, int n)
> > {
> >     for (int i = 0; i < (n & -16); i++)
> >       x[i] = (x[i]&(~255)) == 0;
> > }
> >
> > now generates:
> >
> > .L3:
> >         ldr     q0, [x0]
> >         cmhs    v0.4s, v2.4s, v0.4s
> >         and     v0.16b, v1.16b, v0.16b
> >         str     q0, [x0], 16
> >         cmp     x0, x1
> >         bne     .L3
> >
> > and floats the immediate out of the loop.
> >
> > instead of:
> >
> > .L3:
> >         ldr     q0, [x0]
> >         bic     v0.4s, #255
> >         cmeq    v0.4s, v0.4s, #0
> >         and     v0.16b, v1.16b, v0.16b
> >         str     q0, [x0], 16
> >         cmp     x0, x1
> >         bne     .L3
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
> > and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	* match.pd: New bitmask compare pattern.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 	* gcc.dg/bic-bitmask-10.c: New test.
> > 	* gcc.dg/bic-bitmask-11.c: New test.
> > 	* gcc.dg/bic-bitmask-12.c: New test.
> > 	* gcc.dg/bic-bitmask-2.c: New test.
> > 	* gcc.dg/bic-bitmask-3.c: New test.
> > 	* gcc.dg/bic-bitmask-4.c: New test.
> > 	* gcc.dg/bic-bitmask-5.c: New test.
> > 	* gcc.dg/bic-bitmask-6.c: New test.
> > 	* gcc.dg/bic-bitmask-7.c: New test.
> > 	* gcc.dg/bic-bitmask-8.c: New test.
> > 	* gcc.dg/bic-bitmask-9.c: New test.
> > 	* gcc.dg/bic-bitmask.h: New test.
> > 	* gcc.target/aarch64/bic-bitmask-1.c: New test.
> >
> > --- inline copy of patch --
> > diff --git a/gcc/match.pd b/gcc/match.pd index
> >
> 0fcfd0ea62c043dc217d0d560ce5b7e569b70e7d..df9212cb27d172856b9d43b08
> 752
> > 62f96e8993c4 100644
> > --- a/gcc/match.pd
> > +++ b/gcc/match.pd
> > @@ -4288,6 +4288,56 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> >      (if (ic == ncmp)
> >       (ncmp @0 @1))))))
> >
> > +/* Transform comparisons of the form (X & Y) CMP 0 to X CMP2 Z
> > +   where ~Y + 1 == pow2 and Z = ~Y.  */ (for cmp (simple_comparison)
> > +(simplify
> > +  (cmp (bit_and:c @0 VECTOR_CST@1) integer_zerop)
> 
> Why not for INTEGER_CST as well?  We do have a related folding (only for
> INTEGER_CST) that does
> 

Because of a slight concern to de-optimize what targets currently generate for the flag setting variants.
So for example AArch64 generates worse code for foo than it does bar

int foo (int x)
{
    if (x <= 0xFFFF)
      return 1;

    return 0;
}

int bar (int x)
{
    if (x & ~0xFFFF)
      return 1;

    return 0;
}

Because the flag setting bitmask was optimized more.  I can of course do this and fix
AArch64 but other targets may have the same issue.  For vectors this was less of a concern since
there's not flag setting there.

Do you still want the scalar version?

Thanks,
Tamar

> /* A & (2**N - 1) <= 2**K - 1 -> A & (2**N - 2**K) == 0
>    A & (2**N - 1) >  2**K - 1 -> A & (2**N - 2**K) != 0
> 
> which could be extended for integer vectors.  That said, can you please place
> the pattern next to the above?
> 
> Why does the transform only work for uniform vector constants?  (I see that
> the implementation becomes simpler, but then you should also handle the
> INTEGER_CST case at least)
> 
> > +   (if (VECTOR_INTEGER_TYPE_P (TREE_TYPE (@1))
> > +	&& uniform_vector_p (@1))
> > +    (with { tree elt = vector_cst_elt (@1, 0); }
> > +     (switch
> > +      (if (TYPE_UNSIGNED (TREE_TYPE (@1)) && tree_fits_uhwi_p (elt))
> 
> avoid tree_fits_uhwi_p and use wide_int here
> 
> > +	(with { unsigned HOST_WIDE_INT diff = tree_to_uhwi (elt);
> > +	        tree tdiff = wide_int_to_tree (TREE_TYPE (elt), (~diff) + 1);
> > +		tree newval = wide_int_to_tree (TREE_TYPE (elt), ~diff);
> > +		tree newmask = build_uniform_cst (TREE_TYPE (@1),
> newval); }
> > +	 (if (integer_pow2p (tdiff))
> 
> You don't seem to use 'tdiff' so please do this check in wide_int
> 
> > +	  (switch
> > +	   /* ((mask & x) < 0) -> 0.  */
> > +	   (if (cmp == LT_EXPR)
> > +	    { build_zero_cst (TREE_TYPE (@1)); })
> > +	   /* ((mask & x) <= 0) -> x < mask.  */
> > +	   (if (cmp == LE_EXPR)
> > +	    (lt @0 { newmask; }))
> > +	   /* ((mask & x) == 0) -> x < mask.  */
> > +	   (if (cmp == EQ_EXPR)
> > +	    (le @0 { newmask; }))
> > +	   /* ((mask & x) != 0) -> x > mask.  */
> > +	   (if (cmp == NE_EXPR)
> > +	    (gt @0 { newmask; }))
> > +	   /* ((mask & x) >= 0) -> x <= mask.  */
> > +	   (if (cmp == GE_EXPR)
> > +	    (le @0 { newmask; }))
> > +	    /* ((mask & x) > 0) -> x < mask.  */
> > +	   (if (cmp == GT_EXPR)
> > +	    (lt @0 { newmask; }))))))
> 
> you can avoid this switch with a lock-step (for, that maps 'cmp'
> to the result comparison code (for simplicity you can either keep the LT_EXPR
> special-case or transform to an always true condition which will be simplified).
> 
> > +      (if (!TYPE_UNSIGNED (TREE_TYPE (@1)) && tree_fits_shwi_p (elt))
> > +	(with { unsigned HOST_WIDE_INT diff = tree_to_shwi (elt);
> > +		tree ustype = unsigned_type_for (TREE_TYPE (elt));
> > +		tree uvtype = unsigned_type_for (TREE_TYPE (@1));
> > +	        tree tdiff = wide_int_to_tree (ustype, (~diff) + 1);
> > +	        tree udiff = wide_int_to_tree (ustype, ~diff);
> > +		tree cst = build_uniform_cst (uvtype, udiff); }
> > +	 (if (integer_pow2p (tdiff))
> > +	  (switch
> > +	    /* ((mask & x) == 0) -> x < mask.  */
> > +	    (if (cmp == EQ_EXPR)
> > +	     (le (convert:uvtype @0) { cst; }))
> > +	    /* ((mask & x) != 0) -> x > mask.  */
> > +	    (if (cmp == NE_EXPR)
> > +	     (gt (convert:uvtype @0) { cst; })))))))))))
> > +
> >  /* Transform comparisons of the form X - Y CMP 0 to X CMP Y.
> >     ??? The transformation is valid for the other operators if overflow
> >     is undefined for the type, but performing it here badly interacts
> > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-10.c
> > b/gcc/testsuite/gcc.dg/bic-bitmask-10.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..76a22a2313137a2a75dd711c2c
> 15
> > c2d3a34e15aa
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-10.c
> > @@ -0,0 +1,26 @@
> > +/* { dg-do run } */
> > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > +
> > +#include <stdint.h>
> > +
> > +__attribute__((noinline, noipa))
> > +void fun1(int32_t *x, int n)
> > +{
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~255)) == 0;
> > +}
> > +
> > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(int32_t
> > +*x, int n) {
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~255)) == 0;
> > +}
> > +
> > +#define TYPE int32_t
> > +#include "bic-bitmask.h"
> > +
> > +/* { dg-final { scan-tree-dump {<=\s*.+\{ 255,.+\}} dce7 } } */
> > +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 }
> > +} */
> > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > +aarch64*-*-* } } } } */
> > +
> > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-11.c
> > b/gcc/testsuite/gcc.dg/bic-bitmask-11.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..32553d7ba2f823f7a212374519
> 90
> > d0a216d2f912
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-11.c
> > @@ -0,0 +1,25 @@
> > +/* { dg-do run } */
> > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > +
> > +#include <stdint.h>
> > +
> > +__attribute__((noinline, noipa))
> > +void fun1(uint32_t *x, int n)
> > +{
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~255)) != 0;
> > +}
> > +
> > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(uint32_t
> > +*x, int n) {
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~255)) != 0;
> > +}
> > +
> > +#include "bic-bitmask.h"
> > +
> > +/* { dg-final { scan-tree-dump {>\s*.+\{ 255,.+\}} dce7 } } */
> > +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 }
> > +} */
> > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > +aarch64*-*-* } } } } */
> > +
> > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-12.c
> > b/gcc/testsuite/gcc.dg/bic-bitmask-12.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..e10cbf7fabe2dbf7ce436cdf37
> b0
> > f8b207c58408
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-12.c
> > @@ -0,0 +1,17 @@
> > +/* { dg-do assemble } */
> > +/* { dg-options "-O3 -fdump-tree-dce" } */
> > +
> > +#include <stdint.h>
> > +
> > +typedef unsigned int v4si __attribute__ ((vector_size (16)));
> > +
> > +__attribute__((noinline, noipa))
> > +void fun(v4si *x, int n)
> > +{
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~255)) == 0;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump {<=\s*.+\{ 255,.+\}} dce7 } } */
> > +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 }
> > +} */
> > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > +aarch64*-*-* } } } } */
> > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-2.c
> > b/gcc/testsuite/gcc.dg/bic-bitmask-2.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..da30fad89f6c8239baa4395b3f
> fa
> > ec0be577e13f
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-2.c
> > @@ -0,0 +1,25 @@
> > +/* { dg-do run } */
> > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > +
> > +#include <stdint.h>
> > +
> > +__attribute__((noinline, noipa))
> > +void fun1(uint32_t *x, int n)
> > +{
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~255)) == 0;
> > +}
> > +
> > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(uint32_t
> > +*x, int n) {
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~255)) == 0;
> > +}
> > +
> > +#include "bic-bitmask.h"
> > +
> > +/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } }
> > +*/
> > +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 }
> > +} */
> > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > +aarch64*-*-* } } } } */
> > +
> > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-3.c
> > b/gcc/testsuite/gcc.dg/bic-bitmask-3.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..da30fad89f6c8239baa4395b3f
> fa
> > ec0be577e13f
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-3.c
> > @@ -0,0 +1,25 @@
> > +/* { dg-do run } */
> > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > +
> > +#include <stdint.h>
> > +
> > +__attribute__((noinline, noipa))
> > +void fun1(uint32_t *x, int n)
> > +{
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~255)) == 0;
> > +}
> > +
> > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(uint32_t
> > +*x, int n) {
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~255)) == 0;
> > +}
> > +
> > +#include "bic-bitmask.h"
> > +
> > +/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } }
> > +*/
> > +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 }
> > +} */
> > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > +aarch64*-*-* } } } } */
> > +
> > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-4.c
> > b/gcc/testsuite/gcc.dg/bic-bitmask-4.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..1bcf23ccf1447d6c8c999ed1eb
> 25
> > ba0a450028e1
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-4.c
> > @@ -0,0 +1,25 @@
> > +/* { dg-do run } */
> > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > +
> > +#include <stdint.h>
> > +
> > +__attribute__((noinline, noipa))
> > +void fun1(uint32_t *x, int n)
> > +{
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~255)) >= 0;
> > +}
> > +
> > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(uint32_t
> > +*x, int n) {
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~255)) >= 0;
> > +}
> > +
> > +#include "bic-bitmask.h"
> > +
> > +/* { dg-final { scan-tree-dump-times {=\s*.+\{ 1,.+\}} 1 dce7 } } */
> > +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 }
> > +} */
> > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > +aarch64*-*-* } } } } */
> > +
> > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-5.c
> > b/gcc/testsuite/gcc.dg/bic-bitmask-5.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..6e5a2fca9992efbc01f8dbbc6f
> 95
> > 936e86643028
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-5.c
> > @@ -0,0 +1,25 @@
> > +/* { dg-do run } */
> > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > +
> > +#include <stdint.h>
> > +
> > +__attribute__((noinline, noipa))
> > +void fun1(uint32_t *x, int n)
> > +{
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~255)) > 0;
> > +}
> > +
> > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(uint32_t
> > +*x, int n) {
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~255)) > 0;
> > +}
> > +
> > +#include "bic-bitmask.h"
> > +
> > +/* { dg-final { scan-tree-dump-times {>\s*.+\{ 255,.+\}} 1 dce7 } }
> > +*/
> > +/* { dg-final { scan-tree-dump-not {&`s*.+\{ 4294967040,.+\}} dce7 }
> > +} */
> > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > +aarch64*-*-* } } } } */
> > +
> > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-6.c
> > b/gcc/testsuite/gcc.dg/bic-bitmask-6.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..018e7a4348c9fc461106c3d9d0
> 12
> > 91325d3406c2
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-6.c
> > @@ -0,0 +1,25 @@
> > +/* { dg-do run } */
> > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > +
> > +#include <stdint.h>
> > +
> > +__attribute__((noinline, noipa))
> > +void fun1(uint32_t *x, int n)
> > +{
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~255)) <= 0;
> > +}
> > +
> > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(uint32_t
> > +*x, int n) {
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~255)) <= 0;
> > +}
> > +
> > +#include "bic-bitmask.h"
> > +
> > +/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } }
> > +*/
> > +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 }
> > +} */
> > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > +aarch64*-*-* } } } } */
> > +
> > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-7.c
> > b/gcc/testsuite/gcc.dg/bic-bitmask-7.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..798678fb7555052c93abc4ca34
> f6
> > 17d640f73bb4
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-7.c
> > @@ -0,0 +1,24 @@
> > +/* { dg-do run } */
> > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > +
> > +#include <stdint.h>
> > +
> > +__attribute__((noinline, noipa))
> > +void fun1(uint32_t *x, int n)
> > +{
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~1)) < 0;
> > +}
> > +
> > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(uint32_t
> > +*x, int n) {
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~1)) < 0;
> > +}
> > +
> > +#include "bic-bitmask.h"
> > +
> > +/* { dg-final { scan-tree-dump-times {__builtin_memset} 1 dce7 } } */
> > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > +aarch64*-*-* } } } } */
> > +
> > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-8.c
> > b/gcc/testsuite/gcc.dg/bic-bitmask-8.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..1dabe834ed57dfa0be48c1dc3
> dbb
> > 226092c79a1a
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-8.c
> > @@ -0,0 +1,25 @@
> > +/* { dg-do run } */
> > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > +
> > +#include <stdint.h>
> > +
> > +__attribute__((noinline, noipa))
> > +void fun1(uint32_t *x, int n)
> > +{
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~1)) != 0;
> > +}
> > +
> > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(uint32_t
> > +*x, int n) {
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~1)) != 0;
> > +}
> > +
> > +#include "bic-bitmask.h"
> > +
> > +/* { dg-final { scan-tree-dump-times {>\s*.+\{ 1,.+\}} 1 dce7 } } */
> > +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967294,.+\}} dce7 }
> > +} */
> > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > +aarch64*-*-* } } } } */
> > +
> > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-9.c
> > b/gcc/testsuite/gcc.dg/bic-bitmask-9.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..9c1f8ee0adfc45d1b9fc212138
> ea
> > 26bb6b693e49
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-9.c
> > @@ -0,0 +1,25 @@
> > +/* { dg-do run } */
> > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > +
> > +#include <stdint.h>
> > +
> > +__attribute__((noinline, noipa))
> > +void fun1(uint32_t *x, int n)
> > +{
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~5)) == 0;
> > +}
> > +
> > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(uint32_t
> > +*x, int n) {
> > +    for (int i = 0; i < (n & -16); i++)
> > +      x[i] = (x[i]&(~5)) == 0;
> > +}
> > +
> > +#include "bic-bitmask.h"
> > +
> > +/* { dg-final { scan-tree-dump-not {<=\s*.+\{ 4294967289,.+\}} dce7 }
> > +} */
> > +/* { dg-final { scan-tree-dump {&\s*.+\{ 4294967290,.+\}} dce7 } } */
> > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > +aarch64*-*-* } } } } */
> > +
> > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask.h
> > b/gcc/testsuite/gcc.dg/bic-bitmask.h
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..2b94065c025e0cbf71a21ac9b9
> d6
> > 314e24b0c2d9
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/bic-bitmask.h
> > @@ -0,0 +1,43 @@
> > +#include <stdio.h>
> > +
> > +#ifndef N
> > +#define N 50
> > +#endif
> > +
> > +#ifndef TYPE
> > +#define TYPE uint32_t
> > +#endif
> > +
> > +#ifndef DEBUG
> > +#define DEBUG 0
> > +#endif
> > +
> > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> > +
> > +int main ()
> > +{
> > +  TYPE a[N];
> > +  TYPE b[N];
> > +
> > +  for (int i = 0; i < N; ++i)
> > +    {
> > +      a[i] = BASE + i * 13;
> > +      b[i] = BASE + i * 13;
> > +      if (DEBUG)
> > +        printf ("%d: 0x%x\n", i, a[i]);
> > +    }
> > +
> > +  fun1 (a, N);
> > +  fun2 (b, N);
> > +
> > +  for (int i = 0; i < N; ++i)
> > +    {
> > +      if (DEBUG)
> > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> > +
> > +      if (a[i] != b[i])
> > +        __builtin_abort ();
> > +    }
> > +  return 0;
> > +}
> > +
> > diff --git a/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c
> > b/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c
> > new file mode 100644
> > index
> >
> 0000000000000000000000000000000000000000..568c1ffc8bc4148efaeeba7a45
> a7
> > 5ecbd3a7a3dd
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c
> > @@ -0,0 +1,13 @@
> > +/* { dg-do assemble } */
> > +/* { dg-options "-O2 -save-temps" } */
> > +
> > +#include <arm_neon.h>
> > +
> > +uint32x4_t foo (int32x4_t a)
> > +{
> > +  int32x4_t cst = vdupq_n_s32 (255);
> > +  int32x4_t zero = vdupq_n_s32 (0);
> > +  return vceqq_s32 (vbicq_s32 (a, cst), zero); }
> > +
> > +/* { dg-final { scan-assembler-not {\tbic\t} { xfail { aarch64*-*-* }
> > +} } } */
> >
> >
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 5/7]middle-end Convert bitclear <imm> + cmp<cc> #0 into cm<cc2> <imm2>
  2021-09-30  9:56     ` Tamar Christina
@ 2021-09-30 10:26       ` Richard Biener
  2021-10-05 12:55         ` Tamar Christina
  0 siblings, 1 reply; 31+ messages in thread
From: Richard Biener @ 2021-09-30 10:26 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc-patches, nd

On Thu, 30 Sep 2021, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Thursday, September 30, 2021 7:18 AM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> > Subject: Re: [PATCH 5/7]middle-end Convert bitclear <imm> + cmp<cc> #0
> > into cm<cc2> <imm2>
> > 
> > On Wed, 29 Sep 2021, Tamar Christina wrote:
> > 
> > > Hi All,
> > >
> > > This optimizes the case where a mask Y which fulfills ~Y + 1 == pow2
> > > is used to clear a some bits and then compared against 0 into one
> > > without the masking and a compare against a different bit immediate.
> > >
> > > We can do this for all unsigned compares and for signed we can do it
> > > for comparisons of EQ and NE:
> > >
> > > (x & (~255)) == 0 becomes x <= 255. Which for leaves it to the target
> > > to optimally deal with the comparison.
> > >
> > > This transformation has to be done in the mid-end because in RTL you
> > > don't have the signs of the comparison operands and if the target
> > > needs an immediate this should be floated outside of the loop.
> > >
> > > The RTL loop invariant hoisting is done before split1.
> > >
> > > i.e.
> > >
> > > void fun1(int32_t *x, int n)
> > > {
> > >     for (int i = 0; i < (n & -16); i++)
> > >       x[i] = (x[i]&(~255)) == 0;
> > > }
> > >
> > > now generates:
> > >
> > > .L3:
> > >         ldr     q0, [x0]
> > >         cmhs    v0.4s, v2.4s, v0.4s
> > >         and     v0.16b, v1.16b, v0.16b
> > >         str     q0, [x0], 16
> > >         cmp     x0, x1
> > >         bne     .L3
> > >
> > > and floats the immediate out of the loop.
> > >
> > > instead of:
> > >
> > > .L3:
> > >         ldr     q0, [x0]
> > >         bic     v0.4s, #255
> > >         cmeq    v0.4s, v0.4s, #0
> > >         and     v0.16b, v1.16b, v0.16b
> > >         str     q0, [x0], 16
> > >         cmp     x0, x1
> > >         bne     .L3
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
> > > and no issues.
> > >
> > > Ok for master?
> > >
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > > 	* match.pd: New bitmask compare pattern.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > 	* gcc.dg/bic-bitmask-10.c: New test.
> > > 	* gcc.dg/bic-bitmask-11.c: New test.
> > > 	* gcc.dg/bic-bitmask-12.c: New test.
> > > 	* gcc.dg/bic-bitmask-2.c: New test.
> > > 	* gcc.dg/bic-bitmask-3.c: New test.
> > > 	* gcc.dg/bic-bitmask-4.c: New test.
> > > 	* gcc.dg/bic-bitmask-5.c: New test.
> > > 	* gcc.dg/bic-bitmask-6.c: New test.
> > > 	* gcc.dg/bic-bitmask-7.c: New test.
> > > 	* gcc.dg/bic-bitmask-8.c: New test.
> > > 	* gcc.dg/bic-bitmask-9.c: New test.
> > > 	* gcc.dg/bic-bitmask.h: New test.
> > > 	* gcc.target/aarch64/bic-bitmask-1.c: New test.
> > >
> > > --- inline copy of patch --
> > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > >
> > 0fcfd0ea62c043dc217d0d560ce5b7e569b70e7d..df9212cb27d172856b9d43b08
> > 752
> > > 62f96e8993c4 100644
> > > --- a/gcc/match.pd
> > > +++ b/gcc/match.pd
> > > @@ -4288,6 +4288,56 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> > >      (if (ic == ncmp)
> > >       (ncmp @0 @1))))))
> > >
> > > +/* Transform comparisons of the form (X & Y) CMP 0 to X CMP2 Z
> > > +   where ~Y + 1 == pow2 and Z = ~Y.  */ (for cmp (simple_comparison)
> > > +(simplify
> > > +  (cmp (bit_and:c @0 VECTOR_CST@1) integer_zerop)
> > 
> > Why not for INTEGER_CST as well?  We do have a related folding (only for
> > INTEGER_CST) that does
> > 
> 
> Because of a slight concern to de-optimize what targets currently generate for the flag setting variants.
> So for example AArch64 generates worse code for foo than it does bar
> 
> int foo (int x)
> {
>     if (x <= 0xFFFF)
>       return 1;
> 
>     return 0;
> }
> 
> int bar (int x)
> {
>     if (x & ~0xFFFF)
>       return 1;
> 
>     return 0;
> }
> 
> Because the flag setting bitmask was optimized more.  I can of course do this and fix
> AArch64 but other targets may have the same issue.  For vectors this was less of a concern since
> there's not flag setting there.
> 
> Do you still want the scalar version?

Yes, the simplification result is simpler and thus more canonical on
GIMPLE.  On x86 we generate

        xorl    %eax, %eax
        cmpl    $65535, %edi
        setle   %al
        ret

vs

        xorl    %eax, %eax
        andl    $-65536, %edi
        setne   %al
        ret

which are equivalent I think (and would be easily be transformed
using a peephole if required).

Richard.

> Thanks,
> Tamar
> 
> > /* A & (2**N - 1) <= 2**K - 1 -> A & (2**N - 2**K) == 0
> >    A & (2**N - 1) >  2**K - 1 -> A & (2**N - 2**K) != 0
> > 
> > which could be extended for integer vectors.  That said, can you please place
> > the pattern next to the above?
> > 
> > Why does the transform only work for uniform vector constants?  (I see that
> > the implementation becomes simpler, but then you should also handle the
> > INTEGER_CST case at least)
> > 
> > > +   (if (VECTOR_INTEGER_TYPE_P (TREE_TYPE (@1))
> > > +	&& uniform_vector_p (@1))
> > > +    (with { tree elt = vector_cst_elt (@1, 0); }
> > > +     (switch
> > > +      (if (TYPE_UNSIGNED (TREE_TYPE (@1)) && tree_fits_uhwi_p (elt))
> > 
> > avoid tree_fits_uhwi_p and use wide_int here
> > 
> > > +	(with { unsigned HOST_WIDE_INT diff = tree_to_uhwi (elt);
> > > +	        tree tdiff = wide_int_to_tree (TREE_TYPE (elt), (~diff) + 1);
> > > +		tree newval = wide_int_to_tree (TREE_TYPE (elt), ~diff);
> > > +		tree newmask = build_uniform_cst (TREE_TYPE (@1),
> > newval); }
> > > +	 (if (integer_pow2p (tdiff))
> > 
> > You don't seem to use 'tdiff' so please do this check in wide_int
> > 
> > > +	  (switch
> > > +	   /* ((mask & x) < 0) -> 0.  */
> > > +	   (if (cmp == LT_EXPR)
> > > +	    { build_zero_cst (TREE_TYPE (@1)); })
> > > +	   /* ((mask & x) <= 0) -> x < mask.  */
> > > +	   (if (cmp == LE_EXPR)
> > > +	    (lt @0 { newmask; }))
> > > +	   /* ((mask & x) == 0) -> x < mask.  */
> > > +	   (if (cmp == EQ_EXPR)
> > > +	    (le @0 { newmask; }))
> > > +	   /* ((mask & x) != 0) -> x > mask.  */
> > > +	   (if (cmp == NE_EXPR)
> > > +	    (gt @0 { newmask; }))
> > > +	   /* ((mask & x) >= 0) -> x <= mask.  */
> > > +	   (if (cmp == GE_EXPR)
> > > +	    (le @0 { newmask; }))
> > > +	    /* ((mask & x) > 0) -> x < mask.  */
> > > +	   (if (cmp == GT_EXPR)
> > > +	    (lt @0 { newmask; }))))))
> > 
> > you can avoid this switch with a lock-step (for, that maps 'cmp'
> > to the result comparison code (for simplicity you can either keep the LT_EXPR
> > special-case or transform to an always true condition which will be simplified).
> > 
> > > +      (if (!TYPE_UNSIGNED (TREE_TYPE (@1)) && tree_fits_shwi_p (elt))
> > > +	(with { unsigned HOST_WIDE_INT diff = tree_to_shwi (elt);
> > > +		tree ustype = unsigned_type_for (TREE_TYPE (elt));
> > > +		tree uvtype = unsigned_type_for (TREE_TYPE (@1));
> > > +	        tree tdiff = wide_int_to_tree (ustype, (~diff) + 1);
> > > +	        tree udiff = wide_int_to_tree (ustype, ~diff);
> > > +		tree cst = build_uniform_cst (uvtype, udiff); }
> > > +	 (if (integer_pow2p (tdiff))
> > > +	  (switch
> > > +	    /* ((mask & x) == 0) -> x < mask.  */
> > > +	    (if (cmp == EQ_EXPR)
> > > +	     (le (convert:uvtype @0) { cst; }))
> > > +	    /* ((mask & x) != 0) -> x > mask.  */
> > > +	    (if (cmp == NE_EXPR)
> > > +	     (gt (convert:uvtype @0) { cst; })))))))))))
> > > +
> > >  /* Transform comparisons of the form X - Y CMP 0 to X CMP Y.
> > >     ??? The transformation is valid for the other operators if overflow
> > >     is undefined for the type, but performing it here badly interacts
> > > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-10.c
> > > b/gcc/testsuite/gcc.dg/bic-bitmask-10.c
> > > new file mode 100644
> > > index
> > >
> > 0000000000000000000000000000000000000000..76a22a2313137a2a75dd711c2c
> > 15
> > > c2d3a34e15aa
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-10.c
> > > @@ -0,0 +1,26 @@
> > > +/* { dg-do run } */
> > > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > > +
> > > +#include <stdint.h>
> > > +
> > > +__attribute__((noinline, noipa))
> > > +void fun1(int32_t *x, int n)
> > > +{
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~255)) == 0;
> > > +}
> > > +
> > > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(int32_t
> > > +*x, int n) {
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~255)) == 0;
> > > +}
> > > +
> > > +#define TYPE int32_t
> > > +#include "bic-bitmask.h"
> > > +
> > > +/* { dg-final { scan-tree-dump {<=\s*.+\{ 255,.+\}} dce7 } } */
> > > +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 }
> > > +} */
> > > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > > +aarch64*-*-* } } } } */
> > > +
> > > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-11.c
> > > b/gcc/testsuite/gcc.dg/bic-bitmask-11.c
> > > new file mode 100644
> > > index
> > >
> > 0000000000000000000000000000000000000000..32553d7ba2f823f7a212374519
> > 90
> > > d0a216d2f912
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-11.c
> > > @@ -0,0 +1,25 @@
> > > +/* { dg-do run } */
> > > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > > +
> > > +#include <stdint.h>
> > > +
> > > +__attribute__((noinline, noipa))
> > > +void fun1(uint32_t *x, int n)
> > > +{
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~255)) != 0;
> > > +}
> > > +
> > > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(uint32_t
> > > +*x, int n) {
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~255)) != 0;
> > > +}
> > > +
> > > +#include "bic-bitmask.h"
> > > +
> > > +/* { dg-final { scan-tree-dump {>\s*.+\{ 255,.+\}} dce7 } } */
> > > +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 }
> > > +} */
> > > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > > +aarch64*-*-* } } } } */
> > > +
> > > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-12.c
> > > b/gcc/testsuite/gcc.dg/bic-bitmask-12.c
> > > new file mode 100644
> > > index
> > >
> > 0000000000000000000000000000000000000000..e10cbf7fabe2dbf7ce436cdf37
> > b0
> > > f8b207c58408
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-12.c
> > > @@ -0,0 +1,17 @@
> > > +/* { dg-do assemble } */
> > > +/* { dg-options "-O3 -fdump-tree-dce" } */
> > > +
> > > +#include <stdint.h>
> > > +
> > > +typedef unsigned int v4si __attribute__ ((vector_size (16)));
> > > +
> > > +__attribute__((noinline, noipa))
> > > +void fun(v4si *x, int n)
> > > +{
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~255)) == 0;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump {<=\s*.+\{ 255,.+\}} dce7 } } */
> > > +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 }
> > > +} */
> > > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > > +aarch64*-*-* } } } } */
> > > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-2.c
> > > b/gcc/testsuite/gcc.dg/bic-bitmask-2.c
> > > new file mode 100644
> > > index
> > >
> > 0000000000000000000000000000000000000000..da30fad89f6c8239baa4395b3f
> > fa
> > > ec0be577e13f
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-2.c
> > > @@ -0,0 +1,25 @@
> > > +/* { dg-do run } */
> > > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > > +
> > > +#include <stdint.h>
> > > +
> > > +__attribute__((noinline, noipa))
> > > +void fun1(uint32_t *x, int n)
> > > +{
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~255)) == 0;
> > > +}
> > > +
> > > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(uint32_t
> > > +*x, int n) {
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~255)) == 0;
> > > +}
> > > +
> > > +#include "bic-bitmask.h"
> > > +
> > > +/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } }
> > > +*/
> > > +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 }
> > > +} */
> > > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > > +aarch64*-*-* } } } } */
> > > +
> > > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-3.c
> > > b/gcc/testsuite/gcc.dg/bic-bitmask-3.c
> > > new file mode 100644
> > > index
> > >
> > 0000000000000000000000000000000000000000..da30fad89f6c8239baa4395b3f
> > fa
> > > ec0be577e13f
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-3.c
> > > @@ -0,0 +1,25 @@
> > > +/* { dg-do run } */
> > > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > > +
> > > +#include <stdint.h>
> > > +
> > > +__attribute__((noinline, noipa))
> > > +void fun1(uint32_t *x, int n)
> > > +{
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~255)) == 0;
> > > +}
> > > +
> > > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(uint32_t
> > > +*x, int n) {
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~255)) == 0;
> > > +}
> > > +
> > > +#include "bic-bitmask.h"
> > > +
> > > +/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } }
> > > +*/
> > > +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 }
> > > +} */
> > > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > > +aarch64*-*-* } } } } */
> > > +
> > > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-4.c
> > > b/gcc/testsuite/gcc.dg/bic-bitmask-4.c
> > > new file mode 100644
> > > index
> > >
> > 0000000000000000000000000000000000000000..1bcf23ccf1447d6c8c999ed1eb
> > 25
> > > ba0a450028e1
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-4.c
> > > @@ -0,0 +1,25 @@
> > > +/* { dg-do run } */
> > > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > > +
> > > +#include <stdint.h>
> > > +
> > > +__attribute__((noinline, noipa))
> > > +void fun1(uint32_t *x, int n)
> > > +{
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~255)) >= 0;
> > > +}
> > > +
> > > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(uint32_t
> > > +*x, int n) {
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~255)) >= 0;
> > > +}
> > > +
> > > +#include "bic-bitmask.h"
> > > +
> > > +/* { dg-final { scan-tree-dump-times {=\s*.+\{ 1,.+\}} 1 dce7 } } */
> > > +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 }
> > > +} */
> > > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > > +aarch64*-*-* } } } } */
> > > +
> > > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-5.c
> > > b/gcc/testsuite/gcc.dg/bic-bitmask-5.c
> > > new file mode 100644
> > > index
> > >
> > 0000000000000000000000000000000000000000..6e5a2fca9992efbc01f8dbbc6f
> > 95
> > > 936e86643028
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-5.c
> > > @@ -0,0 +1,25 @@
> > > +/* { dg-do run } */
> > > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > > +
> > > +#include <stdint.h>
> > > +
> > > +__attribute__((noinline, noipa))
> > > +void fun1(uint32_t *x, int n)
> > > +{
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~255)) > 0;
> > > +}
> > > +
> > > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(uint32_t
> > > +*x, int n) {
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~255)) > 0;
> > > +}
> > > +
> > > +#include "bic-bitmask.h"
> > > +
> > > +/* { dg-final { scan-tree-dump-times {>\s*.+\{ 255,.+\}} 1 dce7 } }
> > > +*/
> > > +/* { dg-final { scan-tree-dump-not {&`s*.+\{ 4294967040,.+\}} dce7 }
> > > +} */
> > > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > > +aarch64*-*-* } } } } */
> > > +
> > > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-6.c
> > > b/gcc/testsuite/gcc.dg/bic-bitmask-6.c
> > > new file mode 100644
> > > index
> > >
> > 0000000000000000000000000000000000000000..018e7a4348c9fc461106c3d9d0
> > 12
> > > 91325d3406c2
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-6.c
> > > @@ -0,0 +1,25 @@
> > > +/* { dg-do run } */
> > > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > > +
> > > +#include <stdint.h>
> > > +
> > > +__attribute__((noinline, noipa))
> > > +void fun1(uint32_t *x, int n)
> > > +{
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~255)) <= 0;
> > > +}
> > > +
> > > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(uint32_t
> > > +*x, int n) {
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~255)) <= 0;
> > > +}
> > > +
> > > +#include "bic-bitmask.h"
> > > +
> > > +/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } }
> > > +*/
> > > +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 }
> > > +} */
> > > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > > +aarch64*-*-* } } } } */
> > > +
> > > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-7.c
> > > b/gcc/testsuite/gcc.dg/bic-bitmask-7.c
> > > new file mode 100644
> > > index
> > >
> > 0000000000000000000000000000000000000000..798678fb7555052c93abc4ca34
> > f6
> > > 17d640f73bb4
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-7.c
> > > @@ -0,0 +1,24 @@
> > > +/* { dg-do run } */
> > > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > > +
> > > +#include <stdint.h>
> > > +
> > > +__attribute__((noinline, noipa))
> > > +void fun1(uint32_t *x, int n)
> > > +{
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~1)) < 0;
> > > +}
> > > +
> > > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(uint32_t
> > > +*x, int n) {
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~1)) < 0;
> > > +}
> > > +
> > > +#include "bic-bitmask.h"
> > > +
> > > +/* { dg-final { scan-tree-dump-times {__builtin_memset} 1 dce7 } } */
> > > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > > +aarch64*-*-* } } } } */
> > > +
> > > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-8.c
> > > b/gcc/testsuite/gcc.dg/bic-bitmask-8.c
> > > new file mode 100644
> > > index
> > >
> > 0000000000000000000000000000000000000000..1dabe834ed57dfa0be48c1dc3
> > dbb
> > > 226092c79a1a
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-8.c
> > > @@ -0,0 +1,25 @@
> > > +/* { dg-do run } */
> > > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > > +
> > > +#include <stdint.h>
> > > +
> > > +__attribute__((noinline, noipa))
> > > +void fun1(uint32_t *x, int n)
> > > +{
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~1)) != 0;
> > > +}
> > > +
> > > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(uint32_t
> > > +*x, int n) {
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~1)) != 0;
> > > +}
> > > +
> > > +#include "bic-bitmask.h"
> > > +
> > > +/* { dg-final { scan-tree-dump-times {>\s*.+\{ 1,.+\}} 1 dce7 } } */
> > > +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967294,.+\}} dce7 }
> > > +} */
> > > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > > +aarch64*-*-* } } } } */
> > > +
> > > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-9.c
> > > b/gcc/testsuite/gcc.dg/bic-bitmask-9.c
> > > new file mode 100644
> > > index
> > >
> > 0000000000000000000000000000000000000000..9c1f8ee0adfc45d1b9fc212138
> > ea
> > > 26bb6b693e49
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/bic-bitmask-9.c
> > > @@ -0,0 +1,25 @@
> > > +/* { dg-do run } */
> > > +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> > > +
> > > +#include <stdint.h>
> > > +
> > > +__attribute__((noinline, noipa))
> > > +void fun1(uint32_t *x, int n)
> > > +{
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~5)) == 0;
> > > +}
> > > +
> > > +__attribute__((noinline, noipa, optimize("O1"))) void fun2(uint32_t
> > > +*x, int n) {
> > > +    for (int i = 0; i < (n & -16); i++)
> > > +      x[i] = (x[i]&(~5)) == 0;
> > > +}
> > > +
> > > +#include "bic-bitmask.h"
> > > +
> > > +/* { dg-final { scan-tree-dump-not {<=\s*.+\{ 4294967289,.+\}} dce7 }
> > > +} */
> > > +/* { dg-final { scan-tree-dump {&\s*.+\{ 4294967290,.+\}} dce7 } } */
> > > +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target {
> > > +aarch64*-*-* } } } } */
> > > +
> > > diff --git a/gcc/testsuite/gcc.dg/bic-bitmask.h
> > > b/gcc/testsuite/gcc.dg/bic-bitmask.h
> > > new file mode 100644
> > > index
> > >
> > 0000000000000000000000000000000000000000..2b94065c025e0cbf71a21ac9b9
> > d6
> > > 314e24b0c2d9
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/bic-bitmask.h
> > > @@ -0,0 +1,43 @@
> > > +#include <stdio.h>
> > > +
> > > +#ifndef N
> > > +#define N 50
> > > +#endif
> > > +
> > > +#ifndef TYPE
> > > +#define TYPE uint32_t
> > > +#endif
> > > +
> > > +#ifndef DEBUG
> > > +#define DEBUG 0
> > > +#endif
> > > +
> > > +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> > > +
> > > +int main ()
> > > +{
> > > +  TYPE a[N];
> > > +  TYPE b[N];
> > > +
> > > +  for (int i = 0; i < N; ++i)
> > > +    {
> > > +      a[i] = BASE + i * 13;
> > > +      b[i] = BASE + i * 13;
> > > +      if (DEBUG)
> > > +        printf ("%d: 0x%x\n", i, a[i]);
> > > +    }
> > > +
> > > +  fun1 (a, N);
> > > +  fun2 (b, N);
> > > +
> > > +  for (int i = 0; i < N; ++i)
> > > +    {
> > > +      if (DEBUG)
> > > +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> > > +
> > > +      if (a[i] != b[i])
> > > +        __builtin_abort ();
> > > +    }
> > > +  return 0;
> > > +}
> > > +
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c
> > > b/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c
> > > new file mode 100644
> > > index
> > >
> > 0000000000000000000000000000000000000000..568c1ffc8bc4148efaeeba7a45
> > a7
> > > 5ecbd3a7a3dd
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c
> > > @@ -0,0 +1,13 @@
> > > +/* { dg-do assemble } */
> > > +/* { dg-options "-O2 -save-temps" } */
> > > +
> > > +#include <arm_neon.h>
> > > +
> > > +uint32x4_t foo (int32x4_t a)
> > > +{
> > > +  int32x4_t cst = vdupq_n_s32 (255);
> > > +  int32x4_t zero = vdupq_n_s32 (0);
> > > +  return vceqq_s32 (vbicq_s32 (a, cst), zero); }
> > > +
> > > +/* { dg-final { scan-assembler-not {\tbic\t} { xfail { aarch64*-*-* }
> > > +} } } */
> > >
> > >
> > >
> > 
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> > Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 5/7]middle-end Convert bitclear <imm> + cmp<cc> #0 into cm<cc2> <imm2>
  2021-09-30 10:26       ` Richard Biener
@ 2021-10-05 12:55         ` Tamar Christina
  2021-10-13 12:17           ` Richard Biener
  0 siblings, 1 reply; 31+ messages in thread
From: Tamar Christina @ 2021-10-05 12:55 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, nd

[-- Attachment #1: Type: text/plain, Size: 26960 bytes --]

Hi All,

Here's a new version of the patch handling both scalar and vector modes
and non-uniform constant vectors.

Bootstrapped Regtested on aarch64-none-linux-gnu,
x86_64-pc-linux-gnu and no regressions.

In order to not break IVopts and CSE I have added a
requirement for the scalar version to be single use.

Thanks,
Tamar

gcc/ChangeLog:

	* tree.c (bitmask_inv_cst_vector_p): New.
	* tree.h (bitmask_inv_cst_vector_p): New.
	* match.pd: Use it in new bitmask compare pattern.

gcc/testsuite/ChangeLog:

	* gcc.dg/bic-bitmask-10.c: New test.
	* gcc.dg/bic-bitmask-11.c: New test.
	* gcc.dg/bic-bitmask-12.c: New test.
	* gcc.dg/bic-bitmask-13.c: New test.
	* gcc.dg/bic-bitmask-14.c: New test.
	* gcc.dg/bic-bitmask-15.c: New test.
	* gcc.dg/bic-bitmask-16.c: New test.
	* gcc.dg/bic-bitmask-17.c: New test.
	* gcc.dg/bic-bitmask-18.c: New test.
	* gcc.dg/bic-bitmask-19.c: New test.
	* gcc.dg/bic-bitmask-2.c: New test.
	* gcc.dg/bic-bitmask-20.c: New test.
	* gcc.dg/bic-bitmask-21.c: New test.
	* gcc.dg/bic-bitmask-22.c: New test.
	* gcc.dg/bic-bitmask-23.c: New test.
	* gcc.dg/bic-bitmask-3.c: New test.
	* gcc.dg/bic-bitmask-4.c: New test.
	* gcc.dg/bic-bitmask-5.c: New test.
	* gcc.dg/bic-bitmask-6.c: New test.
	* gcc.dg/bic-bitmask-7.c: New test.
	* gcc.dg/bic-bitmask-8.c: New test.
	* gcc.dg/bic-bitmask-9.c: New test.
	* gcc.dg/bic-bitmask.h: New test.
	* gcc.target/aarch64/bic-bitmask-1.c: New test.

--- inline copy of patch --

diff --git a/gcc/match.pd b/gcc/match.pd
index 0fcfd0ea62c043dc217d0d560ce5b7e569b70e7d..7d2a24dbc5e9644a09968f877e12a824d8ba1caa 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -37,7 +37,8 @@ along with GCC; see the file COPYING3.  If not see
    integer_pow2p
    uniform_integer_cst_p
    HONOR_NANS
-   uniform_vector_p)
+   uniform_vector_p
+   bitmask_inv_cst_vector_p)
 
 /* Operator lists.  */
 (define_operator_list tcc_comparison
@@ -4900,6 +4901,24 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
       (eqcmp (bit_and @1 { wide_int_to_tree (ty, mask - rhs); })
 	     { build_zero_cst (ty); }))))))
 
+/* Transform comparisons of the form (X & Y) CMP 0 to X CMP2 Z
+   where ~Y + 1 == pow2 and Z = ~Y.  */
+(for cst (VECTOR_CST INTEGER_CST)
+ (for cmp (le eq ne ge gt)
+      icmp (le le gt le gt)
+ (simplify
+  (cmp (bit_and:c@2 @0 cst@1) integer_zerop)
+   (with { tree csts = bitmask_inv_cst_vector_p (@1); }
+     (switch
+      (if (csts && TYPE_UNSIGNED (TREE_TYPE (@1))
+	   && (VECTOR_TYPE_P (TREE_TYPE (@1)) || single_use (@2)))
+       (icmp @0 { csts; }))
+      (if (csts && !TYPE_UNSIGNED (TREE_TYPE (@1))
+	   && (cmp == EQ_EXPR || cmp == NE_EXPR)
+	   && (VECTOR_TYPE_P (TREE_TYPE (@1)) || single_use (@2)))
+       (with { tree utype = unsigned_type_for (TREE_TYPE (@1)); }
+	(icmp (convert:utype @0) { csts; }))))))))
+
 /* -A CMP -B -> B CMP A.  */
 (for cmp (tcc_comparison)
      scmp (swapped_tcc_comparison)
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-10.c b/gcc/testsuite/gcc.dg/bic-bitmask-10.c
new file mode 100644
index 0000000000000000000000000000000000000000..76a22a2313137a2a75dd711c2c15c2d3a34e15aa
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-10.c
@@ -0,0 +1,26 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(int32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(int32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+#define TYPE int32_t
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump {<=\s*.+\{ 255,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-11.c b/gcc/testsuite/gcc.dg/bic-bitmask-11.c
new file mode 100644
index 0000000000000000000000000000000000000000..32553d7ba2f823f7a21237451990d0a216d2f912
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-11.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) != 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) != 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump {>\s*.+\{ 255,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-12.c b/gcc/testsuite/gcc.dg/bic-bitmask-12.c
new file mode 100644
index 0000000000000000000000000000000000000000..e10cbf7fabe2dbf7ce436cdf37b0f8b207c58408
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-12.c
@@ -0,0 +1,17 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+typedef unsigned int v4si __attribute__ ((vector_size (16)));
+
+__attribute__((noinline, noipa))
+void fun(v4si *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+/* { dg-final { scan-tree-dump {<=\s*.+\{ 255,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-13.c b/gcc/testsuite/gcc.dg/bic-bitmask-13.c
new file mode 100644
index 0000000000000000000000000000000000000000..8ebaa30238c761b8831685209a7490f06591c000
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-13.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O0 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {<=\s* 255} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s* 4294967040} dce7 } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-14.c b/gcc/testsuite/gcc.dg/bic-bitmask-14.c
new file mode 100644
index 0000000000000000000000000000000000000000..0a2789433f8bc45a590d136179b8ee4ec5cda1c1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-14.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O0")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {<=\s* 255} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s* 4294967040} dce7 } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-15.c b/gcc/testsuite/gcc.dg/bic-bitmask-15.c
new file mode 100644
index 0000000000000000000000000000000000000000..53a2c986b00f159ae5fa839798850ac42e9b9504
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-15.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) >= 0;
+}
+
+__attribute__((noinline, noipa, optimize("O0")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) >= 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {=\s* 1} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s* 4294967040} dce7 } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-16.c b/gcc/testsuite/gcc.dg/bic-bitmask-16.c
new file mode 100644
index 0000000000000000000000000000000000000000..fff4670e2a47106c614c1224b8e8aac091d6e821
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-16.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) > 0;
+}
+
+__attribute__((noinline, noipa, optimize("O0")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) > 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {>\s* 255} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s* 4294967040} dce7 } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-17.c b/gcc/testsuite/gcc.dg/bic-bitmask-17.c
new file mode 100644
index 0000000000000000000000000000000000000000..7b877fde017de0fb7aeabd4152a1593f07d52e71
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-17.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) <= 0;
+}
+
+__attribute__((noinline, noipa, optimize("O0")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) <= 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {<=\s* 255} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s* 4294967040} dce7 } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-18.c b/gcc/testsuite/gcc.dg/bic-bitmask-18.c
new file mode 100644
index 0000000000000000000000000000000000000000..8455bf5286a7cc9f08713489e29b5f2b6f6fc012
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-18.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) < 0;
+}
+
+__attribute__((noinline, noipa, optimize("O0")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) < 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {= 0} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-19.c b/gcc/testsuite/gcc.dg/bic-bitmask-19.c
new file mode 100644
index 0000000000000000000000000000000000000000..ad266cec0d10a1afc0bdacd8762d328bff1f6ea4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-19.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) != 0;
+}
+
+__attribute__((noinline, noipa, optimize("O0")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) != 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {>\s* 1} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s* 4294967294} dce7 } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-2.c b/gcc/testsuite/gcc.dg/bic-bitmask-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..da30fad89f6c8239baa4395b3ffaec0be577e13f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-2.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-20.c b/gcc/testsuite/gcc.dg/bic-bitmask-20.c
new file mode 100644
index 0000000000000000000000000000000000000000..befd25cb4aac6fe206110e7ca80816dd6fc0ed94
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-20.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~5)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O0")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~5)) == 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-not {<=\s* 4294967289} dce7 } } */
+/* { dg-final { scan-tree-dump {&\s* 4294967290} dce7 } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-21.c b/gcc/testsuite/gcc.dg/bic-bitmask-21.c
new file mode 100644
index 0000000000000000000000000000000000000000..b037ffc248eef1509c642abb0087b77882679150
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-21.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(int32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O0")))
+void fun2(int32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+#define TYPE int32_t
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump {<=\s* 255} dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s* 4294967290} dce7 } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-22.c b/gcc/testsuite/gcc.dg/bic-bitmask-22.c
new file mode 100644
index 0000000000000000000000000000000000000000..20027a6367a1dcec12492435f260250a7b54aca1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-22.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) != 0;
+}
+
+__attribute__((noinline, noipa, optimize("O0")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) != 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump {>\s* 255} dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s* 4294967290} dce7 } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-23.c b/gcc/testsuite/gcc.dg/bic-bitmask-23.c
new file mode 100644
index 0000000000000000000000000000000000000000..9aef4baed796cf7bf4998d1b4f4ddfbc457add29
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-23.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O1 -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+typedef unsigned int v4si __attribute__ ((vector_size (16)));
+
+__attribute__((noinline, noipa))
+v4si fun(v4si x)
+{
+    v4si mask = { 255, 15, 1, 0xFFFF };
+    v4si zeros = {0};
+    return (x & ~mask) == zeros;
+}
+
+/* { dg-final { scan-tree-dump {<=\s*.+\{ 255, 15, 1, 65535 \}} dce7 } } */
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-3.c b/gcc/testsuite/gcc.dg/bic-bitmask-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..da30fad89f6c8239baa4395b3ffaec0be577e13f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-3.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-4.c b/gcc/testsuite/gcc.dg/bic-bitmask-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..1bcf23ccf1447d6c8c999ed1eb25ba0a450028e1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-4.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) >= 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) >= 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {=\s*.+\{ 1,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-5.c b/gcc/testsuite/gcc.dg/bic-bitmask-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..d053727076abedefdecfda7c4fea6f92d54a94a5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-5.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) > 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) > 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {>\s*.+\{ 255,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-6.c b/gcc/testsuite/gcc.dg/bic-bitmask-6.c
new file mode 100644
index 0000000000000000000000000000000000000000..018e7a4348c9fc461106c3d9d01291325d3406c2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-6.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) <= 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) <= 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-7.c b/gcc/testsuite/gcc.dg/bic-bitmask-7.c
new file mode 100644
index 0000000000000000000000000000000000000000..798678fb7555052c93abc4ca34f617d640f73bb4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-7.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) < 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) < 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {__builtin_memset} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-8.c b/gcc/testsuite/gcc.dg/bic-bitmask-8.c
new file mode 100644
index 0000000000000000000000000000000000000000..1dabe834ed57dfa0be48c1dc3dbb226092c79a1a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-8.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) != 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) != 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {>\s*.+\{ 1,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967294,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-9.c b/gcc/testsuite/gcc.dg/bic-bitmask-9.c
new file mode 100644
index 0000000000000000000000000000000000000000..9c1f8ee0adfc45d1b9fc212138ea26bb6b693e49
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-9.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~5)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~5)) == 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-not {<=\s*.+\{ 4294967289,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump {&\s*.+\{ 4294967290,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask.h b/gcc/testsuite/gcc.dg/bic-bitmask.h
new file mode 100644
index 0000000000000000000000000000000000000000..faf80b974db07a7d817a615cc64a35f1020e9764
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask.h
@@ -0,0 +1,43 @@
+#include <stdio.h>
+
+#ifndef N
+#define N 65
+#endif
+
+#ifndef TYPE
+#define TYPE uint32_t
+#endif
+
+#ifndef DEBUG
+#define DEBUG 0
+#endif
+
+#define BASE ((TYPE) -1 < 0 ? -126 : 4)
+
+int main ()
+{
+  TYPE a[N];
+  TYPE b[N];
+
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 13;
+      b[i] = BASE + i * 13;
+      if (DEBUG)
+        printf ("%d: 0x%x\n", i, a[i]);
+    }
+
+  fun1 (a, N);
+  fun2 (b, N);
+
+  for (int i = 0; i < N; ++i)
+    {
+      if (DEBUG)
+        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
+
+      if (a[i] != b[i])
+        __builtin_abort ();
+    }
+  return 0;
+}
+
diff --git a/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c b/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..568c1ffc8bc4148efaeeba7a45a75ecbd3a7a3dd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c
@@ -0,0 +1,13 @@
+/* { dg-do assemble } */
+/* { dg-options "-O2 -save-temps" } */
+
+#include <arm_neon.h>
+
+uint32x4_t foo (int32x4_t a)
+{
+  int32x4_t cst = vdupq_n_s32 (255);
+  int32x4_t zero = vdupq_n_s32 (0);
+  return vceqq_s32 (vbicq_s32 (a, cst), zero);
+}
+
+/* { dg-final { scan-assembler-not {\tbic\t} { xfail { aarch64*-*-* } } } } */
diff --git a/gcc/tree.h b/gcc/tree.h
index 905417fd17b1721f6c65d3cdf66a5fd5f920de7f..3583d51409cad9471fe45de64806e78d855bd97c 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -4862,6 +4862,11 @@ extern bool integer_minus_onep (const_tree);
 
 extern bool integer_pow2p (const_tree);
 
+/* Checks to see if T is a constant or a constant vector and if each element E
+   adheres to ~E + 1 == pow2 then return ~E otherwise NULL_TREE.  */
+
+extern tree bitmask_inv_cst_vector_p (tree);
+
 /* integer_nonzerop (tree x) is nonzero if X is an integer constant
    with a nonzero value.  */
 
diff --git a/gcc/tree.c b/gcc/tree.c
index cba3bca41b3a50557939267b7a06df04b5d042b7..801f121a368b088d0f769f16f2ed29e320e71556 100644
--- a/gcc/tree.c
+++ b/gcc/tree.c
@@ -10246,6 +10246,59 @@ uniform_integer_cst_p (tree t)
   return NULL_TREE;
 }
 
+/* Checks to see if T is a constant or a constant vector and if each element E
+   adheres to ~E + 1 == pow2 then return ~E otherwise NULL_TREE.  */
+
+tree
+bitmask_inv_cst_vector_p (tree t)
+{
+
+  tree_code code = TREE_CODE (t);
+  tree type = TREE_TYPE (t);
+
+  if (!INTEGRAL_TYPE_P (type)
+      && !VECTOR_INTEGER_TYPE_P (type))
+    return NULL_TREE;
+
+  unsigned HOST_WIDE_INT nelts = 1;
+  tree cst;
+  unsigned int idx = 0;
+  bool uniform = uniform_integer_cst_p (t);
+  tree newtype = unsigned_type_for (type);
+  tree_vector_builder builder;
+  if (code == INTEGER_CST)
+    cst = t;
+  else
+    {
+      if (!VECTOR_CST_NELTS (t).is_constant (&nelts))
+	return NULL_TREE;
+
+      cst = vector_cst_elt (t, 0);
+      builder.new_vector (newtype, nelts, 1);
+    }
+
+  tree ty = unsigned_type_for (TREE_TYPE (cst));
+
+  do {
+    if (idx > 0)
+      cst = vector_cst_elt (t, idx);
+    wide_int icst = wi::to_wide (cst);
+    wide_int inv =  wi::bit_not (icst);
+    icst = wi::add (1, inv);
+    if (wi::popcount (icst) != 1)
+      return NULL_TREE;
+
+    tree newcst = wide_int_to_tree (ty, inv);
+
+    if (uniform)
+      return build_uniform_cst (newtype, newcst);
+
+    builder.quick_push (newcst);
+  } while (++idx < nelts);
+
+  return builder.build ();
+}
+
 /* If VECTOR_CST T has a single nonzero element, return the index of that
    element, otherwise return -1.  */


[-- Attachment #2: rb14896.patch --]
[-- Type: application/octet-stream, Size: 24618 bytes --]

diff --git a/gcc/match.pd b/gcc/match.pd
index 0fcfd0ea62c043dc217d0d560ce5b7e569b70e7d..7d2a24dbc5e9644a09968f877e12a824d8ba1caa 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -37,7 +37,8 @@ along with GCC; see the file COPYING3.  If not see
    integer_pow2p
    uniform_integer_cst_p
    HONOR_NANS
-   uniform_vector_p)
+   uniform_vector_p
+   bitmask_inv_cst_vector_p)
 
 /* Operator lists.  */
 (define_operator_list tcc_comparison
@@ -4900,6 +4901,24 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
       (eqcmp (bit_and @1 { wide_int_to_tree (ty, mask - rhs); })
 	     { build_zero_cst (ty); }))))))
 
+/* Transform comparisons of the form (X & Y) CMP 0 to X CMP2 Z
+   where ~Y + 1 == pow2 and Z = ~Y.  */
+(for cst (VECTOR_CST INTEGER_CST)
+ (for cmp (le eq ne ge gt)
+      icmp (le le gt le gt)
+ (simplify
+  (cmp (bit_and:c@2 @0 cst@1) integer_zerop)
+   (with { tree csts = bitmask_inv_cst_vector_p (@1); }
+     (switch
+      (if (csts && TYPE_UNSIGNED (TREE_TYPE (@1))
+	   && (VECTOR_TYPE_P (TREE_TYPE (@1)) || single_use (@2)))
+       (icmp @0 { csts; }))
+      (if (csts && !TYPE_UNSIGNED (TREE_TYPE (@1))
+	   && (cmp == EQ_EXPR || cmp == NE_EXPR)
+	   && (VECTOR_TYPE_P (TREE_TYPE (@1)) || single_use (@2)))
+       (with { tree utype = unsigned_type_for (TREE_TYPE (@1)); }
+	(icmp (convert:utype @0) { csts; }))))))))
+
 /* -A CMP -B -> B CMP A.  */
 (for cmp (tcc_comparison)
      scmp (swapped_tcc_comparison)
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-10.c b/gcc/testsuite/gcc.dg/bic-bitmask-10.c
new file mode 100644
index 0000000000000000000000000000000000000000..76a22a2313137a2a75dd711c2c15c2d3a34e15aa
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-10.c
@@ -0,0 +1,26 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(int32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(int32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+#define TYPE int32_t
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump {<=\s*.+\{ 255,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-11.c b/gcc/testsuite/gcc.dg/bic-bitmask-11.c
new file mode 100644
index 0000000000000000000000000000000000000000..32553d7ba2f823f7a21237451990d0a216d2f912
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-11.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) != 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) != 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump {>\s*.+\{ 255,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-12.c b/gcc/testsuite/gcc.dg/bic-bitmask-12.c
new file mode 100644
index 0000000000000000000000000000000000000000..e10cbf7fabe2dbf7ce436cdf37b0f8b207c58408
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-12.c
@@ -0,0 +1,17 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+typedef unsigned int v4si __attribute__ ((vector_size (16)));
+
+__attribute__((noinline, noipa))
+void fun(v4si *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+/* { dg-final { scan-tree-dump {<=\s*.+\{ 255,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-13.c b/gcc/testsuite/gcc.dg/bic-bitmask-13.c
new file mode 100644
index 0000000000000000000000000000000000000000..8ebaa30238c761b8831685209a7490f06591c000
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-13.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O0 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {<=\s* 255} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s* 4294967040} dce7 } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-14.c b/gcc/testsuite/gcc.dg/bic-bitmask-14.c
new file mode 100644
index 0000000000000000000000000000000000000000..0a2789433f8bc45a590d136179b8ee4ec5cda1c1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-14.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O0")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {<=\s* 255} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s* 4294967040} dce7 } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-15.c b/gcc/testsuite/gcc.dg/bic-bitmask-15.c
new file mode 100644
index 0000000000000000000000000000000000000000..53a2c986b00f159ae5fa839798850ac42e9b9504
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-15.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) >= 0;
+}
+
+__attribute__((noinline, noipa, optimize("O0")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) >= 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {=\s* 1} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s* 4294967040} dce7 } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-16.c b/gcc/testsuite/gcc.dg/bic-bitmask-16.c
new file mode 100644
index 0000000000000000000000000000000000000000..fff4670e2a47106c614c1224b8e8aac091d6e821
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-16.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) > 0;
+}
+
+__attribute__((noinline, noipa, optimize("O0")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) > 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {>\s* 255} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s* 4294967040} dce7 } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-17.c b/gcc/testsuite/gcc.dg/bic-bitmask-17.c
new file mode 100644
index 0000000000000000000000000000000000000000..7b877fde017de0fb7aeabd4152a1593f07d52e71
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-17.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) <= 0;
+}
+
+__attribute__((noinline, noipa, optimize("O0")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) <= 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {<=\s* 255} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s* 4294967040} dce7 } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-18.c b/gcc/testsuite/gcc.dg/bic-bitmask-18.c
new file mode 100644
index 0000000000000000000000000000000000000000..8455bf5286a7cc9f08713489e29b5f2b6f6fc012
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-18.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) < 0;
+}
+
+__attribute__((noinline, noipa, optimize("O0")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) < 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {= 0} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-19.c b/gcc/testsuite/gcc.dg/bic-bitmask-19.c
new file mode 100644
index 0000000000000000000000000000000000000000..ad266cec0d10a1afc0bdacd8762d328bff1f6ea4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-19.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) != 0;
+}
+
+__attribute__((noinline, noipa, optimize("O0")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) != 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {>\s* 1} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s* 4294967294} dce7 } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-2.c b/gcc/testsuite/gcc.dg/bic-bitmask-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..da30fad89f6c8239baa4395b3ffaec0be577e13f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-2.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-20.c b/gcc/testsuite/gcc.dg/bic-bitmask-20.c
new file mode 100644
index 0000000000000000000000000000000000000000..befd25cb4aac6fe206110e7ca80816dd6fc0ed94
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-20.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~5)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O0")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~5)) == 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-not {<=\s* 4294967289} dce7 } } */
+/* { dg-final { scan-tree-dump {&\s* 4294967290} dce7 } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-21.c b/gcc/testsuite/gcc.dg/bic-bitmask-21.c
new file mode 100644
index 0000000000000000000000000000000000000000..b037ffc248eef1509c642abb0087b77882679150
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-21.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(int32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O0")))
+void fun2(int32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+#define TYPE int32_t
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump {<=\s* 255} dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s* 4294967290} dce7 } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-22.c b/gcc/testsuite/gcc.dg/bic-bitmask-22.c
new file mode 100644
index 0000000000000000000000000000000000000000..20027a6367a1dcec12492435f260250a7b54aca1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-22.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) != 0;
+}
+
+__attribute__((noinline, noipa, optimize("O0")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) != 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump {>\s* 255} dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s* 4294967290} dce7 } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-23.c b/gcc/testsuite/gcc.dg/bic-bitmask-23.c
new file mode 100644
index 0000000000000000000000000000000000000000..9aef4baed796cf7bf4998d1b4f4ddfbc457add29
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-23.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O1 -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+typedef unsigned int v4si __attribute__ ((vector_size (16)));
+
+__attribute__((noinline, noipa))
+v4si fun(v4si x)
+{
+    v4si mask = { 255, 15, 1, 0xFFFF };
+    v4si zeros = {0};
+    return (x & ~mask) == zeros;
+}
+
+/* { dg-final { scan-tree-dump {<=\s*.+\{ 255, 15, 1, 65535 \}} dce7 } } */
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-3.c b/gcc/testsuite/gcc.dg/bic-bitmask-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..da30fad89f6c8239baa4395b3ffaec0be577e13f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-3.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) == 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-4.c b/gcc/testsuite/gcc.dg/bic-bitmask-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..1bcf23ccf1447d6c8c999ed1eb25ba0a450028e1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-4.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) >= 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) >= 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {=\s*.+\{ 1,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-5.c b/gcc/testsuite/gcc.dg/bic-bitmask-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..d053727076abedefdecfda7c4fea6f92d54a94a5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-5.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) > 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) > 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {>\s*.+\{ 255,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-6.c b/gcc/testsuite/gcc.dg/bic-bitmask-6.c
new file mode 100644
index 0000000000000000000000000000000000000000..018e7a4348c9fc461106c3d9d01291325d3406c2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-6.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) <= 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~255)) <= 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-7.c b/gcc/testsuite/gcc.dg/bic-bitmask-7.c
new file mode 100644
index 0000000000000000000000000000000000000000..798678fb7555052c93abc4ca34f617d640f73bb4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-7.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) < 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) < 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {__builtin_memset} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-8.c b/gcc/testsuite/gcc.dg/bic-bitmask-8.c
new file mode 100644
index 0000000000000000000000000000000000000000..1dabe834ed57dfa0be48c1dc3dbb226092c79a1a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-8.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) != 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~1)) != 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-times {>\s*.+\{ 1,.+\}} 1 dce7 } } */
+/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967294,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-9.c b/gcc/testsuite/gcc.dg/bic-bitmask-9.c
new file mode 100644
index 0000000000000000000000000000000000000000..9c1f8ee0adfc45d1b9fc212138ea26bb6b693e49
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-9.c
@@ -0,0 +1,25 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
+
+#include <stdint.h>
+
+__attribute__((noinline, noipa))
+void fun1(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~5)) == 0;
+}
+
+__attribute__((noinline, noipa, optimize("O1")))
+void fun2(uint32_t *x, int n)
+{
+    for (int i = 0; i < (n & -16); i++)
+      x[i] = (x[i]&(~5)) == 0;
+}
+
+#include "bic-bitmask.h"
+
+/* { dg-final { scan-tree-dump-not {<=\s*.+\{ 4294967289,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump {&\s*.+\{ 4294967290,.+\}} dce7 } } */
+/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
+
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask.h b/gcc/testsuite/gcc.dg/bic-bitmask.h
new file mode 100644
index 0000000000000000000000000000000000000000..faf80b974db07a7d817a615cc64a35f1020e9764
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/bic-bitmask.h
@@ -0,0 +1,43 @@
+#include <stdio.h>
+
+#ifndef N
+#define N 65
+#endif
+
+#ifndef TYPE
+#define TYPE uint32_t
+#endif
+
+#ifndef DEBUG
+#define DEBUG 0
+#endif
+
+#define BASE ((TYPE) -1 < 0 ? -126 : 4)
+
+int main ()
+{
+  TYPE a[N];
+  TYPE b[N];
+
+  for (int i = 0; i < N; ++i)
+    {
+      a[i] = BASE + i * 13;
+      b[i] = BASE + i * 13;
+      if (DEBUG)
+        printf ("%d: 0x%x\n", i, a[i]);
+    }
+
+  fun1 (a, N);
+  fun2 (b, N);
+
+  for (int i = 0; i < N; ++i)
+    {
+      if (DEBUG)
+        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
+
+      if (a[i] != b[i])
+        __builtin_abort ();
+    }
+  return 0;
+}
+
diff --git a/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c b/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..568c1ffc8bc4148efaeeba7a45a75ecbd3a7a3dd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c
@@ -0,0 +1,13 @@
+/* { dg-do assemble } */
+/* { dg-options "-O2 -save-temps" } */
+
+#include <arm_neon.h>
+
+uint32x4_t foo (int32x4_t a)
+{
+  int32x4_t cst = vdupq_n_s32 (255);
+  int32x4_t zero = vdupq_n_s32 (0);
+  return vceqq_s32 (vbicq_s32 (a, cst), zero);
+}
+
+/* { dg-final { scan-assembler-not {\tbic\t} { xfail { aarch64*-*-* } } } } */
diff --git a/gcc/tree.h b/gcc/tree.h
index 905417fd17b1721f6c65d3cdf66a5fd5f920de7f..3583d51409cad9471fe45de64806e78d855bd97c 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -4862,6 +4862,11 @@ extern bool integer_minus_onep (const_tree);
 
 extern bool integer_pow2p (const_tree);
 
+/* Checks to see if T is a constant or a constant vector and if each element E
+   adheres to ~E + 1 == pow2 then return ~E otherwise NULL_TREE.  */
+
+extern tree bitmask_inv_cst_vector_p (tree);
+
 /* integer_nonzerop (tree x) is nonzero if X is an integer constant
    with a nonzero value.  */
 
diff --git a/gcc/tree.c b/gcc/tree.c
index cba3bca41b3a50557939267b7a06df04b5d042b7..801f121a368b088d0f769f16f2ed29e320e71556 100644
--- a/gcc/tree.c
+++ b/gcc/tree.c
@@ -10246,6 +10246,59 @@ uniform_integer_cst_p (tree t)
   return NULL_TREE;
 }
 
+/* Checks to see if T is a constant or a constant vector and if each element E
+   adheres to ~E + 1 == pow2 then return ~E otherwise NULL_TREE.  */
+
+tree
+bitmask_inv_cst_vector_p (tree t)
+{
+
+  tree_code code = TREE_CODE (t);
+  tree type = TREE_TYPE (t);
+
+  if (!INTEGRAL_TYPE_P (type)
+      && !VECTOR_INTEGER_TYPE_P (type))
+    return NULL_TREE;
+
+  unsigned HOST_WIDE_INT nelts = 1;
+  tree cst;
+  unsigned int idx = 0;
+  bool uniform = uniform_integer_cst_p (t);
+  tree newtype = unsigned_type_for (type);
+  tree_vector_builder builder;
+  if (code == INTEGER_CST)
+    cst = t;
+  else
+    {
+      if (!VECTOR_CST_NELTS (t).is_constant (&nelts))
+	return NULL_TREE;
+
+      cst = vector_cst_elt (t, 0);
+      builder.new_vector (newtype, nelts, 1);
+    }
+
+  tree ty = unsigned_type_for (TREE_TYPE (cst));
+
+  do {
+    if (idx > 0)
+      cst = vector_cst_elt (t, idx);
+    wide_int icst = wi::to_wide (cst);
+    wide_int inv =  wi::bit_not (icst);
+    icst = wi::add (1, inv);
+    if (wi::popcount (icst) != 1)
+      return NULL_TREE;
+
+    tree newcst = wide_int_to_tree (ty, inv);
+
+    if (uniform)
+      return build_uniform_cst (newtype, newcst);
+
+    builder.quick_push (newcst);
+  } while (++idx < nelts);
+
+  return builder.build ();
+}
+
 /* If VECTOR_CST T has a single nonzero element, return the index of that
    element, otherwise return -1.  */
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/7]AArch64 Add combine patterns for right shift and narrow
  2021-09-30  8:50   ` Kyrylo Tkachov
@ 2021-10-06 14:32     ` Richard Sandiford
  2021-10-12 16:18       ` Tamar Christina
  0 siblings, 1 reply; 31+ messages in thread
From: Richard Sandiford @ 2021-10-06 14:32 UTC (permalink / raw)
  To: Kyrylo Tkachov
  Cc: Tamar Christina, gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft

(Nice optimisations!)

Kyrylo Tkachov <Kyrylo.Tkachov@arm.com> writes:
> Hi Tamar,
>
>> -----Original Message-----
>> From: Tamar Christina <Tamar.Christina@arm.com>
>> Sent: Wednesday, September 29, 2021 5:19 PM
>> To: gcc-patches@gcc.gnu.org
>> Cc: nd <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>;
>> Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
>> <Kyrylo.Tkachov@arm.com>; Richard Sandiford
>> <Richard.Sandiford@arm.com>
>> Subject: [PATCH 1/7]AArch64 Add combine patterns for right shift and
>> narrow
>>
>> Hi All,
>>
>> This adds a simple pattern for combining right shifts and narrows into
>> shifted narrows.
>>
>> i.e.
>>
>> typedef short int16_t;
>> typedef unsigned short uint16_t;
>>
>> void foo (uint16_t * restrict a, int16_t * restrict d, int n)
>> {
>>     for( int i = 0; i < n; i++ )
>>       d[i] = (a[i] * a[i]) >> 10;
>> }
>>
>> now generates:
>>
>> .L4:
>>         ldr     q0, [x0, x3]
>>         umull   v1.4s, v0.4h, v0.4h
>>         umull2  v0.4s, v0.8h, v0.8h
>>         shrn    v1.4h, v1.4s, 10
>>         shrn2   v1.8h, v0.4s, 10
>>         str     q1, [x1, x3]
>>         add     x3, x3, 16
>>         cmp     x4, x3
>>         bne     .L4
>>
>> instead of:
>>
>> .L4:
>>         ldr     q0, [x0, x3]
>>         umull   v1.4s, v0.4h, v0.4h
>>         umull2  v0.4s, v0.8h, v0.8h
>>         sshr    v1.4s, v1.4s, 10
>>         sshr    v0.4s, v0.4s, 10
>>         xtn     v1.4h, v1.4s
>>         xtn2    v1.8h, v0.4s
>>         str     q1, [x1, x3]
>>         add     x3, x3, 16
>>         cmp     x4, x3
>>         bne     .L4
>>
>> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>>
>> Ok for master?
>>
>> Thanks,
>> Tamar
>>
>> gcc/ChangeLog:
>>
>>       * config/aarch64/aarch64-simd.md
>> (*aarch64_<srn_op>shrn<mode>_vect,
>>       *aarch64_<srn_op>shrn<mode>2_vect): New.
>>       * config/aarch64/iterators.md (srn_op): New.
>>
>> gcc/testsuite/ChangeLog:
>>
>>       * gcc.target/aarch64/shrn-combine.c: New test.
>>
>> --- inline copy of patch --
>> diff --git a/gcc/config/aarch64/aarch64-simd.md
>> b/gcc/config/aarch64/aarch64-simd.md
>> index
>> 48eddf64e05afe3788abfa05141f6544a9323ea1..d7b6cae424622d259f97a3d5
>> fa9093c0fb0bd5ce 100644
>> --- a/gcc/config/aarch64/aarch64-simd.md
>> +++ b/gcc/config/aarch64/aarch64-simd.md
>> @@ -1818,6 +1818,28 @@ (define_insn "aarch64_shrn<mode>_insn_be"
>>    [(set_attr "type" "neon_shift_imm_narrow_q")]
>>  )
>>
>> +(define_insn "*aarch64_<srn_op>shrn<mode>_vect"
>> +  [(set (match_operand:<VNARROWQ> 0 "register_operand" "=w")
>> +        (truncate:<VNARROWQ>
>> +          (SHIFTRT:VQN (match_operand:VQN 1 "register_operand" "w")
>> +            (match_operand:VQN 2
>> "aarch64_simd_shift_imm_vec_<vn_mode>"))))]
>> +  "TARGET_SIMD"
>> +  "shrn\\t%0.<Vntype>, %1.<Vtype>, %2"
>> +  [(set_attr "type" "neon_shift_imm_narrow_q")]
>> +)
>> +
>> +(define_insn "*aarch64_<srn_op>shrn<mode>2_vect"
>> +  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
>> +     (vec_concat:<VNARROWQ2>
>> +       (match_operand:<VNARROWQ> 1 "register_operand" "0")
>> +       (truncate:<VNARROWQ>
>> +         (SHIFTRT:VQN (match_operand:VQN 2 "register_operand" "w")
>> +           (match_operand:VQN 3
>> "aarch64_simd_shift_imm_vec_<vn_mode>")))))]
>> +  "TARGET_SIMD"
>> +  "shrn2\\t%0.<V2ntype>, %2.<Vtype>, %3"
>> +  [(set_attr "type" "neon_shift_imm_narrow_q")]
>> +)
>
> I think this needs to be guarded on !BYTES_BIG_ENDIAN and a similar pattern added for BYTES_BIG_ENDIAN with the vec_concat operands swapped around.
> This is similar to the aarch64_xtn2<mode>_insn_be pattern, for example.

Yeah.  I think that applies to 2/7 and 4/7 too.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/7]AArch64 Add pattern for sshr to cmlt
  2021-09-30  9:27   ` Kyrylo Tkachov
@ 2021-10-11 19:56     ` Andrew Pinski
  2021-10-12 12:19       ` Kyrylo Tkachov
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Pinski @ 2021-10-11 19:56 UTC (permalink / raw)
  To: Kyrylo Tkachov
  Cc: Tamar Christina, gcc-patches, apinski, Richard Earnshaw, nd,
	Marcus Shawcroft, Richard Sandiford

On Thu, Sep 30, 2021 at 2:28 AM Kyrylo Tkachov via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
> > -----Original Message-----
> > From: Tamar Christina <Tamar.Christina@arm.com>
> > Sent: Wednesday, September 29, 2021 5:20 PM
> > To: gcc-patches@gcc.gnu.org
> > Cc: nd <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>;
> > Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> > <Kyrylo.Tkachov@arm.com>; Richard Sandiford
> > <Richard.Sandiford@arm.com>
> > Subject: [PATCH 3/7]AArch64 Add pattern for sshr to cmlt
> >
> > Hi All,
> >
> > This optimizes signed right shift by BITSIZE-1 into a cmlt operation which is
> > more optimal because generally compares have a higher throughput than
> > shifts.
> >
> > On AArch64 the result of the shift would have been either -1 or 0 which is the
> > results of the compare.
> >
> > i.e.
> >
> > void e (int * restrict a, int *b, int n)
> > {
> >     for (int i = 0; i < n; i++)
> >       b[i] = a[i] >> 31;
> > }
> >
> > now generates:
> >
> > .L4:
> >         ldr     q0, [x0, x3]
> >         cmlt    v0.4s, v0.4s, #0
> >         str     q0, [x1, x3]
> >         add     x3, x3, 16
> >         cmp     x4, x3
> >         bne     .L4
> >
> > instead of:
> >
> > .L4:
> >         ldr     q0, [x0, x3]
> >         sshr    v0.4s, v0.4s, 31
> >         str     q0, [x1, x3]
> >         add     x3, x3, 16
> >         cmp     x4, x3
> >         bne     .L4
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
>
> This should be okay (either a win or neutral) for Arm Cortex and Neoverse cores so I'm tempted to not ask for a CPU-specific tunable to guard it to keep the code clean.
> Andrew, would this change be okay from a Thunder X line perspective?

I don't know about ThunderX2 but here are the details for ThunderX1
(and OcteonX1) and OcteonX2:
The sshr and cmlt are handled the same in the pipeline as far as I can tell.

Thanks,
Andrew



> Thanks,
> Kyrill
>
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> >       * config/aarch64/aarch64-simd.md (aarch64_simd_ashr<mode>):
> > Add case cmp
> >       case.
> >       * config/aarch64/constraints.md (D1): New.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       * gcc.target/aarch64/shl-combine-2.c: New test.
> >
> > --- inline copy of patch --
> > diff --git a/gcc/config/aarch64/aarch64-simd.md
> > b/gcc/config/aarch64/aarch64-simd.md
> > index
> > 300bf001b59ca7fa197c580b10adb7f70f20d1e0..19b2d0ad4dab4d574269829
> > 7ded861228ee22007 100644
> > --- a/gcc/config/aarch64/aarch64-simd.md
> > +++ b/gcc/config/aarch64/aarch64-simd.md
> > @@ -1127,12 +1127,14 @@ (define_insn "aarch64_simd_lshr<mode>"
> >  )
> >
> >  (define_insn "aarch64_simd_ashr<mode>"
> > - [(set (match_operand:VDQ_I 0 "register_operand" "=w")
> > -       (ashiftrt:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w")
> > -                  (match_operand:VDQ_I  2 "aarch64_simd_rshift_imm"
> > "Dr")))]
> > + [(set (match_operand:VDQ_I 0 "register_operand" "=w,w")
> > +       (ashiftrt:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w,w")
> > +                  (match_operand:VDQ_I  2 "aarch64_simd_rshift_imm"
> > "D1,Dr")))]
> >   "TARGET_SIMD"
> > - "sshr\t%0.<Vtype>, %1.<Vtype>, %2"
> > -  [(set_attr "type" "neon_shift_imm<q>")]
> > + "@
> > +  cmlt\t%0.<Vtype>, %1.<Vtype>, #0
> > +  sshr\t%0.<Vtype>, %1.<Vtype>, %2"
> > +  [(set_attr "type" "neon_compare<q>,neon_shift_imm<q>")]
> >  )
> >
> >  (define_insn "*aarch64_simd_sra<mode>"
> > diff --git a/gcc/config/aarch64/constraints.md
> > b/gcc/config/aarch64/constraints.md
> > index
> > 3b49b452119c49320020fa9183314d9a25b92491..18630815ffc13f2168300a89
> > 9db69fd428dfb0d6 100644
> > --- a/gcc/config/aarch64/constraints.md
> > +++ b/gcc/config/aarch64/constraints.md
> > @@ -437,6 +437,14 @@ (define_constraint "Dl"
> >        (match_test "aarch64_simd_shift_imm_p (op, GET_MODE (op),
> >                                                true)")))
> >
> > +(define_constraint "D1"
> > +  "@internal
> > + A constraint that matches vector of immediates that is bits(mode)-1."
> > + (and (match_code "const,const_vector")
> > +      (match_test "aarch64_const_vec_all_same_in_range_p (op,
> > +                     GET_MODE_UNIT_BITSIZE (mode) - 1,
> > +                     GET_MODE_UNIT_BITSIZE (mode) - 1)")))
> > +
> >  (define_constraint "Dr"
> >    "@internal
> >   A constraint that matches vector of immediates for right shifts."
> > diff --git a/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c
> > b/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c
> > new file mode 100644
> > index
> > 0000000000000000000000000000000000000000..bdfe35d09ffccc7928947c9e
> > 57f1034f7ca2c798
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c
> > @@ -0,0 +1,12 @@
> > +/* { dg-do assemble } */
> > +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> > +
> > +void e (int * restrict a, int *b, int n)
> > +{
> > +    for (int i = 0; i < n; i++)
> > +      b[i] = a[i] >> 31;
> > +}
> > +
> > +/* { dg-final { scan-assembler-times {\tcmlt\t} 1 } } */
> > +/* { dg-final { scan-assembler-not {\tsshr\t} } } */
> > +
> >
> >
> > --

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 3/7]AArch64 Add pattern for sshr to cmlt
  2021-10-11 19:56     ` Andrew Pinski
@ 2021-10-12 12:19       ` Kyrylo Tkachov
  2021-10-12 16:20         ` Tamar Christina
  0 siblings, 1 reply; 31+ messages in thread
From: Kyrylo Tkachov @ 2021-10-12 12:19 UTC (permalink / raw)
  To: Andrew Pinski
  Cc: Tamar Christina, gcc-patches, apinski, Richard Earnshaw, nd,
	Marcus Shawcroft, Richard Sandiford



> -----Original Message-----
> From: Andrew Pinski <pinskia@gmail.com>
> Sent: Monday, October 11, 2021 8:56 PM
> To: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> Cc: Tamar Christina <Tamar.Christina@arm.com>; gcc-patches@gcc.gnu.org;
> apinski@marvell.com; Richard Earnshaw <Richard.Earnshaw@arm.com>; nd
> <nd@arm.com>; Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Richard
> Sandiford <Richard.Sandiford@arm.com>
> Subject: Re: [PATCH 3/7]AArch64 Add pattern for sshr to cmlt
> 
> On Thu, Sep 30, 2021 at 2:28 AM Kyrylo Tkachov via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> > > -----Original Message-----
> > > From: Tamar Christina <Tamar.Christina@arm.com>
> > > Sent: Wednesday, September 29, 2021 5:20 PM
> > > To: gcc-patches@gcc.gnu.org
> > > Cc: nd <nd@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>;
> > > Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> > > <Kyrylo.Tkachov@arm.com>; Richard Sandiford
> > > <Richard.Sandiford@arm.com>
> > > Subject: [PATCH 3/7]AArch64 Add pattern for sshr to cmlt
> > >
> > > Hi All,
> > >
> > > This optimizes signed right shift by BITSIZE-1 into a cmlt operation which
> is
> > > more optimal because generally compares have a higher throughput than
> > > shifts.
> > >
> > > On AArch64 the result of the shift would have been either -1 or 0 which is
> the
> > > results of the compare.
> > >
> > > i.e.
> > >
> > > void e (int * restrict a, int *b, int n)
> > > {
> > >     for (int i = 0; i < n; i++)
> > >       b[i] = a[i] >> 31;
> > > }
> > >
> > > now generates:
> > >
> > > .L4:
> > >         ldr     q0, [x0, x3]
> > >         cmlt    v0.4s, v0.4s, #0
> > >         str     q0, [x1, x3]
> > >         add     x3, x3, 16
> > >         cmp     x4, x3
> > >         bne     .L4
> > >
> > > instead of:
> > >
> > > .L4:
> > >         ldr     q0, [x0, x3]
> > >         sshr    v0.4s, v0.4s, 31
> > >         str     q0, [x1, x3]
> > >         add     x3, x3, 16
> > >         cmp     x4, x3
> > >         bne     .L4
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > >
> > > Ok for master?
> >
> > This should be okay (either a win or neutral) for Arm Cortex and Neoverse
> cores so I'm tempted to not ask for a CPU-specific tunable to guard it to keep
> the code clean.
> > Andrew, would this change be okay from a Thunder X line perspective?
> 
> I don't know about ThunderX2 but here are the details for ThunderX1
> (and OcteonX1) and OcteonX2:
> The sshr and cmlt are handled the same in the pipeline as far as I can tell.
> 

Thanks for the info.
This patch is ok.
Kyrill

> Thanks,
> Andrew
> 
> 
> 
> > Thanks,
> > Kyrill
> >
> > >
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > >       * config/aarch64/aarch64-simd.md (aarch64_simd_ashr<mode>):
> > > Add case cmp
> > >       case.
> > >       * config/aarch64/constraints.md (D1): New.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >       * gcc.target/aarch64/shl-combine-2.c: New test.
> > >
> > > --- inline copy of patch --
> > > diff --git a/gcc/config/aarch64/aarch64-simd.md
> > > b/gcc/config/aarch64/aarch64-simd.md
> > > index
> > >
> 300bf001b59ca7fa197c580b10adb7f70f20d1e0..19b2d0ad4dab4d574269829
> > > 7ded861228ee22007 100644
> > > --- a/gcc/config/aarch64/aarch64-simd.md
> > > +++ b/gcc/config/aarch64/aarch64-simd.md
> > > @@ -1127,12 +1127,14 @@ (define_insn "aarch64_simd_lshr<mode>"
> > >  )
> > >
> > >  (define_insn "aarch64_simd_ashr<mode>"
> > > - [(set (match_operand:VDQ_I 0 "register_operand" "=w")
> > > -       (ashiftrt:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w")
> > > -                  (match_operand:VDQ_I  2 "aarch64_simd_rshift_imm"
> > > "Dr")))]
> > > + [(set (match_operand:VDQ_I 0 "register_operand" "=w,w")
> > > +       (ashiftrt:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w,w")
> > > +                  (match_operand:VDQ_I  2 "aarch64_simd_rshift_imm"
> > > "D1,Dr")))]
> > >   "TARGET_SIMD"
> > > - "sshr\t%0.<Vtype>, %1.<Vtype>, %2"
> > > -  [(set_attr "type" "neon_shift_imm<q>")]
> > > + "@
> > > +  cmlt\t%0.<Vtype>, %1.<Vtype>, #0
> > > +  sshr\t%0.<Vtype>, %1.<Vtype>, %2"
> > > +  [(set_attr "type" "neon_compare<q>,neon_shift_imm<q>")]
> > >  )
> > >
> > >  (define_insn "*aarch64_simd_sra<mode>"
> > > diff --git a/gcc/config/aarch64/constraints.md
> > > b/gcc/config/aarch64/constraints.md
> > > index
> > >
> 3b49b452119c49320020fa9183314d9a25b92491..18630815ffc13f2168300a89
> > > 9db69fd428dfb0d6 100644
> > > --- a/gcc/config/aarch64/constraints.md
> > > +++ b/gcc/config/aarch64/constraints.md
> > > @@ -437,6 +437,14 @@ (define_constraint "Dl"
> > >        (match_test "aarch64_simd_shift_imm_p (op, GET_MODE (op),
> > >                                                true)")))
> > >
> > > +(define_constraint "D1"
> > > +  "@internal
> > > + A constraint that matches vector of immediates that is bits(mode)-1."
> > > + (and (match_code "const,const_vector")
> > > +      (match_test "aarch64_const_vec_all_same_in_range_p (op,
> > > +                     GET_MODE_UNIT_BITSIZE (mode) - 1,
> > > +                     GET_MODE_UNIT_BITSIZE (mode) - 1)")))
> > > +
> > >  (define_constraint "Dr"
> > >    "@internal
> > >   A constraint that matches vector of immediates for right shifts."
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c
> > > b/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c
> > > new file mode 100644
> > > index
> > >
> 0000000000000000000000000000000000000000..bdfe35d09ffccc7928947c9e
> > > 57f1034f7ca2c798
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c
> > > @@ -0,0 +1,12 @@
> > > +/* { dg-do assemble } */
> > > +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" }
> */
> > > +
> > > +void e (int * restrict a, int *b, int n)
> > > +{
> > > +    for (int i = 0; i < n; i++)
> > > +      b[i] = a[i] >> 31;
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times {\tcmlt\t} 1 } } */
> > > +/* { dg-final { scan-assembler-not {\tsshr\t} } } */
> > > +
> > >
> > >
> > > --

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 1/7]AArch64 Add combine patterns for right shift and narrow
  2021-10-06 14:32     ` Richard Sandiford
@ 2021-10-12 16:18       ` Tamar Christina
  2021-10-12 16:35         ` Kyrylo Tkachov
  0 siblings, 1 reply; 31+ messages in thread
From: Tamar Christina @ 2021-10-12 16:18 UTC (permalink / raw)
  To: Richard Sandiford, Kyrylo Tkachov
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft

[-- Attachment #1: Type: text/plain, Size: 6461 bytes --]

Hi All,

Here's a new version with big-endian support and more tests

> >
> > I think this needs to be guarded on !BYTES_BIG_ENDIAN and a similar
> pattern added for BYTES_BIG_ENDIAN with the vec_concat operands
> swapped around.
> > This is similar to the aarch64_xtn2<mode>_insn_be pattern, for example.
> 
> Yeah.  I think that applies to 2/7 and 4/7 too.
> 

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (*aarch64_<srn_op>shrn<mode>_vect,
	*aarch64_<srn_op>shrn<mode>2_vect_le,
	*aarch64_<srn_op>shrn<mode>2_vect_be): New.
	* config/aarch64/iterators.md (srn_op): New.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/shrn-combine-1.c: New test.
	* gcc.target/aarch64/shrn-combine-2.c: New test.
	* gcc.target/aarch64/shrn-combine-3.c: New test.
	* gcc.target/aarch64/shrn-combine-4.c: New test.

--- inline copy of patch ---

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 48eddf64e05afe3788abfa05141f6544a9323ea1..5715db4e1e1386e724e4d4defd5e5ed9efd8a874 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1818,6 +1818,40 @@ (define_insn "aarch64_shrn<mode>_insn_be"
   [(set_attr "type" "neon_shift_imm_narrow_q")]
 )
 
+(define_insn "*aarch64_<srn_op>shrn<mode>_vect"
+  [(set (match_operand:<VNARROWQ> 0 "register_operand" "=w")
+        (truncate:<VNARROWQ>
+          (SHIFTRT:VQN (match_operand:VQN 1 "register_operand" "w")
+            (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_<vn_mode>"))))]
+  "TARGET_SIMD"
+  "shrn\\t%0.<Vntype>, %1.<Vtype>, %2"
+  [(set_attr "type" "neon_shift_imm_narrow_q")]
+)
+
+(define_insn "*aarch64_<srn_op>shrn<mode>2_vect_le"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+	  (match_operand:<VNARROWQ> 1 "register_operand" "0")
+	  (truncate:<VNARROWQ>
+	    (SHIFTRT:VQN (match_operand:VQN 2 "register_operand" "w")
+	      (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_<vn_mode>")))))]
+  "TARGET_SIMD && !BYTES_BIG_ENDIAN"
+  "shrn2\\t%0.<V2ntype>, %2.<Vtype>, %3"
+  [(set_attr "type" "neon_shift_imm_narrow_q")]
+)
+
+(define_insn "*aarch64_<srn_op>shrn<mode>2_vect_be"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+	  (truncate:<VNARROWQ>
+	    (SHIFTRT:VQN (match_operand:VQN 2 "register_operand" "w")
+	      (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_<vn_mode>")))
+	  (match_operand:<VNARROWQ> 1 "register_operand" "0")))]
+  "TARGET_SIMD && BYTES_BIG_ENDIAN"
+  "shrn2\\t%0.<V2ntype>, %2.<Vtype>, %3"
+  [(set_attr "type" "neon_shift_imm_narrow_q")]
+)
+
 (define_expand "aarch64_shrn<mode>"
   [(set (match_operand:<VNARROWQ> 0 "register_operand")
 	(truncate:<VNARROWQ>
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index caa42f8f169fbf2cf46a90cf73dee05619acc300..8dbeed3b0d4a44cdc17dd333ed397b39a33f386a 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -2003,6 +2003,9 @@ (define_code_attr shift [(ashift "lsl") (ashiftrt "asr")
 ;; Op prefix for shift right and accumulate.
 (define_code_attr sra_op [(ashiftrt "s") (lshiftrt "u")])
 
+;; op prefix for shift right and narrow.
+(define_code_attr srn_op [(ashiftrt "r") (lshiftrt "")])
+
 ;; Map shift operators onto underlying bit-field instructions
 (define_code_attr bfshift [(ashift "ubfiz") (ashiftrt "sbfx")
 			   (lshiftrt "ubfx") (rotatert "extr")])
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-1.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..a28524662edca8eb149e34c2242091b51a167b71
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-1.c
@@ -0,0 +1,13 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define TYPE char
+
+void foo (unsigned TYPE * restrict a, TYPE * restrict d, int n)
+{
+    for( int i = 0; i < n; i++ )
+      d[i] = (a[i] * a[i]) >> 2;
+}
+
+/* { dg-final { scan-assembler-times {\tshrn\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tshrn2\t} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-2.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..012135b424f98abadc480e7ef13fcab080d99c28
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-2.c
@@ -0,0 +1,13 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define TYPE short
+
+void foo (unsigned TYPE * restrict a, TYPE * restrict d, int n)
+{
+    for( int i = 0; i < n; i++ )
+      d[i] = (a[i] * a[i]) >> 2;
+}
+
+/* { dg-final { scan-assembler-times {\tshrn\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tshrn2\t} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-3.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..8b5b360de623b0ada0da1531795ba6b428c7f9e1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-3.c
@@ -0,0 +1,13 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define TYPE int
+
+void foo (unsigned long long * restrict a, TYPE * restrict d, int n)
+{
+    for( int i = 0; i < n; i++ )
+      d[i] = a[i] >> 3;
+}
+
+/* { dg-final { scan-assembler-times {\tshrn\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tshrn2\t} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-4.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..fedca7621e2a82df0df9d12b91c5c0c9fd3dfc60
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-4.c
@@ -0,0 +1,13 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define TYPE long long
+
+void foo (unsigned TYPE * restrict a, TYPE * restrict d, int n)
+{
+    for( int i = 0; i < n; i++ )
+      d[i] = (a[i] * a[i]) >> 2;
+}
+
+/* { dg-final { scan-assembler-not {\tshrn\t} } } */
+/* { dg-final { scan-assembler-not {\tshrn2\t} } } */

[-- Attachment #2: rb14892.patch --]
[-- Type: application/octet-stream, Size: 5358 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 48eddf64e05afe3788abfa05141f6544a9323ea1..5715db4e1e1386e724e4d4defd5e5ed9efd8a874 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1818,6 +1818,40 @@ (define_insn "aarch64_shrn<mode>_insn_be"
   [(set_attr "type" "neon_shift_imm_narrow_q")]
 )
 
+(define_insn "*aarch64_<srn_op>shrn<mode>_vect"
+  [(set (match_operand:<VNARROWQ> 0 "register_operand" "=w")
+        (truncate:<VNARROWQ>
+          (SHIFTRT:VQN (match_operand:VQN 1 "register_operand" "w")
+            (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_<vn_mode>"))))]
+  "TARGET_SIMD"
+  "shrn\\t%0.<Vntype>, %1.<Vtype>, %2"
+  [(set_attr "type" "neon_shift_imm_narrow_q")]
+)
+
+(define_insn "*aarch64_<srn_op>shrn<mode>2_vect_le"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+	  (match_operand:<VNARROWQ> 1 "register_operand" "0")
+	  (truncate:<VNARROWQ>
+	    (SHIFTRT:VQN (match_operand:VQN 2 "register_operand" "w")
+	      (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_<vn_mode>")))))]
+  "TARGET_SIMD && !BYTES_BIG_ENDIAN"
+  "shrn2\\t%0.<V2ntype>, %2.<Vtype>, %3"
+  [(set_attr "type" "neon_shift_imm_narrow_q")]
+)
+
+(define_insn "*aarch64_<srn_op>shrn<mode>2_vect_be"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+	  (truncate:<VNARROWQ>
+	    (SHIFTRT:VQN (match_operand:VQN 2 "register_operand" "w")
+	      (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_<vn_mode>")))
+	  (match_operand:<VNARROWQ> 1 "register_operand" "0")))]
+  "TARGET_SIMD && BYTES_BIG_ENDIAN"
+  "shrn2\\t%0.<V2ntype>, %2.<Vtype>, %3"
+  [(set_attr "type" "neon_shift_imm_narrow_q")]
+)
+
 (define_expand "aarch64_shrn<mode>"
   [(set (match_operand:<VNARROWQ> 0 "register_operand")
 	(truncate:<VNARROWQ>
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index caa42f8f169fbf2cf46a90cf73dee05619acc300..8dbeed3b0d4a44cdc17dd333ed397b39a33f386a 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -2003,6 +2003,9 @@ (define_code_attr shift [(ashift "lsl") (ashiftrt "asr")
 ;; Op prefix for shift right and accumulate.
 (define_code_attr sra_op [(ashiftrt "s") (lshiftrt "u")])
 
+;; op prefix for shift right and narrow.
+(define_code_attr srn_op [(ashiftrt "r") (lshiftrt "")])
+
 ;; Map shift operators onto underlying bit-field instructions
 (define_code_attr bfshift [(ashift "ubfiz") (ashiftrt "sbfx")
 			   (lshiftrt "ubfx") (rotatert "extr")])
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-1.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..a28524662edca8eb149e34c2242091b51a167b71
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-1.c
@@ -0,0 +1,13 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define TYPE char
+
+void foo (unsigned TYPE * restrict a, TYPE * restrict d, int n)
+{
+    for( int i = 0; i < n; i++ )
+      d[i] = (a[i] * a[i]) >> 2;
+}
+
+/* { dg-final { scan-assembler-times {\tshrn\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tshrn2\t} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-2.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..012135b424f98abadc480e7ef13fcab080d99c28
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-2.c
@@ -0,0 +1,13 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define TYPE short
+
+void foo (unsigned TYPE * restrict a, TYPE * restrict d, int n)
+{
+    for( int i = 0; i < n; i++ )
+      d[i] = (a[i] * a[i]) >> 2;
+}
+
+/* { dg-final { scan-assembler-times {\tshrn\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tshrn2\t} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-3.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..8b5b360de623b0ada0da1531795ba6b428c7f9e1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-3.c
@@ -0,0 +1,13 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define TYPE int
+
+void foo (unsigned long long * restrict a, TYPE * restrict d, int n)
+{
+    for( int i = 0; i < n; i++ )
+      d[i] = a[i] >> 3;
+}
+
+/* { dg-final { scan-assembler-times {\tshrn\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tshrn2\t} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-4.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..fedca7621e2a82df0df9d12b91c5c0c9fd3dfc60
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-4.c
@@ -0,0 +1,13 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define TYPE long long
+
+void foo (unsigned TYPE * restrict a, TYPE * restrict d, int n)
+{
+    for( int i = 0; i < n; i++ )
+      d[i] = (a[i] * a[i]) >> 2;
+}
+
+/* { dg-final { scan-assembler-not {\tshrn\t} } } */
+/* { dg-final { scan-assembler-not {\tshrn2\t} } } */

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 3/7]AArch64 Add pattern for sshr to cmlt
  2021-10-12 12:19       ` Kyrylo Tkachov
@ 2021-10-12 16:20         ` Tamar Christina
  0 siblings, 0 replies; 31+ messages in thread
From: Tamar Christina @ 2021-10-12 16:20 UTC (permalink / raw)
  To: Kyrylo Tkachov, Andrew Pinski
  Cc: gcc-patches, apinski, Richard Earnshaw, nd, Marcus Shawcroft,
	Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 7230 bytes --]

Thanks,

Just archiving a version with more tests as requested.

I will assume the OK still stands.

Regards,
Tamar

> -----Original Message-----
> From: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> Sent: Tuesday, October 12, 2021 1:19 PM
> To: Andrew Pinski <pinskia@gmail.com>
> Cc: Tamar Christina <Tamar.Christina@arm.com>; gcc-patches@gcc.gnu.org;
> apinski@marvell.com; Richard Earnshaw <Richard.Earnshaw@arm.com>; nd
> <nd@arm.com>; Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Richard
> Sandiford <Richard.Sandiford@arm.com>
> Subject: RE: [PATCH 3/7]AArch64 Add pattern for sshr to cmlt
> 
> 
> 
> > -----Original Message-----
> > From: Andrew Pinski <pinskia@gmail.com>
> > Sent: Monday, October 11, 2021 8:56 PM
> > To: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> > Cc: Tamar Christina <Tamar.Christina@arm.com>;
> > gcc-patches@gcc.gnu.org; apinski@marvell.com; Richard Earnshaw
> > <Richard.Earnshaw@arm.com>; nd <nd@arm.com>; Marcus Shawcroft
> > <Marcus.Shawcroft@arm.com>; Richard Sandiford
> > <Richard.Sandiford@arm.com>
> > Subject: Re: [PATCH 3/7]AArch64 Add pattern for sshr to cmlt
> >
> > On Thu, Sep 30, 2021 at 2:28 AM Kyrylo Tkachov via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > > > -----Original Message-----
> > > > From: Tamar Christina <Tamar.Christina@arm.com>
> > > > Sent: Wednesday, September 29, 2021 5:20 PM
> > > > To: gcc-patches@gcc.gnu.org
> > > > Cc: nd <nd@arm.com>; Richard Earnshaw
> > <Richard.Earnshaw@arm.com>;
> > > > Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> > > > <Kyrylo.Tkachov@arm.com>; Richard Sandiford
> > > > <Richard.Sandiford@arm.com>
> > > > Subject: [PATCH 3/7]AArch64 Add pattern for sshr to cmlt
> > > >
> > > > Hi All,
> > > >
> > > > This optimizes signed right shift by BITSIZE-1 into a cmlt
> > > > operation which
> > is
> > > > more optimal because generally compares have a higher throughput
> > > > than shifts.
> > > >
> > > > On AArch64 the result of the shift would have been either -1 or 0
> > > > which is
> > the
> > > > results of the compare.
> > > >
> > > > i.e.
> > > >
> > > > void e (int * restrict a, int *b, int n) {
> > > >     for (int i = 0; i < n; i++)
> > > >       b[i] = a[i] >> 31;
> > > > }
> > > >
> > > > now generates:
> > > >
> > > > .L4:
> > > >         ldr     q0, [x0, x3]
> > > >         cmlt    v0.4s, v0.4s, #0
> > > >         str     q0, [x1, x3]
> > > >         add     x3, x3, 16
> > > >         cmp     x4, x3
> > > >         bne     .L4
> > > >
> > > > instead of:
> > > >
> > > > .L4:
> > > >         ldr     q0, [x0, x3]
> > > >         sshr    v0.4s, v0.4s, 31
> > > >         str     q0, [x1, x3]
> > > >         add     x3, x3, 16
> > > >         cmp     x4, x3
> > > >         bne     .L4
> > > >
> > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > > >
> > > > Ok for master?
> > >
> > > This should be okay (either a win or neutral) for Arm Cortex and
> > > Neoverse
> > cores so I'm tempted to not ask for a CPU-specific tunable to guard it
> > to keep the code clean.
> > > Andrew, would this change be okay from a Thunder X line perspective?
> >
> > I don't know about ThunderX2 but here are the details for ThunderX1
> > (and OcteonX1) and OcteonX2:
> > The sshr and cmlt are handled the same in the pipeline as far as I can tell.
> >
> 
> Thanks for the info.
> This patch is ok.
> Kyrill
> 
> > Thanks,
> > Andrew
> >
> >
> >
> > > Thanks,
> > > Kyrill
> > >
> > > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >       * config/aarch64/aarch64-simd.md (aarch64_simd_ashr<mode>):
> > > > Add case cmp
> > > >       case.
> > > >       * config/aarch64/constraints.md (D1): New.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >       * gcc.target/aarch64/shl-combine-2.c: New test.
> > > >
> > > > --- inline copy of patch --
> > > > diff --git a/gcc/config/aarch64/aarch64-simd.md
> > > > b/gcc/config/aarch64/aarch64-simd.md
> > > > index
> > > >
> > 300bf001b59ca7fa197c580b10adb7f70f20d1e0..19b2d0ad4dab4d574269829
> > > > 7ded861228ee22007 100644
> > > > --- a/gcc/config/aarch64/aarch64-simd.md
> > > > +++ b/gcc/config/aarch64/aarch64-simd.md
> > > > @@ -1127,12 +1127,14 @@ (define_insn "aarch64_simd_lshr<mode>"
> > > >  )
> > > >
> > > >  (define_insn "aarch64_simd_ashr<mode>"
> > > > - [(set (match_operand:VDQ_I 0 "register_operand" "=w")
> > > > -       (ashiftrt:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w")
> > > > -                  (match_operand:VDQ_I  2 "aarch64_simd_rshift_imm"
> > > > "Dr")))]
> > > > + [(set (match_operand:VDQ_I 0 "register_operand" "=w,w")
> > > > +       (ashiftrt:VDQ_I (match_operand:VDQ_I 1 "register_operand"
> "w,w")
> > > > +                  (match_operand:VDQ_I  2 "aarch64_simd_rshift_imm"
> > > > "D1,Dr")))]
> > > >   "TARGET_SIMD"
> > > > - "sshr\t%0.<Vtype>, %1.<Vtype>, %2"
> > > > -  [(set_attr "type" "neon_shift_imm<q>")]
> > > > + "@
> > > > +  cmlt\t%0.<Vtype>, %1.<Vtype>, #0  sshr\t%0.<Vtype>, %1.<Vtype>,
> > > > + %2"
> > > > +  [(set_attr "type" "neon_compare<q>,neon_shift_imm<q>")]
> > > >  )
> > > >
> > > >  (define_insn "*aarch64_simd_sra<mode>"
> > > > diff --git a/gcc/config/aarch64/constraints.md
> > > > b/gcc/config/aarch64/constraints.md
> > > > index
> > > >
> > 3b49b452119c49320020fa9183314d9a25b92491..18630815ffc13f2168300a89
> > > > 9db69fd428dfb0d6 100644
> > > > --- a/gcc/config/aarch64/constraints.md
> > > > +++ b/gcc/config/aarch64/constraints.md
> > > > @@ -437,6 +437,14 @@ (define_constraint "Dl"
> > > >        (match_test "aarch64_simd_shift_imm_p (op, GET_MODE (op),
> > > >                                                true)")))
> > > >
> > > > +(define_constraint "D1"
> > > > +  "@internal
> > > > + A constraint that matches vector of immediates that is bits(mode)-1."
> > > > + (and (match_code "const,const_vector")
> > > > +      (match_test "aarch64_const_vec_all_same_in_range_p (op,
> > > > +                     GET_MODE_UNIT_BITSIZE (mode) - 1,
> > > > +                     GET_MODE_UNIT_BITSIZE (mode) - 1)")))
> > > > +
> > > >  (define_constraint "Dr"
> > > >    "@internal
> > > >   A constraint that matches vector of immediates for right shifts."
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c
> > > > b/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c
> > > > new file mode 100644
> > > > index
> > > >
> > 0000000000000000000000000000000000000000..bdfe35d09ffccc7928947c9e
> > > > 57f1034f7ca2c798
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c
> > > > @@ -0,0 +1,12 @@
> > > > +/* { dg-do assemble } */
> > > > +/* { dg-options "-O3 --save-temps
> > > > +--param=vect-epilogues-nomask=0" }
> > */
> > > > +
> > > > +void e (int * restrict a, int *b, int n) {
> > > > +    for (int i = 0; i < n; i++)
> > > > +      b[i] = a[i] >> 31;
> > > > +}
> > > > +
> > > > +/* { dg-final { scan-assembler-times {\tcmlt\t} 1 } } */
> > > > +/* { dg-final { scan-assembler-not {\tsshr\t} } } */
> > > > +
> > > >
> > > >
> > > > --


[-- Attachment #2: rb14894.patch --]
[-- Type: application/octet-stream, Size: 4581 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 7f0888ee2f81ae17ac97be1f8438a2e588587c2a..0b340b49fa06684b80d0b78cb712e49328ca92d5 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1127,12 +1127,14 @@ (define_insn "aarch64_simd_lshr<mode>"
 )
 
 (define_insn "aarch64_simd_ashr<mode>"
- [(set (match_operand:VDQ_I 0 "register_operand" "=w")
-       (ashiftrt:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w")
-		     (match_operand:VDQ_I  2 "aarch64_simd_rshift_imm" "Dr")))]
+ [(set (match_operand:VDQ_I 0 "register_operand" "=w,w")
+       (ashiftrt:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w,w")
+		     (match_operand:VDQ_I  2 "aarch64_simd_rshift_imm" "D1,Dr")))]
  "TARGET_SIMD"
- "sshr\t%0.<Vtype>, %1.<Vtype>, %2"
-  [(set_attr "type" "neon_shift_imm<q>")]
+ "@
+  cmlt\t%0.<Vtype>, %1.<Vtype>, #0
+  sshr\t%0.<Vtype>, %1.<Vtype>, %2"
+  [(set_attr "type" "neon_compare<q>,neon_shift_imm<q>")]
 )
 
 (define_insn "*aarch64_simd_sra<mode>"
diff --git a/gcc/config/aarch64/constraints.md b/gcc/config/aarch64/constraints.md
index 3b49b452119c49320020fa9183314d9a25b92491..18630815ffc13f2168300a899db69fd428dfb0d6 100644
--- a/gcc/config/aarch64/constraints.md
+++ b/gcc/config/aarch64/constraints.md
@@ -437,6 +437,14 @@ (define_constraint "Dl"
       (match_test "aarch64_simd_shift_imm_p (op, GET_MODE (op),
 						 true)")))
 
+(define_constraint "D1"
+  "@internal
+ A constraint that matches vector of immediates that is bits(mode)-1."
+ (and (match_code "const,const_vector")
+      (match_test "aarch64_const_vec_all_same_in_range_p (op,
+			GET_MODE_UNIT_BITSIZE (mode) - 1,
+			GET_MODE_UNIT_BITSIZE (mode) - 1)")))
+
 (define_constraint "Dr"
   "@internal
  A constraint that matches vector of immediates for right shifts."
diff --git a/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c b/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..6a0331fbe609ef5f28b67682a6074099c6bc4b1f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shl-combine-2.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define TYPE char
+
+void e (signed TYPE * restrict a, signed TYPE *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      b[i] = a[i] >> (sizeof(TYPE)*8)-1;
+}
+
+/* { dg-final { scan-assembler-times {\tcmlt\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tsshr\t} } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/shl-combine-3.c b/gcc/testsuite/gcc.target/aarch64/shl-combine-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..2086b24a3cb588e08655d0641810eede40c1603a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shl-combine-3.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define TYPE short
+
+void e (signed TYPE * restrict a, signed TYPE *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      b[i] = a[i] >> (sizeof(TYPE)*8)-1;
+}
+
+/* { dg-final { scan-assembler-times {\tcmlt\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tsshr\t} } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/shl-combine-4.c b/gcc/testsuite/gcc.target/aarch64/shl-combine-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..083181071f400ece0719f879d019f9dc05daa87b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shl-combine-4.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define TYPE int
+
+void e (signed TYPE * restrict a, signed TYPE *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      b[i] = a[i] >> (sizeof(TYPE)*8)-1;
+}
+
+/* { dg-final { scan-assembler-times {\tcmlt\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tsshr\t} } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/shl-combine-5.c b/gcc/testsuite/gcc.target/aarch64/shl-combine-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..6b2a6bd86b3658209e435c8cb4bbb14d5b11ddc4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shl-combine-5.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define TYPE long
+
+void e (signed TYPE * restrict a, signed TYPE *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      b[i] = a[i] >> (sizeof(TYPE)*8)-1;
+}
+
+/* { dg-final { scan-assembler-times {\tcmlt\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tsshr\t} } } */
+

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 2/7]AArch64 Add combine patterns for narrowing shift of half top bits (shuffle)
  2021-09-30  8:54   ` Kyrylo Tkachov
@ 2021-10-12 16:23     ` Tamar Christina
  2021-10-12 16:36       ` Kyrylo Tkachov
  0 siblings, 1 reply; 31+ messages in thread
From: Tamar Christina @ 2021-10-12 16:23 UTC (permalink / raw)
  To: Kyrylo Tkachov, gcc-patches
  Cc: nd, Richard Earnshaw, Marcus Shawcroft, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 9216 bytes --]

Hi All,

This is  a new version with more tests and BE support.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md
	(*aarch64_<srn_op>topbits_shuffle<mode>_le): New.
	(*aarch64_topbits_shuffle<mode>_le): New.
	(*aarch64_<srn_op>topbits_shuffle<mode>_be): New.
	(*aarch64_topbits_shuffle<mode>_be): New.
	* config/aarch64/predicates.md
	(aarch64_simd_shift_imm_vec_exact_top): New.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/shrn-combine-10.c: New test.
	* gcc.target/aarch64/shrn-combine-5.c: New test.
	* gcc.target/aarch64/shrn-combine-6.c: New test.
	* gcc.target/aarch64/shrn-combine-7.c: New test.
	* gcc.target/aarch64/shrn-combine-8.c: New test.
	* gcc.target/aarch64/shrn-combine-9.c: New test.

--- inline copy of patch ---

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 5715db4e1e1386e724e4d4defd5e5ed9efd8a874..7f0888ee2f81ae17ac97be1f8438a2e588587c2a 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1852,6 +1852,66 @@ (define_insn "*aarch64_<srn_op>shrn<mode>2_vect_be"
   [(set_attr "type" "neon_shift_imm_narrow_q")]
 )
 
+(define_insn "*aarch64_<srn_op>topbits_shuffle<mode>_le"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+          (truncate:<VNARROWQ>
+            (SHIFTRT:VQN (match_operand:VQN 1 "register_operand" "w")
+	      (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_exact_top")))
+	  (truncate:<VNARROWQ>
+	    (SHIFTRT:VQN (match_operand:VQN 3 "register_operand" "w")
+	      (match_dup 2)))))]
+  "TARGET_SIMD && !BYTES_BIG_ENDIAN"
+  "uzp2\\t%0.<V2ntype>, %1.<V2ntype>, %3.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+
+(define_insn "*aarch64_topbits_shuffle<mode>_le"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+          (unspec:<VNARROWQ> [
+              (match_operand:VQN 1 "register_operand" "w")
+	      (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_exact_top")
+	     ] UNSPEC_RSHRN)
+	  (unspec:<VNARROWQ> [
+	      (match_operand:VQN 3 "register_operand" "w")
+	      (match_dup 2)
+	     ] UNSPEC_RSHRN)))]
+  "TARGET_SIMD && !BYTES_BIG_ENDIAN"
+  "uzp2\\t%0.<V2ntype>, %1.<V2ntype>, %3.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+
+(define_insn "*aarch64_<srn_op>topbits_shuffle<mode>_be"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+	  (truncate:<VNARROWQ>
+	    (SHIFTRT:VQN (match_operand:VQN 3 "register_operand" "w")
+	      (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_exact_top")))
+          (truncate:<VNARROWQ>
+            (SHIFTRT:VQN (match_operand:VQN 1 "register_operand" "w")
+	      (match_dup 2)))))]
+  "TARGET_SIMD && BYTES_BIG_ENDIAN"
+  "uzp2\\t%0.<V2ntype>, %1.<V2ntype>, %3.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+
+(define_insn "*aarch64_topbits_shuffle<mode>_be"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+	  (unspec:<VNARROWQ> [
+	      (match_operand:VQN 3 "register_operand" "w")
+	      (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_exact_top")
+	     ] UNSPEC_RSHRN)
+          (unspec:<VNARROWQ> [
+              (match_operand:VQN 1 "register_operand" "w")
+	      (match_dup 2)
+	     ] UNSPEC_RSHRN)))]
+  "TARGET_SIMD && BYTES_BIG_ENDIAN"
+  "uzp2\\t%0.<V2ntype>, %1.<V2ntype>, %3.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+
 (define_expand "aarch64_shrn<mode>"
   [(set (match_operand:<VNARROWQ> 0 "register_operand")
 	(truncate:<VNARROWQ>
diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
index 49f02ae0381359174fed80c2a2264295c75bc189..7fd4f9e7d06d3082d6f3047290f0446789e1d0d2 100644
--- a/gcc/config/aarch64/predicates.md
+++ b/gcc/config/aarch64/predicates.md
@@ -545,6 +545,12 @@ (define_predicate "aarch64_simd_shift_imm_offset_di"
   (and (match_code "const_int")
        (match_test "IN_RANGE (INTVAL (op), 1, 64)")))
 
+(define_predicate "aarch64_simd_shift_imm_vec_exact_top"
+  (and (match_code "const_vector")
+       (match_test "aarch64_const_vec_all_same_in_range_p (op,
+			GET_MODE_UNIT_BITSIZE (GET_MODE (op)) / 2,
+			GET_MODE_UNIT_BITSIZE (GET_MODE (op)) / 2)")))
+
 (define_predicate "aarch64_simd_shift_imm_vec_qi"
   (and (match_code "const_vector")
        (match_test "aarch64_const_vec_all_same_in_range_p (op, 1, 8)")))
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-10.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-10.c
new file mode 100644
index 0000000000000000000000000000000000000000..3a1cfce93e9065e8d5b43a770b0ef24a17586411
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-10.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+
+#include <arm_neon.h>
+
+uint32x4_t foo (uint64x2_t a, uint64x2_t b)
+{
+  return vrshrn_high_n_u64 (vrshrn_n_u64 (a, 32), b, 32);
+}
+
+/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tshrn\t} } } */
+/* { dg-final { scan-assembler-not {\tshrn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-5.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..408e85535788b2c1c9b05672a269e4e6567f2683
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-5.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define TYPE1 char
+#define TYPE2 short
+#define SHIFT 8
+
+void foo (TYPE2 * restrict a, TYPE1 * restrict d, int n)
+{
+    for( int i = 0; i < n; i++ )
+      d[i] = a[i] >> SHIFT;
+}
+
+/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tshrn\t} } } */
+/* { dg-final { scan-assembler-not {\tshrn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-6.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-6.c
new file mode 100644
index 0000000000000000000000000000000000000000..6211ba3e41c199f325b80217d298801767c8dad5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-6.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define TYPE1 short
+#define TYPE2 int
+#define SHIFT 16
+
+void foo (TYPE2 * restrict a, TYPE1 * restrict d, int n)
+{
+    for( int i = 0; i < n; i++ )
+      d[i] = a[i] >> SHIFT;
+}
+
+/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tshrn\t} } } */
+/* { dg-final { scan-assembler-not {\tshrn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-7.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-7.c
new file mode 100644
index 0000000000000000000000000000000000000000..56cbeacc6de54f177f5b66d26b62ba6cefb921ad
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-7.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define TYPE1 int
+#define TYPE2 long long
+#define SHIFT 32
+
+void foo (TYPE2 * restrict a, TYPE1 * restrict d, int n)
+{
+    for( int i = 0; i < n; i++ )
+      d[i] = a[i] >> SHIFT;
+}
+
+/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tshrn\t} } } */
+/* { dg-final { scan-assembler-not {\tshrn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-8.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-8.c
new file mode 100644
index 0000000000000000000000000000000000000000..6a47f3cdaee399e603c57a1c6a0c09c6cfd21abb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-8.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+
+#include <arm_neon.h>
+
+uint8x16_t foo (uint16x8_t a, uint16x8_t b)
+{
+  return vrshrn_high_n_u16 (vrshrn_n_u16 (a, 8), b, 8);
+}
+
+/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tshrn\t} } } */
+/* { dg-final { scan-assembler-not {\tshrn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-9.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-9.c
new file mode 100644
index 0000000000000000000000000000000000000000..929a55c5c338844e6a5c5ad249af482286ab9c61
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-9.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+
+#include <arm_neon.h>
+
+uint16x8_t foo (uint32x4_t a, uint32x4_t b)
+{
+  return vrshrn_high_n_u32 (vrshrn_n_u32 (a, 16), b, 16);
+}
+
+/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tshrn\t} } } */
+/* { dg-final { scan-assembler-not {\tshrn2\t} } } */

[-- Attachment #2: rb14893.patch --]
[-- Type: application/octet-stream, Size: 8134 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 5715db4e1e1386e724e4d4defd5e5ed9efd8a874..7f0888ee2f81ae17ac97be1f8438a2e588587c2a 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1852,6 +1852,66 @@ (define_insn "*aarch64_<srn_op>shrn<mode>2_vect_be"
   [(set_attr "type" "neon_shift_imm_narrow_q")]
 )
 
+(define_insn "*aarch64_<srn_op>topbits_shuffle<mode>_le"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+          (truncate:<VNARROWQ>
+            (SHIFTRT:VQN (match_operand:VQN 1 "register_operand" "w")
+	      (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_exact_top")))
+	  (truncate:<VNARROWQ>
+	    (SHIFTRT:VQN (match_operand:VQN 3 "register_operand" "w")
+	      (match_dup 2)))))]
+  "TARGET_SIMD && !BYTES_BIG_ENDIAN"
+  "uzp2\\t%0.<V2ntype>, %1.<V2ntype>, %3.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+
+(define_insn "*aarch64_topbits_shuffle<mode>_le"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+          (unspec:<VNARROWQ> [
+              (match_operand:VQN 1 "register_operand" "w")
+	      (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_exact_top")
+	     ] UNSPEC_RSHRN)
+	  (unspec:<VNARROWQ> [
+	      (match_operand:VQN 3 "register_operand" "w")
+	      (match_dup 2)
+	     ] UNSPEC_RSHRN)))]
+  "TARGET_SIMD && !BYTES_BIG_ENDIAN"
+  "uzp2\\t%0.<V2ntype>, %1.<V2ntype>, %3.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+
+(define_insn "*aarch64_<srn_op>topbits_shuffle<mode>_be"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+	  (truncate:<VNARROWQ>
+	    (SHIFTRT:VQN (match_operand:VQN 3 "register_operand" "w")
+	      (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_exact_top")))
+          (truncate:<VNARROWQ>
+            (SHIFTRT:VQN (match_operand:VQN 1 "register_operand" "w")
+	      (match_dup 2)))))]
+  "TARGET_SIMD && BYTES_BIG_ENDIAN"
+  "uzp2\\t%0.<V2ntype>, %1.<V2ntype>, %3.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+
+(define_insn "*aarch64_topbits_shuffle<mode>_be"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+	  (unspec:<VNARROWQ> [
+	      (match_operand:VQN 3 "register_operand" "w")
+	      (match_operand:VQN 2 "aarch64_simd_shift_imm_vec_exact_top")
+	     ] UNSPEC_RSHRN)
+          (unspec:<VNARROWQ> [
+              (match_operand:VQN 1 "register_operand" "w")
+	      (match_dup 2)
+	     ] UNSPEC_RSHRN)))]
+  "TARGET_SIMD && BYTES_BIG_ENDIAN"
+  "uzp2\\t%0.<V2ntype>, %1.<V2ntype>, %3.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+
 (define_expand "aarch64_shrn<mode>"
   [(set (match_operand:<VNARROWQ> 0 "register_operand")
 	(truncate:<VNARROWQ>
diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
index 49f02ae0381359174fed80c2a2264295c75bc189..7fd4f9e7d06d3082d6f3047290f0446789e1d0d2 100644
--- a/gcc/config/aarch64/predicates.md
+++ b/gcc/config/aarch64/predicates.md
@@ -545,6 +545,12 @@ (define_predicate "aarch64_simd_shift_imm_offset_di"
   (and (match_code "const_int")
        (match_test "IN_RANGE (INTVAL (op), 1, 64)")))
 
+(define_predicate "aarch64_simd_shift_imm_vec_exact_top"
+  (and (match_code "const_vector")
+       (match_test "aarch64_const_vec_all_same_in_range_p (op,
+			GET_MODE_UNIT_BITSIZE (GET_MODE (op)) / 2,
+			GET_MODE_UNIT_BITSIZE (GET_MODE (op)) / 2)")))
+
 (define_predicate "aarch64_simd_shift_imm_vec_qi"
   (and (match_code "const_vector")
        (match_test "aarch64_const_vec_all_same_in_range_p (op, 1, 8)")))
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-10.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-10.c
new file mode 100644
index 0000000000000000000000000000000000000000..3a1cfce93e9065e8d5b43a770b0ef24a17586411
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-10.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+
+#include <arm_neon.h>
+
+uint32x4_t foo (uint64x2_t a, uint64x2_t b)
+{
+  return vrshrn_high_n_u64 (vrshrn_n_u64 (a, 32), b, 32);
+}
+
+/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tshrn\t} } } */
+/* { dg-final { scan-assembler-not {\tshrn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-5.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..408e85535788b2c1c9b05672a269e4e6567f2683
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-5.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define TYPE1 char
+#define TYPE2 short
+#define SHIFT 8
+
+void foo (TYPE2 * restrict a, TYPE1 * restrict d, int n)
+{
+    for( int i = 0; i < n; i++ )
+      d[i] = a[i] >> SHIFT;
+}
+
+/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tshrn\t} } } */
+/* { dg-final { scan-assembler-not {\tshrn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-6.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-6.c
new file mode 100644
index 0000000000000000000000000000000000000000..6211ba3e41c199f325b80217d298801767c8dad5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-6.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define TYPE1 short
+#define TYPE2 int
+#define SHIFT 16
+
+void foo (TYPE2 * restrict a, TYPE1 * restrict d, int n)
+{
+    for( int i = 0; i < n; i++ )
+      d[i] = a[i] >> SHIFT;
+}
+
+/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tshrn\t} } } */
+/* { dg-final { scan-assembler-not {\tshrn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-7.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-7.c
new file mode 100644
index 0000000000000000000000000000000000000000..56cbeacc6de54f177f5b66d26b62ba6cefb921ad
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-7.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define TYPE1 int
+#define TYPE2 long long
+#define SHIFT 32
+
+void foo (TYPE2 * restrict a, TYPE1 * restrict d, int n)
+{
+    for( int i = 0; i < n; i++ )
+      d[i] = a[i] >> SHIFT;
+}
+
+/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tshrn\t} } } */
+/* { dg-final { scan-assembler-not {\tshrn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-8.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-8.c
new file mode 100644
index 0000000000000000000000000000000000000000..6a47f3cdaee399e603c57a1c6a0c09c6cfd21abb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-8.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+
+#include <arm_neon.h>
+
+uint8x16_t foo (uint16x8_t a, uint16x8_t b)
+{
+  return vrshrn_high_n_u16 (vrshrn_n_u16 (a, 8), b, 8);
+}
+
+/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tshrn\t} } } */
+/* { dg-final { scan-assembler-not {\tshrn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-9.c b/gcc/testsuite/gcc.target/aarch64/shrn-combine-9.c
new file mode 100644
index 0000000000000000000000000000000000000000..929a55c5c338844e6a5c5ad249af482286ab9c61
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-9.c
@@ -0,0 +1,14 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+
+#include <arm_neon.h>
+
+uint16x8_t foo (uint32x4_t a, uint32x4_t b)
+{
+  return vrshrn_high_n_u32 (vrshrn_n_u32 (a, 16), b, 16);
+}
+
+/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
+/* { dg-final { scan-assembler-not {\tshrn\t} } } */
+/* { dg-final { scan-assembler-not {\tshrn2\t} } } */

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 4/7]AArch64 Add pattern xtn+xtn2 to uzp2
  2021-09-30  9:28   ` Kyrylo Tkachov
@ 2021-10-12 16:25     ` Tamar Christina
  2021-10-12 16:39       ` Kyrylo Tkachov
  0 siblings, 1 reply; 31+ messages in thread
From: Tamar Christina @ 2021-10-12 16:25 UTC (permalink / raw)
  To: Kyrylo Tkachov, gcc-patches
  Cc: nd, Richard Earnshaw, Marcus Shawcroft, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 8708 bytes --]

Hi All,

This is  a new version with BE support and more tests.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (*aarch64_narrow_trunc<mode>_le):
	(*aarch64_narrow_trunc<mode>_be): New.
	* config/aarch64/iterators.md (VNARROWSIMD, Vnarrowsimd): New.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/narrow_high_combine.c: Update case.
	* gcc.target/aarch64/xtn-combine-1.c: New test.
	* gcc.target/aarch64/xtn-combine-2.c: New test.
	* gcc.target/aarch64/xtn-combine-3.c: New test.
	* gcc.target/aarch64/xtn-combine-4.c: New test.
	* gcc.target/aarch64/xtn-combine-5.c: New test.
	* gcc.target/aarch64/xtn-combine-6.c: New test.

--- inline copy of patch ---

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 0b340b49fa06684b80d0b78cb712e49328ca92d5..8435dece660a12aa747c4a489fbbda5bc0f83a86 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1753,6 +1753,30 @@ (define_expand "aarch64_xtn2<mode>"
   }
 )
 
+(define_insn "*aarch64_narrow_trunc<mode>_le"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+          (truncate:<VNARROWQ>
+            (match_operand:VQN 1 "register_operand" "w"))
+	  (truncate:<VNARROWQ>
+	    (match_operand:VQN 2 "register_operand" "w"))))]
+  "TARGET_SIMD && !BYTES_BIG_ENDIAN"
+  "uzp1\\t%0.<V2ntype>, %1.<V2ntype>, %2.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+
+(define_insn "*aarch64_narrow_trunc<mode>_be"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+	  (truncate:<VNARROWQ>
+	    (match_operand:VQN 2 "register_operand" "w"))
+          (truncate:<VNARROWQ>
+            (match_operand:VQN 1 "register_operand" "w"))))]
+  "TARGET_SIMD && BYTES_BIG_ENDIAN"
+  "uzp1\\t%0.<V2ntype>, %1.<V2ntype>, %2.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+
 ;; Packing doubles.
 
 (define_expand "vec_pack_trunc_<mode>"
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 8dbeed3b0d4a44cdc17dd333ed397b39a33f386a..95b385c0c9405fe95fcd07262a9471ab13d5488e 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -270,6 +270,14 @@ (define_mode_iterator VDQHS [V4HI V8HI V2SI V4SI])
 ;; Advanced SIMD modes for H, S and D types.
 (define_mode_iterator VDQHSD [V4HI V8HI V2SI V4SI V2DI])
 
+;; Modes for which we can narrow the element and increase the lane counts
+;; to preserve the same register size.
+(define_mode_attr VNARROWSIMD [(V4HI "V8QI") (V8HI "V16QI") (V4SI "V8HI")
+			       (V2SI "V4HI") (V2DI "V4SI")])
+
+(define_mode_attr Vnarrowsimd [(V4HI "v8qi") (V8HI "v16qi") (V4SI "v8hi")
+			       (V2SI "v4hi") (V2DI "v4si")])
+
 ;; Advanced SIMD and scalar integer modes for H and S.
 (define_mode_iterator VSDQ_HSI [V4HI V8HI V2SI V4SI HI SI])
 
diff --git a/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c b/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
index 50ecab002a3552d37a5cc0d8921f42f6c3dba195..fa61196d3644caa48b12151e12b15dfeab8c7e71 100644
--- a/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
+++ b/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
@@ -225,7 +225,8 @@ TEST_2_UNARY (vqmovun, uint32x4_t, int64x2_t, s64, u32)
 /* { dg-final { scan-assembler-times "\\tuqshrn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tsqrshrn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tuqrshrn2\\tv" 6} }  */
-/* { dg-final { scan-assembler-times "\\txtn2\\tv" 12} }  */
+/* { dg-final { scan-assembler-times "\\txtn2\\tv" 6} }  */
+/* { dg-final { scan-assembler-times "\\tuzp1\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tuqxtn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tsqxtn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tsqxtun2\\tv" 6} }  */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..14e0414cd1478f1cb7b17766aa8d4451c5659977
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN signed
+#define TYPE1 char
+#define TYPE2 short
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-2.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..c259010442bca4ba008706e47b3ffcc50a910b52
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-2.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN signed
+#define TYPE1 short
+#define TYPE2 int
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-3.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..9a2065f65101f82ebe33519840919f9606192c43
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-3.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN signed
+#define TYPE1 int
+#define TYPE2 long long
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-4.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..77c3dce12049989d344e4652bd593de2e846ae25
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-4.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN unsigned
+#define TYPE1 char
+#define TYPE2 short
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-5.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..ae30e864ed7a239c40522423d7493ddd33830cc8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-5.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN unsigned
+#define TYPE1 short
+#define TYPE2 int
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-6.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-6.c
new file mode 100644
index 0000000000000000000000000000000000000000..882f3d333e2cc42e964d880de5b58c505ec8b122
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-6.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN unsigned
+#define TYPE1 int
+#define TYPE2 long long
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */

[-- Attachment #2: rb14895.patch --]
[-- Type: application/octet-stream, Size: 7720 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 0b340b49fa06684b80d0b78cb712e49328ca92d5..8435dece660a12aa747c4a489fbbda5bc0f83a86 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1753,6 +1753,30 @@ (define_expand "aarch64_xtn2<mode>"
   }
 )
 
+(define_insn "*aarch64_narrow_trunc<mode>_le"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+          (truncate:<VNARROWQ>
+            (match_operand:VQN 1 "register_operand" "w"))
+	  (truncate:<VNARROWQ>
+	    (match_operand:VQN 2 "register_operand" "w"))))]
+  "TARGET_SIMD && !BYTES_BIG_ENDIAN"
+  "uzp1\\t%0.<V2ntype>, %1.<V2ntype>, %2.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+
+(define_insn "*aarch64_narrow_trunc<mode>_be"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+	  (truncate:<VNARROWQ>
+	    (match_operand:VQN 2 "register_operand" "w"))
+          (truncate:<VNARROWQ>
+            (match_operand:VQN 1 "register_operand" "w"))))]
+  "TARGET_SIMD && BYTES_BIG_ENDIAN"
+  "uzp1\\t%0.<V2ntype>, %1.<V2ntype>, %2.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+
 ;; Packing doubles.
 
 (define_expand "vec_pack_trunc_<mode>"
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 8dbeed3b0d4a44cdc17dd333ed397b39a33f386a..95b385c0c9405fe95fcd07262a9471ab13d5488e 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -270,6 +270,14 @@ (define_mode_iterator VDQHS [V4HI V8HI V2SI V4SI])
 ;; Advanced SIMD modes for H, S and D types.
 (define_mode_iterator VDQHSD [V4HI V8HI V2SI V4SI V2DI])
 
+;; Modes for which we can narrow the element and increase the lane counts
+;; to preserve the same register size.
+(define_mode_attr VNARROWSIMD [(V4HI "V8QI") (V8HI "V16QI") (V4SI "V8HI")
+			       (V2SI "V4HI") (V2DI "V4SI")])
+
+(define_mode_attr Vnarrowsimd [(V4HI "v8qi") (V8HI "v16qi") (V4SI "v8hi")
+			       (V2SI "v4hi") (V2DI "v4si")])
+
 ;; Advanced SIMD and scalar integer modes for H and S.
 (define_mode_iterator VSDQ_HSI [V4HI V8HI V2SI V4SI HI SI])
 
diff --git a/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c b/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
index 50ecab002a3552d37a5cc0d8921f42f6c3dba195..fa61196d3644caa48b12151e12b15dfeab8c7e71 100644
--- a/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
+++ b/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
@@ -225,7 +225,8 @@ TEST_2_UNARY (vqmovun, uint32x4_t, int64x2_t, s64, u32)
 /* { dg-final { scan-assembler-times "\\tuqshrn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tsqrshrn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tuqrshrn2\\tv" 6} }  */
-/* { dg-final { scan-assembler-times "\\txtn2\\tv" 12} }  */
+/* { dg-final { scan-assembler-times "\\txtn2\\tv" 6} }  */
+/* { dg-final { scan-assembler-times "\\tuzp1\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tuqxtn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tsqxtn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tsqxtun2\\tv" 6} }  */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..14e0414cd1478f1cb7b17766aa8d4451c5659977
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN signed
+#define TYPE1 char
+#define TYPE2 short
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-2.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..c259010442bca4ba008706e47b3ffcc50a910b52
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-2.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN signed
+#define TYPE1 short
+#define TYPE2 int
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-3.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..9a2065f65101f82ebe33519840919f9606192c43
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-3.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN signed
+#define TYPE1 int
+#define TYPE2 long long
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-4.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..77c3dce12049989d344e4652bd593de2e846ae25
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-4.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN unsigned
+#define TYPE1 char
+#define TYPE2 short
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-5.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..ae30e864ed7a239c40522423d7493ddd33830cc8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-5.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN unsigned
+#define TYPE1 short
+#define TYPE2 int
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-6.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-6.c
new file mode 100644
index 0000000000000000000000000000000000000000..882f3d333e2cc42e964d880de5b58c505ec8b122
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-6.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN unsigned
+#define TYPE1 int
+#define TYPE2 long long
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 1/7]AArch64 Add combine patterns for right shift and narrow
  2021-10-12 16:18       ` Tamar Christina
@ 2021-10-12 16:35         ` Kyrylo Tkachov
  0 siblings, 0 replies; 31+ messages in thread
From: Kyrylo Tkachov @ 2021-10-12 16:35 UTC (permalink / raw)
  To: Tamar Christina, Richard Sandiford
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft



> -----Original Message-----
> From: Tamar Christina <Tamar.Christina@arm.com>
> Sent: Tuesday, October 12, 2021 5:18 PM
> To: Richard Sandiford <Richard.Sandiford@arm.com>; Kyrylo Tkachov
> <Kyrylo.Tkachov@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> <Marcus.Shawcroft@arm.com>
> Subject: RE: [PATCH 1/7]AArch64 Add combine patterns for right shift and
> narrow
> 
> Hi All,
> 
> Here's a new version with big-endian support and more tests
> 
> > >
> > > I think this needs to be guarded on !BYTES_BIG_ENDIAN and a similar
> > pattern added for BYTES_BIG_ENDIAN with the vec_concat operands
> > swapped around.
> > > This is similar to the aarch64_xtn2<mode>_insn_be pattern, for example.
> >
> > Yeah.  I think that applies to 2/7 and 4/7 too.
> >
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?

Ok.
Thanks,
Kyrill

> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* config/aarch64/aarch64-simd.md
> (*aarch64_<srn_op>shrn<mode>_vect,
> 	*aarch64_<srn_op>shrn<mode>2_vect_le,
> 	*aarch64_<srn_op>shrn<mode>2_vect_be): New.
> 	* config/aarch64/iterators.md (srn_op): New.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/aarch64/shrn-combine-1.c: New test.
> 	* gcc.target/aarch64/shrn-combine-2.c: New test.
> 	* gcc.target/aarch64/shrn-combine-3.c: New test.
> 	* gcc.target/aarch64/shrn-combine-4.c: New test.
> 
> --- inline copy of patch ---
> 
> diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> index
> 48eddf64e05afe3788abfa05141f6544a9323ea1..5715db4e1e1386e724e4d4d
> efd5e5ed9efd8a874 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1818,6 +1818,40 @@ (define_insn "aarch64_shrn<mode>_insn_be"
>    [(set_attr "type" "neon_shift_imm_narrow_q")]
>  )
> 
> +(define_insn "*aarch64_<srn_op>shrn<mode>_vect"
> +  [(set (match_operand:<VNARROWQ> 0 "register_operand" "=w")
> +        (truncate:<VNARROWQ>
> +          (SHIFTRT:VQN (match_operand:VQN 1 "register_operand" "w")
> +            (match_operand:VQN 2
> "aarch64_simd_shift_imm_vec_<vn_mode>"))))]
> +  "TARGET_SIMD"
> +  "shrn\\t%0.<Vntype>, %1.<Vtype>, %2"
> +  [(set_attr "type" "neon_shift_imm_narrow_q")]
> +)
> +
> +(define_insn "*aarch64_<srn_op>shrn<mode>2_vect_le"
> +  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
> +	(vec_concat:<VNARROWQ2>
> +	  (match_operand:<VNARROWQ> 1 "register_operand" "0")
> +	  (truncate:<VNARROWQ>
> +	    (SHIFTRT:VQN (match_operand:VQN 2 "register_operand" "w")
> +	      (match_operand:VQN 3
> "aarch64_simd_shift_imm_vec_<vn_mode>")))))]
> +  "TARGET_SIMD && !BYTES_BIG_ENDIAN"
> +  "shrn2\\t%0.<V2ntype>, %2.<Vtype>, %3"
> +  [(set_attr "type" "neon_shift_imm_narrow_q")]
> +)
> +
> +(define_insn "*aarch64_<srn_op>shrn<mode>2_vect_be"
> +  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
> +	(vec_concat:<VNARROWQ2>
> +	  (truncate:<VNARROWQ>
> +	    (SHIFTRT:VQN (match_operand:VQN 2 "register_operand" "w")
> +	      (match_operand:VQN 3
> "aarch64_simd_shift_imm_vec_<vn_mode>")))
> +	  (match_operand:<VNARROWQ> 1 "register_operand" "0")))]
> +  "TARGET_SIMD && BYTES_BIG_ENDIAN"
> +  "shrn2\\t%0.<V2ntype>, %2.<Vtype>, %3"
> +  [(set_attr "type" "neon_shift_imm_narrow_q")]
> +)
> +
>  (define_expand "aarch64_shrn<mode>"
>    [(set (match_operand:<VNARROWQ> 0 "register_operand")
>  	(truncate:<VNARROWQ>
> diff --git a/gcc/config/aarch64/iterators.md
> b/gcc/config/aarch64/iterators.md
> index
> caa42f8f169fbf2cf46a90cf73dee05619acc300..8dbeed3b0d4a44cdc17dd333e
> d397b39a33f386a 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -2003,6 +2003,9 @@ (define_code_attr shift [(ashift "lsl") (ashiftrt "asr")
>  ;; Op prefix for shift right and accumulate.
>  (define_code_attr sra_op [(ashiftrt "s") (lshiftrt "u")])
> 
> +;; op prefix for shift right and narrow.
> +(define_code_attr srn_op [(ashiftrt "r") (lshiftrt "")])
> +
>  ;; Map shift operators onto underlying bit-field instructions
>  (define_code_attr bfshift [(ashift "ubfiz") (ashiftrt "sbfx")
>  			   (lshiftrt "ubfx") (rotatert "extr")])
> diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-1.c
> b/gcc/testsuite/gcc.target/aarch64/shrn-combine-1.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..a28524662edca8eb149e34c
> 2242091b51a167b71
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-1.c
> @@ -0,0 +1,13 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +#define TYPE char
> +
> +void foo (unsigned TYPE * restrict a, TYPE * restrict d, int n)
> +{
> +    for( int i = 0; i < n; i++ )
> +      d[i] = (a[i] * a[i]) >> 2;
> +}
> +
> +/* { dg-final { scan-assembler-times {\tshrn\t} 1 } } */
> +/* { dg-final { scan-assembler-times {\tshrn2\t} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-2.c
> b/gcc/testsuite/gcc.target/aarch64/shrn-combine-2.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..012135b424f98abadc480e7
> ef13fcab080d99c28
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-2.c
> @@ -0,0 +1,13 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +#define TYPE short
> +
> +void foo (unsigned TYPE * restrict a, TYPE * restrict d, int n)
> +{
> +    for( int i = 0; i < n; i++ )
> +      d[i] = (a[i] * a[i]) >> 2;
> +}
> +
> +/* { dg-final { scan-assembler-times {\tshrn\t} 1 } } */
> +/* { dg-final { scan-assembler-times {\tshrn2\t} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-3.c
> b/gcc/testsuite/gcc.target/aarch64/shrn-combine-3.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..8b5b360de623b0ada0da15
> 31795ba6b428c7f9e1
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-3.c
> @@ -0,0 +1,13 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +#define TYPE int
> +
> +void foo (unsigned long long * restrict a, TYPE * restrict d, int n)
> +{
> +    for( int i = 0; i < n; i++ )
> +      d[i] = a[i] >> 3;
> +}
> +
> +/* { dg-final { scan-assembler-times {\tshrn\t} 1 } } */
> +/* { dg-final { scan-assembler-times {\tshrn2\t} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-4.c
> b/gcc/testsuite/gcc.target/aarch64/shrn-combine-4.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..fedca7621e2a82df0df9d12
> b91c5c0c9fd3dfc60
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-4.c
> @@ -0,0 +1,13 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +#define TYPE long long
> +
> +void foo (unsigned TYPE * restrict a, TYPE * restrict d, int n)
> +{
> +    for( int i = 0; i < n; i++ )
> +      d[i] = (a[i] * a[i]) >> 2;
> +}
> +
> +/* { dg-final { scan-assembler-not {\tshrn\t} } } */
> +/* { dg-final { scan-assembler-not {\tshrn2\t} } } */


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 2/7]AArch64 Add combine patterns for narrowing shift of half top bits (shuffle)
  2021-10-12 16:23     ` Tamar Christina
@ 2021-10-12 16:36       ` Kyrylo Tkachov
  0 siblings, 0 replies; 31+ messages in thread
From: Kyrylo Tkachov @ 2021-10-12 16:36 UTC (permalink / raw)
  To: Tamar Christina, gcc-patches
  Cc: nd, Richard Earnshaw, Marcus Shawcroft, Richard Sandiford



> -----Original Message-----
> From: Tamar Christina <Tamar.Christina@arm.com>
> Sent: Tuesday, October 12, 2021 5:23 PM
> To: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>; gcc-patches@gcc.gnu.org
> Cc: nd <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>;
> Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Richard Sandiford
> <Richard.Sandiford@arm.com>
> Subject: RE: [PATCH 2/7]AArch64 Add combine patterns for narrowing shift
> of half top bits (shuffle)
> 
> Hi All,
> 
> This is  a new version with more tests and BE support.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?

Ok.
Thanks,
Kyrill

> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* config/aarch64/aarch64-simd.md
> 	(*aarch64_<srn_op>topbits_shuffle<mode>_le): New.
> 	(*aarch64_topbits_shuffle<mode>_le): New.
> 	(*aarch64_<srn_op>topbits_shuffle<mode>_be): New.
> 	(*aarch64_topbits_shuffle<mode>_be): New.
> 	* config/aarch64/predicates.md
> 	(aarch64_simd_shift_imm_vec_exact_top): New.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/aarch64/shrn-combine-10.c: New test.
> 	* gcc.target/aarch64/shrn-combine-5.c: New test.
> 	* gcc.target/aarch64/shrn-combine-6.c: New test.
> 	* gcc.target/aarch64/shrn-combine-7.c: New test.
> 	* gcc.target/aarch64/shrn-combine-8.c: New test.
> 	* gcc.target/aarch64/shrn-combine-9.c: New test.
> 
> --- inline copy of patch ---
> 
> diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> index
> 5715db4e1e1386e724e4d4defd5e5ed9efd8a874..7f0888ee2f81ae17ac97be1f
> 8438a2e588587c2a 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1852,6 +1852,66 @@ (define_insn
> "*aarch64_<srn_op>shrn<mode>2_vect_be"
>    [(set_attr "type" "neon_shift_imm_narrow_q")]
>  )
> 
> +(define_insn "*aarch64_<srn_op>topbits_shuffle<mode>_le"
> +  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
> +	(vec_concat:<VNARROWQ2>
> +          (truncate:<VNARROWQ>
> +            (SHIFTRT:VQN (match_operand:VQN 1 "register_operand" "w")
> +	      (match_operand:VQN 2
> "aarch64_simd_shift_imm_vec_exact_top")))
> +	  (truncate:<VNARROWQ>
> +	    (SHIFTRT:VQN (match_operand:VQN 3 "register_operand" "w")
> +	      (match_dup 2)))))]
> +  "TARGET_SIMD && !BYTES_BIG_ENDIAN"
> +  "uzp2\\t%0.<V2ntype>, %1.<V2ntype>, %3.<V2ntype>"
> +  [(set_attr "type" "neon_permute<q>")]
> +)
> +
> +(define_insn "*aarch64_topbits_shuffle<mode>_le"
> +  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
> +	(vec_concat:<VNARROWQ2>
> +          (unspec:<VNARROWQ> [
> +              (match_operand:VQN 1 "register_operand" "w")
> +	      (match_operand:VQN 2
> "aarch64_simd_shift_imm_vec_exact_top")
> +	     ] UNSPEC_RSHRN)
> +	  (unspec:<VNARROWQ> [
> +	      (match_operand:VQN 3 "register_operand" "w")
> +	      (match_dup 2)
> +	     ] UNSPEC_RSHRN)))]
> +  "TARGET_SIMD && !BYTES_BIG_ENDIAN"
> +  "uzp2\\t%0.<V2ntype>, %1.<V2ntype>, %3.<V2ntype>"
> +  [(set_attr "type" "neon_permute<q>")]
> +)
> +
> +(define_insn "*aarch64_<srn_op>topbits_shuffle<mode>_be"
> +  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
> +	(vec_concat:<VNARROWQ2>
> +	  (truncate:<VNARROWQ>
> +	    (SHIFTRT:VQN (match_operand:VQN 3 "register_operand" "w")
> +	      (match_operand:VQN 2
> "aarch64_simd_shift_imm_vec_exact_top")))
> +          (truncate:<VNARROWQ>
> +            (SHIFTRT:VQN (match_operand:VQN 1 "register_operand" "w")
> +	      (match_dup 2)))))]
> +  "TARGET_SIMD && BYTES_BIG_ENDIAN"
> +  "uzp2\\t%0.<V2ntype>, %1.<V2ntype>, %3.<V2ntype>"
> +  [(set_attr "type" "neon_permute<q>")]
> +)
> +
> +(define_insn "*aarch64_topbits_shuffle<mode>_be"
> +  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
> +	(vec_concat:<VNARROWQ2>
> +	  (unspec:<VNARROWQ> [
> +	      (match_operand:VQN 3 "register_operand" "w")
> +	      (match_operand:VQN 2
> "aarch64_simd_shift_imm_vec_exact_top")
> +	     ] UNSPEC_RSHRN)
> +          (unspec:<VNARROWQ> [
> +              (match_operand:VQN 1 "register_operand" "w")
> +	      (match_dup 2)
> +	     ] UNSPEC_RSHRN)))]
> +  "TARGET_SIMD && BYTES_BIG_ENDIAN"
> +  "uzp2\\t%0.<V2ntype>, %1.<V2ntype>, %3.<V2ntype>"
> +  [(set_attr "type" "neon_permute<q>")]
> +)
> +
>  (define_expand "aarch64_shrn<mode>"
>    [(set (match_operand:<VNARROWQ> 0 "register_operand")
>  	(truncate:<VNARROWQ>
> diff --git a/gcc/config/aarch64/predicates.md
> b/gcc/config/aarch64/predicates.md
> index
> 49f02ae0381359174fed80c2a2264295c75bc189..7fd4f9e7d06d3082d6f30472
> 90f0446789e1d0d2 100644
> --- a/gcc/config/aarch64/predicates.md
> +++ b/gcc/config/aarch64/predicates.md
> @@ -545,6 +545,12 @@ (define_predicate
> "aarch64_simd_shift_imm_offset_di"
>    (and (match_code "const_int")
>         (match_test "IN_RANGE (INTVAL (op), 1, 64)")))
> 
> +(define_predicate "aarch64_simd_shift_imm_vec_exact_top"
> +  (and (match_code "const_vector")
> +       (match_test "aarch64_const_vec_all_same_in_range_p (op,
> +			GET_MODE_UNIT_BITSIZE (GET_MODE (op)) / 2,
> +			GET_MODE_UNIT_BITSIZE (GET_MODE (op)) / 2)")))
> +
>  (define_predicate "aarch64_simd_shift_imm_vec_qi"
>    (and (match_code "const_vector")
>         (match_test "aarch64_const_vec_all_same_in_range_p (op, 1, 8)")))
> diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-10.c
> b/gcc/testsuite/gcc.target/aarch64/shrn-combine-10.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..3a1cfce93e9065e8d5b43a7
> 70b0ef24a17586411
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-10.c
> @@ -0,0 +1,14 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +
> +#include <arm_neon.h>
> +
> +uint32x4_t foo (uint64x2_t a, uint64x2_t b)
> +{
> +  return vrshrn_high_n_u64 (vrshrn_n_u64 (a, 32), b, 32);
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\tshrn\t} } } */
> +/* { dg-final { scan-assembler-not {\tshrn2\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-5.c
> b/gcc/testsuite/gcc.target/aarch64/shrn-combine-5.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..408e85535788b2c1c9b0567
> 2a269e4e6567f2683
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-5.c
> @@ -0,0 +1,16 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +#define TYPE1 char
> +#define TYPE2 short
> +#define SHIFT 8
> +
> +void foo (TYPE2 * restrict a, TYPE1 * restrict d, int n)
> +{
> +    for( int i = 0; i < n; i++ )
> +      d[i] = a[i] >> SHIFT;
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\tshrn\t} } } */
> +/* { dg-final { scan-assembler-not {\tshrn2\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-6.c
> b/gcc/testsuite/gcc.target/aarch64/shrn-combine-6.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..6211ba3e41c199f325b8021
> 7d298801767c8dad5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-6.c
> @@ -0,0 +1,16 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +#define TYPE1 short
> +#define TYPE2 int
> +#define SHIFT 16
> +
> +void foo (TYPE2 * restrict a, TYPE1 * restrict d, int n)
> +{
> +    for( int i = 0; i < n; i++ )
> +      d[i] = a[i] >> SHIFT;
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\tshrn\t} } } */
> +/* { dg-final { scan-assembler-not {\tshrn2\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-7.c
> b/gcc/testsuite/gcc.target/aarch64/shrn-combine-7.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..56cbeacc6de54f177f5b66d
> 26b62ba6cefb921ad
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-7.c
> @@ -0,0 +1,16 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +#define TYPE1 int
> +#define TYPE2 long long
> +#define SHIFT 32
> +
> +void foo (TYPE2 * restrict a, TYPE1 * restrict d, int n)
> +{
> +    for( int i = 0; i < n; i++ )
> +      d[i] = a[i] >> SHIFT;
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\tshrn\t} } } */
> +/* { dg-final { scan-assembler-not {\tshrn2\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-8.c
> b/gcc/testsuite/gcc.target/aarch64/shrn-combine-8.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..6a47f3cdaee399e603c57a1
> c6a0c09c6cfd21abb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-8.c
> @@ -0,0 +1,14 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +
> +#include <arm_neon.h>
> +
> +uint8x16_t foo (uint16x8_t a, uint16x8_t b)
> +{
> +  return vrshrn_high_n_u16 (vrshrn_n_u16 (a, 8), b, 8);
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\tshrn\t} } } */
> +/* { dg-final { scan-assembler-not {\tshrn2\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/shrn-combine-9.c
> b/gcc/testsuite/gcc.target/aarch64/shrn-combine-9.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..929a55c5c338844e6a5c5ad
> 249af482286ab9c61
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/shrn-combine-9.c
> @@ -0,0 +1,14 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +
> +#include <arm_neon.h>
> +
> +uint16x8_t foo (uint32x4_t a, uint32x4_t b)
> +{
> +  return vrshrn_high_n_u32 (vrshrn_n_u32 (a, 16), b, 16);
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp2\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\tshrn\t} } } */
> +/* { dg-final { scan-assembler-not {\tshrn2\t} } } */

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 4/7]AArch64 Add pattern xtn+xtn2 to uzp2
  2021-10-12 16:25     ` Tamar Christina
@ 2021-10-12 16:39       ` Kyrylo Tkachov
  2021-10-13 11:05         ` Tamar Christina
  0 siblings, 1 reply; 31+ messages in thread
From: Kyrylo Tkachov @ 2021-10-12 16:39 UTC (permalink / raw)
  To: Tamar Christina, gcc-patches
  Cc: nd, Richard Earnshaw, Marcus Shawcroft, Richard Sandiford



> -----Original Message-----
> From: Tamar Christina <Tamar.Christina@arm.com>
> Sent: Tuesday, October 12, 2021 5:25 PM
> To: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>; gcc-patches@gcc.gnu.org
> Cc: nd <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>;
> Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Richard Sandiford
> <Richard.Sandiford@arm.com>
> Subject: RE: [PATCH 4/7]AArch64 Add pattern xtn+xtn2 to uzp2
> 
> Hi All,
> 
> This is  a new version with BE support and more tests.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?


+(define_insn "*aarch64_narrow_trunc<mode>_le"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+          (truncate:<VNARROWQ>
+            (match_operand:VQN 1 "register_operand" "w"))
+	  (truncate:<VNARROWQ>
+	    (match_operand:VQN 2 "register_operand" "w"))))]
+  "TARGET_SIMD && !BYTES_BIG_ENDIAN"
+  "uzp1\\t%0.<V2ntype>, %1.<V2ntype>, %2.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+
+(define_insn "*aarch64_narrow_trunc<mode>_be"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+	  (truncate:<VNARROWQ>
+	    (match_operand:VQN 2 "register_operand" "w"))
+          (truncate:<VNARROWQ>
+            (match_operand:VQN 1 "register_operand" "w"))))]
+  "TARGET_SIMD && BYTES_BIG_ENDIAN"
+  "uzp1\\t%0.<V2ntype>, %1.<V2ntype>, %2.<V2ntype>"
+  [(set_attr "type" "neon_permute<q>")]
+)
+

Hmmm these patterns are identical in what they match they just have the effect of printing operands 1 and 2 in a different order.
Perhaps it's more compact to change the output template into a BYTES_BIG_ENDIAN ? "uzp1\\t%0.<V2ntype>, %1.<V2ntype>, %2.<V2ntype>"" : uzp1\\t%0.<V2ntype>, %2.<V2ntype>, %1.<V2ntype>"
and avoid having a second at all?

Thanks,
Kyrill

> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* config/aarch64/aarch64-simd.md
> (*aarch64_narrow_trunc<mode>_le):
> 	(*aarch64_narrow_trunc<mode>_be): New.
> 	* config/aarch64/iterators.md (VNARROWSIMD, Vnarrowsimd):
> New.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/aarch64/narrow_high_combine.c: Update case.
> 	* gcc.target/aarch64/xtn-combine-1.c: New test.
> 	* gcc.target/aarch64/xtn-combine-2.c: New test.
> 	* gcc.target/aarch64/xtn-combine-3.c: New test.
> 	* gcc.target/aarch64/xtn-combine-4.c: New test.
> 	* gcc.target/aarch64/xtn-combine-5.c: New test.
> 	* gcc.target/aarch64/xtn-combine-6.c: New test.
> 
> --- inline copy of patch ---
> 
> diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> index
> 0b340b49fa06684b80d0b78cb712e49328ca92d5..8435dece660a12aa747c4a4
> 89fbbda5bc0f83a86 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1753,6 +1753,30 @@ (define_expand "aarch64_xtn2<mode>"
>    }
>  )
> 
> +(define_insn "*aarch64_narrow_trunc<mode>_le"
> +  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
> +	(vec_concat:<VNARROWQ2>
> +          (truncate:<VNARROWQ>
> +            (match_operand:VQN 1 "register_operand" "w"))
> +	  (truncate:<VNARROWQ>
> +	    (match_operand:VQN 2 "register_operand" "w"))))]
> +  "TARGET_SIMD && !BYTES_BIG_ENDIAN"
> +  "uzp1\\t%0.<V2ntype>, %1.<V2ntype>, %2.<V2ntype>"
> +  [(set_attr "type" "neon_permute<q>")]
> +)
> +
> +(define_insn "*aarch64_narrow_trunc<mode>_be"
> +  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
> +	(vec_concat:<VNARROWQ2>
> +	  (truncate:<VNARROWQ>
> +	    (match_operand:VQN 2 "register_operand" "w"))
> +          (truncate:<VNARROWQ>
> +            (match_operand:VQN 1 "register_operand" "w"))))]
> +  "TARGET_SIMD && BYTES_BIG_ENDIAN"
> +  "uzp1\\t%0.<V2ntype>, %1.<V2ntype>, %2.<V2ntype>"
> +  [(set_attr "type" "neon_permute<q>")]
> +)
> +
>  ;; Packing doubles.
> 
>  (define_expand "vec_pack_trunc_<mode>"
> diff --git a/gcc/config/aarch64/iterators.md
> b/gcc/config/aarch64/iterators.md
> index
> 8dbeed3b0d4a44cdc17dd333ed397b39a33f386a..95b385c0c9405fe95fcd072
> 62a9471ab13d5488e 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -270,6 +270,14 @@ (define_mode_iterator VDQHS [V4HI V8HI V2SI
> V4SI])
>  ;; Advanced SIMD modes for H, S and D types.
>  (define_mode_iterator VDQHSD [V4HI V8HI V2SI V4SI V2DI])
> 
> +;; Modes for which we can narrow the element and increase the lane counts
> +;; to preserve the same register size.
> +(define_mode_attr VNARROWSIMD [(V4HI "V8QI") (V8HI "V16QI") (V4SI
> "V8HI")
> +			       (V2SI "V4HI") (V2DI "V4SI")])
> +
> +(define_mode_attr Vnarrowsimd [(V4HI "v8qi") (V8HI "v16qi") (V4SI "v8hi")
> +			       (V2SI "v4hi") (V2DI "v4si")])
> +
>  ;; Advanced SIMD and scalar integer modes for H and S.
>  (define_mode_iterator VSDQ_HSI [V4HI V8HI V2SI V4SI HI SI])
> 
> diff --git a/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
> b/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
> index
> 50ecab002a3552d37a5cc0d8921f42f6c3dba195..fa61196d3644caa48b12151e
> 12b15dfeab8c7e71 100644
> --- a/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
> +++ b/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
> @@ -225,7 +225,8 @@ TEST_2_UNARY (vqmovun, uint32x4_t, int64x2_t,
> s64, u32)
>  /* { dg-final { scan-assembler-times "\\tuqshrn2\\tv" 6} }  */
>  /* { dg-final { scan-assembler-times "\\tsqrshrn2\\tv" 6} }  */
>  /* { dg-final { scan-assembler-times "\\tuqrshrn2\\tv" 6} }  */
> -/* { dg-final { scan-assembler-times "\\txtn2\\tv" 12} }  */
> +/* { dg-final { scan-assembler-times "\\txtn2\\tv" 6} }  */
> +/* { dg-final { scan-assembler-times "\\tuzp1\\tv" 6} }  */
>  /* { dg-final { scan-assembler-times "\\tuqxtn2\\tv" 6} }  */
>  /* { dg-final { scan-assembler-times "\\tsqxtn2\\tv" 6} }  */
>  /* { dg-final { scan-assembler-times "\\tsqxtun2\\tv" 6} }  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
> b/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..14e0414cd1478f1cb7b1776
> 6aa8d4451c5659977
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
> @@ -0,0 +1,16 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +#define SIGN signed
> +#define TYPE1 char
> +#define TYPE2 short
> +
> +void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
> +{
> +    for (int i = 0; i < n; i++)
> +      a[i] = b[i];
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\txtn\t} } } */
> +/* { dg-final { scan-assembler-not {\txtn2\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-2.c
> b/gcc/testsuite/gcc.target/aarch64/xtn-combine-2.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..c259010442bca4ba008706e
> 47b3ffcc50a910b52
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-2.c
> @@ -0,0 +1,16 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +#define SIGN signed
> +#define TYPE1 short
> +#define TYPE2 int
> +
> +void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
> +{
> +    for (int i = 0; i < n; i++)
> +      a[i] = b[i];
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\txtn\t} } } */
> +/* { dg-final { scan-assembler-not {\txtn2\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-3.c
> b/gcc/testsuite/gcc.target/aarch64/xtn-combine-3.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..9a2065f65101f82ebe33519
> 840919f9606192c43
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-3.c
> @@ -0,0 +1,16 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +#define SIGN signed
> +#define TYPE1 int
> +#define TYPE2 long long
> +
> +void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
> +{
> +    for (int i = 0; i < n; i++)
> +      a[i] = b[i];
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\txtn\t} } } */
> +/* { dg-final { scan-assembler-not {\txtn2\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-4.c
> b/gcc/testsuite/gcc.target/aarch64/xtn-combine-4.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..77c3dce12049989d344e465
> 2bd593de2e846ae25
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-4.c
> @@ -0,0 +1,16 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +#define SIGN unsigned
> +#define TYPE1 char
> +#define TYPE2 short
> +
> +void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
> +{
> +    for (int i = 0; i < n; i++)
> +      a[i] = b[i];
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\txtn\t} } } */
> +/* { dg-final { scan-assembler-not {\txtn2\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-5.c
> b/gcc/testsuite/gcc.target/aarch64/xtn-combine-5.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..ae30e864ed7a239c4052242
> 3d7493ddd33830cc8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-5.c
> @@ -0,0 +1,16 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +#define SIGN unsigned
> +#define TYPE1 short
> +#define TYPE2 int
> +
> +void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
> +{
> +    for (int i = 0; i < n; i++)
> +      a[i] = b[i];
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\txtn\t} } } */
> +/* { dg-final { scan-assembler-not {\txtn2\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-6.c
> b/gcc/testsuite/gcc.target/aarch64/xtn-combine-6.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..882f3d333e2cc42e964d880
> de5b58c505ec8b122
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-6.c
> @@ -0,0 +1,16 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +#define SIGN unsigned
> +#define TYPE1 int
> +#define TYPE2 long long
> +
> +void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
> +{
> +    for (int i = 0; i < n; i++)
> +      a[i] = b[i];
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\txtn\t} } } */
> +/* { dg-final { scan-assembler-not {\txtn2\t} } } */

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 4/7]AArch64 Add pattern xtn+xtn2 to uzp2
  2021-10-12 16:39       ` Kyrylo Tkachov
@ 2021-10-13 11:05         ` Tamar Christina
  2021-10-13 12:52           ` Kyrylo Tkachov
  0 siblings, 1 reply; 31+ messages in thread
From: Tamar Christina @ 2021-10-13 11:05 UTC (permalink / raw)
  To: Kyrylo Tkachov, gcc-patches
  Cc: nd, Richard Earnshaw, Marcus Shawcroft, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 8618 bytes --]

> 
> Hmmm these patterns are identical in what they match they just have the
> effect of printing operands 1 and 2 in a different order.
> Perhaps it's more compact to change the output template into a
> BYTES_BIG_ENDIAN ?
> "uzp1\\t%0.<V2ntype>, %1.<V2ntype>, %2.<V2ntype>"" :
> uzp1\\t%0.<V2ntype>, %2.<V2ntype>, %1.<V2ntype>"
> and avoid having a second at all?
> 

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (*aarch64_narrow_trunc<mode>): New.
	* config/aarch64/iterators.md (VNARROWSIMD, Vnarrowsimd): New.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/narrow_high_combine.c: Update case.
	* gcc.target/aarch64/xtn-combine-1.c: New test.
	* gcc.target/aarch64/xtn-combine-2.c: New test.
	* gcc.target/aarch64/xtn-combine-3.c: New test.
	* gcc.target/aarch64/xtn-combine-4.c: New test.
	* gcc.target/aarch64/xtn-combine-5.c: New test.
	* gcc.target/aarch64/xtn-combine-6.c: New test.

--- inline copy of patch ---

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 0b340b49fa06684b80d0b78cb712e49328ca92d5..b0dda554466149817a7828dbf4e0ed372a91872b 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1753,6 +1753,23 @@ (define_expand "aarch64_xtn2<mode>"
   }
 )
 
+(define_insn "*aarch64_narrow_trunc<mode>"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+          (truncate:<VNARROWQ>
+            (match_operand:VQN 1 "register_operand" "w"))
+	  (truncate:<VNARROWQ>
+	    (match_operand:VQN 2 "register_operand" "w"))))]
+  "TARGET_SIMD"
+{
+  if (!BYTES_BIG_ENDIAN)
+    return "uzp1\\t%0.<V2ntype>, %1.<V2ntype>, %2.<V2ntype>";
+  else
+    return "uzp1\\t%0.<V2ntype>, %2.<V2ntype>, %1.<V2ntype>";
+}
+  [(set_attr "type" "neon_permute<q>")]
+)
+
 ;; Packing doubles.
 
 (define_expand "vec_pack_trunc_<mode>"
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 8dbeed3b0d4a44cdc17dd333ed397b39a33f386a..95b385c0c9405fe95fcd07262a9471ab13d5488e 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -270,6 +270,14 @@ (define_mode_iterator VDQHS [V4HI V8HI V2SI V4SI])
 ;; Advanced SIMD modes for H, S and D types.
 (define_mode_iterator VDQHSD [V4HI V8HI V2SI V4SI V2DI])
 
+;; Modes for which we can narrow the element and increase the lane counts
+;; to preserve the same register size.
+(define_mode_attr VNARROWSIMD [(V4HI "V8QI") (V8HI "V16QI") (V4SI "V8HI")
+			       (V2SI "V4HI") (V2DI "V4SI")])
+
+(define_mode_attr Vnarrowsimd [(V4HI "v8qi") (V8HI "v16qi") (V4SI "v8hi")
+			       (V2SI "v4hi") (V2DI "v4si")])
+
 ;; Advanced SIMD and scalar integer modes for H and S.
 (define_mode_iterator VSDQ_HSI [V4HI V8HI V2SI V4SI HI SI])
 
diff --git a/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c b/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
index 50ecab002a3552d37a5cc0d8921f42f6c3dba195..fa61196d3644caa48b12151e12b15dfeab8c7e71 100644
--- a/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
+++ b/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
@@ -225,7 +225,8 @@ TEST_2_UNARY (vqmovun, uint32x4_t, int64x2_t, s64, u32)
 /* { dg-final { scan-assembler-times "\\tuqshrn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tsqrshrn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tuqrshrn2\\tv" 6} }  */
-/* { dg-final { scan-assembler-times "\\txtn2\\tv" 12} }  */
+/* { dg-final { scan-assembler-times "\\txtn2\\tv" 6} }  */
+/* { dg-final { scan-assembler-times "\\tuzp1\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tuqxtn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tsqxtn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tsqxtun2\\tv" 6} }  */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..14e0414cd1478f1cb7b17766aa8d4451c5659977
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN signed
+#define TYPE1 char
+#define TYPE2 short
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-2.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..c259010442bca4ba008706e47b3ffcc50a910b52
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-2.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN signed
+#define TYPE1 short
+#define TYPE2 int
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-3.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..9a2065f65101f82ebe33519840919f9606192c43
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-3.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN signed
+#define TYPE1 int
+#define TYPE2 long long
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-4.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..77c3dce12049989d344e4652bd593de2e846ae25
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-4.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN unsigned
+#define TYPE1 char
+#define TYPE2 short
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-5.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..ae30e864ed7a239c40522423d7493ddd33830cc8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-5.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN unsigned
+#define TYPE1 short
+#define TYPE2 int
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-6.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-6.c
new file mode 100644
index 0000000000000000000000000000000000000000..882f3d333e2cc42e964d880de5b58c505ec8b122
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-6.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN unsigned
+#define TYPE1 int
+#define TYPE2 long long
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */

[-- Attachment #2: rb14895.patch --]
[-- Type: application/octet-stream, Size: 7366 bytes --]

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 0b340b49fa06684b80d0b78cb712e49328ca92d5..b0dda554466149817a7828dbf4e0ed372a91872b 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1753,6 +1753,23 @@ (define_expand "aarch64_xtn2<mode>"
   }
 )
 
+(define_insn "*aarch64_narrow_trunc<mode>"
+  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
+	(vec_concat:<VNARROWQ2>
+          (truncate:<VNARROWQ>
+            (match_operand:VQN 1 "register_operand" "w"))
+	  (truncate:<VNARROWQ>
+	    (match_operand:VQN 2 "register_operand" "w"))))]
+  "TARGET_SIMD"
+{
+  if (!BYTES_BIG_ENDIAN)
+    return "uzp1\\t%0.<V2ntype>, %1.<V2ntype>, %2.<V2ntype>";
+  else
+    return "uzp1\\t%0.<V2ntype>, %2.<V2ntype>, %1.<V2ntype>";
+}
+  [(set_attr "type" "neon_permute<q>")]
+)
+
 ;; Packing doubles.
 
 (define_expand "vec_pack_trunc_<mode>"
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 8dbeed3b0d4a44cdc17dd333ed397b39a33f386a..95b385c0c9405fe95fcd07262a9471ab13d5488e 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -270,6 +270,14 @@ (define_mode_iterator VDQHS [V4HI V8HI V2SI V4SI])
 ;; Advanced SIMD modes for H, S and D types.
 (define_mode_iterator VDQHSD [V4HI V8HI V2SI V4SI V2DI])
 
+;; Modes for which we can narrow the element and increase the lane counts
+;; to preserve the same register size.
+(define_mode_attr VNARROWSIMD [(V4HI "V8QI") (V8HI "V16QI") (V4SI "V8HI")
+			       (V2SI "V4HI") (V2DI "V4SI")])
+
+(define_mode_attr Vnarrowsimd [(V4HI "v8qi") (V8HI "v16qi") (V4SI "v8hi")
+			       (V2SI "v4hi") (V2DI "v4si")])
+
 ;; Advanced SIMD and scalar integer modes for H and S.
 (define_mode_iterator VSDQ_HSI [V4HI V8HI V2SI V4SI HI SI])
 
diff --git a/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c b/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
index 50ecab002a3552d37a5cc0d8921f42f6c3dba195..fa61196d3644caa48b12151e12b15dfeab8c7e71 100644
--- a/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
+++ b/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
@@ -225,7 +225,8 @@ TEST_2_UNARY (vqmovun, uint32x4_t, int64x2_t, s64, u32)
 /* { dg-final { scan-assembler-times "\\tuqshrn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tsqrshrn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tuqrshrn2\\tv" 6} }  */
-/* { dg-final { scan-assembler-times "\\txtn2\\tv" 12} }  */
+/* { dg-final { scan-assembler-times "\\txtn2\\tv" 6} }  */
+/* { dg-final { scan-assembler-times "\\tuzp1\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tuqxtn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tsqxtn2\\tv" 6} }  */
 /* { dg-final { scan-assembler-times "\\tsqxtun2\\tv" 6} }  */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..14e0414cd1478f1cb7b17766aa8d4451c5659977
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN signed
+#define TYPE1 char
+#define TYPE2 short
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-2.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..c259010442bca4ba008706e47b3ffcc50a910b52
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-2.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN signed
+#define TYPE1 short
+#define TYPE2 int
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-3.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..9a2065f65101f82ebe33519840919f9606192c43
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-3.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN signed
+#define TYPE1 int
+#define TYPE2 long long
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-4.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-4.c
new file mode 100644
index 0000000000000000000000000000000000000000..77c3dce12049989d344e4652bd593de2e846ae25
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-4.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN unsigned
+#define TYPE1 char
+#define TYPE2 short
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-5.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-5.c
new file mode 100644
index 0000000000000000000000000000000000000000..ae30e864ed7a239c40522423d7493ddd33830cc8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-5.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN unsigned
+#define TYPE1 short
+#define TYPE2 int
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-6.c b/gcc/testsuite/gcc.target/aarch64/xtn-combine-6.c
new file mode 100644
index 0000000000000000000000000000000000000000..882f3d333e2cc42e964d880de5b58c505ec8b122
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-6.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble } */
+/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
+
+#define SIGN unsigned
+#define TYPE1 int
+#define TYPE2 long long
+
+void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
+{
+    for (int i = 0; i < n; i++)
+      a[i] = b[i];
+}
+
+/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
+/* { dg-final { scan-assembler-not {\txtn\t} } } */
+/* { dg-final { scan-assembler-not {\txtn2\t} } } */

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 5/7]middle-end Convert bitclear <imm> + cmp<cc> #0 into cm<cc2> <imm2>
  2021-10-05 12:55         ` Tamar Christina
@ 2021-10-13 12:17           ` Richard Biener
  0 siblings, 0 replies; 31+ messages in thread
From: Richard Biener @ 2021-10-13 12:17 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc-patches, nd

On Tue, 5 Oct 2021, Tamar Christina wrote:

> Hi All,
> 
> Here's a new version of the patch handling both scalar and vector modes
> and non-uniform constant vectors.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu,
> x86_64-pc-linux-gnu and no regressions.
> 
> In order to not break IVopts and CSE I have added a
> requirement for the scalar version to be single use.

OK.

Thanks,
Richard.

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* tree.c (bitmask_inv_cst_vector_p): New.
> 	* tree.h (bitmask_inv_cst_vector_p): New.
> 	* match.pd: Use it in new bitmask compare pattern.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.dg/bic-bitmask-10.c: New test.
> 	* gcc.dg/bic-bitmask-11.c: New test.
> 	* gcc.dg/bic-bitmask-12.c: New test.
> 	* gcc.dg/bic-bitmask-13.c: New test.
> 	* gcc.dg/bic-bitmask-14.c: New test.
> 	* gcc.dg/bic-bitmask-15.c: New test.
> 	* gcc.dg/bic-bitmask-16.c: New test.
> 	* gcc.dg/bic-bitmask-17.c: New test.
> 	* gcc.dg/bic-bitmask-18.c: New test.
> 	* gcc.dg/bic-bitmask-19.c: New test.
> 	* gcc.dg/bic-bitmask-2.c: New test.
> 	* gcc.dg/bic-bitmask-20.c: New test.
> 	* gcc.dg/bic-bitmask-21.c: New test.
> 	* gcc.dg/bic-bitmask-22.c: New test.
> 	* gcc.dg/bic-bitmask-23.c: New test.
> 	* gcc.dg/bic-bitmask-3.c: New test.
> 	* gcc.dg/bic-bitmask-4.c: New test.
> 	* gcc.dg/bic-bitmask-5.c: New test.
> 	* gcc.dg/bic-bitmask-6.c: New test.
> 	* gcc.dg/bic-bitmask-7.c: New test.
> 	* gcc.dg/bic-bitmask-8.c: New test.
> 	* gcc.dg/bic-bitmask-9.c: New test.
> 	* gcc.dg/bic-bitmask.h: New test.
> 	* gcc.target/aarch64/bic-bitmask-1.c: New test.
> 
> --- inline copy of patch --
> 
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 0fcfd0ea62c043dc217d0d560ce5b7e569b70e7d..7d2a24dbc5e9644a09968f877e12a824d8ba1caa 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -37,7 +37,8 @@ along with GCC; see the file COPYING3.  If not see
>     integer_pow2p
>     uniform_integer_cst_p
>     HONOR_NANS
> -   uniform_vector_p)
> +   uniform_vector_p
> +   bitmask_inv_cst_vector_p)
>  
>  /* Operator lists.  */
>  (define_operator_list tcc_comparison
> @@ -4900,6 +4901,24 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>        (eqcmp (bit_and @1 { wide_int_to_tree (ty, mask - rhs); })
>  	     { build_zero_cst (ty); }))))))
>  
> +/* Transform comparisons of the form (X & Y) CMP 0 to X CMP2 Z
> +   where ~Y + 1 == pow2 and Z = ~Y.  */
> +(for cst (VECTOR_CST INTEGER_CST)
> + (for cmp (le eq ne ge gt)
> +      icmp (le le gt le gt)
> + (simplify
> +  (cmp (bit_and:c@2 @0 cst@1) integer_zerop)
> +   (with { tree csts = bitmask_inv_cst_vector_p (@1); }
> +     (switch
> +      (if (csts && TYPE_UNSIGNED (TREE_TYPE (@1))
> +	   && (VECTOR_TYPE_P (TREE_TYPE (@1)) || single_use (@2)))
> +       (icmp @0 { csts; }))
> +      (if (csts && !TYPE_UNSIGNED (TREE_TYPE (@1))
> +	   && (cmp == EQ_EXPR || cmp == NE_EXPR)
> +	   && (VECTOR_TYPE_P (TREE_TYPE (@1)) || single_use (@2)))
> +       (with { tree utype = unsigned_type_for (TREE_TYPE (@1)); }
> +	(icmp (convert:utype @0) { csts; }))))))))
> +
>  /* -A CMP -B -> B CMP A.  */
>  (for cmp (tcc_comparison)
>       scmp (swapped_tcc_comparison)
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-10.c b/gcc/testsuite/gcc.dg/bic-bitmask-10.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..76a22a2313137a2a75dd711c2c15c2d3a34e15aa
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-10.c
> @@ -0,0 +1,26 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(int32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(int32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +#define TYPE int32_t
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump {<=\s*.+\{ 255,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-11.c b/gcc/testsuite/gcc.dg/bic-bitmask-11.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..32553d7ba2f823f7a21237451990d0a216d2f912
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-11.c
> @@ -0,0 +1,25 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) != 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) != 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump {>\s*.+\{ 255,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-12.c b/gcc/testsuite/gcc.dg/bic-bitmask-12.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..e10cbf7fabe2dbf7ce436cdf37b0f8b207c58408
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-12.c
> @@ -0,0 +1,17 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +typedef unsigned int v4si __attribute__ ((vector_size (16)));
> +
> +__attribute__((noinline, noipa))
> +void fun(v4si *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +/* { dg-final { scan-tree-dump {<=\s*.+\{ 255,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967290,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-13.c b/gcc/testsuite/gcc.dg/bic-bitmask-13.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..8ebaa30238c761b8831685209a7490f06591c000
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-13.c
> @@ -0,0 +1,24 @@
> +/* { dg-do run } */
> +/* { dg-options "-O0 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {<=\s* 255} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s* 4294967040} dce7 } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-14.c b/gcc/testsuite/gcc.dg/bic-bitmask-14.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..0a2789433f8bc45a590d136179b8ee4ec5cda1c1
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-14.c
> @@ -0,0 +1,24 @@
> +/* { dg-do run } */
> +/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O0")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {<=\s* 255} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s* 4294967040} dce7 } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-15.c b/gcc/testsuite/gcc.dg/bic-bitmask-15.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..53a2c986b00f159ae5fa839798850ac42e9b9504
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-15.c
> @@ -0,0 +1,24 @@
> +/* { dg-do run } */
> +/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) >= 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O0")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) >= 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {=\s* 1} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s* 4294967040} dce7 } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-16.c b/gcc/testsuite/gcc.dg/bic-bitmask-16.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..fff4670e2a47106c614c1224b8e8aac091d6e821
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-16.c
> @@ -0,0 +1,24 @@
> +/* { dg-do run } */
> +/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) > 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O0")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) > 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {>\s* 255} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s* 4294967040} dce7 } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-17.c b/gcc/testsuite/gcc.dg/bic-bitmask-17.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..7b877fde017de0fb7aeabd4152a1593f07d52e71
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-17.c
> @@ -0,0 +1,24 @@
> +/* { dg-do run } */
> +/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) <= 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O0")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) <= 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {<=\s* 255} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s* 4294967040} dce7 } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-18.c b/gcc/testsuite/gcc.dg/bic-bitmask-18.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..8455bf5286a7cc9f08713489e29b5f2b6f6fc012
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-18.c
> @@ -0,0 +1,24 @@
> +/* { dg-do run } */
> +/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~1)) < 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O0")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~1)) < 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {= 0} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-19.c b/gcc/testsuite/gcc.dg/bic-bitmask-19.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..ad266cec0d10a1afc0bdacd8762d328bff1f6ea4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-19.c
> @@ -0,0 +1,24 @@
> +/* { dg-do run } */
> +/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~1)) != 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O0")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~1)) != 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {>\s* 1} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s* 4294967294} dce7 } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-2.c b/gcc/testsuite/gcc.dg/bic-bitmask-2.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..da30fad89f6c8239baa4395b3ffaec0be577e13f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-2.c
> @@ -0,0 +1,25 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-20.c b/gcc/testsuite/gcc.dg/bic-bitmask-20.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..befd25cb4aac6fe206110e7ca80816dd6fc0ed94
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-20.c
> @@ -0,0 +1,24 @@
> +/* { dg-do run } */
> +/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~5)) == 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O0")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~5)) == 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-not {<=\s* 4294967289} dce7 } } */
> +/* { dg-final { scan-tree-dump {&\s* 4294967290} dce7 } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-21.c b/gcc/testsuite/gcc.dg/bic-bitmask-21.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..b037ffc248eef1509c642abb0087b77882679150
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-21.c
> @@ -0,0 +1,25 @@
> +/* { dg-do run } */
> +/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(int32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O0")))
> +void fun2(int32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +#define TYPE int32_t
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump {<=\s* 255} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s* 4294967290} dce7 } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-22.c b/gcc/testsuite/gcc.dg/bic-bitmask-22.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..20027a6367a1dcec12492435f260250a7b54aca1
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-22.c
> @@ -0,0 +1,24 @@
> +/* { dg-do run } */
> +/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) != 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O0")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) != 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump {>\s* 255} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s* 4294967290} dce7 } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-23.c b/gcc/testsuite/gcc.dg/bic-bitmask-23.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..9aef4baed796cf7bf4998d1b4f4ddfbc457add29
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-23.c
> @@ -0,0 +1,16 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O1 -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +typedef unsigned int v4si __attribute__ ((vector_size (16)));
> +
> +__attribute__((noinline, noipa))
> +v4si fun(v4si x)
> +{
> +    v4si mask = { 255, 15, 1, 0xFFFF };
> +    v4si zeros = {0};
> +    return (x & ~mask) == zeros;
> +}
> +
> +/* { dg-final { scan-tree-dump {<=\s*.+\{ 255, 15, 1, 65535 \}} dce7 } } */
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-3.c b/gcc/testsuite/gcc.dg/bic-bitmask-3.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..da30fad89f6c8239baa4395b3ffaec0be577e13f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-3.c
> @@ -0,0 +1,25 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) == 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-4.c b/gcc/testsuite/gcc.dg/bic-bitmask-4.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..1bcf23ccf1447d6c8c999ed1eb25ba0a450028e1
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-4.c
> @@ -0,0 +1,25 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) >= 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) >= 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {=\s*.+\{ 1,.+\}} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-5.c b/gcc/testsuite/gcc.dg/bic-bitmask-5.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..d053727076abedefdecfda7c4fea6f92d54a94a5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-5.c
> @@ -0,0 +1,25 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) > 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) > 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {>\s*.+\{ 255,.+\}} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-6.c b/gcc/testsuite/gcc.dg/bic-bitmask-6.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..018e7a4348c9fc461106c3d9d01291325d3406c2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-6.c
> @@ -0,0 +1,25 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) <= 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~255)) <= 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {<=\s*.+\{ 255,.+\}} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967040,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-7.c b/gcc/testsuite/gcc.dg/bic-bitmask-7.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..798678fb7555052c93abc4ca34f617d640f73bb4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-7.c
> @@ -0,0 +1,24 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~1)) < 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~1)) < 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {__builtin_memset} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-8.c b/gcc/testsuite/gcc.dg/bic-bitmask-8.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..1dabe834ed57dfa0be48c1dc3dbb226092c79a1a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-8.c
> @@ -0,0 +1,25 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~1)) != 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~1)) != 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-times {>\s*.+\{ 1,.+\}} 1 dce7 } } */
> +/* { dg-final { scan-tree-dump-not {&\s*.+\{ 4294967294,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-9.c b/gcc/testsuite/gcc.dg/bic-bitmask-9.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..9c1f8ee0adfc45d1b9fc212138ea26bb6b693e49
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask-9.c
> @@ -0,0 +1,25 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -save-temps -fdump-tree-dce" } */
> +
> +#include <stdint.h>
> +
> +__attribute__((noinline, noipa))
> +void fun1(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~5)) == 0;
> +}
> +
> +__attribute__((noinline, noipa, optimize("O1")))
> +void fun2(uint32_t *x, int n)
> +{
> +    for (int i = 0; i < (n & -16); i++)
> +      x[i] = (x[i]&(~5)) == 0;
> +}
> +
> +#include "bic-bitmask.h"
> +
> +/* { dg-final { scan-tree-dump-not {<=\s*.+\{ 4294967289,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump {&\s*.+\{ 4294967290,.+\}} dce7 } } */
> +/* { dg-final { scan-tree-dump-not {\s+bic\s+} dce7 { target { aarch64*-*-* } } } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/bic-bitmask.h b/gcc/testsuite/gcc.dg/bic-bitmask.h
> new file mode 100644
> index 0000000000000000000000000000000000000000..faf80b974db07a7d817a615cc64a35f1020e9764
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/bic-bitmask.h
> @@ -0,0 +1,43 @@
> +#include <stdio.h>
> +
> +#ifndef N
> +#define N 65
> +#endif
> +
> +#ifndef TYPE
> +#define TYPE uint32_t
> +#endif
> +
> +#ifndef DEBUG
> +#define DEBUG 0
> +#endif
> +
> +#define BASE ((TYPE) -1 < 0 ? -126 : 4)
> +
> +int main ()
> +{
> +  TYPE a[N];
> +  TYPE b[N];
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      a[i] = BASE + i * 13;
> +      b[i] = BASE + i * 13;
> +      if (DEBUG)
> +        printf ("%d: 0x%x\n", i, a[i]);
> +    }
> +
> +  fun1 (a, N);
> +  fun2 (b, N);
> +
> +  for (int i = 0; i < N; ++i)
> +    {
> +      if (DEBUG)
> +        printf ("%d = 0x%x == 0x%x\n", i, a[i], b[i]);
> +
> +      if (a[i] != b[i])
> +        __builtin_abort ();
> +    }
> +  return 0;
> +}
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c b/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..568c1ffc8bc4148efaeeba7a45a75ecbd3a7a3dd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/bic-bitmask-1.c
> @@ -0,0 +1,13 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O2 -save-temps" } */
> +
> +#include <arm_neon.h>
> +
> +uint32x4_t foo (int32x4_t a)
> +{
> +  int32x4_t cst = vdupq_n_s32 (255);
> +  int32x4_t zero = vdupq_n_s32 (0);
> +  return vceqq_s32 (vbicq_s32 (a, cst), zero);
> +}
> +
> +/* { dg-final { scan-assembler-not {\tbic\t} { xfail { aarch64*-*-* } } } } */
> diff --git a/gcc/tree.h b/gcc/tree.h
> index 905417fd17b1721f6c65d3cdf66a5fd5f920de7f..3583d51409cad9471fe45de64806e78d855bd97c 100644
> --- a/gcc/tree.h
> +++ b/gcc/tree.h
> @@ -4862,6 +4862,11 @@ extern bool integer_minus_onep (const_tree);
>  
>  extern bool integer_pow2p (const_tree);
>  
> +/* Checks to see if T is a constant or a constant vector and if each element E
> +   adheres to ~E + 1 == pow2 then return ~E otherwise NULL_TREE.  */
> +
> +extern tree bitmask_inv_cst_vector_p (tree);
> +
>  /* integer_nonzerop (tree x) is nonzero if X is an integer constant
>     with a nonzero value.  */
>  
> diff --git a/gcc/tree.c b/gcc/tree.c
> index cba3bca41b3a50557939267b7a06df04b5d042b7..801f121a368b088d0f769f16f2ed29e320e71556 100644
> --- a/gcc/tree.c
> +++ b/gcc/tree.c
> @@ -10246,6 +10246,59 @@ uniform_integer_cst_p (tree t)
>    return NULL_TREE;
>  }
>  
> +/* Checks to see if T is a constant or a constant vector and if each element E
> +   adheres to ~E + 1 == pow2 then return ~E otherwise NULL_TREE.  */
> +
> +tree
> +bitmask_inv_cst_vector_p (tree t)
> +{
> +
> +  tree_code code = TREE_CODE (t);
> +  tree type = TREE_TYPE (t);
> +
> +  if (!INTEGRAL_TYPE_P (type)
> +      && !VECTOR_INTEGER_TYPE_P (type))
> +    return NULL_TREE;
> +
> +  unsigned HOST_WIDE_INT nelts = 1;
> +  tree cst;
> +  unsigned int idx = 0;
> +  bool uniform = uniform_integer_cst_p (t);
> +  tree newtype = unsigned_type_for (type);
> +  tree_vector_builder builder;
> +  if (code == INTEGER_CST)
> +    cst = t;
> +  else
> +    {
> +      if (!VECTOR_CST_NELTS (t).is_constant (&nelts))
> +	return NULL_TREE;
> +
> +      cst = vector_cst_elt (t, 0);
> +      builder.new_vector (newtype, nelts, 1);
> +    }
> +
> +  tree ty = unsigned_type_for (TREE_TYPE (cst));
> +
> +  do {
> +    if (idx > 0)
> +      cst = vector_cst_elt (t, idx);
> +    wide_int icst = wi::to_wide (cst);
> +    wide_int inv =  wi::bit_not (icst);
> +    icst = wi::add (1, inv);
> +    if (wi::popcount (icst) != 1)
> +      return NULL_TREE;
> +
> +    tree newcst = wide_int_to_tree (ty, inv);
> +
> +    if (uniform)
> +      return build_uniform_cst (newtype, newcst);
> +
> +    builder.quick_push (newcst);
> +  } while (++idx < nelts);
> +
> +  return builder.build ();
> +}
> +
>  /* If VECTOR_CST T has a single nonzero element, return the index of that
>     element, otherwise return -1.  */
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 4/7]AArch64 Add pattern xtn+xtn2 to uzp2
  2021-10-13 11:05         ` Tamar Christina
@ 2021-10-13 12:52           ` Kyrylo Tkachov
  0 siblings, 0 replies; 31+ messages in thread
From: Kyrylo Tkachov @ 2021-10-13 12:52 UTC (permalink / raw)
  To: Tamar Christina, gcc-patches
  Cc: nd, Richard Earnshaw, Marcus Shawcroft, Richard Sandiford



> -----Original Message-----
> From: Tamar Christina <Tamar.Christina@arm.com>
> Sent: Wednesday, October 13, 2021 12:06 PM
> To: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>; gcc-patches@gcc.gnu.org
> Cc: nd <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>;
> Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Richard Sandiford
> <Richard.Sandiford@arm.com>
> Subject: RE: [PATCH 4/7]AArch64 Add pattern xtn+xtn2 to uzp2
> 
> >
> > Hmmm these patterns are identical in what they match they just have the
> > effect of printing operands 1 and 2 in a different order.
> > Perhaps it's more compact to change the output template into a
> > BYTES_BIG_ENDIAN ?
> > "uzp1\\t%0.<V2ntype>, %1.<V2ntype>, %2.<V2ntype>"" :
> > uzp1\\t%0.<V2ntype>, %2.<V2ntype>, %1.<V2ntype>"
> > and avoid having a second at all?
> >
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?

Sorry I should have noticed earlier but...
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 8dbeed3b0d4a44cdc17dd333ed397b39a33f386a..95b385c0c9405fe95fcd07262a9471ab13d5488e 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -270,6 +270,14 @@ (define_mode_iterator VDQHS [V4HI V8HI V2SI V4SI])
 ;; Advanced SIMD modes for H, S and D types.
 (define_mode_iterator VDQHSD [V4HI V8HI V2SI V4SI V2DI])
 
+;; Modes for which we can narrow the element and increase the lane counts
+;; to preserve the same register size.
+(define_mode_attr VNARROWSIMD [(V4HI "V8QI") (V8HI "V16QI") (V4SI "V8HI")
+			       (V2SI "V4HI") (V2DI "V4SI")])
+
+(define_mode_attr Vnarrowsimd [(V4HI "v8qi") (V8HI "v16qi") (V4SI "v8hi")
+			       (V2SI "v4hi") (V2DI "v4si")])
+

These attributes are not needed it seems.
So patch is ok without this hunk.
Thanks,
Kyrill

> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* config/aarch64/aarch64-simd.md
> (*aarch64_narrow_trunc<mode>): New.
> 	* config/aarch64/iterators.md (VNARROWSIMD, Vnarrowsimd):
> New.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/aarch64/narrow_high_combine.c: Update case.
> 	* gcc.target/aarch64/xtn-combine-1.c: New test.
> 	* gcc.target/aarch64/xtn-combine-2.c: New test.
> 	* gcc.target/aarch64/xtn-combine-3.c: New test.
> 	* gcc.target/aarch64/xtn-combine-4.c: New test.
> 	* gcc.target/aarch64/xtn-combine-5.c: New test.
> 	* gcc.target/aarch64/xtn-combine-6.c: New test.
> 
> --- inline copy of patch ---
> 
> diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> index
> 0b340b49fa06684b80d0b78cb712e49328ca92d5..b0dda554466149817a7828
> dbf4e0ed372a91872b 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1753,6 +1753,23 @@ (define_expand "aarch64_xtn2<mode>"
>    }
>  )
> 
> +(define_insn "*aarch64_narrow_trunc<mode>"
> +  [(set (match_operand:<VNARROWQ2> 0 "register_operand" "=w")
> +	(vec_concat:<VNARROWQ2>
> +          (truncate:<VNARROWQ>
> +            (match_operand:VQN 1 "register_operand" "w"))
> +	  (truncate:<VNARROWQ>
> +	    (match_operand:VQN 2 "register_operand" "w"))))]
> +  "TARGET_SIMD"
> +{
> +  if (!BYTES_BIG_ENDIAN)
> +    return "uzp1\\t%0.<V2ntype>, %1.<V2ntype>, %2.<V2ntype>";
> +  else
> +    return "uzp1\\t%0.<V2ntype>, %2.<V2ntype>, %1.<V2ntype>";
> +}
> +  [(set_attr "type" "neon_permute<q>")]
> +)
> +
>  ;; Packing doubles.
> 
>  (define_expand "vec_pack_trunc_<mode>"
> diff --git a/gcc/config/aarch64/iterators.md
> b/gcc/config/aarch64/iterators.md
> index
> 8dbeed3b0d4a44cdc17dd333ed397b39a33f386a..95b385c0c9405fe95fcd072
> 62a9471ab13d5488e 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -270,6 +270,14 @@ (define_mode_iterator VDQHS [V4HI V8HI V2SI
> V4SI])
>  ;; Advanced SIMD modes for H, S and D types.
>  (define_mode_iterator VDQHSD [V4HI V8HI V2SI V4SI V2DI])
> 
> +;; Modes for which we can narrow the element and increase the lane counts
> +;; to preserve the same register size.
> +(define_mode_attr VNARROWSIMD [(V4HI "V8QI") (V8HI "V16QI") (V4SI
> "V8HI")
> +			       (V2SI "V4HI") (V2DI "V4SI")])
> +
> +(define_mode_attr Vnarrowsimd [(V4HI "v8qi") (V8HI "v16qi") (V4SI "v8hi")
> +			       (V2SI "v4hi") (V2DI "v4si")])
> +
>  ;; Advanced SIMD and scalar integer modes for H and S.
>  (define_mode_iterator VSDQ_HSI [V4HI V8HI V2SI V4SI HI SI])
> 
> diff --git a/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
> b/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
> index
> 50ecab002a3552d37a5cc0d8921f42f6c3dba195..fa61196d3644caa48b12151e
> 12b15dfeab8c7e71 100644
> --- a/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
> +++ b/gcc/testsuite/gcc.target/aarch64/narrow_high_combine.c
> @@ -225,7 +225,8 @@ TEST_2_UNARY (vqmovun, uint32x4_t, int64x2_t,
> s64, u32)
>  /* { dg-final { scan-assembler-times "\\tuqshrn2\\tv" 6} }  */
>  /* { dg-final { scan-assembler-times "\\tsqrshrn2\\tv" 6} }  */
>  /* { dg-final { scan-assembler-times "\\tuqrshrn2\\tv" 6} }  */
> -/* { dg-final { scan-assembler-times "\\txtn2\\tv" 12} }  */
> +/* { dg-final { scan-assembler-times "\\txtn2\\tv" 6} }  */
> +/* { dg-final { scan-assembler-times "\\tuzp1\\tv" 6} }  */
>  /* { dg-final { scan-assembler-times "\\tuqxtn2\\tv" 6} }  */
>  /* { dg-final { scan-assembler-times "\\tsqxtn2\\tv" 6} }  */
>  /* { dg-final { scan-assembler-times "\\tsqxtun2\\tv" 6} }  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
> b/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..14e0414cd1478f1cb7b1776
> 6aa8d4451c5659977
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-1.c
> @@ -0,0 +1,16 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +#define SIGN signed
> +#define TYPE1 char
> +#define TYPE2 short
> +
> +void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
> +{
> +    for (int i = 0; i < n; i++)
> +      a[i] = b[i];
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\txtn\t} } } */
> +/* { dg-final { scan-assembler-not {\txtn2\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-2.c
> b/gcc/testsuite/gcc.target/aarch64/xtn-combine-2.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..c259010442bca4ba008706e
> 47b3ffcc50a910b52
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-2.c
> @@ -0,0 +1,16 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +#define SIGN signed
> +#define TYPE1 short
> +#define TYPE2 int
> +
> +void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
> +{
> +    for (int i = 0; i < n; i++)
> +      a[i] = b[i];
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\txtn\t} } } */
> +/* { dg-final { scan-assembler-not {\txtn2\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-3.c
> b/gcc/testsuite/gcc.target/aarch64/xtn-combine-3.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..9a2065f65101f82ebe33519
> 840919f9606192c43
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-3.c
> @@ -0,0 +1,16 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +#define SIGN signed
> +#define TYPE1 int
> +#define TYPE2 long long
> +
> +void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
> +{
> +    for (int i = 0; i < n; i++)
> +      a[i] = b[i];
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\txtn\t} } } */
> +/* { dg-final { scan-assembler-not {\txtn2\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-4.c
> b/gcc/testsuite/gcc.target/aarch64/xtn-combine-4.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..77c3dce12049989d344e465
> 2bd593de2e846ae25
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-4.c
> @@ -0,0 +1,16 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +#define SIGN unsigned
> +#define TYPE1 char
> +#define TYPE2 short
> +
> +void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
> +{
> +    for (int i = 0; i < n; i++)
> +      a[i] = b[i];
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\txtn\t} } } */
> +/* { dg-final { scan-assembler-not {\txtn2\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-5.c
> b/gcc/testsuite/gcc.target/aarch64/xtn-combine-5.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..ae30e864ed7a239c4052242
> 3d7493ddd33830cc8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-5.c
> @@ -0,0 +1,16 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +#define SIGN unsigned
> +#define TYPE1 short
> +#define TYPE2 int
> +
> +void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
> +{
> +    for (int i = 0; i < n; i++)
> +      a[i] = b[i];
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\txtn\t} } } */
> +/* { dg-final { scan-assembler-not {\txtn2\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/xtn-combine-6.c
> b/gcc/testsuite/gcc.target/aarch64/xtn-combine-6.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..882f3d333e2cc42e964d880
> de5b58c505ec8b122
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/xtn-combine-6.c
> @@ -0,0 +1,16 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O3 --save-temps --param=vect-epilogues-nomask=0" } */
> +
> +#define SIGN unsigned
> +#define TYPE1 int
> +#define TYPE2 long long
> +
> +void d2 (SIGN TYPE1 * restrict a, SIGN TYPE2 *b, int n)
> +{
> +    for (int i = 0; i < n; i++)
> +      a[i] = b[i];
> +}
> +
> +/* { dg-final { scan-assembler-times {\tuzp1\t} 1 } } */
> +/* { dg-final { scan-assembler-not {\txtn\t} } } */
> +/* { dg-final { scan-assembler-not {\txtn2\t} } } */

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2021-10-13 12:52 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-29 16:19 [PATCH 0/7]AArch64 Optimize truncation, shifts and bitmask comparisons Tamar Christina
2021-09-29 16:19 ` [PATCH 1/7]AArch64 Add combine patterns for right shift and narrow Tamar Christina
2021-09-30  8:50   ` Kyrylo Tkachov
2021-10-06 14:32     ` Richard Sandiford
2021-10-12 16:18       ` Tamar Christina
2021-10-12 16:35         ` Kyrylo Tkachov
2021-09-29 16:19 ` [PATCH 2/7]AArch64 Add combine patterns for narrowing shift of half top bits (shuffle) Tamar Christina
2021-09-30  8:54   ` Kyrylo Tkachov
2021-10-12 16:23     ` Tamar Christina
2021-10-12 16:36       ` Kyrylo Tkachov
2021-09-29 16:20 ` [PATCH 3/7]AArch64 Add pattern for sshr to cmlt Tamar Christina
2021-09-30  9:27   ` Kyrylo Tkachov
2021-10-11 19:56     ` Andrew Pinski
2021-10-12 12:19       ` Kyrylo Tkachov
2021-10-12 16:20         ` Tamar Christina
2021-09-29 16:20 ` [PATCH 4/7]AArch64 Add pattern xtn+xtn2 to uzp2 Tamar Christina
2021-09-30  9:28   ` Kyrylo Tkachov
2021-10-12 16:25     ` Tamar Christina
2021-10-12 16:39       ` Kyrylo Tkachov
2021-10-13 11:05         ` Tamar Christina
2021-10-13 12:52           ` Kyrylo Tkachov
2021-09-29 16:21 ` [PATCH 5/7]middle-end Convert bitclear <imm> + cmp<cc> #0 into cm<cc2> <imm2> Tamar Christina
2021-09-30  6:17   ` Richard Biener
2021-09-30  9:56     ` Tamar Christina
2021-09-30 10:26       ` Richard Biener
2021-10-05 12:55         ` Tamar Christina
2021-10-13 12:17           ` Richard Biener
2021-09-29 16:21 ` [PATCH 6/7]AArch64 Add neg + cmle into cmgt Tamar Christina
2021-09-30  9:34   ` Kyrylo Tkachov
2021-09-29 16:21 ` [PATCH 7/7]AArch64 Combine cmeq 0 + not into cmtst Tamar Christina
2021-09-30  9:35   ` Kyrylo Tkachov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).