From: Tamar Christina <tamar.christina@arm.com>
To: gcc-patches@gcc.gnu.org
Cc: nd@arm.com, Richard.Earnshaw@arm.com, Marcus.Shawcroft@arm.com,
Kyrylo.Tkachov@arm.com, richard.sandiford@arm.com
Subject: [PATCH 2/2]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583]
Date: Thu, 9 Feb 2023 17:22:09 +0000 [thread overview]
Message-ID: <Y+UrwQSz3hWz+Puo@arm.com> (raw)
In-Reply-To: <patch-16909-tamar@arm.com>
[-- Attachment #1: Type: text/plain, Size: 14446 bytes --]
Hi All,
This replaces the custom division hook with just an implementation through
add_highpart. For NEON we implement the add highpart (Addition + extraction of
the upper highpart of the register in the same precision) as ADD + LSR.
This representation allows us to easily optimize the sequence using existing
sequences. This gets us a pretty decent sequence using SRA:
umull v1.8h, v0.8b, v3.8b
umull2 v0.8h, v0.16b, v3.16b
add v5.8h, v1.8h, v2.8h
add v4.8h, v0.8h, v2.8h
usra v1.8h, v5.8h, 8
usra v0.8h, v4.8h, 8
uzp2 v1.16b, v1.16b, v0.16b
To get the most optimal sequence however we match (a + ((b + c) >> n)) where n
is half the precision of the mode of the operation into addhn + uaddw which is
a general good optimization on its own and gets us back to:
.L4:
ldr q0, [x3]
umull v1.8h, v0.8b, v5.8b
umull2 v0.8h, v0.16b, v5.16b
addhn v3.8b, v1.8h, v4.8h
addhn v2.8b, v0.8h, v4.8h
uaddw v1.8h, v1.8h, v3.8b
uaddw v0.8h, v0.8h, v2.8b
uzp2 v1.16b, v1.16b, v0.16b
str q1, [x3], 16
cmp x3, x4
bne .L4
For SVE2 we optimize the initial sequence to the same ADD + LSR which gets us:
.L3:
ld1b z0.h, p0/z, [x0, x3]
mul z0.h, p1/m, z0.h, z2.h
add z1.h, z0.h, z3.h
usra z0.h, z1.h, #8
lsr z0.h, z0.h, #8
st1b z0.h, p0, [x0, x3]
inch x3
whilelo p0.h, w3, w2
b.any .L3
.L1:
ret
and to get the most optimal sequence I match (a + b) >> n (same constraint on n)
to addhnb which gets us to:
.L3:
ld1b z0.h, p0/z, [x0, x3]
mul z0.h, p1/m, z0.h, z2.h
addhnb z1.b, z0.h, z3.h
addhnb z0.b, z0.h, z1.h
st1b z0.h, p0, [x0, x3]
inch x3
whilelo p0.h, w3, w2
b.any .L3
There are multiple RTL representations possible for these optimizations, I did
not represent them using a zero_extend because we seem very inconsistent in this
in the backend. Since they are unspecs we won't match them from vector ops
anyway. I figured maintainers would prefer this, but my maintainer ouija board
is still out for repairs :)
There are no new test as new correctness tests were added to the mid-end and
the existing codegen tests for this already exist.
Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going> issues.
Ok for master?
Thanks,
Tamar
gcc/ChangeLog:
PR target/108583
* config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv<mode>3): Remove.
(<su>add<mode>3_highpart, *bitmask_shift_plus<mode>): New.
* config/aarch64/aarch64-sve2.md (<su>add<mode>3_highpart,
*bitmask_shift_plus<mode>): New.
(@aarch64_bitmask_udiv<mode>3): Remove.
* config/aarch64/aarch64.cc
(aarch64_vectorize_can_special_div_by_constant): Removed.
* config/aarch64/iterators.md (UNSPEC_SADD_HIGHPART,
UNSPEC_UADD_HIGHPART, ADD_HIGHPART): New.
--- inline copy of patch --
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 7f212bf37cd2c120dceb7efa733c9fa76226f029..26871a56d1fdb134f0ad9d828ce68a8df0272c53 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4867,62 +4867,48 @@ (define_expand "aarch64_<sur><addsub>hn2<mode>"
}
)
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; If we imagine a short as being composed of two blocks of bytes then
-;; adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
-;; adding 1 to each sub component:
-;;
-;; short value of 16-bits
-;; ┌──────────────┬────────────────┐
-;; │ │ │
-;; └──────────────┴────────────────┘
-;; 8-bit part1 ▲ 8-bit part2 ▲
-;; │ │
-;; │ │
-;; +1 +1
-;;
-;; after the first addition, we have to shift right by 8, and narrow the
-;; results back to a byte. Remember that the addition must be done in
-;; double the precision of the input. Since 8 is half the size of a short
-;; we can use a narrowing halfing instruction in AArch64, addhn which also
-;; does the addition in a wider precision and narrows back to a byte. The
-;; shift itself is implicit in the operation as it writes back only the top
-;; half of the result. i.e. bits 2*esize-1:esize.
-;;
-;; Since we have narrowed the result of the first part back to a byte, for
-;; the second addition we can use a widening addition, uaddw.
-;;
-;; For the final shift, since it's unsigned arithmetic we emit an ushr by 8.
-;;
-;; The shift is later optimized by combine to a uzp2 with movi #0.
-(define_expand "@aarch64_bitmask_udiv<mode>3"
- [(match_operand:VQN 0 "register_operand")
- (match_operand:VQN 1 "register_operand")
- (match_operand:VQN 2 "immediate_operand")]
+;; Implement add_highpart as ADD + RSHIFT, we have various optimization for
+;; narrowing represented as shifts and so this representation will allow us to
+;; further optimize this should the result require narrowing. The alternative
+;; representation of ADDHN + UXTL is less efficient and harder to futher
+;; optimize.
+(define_expand "<su>add<mode>3_highpart"
+ [(set (match_operand:VQN 0 "register_operand")
+ (unspec:VQN [(match_operand:VQN 1 "register_operand")
+ (match_operand:VQN 2 "register_operand")]
+ ADD_HIGHPART))]
+ "TARGET_SIMD"
+{
+ rtx result = gen_reg_rtx (<MODE>mode);
+ int shift_amount = GET_MODE_UNIT_BITSIZE (<MODE>mode) / 2;
+ rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
+ shift_amount);
+ emit_insn (gen_add<mode>3 (result, operands[1], operands[2]));
+ emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], result, shift_vector));
+ DONE;
+})
+
+;; Optimize ((a + b) >> n) + c where n is half the bitsize of the vector
+(define_insn_and_split "*bitmask_shift_plus<mode>"
+ [(set (match_operand:VQN 0 "register_operand" "=w")
+ (plus:VQN
+ (lshiftrt:VQN
+ (plus:VQN (match_operand:VQN 1 "register_operand" "w")
+ (match_operand:VQN 2 "register_operand" "w"))
+ (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
+ (match_operand:VQN 4 "register_operand" "w")))]
"TARGET_SIMD"
+ "#"
+ "&& !reload_completed"
+ [(const_int 0)]
{
- unsigned HOST_WIDE_INT size
- = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode)) - 1;
- rtx elt = unwrap_const_vec_duplicate (operands[2]);
- if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
- FAIL;
-
- rtx addend = gen_reg_rtx (<MODE>mode);
- rtx val = aarch64_simd_gen_const_vector_dup (<VNARROWQ2>mode, 1);
- emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROWQ2>mode));
- rtx tmp1 = gen_reg_rtx (<VNARROWQ>mode);
- rtx tmp2 = gen_reg_rtx (<MODE>mode);
- emit_insn (gen_aarch64_addhn<mode> (tmp1, operands[1], addend));
- unsigned bitsize = GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode);
- rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode, bitsize);
- emit_insn (gen_aarch64_uaddw<Vnarrowq> (tmp2, operands[1], tmp1));
- emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], tmp2, shift_vector));
+ rtx tmp = gen_reg_rtx (<VNARROWQ>mode);
+ emit_insn (gen_aarch64_addhn<mode> (tmp, operands[1], operands[2]));
+ emit_insn (gen_aarch64_uaddw<Vnarrowq> (operands[0], operands[4], tmp));
DONE;
-})
+}
+ [(set_attr "type" "neon_add_halve<q>")]
+)
;; pmul.
diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
index 40c0728a7e6f00c395c360ce7625bc2e4a018809..ad01c1ddf9257cec951ed0c16558a3c4d856813b 100644
--- a/gcc/config/aarch64/aarch64-sve2.md
+++ b/gcc/config/aarch64/aarch64-sve2.md
@@ -2317,39 +2317,51 @@ (define_insn "@aarch64_sve_<optab><mode>"
;; ---- [INT] Misc optab implementations
;; -------------------------------------------------------------------------
;; Includes:
-;; - aarch64_bitmask_udiv
+;; - add_highpart
;; -------------------------------------------------------------------------
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; See aarch64-simd.md for bigger explanation.
-(define_expand "@aarch64_bitmask_udiv<mode>3"
- [(match_operand:SVE_FULL_HSDI 0 "register_operand")
- (match_operand:SVE_FULL_HSDI 1 "register_operand")
- (match_operand:SVE_FULL_HSDI 2 "immediate_operand")]
+;; Implement add_highpart as ADD + RSHIFT, we have various optimization for
+;; narrowing represented as shifts and so this representation will allow us to
+;; further optimize this should the result require narrowing. The alternative
+;; representation of ADDHN + UXTL is less efficient and harder to futher
+;; optimize.
+(define_expand "<su>add<mode>3_highpart"
+ [(set (match_operand:SVE_FULL_HSDI 0 "register_operand")
+ (unspec:SVE_FULL_HSDI
+ [(match_operand:SVE_FULL_HSDI 1 "register_operand")
+ (match_operand:SVE_FULL_HSDI 2 "register_operand")]
+ ADD_HIGHPART))]
"TARGET_SVE2"
{
- unsigned HOST_WIDE_INT size
- = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROW>mode)) - 1;
- rtx elt = unwrap_const_vec_duplicate (operands[2]);
- if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
- FAIL;
+ rtx result = gen_reg_rtx (<MODE>mode);
+ int shift_amount = GET_MODE_UNIT_BITSIZE (<MODE>mode) / 2;
+ rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
+ shift_amount);
+ emit_insn (gen_add<mode>3 (result, operands[1], operands[2]));
+ emit_insn (gen_vlshr<mode>3 (operands[0], result, shift_vector));
+ DONE;
+})
- rtx addend = gen_reg_rtx (<MODE>mode);
+;; Optimize ((a + b) >> n) where n is half the bitsize of the vector
+(define_insn_and_split "*bitmask_shift_plus<mode>"
+ [(set (match_operand:SVE_FULL_HSDI 0 "register_operand" "=w")
+ (unspec:SVE_FULL_HSDI [
+ (match_operand:<VPRED> 1 "register_operand" "Upl")
+ (lshiftrt:SVE_FULL_HSDI
+ (plus:SVE_FULL_HSDI
+ (match_operand:SVE_FULL_HSDI 2 "register_operand" "w")
+ (match_operand:SVE_FULL_HSDI 3 "register_operand" "w"))
+ (match_operand:SVE_FULL_HSDI 4 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
+ ] UNSPEC_PRED_X))]
+ "TARGET_SVE2"
+ "#"
+ "&& !reload_completed"
+ [(const_int 0)]
+{
rtx tmp1 = gen_reg_rtx (<VNARROW>mode);
- rtx tmp2 = gen_reg_rtx (<VNARROW>mode);
- rtx val = aarch64_simd_gen_const_vector_dup (<VNARROW>mode, 1);
- emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROW>mode));
- emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1, operands[1],
- addend));
- emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp2, operands[1],
- lowpart_subreg (<MODE>mode, tmp1,
- <VNARROW>mode)));
+ emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1, operands[2], operands[3]));
emit_move_insn (operands[0],
- lowpart_subreg (<MODE>mode, tmp2, <VNARROW>mode));
+ lowpart_subreg (<MODE>mode, tmp1, <VNARROW>mode));
DONE;
})
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index e6f47cbbb0d04a6f33b9a741ebb614cabd0204b9..8a04feb29e6bfb423a09dde2cd64853e69d0e1ba 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -24363,46 +24363,6 @@ aarch64_vectorize_vec_perm_const (machine_mode vmode, machine_mode op_mode,
return ret;
}
-
-/* Implement TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST. */
-
-bool
-aarch64_vectorize_can_special_div_by_constant (enum tree_code code,
- tree vectype, wide_int cst,
- rtx *output, rtx in0, rtx in1)
-{
- if (code != TRUNC_DIV_EXPR
- || !TYPE_UNSIGNED (vectype))
- return false;
-
- machine_mode mode = TYPE_MODE (vectype);
- unsigned int flags = aarch64_classify_vector_mode (mode);
- if ((flags & VEC_ANY_SVE) && !TARGET_SVE2)
- return false;
-
- int pow = wi::exact_log2 (cst + 1);
- auto insn_code = maybe_code_for_aarch64_bitmask_udiv3 (TYPE_MODE (vectype));
- /* SVE actually has a div operator, we may have gotten here through
- that route. */
- if (pow != (int) (element_precision (vectype) / 2)
- || insn_code == CODE_FOR_nothing)
- return false;
-
- /* We can use the optimized pattern. */
- if (in0 == NULL_RTX && in1 == NULL_RTX)
- return true;
-
- gcc_assert (output);
-
- expand_operand ops[3];
- create_output_operand (&ops[0], *output, mode);
- create_input_operand (&ops[1], in0, mode);
- create_fixed_operand (&ops[2], in1);
- expand_insn (insn_code, 3, ops);
- *output = ops[0].value;
- return true;
-}
-
/* Generate a byte permute mask for a register of mode MODE,
which has NUNITS units. */
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 6cbc97cc82c06a68259bdf4dec8a0eab230081e5..ae627ae56cbd1e8b882e596dba974e74ef396e0e 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -750,6 +750,8 @@ (define_c_enum "unspec"
UNSPEC_REVH ; Used in aarch64-sve.md.
UNSPEC_REVW ; Used in aarch64-sve.md.
UNSPEC_REVBHW ; Used in aarch64-sve.md.
+ UNSPEC_SADD_HIGHPART ; Used in aarch64-sve.md.
+ UNSPEC_UADD_HIGHPART ; Used in aarch64-sve.md.
UNSPEC_SMUL_HIGHPART ; Used in aarch64-sve.md.
UNSPEC_UMUL_HIGHPART ; Used in aarch64-sve.md.
UNSPEC_FMLA ; Used in aarch64-sve.md.
@@ -2704,6 +2706,7 @@ (define_int_iterator UNPACK [UNSPEC_UNPACKSHI UNSPEC_UNPACKUHI
(define_int_iterator UNPACK_UNSIGNED [UNSPEC_UNPACKULO UNSPEC_UNPACKUHI])
+(define_int_iterator ADD_HIGHPART [UNSPEC_SADD_HIGHPART UNSPEC_UADD_HIGHPART])
(define_int_iterator MUL_HIGHPART [UNSPEC_SMUL_HIGHPART UNSPEC_UMUL_HIGHPART])
(define_int_iterator CLAST [UNSPEC_CLASTA UNSPEC_CLASTB])
@@ -3342,6 +3345,8 @@ (define_int_attr su [(UNSPEC_SADDV "s")
(UNSPEC_UNPACKUHI "u")
(UNSPEC_UNPACKSLO "s")
(UNSPEC_UNPACKULO "u")
+ (UNSPEC_SADD_HIGHPART "s")
+ (UNSPEC_UADD_HIGHPART "u")
(UNSPEC_SMUL_HIGHPART "s")
(UNSPEC_UMUL_HIGHPART "u")
(UNSPEC_COND_FCVTZS "s")
--
[-- Attachment #2: rb16910.patch --]
[-- Type: text/plain, Size: 11404 bytes --]
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 7f212bf37cd2c120dceb7efa733c9fa76226f029..26871a56d1fdb134f0ad9d828ce68a8df0272c53 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4867,62 +4867,48 @@ (define_expand "aarch64_<sur><addsub>hn2<mode>"
}
)
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; If we imagine a short as being composed of two blocks of bytes then
-;; adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
-;; adding 1 to each sub component:
-;;
-;; short value of 16-bits
-;; ┌──────────────┬────────────────┐
-;; │ │ │
-;; └──────────────┴────────────────┘
-;; 8-bit part1 ▲ 8-bit part2 ▲
-;; │ │
-;; │ │
-;; +1 +1
-;;
-;; after the first addition, we have to shift right by 8, and narrow the
-;; results back to a byte. Remember that the addition must be done in
-;; double the precision of the input. Since 8 is half the size of a short
-;; we can use a narrowing halfing instruction in AArch64, addhn which also
-;; does the addition in a wider precision and narrows back to a byte. The
-;; shift itself is implicit in the operation as it writes back only the top
-;; half of the result. i.e. bits 2*esize-1:esize.
-;;
-;; Since we have narrowed the result of the first part back to a byte, for
-;; the second addition we can use a widening addition, uaddw.
-;;
-;; For the final shift, since it's unsigned arithmetic we emit an ushr by 8.
-;;
-;; The shift is later optimized by combine to a uzp2 with movi #0.
-(define_expand "@aarch64_bitmask_udiv<mode>3"
- [(match_operand:VQN 0 "register_operand")
- (match_operand:VQN 1 "register_operand")
- (match_operand:VQN 2 "immediate_operand")]
+;; Implement add_highpart as ADD + RSHIFT, we have various optimization for
+;; narrowing represented as shifts and so this representation will allow us to
+;; further optimize this should the result require narrowing. The alternative
+;; representation of ADDHN + UXTL is less efficient and harder to futher
+;; optimize.
+(define_expand "<su>add<mode>3_highpart"
+ [(set (match_operand:VQN 0 "register_operand")
+ (unspec:VQN [(match_operand:VQN 1 "register_operand")
+ (match_operand:VQN 2 "register_operand")]
+ ADD_HIGHPART))]
+ "TARGET_SIMD"
+{
+ rtx result = gen_reg_rtx (<MODE>mode);
+ int shift_amount = GET_MODE_UNIT_BITSIZE (<MODE>mode) / 2;
+ rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
+ shift_amount);
+ emit_insn (gen_add<mode>3 (result, operands[1], operands[2]));
+ emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], result, shift_vector));
+ DONE;
+})
+
+;; Optimize ((a + b) >> n) + c where n is half the bitsize of the vector
+(define_insn_and_split "*bitmask_shift_plus<mode>"
+ [(set (match_operand:VQN 0 "register_operand" "=w")
+ (plus:VQN
+ (lshiftrt:VQN
+ (plus:VQN (match_operand:VQN 1 "register_operand" "w")
+ (match_operand:VQN 2 "register_operand" "w"))
+ (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
+ (match_operand:VQN 4 "register_operand" "w")))]
"TARGET_SIMD"
+ "#"
+ "&& !reload_completed"
+ [(const_int 0)]
{
- unsigned HOST_WIDE_INT size
- = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode)) - 1;
- rtx elt = unwrap_const_vec_duplicate (operands[2]);
- if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
- FAIL;
-
- rtx addend = gen_reg_rtx (<MODE>mode);
- rtx val = aarch64_simd_gen_const_vector_dup (<VNARROWQ2>mode, 1);
- emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROWQ2>mode));
- rtx tmp1 = gen_reg_rtx (<VNARROWQ>mode);
- rtx tmp2 = gen_reg_rtx (<MODE>mode);
- emit_insn (gen_aarch64_addhn<mode> (tmp1, operands[1], addend));
- unsigned bitsize = GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode);
- rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode, bitsize);
- emit_insn (gen_aarch64_uaddw<Vnarrowq> (tmp2, operands[1], tmp1));
- emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], tmp2, shift_vector));
+ rtx tmp = gen_reg_rtx (<VNARROWQ>mode);
+ emit_insn (gen_aarch64_addhn<mode> (tmp, operands[1], operands[2]));
+ emit_insn (gen_aarch64_uaddw<Vnarrowq> (operands[0], operands[4], tmp));
DONE;
-})
+}
+ [(set_attr "type" "neon_add_halve<q>")]
+)
;; pmul.
diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
index 40c0728a7e6f00c395c360ce7625bc2e4a018809..ad01c1ddf9257cec951ed0c16558a3c4d856813b 100644
--- a/gcc/config/aarch64/aarch64-sve2.md
+++ b/gcc/config/aarch64/aarch64-sve2.md
@@ -2317,39 +2317,51 @@ (define_insn "@aarch64_sve_<optab><mode>"
;; ---- [INT] Misc optab implementations
;; -------------------------------------------------------------------------
;; Includes:
-;; - aarch64_bitmask_udiv
+;; - add_highpart
;; -------------------------------------------------------------------------
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; See aarch64-simd.md for bigger explanation.
-(define_expand "@aarch64_bitmask_udiv<mode>3"
- [(match_operand:SVE_FULL_HSDI 0 "register_operand")
- (match_operand:SVE_FULL_HSDI 1 "register_operand")
- (match_operand:SVE_FULL_HSDI 2 "immediate_operand")]
+;; Implement add_highpart as ADD + RSHIFT, we have various optimization for
+;; narrowing represented as shifts and so this representation will allow us to
+;; further optimize this should the result require narrowing. The alternative
+;; representation of ADDHN + UXTL is less efficient and harder to futher
+;; optimize.
+(define_expand "<su>add<mode>3_highpart"
+ [(set (match_operand:SVE_FULL_HSDI 0 "register_operand")
+ (unspec:SVE_FULL_HSDI
+ [(match_operand:SVE_FULL_HSDI 1 "register_operand")
+ (match_operand:SVE_FULL_HSDI 2 "register_operand")]
+ ADD_HIGHPART))]
"TARGET_SVE2"
{
- unsigned HOST_WIDE_INT size
- = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROW>mode)) - 1;
- rtx elt = unwrap_const_vec_duplicate (operands[2]);
- if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
- FAIL;
+ rtx result = gen_reg_rtx (<MODE>mode);
+ int shift_amount = GET_MODE_UNIT_BITSIZE (<MODE>mode) / 2;
+ rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
+ shift_amount);
+ emit_insn (gen_add<mode>3 (result, operands[1], operands[2]));
+ emit_insn (gen_vlshr<mode>3 (operands[0], result, shift_vector));
+ DONE;
+})
- rtx addend = gen_reg_rtx (<MODE>mode);
+;; Optimize ((a + b) >> n) where n is half the bitsize of the vector
+(define_insn_and_split "*bitmask_shift_plus<mode>"
+ [(set (match_operand:SVE_FULL_HSDI 0 "register_operand" "=w")
+ (unspec:SVE_FULL_HSDI [
+ (match_operand:<VPRED> 1 "register_operand" "Upl")
+ (lshiftrt:SVE_FULL_HSDI
+ (plus:SVE_FULL_HSDI
+ (match_operand:SVE_FULL_HSDI 2 "register_operand" "w")
+ (match_operand:SVE_FULL_HSDI 3 "register_operand" "w"))
+ (match_operand:SVE_FULL_HSDI 4 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
+ ] UNSPEC_PRED_X))]
+ "TARGET_SVE2"
+ "#"
+ "&& !reload_completed"
+ [(const_int 0)]
+{
rtx tmp1 = gen_reg_rtx (<VNARROW>mode);
- rtx tmp2 = gen_reg_rtx (<VNARROW>mode);
- rtx val = aarch64_simd_gen_const_vector_dup (<VNARROW>mode, 1);
- emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROW>mode));
- emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1, operands[1],
- addend));
- emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp2, operands[1],
- lowpart_subreg (<MODE>mode, tmp1,
- <VNARROW>mode)));
+ emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1, operands[2], operands[3]));
emit_move_insn (operands[0],
- lowpart_subreg (<MODE>mode, tmp2, <VNARROW>mode));
+ lowpart_subreg (<MODE>mode, tmp1, <VNARROW>mode));
DONE;
})
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index e6f47cbbb0d04a6f33b9a741ebb614cabd0204b9..8a04feb29e6bfb423a09dde2cd64853e69d0e1ba 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -24363,46 +24363,6 @@ aarch64_vectorize_vec_perm_const (machine_mode vmode, machine_mode op_mode,
return ret;
}
-
-/* Implement TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST. */
-
-bool
-aarch64_vectorize_can_special_div_by_constant (enum tree_code code,
- tree vectype, wide_int cst,
- rtx *output, rtx in0, rtx in1)
-{
- if (code != TRUNC_DIV_EXPR
- || !TYPE_UNSIGNED (vectype))
- return false;
-
- machine_mode mode = TYPE_MODE (vectype);
- unsigned int flags = aarch64_classify_vector_mode (mode);
- if ((flags & VEC_ANY_SVE) && !TARGET_SVE2)
- return false;
-
- int pow = wi::exact_log2 (cst + 1);
- auto insn_code = maybe_code_for_aarch64_bitmask_udiv3 (TYPE_MODE (vectype));
- /* SVE actually has a div operator, we may have gotten here through
- that route. */
- if (pow != (int) (element_precision (vectype) / 2)
- || insn_code == CODE_FOR_nothing)
- return false;
-
- /* We can use the optimized pattern. */
- if (in0 == NULL_RTX && in1 == NULL_RTX)
- return true;
-
- gcc_assert (output);
-
- expand_operand ops[3];
- create_output_operand (&ops[0], *output, mode);
- create_input_operand (&ops[1], in0, mode);
- create_fixed_operand (&ops[2], in1);
- expand_insn (insn_code, 3, ops);
- *output = ops[0].value;
- return true;
-}
-
/* Generate a byte permute mask for a register of mode MODE,
which has NUNITS units. */
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 6cbc97cc82c06a68259bdf4dec8a0eab230081e5..ae627ae56cbd1e8b882e596dba974e74ef396e0e 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -750,6 +750,8 @@ (define_c_enum "unspec"
UNSPEC_REVH ; Used in aarch64-sve.md.
UNSPEC_REVW ; Used in aarch64-sve.md.
UNSPEC_REVBHW ; Used in aarch64-sve.md.
+ UNSPEC_SADD_HIGHPART ; Used in aarch64-sve.md.
+ UNSPEC_UADD_HIGHPART ; Used in aarch64-sve.md.
UNSPEC_SMUL_HIGHPART ; Used in aarch64-sve.md.
UNSPEC_UMUL_HIGHPART ; Used in aarch64-sve.md.
UNSPEC_FMLA ; Used in aarch64-sve.md.
@@ -2704,6 +2706,7 @@ (define_int_iterator UNPACK [UNSPEC_UNPACKSHI UNSPEC_UNPACKUHI
(define_int_iterator UNPACK_UNSIGNED [UNSPEC_UNPACKULO UNSPEC_UNPACKUHI])
+(define_int_iterator ADD_HIGHPART [UNSPEC_SADD_HIGHPART UNSPEC_UADD_HIGHPART])
(define_int_iterator MUL_HIGHPART [UNSPEC_SMUL_HIGHPART UNSPEC_UMUL_HIGHPART])
(define_int_iterator CLAST [UNSPEC_CLASTA UNSPEC_CLASTB])
@@ -3342,6 +3345,8 @@ (define_int_attr su [(UNSPEC_SADDV "s")
(UNSPEC_UNPACKUHI "u")
(UNSPEC_UNPACKSLO "s")
(UNSPEC_UNPACKULO "u")
+ (UNSPEC_SADD_HIGHPART "s")
+ (UNSPEC_UADD_HIGHPART "u")
(UNSPEC_SMUL_HIGHPART "s")
(UNSPEC_UMUL_HIGHPART "u")
(UNSPEC_COND_FCVTZS "s")
next prev parent reply other threads:[~2023-02-09 17:22 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-09 17:16 [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583] Tamar Christina
2023-02-09 17:22 ` Tamar Christina [this message]
2023-02-10 10:35 ` [PATCH 2/2]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583] Tamar Christina
2023-02-10 14:10 ` Richard Sandiford
2023-02-10 10:34 ` [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583] Tamar Christina
2023-02-10 13:13 ` Richard Biener
2023-02-10 13:36 ` Richard Sandiford
2023-02-10 13:52 ` Richard Biener
2023-02-10 14:13 ` Tamar Christina
2023-02-10 14:30 ` Richard Sandiford
2023-02-10 14:54 ` Tamar Christina
2023-02-27 11:09 ` Tamar Christina
2023-02-27 12:11 ` Richard Sandiford
2023-02-27 12:14 ` Tamar Christina
2023-02-27 21:33 ` Richard Sandiford
2023-02-27 22:10 ` Tamar Christina
2023-02-28 11:08 ` Richard Sandiford
2023-02-28 11:12 ` Tamar Christina
2023-02-28 12:03 ` Richard Sandiford
2023-03-01 11:30 ` Richard Biener
2023-02-10 15:56 ` Richard Sandiford
2023-02-10 16:09 ` Tamar Christina
2023-02-10 16:25 ` Richard Sandiford
2023-02-10 16:33 ` Tamar Christina
2023-02-10 16:57 ` Richard Sandiford
2023-02-10 17:01 ` Richard Sandiford
2023-02-10 17:14 ` Tamar Christina
2023-02-10 18:12 ` Richard Sandiford
2023-02-10 18:34 ` Richard Biener
2023-02-10 20:58 ` Andrew MacLeod
2023-02-13 9:54 ` Tamar Christina
2023-02-15 12:51 ` Tamar Christina
2023-02-15 16:05 ` Andrew MacLeod
2023-02-15 17:13 ` Tamar Christina
2023-02-15 17:50 ` Andrew MacLeod
2023-02-15 18:42 ` Andrew MacLeod
2023-02-22 12:51 ` Tamar Christina
2023-02-22 16:41 ` Andrew MacLeod
2023-02-22 18:03 ` Tamar Christina
2023-02-22 18:33 ` Andrew MacLeod
2023-02-23 8:36 ` Tamar Christina
2023-02-23 16:39 ` Andrew MacLeod
2023-02-23 16:56 ` Tamar Christina
2023-03-01 16:57 ` Andrew Carlotti
2023-03-01 18:16 ` Tamar Christina
2023-02-22 13:06 ` Tamar Christina
2023-02-22 15:19 ` Andrew MacLeod
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y+UrwQSz3hWz+Puo@arm.com \
--to=tamar.christina@arm.com \
--cc=Kyrylo.Tkachov@arm.com \
--cc=Marcus.Shawcroft@arm.com \
--cc=Richard.Earnshaw@arm.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=nd@arm.com \
--cc=richard.sandiford@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).