From: Delia Burduv <delia.burduv@arm.com>
To: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>,
"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Cc: "nickc@redhat.com" <nickc@redhat.com>,
Richard Earnshaw <Richard.Earnshaw@arm.com>,
Ramana Radhakrishnan <Ramana.Radhakrishnan@arm.com>
Subject: Re: ACLE intrinsics: BFloat16 load intrinsics for AArch32
Date: Thu, 05 Mar 2020 16:39:00 -0000 [thread overview]
Message-ID: <03e394d8-9d16-ce0f-e478-e708b35bc3e1@arm.com> (raw)
In-Reply-To: <4b4edee7-e9e8-d12b-8f88-6c6be52e02a6@foss.arm.com>
[-- Attachment #1: Type: text/plain, Size: 8066 bytes --]
Hi,
This is the latest version of the patch. I am forcing -mfloat-abi=hard
because the code generated is slightly differently depending on the
float-abi used.
Thanks,
Delia
On 3/4/20 5:20 PM, Kyrill Tkachov wrote:
> Hi Delia,
>
> On 3/4/20 2:05 PM, Delia Burduv wrote:
>> Hi,
>>
>> The previous version of this patch shared part of its code with the
>> store intrinsics patch
>> (https://gcc.gnu.org/ml/gcc-patches/2020-03/msg00145.html) so I removed
>> any duplicated code. This patch now depends on the previously mentioned
>> store intrinsics patch.
>>
>> Here is the latest version and the updated ChangeLog.
>>
>> gcc/ChangeLog:
>>
>> 2019-03-04 Delia Burduv <delia.burduv@arm.com>
>>
>> * config/arm/arm_neon.h (bfloat16_t): New typedef.
>> (vld2_bf16): New.
>> (vld2q_bf16): New.
>> (vld3_bf16): New.
>> (vld3q_bf16): New.
>> (vld4_bf16): New.
>> (vld4q_bf16): New.
>> (vld2_dup_bf16): New.
>> (vld2q_dup_bf16): New.
>> (vld3_dup_bf16): New.
>> (vld3q_dup_bf16): New.
>> (vld4_dup_bf16): New.
>> (vld4q_dup_bf16): New.
>> * config/arm/arm_neon_builtins.def
>> (vld2): Changed to VAR13 and added v4bf, v8bf
>> (vld2_dup): Changed to VAR8 and added v4bf, v8bf
>> (vld3): Changed to VAR13 and added v4bf, v8bf
>> (vld3_dup): Changed to VAR8 and added v4bf, v8bf
>> (vld4): Changed to VAR13 and added v4bf, v8bf
>> (vld4_dup): Changed to VAR8 and added v4bf, v8bf
>> * config/arm/iterators.md (VDXBF): New iterator.
>> (VQ2BF): New iterator.
>> *config/arm/neon.md (vld2): Used new iterators.
>> (vld2_dup<mode>): Used new iterators.
>> (vld2_dupv8bf): New.
>> (vst3): Used new iterators.
>> (vst3qa): Used new iterators.
>> (vst3qb): Used new iterators.
>> (vld3_dup<mode>): Used new iterators.
>> (vld3_dupv8bf): New.
>> (vst4): Used new iterators.
>> (vst4qa): Used new iterators.
>> (vst4qb): Used new iterators.
>> (vld4_dup<mode>): Used new iterators.
>> (vld4_dupv8bf): New.
>>
>>
>> gcc/testsuite/ChangeLog:
>>
>> 2019-03-04 Delia Burduv <delia.burduv@arm.com>
>>
>> * gcc.target/arm/simd/bf16_vldn_1.c: New test.
>>
>> Thanks,
>> Delia
>>
>> On 2/19/20 5:25 PM, Delia Burduv wrote:
>> >
>> > Hi,
>> >
>> > Here is the latest version of the patch. It just has some minor
>> > formatting changes that were brought up by Richard Sandiford in the
>> > AArch64 patches
>> >
>> > Thanks,
>> > Delia
>> >
>> > On 1/22/20 5:31 PM, Delia Burduv wrote:
>> >> Ping.
>> >>
>> >> I will change the tests to use the exact input and output registers as
>> >> Richard Sandiford suggested for the AArch64 patches.
>> >>
>> >> On 12/20/19 6:48 PM, Delia Burduv wrote:
>> >>> This patch adds the ARMv8.6 ACLE BFloat16 load intrinsics
>> >>> vld<n>{q}_bf16 as part of the BFloat16 extension.
>> >>>
>> (https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics)
>>
>> >>>
>> >>> The intrinsics are declared in arm_neon.h .
>> >>> A new test is added to check assembler output.
>> >>>
>> >>> This patch depends on the Arm back-end patche.
>> >>> (https://gcc.gnu.org/ml/gcc-patches/2019-12/msg01448.html)
>> >>>
>> >>> Tested for regression on arm-none-eabi and armeb-none-eabi. I don't
>> >>> have commit rights, so if this is ok can someone please commit it for
>> >>> me?
>> >>>
>> >>> gcc/ChangeLog:
>> >>>
>> >>> 2019-11-14 Delia Burduv <delia.burduv@arm.com>
>> >>>
>> >>> * config/arm/arm_neon.h (bfloat16_t): New typedef.
>> >>> (bfloat16x4x2_t): New typedef.
>> >>> (bfloat16x8x2_t): New typedef.
>> >>> (bfloat16x4x3_t): New typedef.
>> >>> (bfloat16x8x3_t): New typedef.
>> >>> (bfloat16x4x4_t): New typedef.
>> >>> (bfloat16x8x4_t): New typedef.
>> >>> (vld2_bf16): New.
>> >>> (vld2q_bf16): New.
>> >>> (vld3_bf16): New.
>> >>> (vld3q_bf16): New.
>> >>> (vld4_bf16): New.
>> >>> (vld4q_bf16): New.
>> >>> (vld2_dup_bf16): New.
>> >>> (vld2q_dup_bf16): New.
>> >>> (vld3_dup_bf16): New.
>> >>> (vld3q_dup_bf16): New.
>> >>> (vld4_dup_bf16): New.
>> >>> (vld4q_dup_bf16): New.
>> >>> * config/arm/arm-builtins.c (E_V2BFmode): New mode.
>> >>> (VAR13): New.
>> >>> (arm_simd_types[Bfloat16x2_t]):New type.
>> >>> * config/arm/arm-modes.def (V2BF): New mode.
>> >>> * config/arm/arm-simd-builtin-types.def
>> >>> (Bfloat16x2_t): New entry.
>> >>> * config/arm/arm_neon_builtins.def
>> >>> (vld2): Changed to VAR13 and added v4bf, v8bf
>> >>> (vld2_dup): Changed to VAR8 and added v4bf, v8bf
>> >>> (vld3): Changed to VAR13 and added v4bf, v8bf
>> >>> (vld3_dup): Changed to VAR8 and added v4bf, v8bf
>> >>> (vld4): Changed to VAR13 and added v4bf, v8bf
>> >>> (vld4_dup): Changed to VAR8 and added v4bf, v8bf
>> >>> * config/arm/iterators.md (VDXBF): New iterator.
>> >>> (VQ2BF): New iterator.
>> >>> (V_elem): Added V4BF, V8BF.
>> >>> (V_sz_elem): Added V4BF, V8BF.
>> >>> (V_mode_nunits): Added V4BF, V8BF.
>> >>> (q): Added V4BF, V8BF.
>> >>> *config/arm/neon.md (vld2): Used new iterators.
>> >>> (vld2_dup<mode>): Used new iterators.
>> >>> (vld2_dupv8bf): New.
>> >>> (vst3): Used new iterators.
>> >>> (vst3qa): Used new iterators.
>> >>> (vst3qb): Used new iterators.
>> >>> (vld3_dup<mode>): Used new iterators.
>> >>> (vld3_dupv8bf): New.
>> >>> (vst4): Used new iterators.
>> >>> (vst4qa): Used new iterators.
>> >>> (vst4qb): Used new iterators.
>> >>> (vld4_dup<mode>): Used new iterators.
>> >>> (vld4_dupv8bf): New.
>> >>>
>> >>>
>> >>> gcc/testsuite/ChangeLog:
>> >>>
>> >>> 2019-11-14 Delia Burduv <delia.burduv@arm.com>
>> >>>
>> >>> * gcc.target/arm/simd/bf16_vldn_1.c: New test.
>
>
> diff --git a/gcc/testsuite/gcc.target/arm/simd/bf16_vldn_1.c
> b/gcc/testsuite/gcc.target/arm/simd/bf16_vldn_1.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..7ff8b600827e5c2e313ce40d14382aa641b4bb31
>
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/simd/bf16_vldn_1.c
> @@ -0,0 +1,152 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-save-temps" } */
> +/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
> +/* { dg-add-options arm_v8_2a_bf16_neon } */
> +/* { dg-final { check-function-bodies "**" "" } } */
>
>
> I think this should include an optimisation option like -O2 because...
>
> +
> +#include "arm_neon.h"
> +
> +
> +/*
> +**test_vld2_bf16:
> +** ...
> +** vld2.16 {d16-d17}, \[r3\]
>
> ... this is unstable codegen depending on the -O0 register allocator
> moving the ptr argument to r3 from its initial r0.
> This should really be r0 and the load instruction should load the low D
> regs.
> So let's add an -O2 to the dg-options and scan for the result of that.
>
>
> Otherwise this is ok.
> Thanks!
> Kyrill
>
>
> +** ...
> +*/
> +bfloat16x4x2_t
> +test_vld2_bf16 (bfloat16_t * ptr)
> +{
> + vld2_bf16 (ptr);
> +}
> +
>
[-- Attachment #2: rb12473.patch --]
[-- Type: text/x-patch, Size: 18332 bytes --]
diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
index 425a2a49b69d7e3070059dd0a79ae3d306400f4b..2573cca6bb64f5104a1efd1379ef956f56d0fe04 100644
--- a/gcc/config/arm/arm_neon.h
+++ b/gcc/config/arm/arm_neon.h
@@ -19504,6 +19504,114 @@ vst4q_bf16 (bfloat16_t * __ptr, bfloat16x8x4_t __val)
return __builtin_neon_vst4v8bf (__ptr, __bu.__o);
}
+__extension__ extern __inline bfloat16x4x2_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vld2_bf16 (bfloat16_t const * __ptr)
+{
+ union { bfloat16x4x2_t __i; __builtin_neon_ti __o; } __rv;
+ __rv.__o = __builtin_neon_vld2v4bf ((const __builtin_neon_hi *) __ptr);
+ return __rv.__i;
+}
+
+__extension__ extern __inline bfloat16x8x2_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vld2q_bf16 (const bfloat16_t * __ptr)
+{
+ union { bfloat16x8x2_t __i; __builtin_neon_oi __o; } __rv;
+ __rv.__o = __builtin_neon_vld2v8bf ((const __builtin_neon_hi *) __ptr);
+ return __rv.__i;
+}
+
+__extension__ extern __inline bfloat16x4x3_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vld3_bf16 (const bfloat16_t * __ptr)
+{
+ union { bfloat16x4x3_t __i; __builtin_neon_ei __o; } __rv;
+ __rv.__o = __builtin_neon_vld3v4bf ((const __builtin_neon_hi *) __ptr);
+ return __rv.__i;
+}
+
+__extension__ extern __inline bfloat16x8x3_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vld3q_bf16 (const bfloat16_t * __ptr)
+{
+ union { bfloat16x8x3_t __i; __builtin_neon_ci __o; } __rv;
+ __rv.__o = __builtin_neon_vld3v8bf ((const __builtin_neon_hi *) __ptr);
+ return __rv.__i;
+}
+
+__extension__ extern __inline bfloat16x4x4_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vld4_bf16 (const bfloat16_t * __ptr)
+{
+ union { bfloat16x4x4_t __i; __builtin_neon_oi __o; } __rv;
+ __rv.__o = __builtin_neon_vld4v4bf ((const __builtin_neon_hi *) __ptr);
+ return __rv.__i;
+}
+
+__extension__ extern __inline bfloat16x8x4_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vld4q_bf16 (const bfloat16_t * __ptr)
+{
+ union { bfloat16x8x4_t __i; __builtin_neon_xi __o; } __rv;
+ __rv.__o = __builtin_neon_vld4v8bf ((const __builtin_neon_hi *) __ptr);
+ return __rv.__i;
+}
+
+__extension__ extern __inline bfloat16x4x2_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vld2_dup_bf16 (const bfloat16_t * __ptr)
+{
+ union { bfloat16x4x2_t __i; __builtin_neon_ti __o; } __rv;
+ __rv.__o = __builtin_neon_vld2_dupv4bf ((const __builtin_neon_hi *) __ptr);
+ return __rv.__i;
+}
+
+__extension__ extern __inline bfloat16x8x2_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vld2q_dup_bf16 (const bfloat16_t * __ptr)
+{
+ union { bfloat16x8x2_t __i; __builtin_neon_oi __o; } __rv;
+ __rv.__o = __builtin_neon_vld2_dupv8bf ((const __builtin_neon_hi *) __ptr);
+ return __rv.__i;
+}
+
+__extension__ extern __inline bfloat16x4x3_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vld3_dup_bf16 (const bfloat16_t * __ptr)
+{
+ union { bfloat16x4x3_t __i; __builtin_neon_ei __o; } __rv;
+ __rv.__o = __builtin_neon_vld3_dupv4bf ((const __builtin_neon_hi *) __ptr);
+ return __rv.__i;
+}
+
+__extension__ extern __inline bfloat16x8x3_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vld3q_dup_bf16 (const bfloat16_t * __ptr)
+{
+ union { bfloat16x8x3_t __i; __builtin_neon_ci __o; } __rv;
+ __rv.__o = __builtin_neon_vld3_dupv8bf ((const __builtin_neon_hi *) __ptr);
+ return __rv.__i;
+}
+
+__extension__ extern __inline bfloat16x4x4_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vld4_dup_bf16 (const bfloat16_t * __ptr)
+{
+ union { bfloat16x4x4_t __i; __builtin_neon_oi __o; } __rv;
+ __rv.__o = __builtin_neon_vld4_dupv4bf ((const __builtin_neon_hi *) __ptr);
+ return __rv.__i;
+}
+
+__extension__ extern __inline bfloat16x8x4_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vld4q_dup_bf16 (const bfloat16_t * __ptr)
+{
+ union { bfloat16x8x4_t __i; __builtin_neon_xi __o; } __rv;
+ __rv.__o = __builtin_neon_vld4_dupv8bf ((const __builtin_neon_hi *) __ptr);
+ return __rv.__i;
+}
+
#pragma GCC pop_options
#ifdef __cplusplus
diff --git a/gcc/config/arm/arm_neon_builtins.def b/gcc/config/arm/arm_neon_builtins.def
index d85a2d4b1fcf9e851f215dfdd4b305e59ded651c..e3c1652b9e92ff5024225279f26c1ccb197dcd69 100644
--- a/gcc/config/arm/arm_neon_builtins.def
+++ b/gcc/config/arm/arm_neon_builtins.def
@@ -320,29 +320,29 @@ VAR12 (STORE1, vst1,
v8qi, v4hi, v4hf, v2si, v2sf, di, v16qi, v8hi, v8hf, v4si, v4sf, v2di)
VAR12 (STORE1LANE, vst1_lane,
v8qi, v4hi, v4hf, v2si, v2sf, di, v16qi, v8hi, v8hf, v4si, v4sf, v2di)
-VAR11 (LOAD1, vld2,
- v8qi, v4hi, v4hf, v2si, v2sf, di, v16qi, v8hi, v8hf, v4si, v4sf)
+VAR13 (LOAD1, vld2,
+ v8qi, v4hi, v4hf, v2si, v2sf, di, v16qi, v8hi, v8hf, v4si, v4sf, v4bf, v8bf)
VAR9 (LOAD1LANE, vld2_lane,
v8qi, v4hi, v4hf, v2si, v2sf, v8hi, v8hf, v4si, v4sf)
-VAR6 (LOAD1, vld2_dup, v8qi, v4hi, v4hf, v2si, v2sf, di)
+VAR8 (LOAD1, vld2_dup, v8qi, v4hi, v4hf, v2si, v2sf, di, v4bf, v8bf)
VAR13 (STORE1, vst2,
v8qi, v4hi, v4hf, v4bf, v2si, v2sf, di, v16qi, v8hi, v8hf, v8bf, v4si, v4sf)
VAR9 (STORE1LANE, vst2_lane,
v8qi, v4hi, v4hf, v2si, v2sf, v8hi, v8hf, v4si, v4sf)
-VAR11 (LOAD1, vld3,
- v8qi, v4hi, v4hf, v2si, v2sf, di, v16qi, v8hi, v8hf, v4si, v4sf)
+VAR13 (LOAD1, vld3,
+ v8qi, v4hi, v4hf, v2si, v2sf, di, v16qi, v8hi, v8hf, v4si, v4sf, v4bf, v8bf)
VAR9 (LOAD1LANE, vld3_lane,
v8qi, v4hi, v4hf, v2si, v2sf, v8hi, v8hf, v4si, v4sf)
-VAR6 (LOAD1, vld3_dup, v8qi, v4hi, v4hf, v2si, v2sf, di)
+VAR8 (LOAD1, vld3_dup, v8qi, v4hi, v4hf, v2si, v2sf, di, v4bf, v8bf)
VAR13 (STORE1, vst3,
v8qi, v4hi, v4hf, v4bf, v2si, v2sf, di, v16qi, v8hi, v8hf, v8bf, v4si, v4sf)
VAR9 (STORE1LANE, vst3_lane,
v8qi, v4hi, v4hf, v2si, v2sf, v8hi, v8hf, v4si, v4sf)
-VAR11 (LOAD1, vld4,
- v8qi, v4hi, v4hf, v2si, v2sf, di, v16qi, v8hi, v8hf, v4si, v4sf)
+VAR13 (LOAD1, vld4,
+ v8qi, v4hi, v4hf, v2si, v2sf, di, v16qi, v8hi, v8hf, v4si, v4sf, v4bf, v8bf)
VAR9 (LOAD1LANE, vld4_lane,
v8qi, v4hi, v4hf, v2si, v2sf, v8hi, v8hf, v4si, v4sf)
-VAR6 (LOAD1, vld4_dup, v8qi, v4hi, v4hf, v2si, v2sf, di)
+VAR8 (LOAD1, vld4_dup, v8qi, v4hi, v4hf, v2si, v2sf, di, v4bf, v8bf)
VAR13 (STORE1, vst4,
v8qi, v4hi, v4hf, v4bf, v2si, v2sf, di, v16qi, v8hi, v8hf, v8bf, v4si, v4sf)
VAR9 (STORE1LANE, vst4_lane,
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index 0c03e747c3643e018f4f62dda5e832dfb1af758f..7401f16ef59b9854bbc85f98cfdcdd7a8a600337 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -87,6 +87,9 @@
;; Double-width vector modes plus 64-bit elements, including V4BF.
(define_mode_iterator VDXBF [V8QI V4HI V4HF (V4BF "TARGET_BF16_SIMD") V2SI V2SF DI])
+;; Double-width vector modes plus 64-bit elements, V4BF and V8BF.
+(define_mode_iterator VDXBF2 [V8QI V4HI V4HF V2SI V2SF DI (V4BF "TARGET_BF16_SIMD") (V8BF ("TARGET_BF16_SIMD"))])
+
;; Double-width vector modes plus 64-bit elements,
;; with V4BFmode added, suitable for moves.
(define_mode_iterator VDXMOV [V8QI V4HI V4HF V4BF V2SI V2SF DI])
diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index fcf59aee32a955b6bb3e7b98a4d880a0e631b4be..5117f78dd2dce442bc738de6082686421fcdca52 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -5428,7 +5428,7 @@ if (BYTES_BIG_ENDIAN)
(define_insn "neon_vld2<mode>"
[(set (match_operand:TI 0 "s_register_operand" "=w")
(unspec:TI [(match_operand:TI 1 "neon_struct_operand" "Um")
- (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+ (unspec:VDXBF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
UNSPEC_VLD2))]
"TARGET_NEON"
{
@@ -5453,7 +5453,7 @@ if (BYTES_BIG_ENDIAN)
(define_insn "neon_vld2<mode>"
[(set (match_operand:OI 0 "s_register_operand" "=w")
(unspec:OI [(match_operand:OI 1 "neon_struct_operand" "Um")
- (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+ (unspec:VQ2BF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
UNSPEC_VLD2))]
"TARGET_NEON"
"vld2.<V_sz_elem>\t%h0, %A1"
@@ -5516,7 +5516,7 @@ if (BYTES_BIG_ENDIAN)
(define_insn "neon_vld2_dup<mode>"
[(set (match_operand:TI 0 "s_register_operand" "=w")
(unspec:TI [(match_operand:<V_two_elem> 1 "neon_struct_operand" "Um")
- (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+ (unspec:VDXBF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
UNSPEC_VLD2_DUP))]
"TARGET_NEON"
{
@@ -5531,6 +5531,27 @@ if (BYTES_BIG_ENDIAN)
(const_string "neon_load1_1reg<q>")))]
)
+(define_insn "neon_vld2_dupv8bf"
+ [(set (match_operand:OI 0 "s_register_operand" "=w")
+ (unspec:OI [(match_operand:V2BF 1 "neon_struct_operand" "Um")
+ (unspec:V8BF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+ UNSPEC_VLD2_DUP))]
+ "TARGET_BF16_SIMD"
+ {
+ rtx ops[5];
+ int tabbase = REGNO (operands[0]);
+
+ ops[4] = operands[1];
+ ops[0] = gen_rtx_REG (V4BFmode, tabbase);
+ ops[1] = gen_rtx_REG (V4BFmode, tabbase + 2);
+ ops[2] = gen_rtx_REG (V4BFmode, tabbase + 4);
+ ops[3] = gen_rtx_REG (V4BFmode, tabbase + 6);
+ output_asm_insn ("vld2.16\t{%P0, %P1, %P2, %P3}, %A4", ops);
+ return "";
+ }
+ [(set_attr "type" "neon_load2_all_lanes_q")]
+)
+
(define_expand "vec_store_lanesti<mode>"
[(set (match_operand:TI 0 "neon_struct_operand")
(unspec:TI [(match_operand:TI 1 "s_register_operand")
@@ -5637,7 +5658,7 @@ if (BYTES_BIG_ENDIAN)
(define_insn "neon_vld3<mode>"
[(set (match_operand:EI 0 "s_register_operand" "=w")
(unspec:EI [(match_operand:EI 1 "neon_struct_operand" "Um")
- (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+ (unspec:VDXBF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
UNSPEC_VLD3))]
"TARGET_NEON"
{
@@ -5665,7 +5686,7 @@ if (BYTES_BIG_ENDIAN)
(define_expand "neon_vld3<mode>"
[(match_operand:CI 0 "s_register_operand")
(match_operand:CI 1 "neon_struct_operand")
- (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+ (unspec:VQ2BF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
"TARGET_NEON"
{
rtx mem;
@@ -5680,7 +5701,7 @@ if (BYTES_BIG_ENDIAN)
(define_insn "neon_vld3qa<mode>"
[(set (match_operand:CI 0 "s_register_operand" "=w")
(unspec:CI [(match_operand:EI 1 "neon_struct_operand" "Um")
- (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+ (unspec:VQ2BF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
UNSPEC_VLD3A))]
"TARGET_NEON"
{
@@ -5700,7 +5721,7 @@ if (BYTES_BIG_ENDIAN)
[(set (match_operand:CI 0 "s_register_operand" "=w")
(unspec:CI [(match_operand:EI 1 "neon_struct_operand" "Um")
(match_operand:CI 2 "s_register_operand" "0")
- (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+ (unspec:VQ2BF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
UNSPEC_VLD3B))]
"TARGET_NEON"
{
@@ -5777,7 +5798,7 @@ if (BYTES_BIG_ENDIAN)
(define_insn "neon_vld3_dup<mode>"
[(set (match_operand:EI 0 "s_register_operand" "=w")
(unspec:EI [(match_operand:<V_three_elem> 1 "neon_struct_operand" "Um")
- (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+ (unspec:VDXBF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
UNSPEC_VLD3_DUP))]
"TARGET_NEON"
{
@@ -5800,6 +5821,26 @@ if (BYTES_BIG_ENDIAN)
(const_string "neon_load3_all_lanes<q>")
(const_string "neon_load1_1reg<q>")))])
+(define_insn "neon_vld3_dupv8bf"
+ [(set (match_operand:CI 0 "s_register_operand" "=w")
+ (unspec:CI [(match_operand:V2BF 1 "neon_struct_operand" "Um")
+ (unspec:V8BF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+ UNSPEC_VLD2_DUP))]
+ "TARGET_BF16_SIMD"
+ {
+ rtx ops[4];
+ int tabbase = REGNO (operands[0]);
+
+ ops[3] = operands[1];
+ ops[0] = gen_rtx_REG (V4BFmode, tabbase);
+ ops[1] = gen_rtx_REG (V4BFmode, tabbase + 2);
+ ops[2] = gen_rtx_REG (V4BFmode, tabbase + 4);
+ output_asm_insn ("vld3.16\t{%P0[], %P1[], %P2[]}, %A3", ops);
+ return "";
+ }
+ [(set_attr "type" "neon_load3_all_lanes_q")]
+)
+
(define_expand "vec_store_lanesei<mode>"
[(set (match_operand:EI 0 "neon_struct_operand")
(unspec:EI [(match_operand:EI 1 "s_register_operand")
@@ -5955,7 +5996,7 @@ if (BYTES_BIG_ENDIAN)
(define_insn "neon_vld4<mode>"
[(set (match_operand:OI 0 "s_register_operand" "=w")
(unspec:OI [(match_operand:OI 1 "neon_struct_operand" "Um")
- (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+ (unspec:VDXBF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
UNSPEC_VLD4))]
"TARGET_NEON"
{
@@ -5983,7 +6024,7 @@ if (BYTES_BIG_ENDIAN)
(define_expand "neon_vld4<mode>"
[(match_operand:XI 0 "s_register_operand")
(match_operand:XI 1 "neon_struct_operand")
- (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+ (unspec:VQ2BF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
"TARGET_NEON"
{
rtx mem;
@@ -5998,7 +6039,7 @@ if (BYTES_BIG_ENDIAN)
(define_insn "neon_vld4qa<mode>"
[(set (match_operand:XI 0 "s_register_operand" "=w")
(unspec:XI [(match_operand:OI 1 "neon_struct_operand" "Um")
- (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+ (unspec:VQ2BF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
UNSPEC_VLD4A))]
"TARGET_NEON"
{
@@ -6019,7 +6060,7 @@ if (BYTES_BIG_ENDIAN)
[(set (match_operand:XI 0 "s_register_operand" "=w")
(unspec:XI [(match_operand:OI 1 "neon_struct_operand" "Um")
(match_operand:XI 2 "s_register_operand" "0")
- (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+ (unspec:VQ2BF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
UNSPEC_VLD4B))]
"TARGET_NEON"
{
@@ -6099,7 +6140,7 @@ if (BYTES_BIG_ENDIAN)
(define_insn "neon_vld4_dup<mode>"
[(set (match_operand:OI 0 "s_register_operand" "=w")
(unspec:OI [(match_operand:<V_four_elem> 1 "neon_struct_operand" "Um")
- (unspec:VDX [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+ (unspec:VDXBF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
UNSPEC_VLD4_DUP))]
"TARGET_NEON"
{
@@ -6125,6 +6166,27 @@ if (BYTES_BIG_ENDIAN)
(const_string "neon_load1_1reg<q>")))]
)
+(define_insn "neon_vld4_dupv8bf"
+ [(set (match_operand:XI 0 "s_register_operand" "=w")
+ (unspec:XI [(match_operand:V2BF 1 "neon_struct_operand" "Um")
+ (unspec:V8BF [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
+ UNSPEC_VLD2_DUP))]
+ "TARGET_BF16_SIMD"
+ {
+ rtx ops[5];
+ int tabbase = REGNO (operands[0]);
+
+ ops[4] = operands[1];
+ ops[0] = gen_rtx_REG (V4BFmode, tabbase);
+ ops[1] = gen_rtx_REG (V4BFmode, tabbase + 2);
+ ops[2] = gen_rtx_REG (V4BFmode, tabbase + 4);
+ ops[3] = gen_rtx_REG (V4BFmode, tabbase + 6);
+ output_asm_insn ("vld4.16\t{%P0[], %P1[], %P2[], %P3[]}, %A4", ops);
+ return "";
+ }
+ [(set_attr "type" "neon_load4_all_lanes_q")]
+)
+
(define_expand "vec_store_lanesoi<mode>"
[(set (match_operand:OI 0 "neon_struct_operand")
(unspec:OI [(match_operand:OI 1 "s_register_operand")
diff --git a/gcc/testsuite/gcc.target/arm/simd/bf16_vldn_1.c b/gcc/testsuite/gcc.target/arm/simd/bf16_vldn_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..222e7af945383bd93b6b280b516a56e684f1d651
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/bf16_vldn_1.c
@@ -0,0 +1,152 @@
+/* { dg-do assemble } */
+/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
+/* { dg-add-options arm_v8_2a_bf16_neon } */
+/* { dg-additional-options "-save-temps -O2 -mfloat-abi=hard" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include "arm_neon.h"
+
+
+/*
+**test_vld2_bf16:
+** ...
+** vld2.16 {d0-d1}, \[r0\]
+** bx lr
+*/
+bfloat16x4x2_t
+test_vld2_bf16 (bfloat16_t * ptr)
+{
+ return vld2_bf16 (ptr);
+}
+
+/*
+**test_vld2q_bf16:
+** ...
+** vld2.16 {d0-d3}, \[r0\]
+** bx lr
+*/
+bfloat16x8x2_t
+test_vld2q_bf16 (bfloat16_t * ptr)
+{
+ return vld2q_bf16 (ptr);
+}
+
+/*
+**test_vld2_dup_bf16:
+** ...
+** vld2.16 {d0\[\], d1\[\]}, \[r0\]
+** bx lr
+*/
+bfloat16x4x2_t
+test_vld2_dup_bf16 (bfloat16_t * ptr)
+{
+ return vld2_dup_bf16 (ptr);
+}
+
+/*
+**test_vld2q_dup_bf16:
+** ...
+** vld2.16 {d0, d1, d2, d3}, \[r0\]
+** bx lr
+*/
+bfloat16x8x2_t
+test_vld2q_dup_bf16 (bfloat16_t * ptr)
+{
+ return vld2q_dup_bf16 (ptr);
+}
+
+/*
+**test_vld3_bf16:
+** ...
+** vld3.16 {d0-d2}, \[r0\]
+** bx lr
+*/
+bfloat16x4x3_t
+test_vld3_bf16 (bfloat16_t * ptr)
+{
+ return vld3_bf16 (ptr);
+}
+
+/*
+**test_vld3q_bf16:
+** ...
+** vld3.16 {d1, d3, d5}, \[r0\]
+** bx lr
+*/
+bfloat16x8x3_t
+test_vld3q_bf16 (bfloat16_t * ptr)
+{
+ return vld3q_bf16 (ptr);
+}
+
+/*
+**test_vld3_dup_bf16:
+** ...
+** vld3.16 {d0\[\], d1\[\], d2\[\]}, \[r0\]
+** bx lr
+*/
+bfloat16x4x3_t
+test_vld3_dup_bf16 (bfloat16_t * ptr)
+{
+ return vld3_dup_bf16 (ptr);
+}
+
+/*
+**test_vld3q_dup_bf16:
+** ...
+** vld3.16 {d0\[\], d1\[\], d2\[\]}, \[r0\]
+** bx lr
+*/
+bfloat16x8x3_t
+test_vld3q_dup_bf16 (bfloat16_t * ptr)
+{
+ return vld3q_dup_bf16 (ptr);
+}
+
+/*
+**test_vld4_bf16:
+** ...
+** vld4.16 {d0-d3}, \[r0\]
+** bx lr
+*/
+bfloat16x4x4_t
+test_vld4_bf16 (bfloat16_t * ptr)
+{
+ return vld4_bf16 (ptr);
+}
+
+/*
+**test_vld4q_bf16:
+** ...
+** vld4.16 {d1, d3, d5, d7}, \[r0\]
+** bx lr
+*/
+bfloat16x8x4_t
+test_vld4q_bf16 (bfloat16_t * ptr)
+{
+ return vld4q_bf16 (ptr);
+}
+
+/*
+**test_vld4_dup_bf16:
+** ...
+** vld4.16 {d0\[\], d1\[\], d2\[\], d3\[\]}, \[r0\]
+** bx lr
+*/
+bfloat16x4x4_t
+test_vld4_dup_bf16 (bfloat16_t * ptr)
+{
+ return vld4_dup_bf16 (ptr);
+}
+
+/*
+**test_vld4q_dup_bf16:
+** ...
+** vld4.16 {d0\[\], d1\[\], d2\[\], d3\[\]}, \[r0\]
+** bx lr
+*/
+bfloat16x8x4_t
+test_vld4q_dup_bf16 (bfloat16_t * ptr)
+{
+ return vld4q_dup_bf16 (ptr);
+}
next prev parent reply other threads:[~2020-03-05 16:39 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-12-20 19:04 Delia Burduv
2020-01-22 18:20 ` Delia Burduv
2020-01-28 17:18 ` Delia Burduv
2020-02-19 17:26 ` Delia Burduv
2020-03-04 14:05 ` Delia Burduv
2020-03-04 17:21 ` Kyrill Tkachov
2020-03-05 16:39 ` Delia Burduv [this message]
2020-03-06 10:45 ` Kyrill Tkachov
2020-03-09 10:18 ` Christophe Lyon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=03e394d8-9d16-ce0f-e478-e708b35bc3e1@arm.com \
--to=delia.burduv@arm.com \
--cc=Ramana.Radhakrishnan@arm.com \
--cc=Richard.Earnshaw@arm.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=kyrylo.tkachov@foss.arm.com \
--cc=nickc@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).