Hi,

This is the latest version of the patch. I am forcing -mfloat-abi=hard 
because the code generated is slightly differently depending on the 
float-abi used.

Thanks,
Delia

On 3/4/20 5:20 PM, Kyrill Tkachov wrote:
> Hi Delia,
> 
> On 3/4/20 2:05 PM, Delia Burduv wrote:
>> Hi,
>>
>> The previous version of this patch shared part of its code with the
>> store intrinsics patch
>> (https://gcc.gnu.org/ml/gcc-patches/2020-03/msg00145.html) so I removed
>> any duplicated code. This patch now depends on the previously mentioned
>> store intrinsics patch.
>>
>> Here is the latest version and the updated ChangeLog.
>>
>> gcc/ChangeLog:
>>
>> 2019-03-04  Delia Burduv  <delia.burduv@arm.com>
>>
>>         * config/arm/arm_neon.h (bfloat16_t): New typedef.
>>          (vld2_bf16): New.
>>         (vld2q_bf16): New.
>>         (vld3_bf16): New.
>>         (vld3q_bf16): New.
>>         (vld4_bf16): New.
>>         (vld4q_bf16): New.
>>         (vld2_dup_bf16): New.
>>         (vld2q_dup_bf16): New.
>>          (vld3_dup_bf16): New.
>>         (vld3q_dup_bf16): New.
>>         (vld4_dup_bf16): New.
>>         (vld4q_dup_bf16): New.
>>          * config/arm/arm_neon_builtins.def
>>          (vld2): Changed to VAR13 and added v4bf, v8bf
>>          (vld2_dup): Changed to VAR8 and added v4bf, v8bf
>>          (vld3): Changed to VAR13 and added v4bf, v8bf
>>          (vld3_dup): Changed to VAR8 and added v4bf, v8bf
>>          (vld4): Changed to VAR13 and added v4bf, v8bf
>>          (vld4_dup): Changed to VAR8 and added v4bf, v8bf
>>          * config/arm/iterators.md (VDXBF): New iterator.
>>          (VQ2BF): New iterator.
>>          *config/arm/neon.md (vld2): Used new iterators.
>>          (vld2_dup<mode>): Used new iterators.
>>          (vld2_dupv8bf): New.
>>          (vst3): Used new iterators.
>>          (vst3qa): Used new iterators.
>>          (vst3qb): Used new iterators.
>>          (vld3_dup<mode>): Used new iterators.
>>          (vld3_dupv8bf): New.
>>          (vst4): Used new iterators.
>>          (vst4qa): Used new iterators.
>>          (vst4qb): Used new iterators.
>>          (vld4_dup<mode>): Used new iterators.
>>          (vld4_dupv8bf): New.
>>
>>
>> gcc/testsuite/ChangeLog:
>>
>> 2019-03-04  Delia Burduv  <delia.burduv@arm.com>
>>
>>         * gcc.target/arm/simd/bf16_vldn_1.c: New test.
>>
>> Thanks,
>> Delia
>>
>> On 2/19/20 5:25 PM, Delia Burduv wrote:
>> >
>> > Hi,
>> >
>> > Here is the latest version of the patch. It just has some minor
>> > formatting changes that were brought up by Richard Sandiford in the
>> > AArch64 patches
>> >
>> > Thanks,
>> > Delia
>> >
>> > On 1/22/20 5:31 PM, Delia Burduv wrote:
>> >> Ping.
>> >>
>> >> I will change the tests to use the exact input and output registers as
>> >> Richard Sandiford suggested for the AArch64 patches.
>> >>
>> >> On 12/20/19 6:48 PM, Delia Burduv wrote:
>> >>> This patch adds the ARMv8.6 ACLE BFloat16 load intrinsics
>> >>> vld<n>{q}_bf16 as part of the BFloat16 extension.
>> >>> 
>> (https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics) 
>>
>> >>>
>> >>> The intrinsics are declared in arm_neon.h .
>> >>> A new test is added to check assembler output.
>> >>>
>> >>> This patch depends on the Arm back-end patche.
>> >>> (https://gcc.gnu.org/ml/gcc-patches/2019-12/msg01448.html)
>> >>>
>> >>> Tested for regression on arm-none-eabi and armeb-none-eabi. I don't
>> >>> have commit rights, so if this is ok can someone please commit it for
>> >>> me?
>> >>>
>> >>> gcc/ChangeLog:
>> >>>
>> >>> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
>> >>>
>> >>>      * config/arm/arm_neon.h (bfloat16_t): New typedef.
>> >>>          (bfloat16x4x2_t): New typedef.
>> >>>          (bfloat16x8x2_t): New typedef.
>> >>>          (bfloat16x4x3_t): New typedef.
>> >>>          (bfloat16x8x3_t): New typedef.
>> >>>          (bfloat16x4x4_t): New typedef.
>> >>>          (bfloat16x8x4_t): New typedef.
>> >>>          (vld2_bf16): New.
>> >>>      (vld2q_bf16): New.
>> >>>      (vld3_bf16): New.
>> >>>      (vld3q_bf16): New.
>> >>>      (vld4_bf16): New.
>> >>>      (vld4q_bf16): New.
>> >>>      (vld2_dup_bf16): New.
>> >>>      (vld2q_dup_bf16): New.
>> >>>       (vld3_dup_bf16): New.
>> >>>      (vld3q_dup_bf16): New.
>> >>>      (vld4_dup_bf16): New.
>> >>>      (vld4q_dup_bf16): New.
>> >>>          * config/arm/arm-builtins.c (E_V2BFmode): New mode.
>> >>>          (VAR13): New.
>> >>>          (arm_simd_types[Bfloat16x2_t]):New type.
>> >>>          * config/arm/arm-modes.def (V2BF): New mode.
>> >>>          * config/arm/arm-simd-builtin-types.def
>> >>>          (Bfloat16x2_t): New entry.
>> >>>          * config/arm/arm_neon_builtins.def
>> >>>          (vld2): Changed to VAR13 and added v4bf, v8bf
>> >>>          (vld2_dup): Changed to VAR8 and added v4bf, v8bf
>> >>>          (vld3): Changed to VAR13 and added v4bf, v8bf
>> >>>          (vld3_dup): Changed to VAR8 and added v4bf, v8bf
>> >>>          (vld4): Changed to VAR13 and added v4bf, v8bf
>> >>>          (vld4_dup): Changed to VAR8 and added v4bf, v8bf
>> >>>          * config/arm/iterators.md (VDXBF): New iterator.
>> >>>          (VQ2BF): New iterator.
>> >>>          (V_elem): Added V4BF, V8BF.
>> >>>          (V_sz_elem): Added V4BF, V8BF.
>> >>>          (V_mode_nunits): Added V4BF, V8BF.
>> >>>          (q): Added V4BF, V8BF.
>> >>>          *config/arm/neon.md (vld2): Used new iterators.
>> >>>          (vld2_dup<mode>): Used new iterators.
>> >>>          (vld2_dupv8bf): New.
>> >>>          (vst3): Used new iterators.
>> >>>          (vst3qa): Used new iterators.
>> >>>          (vst3qb): Used new iterators.
>> >>>          (vld3_dup<mode>): Used new iterators.
>> >>>          (vld3_dupv8bf): New.
>> >>>          (vst4): Used new iterators.
>> >>>          (vst4qa): Used new iterators.
>> >>>          (vst4qb): Used new iterators.
>> >>>          (vld4_dup<mode>): Used new iterators.
>> >>>          (vld4_dupv8bf): New.
>> >>>
>> >>>
>> >>> gcc/testsuite/ChangeLog:
>> >>>
>> >>> 2019-11-14  Delia Burduv <delia.burduv@arm.com>
>> >>>
>> >>>      * gcc.target/arm/simd/bf16_vldn_1.c: New test.
> 
> 
> diff --git a/gcc/testsuite/gcc.target/arm/simd/bf16_vldn_1.c 
> b/gcc/testsuite/gcc.target/arm/simd/bf16_vldn_1.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..7ff8b600827e5c2e313ce40d14382aa641b4bb31 
> 
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/simd/bf16_vldn_1.c
> @@ -0,0 +1,152 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-save-temps" }  */
> +/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
> +/* { dg-add-options arm_v8_2a_bf16_neon } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> 
> 
> I think this should include an optimisation option like -O2 because...
> 
>   +
> +#include "arm_neon.h"
> +
> +
> +/*
> +**test_vld2_bf16:
> +**    ...
> +**    vld2.16    {d16-d17}, \[r3\]
> 
> ... this is unstable codegen depending on the -O0 register allocator 
> moving the ptr argument to r3 from its initial r0.
> This should really be r0 and the load instruction should load the low D 
> regs.
> So let's add an -O2 to the dg-options and scan for the result of that.
> 
> 
> Otherwise this is ok.
> Thanks!
> Kyrill
> 
> 
>   +**    ...
> +*/
> +bfloat16x4x2_t
> +test_vld2_bf16 (bfloat16_t * ptr)
> +{
> +  vld2_bf16 (ptr);
> +}
> +
>