Hi all,

While optimising some vector math library code with intrinsics we stumbled upon the issue in the testcase.
The compiler should be generating a FACGT instruction but instead we generate:
foo(__Float32x4_t, __Float32x4_t, __Float32x4_t):
        fabs    v0.4s, v0.4s
        adrp    x0, .LC0
        ldr     q31, [x0, #:lo12:.LC0]
        fcmgt   v0.4s, v0.4s, v31.4s
        ret

This is because the vcagtq_f32 intrinsic is open-coded in arm_neon.h as
return vabsq_f32 (__a) > vabsq_f32 (__b)
thus relying on the optimisers to merge it back together. But since one of the arms of the comparison
is a vector constant the combine pass optimises the abs into it and tries matching:
(set (reg:V4SI 101)
    (neg:V4SI (gt:V4SI (reg:V4SF 100)
            (const_vector:V4SF [
                    (const_double:SF 1.0e+2 [0x0.c8p+7]) repeated x4
                ]))))
and
(set (reg:V4SI 101)
    (neg:V4SI (gt:V4SI (abs:V4SF (reg:V4SF 104))
            (reg:V4SF 103))))

instead of what we want:
(insn 13 9 14 2 (set (reg/i:V4SI 32 v0)
        (neg:V4SI (gt:V4SI (abs:V4SF (reg:V4SF 98))
                (abs:V4SF (reg:V4SF 96)))))

I don't really see a good way around that with our current implementation of these intrinsics.
Therefore this patch reimplements these intrinsics with aarch64 builtins that generate the RTL for these
instructions directly. Apparently we already had them defined in aarch64-simd-builtins.def and have been
using them for the fp16 case already.
I realise that this approach is against the general principle of expressing intrinsics in the higher-level constructs,
so I'm willing to listen to counter-arguments.
That said, the FACGT/FACGE instructions are as fast as the non-ABS comparison instructions on all microarchitectures that I know of
so it should always be a win to have them in the merged form rather than split the fabs step separately or try to hoist it.
And the testcase does come from real library code that we're trying to optimise.
With this patch for the testcase we generate:
foo:
        adrp    x0, .LC0
        ldr     q31, [x0, #:lo12:.LC0]
        facgt   v0.4s, v0.4s, v31.4s
        ret

Bootstrapped and tested on aarch64-none-linux-gnu.
I'll hold off on committing this to give folks a few days to comment, but will push by the end of next week if there are no objections.

Thanks,
Kyrill

gcc/ChangeLog:

	* config/aarch64/arm_neon.h (vcage_f64): Reimplement with builtins.
	(vcage_f32): Likewise.
	(vcages_f32): Likewise.
	(vcageq_f32): Likewise.
	(vcaged_f64): Likewise.
	(vcageq_f64): Likewise.
	(vcagts_f32): Likewise.
	(vcagt_f32): Likewise.
	(vcagt_f64): Likewise.
	(vcagtq_f32): Likewise.
	(vcagtd_f64): Likewise.
	(vcagtq_f64): Likewise.
	(vcale_f32): Likewise.
	(vcale_f64): Likewise.
	(vcaled_f64): Likewise.
	(vcales_f32): Likewise.
	(vcaleq_f32): Likewise.
	(vcaleq_f64): Likewise.
	(vcalt_f32): Likewise.
	(vcalt_f64): Likewise.
	(vcaltd_f64): Likewise.
	(vcaltq_f32): Likewise.
	(vcaltq_f64): Likewise.
	(vcalts_f32): Likewise.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/simd/facgt_constpool_1.c: New test.