From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rguenther@suse.de>
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29])
 by sourceware.org (Postfix) with ESMTPS id 73A2F3857357
 for <gcc-patches@gcc.gnu.org>; Sat, 18 Jun 2022 10:49:04 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 73A2F3857357
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512)
 (No client certificate requested)
 by smtp-out2.suse.de (Postfix) with ESMTPS id B2BB51FA73;
 Sat, 18 Jun 2022 10:49:02 +0000 (UTC)
Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512)
 (No client certificate requested)
 by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id A12071348B;
 Sat, 18 Jun 2022 10:49:02 +0000 (UTC)
Received: from dovecot-director2.suse.de ([192.168.254.65])
 by imap2.suse-dmz.suse.de with ESMTPSA id K4VKJ56trWJKKAAAMHmgww
 (envelope-from <rguenther@suse.de>); Sat, 18 Jun 2022 10:49:02 +0000
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
From: Richard Biener <rguenther@suse.de>
Mime-Version: 1.0 (1.0)
Subject: Re: [PATCH]middle-end Add optimized float addsub without needing
 VEC_PERM_EXPR. 
Date: Sat, 18 Jun 2022 12:49:02 +0200
Message-Id: <1C4185AB-6EE6-4B8B-838C-465098DAFD3B@suse.de>
References: <CA+=Sn1=+BC0FpE0X6-ABOf2+1sCMy3Z=KAisSrpAnKfCcmv+NQ@mail.gmail.com>
Cc: Tamar Christina <tamar.christina@arm.com>, nd <nd@arm.com>
In-Reply-To: <CA+=Sn1=+BC0FpE0X6-ABOf2+1sCMy3Z=KAisSrpAnKfCcmv+NQ@mail.gmail.com>
To: Andrew Pinski via Gcc-patches <gcc-patches@gcc.gnu.org>
X-Mailer: iPhone Mail (19F77)
X-Spam-Status: No, score=-11.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_LOTSOFHASH,
 KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Sat, 18 Jun 2022 10:49:06 -0000


> Am 17.06.2022 um 22:34 schrieb Andrew Pinski via Gcc-patches <gcc-patches@=
gcc.gnu.org>:
>=20
> =EF=BB=BFOn Thu, Jun 16, 2022 at 3:59 AM Tamar Christina via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>=20
>> Hi All,
>>=20
>> For IEEE 754 floating point formats we can replace a sequence of alternat=
ive
>> +/- with fneg of a wider type followed by an fadd.  This eliminated the n=
eed for
>> using a permutation.  This patch adds a math.pd rule to recognize and do t=
his
>> rewriting.
>=20
> I don't think this is correct. You don't check the format of the
> floating point to make sure this is valid (e.g. REAL_MODE_FORMAT's
> signbit_rw/signbit_ro field).
> Also would just be better if you do the xor in integer mode (using
> signbit_rw field for the correct bit)?
> And then making sure the target optimizes the xor to the neg
> instruction when needed?

I=E2=80=99m also worried about using FP operations for the negate here.  Whe=
n @1 is constant do we still constant fold this correctly?

For costing purposes it would be nice to make this visible to the vectorizer=
.

Also is this really good for all targets?  Can there be issues with reformat=
ting when using FP ops as in your patch or with using integer XOR as suggest=
ed making this more expensive than the blend?

Richard.

> Thanks,
> Andrew Pinski
>=20
>=20
>=20
>>=20
>> For
>>=20
>> void f (float *restrict a, float *restrict b, float *res, int n)
>> {
>>   for (int i =3D 0; i < (n & -4); i+=3D2)
>>    {
>>      res[i+0] =3D a[i+0] + b[i+0];
>>      res[i+1] =3D a[i+1] - b[i+1];
>>    }
>> }
>>=20
>> we generate:
>>=20
>> .L3:
>>        ldr     q1, [x1, x3]
>>        ldr     q0, [x0, x3]
>>        fneg    v1.2d, v1.2d
>>        fadd    v0.4s, v0.4s, v1.4s
>>        str     q0, [x2, x3]
>>        add     x3, x3, 16
>>        cmp     x3, x4
>>        bne     .L3
>>=20
>> now instead of:
>>=20
>> .L3:
>>        ldr     q1, [x0, x3]
>>        ldr     q2, [x1, x3]
>>        fadd    v0.4s, v1.4s, v2.4s
>>        fsub    v1.4s, v1.4s, v2.4s
>>        tbl     v0.16b, {v0.16b - v1.16b}, v3.16b
>>        str     q0, [x2, x3]
>>        add     x3, x3, 16
>>        cmp     x3, x4
>>        bne     .L3
>>=20
>> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>>=20
>> Thanks to George Steed for the idea.
>>=20
>> Ok for master?
>>=20
>> Thanks,
>> Tamar
>>=20
>> gcc/ChangeLog:
>>=20
>>        * match.pd: Add fneg/fadd rule.
>>=20
>> gcc/testsuite/ChangeLog:
>>=20
>>        * gcc.target/aarch64/simd/addsub_1.c: New test.
>>        * gcc.target/aarch64/sve/addsub_1.c: New test.
>>=20
>> --- inline copy of patch --
>> diff --git a/gcc/match.pd b/gcc/match.pd
>> index 51b0a1b562409af535e53828a10c30b8a3e1ae2e..af1c98d4a2831f38258d6fc1b=
be811c8ee6c7c6e 100644
>> --- a/gcc/match.pd
>> +++ b/gcc/match.pd
>> @@ -7612,6 +7612,49 @@ and,
>>   (simplify (reduc (op @0 VECTOR_CST@1))
>>     (op (reduc:type @0) (reduc:type @1))))
>>=20
>> +/* Simplify vector floating point operations of alternating sub/add pair=
s
>> +   into using an fneg of a wider element type followed by a normal add.
>> +   under IEEE 754 the fneg of the wider type will negate every even entr=
y
>> +   and when doing an add we get a sub of the even and add of every odd
>> +   elements.  */
>> +(simplify
>> + (vec_perm (plus:c @0 @1) (minus @0 @1) VECTOR_CST@2)
>> + (if (!VECTOR_INTEGER_TYPE_P (type) && !BYTES_BIG_ENDIAN)
>> +  (with
>> +   {
>> +     /* Build a vector of integers from the tree mask.  */
>> +     vec_perm_builder builder;
>> +     if (!tree_to_vec_perm_builder (&builder, @2))
>> +       return NULL_TREE;
>> +
>> +     /* Create a vec_perm_indices for the integer vector.  */
>> +     poly_uint64 nelts =3D TYPE_VECTOR_SUBPARTS (type);
>> +     vec_perm_indices sel (builder, 2, nelts);
>> +   }
>> +   (if (sel.series_p (0, 2, 0, 2))
>> +    (with
>> +     {
>> +       machine_mode vec_mode =3D TYPE_MODE (type);
>> +       auto elem_mode =3D GET_MODE_INNER (vec_mode);
>> +       auto nunits =3D exact_div (GET_MODE_NUNITS (vec_mode), 2);
>> +       tree stype;
>> +       switch (elem_mode)
>> +        {
>> +        case E_HFmode:
>> +          stype =3D float_type_node;
>> +          break;
>> +        case E_SFmode:
>> +          stype =3D double_type_node;
>> +          break;
>> +        default:
>> +          return NULL_TREE;
>> +        }
>> +       tree ntype =3D build_vector_type (stype, nunits);
>> +       if (!ntype)
>> +        return NULL_TREE;
>> +     }
>> +     (plus (view_convert:type (negate (view_convert:ntype @1))) @0))))))=

>> +
>> (simplify
>>  (vec_perm @0 @1 VECTOR_CST@2)
>>  (with
>> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/addsub_1.c b/gcc/tests=
uite/gcc.target/aarch64/simd/addsub_1.c
>> new file mode 100644
>> index 0000000000000000000000000000000000000000..1fb91a34c421bbd2894faa0db=
bf1b47ad43310c4
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/simd/addsub_1.c
>> @@ -0,0 +1,56 @@
>> +/* { dg-do compile } */
>> +/* { dg-require-effective-target arm_v8_2a_fp16_neon_ok } */
>> +/* { dg-options "-Ofast" } */
>> +/* { dg-add-options arm_v8_2a_fp16_neon } */
>> +/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } *=
/
>> +
>> +#pragma GCC target "+nosve"
>> +
>> +/*
>> +** f1:
>> +** ...
>> +**     fneg    v[0-9]+.2d, v[0-9]+.2d
>> +**     fadd    v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
>> +** ...
>> +*/
>> +void f1 (float *restrict a, float *restrict b, float *res, int n)
>> +{
>> +   for (int i =3D 0; i < (n & -4); i+=3D2)
>> +    {
>> +      res[i+0] =3D a[i+0] + b[i+0];
>> +      res[i+1] =3D a[i+1] - b[i+1];
>> +    }
>> +}
>> +
>> +/*
>> +** d1:
>> +** ...
>> +**     fneg    v[0-9]+.4s, v[0-9]+.4s
>> +**     fadd    v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h
>> +** ...
>> +*/
>> +void d1 (_Float16 *restrict a, _Float16 *restrict b, _Float16 *res, int n=
)
>> +{
>> +   for (int i =3D 0; i < (n & -8); i+=3D2)
>> +    {
>> +      res[i+0] =3D a[i+0] + b[i+0];
>> +      res[i+1] =3D a[i+1] - b[i+1];
>> +    }
>> +}
>> +
>> +/*
>> +** e1:
>> +** ...
>> +**     fadd    v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d
>> +**     fsub    v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d
>> +**     ins     v[0-9]+.d\[1\], v[0-9]+.d\[1\]
>> +** ...
>> +*/
>> +void e1 (double *restrict a, double *restrict b, double *res, int n)
>> +{
>> +   for (int i =3D 0; i < (n & -4); i+=3D2)
>> +    {
>> +      res[i+0] =3D a[i+0] + b[i+0];
>> +      res[i+1] =3D a[i+1] - b[i+1];
>> +    }
>> +}
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/addsub_1.c b/gcc/testsu=
ite/gcc.target/aarch64/sve/addsub_1.c
>> new file mode 100644
>> index 0000000000000000000000000000000000000000..ea7f9d9db2c8c9a3efe5c7951=
a314a29b7a7a922
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/addsub_1.c
>> @@ -0,0 +1,52 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-Ofast" } */
>> +/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } *=
/
>> +
>> +/*
>> +** f1:
>> +** ...
>> +**     fneg    z[0-9]+.d, p[0-9]+/m, z[0-9]+.d
>> +**     fadd    z[0-9]+.s, z[0-9]+.s, z[0-9]+.s
>> +** ...
>> +*/
>> +void f1 (float *restrict a, float *restrict b, float *res, int n)
>> +{
>> +   for (int i =3D 0; i < (n & -4); i+=3D2)
>> +    {
>> +      res[i+0] =3D a[i+0] + b[i+0];
>> +      res[i+1] =3D a[i+1] - b[i+1];
>> +    }
>> +}
>> +
>> +/*
>> +** d1:
>> +** ...
>> +**     fneg    z[0-9]+.s, p[0-9]+/m, z[0-9]+.s
>> +**     fadd    z[0-9]+.h, z[0-9]+.h, z[0-9]+.h
>> +** ...
>> +*/
>> +void d1 (_Float16 *restrict a, _Float16 *restrict b, _Float16 *res, int n=
)
>> +{
>> +   for (int i =3D 0; i < (n & -8); i+=3D2)
>> +    {
>> +      res[i+0] =3D a[i+0] + b[i+0];
>> +      res[i+1] =3D a[i+1] - b[i+1];
>> +    }
>> +}
>> +
>> +/*
>> +** e1:
>> +** ...
>> +**     fsub    z[0-9]+.d, z[0-9]+.d, z[0-9]+.d
>> +**     movprfx z[0-9]+.d, p[0-9]+/m, z[0-9]+.d
>> +**     fadd    z[0-9]+.d, p[0-9]+/m, z[0-9]+.d, z[0-9]+.d
>> +** ...
>> +*/
>> +void e1 (double *restrict a, double *restrict b, double *res, int n)
>> +{
>> +   for (int i =3D 0; i < (n & -4); i+=3D2)
>> +    {
>> +      res[i+0] =3D a[i+0] + b[i+0];
>> +      res[i+1] =3D a[i+1] - b[i+1];
>> +    }
>> +}
>>=20
>>=20
>>=20
>>=20
>> --