From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by sourceware.org (Postfix) with ESMTPS id 73A2F3857357 for ; Sat, 18 Jun 2022 10:49:04 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 73A2F3857357 Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id B2BB51FA73; Sat, 18 Jun 2022 10:49:02 +0000 (UTC) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id A12071348B; Sat, 18 Jun 2022 10:49:02 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id K4VKJ56trWJKKAAAMHmgww (envelope-from ); Sat, 18 Jun 2022 10:49:02 +0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: Richard Biener Mime-Version: 1.0 (1.0) Subject: Re: [PATCH]middle-end Add optimized float addsub without needing VEC_PERM_EXPR. Date: Sat, 18 Jun 2022 12:49:02 +0200 Message-Id: <1C4185AB-6EE6-4B8B-838C-465098DAFD3B@suse.de> References: Cc: Tamar Christina , nd In-Reply-To: To: Andrew Pinski via Gcc-patches X-Mailer: iPhone Mail (19F77) X-Spam-Status: No, score=-11.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_LOTSOFHASH, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Jun 2022 10:49:06 -0000 > Am 17.06.2022 um 22:34 schrieb Andrew Pinski via Gcc-patches : >=20 > =EF=BB=BFOn Thu, Jun 16, 2022 at 3:59 AM Tamar Christina via Gcc-patches > wrote: >>=20 >> Hi All, >>=20 >> For IEEE 754 floating point formats we can replace a sequence of alternat= ive >> +/- with fneg of a wider type followed by an fadd. This eliminated the n= eed for >> using a permutation. This patch adds a math.pd rule to recognize and do t= his >> rewriting. >=20 > I don't think this is correct. You don't check the format of the > floating point to make sure this is valid (e.g. REAL_MODE_FORMAT's > signbit_rw/signbit_ro field). > Also would just be better if you do the xor in integer mode (using > signbit_rw field for the correct bit)? > And then making sure the target optimizes the xor to the neg > instruction when needed? I=E2=80=99m also worried about using FP operations for the negate here. Whe= n @1 is constant do we still constant fold this correctly? For costing purposes it would be nice to make this visible to the vectorizer= . Also is this really good for all targets? Can there be issues with reformat= ting when using FP ops as in your patch or with using integer XOR as suggest= ed making this more expensive than the blend? Richard. > Thanks, > Andrew Pinski >=20 >=20 >=20 >>=20 >> For >>=20 >> void f (float *restrict a, float *restrict b, float *res, int n) >> { >> for (int i =3D 0; i < (n & -4); i+=3D2) >> { >> res[i+0] =3D a[i+0] + b[i+0]; >> res[i+1] =3D a[i+1] - b[i+1]; >> } >> } >>=20 >> we generate: >>=20 >> .L3: >> ldr q1, [x1, x3] >> ldr q0, [x0, x3] >> fneg v1.2d, v1.2d >> fadd v0.4s, v0.4s, v1.4s >> str q0, [x2, x3] >> add x3, x3, 16 >> cmp x3, x4 >> bne .L3 >>=20 >> now instead of: >>=20 >> .L3: >> ldr q1, [x0, x3] >> ldr q2, [x1, x3] >> fadd v0.4s, v1.4s, v2.4s >> fsub v1.4s, v1.4s, v2.4s >> tbl v0.16b, {v0.16b - v1.16b}, v3.16b >> str q0, [x2, x3] >> add x3, x3, 16 >> cmp x3, x4 >> bne .L3 >>=20 >> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. >>=20 >> Thanks to George Steed for the idea. >>=20 >> Ok for master? >>=20 >> Thanks, >> Tamar >>=20 >> gcc/ChangeLog: >>=20 >> * match.pd: Add fneg/fadd rule. >>=20 >> gcc/testsuite/ChangeLog: >>=20 >> * gcc.target/aarch64/simd/addsub_1.c: New test. >> * gcc.target/aarch64/sve/addsub_1.c: New test. >>=20 >> --- inline copy of patch -- >> diff --git a/gcc/match.pd b/gcc/match.pd >> index 51b0a1b562409af535e53828a10c30b8a3e1ae2e..af1c98d4a2831f38258d6fc1b= be811c8ee6c7c6e 100644 >> --- a/gcc/match.pd >> +++ b/gcc/match.pd >> @@ -7612,6 +7612,49 @@ and, >> (simplify (reduc (op @0 VECTOR_CST@1)) >> (op (reduc:type @0) (reduc:type @1)))) >>=20 >> +/* Simplify vector floating point operations of alternating sub/add pair= s >> + into using an fneg of a wider element type followed by a normal add. >> + under IEEE 754 the fneg of the wider type will negate every even entr= y >> + and when doing an add we get a sub of the even and add of every odd >> + elements. */ >> +(simplify >> + (vec_perm (plus:c @0 @1) (minus @0 @1) VECTOR_CST@2) >> + (if (!VECTOR_INTEGER_TYPE_P (type) && !BYTES_BIG_ENDIAN) >> + (with >> + { >> + /* Build a vector of integers from the tree mask. */ >> + vec_perm_builder builder; >> + if (!tree_to_vec_perm_builder (&builder, @2)) >> + return NULL_TREE; >> + >> + /* Create a vec_perm_indices for the integer vector. */ >> + poly_uint64 nelts =3D TYPE_VECTOR_SUBPARTS (type); >> + vec_perm_indices sel (builder, 2, nelts); >> + } >> + (if (sel.series_p (0, 2, 0, 2)) >> + (with >> + { >> + machine_mode vec_mode =3D TYPE_MODE (type); >> + auto elem_mode =3D GET_MODE_INNER (vec_mode); >> + auto nunits =3D exact_div (GET_MODE_NUNITS (vec_mode), 2); >> + tree stype; >> + switch (elem_mode) >> + { >> + case E_HFmode: >> + stype =3D float_type_node; >> + break; >> + case E_SFmode: >> + stype =3D double_type_node; >> + break; >> + default: >> + return NULL_TREE; >> + } >> + tree ntype =3D build_vector_type (stype, nunits); >> + if (!ntype) >> + return NULL_TREE; >> + } >> + (plus (view_convert:type (negate (view_convert:ntype @1))) @0))))))= >> + >> (simplify >> (vec_perm @0 @1 VECTOR_CST@2) >> (with >> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/addsub_1.c b/gcc/tests= uite/gcc.target/aarch64/simd/addsub_1.c >> new file mode 100644 >> index 0000000000000000000000000000000000000000..1fb91a34c421bbd2894faa0db= bf1b47ad43310c4 >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/aarch64/simd/addsub_1.c >> @@ -0,0 +1,56 @@ >> +/* { dg-do compile } */ >> +/* { dg-require-effective-target arm_v8_2a_fp16_neon_ok } */ >> +/* { dg-options "-Ofast" } */ >> +/* { dg-add-options arm_v8_2a_fp16_neon } */ >> +/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } *= / >> + >> +#pragma GCC target "+nosve" >> + >> +/* >> +** f1: >> +** ... >> +** fneg v[0-9]+.2d, v[0-9]+.2d >> +** fadd v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s >> +** ... >> +*/ >> +void f1 (float *restrict a, float *restrict b, float *res, int n) >> +{ >> + for (int i =3D 0; i < (n & -4); i+=3D2) >> + { >> + res[i+0] =3D a[i+0] + b[i+0]; >> + res[i+1] =3D a[i+1] - b[i+1]; >> + } >> +} >> + >> +/* >> +** d1: >> +** ... >> +** fneg v[0-9]+.4s, v[0-9]+.4s >> +** fadd v[0-9]+.8h, v[0-9]+.8h, v[0-9]+.8h >> +** ... >> +*/ >> +void d1 (_Float16 *restrict a, _Float16 *restrict b, _Float16 *res, int n= ) >> +{ >> + for (int i =3D 0; i < (n & -8); i+=3D2) >> + { >> + res[i+0] =3D a[i+0] + b[i+0]; >> + res[i+1] =3D a[i+1] - b[i+1]; >> + } >> +} >> + >> +/* >> +** e1: >> +** ... >> +** fadd v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d >> +** fsub v[0-9]+.2d, v[0-9]+.2d, v[0-9]+.2d >> +** ins v[0-9]+.d\[1\], v[0-9]+.d\[1\] >> +** ... >> +*/ >> +void e1 (double *restrict a, double *restrict b, double *res, int n) >> +{ >> + for (int i =3D 0; i < (n & -4); i+=3D2) >> + { >> + res[i+0] =3D a[i+0] + b[i+0]; >> + res[i+1] =3D a[i+1] - b[i+1]; >> + } >> +} >> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/addsub_1.c b/gcc/testsu= ite/gcc.target/aarch64/sve/addsub_1.c >> new file mode 100644 >> index 0000000000000000000000000000000000000000..ea7f9d9db2c8c9a3efe5c7951= a314a29b7a7a922 >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/aarch64/sve/addsub_1.c >> @@ -0,0 +1,52 @@ >> +/* { dg-do compile } */ >> +/* { dg-options "-Ofast" } */ >> +/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } *= / >> + >> +/* >> +** f1: >> +** ... >> +** fneg z[0-9]+.d, p[0-9]+/m, z[0-9]+.d >> +** fadd z[0-9]+.s, z[0-9]+.s, z[0-9]+.s >> +** ... >> +*/ >> +void f1 (float *restrict a, float *restrict b, float *res, int n) >> +{ >> + for (int i =3D 0; i < (n & -4); i+=3D2) >> + { >> + res[i+0] =3D a[i+0] + b[i+0]; >> + res[i+1] =3D a[i+1] - b[i+1]; >> + } >> +} >> + >> +/* >> +** d1: >> +** ... >> +** fneg z[0-9]+.s, p[0-9]+/m, z[0-9]+.s >> +** fadd z[0-9]+.h, z[0-9]+.h, z[0-9]+.h >> +** ... >> +*/ >> +void d1 (_Float16 *restrict a, _Float16 *restrict b, _Float16 *res, int n= ) >> +{ >> + for (int i =3D 0; i < (n & -8); i+=3D2) >> + { >> + res[i+0] =3D a[i+0] + b[i+0]; >> + res[i+1] =3D a[i+1] - b[i+1]; >> + } >> +} >> + >> +/* >> +** e1: >> +** ... >> +** fsub z[0-9]+.d, z[0-9]+.d, z[0-9]+.d >> +** movprfx z[0-9]+.d, p[0-9]+/m, z[0-9]+.d >> +** fadd z[0-9]+.d, p[0-9]+/m, z[0-9]+.d, z[0-9]+.d >> +** ... >> +*/ >> +void e1 (double *restrict a, double *restrict b, double *res, int n) >> +{ >> + for (int i =3D 0; i < (n & -4); i+=3D2) >> + { >> + res[i+0] =3D a[i+0] + b[i+0]; >> + res[i+1] =3D a[i+1] - b[i+1]; >> + } >> +} >>=20 >>=20 >>=20 >>=20 >> --