From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id A9B153877030 for ; Wed, 12 Jun 2024 11:55:01 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A9B153877030 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org A9B153877030 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718193303; cv=none; b=Xm77VcWJiDcygapCi37z7q8XM5qFS/mgWzrx06DFyLlcBJMMJlyGlFRa+550qy69rEMoFW//H2JSdEcvbSN+rvTZqtKrX+8CTwh7StofY5qit4ro+4kWKx7V6dtxxTPuRydRYhm6J07SZAxzntyVuv2U1TG6FTJ9oHpxnwG4E6Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718193303; c=relaxed/simple; bh=5gemlSEw1D5v220DVyECc9wo5RoYsF1+BVrMiPSK3FM=; h=From:To:Subject:Date:Message-ID:MIME-Version; b=pY2cHlUs8SEBcnof848f4jj5xkxqGRWIdz7+MzZdMEGnziLDjYFXkUbBgwXZA19MpjzcBgvtVSln8MEyvpu8NQHqOkiKBl7MABK4gu7UgP1RZV4KydGxtGe9dsbhPjLhe4BRtICp2Mw72aeBRKEGHiancuJocrW8Rq8Vd5tIh78= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0C93D1595; Wed, 12 Jun 2024 04:55:26 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E7B3C3F64C; Wed, 12 Jun 2024 04:55:00 -0700 (PDT) From: Richard Sandiford To: Pengxuan Zheng Mail-Followup-To: Pengxuan Zheng ,, , richard.sandiford@arm.com Cc: , Subject: Re: [PATCH] aarch64: Add vector popcount besides QImode [PR113859] References: <20240501003143.5323-1-quic_pzheng@quicinc.com> Date: Wed, 12 Jun 2024 12:54:59 +0100 In-Reply-To: <20240501003143.5323-1-quic_pzheng@quicinc.com> (Pengxuan Zheng's message of "Tue, 30 Apr 2024 17:31:43 -0700") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-19.9 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Pengxuan Zheng writes: > This patch improves GCC=E2=80=99s vectorization of __builtin_popcount for= aarch64 target > by adding popcount patterns for vector modes besides QImode, i.e., HImode, > SImode and DImode. > > With this patch, we now generate the following for HImode: > cnt v1.16b, v.16b > uaddlp v2.8h, v1.16b > > For SImode, we generate: > cnt v1.16b, v.16b > uaddlp v2.8h, v1.16b > uaddlp v3.4s, v2.8h > > For V2DI, we generate: > cnt v1.16b, v.16b > uaddlp v2.8h, v1.16b > uaddlp v3.4s, v2.8h > uaddlp v4.2d, v3.4s > > gcc/ChangeLog: > > PR target/113859 > * config/aarch64/aarch64-simd.md (popcount2): New define_expand. > > gcc/testsuite/ChangeLog: > > PR target/113859 > * gcc.target/aarch64/popcnt-vec.c: New test. > > Signed-off-by: Pengxuan Zheng > --- > gcc/config/aarch64/aarch64-simd.md | 40 ++++++++++++++++ > gcc/testsuite/gcc.target/aarch64/popcnt-vec.c | 48 +++++++++++++++++++ > 2 files changed, 88 insertions(+) > create mode 100644 gcc/testsuite/gcc.target/aarch64/popcnt-vec.c > > diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarc= h64-simd.md > index f8bb973a278..093c32ee8ff 100644 > --- a/gcc/config/aarch64/aarch64-simd.md > +++ b/gcc/config/aarch64/aarch64-simd.md > @@ -3540,6 +3540,46 @@ (define_insn "popcount2" > [(set_attr "type" "neon_cnt")] > ) >=20=20 > +(define_expand "popcount2" > + [(set (match_operand:VQN 0 "register_operand" "=3Dw") > + (popcount:VQN (match_operand:VQN 1 "register_operand" "w")))] > + "TARGET_SIMD" > + { > + rtx v =3D gen_reg_rtx (V16QImode); > + rtx v1 =3D gen_reg_rtx (V16QImode); > + emit_move_insn (v, gen_lowpart (V16QImode, operands[1])); > + emit_insn (gen_popcountv16qi2 (v1, v)); > + if (mode =3D=3D V8HImode) > + { > + /* For V8HI, we generate: > + cnt v1.16b, v.16b > + uaddlp v2.8h, v1.16b */ > + emit_insn (gen_aarch64_uaddlpv16qi (operands[0], v1)); > + DONE; > + } > + rtx v2 =3D gen_reg_rtx (V8HImode); > + emit_insn (gen_aarch64_uaddlpv16qi (v2, v1)); > + if (mode =3D=3D V4SImode) > + { > + /* For V4SI, we generate: > + cnt v1.16b, v.16b > + uaddlp v2.8h, v1.16b > + uaddlp v3.4s, v2.8h */ > + emit_insn (gen_aarch64_uaddlpv8hi (operands[0], v2)); > + DONE; > + } > + /* For V2DI, we generate: > + cnt v1.16b, v.16b > + uaddlp v2.8h, v1.16b > + uaddlp v3.4s, v2.8h > + uaddlp v4.2d, v3.4s */ > + rtx v3 =3D gen_reg_rtx (V4SImode); > + emit_insn (gen_aarch64_uaddlpv8hi (v3, v2)); > + emit_insn (gen_aarch64_uaddlpv4si (operands[0], v3)); > + DONE; > + } > +) > + Could you add support for V4HI and V2SI at the same time? I think it's possible to handle all 5 modes iteratively, like so: (define_expand "popcount2" [(set (match_operand:VDQHSD 0 "register_operand") (popcount:VDQHSD (match_operand:VDQHSD 1 "register_operand")))] "TARGET_SIMD" { /* Generate a byte popcount. */ machine_mode mode =3D =3D=3D 64 ? V8QImode : V16QImode; rtx tmp =3D gen_reg_rtx (mode); auto icode =3D optab_handler (popcount_optab, mode); emit_insn (GEN_FCN (icode) (tmp, gen_lowpart (mode, operands[1]))); /* Use a sequence of UADDLPs to accumulate the counts. Each step doubles the element size and halves the number of elements. */ do { auto icode =3D code_for_aarch64_addlp (ZERO_EXTEND, GET_MODE (tmp)); mode =3D insn_data[icode].operand[0].mode; rtx dest =3D mode =3D=3D mode ? operands[0] : gen_reg_rtx (mode= ); emit_insn (GEN_FCN (icode) (dest, tmp)); tmp =3D dest; } while (mode !=3D mode); DONE; }) (only lightly tested). This requires changing: (define_expand "aarch64_addlp" to: (define_expand "@aarch64_addlp" Thanks, Richard > ;; 'across lanes' max and min ops. >=20=20 > ;; Template for outputting a scalar, so we can create __builtins which c= an be > diff --git a/gcc/testsuite/gcc.target/aarch64/popcnt-vec.c b/gcc/testsuit= e/gcc.target/aarch64/popcnt-vec.c > new file mode 100644 > index 00000000000..4c9a1b95990 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/popcnt-vec.c > @@ -0,0 +1,48 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2" } */ > + > +/* This function should produce cnt v.16b. */ > +void > +bar (unsigned char *__restrict b, unsigned char *__restrict d) > +{ > + for (int i =3D 0; i < 1024; i++) > + d[i] =3D __builtin_popcount (b[i]); > +} > + > +/* This function should produce cnt v.16b and uaddlp (Add Long Pairwise)= . */ > +void > +bar1 (unsigned short *__restrict b, unsigned short *__restrict d) > +{ > + for (int i =3D 0; i < 1024; i++) > + d[i] =3D __builtin_popcount (b[i]); > +} > + > +/* This function should produce cnt v.16b and 2 uaddlp (Add Long Pairwis= e). */ > +void > +bar2 (unsigned int *__restrict b, unsigned int *__restrict d) > +{ > + for (int i =3D 0; i < 1024; i++) > + d[i] =3D __builtin_popcount (b[i]); > +} > + > +/* This function should produce cnt v.16b and 3 uaddlp (Add Long Pairwis= e). */ > +void > +bar3 (unsigned long long *__restrict b, unsigned long long *__restrict d) > +{ > + for (int i =3D 0; i < 1024; i++) > + d[i] =3D __builtin_popcountll (b[i]); > +} > + > +/* SLP > + This function should produce cnt v.16b and 3 uaddlp (Add Long Pairwise)= . */ > +void > +bar4 (unsigned long long *__restrict b, unsigned long long *__restrict d) > +{ > + d[0] =3D __builtin_popcountll (b[0]); > + d[1] =3D __builtin_popcountll (b[1]); > +} > + > +/* { dg-final { scan-assembler-not {\tbl\tpopcount} } } */ > +/* { dg-final { scan-assembler-times {cnt\t} 5 } } */ > +/* { dg-final { scan-assembler-times {uaddlp\t} 9 } } */ > +/* { dg-final { scan-assembler-times {ldr\tq} 5 } } */