From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id B20DE3858412 for ; Sat, 12 Nov 2022 12:33:50 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org B20DE3858412 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7BDCF1FB; Sat, 12 Nov 2022 04:33:56 -0800 (PST) Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.62]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 57E853F534; Sat, 12 Nov 2022 04:33:49 -0800 (PST) From: Richard Sandiford To: Tamar Christina Mail-Followup-To: Tamar Christina ,gcc-patches@gcc.gnu.org, nd@arm.com, Richard.Earnshaw@arm.com, Marcus.Shawcroft@arm.com, Kyrylo.Tkachov@arm.com, richard.sandiford@arm.com Cc: gcc-patches@gcc.gnu.org, nd@arm.com, Richard.Earnshaw@arm.com, Marcus.Shawcroft@arm.com, Kyrylo.Tkachov@arm.com Subject: Re: [PATCH 4/4]AArch64 sve2: rewrite pack + NARROWB + NARROWB to NARROWB + NARROWT References: Date: Sat, 12 Nov 2022 12:33:48 +0000 In-Reply-To: (Richard Sandiford's message of "Sat, 12 Nov 2022 12:25:04 +0000") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-42.2 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Richard Sandiford writes: > Tamar Christina writes: >> Hi All, >> >> This adds an RTL pattern for when two NARROWB instructions are being combined >> with a PACK. The second NARROWB is then transformed into a NARROWT. >> >> For the example: >> >> void draw_bitmap1(uint8_t* restrict pixel, uint8_t level, int n) >> { >> for (int i = 0; i < (n & -16); i+=1) >> pixel[i] += (pixel[i] * level) / 0xff; >> } >> >> we generate: >> >> addhnb z6.b, z0.h, z4.h >> addhnb z5.b, z1.h, z4.h >> addhnb z0.b, z0.h, z6.h >> addhnt z0.b, z1.h, z5.h >> add z0.b, z0.b, z2.b >> >> instead of: >> >> addhnb z6.b, z1.h, z4.h >> addhnb z5.b, z0.h, z4.h >> addhnb z1.b, z1.h, z6.h >> addhnb z0.b, z0.h, z5.h >> uzp1 z0.b, z0.b, z1.b >> add z0.b, z0.b, z2.b >> >> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. >> >> Ok for master? >> >> Thanks, >> Tamar >> >> gcc/ChangeLog: >> >> * config/aarch64/aarch64-sve2.md (*aarch64_sve_pack_): >> New. >> * config/aarch64/iterators.md (binary_top): New. >> >> gcc/testsuite/ChangeLog: >> >> * gcc.dg/vect/vect-div-bitmask-4.c: New test. >> * gcc.target/aarch64/sve2/div-by-bitmask_2.c: New test. >> >> --- inline copy of patch -- >> diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md >> index ab5dcc369481311e5bd68a1581265e1ce99b4b0f..0ee46c8b0d43467da4a6b98ad3c41e5d05d8cf38 100644 >> --- a/gcc/config/aarch64/aarch64-sve2.md >> +++ b/gcc/config/aarch64/aarch64-sve2.md >> @@ -1600,6 +1600,25 @@ (define_insn "@aarch64_sve_" >> "\t%0., %2., %3." >> ) >> >> +(define_insn_and_split "*aarch64_sve_pack_" >> + [(set (match_operand: 0 "register_operand" "=w") >> + (unspec: >> + [(match_operand:SVE_FULL_HSDI 1 "register_operand" "w") > > "0" would be safer, in case the instruction is only split after RA. > >> + (subreg:SVE_FULL_HSDI (unspec: >> + [(match_operand:SVE_FULL_HSDI 2 "register_operand" "w") >> + (match_operand:SVE_FULL_HSDI 3 "register_operand" "w")] >> + SVE2_INT_BINARY_NARROWB) 0)] >> + UNSPEC_PACK))] > > I think ideally this would be the canonical pattern, so that we can > drop the separate top unspecs. That's more work though, and would > probably make sense to do once we have a generic way of representing > the pack. > > So OK with the "0" change above. Hmm, actually, I take that back. Is this transform really correct? I think the blend corresponds to a TRN1 rather than a UZP1. The bottom operations populate the lower half of each wider element and the top operations populate the upper half. Thanks, Richard