From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 2ED28384AB62; Mon, 22 Apr 2024 21:11:10 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2ED28384AB62 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1713820270; bh=Mih3sSQd8gTnn7r0N1aXaAa7E5mOt1wp/hhPecq1nD4=; h=From:To:Subject:Date:In-Reply-To:References:From; b=AKWkebUT4b6H4Uq34NZaz2V+fd2dYXTuGB74QO4kyqpTsml3/nmWZL3XxAgWCZ1DN 8AxE+KneT22161/TkMyEXghROQRTysiw3IZh2Q4eyFxl1kDGh4R1byYOI8c2ARPMdk BGDsz/36KoeIlPQNfNwlBR9mTWTT1WQGOD4cGKx8= From: "andrew at sifive dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/114809] [RISC-V RVV] Counting elements might be simpler Date: Mon, 22 Apr 2024 21:11:09 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: andrew at sifive dot com X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D114809 Andrew Waterman changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |andrew at sifive dot com --- Comment #2 from Andrew Waterman --- To respond to some of Palmer's points: In general, doing a single reduction at the end will perform better than do= ing multiple reductions. For the same total number of additions, sum reductions tend to be slower (or at least no faster) than regular vector adds. On some microarchitectures, vcpop.m results in a loss-of-decoupling event, since it's consumed by the scalar unit. To get reasonable performance on t= hose uarches, you need to use maximal LMUL to amortize the loss-of-decoupling ev= ent over a greater amount of vector work. (The alternative is to unroll the lo= op such that each vcpop.m writes a different x-register, but that's far messier than using large LMUL.)=