public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/114809] New: [RISC-V RVV] Counting elements might be simpler
@ 2024-04-22 17:10 wojciech_mula at poczta dot onet.pl
2024-04-22 20:24 ` [Bug target/114809] " palmer at gcc dot gnu.org
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: wojciech_mula at poczta dot onet.pl @ 2024-04-22 17:10 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114809
Bug ID: 114809
Summary: [RISC-V RVV] Counting elements might be simpler
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: wojciech_mula at poczta dot onet.pl
Target Milestone: ---
Consider this simple procedure
---
#include <cstdint>
#include <cstdlib>
size_t count_chars(const char *src, size_t len, char c) {
size_t count = 0;
for (size_t i=0; i < len; i++) {
count += src[i] == c;
}
return count;
}
---
Assembly for it (GCC 14.0, -march=rv64gcv -O3):
---
count_chars(char const*, unsigned long, char):
beq a1,zero,.L4
vsetvli a4,zero,e8,mf8,ta,ma
vmv.v.x v2,a2
vsetvli zero,zero,e64,m1,ta,ma
vmv.v.i v1,0
.L3:
vsetvli a5,a1,e8,mf8,ta,ma
vle8.v v0,0(a0)
sub a1,a1,a5
add a0,a0,a5
vmseq.vv v0,v0,v2
vsetvli zero,zero,e64,m1,tu,mu
vadd.vi v1,v1,1,v0.t
bne a1,zero,.L3
vsetvli a5,zero,e64,m1,ta,ma
li a4,0
vmv.s.x v2,a4
vredsum.vs v1,v1,v2
vmv.x.s a0,v1
ret
.L4:
li a0,0
ret
---
The counting procedure might use `vcpop.m` instead of updating vector of
counters (`v1`) and summing them in the end. This would move all mode switches
outside the loop.
And there's a missing peephole optimization:
li a4,0
vmv.s.x v2,a4
It can be:
vmv.s.x v2,zero
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/114809] [RISC-V RVV] Counting elements might be simpler
2024-04-22 17:10 [Bug target/114809] New: [RISC-V RVV] Counting elements might be simpler wojciech_mula at poczta dot onet.pl
@ 2024-04-22 20:24 ` palmer at gcc dot gnu.org
2024-04-22 21:11 ` andrew at sifive dot com
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: palmer at gcc dot gnu.org @ 2024-04-22 20:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114809
palmer at gcc dot gnu.org changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Keywords| |missed-optimization
Ever confirmed|0 |1
CC| |palmer at gcc dot gnu.org
Last reconfirmed| |2024-04-22
--- Comment #1 from palmer at gcc dot gnu.org ---
Thanks.
Sounds like there's really two issues here: a missed peephole and a more
complex set of micro-architectural tradeoffs.
The peephole seems like a pretty straight-forward missed optimization, if
you've got a smaller reproducer it's probably worth filing another bug for it.
We're right at the end of the GCC-14 release process and ended up with some
last-minute breakages so stuff is pretty chaotic right now, having the bug will
make it easier to avoid forgetting about this.
The reduction looks way more complicated to me. Just thinking a bit as I'm
watching the regressions run, I think there's a few options for generating the
code here:
* Do we accumulate into a vector and then reduce, or reduce and then
accumulate?
* Do we reduce via a sum-reduction or a popcnt?
* Do we reconfigure to a wider type or handle the overflow?
I think this will depend on the cost model for the hardware: we're essentially
trading off operations of one flavor of op for another, and that's going to
depend on how these ops perform. Your suggestion is essentially a
reconfiguration vs reduction trade-off, which is probably going to be
implementation-specific.
Do you have a system that this code performs poorly on? If there's something
concrete to target and we're not generating good code that's pretty actionable,
otherwise I think this one is going to be hard to reason about for a bit.
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/114809] [RISC-V RVV] Counting elements might be simpler
2024-04-22 17:10 [Bug target/114809] New: [RISC-V RVV] Counting elements might be simpler wojciech_mula at poczta dot onet.pl
2024-04-22 20:24 ` [Bug target/114809] " palmer at gcc dot gnu.org
@ 2024-04-22 21:11 ` andrew at sifive dot com
2024-04-22 21:59 ` pinskia at gcc dot gnu.org
2024-04-22 22:23 ` juzhe.zhong at rivai dot ai
3 siblings, 0 replies; 5+ messages in thread
From: andrew at sifive dot com @ 2024-04-22 21:11 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114809
Andrew Waterman <andrew at sifive dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |andrew at sifive dot com
--- Comment #2 from Andrew Waterman <andrew at sifive dot com> ---
To respond to some of Palmer's points:
In general, doing a single reduction at the end will perform better than doing
multiple reductions. For the same total number of additions, sum reductions
tend to be slower (or at least no faster) than regular vector adds.
On some microarchitectures, vcpop.m results in a loss-of-decoupling event,
since it's consumed by the scalar unit. To get reasonable performance on those
uarches, you need to use maximal LMUL to amortize the loss-of-decoupling event
over a greater amount of vector work. (The alternative is to unroll the loop
such that each vcpop.m writes a different x-register, but that's far messier
than using large LMUL.)
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/114809] [RISC-V RVV] Counting elements might be simpler
2024-04-22 17:10 [Bug target/114809] New: [RISC-V RVV] Counting elements might be simpler wojciech_mula at poczta dot onet.pl
2024-04-22 20:24 ` [Bug target/114809] " palmer at gcc dot gnu.org
2024-04-22 21:11 ` andrew at sifive dot com
@ 2024-04-22 21:59 ` pinskia at gcc dot gnu.org
2024-04-22 22:23 ` juzhe.zhong at rivai dot ai
3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-04-22 21:59 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114809
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |enhancement
^ permalink raw reply [flat|nested] 5+ messages in thread
* [Bug target/114809] [RISC-V RVV] Counting elements might be simpler
2024-04-22 17:10 [Bug target/114809] New: [RISC-V RVV] Counting elements might be simpler wojciech_mula at poczta dot onet.pl
` (2 preceding siblings ...)
2024-04-22 21:59 ` pinskia at gcc dot gnu.org
@ 2024-04-22 22:23 ` juzhe.zhong at rivai dot ai
3 siblings, 0 replies; 5+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-04-22 22:23 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114809
JuzheZhong <juzhe.zhong at rivai dot ai> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |juzhe.zhong at rivai dot ai
--- Comment #3 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
For missed peephole optimization, I already noticed it long time ago,
and I have filed PR:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113014
Such issue will gone after Richard Standiford @arm merged late-combine PASS in
GCC 15.
Also, GCC support dynamic LMUL optimization with -mrvv-max-lmul=dynamic:
https://godbolt.org/z/646nYoKbv
ASM:
count_chars(char const*, unsigned long, char):
beq a1,zero,.L4
vsetvli a4,zero,e8,m1,ta,ma
vmv.v.x v1,a2
vsetvli zero,zero,e64,m8,ta,ma
vmv.v.i v8,0
.L3:
vsetvli a5,a1,e8,m1,ta,ma
vle8.v v0,0(a0)
sub a1,a1,a5
add a0,a0,a5
vmseq.vv v0,v0,v1
vsetvli zero,zero,e64,m8,tu,mu
vadd.vi v8,v8,1,v0.t
bne a1,zero,.L3
vsetvli a5,zero,e64,m8,ta,ma
li a4,0
vmv.s.x v1,a4
vredsum.vs v8,v8,v1
vmv.x.s a0,v8
ret
.L4:
li a0,0
ret
GCC picks LMUL = 8, since it doesn't cause additional register spillings
according to the program register pressure.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-04-22 22:23 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-22 17:10 [Bug target/114809] New: [RISC-V RVV] Counting elements might be simpler wojciech_mula at poczta dot onet.pl
2024-04-22 20:24 ` [Bug target/114809] " palmer at gcc dot gnu.org
2024-04-22 21:11 ` andrew at sifive dot com
2024-04-22 21:59 ` pinskia at gcc dot gnu.org
2024-04-22 22:23 ` juzhe.zhong at rivai dot ai
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).