[Bug target/114809] New: [RISC-V RVV] Counting elements might be simpler

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/114809] New: [RISC-V RVV] Counting elements might be simpler
@ 2024-04-22 17:10 wojciech_mula at poczta dot onet.pl
  2024-04-22 20:24 ` [Bug target/114809] " palmer at gcc dot gnu.org
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: wojciech_mula at poczta dot onet.pl @ 2024-04-22 17:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114809

            Bug ID: 114809
           Summary: [RISC-V RVV] Counting elements might be simpler
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: wojciech_mula at poczta dot onet.pl
  Target Milestone: ---

Consider this simple procedure

---
#include <cstdint>
#include <cstdlib>

size_t count_chars(const char *src, size_t len, char c) {
    size_t count = 0;
    for (size_t i=0; i < len; i++) {
        count += src[i] == c;
    }

    return count;
}
---

Assembly for it (GCC 14.0, -march=rv64gcv -O3):

---
count_chars(char const*, unsigned long, char):
        beq     a1,zero,.L4
        vsetvli a4,zero,e8,mf8,ta,ma
        vmv.v.x v2,a2
        vsetvli zero,zero,e64,m1,ta,ma
        vmv.v.i v1,0
.L3:
        vsetvli a5,a1,e8,mf8,ta,ma
        vle8.v  v0,0(a0)
        sub     a1,a1,a5
        add     a0,a0,a5
        vmseq.vv        v0,v0,v2
        vsetvli zero,zero,e64,m1,tu,mu
        vadd.vi v1,v1,1,v0.t
        bne     a1,zero,.L3
        vsetvli a5,zero,e64,m1,ta,ma
        li      a4,0
        vmv.s.x v2,a4
        vredsum.vs      v1,v1,v2
        vmv.x.s a0,v1
        ret
.L4:
        li      a0,0
        ret
---

The counting procedure might use `vcpop.m` instead of updating vector of
counters (`v1`) and summing them in the end. This would move all mode switches
outside the loop.

And there's a missing peephole optimization:

        li      a4,0
        vmv.s.x v2,a4

It can be:

        vmv.s.x v2,zero

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/114809] [RISC-V RVV] Counting elements might be simpler
  2024-04-22 17:10 [Bug target/114809] New: [RISC-V RVV] Counting elements might be simpler wojciech_mula at poczta dot onet.pl
@ 2024-04-22 20:24 ` palmer at gcc dot gnu.org
  2024-04-22 21:11 ` andrew at sifive dot com
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: palmer at gcc dot gnu.org @ 2024-04-22 20:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114809

palmer at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
           Keywords|                            |missed-optimization
     Ever confirmed|0                           |1
                 CC|                            |palmer at gcc dot gnu.org
   Last reconfirmed|                            |2024-04-22

--- Comment #1 from palmer at gcc dot gnu.org ---
Thanks.

Sounds like there's really two issues here: a missed peephole and a more
complex set of micro-architectural tradeoffs.

The peephole seems like a pretty straight-forward missed optimization, if
you've got a smaller reproducer it's probably worth filing another bug for it. 
We're right at the end of the GCC-14 release process and ended up with some
last-minute breakages so stuff is pretty chaotic right now, having the bug will
make it easier to avoid forgetting about this.

The reduction looks way more complicated to me.  Just thinking a bit as I'm
watching the regressions run, I think there's a few options for generating the
code here:

* Do we accumulate into a vector and then reduce, or reduce and then
accumulate?
* Do we reduce via a sum-reduction or a popcnt?
* Do we reconfigure to a wider type or handle the overflow?

I think this will depend on the cost model for the hardware: we're essentially
trading off operations of one flavor of op for another, and that's going to
depend on how these ops perform.  Your suggestion is essentially a
reconfiguration vs reduction trade-off, which is probably going to be
implementation-specific.

Do you have a system that this code performs poorly on?  If there's something
concrete to target and we're not generating good code that's pretty actionable,
otherwise I think this one is going to be hard to reason about for a bit.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/114809] [RISC-V RVV] Counting elements might be simpler
  2024-04-22 17:10 [Bug target/114809] New: [RISC-V RVV] Counting elements might be simpler wojciech_mula at poczta dot onet.pl
  2024-04-22 20:24 ` [Bug target/114809] " palmer at gcc dot gnu.org
@ 2024-04-22 21:11 ` andrew at sifive dot com
  2024-04-22 21:59 ` pinskia at gcc dot gnu.org
  2024-04-22 22:23 ` juzhe.zhong at rivai dot ai
  3 siblings, 0 replies; 5+ messages in thread
From: andrew at sifive dot com @ 2024-04-22 21:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114809

Andrew Waterman <andrew at sifive dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |andrew at sifive dot com

--- Comment #2 from Andrew Waterman <andrew at sifive dot com> ---
To respond to some of Palmer's points:

In general, doing a single reduction at the end will perform better than doing
multiple reductions.  For the same total number of additions, sum reductions
tend to be slower (or at least no faster) than regular vector adds.

On some microarchitectures, vcpop.m results in a loss-of-decoupling event,
since it's consumed by the scalar unit.  To get reasonable performance on those
uarches, you need to use maximal LMUL to amortize the loss-of-decoupling event
over a greater amount of vector work.  (The alternative is to unroll the loop
such that each vcpop.m writes a different x-register, but that's far messier
than using large LMUL.)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/114809] [RISC-V RVV] Counting elements might be simpler
  2024-04-22 17:10 [Bug target/114809] New: [RISC-V RVV] Counting elements might be simpler wojciech_mula at poczta dot onet.pl
  2024-04-22 20:24 ` [Bug target/114809] " palmer at gcc dot gnu.org
  2024-04-22 21:11 ` andrew at sifive dot com
@ 2024-04-22 21:59 ` pinskia at gcc dot gnu.org
  2024-04-22 22:23 ` juzhe.zhong at rivai dot ai
  3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-04-22 21:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114809

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/114809] [RISC-V RVV] Counting elements might be simpler
  2024-04-22 17:10 [Bug target/114809] New: [RISC-V RVV] Counting elements might be simpler wojciech_mula at poczta dot onet.pl
                   ` (2 preceding siblings ...)
  2024-04-22 21:59 ` pinskia at gcc dot gnu.org
@ 2024-04-22 22:23 ` juzhe.zhong at rivai dot ai
  3 siblings, 0 replies; 5+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-04-22 22:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114809

JuzheZhong <juzhe.zhong at rivai dot ai> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |juzhe.zhong at rivai dot ai

--- Comment #3 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
For missed peephole optimization, I already noticed it long time ago,
and I have filed PR:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113014

Such issue will gone after Richard Standiford @arm merged late-combine PASS in
GCC 15.

Also, GCC support dynamic LMUL optimization with -mrvv-max-lmul=dynamic:

https://godbolt.org/z/646nYoKbv

ASM:

count_chars(char const*, unsigned long, char):
        beq     a1,zero,.L4
        vsetvli a4,zero,e8,m1,ta,ma
        vmv.v.x v1,a2
        vsetvli zero,zero,e64,m8,ta,ma
        vmv.v.i v8,0
.L3:
        vsetvli a5,a1,e8,m1,ta,ma
        vle8.v  v0,0(a0)
        sub     a1,a1,a5
        add     a0,a0,a5
        vmseq.vv        v0,v0,v1
        vsetvli zero,zero,e64,m8,tu,mu
        vadd.vi v8,v8,1,v0.t
        bne     a1,zero,.L3
        vsetvli a5,zero,e64,m8,ta,ma
        li      a4,0
        vmv.s.x v1,a4
        vredsum.vs      v8,v8,v1
        vmv.x.s a0,v8
        ret
.L4:
        li      a0,0
        ret

GCC picks LMUL = 8, since it doesn't cause additional register spillings
according to the program register pressure.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-04-22 22:23 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-22 17:10 [Bug target/114809] New: [RISC-V RVV] Counting elements might be simpler wojciech_mula at poczta dot onet.pl
2024-04-22 20:24 ` [Bug target/114809] " palmer at gcc dot gnu.org
2024-04-22 21:11 ` andrew at sifive dot com
2024-04-22 21:59 ` pinskia at gcc dot gnu.org
2024-04-22 22:23 ` juzhe.zhong at rivai dot ai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).