public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/114686] New: Feature request: Dynamic LMUL should be the default for the RISC-V Vector extension
@ 2024-04-10 20:52 camel-cdr at protonmail dot com
  2024-04-10 20:59 ` [Bug target/114686] " pinskia at gcc dot gnu.org
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: camel-cdr at protonmail dot com @ 2024-04-10 20:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114686

            Bug ID: 114686
           Summary: Feature request: Dynamic LMUL should be the default
                    for the RISC-V Vector extension
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: camel-cdr at protonmail dot com
  Target Milestone: ---

Currently, the default value for -mrvv-max-lmul is "m1", it should be "dynamic"
instead for the following reasons:

All currently available RVV implementations benefit from using the largest
LMUL, when possible (C906,C908,C920,ara,bobcat, see also this comment about the
SiFive cores:
https://gcc.gnu.org/pipermail/gcc-patches/2024-February/644676.html)

Some even to the degree that you are basically always wasting 50% of the
performance by using LMUL=1 instead of LMUL=2 or above, as you can see here for
the C908: https://camel-cdr.github.io/rvv-bench-results/canmv_k230/index.html

I don't see any reason why this wouldn't be the case for the vast majority of
implementations, especially high performance ones would benefit from having
more work to saturate the execution units with, since a larger LMUL works quite
similar to loop unrolling.

Also consider that using a lower LMUL than possible would make mask
instructions more expensive because they happen more frequently. With any
LMUL/SEW the mask fits into a single LMUL=1 vector register and can thus
(usually) execute in the same number of cycles regardless of LMUL. So in a loop
with LMUL=4 the mask operations are four times as fast per element as with
LMUL=1, because they occur less frequently.


Notes:

The vrgather.vv instruction should be except from that, because an LMUL=8
vrgather.vv is way more powerful than eight LMUL=1 vrgather.vv instructions,
and thus disproportionately complex to implement. When you don't need to cross
lanes, it's possible to unrolling LMUL=1 vrgathers manually, instead of
choosing a higher LMUL.

Here are throughput measurements on some existing implementations:
        VLEN e8m1 e8m2 e8m4 e8m8
c906    128  4    16   64   256
c908    128  4    16   64.9 261.1
c920    128  0.5  2.4  8.0  32.0
bobcat* 256  68   132  260  516
x280*   512  65   129  257  513

*bobcat: Note that it was explicitly stated, that they didn't optimize the
         permutation instructions
*x280: the numbers are from llvm-mca, but I was told they match reality. There
       is also supposed to be a vrgather fast path for vl<=256. I think there
       was much incentive to make this fast, as the x280 mostly targets AI.

vcompress.vm doesn't scale linearly with LMUL on the XuanTie chips either, but
a better implementation is conceivable, because the work can be better
distributed/subdivided. GCC currently doesn't seem to generate vcompress.vm via
auto-vectorization anyway: https://godbolt.org/z/Mb5Kba865

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-04-15 11:20 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-10 20:52 [Bug target/114686] New: Feature request: Dynamic LMUL should be the default for the RISC-V Vector extension camel-cdr at protonmail dot com
2024-04-10 20:59 ` [Bug target/114686] " pinskia at gcc dot gnu.org
2024-04-13 22:46 ` juzhe.zhong at rivai dot ai
2024-04-15 11:20 ` rdapp at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).