public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/114686] New: Feature request: Dynamic LMUL should be the default for the RISC-V Vector extension
@ 2024-04-10 20:52 camel-cdr at protonmail dot com
2024-04-10 20:59 ` [Bug target/114686] " pinskia at gcc dot gnu.org
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: camel-cdr at protonmail dot com @ 2024-04-10 20:52 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114686
Bug ID: 114686
Summary: Feature request: Dynamic LMUL should be the default
for the RISC-V Vector extension
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: camel-cdr at protonmail dot com
Target Milestone: ---
Currently, the default value for -mrvv-max-lmul is "m1", it should be "dynamic"
instead for the following reasons:
All currently available RVV implementations benefit from using the largest
LMUL, when possible (C906,C908,C920,ara,bobcat, see also this comment about the
SiFive cores:
https://gcc.gnu.org/pipermail/gcc-patches/2024-February/644676.html)
Some even to the degree that you are basically always wasting 50% of the
performance by using LMUL=1 instead of LMUL=2 or above, as you can see here for
the C908: https://camel-cdr.github.io/rvv-bench-results/canmv_k230/index.html
I don't see any reason why this wouldn't be the case for the vast majority of
implementations, especially high performance ones would benefit from having
more work to saturate the execution units with, since a larger LMUL works quite
similar to loop unrolling.
Also consider that using a lower LMUL than possible would make mask
instructions more expensive because they happen more frequently. With any
LMUL/SEW the mask fits into a single LMUL=1 vector register and can thus
(usually) execute in the same number of cycles regardless of LMUL. So in a loop
with LMUL=4 the mask operations are four times as fast per element as with
LMUL=1, because they occur less frequently.
Notes:
The vrgather.vv instruction should be except from that, because an LMUL=8
vrgather.vv is way more powerful than eight LMUL=1 vrgather.vv instructions,
and thus disproportionately complex to implement. When you don't need to cross
lanes, it's possible to unrolling LMUL=1 vrgathers manually, instead of
choosing a higher LMUL.
Here are throughput measurements on some existing implementations:
VLEN e8m1 e8m2 e8m4 e8m8
c906 128 4 16 64 256
c908 128 4 16 64.9 261.1
c920 128 0.5 2.4 8.0 32.0
bobcat* 256 68 132 260 516
x280* 512 65 129 257 513
*bobcat: Note that it was explicitly stated, that they didn't optimize the
permutation instructions
*x280: the numbers are from llvm-mca, but I was told they match reality. There
is also supposed to be a vrgather fast path for vl<=256. I think there
was much incentive to make this fast, as the x280 mostly targets AI.
vcompress.vm doesn't scale linearly with LMUL on the XuanTie chips either, but
a better implementation is conceivable, because the work can be better
distributed/subdivided. GCC currently doesn't seem to generate vcompress.vm via
auto-vectorization anyway: https://godbolt.org/z/Mb5Kba865
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug target/114686] Feature request: Dynamic LMUL should be the default for the RISC-V Vector extension
2024-04-10 20:52 [Bug target/114686] New: Feature request: Dynamic LMUL should be the default for the RISC-V Vector extension camel-cdr at protonmail dot com
@ 2024-04-10 20:59 ` pinskia at gcc dot gnu.org
2024-04-13 22:46 ` juzhe.zhong at rivai dot ai
2024-04-15 11:20 ` rdapp at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-04-10 20:59 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114686
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I suspect this will happen for GCC 15. GCC 14's RVV was done to test out the
middle-end and back-end parts and not really tuned at all. Even the default
tuning is no cost model; though you can turn it on.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug target/114686] Feature request: Dynamic LMUL should be the default for the RISC-V Vector extension
2024-04-10 20:52 [Bug target/114686] New: Feature request: Dynamic LMUL should be the default for the RISC-V Vector extension camel-cdr at protonmail dot com
2024-04-10 20:59 ` [Bug target/114686] " pinskia at gcc dot gnu.org
@ 2024-04-13 22:46 ` juzhe.zhong at rivai dot ai
2024-04-15 11:20 ` rdapp at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-04-13 22:46 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114686
JuzheZhong <juzhe.zhong at rivai dot ai> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |juzhe.zhong at rivai dot ai
--- Comment #2 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
CCing RISC-V folks who may be interested at it.
Yeah, I agree to set dynamic lmul as default. I have mentioned it long time
ago.
However, almost all other RISC-V folks disagree with that.
Here is data from Li Pan@intel:
https://github.com/Incarnation-p-lee/Incarnation-p-lee/blob/master/performance/coremark-pro/coremark-pro_in_k230_evb.png
Doing auto-vectorization on both LLVM and GCC (all LMUL) of coremark-pro.
Turns out dynamic LMUL is beneficial.
>> The vrgather.vv instruction should be except from that, because an LMUL=8
>> vrgather.vv is way more powerful than eight LMUL=1 vrgather.vv instructions,
>> and thus disproportionately complex to implement. When you don't need to cross
>> lanes, it's possible to unrolling LMUL=1 vrgathers manually, instead of
>> choosing a higher LMUL.
Agree, I think for some instructions like vrgather, we shouldn't pick the large
LMUL even though the register pressure of the program is ok.
We can consider large LMUL of vrgather as expensive in dynamic LMUL cost model
and optimize it in GCC-15.
>> vcompress.vm doesn't scale linearly with LMUL on the XuanTie chips either, but
>> a better implementation is conceivable, because the work can be better
>> distributed/subdivided. GCC currently doesn't seem to generate vcompress.vm via
>> auto-vectorization anyway: https://godbolt.org/z/Mb5Kba865
GCC may generate compress in auto-vectorization, your case is because GCC
failed to vectorize it, we may will optimize it in GCC-15.
Here is some cases that GCC may generate vcompress:
https://godbolt.org/z/5GKh4eM7z
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug target/114686] Feature request: Dynamic LMUL should be the default for the RISC-V Vector extension
2024-04-10 20:52 [Bug target/114686] New: Feature request: Dynamic LMUL should be the default for the RISC-V Vector extension camel-cdr at protonmail dot com
2024-04-10 20:59 ` [Bug target/114686] " pinskia at gcc dot gnu.org
2024-04-13 22:46 ` juzhe.zhong at rivai dot ai
@ 2024-04-15 11:20 ` rdapp at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: rdapp at gcc dot gnu.org @ 2024-04-15 11:20 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114686
--- Comment #3 from Robin Dapp <rdapp at gcc dot gnu.org> ---
I think we have always maintained that this can definitely be a per-uarch
default but shouldn't be a generic default.
> I don't see any reason why this wouldn't be the case for the vast majority of
> implementations, especially high performance ones would benefit from having
> more work to saturate the execution units with, since a larger LMUL works
> quite
> similar to loop unrolling.
One argument is reduced freedom for renaming and the out of order machinery.
It's much easier to shuffle individual registers around than large blocks.
Also lower-latency insns are easier to schedule than longer-latency ones and
faults, rejects, aborts etc. get proportionally more expensive.
I was under the impression that unrolling doesn't help a whole lot (sometimes
even slows things down a bit) on modern cores and certainly is not
unconditionally helpful. Granted, I haven't seen a lot of data on it recently.
An exception is of course breaking dependency chains.
In general nothing stands in the way of having a particular tune target use
dynamic LMUL by default even now but nobody went ahead and posted a patch for
theirs. One could maybe argue that it should be the default for in-order
uarchs?
Should it become obvious in the future that LMUL > 1 is indeed,
unconditionally, a "better unrolling" because of its favorable icache footprint
and other properties (which I doubt - happy to be proved wrong) then we will
surely re-evaluation the decision or rather have a different consensus.
The data we publicly have so far is all in-order cores and my expectation is
that the picture will change once out-of-order cores hit the scene.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-04-15 11:20 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-10 20:52 [Bug target/114686] New: Feature request: Dynamic LMUL should be the default for the RISC-V Vector extension camel-cdr at protonmail dot com
2024-04-10 20:59 ` [Bug target/114686] " pinskia at gcc dot gnu.org
2024-04-13 22:46 ` juzhe.zhong at rivai dot ai
2024-04-15 11:20 ` rdapp at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).