public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/116573] New: [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V
@ 2024-09-02 16:47 law at gcc dot gnu.org
  2024-09-02 17:37 ` [Bug tree-optimization/116573] " rguenther at suse dot de
                   ` (13 more replies)
  0 siblings, 14 replies; 15+ messages in thread
From: law at gcc dot gnu.org @ 2024-09-02 16:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573

            Bug ID: 116573
           Summary: [15 Regression] Recent SLP work appears to generate
                    significantly worse code on RISC-V
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: law at gcc dot gnu.org
  Target Milestone: ---

This change:

commit 9aaedfc4146c5e4b8412913a6ca4092a2731c35c (HEAD)
Author: Richard Biener <rguenther@suse.de>
Date:   Fri Jul 5 10:35:08 2024 +0200

    load and store-lanes with SLP

    The following is a prototype for how to represent load/store-lanes
    within SLP.  I've for now settled with having a single load node
    with multiple permute nodes acting as selection, one for each loaded lane
    and a single store node fed from all stored lanes.  For
[ ... ]

Is causing a multiple scan failures in the testsuite.  Several are "don't care"
changes in code generation.  But one class seems to indicate a notable
regression in the quality of the generated code:

Before the change for the attached testcase we would generate:

.L3:
        vsetvli a5,a3,e8,m1,ta,ma
        vle8.v  v2,0(a0)
        vle8.v  v3,0(a1)
        slli    a4,a5,1
        sub     a3,a3,a5
        add     a0,a0,a5
        add     a1,a1,a5
        vsseg2e8.v      v2,(a2)
        add     a2,a2,a4
        bne     a3,zero,.L3

Nothing really of note there.  Load up two values, then store them elsewhere
with a segmented store and the usual pointer updates.

After the change:
.L4:
        mv      a6,a3
        mv      a4,a3
        bleu    a3,a5,.L3
        csrr    a4,vlenb
.L3:
        vsetvli zero,a4,e8,m1,ta,ma
        vle8.v  v2,0(a0)
        vle8.v  v3,0(a1)
        sub     a3,a3,a5
        add     a0,a0,a5
        add     a1,a1,a5
        vsseg2e8.v      v2,(a2)
        add     a2,a2,a7
        bgtu    a6,a5,.L4

Ugh.  We've got a conditional branch in the middle of the loop, a CSR read and
a bit of sillyness with those extra move instructions.  Not really sure what
went wrong, but it's a reasonable assumption that this code is less performant
than the original.

Compile with "  -O3 -ftree-vectorize -std=c99 -march=rv32gcv_zvfh -mabi=ilp32d
-mrvv-vector-bits=scalable -fno-vect-cost-model"



typedef unsigned char uint8_t;
typedef signed char int8_t;
#ifndef TYPE
#define TYPE uint8_t
#define ITYPE int8_t
#endif

void __attribute__ ((noinline, noclone))
g2 (TYPE *__restrict a, TYPE *__restrict b, TYPE *__restrict c, ITYPE n)
{
  for (ITYPE i = 0; i < n; ++i)
    {
      c[i * 2] = a[i];
      c[i * 2 + 1] = b[i];
    }
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/116573] [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V
  2024-09-02 16:47 [Bug tree-optimization/116573] New: [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V law at gcc dot gnu.org
@ 2024-09-02 17:37 ` rguenther at suse dot de
  2024-09-02 22:19 ` pinskia at gcc dot gnu.org
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rguenther at suse dot de @ 2024-09-02 17:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573

--- Comment #1 from rguenther at suse dot de <rguenther at suse dot de> ---
> Am 02.09.2024 um 18:48 schrieb law at gcc dot gnu.org <gcc-bugzilla@gcc.gnu.org>:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573
> 
> Jeffrey A. Law <law at gcc dot gnu.org> changed:
> 
>           What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                 CC|                            |rguenth at gcc dot gnu.org

Can you explain the setvli difference?  I think some of the folks adding the
with-Len-and-mask IV support have to look, it’s probably going wrong there for
unknown reasons.


> --
> You are receiving this mail because:
> You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/116573] [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V
  2024-09-02 16:47 [Bug tree-optimization/116573] New: [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V law at gcc dot gnu.org
  2024-09-02 17:37 ` [Bug tree-optimization/116573] " rguenther at suse dot de
@ 2024-09-02 22:19 ` pinskia at gcc dot gnu.org
  2024-09-03  6:56 ` rguenth at gcc dot gnu.org
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-09-02 22:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
   Target Milestone|---                         |15.0
             Blocks|                            |53947


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/116573] [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V
  2024-09-02 16:47 [Bug tree-optimization/116573] New: [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V law at gcc dot gnu.org
  2024-09-02 17:37 ` [Bug tree-optimization/116573] " rguenther at suse dot de
  2024-09-02 22:19 ` pinskia at gcc dot gnu.org
@ 2024-09-03  6:56 ` rguenth at gcc dot gnu.org
  2024-09-06  8:48 ` rguenth at gcc dot gnu.org
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-09-03  6:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |juzhe.zhong at rivai dot ai

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
I don't see anything suspicious in the GIMPLE dump.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/116573] [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V
  2024-09-02 16:47 [Bug tree-optimization/116573] New: [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V law at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2024-09-03  6:56 ` rguenth at gcc dot gnu.org
@ 2024-09-06  8:48 ` rguenth at gcc dot gnu.org
  2024-09-12 22:58 ` juzhe.zhong at rivai dot ai
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-09-06  8:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2024-09-06
             Status|UNCONFIRMED                 |NEW

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
So when investigating "future" fallout I've seen similar differences for
gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add-1.c for example with the
GIMPLE difference being that before we used .SELECT_VL but afterwards
there's a MIN_EXPR to compute the length.

I've tried to read up the RVV specification but there doesn't seem to be
a good overall operand documentation for vsetvli :(  I tried to understand

.L6:
        mv      a4,a3
        bleu    a3,a5,.L5  // this is likely the MIN?
        csrr    a4,vlenb   // save VLEN to a4(?)
.L5:
        vsetvli zero,a4,e8,m1,ta,ma // set VLEN to a4 and store new VLEN to
'zero'(?)
        vle8.v  v1,0(a1)
        vle8.v  v2,0(a2)
        vsetvli a6,zero,e8,m1,ta,ma // set VLEN to zero?!
        vsaddu.vv       v1,v1,v2
        vsetvli zero,a4,e8,m1,ta,ma // set VLEN to a4 again
        vse8.v  v1,0(a0)
        add     a1,a1,a5
        add     a2,a2,a5
        add     a0,a0,a5
        mv      a4,a3
        sub     a3,a3,a5
        bgtu    a4,a5,.L6

I think the GIMPLE looks straight-forward but the code the backend generates
looks bad, possibly the vsetvli pass is lacking here.

Now, the vectorizer doesn't use .SELECT_VL because

      if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type,
                                          OPTIMIZE_FOR_SPEED)
          && LOOP_VINFO_LENS (loop_vinfo).length () == 1
          && LOOP_VINFO_LENS (loop_vinfo)[0].factor == 1 && !slp
          && (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
              || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()))
        LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = true;

see the !slp - the comment doesn't explain why, but for example
vectorizable_induction simply asserts !slp_node when
LOOP_VINFO_USING_SELECT_VL_P.  I would have expected it to be handled
more like LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P and be disabled when
we cannot handle code generation for a feature.

Simply removing the && !slp fixes the particular testcase above for me.

I'll leave this bug and the fallout to Ju-Zhe Zhong who added
LOOP_VINFO_USING_SELECT_VL_P support.

Anyway, confirmed.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/116573] [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V
  2024-09-02 16:47 [Bug tree-optimization/116573] New: [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V law at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2024-09-06  8:48 ` rguenth at gcc dot gnu.org
@ 2024-09-12 22:58 ` juzhe.zhong at rivai dot ai
  2024-09-13  6:06 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-09-12 22:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573

--- Comment #4 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Richard Biener from comment #3)
> So when investigating "future" fallout I've seen similar differences for
> gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add-1.c for example with the
> GIMPLE difference being that before we used .SELECT_VL but afterwards
> there's a MIN_EXPR to compute the length.
> 
> I've tried to read up the RVV specification but there doesn't seem to be
> a good overall operand documentation for vsetvli :(  I tried to understand
> 
> .L6:
>         mv      a4,a3
>         bleu    a3,a5,.L5  // this is likely the MIN?
>         csrr    a4,vlenb   // save VLEN to a4(?)
> .L5:
>         vsetvli zero,a4,e8,m1,ta,ma // set VLEN to a4 and store new VLEN to
> 'zero'(?)
>         vle8.v  v1,0(a1)
>         vle8.v  v2,0(a2)
>         vsetvli a6,zero,e8,m1,ta,ma // set VLEN to zero?!
>         vsaddu.vv       v1,v1,v2
>         vsetvli zero,a4,e8,m1,ta,ma // set VLEN to a4 again
>         vse8.v  v1,0(a0)
>         add     a1,a1,a5
>         add     a2,a2,a5
>         add     a0,a0,a5
>         mv      a4,a3
>         sub     a3,a3,a5
>         bgtu    a4,a5,.L6
> 
> I think the GIMPLE looks straight-forward but the code the backend generates
> looks bad, possibly the vsetvli pass is lacking here.
> 
> Now, the vectorizer doesn't use .SELECT_VL because
> 
>       if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type,
>                                           OPTIMIZE_FOR_SPEED)
>           && LOOP_VINFO_LENS (loop_vinfo).length () == 1
>           && LOOP_VINFO_LENS (loop_vinfo)[0].factor == 1 && !slp
>           && (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>               || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()))
>         LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = true;
> 
> see the !slp - the comment doesn't explain why, but for example
> vectorizable_induction simply asserts !slp_node when
> LOOP_VINFO_USING_SELECT_VL_P.  I would have expected it to be handled
> more like LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P and be disabled when
> we cannot handle code generation for a feature.
> 
> Simply removing the && !slp fixes the particular testcase above for me.
> 
> I'll leave this bug and the fallout to Ju-Zhe Zhong who added
> LOOP_VINFO_USING_SELECT_VL_P support.
> 
> Anyway, confirmed.

Hi, Richard. Thanks for the guide.

For RVV, we apply SELECT_VL to calculate the number of elements to be processed
when it's single rgroup (control) vectorization
since multiple rgroup vectorization will become more complicate and codengen
seems to be worse while using SELECT_VL to calculate the number of elements so
we prefer to using MIN in this scenario.

Single rgroup:

#define N 16
int src[N];
int dest[N];

void
foo (int n)
{
  for (int i = 0; i < n; i++)
    dest[i] = src[i] + 2;
}
Expect to use SELECT_VL:

_27 = .SELECT_VL (ivtmp_25, POLY_INT_CST [4, 4]);
  ivtmp_3 = _27 * 4;
  vect__1.8_17 = .MASK_LEN_LOAD (vectp_src.6_15, 32B, { -1, ... }, _27, 0);
  vect__2.9_19 = vect__1.8_17 + { 2, ... };
  .MASK_LEN_STORE (vectp_dest.10_21, 32B, { -1, ... }, _27, 0, vect__2.9_19);

Assembly:
.L3:
        vsetvli a5,a0,e32,m1,ta,ma
        vle32.v v1,0(a3)
        slli    a2,a5,2
        sub     a0,a0,a5
        add     a3,a3,a2
        vadd.vi v1,v1,2
        vse32.v v1,0(a4)
        add     a4,a4,a2
        bne     a0,zero,.L3


Wheras in this case (multiple rgroup):

We expect to use MIN to calculate elements:

  _42 = MIN_EXPR <ivtmp.18_5, POLY_INT_CST [16, 16]>;
  loop_len_32 = MIN_EXPR <_42, POLY_INT_CST [8, 8]>;
  loop_len_31 = _42 - loop_len_32;
  _47 = (void *) ivtmp.17_2;
  _48 = &MEM <vector([8,8]) short int> [(short int *)_47];
  .MASK_LEN_STORE (_48, 16B, { -1, ... }, loop_len_32, 0, { 1, 2, ... });
  _49 = (void *) ivtmp.21_10;
  _50 = &MEM <vector([8,8]) short int> [(short int *)_49];
  .MASK_LEN_STORE (_50, 16B, { -1, ... }, loop_len_31, 0, { 1, 2, ... });
  _21 = reciptmp_26 * loop_len_32;
  _51 = (void *) ivtmp.15_36;
  _52 = &MEM <vector([4,4]) int> [(int *)_51];
  .MASK_LEN_STORE (_52, 32B, { -1, ... }, _21, 0, { 3, ... });
  _38 = reciptmp_26 * loop_len_31;
  _53 = (void *) ivtmp.22_17;
  _54 = &MEM <vector([4,4]) int> [(int *)_53];
  .MASK_LEN_STORE (_54, 32B, { -1, ... }, _38, 0, { 3, ... });

So, If I am understanding correctly, it seems that Richard has change
vectorizer that all auto-vectorization are represented as SLP instance ?

So the !slp is not the correct condition in this case.

It seems to change it into :

LOOP_VINFO_SLP_INSTANCES.size() == 1?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/116573] [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V
  2024-09-02 16:47 [Bug tree-optimization/116573] New: [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V law at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2024-09-12 22:58 ` juzhe.zhong at rivai dot ai
@ 2024-09-13  6:06 ` rguenth at gcc dot gnu.org
  2024-09-13  7:51 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-09-13  6:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to JuzheZhong from comment #4)
> So, If I am understanding correctly, it seems that Richard has change
> vectorizer that all auto-vectorization are represented as SLP instance ?

Yes.

> So the !slp is not the correct condition in this case.

Correct.

> It seems to change it into :
> 
> LOOP_VINFO_SLP_INSTANCES.size() == 1?

AFAIU a single control group means that all scalar elements have the same
size?  Having a single SLP instance does not guarantee this nor would
having multiple SLP instances mean we have multiple control groups.

Instead I would have expected we compute control groups correctly with SLP
and thus can check whether we have multiple or a single one.

As said, I have no idea why that !slp condition is there - it definitely
seems to be a premature early out for SLP that shouldn't be necessary. I
would expect that you added that because you either didn't want to think
about SLP or that there were bugs you didn't want to chase down?

I'll throw removing that condition on the CI - I think that's the correct
thing to do.  And of course properly constrain SELECT_VL where necessary
then, like it must be done for non-SLP operation already.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/116573] [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V
  2024-09-02 16:47 [Bug tree-optimization/116573] New: [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V law at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2024-09-13  6:06 ` rguenth at gcc dot gnu.org
@ 2024-09-13  7:51 ` rguenth at gcc dot gnu.org
  2024-09-17  8:56 ` rdapp at gcc dot gnu.org
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-09-13  7:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, it does seem to be a correctness issue as well, I see multiple execute
FAILs in the gcc.dg/vect testsuite when removing the check and running with
-march=rv64gcv.

So I would have expected, similar to how we handle
LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P, that .SELECT_VL usage is disabled by
stmt level analysis when it cannot be used (possibly by vect_record_loop_len
which _should_ have all the required information?).

That said, analyzing a FAIL, like for example gcc.dg/vect/vect-vfa-slp.c
which looks very simple would help.  It seems that we fail to multiply
the .SELECT_VL result by the SLP group size there?

.L8:
        vsetvli a5,a4,e16,m1,ta,ma
        vle16.v v1,0(a0)
        slli    a3,a5,2
        sub     a4,a4,a5
        add     a0,a0,a3
        vadd.vv v1,v1,v2
        vse16.v v1,0(a1)
        add     a1,a1,a3
        bne     a4,zero,.L8

So to preserve previous behavior, instead of checking for !slp verifying
that each SLP instance only has single-lane operations (note though that
the stores and loads will still be represented as multi-lane, but
load-/store-lanes might work).  But as in principle SLP instances can
fork/merge controlling this individually would make more sense.

I don't know what the constraints are for vsetvli, but for the above case
I think we'd want to feed it 2*a4 since we know we uniformly need two
elements per iteration per vector.  I'd hope when feeding vsetvli for
example 12 as length that it will never actually set the length to an
odd number.  Similar with three elements per iteration, when we feed it,
say 9, how can we make sure it will set the length to a multiple of three?

It looks like you side-stepped all this by disabling .SELECT_VL with SLP.
As said, it should still work for uniform single-lane cases and for loads
when using load/store-lanes exclusively.  I can see to implement that as
check where we currently have the !slp check though as said, more carefully
handling cases we could support would be nice - like the above case with
two elements?  As said, I don't know the constraints RVV places on
implementations here and the spec isn't exactly helpful either.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/116573] [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V
  2024-09-02 16:47 [Bug tree-optimization/116573] New: [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V law at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2024-09-13  7:51 ` rguenth at gcc dot gnu.org
@ 2024-09-17  8:56 ` rdapp at gcc dot gnu.org
  2024-09-17  8:59 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rdapp at gcc dot gnu.org @ 2024-09-17  8:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573

--- Comment #7 from Robin Dapp <rdapp at gcc dot gnu.org> ---
I'm testing a patch that basically does what Richi proposes.

I was also playing around with mixed lane configurations where we potentially
reuse the pointer increment from another pointer update.  To me the code looked
promising and I think we could at least make it work for a subset of lane
configurations.  I didn't manage to get everything correct, though so the patch
tries to only restore the status quo. 

Some info about vsetvl because the question also came up on the cauldron -
according to the vector spec it has the (for the compiler) annoying  property
that it can basically set the length freely within a certain range.  This is
for load-balancing reasons and intended to give hardware implementations more
freedom.  (I'm not sure that is a useful tradeoff as the compiler's freedom is
significantly reduced)

vsetvl takes the "application vector length" (AVL) so the total number of
elements the whole loop wants to process and returns a vl.
VLMAX is the maximum number of elements a single vector (or vector group with
LMUL) can hold.

If the AVL is larger than VLMAX but <= 2 * VLMAX vsetvl can set vl to a value
inside the range
[ceil(AVL / 2), VLMAX].
So for e.g. AVL = 37, ceil(37/2) = 19 would, unfortunately, be a legal vl
value.
For the other possible values of AVL (<= VLMAX, > 2*VLMAX) the behavior is as
expected.

My hope is that most hardware implementations would take a saner approach and
have vsetvl always act as a "min (AVL, VLMAX)".  That would enable easy scalar
evolution and would possible also allow mixed-lane settings with reuse of the
vl value.  I suppose we could have a target hook or target query mechanism that
asks for "sane" behavior of vsetvl?  Thus we could have optimized SELECT_VL
behavior for those implementations.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/116573] [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V
  2024-09-02 16:47 [Bug tree-optimization/116573] New: [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V law at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2024-09-17  8:56 ` rdapp at gcc dot gnu.org
@ 2024-09-17  8:59 ` rguenth at gcc dot gnu.org
  2024-09-18  7:10 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-09-17  8:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, I've written up a patch as well:

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 62c7f90779f..199d79029e4 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -3078,10 +3078,23 @@ start_over:
       if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type,
                                          OPTIMIZE_FOR_SPEED)
          && LOOP_VINFO_LENS (loop_vinfo).length () == 1
-         && LOOP_VINFO_LENS (loop_vinfo)[0].factor == 1 && !slp
+         && LOOP_VINFO_LENS (loop_vinfo)[0].factor == 1
          && (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
              || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()))
        LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = true;
+
+      /* If any of the SLP instances cover more than a single lane
+        we cannot use .SELECT_VL at the moment, even if the number
+        of lanes is uniform throughout the SLP graph.  */
+      if (LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo))
+       for (slp_instance inst : LOOP_VINFO_SLP_INSTANCES (loop_vinfo))
+         if (SLP_TREE_LANES (SLP_INSTANCE_TREE (inst)) != 1
+             && !(SLP_INSTANCE_KIND (inst) == slp_inst_kind_store
+                  && SLP_INSTANCE_TREE (inst)->ldst_lanes))
+           {
+             LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = false;
+             break;
+           }
     }

   /* Decide whether this loop_vinfo should use partial vectors or peeling,

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/116573] [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V
  2024-09-02 16:47 [Bug tree-optimization/116573] New: [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V law at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2024-09-17  8:59 ` rguenth at gcc dot gnu.org
@ 2024-09-18  7:10 ` rguenth at gcc dot gnu.org
  2024-09-18  7:19 ` juzhe.zhong at rivai dot ai
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-09-18  7:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
So with the patch I see tons of "regressions"
(https://github.com/ewlu/gcc-precommit-ci/issues/2248#issuecomment-2355417578)
like for example for
gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add-1.c we go from

.L6:
        mv      a4,a3
        bleu    a3,a5,.L5
        csrr    a4,vlenb
.L5:
        vsetvli zero,a4,e8,m1,ta,ma
        vle8.v  v1,0(a1)
        vle8.v  v2,0(a2)
        vsetvli a6,zero,e8,m1,ta,ma
        vsaddu.vv       v1,v1,v2
        vsetvli zero,a4,e8,m1,ta,ma
        vse8.v  v1,0(a0)
        add     a1,a1,a5
        add     a2,a2,a5
        add     a0,a0,a5
        mv      a4,a3
        sub     a3,a3,a5
        bgtu    a4,a5,.L6

to

.L5:
        vsetvli a5,a3,e8,m1,ta,ma
        vle8.v  v1,0(a1)
        vle8.v  v2,0(a2)
        vsaddu.vv       v1,v1,v2
        vse8.v  v1,0(a0)
        add     a1,a1,a5
        add     a2,a2,a5
        add     a0,a0,a5
        sub     a3,a3,a5
        bne     a3,zero,.L5

so we're now using .SELECT_VL where we didn't before.  Unfortunately all the
regressing testcases are compile-only :/  I think the new variant is OK,
we're accessing all uchar8_t data only, I don't know why we rejected the use
of .SELECT_VL for this earlier.

I do not feel like checking all of the 900 additional FAILs that appear in
the CI though.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/116573] [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V
  2024-09-02 16:47 [Bug tree-optimization/116573] New: [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V law at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2024-09-18  7:10 ` rguenth at gcc dot gnu.org
@ 2024-09-18  7:19 ` juzhe.zhong at rivai dot ai
  2024-09-18  7:25 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-09-18  7:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573

--- Comment #10 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Richard Biener from comment #9)
> So with the patch I see tons of "regressions"
> (https://github.com/ewlu/gcc-precommit-ci/issues/2248#issuecomment-
> 2355417578) like for example for
> gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add-1.c we go from
> 
> .L6:
>         mv      a4,a3
>         bleu    a3,a5,.L5
>         csrr    a4,vlenb
> .L5:
>         vsetvli zero,a4,e8,m1,ta,ma
>         vle8.v  v1,0(a1)
>         vle8.v  v2,0(a2)
>         vsetvli a6,zero,e8,m1,ta,ma
>         vsaddu.vv       v1,v1,v2
>         vsetvli zero,a4,e8,m1,ta,ma
>         vse8.v  v1,0(a0)
>         add     a1,a1,a5
>         add     a2,a2,a5
>         add     a0,a0,a5
>         mv      a4,a3
>         sub     a3,a3,a5
>         bgtu    a4,a5,.L6
> 
> to
> 
> .L5:
>         vsetvli a5,a3,e8,m1,ta,ma
>         vle8.v  v1,0(a1)
>         vle8.v  v2,0(a2)
>         vsaddu.vv       v1,v1,v2
>         vse8.v  v1,0(a0)
>         add     a1,a1,a5
>         add     a2,a2,a5
>         add     a0,a0,a5
>         sub     a3,a3,a5
>         bne     a3,zero,.L5
> 
> so we're now using .SELECT_VL where we didn't before.  Unfortunately all the
> regressing testcases are compile-only :/  I think the new variant is OK,
> we're accessing all uchar8_t data only, I don't know why we rejected the use
> of .SELECT_VL for this earlier.
> 
> I do not feel like checking all of the 900 additional FAILs that appear in
> the CI though.

Thanks Richard. I think your SELECT_VL is good to go.
The RISC-V testcases FAILs can leave them to RISC-V folks to fix them.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/116573] [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V
  2024-09-02 16:47 [Bug tree-optimization/116573] New: [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V law at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2024-09-18  7:19 ` juzhe.zhong at rivai dot ai
@ 2024-09-18  7:25 ` rguenth at gcc dot gnu.org
  2024-09-19 11:28 ` cvs-commit at gcc dot gnu.org
  2024-09-19 11:33 ` rguenth at gcc dot gnu.org
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-09-18  7:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573

--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 59135
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59135&action=edit
patch I tested

This is the patch for the CI with the observed issue.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/116573] [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V
  2024-09-02 16:47 [Bug tree-optimization/116573] New: [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V law at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2024-09-18  7:25 ` rguenth at gcc dot gnu.org
@ 2024-09-19 11:28 ` cvs-commit at gcc dot gnu.org
  2024-09-19 11:33 ` rguenth at gcc dot gnu.org
  13 siblings, 0 replies; 15+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-09-19 11:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573

--- Comment #12 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:5e3a4a01785e2d5135528a07bb8116af9c55ddf8

commit r15-3712-g5e3a4a01785e2d5135528a07bb8116af9c55ddf8
Author: Richard Biener <rguenther@suse.de>
Date:   Tue Sep 17 11:20:10 2024 +0200

    tree-optimization/116573 - .SELECT_VL for SLP

    The following restores the use of .SELECT_VL for testcases where it
    is safe to use even when using SLP.  I've for now restricted it
    to single-lane SLP plus optimistically allow store-lane nodes
    and assume single-lane roots are not widened but at most to
    load-lane who should be fine.

            PR tree-optimization/116573
            * tree-vect-loop.cc (vect_analyze_loop_2): Allow .SELECV_VL
            for SLP but disable it when there's multi-lane instances.
            * tree-vect-stmts.cc (vectorizable_store): Only compute the
            ptr increment when generating code.
            (vectorizable_load): Likewise.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug tree-optimization/116573] [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V
  2024-09-02 16:47 [Bug tree-optimization/116573] New: [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V law at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2024-09-19 11:28 ` cvs-commit at gcc dot gnu.org
@ 2024-09-19 11:33 ` rguenth at gcc dot gnu.org
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-09-19 11:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED

--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
This particular noticed issue should be fixed now.  Others might exist - please
open separate bugs for distinct causes.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-09-19 11:33 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-02 16:47 [Bug tree-optimization/116573] New: [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V law at gcc dot gnu.org
2024-09-02 17:37 ` [Bug tree-optimization/116573] " rguenther at suse dot de
2024-09-02 22:19 ` pinskia at gcc dot gnu.org
2024-09-03  6:56 ` rguenth at gcc dot gnu.org
2024-09-06  8:48 ` rguenth at gcc dot gnu.org
2024-09-12 22:58 ` juzhe.zhong at rivai dot ai
2024-09-13  6:06 ` rguenth at gcc dot gnu.org
2024-09-13  7:51 ` rguenth at gcc dot gnu.org
2024-09-17  8:56 ` rdapp at gcc dot gnu.org
2024-09-17  8:59 ` rguenth at gcc dot gnu.org
2024-09-18  7:10 ` rguenth at gcc dot gnu.org
2024-09-18  7:19 ` juzhe.zhong at rivai dot ai
2024-09-18  7:25 ` rguenth at gcc dot gnu.org
2024-09-19 11:28 ` cvs-commit at gcc dot gnu.org
2024-09-19 11:33 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).