public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/113247] New: RISC-V: Performance bug in SHA256
@ 2024-01-06  0:02 juzhe.zhong at rivai dot ai
  2024-01-09 14:47 ` [Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization rdapp at gcc dot gnu.org
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-06  0:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247

            Bug ID: 113247
           Summary: RISC-V: Performance bug in SHA256
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: juzhe.zhong at rivai dot ai
  Target Milestone: ---

We found we have a performance bug while testing with various benchmarks.

This is the case from coremark-pro SHA256:

Tested on spike vector vs scalar in case of dynamic instruction count:

897210195 (vector) vs 418451694 (scalar)

Obviously vector dynamic instruction count as twice as scalar.

We tested on our hardware board and Thead C908 (K230),

Both vector performance drop about 60%+ in the real hardware in case of vector
vs scalar.

After investigation, the performance bug issue happens in the following case:

https://compiler-explorer.com/z/GcsnK7edn

#include <stdint.h>

#define Ch(x,y,z)   (z ^ (x & (y ^ z)))
#define Maj(x,y,z)  ((x & y) | (z & (x | y)))

#define SHR(x, n)    (x >> n)
#define ROTR(x,n)    (SHR(x,n) | (x << (32 - n)))
#define S1(x)        (ROTR(x, 6) ^ ROTR(x,11) ^ ROTR(x,25))
#define S0(x)        (ROTR(x, 2) ^ ROTR(x,13) ^ ROTR(x,22))

#define s1(x)        (ROTR(x,17) ^ ROTR(x,19) ^  SHR(x,10))
#define s0(x)        (ROTR(x, 7) ^ ROTR(x,18) ^  SHR(x, 3))

#define SHA256_STEP(a,b,c,d,e,f,g,h,x,K)                 \
{                                                        \
    tmp1 = h + S1(e) + Ch(e,f,g) + K + x;                \
    tmp2 = S0(a) + Maj(a,b,c);                           \
    h  = tmp1 + tmp2;                                    \
    d += tmp1;                                           \
}

#define BE_LOAD32(n,b,i) (n) = byteswap(*(uint32_t *)(b + i))

static uint32_t byteswap(uint32_t x)
{
    x = (x & 0x0000FFFF) << 16 | (x & 0xFFFF0000) >> 16;
    x = (x & 0x00FF00FF) << 8 | (x & 0xFF00FF00) >> 8;  

    return x;
}

void sha256 (const uint8_t *in, uint32_t out[8])
{
    uint32_t tmp1, tmp2, a, b, c, d, e, f, g, h;
    uint32_t w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, w11, w12, w13, w14,
w15;

    tmp1 = tmp2 = 0;
    w0 = w1 = w2 = w3 = w4 = w5 = w6 = w7 = w8 = w9 = w10 = w11 = w12 = w13 =
w14 = w15 = 0;

    BE_LOAD32 (  w0, in,  0 );
    BE_LOAD32 (  w1, in,  4 );
    BE_LOAD32 (  w2, in,  8 );
    BE_LOAD32 (  w3, in, 12 );
    BE_LOAD32 (  w4, in, 16 );
    BE_LOAD32 (  w5, in, 20 );
    BE_LOAD32 (  w6, in, 24 );
    BE_LOAD32 (  w7, in, 28 );
    BE_LOAD32 (  w8, in, 32 );
    BE_LOAD32 (  w9, in, 36 );
    BE_LOAD32 ( w10, in, 40 );
    BE_LOAD32 ( w11, in, 44 );
    BE_LOAD32 ( w12, in, 48 );
    BE_LOAD32 ( w13, in, 52 );
    BE_LOAD32 ( w14, in, 56 );
    BE_LOAD32 ( w15, in, 60 );

    a = out[0];
    b = out[1];
    c = out[2];
    d = out[3];
    e = out[4];
    f = out[5];
    g = out[6];
    h = out[7];

    SHA256_STEP(a, b, c, d, e, f, g, h,  w0, 0x428a2f98);
    SHA256_STEP(h, a, b, c, d, e, f, g,  w1, 0x71374491);
    SHA256_STEP(g, h, a, b, c, d, e, f,  w2, 0xb5c0fbcf);
    SHA256_STEP(f, g, h, a, b, c, d, e,  w3, 0xe9b5dba5);
    SHA256_STEP(e, f, g, h, a, b, c, d,  w4, 0x3956c25b);
    SHA256_STEP(d, e, f, g, h, a, b, c,  w5, 0x59f111f1);
    SHA256_STEP(c, d, e, f, g, h, a, b,  w6, 0x923f82a4);
    SHA256_STEP(b, c, d, e, f, g, h, a,  w7, 0xab1c5ed5);
    SHA256_STEP(a, b, c, d, e, f, g, h,  w8, 0xd807aa98);
    SHA256_STEP(h, a, b, c, d, e, f, g,  w9, 0x12835b01);
    SHA256_STEP(g, h, a, b, c, d, e, f, w10, 0x243185be);
    SHA256_STEP(f, g, h, a, b, c, d, e, w11, 0x550c7dc3);
    SHA256_STEP(e, f, g, h, a, b, c, d, w12, 0x72be5d74);
    SHA256_STEP(d, e, f, g, h, a, b, c, w13, 0x80deb1fe);
    SHA256_STEP(c, d, e, f, g, h, a, b, w14, 0x9bdc06a7);
    SHA256_STEP(b, c, d, e, f, g, h, a, w15, 0xc19bf174);

    w0 = s1(w14) + w9 + s0(w1) + w0;
    SHA256_STEP(a, b, c, d, e, f, g, h,  w0, 0xe49b69c1);
    w1 = s1(w15) + w10 + s0(w2) + w1;
    SHA256_STEP(h, a, b, c, d, e, f, g,  w1, 0xefbe4786);
    w2 = s1(w0) + w11 + s0(w3) + w2;
    SHA256_STEP(g, h, a, b, c, d, e, f,  w2, 0x0fc19dc6);
    w3 = s1(w1) + w12 + s0(w4) + w3;
    SHA256_STEP(f, g, h, a, b, c, d, e,  w3, 0x240ca1cc);
    w4 = s1(w2) + w13 + s0(w5) + w4;
    SHA256_STEP(e, f, g, h, a, b, c, d,  w4, 0x2de92c6f);
    w5 = s1(w3) + w14 + s0(w6) + w5;
    SHA256_STEP(d, e, f, g, h, a, b, c,  w5, 0x4a7484aa);
    w6 = s1(w4) + w15 + s0(w7) + w6;
    SHA256_STEP(c, d, e, f, g, h, a, b,  w6, 0x5cb0a9dc);
    w7 = s1(w5) + w0 + s0(w8) + w7;
    SHA256_STEP(b, c, d, e, f, g, h, a,  w7, 0x76f988da);
    w8 = s1(w6) + w1 + s0(w9) + w8;
    SHA256_STEP(a, b, c, d, e, f, g, h,  w8, 0x983e5152);
    w9 = s1(w7) + w2 + s0(w10) + w9;
    SHA256_STEP(h, a, b, c, d, e, f, g,  w9, 0xa831c66d);
    w10 = s1(w8) + w3 + s0(w11) + w10;
    SHA256_STEP(g, h, a, b, c, d, e, f, w10, 0xb00327c8);
    w11 = s1(w9) + w4 + s0(w12) + w11;
    SHA256_STEP(f, g, h, a, b, c, d, e, w11, 0xbf597fc7);
    w12 = s1(w10) + w5 + s0(w13) + w12;
    SHA256_STEP(e, f, g, h, a, b, c, d, w12, 0xc6e00bf3);
    w13 = s1(w11) + w6 + s0(w14) + w13;
    SHA256_STEP(d, e, f, g, h, a, b, c, w13, 0xd5a79147);
    w14 = s1(w12) + w7 + s0(w15) + w14;
    SHA256_STEP(c, d, e, f, g, h, a, b, w14, 0x06ca6351);
    w15 = s1(w13) + w8 + s0(w0) + w15;
    SHA256_STEP(b, c, d, e, f, g, h, a, w15, 0x14292967);

    w0 = s1(w14) + w9 + s0(w1) + w0;
    SHA256_STEP(a, b, c, d, e, f, g, h,  w0, 0x27b70a85);
    w1 = s1(w15) + w10 + s0(w2) + w1;
    SHA256_STEP(h, a, b, c, d, e, f, g,  w1, 0x2e1b2138);
    w2 = s1(w0) + w11 + s0(w3) + w2;
    SHA256_STEP(g, h, a, b, c, d, e, f,  w2, 0x4d2c6dfc);
    w3 = s1(w1) + w12 + s0(w4) + w3;
    SHA256_STEP(f, g, h, a, b, c, d, e,  w3, 0x53380d13);
    w4 = s1(w2) + w13 + s0(w5) + w4;
    SHA256_STEP(e, f, g, h, a, b, c, d,  w4, 0x650a7354);
    w5 = s1(w3) + w14 + s0(w6) + w5;
    SHA256_STEP(d, e, f, g, h, a, b, c,  w5, 0x766a0abb);
    w6 = s1(w4) + w15 + s0(w7) + w6;
    SHA256_STEP(c, d, e, f, g, h, a, b,  w6, 0x81c2c92e);
    w7 = s1(w5) + w0 + s0(w8) + w7;
    SHA256_STEP(b, c, d, e, f, g, h, a,  w7, 0x92722c85);
    w8 = s1(w6) + w1 + s0(w9) + w8;
    SHA256_STEP(a, b, c, d, e, f, g, h,  w8, 0xa2bfe8a1);
    w9 = s1(w7) + w2 + s0(w10) + w9;
    SHA256_STEP(h, a, b, c, d, e, f, g,  w9, 0xa81a664b);
    w10 = s1(w8) + w3 + s0(w11) + w10;
    SHA256_STEP(g, h, a, b, c, d, e, f, w10, 0xc24b8b70);
    w11 = s1(w9) + w4 + s0(w12) + w11;
    SHA256_STEP(f, g, h, a, b, c, d, e, w11, 0xc76c51a3);
    w12 = s1(w10) + w5 + s0(w13) + w12;
    SHA256_STEP(e, f, g, h, a, b, c, d, w12, 0xd192e819);
    w13 = s1(w11) + w6 + s0(w14) + w13;
    SHA256_STEP(d, e, f, g, h, a, b, c, w13, 0xd6990624);
    w14 = s1(w12) + w7 + s0(w15) + w14;
    SHA256_STEP(c, d, e, f, g, h, a, b, w14, 0xf40e3585);
    w15 = s1(w13) + w8 + s0(w0) + w15;
    SHA256_STEP(b, c, d, e, f, g, h, a, w15, 0x106aa070);

    w0 = s1(w14) + w9 + s0(w1) + w0;
    SHA256_STEP(a, b, c, d, e, f, g, h,  w0, 0x19a4c116);
    w1 = s1(w15) + w10 + s0(w2) + w1;
    SHA256_STEP(h, a, b, c, d, e, f, g,  w1, 0x1e376c08);
    w2 = s1(w0) + w11 + s0(w3) + w2;
    SHA256_STEP(g, h, a, b, c, d, e, f,  w2, 0x2748774c);
    w3 = s1(w1) + w12 + s0(w4) + w3;
    SHA256_STEP(f, g, h, a, b, c, d, e,  w3, 0x34b0bcb5);
    w4 = s1(w2) + w13 + s0(w5) + w4;
    SHA256_STEP(e, f, g, h, a, b, c, d,  w4, 0x391c0cb3);
    w5 = s1(w3) + w14 + s0(w6) + w5;
    SHA256_STEP(d, e, f, g, h, a, b, c,  w5, 0x4ed8aa4a);
    w6 = s1(w4) + w15 + s0(w7) + w6;
    SHA256_STEP(c, d, e, f, g, h, a, b,  w6, 0x5b9cca4f);
    w7 = s1(w5) + w0 + s0(w8) + w7;
    SHA256_STEP(b, c, d, e, f, g, h, a,  w7, 0x682e6ff3);
    w8 = s1(w6) + w1 + s0(w9) + w8;
    SHA256_STEP(a, b, c, d, e, f, g, h,  w8, 0x748f82ee);
    w9 = s1(w7) + w2 + s0(w10) + w9;
    SHA256_STEP(h, a, b, c, d, e, f, g,  w9, 0x78a5636f);
    w10 = s1(w8) + w3 + s0(w11) + w10;
    SHA256_STEP(g, h, a, b, c, d, e, f, w10, 0x84c87814);
    w11 = s1(w9) + w4 + s0(w12) + w11;
    SHA256_STEP(f, g, h, a, b, c, d, e, w11, 0x8cc70208);
    w12 = s1(w10) + w5 + s0(w13) + w12;
    SHA256_STEP(e, f, g, h, a, b, c, d, w12, 0x90befffa);
    w13 = s1(w11) + w6 + s0(w14) + w13;
    SHA256_STEP(d, e, f, g, h, a, b, c, w13, 0xa4506ceb);
    w14 = s1(w12) + w7 + s0(w15) + w14;
    SHA256_STEP(c, d, e, f, g, h, a, b, w14, 0xbef9a3f7);
    w15 = s1(w13) + w8 + s0(w0) + w15;
    SHA256_STEP(b, c, d, e, f, g, h, a, w15, 0xc67178f2);

    out[0] += a;
    out[1] += b;
    out[2] += c;
    out[3] += d;
    out[4] += e;
    out[5] += f;
    out[6] += g;
    out[7] += h;
}

The assembly is quite staigth-forward. RVV GCC vectorizes codes which is
profitable to be vectorized.


Confirm both ARM SVE GCC and RISC-V Clang don't vectorize this code.

So I believe we should teach cost model not vectorize it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization
  2024-01-06  0:02 [Bug c/113247] New: RISC-V: Performance bug in SHA256 juzhe.zhong at rivai dot ai
@ 2024-01-09 14:47 ` rdapp at gcc dot gnu.org
  2024-01-09 14:53 ` juzhe.zhong at rivai dot ai
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rdapp at gcc dot gnu.org @ 2024-01-09 14:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247

--- Comment #1 from Robin Dapp <rdapp at gcc dot gnu.org> ---
Hmm, so I tried reproducing this and without a vector cost model we indeed
vectorize.  My qemu dynamic instruction count results are not as abysmal as
yours but still bad enough (20-30% increase in dynamic instructions).

However, as soon as I use the vector cost model, enabled by -mtune=generic-ooo,
the sha256 function is not vectorized anymore:

bla.c:95:5: note: Cost model analysis for part in loop 0:
  Vector cost: 294
  Scalar cost: 185
bla.c:95:5: missed: not vectorized: vectorization is not profitable.

Without that we have:
bla.c:95:5: note: Cost model analysis for part in loop 0:
  Vector cost: 173
  Scalar cost: 185
bla.c:95:5: note: Basic block will be vectorized using SLP

(Those costs are obtained via default_builtin_vectorization_cost).

The main difference is vec_to_scalar cost being 1 by default and 2 in our cost
model, as well as vec_perm = 2.  Given our limited permute capabilities I think
a cost of 2 makes sense.  We can also argue in favor of vec_to_scalar = 2
because we need to slide down elements for extraction and cannot extract
directly.  Setting scalar_to_vec = 2 is debatable and I'd rather keep it at 1.

For the future we need to make a decision whether to continue with generic-ooo
as the default vector model or if we want to set latencies to a few uniform
values in order for scheduling not to introduce spilling and waiting for
dependencies.

To help with that decision you could run some benchmarks with the generic-ooo
tuning and see if things get better or worse?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization
  2024-01-06  0:02 [Bug c/113247] New: RISC-V: Performance bug in SHA256 juzhe.zhong at rivai dot ai
  2024-01-09 14:47 ` [Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization rdapp at gcc dot gnu.org
@ 2024-01-09 14:53 ` juzhe.zhong at rivai dot ai
  2024-01-09 15:02 ` rdapp at gcc dot gnu.org
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-09 14:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247

--- Comment #2 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Robin Dapp from comment #1)
> Hmm, so I tried reproducing this and without a vector cost model we indeed
> vectorize.  My qemu dynamic instruction count results are not as abysmal as
> yours but still bad enough (20-30% increase in dynamic instructions).
> 
> However, as soon as I use the vector cost model, enabled by
> -mtune=generic-ooo, the sha256 function is not vectorized anymore:
> 
> bla.c:95:5: note: Cost model analysis for part in loop 0:
>   Vector cost: 294
>   Scalar cost: 185
> bla.c:95:5: missed: not vectorized: vectorization is not profitable.
> 
> Without that we have:
> bla.c:95:5: note: Cost model analysis for part in loop 0:
>   Vector cost: 173
>   Scalar cost: 185
> bla.c:95:5: note: Basic block will be vectorized using SLP
> 
> (Those costs are obtained via default_builtin_vectorization_cost).
> 
> The main difference is vec_to_scalar cost being 1 by default and 2 in our
> cost model, as well as vec_perm = 2.  Given our limited permute capabilities
> I think a cost of 2 makes sense.  We can also argue in favor of
> vec_to_scalar = 2 because we need to slide down elements for extraction and
> cannot extract directly.  Setting scalar_to_vec = 2 is debatable and I'd
> rather keep it at 1.
> 
> For the future we need to make a decision whether to continue with
> generic-ooo as the default vector model or if we want to set latencies to a
> few uniform values in order for scheduling not to introduce spilling and
> waiting for dependencies.
> 
> To help with that decision you could run some benchmarks with the
> generic-ooo tuning and see if things get better or worse?

generic-ooo is using the generic cost model I added.

Since the default tuning info is rocket which doesn't have cost model, then
we use default builtin cost model.

I think it's more reasonable adjust code in builtin_vectorize_cost:

if (!cost)
  cost = &generic_vector_cost;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization
  2024-01-06  0:02 [Bug c/113247] New: RISC-V: Performance bug in SHA256 juzhe.zhong at rivai dot ai
  2024-01-09 14:47 ` [Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization rdapp at gcc dot gnu.org
  2024-01-09 14:53 ` juzhe.zhong at rivai dot ai
@ 2024-01-09 15:02 ` rdapp at gcc dot gnu.org
  2024-01-09 15:15 ` rdapp at gcc dot gnu.org
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rdapp at gcc dot gnu.org @ 2024-01-09 15:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247

--- Comment #3 from Robin Dapp <rdapp at gcc dot gnu.org> ---
Yes, sure and I gave a bit of detail why the values chosen there (same as
aarch64) make sense to me.

Using this generic vector cost model by default without adjusting the latencies
is possible.  I would be OK with such a change but would also rather not have
"rocket" at all by default ;)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization
  2024-01-06  0:02 [Bug c/113247] New: RISC-V: Performance bug in SHA256 juzhe.zhong at rivai dot ai
                   ` (2 preceding siblings ...)
  2024-01-09 15:02 ` rdapp at gcc dot gnu.org
@ 2024-01-09 15:15 ` rdapp at gcc dot gnu.org
  2024-01-10  0:54 ` juzhe.zhong at rivai dot ai
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rdapp at gcc dot gnu.org @ 2024-01-09 15:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247

--- Comment #4 from Robin Dapp <rdapp at gcc dot gnu.org> ---
The other option is to assert that all tune models have at least a vector cost
model rather than NULL...  But not falling back to the builtin costs still
makes sense.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization
  2024-01-06  0:02 [Bug c/113247] New: RISC-V: Performance bug in SHA256 juzhe.zhong at rivai dot ai
                   ` (3 preceding siblings ...)
  2024-01-09 15:15 ` rdapp at gcc dot gnu.org
@ 2024-01-10  0:54 ` juzhe.zhong at rivai dot ai
  2024-01-10  6:29 ` juzhe.zhong at rivai dot ai
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-10  0:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247

--- Comment #5 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Robin Dapp from comment #4)
> The other option is to assert that all tune models have at least a vector
> cost model rather than NULL...  But not falling back to the builtin costs
> still makes sense.

I have talked with kito. He just want rocket sifive-series-7 those tune chip
has 
no vector hardware so I think it makes more sense we adapt
builtin_vectorize_cost
to set cost to generic vector cost when there is no vector cost in tune info.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization
  2024-01-06  0:02 [Bug c/113247] New: RISC-V: Performance bug in SHA256 juzhe.zhong at rivai dot ai
                   ` (4 preceding siblings ...)
  2024-01-10  0:54 ` juzhe.zhong at rivai dot ai
@ 2024-01-10  6:29 ` juzhe.zhong at rivai dot ai
  2024-01-10 10:00 ` juzhe.zhong at rivai dot ai
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-10  6:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247

--- Comment #6 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
I have tried generic-ooo:

https://compiler-explorer.com/z/44dcePczz

There are still a few vectorized codes in the last couple lines of assembler:

        vsetivli        zero,4,e32,m1,ta,ma
        addw    a5,s10,t1
        addw    a0,a0,t3
        addw    a5,a5,t3
        vmv.v.x v2,a0
        ld      s6,56(sp)
        vmv.v.x v1,a5
        ld      s5,88(sp)
        addw    a3,s6,s4
        addw    a5,s5,a1
        vslide1down.vx  v2,v2,a3
        ld      s7,64(sp)
        vslide1down.vx  v1,v1,a5
        ld      s5,96(sp)
        addw    a3,s7,t4
        addw    a5,s5,a6
        vslide1down.vx  v2,v2,a3
        vslide1down.vx  v1,v1,a5
        ld      a5,72(sp)
        ld      a6,104(sp)
        addw    a3,a5,a7
        vslide1down.vx  v2,v2,a3
        addw    a5,a6,a4
        vslide1down.vx  v1,v1,a5
        ld      s1,120(sp)
        vse32.v v2,0(s1)
        addi    a5,s1,16
        vse32.v v1,0(a5)

I suspect this will still lower down the performance.

I can ask Li Pan to test it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization
  2024-01-06  0:02 [Bug c/113247] New: RISC-V: Performance bug in SHA256 juzhe.zhong at rivai dot ai
                   ` (5 preceding siblings ...)
  2024-01-10  6:29 ` juzhe.zhong at rivai dot ai
@ 2024-01-10 10:00 ` juzhe.zhong at rivai dot ai
  2024-01-10 12:35 ` pan2.li at intel dot com
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-10 10:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247

--- Comment #7 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Hi, Robin.

I sent a patch switching cost model into generic cost model:

https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642428.html

But I tweak the cost back to default cost.

Since current generic cost model has inferior codegen in dynamic-lmul2-7.c,
I think we should switch it first, then we can together tune the cost model
together so that we don't block our work.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization
  2024-01-06  0:02 [Bug c/113247] New: RISC-V: Performance bug in SHA256 juzhe.zhong at rivai dot ai
                   ` (6 preceding siblings ...)
  2024-01-10 10:00 ` juzhe.zhong at rivai dot ai
@ 2024-01-10 12:35 ` pan2.li at intel dot com
  2024-01-10 13:41 ` rdapp at gcc dot gnu.org
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: pan2.li at intel dot com @ 2024-01-10 12:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247

Li Pan <pan2.li at intel dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pan2.li at intel dot com

--- Comment #8 from Li Pan <pan2.li at intel dot com> ---
The performance ratio of sha-test compared to scalar is about -2.2% when build
with -mtune=generic-ooo.

Aka the option -mtune=generic-ooo makes the ratio (compares to scalar) from
-70% to -2.2%. I suppose the negative ratio may be caused by the part mentioned
by Juzhe.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization
  2024-01-06  0:02 [Bug c/113247] New: RISC-V: Performance bug in SHA256 juzhe.zhong at rivai dot ai
                   ` (7 preceding siblings ...)
  2024-01-10 12:35 ` pan2.li at intel dot com
@ 2024-01-10 13:41 ` rdapp at gcc dot gnu.org
  2024-01-11  1:37 ` pan2.li at intel dot com
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rdapp at gcc dot gnu.org @ 2024-01-10 13:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247

--- Comment #9 from Robin Dapp <rdapp at gcc dot gnu.org> ---
I also noticed this (likely unwanted) vector snippet and wondered where it is
being created.  First I thought it's a vec_extract but doesn't look like it. 
I'm going to check why we create this.

Pan, the test was on real hardware I suppose?  So regardless of the fact that
we likely want to get rid of the snippet above, would you mind checking whether
generic-ooo has any effect on performance?  Maybe you could try -march=rv64gc
-mtune=generic-ooo.  Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization
  2024-01-06  0:02 [Bug c/113247] New: RISC-V: Performance bug in SHA256 juzhe.zhong at rivai dot ai
                   ` (8 preceding siblings ...)
  2024-01-10 13:41 ` rdapp at gcc dot gnu.org
@ 2024-01-11  1:37 ` pan2.li at intel dot com
  2024-01-15 13:15 ` cvs-commit at gcc dot gnu.org
  2024-01-15 13:24 ` juzhe.zhong at rivai dot ai
  11 siblings, 0 replies; 13+ messages in thread
From: pan2.li at intel dot com @ 2024-01-11  1:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247

--- Comment #10 from Li Pan <pan2.li at intel dot com> ---
(In reply to Robin Dapp from comment #9)
> I also noticed this (likely unwanted) vector snippet and wondered where it
> is being created.  First I thought it's a vec_extract but doesn't look like
> it.  I'm going to check why we create this.
> 
> Pan, the test was on real hardware I suppose?  

Yes.

> So regardless of the fact
> that we likely want to get rid of the snippet above, would you mind checking
> whether generic-ooo has any effect on performance?  Maybe you could try
> -march=rv64gc -mtune=generic-ooo.  Thanks.

Sure thing, actually I have some performance data that is under review to make
sure the alignment to the company policy before share to the community. I will
add a new column for generic-ooo.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization
  2024-01-06  0:02 [Bug c/113247] New: RISC-V: Performance bug in SHA256 juzhe.zhong at rivai dot ai
                   ` (9 preceding siblings ...)
  2024-01-11  1:37 ` pan2.li at intel dot com
@ 2024-01-15 13:15 ` cvs-commit at gcc dot gnu.org
  2024-01-15 13:24 ` juzhe.zhong at rivai dot ai
  11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-01-15 13:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247

--- Comment #11 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Pan Li <panli@gcc.gnu.org>:

https://gcc.gnu.org/g:7be87b7d2e330afd14a7cc028f64d88f80e12f40

commit r14-7245-g7be87b7d2e330afd14a7cc028f64d88f80e12f40
Author: Juzhe-Zhong <juzhe.zhong@rivai.ai>
Date:   Mon Jan 15 20:00:14 2024 +0800

    RISC-V: Fix regression (GCC-14 compare with GCC-13.2) of SHA256 from
coremark-pro

    This patch fixes -70% performance drop from GCC-13.2 to GCC-14 with
-march=rv64gcv in real hardware.

    The root cause is incorrect cost model cause inefficient vectorization
which makes us performance drop significantly.

    So this patch does:

    1. Adjust vector to scalar cost by introducing v to scalar reg move.
    2. Adjust vec_construct cost since we does spend NUNITS instructions to
construct the vector.

    Tested on both RV32/RV64 no regression, Rebase to the trunk and commit it
as it is approved by Robin.

            PR target/113247

    gcc/ChangeLog:

            * config/riscv/riscv-protos.h (struct regmove_vector_cost): Add
vector to scalar regmove.
            * config/riscv/riscv-vector-costs.cc (adjust_stmt_cost): Ditto.
            * config/riscv/riscv.cc (riscv_builtin_vectorization_cost): Adjust
vec_construct cost.

    gcc/testsuite/ChangeLog:

            * gcc.target/riscv/rvv/autovec/vls/reduc-19.c: Adapt test.
            * gcc.target/riscv/rvv/autovec/vls/reduc-20.c: Ditto.
            * gcc.target/riscv/rvv/autovec/vls/reduc-21.c: Ditto.
            * gcc.dg/vect/costmodel/riscv/rvv/pr113247-1.c: New test.
            * gcc.dg/vect/costmodel/riscv/rvv/pr113247-2.c: New test.
            * gcc.dg/vect/costmodel/riscv/rvv/pr113247-3.c: New test.
            * gcc.dg/vect/costmodel/riscv/rvv/pr113247-4.c: New test.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization
  2024-01-06  0:02 [Bug c/113247] New: RISC-V: Performance bug in SHA256 juzhe.zhong at rivai dot ai
                   ` (10 preceding siblings ...)
  2024-01-15 13:15 ` cvs-commit at gcc dot gnu.org
@ 2024-01-15 13:24 ` juzhe.zhong at rivai dot ai
  11 siblings, 0 replies; 13+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2024-01-15 13:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113247

JuzheZhong <juzhe.zhong at rivai dot ai> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #12 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Fixed

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-01-15 13:24 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-06  0:02 [Bug c/113247] New: RISC-V: Performance bug in SHA256 juzhe.zhong at rivai dot ai
2024-01-09 14:47 ` [Bug target/113247] RISC-V: Performance bug in SHA256 after enabling RVV vectorization rdapp at gcc dot gnu.org
2024-01-09 14:53 ` juzhe.zhong at rivai dot ai
2024-01-09 15:02 ` rdapp at gcc dot gnu.org
2024-01-09 15:15 ` rdapp at gcc dot gnu.org
2024-01-10  0:54 ` juzhe.zhong at rivai dot ai
2024-01-10  6:29 ` juzhe.zhong at rivai dot ai
2024-01-10 10:00 ` juzhe.zhong at rivai dot ai
2024-01-10 12:35 ` pan2.li at intel dot com
2024-01-10 13:41 ` rdapp at gcc dot gnu.org
2024-01-11  1:37 ` pan2.li at intel dot com
2024-01-15 13:15 ` cvs-commit at gcc dot gnu.org
2024-01-15 13:24 ` juzhe.zhong at rivai dot ai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).