The new  patch looks reasonable to me now. Thanks for fixing it.

Could you append testcase after finishing test infrastructure ?
I prefer this patch with testcase after infrastructure. 

Thanks.


juzhe.zhong@rivai.ai
 
From: Joern Rennecke
Date: 2023-08-15 16:12
To: 钟居哲
CC: Jeff Law; gcc-patches; kito.cheng; kito.cheng; rdapp.gcc
Subject: Re: Re: cpymem for RISCV with v extension
On Sat, 5 Aug 2023 at 00:35, 钟居哲 <juzhe.zhong@rivai.ai> wrote:
>
> >> Testing what specifically?  Are you asking for correctness tests,
> >> performance/code quality tests?
>
> Add memcpy test using RVV instructions, just like we are adding testcases for auto-vectorization support.
 
I wanted to get in the test infrastructure first.
 
> void foo (int32_t * a, int32_t * b, int num)
> {
>   memcpy (a, b, num);
> }
>
>
> In my downstream LLVM/GCC codegen:
> foo:
> .L2:
>         vsetvli a5,a2,e8,m8,ta,ma
>         vle8.v  v24,(a1)
>         sub     a2,a2,a5
>         vse8.v  v24,(a0)
>         add     a1,a1,a5
>         add     a0,a0,a5
>         bne     a2,zero,.L2
>         ret
 
Yeah, it does that.
 
>
> Another example:
> void foo (int32_t * a, int32_t * b, int num)
> {
>   memcpy (a, b, 4);
> }
>
>
> My downstream LLVM/GCC assembly:
>
> foo:
> vsetvli zero,16,e8,m1,ta,ma
> vle8.v v24,(a1)
> vse8.v v24,(a0)
> ret
 
copying 16 bytes when asked to copy 4 is problematic.  Mine copies 4.
 
Note also for:
typedef struct { int a[31]; } s;
 
void foo (s *a, s *b)
{
  *a = *b;
}
 
You get:
 
        vsetivli        zero,31,e32,m8,ta,ma
        vle32.v v8,0(a1)
        vse32.v v8,0(a0)
 
Using memcpy, the compiler unfortunately discards the alignment.
 
> emit_insn (gen_pred_store...)
 
Thanks to pointing me in the right direction.  From the naming of the
patterns, the dearth of comments, and the default behaviour of the
compiler when optimizing with generic optimization options (i.e. no
vectorization) I had assumed that the infrastructure was still
missing.
 
I have attached a re-worked patch that uses pred_mov / pred_store and
as adapted to the refactored modes.
It lacks the strength reduction of the opaque pattern version for -O3,
though.  Would people also like to see that expanded into RTL?  Or
should I just drop in the opaque pattern for that?  Or not at all,
because everyone uses Superscalar Out-Of-Order execution?