The new patch looks reasonable to me now. Thanks for fixing it. Could you append testcase after finishing test infrastructure ? I prefer this patch with testcase after infrastructure. Thanks. juzhe.zhong@rivai.ai From: Joern Rennecke Date: 2023-08-15 16:12 To: 钟居哲 CC: Jeff Law; gcc-patches; kito.cheng; kito.cheng; rdapp.gcc Subject: Re: Re: cpymem for RISCV with v extension On Sat, 5 Aug 2023 at 00:35, 钟居哲 wrote: > > >> Testing what specifically? Are you asking for correctness tests, > >> performance/code quality tests? > > Add memcpy test using RVV instructions, just like we are adding testcases for auto-vectorization support. I wanted to get in the test infrastructure first. > void foo (int32_t * a, int32_t * b, int num) > { > memcpy (a, b, num); > } > > > In my downstream LLVM/GCC codegen: > foo: > .L2: > vsetvli a5,a2,e8,m8,ta,ma > vle8.v v24,(a1) > sub a2,a2,a5 > vse8.v v24,(a0) > add a1,a1,a5 > add a0,a0,a5 > bne a2,zero,.L2 > ret Yeah, it does that. > > Another example: > void foo (int32_t * a, int32_t * b, int num) > { > memcpy (a, b, 4); > } > > > My downstream LLVM/GCC assembly: > > foo: > vsetvli zero,16,e8,m1,ta,ma > vle8.v v24,(a1) > vse8.v v24,(a0) > ret copying 16 bytes when asked to copy 4 is problematic. Mine copies 4. Note also for: typedef struct { int a[31]; } s; void foo (s *a, s *b) { *a = *b; } You get: vsetivli zero,31,e32,m8,ta,ma vle32.v v8,0(a1) vse32.v v8,0(a0) Using memcpy, the compiler unfortunately discards the alignment. > emit_insn (gen_pred_store...) Thanks to pointing me in the right direction. From the naming of the patterns, the dearth of comments, and the default behaviour of the compiler when optimizing with generic optimization options (i.e. no vectorization) I had assumed that the infrastructure was still missing. I have attached a re-worked patch that uses pred_mov / pred_store and as adapted to the refactored modes. It lacks the strength reduction of the opaque pattern version for -O3, though. Would people also like to see that expanded into RTL? Or should I just drop in the opaque pattern for that? Or not at all, because everyone uses Superscalar Out-Of-Order execution?