Fortunately, we won't have aggregates, arrays of vbool*_t in the future. I think it's not an issue. juzhe.zhong@rivai.ai From: Richard Biener Date: 2023-03-02 16:25 To: juzhe.zhong CC: richard.sandiford; pan2.li; gcc-patches; Pan Li; kito.cheng Subject: Re: Re: [PATCH] RISC-V: Bugfix for rvv bool mode precision adjustment On Thu, 2 Mar 2023, juzhe.zhong@rivai.ai wrote: > >> Does the eventual value set by ADJUST_BYTESIZE equal the real number of > >> bytes loaded by vlm.v and stored by vstm.v (after the appropriate vsetvl)? > >> Or is the GCC size larger in some cases than the number of bytes > >> loaded and stored? > For VNx1BI,VNx2BI,VNx4BI,VNx8BI, we allocate the larger size of memory or stack for register spillling > according to ADJUST_BYTESIZE. > After appropriate vsetvl, VNx1BI is loaded/stored 1/8 of ADJUST_BYTESIZE (vsetvl e8mf8). > After appropriate vsetvl, VNx2BI is loaded/stored 2/8 of ADJUST_BYTESIZE (vsetvl e8mf2). > After appropriate vsetvl, VNx4BI is loaded/stored 4/8 of ADJUST_BYTESIZE (vsetvl e8mf4). > After appropriate vsetvl, VNx8BI is loaded/stored 8/8 of ADJUST_BYTESIZE (vsetvl e8m1). > > Note: except these 4 machine modes, all other machine modes of RVV, ADJUST_BYTESIZE > are equal to the real number of bytes of load/store instruction that RVV ISA define. > > Well, as I said, it's fine that we allocated larger memory for VNx1BI,VNx2BI,VNx4BI, > we can emit appropriate vsetvl to gurantee the correctness in RISC-V backward according > to the machine_mode as long as long GCC didn't do the incorrect elimination in middle-end. > > Besides, poly (1,1) is 1/8 of machine vector-length which is already really a small number, > which is the real number bytes loaded/stored for VNx8BI. > You can say VNx1BI, VNx2BI, VNx4BI are consuming larger memory than we actually load/stored by appropriate vsetvl > since they are having same ADJUST_BYTESIZE as VNx8BI. However, I think it's totally fine so far as long as we can > gurantee the correctness and I think optimizing such memory storage consuming is trivial. > > >> And does it equal the size of the corresponding LLVM machine type? > > Well, for some reason, in case of register spilling, LLVM consume much more memory than GCC. > And they always do whole register load/store (a single vector register vector-length) for register spilling. > That's another story (I am not going to talk too much about this since it's a quite ugly implementation). > They don't model the types accurately according RVV ISA for register spilling. > > In case of normal load/store like: > vbool8_t v2 = *(vbool8_t*)in; *(vbool8_t*)(out + 100) = v2; > This kind of load/store, their load/stores instructions of codegen are accurate. > Even though their instructions are accurate for load/store accessing behavior, I am not sure whether size > of their machine type is accurate. > > For example, in IR presentation: VNx1BI of GCC is represented as vscale x 1 x i1 > VNx2BI of GCC is represented as vscale x 2 x i1 > in LLVM IR. > I am not sure the bytesize of vscale x 1 x i1 and vscale x 2 x i1. > I didn't take a deep a look at it. > > I think this question is not that important, no matter whether VNx1BI and VNx2BI are modeled accurately in case of ADUST_BYTESIZE > in GCC or vscale x 1 x i1 and vscale x 2 x i1 are modeled accurately in case of their bytesize, > I think as long as we can emit appropriate vsetvl + vlm/vsm, it's totally fine for RVV even though in some case, their memory allocation > is not accurate in compiler. I'm not sure how it works for variable-length types but isn't sizeof (vbool8_t) part of the ABI and thus its TYPE_SIZE / GET_MODE_SIZE are relevant there? It might of course be that you can never have these types as part of aggregates, arrays or objects of them address-taken in which case the issue is moot? Richard. > > juzhe.zhong@rivai.ai > > From: Richard Sandiford > Date: 2023-03-02 00:14 > To: Li\, Pan2 > CC: juzhe.zhong\@rivai.ai; rguenther; gcc-patches; Pan Li; kito.cheng > Subject: Re: [PATCH] RISC-V: Bugfix for rvv bool mode precision adjustment > "Li, Pan2" writes: > > Thanks all for so much valuable and helpful materials. > > > > As I understand (Please help to correct me if any mistake.), for the VNx*BI (aka, 1, 2, 4, 8, 16, 32, 64), > > the precision and mode size need to be adjusted as below. > > > > Precision size [1, 2, 4, 8, 16, 32, 64] > > Mode size [1, 1, 1, 1, 2, 4, 8] > > > > Given that, if we ignore the self-test failure, only the adjust_precision part is able to fix the bug I mentioned. > > The genmode will first get the precision, and then leverage the mode_size = exact_div / 8 to generate. > > Meanwhile, it also provides the adjust_mode_size after the mode_size generation. > > > > The riscv parts has the mode_size_adjust already and the value of mode_size will be overridden by the adjustments. > > Ah, OK! In that case, would the following help: > > Turn: > > mode_size[E_%smode] = exact_div (mode_precision[E_%smode], BITS_PER_UNIT); > > into: > > if (!multiple_p (mode_precision[E_%smode], BITS_PER_UNIT, > &mode_size[E_%smode])) > mode_size[E_%smode] = -1; > > where -1 is an "obviously wrong" value. > > Ports that might hit the -1 are then responsible for setting the size > later, via ADJUST_BYTESIZE. > > After all the adjustments are complete, genmodes asserts that no size is > known_eq to -1. > > That way, target-independent code doesn't need to guess what the > correct behaviour is. > > Does the eventual value set by ADJUST_BYTESIZE equal the real number of > bytes loaded by vlm.v and stored by vstm.v (after the appropriate vsetvl)? > And does it equal the size of the corresponding LLVM machine type? > Or is the GCC size larger in some cases than the number of bytes > loaded and stored? > > (You and Juzhe have probably answered that question before, sorry, > but I'm still not 100% sure of the answer. Personally, I think I would > find the ISA behaviour easier to understand if the explanation doesn't > involve poly_ints. It would be good to understand things "as the > architecture sees then" rather than in terms of GCC concepts.) > > Thanks, > Richard > > > Unfortunately, the early stage mode_size generation leveraged exact_div, which doesn't honor precision size < 8 > > with the adjustment and fails on exact_div assertions. > > > > Besides the precision adjustment, I am not sure if we can narrow down the problem to. > > > > > > 1. Defined the real size of both the precision and mode size to align the riscv ISA. > > 2. Besides, make the general mode_size = precision_size / 8 is able to take care of both the exact_div and the dividend less than the divisor (like 1/8 or 2/8) cases. > > > > Could you please share your professional suggestions about this? Thank you all again and have a nice day! > > > > Pan > > > > From: juzhe.zhong@rivai.ai > > Sent: Wednesday, March 1, 2023 10:19 PM > > To: rguenther > > Cc: richard.sandiford ; gcc-patches ; Pan Li ; Li, Pan2 ; kito.cheng > > Subject: Re: Re: [PATCH] RISC-V: Bugfix for rvv bool mode precision adjustment > > > >>> So given the above I think that modeling the size as being the same > >>> but with accurate precision would work. It's then only the size of the > >>> padding in bytes we cannot represent with poly-int which should be fine. > > > >>> Correct? > > Yes. > > > >>> Btw, is storing a VNx1BI and then loading a VNx2BI from the same > >>> memory address well-defined? That is, how is the padding handled > >>> by the machine load/store instructions? > > > > storing VNx1BI is storing the data from addr 0 ~ 1/8 poly (1,1) and keep addr 1/8 poly (1,1) ~ 2/8 poly (1,1) memory data unchange. > > load VNx2BI will load 0 ~ 2/8 poly (1,1), note that 0 ~ 1/8 poly (1,1) is the date that we store above, 1/8 poly (1,1) ~ 2/8 poly (1,1) is the orignal memory data. > > You can see here for this case (LLVM): > > https://godbolt.org/z/P9e1adrd3 > > foo: # @foo > > vsetvli a2, zero, e8, mf8, ta, ma > > vsm.v v0, (a0) > > vsetvli a2, zero, e8, mf4, ta, ma > > vlm.v v8, (a0) > > vsm.v v8, (a1) > > ret > > > > We can also doing like this in GCC as long as we can differentiate VNx1BI and VNx2BI, and GCC do not eliminate statement according precision even though > > they have same bytesize. > > > > First we emit vsetvl e8mf8 +vsm for VNx1BI > > Then we emit vsetvl e8mf8 + vlm for VNx2BI > > > > Thanks. > > ________________________________ > > juzhe.zhong@rivai.ai > > > > From: Richard Biener > > Date: 2023-03-01 22:03 > > To: juzhe.zhong > > CC: richard.sandiford; gcc-patches; Pan Li; pan2.li; kito.cheng > > Subject: Re: Re: [PATCH] RISC-V: Bugfix for rvv bool mode precision adjustment > > On Wed, 1 Mar 2023, Richard Biener wrote: > > > >> On Wed, 1 Mar 2023, juzhe.zhong@rivai.ai wrote: > >> > >> > Let's me first introduce RVV load/store basics and stack allocation. > >> > For scalable vector memory allocation, we allocate memory according to machine vector-length. > >> > To get this CPU vector-length value (runtime invariant but compile time unknown), we have an instruction call csrr vlenb. > >> > For example, csrr a5,vlenb (store CPU a single register vector-length value (describe as bytesize) in a5 register). > >> > A single register size in bytes (GET_MODE_SIZE) is poly value (8,8) bytes. That means csrr a5,vlenb, a5 has the value of size poly (8,8) bytes. > >> > > >> > Now, our problem is that VNx1BI, VNx2BI, VNx4BI, VNx8BI has the same bytesize poly (1,1). So their storage consumes the same size. > >> > Meaning when we want to allocate a memory storge or stack for register spillings, we should first csrr a5, vlenb, then slli a5,a5,3 (means a5 = a5/8) > >> > Then, a5 has the bytesize value of poly (1,1). All VNx1BI, VNx2BI, VNx4BI, VNx8BI are doing the same process as I described above. They all consume > >> > the same memory storage size since we can't model them accurately according to precision or you bitsize. > >> > > >> > They consume the same storage (I am agree it's better to model them more accurately in case of memory storage comsuming). > >> > > >> > Well, even though they are consuming same size memory storage, I can make their memory accessing behavior (load/store) accurately by > >> > emiting the accurate RVV instruction for them according to RVV ISA. > >> > > >> > VNx1BI,VNx2BI, VNx4BI, VNx8BI are consuming same memory storage with size poly (1,1) > >> > The instruction for these modes as follows: > >> > VNx1BI: vsevl e8mf8 + vlm, loading 1/8 of poly (1,1) storage. > >> > VNx2BI: vsevl e8mf8 + vlm, loading 1/4 of poly (1,1) storage. > >> > VNx4BI: vsevl e8mf8 + vlm, loading 1/2 of poly (1,1) storage. > >> > VNx8BI: vsevl e8mf8 + vlm, loading 1 of poly (1,1) storage. > >> > > >> > So base on these, It's fine that we don't model VNx1BI,VNx2BI, VNx4BI, VNx8BI accurately according to precision or bitsize. > >> > This implementation is fine even though their memory storage is not accurate. > >> > > >> > However, the problem is that since they have the same bytesize, GCC will think they are the same and do some incorrect statement elimination: > >> > > >> > (Note: Load same memory base) > >> > load v0 VNx1BI from base0 > >> > load v1 VNx2BI from base0 > >> > load v2 VNx4BI from base0 > >> > load v3 VNx8BI from base0 > >> > > >> > store v0 base1 > >> > store v1 base2 > >> > store v2 base3 > >> > store v3 base4 > >> > > >> > This program sequence, in GCC, it will eliminate the last 3 load instructions. > >> > > >> > Then it will become: > >> > > >> > load v0 VNx1BI from base0 ===> vsetvl e8mf8 + vlm (only load 1/8 of poly size (1,1) memory data) > >> > > >> > store v0 base1 > >> > store v0 base2 > >> > store v0 base3 > >> > store v0 base4 > >> > > >> > This is what we want to fix. I think as long as we can have the way to differentiate VNx1BI,VNx2BI, VNx4BI, VNx8BI > >> > and GCC will not do th incorrect elimination for RVV. > >> > > >> > I think it can work fine even though these 4 modes consume inaccurate memory storage size > >> > but accurate data memory access load store behavior. > >> > >> So given the above I think that modeling the size as being the same > >> but with accurate precision would work. It's then only the size of the > >> padding in bytes we cannot represent with poly-int which should be fine. > >> > >> Correct? > > > > Btw, is storing a VNx1BI and then loading a VNx2BI from the same > > memory address well-defined? That is, how is the padding handled > > by the machine load/store instructions? > > > > Richard. > > -- Richard Biener SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; HRB 36809 (AG Nuernberg)