>> Not sure if you've covered this already in another thread but IIRC >> RVV uses "with-len" not only for loads and stores but for arithmetic >> instructions as well which is where (3) fails. Fortunately RVV uses >> element counts(?) Yes, RVV uses element count. But I did discover we have bugs for some arithmetic operations. For example, Division, we definitely need len_div (...) like cond_div in ARM SVE. But this is another story. I have support full features of RVV in my downstream GCC and works well for a year (I think fix all potential issue for RVV). So you could image I will post more middle-end patches for RVV auto-vectorization in the future. Thanks. juzhe.zhong@rivai.ai From: Richard Biener Date: 2023-05-22 18:12 To: Richard Sandiford; juzhe.zhong@rivai.ai; gcc-patches; rguenther Subject: Re: [PATCH V11] VECT: Add decrement IV support in Loop Vectorizer On Fri, May 19, 2023 at 12:59 PM Richard Sandiford via Gcc-patches wrote: > > "juzhe.zhong@rivai.ai" writes: > >>> I don't think this is a property of decrementing IVs. IIUC it's really > >>> a property of rgl->factor == 1 && factor == 1, where factor would need > >>> to be passed in by the caller. Because of that, it should probably be > >>> a separate patch. > > Is it right that I just post this part code as a seperate patch then merge it? > > No, not in its current form. Like I say, the test should be based on > factors rather than TYPE_VECTOR_SUBPARTS. But a fix for this problem > should come before the changes to IVs. > > >>> That is, current LOAD_LEN targets have two properties (IIRC): > >>> (1) all vectors used in a given piece of vector code have the same byte size > >>> (2) lengths are measured in bytes rather than elements > >>> For all cases, including SVE, the number of controls needed for a scalar > >>> statement is equal to the number of vectors needed for that scalar > >>> statement. > >>> Because of (1), on current LOADL_LEN targets, the number of controls > >>> needed for a scalar statement is also proportional to the total number > >>> of bytes occupied by the vectors generated for that scalar statement. > >>> And because of (2), the total number of bytes is the only thing that > >>> matters, so all users of a particular control can use the same control > >>> value. > >>> E.g. on current LOAD_LEN targets, 2xV16QI and 2xV8HI would use the same > >>> control (with no adjustment). 2xV16QI means 32 elements, while 2xV8HI > >>> means 16 elements. V16QI's nscalars_per_iter would therefore be double > >>> V8HI's, but V8HI's factor would be double V16QI's (2 vs 1), so things > >>> even out. > >>> The code structurally supports targets that count in elements rather > >>> than bytes, so that factor==1 for all element types. See the > >>> "rgl->factor == 1 && factor == 1" case in: > > >> if (rgl->max_nscalars_per_iter < nscalars_per_iter) >> { >> /* For now, we only support cases in which all loads and stores fall back to VnQI or none do. */ > > >> gcc_assert (!rgl->max_nscalars_per_iter>> || (rgl->factor == 1 && factor == 1) > > || (rgl->max_nscalars_per_iter * rgl->factor > >>> == nscalars_per_iter * factor)); > > >> rgl->max_nscalars_per_iter = nscalars_per_iter; >> rgl->type = vectype; >> rgl->factor = factor; >> }>> But it hasn't been tested, since no current target uses it. > >>> I think the above part of the patch shows that the current "factor is > >>> always 1" path is in fact broken, and the patch is a correctness fix on > >>> targets that measure in elements rather than bytes. > >>> So I think the above part of the patch should go in ahead of the IV changes. > >>> But the test should be based on factor rather than TYPE_VECTOR_SUBPARTS. > > Since the length control measured by bytes instead of bytes is not > > appropriate for RVV.You mean I can't support RVV auto-vectorization in > > upstream GCC middle-end and I can only support it in my downstream, is > > that right? > > No. I haven't said in this or previous reviews that something cannot be > supported in upstream GCC. > > I'm saying that the code in theory supports counting in bytes *or* > counting in elements. But only the first one has actually been tested. > And so, perhaps not surprisingly, the support for counting elements > needs a fix. > > The fix in your patch looks like it's on the right lines, but it should be > based on factor rather than TYPE_VECTOR_SUBPARTS. > > See get_len_load_store_mode for how this selection happens: > > (1) IFN_LOAD_LEN itself always counts in elements rather than bytes. > > (2) If a target has instructions that count in elements, it should > define load_len patterns for all vector modes that it supports. > > (3) If a target has instructions that count in bytes, it should define > load_len patterns only for byte modes. The vectoriser will then > use byte loads for all vector types (even things like V8HI). Not sure if you've covered this already in another thread but IIRC RVV uses "with-len" not only for loads and stores but for arithmetic instructions as well which is where (3) fails. Fortunately RVV uses element counts(?) > For (2), the loop controls will always have a factor of 1. > For (3), the loop controls will have a factor equal to the element > size in bytes. See: > > machine_mode vmode; > if (get_len_load_store_mode (vecmode, is_load).exists (&vmode)) > { > nvectors = group_memory_nvectors (group_size * vf, nunits); > vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); > unsigned factor = (vecmode == vmode) ? 1 : GET_MODE_UNIT_SIZE (vecmode); > vect_record_loop_len (loop_vinfo, lens, nvectors, vectype, factor); > using_partial_vectors_p = true; > } > > This part should work correctly for RVV and any future targets that > measure in elements rather than bytes. The problem is here: > > tree final_len > = vect_get_loop_len (loop_vinfo, loop_lens, > vec_num * ncopies, > vec_num * j + i); > tree ptr = build_int_cst (ref_type, > align * BITS_PER_UNIT); > > machine_mode vmode = TYPE_MODE (vectype); > opt_machine_mode new_ovmode > = get_len_load_store_mode (vmode, true); > machine_mode new_vmode = new_ovmode.require (); > tree qi_type = unsigned_intQI_type_node; > > This should be rearranged so that: > > - new_vmode is calculated before final_len > - a "factor" is calculated in the same way as the above code > - this factor is passed to vect_get_loop_len > - vect_get_loop_len then uses this information to do a division. > > Thanks, > Richard