From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id EC04F3858D28 for ; Fri, 19 May 2023 10:58:42 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org EC04F3858D28 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 47B2A1FB; Fri, 19 May 2023 03:59:27 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id CCB853F762; Fri, 19 May 2023 03:58:41 -0700 (PDT) From: Richard Sandiford To: "juzhe.zhong\@rivai.ai" Mail-Followup-To: "juzhe.zhong\@rivai.ai" ,gcc-patches , rguenther , richard.sandiford@arm.com Cc: gcc-patches , rguenther Subject: Re: [PATCH V11] VECT: Add decrement IV support in Loop Vectorizer References: <20230516102319.163752-1-juzhe.zhong@rivai.ai> <39BD0264F732B3DE+2023051918370378242996@rivai.ai> Date: Fri, 19 May 2023 11:58:40 +0100 In-Reply-To: <39BD0264F732B3DE+2023051918370378242996@rivai.ai> (juzhe's message of "Fri, 19 May 2023 18:37:04 +0800") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-23.0 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: "juzhe.zhong@rivai.ai" writes: >>> I don't think this is a property of decrementing IVs. IIUC it's really >>> a property of rgl->factor == 1 && factor == 1, where factor would need >>> to be passed in by the caller. Because of that, it should probably be >>> a separate patch. > Is it right that I just post this part code as a seperate patch then merge it? No, not in its current form. Like I say, the test should be based on factors rather than TYPE_VECTOR_SUBPARTS. But a fix for this problem should come before the changes to IVs. >>> That is, current LOAD_LEN targets have two properties (IIRC): >>> (1) all vectors used in a given piece of vector code have the same byte size >>> (2) lengths are measured in bytes rather than elements >>> For all cases, including SVE, the number of controls needed for a scalar >>> statement is equal to the number of vectors needed for that scalar >>> statement. >>> Because of (1), on current LOADL_LEN targets, the number of controls >>> needed for a scalar statement is also proportional to the total number >>> of bytes occupied by the vectors generated for that scalar statement. >>> And because of (2), the total number of bytes is the only thing that >>> matters, so all users of a particular control can use the same control >>> value. >>> E.g. on current LOAD_LEN targets, 2xV16QI and 2xV8HI would use the same >>> control (with no adjustment). 2xV16QI means 32 elements, while 2xV8HI >>> means 16 elements. V16QI's nscalars_per_iter would therefore be double >>> V8HI's, but V8HI's factor would be double V16QI's (2 vs 1), so things >>> even out. >>> The code structurally supports targets that count in elements rather >>> than bytes, so that factor==1 for all element types. See the >>> "rgl->factor == 1 && factor == 1" case in: > >> if (rgl->max_nscalars_per_iter < nscalars_per_iter) >> { >> /* For now, we only support cases in which all loads and stores fall back to VnQI or none do. */ > >> gcc_assert (!rgl->max_nscalars_per_iter>> || (rgl->factor == 1 && factor == 1) > || (rgl->max_nscalars_per_iter * rgl->factor >>> == nscalars_per_iter * factor)); > >> rgl->max_nscalars_per_iter = nscalars_per_iter; >> rgl->type = vectype; >> rgl->factor = factor; >> }>> But it hasn't been tested, since no current target uses it. >>> I think the above part of the patch shows that the current "factor is >>> always 1" path is in fact broken, and the patch is a correctness fix on >>> targets that measure in elements rather than bytes. >>> So I think the above part of the patch should go in ahead of the IV changes. >>> But the test should be based on factor rather than TYPE_VECTOR_SUBPARTS. > Since the length control measured by bytes instead of bytes is not > appropriate for RVV.You mean I can't support RVV auto-vectorization in > upstream GCC middle-end and I can only support it in my downstream, is > that right? No. I haven't said in this or previous reviews that something cannot be supported in upstream GCC. I'm saying that the code in theory supports counting in bytes *or* counting in elements. But only the first one has actually been tested. And so, perhaps not surprisingly, the support for counting elements needs a fix. The fix in your patch looks like it's on the right lines, but it should be based on factor rather than TYPE_VECTOR_SUBPARTS. See get_len_load_store_mode for how this selection happens: (1) IFN_LOAD_LEN itself always counts in elements rather than bytes. (2) If a target has instructions that count in elements, it should define load_len patterns for all vector modes that it supports. (3) If a target has instructions that count in bytes, it should define load_len patterns only for byte modes. The vectoriser will then use byte loads for all vector types (even things like V8HI). For (2), the loop controls will always have a factor of 1. For (3), the loop controls will have a factor equal to the element size in bytes. See: machine_mode vmode; if (get_len_load_store_mode (vecmode, is_load).exists (&vmode)) { nvectors = group_memory_nvectors (group_size * vf, nunits); vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); unsigned factor = (vecmode == vmode) ? 1 : GET_MODE_UNIT_SIZE (vecmode); vect_record_loop_len (loop_vinfo, lens, nvectors, vectype, factor); using_partial_vectors_p = true; } This part should work correctly for RVV and any future targets that measure in elements rather than bytes. The problem is here: tree final_len = vect_get_loop_len (loop_vinfo, loop_lens, vec_num * ncopies, vec_num * j + i); tree ptr = build_int_cst (ref_type, align * BITS_PER_UNIT); machine_mode vmode = TYPE_MODE (vectype); opt_machine_mode new_ovmode = get_len_load_store_mode (vmode, true); machine_mode new_vmode = new_ovmode.require (); tree qi_type = unsigned_intQI_type_node; This should be rearranged so that: - new_vmode is calculated before final_len - a "factor" is calculated in the same way as the above code - this factor is passed to vect_get_loop_len - vect_get_loop_len then uses this information to do a division. Thanks, Richard