From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by sourceware.org (Postfix) with ESMTPS id 7F7C73858D28 for ; Sun, 7 May 2023 15:19:56 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 7F7C73858D28 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-pl1-x636.google.com with SMTP id d9443c01a7336-1aad5245632so25169965ad.3 for ; Sun, 07 May 2023 08:19:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683472795; x=1686064795; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=4/P5WOoXhQYx2SOATZOzjNyhnzNkepdxxO8Ozm0QH5o=; b=iGtJL2kBnvyyLPe+LtXcicTa2QpAV90vDFAZvY020A1y7dOL9OuIhR+N5XmkO6FVIe Qri3SeLyvcG0ucoTUb2UlgKZoy7DBpUaMoYi1IpwejolC5ichk/Ia55jyr07CrysoXIu qnQAMwFvr+9nKxjlLIpVGRGbI3wrjhvhe/G50NwU/IrDYS4uJCxN4kjsVJ85dP1wsrWC TidurourSEreijtMmqKD055Cf127PTQaxldx4HU+9uSy4WB9AqGgmIUWD8V0CjMm2tUs 68UUx1WgrlIOY3iehyF9eNLIcBLc7NW3hThqmgEBY1WzyrN5BUBE/LH5Z6/WnyG1Fj5w qA7g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683472795; x=1686064795; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=4/P5WOoXhQYx2SOATZOzjNyhnzNkepdxxO8Ozm0QH5o=; b=Ba9lZIBfcz6TWJ2zDHHJ/imyWU6MFIF0P6kOm2EPnjakBYdoET9+toTI782z4JqhKG oariC5OseGSkBJWofSerByZ6VQDOM5IzVnJoKFrgFHgc5ry90/nrreUOmHSOLqfBZbjs ykOHeu0behkJJo7+CFcOPspTBWXKZWrG/CC7aoPqaxlM4rBIDRYfW0k+sNA+5jJOEPzS rJaFsf7kq625tRWoz7Zwy1BFcZD/J1zoAYrgIrUTuRAgHMZnJf3x3BmY0Ca3qQ4auiOX qbdLeCe2ZTKa+9ZLeQUlxY+sUeSSIaRJnynxzyBfv/NGENLtfnt4MfDvKetk9Ix6fEfw 8i+g== X-Gm-Message-State: AC+VfDwop29N7Zo1eebVFYg1Zetjd551wFrkxxM414YqKzD7nDK57Rp0 jiAW1sBdI/F1JXUlE9eSBCI= X-Google-Smtp-Source: ACHHUZ6qrErVoP5yuEMhkvXfhdTu5YjdEVpntrRZbU/jDxEygc2fFQqfpYgfbHtS5tyVb/9cp/LSoQ== X-Received: by 2002:a17:903:294d:b0:19d:1834:92b9 with SMTP id li13-20020a170903294d00b0019d183492b9mr7329399plb.56.1683472794886; Sun, 07 May 2023 08:19:54 -0700 (PDT) Received: from ?IPV6:2601:681:8600:13d0::99f? ([2601:681:8600:13d0::99f]) by smtp.gmail.com with ESMTPSA id d7-20020a170902c18700b001a04d27ee92sm5243573pld.241.2023.05.07.08.19.53 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 07 May 2023 08:19:54 -0700 (PDT) Message-ID: <62a49c62-8632-baff-c3d6-c4277fd669ca@gmail.com> Date: Sun, 7 May 2023 09:19:53 -0600 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.1 Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support Content-Language: en-US To: juzhe.zhong@rivai.ai, gcc-patches@gcc.gnu.org Cc: richard.sandiford@arm.com, rguenther@suse.de References: <20230504132540.286148-1-juzhe.zhong@rivai.ai> From: Jeff Law In-Reply-To: <20230504132540.286148-1-juzhe.zhong@rivai.ai> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-10.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 5/4/23 07:25, juzhe.zhong@rivai.ai wrote: > From: Ju-Zhe Zhong > > This patch is fixing V3 patch: > https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zhong@rivai.ai/ > > Fix issues according to Richard Sandiford && Richard Biener. > > 1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford. > 2. Support multiple-rgroup for non-SLP auto-vectorization. > > For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the total length: > > _36 = MIN_EXPR ; > > First length (MIN (X, VF/N)): > loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>; > > Second length (X - MIN (X, 1 * VF/N)): > loop_len_16 = _36 - loop_len_15; > > Third length (X - MIN (X, 2 * VF/N)): > _38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>; > loop_len_17 = _36 - _38; > > Forth length (X - MIN (X, 3 * VF/N)): > _39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>; > loop_len_18 = _36 - _39; > > The reason that I use MIN_EXPR instead of SELECT_VL to calculate total length since using SELECT_VL > to adapt induction IV consumes more instructions than just using MIN_EXPR. Also, during testing, > I found it's hard to adjust length correctly according to SELECT_VL. > > So, this patch we only use SELECT_VL for single-rgroup with single length control. > > 3. Fix document of select_vl for Richard Biener (remove mode N). > 4. Fix comments of vect_set_loop_controls_by_select_vl according to Richard Biener. > 5. Keep loop_vinfo as first parameter for "vect_get_loop_len". > 6. make requirement of get_while_len_data_ref_ptr outside, let it to be gated at the caller site. > > More comments from Richard Biener: >>> So it's not actually saturating. The saturating operation is done by .WHILE_LEN? > I define the outcome of SELECT_VL (n, vf) (WHILE_LEN) = IN_RANGE (0, min (n, vf)) will make > the loop control counter never underflow zero. > >>> I see. I wonder if it makes sense to leave .WHILE_LEN aside for a start, >>> the above scheme should also work for single rgroups, no? >>> As said, it _looks_ like you can progress without .WHILE_LEN and using >>> .WHILE_LEN is a pure optimization? > Yes, SELECT_VL (WHILE_LEN) is pure optimization for single-rgroup and allow > target adjust any length = INRANGE (0, min (n, vf)) each iteration. > > Let me known if I missed something for the V3 patch. So at a high level this is pretty good. I think there's some improvements we should make in the documentation and comments, but I'm comfortable with most of the implementation details. > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi > index cc4a93a8763..99cf0cdbdca 100644 > --- a/gcc/doc/md.texi > +++ b/gcc/doc/md.texi > @@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++) > operand0[i] = operand0[i - 1] && (operand1 + i < operand2); > @end smallexample > > +@cindex @code{select_vl@var{m}} instruction pattern > +@item @code{select_vl@var{m}} > +Set operand 0 to the number of active elements in vector will be updated value. This reads rather poorly. Is this still accurate? Set operand 0 to the number of active elements in a vector to be updated in a loop iteration based on the total number of elements to be updated, the vectorization factor and vector properties of the target. > +operand 1 is the total elements need to be updated value. operand 1 is the total elements in the vector to be updated. > + > +The output of this pattern is not only used as IV of loop control counter, but also > +is used as the IV of address calculation with multiply/shift operation. This allow > +us dynamic adjust the number of elements is processed in each iteration of the loop. This allows dynamic adjustment of the number of elements processed each loop iteration. -- is that still accurate and does it read better? > @@ -47,7 +47,9 @@ along with GCC; see the file COPYING3. If not see > so that we can free them all at once. */ > static bitmap_obstack loop_renamer_obstack; > > -/* Creates an induction variable with value BASE + STEP * iteration in LOOP. > +/* Creates an induction variable with value BASE (+/-) STEP * iteration in LOOP. > + If CODE is PLUS_EXPR, the induction variable is BASE + STEP * iteration. > + If CODE is MINUS_EXPR, the induction variable is BASE - STEP * iteration. > It is expected that neither BASE nor STEP are shared with other expressions > (unless the sharing rules allow this). Use VAR as a base var_decl for it > (if NULL, a new temporary will be created). The increment will occur at It's been pretty standard to stick with just PLUS_EXPR for this stuff and instead negate the constant to produce the same effect as MINUS_EXPR. Is there a reason we're not continuing that practice? Sorry if you've answered this already -- if you have, you can just point me at the prior discussion and I'll read it. > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc > index 44bd5f2c805..d63ded5d4f0 100644 > --- a/gcc/tree-vect-loop-manip.cc > +++ b/gcc/tree-vect-loop-manip.cc > @@ -385,6 +385,48 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm, > return false; > } > > +/* Try to use permutes to define the lens in DEST_RGM using the lens > + in SRC_RGM, given that the former has twice as many lens as the > + latter. Return true on success, adding any new statements to SEQ. */ I would suggest not using "permute" in this description. When I read permute in the context of vectorization, I think of a vector permute to scramble elements within a vector. This looks like you're just adjusting how many vector elements you're operating on. > + { > + /* For SLP, we can't allow non-VF number of elements to be processed > + in non-final iteration. We force the number of elements to be > + processed in each non-final iteration is VF elements. If we allow > + non-VF elements processing in non-final iteration will make SLP too > + complicated and produce inferior codegen. Looks like you may have mixed up spaces and tabs in the above comment. Just a nit, but let's go ahead and get it fixed. > @@ -703,6 +1040,10 @@ vect_set_loop_condition_partial_vectors (class loop *loop, > > bool use_masks_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo); > tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo); > + tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo); > + bool use_vl_p = !use_masks_p > + && direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type, > + OPTIMIZE_FOR_SPEED); When you break a line with a logical like this, go ahead and add parenthesis and make sure the logical aligns just after the paren. ie bool use_vl_p = (!use_masks_p && direct.... Alternately, compute the direct_itnernal_fn_supported_p into its own boolean and then you don't need as much line wrapping. In general, don't be afraid to use extra temporaries if doing so improves readability. > + else if (loop_lens && loop_lens->length () == 1 > + && direct_internal_fn_supported_p ( > + IFN_SELECT_VL, LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo), > + OPTIMIZE_FOR_SPEED) > + && memory_access_type != VMAT_INVARIANT) This looks like a good example of code that would be easier to read if the call to direct_internal-fn_supported_p was saved into a temporary. Similarly for the instance you added in vectorizable_load. I'd like to get this patch wrapped up soon. But I also want to give both Richards a chance to chime in with their concerns. Thanks, Jeff