From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-x132.google.com (mail-lf1-x132.google.com [IPv6:2a00:1450:4864:20::132]) by sourceware.org (Postfix) with ESMTPS id F0D79388A405 for ; Thu, 20 Jun 2024 11:52:17 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org F0D79388A405 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org F0D79388A405 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::132 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718884340; cv=none; b=BG8JeTbhV3dFIgvvxyQxBtbgsVAl3E13/hs+tUK4kdAKEsa3X0yXc+AoefWjA/76gnJVdonvdMyaaEjcQLiQgbPzcLR6AAYbLBlqfnLN7oqo4CcvLVUBFny9RPGG99n2zm4F1QHgtJ1WSxDk+oTZXwU4TZmDhTvoLRkneKNzQ8Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718884340; c=relaxed/simple; bh=93iX11/aGhzcjUWCYC+qnBCEV0RKe/SOnhQlVDwhVSw=; h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=NCZ0CCTPvkDYv4okcdbkSlB9+mzGJ/H4TA2T6sgIWoZfBQkfHjxK1oUOp9VCd7nUFkcZqE1LCd+zJaL8ChdRfUBdnlzADN76tDNx9VffTjqB3fV1oeIf6hZ9FvXs4mNa2Ciky2Q/bSvWl4zM7LoiUi1bZUL4PngcxADzZWUyW6M= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-lf1-x132.google.com with SMTP id 2adb3069b0e04-5295e488248so980528e87.2 for ; Thu, 20 Jun 2024 04:52:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718884336; x=1719489136; darn=gcc.gnu.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=KAn3LePkSnl5cPfevbfd1y2zO29+tD60N3tvDQYBdBw=; b=BGHNjgciiJltRMqoAUB2aeN3qVAVEL7sRwXAEx2XqEfBFnInuSz9b1PTuuQHMnmYmg J0Li7tk7WC1oXKlY4UYazLl1tiGu3wplD+dSGqi1RYTCo/75ZvGq3xy0RTCRa/QBAXuT zXMyyrsB7MWKiQb51JvQS0nsSShV7Y4JLOYo7xxVKpf8ziQiFEP/IMH64fp2dJJnm2ZC AG3SUufi8UMeXPi2hFn+6PdqaYddK5FuTYvcSLHfcYpxqhnnH2B4VQTG+3QhahBMd4xG g1FYdiYq/GrwS5jmzRKTroKa/QYLrx7Uo3x1c66NDr9VBHPHiFN9/ratOjWsZD0114IN NGDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718884336; x=1719489136; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=KAn3LePkSnl5cPfevbfd1y2zO29+tD60N3tvDQYBdBw=; b=EmyPXOEL/ii36+1KMyjvn++S/xVdYi354hobUaipu+YTftFD4dD0SyJ1PZq1rVkc4P MFvNQX/0ILYAFcjj2ZOrEbGkjr6NfnnPN4ira/iU9Hc9tn049VzhIJ2axoxlyF6ihsMz t1vugbhsmXRVqnAmtp4YElnb6cXIiX6zVHFupmHAOQt5XdfznMa74Q4wW8A17NGNe6ce h34fjLm3QRAnHqI55yWZEcofOg1jnl6LOhSkitv25UM/pomtZAbFKvuOjhZSjZMHa8Rp GJnGT1g1li6ZMKhesLNJlwiwTMYJCrgvomatm0h5Iy9gYEHpJN0kWmXCrB2A1MU+JQzS bjDw== X-Gm-Message-State: AOJu0YxnKWyoZa/LDA9/KkEHLh6ZGZiEz7G0bGEQcMsl75gIpN8jdxU8 BWjS9lnF6DqvVSssHqS35EEsdnQNvxKaWZgZDaOL8DcpwePqXs2v1YEU6OaJgLhGM94r109VR5+ QoF5yrfO/fGmsGMw2b7TPIEFyvvE9hg== X-Google-Smtp-Source: AGHT+IFIgS0fJ9BQ0//49kDKeqokaEVEi7pzW2Qcxe40GlL0p3GFwkxEuzGm5CLR5CA4Vvk8pxs84aovI1DFaZViPFY= X-Received: by 2002:ac2:5b4c:0:b0:52c:9d61:271f with SMTP id 2adb3069b0e04-52ccaa33e61mr2850599e87.27.1718884335937; Thu, 20 Jun 2024 04:52:15 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Richard Biener Date: Thu, 20 Jun 2024 13:52:04 +0200 Message-ID: Subject: Re: [PATCH 4/8] vect: Determine input vectype for multiple lane-reducing To: Feng Xue OS Cc: "gcc-patches@gcc.gnu.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-7.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Thu, Jun 20, 2024 at 7:47=E2=80=AFAM Feng Xue OS wrote: > > >> + if (lane_reducing_op_p (op.code)) > >> + { > >> + unsigned group_size =3D slp_node ? SLP_TREE_LANES (slp_n= ode) : 0; > >> + tree op_type =3D TREE_TYPE (op.ops[0]); > >> + tree new_vectype_in =3D get_vectype_for_scalar_type (loo= p_vinfo, > >> + op_ty= pe, > >> + group= _size); > > > > I think doing it this way does not adhere to the vector type size const= raint > > with loop vectorization. You should use vect_is_simple_use like the > > original code did as the actual vector definition determines the vector= type > > used. > > OK, though this might be wordy. > > Actually, STMT_VINFO_REDUC_VECTYPE_IN is logically equivalent to nunits_v= ectype > that is determined in vect_determine_vf_for_stmt_1(). So how about settin= g the type > in this function? But why would we need it here? Isn't the info indirectly available through the loops vectorization factor? > > > > You are always using op.ops[0] here - I think that works because > > reduc_idx is the last operand of all lane-reducing ops. But then > > we should assert reduc_idx !=3D 0 here and add a comment. > > Already added in the following assertion. > > >> + > >> + /* The last operand of lane-reducing operation is for > >> + reduction. */ > >> + gcc_assert (reduc_idx > 0 && reduc_idx =3D=3D (int) op.n= um_ops - 1); > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > >> + > >> + /* For lane-reducing operation vectorizable analysis nee= ds the > >> + reduction PHI information */ > >> + STMT_VINFO_REDUC_DEF (def) =3D phi_info; > >> + > >> + if (!new_vectype_in) > >> + return false; > >> + > >> + /* Each lane-reducing operation has its own input vectyp= e, while > >> + reduction PHI will record the input vectype with the = least > >> + lanes. */ > >> + STMT_VINFO_REDUC_VECTYPE_IN (vdef) =3D new_vectype_in; > >> + > >> + /* To accommodate lane-reducing operations of mixed inpu= t > >> + vectypes, choose input vectype with the least lanes f= or the > >> + reduction PHI statement, which would result in the mo= st > >> + ncopies for vectorized reduction results. */ > >> + if (!vectype_in > >> + || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vect= ype_in))) > >> + < GET_MODE_SIZE (SCALAR_TYPE_MODE (op_type)))) > >> + vectype_in =3D new_vectype_in; > > > > I know this is a fragile area but I always wonder since the accumulatin= g operand > > is the largest (all lane-reducing ops are widening), and that will be > > equal to the > > type of the PHI node, how this condition can be ever true. > > In the original code, accumulating operand is skipped! While it is correc= tly, we > should not count the operand, this is why we call operation lane-reducing= . > > > > > ncopies is determined by the VF, so the comment is at least misleading. > > > >> + } > >> + else > >> + vectype_in =3D STMT_VINFO_VECTYPE (phi_info); > > > > Please initialize vectype_in from phi_info before the loop (that > > should never be NULL). > > > > May not, as the below explanation. > > > I'll note that with your patch it seems we'd initialize vectype_in to > > the biggest > > non-accumulation vector type involved in lane-reducing ops but the accu= mulating > > type might still be larger. Why, when we have multiple lane-reducing > > ops, would > > we chose the largest input here? I see we eventually do > > > > if (slp_node) > > ncopies =3D 1; > > else > > ncopies =3D vect_get_num_copies (loop_vinfo, vectype_in); > > > > but then IIRC we always force a single cycle def for lane-reducing ops(= ?). > > > > In particular for vect_transform_reduction and SLP we rely on > > SLP_TREE_NUMBER_OF_VEC_STMTS while non-SLP uses > > STMT_VINFO_REDUC_VECTYPE_IN. > > > > So I wonder what breaks when we set vectype_in =3D vector type of PHI? > > > > Yes. It is right, nothing is broken. Suppose that a loop contains three d= ot_prods, > two are <16 * char>, one is <8 * short>, and choose <4 * int> as vectype_= in: > > With the patch #7, we get: > > vector<4> int sum_v0 =3D { 0, 0, 0, 0 }; > vector<4> int sum_v1 =3D { 0, 0, 0, 0 }; > vector<4> int sum_v2 =3D { 0, 0, 0, 0 }; > vector<4> int sum_v3 =3D { 0, 0, 0, 0 }; > > loop () { > sum_v0 =3D dot_prod<16 * char>(char_a0, char_a1, sum_v0); > > sum_v0 =3D dot_prod<16 * char>(char_b0, char_b1, sum_v0); > > sum_v0 =3D dot_prod<8 * short>(short_c0_lo, short_c1_lo, sum_v0)= ; > sum_v1 =3D dot_prod<8 * short>(short_c0_hi, short_c1_hi, sum_v1)= ; > > sum_v2 =3D sum_v2; > sum_v3 =3D sum_v3; > } > > The def/use cycles (sum_v2 and sum_v3> would be optimized away finally. > Then this gets same result as setting vectype_in to <8 * short>. > > With the patch #8, we get: > > vector<4> int sum_v0 =3D { 0, 0, 0, 0 }; > vector<4> int sum_v1 =3D { 0, 0, 0, 0 }; > vector<4> int sum_v2 =3D { 0, 0, 0, 0 }; > vector<4> int sum_v3 =3D { 0, 0, 0, 0 }; > > loop () { > sum_v0 =3D dot_prod<16 * char>(char_a0, char_a1, sum_v0); > > sum_v1 =3D dot_prod<16 * char>(char_b0, char_b1, sum_v1); > > sum_v2 =3D dot_prod<8 * short>(short_c0_lo, short_c1_lo, sum_v2)= ; > sum_v3 =3D dot_prod<8 * short>(short_c0_hi, short_c1_hi, sum_v3)= ; > } > > All dot_prods are assigned to separate def/use cycles, and no > dependency. More def/use cycles, higher instruction parallelism, > but there need extra cost in epilogue to combine the result. > > So we consider a somewhat compact def/use layout similar to > single-defuse-cycle, in which two <16 * char> dot_prods are independent, > and cycle 2 and 3 are not used, and this is better than the 1st scheme. > > vector<4> int sum_v0 =3D { 0, 0, 0, 0 }; > vector<4> int sum_v1 =3D { 0, 0, 0, 0 }; > > loop () { > sum_v0 =3D dot_prod<16 * char>(char_a0, char_a1, sum_v0); > > sum_v1 =3D dot_prod<16 * char>(char_b0, char_b1, sum_v1); > > sum_v0 =3D dot_prod<8 * short>(short_c0_lo, short_c1_lo, sum_v0)= ; > sum_v1 =3D dot_prod<8 * short>(short_c0_hi, short_c1_hi, sum_v1)= ; > } > > For this purpose, we need to track the vectype_in that results in > the most ncopies, for this case, the type is <8 * short>. So the VF determines the number of inputs to the lane-reducing ops and by that way also the number of lane-reducing stmts, individually for each lane-reducing op. By the soft constraint of having same sized input vectors you'll probably never arrive at the situation that, for the above, you end up with one dot_prod<16 * char> but two dot_prod <8 * short>= ? That said, I still don't exactly see how we need the input vector type rather than the accumulator vector type plus the VF to determine everything= . I know it's pre-existing, but now you're extending it to cover multiple pos= sibly "different" (to what extent "different"? is the only requirement that the accumulator type is the same?) lane-reducing ops. > BTW: would you please also take a look at patch #7 and #8? Sure, I thought they were logically dependent on this one (and I didn't get= to them yesterday). Richard. > Thanks, > Feng > > ________________________________________ > From: Richard Biener > Sent: Wednesday, June 19, 2024 9:01 PM > To: Feng Xue OS > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH 4/8] vect: Determine input vectype for multiple lane-= reducing > > On Sun, Jun 16, 2024 at 9:25=E2=80=AFAM Feng Xue OS wrote: > > > > The input vectype of reduction PHI statement must be determined before > > vect cost computation for the reduction. Since lance-reducing operation= has > > different input vectype from normal one, so we need to traverse all red= uction > > statements to find out the input vectype with the least lanes, and set = that to > > the PHI statement. > > > > Thanks, > > Feng > > > > --- > > gcc/ > > * tree-vect-loop.cc (vectorizable_reduction): Determine input v= ectype > > during traversal of reduction statements. > > --- > > gcc/tree-vect-loop.cc | 72 +++++++++++++++++++++++++++++-------------- > > 1 file changed, 49 insertions(+), 23 deletions(-) > > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > > index 0f7b125e72d..39aa5cb1197 100644 > > --- a/gcc/tree-vect-loop.cc > > +++ b/gcc/tree-vect-loop.cc > > @@ -7643,7 +7643,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > > { > > stmt_vec_info def =3D loop_vinfo->lookup_def (reduc_def); > > stmt_vec_info vdef =3D vect_stmt_to_vectorize (def); > > - if (STMT_VINFO_REDUC_IDX (vdef) =3D=3D -1) > > + int reduc_idx =3D STMT_VINFO_REDUC_IDX (vdef); > > + > > + if (reduc_idx =3D=3D -1) > > { > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > @@ -7686,10 +7688,50 @@ vectorizable_reduction (loop_vec_info loop_vinf= o, > > return false; > > } > > } > > - else if (!stmt_info) > > - /* First non-conversion stmt. */ > > - stmt_info =3D vdef; > > - reduc_def =3D op.ops[STMT_VINFO_REDUC_IDX (vdef)]; > > + else > > + { > > + /* First non-conversion stmt. */ > > + if (!stmt_info) > > + stmt_info =3D vdef; > > + > > + if (lane_reducing_op_p (op.code)) > > + { > > + unsigned group_size =3D slp_node ? SLP_TREE_LANES (slp_no= de) : 0; > > + tree op_type =3D TREE_TYPE (op.ops[0]); > > + tree new_vectype_in =3D get_vectype_for_scalar_type (loop= _vinfo, > > + op_typ= e, > > + group_= size); > > I think doing it this way does not adhere to the vector type size constra= int > with loop vectorization. You should use vect_is_simple_use like the > original code did as the actual vector definition determines the vector t= ype > used. > > You are always using op.ops[0] here - I think that works because > reduc_idx is the last operand of all lane-reducing ops. But then > we should assert reduc_idx !=3D 0 here and add a comment. > > > + > > + /* The last operand of lane-reducing operation is for > > + reduction. */ > > + gcc_assert (reduc_idx > 0 && reduc_idx =3D=3D (int) op.nu= m_ops - 1); > > + > > + /* For lane-reducing operation vectorizable analysis need= s the > > + reduction PHI information */ > > + STMT_VINFO_REDUC_DEF (def) =3D phi_info; > > + > > + if (!new_vectype_in) > > + return false; > > + > > + /* Each lane-reducing operation has its own input vectype= , while > > + reduction PHI will record the input vectype with the l= east > > + lanes. */ > > + STMT_VINFO_REDUC_VECTYPE_IN (vdef) =3D new_vectype_in; > > + > > + /* To accommodate lane-reducing operations of mixed input > > + vectypes, choose input vectype with the least lanes fo= r the > > + reduction PHI statement, which would result in the mos= t > > + ncopies for vectorized reduction results. */ > > + if (!vectype_in > > + || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vecty= pe_in))) > > + < GET_MODE_SIZE (SCALAR_TYPE_MODE (op_type)))) > > + vectype_in =3D new_vectype_in; > > I know this is a fragile area but I always wonder since the accumulating = operand > is the largest (all lane-reducing ops are widening), and that will be > equal to the > type of the PHI node, how this condition can be ever true. > > ncopies is determined by the VF, so the comment is at least misleading. > > > + } > > + else > > + vectype_in =3D STMT_VINFO_VECTYPE (phi_info); > > Please initialize vectype_in from phi_info before the loop (that > should never be NULL). > > I'll note that with your patch it seems we'd initialize vectype_in to > the biggest > non-accumulation vector type involved in lane-reducing ops but the accumu= lating > type might still be larger. Why, when we have multiple lane-reducing > ops, would > we chose the largest input here? I see we eventually do > > if (slp_node) > ncopies =3D 1; > else > ncopies =3D vect_get_num_copies (loop_vinfo, vectype_in); > > but then IIRC we always force a single cycle def for lane-reducing ops(?)= . > In particular for vect_transform_reduction and SLP we rely on > SLP_TREE_NUMBER_OF_VEC_STMTS while non-SLP uses > STMT_VINFO_REDUC_VECTYPE_IN. > > So I wonder what breaks when we set vectype_in =3D vector type of PHI? > > Richard. > > > + } > > + > > + reduc_def =3D op.ops[reduc_idx]; > > reduc_chain_length++; > > if (!stmt_info && slp_node) > > slp_for_stmt_info =3D SLP_TREE_CHILDREN (slp_for_stmt_info)[0]; > > @@ -7747,6 +7789,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > > > > tree vectype_out =3D STMT_VINFO_VECTYPE (stmt_info); > > STMT_VINFO_REDUC_VECTYPE (reduc_info) =3D vectype_out; > > + STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) =3D vectype_in; > > + > > gimple_match_op op; > > if (!gimple_extract_op (stmt_info->stmt, &op)) > > gcc_unreachable (); > > @@ -7831,16 +7875,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo= , > > =3D get_vectype_for_scalar_type (loop_vinfo, > > TREE_TYPE (op.ops[i]), slp_op[= i]); > > > > - /* To properly compute ncopies we are interested in the widest > > - non-reduction input type in case we're looking at a widening > > - accumulation that we later handle in vect_transform_reduction.= */ > > - if (lane_reducing > > - && vectype_op[i] > > - && (!vectype_in > > - || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_i= n))) > > - < GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype= _op[i])))))) > > - vectype_in =3D vectype_op[i]; > > - > > /* Record how the non-reduction-def value of COND_EXPR is define= d. > > ??? For a chain of multiple CONDs we'd have to match them up = all. */ > > if (op.code =3D=3D COND_EXPR && reduc_chain_length =3D=3D 1) > > @@ -7859,14 +7893,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo= , > > } > > } > > } > > - if (!vectype_in) > > - vectype_in =3D STMT_VINFO_VECTYPE (phi_info); > > - STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) =3D vectype_in; > > - > > - /* Each lane-reducing operation has its own input vectype, while red= uction > > - PHI records the input vectype with least lanes. */ > > - if (lane_reducing) > > - STMT_VINFO_REDUC_VECTYPE_IN (stmt_info) =3D vectype_in; > > > > enum vect_reduction_type reduction_type =3D STMT_VINFO_REDUC_TYPE (p= hi_info); > > STMT_VINFO_REDUC_TYPE (reduc_info) =3D reduction_type; > > -- > > 2.17.1