From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=svUM=N6=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lf1-x132.google.com (mail-lf1-x132.google.com [IPv6:2a00:1450:4864:20::132])
	by sourceware.org (Postfix) with ESMTPS id B9293382FADF
	for <gcc-patches@gcc.gnu.org>; Fri, 28 Jun 2024 13:00:04 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org B9293382FADF
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org B9293382FADF
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::132
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1719579608; cv=none;
	b=OYAtEUPG+OHa6wX+NlSGQvKodGRlU0zRd8gs3DprRjfICEMbtIVJCgXXPlzaDgmyds8zy3KVO8hcd4HNFK2C0kaB+C/469O/AOoh67GJBhkbWv/TwAZ7KOBP2MguGHXqNYGly1E5fxcawuojZPYIdqlppdsOLBbB2j/m4jRCMiA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1719579608; c=relaxed/simple;
	bh=4RqudNfTBgmpXXzzldF3mt+RvboaD4Y6UpX655NpvFw=;
	h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=haIwFDyq1Y07Ar9sozIsSL5MfxR/W4cz3YIWiY87eM23zZ3aMAZ1AvkwryWqEngaVxecL7PvwHKFS9WmA12LiwRmuLTMbruIrytO+66AjnJRLHADiXwlaUUIK8/MAKnluhf3jIumMjm/IRi/B+wu5h/hoErwK8uNV21ud8MkQ5I=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-lf1-x132.google.com with SMTP id 2adb3069b0e04-52e7d2278d8so750707e87.3
        for <gcc-patches@gcc.gnu.org>; Fri, 28 Jun 2024 06:00:04 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1719579603; x=1720184403; darn=gcc.gnu.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=vU/g7HCpd+DUtwZgfilD2mWlxozULP3hBYENwuaOAFE=;
        b=XiXni0onP++XhzVLmO9se7kjyzI8TGrDdOCt76raxA0QUOv6KLw7lCq9uzBzOsB/dr
         Jtn6nh2gZgXxlFRcMxYTbcPji22m0sVXPTBZJ4zc0ukTwJcu0jfTjHwhkS9t4cPIRJH3
         EG8aR7OXeercXIe4UORre8B0pBoiyk2clesOn1YkQ/caDzPl775ziMKAMQh0p+Ct9TnP
         RYBinR1e2n0WkN6ny/7MjBTJ/xGr5RX/z4xYqwfhlXEspDxi2JMh0YmOMTPSaHMxPSfO
         nxITCy2t1NIp1mJ1nM0JQ6UNrNgmhNlhaEEVZZIfFm+/WX/K4QdUxkm8sfVbnrw756bg
         ToQw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1719579603; x=1720184403;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=vU/g7HCpd+DUtwZgfilD2mWlxozULP3hBYENwuaOAFE=;
        b=ABGy1uz1/rc/v6+R4W7njK/49eo38g/lIxrYj9Yf5EvNxOwT9yuTFgJDvQxTzO+yEq
         eO1CAxD22oHeiBVhulFWl1wKRCV+gCAJzvOkljNq8IAT4Q3/j1N7MqIpFqesbj8yshGq
         11zJjBFPFUUPWLfK5MRXjFsdYn959lCxgggYUuEoFy/lPnh5MkNJXGGKp/PHQ0dYJ0yV
         46BOIyubAUXP6tMxPjALseeB9XnjimGKe2rD8frxgghEb09Gvw4MtDkTt8+Eti+z3Qtr
         MwByrEJnCUVpItsPm4l5bYq267pmXmMsP2zEHZOTmRSoqRLnYjdDa8K2smKJdV2/g7Q7
         ZJNA==
X-Gm-Message-State: AOJu0Yz0KEFHemrvJYuPizemxhnaHGs2Wjz581NI2d5xF0JgUcuNQH8N
	W3TjYIPkqhEhcIxJGSzX92TwDl4qHm/5XcmBsD/9egHi/YsvBEd9e/HGKjDOP9bH5hMXGYi9oT/
	sqQBhz6Yf5m9Z0kYJ5hjrIV/9k1pBzw==
X-Google-Smtp-Source: AGHT+IF+JlGKWXWsKml/jTDj8R4lL1LVDm9nvd3UKReQ8stS9aNzCyO4wxVou2F+xoHpgLYwtxCCoyRYeJil8IOU4DA=
X-Received: by 2002:a19:f00d:0:b0:52c:df36:1041 with SMTP id
 2adb3069b0e04-52ce1835a75mr11424288e87.40.1719579602699; Fri, 28 Jun 2024
 06:00:02 -0700 (PDT)
MIME-Version: 1.0
References: <LV2PR01MB78393B9BE03402556710D12CF7CC2@LV2PR01MB7839.prod.exchangelabs.com>
 <CAFiYyc1F-GLqnkex12nriVTYqXKhNm0UyFUEETf70uCt_vnnPw@mail.gmail.com>
 <LV2PR01MB78399700661AC569E6CA40BAF7C82@LV2PR01MB7839.prod.exchangelabs.com> <LV2PR01MB7839AA7184054EAB80571117F7D62@LV2PR01MB7839.prod.exchangelabs.com>
In-Reply-To: <LV2PR01MB7839AA7184054EAB80571117F7D62@LV2PR01MB7839.prod.exchangelabs.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Fri, 28 Jun 2024 14:59:50 +0200
Message-ID: <CAFiYyc1mJzd1BwLV76J68_yHYcZzsNK-+8cVE9CxLWGyDm7-Pw@mail.gmail.com>
Subject: Re: [PATCH 4/8] vect: Determine input vectype for multiple lane-reducing
To: Feng Xue OS <fxue@os.amperecomputing.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-7.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Wed, Jun 26, 2024 at 4:48=E2=80=AFPM Feng Xue OS <fxue@os.amperecomputin=
g.com> wrote:
>
> Updated the patches based on comments.
>
> The input vectype of reduction PHI statement must be determined before
> vect cost computation for the reduction. Since lance-reducing operation h=
as
> different input vectype from normal one, so we need to traverse all reduc=
tion
> statements to find out the input vectype with the least lanes, and set th=
at to
> the PHI statement.

OK

> ---
>  gcc/tree-vect-loop.cc | 79 ++++++++++++++++++++++++++++++-------------
>  1 file changed, 56 insertions(+), 23 deletions(-)
>
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 347dac97e49..419f4b08d2b 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -7643,7 +7643,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>      {
>        stmt_vec_info def =3D loop_vinfo->lookup_def (reduc_def);
>        stmt_vec_info vdef =3D vect_stmt_to_vectorize (def);
> -      if (STMT_VINFO_REDUC_IDX (vdef) =3D=3D -1)
> +      int reduc_idx =3D STMT_VINFO_REDUC_IDX (vdef);
> +
> +      if (reduc_idx =3D=3D -1)
>         {
>           if (dump_enabled_p ())
>             dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> @@ -7686,10 +7688,57 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>               return false;
>             }
>         }
> -      else if (!stmt_info)
> -       /* First non-conversion stmt.  */
> -       stmt_info =3D vdef;
> -      reduc_def =3D op.ops[STMT_VINFO_REDUC_IDX (vdef)];
> +      else
> +       {
> +         /* First non-conversion stmt.  */
> +         if (!stmt_info)
> +           stmt_info =3D vdef;
> +
> +         if (lane_reducing_op_p (op.code))
> +           {
> +             enum vect_def_type dt;
> +             tree vectype_op;
> +
> +             /* The last operand of lane-reducing operation is for
> +                reduction.  */
> +             gcc_assert (reduc_idx > 0 && reduc_idx =3D=3D (int) op.num_=
ops - 1);
> +
> +             if (!vect_is_simple_use (op.ops[0], loop_vinfo, &dt, &vecty=
pe_op))
> +               return false;
> +
> +             tree type_op =3D TREE_TYPE (op.ops[0]);
> +
> +             if (!vectype_op)
> +               {
> +                 vectype_op =3D get_vectype_for_scalar_type (loop_vinfo,
> +                                                           type_op);
> +                 if (!vectype_op)
> +                   return false;
> +               }
> +
> +             /* For lane-reducing operation vectorizable analysis needs =
the
> +                reduction PHI information */
> +             STMT_VINFO_REDUC_DEF (def) =3D phi_info;
> +
> +             /* Each lane-reducing operation has its own input vectype, =
while
> +                reduction PHI will record the input vectype with the lea=
st
> +                lanes.  */
> +             STMT_VINFO_REDUC_VECTYPE_IN (vdef) =3D vectype_op;
> +
> +             /* To accommodate lane-reducing operations of mixed input
> +                vectypes, choose input vectype with the least lanes for =
the
> +                reduction PHI statement, which would result in the most
> +                ncopies for vectorized reduction results.  */
> +             if (!vectype_in
> +                 || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype=
_in)))
> +                      < GET_MODE_SIZE (SCALAR_TYPE_MODE (type_op))))
> +               vectype_in =3D vectype_op;
> +           }
> +         else
> +           vectype_in =3D STMT_VINFO_VECTYPE (phi_info);
> +       }
> +
> +      reduc_def =3D op.ops[reduc_idx];
>        reduc_chain_length++;
>        if (!stmt_info && slp_node)
>         slp_for_stmt_info =3D SLP_TREE_CHILDREN (slp_for_stmt_info)[0];
> @@ -7747,6 +7796,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>
>    tree vectype_out =3D STMT_VINFO_VECTYPE (stmt_info);
>    STMT_VINFO_REDUC_VECTYPE (reduc_info) =3D vectype_out;
> +  STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) =3D vectype_in;
> +
>    gimple_match_op op;
>    if (!gimple_extract_op (stmt_info->stmt, &op))
>      gcc_unreachable ();
> @@ -7831,16 +7882,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>           =3D get_vectype_for_scalar_type (loop_vinfo,
>                                          TREE_TYPE (op.ops[i]), slp_op[i]=
);
>
> -      /* To properly compute ncopies we are interested in the widest
> -        non-reduction input type in case we're looking at a widening
> -        accumulation that we later handle in vect_transform_reduction.  =
*/
> -      if (lane_reducing
> -         && vectype_op[i]
> -         && (!vectype_in
> -             || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_in)=
))
> -                 < GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_o=
p[i]))))))
> -       vectype_in =3D vectype_op[i];
> -
>        /* Record how the non-reduction-def value of COND_EXPR is defined.
>          ???  For a chain of multiple CONDs we'd have to match them up al=
l.  */
>        if (op.code =3D=3D COND_EXPR && reduc_chain_length =3D=3D 1)
> @@ -7859,14 +7900,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>             }
>         }
>      }
> -  if (!vectype_in)
> -    vectype_in =3D STMT_VINFO_VECTYPE (phi_info);
> -  STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) =3D vectype_in;
> -
> -  /* Each lane-reducing operation has its own input vectype, while reduc=
tion
> -     PHI records the input vectype with least lanes.  */
> -  if (lane_reducing)
> -    STMT_VINFO_REDUC_VECTYPE_IN (stmt_info) =3D vectype_in;
>
>    enum vect_reduction_type reduction_type =3D STMT_VINFO_REDUC_TYPE (phi=
_info);
>    STMT_VINFO_REDUC_TYPE (reduc_info) =3D reduction_type;
> --
> 2.17.1
>
>
> ________________________________________
> From: Feng Xue OS <fxue@os.amperecomputing.com>
> Sent: Thursday, June 20, 2024 1:47 PM
> To: Richard Biener
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH 4/8] vect: Determine input vectype for multiple lane-=
reducing
>
> >> +         if (lane_reducing_op_p (op.code))
> >> +           {
> >> +             unsigned group_size =3D slp_node ? SLP_TREE_LANES (slp_n=
ode) : 0;
> >> +             tree op_type =3D TREE_TYPE (op.ops[0]);
> >> +             tree new_vectype_in =3D get_vectype_for_scalar_type (loo=
p_vinfo,
> >> +                                                                op_ty=
pe,
> >> +                                                                group=
_size);
> >
> > I think doing it this way does not adhere to the vector type size const=
raint
> > with loop vectorization.  You should use vect_is_simple_use like the
> > original code did as the actual vector definition determines the vector=
 type
> > used.
>
> OK, though this might be wordy.
>
> Actually, STMT_VINFO_REDUC_VECTYPE_IN is logically equivalent to nunits_v=
ectype
> that is determined in vect_determine_vf_for_stmt_1(). So how about settin=
g the type
> in this function?
>
> >
> > You are always using op.ops[0] here - I think that works because
> > reduc_idx is the last operand of all lane-reducing ops.  But then
> > we should assert reduc_idx !=3D 0 here and add a comment.
>
> Already added in the following assertion.
>
> >> +
> >> +             /* The last operand of lane-reducing operation is for
> >> +                reduction.  */
> >> +             gcc_assert (reduc_idx > 0 && reduc_idx =3D=3D (int) op.n=
um_ops - 1);
>
>                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >> +
> >> +             /* For lane-reducing operation vectorizable analysis nee=
ds the
> >> +                reduction PHI information */
> >> +             STMT_VINFO_REDUC_DEF (def) =3D phi_info;
> >> +
> >> +             if (!new_vectype_in)
> >> +               return false;
> >> +
> >> +             /* Each lane-reducing operation has its own input vectyp=
e, while
> >> +                reduction PHI will record the input vectype with the =
least
> >> +                lanes.  */
> >> +             STMT_VINFO_REDUC_VECTYPE_IN (vdef) =3D new_vectype_in;
> >> +
> >> +             /* To accommodate lane-reducing operations of mixed inpu=
t
> >> +                vectypes, choose input vectype with the least lanes f=
or the
> >> +                reduction PHI statement, which would result in the mo=
st
> >> +                ncopies for vectorized reduction results.  */
> >> +             if (!vectype_in
> >> +                 || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vect=
ype_in)))
> >> +                      < GET_MODE_SIZE (SCALAR_TYPE_MODE (op_type))))
> >> +               vectype_in =3D new_vectype_in;
> >
> > I know this is a fragile area but I always wonder since the accumulatin=
g operand
> > is the largest (all lane-reducing ops are widening), and that will be
> > equal to the
> > type of the PHI node, how this condition can be ever true.
>
> In the original code, accumulating operand is skipped! While it is correc=
tly, we
> should not count the operand, this is why we call operation lane-reducing=
.
>
> >
> > ncopies is determined by the VF, so the comment is at least misleading.
> >
> >> +           }
> >> +         else
> >> +           vectype_in =3D STMT_VINFO_VECTYPE (phi_info);
> >
> > Please initialize vectype_in from phi_info before the loop (that
> > should never be NULL).
> >
>
> May not, as the below explanation.
>
> > I'll note that with your patch it seems we'd initialize vectype_in to
> > the biggest
> > non-accumulation vector type involved in lane-reducing ops but the accu=
mulating
> > type might still be larger.   Why, when we have multiple lane-reducing
> > ops, would
> > we chose the largest input here?  I see we eventually do
> >
> >   if (slp_node)
> >     ncopies =3D 1;
> >   else
> >     ncopies =3D vect_get_num_copies (loop_vinfo, vectype_in);
> >
> > but then IIRC we always force a single cycle def for lane-reducing ops(=
?).
>
>
> > In particular for vect_transform_reduction and SLP we rely on
> > SLP_TREE_NUMBER_OF_VEC_STMTS while non-SLP uses
> > STMT_VINFO_REDUC_VECTYPE_IN.
> >
> > So I wonder what breaks when we set vectype_in =3D vector type of PHI?
> >
>
> Yes. It is right, nothing is broken. Suppose that a loop contains three d=
ot_prods,
> two are <16 * char>, one is <8 * short>, and choose <4 * int> as vectype_=
in:
>
> With the patch #7, we get:
>
>       vector<4> int sum_v0 =3D { 0, 0, 0, 0 };
>       vector<4> int sum_v1 =3D { 0, 0, 0, 0 };
>       vector<4> int sum_v2 =3D { 0, 0, 0, 0 };
>       vector<4> int sum_v3 =3D { 0, 0, 0, 0 };
>
>       loop () {
>          sum_v0 =3D dot_prod<16 * char>(char_a0, char_a1, sum_v0);
>
>          sum_v0 =3D dot_prod<16 * char>(char_b0, char_b1, sum_v0);
>
>          sum_v0 =3D dot_prod<8 * short>(short_c0_lo, short_c1_lo, sum_v0)=
;
>          sum_v1 =3D dot_prod<8 * short>(short_c0_hi, short_c1_hi, sum_v1)=
;
>
>          sum_v2 =3D sum_v2;
>          sum_v3 =3D sum_v3;
>       }
>
> The def/use cycles (sum_v2 and sum_v3> would be optimized away finally.
> Then this gets same result as setting vectype_in to <8 * short>.
>
> With the patch #8, we get:
>
>       vector<4> int sum_v0 =3D { 0, 0, 0, 0 };
>       vector<4> int sum_v1 =3D { 0, 0, 0, 0 };
>       vector<4> int sum_v2 =3D { 0, 0, 0, 0 };
>       vector<4> int sum_v3 =3D { 0, 0, 0, 0 };
>
>       loop () {
>          sum_v0 =3D dot_prod<16 * char>(char_a0, char_a1, sum_v0);
>
>          sum_v1 =3D dot_prod<16 * char>(char_b0, char_b1, sum_v1);
>
>          sum_v2 =3D dot_prod<8 * short>(short_c0_lo, short_c1_lo, sum_v2)=
;
>          sum_v3 =3D dot_prod<8 * short>(short_c0_hi, short_c1_hi, sum_v3)=
;
>       }
>
> All dot_prods are assigned to separate def/use cycles, and no
> dependency. More def/use cycles, higher instruction parallelism,
> but there need extra cost in epilogue to combine the result.
>
> So we consider a somewhat compact def/use layout similar to
> single-defuse-cycle, in which two <16 * char> dot_prods are independent,
> and cycle 2 and 3 are not used, and this is better than the 1st scheme.
>
>       vector<4> int sum_v0 =3D { 0, 0, 0, 0 };
>       vector<4> int sum_v1 =3D { 0, 0, 0, 0 };
>
>       loop () {
>          sum_v0 =3D dot_prod<16 * char>(char_a0, char_a1, sum_v0);
>
>          sum_v1 =3D dot_prod<16 * char>(char_b0, char_b1, sum_v1);
>
>          sum_v0 =3D dot_prod<8 * short>(short_c0_lo, short_c1_lo, sum_v0)=
;
>          sum_v1 =3D dot_prod<8 * short>(short_c0_hi, short_c1_hi, sum_v1)=
;
>       }
>
> For this purpose, we need to track the vectype_in that results in
> the most ncopies, for this case, the type is <8 * short>.
>
> BTW: would you please also take a look at patch #7 and #8?
>
> Thanks,
> Feng
>
> ________________________________________
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Wednesday, June 19, 2024 9:01 PM
> To: Feng Xue OS
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH 4/8] vect: Determine input vectype for multiple lane-=
reducing
>
> On Sun, Jun 16, 2024 at 9:25=E2=80=AFAM Feng Xue OS <fxue@os.amperecomput=
ing.com> wrote:
> >
> > The input vectype of reduction PHI statement must be determined before
> > vect cost computation for the reduction. Since lance-reducing operation=
 has
> > different input vectype from normal one, so we need to traverse all red=
uction
> > statements to find out the input vectype with the least lanes, and set =
that to
> > the PHI statement.
> >
> > Thanks,
> > Feng
> >
> > ---
> > gcc/
> >         * tree-vect-loop.cc (vectorizable_reduction): Determine input v=
ectype
> >         during traversal of reduction statements.
> > ---
> >  gcc/tree-vect-loop.cc | 72 +++++++++++++++++++++++++++++--------------
> >  1 file changed, 49 insertions(+), 23 deletions(-)
> >
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index 0f7b125e72d..39aa5cb1197 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -7643,7 +7643,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >      {
> >        stmt_vec_info def =3D loop_vinfo->lookup_def (reduc_def);
> >        stmt_vec_info vdef =3D vect_stmt_to_vectorize (def);
> > -      if (STMT_VINFO_REDUC_IDX (vdef) =3D=3D -1)
> > +      int reduc_idx =3D STMT_VINFO_REDUC_IDX (vdef);
> > +
> > +      if (reduc_idx =3D=3D -1)
> >         {
> >           if (dump_enabled_p ())
> >             dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > @@ -7686,10 +7688,50 @@ vectorizable_reduction (loop_vec_info loop_vinf=
o,
> >               return false;
> >             }
> >         }
> > -      else if (!stmt_info)
> > -       /* First non-conversion stmt.  */
> > -       stmt_info =3D vdef;
> > -      reduc_def =3D op.ops[STMT_VINFO_REDUC_IDX (vdef)];
> > +      else
> > +       {
> > +         /* First non-conversion stmt.  */
> > +         if (!stmt_info)
> > +           stmt_info =3D vdef;
> > +
> > +         if (lane_reducing_op_p (op.code))
> > +           {
> > +             unsigned group_size =3D slp_node ? SLP_TREE_LANES (slp_no=
de) : 0;
> > +             tree op_type =3D TREE_TYPE (op.ops[0]);
> > +             tree new_vectype_in =3D get_vectype_for_scalar_type (loop=
_vinfo,
> > +                                                                op_typ=
e,
> > +                                                                group_=
size);
>
> I think doing it this way does not adhere to the vector type size constra=
int
> with loop vectorization.  You should use vect_is_simple_use like the
> original code did as the actual vector definition determines the vector t=
ype
> used.
>
> You are always using op.ops[0] here - I think that works because
> reduc_idx is the last operand of all lane-reducing ops.  But then
> we should assert reduc_idx !=3D 0 here and add a comment.
>
> > +
> > +             /* The last operand of lane-reducing operation is for
> > +                reduction.  */
> > +             gcc_assert (reduc_idx > 0 && reduc_idx =3D=3D (int) op.nu=
m_ops - 1);
> > +
> > +             /* For lane-reducing operation vectorizable analysis need=
s the
> > +                reduction PHI information */
> > +             STMT_VINFO_REDUC_DEF (def) =3D phi_info;
> > +
> > +             if (!new_vectype_in)
> > +               return false;
> > +
> > +             /* Each lane-reducing operation has its own input vectype=
, while
> > +                reduction PHI will record the input vectype with the l=
east
> > +                lanes.  */
> > +             STMT_VINFO_REDUC_VECTYPE_IN (vdef) =3D new_vectype_in;
> > +
> > +             /* To accommodate lane-reducing operations of mixed input
> > +                vectypes, choose input vectype with the least lanes fo=
r the
> > +                reduction PHI statement, which would result in the mos=
t
> > +                ncopies for vectorized reduction results.  */
> > +             if (!vectype_in
> > +                 || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vecty=
pe_in)))
> > +                      < GET_MODE_SIZE (SCALAR_TYPE_MODE (op_type))))
> > +               vectype_in =3D new_vectype_in;
>
> I know this is a fragile area but I always wonder since the accumulating =
operand
> is the largest (all lane-reducing ops are widening), and that will be
> equal to the
> type of the PHI node, how this condition can be ever true.
>
> ncopies is determined by the VF, so the comment is at least misleading.
>
> > +           }
> > +         else
> > +           vectype_in =3D STMT_VINFO_VECTYPE (phi_info);
>
> Please initialize vectype_in from phi_info before the loop (that
> should never be NULL).
>
> I'll note that with your patch it seems we'd initialize vectype_in to
> the biggest
> non-accumulation vector type involved in lane-reducing ops but the accumu=
lating
> type might still be larger.   Why, when we have multiple lane-reducing
> ops, would
> we chose the largest input here?  I see we eventually do
>
>   if (slp_node)
>     ncopies =3D 1;
>   else
>     ncopies =3D vect_get_num_copies (loop_vinfo, vectype_in);
>
> but then IIRC we always force a single cycle def for lane-reducing ops(?)=
.
> In particular for vect_transform_reduction and SLP we rely on
> SLP_TREE_NUMBER_OF_VEC_STMTS while non-SLP uses
> STMT_VINFO_REDUC_VECTYPE_IN.
>
> So I wonder what breaks when we set vectype_in =3D vector type of PHI?
>
> Richard.
>
> > +       }
> > +
> > +      reduc_def =3D op.ops[reduc_idx];
> >        reduc_chain_length++;
> >        if (!stmt_info && slp_node)
> >         slp_for_stmt_info =3D SLP_TREE_CHILDREN (slp_for_stmt_info)[0];
> > @@ -7747,6 +7789,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >
> >    tree vectype_out =3D STMT_VINFO_VECTYPE (stmt_info);
> >    STMT_VINFO_REDUC_VECTYPE (reduc_info) =3D vectype_out;
> > +  STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) =3D vectype_in;
> > +
> >    gimple_match_op op;
> >    if (!gimple_extract_op (stmt_info->stmt, &op))
> >      gcc_unreachable ();
> > @@ -7831,16 +7875,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo=
,
> >           =3D get_vectype_for_scalar_type (loop_vinfo,
> >                                          TREE_TYPE (op.ops[i]), slp_op[=
i]);
> >
> > -      /* To properly compute ncopies we are interested in the widest
> > -        non-reduction input type in case we're looking at a widening
> > -        accumulation that we later handle in vect_transform_reduction.=
  */
> > -      if (lane_reducing
> > -         && vectype_op[i]
> > -         && (!vectype_in
> > -             || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_i=
n)))
> > -                 < GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype=
_op[i]))))))
> > -       vectype_in =3D vectype_op[i];
> > -
> >        /* Record how the non-reduction-def value of COND_EXPR is define=
d.
> >          ???  For a chain of multiple CONDs we'd have to match them up =
all.  */
> >        if (op.code =3D=3D COND_EXPR && reduc_chain_length =3D=3D 1)
> > @@ -7859,14 +7893,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo=
,
> >             }
> >         }
> >      }
> > -  if (!vectype_in)
> > -    vectype_in =3D STMT_VINFO_VECTYPE (phi_info);
> > -  STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) =3D vectype_in;
> > -
> > -  /* Each lane-reducing operation has its own input vectype, while red=
uction
> > -     PHI records the input vectype with least lanes.  */
> > -  if (lane_reducing)
> > -    STMT_VINFO_REDUC_VECTYPE_IN (stmt_info) =3D vectype_in;
> >
> >    enum vect_reduction_type reduction_type =3D STMT_VINFO_REDUC_TYPE (p=
hi_info);
> >    STMT_VINFO_REDUC_TYPE (reduc_info) =3D reduction_type;
> > --
> > 2.17.1