From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=HlwA=N2=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lj1-x229.google.com (mail-lj1-x229.google.com [IPv6:2a00:1450:4864:20::229])
	by sourceware.org (Postfix) with ESMTPS id 69F903858D35
	for <gcc-patches@gcc.gnu.org>; Mon, 24 Jun 2024 12:58:19 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 69F903858D35
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 69F903858D35
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::229
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1719233908; cv=none;
	b=ApskybxMJAs11Q5iYf3JoxsL4+53aXaP9NoB7OJhi2pB+K15yosTktYBglPw0drv81febasZdL2E2ZcbEfRi5Lxcq+nKZVp+vI1cG534+jBy2woBYozECy9M51Vn6FpdFV3x0HDdWWShxNmMqYOAmkvM5C5lw7/qr9yIFKWvxZQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1719233908; c=relaxed/simple;
	bh=OnszmzFMxaLXFBau43gfyROZLi5qmtG/LiYls8fLSkw=;
	h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=xFhCGDTC6N2U+wWGBMQ2NsdtN5idwFb8MStDpWjMxd49pSU/nVjdwm8CmQcSuf1sSgYrGbYztfXfGTjUc9WRiO25fklBmfZ6SN7jGDJaPMTMMnMxyFsNWjWva8RpHqqcJUmgVbm4aUwA7Yf/dfi3Gg8zAyVopyGr9V0NPjH0BU0=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-lj1-x229.google.com with SMTP id 38308e7fff4ca-2ec58040f39so18229151fa.2
        for <gcc-patches@gcc.gnu.org>; Mon, 24 Jun 2024 05:58:19 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1719233898; x=1719838698; darn=gcc.gnu.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=Fa08U3c96wUCCABVqlNsruvFJx+laRXtiHqAaiYmA3w=;
        b=f7a7H1vNK/6Z0UhOJm0uBle8ZUgwUdNAvDKWIJQAWgpWWea8UDnjn4mYwhgzkwdTxn
         DzzA8nY3yiLN8cXa1EotIS1jxfPy8OQedOq9cRXXzLCFRd10YUrmwq2hVGUzH+nteEIi
         DdbEbHKwO5BC9o3/CA4zjtHW0vqAvO4TMeeXDNGqZpyZASr4eAlksZHjQMXHbHVhGqyl
         l/LjDjjol/deSDPUDGFnZ6ieAZg0Qy1pWj6/EpKJKZLsSPTrh3lWZC+GFpWpj7ZQ5XCZ
         rTXHIqST164sYmWFCliXVV0aYa3GMyqKvUUXJm2eV/mBsjBzuA+2tPok50tyDhGnXw1+
         jGuQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1719233898; x=1719838698;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=Fa08U3c96wUCCABVqlNsruvFJx+laRXtiHqAaiYmA3w=;
        b=EP6Me7VeWI+gSPIL3CSNspwDn5dV2M311/AeoxL3f5gjV7ZCaevsqOSqMSBxrSSApa
         O+mY62np1GRT5PD5feKWG4Qn8YrMyBZmLwYeLjmVzwnG0jQh0muF6J0hApraFsJMW2ki
         tTzqjWdMjjyTt6HAWopFZrzej871Q0AieQ7+ZPu9umFZjHh34IavJiwv4d2yGtMaEmZU
         yC3I+P2CjUC4sErRZUn1QOnUhQL+IK551Z1vzOBqhV8WkcdmCh2uOrfh7mhlPYo6EWfj
         UiOClF+H95no9OrwnStHGRRb3EX49kdms1GiMkFclNB+xvZoWDUdcsSccAtP+HLtFziy
         WoOw==
X-Gm-Message-State: AOJu0YxOQ5tYUae9mn2cX+MIkPS5zyeVz6XrqrwHNhCKQgsnSsSqbUWI
	qRs/xRqSj4ouPf8+z1YFY6yBEySJz/ghEzDskzP1+8duyqVYMbpcYNSwMxE75NjYQ1wQGIiM8LA
	5k+w64cfsf9h6/QIXzroxbRm2CNRYoc92
X-Google-Smtp-Source: AGHT+IGIh9mmzephArTV9mCAVGLo+GjTtspGWfrs4uztpRn0RaCBKnUwMM8Wgrha04fCN/f9Ke/iiT2ThD3xPkfBhJs=
X-Received: by 2002:ac2:5ec1:0:b0:52c:d9f8:b033 with SMTP id
 2adb3069b0e04-52ce063e78amr3177904e87.3.1719233897091; Mon, 24 Jun 2024
 05:58:17 -0700 (PDT)
MIME-Version: 1.0
References: <LV2PR01MB7839A1A6112176E201D3DEBCF7CC2@LV2PR01MB7839.prod.exchangelabs.com>
 <CAFiYyc1oMjAW+OnBe5BNN4GYfyGRzwn6AzVcc8VxQ_eN+QNZ3A@mail.gmail.com> <LV2PR01MB7839454698AE00602C3EC782F7CB2@LV2PR01MB7839.prod.exchangelabs.com>
In-Reply-To: <LV2PR01MB7839454698AE00602C3EC782F7CB2@LV2PR01MB7839.prod.exchangelabs.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Mon, 24 Jun 2024 14:58:05 +0200
Message-ID: <CAFiYyc31SzZiYZF4wT50c0SMH=RfhvzwTUM2n8aSGDiDJoJt5w@mail.gmail.com>
Subject: Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for
 loop reduction [PR114440]
To: Feng Xue OS <fxue@os.amperecomputing.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-7.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Sun, Jun 23, 2024 at 5:10=E2=80=AFPM Feng Xue OS <fxue@os.amperecomputin=
g.com> wrote:
>
> >> -      if (slp_node)
> >> +      if (slp_node && SLP_TREE_LANES (slp_node) > 1)
> >
> > Hmm, that looks wrong.  It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is o=
ff
> > instead, which is bad.
> >
> >>         nvectors =3D SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
> >>        else
> >>         nvectors =3D vect_get_num_copies (loop_vinfo, vectype_in);
> >> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (lo=
op_vec_info loop_vinfo,
> >>      }
> >>  }
> >>
> >> +/* Check if STMT_INFO is a lane-reducing operation that can be vector=
ized in
> >> +   the context of LOOP_VINFO, and vector cost will be recorded in COS=
T_VEC.
> >> +   Now there are three such kinds of operations: dot-prod/widen-sum/s=
ad
> >> +   (sum-of-absolute-differences).
> >> +
> >> +   For a lane-reducing operation, the loop reduction path that it lie=
s in,
> >> +   may contain normal operation, or other lane-reducing operation of =
different
> >> +   input type size, an example as:
> >> +
> >> +     int sum =3D 0;
> >> +     for (i)
> >> +       {
> >> +         ...
> >> +         sum +=3D d0[i] * d1[i];       // dot-prod <vector(16) char>
> >> +         sum +=3D w[i];                // widen-sum <vector(16) char>
> >> +         sum +=3D abs(s0[i] - s1[i]);  // sad <vector(8) short>
> >> +         sum +=3D n[i];                // normal <vector(4) int>
> >> +         ...
> >> +       }
> >> +
> >> +   Vectorization factor is essentially determined by operation whose =
input
> >> +   vectype has the most lanes ("vector(16) char" in the example), whi=
le we
> >> +   need to choose input vectype with the least lanes ("vector(4) int"=
 in the
> >> +   example) for the reduction PHI statement.  */
> >> +
> >> +bool
> >> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info s=
tmt_info,
> >> +                           slp_tree slp_node, stmt_vector_for_cost *c=
ost_vec)
> >> +{
> >> +  gimple *stmt =3D stmt_info->stmt;
> >> +
> >> +  if (!lane_reducing_stmt_p (stmt))
> >> +    return false;
> >> +
> >> +  tree type =3D TREE_TYPE (gimple_assign_lhs (stmt));
> >> +
> >> +  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
> >> +    return false;
> >> +
> >> +  /* Do not try to vectorize bit-precision reductions.  */
> >> +  if (!type_has_mode_precision_p (type))
> >> +    return false;
> >> +
> >> +  if (!slp_node)
> >> +    return false;
> >> +
> >> +  for (int i =3D 0; i < (int) gimple_num_ops (stmt) - 1; i++)
> >> +    {
> >> +      stmt_vec_info def_stmt_info;
> >> +      slp_tree slp_op;
> >> +      tree op;
> >> +      tree vectype;
> >> +      enum vect_def_type dt;
> >> +
> >> +      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &o=
p,
> >> +                              &slp_op, &dt, &vectype, &def_stmt_info)=
)
> >> +       {
> >> +         if (dump_enabled_p ())
> >> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> +                            "use not simple.\n");
> >> +         return false;
> >> +       }
> >> +
> >> +      if (!vectype)
> >> +       {
> >> +         vectype =3D get_vectype_for_scalar_type (loop_vinfo, TREE_TY=
PE (op),
> >> +                                                slp_op);
> >> +         if (!vectype)
> >> +           return false;
> >> +       }
> >> +
> >> +      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
> >> +       {
> >> +         if (dump_enabled_p ())
> >> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> +                            "incompatible vector types for invariants=
\n");
> >> +         return false;
> >> +       }
> >> +
> >> +      if (i =3D=3D STMT_VINFO_REDUC_IDX (stmt_info))
> >> +       continue;
> >> +
> >> +      /* There should be at most one cycle def in the stmt.  */
> >> +      if (VECTORIZABLE_CYCLE_DEF (dt))
> >> +       return false;
> >> +    }
> >> +
> >> +  stmt_vec_info reduc_info =3D STMT_VINFO_REDUC_DEF (vect_orig_stmt (=
stmt_info));
> >> +
> >> +  /* TODO: Support lane-reducing operation that does not directly par=
ticipate
> >> +     in loop reduction. */
> >> +  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
> >> +    return false;
> >> +
> >> +  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
> >> +     recoginized.  */
> >> +  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) =3D=3D vect_reduction_=
def);
> >> +  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D TREE_CODE_RED=
UCTION);
> >> +
> >> +  tree vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> >> +  int ncopies_for_cost;
> >> +
> >> +  if (SLP_TREE_LANES (slp_node) > 1)
> >> +    {
> >> +      /* Now lane-reducing operations in a non-single-lane slp node s=
hould only
> >> +        come from the same loop reduction path.  */
> >> +      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
> >> +      ncopies_for_cost =3D 1;
> >> +    }
> >> +  else
> >> +    {
> >> +      ncopies_for_cost =3D vect_get_num_copies (loop_vinfo, vectype_i=
n);
> >
> > OK, so the fact that the ops are lane-reducing means they effectively
> > change the VF for the result.  That's only possible as we tightly contr=
ol
> > code generation and "adjust" to the expected VF (by inserting the copie=
s
> > you mentioned above), but only up to the highest number of outputs
> > created in the reduction chain.  In that sense instead of talking and r=
ecording
> > "input vector types" wouldn't it make more sense to record the effectiv=
e
> > vectorization factor for the reduction instance?  That VF would be at m=
ost
> > the loops VF but could be as low as 1.  Once we have a non-lane-reducin=
g
> > operation in the reduction chain it would be always equal to the loops =
VF.
> >
> > ncopies would then be always determined by that reduction instance VF a=
nd
> > the accumulator vector type (STMT_VINFO_VECTYPE).  This reduction
> > instance VF would also trivially indicate the force-single-def-use-cycl=
e
> > case, possibly simplifying code?
>
> I tried to add such an effective VF, while the vectype_in is still needed=
 in some
> scenarios, such as when checking whether a dot-prod stmt is emulated or n=
ot.
> The former could be deduced from the later, so recording both things seem=
s
> to be redundant. Another consideration is that for normal op, ncopies
> is determined from type (STMT_VINFO_VECTYPE), but for lane-reducing op,
> it is from VF. So, a better means to make them unified?

AFAICS reductions are special in that they, for the accumulation SSA cycle,
do not adhere to the loops VF but as optimization can chose a smaller one.
OTOH STMT_VINFO_VECTYPE is for the vector type used for individual
operations which even for lane-reducing ops is adhered to - those just
may use a smaller VF, that of the reduction SSA cycle.

So what's redundant is STMT_VINFO_REDUC_VECTYPE_IN - or rather
it's not fully redundant but needlessly replicated over all stmts participa=
ting
in the reduction instead of recording the reduction VF in the reduc_info an=
d
using that (plus STMT_VINFO_VECTYPE) to compute the effective ncopies
for stmts in the reduction cycle.

At least that was my idea ...

> >> +      gcc_assert (ncopies_for_cost >=3D 1);
> >> +    }
> >> +
> >> +  if (vect_is_emulated_mixed_dot_prod (stmt_info))
> >> +    {
> >> +      /* We need extra two invariants: one that contains the minimum =
signed
> >> +        value and one that contains half of its negative.  */
> >> +      int prologue_stmts =3D 2;
> >> +      unsigned cost =3D record_stmt_cost (cost_vec, prologue_stmts,
> >> +                                       scalar_to_vec, stmt_info, 0,
> >> +                                       vect_prologue);
> >> +      if (dump_enabled_p ())
> >> +       dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
> >> +                    "extra prologue_cost =3D %d .\n", cost);
> >> +
> >> +      /* Three dot-products and a subtraction.  */
> >> +      ncopies_for_cost *=3D 4;
> >> +    }
> >> +
> >> +  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_inf=
o, 0,
> >> +                   vect_body);
> >> +
> >> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> >> +    {
> >> +      enum tree_code code =3D gimple_assign_rhs_code (stmt);
> >> +      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_i=
nfo,
> >> +                                                 slp_node, code, type=
,
> >> +                                                 vectype_in);
> >> +    }
> >> +
> >
> > Add a comment:
> >
> >     /* Transform via vect_transform_reduction.  */
> >
> >> +  STMT_VINFO_TYPE (stmt_info) =3D reduc_vec_info_type;
> >> +  return true;
> >> +}
> >> +
> >>  /* Function vectorizable_reduction.
> >>
> >>     Check if STMT_INFO performs a reduction operation that can be vect=
orized.
> >> @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinf=
o,
> >>    if (!type_has_mode_precision_p (op.type))
> >>      return false;
> >>
> >> -  /* For lane-reducing ops we're reducing the number of reduction PHI=
s
> >> -     which means the only use of that may be in the lane-reducing ope=
ration.  */
> >> -  if (lane_reducing
> >> -      && reduc_chain_length !=3D 1
> >> -      && !only_slp_reduc_chain)
> >> -    {
> >> -      if (dump_enabled_p ())
> >> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> -                        "lane-reducing reduction with extra stmts.\n"=
);
> >> -      return false;
> >> -    }
> >> -
> >>    /* Lane-reducing ops also never can be used in a SLP reduction grou=
p
> >>       since we'll mix lanes belonging to different reductions.  But it=
's
> >>       OK to use them in a reduction chain or when the reduction group
> >> @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vin=
fo,
> >>        && loop_vinfo->suggested_unroll_factor =3D=3D 1)
> >>      single_defuse_cycle =3D true;
> >>
> >> -  if (single_defuse_cycle || lane_reducing)
> >> +  if (single_defuse_cycle && !lane_reducing)
> >
> > If there's also a non-lane-reducing plus in the chain don't we have to
> > check for that reduction op?  So shouldn't it be
> > single_defuse_cycle && ... fact that we don't record
> > (non-lane-reducing op there) ...
>
> Quite not understand this point.  For a non-lane-reducing op in the chain=
,
> it should be handled in its own vectorizable_xxx function? The below chec=
k
> is only for the first statement (vect_reduction_def) in the reduction.

Hmm.  So we have vectorizable_lane_reducing_* for the check on the
lane-reducing stmts, vectorizable_* for !single-def-use stmts.  And the
following is then just for the case there's a single def that's not
lane-reducing
and we're forcing a single-def-use and thus go via vect_transform_reduction=
?

> >
> >>      {
> >>        gcc_assert (op.code !=3D COND_EXPR);
> >>
> >> -      /* 4. Supportable by target?  */
> >> -      bool ok =3D true;
> >> -
> >> -      /* 4.1. check support for the operation in the loop
> >> +      /* 4. check support for the operation in the loop
> >>
> >>          This isn't necessary for the lane reduction codes, since they
> >>          can only be produced by pattern matching, and it's up to the
> >> @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vin=
fo,
> >>          mixed-sign dot-products can be implemented using signed
> >>          dot-products.  */
> >>        machine_mode vec_mode =3D TYPE_MODE (vectype_in);
> >> -      if (!lane_reducing
> >> -         && !directly_supported_p (op.code, vectype_in, optab_vector)=
)
> >> +      if (!directly_supported_p (op.code, vectype_in, optab_vector))
> >>          {
> >>            if (dump_enabled_p ())
> >>              dump_printf (MSG_NOTE, "op not supported by target.\n");
> >>           if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
> >>               || !vect_can_vectorize_without_simd_p (op.code))
> >> -           ok =3D false;
> >> +           single_defuse_cycle =3D false;
> >>           else
> >>             if (dump_enabled_p ())
> >>               dump_printf (MSG_NOTE, "proceeding using word mode.\n");
> >> @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinf=
o,
> >>             dump_printf (MSG_NOTE, "using word mode not possible.\n");
> >>           return false;
> >>         }
> >> -
> >> -      /* lane-reducing operations have to go through vect_transform_r=
eduction.
> >> -         For the other cases try without the single cycle optimizatio=
n.  */
> >> -      if (!ok)
> >> -       {
> >> -         if (lane_reducing)
> >> -           return false;
> >> -         else
> >> -           single_defuse_cycle =3D false;
> >> -       }
> >>      }
> >>    if (dump_enabled_p () && single_defuse_cycle)
> >>      dump_printf_loc (MSG_NOTE, vect_location,
> >> @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vin=
fo,
> >>                      "multiple vectors to one in the loop body\n");
> >>    STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) =3D single_defuse_cycle;
> >>
> >> -  /* If the reduction stmt is one of the patterns that have lane
> >> -     reduction embedded we cannot handle the case of ! single_defuse_=
cycle.  */
> >> -  if ((ncopies > 1 && ! single_defuse_cycle)
> >> -      && lane_reducing)
> >> -    {
> >> -      if (dump_enabled_p ())
> >> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> -                        "multi def-use cycle not possible for lane-re=
ducing "
> >> -                        "reduction operation\n");
> >> -      return false;
> >> -    }
> >> +  /* For lane-reducing operation, the below processing related to sin=
gle
> >> +     defuse-cycle will be done in its own vectorizable function.  One=
 more
> >> +     thing to note is that the operation must not be involved in fold=
-left
> >> +     reduction.  */
> >> +  single_defuse_cycle &=3D !lane_reducing;
> >>
> >>    if (slp_node
> >> -      && !(!single_defuse_cycle
> >> -          && !lane_reducing
> >> -          && reduction_type !=3D FOLD_LEFT_REDUCTION))
> >> +      && (single_defuse_cycle || reduction_type =3D=3D FOLD_LEFT_REDU=
CTION))
> >>      for (i =3D 0; i < (int) op.num_ops; i++)
> >>        if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]=
))
> >>         {
> >> @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vin=
fo,
> >>    vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
> >>                              reduction_type, ncopies, cost_vec);
> >>    /* Cost the reduction op inside the loop if transformed via
> >> -     vect_transform_reduction.  Otherwise this is costed by the
> >> -     separate vectorizable_* routines.  */
> >> -  if (single_defuse_cycle || lane_reducing)
> >> -    {
> >> -      int factor =3D 1;
> >> -      if (vect_is_emulated_mixed_dot_prod (stmt_info))
> >> -       /* Three dot-products and a subtraction.  */
> >> -       factor =3D 4;
> >> -      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
> >> -                       stmt_info, 0, vect_body);
> >> -    }
> >> +     vect_transform_reduction for non-lane-reducing operation.  Other=
wise
> >> +     this is costed by the separate vectorizable_* routines.  */
> >> +  if (single_defuse_cycle)
> >> +    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, v=
ect_body);
> >>
> >>    if (dump_enabled_p ()
> >>        && reduction_type =3D=3D FOLD_LEFT_REDUCTION)
> >>      dump_printf_loc (MSG_NOTE, vect_location,
> >>                      "using an in-order (fold-left) reduction.\n");
> >>    STMT_VINFO_TYPE (orig_stmt_of_analysis) =3D cycle_phi_info_type;
> >> -  /* All but single defuse-cycle optimized, lane-reducing and fold-le=
ft
> >> -     reductions go through their own vectorizable_* routines.  */
> >> -  if (!single_defuse_cycle
> >> -      && !lane_reducing
> >> -      && reduction_type !=3D FOLD_LEFT_REDUCTION)
> >> +
> >> +  /* All but single defuse-cycle optimized and fold-left reductions g=
o
> >> +     through their own vectorizable_* routines.  */
> >> +  if (!single_defuse_cycle && reduction_type !=3D FOLD_LEFT_REDUCTION=
)
> >>      {
> >>        stmt_vec_info tem
> >>         =3D vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
> >> @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vi=
nfo,
> >>    bool lane_reducing =3D lane_reducing_op_p (code);
> >>    gcc_assert (single_defuse_cycle || lane_reducing);
> >>
> >> +  if (lane_reducing)
> >> +    {
> >> +      /* The last operand of lane-reducing op is for reduction.  */
> >> +      gcc_assert (reduc_index =3D=3D (int) op.num_ops - 1);
> >> +
> >> +      /* Now all lane-reducing ops are covered by some slp node.  */
> >> +      gcc_assert (slp_node);
> >> +    }
> >> +
> >>    /* Create the destination vector  */
> >>    tree scalar_dest =3D gimple_get_lhs (stmt_info->stmt);
> >>    tree vec_dest =3D vect_create_destination_var (scalar_dest, vectype=
_out);
> >> @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vi=
nfo,
> >>                          reduc_index =3D=3D 2 ? op.ops[2] : NULL_TREE,
> >>                          &vec_oprnds[2]);
> >>      }
> >> +  else if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D 1
> >> +          && vec_oprnds[0].length () < vec_oprnds[reduc_index].length=
 ())
> >> +    {
> >> +      /* For lane-reducing op covered by single-lane slp node, the in=
put
> >> +        vectype of the reduction PHI determines copies of vectorized =
def-use
> >> +        cycles, which might be more than effective copies of vectoriz=
ed lane-
> >> +        reducing reduction statements.  This could be complemented by
> >> +        generating extra trivial pass-through copies.  For example:
> >> +
> >> +          int sum =3D 0;
> >> +          for (i)
> >> +            {
> >> +              sum +=3D d0[i] * d1[i];      // dot-prod <vector(16) ch=
ar>
> >> +              sum +=3D abs(s0[i] - s1[i]); // sad <vector(8) short>
> >> +              sum +=3D n[i];               // normal <vector(4) int>
> >> +            }
> >> +
> >> +        The vector size is 128-bit,vectorization factor is 16.  Reduc=
tion
> >> +        statements would be transformed as:
> >> +
> >> +          vector<4> int sum_v0 =3D { 0, 0, 0, 0 };
> >> +          vector<4> int sum_v1 =3D { 0, 0, 0, 0 };
> >> +          vector<4> int sum_v2 =3D { 0, 0, 0, 0 };
> >> +          vector<4> int sum_v3 =3D { 0, 0, 0, 0 };
> >> +
> >> +          for (i / 16)
> >> +            {
> >> +              sum_v0 =3D DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15]=
, sum_v0);
> >> +              sum_v1 =3D sum_v1;  // copy
> >> +              sum_v2 =3D sum_v2;  // copy
> >> +              sum_v3 =3D sum_v3;  // copy
> >> +
> >> +              sum_v0 =3D SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum=
_v0);
> >> +              sum_v1 =3D SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum=
_v1);
> >> +              sum_v2 =3D sum_v2;  // copy
> >> +              sum_v3 =3D sum_v3;  // copy
> >> +
> >> +              sum_v0 +=3D n_v0[i: 0  ~ 3 ];
> >> +              sum_v1 +=3D n_v1[i: 4  ~ 7 ];
> >> +              sum_v2 +=3D n_v2[i: 8  ~ 11];
> >> +              sum_v3 +=3D n_v3[i: 12 ~ 15];
> >> +            }
> >> +       */
> >> +      unsigned using_ncopies =3D vec_oprnds[0].length ();
> >> +      unsigned reduc_ncopies =3D vec_oprnds[reduc_index].length ();
> >> +
> >
> > assert reduc_ncopies >=3D using_ncopies?  Maybe assert
> > reduc_index =3D=3D op.num_ops - 1 given you use one above
> > and the other below?  Or simply iterate till op.num_ops
> > and sip i =3D=3D reduc_index.
> >
> >> +      for (unsigned i =3D 0; i < op.num_ops - 1; i++)
> >> +       {
> >> +         gcc_assert (vec_oprnds[i].length () =3D=3D using_ncopies);
> >> +         vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
> >> +       }
> >> +    }
> >>
> >>    bool emulated_mixed_dot_prod =3D vect_is_emulated_mixed_dot_prod (s=
tmt_info);
> >>    unsigned num =3D vec_oprnds[reduc_index =3D=3D 0 ? 1 : 0].length ()=
;
> >> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vi=
nfo,
> >>      {
> >>        gimple *new_stmt;
> >>        tree vop[3] =3D { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE=
 };
> >> -      if (masked_loop_p && !mask_by_cond_expr)
> >> +
> >> +      if (!vop[0] || !vop[1])
> >> +       {
> >> +         tree reduc_vop =3D vec_oprnds[reduc_index][i];
> >> +
> >> +         /* Insert trivial copy if no need to generate vectorized
> >> +            statement.  */
> >> +         gcc_assert (reduc_vop);
> >> +
> >> +         new_stmt =3D gimple_build_assign (vec_dest, reduc_vop);
> >> +         new_temp =3D make_ssa_name (vec_dest, new_stmt);
> >> +         gimple_set_lhs (new_stmt, new_temp);
> >> +         vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt=
, gsi);
> >
> > I think you could simply do
> >
> >                slp_node->push_vec_def (reduc_vop);
> >                continue;
> >
> > without any code generation.
> >
>
> OK, that would be easy. Here comes another question, this patch assumes
> lane-reducing op would always be contained in a slp node, since single-la=
ne
> slp node feature has been enabled. But I got some regression if I enforce=
d
> such constraint on lane-reducing op check. Those cases are founded to
> be unvectorizable with single-lane slp, so this should not be what we wan=
t?
> and need to be fixed?

Yes, in the end we need to chase down all unsupported cases and fix them
(there's known issues with load permutes, I'm working on that - hopefully
when finding a continuous stretch of time...).

>
> >> +       }
> >> +      else if (masked_loop_p && !mask_by_cond_expr)
> >>         {
> >>           /* No conditional ifns have been defined for lane-reducing o=
p
> >>              yet.  */
> >> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vi=
nfo,
> >>
> >>           if (masked_loop_p && mask_by_cond_expr)
> >>             {
> >> +             tree stmt_vectype_in =3D vectype_in;
> >> +             unsigned nvectors =3D vec_num * ncopies;
> >> +
> >> +             if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D 1)
> >> +               {
> >> +                 /* Input vectype of the reduction PHI may be deffere=
nt from
> >
> > different
> >
> >> +                    that of lane-reducing operation.  */
> >> +                 stmt_vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (stm=
t_info);
> >> +                 nvectors =3D vect_get_num_copies (loop_vinfo, stmt_v=
ectype_in);
> >
> > I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS.
>
> To partially vectorizing a dot_prod<16 * char> with 128-bit vector width,
> we should pass (nvector=3D4, vectype=3D<4 *int>) instead of (nvector=3D1,=
 vectype=3D<16 *char>)
> to vect_get_loop_mask?

Probably - it depends on the vectorization factor.  What I wanted to
point out is that
vec_num (likely from SLP_TREE_NUMBER_OF_VEC_STMTS) is wrong.  The
place setting SLP_TREE_NUMBER_OF_VEC_STMTS needs to be adjusted,
or we should forgo with it (but that's possibly a post-only-SLP
cleanup to be done).

See vect_slp_analyze_node_operations_1 where that's computed.  For reductio=
ns
it's probably not quite right (and we might have latent issues like
those you are
"fixing" with code like above).  The order we analyze stmts might also be n=
ot
optimal for reductions with SLP - in fact given that stmt analysis
relies on a fixed VF
it would probably make sense to determine the reduction VF in advance as we=
ll.
But again this sounds like post-only-SLP cleanup opportunities.

In the end I might suggest to always use reduct-VF and vectype to determine
the number of vector stmts rather than computing ncopies/vec_num separately=
.

Richard.

> Thanks,
> Feng
>
>
> ________________________________________
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Thursday, June 20, 2024 8:26 PM
> To: Feng Xue OS
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH 7/8] vect: Support multiple lane-reducing operations =
for loop reduction [PR114440]
>
> On Sun, Jun 16, 2024 at 9:31?AM Feng Xue OS <fxue@os.amperecomputing.com>=
 wrote:
> >
> > For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, =
current
> > vectorizer could only handle the pattern if the reduction chain does no=
t
> > contain other operation, no matter the other is normal or lane-reducing=
.
> >
> > Actually, to allow multiple arbitrary lane-reducing operations, we need=
 to
> > support vectorization of loop reduction chain with mixed input vectypes=
. Since
> > lanes of vectype may vary with operation, the effective ncopies of vect=
orized
> > statements for operation also may not be same to each other, this cause=
s
> > mismatch on vectorized def-use cycles. A simple way is to align all ope=
rations
> > with the one that has the most ncopies, the gap could be complemented b=
y
> > generating extra trivial pass-through copies. For example:
> >
> >    int sum =3D 0;
> >    for (i)
> >      {
> >        sum +=3D d0[i] * d1[i];      // dot-prod <vector(16) char>
> >        sum +=3D w[i];               // widen-sum <vector(16) char>
> >        sum +=3D abs(s0[i] - s1[i]); // sad <vector(8) short>
> >        sum +=3D n[i];               // normal <vector(4) int>
> >      }
> >
> > The vector size is 128-bit vectorization factor is 16. Reduction statem=
ents
> > would be transformed as:
> >
> >    vector<4> int sum_v0 =3D { 0, 0, 0, 0 };
> >    vector<4> int sum_v1 =3D { 0, 0, 0, 0 };
> >    vector<4> int sum_v2 =3D { 0, 0, 0, 0 };
> >    vector<4> int sum_v3 =3D { 0, 0, 0, 0 };
> >
> >    for (i / 16)
> >      {
> >        sum_v0 =3D DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0)=
;
> >        sum_v1 =3D sum_v1;  // copy
> >        sum_v2 =3D sum_v2;  // copy
> >        sum_v3 =3D sum_v3;  // copy
> >
> >        sum_v0 =3D WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
> >        sum_v1 =3D sum_v1;  // copy
> >        sum_v2 =3D sum_v2;  // copy
> >        sum_v3 =3D sum_v3;  // copy
> >
> >        sum_v0 =3D SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
> >        sum_v1 =3D SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
> >        sum_v2 =3D sum_v2;  // copy
> >        sum_v3 =3D sum_v3;  // copy
> >
> >        sum_v0 +=3D n_v0[i: 0  ~ 3 ];
> >        sum_v1 +=3D n_v1[i: 4  ~ 7 ];
> >        sum_v2 +=3D n_v2[i: 8  ~ 11];
> >        sum_v3 +=3D n_v3[i: 12 ~ 15];
> >      }
> >
> > Thanks,
> > Feng
> >
> > ---
> > gcc/
> >         PR tree-optimization/114440
> >         * tree-vectorizer.h (vectorizable_lane_reducing): New function
> >         declaration.
> >         * tree-vect-stmts.cc (vect_analyze_stmt): Call new function
> >         vectorizable_lane_reducing to analyze lane-reducing operation.
> >         * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost co=
mputation
> >         code related to emulated_mixed_dot_prod.
> >         (vect_reduction_update_partial_vector_usage): Compute ncopies a=
s the
> >         original means for single-lane slp node.
> >         (vectorizable_lane_reducing): New function.
> >         (vectorizable_reduction): Allow multiple lane-reducing operatio=
ns in
> >         loop reduction. Move some original lane-reducing related code t=
o
> >         vectorizable_lane_reducing.
> >         (vect_transform_reduction): Extend transformation to support re=
duction
> >         statements with mixed input vectypes.
> >
> > gcc/testsuite/
> >         PR tree-optimization/114440
> >         * gcc.dg/vect/vect-reduc-chain-1.c
> >         * gcc.dg/vect/vect-reduc-chain-2.c
> >         * gcc.dg/vect/vect-reduc-chain-3.c
> >         * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
> >         * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
> >         * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
> >         * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
> >         * gcc.dg/vect/vect-reduc-dot-slp-1.c
> > ---
> >  .../gcc.dg/vect/vect-reduc-chain-1.c          |  62 ++++
> >  .../gcc.dg/vect/vect-reduc-chain-2.c          |  77 +++++
> >  .../gcc.dg/vect/vect-reduc-chain-3.c          |  66 ++++
> >  .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 +++++
> >  .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 ++++
> >  .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 +++++
> >  .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 ++++
> >  .../gcc.dg/vect/vect-reduc-dot-slp-1.c        |  35 ++
> >  gcc/tree-vect-loop.cc                         | 324 ++++++++++++++----
> >  gcc/tree-vect-stmts.cc                        |   2 +
> >  gcc/tree-vectorizer.h                         |   2 +
> >  11 files changed, 802 insertions(+), 70 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-=
1.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-=
2.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-=
3.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-=
4.c
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
> >
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/tests=
uite/gcc.dg/vect/vect-reduc-chain-1.c
> > new file mode 100644
> > index 00000000000..04bfc419dbd
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
> > @@ -0,0 +1,62 @@
> > +/* Disabling epilogues until we find a better way to deal with scans. =
 */
> > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */
> > +/* { dg-require-effective-target vect_int } */
> > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { =
aarch64*-*-* || arm*-*-* } } } */
> > +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> > +
> > +#include "tree-vect.h"
> > +
> > +#define N 50
> > +
> > +#ifndef SIGNEDNESS_1
> > +#define SIGNEDNESS_1 signed
> > +#define SIGNEDNESS_2 signed
> > +#endif
> > +
> > +SIGNEDNESS_1 int __attribute__ ((noipa))
> > +f (SIGNEDNESS_1 int res,
> > +   SIGNEDNESS_2 char *restrict a,
> > +   SIGNEDNESS_2 char *restrict b,
> > +   SIGNEDNESS_2 char *restrict c,
> > +   SIGNEDNESS_2 char *restrict d,
> > +   SIGNEDNESS_1 int *restrict e)
> > +{
> > +  for (int i =3D 0; i < N; ++i)
> > +    {
> > +      res +=3D a[i] * b[i];
> > +      res +=3D c[i] * d[i];
> > +      res +=3D e[i];
> > +    }
> > +  return res;
> > +}
> > +
> > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> > +#define OFFSET 20
> > +
> > +int
> > +main (void)
> > +{
> > +  check_vect ();
> > +
> > +  SIGNEDNESS_2 char a[N], b[N];
> > +  SIGNEDNESS_2 char c[N], d[N];
> > +  SIGNEDNESS_1 int e[N];
> > +  int expected =3D 0x12345;
> > +  for (int i =3D 0; i < N; ++i)
> > +    {
> > +      a[i] =3D BASE + i * 5;
> > +      b[i] =3D BASE + OFFSET + i * 4;
> > +      c[i] =3D BASE + i * 2;
> > +      d[i] =3D BASE + OFFSET + i * 3;
> > +      e[i] =3D i;
> > +      asm volatile ("" ::: "memory");
> > +      expected +=3D a[i] * b[i];
> > +      expected +=3D c[i] * d[i];
> > +      expected +=3D e[i];
> > +    }
> > +  if (f (0x12345, a, b, c, d, e) !=3D expected)
> > +    __builtin_abort ();
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected"=
 "vect" } } */
> > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D =
DOT_PROD_EXPR" 2 "vect" { target vect_sdot_qi } } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c b/gcc/tests=
uite/gcc.dg/vect/vect-reduc-chain-2.c
> > new file mode 100644
> > index 00000000000..6c803b80120
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
> > @@ -0,0 +1,77 @@
> > +/* Disabling epilogues until we find a better way to deal with scans. =
 */
> > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */
> > +/* { dg-require-effective-target vect_int } */
> > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { =
aarch64*-*-* || arm*-*-* } } } */
> > +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> > +
> > +#include "tree-vect.h"
> > +
> > +#define N 50
> > +
> > +#ifndef SIGNEDNESS_1
> > +#define SIGNEDNESS_1 signed
> > +#define SIGNEDNESS_2 unsigned
> > +#define SIGNEDNESS_3 signed
> > +#define SIGNEDNESS_4 signed
> > +#endif
> > +
> > +SIGNEDNESS_1 int __attribute__ ((noipa))
> > +fn (SIGNEDNESS_1 int res,
> > +   SIGNEDNESS_2 char *restrict a,
> > +   SIGNEDNESS_2 char *restrict b,
> > +   SIGNEDNESS_3 char *restrict c,
> > +   SIGNEDNESS_3 char *restrict d,
> > +   SIGNEDNESS_4 short *restrict e,
> > +   SIGNEDNESS_4 short *restrict f,
> > +   SIGNEDNESS_1 int *restrict g)
> > +{
> > +  for (int i =3D 0; i < N; ++i)
> > +    {
> > +      res +=3D a[i] * b[i];
> > +      res +=3D i + 1;
> > +      res +=3D c[i] * d[i];
> > +      res +=3D e[i] * f[i];
> > +      res +=3D g[i];
> > +    }
> > +  return res;
> > +}
> > +
> > +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> > +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -126 : 4)
> > +#define BASE4 ((SIGNEDNESS_4 int) -1 < 0 ? -1026 : 373)
> > +#define OFFSET 20
> > +
> > +int
> > +main (void)
> > +{
> > +  check_vect ();
> > +
> > +  SIGNEDNESS_2 char a[N], b[N];
> > +  SIGNEDNESS_3 char c[N], d[N];
> > +  SIGNEDNESS_4 short e[N], f[N];
> > +  SIGNEDNESS_1 int g[N];
> > +  int expected =3D 0x12345;
> > +  for (int i =3D 0; i < N; ++i)
> > +    {
> > +      a[i] =3D BASE2 + i * 5;
> > +      b[i] =3D BASE2 + OFFSET + i * 4;
> > +      c[i] =3D BASE3 + i * 2;
> > +      d[i] =3D BASE3 + OFFSET + i * 3;
> > +      e[i] =3D BASE4 + i * 6;
> > +      f[i] =3D BASE4 + OFFSET + i * 5;
> > +      g[i] =3D i;
> > +      asm volatile ("" ::: "memory");
> > +      expected +=3D a[i] * b[i];
> > +      expected +=3D i + 1;
> > +      expected +=3D c[i] * d[i];
> > +      expected +=3D e[i] * f[i];
> > +      expected +=3D g[i];
> > +    }
> > +  if (fn (0x12345, a, b, c, d, e, f, g) !=3D expected)
> > +    __builtin_abort ();
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected"=
 "vect" } } */
> > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D DOT_PR=
OD_EXPR" "vect" { target { vect_sdot_qi } } } } */
> > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D DOT_PR=
OD_EXPR" "vect" { target { vect_udot_qi } } } } */
> > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D DOT_PR=
OD_EXPR" "vect" { target { vect_sdot_hi } } } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c b/gcc/tests=
uite/gcc.dg/vect/vect-reduc-chain-3.c
> > new file mode 100644
> > index 00000000000..a41e4b176c4
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
> > @@ -0,0 +1,66 @@
> > +/* Disabling epilogues until we find a better way to deal with scans. =
 */
> > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */
> > +/* { dg-require-effective-target vect_int } */
> > +
> > +#include "tree-vect.h"
> > +
> > +#define N 50
> > +
> > +#ifndef SIGNEDNESS_1
> > +#define SIGNEDNESS_1 signed
> > +#define SIGNEDNESS_2 unsigned
> > +#define SIGNEDNESS_3 signed
> > +#endif
> > +
> > +SIGNEDNESS_1 int __attribute__ ((noipa))
> > +f (SIGNEDNESS_1 int res,
> > +   SIGNEDNESS_2 char *restrict a,
> > +   SIGNEDNESS_2 char *restrict b,
> > +   SIGNEDNESS_3 short *restrict c,
> > +   SIGNEDNESS_3 short *restrict d,
> > +   SIGNEDNESS_1 int *restrict e)
> > +{
> > +  for (int i =3D 0; i < N; ++i)
> > +    {
> > +      short diff =3D a[i] - b[i];
> > +      SIGNEDNESS_2 short abs =3D diff < 0 ? -diff : diff;
> > +      res +=3D abs;
> > +      res +=3D c[i] * d[i];
> > +      res +=3D e[i];
> > +    }
> > +  return res;
> > +}
> > +
> > +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> > +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -1236 : 373)
> > +#define OFFSET 20
> > +
> > +int
> > +main (void)
> > +{
> > +  check_vect ();
> > +
> > +  SIGNEDNESS_2 char a[N], b[N];
> > +  SIGNEDNESS_3 short c[N], d[N];
> > +  SIGNEDNESS_1 int e[N];
> > +  int expected =3D 0x12345;
> > +  for (int i =3D 0; i < N; ++i)
> > +    {
> > +      a[i] =3D BASE2 + i * 5;
> > +      b[i] =3D BASE2 - i * 4;
> > +      c[i] =3D BASE3 + i * 2;
> > +      d[i] =3D BASE3 + OFFSET + i * 3;
> > +      e[i] =3D i;
> > +      asm volatile ("" ::: "memory");
> > +      short diff =3D a[i] - b[i];
> > +      SIGNEDNESS_2 short abs =3D diff < 0 ? -diff : diff;
> > +      expected +=3D abs;
> > +      expected +=3D c[i] * d[i];
> > +      expected +=3D e[i];
> > +    }
> > +  if (f (0x12345, a, b, c, d, e) !=3D expected)
> > +    __builtin_abort ();
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D SAD_EX=
PR" "vect" { target vect_udot_qi } } } */
> > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D DOT_PR=
OD_EXPR" "vect" { target vect_sdot_hi } } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c b/g=
cc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
> > new file mode 100644
> > index 00000000000..c2831fbcc8e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
> > @@ -0,0 +1,95 @@
> > +/* Disabling epilogues until we find a better way to deal with scans. =
 */
> > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */
> > +/* { dg-require-effective-target vect_int } */
> > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { =
aarch64*-*-* || arm*-*-* } } } */
> > +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> > +
> > +#include "tree-vect.h"
> > +
> > +#ifndef SIGNEDNESS_1
> > +#define SIGNEDNESS_1 signed
> > +#define SIGNEDNESS_2 signed
> > +#endif
> > +
> > +SIGNEDNESS_1 int __attribute__ ((noipa))
> > +f (SIGNEDNESS_1 int res,
> > +   SIGNEDNESS_2 char *a,
> > +   SIGNEDNESS_2 char *b,
> > +   int step, int n)
> > +{
> > +  for (int i =3D 0; i < n; i++)
> > +    {
> > +      res +=3D a[0] * b[0];
> > +      res +=3D a[1] * b[1];
> > +      res +=3D a[2] * b[2];
> > +      res +=3D a[3] * b[3];
> > +      res +=3D a[4] * b[4];
> > +      res +=3D a[5] * b[5];
> > +      res +=3D a[6] * b[6];
> > +      res +=3D a[7] * b[7];
> > +      res +=3D a[8] * b[8];
> > +      res +=3D a[9] * b[9];
> > +      res +=3D a[10] * b[10];
> > +      res +=3D a[11] * b[11];
> > +      res +=3D a[12] * b[12];
> > +      res +=3D a[13] * b[13];
> > +      res +=3D a[14] * b[14];
> > +      res +=3D a[15] * b[15];
> > +
> > +      a +=3D step;
> > +      b +=3D step;
> > +    }
> > +
> > +  return res;
> > +}
> > +
> > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> > +#define OFFSET 20
> > +
> > +int
> > +main (void)
> > +{
> > +  check_vect ();
> > +
> > +  SIGNEDNESS_2 char a[100], b[100];
> > +  int expected =3D 0x12345;
> > +  int step =3D 16;
> > +  int n =3D 2;
> > +  int t =3D 0;
> > +
> > +  for (int i =3D 0; i < sizeof (a) / sizeof (a[0]); ++i)
> > +    {
> > +      a[i] =3D BASE + i * 5;
> > +      b[i] =3D BASE + OFFSET + i * 4;
> > +      asm volatile ("" ::: "memory");
> > +    }
> > +
> > +  for (int i =3D 0; i < n; i++)
> > +    {
> > +      asm volatile ("" ::: "memory");
> > +      expected +=3D a[t + 0] * b[t + 0];
> > +      expected +=3D a[t + 1] * b[t + 1];
> > +      expected +=3D a[t + 2] * b[t + 2];
> > +      expected +=3D a[t + 3] * b[t + 3];
> > +      expected +=3D a[t + 4] * b[t + 4];
> > +      expected +=3D a[t + 5] * b[t + 5];
> > +      expected +=3D a[t + 6] * b[t + 6];
> > +      expected +=3D a[t + 7] * b[t + 7];
> > +      expected +=3D a[t + 8] * b[t + 8];
> > +      expected +=3D a[t + 9] * b[t + 9];
> > +      expected +=3D a[t + 10] * b[t + 10];
> > +      expected +=3D a[t + 11] * b[t + 11];
> > +      expected +=3D a[t + 12] * b[t + 12];
> > +      expected +=3D a[t + 13] * b[t + 13];
> > +      expected +=3D a[t + 14] * b[t + 14];
> > +      expected +=3D a[t + 15] * b[t + 15];
> > +      t +=3D step;
> > +    }
> > +
> > +  if (f (0x12345, a, b, step, n) !=3D expected)
> > +    __builtin_abort ();
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected"=
 "vect" } } */
> > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } =
} */
> > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D =
DOT_PROD_EXPR" 16 "vect" } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c b/g=
cc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
> > new file mode 100644
> > index 00000000000..4114264a364
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
> > @@ -0,0 +1,67 @@
> > +/* Disabling epilogues until we find a better way to deal with scans. =
 */
> > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */
> > +/* { dg-require-effective-target vect_int } */
> > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { =
aarch64*-*-* || arm*-*-* } } } */
> > +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> > +
> > +#include "tree-vect.h"
> > +
> > +#ifndef SIGNEDNESS_1
> > +#define SIGNEDNESS_1 signed
> > +#define SIGNEDNESS_2 signed
> > +#endif
> > +
> > +SIGNEDNESS_1 int __attribute__ ((noipa))
> > +f (SIGNEDNESS_1 int res,
> > +   SIGNEDNESS_2 char *a,
> > +   SIGNEDNESS_2 char *b,
> > +   int n)
> > +{
> > +  for (int i =3D 0; i < n; i++)
> > +    {
> > +      res +=3D a[5 * i + 0] * b[5 * i + 0];
> > +      res +=3D a[5 * i + 1] * b[5 * i + 1];
> > +      res +=3D a[5 * i + 2] * b[5 * i + 2];
> > +      res +=3D a[5 * i + 3] * b[5 * i + 3];
> > +      res +=3D a[5 * i + 4] * b[5 * i + 4];
> > +    }
> > +
> > +  return res;
> > +}
> > +
> > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
> > +#define OFFSET 20
> > +
> > +int
> > +main (void)
> > +{
> > +  check_vect ();
> > +
> > +  SIGNEDNESS_2 char a[100], b[100];
> > +  int expected =3D 0x12345;
> > +  int n =3D 18;
> > +
> > +  for (int i =3D 0; i < sizeof (a) / sizeof (a[0]); ++i)
> > +    {
> > +      a[i] =3D BASE + i * 5;
> > +      b[i] =3D BASE + OFFSET + i * 4;
> > +      asm volatile ("" ::: "memory");
> > +    }
> > +
> > +  for (int i =3D 0; i < n; i++)
> > +    {
> > +      asm volatile ("" ::: "memory");
> > +      expected +=3D a[5 * i + 0] * b[5 * i + 0];
> > +      expected +=3D a[5 * i + 1] * b[5 * i + 1];
> > +      expected +=3D a[5 * i + 2] * b[5 * i + 2];
> > +      expected +=3D a[5 * i + 3] * b[5 * i + 3];
> > +      expected +=3D a[5 * i + 4] * b[5 * i + 4];
> > +    }
> > +
> > +  if (f (0x12345, a, b, n) !=3D expected)
> > +    __builtin_abort ();
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected"=
 "vect" } } */
> > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } =
} */
> > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D =
DOT_PROD_EXPR" 5 "vect" } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c b/g=
cc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
> > new file mode 100644
> > index 00000000000..2cdecc36d16
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
> > @@ -0,0 +1,79 @@
> > +/* Disabling epilogues until we find a better way to deal with scans. =
 */
> > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */
> > +/* { dg-require-effective-target vect_int } */
> > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { =
aarch64*-*-* || arm*-*-* } } } */
> > +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> > +
> > +#include "tree-vect.h"
> > +
> > +#ifndef SIGNEDNESS_1
> > +#define SIGNEDNESS_1 signed
> > +#define SIGNEDNESS_2 signed
> > +#endif
> > +
> > +SIGNEDNESS_1 int __attribute__ ((noipa))
> > +f (SIGNEDNESS_1 int res,
> > +   SIGNEDNESS_2 short *a,
> > +   SIGNEDNESS_2 short *b,
> > +   int step, int n)
> > +{
> > +  for (int i =3D 0; i < n; i++)
> > +    {
> > +      res +=3D a[0] * b[0];
> > +      res +=3D a[1] * b[1];
> > +      res +=3D a[2] * b[2];
> > +      res +=3D a[3] * b[3];
> > +      res +=3D a[4] * b[4];
> > +      res +=3D a[5] * b[5];
> > +      res +=3D a[6] * b[6];
> > +      res +=3D a[7] * b[7];
> > +
> > +      a +=3D step;
> > +      b +=3D step;
> > +    }
> > +
> > +  return res;
> > +}
> > +
> > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
> > +#define OFFSET 20
> > +
> > +int
> > +main (void)
> > +{
> > +  check_vect ();
> > +
> > +  SIGNEDNESS_2 short a[100], b[100];
> > +  int expected =3D 0x12345;
> > +  int step =3D 8;
> > +  int n =3D 2;
> > +  int t =3D 0;
> > +
> > +  for (int i =3D 0; i < sizeof (a) / sizeof (a[0]); ++i)
> > +    {
> > +      a[i] =3D BASE + i * 5;
> > +      b[i] =3D BASE + OFFSET + i * 4;
> > +      asm volatile ("" ::: "memory");
> > +    }
> > +
> > +  for (int i =3D 0; i < n; i++)
> > +    {
> > +      asm volatile ("" ::: "memory");
> > +      expected +=3D a[t + 0] * b[t + 0];
> > +      expected +=3D a[t + 1] * b[t + 1];
> > +      expected +=3D a[t + 2] * b[t + 2];
> > +      expected +=3D a[t + 3] * b[t + 3];
> > +      expected +=3D a[t + 4] * b[t + 4];
> > +      expected +=3D a[t + 5] * b[t + 5];
> > +      expected +=3D a[t + 6] * b[t + 6];
> > +      expected +=3D a[t + 7] * b[t + 7];
> > +      t +=3D step;
> > +    }
> > +
> > +  if (f (0x12345, a, b, step, n) !=3D expected)
> > +    __builtin_abort ();
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected"=
 "vect" } } */
> > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } =
} */
> > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D =
DOT_PROD_EXPR" 8 "vect"  { target vect_sdot_hi } } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c b/g=
cc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
> > new file mode 100644
> > index 00000000000..32c0f30c77b
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
> > @@ -0,0 +1,63 @@
> > +/* Disabling epilogues until we find a better way to deal with scans. =
 */
> > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */
> > +/* { dg-require-effective-target vect_int } */
> > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { =
aarch64*-*-* || arm*-*-* } } } */
> > +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> > +
> > +#include "tree-vect.h"
> > +
> > +#ifndef SIGNEDNESS_1
> > +#define SIGNEDNESS_1 signed
> > +#define SIGNEDNESS_2 signed
> > +#endif
> > +
> > +SIGNEDNESS_1 int __attribute__ ((noipa))
> > +f (SIGNEDNESS_1 int res,
> > +   SIGNEDNESS_2 short *a,
> > +   SIGNEDNESS_2 short *b,
> > +   int n)
> > +{
> > +  for (int i =3D 0; i < n; i++)
> > +    {
> > +      res +=3D a[3 * i + 0] * b[3 * i + 0];
> > +      res +=3D a[3 * i + 1] * b[3 * i + 1];
> > +      res +=3D a[3 * i + 2] * b[3 * i + 2];
> > +    }
> > +
> > +  return res;
> > +}
> > +
> > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)
> > +#define OFFSET 20
> > +
> > +int
> > +main (void)
> > +{
> > +  check_vect ();
> > +
> > +  SIGNEDNESS_2 short a[100], b[100];
> > +  int expected =3D 0x12345;
> > +  int n =3D 18;
> > +
> > +  for (int i =3D 0; i < sizeof (a) / sizeof (a[0]); ++i)
> > +    {
> > +      a[i] =3D BASE + i * 5;
> > +      b[i] =3D BASE + OFFSET + i * 4;
> > +      asm volatile ("" ::: "memory");
> > +    }
> > +
> > +  for (int i =3D 0; i < n; i++)
> > +    {
> > +      asm volatile ("" ::: "memory");
> > +      expected +=3D a[3 * i + 0] * b[3 * i + 0];
> > +      expected +=3D a[3 * i + 1] * b[3 * i + 1];
> > +      expected +=3D a[3 * i + 2] * b[3 * i + 2];
> > +    }
> > +
> > +  if (f (0x12345, a, b, n) !=3D expected)
> > +    __builtin_abort ();
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected"=
 "vect" } } */
> > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } =
} */
> > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D =
DOT_PROD_EXPR" 3 "vect"  { target vect_sdot_hi } } } */
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c b/gcc/tes=
tsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
> > new file mode 100644
> > index 00000000000..e17d6291f75
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
> > @@ -0,0 +1,35 @@
> > +/* Disabling epilogues until we find a better way to deal with scans. =
 */
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */
> > +/* { dg-require-effective-target vect_int } */
> > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { =
aarch64*-*-* || arm*-*-* } } } */
> > +/* { dg-add-options arm_v8_2a_dotprod_neon }  */
> > +
> > +#include "tree-vect.h"
> > +
> > +#ifndef SIGNEDNESS_1
> > +#define SIGNEDNESS_1 signed
> > +#define SIGNEDNESS_2 signed
> > +#endif
> > +
> > +SIGNEDNESS_1 int __attribute__ ((noipa))
> > +f (SIGNEDNESS_1 int res0,
> > +   SIGNEDNESS_1 int res1,
> > +   SIGNEDNESS_1 int res2,
> > +   SIGNEDNESS_1 int res3,
> > +   SIGNEDNESS_2 short *a,
> > +   SIGNEDNESS_2 short *b)
> > +{
> > +  for (int i =3D 0; i < 64; i +=3D 4)
> > +    {
> > +      res0 +=3D a[i + 0] * b[i + 0];
> > +      res1 +=3D a[i + 1] * b[i + 1];
> > +      res2 +=3D a[i + 2] * b[i + 2];
> > +      res3 +=3D a[i + 3] * b[i + 3];
> > +    }
> > +
> > +  return res0 ^ res1 ^ res2 ^ res3;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected"=
 "vect" } } */
> > +/* { dg-final { scan-tree-dump-not "vectorizing stmts using SLP" "vect=
" } } */
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index e0561feddce..6d91665a341 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -5324,8 +5324,6 @@ vect_model_reduction_cost (loop_vec_info loop_vin=
fo,
> >    if (!gimple_extract_op (orig_stmt_info->stmt, &op))
> >      gcc_unreachable ();
> >
> > -  bool emulated_mixed_dot_prod =3D vect_is_emulated_mixed_dot_prod (st=
mt_info);
> > -
> >    if (reduction_type =3D=3D EXTRACT_LAST_REDUCTION)
> >      /* No extra instructions are needed in the prologue.  The loop bod=
y
> >         operations are costed in vectorizable_condition.  */
> > @@ -5360,12 +5358,8 @@ vect_model_reduction_cost (loop_vec_info loop_vi=
nfo,
> >            initial result of the data reduction, initial value of the i=
ndex
> >            reduction.  */
> >         prologue_stmts =3D 4;
> > -      else if (emulated_mixed_dot_prod)
> > -       /* We need the initial reduction value and two invariants:
> > -          one that contains the minimum signed value and one that
> > -          contains half of its negative.  */
> > -       prologue_stmts =3D 3;
> >        else
> > +       /* We need the initial reduction value.  */
> >         prologue_stmts =3D 1;
> >        prologue_cost +=3D record_stmt_cost (cost_vec, prologue_stmts,
> >                                          scalar_to_vec, stmt_info, 0,
> > @@ -7466,7 +7460,7 @@ vect_reduction_update_partial_vector_usage (loop_=
vec_info loop_vinfo,
> >        vec_loop_lens *lens =3D &LOOP_VINFO_LENS (loop_vinfo);
> >        unsigned nvectors;
> >
> > -      if (slp_node)
> > +      if (slp_node && SLP_TREE_LANES (slp_node) > 1)
>
> Hmm, that looks wrong.  It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off
> instead, which is bad.
>
> >         nvectors =3D SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
> >        else
> >         nvectors =3D vect_get_num_copies (loop_vinfo, vectype_in);
> > @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (loo=
p_vec_info loop_vinfo,
> >      }
> >  }
> >
> > +/* Check if STMT_INFO is a lane-reducing operation that can be vectori=
zed in
> > +   the context of LOOP_VINFO, and vector cost will be recorded in COST=
_VEC.
> > +   Now there are three such kinds of operations: dot-prod/widen-sum/sa=
d
> > +   (sum-of-absolute-differences).
> > +
> > +   For a lane-reducing operation, the loop reduction path that it lies=
 in,
> > +   may contain normal operation, or other lane-reducing operation of d=
ifferent
> > +   input type size, an example as:
> > +
> > +     int sum =3D 0;
> > +     for (i)
> > +       {
> > +         ...
> > +         sum +=3D d0[i] * d1[i];       // dot-prod <vector(16) char>
> > +         sum +=3D w[i];                // widen-sum <vector(16) char>
> > +         sum +=3D abs(s0[i] - s1[i]);  // sad <vector(8) short>
> > +         sum +=3D n[i];                // normal <vector(4) int>
> > +         ...
> > +       }
> > +
> > +   Vectorization factor is essentially determined by operation whose i=
nput
> > +   vectype has the most lanes ("vector(16) char" in the example), whil=
e we
> > +   need to choose input vectype with the least lanes ("vector(4) int" =
in the
> > +   example) for the reduction PHI statement.  */
> > +
> > +bool
> > +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info st=
mt_info,
> > +                           slp_tree slp_node, stmt_vector_for_cost *co=
st_vec)
> > +{
> > +  gimple *stmt =3D stmt_info->stmt;
> > +
> > +  if (!lane_reducing_stmt_p (stmt))
> > +    return false;
> > +
> > +  tree type =3D TREE_TYPE (gimple_assign_lhs (stmt));
> > +
> > +  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
> > +    return false;
> > +
> > +  /* Do not try to vectorize bit-precision reductions.  */
> > +  if (!type_has_mode_precision_p (type))
> > +    return false;
> > +
> > +  if (!slp_node)
> > +    return false;
> > +
> > +  for (int i =3D 0; i < (int) gimple_num_ops (stmt) - 1; i++)
> > +    {
> > +      stmt_vec_info def_stmt_info;
> > +      slp_tree slp_op;
> > +      tree op;
> > +      tree vectype;
> > +      enum vect_def_type dt;
> > +
> > +      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op=
,
> > +                              &slp_op, &dt, &vectype, &def_stmt_info))
> > +       {
> > +         if (dump_enabled_p ())
> > +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > +                            "use not simple.\n");
> > +         return false;
> > +       }
> > +
> > +      if (!vectype)
> > +       {
> > +         vectype =3D get_vectype_for_scalar_type (loop_vinfo, TREE_TYP=
E (op),
> > +                                                slp_op);
> > +         if (!vectype)
> > +           return false;
> > +       }
> > +
> > +      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
> > +       {
> > +         if (dump_enabled_p ())
> > +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > +                            "incompatible vector types for invariants\=
n");
> > +         return false;
> > +       }
> > +
> > +      if (i =3D=3D STMT_VINFO_REDUC_IDX (stmt_info))
> > +       continue;
> > +
> > +      /* There should be at most one cycle def in the stmt.  */
> > +      if (VECTORIZABLE_CYCLE_DEF (dt))
> > +       return false;
> > +    }
> > +
> > +  stmt_vec_info reduc_info =3D STMT_VINFO_REDUC_DEF (vect_orig_stmt (s=
tmt_info));
> > +
> > +  /* TODO: Support lane-reducing operation that does not directly part=
icipate
> > +     in loop reduction. */
> > +  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
> > +    return false;
> > +
> > +  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
> > +     recoginized.  */
> > +  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) =3D=3D vect_reduction_d=
ef);
> > +  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D TREE_CODE_REDU=
CTION);
> > +
> > +  tree vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> > +  int ncopies_for_cost;
> > +
> > +  if (SLP_TREE_LANES (slp_node) > 1)
> > +    {
> > +      /* Now lane-reducing operations in a non-single-lane slp node sh=
ould only
> > +        come from the same loop reduction path.  */
> > +      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
> > +      ncopies_for_cost =3D 1;
> > +    }
> > +  else
> > +    {
> > +      ncopies_for_cost =3D vect_get_num_copies (loop_vinfo, vectype_in=
);
>
> OK, so the fact that the ops are lane-reducing means they effectively
> change the VF for the result.  That's only possible as we tightly control
> code generation and "adjust" to the expected VF (by inserting the copies
> you mentioned above), but only up to the highest number of outputs
> created in the reduction chain.  In that sense instead of talking and rec=
ording
> "input vector types" wouldn't it make more sense to record the effective
> vectorization factor for the reduction instance?  That VF would be at mos=
t
> the loops VF but could be as low as 1.  Once we have a non-lane-reducing
> operation in the reduction chain it would be always equal to the loops VF=
.
>
> ncopies would then be always determined by that reduction instance VF and
> the accumulator vector type (STMT_VINFO_VECTYPE).  This reduction
> instance VF would also trivially indicate the force-single-def-use-cycle
> case, possibly simplifying code?
>
> > +      gcc_assert (ncopies_for_cost >=3D 1);
> > +    }
> > +
> > +  if (vect_is_emulated_mixed_dot_prod (stmt_info))
> > +    {
> > +      /* We need extra two invariants: one that contains the minimum s=
igned
> > +        value and one that contains half of its negative.  */
> > +      int prologue_stmts =3D 2;
> > +      unsigned cost =3D record_stmt_cost (cost_vec, prologue_stmts,
> > +                                       scalar_to_vec, stmt_info, 0,
> > +                                       vect_prologue);
> > +      if (dump_enabled_p ())
> > +       dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
> > +                    "extra prologue_cost =3D %d .\n", cost);
> > +
> > +      /* Three dot-products and a subtraction.  */
> > +      ncopies_for_cost *=3D 4;
> > +    }
> > +
> > +  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info=
, 0,
> > +                   vect_body);
> > +
> > +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> > +    {
> > +      enum tree_code code =3D gimple_assign_rhs_code (stmt);
> > +      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_in=
fo,
> > +                                                 slp_node, code, type,
> > +                                                 vectype_in);
> > +    }
> > +
>
> Add a comment:
>
>     /* Transform via vect_transform_reduction.  */
>
> > +  STMT_VINFO_TYPE (stmt_info) =3D reduc_vec_info_type;
> > +  return true;
> > +}
> > +
> >  /* Function vectorizable_reduction.
> >
> >     Check if STMT_INFO performs a reduction operation that can be vecto=
rized.
> > @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo=
,
> >    if (!type_has_mode_precision_p (op.type))
> >      return false;
> >
> > -  /* For lane-reducing ops we're reducing the number of reduction PHIs
> > -     which means the only use of that may be in the lane-reducing oper=
ation.  */
> > -  if (lane_reducing
> > -      && reduc_chain_length !=3D 1
> > -      && !only_slp_reduc_chain)
> > -    {
> > -      if (dump_enabled_p ())
> > -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > -                        "lane-reducing reduction with extra stmts.\n")=
;
> > -      return false;
> > -    }
> > -
> >    /* Lane-reducing ops also never can be used in a SLP reduction group
> >       since we'll mix lanes belonging to different reductions.  But it'=
s
> >       OK to use them in a reduction chain or when the reduction group
> > @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vinf=
o,
> >        && loop_vinfo->suggested_unroll_factor =3D=3D 1)
> >      single_defuse_cycle =3D true;
> >
> > -  if (single_defuse_cycle || lane_reducing)
> > +  if (single_defuse_cycle && !lane_reducing)
>
> If there's also a non-lane-reducing plus in the chain don't we have to
> check for that reduction op?  So shouldn't it be
> single_defuse_cycle && ... fact that we don't record
> (non-lane-reducing op there) ...
>
> >      {
> >        gcc_assert (op.code !=3D COND_EXPR);
> >
> > -      /* 4. Supportable by target?  */
> > -      bool ok =3D true;
> > -
> > -      /* 4.1. check support for the operation in the loop
> > +      /* 4. check support for the operation in the loop
> >
> >          This isn't necessary for the lane reduction codes, since they
> >          can only be produced by pattern matching, and it's up to the
> > @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vinf=
o,
> >          mixed-sign dot-products can be implemented using signed
> >          dot-products.  */
> >        machine_mode vec_mode =3D TYPE_MODE (vectype_in);
> > -      if (!lane_reducing
> > -         && !directly_supported_p (op.code, vectype_in, optab_vector))
> > +      if (!directly_supported_p (op.code, vectype_in, optab_vector))
> >          {
> >            if (dump_enabled_p ())
> >              dump_printf (MSG_NOTE, "op not supported by target.\n");
> >           if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
> >               || !vect_can_vectorize_without_simd_p (op.code))
> > -           ok =3D false;
> > +           single_defuse_cycle =3D false;
> >           else
> >             if (dump_enabled_p ())
> >               dump_printf (MSG_NOTE, "proceeding using word mode.\n");
> > @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo=
,
> >             dump_printf (MSG_NOTE, "using word mode not possible.\n");
> >           return false;
> >         }
> > -
> > -      /* lane-reducing operations have to go through vect_transform_re=
duction.
> > -         For the other cases try without the single cycle optimization=
.  */
> > -      if (!ok)
> > -       {
> > -         if (lane_reducing)
> > -           return false;
> > -         else
> > -           single_defuse_cycle =3D false;
> > -       }
> >      }
> >    if (dump_enabled_p () && single_defuse_cycle)
> >      dump_printf_loc (MSG_NOTE, vect_location,
> > @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vinf=
o,
> >                      "multiple vectors to one in the loop body\n");
> >    STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) =3D single_defuse_cycle;
> >
> > -  /* If the reduction stmt is one of the patterns that have lane
> > -     reduction embedded we cannot handle the case of ! single_defuse_c=
ycle.  */
> > -  if ((ncopies > 1 && ! single_defuse_cycle)
> > -      && lane_reducing)
> > -    {
> > -      if (dump_enabled_p ())
> > -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > -                        "multi def-use cycle not possible for lane-red=
ucing "
> > -                        "reduction operation\n");
> > -      return false;
> > -    }
> > +  /* For lane-reducing operation, the below processing related to sing=
le
> > +     defuse-cycle will be done in its own vectorizable function.  One =
more
> > +     thing to note is that the operation must not be involved in fold-=
left
> > +     reduction.  */
> > +  single_defuse_cycle &=3D !lane_reducing;
> >
> >    if (slp_node
> > -      && !(!single_defuse_cycle
> > -          && !lane_reducing
> > -          && reduction_type !=3D FOLD_LEFT_REDUCTION))
> > +      && (single_defuse_cycle || reduction_type =3D=3D FOLD_LEFT_REDUC=
TION))
> >      for (i =3D 0; i < (int) op.num_ops; i++)
> >        if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i])=
)
> >         {
> > @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vinf=
o,
> >    vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
> >                              reduction_type, ncopies, cost_vec);
> >    /* Cost the reduction op inside the loop if transformed via
> > -     vect_transform_reduction.  Otherwise this is costed by the
> > -     separate vectorizable_* routines.  */
> > -  if (single_defuse_cycle || lane_reducing)
> > -    {
> > -      int factor =3D 1;
> > -      if (vect_is_emulated_mixed_dot_prod (stmt_info))
> > -       /* Three dot-products and a subtraction.  */
> > -       factor =3D 4;
> > -      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
> > -                       stmt_info, 0, vect_body);
> > -    }
> > +     vect_transform_reduction for non-lane-reducing operation.  Otherw=
ise
> > +     this is costed by the separate vectorizable_* routines.  */
> > +  if (single_defuse_cycle)
> > +    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, ve=
ct_body);
> >
> >    if (dump_enabled_p ()
> >        && reduction_type =3D=3D FOLD_LEFT_REDUCTION)
> >      dump_printf_loc (MSG_NOTE, vect_location,
> >                      "using an in-order (fold-left) reduction.\n");
> >    STMT_VINFO_TYPE (orig_stmt_of_analysis) =3D cycle_phi_info_type;
> > -  /* All but single defuse-cycle optimized, lane-reducing and fold-lef=
t
> > -     reductions go through their own vectorizable_* routines.  */
> > -  if (!single_defuse_cycle
> > -      && !lane_reducing
> > -      && reduction_type !=3D FOLD_LEFT_REDUCTION)
> > +
> > +  /* All but single defuse-cycle optimized and fold-left reductions go
> > +     through their own vectorizable_* routines.  */
> > +  if (!single_defuse_cycle && reduction_type !=3D FOLD_LEFT_REDUCTION)
> >      {
> >        stmt_vec_info tem
> >         =3D vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
> > @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vin=
fo,
> >    bool lane_reducing =3D lane_reducing_op_p (code);
> >    gcc_assert (single_defuse_cycle || lane_reducing);
> >
> > +  if (lane_reducing)
> > +    {
> > +      /* The last operand of lane-reducing op is for reduction.  */
> > +      gcc_assert (reduc_index =3D=3D (int) op.num_ops - 1);
> > +
> > +      /* Now all lane-reducing ops are covered by some slp node.  */
> > +      gcc_assert (slp_node);
> > +    }
> > +
> >    /* Create the destination vector  */
> >    tree scalar_dest =3D gimple_get_lhs (stmt_info->stmt);
> >    tree vec_dest =3D vect_create_destination_var (scalar_dest, vectype_=
out);
> > @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vin=
fo,
> >                          reduc_index =3D=3D 2 ? op.ops[2] : NULL_TREE,
> >                          &vec_oprnds[2]);
> >      }
> > +  else if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D 1
> > +          && vec_oprnds[0].length () < vec_oprnds[reduc_index].length =
())
> > +    {
> > +      /* For lane-reducing op covered by single-lane slp node, the inp=
ut
> > +        vectype of the reduction PHI determines copies of vectorized d=
ef-use
> > +        cycles, which might be more than effective copies of vectorize=
d lane-
> > +        reducing reduction statements.  This could be complemented by
> > +        generating extra trivial pass-through copies.  For example:
> > +
> > +          int sum =3D 0;
> > +          for (i)
> > +            {
> > +              sum +=3D d0[i] * d1[i];      // dot-prod <vector(16) cha=
r>
> > +              sum +=3D abs(s0[i] - s1[i]); // sad <vector(8) short>
> > +              sum +=3D n[i];               // normal <vector(4) int>
> > +            }
> > +
> > +        The vector size is 128-bit,vectorization factor is 16.  Reduct=
ion
> > +        statements would be transformed as:
> > +
> > +          vector<4> int sum_v0 =3D { 0, 0, 0, 0 };
> > +          vector<4> int sum_v1 =3D { 0, 0, 0, 0 };
> > +          vector<4> int sum_v2 =3D { 0, 0, 0, 0 };
> > +          vector<4> int sum_v3 =3D { 0, 0, 0, 0 };
> > +
> > +          for (i / 16)
> > +            {
> > +              sum_v0 =3D DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15],=
 sum_v0);
> > +              sum_v1 =3D sum_v1;  // copy
> > +              sum_v2 =3D sum_v2;  // copy
> > +              sum_v3 =3D sum_v3;  // copy
> > +
> > +              sum_v0 =3D SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_=
v0);
> > +              sum_v1 =3D SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_=
v1);
> > +              sum_v2 =3D sum_v2;  // copy
> > +              sum_v3 =3D sum_v3;  // copy
> > +
> > +              sum_v0 +=3D n_v0[i: 0  ~ 3 ];
> > +              sum_v1 +=3D n_v1[i: 4  ~ 7 ];
> > +              sum_v2 +=3D n_v2[i: 8  ~ 11];
> > +              sum_v3 +=3D n_v3[i: 12 ~ 15];
> > +            }
> > +       */
> > +      unsigned using_ncopies =3D vec_oprnds[0].length ();
> > +      unsigned reduc_ncopies =3D vec_oprnds[reduc_index].length ();
> > +
>
> assert reduc_ncopies >=3D using_ncopies?  Maybe assert
> reduc_index =3D=3D op.num_ops - 1 given you use one above
> and the other below?  Or simply iterate till op.num_ops
> and sip i =3D=3D reduc_index.
>
> > +      for (unsigned i =3D 0; i < op.num_ops - 1; i++)
> > +       {
> > +         gcc_assert (vec_oprnds[i].length () =3D=3D using_ncopies);
> > +         vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
> > +       }
> > +    }
> >
> >    bool emulated_mixed_dot_prod =3D vect_is_emulated_mixed_dot_prod (st=
mt_info);
> >    unsigned num =3D vec_oprnds[reduc_index =3D=3D 0 ? 1 : 0].length ();
> > @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vin=
fo,
> >      {
> >        gimple *new_stmt;
> >        tree vop[3] =3D { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE =
};
> > -      if (masked_loop_p && !mask_by_cond_expr)
> > +
> > +      if (!vop[0] || !vop[1])
> > +       {
> > +         tree reduc_vop =3D vec_oprnds[reduc_index][i];
> > +
> > +         /* Insert trivial copy if no need to generate vectorized
> > +            statement.  */
> > +         gcc_assert (reduc_vop);
> > +
> > +         new_stmt =3D gimple_build_assign (vec_dest, reduc_vop);
> > +         new_temp =3D make_ssa_name (vec_dest, new_stmt);
> > +         gimple_set_lhs (new_stmt, new_temp);
> > +         vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt,=
 gsi);
>
> I think you could simply do
>
>                slp_node->push_vec_def (reduc_vop);
>                continue;
>
> without any code generation.
>
> > +       }
> > +      else if (masked_loop_p && !mask_by_cond_expr)
> >         {
> >           /* No conditional ifns have been defined for lane-reducing op
> >              yet.  */
> > @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vin=
fo,
> >
> >           if (masked_loop_p && mask_by_cond_expr)
> >             {
> > +             tree stmt_vectype_in =3D vectype_in;
> > +             unsigned nvectors =3D vec_num * ncopies;
> > +
> > +             if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D 1)
> > +               {
> > +                 /* Input vectype of the reduction PHI may be defferen=
t from
>
> different
>
> > +                    that of lane-reducing operation.  */
> > +                 stmt_vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (stmt=
_info);
> > +                 nvectors =3D vect_get_num_copies (loop_vinfo, stmt_ve=
ctype_in);
>
> I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS.
>
> Otherwise the patch looks good to me.
>
> Richard.
>
> > +               }
> > +
> >               tree mask =3D vect_get_loop_mask (loop_vinfo, gsi, masks,
> > -                                             vec_num * ncopies, vectyp=
e_in, i);
> > +                                             nvectors, stmt_vectype_in=
, i);
> >               build_vect_cond_expr (code, vop, mask, gsi);
> >             }
> >
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index ca6052662a3..1b73ef01ade 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -13350,6 +13350,8 @@ vect_analyze_stmt (vec_info *vinfo,
> >                                       NULL, NULL, node, cost_vec)
> >           || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cos=
t_vec)
> >           || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, co=
st_vec)
> > +         || vectorizable_lane_reducing (as_a <loop_vec_info> (vinfo),
> > +                                        stmt_info, node, cost_vec)
> >           || vectorizable_reduction (as_a <loop_vec_info> (vinfo), stmt=
_info,
> >                                      node, node_instance, cost_vec)
> >           || vectorizable_induction (as_a <loop_vec_info> (vinfo), stmt=
_info,
> > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > index 60224f4e284..94736736dcc 100644
> > --- a/gcc/tree-vectorizer.h
> > +++ b/gcc/tree-vectorizer.h
> > @@ -2455,6 +2455,8 @@ extern loop_vec_info vect_create_loop_vinfo (clas=
s loop *, vec_info_shared *,
> >  extern bool vectorizable_live_operation (vec_info *, stmt_vec_info,
> >                                          slp_tree, slp_instance, int,
> >                                          bool, stmt_vector_for_cost *);
> > +extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info,
> > +                                       slp_tree, stmt_vector_for_cost =
*);
> >  extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info,
> >                                     slp_tree, slp_instance,
> >                                     stmt_vector_for_cost *);
> > --
> > 2.17.1