From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=VnNV=N3=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lj1-x22d.google.com (mail-lj1-x22d.google.com [IPv6:2a00:1450:4864:20::22d])
	by sourceware.org (Postfix) with ESMTPS id 15978385B50C
	for <gcc-patches@gcc.gnu.org>; Tue, 25 Jun 2024 10:26:17 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 15978385B50C
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 15978385B50C
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::22d
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1719311181; cv=none;
	b=JvjEbUyXIn8c7fQHkCFknsxYeQzf77GZDJjle3K2dmv/vrgingu0Jqh4x4zm/S2jJeUNr6StpHEw4isxErbKKSBnIS6HNdkVoBK/LSlSufSQcQ15k7xotOK/+i/zFtiOhbzFY9v0YRXnN48kI8XtoYS07DXoHU3F+FFLR71QoEM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1719311181; c=relaxed/simple;
	bh=b4vxZ2Chg208HfTv3b8Mgl6V4YE4jOb4FHowAPWRPeE=;
	h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=Al/XSSudomrO2A+O+75EXpwpA8SaoJ8CSpGeVq2LwMhwjQ2lM5Ru7B7Ns/kV8PAbbcnQwUtaIoB/+503ZYxYhMLG1Vm1lOLtcduFmzKSMSf0fER6cmbT6mnzCE+ei5y4N14/nuxYcgjYKtaY5kSlFCI47gcviVFOAh8lJW+2iP8=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-lj1-x22d.google.com with SMTP id 38308e7fff4ca-2ebe0a81dc8so69109541fa.2
        for <gcc-patches@gcc.gnu.org>; Tue, 25 Jun 2024 03:26:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1719311175; x=1719915975; darn=gcc.gnu.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=w8rT9gkLJioXxVXLprvkHW9vdamEuPA9iH49F/C+g2Q=;
        b=f4cJ5KAsvFe8xFBhUc/SAeSZYBJkloA4+rU4qnHVbdXwJoWBRjjKbAzg9bvJwLfVix
         gcqK/M2lA/90z6OmXyVVNcFA1DwCRuCPTlrTnJg3pxg2FBioJ99JUXFmruBqIjK0o7Ds
         myy8SRMsGr697piiJ3geRB5i34qMiVVmrztc8115IC8RUCUtxNFqOFkhzOdDPtVQkHQy
         /CDiSy6uggz26CT4ojOdRkFPmujfXFrY72PSILKNKz0iXmpH5q1aw7gjIzE7pXn/qiDx
         jo/HBBypMPdRdtkLraC002nABGaCFujMw6rr5AwvVG9pSYdsEgwZogvTJFLtVRTiQy4o
         +0lQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1719311175; x=1719915975;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=w8rT9gkLJioXxVXLprvkHW9vdamEuPA9iH49F/C+g2Q=;
        b=NSngI6pCxR7qN46HxDrrXcGS1mzSU+jzpqOQYJpts5PKNe/CxQ4eF+TG6/O56TOZKv
         36Bqc9sTh1j5XUNl/Y9phLcpzbMBTc6Imn+D0y8/TJU3t1yPjRIDcNGwWJe6KUQbtmG0
         k1sQPsUGAbW+m84o1V3sTtrx4rqMT1FQIYnu8q+iFQe/yk8G6aps0pFRpBxvBInDs97h
         3N2PNbgF6TXU2OjsXcmxnjzSPVR4rlqWnD16I7wMJnx+mUlL0HlCUdkQWUwbFoIKhET3
         Ha/gP72W/ZodBIk6zKQAtbGGukzZaghTMMrhtS4a/VndZqZLc3lWmcCc4shBwl5FrKsC
         YZUA==
X-Gm-Message-State: AOJu0YzRB//0ePRK06XTWvtQRoKaheQZQgjh6yuGidq1OAEcG1P4FJV1
	2phzglBG6ZetFBCl7qIuvkTfxNBnVfB5vcf+W+MghiA/3bfahG5wWzr+t7RGsrr41AdglOeT8Vv
	h6IRivwb12TKVLhXkm/kpRExC0aEdgg==
X-Google-Smtp-Source: AGHT+IEKp76SG3Vb1iqVaSJObgPt1RuyWWUf8q0VcIvdkddAp9MWvq1u+16+ZlTWynw1aV/UyXt0nWZy17NGHpAmICc=
X-Received: by 2002:a05:6512:1154:b0:52c:9942:b008 with SMTP id
 2adb3069b0e04-52ce18325femr5945507e87.2.1719311174779; Tue, 25 Jun 2024
 03:26:14 -0700 (PDT)
MIME-Version: 1.0
References: <LV2PR01MB7839A1A6112176E201D3DEBCF7CC2@LV2PR01MB7839.prod.exchangelabs.com>
 <CAFiYyc1oMjAW+OnBe5BNN4GYfyGRzwn6AzVcc8VxQ_eN+QNZ3A@mail.gmail.com>
 <LV2PR01MB7839454698AE00602C3EC782F7CB2@LV2PR01MB7839.prod.exchangelabs.com>
 <CAFiYyc31SzZiYZF4wT50c0SMH=RfhvzwTUM2n8aSGDiDJoJt5w@mail.gmail.com> <LV2PR01MB783913E7961168910BAC2DB7F7D52@LV2PR01MB7839.prod.exchangelabs.com>
In-Reply-To: <LV2PR01MB783913E7961168910BAC2DB7F7D52@LV2PR01MB7839.prod.exchangelabs.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Tue, 25 Jun 2024 12:26:03 +0200
Message-ID: <CAFiYyc3uE-CwLn45kFmCwA7Kh6nASmGF6bg3yhQJLiNnMwKWZQ@mail.gmail.com>
Subject: Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for
 loop reduction [PR114440]
To: Feng Xue OS <fxue@os.amperecomputing.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Tue, Jun 25, 2024 at 11:32=E2=80=AFAM Feng Xue OS
<fxue@os.amperecomputing.com> wrote:
>
> >>
> >> >> -      if (slp_node)
> >> >> +      if (slp_node && SLP_TREE_LANES (slp_node) > 1)
> >> >
> >> > Hmm, that looks wrong.  It looks like SLP_TREE_NUMBER_OF_VEC_STMTS i=
s off
> >> > instead, which is bad.
> >> >
> >> >>         nvectors =3D SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
> >> >>        else
> >> >>         nvectors =3D vect_get_num_copies (loop_vinfo, vectype_in);
> >> >> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage =
(loop_vec_info loop_vinfo,
> >> >>      }
> >> >>  }
> >> >>
> >> >> +/* Check if STMT_INFO is a lane-reducing operation that can be vec=
torized in
> >> >> +   the context of LOOP_VINFO, and vector cost will be recorded in =
COST_VEC.
> >> >> +   Now there are three such kinds of operations: dot-prod/widen-su=
m/sad
> >> >> +   (sum-of-absolute-differences).
> >> >> +
> >> >> +   For a lane-reducing operation, the loop reduction path that it =
lies in,
> >> >> +   may contain normal operation, or other lane-reducing operation =
of different
> >> >> +   input type size, an example as:
> >> >> +
> >> >> +     int sum =3D 0;
> >> >> +     for (i)
> >> >> +       {
> >> >> +         ...
> >> >> +         sum +=3D d0[i] * d1[i];       // dot-prod <vector(16) cha=
r>
> >> >> +         sum +=3D w[i];                // widen-sum <vector(16) ch=
ar>
> >> >> +         sum +=3D abs(s0[i] - s1[i]);  // sad <vector(8) short>
> >> >> +         sum +=3D n[i];                // normal <vector(4) int>
> >> >> +         ...
> >> >> +       }
> >> >> +
> >> >> +   Vectorization factor is essentially determined by operation who=
se input
> >> >> +   vectype has the most lanes ("vector(16) char" in the example), =
while we
> >> >> +   need to choose input vectype with the least lanes ("vector(4) i=
nt" in the
> >> >> +   example) for the reduction PHI statement.  */
> >> >> +
> >> >> +bool
> >> >> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_inf=
o stmt_info,
> >> >> +                           slp_tree slp_node, stmt_vector_for_cost=
 *cost_vec)
> >> >> +{
> >> >> +  gimple *stmt =3D stmt_info->stmt;
> >> >> +
> >> >> +  if (!lane_reducing_stmt_p (stmt))
> >> >> +    return false;
> >> >> +
> >> >> +  tree type =3D TREE_TYPE (gimple_assign_lhs (stmt));
> >> >> +
> >> >> +  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
> >> >> +    return false;
> >> >> +
> >> >> +  /* Do not try to vectorize bit-precision reductions.  */
> >> >> +  if (!type_has_mode_precision_p (type))
> >> >> +    return false;
> >> >> +
> >> >> +  if (!slp_node)
> >> >> +    return false;
> >> >> +
> >> >> +  for (int i =3D 0; i < (int) gimple_num_ops (stmt) - 1; i++)
> >> >> +    {
> >> >> +      stmt_vec_info def_stmt_info;
> >> >> +      slp_tree slp_op;
> >> >> +      tree op;
> >> >> +      tree vectype;
> >> >> +      enum vect_def_type dt;
> >> >> +
> >> >> +      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i,=
 &op,
> >> >> +                              &slp_op, &dt, &vectype, &def_stmt_in=
fo))
> >> >> +       {
> >> >> +         if (dump_enabled_p ())
> >> >> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location=
,
> >> >> +                            "use not simple.\n");
> >> >> +         return false;
> >> >> +       }
> >> >> +
> >> >> +      if (!vectype)
> >> >> +       {
> >> >> +         vectype =3D get_vectype_for_scalar_type (loop_vinfo, TREE=
_TYPE (op),
> >> >> +                                                slp_op);
> >> >> +         if (!vectype)
> >> >> +           return false;
> >> >> +       }
> >> >> +
> >> >> +      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
> >> >> +       {
> >> >> +         if (dump_enabled_p ())
> >> >> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location=
,
> >> >> +                            "incompatible vector types for invaria=
nts\n");
> >> >> +         return false;
> >> >> +       }
> >> >> +
> >> >> +      if (i =3D=3D STMT_VINFO_REDUC_IDX (stmt_info))
> >> >> +       continue;
> >> >> +
> >> >> +      /* There should be at most one cycle def in the stmt.  */
> >> >> +      if (VECTORIZABLE_CYCLE_DEF (dt))
> >> >> +       return false;
> >> >> +    }
> >> >> +
> >> >> +  stmt_vec_info reduc_info =3D STMT_VINFO_REDUC_DEF (vect_orig_stm=
t (stmt_info));
> >> >> +
> >> >> +  /* TODO: Support lane-reducing operation that does not directly =
participate
> >> >> +     in loop reduction. */
> >> >> +  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
> >> >> +    return false;
> >> >> +
> >> >> +  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is =
not
> >> >> +     recoginized.  */
> >> >> +  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) =3D=3D vect_reducti=
on_def);
> >> >> +  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D TREE_CODE_=
REDUCTION);
> >> >> +
> >> >> +  tree vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> >> >> +  int ncopies_for_cost;
> >> >> +
> >> >> +  if (SLP_TREE_LANES (slp_node) > 1)
> >> >> +    {
> >> >> +      /* Now lane-reducing operations in a non-single-lane slp nod=
e should only
> >> >> +        come from the same loop reduction path.  */
> >> >> +      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
> >> >> +      ncopies_for_cost =3D 1;
> >> >> +    }
> >> >> +  else
> >> >> +    {
> >> >> +      ncopies_for_cost =3D vect_get_num_copies (loop_vinfo, vectyp=
e_in);
> >> >
> >> > OK, so the fact that the ops are lane-reducing means they effectivel=
y
> >> > change the VF for the result.  That's only possible as we tightly co=
ntrol
> >> > code generation and "adjust" to the expected VF (by inserting the co=
pies
> >> > you mentioned above), but only up to the highest number of outputs
> >> > created in the reduction chain.  In that sense instead of talking an=
d recording
> >> > "input vector types" wouldn't it make more sense to record the effec=
tive
> >> > vectorization factor for the reduction instance?  That VF would be a=
t most
> >> > the loops VF but could be as low as 1.  Once we have a non-lane-redu=
cing
> >> > operation in the reduction chain it would be always equal to the loo=
ps VF.
> >> >
> >> > ncopies would then be always determined by that reduction instance V=
F and
> >> > the accumulator vector type (STMT_VINFO_VECTYPE).  This reduction
> >> > instance VF would also trivially indicate the force-single-def-use-c=
ycle
> >> > case, possibly simplifying code?
> >>
> >> I tried to add such an effective VF, while the vectype_in is still nee=
ded in some
> >> scenarios, such as when checking whether a dot-prod stmt is emulated o=
r not.
> >> The former could be deduced from the later, so recording both things s=
eems
> >> to be redundant. Another consideration is that for normal op, ncopies
> >> is determined from type (STMT_VINFO_VECTYPE), but for lane-reducing op=
,
> >> it is from VF. So, a better means to make them unified?
> >
> > AFAICS reductions are special in that they, for the accumulation SSA cy=
cle,
> > do not adhere to the loops VF but as optimization can chose a smaller o=
ne.
> > OTOH STMT_VINFO_VECTYPE is for the vector type used for individual
> > operations which even for lane-reducing ops is adhered to - those just
> > may use a smaller VF, that of the reduction SSA cycle.
> >
> > So what's redundant is STMT_VINFO_REDUC_VECTYPE_IN - or rather
> > it's not fully redundant but needlessly replicated over all stmts parti=
cipating
> > in the reduction instead of recording the reduction VF in the reduc_inf=
o and
> > using that (plus STMT_VINFO_VECTYPE) to compute the effective ncopies
> > for stmts in the reduction cycle.
> >
> > At least that was my idea ...
> >
>
> For lane-reducing ops and single-defuse-cycle optimization, we could assu=
me
> no lane would be reduced, and always generate vectorization statements
> according to the normal VF, if placeholder is needed, just insert some tr=
ivial
> statement like zero-initialization, or pass-through copy. And define a"ef=
fective VF or
> ncopies" to control lane-reducing related aspects in analysis and codegen=
 (such
> as the below vect_get_loop_mask).  Since all things will become SLP-based=
 finally,
> I think a suitable place to add such a field might be in slp_node, as a s=
upplement to
> "vect_stmts_size", and it is expected to be adjusted in vectorizable_redu=
ction. So
> could we do the refinement as separate patches when non-slp code path is =
to be
> removed?

I suppose so.

Thanks,
Richard.

> >> >> +      gcc_assert (ncopies_for_cost >=3D 1);
> >> >> +    }
> >> >> +
> >> >> +  if (vect_is_emulated_mixed_dot_prod (stmt_info))
> >> >> +    {
> >> >> +      /* We need extra two invariants: one that contains the minim=
um signed
> >> >> +        value and one that contains half of its negative.  */
> >> >> +      int prologue_stmts =3D 2;
> >> >> +      unsigned cost =3D record_stmt_cost (cost_vec, prologue_stmts=
,
> >> >> +                                       scalar_to_vec, stmt_info, 0=
,
> >> >> +                                       vect_prologue);
> >> >> +      if (dump_enabled_p ())
> >> >> +       dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
> >> >> +                    "extra prologue_cost =3D %d .\n", cost);
> >> >> +
> >> >> +      /* Three dot-products and a subtraction.  */
> >> >> +      ncopies_for_cost *=3D 4;
> >> >> +    }
> >> >> +
> >> >> +  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_=
info, 0,
> >> >> +                   vect_body);
> >> >> +
> >> >> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> >> >> +    {
> >> >> +      enum tree_code code =3D gimple_assign_rhs_code (stmt);
> >> >> +      vect_reduction_update_partial_vector_usage (loop_vinfo, redu=
c_info,
> >> >> +                                                 slp_node, code, t=
ype,
> >> >> +                                                 vectype_in);
> >> >> +    }
> >> >> +
> >> >
> >> > Add a comment:
> >> >
> >> >     /* Transform via vect_transform_reduction.  */
> >> >
> >> >> +  STMT_VINFO_TYPE (stmt_info) =3D reduc_vec_info_type;
> >> >> +  return true;
> >> >> +}
> >> >> +
> >> >>  /* Function vectorizable_reduction.
> >> >>
> >> >>     Check if STMT_INFO performs a reduction operation that can be v=
ectorized.
> >> >> @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_v=
info,
> >> >>    if (!type_has_mode_precision_p (op.type))
> >> >>      return false;
> >> >>
> >> >> -  /* For lane-reducing ops we're reducing the number of reduction =
PHIs
> >> >> -     which means the only use of that may be in the lane-reducing =
operation.  */
> >> >> -  if (lane_reducing
> >> >> -      && reduc_chain_length !=3D 1
> >> >> -      && !only_slp_reduc_chain)
> >> >> -    {
> >> >> -      if (dump_enabled_p ())
> >> >> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> >> -                        "lane-reducing reduction with extra stmts.=
\n");
> >> >> -      return false;
> >> >> -    }
> >> >> -
> >> >>    /* Lane-reducing ops also never can be used in a SLP reduction g=
roup
> >> >>       since we'll mix lanes belonging to different reductions.  But=
 it's
> >> >>       OK to use them in a reduction chain or when the reduction gro=
up
> >> >> @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_=
vinfo,
> >> >>        && loop_vinfo->suggested_unroll_factor =3D=3D 1)
> >> >>      single_defuse_cycle =3D true;
> >> >>
> >> >> -  if (single_defuse_cycle || lane_reducing)
> >> >> +  if (single_defuse_cycle && !lane_reducing)
> >> >
> >> > If there's also a non-lane-reducing plus in the chain don't we have =
to
> >> > check for that reduction op?  So shouldn't it be
> >> > single_defuse_cycle && ... fact that we don't record
> >> > (non-lane-reducing op there) ...
> >>
> >> Quite not understand this point.  For a non-lane-reducing op in the ch=
ain,
> >> it should be handled in its own vectorizable_xxx function? The below c=
heck
> >> is only for the first statement (vect_reduction_def) in the reduction.
> >
> > Hmm.  So we have vectorizable_lane_reducing_* for the check on the
> > lane-reducing stmts, vectorizable_* for !single-def-use stmts.  And the
> > following is then just for the case there's a single def that's not
> > lane-reducing
> > and we're forcing a single-def-use and thus go via vect_transform_reduc=
tion?
>
> Yes. Non-lane-reducing with single-defuse-cycle is handled in the functio=
n.
> This logic is same as the original.
>
> >> >
> >> >>      {
> >> >>        gcc_assert (op.code !=3D COND_EXPR);
> >> >>
> >> >> -      /* 4. Supportable by target?  */
> >> >> -      bool ok =3D true;
> >> >> -
> >> >> -      /* 4.1. check support for the operation in the loop
> >> >> +      /* 4. check support for the operation in the loop
> >> >>
> >> >>          This isn't necessary for the lane reduction codes, since t=
hey
> >> >>          can only be produced by pattern matching, and it's up to t=
he
> >> >> @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_=
vinfo,
> >> >>          mixed-sign dot-products can be implemented using signed
> >> >>          dot-products.  */
> >> >>        machine_mode vec_mode =3D TYPE_MODE (vectype_in);
> >> >> -      if (!lane_reducing
> >> >> -         && !directly_supported_p (op.code, vectype_in, optab_vect=
or))
> >> >> +      if (!directly_supported_p (op.code, vectype_in, optab_vector=
))
> >> >>          {
> >> >>            if (dump_enabled_p ())
> >> >>              dump_printf (MSG_NOTE, "op not supported by target.\n"=
);
> >> >>           if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
> >> >>               || !vect_can_vectorize_without_simd_p (op.code))
> >> >> -           ok =3D false;
> >> >> +           single_defuse_cycle =3D false;
> >> >>           else
> >> >>             if (dump_enabled_p ())
> >> >>               dump_printf (MSG_NOTE, "proceeding using word mode.\n=
");
> >> >> @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_v=
info,
> >> >>             dump_printf (MSG_NOTE, "using word mode not possible.\n=
");
> >> >>           return false;
> >> >>         }
> >> >> -
> >> >> -      /* lane-reducing operations have to go through vect_transfor=
m_reduction.
> >> >> -         For the other cases try without the single cycle optimiza=
tion.  */
> >> >> -      if (!ok)
> >> >> -       {
> >> >> -         if (lane_reducing)
> >> >> -           return false;
> >> >> -         else
> >> >> -           single_defuse_cycle =3D false;
> >> >> -       }
> >> >>      }
> >> >>    if (dump_enabled_p () && single_defuse_cycle)
> >> >>      dump_printf_loc (MSG_NOTE, vect_location,
> >> >> @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_=
vinfo,
> >> >>                      "multiple vectors to one in the loop body\n");
> >> >>    STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) =3D single_defuse_cyc=
le;
> >> >>
> >> >> -  /* If the reduction stmt is one of the patterns that have lane
> >> >> -     reduction embedded we cannot handle the case of ! single_defu=
se_cycle.  */
> >> >> -  if ((ncopies > 1 && ! single_defuse_cycle)
> >> >> -      && lane_reducing)
> >> >> -    {
> >> >> -      if (dump_enabled_p ())
> >> >> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> >> -                        "multi def-use cycle not possible for lane=
-reducing "
> >> >> -                        "reduction operation\n");
> >> >> -      return false;
> >> >> -    }
> >> >> +  /* For lane-reducing operation, the below processing related to =
single
> >> >> +     defuse-cycle will be done in its own vectorizable function.  =
One more
> >> >> +     thing to note is that the operation must not be involved in f=
old-left
> >> >> +     reduction.  */
> >> >> +  single_defuse_cycle &=3D !lane_reducing;
> >> >>
> >> >>    if (slp_node
> >> >> -      && !(!single_defuse_cycle
> >> >> -          && !lane_reducing
> >> >> -          && reduction_type !=3D FOLD_LEFT_REDUCTION))
> >> >> +      && (single_defuse_cycle || reduction_type =3D=3D FOLD_LEFT_R=
EDUCTION))
> >> >>      for (i =3D 0; i < (int) op.num_ops; i++)
> >> >>        if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op=
[i]))
> >> >>         {
> >> >> @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_=
vinfo,
> >> >>    vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
> >> >>                              reduction_type, ncopies, cost_vec);
> >> >>    /* Cost the reduction op inside the loop if transformed via
> >> >> -     vect_transform_reduction.  Otherwise this is costed by the
> >> >> -     separate vectorizable_* routines.  */
> >> >> -  if (single_defuse_cycle || lane_reducing)
> >> >> -    {
> >> >> -      int factor =3D 1;
> >> >> -      if (vect_is_emulated_mixed_dot_prod (stmt_info))
> >> >> -       /* Three dot-products and a subtraction.  */
> >> >> -       factor =3D 4;
> >> >> -      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
> >> >> -                       stmt_info, 0, vect_body);
> >> >> -    }
> >> >> +     vect_transform_reduction for non-lane-reducing operation.  Ot=
herwise
> >> >> +     this is costed by the separate vectorizable_* routines.  */
> >> >> +  if (single_defuse_cycle)
> >> >> +    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0=
, vect_body);
> >> >>
> >> >>    if (dump_enabled_p ()
> >> >>        && reduction_type =3D=3D FOLD_LEFT_REDUCTION)
> >> >>      dump_printf_loc (MSG_NOTE, vect_location,
> >> >>                      "using an in-order (fold-left) reduction.\n");
> >> >>    STMT_VINFO_TYPE (orig_stmt_of_analysis) =3D cycle_phi_info_type;
> >> >> -  /* All but single defuse-cycle optimized, lane-reducing and fold=
-left
> >> >> -     reductions go through their own vectorizable_* routines.  */
> >> >> -  if (!single_defuse_cycle
> >> >> -      && !lane_reducing
> >> >> -      && reduction_type !=3D FOLD_LEFT_REDUCTION)
> >> >> +
> >> >> +  /* All but single defuse-cycle optimized and fold-left reduction=
s go
> >> >> +     through their own vectorizable_* routines.  */
> >> >> +  if (!single_defuse_cycle && reduction_type !=3D FOLD_LEFT_REDUCT=
ION)
> >> >>      {
> >> >>        stmt_vec_info tem
> >> >>         =3D vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info)=
);
> >> >> @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop=
_vinfo,
> >> >>    bool lane_reducing =3D lane_reducing_op_p (code);
> >> >>    gcc_assert (single_defuse_cycle || lane_reducing);
> >> >>
> >> >> +  if (lane_reducing)
> >> >> +    {
> >> >> +      /* The last operand of lane-reducing op is for reduction.  *=
/
> >> >> +      gcc_assert (reduc_index =3D=3D (int) op.num_ops - 1);
> >> >> +
> >> >> +      /* Now all lane-reducing ops are covered by some slp node.  =
*/
> >> >> +      gcc_assert (slp_node);
> >> >> +    }
> >> >> +
> >> >>    /* Create the destination vector  */
> >> >>    tree scalar_dest =3D gimple_get_lhs (stmt_info->stmt);
> >> >>    tree vec_dest =3D vect_create_destination_var (scalar_dest, vect=
ype_out);
> >> >> @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop=
_vinfo,
> >> >>                          reduc_index =3D=3D 2 ? op.ops[2] : NULL_TR=
EE,
> >> >>                          &vec_oprnds[2]);
> >> >>      }
> >> >> +  else if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D 1
> >> >> +          && vec_oprnds[0].length () < vec_oprnds[reduc_index].len=
gth ())
> >> >> +    {
> >> >> +      /* For lane-reducing op covered by single-lane slp node, the=
 input
> >> >> +        vectype of the reduction PHI determines copies of vectoriz=
ed def-use
> >> >> +        cycles, which might be more than effective copies of vecto=
rized lane-
> >> >> +        reducing reduction statements.  This could be complemented=
 by
> >> >> +        generating extra trivial pass-through copies.  For example=
:
> >> >> +
> >> >> +          int sum =3D 0;
> >> >> +          for (i)
> >> >> +            {
> >> >> +              sum +=3D d0[i] * d1[i];      // dot-prod <vector(16)=
 char>
> >> >> +              sum +=3D abs(s0[i] - s1[i]); // sad <vector(8) short=
>
> >> >> +              sum +=3D n[i];               // normal <vector(4) in=
t>
> >> >> +            }
> >> >> +
> >> >> +        The vector size is 128-bit,vectorization factor is 16.  Re=
duction
> >> >> +        statements would be transformed as:
> >> >> +
> >> >> +          vector<4> int sum_v0 =3D { 0, 0, 0, 0 };
> >> >> +          vector<4> int sum_v1 =3D { 0, 0, 0, 0 };
> >> >> +          vector<4> int sum_v2 =3D { 0, 0, 0, 0 };
> >> >> +          vector<4> int sum_v3 =3D { 0, 0, 0, 0 };
> >> >> +
> >> >> +          for (i / 16)
> >> >> +            {
> >> >> +              sum_v0 =3D DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ =
15], sum_v0);
> >> >> +              sum_v1 =3D sum_v1;  // copy
> >> >> +              sum_v2 =3D sum_v2;  // copy
> >> >> +              sum_v3 =3D sum_v3;  // copy
> >> >> +
> >> >> +              sum_v0 =3D SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], =
sum_v0);
> >> >> +              sum_v1 =3D SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], =
sum_v1);
> >> >> +              sum_v2 =3D sum_v2;  // copy
> >> >> +              sum_v3 =3D sum_v3;  // copy
> >> >> +
> >> >> +              sum_v0 +=3D n_v0[i: 0  ~ 3 ];
> >> >> +              sum_v1 +=3D n_v1[i: 4  ~ 7 ];
> >> >> +              sum_v2 +=3D n_v2[i: 8  ~ 11];
> >> >> +              sum_v3 +=3D n_v3[i: 12 ~ 15];
> >> >> +            }
> >> >> +       */
> >> >> +      unsigned using_ncopies =3D vec_oprnds[0].length ();
> >> >> +      unsigned reduc_ncopies =3D vec_oprnds[reduc_index].length ()=
;
> >> >> +
> >> >
> >> > assert reduc_ncopies >=3D using_ncopies?  Maybe assert
> >> > reduc_index =3D=3D op.num_ops - 1 given you use one above
> >> > and the other below?  Or simply iterate till op.num_ops
> >> > and sip i =3D=3D reduc_index.
> >> >
> >> >> +      for (unsigned i =3D 0; i < op.num_ops - 1; i++)
> >> >> +       {
> >> >> +         gcc_assert (vec_oprnds[i].length () =3D=3D using_ncopies)=
;
> >> >> +         vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
> >> >> +       }
> >> >> +    }
> >> >>
> >> >>    bool emulated_mixed_dot_prod =3D vect_is_emulated_mixed_dot_prod=
 (stmt_info);
> >> >>    unsigned num =3D vec_oprnds[reduc_index =3D=3D 0 ? 1 : 0].length=
 ();
> >> >> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop=
_vinfo,
> >> >>      {
> >> >>        gimple *new_stmt;
> >> >>        tree vop[3] =3D { vec_oprnds[0][i], vec_oprnds[1][i], NULL_T=
REE };
> >> >> -      if (masked_loop_p && !mask_by_cond_expr)
> >> >> +
> >> >> +      if (!vop[0] || !vop[1])
> >> >> +       {
> >> >> +         tree reduc_vop =3D vec_oprnds[reduc_index][i];
> >> >> +
> >> >> +         /* Insert trivial copy if no need to generate vectorized
> >> >> +            statement.  */
> >> >> +         gcc_assert (reduc_vop);
> >> >> +
> >> >> +         new_stmt =3D gimple_build_assign (vec_dest, reduc_vop);
> >> >> +         new_temp =3D make_ssa_name (vec_dest, new_stmt);
> >> >> +         gimple_set_lhs (new_stmt, new_temp);
> >> >> +         vect_finish_stmt_generation (loop_vinfo, stmt_info, new_s=
tmt, gsi);
> >> >
> >> > I think you could simply do
> >> >
> >> >                slp_node->push_vec_def (reduc_vop);
> >> >                continue;
> >> >
> >> > without any code generation.
> >> >
> >>
> >> OK, that would be easy. Here comes another question, this patch assume=
s
> >> lane-reducing op would always be contained in a slp node, since single=
-lane
> >> slp node feature has been enabled. But I got some regression if I enfo=
rced
> >> such constraint on lane-reducing op check. Those cases are founded to
> >> be unvectorizable with single-lane slp, so this should not be what we =
want?
> >> and need to be fixed?
> >
> > Yes, in the end we need to chase down all unsupported cases and fix the=
m
> > (there's known issues with load permutes, I'm working on that - hopeful=
ly
> > when finding a continuous stretch of time...).
> >
> >>
> >> >> +       }
> >> >> +      else if (masked_loop_p && !mask_by_cond_expr)
> >> >>         {
> >> >>           /* No conditional ifns have been defined for lane-reducin=
g op
> >> >>              yet.  */
> >> >> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop=
_vinfo,
> >> >>
> >> >>           if (masked_loop_p && mask_by_cond_expr)
> >> >>             {
> >> >> +             tree stmt_vectype_in =3D vectype_in;
> >> >> +             unsigned nvectors =3D vec_num * ncopies;
> >> >> +
> >> >> +             if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D=
 1)
> >> >> +               {
> >> >> +                 /* Input vectype of the reduction PHI may be deff=
erent from
> >> >
> >> > different
> >> >
> >> >> +                    that of lane-reducing operation.  */
> >> >> +                 stmt_vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (=
stmt_info);
> >> >> +                 nvectors =3D vect_get_num_copies (loop_vinfo, stm=
t_vectype_in);
> >> >
> >> > I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS.
> >>
> >> To partially vectorizing a dot_prod<16 * char> with 128-bit vector wid=
th,
> >> we should pass (nvector=3D4, vectype=3D<4 *int>) instead of (nvector=
=3D1, vectype=3D<16 *char>)
> >> to vect_get_loop_mask?
> >
> > Probably - it depends on the vectorization factor.  What I wanted to
> > point out is that
> > vec_num (likely from SLP_TREE_NUMBER_OF_VEC_STMTS) is wrong.  The
> > place setting SLP_TREE_NUMBER_OF_VEC_STMTS needs to be adjusted,
> > or we should forgo with it (but that's possibly a post-only-SLP
> > cleanup to be done).
> >
> > See vect_slp_analyze_node_operations_1 where that's computed.  For redu=
ctions
> > it's probably not quite right (and we might have latent issues like
> > those you are
> > "fixing" with code like above).  The order we analyze stmts might also =
be not
> > optimal for reductions with SLP - in fact given that stmt analysis
> > relies on a fixed VF
> > it would probably make sense to determine the reduction VF in advance a=
s well.
> > But again this sounds like post-only-SLP cleanup opportunities.
> >
> > In the end I might suggest to always use reduct-VF and vectype to deter=
mine
> > the number of vector stmts rather than computing ncopies/vec_num separa=
tely.
> >
>
> Thanks,
> Feng