From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.guenther@gmail.com>
Received: from mail-ed1-x52e.google.com (mail-ed1-x52e.google.com
 [IPv6:2a00:1450:4864:20::52e])
 by sourceware.org (Postfix) with ESMTPS id 8AEB9383D012
 for <gcc-patches@gcc.gnu.org>; Mon, 12 Jul 2021 06:32:55 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8AEB9383D012
Received: by mail-ed1-x52e.google.com with SMTP id l26so17962822eda.10
 for <gcc-patches@gcc.gnu.org>; Sun, 11 Jul 2021 23:32:55 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:content-transfer-encoding;
 bh=6sFFmE35B4Zv6SU7clfqNJ5pRA7OHRSGTFMlPF4xaBY=;
 b=j8bGDdKcVhJjK/Qugq/eVkTk9vnci++EdMDRTyfTGXFmVT4LuhI54Zirlxi0TQgapu
 mwvBj5Z5BGoI35TEn9tEQDjYYNIQWESVS5MAevY9fD7aSOD3Ukz3HdUhVojdFEDeU0IX
 CYOfAILQ3CWxtd42FknW+ZGukqPrwbeyE9FE8GrcC0ZjAkk3yyBnhVfsQRFQxE3rwo12
 JJx/3rXiH2uEbxAfQaD8RvPshYq+WcRbUrRSzD6P3EH38q98Suo6KTuW+f5vDlQuknQN
 aP3yo1tfrBVRoYbVp0g8Pc4CrGgMpF2SNTk6fAyVOmBgnqRFxitkfY6rB3V+CslteQgM
 pfCg==
X-Gm-Message-State: AOAM5333sRyn2PLT1yy8qSY+A/1l/xMf1d7IiMKBEPiyk7o/Q9q7pNYS
 ox3xOgI+FX7YEt2w4D7nqwCGTaafr1tKoDMP66o=
X-Google-Smtp-Source: ABdhPJwuuJp/ai3etwRhKhrk5F52iPHF5HG61pFBGI1P+ypDA7MS40J2W6TMmVoXJnGlvlneCt+bpGg9yi9w4HiOYIM=
X-Received: by 2002:aa7:c652:: with SMTP id z18mr4968840edr.361.1626071574462; 
 Sun, 11 Jul 2021 23:32:54 -0700 (PDT)
MIME-Version: 1.0
References: <mptpmvtrner.fsf@arm.com> <mpt35spq8mb.fsf@arm.com>
 <CAFiYyc1Ynh9a9gxSjJSun5SGD9rW6kFzdMb_qEPLKQBCoDavxA@mail.gmail.com>
 <mptbl7bljgf.fsf@arm.com>
In-Reply-To: <mptbl7bljgf.fsf@arm.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Mon, 12 Jul 2021 08:32:43 +0200
Message-ID: <CAFiYyc29J0w+yEktwnLfJ7Q-FanGXZ9o2=A8KzrkLx1qeN40oQ@mail.gmail.com>
Subject: Re: [PATCH 10/10] vect: Reuse reduction accumulators between loops
To: Richard Biener <richard.guenther@gmail.com>,
 GCC Patches <gcc-patches@gcc.gnu.org>, 
 Richard Sandiford <richard.sandiford@arm.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-8.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Jul 2021 06:32:57 -0000

On Fri, Jul 9, 2021 at 3:12 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Thanks for the review.
>
> Richard Biener <richard.guenther@gmail.com> writes:
> >> @@ -588,6 +600,23 @@ public:
> >>    /* Unrolling factor  */
> >>    poly_uint64 vectorization_factor;
> >>
> >> +  /* If this loop is an epilogue loop whose main loop can be skipped,
> >> +     MAIN_LOOP_EDGE is the edge from the main loop to this loop's
> >> +     preheader.  SKIP_MAIN_LOOP_EDGE is then the edge that skips the
> >> +     main loop and goes straight to this loop's preheader.
> >> +
> >> +     Both fields are null otherwise.  */
> >> +  edge main_loop_edge;
> >> +  edge skip_main_loop_edge;
> >> +
> >> +  /* If this loop is an epilogue loop that might be skipped after exe=
cuting
> >> +     the main loop, this edge is the one that skips the epilogue.  */
> >> +  edge skip_this_loop_edge;
> >> +
> >> +  /* After vectorization, maps live-out SSA names to information abou=
t
> >> +     the reductions that generated them.  */
> >> +  hash_map<tree, vect_reusable_accumulator> reusable_accumulators;
> >
> > Is that the LC PHI node defs or the definition inside of the loop?
> > If the latter we could attach the info directly to its stmt-info?
>
> Ah, yeah, I should improve the comment there.  It's the vectoriser's
> replacement for the original LC PHI node, i.e. the final scalar result
> after the reduction has taken place.

OK

> >> @@ -1186,6 +1215,21 @@ public:
> >>    /* The vector type for performing the actual reduction.  */
> >>    tree reduc_vectype;
> >>
> >> +  /* If IS_REDUC_INFO is true and if the reduction is operating on N
> >> +     elements in parallel, this vector gives the initial values of th=
ese
> >> +     N elements.  */
> >
> > That's N scalar elements or N vector elements?  I suppose it's for
> > SLP reductions (rather than SLP reduction chains) and never non-SLP
> > reductions?
>
> Yeah, poor wording again, sorry.  I meant something closer to:
>
>   /* If IS_REDUC_INFO is true and if the vector code is performing
>      N scalar reductions in parallel, this vector gives the initial
>      scalar values of those N reductions.  */
>
> >> +  vec<tree> reduc_initial_values;
> >> +
> >> +  /* If IS_REDUC_INFO is true and if the reduction is operating on N
> >> +     elements in parallel, this vector gives the scalar result of eac=
h
> >> +     reduction.  */
> >> +  vec<tree> reduc_scalar_results;
>
> Same change here.
>
> >> [=E2=80=A6]
> >> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> >> index 2909e8a0fc3..b7b0523e3c8 100644
> >> --- a/gcc/tree-vect-loop-manip.c
> >> +++ b/gcc/tree-vect-loop-manip.c
> >> @@ -2457,6 +2457,31 @@ vect_update_epilogue_niters (loop_vec_info epil=
ogue_vinfo,
> >>    return vect_determine_partial_vectors_and_peeling (epilogue_vinfo, =
true);
> >>  }
> >>
> >> +/* LOOP_VINFO is an epilogue loop and MAIN_LOOP_VALUE is available on=
 exit
> >> +   from the corresponding main loop.  Return a value that is availabl=
e in
> >> +   LOOP_VINFO's preheader, using SKIP_VALUE if the main loop is skipp=
ed.
> >> +   Passing a null SKIP_VALUE is equivalent to passing zero.  */
> >> +
> >> +tree
> >> +vect_get_main_loop_result (loop_vec_info loop_vinfo, tree main_loop_v=
alue,
> >> +                          tree skip_value)
> >> +{
> >> +  if (!loop_vinfo->main_loop_edge)
> >> +    return main_loop_value;
> >> +
> >> +  if (!skip_value)
> >> +    skip_value =3D build_zero_cst (TREE_TYPE (main_loop_value));
> >
> > shouldn't that be the initial value?
>
> For the current use case, the above two conditions are never true.
> I wrote it like this because I had a follow-on patch (which might
> not go anywhere) that needed this function for 0-based IVs.
>
> Maybe that's a bad risk/reward trade-off though.  Not having to pass
> zero makes things only slightly simpler for the follow-on patch,
> and I guess could be dangerous in other cases.
>
> Perhaps in that case though I should change loop_vinfo->main_loop_edge
> into a gcc_assert as well.

Yeah, I think asserts (and comments in case it's because we don't handle
some specific cases yet) are better than possibly wrong behavior.

> >> +  tree phi_result =3D make_ssa_name (TREE_TYPE (main_loop_value));
> >> +  basic_block bb =3D loop_vinfo->main_loop_edge->dest;
> >> +  gphi *new_phi =3D create_phi_node (phi_result, bb);
> >> +  add_phi_arg (new_phi, main_loop_value, loop_vinfo->main_loop_edge,
> >> +              UNKNOWN_LOCATION);
> >> +  add_phi_arg (new_phi, skip_value,
> >> +              loop_vinfo->skip_main_loop_edge, UNKNOWN_LOCATION);
> >> +  return phi_result;
> >> +}
> >> +
> >>  /* Function vect_do_peeling.
> >>
> >>     Input:
> >> [=E2=80=A6]
> >> @@ -4823,6 +4842,100 @@ info_for_reduction (vec_info *vinfo, stmt_vec_=
info stmt_info)
> >>    return stmt_info;
> >>  }
> >>
> >> +/* PHI is a reduction in LOOP_VINFO that we are going to vectorize us=
ing vector
> >> +   type VECTYPE.  See if LOOP_VINFO is an epilogue loop whose main lo=
op had a
> >> +   matching reduction that we can build on.  Adjust REDUC_INFO and re=
turn true
> >> +   if so, otherwise return false.  */
> >> +
> >> +static bool
> >> +vect_find_reusable_accumulator (loop_vec_info loop_vinfo,
> >> +                               stmt_vec_info reduc_info)
> >> +{
> >> +  loop_vec_info main_loop_vinfo =3D LOOP_VINFO_ORIG_LOOP_INFO (loop_v=
info);
> >> +  if (!main_loop_vinfo)
> >> +    return false;
> >> +
> >> +  if (STMT_VINFO_REDUC_TYPE (reduc_info) !=3D TREE_CODE_REDUCTION)
> >> +    return false;
> >> +
> >> +  unsigned int num_phis =3D reduc_info->reduc_initial_values.length (=
);
> >> +  auto_vec<tree, 16> main_loop_results (num_phis);
> >> +  auto_vec<tree, 16> initial_values (num_phis);
> >> +  if (edge main_loop_edge =3D loop_vinfo->main_loop_edge)
> >> +    {
> >> +      /* The epilogue loop can be entered either from the main loop o=
r
> >> +        from an earlier guard block.  */
> >> +      edge skip_edge =3D loop_vinfo->skip_main_loop_edge;
> >> +      for (tree incoming_value : reduc_info->reduc_initial_values)
> >> +       {
> >> +         /* Look for:
> >> +
> >> +              INCOMING_VALUE =3D phi<MAIN_LOOP_RESULT(main loop),
> >> +                                   INITIAL_VALUE(guard block)>.  */
> >> +         gcc_assert (TREE_CODE (incoming_value) =3D=3D SSA_NAME);
> >> +
> >> +         gphi *phi =3D as_a <gphi *> (SSA_NAME_DEF_STMT (incoming_val=
ue));
> >> +         gcc_assert (gimple_bb (phi) =3D=3D main_loop_edge->dest);
> >> +
> >> +         tree from_main_loop =3D PHI_ARG_DEF_FROM_EDGE (phi, main_loo=
p_edge);
> >> +         tree from_skip =3D PHI_ARG_DEF_FROM_EDGE (phi, skip_edge);
> >> +
> >> +         main_loop_results.quick_push (from_main_loop);
> >> +         initial_values.quick_push (from_skip);
> >> +       }
> >> +    }
> >> +  else
> >> +    /* The main loop dominates the epilogue loop.  */
> >> +    main_loop_results.splice (reduc_info->reduc_initial_values);
> >> +
> >> +  /* See if the main loop has the kind of accumulator we need.  */
> >> +  vect_reusable_accumulator *accumulator
> >> +    =3D main_loop_vinfo->reusable_accumulators.get (main_loop_results=
[0]);
> >> +  if (!accumulator
> >> +      || num_phis !=3D accumulator->reduc_info->reduc_scalar_results.=
length ()
> >> +      || !std::equal (main_loop_results.begin (), main_loop_results.e=
nd (),
> >> +                     accumulator->reduc_info->reduc_scalar_results.be=
gin ()))
> >> +    return false;
> >> +
> >> +  /* For now, only handle the case in which both loops are operating =
on the
> >> +     same vector types.  In future we could reduce wider vectors to n=
arrower
> >> +     ones as well.  */
> >> +  tree vectype =3D STMT_VINFO_VECTYPE (reduc_info);
> >> +  tree old_vectype =3D TREE_TYPE (accumulator->reduc_input);
> >> +  if (!useless_type_conversion_p (old_vectype, vectype))
> >
> > It should be indeed quite trivial to handle, likewise the case where we
> > have multiple PHIs - just reduce to a single input vector and have the
> > possibly multiple input vectors in the epilogue filled with neutral
> > elements.  I'll see if I can cook up stuff for this next week.
>
> Yeah, agreed.  The multi-vector epilogue case should be especially easy
> to handle, but it's not interesting for SVE as things stand, since:
>
> (a) non-SLP reductions use a single cycle for ncopies>1 (a misfeature
>     IMO -- on targets with wide pipelines we want exactly the opposite)
>
> (b) SLP reductions are limited to single vectors for variable-length targ=
ets.
>
> So it wasn't possible to trigger multiple epilogue vectors for the
> motivating SVE use case.

OK, I see.  If the series is in I'll see to create testcases for x86_64.

> >> [=E2=80=A6]
> >> @@ -5196,6 +5305,37 @@ vect_create_epilog_for_reduction (loop_vec_info=
 loop_vinfo,
> >>        reduc_inputs.safe_push (single_input);
> >>      }
> >>
> >> +  tree orig_reduc_input =3D reduc_inputs[0];
> >> +
> >> +  /* If this loop is an epilogue loop that can be skipped after the
> >> +     main loop, we can only share a reduction operation between the
> >> +     main loop and the epilogue if we put it at the target of the
> >> +     skip edge.
> >
> > Do you have a testcase where we cannot do this?
>
> No, it's being defensive.  I wasn't sure how the epilogue code would
> evolve in future.
>
> >> +     We can still reuse accumulators if this check fails.  Doing so h=
as
> >> +     the minor(?) benefit of making the epilogue loop's scalar result
> >> +     independent of the main loop's scalar result.  */
> >> +  bool unify_with_main_loop_p =3D false;
> >> +  if (reduc_info->reused_accumulator
> >> +      && loop_vinfo->skip_this_loop_edge
> >> +      && single_succ_p (exit_bb)
> >> +      && single_succ (exit_bb) =3D=3D loop_vinfo->skip_this_loop_edge=
->dest)
> >> +    {
> >> +      unify_with_main_loop_p =3D true;
> >> +
> >> +      basic_block reduc_block =3D loop_vinfo->skip_this_loop_edge->de=
st;
> >> +      reduc_inputs[0] =3D make_ssa_name (vectype);
> >> +      gphi *new_phi =3D create_phi_node (reduc_inputs[0], reduc_block=
);
> >> +      add_phi_arg (new_phi, orig_reduc_input, single_succ_edge (exit_=
bb),
> >> +                  UNKNOWN_LOCATION);
> >> +      add_phi_arg (new_phi, reduc_info->reused_accumulator->reduc_inp=
ut,
> >> +                  loop_vinfo->skip_this_loop_edge, UNKNOWN_LOCATION);
> >> +      exit_gsi =3D gsi_after_labels (reduc_block);
> >> +    }
> >> +
> >> +  /* Shouldn't be used beyond this point.  */
> >> +  exit_bb =3D nullptr;
> >> +
> >>    if (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D COND_REDUCTION
> >>        && reduc_fn !=3D IFN_LAST)
> >>      {
> >> @@ -5819,6 +5958,12 @@ vect_create_epilog_for_reduction (loop_vec_info=
 loop_vinfo,
> >>        scalar_results[0] =3D new_temp;
> >>      }
> >>
> >> +  /* Record this operation if it could be reused by the epilogue loop=
.  */
> >> +  if (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D TREE_CODE_REDUCTION
> >> +      && !double_reduc)
> >
> > what's the issue with double_reduc?
>
> Probably nothing TBH.  I haven't been able to construct a case that
> uses predicated double reductions with vect-partial-vector-usage=3D1,
> but that's probably a missed optimisation.
>
> There again, double reductions themselves seem to be hard to trigger
> now that we have loop interchange.  Is there a good way of testing
> them without -fno-loop-interchange?

there are a bunch of testcases in gcc.dg/vect/vect-double-reduc-?.c,
I don't see how interchange avoids the double reduction, in fact when
doing interchange we no longer can apply outer loop vectorization (but
it's still a double reduction, just only inner loop vectorized).
But eventually we don't do epilogue vectorization for outer loop
vectorizations with reductions.

Oh, and of course vect.exp runs with -O2 -ftree-vectorize, avoiding
any of the high-level loop opts ...

Richard.

> Thanks,
> Richard
>
> >> +    loop_vinfo->reusable_accumulators.put (scalar_results[0],
> >> +                                          { orig_reduc_input, reduc_i=
nfo });
> >> +
> >>    if (double_reduc)
> >>      loop =3D outer_loop;
> >>