From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-512582-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 51923 invoked by alias); 6 Nov 2019 11:22:18 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 51915 invoked by uid 89); 6 Nov 2019 11:22:17 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-7.1 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,GIT_PATCH_2,GIT_PATCH_3,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.1 spammy=H*f:sk:mpt4kzh, H*i:sk:mpt4kzh
X-HELO: mail-lj1-f174.google.com
Received: from mail-lj1-f174.google.com (HELO mail-lj1-f174.google.com) (209.85.208.174) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 06 Nov 2019 11:22:15 +0000
Received: by mail-lj1-f174.google.com with SMTP id n5so14628299ljc.9        for <gcc-patches@gcc.gnu.org>; Wed, 06 Nov 2019 03:22:15 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com; s=20161025;        h=mime-version:references:in-reply-to:from:date:message-id:subject:to         :cc;        bh=eq6VzLnb+UT1kULZyjBUJ0o38QsZh9131RF8wHjzmh4=;        b=MWylweH4o5N9B7fArA8sQbj7WX79VLPC1VXvOWDgRDpf14P8qLaQN7VF2ycb5UHVq9         wT7aaXJk4r/vd5lY7AcePSg32LhM2mCz/VtugYUYccUUh+ahg03XThEPXvLgG/LPXNN4         sNII1sbeJsLVY8kumS/yWM5rhck5qNNFLoPFkX7jg36XmIYDwpxBIZDhFZbvbeUs8ULj         T+c0gXB2DpaJdVaTZvrz4M+muJzAXi/sGrMA3ORlCafjgrDdzvXlp/fLgUlCire98zXx         ECZCyACNEANIP1K8fhTk8w3iVRilE9YEojng+rnpPNktug23WJRWzL/6dVlsIG/hEPiz         pdVw==
MIME-Version: 1.0
References: <mpto8y59fe7.fsf@arm.com> <mpt5zjyvbaz.fsf@arm.com> <CAFiYyc35Za8rjUrP=fs3GtMt7Gr-8UvjFxkgfNPQyeuYU9pr+g@mail.gmail.com> <mptbltpqp9c.fsf@arm.com> <CAFiYyc1+V8nsnSG-0S+7eLy_dYow_NfN443W0mmx4VjhFX8=+w@mail.gmail.com> <mpt4kzhqnc6.fsf@arm.com>
In-Reply-To: <mpt4kzhqnc6.fsf@arm.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Wed, 06 Nov 2019 11:22:00 -0000
Message-ID: <CAFiYyc0mQ-QF+nhe4mhY6Ud3aBmRsvk4DDABq8EBg0rRXd72oA@mail.gmail.com>
Subject: Re: [11a/n] Avoid retrying with the same vector modes
To: Richard Sandiford <richard.sandiford@arm.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
X-IsSubscribed: yes
X-SW-Source: 2019-11/txt/msg00375.txt.bz2

On Wed, Nov 6, 2019 at 12:02 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Wed, Nov 6, 2019 at 11:21 AM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Richard Biener <richard.guenther@gmail.com> writes:
> >> > On Tue, Nov 5, 2019 at 9:25 PM Richard Sandiford
> >> > <richard.sandiford@arm.com> wrote:
> >> >>
> >> >> Patch 12/n makes the AArch64 port add four entries to
> >> >> autovectorize_vector_modes.  Each entry describes a different
> >> >> vector mode assignment for vector code that mixes 8-bit, 16-bit,
> >> >> 32-bit and 64-bit elements.  But if (as usual) the vector code has
> >> >> fewer element sizes than that, we could end up trying the same
> >> >> combination of vector modes multiple times.  This patch adds a
> >> >> check to prevent that.
> >> >>
> >> >> As before: each patch tested individually on aarch64-linux-gnu and the
> >> >> series as a whole on x86_64-linux-gnu.
> >> >>
> >> >>
> >> >> 2019-11-04  Richard Sandiford  <richard.sandiford@arm.com>
> >> >>
> >> >> gcc/
> >> >>         * tree-vectorizer.h (vec_info::mode_set): New typedef.
> >> >>         (vec_info::used_vector_mode): New member variable.
> >> >>         (vect_chooses_same_modes_p): Declare.
> >> >>         * tree-vect-stmts.c (get_vectype_for_scalar_type): Record each
> >> >>         chosen vector mode in vec_info::used_vector_mode.
> >> >>         (vect_chooses_same_modes_p): New function.
> >> >>         * tree-vect-loop.c (vect_analyze_loop): Use it to avoid trying
> >> >>         the same vector statements multiple times.
> >> >>         * tree-vect-slp.c (vect_slp_bb_region): Likewise.
> >> >>
> >> >> Index: gcc/tree-vectorizer.h
> >> >> ===================================================================
> >> >> --- gcc/tree-vectorizer.h       2019-11-05 10:48:11.246092351 +0000
> >> >> +++ gcc/tree-vectorizer.h       2019-11-05 10:57:41.662071145 +0000
> >> >> @@ -298,6 +298,7 @@ typedef std::pair<tree, tree> vec_object
> >> >>  /* Vectorizer state common between loop and basic-block vectorization.  */
> >> >>  class vec_info {
> >> >>  public:
> >> >> +  typedef hash_set<int_hash<machine_mode, E_VOIDmode, E_BLKmode> > mode_set;
> >> >>    enum vec_kind { bb, loop };
> >> >>
> >> >>    vec_info (vec_kind, void *, vec_info_shared *);
> >> >> @@ -335,6 +336,9 @@ typedef std::pair<tree, tree> vec_object
> >> >>    /* Cost data used by the target cost model.  */
> >> >>    void *target_cost_data;
> >> >>
> >> >> +  /* The set of vector modes used in the vectorized region.  */
> >> >> +  mode_set used_vector_modes;
> >> >> +
> >> >>    /* The argument we should pass to related_vector_mode when looking up
> >> >>       the vector mode for a scalar mode, or VOIDmode if we haven't yet
> >> >>       made any decisions about which vector modes to use.  */
> >> >> @@ -1615,6 +1619,7 @@ extern tree get_related_vectype_for_scal
> >> >>  extern tree get_vectype_for_scalar_type (vec_info *, tree);
> >> >>  extern tree get_mask_type_for_scalar_type (vec_info *, tree);
> >> >>  extern tree get_same_sized_vectype (tree, tree);
> >> >> +extern bool vect_chooses_same_modes_p (vec_info *, machine_mode);
> >> >>  extern bool vect_get_loop_mask_type (loop_vec_info);
> >> >>  extern bool vect_is_simple_use (tree, vec_info *, enum vect_def_type *,
> >> >>                                 stmt_vec_info * = NULL, gimple ** = NULL);
> >> >> Index: gcc/tree-vect-stmts.c
> >> >> ===================================================================
> >> >> --- gcc/tree-vect-stmts.c       2019-11-05 10:48:11.242092379 +0000
> >> >> +++ gcc/tree-vect-stmts.c       2019-11-05 10:57:41.662071145 +0000
> >> >> @@ -11235,6 +11235,10 @@ get_vectype_for_scalar_type (vec_info *v
> >> >>                                                       scalar_type);
> >> >>    if (vectype && vinfo->vector_mode == VOIDmode)
> >> >>      vinfo->vector_mode = TYPE_MODE (vectype);
> >> >> +
> >> >> +  if (vectype)
> >> >> +    vinfo->used_vector_modes.add (TYPE_MODE (vectype));
> >> >> +
> >> >
> >> > Do we actually end up _using_ all types returned by this function?
> >>
> >> No, not all of them, so it's a bit crude.  E.g. some types might end up
> >> not being relevant after pattern recognition, or after we've made a
> >> final decision about which parts of an address calculation to include
> >> in a gather or scatter op.  So we can still end up retrying the same
> >> thing even after the patch.
> >>
> >> The problem is that we're trying to avoid pointless retries on failure
> >> as well as success, so we could end up stopping at arbitrary points.
> >> I wasn't sure where else to handle this.
> >
> > Yeah, I think this "iterating" is somewhat bogus (crude) now.
>
> I think it was crude even before the series though. :-)  Not sure the
> series is making things worse.
>
> The problem is that there's a chicken-and-egg problem between how
> we decide to vectorise and which vector subarchitecture and VF we use.
> E.g. if we have:
>
>   unsigned char *x, *y;
>   ...
>   x[i] = (unsigned short) (x[i] + y[i] + 1) >> 1;
>
> do we build the SLP graph on the assumption that we need to use short
> elements, or on the assumption that we can use IFN_AVG_CEIL?  This
> affects the VF we get out: using IFN_AVG_CEIL gives double the VF
> relative to doing unsigned short arithmetic.
>
> And we need to know which vector subarchitecture we're targetting when
> making that decision: e.g. Advanced SIMD and SVE2 have IFN_AVG_CEIL,
> but SVE doesn't.  On the other hand, SVE supports things that Advanced
> SIMD doesn't.  It's similar story of course for the x86 vector subarchs.
>
> For one pattern like this, we could simply try both ways.
> But that becomes untenable if there are multiple potential patterns.
> Iterating over the vector subarchs gives us a sensible way of reducing
> the search space by only applying patterns that the subarch supports.
>
> So...
>
> > What we'd like to collect is for all defs the vector types we could
> > use and then vectorizable_ defines constraints between input and
> > output vector types.  From that we'd arrive at a (possibly quite
> > large) set of "SLP graphs with vector types" we'd choose from.  I
> > believe we'll never want to truly explore the whole space but guess we
> > want to greedily compute those "SLP graphs with vector types" starting
> > from what (grouped) datarefs tells us is possible (which is kind of
> > what we do now).
>
> ...I don't think we can/should use the same SLP graph to pick vector
> types for all subarchs, since the ideal graph depends on the subarch.
> I'm also not sure the vectorizable_* routines could say anything that
> isn't obvious from the input and output scalar types.  Won't it still be
> the case that within an SLP instance, all scalars of type T will become
> the same vector(N) T?

Not necessarily.  I can see the SLP graph containing "reductions"
(like those complex patterns proposed).  But yes, at the moment
there's a single group-size per SLP instance.  Now, for the SLP _graph_
we may have one instance with 4 and one with group size of 8 both
for example sharing the same grouped load.  It may make sense to
vectorize the load with v8si and for the group-size 4 SLP instance
have a "reduction operation" that selects the appropriate part of the
loaded vector.

Now, vectorizable_* for say a conversion from int to double
may be able to vectorize for a v4si directly to v4df or to two times v2df.
With the size restriction relaxed vectorizable_* could opt to choose
smaller/larger vectors specifically and thus also have different vector types
for the same scalar type in the SLP graph.  I do expect this to be
profitable on x86 for some loops due to some asymmetries in the ISA
(and extra cost of lane-crossing operations for say AVX where using SSE
is cheaper for some ops even though you now have 2 or more instructions).

Richard.

>
> Thanks,
> Richard