public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* aarch64: Opt-in tweaks to the AArch64 vector cost model
@ 2021-03-26 16:12 Richard Sandiford
  2021-03-26 16:14 ` [PATCH 01/13] aarch64: Add reduction costs to simd_vec_costs Richard Sandiford
                   ` (12 more replies)
  0 siblings, 13 replies; 14+ messages in thread
From: Richard Sandiford @ 2021-03-26 16:12 UTC (permalink / raw)
  To: gcc-patches

SVE uses VECT_COMPARE_COSTS to tell the vectoriser to try as many
variations as it knows and pick the one with the lowest cost.
This serves two purposes:

(1) It means we can compare SVE loops that operate on packed vectors
    with SVE loops that operate on unpacked vectors.

(2) It means that we can compare SVE with Advanced SIMD.

Although we used VECT_COMPARE_COSTS for both of these purposes from the
outset, the focus initially was more on (1).  Adding VECT_COMPARE_COSTS
allowed us to use SVE extending loads and truncating stores, in which
loads and stores effectively operate on unpacked rather than packed
vectors.  This part seems to work pretty well in practice.

However, it turns out that the second part (Advanced SIMD vs. SVE)
is less reliable.  There are three main reasons for this:

* At the moment, the AArch64 vector cost structures stick rigidly to the
  vect_cost_for_stmt enumeration provided by target-independent code.
  This particularly affects vec_to_scalar, which is used for at least:

  - reductions
  - extracting an element from a vector to do scalar arithmetic
  - extracting an element to store it out

  The vectoriser gives us the information we need to distinguish
  these cases, but the port wasn't using it.  Other problems include
  undercosting LD[234] and ST[234] instructions and scatter stores.

* Currently, the vectoriser costing works by adding up what are typically
  latency values.  As Richi mentioned recently in an x86 context,
  this effectively means that we treat the scalar and vector code
  as executing serially.  That already causes some problems for
  Advanced SIMD vs. scalar code, but it turns out to be particularly
  a problem when comparing SVE with Advanced SIMD.  Scalar, Advanced
  SIMD and SVE can have significantly different issue characteristics,
  and summing latencies misses some important details, especially in
  loops involving reductions.

* Advanced SIMD code can be completely unrolled at compile time,
  but length-agnostic SVE code can't.  We weren't taking this into
  account when comparing the costs.

This series of patches tries to address these problems by making
some opt-in tweaks to the vector cost model.  It produces much better
results on the SVE workloads that we've tried internally.  We'd therefore
like to put this in for GCC 11.

I'm really sorry that this is landing so late in stage 4.  Clearly it
would have been much better to do this earlier.  However:

- The patches “only” change the vector cost hooks.  There are no changes
  elsewhere.  In other words, the SVE code we generate and the Advanced
  SIMD code we generate is unchanged: the “only” thing we're doing is
  using different heuristics to select between them.

- As mentioned above, almost all the new code is “opt-in”.  Therefore,
  only CPUs that explicitly want it (and will benefit from it) will be
  affected.  Most of the code is not executed otherwise.

Tested on aarch64-linux-gnu (with and without SVE), pushed to trunk.

Richard

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-03-26 16:18 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-26 16:12 aarch64: Opt-in tweaks to the AArch64 vector cost model Richard Sandiford
2021-03-26 16:14 ` [PATCH 01/13] aarch64: Add reduction costs to simd_vec_costs Richard Sandiford
2021-03-26 16:14 ` [PATCH 02/13] aarch64: Add vector costs for SVE CLAST[AB] and FADDA Richard Sandiford
2021-03-26 16:15 ` [PATCH 03/13] aarch64: Add costs for LD[234]/ST[234] permutes Richard Sandiford
2021-03-26 16:15 ` [PATCH 04/13] aarch64: Add costs for storing one element of a vector Richard Sandiford
2021-03-26 16:15 ` [PATCH 05/13] aarch64: Add costs for one element of a scatter store Richard Sandiford
2021-03-26 16:16 ` [PATCH 06/13] aarch64: Add a CPU-specific cost table for Neoverse V1 Richard Sandiford
2021-03-26 16:16 ` [PATCH 07/13] aarch64: Use an aarch64-specific structure for vector costing Richard Sandiford
2021-03-26 16:16 ` [PATCH 08/13] aarch64: Try to detect when Advanced SIMD code would be completely unrolled Richard Sandiford
2021-03-26 16:17 ` [PATCH 09/13] aarch64: Detect scalar extending loads Richard Sandiford
2021-03-26 16:17 ` [PATCH 10/13] aarch64: Cost comparisons embedded in COND_EXPRs Richard Sandiford
2021-03-26 16:18 ` [PATCH 11/13] aarch64: Ignore inductions when costing vector code Richard Sandiford
2021-03-26 16:18 ` [PATCH 12/13] aarch64: Take issue rate into account for vector loop costs Richard Sandiford
2021-03-26 16:18 ` [PATCH 13/13] aarch64: Add costs for LD[34] and ST[34] postincrements Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).