public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/110625] New: [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large
@ 2023-07-11  9:15 hliu at amperecomputing dot com
  2023-07-11 10:41 ` [Bug target/110625] " rguenth at gcc dot gnu.org
                   ` (25 more replies)
  0 siblings, 26 replies; 27+ messages in thread
From: hliu at amperecomputing dot com @ 2023-07-11  9:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110625

            Bug ID: 110625
           Summary: [AArch64] Vect: SLP fails to vectorize a loop as the
                    reduction_latency calculated by new costs is too large
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hliu at amperecomputing dot com
  Target Milestone: ---

This problem causes a performance regression in SPEC2017 538.imagick.  For the
following simple case (modified from pr96208):

    typedef struct {
        unsigned short m1, m2, m3, m4;
    } the_struct_t;
    typedef struct {
        double m1, m2, m3, m4, m5;
    } the_struct2_t;

    double bar1 (the_struct2_t*);

    double foo (double* k, unsigned int n, the_struct_t* the_struct) {
        unsigned int u;
        the_struct2_t result;
        for (u=0; u < n; u++, k--) {
            result.m1 += (*k)*the_struct[u].m1;
            result.m2 += (*k)*the_struct[u].m2;
            result.m3 += (*k)*the_struct[u].m3;
            result.m4 += (*k)*the_struct[u].m4;
        }
        return bar1 (&result);
    }


Compile it with "-Ofast -S -mcpu=neoverse-n2 -fdump-tree-vect-details
-fno-tree-slp-vectorize". SLP fails to vectorize the loop as the vector body
cost is increased due to the too large "reduction latency".  See the dump of
vect pass:

    Original vector body cost = 51
    Scalar issue estimate:
      ...
      reduction latency = 2
      estimated min cycles per iteration = 2.000000
      estimated cycles per vector iteration (for VF 2) = 4.000000
    Vector issue estimate:
      ...
      reduction latency = 8      <-- Too large
      estimated min cycles per iteration = 8.000000
    Increasing body cost to 102 because scalar code would issue more quickly
    Cost model analysis: 
    Vector inside of loop cost: 102
    ...
    Scalar iteration cost: 44
    ...
    missed:  cost model: the vector iteration cost = 102 divided by the scalar
iteration cost = 44 is greater or equal to the vectorization factor = 2.
    missed:  not vectorized: vectorization not profitable.


SLP will success with "-mcpu=neoverse-n1", as N1 doesn't use the new vector
costs and vector body cost is not increased. The "reduction latency" is
calculated in aarch64.cc count_ops():
      /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
         that's not yet the case.  */
      ops->reduction_latency = MAX (ops->reduction_latency, base * count);

For this case, the "base" is 2 and "count" is 4 .  To my understanding, the
"count" of SLP means the number of scalar stmts (i.e. results.m1 +=, ...) in a
permutation group to be merged into a vector stmt.  It seems not reasonable to
multiply cost by "count" (maybe it doesn't consider about the SLP situation). 
So, I'm thinking to calcualte it differently for SLP situation, e.g.

      unsigned int latency = PURE_SLP_STMT(stmt_info) ? base : base * count;
      ops->reduction_latency = MAX (ops->reduction_latency, latency);

Is this reasonable?

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2023-12-30  8:34 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-11  9:15 [Bug target/110625] New: [AArch64] Vect: SLP fails to vectorize a loop as the reduction_latency calculated by new costs is too large hliu at amperecomputing dot com
2023-07-11 10:41 ` [Bug target/110625] " rguenth at gcc dot gnu.org
2023-07-12  0:58 ` hliu at amperecomputing dot com
2023-07-14  8:58 ` hliu at amperecomputing dot com
2023-07-18 10:41 ` rsandifo at gcc dot gnu.org
2023-07-18 12:03 ` rsandifo at gcc dot gnu.org
2023-07-19  2:57 ` hliu at amperecomputing dot com
2023-07-19  8:55 ` rsandifo at gcc dot gnu.org
2023-07-19  9:36 ` hliu at amperecomputing dot com
2023-07-28 16:50 ` rsandifo at gcc dot gnu.org
2023-07-28 16:53 ` rsandifo at gcc dot gnu.org
2023-07-31  2:45 ` hliu at amperecomputing dot com
2023-07-31 12:56 ` cvs-commit at gcc dot gnu.org
2023-08-01  9:09 ` tnfchris at gcc dot gnu.org
2023-08-01  9:19 ` tnfchris at gcc dot gnu.org
2023-08-01  9:45 ` hliu at amperecomputing dot com
2023-08-01  9:49 ` tnfchris at gcc dot gnu.org
2023-08-01  9:50 ` hliu at amperecomputing dot com
2023-08-01 13:49 ` tnfchris at gcc dot gnu.org
2023-08-02  3:49 ` hliu at amperecomputing dot com
2023-08-04  2:34 ` cvs-commit at gcc dot gnu.org
2023-12-08 10:20 ` [Bug target/110625] [14 Regression][AArch64] " tnfchris at gcc dot gnu.org
2023-12-12 16:26 ` tnfchris at gcc dot gnu.org
2023-12-26 14:55 ` tnfchris at gcc dot gnu.org
2023-12-29 15:59 ` cvs-commit at gcc dot gnu.org
2023-12-29 16:16 ` tnfchris at gcc dot gnu.org
2023-12-30  8:34 ` hliu at amperecomputing dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).