Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Hao Liu OS <hliu@os.amperecomputing.com>
To: Richard Sandiford <richard.sandiford@arm.com>
Cc: "GCC-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]
Date: Thu, 3 Aug 2023 09:33:32 +0000	[thread overview]
Message-ID: <SJ2PR01MB8635FE0532272319D67B319AE108A@SJ2PR01MB8635.prod.exchangelabs.com> (raw)
In-Reply-To: <SJ2PR01MB86355A358E719CFEE0E2A785E10BA@SJ2PR01MB8635.prod.exchangelabs.com>

Gentle ping. Is it OK for master?

I'm afraid the ICE may cause trouble and hope it can be fixed ASAP.

Thanks,
Hao

________________________________________
From: Hao Liu OS <hliu@os.amperecomputing.com>
Sent: Wednesday, August 2, 2023 11:45
To: Richard Sandiford
Cc: Richard Biener; GCC-patches@gcc.gnu.org
Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]

Hi Richard,

Update the patch with a simple case (see below case and comments).  It shows a live stmt may not have reduction def, which introduce the ICE.

Is it OK for trunk?

----
Fix the assertion failure on empty reduction define in info_for_reduction.
Even a stmt is live, it may still have empty reduction define.  Check the
reduction definition instead of live info before calling info_for_reduction.

gcc/ChangeLog:

        PR target/110625
        * config/aarch64/aarch64.cc (aarch64_force_single_cycle): check
        STMT_VINFO_REDUC_DEF to avoid failures in info_for_reduction.

gcc/testsuite/ChangeLog:

        * gcc.target/aarch64/pr110625_3.c: New testcase.
---
 gcc/config/aarch64/aarch64.cc                 |  2 +-
 gcc/testsuite/gcc.target/aarch64/pr110625_3.c | 34 +++++++++++++++++++
 2 files changed, 35 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_3.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index d4d76025545..5b8d8fa8e2d 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16776,7 +16776,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info,
 static bool
 aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info)
 {
-  if (!STMT_VINFO_LIVE_P (stmt_info))
+  if (!STMT_VINFO_REDUC_DEF (stmt_info))
     return false;

   auto reduc_info = info_for_reduction (vinfo, stmt_info);
diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_3.c b/gcc/testsuite/gcc.target/aarch64/pr110625_3.c
new file mode 100644
index 00000000000..35a50290cb0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr110625_3.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -mcpu=neoverse-n2" } */
+
+/* Avoid ICE on empty reduction def in single_defuse_cycle.
+
+   E.g.
+     <bb 3> [local count: 858993456]:
+     # sum_18 = PHI <sum_15(5), 0(2)>
+     sum.0_5 = (unsigned int) sum_18;
+     _6 = _4 + sum.0_5;     <-- it is "live" but doesn't have reduction def
+     sum_15 = (int) _6;
+     ...
+     if (ivtmp_29 != 0)
+       goto <bb 5>; [75.00%]
+     else
+       goto <bb 4>; [25.00%]
+
+     <bb 5> [local count: 644245086]:
+     goto <bb 3>; [100.00%]
+
+     <bb 4> [local count: 214748368]:
+     # _31 = PHI <_6(3)>
+     _8 = _31 >> 1;
+*/
+
+int
+f (unsigned int *tmp)
+{
+  int sum = 0;
+  for (int i = 0; i < 4; i++)
+    sum += tmp[i];
+
+  return (unsigned int) sum >> 1;
+}
--
2.34.1

________________________________________
From: Hao Liu OS <hliu@os.amperecomputing.com>
Sent: Tuesday, August 1, 2023 17:43
To: Richard Sandiford
Cc: Richard Biener; GCC-patches@gcc.gnu.org
Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]

Hi Richard,

This is a quick fix to the several ICEs.  It seems even STMT_VINFO_LIVE_P is true, some reduct stmts still don't have REDUC_DEF.  So I change the check to STMT_VINFO_REDUC_DEF.

Is it OK for trunk?

---
Fix the ICEs on empty reduction define.  Even STMT_VINFO_LIVE_P is true, some reduct stmts
still don't have definition.

gcc/ChangeLog:

        PR target/110625
        * config/aarch64/aarch64.cc (aarch64_force_single_cycle): check
        STMT_VINFO_REDUC_DEF to avoid failures in info_for_reduction
---
 gcc/config/aarch64/aarch64.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index d4d76025545..5b8d8fa8e2d 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16776,7 +16776,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info,
 static bool
 aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info)
 {
-  if (!STMT_VINFO_LIVE_P (stmt_info))
+  if (!STMT_VINFO_REDUC_DEF (stmt_info))
     return false;

   auto reduc_info = info_for_reduction (vinfo, stmt_info);
--
2.40.0


________________________________________
From: Richard Sandiford <richard.sandiford@arm.com>
Sent: Monday, July 31, 2023 17:11
To: Hao Liu OS
Cc: Richard Biener; GCC-patches@gcc.gnu.org
Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]

Hao Liu OS <hliu@os.amperecomputing.com> writes:
>> Which test case do you see this for?  The two tests in the patch still
>> seem to report correct latencies for me if I make the change above.
>
> Not the newly added tests.  It is still the existing case causing the previous ICE (i.e. assertion problem): gcc.target/aarch64/sve/cost_model_13.c.
>
> It's not the test case itself failed, but the dump message of vect says the "reduction latency" is 0:
>
> Before the change:
> cost_model_13.c:7:21: note:  Original vector body cost = 6
> cost_model_13.c:7:21: note:  Scalar issue estimate:
> cost_model_13.c:7:21: note:    load operations = 1
> cost_model_13.c:7:21: note:    store operations = 0
> cost_model_13.c:7:21: note:    general operations = 1
> cost_model_13.c:7:21: note:    reduction latency = 1
> cost_model_13.c:7:21: note:    estimated min cycles per iteration = 1.000000
> cost_model_13.c:7:21: note:    estimated cycles per vector iteration (for VF 8) = 8.000000
> cost_model_13.c:7:21: note:  Vector issue estimate:
> cost_model_13.c:7:21: note:    load operations = 1
> cost_model_13.c:7:21: note:    store operations = 0
> cost_model_13.c:7:21: note:    general operations = 1
> cost_model_13.c:7:21: note:    reduction latency = 2
> cost_model_13.c:7:21: note:    estimated min cycles per iteration = 2.000000
>
> After the change:
> cost_model_13.c:7:21: note:  Original vector body cost = 6
> cost_model_13.c:7:21: note:  Scalar issue estimate:
> cost_model_13.c:7:21: note:    load operations = 1
> cost_model_13.c:7:21: note:    store operations = 0
> cost_model_13.c:7:21: note:    general operations = 1
> cost_model_13.c:7:21: note:    reduction latency = 0         <--- seems not consistent with above result
> cost_model_13.c:7:21: note:    estimated min cycles per iteration = 1.000000
> cost_model_13.c:7:21: note:    estimated cycles per vector iteration (for VF 8) = 8.000000
> cost_model_13.c:7:21: note:  Vector issue estimate:
> cost_model_13.c:7:21: note:    load operations = 1
> cost_model_13.c:7:21: note:    store operations = 0
> cost_model_13.c:7:21: note:    general operations = 1
> cost_model_13.c:7:21: note:    reduction latency = 0         <--- seems not consistent with above result
> cost_model_13.c:7:21: note:    estimated min cycles per iteration = 1.000000         <--- seems not consistent with above result
>
> BTW. this should be caused by the reduction stmt is not live, which indicates whether this stmts is part of a computation whose result is used outside the loop (tree-vectorized.h:1204):
>   <bb 3>:
>   # res_18 = PHI <res_15(7), 0(6)>
>   # i_20 = PHI <i_16(7), 0(6)>
>   _1 = (long unsigned int) i_20;
>   _2 = _1 * 2;
>   _3 = x_14(D) + _2;
>   _4 = *_3;
>   _5 = (unsigned short) _4;
>   res.0_6 = (unsigned short) res_18;
>   _7 = _5 + res.0_6;                             <-- This is not live, may be caused by the below type cast stmt.
>   res_15 = (short int) _7;
>   i_16 = i_20 + 1;
>   if (n_11(D) > i_16)
>     goto <bb 7>;
>   else
>     goto <bb 4>;
>
>   <bb 7>:
>   goto <bb 3>;

Ah, I see, thanks.  My concern was: if requiring !STMT_VINFO_LIVE_P stmts
can cause "normal" reductions to have a latency of 0, could the same thing
happen for single-cycle reductions?  But I suppose the answer is "no".
Introducing a cast like the above would cause reduc_chain_length > 1,
and so:

  if (ncopies > 1
      && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
      && reduc_chain_length == 1
      && loop_vinfo->suggested_unroll_factor == 1)
    single_defuse_cycle = true;

wouldn't trigger.  Which makes the single-cycle thing a bit hit-and-miss...

So yeah, I agree the patch is safe after all.

Please split the check out into a helper though, to avoid the awkward
formatting:

/* Return true if STMT_INFO is part of a reduction that has the form:

      r = r op ...;
      r = r op ...;

   with the single accumulator being read and written multiple times.  */
static bool
aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info)
{
  if (!STMT_VINFO_LIVE_P (stmt_info))
    return false;

  auto reduc_info = info_for_reduction (vinfo, stmt_info);
  return STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info);
}

OK with that change, thanks.

Richard

next prev parent reply	other threads:[~2023-08-03  9:33 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-19  4:33 Hao Liu OS
2023-07-24  1:58 ` Hao Liu OS
2023-07-24 11:10 ` Richard Sandiford
2023-07-25  9:10   ` Hao Liu OS
2023-07-25  9:44     ` Richard Sandiford
2023-07-26  2:01       ` Hao Liu OS
2023-07-26  8:47         ` Richard Biener
2023-07-26  9:14           ` Richard Sandiford
2023-07-26 10:02             ` Richard Biener
2023-07-26 10:12               ` Richard Sandiford
2023-07-26 12:00                 ` Richard Biener
2023-07-26 12:54             ` Hao Liu OS
2023-07-28 10:06               ` Hao Liu OS
2023-07-28 17:35               ` Richard Sandiford
2023-07-31  2:39                 ` Hao Liu OS
2023-07-31  9:11                   ` Richard Sandiford
2023-07-31  9:25                     ` Hao Liu OS
2023-08-01  9:43                     ` Hao Liu OS
2023-08-02  3:45                       ` Hao Liu OS
2023-08-03  9:33                         ` Hao Liu OS [this message]
2023-08-03 10:10                         ` Richard Sandiford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=SJ2PR01MB8635FE0532272319D67B319AE108A@SJ2PR01MB8635.prod.exchangelabs.com \
    --to=hliu@os.amperecomputing.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).