Re: [PATCH] Don't reduce estimated unrolled size for innermost loop.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Biener <richard.guenther@gmail.com>
To: liuhongt <hongtao.liu@intel.com>
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] Don't reduce estimated unrolled size for innermost loop.
Date: Mon, 13 May 2024 09:40:24 +0200	[thread overview]
Message-ID: <CAFiYyc3snQb4M=mC8a7Bqx_kDr+104uxp35JnnpDQx4DYQLP1Q@mail.gmail.com> (raw)
In-Reply-To: <20240513022737.3105192-1-hongtao.liu@intel.com>

[-- Attachment #1: Type: text/plain, Size: 7193 bytes --]

On Mon, May 13, 2024 at 4:29 AM liuhongt <hongtao.liu@intel.com> wrote:
>
> As testcase in the PR, O3 cunrolli may prevent vectorization for the
> innermost loop and increase register pressure.
> The patch removes the 1/3 reduction of unr_insn for innermost loop for UL_ALL.
> ul != UR_ALL is needed since some small loop complete unrolling at O2 relies
> the reduction.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> No big impact for SPEC2017.
> Ok for trunk?

This removes the 1/3 reduction when unrolling a loop nest (the case I was
concerned about).  Unrolling of a nest is by iterating in
tree_unroll_loops_completely
so the to be unrolled loop appears innermost.  So I think you need a new
parameter on tree_unroll_loops_completely_1 indicating whether we're in the
first iteration (or whether to assume inner most loops will "simplify").

Few comments below

> gcc/ChangeLog:
>
>         PR tree-optimization/112325
>         * tree-ssa-loop-ivcanon.cc (estimated_unrolled_size): Add 2
>         new parameters: loop and ul, and remove unr_insns reduction
>         for innermost loop.
>         (try_unroll_loop_completely): Pass loop and ul to
>         estimated_unrolled_size.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/tree-ssa/pr112325.c: New test.
>         * gcc.dg/vect/pr69783.c: Add extra option --param
>         max-completely-peeled-insns=300.
> ---
>  gcc/testsuite/gcc.dg/tree-ssa/pr112325.c | 57 ++++++++++++++++++++++++
>  gcc/testsuite/gcc.dg/vect/pr69783.c      |  2 +-
>  gcc/tree-ssa-loop-ivcanon.cc             | 16 +++++--
>  3 files changed, 71 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
>
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c b/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
> new file mode 100644
> index 00000000000..14208b3e7f8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
> @@ -0,0 +1,57 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-cunrolli-details" } */
> +
> +typedef unsigned short ggml_fp16_t;
> +static float table_f32_f16[1 << 16];
> +
> +inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
> +    unsigned short s;
> +    __builtin_memcpy(&s, &f, sizeof(unsigned short));
> +    return table_f32_f16[s];
> +}
> +
> +typedef struct {
> +    ggml_fp16_t d;
> +    ggml_fp16_t m;
> +    unsigned char qh[4];
> +    unsigned char qs[32 / 2];
> +} block_q5_1;
> +
> +typedef struct {
> +    float d;
> +    float s;
> +    char qs[32];
> +} block_q8_1;
> +
> +void ggml_vec_dot_q5_1_q8_1(const int n, float * restrict s, const void * restrict vx, const void * restrict vy) {
> +    const int qk = 32;
> +    const int nb = n / qk;
> +
> +    const block_q5_1 * restrict x = vx;
> +    const block_q8_1 * restrict y = vy;
> +
> +    float sumf = 0.0;
> +
> +    for (int i = 0; i < nb; i++) {
> +        unsigned qh;
> +        __builtin_memcpy(&qh, x[i].qh, sizeof(qh));
> +
> +        int sumi = 0;
> +
> +        for (int j = 0; j < qk/2; ++j) {
> +            const unsigned char xh_0 = ((qh >> (j + 0)) << 4) & 0x10;
> +            const unsigned char xh_1 = ((qh >> (j + 12)) ) & 0x10;
> +
> +            const int x0 = (x[i].qs[j] & 0xF) | xh_0;
> +            const int x1 = (x[i].qs[j] >> 4) | xh_1;
> +
> +            sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
> +        }
> +
> +        sumf += (ggml_lookup_fp16_to_fp32(x[i].d)*y[i].d)*sumi + ggml_lookup_fp16_to_fp32(x[i].m)*y[i].s;
> +    }
> +
> +    *s = sumf;
> +}
> +
> +/* { dg-final { scan-tree-dump {(?n)Not unrolling loop [1-9] \(--param max-completely-peel-times limit reached} "cunrolli"} } */
> diff --git a/gcc/testsuite/gcc.dg/vect/pr69783.c b/gcc/testsuite/gcc.dg/vect/pr69783.c
> index 5df95d0ce4e..a1f75514d72 100644
> --- a/gcc/testsuite/gcc.dg/vect/pr69783.c
> +++ b/gcc/testsuite/gcc.dg/vect/pr69783.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-require-effective-target vect_float } */
> -/* { dg-additional-options "-Ofast -funroll-loops" } */
> +/* { dg-additional-options "-Ofast -funroll-loops --param max-completely-peeled-insns=300" } */

If we rely on unrolling of a loop can you put #pragma unroll [N]
before the respective loop
instead?

>  #define NXX 516
>  #define NYY 516
> diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.cc
> index bf017137260..5e0eca647a1 100644
> --- a/gcc/tree-ssa-loop-ivcanon.cc
> +++ b/gcc/tree-ssa-loop-ivcanon.cc
> @@ -444,7 +444,9 @@ tree_estimate_loop_size (class loop *loop, edge exit, edge edge_to_cancel,
>
>  static unsigned HOST_WIDE_INT
>  estimated_unrolled_size (struct loop_size *size,
> -                        unsigned HOST_WIDE_INT nunroll)
> +                        unsigned HOST_WIDE_INT nunroll,
> +                        enum unroll_level ul,
> +                        class loop* loop)
>  {
>    HOST_WIDE_INT unr_insns = ((nunroll)
>                              * (HOST_WIDE_INT) (size->overall
> @@ -453,7 +455,15 @@ estimated_unrolled_size (struct loop_size *size,
>      unr_insns = 0;
>    unr_insns += size->last_iteration - size->last_iteration_eliminated_by_peeling;
>
> -  unr_insns = unr_insns * 2 / 3;
> +  /* For innermost loop, loop body is not likely to be simplied as much as 1/3.
> +     and may increase a lot of register pressure.
> +     UL != UL_ALL is need to unroll small loop at O2.  */
> +  class loop *loop_father = loop_outer (loop);
> +  if (loop->inner || !loop_father

Do we ever get here for !loop_father?  We shouldn't.

> +      || loop_father->latch == EXIT_BLOCK_PTR_FOR_FN (cfun)

This means you excempt all loops that are direct children of the loop
root tree.  That doesn't make much sense.

> +      || ul != UL_ALL)

This is also quite odd - we're being more optimistic for UL_NO_GROWTH
than for UL_ALL?  This doesn't make much sense.

Overall I think this means removal of being optimistic doesn't work so well?

If we need some extra leeway for UL_NO_GROWTH for what we expect
to unroll it might be better to add sth like --param
nogrowth-completely-peeled-insns
specifying a fixed surplus size?  Or we need to look at what's the problem
with the testcases regressing or the one you are trying to fix.

I did experiment with better estimating cleanup done at some point
(see attached),
but didn't get to finishing that (and as said, as we're running VN on the result
we'd ideally do that as part of the estimation somehow).

Richard.

> +    unr_insns = unr_insns * 2 / 3;
> +
>    if (unr_insns <= 0)
>      unr_insns = 1;
>
> @@ -837,7 +847,7 @@ try_unroll_loop_completely (class loop *loop,
>
>           unsigned HOST_WIDE_INT ninsns = size.overall;
>           unsigned HOST_WIDE_INT unr_insns
> -           = estimated_unrolled_size (&size, n_unroll);
> +           = estimated_unrolled_size (&size, n_unroll, ul, loop);
>           if (dump_file && (dump_flags & TDF_DETAILS))
>             {
>               fprintf (dump_file, "  Loop size: %d\n", (int) ninsns);
> --
> 2.31.1
>

[-- Attachment #2: p --]
[-- Type: application/octet-stream, Size: 9048 bytes --]

From 009002633e3acce6031094c8d1910b7d99314f9f Mon Sep 17 00:00:00 2001
From: Richard Biener <rguenther@suse.de>
Date: Mon, 14 Aug 2023 12:02:41 +0200
Subject: [PATCH] test unroll
To: gcc-patches@gcc.gnu.org

---
 .../gcc.dg/fstack-protector-strong.c          |   4 +-
 gcc/tree-ssa-loop-ivcanon.cc                  | 158 ++++++++++++------
 2 files changed, 113 insertions(+), 49 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/fstack-protector-strong.c b/gcc/testsuite/gcc.dg/fstack-protector-strong.c
index 94dc3508f1a..fafa1917449 100644
--- a/gcc/testsuite/gcc.dg/fstack-protector-strong.c
+++ b/gcc/testsuite/gcc.dg/fstack-protector-strong.c
@@ -28,7 +28,7 @@ foo1 ()
 struct ArrayStruct
 {
   int a;
-  int array[10];
+  int array[18];
 };
 
 struct AA
@@ -43,7 +43,7 @@ foo2 ()
 {
   struct AA aa;
   int i;
-  for (i = 0; i < 10; ++i)
+  for (i = 0; i < 18; ++i)
     {
       aa.as.array[i] = i * (i-1) + i / 2;
     }
diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.cc
index bf017137260..58e4168c209 100644
--- a/gcc/tree-ssa-loop-ivcanon.cc
+++ b/gcc/tree-ssa-loop-ivcanon.cc
@@ -159,6 +159,7 @@ struct loop_size
   int num_branches_on_hot_path;
 };
 
+#if 0
 /* Return true if OP in STMT will be constant after peeling LOOP.  */
 
 static bool
@@ -246,6 +247,7 @@ constant_after_peeling (tree op, gimple *stmt, class loop *loop)
     }
   return true;
 }
+#endif
 
 /* Computes an estimated number of insns in LOOP.
    EXIT (if non-NULL) is an exite edge that will be eliminated in all but last
@@ -277,6 +279,31 @@ tree_estimate_loop_size (class loop *loop, edge exit, edge edge_to_cancel,
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     fprintf (dump_file, "Estimating sizes for loop %i\n", loop->num);
+
+  static hash_map<tree, tree> *vals;
+  vals = new hash_map<tree, tree>;
+  edge pe = loop_preheader_edge (loop);
+  for (auto si = gsi_start_phis (loop->header);
+       !gsi_end_p (si); gsi_next (&si))
+    {
+      if (virtual_operand_p (gimple_phi_result (*si)))
+	continue;
+      tree val = gimple_phi_arg_def_from_edge (*si, pe);
+      if (CONSTANT_CLASS_P (val))
+	{
+	  vals->put (gimple_phi_result (*si), val);
+	  tree ev = analyze_scalar_evolution (loop, gimple_phi_result (*si));
+	  if (!chrec_contains_undetermined (ev)
+	      && !chrec_contains_symbols (ev))
+	    size->constant_iv = true;
+	}
+    }
+
+  auto els_valueize = [] (tree op) -> tree
+    { if (tree *val = vals->get (op)) return *val; return op; };
+
+  auto process_loop = [&] () -> bool
+    {
   for (i = 0; i < loop->num_nodes; i++)
     {
       if (edge_to_cancel && body[i] != edge_to_cancel->src
@@ -323,57 +350,51 @@ tree_estimate_loop_size (class loop *loop, edge exit, edge edge_to_cancel,
 			     "in last copy.\n");
 		  likely_eliminated_last = true;
 		}
-	      /* Sets of IV variables  */
-	      if (gimple_code (stmt) == GIMPLE_ASSIGN
-		  && constant_after_peeling (gimple_assign_lhs (stmt), stmt, loop))
+	      /* Stores are not eliminated.  */
+	      if (gimple_vdef (stmt))
+		goto account;
+	      /* Below we are using constant folding to decide whether
+		 we can elide a stmt.  While for the first iteration we
+		 could use the actual value for the rest we have to
+		 avoid the situation re-using a * 1 or + 0 operand, so
+		 require all SSA operands to be constants here.  */
+	      bool fail = false;
+	      ssa_op_iter iter;
+	      use_operand_p use_p;
+	      FOR_EACH_SSA_USE_OPERAND (use_p, stmt, iter, SSA_OP_USE)
+		if (!vals->get (USE_FROM_PTR (use_p)))
+		  {
+		    fail = true;
+		    break;
+		  }
+	      if (fail)
+		goto account;
+	      tree val;
+	      /* Switches are not handled by folding.  */
+	      if (gimple_code (stmt) == GIMPLE_SWITCH
+		  && ! is_gimple_min_invariant
+		  (gimple_switch_index (as_a <gswitch *> (stmt)))
+		  && vals->get (gimple_switch_index
+				(as_a <gswitch *> (stmt))))
 		{
 		  if (dump_file && (dump_flags & TDF_DETAILS))
-		    fprintf (dump_file, "   Induction variable computation will"
-			     " be folded away.\n");
+		    fprintf (dump_file, "   Constant conditional.\n");
 		  likely_eliminated = true;
 		}
-	      /* Assignments of IV variables.  */
-	      else if (gimple_code (stmt) == GIMPLE_ASSIGN
-		       && TREE_CODE (gimple_assign_lhs (stmt)) == SSA_NAME
-		       && constant_after_peeling (gimple_assign_rhs1 (stmt),
-						  stmt, loop)
-		       && (gimple_assign_rhs_class (stmt) != GIMPLE_BINARY_RHS
-			   || constant_after_peeling (gimple_assign_rhs2 (stmt),
-						      stmt, loop))
-		       && gimple_assign_rhs_class (stmt) != GIMPLE_TERNARY_RHS)
+	      else if ((val = gimple_fold_stmt_to_constant (stmt, els_valueize))
+		       && CONSTANT_CLASS_P (val))
 		{
-		  size->constant_iv = true;
 		  if (dump_file && (dump_flags & TDF_DETAILS))
 		    fprintf (dump_file,
 			     "   Constant expression will be folded away.\n");
 		  likely_eliminated = true;
-		}
-	      /* Conditionals.  */
-	      else if ((gimple_code (stmt) == GIMPLE_COND
-			&& constant_after_peeling (gimple_cond_lhs (stmt), stmt,
-						   loop)
-			&& constant_after_peeling (gimple_cond_rhs (stmt), stmt,
-						   loop)
-			/* We don't simplify all constant compares so make sure
-			   they are not both constant already.  See PR70288.  */
-			&& (! is_gimple_min_invariant (gimple_cond_lhs (stmt))
-			    || ! is_gimple_min_invariant
-				 (gimple_cond_rhs (stmt))))
-		       || (gimple_code (stmt) == GIMPLE_SWITCH
-			   && constant_after_peeling (gimple_switch_index (
-							as_a <gswitch *>
-							  (stmt)),
-						      stmt, loop)
-			   && ! is_gimple_min_invariant
-				   (gimple_switch_index
-				      (as_a <gswitch *> (stmt)))))
-		{
-		  if (dump_file && (dump_flags & TDF_DETAILS))
-		    fprintf (dump_file, "   Constant conditional.\n");
-		  likely_eliminated = true;
+		  if (tree lhs = gimple_get_lhs (stmt))
+		    if (TREE_CODE (lhs) == SSA_NAME)
+		      vals->put (lhs, val);
 		}
 	    }
 
+account:
 	  size->overall += num;
 	  if (likely_eliminated || likely_eliminated_peeled)
 	    size->eliminated_by_peeling += num;
@@ -386,11 +407,55 @@ tree_estimate_loop_size (class loop *loop, edge exit, edge edge_to_cancel,
 	  if ((size->overall * 3 / 2 - size->eliminated_by_peeling
 	      - size->last_iteration_eliminated_by_peeling) > upper_bound)
 	    {
-              free (body);
-	      return true;
+	      free (body);
+	      delete vals;
+	      vals = nullptr;
+	      return false;
 	    }
 	}
     }
+  return true;
+  };
+
+  /* Estimate the size of the unrolled first iteration.  */
+  if (!process_loop ())
+    return true;
+
+  /* Determine whether the IVs will stay constant (we simply assume that
+     if the 2nd iteration receives a constant value the third and all
+     further will so as well).  */
+  for (auto si = gsi_start_phis (loop->header);
+       !gsi_end_p (si); gsi_next (&si))
+    {
+      if (virtual_operand_p (gimple_phi_result (*si)))
+	continue;
+      tree def = gimple_phi_arg_def_from_edge (*si, loop_latch_edge (loop));
+      if (CONSTANT_CLASS_P (def) || vals->get (def))
+	/* ???  If we compute the first iteration size separately we
+	   could also handle an invariant backedge value more
+	   optimistically.
+	   ???  Note the actual value we leave here may still have an
+	   effect on the constant-ness.  */
+	;
+      else
+	vals->remove (gimple_phi_result (*si));
+    }
+
+  /* Reset sizes and compute the size based on the adjustment above.
+     ???  We could keep the more precise and optimistic counts for
+     the first iteration.  */
+  size->overall = 0;
+  size->eliminated_by_peeling = 0;
+  size->last_iteration = 0;
+  size->last_iteration_eliminated_by_peeling = 0;
+  size->num_pure_calls_on_hot_path = 0;
+  size->num_non_pure_calls_on_hot_path = 0;
+  size->non_call_stmts_on_hot_path = 0;
+  size->num_branches_on_hot_path = 0;
+  size->constant_iv = 0;
+  if (!process_loop ())
+    return true;
+
   while (path.length ())
     {
       basic_block bb = path.pop ();
@@ -412,13 +477,10 @@ tree_estimate_loop_size (class loop *loop, edge exit, edge edge_to_cancel,
 	  else if (gimple_code (stmt) != GIMPLE_DEBUG)
 	    size->non_call_stmts_on_hot_path++;
 	  if (((gimple_code (stmt) == GIMPLE_COND
-	        && (!constant_after_peeling (gimple_cond_lhs (stmt), stmt, loop)
-		    || !constant_after_peeling (gimple_cond_rhs (stmt), stmt,
-						loop)))
+		&& (!vals->get (gimple_cond_lhs (stmt))
+		    || !vals->get (gimple_cond_rhs (stmt))))
 	       || (gimple_code (stmt) == GIMPLE_SWITCH
-		   && !constant_after_peeling (gimple_switch_index (
-						 as_a <gswitch *> (stmt)),
-					       stmt, loop)))
+		   && !vals->get (gimple_switch_index (as_a <gswitch *> (stmt)))))
 	      && (!exit || bb != exit->src))
 	    size->num_branches_on_hot_path++;
 	}
@@ -430,6 +492,8 @@ tree_estimate_loop_size (class loop *loop, edge exit, edge edge_to_cancel,
 	     size->last_iteration_eliminated_by_peeling);
 
   free (body);
+  delete vals;
+  vals = nullptr;
   return false;
 }
 
-- 
2.35.3

next prev parent reply	other threads:[~2024-05-13  7:40 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-13  2:27 liuhongt
2024-05-13  7:40 ` Richard Biener [this message]
2024-05-15  2:14   ` Hongtao Liu
2024-05-15  9:24     ` Richard Biener
2024-05-15  9:49       ` Hongtao Liu
2024-05-21  2:35       ` Hongtao Liu
2024-05-21  6:14         ` Richard Biener
2024-05-22  5:07           ` [V2 PATCH] Don't reduce estimated unrolled size for innermost loop at cunrolli liuhongt
2024-05-23  1:55             ` Hongtao Liu
2024-05-23 11:59             ` Richard Biener
2024-05-24  7:29               ` [V3 PATCH] Don't reduce estimated unrolled size for innermost loop liuhongt
2024-05-29 11:22                 ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAFiYyc3snQb4M=mC8a7Bqx_kDr+104uxp35JnnpDQx4DYQLP1Q@mail.gmail.com' \
    --to=richard.guenther@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=hongtao.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).