[V2 PATCH] Don't reduce estimated unrolled size for innermost loop at cunrolli.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: liuhongt <hongtao.liu@intel.com>
To: gcc-patches@gcc.gnu.org
Cc: richard.guenther@gmail.com
Subject: [V2 PATCH] Don't reduce estimated unrolled size for innermost loop at cunrolli.
Date: Wed, 22 May 2024 13:07:34 +0800	[thread overview]
Message-ID: <20240522050734.1129622-1-hongtao.liu@intel.com> (raw)
In-Reply-To: <CAFiYyc3xP_8X3SYbbPdqyiOOt_J6+QN=a4gmbPedCfcXS0EPoA@mail.gmail.com>

>> Hard to find a default value satisfying all testcases.
>> some require loop unroll with 7 insns increment, some don't want loop
>> unroll w/ 5 insn increment.
>> The original 2/3 reduction happened to meet all those testcases(or the
>> testcases are constructed based on the old 2/3).
>> Can we define the parameter as the size of the loop, below the size we
>> still do the reduction, so the small loop can be unrolled?

>Yeah, that's also a sensible possibility.  Does it work to have a parameter
>for the unrolled body size?  Thus, amend the existing
>--param max-completely-peeled-insns with a --param
>max-completely-peeled-insns-nogrowth?

Update V2:
It's still hard to find a default value for loop boday size. So I move the
2 / 3 reduction from estimated_unrolled_size to try_unroll_loop_completely.
For the check of body size shrink, 2 / 3 reduction is added, so small loops
can still be unrolled.
For the check of comparison between body size and param_max_completely_peeled_insns,
2 / 3 is conditionally added for loop->inner || !cunrolli.
Then the patch avoid gcc testsuite regression, and also prevent big inner loop
completely unrolled at cunrolli.

------------------

For the innermost loop, after completely loop unroll, it will most likely
not be able to reduce the body size to 2/3. The current 2/3 reduction
will make some of the larger loops completely unrolled during
cunrolli, which will then result in them not being able to be
vectorized. It also increases the register pressure. The patch move
from estimated_unrolled_size to
the 2/3 reduction at cunrolli.

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?

gcc/ChangeLog:

	PR tree-optimization/112325
	* tree-ssa-loop-ivcanon.cc (estimated_unrolled_size): Move the
	2 / 3 loop body size reduction to ..
	(try_unroll_loop_completely): .. here, add it for the check of
	body size shrink, and the check of comparison against
	param_max_completely_peeled_insns when
	(!cunrolli ||loop->inner).
	(canonicalize_loop_induction_variables): Add new parameter
	cunrolli and pass down.
	(tree_unroll_loops_completely_1): Ditto.
	(tree_unroll_loops_completely): Ditto.
	(canonicalize_induction_variables): Handle new parameter.
	(pass_complete_unrolli::execute): Ditto.
	(pass_complete_unroll::execute): Ditto.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/pr112325.c: New test.
	* gcc.dg/vect/pr69783.c: Add extra option --param
	max-completely-peeled-insns=300.
---
 gcc/testsuite/gcc.dg/tree-ssa/pr112325.c | 57 ++++++++++++++++++++++++
 gcc/testsuite/gcc.dg/vect/pr69783.c      |  2 +-
 gcc/tree-ssa-loop-ivcanon.cc             | 45 ++++++++++---------
 3 files changed, 83 insertions(+), 21 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr112325.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c b/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
new file mode 100644
index 00000000000..14208b3e7f8
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
@@ -0,0 +1,57 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-cunrolli-details" } */
+
+typedef unsigned short ggml_fp16_t;
+static float table_f32_f16[1 << 16];
+
+inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
+    unsigned short s;
+    __builtin_memcpy(&s, &f, sizeof(unsigned short));
+    return table_f32_f16[s];
+}
+
+typedef struct {
+    ggml_fp16_t d;
+    ggml_fp16_t m;
+    unsigned char qh[4];
+    unsigned char qs[32 / 2];
+} block_q5_1;
+
+typedef struct {
+    float d;
+    float s;
+    char qs[32];
+} block_q8_1;
+
+void ggml_vec_dot_q5_1_q8_1(const int n, float * restrict s, const void * restrict vx, const void * restrict vy) {
+    const int qk = 32;
+    const int nb = n / qk;
+
+    const block_q5_1 * restrict x = vx;
+    const block_q8_1 * restrict y = vy;
+
+    float sumf = 0.0;
+
+    for (int i = 0; i < nb; i++) {
+        unsigned qh;
+        __builtin_memcpy(&qh, x[i].qh, sizeof(qh));
+
+        int sumi = 0;
+
+        for (int j = 0; j < qk/2; ++j) {
+            const unsigned char xh_0 = ((qh >> (j + 0)) << 4) & 0x10;
+            const unsigned char xh_1 = ((qh >> (j + 12)) ) & 0x10;
+
+            const int x0 = (x[i].qs[j] & 0xF) | xh_0;
+            const int x1 = (x[i].qs[j] >> 4) | xh_1;
+
+            sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
+        }
+
+        sumf += (ggml_lookup_fp16_to_fp32(x[i].d)*y[i].d)*sumi + ggml_lookup_fp16_to_fp32(x[i].m)*y[i].s;
+    }
+
+    *s = sumf;
+}
+
+/* { dg-final { scan-tree-dump {(?n)Not unrolling loop [1-9] \(--param max-completely-peel-times limit reached} "cunrolli"} } */
diff --git a/gcc/testsuite/gcc.dg/vect/pr69783.c b/gcc/testsuite/gcc.dg/vect/pr69783.c
index 5df95d0ce4e..a1f75514d72 100644
--- a/gcc/testsuite/gcc.dg/vect/pr69783.c
+++ b/gcc/testsuite/gcc.dg/vect/pr69783.c
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
 /* { dg-require-effective-target vect_float } */
-/* { dg-additional-options "-Ofast -funroll-loops" } */
+/* { dg-additional-options "-Ofast -funroll-loops --param max-completely-peeled-insns=300" } */
 
 #define NXX 516
 #define NYY 516
diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.cc
index bf017137260..cc53eee1301 100644
--- a/gcc/tree-ssa-loop-ivcanon.cc
+++ b/gcc/tree-ssa-loop-ivcanon.cc
@@ -437,11 +437,7 @@ tree_estimate_loop_size (class loop *loop, edge exit, edge edge_to_cancel,
    It is (NUNROLL + 1) * size of loop body with taking into account
    the fact that in last copy everything after exit conditional
    is dead and that some instructions will be eliminated after
-   peeling.
-
-   Loop body is likely going to simplify further, this is difficult
-   to guess, we just decrease the result by 1/3.  */
-
+   peeling.  */
 static unsigned HOST_WIDE_INT
 estimated_unrolled_size (struct loop_size *size,
 			 unsigned HOST_WIDE_INT nunroll)
@@ -453,7 +449,6 @@ estimated_unrolled_size (struct loop_size *size,
     unr_insns = 0;
   unr_insns += size->last_iteration - size->last_iteration_eliminated_by_peeling;
 
-  unr_insns = unr_insns * 2 / 3;
   if (unr_insns <= 0)
     unr_insns = 1;
 
@@ -734,7 +729,8 @@ try_unroll_loop_completely (class loop *loop,
 			    edge exit, tree niter, bool may_be_zero,
 			    enum unroll_level ul,
 			    HOST_WIDE_INT maxiter,
-			    dump_user_location_t locus, bool allow_peel)
+			    dump_user_location_t locus, bool allow_peel,
+			    bool cunrolli)
 {
   unsigned HOST_WIDE_INT n_unroll = 0;
   bool n_unroll_found = false;
@@ -847,8 +843,9 @@ try_unroll_loop_completely (class loop *loop,
 
 	  /* If the code is going to shrink, we don't need to be extra
 	     cautious on guessing if the unrolling is going to be
-	     profitable.  */
-	  if (unr_insns
+	     profitable.
+	     Move from estimated_unrolled_size to unroll small loops.  */
+	  if (unr_insns * 2 / 3
 	      /* If there is IV variable that will become constant, we
 		 save one instruction in the loop prologue we do not
 		 account otherwise.  */
@@ -919,7 +916,13 @@ try_unroll_loop_completely (class loop *loop,
 			 loop->num);
 	      return false;
 	    }
-	  else if (unr_insns
+	  /* Move 2 / 3 reduction from estimated_unrolled_size, but don't reduce
+	     unrolled size for innermost loop when cunrolli.
+	     1) It could increase register pressure.
+	     2) Big loop after completely unroll may not be vectorized
+	     by BB vectorizer.  */
+	  else if ((cunrolli && !loop->inner
+		    ? unr_insns : unr_insns * 2 / 3)
 		   > (unsigned) param_max_completely_peeled_insns)
 	    {
 	      if (dump_file && (dump_flags & TDF_DETAILS))
@@ -1227,7 +1230,7 @@ try_peel_loop (class loop *loop,
 static bool
 canonicalize_loop_induction_variables (class loop *loop,
 				       bool create_iv, enum unroll_level ul,
-				       bool try_eval, bool allow_peel)
+				       bool try_eval, bool allow_peel, bool cunrolli)
 {
   edge exit = NULL;
   tree niter;
@@ -1314,7 +1317,7 @@ canonicalize_loop_induction_variables (class loop *loop,
 
   dump_user_location_t locus = find_loop_location (loop);
   if (try_unroll_loop_completely (loop, exit, niter, may_be_zero, ul,
-				  maxiter, locus, allow_peel))
+				  maxiter, locus, allow_peel, cunrolli))
     return true;
 
   if (create_iv
@@ -1358,7 +1361,7 @@ canonicalize_induction_variables (void)
     {
       changed |= canonicalize_loop_induction_variables (loop,
 							true, UL_SINGLE_ITER,
-							true, false);
+							true, false, false);
     }
   gcc_assert (!need_ssa_update_p (cfun));
 
@@ -1392,7 +1395,7 @@ canonicalize_induction_variables (void)
 
 static bool
 tree_unroll_loops_completely_1 (bool may_increase_size, bool unroll_outer,
-				bitmap father_bbs, class loop *loop)
+				bitmap father_bbs, class loop *loop, bool cunrolli)
 {
   class loop *loop_father;
   bool changed = false;
@@ -1410,7 +1413,7 @@ tree_unroll_loops_completely_1 (bool may_increase_size, bool unroll_outer,
 	if (!child_father_bbs)
 	  child_father_bbs = BITMAP_ALLOC (NULL);
 	if (tree_unroll_loops_completely_1 (may_increase_size, unroll_outer,
-					    child_father_bbs, inner))
+					    child_father_bbs, inner, cunrolli))
 	  {
 	    bitmap_ior_into (father_bbs, child_father_bbs);
 	    bitmap_clear (child_father_bbs);
@@ -1456,7 +1459,7 @@ tree_unroll_loops_completely_1 (bool may_increase_size, bool unroll_outer,
     ul = UL_NO_GROWTH;
 
   if (canonicalize_loop_induction_variables
-        (loop, false, ul, !flag_tree_loop_ivcanon, unroll_outer))
+      (loop, false, ul, !flag_tree_loop_ivcanon, unroll_outer, cunrolli))
     {
       /* If we'll continue unrolling, we need to propagate constants
 	 within the new basic blocks to fold away induction variable
@@ -1485,7 +1488,8 @@ tree_unroll_loops_completely_1 (bool may_increase_size, bool unroll_outer,
    size of the code does not increase.  */
 
 static unsigned int
-tree_unroll_loops_completely (bool may_increase_size, bool unroll_outer)
+tree_unroll_loops_completely (bool may_increase_size, bool unroll_outer,
+			      bool cunrolli)
 {
   bitmap father_bbs = BITMAP_ALLOC (NULL);
   bool changed;
@@ -1507,7 +1511,8 @@ tree_unroll_loops_completely (bool may_increase_size, bool unroll_outer)
 
       changed = tree_unroll_loops_completely_1 (may_increase_size,
 						unroll_outer, father_bbs,
-						current_loops->tree_root);
+						current_loops->tree_root,
+						cunrolli);
       if (changed)
 	{
 	  unsigned i;
@@ -1671,7 +1676,7 @@ pass_complete_unroll::execute (function *fun)
   if (flag_peel_loops)
     peeled_loops = BITMAP_ALLOC (NULL);
   unsigned int val = tree_unroll_loops_completely (flag_cunroll_grow_size, 
-						   true);
+						   true, false);
   if (peeled_loops)
     {
       BITMAP_FREE (peeled_loops);
@@ -1727,7 +1732,7 @@ pass_complete_unrolli::execute (function *fun)
   if (number_of_loops (fun) > 1)
     {
       scev_initialize ();
-      ret = tree_unroll_loops_completely (optimize >= 3, false);
+      ret = tree_unroll_loops_completely (optimize >= 3, false, true);
       scev_finalize ();
     }
   loop_optimizer_finalize ();
-- 
2.31.1

next prev parent reply	other threads:[~2024-05-22  5:07 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-13  2:27 [PATCH] Don't reduce estimated unrolled size for innermost loop liuhongt
2024-05-13  7:40 ` Richard Biener
2024-05-15  2:14   ` Hongtao Liu
2024-05-15  9:24     ` Richard Biener
2024-05-15  9:49       ` Hongtao Liu
2024-05-21  2:35       ` Hongtao Liu
2024-05-21  6:14         ` Richard Biener
2024-05-22  5:07           ` liuhongt [this message]
2024-05-23  1:55             ` [V2 PATCH] Don't reduce estimated unrolled size for innermost loop at cunrolli Hongtao Liu
2024-05-23 11:59             ` Richard Biener
2024-05-24  7:29               ` [V3 PATCH] Don't reduce estimated unrolled size for innermost loop liuhongt
2024-05-29 11:22                 ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240522050734.1129622-1-hongtao.liu@intel.com \
    --to=hongtao.liu@intel.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).