[Bug tree-optimization/77689] Missing vectorization lead to huge performance loss

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/77689] Missing vectorization lead to huge performance loss
       [not found] <bug-77689-4@http.gcc.gnu.org/bugzilla/>
@ 2021-08-25  6:07 ` pinskia at gcc dot gnu.org
  2021-08-25  7:38 ` crazylht at gmail dot com
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 6+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-25  6:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77689

--- Comment #13 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
On the trunk, loop splitting code is there but it is not actually splitting the
loop

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/77689] Missing vectorization lead to huge performance loss
       [not found] <bug-77689-4@http.gcc.gnu.org/bugzilla/>
  2021-08-25  6:07 ` [Bug tree-optimization/77689] Missing vectorization lead to huge performance loss pinskia at gcc dot gnu.org
@ 2021-08-25  7:38 ` crazylht at gmail dot com
  2023-07-28  8:12 ` hubicka at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 6+ messages in thread
From: crazylht at gmail dot com @ 2021-08-25  7:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77689

Hongtao.liu <crazylht at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |crazylht at gmail dot com

--- Comment #14 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Andrew Pinski from comment #13)
> On the trunk, loop splitting code is there but it is not actually splitting
> the loop

Does GCC perform loop peeling?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/77689] Missing vectorization lead to huge performance loss
       [not found] <bug-77689-4@http.gcc.gnu.org/bugzilla/>
  2021-08-25  6:07 ` [Bug tree-optimization/77689] Missing vectorization lead to huge performance loss pinskia at gcc dot gnu.org
  2021-08-25  7:38 ` crazylht at gmail dot com
@ 2023-07-28  8:12 ` hubicka at gcc dot gnu.org
  2023-07-28 11:53 ` hubicka at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 6+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-07-28  8:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77689

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu.org

--- Comment #15 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
loop splitting gives up here because it only handles test
  i < val
or 
  i > val

Not i = val.  It could easily work out that i == 0 is same as i < 1 from vlaue
range of i.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/77689] Missing vectorization lead to huge performance loss
       [not found] <bug-77689-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2023-07-28  8:12 ` hubicka at gcc dot gnu.org
@ 2023-07-28 11:53 ` hubicka at gcc dot gnu.org
  2023-07-28 13:11 ` hubicka at gcc dot gnu.org
  2023-07-28 14:19 ` cvs-commit at gcc dot gnu.org
  5 siblings, 0 replies; 6+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-07-28 11:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77689

--- Comment #16 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
I am testing the following that makes loop splitting understand when first
iteration is special.

diff --git a/gcc/tree-ssa-loop-split.cc b/gcc/tree-ssa-loop-split.cc
index 70cd0aaefa7..1fd3ee1d1e5 100644
--- a/gcc/tree-ssa-loop-split.cc
+++ b/gcc/tree-ssa-loop-split.cc
@@ -42,6 +42,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-fold.h"
 #include "gimplify-me.h"
 #include "print-tree.h"
+#include "value-query.h"

 /* This file implements two kinds of loop splitting.

@@ -75,7 +76,8 @@ along with GCC; see the file COPYING3.  If not see
    point in *BORDER and the comparison induction variable in IV.  */

 static tree
-split_at_bb_p (class loop *loop, basic_block bb, tree *border, affine_iv *iv)
+split_at_bb_p (class loop *loop, basic_block bb, tree *border, affine_iv *iv,
+              enum tree_code *guard_code)
 {
   gcond *stmt;
   affine_iv iv2;
@@ -87,19 +89,6 @@ split_at_bb_p (class loop *loop, basic_block bb, tree
*border, affine_iv *iv)

   enum tree_code code = gimple_cond_code (stmt);

-  /* Only handle relational comparisons, for equality and non-equality
-     we'd have to split the loop into two loops and a middle statement.  */
-  switch (code)
-    {
-      case LT_EXPR:
-      case LE_EXPR:
-      case GT_EXPR:
-      case GE_EXPR:
-       break;
-      default:
-       return NULL_TREE;
-    }
-
   if (loop_exits_from_bb_p (loop, bb))
     return NULL_TREE;

@@ -129,6 +118,55 @@ split_at_bb_p (class loop *loop, basic_block bb, tree
*border, affine_iv *iv)
   if (!iv->no_overflow)
     return NULL_TREE;

+  /* Only handle relational comparisons, for equality and non-equality
+     we'd have to split the loop into two loops and a middle statement.  */
+  switch (code)
+    {
+      case LT_EXPR:
+      case LE_EXPR:
+      case GT_EXPR:
+      case GE_EXPR:
+       break;
+      case NE_EXPR:
+      case EQ_EXPR:
+       /* If the test check for first iteration, we can handle NE/EQ
+          with only one split loop.  */
+       if (operand_equal_p (iv->base, iv2.base, 0))
+         {
+           if (code == EQ_EXPR)
+             code = !tree_int_cst_sign_bit (iv->step) ? LE_EXPR : GE_EXPR;
+           else
+             code = !tree_int_cst_sign_bit (iv->step) ? GT_EXPR : LT_EXPR;
+           break;
+         }
+       /* Similarly when the test checks for minimal or maximal
+          value range.  */
+       else
+         {
+           int_range<2> r;
+           get_global_range_query ()->range_of_expr (r, op0, stmt);
+           if (!r.varying_p () && !r.undefined_p ())
+             {
+               wide_int val = wi::to_wide (op1);
+               if (known_eq (val, r.lower_bound ()))
+                 {
+                   code = (code == EQ_EXPR) ? LE_EXPR : GT_EXPR;
+                   break;
+                 }
+               else if (known_eq (val, r.upper_bound ()))
+                 {
+                   code = (code == EQ_EXPR) ? GE_EXPR : LT_EXPR;
+                   break;
+                 }
+             }
+         }
+       /* TODO: We can compare with exit condition; it seems that testing for
+          last iteration is common case.  */
+       return NULL_TREE;
+      default:
+       return NULL_TREE;
+    }
+
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
       fprintf (dump_file, "Found potential split point: ");
@@ -143,6 +181,7 @@ split_at_bb_p (class loop *loop, basic_block bb, tree
*border, affine_iv *iv)
     }

   *border = iv2.base;
+  *guard_code = code;
   return op0;
 }

@@ -551,7 +590,7 @@ split_loop (class loop *loop1)
     {
       if (!niter.control.no_overflow)
        return false;
-      if (tree_int_cst_sign_bit (niter.control.step) > 0)
+      if (tree_int_cst_sign_bit (niter.control.step))
        niter.cmp = GT_EXPR;
       else
        niter.cmp = LT_EXPR;
@@ -566,8 +605,9 @@ split_loop (class loop *loop1)
     }

   /* Find a splitting opportunity.  */
+  enum tree_code guard_code;
   for (i = 0; i < loop1->num_nodes; i++)
-    if ((guard_iv = split_at_bb_p (loop1, bbs[i], &border, &iv)))
+    if ((guard_iv = split_at_bb_p (loop1, bbs[i], &border, &iv, &guard_code)))
       {
        profile_count entry_count = loop_preheader_edge (loop1)->count ();
        /* Handling opposite steps is not implemented yet.  Neither
@@ -585,7 +625,6 @@ split_loop (class loop *loop1)
        gcond *guard_stmt = as_a<gcond *> (*gsi_last_bb (bbs[i]));
        tree guard_init = PHI_ARG_DEF_FROM_EDGE (phi,
                                                 loop_preheader_edge (loop1));
-       enum tree_code guard_code = gimple_cond_code (guard_stmt);

        /* Loop splitting is implemented by versioning the loop, placing
           the new loop after the old loop, make the first loop iterate

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/77689] Missing vectorization lead to huge performance loss
       [not found] <bug-77689-4@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2023-07-28 11:53 ` hubicka at gcc dot gnu.org
@ 2023-07-28 13:11 ` hubicka at gcc dot gnu.org
  2023-07-28 14:19 ` cvs-commit at gcc dot gnu.org
  5 siblings, 0 replies; 6+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-07-28 13:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77689

--- Comment #17 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
I posted the patch.  With it we split the loop, but we don't get really big
improvements from that
h@ryzen3:~/gcc/build3/gcc> ./xgcc -B ./ -Ofast c.ii -S -fopt-info 2>&1 | grep
split ; perf stat ./a.out
c.C:15:9: optimized: loop split

 Performance counter stats for './a.out':

            862.32 msec task-clock:u                     #    0.978 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
            53,443      page-faults:u                    #   61.976 K/sec       
     3,295,805,448      cycles:u                         #    3.822 GHz        
                (83.44%)
        81,606,129      stalled-cycles-frontend:u        #    2.48% frontend
cycles idle        (83.33%)
         8,205,437      stalled-cycles-backend:u         #    0.25% backend
cycles idle         (83.32%)
     7,420,801,599      instructions:u                   #    2.25  insn per
cycle            
                                                  #    0.01  stalled cycles per
insn     (82.88%)
       903,367,479      branches:u                       #    1.048 G/sec      
                (83.50%)
            54,872      branch-misses:u                  #    0.01% of all
branches             (83.53%)

       0.881716607 seconds time elapsed

       0.782798000 seconds user
       0.079877000 seconds sys

jh@ryzen3:~/gcc/build3/gcc> ~/trunk-install/bin/g++ -Ofast c.ii -S -fopt-info
2>&1 | grep split ; perf stat ./a.out

 Performance counter stats for './a.out':

            905.76 msec task-clock:u                     #    0.998 CPUs
utilized             
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
            51,910      page-faults:u                    #   57.311 K/sec       
     3,459,244,533      cycles:u                         #    3.819 GHz        
                (83.24%)
        83,603,137      stalled-cycles-frontend:u        #    2.42% frontend
cycles idle        (83.24%)
        13,908,621      stalled-cycles-backend:u         #    0.40% backend
cycles idle         (83.25%)
     7,422,922,864      instructions:u                   #    2.15  insn per
cycle            
                                                  #    0.01  stalled cycles per
insn     (83.30%)
       899,226,266      branches:u                       #  992.791 M/sec      
                (83.67%)
            52,719      branch-misses:u                  #    0.01% of all
branches             (83.31%)

       0.907459830 seconds time elapsed

       0.810481000 seconds user
       0.095820000 seconds sys

optimized dump is:
  <bb 6> [local count: 679982665]:
  # ivtmp.62_101 = PHI <1(5), ivtmp.62_16(6)>
  _122 = MEM[(value_type &)_44 + ivtmp.62_101 * 4];
  _120 = MEM[(value_type &)_41 + ivtmp.62_101 * 4];
  _119 = _120 + _122;
  _114 = MEM[(value_type &)_41 + 18446744073709551612 + ivtmp.62_101 * 4];
  _113 = _114 * _119;
  _112 = (double) _113;
  _111 = (signed int) ivtmp.62_101;
  _110 = (double) _111;
  _109 = __builtin_log (_110);
  _108 = _109 * _112;
  _107 = (float) _108;
  MEM[(value_type &)_35 + ivtmp.62_101 * 4] = _107;
  ivtmp.62_16 = ivtmp.62_101 + 1;
  if (ivtmp.62_16 != 100000000)
    goto <bb 6>; [98.42%]
  else
    goto <bb 7>; [1.58%]

which looks reasonable. Vectorizer says
c.ii.174t.vect:c.C:18:26: missed:   versioning for alias required: can't
determine dependence between *_123 and *_106
c.ii.174t.vect:c.C:18:26: missed:   versioning for alias required: can't
determine dependence between *_121 and *_106
c.ii.174t.vect:c.C:18:34: missed:   versioning for alias required: can't
determine dependence between *_115 and *_106
c.ii.174t.vect:c.C:13:27: missed:   Unknown misalignment, naturally aligned
c.ii.174t.vect:c.C:13:27: missed:   Unknown misalignment, naturally aligned
c.ii.174t.vect:c.C:13:27: missed:   Unknown misalignment, naturally aligned
c.ii.174t.vect:c.C:13:27: missed:   Unknown misalignment, naturally aligned
c.ii.174t.vect:c.C:13:27: missed:   conversion not supported by target.
c.ii.174t.vect:c.C:13:27: missed:   no optab.
c.ii.174t.vect:c.C:13:27: missed:   function is not vectorizable.
c.ii.174t.vect:/home/jh/trunk-install/include/c++/14.0.0/cmath:353:27: missed: 
 not vectorized: relevant stmt not supported: _109 = __builtin_log (_110);
c.ii.174t.vect:c.C:13:27: missed:  bad operation or unsupported loop bound.
c.ii.174t.vect:c.C:18:26: missed:   versioning for alias required: can't
determine dependence between *_123 and *_106
c.ii.174t.vect:c.C:18:26: missed:   versioning for alias required: can't
determine dependence between *_121 and *_106
c.ii.174t.vect:c.C:18:34: missed:   versioning for alias required: can't
determine dependence between *_115 and *_106
c.ii.174t.vect:c.C:13:27: missed:   Unknown misalignment, naturally aligned
c.ii.174t.vect:c.C:13:27: missed:   Unknown misalignment, naturally aligned
c.ii.174t.vect:c.C:13:27: missed:   Unknown misalignment, naturally aligned
c.ii.174t.vect:c.C:13:27: missed:   Unknown misalignment, naturally aligned
c.ii.174t.vect:c.C:13:27: missed:   conversion not supported by target.
c.ii.174t.vect:c.C:13:27: missed:   no optab.
c.ii.174t.vect:c.C:13:27: missed:   function is not vectorizable.
c.ii.174t.vect:/home/jh/trunk-install/include/c++/14.0.0/cmath:353:27: missed: 
 not vectorized: relevant stmt not supported: _109 = __builtin_log (_110);
c.ii.174t.vect:c.C:13:27: missed:  bad operation or unsupported loop bound.
c.ii.174t.vect:c.C:18:26: missed:   versioning for alias required: can't
determine dependence between *_123 and *_106
c.ii.174t.vect:c.C:18:26: missed:   versioning for alias required: can't
determine dependence between *_121 and *_106
c.ii.174t.vect:c.C:18:34: missed:   versioning for alias required: can't
determine dependence between *_115 and *_106
c.ii.174t.vect:c.C:18:34: missed:   not vectorized: unsupported data-type
double
c.ii.174t.vect:c.C:13:27: missed:  can't determine vectorization factor.
c.ii.174t.vect:c.C:13:27: missed:   not vectorized: no vectype for stmt: _122 =
*_123;
c.ii.174t.vect:c.C:18:26: missed:   not vectorized: no vectype for stmt: _122 =
*_123;
c.ii.174t.vect:c.C:13:27: missed:  bad data references.
c.ii.174t.vect:c.C:13:27: missed: couldn't vectorize loop

At loop vectorization time the counter goes backwards:
  <bb 36> [local count: 679982665]:
  # i_127 = PHI <1(7), i_17(40)>
  # ivtmp_31 = PHI <99999999(7), ivtmp_28(40)>
  _125 = (long unsigned int) i_127;
  _124 = _125 * 4;
  _123 = _44 + _124;
  _122 = *_123;
  _121 = _41 + _124;
  _120 = *_121;
  _119 = _122 + _120;
  _118 = i_127 + 4294967295;
  _117 = (long unsigned int) _118;
  _116 = _117 * 4;
  _115 = _41 + _116;
  _114 = *_115;
  _113 = _119 * _114;
  _112 = (double) _113; 
  _111 = (signed int) i_127;
  _110 = (double) _111; 
  _109 = __builtin_log (_110);
  _108 = _112 * _109;
  _107 = (float) _108;
  _106 = _35 + _124;
  *_106 = _107;
  i_17 = i_127 + 1;
  ivtmp_28 = ivtmp_31 - 1;
  if (ivtmp_28 != 0)
    goto <bb 40>; [98.42%]
  else
    goto <bb 12>; [1.58%]

  <bb 40> [local count: 669250617]:
  goto <bb 36>; [100.00%]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/77689] Missing vectorization lead to huge performance loss
       [not found] <bug-77689-4@http.gcc.gnu.org/bugzilla/>
                   ` (4 preceding siblings ...)
  2023-07-28 13:11 ` hubicka at gcc dot gnu.org
@ 2023-07-28 14:19 ` cvs-commit at gcc dot gnu.org
  5 siblings, 0 replies; 6+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-07-28 14:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77689

--- Comment #18 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>:

https://gcc.gnu.org/g:f5fb9ff2396fd41fdd2e6d35a412e936d2d42f75

commit r14-2852-gf5fb9ff2396fd41fdd2e6d35a412e936d2d42f75
Author: Jan Hubicka <jh@suse.cz>
Date:   Fri Jul 28 16:18:32 2023 +0200

    loop-split improvements, part 3

    extend tree-ssa-loop-split to understand test of the form
     if (i==0)
    and
     if (i!=0)
    which triggers only during the first iteration.  Naturally we should
    also be able to trigger last iteration or split into 3 cases if
    the test indeed can fire in the middle of the loop.

    Last iteration is bit trickier pattern matching so I want to do it
    incrementally, but I implemented easy case using value range that handled
    loops with constant iterations.

    The testcase gets misupdated profile, I will also fix that incrementally.

    gcc/ChangeLog:

            PR middle-end/77689
            * tree-ssa-loop-split.cc: Include value-query.h.
            (split_at_bb_p): Analyze cases where EQ/NE can be turned
            into LT/LE/GT/GE; return updated guard code.
            (split_loop): Use guard code.

    gcc/testsuite/ChangeLog:

            PR middle-end/77689
            * g++.dg/tree-ssa/loop-split-1.C: New test.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-07-28 14:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-77689-4@http.gcc.gnu.org/bugzilla/>
2021-08-25  6:07 ` [Bug tree-optimization/77689] Missing vectorization lead to huge performance loss pinskia at gcc dot gnu.org
2021-08-25  7:38 ` crazylht at gmail dot com
2023-07-28  8:12 ` hubicka at gcc dot gnu.org
2023-07-28 11:53 ` hubicka at gcc dot gnu.org
2023-07-28 13:11 ` hubicka at gcc dot gnu.org
2023-07-28 14:19 ` cvs-commit at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).