public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/115120] New: Bad interaction between ivcanon and early break vectorization
@ 2024-05-16 16:03 acoplan at gcc dot gnu.org
  2024-05-17  6:45 ` [Bug tree-optimization/115120] " rguenth at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: acoplan at gcc dot gnu.org @ 2024-05-16 16:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115120

            Bug ID: 115120
           Summary: Bad interaction between ivcanon and early break
                    vectorization
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

Consider the following testcase on aarch64:

int arr[1024];
int *f()
{
    int i;
    for (i = 0; i < 1024; i++)
      if (arr[i] == 42)
        break;
    return arr + i;
}

compiled with -O3 we get the following vector loop body:

.L2:
        cmp     x2, x1
        beq     .L9
.L6:
        ldr     q31, [x1]
        add     x1, x1, 16
        mov     v27.16b, v29.16b
        mov     v28.16b, v30.16b
        cmeq    v31.4s, v31.4s, v26.4s
        add     v29.4s, v29.4s, v24.4s
        add     v30.4s, v30.4s, v25.4s
        umaxp   v31.4s, v31.4s, v31.4s
        fmov    x3, d31
        cbz     x3, .L2

it's somewhat surprising that there are two vector adds, looking at the
optimized dump:

<bb 3> [local count: 1063004408]:
  # vect_vec_iv_.6_28 = PHI <_29(10), { 0, 1, 2, 3 }(2)>
  # vect_vec_iv_.7_33 = PHI <_34(10), { 1024, 1023, 1022, 1021 }(2)>
  # ivtmp.18_19 = PHI <ivtmp.18_20(10), ivtmp.18_26(2)>
  _34 = vect_vec_iv_.7_33 + { 4294967292, 4294967292, 4294967292, 4294967292 };
  _29 = vect_vec_iv_.6_28 + { 4, 4, 4, 4 };
  _25 = (void *) ivtmp.18_19;
  vect__1.10_39 = MEM <vector(4) int> [(int *)_25];
  mask_patt_9.11_41 = vect__1.10_39 == { 42, 42, 42, 42 };
  if (mask_patt_9.11_41 != { 0, 0, 0, 0 })
    goto <bb 4>; [5.50%]
  else
    goto <bb 10>; [94.50%]

we can see that there are two IV updates that got vectorized.  It turns out
that
one of these comes from the ivcanon pass.  If I add -fno-tree-loop-ivcanon we
instead get the following vector loop body:

.L2:
        cmp     x2, x1
        beq     .L9
.L6:
        ldr     q31, [x1]
        add     x1, x1, 16
        mov     v29.16b, v30.16b
        add     v30.4s, v30.4s, v27.4s
        cmeq    v31.4s, v31.4s, v28.4s
        umaxp   v31.4s, v31.4s, v31.4s
        fmov    x3, d31
        cbz     x3, .L2

which is much cleaner.  Looking at the tree dumps, the ivcanon pass makes the
following transformation:

--- cddce2.tree 2024-05-16 13:49:10.426703350 +0000
+++ ivcanon.tree        2024-05-16 13:49:17.678874925 +0000
@@ -4,6 +4,8 @@
   int i;
   int _1;
   int * _8;
+  unsigned int ivtmp_11;
+  unsigned int ivtmp_12;
   long unsigned int _13;
   long unsigned int _15;
   long unsigned int prephitmp_16;
@@ -12,6 +14,7 @@

   <bb 3> [local count: 1063004408]:
   # i_10 = PHI <i_7(7), 0(2)>
+  # ivtmp_12 = PHI <ivtmp_11(7), 1024(2)>
   _1 = arr[i_10];
   if (_1 == 42)
     goto <bb 5>; [5.50%]
@@ -20,7 +23,8 @@

   <bb 4> [local count: 1004539166]:
   i_7 = i_10 + 1;
-  if (i_7 != 1024)
+  ivtmp_11 = ivtmp_12 - 1;
+  if (ivtmp_11 != 0)
     goto <bb 7>; [98.93%]
   else
     goto <bb 8>; [1.07%]

i.e. it introduces the backwards-counting IV.  It seems in the general case
without vectorization ivopts then cleans this up and ensures we only have a
single IV.

In the vectorized case it seems this problem only shows up with early break
vectorization. Looking at a simple reduction, such as:

int a[1024];
int g()
{
    int sum = 0;
    for (int i = 0; i < 1024; i++)
        sum += a[i];
    return sum;
}

although we still have the backwards-counting IV in ifcvt:

  <bb 3> [local count: 1063004408]:
  # sum_9 = PHI <sum_5(5), 0(2)>
  # i_11 = PHI <i_6(5), 0(2)>
  # ivtmp_8 = PHI <ivtmp_7(5), 1024(2)>
  _1 = a[i_11];
  sum_5 = _1 + sum_9;
  i_6 = i_11 + 1;
  ivtmp_7 = ivtmp_8 - 1;
  if (ivtmp_7 != 0)
    goto <bb 5>; [98.99%]
  else
    goto <bb 4>; [1.01%]

we end up with only scalar IVs after vectorization, and the backwards scalar IV
ends up getting deleted by dce6:

Deleting : ivtmp_7 = ivtmp_8 - 1;

I'm not sure what the right solution is but we should avoid having duplicated
IVs with early break vectorization.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/115120] Bad interaction between ivcanon and early break vectorization
  2024-05-16 16:03 [Bug tree-optimization/115120] New: Bad interaction between ivcanon and early break vectorization acoplan at gcc dot gnu.org
@ 2024-05-17  6:45 ` rguenth at gcc dot gnu.org
  2024-05-17  6:46 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-17  6:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115120

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |53947
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2024-05-17

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note this isn't really because of IVCANON but because the IV is live.  IVCANON
adds a downward counting IV historically to enable RTL doloop transforms.

One could argue this should nowadays be IVOPTs job (in the past IVOPTs was
done on RTL).

So what's really the issue is that IVOPTs does a bad job (I don't really
think it even tries) to replace one vector IV with another.

We could try to change IVCANON to avoid creating a canonical downward counting
IV if one already exists (I don't think it even avoids that case) and also
avoid creating a downward counting IV with step -1 when a upward counting
IV with step 1 already controls the exit and verify IVOPTs turns it into
a downward counting one when profitable.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/115120] Bad interaction between ivcanon and early break vectorization
  2024-05-16 16:03 [Bug tree-optimization/115120] New: Bad interaction between ivcanon and early break vectorization acoplan at gcc dot gnu.org
  2024-05-17  6:45 ` [Bug tree-optimization/115120] " rguenth at gcc dot gnu.org
@ 2024-05-17  6:46 ` rguenth at gcc dot gnu.org
  2024-05-17  7:47 ` tnfchris at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-17  6:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115120

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
That said, looking into IVOPTs and making it detect vector IVs would be nice
as well.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/115120] Bad interaction between ivcanon and early break vectorization
  2024-05-16 16:03 [Bug tree-optimization/115120] New: Bad interaction between ivcanon and early break vectorization acoplan at gcc dot gnu.org
  2024-05-17  6:45 ` [Bug tree-optimization/115120] " rguenth at gcc dot gnu.org
  2024-05-17  6:46 ` rguenth at gcc dot gnu.org
@ 2024-05-17  7:47 ` tnfchris at gcc dot gnu.org
  2024-06-24 18:09 ` tnfchris at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-05-17  7:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115120

--- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
That makes sense, though I also wonder how it works for scalar multi exit
loops, IVops has various checks on single exits.

I guess one problem is that the code in IVops that does this uses the exit to
determine niters.

But in the case of the multiple exits vector code the vectorizer could have
picked a different exit.

So I guess the question is how do we even tell which one is used or could the
transformation be driven from the PHI nodes themselves instead of an exit.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/115120] Bad interaction between ivcanon and early break vectorization
  2024-05-16 16:03 [Bug tree-optimization/115120] New: Bad interaction between ivcanon and early break vectorization acoplan at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2024-05-17  7:47 ` tnfchris at gcc dot gnu.org
@ 2024-06-24 18:09 ` tnfchris at gcc dot gnu.org
  2024-06-25  7:03 ` tnfchris at gcc dot gnu.org
  2024-06-25 12:46 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-06-24 18:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115120

--- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
You asked why this doesn't happen with a normal vector loop Richi.

For a normal loop when IVcannon adds the downward counting loop there are two
main differences.

1. for a single exit loop, the downward IV is the main IV. which we ignore as
the vectorizer replaces the loop exit condition with a bound iteration check.

2.  when we peel, the main loop has a known iteration count.  So the starting
downward IV for the scalar loop is a known constant.  That means we statically
compute the start of the IV.  As such there's no data-flow from for this
downwards counting IV from the main loop into the scalar loop.

i.e. in this loop:

  <bb 3> [local count: 1063004408]:
  # i_8 = PHI <i_5(5), 0(2)>
  # ivtmp_2 = PHI <ivtmp_1(5), 1087(2)>
  res[i_8] = i_8;
  i_5 = i_8 + 1;
  ivtmp_1 = ivtmp_2 - 1;
  if (ivtmp_1 != 0)
    goto <bb 5>; [98.99%]
  else
    goto <bb 4>; [1.01%]

  <bb 5> [local count: 1052266995]:
  goto <bb 3>; [100.00%]

when we vectorize the final loop looks like:

  <bb 3> [local count: 1063004408]:
  # i_8 = PHI <i_5(5), 0(2)>
  # ivtmp_2 = PHI <ivtmp_1(5), 1087(2)>
  # vect_vec_iv_.6_19 = PHI <_20(5), { 0, 1, 2, 3 }(2)>
  # vectp_res.7_21 = PHI <vectp_res.7_22(5), &res(2)>
  # ivtmp_24 = PHI <ivtmp_25(5), 0(2)>
  _20 = vect_vec_iv_.6_19 + { 4, 4, 4, 4 };
  MEM <vector(4) int> [(int *)vectp_res.7_21] = vect_vec_iv_.6_19;
  i_5 = i_8 + 1;
  ivtmp_1 = ivtmp_2 - 1;
  vectp_res.7_22 = vectp_res.7_21 + 16;
  ivtmp_25 = ivtmp_24 + 1;
  if (ivtmp_25 < 271)
    goto <bb 5>; [98.99%]
  else
    goto <bb 7>; [1.01%]

  <bb 5> [local count: 1052266995]:
  goto <bb 3>; [100.00%]

  <bb 7> [local count: 10737416]:
  # i_16 = PHI <i_5(3)>
  # ivtmp_17 = PHI <ivtmp_1(3)>

  <bb 8> [local count: 32212248]:
  # i_7 = PHI <i_13(9), 1084(7)>
  # ivtmp_11 = PHI <ivtmp_14(9), 3(7)>
  res[i_7] = i_7;
  i_13 = i_7 + 1;
  ivtmp_14 = ivtmp_11 - 1;
  if (ivtmp_14 != 0)
    goto <bb 9>; [66.67%]
  else
    goto <bb 4>; [33.33%]

  <bb 9> [local count: 21474835]:
  goto <bb 8>; [100.00%]

for a vector code neither assumption are no longer true.

1.  The vectorizer may pick another exit other than the downwards counting IV.
In particular if the early exit has a known iteration count lower than the main
exit.

2.  Because we don't know which exit the loop takes, we can't tell how many
iteration you have to do at a minimum for the scalar loop.  We only know the
maximum.  As such the loop reduction into the second loop is:

  <bb 15> [local count: 58465242]:
  # vect_vec_iv_.6_30 = PHI <vect_vec_iv_.6_28(3)>
  # vect_vec_iv_.7_35 = PHI <vect_vec_iv_.7_33(3)>
  _36 = BIT_FIELD_REF <vect_vec_iv_.7_35, 32, 0>;
  ivtmp_26 = _36;
  _31 = BIT_FIELD_REF <vect_vec_iv_.6_30, 32, 0>;
  i_25 = _31;
  goto <bb 11>; [100.00%]

  <bb 11> [local count: 214528238]:
  # i_3 = PHI <i_19(12), i_25(15)>
  # ivtmp_17 = PHI <ivtmp_20(12), ivtmp_26(15)>

Since we don't know the iteration count we require both IVs to be live.  the
downcounting IV is live because the scalar loop needs a starting point, and the
incrementing IV is live due to addressing mode usages.

This means neither can be removed.

In the single exit case, the downward IV is only used for loop control:

  <bb 8> [local count: 32212248]:
  # i_7 = PHI <i_13(9), 1084(7)>
  # ivtmp_11 = PHI <ivtmp_14(9), 3(7)>
  res[i_7] = i_7;
  i_13 = i_7 + 1;
  ivtmp_14 = ivtmp_11 - 1;
  if (ivtmp_14 != 0)
    goto <bb 9>; [66.67%]
  else
    goto <bb 4>; [33.33%]

  <bb 9> [local count: 21474835]:
  goto <bb 8>; [100.00%]

and so IVopts rewrites the addressing mode usages of `i` into

  <bb 8> [local count: 32212248]:
  # ivtmp.12_2 = PHI <ivtmp.12_8(9), 1084(7)>
  _5 = (unsigned int) ivtmp.12_2;
  i_7 = (int) _5;
  MEM[(int *)&res + ivtmp.12_2 * 4] = i_7;
  ivtmp.12_8 = ivtmp.12_2 + 1;
  if (ivtmp.12_8 != 1087)
    goto <bb 9>; [66.67%]
  else
    goto <bb 4>; [33.33%]

  <bb 9> [local count: 21474835]:
  goto <bb 8>; [100.00%]

and rewrites the loop back into an incrementing loop.  This also happens for
the early exit loop, that's why the scalar code doesn't have the double IVs.

But vector loop we have this issue due to needing the second IV live.

We might be able to rewrite the vector IVs as you say in IVopts,  however not
only does IVopts not rewrite vector IVs, it also doesn't rewrite multiple exit
loops in general. 

It has two checks:

  /* Make sure that the loop iterates till the loop bound is hit, as otherwise
     the calculation of the BOUND could overflow, making the comparison
     invalid.  */
  if (!data->loop_single_exit_p)
    return false;

and seems to lose a lot of information when niter_for_single_dom_exit (..) is
null, it seems that in order for this to work correctly IVopts needs to know
which exit we've chosen in the vectorizer. i.e. I think it would have issued
with a PEELED loop.

We also have the problem where both IVs are required:

int arr[1024];
int f()
{
    int i;
    for (i = 0; i < 1024; i++)
      if (arr[i] == 42)
        return i;
    return *(arr + i);
}

but with the downward counting IV enabled, we get a much more complicated
latch.

> Note this isn't really because of IVCANON but because the IV is live.  IVCANON adds a downward counting IV historically to enable RTL doloop transforms.

IVopts currently has:

  /* Similar to doloop_optimize, check iteration description to know it's
     suitable or not.  Keep it as simple as possible, feel free to extend it
     if you find any multiple exits cases matter.  */
  edge exit = single_dom_exit (loop);
  class tree_niter_desc *niter_desc;
  if (!exit || !(niter_desc = niter_for_exit (data, exit)))
    {
      if (dump_file && (dump_flags & TDF_DETAILS))
        fprintf (dump_file, "Predict doloop failure due to"
                            " unexpected niters.\n");
      return false;
    }

in generic_predict_doloop_p so it looks like it just punts on doloop for loops
with multiple exits.

but indeed from a scan of the code I don't see a similar restriction in RTL
doloop.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/115120] Bad interaction between ivcanon and early break vectorization
  2024-05-16 16:03 [Bug tree-optimization/115120] New: Bad interaction between ivcanon and early break vectorization acoplan at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2024-06-24 18:09 ` tnfchris at gcc dot gnu.org
@ 2024-06-25  7:03 ` tnfchris at gcc dot gnu.org
  2024-06-25 12:46 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-06-25  7:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115120

--- Comment #5 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
considering ivopts bails out on doloop prediction for multiple exits anyway,
what do you think about:

diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.cc
index 5ef24a91917..d1b25ad7dea 100644
--- a/gcc/tree-ssa-loop-ivcanon.cc
+++ b/gcc/tree-ssa-loop-ivcanon.cc
@@ -1319,7 +1319,8 @@ canonicalize_loop_induction_variables (class loop *loop,

   if (create_iv
       && niter && !chrec_contains_undetermined (niter)
-      && exit && just_once_each_iteration_p (loop, exit->src))
+      && exit && just_once_each_iteration_p (loop, exit->src)
+      && (single_dom_exit (loop) || targetm.predict_doloop_p (loop)))
     {
       tree iv_niter = niter;
       if (may_be_zero)

richi?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/115120] Bad interaction between ivcanon and early break vectorization
  2024-05-16 16:03 [Bug tree-optimization/115120] New: Bad interaction between ivcanon and early break vectorization acoplan at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2024-06-25  7:03 ` tnfchris at gcc dot gnu.org
@ 2024-06-25 12:46 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-06-25 12:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115120

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #5)
> considering ivopts bails out on doloop prediction for multiple exits anyway,
> what do you think about:
> 
> diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.cc
> index 5ef24a91917..d1b25ad7dea 100644
> --- a/gcc/tree-ssa-loop-ivcanon.cc
> +++ b/gcc/tree-ssa-loop-ivcanon.cc
> @@ -1319,7 +1319,8 @@ canonicalize_loop_induction_variables (class loop
> *loop,
> 
>    if (create_iv
>        && niter && !chrec_contains_undetermined (niter)
> -      && exit && just_once_each_iteration_p (loop, exit->src))
> +      && exit && just_once_each_iteration_p (loop, exit->src)
> +      && (single_dom_exit (loop) || targetm.predict_doloop_p (loop)))
>      {
>        tree iv_niter = niter;
>        if (may_be_zero)
> 
> richi?

I think while IVOPTs might not care, loop-doloop.cc handles multiple exists
just fine I think.

What about moving iv_canon towards iv_optimize?  Ideally it would be
integrated with IVOPTs itself, but that can be done later.  In particular
I wonder which other passes might depend on iv_canon being run early
(yeah, it also elides single-iteration loops, so maybe you need to "split"
it)

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-06-25 12:46 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-16 16:03 [Bug tree-optimization/115120] New: Bad interaction between ivcanon and early break vectorization acoplan at gcc dot gnu.org
2024-05-17  6:45 ` [Bug tree-optimization/115120] " rguenth at gcc dot gnu.org
2024-05-17  6:46 ` rguenth at gcc dot gnu.org
2024-05-17  7:47 ` tnfchris at gcc dot gnu.org
2024-06-24 18:09 ` tnfchris at gcc dot gnu.org
2024-06-25  7:03 ` tnfchris at gcc dot gnu.org
2024-06-25 12:46 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).