From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 251F73858D35; Mon, 24 Jun 2024 18:09:26 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 251F73858D35
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1719252566;
	bh=ETF8ySV1Yq6N381KtH0R5kA2AEfEsV2ArI76pvPJjIw=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=S6C/phV7sIhiAQ5KHY1pvAVrqTEmhwvI58yUdurhuGkZWg1APpUTs4SEYi4USb0vp
	 NeeomMo5iagWG9F8N2dYjxQvwfRISt5vFaBLw5l3Jon34QApDczHwDsQk8ZKJ23mNq
	 38K4mgBy9etuu5C2pS4O9TeLV80x+ouR6XD/MwRI=
From: "tnfchris at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/115120] Bad interaction between ivcanon and
 early break vectorization
Date: Mon, 24 Jun 2024 18:09:24 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: tnfchris at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-115120-4-RS8cl2C4Lb@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-115120-4@http.gcc.gnu.org/bugzilla/>
References: <bug-115120-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D115120
--- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
You asked why this doesn't happen with a normal vector loop Richi.

For a normal loop when IVcannon adds the downward counting loop there are t=
wo
main differences.

1. for a single exit loop, the downward IV is the main IV. which we ignore =
as
the vectorizer replaces the loop exit condition with a bound iteration chec=
k.

2.  when we peel, the main loop has a known iteration count.  So the starti=
ng
downward IV for the scalar loop is a known constant.  That means we statica=
lly
compute the start of the IV.  As such there's no data-flow from for this
downwards counting IV from the main loop into the scalar loop.

i.e. in this loop:

  <bb 3> [local count: 1063004408]:
  # i_8 =3D PHI <i_5(5), 0(2)>
  # ivtmp_2 =3D PHI <ivtmp_1(5), 1087(2)>
  res[i_8] =3D i_8;
  i_5 =3D i_8 + 1;
  ivtmp_1 =3D ivtmp_2 - 1;
  if (ivtmp_1 !=3D 0)
    goto <bb 5>; [98.99%]
  else
    goto <bb 4>; [1.01%]

  <bb 5> [local count: 1052266995]:
  goto <bb 3>; [100.00%]

when we vectorize the final loop looks like:

  <bb 3> [local count: 1063004408]:
  # i_8 =3D PHI <i_5(5), 0(2)>
  # ivtmp_2 =3D PHI <ivtmp_1(5), 1087(2)>
  # vect_vec_iv_.6_19 =3D PHI <_20(5), { 0, 1, 2, 3 }(2)>
  # vectp_res.7_21 =3D PHI <vectp_res.7_22(5), &res(2)>
  # ivtmp_24 =3D PHI <ivtmp_25(5), 0(2)>
  _20 =3D vect_vec_iv_.6_19 + { 4, 4, 4, 4 };
  MEM <vector(4) int> [(int *)vectp_res.7_21] =3D vect_vec_iv_.6_19;
  i_5 =3D i_8 + 1;
  ivtmp_1 =3D ivtmp_2 - 1;
  vectp_res.7_22 =3D vectp_res.7_21 + 16;
  ivtmp_25 =3D ivtmp_24 + 1;
  if (ivtmp_25 < 271)
    goto <bb 5>; [98.99%]
  else
    goto <bb 7>; [1.01%]

  <bb 5> [local count: 1052266995]:
  goto <bb 3>; [100.00%]

  <bb 7> [local count: 10737416]:
  # i_16 =3D PHI <i_5(3)>
  # ivtmp_17 =3D PHI <ivtmp_1(3)>

  <bb 8> [local count: 32212248]:
  # i_7 =3D PHI <i_13(9), 1084(7)>
  # ivtmp_11 =3D PHI <ivtmp_14(9), 3(7)>
  res[i_7] =3D i_7;
  i_13 =3D i_7 + 1;
  ivtmp_14 =3D ivtmp_11 - 1;
  if (ivtmp_14 !=3D 0)
    goto <bb 9>; [66.67%]
  else
    goto <bb 4>; [33.33%]

  <bb 9> [local count: 21474835]:
  goto <bb 8>; [100.00%]

for a vector code neither assumption are no longer true.

1.  The vectorizer may pick another exit other than the downwards counting =
IV.
In particular if the early exit has a known iteration count lower than the =
main
exit.

2.  Because we don't know which exit the loop takes, we can't tell how many
iteration you have to do at a minimum for the scalar loop.  We only know the
maximum.  As such the loop reduction into the second loop is:

  <bb 15> [local count: 58465242]:
  # vect_vec_iv_.6_30 =3D PHI <vect_vec_iv_.6_28(3)>
  # vect_vec_iv_.7_35 =3D PHI <vect_vec_iv_.7_33(3)>
  _36 =3D BIT_FIELD_REF <vect_vec_iv_.7_35, 32, 0>;
  ivtmp_26 =3D _36;
  _31 =3D BIT_FIELD_REF <vect_vec_iv_.6_30, 32, 0>;
  i_25 =3D _31;
  goto <bb 11>; [100.00%]

  <bb 11> [local count: 214528238]:
  # i_3 =3D PHI <i_19(12), i_25(15)>
  # ivtmp_17 =3D PHI <ivtmp_20(12), ivtmp_26(15)>

Since we don't know the iteration count we require both IVs to be live.  the
downcounting IV is live because the scalar loop needs a starting point, and=
 the
incrementing IV is live due to addressing mode usages.

This means neither can be removed.

In the single exit case, the downward IV is only used for loop control:

  <bb 8> [local count: 32212248]:
  # i_7 =3D PHI <i_13(9), 1084(7)>
  # ivtmp_11 =3D PHI <ivtmp_14(9), 3(7)>
  res[i_7] =3D i_7;
  i_13 =3D i_7 + 1;
  ivtmp_14 =3D ivtmp_11 - 1;
  if (ivtmp_14 !=3D 0)
    goto <bb 9>; [66.67%]
  else
    goto <bb 4>; [33.33%]

  <bb 9> [local count: 21474835]:
  goto <bb 8>; [100.00%]

and so IVopts rewrites the addressing mode usages of `i` into

  <bb 8> [local count: 32212248]:
  # ivtmp.12_2 =3D PHI <ivtmp.12_8(9), 1084(7)>
  _5 =3D (unsigned int) ivtmp.12_2;
  i_7 =3D (int) _5;
  MEM[(int *)&res + ivtmp.12_2 * 4] =3D i_7;
  ivtmp.12_8 =3D ivtmp.12_2 + 1;
  if (ivtmp.12_8 !=3D 1087)
    goto <bb 9>; [66.67%]
  else
    goto <bb 4>; [33.33%]

  <bb 9> [local count: 21474835]:
  goto <bb 8>; [100.00%]

and rewrites the loop back into an incrementing loop.  This also happens for
the early exit loop, that's why the scalar code doesn't have the double IVs.

But vector loop we have this issue due to needing the second IV live.

We might be able to rewrite the vector IVs as you say in IVopts,  however n=
ot
only does IVopts not rewrite vector IVs, it also doesn't rewrite multiple e=
xit
loops in general.=20

It has two checks:

  /* Make sure that the loop iterates till the loop bound is hit, as otherw=
ise
     the calculation of the BOUND could overflow, making the comparison
     invalid.  */
  if (!data->loop_single_exit_p)
    return false;

and seems to lose a lot of information when niter_for_single_dom_exit (..) =
is
null, it seems that in order for this to work correctly IVopts needs to know
which exit we've chosen in the vectorizer. i.e. I think it would have issued
with a PEELED loop.

We also have the problem where both IVs are required:

int arr[1024];
int f()
{
    int i;
    for (i =3D 0; i < 1024; i++)
      if (arr[i] =3D=3D 42)
        return i;
    return *(arr + i);
}

but with the downward counting IV enabled, we get a much more complicated
latch.

> Note this isn't really because of IVCANON but because the IV is live.  IV=
CANON adds a downward counting IV historically to enable RTL doloop transfo=
rms.

IVopts currently has:

  /* Similar to doloop_optimize, check iteration description to know it's
     suitable or not.  Keep it as simple as possible, feel free to extend it
     if you find any multiple exits cases matter.  */
  edge exit =3D single_dom_exit (loop);
  class tree_niter_desc *niter_desc;
  if (!exit || !(niter_desc =3D niter_for_exit (data, exit)))
    {
      if (dump_file && (dump_flags & TDF_DETAILS))
        fprintf (dump_file, "Predict doloop failure due to"
                            " unexpected niters.\n");
      return false;
    }

in generic_predict_doloop_p so it looks like it just punts on doloop for lo=
ops
with multiple exits.

but indeed from a scan of the code I don't see a similar restriction in RTL
doloop.=