From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 4151E3858C98; Mon, 15 Apr 2024 11:07:22 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4151E3858C98
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1713179243;
	bh=O3hDYjBqKrq5A/Hm0utp8gPaheaFphm7/YdAkAJCUmM=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=mFNfcnXGrNUQBNcX1fF25y/nheXA/5ANzG4+pN9nT2jGaHctAg85CqydA0nXlS6mp
	 j1XgiixqPSihAM1yHt13xsxtErYrF5Y4DPoQIpc/Rci0cnXehOHF99UfWDJ6oVo13i
	 y1l3IMUgaeX3mvY0+2xoBVMdn8EapDLvYqp8b3SA=
From: "cvs-commit at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/114403] [14 regression] LLVM miscompiled with
 -O3 -march=znver2 -fno-vect-cost-model since r14-6822-g01f4251b8775c8
Date: Mon, 15 Apr 2024 11:07:17 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: wrong-code
X-Bugzilla-Severity: normal
X-Bugzilla-Who: cvs-commit at gcc dot gnu.org
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P1
X-Bugzilla-Assigned-To: tnfchris at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 14.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-114403-4-vVHmXBr6hk@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-114403-4@http.gcc.gnu.org/bugzilla/>
References: <bug-114403-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D114403
--- Comment #28 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org=
>:

https://gcc.gnu.org/g:85002f8085c25bb3e74ab013581a74e7c7ae006b

commit r14-9969-g85002f8085c25bb3e74ab013581a74e7c7ae006b
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Mon Apr 15 12:06:21 2024 +0100

    middle-end: adjust loop upper bounds when peeling for gaps and early br=
eak
[PR114403].

    This fixes a bug with the interaction between peeling for gaps and early
break.

    Before I go further, I'll first explain how I understand this to work f=
or
loops
    with a single exit.

    When peeling for gaps we peel N < VF iterations to scalar.
    This happens by removing N iterations from the calculation of niters su=
ch
that
    vect_iters * VF =3D=3D niters is always false.

    In other words, when we exit the vector loop we always fall to the scal=
ar
loop.
    The loop bounds adjustment guarantees this. Because of this we potentia=
lly
    execute a vector loop iteration less.  That is, if you're at the bounda=
ry
    condition where niters % VF by peeling one or more scalar iterations the
vector
    loop executes one less.

    This is accounted for by the adjustments in vect_transform_loops.  This
    adjustment happens differently based on whether the the vector loop can=
 be
    partial or not:

    Peeling for gaps sets the bias to 0 and then:

    when not partial:  we take the floor of (scalar_upper_bound / VF) - 1 to
get the
                       vector latch iteration count.

    when loop is partial:  For a single exit this means the loop is masked,=
 we
take
                           the ceil to account for the fact that the loop c=
an
handle
                           the final partial iteration using masking.

    Note that there's no difference between ceil an floor on the boundary
condition.
    There is a difference however when you're slightly above it. i.e. if sc=
alar
    iterates 14 times and VF =3D 4 and we peel 1 iteration for gaps.

    The partial loop does ((13 + 0) / 4) - 1 =3D=3D 2 vector iterations. an=
d in
effect
    the partial iteration is ignored and it's done as scalar.

    This is fine because the niters modification has capped the vector
iteration at
    2.  So that when we reduce the induction values you end up entering the
scalar
    code with ind_var.2 =3D ind_var.1 + 2 * VF.

    Now lets look at early breaks.  To make it esier I'll focus on the spec=
ific
    testcase:

    char buffer[64];

    __attribute__ ((noipa))
    buff_t *copy (buff_t *first, buff_t *last)
    {
      char *buffer_ptr =3D buffer;
      char *const buffer_end =3D &buffer[SZ-1];
      int store_size =3D sizeof(first->Val);
      while (first !=3D last && (buffer_ptr + store_size) <=3D buffer_end)
        {
          const char *value_data =3D (const char *)(&first->Val);
          __builtin_memcpy(buffer_ptr, value_data, store_size);
          buffer_ptr +=3D store_size;
          ++first;
        }

      if (first =3D=3D last)
        return 0;

      return first;
    }

    Here the first, early exit is on the condition:

      (buffer_ptr + store_size) <=3D buffer_end

    and the main exit is on condition:

      first !=3D last

    This is important, as this bug only manifests itself when the first exit
has a
    known constant iteration count that's lower than the latch exit count.

    because buffer holds 64 bytes, and VF =3D 4, unroll =3D 2, we end up pr=
ocessing
16
    bytes per iteration.  So the exit has a known bounds of 8 + 1.

    The vectorizer correctly analizes this:

    Statement (exit)if (ivtmp_21 !=3D 0)
     is executed at most 8 (bounded by 8) + 1 times in loop 1.

    and as a consequence the IV is bound by 9:

      # vect_vec_iv_.14_117 =3D PHI <_118(9), { 9, 8, 7, 6 }(20)>
      ...
      vect_ivtmp_21.16_124 =3D vect_vec_iv_.14_117 + { 18446744073709551615,
18446744073709551615, 18446744073709551615, 18446744073709551615 };
      mask_patt_22.17_126 =3D vect_ivtmp_21.16_124 !=3D { 0, 0, 0, 0 };
      if (mask_patt_22.17_126 =3D=3D { -1, -1, -1, -1 })
        goto <bb 3>; [88.89%]
      else
        goto <bb 30>; [11.11%]

    The imporant bits are this:

    In this example the value of last - first =3D 416.

    the calculated vector iteration count, is:

        x =3D (((ptr2 - ptr1) - 16) / 16) + 1 =3D 27

    the bounds generated, adjusting for gaps:

       x =3D=3D (((x - 1) >> 2) << 2)

    which means we'll always fall through to the scalar code. as intended.

    Here are two key things to note:

    1. In this loop, the early exit will always be the one taken.  When it's
taken
       we enter the scalar loop with the correct induction value to apply t=
he
gap
       peeling.

    2. If the main exit is taken, the induction values assumes you've finis=
hed
all
       vector iterations.  i.e. it assumes you have completed 24 iterations=
, as
we
       treat the main exit the same for normal loop vect and early break wh=
en
not
       PEELED.
       This means the induction value is adjusted to ind_var.2 =3D ind_var.=
1 + 24
* VF;

    So what's going wrong.  The vectorizer's codegen is correct and efficie=
nt,
    however when we adjust the upper bounds, that code knows that the loops
upper
    bound is based on the early exit. i.e. 8 latch iterations. or in other
words.
    It thinks the loop iterates once.

    This is incorrect as the vector loop iterates twice, as it has set up t=
he
    induction value such that it exits at the early exit.   So it in effect
iterates
    2.5x times.

    Becuase the upper bound is incorrect, when we unroll it now exits from =
the
main
    exit which uses the incorrect induction value.

    So there are three ways to fix this:

    1.  If we take the position that the main exit should support both
premature
        exits and final exits then vect_update_ivs_after_vectorizer needs t=
o be
        skipped for this case, and vectorizable_induction updated with  thi=
rd
case
        where we reduce with LAST reduction based on the IVs instead of
assuming
        you're at the end of the vector loop.

        I don't like this approach.  It don't think we should add a third
induction
        style to cover up an issue introduced by unrolling.  It makes the c=
ode
        harder to follow and makes main exits harder to reason about.

    2. We could say that vec_init_loop_exit_info should pick the exit which=
 has
the
       smallest known iteration count.  This would turn this case into a PE=
ELED
case
       and the induction values would be correct as we'd always recalculate
them
       from a reduction.  This is suboptimal though as the reason we pick t=
he
latch
       exit as the IV one is to prevent having to rotate the loop.  This
results
       in more efficient code for what we assume is the common case, i.e. t=
he
main
       exit.

    3. In PR113734 we've established that for vectorization of early breaks
that we
       must always treat the loop as partial.  Here partiallity means that =
we
have
       enough vector elements to start the iteration, but we may take an ea=
rly
exit
       and so never reach the latch/main exit.

       This requirement is overwritten by the peeling for gaps adjustment of
the
       upper bound.  I believe the bug is simply that this shouldn't be don=
e.
       The adjustment here is to indicate that the main exit always leads to
the
       scalar loop when peeling for gaps.

       But this invariant is already always true for all early exits.  Reme=
mber
that
       early exits restart the scalar loop at the start of the vector
iteration, so
       the induction values will start it where we want to do the gaps peel=
ing.

    I think no# 3 is the correct fix, and also one that doesn't degrade code
quality.

    gcc/ChangeLog:

            PR tree-optimization/114403
            * tree-vect-loop.cc (vect_transform_loop): Adjust upper bounds =
for
when
            peeling for gaps and early break.

    gcc/testsuite/ChangeLog:

            PR tree-optimization/114403
            * gcc.dg/vect/vect-early-break_124-pr114403.c: New test.
            * gcc.dg/vect/vect-early-break_125-pr114403.c: New test.=