From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id CF3E03858D35; Fri,  4 Feb 2022 07:58:14 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org CF3E03858D35
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/104368] [12 Regression] Failure to vectorise
 conditional grouped accesses after PR102659
Date: Fri, 04 Feb 2022 07:58:14 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 12.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 12.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: everconfirmed cf_reconfirmed_on bug_status cc
Message-ID: <bug-104368-4-AD49ZbRXIH@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-104368-4@http.gcc.gnu.org/bugzilla/>
References: <bug-104368-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 04 Feb 2022 07:58:14 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D104368

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2022-02-04
             Status|UNCONFIRMED                 |NEW
                 CC|                            |amacleod at redhat dot com
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  On x86 with AVX2 we don't get this vectorized anymore for the s=
ame
reason.

t.c:5:15: missed:  failed: evolution of base is not affine.
        base_address:
        offset from base address:
        constant offset from base address:
        step:
        base alignment: 0
        base misalignment: 0
        offset alignment: 0
        step alignment: 0
        base_object: *_8
Creating dr for *_12

if-conversion now produces

...
  _47 =3D (unsigned long) y_21(D);
..
# i_26 =3D PHI <i_23(8), 0(15)>
_1 =3D (long unsigned int) i_26;
_2 =3D _1 * 4;
_3 =3D x_20(D) + _2;
_4 =3D *_3;
_45 =3D (unsigned int) i_26;
_46 =3D _45 * 2;
_5 =3D (int) _46;
_6 =3D (long unsigned int) _5;
_7 =3D _6 * 4;
_48 =3D _47 + _7;
_8 =3D (int *) _48;
_49 =3D _4 > 0;
_9 =3D .MASK_LOAD (_8, 32B, _49);
_10 =3D _6 + 1;
_11 =3D _10 * 4;
_51 =3D _11 + _47;
_12 =3D (int *) _51;
_13 =3D .MASK_LOAD (_12, 32B, _49);
_52 =3D (unsigned int) _9;
_53 =3D (unsigned int) _13;
_54 =3D _52 + _53;
_14 =3D (int) _54;
.MASK_STORE (_3, 32B, _49, _14);
i_23 =3D i_26 + 1;
if (n_19(D) > i_23)
  goto <bb 8>; [89.00%]
else
  goto <bb 6>; [11.00%]


note that if-conversion is correct in rewriting i*2 and i*2 + 1 to unsigned
arithmetic since that will now execute unconditionally and can overflow.

In the end the issue is that the multiplication by the element size is
done in sizetype and so y[i*2] and y[i*2+1] might not be adjacent.  What
we miss is that iff the stmts were executed then because of undefined overf=
low
they will always be adjacent.

IMHO the only good way to recover is to scrap the separate if-conversion st=
ep
and do vectorization on the original IL.  Or integrate the two passes
as much as to allow dataref analysis on the not if-converted IL.

Another possibility (and long-standing TODO) is to teach SCEV analysis
to derive assumptions we can version the loop on - in this case that
i*2 + 1 does not overflow.

Note in this particular case we probably miss to see that

i is in [0,INT_MAX-1] and thus (unsigned)i * 2 + 1 never wraps

(unless I miss something).  We have

  <bb 3> [local count: 955630226]:
  # RANGE [0, 2147483647] NONZERO 2147483647
  # i_26 =3D PHI <i_23(8), 0(15)>
  # RANGE [0, 2147483646] NONZERO 2147483647
  _1 =3D (long unsigned int) i_26;
  # RANGE [0, 8589934584] NONZERO 8589934588
  _2 =3D _1 * 4;
  # PT =3D null { D.2435 } (nonlocal, restrict)
  _3 =3D x_20(D) + _2;
  _4 =3D MEM[(int *)_3 clique 1 base 1];
  _45 =3D (unsigned int) i_26;
  _46 =3D _45 * 2;
  _5 =3D (int) _46;
  _6 =3D (long unsigned int) _5;
  _7 =3D _6 * 4;
  _48 =3D _47 + _7;

so unfortunately while _1 has that correct range, i_26 does not and the
ifcvt generated stmts don't either.  It might be possible to throw
ranger on the if-converted body.

Andrew - if we'd like to do that, in tree-if-conv.cc in tree_if_conversion =
()
after we've produced the final IL (after the call to ifcvt_hoist_invariants=
),
is there a way to invoke ranger on the stmts of the (single-BB) loop
and have it adjust the global ranges?  In particular - see above, it
would need to somehow improve the global range of the i_26 IV.

The pass creates blocks and destroys edges, so I'm not sure if we can
reasonably use a caching instance over its lifetime so cost per loop would
be a limiting factor.=