From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id C88CF385841E; Tue, 12 Mar 2024 09:59:38 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C88CF385841E
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1710237578;
	bh=Oo8PgptPwTpJVESwYATqvjnOyaEwJK9T5bs8g+PjKj8=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=tE6oFmMEPPZwPZXrpSw5ABf65rg/ZajjOo/eL5IQx6V0YN7ooyj0viKWTuK62/+Q5
	 N8PgkyG/B71nEbRCcQ0w0iTojolRZEx4OFUsVps8X5Tp6Tn/I53+Q01dKXvSoGknTd
	 kX8PQBtRm4yBUIWbBZj+SVqoGhmhCbRPPWdWD9tg=
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/114151] [14 Regression] weird and inefficient
 codegen and addressing modes since r14-9193
Date: Tue, 12 Mar 2024 09:59:33 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 14.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-114151-4-EnXnQMSmvC@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-114151-4@http.gcc.gnu.org/bugzilla/>
References: <bug-114151-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D114151

--- Comment #19 from Richard Biener <rguenth at gcc dot gnu.org> ---
So what remains here is differences like

-  (chrec =3D {(long unsigned int) (col_stride_10 * _105), +, (long unsigne=
d int)
col_stride_10}_2)
+  (chrec =3D (long unsigned int) (int) {(unsigned int) col_stride_10 * (un=
signed
int) _105, +, (unsigned int) col_stride_10}_2)

where we can't pull the sign-extension inside the CHREC because it might
overflow.

And

 (set_scalar_evolution=20
   instantiated_below =3D 22=20
   (scalar =3D _59)
-  (scalar_evolution =3D {(long unsigned int) (col_stride_10 * _105) * 2, +,
(long unsigned int) col_stride_10 * 2}_2))
+  (scalar_evolution =3D _59))
+)

which is failure to analyze at all.  This one looks like

  <bb 4> [local count: 118111600]:
  # col_stride_10 =3D PHI <size_15(D)(11), 1(2)>
  if (size_15(D) > 0)
    goto <bb 21>; [89.00%]
  else
    goto <bb 5>; [11.00%]

  <bb 5> [local count: 118111600]:
  return;
...
  <bb 15> [local count: 343854870]:
  # RANGE [irange] int [0, 2147483646]
  # j_73 =3D PHI <_105(22), _68(19)>
...
  col_i_61 =3D col_stride_10 * j_73;
  # RANGE [irange] long unsigned int [0, 2147483647][18446744071562067968,
+INF]
  _60 =3D (long unsigned int) col_i_61;
  # RANGE [irange] long unsigned int [0, 4294967294][18446744069414584320,
18446744073709551614] MASK 0xfffffffffffffffe VALUE 0x0
  _59 =3D _60 * 2;

j_73 is {_105, +, 1}_2
col_i_61 is (int) {(unsigned int) col_stride_10 * (unsigned int) _105, +,
(unsigned int) col_stride_10}_2
_60 is (long unsigned int) (int) {(unsigned int) col_stride_10 * (unsigned =
int)
_105, +, (unsigned int) col_stride_10}_2

and on the _60 * 2 multiply we fail.  When applying Andrews proposed patch
this doesn't help since the range of col_stride_10 can only conditionally
be adjusted to positive.

SCEV caches a scalar evolution based on SSA_NAME and 'instantiated below'
block which is "block_before_loop" which is a loops preheader or the
function ENTRY block for analyses of scalars in the loop tree root.
A conservative context for analysis of the SCEV might be
 1) the definition stmt of the SSA name
 2) the instantiated-below block (on-exit ranges of it)

With doing 2) by feeding the last stmt of the block as context (when the
block is empty that won't work :/) the testcase is optimized again when
I discard the SCEV cache at the start of IVOPTs and wrap IVOPTs in a
ranger instance.

While ranger has a range_on_exit API this doesn't work on GENERIC expressio=
ns
as far as I can see but only SSA names but I guess that could be "fixed"
given range_on_exit also looks at the last stmt and eventually defers to
range_of_expr (or range_on_entry), but possibly get_tree_range needs
variants for on_entry/on_exit (it doesn't seem to use it's 'stmt' context
very consistently, notably not for SSA_NAMEs ...).

Interestingly enough we somehow still need the
diff --git a/gcc/gimple-range.cc b/gcc/gimple-range.cc
index c16b776c1e3..c0eda5fc51d 100644
--- a/gcc/gimple-range.cc
+++ b/gcc/gimple-range.cc
@@ -102,7 +102,15 @@ gimple_ranger::range_of_expr (vrange &r, tree expr, gi=
mple
*stmt)
   if (!stmt)
     {
       Value_Range tmp (TREE_TYPE (expr));
-      m_cache.get_global_range (r, expr);
+      // If there is no global range for EXPR yet, try to evaluate it.
+      // THis call does set R to a global range regardless.
+      if (!m_cache.get_global_range (r, expr))
+       {
+         gimple *s =3D SSA_NAME_DEF_STMT (expr);
+         // Calculate a range for S if it is safe to do so.
+         if (s && gimple_bb (s) && gimple_get_lhs (s) =3D=3D expr)
+           return range_of_stmt (r, s);
+       }
       // Pick up implied context information from the on-entry cache
       // if current_bb is set.  Do not attempt any new calculations.
       if (current_bb && m_cache.block_range (tmp, current_bb, expr, false))

hunk of Andrews patch to do it :/

There's one other detail - the problematical multiply folding is
col_stride_10 * {_105, +, 1}_2
I'm thinking that similar to CHREC_LEFT =3D=3D 0 we can handle CHREC_RIGHT =
=3D=3D 1
without unsigned promotion.  In the second iteration we are replacing
(_105 + 1) * col_stride_10 with _105 * col_stride_10 + col_stride_10
but we know already that _105 * col_stride_10 doesn't overflow as we
computed that in the first iteration.  And 1 * X never overflows.
The third iteration is problematic - we don't know whether 2 * col_stride_10
overflows if _105 was zero, if it was not it might have been -1 which
means the second iteration computed 0 * col_stride_10 originally.  Hmm,
so _105 =3D=3D -1 is problematic, so no - I don't think we can handle
CHREC_RIGHT =3D=3D 1 specially.=