* [1/4][PATCH] Fix PR84032
2019-04-16 11:59 [0/4] Addressing modulo-scheduling bugs zhroma
@ 2019-04-16 12:03 ` Roman Zhuykov
2019-04-16 12:05 ` [2/4][PATCH] Fix PR87979 Roman Zhuykov
` (3 subsequent siblings)
4 siblings, 0 replies; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-16 12:03 UTC (permalink / raw)
To: gcc-patches
There is the following mistake in logic behind the code.
We want to schedule the branch instructions only as a last instruction
in a row. But when branch was scheduled and we add other instructions
into partial schedule, we sometimes allow them to be in same row after
the branch.
The issue happens later when we try to reschedule branch into another
row, algorithm there works like this:
(1) Remove branch from the row where it is (say, âprevious rowâ)
(2) Try insert into the needed row
(3) If success â OK, continue scheduling other instructions
(4) But when inserting (2) was not done â insert it back into âprevious
rowâ and this insertion must be certainly successful, which is checked
by assertion.
But when on step (1) branch in not last in a row there is no guarantee,
that on step (4) we could insert it back, because there we will try only
last-in-a-row position for it.
This patch solves this totally preventing other instructions to be
scheduled after branch in the same row.
Iâve described patch testing in cover letter. Ok for trunk?
gcc/ChangeLog:
2019-04-09 Roman Zhuykov <zhroma@ispras.ru>
PR rtl-optimization/84032
* modulo-sched.c (ps_insn_find_column): Change condition so that
branch will always be the last insn in a row inside partial schedule.
gcc/testsuite/ChangeLog:
2019-04-09 Roman Zhuykov <zhroma@ispras.ru>
PR rtl-optimization/84032
* gcc.dg/pr84032.c: New test.
diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c
--- a/gcc/modulo-sched.c
+++ b/gcc/modulo-sched.c
@@ -2996,9 +2996,7 @@ ps_insn_find_column (partial_schedule_ptr ps,
ps_insn_ptr ps_i,
last_must_precede = next_ps_i;
}
/* The closing branch must be the last in the row. */
- if (must_precede
- && bitmap_bit_p (must_precede, next_ps_i->id)
- && JUMP_P (ps_rtl_insn (ps, next_ps_i->id)))
+ if (JUMP_P (ps_rtl_insn (ps, next_ps_i->id)))
return false;
last_in_row = next_ps_i;
diff --git a/gcc/testsuite/gcc.dg/pr84032.c
b/gcc/testsuite/gcc.dg/pr84032.c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr84032.c
@@ -0,0 +1,23 @@
+/* PR rtl-optimization/84032 */
+/* { dg-do compile } */
+/* { dg-options "-O1 -fmodulo-sched" } */
+/* { dg-additional-options "-mcpu=power6" { target { powerpc-*-* } } }
*/
+
+void
+yr (int cm)
+{
+ int ka = cm;
+
+ for (;;)
+ {
+ short int m0;
+
+ for (m0 = 0; m0 < 6; ++m0)
+ {
+ ka &= 1;
+ cm *= 2;
+ }
+
+ ka = (ka == 0) ? cm : 0;
+ }
+}
^ permalink raw reply [flat|nested] 17+ messages in thread
* [2/4][PATCH] Fix PR87979
2019-04-16 11:59 [0/4] Addressing modulo-scheduling bugs zhroma
2019-04-16 12:03 ` [1/4][PATCH] Fix PR84032 Roman Zhuykov
@ 2019-04-16 12:05 ` Roman Zhuykov
2019-04-16 12:08 ` [3/4][PATCH] Fix PR90001 Roman Zhuykov
` (2 subsequent siblings)
4 siblings, 0 replies; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-16 12:05 UTC (permalink / raw)
To: gcc-patches
We divide by zero when we try to schedule loop body in zero cycles.
Both res_mii and rec_mii estimations equals zero. We have to start with
one cycle in this situation.
Same failure happens in the following tests on ia64 platform with
-fmodulo-sched enabled (with any of -O1, -O2, -Os):
gcc.dg/torture/pr82762.c
gcc.c-torture/execute/20170419-1.c
Iâve described patch testing in cover letter. Ok for trunk?
gcc/ChangeLog:
2019-04-08 Roman Zhuykov <zhroma@ispras.ru>
PR rtl-optimization/87979
* modulo-sched.c (sms_schedule): Start ii value "mii" should
not equal zero.
gcc/testsuite/ChangeLog:
2019-04-08 Roman Zhuykov <zhroma@ispras.ru>
PR rtl-optimization/87979
* gcc.dg/pr87979.c: New test.
diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c
--- a/gcc/modulo-sched.c
+++ b/gcc/modulo-sched.c
@@ -1602,6 +1602,7 @@ sms_schedule (void)
mii = 1; /* Need to pass some estimate of mii. */
rec_mii = sms_order_nodes (g, mii, node_order, &max_asap);
mii = MAX (res_MII (g), rec_mii);
+ mii = MAX (mii, 1);
maxii = MAX (max_asap, MAXII_FACTOR * mii);
if (dump_file)
diff --git a/gcc/testsuite/gcc.dg/pr87979.c
b/gcc/testsuite/gcc.dg/pr87979.c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr87979.c
@@ -0,0 +1,11 @@
+/* PR rtl-optimization/87979 */
+/* { dg-do compile } */
+/* { dg-options "-Os -fmodulo-sched -fno-tree-loop-im" } */
+/* { dg-additional-options "-march=z196" { target { s390*-*-* } } } */
+
+void foo(void)
+{
+ static int m;
+ for (int i = 0; i < 10; ++i)
+ m++;
+}
^ permalink raw reply [flat|nested] 17+ messages in thread
* [3/4][PATCH] Fix PR90001
2019-04-16 11:59 [0/4] Addressing modulo-scheduling bugs zhroma
2019-04-16 12:03 ` [1/4][PATCH] Fix PR84032 Roman Zhuykov
2019-04-16 12:05 ` [2/4][PATCH] Fix PR87979 Roman Zhuykov
@ 2019-04-16 12:08 ` Roman Zhuykov
2019-04-26 15:42 ` Jeff Law
2019-04-16 12:26 ` [4/4][PATCH] Discussing PR83507 Roman Zhuykov
2019-04-22 19:16 ` [0/4] Addressing modulo-scheduling bugs Roman Zhuykov
4 siblings, 1 reply; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-16 12:08 UTC (permalink / raw)
To: gcc-patches
Current algorithm which finds recurrence_length for all DDG strongly
connected components works in like O(N^6) time, where N in the number of
nodes in DDG. The time is so bad mostly for graphs with lots of edges,
like almost N^2 edges. My proposed algorithm works in O(N^3).
Algorithm of finding SCCs itself is also not optimal (maybe up to
O(N^4)), but here it left untouched.
For some situations, when amount of edges is smaller (like equal to N),
new algorithm can be unfortunately slower than old one. But I think
it's better here to add some bail-out when we got more than 1000 nodes
for example.
Before creating this patch, I tested special version of it, where both
approaches were in action and asserts were inserted to check that
algorithms results (longest_simple_path values) are absolutely the same.
I can publish this special version if needed.
Iâve described patch testing in cover letter. Ok for trunk?
gcc/ChangeLog:
2019-04-08 Roman Zhuykov <zhroma@ispras.ru>
PR rtl-optimization/90001
* ddg.c (create_ddg): Init max_dist array for each node.
(free_ddg): Free max_dist array.
(create_ddg_edge): Use bool field instead of aux union.
(set_recurrence_length): Use prepared max_dist information instead
of calling longest_simple_path.
(create_scc): Remove graph argument, fill node's aux.count with
SCC id, and move set_recurrence_length call to...
(create_ddg_all_sccs): ...here, after filling all max_dist arrays
using FloydâWarshall-like algorithm.
(update_dist_to_successors): Remove the whole function.
(longest_simple_path): Likewise.
* ddg.h (struct ddg_node): Add max_dist pointer.
(struct ddg_edge): Use bool field instead of unused aux union.
diff --git a/gcc/ddg.c b/gcc/ddg.c
--- a/gcc/ddg.c
+++ b/gcc/ddg.c
@@ -32,9 +32,6 @@ along with GCC; see the file COPYING3. If not see
#ifdef INSN_SCHEDULING
-/* A flag indicating that a ddg edge belongs to an SCC or not. */
-enum edge_flag {NOT_IN_SCC = 0, IN_SCC};
-
/* Forward declarations. */
static void add_backarc_to_ddg (ddg_ptr, ddg_edge_ptr);
static void add_backarc_to_scc (ddg_scc_ptr, ddg_edge_ptr);
@@ -564,7 +561,7 @@ create_ddg (basic_block bb, int closing_branch_deps)
{
ddg_ptr g;
rtx_insn *insn, *first_note;
- int i;
+ int i, j;
int num_nodes = 0;
g = (ddg_ptr) xcalloc (1, sizeof (struct ddg));
@@ -632,6 +629,12 @@ create_ddg (basic_block bb, int
closing_branch_deps)
g->nodes[i].predecessors = sbitmap_alloc (num_nodes);
bitmap_clear (g->nodes[i].predecessors);
g->nodes[i].first_note = (first_note ? first_note : insn);
+
+ g->nodes[i].aux.count = -1;
+ g->nodes[i].max_dist = XCNEWVEC (int, num_nodes);
+ for (j = 0; j < num_nodes; j++)
+ g->nodes[i].max_dist[j] = -1;
+
g->nodes[i++].insn = insn;
first_note = NULL;
}
@@ -668,6 +671,7 @@ free_ddg (ddg_ptr g)
}
sbitmap_free (g->nodes[i].successors);
sbitmap_free (g->nodes[i].predecessors);
+ free (g->nodes[i].max_dist);
}
if (g->num_backarcs > 0)
free (g->backarcs);
@@ -792,7 +796,7 @@ create_ddg_edge (ddg_node_ptr src, ddg_node_ptr
dest,
e->latency = l;
e->distance = d;
e->next_in = e->next_out = NULL;
- e->aux.info = 0;
+ e->in_scc = false;
return e;
}
@@ -820,7 +824,7 @@ add_edge_to_ddg (ddg_ptr g ATTRIBUTE_UNUSED,
ddg_edge_ptr e)
for now that cycles in the data dependence graph contain a single
backarc.
This simplifies the algorithm, and can be generalized later. */
static void
-set_recurrence_length (ddg_scc_ptr scc, ddg_ptr g)
+set_recurrence_length (ddg_scc_ptr scc)
{
int j;
int result = -1;
@@ -828,17 +832,14 @@ set_recurrence_length (ddg_scc_ptr scc, ddg_ptr g)
for (j = 0; j < scc->num_backarcs; j++)
{
ddg_edge_ptr backarc = scc->backarcs[j];
- int length;
int distance = backarc->distance;
ddg_node_ptr src = backarc->dest;
ddg_node_ptr dest = backarc->src;
+ int length = src->max_dist[dest->cuid];
+
+ if (length < 0)
+ continue;
- length = longest_simple_path (g, src->cuid, dest->cuid,
scc->nodes);
- if (length < 0 )
- {
- /* fprintf (stderr, "Backarc not on simple cycle in SCC.\n"); */
- continue;
- }
length += backarc->latency;
result = MAX (result, (length / distance));
}
@@ -846,9 +847,9 @@ set_recurrence_length (ddg_scc_ptr scc, ddg_ptr g)
}
/* Create a new SCC given the set of its nodes. Compute its
recurrence_length
- and mark edges that belong to this scc as IN_SCC. */
+ and mark edges that belong to this scc. */
static ddg_scc_ptr
-create_scc (ddg_ptr g, sbitmap nodes)
+create_scc (ddg_ptr g, sbitmap nodes, int id)
{
ddg_scc_ptr scc;
unsigned int u = 0;
@@ -866,16 +867,18 @@ create_scc (ddg_ptr g, sbitmap nodes)
ddg_edge_ptr e;
ddg_node_ptr n = &g->nodes[u];
+ gcc_assert (n->aux.count == -1);
+ n->aux.count = id;
+
for (e = n->out; e; e = e->next_out)
if (bitmap_bit_p (nodes, e->dest->cuid))
{
- e->aux.count = IN_SCC;
+ e->in_scc = true;
if (e->distance > 0)
add_backarc_to_scc (scc, e);
}
}
- set_recurrence_length (scc, g);
return scc;
}
@@ -1018,7 +1021,7 @@ check_sccs (ddg_all_sccs_ptr sccs, int num_nodes)
ddg_all_sccs_ptr
create_ddg_all_sccs (ddg_ptr g)
{
- int i;
+ int i, j, k, scc, way;
int num_nodes = g->num_nodes;
auto_sbitmap from (num_nodes);
auto_sbitmap to (num_nodes);
@@ -1038,7 +1041,7 @@ create_ddg_all_sccs (ddg_ptr g)
ddg_node_ptr dest = backarc->dest;
/* If the backarc already belongs to an SCC, continue. */
- if (backarc->aux.count == IN_SCC)
+ if (backarc->in_scc)
continue;
bitmap_clear (scc_nodes);
@@ -1049,10 +1052,52 @@ create_ddg_all_sccs (ddg_ptr g)
if (find_nodes_on_paths (scc_nodes, g, from, to))
{
- scc = create_scc (g, scc_nodes);
+ scc = create_scc (g, scc_nodes, sccs->num_sccs);
add_scc_to_ddg (sccs, scc);
}
}
+
+ /* Init max_dist arrays for FloydâWarshall-like
+ longest patch calculation algorithm. */
+ for (k = 0; k < num_nodes; k++)
+ {
+ ddg_edge_ptr e;
+ ddg_node_ptr n = &g->nodes[k];
+
+ if (n->aux.count == -1)
+ continue;
+
+ n->max_dist[k] = 0;
+ for (e = n->out; e; e = e->next_out)
+ if (e->distance == 0 && g->nodes[e->dest->cuid].aux.count ==
n->aux.count)
+ n->max_dist[e->dest->cuid] = e->latency;
+ }
+
+ /* Run main Floid-Warshall loop. We use only non-backarc edges
+ inside each scc. */
+ for (k = 0; k < num_nodes; k++)
+ {
+ scc = g->nodes[k].aux.count;
+ if (scc != -1)
+ {
+ for (i = 0; i < num_nodes; i++)
+ if (g->nodes[i].aux.count == scc)
+ for (j = 0; j < num_nodes; j++)
+ if (g->nodes[j].aux.count == scc
+ && g->nodes[i].max_dist[k] >= 0
+ && g->nodes[k].max_dist[j] >= 0)
+ {
+ way = g->nodes[i].max_dist[k] +
g->nodes[k].max_dist[j];
+ if (g->nodes[i].max_dist[j] < way)
+ g->nodes[i].max_dist[j] = way;
+ }
+ }
+ }
+
+ /* Calculate recurrence_length using max_dist info. */
+ for (i = 0; i < sccs->num_sccs; i++)
+ set_recurrence_length (sccs->sccs[i]);
+
order_sccs (sccs);
if (flag_checking)
@@ -1155,72 +1200,4 @@ find_nodes_on_paths (sbitmap result, ddg_ptr g,
sbitmap from, sbitmap to)
return bitmap_and (result, reachable_from, reach_to);
}
-
-/* Updates the counts of U_NODE's successors (that belong to NODES) to
be
- at-least as large as the count of U_NODE plus the latency between
them.
- Sets a bit in TMP for each successor whose count was changed
(increased).
- Returns nonzero if any count was changed. */
-static int
-update_dist_to_successors (ddg_node_ptr u_node, sbitmap nodes, sbitmap
tmp)
-{
- ddg_edge_ptr e;
- int result = 0;
-
- for (e = u_node->out; e; e = e->next_out)
- {
- ddg_node_ptr v_node = e->dest;
- int v = v_node->cuid;
-
- if (bitmap_bit_p (nodes, v)
- && (e->distance == 0)
- && (v_node->aux.count < u_node->aux.count + e->latency))
- {
- v_node->aux.count = u_node->aux.count + e->latency;
- bitmap_set_bit (tmp, v);
- result = 1;
- }
- }
- return result;
-}
-
-
-/* Find the length of a longest path from SRC to DEST in G,
- going only through NODES, and disregarding backarcs. */
-int
-longest_simple_path (struct ddg * g, int src, int dest, sbitmap nodes)
-{
- int i;
- unsigned int u = 0;
- int change = 1;
- int num_nodes = g->num_nodes;
- auto_sbitmap workset (num_nodes);
- auto_sbitmap tmp (num_nodes);
- for (i = 0; i < g->num_nodes; i++)
- g->nodes[i].aux.count = -1;
- g->nodes[src].aux.count = 0;
-
- bitmap_clear (tmp);
- bitmap_set_bit (tmp, src);
-
- while (change)
- {
- sbitmap_iterator sbi;
-
- change = 0;
- bitmap_copy (workset, tmp);
- bitmap_clear (tmp);
- EXECUTE_IF_SET_IN_BITMAP (workset, 0, u, sbi)
- {
- ddg_node_ptr u_node = &g->nodes[u];
-
- change |= update_dist_to_successors (u_node, nodes, tmp);
- }
- }
- return g->nodes[dest].aux.count;
-}
-
#endif /* INSN_SCHEDULING */
diff --git a/gcc/ddg.h b/gcc/ddg.h
--- a/gcc/ddg.h
+++ b/gcc/ddg.h
@@ -64,6 +64,10 @@ struct ddg_node
sbitmap successors;
sbitmap predecessors;
+ /* Temporary array used for Floyd-Warshall algorithm to find
+ scc recurrence length. */
+ int *max_dist;
+
/* For general use by algorithms manipulating the ddg. */
union {
int count;
@@ -95,11 +99,8 @@ struct ddg_edge
ddg_edge_ptr next_in;
ddg_edge_ptr next_out;
- /* For general use by algorithms manipulating the ddg. */
- union {
- int count;
- void *info;
- } aux;
+ /* Is true when edge is already in scc. */
+ bool in_scc;
};
/* This structure holds the Data Dependence Graph for a basic block.
*/
@@ -178,7 +179,6 @@ ddg_all_sccs_ptr create_ddg_all_sccs (ddg_ptr);
void free_ddg_all_sccs (ddg_all_sccs_ptr);
int find_nodes_on_paths (sbitmap result, ddg_ptr, sbitmap from, sbitmap
to);
-int longest_simple_path (ddg_ptr, int from, int to, sbitmap via);
bool autoinc_var_is_used_p (rtx_insn *, rtx_insn *);
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [3/4][PATCH] Fix PR90001
2019-04-16 12:08 ` [3/4][PATCH] Fix PR90001 Roman Zhuykov
@ 2019-04-26 15:42 ` Jeff Law
2019-04-26 17:41 ` Roman Zhuykov
0 siblings, 1 reply; 17+ messages in thread
From: Jeff Law @ 2019-04-26 15:42 UTC (permalink / raw)
To: Roman Zhuykov, gcc-patches
On 4/16/19 6:05 AM, Roman Zhuykov wrote:
> Current algorithm which finds recurrence_length for all DDG strongly
> connected components works in like O(N^6) time, where N in the number of
> nodes in DDG. The time is so bad mostly for graphs with lots of edges,
> like almost N^2 edges. My proposed algorithm works in O(N^3).Â
> Algorithm of finding SCCs itself is also not optimal (maybe up to
> O(N^4)), but here it left untouched.
>
> For some situations, when amount of edges is smaller (like equal to N),
> new algorithm can be unfortunately slower than old one. But I think
> it's better here to add some bail-out when we got more than 1000 nodes
> for example.
>
> Before creating this patch, I tested special version of it, where both
> approaches were in action and asserts were inserted to check that
> algorithms results (longest_simple_path values) are absolutely the same.
> Â I can publish this special version if needed.
>
> Iâve described patch testing in cover letter. Ok for trunk?
>
> gcc/ChangeLog:
>
> 2019-04-08 Roman Zhuykov <zhroma@ispras.ru>
>
> Â Â Â Â PR rtl-optimization/90001
> Â Â Â Â * ddg.c (create_ddg): Init max_dist array for each node.
> Â Â Â Â (free_ddg): Free max_dist array.
> Â Â Â Â (create_ddg_edge): Use bool field instead of aux union.
> Â Â Â Â (set_recurrence_length): Use prepared max_dist information instead
> Â Â Â Â of calling longest_simple_path.
> Â Â Â Â (create_scc): Remove graph argument, fill node's aux.count with
> Â Â Â Â SCC id, and move set_recurrence_length call to...
> Â Â Â Â (create_ddg_all_sccs): ...here, after filling all max_dist arrays
> Â Â Â Â using FloydâWarshall-like algorithm.
> Â Â Â Â (update_dist_to_successors): Remove the whole function.
> Â Â Â Â (longest_simple_path): Likewise.
> Â Â Â Â * ddg.h (struct ddg_node): Add max_dist pointer.
> Â Â Â Â (struct ddg_edge): Use bool field instead of unused aux union.
So just an FYI. If ddg.c is only used by the modulo scheduler, then it
falls under your maintainership and you can decide when to install this
patch.
We generally try to avoid major changes right after the branch is cut,
but I doubt we'll be doing a lot of backporting in this code, so I think
you can go whenever you're ready.
jeff
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [3/4][PATCH] Fix PR90001
2019-04-26 15:42 ` Jeff Law
@ 2019-04-26 17:41 ` Roman Zhuykov
0 siblings, 0 replies; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-26 17:41 UTC (permalink / raw)
To: Jeff Law; +Cc: gcc-patches
> So just an FYI. If ddg.c is only used by the modulo scheduler, then it
> falls under your maintainership and you can decide when to install this
> patch.
Yes, I understand about ddg.c and ddg.h. I also studied everything
about loop-doloop.c because it is connected. Will try to participate
in doloop discussions, and soon present some (mostly refactoring)
doloop patch for approval.
> We generally try to avoid major changes right after the branch is cut,
> but I doubt we'll be doing a lot of backporting in this code, so I think
> you can go whenever you're ready.
I'll now wait the "not disrupting code RC phase" to finish, and
than will commit it together with posting my next patchset.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [4/4][PATCH] Discussing PR83507
2019-04-16 11:59 [0/4] Addressing modulo-scheduling bugs zhroma
` (2 preceding siblings ...)
2019-04-16 12:08 ` [3/4][PATCH] Fix PR90001 Roman Zhuykov
@ 2019-04-16 12:26 ` Roman Zhuykov
2019-04-16 15:10 ` Segher Boessenkool
2019-04-22 19:16 ` [0/4] Addressing modulo-scheduling bugs Roman Zhuykov
4 siblings, 1 reply; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-16 12:26 UTC (permalink / raw)
To: gcc-patches; +Cc: Jakub Jelinek
This issue unfortunately was not solved correctly. In that example we
donât have -fmodulo-sched-allow-regmoves enabled and we should not
create any register moves at all.
Iâll describe here everything I found while looking on the issue. At
the first step I have patched trunk compiler like this:
diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c
--- a/gcc/modulo-sched.c
+++ b/gcc/modulo-sched.c
@@ -735,6 +735,7 @@ schedule_reg_moves (partial_schedule_ptr ps)
continue;
/* Create NREG_MOVES register moves. */
+ gcc_assert (flag_modulo_sched_allow_regmoves);
first_move = ps->reg_moves.length ();
ps->reg_moves.safe_grow_cleared (first_move + nreg_moves);
extend_node_sched_params (ps);
@@ -744,8 +745,7 @@ schedule_reg_moves (partial_schedule_ptr ps)
/* Generate each move. */
old_reg = prev_reg = SET_DEST (set);
- if (HARD_REGISTER_P (old_reg))
- return false;
+ gcc_assert (!HARD_REGISTER_P (old_reg));
for (i_reg_move = 0; i_reg_move < nreg_moves; i_reg_move++)
{
On current trunk this patch doesnât give any failures on powerpc, but
Iâve checked other platforms and found gcc.c-torture/execute/pr84524.c
ICEing with -O1 -fmodulo-sched on aarch64. Speaking about sms-13.c, it
happens not to be a proper loop for modulo-scheduling at current trunk.
Some changes in other passes caused this, and that file actually tests
nothing right now.
Than, I added 8 branch into my testing, and decide first to find a
proper way to fix sms-13.c on 8 branch.
Without -fmodulo-sched-allow-regmoves each TRUE_DEP edge should be
paired with another (sometimes ANTI_DEP or OUTPUT_DEP) edge in the
opposite direction. My assertion failure means that DDG was odd.
As Alexander mentioned here
https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00028.html
âSMS uses sched-deps for intra-loop deps, and then separately uses DF
for cross-iteration deps, which means that it should be ready for
surprises when the two scanners are not 100% in sync.â
Situation here is one of such surprises, and debugging shows that we
loose one of dependencies here.
Considering this two instructions:
(insn 30 29 31 8 (parallel [
(set (reg:SI 142)
(minus:SI (reg/v:SI 133 [ x+-2 ])
(reg/v:SI 130 [ b+-2 ])))
(set (reg:SI 76 ca)
(leu:SI (reg/v:SI 130 [ b+-2 ])
(reg/v:SI 133 [ x+-2 ])))
]) "test.c":22 107 {subfsi3_carry}
(expr_list:REG_UNUSED (reg:SI 142)
(nil)))
(insn 31 30 32 8 (parallel [
(set (reg:SI 143)
(plus:SI (reg:SI 76 ca)
(const_int -1 [0xffffffffffffffff])))
(clobber (reg:SI 76 ca))
]) "test.c":22 116 {subfsi3_carry_in_xx}
(expr_list:REG_DEAD (reg:SI 76 ca)
(expr_list:REG_UNUSED (reg:SI 76 ca)
(nil))))
Sched-deps organize obvious 30->31 true dependency, and we want DF to
help us with 31->30 output dependency here. CA is clobbered in 31, and
if we move insn 31 forward we have to stop at insn 30 on the next
iteration.
Build_inter_loop_deps doesnât create such output dependency because
reaching definitions ârd_bb_info->genâ dependency set doesnât contain CA
register. Maybe, because it is a clobber, not a set. So we need
somehow organize that dependency, and Iâve found that df_live set can be
used here instead.
I proceed with the following change:
diff --git a/gcc/ddg.c b/gcc/ddg.c
--- a/gcc/ddg.c
+++ b/gcc/ddg.c
@@ -365,16 +365,14 @@ add_cross_iteration_register_deps (ddg_ptr g,
df_ref last_def)
static void
build_inter_loop_deps (ddg_ptr g)
{
- unsigned rd_num;
- struct df_rd_bb_info *rd_bb_info;
+ unsigned regno;
bitmap_iterator bi;
-
- rd_bb_info = DF_RD_BB_INFO (g->bb);
+ struct df_live_bb_info *live_bb_info = DF_LIVE_BB_INFO (g->bb);
/* Find inter-loop register output, true and anti deps. */
- EXECUTE_IF_SET_IN_BITMAP (&rd_bb_info->gen, 0, rd_num, bi)
+ EXECUTE_IF_SET_IN_BITMAP (&live_bb_info->gen, 0, regno, bi)
{
- df_ref rd = DF_DEFS_GET (rd_num);
+ df_ref rd = df_bb_regno_last_def_find (g->bb, regno);
add_cross_iteration_register_deps (g, rd);
}
(Here I skipped minor hunks which properly add/remove df_live problem in
modulo-sched.c when optimize == 1).
This fix corrected sms-13.c on 8 branch, but pr84524.c mentioned above
was still ICEing on both trunk and 8-branch. No other failures were
introduces while testing this intermediate patch on different branches
and platforms. Tested patch also contained temporarily additional
checks that the only difference between ârd_bb_info->genâ and
âlive_bb_info->genâ sets may be additional hard registers in live set.
In pr84524.c we got a loop with an extended inline asm:
asm volatile ("" : "+r" (v))
which also gives us a âsurprisingâ situation Alexander predicts.
For sched-deps scanner such volatile asm is a âtrue barrierâ and we
create dependencies to almost all other instructions, while DF scanners
donât give us such information.
Maybe it is a good idea somehow re-implement modulo scheduler using only
one scanner instead of two, but at the moment the best thing to do is to
detect the situation earlier and skip such loops.
So, the last two hunks are:
@@ -1473,6 +1478,8 @@ sms_schedule (void)
if (CALL_P (insn)
|| BARRIER_P (insn)
+ || (NONDEBUG_INSN_P (insn) && extract_asm_operands (PATTERN
(insn))
+ && MEM_VOLATILE_P (extract_asm_operands (PATTERN
(insn))))
|| (NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
&& !single_set (insn) && GET_CODE (PATTERN (insn)) !=
USE
&& !reg_mentioned_p (count_reg, insn))
@@ -1489,6 +1496,11 @@ sms_schedule (void)
fprintf (dump_file, "SMS loop-with-call\n");
else if (BARRIER_P (insn))
fprintf (dump_file, "SMS loop-with-barrier\n");
+ else if (NONDEBUG_INSN_P (insn)
+ && extract_asm_operands (PATTERN (insn))
+ && MEM_VOLATILE_P (extract_asm_operands
+ (PATTERN (insn))))
+ fprintf (dump_file, "SMS loop-with-volatile-asm\n");
else if ((NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
&& !single_set (insn) && GET_CODE (PATTERN (insn)) !=
USE))
Patch testing addressed in my cover letter. Ok for trunk?
gcc/ChangeLog:
2019-04-10 Roman Zhuykov <zhroma@ispras.ru>
PR target/83507
* ddg.c (build_inter_loop_deps): Use live_bb_info instead of
rd_bb_info to search last register definitions.
* modulo-sched.c (schedule_reg_moves): Add an assertion which
prevents creating register moves when it is not allowed.
(sms_schedule): Add/remove df_live problem when optimize == 1.
Bail out when we got volatile asm inside the loop.
diff --git a/gcc/ddg.c b/gcc/ddg.c
--- a/gcc/ddg.c
+++ b/gcc/ddg.c
@@ -365,16 +365,14 @@ add_cross_iteration_register_deps (ddg_ptr g,
df_ref last_def)
static void
build_inter_loop_deps (ddg_ptr g)
{
- unsigned rd_num;
- struct df_rd_bb_info *rd_bb_info;
+ unsigned regno;
bitmap_iterator bi;
-
- rd_bb_info = DF_RD_BB_INFO (g->bb);
+ struct df_live_bb_info *live_bb_info = DF_LIVE_BB_INFO (g->bb);
/* Find inter-loop register output, true and anti deps. */
- EXECUTE_IF_SET_IN_BITMAP (&rd_bb_info->gen, 0, rd_num, bi)
+ EXECUTE_IF_SET_IN_BITMAP (&live_bb_info->gen, 0, regno, bi)
{
- df_ref rd = DF_DEFS_GET (rd_num);
+ df_ref rd = df_bb_regno_last_def_find (g->bb, regno);
add_cross_iteration_register_deps (g, rd);
}
diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c
/* Create NREG_MOVES register moves. */
+ gcc_assert (flag_modulo_sched_allow_regmoves);
first_move = ps->reg_moves.length ();
ps->reg_moves.safe_grow_cleared (first_move + nreg_moves);
extend_node_sched_params (ps);
@@ -744,8 +745,7 @@ schedule_reg_moves (partial_schedule_ptr ps)
/* Generate each move. */
old_reg = prev_reg = SET_DEST (set);
- if (HARD_REGISTER_P (old_reg))
- return false;
+ gcc_assert (!HARD_REGISTER_P (old_reg));
for (i_reg_move = 0; i_reg_move < nreg_moves; i_reg_move++)
{
@@ -1371,7 +1371,12 @@ sms_schedule (void)
else
issue_rate = 1;
- /* Initialize the scheduler. */
+ /* Initialize the scheduler, we need live info even at O1. */
+ if (optimize == 1)
+ {
+ df_live_add_problem ();
+ df_live_set_all_dirty ();
+ }
setup_sched_infos ();
haifa_sched_init ();
@@ -1473,6 +1478,8 @@ sms_schedule (void)
if (CALL_P (insn)
|| BARRIER_P (insn)
+ || (NONDEBUG_INSN_P (insn) && extract_asm_operands (PATTERN
(insn))
+ && MEM_VOLATILE_P (extract_asm_operands (PATTERN
(insn))))
|| (NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
&& !single_set (insn) && GET_CODE (PATTERN (insn)) !=
USE
&& !reg_mentioned_p (count_reg, insn))
@@ -1489,6 +1496,11 @@ sms_schedule (void)
fprintf (dump_file, "SMS loop-with-call\n");
else if (BARRIER_P (insn))
fprintf (dump_file, "SMS loop-with-barrier\n");
+ else if (NONDEBUG_INSN_P (insn)
+ && extract_asm_operands (PATTERN (insn))
+ && MEM_VOLATILE_P (extract_asm_operands
+ (PATTERN (insn))))
+ fprintf (dump_file, "SMS loop-with-volatile-asm\n");
else if ((NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
&& !single_set (insn) && GET_CODE (PATTERN (insn)) !=
USE))
fprintf (dump_file, "SMS loop-with-not-single-set\n");
@@ -1752,6 +1764,9 @@ sms_schedule (void)
/* Release scheduler data, needed until now because of DFA. */
haifa_sched_finish ();
loop_optimizer_finalize ();
+
+ if (optimize == 1)
+ df_remove_problem (df_live);
}
/* The SMS scheduling algorithm itself
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [4/4][PATCH] Discussing PR83507
2019-04-16 12:26 ` [4/4][PATCH] Discussing PR83507 Roman Zhuykov
@ 2019-04-16 15:10 ` Segher Boessenkool
2019-04-22 16:45 ` Roman Zhuykov
0 siblings, 1 reply; 17+ messages in thread
From: Segher Boessenkool @ 2019-04-16 15:10 UTC (permalink / raw)
To: Roman Zhuykov; +Cc: gcc-patches, Jakub Jelinek
On Tue, Apr 16, 2019 at 03:08:23PM +0300, Roman Zhuykov wrote:
> This issue unfortunately was not solved correctly. In that example we
> donât have -fmodulo-sched-allow-regmoves enabled and we should not
> create any register moves at all.
Yes, but if we do for whatever reason, we should never create register
moves of hard registers. Because that is incorrect (there may not be
insns to do it). It's a separate issue.
You're extending Jakub's patch here, not replacing it, so that's fine.
> In pr84524.c we got a loop with an extended inline asm:
> asm volatile ("" : "+r" (v))
> which also gives us a âsurprisingâ situation Alexander predicts.
>
> For sched-deps scanner such volatile asm is a âtrue barrierâ and we
> create dependencies to almost all other instructions, while DF scanners
> donât give us such information.
There is no such thing as a "true barrier" in extended asm. The only thing
volatile asm means is that the asm has a side effect unknown to the compiler;
this can *not* be a modification of a register or of memory contents, such
things are known by the compiler, that's what clobbers and "memory" clobber
are about.
> Maybe it is a good idea somehow re-implement modulo scheduler using only
> one scanner instead of two, but at the moment the best thing to do is to
> detect the situation earlier and skip such loops.
Or fix the broken code...
Segher
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [4/4][PATCH] Discussing PR83507
2019-04-16 15:10 ` Segher Boessenkool
@ 2019-04-22 16:45 ` Roman Zhuykov
2019-04-23 15:55 ` Segher Boessenkool
0 siblings, 1 reply; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-22 16:45 UTC (permalink / raw)
To: Segher Boessenkool; +Cc: gcc-patches, Jakub Jelinek
> > This issue unfortunately was not solved correctly. In that example we
> > don’t have -fmodulo-sched-allow-regmoves enabled and we should not
> > create any register moves at all.
>
> Yes, but if we do for whatever reason, we should never create register
> moves of hard registers. Because that is incorrect (there may not be
> insns to do it). It's a separate issue.
>
> You're extending Jakub's patch here, not replacing it, so that's fine.
Certainly, the patch contains assertions to catch both situations:
when we try to create reg_move when it wasn’t allowed at all, and when
we try to create reg_move for hard register. Preventing both of them
in theory can be easily achieved when SMS algorithm got absolutely
correct DDG as an input data.
> > In pr84524.c we got a loop with an extended inline asm:
> > asm volatile ("" : "+r" (v))
> > which also gives us a “surprising” situation Alexander predicts.
> >
> > For sched-deps scanner such volatile asm is a “true barrier” and we
> > create dependencies to almost all other instructions, while DF scanners
> > don’t give us such information.
>
> There is no such thing as a "true barrier" in extended asm. The only thing
> volatile asm means is that the asm has a side effect unknown to the compiler;
> this can *not* be a modification of a register or of memory contents, such
> things are known by the compiler, that's what clobbers and "memory" clobber
> are about.
In sched-deps.c we got:
case ASM_OPERANDS:
case ASM_INPUT:
{
/* Traditional and volatile asm instructions must be considered to use
and clobber all hard registers, all pseudo-registers and all of
memory. So must TRAP_IF and UNSPEC_VOLATILE operations.
Consider for instance a volatile asm that changes the fpu rounding
mode. An insn should not be moved across this even if it only uses
pseudo-regs because it might give an incorrectly rounded result. */
if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
&& !DEBUG_INSN_P (insn))
reg_pending_barrier = TRUE_BARRIER;
...
I understand that mentioned “changing fpu rounding mode” example may
not be relevant in current compiler. But we still got TRUE_BARRIER
for all volatile asms.
> > Maybe it is a good idea somehow re-implement modulo scheduler using only
> > one scanner instead of two, but at the moment the best thing to do is to
> > detect the situation earlier and skip such loops.
>
> Or fix the broken code...
I’m not sure what you mean here, but the only alternative way to build
correct DDG is to change sched-deps.c parts to make analysis
consistent with DF, but that change will affect all schedulers. And
we’ll have 2 rather big problems with such change:
1) Testing correctness and quality of generated code in all imaginable
situations in all schedulers.
2) Potential compilation slowdown. Sched-deps analysis algorithm
implements heuristic ideas, while DF pretends to be more accurate but
slower analysis. Trying to keep them in sync probably could increase
sched-deps analysis time.
A year ago there were relevant discussion about some unnecessary
inline asm dependencies in DF:
https://gcc.gnu.org/ml/gcc-patches/2018-04/msg01056.html
Roman
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [4/4][PATCH] Discussing PR83507
2019-04-22 16:45 ` Roman Zhuykov
@ 2019-04-23 15:55 ` Segher Boessenkool
2019-04-29 16:56 ` Roman Zhuykov
0 siblings, 1 reply; 17+ messages in thread
From: Segher Boessenkool @ 2019-04-23 15:55 UTC (permalink / raw)
To: Roman Zhuykov; +Cc: gcc-patches, Jakub Jelinek
Hi Roman,
On Mon, Apr 22, 2019 at 07:36:40PM +0300, Roman Zhuykov wrote:
> > > In pr84524.c we got a loop with an extended inline asm:
> > > asm volatile ("" : "+r" (v))
> > > which also gives us a âsurprisingâ situation Alexander predicts.
> > >
> > > For sched-deps scanner such volatile asm is a âtrue barrierâ and we
> > > create dependencies to almost all other instructions, while DF scanners
> > > donât give us such information.
> >
> > There is no such thing as a "true barrier" in extended asm. The only thing
> > volatile asm means is that the asm has a side effect unknown to the compiler;
> > this can *not* be a modification of a register or of memory contents, such
> > things are known by the compiler, that's what clobbers and "memory" clobber
> > are about.
>
> In sched-deps.c we got:
> case ASM_OPERANDS:
> case ASM_INPUT:
> {
> /* Traditional and volatile asm instructions must be considered to use
> and clobber all hard registers, all pseudo-registers and all of
> memory. So must TRAP_IF and UNSPEC_VOLATILE operations.
>
> Consider for instance a volatile asm that changes the fpu rounding
> mode. An insn should not be moved across this even if it only uses
> pseudo-regs because it might give an incorrectly rounded result. */
> if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
> && !DEBUG_INSN_P (insn))
> reg_pending_barrier = TRUE_BARRIER;
> ...
This code was added in 1997 (r14770). In 2004 the documentation was
changed to clarify how things really work (r88999):
"Note that even a volatile @code{asm} instruction can be moved relative to
other code, including across jump instructions."
(followed by an example exactly about what this means for FPU control).
> I understand that mentioned âchanging fpu rounding modeâ example may
> not be relevant in current compiler. But we still got TRUE_BARRIER
> for all volatile asms.
> > > Maybe it is a good idea somehow re-implement modulo scheduler using only
> > > one scanner instead of two, but at the moment the best thing to do is to
> > > detect the situation earlier and skip such loops.
> >
> > Or fix the broken code...
>
> Iâm not sure what you mean here,
I mean have the modulo scheduler implement the correct asm semantics, not
some more restrictive thing that gets it into conflicts with DF, etc.
I don't think this will turn out to be a problem in any way. Some invalid
asm will break, sure.
Segher
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [4/4][PATCH] Discussing PR83507
2019-04-23 15:55 ` Segher Boessenkool
@ 2019-04-29 16:56 ` Roman Zhuykov
2019-04-29 17:51 ` Segher Boessenkool
2019-04-30 15:14 ` Jeff Law
0 siblings, 2 replies; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-29 16:56 UTC (permalink / raw)
To: Segher Boessenkool; +Cc: gcc-patches, Jakub Jelinek
Hi, Segher
> > > > In pr84524.c we got a loop with an extended inline asm:
> > > > asm volatile ("" : "+r" (v))
> > > > which also gives us a “surprising” situation Alexander predicts.
> > > >
> > > > For sched-deps scanner such volatile asm is a “true barrier” and we
> > > > create dependencies to almost all other instructions, while DF scanners
> > > > don’t give us such information.
> > >
> > > There is no such thing as a "true barrier" in extended asm. The only thing
> > > volatile asm means is that the asm has a side effect unknown to the compiler;
> > > this can *not* be a modification of a register or of memory contents, such
> > > things are known by the compiler, that's what clobbers and "memory" clobber
> > > are about.
> >
> > In sched-deps.c we got:
> > case ASM_OPERANDS:
> > case ASM_INPUT:
> > {
> > /* Traditional and volatile asm instructions must be considered to use
> > and clobber all hard registers, all pseudo-registers and all of
> > memory. So must TRAP_IF and UNSPEC_VOLATILE operations.
> >
> > Consider for instance a volatile asm that changes the fpu rounding
> > mode. An insn should not be moved across this even if it only uses
> > pseudo-regs because it might give an incorrectly rounded result. */
> > if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
> > && !DEBUG_INSN_P (insn))
> > reg_pending_barrier = TRUE_BARRIER;
> > ...
>
> This code was added in 1997 (r14770). In 2004 the documentation was
> changed to clarify how things really work (r88999):
>
> "Note that even a volatile @code{asm} instruction can be moved relative to
> other code, including across jump instructions."
>
> (followed by an example exactly about what this means for FPU control).
Thanks for pointing to that changes! Unfortunately, sched-deps.c was
more conservative this 15 years...
Let’s try to fix it.
> > > > Maybe it is a good idea somehow re-implement modulo scheduler using only
> > > > one scanner instead of two, but at the moment the best thing to do is to
> > > > detect the situation earlier and skip such loops.
> > >
> > > Or fix the broken code...
> >
> > I’m not sure what you mean here,
>
> I mean have the modulo scheduler implement the correct asm semantics, not
> some more restrictive thing that gets it into conflicts with DF, etc.
>
> I don't think this will turn out to be a problem in any way. Some invalid
> asm will break, sure.
I have started with applying this without any SMS changes:
diff --git a/gcc/sched-deps.c b/gcc/sched-deps.c
--- a/gcc/sched-deps.c
+++ b/gcc/sched-deps.c
@@ -2753,22 +2753,14 @@ sched_analyze_2 (struct deps_desc *deps, rtx
x, rtx_insn *insn)
case UNSPEC_VOLATILE:
flush_pending_lists (deps, insn, true, true);
+
+ if (!DEBUG_INSN_P (insn))
+ reg_pending_barrier = TRUE_BARRIER;
/* FALLTHRU */
case ASM_OPERANDS:
case ASM_INPUT:
{
- /* Traditional and volatile asm instructions must be considered to use
- and clobber all hard registers, all pseudo-registers and all of
- memory. So must TRAP_IF and UNSPEC_VOLATILE operations.
-
- Consider for instance a volatile asm that changes the fpu rounding
- mode. An insn should not be moved across this even if it only uses
- pseudo-regs because it might give an incorrectly rounded result. */
- if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
- && !DEBUG_INSN_P (insn))
- reg_pending_barrier = TRUE_BARRIER;
-
/* For all ASM_OPERANDS, we must traverse the vector of input operands.
We cannot just fall through here since then we would be confused
by the ASM_INPUT rtx inside ASM_OPERANDS, which do not indicate
Regstrapping it on x86-64 shows some failures. First is with ms-sysv
abi test and can be solved like this:
diff --git a/gcc/testsuite/gcc.target/x86_64/abi/ms-sysv/gen.cc
b/gcc/testsuite/gcc.target/x86_64/abi/ms-sysv/gen.cc
--- a/gcc/testsuite/gcc.target/x86_64/abi/ms-sysv/gen.cc
+++ b/gcc/testsuite/gcc.target/x86_64/abi/ms-sysv/gen.cc
@@ -511,7 +511,7 @@ void make_do_test (const vector<class arg> &args,
/* End if init_test call. */
if (f.get_realign () && unaligned == 1)
- out << " __asm__ __volatile__ (\"subq $8,%%rsp\":::\"cc\");"
+ out << " __asm__ __volatile__ (\"subq
$8,%%rsp\":::\"cc\", \"memory\");"
<< endl;
out << " ret = do_test_"
@@ -525,7 +525,7 @@ void make_do_test (const vector<class arg> &args,
out << ");" << endl;
if (f.get_realign () && unaligned == 1)
- out << " __asm__ __volatile__ (\"addq $8,%%rsp\":::\"cc\");"
+ out << " __asm__ __volatile__ (\"addq
$8,%%rsp\":::\"cc\", \"memory\");"
<< endl;
out << " check_results (ret);" << endl;
Here we have some asms which control stack pointer (sigh!). It
certainly may be broken at any moment, but right now “memory” clobber
fixes everything for me.
Another failure is:
FAIL: gcc.dg/guality/pr58791-4.c -O2 -DPREVENT_OPTIMIZATION line
pr58791-4.c:32 i == 486
FAIL: gcc.dg/guality/pr58791-4.c -O2 -DPREVENT_OPTIMIZATION line
pr58791-4.c:32 i2 == 487
FAIL: gcc.dg/guality/pr58791-4.c -O3 -g -DPREVENT_OPTIMIZATION
line pr58791-4.c:32 i == 486
FAIL: gcc.dg/guality/pr58791-4.c -O3 -g -DPREVENT_OPTIMIZATION
line pr58791-4.c:32 i2 == 487
FAIL: gcc.dg/guality/pr58791-4.c -Os -DPREVENT_OPTIMIZATION line
pr58791-4.c:32 i == 486
FAIL: gcc.dg/guality/pr58791-4.c -Os -DPREVENT_OPTIMIZATION line
pr58791-4.c:32 i2 == 487
FAIL: gcc.dg/guality/pr58791-4.c -O2 -flto -fno-use-linker-plugin
-flto-partition=none -DPREVENT_OPTIMIZATION line pr58791-4.c:32 i ==
486
FAIL: gcc.dg/guality/pr58791-4.c -O2 -flto -fno-use-linker-plugin
-flto-partition=none -DPREVENT_OPTIMIZATION line pr58791-4.c:32 i2 ==
487
FAIL: gcc.dg/guality/pr58791-4.c -O2 -flto -fuse-linker-plugin
-fno-fat-lto-objects -DPREVENT_OPTIMIZATION line pr58791-4.c:32 i ==
486
FAIL: gcc.dg/guality/pr58791-4.c -O2 -flto -fuse-linker-plugin
-fno-fat-lto-objects -DPREVENT_OPTIMIZATION line pr58791-4.c:32 i2 ==
487
It is connected to debug-info, and I cannot solve it myself. I am not
sure how this should work when we try to print dead-code variable (i2)
while debugging -O2 (O3/Os) compiled executable. Jakub created that
test, he is in CC already.
I will also look at aarch64 regstrap a bit later. But that job should
also be done for all other targets. Segher, could you test it on
rs6000?
> > > > In pr84524.c we got a loop with an extended inline asm:
> > > > asm volatile ("" : "+r" (v))
It seems a good idea to add “memory” clobber into asm and recheck DDG
in this test on aarch64 with both SMS and sched-deps patches applied.
I'll check it.
Roman
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [4/4][PATCH] Discussing PR83507
2019-04-29 16:56 ` Roman Zhuykov
@ 2019-04-29 17:51 ` Segher Boessenkool
2019-04-30 11:35 ` Roman Zhuykov
2019-04-30 15:14 ` Jeff Law
1 sibling, 1 reply; 17+ messages in thread
From: Segher Boessenkool @ 2019-04-29 17:51 UTC (permalink / raw)
To: Roman Zhuykov; +Cc: gcc-patches, Jakub Jelinek
Hi!
On Mon, Apr 29, 2019 at 07:28:12PM +0300, Roman Zhuykov wrote:
> > This code was added in 1997 (r14770). In 2004 the documentation was
> > changed to clarify how things really work (r88999):
> >
> > "Note that even a volatile @code{asm} instruction can be moved relative to
> > other code, including across jump instructions."
> >
> > (followed by an example exactly about what this means for FPU control).
>
> Thanks for pointing to that changes! Unfortunately, sched-deps.c was
> more conservative this 15 years...
> Letâs try to fix it.
If it causes problems now, that would be a good idea yes :-)
> > I mean have the modulo scheduler implement the correct asm semantics, not
> > some more restrictive thing that gets it into conflicts with DF, etc.
> >
> > I don't think this will turn out to be a problem in any way. Some invalid
> > asm will break, sure.
>
> I have started with applying this without any SMS changes:
>
> diff --git a/gcc/sched-deps.c b/gcc/sched-deps.c
> --- a/gcc/sched-deps.c
> +++ b/gcc/sched-deps.c
> @@ -2753,22 +2753,14 @@ sched_analyze_2 (struct deps_desc *deps, rtx
> x, rtx_insn *insn)
>
> case UNSPEC_VOLATILE:
> flush_pending_lists (deps, insn, true, true);
> +
> + if (!DEBUG_INSN_P (insn))
> + reg_pending_barrier = TRUE_BARRIER;
> /* FALLTHRU */
>
> case ASM_OPERANDS:
> case ASM_INPUT:
> {
> - /* Traditional and volatile asm instructions must be considered to use
> - and clobber all hard registers, all pseudo-registers and all of
> - memory. So must TRAP_IF and UNSPEC_VOLATILE operations.
> -
> - Consider for instance a volatile asm that changes the fpu rounding
> - mode. An insn should not be moved across this even if it only uses
> - pseudo-regs because it might give an incorrectly rounded result. */
> - if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
> - && !DEBUG_INSN_P (insn))
> - reg_pending_barrier = TRUE_BARRIER;
> -
> /* For all ASM_OPERANDS, we must traverse the vector of input operands.
> We cannot just fall through here since then we would be confused
> by the ASM_INPUT rtx inside ASM_OPERANDS, which do not indicate
UNSPEC_VOLATILE and volatile asm should have the same semantics, ideally.
One step at a time I guess :-)
> Regstrapping it on x86-64 shows some failures. First is with ms-sysv
> abi test and can be solved like this:
[ snip ]
> Here we have some asms which control stack pointer (sigh!). It
> certainly may be broken at any moment, but right now âmemoryâ clobber
> fixes everything for me.
Makes sense.
> Another failure is:
>
> FAIL: gcc.dg/guality/pr58791-4.c -O2 -DPREVENT_OPTIMIZATION line
> pr58791-4.c:32 i == 486
[ snip ]
> It is connected to debug-info, and I cannot solve it myself. I am not
> sure how this should work when we try to print dead-code variable (i2)
> while debugging -O2 (O3/Os) compiled executable. Jakub created that
> test, he is in CC already.
What does PREVENT_OPTIMIZATION do? It probably needs to be made a bit
stronger.
(It seems to just add some "used"?)
> I will also look at aarch64 regstrap a bit later. But that job should
> also be done for all other targets. Segher, could you test it on
> rs6000?
Do you have an account on the compile farm? It has POWER7 (BE, 32/64),
as well as POWER8 and POWER9 machines (both LE).
https://cfarm.tetaneutral.net/
But I'll do it if you want. Just let me know.
> > > > > In pr84524.c we got a loop with an extended inline asm:
> > > > > asm volatile ("" : "+r" (v))
>
> It seems a good idea to add âmemoryâ clobber into asm and recheck DDG
> in this test on aarch64 with both SMS and sched-deps patches applied.
> I'll check it.
Segher
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [4/4][PATCH] Discussing PR83507
2019-04-29 17:51 ` Segher Boessenkool
@ 2019-04-30 11:35 ` Roman Zhuykov
0 siblings, 0 replies; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-30 11:35 UTC (permalink / raw)
To: Segher Boessenkool; +Cc: gcc-patches, Jakub Jelinek
> > > This code was added in 1997 (r14770). In 2004 the documentation was
> > > changed to clarify how things really work (r88999):
> > >
> > > "Note that even a volatile @code{asm} instruction can be moved relative to
> > > other code, including across jump instructions."
> > >
> > > (followed by an example exactly about what this means for FPU control).
> >
> > Thanks for pointing to that changes! Unfortunately, sched-deps.c was
> > more conservative this 15 years...
> > Let’s try to fix it.
>
> If it causes problems now, that would be a good idea yes :-)
>
> > > I mean have the modulo scheduler implement the correct asm semantics, not
> > > some more restrictive thing that gets it into conflicts with DF, etc.
> > >
> > > I don't think this will turn out to be a problem in any way. Some invalid
> > > asm will break, sure.
> >
> > I have started with applying this without any SMS changes:
> >
> > diff --git a/gcc/sched-deps.c b/gcc/sched-deps.c
> > --- a/gcc/sched-deps.c
> > +++ b/gcc/sched-deps.c
> > @@ -2753,22 +2753,14 @@ sched_analyze_2 (struct deps_desc *deps, rtx
> > x, rtx_insn *insn)
> >
> > case UNSPEC_VOLATILE:
> > flush_pending_lists (deps, insn, true, true);
> > +
> > + if (!DEBUG_INSN_P (insn))
> > + reg_pending_barrier = TRUE_BARRIER;
> > /* FALLTHRU */
> >
> > case ASM_OPERANDS:
> > case ASM_INPUT:
> > {
> > - /* Traditional and volatile asm instructions must be considered to use
> > - and clobber all hard registers, all pseudo-registers and all of
> > - memory. So must TRAP_IF and UNSPEC_VOLATILE operations.
> > -
> > - Consider for instance a volatile asm that changes the fpu rounding
> > - mode. An insn should not be moved across this even if it only uses
> > - pseudo-regs because it might give an incorrectly rounded result. */
> > - if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
> > - && !DEBUG_INSN_P (insn))
> > - reg_pending_barrier = TRUE_BARRIER;
> > -
> > /* For all ASM_OPERANDS, we must traverse the vector of input operands.
> > We cannot just fall through here since then we would be confused
> > by the ASM_INPUT rtx inside ASM_OPERANDS, which do not indicate
>
> UNSPEC_VOLATILE and volatile asm should have the same semantics, ideally.
> One step at a time I guess :-)
When barrier for UNSPEC_VOLATILE is also dropped, there are more
issues on x86_64:
FAIL: gcc.dg/vect/pr62021.c -flto -ffat-lto-objects execution test
FAIL: gcc.dg/vect/pr62021.c execution test
FAIL: gcc.target/i386/avx2-vect-aggressive.c execution test
I haven’t looked at those vectorization tests.
> > Regstrapping it on x86-64 shows some failures. First is with ms-sysv
> > abi test and can be solved like this:
>
> [ snip ]
>
> > Here we have some asms which control stack pointer (sigh!). It
> > certainly may be broken at any moment, but right now “memory” clobber
> > fixes everything for me.
>
> Makes sense.
>
> > Another failure is:
> >
> > FAIL: gcc.dg/guality/pr58791-4.c -O2 -DPREVENT_OPTIMIZATION line
> > pr58791-4.c:32 i == 486
>
> [ snip ]
>
> > It is connected to debug-info, and I cannot solve it myself. I am not
> > sure how this should work when we try to print dead-code variable (i2)
> > while debugging -O2 (O3/Os) compiled executable. Jakub created that
> > test, he is in CC already.
>
> What does PREVENT_OPTIMIZATION do? It probably needs to be made a bit
> stronger.
It seems PREVENT_OPTIMIZATION stuff influents only one test –
gcc.dg/guality/pr45882.c. Other tests do not contain ATTRIBUTE_USED
macro.
> (It seems to just add some "used"?)
>
> > I will also look at aarch64 regstrap a bit later. But that job should
> > also be done for all other targets. Segher, could you test it on
> > rs6000?
>
> Do you have an account on the compile farm? It has POWER7 (BE, 32/64),
> as well as POWER8 and POWER9 machines (both LE).
> https://cfarm.tetaneutral.net/
Ok, I’ll use the farm. Have filled the “new user” form already.
Aarch64 is still testing, maybe next time will use farm instead of
qemu-system.
I forgot to mention in previous letter, that with this patch we also
drop dependencies between volatile asms and volatile mems. We have
some offline discussion with Alex, and it seems that since 2004 docs
never guarantee such a dependency, user should add relevant
constraints into asm in all cases. But somebody may have another
opinion about this tricky moment.
Roman
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [4/4][PATCH] Discussing PR83507
2019-04-29 16:56 ` Roman Zhuykov
2019-04-29 17:51 ` Segher Boessenkool
@ 2019-04-30 15:14 ` Jeff Law
1 sibling, 0 replies; 17+ messages in thread
From: Jeff Law @ 2019-04-30 15:14 UTC (permalink / raw)
To: Roman Zhuykov, Segher Boessenkool; +Cc: gcc-patches, Jakub Jelinek
On 4/29/19 10:28 AM, Roman Zhuykov wrote:
> Hi, Segher
>
>>>>> In pr84524.c we got a loop with an extended inline asm:
>>>>> asm volatile ("" : "+r" (v))
>>>>> which also gives us a âsurprisingâ situation Alexander predicts.
>>>>>
>>>>> For sched-deps scanner such volatile asm is a âtrue barrierâ and we
>>>>> create dependencies to almost all other instructions, while DF scanners
>>>>> donât give us such information.
>>>>
>>>> There is no such thing as a "true barrier" in extended asm. The only thing
>>>> volatile asm means is that the asm has a side effect unknown to the compiler;
>>>> this can *not* be a modification of a register or of memory contents, such
>>>> things are known by the compiler, that's what clobbers and "memory" clobber
>>>> are about.
>>>
>>> In sched-deps.c we got:
>>> case ASM_OPERANDS:
>>> case ASM_INPUT:
>>> {
>>> /* Traditional and volatile asm instructions must be considered to use
>>> and clobber all hard registers, all pseudo-registers and all of
>>> memory. So must TRAP_IF and UNSPEC_VOLATILE operations.
>>>
>>> Consider for instance a volatile asm that changes the fpu rounding
>>> mode. An insn should not be moved across this even if it only uses
>>> pseudo-regs because it might give an incorrectly rounded result. */
>>> if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
>>> && !DEBUG_INSN_P (insn))
>>> reg_pending_barrier = TRUE_BARRIER;
>>> ...
>>
>> This code was added in 1997 (r14770). In 2004 the documentation was
>> changed to clarify how things really work (r88999):
>>
>> "Note that even a volatile @code{asm} instruction can be moved relative to
>> other code, including across jump instructions."
>>
>> (followed by an example exactly about what this means for FPU control).
>
> Thanks for pointing to that changes! Unfortunately, sched-deps.c was
> more conservative this 15 years...
> Letâs try to fix it.
Also note that the integration of the Haifa scheduler was a large
change. A small detail like this could have been missed.
Jeff
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [0/4] Addressing modulo-scheduling bugs
2019-04-16 11:59 [0/4] Addressing modulo-scheduling bugs zhroma
` (3 preceding siblings ...)
2019-04-16 12:26 ` [4/4][PATCH] Discussing PR83507 Roman Zhuykov
@ 2019-04-22 19:16 ` Roman Zhuykov
2019-04-26 15:45 ` Jeff Law
4 siblings, 1 reply; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-22 19:16 UTC (permalink / raw)
To: gcc-patches
As a freshly appointed maintainer Iâm ready to commit my own
modulo-sched patches, but decide to ask here if there are any
objections. Maybe I should ask any additional approval at this stage?
If no, Iâll start tomorrow with committing patches 1/4 and 2/4 which are
well-formed regression fixes. Patch 3/4 doesnât include test example,
and I donât know how to add it there, so I am ready to postpone it to
stage 1. Patch 4/4 is not solving a regression technically.
First two patches can be also committed into 8 branch. If no discussion
occurs, Iâll commit them later this week, e.g. on friday.
Roman
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [0/4] Addressing modulo-scheduling bugs
2019-04-22 19:16 ` [0/4] Addressing modulo-scheduling bugs Roman Zhuykov
@ 2019-04-26 15:45 ` Jeff Law
2019-04-26 16:41 ` Roman Zhuykov
0 siblings, 1 reply; 17+ messages in thread
From: Jeff Law @ 2019-04-26 15:45 UTC (permalink / raw)
To: Roman Zhuykov, gcc-patches
On 4/22/19 10:45 AM, Roman Zhuykov wrote:
> As a freshly appointed maintainer Iâm ready to commit my own
> modulo-sched patches, but decide to ask here if there are any
> objections. Maybe I should ask any additional approval at this stage?Â
> If no, Iâll start tomorrow with committing patches 1/4 and 2/4 which are
> well-formed regression fixes. Patch 3/4 doesnât include test example,
> and I donât know how to add it there, so I am ready to postpone it to
> stage 1. Patch 4/4 is not solving a regression technically.
>
> First two patches can be also committed into 8 branch. If no discussion
> occurs, Iâll commit them later this week, e.g. on friday.
As a maintainer we trust you to make these decisions. Though if you
want guidance, mine would have been to go with #1 and #2 immediately and
postpone #3 and #4 for gcc-10. THat appears to be precisely what you've
done.
Jeff
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [0/4] Addressing modulo-scheduling bugs
2019-04-26 15:45 ` Jeff Law
@ 2019-04-26 16:41 ` Roman Zhuykov
0 siblings, 0 replies; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-26 16:41 UTC (permalink / raw)
To: Jeff Law; +Cc: gcc-patches
Hello, Jeff
> > As a freshly appointed maintainer I’m ready to commit my own
> > modulo-sched patches, but decide to ask here if there are any
> > objections. Maybe I should ask any additional approval at this stage?
> > If no, I’ll start tomorrow with committing patches 1/4 and 2/4 which are
> > well-formed regression fixes. Patch 3/4 doesn’t include test example,
> > and I don’t know how to add it there, so I am ready to postpone it to
> > stage 1. Patch 4/4 is not solving a regression technically.
> >
> > First two patches can be also committed into 8 branch. If no discussion
> > occurs, I’ll commit them later this week, e.g. on friday.
> As a maintainer we trust you to make these decisions. Though if you
> want guidance, mine would have been to go with #1 and #2 immediately and
> postpone #3 and #4 for gcc-10. THat appears to be precisely what you've
> done.
Thanks! I was afraid a bit to make something wrong in soon-frozing code.
Certainly Alex and Andrey help me with my first maintainter steps :)
Few minutes ago I've backported #1 and #2 into 8-branch as r270609.
Roman
^ permalink raw reply [flat|nested] 17+ messages in thread