[0/4] Addressing modulo-scheduling bugs

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [0/4] Addressing modulo-scheduling bugs
@ 2019-04-16 11:59 zhroma
  2019-04-16 12:03 ` [1/4][PATCH] Fix PR84032 Roman Zhuykov
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: zhroma @ 2019-04-16 11:59 UTC (permalink / raw)
  To: gcc-patches; +Cc: gcc, Jakub Jelinek, Jan Hubicka

Last few days Iâ€™ve added my comments/patches into almost all 
modulo-scheduler PRs appeared in the last two years.  Here I want to 
discuss five PRs.  First of all, I have four patches which fix 
regressions, so technically I can ask about applying them right now on 
stage4.  But we donâ€™t have a maintainer in modulo-sched pass, so not 
sure about their fast approval.

Last one PR90040 is meta-bug Iâ€™ve created.  It consolidates 5-6 other 
bugzilla PRs, which really are about the same issue.  Unfortunately, I 
canâ€™t solve that myself.  Jakub, Honza, would you kindly help me and 
take a look at that PR to provide any suggestions you might have.

All my patches were successfully bootstrapped and regtested on x86_64.  
Also Iâ€™ve checked that in cross-compiler mode to s390, spu, aarch64, 
arm, ia64, ppc and ppc64 regtest shows no new regressions.  The same 
testing was done with -fmodulo-sched enabled.  Also the same testing was 
done with -fmodulo-sched -fmodulo-sched-allow-regmoves enabled.  
Moreover, all that was also done on 8 branch. No new issues introduced.  
So, since Iâ€™ve made really heavy testing,
IMO all 4 fixes can go into trunk and even 8 branch.

When development goes into stage1 and after somehow fixing PR90040 issue 
I will introduce updated patchset described here:
https://gcc.gnu.org/ml/gcc-patches/2017-02/msg01647.html
(the set is supported locally on all branches since 4.9 with making a 
lot of regtesting).

Regarding the modulo scheduling maintainership, if my patches look good, 
I'm happy to volunteer to maintain the code as I looked at the code a 
lot during recent years.  Please let me know the steps I should do to 
make that happen.

Roman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [1/4][PATCH] Fix PR84032
  2019-04-16 11:59 [0/4] Addressing modulo-scheduling bugs zhroma
@ 2019-04-16 12:03 ` Roman Zhuykov
  2019-04-16 12:05 ` [2/4][PATCH] Fix PR87979 Roman Zhuykov
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-16 12:03 UTC (permalink / raw)
  To: gcc-patches

There is the following mistake in logic behind the code.

We want to schedule the branch instructions only as a last instruction 
in a row. But when branch was scheduled and we add other instructions 
into partial schedule, we sometimes allow them to be in same row after 
the branch.

The issue happens later when we try to reschedule branch into another 
row, algorithm there works like this:
(1) Remove branch from the row where it is (say, â€œprevious rowâ€)
(2) Try insert into the needed row
(3) If success â€“ OK, continue scheduling other instructions
(4) But when inserting (2) was not done â€“ insert it back into â€œprevious 
rowâ€ and this insertion must be certainly successful, which is checked 
by assertion.

But when on step (1) branch in not last in a row there is no guarantee, 
that on step (4) we could insert it back, because there we will try only 
last-in-a-row position for it.

This patch solves this totally preventing other instructions to be 
scheduled after branch in the same row.

Iâ€™ve described patch testing in cover letter. Ok for trunk?

gcc/ChangeLog:

2019-04-09  Roman Zhuykov  <zhroma@ispras.ru>

	PR rtl-optimization/84032
	* modulo-sched.c (ps_insn_find_column): Change condition so that
	branch will always be the last insn in a row inside partial schedule.

gcc/testsuite/ChangeLog:

2019-04-09  Roman Zhuykov  <zhroma@ispras.ru>

	PR rtl-optimization/84032
	* gcc.dg/pr84032.c: New test.

diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c
--- a/gcc/modulo-sched.c
+++ b/gcc/modulo-sched.c
@@ -2996,9 +2996,7 @@ ps_insn_find_column (partial_schedule_ptr ps, 
ps_insn_ptr ps_i,
              last_must_precede = next_ps_i;
          }
        /* The closing branch must be the last in the row.  */
-      if (must_precede
-	  && bitmap_bit_p (must_precede, next_ps_i->id)
-	  && JUMP_P (ps_rtl_insn (ps, next_ps_i->id)))
+      if (JUMP_P (ps_rtl_insn (ps, next_ps_i->id)))
  	return false;

         last_in_row = next_ps_i;
diff --git a/gcc/testsuite/gcc.dg/pr84032.c 
b/gcc/testsuite/gcc.dg/pr84032.c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr84032.c
@@ -0,0 +1,23 @@
+/* PR rtl-optimization/84032 */
+/* { dg-do compile } */
+/* { dg-options "-O1 -fmodulo-sched" } */
+/* { dg-additional-options "-mcpu=power6" { target { powerpc-*-* } } } 
*/
+
+void
+yr (int cm)
+{
+  int ka = cm;
+
+  for (;;)
+    {
+      short int m0;
+
+      for (m0 = 0; m0 < 6; ++m0)
+        {
+          ka &= 1;
+          cm *= 2;
+        }
+
+      ka = (ka == 0) ? cm : 0;
+    }
+}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [2/4][PATCH] Fix PR87979
  2019-04-16 11:59 [0/4] Addressing modulo-scheduling bugs zhroma
  2019-04-16 12:03 ` [1/4][PATCH] Fix PR84032 Roman Zhuykov
@ 2019-04-16 12:05 ` Roman Zhuykov
  2019-04-16 12:08 ` [3/4][PATCH] Fix PR90001 Roman Zhuykov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-16 12:05 UTC (permalink / raw)
  To: gcc-patches

We divide by zero when we try to schedule loop body in zero cycles.  
Both res_mii and rec_mii estimations equals zero.  We have to start with 
one cycle in this situation.

Same failure happens in the following tests on ia64 platform with 
-fmodulo-sched enabled (with any of -O1, -O2, -Os):
gcc.dg/torture/pr82762.c
gcc.c-torture/execute/20170419-1.c

Iâ€™ve described patch testing in cover letter. Ok for trunk?

gcc/ChangeLog:

2019-04-08  Roman Zhuykov  <zhroma@ispras.ru>

	PR rtl-optimization/87979
	* modulo-sched.c (sms_schedule): Start ii value "mii" should
	not equal zero.

gcc/testsuite/ChangeLog:

2019-04-08  Roman Zhuykov  <zhroma@ispras.ru>

	PR rtl-optimization/87979
	* gcc.dg/pr87979.c: New test.


diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c
--- a/gcc/modulo-sched.c
+++ b/gcc/modulo-sched.c
@@ -1602,6 +1602,7 @@ sms_schedule (void)
        mii = 1; /* Need to pass some estimate of mii.  */
        rec_mii = sms_order_nodes (g, mii, node_order, &max_asap);
        mii = MAX (res_MII (g), rec_mii);
+      mii = MAX (mii, 1);
        maxii = MAX (max_asap, MAXII_FACTOR * mii);

        if (dump_file)
diff --git a/gcc/testsuite/gcc.dg/pr87979.c 
b/gcc/testsuite/gcc.dg/pr87979.c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr87979.c
@@ -0,0 +1,11 @@
+/* PR rtl-optimization/87979 */
+/* { dg-do compile } */
+/* { dg-options "-Os -fmodulo-sched -fno-tree-loop-im" } */
+/* { dg-additional-options "-march=z196" { target { s390*-*-* } } } */
+
+void foo(void)
+{
+  static int m;
+  for (int i = 0; i < 10; ++i)
+    m++;
+}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [3/4][PATCH] Fix PR90001
  2019-04-16 11:59 [0/4] Addressing modulo-scheduling bugs zhroma
  2019-04-16 12:03 ` [1/4][PATCH] Fix PR84032 Roman Zhuykov
  2019-04-16 12:05 ` [2/4][PATCH] Fix PR87979 Roman Zhuykov
@ 2019-04-16 12:08 ` Roman Zhuykov
  2019-04-26 15:42   ` Jeff Law
  2019-04-16 12:26 ` [4/4][PATCH] Discussing PR83507 Roman Zhuykov
  2019-04-22 19:16 ` [0/4] Addressing modulo-scheduling bugs Roman Zhuykov
  4 siblings, 1 reply; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-16 12:08 UTC (permalink / raw)
  To: gcc-patches

Current algorithm which finds recurrence_length for all DDG strongly 
connected components works in like O(N^6) time, where N in the number of 
nodes in DDG.  The time is so bad mostly for graphs with lots of edges, 
like almost N^2 edges.  My proposed algorithm works in O(N^3).  
Algorithm of finding SCCs itself is also not optimal (maybe up to 
O(N^4)), but here it left untouched.

For some situations, when amount of edges is smaller (like equal to N), 
new algorithm can be unfortunately slower than old one.  But I think 
it's better here to add some bail-out when we got more than 1000 nodes 
for example.

Before creating this patch, I tested special version of it, where both 
approaches were in action and asserts were inserted to check that 
algorithms results (longest_simple_path values) are absolutely the same. 
  I can publish this special version if needed.

Iâ€™ve described patch testing in cover letter. Ok for trunk?

gcc/ChangeLog:

2019-04-08  Roman Zhuykov  <zhroma@ispras.ru>

	PR rtl-optimization/90001
	* ddg.c (create_ddg): Init max_dist array for each node.
	(free_ddg): Free max_dist array.
	(create_ddg_edge): Use bool field instead of aux union.
	(set_recurrence_length): Use prepared max_dist information instead
	of calling longest_simple_path.
	(create_scc): Remove graph argument, fill node's aux.count with
	SCC id, and move set_recurrence_length call to...
	(create_ddg_all_sccs): ...here, after filling all max_dist arrays
	using Floydâ€“Warshall-like algorithm.
	(update_dist_to_successors): Remove the whole function.
	(longest_simple_path): Likewise.
	* ddg.h (struct ddg_node): Add max_dist pointer.
	(struct ddg_edge): Use bool field instead of unused aux union.


diff --git a/gcc/ddg.c b/gcc/ddg.c
--- a/gcc/ddg.c
+++ b/gcc/ddg.c
@@ -32,9 +32,6 @@ along with GCC; see the file COPYING3.  If not see

  #ifdef INSN_SCHEDULING

-/* A flag indicating that a ddg edge belongs to an SCC or not.  */
-enum edge_flag {NOT_IN_SCC = 0, IN_SCC};
-
  /* Forward declarations.  */
  static void add_backarc_to_ddg (ddg_ptr, ddg_edge_ptr);
  static void add_backarc_to_scc (ddg_scc_ptr, ddg_edge_ptr);
@@ -564,7 +561,7 @@ create_ddg (basic_block bb, int closing_branch_deps)
  {
    ddg_ptr g;
    rtx_insn *insn, *first_note;
-  int i;
+  int i, j;
    int num_nodes = 0;

    g = (ddg_ptr) xcalloc (1, sizeof (struct ddg));
@@ -632,6 +629,12 @@ create_ddg (basic_block bb, int 
closing_branch_deps)
        g->nodes[i].predecessors = sbitmap_alloc (num_nodes);
        bitmap_clear (g->nodes[i].predecessors);
        g->nodes[i].first_note = (first_note ? first_note : insn);
+
+      g->nodes[i].aux.count = -1;
+      g->nodes[i].max_dist = XCNEWVEC (int, num_nodes);
+      for (j = 0; j < num_nodes; j++)
+         g->nodes[i].max_dist[j] = -1;
+
        g->nodes[i++].insn = insn;
        first_note = NULL;
      }
@@ -668,6 +671,7 @@ free_ddg (ddg_ptr g)
  	}
        sbitmap_free (g->nodes[i].successors);
        sbitmap_free (g->nodes[i].predecessors);
+      free (g->nodes[i].max_dist);
      }
    if (g->num_backarcs > 0)
      free (g->backarcs);
@@ -792,7 +796,7 @@ create_ddg_edge (ddg_node_ptr src, ddg_node_ptr 
dest,
    e->latency = l;
    e->distance = d;
    e->next_in = e->next_out = NULL;
-  e->aux.info = 0;
+  e->in_scc = false;
    return e;
  }

@@ -820,7 +824,7 @@ add_edge_to_ddg (ddg_ptr g ATTRIBUTE_UNUSED, 
ddg_edge_ptr e)
     for now that cycles in the data dependence graph contain a single 
backarc.
     This simplifies the algorithm, and can be generalized later.  */
  static void
-set_recurrence_length (ddg_scc_ptr scc, ddg_ptr g)
+set_recurrence_length (ddg_scc_ptr scc)
  {
    int j;
    int result = -1;
@@ -828,17 +832,14 @@ set_recurrence_length (ddg_scc_ptr scc, ddg_ptr g)
    for (j = 0; j < scc->num_backarcs; j++)
      {
        ddg_edge_ptr backarc = scc->backarcs[j];
-      int length;
        int distance = backarc->distance;
        ddg_node_ptr src = backarc->dest;
        ddg_node_ptr dest = backarc->src;
+      int length = src->max_dist[dest->cuid];
+
+      if (length < 0)
+        continue;

-      length = longest_simple_path (g, src->cuid, dest->cuid, 
scc->nodes);
-      if (length < 0 )
-	{
-	  /* fprintf (stderr, "Backarc not on simple cycle in SCC.\n"); */
-	  continue;
-	}
        length += backarc->latency;
        result = MAX (result, (length / distance));
      }
@@ -846,9 +847,9 @@ set_recurrence_length (ddg_scc_ptr scc, ddg_ptr g)
  }

  /* Create a new SCC given the set of its nodes.  Compute its 
recurrence_length
-   and mark edges that belong to this scc as IN_SCC.  */
+   and mark edges that belong to this scc.  */
  static ddg_scc_ptr
-create_scc (ddg_ptr g, sbitmap nodes)
+create_scc (ddg_ptr g, sbitmap nodes, int id)
  {
    ddg_scc_ptr scc;
    unsigned int u = 0;
@@ -866,16 +867,18 @@ create_scc (ddg_ptr g, sbitmap nodes)
        ddg_edge_ptr e;
        ddg_node_ptr n = &g->nodes[u];

+      gcc_assert (n->aux.count == -1);
+      n->aux.count = id;
+
        for (e = n->out; e; e = e->next_out)
  	if (bitmap_bit_p (nodes, e->dest->cuid))
  	  {
-	    e->aux.count = IN_SCC;
+	    e->in_scc = true;
  	    if (e->distance > 0)
  	      add_backarc_to_scc (scc, e);
  	  }
      }

-  set_recurrence_length (scc, g);
    return scc;
  }

@@ -1018,7 +1021,7 @@ check_sccs (ddg_all_sccs_ptr sccs, int num_nodes)
  ddg_all_sccs_ptr
  create_ddg_all_sccs (ddg_ptr g)
  {
-  int i;
+  int i, j, k, scc, way;
    int num_nodes = g->num_nodes;
    auto_sbitmap from (num_nodes);
    auto_sbitmap to (num_nodes);
@@ -1038,7 +1041,7 @@ create_ddg_all_sccs (ddg_ptr g)
        ddg_node_ptr dest = backarc->dest;

        /* If the backarc already belongs to an SCC, continue.  */
-      if (backarc->aux.count == IN_SCC)
+      if (backarc->in_scc)
  	continue;

        bitmap_clear (scc_nodes);
@@ -1049,10 +1052,52 @@ create_ddg_all_sccs (ddg_ptr g)

        if (find_nodes_on_paths (scc_nodes, g, from, to))
  	{
-	  scc = create_scc (g, scc_nodes);
+	  scc = create_scc (g, scc_nodes, sccs->num_sccs);
  	  add_scc_to_ddg (sccs, scc);
  	}
      }
+
+  /* Init max_dist arrays for Floydâ€“Warshall-like
+     longest patch calculation algorithm.  */
+  for (k = 0; k < num_nodes; k++)
+    {
+      ddg_edge_ptr e;
+      ddg_node_ptr n = &g->nodes[k];
+
+      if (n->aux.count == -1)
+        continue;
+
+      n->max_dist[k] = 0;
+      for (e = n->out; e; e = e->next_out)
+        if (e->distance == 0 && g->nodes[e->dest->cuid].aux.count == 
n->aux.count)
+          n->max_dist[e->dest->cuid] = e->latency;
+    }
+
+  /* Run main Floid-Warshall loop.  We use only non-backarc edges
+     inside each scc.  */
+  for (k = 0; k < num_nodes; k++)
+    {
+      scc = g->nodes[k].aux.count;
+      if (scc != -1)
+        {
+          for (i = 0; i < num_nodes; i++)
+            if (g->nodes[i].aux.count == scc)
+              for (j = 0; j < num_nodes; j++)
+                if (g->nodes[j].aux.count == scc
+                    && g->nodes[i].max_dist[k] >= 0
+                    && g->nodes[k].max_dist[j] >= 0)
+                  {
+                    way = g->nodes[i].max_dist[k] + 
g->nodes[k].max_dist[j];
+                    if (g->nodes[i].max_dist[j] < way)
+                      g->nodes[i].max_dist[j] = way;
+                  }
+        }
+    }
+
+  /* Calculate recurrence_length using max_dist info.  */
+  for (i = 0; i < sccs->num_sccs; i++)
+    set_recurrence_length (sccs->sccs[i]);
+
    order_sccs (sccs);

    if (flag_checking)
@@ -1155,72 +1200,4 @@ find_nodes_on_paths (sbitmap result, ddg_ptr g, 
sbitmap from, sbitmap to)
    return bitmap_and (result, reachable_from, reach_to);
  }

-
-/* Updates the counts of U_NODE's successors (that belong to NODES) to 
be
-   at-least as large as the count of U_NODE plus the latency between 
them.
-   Sets a bit in TMP for each successor whose count was changed 
(increased).
-   Returns nonzero if any count was changed.  */
-static int
-update_dist_to_successors (ddg_node_ptr u_node, sbitmap nodes, sbitmap 
tmp)
-{
-  ddg_edge_ptr e;
-  int result = 0;
-
-  for (e = u_node->out; e; e = e->next_out)
-    {
-      ddg_node_ptr v_node = e->dest;
-      int v = v_node->cuid;
-
-      if (bitmap_bit_p (nodes, v)
-	  && (e->distance == 0)
-	  && (v_node->aux.count < u_node->aux.count + e->latency))
-	{
-	  v_node->aux.count = u_node->aux.count + e->latency;
-	  bitmap_set_bit (tmp, v);
-	  result = 1;
-	}
-    }
-  return result;
-}
-
-
-/* Find the length of a longest path from SRC to DEST in G,
-   going only through NODES, and disregarding backarcs.  */
-int
-longest_simple_path (struct ddg * g, int src, int dest, sbitmap nodes)
-{
-  int i;
-  unsigned int u = 0;
-  int change = 1;
-  int num_nodes = g->num_nodes;
-  auto_sbitmap workset (num_nodes);
-  auto_sbitmap tmp (num_nodes);
-  for (i = 0; i < g->num_nodes; i++)
-    g->nodes[i].aux.count = -1;
-  g->nodes[src].aux.count = 0;
-
-  bitmap_clear (tmp);
-  bitmap_set_bit (tmp, src);
-
-  while (change)
-    {
-      sbitmap_iterator sbi;
-
-      change = 0;
-      bitmap_copy (workset, tmp);
-      bitmap_clear (tmp);
-      EXECUTE_IF_SET_IN_BITMAP (workset, 0, u, sbi)
-	{
-	  ddg_node_ptr u_node = &g->nodes[u];
-
-	  change |= update_dist_to_successors (u_node, nodes, tmp);
-	}
-    }
-  return g->nodes[dest].aux.count;
-}
-
  #endif /* INSN_SCHEDULING */
diff --git a/gcc/ddg.h b/gcc/ddg.h
--- a/gcc/ddg.h
+++ b/gcc/ddg.h
@@ -64,6 +64,10 @@ struct ddg_node
    sbitmap successors;
    sbitmap predecessors;

+  /* Temporary array used for Floyd-Warshall algorithm to find
+     scc recurrence length.  */
+  int *max_dist;
+
    /* For general use by algorithms manipulating the ddg.  */
    union {
      int count;
@@ -95,11 +99,8 @@ struct ddg_edge
    ddg_edge_ptr next_in;
    ddg_edge_ptr next_out;

-  /* For general use by algorithms manipulating the ddg.  */
-  union {
-    int count;
-    void *info;
-  } aux;
+  /* Is true when edge is already in scc.  */
+  bool in_scc;
  };

  /* This structure holds the Data Dependence Graph for a basic block.  
*/
@@ -178,7 +179,6 @@ ddg_all_sccs_ptr create_ddg_all_sccs (ddg_ptr);
  void free_ddg_all_sccs (ddg_all_sccs_ptr);

  int find_nodes_on_paths (sbitmap result, ddg_ptr, sbitmap from, sbitmap 
to);
-int longest_simple_path (ddg_ptr, int from, int to, sbitmap via);

  bool autoinc_var_is_used_p (rtx_insn *, rtx_insn *);


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [4/4][PATCH] Discussing PR83507
  2019-04-16 11:59 [0/4] Addressing modulo-scheduling bugs zhroma
                   ` (2 preceding siblings ...)
  2019-04-16 12:08 ` [3/4][PATCH] Fix PR90001 Roman Zhuykov
@ 2019-04-16 12:26 ` Roman Zhuykov
  2019-04-16 15:10   ` Segher Boessenkool
  2019-04-22 19:16 ` [0/4] Addressing modulo-scheduling bugs Roman Zhuykov
  4 siblings, 1 reply; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-16 12:26 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek

This issue unfortunately was not solved correctly.  In that example we 
donâ€™t have -fmodulo-sched-allow-regmoves enabled and we should not 
create any register moves at all.

Iâ€™ll describe here everything I found while looking on the issue.  At 
the first step I have patched trunk compiler like this:

diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c
--- a/gcc/modulo-sched.c
+++ b/gcc/modulo-sched.c
@@ -735,6 +735,7 @@ schedule_reg_moves (partial_schedule_ptr ps)
         continue;

        /* Create NREG_MOVES register moves.  */
+      gcc_assert (flag_modulo_sched_allow_regmoves);
        first_move = ps->reg_moves.length ();
        ps->reg_moves.safe_grow_cleared (first_move + nreg_moves);
        extend_node_sched_params (ps);
@@ -744,8 +745,7 @@ schedule_reg_moves (partial_schedule_ptr ps)

        /* Generate each move.  */
        old_reg = prev_reg = SET_DEST (set);
-      if (HARD_REGISTER_P (old_reg))
-       return false;
+      gcc_assert (!HARD_REGISTER_P (old_reg));

        for (i_reg_move = 0; i_reg_move < nreg_moves; i_reg_move++)
         {

On current trunk this patch doesnâ€™t give any failures on powerpc, but 
Iâ€™ve checked other platforms and found gcc.c-torture/execute/pr84524.c 
ICEing with  -O1 -fmodulo-sched on aarch64.  Speaking about sms-13.c, it 
happens not to be a proper loop for modulo-scheduling at current trunk.  
Some changes in other passes caused this, and that file actually tests 
nothing right now.

Than, I added 8 branch into my testing, and decide first to find a 
proper way to fix sms-13.c on 8 branch.

Without -fmodulo-sched-allow-regmoves each TRUE_DEP edge should be 
paired with another (sometimes ANTI_DEP or OUTPUT_DEP) edge in the 
opposite direction.  My assertion failure means that DDG was odd.

As Alexander mentioned here
https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00028.html
â€œSMS uses sched-deps for intra-loop deps, and then separately uses DF 
for cross-iteration deps, which means that it should be ready for 
surprises when the two scanners are not 100% in sync.â€

Situation here is one of such surprises, and debugging shows that we 
loose one of dependencies here.

Considering this two instructions:
(insn 30 29 31 8 (parallel [
             (set (reg:SI 142)
                 (minus:SI (reg/v:SI 133 [ x+-2 ])
                     (reg/v:SI 130 [ b+-2 ])))
             (set (reg:SI 76 ca)
                 (leu:SI (reg/v:SI 130 [ b+-2 ])
                     (reg/v:SI 133 [ x+-2 ])))
         ]) "test.c":22 107 {subfsi3_carry}
      (expr_list:REG_UNUSED (reg:SI 142)
         (nil)))

(insn 31 30 32 8 (parallel [
             (set (reg:SI 143)
                 (plus:SI (reg:SI 76 ca)
                     (const_int -1 [0xffffffffffffffff])))
             (clobber (reg:SI 76 ca))
         ]) "test.c":22 116 {subfsi3_carry_in_xx}
      (expr_list:REG_DEAD (reg:SI 76 ca)
         (expr_list:REG_UNUSED (reg:SI 76 ca)
             (nil))))

Sched-deps organize obvious 30->31 true dependency, and we want DF to 
help us with 31->30 output dependency here.  CA is clobbered in 31, and 
if we move insn 31 forward we have to stop at insn 30 on the next 
iteration.

Build_inter_loop_deps doesnâ€™t create such output dependency because 
reaching definitions â€œrd_bb_info->genâ€ dependency set doesnâ€™t contain CA 
register.  Maybe, because it is a clobber, not a set.  So we need 
somehow organize that dependency, and Iâ€™ve found that df_live set can be 
used here instead.

I proceed with the following change:

diff --git a/gcc/ddg.c b/gcc/ddg.c
--- a/gcc/ddg.c
+++ b/gcc/ddg.c
@@ -365,16 +365,14 @@ add_cross_iteration_register_deps (ddg_ptr g, 
df_ref last_def)
  static void
  build_inter_loop_deps (ddg_ptr g)
  {
-  unsigned rd_num;
-  struct df_rd_bb_info *rd_bb_info;
+  unsigned regno;
    bitmap_iterator bi;
-
-  rd_bb_info = DF_RD_BB_INFO (g->bb);
+  struct df_live_bb_info *live_bb_info = DF_LIVE_BB_INFO (g->bb);

    /* Find inter-loop register output, true and anti deps.  */
-  EXECUTE_IF_SET_IN_BITMAP (&rd_bb_info->gen, 0, rd_num, bi)
+  EXECUTE_IF_SET_IN_BITMAP (&live_bb_info->gen, 0, regno, bi)
    {
-    df_ref rd = DF_DEFS_GET (rd_num);
+    df_ref rd = df_bb_regno_last_def_find (g->bb, regno);

      add_cross_iteration_register_deps (g, rd);
    }

(Here I skipped minor hunks which properly add/remove df_live problem in 
modulo-sched.c when optimize == 1).

This fix corrected sms-13.c on 8 branch, but pr84524.c mentioned above 
was still  ICEing on both trunk and 8-branch.  No other failures were 
introduces while testing this intermediate patch on different branches 
and platforms.  Tested patch also contained temporarily additional 
checks that the only difference between â€œrd_bb_info->genâ€ and 
â€œlive_bb_info->genâ€ sets may be additional hard registers in live set.

In pr84524.c we got a loop with an extended inline asm:
asm volatile ("" : "+r" (v))
which also gives us a â€œsurprisingâ€ situation Alexander predicts.

For sched-deps scanner such volatile asm is a â€œtrue barrierâ€ and we 
create dependencies to almost all other instructions, while DF scanners 
donâ€™t give us such information.

Maybe it is a good idea somehow re-implement modulo scheduler using only 
one scanner instead of two, but at the moment the best thing to do is to 
detect the situation earlier and skip such loops.

So, the last two hunks are:

@@ -1473,6 +1478,8 @@ sms_schedule (void)

          if (CALL_P (insn)
              || BARRIER_P (insn)
+            || (NONDEBUG_INSN_P (insn) && extract_asm_operands (PATTERN 
(insn))
+                && MEM_VOLATILE_P (extract_asm_operands (PATTERN 
(insn))))
              || (NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
                  && !single_set (insn) && GET_CODE (PATTERN (insn)) != 
USE
                  && !reg_mentioned_p (count_reg, insn))
@@ -1489,6 +1496,11 @@ sms_schedule (void)
                 fprintf (dump_file, "SMS loop-with-call\n");
               else if (BARRIER_P (insn))
                 fprintf (dump_file, "SMS loop-with-barrier\n");
+             else if (NONDEBUG_INSN_P (insn)
+                      && extract_asm_operands (PATTERN (insn))
+                      && MEM_VOLATILE_P (extract_asm_operands
+                                               (PATTERN (insn))))
+               fprintf (dump_file, "SMS loop-with-volatile-asm\n");
                else if ((NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
                  && !single_set (insn) && GET_CODE (PATTERN (insn)) != 
USE))

Patch testing addressed in my cover letter. Ok for trunk?

gcc/ChangeLog:

2019-04-10  Roman Zhuykov  <zhroma@ispras.ru>

	PR target/83507
	* ddg.c (build_inter_loop_deps): Use live_bb_info instead of
	rd_bb_info to search last register definitions.
	* modulo-sched.c (schedule_reg_moves): Add an assertion which
	prevents creating register moves when it is not allowed.
	(sms_schedule): Add/remove df_live problem when optimize == 1.
	Bail out when we got volatile asm inside the loop.


diff --git a/gcc/ddg.c b/gcc/ddg.c
--- a/gcc/ddg.c
+++ b/gcc/ddg.c
@@ -365,16 +365,14 @@ add_cross_iteration_register_deps (ddg_ptr g, 
df_ref last_def)
  static void
  build_inter_loop_deps (ddg_ptr g)
  {
-  unsigned rd_num;
-  struct df_rd_bb_info *rd_bb_info;
+  unsigned regno;
    bitmap_iterator bi;
-
-  rd_bb_info = DF_RD_BB_INFO (g->bb);
+  struct df_live_bb_info *live_bb_info = DF_LIVE_BB_INFO (g->bb);

    /* Find inter-loop register output, true and anti deps.  */
-  EXECUTE_IF_SET_IN_BITMAP (&rd_bb_info->gen, 0, rd_num, bi)
+  EXECUTE_IF_SET_IN_BITMAP (&live_bb_info->gen, 0, regno, bi)
    {
-    df_ref rd = DF_DEFS_GET (rd_num);
+    df_ref rd = df_bb_regno_last_def_find (g->bb, regno);

      add_cross_iteration_register_deps (g, rd);
    }
diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c

        /* Create NREG_MOVES register moves.  */
+      gcc_assert (flag_modulo_sched_allow_regmoves);
        first_move = ps->reg_moves.length ();
        ps->reg_moves.safe_grow_cleared (first_move + nreg_moves);
        extend_node_sched_params (ps);
@@ -744,8 +745,7 @@ schedule_reg_moves (partial_schedule_ptr ps)

        /* Generate each move.  */
        old_reg = prev_reg = SET_DEST (set);
-      if (HARD_REGISTER_P (old_reg))
-	return false;
+      gcc_assert (!HARD_REGISTER_P (old_reg));

        for (i_reg_move = 0; i_reg_move < nreg_moves; i_reg_move++)
  	{
@@ -1371,7 +1371,12 @@ sms_schedule (void)
    else
      issue_rate = 1;

-  /* Initialize the scheduler.  */
+  /* Initialize the scheduler, we need live info even at O1.  */
+  if (optimize == 1)
+    {
+      df_live_add_problem ();
+      df_live_set_all_dirty ();
+    }
    setup_sched_infos ();
    haifa_sched_init ();

@@ -1473,6 +1478,8 @@ sms_schedule (void)

          if (CALL_P (insn)
              || BARRIER_P (insn)
+            || (NONDEBUG_INSN_P (insn) && extract_asm_operands (PATTERN 
(insn))
+                && MEM_VOLATILE_P (extract_asm_operands (PATTERN 
(insn))))
              || (NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
                  && !single_set (insn) && GET_CODE (PATTERN (insn)) != 
USE
                  && !reg_mentioned_p (count_reg, insn))
@@ -1489,6 +1496,11 @@ sms_schedule (void)
  		fprintf (dump_file, "SMS loop-with-call\n");
  	      else if (BARRIER_P (insn))
  		fprintf (dump_file, "SMS loop-with-barrier\n");
+	      else if (NONDEBUG_INSN_P (insn)
+		       && extract_asm_operands (PATTERN (insn))
+		       && MEM_VOLATILE_P (extract_asm_operands
+						(PATTERN (insn))))
+		fprintf (dump_file, "SMS loop-with-volatile-asm\n");
                else if ((NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
                  && !single_set (insn) && GET_CODE (PATTERN (insn)) != 
USE))
                  fprintf (dump_file, "SMS loop-with-not-single-set\n");
@@ -1752,6 +1764,9 @@ sms_schedule (void)
    /* Release scheduler data, needed until now because of DFA.  */
    haifa_sched_finish ();
    loop_optimizer_finalize ();
+
+  if (optimize == 1)
+    df_remove_problem (df_live);
  }

  /* The SMS scheduling algorithm itself

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [4/4][PATCH] Discussing PR83507
  2019-04-16 12:26 ` [4/4][PATCH] Discussing PR83507 Roman Zhuykov
@ 2019-04-16 15:10   ` Segher Boessenkool
  2019-04-22 16:45     ` Roman Zhuykov
  0 siblings, 1 reply; 17+ messages in thread
From: Segher Boessenkool @ 2019-04-16 15:10 UTC (permalink / raw)
  To: Roman Zhuykov; +Cc: gcc-patches, Jakub Jelinek

On Tue, Apr 16, 2019 at 03:08:23PM +0300, Roman Zhuykov wrote:
> This issue unfortunately was not solved correctly.  In that example we 
> donâ€™t have -fmodulo-sched-allow-regmoves enabled and we should not 
> create any register moves at all.

Yes, but if we do for whatever reason, we should never create register
moves of hard registers.  Because that is incorrect (there may not be
insns to do it).  It's a separate issue.

You're extending Jakub's patch here, not replacing it, so that's fine.

> In pr84524.c we got a loop with an extended inline asm:
> asm volatile ("" : "+r" (v))
> which also gives us a â€œsurprisingâ€ situation Alexander predicts.
> 
> For sched-deps scanner such volatile asm is a â€œtrue barrierâ€ and we 
> create dependencies to almost all other instructions, while DF scanners 
> donâ€™t give us such information.

There is no such thing as a "true barrier" in extended asm.  The only thing
volatile asm means is that the asm has a side effect unknown to the compiler;
this can *not* be a modification of a register or of memory contents, such
things are known by the compiler, that's what clobbers and "memory" clobber
are about.

> Maybe it is a good idea somehow re-implement modulo scheduler using only 
> one scanner instead of two, but at the moment the best thing to do is to 
> detect the situation earlier and skip such loops.

Or fix the broken code...

Segher

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [4/4][PATCH] Discussing PR83507
  2019-04-16 15:10   ` Segher Boessenkool
@ 2019-04-22 16:45     ` Roman Zhuykov
  2019-04-23 15:55       ` Segher Boessenkool
  0 siblings, 1 reply; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-22 16:45 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, Jakub Jelinek

> > This issue unfortunately was not solved correctly.  In that example we
> > don’t have -fmodulo-sched-allow-regmoves enabled and we should not
> > create any register moves at all.
>
> Yes, but if we do for whatever reason, we should never create register
> moves of hard registers.  Because that is incorrect (there may not be
> insns to do it).  It's a separate issue.
>
> You're extending Jakub's patch here, not replacing it, so that's fine.

Certainly, the patch contains assertions to catch both situations:
when we try to create reg_move when it wasn’t allowed at all, and when
we try to create reg_move for hard register. Preventing both of them
in theory can be easily achieved when SMS algorithm got absolutely
correct DDG as an input data.

> > In pr84524.c we got a loop with an extended inline asm:
> > asm volatile ("" : "+r" (v))
> > which also gives us a “surprising” situation Alexander predicts.
> >
> > For sched-deps scanner such volatile asm is a “true barrier” and we
> > create dependencies to almost all other instructions, while DF scanners
> > don’t give us such information.
>
> There is no such thing as a "true barrier" in extended asm.  The only thing
> volatile asm means is that the asm has a side effect unknown to the compiler;
> this can *not* be a modification of a register or of memory contents, such
> things are known by the compiler, that's what clobbers and "memory" clobber
> are about.

In sched-deps.c we got:
    case ASM_OPERANDS:
    case ASM_INPUT:
      {
       /* Traditional and volatile asm instructions must be considered to use
          and clobber all hard registers, all pseudo-registers and all of
          memory.  So must TRAP_IF and UNSPEC_VOLATILE operations.

          Consider for instance a volatile asm that changes the fpu rounding
          mode.  An insn should not be moved across this even if it only uses
          pseudo-regs because it might give an incorrectly rounded result.  */
       if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
           && !DEBUG_INSN_P (insn))
         reg_pending_barrier = TRUE_BARRIER;
       ...

I understand that mentioned “changing fpu rounding mode” example may
not be relevant in current compiler.  But we still got TRUE_BARRIER
for all volatile asms.

> > Maybe it is a good idea somehow re-implement modulo scheduler using only
> > one scanner instead of two, but at the moment the best thing to do is to
> > detect the situation earlier and skip such loops.
>
> Or fix the broken code...

I’m not sure what you mean here, but the only alternative way to build
correct DDG is to change sched-deps.c parts to make analysis
consistent with DF, but that change will affect all schedulers.  And
we’ll have 2 rather big problems with such change:
1) Testing correctness and quality of generated code in all imaginable
situations in all schedulers.
2) Potential compilation slowdown.  Sched-deps analysis algorithm
implements heuristic ideas, while DF pretends to be more accurate but
slower analysis.  Trying to keep them in sync probably could increase
sched-deps analysis time.

A year ago there were relevant discussion about some unnecessary
inline asm dependencies in DF:
https://gcc.gnu.org/ml/gcc-patches/2018-04/msg01056.html

Roman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [0/4] Addressing modulo-scheduling bugs
  2019-04-16 11:59 [0/4] Addressing modulo-scheduling bugs zhroma
                   ` (3 preceding siblings ...)
  2019-04-16 12:26 ` [4/4][PATCH] Discussing PR83507 Roman Zhuykov
@ 2019-04-22 19:16 ` Roman Zhuykov
  2019-04-26 15:45   ` Jeff Law
  4 siblings, 1 reply; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-22 19:16 UTC (permalink / raw)
  To: gcc-patches

As a freshly appointed maintainer Iâ€™m ready to commit my own 
modulo-sched patches, but decide to ask here if there are any 
objections.  Maybe I should ask any additional approval at this stage?  
If no, Iâ€™ll start tomorrow with committing patches 1/4 and 2/4 which are 
well-formed regression fixes.  Patch 3/4 doesnâ€™t include test example, 
and I donâ€™t know how to add it there, so I am ready to postpone it to 
stage 1.  Patch 4/4 is not solving a regression technically.

First two patches can be also committed into 8 branch.  If no discussion 
occurs,  Iâ€™ll commit them later this week, e.g. on friday.

Roman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [4/4][PATCH] Discussing PR83507
  2019-04-22 16:45     ` Roman Zhuykov
@ 2019-04-23 15:55       ` Segher Boessenkool
  2019-04-29 16:56         ` Roman Zhuykov
  0 siblings, 1 reply; 17+ messages in thread
From: Segher Boessenkool @ 2019-04-23 15:55 UTC (permalink / raw)
  To: Roman Zhuykov; +Cc: gcc-patches, Jakub Jelinek

Hi Roman,

On Mon, Apr 22, 2019 at 07:36:40PM +0300, Roman Zhuykov wrote:
> > > In pr84524.c we got a loop with an extended inline asm:
> > > asm volatile ("" : "+r" (v))
> > > which also gives us a â€œsurprisingâ€ situation Alexander predicts.
> > >
> > > For sched-deps scanner such volatile asm is a â€œtrue barrierâ€ and we
> > > create dependencies to almost all other instructions, while DF scanners
> > > donâ€™t give us such information.
> >
> > There is no such thing as a "true barrier" in extended asm.  The only thing
> > volatile asm means is that the asm has a side effect unknown to the compiler;
> > this can *not* be a modification of a register or of memory contents, such
> > things are known by the compiler, that's what clobbers and "memory" clobber
> > are about.
> 
> In sched-deps.c we got:
>     case ASM_OPERANDS:
>     case ASM_INPUT:
>       {
>        /* Traditional and volatile asm instructions must be considered to use
>           and clobber all hard registers, all pseudo-registers and all of
>           memory.  So must TRAP_IF and UNSPEC_VOLATILE operations.
> 
>           Consider for instance a volatile asm that changes the fpu rounding
>           mode.  An insn should not be moved across this even if it only uses
>           pseudo-regs because it might give an incorrectly rounded result.  */
>        if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
>            && !DEBUG_INSN_P (insn))
>          reg_pending_barrier = TRUE_BARRIER;
>        ...

This code was added in 1997 (r14770).  In 2004 the documentation was
changed to clarify how things really work (r88999):

"Note that even a volatile @code{asm} instruction can be moved relative to
other code, including across jump instructions."

(followed by an example exactly about what this means for FPU control).

> I understand that mentioned â€œchanging fpu rounding modeâ€ example may
> not be relevant in current compiler.  But we still got TRUE_BARRIER
> for all volatile asms.

> > > Maybe it is a good idea somehow re-implement modulo scheduler using only
> > > one scanner instead of two, but at the moment the best thing to do is to
> > > detect the situation earlier and skip such loops.
> >
> > Or fix the broken code...
> 
> Iâ€™m not sure what you mean here,

I mean have the modulo scheduler implement the correct asm semantics, not
some more restrictive thing that gets it into conflicts with DF, etc.

I don't think this will turn out to be a problem in any way.  Some invalid
asm will break, sure.


Segher

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [3/4][PATCH] Fix PR90001
  2019-04-16 12:08 ` [3/4][PATCH] Fix PR90001 Roman Zhuykov
@ 2019-04-26 15:42   ` Jeff Law
  2019-04-26 17:41     ` Roman Zhuykov
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff Law @ 2019-04-26 15:42 UTC (permalink / raw)
  To: Roman Zhuykov, gcc-patches

On 4/16/19 6:05 AM, Roman Zhuykov wrote:
> Current algorithm which finds recurrence_length for all DDG strongly
> connected components works in like O(N^6) time, where N in the number of
> nodes in DDG.Â  The time is so bad mostly for graphs with lots of edges,
> like almost N^2 edges.Â  My proposed algorithm works in O(N^3).Â 
> Algorithm of finding SCCs itself is also not optimal (maybe up to
> O(N^4)), but here it left untouched.
> 
> For some situations, when amount of edges is smaller (like equal to N),
> new algorithm can be unfortunately slower than old one.Â  But I think
> it's better here to add some bail-out when we got more than 1000 nodes
> for example.
> 
> Before creating this patch, I tested special version of it, where both
> approaches were in action and asserts were inserted to check that
> algorithms results (longest_simple_path values) are absolutely the same.
> Â I can publish this special version if needed.
> 
> Iâ€™ve described patch testing in cover letter. Ok for trunk?
> 
> gcc/ChangeLog:
> 
> 2019-04-08Â  Roman ZhuykovÂ  <zhroma@ispras.ru>
> 
> Â Â Â Â PR rtl-optimization/90001
> Â Â Â Â * ddg.c (create_ddg): Init max_dist array for each node.
> Â Â Â Â (free_ddg): Free max_dist array.
> Â Â Â Â (create_ddg_edge): Use bool field instead of aux union.
> Â Â Â Â (set_recurrence_length): Use prepared max_dist information instead
> Â Â Â Â of calling longest_simple_path.
> Â Â Â Â (create_scc): Remove graph argument, fill node's aux.count with
> Â Â Â Â SCC id, and move set_recurrence_length call to...
> Â Â Â Â (create_ddg_all_sccs): ...here, after filling all max_dist arrays
> Â Â Â Â using Floydâ€“Warshall-like algorithm.
> Â Â Â Â (update_dist_to_successors): Remove the whole function.
> Â Â Â Â (longest_simple_path): Likewise.
> Â Â Â Â * ddg.h (struct ddg_node): Add max_dist pointer.
> Â Â Â Â (struct ddg_edge): Use bool field instead of unused aux union.
So just an FYI.  If ddg.c is only used by the modulo scheduler, then it
falls under your maintainership and you can decide when to install this
patch.

We generally try to avoid major changes right after the branch is cut,
but I doubt we'll be doing a lot of backporting in this code, so I think
you can go whenever you're ready.

jeff
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [0/4] Addressing modulo-scheduling bugs
  2019-04-22 19:16 ` [0/4] Addressing modulo-scheduling bugs Roman Zhuykov
@ 2019-04-26 15:45   ` Jeff Law
  2019-04-26 16:41     ` Roman Zhuykov
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff Law @ 2019-04-26 15:45 UTC (permalink / raw)
  To: Roman Zhuykov, gcc-patches

On 4/22/19 10:45 AM, Roman Zhuykov wrote:
> As a freshly appointed maintainer Iâ€™m ready to commit my own
> modulo-sched patches, but decide to ask here if there are any
> objections.Â  Maybe I should ask any additional approval at this stage?Â 
> If no, Iâ€™ll start tomorrow with committing patches 1/4 and 2/4 which are
> well-formed regression fixes.Â  Patch 3/4 doesnâ€™t include test example,
> and I donâ€™t know how to add it there, so I am ready to postpone it to
> stage 1.Â  Patch 4/4 is not solving a regression technically.
> 
> First two patches can be also committed into 8 branch.Â  If no discussion
> occurs,Â  Iâ€™ll commit them later this week, e.g. on friday.
As a maintainer we trust you to make these decisions.  Though if you
want guidance, mine would have been to go with #1 and #2 immediately and
postpone #3 and #4 for gcc-10.  THat appears to be precisely what you've
done.

Jeff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [0/4] Addressing modulo-scheduling bugs
  2019-04-26 15:45   ` Jeff Law
@ 2019-04-26 16:41     ` Roman Zhuykov
  0 siblings, 0 replies; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-26 16:41 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches

Hello, Jeff

> > As a freshly appointed maintainer I’m ready to commit my own
> > modulo-sched patches, but decide to ask here if there are any
> > objections.  Maybe I should ask any additional approval at this stage?
> > If no, I’ll start tomorrow with committing patches 1/4 and 2/4 which are
> > well-formed regression fixes.  Patch 3/4 doesn’t include test example,
> > and I don’t know how to add it there, so I am ready to postpone it to
> > stage 1.  Patch 4/4 is not solving a regression technically.
> >
> > First two patches can be also committed into 8 branch.  If no discussion
> > occurs,  I’ll commit them later this week, e.g. on friday.
> As a maintainer we trust you to make these decisions.  Though if you
> want guidance, mine would have been to go with #1 and #2 immediately and
> postpone #3 and #4 for gcc-10.  THat appears to be precisely what you've
> done.

Thanks! I was afraid a bit to make something wrong in soon-frozing code.
Certainly Alex and Andrey help me with my first maintainter steps :)
Few minutes ago I've backported #1 and #2 into 8-branch as r270609.


Roman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [3/4][PATCH] Fix PR90001
  2019-04-26 15:42   ` Jeff Law
@ 2019-04-26 17:41     ` Roman Zhuykov
  0 siblings, 0 replies; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-26 17:41 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches

> So just an FYI.  If ddg.c is only used by the modulo scheduler, then it
> falls under your maintainership and you can decide when to install this
> patch.

Yes, I understand about ddg.c and ddg.h. I also studied everything
about loop-doloop.c because it is connected. Will try to participate
in doloop discussions, and soon present some (mostly refactoring)
doloop patch for approval.

> We generally try to avoid major changes right after the branch is cut,
> but I doubt we'll be doing a lot of backporting in this code, so I think
> you can go whenever you're ready.

I'll now wait the "not disrupting code RC phase" to finish, and
than will commit it together with posting my next patchset.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [4/4][PATCH] Discussing PR83507
  2019-04-23 15:55       ` Segher Boessenkool
@ 2019-04-29 16:56         ` Roman Zhuykov
  2019-04-29 17:51           ` Segher Boessenkool
  2019-04-30 15:14           ` Jeff Law
  0 siblings, 2 replies; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-29 16:56 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, Jakub Jelinek

Hi, Segher

> > > > In pr84524.c we got a loop with an extended inline asm:
> > > > asm volatile ("" : "+r" (v))
> > > > which also gives us a “surprising” situation Alexander predicts.
> > > >
> > > > For sched-deps scanner such volatile asm is a “true barrier” and we
> > > > create dependencies to almost all other instructions, while DF scanners
> > > > don’t give us such information.
> > >
> > > There is no such thing as a "true barrier" in extended asm.  The only thing
> > > volatile asm means is that the asm has a side effect unknown to the compiler;
> > > this can *not* be a modification of a register or of memory contents, such
> > > things are known by the compiler, that's what clobbers and "memory" clobber
> > > are about.
> >
> > In sched-deps.c we got:
> >     case ASM_OPERANDS:
> >     case ASM_INPUT:
> >       {
> >        /* Traditional and volatile asm instructions must be considered to use
> >           and clobber all hard registers, all pseudo-registers and all of
> >           memory.  So must TRAP_IF and UNSPEC_VOLATILE operations.
> >
> >           Consider for instance a volatile asm that changes the fpu rounding
> >           mode.  An insn should not be moved across this even if it only uses
> >           pseudo-regs because it might give an incorrectly rounded result.  */
> >        if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
> >            && !DEBUG_INSN_P (insn))
> >          reg_pending_barrier = TRUE_BARRIER;
> >        ...
>
> This code was added in 1997 (r14770).  In 2004 the documentation was
> changed to clarify how things really work (r88999):
>
> "Note that even a volatile @code{asm} instruction can be moved relative to
> other code, including across jump instructions."
>
> (followed by an example exactly about what this means for FPU control).

Thanks for pointing to that changes!  Unfortunately, sched-deps.c was
more conservative this 15 years...
Let’s try to fix it.

> > > > Maybe it is a good idea somehow re-implement modulo scheduler using only
> > > > one scanner instead of two, but at the moment the best thing to do is to
> > > > detect the situation earlier and skip such loops.
> > >
> > > Or fix the broken code...
> >
> > I’m not sure what you mean here,
>
> I mean have the modulo scheduler implement the correct asm semantics, not
> some more restrictive thing that gets it into conflicts with DF, etc.
>
> I don't think this will turn out to be a problem in any way.  Some invalid
> asm will break, sure.

I have started with applying this without any SMS changes:

diff --git a/gcc/sched-deps.c b/gcc/sched-deps.c
--- a/gcc/sched-deps.c
+++ b/gcc/sched-deps.c
@@ -2753,22 +2753,14 @@ sched_analyze_2 (struct deps_desc *deps, rtx
x, rtx_insn *insn)

     case UNSPEC_VOLATILE:
       flush_pending_lists (deps, insn, true, true);
+
+      if (!DEBUG_INSN_P (insn))
+       reg_pending_barrier = TRUE_BARRIER;
       /* FALLTHRU */

     case ASM_OPERANDS:
     case ASM_INPUT:
       {
-       /* Traditional and volatile asm instructions must be considered to use
-          and clobber all hard registers, all pseudo-registers and all of
-          memory.  So must TRAP_IF and UNSPEC_VOLATILE operations.
-
-          Consider for instance a volatile asm that changes the fpu rounding
-          mode.  An insn should not be moved across this even if it only uses
-          pseudo-regs because it might give an incorrectly rounded result.  */
-       if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
-           && !DEBUG_INSN_P (insn))
-         reg_pending_barrier = TRUE_BARRIER;
-
        /* For all ASM_OPERANDS, we must traverse the vector of input operands.
           We cannot just fall through here since then we would be confused
           by the ASM_INPUT rtx inside ASM_OPERANDS, which do not indicate

Regstrapping it on x86-64 shows some failures. First is with ms-sysv
abi test and can be solved like this:

diff --git a/gcc/testsuite/gcc.target/x86_64/abi/ms-sysv/gen.cc
b/gcc/testsuite/gcc.target/x86_64/abi/ms-sysv/gen.cc
--- a/gcc/testsuite/gcc.target/x86_64/abi/ms-sysv/gen.cc
+++ b/gcc/testsuite/gcc.target/x86_64/abi/ms-sysv/gen.cc
@@ -511,7 +511,7 @@ void make_do_test (const vector<class arg> &args,
            /* End if init_test call.  */

            if (f.get_realign () && unaligned == 1)
-             out << "  __asm__ __volatile__ (\"subq $8,%%rsp\":::\"cc\");"
+             out << "  __asm__ __volatile__ (\"subq
$8,%%rsp\":::\"cc\", \"memory\");"
                  << endl;

            out << "  ret = do_test_"
@@ -525,7 +525,7 @@ void make_do_test (const vector<class arg> &args,
            out << ");" << endl;

            if (f.get_realign () && unaligned == 1)
-             out << "  __asm__ __volatile__ (\"addq $8,%%rsp\":::\"cc\");"
+             out << "  __asm__ __volatile__ (\"addq
$8,%%rsp\":::\"cc\", \"memory\");"
                  << endl;

            out << "  check_results (ret);" << endl;

Here we have some asms which control stack pointer (sigh!). It
certainly may be broken at any moment, but right now “memory” clobber
fixes everything for me.

Another failure is:

  FAIL: gcc.dg/guality/pr58791-4.c   -O2  -DPREVENT_OPTIMIZATION  line
pr58791-4.c:32 i == 486
  FAIL: gcc.dg/guality/pr58791-4.c   -O2  -DPREVENT_OPTIMIZATION  line
pr58791-4.c:32 i2 == 487
  FAIL: gcc.dg/guality/pr58791-4.c   -O3 -g  -DPREVENT_OPTIMIZATION
line pr58791-4.c:32 i == 486
  FAIL: gcc.dg/guality/pr58791-4.c   -O3 -g  -DPREVENT_OPTIMIZATION
line pr58791-4.c:32 i2 == 487
  FAIL: gcc.dg/guality/pr58791-4.c   -Os  -DPREVENT_OPTIMIZATION  line
pr58791-4.c:32 i == 486
  FAIL: gcc.dg/guality/pr58791-4.c   -Os  -DPREVENT_OPTIMIZATION  line
pr58791-4.c:32 i2 == 487
  FAIL: gcc.dg/guality/pr58791-4.c   -O2 -flto -fno-use-linker-plugin
-flto-partition=none  -DPREVENT_OPTIMIZATION line pr58791-4.c:32 i ==
486
  FAIL: gcc.dg/guality/pr58791-4.c   -O2 -flto -fno-use-linker-plugin
-flto-partition=none  -DPREVENT_OPTIMIZATION line pr58791-4.c:32 i2 ==
487
  FAIL: gcc.dg/guality/pr58791-4.c   -O2 -flto -fuse-linker-plugin
-fno-fat-lto-objects  -DPREVENT_OPTIMIZATION line pr58791-4.c:32 i ==
486
  FAIL: gcc.dg/guality/pr58791-4.c   -O2 -flto -fuse-linker-plugin
-fno-fat-lto-objects  -DPREVENT_OPTIMIZATION line pr58791-4.c:32 i2 ==
487

It is connected to debug-info, and I cannot solve it myself. I am not
sure how this should work when we try to print dead-code variable (i2)
while debugging -O2 (O3/Os) compiled executable.  Jakub created that
test, he is in CC already.

I will also look at aarch64 regstrap a bit later.  But that job should
also be done for all other targets.  Segher, could you test it on
rs6000?

> > > > In pr84524.c we got a loop with an extended inline asm:
> > > > asm volatile ("" : "+r" (v))

It seems a good idea to add “memory” clobber into asm and recheck DDG
in this test on aarch64 with both SMS and sched-deps patches applied.
I'll check it.


Roman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [4/4][PATCH] Discussing PR83507
  2019-04-29 16:56         ` Roman Zhuykov
@ 2019-04-29 17:51           ` Segher Boessenkool
  2019-04-30 11:35             ` Roman Zhuykov
  2019-04-30 15:14           ` Jeff Law
  1 sibling, 1 reply; 17+ messages in thread
From: Segher Boessenkool @ 2019-04-29 17:51 UTC (permalink / raw)
  To: Roman Zhuykov; +Cc: gcc-patches, Jakub Jelinek

Hi!

On Mon, Apr 29, 2019 at 07:28:12PM +0300, Roman Zhuykov wrote:
> > This code was added in 1997 (r14770).  In 2004 the documentation was
> > changed to clarify how things really work (r88999):
> >
> > "Note that even a volatile @code{asm} instruction can be moved relative to
> > other code, including across jump instructions."
> >
> > (followed by an example exactly about what this means for FPU control).
> 
> Thanks for pointing to that changes!  Unfortunately, sched-deps.c was
> more conservative this 15 years...
> Letâ€™s try to fix it.

If it causes problems now, that would be a good idea yes :-)

> > I mean have the modulo scheduler implement the correct asm semantics, not
> > some more restrictive thing that gets it into conflicts with DF, etc.
> >
> > I don't think this will turn out to be a problem in any way.  Some invalid
> > asm will break, sure.
> 
> I have started with applying this without any SMS changes:
> 
> diff --git a/gcc/sched-deps.c b/gcc/sched-deps.c
> --- a/gcc/sched-deps.c
> +++ b/gcc/sched-deps.c
> @@ -2753,22 +2753,14 @@ sched_analyze_2 (struct deps_desc *deps, rtx
> x, rtx_insn *insn)
> 
>      case UNSPEC_VOLATILE:
>        flush_pending_lists (deps, insn, true, true);
> +
> +      if (!DEBUG_INSN_P (insn))
> +       reg_pending_barrier = TRUE_BARRIER;
>        /* FALLTHRU */
> 
>      case ASM_OPERANDS:
>      case ASM_INPUT:
>        {
> -       /* Traditional and volatile asm instructions must be considered to use
> -          and clobber all hard registers, all pseudo-registers and all of
> -          memory.  So must TRAP_IF and UNSPEC_VOLATILE operations.
> -
> -          Consider for instance a volatile asm that changes the fpu rounding
> -          mode.  An insn should not be moved across this even if it only uses
> -          pseudo-regs because it might give an incorrectly rounded result.  */
> -       if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
> -           && !DEBUG_INSN_P (insn))
> -         reg_pending_barrier = TRUE_BARRIER;
> -
>         /* For all ASM_OPERANDS, we must traverse the vector of input operands.
>            We cannot just fall through here since then we would be confused
>            by the ASM_INPUT rtx inside ASM_OPERANDS, which do not indicate

UNSPEC_VOLATILE and volatile asm should have the same semantics, ideally.
One step at a time I guess :-)

> Regstrapping it on x86-64 shows some failures. First is with ms-sysv
> abi test and can be solved like this:

[ snip ]

> Here we have some asms which control stack pointer (sigh!). It
> certainly may be broken at any moment, but right now â€œmemoryâ€ clobber
> fixes everything for me.

Makes sense.

> Another failure is:
> 
>   FAIL: gcc.dg/guality/pr58791-4.c   -O2  -DPREVENT_OPTIMIZATION  line
> pr58791-4.c:32 i == 486

[ snip ]

> It is connected to debug-info, and I cannot solve it myself. I am not
> sure how this should work when we try to print dead-code variable (i2)
> while debugging -O2 (O3/Os) compiled executable.  Jakub created that
> test, he is in CC already.

What does PREVENT_OPTIMIZATION do?  It probably needs to be made a bit
stronger.

(It seems to just add some "used"?)

> I will also look at aarch64 regstrap a bit later.  But that job should
> also be done for all other targets.  Segher, could you test it on
> rs6000?

Do you have an account on the compile farm?  It has POWER7 (BE, 32/64),
as well as POWER8 and POWER9 machines (both LE).
https://cfarm.tetaneutral.net/

But I'll do it if you want.  Just let me know.

> > > > > In pr84524.c we got a loop with an extended inline asm:
> > > > > asm volatile ("" : "+r" (v))
> 
> It seems a good idea to add â€œmemoryâ€ clobber into asm and recheck DDG
> in this test on aarch64 with both SMS and sched-deps patches applied.
> I'll check it.


Segher

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [4/4][PATCH] Discussing PR83507
  2019-04-29 17:51           ` Segher Boessenkool
@ 2019-04-30 11:35             ` Roman Zhuykov
  0 siblings, 0 replies; 17+ messages in thread
From: Roman Zhuykov @ 2019-04-30 11:35 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, Jakub Jelinek

> > > This code was added in 1997 (r14770).  In 2004 the documentation was
> > > changed to clarify how things really work (r88999):
> > >
> > > "Note that even a volatile @code{asm} instruction can be moved relative to
> > > other code, including across jump instructions."
> > >
> > > (followed by an example exactly about what this means for FPU control).
> >
> > Thanks for pointing to that changes!  Unfortunately, sched-deps.c was
> > more conservative this 15 years...
> > Let’s try to fix it.
>
> If it causes problems now, that would be a good idea yes :-)
>
> > > I mean have the modulo scheduler implement the correct asm semantics, not
> > > some more restrictive thing that gets it into conflicts with DF, etc.
> > >
> > > I don't think this will turn out to be a problem in any way.  Some invalid
> > > asm will break, sure.
> >
> > I have started with applying this without any SMS changes:
> >
> > diff --git a/gcc/sched-deps.c b/gcc/sched-deps.c
> > --- a/gcc/sched-deps.c
> > +++ b/gcc/sched-deps.c
> > @@ -2753,22 +2753,14 @@ sched_analyze_2 (struct deps_desc *deps, rtx
> > x, rtx_insn *insn)
> >
> >      case UNSPEC_VOLATILE:
> >        flush_pending_lists (deps, insn, true, true);
> > +
> > +      if (!DEBUG_INSN_P (insn))
> > +       reg_pending_barrier = TRUE_BARRIER;
> >        /* FALLTHRU */
> >
> >      case ASM_OPERANDS:
> >      case ASM_INPUT:
> >        {
> > -       /* Traditional and volatile asm instructions must be considered to use
> > -          and clobber all hard registers, all pseudo-registers and all of
> > -          memory.  So must TRAP_IF and UNSPEC_VOLATILE operations.
> > -
> > -          Consider for instance a volatile asm that changes the fpu rounding
> > -          mode.  An insn should not be moved across this even if it only uses
> > -          pseudo-regs because it might give an incorrectly rounded result.  */
> > -       if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
> > -           && !DEBUG_INSN_P (insn))
> > -         reg_pending_barrier = TRUE_BARRIER;
> > -
> >         /* For all ASM_OPERANDS, we must traverse the vector of input operands.
> >            We cannot just fall through here since then we would be confused
> >            by the ASM_INPUT rtx inside ASM_OPERANDS, which do not indicate
>
> UNSPEC_VOLATILE and volatile asm should have the same semantics, ideally.
> One step at a time I guess :-)

When barrier for UNSPEC_VOLATILE is also dropped, there are more
issues on x86_64:

FAIL: gcc.dg/vect/pr62021.c -flto -ffat-lto-objects execution test
FAIL: gcc.dg/vect/pr62021.c execution test
FAIL: gcc.target/i386/avx2-vect-aggressive.c execution test

I haven’t looked at those vectorization tests.

> > Regstrapping it on x86-64 shows some failures. First is with ms-sysv
> > abi test and can be solved like this:
>
> [ snip ]
>
> > Here we have some asms which control stack pointer (sigh!). It
> > certainly may be broken at any moment, but right now “memory” clobber
> > fixes everything for me.
>
> Makes sense.
>
> > Another failure is:
> >
> >   FAIL: gcc.dg/guality/pr58791-4.c   -O2  -DPREVENT_OPTIMIZATION  line
> > pr58791-4.c:32 i == 486
>
> [ snip ]
>
> > It is connected to debug-info, and I cannot solve it myself. I am not
> > sure how this should work when we try to print dead-code variable (i2)
> > while debugging -O2 (O3/Os) compiled executable.  Jakub created that
> > test, he is in CC already.
>
> What does PREVENT_OPTIMIZATION do?  It probably needs to be made a bit
> stronger.

It seems PREVENT_OPTIMIZATION stuff influents only one test –
gcc.dg/guality/pr45882.c.  Other tests do not contain ATTRIBUTE_USED
macro.

> (It seems to just add some "used"?)
>
> > I will also look at aarch64 regstrap a bit later.  But that job should
> > also be done for all other targets.  Segher, could you test it on
> > rs6000?
>
> Do you have an account on the compile farm?  It has POWER7 (BE, 32/64),
> as well as POWER8 and POWER9 machines (both LE).
> https://cfarm.tetaneutral.net/

Ok, I’ll use the farm. Have filled the “new user” form already.
Aarch64 is still testing, maybe next time will use farm instead of
qemu-system.

I forgot to mention in previous letter, that with this patch we also
drop dependencies between volatile asms and volatile mems. We have
some offline discussion with Alex, and it seems that since 2004 docs
never guarantee such a dependency, user should add relevant
constraints into asm in all cases. But somebody may have another
opinion about this tricky moment.


Roman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [4/4][PATCH] Discussing PR83507
  2019-04-29 16:56         ` Roman Zhuykov
  2019-04-29 17:51           ` Segher Boessenkool
@ 2019-04-30 15:14           ` Jeff Law
  1 sibling, 0 replies; 17+ messages in thread
From: Jeff Law @ 2019-04-30 15:14 UTC (permalink / raw)
  To: Roman Zhuykov, Segher Boessenkool; +Cc: gcc-patches, Jakub Jelinek

On 4/29/19 10:28 AM, Roman Zhuykov wrote:
> Hi, Segher
> 
>>>>> In pr84524.c we got a loop with an extended inline asm:
>>>>> asm volatile ("" : "+r" (v))
>>>>> which also gives us a â€œsurprisingâ€ situation Alexander predicts.
>>>>>
>>>>> For sched-deps scanner such volatile asm is a â€œtrue barrierâ€ and we
>>>>> create dependencies to almost all other instructions, while DF scanners
>>>>> donâ€™t give us such information.
>>>>
>>>> There is no such thing as a "true barrier" in extended asm.  The only thing
>>>> volatile asm means is that the asm has a side effect unknown to the compiler;
>>>> this can *not* be a modification of a register or of memory contents, such
>>>> things are known by the compiler, that's what clobbers and "memory" clobber
>>>> are about.
>>>
>>> In sched-deps.c we got:
>>>     case ASM_OPERANDS:
>>>     case ASM_INPUT:
>>>       {
>>>        /* Traditional and volatile asm instructions must be considered to use
>>>           and clobber all hard registers, all pseudo-registers and all of
>>>           memory.  So must TRAP_IF and UNSPEC_VOLATILE operations.
>>>
>>>           Consider for instance a volatile asm that changes the fpu rounding
>>>           mode.  An insn should not be moved across this even if it only uses
>>>           pseudo-regs because it might give an incorrectly rounded result.  */
>>>        if ((code != ASM_OPERANDS || MEM_VOLATILE_P (x))
>>>            && !DEBUG_INSN_P (insn))
>>>          reg_pending_barrier = TRUE_BARRIER;
>>>        ...
>>
>> This code was added in 1997 (r14770).  In 2004 the documentation was
>> changed to clarify how things really work (r88999):
>>
>> "Note that even a volatile @code{asm} instruction can be moved relative to
>> other code, including across jump instructions."
>>
>> (followed by an example exactly about what this means for FPU control).
> 
> Thanks for pointing to that changes!  Unfortunately, sched-deps.c was
> more conservative this 15 years...
> Letâ€™s try to fix it.
Also note that the integration of the Haifa scheduler was a large
change.  A small detail like this could have been missed.

Jeff

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-04-30 15:11 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-16 11:59 [0/4] Addressing modulo-scheduling bugs zhroma
2019-04-16 12:03 ` [1/4][PATCH] Fix PR84032 Roman Zhuykov
2019-04-16 12:05 ` [2/4][PATCH] Fix PR87979 Roman Zhuykov
2019-04-16 12:08 ` [3/4][PATCH] Fix PR90001 Roman Zhuykov
2019-04-26 15:42   ` Jeff Law
2019-04-26 17:41     ` Roman Zhuykov
2019-04-16 12:26 ` [4/4][PATCH] Discussing PR83507 Roman Zhuykov
2019-04-16 15:10   ` Segher Boessenkool
2019-04-22 16:45     ` Roman Zhuykov
2019-04-23 15:55       ` Segher Boessenkool
2019-04-29 16:56         ` Roman Zhuykov
2019-04-29 17:51           ` Segher Boessenkool
2019-04-30 11:35             ` Roman Zhuykov
2019-04-30 15:14           ` Jeff Law
2019-04-22 19:16 ` [0/4] Addressing modulo-scheduling bugs Roman Zhuykov
2019-04-26 15:45   ` Jeff Law
2019-04-26 16:41     ` Roman Zhuykov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).