public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* New optimization for reload_combine
@ 2010-07-16 10:35 Bernd Schmidt
  2010-07-16 18:32 ` Jeff Law
                   ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Bernd Schmidt @ 2010-07-16 10:35 UTC (permalink / raw)
  To: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 1147 bytes --]

This fixes one subproblem of PR42235.  In reload_combine, we recognize a
new pattern:

 REG1 = REG2 + CONST
 ...
 something MEM (REG1)

and try to use the addition in the MEM directly, moving the add insn
downwards if REG1 and REG2 are identical.  This has two benefits: it may
be possible to eliminate the add, and the load no longer depends on the
add instruction and may schedule sooner.

This can make the generated code quite a lot nicer:

-       adds    r3, r2, #2
        strb    r5, [r2, #0]
-       strb    r5, [r3, #0]
-       adds    r3, r2, #3
+       strb    r5, [r2, #2]
        strb    r5, [r2, #1]
-       strb    r5, [r3, #0]
-       adds    r3, r2, #4
+       strb    r5, [r2, #3]
        lsrs    r1, r1, #11
-       strb    r5, [r3, #0]
-       adds    r3, r2, #5
+       strb    r5, [r2, #4]
        add     r1, r1, r1, lsl #1
-       strb    r5, [r3, #0]
+       strb    r5, [r2, #5]

Bootstrapped and regression tested on i686-linux.  Also tested on
    qemu-system-armv7/arch=armv7-a/thumb
    qemu-system-armv7/thumb
    qemu-system-armv7

Does postreload.c fall under reload?  If not, would someone approve this?


Bernd

[-- Attachment #2: rc-newpattern.diff --]
[-- Type: text/plain, Size: 27527 bytes --]

	PR target/42235
	* postreload.c (reload_cse_move2add): Return bool, true if anything.
	changed.  All callers changed.
	(move2add_use_add2_insn): Likewise.
	(move2add_use_add3_insn): Likewise.
	(reload_cse_regs): If reload_cse_move2add changed anything, rerun
	reload_combine.
	(RELOAD_COMBINE_MAX_USES): Bump to 16.
	(last_jump_ruid): New static variable.
	(struct reg_use): New members CONTAINING_MEM and RUID.
	(reg_state): New members ALL_OFFSETS_MATCH and REAL_STORE_RUID.
	(reload_combine_split_one_ruid, reload_combine_split_ruids,
	reload_combine_purge_insn_uses, reload_combine_closest_single_use
	reload_combine_purge_reg_uses_after_ruid,
	reload_combine_recognize_const_pattern): New static functions.
	(reload_combine_recognize_pattern): Verify that ALL_OFFSETS_MATCH
	is true for our reg and that we have available index regs.
	(reload_combine_note_use): New args RUID and CONTAINING_MEM.  All
	callers changed.  Use them to initialize fields in struct reg_use.
	(reload_combine): Initialize last_jump_ruid.  Be careful when to
	take PREV_INSN of the scanned insn.  Update REAL_STORE_RUID fields.
	Call reload_combine_recognize_const_pattern.
	(reload_combine_note_store): Update REAL_STORE_RUID field.

	* gcc.target/arm/pr42235.c: New test.

Index: postreload.c
===================================================================
--- postreload.c	(revision 162228)
+++ postreload.c	(working copy)
@@ -56,10 +56,10 @@ static int reload_cse_simplify_set (rtx,
 static int reload_cse_simplify_operands (rtx, rtx);
 
 static void reload_combine (void);
-static void reload_combine_note_use (rtx *, rtx);
+static void reload_combine_note_use (rtx *, rtx, int, rtx);
 static void reload_combine_note_store (rtx, const_rtx, void *);
 
-static void reload_cse_move2add (rtx);
+static bool reload_cse_move2add (rtx);
 static void move2add_note_store (rtx, const_rtx, void *);
 
 /* Call cse / combine like post-reload optimization phases.
@@ -67,11 +67,16 @@ static void move2add_note_store (rtx, co
 void
 reload_cse_regs (rtx first ATTRIBUTE_UNUSED)
 {
+  bool moves_converted;
   reload_cse_regs_1 (first);
   reload_combine ();
-  reload_cse_move2add (first);
+  moves_converted = reload_cse_move2add (first);
   if (flag_expensive_optimizations)
-    reload_cse_regs_1 (first);
+    {
+      if (moves_converted)
+	reload_combine ();
+      reload_cse_regs_1 (first);
+    }
 }
 
 /* See whether a single set SET is a noop.  */
@@ -660,29 +665,42 @@ reload_cse_simplify_operands (rtx insn, 
 
 /* The maximum number of uses of a register we can keep track of to
    replace them with reg+reg addressing.  */
-#define RELOAD_COMBINE_MAX_USES 6
+#define RELOAD_COMBINE_MAX_USES 16
 
-/* INSN is the insn where a register has been used, and USEP points to the
-   location of the register within the rtl.  */
-struct reg_use { rtx insn, *usep; };
+/* Describes a recorded use of a register.  */
+struct reg_use
+{
+  /* The insn where a register has been used.  */
+  rtx insn;
+  /* The reverse uid of the insn.  */
+  int ruid;
+  /* Points to the memory reference enclosing the use, if any, NULL_RTX
+     otherwise.  */
+  rtx containing_mem;
+  /* Location of the register withing INSN.  */
+  rtx *usep;
+};
 
 /* If the register is used in some unknown fashion, USE_INDEX is negative.
    If it is dead, USE_INDEX is RELOAD_COMBINE_MAX_USES, and STORE_RUID
-   indicates where it becomes live again.
+   indicates where it is first set or clobbered.
    Otherwise, USE_INDEX is the index of the last encountered use of the
-   register (which is first among these we have seen since we scan backwards),
-   OFFSET contains the constant offset that is added to the register in
-   all encountered uses, and USE_RUID indicates the first encountered, i.e.
-   last, of these uses.
+   register (which is first among these we have seen since we scan backwards).
+   USE_RUID indicates the first encountered, i.e. last, of these uses.
+   If ALL_OFFSETS_MATCH is true, all encountered uses were inside a PLUS
+   with a constant offset; OFFSET contains this constant in that case.
    STORE_RUID is always meaningful if we only want to use a value in a
    register in a different place: it denotes the next insn in the insn
-   stream (i.e. the last encountered) that sets or clobbers the register.  */
+   stream (i.e. the last encountered) that sets or clobbers the register.
+   REAL_STORE_RUID is similar, but clobbers are ignored when updating it.  */
 static struct
   {
     struct reg_use reg_use[RELOAD_COMBINE_MAX_USES];
     int use_index;
     rtx offset;
+    bool all_offsets_match;
     int store_ruid;
+    int real_store_ruid;
     int use_ruid;
   } reg_state[FIRST_PSEUDO_REGISTER];
 
@@ -694,6 +712,9 @@ static int reload_combine_ruid;
 /* The RUID of the last label we encountered in reload_combine.  */
 static int last_label_ruid;
 
+/* The RUID of the last jump we encountered in reload_combine.  */
+static int last_jump_ruid;
+
 /* The register numbers of the first and last index register.  A value of
    -1 in LAST_INDEX_REG indicates that we've previously computed these
    values and found no suitable index registers.  */
@@ -703,6 +724,310 @@ static int last_index_reg;
 #define LABEL_LIVE(LABEL) \
   (label_live[CODE_LABEL_NUMBER (LABEL) - min_labelno])
 
+/* Subroutine of reload_combine_split_ruids, called to fix up a single
+   ruid pointed to by *PRUID if it is higher than SPLIT_RUID.  */
+
+static inline void
+reload_combine_split_one_ruid (int *pruid, int split_ruid)
+{
+  if (*pruid > split_ruid)
+    (*pruid)++;
+}
+
+/* Called when we insert a new insn in a position we've already passed in
+   the scan.  Examine all our state, increasing all ruids that are higher
+   than SPLIT_RUID by one in order to make room for a new insn.  */
+
+static void
+reload_combine_split_ruids (int split_ruid)
+{
+  unsigned i;
+
+  reload_combine_split_one_ruid (&reload_combine_ruid, split_ruid);
+  reload_combine_split_one_ruid (&last_label_ruid, split_ruid);
+  reload_combine_split_one_ruid (&last_jump_ruid, split_ruid);
+
+  for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
+    {
+      int j, idx = reg_state[i].use_index;
+      reload_combine_split_one_ruid (&reg_state[i].use_ruid, split_ruid);
+      reload_combine_split_one_ruid (&reg_state[i].store_ruid, split_ruid);
+      reload_combine_split_one_ruid (&reg_state[i].real_store_ruid,
+				     split_ruid);
+      if (idx < 0)
+	continue;
+      for (j = idx; j < RELOAD_COMBINE_MAX_USES; j++)
+	{
+	  reload_combine_split_one_ruid (&reg_state[i].reg_use[j].ruid,
+					 split_ruid);
+	}
+    }
+}
+
+/* Called when we are about to rescan a previously encountered insn with
+   reload_combine_note_use after modifying some part of it.  This clears all
+   information about uses in that particular insn.  */
+
+static void
+reload_combine_purge_insn_uses (rtx insn)
+{
+  unsigned i;
+
+  for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
+    {
+      int j, k, idx = reg_state[i].use_index;
+      if (idx < 0)
+	continue;
+      j = k = RELOAD_COMBINE_MAX_USES;
+      while (j-- > idx)
+	{
+	  if (reg_state[i].reg_use[j].insn != insn)
+	    {
+	      k--;
+	      if (k != j)
+		reg_state[i].reg_use[k] = reg_state[i].reg_use[j];
+	    }
+	}
+      reg_state[i].use_index = k;
+    }
+}
+
+/* Called when we need to forget about all uses of REGNO after an insn
+   which is identified by RUID.  */
+
+static void
+reload_combine_purge_reg_uses_after_ruid (unsigned regno, int ruid)
+{
+  int j, k, idx = reg_state[regno].use_index;
+  if (idx < 0)
+    return;
+  j = k = RELOAD_COMBINE_MAX_USES;
+  while (j-- > idx)
+    {
+      if (reg_state[regno].reg_use[j].ruid >= ruid)
+	{
+	  k--;
+	  if (k != j)
+	    reg_state[regno].reg_use[k] = reg_state[regno].reg_use[j];
+	}
+    }
+  reg_state[regno].use_index = k;
+}
+
+/* Find the use of REGNO with the ruid that is highest among those
+   lower than RUID_LIMIT, and return it if it is the only use of this
+   reg in the insn.  Return NULL otherwise.  */
+
+static struct reg_use *
+reload_combine_closest_single_use (unsigned regno, int ruid_limit)
+{
+  int i, best_ruid = 0;
+  int use_idx = reg_state[regno].use_index;
+  struct reg_use *retval;
+
+  if (use_idx < 0)
+    return NULL;
+  retval = NULL;
+  for (i = use_idx; i < RELOAD_COMBINE_MAX_USES; i++)
+    {
+      int this_ruid = reg_state[regno].reg_use[i].ruid;
+      if (this_ruid >= ruid_limit)
+	continue;
+      if (this_ruid > best_ruid)
+	{
+	  best_ruid = this_ruid;
+	  retval = reg_state[regno].reg_use + i;
+	}
+      else if (this_ruid == best_ruid)
+	retval = NULL;
+    }
+  if (last_label_ruid >= best_ruid)
+    return NULL;
+  return retval;
+}
+
+/* Called by reload_combine when scanning INSN.  This function tries to detect
+   patterns where a constant is added to a register, and the result is used
+   in an address.
+   Return true if no further processing is needed on INSN; false if it wasn't
+   recognized and should be handled normally.  */
+
+static bool
+reload_combine_recognize_const_pattern (rtx insn)
+{
+  int from_ruid = reload_combine_ruid;
+  rtx set, pat, reg, src, addreg;
+  unsigned int regno;
+  struct reg_use *use;
+  bool must_move_add;
+  rtx add_moved_after_insn = NULL_RTX;
+  int add_moved_after_ruid = 0;
+  int clobbered_regno = -1;
+
+  set = single_set (insn);
+  if (set == NULL_RTX)
+    return false;
+
+  reg = SET_DEST (set);
+  src = SET_SRC (set);
+  if (!REG_P (reg)
+      || hard_regno_nregs[REGNO (reg)][GET_MODE (reg)] != 1
+      || GET_MODE (reg) != Pmode)
+    return false;
+
+  regno = REGNO (reg);
+
+  /* We look for a REG1 = REG2 + CONSTANT insn, followed by either
+     uses of REG1 inside an address, or inside another add insn.  If
+     possible and profitable, merge the addition into subsequent
+     uses.  */
+  if (GET_CODE (src) != PLUS
+      || !REG_P (XEXP (src, 0))
+      || !CONSTANT_P (XEXP (src, 1)))
+    return false;
+
+  addreg = XEXP (src, 0);
+  must_move_add = rtx_equal_p (reg, addreg);
+
+  pat = PATTERN (insn);
+  if (must_move_add && set != pat)
+    {
+      /* We have to be careful when moving the add; apart from the
+	 single_set there may also be clobbers.  Recognize one special
+	 case, that of one clobber alongside the set (likely a clobber
+	 of the CC register).  */
+      gcc_assert (GET_CODE (PATTERN (insn)) == PARALLEL);
+      if (XVECLEN (pat, 0) != 2 || XVECEXP (pat, 0, 0) != set
+	  || GET_CODE (XVECEXP (pat, 0, 1)) != CLOBBER
+	  || !REG_P (XEXP (XVECEXP (pat, 0, 1), 0)))
+	return false;
+      clobbered_regno = REGNO (XEXP (XVECEXP (pat, 0, 1), 0));
+    }
+
+  do
+    {
+      use = reload_combine_closest_single_use (regno, from_ruid);
+
+      if (use)
+	/* Start the search for the next use from here.  */
+	from_ruid = use->ruid;
+
+      if (use && GET_MODE (*use->usep) == Pmode)
+	{
+	  rtx use_insn = use->insn;
+	  int use_ruid = use->ruid;
+	  rtx mem = use->containing_mem;
+	  bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (use_insn));
+
+	  /* Avoid moving the add insn past a jump.  */
+	  if (must_move_add && use_ruid < last_jump_ruid)
+	    break;
+
+	  /* If the add clobbers another hard reg in parallel, don't move
+	     it past a real set of this hard reg.  */
+	  if (must_move_add && clobbered_regno >= 0
+	      && reg_state[clobbered_regno].real_store_ruid >= use_ruid)
+	    break;
+
+	  /* Avoid moving a use of ADDREG past a point where it
+	     is stored.  */
+	  if (reg_state[REGNO (addreg)].store_ruid >= use_ruid)
+	    break;
+
+	  if (mem != NULL_RTX)
+	    {
+	      rtx oldaddr = XEXP (mem, 0);
+	      rtx newaddr = NULL_RTX;
+	      int old_cost = rtx_cost (mem, SET, speed);
+	      int new_cost;
+
+	      newaddr = simplify_replace_rtx (oldaddr, reg, copy_rtx (src));
+	      if (memory_address_addr_space_p (GET_MODE (mem), newaddr,
+					       MEM_ADDR_SPACE (mem)))
+		{
+		  XEXP (mem, 0) = newaddr;
+		  new_cost = rtx_cost (mem, SET, speed);
+		  XEXP (mem, 0) = oldaddr;
+		  if (new_cost <= old_cost
+		      && validate_change (use_insn,
+					  &XEXP (mem, 0), newaddr, 0))
+		    {
+		      reload_combine_purge_insn_uses (use_insn);
+		      reload_combine_note_use (&PATTERN (use_insn), use_insn,
+					       use_ruid, NULL_RTX);
+
+		      if (must_move_add)
+			{
+			  add_moved_after_insn = use_insn;
+			  add_moved_after_ruid = use_ruid;
+			}
+		      continue;
+		    }
+		}
+	    }
+	  else
+	    {
+	      rtx new_set = single_set (use_insn);
+	      if (new_set
+		  && REG_P (SET_DEST (new_set))
+		  && GET_CODE (SET_SRC (new_set)) == PLUS
+		  && REG_P (XEXP (SET_SRC (new_set), 0))
+		  && CONSTANT_P (XEXP (SET_SRC (new_set), 1)))
+		{
+		  rtx new_src;
+		  int old_cost = rtx_cost (SET_SRC (new_set), SET, speed);
+
+		  gcc_assert (rtx_equal_p (XEXP (SET_SRC (new_set), 0), reg));
+		  new_src = simplify_replace_rtx (SET_SRC (new_set), reg,
+						  copy_rtx (src));
+
+		  if (rtx_cost (new_src, SET, speed) <= old_cost
+		      && validate_change (use_insn, &SET_SRC (new_set),
+					  new_src, 0))
+		    {
+		      reload_combine_purge_insn_uses (use_insn);
+		      reload_combine_note_use (&SET_SRC (new_set), use_insn,
+					       use_ruid, NULL_RTX);
+
+		      if (must_move_add)
+			{
+			  /* See if that took care of the add insn.  */
+			  if (rtx_equal_p (SET_DEST (new_set), reg))
+			    {
+			      delete_insn (insn);
+			      return true;
+			    }
+			  else
+			    {
+			      add_moved_after_insn = use_insn;
+			      add_moved_after_ruid = use_ruid;
+			    }
+			}
+		      continue;
+		    }
+		}
+	    }
+	  /* If we get here, we couldn't handle this use.  */
+	  if (must_move_add)
+	    break;
+	}
+    }
+  while (use);
+
+  if (!must_move_add || add_moved_after_insn == NULL_RTX)
+    /* Process the add normally.  */
+    return false;
+
+  reorder_insns (insn, insn, add_moved_after_insn);
+  reload_combine_purge_reg_uses_after_ruid (regno, add_moved_after_ruid);
+  reload_combine_split_ruids (add_moved_after_ruid - 1);
+  reload_combine_note_use (&PATTERN (insn), insn,
+			   add_moved_after_ruid, NULL_RTX);
+  reg_state[regno].store_ruid = add_moved_after_ruid;
+
+  return true;
+}
+
 /* Called by reload_combine when scanning INSN.  Try to detect a pattern we
    can handle and improve.  Return true if no further processing is needed on
    INSN; false if it wasn't recognized and should be handled normally.  */
@@ -713,6 +1038,18 @@ reload_combine_recognize_pattern (rtx in
   rtx set, reg, src;
   unsigned int regno;
 
+  set = single_set (insn);
+  if (set == NULL_RTX)
+    return false;
+
+  reg = SET_DEST (set);
+  src = SET_SRC (set);
+  if (!REG_P (reg)
+      || hard_regno_nregs[REGNO (reg)][GET_MODE (reg)] != 1)
+    return false;
+
+  regno = REGNO (reg);
+
   /* Look for (set (REGX) (CONST_INT))
      (set (REGX) (PLUS (REGX) (REGY)))
      ...
@@ -726,19 +1063,10 @@ reload_combine_recognize_pattern (rtx in
      and that we know all uses of REGX before it dies.
      Also, explicitly check that REGX != REGY; our life information
      does not yet show whether REGY changes in this insn.  */
-  set = single_set (insn);
-  if (set == NULL_RTX)
-    return false;
-
-  reg = SET_DEST (set);
-  src = SET_SRC (set);
-  if (!REG_P (reg)
-      || hard_regno_nregs[REGNO (reg)][GET_MODE (reg)] != 1)
-    return false;
-
-  regno = REGNO (reg);
 
   if (GET_CODE (src) == PLUS
+      && reg_state[regno].all_offsets_match
+      && last_index_reg != -1
       && REG_P (XEXP (src, 1))
       && rtx_equal_p (XEXP (src, 0), reg)
       && !rtx_equal_p (XEXP (src, 1), reg)
@@ -822,7 +1150,9 @@ reload_combine_recognize_pattern (rtx in
 		   i < RELOAD_COMBINE_MAX_USES; i++)
 		reload_combine_note_use
 		  (&XEXP (*reg_state[regno].reg_use[i].usep, 1),
-		   reg_state[regno].reg_use[i].insn);
+		   reg_state[regno].reg_use[i].insn,
+		   reg_state[regno].reg_use[i].ruid,
+		   reg_state[regno].reg_use[i].containing_mem);
 
 	      if (reg_state[REGNO (base)].use_ruid
 		  > reg_state[regno].use_ruid)
@@ -850,7 +1180,7 @@ reload_combine_recognize_pattern (rtx in
 static void
 reload_combine (void)
 {
-  rtx insn;
+  rtx insn, prev;
   int i;
   basic_block bb;
   unsigned int r;
@@ -912,20 +1244,23 @@ reload_combine (void)
     }
 
   /* Initialize last_label_ruid, reload_combine_ruid and reg_state.  */
-  last_label_ruid = reload_combine_ruid = 0;
+  last_label_ruid = last_jump_ruid = reload_combine_ruid = 0;
   for (r = 0; r < FIRST_PSEUDO_REGISTER; r++)
     {
-      reg_state[r].store_ruid = reload_combine_ruid;
+      reg_state[r].store_ruid = 0;
+      reg_state[r].real_store_ruid = 0;
       if (fixed_regs[r])
 	reg_state[r].use_index = -1;
       else
 	reg_state[r].use_index = RELOAD_COMBINE_MAX_USES;
     }
 
-  for (insn = get_last_insn (); insn; insn = PREV_INSN (insn))
+  for (insn = get_last_insn (); insn; insn = prev)
     {
       rtx note;
 
+      prev = PREV_INSN (insn);
+
       /* We cannot do our optimization across labels.  Invalidating all the use
 	 information we have would be costly, so we just note where the label
 	 is and then later disable any optimization that would cross it.  */
@@ -941,7 +1276,11 @@ reload_combine (void)
 
       reload_combine_ruid++;
 
-      if (reload_combine_recognize_pattern (insn))
+      if (JUMP_P (insn))
+	last_jump_ruid = reload_combine_ruid;
+
+      if (reload_combine_recognize_const_pattern (insn)
+	  || reload_combine_recognize_pattern (insn))
 	continue;
 
       note_stores (PATTERN (insn), reload_combine_note_store, NULL);
@@ -998,7 +1337,8 @@ reload_combine (void)
 	      reg_state[i].use_index = -1;
 	}
 
-      reload_combine_note_use (&PATTERN (insn), insn);
+      reload_combine_note_use (&PATTERN (insn), insn,
+			       reload_combine_ruid, NULL_RTX);
       for (note = REG_NOTES (insn); note; note = XEXP (note, 1))
 	{
 	  if (REG_NOTE_KIND (note) == REG_INC
@@ -1007,6 +1347,7 @@ reload_combine (void)
 	      int regno = REGNO (XEXP (note, 0));
 
 	      reg_state[regno].store_ruid = reload_combine_ruid;
+	      reg_state[regno].real_store_ruid = reload_combine_ruid;
 	      reg_state[regno].use_index = -1;
 	    }
 	}
@@ -1016,8 +1357,8 @@ reload_combine (void)
 }
 
 /* Check if DST is a register or a subreg of a register; if it is,
-   update reg_state[regno].store_ruid and reg_state[regno].use_index
-   accordingly.  Called via note_stores from reload_combine.  */
+   update store_ruid, real_store_ruid and use_index in the reg_state
+   structure accordingly.  Called via note_stores from reload_combine.  */
 
 static void
 reload_combine_note_store (rtx dst, const_rtx set, void *data ATTRIBUTE_UNUSED)
@@ -1041,14 +1382,14 @@ reload_combine_note_store (rtx dst, cons
   /* note_stores might have stripped a STRICT_LOW_PART, so we have to be
      careful with registers / register parts that are not full words.
      Similarly for ZERO_EXTRACT.  */
-  if (GET_CODE (set) != SET
-      || GET_CODE (SET_DEST (set)) == ZERO_EXTRACT
+  if (GET_CODE (SET_DEST (set)) == ZERO_EXTRACT
       || GET_CODE (SET_DEST (set)) == STRICT_LOW_PART)
     {
       for (i = hard_regno_nregs[regno][mode] - 1 + regno; i >= regno; i--)
 	{
 	  reg_state[i].use_index = -1;
 	  reg_state[i].store_ruid = reload_combine_ruid;
+	  reg_state[i].real_store_ruid = reload_combine_ruid;
 	}
     }
   else
@@ -1056,6 +1397,8 @@ reload_combine_note_store (rtx dst, cons
       for (i = hard_regno_nregs[regno][mode] - 1 + regno; i >= regno; i--)
 	{
 	  reg_state[i].store_ruid = reload_combine_ruid;
+	  if (GET_CODE (set) == SET)
+	    reg_state[i].real_store_ruid = reload_combine_ruid;
 	  reg_state[i].use_index = RELOAD_COMBINE_MAX_USES;
 	}
     }
@@ -1066,7 +1409,7 @@ reload_combine_note_store (rtx dst, cons
    *XP is the pattern of INSN, or a part of it.
    Called from reload_combine, and recursively by itself.  */
 static void
-reload_combine_note_use (rtx *xp, rtx insn)
+reload_combine_note_use (rtx *xp, rtx insn, int ruid, rtx containing_mem)
 {
   rtx x = *xp;
   enum rtx_code code = x->code;
@@ -1079,7 +1422,7 @@ reload_combine_note_use (rtx *xp, rtx in
     case SET:
       if (REG_P (SET_DEST (x)))
 	{
-	  reload_combine_note_use (&SET_SRC (x), insn);
+	  reload_combine_note_use (&SET_SRC (x), insn, ruid, NULL_RTX);
 	  return;
 	}
       break;
@@ -1143,29 +1486,28 @@ reload_combine_note_use (rtx *xp, rtx in
 	if (use_index < 0)
 	  return;
 
-	if (use_index != RELOAD_COMBINE_MAX_USES - 1)
-	  {
-	    /* We have found another use for a register that is already
-	       used later.  Check if the offsets match; if not, mark the
-	       register as used in an unknown fashion.  */
-	    if (! rtx_equal_p (offset, reg_state[regno].offset))
-	      {
-		reg_state[regno].use_index = -1;
-		return;
-	      }
-	  }
-	else
+	if (use_index == RELOAD_COMBINE_MAX_USES - 1)
 	  {
 	    /* This is the first use of this register we have seen since we
 	       marked it as dead.  */
 	    reg_state[regno].offset = offset;
-	    reg_state[regno].use_ruid = reload_combine_ruid;
+	    reg_state[regno].all_offsets_match = true;
+	    reg_state[regno].use_ruid = ruid;
 	  }
+	else if (! rtx_equal_p (offset, reg_state[regno].offset))
+	  reg_state[regno].all_offsets_match = false;
+
 	reg_state[regno].reg_use[use_index].insn = insn;
+	reg_state[regno].reg_use[use_index].ruid = ruid;
+	reg_state[regno].reg_use[use_index].containing_mem = containing_mem;
 	reg_state[regno].reg_use[use_index].usep = xp;
 	return;
       }
 
+    case MEM:
+      containing_mem = x;
+      break;
+
     default:
       break;
     }
@@ -1175,11 +1517,12 @@ reload_combine_note_use (rtx *xp, rtx in
   for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--)
     {
       if (fmt[i] == 'e')
-	reload_combine_note_use (&XEXP (x, i), insn);
+	reload_combine_note_use (&XEXP (x, i), insn, ruid, containing_mem);
       else if (fmt[i] == 'E')
 	{
 	  for (j = XVECLEN (x, i) - 1; j >= 0; j--)
-	    reload_combine_note_use (&XVECEXP (x, i, j), insn);
+	    reload_combine_note_use (&XVECEXP (x, i, j), insn, ruid,
+				     containing_mem);
 	}
     }
 }
@@ -1227,9 +1570,10 @@ static int move2add_last_label_luid;
    while REG is known to already have value (SYM + offset).
    This function tries to change INSN into an add instruction
    (set (REG) (plus (REG) (OFF - offset))) using the known value.
-   It also updates the information about REG's known value.  */
+   It also updates the information about REG's known value.
+   Return true if we made a change.  */
 
-static void
+static bool
 move2add_use_add2_insn (rtx reg, rtx sym, rtx off, rtx insn)
 {
   rtx pat = PATTERN (insn);
@@ -1238,6 +1582,7 @@ move2add_use_add2_insn (rtx reg, rtx sym
   rtx new_src = gen_int_mode (INTVAL (off) - reg_offset[regno],
 			      GET_MODE (reg));
   bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn));
+  bool changed = false;
 
   /* (set (reg) (plus (reg) (const_int 0))) is not canonical;
      use (set (reg) (reg)) instead.
@@ -1252,13 +1597,13 @@ move2add_use_add2_insn (rtx reg, rtx sym
 	 (reg)), would be discarded.  Maybe we should
 	 try a truncMN pattern?  */
       if (INTVAL (off) == reg_offset [regno])
-	validate_change (insn, &SET_SRC (pat), reg, 0);
+	changed = validate_change (insn, &SET_SRC (pat), reg, 0);
     }
   else if (rtx_cost (new_src, PLUS, speed) < rtx_cost (src, SET, speed)
 	   && have_add2_insn (reg, new_src))
     {
       rtx tem = gen_rtx_PLUS (GET_MODE (reg), reg, new_src);
-      validate_change (insn, &SET_SRC (pat), tem, 0);
+      changed = validate_change (insn, &SET_SRC (pat), tem, 0);
     }
   else if (sym == NULL_RTX && GET_MODE (reg) != BImode)
     {
@@ -1283,8 +1628,9 @@ move2add_use_add2_insn (rtx reg, rtx sym
 			     gen_rtx_STRICT_LOW_PART (VOIDmode,
 						      narrow_reg),
 			     narrow_src);
-	      if (validate_change (insn, &PATTERN (insn),
-				   new_set, 0))
+	      changed = validate_change (insn, &PATTERN (insn),
+					 new_set, 0);
+	      if (changed)
 		break;
 	    }
 	}
@@ -1294,6 +1640,7 @@ move2add_use_add2_insn (rtx reg, rtx sym
   reg_mode[regno] = GET_MODE (reg);
   reg_symbol_ref[regno] = sym;
   reg_offset[regno] = INTVAL (off);
+  return changed;
 }
 
 
@@ -1303,9 +1650,10 @@ move2add_use_add2_insn (rtx reg, rtx sym
    value (SYM + offset) and change INSN into an add instruction
    (set (REG) (plus (the found register) (OFF - offset))) if such
    a register is found.  It also updates the information about
-   REG's known value.  */
+   REG's known value.
+   Return true iff we made a change.  */
 
-static void
+static bool
 move2add_use_add3_insn (rtx reg, rtx sym, rtx off, rtx insn)
 {
   rtx pat = PATTERN (insn);
@@ -1315,6 +1663,7 @@ move2add_use_add3_insn (rtx reg, rtx sym
   int min_regno = 0;
   bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn));
   int i;
+  bool changed = false;
 
   for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
     if (reg_set_luid[i] > move2add_last_label_luid
@@ -1359,20 +1708,25 @@ move2add_use_add3_insn (rtx reg, rtx sym
 				      GET_MODE (reg));
 	  tem = gen_rtx_PLUS (GET_MODE (reg), tem, new_src);
 	}
-      validate_change (insn, &SET_SRC (pat), tem, 0);
+      if (validate_change (insn, &SET_SRC (pat), tem, 0))
+	changed = true;
     }
   reg_set_luid[regno] = move2add_luid;
   reg_base_reg[regno] = -1;
   reg_mode[regno] = GET_MODE (reg);
   reg_symbol_ref[regno] = sym;
   reg_offset[regno] = INTVAL (off);
+  return changed;
 }
 
-static void
+/* Convert move insns with constant inputs to additions if they are cheaper.
+   Return true if any changes were made.  */
+static bool
 reload_cse_move2add (rtx first)
 {
   int i;
   rtx insn;
+  bool changed = false;
 
   for (i = FIRST_PSEUDO_REGISTER - 1; i >= 0; i--)
     {
@@ -1433,7 +1787,7 @@ reload_cse_move2add (rtx first)
 		  && reg_base_reg[regno] < 0
 		  && reg_symbol_ref[regno] == NULL_RTX)
 		{
-		  move2add_use_add2_insn (reg, NULL_RTX, src, insn);
+		  changed |= move2add_use_add2_insn (reg, NULL_RTX, src, insn);
 		  continue;
 		}
 
@@ -1494,6 +1848,7 @@ reload_cse_move2add (rtx first)
 			}
 		      if (success)
 			delete_insn (insn);
+		      changed |= success;
 		      insn = next;
 		      reg_mode[regno] = GET_MODE (reg);
 		      reg_offset[regno] =
@@ -1539,12 +1894,12 @@ reload_cse_move2add (rtx first)
 		  && reg_base_reg[regno] < 0
 		  && reg_symbol_ref[regno] != NULL_RTX
 		  && rtx_equal_p (sym, reg_symbol_ref[regno]))
-		move2add_use_add2_insn (reg, sym, off, insn);
+		changed |= move2add_use_add2_insn (reg, sym, off, insn);
 
 	      /* Otherwise, we have to find a register whose value is sum
 		 of sym and some constant value.  */
 	      else
-		move2add_use_add3_insn (reg, sym, off, insn);
+		changed |= move2add_use_add3_insn (reg, sym, off, insn);
 
 	      continue;
 	    }
@@ -1599,6 +1954,7 @@ reload_cse_move2add (rtx first)
 	    }
 	}
     }
+  return changed;
 }
 
 /* SET is a SET or CLOBBER that sets DST.  DATA is the insn which
Index: testsuite/gcc.target/arm/pr42235.c
===================================================================
--- testsuite/gcc.target/arm/pr42235.c	(revision 0)
+++ testsuite/gcc.target/arm/pr42235.c	(revision 0)
@@ -0,0 +1,11 @@
+/* { dg-options "-mthumb -O2 -march=armv5te" }  */
+/* { dg-require-effective-target arm_thumb1_ok } */
+/* { dg-final { scan-assembler-not "add\[\\t \]*r.,\[\\t \]*r.,\[\\t \]*\#1" } } */
+/* { dg-final { scan-assembler-not "add\[\\t \]*r.,\[\\t \]*\#1" } } */
+
+#include <string.h>
+
+int foo (char *x)
+{
+  memset (x, 0, 6);
+}

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-16 10:35 New optimization for reload_combine Bernd Schmidt
@ 2010-07-16 18:32 ` Jeff Law
  2010-07-16 23:50   ` Bernd Schmidt
  2010-07-16 19:45 ` Paolo Bonzini
  2010-07-26 10:13 ` IainS
  2 siblings, 1 reply; 49+ messages in thread
From: Jeff Law @ 2010-07-16 18:32 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches

On 07/16/10 04:34, Bernd Schmidt wrote:
> This fixes one subproblem of PR42235.  In reload_combine, we recognize a
> new pattern:
>
>   REG1 = REG2 + CONST
>   ...
>   something MEM (REG1)
>
> and try to use the addition in the MEM directly, moving the add insn
> downwards if REG1 and REG2 are identical.  This has two benefits: it may
> be possible to eliminate the add, and the load no longer depends on the
> add instruction and may schedule sooner.
>
> This can make the generated code quite a lot nicer:
>
> -       adds    r3, r2, #2
>          strb    r5, [r2, #0]
> -       strb    r5, [r3, #0]
> -       adds    r3, r2, #3
> +       strb    r5, [r2, #2]
>          strb    r5, [r2, #1]
> -       strb    r5, [r3, #0]
> -       adds    r3, r2, #4
> +       strb    r5, [r2, #3]
>          lsrs    r1, r1, #11
> -       strb    r5, [r3, #0]
> -       adds    r3, r2, #5
> +       strb    r5, [r2, #4]
>          add     r1, r1, r1, lsl #1
> -       strb    r5, [r3, #0]
> +       strb    r5, [r2, #5]
>
> Bootstrapped and regression tested on i686-linux.  Also tested on
>      qemu-system-armv7/arch=armv7-a/thumb
>      qemu-system-armv7/thumb
>      qemu-system-armv7
>
> Does postreload.c fall under reload?  If not, would someone approve this?
>    
I'm not sure if it falls under reload or not, regardless, I'll make a 
few comments and approve :-)

+/* Describes a recorded use of a register.  */
+struct reg_use
+{
+  /* The insn where a register has been used.  */
+  rtx insn;
+  /* The reverse uid of the insn.  */
+  int ruid;
+  /* Points to the memory reference enclosing the use, if any, NULL_RTX
+     otherwise.  */
+  rtx containing_mem;
+  /* Location of the register withing INSN.  */
+  rtx *usep;
+};
Probably worth moving the int field to the end to avoid potentially 
unnecessary padding if someone adds other int-sized fields in the future.

  static struct
    {
      struct reg_use reg_use[RELOAD_COMBINE_MAX_USES];
      int use_index;
      rtx offset;
+    bool all_offsets_match;
      int store_ruid;
+    int real_store_ruid;
      int use_ruid;
    } reg_state[FIRST_PSEUDO_REGISTER];

SImilarly, but in this case I think moving the offset field up and the 
all_offsets_match field down is the way to go.

We're probably not talking about a huge issue, but we might as well take 
the more compact packing when we can get it.

Otherwise it looks fine.  Pre-approved after juggling the structure orders.

jeff

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-16 10:35 New optimization for reload_combine Bernd Schmidt
  2010-07-16 18:32 ` Jeff Law
@ 2010-07-16 19:45 ` Paolo Bonzini
  2010-07-16 19:55   ` Bernd Schmidt
  2010-07-19 16:11   ` Mark Mitchell
  2010-07-26 10:13 ` IainS
  2 siblings, 2 replies; 49+ messages in thread
From: Paolo Bonzini @ 2010-07-16 19:45 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches

On 07/16/2010 12:34 PM, Bernd Schmidt wrote:
> This fixes one subproblem of PR42235.  In reload_combine, we recognize a
> new pattern:
>
>   REG1 = REG2 + CONST
>   ...
>   something MEM (REG1)
>
> and try to use the addition in the MEM directly, moving the add insn
> downwards if REG1 and REG2 are identical.  This has two benefits: it may
> be possible to eliminate the add, and the load no longer depends on the
> add instruction and may schedule sooner.
>
> This can make the generated code quite a lot nicer:
>
> -       adds    r3, r2, #2
>          strb    r5, [r2, #0]
> -       strb    r5, [r3, #0]
> -       adds    r3, r2, #3
> +       strb    r5, [r2, #2]
>          strb    r5, [r2, #1]
> -       strb    r5, [r3, #0]
> -       adds    r3, r2, #4
> +       strb    r5, [r2, #3]
>          lsrs    r1, r1, #11
> -       strb    r5, [r3, #0]
> -       adds    r3, r2, #5
> +       strb    r5, [r2, #4]
>          add     r1, r1, r1, lsl #1
> -       strb    r5, [r3, #0]
> +       strb    r5, [r2, #5]

Nice. :)  I suppose fwprop doesn't do it because the memory accesses are 
not present before reload?

Can you make the change dependent on

   bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn));

   /* Prefer the new address if it is less expensive.  */
   gain = (address_cost (old_rtx, mode, as, speed)
           - address_cost (new_rtx, mode, as, speed));

   /* If the addresses have equivalent cost, prefer the new address
      if it has the highest `rtx_cost'.  That has the potential of
      eliminating the most insns without additional costs, and it
      is the same that cse.c used to do.  */
   if (gain == 0)
     gain = rtx_cost (new_rtx, SET, speed) - rtx_cost (old_rtx, SET, speed);

   return (gain > 0);

as in fwprop (in turn taken from CSE)?

> Does postreload.c fall under reload?  If not, would someone approve this?

Well, it was part of reload1.c until a while ago...

Paolo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-16 19:45 ` Paolo Bonzini
@ 2010-07-16 19:55   ` Bernd Schmidt
  2010-07-16 20:24     ` Paolo Bonzini
  2010-07-19 16:11   ` Mark Mitchell
  1 sibling, 1 reply; 49+ messages in thread
From: Bernd Schmidt @ 2010-07-16 19:55 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: GCC Patches

On 07/16/2010 09:45 PM, Paolo Bonzini wrote:
> On 07/16/2010 12:34 PM, Bernd Schmidt wrote:

>> This can make the generated code quite a lot nicer:
>>
>> -       adds    r3, r2, #2
>>          strb    r5, [r2, #0]
>> -       strb    r5, [r3, #0]
>> -       adds    r3, r2, #3
>> +       strb    r5, [r2, #2]
>>          strb    r5, [r2, #1]
>> -       strb    r5, [r3, #0]
>> -       adds    r3, r2, #4
>> +       strb    r5, [r2, #3]
>>          lsrs    r1, r1, #11
>> -       strb    r5, [r3, #0]
>> -       adds    r3, r2, #5
>> +       strb    r5, [r2, #4]
>>          add     r1, r1, r1, lsl #1
>> -       strb    r5, [r3, #0]
>> +       strb    r5, [r2, #5]
> 
> Nice. :)  I suppose fwprop doesn't do it because the memory accesses are
> not present before reload?

I haven't checked, but I guess in many cases (i.e. anything
stack-relative) that's the case.  Also, some of the adds are produced by
reload_cse_move2add.

> Can you make the change dependent on
> 
>   bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn));
> 
>   /* Prefer the new address if it is less expensive.  */
>   gain = (address_cost (old_rtx, mode, as, speed)
>           - address_cost (new_rtx, mode, as, speed));
> 
>   /* If the addresses have equivalent cost, prefer the new address
>      if it has the highest `rtx_cost'.  That has the potential of
>      eliminating the most insns without additional costs, and it
>      is the same that cse.c used to do.  */
>   if (gain == 0)
>     gain = rtx_cost (new_rtx, SET, speed) - rtx_cost (old_rtx, SET, speed);
> 
>   return (gain > 0);
> 
> as in fwprop (in turn taken from CSE)?

I guess I can use address_cost instead of rtx_cost.  I'll make the
change if it still gives me good results.  The extra rtx_cost check you
quoted doesn't seem worthwhile here, as the goal mentioned in the
comment (i.e, eliminating the add) is achievable only if we do the
replacement every time.

As a followup, it would be nice to compute all the costs before
replacing anything, and taking into account whether the add is dead.


Bernd

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-16 19:55   ` Bernd Schmidt
@ 2010-07-16 20:24     ` Paolo Bonzini
  0 siblings, 0 replies; 49+ messages in thread
From: Paolo Bonzini @ 2010-07-16 20:24 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches

>> Can you make the change dependent on
>>
>>    bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn));
>>
>>    /* Prefer the new address if it is less expensive.  */
>>    gain = (address_cost (old_rtx, mode, as, speed)
>>            - address_cost (new_rtx, mode, as, speed));
>>
>>    /* If the addresses have equivalent cost, prefer the new address
>>       if it has the highest `rtx_cost'.  That has the potential of
>>       eliminating the most insns without additional costs, and it
>>       is the same that cse.c used to do.  */
>>    if (gain == 0)
>>      gain = rtx_cost (new_rtx, SET, speed) - rtx_cost (old_rtx, SET, speed);
>>
>>    return (gain>  0);
>>
>> as in fwprop (in turn taken from CSE)?
>
> I guess I can use address_cost instead of rtx_cost.  I'll make the
> change if it still gives me good results.  The extra rtx_cost check you
> quoted doesn't seem worthwhile here, as the goal mentioned in the
> comment (i.e, eliminating the add) is achievable only if we do the
> replacement every time.

That one is also covering the default definition of address_cost, which 
is that everything has the same cost.  You can of course change "gain > 
0" to test for >= instead.

Paolo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-16 18:32 ` Jeff Law
@ 2010-07-16 23:50   ` Bernd Schmidt
  2010-07-17  2:38     ` H.J. Lu
  0 siblings, 1 reply; 49+ messages in thread
From: Bernd Schmidt @ 2010-07-16 23:50 UTC (permalink / raw)
  To: Jeff Law; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 437 bytes --]

On 07/16/2010 08:32 PM, Jeff Law wrote:
> We're probably not talking about a huge issue, but we might as well take
> the more compact packing when we can get it.
> 
> Otherwise it looks fine.  Pre-approved after juggling the structure orders.

Yay.  Thanks!

Here's the variant that I committed.  It's a good thing Paolo made a
suggestion, as that exposed a new bootstrap failure - we really mustn't
move a stack pointer adjust.


Bernd

[-- Attachment #2: rc-newpattern2.diff --]
[-- Type: text/plain, Size: 26370 bytes --]

Index: postreload.c
===================================================================
--- postreload.c	(revision 162228)
+++ postreload.c	(working copy)
@@ -56,10 +56,10 @@ static int reload_cse_simplify_set (rtx,
 static int reload_cse_simplify_operands (rtx, rtx);
 
 static void reload_combine (void);
-static void reload_combine_note_use (rtx *, rtx);
+static void reload_combine_note_use (rtx *, rtx, int, rtx);
 static void reload_combine_note_store (rtx, const_rtx, void *);
 
-static void reload_cse_move2add (rtx);
+static bool reload_cse_move2add (rtx);
 static void move2add_note_store (rtx, const_rtx, void *);
 
 /* Call cse / combine like post-reload optimization phases.
@@ -67,11 +67,16 @@ static void move2add_note_store (rtx, co
 void
 reload_cse_regs (rtx first ATTRIBUTE_UNUSED)
 {
+  bool moves_converted;
   reload_cse_regs_1 (first);
   reload_combine ();
-  reload_cse_move2add (first);
+  moves_converted = reload_cse_move2add (first);
   if (flag_expensive_optimizations)
-    reload_cse_regs_1 (first);
+    {
+      if (moves_converted)
+	reload_combine ();
+      reload_cse_regs_1 (first);
+    }
 }
 
 /* See whether a single set SET is a noop.  */
@@ -660,30 +665,43 @@ reload_cse_simplify_operands (rtx insn, 
 
 /* The maximum number of uses of a register we can keep track of to
    replace them with reg+reg addressing.  */
-#define RELOAD_COMBINE_MAX_USES 6
+#define RELOAD_COMBINE_MAX_USES 16
 
-/* INSN is the insn where a register has been used, and USEP points to the
-   location of the register within the rtl.  */
-struct reg_use { rtx insn, *usep; };
+/* Describes a recorded use of a register.  */
+struct reg_use
+{
+  /* The insn where a register has been used.  */
+  rtx insn;
+  /* Points to the memory reference enclosing the use, if any, NULL_RTX
+     otherwise.  */
+  rtx containing_mem;
+  /* Location of the register withing INSN.  */
+  rtx *usep;
+  /* The reverse uid of the insn.  */
+  int ruid;
+};
 
 /* If the register is used in some unknown fashion, USE_INDEX is negative.
    If it is dead, USE_INDEX is RELOAD_COMBINE_MAX_USES, and STORE_RUID
-   indicates where it becomes live again.
+   indicates where it is first set or clobbered.
    Otherwise, USE_INDEX is the index of the last encountered use of the
-   register (which is first among these we have seen since we scan backwards),
-   OFFSET contains the constant offset that is added to the register in
-   all encountered uses, and USE_RUID indicates the first encountered, i.e.
-   last, of these uses.
+   register (which is first among these we have seen since we scan backwards).
+   USE_RUID indicates the first encountered, i.e. last, of these uses.
+   If ALL_OFFSETS_MATCH is true, all encountered uses were inside a PLUS
+   with a constant offset; OFFSET contains this constant in that case.
    STORE_RUID is always meaningful if we only want to use a value in a
    register in a different place: it denotes the next insn in the insn
-   stream (i.e. the last encountered) that sets or clobbers the register.  */
+   stream (i.e. the last encountered) that sets or clobbers the register.
+   REAL_STORE_RUID is similar, but clobbers are ignored when updating it.  */
 static struct
   {
     struct reg_use reg_use[RELOAD_COMBINE_MAX_USES];
-    int use_index;
     rtx offset;
+    int use_index;
     int store_ruid;
+    int real_store_ruid;
     int use_ruid;
+    bool all_offsets_match;
   } reg_state[FIRST_PSEUDO_REGISTER];
 
 /* Reverse linear uid.  This is increased in reload_combine while scanning
@@ -694,6 +712,9 @@ static int reload_combine_ruid;
 /* The RUID of the last label we encountered in reload_combine.  */
 static int last_label_ruid;
 
+/* The RUID of the last jump we encountered in reload_combine.  */
+static int last_jump_ruid;
+
 /* The register numbers of the first and last index register.  A value of
    -1 in LAST_INDEX_REG indicates that we've previously computed these
    values and found no suitable index registers.  */
@@ -703,6 +724,311 @@ static int last_index_reg;
 #define LABEL_LIVE(LABEL) \
   (label_live[CODE_LABEL_NUMBER (LABEL) - min_labelno])
 
+/* Subroutine of reload_combine_split_ruids, called to fix up a single
+   ruid pointed to by *PRUID if it is higher than SPLIT_RUID.  */
+
+static inline void
+reload_combine_split_one_ruid (int *pruid, int split_ruid)
+{
+  if (*pruid > split_ruid)
+    (*pruid)++;
+}
+
+/* Called when we insert a new insn in a position we've already passed in
+   the scan.  Examine all our state, increasing all ruids that are higher
+   than SPLIT_RUID by one in order to make room for a new insn.  */
+
+static void
+reload_combine_split_ruids (int split_ruid)
+{
+  unsigned i;
+
+  reload_combine_split_one_ruid (&reload_combine_ruid, split_ruid);
+  reload_combine_split_one_ruid (&last_label_ruid, split_ruid);
+  reload_combine_split_one_ruid (&last_jump_ruid, split_ruid);
+
+  for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
+    {
+      int j, idx = reg_state[i].use_index;
+      reload_combine_split_one_ruid (&reg_state[i].use_ruid, split_ruid);
+      reload_combine_split_one_ruid (&reg_state[i].store_ruid, split_ruid);
+      reload_combine_split_one_ruid (&reg_state[i].real_store_ruid,
+				     split_ruid);
+      if (idx < 0)
+	continue;
+      for (j = idx; j < RELOAD_COMBINE_MAX_USES; j++)
+	{
+	  reload_combine_split_one_ruid (&reg_state[i].reg_use[j].ruid,
+					 split_ruid);
+	}
+    }
+}
+
+/* Called when we are about to rescan a previously encountered insn with
+   reload_combine_note_use after modifying some part of it.  This clears all
+   information about uses in that particular insn.  */
+
+static void
+reload_combine_purge_insn_uses (rtx insn)
+{
+  unsigned i;
+
+  for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
+    {
+      int j, k, idx = reg_state[i].use_index;
+      if (idx < 0)
+	continue;
+      j = k = RELOAD_COMBINE_MAX_USES;
+      while (j-- > idx)
+	{
+	  if (reg_state[i].reg_use[j].insn != insn)
+	    {
+	      k--;
+	      if (k != j)
+		reg_state[i].reg_use[k] = reg_state[i].reg_use[j];
+	    }
+	}
+      reg_state[i].use_index = k;
+    }
+}
+
+/* Called when we need to forget about all uses of REGNO after an insn
+   which is identified by RUID.  */
+
+static void
+reload_combine_purge_reg_uses_after_ruid (unsigned regno, int ruid)
+{
+  int j, k, idx = reg_state[regno].use_index;
+  if (idx < 0)
+    return;
+  j = k = RELOAD_COMBINE_MAX_USES;
+  while (j-- > idx)
+    {
+      if (reg_state[regno].reg_use[j].ruid >= ruid)
+	{
+	  k--;
+	  if (k != j)
+	    reg_state[regno].reg_use[k] = reg_state[regno].reg_use[j];
+	}
+    }
+  reg_state[regno].use_index = k;
+}
+
+/* Find the use of REGNO with the ruid that is highest among those
+   lower than RUID_LIMIT, and return it if it is the only use of this
+   reg in the insn.  Return NULL otherwise.  */
+
+static struct reg_use *
+reload_combine_closest_single_use (unsigned regno, int ruid_limit)
+{
+  int i, best_ruid = 0;
+  int use_idx = reg_state[regno].use_index;
+  struct reg_use *retval;
+
+  if (use_idx < 0)
+    return NULL;
+  retval = NULL;
+  for (i = use_idx; i < RELOAD_COMBINE_MAX_USES; i++)
+    {
+      int this_ruid = reg_state[regno].reg_use[i].ruid;
+      if (this_ruid >= ruid_limit)
+	continue;
+      if (this_ruid > best_ruid)
+	{
+	  best_ruid = this_ruid;
+	  retval = reg_state[regno].reg_use + i;
+	}
+      else if (this_ruid == best_ruid)
+	retval = NULL;
+    }
+  if (last_label_ruid >= best_ruid)
+    return NULL;
+  return retval;
+}
+
+/* Called by reload_combine when scanning INSN.  This function tries to detect
+   patterns where a constant is added to a register, and the result is used
+   in an address.
+   Return true if no further processing is needed on INSN; false if it wasn't
+   recognized and should be handled normally.  */
+
+static bool
+reload_combine_recognize_const_pattern (rtx insn)
+{
+  int from_ruid = reload_combine_ruid;
+  rtx set, pat, reg, src, addreg;
+  unsigned int regno;
+  struct reg_use *use;
+  bool must_move_add;
+  rtx add_moved_after_insn = NULL_RTX;
+  int add_moved_after_ruid = 0;
+  int clobbered_regno = -1;
+
+  set = single_set (insn);
+  if (set == NULL_RTX)
+    return false;
+
+  reg = SET_DEST (set);
+  src = SET_SRC (set);
+  if (!REG_P (reg)
+      || hard_regno_nregs[REGNO (reg)][GET_MODE (reg)] != 1
+      || GET_MODE (reg) != Pmode
+      || reg == stack_pointer_rtx)
+    return false;
+
+  regno = REGNO (reg);
+
+  /* We look for a REG1 = REG2 + CONSTANT insn, followed by either
+     uses of REG1 inside an address, or inside another add insn.  If
+     possible and profitable, merge the addition into subsequent
+     uses.  */
+  if (GET_CODE (src) != PLUS
+      || !REG_P (XEXP (src, 0))
+      || !CONSTANT_P (XEXP (src, 1)))
+    return false;
+
+  addreg = XEXP (src, 0);
+  must_move_add = rtx_equal_p (reg, addreg);
+
+  pat = PATTERN (insn);
+  if (must_move_add && set != pat)
+    {
+      /* We have to be careful when moving the add; apart from the
+	 single_set there may also be clobbers.  Recognize one special
+	 case, that of one clobber alongside the set (likely a clobber
+	 of the CC register).  */
+      gcc_assert (GET_CODE (PATTERN (insn)) == PARALLEL);
+      if (XVECLEN (pat, 0) != 2 || XVECEXP (pat, 0, 0) != set
+	  || GET_CODE (XVECEXP (pat, 0, 1)) != CLOBBER
+	  || !REG_P (XEXP (XVECEXP (pat, 0, 1), 0)))
+	return false;
+      clobbered_regno = REGNO (XEXP (XVECEXP (pat, 0, 1), 0));
+    }
+
+  do
+    {
+      use = reload_combine_closest_single_use (regno, from_ruid);
+
+      if (use)
+	/* Start the search for the next use from here.  */
+	from_ruid = use->ruid;
+
+      if (use && GET_MODE (*use->usep) == Pmode)
+	{
+	  rtx use_insn = use->insn;
+	  int use_ruid = use->ruid;
+	  rtx mem = use->containing_mem;
+	  bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (use_insn));
+
+	  /* Avoid moving the add insn past a jump.  */
+	  if (must_move_add && use_ruid < last_jump_ruid)
+	    break;
+
+	  /* If the add clobbers another hard reg in parallel, don't move
+	     it past a real set of this hard reg.  */
+	  if (must_move_add && clobbered_regno >= 0
+	      && reg_state[clobbered_regno].real_store_ruid >= use_ruid)
+	    break;
+
+	  /* Avoid moving a use of ADDREG past a point where it
+	     is stored.  */
+	  if (reg_state[REGNO (addreg)].store_ruid >= use_ruid)
+	    break;
+
+	  if (mem != NULL_RTX)
+	    {
+	      addr_space_t as = MEM_ADDR_SPACE (mem);
+	      rtx oldaddr = XEXP (mem, 0);
+	      rtx newaddr = NULL_RTX;
+	      int old_cost = address_cost (oldaddr, GET_MODE (mem), as, speed);
+	      int new_cost;
+
+	      newaddr = simplify_replace_rtx (oldaddr, reg, copy_rtx (src));
+	      if (memory_address_addr_space_p (GET_MODE (mem), newaddr, as))
+		{
+		  XEXP (mem, 0) = newaddr;
+		  new_cost = address_cost (newaddr, GET_MODE (mem), as, speed);
+		  XEXP (mem, 0) = oldaddr;
+		  if (new_cost <= old_cost
+		      && validate_change (use_insn,
+					  &XEXP (mem, 0), newaddr, 0))
+		    {
+		      reload_combine_purge_insn_uses (use_insn);
+		      reload_combine_note_use (&PATTERN (use_insn), use_insn,
+					       use_ruid, NULL_RTX);
+
+		      if (must_move_add)
+			{
+			  add_moved_after_insn = use_insn;
+			  add_moved_after_ruid = use_ruid;
+			}
+		      continue;
+		    }
+		}
+	    }
+	  else
+	    {
+	      rtx new_set = single_set (use_insn);
+	      if (new_set
+		  && REG_P (SET_DEST (new_set))
+		  && GET_CODE (SET_SRC (new_set)) == PLUS
+		  && REG_P (XEXP (SET_SRC (new_set), 0))
+		  && CONSTANT_P (XEXP (SET_SRC (new_set), 1)))
+		{
+		  rtx new_src;
+		  int old_cost = rtx_cost (SET_SRC (new_set), SET, speed);
+
+		  gcc_assert (rtx_equal_p (XEXP (SET_SRC (new_set), 0), reg));
+		  new_src = simplify_replace_rtx (SET_SRC (new_set), reg,
+						  copy_rtx (src));
+
+		  if (rtx_cost (new_src, SET, speed) <= old_cost
+		      && validate_change (use_insn, &SET_SRC (new_set),
+					  new_src, 0))
+		    {
+		      reload_combine_purge_insn_uses (use_insn);
+		      reload_combine_note_use (&SET_SRC (new_set), use_insn,
+					       use_ruid, NULL_RTX);
+
+		      if (must_move_add)
+			{
+			  /* See if that took care of the add insn.  */
+			  if (rtx_equal_p (SET_DEST (new_set), reg))
+			    {
+			      delete_insn (insn);
+			      return true;
+			    }
+			  else
+			    {
+			      add_moved_after_insn = use_insn;
+			      add_moved_after_ruid = use_ruid;
+			    }
+			}
+		      continue;
+		    }
+		}
+	    }
+	  /* If we get here, we couldn't handle this use.  */
+	  if (must_move_add)
+	    break;
+	}
+    }
+  while (use);
+
+  if (!must_move_add || add_moved_after_insn == NULL_RTX)
+    /* Process the add normally.  */
+    return false;
+
+  reorder_insns (insn, insn, add_moved_after_insn);
+  reload_combine_purge_reg_uses_after_ruid (regno, add_moved_after_ruid);
+  reload_combine_split_ruids (add_moved_after_ruid - 1);
+  reload_combine_note_use (&PATTERN (insn), insn,
+			   add_moved_after_ruid, NULL_RTX);
+  reg_state[regno].store_ruid = add_moved_after_ruid;
+
+  return true;
+}
+
 /* Called by reload_combine when scanning INSN.  Try to detect a pattern we
    can handle and improve.  Return true if no further processing is needed on
    INSN; false if it wasn't recognized and should be handled normally.  */
@@ -713,6 +1039,18 @@ reload_combine_recognize_pattern (rtx in
   rtx set, reg, src;
   unsigned int regno;
 
+  set = single_set (insn);
+  if (set == NULL_RTX)
+    return false;
+
+  reg = SET_DEST (set);
+  src = SET_SRC (set);
+  if (!REG_P (reg)
+      || hard_regno_nregs[REGNO (reg)][GET_MODE (reg)] != 1)
+    return false;
+
+  regno = REGNO (reg);
+
   /* Look for (set (REGX) (CONST_INT))
      (set (REGX) (PLUS (REGX) (REGY)))
      ...
@@ -726,19 +1064,10 @@ reload_combine_recognize_pattern (rtx in
      and that we know all uses of REGX before it dies.
      Also, explicitly check that REGX != REGY; our life information
      does not yet show whether REGY changes in this insn.  */
-  set = single_set (insn);
-  if (set == NULL_RTX)
-    return false;
-
-  reg = SET_DEST (set);
-  src = SET_SRC (set);
-  if (!REG_P (reg)
-      || hard_regno_nregs[REGNO (reg)][GET_MODE (reg)] != 1)
-    return false;
-
-  regno = REGNO (reg);
 
   if (GET_CODE (src) == PLUS
+      && reg_state[regno].all_offsets_match
+      && last_index_reg != -1
       && REG_P (XEXP (src, 1))
       && rtx_equal_p (XEXP (src, 0), reg)
       && !rtx_equal_p (XEXP (src, 1), reg)
@@ -822,7 +1151,9 @@ reload_combine_recognize_pattern (rtx in
 		   i < RELOAD_COMBINE_MAX_USES; i++)
 		reload_combine_note_use
 		  (&XEXP (*reg_state[regno].reg_use[i].usep, 1),
-		   reg_state[regno].reg_use[i].insn);
+		   reg_state[regno].reg_use[i].insn,
+		   reg_state[regno].reg_use[i].ruid,
+		   reg_state[regno].reg_use[i].containing_mem);
 
 	      if (reg_state[REGNO (base)].use_ruid
 		  > reg_state[regno].use_ruid)
@@ -850,7 +1181,7 @@ reload_combine_recognize_pattern (rtx in
 static void
 reload_combine (void)
 {
-  rtx insn;
+  rtx insn, prev;
   int i;
   basic_block bb;
   unsigned int r;
@@ -881,11 +1212,13 @@ reload_combine (void)
 	}
     }
 
+#if 0
   /* If reg+reg can be used in offsetable memory addresses, the main chunk of
      reload has already used it where appropriate, so there is no use in
      trying to generate it now.  */
   if (double_reg_address_ok || last_index_reg == -1)
     return;
+#endif
 
   /* Set up LABEL_LIVE and EVER_LIVE_AT_START.  The register lifetime
      information is a bit fuzzy immediately after reload, but it's
@@ -912,20 +1245,23 @@ reload_combine (void)
     }
 
   /* Initialize last_label_ruid, reload_combine_ruid and reg_state.  */
-  last_label_ruid = reload_combine_ruid = 0;
+  last_label_ruid = last_jump_ruid = reload_combine_ruid = 0;
   for (r = 0; r < FIRST_PSEUDO_REGISTER; r++)
     {
-      reg_state[r].store_ruid = reload_combine_ruid;
+      reg_state[r].store_ruid = 0;
+      reg_state[r].real_store_ruid = 0;
       if (fixed_regs[r])
 	reg_state[r].use_index = -1;
       else
 	reg_state[r].use_index = RELOAD_COMBINE_MAX_USES;
     }
 
-  for (insn = get_last_insn (); insn; insn = PREV_INSN (insn))
+  for (insn = get_last_insn (); insn; insn = prev)
     {
       rtx note;
 
+      prev = PREV_INSN (insn);
+
       /* We cannot do our optimization across labels.  Invalidating all the use
 	 information we have would be costly, so we just note where the label
 	 is and then later disable any optimization that would cross it.  */
@@ -941,7 +1277,11 @@ reload_combine (void)
 
       reload_combine_ruid++;
 
-      if (reload_combine_recognize_pattern (insn))
+      if (JUMP_P (insn))
+	last_jump_ruid = reload_combine_ruid;
+
+      if (reload_combine_recognize_const_pattern (insn)
+	  || reload_combine_recognize_pattern (insn))
 	continue;
 
       note_stores (PATTERN (insn), reload_combine_note_store, NULL);
@@ -998,7 +1338,8 @@ reload_combine (void)
 	      reg_state[i].use_index = -1;
 	}
 
-      reload_combine_note_use (&PATTERN (insn), insn);
+      reload_combine_note_use (&PATTERN (insn), insn,
+			       reload_combine_ruid, NULL_RTX);
       for (note = REG_NOTES (insn); note; note = XEXP (note, 1))
 	{
 	  if (REG_NOTE_KIND (note) == REG_INC
@@ -1007,6 +1348,7 @@ reload_combine (void)
 	      int regno = REGNO (XEXP (note, 0));
 
 	      reg_state[regno].store_ruid = reload_combine_ruid;
+	      reg_state[regno].real_store_ruid = reload_combine_ruid;
 	      reg_state[regno].use_index = -1;
 	    }
 	}
@@ -1016,8 +1358,8 @@ reload_combine (void)
 }
 
 /* Check if DST is a register or a subreg of a register; if it is,
-   update reg_state[regno].store_ruid and reg_state[regno].use_index
-   accordingly.  Called via note_stores from reload_combine.  */
+   update store_ruid, real_store_ruid and use_index in the reg_state
+   structure accordingly.  Called via note_stores from reload_combine.  */
 
 static void
 reload_combine_note_store (rtx dst, const_rtx set, void *data ATTRIBUTE_UNUSED)
@@ -1041,14 +1383,14 @@ reload_combine_note_store (rtx dst, cons
   /* note_stores might have stripped a STRICT_LOW_PART, so we have to be
      careful with registers / register parts that are not full words.
      Similarly for ZERO_EXTRACT.  */
-  if (GET_CODE (set) != SET
-      || GET_CODE (SET_DEST (set)) == ZERO_EXTRACT
+  if (GET_CODE (SET_DEST (set)) == ZERO_EXTRACT
       || GET_CODE (SET_DEST (set)) == STRICT_LOW_PART)
     {
       for (i = hard_regno_nregs[regno][mode] - 1 + regno; i >= regno; i--)
 	{
 	  reg_state[i].use_index = -1;
 	  reg_state[i].store_ruid = reload_combine_ruid;
+	  reg_state[i].real_store_ruid = reload_combine_ruid;
 	}
     }
   else
@@ -1056,6 +1398,8 @@ reload_combine_note_store (rtx dst, cons
       for (i = hard_regno_nregs[regno][mode] - 1 + regno; i >= regno; i--)
 	{
 	  reg_state[i].store_ruid = reload_combine_ruid;
+	  if (GET_CODE (set) == SET)
+	    reg_state[i].real_store_ruid = reload_combine_ruid;
 	  reg_state[i].use_index = RELOAD_COMBINE_MAX_USES;
 	}
     }
@@ -1066,7 +1410,7 @@ reload_combine_note_store (rtx dst, cons
    *XP is the pattern of INSN, or a part of it.
    Called from reload_combine, and recursively by itself.  */
 static void
-reload_combine_note_use (rtx *xp, rtx insn)
+reload_combine_note_use (rtx *xp, rtx insn, int ruid, rtx containing_mem)
 {
   rtx x = *xp;
   enum rtx_code code = x->code;
@@ -1079,7 +1423,7 @@ reload_combine_note_use (rtx *xp, rtx in
     case SET:
       if (REG_P (SET_DEST (x)))
 	{
-	  reload_combine_note_use (&SET_SRC (x), insn);
+	  reload_combine_note_use (&SET_SRC (x), insn, ruid, NULL_RTX);
 	  return;
 	}
       break;
@@ -1143,29 +1487,28 @@ reload_combine_note_use (rtx *xp, rtx in
 	if (use_index < 0)
 	  return;
 
-	if (use_index != RELOAD_COMBINE_MAX_USES - 1)
-	  {
-	    /* We have found another use for a register that is already
-	       used later.  Check if the offsets match; if not, mark the
-	       register as used in an unknown fashion.  */
-	    if (! rtx_equal_p (offset, reg_state[regno].offset))
-	      {
-		reg_state[regno].use_index = -1;
-		return;
-	      }
-	  }
-	else
+	if (use_index == RELOAD_COMBINE_MAX_USES - 1)
 	  {
 	    /* This is the first use of this register we have seen since we
 	       marked it as dead.  */
 	    reg_state[regno].offset = offset;
-	    reg_state[regno].use_ruid = reload_combine_ruid;
+	    reg_state[regno].all_offsets_match = true;
+	    reg_state[regno].use_ruid = ruid;
 	  }
+	else if (! rtx_equal_p (offset, reg_state[regno].offset))
+	  reg_state[regno].all_offsets_match = false;
+
 	reg_state[regno].reg_use[use_index].insn = insn;
+	reg_state[regno].reg_use[use_index].ruid = ruid;
+	reg_state[regno].reg_use[use_index].containing_mem = containing_mem;
 	reg_state[regno].reg_use[use_index].usep = xp;
 	return;
       }
 
+    case MEM:
+      containing_mem = x;
+      break;
+
     default:
       break;
     }
@@ -1175,11 +1518,12 @@ reload_combine_note_use (rtx *xp, rtx in
   for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--)
     {
       if (fmt[i] == 'e')
-	reload_combine_note_use (&XEXP (x, i), insn);
+	reload_combine_note_use (&XEXP (x, i), insn, ruid, containing_mem);
       else if (fmt[i] == 'E')
 	{
 	  for (j = XVECLEN (x, i) - 1; j >= 0; j--)
-	    reload_combine_note_use (&XVECEXP (x, i, j), insn);
+	    reload_combine_note_use (&XVECEXP (x, i, j), insn, ruid,
+				     containing_mem);
 	}
     }
 }
@@ -1227,9 +1571,10 @@ static int move2add_last_label_luid;
    while REG is known to already have value (SYM + offset).
    This function tries to change INSN into an add instruction
    (set (REG) (plus (REG) (OFF - offset))) using the known value.
-   It also updates the information about REG's known value.  */
+   It also updates the information about REG's known value.
+   Return true if we made a change.  */
 
-static void
+static bool
 move2add_use_add2_insn (rtx reg, rtx sym, rtx off, rtx insn)
 {
   rtx pat = PATTERN (insn);
@@ -1238,6 +1583,7 @@ move2add_use_add2_insn (rtx reg, rtx sym
   rtx new_src = gen_int_mode (INTVAL (off) - reg_offset[regno],
 			      GET_MODE (reg));
   bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn));
+  bool changed = false;
 
   /* (set (reg) (plus (reg) (const_int 0))) is not canonical;
      use (set (reg) (reg)) instead.
@@ -1252,13 +1598,13 @@ move2add_use_add2_insn (rtx reg, rtx sym
 	 (reg)), would be discarded.  Maybe we should
 	 try a truncMN pattern?  */
       if (INTVAL (off) == reg_offset [regno])
-	validate_change (insn, &SET_SRC (pat), reg, 0);
+	changed = validate_change (insn, &SET_SRC (pat), reg, 0);
     }
   else if (rtx_cost (new_src, PLUS, speed) < rtx_cost (src, SET, speed)
 	   && have_add2_insn (reg, new_src))
     {
       rtx tem = gen_rtx_PLUS (GET_MODE (reg), reg, new_src);
-      validate_change (insn, &SET_SRC (pat), tem, 0);
+      changed = validate_change (insn, &SET_SRC (pat), tem, 0);
     }
   else if (sym == NULL_RTX && GET_MODE (reg) != BImode)
     {
@@ -1283,8 +1629,9 @@ move2add_use_add2_insn (rtx reg, rtx sym
 			     gen_rtx_STRICT_LOW_PART (VOIDmode,
 						      narrow_reg),
 			     narrow_src);
-	      if (validate_change (insn, &PATTERN (insn),
-				   new_set, 0))
+	      changed = validate_change (insn, &PATTERN (insn),
+					 new_set, 0);
+	      if (changed)
 		break;
 	    }
 	}
@@ -1294,6 +1641,7 @@ move2add_use_add2_insn (rtx reg, rtx sym
   reg_mode[regno] = GET_MODE (reg);
   reg_symbol_ref[regno] = sym;
   reg_offset[regno] = INTVAL (off);
+  return changed;
 }
 
 
@@ -1303,9 +1651,10 @@ move2add_use_add2_insn (rtx reg, rtx sym
    value (SYM + offset) and change INSN into an add instruction
    (set (REG) (plus (the found register) (OFF - offset))) if such
    a register is found.  It also updates the information about
-   REG's known value.  */
+   REG's known value.
+   Return true iff we made a change.  */
 
-static void
+static bool
 move2add_use_add3_insn (rtx reg, rtx sym, rtx off, rtx insn)
 {
   rtx pat = PATTERN (insn);
@@ -1315,6 +1664,7 @@ move2add_use_add3_insn (rtx reg, rtx sym
   int min_regno = 0;
   bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn));
   int i;
+  bool changed = false;
 
   for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
     if (reg_set_luid[i] > move2add_last_label_luid
@@ -1359,20 +1709,25 @@ move2add_use_add3_insn (rtx reg, rtx sym
 				      GET_MODE (reg));
 	  tem = gen_rtx_PLUS (GET_MODE (reg), tem, new_src);
 	}
-      validate_change (insn, &SET_SRC (pat), tem, 0);
+      if (validate_change (insn, &SET_SRC (pat), tem, 0))
+	changed = true;
     }
   reg_set_luid[regno] = move2add_luid;
   reg_base_reg[regno] = -1;
   reg_mode[regno] = GET_MODE (reg);
   reg_symbol_ref[regno] = sym;
   reg_offset[regno] = INTVAL (off);
+  return changed;
 }
 
-static void
+/* Convert move insns with constant inputs to additions if they are cheaper.
+   Return true if any changes were made.  */
+static bool
 reload_cse_move2add (rtx first)
 {
   int i;
   rtx insn;
+  bool changed = false;
 
   for (i = FIRST_PSEUDO_REGISTER - 1; i >= 0; i--)
     {
@@ -1433,7 +1788,7 @@ reload_cse_move2add (rtx first)
 		  && reg_base_reg[regno] < 0
 		  && reg_symbol_ref[regno] == NULL_RTX)
 		{
-		  move2add_use_add2_insn (reg, NULL_RTX, src, insn);
+		  changed |= move2add_use_add2_insn (reg, NULL_RTX, src, insn);
 		  continue;
 		}
 
@@ -1494,6 +1849,7 @@ reload_cse_move2add (rtx first)
 			}
 		      if (success)
 			delete_insn (insn);
+		      changed |= success;
 		      insn = next;
 		      reg_mode[regno] = GET_MODE (reg);
 		      reg_offset[regno] =
@@ -1539,12 +1895,12 @@ reload_cse_move2add (rtx first)
 		  && reg_base_reg[regno] < 0
 		  && reg_symbol_ref[regno] != NULL_RTX
 		  && rtx_equal_p (sym, reg_symbol_ref[regno]))
-		move2add_use_add2_insn (reg, sym, off, insn);
+		changed |= move2add_use_add2_insn (reg, sym, off, insn);
 
 	      /* Otherwise, we have to find a register whose value is sum
 		 of sym and some constant value.  */
 	      else
-		move2add_use_add3_insn (reg, sym, off, insn);
+		changed |= move2add_use_add3_insn (reg, sym, off, insn);
 
 	      continue;
 	    }
@@ -1599,6 +1955,7 @@ reload_cse_move2add (rtx first)
 	    }
 	}
     }
+  return changed;
 }
 
 /* SET is a SET or CLOBBER that sets DST.  DATA is the insn which

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-16 23:50   ` Bernd Schmidt
@ 2010-07-17  2:38     ` H.J. Lu
  2010-07-17 14:25       ` Bernd Schmidt
  2010-11-04  4:30       ` H.J. Lu
  0 siblings, 2 replies; 49+ messages in thread
From: H.J. Lu @ 2010-07-17  2:38 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Jeff Law, GCC Patches

On Fri, Jul 16, 2010 at 4:49 PM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> On 07/16/2010 08:32 PM, Jeff Law wrote:
>> We're probably not talking about a huge issue, but we might as well take
>> the more compact packing when we can get it.
>>
>> Otherwise it looks fine.  Pre-approved after juggling the structure orders.
>
> Yay.  Thanks!
>
> Here's the variant that I committed.  It's a good thing Paolo made a
> suggestion, as that exposed a new bootstrap failure - we really mustn't
> move a stack pointer adjust.
>

This caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44970


-- 
H.J.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-17  2:38     ` H.J. Lu
@ 2010-07-17 14:25       ` Bernd Schmidt
  2010-07-17 15:03         ` H.J. Lu
                           ` (2 more replies)
  2010-11-04  4:30       ` H.J. Lu
  1 sibling, 3 replies; 49+ messages in thread
From: Bernd Schmidt @ 2010-07-17 14:25 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Jeff Law, GCC Patches

[-- Attachment #1: Type: text/plain, Size: 1186 bytes --]

On 07/17/2010 04:38 AM, H.J. Lu wrote:
> This caused:
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44970

Apparently, the sse_prologue_save_insn is broken.

diff -dru old/nest-stdar-1.s new/nest-stdar-1.s
--- old/nest-stdar-1.s	2010-07-17 14:10:40.308605357 +0000
+++ new/nest-stdar-1.s	2010-07-17 14:00:30.592312121 +0000
@@ -9,25 +9,24 @@
 	subq	$48, %rsp
 	.cfi_def_cfa_offset 56
 	leaq	0(,%rax,4), %rcx
-	leaq	39(%rsp), %rdx
 	movl	$.L2, %eax
 	subq	%rcx, %rax
 	jmp	*%rax
-	movaps	%xmm7, -15(%rdx)
-	movaps	%xmm6, -31(%rdx)
-	movaps	%xmm5, -47(%rdx)
-	movaps	%xmm4, -63(%rdx)
-	movaps	%xmm3, -79(%rdx)
-	movaps	%xmm2, -95(%rdx)
-	movaps	%xmm1, -111(%rdx)
-	movaps	%xmm0, -127(%rdx)
+	movaps	%xmm7, 24(%rsp)
+	movaps	%xmm6, 8(%rsp)
+	movaps	%xmm5, -8(%rsp)
+	movaps	%xmm4, -24(%rsp)
+	movaps	%xmm3, -40(%rsp)
+	movaps	%xmm2, -56(%rsp)
+	movaps	%xmm1, -72(%rsp)
+	movaps	%xmm0, -88(%rsp)

It's implementing a crazy jump table, which requires that all insns have
the same length, which in turn requires that no one modifies the address
in the pattern.

I can fix this testcase with the patch below, but I'll leave it for the
x86 maintainers to choose this fix or another.


Bernd

[-- Attachment #2: crazy.diff --]
[-- Type: text/plain, Size: 2702 bytes --]

Index: i386.md
===================================================================
--- i386.md	(revision 162146)
+++ i386.md	(working copy)
@@ -17944,7 +17944,7 @@ (define_split
 
 (define_insn "sse_prologue_save_insn"
   [(set (mem:BLK (plus:DI (match_operand:DI 0 "register_operand" "R")
-			  (match_operand:DI 4 "const_int_operand" "n")))
+			  (const_int -127)))
 	(unspec:BLK [(reg:DI XMM0_REG)
 		     (reg:DI XMM1_REG)
 		     (reg:DI XMM2_REG)
@@ -17956,14 +17956,12 @@ (define_insn "sse_prologue_save_insn"
    (use (match_operand:DI 1 "register_operand" "r"))
    (use (match_operand:DI 2 "const_int_operand" "i"))
    (use (label_ref:DI (match_operand 3 "" "X")))
-   (use (match_operand:DI 5 "const_int_operand" "i"))]
-  "TARGET_64BIT
-   && INTVAL (operands[4]) + X86_64_SSE_REGPARM_MAX * 16 - 16 < 128
-   && INTVAL (operands[4]) + INTVAL (operands[2]) * 16 >= -128"
+   (use (match_operand:DI 4 "const_int_operand" "i"))]
+  "TARGET_64BIT"
 {
   int i;
   operands[0] = gen_rtx_MEM (Pmode,
-			     gen_rtx_PLUS (Pmode, operands[0], operands[4]));
+			     gen_rtx_PLUS (Pmode, operands[0], GEN_INT (-127)));
   /* VEX instruction with a REX prefix will #UD.  */
   if (TARGET_AVX && GET_CODE (XEXP (operands[0], 0)) != PLUS)
     gcc_unreachable ();
@@ -17971,15 +17969,16 @@ (define_insn "sse_prologue_save_insn"
   output_asm_insn ("jmp\t%A1", operands);
   for (i = X86_64_SSE_REGPARM_MAX - 1; i >= INTVAL (operands[2]); i--)
     {
-      operands[4] = adjust_address (operands[0], DImode, i*16);
-      operands[5] = gen_rtx_REG (TImode, SSE_REGNO (i));
-      PUT_MODE (operands[4], TImode);
+      rtx xops[2];
+      xops[0] = adjust_address (operands[0], DImode, i*16);
+      xops[1] = gen_rtx_REG (TImode, SSE_REGNO (i));
+      PUT_MODE (operands[0], TImode);
       if (GET_CODE (XEXP (operands[0], 0)) != PLUS)
         output_asm_insn ("rex", operands);
       if (crtl->stack_alignment_needed < 128)
-        output_asm_insn ("%vmovsd\t{%5, %4|%4, %5}", operands);
+        output_asm_insn ("%vmovsd\t{%1, %0|%0, %1}", xops);
       else
-        output_asm_insn ("%vmovaps\t{%5, %4|%4, %5}", operands);
+        output_asm_insn ("%vmovaps\t{%1, %0|%0, %1}", xops);
     }
   targetm.asm_out.internal_label (asm_out_file, "L",
 				  CODE_LABEL_NUMBER (operands[3]));
@@ -17991,7 +17990,7 @@ (define_insn "sse_prologue_save_insn"
    ;; 2 bytes for jump and opernds[4] bytes for each save.
    (set (attr "length")
      (plus (const_int 2)
-	   (mult (symbol_ref ("INTVAL (operands[5])"))
+	   (mult (symbol_ref ("INTVAL (operands[4])"))
 		 (symbol_ref ("X86_64_SSE_REGPARM_MAX - INTVAL (operands[2])")))))
    (set_attr "memory" "store")
    (set_attr "modrm" "0")

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-17 14:25       ` Bernd Schmidt
@ 2010-07-17 15:03         ` H.J. Lu
  2010-07-17 15:10           ` IainS
                             ` (2 more replies)
  2010-07-19 15:41         ` x86_64 varargs setup jump table Richard Henderson
  2010-07-19 16:19         ` New optimization for reload_combine Jeff Law
  2 siblings, 3 replies; 49+ messages in thread
From: H.J. Lu @ 2010-07-17 15:03 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Jeff Law, GCC Patches

On Sat, Jul 17, 2010 at 7:25 AM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> On 07/17/2010 04:38 AM, H.J. Lu wrote:
>> This caused:
>>
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44970
>
> Apparently, the sse_prologue_save_insn is broken.
>

It is more than that. It failed to boostrap on Linux/ia32 when
configured with

--enable-clocale=gnu --with-system-zlib --enable-shared
--with-demangler-in-ld  --with-fpmath=sse

-- 
H.J.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-17 15:03         ` H.J. Lu
@ 2010-07-17 15:10           ` IainS
  2010-07-17 16:00           ` Bernd Schmidt
  2010-07-19 10:01           ` Bernd Schmidt
  2 siblings, 0 replies; 49+ messages in thread
From: IainS @ 2010-07-17 15:10 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Bernd Schmidt, Jeff Law, GCC Patches


On 17 Jul 2010, at 16:03, H.J. Lu wrote:

> On Sat, Jul 17, 2010 at 7:25 AM, Bernd Schmidt <bernds@codesourcery.com 
> > wrote:
>> On 07/17/2010 04:38 AM, H.J. Lu wrote:
>>> This caused:
>>>
>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44970
>>
>> Apparently, the sse_prologue_save_insn is broken.
>>
>
> It is more than that. It failed to boostrap on Linux/ia32 when
> configured with
>
> --enable-clocale=gnu --with-system-zlib --enable-shared
> --with-demangler-in-ld  --with-fpmath=sse

I think it breaks powerpc bootstrap too.
Iain


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-17 15:03         ` H.J. Lu
  2010-07-17 15:10           ` IainS
@ 2010-07-17 16:00           ` Bernd Schmidt
  2010-07-17 16:15             ` H.J. Lu
  2010-07-18 18:15             ` H.J. Lu
  2010-07-19 10:01           ` Bernd Schmidt
  2 siblings, 2 replies; 49+ messages in thread
From: Bernd Schmidt @ 2010-07-17 16:00 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Jeff Law, GCC Patches

On 07/17/2010 05:03 PM, H.J. Lu wrote:
> It is more than that. It failed to boostrap on Linux/ia32 when
> configured with
> 
> --enable-clocale=gnu --with-system-zlib --enable-shared
> --with-demangler-in-ld  --with-fpmath=sse

I can't seem to reproduce this.  Is that the full command line?


Bernd


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-17 16:00           ` Bernd Schmidt
@ 2010-07-17 16:15             ` H.J. Lu
  2010-07-17 17:18               ` Bernd Schmidt
  2010-07-18 18:15             ` H.J. Lu
  1 sibling, 1 reply; 49+ messages in thread
From: H.J. Lu @ 2010-07-17 16:15 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Jeff Law, GCC Patches

On Sat, Jul 17, 2010 at 8:59 AM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> On 07/17/2010 05:03 PM, H.J. Lu wrote:
>> It is more than that. It failed to boostrap on Linux/ia32 when
>> configured with
>>
>> --enable-clocale=gnu --with-system-zlib --enable-shared
>> --with-demangler-in-ld  --with-fpmath=sse
>
> I can't seem to reproduce this.  Is that the full command line?
>
>

I used:

../src-trunk/configure \
		 --enable-clocale=gnu --with-system-zlib --enable-shared --with-
demangler-in-ld -with-plugin-ld=ld.gold --enable-gold --with-fpmath=sse

on Fedora 12/ia32.


-- 
H.J.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-17 16:15             ` H.J. Lu
@ 2010-07-17 17:18               ` Bernd Schmidt
  2010-07-17 17:27                 ` H.J. Lu
  2010-07-17 17:29                 ` IainS
  0 siblings, 2 replies; 49+ messages in thread
From: Bernd Schmidt @ 2010-07-17 17:18 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Jeff Law, GCC Patches

On 07/17/2010 06:14 PM, H.J. Lu wrote:
> On Sat, Jul 17, 2010 at 8:59 AM, Bernd Schmidt <bernds@codesourcery.com> wrote:
>> On 07/17/2010 05:03 PM, H.J. Lu wrote:
>>> It is more than that. It failed to boostrap on Linux/ia32 when
>>> configured with
>>>
>>> --enable-clocale=gnu --with-system-zlib --enable-shared
>>> --with-demangler-in-ld  --with-fpmath=sse
>>
>> I can't seem to reproduce this.  Is that the full command line?
>>
>>
> 
> I used:
> 
> ../src-trunk/configure \
> 		 --enable-clocale=gnu --with-system-zlib --enable-shared --with-
> demangler-in-ld -with-plugin-ld=ld.gold --enable-gold --with-fpmath=sse
> 
> on Fedora 12/ia32.

I'm on Gentoo, without gold - not sure whether that made a difference,
but I'm not seeing these failures.  I don't have access to SPEC2k6
either.  Can you isolate any testcases?


Bernd

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-17 17:18               ` Bernd Schmidt
@ 2010-07-17 17:27                 ` H.J. Lu
  2010-07-17 17:29                 ` IainS
  1 sibling, 0 replies; 49+ messages in thread
From: H.J. Lu @ 2010-07-17 17:27 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Jeff Law, GCC Patches

On Sat, Jul 17, 2010 at 10:17 AM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> On 07/17/2010 06:14 PM, H.J. Lu wrote:
>> On Sat, Jul 17, 2010 at 8:59 AM, Bernd Schmidt <bernds@codesourcery.com> wrote:
>>> On 07/17/2010 05:03 PM, H.J. Lu wrote:
>>>> It is more than that. It failed to boostrap on Linux/ia32 when
>>>> configured with
>>>>
>>>> --enable-clocale=gnu --with-system-zlib --enable-shared
>>>> --with-demangler-in-ld  --with-fpmath=sse
>>>
>>> I can't seem to reproduce this.  Is that the full command line?
>>>
>>>
>>
>> I used:
>>
>> ../src-trunk/configure \
>>                --enable-clocale=gnu --with-system-zlib --enable-shared --with-
>> demangler-in-ld -with-plugin-ld=ld.gold --enable-gold --with-fpmath=sse
>>
>> on Fedora 12/ia32.
>
> I'm on Gentoo, without gold - not sure whether that made a difference,

Is that possible for you to install Fedora 12/13? 64bit Fedora is fine.
You can bootstrap 32bit gcc on 64bit Fedora.

> but I'm not seeing these failures.  I don't have access to SPEC2k6
> either.  Can you isolate any testcases?
>

SPEC CPU 2006 failure is the run-time comparison failure.
It won't be easy to find a small testcase. Gcc bootstrap comparison
failure and testsuite regressions are much easier to debug.

Thanks.


-- 
H.J.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-17 17:18               ` Bernd Schmidt
  2010-07-17 17:27                 ` H.J. Lu
@ 2010-07-17 17:29                 ` IainS
  1 sibling, 0 replies; 49+ messages in thread
From: IainS @ 2010-07-17 17:29 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: H.J. Lu, Jeff Law, GCC Patches


On 17 Jul 2010, at 18:17, Bernd Schmidt wrote:

> On 07/17/2010 06:14 PM, H.J. Lu wrote:
>> On Sat, Jul 17, 2010 at 8:59 AM, Bernd Schmidt <bernds@codesourcery.com 
>> > wrote:
>>> On 07/17/2010 05:03 PM, H.J. Lu wrote:
>>>> It is more than that. It failed to boostrap on Linux/ia32 when
>>>> configured with
>>>>
>>>> --enable-clocale=gnu --with-system-zlib --enable-shared
>>>> --with-demangler-in-ld  --with-fpmath=sse
>>>
>>> I can't seem to reproduce this.  Is that the full command line?
>>>
>>>
>>
>> I used:
>>
>> ../src-trunk/configure \
>> 		 --enable-clocale=gnu --with-system-zlib --enable-shared --with-
>> demangler-in-ld -with-plugin-ld=ld.gold --enable-gold --with- 
>> fpmath=sse
>>
>> on Fedora 12/ia32.
>
> I'm on Gentoo, without gold - not sure whether that made a difference,
> but I'm not seeing these failures.  I don't have access to SPEC2k6
> either.  Can you isolate any testcases?

I don't have gold either - on darwin.

on i686-apple-darwin:
recog.o  and
reg-stack.o fail stage2/3 compare

here is recog.o (stage2 vs stage3)
stripped binaries => otool -tv ( basically turn the text section into  
disassembled).

It doesn't look like a debug-related diff to me - but more like what  
you posted earlier.
cheers
Iain


--- r2-code.s   2010-07-17 18:25:11.000000000 +0100
+++ r3-code.s   2010-07-17 18:25:19.000000000 +0100
@@ -1,4 +1,4 @@
-r2.o.stripped:
+r3.o.stripped:
  (__TEXT,__text) section
  _memory_address_addr_space_p:
  00000000       pushl   %ebx
@@ -6400,8 +6400,8 @@
  0000596f       leal    0x40(%esp),%edi
  00005973       nopw    0x00(%eax,%eax)
  00005979       nopl    0x00000000(%eax)
-00005980       leal    0xfc(%esi),%eax
-00005983       addl    $0x01,%esi
+00005980       addl    $0x01,%esi
+00005983       leal    0xfb(%esi),%eax
  00005986       cmpl    $0x05,%esi
  00005989       cmovgel %eax,%esi
  0000598c       movl    0x00(%ebp,%esi,8),%eax
  

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-17 16:00           ` Bernd Schmidt
  2010-07-17 16:15             ` H.J. Lu
@ 2010-07-18 18:15             ` H.J. Lu
  1 sibling, 0 replies; 49+ messages in thread
From: H.J. Lu @ 2010-07-18 18:15 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Jeff Law, GCC Patches

On Sat, Jul 17, 2010 at 8:59 AM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> On 07/17/2010 05:03 PM, H.J. Lu wrote:
>> It is more than that. It failed to boostrap on Linux/ia32 when
>> configured with
>>
>> --enable-clocale=gnu --with-system-zlib --enable-shared
>> --with-demangler-in-ld  --with-fpmath=sse
>
> I can't seem to reproduce this.  Is that the full command line?
>

Many people have reported bootstrap failure on Linux/ia32. May
I suggest you reproduce it on a different Linux/ia32 OS? I know
it can be reproduced on Fedora 12/13.

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-17 15:03         ` H.J. Lu
  2010-07-17 15:10           ` IainS
  2010-07-17 16:00           ` Bernd Schmidt
@ 2010-07-19 10:01           ` Bernd Schmidt
  2010-07-19 13:27             ` H.J. Lu
  2010-07-20 11:46             ` Bernd Schmidt
  2 siblings, 2 replies; 49+ messages in thread
From: Bernd Schmidt @ 2010-07-19 10:01 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Jeff Law, GCC Patches

[-- Attachment #1: Type: text/plain, Size: 1319 bytes --]

On 07/17/2010 05:03 PM, H.J. Lu wrote:
> On Sat, Jul 17, 2010 at 7:25 AM, Bernd Schmidt <bernds@codesourcery.com> wrote:
>> On 07/17/2010 04:38 AM, H.J. Lu wrote:
>>> This caused:
>>>
>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44970
>>
>> Apparently, the sse_prologue_save_insn is broken.
>>
> 
> It is more than that. It failed to boostrap on Linux/ia32 when
> configured with
> 
> --enable-clocale=gnu --with-system-zlib --enable-shared
> --with-demangler-in-ld  --with-fpmath=sse

I've committed the following fix for the bootstrap comparison failures.
 DEBUG_INSNs got in the way, as they do.  This also fixes a problem
observed with sh libjava, we need to avoid moving something after an
insn that can throw.  Another fix which I found by inspection is for the
use_ruid bookkeeping, the interaction between the new and the old code
could have caused problems there.

This leaves the x86_64 bootstrap failure, which needs to be fixed by an
x86 maintainer (I don't know enough about what's required), and a
spec2k6 regression reported by HJ.  HJ, have you verified that that was
actually caused by the same revision, and not just a range of revisions?
 If it was indeed my patch, can you either extract a test case, or
compile the failing program to assembly files, then send me a diff of
before/after?


Bernd

[-- Attachment #2: rc-fix4.diff --]
[-- Type: text/plain, Size: 5498 bytes --]

	* postreload.c (reload_combine_closest_single_use): Ignore the
	number of uses for DEBUG_INSNs.
	(fixup_debug_insns): New static function.
	(reload_combine_recognize_const_pattern): Use it.  Don't let the
	main loop be affected by DEBUG_INSNs.
	Really disallow moving adds past a jump insn.
	(reload_combine_recognize_pattern): Don't update use_ruid here.
	(reload_combine_note_use): Do it here.
	(reload_combine): Use control_flow_insn_p rather than JUMP_P.
	
	
Index: postreload.c
===================================================================
--- postreload.c	(revision 162277)
+++ postreload.c	(working copy)
@@ -816,7 +816,8 @@ reload_combine_purge_reg_uses_after_ruid
 
 /* Find the use of REGNO with the ruid that is highest among those
    lower than RUID_LIMIT, and return it if it is the only use of this
-   reg in the insn.  Return NULL otherwise.  */
+   reg in the insn (or if the insn is a debug insn).  Return NULL
+   otherwise.  */
 
 static struct reg_use *
 reload_combine_closest_single_use (unsigned regno, int ruid_limit)
@@ -830,7 +831,8 @@ reload_combine_closest_single_use (unsig
   retval = NULL;
   for (i = use_idx; i < RELOAD_COMBINE_MAX_USES; i++)
     {
-      int this_ruid = reg_state[regno].reg_use[i].ruid;
+      struct reg_use *use = reg_state[regno].reg_use + i; 
+      int this_ruid = use->ruid;
       if (this_ruid >= ruid_limit)
 	continue;
       if (this_ruid > best_ruid)
@@ -838,7 +840,7 @@ reload_combine_closest_single_use (unsig
 	  best_ruid = this_ruid;
 	  retval = reg_state[regno].reg_use + i;
 	}
-      else if (this_ruid == best_ruid)
+      else if (this_ruid == best_ruid && !DEBUG_INSN_P (use->insn))
 	retval = NULL;
     }
   if (last_label_ruid >= best_ruid)
@@ -846,6 +848,44 @@ reload_combine_closest_single_use (unsig
   return retval;
 }
 
+/* After we've moved an add insn, fix up any debug insns that occur between
+   the old location of the add and the new location.  REGNO is the destination
+   register of the add insn; REG is the corresponding RTX.  REPLACEMENT is
+   the SET_SRC of the add.  MIN_RUID specifies the ruid of the insn after
+   which we've placed the add, we ignore any debug insns after it.  */
+
+static void
+fixup_debug_insns (unsigned regno, rtx reg, rtx replacement, int min_ruid)
+{
+  struct reg_use *use;
+  int from = reload_combine_ruid;
+  for (;;)
+    {
+      rtx t;
+      rtx use_insn = NULL_RTX;
+      if (from < min_ruid)
+	break;
+      use = reload_combine_closest_single_use (regno, from);
+      if (use)
+	{
+	  from = use->ruid;
+	  use_insn = use->insn;
+	}
+      else
+	break;
+      
+      if (NONDEBUG_INSN_P (use->insn))
+	continue;
+      t = INSN_VAR_LOCATION_LOC (use_insn);
+      t = simplify_replace_rtx (t, reg, copy_rtx (replacement));
+      validate_change (use->insn,
+		       &INSN_VAR_LOCATION_LOC (use->insn), t, 0);
+      reload_combine_purge_insn_uses (use_insn);
+      reload_combine_note_use (&PATTERN (use_insn), use_insn,
+			       use->ruid, NULL_RTX);
+    }
+}
+
 /* Called by reload_combine when scanning INSN.  This function tries to detect
    patterns where a constant is added to a register, and the result is used
    in an address.
@@ -913,6 +953,10 @@ reload_combine_recognize_const_pattern (
 	/* Start the search for the next use from here.  */
 	from_ruid = use->ruid;
 
+      /* We'll fix up DEBUG_INSNs after we're done.  */
+      if (use && DEBUG_INSN_P (use->insn))
+	continue;
+
       if (use && GET_MODE (*use->usep) == Pmode)
 	{
 	  rtx use_insn = use->insn;
@@ -921,7 +965,7 @@ reload_combine_recognize_const_pattern (
 	  bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (use_insn));
 
 	  /* Avoid moving the add insn past a jump.  */
-	  if (must_move_add && use_ruid < last_jump_ruid)
+	  if (must_move_add && use_ruid <= last_jump_ruid)
 	    break;
 
 	  /* If the add clobbers another hard reg in parallel, don't move
@@ -1019,6 +1063,8 @@ reload_combine_recognize_const_pattern (
     /* Process the add normally.  */
     return false;
 
+  fixup_debug_insns (regno, reg, src, add_moved_after_ruid);
+  
   reorder_insns (insn, insn, add_moved_after_insn);
   reload_combine_purge_reg_uses_after_ruid (regno, add_moved_after_ruid);
   reload_combine_split_ruids (add_moved_after_ruid - 1);
@@ -1155,11 +1201,6 @@ reload_combine_recognize_pattern (rtx in
 		   reg_state[regno].reg_use[i].ruid,
 		   reg_state[regno].reg_use[i].containing_mem);
 
-	      if (reg_state[REGNO (base)].use_ruid
-		  > reg_state[regno].use_ruid)
-		reg_state[REGNO (base)].use_ruid
-		  = reg_state[regno].use_ruid;
-
 	      /* Delete the reg-reg addition.  */
 	      delete_insn (insn);
 
@@ -1277,7 +1318,7 @@ reload_combine (void)
 
       reload_combine_ruid++;
 
-      if (JUMP_P (insn))
+      if (control_flow_insn_p (insn))
 	last_jump_ruid = reload_combine_ruid;
 
       if (reload_combine_recognize_const_pattern (insn)
@@ -1495,8 +1536,14 @@ reload_combine_note_use (rtx *xp, rtx in
 	    reg_state[regno].all_offsets_match = true;
 	    reg_state[regno].use_ruid = ruid;
 	  }
-	else if (! rtx_equal_p (offset, reg_state[regno].offset))
-	  reg_state[regno].all_offsets_match = false;
+	else
+	  {
+	    if (reg_state[regno].use_ruid > ruid)
+	      reg_state[regno].use_ruid = ruid;
+
+	    if (! rtx_equal_p (offset, reg_state[regno].offset))
+	      reg_state[regno].all_offsets_match = false;
+	  }
 
 	reg_state[regno].reg_use[use_index].insn = insn;
 	reg_state[regno].reg_use[use_index].ruid = ruid;

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-19 10:01           ` Bernd Schmidt
@ 2010-07-19 13:27             ` H.J. Lu
  2010-07-20 11:46             ` Bernd Schmidt
  1 sibling, 0 replies; 49+ messages in thread
From: H.J. Lu @ 2010-07-19 13:27 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Jeff Law, GCC Patches

On Mon, Jul 19, 2010 at 3:00 AM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> On 07/17/2010 05:03 PM, H.J. Lu wrote:
>> On Sat, Jul 17, 2010 at 7:25 AM, Bernd Schmidt <bernds@codesourcery.com> wrote:
>>> On 07/17/2010 04:38 AM, H.J. Lu wrote:
>>>> This caused:
>>>>
>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44970
>>>
>>> Apparently, the sse_prologue_save_insn is broken.
>>>
>>
>> It is more than that. It failed to boostrap on Linux/ia32 when
>> configured with
>>
>> --enable-clocale=gnu --with-system-zlib --enable-shared
>> --with-demangler-in-ld  --with-fpmath=sse
>
> I've committed the following fix for the bootstrap comparison failures.
>  DEBUG_INSNs got in the way, as they do.  This also fixes a problem
> observed with sh libjava, we need to avoid moving something after an
> insn that can throw.  Another fix which I found by inspection is for the
> use_ruid bookkeeping, the interaction between the new and the old code
> could have caused problems there.
>
> This leaves the x86_64 bootstrap failure, which needs to be fixed by an
> x86 maintainer (I don't know enough about what's required), and a
> spec2k6 regression reported by HJ.  HJ, have you verified that that was
> actually caused by the same revision, and not just a range of revisions?

I verified that revision 162270 is the cause.

>  If it was indeed my patch, can you either extract a test case, or
> compile the failing program to assembly files, then send me a diff of
> before/after?

I will do that. It will take a while.

-- 
H.J.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* x86_64 varargs setup jump table
  2010-07-17 14:25       ` Bernd Schmidt
  2010-07-17 15:03         ` H.J. Lu
@ 2010-07-19 15:41         ` Richard Henderson
  2010-07-19 15:56           ` H.J. Lu
                             ` (2 more replies)
  2010-07-19 16:19         ` New optimization for reload_combine Jeff Law
  2 siblings, 3 replies; 49+ messages in thread
From: Richard Henderson @ 2010-07-19 15:41 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: H.J. Lu, GCC Patches, ubizjak

On 07/17/2010 07:25 AM, Bernd Schmidt wrote:
>  	leaq	0(,%rax,4), %rcx
>  	movl	$.L2, %eax
>  	subq	%rcx, %rax
>  	jmp	*%rax

I've often thought this was over-engineering in the x86_64 abi.
This jump table is trading memory bandwidth for unpredictability
in the branch target.

I've often wondered if we'd get better performance if we changed
to a simple comparison against zero.  I.e.

	test	%al,%al
	jz	1f
	// 8 xmm stores
1:

H.J., do you think you'd be able to measure performance on this?



r~

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86_64 varargs setup jump table
  2010-07-19 15:41         ` x86_64 varargs setup jump table Richard Henderson
@ 2010-07-19 15:56           ` H.J. Lu
  2010-07-19 20:57             ` H.J. Lu
  2010-07-19 16:09           ` x86_64 varargs setup jump table Andi Kleen
  2010-07-19 16:23           ` Jan Hubicka
  2 siblings, 1 reply; 49+ messages in thread
From: H.J. Lu @ 2010-07-19 15:56 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Bernd Schmidt, GCC Patches, ubizjak

On Mon, Jul 19, 2010 at 8:41 AM, Richard Henderson <rth@redhat.com> wrote:
> On 07/17/2010 07:25 AM, Bernd Schmidt wrote:
>>       leaq    0(,%rax,4), %rcx
>>       movl    $.L2, %eax
>>       subq    %rcx, %rax
>>       jmp     *%rax
>
> I've often thought this was over-engineering in the x86_64 abi.
> This jump table is trading memory bandwidth for unpredictability
> in the branch target.
>
> I've often wondered if we'd get better performance if we changed
> to a simple comparison against zero.  I.e.
>
>        test    %al,%al
>        jz      1f
>        // 8 xmm stores
> 1:
>
> H.J., do you think you'd be able to measure performance on this?
>

Sure.


-- 
H.J.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86_64 varargs setup jump table
  2010-07-19 15:41         ` x86_64 varargs setup jump table Richard Henderson
  2010-07-19 15:56           ` H.J. Lu
@ 2010-07-19 16:09           ` Andi Kleen
  2010-07-19 16:16             ` H.J. Lu
  2010-07-19 16:23           ` Jan Hubicka
  2 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2010-07-19 16:09 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Bernd Schmidt, H.J. Lu, GCC Patches, ubizjak

Richard Henderson <rth@redhat.com> writes:

> On 07/17/2010 07:25 AM, Bernd Schmidt wrote:
>>  	leaq	0(,%rax,4), %rcx
>>  	movl	$.L2, %eax
>>  	subq	%rcx, %rax
>>  	jmp	*%rax
>
> I've often thought this was over-engineering in the x86_64 abi.
> This jump table is trading memory bandwidth for unpredictability
> in the branch target.

The other problem with the jump is that if you misdeclare
the prototype and nothing correct is passed in %eax
then you get a random jump somewhere which tends to be hard
to debug. This has happened in the past when porting
existing code.

Just alone getting rid of that would be worth changing it.
varargs shouldn't be that time critical anyways

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-16 19:45 ` Paolo Bonzini
  2010-07-16 19:55   ` Bernd Schmidt
@ 2010-07-19 16:11   ` Mark Mitchell
  2010-07-19 16:13     ` Jeff Law
  2010-07-19 16:31     ` Paolo Bonzini
  1 sibling, 2 replies; 49+ messages in thread
From: Mark Mitchell @ 2010-07-19 16:11 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Bernd Schmidt, GCC Patches

Paolo Bonzini wrote:

>> Does postreload.c fall under reload?  If not, would someone approve this?
> 
> Well, it was part of reload1.c until a while ago...

Without commenting on the patch (which Jeff has already reviewed), I
think that postreload should fall in the scope of "reload maintainer".
Does anyone object?

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-19 16:11   ` Mark Mitchell
@ 2010-07-19 16:13     ` Jeff Law
  2010-07-19 16:31     ` Paolo Bonzini
  1 sibling, 0 replies; 49+ messages in thread
From: Jeff Law @ 2010-07-19 16:13 UTC (permalink / raw)
  To: Mark Mitchell; +Cc: Paolo Bonzini, Bernd Schmidt, GCC Patches

On 07/19/10 10:11, Mark Mitchell wrote:
> Paolo Bonzini wrote:
>
>    
>>> Does postreload.c fall under reload?  If not, would someone approve this?
>>>        
>> Well, it was part of reload1.c until a while ago...
>>      
> Without commenting on the patch (which Jeff has already reviewed), I
> think that postreload should fall in the scope of "reload maintainer".
> Does anyone object?
>    
Nope.  Sounds quite reasonable to me.

jeff

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86_64 varargs setup jump table
  2010-07-19 16:09           ` x86_64 varargs setup jump table Andi Kleen
@ 2010-07-19 16:16             ` H.J. Lu
  2010-07-19 16:26               ` Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: H.J. Lu @ 2010-07-19 16:16 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Richard Henderson, Bernd Schmidt, GCC Patches, ubizjak

On Mon, Jul 19, 2010 at 9:09 AM, Andi Kleen <andi@firstfloor.org> wrote:
> Richard Henderson <rth@redhat.com> writes:
>
>> On 07/17/2010 07:25 AM, Bernd Schmidt wrote:
>>>      leaq    0(,%rax,4), %rcx
>>>      movl    $.L2, %eax
>>>      subq    %rcx, %rax
>>>      jmp     *%rax
>>
>> I've often thought this was over-engineering in the x86_64 abi.
>> This jump table is trading memory bandwidth for unpredictability
>> in the branch target.
>
> The other problem with the jump is that if you misdeclare
> the prototype and nothing correct is passed in %eax

FWIW, we DON'T support varargs without correct prototype
on x86-64. See AVX support in x86-64 psABI for details.


-- 
H.J.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-17 14:25       ` Bernd Schmidt
  2010-07-17 15:03         ` H.J. Lu
  2010-07-19 15:41         ` x86_64 varargs setup jump table Richard Henderson
@ 2010-07-19 16:19         ` Jeff Law
  2 siblings, 0 replies; 49+ messages in thread
From: Jeff Law @ 2010-07-19 16:19 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: H.J. Lu, GCC Patches

On 07/17/10 08:25, Bernd Schmidt wrote:
> On 07/17/2010 04:38 AM, H.J. Lu wrote:
>    
>> This caused:
>>
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44970
>>      
> Apparently, the sse_prologue_save_insn is broken.
>
> diff -dru old/nest-stdar-1.s new/nest-stdar-1.s
> --- old/nest-stdar-1.s	2010-07-17 14:10:40.308605357 +0000
> +++ new/nest-stdar-1.s	2010-07-17 14:00:30.592312121 +0000
> @@ -9,25 +9,24 @@
>   	subq	$48, %rsp
>   	.cfi_def_cfa_offset 56
>   	leaq	0(,%rax,4), %rcx
> -	leaq	39(%rsp), %rdx
>   	movl	$.L2, %eax
>   	subq	%rcx, %rax
>   	jmp	*%rax
> -	movaps	%xmm7, -15(%rdx)
> -	movaps	%xmm6, -31(%rdx)
> -	movaps	%xmm5, -47(%rdx)
> -	movaps	%xmm4, -63(%rdx)
> -	movaps	%xmm3, -79(%rdx)
> -	movaps	%xmm2, -95(%rdx)
> -	movaps	%xmm1, -111(%rdx)
> -	movaps	%xmm0, -127(%rdx)
> +	movaps	%xmm7, 24(%rsp)
> +	movaps	%xmm6, 8(%rsp)
> +	movaps	%xmm5, -8(%rsp)
> +	movaps	%xmm4, -24(%rsp)
> +	movaps	%xmm3, -40(%rsp)
> +	movaps	%xmm2, -56(%rsp)
> +	movaps	%xmm1, -72(%rsp)
> +	movaps	%xmm0, -88(%rsp)
>
> It's implementing a crazy jump table, which requires that all insns have
> the same length, which in turn requires that no one modifies the address
> in the pattern.
>    
Unreal.  Anytime I see such fragile code I want to cry -- then I find 
out it's for varargs and I want to scream.

> I can fix this testcase with the patch below, but I'll leave it for the
> x86 maintainers to choose this fix or another.
>    
Yea, let's let the x86 maintainers fix this -- presumably they'll find 
something that doesn't rely upon exact instruction lengths.

jeff

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86_64 varargs setup jump table
  2010-07-19 15:41         ` x86_64 varargs setup jump table Richard Henderson
  2010-07-19 15:56           ` H.J. Lu
  2010-07-19 16:09           ` x86_64 varargs setup jump table Andi Kleen
@ 2010-07-19 16:23           ` Jan Hubicka
  2 siblings, 0 replies; 49+ messages in thread
From: Jan Hubicka @ 2010-07-19 16:23 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Bernd Schmidt, H.J. Lu, GCC Patches, ubizjak

> On 07/17/2010 07:25 AM, Bernd Schmidt wrote:
> >  	leaq	0(,%rax,4), %rcx
> >  	movl	$.L2, %eax
> >  	subq	%rcx, %rax
> >  	jmp	*%rax
> 
> I've often thought this was over-engineering in the x86_64 abi.
> This jump table is trading memory bandwidth for unpredictability
> in the branch target.
> 
> I've often wondered if we'd get better performance if we changed
> to a simple comparison against zero.  I.e.
> 
> 	test	%al,%al
> 	jz	1f
> 	// 8 xmm stores
> 1:
> 
> H.J., do you think you'd be able to measure performance on this?

THe orginal problem was the fact that early K8 chips had no way of effectively
storing SSE register to memory whithout knowing its type.  So the stores in
prologue executed very slow when reformating happent.  Same reason was
for not having callee saved/restored SSE regs.

On current chips this is not big issue, so I do not care what way we output.
In fact I used to have patch for doing the jz but lost it.  I think we might
keep supporting both to get some checking that ABI is not terribly broken
(i.e. that no other copmilers just feeds rax with random value, but always
by number of args).

Honza
> 
> 
> 
> r~

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86_64 varargs setup jump table
  2010-07-19 16:16             ` H.J. Lu
@ 2010-07-19 16:26               ` Andi Kleen
  0 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2010-07-19 16:26 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Andi Kleen, Richard Henderson, Bernd Schmidt, GCC Patches, ubizjak

On Mon, Jul 19, 2010 at 09:15:47AM -0700, H.J. Lu wrote:
> On Mon, Jul 19, 2010 at 9:09 AM, Andi Kleen <andi@firstfloor.org> wrote:
> > Richard Henderson <rth@redhat.com> writes:
> >
> >> On 07/17/2010 07:25 AM, Bernd Schmidt wrote:
> >>>      leaq    0(,%rax,4), %rcx
> >>>      movl    $.L2, %eax
> >>>      subq    %rcx, %rax
> >>>      jmp     *%rax
> >>
> >> I've often thought this was over-engineering in the x86_64 abi.
> >> This jump table is trading memory bandwidth for unpredictability
> >> in the branch target.
> >
> > The other problem with the jump is that if you misdeclare
> > the prototype and nothing correct is passed in %eax
> 
> FWIW, we DON'T support varargs without correct prototype
> on x86-64. See AVX support in x86-64 psABI for details.

Yes, it's the same on non AVX. But still if it goes wrong
(and in practice it sometimes does) it's better if it's a smooth landing 
than if it jumps randomly over the address space. 

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-19 16:11   ` Mark Mitchell
  2010-07-19 16:13     ` Jeff Law
@ 2010-07-19 16:31     ` Paolo Bonzini
  1 sibling, 0 replies; 49+ messages in thread
From: Paolo Bonzini @ 2010-07-19 16:31 UTC (permalink / raw)
  To: Mark Mitchell; +Cc: Bernd Schmidt, GCC Patches

On 07/19/2010 06:11 PM, Mark Mitchell wrote:
> Paolo Bonzini wrote:
>
>>> Does postreload.c fall under reload?  If not, would someone approve this?
>>
>> Well, it was part of reload1.c until a while ago...
>
> Without commenting on the patch (which Jeff has already reviewed), I
> think that postreload should fall in the scope of "reload maintainer".
> Does anyone object?

It was even part of reload1.c some years ago.

Paolo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86_64 varargs setup jump table
  2010-07-19 15:56           ` H.J. Lu
@ 2010-07-19 20:57             ` H.J. Lu
  2010-07-19 21:14               ` Richard Henderson
  0 siblings, 1 reply; 49+ messages in thread
From: H.J. Lu @ 2010-07-19 20:57 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Bernd Schmidt, GCC Patches, ubizjak

On Mon, Jul 19, 2010 at 8:56 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Mon, Jul 19, 2010 at 8:41 AM, Richard Henderson <rth@redhat.com> wrote:
>> On 07/17/2010 07:25 AM, Bernd Schmidt wrote:
>>>       leaq    0(,%rax,4), %rcx
>>>       movl    $.L2, %eax
>>>       subq    %rcx, %rax
>>>       jmp     *%rax
>>
>> I've often thought this was over-engineering in the x86_64 abi.
>> This jump table is trading memory bandwidth for unpredictability
>> in the branch target.
>>
>> I've often wondered if we'd get better performance if we changed
>> to a simple comparison against zero.  I.e.
>>
>>        test    %al,%al
>>        jz      1f
>>        // 8 xmm stores
>> 1:
>>
>> H.J., do you think you'd be able to measure performance on this?
>>
>
> Sure.

Just to be clear.  I can test performance impact if there is a patch.
But I don't have time to create a patch in the near future.


-- 
H.J.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86_64 varargs setup jump table
  2010-07-19 20:57             ` H.J. Lu
@ 2010-07-19 21:14               ` Richard Henderson
  2010-07-20 16:32                 ` Richard Henderson
  0 siblings, 1 reply; 49+ messages in thread
From: Richard Henderson @ 2010-07-19 21:14 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Bernd Schmidt, GCC Patches, ubizjak

[-- Attachment #1: Type: text/plain, Size: 287 bytes --]

On 07/19/2010 01:57 PM, H.J. Lu wrote:
>>> H.J., do you think you'd be able to measure performance on this?
>>>
>>
>> Sure.

This bootstraps; regression test starting now.

Obviously there's some patterns in i386.md that should be removed
along with this, were this patch to go in.


r~

[-- Attachment #2: z --]
[-- Type: text/plain, Size: 3427 bytes --]

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index bb0b890..7b03c6d 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -7062,11 +7062,8 @@ static void
 setup_incoming_varargs_64 (CUMULATIVE_ARGS *cum)
 {
   rtx save_area, mem;
-  rtx label;
-  rtx tmp_reg;
-  rtx nsse_reg;
   alias_set_type set;
-  int i;
+  int i, max;
 
   /* GPR size of varargs save area.  */
   if (cfun->va_list_gpr_size)
@@ -7087,10 +7084,11 @@ setup_incoming_varargs_64 (CUMULATIVE_ARGS *cum)
   save_area = frame_pointer_rtx;
   set = get_varargs_alias_set ();
 
-  for (i = cum->regno;
-       i < X86_64_REGPARM_MAX
-       && i < cum->regno + cfun->va_list_gpr_size / UNITS_PER_WORD;
-       i++)
+  max = cum->regno + cfun->va_list_gpr_size / UNITS_PER_WORD;
+  if (max > X86_64_REGPARM_MAX)
+    max = X86_64_REGPARM_MAX;
+
+  for (i = cum->regno; i < max; i++)
     {
       mem = gen_rtx_MEM (Pmode,
 			 plus_constant (save_area, i * UNITS_PER_WORD));
@@ -7102,33 +7100,42 @@ setup_incoming_varargs_64 (CUMULATIVE_ARGS *cum)
 
   if (ix86_varargs_fpr_size)
     {
+      enum machine_mode smode;
+      rtx label, test;
+
       /* Now emit code to save SSE registers.  The AX parameter contains number
-	 of SSE parameter registers used to call this function.  We use
-	 sse_prologue_save insn template that produces computed jump across
-	 SSE saves.  We need some preparation work to get this working.  */
+	 of SSE parameter registers used to call this function, though all we
+	 actually check here is the zero/non-zero status.  */
 
       label = gen_label_rtx ();
+      test = gen_rtx_EQ (VOIDmode, gen_rtx_REG (QImode, AX_REG), const0_rtx);
+      emit_jump_insn (gen_cbranchqi4 (test, XEXP (test, 0), XEXP (test, 1),
+				      label));
+
+      /* If we've determined that we're only loading scalars (and not
+	 vector data) then we can store doubles instead.  */
+      /* ??? This is too early to determine this, it would seem.  */
+      if (0 && crtl->stack_alignment_needed < 128)
+	smode = DFmode;
+      else
+	smode = V4SFmode;
 
-      nsse_reg = gen_reg_rtx (Pmode);
-      emit_insn (gen_zero_extendqidi2 (nsse_reg, gen_rtx_REG (QImode, AX_REG)));
-
-      /* Compute address of memory block we save into.  We always use pointer
-	 pointing 127 bytes after first byte to store - this is needed to keep
-	 instruction size limited by 4 bytes (5 bytes for AVX) with one
-	 byte displacement.  */
-      tmp_reg = gen_reg_rtx (Pmode);
-      emit_insn (gen_rtx_SET (VOIDmode, tmp_reg,
-			      plus_constant (save_area,
-					     ix86_varargs_gpr_size + 127)));
-      mem = gen_rtx_MEM (BLKmode, plus_constant (tmp_reg, -127));
-      MEM_NOTRAP_P (mem) = 1;
-      set_mem_alias_set (mem, set);
-      set_mem_align (mem, 64);
+      max = cum->sse_regno + cfun->va_list_fpr_size / 16;
+      if (max > X86_64_SSE_REGPARM_MAX)
+	max = X86_64_SSE_REGPARM_MAX;
 
-      /* And finally do the dirty job!  */
-      emit_insn (gen_sse_prologue_save (mem, nsse_reg,
-					GEN_INT (cum->sse_regno), label,
-					gen_reg_rtx (Pmode)));
+      for (i = cum->sse_regno; i < max; ++i)
+	{
+	  mem = plus_constant (save_area, i * 16 + ix86_varargs_gpr_size);
+	  mem = gen_rtx_MEM (smode, mem);
+	  MEM_NOTRAP_P (mem) = 1;
+	  set_mem_alias_set (mem, set);
+	  set_mem_align (mem, 128);
+
+	  emit_move_insn (mem, gen_rtx_REG (smode, SSE_REGNO (i)));
+	}
+
+      emit_label (label);
     }
 }
 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-19 10:01           ` Bernd Schmidt
  2010-07-19 13:27             ` H.J. Lu
@ 2010-07-20 11:46             ` Bernd Schmidt
  2010-07-20 15:33               ` Jeff Law
  2010-07-20 15:34               ` Bernd Schmidt
  1 sibling, 2 replies; 49+ messages in thread
From: Bernd Schmidt @ 2010-07-20 11:46 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Jeff Law, GCC Patches, Jakub Jelinek, Alexandre Oliva

On 07/19/2010 12:00 PM, Bernd Schmidt wrote:
> I've committed the following fix for the bootstrap comparison failures.
>  DEBUG_INSNs got in the way, as they do.  This also fixes a problem
> observed with sh libjava, we need to avoid moving something after an
> insn that can throw.  Another fix which I found by inspection is for the
> use_ruid bookkeeping, the interaction between the new and the old code
> could have caused problems there.

It's still producing different code for -g vs. no-debug on powerpc.  It
turns out that we only store a limited amount of uses for a reg, and
somehow the powerpc compiler produces a huge amount of debug_insns,
which fill the buffer and prevent us from doing the optimization on real
insns.

Not sure yet how to fix it, probably by not tracking DEBUG_INSNs at all
and letting them get out of date.

Do we have any data whether this whole DEBUG_INSN stuff is actually
useful?  It was still controversial when it was committed, and I've
personally not seen any improvement in trying to debug a cc1 compiled at
-O2.  Is there any evidence that it has actually made debugging easier,
or are we carrying around all this complexity for no gain?


Bernd

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-20 11:46             ` Bernd Schmidt
@ 2010-07-20 15:33               ` Jeff Law
  2010-07-20 15:34               ` Bernd Schmidt
  1 sibling, 0 replies; 49+ messages in thread
From: Jeff Law @ 2010-07-20 15:33 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: H.J. Lu, GCC Patches, Jakub Jelinek, Alexandre Oliva

On 07/20/10 05:45, Bernd Schmidt wrote:
> On 07/19/2010 12:00 PM, Bernd Schmidt wrote:
>    
>> I've committed the following fix for the bootstrap comparison failures.
>>   DEBUG_INSNs got in the way, as they do.  This also fixes a problem
>> observed with sh libjava, we need to avoid moving something after an
>> insn that can throw.  Another fix which I found by inspection is for the
>> use_ruid bookkeeping, the interaction between the new and the old code
>> could have caused problems there.
>>      
> It's still producing different code for -g vs. no-debug on powerpc.  It
> turns out that we only store a limited amount of uses for a reg, and
> somehow the powerpc compiler produces a huge amount of debug_insns,
> which fill the buffer and prevent us from doing the optimization on real
> insns.
>
> Not sure yet how to fix it, probably by not tracking DEBUG_INSNs at all
> and letting them get out of date.
>    
The only middle ground I see would be to track them, but if the buffer 
fills, you go through and purge all the DEBUG_INSN entries from the 
buffer and associated data structures.  Otherwise I think you have to 
track them without bound or ignore them completely.


Jeff

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-20 11:46             ` Bernd Schmidt
  2010-07-20 15:33               ` Jeff Law
@ 2010-07-20 15:34               ` Bernd Schmidt
  2010-07-20 15:40                 ` Jakub Jelinek
  1 sibling, 1 reply; 49+ messages in thread
From: Bernd Schmidt @ 2010-07-20 15:34 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Jeff Law, GCC Patches, Jakub Jelinek, Alexandre Oliva

[-- Attachment #1: Type: text/plain, Size: 756 bytes --]

On 07/20/2010 01:45 PM, Bernd Schmidt wrote:
> It's still producing different code for -g vs. no-debug on powerpc.  It
> turns out that we only store a limited amount of uses for a reg, and
> somehow the powerpc compiler produces a huge amount of debug_insns,
> which fill the buffer and prevent us from doing the optimization on real
> insns.
> 
> Not sure yet how to fix it, probably by not tracking DEBUG_INSNs at all
> and letting them get out of date.

Here's a fix, which I've committed.  Bootstrapped and regression tested
on i686-linux, and also verified that I can no longer reproduce output
differences with a powerpc cross compiler.  It even tries to fix up the
debug insns in the range of insns we're changing so that should be OK too.


Bernd

[-- Attachment #2: more-debug-insns.diff --]
[-- Type: text/plain, Size: 4316 bytes --]

Index: ChangeLog
===================================================================
--- ChangeLog	(revision 162341)
+++ ChangeLog	(working copy)
@@ -1,3 +1,11 @@
+2010-07-20  Bernd Schmidt  <bernds@codesourcery.com>
+
+	* postreload.c (fixup_debug_insns): Remove arg REGNO.  New args
+	FROM and TO.  All callers changed.  Don't look for tracked uses,
+	just scan the RTL for DEBUG_INSNs and substitute.
+	(reload_combine_recognize_pattern): Call fixup_debug_insns.
+	(reload_combine): Ignore DEBUG_INSNs.
+
 2010-07-20  Jakub Jelinek  <jakub@redhat.com>
 
 	* var-tracking.c (vt_expand_loc, vt_expand_loc_dummy): Bump maximum
Index: postreload.c
===================================================================
--- postreload.c	(revision 162301)
+++ postreload.c	(working copy)
@@ -848,41 +848,26 @@ reload_combine_closest_single_use (unsig
   return retval;
 }
 
-/* After we've moved an add insn, fix up any debug insns that occur between
-   the old location of the add and the new location.  REGNO is the destination
-   register of the add insn; REG is the corresponding RTX.  REPLACEMENT is
-   the SET_SRC of the add.  MIN_RUID specifies the ruid of the insn after
-   which we've placed the add, we ignore any debug insns after it.  */
+/* After we've moved an add insn, fix up any debug insns that occur
+   between the old location of the add and the new location.  REG is
+   the destination register of the add insn; REPLACEMENT is the
+   SET_SRC of the add.  FROM and TO specify the range in which we
+   should make this change on debug insns.  */
 
 static void
-fixup_debug_insns (unsigned regno, rtx reg, rtx replacement, int min_ruid)
+fixup_debug_insns (rtx reg, rtx replacement, rtx from, rtx to)
 {
-  struct reg_use *use;
-  int from = reload_combine_ruid;
-  for (;;)
+  rtx insn;
+  for (insn = from; insn != to; insn = NEXT_INSN (insn))
     {
       rtx t;
-      rtx use_insn = NULL_RTX;
-      if (from < min_ruid)
-	break;
-      use = reload_combine_closest_single_use (regno, from);
-      if (use)
-	{
-	  from = use->ruid;
-	  use_insn = use->insn;
-	}
-      else
-	break;
-      
-      if (NONDEBUG_INSN_P (use->insn))
+
+      if (!DEBUG_INSN_P (insn))
 	continue;
-      t = INSN_VAR_LOCATION_LOC (use_insn);
+      
+      t = INSN_VAR_LOCATION_LOC (insn);
       t = simplify_replace_rtx (t, reg, copy_rtx (replacement));
-      validate_change (use->insn,
-		       &INSN_VAR_LOCATION_LOC (use->insn), t, 0);
-      reload_combine_purge_insn_uses (use_insn);
-      reload_combine_note_use (&PATTERN (use_insn), use_insn,
-			       use->ruid, NULL_RTX);
+      validate_change (insn, &INSN_VAR_LOCATION_LOC (insn), t, 0);
     }
 }
 
@@ -1063,8 +1048,8 @@ reload_combine_recognize_const_pattern (
     /* Process the add normally.  */
     return false;
 
-  fixup_debug_insns (regno, reg, src, add_moved_after_ruid);
-  
+  fixup_debug_insns (reg, src, insn, add_moved_after_insn);
+
   reorder_insns (insn, insn, add_moved_after_insn);
   reload_combine_purge_reg_uses_after_ruid (regno, add_moved_after_ruid);
   reload_combine_split_ruids (add_moved_after_ruid - 1);
@@ -1191,15 +1176,21 @@ reload_combine_recognize_pattern (rtx in
 
 	  if (apply_change_group ())
 	    {
+	      struct reg_use *lowest_ruid = NULL;
+
 	      /* For every new use of REG_SUM, we have to record the use
 		 of BASE therein, i.e. operand 1.  */
 	      for (i = reg_state[regno].use_index;
 		   i < RELOAD_COMBINE_MAX_USES; i++)
-		reload_combine_note_use
-		  (&XEXP (*reg_state[regno].reg_use[i].usep, 1),
-		   reg_state[regno].reg_use[i].insn,
-		   reg_state[regno].reg_use[i].ruid,
-		   reg_state[regno].reg_use[i].containing_mem);
+		{
+		  struct reg_use *use = reg_state[regno].reg_use + i;
+		  reload_combine_note_use (&XEXP (*use->usep, 1), use->insn,
+					   use->ruid, use->containing_mem);
+		  if (lowest_ruid == NULL || use->ruid < lowest_ruid->ruid)
+		    lowest_ruid = use;
+		}
+
+	      fixup_debug_insns (reg, reg_sum, insn, lowest_ruid->insn);
 
 	      /* Delete the reg-reg addition.  */
 	      delete_insn (insn);
@@ -1313,7 +1304,7 @@ reload_combine (void)
 	  if (! fixed_regs[r])
 	      reg_state[r].use_index = RELOAD_COMBINE_MAX_USES;
 
-      if (! INSN_P (insn))
+      if (! NONDEBUG_INSN_P (insn))
 	continue;
 
       reload_combine_ruid++;

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-20 15:34               ` Bernd Schmidt
@ 2010-07-20 15:40                 ` Jakub Jelinek
  0 siblings, 0 replies; 49+ messages in thread
From: Jakub Jelinek @ 2010-07-20 15:40 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: H.J. Lu, Jeff Law, GCC Patches, Alexandre Oliva

On Tue, Jul 20, 2010 at 05:33:51PM +0200, Bernd Schmidt wrote:
>        t = simplify_replace_rtx (t, reg, copy_rtx (replacement));

The copy_rtx doesn't make sense here.  simplify_replace_rtx will copy_rtx
whenever replacing (fn is NULL):

  if (__builtin_expect (fn != NULL, 0))
    {
      newx = fn (x, old_rtx, data);
      if (newx)
        return newx;
    }
  else if (rtx_equal_p (x, old_rtx))
    return copy_rtx ((rtx) data);

	Jakub

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86_64 varargs setup jump table
  2010-07-19 21:14               ` Richard Henderson
@ 2010-07-20 16:32                 ` Richard Henderson
  2010-07-20 23:05                   ` Richard Henderson
  2010-07-21 23:15                   ` H.J. Lu
  0 siblings, 2 replies; 49+ messages in thread
From: Richard Henderson @ 2010-07-20 16:32 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Bernd Schmidt, GCC Patches, ubizjak

[-- Attachment #1: Type: text/plain, Size: 327 bytes --]

On 07/19/2010 02:13 PM, Richard Henderson wrote:
> This bootstraps; regression test starting now.
> 
> Obviously there's some patterns in i386.md that should be removed
> along with this, were this patch to go in.

A slightly different patch that passes regression testing.
This also vanishes the patterns that should go.


r~

[-- Attachment #2: z --]
[-- Type: text/plain, Size: 10354 bytes --]

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index bb0b890..d9dc571 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -7062,11 +7062,8 @@ static void
 setup_incoming_varargs_64 (CUMULATIVE_ARGS *cum)
 {
   rtx save_area, mem;
-  rtx label;
-  rtx tmp_reg;
-  rtx nsse_reg;
   alias_set_type set;
-  int i;
+  int i, max;
 
   /* GPR size of varargs save area.  */
   if (cfun->va_list_gpr_size)
@@ -7087,10 +7084,11 @@ setup_incoming_varargs_64 (CUMULATIVE_ARGS *cum)
   save_area = frame_pointer_rtx;
   set = get_varargs_alias_set ();
 
-  for (i = cum->regno;
-       i < X86_64_REGPARM_MAX
-       && i < cum->regno + cfun->va_list_gpr_size / UNITS_PER_WORD;
-       i++)
+  max = cum->regno + cfun->va_list_gpr_size / UNITS_PER_WORD;
+  if (max > X86_64_REGPARM_MAX)
+    max = X86_64_REGPARM_MAX;
+
+  for (i = cum->regno; i < max; i++)
     {
       mem = gen_rtx_MEM (Pmode,
 			 plus_constant (save_area, i * UNITS_PER_WORD));
@@ -7102,33 +7100,41 @@ setup_incoming_varargs_64 (CUMULATIVE_ARGS *cum)
 
   if (ix86_varargs_fpr_size)
     {
+      enum machine_mode smode;
+      rtx label, test;
+
       /* Now emit code to save SSE registers.  The AX parameter contains number
-	 of SSE parameter registers used to call this function.  We use
-	 sse_prologue_save insn template that produces computed jump across
-	 SSE saves.  We need some preparation work to get this working.  */
+	 of SSE parameter registers used to call this function, though all we
+	 actually check here is the zero/non-zero status.  */
 
       label = gen_label_rtx ();
+      test = gen_rtx_EQ (VOIDmode, gen_rtx_REG (QImode, AX_REG), const0_rtx);
+      emit_jump_insn (gen_cbranchqi4 (test, XEXP (test, 0), XEXP (test, 1),
+				      label));
+
+      /* If we've determined that we're only loading scalars (and not
+	 vector data) then we can store doubles instead.  */
+      if (crtl->stack_alignment_needed < 128)
+	smode = DFmode;
+      else
+	smode = V4SFmode;
 
-      nsse_reg = gen_reg_rtx (Pmode);
-      emit_insn (gen_zero_extendqidi2 (nsse_reg, gen_rtx_REG (QImode, AX_REG)));
-
-      /* Compute address of memory block we save into.  We always use pointer
-	 pointing 127 bytes after first byte to store - this is needed to keep
-	 instruction size limited by 4 bytes (5 bytes for AVX) with one
-	 byte displacement.  */
-      tmp_reg = gen_reg_rtx (Pmode);
-      emit_insn (gen_rtx_SET (VOIDmode, tmp_reg,
-			      plus_constant (save_area,
-					     ix86_varargs_gpr_size + 127)));
-      mem = gen_rtx_MEM (BLKmode, plus_constant (tmp_reg, -127));
-      MEM_NOTRAP_P (mem) = 1;
-      set_mem_alias_set (mem, set);
-      set_mem_align (mem, 64);
+      max = cum->sse_regno + cfun->va_list_fpr_size / 16;
+      if (max > X86_64_SSE_REGPARM_MAX)
+	max = X86_64_SSE_REGPARM_MAX;
 
-      /* And finally do the dirty job!  */
-      emit_insn (gen_sse_prologue_save (mem, nsse_reg,
-					GEN_INT (cum->sse_regno), label,
-					gen_reg_rtx (Pmode)));
+      for (i = cum->sse_regno; i < max; ++i)
+	{
+	  mem = plus_constant (save_area, i * 16 + ix86_varargs_gpr_size);
+	  mem = gen_rtx_MEM (smode, mem);
+	  MEM_NOTRAP_P (mem) = 1;
+	  set_mem_alias_set (mem, set);
+	  set_mem_align (mem, GET_MODE_ALIGNMENT (smode));
+
+	  emit_move_insn (mem, gen_rtx_REG (smode, SSE_REGNO (i)));
+	}
+
+      emit_label (label);
     }
 }
 
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 88b4029..6616da2 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -79,13 +79,11 @@
   ;; Prologue support
   UNSPEC_STACK_ALLOC
   UNSPEC_SET_GOT
-  UNSPEC_SSE_PROLOGUE_SAVE
   UNSPEC_REG_SAVE
   UNSPEC_DEF_CFA
   UNSPEC_SET_RIP
   UNSPEC_SET_GOT_OFFSET
   UNSPEC_MEMORY_BLOCKAGE
-  UNSPEC_SSE_PROLOGUE_SAVE_LOW
 
   ;; TLS support
   UNSPEC_TP
@@ -17825,179 +17823,6 @@
   { return ASM_SHORT "0x0b0f"; }
   [(set_attr "length" "2")])
 
-(define_expand "sse_prologue_save"
-  [(parallel [(set (match_operand:BLK 0 "" "")
-		   (unspec:BLK [(reg:DI XMM0_REG)
-				(reg:DI XMM1_REG)
-				(reg:DI XMM2_REG)
-				(reg:DI XMM3_REG)
-				(reg:DI XMM4_REG)
-				(reg:DI XMM5_REG)
-				(reg:DI XMM6_REG)
-				(reg:DI XMM7_REG)] UNSPEC_SSE_PROLOGUE_SAVE))
-	      (clobber (reg:CC FLAGS_REG))
-	      (clobber (match_operand:DI 1 "register_operand" ""))
-	      (use (match_operand:DI 2 "immediate_operand" ""))
-	      (use (label_ref:DI (match_operand 3 "" "")))
-	      (clobber (match_operand:DI 4 "register_operand" ""))
-	      (use (match_dup 1))])]
-  "TARGET_64BIT"
-  "")
-
-;; Pre-reload version of prologue save.  Until after prologue generation we don't know
-;; what the size of save instruction will be.
-;; Operand 0+operand 6 is the memory save area
-;; Operand 1 is number of registers to save (will get overwritten to operand 5)
-;; Operand 2 is number of non-vaargs SSE arguments
-;; Operand 3 is label starting the save block
-;; Operand 4 is used for temporary computation of jump address
-(define_insn "*sse_prologue_save_insn1"
-  [(set (mem:BLK (plus:DI (match_operand:DI 0 "register_operand" "R")
-			  (match_operand:DI 6 "const_int_operand" "n")))
-	(unspec:BLK [(reg:DI XMM0_REG)
-		     (reg:DI XMM1_REG)
-		     (reg:DI XMM2_REG)
-		     (reg:DI XMM3_REG)
-		     (reg:DI XMM4_REG)
-		     (reg:DI XMM5_REG)
-		     (reg:DI XMM6_REG)
-		     (reg:DI XMM7_REG)] UNSPEC_SSE_PROLOGUE_SAVE))
-   (clobber (reg:CC FLAGS_REG))
-   (clobber (match_operand:DI 1 "register_operand" "=r"))
-   (use (match_operand:DI 2 "const_int_operand" "i"))
-   (use (label_ref:DI (match_operand 3 "" "X")))
-   (clobber (match_operand:DI 4 "register_operand" "=&r"))
-   (use (match_operand:DI 5 "register_operand" "1"))]
-  "TARGET_64BIT
-   && INTVAL (operands[6]) + X86_64_SSE_REGPARM_MAX * 16 - 16 < 128
-   && INTVAL (operands[6]) + INTVAL (operands[2]) * 16 >= -128"
-  "#"
-  [(set_attr "type" "other")
-   (set_attr "memory" "store")
-   (set_attr "mode" "DI")])
-
-;; We know size of save instruction; expand the computation of jump address
-;; in the jumptable.
-(define_split
-  [(parallel [(set (match_operand:BLK 0 "" "")
-		    (unspec:BLK [(reg:DI XMM0_REG)
-				 (reg:DI XMM1_REG)
-				 (reg:DI XMM2_REG)
-				 (reg:DI XMM3_REG)
-				 (reg:DI XMM4_REG)
-				 (reg:DI XMM5_REG)
-				 (reg:DI XMM6_REG)
-				 (reg:DI XMM7_REG)] UNSPEC_SSE_PROLOGUE_SAVE))
-	       (clobber (reg:CC FLAGS_REG))
-	       (clobber (match_operand:DI 1 "register_operand" ""))
-	       (use (match_operand:DI 2 "const_int_operand" ""))
-	       (use (match_operand 3 "" ""))
-	       (clobber (match_operand:DI 4 "register_operand" ""))
-	       (use (match_operand:DI 5 "register_operand" ""))])]
-  "reload_completed"
-  [(parallel [(set (match_dup 0)
-		   (unspec:BLK [(reg:DI XMM0_REG)
-				(reg:DI XMM1_REG)
-				(reg:DI XMM2_REG)
-				(reg:DI XMM3_REG)
-				(reg:DI XMM4_REG)
-				(reg:DI XMM5_REG)
-				(reg:DI XMM6_REG)
-				(reg:DI XMM7_REG)]
-			       UNSPEC_SSE_PROLOGUE_SAVE_LOW))
-	      (use (match_dup 1))
-	      (use (match_dup 2))
-	      (use (match_dup 3))
-	      (use (match_dup 5))])]
-{
-  /* Movaps is 4 bytes, AVX and movsd is 5 bytes.  */
-  int size = 4 + (TARGET_AVX || crtl->stack_alignment_needed < 128);
-
-  /* Compute address to jump to:
-     label - eax*size + nnamed_sse_arguments*size. */
-  if (size == 5)
-    emit_insn (gen_rtx_SET (VOIDmode, operands[4],
-			    gen_rtx_PLUS
-			      (Pmode,
-			       gen_rtx_MULT (Pmode, operands[1],
-					     GEN_INT (4)),
-			       operands[1])));
-  else  if (size == 4)
-    emit_insn (gen_rtx_SET (VOIDmode, operands[4],
-			    gen_rtx_MULT (Pmode, operands[1],
-					  GEN_INT (4))));
-  else
-    gcc_unreachable ();
-  if (INTVAL (operands[2]))
-    emit_move_insn
-      (operands[1],
-       gen_rtx_CONST (DImode,
-		      gen_rtx_PLUS (DImode,
-				    operands[3],
-				    GEN_INT (INTVAL (operands[2])
-					     * size))));
-  else
-    emit_move_insn (operands[1], operands[3]);
-  emit_insn (gen_subdi3 (operands[1], operands[1], operands[4]));
-  operands[5] = GEN_INT (size);
-})
-
-(define_insn "sse_prologue_save_insn"
-  [(set (mem:BLK (plus:DI (match_operand:DI 0 "register_operand" "R")
-			  (match_operand:DI 4 "const_int_operand" "n")))
-	(unspec:BLK [(reg:DI XMM0_REG)
-		     (reg:DI XMM1_REG)
-		     (reg:DI XMM2_REG)
-		     (reg:DI XMM3_REG)
-		     (reg:DI XMM4_REG)
-		     (reg:DI XMM5_REG)
-		     (reg:DI XMM6_REG)
-		     (reg:DI XMM7_REG)] UNSPEC_SSE_PROLOGUE_SAVE_LOW))
-   (use (match_operand:DI 1 "register_operand" "r"))
-   (use (match_operand:DI 2 "const_int_operand" "i"))
-   (use (label_ref:DI (match_operand 3 "" "X")))
-   (use (match_operand:DI 5 "const_int_operand" "i"))]
-  "TARGET_64BIT
-   && INTVAL (operands[4]) + X86_64_SSE_REGPARM_MAX * 16 - 16 < 128
-   && INTVAL (operands[4]) + INTVAL (operands[2]) * 16 >= -128"
-{
-  int i;
-  operands[0] = gen_rtx_MEM (Pmode,
-			     gen_rtx_PLUS (Pmode, operands[0], operands[4]));
-  /* VEX instruction with a REX prefix will #UD.  */
-  if (TARGET_AVX && GET_CODE (XEXP (operands[0], 0)) != PLUS)
-    gcc_unreachable ();
-
-  output_asm_insn ("jmp\t%A1", operands);
-  for (i = X86_64_SSE_REGPARM_MAX - 1; i >= INTVAL (operands[2]); i--)
-    {
-      operands[4] = adjust_address (operands[0], DImode, i*16);
-      operands[5] = gen_rtx_REG (TImode, SSE_REGNO (i));
-      PUT_MODE (operands[4], TImode);
-      if (GET_CODE (XEXP (operands[0], 0)) != PLUS)
-        output_asm_insn ("rex", operands);
-      if (crtl->stack_alignment_needed < 128)
-        output_asm_insn ("%vmovsd\t{%5, %4|%4, %5}", operands);
-      else
-        output_asm_insn ("%vmovaps\t{%5, %4|%4, %5}", operands);
-    }
-  targetm.asm_out.internal_label (asm_out_file, "L",
-				  CODE_LABEL_NUMBER (operands[3]));
-  return "";
-}
-  [(set_attr "type" "other")
-   (set_attr "length_immediate" "0")
-   (set_attr "length_address" "0")
-   ;; 2 bytes for jump and opernds[4] bytes for each save.
-   (set (attr "length")
-     (plus (const_int 2)
-	   (mult (symbol_ref ("INTVAL (operands[5])"))
-		 (symbol_ref ("X86_64_SSE_REGPARM_MAX - INTVAL (operands[2])")))))
-   (set_attr "memory" "store")
-   (set_attr "modrm" "0")
-   (set_attr "prefix" "maybe_vex")
-   (set_attr "mode" "DI")])
-
 (define_expand "prefetch"
   [(prefetch (match_operand 0 "address_operand" "")
 	     (match_operand:SI 1 "const_int_operand" "")

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86_64 varargs setup jump table
  2010-07-20 16:32                 ` Richard Henderson
@ 2010-07-20 23:05                   ` Richard Henderson
  2010-07-20 23:21                     ` Sebastian Pop
  2010-07-22 13:52                     ` H.J. Lu
  2010-07-21 23:15                   ` H.J. Lu
  1 sibling, 2 replies; 49+ messages in thread
From: Richard Henderson @ 2010-07-20 23:05 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Bernd Schmidt, GCC Patches, ubizjak

On 07/20/2010 09:32 AM, Richard Henderson wrote:
> On 07/19/2010 02:13 PM, Richard Henderson wrote:
>> This bootstraps; regression test starting now.
>>
>> Obviously there's some patterns in i386.md that should be removed
>> along with this, were this patch to go in.
> 
> A slightly different patch that passes regression testing.
> This also vanishes the patterns that should go.

It was pointed out to me on IRC that x86_64 bootstrap is still broken.
So unless there are objections to this patch, I'll commit it tomorrow
morning (GMT+7).  We can debate the performance impact afterward.


r~

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86_64 varargs setup jump table
  2010-07-20 23:05                   ` Richard Henderson
@ 2010-07-20 23:21                     ` Sebastian Pop
  2010-07-22 13:52                     ` H.J. Lu
  1 sibling, 0 replies; 49+ messages in thread
From: Sebastian Pop @ 2010-07-20 23:21 UTC (permalink / raw)
  To: Richard Henderson; +Cc: H.J. Lu, Bernd Schmidt, GCC Patches, ubizjak

On Tue, Jul 20, 2010 at 18:05, Richard Henderson <rth@redhat.com> wrote:
> On 07/20/2010 09:32 AM, Richard Henderson wrote:
>> On 07/19/2010 02:13 PM, Richard Henderson wrote:
>>> This bootstraps; regression test starting now.
>>>
>>> Obviously there's some patterns in i386.md that should be removed
>>> along with this, were this patch to go in.
>>
>> A slightly different patch that passes regression testing.
>> This also vanishes the patterns that should go.
>
> It was pointed out to me on IRC that x86_64 bootstrap is still broken.
> So unless there are objections to this patch, I'll commit it tomorrow
> morning (GMT+7).  We can debate the performance impact afterward.

I am also testing SPEC 2006 with your last patch on amd64-linux.

Sebastian

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86_64 varargs setup jump table
  2010-07-20 16:32                 ` Richard Henderson
  2010-07-20 23:05                   ` Richard Henderson
@ 2010-07-21 23:15                   ` H.J. Lu
  2010-07-22 21:53                     ` Fix target/45027 [was: x86_64 varargs setup jump table] Richard Henderson
  1 sibling, 1 reply; 49+ messages in thread
From: H.J. Lu @ 2010-07-21 23:15 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Bernd Schmidt, GCC Patches, ubizjak

On Tue, Jul 20, 2010 at 9:32 AM, Richard Henderson <rth@redhat.com> wrote:
> On 07/19/2010 02:13 PM, Richard Henderson wrote:
>> This bootstraps; regression test starting now.
>>
>> Obviously there's some patterns in i386.md that should be removed
>> along with this, were this patch to go in.
>
> A slightly different patch that passes regression testing.
> This also vanishes the patterns that should go.
>

This caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45027


-- 
H.J.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86_64 varargs setup jump table
  2010-07-20 23:05                   ` Richard Henderson
  2010-07-20 23:21                     ` Sebastian Pop
@ 2010-07-22 13:52                     ` H.J. Lu
  2010-07-22 18:03                       ` Sebastian Pop
  1 sibling, 1 reply; 49+ messages in thread
From: H.J. Lu @ 2010-07-22 13:52 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Bernd Schmidt, GCC Patches, ubizjak

On Tue, Jul 20, 2010 at 4:05 PM, Richard Henderson <rth@redhat.com> wrote:
> On 07/20/2010 09:32 AM, Richard Henderson wrote:
>> On 07/19/2010 02:13 PM, Richard Henderson wrote:
>>> This bootstraps; regression test starting now.
>>>
>>> Obviously there's some patterns in i386.md that should be removed
>>> along with this, were this patch to go in.
>>
>> A slightly different patch that passes regression testing.
>> This also vanishes the patterns that should go.
>
> It was pointed out to me on IRC that x86_64 bootstrap is still broken.
> So unless there are objections to this patch, I'll commit it tomorrow
> morning (GMT+7).  We can debate the performance impact afterward.
>

Here are results on Intel Core i7:

Old: Gcc 4.6.0 revision 162269
New: Gcc 4.6.0 revision 162269 + this patch.

                                 New/Old
-O2:
164.gzip 			 0.595829%
175.vpr 			 0.123508%
176.gcc 			 0.580271%
181.mcf 			 -0.724263%
186.crafty 			 0.222717%
197.parser 			 0.100806%
252.eon 			 -0.159744%
253.perlbmk 			 -6.45709%
254.gap 			 0.417202%
255.vortex 			 0.127226%
256.bzip2 			 -0.167926%
300.twolf 			 0%
SPECint_base2000 			 -0.461082%

168.wupwise 			 -0.0897666%
171.swim 			 -0.210859%
172.mgrid 			 0.304298%
173.applu 			 1.39241%
177.mesa 			 -0.0538938%
178.galgel 			 0.0946074%
179.art 			 -0.244662%
183.equake 			 -1.24064%
187.facerec 			 0.50797%
188.ammp 			 0.532092%
189.lucas 			 -0.149785%
191.fma3d 			 0.841751%
200.sixtrack 			 -0.0752445%
301.apsi 			 -0.209018%
SPECfp_base2000 			 0.0985068%

400.perlbench 			 3.35821%
401.bzip2 			 1.11732%
403.gcc 			 -0.413223%
429.mcf 			 0.425532%
445.gobmk 			 0.41841%
456.hmmer 			 0%
458.sjeng 			 0%
462.libquantum 			 -1.81818%
464.h264ref 			 0%
471.omnetpp 			 0%
473.astar 			 0%
483.xalancbmk 			 0.3861%
SPECint(R)_base2006 			 0.41841%

410.bwaves 			 -0.287356%
416.gamess 			 0%
433.milc 			 1.08696%
434.zeusmp 			 0%
435.gromacs 			 -0.490196%
436.cactusADM 			 -4.67836%
437.leslie3d 			 -0.502513%
444.namd 			 0%
447.dealII 			 0%
450.soplex 			 -1.14504%
453.povray 			 -0.358423%
454.calculix 			 0%
459.GemsFDTD 			 -1.32159%
465.tonto 			 0.44843%
470.lbm 			 -10%
481.wrf 			 0%
482.sphinx3 			 0.621118%
SPECfp(R)_base2006 			 -0.865801%

-O3:

164.gzip 			 -0.0487567%
175.vpr 			 0.204499%
176.gcc 			 -0.389509%
181.mcf 			 -0.721092%
186.crafty 			 0%
197.parser 			 0.328638%
252.eon 			 -0.021796%
253.perlbmk 			 0.107817%
254.gap 			 0.607806%
255.vortex 			 0.176278%
256.bzip2 			 0%
300.twolf 			 0%
SPECint_base2000 			 0.018909%

168.wupwise 			 0.404722%
171.swim 			 -0.289321%
172.mgrid 			 0.662495%
173.applu 			 0%
177.mesa 			 0.0272109%
178.galgel 			 0.0126374%
179.art 			 0.155699%
183.equake 			 3.73021%
187.facerec 			 0.27248%
188.ammp 			 0.064%
189.lucas 			 0.432331%
191.fma3d 			 0.658436%
200.sixtrack 			 -0.142653%
301.apsi 			 -0.536572%
SPECfp_base2000 			 0.385806%

400.perlbench 			 2.22222%
401.bzip2 			 6.89655%
403.gcc 			 -0.809717%
429.mcf 			 0%
445.gobmk 			 0%
456.hmmer 			 0%
458.sjeng 			 0%
462.libquantum 			 -2.86396%
464.h264ref 			 0%
471.omnetpp 			 0%
473.astar 			 0%
483.xalancbmk 			 -0.352113%
SPECint(R)_base2006 			 0.403226%

410.bwaves 			 -7.14286%
416.gamess 			 0%
433.milc 			 -1.02564%
434.zeusmp 			 -0.865801%
435.gromacs 			 0%
436.cactusADM 			 0%
437.leslie3d 			 -1.66667%
444.namd 			 0%
447.dealII 			 -0.595238%
450.soplex 			 -0.754717%
453.povray 			 0.373134%
454.calculix 			 0.480769%
459.GemsFDTD 			 0.444444%
465.tonto 			 0.44843%
470.lbm 			 -0.438596%
481.wrf 			 0%
482.sphinx3 			 -1.08696%
SPECfp(R)_base2006 			 -0.78125%

Results look OK. Can't tell why big changes in 410.bwaves,
400.perlbench and 401.bzip2.

-- 
H.J.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86_64 varargs setup jump table
  2010-07-22 13:52                     ` H.J. Lu
@ 2010-07-22 18:03                       ` Sebastian Pop
  2010-07-22 18:15                         ` Richard Henderson
  0 siblings, 1 reply; 49+ messages in thread
From: Sebastian Pop @ 2010-07-22 18:03 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Richard Henderson, Bernd Schmidt, GCC Patches, ubizjak

Here are the results on AMD Phenom(tm) 9950 Quad-Core.

Old: Gcc 4.6.0 revision 162355
New: Gcc 4.6.0 revision 162355 + this patch.
Flags: -O3 -funroll-loops -fpeel-loops -ffast-math -march=native

The number is the run time percentage: (old - new) / old * 100
(positive is better)

483.xalancbmk  Error
403.gcc        Error

410.bwaves      	0.00%
416.gamess      	0.00%
433.milc        	0.00%
434.zeusmp      	-0.40%
435.gromacs     	-0.18%
436.cactusADM   	0.92%
437.leslie3d    	0.00%
444.namd        	0.00%
447.dealII      	-0.16%
450.soplex      	-0.15%
453.povray      	-0.33%
454.calculix    	-0.16%
459.GemsFDTD    	0.11%
465.tonto       	-0.38%
470.lbm         	-0.12%
481.wrf         	-0.11%
482.sphinx3     	0.85%
400.perlbench   	0.33%
401.bzip2       	0.33%
429.mcf         	6.11%
445.gobmk       	0.13%
456.hmmer       	9.95%
458.sjeng       	0.23%
462.libquantum  	-0.38%
464.h264ref     	-0.91%
471.omnetpp     	0.27%
473.astar       	0.00%

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86_64 varargs setup jump table
  2010-07-22 18:03                       ` Sebastian Pop
@ 2010-07-22 18:15                         ` Richard Henderson
  2010-07-22 18:17                           ` Richard Henderson
  0 siblings, 1 reply; 49+ messages in thread
From: Richard Henderson @ 2010-07-22 18:15 UTC (permalink / raw)
  To: Sebastian Pop; +Cc: H.J. Lu, Bernd Schmidt, GCC Patches, ubizjak

On 07/22/2010 11:02 AM, Sebastian Pop wrote:
> Here are the results on AMD Phenom(tm) 9950 Quad-Core.
> 
> Old: Gcc 4.6.0 revision 162355
> New: Gcc 4.6.0 revision 162355 + this patch.
> Flags: -O3 -funroll-loops -fpeel-loops -ffast-math -march=native
> 
> The number is the run time percentage: (old - new) / old * 100
> (positive is better)
> 
> [ no positive results ]

Hmm.  At least HJ had some positive results.  I'm surprised
that there are none on the AMD box.

Does movaps have reformatting stalls that perhaps movdqa does
with that particular micro-architecture?  Or are stores exempt
from reformatting stalls now?

Otherwise the only thing I can think is that the computed jump
was in practice very predictable (i.e. lots of calls containing
the same sequence of types), and that performing a few less
stores makes that difference.


r~

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86_64 varargs setup jump table
  2010-07-22 18:15                         ` Richard Henderson
@ 2010-07-22 18:17                           ` Richard Henderson
  2010-07-22 18:20                             ` Sebastian Pop
  0 siblings, 1 reply; 49+ messages in thread
From: Richard Henderson @ 2010-07-22 18:17 UTC (permalink / raw)
  To: Sebastian Pop; +Cc: H.J. Lu, Bernd Schmidt, GCC Patches, ubizjak

On 07/22/2010 11:15 AM, Richard Henderson wrote:
> Hmm.  At least HJ had some positive results.  I'm surprised
> that there are none on the AMD box.

Clearly I can't read.  There were improvements, some large.


r~

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: x86_64 varargs setup jump table
  2010-07-22 18:17                           ` Richard Henderson
@ 2010-07-22 18:20                             ` Sebastian Pop
  0 siblings, 0 replies; 49+ messages in thread
From: Sebastian Pop @ 2010-07-22 18:20 UTC (permalink / raw)
  To: Richard Henderson; +Cc: H.J. Lu, Bernd Schmidt, GCC Patches, ubizjak

On Thu, Jul 22, 2010 at 13:17, Richard Henderson <rth@redhat.com> wrote:
> On 07/22/2010 11:15 AM, Richard Henderson wrote:
>> Hmm.  At least HJ had some positive results.  I'm surprised
>> that there are none on the AMD box.
>
> Clearly I can't read.  There were improvements, some large.

Right.  The negative results are minor, less than 1%.

Sebastian

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Fix target/45027 [was: x86_64 varargs setup jump table]
  2010-07-21 23:15                   ` H.J. Lu
@ 2010-07-22 21:53                     ` Richard Henderson
  2010-07-23  9:07                       ` Richard Guenther
  0 siblings, 1 reply; 49+ messages in thread
From: Richard Henderson @ 2010-07-22 21:53 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Bernd Schmidt, GCC Patches, ubizjak

[-- Attachment #1: Type: text/plain, Size: 866 bytes --]

On 07/21/2010 04:15 PM, H.J. Lu wrote:
> This caused:
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45027

Fixed thus.

Honza and I had a discussion about this on IRC.  His original patch, 

  http://gcc.gnu.org/ml/gcc-patches/2010-04/msg01120.html

had some problems in it, but worked anyway for a complicated set of
reasons.

The following patch removes the optimization that attempts to leave
the stack alignment at 64-bits, rather than storing the varargs SSE
registers in full into a 128-bit aligned slot.

The only optimization that seems interesting to me would be to properly
track the true types of the data at each varargs slot.  In a restricted
set of circumstances we might be able to copy-propagate the incoming 
SSE register to the use.  This would require a significant amount of
work in pass_stdarg, and really doesn't seem worth the effort.


r~

[-- Attachment #2: z --]
[-- Type: text/plain, Size: 2402 bytes --]

        * config/i386/i386.c (setup_incoming_varargs_64): Force the use
        of V4SFmode for the SSE saves; increase stack alignment if needed.
        (ix86_gimplify_va_arg): Don't increase stack alignment here.



diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index d9dc571..596a6db 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -7073,7 +7073,7 @@ setup_incoming_varargs_64 (CUMULATIVE_ARGS *cum)
 
   /* FPR size of varargs save area.  We don't need it if we don't pass
      anything in SSE registers.  */
-  if (cum->sse_nregs && cfun->va_list_fpr_size)
+  if (TARGET_SSE && cfun->va_list_fpr_size)
     ix86_varargs_fpr_size = X86_64_SSE_REGPARM_MAX * 16;
   else
     ix86_varargs_fpr_size = 0;
@@ -7112,12 +7112,13 @@ setup_incoming_varargs_64 (CUMULATIVE_ARGS *cum)
       emit_jump_insn (gen_cbranchqi4 (test, XEXP (test, 0), XEXP (test, 1),
 				      label));
 
-      /* If we've determined that we're only loading scalars (and not
-	 vector data) then we can store doubles instead.  */
-      if (crtl->stack_alignment_needed < 128)
-	smode = DFmode;
-      else
-	smode = V4SFmode;
+      /* ??? If !TARGET_SSE_TYPELESS_STORES, would we perform better if
+	 we used movdqa (i.e. TImode) instead?  Perhaps even better would
+	 be if we could determine the real mode of the data, via a hook
+	 into pass_stdarg.  Ignore all that for now.  */
+      smode = V4SFmode;
+      if (crtl->stack_alignment_needed < GET_MODE_ALIGNMENT (smode))
+	crtl->stack_alignment_needed = GET_MODE_ALIGNMENT (smode);
 
       max = cum->sse_regno + cfun->va_list_fpr_size / 16;
       if (max > X86_64_SSE_REGPARM_MAX)
@@ -7549,8 +7550,7 @@ ix86_gimplify_va_arg (tree valist, tree type, gimple_seq *pre_p,
     arg_boundary = MAX_SUPPORTED_STACK_ALIGNMENT;
 
   /* Care for on-stack alignment if needed.  */
-  if (arg_boundary <= 64
-      || integer_zerop (TYPE_SIZE (type)))
+  if (arg_boundary <= 64 || size == 0)
     t = ovf;
  else
     {
@@ -7561,9 +7561,8 @@ ix86_gimplify_va_arg (tree valist, tree type, gimple_seq *pre_p,
       t = build2 (BIT_AND_EXPR, TREE_TYPE (t), t,
 		  size_int (-align));
       t = fold_convert (TREE_TYPE (ovf), t);
-      if (crtl->stack_alignment_needed < arg_boundary)
-	crtl->stack_alignment_needed = arg_boundary;
     }
+
   gimplify_expr (&t, pre_p, NULL, is_gimple_val, fb_rvalue);
   gimplify_assign (addr, t, pre_p);
 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Fix target/45027 [was: x86_64 varargs setup jump table]
  2010-07-22 21:53                     ` Fix target/45027 [was: x86_64 varargs setup jump table] Richard Henderson
@ 2010-07-23  9:07                       ` Richard Guenther
  0 siblings, 0 replies; 49+ messages in thread
From: Richard Guenther @ 2010-07-23  9:07 UTC (permalink / raw)
  To: Richard Henderson; +Cc: H.J. Lu, Bernd Schmidt, GCC Patches, ubizjak

On Thu, Jul 22, 2010 at 11:52 PM, Richard Henderson <rth@redhat.com> wrote:
> On 07/21/2010 04:15 PM, H.J. Lu wrote:
>> This caused:
>>
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45027
>
> Fixed thus.
>
> Honza and I had a discussion about this on IRC.  His original patch,
>
>  http://gcc.gnu.org/ml/gcc-patches/2010-04/msg01120.html
>
> had some problems in it, but worked anyway for a complicated set of
> reasons.
>
> The following patch removes the optimization that attempts to leave
> the stack alignment at 64-bits, rather than storing the varargs SSE
> registers in full into a 128-bit aligned slot.
>
> The only optimization that seems interesting to me would be to properly
> track the true types of the data at each varargs slot.  In a restricted
> set of circumstances we might be able to copy-propagate the incoming
> SSE register to the use.  This would require a significant amount of
> work in pass_stdarg, and really doesn't seem worth the effort.

Well, IPA-SRA could be extended to handle varargs ...

Richard.

>
> r~
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-16 10:35 New optimization for reload_combine Bernd Schmidt
  2010-07-16 18:32 ` Jeff Law
  2010-07-16 19:45 ` Paolo Bonzini
@ 2010-07-26 10:13 ` IainS
  2010-07-26 10:46   ` Bernd Schmidt
  2010-07-27  0:53   ` Bernd Schmidt
  2 siblings, 2 replies; 49+ messages in thread
From: IainS @ 2010-07-26 10:13 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc patches, David Edelsohn

Hi Bernd,

Apparently, one more problem remains...

On 16 Jul 2010, at 11:34, Bernd Schmidt wrote:
> <rc-newpattern.diff>

This causes a bootstrap fail on powerpc64-darwin9 (when building GMP/ 
MPFR/MPC in tree)
(libgfortran fails to build because of a bad header, which is caused  
by an MPFR fail).

to cut a long story short ...

MPFR built with powerpc*-darwin9 @ m64 162270 fails its built-in  
check  [12 fails out of 148 tests] (it passes at 162269).
It still fails at head with the same problem (despite the other fixes).

I have no other powerpc64- execute target, so I can't tell if it's  
platform-related or cpu-wide.
Having said that, ISTM that most people do not build mp* in-tree, so  
this issue might be missed by them, hence copied to David.

I've checked {i686,x86_64}-darwin9 and they do _not_ show this problem.

This is independent of my darwin64 PPC ABI fixes (i.e. applying those  
does not alter the problem).

It fails on the stage1 compiler, so that should help debug (and I  
guess it's not a gtoggle issue).
and I've kept the builds of MPFR(162269) and (162270) .. (although  
they are normal, optimized objects).

cheers,
Iain

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-26 10:13 ` IainS
@ 2010-07-26 10:46   ` Bernd Schmidt
  2010-07-27  0:53   ` Bernd Schmidt
  1 sibling, 0 replies; 49+ messages in thread
From: Bernd Schmidt @ 2010-07-26 10:46 UTC (permalink / raw)
  To: IainS; +Cc: gcc patches, David Edelsohn

[-- Attachment #1: Type: text/plain, Size: 569 bytes --]

On 07/26/2010 12:12 PM, IainS wrote:
> It fails on the stage1 compiler, so that should help debug (and I guess
> it's not a gtoggle issue).
> and I've kept the builds of MPFR(162269) and (162270) .. (although they
> are normal, optimized objects).

Can you, as before, find a miscompiled file and send me .i and
before/after .s files?  First see if stage1 gcc has any testsuite
failures, that's probably easier than finding the bootstrap problem.

Also, try with the attached patch which gcc_asserts that the situation
fixed by the previous patch can't happen.


Bernd

[-- Attachment #2: assert.diff --]
[-- Type: text/plain, Size: 664 bytes --]

Index: postreload.c
===================================================================
--- postreload.c	(revision 162421)
+++ postreload.c	(working copy)
@@ -955,9 +955,9 @@ reload_combine_recognize_const_pattern (
 	      && reg_state[clobbered_regno].real_store_ruid >= use_ruid)
 	    break;
 
-	  /* Avoid moving a use of ADDREG past a point where it
-	     is stored.  */
-	  if (reg_state[REGNO (addreg)].store_ruid >= use_ruid)
+	  gcc_assert (reg_state[regno].store_ruid <= use_ruid);
+	  /* Avoid moving a use of ADDREG past a point where it is stored.  */
+	  if (reg_state[REGNO (addreg)].store_ruid > use_ruid)
 	    break;
 
 	  if (mem != NULL_RTX)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-26 10:13 ` IainS
  2010-07-26 10:46   ` Bernd Schmidt
@ 2010-07-27  0:53   ` Bernd Schmidt
  1 sibling, 0 replies; 49+ messages in thread
From: Bernd Schmidt @ 2010-07-27  0:53 UTC (permalink / raw)
  To: IainS; +Cc: gcc patches, David Edelsohn

[-- Attachment #1: Type: text/plain, Size: 361 bytes --]

On 07/26/2010 12:12 PM, IainS wrote:
> Apparently, one more problem remains...

Thanks again for providing a testcase.  I've committed the following fix
after bootstrap and regression test on i686-linux.  A use with a wrong
mode could cause us to move the addition past another set of the
register, as the necessary test was at the wrong nesting level.


Bernd

[-- Attachment #2: pr-movefix.diff --]
[-- Type: text/plain, Size: 1295 bytes --]

Index: ChangeLog
===================================================================
--- ChangeLog	(revision 162549)
+++ ChangeLog	(working copy)
@@ -1,3 +1,8 @@
+2010-07-27  Bernd Schmidt  <bernds@codesourcery.com>
+
+	* postreload.c (reload_combine_recognize_const_pattern): Move test
+	for limiting the insn movement to the right scope.
+
 2010-07-26  Richard Henderson  <rth@redhat.com>
 
 	PR target/44132
Index: postreload.c
===================================================================
--- postreload.c	(revision 162421)
+++ postreload.c	(working copy)
@@ -955,8 +955,8 @@ reload_combine_recognize_const_pattern (
 	      && reg_state[clobbered_regno].real_store_ruid >= use_ruid)
 	    break;
 
-	  /* Avoid moving a use of ADDREG past a point where it
-	     is stored.  */
+	  gcc_assert (reg_state[regno].store_ruid <= use_ruid);
+	  /* Avoid moving a use of ADDREG past a point where it is stored.  */
 	  if (reg_state[REGNO (addreg)].store_ruid >= use_ruid)
 	    break;
 
@@ -1033,10 +1033,10 @@ reload_combine_recognize_const_pattern (
 		    }
 		}
 	    }
-	  /* If we get here, we couldn't handle this use.  */
-	  if (must_move_add)
-	    break;
 	}
+      /* If we get here, we couldn't handle this use.  */
+      if (must_move_add)
+	break;
     }
   while (use);
 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: New optimization for reload_combine
  2010-07-17  2:38     ` H.J. Lu
  2010-07-17 14:25       ` Bernd Schmidt
@ 2010-11-04  4:30       ` H.J. Lu
  1 sibling, 0 replies; 49+ messages in thread
From: H.J. Lu @ 2010-11-04  4:30 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Jeff Law, GCC Patches

On Fri, Jul 16, 2010 at 7:38 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Fri, Jul 16, 2010 at 4:49 PM, Bernd Schmidt <bernds@codesourcery.com> wrote:
>> On 07/16/2010 08:32 PM, Jeff Law wrote:
>>> We're probably not talking about a huge issue, but we might as well take
>>> the more compact packing when we can get it.
>>>
>>> Otherwise it looks fine.  Pre-approved after juggling the structure orders.
>>
>> Yay.  Thanks!
>>
>> Here's the variant that I committed.  It's a good thing Paolo made a
>> suggestion, as that exposed a new bootstrap failure - we really mustn't
>> move a stack pointer adjust.
>>
>
> This caused:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44970
>

This also caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46297


-- 
H.J.

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2010-11-04  4:06 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-16 10:35 New optimization for reload_combine Bernd Schmidt
2010-07-16 18:32 ` Jeff Law
2010-07-16 23:50   ` Bernd Schmidt
2010-07-17  2:38     ` H.J. Lu
2010-07-17 14:25       ` Bernd Schmidt
2010-07-17 15:03         ` H.J. Lu
2010-07-17 15:10           ` IainS
2010-07-17 16:00           ` Bernd Schmidt
2010-07-17 16:15             ` H.J. Lu
2010-07-17 17:18               ` Bernd Schmidt
2010-07-17 17:27                 ` H.J. Lu
2010-07-17 17:29                 ` IainS
2010-07-18 18:15             ` H.J. Lu
2010-07-19 10:01           ` Bernd Schmidt
2010-07-19 13:27             ` H.J. Lu
2010-07-20 11:46             ` Bernd Schmidt
2010-07-20 15:33               ` Jeff Law
2010-07-20 15:34               ` Bernd Schmidt
2010-07-20 15:40                 ` Jakub Jelinek
2010-07-19 15:41         ` x86_64 varargs setup jump table Richard Henderson
2010-07-19 15:56           ` H.J. Lu
2010-07-19 20:57             ` H.J. Lu
2010-07-19 21:14               ` Richard Henderson
2010-07-20 16:32                 ` Richard Henderson
2010-07-20 23:05                   ` Richard Henderson
2010-07-20 23:21                     ` Sebastian Pop
2010-07-22 13:52                     ` H.J. Lu
2010-07-22 18:03                       ` Sebastian Pop
2010-07-22 18:15                         ` Richard Henderson
2010-07-22 18:17                           ` Richard Henderson
2010-07-22 18:20                             ` Sebastian Pop
2010-07-21 23:15                   ` H.J. Lu
2010-07-22 21:53                     ` Fix target/45027 [was: x86_64 varargs setup jump table] Richard Henderson
2010-07-23  9:07                       ` Richard Guenther
2010-07-19 16:09           ` x86_64 varargs setup jump table Andi Kleen
2010-07-19 16:16             ` H.J. Lu
2010-07-19 16:26               ` Andi Kleen
2010-07-19 16:23           ` Jan Hubicka
2010-07-19 16:19         ` New optimization for reload_combine Jeff Law
2010-11-04  4:30       ` H.J. Lu
2010-07-16 19:45 ` Paolo Bonzini
2010-07-16 19:55   ` Bernd Schmidt
2010-07-16 20:24     ` Paolo Bonzini
2010-07-19 16:11   ` Mark Mitchell
2010-07-19 16:13     ` Jeff Law
2010-07-19 16:31     ` Paolo Bonzini
2010-07-26 10:13 ` IainS
2010-07-26 10:46   ` Bernd Schmidt
2010-07-27  0:53   ` Bernd Schmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).