public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Combine four insns
@ 2010-08-06 14:49 Bernd Schmidt
  2010-08-06 15:04 ` Richard Guenther
  2010-08-06 15:08 ` Steven Bosscher
  0 siblings, 2 replies; 129+ messages in thread
From: Bernd Schmidt @ 2010-08-06 14:49 UTC (permalink / raw)
  To: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 1695 bytes --]

I was slightly bored while waiting for some SPEC runs, so I played with
the combiner a little.  The following extends it to do four-insn
combinations.

Conceptually, the following is one motivation for the change: consider a
RISC target (probably not very prevalent when the combiner was written),
where arithmetic ops don't allow constants, only registers.  Then, to
combine multiple such operations into one (or rather, two), you need a
four-instruction window.  This is what happens e.g. on Thumb-1; PR42172
is such an example.  We have

	ldrb	r3, [r0]
	mov	r2, #7
	bic	r3, r2
	add	r2, r2, #49
	bic	r3, r2
	sub	r2, r2, #48
	orr	r3, r2
	add	r2, r2, #56
	bic	r3, r2
	add	r2, r2, #63
	and	r3, r2
	strb	r3, [r0]

which can be optimized into

	mov	r3, #8
	strb	r3, [r0]

by the patch below.  I'm attaching a file with a few more examples I
found.  The same patterns occur quite frequently - several times e.g. in
Linux TCP code.

The downside is a compile-time cost, which appears to be about 1% user
time on a full bootstrap.  To put that in persepective, it's 12s of real
time.

real 16m13.446s user 103m3.607s sys 3m2.235s
real 16m25.534s user 104m0.686s sys 3m4.158s

I'd argue that compile-time shouldn't be our top priority, as it's one
of the few things that still benefits from Moore's Law, while the same
may not be true for the programs we compile.  I expect people will argue
a 1% slowdown is unacceptable, but in that case I think we need to
discuss whether users are complaining about slow compiles more often
than they complain about missed optimizations - in my experience the
reverse is true.

Bootstrapped on i686-linux, a slightly earlier version also
regression-tested.


Bernd

[-- Attachment #2: combine-examples --]
[-- Type: text/plain, Size: 2597 bytes --]

-       andl    $4, %eax
-       cmpl    $1, %eax
-       sbbl    %eax, %eax
-       notl    %eax
+       sall    $5, %eax
        andl    $128, %eax
===========
-       movl    8(%eax), %edx
-       xorb    %dl, %dl
-       orl     $11, %edx
-       movl    %edx, 8(%eax)
+       movb    $11, 8(%eax)
==========
-       movzbl  %al, %edx
-       sall    $16, %edx
-       movl    92(%edi), %ecx
-       andl    $-16711681, %ecx
-       orl     %ecx, %edx
-       movl    %edx, 92(%edi)
+       movb    %al, 94(%edi)
==========
-       andl    $2, %eax
-       cmpl    $1, %eax
-       sbbl    %eax, %eax
-       notl    %eax
+       sall    $30, %eax
+       sarl    $31, %eax
===========
-       movl    %ecx, %edx
-       sall    $8, %edx
-       shrw    $8, %ax
-       orl     %edx, %eax
+       rolw    $8, %ax
===========
-       movl    %edx, %eax
-       sall    $5, %eax
-       imull   $39, %edx, %edx
-       subl    %edx, %ecx
-       leal    (%eax,%ecx), %edx
+       imull   $-7, %edx, %edx
+       addl    %ecx, %edx
==========
        movl    24(%esp), %eax
-       movzbl  792(%eax), %ebx
-       movl    %ebx, %eax
-       andl    $-80, %eax
-       orl     $64, %eax
-       andl    $15, %ebx
-       orl     %eax, %ebx
-       movl    24(%esp), %eax
-       movb    %bl, 792(%eax)
+       orb     $64, 792(%eax)
==========
-       movzbl  1(%eax), %ecx
-       movl    %ecx, %edx
-       shrb    $3, %dl
-       andl    $3, %edx
-       orl     $1, %edx
-       leal    0(,%edx,8), %ebp
-       movl    %ecx, %edx
-       andl    $-25, %edx
-       orl     %ebp, %edx
-       movb    %dl, 1(%eax)
+       orb     $8, 1(%eax)
==========
-       mov     r3, #2
-       bic     r0, r3
-       lsl     r0, r0, #24
-       lsr     r0, r0, #24
+       mov     r3, #253
+       and     r0, r3
==========
-       ldr     r3, [r0, r3]
-       mov     r0, #128
-       lsl     r0, r0, #17
-       and     r0, r3
-       neg     r3, r0
-       adc     r0, r0, r3
-       sub     r0, r0, #1
+       ldr     r0, [r0, r3]
+       lsl     r0, r0, #7
+       asr     r0, r0, #31
==========
-       bic     r2, r1, #-134217728
+       bic     r1, r1, #-134217728
-       rsb     r3, r3, r3, lsl #9
-       add     r3, r3, r3, lsl #18
-       negs    r1, r3
-       orr     r1, r2, r1, lsl #27
+       orr     r1, r1, r3, lsl #27
=========
 xfs_dir2_sf_toino8.isra.8:
-       @ args = 0, pretend = 0, frame = 72
+       @ args = 0, pretend = 0, frame = 40
==========
-       mvn     r2, fp
-       add     r3, fp, #1
-       adds    r2, r2, r7
-       adds    r4, r3, r2
+       mov     r4, r7

[-- Attachment #3: combine4b.diff --]
[-- Type: text/plain, Size: 42787 bytes --]

	PR target/42172
	* combine.c (combine_validate_cost): New arg I0.  All callers changed.
	Take its cost into account if nonnull.
	(insn_a_feeds_b): New static function.
	(combine_instructions): Look for four-insn combinations.
	(can_combine_p): New args PRED2, SUCC2.  All callers changed.  Take
	them into account when computing all_adjacent and looking for other
	uses.
	(combinable_i3pat): New args I0DEST, I0_NOT_IN_SRC.  All callers
	changed.  Treat them like I1DEST and I1_NOT_IN_SRC.
	(try_combine): New arg I0.  Handle four-insn combinations.
	(distribute_notes): New arg ELIM_I0.  All callers changed.  Treat it
	like ELIM_I1.

Index: combine.c
===================================================================
--- combine.c	(revision 162821)
+++ combine.c	(working copy)
@@ -385,10 +385,10 @@ static void init_reg_last (void);
 static void setup_incoming_promotions (rtx);
 static void set_nonzero_bits_and_sign_copies (rtx, const_rtx, void *);
 static int cant_combine_insn_p (rtx);
-static int can_combine_p (rtx, rtx, rtx, rtx, rtx *, rtx *);
-static int combinable_i3pat (rtx, rtx *, rtx, rtx, int, rtx *);
+static int can_combine_p (rtx, rtx, rtx, rtx, rtx, rtx, rtx *, rtx *);
+static int combinable_i3pat (rtx, rtx *, rtx, rtx, rtx, int, int, rtx *);
 static int contains_muldiv (rtx);
-static rtx try_combine (rtx, rtx, rtx, int *);
+static rtx try_combine (rtx, rtx, rtx, rtx, int *);
 static void undo_all (void);
 static void undo_commit (void);
 static rtx *find_split_point (rtx *, rtx, bool);
@@ -438,7 +438,7 @@ static void reg_dead_at_p_1 (rtx, const_
 static int reg_dead_at_p (rtx, rtx);
 static void move_deaths (rtx, rtx, int, rtx, rtx *);
 static int reg_bitfield_target_p (rtx, rtx);
-static void distribute_notes (rtx, rtx, rtx, rtx, rtx, rtx);
+static void distribute_notes (rtx, rtx, rtx, rtx, rtx, rtx, rtx);
 static void distribute_links (rtx);
 static void mark_used_regs_combine (rtx);
 static void record_promoted_value (rtx, rtx);
@@ -766,7 +766,7 @@ do_SUBST_MODE (rtx *into, enum machine_m
 \f
 /* Subroutine of try_combine.  Determine whether the combine replacement
    patterns NEWPAT, NEWI2PAT and NEWOTHERPAT are cheaper according to
-   insn_rtx_cost that the original instruction sequence I1, I2, I3 and
+   insn_rtx_cost that the original instruction sequence I0, I1, I2, I3 and
    undobuf.other_insn.  Note that I1 and/or NEWI2PAT may be NULL_RTX.
    NEWOTHERPAT and undobuf.other_insn may also both be NULL_RTX.  This
    function returns false, if the costs of all instructions can be
@@ -774,10 +774,10 @@ do_SUBST_MODE (rtx *into, enum machine_m
    sequence.  */
 
 static bool
-combine_validate_cost (rtx i1, rtx i2, rtx i3, rtx newpat, rtx newi2pat,
-		       rtx newotherpat)
+combine_validate_cost (rtx i0, rtx i1, rtx i2, rtx i3, rtx newpat,
+		       rtx newi2pat, rtx newotherpat)
 {
-  int i1_cost, i2_cost, i3_cost;
+  int i0_cost, i1_cost, i2_cost, i3_cost;
   int new_i2_cost, new_i3_cost;
   int old_cost, new_cost;
 
@@ -788,13 +788,23 @@ combine_validate_cost (rtx i1, rtx i2, r
   if (i1)
     {
       i1_cost = INSN_COST (i1);
-      old_cost = (i1_cost > 0 && i2_cost > 0 && i3_cost > 0)
-		 ? i1_cost + i2_cost + i3_cost : 0;
+      if (i0)
+	{
+	  i0_cost = INSN_COST (i0);
+	  old_cost = (i0_cost > 0 && i1_cost > 0 && i2_cost > 0 && i3_cost > 0
+		      ? i0_cost + i1_cost + i2_cost + i3_cost : 0);
+	}
+      else
+	{
+	  old_cost = (i1_cost > 0 && i2_cost > 0 && i3_cost > 0
+		      ? i1_cost + i2_cost + i3_cost : 0);
+	  i0_cost = 0;
+	}
     }
   else
     {
       old_cost = (i2_cost > 0 && i3_cost > 0) ? i2_cost + i3_cost : 0;
-      i1_cost = 0;
+      i1_cost = i0_cost = 0;
     }
 
   /* Calculate the replacement insn_rtx_costs.  */
@@ -833,7 +843,16 @@ combine_validate_cost (rtx i1, rtx i2, r
     {
       if (dump_file)
 	{
-	  if (i1)
+	  if (i0)
+	    {
+	      fprintf (dump_file,
+		       "rejecting combination of insns %d, %d, %d and %d\n",
+		       INSN_UID (i0), INSN_UID (i1), INSN_UID (i2),
+		       INSN_UID (i3));
+	      fprintf (dump_file, "original costs %d + %d + %d + %d = %d\n",
+		       i0_cost, i1_cost, i2_cost, i3_cost, old_cost);
+	    }
+	  else if (i1)
 	    {
 	      fprintf (dump_file,
 		       "rejecting combination of insns %d, %d and %d\n",
@@ -1010,6 +1029,19 @@ clear_log_links (void)
     if (INSN_P (insn))
       free_INSN_LIST_list (&LOG_LINKS (insn));
 }
+
+/* Walk the LOG_LINKS of insn B to see if we find a reference to A.  Return
+   true if we found a LOG_LINK that proves that A feeds B.  */
+
+static bool
+insn_a_feeds_b (rtx a, rtx b)
+{
+  rtx links;
+  for (links = LOG_LINKS (b); links; links = XEXP (links, 1))
+    if (XEXP (links, 0) == a)
+      return true;
+  return false;
+}
 \f
 /* Main entry point for combiner.  F is the first insn of the function.
    NREGS is the first unused pseudo-reg number.
@@ -1150,7 +1182,7 @@ combine_instructions (rtx f, unsigned in
 	      /* Try this insn with each insn it links back to.  */
 
 	      for (links = LOG_LINKS (insn); links; links = XEXP (links, 1))
-		if ((next = try_combine (insn, XEXP (links, 0),
+		if ((next = try_combine (insn, XEXP (links, 0), NULL_RTX,
 					 NULL_RTX, &new_direct_jump_p)) != 0)
 		  goto retry;
 
@@ -1168,8 +1200,8 @@ combine_instructions (rtx f, unsigned in
 		  for (nextlinks = LOG_LINKS (link);
 		       nextlinks;
 		       nextlinks = XEXP (nextlinks, 1))
-		    if ((next = try_combine (insn, link,
-					     XEXP (nextlinks, 0),
+		    if ((next = try_combine (insn, link, XEXP (nextlinks, 0),
+					     NULL_RTX,
 					     &new_direct_jump_p)) != 0)
 		      goto retry;
 		}
@@ -1187,14 +1219,14 @@ combine_instructions (rtx f, unsigned in
 		  && NONJUMP_INSN_P (prev)
 		  && sets_cc0_p (PATTERN (prev)))
 		{
-		  if ((next = try_combine (insn, prev,
-					   NULL_RTX, &new_direct_jump_p)) != 0)
+		  if ((next = try_combine (insn, prev, NULL_RTX, NULL_RTX,
+					   &new_direct_jump_p)) != 0)
 		    goto retry;
 
 		  for (nextlinks = LOG_LINKS (prev); nextlinks;
 		       nextlinks = XEXP (nextlinks, 1))
-		    if ((next = try_combine (insn, prev,
-					     XEXP (nextlinks, 0),
+		    if ((next = try_combine (insn, prev, XEXP (nextlinks, 0),
+					     NULL_RTX,
 					     &new_direct_jump_p)) != 0)
 		      goto retry;
 		}
@@ -1207,14 +1239,14 @@ combine_instructions (rtx f, unsigned in
 		  && GET_CODE (PATTERN (insn)) == SET
 		  && reg_mentioned_p (cc0_rtx, SET_SRC (PATTERN (insn))))
 		{
-		  if ((next = try_combine (insn, prev,
-					   NULL_RTX, &new_direct_jump_p)) != 0)
+		  if ((next = try_combine (insn, prev, NULL_RTX, NULL_RTX,
+					   &new_direct_jump_p)) != 0)
 		    goto retry;
 
 		  for (nextlinks = LOG_LINKS (prev); nextlinks;
 		       nextlinks = XEXP (nextlinks, 1))
-		    if ((next = try_combine (insn, prev,
-					     XEXP (nextlinks, 0),
+		    if ((next = try_combine (insn, prev, XEXP (nextlinks, 0),
+					     NULL_RTX,
 					     &new_direct_jump_p)) != 0)
 		      goto retry;
 		}
@@ -1230,7 +1262,8 @@ combine_instructions (rtx f, unsigned in
 		    && NONJUMP_INSN_P (prev)
 		    && sets_cc0_p (PATTERN (prev))
 		    && (next = try_combine (insn, XEXP (links, 0),
-					    prev, &new_direct_jump_p)) != 0)
+					    prev, NULL_RTX,
+					    &new_direct_jump_p)) != 0)
 		  goto retry;
 #endif
 
@@ -1240,10 +1273,64 @@ combine_instructions (rtx f, unsigned in
 		for (nextlinks = XEXP (links, 1); nextlinks;
 		     nextlinks = XEXP (nextlinks, 1))
 		  if ((next = try_combine (insn, XEXP (links, 0),
-					   XEXP (nextlinks, 0),
+					   XEXP (nextlinks, 0), NULL_RTX,
 					   &new_direct_jump_p)) != 0)
 		    goto retry;
 
+	      /* Try four-instruction combinations.  */
+	      for (links = LOG_LINKS (insn); links; links = XEXP (links, 1))
+		{
+		  rtx next1;
+		  rtx link = XEXP (links, 0);
+
+		  /* If the linked insn has been replaced by a note, then there
+		     is no point in pursuing this chain any further.  */
+		  if (NOTE_P (link))
+		    continue;
+
+		  for (next1 = LOG_LINKS (link); next1; next1 = XEXP (next1, 1))
+		    {
+		      rtx link1 = XEXP (next1, 0);
+		      if (NOTE_P (link1))
+			continue;
+		      /* I0 -> I1 -> I2 -> I3.  */
+		      for (nextlinks = LOG_LINKS (link1); nextlinks;
+			   nextlinks = XEXP (nextlinks, 1))
+			if ((next = try_combine (insn, link, link1,
+						 XEXP (nextlinks, 0),
+						 &new_direct_jump_p)) != 0)
+			  goto retry;
+		      /* I0, I1 -> I2, I2 -> I3.  */
+		      for (nextlinks = XEXP (next1, 1); nextlinks;
+			   nextlinks = XEXP (nextlinks, 1))
+			if ((next = try_combine (insn, link, link1,
+						 XEXP (nextlinks, 0),
+						 &new_direct_jump_p)) != 0)
+			  goto retry;
+		    }
+
+		  for (next1 = XEXP (links, 1); next1; next1 = XEXP (next1, 1))
+		    {
+		      rtx link1 = XEXP (next1, 0);
+		      if (NOTE_P (link1))
+			continue;
+		      /* I0 -> I2; I1, I2 -> I3.  */
+		      for (nextlinks = LOG_LINKS (link); nextlinks;
+			   nextlinks = XEXP (nextlinks, 1))
+			if ((next = try_combine (insn, link, link1,
+						 XEXP (nextlinks, 0),
+						 &new_direct_jump_p)) != 0)
+			  goto retry;
+		      /* I0 -> I1; I1, I2 -> I3.  */
+		      for (nextlinks = LOG_LINKS (link1); nextlinks;
+			   nextlinks = XEXP (nextlinks, 1))
+			if ((next = try_combine (insn, link, link1,
+						 XEXP (nextlinks, 0),
+						 &new_direct_jump_p)) != 0)
+			  goto retry;
+		    }
+		}
+
 	      /* Try this insn with each REG_EQUAL note it links back to.  */
 	      for (links = LOG_LINKS (insn); links; links = XEXP (links, 1))
 		{
@@ -1267,7 +1354,7 @@ combine_instructions (rtx f, unsigned in
 		      i2mod = temp;
 		      i2mod_old_rhs = copy_rtx (orig);
 		      i2mod_new_rhs = copy_rtx (note);
-		      next = try_combine (insn, i2mod, NULL_RTX,
+		      next = try_combine (insn, i2mod, NULL_RTX, NULL_RTX,
 					  &new_direct_jump_p);
 		      i2mod = NULL_RTX;
 		      if (next)
@@ -1529,9 +1616,10 @@ set_nonzero_bits_and_sign_copies (rtx x,
     }
 }
 \f
-/* See if INSN can be combined into I3.  PRED and SUCC are optionally
-   insns that were previously combined into I3 or that will be combined
-   into the merger of INSN and I3.
+/* See if INSN can be combined into I3.  PRED, PRED2, SUCC and SUCC2 are
+   optionally insns that were previously combined into I3 or that will be
+   combined into the merger of INSN and I3.  The order is PRED, PRED2,
+   INSN, SUCC, SUCC2, I3.
 
    Return 0 if the combination is not allowed for any reason.
 
@@ -1540,7 +1628,8 @@ set_nonzero_bits_and_sign_copies (rtx x,
    will return 1.  */
 
 static int
-can_combine_p (rtx insn, rtx i3, rtx pred ATTRIBUTE_UNUSED, rtx succ,
+can_combine_p (rtx insn, rtx i3, rtx pred ATTRIBUTE_UNUSED,
+	       rtx pred2 ATTRIBUTE_UNUSED, rtx succ, rtx succ2,
 	       rtx *pdest, rtx *psrc)
 {
   int i;
@@ -1550,10 +1639,25 @@ can_combine_p (rtx insn, rtx i3, rtx pre
 #ifdef AUTO_INC_DEC
   rtx link;
 #endif
-  int all_adjacent = (succ ? (next_active_insn (insn) == succ
-			      && next_active_insn (succ) == i3)
-		      : next_active_insn (insn) == i3);
+  bool all_adjacent = true;
 
+  if (succ)
+    {
+      if (succ2)
+	{
+	  if (next_active_insn (succ2) != i3)
+	    all_adjacent = false;
+	  if (next_active_insn (succ) != succ2)
+	    all_adjacent = false;
+	}
+      else if (next_active_insn (succ) != i3)
+	all_adjacent = false;
+      if (next_active_insn (insn) != succ)
+	all_adjacent = false;
+    }
+  else if (next_active_insn (insn) != i3)
+    all_adjacent = false;
+    
   /* Can combine only if previous insn is a SET of a REG, a SUBREG or CC0.
      or a PARALLEL consisting of such a SET and CLOBBERs.
 
@@ -1678,11 +1782,15 @@ can_combine_p (rtx insn, rtx i3, rtx pre
       /* Don't substitute into an incremented register.  */
       || FIND_REG_INC_NOTE (i3, dest)
       || (succ && FIND_REG_INC_NOTE (succ, dest))
+      || (succ2 && FIND_REG_INC_NOTE (succ2, dest))
       /* Don't substitute into a non-local goto, this confuses CFG.  */
       || (JUMP_P (i3) && find_reg_note (i3, REG_NON_LOCAL_GOTO, NULL_RTX))
       /* Make sure that DEST is not used after SUCC but before I3.  */
-      || (succ && ! all_adjacent
-	  && reg_used_between_p (dest, succ, i3))
+      || (!all_adjacent
+	  && ((succ2
+	       && (reg_used_between_p (dest, succ2, i3)
+		   || reg_used_between_p (dest, succ, succ2)))
+	      || (!succ2 && succ && reg_used_between_p (dest, succ, i3))))
       /* Make sure that the value that is to be substituted for the register
 	 does not use any registers whose values alter in between.  However,
 	 If the insns are adjacent, a use can't cross a set even though we
@@ -1765,13 +1873,12 @@ can_combine_p (rtx insn, rtx i3, rtx pre
 
   if (GET_CODE (src) == ASM_OPERANDS || volatile_refs_p (src))
     {
-      /* Make sure succ doesn't contain a volatile reference.  */
+      /* Make sure neither succ nor succ2 contains a volatile reference.  */
+      if (succ2 != 0 && volatile_refs_p (PATTERN (succ2)))
+	return 0;
       if (succ != 0 && volatile_refs_p (PATTERN (succ)))
 	return 0;
-
-      for (p = NEXT_INSN (insn); p != i3; p = NEXT_INSN (p))
-	if (INSN_P (p) && p != succ && volatile_refs_p (PATTERN (p)))
-	  return 0;
+      /* We'll check insns between INSN and I3 below.  */
     }
 
   /* If INSN is an asm, and DEST is a hard register, reject, since it has
@@ -1785,7 +1892,7 @@ can_combine_p (rtx insn, rtx i3, rtx pre
      they might affect machine state.  */
 
   for (p = NEXT_INSN (insn); p != i3; p = NEXT_INSN (p))
-    if (INSN_P (p) && p != succ && volatile_insn_p (PATTERN (p)))
+    if (INSN_P (p) && p != succ && p != succ2 && volatile_insn_p (PATTERN (p)))
       return 0;
 
   /* If INSN contains an autoincrement or autodecrement, make sure that
@@ -1801,8 +1908,12 @@ can_combine_p (rtx insn, rtx i3, rtx pre
 	    || reg_used_between_p (XEXP (link, 0), insn, i3)
 	    || (pred != NULL_RTX
 		&& reg_overlap_mentioned_p (XEXP (link, 0), PATTERN (pred)))
+	    || (pred2 != NULL_RTX
+		&& reg_overlap_mentioned_p (XEXP (link, 0), PATTERN (pred2)))
 	    || (succ != NULL_RTX
 		&& reg_overlap_mentioned_p (XEXP (link, 0), PATTERN (succ)))
+	    || (succ2 != NULL_RTX
+		&& reg_overlap_mentioned_p (XEXP (link, 0), PATTERN (succ2)))
 	    || reg_overlap_mentioned_p (XEXP (link, 0), PATTERN (i3))))
       return 0;
 #endif
@@ -1836,8 +1947,8 @@ can_combine_p (rtx insn, rtx i3, rtx pre
    of a PARALLEL of the pattern.  We validate that it is valid for combining.
 
    One problem is if I3 modifies its output, as opposed to replacing it
-   entirely, we can't allow the output to contain I2DEST or I1DEST as doing
-   so would produce an insn that is not equivalent to the original insns.
+   entirely, we can't allow the output to contain I2DEST, I1DEST or I0DEST as
+   doing so would produce an insn that is not equivalent to the original insns.
 
    Consider:
 
@@ -1858,7 +1969,8 @@ can_combine_p (rtx insn, rtx i3, rtx pre
    must reject the combination.  This case occurs when I2 and I1 both
    feed into I3, rather than when I1 feeds into I2, which feeds into I3.
    If I1_NOT_IN_SRC is nonzero, it means that finding I1 in the source
-   of a SET must prevent combination from occurring.
+   of a SET must prevent combination from occurring.  The same situation
+   can occur for I0, in which case I0_NOT_IN_SRC is set.
 
    Before doing the above check, we first try to expand a field assignment
    into a set of logical operations.
@@ -1870,8 +1982,8 @@ can_combine_p (rtx insn, rtx i3, rtx pre
    Return 1 if the combination is valid, zero otherwise.  */
 
 static int
-combinable_i3pat (rtx i3, rtx *loc, rtx i2dest, rtx i1dest,
-		  int i1_not_in_src, rtx *pi3dest_killed)
+combinable_i3pat (rtx i3, rtx *loc, rtx i2dest, rtx i1dest, rtx i0dest,
+		  int i1_not_in_src, int i0_not_in_src, rtx *pi3dest_killed)
 {
   rtx x = *loc;
 
@@ -1895,9 +2007,11 @@ combinable_i3pat (rtx i3, rtx *loc, rtx 
       if ((inner_dest != dest &&
 	   (!MEM_P (inner_dest)
 	    || rtx_equal_p (i2dest, inner_dest)
-	    || (i1dest && rtx_equal_p (i1dest, inner_dest)))
+	    || (i1dest && rtx_equal_p (i1dest, inner_dest))
+	    || (i0dest && rtx_equal_p (i0dest, inner_dest)))
 	   && (reg_overlap_mentioned_p (i2dest, inner_dest)
-	       || (i1dest && reg_overlap_mentioned_p (i1dest, inner_dest))))
+	       || (i1dest && reg_overlap_mentioned_p (i1dest, inner_dest))
+	       || (i0dest && reg_overlap_mentioned_p (i0dest, inner_dest))))
 
 	  /* This is the same test done in can_combine_p except we can't test
 	     all_adjacent; we don't have to, since this instruction will stay
@@ -1913,7 +2027,8 @@ combinable_i3pat (rtx i3, rtx *loc, rtx 
 	      && REGNO (inner_dest) < FIRST_PSEUDO_REGISTER
 	      && (! HARD_REGNO_MODE_OK (REGNO (inner_dest),
 					GET_MODE (inner_dest))))
-	  || (i1_not_in_src && reg_overlap_mentioned_p (i1dest, src)))
+	  || (i1_not_in_src && reg_overlap_mentioned_p (i1dest, src))
+	  || (i0_not_in_src && reg_overlap_mentioned_p (i0dest, src)))
 	return 0;
 
       /* If DEST is used in I3, it is being killed in this insn, so
@@ -1953,8 +2068,8 @@ combinable_i3pat (rtx i3, rtx *loc, rtx 
       int i;
 
       for (i = 0; i < XVECLEN (x, 0); i++)
-	if (! combinable_i3pat (i3, &XVECEXP (x, 0, i), i2dest, i1dest,
-				i1_not_in_src, pi3dest_killed))
+	if (! combinable_i3pat (i3, &XVECEXP (x, 0, i), i2dest, i1dest, i0dest,
+				i1_not_in_src, i0_not_in_src, pi3dest_killed))
 	  return 0;
     }
 
@@ -2364,15 +2479,15 @@ update_cfg_for_uncondjump (rtx insn)
     single_succ_edge (bb)->flags |= EDGE_FALLTHRU;
 }
 
+/* Try to combine the insns I0, I1 and I2 into I3.
+   Here I0, I1 and I2 appear earlier than I3.
+   I0 and I1 can be zero; then we combine just I2 into I3, or I1 and I2 into
+   I3.
 
-/* Try to combine the insns I1 and I2 into I3.
-   Here I1 and I2 appear earlier than I3.
-   I1 can be zero; then we combine just I2 into I3.
-
-   If we are combining three insns and the resulting insn is not recognized,
-   try splitting it into two insns.  If that happens, I2 and I3 are retained
-   and I1 is pseudo-deleted by turning it into a NOTE.  Otherwise, I1 and I2
-   are pseudo-deleted.
+   If we are combining more than two insns and the resulting insn is not
+   recognized, try splitting it into two insns.  If that happens, I2 and I3
+   are retained and I1/I0 are pseudo-deleted by turning them into a NOTE.
+   Otherwise, I0, I1 and I2 are pseudo-deleted.
 
    Return 0 if the combination does not work.  Then nothing is changed.
    If we did the combination, return the insn at which combine should
@@ -2382,34 +2497,38 @@ update_cfg_for_uncondjump (rtx insn)
    new direct jump instruction.  */
 
 static rtx
-try_combine (rtx i3, rtx i2, rtx i1, int *new_direct_jump_p)
+try_combine (rtx i3, rtx i2, rtx i1, rtx i0, int *new_direct_jump_p)
 {
   /* New patterns for I3 and I2, respectively.  */
   rtx newpat, newi2pat = 0;
   rtvec newpat_vec_with_clobbers = 0;
-  int substed_i2 = 0, substed_i1 = 0;
-  /* Indicates need to preserve SET in I1 or I2 in I3 if it is not dead.  */
-  int added_sets_1, added_sets_2;
+  int substed_i2 = 0, substed_i1 = 0, substed_i0 = 0;
+  /* Indicates need to preserve SET in I0, I1 or I2 in I3 if it is not
+     dead.  */
+  int added_sets_0, added_sets_1, added_sets_2;
   /* Total number of SETs to put into I3.  */
   int total_sets;
-  /* Nonzero if I2's body now appears in I3.  */
-  int i2_is_used;
+  /* Nonzero if I2's or I1's body now appears in I3.  */
+  int i2_is_used, i1_is_used;
   /* INSN_CODEs for new I3, new I2, and user of condition code.  */
   int insn_code_number, i2_code_number = 0, other_code_number = 0;
   /* Contains I3 if the destination of I3 is used in its source, which means
      that the old life of I3 is being killed.  If that usage is placed into
      I2 and not in I3, a REG_DEAD note must be made.  */
   rtx i3dest_killed = 0;
-  /* SET_DEST and SET_SRC of I2 and I1.  */
-  rtx i2dest = 0, i2src = 0, i1dest = 0, i1src = 0;
+  /* SET_DEST and SET_SRC of I2, I1 and I0.  */
+  rtx i2dest = 0, i2src = 0, i1dest = 0, i1src = 0, i0dest = 0, i0src = 0;
   /* Set if I2DEST was reused as a scratch register.  */
   bool i2scratch = false;
-  /* PATTERN (I1) and PATTERN (I2), or a copy of it in certain cases.  */
-  rtx i1pat = 0, i2pat = 0;
+  /* The PATTERNs of I0, I1, and I2, or a copy of them in certain cases.  */
+  rtx i0pat = 0, i1pat = 0, i2pat = 0;
   /* Indicates if I2DEST or I1DEST is in I2SRC or I1_SRC.  */
   int i2dest_in_i2src = 0, i1dest_in_i1src = 0, i2dest_in_i1src = 0;
-  int i2dest_killed = 0, i1dest_killed = 0;
+  int i0dest_in_i0src = 0, i1dest_in_i0src = 0, i2dest_in_i0src = 0;
+  int i2dest_killed = 0, i1dest_killed = 0, i0dest_killed;
   int i1_feeds_i3 = 0;
+  int i1_feeds_i3_n = 0, i1_feeds_i2_n = 0, i0_feeds_i3_n = 0;
+  int i0_feeds_i2_n = 0, i0_feeds_i1_n = 0;
   /* Notes that must be added to REG_NOTES in I3 and I2.  */
   rtx new_i3_notes, new_i2_notes;
   /* Notes that we substituted I3 into I2 instead of the normal case.  */
@@ -2431,6 +2550,7 @@ try_combine (rtx i3, rtx i2, rtx i1, int
   if (cant_combine_insn_p (i3)
       || cant_combine_insn_p (i2)
       || (i1 && cant_combine_insn_p (i1))
+      || (i0 && cant_combine_insn_p (i0))
       || likely_spilled_retval_p (i3))
     return 0;
 
@@ -2442,7 +2562,10 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
-      if (i1)
+      if (i0)
+	fprintf (dump_file, "\nTrying %d, %d, %d -> %d:\n",
+		 INSN_UID (i0), INSN_UID (i1), INSN_UID (i2), INSN_UID (i3));
+      else if (i1)
 	fprintf (dump_file, "\nTrying %d, %d -> %d:\n",
 		 INSN_UID (i1), INSN_UID (i2), INSN_UID (i3));
       else
@@ -2450,8 +2573,12 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 		 INSN_UID (i2), INSN_UID (i3));
     }
 
-  /* If I1 and I2 both feed I3, they can be in any order.  To simplify the
-     code below, set I1 to be the earlier of the two insns.  */
+  /* If multiple insns feed into one of I2 or I3, they can be in any
+     order.  To simplify the code below, reorder them in sequence.  */
+  if (i0 && DF_INSN_LUID (i0) > DF_INSN_LUID (i2))
+    temp = i2, i2 = i0, i0 = temp;
+  if (i0 && DF_INSN_LUID (i0) > DF_INSN_LUID (i1))
+    temp = i1, i1 = i0, i0 = temp;
   if (i1 && DF_INSN_LUID (i1) > DF_INSN_LUID (i2))
     temp = i1, i1 = i2, i2 = temp;
 
@@ -2673,8 +2800,11 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 #endif
 
   /* Verify that I2 and I1 are valid for combining.  */
-  if (! can_combine_p (i2, i3, i1, NULL_RTX, &i2dest, &i2src)
-      || (i1 && ! can_combine_p (i1, i3, NULL_RTX, i2, &i1dest, &i1src)))
+  if (! can_combine_p (i2, i3, i0, i1, NULL_RTX, NULL_RTX, &i2dest, &i2src)
+      || (i1 && ! can_combine_p (i1, i3, i0, NULL_RTX, i2, NULL_RTX,
+				 &i1dest, &i1src))
+      || (i0 && ! can_combine_p (i0, i3, NULL_RTX, NULL_RTX, i1, i2,
+				 &i0dest, &i0src)))
     {
       undo_all ();
       return 0;
@@ -2685,16 +2815,27 @@ try_combine (rtx i3, rtx i2, rtx i1, int
   i2dest_in_i2src = reg_overlap_mentioned_p (i2dest, i2src);
   i1dest_in_i1src = i1 && reg_overlap_mentioned_p (i1dest, i1src);
   i2dest_in_i1src = i1 && reg_overlap_mentioned_p (i2dest, i1src);
+  i0dest_in_i0src = i0 && reg_overlap_mentioned_p (i0dest, i0src);
+  i1dest_in_i0src = i0 && reg_overlap_mentioned_p (i1dest, i0src);
+  i2dest_in_i0src = i0 && reg_overlap_mentioned_p (i2dest, i0src);
   i2dest_killed = dead_or_set_p (i2, i2dest);
   i1dest_killed = i1 && dead_or_set_p (i1, i1dest);
+  i0dest_killed = i0 && dead_or_set_p (i0, i0dest);
 
   /* See if I1 directly feeds into I3.  It does if I1DEST is not used
      in I2SRC.  */
   i1_feeds_i3 = i1 && ! reg_overlap_mentioned_p (i1dest, i2src);
+  i1_feeds_i2_n = i1 && insn_a_feeds_b (i1, i2);
+  i1_feeds_i3_n = i1 && insn_a_feeds_b (i1, i3);
+  i0_feeds_i3_n = i0 && insn_a_feeds_b (i0, i3);
+  i0_feeds_i2_n = i0 && insn_a_feeds_b (i0, i2);
+  i0_feeds_i1_n = i0 && insn_a_feeds_b (i0, i1);
 
   /* Ensure that I3's pattern can be the destination of combines.  */
-  if (! combinable_i3pat (i3, &PATTERN (i3), i2dest, i1dest,
+  if (! combinable_i3pat (i3, &PATTERN (i3), i2dest, i1dest, i0dest,
 			  i1 && i2dest_in_i1src && i1_feeds_i3,
+			  i0 && ((i2dest_in_i0src && i0_feeds_i3_n)
+				 || (i1dest_in_i0src && !i0_feeds_i1_n)),
 			  &i3dest_killed))
     {
       undo_all ();
@@ -2706,6 +2847,7 @@ try_combine (rtx i3, rtx i2, rtx i1, int
      here.  */
   if (GET_CODE (i2src) == MULT
       || (i1 != 0 && GET_CODE (i1src) == MULT)
+      || (i0 != 0 && GET_CODE (i0src) == MULT)
       || (GET_CODE (PATTERN (i3)) == SET
 	  && GET_CODE (SET_SRC (PATTERN (i3))) == MULT))
     have_mult = 1;
@@ -2745,14 +2887,22 @@ try_combine (rtx i3, rtx i2, rtx i1, int
      feed into I3, the set in I1 needs to be kept around if I1DEST dies
      or is set in I3.  Otherwise (if I1 feeds I2 which feeds I3), the set
      in I1 needs to be kept around unless I1DEST dies or is set in either
-     I2 or I3.  We can distinguish these cases by seeing if I2SRC mentions
-     I1DEST.  If so, we know I1 feeds into I2.  */
+     I2 or I3.  The same consideration applies to I0.  */
 
-  added_sets_2 = ! dead_or_set_p (i3, i2dest);
+  added_sets_2 = !dead_or_set_p (i3, i2dest);
 
-  added_sets_1
-    = i1 && ! (i1_feeds_i3 ? dead_or_set_p (i3, i1dest)
-	       : (dead_or_set_p (i3, i1dest) || dead_or_set_p (i2, i1dest)));
+  if (i1)
+    added_sets_1 = !((i1_feeds_i3_n && dead_or_set_p (i3, i1dest))
+		     || (i1_feeds_i2_n && dead_or_set_p (i2, i1dest)));
+  else
+    added_sets_1 = 0;
+
+  if (i0)
+    added_sets_0 =  !((i0_feeds_i3_n && dead_or_set_p (i3, i0dest))
+		      || (i0_feeds_i2_n && dead_or_set_p (i2, i0dest))
+		      || (i0_feeds_i1_n && dead_or_set_p (i1, i0dest)));
+  else
+    added_sets_0 = 0;
 
   /* If the set in I2 needs to be kept around, we must make a copy of
      PATTERN (I2), so that when we substitute I1SRC for I1DEST in
@@ -2777,6 +2927,14 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	i1pat = copy_rtx (PATTERN (i1));
     }
 
+  if (added_sets_0)
+    {
+      if (GET_CODE (PATTERN (i0)) == PARALLEL)
+	i0pat = gen_rtx_SET (VOIDmode, i0dest, copy_rtx (i0src));
+      else
+	i0pat = copy_rtx (PATTERN (i0));
+    }
+
   combine_merges++;
 
   /* Substitute in the latest insn for the regs set by the earlier ones.  */
@@ -2825,8 +2983,8 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 					      i2src, const0_rtx))
 	      != GET_MODE (SET_DEST (newpat))))
 	{
-	  if (can_change_dest_mode(SET_DEST (newpat), added_sets_2,
-				   compare_mode))
+	  if (can_change_dest_mode (SET_DEST (newpat), added_sets_2,
+				    compare_mode))
 	    {
 	      unsigned int regno = REGNO (SET_DEST (newpat));
 	      rtx new_dest;
@@ -2889,13 +3047,14 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 
       n_occurrences = 0;		/* `subst' counts here */
 
-      /* If I1 feeds into I2 (not into I3) and I1DEST is in I1SRC, we
-	 need to make a unique copy of I2SRC each time we substitute it
-	 to avoid self-referential rtl.  */
+      /* If I1 feeds into I2 and I1DEST is in I1SRC, we need to make a
+	 unique copy of I2SRC each time we substitute it to avoid
+	 self-referential rtl.  */
 
       subst_low_luid = DF_INSN_LUID (i2);
       newpat = subst (PATTERN (i3), i2dest, i2src, 0,
-		      ! i1_feeds_i3 && i1dest_in_i1src);
+		      ((i1_feeds_i2_n && i1dest_in_i1src)
+		       || (i0_feeds_i2_n && i0dest_in_i0src)));
       substed_i2 = 1;
 
       /* Record whether i2's body now appears within i3's body.  */
@@ -2911,13 +3070,14 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	 This happens if I1DEST is mentioned in I2 and dies there, and
 	 has disappeared from the new pattern.  */
       if ((FIND_REG_INC_NOTE (i1, NULL_RTX) != 0
-	   && !i1_feeds_i3
+	   && i1_feeds_i2_n
 	   && dead_or_set_p (i2, i1dest)
 	   && !reg_overlap_mentioned_p (i1dest, newpat))
 	  /* Before we can do this substitution, we must redo the test done
 	     above (see detailed comments there) that ensures  that I1DEST
 	     isn't mentioned in any SETs in NEWPAT that are field assignments.  */
-          || !combinable_i3pat (NULL_RTX, &newpat, i1dest, NULL_RTX, 0, 0))
+          || !combinable_i3pat (NULL_RTX, &newpat, i1dest, NULL_RTX, NULL_RTX,
+				0, 0, 0))
 	{
 	  undo_all ();
 	  return 0;
@@ -2925,8 +3085,29 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 
       n_occurrences = 0;
       subst_low_luid = DF_INSN_LUID (i1);
-      newpat = subst (newpat, i1dest, i1src, 0, 0);
+      newpat = subst (newpat, i1dest, i1src, 0,
+		      i0_feeds_i1_n && i0dest_in_i0src);
       substed_i1 = 1;
+      i1_is_used = n_occurrences;
+    }
+  if (i0 && GET_CODE (newpat) != CLOBBER)
+    {
+      if ((FIND_REG_INC_NOTE (i0, NULL_RTX) != 0
+	   && ((i0_feeds_i2_n && dead_or_set_p (i2, i0dest))
+	       || (i0_feeds_i1_n && dead_or_set_p (i1, i0dest)))
+	   && !reg_overlap_mentioned_p (i0dest, newpat))
+          || !combinable_i3pat (NULL_RTX, &newpat, i0dest, NULL_RTX, NULL_RTX,
+				0, 0, 0))
+	{
+	  undo_all ();
+	  return 0;
+	}
+
+      n_occurrences = 0;
+      subst_low_luid = DF_INSN_LUID (i1);
+      newpat = subst (newpat, i0dest, i0src, 0,
+		      i0_feeds_i1_n && i0dest_in_i0src);
+      substed_i0 = 1;
     }
 
   /* Fail if an autoincrement side-effect has been duplicated.  Be careful
@@ -2934,7 +3115,12 @@ try_combine (rtx i3, rtx i2, rtx i1, int
   if ((FIND_REG_INC_NOTE (i2, NULL_RTX) != 0
        && i2_is_used + added_sets_2 > 1)
       || (i1 != 0 && FIND_REG_INC_NOTE (i1, NULL_RTX) != 0
-	  && (n_occurrences + added_sets_1 + (added_sets_2 && ! i1_feeds_i3)
+	  && (i1_is_used + added_sets_1 + (added_sets_2 && i1_feeds_i2_n)
+	      > 1))
+      || (i0 != 0 && FIND_REG_INC_NOTE (i0, NULL_RTX) != 0
+	  && (n_occurrences + added_sets_0
+	      + (added_sets_1 && i0_feeds_i1_n)
+	      + (added_sets_2 && i0_feeds_i2_n)
 	      > 1))
       /* Fail if we tried to make a new register.  */
       || max_reg_num () != maxreg
@@ -2954,14 +3140,15 @@ try_combine (rtx i3, rtx i2, rtx i1, int
      we must make a new PARALLEL for the latest insn
      to hold additional the SETs.  */
 
-  if (added_sets_1 || added_sets_2)
+  if (added_sets_0 || added_sets_1 || added_sets_2)
     {
+      int extra_sets = added_sets_0 + added_sets_1 + added_sets_2;
       combine_extras++;
 
       if (GET_CODE (newpat) == PARALLEL)
 	{
 	  rtvec old = XVEC (newpat, 0);
-	  total_sets = XVECLEN (newpat, 0) + added_sets_1 + added_sets_2;
+	  total_sets = XVECLEN (newpat, 0) + extra_sets;
 	  newpat = gen_rtx_PARALLEL (VOIDmode, rtvec_alloc (total_sets));
 	  memcpy (XVEC (newpat, 0)->elem, &old->elem[0],
 		  sizeof (old->elem[0]) * old->num_elem);
@@ -2969,25 +3156,31 @@ try_combine (rtx i3, rtx i2, rtx i1, int
       else
 	{
 	  rtx old = newpat;
-	  total_sets = 1 + added_sets_1 + added_sets_2;
+	  total_sets = 1 + extra_sets;
 	  newpat = gen_rtx_PARALLEL (VOIDmode, rtvec_alloc (total_sets));
 	  XVECEXP (newpat, 0, 0) = old;
 	}
 
+      if (added_sets_0)
+	XVECEXP (newpat, 0, --total_sets) = i0pat;
+
       if (added_sets_1)
-	XVECEXP (newpat, 0, --total_sets) = i1pat;
+	{
+	  rtx t = i1pat;
+	  if (i0_feeds_i1_n)
+	    t = subst (t, i0dest, i0src, 0, 0);
 
+	  XVECEXP (newpat, 0, --total_sets) = t;
+	}
       if (added_sets_2)
 	{
-	  /* If there is no I1, use I2's body as is.  We used to also not do
-	     the subst call below if I2 was substituted into I3,
-	     but that could lose a simplification.  */
-	  if (i1 == 0)
-	    XVECEXP (newpat, 0, --total_sets) = i2pat;
-	  else
-	    /* See comment where i2pat is assigned.  */
-	    XVECEXP (newpat, 0, --total_sets)
-	      = subst (i2pat, i1dest, i1src, 0, 0);
+	  rtx t = i2pat;
+	  if (i0_feeds_i2_n)
+	    t = subst (t, i0dest, i0src, 0, 0);
+	  if (i1_feeds_i2_n)
+	    t = subst (t, i1dest, i1src, 0, 0);
+
+	  XVECEXP (newpat, 0, --total_sets) = t;
 	}
     }
 
@@ -3543,7 +3736,7 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 
   /* Only allow this combination if insn_rtx_costs reports that the
      replacement instructions are cheaper than the originals.  */
-  if (!combine_validate_cost (i1, i2, i3, newpat, newi2pat, other_pat))
+  if (!combine_validate_cost (i0, i1, i2, i3, newpat, newi2pat, other_pat))
     {
       undo_all ();
       return 0;
@@ -3642,7 +3835,8 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	}
 
       distribute_notes (new_other_notes, undobuf.other_insn,
-			undobuf.other_insn, NULL_RTX, NULL_RTX, NULL_RTX);
+			undobuf.other_insn, NULL_RTX, NULL_RTX, NULL_RTX,
+			NULL_RTX);
     }
 
   if (swap_i2i3)
@@ -3689,21 +3883,26 @@ try_combine (rtx i3, rtx i2, rtx i1, int
     }
 
   {
-    rtx i3notes, i2notes, i1notes = 0;
-    rtx i3links, i2links, i1links = 0;
+    rtx i3notes, i2notes, i1notes = 0, i0notes = 0;
+    rtx i3links, i2links, i1links = 0, i0links = 0;
     rtx midnotes = 0;
+    int from_luid;
     unsigned int regno;
     /* Compute which registers we expect to eliminate.  newi2pat may be setting
        either i3dest or i2dest, so we must check it.  Also, i1dest may be the
        same as i3dest, in which case newi2pat may be setting i1dest.  */
     rtx elim_i2 = ((newi2pat && reg_set_p (i2dest, newi2pat))
-		   || i2dest_in_i2src || i2dest_in_i1src
+		   || i2dest_in_i2src || i2dest_in_i1src || i2dest_in_i0src
 		   || !i2dest_killed
 		   ? 0 : i2dest);
-    rtx elim_i1 = (i1 == 0 || i1dest_in_i1src
+    rtx elim_i1 = (i1 == 0 || i1dest_in_i1src || i1dest_in_i0src
 		   || (newi2pat && reg_set_p (i1dest, newi2pat))
 		   || !i1dest_killed
 		   ? 0 : i1dest);
+    rtx elim_i0 = (i0 == 0 || i0dest_in_i0src
+		   || (newi2pat && reg_set_p (i0dest, newi2pat))
+		   || !i0dest_killed
+		   ? 0 : i0dest);
 
     /* Get the old REG_NOTES and LOG_LINKS from all our insns and
        clear them.  */
@@ -3711,6 +3910,8 @@ try_combine (rtx i3, rtx i2, rtx i1, int
     i2notes = REG_NOTES (i2), i2links = LOG_LINKS (i2);
     if (i1)
       i1notes = REG_NOTES (i1), i1links = LOG_LINKS (i1);
+    if (i0)
+      i0notes = REG_NOTES (i0), i0links = LOG_LINKS (i0);
 
     /* Ensure that we do not have something that should not be shared but
        occurs multiple times in the new insns.  Check this by first
@@ -3719,6 +3920,7 @@ try_combine (rtx i3, rtx i2, rtx i1, int
     reset_used_flags (i3notes);
     reset_used_flags (i2notes);
     reset_used_flags (i1notes);
+    reset_used_flags (i0notes);
     reset_used_flags (newpat);
     reset_used_flags (newi2pat);
     if (undobuf.other_insn)
@@ -3727,6 +3929,7 @@ try_combine (rtx i3, rtx i2, rtx i1, int
     i3notes = copy_rtx_if_shared (i3notes);
     i2notes = copy_rtx_if_shared (i2notes);
     i1notes = copy_rtx_if_shared (i1notes);
+    i0notes = copy_rtx_if_shared (i0notes);
     newpat = copy_rtx_if_shared (newpat);
     newi2pat = copy_rtx_if_shared (newi2pat);
     if (undobuf.other_insn)
@@ -3753,6 +3956,8 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 
 	if (substed_i1)
 	  replace_rtx (call_usage, i1dest, i1src);
+	if (substed_i0)
+	  replace_rtx (call_usage, i0dest, i0src);
 
 	CALL_INSN_FUNCTION_USAGE (i3) = call_usage;
       }
@@ -3827,43 +4032,58 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	SET_INSN_DELETED (i1);
       }
 
+    if (i0)
+      {
+	LOG_LINKS (i0) = 0;
+	REG_NOTES (i0) = 0;
+	if (MAY_HAVE_DEBUG_INSNS)
+	  propagate_for_debug (i0, i3, i0dest, i0src, false);
+	SET_INSN_DELETED (i0);
+      }
+
     /* Get death notes for everything that is now used in either I3 or
        I2 and used to die in a previous insn.  If we built two new
        patterns, move from I1 to I2 then I2 to I3 so that we get the
        proper movement on registers that I2 modifies.  */
 
-    if (newi2pat)
-      {
-	move_deaths (newi2pat, NULL_RTX, DF_INSN_LUID (i1), i2, &midnotes);
-	move_deaths (newpat, newi2pat, DF_INSN_LUID (i1), i3, &midnotes);
-      }
+    if (i0)
+      from_luid = DF_INSN_LUID (i0);
+    else if (i1)
+      from_luid = DF_INSN_LUID (i1);
     else
-      move_deaths (newpat, NULL_RTX, i1 ? DF_INSN_LUID (i1) : DF_INSN_LUID (i2),
-		   i3, &midnotes);
+      from_luid = DF_INSN_LUID (i2);
+    if (newi2pat)
+      move_deaths (newi2pat, NULL_RTX, from_luid, i2, &midnotes);
+    move_deaths (newpat, newi2pat, from_luid, i3, &midnotes);
 
     /* Distribute all the LOG_LINKS and REG_NOTES from I1, I2, and I3.  */
     if (i3notes)
       distribute_notes (i3notes, i3, i3, newi2pat ? i2 : NULL_RTX,
-			elim_i2, elim_i1);
+			elim_i2, elim_i1, elim_i0);
     if (i2notes)
       distribute_notes (i2notes, i2, i3, newi2pat ? i2 : NULL_RTX,
-			elim_i2, elim_i1);
+			elim_i2, elim_i1, elim_i0);
     if (i1notes)
       distribute_notes (i1notes, i1, i3, newi2pat ? i2 : NULL_RTX,
-			elim_i2, elim_i1);
+			elim_i2, elim_i1, elim_i0);
+    if (i0notes)
+      distribute_notes (i0notes, i0, i3, newi2pat ? i2 : NULL_RTX,
+			elim_i2, elim_i1, elim_i0);
     if (midnotes)
       distribute_notes (midnotes, NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
-			elim_i2, elim_i1);
+			elim_i2, elim_i1, elim_i0);
 
     /* Distribute any notes added to I2 or I3 by recog_for_combine.  We
        know these are REG_UNUSED and want them to go to the desired insn,
        so we always pass it as i3.  */
 
     if (newi2pat && new_i2_notes)
-      distribute_notes (new_i2_notes, i2, i2, NULL_RTX, NULL_RTX, NULL_RTX);
+      distribute_notes (new_i2_notes, i2, i2, NULL_RTX, NULL_RTX, NULL_RTX,
+			NULL_RTX);
 
     if (new_i3_notes)
-      distribute_notes (new_i3_notes, i3, i3, NULL_RTX, NULL_RTX, NULL_RTX);
+      distribute_notes (new_i3_notes, i3, i3, NULL_RTX, NULL_RTX, NULL_RTX,
+			NULL_RTX);
 
     /* If I3DEST was used in I3SRC, it really died in I3.  We may need to
        put a REG_DEAD note for it somewhere.  If NEWI2PAT exists and sets
@@ -3877,39 +4097,51 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	if (newi2pat && reg_set_p (i3dest_killed, newi2pat))
 	  distribute_notes (alloc_reg_note (REG_DEAD, i3dest_killed,
 					    NULL_RTX),
-			    NULL_RTX, i2, NULL_RTX, elim_i2, elim_i1);
+			    NULL_RTX, i2, NULL_RTX, elim_i2, elim_i1, elim_i0);
 	else
 	  distribute_notes (alloc_reg_note (REG_DEAD, i3dest_killed,
 					    NULL_RTX),
 			    NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
-			    elim_i2, elim_i1);
+			    elim_i2, elim_i1, elim_i0);
       }
 
     if (i2dest_in_i2src)
       {
+	rtx new_note = alloc_reg_note (REG_DEAD, i2dest, NULL_RTX);
 	if (newi2pat && reg_set_p (i2dest, newi2pat))
-	  distribute_notes (alloc_reg_note (REG_DEAD, i2dest, NULL_RTX),
-			    NULL_RTX, i2, NULL_RTX, NULL_RTX, NULL_RTX);
-	else
-	  distribute_notes (alloc_reg_note (REG_DEAD, i2dest, NULL_RTX),
-			    NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
+	  distribute_notes (new_note,  NULL_RTX, i2, NULL_RTX, NULL_RTX,
 			    NULL_RTX, NULL_RTX);
+	else
+	  distribute_notes (new_note, NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
+			    NULL_RTX, NULL_RTX, NULL_RTX);
       }
 
     if (i1dest_in_i1src)
       {
+	rtx new_note = alloc_reg_note (REG_DEAD, i1dest, NULL_RTX);
 	if (newi2pat && reg_set_p (i1dest, newi2pat))
-	  distribute_notes (alloc_reg_note (REG_DEAD, i1dest, NULL_RTX),
-			    NULL_RTX, i2, NULL_RTX, NULL_RTX, NULL_RTX);
+	  distribute_notes (new_note, NULL_RTX, i2, NULL_RTX, NULL_RTX,
+			    NULL_RTX, NULL_RTX);
 	else
-	  distribute_notes (alloc_reg_note (REG_DEAD, i1dest, NULL_RTX),
-			    NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
+	  distribute_notes (new_note, NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
+			    NULL_RTX, NULL_RTX, NULL_RTX);
+      }
+
+    if (i0dest_in_i0src)
+      {
+	rtx new_note = alloc_reg_note (REG_DEAD, i0dest, NULL_RTX);
+	if (newi2pat && reg_set_p (i0dest, newi2pat))
+	  distribute_notes (new_note, NULL_RTX, i2, NULL_RTX, NULL_RTX,
 			    NULL_RTX, NULL_RTX);
+	else
+	  distribute_notes (new_note, NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
+			    NULL_RTX, NULL_RTX, NULL_RTX);
       }
 
     distribute_links (i3links);
     distribute_links (i2links);
     distribute_links (i1links);
+    distribute_links (i0links);
 
     if (REG_P (i2dest))
       {
@@ -3959,6 +4191,23 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	  INC_REG_N_SETS (regno, -1);
       }
 
+    if (i0 && REG_P (i0dest))
+      {
+	rtx link;
+	rtx i0_insn = 0, i0_val = 0, set;
+
+	for (link = LOG_LINKS (i3); link; link = XEXP (link, 1))
+	  if ((set = single_set (XEXP (link, 0))) != 0
+	      && rtx_equal_p (i0dest, SET_DEST (set)))
+	    i0_insn = XEXP (link, 0), i0_val = SET_SRC (set);
+
+	record_value_for_reg (i0dest, i0_insn, i0_val);
+
+	regno = REGNO (i0dest);
+	if (! added_sets_0 && ! i0dest_in_i0src)
+	  INC_REG_N_SETS (regno, -1);
+      }
+
     /* Update reg_stat[].nonzero_bits et al for any changes that may have
        been made to this insn.  The order of
        set_nonzero_bits_and_sign_copies() is important.  Because newi2pat
@@ -3978,6 +4227,16 @@ try_combine (rtx i3, rtx i2, rtx i1, int
       df_insn_rescan (undobuf.other_insn);
     }
 
+  if (i0 && !(NOTE_P(i0) && (NOTE_KIND (i0) == NOTE_INSN_DELETED)))
+    {
+      if (dump_file)
+	{
+	  fprintf (dump_file, "modifying insn i1 ");
+	  dump_insn_slim (dump_file, i0);
+	}
+      df_insn_rescan (i0);
+    }
+
   if (i1 && !(NOTE_P(i1) && (NOTE_KIND (i1) == NOTE_INSN_DELETED)))
     {
       if (dump_file)
@@ -12668,7 +12927,7 @@ reg_bitfield_target_p (rtx x, rtx body)
 
 static void
 distribute_notes (rtx notes, rtx from_insn, rtx i3, rtx i2, rtx elim_i2,
-		  rtx elim_i1)
+		  rtx elim_i1, rtx elim_i0)
 {
   rtx note, next_note;
   rtx tem;
@@ -12914,7 +13173,8 @@ distribute_notes (rtx notes, rtx from_in
 			&& !(i2mod
 			     && reg_overlap_mentioned_p (XEXP (note, 0),
 							 i2mod_old_rhs)))
-		       || rtx_equal_p (XEXP (note, 0), elim_i1))
+		       || rtx_equal_p (XEXP (note, 0), elim_i1)
+		       || rtx_equal_p (XEXP (note, 0), elim_i0))
 		break;
 	      tem = i3;
 	    }
@@ -12981,7 +13241,7 @@ distribute_notes (rtx notes, rtx from_in
 			  REG_NOTES (tem) = NULL;
 
 			  distribute_notes (old_notes, tem, tem, NULL_RTX,
-					    NULL_RTX, NULL_RTX);
+					    NULL_RTX, NULL_RTX, NULL_RTX);
 			  distribute_links (LOG_LINKS (tem));
 
 			  SET_INSN_DELETED (tem);
@@ -12998,7 +13258,7 @@ distribute_notes (rtx notes, rtx from_in
 
 			      distribute_notes (old_notes, cc0_setter,
 						cc0_setter, NULL_RTX,
-						NULL_RTX, NULL_RTX);
+						NULL_RTX, NULL_RTX, NULL_RTX);
 			      distribute_links (LOG_LINKS (cc0_setter));
 
 			      SET_INSN_DELETED (cc0_setter);
@@ -13118,7 +13378,8 @@ distribute_notes (rtx notes, rtx from_in
 							     NULL_RTX);
 
 			      distribute_notes (new_note, place, place,
-						NULL_RTX, NULL_RTX, NULL_RTX);
+						NULL_RTX, NULL_RTX, NULL_RTX,
+						NULL_RTX);
 			    }
 			  else if (! refers_to_regno_p (i, i + 1,
 							PATTERN (place), 0)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 14:49 Combine four insns Bernd Schmidt
@ 2010-08-06 15:04 ` Richard Guenther
  2010-08-06 20:08   ` Bernd Schmidt
  2010-08-06 15:08 ` Steven Bosscher
  1 sibling, 1 reply; 129+ messages in thread
From: Richard Guenther @ 2010-08-06 15:04 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches

On Fri, Aug 6, 2010 at 4:48 PM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> I was slightly bored while waiting for some SPEC runs, so I played with
> the combiner a little.  The following extends it to do four-insn
> combinations.
>
> Conceptually, the following is one motivation for the change: consider a
> RISC target (probably not very prevalent when the combiner was written),
> where arithmetic ops don't allow constants, only registers.  Then, to
> combine multiple such operations into one (or rather, two), you need a
> four-instruction window.  This is what happens e.g. on Thumb-1; PR42172
> is such an example.  We have
>
>        ldrb    r3, [r0]
>        mov     r2, #7
>        bic     r3, r2
>        add     r2, r2, #49
>        bic     r3, r2
>        sub     r2, r2, #48
>        orr     r3, r2
>        add     r2, r2, #56
>        bic     r3, r2
>        add     r2, r2, #63
>        and     r3, r2
>        strb    r3, [r0]
>
> which can be optimized into
>
>        mov     r3, #8
>        strb    r3, [r0]
>
> by the patch below.  I'm attaching a file with a few more examples I
> found.  The same patterns occur quite frequently - several times e.g. in
> Linux TCP code.
>
> The downside is a compile-time cost, which appears to be about 1% user
> time on a full bootstrap.  To put that in persepective, it's 12s of real
> time.
>
> real 16m13.446s user 103m3.607s sys 3m2.235s
> real 16m25.534s user 104m0.686s sys 3m4.158s
>
> I'd argue that compile-time shouldn't be our top priority, as it's one
> of the few things that still benefits from Moore's Law, while the same
> may not be true for the programs we compile.  I expect people will argue
> a 1% slowdown is unacceptable, but in that case I think we need to
> discuss whether users are complaining about slow compiles more often
> than they complain about missed optimizations - in my experience the
> reverse is true.
>
> Bootstrapped on i686-linux, a slightly earlier version also
> regression-tested.

Do you have statistics how many two, three and four insn combinations
a) are tried, b) can be validated, c) are used in the end, for example
during a GCC bootstrap?

It might make sense to restrict 4 insn combinations to
-fexpensive-optimizations (thus, not enable it at -O1).

Richard.

>
> Bernd
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 14:49 Combine four insns Bernd Schmidt
  2010-08-06 15:04 ` Richard Guenther
@ 2010-08-06 15:08 ` Steven Bosscher
  2010-08-06 16:45   ` Paolo Bonzini
  2010-08-06 19:20   ` Bernd Schmidt
  1 sibling, 2 replies; 129+ messages in thread
From: Steven Bosscher @ 2010-08-06 15:08 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches

On Fri, Aug 6, 2010 at 4:48 PM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> I'd argue that compile-time shouldn't be our top priority, as it's one
> of the few things that still benefits from Moore's Law, while the same
> may not be true for the programs we compile.  I expect people will argue
> a 1% slowdown is unacceptable, but in that case I think we need to
> discuss whether users are complaining about slow compiles more often
> than they complain about missed optimizations - in my experience the
> reverse is true.

It depends where you look. In the free software community, the
majority of complaints is about compile time, but perhaps for your
customers the missed optimizations are more important.

But perhaps the optimization can be performed in another place than
combine? If it's just a relatively small set of common patterns, a
quick GIMPLE pass may be preferable.

(I know, it's the old argument we've had before: You have a real fix
now for a real problem, others (incl. me) believe that to fix this
problem in combine is a step away from sane instruction selection that
GCC may not have for years to come...)

What does the code look like, that is optimized so much better with 4
insns to combine? It looks like code that could come from bitfield
manipulations, which is an area where it's well-known that GCC does
not always optimize too well.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 15:08 ` Steven Bosscher
@ 2010-08-06 16:45   ` Paolo Bonzini
  2010-08-06 17:22     ` Steven Bosscher
  2010-08-06 19:20   ` Bernd Schmidt
  1 sibling, 1 reply; 129+ messages in thread
From: Paolo Bonzini @ 2010-08-06 16:45 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: Bernd Schmidt, GCC Patches

On 08/06/2010 05:07 PM, Steven Bosscher wrote:
> On Fri, Aug 6, 2010 at 4:48 PM, Bernd Schmidt<bernds@codesourcery.com>  wrote:
>> I'd argue that compile-time shouldn't be our top priority, as it's one
>> of the few things that still benefits from Moore's Law, while the same
>> may not be true for the programs we compile.  I expect people will argue
>> a 1% slowdown is unacceptable, but in that case I think we need to
>> discuss whether users are complaining about slow compiles more often
>> than they complain about missed optimizations - in my experience the
>> reverse is true.
>
> It depends where you look. In the free software community, the
> majority of complaints is about compile time, but perhaps for your
> customers the missed optimizations are more important.
>
> But perhaps the optimization can be performed in another place than
> combine? If it's just a relatively small set of common patterns, a
> quick GIMPLE pass may be preferable.
>
> (I know, it's the old argument we've had before: You have a real fix
> now for a real problem, others (incl. me) believe that to fix this
> problem in combine is a step away from sane instruction selection that
> GCC may not have for years to come...)

In this case I actually think this argument is not going to work, since 
we have not even a prototype of a sane instruction selection pass and 
not even someone who is thinking about working on it.

Paolo

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 16:45   ` Paolo Bonzini
@ 2010-08-06 17:22     ` Steven Bosscher
  2010-08-06 18:02       ` Jeff Law
  2010-08-06 18:56       ` Vladimir N. Makarov
  0 siblings, 2 replies; 129+ messages in thread
From: Steven Bosscher @ 2010-08-06 17:22 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Bernd Schmidt, GCC Patches

On Fri, Aug 6, 2010 at 6:45 PM, Paolo Bonzini <bonzini@gnu.org> wrote:
>> But perhaps the optimization can be performed in another place than
>> combine? If it's just a relatively small set of common patterns, a
>> quick GIMPLE pass may be preferable.

I think this part of my message was much more interesting to respond to.


>> (I know, it's the old argument we've had before: You have a real fix
>> now for a real problem, others (incl. me) believe that to fix this
>> problem in combine is a step away from sane instruction selection that
>> GCC may not have for years to come...)
>
> In this case I actually think this argument is not going to work, since we
> have not even a prototype of a sane instruction selection pass and not even
> someone who is thinking about working on it.

Agreed. There is the work from Preston Briggs, but that appears to
have gone no-where, unfortunately.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 17:22     ` Steven Bosscher
@ 2010-08-06 18:02       ` Jeff Law
  2010-08-06 20:44         ` Steven Bosscher
  2010-08-06 18:56       ` Vladimir N. Makarov
  1 sibling, 1 reply; 129+ messages in thread
From: Jeff Law @ 2010-08-06 18:02 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: Paolo Bonzini, Bernd Schmidt, GCC Patches

  On 08/06/10 11:22, Steven Bosscher wrote:
> On Fri, Aug 6, 2010 at 6:45 PM, Paolo Bonzini<bonzini@gnu.org>  wrote:
>>> But perhaps the optimization can be performed in another place than
>>> combine? If it's just a relatively small set of common patterns, a
>>> quick GIMPLE pass may be preferable.
> I think this part of my message was much more interesting to respond to.
This code naturally belongs in combine, putting it anywhere else is 
just, umm, silly.

jeff

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 17:22     ` Steven Bosscher
  2010-08-06 18:02       ` Jeff Law
@ 2010-08-06 18:56       ` Vladimir N. Makarov
  2010-08-06 19:02         ` Steven Bosscher
                           ` (3 more replies)
  1 sibling, 4 replies; 129+ messages in thread
From: Vladimir N. Makarov @ 2010-08-06 18:56 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: Paolo Bonzini, Bernd Schmidt, GCC Patches

On 08/06/2010 01:22 PM, Steven Bosscher wrote:
> On Fri, Aug 6, 2010 at 6:45 PM, Paolo Bonzini<bonzini@gnu.org>  wrote:
>    
> (I know, it's the old argument we've had before: You have a real fix
>>> now for a real problem, others (incl. me) believe that to fix this
>>> problem in combine is a step away from sane instruction selection that
>>> GCC may not have for years to come...)
>>>        
>> In this case I actually think this argument is not going to work, since we
>> have not even a prototype of a sane instruction selection pass and not even
>> someone who is thinking about working on it.
>>      
>    
To be honest I am periodically thinking about working on it but I did 
not make a final decision yet and don't know when I can start.
> Agreed. There is the work from Preston Briggs, but that appears to
> have gone no-where, unfortunately.
>
>    
IMHO the code should have become public if we want to see a progress on 
the problem solution.  But it is a Google property and they probably 
decided not to give it gcc community.  Although I heard that Ken Zadeck 
has access to it.  We will see what the final result will be.

I only can guess from speculations on info I have that Preston's code is 
probably not so valuable for GCC because as I understand the backend was 
only specialized for x86/x86_64.

The real problem is not in implementing modern pattern matching itself 
(there are a lot free pattern matching code and tools for code 
selection).  The real problem is in how to feed all md files (in which 
implementation a lot of gcc community efforts were made) to pattern 
matcher (in any case most probably we will not need define_split in md 
files if we use a modern pattern matcher).

Another problem, most modern pattern matchers works on trees not on 
DAGs.  There are some solutions (heuristic or expensive optimal) for 
this problem.  The combiner already solves the problem in some way.

In any case I'd not expect, from a new code selection pass, big code or 
compile time improvements (e.g. code selection in LLVM is very expensive 
and takes more time than GCC combiner).  But a new code selection pass 
could be more readable and easy for maintenance.


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 18:56       ` Vladimir N. Makarov
@ 2010-08-06 19:02         ` Steven Bosscher
  2010-08-06 21:11         ` Chris Lattner
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 129+ messages in thread
From: Steven Bosscher @ 2010-08-06 19:02 UTC (permalink / raw)
  To: Vladimir N. Makarov; +Cc: Paolo Bonzini, Bernd Schmidt, GCC Patches

On Fri, Aug 6, 2010 at 8:56 PM, Vladimir N. Makarov <vmakarov@redhat.com> wrote:
> IMHO the code should have become public if we want to see a progress on the
> problem solution.  But it is a Google property and they probably decided not
> to give it gcc community.  Although I heard that Ken Zadeck has access to
> it.  We will see what the final result will be.

http://gcc.gnu.org/viewcvs/branches/gimple-back-end/

But at that point, everyone stopped working on this. Perhaps Diego or
Ian can explain why.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 15:08 ` Steven Bosscher
  2010-08-06 16:45   ` Paolo Bonzini
@ 2010-08-06 19:20   ` Bernd Schmidt
  2010-08-06 19:37     ` Jeff Law
  1 sibling, 1 reply; 129+ messages in thread
From: Bernd Schmidt @ 2010-08-06 19:20 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: GCC Patches

On 08/06/2010 05:07 PM, Steven Bosscher wrote:

> It depends where you look. In the free software community, the
> majority of complaints is about compile time, but perhaps for your
> customers the missed optimizations are more important.

The latter I believe is certainly true.  But it also seems to happen
every now and then that I stumble on discussions on the Web where
someone claims gcc code generation is "crazy bad" for their code, or
where they compare gcc output to other compilers.

> But perhaps the optimization can be performed in another place than
> combine? If it's just a relatively small set of common patterns, a
> quick GIMPLE pass may be preferable.
> 
> (I know, it's the old argument we've had before: You have a real fix
> now for a real problem, others (incl. me) believe that to fix this
> problem in combine is a step away from sane instruction selection that
> GCC may not have for years to come...)

I don't think making an existing pass more powerful can be viewed as a
step away from a sane implementation.  It may set the bar higher for
comparisons, but that's as it should be - an alternative solution must
prove its superiority.  The combiner is one way of solving the problem,
and one that works in the context of gcc - alternative implementations
would first have to exist and then prove that they achieve better results.

> What does the code look like, that is optimized so much better with 4
> insns to combine? It looks like code that could come from bitfield
> manipulations, which is an area where it's well-known that GCC does
> not always optimize too well.

Here are some typical examples.  Some of it is bitfield operations.
PR42172 is.  Also, from gimplify.s:

gimple_set_plf (gimple stmt, enum plf_mask plf, unsigned char val_p)
{
  if (val_p)
    stmt->gsbase.plf |= (unsigned int) plf;
  else
    stmt->gsbase.plf &= ~((unsigned int) plf);
}

(inlined)
        call    gimple_build_goto
-       movzbl  1(%eax), %ecx
-       movl    %ecx, %edx
-       shrb    $3, %dl
-       andl    $3, %edx
-       orl     $1, %edx
-       leal    0(,%edx,8), %ebp
-       movl    %ecx, %edx
-       andl    $-25, %edx
-       orl     %ebp, %edx
-       movb    %dl, 1(%eax)
+       orb     $8, 1(%eax)
=============
Sometimes bitfield operations are written manually as masks and shifts.
 From one of my own programs:
            ciabalarm = (ciabalarm & ~0xff) | val;

-       movzbl  %dl, %ebx
-       movl    ciabalarm, %eax
-       xorb    %al, %al
-       orl     %eax, %ebx
-       movl    %ebx, ciabalarm
+       movb    %dl, ciabalarm
IIRC gcc actually used to be better at optimizing this; I seem to recall
relying on it with gcc-2.7.2.  I think the x86 backend has changed and
now allows fewer complex insns, making it harder for the combiner to do
its thing.  So this is probably a regression.
=============
From attr.i (kernel):

  if (((inode)->i_flags & 256))
   return -26;

-       andl    $256, %eax
-       cmpl    $1, %eax
-       sbbl    %eax, %eax
-       notl    %eax
+       sall    $23, %eax
+       sarl    $31, %eax
        andl    $-26, %eax
=============
Elsewhere in the kernel:
static inline __attribute__((always_inline)) __u8 rol8(__u8 word,
unsigned int shift)
{
 return (word << shift) | (word >> (8 - shift));
}

-       movl    %ecx, %edx
-       sall    $8, %edx
-       shrw    $8, %ax
-       orl     %edx, %eax
+       rolw    $8, %ax

These are all things which in gcc have always been combine's job to deal
with.


Bernd

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 19:20   ` Bernd Schmidt
@ 2010-08-06 19:37     ` Jeff Law
  2010-08-06 19:43       ` Bernd Schmidt
  0 siblings, 1 reply; 129+ messages in thread
From: Jeff Law @ 2010-08-06 19:37 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Steven Bosscher, GCC Patches

  On 08/06/10 13:19, Bernd Schmidt wrote:
>> What does the code look like, that is optimized so much better with 4
>> insns to combine? It looks like code that could come from bitfield
>> manipulations, which is an area where it's well-known that GCC does
>> not always optimize too well.
> Here are some typical examples.  Some of it is bitfield operations.
> PR42172 is.  Also, from gimplify.s:
It's also worth noting that some ports have hacks to encourage 4->1 or 
4->2 combinations.  Basically they have patterns which represent an 
intermediate step in a 4->1 or 4->2 combination even if there is no 
machine instruction which implements the insn appearing in the 
intermediate step.  I've used (and recommended) this trick numerous 
times through the years, so I suspect these patterns probably exist in 
many ports.


Here's an example from h8300.md:


;; This is a "bridge" instruction.  Combine can't cram enough insns
;; together to crate a MAC instruction directly, but it can create
;; this instruction, which then allows combine to create the real
;; MAC insn.
;;
;; Unfortunately, if combine doesn't create a MAC instruction, this
;; insn must generate reasonably correct code.  Egad.
(define_insn ""
   [(set (match_operand:SI 0 "register_operand" "=a")
         (mult:SI
           (sign_extend:SI
             (mem:HI (post_inc:SI (match_operand:SI 1 "register_operand" 
"r"))))
           (sign_extend:SI
             (mem:HI (post_inc:SI (match_operand:SI 2 "register_operand" 
"r"))))))]
   "TARGET_MAC"
   "clrmac\;mac  @%2+,@%1+"
   [(set_attr "length" "6")
    (set_attr "cc" "none_0hit")])

(define_insn ""
   [(set (match_operand:SI 0 "register_operand" "=a")
         (plus:SI (mult:SI
           (sign_extend:SI (mem:HI
             (post_inc:SI (match_operand:SI 1 "register_operand" "r"))))
           (sign_extend:SI (mem:HI
             (post_inc:SI (match_operand:SI 2 "register_operand" "r")))))
               (match_operand:SI 3 "register_operand" "0")))]
   "TARGET_MAC"
   "mac  @%2+,@%1+"
   [(set_attr "length" "4")
    (set_attr "cc" "none_0hit")])


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 19:37     ` Jeff Law
@ 2010-08-06 19:43       ` Bernd Schmidt
  2010-08-06 21:46         ` Jeff Law
  0 siblings, 1 reply; 129+ messages in thread
From: Bernd Schmidt @ 2010-08-06 19:43 UTC (permalink / raw)
  To: Jeff Law; +Cc: Steven Bosscher, GCC Patches

On 08/06/2010 09:37 PM, Jeff Law wrote:
> It's also worth noting that some ports have hacks to encourage 4->1 or
> 4->2 combinations.  Basically they have patterns which represent an
> intermediate step in a 4->1 or 4->2 combination even if there is no
> machine instruction which implements the insn appearing in the
> intermediate step.  I've used (and recommended) this trick numerous
> times through the years, so I suspect these patterns probably exist in
> many ports.

Yes.  Such combiner bridges may still be useful to help with five-insn
combinations :)

However, as I found when trying to fix PR42172 for the Thumb, describing
insns that don't actually exist on the machine can degrade code quality.
 For that PR, it seemed like it would be easy to try to just use
nonmemory_operand for the second input, and add a few alternatives with
constraints for certain constants to output the right sequence later on.
 However - some and operations are expanded into

rn = rn << 10;
rn = rn >> 20;

and if the shifts aren't present in the RTL, naturally they can't be
combined with anything else, which can be a problem.

So, half a dozen preliminary Thumb-1 patches later, I decided to do it
in the combiner, which turned out to be significantly easier (unless a
reviewer finds a big thinko...)


Bernd

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 15:04 ` Richard Guenther
@ 2010-08-06 20:08   ` Bernd Schmidt
  2010-08-06 20:37     ` Richard Guenther
  0 siblings, 1 reply; 129+ messages in thread
From: Bernd Schmidt @ 2010-08-06 20:08 UTC (permalink / raw)
  To: Richard Guenther; +Cc: GCC Patches

On 08/06/2010 05:04 PM, Richard Guenther wrote:
> Do you have statistics how many two, three and four insn combinations
> a) are tried, b) can be validated, c) are used in the end, for example
> during a GCC bootstrap?

Here are some low-tech statistics for my collection of .i files.  This
includes gcc, Linux kernel (double counted here since there are two
versions for different architectures), SPECint2000, Toshi's stress
testsuite, a popular embedded benchmark and a few other things.  Let me
know if you want something else.  I only did a) and c) because I
realized too late you probably asked about which ones were discarded due
to costs.

$ grep Trying.four log |wc -l
307743
$ grep Trying.three log |wc -l
592776
$ grep Trying.two log |wc -l
1643112
$ grep Succeeded.two log |wc -l
204808
$ grep Succeeded.three.into.two log |wc -l
2976
$ grep Succeeded.three.into.one log |wc -l
12473
$ grep Succeeded.four.into.two log |wc -l
244
$ grep Succeeded.four.into.one log |wc -l
140

> It might make sense to restrict 4 insn combinations to
> -fexpensive-optimizations (thus, not enable it at -O1).

This, certainly.


Bernd

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 20:08   ` Bernd Schmidt
@ 2010-08-06 20:37     ` Richard Guenther
  2010-08-06 21:53       ` Jeff Law
                         ` (2 more replies)
  0 siblings, 3 replies; 129+ messages in thread
From: Richard Guenther @ 2010-08-06 20:37 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches

On Fri, Aug 6, 2010 at 10:08 PM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> On 08/06/2010 05:04 PM, Richard Guenther wrote:
>> Do you have statistics how many two, three and four insn combinations
>> a) are tried, b) can be validated, c) are used in the end, for example
>> during a GCC bootstrap?
>
> Here are some low-tech statistics for my collection of .i files.  This
> includes gcc, Linux kernel (double counted here since there are two
> versions for different architectures), SPECint2000, Toshi's stress
> testsuite, a popular embedded benchmark and a few other things.  Let me
> know if you want something else.  I only did a) and c) because I
> realized too late you probably asked about which ones were discarded due
> to costs.

Right.

> $ grep Trying.four log |wc -l
> 307743
> $ grep Trying.three log |wc -l
> 592776
> $ grep Trying.two log |wc -l
> 1643112
> $ grep Succeeded.two log |wc -l
> 204808
> $ grep Succeeded.three.into.two log |wc -l
> 2976
> $ grep Succeeded.three.into.one log |wc -l
> 12473
> $ grep Succeeded.four.into.two log |wc -l
> 244
> $ grep Succeeded.four.into.one log |wc -l
> 140

No four into three?  So overall it's one order of magnitude less
three than two and two order of magnitude less four than three.

I think it's still reasonable (and I agree with that combine is a proper
place to do this).  Can you, for the list of examples you found where
it helps file enhancement bugreports so that we might do sth here
on the tree level or do better initial expansion?

Thanks,
Richard.

>> It might make sense to restrict 4 insn combinations to
>> -fexpensive-optimizations (thus, not enable it at -O1).
>
> This, certainly.
>
>
> Bernd
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 18:02       ` Jeff Law
@ 2010-08-06 20:44         ` Steven Bosscher
  2010-08-06 20:48           ` Richard Guenther
  2010-08-06 21:49           ` Jeff Law
  0 siblings, 2 replies; 129+ messages in thread
From: Steven Bosscher @ 2010-08-06 20:44 UTC (permalink / raw)
  To: Jeff Law; +Cc: Paolo Bonzini, Bernd Schmidt, GCC Patches

On Fri, Aug 6, 2010 at 8:02 PM, Jeff Law <law@redhat.com> wrote:
>  On 08/06/10 11:22, Steven Bosscher wrote:
>>
>> On Fri, Aug 6, 2010 at 6:45 PM, Paolo Bonzini<bonzini@gnu.org>  wrote:
>>>>
>>>> But perhaps the optimization can be performed in another place than
>>>> combine? If it's just a relatively small set of common patterns, a
>>>> quick GIMPLE pass may be preferable.
>>
>> I think this part of my message was much more interesting to respond to.
>
> This code naturally belongs in combine, putting it anywhere else is just,
> umm, silly.

I suppose you meant to say you are of the opinion that it's silly, or
did you actually intend to put that as a fact? *That* would be silly
(fact), thank you very much.

Anyway.

If it's patterns we could combine at the GIMPLE level, it wouldn't be
so silly to handle it there. Weren't you once one of the people
talking about a tree-combine pass?

Or if this is a case the compiler could already expand from GIMPLE to
good initial RTL (i.e. TER), that wouldn't be a silly place to put it
either.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 20:44         ` Steven Bosscher
@ 2010-08-06 20:48           ` Richard Guenther
  2010-08-06 21:49           ` Jeff Law
  1 sibling, 0 replies; 129+ messages in thread
From: Richard Guenther @ 2010-08-06 20:48 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: Jeff Law, Paolo Bonzini, Bernd Schmidt, GCC Patches

On Fri, Aug 6, 2010 at 10:43 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Fri, Aug 6, 2010 at 8:02 PM, Jeff Law <law@redhat.com> wrote:
>>  On 08/06/10 11:22, Steven Bosscher wrote:
>>>
>>> On Fri, Aug 6, 2010 at 6:45 PM, Paolo Bonzini<bonzini@gnu.org>  wrote:
>>>>>
>>>>> But perhaps the optimization can be performed in another place than
>>>>> combine? If it's just a relatively small set of common patterns, a
>>>>> quick GIMPLE pass may be preferable.
>>>
>>> I think this part of my message was much more interesting to respond to.
>>
>> This code naturally belongs in combine, putting it anywhere else is just,
>> umm, silly.
>
> I suppose you meant to say you are of the opinion that it's silly, or
> did you actually intend to put that as a fact? *That* would be silly
> (fact), thank you very much.
>
> Anyway.
>
> If it's patterns we could combine at the GIMPLE level, it wouldn't be
> so silly to handle it there. Weren't you once one of the people
> talking about a tree-combine pass?

Combining at GIMPLE level is difficult because complex instructions
span multiple GIMPLE statements (unless we invent more complex
ones, of course).  You also face the problem of doing target-independent
costs.

> Or if this is a case the compiler could already expand from GIMPLE to
> good initial RTL (i.e. TER), that wouldn't be a silly place to put it
> either.

That's true.  I think we should really differentiate when combine does
things because simplify-rtx does sth or because we can match a more
complex instruction.  The former should be handled at the GIMPLE level
and I'd like to see enhancement bugreports with cases that we miss
currently.

Richard.

> Ciao!
> Steven
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 18:56       ` Vladimir N. Makarov
  2010-08-06 19:02         ` Steven Bosscher
@ 2010-08-06 21:11         ` Chris Lattner
  2010-08-08 11:40         ` Paolo Bonzini
  2010-08-12  5:53         ` Ian Lance Taylor
  3 siblings, 0 replies; 129+ messages in thread
From: Chris Lattner @ 2010-08-06 21:11 UTC (permalink / raw)
  To: Vladimir N. Makarov
  Cc: Steven Bosscher, Paolo Bonzini, Bernd Schmidt, GCC Patches


On Aug 6, 2010, at 11:56 AM, Vladimir N. Makarov wrote:

> 
> In any case I'd not expect, from a new code selection pass, big code or compile time improvements (e.g. code selection in LLVM is very expensive and takes more time than GCC combiner).  But a new code selection pass could be more readable and easy for maintenance.

I agree with you that it is reasonably expensive, but FYI, the selection dag time in LLVM doesn't correspond to combine times.  The SelectionDAG phases in LLVM (which does do dag-based, not tree-based, pattern matching) does lowering, matching, pre-regalloc scheduling, and optimization.  It's more fair to compare it to expand, combine, pre-ra scheduling, and whatever optimization passes gcc runs early in the rtl pipeline.

-Chris

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 19:43       ` Bernd Schmidt
@ 2010-08-06 21:46         ` Jeff Law
  2010-08-09 14:54           ` Mark Mitchell
  0 siblings, 1 reply; 129+ messages in thread
From: Jeff Law @ 2010-08-06 21:46 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Steven Bosscher, GCC Patches

  On 08/06/10 13:43, Bernd Schmidt wrote:
> On 08/06/2010 09:37 PM, Jeff Law wrote:
>> It's also worth noting that some ports have hacks to encourage 4->1 or
>> 4->2 combinations.  Basically they have patterns which represent an
>> intermediate step in a 4->1 or 4->2 combination even if there is no
>> machine instruction which implements the insn appearing in the
>> intermediate step.  I've used (and recommended) this trick numerous
>> times through the years, so I suspect these patterns probably exist in
>> many ports.
> Yes.  Such combiner bridges may still be useful to help with five-insn
> combinations :)
Yea, though I expect there to be an ever-decreasing payoff for allowing 
larger bundles of instructions to be combined.
> However, as I found when trying to fix PR42172 for the Thumb, describing
> insns that don't actually exist on the machine can degrade code quality.
Precisely.  Creating those funky bridge instructions was never a 
solution, it was a hack and sometimes the hack created worse code.  Far 
better to actually fix combine.
Jeff

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 20:44         ` Steven Bosscher
  2010-08-06 20:48           ` Richard Guenther
@ 2010-08-06 21:49           ` Jeff Law
  1 sibling, 0 replies; 129+ messages in thread
From: Jeff Law @ 2010-08-06 21:49 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: Paolo Bonzini, Bernd Schmidt, GCC Patches

  On 08/06/10 14:43, Steven Bosscher wrote:
>
> If it's patterns we could combine at the GIMPLE level, it wouldn't be
> so silly to handle it there. Weren't you once one of the people
> talking about a tree-combine pass?
The specific cases I've seen won't combine at the gimple level.  For 
example, if you look at the H8, you need to have a post-inc memory 
operand which I really doubt we're going to ever expose at the gimple level.

I still think a tree combiner would be a great thing to have as it would 
eliminate tree-ssa-forwprop with a more generic engine.

> Or if this is a case the compiler could already expand from GIMPLE to
> good initial RTL (i.e. TER), that wouldn't be a silly place to put it
> either.
Again, I don't think the majority of the cases where 4->1 or 4->2 
combinations occur are going to be anything that we'd expose that early 
in the pipeline.

Jeff

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 20:37     ` Richard Guenther
@ 2010-08-06 21:53       ` Jeff Law
  2010-08-06 22:41       ` Bernd Schmidt
  2010-08-10 14:37       ` Bernd Schmidt
  2 siblings, 0 replies; 129+ messages in thread
From: Jeff Law @ 2010-08-06 21:53 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Bernd Schmidt, GCC Patches

  On 08/06/10 14:37, Richard Guenther wrote:
>>> It might make sense to restrict 4 insn combinations to
>>> -fexpensive-optimizations (thus, not enable it at -O1).
>> This, certainly.
Agreed.

Jeff

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 20:37     ` Richard Guenther
  2010-08-06 21:53       ` Jeff Law
@ 2010-08-06 22:41       ` Bernd Schmidt
  2010-08-06 23:47         ` Richard Guenther
  2010-08-10 14:37       ` Bernd Schmidt
  2 siblings, 1 reply; 129+ messages in thread
From: Bernd Schmidt @ 2010-08-06 22:41 UTC (permalink / raw)
  To: Richard Guenther; +Cc: GCC Patches

On 08/06/2010 10:37 PM, Richard Guenther wrote:
> No four into three?

I did not implement this.  It's probably possible, but I'm not convinced
it would be worthwhile.  There are a few other things we could try with
the combiner, such as 2->2 combinations, which is similar to what I did
with reload_combine recently.

> I think it's still reasonable (and I agree with that combine is a proper
> place to do this).  Can you, for the list of examples you found where
> it helps file enhancement bugreports so that we might do sth here
> on the tree level or do better initial expansion?

Filed PRs 45214 through 45218, which should give reasonably good
coverage.  I'll add a note to 42172.


Bernd

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 22:41       ` Bernd Schmidt
@ 2010-08-06 23:47         ` Richard Guenther
  2010-08-07  8:11           ` Eric Botcazou
  0 siblings, 1 reply; 129+ messages in thread
From: Richard Guenther @ 2010-08-06 23:47 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches

On Sat, Aug 7, 2010 at 12:41 AM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> On 08/06/2010 10:37 PM, Richard Guenther wrote:
>> No four into three?
>
> I did not implement this.  It's probably possible, but I'm not convinced
> it would be worthwhile.  There are a few other things we could try with
> the combiner, such as 2->2 combinations, which is similar to what I did
> with reload_combine recently.
>
>> I think it's still reasonable (and I agree with that combine is a proper
>> place to do this).  Can you, for the list of examples you found where
>> it helps file enhancement bugreports so that we might do sth here
>> on the tree level or do better initial expansion?
>
> Filed PRs 45214 through 45218, which should give reasonably good
> coverage.  I'll add a note to 42172.

Thanks for doing this.  I think the patch is ok if you guard the
4-way combines by flag_expensive_optimizations.

Thanks,
Richard.

>
> Bernd
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 23:47         ` Richard Guenther
@ 2010-08-07  8:11           ` Eric Botcazou
  2010-08-09 12:29             ` Bernd Schmidt
  0 siblings, 1 reply; 129+ messages in thread
From: Eric Botcazou @ 2010-08-07  8:11 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc-patches, Bernd Schmidt

> Thanks for doing this.  I think the patch is ok if you guard the
> 4-way combines by flag_expensive_optimizations.

Combining Steven and Bernd's figures, 1% of a bootstrap time is 37% of the 
combiner's time.  The result is 0.18% more combined insns.  It seems to me 
that we are already very far in the directory of diminishing returns.

Needless to say that, by fixing the problems earlier, we not only save 1% of 
compilation time but speed up the compiler by not generating the dreadful RTL 
in the first place and carrying it over to the combiner.

Bernd is essentially of the opinion that compilation time doesn't matter.  It 
seems to me that, even if we were to adopt this position, this shouldn't mean 
wasting compilation time, which I think is the case here.

So I think that the patch shouldn't go in at this point.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 18:56       ` Vladimir N. Makarov
  2010-08-06 19:02         ` Steven Bosscher
  2010-08-06 21:11         ` Chris Lattner
@ 2010-08-08 11:40         ` Paolo Bonzini
  2010-08-12  5:53         ` Ian Lance Taylor
  3 siblings, 0 replies; 129+ messages in thread
From: Paolo Bonzini @ 2010-08-08 11:40 UTC (permalink / raw)
  To: Vladimir N. Makarov; +Cc: Steven Bosscher, Bernd Schmidt, GCC Patches

On Fri, Aug 6, 2010 at 14:56, Vladimir N. Makarov <vmakarov@redhat.com> wrote:
> Another problem, most modern pattern matchers works on trees not on DAGs.
> There are some solutions (heuristic or expensive optimal) for this problem.
> The combiner already solves the problem in some way.

One idea I had was to separate combination to multi-output
instructions and combination to single-output instruction.  The former
is probably done only in very special cases (conditional codes,
divmod, etc.), the latter could be done using tree-based pattern
selection.

Paolo

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-07  8:11           ` Eric Botcazou
@ 2010-08-09 12:29             ` Bernd Schmidt
  2010-08-09 12:39               ` Steven Bosscher
                                 ` (2 more replies)
  0 siblings, 3 replies; 129+ messages in thread
From: Bernd Schmidt @ 2010-08-09 12:29 UTC (permalink / raw)
  To: Eric Botcazou; +Cc: Richard Guenther, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2744 bytes --]

On 08/07/2010 10:10 AM, Eric Botcazou wrote:
> So I think that the patch shouldn't go in at this point.

Richard has approved it.  I'll wait a few more days to see if anyone
else agrees with your position.

> Combining Steven and Bernd's figures, 1% of a bootstrap time is 37% of the 
> combiner's time.  The result is 0.18% more combined insns.  It seems to me 
> that we are already very far in the directory of diminishing returns.

Better to look at actual code generation results IMO.  Do you have an
opinion on the examples I included with the patch?

> Bernd is essentially of the opinion that compilation time doesn't matter.

In a sense, the fact that a single CPU can bootstrap gcc in under 15
minutes is evidence that it doesn't matter.
However, what I'm actually saying is that we shouldn't prioritize
compile time over producing good code, based on what I think users want
more.

> It  seems to me that, even if we were to adopt this position, this shouldn't
> mean wasting compilation time, which I think is the case here.

Compile time is wasted only when it's spent on something that has no
user-visible impact.  For all the talk about how important it is, no one
seems to have made an effort to eliminate some fairly obvious sources of
waste, such as excessive use of ggc.  I suspect that some of the time
lost in combine is simply due to inefficient allocation and collection
of all the patterns it creates.

The following crude proof-of-concept patch moves rtl generation back to
obstacks.  (You may need --disable-werror which I just noticed I have in
the build tree).

Three runs with ggc:
real 14m8.202s  user 99m23.408s  sys 3m4.175s
real 14m25.045s user 100m14.608s sys 3m7.654s
real 14m2.115s  user 99m9.492s sys 3m4.461s

Three runs with obstacks:
real 13m49.718s user 97m10.766s sys 3m4.311s
real 13m42.406s user 96m39.082s sys 3m3.908s
real 13m49.806s user 97m1.344s sys 3m2.731s

Combiner patch on top of the obstacks patch:
real 13m51.508s user 97m25.865s sys 3m5.938s
real 13m47.367s user 97m28.612s sys 3m7.298s

(The numbers are not comparable to the ones included with the combiner
patch last week, as that tree contained some i386 backend changes as
well which I've removed for this test.)

Even if you take the 96m39s outlier, I think it shows that the overhead
of the combine-4 patch is somewhat reduced when RTL allocation is
restored to sanity.

Since I didn't know what kinds of problems to expect, I've only tried to
find some kind of fix for whatever showed up, not necessarily the best
possible one.  A second pass over everything would be necessary to clean
it up a little.  I'm somewhat disinclined to spend much more than the
one weekend on this; after all I don't care about compile time.


Bernd

[-- Attachment #2: rtlobst4.diff --]
[-- Type: text/plain, Size: 99264 bytes --]

Index: fwprop.c
===================================================================
--- fwprop.c	(revision 162821)
+++ fwprop.c	(working copy)
@@ -116,6 +116,8 @@ along with GCC; see the file COPYING3.  
    all that is needed by fwprop.  */
 
 
+static char *fwprop_firstobj;
+
 static int num_changes;
 
 DEF_VEC_P(df_ref);
@@ -1002,6 +1004,7 @@ try_fwprop_subst (df_ref use, rtx *loc, 
     {
       confirm_change_group ();
       num_changes++;
+      fwprop_firstobj = XOBNEWVAR (rtl_obstack, char, 0);
 
       df_ref_remove (use);
       if (!CONSTANT_P (new_rtx))
@@ -1023,6 +1026,7 @@ try_fwprop_subst (df_ref use, rtx *loc, 
 	    fprintf (dump_file, " Setting REG_EQUAL note\n");
 
 	  set_unique_reg_note (insn, REG_EQUAL, copy_rtx (new_rtx));
+	  fwprop_firstobj = XOBNEWVAR (rtl_obstack, char, 0);
 
 	  /* ??? Is this still necessary if we add the note through
 	     set_unique_reg_note?  */
@@ -1035,6 +1039,7 @@ try_fwprop_subst (df_ref use, rtx *loc, 
 			 type, DF_REF_IN_NOTE);
 	    }
 	}
+      obstack_free (rtl_obstack, fwprop_firstobj);
     }
 
   return ok;
@@ -1212,8 +1217,11 @@ forward_propagate_asm (df_ref use, rtx d
     }
 
   if (num_changes_pending () == 0 || !apply_change_group ())
-    return false;
-
+    {
+      obstack_free (rtl_obstack, fwprop_firstobj);
+      return false;
+    }
+  fwprop_firstobj = XOBNEWVAR (rtl_obstack, char, 0);
   num_changes++;
   return true;
 }
@@ -1392,6 +1400,7 @@ fwprop_init (void)
 
   build_single_def_use_links ();
   df_set_flags (DF_DEFER_INSN_RESCAN);
+  fwprop_firstobj = XOBNEWVAR (rtl_obstack, char, 0);
 }
 
 static void
Index: cgraph.h
===================================================================
--- cgraph.h	(revision 162821)
+++ cgraph.h	(working copy)
@@ -871,7 +871,7 @@ varpool_node_set_size (varpool_node_set 
 
 struct GTY(()) constant_descriptor_tree {
   /* A MEM for the constant.  */
-  rtx rtl;
+  rtx GTY((skip)) rtl;
 
   /* The value of the constant.  */
   tree value;
Index: libfuncs.h
===================================================================
--- libfuncs.h	(revision 162821)
+++ libfuncs.h	(working copy)
@@ -50,23 +50,23 @@ enum libfunc_index
 /* Information about an optab-related libfunc.  We use the same hashtable
    for normal optabs and conversion optabs.  In the first case mode2
    is unused.  */
-struct GTY(()) libfunc_entry {
+struct libfunc_entry {
   size_t optab;
   enum machine_mode mode1, mode2;
   rtx libfunc;
 };
 
 /* Target-dependent globals.  */
-struct GTY(()) target_libfuncs {
+struct target_libfuncs {
   /* SYMBOL_REF rtx's for the library functions that are called
      implicitly and not via optabs.  */
   rtx x_libfunc_table[LTI_MAX];
 
   /* Hash table used to convert declarations into nodes.  */
-  htab_t GTY((param_is (struct libfunc_entry))) x_libfunc_hash;
+  htab_t x_libfunc_hash;
 };
 
-extern GTY(()) struct target_libfuncs default_target_libfuncs;
+extern struct target_libfuncs default_target_libfuncs;
 #if SWITCHABLE_TARGET
 extern struct target_libfuncs *this_target_libfuncs;
 #else
Index: optabs.c
===================================================================
--- optabs.c	(revision 162821)
+++ optabs.c	(working copy)
@@ -6075,7 +6075,7 @@ set_optab_libfunc (optab optable, enum m
     val = 0;
   slot = (struct libfunc_entry **) htab_find_slot (libfunc_hash, &e, INSERT);
   if (*slot == NULL)
-    *slot = ggc_alloc_libfunc_entry ();
+    *slot = XNEW (struct libfunc_entry);
   (*slot)->optab = (size_t) (optable - &optab_table[0]);
   (*slot)->mode1 = mode;
   (*slot)->mode2 = VOIDmode;
@@ -6102,7 +6102,7 @@ set_conv_libfunc (convert_optab optable,
     val = 0;
   slot = (struct libfunc_entry **) htab_find_slot (libfunc_hash, &e, INSERT);
   if (*slot == NULL)
-    *slot = ggc_alloc_libfunc_entry ();
+    *slot = XNEW (struct libfunc_entry);
   (*slot)->optab = (size_t) (optable - &convert_optab_table[0]);
   (*slot)->mode1 = tmode;
   (*slot)->mode2 = fmode;
@@ -6123,7 +6123,7 @@ init_optabs (void)
       init_insn_codes ();
     }
   else
-    libfunc_hash = htab_create_ggc (10, hash_libfunc, eq_libfunc, NULL);
+    libfunc_hash = htab_create (10, hash_libfunc, eq_libfunc, NULL);
 
   init_optab (add_optab, PLUS);
   init_optabv (addv_optab, PLUS);
Index: gengenrtl.c
===================================================================
--- gengenrtl.c	(revision 162821)
+++ gengenrtl.c	(working copy)
@@ -125,6 +125,7 @@ static int
 special_rtx (int idx)
 {
   return (strcmp (defs[idx].enumname, "CONST_INT") == 0
+	  || strcmp (defs[idx].enumname, "CONST") == 0
 	  || strcmp (defs[idx].enumname, "REG") == 0
 	  || strcmp (defs[idx].enumname, "SUBREG") == 0
 	  || strcmp (defs[idx].enumname, "MEM") == 0
Index: tree.h
===================================================================
--- tree.h	(revision 162821)
+++ tree.h	(working copy)
@@ -2816,7 +2816,7 @@ extern void decl_value_expr_insert (tree
 
 struct GTY(()) tree_decl_with_rtl {
   struct tree_decl_common common;
-  rtx rtl;
+  rtx GTY((skip)) rtl;
 };
 
 /* In a FIELD_DECL, this is the field position, counting in bytes, of the
@@ -2937,7 +2937,7 @@ struct GTY(()) tree_const_decl {
 
 struct GTY(()) tree_parm_decl {
   struct tree_decl_with_rtl common;
-  rtx incoming_rtl;
+  rtx GTY((skip)) incoming_rtl;
   struct var_ann_d *ann;
 };
 
Index: reload.h
===================================================================
--- reload.h	(revision 162821)
+++ reload.h	(working copy)
@@ -206,7 +206,7 @@ extern struct target_reload *this_target
 #define caller_save_initialized_p \
   (this_target_reload->x_caller_save_initialized_p)
 
-extern GTY (()) VEC(rtx,gc) *reg_equiv_memory_loc_vec;
+extern VEC(rtx,gc) *reg_equiv_memory_loc_vec;
 extern rtx *reg_equiv_constant;
 extern rtx *reg_equiv_invariant;
 extern rtx *reg_equiv_memory_loc;
@@ -216,10 +216,10 @@ extern rtx *reg_equiv_alt_mem_list;
 
 /* Element N is the list of insns that initialized reg N from its equivalent
    constant or memory slot.  */
-extern GTY((length("reg_equiv_init_size"))) rtx *reg_equiv_init;
+extern rtx *reg_equiv_init;
 
 /* The size of the previous array, for GC purposes.  */
-extern GTY(()) int reg_equiv_init_size;
+extern int reg_equiv_init_size;
 
 /* All the "earlyclobber" operands of the current insn
    are recorded here.  */
Index: final.c
===================================================================
--- final.c	(revision 162821)
+++ final.c	(working copy)
@@ -4467,6 +4467,8 @@ rest_of_clean_state (void)
   /* We're done with this function.  Free up memory if we can.  */
   free_after_parsing (cfun);
   free_after_compilation (cfun);
+  free_function_rtl ();
+  discard_rtx_lists ();
   return 0;
 }
 
Index: builtins.c
===================================================================
--- builtins.c	(revision 162821)
+++ builtins.c	(working copy)
@@ -1885,11 +1885,12 @@ expand_errno_check (tree exp, rtx target
   /* If this built-in doesn't throw an exception, set errno directly.  */
   if (TREE_NOTHROW (TREE_OPERAND (CALL_EXPR_FN (exp), 0)))
     {
+      rtx errno_rtx;
 #ifdef GEN_ERRNO_RTX
-      rtx errno_rtx = GEN_ERRNO_RTX;
+      errno_rtx = GEN_ERRNO_RTX;
 #else
-      rtx errno_rtx
-	  = gen_rtx_MEM (word_mode, gen_rtx_SYMBOL_REF (Pmode, "errno"));
+      errno_rtx
+	= gen_rtx_MEM (word_mode, gen_rtx_SYMBOL_REF (Pmode, "errno"));
 #endif
       emit_move_insn (errno_rtx, GEN_INT (TARGET_EDOM));
       emit_label (lab);
Index: lists.c
===================================================================
--- lists.c	(revision 162821)
+++ lists.c	(working copy)
@@ -26,17 +26,16 @@ along with GCC; see the file COPYING3.  
 #include "diagnostic-core.h"
 #include "toplev.h"
 #include "rtl.h"
-#include "ggc.h"
 
 static void free_list (rtx *, rtx *);
 
 /* Functions for maintaining cache-able lists of EXPR_LIST and INSN_LISTs.  */
 
 /* An INSN_LIST containing all INSN_LISTs allocated but currently unused.  */
-static GTY ((deletable)) rtx unused_insn_list;
+static rtx unused_insn_list;
 
 /* An EXPR_LIST containing all EXPR_LISTs allocated but currently unused.  */
-static GTY ((deletable)) rtx unused_expr_list;
+static rtx unused_expr_list;
 
 /* This function will free an entire list of either EXPR_LIST, INSN_LIST
    or DEPS_LIST nodes.  This is to be used only on lists that consist
@@ -216,4 +215,9 @@ remove_free_EXPR_LIST_node (rtx *listp)
   return elem;
 }
 
-#include "gt-lists.h"
+void
+discard_rtx_lists (void)
+{
+  unused_insn_list = unused_expr_list = 0;
+}
+
Index: gensupport.c
===================================================================
--- gensupport.c	(revision 162821)
+++ gensupport.c	(working copy)
@@ -35,9 +35,6 @@ int target_flags;
 
 int insn_elision = 1;
 
-static struct obstack obstack;
-struct obstack *rtl_obstack = &obstack;
-
 static int sequence_num;
 
 static int predicable_default;
@@ -788,10 +785,10 @@ bool
 init_rtx_reader_args_cb (int argc, char **argv,
 			 bool (*parse_opt) (const char *))
 {
+  init_rtl ();
   /* Prepare to read input.  */
   condition_table = htab_create (500, hash_c_test, cmp_c_test, NULL);
   init_predicate_table ();
-  obstack_init (rtl_obstack);
   sequence_num = 0;
 
   read_md_files (argc, argv, parse_opt, rtx_handle_directive);
Index: toplev.c
===================================================================
--- toplev.c	(revision 162821)
+++ toplev.c	(working copy)
@@ -2080,6 +2080,7 @@ backend_init_target (void)
 static void
 backend_init (void)
 {
+  init_rtl ();
   init_emit_once ();
 
   init_rtlanal ();
@@ -2090,6 +2091,7 @@ backend_init (void)
   /* Initialize the target-specific back end pieces.  */
   ira_init_once ();
   backend_init_target ();
+  preserve_rtl ();
 }
 
 /* Initialize excess precision settings.  */
Index: dojump.c
===================================================================
--- dojump.c	(revision 162821)
+++ dojump.c	(working copy)
@@ -33,7 +33,6 @@ along with GCC; see the file COPYING3.  
 #include "expr.h"
 #include "optabs.h"
 #include "langhooks.h"
-#include "ggc.h"
 #include "basic-block.h"
 #include "output.h"
 
@@ -132,9 +131,9 @@ jumpif_1 (enum tree_code code, tree op0,
 
 /* Used internally by prefer_and_bit_test.  */
 
-static GTY(()) rtx and_reg;
-static GTY(()) rtx and_test;
-static GTY(()) rtx shift_test;
+static rtx and_reg;
+static rtx and_test;
+static rtx shift_test;
 
 /* Compare the relative costs of "(X & (1 << BITNUM))" and "(X >> BITNUM) & 1",
    where X is an arbitrary register of mode MODE.  Return true if the former
@@ -145,12 +144,14 @@ prefer_and_bit_test (enum machine_mode m
 {
   if (and_test == 0)
     {
+      rtl_on_permanent_obstack ();
       /* Set up rtxes for the two variations.  Use NULL as a placeholder
 	 for the BITNUM-based constants.  */
       and_reg = gen_rtx_REG (mode, FIRST_PSEUDO_REGISTER);
       and_test = gen_rtx_AND (mode, and_reg, NULL);
       shift_test = gen_rtx_AND (mode, gen_rtx_ASHIFTRT (mode, and_reg, NULL),
 				const1_rtx);
+      rtl_pop_obstack ();
     }
   else
     {
@@ -1185,5 +1186,3 @@ do_compare_and_jump (tree treeop0, tree 
                             ? expr_size (treeop0) : NULL_RTX),
 			   if_false_label, if_true_label, prob);
 }
-
-#include "gt-dojump.h"
Index: caller-save.c
===================================================================
--- caller-save.c	(revision 162821)
+++ caller-save.c	(working copy)
@@ -39,7 +39,6 @@ along with GCC; see the file COPYING3.  
 #include "tm_p.h"
 #include "addresses.h"
 #include "output.h"
-#include "ggc.h"
 
 #define MOVE_MAX_WORDS (MOVE_MAX / UNITS_PER_WORD)
 
@@ -101,12 +100,9 @@ static void add_stored_regs (rtx, const_
 
 \f
 
-static GTY(()) rtx savepat;
-static GTY(()) rtx restpat;
-static GTY(()) rtx test_reg;
-static GTY(()) rtx test_mem;
-static GTY(()) rtx saveinsn;
-static GTY(()) rtx restinsn;
+static rtx savepat, saveinsn;
+static rtx restpat, restinsn;
+static rtx test_reg, test_mem;
 
 /* Return the INSN_CODE used to save register REG in mode MODE.  */
 static int
@@ -190,7 +186,10 @@ init_caller_save (void)
 
   caller_save_initialized_p = true;
 
+  rtl_on_permanent_obstack ();
+
   CLEAR_HARD_REG_SET (no_caller_save_reg_set);
+  
   /* First find all the registers that we need to deal with and all
      the modes that they can have.  If we can't find a mode to use,
      we can't have the register live over calls.  */
@@ -278,6 +277,7 @@ init_caller_save (void)
 		SET_HARD_REG_BIT (no_caller_save_reg_set, i);
 	    }
 	}
+  rtl_pop_obstack ();
 }
 
 \f
@@ -1405,4 +1405,3 @@ insert_one_insn (struct insn_chain *chai
   INSN_CODE (new_chain->insn) = code;
   return new_chain;
 }
-#include "gt-caller-save.h"
Index: dwarf2out.c
===================================================================
--- dwarf2out.c	(revision 162821)
+++ dwarf2out.c	(working copy)
@@ -199,10 +199,6 @@ dwarf2out_do_cfi_asm (void)
 #define PTR_SIZE (POINTER_SIZE / BITS_PER_UNIT)
 #endif
 
-/* Array of RTXes referenced by the debugging information, which therefore
-   must be kept around forever.  */
-static GTY(()) VEC(rtx,gc) *used_rtx_array;
-
 /* A pointer to the base of a list of incomplete types which might be
    completed at some later time.  incomplete_types_list needs to be a
    VEC(tree,gc) because we want to tell the garbage collector about
@@ -233,7 +229,7 @@ static GTY(()) section *debug_frame_sect
 
 /* Personality decl of current unit.  Used only when assembler does not support
    personality CFI.  */
-static GTY(()) rtx current_unit_personality;
+static rtx current_unit_personality;
 
 /* How to start an assembler comment.  */
 #ifndef ASM_COMMENT_START
@@ -1662,17 +1658,17 @@ dwarf2out_notice_stack_adjust (rtx insn,
    of the prologue or (b) the register is clobbered.  This clusters
    register saves so that there are fewer pc advances.  */
 
-struct GTY(()) queued_reg_save {
+struct queued_reg_save {
   struct queued_reg_save *next;
   rtx reg;
   HOST_WIDE_INT cfa_offset;
   rtx saved_reg;
 };
 
-static GTY(()) struct queued_reg_save *queued_reg_saves;
+static struct queued_reg_save *queued_reg_saves;
 
 /* The caller's ORIG_REG is saved in SAVED_IN_REG.  */
-struct GTY(()) reg_saved_in_data {
+struct reg_saved_in_data {
   rtx orig_reg;
   rtx saved_in_reg;
 };
@@ -1681,8 +1677,8 @@ struct GTY(()) reg_saved_in_data {
    The list intentionally has a small maximum capacity of 4; if your
    port needs more than that, you might consider implementing a
    more efficient data structure.  */
-static GTY(()) struct reg_saved_in_data regs_saved_in_regs[4];
-static GTY(()) size_t num_regs_saved_in_regs;
+static struct reg_saved_in_data regs_saved_in_regs[4];
+static size_t num_regs_saved_in_regs;
 
 #if defined (DWARF2_DEBUGGING_INFO) || defined (DWARF2_UNWIND_INFO)
 static const char *last_reg_save_label;
@@ -1704,7 +1700,7 @@ queue_reg_save (const char *label, rtx r
 
   if (q == NULL)
     {
-      q = ggc_alloc_queued_reg_save ();
+      q = XNEW (struct queued_reg_save);
       q->next = queued_reg_saves;
       queued_reg_saves = q;
     }
@@ -1721,13 +1717,14 @@ queue_reg_save (const char *label, rtx r
 static void
 flush_queued_reg_saves (void)
 {
-  struct queued_reg_save *q;
+  struct queued_reg_save *q, *next;
 
-  for (q = queued_reg_saves; q; q = q->next)
+  for (q = queued_reg_saves; q; q = next)
     {
       size_t i;
       unsigned int reg, sreg;
 
+      next = q->next;
       for (i = 0; i < num_regs_saved_in_regs; i++)
 	if (REGNO (regs_saved_in_regs[i].orig_reg) == REGNO (q->reg))
 	  break;
@@ -1748,6 +1745,8 @@ flush_queued_reg_saves (void)
       else
 	sreg = INVALID_REGNUM;
       reg_save (last_reg_save_label, reg, sreg, q->cfa_offset);
+
+      free (q);
     }
 
   queued_reg_saves = NULL;
@@ -4278,7 +4277,7 @@ typedef struct GTY(()) dw_val_struct {
   enum dw_val_class val_class;
   union dw_val_struct_union
     {
-      rtx GTY ((tag ("dw_val_class_addr"))) val_addr;
+      rtx GTY ((skip,tag ("dw_val_class_addr"))) val_addr;
       unsigned HOST_WIDE_INT GTY ((tag ("dw_val_class_offset"))) val_offset;
       dw_loc_list_ref GTY ((tag ("dw_val_class_loc_list"))) val_loc_list;
       dw_loc_descr_ref GTY ((tag ("dw_val_class_loc"))) val_loc;
@@ -5859,7 +5858,7 @@ struct GTY ((chain_next ("%h.next"))) va
      mode is 0 and first operand is a CONCAT with bitsize
      as first CONCAT operand and NOTE_INSN_VAR_LOCATION resp.
      NULL as second operand.  */
-  rtx GTY (()) loc;
+  rtx GTY ((skip)) loc;
   const char * GTY (()) label;
   struct var_loc_node * GTY (()) next;
 };
@@ -7378,9 +7377,12 @@ add_AT_addr (dw_die_ref die, enum dwarf_
 {
   dw_attr_node attr;
 
+  rtl_on_permanent_obstack ();
+  addr = copy_rtx (addr);
+  rtl_pop_obstack ();
   attr.dw_attr = attr_kind;
   attr.dw_attr_val.val_class = dw_val_class_addr;
-  attr.dw_attr_val.v.val_addr = addr;
+  attr.dw_attr_val.v.val_addr = permanent_copy_rtx (addr);
   add_dwarf_attr (die, &attr);
 }
 
@@ -13620,7 +13622,7 @@ mem_loc_descriptor (rtx rtl, enum machin
 	  temp = new_loc_descr (DWARF2_ADDR_SIZE == 4
 				? DW_OP_const4u : DW_OP_const8u, 0, 0);
 	  temp->dw_loc_oprnd1.val_class = dw_val_class_addr;
-	  temp->dw_loc_oprnd1.v.val_addr = rtl;
+	  temp->dw_loc_oprnd1.v.val_addr = permanent_copy_rtx (rtl);
 	  temp->dtprel = true;
 
 	  mem_loc_result = new_loc_descr (DW_OP_GNU_push_tls_address, 0, 0);
@@ -13633,10 +13635,12 @@ mem_loc_descriptor (rtx rtl, enum machin
 	break;
 
     symref:
+      rtl_on_permanent_obstack ();
+      rtl = copy_rtx (rtl);
+      rtl_pop_obstack ();
       mem_loc_result = new_loc_descr (DW_OP_addr, 0, 0);
       mem_loc_result->dw_loc_oprnd1.val_class = dw_val_class_addr;
-      mem_loc_result->dw_loc_oprnd1.v.val_addr = rtl;
-      VEC_safe_push (rtx, gc, used_rtx_array, rtl);
+      mem_loc_result->dw_loc_oprnd1.v.val_addr = permanent_copy_rtx (rtl);
       break;
 
     case CONCAT:
@@ -14442,9 +14446,11 @@ loc_descriptor (rtx rtl, enum machine_mo
 	{
 	  loc_result = new_loc_descr (DW_OP_addr, 0, 0);
 	  loc_result->dw_loc_oprnd1.val_class = dw_val_class_addr;
-	  loc_result->dw_loc_oprnd1.v.val_addr = rtl;
+	  loc_result->dw_loc_oprnd1.v.val_addr = permanent_copy_rtx (rtl);
 	  add_loc_descr (&loc_result, new_loc_descr (DW_OP_stack_value, 0, 0));
-	  VEC_safe_push (rtx, gc, used_rtx_array, rtl);
+	  rtl_on_permanent_obstack ();
+	  rtl = copy_rtx (rtl);
+	  rtl_pop_obstack ();
 	}
       break;
 
@@ -15134,7 +15140,7 @@ loc_list_from_tree (tree loc, int want_a
 
 	  ret = new_loc_descr (first_op, 0, 0);
 	  ret->dw_loc_oprnd1.val_class = dw_val_class_addr;
-	  ret->dw_loc_oprnd1.v.val_addr = rtl;
+	  ret->dw_loc_oprnd1.v.val_addr = permanent_copy_rtx (rtl);
 	  ret->dtprel = dtprel;
 
 	  ret1 = new_loc_descr (second_op, 0, 0);
@@ -15185,7 +15191,7 @@ loc_list_from_tree (tree loc, int want_a
 	  {
 	    ret = new_loc_descr (DW_OP_addr, 0, 0);
 	    ret->dw_loc_oprnd1.val_class = dw_val_class_addr;
-	    ret->dw_loc_oprnd1.v.val_addr = rtl;
+	    ret->dw_loc_oprnd1.v.val_addr = permanent_copy_rtx (rtl);
 	  }
 	else
 	  {
@@ -16111,10 +16117,12 @@ add_const_value_attribute (dw_die_ref di
 	rtl_addr:
 	  loc_result = new_loc_descr (DW_OP_addr, 0, 0);
 	  loc_result->dw_loc_oprnd1.val_class = dw_val_class_addr;
-	  loc_result->dw_loc_oprnd1.v.val_addr = rtl;
+	  loc_result->dw_loc_oprnd1.v.val_addr = permanent_copy_rtx (rtl);
 	  add_loc_descr (&loc_result, new_loc_descr (DW_OP_stack_value, 0, 0));
 	  add_AT_loc (die, DW_AT_location, loc_result);
-	  VEC_safe_push (rtx, gc, used_rtx_array, rtl);
+	  rtl_on_permanent_obstack ();
+	  rtl = copy_rtx (rtl);
+	  rtl_pop_obstack ();
 	  return true;
 	}
       return false;
@@ -17498,9 +17506,11 @@ add_name_and_src_coords_attributes (dw_d
      from the DECL_NAME name used in the source file.  */
   if (TREE_CODE (decl) == FUNCTION_DECL && TREE_ASM_WRITTEN (decl))
     {
-      add_AT_addr (die, DW_AT_VMS_rtnbeg_pd_address,
-		   XEXP (DECL_RTL (decl), 0));
-      VEC_safe_push (rtx, gc, used_rtx_array, XEXP (DECL_RTL (decl), 0));
+      rtx rtl = XEXP (DECL_RTL (decl), 0);
+      rtl_on_permanent_obstack ();
+      rtl = copy_rtx (rtl);
+      rtl_pop_obstack ();
+      add_AT_addr (die, DW_AT_VMS_rtnbeg_pd_address, rtl);
     }
 #endif
 }
@@ -19036,9 +19046,13 @@ gen_variable_die (tree decl, tree origin
 			  && loc->expr->dw_loc_next == NULL
 			  && GET_CODE (loc->expr->dw_loc_oprnd1.v.val_addr)
 			     == SYMBOL_REF)
-			loc->expr->dw_loc_oprnd1.v.val_addr
-			  = plus_constant (loc->expr->dw_loc_oprnd1.v.val_addr, off);
-			else
+			{
+			  rtx t = plus_constant (loc->expr->dw_loc_oprnd1.v.val_addr,
+						 off);
+			  t = permanent_copy_rtx (t);
+			  loc->expr->dw_loc_oprnd1.v.val_addr = t;
+			}
+		      else
 			  loc_list_plus_const (loc, off);
 		    }
 		  add_AT_location_description (var_die, DW_AT_location, loc);
@@ -19099,8 +19113,12 @@ gen_variable_die (tree decl, tree origin
 		  && loc->expr->dw_loc_opc == DW_OP_addr
 		  && loc->expr->dw_loc_next == NULL
 		  && GET_CODE (loc->expr->dw_loc_oprnd1.v.val_addr) == SYMBOL_REF)
-		loc->expr->dw_loc_oprnd1.v.val_addr
-		  = plus_constant (loc->expr->dw_loc_oprnd1.v.val_addr, off);
+		{
+		  rtx t = plus_constant (loc->expr->dw_loc_oprnd1.v.val_addr,
+					 off);
+		  t = permanent_copy_rtx (t);
+		  loc->expr->dw_loc_oprnd1.v.val_addr = t;
+		}
 	      else
 		loc_list_plus_const (loc, off);
 	    }
@@ -21582,8 +21600,6 @@ dwarf2out_init (const char *filename ATT
 
   incomplete_types = VEC_alloc (tree, gc, 64);
 
-  used_rtx_array = VEC_alloc (rtx, gc, 32);
-
   debug_info_section = get_section (DEBUG_INFO_SECTION,
 				    SECTION_DEBUG, NULL);
   debug_abbrev_section = get_section (DEBUG_ABBREV_SECTION,
@@ -22126,8 +22142,9 @@ resolve_one_addr (rtx *addr, void *data 
       rtl = lookup_constant_def (t);
       if (!rtl || !MEM_P (rtl))
 	return 1;
-      rtl = XEXP (rtl, 0);
-      VEC_safe_push (rtx, gc, used_rtx_array, rtl);
+      rtl_on_permanent_obstack ();
+      rtl = copy_rtx (XEXP (rtl, 0));
+      rtl_pop_obstack ();
       *addr = rtl;
       return 0;
     }
Index: tree-ssa-address.c
===================================================================
--- tree-ssa-address.c	(revision 162821)
+++ tree-ssa-address.c	(working copy)
@@ -43,7 +43,6 @@ along with GCC; see the file COPYING3.  
 #include "rtl.h"
 #include "recog.h"
 #include "expr.h"
-#include "ggc.h"
 #include "target.h"
 
 /* TODO -- handling of symbols (according to Richard Hendersons
@@ -73,22 +72,22 @@ along with GCC; see the file COPYING3.  
 /* A "template" for memory address, used to determine whether the address is
    valid for mode.  */
 
-typedef struct GTY (()) mem_addr_template {
+typedef struct mem_addr_template {
   rtx ref;			/* The template.  */
-  rtx * GTY ((skip)) step_p;	/* The point in template where the step should be
-				   filled in.  */
-  rtx * GTY ((skip)) off_p;	/* The point in template where the offset should
-				   be filled in.  */
+  rtx *step_p;	/* The point in template where the step should be
+		   filled in.  */
+  rtx *off_p;	/* The point in template where the offset should
+		   be filled in.  */
 } mem_addr_template;
 
 DEF_VEC_O (mem_addr_template);
-DEF_VEC_ALLOC_O (mem_addr_template, gc);
+DEF_VEC_ALLOC_O (mem_addr_template, heap);
 
 /* The templates.  Each of the low five bits of the index corresponds to one
    component of TARGET_MEM_REF being present, while the high bits identify
    the address space.  See TEMPL_IDX.  */
 
-static GTY(()) VEC (mem_addr_template, gc) *mem_addr_template_list;
+static VEC (mem_addr_template, heap) *mem_addr_template_list;
 
 #define TEMPL_IDX(AS, SYMBOL, BASE, INDEX, STEP, OFFSET) \
   (((int) (AS) << 5) \
@@ -115,6 +114,8 @@ gen_addr_rtx (enum machine_mode address_
   if (offset_p)
     *offset_p = NULL;
 
+  rtl_on_permanent_obstack ();
+
   if (index)
     {
       act_elem = index;
@@ -176,6 +177,8 @@ gen_addr_rtx (enum machine_mode address_
 
   if (!*addr)
     *addr = const0_rtx;
+
+  rtl_pop_obstack ();
 }
 
 /* Returns address for TARGET_MEM_REF with parameters given by ADDR
@@ -209,13 +212,14 @@ addr_for_mem_ref (struct mem_address *ad
 
       if (templ_index
 	  >= VEC_length (mem_addr_template, mem_addr_template_list))
-	VEC_safe_grow_cleared (mem_addr_template, gc, mem_addr_template_list,
+	VEC_safe_grow_cleared (mem_addr_template, heap, mem_addr_template_list,
 			       templ_index + 1);
 
       /* Reuse the templates for addresses, so that we do not waste memory.  */
       templ = VEC_index (mem_addr_template, mem_addr_template_list, templ_index);
       if (!templ->ref)
 	{
+	  rtl_on_permanent_obstack ();
 	  sym = (addr->symbol ?
 		 gen_rtx_SYMBOL_REF (address_mode, ggc_strdup ("test_symbol"))
 		 : NULL_RTX);
@@ -232,6 +236,7 @@ addr_for_mem_ref (struct mem_address *ad
 			&templ->ref,
 			&templ->step_p,
 			&templ->off_p);
+	  rtl_pop_obstack ();
 	}
 
       if (st)
@@ -921,4 +926,3 @@ dump_mem_address (FILE *file, struct mem
     }
 }
 
-#include "gt-tree-ssa-address.h"
Index: function.c
===================================================================
--- function.c	(revision 162821)
+++ function.c	(working copy)
@@ -124,10 +124,8 @@ struct machine_function * (*init_machine
 struct function *cfun = 0;
 
 /* These hashes record the prologue and epilogue insns.  */
-static GTY((if_marked ("ggc_marked_p"), param_is (struct rtx_def)))
-  htab_t prologue_insn_hash;
-static GTY((if_marked ("ggc_marked_p"), param_is (struct rtx_def)))
-  htab_t epilogue_insn_hash;
+static htab_t prologue_insn_hash;
+static htab_t epilogue_insn_hash;
 \f
 
 htab_t types_used_by_vars_hash = NULL;
@@ -208,6 +206,10 @@ free_after_parsing (struct function *f)
 void
 free_after_compilation (struct function *f)
 {
+  if (prologue_insn_hash)
+    htab_delete (prologue_insn_hash);
+  if (epilogue_insn_hash)
+    htab_delete (epilogue_insn_hash);
   prologue_insn_hash = NULL;
   epilogue_insn_hash = NULL;
 
@@ -219,6 +221,8 @@ free_after_compilation (struct function 
   f->machine = NULL;
   f->cfg = NULL;
 
+  if (regno_reg_rtx)
+    free (regno_reg_rtx);
   regno_reg_rtx = NULL;
   insn_locators_free ();
 }
@@ -339,7 +343,8 @@ try_fit_stack_local (HOST_WIDE_INT start
 static void
 add_frame_space (HOST_WIDE_INT start, HOST_WIDE_INT end)
 {
-  struct frame_space *space = ggc_alloc_frame_space ();
+  struct frame_space *space;
+  space = (struct frame_space *)obstack_alloc (rtl_obstack, sizeof *space);
   space->next = crtl->frame_space_list;
   crtl->frame_space_list = space;
   space->start = start;
@@ -516,7 +521,6 @@ assign_stack_local (enum machine_mode mo
   return assign_stack_local_1 (mode, size, align, false);
 }
 \f
-\f
 /* In order to evaluate some expressions, such as function calls returning
    structures in memory, we need to temporarily allocate stack locations.
    We record each allocated temporary in the following structure.
@@ -535,13 +539,13 @@ assign_stack_local (enum machine_mode mo
    level where they are defined.  They are marked a "kept" so that
    free_temp_slots will not free them.  */
 
-struct GTY(()) temp_slot {
+struct temp_slot {
   /* Points to next temporary slot.  */
   struct temp_slot *next;
   /* Points to previous temporary slot.  */
   struct temp_slot *prev;
   /* The rtx to used to reference the slot.  */
-  rtx slot;
+  rtx GTY(()) slot;
   /* The size, in units, of the slot.  */
   HOST_WIDE_INT size;
   /* The type of the object in the slot, or zero if it doesn't correspond
@@ -569,10 +573,10 @@ struct GTY(()) temp_slot {
 
 /* A table of addresses that represent a stack slot.  The table is a mapping
    from address RTXen to a temp slot.  */
-static GTY((param_is(struct temp_slot_address_entry))) htab_t temp_slot_address_table;
+static htab_t temp_slot_address_table;
 
 /* Entry for the above hash table.  */
-struct GTY(()) temp_slot_address_entry {
+struct temp_slot_address_entry {
   hashval_t hash;
   rtx address;
   struct temp_slot *temp_slot;
@@ -682,7 +686,8 @@ static void
 insert_temp_slot_address (rtx address, struct temp_slot *temp_slot)
 {
   void **slot;
-  struct temp_slot_address_entry *t = ggc_alloc_temp_slot_address_entry ();
+  struct temp_slot_address_entry *t;
+  t = (struct temp_slot_address_entry *)obstack_alloc (rtl_obstack, sizeof *t);
   t->address = address;
   t->temp_slot = temp_slot;
   t->hash = temp_slot_address_compute_hash (t);
@@ -834,7 +839,7 @@ assign_stack_temp_for_type (enum machine
 
 	  if (best_p->size - rounded_size >= alignment)
 	    {
-	      p = ggc_alloc_temp_slot ();
+	      p = (struct temp_slot *)obstack_alloc (rtl_obstack, sizeof *p);
 	      p->in_use = p->addr_taken = 0;
 	      p->size = best_p->size - rounded_size;
 	      p->base_offset = best_p->base_offset + rounded_size;
@@ -858,7 +863,7 @@ assign_stack_temp_for_type (enum machine
     {
       HOST_WIDE_INT frame_offset_old = frame_offset;
 
-      p = ggc_alloc_temp_slot ();
+      p = (struct temp_slot *)obstack_alloc (rtl_obstack, sizeof *p);
 
       /* We are passing an explicit alignment request to assign_stack_local.
 	 One side effect of that is assign_stack_local will not round SIZE
@@ -1311,10 +1316,10 @@ init_temp_slots (void)
 
   /* Set up the table to map addresses to temp slots.  */
   if (! temp_slot_address_table)
-    temp_slot_address_table = htab_create_ggc (32,
-					       temp_slot_address_hash,
-					       temp_slot_address_eq,
-					       NULL);
+    temp_slot_address_table = htab_create (32,
+					   temp_slot_address_hash,
+					   temp_slot_address_eq,
+					   NULL);
   else
     htab_empty (temp_slot_address_table);
 }
@@ -1836,7 +1841,9 @@ instantiate_decls (tree fndecl)
   FOR_EACH_LOCAL_DECL (cfun, ix, decl)
     if (DECL_RTL_SET_P (decl))
       instantiate_decl_rtl (DECL_RTL (decl));
+#if 0
   VEC_free (tree, gc, cfun->local_decls);
+#endif
 }
 
 /* Pass through the INSNS of function FNDECL and convert virtual register
@@ -4779,8 +4786,6 @@ do_warn_unused_parameter (tree fn)
       warning (OPT_Wunused_parameter, "unused parameter %q+D", decl);
 }
 
-static GTY(()) rtx initial_trampoline;
-
 /* Generate RTL for the end of the current function.  */
 
 void
@@ -5068,7 +5073,7 @@ record_insns (rtx insns, rtx end, htab_t
 
   if (hash == NULL)
     *hashp = hash
-      = htab_create_ggc (17, htab_hash_pointer, htab_eq_pointer, NULL);
+      = htab_create (17, htab_hash_pointer, htab_eq_pointer, NULL);
 
   for (tmp = insns; tmp != end; tmp = NEXT_INSN (tmp))
     {
Index: function.h
===================================================================
--- function.h	(revision 162821)
+++ function.h	(working copy)
@@ -33,14 +33,14 @@ along with GCC; see the file COPYING3.  
    The main insn-chain is saved in the last element of the chain,
    unless the chain is empty.  */
 
-struct GTY(()) sequence_stack {
+struct sequence_stack {
   /* First and last insns in the chain of the saved sequence.  */
   rtx first;
   rtx last;
   struct sequence_stack *next;
 };
 \f
-struct GTY(()) emit_status {
+struct emit_status {
   /* This is reset to LAST_VIRTUAL_REGISTER + 1 at the start of each function.
      After rtl generation, it is 1 plus the largest register number used.  */
   int x_reg_rtx_no;
@@ -61,6 +61,7 @@ struct GTY(()) emit_status {
      The main insn-chain is saved in the last element of the chain,
      unless the chain is empty.  */
   struct sequence_stack *sequence_stack;
+  struct sequence_stack *free_sequence_stack;
 
   /* INSN_UID for next insn emitted.
      Reset to 1 for each function compiled.  */
@@ -83,7 +84,7 @@ struct GTY(()) emit_status {
   /* Indexed by pseudo register number, if nonzero gives the known alignment
      for that pseudo (if REG_POINTER is set in x_regno_reg_rtx).
      Allocated in parallel with x_regno_reg_rtx.  */
-  unsigned char * GTY((skip)) regno_pointer_align;
+  unsigned char *regno_pointer_align;
 };
 
 
@@ -92,7 +93,7 @@ struct GTY(()) emit_status {
    FIXME: We could put it into emit_status struct, but gengtype is not able to deal
    with length attribute nested in top level structures.  */
 
-extern GTY ((length ("crtl->emit.x_reg_rtx_no"))) rtx * regno_reg_rtx;
+extern rtx *regno_reg_rtx;
 
 /* For backward compatibility... eventually these should all go away.  */
 #define reg_rtx_no (crtl->emit.x_reg_rtx_no)
@@ -100,7 +101,7 @@ extern GTY ((length ("crtl->emit.x_reg_r
 
 #define REGNO_POINTER_ALIGN(REGNO) (crtl->emit.regno_pointer_align[REGNO])
 
-struct GTY(()) expr_status {
+struct expr_status {
   /* Number of units that we should eventually pop off the stack.
      These are the arguments to function calls that have already returned.  */
   int x_pending_stack_adjust;
@@ -145,7 +146,7 @@ DEF_VEC_P(call_site_record);
 DEF_VEC_ALLOC_P(call_site_record, gc);
 
 /* RTL representation of exception handling.  */
-struct GTY(()) rtl_eh {
+struct rtl_eh {
   rtx ehr_stackadj;
   rtx ehr_handler;
   rtx ehr_label;
@@ -178,7 +179,7 @@ typedef struct ipa_opt_pass_d *ipa_opt_p
 DEF_VEC_P(ipa_opt_pass);
 DEF_VEC_ALLOC_P(ipa_opt_pass,heap);
 
-struct GTY(()) varasm_status {
+struct varasm_status {
   /* If we're using a per-function constant pool, this is it.  */
   struct rtx_constant_pool *pool;
 
@@ -188,7 +189,7 @@ struct GTY(()) varasm_status {
 };
 
 /* Information mainlined about RTL representation of incoming arguments.  */
-struct GTY(()) incoming_args {
+struct incoming_args {
   /* Number of bytes of args popped by function being compiled on its return.
      Zero if no bytes are to be popped.
      May affect compilation of return insn or of function epilogue.  */
@@ -217,7 +218,7 @@ struct GTY(()) incoming_args {
 };
 
 /* Data for function partitioning.  */
-struct GTY(()) function_subsections {
+struct function_subsections {
   /* Assembly labels for the hot and cold text sections, to
      be used by debugger functions for determining the size of text
      sections.  */
@@ -236,7 +237,7 @@ struct GTY(()) function_subsections {
 /* Describe an empty area of space in the stack frame.  These can be chained
    into a list; this is used to keep track of space wasted for alignment
    reasons.  */
-struct GTY(()) frame_space
+struct frame_space
 {
   struct frame_space *next;
 
@@ -245,7 +246,7 @@ struct GTY(()) frame_space
 };
 
 /* Datastructures maintained for currently processed function in RTL form.  */
-struct GTY(()) rtl_data {
+struct rtl_data {
   struct expr_status expr;
   struct emit_status emit;
   struct varasm_status varasm;
@@ -461,7 +462,7 @@ struct GTY(()) rtl_data {
 #define stack_realign_fp (crtl->stack_realign_needed && !crtl->need_drap)
 #define stack_realign_drap (crtl->stack_realign_needed && crtl->need_drap)
 
-extern GTY(()) struct rtl_data x_rtl;
+extern struct rtl_data x_rtl;
 
 /* Accessor to RTL datastructures.  We keep them statically allocated now since
    we never keep multiple functions.  For threaded compiler we might however
@@ -505,11 +506,15 @@ struct GTY(()) function {
 
   /* Vector of function local variables, functions, types and constants.  */
   VEC(tree,gc) *local_decls;
+  VEC(tree,gc) *stack_vars;
+
+  /* A fake decl that is used as the MEM_EXPR of spill slots.  */
+  tree spill_slot_decl;
 
   /* For md files.  */
 
   /* tm.h can use this to store whatever it likes.  */
-  struct machine_function * GTY ((maybe_undef)) machine;
+  struct machine_function * GTY ((skip)) machine;
 
   /* Language-specific code can use this to store whatever it likes.  */
   struct language_function * language;
Index: gcse.c
===================================================================
--- gcse.c	(revision 162821)
+++ gcse.c	(working copy)
@@ -159,7 +159,6 @@ along with GCC; see the file COPYING3.  
 #include "function.h"
 #include "expr.h"
 #include "except.h"
-#include "ggc.h"
 #include "params.h"
 #include "cselib.h"
 #include "intl.h"
@@ -849,7 +848,7 @@ want_to_gcse_p (rtx x, int *max_distance
 
 /* Used internally by can_assign_to_reg_without_clobbers_p.  */
 
-static GTY(()) rtx test_insn;
+static rtx test_insn;
 
 /* Return true if we can assign X to a pseudo register such that the
    resulting insn does not result in clobbering a hard register as a
@@ -879,12 +878,14 @@ can_assign_to_reg_without_clobbers_p (rt
      our test insn if we haven't already.  */
   if (test_insn == 0)
     {
+      rtl_on_permanent_obstack ();
       test_insn
 	= make_insn_raw (gen_rtx_SET (VOIDmode,
 				      gen_rtx_REG (word_mode,
 						   FIRST_PSEUDO_REGISTER * 2),
 				      const0_rtx));
       NEXT_INSN (test_insn) = PREV_INSN (test_insn) = 0;
+      rtl_pop_obstack ();
     }
 
   /* Now make an insn like the one we would make when GCSE'ing and see if
@@ -5361,4 +5362,3 @@ struct rtl_opt_pass pass_rtl_hoist =
  }
 };
 
-#include "gt-gcse.h"
Index: alias.c
===================================================================
--- alias.c	(revision 162821)
+++ alias.c	(working copy)
@@ -204,13 +204,13 @@ static void memory_modified_1 (rtx, cons
    current function performs nonlocal memory memory references for the
    purposes of marking the function as a constant function.  */
 
-static GTY(()) VEC(rtx,gc) *reg_base_value;
+static VEC(rtx,heap) *reg_base_value;
 static rtx *new_reg_base_value;
 
 /* We preserve the copy of old array around to avoid amount of garbage
    produced.  About 8% of garbage produced were attributed to this
    array.  */
-static GTY((deletable)) VEC(rtx,gc) *old_reg_base_value;
+static VEC(rtx,heap) *old_reg_base_value;
 
 #define static_reg_base_value \
   (this_target_rtl->x_static_reg_base_value)
@@ -222,10 +222,10 @@ static GTY((deletable)) VEC(rtx,gc) *old
 /* Vector indexed by N giving the initial (unchanging) value known for
    pseudo-register N.  This array is initialized in init_alias_analysis,
    and does not change until end_alias_analysis is called.  */
-static GTY((length("reg_known_value_size"))) rtx *reg_known_value;
+static rtx *reg_known_value;
 
 /* Indicates number of valid entries in reg_known_value.  */
-static GTY(()) unsigned int reg_known_value_size;
+static unsigned int reg_known_value_size;
 
 /* Vector recording for each reg_known_value whether it is due to a
    REG_EQUIV note.  Future passes (viz., reload) may replace the
@@ -2625,7 +2625,7 @@ init_alias_analysis (void)
   timevar_push (TV_ALIAS_ANALYSIS);
 
   reg_known_value_size = maxreg - FIRST_PSEUDO_REGISTER;
-  reg_known_value = ggc_alloc_cleared_vec_rtx (reg_known_value_size);
+  reg_known_value = XCNEWVEC (rtx, reg_known_value_size);
   reg_known_equiv_p = XCNEWVEC (bool, reg_known_value_size);
 
   /* If we have memory allocated from the previous run, use it.  */
@@ -2635,7 +2635,7 @@ init_alias_analysis (void)
   if (reg_base_value)
     VEC_truncate (rtx, reg_base_value, 0);
 
-  VEC_safe_grow_cleared (rtx, gc, reg_base_value, maxreg);
+  VEC_safe_grow_cleared (rtx, heap, reg_base_value, maxreg);
 
   new_reg_base_value = XNEWVEC (rtx, maxreg);
   reg_seen = XNEWVEC (char, maxreg);
@@ -2800,7 +2800,7 @@ void
 end_alias_analysis (void)
 {
   old_reg_base_value = reg_base_value;
-  ggc_free (reg_known_value);
+  free (reg_known_value);
   reg_known_value = 0;
   reg_known_value_size = 0;
   free (reg_known_equiv_p);
Index: except.c
===================================================================
--- except.c	(revision 162821)
+++ except.c	(working copy)
@@ -2338,7 +2338,7 @@ add_call_site (rtx landing_pad, int acti
 {
   call_site_record record;
 
-  record = ggc_alloc_call_site_record_d ();
+  record = (call_site_record)obstack_alloc (rtl_obstack, sizeof *record);
   record->landing_pad = landing_pad;
   record->action = action;
 
Index: coverage.c
===================================================================
--- coverage.c	(revision 162821)
+++ coverage.c	(working copy)
@@ -106,7 +106,7 @@ static GTY(()) tree tree_ctr_tables[GCOV
 
 /* The names of the counter tables.  Not used if we're
    generating counters at tree level.  */
-static GTY(()) rtx ctr_labels[GCOV_COUNTERS];
+static rtx ctr_labels[GCOV_COUNTERS];
 
 /* The names of merge functions for counters.  */
 static const char *const ctr_merge_functions[GCOV_COUNTERS] = GCOV_MERGE_FUNCTIONS;
Index: except.h
===================================================================
--- except.h	(revision 162821)
+++ except.h	(working copy)
@@ -90,7 +90,7 @@ struct GTY(()) eh_landing_pad_d
      EXCEPTION_RECEIVER pattern will be expanded here, as well as other
      bookkeeping specific to exceptions.  There must not be normal edges
      into the block containing the landing-pad label.  */
-  rtx landing_pad;
+  rtx GTY((skip)) landing_pad;
 
   /* The index of this landing pad within fun->eh->lp_array.  */
   int index;
@@ -178,7 +178,8 @@ struct GTY(()) eh_region_d
   /* EXC_PTR and FILTER values copied from the runtime for this region.
      Each region gets its own psuedos so that if there are nested exceptions
      we do not overwrite the values of the first exception.  */
-  rtx exc_ptr_reg, filter_reg;
+  rtx GTY((skip)) exc_ptr_reg;
+  rtx GTY((skip)) filter_reg;
 
   /* True if this region should use __cxa_end_cleanup instead
      of _Unwind_Resume.  */
Index: emit-rtl.c
===================================================================
--- emit-rtl.c	(revision 162821)
+++ emit-rtl.c	(working copy)
@@ -119,24 +119,19 @@ rtx const_int_rtx[MAX_SAVED_CONST_INT * 
 /* A hash table storing CONST_INTs whose absolute value is greater
    than MAX_SAVED_CONST_INT.  */
 
-static GTY ((if_marked ("ggc_marked_p"), param_is (struct rtx_def)))
-     htab_t const_int_htab;
+static htab_t const_int_htab;
 
 /* A hash table storing memory attribute structures.  */
-static GTY ((if_marked ("ggc_marked_p"), param_is (struct mem_attrs)))
-     htab_t mem_attrs_htab;
+static htab_t mem_attrs_htab;
 
 /* A hash table storing register attribute structures.  */
-static GTY ((if_marked ("ggc_marked_p"), param_is (struct reg_attrs)))
-     htab_t reg_attrs_htab;
+static htab_t reg_attrs_htab;
 
 /* A hash table storing all CONST_DOUBLEs.  */
-static GTY ((if_marked ("ggc_marked_p"), param_is (struct rtx_def)))
-     htab_t const_double_htab;
+static htab_t const_double_htab;
 
 /* A hash table storing all CONST_FIXEDs.  */
-static GTY ((if_marked ("ggc_marked_p"), param_is (struct rtx_def)))
-     htab_t const_fixed_htab;
+static htab_t const_fixed_htab;
 
 #define cur_insn_uid (crtl->emit.x_cur_insn_uid)
 #define cur_debug_insn_uid (crtl->emit.x_cur_debug_insn_uid)
@@ -281,6 +276,8 @@ mem_attrs_htab_eq (const void *x, const 
 		  && operand_equal_p (p->expr, q->expr, 0))));
 }
 
+static GTY(())  VEC(tree,gc) *saved_decls;
+
 /* Allocate a new mem_attrs structure and insert it into the hash table if
    one identical to it is not already in the table.  We are doing this for
    MEM of mode MODE.  */
@@ -302,6 +299,7 @@ get_mem_attrs (alias_set_type alias, tre
 	  ? align == GET_MODE_ALIGNMENT (mode) : align == BITS_PER_UNIT))
     return 0;
 
+  VEC_safe_push (tree, gc, saved_decls, expr);
   attrs.alias = alias;
   attrs.expr = expr;
   attrs.offset = offset;
@@ -312,7 +310,7 @@ get_mem_attrs (alias_set_type alias, tre
   slot = htab_find_slot (mem_attrs_htab, &attrs, INSERT);
   if (*slot == 0)
     {
-      *slot = ggc_alloc_mem_attrs ();
+      *slot = XNEW (mem_attrs);
       memcpy (*slot, &attrs, sizeof (mem_attrs));
     }
 
@@ -355,13 +353,18 @@ get_reg_attrs (tree decl, int offset)
   if (decl == 0 && offset == 0)
     return 0;
 
+  if (cfun)
+    VEC_safe_push (tree, gc, cfun->stack_vars, decl);
+  else
+    VEC_safe_push (tree, gc, saved_decls, decl);
+
   attrs.decl = decl;
   attrs.offset = offset;
 
   slot = htab_find_slot (reg_attrs_htab, &attrs, INSERT);
   if (*slot == 0)
     {
-      *slot = ggc_alloc_reg_attrs ();
+      *slot = XNEW (reg_attrs);
       memcpy (*slot, &attrs, sizeof (reg_attrs));
     }
 
@@ -395,6 +398,25 @@ gen_raw_REG (enum machine_mode mode, int
   return x;
 }
 
+rtx
+gen_rtx_CONST (enum machine_mode mode, rtx arg)
+{
+  rtx x;
+  /* CONST can be shared if it contains a SYMBOL_REF.  If it contains
+     a LABEL_REF, it isn't sharable.  */
+  bool shared = (GET_CODE (arg) == PLUS
+		 && GET_CODE (XEXP (arg, 0)) == SYMBOL_REF
+		 && CONST_INT_P (XEXP (arg, 1)));
+  if (shared)
+    {
+      rtl_on_permanent_obstack ();
+      arg = copy_rtx (arg);
+    }
+  x = gen_rtx_raw_CONST (mode, arg);
+  if (shared)
+    rtl_pop_obstack ();
+  return x;
+}
 /* There are some RTL codes that require special attention; the generation
    functions do the raw handling.  If you add to this list, modify
    special_rtx in gengenrtl.c as well.  */
@@ -416,7 +438,11 @@ gen_rtx_CONST_INT (enum machine_mode mod
   slot = htab_find_slot_with_hash (const_int_htab, &arg,
 				   (hashval_t) arg, INSERT);
   if (*slot == 0)
-    *slot = gen_rtx_raw_CONST_INT (VOIDmode, arg);
+    {
+      rtl_on_permanent_obstack ();
+      *slot = gen_rtx_raw_CONST_INT (VOIDmode, arg);
+      rtl_pop_obstack ();
+    }
 
   return (rtx) *slot;
 }
@@ -449,8 +475,12 @@ lookup_const_double (rtx real)
 rtx
 const_double_from_real_value (REAL_VALUE_TYPE value, enum machine_mode mode)
 {
-  rtx real = rtx_alloc (CONST_DOUBLE);
+  rtx real;
+
+  rtl_on_permanent_obstack ();
+  real = rtx_alloc (CONST_DOUBLE);
   PUT_MODE (real, mode);
+  rtl_pop_obstack ();
 
   real->u.rv = value;
 
@@ -477,8 +507,12 @@ lookup_const_fixed (rtx fixed)
 rtx
 const_fixed_from_fixed_value (FIXED_VALUE_TYPE value, enum machine_mode mode)
 {
-  rtx fixed = rtx_alloc (CONST_FIXED);
+  rtx fixed;
+
+  rtl_on_permanent_obstack ();
+  fixed = rtx_alloc (CONST_FIXED);
   PUT_MODE (fixed, mode);
+  rtl_pop_obstack ();
 
   fixed->u.fv = value;
 
@@ -555,8 +589,10 @@ immed_double_const (HOST_WIDE_INT i0, HO
     return GEN_INT (i0);
 
   /* We use VOIDmode for integers.  */
+  rtl_on_permanent_obstack ();
   value = rtx_alloc (CONST_DOUBLE);
   PUT_MODE (value, VOIDmode);
+  rtl_pop_obstack ();
 
   CONST_DOUBLE_LOW (value) = i0;
   CONST_DOUBLE_HIGH (value) = i1;
@@ -902,7 +938,7 @@ gen_reg_rtx (enum machine_mode mode)
       memset (tmp + old_size, 0, old_size);
       crtl->emit.regno_pointer_align = (unsigned char *) tmp;
 
-      new1 = GGC_RESIZEVEC (rtx, regno_reg_rtx, old_size * 2);
+      new1 = XRESIZEVEC (rtx, regno_reg_rtx, old_size * 2);
       memset (new1 + old_size, 0, old_size * sizeof (rtx));
       regno_reg_rtx = new1;
 
@@ -921,7 +957,7 @@ static void
 update_reg_offset (rtx new_rtx, rtx reg, int offset)
 {
   REG_ATTRS (new_rtx) = get_reg_attrs (REG_EXPR (reg),
-				   REG_OFFSET (reg) + offset);
+				       REG_OFFSET (reg) + offset);
 }
 
 /* Generate a register with same attributes as REG, but with OFFSET
@@ -1814,7 +1850,7 @@ set_mem_attributes_minus_bitpos (rtx ref
   MEM_ATTRS (ref)
     = get_mem_attrs (alias, expr, offset, size, align,
 		     TYPE_ADDR_SPACE (type), GET_MODE (ref));
-
+  
   /* If this is already known to be a scalar or aggregate, we are done.  */
   if (MEM_IN_STRUCT_P (ref) || MEM_SCALAR_P (ref))
     return;
@@ -2222,13 +2258,10 @@ widen_memory_access (rtx memref, enum ma
   return new_rtx;
 }
 \f
-/* A fake decl that is used as the MEM_EXPR of spill slots.  */
-static GTY(()) tree spill_slot_decl;
-
 tree
 get_spill_slot_decl (bool force_build_p)
 {
-  tree d = spill_slot_decl;
+  tree d = cfun->spill_slot_decl;
   rtx rd;
 
   if (d || !force_build_p)
@@ -2240,7 +2273,7 @@ get_spill_slot_decl (bool force_build_p)
   DECL_IGNORED_P (d) = 1;
   TREE_USED (d) = 1;
   TREE_THIS_NOTRAP (d) = 1;
-  spill_slot_decl = d;
+  cfun->spill_slot_decl = d;
 
   rd = gen_rtx_MEM (BLKmode, frame_pointer_rtx);
   MEM_NOTRAP_P (rd) = 1;
@@ -2248,6 +2281,7 @@ get_spill_slot_decl (bool force_build_p)
 				  NULL_RTX, 0, ADDR_SPACE_GENERIC, BLKmode);
   SET_DECL_RTL (d, rd);
 
+  VEC_safe_push (tree, gc, cfun->stack_vars, d);
   return d;
 }
 
@@ -5238,9 +5272,6 @@ emit (rtx x)
     }
 }
 \f
-/* Space for free sequence stack entries.  */
-static GTY ((deletable)) struct sequence_stack *free_sequence_stack;
-
 /* Begin emitting insns to a sequence.  If this sequence will contain
    something that might cause the compiler to pop arguments to function
    calls (because those pops have previously been deferred; see
@@ -5253,13 +5284,13 @@ start_sequence (void)
 {
   struct sequence_stack *tem;
 
-  if (free_sequence_stack != NULL)
+  if (crtl->emit.free_sequence_stack != NULL)
     {
-      tem = free_sequence_stack;
-      free_sequence_stack = tem->next;
+      tem = crtl->emit.free_sequence_stack;
+      crtl->emit.free_sequence_stack = tem->next;
     }
   else
-    tem = ggc_alloc_sequence_stack ();
+    tem = XOBNEW (rtl_obstack, struct sequence_stack);
 
   tem->next = seq_stack;
   tem->first = get_insns ();
@@ -5357,8 +5388,8 @@ end_sequence (void)
   seq_stack = tem->next;
 
   memset (tem, 0, sizeof (*tem));
-  tem->next = free_sequence_stack;
-  free_sequence_stack = tem;
+  tem->next = crtl->emit.free_sequence_stack;
+  crtl->emit.free_sequence_stack = tem;
 }
 
 /* Return 1 if currently emitting into a sequence.  */
@@ -5567,6 +5598,8 @@ init_emit (void)
   first_label_num = label_num;
   seq_stack = NULL;
 
+  crtl->emit.free_sequence_stack = NULL;
+
   /* Init the tables that describe all the pseudo regs.  */
 
   crtl->emit.regno_pointer_align_length = LAST_VIRTUAL_REGISTER + 101;
@@ -5574,7 +5607,7 @@ init_emit (void)
   crtl->emit.regno_pointer_align
     = XCNEWVEC (unsigned char, crtl->emit.regno_pointer_align_length);
 
-  regno_reg_rtx = ggc_alloc_vec_rtx (crtl->emit.regno_pointer_align_length);
+  regno_reg_rtx = XNEWVEC (rtx, crtl->emit.regno_pointer_align_length);
 
   /* Put copies of all the hard registers into regno_reg_rtx.  */
   memcpy (regno_reg_rtx,
@@ -5669,7 +5702,9 @@ gen_rtx_CONST_VECTOR (enum machine_mode 
 	return CONST1_RTX (mode);
     }
 
-  return gen_rtx_raw_CONST_VECTOR (mode, v);
+  x = gen_rtx_raw_CONST_VECTOR (mode, v);
+
+  return x;
 }
 
 /* Initialise global register information required by all functions.  */
@@ -5727,21 +5762,23 @@ init_emit_once (void)
   enum machine_mode mode;
   enum machine_mode double_mode;
 
+  rtl_on_permanent_obstack ();
+
   /* Initialize the CONST_INT, CONST_DOUBLE, CONST_FIXED, and memory attribute
      hash tables.  */
-  const_int_htab = htab_create_ggc (37, const_int_htab_hash,
-				    const_int_htab_eq, NULL);
+  const_int_htab = htab_create (37, const_int_htab_hash,
+				const_int_htab_eq, NULL);
 
-  const_double_htab = htab_create_ggc (37, const_double_htab_hash,
-				       const_double_htab_eq, NULL);
+  const_double_htab = htab_create (37, const_double_htab_hash,
+				   const_double_htab_eq, NULL);
 
-  const_fixed_htab = htab_create_ggc (37, const_fixed_htab_hash,
-				      const_fixed_htab_eq, NULL);
+  const_fixed_htab = htab_create (37, const_fixed_htab_hash,
+				  const_fixed_htab_eq, NULL);
 
-  mem_attrs_htab = htab_create_ggc (37, mem_attrs_htab_hash,
-				    mem_attrs_htab_eq, NULL);
-  reg_attrs_htab = htab_create_ggc (37, reg_attrs_htab_hash,
-				    reg_attrs_htab_eq, NULL);
+  mem_attrs_htab = htab_create (37, mem_attrs_htab_hash,
+				mem_attrs_htab_eq, NULL);
+  reg_attrs_htab = htab_create (37, reg_attrs_htab_hash,
+				reg_attrs_htab_eq, NULL);
 
   /* Compute the word and byte modes.  */
 
@@ -5972,6 +6009,7 @@ init_emit_once (void)
   const_tiny_rtx[0][(int) BImode] = const0_rtx;
   if (STORE_FLAG_VALUE == 1)
     const_tiny_rtx[1][(int) BImode] = const1_rtx;
+  rtl_pop_obstack ();
 }
 \f
 /* Produce exact duplicate of insn INSN after AFTER.
@@ -6039,15 +6077,18 @@ emit_copy_of_insn_after (rtx insn, rtx a
   return new_rtx;
 }
 
-static GTY((deletable)) rtx hard_reg_clobbers [NUM_MACHINE_MODES][FIRST_PSEUDO_REGISTER];
+static rtx hard_reg_clobbers [NUM_MACHINE_MODES][FIRST_PSEUDO_REGISTER];
 rtx
 gen_hard_reg_clobber (enum machine_mode mode, unsigned int regno)
 {
-  if (hard_reg_clobbers[mode][regno])
-    return hard_reg_clobbers[mode][regno];
-  else
-    return (hard_reg_clobbers[mode][regno] =
-	    gen_rtx_CLOBBER (VOIDmode, gen_rtx_REG (mode, regno)));
+  if (!hard_reg_clobbers[mode][regno])
+    {
+      rtl_on_permanent_obstack ();
+      hard_reg_clobbers[mode][regno]
+	= gen_rtx_CLOBBER (VOIDmode, gen_rtx_REG (mode, regno));
+      rtl_pop_obstack ();
+    }
+  return hard_reg_clobbers[mode][regno];
 }
 
 #include "gt-emit-rtl.h"
Index: cfgexpand.c
===================================================================
--- cfgexpand.c	(revision 162821)
+++ cfgexpand.c	(working copy)
@@ -1234,6 +1234,7 @@ init_vars_expansion (void)
 {
   tree t;
   unsigned ix;
+  
   /* Set TREE_USED on all variables in the local_decls.  */
   FOR_EACH_LOCAL_DECL (cfun, ix, t)
     TREE_USED (t) = 1;
@@ -1252,8 +1253,11 @@ fini_vars_expansion (void)
 {
   size_t i, n = stack_vars_num;
   for (i = 0; i < n; i++)
-    BITMAP_FREE (stack_vars[i].conflicts);
-  XDELETEVEC (stack_vars);
+    {
+      VEC_safe_push (tree, gc, cfun->stack_vars, stack_vars[i].decl);
+      BITMAP_FREE (stack_vars[i].conflicts);
+    }
+  
   XDELETEVEC (stack_vars_sorted);
   stack_vars = NULL;
   stack_vars_alloc = stack_vars_num = 0;
@@ -2337,9 +2341,11 @@ expand_debug_expr (tree exp)
       if (op0)
 	return op0;
 
+      rtl_on_permanent_obstack ();
       op0 = gen_rtx_DEBUG_EXPR (mode);
       DEBUG_EXPR_TREE_DECL (op0) = exp;
       SET_DECL_RTL (exp, op0);
+      rtl_pop_obstack ();
 
       return op0;
 
@@ -3108,6 +3114,7 @@ expand_debug_locations (void)
 	  val = NULL_RTX;
 	else
 	  {
+	    VEC_safe_push (tree, gc, cfun->stack_vars, value);
 	    val = expand_debug_expr (value);
 	    gcc_assert (last == get_last_insn ());
 	  }
@@ -3287,6 +3294,9 @@ expand_gimple_basic_block (basic_block b
 		    rtx val;
 		    enum machine_mode mode;
 
+		    VEC_safe_push (tree, gc, cfun->stack_vars, vexpr);
+		    VEC_safe_push (tree, gc, cfun->stack_vars, value);
+
 		    set_curr_insn_source_location (gimple_location (def));
 		    set_curr_insn_block (gimple_block (def));
 
@@ -3303,6 +3313,7 @@ expand_gimple_basic_block (basic_block b
 
 		    val = emit_debug_insn (val);
 
+		    VEC_safe_push (tree, gc, cfun->stack_vars, vexpr);
 		    FOR_EACH_IMM_USE_STMT (debugstmt, imm_iter, op)
 		      {
 			if (!gimple_debug_bind_p (debugstmt))
@@ -3357,6 +3368,7 @@ expand_gimple_basic_block (basic_block b
 	      else
 		mode = TYPE_MODE (TREE_TYPE (var));
 
+	      VEC_safe_push (tree, gc, cfun->stack_vars, value);
 	      val = gen_rtx_VAR_LOCATION
 		(mode, var, (rtx)value, VAR_INIT_STATUS_INITIALIZED);
 
Index: cselib.c
===================================================================
--- cselib.c	(revision 162821)
+++ cselib.c	(working copy)
@@ -36,7 +36,6 @@ along with GCC; see the file COPYING3.  
 #include "diagnostic-core.h"
 #include "toplev.h"
 #include "output.h"
-#include "ggc.h"
 #include "hashtab.h"
 #include "tree-pass.h"
 #include "cselib.h"
@@ -165,7 +164,7 @@ static unsigned int n_used_regs;
 
 /* We pass this to cselib_invalidate_mem to invalidate all of
    memory for a non-const call instruction.  */
-static GTY(()) rtx callmem;
+static rtx callmem;
 
 /* Set by discard_useless_locs if it deleted the last location of any
    value.  */
@@ -2241,8 +2240,10 @@ cselib_init (int record_what)
 
   /* (mem:BLK (scratch)) is a special mechanism to conflict with everything,
      see canon_true_dependence.  This is only created once.  */
+  rtl_on_permanent_obstack ();
   if (! callmem)
     callmem = gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (VOIDmode));
+  rtl_pop_obstack ();
 
   cselib_nregs = max_reg_num ();
 
@@ -2377,4 +2378,3 @@ dump_cselib_table (FILE *out)
   fprintf (out, "next uid %i\n", next_uid);
 }
 
-#include "gt-cselib.h"
Index: explow.c
===================================================================
--- explow.c	(revision 162821)
+++ explow.c	(working copy)
@@ -37,7 +37,6 @@ along with GCC; see the file COPYING3.  
 #include "libfuncs.h"
 #include "hard-reg-set.h"
 #include "insn-config.h"
-#include "ggc.h"
 #include "recog.h"
 #include "langhooks.h"
 #include "target.h"
@@ -1344,13 +1343,15 @@ allocate_dynamic_stack_space (rtx size, 
    run-time routine to call to check the stack, so provide a mechanism for
    calling that routine.  */
 
-static GTY(()) rtx stack_check_libfunc;
+static rtx stack_check_libfunc;
 
 void
 set_stack_check_libfunc (const char *libfunc_name)
 {
   gcc_assert (stack_check_libfunc == NULL_RTX);
+  rtl_on_permanent_obstack ();
   stack_check_libfunc = gen_rtx_SYMBOL_REF (Pmode, libfunc_name);
+  rtl_pop_obstack ();
 }
 \f
 /* Emit one stack probe at ADDRESS, an address within the stack.  */
@@ -1759,4 +1760,3 @@ rtx_to_tree_code (enum rtx_code code)
   return ((int) tcode);
 }
 
-#include "gt-explow.h"
Index: cfglayout.h
===================================================================
--- cfglayout.h	(revision 162821)
+++ cfglayout.h	(working copy)
@@ -22,8 +22,8 @@
 
 #include "basic-block.h"
 
-extern GTY(()) rtx cfg_layout_function_footer;
-extern GTY(()) rtx cfg_layout_function_header;
+extern rtx cfg_layout_function_footer;
+extern rtx cfg_layout_function_header;
 
 extern void cfg_layout_initialize (unsigned int);
 extern void cfg_layout_finalize (void);
Index: varasm.c
===================================================================
--- varasm.c	(revision 162821)
+++ varasm.c	(working copy)
@@ -178,13 +178,13 @@ static GTY(()) section *unnamed_sections
 static GTY((param_is (section))) htab_t section_htab;
 
 /* A table of object_blocks, indexed by section.  */
-static GTY((param_is (struct object_block))) htab_t object_block_htab;
+static htab_t object_block_htab;
 
 /* The next number to use for internal anchor labels.  */
 static GTY(()) int anchor_labelno;
 
 /* A pool of constants that can be shared between functions.  */
-static GTY(()) struct rtx_constant_pool *shared_constant_pool;
+static struct rtx_constant_pool *shared_constant_pool;
 
 /* Helper routines for maintaining section_htab.  */
 
@@ -280,7 +280,7 @@ get_section (const char *name, unsigned 
     {
       sect = ggc_alloc_section ();
       sect->named.common.flags = flags;
-      sect->named.name = ggc_strdup (name);
+      sect->named.name = rtl_strdup (name);
       sect->named.decl = decl;
       *slot = sect;
     }
@@ -327,7 +327,7 @@ get_block_for_section (section *sect)
   block = (struct object_block *) *slot;
   if (block == NULL)
     {
-      block = ggc_alloc_cleared_object_block ();
+      block = XCNEW (struct object_block);
       block->sect = sect;
       *slot = block;
     }
@@ -347,7 +347,7 @@ create_block_symbol (const char *label, 
 
   /* Create the extended SYMBOL_REF.  */
   size = RTX_HDR_SIZE + sizeof (struct block_symbol);
-  symbol = ggc_alloc_zone_rtx_def (size, &rtl_zone);
+  symbol = (rtx)obstack_alloc (rtl_obstack, size);
 
   /* Initialize the normal SYMBOL_REF fields.  */
   memset (symbol, 0, size);
@@ -383,7 +383,7 @@ initialize_cold_section_name (void)
       stripped_name = targetm.strip_name_encoding (name);
 
       buffer = ACONCAT ((stripped_name, "_unlikely", NULL));
-      crtl->subsections.unlikely_text_section_name = ggc_strdup (buffer);
+      crtl->subsections.unlikely_text_section_name = rtl_strdup (buffer);
     }
   else
     crtl->subsections.unlikely_text_section_name =  UNLIKELY_EXECUTED_TEXT_SECTION_NAME;
@@ -1038,7 +1038,14 @@ make_decl_rtl (tree decl)
       /* If the old RTL had the wrong mode, fix the mode.  */
       x = DECL_RTL (decl);
       if (GET_MODE (x) != DECL_MODE (decl))
-	SET_DECL_RTL (decl, adjust_address_nv (x, DECL_MODE (decl), 0));
+	{
+	  rtx newx;
+
+	  rtl_on_permanent_obstack ();
+	  newx = adjust_address_nv (x, DECL_MODE (decl), 0);
+	  rtl_pop_obstack ();
+	  SET_DECL_RTL (decl, newx);
+	}
 
       if (TREE_CODE (decl) != FUNCTION_DECL && DECL_REGISTER (decl))
 	return;
@@ -1112,6 +1119,9 @@ make_decl_rtl (tree decl)
 		     "optimization may eliminate reads and/or "
 		     "writes to register variables");
 
+	  if (TREE_STATIC (decl))
+	    rtl_on_permanent_obstack ();
+
 	  /* If the user specified one of the eliminables registers here,
 	     e.g., FRAME_POINTER_REGNUM, we don't want to get this variable
 	     confused with that register and be eliminated.  This usage is
@@ -1122,6 +1132,9 @@ make_decl_rtl (tree decl)
 	  REG_USERVAR_P (DECL_RTL (decl)) = 1;
 
 	  if (TREE_STATIC (decl))
+	    rtl_pop_obstack ();
+
+	  if (TREE_STATIC (decl))
 	    {
 	      /* Make this register global, so not usable for anything
 		 else.  */
@@ -1168,6 +1181,8 @@ make_decl_rtl (tree decl)
   if (TREE_CODE (decl) == VAR_DECL && DECL_WEAK (decl))
     DECL_COMMON (decl) = 0;
 
+  rtl_on_permanent_obstack ();
+  
   if (use_object_blocks_p () && use_blocks_for_decl_p (decl))
     x = create_block_symbol (name, get_block_for_decl (decl), -1);
   else
@@ -1188,6 +1203,8 @@ make_decl_rtl (tree decl)
     set_mem_attributes (x, decl, 1);
   SET_DECL_RTL (decl, x);
 
+  rtl_pop_obstack ();
+
   /* Optionally set flags or add text to the name to record information
      such as that it is a function name.
      If the name is changed, the macro ASM_OUTPUT_LABELREF
@@ -1424,13 +1441,13 @@ assemble_start_function (tree decl, cons
   if (flag_reorder_blocks_and_partition)
     {
       ASM_GENERATE_INTERNAL_LABEL (tmp_label, "LHOTB", const_labelno);
-      crtl->subsections.hot_section_label = ggc_strdup (tmp_label);
+      crtl->subsections.hot_section_label = rtl_strdup (tmp_label);
       ASM_GENERATE_INTERNAL_LABEL (tmp_label, "LCOLDB", const_labelno);
-      crtl->subsections.cold_section_label = ggc_strdup (tmp_label);
+      crtl->subsections.cold_section_label = rtl_strdup (tmp_label);
       ASM_GENERATE_INTERNAL_LABEL (tmp_label, "LHOTE", const_labelno);
-      crtl->subsections.hot_section_end_label = ggc_strdup (tmp_label);
+      crtl->subsections.hot_section_end_label = rtl_strdup (tmp_label);
       ASM_GENERATE_INTERNAL_LABEL (tmp_label, "LCOLDE", const_labelno);
-      crtl->subsections.cold_section_end_label = ggc_strdup (tmp_label);
+      crtl->subsections.cold_section_end_label = rtl_strdup (tmp_label);
       const_labelno++;
     }
   else
@@ -2199,7 +2216,7 @@ assemble_static_space (unsigned HOST_WID
 
   ASM_GENERATE_INTERNAL_LABEL (name, "LF", const_labelno);
   ++const_labelno;
-  namestring = ggc_strdup (name);
+  namestring = rtl_strdup (name);
 
   x = gen_rtx_SYMBOL_REF (Pmode, namestring);
   SYMBOL_REF_FLAGS (x) = SYMBOL_FLAG_LOCAL;
@@ -2230,7 +2247,7 @@ assemble_static_space (unsigned HOST_WID
    This is done at most once per compilation.
    Returns an RTX for the address of the template.  */
 
-static GTY(()) rtx initial_trampoline;
+static rtx initial_trampoline;
 
 rtx
 assemble_trampoline_template (void)
@@ -2245,6 +2262,8 @@ assemble_trampoline_template (void)
   if (initial_trampoline)
     return initial_trampoline;
 
+  rtl_on_permanent_obstack ();
+
   /* By default, put trampoline templates in read-only data section.  */
 
 #ifdef TRAMPOLINE_SECTION
@@ -2263,7 +2282,7 @@ assemble_trampoline_template (void)
 
   /* Record the rtl to refer to it.  */
   ASM_GENERATE_INTERNAL_LABEL (label, "LTRAMP", 0);
-  name = ggc_strdup (label);
+  name = rtl_strdup (label);
   symbol = gen_rtx_SYMBOL_REF (Pmode, name);
   SYMBOL_REF_FLAGS (symbol) = SYMBOL_FLAG_LOCAL;
 
@@ -2271,6 +2290,8 @@ assemble_trampoline_template (void)
   set_mem_align (initial_trampoline, TRAMPOLINE_ALIGNMENT);
   set_mem_size (initial_trampoline, GEN_INT (TRAMPOLINE_SIZE));
 
+  rtl_pop_obstack ();
+
   return initial_trampoline;
 }
 \f
@@ -2443,7 +2464,7 @@ assemble_real (REAL_VALUE_TYPE d, enum m
    Store them both in the structure *VALUE.
    EXP must be reducible.  */
 
-struct GTY(()) addr_const {
+struct addr_const {
   rtx base;
   HOST_WIDE_INT offset;
 };
@@ -2903,6 +2924,8 @@ copy_constant (tree exp)
     }
 }
 \f
+static GTY(()) VEC(tree,gc) *saved_constant_decls;
+
 /* Return the section into which constant EXP should be placed.  */
 
 static section *
@@ -2977,18 +3000,21 @@ build_constant_desc (tree exp)
   else
     align_variable (decl, 0);
 
+  rtl_on_permanent_obstack ();
+
   /* Now construct the SYMBOL_REF and the MEM.  */
   if (use_object_blocks_p ())
     {
       section *sect = get_constant_section (exp, DECL_ALIGN (decl));
-      symbol = create_block_symbol (ggc_strdup (label),
+      symbol = create_block_symbol (rtl_strdup (label),
 				    get_block_for_section (sect), -1);
     }
   else
-    symbol = gen_rtx_SYMBOL_REF (Pmode, ggc_strdup (label));
+    symbol = gen_rtx_SYMBOL_REF (Pmode, rtl_strdup (label));
   SYMBOL_REF_FLAGS (symbol) |= SYMBOL_FLAG_LOCAL;
   SET_SYMBOL_REF_DECL (symbol, decl);
   TREE_CONSTANT_POOL_ADDRESS_P (symbol) = 1;
+  VEC_safe_push (tree, gc, saved_constant_decls, decl);
 
   rtl = gen_rtx_MEM (TYPE_MODE (TREE_TYPE (exp)), symbol);
   set_mem_attributes (rtl, exp, 1);
@@ -3008,6 +3034,8 @@ build_constant_desc (tree exp)
 
   desc->rtl = rtl;
 
+  rtl_pop_obstack ();
+
   return desc;
 }
 
@@ -3188,7 +3216,7 @@ tree_output_constant_def (tree exp)
    can use one per-file pool.  Should add a targetm bit to tell the
    difference.  */
 
-struct GTY(()) rtx_constant_pool {
+struct rtx_constant_pool {
   /* Pointers to first and last constant in pool, as ordered by offset.  */
   struct constant_descriptor_rtx *first;
   struct constant_descriptor_rtx *last;
@@ -3197,14 +3225,14 @@ struct GTY(()) rtx_constant_pool {
      It is used on RISC machines where immediate integer arguments and
      constant addresses are restricted so that such constants must be stored
      in memory.  */
-  htab_t GTY((param_is (struct constant_descriptor_rtx))) const_rtx_htab;
+  htab_t const_rtx_htab;
 
   /* Current offset in constant pool (does not include any
      machine-specific header).  */
   HOST_WIDE_INT offset;
 };
 
-struct GTY((chain_next ("%h.next"))) constant_descriptor_rtx {
+struct constant_descriptor_rtx {
   struct constant_descriptor_rtx *next;
   rtx mem;
   rtx sym;
@@ -3337,8 +3365,8 @@ create_constant_pool (void)
 {
   struct rtx_constant_pool *pool;
 
-  pool = ggc_alloc_rtx_constant_pool ();
-  pool->const_rtx_htab = htab_create_ggc (31, const_desc_rtx_hash,
+  pool = XNEW (struct rtx_constant_pool);
+  pool->const_rtx_htab = htab_create (31, const_desc_rtx_hash,
 					  const_desc_rtx_eq, NULL);
   pool->first = NULL;
   pool->last = NULL;
@@ -3403,7 +3431,7 @@ force_const_mem (enum machine_mode mode,
     return copy_rtx (desc->mem);
 
   /* Otherwise, create a new descriptor.  */
-  desc = ggc_alloc_constant_descriptor_rtx ();
+  desc = XNEW (struct constant_descriptor_rtx);
   *slot = desc;
 
   /* Align the location counter as required by EXP's data type.  */
@@ -3435,6 +3463,9 @@ force_const_mem (enum machine_mode mode,
     pool->first = pool->last = desc;
   pool->last = desc;
 
+  if (pool == shared_constant_pool)
+    rtl_on_permanent_obstack ();
+
   /* Create a string containing the label name, in LABEL.  */
   ASM_GENERATE_INTERNAL_LABEL (label, "LC", const_labelno);
   ++const_labelno;
@@ -3444,11 +3475,11 @@ force_const_mem (enum machine_mode mode,
   if (use_object_blocks_p () && targetm.use_blocks_for_constant_p (mode, x))
     {
       section *sect = targetm.asm_out.select_rtx_section (mode, x, align);
-      symbol = create_block_symbol (ggc_strdup (label),
+      symbol = create_block_symbol (rtl_strdup (label),
 				    get_block_for_section (sect), -1);
     }
   else
-    symbol = gen_rtx_SYMBOL_REF (Pmode, ggc_strdup (label));
+    symbol = gen_rtx_SYMBOL_REF (Pmode, rtl_strdup (label));
   desc->sym = symbol;
   SYMBOL_REF_FLAGS (symbol) |= SYMBOL_FLAG_LOCAL;
   CONSTANT_POOL_ADDRESS_P (symbol) = 1;
@@ -3464,7 +3495,11 @@ force_const_mem (enum machine_mode mode,
   if (GET_CODE (x) == LABEL_REF)
     LABEL_PRESERVE_P (XEXP (x, 0)) = 1;
 
-  return copy_rtx (def);
+  def = copy_rtx (def);
+  if (pool == shared_constant_pool)
+    rtl_pop_obstack ();
+
+  return def;
 }
 \f
 /* Given a constant pool SYMBOL_REF, return the corresponding constant.  */
@@ -5640,7 +5675,7 @@ init_varasm_once (void)
 {
   section_htab = htab_create_ggc (31, section_entry_hash,
 				  section_entry_eq, NULL);
-  object_block_htab = htab_create_ggc (31, object_block_entry_hash,
+  object_block_htab = htab_create (31, object_block_entry_hash,
 				       object_block_entry_eq, NULL);
   const_desc_htab = htab_create_ggc (1009, const_desc_hash,
 				     const_desc_eq, NULL);
@@ -6685,7 +6720,7 @@ get_section_anchor (struct object_block 
 
   /* Create a new anchor with a unique label.  */
   ASM_GENERATE_INTERNAL_LABEL (label, "LANCHOR", anchor_labelno++);
-  anchor = create_block_symbol (ggc_strdup (label), block, offset);
+  anchor = create_block_symbol (rtl_strdup (label), block, offset);
   SYMBOL_REF_FLAGS (anchor) |= SYMBOL_FLAG_LOCAL | SYMBOL_FLAG_ANCHOR;
   SYMBOL_REF_FLAGS (anchor) |= model << SYMBOL_FLAG_TLS_SHIFT;
 
@@ -6884,6 +6919,7 @@ make_debug_expr_from_rtl (const_rtx exp)
   enum machine_mode mode = GET_MODE (exp);
   rtx dval;
 
+  VEC_safe_push (tree, gc, cfun->stack_vars, ddecl);
   DECL_ARTIFICIAL (ddecl) = 1;
   if (REG_P (exp) && REG_EXPR (exp))
     type = TREE_TYPE (REG_EXPR (exp));
Index: ira.c
===================================================================
--- ira.c	(revision 162821)
+++ ira.c	(working copy)
@@ -1693,7 +1693,7 @@ fix_reg_equiv_init (void)
 
   if (reg_equiv_init_size < max_regno)
     {
-      reg_equiv_init = GGC_RESIZEVEC (rtx, reg_equiv_init, max_regno);
+      reg_equiv_init = XRESIZEVEC (rtx, reg_equiv_init, max_regno);
       while (reg_equiv_init_size < max_regno)
 	reg_equiv_init[reg_equiv_init_size++] = NULL_RTX;
       for (i = FIRST_PSEUDO_REGISTER; i < reg_equiv_init_size; i++)
@@ -2272,7 +2272,7 @@ update_equiv_regs (void)
   recorded_label_ref = 0;
 
   reg_equiv = XCNEWVEC (struct equivalence, max_regno);
-  reg_equiv_init = ggc_alloc_cleared_vec_rtx (max_regno);
+  reg_equiv_init = XCNEWVEC (rtx, max_regno);
   reg_equiv_init_size = max_regno;
 
   init_alias_analysis ();
Index: rtl.c
===================================================================
--- rtl.c	(revision 162821)
+++ rtl.c	(working copy)
@@ -31,12 +31,19 @@ along with GCC; see the file COPYING3.  
 #include "tm.h"
 #include "rtl.h"
 #include "ggc.h"
+#include "obstack.h"
 #ifdef GENERATOR_FILE
 # include "errors.h"
 #else
 # include "diagnostic-core.h"
 #endif
 
+/* Obstack used for allocating RTL objects.  */
+
+static struct obstack function_obstack, permanent_obstack;
+struct obstack *rtl_obstack = &function_obstack;
+static char *rtl_firstobj;
+static int permanent_nesting;
 \f
 /* Indexed by rtx code, gives number of operands for an rtx with that code.
    Does NOT include rtx header data (code and links).  */
@@ -139,7 +146,6 @@ static int rtx_alloc_sizes[(int) LAST_AN
 static int rtvec_alloc_counts;
 static int rtvec_alloc_sizes;
 #endif
-
 \f
 /* Allocate an rtx vector of N elements.
    Store the length, and initialize all elements to zero.  */
@@ -149,7 +155,9 @@ rtvec_alloc (int n)
 {
   rtvec rt;
 
-  rt = ggc_alloc_rtvec_sized (n);
+  rt = (rtvec) obstack_alloc (rtl_obstack,
+			      sizeof (struct rtvec_def)
+			      + ((n - 1) * sizeof (rtunion)));
   /* Clear out the vector.  */
   memset (&rt->elem[0], 0, n * sizeof (rtx));
 
@@ -193,8 +201,24 @@ rtx_size (const_rtx x)
 rtx
 rtx_alloc_stat (RTX_CODE code MEM_STAT_DECL)
 {
-  rtx rt = ggc_alloc_zone_rtx_def_stat (&rtl_zone, RTX_CODE_SIZE (code)
-                                        PASS_MEM_STAT);
+  rtx rt;
+  struct obstack *ob = rtl_obstack;
+  register int length = RTX_CODE_SIZE (code);
+
+  /* This function is called very frequently, so we manipulate the
+     obstack directly.
+
+     Even though rtx objects are word aligned, we may be sharing an obstack
+     with tree nodes, which may have to be double-word aligned.  So align
+     our length to the alignment mask in the obstack.  */
+
+  length = (length + ob->alignment_mask) & ~ ob->alignment_mask;
+
+  if (ob->chunk_limit - ob->next_free < length)
+    _obstack_newchunk (ob, length);
+  rt = (rtx)ob->object_base;
+  ob->next_free += length;
+  ob->object_base = ob->next_free;
 
   /* We want to clear everything up to the FLD array.  Normally, this
      is one int, but we don't want to assume that and it isn't very
@@ -333,10 +357,10 @@ copy_rtx (rtx orig)
 /* Create a new copy of an rtx.  Only copy just one level.  */
 
 rtx
-shallow_copy_rtx_stat (const_rtx orig MEM_STAT_DECL)
+shallow_copy_rtx (const_rtx orig)
 {
   const unsigned int size = rtx_size (orig);
-  rtx const copy = ggc_alloc_zone_rtx_def_stat (&rtl_zone, size PASS_MEM_STAT);
+  rtx const copy = rtx_alloc (GET_CODE (orig));
   return (rtx) memcpy (copy, orig, size);
 }
 \f
@@ -721,3 +745,199 @@ rtl_check_failed_flag (const char *name,
      name, GET_RTX_NAME (GET_CODE (r)), func, trim_filename (file), line);
 }
 #endif /* ENABLE_RTL_FLAG_CHECKING */
+\f
+#if 0
+/* Allocate an rtx vector of N elements.
+   Store the length, and initialize all elements to zero.  */
+
+rtvec
+ggc_rtvec_alloc (int n)
+{
+  rtvec rt;
+
+  rt = ggc_alloc_rtvec_sized (n);
+  /* Clear out the vector.  */
+  memset (&rt->elem[0], 0, n * sizeof (rtx));
+
+  PUT_NUM_ELEM (rt, n);
+
+#ifdef GATHER_STATISTICS
+  rtvec_alloc_counts++;
+  rtvec_alloc_sizes += n * sizeof (rtx);
+#endif
+
+  return rt;
+}
+
+/* Allocate an rtx of code CODE.  The CODE is stored in the rtx;
+   all the rest is initialized to zero.  */
+
+rtx
+ggc_rtx_alloc_stat (RTX_CODE code MEM_STAT_DECL)
+{
+  rtx rt = ggc_alloc_zone_rtx_def_stat (&rtl_zone, RTX_CODE_SIZE (code)
+                                        PASS_MEM_STAT);
+
+  /* We want to clear everything up to the FLD array.  Normally, this
+     is one int, but we don't want to assume that and it isn't very
+     portable anyway; this is.  */
+
+  memset (rt, 0, RTX_HDR_SIZE);
+  PUT_CODE (rt, code);
+
+#ifdef GATHER_STATISTICS
+  rtx_alloc_counts[code]++;
+  rtx_alloc_sizes[code] += RTX_CODE_SIZE (code);
+#endif
+
+  return rt;
+}
+
+/* Create a new copy of an rtx.  Only copy just one level.  */
+
+static rtx
+ggc_shallow_copy_rtx_stat (const_rtx orig MEM_STAT_DECL)
+{
+  const unsigned int size = rtx_size (orig);
+  rtx const copy = ggc_alloc_zone_rtx_def_stat (&rtl_zone, size PASS_MEM_STAT);
+  return (rtx) memcpy (copy, orig, size);
+}
+
+#define ggc_shallow_copy_rtx(a) ggc_shallow_copy_rtx_stat (a MEM_STAT_INFO)
+
+rtx
+copy_rtx_to_ggc (rtx orig)
+{
+  rtx copy;
+  int i, j;
+  RTX_CODE code;
+  const char *format_ptr;
+
+  code = GET_CODE (orig);
+
+  switch (code)
+    {
+    case REG:
+    case CC0:
+    case SCRATCH:
+    case INSN:
+    case SET:
+    case CLOBBER:
+    case JUMP_INSN:
+    case CALL_INSN:
+    case DEBUG_INSN:
+    case DEBUG_EXPR:
+      gcc_unreachable ();
+
+    case CONST_INT:
+    case CONST_DOUBLE:
+    case CONST_FIXED:
+    case CONST_VECTOR:
+      /* These are shared and in ggc memory already.  */
+      return orig;
+
+    default:
+      break;
+    }
+
+  /* Copy the various flags, fields, and other information.  We assume
+     that all fields need copying, and then clear the fields that should
+     not be copied.  That is the sensible default behavior, and forces
+     us to explicitly document why we are *not* copying a flag.  */
+  copy = ggc_shallow_copy_rtx (orig);
+
+  /* We do not copy the USED flag, which is used as a mark bit during
+     walks over the RTL.  */
+  RTX_FLAG (copy, used) = 0;
+
+  RTX_FLAG (copy, jump) = RTX_FLAG (orig, jump);
+  RTX_FLAG (copy, call) = RTX_FLAG (orig, call);
+
+  format_ptr = GET_RTX_FORMAT (GET_CODE (copy));
+
+  for (i = 0; i < GET_RTX_LENGTH (GET_CODE (copy)); i++)
+    switch (*format_ptr++)
+      {
+      case 'e':
+	if (XEXP (orig, i) != NULL)
+	  XEXP (copy, i) = copy_rtx_to_ggc (XEXP (orig, i));
+	break;
+
+      case 'E':
+      case 'V':
+	if (XVEC (orig, i) != NULL)
+	  {
+	    XVEC (copy, i) = ggc_rtvec_alloc (XVECLEN (orig, i));
+	    for (j = 0; j < XVECLEN (copy, i); j++)
+	      XVECEXP (copy, i, j) = copy_rtx_to_ggc (XVECEXP (orig, i, j));
+	  }
+	break;
+
+      case 't':
+      case 'w':
+      case 'i':
+      case 's':
+      case 'S':
+      case 'T':
+      case 'u':
+      case 'B':
+      case '0':
+	/* These are left unchanged.  */
+	break;
+
+      default:
+	gcc_unreachable ();
+      }
+  return copy;
+}
+#endif
+
+rtx
+permanent_copy_rtx (rtx x)
+{
+  rtl_on_permanent_obstack ();
+  x = copy_rtx (x);
+  rtl_pop_obstack ();
+  return x;
+}
+
+char *
+rtl_strdup (const char *s)
+{
+  size_t len = strlen (s);
+  char *t = XOBNEWVAR (&permanent_obstack, char, len + 1);
+  memcpy (t, s, len + 1);
+  return t;
+}
+
+void preserve_rtl (void)
+{
+  rtl_firstobj = (char *) obstack_alloc (&function_obstack, 0);
+}
+
+void init_rtl (void)
+{
+  gcc_obstack_init (&permanent_obstack);
+  gcc_obstack_init (&function_obstack);
+  rtl_firstobj
+    = (char *) obstack_alloc (rtl_obstack, 0);
+}
+
+void free_function_rtl (void)
+{
+  gcc_assert (permanent_nesting == 0);
+  obstack_free (&function_obstack, rtl_firstobj);
+}
+
+void rtl_on_permanent_obstack (void)
+{
+  permanent_nesting++;
+  rtl_obstack = &permanent_obstack;
+}
+
+void rtl_pop_obstack (void)
+{
+  gcc_assert (permanent_nesting > 0);
+  if (--permanent_nesting == 0)
+    rtl_obstack = &function_obstack;
+}
Index: rtl.h
===================================================================
--- rtl.h	(revision 162821)
+++ rtl.h	(working copy)
@@ -143,8 +143,8 @@ typedef struct
 typedef struct GTY(()) mem_attrs
 {
   tree expr;			/* expr corresponding to MEM.  */
-  rtx offset;			/* Offset from start of DECL, as CONST_INT.  */
-  rtx size;			/* Size in bytes, as a CONST_INT.  */
+  rtx GTY((skip)) offset;	/* Offset from start of DECL, as CONST_INT.  */
+  rtx GTY((skip)) size;		/* Size in bytes, as a CONST_INT.  */
   alias_set_type alias;		/* Memory alias set.  */
   unsigned int align;		/* Alignment of MEM in bits.  */
   unsigned char addrspace;	/* Address space (0 for generic).  */
@@ -185,9 +185,9 @@ typedef union rtunion_def rtunion;
 /* This structure remembers the position of a SYMBOL_REF within an
    object_block structure.  A SYMBOL_REF only provides this information
    if SYMBOL_REF_HAS_BLOCK_INFO_P is true.  */
-struct GTY(()) block_symbol {
+struct block_symbol {
   /* The usual SYMBOL_REF fields.  */
-  rtunion GTY ((skip)) fld[3];
+  rtunion fld[3];
 
   /* The block that contains this object.  */
   struct object_block *block;
@@ -199,7 +199,7 @@ struct GTY(()) block_symbol {
 
 /* Describes a group of objects that are to be placed together in such
    a way that their relative positions are known.  */
-struct GTY(()) object_block {
+struct object_block {
   /* The section in which these objects should be placed.  */
   section *sect;
 
@@ -232,8 +232,7 @@ struct GTY(()) object_block {
 
 /* RTL expression ("rtx").  */
 
-struct GTY((chain_next ("RTX_NEXT (&%h)"),
-	    chain_prev ("RTX_PREV (&%h)"), variable_size)) rtx_def {
+struct rtx_def {
   /* The kind of expression this is.  */
   ENUM_BITFIELD(rtx_code) code: 16;
 
@@ -314,7 +313,7 @@ struct GTY((chain_next ("RTX_NEXT (&%h)"
     struct block_symbol block_sym;
     struct real_value rv;
     struct fixed_value fv;
-  } GTY ((special ("rtx_def"), desc ("GET_CODE (&%0)"))) u;
+  } u;
 };
 
 /* The size in bytes of an rtx header (code, mode and flags).  */
@@ -352,7 +351,7 @@ struct GTY((chain_next ("RTX_NEXT (&%h)"
    for a variable number of things.  The principle use is inside
    PARALLEL expressions.  */
 
-struct GTY((variable_size)) rtvec_def {
+struct rtvec_def {
   int num_elem;		/* number of elements */
   rtx GTY ((length ("%h.num_elem"))) elem[1];
 };
@@ -1543,6 +1542,12 @@ extern int generating_concat_p;
 /* Nonzero when we are expanding trees to RTL.  */
 extern int currently_expanding_to_rtl;
 
+extern struct obstack *rtl_obstack;
+
+extern void rtl_on_permanent_obstack (void);
+extern void rtl_pop_obstack (void);
+extern void free_function_rtl (void);
+
 /* Generally useful functions.  */
 
 /* In expmed.c */
@@ -1555,10 +1560,15 @@ extern rtx plus_constant (rtx, HOST_WIDE
 /* In rtl.c */
 extern rtx rtx_alloc_stat (RTX_CODE MEM_STAT_DECL);
 #define rtx_alloc(c) rtx_alloc_stat (c MEM_STAT_INFO)
+extern rtx ggc_rtx_alloc_stat (RTX_CODE MEM_STAT_DECL);
+#define ggc_rtx_alloc(c) ggc_rtx_alloc_stat (c MEM_STAT_INFO)
+extern char *rtl_strdup (const char *);
 
+extern rtvec ggc_rtvec_alloc (int);
 extern rtvec rtvec_alloc (int);
 extern rtvec shallow_copy_rtvec (rtvec);
 extern bool shared_const_p (const_rtx);
+extern rtx permanent_copy_rtx (rtx);
 extern rtx copy_rtx (rtx);
 extern void dump_rtx_statistics (void);
 
@@ -1566,9 +1576,11 @@ extern void dump_rtx_statistics (void);
 extern rtx copy_rtx_if_shared (rtx);
 
 /* In rtl.c */
+extern void init_rtl (void);
+extern void preserve_rtl (void);
+
 extern unsigned int rtx_size (const_rtx);
-extern rtx shallow_copy_rtx_stat (const_rtx MEM_STAT_DECL);
-#define shallow_copy_rtx(a) shallow_copy_rtx_stat (a MEM_STAT_INFO)
+extern rtx shallow_copy_rtx (const_rtx);
 extern int rtx_equal_p (const_rtx, const_rtx);
 
 /* In emit-rtl.c */
@@ -1917,7 +1929,7 @@ extern void remove_free_INSN_LIST_elem (
 extern rtx remove_list_elem (rtx, rtx *);
 extern rtx remove_free_INSN_LIST_node (rtx *);
 extern rtx remove_free_EXPR_LIST_node (rtx *);
-
+extern void discard_rtx_lists (void);
 
 /* reginfo.c */
 
@@ -1946,15 +1958,15 @@ extern void split_all_insns (void);
 extern unsigned int split_all_insns_noflow (void);
 
 #define MAX_SAVED_CONST_INT 64
-extern GTY(()) rtx const_int_rtx[MAX_SAVED_CONST_INT * 2 + 1];
+extern rtx const_int_rtx[MAX_SAVED_CONST_INT * 2 + 1];
 
 #define const0_rtx	(const_int_rtx[MAX_SAVED_CONST_INT])
 #define const1_rtx	(const_int_rtx[MAX_SAVED_CONST_INT+1])
 #define const2_rtx	(const_int_rtx[MAX_SAVED_CONST_INT+2])
 #define constm1_rtx	(const_int_rtx[MAX_SAVED_CONST_INT-1])
-extern GTY(()) rtx const_true_rtx;
+extern rtx const_true_rtx;
 
-extern GTY(()) rtx const_tiny_rtx[3][(int) MAX_MACHINE_MODE];
+extern rtx const_tiny_rtx[3][(int) MAX_MACHINE_MODE];
 
 /* Returns a constant 0 rtx in mode MODE.  Integer modes are treated the
    same as VOIDmode.  */
@@ -2011,7 +2023,7 @@ enum global_rtl_index
 };
 
 /* Target-dependent globals.  */
-struct GTY(()) target_rtl {
+struct target_rtl {
   /* All references to the hard registers in global_rtl_index go through
      these unique rtl objects.  On machines where the frame-pointer and
      arg-pointer are the same register, they use the same unique object.
@@ -2051,7 +2063,7 @@ struct GTY(()) target_rtl {
   rtx x_static_reg_base_value[FIRST_PSEUDO_REGISTER];
 };
 
-extern GTY(()) struct target_rtl default_target_rtl;
+extern struct target_rtl default_target_rtl;
 #if SWITCHABLE_TARGET
 extern struct target_rtl *this_target_rtl;
 #else
@@ -2437,7 +2449,7 @@ extern int stack_regs_mentioned (const_r
 #endif
 
 /* In toplev.c */
-extern GTY(()) rtx stack_limit_rtx;
+extern rtx stack_limit_rtx;
 
 /* In predict.c */
 extern void invert_br_probabilities (rtx);
Index: integrate.c
===================================================================
--- integrate.c	(revision 162821)
+++ integrate.c	(working copy)
@@ -42,7 +42,6 @@ along with GCC; see the file COPYING3.  
 #include "toplev.h"
 #include "intl.h"
 #include "params.h"
-#include "ggc.h"
 #include "target.h"
 #include "langhooks.h"
 #include "tree-pass.h"
@@ -53,14 +52,14 @@ along with GCC; see the file COPYING3.  
 \f
 
 /* Private type used by {get/has}_hard_reg_initial_val.  */
-typedef struct GTY(()) initial_value_pair {
+typedef struct initial_value_pair {
   rtx hard_reg;
   rtx pseudo;
 } initial_value_pair;
 typedef struct GTY(()) initial_value_struct {
   int num_entries;
   int max_entries;
-  initial_value_pair * GTY ((length ("%h.num_entries"))) entries;
+  initial_value_pair *entries;
 } initial_value_struct;
 
 static void set_block_origin_self (tree);
@@ -247,18 +246,28 @@ get_hard_reg_initial_val (enum machine_m
   ivs = crtl->hard_reg_initial_vals;
   if (ivs == 0)
     {
-      ivs = ggc_alloc_initial_value_struct ();
+      ivs
+	= ((struct initial_value_struct *)
+	   obstack_alloc (rtl_obstack, sizeof (struct initial_value_struct)));
       ivs->num_entries = 0;
       ivs->max_entries = 5;
-      ivs->entries = ggc_alloc_vec_initial_value_pair (5);
+      ivs->entries
+	= ((struct initial_value_pair *)obstack_alloc (rtl_obstack,
+						       5 * sizeof (struct initial_value_pair)));
       crtl->hard_reg_initial_vals = ivs;
     }
 
   if (ivs->num_entries >= ivs->max_entries)
     {
+      struct initial_value_pair *newvec;
       ivs->max_entries += 5;
-      ivs->entries = GGC_RESIZEVEC (initial_value_pair, ivs->entries,
-				    ivs->max_entries);
+      newvec
+	= ((struct initial_value_pair *)obstack_alloc (rtl_obstack,
+						       ivs->max_entries
+						       * sizeof (struct initial_value_pair)));
+      memcpy (newvec, ivs->entries,
+	      sizeof (struct initial_value_pair) * (ivs->max_entries - 5));
+      ivs->entries = newvec;
     }
 
   ivs->entries[ivs->num_entries].hard_reg = gen_rtx_REG (mode, regno);
@@ -372,5 +381,3 @@ allocate_initial_values (rtx *reg_equiv_
 	}
     }
 }
-
-#include "gt-integrate.h"
Index: combine.c
===================================================================
--- combine.c	(revision 162821)
+++ combine.c	(working copy)
@@ -367,6 +367,8 @@ struct undobuf
 
 static struct undobuf undobuf;
 
+static char *combine_firstobj;
+
 /* Number of times the pseudo being substituted for
    was found and replaced.  */
 
@@ -1147,6 +1149,8 @@ combine_instructions (rtx f, unsigned in
 		 into SUBREGs.  */
 	      note_uses (&PATTERN (insn), record_truncated_values, NULL);
 
+	      combine_firstobj = (char *) obstack_alloc (rtl_obstack, 0);
+
 	      /* Try this insn with each insn it links back to.  */
 
 	      for (links = LOG_LINKS (insn); links; links = XEXP (links, 1))
@@ -3508,6 +3512,7 @@ try_combine (rtx i3, rtx i2, rtx i1, int
        && (! check_asm_operands (newpat) || added_sets_1 || added_sets_2)))
     {
       undo_all ();
+      obstack_free (rtl_obstack, combine_firstobj);
       return 0;
     }
 
@@ -3523,6 +3528,7 @@ try_combine (rtx i3, rtx i2, rtx i1, int
       if (other_code_number < 0 && ! check_asm_operands (other_pat))
 	{
 	  undo_all ();
+	  obstack_free (rtl_obstack, combine_firstobj);
 	  return 0;
 	}
     }
@@ -3536,6 +3542,7 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	&& sets_cc0_p (newi2pat))
       {
 	undo_all ();
+	obstack_free (rtl_obstack, combine_firstobj);
 	return 0;
       }
   }
@@ -3546,6 +3553,7 @@ try_combine (rtx i3, rtx i2, rtx i1, int
   if (!combine_validate_cost (i1, i2, i3, newpat, newi2pat, other_pat))
     {
       undo_all ();
+      obstack_free (rtl_obstack, combine_firstobj);
       return 0;
     }
 
@@ -4039,6 +4047,8 @@ try_combine (rtx i3, rtx i2, rtx i1, int
   combine_successes++;
   undo_commit ();
 
+  combine_firstobj = (char *) obstack_alloc (rtl_obstack, 0);
+
   if (added_links_insn
       && (newi2pat == 0 || DF_INSN_LUID (added_links_insn) < DF_INSN_LUID (i2))
       && DF_INSN_LUID (added_links_insn) < DF_INSN_LUID (i3))
Index: Makefile.in
===================================================================
--- Makefile.in	(revision 162821)
+++ Makefile.in	(working copy)
@@ -2540,8 +2540,7 @@ tree-ssa-address.o : tree-ssa-address.c 
    $(SYSTEM_H) $(RTL_H) $(TREE_H) $(TM_P_H) \
    output.h $(DIAGNOSTIC_H) $(TIMEVAR_H) $(TM_H) coretypes.h $(TREE_DUMP_H) \
    $(TREE_PASS_H) $(FLAGS_H) $(TREE_INLINE_H) $(RECOG_H) insn-config.h \
-   $(EXPR_H) gt-tree-ssa-address.h $(GGC_H) tree-affine.h $(TARGET_H) \
-   tree-pretty-print.h
+   $(EXPR_H) tree-affine.h $(TARGET_H) tree-pretty-print.h
 tree-ssa-loop-niter.o : tree-ssa-loop-niter.c $(TREE_FLOW_H) $(CONFIG_H) \
    $(SYSTEM_H) $(TREE_H) $(TM_P_H) $(CFGLOOP_H) $(PARAMS_H) \
    $(TREE_INLINE_H) output.h $(DIAGNOSTIC_H) $(TM_H) coretypes.h $(TREE_DUMP_H) \
@@ -2907,7 +2906,7 @@ expr.o : expr.c $(CONFIG_H) $(SYSTEM_H) 
    $(TREE_PASS_H) $(DF_H) $(DIAGNOSTIC_H) vecprim.h $(SSAEXPAND_H)
 dojump.o : dojump.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) $(TREE_H) \
    $(FLAGS_H) $(FUNCTION_H) $(EXPR_H) $(OPTABS_H) $(INSN_ATTR_H) insn-config.h \
-   langhooks.h $(GGC_H) gt-dojump.h vecprim.h $(BASIC_BLOCK_H) output.h
+   langhooks.h vecprim.h $(BASIC_BLOCK_H) output.h
 builtins.o : builtins.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
    $(TREE_H) $(GIMPLE_H) $(FLAGS_H) $(TARGET_H) $(FUNCTION_H) $(REGS_H) \
    $(EXPR_H) $(OPTABS_H) insn-config.h $(RECOG_H) output.h typeclass.h \
@@ -2926,7 +2925,7 @@ expmed.o : expmed.c $(CONFIG_H) $(SYSTEM
    expmed.h
 explow.o : explow.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) $(TREE_H) \
    $(FLAGS_H) hard-reg-set.h insn-config.h $(EXPR_H) $(OPTABS_H) $(RECOG_H) \
-   $(TOPLEV_H) $(DIAGNOSTIC_CORE_H) $(EXCEPT_H) $(FUNCTION_H) $(GGC_H) $(TM_P_H) langhooks.h gt-explow.h \
+   $(TOPLEV_H) $(DIAGNOSTIC_CORE_H) $(EXCEPT_H) $(FUNCTION_H) $(TM_P_H) langhooks.h \
    $(TARGET_H) output.h
 optabs.o : optabs.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
    $(TREE_H) $(FLAGS_H) insn-config.h $(EXPR_H) $(OPTABS_H) $(LIBFUNCS_H) \
@@ -2972,7 +2971,7 @@ integrate.o : integrate.c $(CONFIG_H) $(
    $(RTL_H) $(TREE_H) $(FLAGS_H) debug.h $(INTEGRATE_H) insn-config.h \
    $(EXPR_H) $(REGS_H) intl.h $(FUNCTION_H) output.h $(RECOG_H) \
    $(EXCEPT_H) $(TOPLEV_H) $(DIAGNOSTIC_CORE_H) $(PARAMS_H) $(TM_P_H) $(TARGET_H) langhooks.h \
-   gt-integrate.h $(GGC_H) $(TREE_PASS_H) $(DF_H)
+   $(TREE_PASS_H) $(DF_H)
 jump.o : jump.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
    $(FLAGS_H) hard-reg-set.h $(REGS_H) insn-config.h $(RECOG_H) $(EXPR_H) \
    $(EXCEPT_H) $(FUNCTION_H) $(BASIC_BLOCK_H) $(TREE_PASS_H) \
@@ -3067,7 +3066,7 @@ coverage.o : coverage.c $(GCOV_IO_H) $(C
 cselib.o : cselib.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
    $(REGS_H) hard-reg-set.h $(FLAGS_H) insn-config.h $(RECOG_H) \
    $(EMIT_RTL_H) $(TOPLEV_H) $(DIAGNOSTIC_CORE_H) output.h $(FUNCTION_H) $(TREE_PASS_H) \
-   cselib.h gt-cselib.h $(GGC_H) $(TM_P_H) $(PARAMS_H) alloc-pool.h \
+   cselib.h $(TM_P_H) $(PARAMS_H) alloc-pool.h \
    $(HASHTAB_H) $(TARGET_H) $(BITMAP_H)
 cse.o : cse.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) $(REGS_H) \
    hard-reg-set.h $(FLAGS_H) insn-config.h $(RECOG_H) $(EXPR_H) $(TOPLEV_H) $(DIAGNOSTIC_CORE_H) \
@@ -3096,9 +3095,9 @@ implicit-zee.o : implicit-zee.c $(CONFIG
    $(REGS_H) $(TREE_H) $(TM_P_H) insn-config.h $(INSN_ATTR_H) $(TOPLEV_H) $(DIAGNOSTIC_CORE_H) \
    $(TARGET_H) $(OPTABS_H) insn-codes.h rtlhooks-def.h $(PARAMS_H) $(CGRAPH_H)
 gcse.o : gcse.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
-   $(REGS_H) hard-reg-set.h $(FLAGS_H) insn-config.h $(GGC_H) \
+   $(REGS_H) hard-reg-set.h $(FLAGS_H) insn-config.h \
    $(RECOG_H) $(EXPR_H) $(BASIC_BLOCK_H) $(FUNCTION_H) output.h $(TOPLEV_H) $(DIAGNOSTIC_CORE_H) \
-   $(TM_P_H) $(PARAMS_H) cselib.h $(EXCEPT_H) gt-gcse.h $(TREE_H) $(TIMEVAR_H) \
+   $(TM_P_H) $(PARAMS_H) cselib.h $(EXCEPT_H) $(TREE_H) $(TIMEVAR_H) \
    intl.h $(OBSTACK_H) $(TREE_PASS_H) $(DF_H) $(DBGCNT_H) $(TARGET_H) \
    $(DF_H) gcse.h
 store-motion.o : store-motion.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
@@ -3301,7 +3300,7 @@ postreload-gcse.o : postreload-gcse.c $(
 caller-save.o : caller-save.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
    $(FLAGS_H) $(REGS_H) hard-reg-set.h insn-config.h $(BASIC_BLOCK_H) $(FUNCTION_H) \
    addresses.h $(RECOG_H) reload.h $(EXPR_H) $(TOPLEV_H) $(DIAGNOSTIC_CORE_H) $(TM_P_H) $(DF_H) \
-   output.h gt-caller-save.h $(GGC_H)
+   output.h
 bt-load.o : bt-load.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(EXCEPT_H) \
    $(RTL_H) hard-reg-set.h $(REGS_H) $(TM_P_H) $(FIBHEAP_H) output.h $(EXPR_H) \
    $(TARGET_H) $(FLAGS_H) $(INSN_ATTR_H) $(FUNCTION_H) $(TREE_PASS_H) \
@@ -3434,7 +3433,7 @@ predict.o: predict.c $(CONFIG_H) $(SYSTE
    $(COVERAGE_H) $(SCEV_H) $(GGC_H) predict.def $(TIMEVAR_H) $(TREE_DUMP_H) \
    $(TREE_FLOW_H) $(TREE_PASS_H) $(EXPR_H) pointer-set.h
 lists.o: lists.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(TOPLEV_H) $(DIAGNOSTIC_CORE_H) \
-   $(RTL_H) $(GGC_H) gt-lists.h
+   $(RTL_H)
 bb-reorder.o : bb-reorder.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) \
    $(RTL_H) $(FLAGS_H) $(TIMEVAR_H) output.h $(CFGLAYOUT_H) $(FIBHEAP_H) \
    $(TARGET_H) $(FUNCTION_H) $(TM_P_H) $(OBSTACK_H) $(EXPR_H) $(REGS_H) \
Index: basic-block.h
===================================================================
--- basic-block.h	(revision 162821)
+++ basic-block.h	(working copy)
@@ -41,7 +41,7 @@ struct GTY(()) edge_def {
   /* Instructions queued on the edge.  */
   union edge_def_insns {
     gimple_seq GTY ((tag ("true"))) g;
-    rtx GTY ((tag ("false"))) r;
+    rtx GTY ((skip,tag ("false"))) r;
   } GTY ((desc ("current_ir_type () == IR_GIMPLE"))) insns;
 
   /* Auxiliary info specific to a pass.  */
@@ -147,9 +147,9 @@ struct GTY((chain_next ("%h.next_bb"), c
   struct basic_block_def *next_bb;
 
   union basic_block_il_dependent {
-      struct gimple_bb_info * GTY ((tag ("0"))) gimple;
-      struct rtl_bb_info * GTY ((tag ("1"))) rtl;
-    } GTY ((desc ("((%1.flags & BB_RTL) != 0)"))) il;
+    struct gimple_bb_info * GTY ((tag ("0"))) gimple;
+    struct rtl_bb_info * GTY ((skip,tag ("1"))) rtl;
+  } GTY ((desc ("((%1.flags & BB_RTL) != 0)"))) il;
 
   /* Expected number of executions: calculated in profile.c.  */
   gcov_type count;
Index: config/i386/i386.h
===================================================================
--- config/i386/i386.h	(revision 162821)
+++ config/i386/i386.h	(working copy)
@@ -2298,13 +2298,13 @@ enum ix86_stack_slot
 /* Machine specific CFA tracking during prologue/epilogue generation.  */
 
 #ifndef USED_FOR_TARGET
-struct GTY(()) machine_cfa_state
+struct machine_cfa_state
 {
   rtx reg;
   HOST_WIDE_INT offset;
 };
 
-struct GTY(()) machine_function {
+struct machine_function {
   struct stack_local_entry *stack_locals;
   const char *some_ld_name;
   int varargs_gpr_size;
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 162821)
+++ config/i386/i386.c	(working copy)
@@ -8537,7 +8537,7 @@ ix86_emit_save_sse_regs_using_mov (rtx p
       }
 }
 
-static GTY(()) rtx queued_cfa_restores;
+static rtx queued_cfa_restores;
 
 /* Add a REG_CFA_RESTORE REG note to INSN or queue them until next stack
    manipulation insn.  Don't add it if the previously
@@ -20130,7 +20130,8 @@ ix86_init_machine_status (void)
 {
   struct machine_function *f;
 
-  f = ggc_alloc_cleared_machine_function ();
+  f = XNEW (struct machine_function);
+  memset (f, 0, sizeof *f);
   f->use_fast_prologue_epilogue_nregs = -1;
   f->tls_descriptor_call_expanded_p = 0;
   f->call_abi = ix86_abi;
@@ -20158,7 +20159,7 @@ assign_386_stack_local (enum machine_mod
     if (s->mode == mode && s->n == n)
       return copy_rtx (s->rtl);
 
-  s = ggc_alloc_stack_local_entry ();
+  s = XOBNEW (rtl_obstack, struct stack_local_entry);
   s->n = n;
   s->mode = mode;
   s->rtl = assign_stack_local (mode, GET_MODE_SIZE (mode), 0);
@@ -20170,18 +20171,20 @@ assign_386_stack_local (enum machine_mod
 
 /* Construct the SYMBOL_REF for the tls_get_addr function.  */
 
-static GTY(()) rtx ix86_tls_symbol;
+static rtx ix86_tls_symbol;
 rtx
 ix86_tls_get_addr (void)
 {
 
   if (!ix86_tls_symbol)
     {
+      rtl_on_permanent_obstack ();
       ix86_tls_symbol = gen_rtx_SYMBOL_REF (Pmode,
 					    (TARGET_ANY_GNU_TLS
 					     && !TARGET_64BIT)
 					    ? "___tls_get_addr"
 					    : "__tls_get_addr");
+      rtl_pop_obstack ();
     }
 
   return ix86_tls_symbol;
@@ -20189,17 +20192,19 @@ ix86_tls_get_addr (void)
 
 /* Construct the SYMBOL_REF for the _TLS_MODULE_BASE_ symbol.  */
 
-static GTY(()) rtx ix86_tls_module_base_symbol;
+static rtx ix86_tls_module_base_symbol;
 rtx
 ix86_tls_module_base (void)
 {
 
   if (!ix86_tls_module_base_symbol)
     {
+      rtl_on_permanent_obstack ();
       ix86_tls_module_base_symbol = gen_rtx_SYMBOL_REF (Pmode,
 							"_TLS_MODULE_BASE_");
       SYMBOL_REF_FLAGS (ix86_tls_module_base_symbol)
 	|= TLS_MODEL_GLOBAL_DYNAMIC << SYMBOL_FLAG_TLS_SHIFT;
+      rtl_pop_obstack ();
     }
 
   return ix86_tls_module_base_symbol;
@@ -20982,6 +20987,7 @@ static rtx
 ix86_static_chain (const_tree fndecl, bool incoming_p)
 {
   unsigned regno;
+  rtx t;
 
   if (!DECL_STATIC_CHAIN (fndecl))
     return NULL;
@@ -21031,7 +21037,10 @@ ix86_static_chain (const_tree fndecl, bo
 	}
     }
 
-  return gen_rtx_REG (Pmode, regno);
+  rtl_on_permanent_obstack ();
+  t = gen_rtx_REG (Pmode, regno);
+  rtl_pop_obstack ();
+  return t;
 }
 
 /* Emit RTL insns to initialize the variable parts of a trampoline.
Index: cfgrtl.c
===================================================================
--- cfgrtl.c	(revision 162821)
+++ cfgrtl.c	(working copy)
@@ -3079,7 +3079,8 @@ void
 init_rtl_bb_info (basic_block bb)
 {
   gcc_assert (!bb->il.rtl);
-  bb->il.rtl = ggc_alloc_cleared_rtl_bb_info ();
+  bb->il.rtl = XOBNEW (rtl_obstack, struct rtl_bb_info);
+  memset (bb->il.rtl, 0, sizeof (struct rtl_bb_info));
 }
 
 
Index: stmt.c
===================================================================
--- stmt.c	(revision 162821)
+++ stmt.c	(working copy)
@@ -140,7 +140,10 @@ label_rtx (tree label)
 
   if (!DECL_RTL_SET_P (label))
     {
-      rtx r = gen_label_rtx ();
+      rtx r;
+      rtl_on_permanent_obstack ();
+      r = gen_label_rtx ();
+      rtl_pop_obstack ();
       SET_DECL_RTL (label, r);
       if (FORCED_LABEL (label) || DECL_NONLOCAL (label))
 	LABEL_PRESERVE_P (r) = 1;
Index: reload1.c
===================================================================
--- reload1.c	(revision 162821)
+++ reload1.c	(working copy)
@@ -1457,7 +1457,9 @@ calculate_needs_all_insns (int global)
 {
   struct insn_chain **pprev_reload = &insns_need_reload;
   struct insn_chain *chain, *next = 0;
-
+#if 0
+  char *reload_rtl_firstobj = XOBNEWVAR (rtl_obstack, char, 0);
+#endif
   something_needs_elimination = 0;
 
   reload_insn_firstobj = XOBNEWVAR (&reload_obstack, char, 0);
@@ -1560,7 +1562,12 @@ calculate_needs_all_insns (int global)
 	  /* Discard any register replacements done.  */
 	  if (did_elimination)
 	    {
-	      obstack_free (&reload_obstack, reload_insn_firstobj);
+#if 0
+	      if (n_reloads != 0)
+		reload_rtl_firstobj = XOBNEWVAR (rtl_obstack, char, 0);
+	      else
+		obstack_free (rtl_obstack, reload_rtl_firstobj);
+#endif
 	      PATTERN (insn) = old_body;
 	      INSN_CODE (insn) = old_code;
 	      REG_NOTES (insn) = old_notes;
@@ -3221,9 +3228,13 @@ eliminate_regs_in_insn (rtx insn, int re
 		  || GET_CODE (PATTERN (insn)) == ASM_INPUT
 		  || DEBUG_INSN_P (insn));
       if (DEBUG_INSN_P (insn))
-	INSN_VAR_LOCATION_LOC (insn)
-	  = eliminate_regs (INSN_VAR_LOCATION_LOC (insn), VOIDmode, insn);
-      return 0;
+	{
+	  new_body = eliminate_regs (INSN_VAR_LOCATION_LOC (insn), VOIDmode, insn);
+	  if (new_body != INSN_VAR_LOCATION_LOC (insn))
+	    val = 1;
+	  INSN_VAR_LOCATION_LOC (insn) = new_body;
+	}
+      return val;
     }
 
   if (old_set != 0 && REG_P (SET_DEST (old_set))

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-09 12:29             ` Bernd Schmidt
@ 2010-08-09 12:39               ` Steven Bosscher
  2010-08-09 13:48                 ` Bernd Schmidt
  2010-08-10  2:51                 ` Laurynas Biveinis
  2010-08-09 12:41               ` Michael Matz
  2010-08-09 14:39               ` Toon Moene
  2 siblings, 2 replies; 129+ messages in thread
From: Steven Bosscher @ 2010-08-09 12:39 UTC (permalink / raw)
  To: Bernd Schmidt
  Cc: Eric Botcazou, Richard Guenther, gcc-patches, Laurynas Biveinis

On Mon, Aug 9, 2010 at 2:05 PM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> The following crude proof-of-concept patch moves rtl generation back to
> obstacks.  (You may need --disable-werror which I just noticed I have in
> the build tree).

This is interesting. Years ago (7 years?) Zack Weinberg suggested that
GCC should move RTL back onto obstacks. The overhead should be
relatively small compared to keeping entire functions-as-trees in
memory. This is even more true today, now we keep entire translation
units (and more) in memory as GIMPLE (with SSA). Memory spent on RTL
is marginal compared to that.

I passes this idea to Laurynas last year ago
(http://gcc.gnu.org/ml/gcc/2009-08/msg00386.html). I don't know if he
played with the idea or not.

I've starred your prototype patch in gmail ;-)

Ciao!
Steven

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-09 12:29             ` Bernd Schmidt
  2010-08-09 12:39               ` Steven Bosscher
@ 2010-08-09 12:41               ` Michael Matz
  2010-08-09 14:34                 ` Bernd Schmidt
  2010-08-09 14:39               ` Toon Moene
  2 siblings, 1 reply; 129+ messages in thread
From: Michael Matz @ 2010-08-09 12:41 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Eric Botcazou, Richard Guenther, gcc-patches

Hi,

On Mon, 9 Aug 2010, Bernd Schmidt wrote:

> On 08/07/2010 10:10 AM, Eric Botcazou wrote:
> > So I think that the patch shouldn't go in at this point.
> 
> Richard has approved it.  I'll wait a few more days to see if anyone
> else agrees with your position.

[I'm not a reviewer, but FWIW:]

I'm also not too thrilled about using 1% more compilation time for the 
result.  It's not so much a matter of which sequences are actually 
replaced (your examples showed quite abysmal initial sequences, no doubt), 
but rather a matter of how often they occur in practice, and there it 
doesn't look too bright:

> $ grep Trying.four log |wc -l
> 307743
> $ grep Succeeded.four.into.two log |wc -l
> 244
> $ grep Succeeded.four.into.one log |wc -l
> 140

So out of 1230972 insns it was able to remove 908.  That's 0.07%.  Not too 
exciting, no matter how bad these 384 initial sequences looked like.

Especially for the bitmap manipulation stuff Richi once had something for 
the tree level that would enable better initial code generation.  Apart 
from that I unfortunately have no better idea to get the results you got 
without paying so high a price.

Perhaps you can limit the number it tries to match four instructions to 
some simpler cases?  Like that at least two of the insns must have 
"leaves" (expressions not depending on pseudos) as one of their operand?


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-09 12:39               ` Steven Bosscher
@ 2010-08-09 13:48                 ` Bernd Schmidt
  2010-08-10  2:51                 ` Laurynas Biveinis
  1 sibling, 0 replies; 129+ messages in thread
From: Bernd Schmidt @ 2010-08-09 13:48 UTC (permalink / raw)
  To: Steven Bosscher
  Cc: Eric Botcazou, Richard Guenther, gcc-patches, Laurynas Biveinis

On 08/09/2010 02:33 PM, Steven Bosscher wrote:
> This is interesting. Years ago (7 years?) Zack Weinberg suggested that
> GCC should move RTL back onto obstacks. The overhead should be
> relatively small compared to keeping entire functions-as-trees in
> memory. This is even more true today, now we keep entire translation
> units (and more) in memory as GIMPLE (with SSA). Memory spent on RTL
> is marginal compared to that.

I think the idea is quite well-known.  The reason we moved to garbage
collection was just the fact that our previous obstack-based memory
management was unmaintainable, and we wanted to be able to introduce
things like tree-ssa.  Now that we've managed that transition, the
original reason to use garbage collection is lost, at least for RTL.  We
no longer do crazy things like starting to expand a nested function in
the middle of compiling its containing function.

The combine.c portion of the patch shows that obstacks can give us more
control over memory access patterns: we can create a pattern and then
immediately and cheaply discard the memory if we don't use it; the next
one we create reuses the same memory.  This should be better for locality.


Bernd

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-09 12:41               ` Michael Matz
@ 2010-08-09 14:34                 ` Bernd Schmidt
  0 siblings, 0 replies; 129+ messages in thread
From: Bernd Schmidt @ 2010-08-09 14:34 UTC (permalink / raw)
  To: Michael Matz; +Cc: Eric Botcazou, Richard Guenther, gcc-patches

On 08/09/2010 02:38 PM, Michael Matz wrote:
> Perhaps you can limit the number it tries to match four instructions to 
> some simpler cases?  Like that at least two of the insns must have 
> "leaves" (expressions not depending on pseudos) as one of their operand?

That might be workable.  I could require one of them has to be set a
register to a constant; I expect that would still catch some interesting
cases and eliminate most of the ones that couldn't be combined anyway.


Bernd

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-09 12:29             ` Bernd Schmidt
  2010-08-09 12:39               ` Steven Bosscher
  2010-08-09 12:41               ` Michael Matz
@ 2010-08-09 14:39               ` Toon Moene
  2010-08-09 14:50                 ` Steven Bosscher
  2 siblings, 1 reply; 129+ messages in thread
From: Toon Moene @ 2010-08-09 14:39 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Eric Botcazou, Richard Guenther, gcc-patches

Bernd Schmidt wrote:

> On 08/07/2010 10:10 AM, Eric Botcazou wrote:

>> Combining Steven and Bernd's figures, 1% of a bootstrap time is 37% of the 
>> combiner's time.  The result is 0.18% more combined insns.  It seems to me 
>> that we are already very far in the directory of diminishing returns.
> 
> Better to look at actual code generation results IMO.  Do you have an
> opinion on the examples I included with the patch?

Well, one of the limitations of this analysis is that it is static - it 
doesn't say what the run-time influence of the simplification is.

In case it is inside a heavily used loop, it might be far more important 
than it at first looks ...

-- 
Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html#Fortran

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-09 14:39               ` Toon Moene
@ 2010-08-09 14:50                 ` Steven Bosscher
  2010-08-09 14:58                   ` The speed of the compiler, was: " Toon Moene
  0 siblings, 1 reply; 129+ messages in thread
From: Steven Bosscher @ 2010-08-09 14:50 UTC (permalink / raw)
  To: Toon Moene; +Cc: Bernd Schmidt, Eric Botcazou, Richard Guenther, gcc-patches

On Mon, Aug 9, 2010 at 4:34 PM, Toon Moene <toon@moene.org> wrote:
> Bernd Schmidt wrote:
>
>> On 08/07/2010 10:10 AM, Eric Botcazou wrote:
>
>>> Combining Steven and Bernd's figures, 1% of a bootstrap time is 37% of
>>> the combiner's time.  The result is 0.18% more combined insns.  It seems to
>>> me that we are already very far in the directory of diminishing returns.
>>
>> Better to look at actual code generation results IMO.  Do you have an
>> opinion on the examples I included with the patch?
>
> Well, one of the limitations of this analysis is that it is static - it
> doesn't say what the run-time influence of the simplification is.
>
> In case it is inside a heavily used loop, it might be far more important
> than it at first looks ...

With that argument alone, you can justify a superoptimizer at -O2 :-P

There has to be a trade-off at some point.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 21:46         ` Jeff Law
@ 2010-08-09 14:54           ` Mark Mitchell
  2010-08-09 15:04             ` Bernd Schmidt
  2010-08-09 16:02             ` Chris Lattner
  0 siblings, 2 replies; 129+ messages in thread
From: Mark Mitchell @ 2010-08-09 14:54 UTC (permalink / raw)
  To: Jeff Law; +Cc: Bernd Schmidt, Steven Bosscher, GCC Patches

Jeff Law wrote:

>> Yes.  Such combiner bridges may still be useful to help with five-insn
>> combinations :)

> Yea, though I expect there to be an ever-decreasing payoff for allowing
> larger bundles of instructions to be combined.

We all agree that faster compilers are better, all other things equal.
There was a suggestion for a heuristic that might speed up things
considerably without losing many opportunities, and Bernd thought that
was a good idea.  Perhaps that will get back most of the time and this
discussion will be semi-moot.

Also, I think that the number of instructions combine should consider
might as well be a --param.  As far as I can tell, it would be
reasonably easy to modify the pass to pass around an array of
instructions to combine, rather than i0, i1, i2, i3.  Then, distributors
could set whatever default (2, 3, 4, 5, 100, whatever) they think
appropriate for their users.  Bernd, is this something that you could do
readily?

We still have to have the argument about FSF GCC, but it would be
trivial for a distributor (or even an end user building from source) to
adjust the default as they like.  And, furthermore, it would allow
people to make this trade-off at run-time, independent of the default.
I think it would also make the algorithm clearer.

As to the more general question of compile-time vs. optimization, I
think it's clear that to CodeSourcery's customer base the latter matters
more than the former -- but only up to a point.  Few people want 100%
more compilation time in order to get 0.01% more performance.  But, many
people would happily take 10% more compilation time to get 1% more
performance, and almost all would accept 1% more compilation time to get
1% more performance.

Like Bernd, I doubt that compilation time with optimization enabled per
se is the dominant concern for most people.  GCC's own build times are
increasingly dominated by configure scripts, make overhead, and linking.
 Incremental linking would probably do more to improve the typical
programmers compile-edit-debug cycle (which is probably done with -g,
not -O2!) than anything else.  Compile-time improvements at -O2 benefit
distributors or others who are doing builds of massive amounts of
software, but I'm skeptical that GCC is getting slower faster than
hardware is getting cheaper.

So, I think that combining four instructions by default is plausible.
But, defaults are always arguable.  I think that for FSF GCC we
shouldn't spend too much time worrying about it.  The userbase is too
diverse to make it easy for us to please everybody.  Make it easy for
distributors and let them pick.  They have a better chance of getting
things set up correctly for their users, since their users are a
narrower set of people.

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 129+ messages in thread

* The speed of the compiler, was: Re: Combine four insns
  2010-08-09 14:50                 ` Steven Bosscher
@ 2010-08-09 14:58                   ` Toon Moene
  2010-08-09 15:00                     ` Paul Koning
  2010-08-09 15:33                     ` Diego Novillo
  0 siblings, 2 replies; 129+ messages in thread
From: Toon Moene @ 2010-08-09 14:58 UTC (permalink / raw)
  To: Steven Bosscher
  Cc: Bernd Schmidt, Eric Botcazou, Richard Guenther, gcc-patches

Steven Bosscher wrote:

> On Mon, Aug 9, 2010 at 4:34 PM, Toon Moene <toon@moene.org> wrote:

>> In case it is inside a heavily used loop, it might be far more important
>> than it at first looks ...
> 
> With that argument alone, you can justify a superoptimizer at -O2 :-P
> 
> There has to be a trade-off at some point.

OK, lets get serious then on the (improvement of) the speed of compilation.

By far the most examples I have seen over the last decade, speed of 
compilation is a concern when developers are in the compile->fix 
bug->compile cycle.

Most of these "compiles" are -O0 -g compiles, for obvious reasons (why 
spend time on optimization when you don't even know the code is correct ?)

So all of these complaints fall outside of the "this optimization adds 1 
% of compile time when compiling with -O2" arguments.

When compilation time with -On (n > 1) becomes a *real* problem, I'd 
like to hear about it again (and then, Bernds' analysis of where time 
goes in compilation still holds).

-- 
Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html#Fortran

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-09 14:58                   ` The speed of the compiler, was: " Toon Moene
@ 2010-08-09 15:00                     ` Paul Koning
  2010-08-09 15:33                     ` Diego Novillo
  1 sibling, 0 replies; 129+ messages in thread
From: Paul Koning @ 2010-08-09 15:00 UTC (permalink / raw)
  To: Toon Moene; +Cc: gcc-patches

>> ...
> 
> OK, lets get serious then on the (improvement of) the speed of compilation.
> 
> By far the most examples I have seen over the last decade, speed of compilation is a concern when developers are in the compile->fix bug->compile cycle.
> 
> Most of these "compiles" are -O0 -g compiles, for obvious reasons (why spend time on optimization when you don't even know the code is correct ?)

Actually, I tend to use -O1 for debug compiles, for two reasons. (1) a number of useful gcc warnings don't appear unless optimization is used; (2) if I have to step instructions it's a whole lot more efficient.

	paul

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-09 14:54           ` Mark Mitchell
@ 2010-08-09 15:04             ` Bernd Schmidt
  2010-08-09 16:02             ` Chris Lattner
  1 sibling, 0 replies; 129+ messages in thread
From: Bernd Schmidt @ 2010-08-09 15:04 UTC (permalink / raw)
  To: Mark Mitchell; +Cc: Jeff Law, Steven Bosscher, GCC Patches

On 08/09/2010 04:50 PM, Mark Mitchell wrote:

> Also, I think that the number of instructions combine should consider
> might as well be a --param.  As far as I can tell, it would be
> reasonably easy to modify the pass to pass around an array of
> instructions to combine, rather than i0, i1, i2, i3.  Then, distributors
> could set whatever default (2, 3, 4, 5, 100, whatever) they think
> appropriate for their users.  Bernd, is this something that you could do
> readily?

Not readily since it would require rather more surgery to make sure all
the lifetimes, register deaths etc. are tracked.  Also, the way combine
keeps track of LOG_LINKS (i.e. single-basic block only, one user for
each reg only) makes larger windows have dubious value.

What I do want to try is to do something similar with fwprop (i.e. if
substitution fails or is unprofitable, try substituting more uses).  If
that turns out feasible, it should be possible to have it controlled by
a param.

> Like Bernd, I doubt that compilation time with optimization enabled per
> se is the dominant concern for most people.  GCC's own build times are
> increasingly dominated by configure scripts, make overhead, and linking.

I've been noticing fixincludes too.


Bernd

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-09 14:58                   ` The speed of the compiler, was: " Toon Moene
  2010-08-09 15:00                     ` Paul Koning
@ 2010-08-09 15:33                     ` Diego Novillo
  2010-08-09 15:53                       ` Mark Mitchell
                                         ` (5 more replies)
  1 sibling, 6 replies; 129+ messages in thread
From: Diego Novillo @ 2010-08-09 15:33 UTC (permalink / raw)
  To: Toon Moene
  Cc: Steven Bosscher, Bernd Schmidt, Eric Botcazou, Richard Guenther,
	gcc-patches

On 10-08-09 10:54 , Toon Moene wrote:

> Most of these "compiles" are -O0 -g compiles, for obvious reasons (why
> spend time on optimization when you don't even know the code is correct ?)

Internally, we have been working on build time improvements for a few 
months.  We are not looking at just the compiler, but the entire toolchain.

As I've posted in other threads, our main consumer of compilation time 
is the C++ front end.  Hands down.  It consistently takes between 70% 
and 80% of compilation time across the board.  Furthermore, this is 
independent from the optimization level.  The optimizers never take more 
than 10-15% of total compilation time, on average.

We are also looking at incremental linking and addressing performance 
problems in the build system itself.  But for the compiler, our focus is 
just the C++ front end.  We are trying to incorporate more caching into 
the system.  We already have ccache style caches, so we are trying finer 
grained approaches.

We tried using pre-compiled headers earlier this year, but it just 
didn't work.  PCH caches the transitive closure of the translation unit, 
so if just one file changes, you need to re-compile the whole TU again. 
  It was also space prohibitive (caching the different versions required).

Currently, we are prototyping front end changes for caching tokens and 
parsing results.  We are not yet up to the point where we can tell 
whether it will work, though.

The one interesting result we found so far is that in looking at the 
whole toolchain, it may be worth *slowing* down some components to 
speedup the whole process.  In particular, we have found that using -Os 
as our default build setting, actually decreases build time by 15% to 
20%.  The compiler spends a bit more time optimizing (almost 
unnoticeable in general), but the resulting smaller objects and binaries 
more than compensate for it.  It speeds up linking, I/O, transmission 
times, etc.

Additionally, the very worst offender in terms of compile time is -g. 
The size of debugging information is such, that I/O and communication 
times increase significantly.


Diego.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-09 15:33                     ` Diego Novillo
@ 2010-08-09 15:53                       ` Mark Mitchell
  2010-08-09 17:15                       ` Toon Moene
                                         ` (4 subsequent siblings)
  5 siblings, 0 replies; 129+ messages in thread
From: Mark Mitchell @ 2010-08-09 15:53 UTC (permalink / raw)
  To: Diego Novillo
  Cc: Toon Moene, Steven Bosscher, Bernd Schmidt, Eric Botcazou,
	Richard Guenther, gcc-patches

Diego Novillo wrote:

> Additionally, the very worst offender in terms of compile time is -g.
> The size of debugging information is such, that I/O and communication
> times increase significantly.

And, in the case of C++, this is especially true, since you end up with
information about the same classes in multiple places.

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-09 14:54           ` Mark Mitchell
  2010-08-09 15:04             ` Bernd Schmidt
@ 2010-08-09 16:02             ` Chris Lattner
  2010-08-09 16:07               ` Richard Guenther
  2010-08-09 16:19               ` Mark Mitchell
  1 sibling, 2 replies; 129+ messages in thread
From: Chris Lattner @ 2010-08-09 16:02 UTC (permalink / raw)
  To: Mark Mitchell; +Cc: Jeff Law, Bernd Schmidt, Steven Bosscher, GCC Patches


On Aug 9, 2010, at 7:50 AM, Mark Mitchell wrote:

> Also, I think that the number of instructions combine should consider
> might as well be a --param. ...  Then, distributors
> could set whatever default (2, 3, 4, 5, 100, whatever) they think
> appropriate for their users.
> 
> We still have to have the argument about FSF GCC, but it would be
> trivial for a distributor (or even an end user building from source) to
> adjust the default as they like.
...
> But, defaults are always arguable.  I think that for FSF GCC we
> shouldn't spend too much time worrying about it.  The userbase is too
> diverse to make it easy for us to please everybody.  Make it easy for
> distributors and let them pick.  They have a better chance of getting
> things set up correctly for their users, since their users are a
> narrower set of people.

I find this attitude to be really interesting, as the approach was similar in the -fomit-frame-pointer thread.  While it's always easy to add "yet another knob" and push the onus/responsibility back onto distributors, this is problematic in other ways.  In particular, this neutralizes one of the major advantages that GCC has brought to the FOSS world over the last couple decades: it provides a standardized interface to the C compiler which makefiles and programmers know.

I realize that this specific example isn't a big deal, and GCC has more params than normal distributors would actually care about in practice, but the general attitude is interesting.  If nothing else, if the distributor wanted to change something, they could always hack up the source before they ship it.

Is "throw in a param and let the distributors decide" really a great solution to issues like these?

-Chris

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-09 16:02             ` Chris Lattner
@ 2010-08-09 16:07               ` Richard Guenther
  2010-08-09 17:28                 ` Joseph S. Myers
  2010-08-09 16:19               ` Mark Mitchell
  1 sibling, 1 reply; 129+ messages in thread
From: Richard Guenther @ 2010-08-09 16:07 UTC (permalink / raw)
  To: Chris Lattner
  Cc: Mark Mitchell, Jeff Law, Bernd Schmidt, Steven Bosscher, GCC Patches

On Mon, Aug 9, 2010 at 6:00 PM, Chris Lattner <clattner@apple.com> wrote:
>
> On Aug 9, 2010, at 7:50 AM, Mark Mitchell wrote:
>
>> Also, I think that the number of instructions combine should consider
>> might as well be a --param. ...  Then, distributors
>> could set whatever default (2, 3, 4, 5, 100, whatever) they think
>> appropriate for their users.
>>
>> We still have to have the argument about FSF GCC, but it would be
>> trivial for a distributor (or even an end user building from source) to
>> adjust the default as they like.
> ...
>> But, defaults are always arguable.  I think that for FSF GCC we
>> shouldn't spend too much time worrying about it.  The userbase is too
>> diverse to make it easy for us to please everybody.  Make it easy for
>> distributors and let them pick.  They have a better chance of getting
>> things set up correctly for their users, since their users are a
>> narrower set of people.
>
> I find this attitude to be really interesting, as the approach was similar in the -fomit-frame-pointer thread.  While it's always easy to add "yet another knob" and push the onus/responsibility back onto distributors, this is problematic in other ways.  In particular, this neutralizes one of the major advantages that GCC has brought to the FOSS world over the last couple decades: it provides a standardized interface to the C compiler which makefiles and programmers know.
>
> I realize that this specific example isn't a big deal, and GCC has more params than normal distributors would actually care about in practice, but the general attitude is interesting.  If nothing else, if the distributor wanted to change something, they could always hack up the source before they ship it.
>
> Is "throw in a param and let the distributors decide" really a great solution to issues like these?

No, it is not.  It also makes reproducing bugs harder and explodes
the testing matrix.

So I'd like to have _less_ knobs, not more.

Please.

Richard.

> -Chris
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-09 16:02             ` Chris Lattner
  2010-08-09 16:07               ` Richard Guenther
@ 2010-08-09 16:19               ` Mark Mitchell
  2010-08-09 17:02                 ` Chris Lattner
  1 sibling, 1 reply; 129+ messages in thread
From: Mark Mitchell @ 2010-08-09 16:19 UTC (permalink / raw)
  To: Chris Lattner; +Cc: Jeff Law, Bernd Schmidt, Steven Bosscher, GCC Patches

Chris Lattner wrote:

> Is "throw in a param and let the distributors decide" really a great solution to issues like these?

Do you have a better one?

I see no inherent reason to expect that the right answer for small ARM
microcontrollers is the same as for MIPS control-plane processors or x86
server processors.  And, even for a given processor, I'm not sure that
all users want the same thing.  As you say, since it's open-source
software, users/distributors can always modify it anyhow.  A --param
just makes that easier, and exposes the choice to users.

I'm all for setting things up so that they work well by default.  But,
I'm skeptical that there's a "right answer" to this particular question.

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-09 16:19               ` Mark Mitchell
@ 2010-08-09 17:02                 ` Chris Lattner
  2010-08-10  2:50                   ` Mark Mitchell
  0 siblings, 1 reply; 129+ messages in thread
From: Chris Lattner @ 2010-08-09 17:02 UTC (permalink / raw)
  To: Mark Mitchell; +Cc: Jeff Law, Bernd Schmidt, Steven Bosscher, GCC Patches


On Aug 9, 2010, at 9:06 AM, Mark Mitchell wrote:

> Chris Lattner wrote:
> 
>> Is "throw in a param and let the distributors decide" really a great solution to issues like these?
> 
> Do you have a better one?

Yes, pick an answer.  Either one is better than a new param IMO.

> I see no inherent reason to expect that the right answer for small ARM
> microcontrollers is the same as for MIPS control-plane processors or x86
> server processors.

So make the setting be a target property?

> I'm all for setting things up so that they work well by default.  But,
> I'm skeptical that there's a "right answer" to this particular question.

It's even better to just rip out combine of course, but I understand that's not a short-term solution. :)

-Chris

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-09 15:33                     ` Diego Novillo
  2010-08-09 15:53                       ` Mark Mitchell
@ 2010-08-09 17:15                       ` Toon Moene
  2010-08-09 17:19                         ` Diego Novillo
  2010-08-09 17:27                       ` The speed of the compiler, was: Re: Combine four insns Joseph S. Myers
                                         ` (3 subsequent siblings)
  5 siblings, 1 reply; 129+ messages in thread
From: Toon Moene @ 2010-08-09 17:15 UTC (permalink / raw)
  To: Diego Novillo
  Cc: Steven Bosscher, Bernd Schmidt, Eric Botcazou, Richard Guenther,
	gcc-patches

Diego Novillo wrote:

> On 10-08-09 10:54 , Toon Moene wrote:
> 
>> Most of these "compiles" are -O0 -g compiles, for obvious reasons (why
>> spend time on optimization when you don't even know the code is 
>> correct ?)
> 
> Internally, we have been working on build time improvements for a few 
> months.  We are not looking at just the compiler, but the entire toolchain.
> 
> As I've posted in other threads, our main consumer of compilation time 
> is the C++ front end.  Hands down.  It consistently takes between 70% 
> and 80% of compilation time across the board.  Furthermore, this is 
> independent from the optimization level.  The optimizers never take more 
> than 10-15% of total compilation time, on average.

As I pointed out in my 2007 GCC Summit talk, the Fortran Front End 
*already* (i.e., before anyone approved it) performs optimization on the 
Front End internal representation of a Fortran program / subroutine.

Is this also true for C++ ?  In that case it might be useful to curb 
Front End optimizations when -O0 is given ...

Or is there a reason why the C++ Front End has to do so much work (even 
when not bothering with language specific optimizations) ?

-- 
Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html#Fortran

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-09 17:15                       ` Toon Moene
@ 2010-08-09 17:19                         ` Diego Novillo
  2010-08-09 17:29                           ` Toon Moene
  0 siblings, 1 reply; 129+ messages in thread
From: Diego Novillo @ 2010-08-09 17:19 UTC (permalink / raw)
  To: Toon Moene
  Cc: Steven Bosscher, Bernd Schmidt, Eric Botcazou, Richard Guenther,
	gcc-patches

On 10-08-09 13:07 , Toon Moene wrote:

> Is this also true for C++ ? In that case it might be useful to curb
> Front End optimizations when -O0 is given ...

Not really, the amount of optimization is quite minimal to non-existent.

Much of the slowness is due to the inherent nature of C++ parsing. 
There is some performance to be gained by tweaking the various data 
structures and algorithms, but no order-of-magnitude opportunities seem 
to exist.


Diego.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-09 15:33                     ` Diego Novillo
  2010-08-09 15:53                       ` Mark Mitchell
  2010-08-09 17:15                       ` Toon Moene
@ 2010-08-09 17:27                       ` Joseph S. Myers
  2010-08-09 18:23                         ` Diego Novillo
  2010-08-10  6:20                         ` Chiheng Xu
  2010-08-09 17:34                       ` Steven Bosscher
                                         ` (2 subsequent siblings)
  5 siblings, 2 replies; 129+ messages in thread
From: Joseph S. Myers @ 2010-08-09 17:27 UTC (permalink / raw)
  To: Diego Novillo
  Cc: Toon Moene, Steven Bosscher, Bernd Schmidt, Eric Botcazou,
	Richard Guenther, gcc-patches

On Mon, 9 Aug 2010, Diego Novillo wrote:

> Additionally, the very worst offender in terms of compile time is -g. The size
> of debugging information is such, that I/O and communication times increase
> significantly.

If communication between the compiler and assembler is an important part 
of the cost there, it's possible that a binary interface between them as 
suggested by Ian at <http://www.airs.com/blog/archives/268> would help.  
I would imagine it should be possible to get the assembler to accept some 
form of mixed text/binary input so you could just transmit debug info that 
way and transition to a more efficient interface incrementally (assembler 
input goes through a rather complicated sequence of preprocessing / 
processing steps, but cleaning them up to work with such input should be 
possible).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-09 16:07               ` Richard Guenther
@ 2010-08-09 17:28                 ` Joseph S. Myers
  0 siblings, 0 replies; 129+ messages in thread
From: Joseph S. Myers @ 2010-08-09 17:28 UTC (permalink / raw)
  To: Richard Guenther
  Cc: Chris Lattner, Mark Mitchell, Jeff Law, Bernd Schmidt,
	Steven Bosscher, GCC Patches

On Mon, 9 Aug 2010, Richard Guenther wrote:

> > Is "throw in a param and let the distributors decide" really a great 
> > solution to issues like these?
> 
> No, it is not.  It also makes reproducing bugs harder and explodes
> the testing matrix.
> 
> So I'd like to have _less_ knobs, not more.

Knobs for users are at least better than configure-time knobs (and params 
are to a large extent knobs for GCC developers rather than for normal 
users).  For almost any piece of free software - not just GCC, or even 
mainly GCC - it's a pain for those using the software on multiple 
GNU/Linux distributions, or on GNU/Linux and on other systems, when the 
same software gratuitously behaves differently on different systems 
because the respective distributors / builders decided to configure it 
differently.  For the multi-platform user it's better not to have all the 
configure-time knobs, so they only need to set up their personal 
configuration to override cases where their taste differs from one fixed 
default, and not override every case where one distributor has decided to 
change the default.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-09 17:19                         ` Diego Novillo
@ 2010-08-09 17:29                           ` Toon Moene
  2010-08-09 23:24                             ` Chris Lattner
  0 siblings, 1 reply; 129+ messages in thread
From: Toon Moene @ 2010-08-09 17:29 UTC (permalink / raw)
  To: Diego Novillo
  Cc: Steven Bosscher, Bernd Schmidt, Eric Botcazou, Richard Guenther,
	gcc-patches, Chris Lattner

Diego Novillo wrote:

> On 10-08-09 13:07 , Toon Moene wrote:
> 
>> Is this also true for C++ ? In that case it might be useful to curb
>> Front End optimizations when -O0 is given ...
> 
> Not really, the amount of optimization is quite minimal to non-existent.
> 
> Much of the slowness is due to the inherent nature of C++ parsing. There 
> is some performance to be gained by tweaking the various data structures 
> and algorithms, but no order-of-magnitude opportunities seem to exist.

Perhaps Chris can add something to this discussion - after all, LLVM is 
written mostly in C++, no ?

Certainly, that must have provided him (and his team) with boatloads of 
performance data ....

-- 
Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html#Fortran

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-09 15:33                     ` Diego Novillo
                                         ` (2 preceding siblings ...)
  2010-08-09 17:27                       ` The speed of the compiler, was: Re: Combine four insns Joseph S. Myers
@ 2010-08-09 17:34                       ` Steven Bosscher
  2010-08-09 17:36                         ` Diego Novillo
  2010-08-09 18:59                       ` The speed of the compiler Ralf Wildenhues
  2010-08-09 21:12                       ` The speed of the compiler, was: Re: Combine four insns Mike Stump
  5 siblings, 1 reply; 129+ messages in thread
From: Steven Bosscher @ 2010-08-09 17:34 UTC (permalink / raw)
  To: Diego Novillo
  Cc: Toon Moene, Bernd Schmidt, Eric Botcazou, Richard Guenther, gcc-patches

On Mon, Aug 9, 2010 at 5:26 PM, Diego Novillo <dnovillo@google.com> wrote:
> Additionally, the very worst offender in terms of compile time is -g. The
> size of debugging information is such, that I/O and communication times
> increase significantly.

I assume you already made -pipe the default, and verified that the
piping to the assembler works properly?

It'd be interesting to know if / how much this helps...

Ciao!
Steven

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-09 17:34                       ` Steven Bosscher
@ 2010-08-09 17:36                         ` Diego Novillo
  2010-08-09 23:13                           ` Cary Coutant
  0 siblings, 1 reply; 129+ messages in thread
From: Diego Novillo @ 2010-08-09 17:36 UTC (permalink / raw)
  To: Steven Bosscher
  Cc: Toon Moene, Bernd Schmidt, Eric Botcazou, Richard Guenther,
	gcc-patches, Cary Coutant

On 10-08-09 13:29 , Steven Bosscher wrote:
> On Mon, Aug 9, 2010 at 5:26 PM, Diego Novillo<dnovillo@google.com>  wrote:
>> Additionally, the very worst offender in terms of compile time is -g. The
>> size of debugging information is such, that I/O and communication times
>> increase significantly.
>
> I assume you already made -pipe the default, and verified that the
> piping to the assembler works properly?

Yes.  Cary (CC'd) can provide more details, but the core of the issue is 
the massive size of the debug info.  This causes machines to run out of 
memory, increased transmission times, etc.  Builds already occur in 
tmpfs, so I/O is not an issue.  Transmission costs are, however.


Diego.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-09 17:27                       ` The speed of the compiler, was: Re: Combine four insns Joseph S. Myers
@ 2010-08-09 18:23                         ` Diego Novillo
  2010-08-10  6:20                         ` Chiheng Xu
  1 sibling, 0 replies; 129+ messages in thread
From: Diego Novillo @ 2010-08-09 18:23 UTC (permalink / raw)
  To: Joseph S. Myers
  Cc: Toon Moene, Steven Bosscher, Bernd Schmidt, Eric Botcazou,
	Richard Guenther, gcc-patches

On 10-08-09 13:19 , Joseph S. Myers wrote:
> On Mon, 9 Aug 2010, Diego Novillo wrote:
>
>> Additionally, the very worst offender in terms of compile time is -g. The size
>> of debugging information is such, that I/O and communication times increase
>> significantly.
>
> If communication between the compiler and assembler is an important part
> of the cost there, it's possible that a binary interface between them as
> suggested by Ian at<http://www.airs.com/blog/archives/268>  would help.

Well, builds are already done in tmpfs, and we have not found assembly 
times to show up on the radar.  But as other pieces of the toolchain 
speed up, perhaps they will start being noticed.


Diego.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler
  2010-08-09 15:33                     ` Diego Novillo
                                         ` (3 preceding siblings ...)
  2010-08-09 17:34                       ` Steven Bosscher
@ 2010-08-09 18:59                       ` Ralf Wildenhues
  2010-08-09 19:04                         ` Diego Novillo
  2010-08-09 21:12                       ` The speed of the compiler, was: Re: Combine four insns Mike Stump
  5 siblings, 1 reply; 129+ messages in thread
From: Ralf Wildenhues @ 2010-08-09 18:59 UTC (permalink / raw)
  To: Diego Novillo; +Cc: gcc-patches

* Diego Novillo wrote on Mon, Aug 09, 2010 at 05:26:00PM CEST:
> We are also looking at incremental linking and addressing
> performance problems in the build system itself.

Do you have numbers for configure, make, and libtool overhead in GCC?

There is some room for improvement left in using better shell features
in configure before missing parallelism becomes its bottleneck in GCC.
Parallelism is a bit harder.

Cheers,
Ralf

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler
  2010-08-09 18:59                       ` The speed of the compiler Ralf Wildenhues
@ 2010-08-09 19:04                         ` Diego Novillo
  0 siblings, 0 replies; 129+ messages in thread
From: Diego Novillo @ 2010-08-09 19:04 UTC (permalink / raw)
  To: Ralf Wildenhues, gcc-patches

On 10-08-09 14:55 , Ralf Wildenhues wrote:
> * Diego Novillo wrote on Mon, Aug 09, 2010 at 05:26:00PM CEST:
>> We are also looking at incremental linking and addressing
>> performance problems in the build system itself.
>
> Do you have numbers for configure, make, and libtool overhead in GCC?

No, sorry.  All the measurements I have were taken within our build 
environment.  In previous threads, the profiles I've seen posted were 
very different, since most of the code in GCC is C, not C++.


Diego.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-09 15:33                     ` Diego Novillo
                                         ` (4 preceding siblings ...)
  2010-08-09 18:59                       ` The speed of the compiler Ralf Wildenhues
@ 2010-08-09 21:12                       ` Mike Stump
  2010-08-09 23:48                         ` Cary Coutant
  5 siblings, 1 reply; 129+ messages in thread
From: Mike Stump @ 2010-08-09 21:12 UTC (permalink / raw)
  To: Diego Novillo
  Cc: Toon Moene, Steven Bosscher, Bernd Schmidt, Eric Botcazou,
	Richard Guenther, gcc-patches


On Aug 9, 2010, at 8:26 AM, Diego Novillo wrote:
> Additionally, the very worst offender in terms of compile time is -g. The size of debugging information is such, that I/O and communication times increase significantly.

Well, if one uses a technology to engineer out the possibility of creating/moving/copying/assembling that information...  Apple found it beneficial to leave it behind in the .o files and has the debugger go look in the .o files...  more can be done.

For example, the front-end could tap into a live database directly and avoid much of the cost.  Instead of writing out the same information 100 times for 100 translation units, the first one writes, then next just punt to the first.  Of course, you'd have to be willing to sign up for the downside of this sort of scheme.  Another possibility would be to create the data very lazily (little of the debug information ever created is ever used)...

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-09 17:36                         ` Diego Novillo
@ 2010-08-09 23:13                           ` Cary Coutant
  0 siblings, 0 replies; 129+ messages in thread
From: Cary Coutant @ 2010-08-09 23:13 UTC (permalink / raw)
  To: Diego Novillo
  Cc: Steven Bosscher, Toon Moene, Bernd Schmidt, Eric Botcazou,
	Richard Guenther, gcc-patches

>>> Additionally, the very worst offender in terms of compile time is -g. The
>>> size of debugging information is such, that I/O and communication times
>>> increase significantly.
>>
>> I assume you already made -pipe the default, and verified that the
>> piping to the assembler works properly?
>
> Yes.  Cary (CC'd) can provide more details, but the core of the issue is the
> massive size of the debug info.  This causes machines to run out of memory,
> increased transmission times, etc.  Builds already occur in tmpfs, so I/O is
> not an issue.  Transmission costs are, however.

As Diego mentioned in a follow-up, we haven't found the cost of the
assembler intermediate to be a problem (at least not yet). What hurts
is the size of the object files themselves, with all of the duplicate
debug info that hasn't yet been eliminated by the linker (we're using
-gdwarf-4 and its ability to put debug type info into comdat
sections). Adding the --compress-debug-sections option to gas helped
quite a bit there.

I'd rephrase Diego's last two sentences, however, as "I/O is the
issue, but with tmpfs, the I/O happens outside the compiler and
linker."

-cary

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-09 17:29                           ` Toon Moene
@ 2010-08-09 23:24                             ` Chris Lattner
  2010-08-10 13:02                               ` Toon Moene
  2010-08-10 14:58                               ` Andi Kleen
  0 siblings, 2 replies; 129+ messages in thread
From: Chris Lattner @ 2010-08-09 23:24 UTC (permalink / raw)
  To: Toon Moene
  Cc: Diego Novillo, Steven Bosscher, Bernd Schmidt, Eric Botcazou,
	Richard Guenther, gcc-patches


On Aug 9, 2010, at 10:28 AM, Toon Moene wrote:

> Diego Novillo wrote:
> 
>> On 10-08-09 13:07 , Toon Moene wrote:
>>> Is this also true for C++ ? In that case it might be useful to curb
>>> Front End optimizations when -O0 is given ...
>> Not really, the amount of optimization is quite minimal to non-existent.
>> Much of the slowness is due to the inherent nature of C++ parsing. There is some performance to be gained by tweaking the various data structures and algorithms, but no order-of-magnitude opportunities seem to exist.
> 
> Perhaps Chris can add something to this discussion - after all, LLVM is written mostly in C++, no ?
> 
> Certainly, that must have provided him (and his team) with boatloads of performance data ....

I'm not sure what you mean here.  The single biggest win I've got in my personal development was switching from llvm-g++ to clang++.  It is substantially faster, uses much less memory and has better QoI than G++.  I assume that's not the option that you're suggesting though. :-)


If you want to speed up GCC, I don't have any particular special sauce.  I'd recommend the obvious approach:

1. Decide what you sort of builds you care about.  Difference scenarios require completely different approaches to improve:
  a. "Few core" laptops, "many core" workstations, massive distributed builds
  b. -O0 -g or -O3 builds.
  c. Memory constrained or memory unconstrained
  d. C, ObjC, C++, other?

2. Measure and evaluate.  You can see some of the (really old by now) measurements that we did for Clang here:
  http://clang.llvm.org/performance.html

3. Act on something, depending on what you think is important and what deficiency is most impactful to that scenario.  For example, we've found and tackled:
  a. Memory use.  On memory constrained systems (e.g. Apple just started shipping shiny new 24-thread machines that default to 6G of ram), this is the single biggest thing you can do to speed up builds.

  b. Debug info: As others have pointed out, in the ELF world, this is a huge sink for link times.  Apple defined this away long ago by changing the debug info model.

  c. PCH: For Objective-C and C++ apps that were built for it, PCH is an amazing win.  If you care about these use cases, it might be worthwhile to reevaluate GCC's PCH model, it "lacks optimality".

  d. Integrated assembler: For C apps, we've got a 10-20% speedup at -O0 -g.  The rest of GCC is probably not fast enough yet for this to start to matter much. See http://blog.llvm.org/2010/04/intro-to-llvm-mc-project.html

  e. General speedups: Clang's preprocessor is roughly 2x faster than GCC's and the frontend is generally much faster.  For example, it uses hash tables instead of lists where appropriate, so it doesn't get N^2 cases in silly situations as often.  I don't what what else GCC is doing wrong, I haven't looked at its frontends much.

  f. Optimizer/backend problems.  LLVM was designed to not have many of the issues GCC has suffered from, but we've made substantial improvements to the memory use of debug info (within the compiler) for example.   GCC seems to continue suffering from a host of historic issues with the RTL backend, and (my impression is) the GIMPLE optimizers are starting to slow down substantially as new stuff gets bolted in without a larger architectural view of the problem.

I'm also interested in looking at more aggressive techniques going forward, such as using a compiler server approach to get sharing across invocations of the compiler etc.  This would speed up apps using PCH in particular.

At root, if you're interested in speeding up GCC, you need to decide whats important and stop the 1% performance regressions.  It may not be obvious at first, but a continuous series of 1% slow-downs is actually an exponential regression.

-Chris

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-09 21:12                       ` The speed of the compiler, was: Re: Combine four insns Mike Stump
@ 2010-08-09 23:48                         ` Cary Coutant
  0 siblings, 0 replies; 129+ messages in thread
From: Cary Coutant @ 2010-08-09 23:48 UTC (permalink / raw)
  To: Mike Stump
  Cc: Diego Novillo, Toon Moene, Steven Bosscher, Bernd Schmidt,
	Eric Botcazou, Richard Guenther, gcc-patches

> Well, if one uses a technology to engineer out the possibility of
> creating/moving/copying/assembling that information...  Apple found it
> beneficial to leave it behind in the .o files and has the debugger go look in
> the .o files...  more can be done.
>
> For example, the front-end could tap into a live database directly and avoid
> much of the cost.  Instead of writing out the same information 100 times for
> 100 translation units, the first one writes, then next just punt to the first.
>  Of course, you'd have to be willing to sign up for the downside of this sort
> of scheme.  Another possibility would be to create the data very lazily
> (little of the debug information ever created is ever used)...

Leaving the debug info in the original .o files is a possibility, but
it has some serious drawbacks. Keeping the .o files around while
you're debugging doesn't fit every workflow, and if you've got
thousands of .o files, gdb performance is going to suffer.

With the DWARF-4 feature of type information in comdat sections, I'm
hoping to achieve a hybrid solution, where the non-type debug
information (which typically contains lots of relocatable content) is
processed normally and gets copied into the linker output, while the
type information (which typically requires little or no relocation)
can be served by that live database you mentioned. With DWARF-4 today,
for each type that the compiler generates debug info for, it forms a
signature and builds a comdat section with that signature as the key.
With a repository, the compiler would offer that signature to the
repository. Nine times out of ten, the repository would already have
the type info; when it doesn't, the compiler could store the type info
there instead of writing it into the object file. The debugger would
then just have one additional place to look when trying to resolve a
reference to a type signature (DW_FORM_ref_sig8).

-cary

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-09 17:02                 ` Chris Lattner
@ 2010-08-10  2:50                   ` Mark Mitchell
  2010-08-10 15:35                     ` Chris Lattner
  0 siblings, 1 reply; 129+ messages in thread
From: Mark Mitchell @ 2010-08-10  2:50 UTC (permalink / raw)
  To: Chris Lattner; +Cc: Jeff Law, Bernd Schmidt, Steven Bosscher, GCC Patches

Chris Lattner wrote:

>>> Is "throw in a param and let the distributors decide" really a great solution to issues like these?
>> Do you have a better one?
> 
> Yes, pick an answer.  Either one is better than a new param IMO.

All I can say is that I disagree.

Even if we didn't want to expose this to users, it would be better if
the code were structured that way.  There's no good justification for
hard-coding constants into the compiler, and that's essentially what
we've done for combine.  We've just done it by coding the algorithm that
way instead of by having "3" or "4" somewhere in the code.

And, the general rule in GCC is that magic constants should be exposed
as --params.  If we want to change that rule, of course, we can do so,
but it's certainly been useful to people.

I don't recommend messing about with --params for most people.  In fact,
CodeSourcery has a knowledgebase entry recommending that people stick
with -Os, -O1, -O2, or -O3!  But, that doesn't mean they aren't useful.

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-09 12:39               ` Steven Bosscher
  2010-08-09 13:48                 ` Bernd Schmidt
@ 2010-08-10  2:51                 ` Laurynas Biveinis
  1 sibling, 0 replies; 129+ messages in thread
From: Laurynas Biveinis @ 2010-08-10  2:51 UTC (permalink / raw)
  To: Steven Bosscher
  Cc: Bernd Schmidt, Eric Botcazou, Richard Guenther, gcc-patches

2010/8/9 Steven Bosscher <stevenb.gcc@gmail.com>:
> This is interesting. Years ago (7 years?) Zack Weinberg suggested that
> GCC should move RTL back onto obstacks. The overhead should be
> relatively small compared to keeping entire functions-as-trees in
> memory. This is even more true today, now we keep entire translation
> units (and more) in memory as GIMPLE (with SSA). Memory spent on RTL
> is marginal compared to that.
>
> I passes this idea to Laurynas last year ago
> (http://gcc.gnu.org/ml/gcc/2009-08/msg00386.html). I don't know if he
> played with the idea or not.

IMHO it's a very good idea and now it is on the top of my GCC TODO
list. Don't know when I'll get around doing this (busy in the final
months of the grad school), or if somebody won't beat me to it.
Anyway, I also made a note of this patch :)

-- 
Laurynas

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-09 17:27                       ` The speed of the compiler, was: Re: Combine four insns Joseph S. Myers
  2010-08-09 18:23                         ` Diego Novillo
@ 2010-08-10  6:20                         ` Chiheng Xu
  2010-08-10  7:22                           ` Chiheng Xu
  1 sibling, 1 reply; 129+ messages in thread
From: Chiheng Xu @ 2010-08-10  6:20 UTC (permalink / raw)
  To: Joseph S. Myers
  Cc: Diego Novillo, Toon Moene, Steven Bosscher, Bernd Schmidt,
	Eric Botcazou, Richard Guenther, gcc-patches

On Tue, Aug 10, 2010 at 1:19 AM, Joseph S. Myers
<joseph@codesourcery.com> wrote:
> On Mon, 9 Aug 2010, Diego Novillo wrote:
>
>> Additionally, the very worst offender in terms of compile time is -g. The size
>> of debugging information is such, that I/O and communication times increase
>> significantly.
>
> If communication between the compiler and assembler is an important part
> of the cost there, it's possible that a binary interface between them as
> suggested by Ian at <http://www.airs.com/blog/archives/268> would help.
> I would imagine it should be possible to get the assembler to accept some
> form of mixed text/binary input so you could just transmit debug info that
> way and transition to a more efficient interface incrementally (assembler
> input goes through a rather complicated sequence of preprocessing /
> processing steps, but cleaning them up to work with such input should be
> possible).
>
Ian Lance Taylor wrote:
"What does make sense is using a structured data format, rather than
text, to communicate between the compiler and the assembler. In gcc’s
terms, the compiler should generate insn patterns with associated
operands. The assembler should piece those together into its own
internal data structures. In fact, of course, ideally gcc and the
assembler would use the same internal data structure. It would be
interesting to design such a structure so that it could be transmitted
in a file, or over a pipe, or in shared memory."


Using shared memory is by far the most efficient way to transmit large
amount data between processes.  It's I/O and communication cost is
roughly zero if you have enough physical memory.

Using temp file or pipe is less efficient. This is because syscalls
like read() and write() have large amount of user space to kernel
space and kernel space to user space data coping. Although using pipe
have less memory consumption,  but because size of buffer ring of pipe
in Linux kernel is 4k bytes(not ?), so using pipe to transmit large
data will consume large amount of CPU time to schedule processes,
beside of cost of read() and write().


Using a structured data format, rather than text, to communicate
between the compiler and the assembler, may require big change to
current compiler architecture, rendering the compiler even harder to
maintain.  For a given machine instruction, it's gcc representation is
simple RTL, but it's assembly representation is complex and varying.
And, after all, you must provide text "dump" of the assembly for
debugging purpose.  Separation of compiler and assembler conforms
modern software engineer practice.

-- 
Chiheng Xu
Wuhan,China

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-10  6:20                         ` Chiheng Xu
@ 2010-08-10  7:22                           ` Chiheng Xu
  0 siblings, 0 replies; 129+ messages in thread
From: Chiheng Xu @ 2010-08-10  7:22 UTC (permalink / raw)
  To: Joseph S. Myers
  Cc: Diego Novillo, Toon Moene, Steven Bosscher, Bernd Schmidt,
	Eric Botcazou, Richard Guenther, gcc-patches

On Tue, Aug 10, 2010 at 1:57 PM, Chiheng Xu <chiheng.xu@gmail.com> wrote:
>
> Using shared memory is by far the most efficient way to transmit large
> amount data between processes.  It's I/O and communication cost is
> roughly zero if you have enough physical memory.

I mean, using memory-maped temp file to transmit data. This is
essentially the same as shared memory, except that it does not consume
swap space.



-- 
Chiheng Xu
Wuhan,China

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-09 23:24                             ` Chris Lattner
@ 2010-08-10 13:02                               ` Toon Moene
  2010-08-10 15:36                                 ` Chris Lattner
  2010-08-10 14:58                               ` Andi Kleen
  1 sibling, 1 reply; 129+ messages in thread
From: Toon Moene @ 2010-08-10 13:02 UTC (permalink / raw)
  To: Chris Lattner
  Cc: Diego Novillo, Steven Bosscher, Bernd Schmidt, Eric Botcazou,
	Richard Guenther, gcc-patches

Chris Lattner wrote:

> On Aug 9, 2010, at 10:28 AM, Toon Moene wrote:
> 
>> Diego Novillo wrote:
>>
>>> On 10-08-09 13:07 , Toon Moene wrote:
>>>> Is this also true for C++ ? In that case it might be useful to curb
>>>> Front End optimizations when -O0 is given ...
>>> Not really, the amount of optimization is quite minimal to non-existent.
>>> Much of the slowness is due to the inherent nature of C++ parsing. There is some performance to be gained by tweaking the various data structures and algorithms, but no order-of-magnitude opportunities seem to exist.
>> Perhaps Chris can add something to this discussion - after all, LLVM is written mostly in C++, no ?
>>
>> Certainly, that must have provided him (and his team) with boatloads of performance data ....
> 
> I'm not sure what you mean here.  The single biggest win I've got in my personal development was
> switching from llvm-g++ to clang++.  It is substantially faster, uses much less memory and
> has better QoI than G++.  I assume that's not the option that you're suggesting though. :-)

Well, I just hoped for a list of things where clang++ was faster than 
llvm-g++ and why, but the issues you addressed are probably just as well ...

Thanks,

[ It would probably also help if we started to build GCC with C++ by
   default, although I imagine that the code isn't C++-like enough
   to guide us through all the issues ]

-- 
Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html#Fortran

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 20:37     ` Richard Guenther
  2010-08-06 21:53       ` Jeff Law
  2010-08-06 22:41       ` Bernd Schmidt
@ 2010-08-10 14:37       ` Bernd Schmidt
  2010-08-10 14:40         ` Richard Guenther
  2010-08-11 12:32         ` Michael Matz
  2 siblings, 2 replies; 129+ messages in thread
From: Bernd Schmidt @ 2010-08-10 14:37 UTC (permalink / raw)
  To: Richard Guenther; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 2213 bytes --]

>> $ grep Trying.four log |wc -l
>> 307743
>> $ grep Trying.three log |wc -l
>> 592776
>> $ grep Trying.two log |wc -l
>> 1643112
>> $ grep Succeeded.two log |wc -l
>> 204808
>> $ grep Succeeded.three.into.two log |wc -l
>> 2976
>> $ grep Succeeded.three.into.one log |wc -l
>> 12473
>> $ grep Succeeded.four.into.two log |wc -l
>> 244
>> $ grep Succeeded.four.into.one log |wc -l
>> 140
> 
> No four into three?  So overall it's one order of magnitude less
> three than two and two order of magnitude less four than three.

I redid the numbers for Thumb-1, over the same set of input files.  I
left out the two-insn combinations because the greps take ages and the
comparison 3 vs 4 should provide enough information.

$ grep Trying.three log |wc -l
842213
$ grep Trying.four log |wc -l
488682
$ grep Succeeded.four.to.two log|wc -l
759
$ grep Succeeded.four.to.one log|wc -l
163
$ grep Succeeded.three.to.two log|wc -l
6230
$ grep Succeeded.three.to.one log|wc -l
3178

So the patch seems somewhat more effective for this target
(unsurprisingly as it has simpler instructions).

With the following heuristic in try_combine:

  if (i0)
    {
      int i;
      int ncst = 0;
      int nshift = 0;
      for (i = 0; i < 3; i++)
	{
	  rtx insn = i == 0 ? i0 : i == 1 ? i1 : i2;
	  rtx set = single_set (insn);

	  if (set && CONSTANT_P (SET_SRC (set)))
	    ncst++;
	  else if (set && (GET_CODE (SET_SRC (set)) == ASHIFT
			   || GET_CODE (SET_SRC (set)) == ASHIFTRT
			   || GET_CODE (SET_SRC (set)) == LSHIFTRT))
	    nshift++;
	}
      if (ncst == 0 && nshift < 2)
	return 0;
    }

$ grep Trying.four log2 |wc -l
187120
$ grep Succeeded.four log2|wc -l
828
$ grep Succeeded.three.to.one log2|wc -l
3161
$ grep Succeeded.three.to.two log2|wc -l
6218

With the heuristic, we still catch the majority of interesting cases on
Thumb-1, with a reduced number of attempts, but we also miss some
optimizations like these:

-       add     r2, r2, #1
        lsl     r2, r2, #5
-       add     r3, r3, r2
-       sub     r3, r3, #32
+       add     r3, r2, r3
====
-       mvn     r3, r3
        lsr     r3, r3, #16
-       mvn     r3, r3
====

In summary, should I check in the patch with the above heuristic?


Bernd

[-- Attachment #2: combine4d.diff --]
[-- Type: text/plain, Size: 43004 bytes --]

Index: combine.c
===================================================================
--- combine.c	(revision 162821)
+++ combine.c	(working copy)
@@ -385,10 +385,10 @@ static void init_reg_last (void);
 static void setup_incoming_promotions (rtx);
 static void set_nonzero_bits_and_sign_copies (rtx, const_rtx, void *);
 static int cant_combine_insn_p (rtx);
-static int can_combine_p (rtx, rtx, rtx, rtx, rtx *, rtx *);
-static int combinable_i3pat (rtx, rtx *, rtx, rtx, int, rtx *);
+static int can_combine_p (rtx, rtx, rtx, rtx, rtx, rtx, rtx *, rtx *);
+static int combinable_i3pat (rtx, rtx *, rtx, rtx, rtx, int, int, rtx *);
 static int contains_muldiv (rtx);
-static rtx try_combine (rtx, rtx, rtx, int *);
+static rtx try_combine (rtx, rtx, rtx, rtx, int *);
 static void undo_all (void);
 static void undo_commit (void);
 static rtx *find_split_point (rtx *, rtx, bool);
@@ -438,7 +438,7 @@ static void reg_dead_at_p_1 (rtx, const_
 static int reg_dead_at_p (rtx, rtx);
 static void move_deaths (rtx, rtx, int, rtx, rtx *);
 static int reg_bitfield_target_p (rtx, rtx);
-static void distribute_notes (rtx, rtx, rtx, rtx, rtx, rtx);
+static void distribute_notes (rtx, rtx, rtx, rtx, rtx, rtx, rtx);
 static void distribute_links (rtx);
 static void mark_used_regs_combine (rtx);
 static void record_promoted_value (rtx, rtx);
@@ -766,7 +766,7 @@ do_SUBST_MODE (rtx *into, enum machine_m
 \f
 /* Subroutine of try_combine.  Determine whether the combine replacement
    patterns NEWPAT, NEWI2PAT and NEWOTHERPAT are cheaper according to
-   insn_rtx_cost that the original instruction sequence I1, I2, I3 and
+   insn_rtx_cost that the original instruction sequence I0, I1, I2, I3 and
    undobuf.other_insn.  Note that I1 and/or NEWI2PAT may be NULL_RTX.
    NEWOTHERPAT and undobuf.other_insn may also both be NULL_RTX.  This
    function returns false, if the costs of all instructions can be
@@ -774,10 +774,10 @@ do_SUBST_MODE (rtx *into, enum machine_m
    sequence.  */
 
 static bool
-combine_validate_cost (rtx i1, rtx i2, rtx i3, rtx newpat, rtx newi2pat,
-		       rtx newotherpat)
+combine_validate_cost (rtx i0, rtx i1, rtx i2, rtx i3, rtx newpat,
+		       rtx newi2pat, rtx newotherpat)
 {
-  int i1_cost, i2_cost, i3_cost;
+  int i0_cost, i1_cost, i2_cost, i3_cost;
   int new_i2_cost, new_i3_cost;
   int old_cost, new_cost;
 
@@ -788,13 +788,23 @@ combine_validate_cost (rtx i1, rtx i2, r
   if (i1)
     {
       i1_cost = INSN_COST (i1);
-      old_cost = (i1_cost > 0 && i2_cost > 0 && i3_cost > 0)
-		 ? i1_cost + i2_cost + i3_cost : 0;
+      if (i0)
+	{
+	  i0_cost = INSN_COST (i0);
+	  old_cost = (i0_cost > 0 && i1_cost > 0 && i2_cost > 0 && i3_cost > 0
+		      ? i0_cost + i1_cost + i2_cost + i3_cost : 0);
+	}
+      else
+	{
+	  old_cost = (i1_cost > 0 && i2_cost > 0 && i3_cost > 0
+		      ? i1_cost + i2_cost + i3_cost : 0);
+	  i0_cost = 0;
+	}
     }
   else
     {
       old_cost = (i2_cost > 0 && i3_cost > 0) ? i2_cost + i3_cost : 0;
-      i1_cost = 0;
+      i1_cost = i0_cost = 0;
     }
 
   /* Calculate the replacement insn_rtx_costs.  */
@@ -833,7 +843,16 @@ combine_validate_cost (rtx i1, rtx i2, r
     {
       if (dump_file)
 	{
-	  if (i1)
+	  if (i0)
+	    {
+	      fprintf (dump_file,
+		       "rejecting combination of insns %d, %d, %d and %d\n",
+		       INSN_UID (i0), INSN_UID (i1), INSN_UID (i2),
+		       INSN_UID (i3));
+	      fprintf (dump_file, "original costs %d + %d + %d + %d = %d\n",
+		       i0_cost, i1_cost, i2_cost, i3_cost, old_cost);
+	    }
+	  else if (i1)
 	    {
 	      fprintf (dump_file,
 		       "rejecting combination of insns %d, %d and %d\n",
@@ -1010,6 +1029,19 @@ clear_log_links (void)
     if (INSN_P (insn))
       free_INSN_LIST_list (&LOG_LINKS (insn));
 }
+
+/* Walk the LOG_LINKS of insn B to see if we find a reference to A.  Return
+   true if we found a LOG_LINK that proves that A feeds B.  */
+
+static bool
+insn_a_feeds_b (rtx a, rtx b)
+{
+  rtx links;
+  for (links = LOG_LINKS (b); links; links = XEXP (links, 1))
+    if (XEXP (links, 0) == a)
+      return true;
+  return false;
+}
 \f
 /* Main entry point for combiner.  F is the first insn of the function.
    NREGS is the first unused pseudo-reg number.
@@ -1150,7 +1182,7 @@ combine_instructions (rtx f, unsigned in
 	      /* Try this insn with each insn it links back to.  */
 
 	      for (links = LOG_LINKS (insn); links; links = XEXP (links, 1))
-		if ((next = try_combine (insn, XEXP (links, 0),
+		if ((next = try_combine (insn, XEXP (links, 0), NULL_RTX,
 					 NULL_RTX, &new_direct_jump_p)) != 0)
 		  goto retry;
 
@@ -1168,8 +1200,8 @@ combine_instructions (rtx f, unsigned in
 		  for (nextlinks = LOG_LINKS (link);
 		       nextlinks;
 		       nextlinks = XEXP (nextlinks, 1))
-		    if ((next = try_combine (insn, link,
-					     XEXP (nextlinks, 0),
+		    if ((next = try_combine (insn, link, XEXP (nextlinks, 0),
+					     NULL_RTX,
 					     &new_direct_jump_p)) != 0)
 		      goto retry;
 		}
@@ -1187,14 +1219,14 @@ combine_instructions (rtx f, unsigned in
 		  && NONJUMP_INSN_P (prev)
 		  && sets_cc0_p (PATTERN (prev)))
 		{
-		  if ((next = try_combine (insn, prev,
-					   NULL_RTX, &new_direct_jump_p)) != 0)
+		  if ((next = try_combine (insn, prev, NULL_RTX, NULL_RTX,
+					   &new_direct_jump_p)) != 0)
 		    goto retry;
 
 		  for (nextlinks = LOG_LINKS (prev); nextlinks;
 		       nextlinks = XEXP (nextlinks, 1))
-		    if ((next = try_combine (insn, prev,
-					     XEXP (nextlinks, 0),
+		    if ((next = try_combine (insn, prev, XEXP (nextlinks, 0),
+					     NULL_RTX,
 					     &new_direct_jump_p)) != 0)
 		      goto retry;
 		}
@@ -1207,14 +1239,14 @@ combine_instructions (rtx f, unsigned in
 		  && GET_CODE (PATTERN (insn)) == SET
 		  && reg_mentioned_p (cc0_rtx, SET_SRC (PATTERN (insn))))
 		{
-		  if ((next = try_combine (insn, prev,
-					   NULL_RTX, &new_direct_jump_p)) != 0)
+		  if ((next = try_combine (insn, prev, NULL_RTX, NULL_RTX,
+					   &new_direct_jump_p)) != 0)
 		    goto retry;
 
 		  for (nextlinks = LOG_LINKS (prev); nextlinks;
 		       nextlinks = XEXP (nextlinks, 1))
-		    if ((next = try_combine (insn, prev,
-					     XEXP (nextlinks, 0),
+		    if ((next = try_combine (insn, prev, XEXP (nextlinks, 0),
+					     NULL_RTX,
 					     &new_direct_jump_p)) != 0)
 		      goto retry;
 		}
@@ -1230,7 +1262,8 @@ combine_instructions (rtx f, unsigned in
 		    && NONJUMP_INSN_P (prev)
 		    && sets_cc0_p (PATTERN (prev))
 		    && (next = try_combine (insn, XEXP (links, 0),
-					    prev, &new_direct_jump_p)) != 0)
+					    prev, NULL_RTX,
+					    &new_direct_jump_p)) != 0)
 		  goto retry;
 #endif
 
@@ -1240,10 +1273,64 @@ combine_instructions (rtx f, unsigned in
 		for (nextlinks = XEXP (links, 1); nextlinks;
 		     nextlinks = XEXP (nextlinks, 1))
 		  if ((next = try_combine (insn, XEXP (links, 0),
-					   XEXP (nextlinks, 0),
+					   XEXP (nextlinks, 0), NULL_RTX,
 					   &new_direct_jump_p)) != 0)
 		    goto retry;
 
+	      /* Try four-instruction combinations.  */
+	      for (links = LOG_LINKS (insn); links; links = XEXP (links, 1))
+		{
+		  rtx next1;
+		  rtx link = XEXP (links, 0);
+
+		  /* If the linked insn has been replaced by a note, then there
+		     is no point in pursuing this chain any further.  */
+		  if (NOTE_P (link))
+		    continue;
+
+		  for (next1 = LOG_LINKS (link); next1; next1 = XEXP (next1, 1))
+		    {
+		      rtx link1 = XEXP (next1, 0);
+		      if (NOTE_P (link1))
+			continue;
+		      /* I0 -> I1 -> I2 -> I3.  */
+		      for (nextlinks = LOG_LINKS (link1); nextlinks;
+			   nextlinks = XEXP (nextlinks, 1))
+			if ((next = try_combine (insn, link, link1,
+						 XEXP (nextlinks, 0),
+						 &new_direct_jump_p)) != 0)
+			  goto retry;
+		      /* I0, I1 -> I2, I2 -> I3.  */
+		      for (nextlinks = XEXP (next1, 1); nextlinks;
+			   nextlinks = XEXP (nextlinks, 1))
+			if ((next = try_combine (insn, link, link1,
+						 XEXP (nextlinks, 0),
+						 &new_direct_jump_p)) != 0)
+			  goto retry;
+		    }
+
+		  for (next1 = XEXP (links, 1); next1; next1 = XEXP (next1, 1))
+		    {
+		      rtx link1 = XEXP (next1, 0);
+		      if (NOTE_P (link1))
+			continue;
+		      /* I0 -> I2; I1, I2 -> I3.  */
+		      for (nextlinks = LOG_LINKS (link); nextlinks;
+			   nextlinks = XEXP (nextlinks, 1))
+			if ((next = try_combine (insn, link, link1,
+						 XEXP (nextlinks, 0),
+						 &new_direct_jump_p)) != 0)
+			  goto retry;
+		      /* I0 -> I1; I1, I2 -> I3.  */
+		      for (nextlinks = LOG_LINKS (link1); nextlinks;
+			   nextlinks = XEXP (nextlinks, 1))
+			if ((next = try_combine (insn, link, link1,
+						 XEXP (nextlinks, 0),
+						 &new_direct_jump_p)) != 0)
+			  goto retry;
+		    }
+		}
+
 	      /* Try this insn with each REG_EQUAL note it links back to.  */
 	      for (links = LOG_LINKS (insn); links; links = XEXP (links, 1))
 		{
@@ -1267,7 +1354,7 @@ combine_instructions (rtx f, unsigned in
 		      i2mod = temp;
 		      i2mod_old_rhs = copy_rtx (orig);
 		      i2mod_new_rhs = copy_rtx (note);
-		      next = try_combine (insn, i2mod, NULL_RTX,
+		      next = try_combine (insn, i2mod, NULL_RTX, NULL_RTX,
 					  &new_direct_jump_p);
 		      i2mod = NULL_RTX;
 		      if (next)
@@ -1529,9 +1616,10 @@ set_nonzero_bits_and_sign_copies (rtx x,
     }
 }
 \f
-/* See if INSN can be combined into I3.  PRED and SUCC are optionally
-   insns that were previously combined into I3 or that will be combined
-   into the merger of INSN and I3.
+/* See if INSN can be combined into I3.  PRED, PRED2, SUCC and SUCC2 are
+   optionally insns that were previously combined into I3 or that will be
+   combined into the merger of INSN and I3.  The order is PRED, PRED2,
+   INSN, SUCC, SUCC2, I3.
 
    Return 0 if the combination is not allowed for any reason.
 
@@ -1540,7 +1628,8 @@ set_nonzero_bits_and_sign_copies (rtx x,
    will return 1.  */
 
 static int
-can_combine_p (rtx insn, rtx i3, rtx pred ATTRIBUTE_UNUSED, rtx succ,
+can_combine_p (rtx insn, rtx i3, rtx pred ATTRIBUTE_UNUSED,
+	       rtx pred2 ATTRIBUTE_UNUSED, rtx succ, rtx succ2,
 	       rtx *pdest, rtx *psrc)
 {
   int i;
@@ -1550,10 +1639,25 @@ can_combine_p (rtx insn, rtx i3, rtx pre
 #ifdef AUTO_INC_DEC
   rtx link;
 #endif
-  int all_adjacent = (succ ? (next_active_insn (insn) == succ
-			      && next_active_insn (succ) == i3)
-		      : next_active_insn (insn) == i3);
+  bool all_adjacent = true;
 
+  if (succ)
+    {
+      if (succ2)
+	{
+	  if (next_active_insn (succ2) != i3)
+	    all_adjacent = false;
+	  if (next_active_insn (succ) != succ2)
+	    all_adjacent = false;
+	}
+      else if (next_active_insn (succ) != i3)
+	all_adjacent = false;
+      if (next_active_insn (insn) != succ)
+	all_adjacent = false;
+    }
+  else if (next_active_insn (insn) != i3)
+    all_adjacent = false;
+    
   /* Can combine only if previous insn is a SET of a REG, a SUBREG or CC0.
      or a PARALLEL consisting of such a SET and CLOBBERs.
 
@@ -1678,11 +1782,15 @@ can_combine_p (rtx insn, rtx i3, rtx pre
       /* Don't substitute into an incremented register.  */
       || FIND_REG_INC_NOTE (i3, dest)
       || (succ && FIND_REG_INC_NOTE (succ, dest))
+      || (succ2 && FIND_REG_INC_NOTE (succ2, dest))
       /* Don't substitute into a non-local goto, this confuses CFG.  */
       || (JUMP_P (i3) && find_reg_note (i3, REG_NON_LOCAL_GOTO, NULL_RTX))
       /* Make sure that DEST is not used after SUCC but before I3.  */
-      || (succ && ! all_adjacent
-	  && reg_used_between_p (dest, succ, i3))
+      || (!all_adjacent
+	  && ((succ2
+	       && (reg_used_between_p (dest, succ2, i3)
+		   || reg_used_between_p (dest, succ, succ2)))
+	      || (!succ2 && succ && reg_used_between_p (dest, succ, i3))))
       /* Make sure that the value that is to be substituted for the register
 	 does not use any registers whose values alter in between.  However,
 	 If the insns are adjacent, a use can't cross a set even though we
@@ -1765,13 +1873,12 @@ can_combine_p (rtx insn, rtx i3, rtx pre
 
   if (GET_CODE (src) == ASM_OPERANDS || volatile_refs_p (src))
     {
-      /* Make sure succ doesn't contain a volatile reference.  */
+      /* Make sure neither succ nor succ2 contains a volatile reference.  */
+      if (succ2 != 0 && volatile_refs_p (PATTERN (succ2)))
+	return 0;
       if (succ != 0 && volatile_refs_p (PATTERN (succ)))
 	return 0;
-
-      for (p = NEXT_INSN (insn); p != i3; p = NEXT_INSN (p))
-	if (INSN_P (p) && p != succ && volatile_refs_p (PATTERN (p)))
-	  return 0;
+      /* We'll check insns between INSN and I3 below.  */
     }
 
   /* If INSN is an asm, and DEST is a hard register, reject, since it has
@@ -1785,7 +1892,7 @@ can_combine_p (rtx insn, rtx i3, rtx pre
      they might affect machine state.  */
 
   for (p = NEXT_INSN (insn); p != i3; p = NEXT_INSN (p))
-    if (INSN_P (p) && p != succ && volatile_insn_p (PATTERN (p)))
+    if (INSN_P (p) && p != succ && p != succ2 && volatile_insn_p (PATTERN (p)))
       return 0;
 
   /* If INSN contains an autoincrement or autodecrement, make sure that
@@ -1801,8 +1908,12 @@ can_combine_p (rtx insn, rtx i3, rtx pre
 	    || reg_used_between_p (XEXP (link, 0), insn, i3)
 	    || (pred != NULL_RTX
 		&& reg_overlap_mentioned_p (XEXP (link, 0), PATTERN (pred)))
+	    || (pred2 != NULL_RTX
+		&& reg_overlap_mentioned_p (XEXP (link, 0), PATTERN (pred2)))
 	    || (succ != NULL_RTX
 		&& reg_overlap_mentioned_p (XEXP (link, 0), PATTERN (succ)))
+	    || (succ2 != NULL_RTX
+		&& reg_overlap_mentioned_p (XEXP (link, 0), PATTERN (succ2)))
 	    || reg_overlap_mentioned_p (XEXP (link, 0), PATTERN (i3))))
       return 0;
 #endif
@@ -1836,8 +1947,8 @@ can_combine_p (rtx insn, rtx i3, rtx pre
    of a PARALLEL of the pattern.  We validate that it is valid for combining.
 
    One problem is if I3 modifies its output, as opposed to replacing it
-   entirely, we can't allow the output to contain I2DEST or I1DEST as doing
-   so would produce an insn that is not equivalent to the original insns.
+   entirely, we can't allow the output to contain I2DEST, I1DEST or I0DEST as
+   doing so would produce an insn that is not equivalent to the original insns.
 
    Consider:
 
@@ -1858,7 +1969,8 @@ can_combine_p (rtx insn, rtx i3, rtx pre
    must reject the combination.  This case occurs when I2 and I1 both
    feed into I3, rather than when I1 feeds into I2, which feeds into I3.
    If I1_NOT_IN_SRC is nonzero, it means that finding I1 in the source
-   of a SET must prevent combination from occurring.
+   of a SET must prevent combination from occurring.  The same situation
+   can occur for I0, in which case I0_NOT_IN_SRC is set.
 
    Before doing the above check, we first try to expand a field assignment
    into a set of logical operations.
@@ -1870,8 +1982,8 @@ can_combine_p (rtx insn, rtx i3, rtx pre
    Return 1 if the combination is valid, zero otherwise.  */
 
 static int
-combinable_i3pat (rtx i3, rtx *loc, rtx i2dest, rtx i1dest,
-		  int i1_not_in_src, rtx *pi3dest_killed)
+combinable_i3pat (rtx i3, rtx *loc, rtx i2dest, rtx i1dest, rtx i0dest,
+		  int i1_not_in_src, int i0_not_in_src, rtx *pi3dest_killed)
 {
   rtx x = *loc;
 
@@ -1895,9 +2007,11 @@ combinable_i3pat (rtx i3, rtx *loc, rtx 
       if ((inner_dest != dest &&
 	   (!MEM_P (inner_dest)
 	    || rtx_equal_p (i2dest, inner_dest)
-	    || (i1dest && rtx_equal_p (i1dest, inner_dest)))
+	    || (i1dest && rtx_equal_p (i1dest, inner_dest))
+	    || (i0dest && rtx_equal_p (i0dest, inner_dest)))
 	   && (reg_overlap_mentioned_p (i2dest, inner_dest)
-	       || (i1dest && reg_overlap_mentioned_p (i1dest, inner_dest))))
+	       || (i1dest && reg_overlap_mentioned_p (i1dest, inner_dest))
+	       || (i0dest && reg_overlap_mentioned_p (i0dest, inner_dest))))
 
 	  /* This is the same test done in can_combine_p except we can't test
 	     all_adjacent; we don't have to, since this instruction will stay
@@ -1913,7 +2027,8 @@ combinable_i3pat (rtx i3, rtx *loc, rtx 
 	      && REGNO (inner_dest) < FIRST_PSEUDO_REGISTER
 	      && (! HARD_REGNO_MODE_OK (REGNO (inner_dest),
 					GET_MODE (inner_dest))))
-	  || (i1_not_in_src && reg_overlap_mentioned_p (i1dest, src)))
+	  || (i1_not_in_src && reg_overlap_mentioned_p (i1dest, src))
+	  || (i0_not_in_src && reg_overlap_mentioned_p (i0dest, src)))
 	return 0;
 
       /* If DEST is used in I3, it is being killed in this insn, so
@@ -1953,8 +2068,8 @@ combinable_i3pat (rtx i3, rtx *loc, rtx 
       int i;
 
       for (i = 0; i < XVECLEN (x, 0); i++)
-	if (! combinable_i3pat (i3, &XVECEXP (x, 0, i), i2dest, i1dest,
-				i1_not_in_src, pi3dest_killed))
+	if (! combinable_i3pat (i3, &XVECEXP (x, 0, i), i2dest, i1dest, i0dest,
+				i1_not_in_src, i0_not_in_src, pi3dest_killed))
 	  return 0;
     }
 
@@ -2364,15 +2479,15 @@ update_cfg_for_uncondjump (rtx insn)
     single_succ_edge (bb)->flags |= EDGE_FALLTHRU;
 }
 
+/* Try to combine the insns I0, I1 and I2 into I3.
+   Here I0, I1 and I2 appear earlier than I3.
+   I0 and I1 can be zero; then we combine just I2 into I3, or I1 and I2 into
+   I3.
 
-/* Try to combine the insns I1 and I2 into I3.
-   Here I1 and I2 appear earlier than I3.
-   I1 can be zero; then we combine just I2 into I3.
-
-   If we are combining three insns and the resulting insn is not recognized,
-   try splitting it into two insns.  If that happens, I2 and I3 are retained
-   and I1 is pseudo-deleted by turning it into a NOTE.  Otherwise, I1 and I2
-   are pseudo-deleted.
+   If we are combining more than two insns and the resulting insn is not
+   recognized, try splitting it into two insns.  If that happens, I2 and I3
+   are retained and I1/I0 are pseudo-deleted by turning them into a NOTE.
+   Otherwise, I0, I1 and I2 are pseudo-deleted.
 
    Return 0 if the combination does not work.  Then nothing is changed.
    If we did the combination, return the insn at which combine should
@@ -2382,34 +2497,38 @@ update_cfg_for_uncondjump (rtx insn)
    new direct jump instruction.  */
 
 static rtx
-try_combine (rtx i3, rtx i2, rtx i1, int *new_direct_jump_p)
+try_combine (rtx i3, rtx i2, rtx i1, rtx i0, int *new_direct_jump_p)
 {
   /* New patterns for I3 and I2, respectively.  */
   rtx newpat, newi2pat = 0;
   rtvec newpat_vec_with_clobbers = 0;
-  int substed_i2 = 0, substed_i1 = 0;
-  /* Indicates need to preserve SET in I1 or I2 in I3 if it is not dead.  */
-  int added_sets_1, added_sets_2;
+  int substed_i2 = 0, substed_i1 = 0, substed_i0 = 0;
+  /* Indicates need to preserve SET in I0, I1 or I2 in I3 if it is not
+     dead.  */
+  int added_sets_0, added_sets_1, added_sets_2;
   /* Total number of SETs to put into I3.  */
   int total_sets;
-  /* Nonzero if I2's body now appears in I3.  */
-  int i2_is_used;
+  /* Nonzero if I2's or I1's body now appears in I3.  */
+  int i2_is_used, i1_is_used;
   /* INSN_CODEs for new I3, new I2, and user of condition code.  */
   int insn_code_number, i2_code_number = 0, other_code_number = 0;
   /* Contains I3 if the destination of I3 is used in its source, which means
      that the old life of I3 is being killed.  If that usage is placed into
      I2 and not in I3, a REG_DEAD note must be made.  */
   rtx i3dest_killed = 0;
-  /* SET_DEST and SET_SRC of I2 and I1.  */
-  rtx i2dest = 0, i2src = 0, i1dest = 0, i1src = 0;
+  /* SET_DEST and SET_SRC of I2, I1 and I0.  */
+  rtx i2dest = 0, i2src = 0, i1dest = 0, i1src = 0, i0dest = 0, i0src = 0;
   /* Set if I2DEST was reused as a scratch register.  */
   bool i2scratch = false;
-  /* PATTERN (I1) and PATTERN (I2), or a copy of it in certain cases.  */
-  rtx i1pat = 0, i2pat = 0;
+  /* The PATTERNs of I0, I1, and I2, or a copy of them in certain cases.  */
+  rtx i0pat = 0, i1pat = 0, i2pat = 0;
   /* Indicates if I2DEST or I1DEST is in I2SRC or I1_SRC.  */
   int i2dest_in_i2src = 0, i1dest_in_i1src = 0, i2dest_in_i1src = 0;
-  int i2dest_killed = 0, i1dest_killed = 0;
+  int i0dest_in_i0src = 0, i1dest_in_i0src = 0, i2dest_in_i0src = 0;
+  int i2dest_killed = 0, i1dest_killed = 0, i0dest_killed;
   int i1_feeds_i3 = 0;
+  int i1_feeds_i3_n = 0, i1_feeds_i2_n = 0, i0_feeds_i3_n = 0;
+  int i0_feeds_i2_n = 0, i0_feeds_i1_n = 0;
   /* Notes that must be added to REG_NOTES in I3 and I2.  */
   rtx new_i3_notes, new_i2_notes;
   /* Notes that we substituted I3 into I2 instead of the normal case.  */
@@ -2426,11 +2545,40 @@ try_combine (rtx i3, rtx i2, rtx i1, int
   rtx new_other_notes;
   int i;
 
+  /* Only try four-insn combinations when there's high likelihood of
+     success.  As a heuristic, allow it only if one of the insns loads
+     a constant or if there are two shifts.  */
+  if (i0)
+    {
+      int i;
+      int ncst = 0;
+      int nshift = 0;
+
+      if (!flag_expensive_optimizations)
+	return 0;
+
+      for (i = 0; i < 3; i++)
+	{
+	  rtx insn = i == 0 ? i0 : i == 1 ? i1 : i2;
+	  rtx set = single_set (insn);
+
+	  if (set && CONSTANT_P (SET_SRC (set)))
+	    ncst++;
+	  else if (set && (GET_CODE (SET_SRC (set)) == ASHIFT
+			   || GET_CODE (SET_SRC (set)) == ASHIFTRT
+			   || GET_CODE (SET_SRC (set)) == LSHIFTRT))
+	    nshift++;
+	}
+      if (ncst == 0 && nshift < 2)
+	return 0;
+    }
+
   /* Exit early if one of the insns involved can't be used for
      combinations.  */
   if (cant_combine_insn_p (i3)
       || cant_combine_insn_p (i2)
       || (i1 && cant_combine_insn_p (i1))
+      || (i0 && cant_combine_insn_p (i0))
       || likely_spilled_retval_p (i3))
     return 0;
 
@@ -2442,7 +2590,10 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
-      if (i1)
+      if (i0)
+	fprintf (dump_file, "\nTrying %d, %d, %d -> %d:\n",
+		 INSN_UID (i0), INSN_UID (i1), INSN_UID (i2), INSN_UID (i3));
+      else if (i1)
 	fprintf (dump_file, "\nTrying %d, %d -> %d:\n",
 		 INSN_UID (i1), INSN_UID (i2), INSN_UID (i3));
       else
@@ -2450,8 +2601,12 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 		 INSN_UID (i2), INSN_UID (i3));
     }
 
-  /* If I1 and I2 both feed I3, they can be in any order.  To simplify the
-     code below, set I1 to be the earlier of the two insns.  */
+  /* If multiple insns feed into one of I2 or I3, they can be in any
+     order.  To simplify the code below, reorder them in sequence.  */
+  if (i0 && DF_INSN_LUID (i0) > DF_INSN_LUID (i2))
+    temp = i2, i2 = i0, i0 = temp;
+  if (i0 && DF_INSN_LUID (i0) > DF_INSN_LUID (i1))
+    temp = i1, i1 = i0, i0 = temp;
   if (i1 && DF_INSN_LUID (i1) > DF_INSN_LUID (i2))
     temp = i1, i1 = i2, i2 = temp;
 
@@ -2673,8 +2828,11 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 #endif
 
   /* Verify that I2 and I1 are valid for combining.  */
-  if (! can_combine_p (i2, i3, i1, NULL_RTX, &i2dest, &i2src)
-      || (i1 && ! can_combine_p (i1, i3, NULL_RTX, i2, &i1dest, &i1src)))
+  if (! can_combine_p (i2, i3, i0, i1, NULL_RTX, NULL_RTX, &i2dest, &i2src)
+      || (i1 && ! can_combine_p (i1, i3, i0, NULL_RTX, i2, NULL_RTX,
+				 &i1dest, &i1src))
+      || (i0 && ! can_combine_p (i0, i3, NULL_RTX, NULL_RTX, i1, i2,
+				 &i0dest, &i0src)))
     {
       undo_all ();
       return 0;
@@ -2685,16 +2843,27 @@ try_combine (rtx i3, rtx i2, rtx i1, int
   i2dest_in_i2src = reg_overlap_mentioned_p (i2dest, i2src);
   i1dest_in_i1src = i1 && reg_overlap_mentioned_p (i1dest, i1src);
   i2dest_in_i1src = i1 && reg_overlap_mentioned_p (i2dest, i1src);
+  i0dest_in_i0src = i0 && reg_overlap_mentioned_p (i0dest, i0src);
+  i1dest_in_i0src = i0 && reg_overlap_mentioned_p (i1dest, i0src);
+  i2dest_in_i0src = i0 && reg_overlap_mentioned_p (i2dest, i0src);
   i2dest_killed = dead_or_set_p (i2, i2dest);
   i1dest_killed = i1 && dead_or_set_p (i1, i1dest);
+  i0dest_killed = i0 && dead_or_set_p (i0, i0dest);
 
   /* See if I1 directly feeds into I3.  It does if I1DEST is not used
      in I2SRC.  */
   i1_feeds_i3 = i1 && ! reg_overlap_mentioned_p (i1dest, i2src);
+  i1_feeds_i2_n = i1 && insn_a_feeds_b (i1, i2);
+  i1_feeds_i3_n = i1 && insn_a_feeds_b (i1, i3);
+  i0_feeds_i3_n = i0 && insn_a_feeds_b (i0, i3);
+  i0_feeds_i2_n = i0 && insn_a_feeds_b (i0, i2);
+  i0_feeds_i1_n = i0 && insn_a_feeds_b (i0, i1);
 
   /* Ensure that I3's pattern can be the destination of combines.  */
-  if (! combinable_i3pat (i3, &PATTERN (i3), i2dest, i1dest,
+  if (! combinable_i3pat (i3, &PATTERN (i3), i2dest, i1dest, i0dest,
 			  i1 && i2dest_in_i1src && i1_feeds_i3,
+			  i0 && ((i2dest_in_i0src && i0_feeds_i3_n)
+				 || (i1dest_in_i0src && !i0_feeds_i1_n)),
 			  &i3dest_killed))
     {
       undo_all ();
@@ -2706,6 +2875,7 @@ try_combine (rtx i3, rtx i2, rtx i1, int
      here.  */
   if (GET_CODE (i2src) == MULT
       || (i1 != 0 && GET_CODE (i1src) == MULT)
+      || (i0 != 0 && GET_CODE (i0src) == MULT)
       || (GET_CODE (PATTERN (i3)) == SET
 	  && GET_CODE (SET_SRC (PATTERN (i3))) == MULT))
     have_mult = 1;
@@ -2745,14 +2915,22 @@ try_combine (rtx i3, rtx i2, rtx i1, int
      feed into I3, the set in I1 needs to be kept around if I1DEST dies
      or is set in I3.  Otherwise (if I1 feeds I2 which feeds I3), the set
      in I1 needs to be kept around unless I1DEST dies or is set in either
-     I2 or I3.  We can distinguish these cases by seeing if I2SRC mentions
-     I1DEST.  If so, we know I1 feeds into I2.  */
+     I2 or I3.  The same consideration applies to I0.  */
 
-  added_sets_2 = ! dead_or_set_p (i3, i2dest);
+  added_sets_2 = !dead_or_set_p (i3, i2dest);
 
-  added_sets_1
-    = i1 && ! (i1_feeds_i3 ? dead_or_set_p (i3, i1dest)
-	       : (dead_or_set_p (i3, i1dest) || dead_or_set_p (i2, i1dest)));
+  if (i1)
+    added_sets_1 = !((i1_feeds_i3_n && dead_or_set_p (i3, i1dest))
+		     || (i1_feeds_i2_n && dead_or_set_p (i2, i1dest)));
+  else
+    added_sets_1 = 0;
+
+  if (i0)
+    added_sets_0 =  !((i0_feeds_i3_n && dead_or_set_p (i3, i0dest))
+		      || (i0_feeds_i2_n && dead_or_set_p (i2, i0dest))
+		      || (i0_feeds_i1_n && dead_or_set_p (i1, i0dest)));
+  else
+    added_sets_0 = 0;
 
   /* If the set in I2 needs to be kept around, we must make a copy of
      PATTERN (I2), so that when we substitute I1SRC for I1DEST in
@@ -2777,6 +2955,14 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	i1pat = copy_rtx (PATTERN (i1));
     }
 
+  if (added_sets_0)
+    {
+      if (GET_CODE (PATTERN (i0)) == PARALLEL)
+	i0pat = gen_rtx_SET (VOIDmode, i0dest, copy_rtx (i0src));
+      else
+	i0pat = copy_rtx (PATTERN (i0));
+    }
+
   combine_merges++;
 
   /* Substitute in the latest insn for the regs set by the earlier ones.  */
@@ -2825,8 +3011,8 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 					      i2src, const0_rtx))
 	      != GET_MODE (SET_DEST (newpat))))
 	{
-	  if (can_change_dest_mode(SET_DEST (newpat), added_sets_2,
-				   compare_mode))
+	  if (can_change_dest_mode (SET_DEST (newpat), added_sets_2,
+				    compare_mode))
 	    {
 	      unsigned int regno = REGNO (SET_DEST (newpat));
 	      rtx new_dest;
@@ -2889,13 +3075,14 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 
       n_occurrences = 0;		/* `subst' counts here */
 
-      /* If I1 feeds into I2 (not into I3) and I1DEST is in I1SRC, we
-	 need to make a unique copy of I2SRC each time we substitute it
-	 to avoid self-referential rtl.  */
+      /* If I1 feeds into I2 and I1DEST is in I1SRC, we need to make a
+	 unique copy of I2SRC each time we substitute it to avoid
+	 self-referential rtl.  */
 
       subst_low_luid = DF_INSN_LUID (i2);
       newpat = subst (PATTERN (i3), i2dest, i2src, 0,
-		      ! i1_feeds_i3 && i1dest_in_i1src);
+		      ((i1_feeds_i2_n && i1dest_in_i1src)
+		       || (i0_feeds_i2_n && i0dest_in_i0src)));
       substed_i2 = 1;
 
       /* Record whether i2's body now appears within i3's body.  */
@@ -2911,13 +3098,14 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	 This happens if I1DEST is mentioned in I2 and dies there, and
 	 has disappeared from the new pattern.  */
       if ((FIND_REG_INC_NOTE (i1, NULL_RTX) != 0
-	   && !i1_feeds_i3
+	   && i1_feeds_i2_n
 	   && dead_or_set_p (i2, i1dest)
 	   && !reg_overlap_mentioned_p (i1dest, newpat))
 	  /* Before we can do this substitution, we must redo the test done
 	     above (see detailed comments there) that ensures  that I1DEST
 	     isn't mentioned in any SETs in NEWPAT that are field assignments.  */
-          || !combinable_i3pat (NULL_RTX, &newpat, i1dest, NULL_RTX, 0, 0))
+          || !combinable_i3pat (NULL_RTX, &newpat, i1dest, NULL_RTX, NULL_RTX,
+				0, 0, 0))
 	{
 	  undo_all ();
 	  return 0;
@@ -2925,8 +3113,29 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 
       n_occurrences = 0;
       subst_low_luid = DF_INSN_LUID (i1);
-      newpat = subst (newpat, i1dest, i1src, 0, 0);
+      newpat = subst (newpat, i1dest, i1src, 0,
+		      i0_feeds_i1_n && i0dest_in_i0src);
       substed_i1 = 1;
+      i1_is_used = n_occurrences;
+    }
+  if (i0 && GET_CODE (newpat) != CLOBBER)
+    {
+      if ((FIND_REG_INC_NOTE (i0, NULL_RTX) != 0
+	   && ((i0_feeds_i2_n && dead_or_set_p (i2, i0dest))
+	       || (i0_feeds_i1_n && dead_or_set_p (i1, i0dest)))
+	   && !reg_overlap_mentioned_p (i0dest, newpat))
+          || !combinable_i3pat (NULL_RTX, &newpat, i0dest, NULL_RTX, NULL_RTX,
+				0, 0, 0))
+	{
+	  undo_all ();
+	  return 0;
+	}
+
+      n_occurrences = 0;
+      subst_low_luid = DF_INSN_LUID (i1);
+      newpat = subst (newpat, i0dest, i0src, 0,
+		      i0_feeds_i1_n && i0dest_in_i0src);
+      substed_i0 = 1;
     }
 
   /* Fail if an autoincrement side-effect has been duplicated.  Be careful
@@ -2934,7 +3143,12 @@ try_combine (rtx i3, rtx i2, rtx i1, int
   if ((FIND_REG_INC_NOTE (i2, NULL_RTX) != 0
        && i2_is_used + added_sets_2 > 1)
       || (i1 != 0 && FIND_REG_INC_NOTE (i1, NULL_RTX) != 0
-	  && (n_occurrences + added_sets_1 + (added_sets_2 && ! i1_feeds_i3)
+	  && (i1_is_used + added_sets_1 + (added_sets_2 && i1_feeds_i2_n)
+	      > 1))
+      || (i0 != 0 && FIND_REG_INC_NOTE (i0, NULL_RTX) != 0
+	  && (n_occurrences + added_sets_0
+	      + (added_sets_1 && i0_feeds_i1_n)
+	      + (added_sets_2 && i0_feeds_i2_n)
 	      > 1))
       /* Fail if we tried to make a new register.  */
       || max_reg_num () != maxreg
@@ -2954,14 +3168,15 @@ try_combine (rtx i3, rtx i2, rtx i1, int
      we must make a new PARALLEL for the latest insn
      to hold additional the SETs.  */
 
-  if (added_sets_1 || added_sets_2)
+  if (added_sets_0 || added_sets_1 || added_sets_2)
     {
+      int extra_sets = added_sets_0 + added_sets_1 + added_sets_2;
       combine_extras++;
 
       if (GET_CODE (newpat) == PARALLEL)
 	{
 	  rtvec old = XVEC (newpat, 0);
-	  total_sets = XVECLEN (newpat, 0) + added_sets_1 + added_sets_2;
+	  total_sets = XVECLEN (newpat, 0) + extra_sets;
 	  newpat = gen_rtx_PARALLEL (VOIDmode, rtvec_alloc (total_sets));
 	  memcpy (XVEC (newpat, 0)->elem, &old->elem[0],
 		  sizeof (old->elem[0]) * old->num_elem);
@@ -2969,25 +3184,31 @@ try_combine (rtx i3, rtx i2, rtx i1, int
       else
 	{
 	  rtx old = newpat;
-	  total_sets = 1 + added_sets_1 + added_sets_2;
+	  total_sets = 1 + extra_sets;
 	  newpat = gen_rtx_PARALLEL (VOIDmode, rtvec_alloc (total_sets));
 	  XVECEXP (newpat, 0, 0) = old;
 	}
 
+      if (added_sets_0)
+	XVECEXP (newpat, 0, --total_sets) = i0pat;
+
       if (added_sets_1)
-	XVECEXP (newpat, 0, --total_sets) = i1pat;
+	{
+	  rtx t = i1pat;
+	  if (i0_feeds_i1_n)
+	    t = subst (t, i0dest, i0src, 0, 0);
 
+	  XVECEXP (newpat, 0, --total_sets) = t;
+	}
       if (added_sets_2)
 	{
-	  /* If there is no I1, use I2's body as is.  We used to also not do
-	     the subst call below if I2 was substituted into I3,
-	     but that could lose a simplification.  */
-	  if (i1 == 0)
-	    XVECEXP (newpat, 0, --total_sets) = i2pat;
-	  else
-	    /* See comment where i2pat is assigned.  */
-	    XVECEXP (newpat, 0, --total_sets)
-	      = subst (i2pat, i1dest, i1src, 0, 0);
+	  rtx t = i2pat;
+	  if (i0_feeds_i2_n)
+	    t = subst (t, i0dest, i0src, 0, 0);
+	  if (i1_feeds_i2_n)
+	    t = subst (t, i1dest, i1src, 0, 0);
+
+	  XVECEXP (newpat, 0, --total_sets) = t;
 	}
     }
 
@@ -3543,7 +3764,7 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 
   /* Only allow this combination if insn_rtx_costs reports that the
      replacement instructions are cheaper than the originals.  */
-  if (!combine_validate_cost (i1, i2, i3, newpat, newi2pat, other_pat))
+  if (!combine_validate_cost (i0, i1, i2, i3, newpat, newi2pat, other_pat))
     {
       undo_all ();
       return 0;
@@ -3642,7 +3863,8 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	}
 
       distribute_notes (new_other_notes, undobuf.other_insn,
-			undobuf.other_insn, NULL_RTX, NULL_RTX, NULL_RTX);
+			undobuf.other_insn, NULL_RTX, NULL_RTX, NULL_RTX,
+			NULL_RTX);
     }
 
   if (swap_i2i3)
@@ -3689,21 +3911,26 @@ try_combine (rtx i3, rtx i2, rtx i1, int
     }
 
   {
-    rtx i3notes, i2notes, i1notes = 0;
-    rtx i3links, i2links, i1links = 0;
+    rtx i3notes, i2notes, i1notes = 0, i0notes = 0;
+    rtx i3links, i2links, i1links = 0, i0links = 0;
     rtx midnotes = 0;
+    int from_luid;
     unsigned int regno;
     /* Compute which registers we expect to eliminate.  newi2pat may be setting
        either i3dest or i2dest, so we must check it.  Also, i1dest may be the
        same as i3dest, in which case newi2pat may be setting i1dest.  */
     rtx elim_i2 = ((newi2pat && reg_set_p (i2dest, newi2pat))
-		   || i2dest_in_i2src || i2dest_in_i1src
+		   || i2dest_in_i2src || i2dest_in_i1src || i2dest_in_i0src
 		   || !i2dest_killed
 		   ? 0 : i2dest);
-    rtx elim_i1 = (i1 == 0 || i1dest_in_i1src
+    rtx elim_i1 = (i1 == 0 || i1dest_in_i1src || i1dest_in_i0src
 		   || (newi2pat && reg_set_p (i1dest, newi2pat))
 		   || !i1dest_killed
 		   ? 0 : i1dest);
+    rtx elim_i0 = (i0 == 0 || i0dest_in_i0src
+		   || (newi2pat && reg_set_p (i0dest, newi2pat))
+		   || !i0dest_killed
+		   ? 0 : i0dest);
 
     /* Get the old REG_NOTES and LOG_LINKS from all our insns and
        clear them.  */
@@ -3711,6 +3938,8 @@ try_combine (rtx i3, rtx i2, rtx i1, int
     i2notes = REG_NOTES (i2), i2links = LOG_LINKS (i2);
     if (i1)
       i1notes = REG_NOTES (i1), i1links = LOG_LINKS (i1);
+    if (i0)
+      i0notes = REG_NOTES (i0), i0links = LOG_LINKS (i0);
 
     /* Ensure that we do not have something that should not be shared but
        occurs multiple times in the new insns.  Check this by first
@@ -3719,6 +3948,7 @@ try_combine (rtx i3, rtx i2, rtx i1, int
     reset_used_flags (i3notes);
     reset_used_flags (i2notes);
     reset_used_flags (i1notes);
+    reset_used_flags (i0notes);
     reset_used_flags (newpat);
     reset_used_flags (newi2pat);
     if (undobuf.other_insn)
@@ -3727,6 +3957,7 @@ try_combine (rtx i3, rtx i2, rtx i1, int
     i3notes = copy_rtx_if_shared (i3notes);
     i2notes = copy_rtx_if_shared (i2notes);
     i1notes = copy_rtx_if_shared (i1notes);
+    i0notes = copy_rtx_if_shared (i0notes);
     newpat = copy_rtx_if_shared (newpat);
     newi2pat = copy_rtx_if_shared (newi2pat);
     if (undobuf.other_insn)
@@ -3753,6 +3984,8 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 
 	if (substed_i1)
 	  replace_rtx (call_usage, i1dest, i1src);
+	if (substed_i0)
+	  replace_rtx (call_usage, i0dest, i0src);
 
 	CALL_INSN_FUNCTION_USAGE (i3) = call_usage;
       }
@@ -3827,43 +4060,58 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	SET_INSN_DELETED (i1);
       }
 
+    if (i0)
+      {
+	LOG_LINKS (i0) = 0;
+	REG_NOTES (i0) = 0;
+	if (MAY_HAVE_DEBUG_INSNS)
+	  propagate_for_debug (i0, i3, i0dest, i0src, false);
+	SET_INSN_DELETED (i0);
+      }
+
     /* Get death notes for everything that is now used in either I3 or
        I2 and used to die in a previous insn.  If we built two new
        patterns, move from I1 to I2 then I2 to I3 so that we get the
        proper movement on registers that I2 modifies.  */
 
-    if (newi2pat)
-      {
-	move_deaths (newi2pat, NULL_RTX, DF_INSN_LUID (i1), i2, &midnotes);
-	move_deaths (newpat, newi2pat, DF_INSN_LUID (i1), i3, &midnotes);
-      }
+    if (i0)
+      from_luid = DF_INSN_LUID (i0);
+    else if (i1)
+      from_luid = DF_INSN_LUID (i1);
     else
-      move_deaths (newpat, NULL_RTX, i1 ? DF_INSN_LUID (i1) : DF_INSN_LUID (i2),
-		   i3, &midnotes);
+      from_luid = DF_INSN_LUID (i2);
+    if (newi2pat)
+      move_deaths (newi2pat, NULL_RTX, from_luid, i2, &midnotes);
+    move_deaths (newpat, newi2pat, from_luid, i3, &midnotes);
 
     /* Distribute all the LOG_LINKS and REG_NOTES from I1, I2, and I3.  */
     if (i3notes)
       distribute_notes (i3notes, i3, i3, newi2pat ? i2 : NULL_RTX,
-			elim_i2, elim_i1);
+			elim_i2, elim_i1, elim_i0);
     if (i2notes)
       distribute_notes (i2notes, i2, i3, newi2pat ? i2 : NULL_RTX,
-			elim_i2, elim_i1);
+			elim_i2, elim_i1, elim_i0);
     if (i1notes)
       distribute_notes (i1notes, i1, i3, newi2pat ? i2 : NULL_RTX,
-			elim_i2, elim_i1);
+			elim_i2, elim_i1, elim_i0);
+    if (i0notes)
+      distribute_notes (i0notes, i0, i3, newi2pat ? i2 : NULL_RTX,
+			elim_i2, elim_i1, elim_i0);
     if (midnotes)
       distribute_notes (midnotes, NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
-			elim_i2, elim_i1);
+			elim_i2, elim_i1, elim_i0);
 
     /* Distribute any notes added to I2 or I3 by recog_for_combine.  We
        know these are REG_UNUSED and want them to go to the desired insn,
        so we always pass it as i3.  */
 
     if (newi2pat && new_i2_notes)
-      distribute_notes (new_i2_notes, i2, i2, NULL_RTX, NULL_RTX, NULL_RTX);
+      distribute_notes (new_i2_notes, i2, i2, NULL_RTX, NULL_RTX, NULL_RTX,
+			NULL_RTX);
 
     if (new_i3_notes)
-      distribute_notes (new_i3_notes, i3, i3, NULL_RTX, NULL_RTX, NULL_RTX);
+      distribute_notes (new_i3_notes, i3, i3, NULL_RTX, NULL_RTX, NULL_RTX,
+			NULL_RTX);
 
     /* If I3DEST was used in I3SRC, it really died in I3.  We may need to
        put a REG_DEAD note for it somewhere.  If NEWI2PAT exists and sets
@@ -3877,39 +4125,51 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	if (newi2pat && reg_set_p (i3dest_killed, newi2pat))
 	  distribute_notes (alloc_reg_note (REG_DEAD, i3dest_killed,
 					    NULL_RTX),
-			    NULL_RTX, i2, NULL_RTX, elim_i2, elim_i1);
+			    NULL_RTX, i2, NULL_RTX, elim_i2, elim_i1, elim_i0);
 	else
 	  distribute_notes (alloc_reg_note (REG_DEAD, i3dest_killed,
 					    NULL_RTX),
 			    NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
-			    elim_i2, elim_i1);
+			    elim_i2, elim_i1, elim_i0);
       }
 
     if (i2dest_in_i2src)
       {
+	rtx new_note = alloc_reg_note (REG_DEAD, i2dest, NULL_RTX);
 	if (newi2pat && reg_set_p (i2dest, newi2pat))
-	  distribute_notes (alloc_reg_note (REG_DEAD, i2dest, NULL_RTX),
-			    NULL_RTX, i2, NULL_RTX, NULL_RTX, NULL_RTX);
-	else
-	  distribute_notes (alloc_reg_note (REG_DEAD, i2dest, NULL_RTX),
-			    NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
+	  distribute_notes (new_note,  NULL_RTX, i2, NULL_RTX, NULL_RTX,
 			    NULL_RTX, NULL_RTX);
+	else
+	  distribute_notes (new_note, NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
+			    NULL_RTX, NULL_RTX, NULL_RTX);
       }
 
     if (i1dest_in_i1src)
       {
+	rtx new_note = alloc_reg_note (REG_DEAD, i1dest, NULL_RTX);
 	if (newi2pat && reg_set_p (i1dest, newi2pat))
-	  distribute_notes (alloc_reg_note (REG_DEAD, i1dest, NULL_RTX),
-			    NULL_RTX, i2, NULL_RTX, NULL_RTX, NULL_RTX);
+	  distribute_notes (new_note, NULL_RTX, i2, NULL_RTX, NULL_RTX,
+			    NULL_RTX, NULL_RTX);
 	else
-	  distribute_notes (alloc_reg_note (REG_DEAD, i1dest, NULL_RTX),
-			    NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
+	  distribute_notes (new_note, NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
+			    NULL_RTX, NULL_RTX, NULL_RTX);
+      }
+
+    if (i0dest_in_i0src)
+      {
+	rtx new_note = alloc_reg_note (REG_DEAD, i0dest, NULL_RTX);
+	if (newi2pat && reg_set_p (i0dest, newi2pat))
+	  distribute_notes (new_note, NULL_RTX, i2, NULL_RTX, NULL_RTX,
 			    NULL_RTX, NULL_RTX);
+	else
+	  distribute_notes (new_note, NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
+			    NULL_RTX, NULL_RTX, NULL_RTX);
       }
 
     distribute_links (i3links);
     distribute_links (i2links);
     distribute_links (i1links);
+    distribute_links (i0links);
 
     if (REG_P (i2dest))
       {
@@ -3959,6 +4219,23 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	  INC_REG_N_SETS (regno, -1);
       }
 
+    if (i0 && REG_P (i0dest))
+      {
+	rtx link;
+	rtx i0_insn = 0, i0_val = 0, set;
+
+	for (link = LOG_LINKS (i3); link; link = XEXP (link, 1))
+	  if ((set = single_set (XEXP (link, 0))) != 0
+	      && rtx_equal_p (i0dest, SET_DEST (set)))
+	    i0_insn = XEXP (link, 0), i0_val = SET_SRC (set);
+
+	record_value_for_reg (i0dest, i0_insn, i0_val);
+
+	regno = REGNO (i0dest);
+	if (! added_sets_0 && ! i0dest_in_i0src)
+	  INC_REG_N_SETS (regno, -1);
+      }
+
     /* Update reg_stat[].nonzero_bits et al for any changes that may have
        been made to this insn.  The order of
        set_nonzero_bits_and_sign_copies() is important.  Because newi2pat
@@ -3978,6 +4255,16 @@ try_combine (rtx i3, rtx i2, rtx i1, int
       df_insn_rescan (undobuf.other_insn);
     }
 
+  if (i0 && !(NOTE_P(i0) && (NOTE_KIND (i0) == NOTE_INSN_DELETED)))
+    {
+      if (dump_file)
+	{
+	  fprintf (dump_file, "modifying insn i1 ");
+	  dump_insn_slim (dump_file, i0);
+	}
+      df_insn_rescan (i0);
+    }
+
   if (i1 && !(NOTE_P(i1) && (NOTE_KIND (i1) == NOTE_INSN_DELETED)))
     {
       if (dump_file)
@@ -12668,7 +12955,7 @@ reg_bitfield_target_p (rtx x, rtx body)
 
 static void
 distribute_notes (rtx notes, rtx from_insn, rtx i3, rtx i2, rtx elim_i2,
-		  rtx elim_i1)
+		  rtx elim_i1, rtx elim_i0)
 {
   rtx note, next_note;
   rtx tem;
@@ -12914,7 +13201,8 @@ distribute_notes (rtx notes, rtx from_in
 			&& !(i2mod
 			     && reg_overlap_mentioned_p (XEXP (note, 0),
 							 i2mod_old_rhs)))
-		       || rtx_equal_p (XEXP (note, 0), elim_i1))
+		       || rtx_equal_p (XEXP (note, 0), elim_i1)
+		       || rtx_equal_p (XEXP (note, 0), elim_i0))
 		break;
 	      tem = i3;
 	    }
@@ -12981,7 +13269,7 @@ distribute_notes (rtx notes, rtx from_in
 			  REG_NOTES (tem) = NULL;
 
 			  distribute_notes (old_notes, tem, tem, NULL_RTX,
-					    NULL_RTX, NULL_RTX);
+					    NULL_RTX, NULL_RTX, NULL_RTX);
 			  distribute_links (LOG_LINKS (tem));
 
 			  SET_INSN_DELETED (tem);
@@ -12998,7 +13286,7 @@ distribute_notes (rtx notes, rtx from_in
 
 			      distribute_notes (old_notes, cc0_setter,
 						cc0_setter, NULL_RTX,
-						NULL_RTX, NULL_RTX);
+						NULL_RTX, NULL_RTX, NULL_RTX);
 			      distribute_links (LOG_LINKS (cc0_setter));
 
 			      SET_INSN_DELETED (cc0_setter);
@@ -13118,7 +13406,8 @@ distribute_notes (rtx notes, rtx from_in
 							     NULL_RTX);
 
 			      distribute_notes (new_note, place, place,
-						NULL_RTX, NULL_RTX, NULL_RTX);
+						NULL_RTX, NULL_RTX, NULL_RTX,
+						NULL_RTX);
 			    }
 			  else if (! refers_to_regno_p (i, i + 1,
 							PATTERN (place), 0)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-10 14:37       ` Bernd Schmidt
@ 2010-08-10 14:40         ` Richard Guenther
  2010-08-10 14:49           ` Bernd Schmidt
  2010-08-10 15:06           ` Steven Bosscher
  2010-08-11 12:32         ` Michael Matz
  1 sibling, 2 replies; 129+ messages in thread
From: Richard Guenther @ 2010-08-10 14:40 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches

On Tue, Aug 10, 2010 at 4:31 PM, Bernd Schmidt <bernds@codesourcery.com> wrote:
>>> $ grep Trying.four log |wc -l
>>> 307743
>>> $ grep Trying.three log |wc -l
>>> 592776
>>> $ grep Trying.two log |wc -l
>>> 1643112
>>> $ grep Succeeded.two log |wc -l
>>> 204808
>>> $ grep Succeeded.three.into.two log |wc -l
>>> 2976
>>> $ grep Succeeded.three.into.one log |wc -l
>>> 12473
>>> $ grep Succeeded.four.into.two log |wc -l
>>> 244
>>> $ grep Succeeded.four.into.one log |wc -l
>>> 140
>>
>> No four into three?  So overall it's one order of magnitude less
>> three than two and two order of magnitude less four than three.
>
> I redid the numbers for Thumb-1, over the same set of input files.  I
> left out the two-insn combinations because the greps take ages and the
> comparison 3 vs 4 should provide enough information.
>
> $ grep Trying.three log |wc -l
> 842213
> $ grep Trying.four log |wc -l
> 488682
> $ grep Succeeded.four.to.two log|wc -l
> 759
> $ grep Succeeded.four.to.one log|wc -l
> 163
> $ grep Succeeded.three.to.two log|wc -l
> 6230
> $ grep Succeeded.three.to.one log|wc -l
> 3178
>
> So the patch seems somewhat more effective for this target
> (unsurprisingly as it has simpler instructions).
>
> With the following heuristic in try_combine:
>
>  if (i0)
>    {
>      int i;
>      int ncst = 0;
>      int nshift = 0;
>      for (i = 0; i < 3; i++)
>        {
>          rtx insn = i == 0 ? i0 : i == 1 ? i1 : i2;
>          rtx set = single_set (insn);
>
>          if (set && CONSTANT_P (SET_SRC (set)))
>            ncst++;
>          else if (set && (GET_CODE (SET_SRC (set)) == ASHIFT
>                           || GET_CODE (SET_SRC (set)) == ASHIFTRT
>                           || GET_CODE (SET_SRC (set)) == LSHIFTRT))
>            nshift++;
>        }
>      if (ncst == 0 && nshift < 2)
>        return 0;
>    }
>
> $ grep Trying.four log2 |wc -l
> 187120
> $ grep Succeeded.four log2|wc -l
> 828
> $ grep Succeeded.three.to.one log2|wc -l
> 3161
> $ grep Succeeded.three.to.two log2|wc -l
> 6218
>
> With the heuristic, we still catch the majority of interesting cases on
> Thumb-1, with a reduced number of attempts, but we also miss some
> optimizations like these:
>
> -       add     r2, r2, #1
>        lsl     r2, r2, #5
> -       add     r3, r3, r2
> -       sub     r3, r3, #32
> +       add     r3, r2, r3
> ====
> -       mvn     r3, r3
>        lsr     r3, r3, #16
> -       mvn     r3, r3
> ====
>
> In summary, should I check in the patch with the above heuristic?

You could enable the heuristic with optimize < 3 && !optimize_size
(thus keep combining everything at -O3 and -Os).

Did you try just using the ncst == 0 heuristic?

Richard.

>
> Bernd
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-10 14:40         ` Richard Guenther
@ 2010-08-10 14:49           ` Bernd Schmidt
  2010-08-10 15:06             ` Steven Bosscher
  2010-08-10 15:06           ` Steven Bosscher
  1 sibling, 1 reply; 129+ messages in thread
From: Bernd Schmidt @ 2010-08-10 14:49 UTC (permalink / raw)
  To: Richard Guenther; +Cc: GCC Patches

On 08/10/2010 04:39 PM, Richard Guenther wrote:
> You could enable the heuristic with optimize < 3 && !optimize_size
> (thus keep combining everything at -O3 and -Os).

I could do that.  I just need to know what the consensus opinion is.

> Did you try just using the ncst == 0 heuristic?

Yes.  That left so many instances of opportunities like
-       lsl     r3, r1, #31
-       lsr     r3, r3, #31
-       lsl     r3, r3, #24
-       lsr     r3, r3, #24
+       mov     r3, #1
+       and     r3, r1

that I thought it better to add a special case.


Bernd

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-09 23:24                             ` Chris Lattner
  2010-08-10 13:02                               ` Toon Moene
@ 2010-08-10 14:58                               ` Andi Kleen
  2010-08-10 15:03                                 ` Richard Guenther
                                                   ` (2 more replies)
  1 sibling, 3 replies; 129+ messages in thread
From: Andi Kleen @ 2010-08-10 14:58 UTC (permalink / raw)
  To: Chris Lattner
  Cc: Toon Moene, Diego Novillo, Steven Bosscher, Bernd Schmidt,
	Eric Botcazou, Richard Guenther, gcc-patches

Chris Lattner <clattner@apple.com> writes:
>
>   e. General speedups: Clang's preprocessor is roughly 2x faster than GCC's and the frontend is generally much faster.  For example, it uses hash tables instead of lists where appropriate, so it doesn't get N^2 cases in silly situations as often.  I don't what what else GCC is doing wrong, I haven't looked at its frontends much.

I looked at this a weekend or two ago. The two hot functions in the
preprocessor are cpp_clean_line and the lexer.

At least cpp_clean_line was pretty easy to speed up using SSE 4.2
string instructions and vectorizing it. 

That change made it drop down from top 10 in a unoptimized build to
lower top 40 or so. I suspect with that change the clang advantage
is much less than 2x.

Drawback: the patch broke some of the PCH test cases in the test
suite and I never quite figured out why (that's why I didn't post
the patch)

Other drawback: the optimization only helps on x86 systems
that support SSE 4.2 (but presumably that's a popular build system)

Here's the patch if anyone is interested.

Vectorizing the lexer might be possible too, but it's somewhat
harder.

The other problem I found is that cpplib is not using profile
feedback, that is likely giving some performance away too.

-Andi


diff --git a/libcpp/init.c b/libcpp/init.c
index c5b8c28..769aa50 100644
--- a/libcpp/init.c
+++ b/libcpp/init.c
@@ -137,6 +137,8 @@ init_library (void)
 #ifdef ENABLE_NLS
        (void) bindtextdomain (PACKAGE, LOCALEDIR);
 #endif
+
+       init_vectorized_lexer ();
     }
 }
 
diff --git a/libcpp/internal.h b/libcpp/internal.h
index 9209b55..10ed033 100644
--- a/libcpp/internal.h
+++ b/libcpp/internal.h
@@ -725,6 +725,8 @@ ufputs (const unsigned char *s, FILE *f)
   return fputs ((const char *)s, f);
 }
 
+extern void init_vectorized_lexer (void);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/libcpp/lex.c b/libcpp/lex.c
index f628272..589fa64 100644
--- a/libcpp/lex.c
+++ b/libcpp/lex.c
@@ -96,6 +96,82 @@ add_line_note (cpp_buffer *buffer, const uchar *pos, unsigned int type)
   buffer->notes_used++;
 }
 
+#if (GCC_VERSION >= 4005) && (defined(__i386__) || defined(__x86_64__))
+
+#define HAVE_SSE42 1
+
+#include <stdint.h>
+#include "../gcc/config/i386/cpuid.h"
+
+bool cpu_has_sse42;
+
+/* Check if CPU supports vectorized string instructions. */
+
+void 
+init_vectorized_lexer (void)
+{
+  unsigned dummy, ecx;
+
+  if (__get_cpuid (1, &dummy, &dummy, &ecx, &dummy))
+      cpu_has_sse42 = !!(ecx & (1 << 20));
+}
+
+/* Fast path to find line special characters using SSE 4.2 vectorized string 
+   instructions. Anything complicated falls back to the slow path below. 
+   Since this loop is very hot it's worth doing these kinds of
+   optimizations. Returns true if stopper character found. 
+
+   We should be using the _mm intrinsics, but the xxxintr headers do things
+   not allowed in gcc. So instead use direct builtins. */
+
+static bool __attribute__((__target__("sse4.2")))
+search_line_sse42 (const uchar *s, const uchar *end, const uchar **out)
+{
+  typedef char m128i __attribute__ ((__vector_size__ (16)));
+  int left;
+  int index;
+  static char searchstr[16] __attribute__ ((aligned(16))) = "\n\r?\\";
+  m128i search = *(m128i *)searchstr;
+  m128i data;
+		
+  for (left = end - (uchar *)s; left > 0; left -= 16) 
+    { 
+      if (((uintptr_t)s & 0xfff) > 0xff0) 
+	{
+	  /* Too near page boundary. Use slow path. This could be
+	     avoided if we ensure suitable padding or alignment in
+	     the input buffer. */
+	  *out = s;
+	  return false;
+	}
+
+      /* Use vectorized string comparison, looking for the 4 stoppers. */
+      data = (m128i) __builtin_ia32_loaddqu((const char *)s);
+      index = __builtin_ia32_pcmpestri128 (search, 4, data, left, 0);
+      if (index < 16) 
+	{
+	  *out = s + index;
+	  return true;
+	}
+      s += 16;
+    }
+
+  /* Ran out of buffer. Should not happen? */
+  *out = end;
+  return false;
+}
+
+#else
+
+/* Dummy */
+
+void 
+init_vectorized_lexer (void)
+{
+}
+
+#endif
+
 /* Returns with a logical line that contains no escaped newlines or
    trigraphs.  This is a time-critical inner loop.  */
 void
@@ -109,12 +185,41 @@ _cpp_clean_line (cpp_reader *pfile)
   buffer->cur_note = buffer->notes_used = 0;
   buffer->cur = buffer->line_base = buffer->next_line;
   buffer->need_line = false;
-  s = buffer->next_line - 1;
+  s = buffer->next_line;
 
   if (!buffer->from_stage3)
     {
       const uchar *pbackslash = NULL;
 
+#ifdef HAVE_SSE42
+      if (cpu_has_sse42)
+	{
+	  for (;;) 
+	    {
+	      /* Drop into slow path if ? or nothing is found. */
+	      if (search_line_sse42 (s, buffer->rlimit, &s) == false
+		  || *s == '?')
+		break;
+
+	      c = *s;
+	      
+	      /* Special case for backslash which is reasonably common.
+		 Continue searching using the fast path */
+	      if (c == '\\') 
+		{
+		  pbackslash = s;
+		  s++;
+		  continue;
+		}
+
+	      /* \n or \r here. Process it below. */
+	      goto found;
+	    }
+	}
+#endif
+
+      s--;
+
       /* Short circuit for the common case of an un-escaped line with
 	 no trigraphs.  The primary win here is by not writing any
 	 data back to memory until we have to.  */
@@ -124,6 +229,9 @@ _cpp_clean_line (cpp_reader *pfile)
 	  if (__builtin_expect (c == '\n', false)
 	      || __builtin_expect (c == '\r', false))
 	    {
+#ifdef HAVE_SSE42
+	    found:
+#endif
 	      d = (uchar *) s;
 
 	      if (__builtin_expect (s == buffer->rlimit, false))



-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-10 14:58                               ` Andi Kleen
@ 2010-08-10 15:03                                 ` Richard Guenther
  2010-08-10 15:32                                   ` Andi Kleen
  2010-08-10 20:15                                 ` H.J. Lu
  2010-08-12 21:38                                 ` Vectorized _cpp_clean_line Richard Henderson
  2 siblings, 1 reply; 129+ messages in thread
From: Richard Guenther @ 2010-08-10 15:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Chris Lattner, Toon Moene, Diego Novillo, Steven Bosscher,
	Bernd Schmidt, Eric Botcazou, gcc-patches

On Tue, Aug 10, 2010 at 4:48 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Chris Lattner <clattner@apple.com> writes:
>>
>>   e. General speedups: Clang's preprocessor is roughly 2x faster than GCC's and the frontend is generally much faster.  For example, it uses hash tables instead of lists where appropriate, so it doesn't get N^2 cases in silly situations as often.  I don't what what else GCC is doing wrong, I haven't looked at its frontends much.
>
> I looked at this a weekend or two ago. The two hot functions in the
> preprocessor are cpp_clean_line and the lexer.
>
> At least cpp_clean_line was pretty easy to speed up using SSE 4.2
> string instructions and vectorizing it.
>
> That change made it drop down from top 10 in a unoptimized build to
> lower top 40 or so. I suspect with that change the clang advantage
> is much less than 2x.
>
> Drawback: the patch broke some of the PCH test cases in the test
> suite and I never quite figured out why (that's why I didn't post
> the patch)
>
> Other drawback: the optimization only helps on x86 systems
> that support SSE 4.2 (but presumably that's a popular build system)
>
> Here's the patch if anyone is interested.
>
> Vectorizing the lexer might be possible too, but it's somewhat
> harder.
>
> The other problem I found is that cpplib is not using profile
> feedback, that is likely giving some performance away too.

I'm sure there is a way to open-code this using integer math.
Likely the performance issue is both that we use byte loads
and 4 comparisons per char.  Maybe 4 parallel strchr optimized
searches are comparable fast?

Richard.

> -Andi
>
>
> diff --git a/libcpp/init.c b/libcpp/init.c
> index c5b8c28..769aa50 100644
> --- a/libcpp/init.c
> +++ b/libcpp/init.c
> @@ -137,6 +137,8 @@ init_library (void)
>  #ifdef ENABLE_NLS
>        (void) bindtextdomain (PACKAGE, LOCALEDIR);
>  #endif
> +
> +       init_vectorized_lexer ();
>     }
>  }
>
> diff --git a/libcpp/internal.h b/libcpp/internal.h
> index 9209b55..10ed033 100644
> --- a/libcpp/internal.h
> +++ b/libcpp/internal.h
> @@ -725,6 +725,8 @@ ufputs (const unsigned char *s, FILE *f)
>   return fputs ((const char *)s, f);
>  }
>
> +extern void init_vectorized_lexer (void);
> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/libcpp/lex.c b/libcpp/lex.c
> index f628272..589fa64 100644
> --- a/libcpp/lex.c
> +++ b/libcpp/lex.c
> @@ -96,6 +96,82 @@ add_line_note (cpp_buffer *buffer, const uchar *pos, unsigned int type)
>   buffer->notes_used++;
>  }
>
> +#if (GCC_VERSION >= 4005) && (defined(__i386__) || defined(__x86_64__))
> +
> +#define HAVE_SSE42 1
> +
> +#include <stdint.h>
> +#include "../gcc/config/i386/cpuid.h"
> +
> +bool cpu_has_sse42;
> +
> +/* Check if CPU supports vectorized string instructions. */
> +
> +void
> +init_vectorized_lexer (void)
> +{
> +  unsigned dummy, ecx;
> +
> +  if (__get_cpuid (1, &dummy, &dummy, &ecx, &dummy))
> +      cpu_has_sse42 = !!(ecx & (1 << 20));
> +}
> +
> +/* Fast path to find line special characters using SSE 4.2 vectorized string
> +   instructions. Anything complicated falls back to the slow path below.
> +   Since this loop is very hot it's worth doing these kinds of
> +   optimizations. Returns true if stopper character found.
> +
> +   We should be using the _mm intrinsics, but the xxxintr headers do things
> +   not allowed in gcc. So instead use direct builtins. */
> +
> +static bool __attribute__((__target__("sse4.2")))
> +search_line_sse42 (const uchar *s, const uchar *end, const uchar **out)
> +{
> +  typedef char m128i __attribute__ ((__vector_size__ (16)));
> +  int left;
> +  int index;
> +  static char searchstr[16] __attribute__ ((aligned(16))) = "\n\r?\\";
> +  m128i search = *(m128i *)searchstr;
> +  m128i data;
> +
> +  for (left = end - (uchar *)s; left > 0; left -= 16)
> +    {
> +      if (((uintptr_t)s & 0xfff) > 0xff0)
> +       {
> +         /* Too near page boundary. Use slow path. This could be
> +            avoided if we ensure suitable padding or alignment in
> +            the input buffer. */
> +         *out = s;
> +         return false;
> +       }
> +
> +      /* Use vectorized string comparison, looking for the 4 stoppers. */
> +      data = (m128i) __builtin_ia32_loaddqu((const char *)s);
> +      index = __builtin_ia32_pcmpestri128 (search, 4, data, left, 0);
> +      if (index < 16)
> +       {
> +         *out = s + index;
> +         return true;
> +       }
> +      s += 16;
> +    }
> +
> +  /* Ran out of buffer. Should not happen? */
> +  *out = end;
> +  return false;
> +}
> +
> +#else
> +
> +/* Dummy */
> +
> +void
> +init_vectorized_lexer (void)
> +{
> +}
> +
> +#endif
> +
>  /* Returns with a logical line that contains no escaped newlines or
>    trigraphs.  This is a time-critical inner loop.  */
>  void
> @@ -109,12 +185,41 @@ _cpp_clean_line (cpp_reader *pfile)
>   buffer->cur_note = buffer->notes_used = 0;
>   buffer->cur = buffer->line_base = buffer->next_line;
>   buffer->need_line = false;
> -  s = buffer->next_line - 1;
> +  s = buffer->next_line;
>
>   if (!buffer->from_stage3)
>     {
>       const uchar *pbackslash = NULL;
>
> +#ifdef HAVE_SSE42
> +      if (cpu_has_sse42)
> +       {
> +         for (;;)
> +           {
> +             /* Drop into slow path if ? or nothing is found. */
> +             if (search_line_sse42 (s, buffer->rlimit, &s) == false
> +                 || *s == '?')
> +               break;
> +
> +             c = *s;
> +
> +             /* Special case for backslash which is reasonably common.
> +                Continue searching using the fast path */
> +             if (c == '\\')
> +               {
> +                 pbackslash = s;
> +                 s++;
> +                 continue;
> +               }
> +
> +             /* \n or \r here. Process it below. */
> +             goto found;
> +           }
> +       }
> +#endif
> +
> +      s--;
> +
>       /* Short circuit for the common case of an un-escaped line with
>         no trigraphs.  The primary win here is by not writing any
>         data back to memory until we have to.  */
> @@ -124,6 +229,9 @@ _cpp_clean_line (cpp_reader *pfile)
>          if (__builtin_expect (c == '\n', false)
>              || __builtin_expect (c == '\r', false))
>            {
> +#ifdef HAVE_SSE42
> +           found:
> +#endif
>              d = (uchar *) s;
>
>              if (__builtin_expect (s == buffer->rlimit, false))
>
>
>
> --
> ak@linux.intel.com -- Speaking for myself only.
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-10 14:40         ` Richard Guenther
  2010-08-10 14:49           ` Bernd Schmidt
@ 2010-08-10 15:06           ` Steven Bosscher
  2010-08-10 16:27             ` Andi Kleen
  1 sibling, 1 reply; 129+ messages in thread
From: Steven Bosscher @ 2010-08-10 15:06 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Bernd Schmidt, GCC Patches

On Tue, Aug 10, 2010 at 4:39 PM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> You could enable the heuristic with optimize < 3 && !optimize_size
> (thus keep combining everything at -O3 and -Os).

Better s/optimize_size/optimize_bb_for_size(BLOCK_FOR_INSN (i0))/


Ciao!
Steven

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-10 14:49           ` Bernd Schmidt
@ 2010-08-10 15:06             ` Steven Bosscher
  0 siblings, 0 replies; 129+ messages in thread
From: Steven Bosscher @ 2010-08-10 15:06 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Richard Guenther, GCC Patches

On Tue, Aug 10, 2010 at 4:45 PM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> On 08/10/2010 04:39 PM, Richard Guenther wrote:
>> You could enable the heuristic with optimize < 3 && !optimize_size
>> (thus keep combining everything at -O3 and -Os).
>
> I could do that.  I just need to know what the consensus opinion is.

Seems reasonable to me fwiw.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-10 15:03                                 ` Richard Guenther
@ 2010-08-10 15:32                                   ` Andi Kleen
  2010-08-10 20:09                                     ` Tom Tromey
  0 siblings, 1 reply; 129+ messages in thread
From: Andi Kleen @ 2010-08-10 15:32 UTC (permalink / raw)
  To: Richard Guenther
  Cc: Andi Kleen, Chris Lattner, Toon Moene, Diego Novillo,
	Steven Bosscher, Bernd Schmidt, Eric Botcazou, gcc-patches

> I'm sure there is a way to open-code this using integer math.

I don't think so. Take a look at what PCMPESTRI does.  There's no easy 
replacement, even if you use all the Hacker's Delight tricks 
(it's really a cool instruction, but also very complicated :-)

> Likely the performance issue is both that we use byte loads
> and 4 comparisons per char.  Maybe 4 parallel strchr optimized
> searches are comparable fast?

and various other overhead.

-Andi

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-10  2:50                   ` Mark Mitchell
@ 2010-08-10 15:35                     ` Chris Lattner
  0 siblings, 0 replies; 129+ messages in thread
From: Chris Lattner @ 2010-08-10 15:35 UTC (permalink / raw)
  To: Mark Mitchell; +Cc: Jeff Law, Bernd Schmidt, Steven Bosscher, GCC Patches


On Aug 9, 2010, at 9:27 AM, Mark Mitchell wrote:

> Chris Lattner wrote:
> 
>>>> Is "throw in a param and let the distributors decide" really a great solution to issues like these?
>>> Do you have a better one?
>> 
>> Yes, pick an answer.  Either one is better than a new param IMO.
> 
> All I can say is that I disagree.
> 
> Even if we didn't want to expose this to users, it would be better if
> the code were structured that way.  There's no good justification for
> hard-coding constants into the compiler, and that's essentially what
> we've done for combine.  We've just done it by coding the algorithm that
> way instead of by having "3" or "4" somewhere in the code.

Sure, cleaning up and generalizing the code is different from suggesting to distributors that they make diverging variants of GCC.

> And, the general rule in GCC is that magic constants should be exposed
> as --params.  If we want to change that rule, of course, we can do so,
> but it's certainly been useful to people.
> 
> I don't recommend messing about with --params for most people.  In fact,
> CodeSourcery has a knowledgebase entry recommending that people stick
> with -Os, -O1, -O2, or -O3!  But, that doesn't mean they aren't useful.

I think that suggesting people stick to -OX is a great idea :)

-Chris

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-10 13:02                               ` Toon Moene
@ 2010-08-10 15:36                                 ` Chris Lattner
  0 siblings, 0 replies; 129+ messages in thread
From: Chris Lattner @ 2010-08-10 15:36 UTC (permalink / raw)
  To: Toon Moene
  Cc: Diego Novillo, Steven Bosscher, Bernd Schmidt, Eric Botcazou,
	Richard Guenther, gcc-patches


On Aug 10, 2010, at 5:30 AM, Toon Moene wrote:

> Chris Lattner wrote:
> 
>> On Aug 9, 2010, at 10:28 AM, Toon Moene wrote:
>>> Diego Novillo wrote:
>>> 
>>>> On 10-08-09 13:07 , Toon Moene wrote:
>>>>> Is this also true for C++ ? In that case it might be useful to curb
>>>>> Front End optimizations when -O0 is given ...
>>>> Not really, the amount of optimization is quite minimal to non-existent.
>>>> Much of the slowness is due to the inherent nature of C++ parsing. There is some performance to be gained by tweaking the various data structures and algorithms, but no order-of-magnitude opportunities seem to exist.
>>> Perhaps Chris can add something to this discussion - after all, LLVM is written mostly in C++, no ?
>>> 
>>> Certainly, that must have provided him (and his team) with boatloads of performance data ....
>> I'm not sure what you mean here.  The single biggest win I've got in my personal development was
>> switching from llvm-g++ to clang++.  It is substantially faster, uses much less memory and
>> has better QoI than G++.  I assume that's not the option that you're suggesting though. :-)
> 
> Well, I just hoped for a list of things where clang++ was faster than llvm-g++ and why, but the issues you addressed are probably just as well ...

Ah ok.  We haven't started performance tuning clang++ yet.  Only C/ObjC have seen a focus on compile time so far.

-Chris

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-10 15:06           ` Steven Bosscher
@ 2010-08-10 16:27             ` Andi Kleen
  2010-08-10 16:47               ` Steven Bosscher
  0 siblings, 1 reply; 129+ messages in thread
From: Andi Kleen @ 2010-08-10 16:27 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: Richard Guenther, Bernd Schmidt, GCC Patches

Steven Bosscher <stevenb.gcc@gmail.com> writes:

> On Tue, Aug 10, 2010 at 4:39 PM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> You could enable the heuristic with optimize < 3 && !optimize_size
>> (thus keep combining everything at -O3 and -Os).
>
> Better s/optimize_size/optimize_bb_for_size(BLOCK_FOR_INSN (i0))/

That would mean that a -O2 build where all BBs are cold for some
reason would be 1% slower, right?

To be honest I didn't fully understand why it's ok for -Os 
to be 1% slower. A lot of people use -Os.

It seems more like a -O3 only thing for me.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-10 16:27             ` Andi Kleen
@ 2010-08-10 16:47               ` Steven Bosscher
  2010-08-10 16:55                 ` Andi Kleen
  0 siblings, 1 reply; 129+ messages in thread
From: Steven Bosscher @ 2010-08-10 16:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Richard Guenther, Bernd Schmidt, GCC Patches

On Tue, Aug 10, 2010 at 6:13 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Steven Bosscher <stevenb.gcc@gmail.com> writes:
>
>> On Tue, Aug 10, 2010 at 4:39 PM, Richard Guenther
>> <richard.guenther@gmail.com> wrote:
>>> You could enable the heuristic with optimize < 3 && !optimize_size
>>> (thus keep combining everything at -O3 and -Os).
>>
>> Better s/optimize_size/optimize_bb_for_size(BLOCK_FOR_INSN (i0))/
>
> That would mean that a -O2 build where all BBs are cold for some
> reason would be 1% slower, right?

It's not 1% anymore with the heuristic to only do leaves with 4 insns.
Before that (from Bernd's numbers):

$ grep Trying.four log |wc -l
307743

and after:

$ grep Trying.four log2 |wc -l
187120

So only ~60% of the attempts and probably also less time simplifying.
Slowdown should be less than 0.5% after that.


> To be honest I didn't fully understand why it's ok for -Os
> to be 1% slower. A lot of people use -Os.

Yes, and they expect the smallest code possible. This patch does seem
to help for that.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-10 16:47               ` Steven Bosscher
@ 2010-08-10 16:55                 ` Andi Kleen
  2010-08-10 17:03                   ` David Daney
  0 siblings, 1 reply; 129+ messages in thread
From: Andi Kleen @ 2010-08-10 16:55 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: Andi Kleen, Richard Guenther, Bernd Schmidt, GCC Patches

> > To be honest I didn't fully understand why it's ok for -Os
> > to be 1% slower. A lot of people use -Os.
> 
> Yes, and they expect the smallest code possible. This patch does seem
> to help for that.

They also expect reasonable fast compilation.

-Andi

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-10 16:55                 ` Andi Kleen
@ 2010-08-10 17:03                   ` David Daney
  2010-08-11  8:53                     ` Richard Guenther
  0 siblings, 1 reply; 129+ messages in thread
From: David Daney @ 2010-08-10 17:03 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Steven Bosscher, Richard Guenther, Bernd Schmidt, GCC Patches

On 08/10/2010 09:47 AM, Andi Kleen wrote:
>>> To be honest I didn't fully understand why it's ok for -Os
>>> to be 1% slower. A lot of people use -Os.
>>
>> Yes, and they expect the smallest code possible. This patch does seem
>> to help for that.
>
> They also expect reasonable fast compilation.
>

If they did, their expectation wouldn't have come from having read the 
documentation.  As far as I can see, the *only* thing one should expect 
from -Os is smaller code.

We would all like fast compilation, but if you specify -Os, it seems to 
me that you are expressing a preference for smaller code.

David Daney

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-10 15:32                                   ` Andi Kleen
@ 2010-08-10 20:09                                     ` Tom Tromey
  2010-08-10 20:23                                       ` Andi Kleen
                                                         ` (2 more replies)
  0 siblings, 3 replies; 129+ messages in thread
From: Tom Tromey @ 2010-08-10 20:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Richard Guenther, gcc-patches

>>>>> "Andi" == Andi Kleen <andi@firstfloor.org> writes:

Richard> I'm sure there is a way to open-code this using integer math.

Andi> I don't think so. Take a look at what PCMPESTRI does.  There's no easy 
Andi> replacement, even if you use all the Hacker's Delight tricks 
Andi> (it's really a cool instruction, but also very complicated :-)

This is still pending:

http://gcc.gnu.org/ml/gcc-patches/2010-03/msg00526.html

I think any sort of hackery in this area is fine, if it speeds up the
compiler.  All that is needed is someone with the time and motivation to
do the testing.

Tom

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-10 14:58                               ` Andi Kleen
  2010-08-10 15:03                                 ` Richard Guenther
@ 2010-08-10 20:15                                 ` H.J. Lu
  2010-08-12 21:38                                 ` Vectorized _cpp_clean_line Richard Henderson
  2 siblings, 0 replies; 129+ messages in thread
From: H.J. Lu @ 2010-08-10 20:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Chris Lattner, Toon Moene, Diego Novillo, Steven Bosscher,
	Bernd Schmidt, Eric Botcazou, Richard Guenther, gcc-patches

On Tue, Aug 10, 2010 at 7:48 AM, Andi Kleen <andi@firstfloor.org> wrote:
> Chris Lattner <clattner@apple.com> writes:
>>
>>   e. General speedups: Clang's preprocessor is roughly 2x faster than GCC's and the frontend is generally much faster.  For example, it uses hash tables instead of lists where appropriate, so it doesn't get N^2 cases in silly situations as often.  I don't what what else GCC is doing wrong, I haven't looked at its frontends much.
>
> I looked at this a weekend or two ago. The two hot functions in the
> preprocessor are cpp_clean_line and the lexer.
>
> At least cpp_clean_line was pretty easy to speed up using SSE 4.2
> string instructions and vectorizing it.
>
> That change made it drop down from top 10 in a unoptimized build to
> lower top 40 or so. I suspect with that change the clang advantage
> is much less than 2x.
>
> Drawback: the patch broke some of the PCH test cases in the test
> suite and I never quite figured out why (that's why I didn't post
> the patch)
>
> Other drawback: the optimization only helps on x86 systems
> that support SSE 4.2 (but presumably that's a popular build system)
>
> Here's the patch if anyone is interested.
>
> Vectorizing the lexer might be possible too, but it's somewhat
> harder.
>
> The other problem I found is that cpplib is not using profile
> feedback, that is likely giving some performance away too.
>
> -Andi
>
>

You can use IFUNC to automatically enable it on Linux.


-- 
H.J.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-10 20:09                                     ` Tom Tromey
@ 2010-08-10 20:23                                       ` Andi Kleen
  2010-08-10 22:40                                         ` Mike Stump
  2010-08-10 23:16                                         ` Tom Tromey
  2010-08-12 21:09                                       ` Nathan Froyd
  2010-08-17 15:14                                       ` Mark Mitchell
  2 siblings, 2 replies; 129+ messages in thread
From: Andi Kleen @ 2010-08-10 20:23 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Andi Kleen, Richard Guenther, gcc-patches

On Tue, Aug 10, 2010 at 01:51:46PM -0600, Tom Tromey wrote:
> >>>>> "Andi" == Andi Kleen <andi@firstfloor.org> writes:
> 
> Richard> I'm sure there is a way to open-code this using integer math.
> 
> Andi> I don't think so. Take a look at what PCMPESTRI does.  There's no easy 
> Andi> replacement, even if you use all the Hacker's Delight tricks 
> Andi> (it's really a cool instruction, but also very complicated :-)
> 
> This is still pending:
> 
> http://gcc.gnu.org/ml/gcc-patches/2010-03/msg00526.html

I see. I suspect my version is faster though :-) (if you have a suitable
CPU)

Also I think David got the "buffer near end of page before hole" case wrong
(although I'm not fully sure that really happens in gcc) 

> 
> I think any sort of hackery in this area is fine, if it speeds up the
> compiler.  All that is needed is someone with the time and motivation to
> do the testing.

My version passes testing, except for a few PCH test cases. I actually
ran it in parallel with the original cleanline and it gave the same
results. Any ideas why it could break PCH?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-10 20:23                                       ` Andi Kleen
@ 2010-08-10 22:40                                         ` Mike Stump
  2010-08-10 23:16                                         ` Tom Tromey
  1 sibling, 0 replies; 129+ messages in thread
From: Mike Stump @ 2010-08-10 22:40 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Tom Tromey, Richard Guenther, gcc-patches

On Aug 10, 2010, at 1:15 PM, Andi Kleen wrote:
> My version passes testing, except for a few PCH test cases. I actually
> ran it in parallel with the original cleanline and it gave the same
> results. Any ideas why it could break PCH?

:-)  Yes.  I'd first run it again and see if it always fails.  Sometimes for me the pch testcases shimmer.  After that, you have to just look at the output.  The output can't depend upon things like rtx addresses.  If you have multiple hosts, try the testcase (no pch) on different hosts and ensure you get the same output (via cross compilation).  Any differences tend to be the sorts of things that pch isn't going to like.  sorts on hashes that involve the addresses are typical.  Instead, one should use UID or other data to sort on.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-10 20:23                                       ` Andi Kleen
  2010-08-10 22:40                                         ` Mike Stump
@ 2010-08-10 23:16                                         ` Tom Tromey
  1 sibling, 0 replies; 129+ messages in thread
From: Tom Tromey @ 2010-08-10 23:16 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Richard Guenther, gcc-patches

>>>>> "Andi" == Andi Kleen <andi@firstfloor.org> writes:

Andi> My version passes testing, except for a few PCH test cases. I actually
Andi> ran it in parallel with the original cleanline and it gave the same
Andi> results. Any ideas why it could break PCH?

I don't have any theory.  It doesn't seem like it could affect PCH.

Tom

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-10 17:03                   ` David Daney
@ 2010-08-11  8:53                     ` Richard Guenther
  2010-08-16 20:42                       ` Mark Mitchell
  0 siblings, 1 reply; 129+ messages in thread
From: Richard Guenther @ 2010-08-11  8:53 UTC (permalink / raw)
  To: David Daney; +Cc: Andi Kleen, Steven Bosscher, Bernd Schmidt, GCC Patches

On Tue, Aug 10, 2010 at 7:03 PM, David Daney <ddaney@caviumnetworks.com> wrote:
> On 08/10/2010 09:47 AM, Andi Kleen wrote:
>>>>
>>>> To be honest I didn't fully understand why it's ok for -Os
>>>> to be 1% slower. A lot of people use -Os.
>>>
>>> Yes, and they expect the smallest code possible. This patch does seem
>>> to help for that.
>>
>> They also expect reasonable fast compilation.
>>
>
> If they did, their expectation wouldn't have come from having read the
> documentation.  As far as I can see, the *only* thing one should expect from
> -Os is smaller code.
>
> We would all like fast compilation, but if you specify -Os, it seems to me
> that you are expressing a preference for smaller code.

Yes indeed.

Richard.

> David Daney
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-10 14:37       ` Bernd Schmidt
  2010-08-10 14:40         ` Richard Guenther
@ 2010-08-11 12:32         ` Michael Matz
  1 sibling, 0 replies; 129+ messages in thread
From: Michael Matz @ 2010-08-11 12:32 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Richard Guenther, GCC Patches

Hi,

On Tue, 10 Aug 2010, Bernd Schmidt wrote:

>   if (i0)
>     {
>       int i;
>       int ncst = 0;
>       int nshift = 0;
>       for (i = 0; i < 3; i++)
> 	{
> 	  rtx insn = i == 0 ? i0 : i == 1 ? i1 : i2;
> 	  rtx set = single_set (insn);
> 
> 	  if (set && CONSTANT_P (SET_SRC (set)))
> 	    ncst++;
> 	  else if (set && (GET_CODE (SET_SRC (set)) == ASHIFT
> 			   || GET_CODE (SET_SRC (set)) == ASHIFTRT
> 			   || GET_CODE (SET_SRC (set)) == LSHIFTRT))
> 	    nshift++;
> 	}
>       if (ncst == 0 && nshift < 2)
> 	return 0;
>     }

What follows will degenerate into a 'try this, try that, try something 
else' discussion, but ... well :)  What I had in mind was rather to test 
if at least two of the insns have a constant in its leafs.  Ala:

src = SET_SRC (set);
if (BINARY_P (src)
    && (CONSTANT_P (XEXP (src, 0)) || CONSTANT_P (XEXP (src, 1))))
  ncstleaf++;

And then only do something if ncstleaf >= 2, or possibly in combination 
with the above heuristics.  The idea being that if those constant leafs 
are operands of arithmetic chains then they most probably can be combined 
somehow.

> With the heuristic, we still catch the majority of interesting cases on
> Thumb-1, with a reduced number of attempts, but we also miss some
> optimizations like these:
> 
> -       add     r2, r2, #1
>         lsl     r2, r2, #5
> -       add     r3, r3, r2
> -       sub     r3, r3, #32
> +       add     r3, r2, r3

I believe this case should then be included, it actually has three 
constant leafs in four instructions.

> ====
> -       mvn     r3, r3
>         lsr     r3, r3, #16
> -       mvn     r3, r3
> ====

Hmm, aren't these originally three RTL insns?  Why does the four-combine 
heuristic affect this sequence?


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-06 18:56       ` Vladimir N. Makarov
                           ` (2 preceding siblings ...)
  2010-08-08 11:40         ` Paolo Bonzini
@ 2010-08-12  5:53         ` Ian Lance Taylor
  3 siblings, 0 replies; 129+ messages in thread
From: Ian Lance Taylor @ 2010-08-12  5:53 UTC (permalink / raw)
  To: Vladimir N. Makarov
  Cc: Steven Bosscher, Paolo Bonzini, Bernd Schmidt, GCC Patches

On Fri, Aug 6, 2010 at 11:56 AM, Vladimir N. Makarov
<vmakarov@redhat.com> wrote:
> On 08/06/2010 01:22 PM, Steven Bosscher wrote:

>> Agreed. There is the work from Preston Briggs, but that appears to
>> have gone no-where, unfortunately.
>>
> IMHO the code should have become public if we want to see a progress on the
> problem solution.  But it is a Google property and they probably decided not
> to give it gcc community.  Although I heard that Ken Zadeck has access to
> it.  We will see what the final result will be.
>
> I only can guess from speculations on info I have that Preston's code is
> probably not so valuable for GCC because as I understand the backend was
> only specialized for x86/x86_64.

The work by Preston and others was incomplete, was x86 specific, did
not work with existing MD files, and the tests that worked showed that
it didn't do any better than IRA.  So the project was cancelled.

There is nothing secret about the code, but since it essentially means
replacing the RTL backend I'm not sure it's useful in the context of
gcc.  It could only ever have been used if it were significantly
better than what we have today.

Ian

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-10 20:09                                     ` Tom Tromey
  2010-08-10 20:23                                       ` Andi Kleen
@ 2010-08-12 21:09                                       ` Nathan Froyd
  2010-08-17 15:14                                       ` Mark Mitchell
  2 siblings, 0 replies; 129+ messages in thread
From: Nathan Froyd @ 2010-08-12 21:09 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Andi Kleen, Richard Guenther, gcc-patches

On Tue, Aug 10, 2010 at 01:51:46PM -0600, Tom Tromey wrote:
> >>>>> "Andi" == Andi Kleen <andi@firstfloor.org> writes:
> 
> Richard> I'm sure there is a way to open-code this using integer math.
> 
> Andi> I don't think so. Take a look at what PCMPESTRI does.  There's no easy 
> Andi> replacement, even if you use all the Hacker's Delight tricks 
> Andi> (it's really a cool instruction, but also very complicated :-)
> 
> This is still pending:
> 
> http://gcc.gnu.org/ml/gcc-patches/2010-03/msg00526.html
> 
> I think any sort of hackery in this area is fine, if it speeds up the
> compiler.  All that is needed is someone with the time and motivation to
> do the testing.

FWIW, testing it on x86-64 on two source files lying about (admittedly a
rather small sample size) reduced the # of instructions executed cc1plus
-E by ~6% in one case and 0% (yes, exactly 0%) in the other.  Perhaps
there's a bug lying about (the patch wouldn't compile on a little-endian
machine, for one...).

I like Andi's PCMPESTRI idea, but that does nothing for the many people
without Intel's latest generation of chips (of which there are many),
let alone people on other platforms.

-Nathan

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Vectorized _cpp_clean_line
  2010-08-10 14:58                               ` Andi Kleen
  2010-08-10 15:03                                 ` Richard Guenther
  2010-08-10 20:15                                 ` H.J. Lu
@ 2010-08-12 21:38                                 ` Richard Henderson
  2010-08-12 22:18                                   ` Andi Kleen
  2 siblings, 1 reply; 129+ messages in thread
From: Richard Henderson @ 2010-08-12 21:38 UTC (permalink / raw)
  To: Andi Kleen; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 785 bytes --]

On 08/10/2010 07:48 AM, Andi Kleen wrote:
> @@ -109,12 +185,41 @@ _cpp_clean_line (cpp_reader *pfile)
>    buffer->cur_note = buffer->notes_used = 0;
>    buffer->cur = buffer->line_base = buffer->next_line;
>    buffer->need_line = false;
> -  s = buffer->next_line - 1;
> +  s = buffer->next_line;

I believe I've found the extra failures.  You adjusted this path:
>  
>    if (!buffer->from_stage3)
>      {
...
> +
> +      s--;

but not the from_stage3 path, like so:

   else
     {
-      do
+      while (*s != '\n' && *s != '\r')
        s++;
-      while (*s != '\n' && *s != '\r');
       d = (uchar *) s;


I'm currently testing this patch that merges your patch with
David Miller's bit-twiddling version, plus extra stuff for
Alpha, IA-64, and SSE2 (without SSE4.2).


r~

[-- Attachment #2: searchline-1 --]
[-- Type: text/plain, Size: 16359 bytes --]

diff --git a/libcpp/init.c b/libcpp/init.c
index c5b8c28..769aa50 100644
--- a/libcpp/init.c
+++ b/libcpp/init.c
@@ -137,6 +137,8 @@ init_library (void)
 #ifdef ENABLE_NLS
        (void) bindtextdomain (PACKAGE, LOCALEDIR);
 #endif
+
+       init_vectorized_lexer ();
     }
 }
 
diff --git a/libcpp/internal.h b/libcpp/internal.h
index 9209b55..10ed033 100644
--- a/libcpp/internal.h
+++ b/libcpp/internal.h
@@ -725,6 +725,8 @@ ufputs (const unsigned char *s, FILE *f)
   return fputs ((const char *)s, f);
 }
 
+extern void init_vectorized_lexer (void);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/libcpp/lex.c b/libcpp/lex.c
index f628272..6ea4667 100644
--- a/libcpp/lex.c
+++ b/libcpp/lex.c
@@ -96,6 +96,408 @@ add_line_note (cpp_buffer *buffer, const uchar *pos, unsigned int type)
   buffer->notes_used++;
 }
 
+\f
+/* Fast path to find line special characters using optimized character
+   scanning algorithms.  Anything complicated falls back to the slow
+   path below.  Since this loop is very hot it's worth doing these kinds
+   of optimizations.
+
+   One of the paths through the ifdefs should provide 
+
+     bool search_line_fast (const uchar *s, const uchar *end,
+                            const uchar **out)
+
+   Between S and END, search for \n, \r, \\, ?.  Return true if found.
+   Always update *OUT to the last character scanned, even if not found.  */
+
+/* ??? Should be in configure.ac.  */
+#define WORDS_BIGENDIAN 0
+
+/* We'd like the largest integer that fits into a register.  There's nothing
+   in <stdint.h> that gives us that.  For most hosts this is unsigned long,
+   but MS decided on an LLP64 model.  Thankfully when building with GCC we
+   can get the "real" word size.  */
+#ifdef __GNUC__
+typedef unsigned int word_type __attribute__((__mode__(__word__)));
+#else
+typedef unsigned long word_type;
+#endif
+
+/* Return X with the first N bytes forced to values that won't match one
+   of the interesting characters.  Note that NUL is not interesting.  */
+
+static inline word_type
+acc_char_mask_misalign (word_type val, unsigned int n)
+{
+  word_type mask = -1;
+  if (WORDS_BIGENDIAN)
+    mask >>= n * 8;
+  else
+    mask <<= n * 8;
+  return val & mask;
+}
+
+/* Return X replicated to all byte positions within WORD_TYPE.  */
+
+static inline word_type
+acc_char_replicate (uchar x)
+{
+  word_type ret;
+
+  ret = (x << 24) | (x << 16) | (x << 8) | x;
+  switch (sizeof (ret))
+    {
+    case 8:
+      ret = (ret << 16 << 16) | ret;
+      break;
+    case 4:
+      break;
+    default:
+      abort ();
+    }
+  return ret;
+}
+
+/* Return non-zero if some byte of VAL is (probably) C.  */
+
+static inline word_type
+acc_char_cmp (word_type val, word_type c)
+{
+#if defined(__GNUC__) && defined(__alpha__)
+  /* We can get exact results using a compare-bytes instruction.  Since
+     comparison is always >=, we could either do ((v >= c) & (c >= v))
+     for 3 operations or force matching bytes to zero and compare vs
+     zero with (0 >= (val ^ c)) for 2 operations.  */
+  return __builtin_alpha_cmpbge (0, val ^ c);
+#elif defined(__GNUC__) && defined(__ia64__)
+  /* ??? Ideally we'd have some sort of builtin for this, so that we
+     can pack the 4 instances into fewer bundles.  */
+  word_type ret;
+  __asm__("pcmp1.eq %0 = %1, %2" : "=r"(ret) : "r"(val), "r"(c));
+  return ret;
+#else
+  word_type magic = 0x7efefefeU;
+  switch (sizeof(word_type))
+    {
+    case 8:
+      magic = (magic << 16 << 16) | 0xfefefefeU;
+    case 4:
+      break;
+    default:
+      abort ();
+    }
+  magic |= 1;
+
+  val ^= c;
+  return ((val + magic) ^ ~val) & ~magic;
+#endif
+}
+
+/* Given the result of acc_char_cmp is non-zero, return the index of
+   the found character.  If this was a false positive, return -1.  */
+
+static inline int
+acc_char_index (word_type cmp ATTRIBUTE_UNUSED,
+		word_type val ATTRIBUTE_UNUSED)
+{
+#if defined(__GNUC__) && defined(__alpha__)
+  /* The cmpbge instruction sets *bits* of the result corresponding to
+     matches in the bytes with no false positives.  This means that clz/ctx
+     produces a correct unscaled result.  */
+  return (WORDS_BIGENDIAN ? __builtin_clzl (cmp) : __builtin_ctzl (cmp));
+#elif defined(__GNUC__) && defined(__ia64__)
+  /* The pcmp1 instruction sets matching bytes to 0xff with no false
+     positives.  This means that clz/ctz produces a correct scaled result.  */
+  return (WORDS_BIGENDIAN ? __builtin_clzl (cmp) : __builtin_ctzl (cmp)) >> 3;
+#else
+  unsigned int i;
+
+  /* ??? It would be nice to force unrolling here,
+     and have all of these constants folded.  */
+  for (i = 0; i < sizeof(word_type); ++i)
+    {
+      uchar c;
+      if (WORDS_BIGENDIAN)
+	c = (val >> (sizeof(word_type) - i - 1) * 8) & 0xff;
+      else
+	c = (val >> i * 8) & 0xff;
+
+      if (c == '\n' || c == '\r' || c == '\\' || c == '?')
+	return i;
+    }
+
+  return -1;
+#endif
+}
+
+/* A version of the fast scanner using bit fiddling techniques.
+ 
+   For 32-bit words, one would normally perform 16 comparisons and
+   16 branches.  With this algorithm one performs 24 arithmetic
+   operations and one branch.  Whether this is faster with a 32-bit
+   word side is going to be somewhat system dependent.
+
+   For 64-bit words, we eliminate twice the number of comparisons
+   and branches without increasing the number of arithmetic operations.
+   It's almost certainly going to be a win with 64-bit word size.  */
+
+static bool
+search_line_acc_char (const uchar *s, const uchar *end, const uchar **out)
+{
+  const word_type repl_nl = acc_char_replicate ('\n');
+  const word_type repl_cr = acc_char_replicate ('\r');
+  const word_type repl_bs = acc_char_replicate ('\\');
+  const word_type repl_qm = acc_char_replicate ('?');
+
+  unsigned int misalign;
+  ptrdiff_t left;
+  const word_type *p;
+  word_type val;
+  
+  /* Don't bother optimizing very short lines; too much masking to do.  */
+  left = end - s;
+  if (left < (ptrdiff_t) sizeof(word_type))
+    {
+      *out = s;
+      return false;
+    }
+
+  /* Align the buffer.  Mask out any bytes from before the beginning.  */
+  p = (word_type *)((uintptr_t)s & -sizeof(word_type));
+  val = *p;
+  misalign = (uintptr_t)s & (sizeof(word_type) - 1);
+  if (misalign)
+    {
+      val = acc_char_mask_misalign (val, misalign);
+      left += misalign;
+    }
+
+  /* Main loop.  */
+  while (1)
+    {
+      word_type m_nl, m_cr, m_bs, m_qm, t;
+
+      m_nl = acc_char_cmp (val, repl_nl);
+      m_cr = acc_char_cmp (val, repl_cr);
+      m_bs = acc_char_cmp (val, repl_bs);
+      m_qm = acc_char_cmp (val, repl_qm);
+      t = (m_nl | m_cr) | (m_bs | m_qm);
+
+      if (__builtin_expect (t != 0, 0))
+	{
+	  int i = acc_char_index (t, val);
+	  if (i >= 0)
+	    {
+	      *out = (const uchar *)p + i;
+	      return true;
+	    }
+	}
+
+      left -= sizeof(word_type);
+      ++p;
+      if (left < (ptrdiff_t) sizeof(word_type))
+	{
+	  /* Ran out of complete words.  */
+	  *out = (const uchar *)p;
+	  return false;
+	}
+      val = *p;
+    }
+}
+
+#if (GCC_VERSION >= 4005) && (defined(__i386__) || defined(__x86_64__))
+/* We'd like to share this table between the sse2 and sse4.2 functions
+   below, but we can't define an SSE vector type outside of a context
+   of a function that forces SSE enabled.  So we must define this table
+   in terms of characters.  */
+static const uchar sse_mask_align[16][16] __attribute__((aligned(16))) =
+{
+  {  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+  { -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+  { -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+  { -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+  { -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+  { -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+  { -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+  { -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+  { -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0 },
+  { -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0 },
+  { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0 },
+  { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0 },
+  { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0 },
+  { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0 },
+  { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0 },
+  { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0 }
+};
+
+/* Fast path to find line special characters using SSE 4.2 vectorized string 
+   instructions. Anything complicated falls back to the slow path below. 
+   Since this loop is very hot it's worth doing these kinds of
+   optimizations. Returns true if stopper character found. 
+
+   We should be using the _mm intrinsics, but the xxxintr headers do things
+   not allowed in gcc.  So instead use direct builtins.  */
+
+static bool __attribute__((__target__("sse4.2")))
+search_line_sse42 (const uchar *s, const uchar *end, const uchar **out)
+{
+  typedef char m128i __attribute__ ((__vector_size__ (16)));
+  static const m128i search
+    = { '\n', '\r', '?', '\\', 0,0,0,0,0,0,0,0,0,0,0,0 };
+
+  ptrdiff_t left;
+  unsigned int misalign;
+  const m128i *p;
+  m128i data;
+
+  left = end - s;
+  if (left <= 0)
+    goto no_match;
+
+  /* Align the pointer and mask out the bytes before the start.  */
+  misalign = (uintptr_t)s & 15;
+  p = (m128i *)((uintptr_t)s & -16);
+  data = *p;
+  if (misalign)
+    {
+      data |= ((m128i *)sse_mask_align)[misalign];
+      left += misalign;
+    }
+
+  while (1)
+    {
+      int index = __builtin_ia32_pcmpestri128 (search, 4, data, left, 0);
+      if (__builtin_expect (index < 16, 0)) 
+	{
+	  *out = (const uchar *)p + index;
+	  return true;
+	}
+
+      left -= 16;
+      if (left <= 0)
+	goto no_match;
+
+      data = *++p;
+    }
+
+ no_match:
+  /* No match within buffer.  */
+  *out = end;
+  return false;
+}
+
+static bool __attribute__((__target__("sse2")))
+search_line_sse2 (const uchar *s, const uchar *end, const uchar **out)
+{
+  typedef char m128i __attribute__ ((__vector_size__ (16)));
+
+  static const m128i repl_nl = {
+    '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', 
+    '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n'
+  };
+  static const m128i repl_cr = {
+    '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r', 
+    '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r'
+  };
+  static const m128i repl_bs = {
+    '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\', 
+    '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\'
+  };
+  static const m128i repl_qm = {
+    '?', '?', '?', '?', '?', '?', '?', '?', 
+    '?', '?', '?', '?', '?', '?', '?', '?', 
+  };
+
+  ptrdiff_t left;
+  unsigned int misalign, mask;
+  const m128i *p;
+  m128i data, t;
+
+  left = end - s;
+  if (left <= 0)
+    goto no_match;
+
+  /* Align the pointer and mask out the bytes before the start.  */
+  misalign = (uintptr_t)s & 15;
+  p = (m128i *)((uintptr_t)s & -16);
+  data = *p;
+  if (misalign)
+    {
+      data |= ((const m128i *)sse_mask_align)[misalign];
+      left += misalign;
+    }
+
+  while (1)
+    {
+      t  = __builtin_ia32_pcmpeqb128(data, repl_nl);
+      t |= __builtin_ia32_pcmpeqb128(data, repl_cr);
+      t |= __builtin_ia32_pcmpeqb128(data, repl_bs);
+      t |= __builtin_ia32_pcmpeqb128(data, repl_qm);
+      mask = __builtin_ia32_pmovmskb128 (t);
+
+      if (__builtin_expect (mask, 0))
+	break;
+
+      left -= 16;
+      if (left <= 0)
+	goto no_match;
+      data = *++p;
+    }
+
+  /* If there were less than 16 bytes left in the buffer, then there will
+     be comparisons beyond the end of buffer.  Mask them out.  */
+  if (left < 16)
+    {
+      mask &= (2u << left) - 1;
+      if (mask == 0)
+	goto no_match;
+    }
+
+  s = (const uchar *)p + __builtin_ctz (mask);
+  return true;
+
+ no_match:
+  /* No match within buffer.  */
+  *out = end;
+  return false;
+}
+
+/* Check if CPU supports vectorized string instructions. */
+
+#include "../gcc/config/i386/cpuid.h"
+
+static bool (*search_line_fast)(const uchar *, const uchar *, const uchar **)
+  = search_line_acc_char;
+
+void 
+init_vectorized_lexer (void)
+{
+  unsigned dummy, ecx, edx;
+
+  if (__get_cpuid (1, &dummy, &dummy, &ecx, &edx))
+    {
+      if (ecx & bit_SSE4_2)
+	search_line_fast = search_line_sse42;
+      else if (edx & bit_SSE2)
+	search_line_fast = search_line_sse2;
+    }
+}
+
+#else
+
+/* We only have one accellerated alternative.  Use a direct call so that
+   we encourage inlining.  We must still provide the init_vectorized_lexer
+   entry point, even though it does nothing.  */
+
+#define search_line_fast  search_line_acc_char
+
+void 
+init_vectorized_lexer (void)
+{
+}
+
+#endif
+
 /* Returns with a logical line that contains no escaped newlines or
    trigraphs.  This is a time-critical inner loop.  */
 void
@@ -109,12 +511,34 @@ _cpp_clean_line (cpp_reader *pfile)
   buffer->cur_note = buffer->notes_used = 0;
   buffer->cur = buffer->line_base = buffer->next_line;
   buffer->need_line = false;
-  s = buffer->next_line - 1;
+  s = buffer->next_line;
 
   if (!buffer->from_stage3)
     {
       const uchar *pbackslash = NULL;
 
+    found_bs:
+      /* Perform an optimized search for \n, \r, \\, ?.  */
+      if (search_line_fast (s, buffer->rlimit, &s))
+	{
+	  c = *s;
+	      
+	  /* Special case for backslash which is reasonably common.
+	     Continue searching using the fast path.  */
+	  if (c == '\\') 
+	    {
+	      pbackslash = s;
+	      s++;
+	      goto found_bs;
+	    }
+	  if (__builtin_expect (c == '?', false))
+	    goto found_qm;
+	  else
+	    goto found_nl_cr;
+	}
+
+      s--;
+
       /* Short circuit for the common case of an un-escaped line with
 	 no trigraphs.  The primary win here is by not writing any
 	 data back to memory until we have to.  */
@@ -124,6 +548,7 @@ _cpp_clean_line (cpp_reader *pfile)
 	  if (__builtin_expect (c == '\n', false)
 	      || __builtin_expect (c == '\r', false))
 	    {
+	    found_nl_cr:
 	      d = (uchar *) s;
 
 	      if (__builtin_expect (s == buffer->rlimit, false))
@@ -157,26 +582,28 @@ _cpp_clean_line (cpp_reader *pfile)
 	    }
 	  if (__builtin_expect (c == '\\', false))
 	    pbackslash = s;
-	  else if (__builtin_expect (c == '?', false)
-		   && __builtin_expect (s[1] == '?', false)
-		   && _cpp_trigraph_map[s[2]])
+	  else if (__builtin_expect (c == '?', false))
 	    {
-	      /* Have a trigraph.  We may or may not have to convert
-		 it.  Add a line note regardless, for -Wtrigraphs.  */
-	      add_line_note (buffer, s, s[2]);
-	      if (CPP_OPTION (pfile, trigraphs))
+	    found_qm:
+	      if (__builtin_expect (s[1] == '?', false)
+		   && _cpp_trigraph_map[s[2]])
 		{
-		  /* We do, and that means we have to switch to the
-		     slow path.  */
-		  d = (uchar *) s;
-		  *d = _cpp_trigraph_map[s[2]];
-		  s += 2;
-		  break;
+		  /* Have a trigraph.  We may or may not have to convert
+		     it.  Add a line note regardless, for -Wtrigraphs.  */
+		  add_line_note (buffer, s, s[2]);
+		  if (CPP_OPTION (pfile, trigraphs))
+		    {
+		      /* We do, and that means we have to switch to the
+		         slow path.  */
+		      d = (uchar *) s;
+		      *d = _cpp_trigraph_map[s[2]];
+		      s += 2;
+		      break;
+		    }
 		}
 	    }
 	}
 
-
       for (;;)
 	{
 	  c = *++s;
@@ -184,7 +611,7 @@ _cpp_clean_line (cpp_reader *pfile)
 
 	  if (c == '\n' || c == '\r')
 	    {
-		  /* Handle DOS line endings.  */
+	      /* Handle DOS line endings.  */
 	      if (c == '\r' && s != buffer->rlimit && s[1] == '\n')
 		s++;
 	      if (s == buffer->rlimit)
@@ -215,9 +642,8 @@ _cpp_clean_line (cpp_reader *pfile)
     }
   else
     {
-      do
+      while (*s != '\n' && *s != '\r')
 	s++;
-      while (*s != '\n' && *s != '\r');
       d = (uchar *) s;
 
       /* Handle DOS line endings.  */
diff --git a/libcpp/system.h b/libcpp/system.h
index 2472799..1a74734 100644
--- a/libcpp/system.h
+++ b/libcpp/system.h
@@ -29,6 +29,9 @@ along with GCC; see the file COPYING3.  If not see
 #ifdef HAVE_STDDEF_H
 # include <stddef.h>
 #endif
+#ifdef HAVE_STDINT_H
+# include <stdint.h>
+#endif
 
 #include <stdio.h>
 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Vectorized _cpp_clean_line
  2010-08-12 21:38                                 ` Vectorized _cpp_clean_line Richard Henderson
@ 2010-08-12 22:18                                   ` Andi Kleen
  2010-08-12 22:32                                     ` Richard Henderson
  0 siblings, 1 reply; 129+ messages in thread
From: Andi Kleen @ 2010-08-12 22:18 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Andi Kleen, gcc-patches

> 
> I believe I've found the extra failures.  You adjusted this path:

Thanks.

> I'm currently testing this patch that merges your patch with
> David Miller's bit-twiddling version, plus extra stuff for
> Alpha, IA-64, and SSE2 (without SSE4.2).

At least for sse 4.2 I'm not sure the table lookup
for alignment is worth it. The unaligned loads are quite
cheap on current micro architectures with sse 4.2 
and the page end test is also not that expensive.

I originally avoided the indirect call because I was worried
about the effect on CPUs with indirect branch predictor.

-Andi

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Vectorized _cpp_clean_line
  2010-08-12 22:18                                   ` Andi Kleen
@ 2010-08-12 22:32                                     ` Richard Henderson
  2010-08-12 23:10                                       ` Richard Henderson
  2010-08-13  7:26                                       ` Andi Kleen
  0 siblings, 2 replies; 129+ messages in thread
From: Richard Henderson @ 2010-08-12 22:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: gcc-patches

On 08/12/2010 03:07 PM, Andi Kleen wrote:
> At least for sse 4.2 I'm not sure the table lookup
> for alignment is worth it. The unaligned loads are quite
> cheap on current micro architectures with sse 4.2 
> and the page end test is also not that expensive.

Perhaps.  That's something else that will want testing, as
it's all of a dozen instructions.

At minimum the page end test should not be performed inside
the loop.  We can adjust END before beginning the loop so
that we never cross a page.

> I originally avoided the indirect call because I was worried
> about the effect on CPUs with indirect branch predictor.

WithOUT the indirect branch predictor, you mean?  Which ones
don't have that?  Surely we have to be going back pretty far...

Since the call is the same destination every time, that matches
up well with the indirect branch predictor, AFAIK.  If we're
worried about the indirect branch predictor, we could write

static inline bool
search_line_fast (s, end, out)
{
  if (fast_impl == 0)
    return search_line_sse42 (s, end, out);
  else if (fast_impl == 1)
    return search_line_sse2 (s, end, out);
  else
    return search_line_acc_char (s, end, out);
}

where FAST_IMPL is set up appropriately by init_vectorized_lexer.

The question being, are three predicted jumps faster than one
indirect jump on a processor without the proper predictor?


r~

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Vectorized _cpp_clean_line
  2010-08-12 22:32                                     ` Richard Henderson
@ 2010-08-12 23:10                                       ` Richard Henderson
  2010-08-12 23:13                                         ` Richard Henderson
  2010-08-13  8:33                                         ` Andi Kleen
  2010-08-13  7:26                                       ` Andi Kleen
  1 sibling, 2 replies; 129+ messages in thread
From: Richard Henderson @ 2010-08-12 23:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 779 bytes --]

On 08/12/2010 03:23 PM, Richard Henderson wrote:
> On 08/12/2010 03:07 PM, Andi Kleen wrote:
>> At least for sse 4.2 I'm not sure the table lookup
>> for alignment is worth it. The unaligned loads are quite
>> cheap on current micro architectures with sse 4.2 
>> and the page end test is also not that expensive.
> 
> Perhaps.  That's something else that will want testing, as
> it's all of a dozen instructions.

Alternately, we can check for page end only at the end
of the buffer.  Because obviously all the middle bits
of the buffer are present.

I've done nothing with the indirect branch yet, but here's
a version that bootstraps and checks ok for x86_64 (w/sse4.2).
It also bootstraps on i686 (w/sse2 but not sse4.2); I've just
started the test run for that though.


r~

[-- Attachment #2: searchline-2 --]
[-- Type: text/plain, Size: 16400 bytes --]

diff --git a/libcpp/init.c b/libcpp/init.c
index c5b8c28..769aa50 100644
--- a/libcpp/init.c
+++ b/libcpp/init.c
@@ -137,6 +137,8 @@ init_library (void)
 #ifdef ENABLE_NLS
        (void) bindtextdomain (PACKAGE, LOCALEDIR);
 #endif
+
+       init_vectorized_lexer ();
     }
 }
 
diff --git a/libcpp/internal.h b/libcpp/internal.h
index 9209b55..10ed033 100644
--- a/libcpp/internal.h
+++ b/libcpp/internal.h
@@ -725,6 +725,8 @@ ufputs (const unsigned char *s, FILE *f)
   return fputs ((const char *)s, f);
 }
 
+extern void init_vectorized_lexer (void);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/libcpp/lex.c b/libcpp/lex.c
index f628272..96fb2dd 100644
--- a/libcpp/lex.c
+++ b/libcpp/lex.c
@@ -96,6 +96,399 @@ add_line_note (cpp_buffer *buffer, const uchar *pos, unsigned int type)
   buffer->notes_used++;
 }
 
+\f
+/* Fast path to find line special characters using optimized character
+   scanning algorithms.  Anything complicated falls back to the slow
+   path below.  Since this loop is very hot it's worth doing these kinds
+   of optimizations.
+
+   One of the paths through the ifdefs should provide 
+
+     bool search_line_fast (const uchar *s, const uchar *end,
+                            const uchar **out)
+
+   Between S and END, search for \n, \r, \\, ?.  Return true if found.
+   Always update *OUT to the last character scanned, even if not found.  */
+
+/* ??? Should be in configure.ac.  */
+#define WORDS_BIGENDIAN 0
+
+/* We'd like the largest integer that fits into a register.  There's nothing
+   in <stdint.h> that gives us that.  For most hosts this is unsigned long,
+   but MS decided on an LLP64 model.  Thankfully when building with GCC we
+   can get the "real" word size.  */
+#ifdef __GNUC__
+typedef unsigned int word_type __attribute__((__mode__(__word__)));
+#else
+typedef unsigned long word_type;
+#endif
+
+/* Return X with the first N bytes forced to values that won't match one
+   of the interesting characters.  Note that NUL is not interesting.  */
+
+static inline word_type
+acc_char_mask_misalign (word_type val, unsigned int n)
+{
+  word_type mask = -1;
+  if (WORDS_BIGENDIAN)
+    mask >>= n * 8;
+  else
+    mask <<= n * 8;
+  return val & mask;
+}
+
+/* Return X replicated to all byte positions within WORD_TYPE.  */
+
+static inline word_type
+acc_char_replicate (uchar x)
+{
+  word_type ret;
+
+  ret = (x << 24) | (x << 16) | (x << 8) | x;
+  switch (sizeof (ret))
+    {
+    case 8:
+      ret = (ret << 16 << 16) | ret;
+      break;
+    case 4:
+      break;
+    default:
+      abort ();
+    }
+  return ret;
+}
+
+/* Return non-zero if some byte of VAL is (probably) C.  */
+
+static inline word_type
+acc_char_cmp (word_type val, word_type c)
+{
+#if defined(__GNUC__) && defined(__alpha__)
+  /* We can get exact results using a compare-bytes instruction.  Since
+     comparison is always >=, we could either do ((v >= c) & (c >= v))
+     for 3 operations or force matching bytes to zero and compare vs
+     zero with (0 >= (val ^ c)) for 2 operations.  */
+  return __builtin_alpha_cmpbge (0, val ^ c);
+#elif defined(__GNUC__) && defined(__ia64__)
+  /* ??? Ideally we'd have some sort of builtin for this, so that we
+     can pack the 4 instances into fewer bundles.  */
+  word_type ret;
+  __asm__("pcmp1.eq %0 = %1, %2" : "=r"(ret) : "r"(val), "r"(c));
+  return ret;
+#else
+  word_type magic = 0x7efefefeU;
+  switch (sizeof(word_type))
+    {
+    case 8:
+      magic = (magic << 16 << 16) | 0xfefefefeU;
+    case 4:
+      break;
+    default:
+      abort ();
+    }
+  magic |= 1;
+
+  val ^= c;
+  return ((val + magic) ^ ~val) & ~magic;
+#endif
+}
+
+/* Given the result of acc_char_cmp is non-zero, return the index of
+   the found character.  If this was a false positive, return -1.  */
+
+static inline int
+acc_char_index (word_type cmp ATTRIBUTE_UNUSED,
+		word_type val ATTRIBUTE_UNUSED)
+{
+#if defined(__GNUC__) && defined(__alpha__)
+  /* The cmpbge instruction sets *bits* of the result corresponding to
+     matches in the bytes with no false positives.  This means that clz/ctx
+     produces a correct unscaled result.  */
+  return (WORDS_BIGENDIAN ? __builtin_clzl (cmp) : __builtin_ctzl (cmp));
+#elif defined(__GNUC__) && defined(__ia64__)
+  /* The pcmp1 instruction sets matching bytes to 0xff with no false
+     positives.  This means that clz/ctz produces a correct scaled result.  */
+  return (WORDS_BIGENDIAN ? __builtin_clzl (cmp) : __builtin_ctzl (cmp)) >> 3;
+#else
+  unsigned int i;
+
+  /* ??? It would be nice to force unrolling here,
+     and have all of these constants folded.  */
+  for (i = 0; i < sizeof(word_type); ++i)
+    {
+      uchar c;
+      if (WORDS_BIGENDIAN)
+	c = (val >> (sizeof(word_type) - i - 1) * 8) & 0xff;
+      else
+	c = (val >> i * 8) & 0xff;
+
+      if (c == '\n' || c == '\r' || c == '\\' || c == '?')
+	return i;
+    }
+
+  return -1;
+#endif
+}
+
+/* A version of the fast scanner using bit fiddling techniques.
+ 
+   For 32-bit words, one would normally perform 16 comparisons and
+   16 branches.  With this algorithm one performs 24 arithmetic
+   operations and one branch.  Whether this is faster with a 32-bit
+   word size is going to be somewhat system dependent.
+
+   For 64-bit words, we eliminate twice the number of comparisons
+   and branches without increasing the number of arithmetic operations.
+   It's almost certainly going to be a win with 64-bit word size.  */
+
+static bool
+search_line_acc_char (const uchar *s, const uchar *end, const uchar **out)
+{
+  const word_type repl_nl = acc_char_replicate ('\n');
+  const word_type repl_cr = acc_char_replicate ('\r');
+  const word_type repl_bs = acc_char_replicate ('\\');
+  const word_type repl_qm = acc_char_replicate ('?');
+
+  unsigned int misalign;
+  ptrdiff_t left;
+  const word_type *p;
+  word_type val;
+  
+  /* Don't bother optimizing very short lines; too much masking to do.  */
+  left = end - s;
+  if (left < (ptrdiff_t) sizeof(word_type))
+    {
+      *out = s;
+      return false;
+    }
+
+  /* Align the buffer.  Mask out any bytes from before the beginning.  */
+  p = (word_type *)((uintptr_t)s & -sizeof(word_type));
+  val = *p;
+  misalign = (uintptr_t)s & (sizeof(word_type) - 1);
+  if (misalign)
+    {
+      val = acc_char_mask_misalign (val, misalign);
+      left += misalign;
+    }
+
+  /* Main loop.  */
+  while (1)
+    {
+      word_type m_nl, m_cr, m_bs, m_qm, t;
+
+      m_nl = acc_char_cmp (val, repl_nl);
+      m_cr = acc_char_cmp (val, repl_cr);
+      m_bs = acc_char_cmp (val, repl_bs);
+      m_qm = acc_char_cmp (val, repl_qm);
+      t = (m_nl | m_cr) | (m_bs | m_qm);
+
+      if (__builtin_expect (t != 0, 0))
+	{
+	  int i = acc_char_index (t, val);
+	  if (i >= 0)
+	    {
+	      *out = (const uchar *)p + i;
+	      return true;
+	    }
+	}
+
+      left -= sizeof(word_type);
+      ++p;
+      if (left < (ptrdiff_t) sizeof(word_type))
+	{
+	  /* Ran out of complete words.  */
+	  *out = (const uchar *)p;
+	  return false;
+	}
+      val = *p;
+    }
+}
+
+#if (GCC_VERSION >= 4005) && (defined(__i386__) || defined(__x86_64__))
+/* Fast path to find line special characters using SSE 4.2 vectorized string 
+   instructions. Anything complicated falls back to the slow path below. 
+   Since this loop is very hot it's worth doing these kinds of
+   optimizations. Returns true if stopper character found. 
+
+   We should be using the _mm intrinsics, but the xxxintr headers do things
+   not allowed in gcc.  So instead use direct builtins.  */
+
+static bool __attribute__((__target__("sse4.2")))
+search_line_sse42 (const uchar *s, const uchar *end, const uchar **out)
+{
+  typedef char m128i __attribute__ ((__vector_size__ (16)));
+  static const m128i search
+    = { '\n', '\r', '?', '\\', 0,0,0,0,0,0,0,0,0,0,0,0 };
+
+  ptrdiff_t left;
+  m128i data;
+  int index;
+
+  /* Main loop, processing 16 bytes at a time.  */
+  for (left = end - s; left >= 16; left -= 16, s += 16)
+    {
+      data = __builtin_ia32_loaddqu((const char *)s);
+      index = __builtin_ia32_pcmpestri128 (search, 4, data, 16, 0);
+      if (__builtin_expect (index < 16, 0)) 
+	{
+	  *out = (const uchar *)s + index;
+	  return true;
+	}
+    }
+
+  /* There are less than 16 bytes remaining.  If we can read those bytes
+     without reading from a possibly unmapped next page, then go ahead and
+     do so.  If we are near the end of the page then don't bother; returning
+     will scan the balance of the buffer byte-by-byte.  */
+  if (left > 0 && ((uintptr_t)s & 0xfff) <= 0xff0)
+    {
+      data = __builtin_ia32_loaddqu((const char *)s);
+      index = __builtin_ia32_pcmpestri128 (search, 4, data, left, 0);
+      if (__builtin_expect (index < 16, 0))
+	{
+	  *out = (const uchar *)s + index;
+	  return true;
+	}
+      s = end;
+    }
+
+  /* No match within buffer.  */
+  *out = s;
+  return false;
+}
+
+static bool __attribute__((__target__("sse2")))
+search_line_sse2 (const uchar *s, const uchar *end, const uchar **out)
+{
+  typedef char m128i __attribute__ ((__vector_size__ (16)));
+
+  static const m128i mask_align[16] = {
+    {  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+    { -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+    { -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+    { -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+    { -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+    { -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+    { -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+    { -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+    { -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0 },
+    { -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0 },
+    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0 },
+    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0 },
+    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0 },
+    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0 },
+    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0 },
+    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0 }
+  };
+
+  static const m128i repl_nl = {
+    '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', 
+    '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n'
+  };
+  static const m128i repl_cr = {
+    '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r', 
+    '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r'
+  };
+  static const m128i repl_bs = {
+    '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\', 
+    '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\'
+  };
+  static const m128i repl_qm = {
+    '?', '?', '?', '?', '?', '?', '?', '?', 
+    '?', '?', '?', '?', '?', '?', '?', '?', 
+  };
+
+  ptrdiff_t left;
+  unsigned int misalign, mask;
+  const m128i *p;
+  m128i data, t;
+
+  left = end - s;
+  if (left <= 0)
+    goto no_match;
+
+  /* Align the pointer and mask out the bytes before the start.  */
+  misalign = (uintptr_t)s & 15;
+  p = (const m128i *)((uintptr_t)s & -16);
+  data = *p;
+  if (misalign)
+    {
+      data |= mask_align[misalign];
+      left += misalign;
+    }
+
+  while (1)
+    {
+      t  = __builtin_ia32_pcmpeqb128(data, repl_nl);
+      t |= __builtin_ia32_pcmpeqb128(data, repl_cr);
+      t |= __builtin_ia32_pcmpeqb128(data, repl_bs);
+      t |= __builtin_ia32_pcmpeqb128(data, repl_qm);
+      mask = __builtin_ia32_pmovmskb128 (t);
+
+      if (__builtin_expect (mask, 0))
+	{
+	  /* If there were less than 16 bytes left in the buffer, then there
+	     will be comparisons beyond the end of buffer.  Mask them.  */
+	  if (left < 16)
+	    {
+	      mask &= (2u << left) - 1;
+	      if (mask == 0)
+		goto no_match;
+	    }
+
+	  *out = (const uchar *)p + __builtin_ctz (mask);
+	  return true;
+	}
+
+      left -= 16;
+      if (left <= 0)
+	goto no_match;
+      data = *++p;
+    }
+
+ no_match:
+  /* No match within buffer.  */
+  *out = end;
+  return false;
+}
+
+/* Check if CPU supports vectorized string instructions. */
+
+#include "../gcc/config/i386/cpuid.h"
+
+static bool (*search_line_fast)(const uchar *, const uchar *, const uchar **)
+  = search_line_acc_char;
+
+void 
+init_vectorized_lexer (void)
+{
+  unsigned dummy, ecx, edx;
+
+  if (__get_cpuid (1, &dummy, &dummy, &ecx, &edx))
+    {
+      if (ecx & bit_SSE4_2)
+	search_line_fast = search_line_sse42;
+      else if (edx & bit_SSE2)
+	search_line_fast = search_line_sse2;
+    }
+}
+
+#else
+
+/* We only have one accellerated alternative.  Use a direct call so that
+   we encourage inlining.  We must still provide the init_vectorized_lexer
+   entry point, even though it does nothing.  */
+
+#define search_line_fast  search_line_acc_char
+
+void 
+init_vectorized_lexer (void)
+{
+}
+
+#endif
+
 /* Returns with a logical line that contains no escaped newlines or
    trigraphs.  This is a time-critical inner loop.  */
 void
@@ -109,12 +502,34 @@ _cpp_clean_line (cpp_reader *pfile)
   buffer->cur_note = buffer->notes_used = 0;
   buffer->cur = buffer->line_base = buffer->next_line;
   buffer->need_line = false;
-  s = buffer->next_line - 1;
+  s = buffer->next_line;
 
   if (!buffer->from_stage3)
     {
       const uchar *pbackslash = NULL;
 
+    found_bs:
+      /* Perform an optimized search for \n, \r, \\, ?.  */
+      if (search_line_fast (s, buffer->rlimit, &s))
+	{
+	  c = *s;
+	      
+	  /* Special case for backslash which is reasonably common.
+	     Continue searching using the fast path.  */
+	  if (c == '\\') 
+	    {
+	      pbackslash = s;
+	      s++;
+	      goto found_bs;
+	    }
+	  if (__builtin_expect (c == '?', false))
+	    goto found_qm;
+	  else
+	    goto found_nl_cr;
+	}
+
+      s--;
+
       /* Short circuit for the common case of an un-escaped line with
 	 no trigraphs.  The primary win here is by not writing any
 	 data back to memory until we have to.  */
@@ -124,6 +539,7 @@ _cpp_clean_line (cpp_reader *pfile)
 	  if (__builtin_expect (c == '\n', false)
 	      || __builtin_expect (c == '\r', false))
 	    {
+	    found_nl_cr:
 	      d = (uchar *) s;
 
 	      if (__builtin_expect (s == buffer->rlimit, false))
@@ -157,26 +573,28 @@ _cpp_clean_line (cpp_reader *pfile)
 	    }
 	  if (__builtin_expect (c == '\\', false))
 	    pbackslash = s;
-	  else if (__builtin_expect (c == '?', false)
-		   && __builtin_expect (s[1] == '?', false)
-		   && _cpp_trigraph_map[s[2]])
+	  else if (__builtin_expect (c == '?', false))
 	    {
-	      /* Have a trigraph.  We may or may not have to convert
-		 it.  Add a line note regardless, for -Wtrigraphs.  */
-	      add_line_note (buffer, s, s[2]);
-	      if (CPP_OPTION (pfile, trigraphs))
+	    found_qm:
+	      if (__builtin_expect (s[1] == '?', false)
+		   && _cpp_trigraph_map[s[2]])
 		{
-		  /* We do, and that means we have to switch to the
-		     slow path.  */
-		  d = (uchar *) s;
-		  *d = _cpp_trigraph_map[s[2]];
-		  s += 2;
-		  break;
+		  /* Have a trigraph.  We may or may not have to convert
+		     it.  Add a line note regardless, for -Wtrigraphs.  */
+		  add_line_note (buffer, s, s[2]);
+		  if (CPP_OPTION (pfile, trigraphs))
+		    {
+		      /* We do, and that means we have to switch to the
+		         slow path.  */
+		      d = (uchar *) s;
+		      *d = _cpp_trigraph_map[s[2]];
+		      s += 2;
+		      break;
+		    }
 		}
 	    }
 	}
 
-
       for (;;)
 	{
 	  c = *++s;
@@ -184,7 +602,7 @@ _cpp_clean_line (cpp_reader *pfile)
 
 	  if (c == '\n' || c == '\r')
 	    {
-		  /* Handle DOS line endings.  */
+	      /* Handle DOS line endings.  */
 	      if (c == '\r' && s != buffer->rlimit && s[1] == '\n')
 		s++;
 	      if (s == buffer->rlimit)
@@ -215,9 +633,8 @@ _cpp_clean_line (cpp_reader *pfile)
     }
   else
     {
-      do
+      while (*s != '\n' && *s != '\r')
 	s++;
-      while (*s != '\n' && *s != '\r');
       d = (uchar *) s;
 
       /* Handle DOS line endings.  */
diff --git a/libcpp/system.h b/libcpp/system.h
index 2472799..1a74734 100644
--- a/libcpp/system.h
+++ b/libcpp/system.h
@@ -29,6 +29,9 @@ along with GCC; see the file COPYING3.  If not see
 #ifdef HAVE_STDDEF_H
 # include <stddef.h>
 #endif
+#ifdef HAVE_STDINT_H
+# include <stdint.h>
+#endif
 
 #include <stdio.h>
 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Vectorized _cpp_clean_line
  2010-08-12 23:10                                       ` Richard Henderson
@ 2010-08-12 23:13                                         ` Richard Henderson
  2010-08-13  8:33                                         ` Andi Kleen
  1 sibling, 0 replies; 129+ messages in thread
From: Richard Henderson @ 2010-08-12 23:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: gcc-patches

On 08/12/2010 04:07 PM, Richard Henderson wrote:
> I've done nothing with the indirect branch yet, but here's
> a version that bootstraps and checks ok for x86_64 (w/sse4.2).
> It also bootstraps on i686 (w/sse2 but not sse4.2); I've just
> started the test run for that though.

Gah.  Hit send too soon.  Oprofile samples go from 

Before:

 16.31    518.00   518.00        7    74.00    74.00  _cpp_lex_direct
 12.25    907.00   389.00                             ht_lookup_with_hash
  9.83   1219.00   312.00        6    52.00    52.00  _cpp_clean_line

After:

 17.27    402.00   402.00        2   201.00   201.00  _cpp_lex_direct
 12.67    697.00   295.00                             ht_lookup_with_hash
 12.16    980.00   283.00        1   283.00   775.00  cpp_get_token
  9.41   1199.00   219.00                             cpp_output_token
  6.14   1342.00   143.00                             lex_identifier
  5.84   1478.00   136.00                             preprocess_file
  5.54   1607.00   129.00                             enter_macro_context
  3.39   1686.00    79.00        2    39.50    39.50  _cpp_lex_token
  2.88   1753.00    67.00        2    33.50    33.50  _cpp_clean_line
  2.28   1806.00    53.00                             search_line_sse42



r~

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Vectorized _cpp_clean_line
  2010-08-12 22:32                                     ` Richard Henderson
  2010-08-12 23:10                                       ` Richard Henderson
@ 2010-08-13  7:26                                       ` Andi Kleen
  2010-08-14 17:14                                         ` [CFT, v4] " Richard Henderson
  1 sibling, 1 reply; 129+ messages in thread
From: Andi Kleen @ 2010-08-13  7:26 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Andi Kleen, gcc-patches

On Thu, Aug 12, 2010 at 03:23:04PM -0700, Richard Henderson wrote:
> On 08/12/2010 03:07 PM, Andi Kleen wrote:
> > At least for sse 4.2 I'm not sure the table lookup
> > for alignment is worth it. The unaligned loads are quite
> > cheap on current micro architectures with sse 4.2 
> > and the page end test is also not that expensive.
> 
> Perhaps.  That's something else that will want testing, as
> it's all of a dozen instructions.
> 
> At minimum the page end test should not be performed inside
> the loop.  We can adjust END before beginning the loop so
> that we never cross a page.

The test runs in parallel with the match on a OOO CPU. It would
only be a problem if you were decoder limited.

Moving it out would require special case tail code. glibc used a lot
of switches for that in its code, I didn't like this.

The best probably would be to ensure there is always a tail pad
in the caller, but it is presumably difficult if you mmap()
the input file.

> > I originally avoided the indirect call because I was worried
> > about the effect on CPUs with indirect branch predictor.
> 
> WithOUT the indirect branch predictor, you mean?  Which ones

Yes without.

> don't have that?  Surely we have to be going back pretty far...

Nope.  They're a relatively recent invention: a lot of x86 CPUs still
being used don't have them.

> 
> Since the call is the same destination every time, that matches
> up well with the indirect branch predictor, AFAIK.  If we're
> worried about the indirect branch predictor, we could write

Yes if you have a indirect branch predictor you're fine, assuming
the rest of the compiler didn't thrash the buffers.

Or maybe profile feedback will fix it and does the necessarily inlining
(but you have to fix PR45227 first :-)  Also when I tested this last
time it didn't seem to work very well.

And it would only help if you run it on the same type of system as the end 
host.

Or maybe it's in the wash because it's only once per line.

> 
> static inline bool
> search_line_fast (s, end, out)
> {
>   if (fast_impl == 0)
>     return search_line_sse42 (s, end, out);
>   else if (fast_impl == 1)
>     return search_line_sse2 (s, end, out);
>   else
>     return search_line_acc_char (s, end, out);
> }
> 
> where FAST_IMPL is set up appropriately by init_vectorized_lexer.
> 
> The question being, are three predicted jumps faster than one
> indirect jump on a processor without the proper predictor?

Yes usually, especially if you don't have to go through all three
on average.

-Andi

P.S.: I wonder if there's more to be gotten from larger changes
in cpplib.  The clang preprocessor doesn't use vectorization and it seems
to be still faster?

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Vectorized _cpp_clean_line
  2010-08-12 23:10                                       ` Richard Henderson
  2010-08-12 23:13                                         ` Richard Henderson
@ 2010-08-13  8:33                                         ` Andi Kleen
  1 sibling, 0 replies; 129+ messages in thread
From: Andi Kleen @ 2010-08-13  8:33 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Andi Kleen, gcc-patches

On Thu, Aug 12, 2010 at 04:07:51PM -0700, Richard Henderson wrote:
> On 08/12/2010 03:23 PM, Richard Henderson wrote:
> > On 08/12/2010 03:07 PM, Andi Kleen wrote:
> >> At least for sse 4.2 I'm not sure the table lookup
> >> for alignment is worth it. The unaligned loads are quite
> >> cheap on current micro architectures with sse 4.2 
> >> and the page end test is also not that expensive.
> > 
> > Perhaps.  That's something else that will want testing, as
> > it's all of a dozen instructions.
> 
> Alternately, we can check for page end only at the end
> of the buffer.  Because obviously all the middle bits
> of the buffer are present.

Looks good.

-Andi

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [CFT, v4] Vectorized _cpp_clean_line
  2010-08-13  7:26                                       ` Andi Kleen
@ 2010-08-14 17:14                                         ` Richard Henderson
  2010-08-17 16:59                                           ` Steve Ellcey
                                                             ` (2 more replies)
  0 siblings, 3 replies; 129+ messages in thread
From: Richard Henderson @ 2010-08-14 17:14 UTC (permalink / raw)
  To: gcc-patches; +Cc: Andi Kleen, sje, edelsohn

[-- Attachment #1: Type: text/plain, Size: 1132 bytes --]

On 08/13/2010 12:28 AM, Andi Kleen wrote:
>> static inline bool
>> search_line_fast (s, end, out)
>> {
>>   if (fast_impl == 0)
>>     return search_line_sse42 (s, end, out);
>>   else if (fast_impl == 1)
>>     return search_line_sse2 (s, end, out);
>>   else
>>     return search_line_acc_char (s, end, out);
>> }
>>
>> where FAST_IMPL is set up appropriately by init_vectorized_lexer.
>>
>> The question being, are three predicted jumps faster than one
>> indirect jump on a processor without the proper predictor?
> 
> Yes usually, especially if you don't have to go through all three
> on average.

This is the version I plan to commit Monday or Tuesday, 
barring further feedback.  It uses direct branches as above,
with tweaks to allow inlining when possible (e.g. 64-bit
which always has SSE2).

I've also bootstrapped on powerpc64-linux and ia64-linux.
Those test machines are loaded, so testing is proceeding
rather slowly.  I'd appreciate it if dje and sje could
give it a go on aix and ia64-hpux and see that (1) it works
with the big-endian, ilp32 hpux, and (2) if at all possible
report some performance results.


r~

[-- Attachment #2: searchline-4 --]
[-- Type: text/plain, Size: 30430 bytes --]

diff --git a/libcpp/config.in b/libcpp/config.in
index 9969934..95606c1 100644
--- a/libcpp/config.in
+++ b/libcpp/config.in
@@ -1,5 +1,8 @@
 /* config.in.  Generated from configure.ac by autoheader.  */
 
+/* Define if building universal (internal helper macro) */
+#undef AC_APPLE_UNIVERSAL_BUILD
+
 /* Define to one of `_getb67', `GETB67', `getb67' for Cray-2 and Cray-YMP
    systems. This function is required for `alloca.c' support on those systems.
    */
@@ -209,6 +212,9 @@
 /* Define if <sys/types.h> defines \`uchar'. */
 #undef HAVE_UCHAR
 
+/* Define to 1 if the system has the type `uintptr_t'. */
+#undef HAVE_UINTPTR_T
+
 /* Define to 1 if you have the <unistd.h> header file. */
 #undef HAVE_UNISTD_H
 
@@ -266,6 +272,18 @@
 /* Define to 1 if your <sys/time.h> declares `struct tm'. */
 #undef TM_IN_SYS_TIME
 
+/* Define WORDS_BIGENDIAN to 1 if your processor stores words with the most
+   significant byte first (like Motorola and SPARC, unlike Intel). */
+#if defined AC_APPLE_UNIVERSAL_BUILD
+# if defined __BIG_ENDIAN__
+#  define WORDS_BIGENDIAN 1
+# endif
+#else
+# ifndef WORDS_BIGENDIAN
+#  undef WORDS_BIGENDIAN
+# endif
+#endif
+
 /* Define to empty if `const' does not conform to ANSI C. */
 #undef const
 
@@ -278,8 +296,15 @@
 /* Define to `long int' if <sys/types.h> does not define. */
 #undef off_t
 
+/* Define to `int' if <sys/types.h> does not define. */
+#undef ptrdiff_t
+
 /* Define to `unsigned int' if <sys/types.h> does not define. */
 #undef size_t
 
 /* Define to `int' if <sys/types.h> does not define. */
 #undef ssize_t
+
+/* Define to the type of an unsigned integer type wide enough to hold a
+   pointer, if such a type exists, and if the system does not define it. */
+#undef uintptr_t
diff --git a/libcpp/configure b/libcpp/configure
index a4700e6..a2ce1c3 100755
--- a/libcpp/configure
+++ b/libcpp/configure
@@ -1846,6 +1846,48 @@ fi
 
 } # ac_fn_cxx_check_header_mongrel
 
+# ac_fn_cxx_try_run LINENO
+# ------------------------
+# Try to link conftest.$ac_ext, and return whether this succeeded. Assumes
+# that executables *can* be run.
+ac_fn_cxx_try_run ()
+{
+  as_lineno=${as_lineno-"$1"} as_lineno_stack=as_lineno_stack=$as_lineno_stack
+  if { { ac_try="$ac_link"
+case "(($ac_try" in
+  *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+  *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo="\"\$as_me:${as_lineno-$LINENO}: $ac_try_echo\""
+$as_echo "$ac_try_echo"; } >&5
+  (eval "$ac_link") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; } && { ac_try='./conftest$ac_exeext'
+  { { case "(($ac_try" in
+  *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+  *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo="\"\$as_me:${as_lineno-$LINENO}: $ac_try_echo\""
+$as_echo "$ac_try_echo"; } >&5
+  (eval "$ac_try") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; }; then :
+  ac_retval=0
+else
+  $as_echo "$as_me: program exited with status $ac_status" >&5
+       $as_echo "$as_me: failed program was:" >&5
+sed 's/^/| /' conftest.$ac_ext >&5
+
+       ac_retval=$ac_status
+fi
+  rm -rf conftest.dSYM conftest_ipa8_conftest.oo
+  eval $as_lineno_stack; test "x$as_lineno_stack" = x && { as_lineno=; unset as_lineno;}
+  return $ac_retval
+
+} # ac_fn_cxx_try_run
+
 # ac_fn_cxx_try_link LINENO
 # -------------------------
 # Try to link conftest.$ac_ext, and return whether this succeeded.
@@ -1946,48 +1988,6 @@ $as_echo "$ac_res" >&6; }
 
 } # ac_fn_cxx_check_type
 
-# ac_fn_cxx_try_run LINENO
-# ------------------------
-# Try to link conftest.$ac_ext, and return whether this succeeded. Assumes
-# that executables *can* be run.
-ac_fn_cxx_try_run ()
-{
-  as_lineno=${as_lineno-"$1"} as_lineno_stack=as_lineno_stack=$as_lineno_stack
-  if { { ac_try="$ac_link"
-case "(($ac_try" in
-  *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
-  *) ac_try_echo=$ac_try;;
-esac
-eval ac_try_echo="\"\$as_me:${as_lineno-$LINENO}: $ac_try_echo\""
-$as_echo "$ac_try_echo"; } >&5
-  (eval "$ac_link") 2>&5
-  ac_status=$?
-  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
-  test $ac_status = 0; } && { ac_try='./conftest$ac_exeext'
-  { { case "(($ac_try" in
-  *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
-  *) ac_try_echo=$ac_try;;
-esac
-eval ac_try_echo="\"\$as_me:${as_lineno-$LINENO}: $ac_try_echo\""
-$as_echo "$ac_try_echo"; } >&5
-  (eval "$ac_try") 2>&5
-  ac_status=$?
-  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
-  test $ac_status = 0; }; }; then :
-  ac_retval=0
-else
-  $as_echo "$as_me: program exited with status $ac_status" >&5
-       $as_echo "$as_me: failed program was:" >&5
-sed 's/^/| /' conftest.$ac_ext >&5
-
-       ac_retval=$ac_status
-fi
-  rm -rf conftest.dSYM conftest_ipa8_conftest.oo
-  eval $as_lineno_stack; test "x$as_lineno_stack" = x && { as_lineno=; unset as_lineno;}
-  return $ac_retval
-
-} # ac_fn_cxx_try_run
-
 # ac_fn_cxx_compute_int LINENO EXPR VAR INCLUDES
 # ----------------------------------------------
 # Tries to find the compile-time value of EXPR in a program that includes
@@ -5172,6 +5172,230 @@ done
 fi
 
 # Checks for typedefs, structures, and compiler characteristics.
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether byte ordering is bigendian" >&5
+$as_echo_n "checking whether byte ordering is bigendian... " >&6; }
+if test "${ac_cv_c_bigendian+set}" = set; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_cv_c_bigendian=unknown
+    # See if we're dealing with a universal compiler.
+    cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#ifndef __APPLE_CC__
+	       not a universal capable compiler
+	     #endif
+	     typedef int dummy;
+
+_ACEOF
+if ac_fn_cxx_try_compile "$LINENO"; then :
+
+	# Check for potential -arch flags.  It is not universal unless
+	# there are at least two -arch flags with different values.
+	ac_arch=
+	ac_prev=
+	for ac_word in $CC $CFLAGS $CPPFLAGS $LDFLAGS; do
+	 if test -n "$ac_prev"; then
+	   case $ac_word in
+	     i?86 | x86_64 | ppc | ppc64)
+	       if test -z "$ac_arch" || test "$ac_arch" = "$ac_word"; then
+		 ac_arch=$ac_word
+	       else
+		 ac_cv_c_bigendian=universal
+		 break
+	       fi
+	       ;;
+	   esac
+	   ac_prev=
+	 elif test "x$ac_word" = "x-arch"; then
+	   ac_prev=arch
+	 fi
+       done
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+    if test $ac_cv_c_bigendian = unknown; then
+      # See if sys/param.h defines the BYTE_ORDER macro.
+      cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <sys/types.h>
+	     #include <sys/param.h>
+
+int
+main ()
+{
+#if ! (defined BYTE_ORDER && defined BIG_ENDIAN \
+		     && defined LITTLE_ENDIAN && BYTE_ORDER && BIG_ENDIAN \
+		     && LITTLE_ENDIAN)
+	      bogus endian macros
+	     #endif
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_cxx_try_compile "$LINENO"; then :
+  # It does; now see whether it defined to BIG_ENDIAN or not.
+	 cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <sys/types.h>
+		#include <sys/param.h>
+
+int
+main ()
+{
+#if BYTE_ORDER != BIG_ENDIAN
+		 not big endian
+		#endif
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_cxx_try_compile "$LINENO"; then :
+  ac_cv_c_bigendian=yes
+else
+  ac_cv_c_bigendian=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+    fi
+    if test $ac_cv_c_bigendian = unknown; then
+      # See if <limits.h> defines _LITTLE_ENDIAN or _BIG_ENDIAN (e.g., Solaris).
+      cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <limits.h>
+
+int
+main ()
+{
+#if ! (defined _LITTLE_ENDIAN || defined _BIG_ENDIAN)
+	      bogus endian macros
+	     #endif
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_cxx_try_compile "$LINENO"; then :
+  # It does; now see whether it defined to _BIG_ENDIAN or not.
+	 cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <limits.h>
+
+int
+main ()
+{
+#ifndef _BIG_ENDIAN
+		 not big endian
+		#endif
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_cxx_try_compile "$LINENO"; then :
+  ac_cv_c_bigendian=yes
+else
+  ac_cv_c_bigendian=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+    fi
+    if test $ac_cv_c_bigendian = unknown; then
+      # Compile a test program.
+      if test "$cross_compiling" = yes; then :
+  # Try to guess by grepping values from an object file.
+	 cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+short int ascii_mm[] =
+		  { 0x4249, 0x4765, 0x6E44, 0x6961, 0x6E53, 0x7953, 0 };
+		short int ascii_ii[] =
+		  { 0x694C, 0x5454, 0x656C, 0x6E45, 0x6944, 0x6E61, 0 };
+		int use_ascii (int i) {
+		  return ascii_mm[i] + ascii_ii[i];
+		}
+		short int ebcdic_ii[] =
+		  { 0x89D3, 0xE3E3, 0x8593, 0x95C5, 0x89C4, 0x9581, 0 };
+		short int ebcdic_mm[] =
+		  { 0xC2C9, 0xC785, 0x95C4, 0x8981, 0x95E2, 0xA8E2, 0 };
+		int use_ebcdic (int i) {
+		  return ebcdic_mm[i] + ebcdic_ii[i];
+		}
+		extern int foo;
+
+int
+main ()
+{
+return use_ascii (foo) == use_ebcdic (foo);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_cxx_try_compile "$LINENO"; then :
+  if grep BIGenDianSyS conftest.$ac_objext >/dev/null; then
+	      ac_cv_c_bigendian=yes
+	    fi
+	    if grep LiTTleEnDian conftest.$ac_objext >/dev/null ; then
+	      if test "$ac_cv_c_bigendian" = unknown; then
+		ac_cv_c_bigendian=no
+	      else
+		# finding both strings is unlikely to happen, but who knows?
+		ac_cv_c_bigendian=unknown
+	      fi
+	    fi
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+$ac_includes_default
+int
+main ()
+{
+
+	     /* Are we little or big endian?  From Harbison&Steele.  */
+	     union
+	     {
+	       long int l;
+	       char c[sizeof (long int)];
+	     } u;
+	     u.l = 1;
+	     return u.c[sizeof (long int) - 1] == 1;
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_cxx_try_run "$LINENO"; then :
+  ac_cv_c_bigendian=no
+else
+  ac_cv_c_bigendian=yes
+fi
+rm -f core *.core core.conftest.* gmon.out bb.out conftest$ac_exeext \
+  conftest.$ac_objext conftest.beam conftest.$ac_ext
+fi
+
+    fi
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_c_bigendian" >&5
+$as_echo "$ac_cv_c_bigendian" >&6; }
+ case $ac_cv_c_bigendian in #(
+   yes)
+     $as_echo "#define WORDS_BIGENDIAN 1" >>confdefs.h
+;; #(
+   no)
+      ;; #(
+   universal)
+
+$as_echo "#define AC_APPLE_UNIVERSAL_BUILD 1" >>confdefs.h
+
+     ;; #(
+   *)
+     as_fn_error "unknown endianness
+ presetting ac_cv_c_bigendian=no (or yes) will help" "$LINENO" 5 ;;
+ esac
+
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for an ANSI C-conforming const" >&5
 $as_echo_n "checking for an ANSI C-conforming const... " >&6; }
 if test "${ac_cv_c_const+set}" = set; then :
@@ -5371,6 +5595,53 @@ _ACEOF
 
 fi
 
+
+  ac_fn_cxx_check_type "$LINENO" "uintptr_t" "ac_cv_type_uintptr_t" "$ac_includes_default"
+if test "x$ac_cv_type_uintptr_t" = x""yes; then :
+
+$as_echo "#define HAVE_UINTPTR_T 1" >>confdefs.h
+
+else
+  for ac_type in 'unsigned int' 'unsigned long int' \
+	'unsigned long long int'; do
+       cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+$ac_includes_default
+int
+main ()
+{
+static int test_array [1 - 2 * !(sizeof (void *) <= sizeof ($ac_type))];
+test_array [0] = 0
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_cxx_try_compile "$LINENO"; then :
+
+cat >>confdefs.h <<_ACEOF
+#define uintptr_t $ac_type
+_ACEOF
+
+	  ac_type=
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+       test -z "$ac_type" && break
+     done
+fi
+
+
+ac_fn_cxx_check_type "$LINENO" "ptrdiff_t" "ac_cv_type_ptrdiff_t" "$ac_includes_default"
+if test "x$ac_cv_type_ptrdiff_t" = x""yes; then :
+
+else
+
+cat >>confdefs.h <<_ACEOF
+#define ptrdiff_t int
+_ACEOF
+
+fi
+
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether struct tm is in sys/time.h or time.h" >&5
 $as_echo_n "checking whether struct tm is in sys/time.h or time.h... " >&6; }
 if test "${ac_cv_struct_tm+set}" = set; then :
@@ -7042,6 +7313,7 @@ LTLIBOBJS=$ac_ltlibobjs
 
 
 
+
 : ${CONFIG_STATUS=./config.status}
 ac_write_fail=0
 ac_clean_files_save=$ac_clean_files
diff --git a/libcpp/configure.ac b/libcpp/configure.ac
index ceea29c..1250f49 100644
--- a/libcpp/configure.ac
+++ b/libcpp/configure.ac
@@ -70,12 +70,15 @@ else
 fi
 
 # Checks for typedefs, structures, and compiler characteristics.
+AC_C_BIGENDIAN
 AC_C_CONST
 AC_C_INLINE
 AC_FUNC_OBSTACK
 AC_TYPE_OFF_T
 AC_TYPE_SIZE_T
-AC_CHECK_TYPE(ssize_t, int)
+AC_TYPE_SSIZE_T
+AC_TYPE_UINTPTR_T
+AC_CHECK_TYPE(ptrdiff_t, int)
 AC_STRUCT_TM
 AC_CHECK_SIZEOF(int)
 AC_CHECK_SIZEOF(long)
diff --git a/libcpp/init.c b/libcpp/init.c
index c5b8c28..769aa50 100644
--- a/libcpp/init.c
+++ b/libcpp/init.c
@@ -137,6 +137,8 @@ init_library (void)
 #ifdef ENABLE_NLS
        (void) bindtextdomain (PACKAGE, LOCALEDIR);
 #endif
+
+       init_vectorized_lexer ();
     }
 }
 
diff --git a/libcpp/internal.h b/libcpp/internal.h
index 9209b55..10ed033 100644
--- a/libcpp/internal.h
+++ b/libcpp/internal.h
@@ -725,6 +725,8 @@ ufputs (const unsigned char *s, FILE *f)
   return fputs ((const char *)s, f);
 }
 
+extern void init_vectorized_lexer (void);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/libcpp/lex.c b/libcpp/lex.c
index f628272..82572c6 100644
--- a/libcpp/lex.c
+++ b/libcpp/lex.c
@@ -96,6 +96,442 @@ add_line_note (cpp_buffer *buffer, const uchar *pos, unsigned int type)
   buffer->notes_used++;
 }
 
+\f
+/* Fast path to find line special characters using optimized character
+   scanning algorithms.  Anything complicated falls back to the slow
+   path below.  Since this loop is very hot it's worth doing these kinds
+   of optimizations.
+
+   One of the paths through the ifdefs should provide 
+
+     bool search_line_fast (const uchar *s, const uchar *end,
+                            const uchar **out)
+
+   Between S and END, search for \n, \r, \\, ?.  Return true if found.
+   Always update *OUT to the last character scanned, even if not found.  */
+
+/* Configure gives us an ifdef test.  */
+#ifndef WORDS_BIGENDIAN
+#define WORDS_BIGENDIAN 0
+#endif
+
+/* We'd like the largest integer that fits into a register.  There's nothing
+   in <stdint.h> that gives us that.  For most hosts this is unsigned long,
+   but MS decided on an LLP64 model.  Thankfully when building with GCC we
+   can get the "real" word size.  */
+#ifdef __GNUC__
+typedef unsigned int word_type __attribute__((__mode__(__word__)));
+#else
+typedef unsigned long word_type;
+#endif
+
+/* The code below is only expecting sizes 4 or 8.
+   Die at compile-time if this expectation is violated.  */
+typedef char check_word_type_size
+  [(sizeof(word_type) == 8 || sizeof(word_type) == 4) * 2 - 1];
+
+/* Return X with the first N bytes forced to values that won't match one
+   of the interesting characters.  Note that NUL is not interesting.  */
+
+static inline word_type
+acc_char_mask_misalign (word_type val, unsigned int n)
+{
+  word_type mask = -1;
+  if (WORDS_BIGENDIAN)
+    mask >>= n * 8;
+  else
+    mask <<= n * 8;
+  return val & mask;
+}
+
+/* Return X replicated to all byte positions within WORD_TYPE.  */
+
+static inline word_type
+acc_char_replicate (uchar x)
+{
+  word_type ret;
+
+  ret = (x << 24) | (x << 16) | (x << 8) | x;
+  if (sizeof(word_type) == 8)
+    ret = (ret << 16 << 16) | ret;
+  return ret;
+}
+
+/* Return non-zero if some byte of VAL is (probably) C.  */
+
+static inline word_type
+acc_char_cmp (word_type val, word_type c)
+{
+#if defined(__GNUC__) && defined(__alpha__)
+  /* We can get exact results using a compare-bytes instruction.  Since
+     comparison is always >=, we could either do ((v >= c) & (c >= v))
+     for 3 operations or force matching bytes to zero and compare vs
+     zero with (0 >= (val ^ c)) for 2 operations.  */
+  return __builtin_alpha_cmpbge (0, val ^ c);
+#elif defined(__GNUC__) && defined(__ia64__)
+  /* ??? Ideally we'd have some sort of builtin for this, so that we
+     can pack the 4 instances into fewer bundles.  */
+  word_type ret;
+  __asm__("pcmp1.eq %0 = %1, %2" : "=r"(ret) : "r"(val), "r"(c));
+  return ret;
+#else
+  word_type magic = 0x7efefefeU;
+  if (sizeof(word_type) == 8)
+    magic = (magic << 16 << 16) | 0xfefefefeU;
+  magic |= 1;
+
+  val ^= c;
+  return ((val + magic) ^ ~val) & ~magic;
+#endif
+}
+
+/* Given the result of acc_char_cmp is non-zero, return the index of
+   the found character.  If this was a false positive, return -1.  */
+
+static inline int
+acc_char_index (word_type cmp ATTRIBUTE_UNUSED,
+		word_type val ATTRIBUTE_UNUSED)
+{
+#if defined(__GNUC__) && defined(__alpha__)
+  /* The cmpbge instruction sets *bits* of the result corresponding to
+     matches in the bytes with no false positives.  This means that ctz
+     produces a correct unscaled result.  If big-endian (aka Cray), the
+     clz will include the 7 bytes before the least significant.  */
+  if (WORDS_BIGENDIAN)
+    return __builtin_clzl (cmp) - 56;
+  else
+    return __builtin_ctzl (cmp);
+#elif defined(__GNUC__) && defined(__ia64__)
+  /* The pcmp1 instruction sets matching bytes to 0xff with no false
+     positives.  This means that clz/ctz produces a correct scaled result.  */
+  unsigned int i;
+  i = (WORDS_BIGENDIAN ? __builtin_clzl (cmp) : __builtin_ctzl (cmp));
+  return i / 8;
+#else
+  unsigned int i;
+
+  /* ??? It would be nice to force unrolling here,
+     and have all of these constants folded.  */
+  for (i = 0; i < sizeof(word_type); ++i)
+    {
+      uchar c;
+      if (WORDS_BIGENDIAN)
+	c = (val >> (sizeof(word_type) - i - 1) * 8) & 0xff;
+      else
+	c = (val >> i * 8) & 0xff;
+
+      if (c == '\n' || c == '\r' || c == '\\' || c == '?')
+	return i;
+    }
+
+  return -1;
+#endif
+}
+
+/* A version of the fast scanner using bit fiddling techniques.
+ 
+   For 32-bit words, one would normally perform 16 comparisons and
+   16 branches.  With this algorithm one performs 24 arithmetic
+   operations and one branch.  Whether this is faster with a 32-bit
+   word size is going to be somewhat system dependent.
+
+   For 64-bit words, we eliminate twice the number of comparisons
+   and branches without increasing the number of arithmetic operations.
+   It's almost certainly going to be a win with 64-bit word size.  */
+
+static bool
+search_line_acc_char (const uchar *s, const uchar *end, const uchar **out)
+{
+  const word_type repl_nl = acc_char_replicate ('\n');
+  const word_type repl_cr = acc_char_replicate ('\r');
+  const word_type repl_bs = acc_char_replicate ('\\');
+  const word_type repl_qm = acc_char_replicate ('?');
+
+  unsigned int misalign;
+  ptrdiff_t left;
+  const word_type *p;
+  word_type val;
+  
+  /* Don't bother optimizing very short lines; too much masking to do.  */
+  left = end - s;
+  if (left < (ptrdiff_t) sizeof(word_type))
+    {
+      *out = s;
+      return false;
+    }
+
+  /* Align the buffer.  Mask out any bytes from before the beginning.  */
+  p = (word_type *)((uintptr_t)s & -sizeof(word_type));
+  val = *p;
+  misalign = (uintptr_t)s & (sizeof(word_type) - 1);
+  if (misalign)
+    {
+      val = acc_char_mask_misalign (val, misalign);
+      left += misalign;
+    }
+
+  /* Main loop.  */
+  while (1)
+    {
+      word_type m_nl, m_cr, m_bs, m_qm, t;
+
+      m_nl = acc_char_cmp (val, repl_nl);
+      m_cr = acc_char_cmp (val, repl_cr);
+      m_bs = acc_char_cmp (val, repl_bs);
+      m_qm = acc_char_cmp (val, repl_qm);
+      t = (m_nl | m_cr) | (m_bs | m_qm);
+
+      if (__builtin_expect (t != 0, 0))
+	{
+	  int i = acc_char_index (t, val);
+	  if (i >= 0)
+	    {
+	      *out = (const uchar *)p + i;
+	      return true;
+	    }
+	}
+
+      left -= sizeof(word_type);
+      ++p;
+      if (left < (ptrdiff_t) sizeof(word_type))
+	{
+	  /* Ran out of complete words.  */
+	  *out = (const uchar *)p;
+	  return false;
+	}
+      val = *p;
+    }
+}
+
+#if (GCC_VERSION >= 4005) && (defined(__i386__) || defined(__x86_64__))
+/* A version of the fast scanner using SSE 4.2 vectorized string insns.
+
+   We should be using the _mm intrinsics, but the xxxintr headers do things
+   not allowed in gcc.  So instead use direct builtins.  */
+
+static bool
+#ifndef __SSE4_2__
+__attribute__((__target__("sse4.2")))
+#endif
+search_line_sse42 (const uchar *s, const uchar *end, const uchar **out)
+{
+  typedef char m128i __attribute__ ((__vector_size__ (16)));
+  static const m128i search
+    = { '\n', '\r', '?', '\\', 0,0,0,0,0,0,0,0,0,0,0,0 };
+
+  ptrdiff_t left;
+  m128i data;
+  int index;
+
+  /* Main loop, processing 16 bytes at a time.  */
+  for (left = end - s; left >= 16; left -= 16, s += 16)
+    {
+      data = __builtin_ia32_loaddqu((const char *)s);
+      index = __builtin_ia32_pcmpestri128 (search, 4, data, 16, 0);
+      if (__builtin_expect (index < 16, 0)) 
+	{
+	  *out = (const uchar *)s + index;
+	  return true;
+	}
+    }
+
+  /* There are less than 16 bytes remaining.  If we can read those bytes
+     without reading from a possibly unmapped next page, then go ahead and
+     do so.  If we are near the end of the page then don't bother; returning
+     will scan the balance of the buffer byte-by-byte.  */
+  if (left > 0 && ((uintptr_t)s & 0xfff) <= 0xff0)
+    {
+      data = __builtin_ia32_loaddqu((const char *)s);
+      index = __builtin_ia32_pcmpestri128 (search, 4, data, left, 0);
+      if (__builtin_expect (index < 16, 0))
+	{
+	  *out = (const uchar *)s + index;
+	  return true;
+	}
+      s = end;
+    }
+
+  /* No match within buffer.  */
+  *out = s;
+  return false;
+}
+
+/* A version of the fast scanner using SSE2 vectorized byte compare insns.  */
+
+static bool
+#ifndef __SSE2__
+__attribute__((__target__("sse2")))
+#endif
+search_line_sse2 (const uchar *s, const uchar *end, const uchar **out)
+{
+  typedef char m128i __attribute__ ((__vector_size__ (16)));
+
+  static const m128i mask_align[16] = {
+    {  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+    { -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+    { -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+    { -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+    { -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+    { -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+    { -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+    { -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0 },
+    { -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0 },
+    { -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0 },
+    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  0 },
+    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0,  0 },
+    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0,  0 },
+    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  0 },
+    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0 },
+    { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0 }
+  };
+
+  static const m128i repl_nl = {
+    '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', 
+    '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n'
+  };
+  static const m128i repl_cr = {
+    '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r', 
+    '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r'
+  };
+  static const m128i repl_bs = {
+    '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\', 
+    '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\'
+  };
+  static const m128i repl_qm = {
+    '?', '?', '?', '?', '?', '?', '?', '?', 
+    '?', '?', '?', '?', '?', '?', '?', '?', 
+  };
+
+  ptrdiff_t left;
+  unsigned int misalign, mask;
+  const m128i *p;
+  m128i data, t;
+
+  left = end - s;
+  if (left <= 0)
+    goto no_match;
+
+  /* Align the pointer and mask out the bytes before the start.  */
+  misalign = (uintptr_t)s & 15;
+  p = (const m128i *)((uintptr_t)s & -16);
+  data = *p;
+  if (misalign)
+    {
+      data |= mask_align[misalign];
+      left += misalign;
+    }
+
+  while (1)
+    {
+      t  = __builtin_ia32_pcmpeqb128(data, repl_nl);
+      t |= __builtin_ia32_pcmpeqb128(data, repl_cr);
+      t |= __builtin_ia32_pcmpeqb128(data, repl_bs);
+      t |= __builtin_ia32_pcmpeqb128(data, repl_qm);
+      mask = __builtin_ia32_pmovmskb128 (t);
+
+      if (__builtin_expect (mask, 0))
+	{
+	  /* Match found.  If there were less than 16 bytes left in the
+	     buffer, then we may have found a match beyond the end of
+	     the buffer.  Mask them out.  */
+	  if (left < 16)
+	    {
+	      mask &= (2u << left) - 1;
+	      if (mask == 0)
+		goto no_match;
+	    }
+
+	  *out = (const uchar *)p + __builtin_ctz (mask);
+	  return true;
+	}
+
+      left -= 16;
+      if (left <= 0)
+	goto no_match;
+      data = *++p;
+    }
+
+ no_match:
+  /* No match within buffer.  */
+  *out = end;
+  return false;
+}
+
+/* Check the CPU capabilities.  */
+
+#include "../gcc/config/i386/cpuid.h"
+
+/* 0 = sse4.2, 1 = sse2, 2 = integer only.  */
+static int fast_impl;
+
+void 
+init_vectorized_lexer (void)
+{
+  unsigned dummy, ecx, edx;
+  int impl = 2;
+
+  if (__get_cpuid (1, &dummy, &dummy, &ecx, &edx))
+    {
+      if (ecx & bit_SSE4_2)
+	impl = 0;
+      else if (edx & bit_SSE2)
+	impl = 1;
+    }
+
+  fast_impl = impl;
+}
+
+/* Dispatch to one of the vectorized implementations based on the 
+   capabilities of the cpu as detected above in init_vectorized_lexer
+   and as told to us by the compiler.
+
+   Not all x86 cpus have a branch predictor that can handle indirect
+   branches.  It is reportedly better to have two very well predicted
+   direct branches instead.
+
+   Further, this may allow the vectorized implementations to be
+   inlined into _cpp_clean_line.  */
+
+static inline bool
+search_line_fast (const uchar *s, const uchar *end, const uchar **out)
+{
+  int impl = fast_impl;
+
+  /* If the compiler tells us that a given ISA extension is available,
+     force the following branch to be taken.  This allows the compiler
+     to elide the rest as unreachable code while avoiding more ifdefs.  */
+#ifdef __SSE4_2__
+  impl = 0;
+#endif
+
+  if (impl == 0)
+    return search_line_sse42 (s, end, out);
+
+#ifdef __SSE2__
+  impl = 1;
+#endif
+
+  if (impl == 1)
+    return search_line_sse2 (s, end, out);
+  else
+    return search_line_acc_char (s, end, out);
+}
+
+#else
+
+/* We only have one accellerated alternative.  Use a direct call so that
+   we encourage inlining.  We must still provide the init_vectorized_lexer
+   entry point, even though it does nothing.  */
+
+#define search_line_fast  search_line_acc_char
+
+void 
+init_vectorized_lexer (void)
+{
+}
+
+#endif
+
 /* Returns with a logical line that contains no escaped newlines or
    trigraphs.  This is a time-critical inner loop.  */
 void
@@ -109,12 +545,34 @@ _cpp_clean_line (cpp_reader *pfile)
   buffer->cur_note = buffer->notes_used = 0;
   buffer->cur = buffer->line_base = buffer->next_line;
   buffer->need_line = false;
-  s = buffer->next_line - 1;
+  s = buffer->next_line;
 
   if (!buffer->from_stage3)
     {
       const uchar *pbackslash = NULL;
 
+    found_bs:
+      /* Perform an optimized search for \n, \r, \\, ?.  */
+      if (search_line_fast (s, buffer->rlimit, &s))
+	{
+	  c = *s;
+	      
+	  /* Special case for backslash which is reasonably common.
+	     Continue searching using the fast path.  */
+	  if (c == '\\') 
+	    {
+	      pbackslash = s;
+	      s++;
+	      goto found_bs;
+	    }
+	  if (__builtin_expect (c == '?', false))
+	    goto found_qm;
+	  else
+	    goto found_nl_cr;
+	}
+
+      s--;
+
       /* Short circuit for the common case of an un-escaped line with
 	 no trigraphs.  The primary win here is by not writing any
 	 data back to memory until we have to.  */
@@ -124,6 +582,7 @@ _cpp_clean_line (cpp_reader *pfile)
 	  if (__builtin_expect (c == '\n', false)
 	      || __builtin_expect (c == '\r', false))
 	    {
+	    found_nl_cr:
 	      d = (uchar *) s;
 
 	      if (__builtin_expect (s == buffer->rlimit, false))
@@ -157,26 +616,28 @@ _cpp_clean_line (cpp_reader *pfile)
 	    }
 	  if (__builtin_expect (c == '\\', false))
 	    pbackslash = s;
-	  else if (__builtin_expect (c == '?', false)
-		   && __builtin_expect (s[1] == '?', false)
-		   && _cpp_trigraph_map[s[2]])
+	  else if (__builtin_expect (c == '?', false))
 	    {
-	      /* Have a trigraph.  We may or may not have to convert
-		 it.  Add a line note regardless, for -Wtrigraphs.  */
-	      add_line_note (buffer, s, s[2]);
-	      if (CPP_OPTION (pfile, trigraphs))
+	    found_qm:
+	      if (__builtin_expect (s[1] == '?', false)
+		   && _cpp_trigraph_map[s[2]])
 		{
-		  /* We do, and that means we have to switch to the
-		     slow path.  */
-		  d = (uchar *) s;
-		  *d = _cpp_trigraph_map[s[2]];
-		  s += 2;
-		  break;
+		  /* Have a trigraph.  We may or may not have to convert
+		     it.  Add a line note regardless, for -Wtrigraphs.  */
+		  add_line_note (buffer, s, s[2]);
+		  if (CPP_OPTION (pfile, trigraphs))
+		    {
+		      /* We do, and that means we have to switch to the
+		         slow path.  */
+		      d = (uchar *) s;
+		      *d = _cpp_trigraph_map[s[2]];
+		      s += 2;
+		      break;
+		    }
 		}
 	    }
 	}
 
-
       for (;;)
 	{
 	  c = *++s;
@@ -184,7 +645,7 @@ _cpp_clean_line (cpp_reader *pfile)
 
 	  if (c == '\n' || c == '\r')
 	    {
-		  /* Handle DOS line endings.  */
+	      /* Handle DOS line endings.  */
 	      if (c == '\r' && s != buffer->rlimit && s[1] == '\n')
 		s++;
 	      if (s == buffer->rlimit)
@@ -215,9 +676,8 @@ _cpp_clean_line (cpp_reader *pfile)
     }
   else
     {
-      do
+      while (*s != '\n' && *s != '\r')
 	s++;
-      while (*s != '\n' && *s != '\r');
       d = (uchar *) s;
 
       /* Handle DOS line endings.  */
diff --git a/libcpp/system.h b/libcpp/system.h
index 2472799..1a74734 100644
--- a/libcpp/system.h
+++ b/libcpp/system.h
@@ -29,6 +29,9 @@ along with GCC; see the file COPYING3.  If not see
 #ifdef HAVE_STDDEF_H
 # include <stddef.h>
 #endif
+#ifdef HAVE_STDINT_H
+# include <stdint.h>
+#endif
 
 #include <stdio.h>
 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-11  8:53                     ` Richard Guenther
@ 2010-08-16 20:42                       ` Mark Mitchell
  2010-08-16 20:45                         ` Bernd Schmidt
  0 siblings, 1 reply; 129+ messages in thread
From: Mark Mitchell @ 2010-08-16 20:42 UTC (permalink / raw)
  To: Richard Guenther
  Cc: David Daney, Andi Kleen, Steven Bosscher, Bernd Schmidt, GCC Patches

Richard Guenther wrote:

>> We would all like fast compilation, but if you specify -Os, it seems to me
>> that you are expressing a preference for smaller code.
> 
> Yes indeed.

I agree; it's OK for -Os to take more compilation time just as it's OK
for -O2 to take more compilation time.  It's just optimization on a
different axis.

Where are we with Bernd's patch to combine four instructions at this
point?  Does it need review, or are we still not sure it's a good idea?

Thanks,

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-16 20:42                       ` Mark Mitchell
@ 2010-08-16 20:45                         ` Bernd Schmidt
  2010-08-16 21:03                           ` Mark Mitchell
  2010-08-18 20:50                           ` Eric Botcazou
  0 siblings, 2 replies; 129+ messages in thread
From: Bernd Schmidt @ 2010-08-16 20:45 UTC (permalink / raw)
  To: Mark Mitchell
  Cc: Richard Guenther, David Daney, Andi Kleen, Steven Bosscher, GCC Patches

On 08/16/2010 10:37 PM, Mark Mitchell wrote:
> Richard Guenther wrote:
> 
>>> We would all like fast compilation, but if you specify -Os, it seems to me
>>> that you are expressing a preference for smaller code.
>>
>> Yes indeed.
> 
> I agree; it's OK for -Os to take more compilation time just as it's OK
> for -O2 to take more compilation time.  It's just optimization on a
> different axis.
> 
> Where are we with Bernd's patch to combine four instructions at this
> point?  Does it need review, or are we still not sure it's a good idea?

I was going to use Richard's earlier approval and check in the revised
version some time this week after a bit more testing.  I experimented
with Michael's heuristic last week, without getting useful results, so
I'll use the one I previously posted.  My plan was to disable the
heuristic (i.e. try all possible four-insn combinations) only at -O3
since we don't have -Os3 but I could of course do the same for -Os as well.


Bernd

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-16 20:45                         ` Bernd Schmidt
@ 2010-08-16 21:03                           ` Mark Mitchell
  2010-08-18 20:50                           ` Eric Botcazou
  1 sibling, 0 replies; 129+ messages in thread
From: Mark Mitchell @ 2010-08-16 21:03 UTC (permalink / raw)
  To: Bernd Schmidt
  Cc: Richard Guenther, David Daney, Andi Kleen, Steven Bosscher, GCC Patches

Bernd Schmidt wrote:

>> Where are we with Bernd's patch to combine four instructions at this
>> point?  Does it need review, or are we still not sure it's a good idea?
> 
> I was going to use Richard's earlier approval and check in the revised
> version some time this week after a bit more testing.

I think your plan makes sense, and I think it's worth enabling this at -Os.

Thanks,

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: The speed of the compiler, was: Re: Combine four insns
  2010-08-10 20:09                                     ` Tom Tromey
  2010-08-10 20:23                                       ` Andi Kleen
  2010-08-12 21:09                                       ` Nathan Froyd
@ 2010-08-17 15:14                                       ` Mark Mitchell
  2 siblings, 0 replies; 129+ messages in thread
From: Mark Mitchell @ 2010-08-17 15:14 UTC (permalink / raw)
  To: Tom Tromey; +Cc: Andi Kleen, Richard Guenther, gcc-patches

Tom Tromey wrote:

> http://gcc.gnu.org/ml/gcc-patches/2010-03/msg00526.html
> 
> I think any sort of hackery in this area is fine, if it speeds up the
> compiler.

I completely agree.  It's entirely acceptable to use machine-specific
code in a very hot loop if that's what it takes.

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [CFT, v4] Vectorized _cpp_clean_line
  2010-08-14 17:14                                         ` [CFT, v4] " Richard Henderson
@ 2010-08-17 16:59                                           ` Steve Ellcey
  2010-08-17 17:21                                             ` Richard Henderson
  2010-08-17 17:32                                             ` Jakub Jelinek
  2010-08-18  3:23                                           ` Tom Tromey
       [not found]                                           ` <1281998097.3725.3.camel@gargoyle>
  2 siblings, 2 replies; 129+ messages in thread
From: Steve Ellcey @ 2010-08-17 16:59 UTC (permalink / raw)
  To: Richard Henderson; +Cc: gcc-patches, Andi Kleen, edelsohn

On Sat, 2010-08-14 at 10:00 -0700, Richard Henderson wrote:

> I've also bootstrapped on powerpc64-linux and ia64-linux.
> Those test machines are loaded, so testing is proceeding
> rather slowly.  I'd appreciate it if dje and sje could
> give it a go on aix and ia64-hpux and see that (1) it works
> with the big-endian, ilp32 hpux, and (2) if at all possible
> report some performance results.
> 
> 
> r~

The patch did not work for me on IA64 HP-UX.  I am aborting at line 782
of libcpp/lex.c.

(gdb) p *note
$6 = {
  pos = 0x403db319 "nc.\n   Contributed by James E. Wilson
<wilson@cygnus.com>.\n\n   This file is part of GCC.\n\n   GCC is free
software; you can redistribute it and/or modify\n   it under the terms
of the GNU General Public"..., 
  type = 10}


The type value of 10 in note seems to be invalid but I don't know how we
got it or where it came from.

Steve Ellcey
sje@cup.hp.com

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [CFT, v4] Vectorized _cpp_clean_line
  2010-08-17 16:59                                           ` Steve Ellcey
@ 2010-08-17 17:21                                             ` Richard Henderson
  2010-08-17 20:32                                               ` Steve Ellcey
  2010-08-17 17:32                                             ` Jakub Jelinek
  1 sibling, 1 reply; 129+ messages in thread
From: Richard Henderson @ 2010-08-17 17:21 UTC (permalink / raw)
  To: sje; +Cc: gcc-patches, Andi Kleen, edelsohn

On 08/17/2010 09:30 AM, Steve Ellcey wrote:
> On Sat, 2010-08-14 at 10:00 -0700, Richard Henderson wrote:
> 
>> I've also bootstrapped on powerpc64-linux and ia64-linux.
>> Those test machines are loaded, so testing is proceeding
>> rather slowly.  I'd appreciate it if dje and sje could
>> give it a go on aix and ia64-hpux and see that (1) it works
>> with the big-endian, ilp32 hpux, and (2) if at all possible
>> report some performance results.
>>
>>
>> r~
> 
> The patch did not work for me on IA64 HP-UX.  I am aborting at line 782
> of libcpp/lex.c.

Hum.  Does it work if you remove the ia64 ifdefs?  We could ignore
using pcmp1.eq for the moment and figure that out later.


r~

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [CFT, v4] Vectorized _cpp_clean_line
  2010-08-17 16:59                                           ` Steve Ellcey
  2010-08-17 17:21                                             ` Richard Henderson
@ 2010-08-17 17:32                                             ` Jakub Jelinek
  1 sibling, 0 replies; 129+ messages in thread
From: Jakub Jelinek @ 2010-08-17 17:32 UTC (permalink / raw)
  To: Steve Ellcey; +Cc: Richard Henderson, gcc-patches, Andi Kleen, edelsohn

On Tue, Aug 17, 2010 at 09:30:36AM -0700, Steve Ellcey wrote:
> On Sat, 2010-08-14 at 10:00 -0700, Richard Henderson wrote:
> 
> > I've also bootstrapped on powerpc64-linux and ia64-linux.
> > Those test machines are loaded, so testing is proceeding
> > rather slowly.  I'd appreciate it if dje and sje could
> > give it a go on aix and ia64-hpux and see that (1) it works
> > with the big-endian, ilp32 hpux, and (2) if at all possible
> > report some performance results.
> > 
> > 
> > r~
> 
> The patch did not work for me on IA64 HP-UX.  I am aborting at line 782
> of libcpp/lex.c.
> 
> (gdb) p *note
> $6 = {
>   pos = 0x403db319 "nc.\n   Contributed by James E. Wilson
> <wilson@cygnus.com>.\n\n   This file is part of GCC.\n\n   GCC is free
> software; you can redistribute it and/or modify\n   it under the terms
> of the GNU General Public"..., 
>   type = 10}
> 
> 
> The type value of 10 in note seems to be invalid but I don't know how we
> got it or where it came from.

Just to answer where type value 10 comes from:

 done:
  *d = '\n';
  /* A sentinel note that should never be processed.  */
  add_line_note (buffer, d + 1, '\n');
  buffer->next_line = s + 1;

Why are you seeing an abort with it I don't have an answer for...

	Jakub

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [CFT, v4] Vectorized _cpp_clean_line
  2010-08-17 17:21                                             ` Richard Henderson
@ 2010-08-17 20:32                                               ` Steve Ellcey
  2010-08-18 17:14                                                 ` Steve Ellcey
  0 siblings, 1 reply; 129+ messages in thread
From: Steve Ellcey @ 2010-08-17 20:32 UTC (permalink / raw)
  To: Richard Henderson; +Cc: gcc-patches, Andi Kleen, edelsohn

On Tue, 2010-08-17 at 10:18 -0700, Richard Henderson wrote:
> On 08/17/2010 09:30 AM, Steve Ellcey wrote:
> > On Sat, 2010-08-14 at 10:00 -0700, Richard Henderson wrote:
> > 
> >> I've also bootstrapped on powerpc64-linux and ia64-linux.
> >> Those test machines are loaded, so testing is proceeding
> >> rather slowly.  I'd appreciate it if dje and sje could
> >> give it a go on aix and ia64-hpux and see that (1) it works
> >> with the big-endian, ilp32 hpux, and (2) if at all possible
> >> report some performance results.
> >>
> >>
> >> r~
> > 
> > The patch did not work for me on IA64 HP-UX.  I am aborting at line 782
> > of libcpp/lex.c.
> 
> Hum.  Does it work if you remove the ia64 ifdefs?  We could ignore
> using pcmp1.eq for the moment and figure that out later.
> 
> 
> r~

The bootstrap works on HP-UX if I remove those ifdef's.  I haven't done
a full test run yet, but I will do that overnight.

Steve Ellcey
sje@cup.hp.com

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [CFT, v4] Vectorized _cpp_clean_line
  2010-08-14 17:14                                         ` [CFT, v4] " Richard Henderson
  2010-08-17 16:59                                           ` Steve Ellcey
@ 2010-08-18  3:23                                           ` Tom Tromey
       [not found]                                           ` <1281998097.3725.3.camel@gargoyle>
  2 siblings, 0 replies; 129+ messages in thread
From: Tom Tromey @ 2010-08-18  3:23 UTC (permalink / raw)
  To: Richard Henderson; +Cc: gcc-patches, Andi Kleen, sje, edelsohn

>>>>> "rth" == Richard Henderson <rth@redhat.com> writes:

rth> This is the version I plan to commit Monday or Tuesday, 
rth> barring further feedback.

Thank you for doing this.

Tom

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [CFT, v4] Vectorized _cpp_clean_line
  2010-08-17 20:32                                               ` Steve Ellcey
@ 2010-08-18 17:14                                                 ` Steve Ellcey
  0 siblings, 0 replies; 129+ messages in thread
From: Steve Ellcey @ 2010-08-18 17:14 UTC (permalink / raw)
  To: Richard Henderson; +Cc: gcc-patches

On Tue, 2010-08-17 at 12:55 -0700, Steve Ellcey wrote:

> > 
> > Hum.  Does it work if you remove the ia64 ifdefs?  We could ignore
> > using pcmp1.eq for the moment and figure that out later.
> > 
> > 
> > r~
> 
> The bootstrap works on HP-UX if I remove those ifdef's.  I haven't done
> a full test run yet, but I will do that overnight.
> 
> Steve Ellcey
> sje@cup.hp.com

Following up to my own email, the nightly testing did not show any
regressions.

Steve Ellcey
sje@cup.hp.com

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-16 20:45                         ` Bernd Schmidt
  2010-08-16 21:03                           ` Mark Mitchell
@ 2010-08-18 20:50                           ` Eric Botcazou
  2010-08-18 22:03                             ` Bernd Schmidt
  1 sibling, 1 reply; 129+ messages in thread
From: Eric Botcazou @ 2010-08-18 20:50 UTC (permalink / raw)
  To: Bernd Schmidt
  Cc: gcc-patches, Mark Mitchell, Richard Guenther, David Daney,
	Andi Kleen, Steven Bosscher

> I was going to use Richard's earlier approval and check in the revised
> version some time this week after a bit more testing.

Richard's approval is not stronger than my refusal when it comes to the RTL 
optimizers so you need a 2nd approval to break the tie.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [CFT, v4] Vectorized _cpp_clean_line
       [not found]                                                             ` <4C6C39DB.8070409@redhat.com>
@ 2010-08-18 21:50                                                               ` Richard Henderson
  2010-08-19 14:12                                                                 ` Luis Machado
  0 siblings, 1 reply; 129+ messages in thread
From: Richard Henderson @ 2010-08-18 21:50 UTC (permalink / raw)
  To: luisgpm; +Cc: GCC Patches, meissner

[-- Attachment #1: Type: text/plain, Size: 3191 bytes --]

On 08/18/2010 12:51 PM, Richard Henderson wrote:
> Before:
> 
>   %   cumulative   self              self     total           
>  time   seconds   seconds    calls  ms/call  ms/call  name    
>  29.41      0.05     0.05   168709     0.00     0.00  ._cpp_clean_line
>  17.65      0.08     0.03   606267     0.00     0.00  ._cpp_lex_direct
>  11.76      0.10     0.02   168081     0.00     0.00  .linemap_line_start
>  11.76      0.12     0.02                             .variably_modified_type_p
>   5.88      0.13     0.01   503900     0.00     0.00  ._cpp_lex_token
> 
> After:
> 
>   %   cumulative   self              self     total           
>  time   seconds   seconds    calls  ms/call  ms/call  name    
>  20.83      0.05     0.05   606267     0.00     0.00  ._cpp_lex_direct
>  20.83      0.10     0.05   228345     0.00     0.00  .ht_lookup_with_hash
>  16.67      0.14     0.04                             .variably_modified_type_p
>   4.17      0.15     0.01   503900     0.00     0.00  ._cpp_lex_token
>   4.17      0.16     0.01   304199     0.00     0.00  ._cpp_lex_identifier
>   4.17      0.17     0.01   304199     0.00     0.00  .cpp_token_as_text
>   4.17      0.18     0.01   168709     0.00     0.00  ._cpp_clean_line
> 
> Note that ._cpp_clean_line is about 5 times faster.
> 
> Is the cpu you're testing on (power7 or what?) just that much
> better with the original integer code?

Not to overload you too much, but here's an incremental patch
to test as well.

You do have to add "-maltivec" to {BOOT,STAGE}_CFLAGS at the moment,
since the ppc backend does not yet support the kind of target mixing
and matching that i386 port does.  On that same G5 I get:

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 37.50      0.06     0.06   606267     0.00     0.00  ._cpp_lex_direct
 12.50      0.08     0.02   504424     0.00     0.00  .cpp_get_token_with_location
 12.50      0.10     0.02   503900     0.00     0.00  ._cpp_lex_token
 12.50      0.12     0.02                             .variably_modified_type_p
  6.25      0.13     0.01   457580     0.00     0.00  .cpp_output_token
  6.25      0.14     0.01   217473     0.00     0.00  .init_pp_output
  6.25      0.15     0.01    40842     0.00     0.00  ._cpp_test_assertion
  6.25      0.16     0.01        1    10.00   140.00  .preprocess_file
  0.00      0.16     0.00   633800     0.00     0.00  .cpp_get_token
  0.00      0.16     0.00   505969     0.00     0.00  .linemap_lookup
  0.00      0.16     0.00   304199     0.00     0.00  ._cpp_lex_identifier
  0.00      0.16     0.00   304199     0.00     0.00  .cpp_token_as_text
  0.00      0.16     0.00   228345     0.00     0.00  .ht_lookup_with_hash
  0.00      0.16     0.00   180752     0.00     0.00  ._cpp_get_fresh_line
  0.00      0.16     0.00   173870     0.00     0.00  ._cpp_init_tokenrun
  0.00      0.16     0.00   168709     0.00     0.00  ._cpp_clean_line

I.e. _cpp_clean_line has essentially vanished off the radar.  It
seems likely that oprofile across an entire bootstrap stage would
be more likely to be able to pick out how much time it really takes.


r~

[-- Attachment #2: searchline-5-6 --]
[-- Type: text/plain, Size: 3112 bytes --]

diff --git a/libcpp/lex.c b/libcpp/lex.c
index 8e56784..1e8e847 100644
--- a/libcpp/lex.c
+++ b/libcpp/lex.c
@@ -220,7 +220,7 @@ acc_char_index (word_type cmp ATTRIBUTE_UNUSED,
    and branches without increasing the number of arithmetic operations.
    It's almost certainly going to be a win with 64-bit word size.  */
 
-static bool
+static bool ATTRIBUTE_UNUSED
 search_line_acc_char (const uchar *s, const uchar *end, const uchar **out)
 {
   const word_type repl_nl = acc_char_replicate ('\n');
@@ -497,6 +497,116 @@ search_line_fast (const uchar *s, const uchar *end, const uchar **out)
     return search_line_acc_char (s, end, out);
 }
 
+#elif defined(__ALTIVEC__)
+
+static bool
+search_line_fast (const uchar *s, const uchar *end, const uchar **out)
+{
+  typedef __vector unsigned char vc;
+
+  const vc repl_nl = {
+    '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', 
+    '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n'
+  };
+  const vc repl_cr = {
+    '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r', 
+    '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r'
+  };
+  const vc repl_bs = {
+    '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\', 
+    '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\'
+  };
+  const vc repl_qm = {
+    '?', '?', '?', '?', '?', '?', '?', '?', 
+    '?', '?', '?', '?', '?', '?', '?', '?', 
+  };
+  const vc ones = {
+    -1, -1, -1, -1, -1, -1, -1, -1,
+    -1, -1, -1, -1, -1, -1, -1, -1,
+  };
+  const vc zero = { 0 };
+
+  ptrdiff_t left;
+  vc data, vsl, t;
+
+  left = end - s;
+  data = __builtin_vec_ld(0, (const vc *)s);
+  vsl = __builtin_vec_lvsr(0, s);
+  t = __builtin_vec_perm(zero, ones, vsl);
+  data &= t;
+
+  left += (uintptr_t)s & 15;
+  s = (const uchar *)((uintptr_t)s & -16);
+  goto start;
+
+  do
+    {
+      vc m_nl, m_cr, m_bs, m_qm;
+
+      left -= 16;
+      s += 16;
+      if (__builtin_expect (left <= 0, 0))
+	{
+	  *out = s;
+	  return false;
+	}
+      data = __builtin_vec_ld(0, (const vc *)s);
+
+    start:
+      m_nl = (vc) __builtin_vec_cmpeq(data, repl_nl);
+      m_cr = (vc) __builtin_vec_cmpeq(data, repl_cr);
+      m_bs = (vc) __builtin_vec_cmpeq(data, repl_bs);
+      m_qm = (vc) __builtin_vec_cmpeq(data, repl_qm);
+      t = (m_nl | m_cr) | (m_bs | m_qm);
+    }
+  while (!__builtin_vec_vcmpeq_p(/*__CR6_LT_REV*/3, t, zero));
+
+  /* A match somewhere.  Scan T for the match.  */
+  {
+#define N  (sizeof(vc) / sizeof(long))
+
+    union {
+      vc v;
+      unsigned long l[N];
+    } u;
+    typedef char check_count[(N == 2 || N == 4) * 2 - 1];
+
+    unsigned long l, i = 0;
+
+    u.v = t;
+    switch (N)
+      {
+      case 4:
+	l = u.l[i++];
+	if (l != 0)
+	  break;
+	s += sizeof(unsigned long);
+	l = u.l[i++];
+	if (l != 0)
+	  break;
+	s += sizeof(unsigned long);
+      case 2:
+	l = u.l[i++];
+	if (l != 0)
+	  break;
+	s += sizeof(unsigned long);
+	l = u.l[i];
+      }
+
+    l = __builtin_clzl(l);
+    l /= 8;
+    *out = s + l;
+    return true;
+
+#undef N
+  }
+}
+
+void 
+init_vectorized_lexer (void)
+{
+}
+
 #else
 
 /* We only have one accellerated alternative.  Use a direct call so that

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-18 20:50                           ` Eric Botcazou
@ 2010-08-18 22:03                             ` Bernd Schmidt
  2010-08-19  8:04                               ` Eric Botcazou
  0 siblings, 1 reply; 129+ messages in thread
From: Bernd Schmidt @ 2010-08-18 22:03 UTC (permalink / raw)
  To: Eric Botcazou
  Cc: gcc-patches, Mark Mitchell, Richard Guenther, David Daney,
	Andi Kleen, Steven Bosscher

On 08/18/2010 10:41 PM, Eric Botcazou wrote:
>> I was going to use Richard's earlier approval and check in the revised
>> version some time this week after a bit more testing.
> 
> Richard's approval is not stronger than my refusal when it comes to the RTL 
> optimizers so you need a 2nd approval to break the tie.

Mark said the plan was sensible, so I think there is no tie.


Bernd

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-18 22:03                             ` Bernd Schmidt
@ 2010-08-19  8:04                               ` Eric Botcazou
  2010-08-19 15:44                                 ` Mark Mitchell
  2010-08-19 18:13                                 ` Bernd Schmidt
  0 siblings, 2 replies; 129+ messages in thread
From: Eric Botcazou @ 2010-08-19  8:04 UTC (permalink / raw)
  To: Bernd Schmidt
  Cc: gcc-patches, Mark Mitchell, Richard Guenther, David Daney,
	Andi Kleen, Steven Bosscher

> Mark said the plan was sensible, so I think there is no tie.

Sorry, this is such a bad decision in my opinion, as it will set a precedent 
for one-percent-slowdown-for-very-little-benefit patches, that I think an 
explicit OK is in order.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [CFT, v4] Vectorized _cpp_clean_line
  2010-08-18 21:50                                                               ` Richard Henderson
@ 2010-08-19 14:12                                                                 ` Luis Machado
  0 siblings, 0 replies; 129+ messages in thread
From: Luis Machado @ 2010-08-19 14:12 UTC (permalink / raw)
  To: Richard Henderson; +Cc: GCC Patches, meissner

On Wed, 2010-08-18 at 14:49 -0700, Richard Henderson wrote:
> On 08/18/2010 12:51 PM, Richard Henderson wrote:
> > Before:
> > 
> >   %   cumulative   self              self     total           
> >  time   seconds   seconds    calls  ms/call  ms/call  name    
> >  29.41      0.05     0.05   168709     0.00     0.00  ._cpp_clean_line
> >  17.65      0.08     0.03   606267     0.00     0.00  ._cpp_lex_direct
> >  11.76      0.10     0.02   168081     0.00     0.00  .linemap_line_start
> >  11.76      0.12     0.02                             .variably_modified_type_p
> >   5.88      0.13     0.01   503900     0.00     0.00  ._cpp_lex_token
> > 
> > After:
> > 
> >   %   cumulative   self              self     total           
> >  time   seconds   seconds    calls  ms/call  ms/call  name    
> >  20.83      0.05     0.05   606267     0.00     0.00  ._cpp_lex_direct
> >  20.83      0.10     0.05   228345     0.00     0.00  .ht_lookup_with_hash
> >  16.67      0.14     0.04                             .variably_modified_type_p
> >   4.17      0.15     0.01   503900     0.00     0.00  ._cpp_lex_token
> >   4.17      0.16     0.01   304199     0.00     0.00  ._cpp_lex_identifier
> >   4.17      0.17     0.01   304199     0.00     0.00  .cpp_token_as_text
> >   4.17      0.18     0.01   168709     0.00     0.00  ._cpp_clean_line
> > 
> > Note that ._cpp_clean_line is about 5 times faster.
> > 
> > Is the cpu you're testing on (power7 or what?) just that much
> > better with the original integer code?
> 
> Not to overload you too much, but here's an incremental patch
> to test as well.
> 
> You do have to add "-maltivec" to {BOOT,STAGE}_CFLAGS at the moment,
> since the ppc backend does not yet support the kind of target mixing
> and matching that i386 port does.  On that same G5 I get:
> 
>   %   cumulative   self              self     total           
>  time   seconds   seconds    calls  ms/call  ms/call  name    
>  37.50      0.06     0.06   606267     0.00     0.00  ._cpp_lex_direct
>  12.50      0.08     0.02   504424     0.00     0.00  .cpp_get_token_with_location
>  12.50      0.10     0.02   503900     0.00     0.00  ._cpp_lex_token
>  12.50      0.12     0.02                             .variably_modified_type_p
>   6.25      0.13     0.01   457580     0.00     0.00  .cpp_output_token
>   6.25      0.14     0.01   217473     0.00     0.00  .init_pp_output
>   6.25      0.15     0.01    40842     0.00     0.00  ._cpp_test_assertion
>   6.25      0.16     0.01        1    10.00   140.00  .preprocess_file
>   0.00      0.16     0.00   633800     0.00     0.00  .cpp_get_token
>   0.00      0.16     0.00   505969     0.00     0.00  .linemap_lookup
>   0.00      0.16     0.00   304199     0.00     0.00  ._cpp_lex_identifier
>   0.00      0.16     0.00   304199     0.00     0.00  .cpp_token_as_text
>   0.00      0.16     0.00   228345     0.00     0.00  .ht_lookup_with_hash
>   0.00      0.16     0.00   180752     0.00     0.00  ._cpp_get_fresh_line
>   0.00      0.16     0.00   173870     0.00     0.00  ._cpp_init_tokenrun
>   0.00      0.16     0.00   168709     0.00     0.00  ._cpp_clean_line
> 
> I.e. _cpp_clean_line has essentially vanished off the radar.  It
> seems likely that oprofile across an entire bootstrap stage would
> be more likely to be able to pick out how much time it really takes.
> 
> 
> r~

I'll get it tested.

Luis

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-19  8:04                               ` Eric Botcazou
@ 2010-08-19 15:44                                 ` Mark Mitchell
  2010-08-19 18:13                                 ` Bernd Schmidt
  1 sibling, 0 replies; 129+ messages in thread
From: Mark Mitchell @ 2010-08-19 15:44 UTC (permalink / raw)
  To: Eric Botcazou
  Cc: Bernd Schmidt, gcc-patches, Richard Guenther, David Daney,
	Andi Kleen, Steven Bosscher

Eric Botcazou wrote:

> Sorry, this is such a bad decision in my opinion, as it will set a precedent 
> for one-percent-slowdown-for-very-little-benefit patches, that I think an 
> explicit OK is in order.

Bernd has another version of the patch coming that has even less
compile-time cost than the latest version he posted.  There's no reason
to check in the version Bernd's posted at this point, because the new
version is better and almost ready.

But, I'm going to pre-approve that patch.  Richard, Jeff, and I all
think the patch is a good idea (if not enabled at -O1).  At -O2 and
above, GCC should generate good code, even if it's a little slower.  On
RISC machines in particular, these optimizations are significant.  Bernd
has been responsive to your concern about compile-time, and has
significantly reduced the compile-time impact of the patch.

Thanks,

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-19  8:04                               ` Eric Botcazou
  2010-08-19 15:44                                 ` Mark Mitchell
@ 2010-08-19 18:13                                 ` Bernd Schmidt
  2010-08-19 18:25                                   ` Mark Mitchell
                                                     ` (3 more replies)
  1 sibling, 4 replies; 129+ messages in thread
From: Bernd Schmidt @ 2010-08-19 18:13 UTC (permalink / raw)
  To: Eric Botcazou
  Cc: gcc-patches, Mark Mitchell, Richard Guenther, David Daney,
	Andi Kleen, Steven Bosscher

[-- Attachment #1: Type: text/plain, Size: 2316 bytes --]

On 08/19/2010 09:38 AM, Eric Botcazou wrote:
>> Mark said the plan was sensible, so I think there is no tie.
> 
> Sorry, this is such a bad decision in my opinion, as it will set a precedent 
> for one-percent-slowdown-for-very-little-benefit patches, that I think an 
> explicit OK is in order.

We're no longer discussing the 1% slower patch.  I'll even agree that
that's a bit excessive, and the approval for it surprised me, but it
served to get a discussion going.  Several people provided datapoints
indicating that time spent in the optimizers at -O2 or higher is
something that just isn't on the radar as a valid concern based both on
usage patterns and profiling results which show that most time is spent
elsewhere.

As for the patch itself, Michael Matz provided constructive feedback
which led to a heuristic that eliminated a large number of combine-4
attempts.  I conclude that either you didn't read the thread before
attempting once again to block one of my patches, or the above is more
than a little disingenuous.

The following is a slightly updated variant of the previous patch I
posted.  I fixed a bug and slightly cleaned up the in_feeds_im logic
(insn_a_feeds_b isn't valid for insns that aren't consecutive in the
combination), and I found a way to slightly relax the heuristic in order
to use Michael's suggestion of allowing combinations if there are two or
more binary operations with constant operand.

On i686, the heuristic reduces the combine-4 attempts to slightly over a
third of the ones tried by the first patch (and presumably disallows
them in most cases where we'd generate overly large RTL).  Hence, I
would have expected this patch to cause slowdowns in the 0.4% range, but
when I ran a few bootstraps today, I had a few results near 99m5s user
time with the patch, and the best run without the patch came in at
99m10s.  I don't have enough data to be sure, but some test runs gave me
the impression that there is one change, using insn_a_feeds_b instead of
reg_overlap_mentioned_p, which provided some speedup, and may explain
this result.  It still seems a little odd.

Allowing unary ops in the heuristic as well made hardly a difference in
output, and appeared to cost around 0.3%, so I left it out.

Bootstrapped and regression tested on i686-linux.  Committed.


Bernd

[-- Attachment #2: combine4f.diff --]
[-- Type: text/plain, Size: 44698 bytes --]

	PR target/42172
	* combine.c (combine_validate_cost): New arg I0.  All callers changed.
	Take its cost into account if nonnull.
	(insn_a_feeds_b): New static function.
	(combine_instructions): Look for four-insn combinations.
	(can_combine_p): New args PRED2, SUCC2.  All callers changed.  Take
	them into account when computing all_adjacent and looking for other
	uses.
	(combinable_i3pat): New args I0DEST, I0_NOT_IN_SRC.  All callers
	changed.  Treat them like I1DEST and I1_NOT_IN_SRC.
	(try_combine): New arg I0.  Handle four-insn combinations.
	(distribute_notes): New arg ELIM_I0.  All callers changed.  Treat it
	like ELIM_I1.

Index: combine.c
===================================================================
--- combine.c	(revision 162821)
+++ combine.c	(working copy)
@@ -385,10 +385,10 @@ static void init_reg_last (void);
 static void setup_incoming_promotions (rtx);
 static void set_nonzero_bits_and_sign_copies (rtx, const_rtx, void *);
 static int cant_combine_insn_p (rtx);
-static int can_combine_p (rtx, rtx, rtx, rtx, rtx *, rtx *);
-static int combinable_i3pat (rtx, rtx *, rtx, rtx, int, rtx *);
+static int can_combine_p (rtx, rtx, rtx, rtx, rtx, rtx, rtx *, rtx *);
+static int combinable_i3pat (rtx, rtx *, rtx, rtx, rtx, int, int, rtx *);
 static int contains_muldiv (rtx);
-static rtx try_combine (rtx, rtx, rtx, int *);
+static rtx try_combine (rtx, rtx, rtx, rtx, int *);
 static void undo_all (void);
 static void undo_commit (void);
 static rtx *find_split_point (rtx *, rtx, bool);
@@ -438,7 +438,7 @@ static void reg_dead_at_p_1 (rtx, const_
 static int reg_dead_at_p (rtx, rtx);
 static void move_deaths (rtx, rtx, int, rtx, rtx *);
 static int reg_bitfield_target_p (rtx, rtx);
-static void distribute_notes (rtx, rtx, rtx, rtx, rtx, rtx);
+static void distribute_notes (rtx, rtx, rtx, rtx, rtx, rtx, rtx);
 static void distribute_links (rtx);
 static void mark_used_regs_combine (rtx);
 static void record_promoted_value (rtx, rtx);
@@ -766,7 +766,7 @@ do_SUBST_MODE (rtx *into, enum machine_m
 \f
 /* Subroutine of try_combine.  Determine whether the combine replacement
    patterns NEWPAT, NEWI2PAT and NEWOTHERPAT are cheaper according to
-   insn_rtx_cost that the original instruction sequence I1, I2, I3 and
+   insn_rtx_cost that the original instruction sequence I0, I1, I2, I3 and
    undobuf.other_insn.  Note that I1 and/or NEWI2PAT may be NULL_RTX.
    NEWOTHERPAT and undobuf.other_insn may also both be NULL_RTX.  This
    function returns false, if the costs of all instructions can be
@@ -774,10 +774,10 @@ do_SUBST_MODE (rtx *into, enum machine_m
    sequence.  */
 
 static bool
-combine_validate_cost (rtx i1, rtx i2, rtx i3, rtx newpat, rtx newi2pat,
-		       rtx newotherpat)
+combine_validate_cost (rtx i0, rtx i1, rtx i2, rtx i3, rtx newpat,
+		       rtx newi2pat, rtx newotherpat)
 {
-  int i1_cost, i2_cost, i3_cost;
+  int i0_cost, i1_cost, i2_cost, i3_cost;
   int new_i2_cost, new_i3_cost;
   int old_cost, new_cost;
 
@@ -788,13 +788,23 @@ combine_validate_cost (rtx i1, rtx i2, r
   if (i1)
     {
       i1_cost = INSN_COST (i1);
-      old_cost = (i1_cost > 0 && i2_cost > 0 && i3_cost > 0)
-		 ? i1_cost + i2_cost + i3_cost : 0;
+      if (i0)
+	{
+	  i0_cost = INSN_COST (i0);
+	  old_cost = (i0_cost > 0 && i1_cost > 0 && i2_cost > 0 && i3_cost > 0
+		      ? i0_cost + i1_cost + i2_cost + i3_cost : 0);
+	}
+      else
+	{
+	  old_cost = (i1_cost > 0 && i2_cost > 0 && i3_cost > 0
+		      ? i1_cost + i2_cost + i3_cost : 0);
+	  i0_cost = 0;
+	}
     }
   else
     {
       old_cost = (i2_cost > 0 && i3_cost > 0) ? i2_cost + i3_cost : 0;
-      i1_cost = 0;
+      i1_cost = i0_cost = 0;
     }
 
   /* Calculate the replacement insn_rtx_costs.  */
@@ -833,7 +843,16 @@ combine_validate_cost (rtx i1, rtx i2, r
     {
       if (dump_file)
 	{
-	  if (i1)
+	  if (i0)
+	    {
+	      fprintf (dump_file,
+		       "rejecting combination of insns %d, %d, %d and %d\n",
+		       INSN_UID (i0), INSN_UID (i1), INSN_UID (i2),
+		       INSN_UID (i3));
+	      fprintf (dump_file, "original costs %d + %d + %d + %d = %d\n",
+		       i0_cost, i1_cost, i2_cost, i3_cost, old_cost);
+	    }
+	  else if (i1)
 	    {
 	      fprintf (dump_file,
 		       "rejecting combination of insns %d, %d and %d\n",
@@ -1010,6 +1029,21 @@ clear_log_links (void)
     if (INSN_P (insn))
       free_INSN_LIST_list (&LOG_LINKS (insn));
 }
+
+/* Walk the LOG_LINKS of insn B to see if we find a reference to A.  Return
+   true if we found a LOG_LINK that proves that A feeds B.  This only works
+   if there are no instructions between A and B which could have a link
+   depending on A, since in that case we would not record a link for B.  */
+
+static bool
+insn_a_feeds_b (rtx a, rtx b)
+{
+  rtx links;
+  for (links = LOG_LINKS (b); links; links = XEXP (links, 1))
+    if (XEXP (links, 0) == a)
+      return true;
+  return false;
+}
 \f
 /* Main entry point for combiner.  F is the first insn of the function.
    NREGS is the first unused pseudo-reg number.
@@ -1150,7 +1184,7 @@ combine_instructions (rtx f, unsigned in
 	      /* Try this insn with each insn it links back to.  */
 
 	      for (links = LOG_LINKS (insn); links; links = XEXP (links, 1))
-		if ((next = try_combine (insn, XEXP (links, 0),
+		if ((next = try_combine (insn, XEXP (links, 0), NULL_RTX,
 					 NULL_RTX, &new_direct_jump_p)) != 0)
 		  goto retry;
 
@@ -1168,8 +1202,8 @@ combine_instructions (rtx f, unsigned in
 		  for (nextlinks = LOG_LINKS (link);
 		       nextlinks;
 		       nextlinks = XEXP (nextlinks, 1))
-		    if ((next = try_combine (insn, link,
-					     XEXP (nextlinks, 0),
+		    if ((next = try_combine (insn, link, XEXP (nextlinks, 0),
+					     NULL_RTX,
 					     &new_direct_jump_p)) != 0)
 		      goto retry;
 		}
@@ -1187,14 +1221,14 @@ combine_instructions (rtx f, unsigned in
 		  && NONJUMP_INSN_P (prev)
 		  && sets_cc0_p (PATTERN (prev)))
 		{
-		  if ((next = try_combine (insn, prev,
-					   NULL_RTX, &new_direct_jump_p)) != 0)
+		  if ((next = try_combine (insn, prev, NULL_RTX, NULL_RTX,
+					   &new_direct_jump_p)) != 0)
 		    goto retry;
 
 		  for (nextlinks = LOG_LINKS (prev); nextlinks;
 		       nextlinks = XEXP (nextlinks, 1))
-		    if ((next = try_combine (insn, prev,
-					     XEXP (nextlinks, 0),
+		    if ((next = try_combine (insn, prev, XEXP (nextlinks, 0),
+					     NULL_RTX,
 					     &new_direct_jump_p)) != 0)
 		      goto retry;
 		}
@@ -1207,14 +1241,14 @@ combine_instructions (rtx f, unsigned in
 		  && GET_CODE (PATTERN (insn)) == SET
 		  && reg_mentioned_p (cc0_rtx, SET_SRC (PATTERN (insn))))
 		{
-		  if ((next = try_combine (insn, prev,
-					   NULL_RTX, &new_direct_jump_p)) != 0)
+		  if ((next = try_combine (insn, prev, NULL_RTX, NULL_RTX,
+					   &new_direct_jump_p)) != 0)
 		    goto retry;
 
 		  for (nextlinks = LOG_LINKS (prev); nextlinks;
 		       nextlinks = XEXP (nextlinks, 1))
-		    if ((next = try_combine (insn, prev,
-					     XEXP (nextlinks, 0),
+		    if ((next = try_combine (insn, prev, XEXP (nextlinks, 0),
+					     NULL_RTX,
 					     &new_direct_jump_p)) != 0)
 		      goto retry;
 		}
@@ -1230,7 +1264,8 @@ combine_instructions (rtx f, unsigned in
 		    && NONJUMP_INSN_P (prev)
 		    && sets_cc0_p (PATTERN (prev))
 		    && (next = try_combine (insn, XEXP (links, 0),
-					    prev, &new_direct_jump_p)) != 0)
+					    prev, NULL_RTX,
+					    &new_direct_jump_p)) != 0)
 		  goto retry;
 #endif
 
@@ -1240,10 +1275,64 @@ combine_instructions (rtx f, unsigned in
 		for (nextlinks = XEXP (links, 1); nextlinks;
 		     nextlinks = XEXP (nextlinks, 1))
 		  if ((next = try_combine (insn, XEXP (links, 0),
-					   XEXP (nextlinks, 0),
+					   XEXP (nextlinks, 0), NULL_RTX,
 					   &new_direct_jump_p)) != 0)
 		    goto retry;
 
+	      /* Try four-instruction combinations.  */
+	      for (links = LOG_LINKS (insn); links; links = XEXP (links, 1))
+		{
+		  rtx next1;
+		  rtx link = XEXP (links, 0);
+
+		  /* If the linked insn has been replaced by a note, then there
+		     is no point in pursuing this chain any further.  */
+		  if (NOTE_P (link))
+		    continue;
+
+		  for (next1 = LOG_LINKS (link); next1; next1 = XEXP (next1, 1))
+		    {
+		      rtx link1 = XEXP (next1, 0);
+		      if (NOTE_P (link1))
+			continue;
+		      /* I0 -> I1 -> I2 -> I3.  */
+		      for (nextlinks = LOG_LINKS (link1); nextlinks;
+			   nextlinks = XEXP (nextlinks, 1))
+			if ((next = try_combine (insn, link, link1,
+						 XEXP (nextlinks, 0),
+						 &new_direct_jump_p)) != 0)
+			  goto retry;
+		      /* I0, I1 -> I2, I2 -> I3.  */
+		      for (nextlinks = XEXP (next1, 1); nextlinks;
+			   nextlinks = XEXP (nextlinks, 1))
+			if ((next = try_combine (insn, link, link1,
+						 XEXP (nextlinks, 0),
+						 &new_direct_jump_p)) != 0)
+			  goto retry;
+		    }
+
+		  for (next1 = XEXP (links, 1); next1; next1 = XEXP (next1, 1))
+		    {
+		      rtx link1 = XEXP (next1, 0);
+		      if (NOTE_P (link1))
+			continue;
+		      /* I0 -> I2; I1, I2 -> I3.  */
+		      for (nextlinks = LOG_LINKS (link); nextlinks;
+			   nextlinks = XEXP (nextlinks, 1))
+			if ((next = try_combine (insn, link, link1,
+						 XEXP (nextlinks, 0),
+						 &new_direct_jump_p)) != 0)
+			  goto retry;
+		      /* I0 -> I1; I1, I2 -> I3.  */
+		      for (nextlinks = LOG_LINKS (link1); nextlinks;
+			   nextlinks = XEXP (nextlinks, 1))
+			if ((next = try_combine (insn, link, link1,
+						 XEXP (nextlinks, 0),
+						 &new_direct_jump_p)) != 0)
+			  goto retry;
+		    }
+		}
+
 	      /* Try this insn with each REG_EQUAL note it links back to.  */
 	      for (links = LOG_LINKS (insn); links; links = XEXP (links, 1))
 		{
@@ -1267,7 +1356,7 @@ combine_instructions (rtx f, unsigned in
 		      i2mod = temp;
 		      i2mod_old_rhs = copy_rtx (orig);
 		      i2mod_new_rhs = copy_rtx (note);
-		      next = try_combine (insn, i2mod, NULL_RTX,
+		      next = try_combine (insn, i2mod, NULL_RTX, NULL_RTX,
 					  &new_direct_jump_p);
 		      i2mod = NULL_RTX;
 		      if (next)
@@ -1529,9 +1618,10 @@ set_nonzero_bits_and_sign_copies (rtx x,
     }
 }
 \f
-/* See if INSN can be combined into I3.  PRED and SUCC are optionally
-   insns that were previously combined into I3 or that will be combined
-   into the merger of INSN and I3.
+/* See if INSN can be combined into I3.  PRED, PRED2, SUCC and SUCC2 are
+   optionally insns that were previously combined into I3 or that will be
+   combined into the merger of INSN and I3.  The order is PRED, PRED2,
+   INSN, SUCC, SUCC2, I3.
 
    Return 0 if the combination is not allowed for any reason.
 
@@ -1540,7 +1630,8 @@ set_nonzero_bits_and_sign_copies (rtx x,
    will return 1.  */
 
 static int
-can_combine_p (rtx insn, rtx i3, rtx pred ATTRIBUTE_UNUSED, rtx succ,
+can_combine_p (rtx insn, rtx i3, rtx pred ATTRIBUTE_UNUSED,
+	       rtx pred2 ATTRIBUTE_UNUSED, rtx succ, rtx succ2,
 	       rtx *pdest, rtx *psrc)
 {
   int i;
@@ -1550,10 +1641,25 @@ can_combine_p (rtx insn, rtx i3, rtx pre
 #ifdef AUTO_INC_DEC
   rtx link;
 #endif
-  int all_adjacent = (succ ? (next_active_insn (insn) == succ
-			      && next_active_insn (succ) == i3)
-		      : next_active_insn (insn) == i3);
+  bool all_adjacent = true;
 
+  if (succ)
+    {
+      if (succ2)
+	{
+	  if (next_active_insn (succ2) != i3)
+	    all_adjacent = false;
+	  if (next_active_insn (succ) != succ2)
+	    all_adjacent = false;
+	}
+      else if (next_active_insn (succ) != i3)
+	all_adjacent = false;
+      if (next_active_insn (insn) != succ)
+	all_adjacent = false;
+    }
+  else if (next_active_insn (insn) != i3)
+    all_adjacent = false;
+    
   /* Can combine only if previous insn is a SET of a REG, a SUBREG or CC0.
      or a PARALLEL consisting of such a SET and CLOBBERs.
 
@@ -1678,11 +1784,15 @@ can_combine_p (rtx insn, rtx i3, rtx pre
       /* Don't substitute into an incremented register.  */
       || FIND_REG_INC_NOTE (i3, dest)
       || (succ && FIND_REG_INC_NOTE (succ, dest))
+      || (succ2 && FIND_REG_INC_NOTE (succ2, dest))
       /* Don't substitute into a non-local goto, this confuses CFG.  */
       || (JUMP_P (i3) && find_reg_note (i3, REG_NON_LOCAL_GOTO, NULL_RTX))
       /* Make sure that DEST is not used after SUCC but before I3.  */
-      || (succ && ! all_adjacent
-	  && reg_used_between_p (dest, succ, i3))
+      || (!all_adjacent
+	  && ((succ2
+	       && (reg_used_between_p (dest, succ2, i3)
+		   || reg_used_between_p (dest, succ, succ2)))
+	      || (!succ2 && succ && reg_used_between_p (dest, succ, i3))))
       /* Make sure that the value that is to be substituted for the register
 	 does not use any registers whose values alter in between.  However,
 	 If the insns are adjacent, a use can't cross a set even though we
@@ -1765,13 +1875,12 @@ can_combine_p (rtx insn, rtx i3, rtx pre
 
   if (GET_CODE (src) == ASM_OPERANDS || volatile_refs_p (src))
     {
-      /* Make sure succ doesn't contain a volatile reference.  */
+      /* Make sure neither succ nor succ2 contains a volatile reference.  */
+      if (succ2 != 0 && volatile_refs_p (PATTERN (succ2)))
+	return 0;
       if (succ != 0 && volatile_refs_p (PATTERN (succ)))
 	return 0;
-
-      for (p = NEXT_INSN (insn); p != i3; p = NEXT_INSN (p))
-	if (INSN_P (p) && p != succ && volatile_refs_p (PATTERN (p)))
-	  return 0;
+      /* We'll check insns between INSN and I3 below.  */
     }
 
   /* If INSN is an asm, and DEST is a hard register, reject, since it has
@@ -1785,7 +1894,7 @@ can_combine_p (rtx insn, rtx i3, rtx pre
      they might affect machine state.  */
 
   for (p = NEXT_INSN (insn); p != i3; p = NEXT_INSN (p))
-    if (INSN_P (p) && p != succ && volatile_insn_p (PATTERN (p)))
+    if (INSN_P (p) && p != succ && p != succ2 && volatile_insn_p (PATTERN (p)))
       return 0;
 
   /* If INSN contains an autoincrement or autodecrement, make sure that
@@ -1801,8 +1910,12 @@ can_combine_p (rtx insn, rtx i3, rtx pre
 	    || reg_used_between_p (XEXP (link, 0), insn, i3)
 	    || (pred != NULL_RTX
 		&& reg_overlap_mentioned_p (XEXP (link, 0), PATTERN (pred)))
+	    || (pred2 != NULL_RTX
+		&& reg_overlap_mentioned_p (XEXP (link, 0), PATTERN (pred2)))
 	    || (succ != NULL_RTX
 		&& reg_overlap_mentioned_p (XEXP (link, 0), PATTERN (succ)))
+	    || (succ2 != NULL_RTX
+		&& reg_overlap_mentioned_p (XEXP (link, 0), PATTERN (succ2)))
 	    || reg_overlap_mentioned_p (XEXP (link, 0), PATTERN (i3))))
       return 0;
 #endif
@@ -1836,8 +1949,8 @@ can_combine_p (rtx insn, rtx i3, rtx pre
    of a PARALLEL of the pattern.  We validate that it is valid for combining.
 
    One problem is if I3 modifies its output, as opposed to replacing it
-   entirely, we can't allow the output to contain I2DEST or I1DEST as doing
-   so would produce an insn that is not equivalent to the original insns.
+   entirely, we can't allow the output to contain I2DEST, I1DEST or I0DEST as
+   doing so would produce an insn that is not equivalent to the original insns.
 
    Consider:
 
@@ -1858,7 +1971,8 @@ can_combine_p (rtx insn, rtx i3, rtx pre
    must reject the combination.  This case occurs when I2 and I1 both
    feed into I3, rather than when I1 feeds into I2, which feeds into I3.
    If I1_NOT_IN_SRC is nonzero, it means that finding I1 in the source
-   of a SET must prevent combination from occurring.
+   of a SET must prevent combination from occurring.  The same situation
+   can occur for I0, in which case I0_NOT_IN_SRC is set.
 
    Before doing the above check, we first try to expand a field assignment
    into a set of logical operations.
@@ -1870,8 +1984,8 @@ can_combine_p (rtx insn, rtx i3, rtx pre
    Return 1 if the combination is valid, zero otherwise.  */
 
 static int
-combinable_i3pat (rtx i3, rtx *loc, rtx i2dest, rtx i1dest,
-		  int i1_not_in_src, rtx *pi3dest_killed)
+combinable_i3pat (rtx i3, rtx *loc, rtx i2dest, rtx i1dest, rtx i0dest,
+		  int i1_not_in_src, int i0_not_in_src, rtx *pi3dest_killed)
 {
   rtx x = *loc;
 
@@ -1895,9 +2009,11 @@ combinable_i3pat (rtx i3, rtx *loc, rtx 
       if ((inner_dest != dest &&
 	   (!MEM_P (inner_dest)
 	    || rtx_equal_p (i2dest, inner_dest)
-	    || (i1dest && rtx_equal_p (i1dest, inner_dest)))
+	    || (i1dest && rtx_equal_p (i1dest, inner_dest))
+	    || (i0dest && rtx_equal_p (i0dest, inner_dest)))
 	   && (reg_overlap_mentioned_p (i2dest, inner_dest)
-	       || (i1dest && reg_overlap_mentioned_p (i1dest, inner_dest))))
+	       || (i1dest && reg_overlap_mentioned_p (i1dest, inner_dest))
+	       || (i0dest && reg_overlap_mentioned_p (i0dest, inner_dest))))
 
 	  /* This is the same test done in can_combine_p except we can't test
 	     all_adjacent; we don't have to, since this instruction will stay
@@ -1913,7 +2029,8 @@ combinable_i3pat (rtx i3, rtx *loc, rtx 
 	      && REGNO (inner_dest) < FIRST_PSEUDO_REGISTER
 	      && (! HARD_REGNO_MODE_OK (REGNO (inner_dest),
 					GET_MODE (inner_dest))))
-	  || (i1_not_in_src && reg_overlap_mentioned_p (i1dest, src)))
+	  || (i1_not_in_src && reg_overlap_mentioned_p (i1dest, src))
+	  || (i0_not_in_src && reg_overlap_mentioned_p (i0dest, src)))
 	return 0;
 
       /* If DEST is used in I3, it is being killed in this insn, so
@@ -1953,8 +2070,8 @@ combinable_i3pat (rtx i3, rtx *loc, rtx 
       int i;
 
       for (i = 0; i < XVECLEN (x, 0); i++)
-	if (! combinable_i3pat (i3, &XVECEXP (x, 0, i), i2dest, i1dest,
-				i1_not_in_src, pi3dest_killed))
+	if (! combinable_i3pat (i3, &XVECEXP (x, 0, i), i2dest, i1dest, i0dest,
+				i1_not_in_src, i0_not_in_src, pi3dest_killed))
 	  return 0;
     }
 
@@ -2364,15 +2481,15 @@ update_cfg_for_uncondjump (rtx insn)
     single_succ_edge (bb)->flags |= EDGE_FALLTHRU;
 }
 
+/* Try to combine the insns I0, I1 and I2 into I3.
+   Here I0, I1 and I2 appear earlier than I3.
+   I0 and I1 can be zero; then we combine just I2 into I3, or I1 and I2 into
+   I3.
 
-/* Try to combine the insns I1 and I2 into I3.
-   Here I1 and I2 appear earlier than I3.
-   I1 can be zero; then we combine just I2 into I3.
-
-   If we are combining three insns and the resulting insn is not recognized,
-   try splitting it into two insns.  If that happens, I2 and I3 are retained
-   and I1 is pseudo-deleted by turning it into a NOTE.  Otherwise, I1 and I2
-   are pseudo-deleted.
+   If we are combining more than two insns and the resulting insn is not
+   recognized, try splitting it into two insns.  If that happens, I2 and I3
+   are retained and I1/I0 are pseudo-deleted by turning them into a NOTE.
+   Otherwise, I0, I1 and I2 are pseudo-deleted.
 
    Return 0 if the combination does not work.  Then nothing is changed.
    If we did the combination, return the insn at which combine should
@@ -2382,34 +2499,36 @@ update_cfg_for_uncondjump (rtx insn)
    new direct jump instruction.  */
 
 static rtx
-try_combine (rtx i3, rtx i2, rtx i1, int *new_direct_jump_p)
+try_combine (rtx i3, rtx i2, rtx i1, rtx i0, int *new_direct_jump_p)
 {
   /* New patterns for I3 and I2, respectively.  */
   rtx newpat, newi2pat = 0;
   rtvec newpat_vec_with_clobbers = 0;
-  int substed_i2 = 0, substed_i1 = 0;
-  /* Indicates need to preserve SET in I1 or I2 in I3 if it is not dead.  */
-  int added_sets_1, added_sets_2;
+  int substed_i2 = 0, substed_i1 = 0, substed_i0 = 0;
+  /* Indicates need to preserve SET in I0, I1 or I2 in I3 if it is not
+     dead.  */
+  int added_sets_0, added_sets_1, added_sets_2;
   /* Total number of SETs to put into I3.  */
   int total_sets;
-  /* Nonzero if I2's body now appears in I3.  */
-  int i2_is_used;
+  /* Nonzero if I2's or I1's body now appears in I3.  */
+  int i2_is_used, i1_is_used;
   /* INSN_CODEs for new I3, new I2, and user of condition code.  */
   int insn_code_number, i2_code_number = 0, other_code_number = 0;
   /* Contains I3 if the destination of I3 is used in its source, which means
      that the old life of I3 is being killed.  If that usage is placed into
      I2 and not in I3, a REG_DEAD note must be made.  */
   rtx i3dest_killed = 0;
-  /* SET_DEST and SET_SRC of I2 and I1.  */
-  rtx i2dest = 0, i2src = 0, i1dest = 0, i1src = 0;
+  /* SET_DEST and SET_SRC of I2, I1 and I0.  */
+  rtx i2dest = 0, i2src = 0, i1dest = 0, i1src = 0, i0dest = 0, i0src = 0;
   /* Set if I2DEST was reused as a scratch register.  */
   bool i2scratch = false;
-  /* PATTERN (I1) and PATTERN (I2), or a copy of it in certain cases.  */
-  rtx i1pat = 0, i2pat = 0;
+  /* The PATTERNs of I0, I1, and I2, or a copy of them in certain cases.  */
+  rtx i0pat = 0, i1pat = 0, i2pat = 0;
   /* Indicates if I2DEST or I1DEST is in I2SRC or I1_SRC.  */
   int i2dest_in_i2src = 0, i1dest_in_i1src = 0, i2dest_in_i1src = 0;
-  int i2dest_killed = 0, i1dest_killed = 0;
-  int i1_feeds_i3 = 0;
+  int i0dest_in_i0src = 0, i1dest_in_i0src = 0, i2dest_in_i0src = 0;
+  int i2dest_killed = 0, i1dest_killed = 0, i0dest_killed = 0;
+  int i1_feeds_i2_n = 0, i0_feeds_i2_n = 0, i0_feeds_i1_n = 0;
   /* Notes that must be added to REG_NOTES in I3 and I2.  */
   rtx new_i3_notes, new_i2_notes;
   /* Notes that we substituted I3 into I2 instead of the normal case.  */
@@ -2426,11 +2545,47 @@ try_combine (rtx i3, rtx i2, rtx i1, int
   rtx new_other_notes;
   int i;
 
+  /* Only try four-insn combinations when there's high likelihood of
+     success.  Look for simple insns, such as loads of constants, unary
+     operations, or binary operations involving a constant.  */
+  if (i0)
+    {
+      int i;
+      int ngood = 0;
+      int nshift = 0;
+
+      if (!flag_expensive_optimizations)
+	return 0;
+
+      for (i = 0; i < 4; i++)
+	{
+	  rtx insn = i == 0 ? i0 : i == 1 ? i1 : i == 2 ? i2 : i3;
+	  rtx set = single_set (insn);
+	  rtx src;
+	  if (!set)
+	    continue;
+	  src = SET_SRC (set);
+	  if (CONSTANT_P (src))
+	    {
+	      ngood += 2;
+	      break;
+	    }
+	  else if (BINARY_P (src) && CONSTANT_P (XEXP (src, 1)))
+	    ngood++;
+	  else if (GET_CODE (src) == ASHIFT || GET_CODE (src) == ASHIFTRT
+		   || GET_CODE (src) == LSHIFTRT)
+	    nshift++;
+	}
+      if (ngood < 2 && nshift < 2)
+	return 0;
+    }
+
   /* Exit early if one of the insns involved can't be used for
      combinations.  */
   if (cant_combine_insn_p (i3)
       || cant_combine_insn_p (i2)
       || (i1 && cant_combine_insn_p (i1))
+      || (i0 && cant_combine_insn_p (i0))
       || likely_spilled_retval_p (i3))
     return 0;
 
@@ -2442,7 +2597,10 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
-      if (i1)
+      if (i0)
+	fprintf (dump_file, "\nTrying %d, %d, %d -> %d:\n",
+		 INSN_UID (i0), INSN_UID (i1), INSN_UID (i2), INSN_UID (i3));
+      else if (i1)
 	fprintf (dump_file, "\nTrying %d, %d -> %d:\n",
 		 INSN_UID (i1), INSN_UID (i2), INSN_UID (i3));
       else
@@ -2450,8 +2608,12 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 		 INSN_UID (i2), INSN_UID (i3));
     }
 
-  /* If I1 and I2 both feed I3, they can be in any order.  To simplify the
-     code below, set I1 to be the earlier of the two insns.  */
+  /* If multiple insns feed into one of I2 or I3, they can be in any
+     order.  To simplify the code below, reorder them in sequence.  */
+  if (i0 && DF_INSN_LUID (i0) > DF_INSN_LUID (i2))
+    temp = i2, i2 = i0, i0 = temp;
+  if (i0 && DF_INSN_LUID (i0) > DF_INSN_LUID (i1))
+    temp = i1, i1 = i0, i0 = temp;
   if (i1 && DF_INSN_LUID (i1) > DF_INSN_LUID (i2))
     temp = i1, i1 = i2, i2 = temp;
 
@@ -2519,7 +2681,7 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	      subst_insn = i3;
 	      subst_low_luid = DF_INSN_LUID (i2);
 
-	      added_sets_2 = added_sets_1 = 0;
+	      added_sets_2 = added_sets_1 = added_sets_0 = 0;
 	      i2src = SET_DEST (PATTERN (i3));
 	      i2dest = SET_SRC (PATTERN (i3));
 	      i2dest_killed = dead_or_set_p (i2, i2dest);
@@ -2606,7 +2768,7 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	  combine_merges++;
 	  subst_insn = i3;
 	  subst_low_luid = DF_INSN_LUID (i2);
-	  added_sets_2 = added_sets_1 = 0;
+	  added_sets_2 = added_sets_1 = added_sets_0 = 0;
 	  i2dest = SET_DEST (temp);
 	  i2dest_killed = dead_or_set_p (i2, i2dest);
 
@@ -2673,8 +2835,11 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 #endif
 
   /* Verify that I2 and I1 are valid for combining.  */
-  if (! can_combine_p (i2, i3, i1, NULL_RTX, &i2dest, &i2src)
-      || (i1 && ! can_combine_p (i1, i3, NULL_RTX, i2, &i1dest, &i1src)))
+  if (! can_combine_p (i2, i3, i0, i1, NULL_RTX, NULL_RTX, &i2dest, &i2src)
+      || (i1 && ! can_combine_p (i1, i3, i0, NULL_RTX, i2, NULL_RTX,
+				 &i1dest, &i1src))
+      || (i0 && ! can_combine_p (i0, i3, NULL_RTX, NULL_RTX, i1, i2,
+				 &i0dest, &i0src)))
     {
       undo_all ();
       return 0;
@@ -2685,16 +2850,26 @@ try_combine (rtx i3, rtx i2, rtx i1, int
   i2dest_in_i2src = reg_overlap_mentioned_p (i2dest, i2src);
   i1dest_in_i1src = i1 && reg_overlap_mentioned_p (i1dest, i1src);
   i2dest_in_i1src = i1 && reg_overlap_mentioned_p (i2dest, i1src);
+  i0dest_in_i0src = i0 && reg_overlap_mentioned_p (i0dest, i0src);
+  i1dest_in_i0src = i0 && reg_overlap_mentioned_p (i1dest, i0src);
+  i2dest_in_i0src = i0 && reg_overlap_mentioned_p (i2dest, i0src);
   i2dest_killed = dead_or_set_p (i2, i2dest);
   i1dest_killed = i1 && dead_or_set_p (i1, i1dest);
+  i0dest_killed = i0 && dead_or_set_p (i0, i0dest);
 
-  /* See if I1 directly feeds into I3.  It does if I1DEST is not used
-     in I2SRC.  */
-  i1_feeds_i3 = i1 && ! reg_overlap_mentioned_p (i1dest, i2src);
+  /* For the earlier insns, determine which of the subsequent ones they
+     feed.  */
+  i1_feeds_i2_n = i1 && insn_a_feeds_b (i1, i2);
+  i0_feeds_i1_n = i0 && insn_a_feeds_b (i0, i1);
+  i0_feeds_i2_n = (i0 && (!i0_feeds_i1_n ? insn_a_feeds_b (i0, i2)
+			  : (!dead_or_set_p (i1, i0dest)
+			     && reg_overlap_mentioned_p (i0dest, i2src))));
 
   /* Ensure that I3's pattern can be the destination of combines.  */
-  if (! combinable_i3pat (i3, &PATTERN (i3), i2dest, i1dest,
-			  i1 && i2dest_in_i1src && i1_feeds_i3,
+  if (! combinable_i3pat (i3, &PATTERN (i3), i2dest, i1dest, i0dest,
+			  i1 && i2dest_in_i1src && !i1_feeds_i2_n,
+			  i0 && ((i2dest_in_i0src && !i0_feeds_i2_n)
+				 || (i1dest_in_i0src && !i0_feeds_i1_n)),
 			  &i3dest_killed))
     {
       undo_all ();
@@ -2706,6 +2881,7 @@ try_combine (rtx i3, rtx i2, rtx i1, int
      here.  */
   if (GET_CODE (i2src) == MULT
       || (i1 != 0 && GET_CODE (i1src) == MULT)
+      || (i0 != 0 && GET_CODE (i0src) == MULT)
       || (GET_CODE (PATTERN (i3)) == SET
 	  && GET_CODE (SET_SRC (PATTERN (i3))) == MULT))
     have_mult = 1;
@@ -2745,14 +2921,22 @@ try_combine (rtx i3, rtx i2, rtx i1, int
      feed into I3, the set in I1 needs to be kept around if I1DEST dies
      or is set in I3.  Otherwise (if I1 feeds I2 which feeds I3), the set
      in I1 needs to be kept around unless I1DEST dies or is set in either
-     I2 or I3.  We can distinguish these cases by seeing if I2SRC mentions
-     I1DEST.  If so, we know I1 feeds into I2.  */
+     I2 or I3.  The same consideration applies to I0.  */
 
-  added_sets_2 = ! dead_or_set_p (i3, i2dest);
+  added_sets_2 = !dead_or_set_p (i3, i2dest);
 
-  added_sets_1
-    = i1 && ! (i1_feeds_i3 ? dead_or_set_p (i3, i1dest)
-	       : (dead_or_set_p (i3, i1dest) || dead_or_set_p (i2, i1dest)));
+  if (i1)
+    added_sets_1 = !(dead_or_set_p (i3, i1dest)
+		     || (i1_feeds_i2_n && dead_or_set_p (i2, i1dest)));
+  else
+    added_sets_1 = 0;
+
+  if (i0)
+    added_sets_0 =  !(dead_or_set_p (i3, i0dest)
+		      || (i0_feeds_i2_n && dead_or_set_p (i2, i0dest))
+		      || (i0_feeds_i1_n && dead_or_set_p (i1, i0dest)));
+  else
+    added_sets_0 = 0;
 
   /* If the set in I2 needs to be kept around, we must make a copy of
      PATTERN (I2), so that when we substitute I1SRC for I1DEST in
@@ -2777,6 +2961,14 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	i1pat = copy_rtx (PATTERN (i1));
     }
 
+  if (added_sets_0)
+    {
+      if (GET_CODE (PATTERN (i0)) == PARALLEL)
+	i0pat = gen_rtx_SET (VOIDmode, i0dest, copy_rtx (i0src));
+      else
+	i0pat = copy_rtx (PATTERN (i0));
+    }
+
   combine_merges++;
 
   /* Substitute in the latest insn for the regs set by the earlier ones.  */
@@ -2825,8 +3017,8 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 					      i2src, const0_rtx))
 	      != GET_MODE (SET_DEST (newpat))))
 	{
-	  if (can_change_dest_mode(SET_DEST (newpat), added_sets_2,
-				   compare_mode))
+	  if (can_change_dest_mode (SET_DEST (newpat), added_sets_2,
+				    compare_mode))
 	    {
 	      unsigned int regno = REGNO (SET_DEST (newpat));
 	      rtx new_dest;
@@ -2889,13 +3081,14 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 
       n_occurrences = 0;		/* `subst' counts here */
 
-      /* If I1 feeds into I2 (not into I3) and I1DEST is in I1SRC, we
-	 need to make a unique copy of I2SRC each time we substitute it
-	 to avoid self-referential rtl.  */
+      /* If I1 feeds into I2 and I1DEST is in I1SRC, we need to make a
+	 unique copy of I2SRC each time we substitute it to avoid
+	 self-referential rtl.  */
 
       subst_low_luid = DF_INSN_LUID (i2);
       newpat = subst (PATTERN (i3), i2dest, i2src, 0,
-		      ! i1_feeds_i3 && i1dest_in_i1src);
+		      ((i1_feeds_i2_n && i1dest_in_i1src)
+		       || (i0_feeds_i2_n && i0dest_in_i0src)));
       substed_i2 = 1;
 
       /* Record whether i2's body now appears within i3's body.  */
@@ -2911,13 +3104,14 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	 This happens if I1DEST is mentioned in I2 and dies there, and
 	 has disappeared from the new pattern.  */
       if ((FIND_REG_INC_NOTE (i1, NULL_RTX) != 0
-	   && !i1_feeds_i3
+	   && i1_feeds_i2_n
 	   && dead_or_set_p (i2, i1dest)
 	   && !reg_overlap_mentioned_p (i1dest, newpat))
 	  /* Before we can do this substitution, we must redo the test done
 	     above (see detailed comments there) that ensures  that I1DEST
 	     isn't mentioned in any SETs in NEWPAT that are field assignments.  */
-          || !combinable_i3pat (NULL_RTX, &newpat, i1dest, NULL_RTX, 0, 0))
+          || !combinable_i3pat (NULL_RTX, &newpat, i1dest, NULL_RTX, NULL_RTX,
+				0, 0, 0))
 	{
 	  undo_all ();
 	  return 0;
@@ -2925,8 +3119,29 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 
       n_occurrences = 0;
       subst_low_luid = DF_INSN_LUID (i1);
-      newpat = subst (newpat, i1dest, i1src, 0, 0);
+      newpat = subst (newpat, i1dest, i1src, 0,
+		      i0_feeds_i1_n && i0dest_in_i0src);
       substed_i1 = 1;
+      i1_is_used = n_occurrences;
+    }
+  if (i0 && GET_CODE (newpat) != CLOBBER)
+    {
+      if ((FIND_REG_INC_NOTE (i0, NULL_RTX) != 0
+	   && ((i0_feeds_i2_n && dead_or_set_p (i2, i0dest))
+	       || (i0_feeds_i1_n && dead_or_set_p (i1, i0dest)))
+	   && !reg_overlap_mentioned_p (i0dest, newpat))
+          || !combinable_i3pat (NULL_RTX, &newpat, i0dest, NULL_RTX, NULL_RTX,
+				0, 0, 0))
+	{
+	  undo_all ();
+	  return 0;
+	}
+
+      n_occurrences = 0;
+      subst_low_luid = DF_INSN_LUID (i1);
+      newpat = subst (newpat, i0dest, i0src, 0,
+		      i0_feeds_i1_n && i0dest_in_i0src);
+      substed_i0 = 1;
     }
 
   /* Fail if an autoincrement side-effect has been duplicated.  Be careful
@@ -2934,7 +3149,12 @@ try_combine (rtx i3, rtx i2, rtx i1, int
   if ((FIND_REG_INC_NOTE (i2, NULL_RTX) != 0
        && i2_is_used + added_sets_2 > 1)
       || (i1 != 0 && FIND_REG_INC_NOTE (i1, NULL_RTX) != 0
-	  && (n_occurrences + added_sets_1 + (added_sets_2 && ! i1_feeds_i3)
+	  && (i1_is_used + added_sets_1 + (added_sets_2 && i1_feeds_i2_n)
+	      > 1))
+      || (i0 != 0 && FIND_REG_INC_NOTE (i0, NULL_RTX) != 0
+	  && (n_occurrences + added_sets_0
+	      + (added_sets_1 && i0_feeds_i1_n)
+	      + (added_sets_2 && i0_feeds_i2_n)
 	      > 1))
       /* Fail if we tried to make a new register.  */
       || max_reg_num () != maxreg
@@ -2954,14 +3174,15 @@ try_combine (rtx i3, rtx i2, rtx i1, int
      we must make a new PARALLEL for the latest insn
      to hold additional the SETs.  */
 
-  if (added_sets_1 || added_sets_2)
+  if (added_sets_0 || added_sets_1 || added_sets_2)
     {
+      int extra_sets = added_sets_0 + added_sets_1 + added_sets_2;
       combine_extras++;
 
       if (GET_CODE (newpat) == PARALLEL)
 	{
 	  rtvec old = XVEC (newpat, 0);
-	  total_sets = XVECLEN (newpat, 0) + added_sets_1 + added_sets_2;
+	  total_sets = XVECLEN (newpat, 0) + extra_sets;
 	  newpat = gen_rtx_PARALLEL (VOIDmode, rtvec_alloc (total_sets));
 	  memcpy (XVEC (newpat, 0)->elem, &old->elem[0],
 		  sizeof (old->elem[0]) * old->num_elem);
@@ -2969,25 +3190,31 @@ try_combine (rtx i3, rtx i2, rtx i1, int
       else
 	{
 	  rtx old = newpat;
-	  total_sets = 1 + added_sets_1 + added_sets_2;
+	  total_sets = 1 + extra_sets;
 	  newpat = gen_rtx_PARALLEL (VOIDmode, rtvec_alloc (total_sets));
 	  XVECEXP (newpat, 0, 0) = old;
 	}
 
+      if (added_sets_0)
+	XVECEXP (newpat, 0, --total_sets) = i0pat;
+
       if (added_sets_1)
-	XVECEXP (newpat, 0, --total_sets) = i1pat;
+	{
+	  rtx t = i1pat;
+	  if (i0_feeds_i1_n)
+	    t = subst (t, i0dest, i0src, 0, 0);
 
+	  XVECEXP (newpat, 0, --total_sets) = t;
+	}
       if (added_sets_2)
 	{
-	  /* If there is no I1, use I2's body as is.  We used to also not do
-	     the subst call below if I2 was substituted into I3,
-	     but that could lose a simplification.  */
-	  if (i1 == 0)
-	    XVECEXP (newpat, 0, --total_sets) = i2pat;
-	  else
-	    /* See comment where i2pat is assigned.  */
-	    XVECEXP (newpat, 0, --total_sets)
-	      = subst (i2pat, i1dest, i1src, 0, 0);
+	  rtx t = i2pat;
+	  if (i0_feeds_i2_n)
+	    t = subst (t, i0dest, i0src, 0, 0);
+	  if (i1_feeds_i2_n)
+	    t = subst (t, i1dest, i1src, 0, 0);
+
+	  XVECEXP (newpat, 0, --total_sets) = t;
 	}
     }
 
@@ -3543,7 +3770,7 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 
   /* Only allow this combination if insn_rtx_costs reports that the
      replacement instructions are cheaper than the originals.  */
-  if (!combine_validate_cost (i1, i2, i3, newpat, newi2pat, other_pat))
+  if (!combine_validate_cost (i0, i1, i2, i3, newpat, newi2pat, other_pat))
     {
       undo_all ();
       return 0;
@@ -3642,7 +3869,8 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	}
 
       distribute_notes (new_other_notes, undobuf.other_insn,
-			undobuf.other_insn, NULL_RTX, NULL_RTX, NULL_RTX);
+			undobuf.other_insn, NULL_RTX, NULL_RTX, NULL_RTX,
+			NULL_RTX);
     }
 
   if (swap_i2i3)
@@ -3689,21 +3917,26 @@ try_combine (rtx i3, rtx i2, rtx i1, int
     }
 
   {
-    rtx i3notes, i2notes, i1notes = 0;
-    rtx i3links, i2links, i1links = 0;
+    rtx i3notes, i2notes, i1notes = 0, i0notes = 0;
+    rtx i3links, i2links, i1links = 0, i0links = 0;
     rtx midnotes = 0;
+    int from_luid;
     unsigned int regno;
     /* Compute which registers we expect to eliminate.  newi2pat may be setting
        either i3dest or i2dest, so we must check it.  Also, i1dest may be the
        same as i3dest, in which case newi2pat may be setting i1dest.  */
     rtx elim_i2 = ((newi2pat && reg_set_p (i2dest, newi2pat))
-		   || i2dest_in_i2src || i2dest_in_i1src
+		   || i2dest_in_i2src || i2dest_in_i1src || i2dest_in_i0src
 		   || !i2dest_killed
 		   ? 0 : i2dest);
-    rtx elim_i1 = (i1 == 0 || i1dest_in_i1src
+    rtx elim_i1 = (i1 == 0 || i1dest_in_i1src || i1dest_in_i0src
 		   || (newi2pat && reg_set_p (i1dest, newi2pat))
 		   || !i1dest_killed
 		   ? 0 : i1dest);
+    rtx elim_i0 = (i0 == 0 || i0dest_in_i0src
+		   || (newi2pat && reg_set_p (i0dest, newi2pat))
+		   || !i0dest_killed
+		   ? 0 : i0dest);
 
     /* Get the old REG_NOTES and LOG_LINKS from all our insns and
        clear them.  */
@@ -3711,6 +3944,8 @@ try_combine (rtx i3, rtx i2, rtx i1, int
     i2notes = REG_NOTES (i2), i2links = LOG_LINKS (i2);
     if (i1)
       i1notes = REG_NOTES (i1), i1links = LOG_LINKS (i1);
+    if (i0)
+      i0notes = REG_NOTES (i0), i0links = LOG_LINKS (i0);
 
     /* Ensure that we do not have something that should not be shared but
        occurs multiple times in the new insns.  Check this by first
@@ -3719,6 +3954,7 @@ try_combine (rtx i3, rtx i2, rtx i1, int
     reset_used_flags (i3notes);
     reset_used_flags (i2notes);
     reset_used_flags (i1notes);
+    reset_used_flags (i0notes);
     reset_used_flags (newpat);
     reset_used_flags (newi2pat);
     if (undobuf.other_insn)
@@ -3727,6 +3963,7 @@ try_combine (rtx i3, rtx i2, rtx i1, int
     i3notes = copy_rtx_if_shared (i3notes);
     i2notes = copy_rtx_if_shared (i2notes);
     i1notes = copy_rtx_if_shared (i1notes);
+    i0notes = copy_rtx_if_shared (i0notes);
     newpat = copy_rtx_if_shared (newpat);
     newi2pat = copy_rtx_if_shared (newi2pat);
     if (undobuf.other_insn)
@@ -3753,6 +3990,8 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 
 	if (substed_i1)
 	  replace_rtx (call_usage, i1dest, i1src);
+	if (substed_i0)
+	  replace_rtx (call_usage, i0dest, i0src);
 
 	CALL_INSN_FUNCTION_USAGE (i3) = call_usage;
       }
@@ -3827,43 +4066,58 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	SET_INSN_DELETED (i1);
       }
 
+    if (i0)
+      {
+	LOG_LINKS (i0) = 0;
+	REG_NOTES (i0) = 0;
+	if (MAY_HAVE_DEBUG_INSNS)
+	  propagate_for_debug (i0, i3, i0dest, i0src, false);
+	SET_INSN_DELETED (i0);
+      }
+
     /* Get death notes for everything that is now used in either I3 or
        I2 and used to die in a previous insn.  If we built two new
        patterns, move from I1 to I2 then I2 to I3 so that we get the
        proper movement on registers that I2 modifies.  */
 
-    if (newi2pat)
-      {
-	move_deaths (newi2pat, NULL_RTX, DF_INSN_LUID (i1), i2, &midnotes);
-	move_deaths (newpat, newi2pat, DF_INSN_LUID (i1), i3, &midnotes);
-      }
+    if (i0)
+      from_luid = DF_INSN_LUID (i0);
+    else if (i1)
+      from_luid = DF_INSN_LUID (i1);
     else
-      move_deaths (newpat, NULL_RTX, i1 ? DF_INSN_LUID (i1) : DF_INSN_LUID (i2),
-		   i3, &midnotes);
+      from_luid = DF_INSN_LUID (i2);
+    if (newi2pat)
+      move_deaths (newi2pat, NULL_RTX, from_luid, i2, &midnotes);
+    move_deaths (newpat, newi2pat, from_luid, i3, &midnotes);
 
     /* Distribute all the LOG_LINKS and REG_NOTES from I1, I2, and I3.  */
     if (i3notes)
       distribute_notes (i3notes, i3, i3, newi2pat ? i2 : NULL_RTX,
-			elim_i2, elim_i1);
+			elim_i2, elim_i1, elim_i0);
     if (i2notes)
       distribute_notes (i2notes, i2, i3, newi2pat ? i2 : NULL_RTX,
-			elim_i2, elim_i1);
+			elim_i2, elim_i1, elim_i0);
     if (i1notes)
       distribute_notes (i1notes, i1, i3, newi2pat ? i2 : NULL_RTX,
-			elim_i2, elim_i1);
+			elim_i2, elim_i1, elim_i0);
+    if (i0notes)
+      distribute_notes (i0notes, i0, i3, newi2pat ? i2 : NULL_RTX,
+			elim_i2, elim_i1, elim_i0);
     if (midnotes)
       distribute_notes (midnotes, NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
-			elim_i2, elim_i1);
+			elim_i2, elim_i1, elim_i0);
 
     /* Distribute any notes added to I2 or I3 by recog_for_combine.  We
        know these are REG_UNUSED and want them to go to the desired insn,
        so we always pass it as i3.  */
 
     if (newi2pat && new_i2_notes)
-      distribute_notes (new_i2_notes, i2, i2, NULL_RTX, NULL_RTX, NULL_RTX);
+      distribute_notes (new_i2_notes, i2, i2, NULL_RTX, NULL_RTX, NULL_RTX,
+			NULL_RTX);
 
     if (new_i3_notes)
-      distribute_notes (new_i3_notes, i3, i3, NULL_RTX, NULL_RTX, NULL_RTX);
+      distribute_notes (new_i3_notes, i3, i3, NULL_RTX, NULL_RTX, NULL_RTX,
+			NULL_RTX);
 
     /* If I3DEST was used in I3SRC, it really died in I3.  We may need to
        put a REG_DEAD note for it somewhere.  If NEWI2PAT exists and sets
@@ -3877,39 +4131,51 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	if (newi2pat && reg_set_p (i3dest_killed, newi2pat))
 	  distribute_notes (alloc_reg_note (REG_DEAD, i3dest_killed,
 					    NULL_RTX),
-			    NULL_RTX, i2, NULL_RTX, elim_i2, elim_i1);
+			    NULL_RTX, i2, NULL_RTX, elim_i2, elim_i1, elim_i0);
 	else
 	  distribute_notes (alloc_reg_note (REG_DEAD, i3dest_killed,
 					    NULL_RTX),
 			    NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
-			    elim_i2, elim_i1);
+			    elim_i2, elim_i1, elim_i0);
       }
 
     if (i2dest_in_i2src)
       {
+	rtx new_note = alloc_reg_note (REG_DEAD, i2dest, NULL_RTX);
 	if (newi2pat && reg_set_p (i2dest, newi2pat))
-	  distribute_notes (alloc_reg_note (REG_DEAD, i2dest, NULL_RTX),
-			    NULL_RTX, i2, NULL_RTX, NULL_RTX, NULL_RTX);
-	else
-	  distribute_notes (alloc_reg_note (REG_DEAD, i2dest, NULL_RTX),
-			    NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
+	  distribute_notes (new_note,  NULL_RTX, i2, NULL_RTX, NULL_RTX,
 			    NULL_RTX, NULL_RTX);
+	else
+	  distribute_notes (new_note, NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
+			    NULL_RTX, NULL_RTX, NULL_RTX);
       }
 
     if (i1dest_in_i1src)
       {
+	rtx new_note = alloc_reg_note (REG_DEAD, i1dest, NULL_RTX);
 	if (newi2pat && reg_set_p (i1dest, newi2pat))
-	  distribute_notes (alloc_reg_note (REG_DEAD, i1dest, NULL_RTX),
-			    NULL_RTX, i2, NULL_RTX, NULL_RTX, NULL_RTX);
+	  distribute_notes (new_note, NULL_RTX, i2, NULL_RTX, NULL_RTX,
+			    NULL_RTX, NULL_RTX);
 	else
-	  distribute_notes (alloc_reg_note (REG_DEAD, i1dest, NULL_RTX),
-			    NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
+	  distribute_notes (new_note, NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
+			    NULL_RTX, NULL_RTX, NULL_RTX);
+      }
+
+    if (i0dest_in_i0src)
+      {
+	rtx new_note = alloc_reg_note (REG_DEAD, i0dest, NULL_RTX);
+	if (newi2pat && reg_set_p (i0dest, newi2pat))
+	  distribute_notes (new_note, NULL_RTX, i2, NULL_RTX, NULL_RTX,
 			    NULL_RTX, NULL_RTX);
+	else
+	  distribute_notes (new_note, NULL_RTX, i3, newi2pat ? i2 : NULL_RTX,
+			    NULL_RTX, NULL_RTX, NULL_RTX);
       }
 
     distribute_links (i3links);
     distribute_links (i2links);
     distribute_links (i1links);
+    distribute_links (i0links);
 
     if (REG_P (i2dest))
       {
@@ -3959,6 +4225,23 @@ try_combine (rtx i3, rtx i2, rtx i1, int
 	  INC_REG_N_SETS (regno, -1);
       }
 
+    if (i0 && REG_P (i0dest))
+      {
+	rtx link;
+	rtx i0_insn = 0, i0_val = 0, set;
+
+	for (link = LOG_LINKS (i3); link; link = XEXP (link, 1))
+	  if ((set = single_set (XEXP (link, 0))) != 0
+	      && rtx_equal_p (i0dest, SET_DEST (set)))
+	    i0_insn = XEXP (link, 0), i0_val = SET_SRC (set);
+
+	record_value_for_reg (i0dest, i0_insn, i0_val);
+
+	regno = REGNO (i0dest);
+	if (! added_sets_0 && ! i0dest_in_i0src)
+	  INC_REG_N_SETS (regno, -1);
+      }
+
     /* Update reg_stat[].nonzero_bits et al for any changes that may have
        been made to this insn.  The order of
        set_nonzero_bits_and_sign_copies() is important.  Because newi2pat
@@ -3978,6 +4261,16 @@ try_combine (rtx i3, rtx i2, rtx i1, int
       df_insn_rescan (undobuf.other_insn);
     }
 
+  if (i0 && !(NOTE_P(i0) && (NOTE_KIND (i0) == NOTE_INSN_DELETED)))
+    {
+      if (dump_file)
+	{
+	  fprintf (dump_file, "modifying insn i1 ");
+	  dump_insn_slim (dump_file, i0);
+	}
+      df_insn_rescan (i0);
+    }
+
   if (i1 && !(NOTE_P(i1) && (NOTE_KIND (i1) == NOTE_INSN_DELETED)))
     {
       if (dump_file)
@@ -12668,7 +12961,7 @@ reg_bitfield_target_p (rtx x, rtx body)
 
 static void
 distribute_notes (rtx notes, rtx from_insn, rtx i3, rtx i2, rtx elim_i2,
-		  rtx elim_i1)
+		  rtx elim_i1, rtx elim_i0)
 {
   rtx note, next_note;
   rtx tem;
@@ -12914,7 +13207,8 @@ distribute_notes (rtx notes, rtx from_in
 			&& !(i2mod
 			     && reg_overlap_mentioned_p (XEXP (note, 0),
 							 i2mod_old_rhs)))
-		       || rtx_equal_p (XEXP (note, 0), elim_i1))
+		       || rtx_equal_p (XEXP (note, 0), elim_i1)
+		       || rtx_equal_p (XEXP (note, 0), elim_i0))
 		break;
 	      tem = i3;
 	    }
@@ -12981,7 +13275,7 @@ distribute_notes (rtx notes, rtx from_in
 			  REG_NOTES (tem) = NULL;
 
 			  distribute_notes (old_notes, tem, tem, NULL_RTX,
-					    NULL_RTX, NULL_RTX);
+					    NULL_RTX, NULL_RTX, NULL_RTX);
 			  distribute_links (LOG_LINKS (tem));
 
 			  SET_INSN_DELETED (tem);
@@ -12998,7 +13292,7 @@ distribute_notes (rtx notes, rtx from_in
 
 			      distribute_notes (old_notes, cc0_setter,
 						cc0_setter, NULL_RTX,
-						NULL_RTX, NULL_RTX);
+						NULL_RTX, NULL_RTX, NULL_RTX);
 			      distribute_links (LOG_LINKS (cc0_setter));
 
 			      SET_INSN_DELETED (cc0_setter);
@@ -13118,7 +13412,8 @@ distribute_notes (rtx notes, rtx from_in
 							     NULL_RTX);
 
 			      distribute_notes (new_note, place, place,
-						NULL_RTX, NULL_RTX, NULL_RTX);
+						NULL_RTX, NULL_RTX, NULL_RTX,
+						NULL_RTX);
 			    }
 			  else if (! refers_to_regno_p (i, i + 1,
 							PATTERN (place), 0)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-19 18:13                                 ` Bernd Schmidt
@ 2010-08-19 18:25                                   ` Mark Mitchell
  2010-08-19 18:42                                   ` Richard Henderson
                                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 129+ messages in thread
From: Mark Mitchell @ 2010-08-19 18:25 UTC (permalink / raw)
  To: Bernd Schmidt
  Cc: Eric Botcazou, gcc-patches, Richard Guenther, David Daney,
	Andi Kleen, Steven Bosscher

Bernd Schmidt wrote:

> On i686, the heuristic reduces the combine-4 attempts to slightly over a
> third of the ones tried by the first patch (and presumably disallows
> them in most cases where we'd generate overly large RTL).  Hence, I
> would have expected this patch to cause slowdowns in the 0.4% range, but
> when I ran a few bootstraps today, I had a few results near 99m5s user
> time with the patch, and the best run without the patch came in at
> 99m10s.

OK, I think we can all conclude that the compile-time cost is now nearly
noise.

Bernd, thanks for working on the patch to improve the performance; Eric,
thanks for expressing your concerns and thus encouraging Bernd to come
up with a better patch.

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-19 18:13                                 ` Bernd Schmidt
  2010-08-19 18:25                                   ` Mark Mitchell
@ 2010-08-19 18:42                                   ` Richard Henderson
  2010-08-19 22:14                                   ` Eric Botcazou
  2010-08-20  0:21                                   ` H.J. Lu
  3 siblings, 0 replies; 129+ messages in thread
From: Richard Henderson @ 2010-08-19 18:42 UTC (permalink / raw)
  To: Bernd Schmidt
  Cc: Eric Botcazou, gcc-patches, Mark Mitchell, Richard Guenther,
	David Daney, Andi Kleen, Steven Bosscher

On 08/19/2010 10:37 AM, Bernd Schmidt wrote:
> Allowing unary ops in the heuristic as well made hardly a difference in
> output, and appeared to cost around 0.3%, so I left it out.

Reference to unary ops remain in the heuristic commentary.


r~

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-19 18:13                                 ` Bernd Schmidt
  2010-08-19 18:25                                   ` Mark Mitchell
  2010-08-19 18:42                                   ` Richard Henderson
@ 2010-08-19 22:14                                   ` Eric Botcazou
  2010-08-19 22:19                                     ` Bernd Schmidt
  2010-08-20  0:21                                   ` H.J. Lu
  3 siblings, 1 reply; 129+ messages in thread
From: Eric Botcazou @ 2010-08-19 22:14 UTC (permalink / raw)
  To: Bernd Schmidt
  Cc: gcc-patches, Mark Mitchell, Richard Guenther, David Daney,
	Andi Kleen, Steven Bosscher

> As for the patch itself, Michael Matz provided constructive feedback
> which led to a heuristic that eliminated a large number of combine-4
> attempts.  I conclude that either you didn't read the thread before
> attempting once again to block one of my patches, or the above is more
> than a little disingenuous.

It isn't, I replied to your message saying "I experimented with Michael's 
heuristic last week, without getting useful results, so I'll use the one I 
previously posted" so I genuinely thought you were discarding the heuristic 
altogether.  Glad to hear this isn't the case in the end.

As to again blocking one of your patches, there is nothing personal, you 
happened to post 3 patches in a row that I think aren't the right approach
to solving problems in the part of the compiler I'm responsible for.  For the 
first one, I agreed to step down, for the second one you checked in something 
without approval but the end result was sensible, but for the third one you 
were about to set a precedent that wasn't acceptable to me.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-19 22:14                                   ` Eric Botcazou
@ 2010-08-19 22:19                                     ` Bernd Schmidt
  2010-08-19 22:21                                       ` Eric Botcazou
  0 siblings, 1 reply; 129+ messages in thread
From: Bernd Schmidt @ 2010-08-19 22:19 UTC (permalink / raw)
  To: Eric Botcazou
  Cc: gcc-patches, Mark Mitchell, Richard Guenther, David Daney,
	Andi Kleen, Steven Bosscher

On 08/19/2010 11:59 PM, Eric Botcazou wrote:
>> As for the patch itself, Michael Matz provided constructive feedback
>> which led to a heuristic that eliminated a large number of combine-4
>> attempts.  I conclude that either you didn't read the thread before
>> attempting once again to block one of my patches, or the above is more
>> than a little disingenuous.
> 
> It isn't, I replied to your message saying "I experimented with Michael's 
> heuristic last week, without getting useful results, so I'll use the one I 
> previously posted" so I genuinely thought you were discarding the heuristic 
> altogether.  Glad to hear this isn't the case in the end.

That was, however, after I'd already posted a new patch with a
heuristic.  That mail also contained additional data.

>for the second one you checked in something without approval

I don't believe this is the case.  Where, specifically?


Bernd

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-19 22:19                                     ` Bernd Schmidt
@ 2010-08-19 22:21                                       ` Eric Botcazou
  2010-08-19 22:37                                         ` Bernd Schmidt
  2010-08-19 23:34                                         ` Andrew Pinski
  0 siblings, 2 replies; 129+ messages in thread
From: Eric Botcazou @ 2010-08-19 22:21 UTC (permalink / raw)
  To: Bernd Schmidt
  Cc: gcc-patches, Mark Mitchell, Richard Guenther, David Daney,
	Andi Kleen, Steven Bosscher

> >for the second one you checked in something without approval
>
> I don't believe this is the case.  Where, specifically?

Your message:
  http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02214.html

Paolo's reply:
  http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02226.html

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-19 22:21                                       ` Eric Botcazou
@ 2010-08-19 22:37                                         ` Bernd Schmidt
  2010-08-19 22:53                                           ` Steven Bosscher
  2010-08-19 23:34                                         ` Andrew Pinski
  1 sibling, 1 reply; 129+ messages in thread
From: Bernd Schmidt @ 2010-08-19 22:37 UTC (permalink / raw)
  To: Eric Botcazou
  Cc: gcc-patches, Mark Mitchell, Richard Guenther, David Daney,
	Andi Kleen, Steven Bosscher, Paolo Bonzini

On 08/20/2010 12:14 AM, Eric Botcazou wrote:
>>> for the second one you checked in something without approval
>>
>> I don't believe this is the case.  Where, specifically?
> 
> Your message:
>   http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02214.html
> 
> Paolo's reply:
>   http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02226.html
> 

Sounds like an approval to me, unless you wish to quibble about whether
I need extra approval for the obvious change to atually reenable the
code after the fixes to it were approved.

Is there anyone who thinks I was out of line for checking in the code
after Paolo's message?


Bernd

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-19 22:37                                         ` Bernd Schmidt
@ 2010-08-19 22:53                                           ` Steven Bosscher
  0 siblings, 0 replies; 129+ messages in thread
From: Steven Bosscher @ 2010-08-19 22:53 UTC (permalink / raw)
  To: Bernd Schmidt
  Cc: Eric Botcazou, gcc-patches, Mark Mitchell, Richard Guenther,
	David Daney, Andi Kleen, Paolo Bonzini

On Fri, Aug 20, 2010 at 12:19 AM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> Is there anyone who thinks I was out of line for checking in the code
> after Paolo's message?

For those two lines in lower-subreg.c, absolutely! :-)

Ciao!
Steven

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-19 22:21                                       ` Eric Botcazou
  2010-08-19 22:37                                         ` Bernd Schmidt
@ 2010-08-19 23:34                                         ` Andrew Pinski
  2010-08-19 23:40                                           ` Bernd Schmidt
  2010-08-20 10:21                                           ` Paolo Bonzini
  1 sibling, 2 replies; 129+ messages in thread
From: Andrew Pinski @ 2010-08-19 23:34 UTC (permalink / raw)
  To: Eric Botcazou; +Cc: Bernd Schmidt, gcc-patches, Mark Mitchell

On Thu, Aug 19, 2010 at 3:14 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
>> >for the second one you checked in something without approval
>>
>> I don't believe this is the case.  Where, specifically?
>
> Your message:
>  http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02214.html
>
> Paolo's reply:
>  http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02226.html

That reply is a bit weird because it does seem like an approval for
almost all of the patch.  Though Paolo did mention he could not
approve those two lines but he seems like he was saying to go ahead
and apply it anyways.  Maybe I would have waited a few more days
before applying it or asking for a clarification to make sure people
would not have disagreed with those two lines.  It is tough call in my
mind about this patch and those two lines.  Though those two lines
increased compile time because it enabled a whole new pass which was
not there before.

-- Pinski

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-19 23:34                                         ` Andrew Pinski
@ 2010-08-19 23:40                                           ` Bernd Schmidt
  2010-08-20  1:59                                             ` Mark Mitchell
  2010-08-20 10:21                                           ` Paolo Bonzini
  1 sibling, 1 reply; 129+ messages in thread
From: Bernd Schmidt @ 2010-08-19 23:40 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: Eric Botcazou, gcc-patches, Mark Mitchell

On 08/20/2010 12:26 AM, Andrew Pinski wrote:
> On Thu, Aug 19, 2010 at 3:14 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
>>>> for the second one you checked in something without approval
>>>
>>> I don't believe this is the case.  Where, specifically?
>>
>> Your message:
>>  http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02214.html
>>
>> Paolo's reply:
>>  http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02226.html
> 
> That reply is a bit weird because it does seem like an approval for
> almost all of the patch.  Though Paolo did mention he could not
> approve those two lines but he seems like he was saying to go ahead
> and apply it anyways.  Maybe I would have waited a few more days
> before applying it or asking for a clarification to make sure people
> would not have disagreed with those two lines.

I did not expect that reasonable people would disagree with these two
lines.  I still think they count as obvious if the rest is approved.
Who disagrees?

>  It is tough call in my
> mind about this patch and those two lines.  Though those two lines
> increased compile time because it enabled a whole new pass which was
> not there before.

Not really: it's only run if there are DImode regs, and I showed in the
thread that bootstrap times are unaffected.  And the pass was there
before, it had been enabled previously and was only disabled due to
bugs.  I fixed the bugs, made it faster, and reenabled it in the
specific case where it can be beneficial.  I find it hard to believe
that if the first two parts are approved, the last part counts as
"checking something in without approval".


Bernd

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-19 18:13                                 ` Bernd Schmidt
                                                     ` (2 preceding siblings ...)
  2010-08-19 22:14                                   ` Eric Botcazou
@ 2010-08-20  0:21                                   ` H.J. Lu
  2010-08-20  0:36                                     ` Bernd Schmidt
  2010-08-20 14:07                                     ` H.J. Lu
  3 siblings, 2 replies; 129+ messages in thread
From: H.J. Lu @ 2010-08-20  0:21 UTC (permalink / raw)
  To: Bernd Schmidt
  Cc: Eric Botcazou, gcc-patches, Mark Mitchell, Richard Guenther,
	David Daney, Andi Kleen, Steven Bosscher

On Thu, Aug 19, 2010 at 10:37 AM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> On 08/19/2010 09:38 AM, Eric Botcazou wrote:
>>> Mark said the plan was sensible, so I think there is no tie.
>>
>> Sorry, this is such a bad decision in my opinion, as it will set a precedent
>> for one-percent-slowdown-for-very-little-benefit patches, that I think an
>> explicit OK is in order.
>
> We're no longer discussing the 1% slower patch.  I'll even agree that
> that's a bit excessive, and the approval for it surprised me, but it
> served to get a discussion going.  Several people provided datapoints
> indicating that time spent in the optimizers at -O2 or higher is
> something that just isn't on the radar as a valid concern based both on
> usage patterns and profiling results which show that most time is spent
> elsewhere.
>
> As for the patch itself, Michael Matz provided constructive feedback
> which led to a heuristic that eliminated a large number of combine-4
> attempts.  I conclude that either you didn't read the thread before
> attempting once again to block one of my patches, or the above is more
> than a little disingenuous.
>
> The following is a slightly updated variant of the previous patch I
> posted.  I fixed a bug and slightly cleaned up the in_feeds_im logic
> (insn_a_feeds_b isn't valid for insns that aren't consecutive in the
> combination), and I found a way to slightly relax the heuristic in order
> to use Michael's suggestion of allowing combinations if there are two or
> more binary operations with constant operand.
>
> On i686, the heuristic reduces the combine-4 attempts to slightly over a
> third of the ones tried by the first patch (and presumably disallows
> them in most cases where we'd generate overly large RTL).  Hence, I
> would have expected this patch to cause slowdowns in the 0.4% range, but
> when I ran a few bootstraps today, I had a few results near 99m5s user
> time with the patch, and the best run without the patch came in at
> 99m10s.  I don't have enough data to be sure, but some test runs gave me
> the impression that there is one change, using insn_a_feeds_b instead of
> reg_overlap_mentioned_p, which provided some speedup, and may explain
> this result.  It still seems a little odd.
>
> Allowing unary ops in the heuristic as well made hardly a difference in
> output, and appeared to cost around 0.3%, so I left it out.
>
> Bootstrapped and regression tested on i686-linux.  Committed.
>

This may have caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45350


-- 
H.J.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-20  0:21                                   ` H.J. Lu
@ 2010-08-20  0:36                                     ` Bernd Schmidt
  2010-08-20 14:07                                     ` H.J. Lu
  1 sibling, 0 replies; 129+ messages in thread
From: Bernd Schmidt @ 2010-08-20  0:36 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Eric Botcazou, gcc-patches, Mark Mitchell, Richard Guenther,
	David Daney, Andi Kleen, Steven Bosscher

[-- Attachment #1: Type: text/plain, Size: 317 bytes --]

On 08/20/2010 12:53 AM, H.J. Lu wrote:
> This may have caused:
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45350

Hmm.  FIND_REG_INC_NOTE expands to 0 on i686, so it didn't show up
there.  Anyway, both the set and the use are conditional on i1 != NULL,
so I've committed the following after a bootstrap.


Bernd

[-- Attachment #2: fix-warning.diff --]
[-- Type: text/plain, Size: 1501 bytes --]

Index: ChangeLog
===================================================================
--- ChangeLog	(revision 163388)
+++ ChangeLog	(working copy)
@@ -1,3 +1,9 @@
+2010-08-19  Bernd Schmidt  <bernds@codesourcery.com>
+
+	PR bootstrap/45350
+	* combine.c (try_combine): Initialize i1_is_used and i2_is_used.  Fix
+	a comment.
+
 2010-08-19  Nathan Froyd  <froydnj@codesourcery.com>
 
 	* target.def (function_arg, function_incoming_arg): Remove const
Index: combine.c
===================================================================
--- combine.c	(revision 163383)
+++ combine.c	(working copy)
@@ -2511,7 +2511,7 @@ try_combine (rtx i3, rtx i2, rtx i1, rtx
   /* Total number of SETs to put into I3.  */
   int total_sets;
   /* Nonzero if I2's or I1's body now appears in I3.  */
-  int i2_is_used, i1_is_used;
+  int i2_is_used = 0, i1_is_used = 0;
   /* INSN_CODEs for new I3, new I2, and user of condition code.  */
   int insn_code_number, i2_code_number = 0, other_code_number = 0;
   /* Contains I3 if the destination of I3 is used in its source, which means
@@ -2546,8 +2546,8 @@ try_combine (rtx i3, rtx i2, rtx i1, rtx
   int i;
 
   /* Only try four-insn combinations when there's high likelihood of
-     success.  Look for simple insns, such as loads of constants, unary
-     operations, or binary operations involving a constant.  */
+     success.  Look for simple insns, such as loads of constants or
+     binary operations involving a constant.  */
   if (i0)
     {
       int i;

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-19 23:40                                           ` Bernd Schmidt
@ 2010-08-20  1:59                                             ` Mark Mitchell
  2010-08-20  3:56                                               ` Diego Novillo
  2010-08-20 14:14                                               ` Eric Botcazou
  0 siblings, 2 replies; 129+ messages in thread
From: Mark Mitchell @ 2010-08-20  1:59 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Andrew Pinski, Eric Botcazou, gcc-patches

Bernd Schmidt wrote:

>>> Paolo's reply:
>>>  http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02226.html

> I did not expect that reasonable people would disagree with these two
> lines.  I still think they count as obvious if the rest is approved.
> Who disagrees?

I suppose Paolo could comment as to what he intended.  It seems pretty
reasonable to me, though, to say that if you can approve dataflow
changes, you can approve insertion of a call to the dataflow machinery.

I don't want to see it get too procedural to get something checked in.
I'd rather have a culture where it's not too hard to get things in, but
where we are responsive to problems raised after the patch is checked
in.  I'm less concerned about something going in than about people being
unwilling to take something back out, or to fix something that's broken.
 You have my assurance that if Bernd breaks something, and refuses to
fix it, his CodeSourcery management will poke him with a sharp stick.

Eric, did you object after Bernd checked in his patch?  He posted a
commit message in:

http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02249.html

and the mail archives don't show up a follow-up from you.  That could be
that it flowed in to the next month and I can't find it, though.  In any
case, I think that if you object to an interpretation of an approval
message it's appropriate for you to speak up and we can resolve it at
that point.

In any case, are you objecting to the change now?  If so, what's your
concern?

Thanks,

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-20  1:59                                             ` Mark Mitchell
@ 2010-08-20  3:56                                               ` Diego Novillo
  2010-08-20  4:13                                                 ` Mark Mitchell
  2010-08-20 14:14                                               ` Eric Botcazou
  1 sibling, 1 reply; 129+ messages in thread
From: Diego Novillo @ 2010-08-20  3:56 UTC (permalink / raw)
  To: Mark Mitchell; +Cc: Bernd Schmidt, Andrew Pinski, Eric Botcazou, gcc-patches

On 10-08-19 20:22 , Mark Mitchell wrote:

> I don't want to see it get too procedural to get something checked in.
> I'd rather have a culture where it's not too hard to get things in, but
> where we are responsive to problems raised after the patch is checked
> in.

Agreed.  I'm not very concerned about getting patches in that may break 
something, even if it may not be immediately obvious at the time.  I 
would much rather roll patches back instead of agonizing that every last 
bit of a given patch meets some impossible standard of perfection.

I do not have an opinion on this particular patch as I have not followed 
this thread closely enough.  I just wanted to offer my vote about making 
patch acceptance as painless as possible.  There is sufficient testing 
spread around, that detecting bad commits is rarely the problem.


Diego.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-20  3:56                                               ` Diego Novillo
@ 2010-08-20  4:13                                                 ` Mark Mitchell
  0 siblings, 0 replies; 129+ messages in thread
From: Mark Mitchell @ 2010-08-20  4:13 UTC (permalink / raw)
  To: Diego Novillo; +Cc: Bernd Schmidt, Andrew Pinski, Eric Botcazou, gcc-patches

Diego Novillo wrote:

>> I don't want to see it get too procedural to get something checked in.
>> I'd rather have a culture where it's not too hard to get things in, but
>> where we are responsive to problems raised after the patch is checked
>> in.
> 
> Agreed.  I'm not very concerned about getting patches in that may break
> something, even if it may not be immediately obvious at the time.  I
> would much rather roll patches back instead of agonizing that every last
> bit of a given patch meets some impossible standard of perfection.

Right.  I think that in the past we've tried to use patch approval as a
way of dealing with the problem of people who check something in and
then disappear.  We try to make the patch perfect up front so that we
don't have to deal with a patch that's been abandoned.

I'm not saying that's not a problem; sometimes people *have* abandoned
patches and that's bad.  But, we know that in many cases we can trust
people to clean up after themselves.  Many contributors have a strong
track record of doing that, and for contributors affiliated with a
corporation it's reasonable to ask others at the corporation to fix
problems, even if the original contributor is MIA.

So, I think we should use reasonable judgment.  I'm all for review, but
I think it's more important that people be willing to deal with the
inevitable post-commit problem than that we strictly follow any
particular pre-commit procedure.

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-19 23:34                                         ` Andrew Pinski
  2010-08-19 23:40                                           ` Bernd Schmidt
@ 2010-08-20 10:21                                           ` Paolo Bonzini
  1 sibling, 0 replies; 129+ messages in thread
From: Paolo Bonzini @ 2010-08-20 10:21 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: Eric Botcazou, Bernd Schmidt, gcc-patches, Mark Mitchell

On 08/20/2010 12:26 AM, Andrew Pinski wrote:
>> >  Your message:
>> >    http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02214.html
>> >
>> >  Paolo's reply:
>> >    http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02226.html
> That reply is a bit weird because it does seem like an approval for
> almost all of the patch.  Though Paolo did mention he could not
> approve those two lines but he seems like he was saying to go ahead
> and apply it anyways.

Yes, that's why I said "I could not... but hey" instead "I cannot".

> It seems pretty
> reasonable to me, though, to say that if you can approve dataflow
> changes, you can approve insertion of a call to the dataflow machinery.

It's really not clear what dataflow changes represent.  90% of the RTL 
pipeline uses dataflow, but that's obviously not making me a 90%-of-RTL 
maintainer.

However, in this specific case I thought it was clear that we _did_ want 
some kind of low-granularity DCE.  The old byte-level DCE pass was 
approved and then disabled, but the reason for disabling byte-level DCE 
was regressions rather than compile-time.  Bernd's word-level DCE does 
the same thing in cases that matter, while being simpler and faster too, 
so it seemed like a win-win situation.

A comment regarding compile-time: I agree that constructive discussion 
on the different approaches to optimization benefit the compiler. 
However, I'd focus more on entire passes that have quadratic bottlenecks 
such as ZEE (which I fear is going to become another SEE...).

Paolo

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-20  0:21                                   ` H.J. Lu
  2010-08-20  0:36                                     ` Bernd Schmidt
@ 2010-08-20 14:07                                     ` H.J. Lu
  1 sibling, 0 replies; 129+ messages in thread
From: H.J. Lu @ 2010-08-20 14:07 UTC (permalink / raw)
  To: Bernd Schmidt
  Cc: Eric Botcazou, gcc-patches, Mark Mitchell, Richard Guenther,
	David Daney, Andi Kleen, Steven Bosscher

On Thu, Aug 19, 2010 at 3:53 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Thu, Aug 19, 2010 at 10:37 AM, Bernd Schmidt <bernds@codesourcery.com> wrote:
>> On 08/19/2010 09:38 AM, Eric Botcazou wrote:
>>>> Mark said the plan was sensible, so I think there is no tie.
>>>
>>> Sorry, this is such a bad decision in my opinion, as it will set a precedent
>>> for one-percent-slowdown-for-very-little-benefit patches, that I think an
>>> explicit OK is in order.
>>
>> We're no longer discussing the 1% slower patch.  I'll even agree that
>> that's a bit excessive, and the approval for it surprised me, but it
>> served to get a discussion going.  Several people provided datapoints
>> indicating that time spent in the optimizers at -O2 or higher is
>> something that just isn't on the radar as a valid concern based both on
>> usage patterns and profiling results which show that most time is spent
>> elsewhere.
>>
>> As for the patch itself, Michael Matz provided constructive feedback
>> which led to a heuristic that eliminated a large number of combine-4
>> attempts.  I conclude that either you didn't read the thread before
>> attempting once again to block one of my patches, or the above is more
>> than a little disingenuous.
>>
>> The following is a slightly updated variant of the previous patch I
>> posted.  I fixed a bug and slightly cleaned up the in_feeds_im logic
>> (insn_a_feeds_b isn't valid for insns that aren't consecutive in the
>> combination), and I found a way to slightly relax the heuristic in order
>> to use Michael's suggestion of allowing combinations if there are two or
>> more binary operations with constant operand.
>>
>> On i686, the heuristic reduces the combine-4 attempts to slightly over a
>> third of the ones tried by the first patch (and presumably disallows
>> them in most cases where we'd generate overly large RTL).  Hence, I
>> would have expected this patch to cause slowdowns in the 0.4% range, but
>> when I ran a few bootstraps today, I had a few results near 99m5s user
>> time with the patch, and the best run without the patch came in at
>> 99m10s.  I don't have enough data to be sure, but some test runs gave me
>> the impression that there is one change, using insn_a_feeds_b instead of
>> reg_overlap_mentioned_p, which provided some speedup, and may explain
>> this result.  It still seems a little odd.
>>
>> Allowing unary ops in the heuristic as well made hardly a difference in
>> output, and appeared to cost around 0.3%, so I left it out.
>>
>> Bootstrapped and regression tested on i686-linux.  Committed.
>>
>
> This may have caused:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45350
>

It may also have caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45355

-- 
H.J.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-20  1:59                                             ` Mark Mitchell
  2010-08-20  3:56                                               ` Diego Novillo
@ 2010-08-20 14:14                                               ` Eric Botcazou
  2010-08-20 14:36                                                 ` Paolo Bonzini
  2010-08-20 14:44                                                 ` Mark Mitchell
  1 sibling, 2 replies; 129+ messages in thread
From: Eric Botcazou @ 2010-08-20 14:14 UTC (permalink / raw)
  To: Mark Mitchell; +Cc: Bernd Schmidt, Andrew Pinski, gcc-patches, Paolo Bonzini

> I suppose Paolo could comment as to what he intended.  It seems pretty
> reasonable to me, though, to say that if you can approve dataflow
> changes, you can approve insertion of a call to the dataflow machinery.

That's arguably questionable, given that Paolo himself said that he actually 
could not approve the two lines.

> I don't want to see it get too procedural to get something checked in.
> I'd rather have a culture where it's not too hard to get things in, but
> where we are responsive to problems raised after the patch is checked
> in.  I'm less concerned about something going in than about people being
> unwilling to take something back out, or to fix something that's broken.
>  You have my assurance that if Bernd breaks something, and refuses to
> fix it, his CodeSourcery management will poke him with a sharp stick.

Thanks, but that wasn't really necessary given Bernd's responsiveness.

> Eric, did you object after Bernd checked in his patch?  He posted a
> commit message in:
>
> http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02249.html
>
> and the mail archives don't show up a follow-up from you.  That could be
> that it flowed in to the next month and I can't find it, though.  In any
> case, I think that if you object to an interpretation of an approval
> message it's appropriate for you to speak up and we can resolve it at
> that point.

No, I didn't, I only replied to the first message in the thread:
  http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02214.html
and stated again my disagreement with Bernd's approach.  But I'm not a fanatic 
either, the end result was reasonable so I stopped there.

> In any case, are you objecting to the change now?

No, it's in, let's keep it in.

> If so, what's your concern?

I was replying to Bernd's accusation of "attempting once again to block one of 
[his] patches".  Of the 3 problematic patches, I stepped down for the first 
one and didn't say anything when the second one was installed without 
(indisputable) approval.  Great attempts at blocking something.  The third 
one, yes, I tried to block it in its original form for the reasons already 
stated.

I think it's just responsible maintainership.  I'm often at the other end of 
the review table, I sometimes try to argue with the maintainer and in some 
cases was frustrated because I didn't understand at all why one of my patches 
was rejected.  I never called his position absurd or declared that the review 
process had failed.

I reviewed several patches from Bernd over the past year but, after the last 
two, I think I'm done with that for the few years to come. :-)  Therefore I'd 
suggest, if he agrees of course, that the SC promotes Paolo to whatever 
position it deems appropriate for him to be able to review patches touching 
all the RTL optimizers.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-20 14:14                                               ` Eric Botcazou
@ 2010-08-20 14:36                                                 ` Paolo Bonzini
  2010-08-21 13:46                                                   ` Eric Botcazou
  2010-08-20 14:44                                                 ` Mark Mitchell
  1 sibling, 1 reply; 129+ messages in thread
From: Paolo Bonzini @ 2010-08-20 14:36 UTC (permalink / raw)
  To: Eric Botcazou; +Cc: Mark Mitchell, Bernd Schmidt, Andrew Pinski, gcc-patches

On 08/20/2010 04:06 PM, Eric Botcazou wrote:
>> I suppose Paolo could comment as to what he intended.  It seems pretty
>> reasonable to me, though, to say that if you can approve dataflow
>> changes, you can approve insertion of a call to the dataflow machinery.
>
> That's arguably questionable, given that Paolo himself said that he actually
> could not approve the two lines.

I think that's a matter of common sense, and I explained this in my 
other email.

BTW, English makes things a bit complicated more here, because my 
"could" was a present conditional, not a past simple.  So Paolo never 
said he "could not approve the two lines"; he said he "could not have 
approved the two lines", but did so nevertheless. :)

> I reviewed several patches from Bernd over the past year but, after the last
> two, I think I'm done with that for the few years to come. :-)  Therefore I'd
> suggest, if he agrees of course, that the SC promotes Paolo to whatever
> position it deems appropriate for him to be able to review patches touching
> all the RTL optimizers.

I don't think this is useful on both counts.  First, because I think 
everybody appreciates your attention to detail and your attempts to 
suggest alternative approaches (which even came with patches 
usually---this is beyond what most maintainers do!).  It's 
understandable that this can lead to some frustration, but I'm sure that 
Bernd also appreciates the quality your contribution.

Second, because anyway I do not think I would be a good RTL maintainer. 
  In fact, I tend to criticize patches in the same way but I hardly have 
time to write the code... :)

Paolo

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-20 14:14                                               ` Eric Botcazou
  2010-08-20 14:36                                                 ` Paolo Bonzini
@ 2010-08-20 14:44                                                 ` Mark Mitchell
  1 sibling, 0 replies; 129+ messages in thread
From: Mark Mitchell @ 2010-08-20 14:44 UTC (permalink / raw)
  To: Eric Botcazou; +Cc: Bernd Schmidt, Andrew Pinski, gcc-patches, Paolo Bonzini

Eric Botcazou wrote:

> That's arguably questionable, given that Paolo himself said that he actually 
> could not approve the two lines.

Paolo's now confirmed that he meant his comment as an approval (which
was how I read it, FWIW).  I guess we could criticize Paolo for
approving something he couldn't approve, and/or criticize Bernd for
accepting an approval by someone not allowed to issue the approval, but
that seems somewhat legalistic.  Anyhow, given that we all agree that
the end result was reasonable, I think it's water under the bridge.

> I reviewed several patches from Bernd over the past year but, after the last 
> two, I think I'm done with that for the few years to come. :-)  

I think that would be unfortunate.  I can certainly see the personal
friction, but both you and Bernd are reasonable people and I think that
you can find a way to work together effectively.

I think it will help if we make such that criticisms are clear and
detailed ("I don't like this because it does X, which has consequence Y,
whereas I think it better to do Z so that Y does not occur") and avoid
accusing anyone of bad behavior ("ignoring" things, "misrepresenting"
things, etc.).  These are good general rules for all of us.

> Therefore I'd
> suggest, if he agrees of course, that the SC promotes Paolo to whatever 
> position it deems appropriate for him to be able to review patches touching 
> all the RTL optimizers.

I'd be happy to take that suggestion to the SC, but it looks like Paolo
doesn't want the job.

Thanks,

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-20 14:36                                                 ` Paolo Bonzini
@ 2010-08-21 13:46                                                   ` Eric Botcazou
  2010-08-23 17:03                                                     ` Richard Guenther
  0 siblings, 1 reply; 129+ messages in thread
From: Eric Botcazou @ 2010-08-21 13:46 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Mark Mitchell, Bernd Schmidt, Andrew Pinski, gcc-patches

> I don't think this is useful on both counts.  First, because I think
> everybody appreciates your attention to detail and your attempts to
> suggest alternative approaches (which even came with patches
> usually---this is beyond what most maintainers do!).  It's
> understandable that this can lead to some frustration, but I'm sure that
> Bernd also appreciates the quality your contribution.

I'm not resigning either, but I think that it could be useful to have a second 
RTL maintainer in the current situation.

> Second, because anyway I do not think I would be a good RTL maintainer.
>   In fact, I tend to criticize patches in the same way but I hardly have
> time to write the code... :)

The number of RTL optimizers not covered by a specific maintainership isn't 
very large so I don't think you would have much more work than with DF alone.
And there are the global reviewers.  But, in the end, it's your decision of 
course.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-21 13:46                                                   ` Eric Botcazou
@ 2010-08-23 17:03                                                     ` Richard Guenther
  2010-08-23 17:08                                                       ` Mark Mitchell
  0 siblings, 1 reply; 129+ messages in thread
From: Richard Guenther @ 2010-08-23 17:03 UTC (permalink / raw)
  To: Eric Botcazou
  Cc: Paolo Bonzini, Mark Mitchell, Bernd Schmidt, Andrew Pinski, gcc-patches

On Sat, Aug 21, 2010 at 2:56 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
>> I don't think this is useful on both counts.  First, because I think
>> everybody appreciates your attention to detail and your attempts to
>> suggest alternative approaches (which even came with patches
>> usually---this is beyond what most maintainers do!).  It's
>> understandable that this can lead to some frustration, but I'm sure that
>> Bernd also appreciates the quality your contribution.
>
> I'm not resigning either, but I think that it could be useful to have a second
> RTL maintainer in the current situation.

We do have capable people in the area that are global reviewers (like
rth or Jeff or Bernd himself).  And I wouldn't know whom to appoint - maybe
Richard Sandiford.

Richard.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: Combine four insns
  2010-08-23 17:03                                                     ` Richard Guenther
@ 2010-08-23 17:08                                                       ` Mark Mitchell
  0 siblings, 0 replies; 129+ messages in thread
From: Mark Mitchell @ 2010-08-23 17:08 UTC (permalink / raw)
  To: Richard Guenther
  Cc: Eric Botcazou, Paolo Bonzini, Bernd Schmidt, Andrew Pinski, gcc-patches

Richard Guenther wrote:

> We do have capable people in the area that are global reviewers (like
> rth or Jeff or Bernd himself).  And I wouldn't know whom to appoint - maybe
> Richard Sandiford.

I would be happy to add Richard S. or Paolo B. as RTL reviewers.

I think that the problem here is not that there are insufficiently many
people with approval rights (after all, RTH or Jeff could certainly have
reviewed the patch as they have global review privileges), but just a
lack of time to do that.  So, I think it's more useful to add a reviewer
who doesn't already have global review privileges.

Thanks,

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 129+ messages in thread

end of thread, other threads:[~2010-08-23 16:51 UTC | newest]

Thread overview: 129+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-06 14:49 Combine four insns Bernd Schmidt
2010-08-06 15:04 ` Richard Guenther
2010-08-06 20:08   ` Bernd Schmidt
2010-08-06 20:37     ` Richard Guenther
2010-08-06 21:53       ` Jeff Law
2010-08-06 22:41       ` Bernd Schmidt
2010-08-06 23:47         ` Richard Guenther
2010-08-07  8:11           ` Eric Botcazou
2010-08-09 12:29             ` Bernd Schmidt
2010-08-09 12:39               ` Steven Bosscher
2010-08-09 13:48                 ` Bernd Schmidt
2010-08-10  2:51                 ` Laurynas Biveinis
2010-08-09 12:41               ` Michael Matz
2010-08-09 14:34                 ` Bernd Schmidt
2010-08-09 14:39               ` Toon Moene
2010-08-09 14:50                 ` Steven Bosscher
2010-08-09 14:58                   ` The speed of the compiler, was: " Toon Moene
2010-08-09 15:00                     ` Paul Koning
2010-08-09 15:33                     ` Diego Novillo
2010-08-09 15:53                       ` Mark Mitchell
2010-08-09 17:15                       ` Toon Moene
2010-08-09 17:19                         ` Diego Novillo
2010-08-09 17:29                           ` Toon Moene
2010-08-09 23:24                             ` Chris Lattner
2010-08-10 13:02                               ` Toon Moene
2010-08-10 15:36                                 ` Chris Lattner
2010-08-10 14:58                               ` Andi Kleen
2010-08-10 15:03                                 ` Richard Guenther
2010-08-10 15:32                                   ` Andi Kleen
2010-08-10 20:09                                     ` Tom Tromey
2010-08-10 20:23                                       ` Andi Kleen
2010-08-10 22:40                                         ` Mike Stump
2010-08-10 23:16                                         ` Tom Tromey
2010-08-12 21:09                                       ` Nathan Froyd
2010-08-17 15:14                                       ` Mark Mitchell
2010-08-10 20:15                                 ` H.J. Lu
2010-08-12 21:38                                 ` Vectorized _cpp_clean_line Richard Henderson
2010-08-12 22:18                                   ` Andi Kleen
2010-08-12 22:32                                     ` Richard Henderson
2010-08-12 23:10                                       ` Richard Henderson
2010-08-12 23:13                                         ` Richard Henderson
2010-08-13  8:33                                         ` Andi Kleen
2010-08-13  7:26                                       ` Andi Kleen
2010-08-14 17:14                                         ` [CFT, v4] " Richard Henderson
2010-08-17 16:59                                           ` Steve Ellcey
2010-08-17 17:21                                             ` Richard Henderson
2010-08-17 20:32                                               ` Steve Ellcey
2010-08-18 17:14                                                 ` Steve Ellcey
2010-08-17 17:32                                             ` Jakub Jelinek
2010-08-18  3:23                                           ` Tom Tromey
     [not found]                                           ` <1281998097.3725.3.camel@gargoyle>
     [not found]                                             ` <4C69C317.2080207@redhat.com>
     [not found]                                               ` <1282142212.3725.6.camel@gargoyle>
     [not found]                                                 ` <4C6BF5F7.7040100@redhat.com>
     [not found]                                                   ` <1282149264.3725.15.camel@gargoyle>
     [not found]                                                     ` <4C6C0D92.7080100@redhat.com>
     [not found]                                                       ` <1282151361.3725.19.camel@gargoyle>
     [not found]                                                         ` <4C6C166A.90306@redhat.com>
     [not found]                                                           ` <1282152938.3725.27.camel@gargoyle>
     [not found]                                                             ` <4C6C39DB.8070409@redhat.com>
2010-08-18 21:50                                                               ` Richard Henderson
2010-08-19 14:12                                                                 ` Luis Machado
2010-08-09 17:27                       ` The speed of the compiler, was: Re: Combine four insns Joseph S. Myers
2010-08-09 18:23                         ` Diego Novillo
2010-08-10  6:20                         ` Chiheng Xu
2010-08-10  7:22                           ` Chiheng Xu
2010-08-09 17:34                       ` Steven Bosscher
2010-08-09 17:36                         ` Diego Novillo
2010-08-09 23:13                           ` Cary Coutant
2010-08-09 18:59                       ` The speed of the compiler Ralf Wildenhues
2010-08-09 19:04                         ` Diego Novillo
2010-08-09 21:12                       ` The speed of the compiler, was: Re: Combine four insns Mike Stump
2010-08-09 23:48                         ` Cary Coutant
2010-08-10 14:37       ` Bernd Schmidt
2010-08-10 14:40         ` Richard Guenther
2010-08-10 14:49           ` Bernd Schmidt
2010-08-10 15:06             ` Steven Bosscher
2010-08-10 15:06           ` Steven Bosscher
2010-08-10 16:27             ` Andi Kleen
2010-08-10 16:47               ` Steven Bosscher
2010-08-10 16:55                 ` Andi Kleen
2010-08-10 17:03                   ` David Daney
2010-08-11  8:53                     ` Richard Guenther
2010-08-16 20:42                       ` Mark Mitchell
2010-08-16 20:45                         ` Bernd Schmidt
2010-08-16 21:03                           ` Mark Mitchell
2010-08-18 20:50                           ` Eric Botcazou
2010-08-18 22:03                             ` Bernd Schmidt
2010-08-19  8:04                               ` Eric Botcazou
2010-08-19 15:44                                 ` Mark Mitchell
2010-08-19 18:13                                 ` Bernd Schmidt
2010-08-19 18:25                                   ` Mark Mitchell
2010-08-19 18:42                                   ` Richard Henderson
2010-08-19 22:14                                   ` Eric Botcazou
2010-08-19 22:19                                     ` Bernd Schmidt
2010-08-19 22:21                                       ` Eric Botcazou
2010-08-19 22:37                                         ` Bernd Schmidt
2010-08-19 22:53                                           ` Steven Bosscher
2010-08-19 23:34                                         ` Andrew Pinski
2010-08-19 23:40                                           ` Bernd Schmidt
2010-08-20  1:59                                             ` Mark Mitchell
2010-08-20  3:56                                               ` Diego Novillo
2010-08-20  4:13                                                 ` Mark Mitchell
2010-08-20 14:14                                               ` Eric Botcazou
2010-08-20 14:36                                                 ` Paolo Bonzini
2010-08-21 13:46                                                   ` Eric Botcazou
2010-08-23 17:03                                                     ` Richard Guenther
2010-08-23 17:08                                                       ` Mark Mitchell
2010-08-20 14:44                                                 ` Mark Mitchell
2010-08-20 10:21                                           ` Paolo Bonzini
2010-08-20  0:21                                   ` H.J. Lu
2010-08-20  0:36                                     ` Bernd Schmidt
2010-08-20 14:07                                     ` H.J. Lu
2010-08-11 12:32         ` Michael Matz
2010-08-06 15:08 ` Steven Bosscher
2010-08-06 16:45   ` Paolo Bonzini
2010-08-06 17:22     ` Steven Bosscher
2010-08-06 18:02       ` Jeff Law
2010-08-06 20:44         ` Steven Bosscher
2010-08-06 20:48           ` Richard Guenther
2010-08-06 21:49           ` Jeff Law
2010-08-06 18:56       ` Vladimir N. Makarov
2010-08-06 19:02         ` Steven Bosscher
2010-08-06 21:11         ` Chris Lattner
2010-08-08 11:40         ` Paolo Bonzini
2010-08-12  5:53         ` Ian Lance Taylor
2010-08-06 19:20   ` Bernd Schmidt
2010-08-06 19:37     ` Jeff Law
2010-08-06 19:43       ` Bernd Schmidt
2010-08-06 21:46         ` Jeff Law
2010-08-09 14:54           ` Mark Mitchell
2010-08-09 15:04             ` Bernd Schmidt
2010-08-09 16:02             ` Chris Lattner
2010-08-09 16:07               ` Richard Guenther
2010-08-09 17:28                 ` Joseph S. Myers
2010-08-09 16:19               ` Mark Mitchell
2010-08-09 17:02                 ` Chris Lattner
2010-08-10  2:50                   ` Mark Mitchell
2010-08-10 15:35                     ` Chris Lattner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).