public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* Another performance regression
@ 2002-09-26 12:16 Dale Johannesen
  2002-09-26 16:07 ` Richard Henderson
  0 siblings, 1 reply; 3+ messages in thread
From: Dale Johannesen @ 2002-09-26 12:16 UTC (permalink / raw)
  To: gcc-patches, gcc; +Cc: Dale Johannesen

Try the program at the bottom with -O2 -funroll-loops.  Don't worry 
about the body
of the loops; that's only important insofar as it has the right amount 
of code
to cause the inner loop to be unrolled the right number of times, 
namely 2, with
1 left over.  The unroller generates some rather stupid code here:

           /* Calculate the difference between the final and initial 
values.
              Final value may be a (plus (reg x) (const_int 1)) rtx.
              Let the following cse pass simplify this if initial value 
is
              a constant.

(there's more to it besides the expression described above)
with the expectation that cse will clean it up.  However, the second 
pass of
loop optimization pulls some, but not all, of this code out of the
outer loop, with the effect that cse can't eliminate it.  On ppc, for 
example,
the beginning of the function looks like this:

         bge- cr0,L18        ; zero-trip check for outer loop
         li r0,1	            ; unnecessary
         cmpwi cr1,r0,0      ; unnecessary
         cmpwi cr6,r0,25     ; unnecessary
L16:                        ; top of outer loop
         slwi r0,r6,2
         li r8,0
         add r7,r0,r28
         mr r10,r29
         bge+ cr6,L22        ; always false
         beq- cr1,L15        ; always false
L22:
         ... single copy of inner loop body...
L15:
         ... two copies of inner loop body, executed 12 times...
         ble L15
         ...
         blt L16
L18:

I'm not entirely sure, but I think this patch was the culprit:

2002-07-21  Richard Henderson  <rth@redhat.com>

         * loop.h (LOOP_AUTO_UNROLL): Rename from LOOP_FIRST_PASS.
         * loop.c (strength_reduce): Update.
         * toplev.c (rest_of_compilation): Do unrolling in the first
         loop pass, not the second.

This didn't happen when unrolling was done last.

So should I fix this by making the unrolling code smarter, in effect
doing cse's job?  It seems likely Roger Sayle's approach of running
gcse after loop opts would Just Work.  Is that going to go in?


int foo(char *abcd00, int abcd01, char *abcd02, int *abcd03, 
int*abcd04) {
   int abcd05, abcd06, abcd07=0, abcd08=0, abcd09, abcd10, abcd11=0;
   for (abcd05=0;abcd05<abcd01;abcd05++) {
     for(abcd06=0;abcd06<25;abcd06++) {
       if(abcd00[abcd05]==abcd06 && abcd07<2) {
         if (abcd07==0) {
           abcd09=26*abcd03[abcd06]; abcd02[abcd08++]=abcd00[abcd05]; 
abcd07=1;
         } else if (abcd07==1) {
           abcd10=abcd09+abcd03[abcd06]; 
abcd02[abcd08++]=abcd00[abcd05]; abcd07=2;
         }
       }
       if(abcd07==2) {
         abcd04[abcd11++]=abcd10; abcd07=0;
         break;
       }
     }
   }
   return abcd07;
}

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Another performance regression
  2002-09-26 12:16 Another performance regression Dale Johannesen
@ 2002-09-26 16:07 ` Richard Henderson
  2002-09-27 16:09   ` Dale Johannesen
  0 siblings, 1 reply; 3+ messages in thread
From: Richard Henderson @ 2002-09-26 16:07 UTC (permalink / raw)
  To: Dale Johannesen; +Cc: gcc-patches, gcc

On Thu, Sep 26, 2002 at 12:01:38PM -0700, Dale Johannesen wrote:
> So should I fix this by making the unrolling code smarter, in effect
> doing cse's job?

I think so, since the change to loop unrolling order was 
also done to get around a performance regression.  :-/

> It seems likely Roger Sayle's approach of running
> gcse after loop opts would Just Work.  Is that going to go in?

Not for 3.3.

The following appears to do the job.  Would you see if this
causes any other problems?


r~


	* unroll.c (simplify_cmp_and_jump_insns): New.
	(unroll_loop): Use it.  Use simplify_gen_foo+force_operand
	instead of expand_simple_foo.

Index: unroll.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/unroll.c,v
retrieving revision 1.176
diff -c -p -d -r1.176 unroll.c
*** unroll.c	8 Sep 2002 18:32:31 -0000	1.176
--- unroll.c	26 Sep 2002 22:13:46 -0000
*************** static int *splittable_regs_updates;
*** 193,198 ****
--- 193,201 ----
  
  /* Forward declarations.  */
  
+ static rtx simplify_cmp_and_jump_insns PARAMS ((enum rtx_code,
+ 						enum machine_mode,
+ 						rtx, rtx, rtx));
  static void init_reg_map PARAMS ((struct inline_remap *, int));
  static rtx calculate_giv_inc PARAMS ((rtx, rtx, unsigned int));
  static rtx initial_reg_note_copy PARAMS ((rtx, struct inline_remap *));
*************** unroll_loop (loop, insn_count, strength_
*** 843,849 ****
  			       &initial_value, &final_value, &increment,
  			       &mode))
  	{
! 	  rtx diff;
  	  rtx *labels;
  	  int abs_inc, neg_inc;
  	  enum rtx_code cc = loop_info->comparison_code;
--- 846,852 ----
  			       &initial_value, &final_value, &increment,
  			       &mode))
  	{
! 	  rtx diff, insn;
  	  rtx *labels;
  	  int abs_inc, neg_inc;
  	  enum rtx_code cc = loop_info->comparison_code;
*************** unroll_loop (loop, insn_count, strength_
*** 875,900 ****
  
  	  start_sequence ();
  
  	  /* Final value may have form of (PLUS val1 const1_rtx).  We need
  	     to convert it into general operand, so compute the real value.  */
  
! 	  if (GET_CODE (final_value) == PLUS)
! 	    {
! 	      final_value = expand_simple_binop (mode, PLUS,
! 						 copy_rtx (XEXP (final_value, 0)),
! 						 copy_rtx (XEXP (final_value, 1)),
! 						 NULL_RTX, 0, OPTAB_LIB_WIDEN);
! 	    }
  	  if (!nonmemory_operand (final_value, VOIDmode))
! 	    final_value = force_reg (mode, copy_rtx (final_value));
  
  	  /* Calculate the difference between the final and initial values.
  	     Final value may be a (plus (reg x) (const_int 1)) rtx.
- 	     Let the following cse pass simplify this if initial value is
- 	     a constant.
- 
- 	     We must copy the final and initial values here to avoid
- 	     improperly shared rtl.
  
  	     We have to deal with for (i = 0; --i < 6;) type loops.
  	     For such loops the real final value is the first time the
--- 878,897 ----
  
  	  start_sequence ();
  
+ 	  /* We must copy the final and initial values here to avoid
+ 	     improperly shared rtl.  */
+ 	  final_value = copy_rtx (final_value);
+ 	  initial_value = copy_rtx (initial_value);
+ 
  	  /* Final value may have form of (PLUS val1 const1_rtx).  We need
  	     to convert it into general operand, so compute the real value.  */
  
! 	  final_value = force_operand (final_value, NULL_RTX);
  	  if (!nonmemory_operand (final_value, VOIDmode))
! 	    final_value = force_reg (mode, final_value);
  
  	  /* Calculate the difference between the final and initial values.
  	     Final value may be a (plus (reg x) (const_int 1)) rtx.
  
  	     We have to deal with for (i = 0; --i < 6;) type loops.
  	     For such loops the real final value is the first time the
*************** unroll_loop (loop, insn_count, strength_
*** 907,924 ****
  	     so we can pretend that the overflow value is 0/~0.  */
  
  	  if (cc == NE || less_p != neg_inc)
! 	    diff = expand_simple_binop (mode, MINUS, final_value,
! 					copy_rtx (initial_value), NULL_RTX, 0,
! 					OPTAB_LIB_WIDEN);
  	  else
! 	    diff = expand_simple_unop (mode, neg_inc ? NOT : NEG,
! 				       copy_rtx (initial_value), NULL_RTX, 0);
  
  	  /* Now calculate (diff % (unroll * abs (increment))) by using an
  	     and instruction.  */
! 	  diff = expand_simple_binop (GET_MODE (diff), AND, diff,
! 				      GEN_INT (unroll_number * abs_inc - 1),
! 				      NULL_RTX, 0, OPTAB_LIB_WIDEN);
  
  	  /* Now emit a sequence of branches to jump to the proper precond
  	     loop entry point.  */
--- 904,921 ----
  	     so we can pretend that the overflow value is 0/~0.  */
  
  	  if (cc == NE || less_p != neg_inc)
! 	    diff = simplify_gen_binary (MINUS, mode, final_value,
! 					initial_value);
  	  else
! 	    diff = simplify_gen_unary (neg_inc ? NOT : NEG, mode,
! 				       initial_value, mode);
! 	  diff = force_operand (diff, NULL_RTX);
  
  	  /* Now calculate (diff % (unroll * abs (increment))) by using an
  	     and instruction.  */
! 	  diff = simplify_gen_binary (AND, mode, diff,
! 				      GEN_INT (unroll_number*abs_inc - 1));
! 	  diff = force_operand (diff, NULL_RTX);
  
  	  /* Now emit a sequence of branches to jump to the proper precond
  	     loop entry point.  */
*************** unroll_loop (loop, insn_count, strength_
*** 936,953 ****
  	  if (cc != NE)
  	    {
  	      rtx incremented_initval;
! 	      incremented_initval = expand_simple_binop (mode, PLUS,
! 							 initial_value,
! 							 increment,
! 							 NULL_RTX, 0,
! 							 OPTAB_LIB_WIDEN);
! 	      emit_cmp_and_jump_insns (incremented_initval, final_value,
! 				       less_p ? GE : LE, NULL_RTX,
! 				       mode, unsigned_p, labels[1]);
! 	      predict_insn_def (get_last_insn (), PRED_LOOP_CONDITION,
! 				TAKEN);
! 	      JUMP_LABEL (get_last_insn ()) = labels[1];
! 	      LABEL_NUSES (labels[1])++;
  	    }
  
  	  /* Assuming the unroll_number is 4, and the increment is 2, then
--- 933,954 ----
  	  if (cc != NE)
  	    {
  	      rtx incremented_initval;
! 	      enum rtx_code cmp_code;
! 
! 	      incremented_initval
! 		= simplify_gen_binary (PLUS, mode, initial_value, increment);
! 	      incremented_initval
! 		= force_operand (incremented_initval, NULL_RTX);
! 
! 	      cmp_code = (less_p
! 			  ? (unsigned_p ? GEU : GE)
! 			  : (unsigned_p ? LEU : LE));
! 
! 	      insn = simplify_cmp_and_jump_insns (cmp_code, mode,
! 						  incremented_initval,
! 						  final_value, labels[1]);
! 	      if (insn)
! 	        predict_insn_def (insn, PRED_LOOP_CONDITION, TAKEN);
  	    }
  
  	  /* Assuming the unroll_number is 4, and the increment is 2, then
*************** unroll_loop (loop, insn_count, strength_
*** 986,997 ****
  		  cmp_code = LE;
  		}
  
! 	      emit_cmp_and_jump_insns (diff, GEN_INT (abs_inc * cmp_const),
! 				       cmp_code, NULL_RTX, mode, 0, labels[i]);
! 	      JUMP_LABEL (get_last_insn ()) = labels[i];
! 	      LABEL_NUSES (labels[i])++;
! 	      predict_insn (get_last_insn (), PRED_LOOP_PRECONDITIONING,
! 			    REG_BR_PROB_BASE / (unroll_number - i));
  	    }
  
  	  /* If the increment is greater than one, then we need another branch,
--- 987,998 ----
  		  cmp_code = LE;
  		}
  
! 	      insn = simplify_cmp_and_jump_insns (cmp_code, mode, diff,
! 						  GEN_INT (abs_inc*cmp_const),
! 						  labels[i]);
! 	      if (insn)
! 	        predict_insn (insn, PRED_LOOP_PRECONDITIONING,
! 			      REG_BR_PROB_BASE / (unroll_number - i));
  	    }
  
  	  /* If the increment is greater than one, then we need another branch,
*************** unroll_loop (loop, insn_count, strength_
*** 1019,1028 ****
  		  cmp_code = GE;
  		}
  
! 	      emit_cmp_and_jump_insns (diff, GEN_INT (cmp_const), cmp_code,
! 				       NULL_RTX, mode, 0, labels[0]);
! 	      JUMP_LABEL (get_last_insn ()) = labels[0];
! 	      LABEL_NUSES (labels[0])++;
  	    }
  
  	  sequence = get_insns ();
--- 1020,1027 ----
  		  cmp_code = GE;
  		}
  
! 	      simplify_cmp_and_jump_insns (cmp_code, mode, diff,
! 					   GEN_INT (cmp_const), labels[0]);
  	    }
  
  	  sequence = get_insns ();
*************** unroll_loop (loop, insn_count, strength_
*** 1323,1328 ****
--- 1322,1364 ----
    if (map->reg_map)
      free (map->reg_map);
    free (map);
+ }
+ 
+ /* A helper function for unroll_loop.  Emit a compare and branch to 
+    satisfy (CMP OP1 OP2), but pass this through the simplifier first.
+    If the branch turned out to be conditional, return it, otherwise
+    return NULL.  */
+ 
+ static rtx
+ simplify_cmp_and_jump_insns (code, mode, op0, op1, label)
+      enum rtx_code code;
+      enum machine_mode mode;
+      rtx op0, op1, label;
+ {
+   rtx t, insn;
+ 
+   t = simplify_relational_operation (code, mode, op0, op1);
+   if (!t)
+     {
+       enum rtx_code scode = signed_condition (code);
+       emit_cmp_and_jump_insns (op0, op1, scode, NULL_RTX, mode,
+ 			       code != scode, label);
+       insn = get_last_insn ();
+ 
+       JUMP_LABEL (insn) = label;
+       LABEL_NUSES (label) += 1;
+ 
+       return insn;
+     }
+   else if (t == const_true_rtx)
+     {
+       insn = emit_jump_insn (gen_jump (label));
+       emit_barrier ();
+       JUMP_LABEL (insn) = label;
+       LABEL_NUSES (label) += 1;
+     }
+ 
+   return NULL_RTX;
  }
  \f
  /* Return true if the loop can be safely, and profitably, preconditioned

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Another performance regression
  2002-09-26 16:07 ` Richard Henderson
@ 2002-09-27 16:09   ` Dale Johannesen
  0 siblings, 0 replies; 3+ messages in thread
From: Dale Johannesen @ 2002-09-27 16:09 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Dale Johannesen, gcc-patches, gcc


On Thursday, September 26, 2002, at 03:26 PM, Richard Henderson wrote:

> On Thu, Sep 26, 2002 at 12:01:38PM -0700, Dale Johannesen wrote:
>> So should I fix this by making the unrolling code smarter, in effect
>> doing cse's job?
>
> I think so, since the change to loop unrolling order was
> also done to get around a performance regression.  :-/
>
>> It seems likely Roger Sayle's approach of running
>> gcse after loop opts would Just Work.  Is that going to go in?
>
> Not for 3.3.
>
> The following appears to do the job.  Would you see if this
> causes any other problems?

Thanks!
Bootstrapped and tested on Darwin (C, C++, ObjC, f77), no regressions.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2002-09-27 22:11 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-26 12:16 Another performance regression Dale Johannesen
2002-09-26 16:07 ` Richard Henderson
2002-09-27 16:09   ` Dale Johannesen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).