[patch] Partial fix for PR 18048

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [patch] Partial fix for PR 18048
@ 2004-10-24 13:16 Zdenek Dvorak
  2004-10-24 17:30 ` Andrew Pinski
  2004-10-25 18:15 ` Zack Weinberg
  0 siblings, 2 replies; 14+ messages in thread
From: Zdenek Dvorak @ 2004-10-24 13:16 UTC (permalink / raw)
  To: gcc-patches

Hello,

this patch contains several improvements to ivopts and related
optimizations that improve the code produced for mgrid benchmark quite a
bit (although even with the patch, ivopts still spoil the code).

The improvements are:

1) Ivopts produce expressions like

   *(&a[start] + 8 * index)

   and believe that the memory access created for it will include the
   multiplication by 8.  This fails if there are several similar
   accesses, since dom will CSE the "8 * index" part (thus adding
   unnecessary multiplication instruction, and increasing register
   pressure).

   The part of the patch in the fold-const.c ensures that such an
   expression is folded to a[start + index], thus achieving what ivopts
   intended.  This solution unfortunately is not generic enough to catch
   all cases when something like this may happen, and it does not work
   on LP64 platforms where this optimization might be incorrect, but it
   seems to be the only possibility within the current framework.  I
   will start a discussion about a more appropriate solution for 4.1 in
   a separate thread.

2) The algorithm used for finding the best set of bivs for the loop is
   quite primitive (and cannot be much better due to compile time
   constraints), and as such it is likely to get stuck in local
   minimums.  Which happens in mgrid, causing ivopts to create about
   30 bivs...

   The patch to try_add_cand_for makes the initial set of candidates
   be much smaller (based primarily on the important candidates).
   The idea is that getting stuck in local minimum with too many bivs
   is a disaster, but if it happens with a few bivs, it just means that
   the iv uses cannot be expressed much cheaper than they are, so we
   are not that far from optimum anyway.

3) A small improvement in number_of_iterations_cond.
   build_int_cst (type, -1) is not suitable for producing unsigned
   constant with all bits one, since the bits outside of the precision
   of the type are set to 1 as well, which makes fold to fail to
   perform some of the optimizations with such constants.

Bootstrapped & regtested on ppc, i686 and x86_64.  Benchmark results
from i686 (-O2, base without the patch, peak with) (quick summary --
minor variations except for the huge gain on mgrid, on average the
results improve even if mgrid is not taken into account):

   Benchmarks    Ref Time  Run Time   Ratio    Ref Time  Run Time   Ratio
   ------------  --------  --------  --------  --------  --------  --------
   164.gzip          1400   209       670    *     1400   211       663    *
   175.vpr           1400   390       359    *     1400   395       354    *
   176.gcc           1100        --          X     1100        --          X
   181.mcf           1800   733       245    *     1800   728       247    *
   186.crafty        1000   124       804    *     1000   123       814    *
   197.parser        1800   401       449    *     1800   400       450    *
   252.eon           1300        --          X     1300        --          X
   253.perlbmk       1800   208       865    *     1800   205       879    *
   254.gap           1100   179       616    *     1100   176       627    *
   255.vortex        1900   231       822    *     1900   236       805    *
   256.bzip2         1500   351       427    *     1500   351       427    *
   300.twolf         3000   799       376    *     3000   800       375    *

   168.wupwise       1600   295       543    *     1600   286       559    *
   171.swim          3100   720       430    *     3100   725       428    *
   172.mgrid         1800   574       314    *     1800   504       357    *
   173.applu         2100   528       398    *     2100   520       404    *
   177.mesa          1400   207       675    *     1400   207       675    *
   178.galgel        2900        --          X     2900        --          X
   179.art           2600  1458       178    *     2600  1463       178    *
   183.equake        1300   294       442    *     1300   294       442    *
   187.facerec       1900   473       401    *     1900   467       406    *
   188.ammp          2200   615       357    *     2200   609       361    *
   189.lucas         2000   436       458    *     2000   437       458    *
   191.fma3d         2100   379       554    *     2100   382       549    *
   200.sixtrack      1100   270       408    *     1100   265       415    *
   301.apsi          2600   671       387    *     2600   665       391    *

Zdenek

	PR tree-optimization/18048
	* fold-const.c (try_move_mult_to_index): New function.
	(fold): Use try_move_mult_to_index.
	* tree-ssa-loop-ivopts.c (try_add_cand_for): Prefer common candidates.
	* tree-ssa-loop-niter.c (number_of_iterations_cond): Produce
	an all-ones unsigned constant without extra bits.

Index: fold-const.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/fold-const.c,v
retrieving revision 1.468
diff -c -3 -p -r1.468 fold-const.c
*** fold-const.c	11 Oct 2004 03:42:00 -0000	1.468
--- fold-const.c	23 Oct 2004 15:13:53 -0000
*************** tree_swap_operands_p (tree arg0, tree ar
*** 5967,5972 ****
--- 5967,6050 ----
    return 0;
  }
  
+ /* Tries to replace &a[idx] CODE s * delta with &a[idx CODE delta], if s is
+    step of the array.  TYPE is the type of the expression.  ADDR is the address.
+    MULT is the multiplicative expression.  If the function succeeds, the new
+    address expression is returned.  Otherwise NULL_TREE is returned.  */
+ 
+ static tree
+ try_move_mult_to_index (tree type, enum tree_code code, tree addr, tree mult)
+ {
+   tree s, delta, step;
+   tree arg0 = TREE_OPERAND (mult, 0), arg1 = TREE_OPERAND (mult, 1);
+   tree ref = TREE_OPERAND (addr, 0), pref;
+   tree ret, pos;
+   tree itype;
+ 
+   STRIP_NOPS (arg0);
+   STRIP_NOPS (arg1);
+   
+   if (TREE_CODE (arg0) == INTEGER_CST)
+     {
+       s = arg0;
+       delta = arg1;
+     }
+   else if (TREE_CODE (arg1) == INTEGER_CST)
+     {
+       s = arg1;
+       delta = arg0;
+     }
+   else
+     return NULL_TREE;
+ 
+   for (;; ref = TREE_OPERAND (ref, 0))
+     {
+       if (TREE_CODE (ref) == ARRAY_REF)
+ 	{
+ 	  step = array_ref_element_size (ref);
+ 
+ 	  if (TREE_CODE (step) != INTEGER_CST)
+ 	    continue;
+ 
+ 	  itype = TREE_TYPE (step);
+ 
+ 	  /* If the type sizes do not match, we might run into problems
+ 	     when one of them would overflow.  */
+ 	  if (TYPE_PRECISION (itype) != TYPE_PRECISION (type))
+ 	    continue;
+ 
+ 	  if (!operand_equal_p (step, fold_convert (itype, s), 0))
+ 	    continue;
+ 
+ 	  delta = fold_convert (itype, delta);
+ 	  break;
+ 	}
+ 
+       if (!handled_component_p (ref))
+ 	return NULL_TREE;
+     }
+ 
+   /* We found the suitable array reference.  So copy everything up to it,
+      and replace the index.  */
+ 
+   pref = TREE_OPERAND (addr, 0);
+   ret = copy_node (pref);
+   pos = ret;
+ 
+   while (pref != ref)
+     {
+       pref = TREE_OPERAND (pref, 0);
+       TREE_OPERAND (pos, 0) = copy_node (pref);
+       pos = TREE_OPERAND (pos, 0);
+     }
+ 
+   TREE_OPERAND (pos, 1) = fold (build2 (code, itype,
+ 					TREE_OPERAND (pos, 1),
+ 					delta));
+ 
+   return build1 (ADDR_EXPR, type, ret);
+ }
+ 
  /* Perform constant folding and related simplification of EXPR.
     The related simplifications include x*1 => x, x*0 => 0, etc.,
     and application of the associative law.
*************** fold (tree expr)
*** 6602,6607 ****
--- 6680,6703 ----
  						   alt0, alt1)),
  				     same));
  	    }
+ 
+ 	  /* Try replacing &a[i1] + c * i2 with &a[i1 + i2], if c is step
+ 	     of the array.  Loop optimizer sometimes produce this type of
+ 	     expressions.  */
+ 	  if (TREE_CODE (arg0) == ADDR_EXPR
+ 	      && TREE_CODE (arg1) == MULT_EXPR)
+ 	    {
+ 	      tem = try_move_mult_to_index (type, PLUS_EXPR, arg0, arg1);
+ 	      if (tem)
+ 		return fold (tem);
+ 	    }
+ 	  else if (TREE_CODE (arg1) == ADDR_EXPR
+ 		   && TREE_CODE (arg0) == MULT_EXPR)
+ 	    {
+ 	      tem = try_move_mult_to_index (type, PLUS_EXPR, arg1, arg0);
+ 	      if (tem)
+ 		return fold (tem);
+ 	    }
  	}
        else
  	{
*************** fold (tree expr)
*** 6974,6979 ****
--- 7070,7086 ----
  				     &diff))
  	  return build_int_cst_type (type, diff);
        }
+ 	  
+       /* Try replacing &a[i1] - c * i2 with &a[i1 - i2], if c is step
+ 	 of the array.  Loop optimizer sometimes produce this type of
+ 	 expressions.  */
+       if (TREE_CODE (arg0) == ADDR_EXPR
+ 	  && TREE_CODE (arg1) == MULT_EXPR)
+ 	{
+ 	  tem = try_move_mult_to_index (type, MINUS_EXPR, arg0, arg1);
+ 	  if (tem)
+ 	    return fold (tem);
+ 	}
  
        if (TREE_CODE (arg0) == MULT_EXPR
  	  && TREE_CODE (arg1) == MULT_EXPR
Index: tree-ssa-loop-ivopts.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/tree-ssa-loop-ivopts.c,v
retrieving revision 2.20
diff -c -3 -p -r2.20 tree-ssa-loop-ivopts.c
*** tree-ssa-loop-ivopts.c	18 Oct 2004 18:01:09 -0000	2.20
--- tree-ssa-loop-ivopts.c	23 Oct 2004 15:13:53 -0000
*************** try_add_cand_for (struct ivopts_data *da
*** 3603,3622 ****
    bitmap act_inv = BITMAP_XMALLOC ();
    unsigned i;
    struct cost_pair *cp;
  
    bitmap_copy (best_ivs, ivs);
    bitmap_copy (best_inv, inv);
  
!   for (i = 0; i < use->n_map_members; i++)
      {
!       cp = use->cost_map + i;
!       if (cp->cost == INFTY)
  	continue;
  
        bitmap_copy (act_ivs, ivs);
!       bitmap_set_bit (act_ivs, cp->cand->id);
!       if (cp->depends_on)
! 	bitmap_a_or_b (act_inv, inv, cp->depends_on);
        else
  	bitmap_copy (act_inv, inv);
        act_cost = set_cost_up_to (data, act_ivs, act_inv, use->id + 1);
--- 3603,3634 ----
    bitmap act_inv = BITMAP_XMALLOC ();
    unsigned i;
    struct cost_pair *cp;
+   bitmap_iterator bi;
+   struct iv_cand *cand;
+   bitmap depends_on;
  
    bitmap_copy (best_ivs, ivs);
    bitmap_copy (best_inv, inv);
  
!   /* First try important candidates.  Only if it fails, try the specific ones.
!      Rationale -- in loops with many variables the best choice often is to use
!      just one generic biv.  If we added here many ivs specific to the uses,
!      the optimization algorithm later would be likely to get stuck in a local
!      minimum, thus causing us to create too many ivs.  The approach from
!      few ivs to more seems more likely to be succesful -- starting from few
!      ivs, replacing an expensive use by a specific iv should always be a
!      win.  */
!   EXECUTE_IF_SET_IN_BITMAP (data->important_candidates, 0, i, bi)
      {
!       cand = iv_cand (data, i);
! 
!       if (get_use_iv_cost (data, use, cand, &depends_on) == INFTY)
  	continue;
  
        bitmap_copy (act_ivs, ivs);
!       bitmap_set_bit (act_ivs, cand->id);
!       if (depends_on)
! 	bitmap_a_or_b (act_inv, inv, depends_on);
        else
  	bitmap_copy (act_inv, inv);
        act_cost = set_cost_up_to (data, act_ivs, act_inv, use->id + 1);
*************** try_add_cand_for (struct ivopts_data *da
*** 3629,3634 ****
--- 3641,3675 ----
  	}
      }
  
+   if (best_cost == INFTY)
+     {
+       for (i = 0; i < use->n_map_members; i++)
+ 	{
+ 	  cp = use->cost_map + i;
+ 	  if (cp->cost == INFTY)
+ 	    continue;
+ 
+ 	  /* Already tried this.  */
+ 	  if (cp->cand->important)
+ 	    continue;
+ 
+ 	  bitmap_copy (act_ivs, ivs);
+ 	  bitmap_set_bit (act_ivs, cp->cand->id);
+ 	  if (cp->depends_on)
+ 	    bitmap_a_or_b (act_inv, inv, cp->depends_on);
+ 	  else
+ 	    bitmap_copy (act_inv, inv);
+ 	  act_cost = set_cost_up_to (data, act_ivs, act_inv, use->id + 1);
+ 
+ 	  if (act_cost < best_cost)
+ 	    {
+ 	      best_cost = act_cost;
+ 	      bitmap_copy (best_ivs, act_ivs);
+ 	      bitmap_copy (best_inv, act_inv);
+ 	    }
+ 	}
+     }
+ 
    bitmap_copy (ivs, best_ivs);
    bitmap_copy (inv, best_inv);
  
Index: tree-ssa-loop-niter.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/tree-ssa-loop-niter.c,v
retrieving revision 2.13
diff -c -3 -p -r2.13 tree-ssa-loop-niter.c
*** tree-ssa-loop-niter.c	22 Oct 2004 17:05:08 -0000	2.13
--- tree-ssa-loop-niter.c	23 Oct 2004 15:13:53 -0000
*************** number_of_iterations_cond (tree type, tr
*** 419,427 ****
        d = EXEC_BINARY (LSHIFT_EXPR, niter_type,
  		       build_int_cst_type (niter_type, 1), bits);
        s = EXEC_BINARY (RSHIFT_EXPR, niter_type, step0, bits);
!       bound = EXEC_BINARY (RSHIFT_EXPR, niter_type,
! 			   build_int_cst (niter_type, -1),
! 			   bits);
  
        assumption = fold (build2 (FLOOR_MOD_EXPR, niter_type, base1, d));
        assumption = fold (build2 (EQ_EXPR, boolean_type_node,
--- 419,427 ----
        d = EXEC_BINARY (LSHIFT_EXPR, niter_type,
  		       build_int_cst_type (niter_type, 1), bits);
        s = EXEC_BINARY (RSHIFT_EXPR, niter_type, step0, bits);
!       bound = EXEC_UNARY (BIT_NOT_EXPR, niter_type,
! 			  build_int_cst_type (niter_type, 0));
!       bound = EXEC_BINARY (RSHIFT_EXPR, niter_type, bound, bits);
  
        assumption = fold (build2 (FLOOR_MOD_EXPR, niter_type, base1, d));
        assumption = fold (build2 (EQ_EXPR, boolean_type_node,

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Partial fix for PR 18048
  2004-10-24 13:16 [patch] Partial fix for PR 18048 Zdenek Dvorak
@ 2004-10-24 17:30 ` Andrew Pinski
  2004-10-24 17:57   ` Zdenek Dvorak
  2004-10-25 18:15 ` Zack Weinberg
  1 sibling, 1 reply; 14+ messages in thread
From: Andrew Pinski @ 2004-10-24 17:30 UTC (permalink / raw)
  To: Zdenek Dvorak; +Cc: gcc-patches


On Oct 24, 2004, at 5:41 AM, Zdenek Dvorak wrote:

> Hello,
>
> this patch contains several improvements to ivopts and related
> optimizations that improve the code produced for mgrid benchmark quite 
> a
> bit (although even with the patch, ivopts still spoil the code).
>
> The improvements are:
>
> 1) Ivopts produce expressions like
>
>    *(&a[start] + 8 * index)
>
>    and believe that the memory access created for it will include the
>    multiplication by 8.  This fails if there are several similar
>    accesses, since dom will CSE the "8 * index" part (thus adding
>    unnecessary multiplication instruction, and increasing register
>    pressure).

Isn't this already done by fold_stmt?
-- Pinski

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Partial fix for PR 18048
  2004-10-24 17:30 ` Andrew Pinski
@ 2004-10-24 17:57   ` Zdenek Dvorak
  0 siblings, 0 replies; 14+ messages in thread
From: Zdenek Dvorak @ 2004-10-24 17:57 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: gcc-patches

Hello,

> >this patch contains several improvements to ivopts and related
> >optimizations that improve the code produced for mgrid benchmark quite 
> >a
> >bit (although even with the patch, ivopts still spoil the code).
> >
> >The improvements are:
> >
> >1) Ivopts produce expressions like
> >
> >   *(&a[start] + 8 * index)
> >
> >   and believe that the memory access created for it will include the
> >   multiplication by 8.  This fails if there are several similar
> >   accesses, since dom will CSE the "8 * index" part (thus adding
> >   unnecessary multiplication instruction, and increasing register
> >   pressure).
> 
> Isn't this already done by fold_stmt?

no, as far as I can tell.  The only thing that seems relevant
is maybe_fold_offset_to_array_ref, which however only handles
constant offsets (and from some reason I do not quite understand
these optimizations are not included in fold, so they cannot be used
easily anyway).

Zdenek

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Partial fix for PR 18048
  2004-10-24 13:16 [patch] Partial fix for PR 18048 Zdenek Dvorak
  2004-10-24 17:30 ` Andrew Pinski
@ 2004-10-25 18:15 ` Zack Weinberg
  2004-10-25 22:55   ` Zdenek Dvorak
  2004-10-27 15:38   ` Zdenek Dvorak
  1 sibling, 2 replies; 14+ messages in thread
From: Zack Weinberg @ 2004-10-25 18:15 UTC (permalink / raw)
  To: Zdenek Dvorak; +Cc: gcc-patches

Zdenek Dvorak <rakdver@atrey.karlin.mff.cuni.cz> writes:

> 3) A small improvement in number_of_iterations_cond.
>    build_int_cst (type, -1) is not suitable for producing unsigned
>    constant with all bits one, since the bits outside of the precision
>    of the type are set to 1 as well, which makes fold to fail to
>    perform some of the optimizations with such constants.
...

> !       bound = EXEC_UNARY (BIT_NOT_EXPR, niter_type,
> ! 			  build_int_cst_type (niter_type, 0));
> !       bound = EXEC_BINARY (RSHIFT_EXPR, niter_type, bound, bits);

While this is a good conceptual change, this is not the appropriate
way to do it.  You are creating trees just to get fold() to do
arithmetic for you, which wastes memory, and you risk having overflow
bits set on the constants, which will also inhibit optimization.

It would be nice if force_fit_type could be used for this purpose, but
it also creates junk trees and sets overflow bits.  The most efficient
way to code this logic is like so:

  /* Construct an INTEGER_CST with type niter_type, whose low
     (TYPE_PRECISION (niter_type) - bits) bits are all 1 and whose
     remaining bits are all 0.  */
  {
    unsigned int precision = TYPE_PRECISION (niter_type) - bits;

    unsigned HOST_WIDE_INT blow;
    HOST_WIDE_INT bhigh;

    if (precision < HOST_BITS_PER_WIDE_INT)
      {
        bhigh = 0;
        blow = (((unsigned HOST_WIDE_INT) 1) << precision) - 1;
      }
    else
      {
        precision -= HOST_BITS_PER_WIDE_INT;

        blow = ~ (unsigned HOST_WIDE_INT) 0;
        bhigh = (((unsigned HOST_WIDE_INT) 1) << precision) - 1;
      }
    bound = build_int_cst_wide (niter_type, blow, bhigh);
  }

This replaces both the statements quoted from your patch.  It would be
a good idea to do similar things to all calculations where you know
that a call to fold() will always spit out an INTEGER_CST.  The
*_double functions defined in fold-const.c are frequently useful for
this.

(There may be an off-by-one error in the above, I'm not sure.)

(Needing to construct a CONST_INT for a mask value is a common thing,
and the code is tricky.  Maybe we should have

extern tree tree_int_cst_{low,high}_mask (tree type, unsigned int bits);

primitives which produce masks filled from the low or the high end of
the constant.)

zw

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Partial fix for PR 18048
  2004-10-25 18:15 ` Zack Weinberg
@ 2004-10-25 22:55   ` Zdenek Dvorak
  2004-10-25 23:01     ` Zack Weinberg
  2004-10-27 15:38   ` Zdenek Dvorak
  1 sibling, 1 reply; 14+ messages in thread
From: Zdenek Dvorak @ 2004-10-25 22:55 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc-patches

Hello,

> > 3) A small improvement in number_of_iterations_cond.
> >    build_int_cst (type, -1) is not suitable for producing unsigned
> >    constant with all bits one, since the bits outside of the precision
> >    of the type are set to 1 as well, which makes fold to fail to
> >    perform some of the optimizations with such constants.
> ...
> 
> > !       bound = EXEC_UNARY (BIT_NOT_EXPR, niter_type,
> > ! 			  build_int_cst_type (niter_type, 0));
> > !       bound = EXEC_BINARY (RSHIFT_EXPR, niter_type, bound, bits);
> 
> While this is a good conceptual change, this is not the appropriate
> way to do it.  You are creating trees just to get fold() to do
> arithmetic for you, which wastes memory,

no, I am not.  0 is a shared constant, so no allocation for
build_int_cst_type (niter_type, 0)), and EXEC_UNARY is
nondestructive_fold_unary_to_constant that also does not build
any trees (except for the result, of course).

> and you risk having overflow bits set on the constants, which will
> also inhibit optimization.

Evaluation of BIT_NOT_EXPR should not set any overflow bits (if it does,
it should be fixed).

Zdenek

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Partial fix for PR 18048
  2004-10-25 22:55   ` Zdenek Dvorak
@ 2004-10-25 23:01     ` Zack Weinberg
  2004-10-25 23:02       ` Zdenek Dvorak
  2004-10-25 23:10       ` Zdenek Dvorak
  0 siblings, 2 replies; 14+ messages in thread
From: Zack Weinberg @ 2004-10-25 23:01 UTC (permalink / raw)
  To: Zdenek Dvorak; +Cc: gcc-patches

Zdenek Dvorak <rakdver@atrey.karlin.mff.cuni.cz> writes:

>> While this is a good conceptual change, this is not the appropriate
>> way to do it.  You are creating trees just to get fold() to do
>> arithmetic for you, which wastes memory,
>
> no, I am not.  0 is a shared constant, so no allocation for
> build_int_cst_type (niter_type, 0)), and EXEC_UNARY is
> nondestructive_fold_unary_to_constant that also does not build any
> trees (except for the result, of course).

That's nice.  You're still doing unnecessary work.

>> and you risk having overflow bits set on the constants, which will
>> also inhibit optimization.
>
> Evaluation of BIT_NOT_EXPR should not set any overflow bits (if it does,
> it should be fixed).

Wouldn't you rather not have to deal with the problem at all?

zw

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Partial fix for PR 18048
  2004-10-25 23:01     ` Zack Weinberg
@ 2004-10-25 23:02       ` Zdenek Dvorak
  2004-10-25 23:10       ` Zdenek Dvorak
  1 sibling, 0 replies; 14+ messages in thread
From: Zdenek Dvorak @ 2004-10-25 23:02 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc-patches

Hello,

> >> While this is a good conceptual change, this is not the appropriate
> >> way to do it.  You are creating trees just to get fold() to do
> >> arithmetic for you, which wastes memory,
> >
> > no, I am not.  0 is a shared constant, so no allocation for
> > build_int_cst_type (niter_type, 0)), and EXEC_UNARY is
> > nondestructive_fold_unary_to_constant that also does not build any
> > trees (except for the result, of course).
> 
> That's nice.  You're still doing unnecessary work.

yes, however creating duplicated code just for purpose of saving a very
small amount of time on place that clearly is not critical does not
seem very useful to me.  I.e. I would rather stick with the simple 2
lines solution that is a bit suboptimal, rather than to implement your
20-lines one.

Of course if you want to do the change, you won't spoil anything by it.

Zdenek

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Partial fix for PR 18048
  2004-10-25 23:01     ` Zack Weinberg
  2004-10-25 23:02       ` Zdenek Dvorak
@ 2004-10-25 23:10       ` Zdenek Dvorak
  1 sibling, 0 replies; 14+ messages in thread
From: Zdenek Dvorak @ 2004-10-25 23:10 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc-patches

Hello,

> >> and you risk having overflow bits set on the constants, which will
> >> also inhibit optimization.
> >
> > Evaluation of BIT_NOT_EXPR should not set any overflow bits (if it does,
> > it should be fixed).
> 
> Wouldn't you rather not have to deal with the problem at all?

no.  If the code as it is sets overflow flags, it is clearly buggy and
has to be fixed anyway.

Zdenek

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Partial fix for PR 18048
  2004-10-25 18:15 ` Zack Weinberg
  2004-10-25 22:55   ` Zdenek Dvorak
@ 2004-10-27 15:38   ` Zdenek Dvorak
  2004-10-27 16:22     ` Andrew Pinski
  2004-10-27 19:00     ` Richard Henderson
  1 sibling, 2 replies; 14+ messages in thread
From: Zdenek Dvorak @ 2004-10-27 15:38 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc-patches

Hello,

here is the patch with the Zack's comment reflected.  Bootstrapped &
regtested on x86_64.

Zdenek

	PR tree-optimization/18048
	* fold-const.c (try_move_mult_to_index): New function.
	(fold): Use try_move_mult_to_index.
	* tree-ssa-loop-ivopts.c (try_add_cand_for): Prefer common candidates.
	* tree-ssa-loop-niter.c (number_of_iterations_cond): Produce
	an all-ones unsigned constant without extra bits.
	* tree.c (build_low_bits_mask): New function.
	* tree.h (build_low_bits_mask): Declare.

Index: fold-const.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/fold-const.c,v
retrieving revision 1.468
diff -c -3 -p -r1.468 fold-const.c
*** fold-const.c	11 Oct 2004 03:42:00 -0000	1.468
--- fold-const.c	27 Oct 2004 11:36:47 -0000
*************** tree_swap_operands_p (tree arg0, tree ar
*** 5967,5972 ****
--- 5967,6050 ----
    return 0;
  }
  
+ /* Tries to replace &a[idx] CODE s * delta with &a[idx CODE delta], if s is
+    step of the array.  TYPE is the type of the expression.  ADDR is the address.
+    MULT is the multiplicative expression.  If the function succeeds, the new
+    address expression is returned.  Otherwise NULL_TREE is returned.  */
+ 
+ static tree
+ try_move_mult_to_index (tree type, enum tree_code code, tree addr, tree mult)
+ {
+   tree s, delta, step;
+   tree arg0 = TREE_OPERAND (mult, 0), arg1 = TREE_OPERAND (mult, 1);
+   tree ref = TREE_OPERAND (addr, 0), pref;
+   tree ret, pos;
+   tree itype;
+ 
+   STRIP_NOPS (arg0);
+   STRIP_NOPS (arg1);
+   
+   if (TREE_CODE (arg0) == INTEGER_CST)
+     {
+       s = arg0;
+       delta = arg1;
+     }
+   else if (TREE_CODE (arg1) == INTEGER_CST)
+     {
+       s = arg1;
+       delta = arg0;
+     }
+   else
+     return NULL_TREE;
+ 
+   for (;; ref = TREE_OPERAND (ref, 0))
+     {
+       if (TREE_CODE (ref) == ARRAY_REF)
+ 	{
+ 	  step = array_ref_element_size (ref);
+ 
+ 	  if (TREE_CODE (step) != INTEGER_CST)
+ 	    continue;
+ 
+ 	  itype = TREE_TYPE (step);
+ 
+ 	  /* If the type sizes do not match, we might run into problems
+ 	     when one of them would overflow.  */
+ 	  if (TYPE_PRECISION (itype) != TYPE_PRECISION (type))
+ 	    continue;
+ 
+ 	  if (!operand_equal_p (step, fold_convert (itype, s), 0))
+ 	    continue;
+ 
+ 	  delta = fold_convert (itype, delta);
+ 	  break;
+ 	}
+ 
+       if (!handled_component_p (ref))
+ 	return NULL_TREE;
+     }
+ 
+   /* We found the suitable array reference.  So copy everything up to it,
+      and replace the index.  */
+ 
+   pref = TREE_OPERAND (addr, 0);
+   ret = copy_node (pref);
+   pos = ret;
+ 
+   while (pref != ref)
+     {
+       pref = TREE_OPERAND (pref, 0);
+       TREE_OPERAND (pos, 0) = copy_node (pref);
+       pos = TREE_OPERAND (pos, 0);
+     }
+ 
+   TREE_OPERAND (pos, 1) = fold (build2 (code, itype,
+ 					TREE_OPERAND (pos, 1),
+ 					delta));
+ 
+   return build1 (ADDR_EXPR, type, ret);
+ }
+ 
  /* Perform constant folding and related simplification of EXPR.
     The related simplifications include x*1 => x, x*0 => 0, etc.,
     and application of the associative law.
*************** fold (tree expr)
*** 6602,6607 ****
--- 6680,6703 ----
  						   alt0, alt1)),
  				     same));
  	    }
+ 
+ 	  /* Try replacing &a[i1] + c * i2 with &a[i1 + i2], if c is step
+ 	     of the array.  Loop optimizer sometimes produce this type of
+ 	     expressions.  */
+ 	  if (TREE_CODE (arg0) == ADDR_EXPR
+ 	      && TREE_CODE (arg1) == MULT_EXPR)
+ 	    {
+ 	      tem = try_move_mult_to_index (type, PLUS_EXPR, arg0, arg1);
+ 	      if (tem)
+ 		return fold (tem);
+ 	    }
+ 	  else if (TREE_CODE (arg1) == ADDR_EXPR
+ 		   && TREE_CODE (arg0) == MULT_EXPR)
+ 	    {
+ 	      tem = try_move_mult_to_index (type, PLUS_EXPR, arg1, arg0);
+ 	      if (tem)
+ 		return fold (tem);
+ 	    }
  	}
        else
  	{
*************** fold (tree expr)
*** 6974,6979 ****
--- 7070,7086 ----
  				     &diff))
  	  return build_int_cst_type (type, diff);
        }
+ 	  
+       /* Try replacing &a[i1] - c * i2 with &a[i1 - i2], if c is step
+ 	 of the array.  Loop optimizer sometimes produce this type of
+ 	 expressions.  */
+       if (TREE_CODE (arg0) == ADDR_EXPR
+ 	  && TREE_CODE (arg1) == MULT_EXPR)
+ 	{
+ 	  tem = try_move_mult_to_index (type, MINUS_EXPR, arg0, arg1);
+ 	  if (tem)
+ 	    return fold (tem);
+ 	}
  
        if (TREE_CODE (arg0) == MULT_EXPR
  	  && TREE_CODE (arg1) == MULT_EXPR
Index: tree-ssa-loop-ivopts.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/tree-ssa-loop-ivopts.c,v
retrieving revision 2.20
diff -c -3 -p -r2.20 tree-ssa-loop-ivopts.c
*** tree-ssa-loop-ivopts.c	18 Oct 2004 18:01:09 -0000	2.20
--- tree-ssa-loop-ivopts.c	27 Oct 2004 11:36:47 -0000
*************** try_add_cand_for (struct ivopts_data *da
*** 3603,3622 ****
    bitmap act_inv = BITMAP_XMALLOC ();
    unsigned i;
    struct cost_pair *cp;
  
    bitmap_copy (best_ivs, ivs);
    bitmap_copy (best_inv, inv);
  
!   for (i = 0; i < use->n_map_members; i++)
      {
!       cp = use->cost_map + i;
!       if (cp->cost == INFTY)
  	continue;
  
        bitmap_copy (act_ivs, ivs);
!       bitmap_set_bit (act_ivs, cp->cand->id);
!       if (cp->depends_on)
! 	bitmap_a_or_b (act_inv, inv, cp->depends_on);
        else
  	bitmap_copy (act_inv, inv);
        act_cost = set_cost_up_to (data, act_ivs, act_inv, use->id + 1);
--- 3603,3634 ----
    bitmap act_inv = BITMAP_XMALLOC ();
    unsigned i;
    struct cost_pair *cp;
+   bitmap_iterator bi;
+   struct iv_cand *cand;
+   bitmap depends_on;
  
    bitmap_copy (best_ivs, ivs);
    bitmap_copy (best_inv, inv);
  
!   /* First try important candidates.  Only if it fails, try the specific ones.
!      Rationale -- in loops with many variables the best choice often is to use
!      just one generic biv.  If we added here many ivs specific to the uses,
!      the optimization algorithm later would be likely to get stuck in a local
!      minimum, thus causing us to create too many ivs.  The approach from
!      few ivs to more seems more likely to be succesful -- starting from few
!      ivs, replacing an expensive use by a specific iv should always be a
!      win.  */
!   EXECUTE_IF_SET_IN_BITMAP (data->important_candidates, 0, i, bi)
      {
!       cand = iv_cand (data, i);
! 
!       if (get_use_iv_cost (data, use, cand, &depends_on) == INFTY)
  	continue;
  
        bitmap_copy (act_ivs, ivs);
!       bitmap_set_bit (act_ivs, cand->id);
!       if (depends_on)
! 	bitmap_a_or_b (act_inv, inv, depends_on);
        else
  	bitmap_copy (act_inv, inv);
        act_cost = set_cost_up_to (data, act_ivs, act_inv, use->id + 1);
*************** try_add_cand_for (struct ivopts_data *da
*** 3629,3634 ****
--- 3641,3675 ----
  	}
      }
  
+   if (best_cost == INFTY)
+     {
+       for (i = 0; i < use->n_map_members; i++)
+ 	{
+ 	  cp = use->cost_map + i;
+ 	  if (cp->cost == INFTY)
+ 	    continue;
+ 
+ 	  /* Already tried this.  */
+ 	  if (cp->cand->important)
+ 	    continue;
+ 
+ 	  bitmap_copy (act_ivs, ivs);
+ 	  bitmap_set_bit (act_ivs, cp->cand->id);
+ 	  if (cp->depends_on)
+ 	    bitmap_a_or_b (act_inv, inv, cp->depends_on);
+ 	  else
+ 	    bitmap_copy (act_inv, inv);
+ 	  act_cost = set_cost_up_to (data, act_ivs, act_inv, use->id + 1);
+ 
+ 	  if (act_cost < best_cost)
+ 	    {
+ 	      best_cost = act_cost;
+ 	      bitmap_copy (best_ivs, act_ivs);
+ 	      bitmap_copy (best_inv, act_inv);
+ 	    }
+ 	}
+     }
+ 
    bitmap_copy (ivs, best_ivs);
    bitmap_copy (inv, best_inv);
  
Index: tree-ssa-loop-niter.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/tree-ssa-loop-niter.c,v
retrieving revision 2.13
diff -c -3 -p -r2.13 tree-ssa-loop-niter.c
*** tree-ssa-loop-niter.c	22 Oct 2004 17:05:08 -0000	2.13
--- tree-ssa-loop-niter.c	27 Oct 2004 11:36:47 -0000
*************** number_of_iterations_cond (tree type, tr
*** 419,427 ****
        d = EXEC_BINARY (LSHIFT_EXPR, niter_type,
  		       build_int_cst_type (niter_type, 1), bits);
        s = EXEC_BINARY (RSHIFT_EXPR, niter_type, step0, bits);
!       bound = EXEC_BINARY (RSHIFT_EXPR, niter_type,
! 			   build_int_cst (niter_type, -1),
! 			   bits);
  
        assumption = fold (build2 (FLOOR_MOD_EXPR, niter_type, base1, d));
        assumption = fold (build2 (EQ_EXPR, boolean_type_node,
--- 419,428 ----
        d = EXEC_BINARY (LSHIFT_EXPR, niter_type,
  		       build_int_cst_type (niter_type, 1), bits);
        s = EXEC_BINARY (RSHIFT_EXPR, niter_type, step0, bits);
! 
!       bound = build_low_bits_mask (niter_type,
! 				   (TYPE_PRECISION (niter_type)
! 				    - tree_low_cst (bits, 1)));
  
        assumption = fold (build2 (FLOOR_MOD_EXPR, niter_type, base1, d));
        assumption = fold (build2 (EQ_EXPR, boolean_type_node,
Index: tree.c
===================================================================
RCS file: /cvs/gcc/gcc/gcc/tree.c,v
retrieving revision 1.440
diff -c -3 -p -r1.440 tree.c
*** tree.c	25 Oct 2004 13:27:25 -0000	1.440
--- tree.c	27 Oct 2004 11:36:47 -0000
*************** build_int_cst_wide (tree type, unsigned 
*** 609,614 ****
--- 609,648 ----
    return t;
  }
  
+ /* Builds an integer constant in TYPE such that lowest BITS bits are ones
+    and the rest are zeros.  */
+ 
+ tree
+ build_low_bits_mask (tree type, unsigned bits)
+ {
+   unsigned HOST_WIDE_INT low;
+   HOST_WIDE_INT high;
+   unsigned HOST_WIDE_INT all_ones = ~(unsigned HOST_WIDE_INT) 0;
+ 
+   gcc_assert (bits <= TYPE_PRECISION (type));
+ 
+   if (bits == TYPE_PRECISION (type)
+       && !TYPE_UNSIGNED (type))
+     {
+       /* Sign extended all-ones mask.  */
+       low = all_ones;
+       high = -1;
+     }
+   else if (bits <= HOST_BITS_PER_WIDE_INT)
+     {
+       low = all_ones >> (HOST_BITS_PER_WIDE_INT - bits);
+       high = 0;
+     }
+   else
+     {
+       bits -= HOST_BITS_PER_WIDE_INT;
+       low = all_ones;
+       high = all_ones >> (HOST_BITS_PER_WIDE_INT - bits);
+     }
+ 
+   return build_int_cst_wide (type, low, high);
+ }
+ 
  /* Checks that X is integer constant that can be expressed in (unsigned)
     HOST_WIDE_INT without loss of precision.  */
  
Index: tree.h
===================================================================
RCS file: /cvs/gcc/gcc/gcc/tree.h,v
retrieving revision 1.640
diff -c -3 -p -r1.640 tree.h
*** tree.h	22 Oct 2004 18:52:19 -0000	1.640
--- tree.h	27 Oct 2004 11:36:47 -0000
*************** tree fold_build_cleanup_point_expr (tree
*** 3543,3548 ****
--- 3543,3549 ----
  extern tree build_fold_addr_expr_with_type (tree, tree);
  extern tree build_fold_indirect_ref (tree);
  extern tree constant_boolean_node (int, tree);
+ extern tree build_low_bits_mask (tree, unsigned);
  
  extern bool tree_swap_operands_p (tree, tree, bool);
  extern enum tree_code swap_tree_comparison (enum tree_code);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Partial fix for PR 18048
  2004-10-27 15:38   ` Zdenek Dvorak
@ 2004-10-27 16:22     ` Andrew Pinski
  2004-10-27 16:24       ` Zdenek Dvorak
  2004-10-27 19:00     ` Richard Henderson
  1 sibling, 1 reply; 14+ messages in thread
From: Andrew Pinski @ 2004-10-27 16:22 UTC (permalink / raw)
  To: Zdenek Dvorak; +Cc: gcc-patches, Zack Weinberg


On Oct 27, 2004, at 11:22 AM, Zdenek Dvorak wrote:

> Hello,
>
> here is the patch with the Zack's comment reflected.  Bootstrapped &
> regtested on x86_64.

This should also help Java code where we get:
<L1>:;
   i.3 = (unsigned int) i;
   if ((unsigned int) D.376 > i.3) goto <L5>; else goto <L2>;

<L2>:;
   D.385 = _Jv_ThrowBadArrayIndex (i);

<L5>:;
   *(&a->data[0] + (int *) i.3 * 4B) = 0;

;
   i = (int) (i.3 + 1);
   D.376 = a->length;
   if (i >= D.376) goto <L8>; else goto <L1>;

<L8>:;

But the real question is why is &a->data[0] not pulled out of the loop
in the first place.

-- Pinski

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Partial fix for PR 18048
  2004-10-27 16:22     ` Andrew Pinski
@ 2004-10-27 16:24       ` Zdenek Dvorak
  2004-10-27 17:13         ` Dale Johannesen
  0 siblings, 1 reply; 14+ messages in thread
From: Zdenek Dvorak @ 2004-10-27 16:24 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: gcc-patches, Zack Weinberg

Hello,

> >here is the patch with the Zack's comment reflected.  Bootstrapped &
> >regtested on x86_64.
> 
> This should also help Java code where we get:
> <L1>:;
>   i.3 = (unsigned int) i;
>   if ((unsigned int) D.376 > i.3) goto <L5>; else goto <L2>;
> 
> <L2>:;
>   D.385 = _Jv_ThrowBadArrayIndex (i);
> 
> <L5>:;
>   *(&a->data[0] + (int *) i.3 * 4B) = 0;
> 
> ;
>   i = (int) (i.3 + 1);
>   D.376 = a->length;
>   if (i >= D.376) goto <L8>; else goto <L1>;
> 
> <L8>:;
> 
> But the real question is why is &a->data[0] not pulled out of the loop
> in the first place.

&a->data[0] is a constant (symbol reference + offset), so moving it out of loop would
very likely just cause the register pressure to increase.

Zdenek

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Partial fix for PR 18048
  2004-10-27 16:24       ` Zdenek Dvorak
@ 2004-10-27 17:13         ` Dale Johannesen
  2004-10-27 20:31           ` Zdenek Dvorak
  0 siblings, 1 reply; 14+ messages in thread
From: Dale Johannesen @ 2004-10-27 17:13 UTC (permalink / raw)
  To: Zdenek Dvorak; +Cc: gcc-patches, Dale Johannesen, Andrew Pinski, Zack Weinberg

On Oct 27, 2004, at 9:03 AM, Zdenek Dvorak wrote:
>> But the real question is why is &a->data[0] not pulled out of the loop
>> in the first place.
>
> &a->data[0] is a constant (symbol reference + offset), so moving it 
> out of loop would
> very likely just cause the register pressure to increase.

Of course it increases the register pressure, but it is still 
beneficial to hoist in
some cases.  It should be considered as a hoisting candidate.  (mgrid 
is not
the only specmark where not enough candidates are being hoisted on ppc;
just the easiest to analyze, as it spends 99% of its time in one loop.  
I could
report others as PRs but my expectation is that fixing mgrid will fix 
most of
the rest as well.)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Partial fix for PR 18048
  2004-10-27 17:13         ` Dale Johannesen
@ 2004-10-27 20:31           ` Zdenek Dvorak
  0 siblings, 0 replies; 14+ messages in thread
From: Zdenek Dvorak @ 2004-10-27 20:31 UTC (permalink / raw)
  To: Dale Johannesen; +Cc: gcc-patches, Andrew Pinski, Zack Weinberg

Hello,

> >>But the real question is why is &a->data[0] not pulled out of the loop
> >>in the first place.
> >
> >&a->data[0] is a constant (symbol reference + offset), so moving it 
> >out of loop would
> >very likely just cause the register pressure to increase.
> 
> Of course it increases the register pressure, but it is still 
> beneficial to hoist in
> some cases.  It should be considered as a hoisting candidate.  (mgrid 
> is not
> the only specmark where not enough candidates are being hoisted on ppc;
> just the easiest to analyze, as it spends 99% of its time in one loop.  
> I could
> report others as PRs but my expectation is that fixing mgrid will fix 
> most of
> the rest as well.)

definitely there are cases where hoisting them is useful; I am not
really sure why rtl loop optimizer does not catch this.

Zdenek

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Partial fix for PR 18048
  2004-10-27 15:38   ` Zdenek Dvorak
  2004-10-27 16:22     ` Andrew Pinski
@ 2004-10-27 19:00     ` Richard Henderson
  1 sibling, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2004-10-27 19:00 UTC (permalink / raw)
  To: Zdenek Dvorak; +Cc: Zack Weinberg, gcc-patches

On Wed, Oct 27, 2004 at 05:22:25PM +0200, Zdenek Dvorak wrote:
> 	PR tree-optimization/18048
> 	* fold-const.c (try_move_mult_to_index): New function.
> 	(fold): Use try_move_mult_to_index.
> 	* tree-ssa-loop-ivopts.c (try_add_cand_for): Prefer common candidates.
> 	* tree-ssa-loop-niter.c (number_of_iterations_cond): Produce
> 	an all-ones unsigned constant without extra bits.
> 	* tree.c (build_low_bits_mask): New function.
> 	* tree.h (build_low_bits_mask): Declare.

Ok.


r~

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2004-10-27 20:25 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-10-24 13:16 [patch] Partial fix for PR 18048 Zdenek Dvorak
2004-10-24 17:30 ` Andrew Pinski
2004-10-24 17:57   ` Zdenek Dvorak
2004-10-25 18:15 ` Zack Weinberg
2004-10-25 22:55   ` Zdenek Dvorak
2004-10-25 23:01     ` Zack Weinberg
2004-10-25 23:02       ` Zdenek Dvorak
2004-10-25 23:10       ` Zdenek Dvorak
2004-10-27 15:38   ` Zdenek Dvorak
2004-10-27 16:22     ` Andrew Pinski
2004-10-27 16:24       ` Zdenek Dvorak
2004-10-27 17:13         ` Dale Johannesen
2004-10-27 20:31           ` Zdenek Dvorak
2004-10-27 19:00     ` Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).