public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [C Patch]: pr52543
       [not found]         ` <mcrr4wnl6lb.fsf@dhcp-172-18-216-180.mtv.corp.google.com>
@ 2012-03-29 21:10           ` Kenneth Zadeck
  2012-03-30  8:34             ` Ramana Radhakrishnan
                               ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Kenneth Zadeck @ 2012-03-29 21:10 UTC (permalink / raw)
  To: gcc-patches
  Cc: Ian Lance Taylor, Richard Sandiford, Mike Stump, Kenneth Zadeck, avr

[-- Attachment #1: Type: text/plain, Size: 3623 bytes --]

This patch takes a different approach to fixing PR52543 than does the 
patch in

http://gcc.gnu.org/ml/gcc-patches/2012-03/msg00641.html

This patch transforms the lower-subreg pass(es) from unconditionally 
splitting wide moves, zero extensions, and shifts, so that it now takes 
into account the target specific costs and only does the transformations 
if it is profitable.

Unconditional splitting is a problem that not only occurs on the AVR but 
is also a problem on the ARM NEON and my private port.  Furthermore, it 
is a problem that is likely to occur on most modern larger machines 
since these machines are more likely to have fast instructions for 
moving things that are larger than word mode.

At compiler initialization time, each mode that is larger that a word 
mode is examined to determine if the cost of moving a value of that mode 
is less expensive that inserting the proper number of word sided 
moves.   If it is cheaper to split it up, a bit is set to allow moves of 
that mode to be lowered.

A similar analysis is made for the zero extensions and shifts except 
that lower subreg had been (and is still limited to only breaking up 
these operations if the target size was twice the size of word mode.)  
Also, if the analysis determines that there are no profitable 
transformations, the pass exits quickly without doing any analysis.

It is quite likely that most ports will have to be adjusted after this 
patch is accepted.   For instance, the analysis discovers that there are 
no profitable transformations to be performed on the x86-64.    Since 
this is not my platform, I have no idea if these are the correct 
settings.   But the pass uses the standard insn_rtx_cost interface and 
it is the port maintainers responsibility to not lie to the optimization 
passes so this extra work in stage one should be acceptable.

I do know from a private conversation with Richard Sandiford, that mips 
patches are likely forthcoming.

There is preprocessor controlled code that prints out the cost 
analysis.  Only a summary of this can go in the subregs dump file 
because the analysis is called from backend_init_target and so the dump 
file is not available.      But it is very useful to define LOG_COSTS 
when adjusting your port.

There is also preprocessor code that forces all of the lowering 
operations to marked as profitable.  This is useful in debugging the new 
logic.

Both of these preprocessor symbols are documented at the top of the pass.

I have tested this on an x86_64 with both the force lowering on and off 
and neither cause any regressions as well as extensive testing on my port.

Ok to commit?

Kenny

2012-03-29 Kenneth Zadeck <zadeck@naturalbridge.com>

     * toplev.c (backend_init_target): Call initializer for lower-subreg 
pass.

     * lower-subreg.c (move_modes_to_split, splitting_ashift, 
splitting_lshiftrt)
     splitting_zext, splitting_some_shifts, twice_word_mode,    
something_to_do,
     word_mode_move_cost, move_zero_cost): New static vars.
     (compute_move_cost, profitable_shift_p, init_lower_subreg): New
     functions.
     (find_pseudo_copy, resolve_simple_move): Added code to only split 
based on costs.
     (find_decomposable_subregs): Added code to mark as decomposable
     moves that are not profitable.
     (find_decomposable_shift_zext): Added code to only decompose
     shifts and zext if profitable.
     (resolve_shift_zext): Added comment.
     (decompose_multiword_subregs): Dump list of profitable
     transformations.  Add code to skip non profitable transformations.

     *rtl.h(init_lower_subreg): Added declaration.



[-- Attachment #2: lower.diff --]
[-- Type: text/x-patch, Size: 14744 bytes --]

Index: toplev.c
===================================================================
--- toplev.c	(revision 185969)
+++ toplev.c	(working copy)
@@ -1660,6 +1660,7 @@ backend_init_target (void)
   /* rtx_cost is mode-dependent, so cached values need to be recomputed
      on a mode change.  */
   init_expmed ();
+  init_lower_subreg ();
 
   /* We may need to recompute regno_save_code[] and regno_restore_code[]
      after a mode change as well.  */
Index: lower-subreg.c
===================================================================
--- lower-subreg.c	(revision 185969)
+++ lower-subreg.c	(working copy)
@@ -52,10 +52,34 @@ DEF_VEC_P (bitmap);
 DEF_VEC_ALLOC_P (bitmap,heap);
 
 /* Decompose multi-word pseudo-registers into individual
-   pseudo-registers when possible.  This is possible when all the uses
-   of a multi-word register are via SUBREG, or are copies of the
-   register to another location.  Breaking apart the register permits
-   more CSE and permits better register allocation.  */
+   pseudo-registers when possible and profitable.  This is possible
+   when all the uses of a multi-word register are via SUBREG, or are
+   copies of the register to another location.  Breaking apart the
+   register permits more CSE and permits better register allocation.
+   This is profitable if the machine does not have move instructions
+   to do this.  
+
+   This pass only splits moves with modes are wider than word_mode and
+   ashifts, lshiftrt and zero_extensions with integer modes that are
+   twice the width of word_mode.  The latter could be generalized if
+   there was a need to do this, but the trend in architectures is to
+   not need this.  
+
+   There are two useful preprocessor defines for use by maintainers:  
+
+   #define LOG_COSTS
+
+   if you wish to see the actual cost estimates that are being used
+   for each mode wider than word mode and the cost estimates for zero
+   extension and the shifts.   This can be useful when port maintainers 
+   are tuning insn rtx costs.
+
+   #define FORCE_LOWERING
+
+   if you wish to test the pass with all the transformation forced on.
+   This can be useful for finding bugs in the transformations.
+
+   Both of these should not be enabled at the same time. */
 
 /* Bit N in this bitmap is set if regno N is used in a context in
    which we can decompose it.  */
@@ -71,12 +95,179 @@ static bitmap non_decomposable_context;
    avoid generating accesses to its subwords in integer modes.  */
 static bitmap subreg_context;
 
+/* This pass can transform 4 different operations: move, ashift,
+   lshiftrt, and zero_extend.  There is a boolean vector for move
+   splitting that is indexed by mode and is true for each mode that is
+   to have its copies split.  The other three operations are only done
+   for one mode so they are only controlled by a single boolean  .*/
+static bool move_modes_to_split[MAX_MACHINE_MODE];
+
+/* Other splitting operations to be done on the this platform.  */
+static bool splitting_ashift;
+static bool splitting_lshiftrt;
+static bool splitting_zext;
+/* The or of the previous three values.  */
+static bool splitting_some_shifts;
+
+/* The integer mode that is twice the width of word_mode.  */
+static enum machine_mode twice_word_mode;
+
 /* Bit N in the bitmap in element M of this array is set if there is a
    copy from reg M to reg N.  */
 static VEC(bitmap,heap) *reg_copy_graph;
 
-/* Return whether X is a simple object which we can take a word_mode
-   subreg of.  */
+static bool something_to_do;
+
+/* Precomputed cost values used to determine if lowering shift
+   operations is profitable.  */ 
+static int word_mode_move_cost;
+static int move_zero_cost;
+
+/* See what the move cost is.  */
+static int 
+compute_move_cost (enum machine_mode mode)
+{
+  rtx src = gen_rtx_REG (mode, FIRST_PSEUDO_REGISTER);
+  rtx trg = gen_rtx_REG (mode, FIRST_PSEUDO_REGISTER);
+  rtx pat = gen_rtx_SET (VOIDmode, trg, src);
+
+  return insn_rtx_cost (pat, true);
+}
+
+
+/* Return true if it is profitable to lower and shift by SHIFT_AMT.
+   CODE can be either ASHIFT or LSHIFTRT.  */
+static bool
+profitable_shift_p (enum rtx_code code, int shift_amt)
+{
+  rtx trg = gen_rtx_REG (twice_word_mode, FIRST_PSEUDO_REGISTER);
+  rtx src = gen_rtx_REG (twice_word_mode, FIRST_PSEUDO_REGISTER);
+  int word_mode_size = GET_MODE_BITSIZE (word_mode);
+  int wide_cost = insn_rtx_cost (gen_rtx_SET (VOIDmode, trg, 
+					      gen_rtx_fmt_ee (code, twice_word_mode, 
+							      src, GEN_INT (shift_amt))),
+				 true);
+#ifdef FORCE_LOWERING
+  return true;
+#endif
+
+#ifdef LOG_COSTS
+  fprintf (stderr, "shift code %d, shift amt %d, wide_cost %d\n", code, shift_amt, wide_cost);
+#endif
+  if (shift_amt == word_mode_size)
+    return wide_cost > word_mode_move_cost + move_zero_cost;
+  else
+    {
+      int narrow_cost;
+
+      trg = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER);
+      src = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER);
+      narrow_cost = insn_rtx_cost (gen_rtx_SET (VOIDmode, trg, 
+						gen_rtx_fmt_ee (code, word_mode, 
+								src, 
+								GEN_INT (shift_amt - word_mode_size))),
+				   true);
+
+#ifdef LOG_COSTS
+      fprintf (stderr, "narrow_cost %d\n", narrow_cost);
+#endif
+      return wide_cost > narrow_cost + move_zero_cost;
+    }
+}
+
+
+/* Initialize pass once per execution.  This envolves determining
+   which operations on the machine are profitable.  If none are found,
+   then the pass just returns when called.  */
+
+void
+init_lower_subreg (void)
+{
+  int word_mode_size = GET_MODE_BITSIZE (word_mode);
+  rtx trg = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER);
+  rtx pat;
+  unsigned int i;
+  int factor;
+
+  word_mode_move_cost = compute_move_cost (word_mode);
+  move_zero_cost = insn_rtx_cost (gen_rtx_SET (VOIDmode, trg, CONST0_RTX (word_mode)), true);
+
+#ifdef LOG_COSTS
+  fprintf (stderr, "word mode move cost %d\n", word_mode_move_cost);
+  fprintf (stderr, "move 0 cost %d\n", move_zero_cost);
+#endif
+  for (i = 0; i < MAX_MACHINE_MODE; i++) 
+    {
+      int t;
+      factor = GET_MODE_SIZE (i) / UNITS_PER_WORD;
+
+      if (factor <= 1) 
+	continue;
+
+      t = compute_move_cost ((enum machine_mode) i);
+
+#ifdef LOG_COSTS
+      fprintf (stderr, "mode %s, move cost %d, factor %d\n", mode_name[i], t, factor);
+#endif
+      if (t > word_mode_move_cost * factor)
+	{
+	  move_modes_to_split[i] = true;
+	  something_to_do = true;
+	}
+#ifdef FORCE_LOWERING
+      move_modes_to_split[i] = true;
+      something_to_do = true;
+#endif
+    }
+
+  /* For the moves and shifts, the only case that is checked is one
+     where the mode of the target is an integer mode twice the width
+     of the word_mode.  */
+
+  twice_word_mode = GET_MODE_WIDER_MODE (word_mode);
+
+  /* The only case here to check to see if moving the upper part with a
+     zero is cheaper than doing the zext itself.  There is some theory
+     that any well implemented port would just have the pattern.  */
+  pat = gen_rtx_SET (VOIDmode, trg, 
+		     gen_rtx_ZERO_EXTEND (twice_word_mode, gen_reg_rtx (word_mode)));
+  
+  if (insn_rtx_cost (pat, true) > word_mode_move_cost + move_zero_cost) 
+    {
+      splitting_zext = true;
+      splitting_some_shifts = true;
+      something_to_do = true;
+    }
+
+#ifdef FORCE_LOWERING
+  splitting_zext = true;
+  splitting_some_shifts = true;
+  something_to_do = true;
+#endif
+  /* For the shifts there are three shift amounts that need to be
+     checked: word_mode, word_mode + 1 and word_mode * 1.5.  The first
+     of these can be done without a shift.  The second and third case
+     are the same on large machines but may be different on small
+     machines that special case shift 1 and 2.  If any of these are
+     found to be useful, then we set the flags to consider those cases
+     and when the actual shift amount is known, redo the cost
+     calculation.  */
+  
+  splitting_ashift = profitable_shift_p (ASHIFT, word_mode_size)
+    || profitable_shift_p (ASHIFT, word_mode_size + 1)
+    || profitable_shift_p (ASHIFT, word_mode_size + (word_mode_size >> 1));
+
+  splitting_some_shifts |= splitting_ashift;
+  something_to_do |= splitting_ashift;
+
+  splitting_lshiftrt = profitable_shift_p (LSHIFTRT, word_mode_size)
+    || profitable_shift_p (LSHIFTRT, word_mode_size + 1)
+    || profitable_shift_p (LSHIFTRT, word_mode_size + (word_mode_size >> 1));
+  
+  splitting_some_shifts |= splitting_lshiftrt;
+  something_to_do |= splitting_lshiftrt;
+}
+
 
 static bool
 simple_move_operand (rtx x)
@@ -173,7 +364,7 @@ find_pseudo_copy (rtx set)
   if (HARD_REGISTER_NUM_P (rd) || HARD_REGISTER_NUM_P (rs))
     return false;
 
-  if (GET_MODE_SIZE (GET_MODE (dest)) <= UNITS_PER_WORD)
+  if (!move_modes_to_split[(int)GET_MODE (dest)])
     return false;
 
   b = VEC_index (bitmap, reg_copy_graph, rs);
@@ -335,6 +526,11 @@ find_decomposable_subregs (rtx *px, void
 		bitmap_set_bit (decomposable_context, regno);
 	      break;
 	    case SIMPLE_MOVE:
+	      /* If this is marked as a simple move with a mode this
+		 large, it is because find_pseudo_copy returned false
+		 because this was not a mode we wanted to split.  */
+	      if (!move_modes_to_split[GET_MODE (x)])
+		bitmap_set_bit (non_decomposable_context, regno);
 	      break;
 	    default:
 	      gcc_unreachable ();
@@ -668,7 +864,7 @@ resolve_simple_move (rtx set, rtx insn)
   orig_mode = GET_MODE (dest);
 
   words = (GET_MODE_SIZE (orig_mode) + UNITS_PER_WORD - 1) / UNITS_PER_WORD;
-  if (words <= 1)
+  if (!move_modes_to_split[(int)GET_MODE (dest)])
     return insn;
 
   start_sequence ();
@@ -941,16 +1137,37 @@ find_decomposable_shift_zext (rtx insn)
   rtx set;
   rtx op;
   rtx op_operand;
+  enum machine_mode mode;
 
   set = single_set (insn);
   if (!set)
     return 0;
 
   op = SET_SRC (set);
-  if (GET_CODE (op) != ASHIFT
-      && GET_CODE (op) != LSHIFTRT
-      && GET_CODE (op) != ZERO_EXTEND)
+  mode = GET_MODE (op);
+
+  if (mode != twice_word_mode)
+    return 0;
+
+  switch (GET_CODE (op)) {
+  case ASHIFT:
+    if (splitting_ashift)
+      break;
+    return 0;
+    
+  case LSHIFTRT:
+    if (splitting_lshiftrt)
+      break;
+    return 0;
+    
+  case ZERO_EXTEND:
+    if (splitting_zext)
+      break;
+    return 0;
+
+  default:
     return 0;
+  }
 
   op_operand = XEXP (op, 0);
   if (!REG_P (SET_DEST (set)) || !REG_P (op_operand)
@@ -961,23 +1178,26 @@ find_decomposable_shift_zext (rtx insn)
 
   if (GET_CODE (op) == ZERO_EXTEND)
     {
-      if (GET_MODE (op_operand) != word_mode
-	  || GET_MODE_BITSIZE (GET_MODE (op)) != 2 * BITS_PER_WORD)
+      if (GET_MODE (op_operand) != word_mode)
 	return 0;
     }
   else /* left or right shift */
     {
+      int shift_val;
+
       if (!CONST_INT_P (XEXP (op, 1))
-	  || INTVAL (XEXP (op, 1)) < BITS_PER_WORD
-	  || GET_MODE_BITSIZE (GET_MODE (op_operand)) != 2 * BITS_PER_WORD)
+	  || INTVAL (XEXP (op, 1)) < BITS_PER_WORD)
+	return 0;
+
+      shift_val = INTVAL (XEXP (op, 1));
+      if (!profitable_shift_p (GET_CODE (op), shift_val))
 	return 0;
+
+      bitmap_set_bit (decomposable_context, REGNO (op_operand));
     }
 
   bitmap_set_bit (decomposable_context, REGNO (SET_DEST (set)));
 
-  if (GET_CODE (op) != ZERO_EXTEND)
-    bitmap_set_bit (decomposable_context, REGNO (op_operand));
-
   return 1;
 }
 
@@ -1008,6 +1228,8 @@ resolve_shift_zext (rtx insn)
 
   op_operand = XEXP (op, 0);
 
+  /* We can tear this operation apart only if the regs were already
+     torn apart.  */ 
   if (!resolve_reg_p (SET_DEST (set)) && !resolve_reg_p (op_operand))
     return NULL_RTX;
 
@@ -1083,8 +1305,34 @@ decompose_multiword_subregs (void)
   unsigned int max;
   basic_block bb;
 
-  if (df)
-    df_set_flags (DF_DEFER_INSN_RESCAN);
+  if (dump_file)
+    {
+      unsigned int i;
+
+      for (i = 0; i < MAX_MACHINE_MODE; i++) 
+	{
+	  if (GET_MODE_SIZE (i) > UNITS_PER_WORD)
+	    fprintf (dump_file, "%s mode %s for copy lowering.\n", 
+		     move_modes_to_split[i] ? "Splitting" : "Skipping", mode_name[i]);
+	}
+
+      fprintf (dump_file, "%s mode %s for zero_extend lowering.\n", 
+	       splitting_zext ? "Splitting" : "Skipping", mode_name[twice_word_mode]);
+
+      fprintf (dump_file, "%s mode %s for ashift lowering.\n", 
+	       splitting_ashift ? "Splitting" : "Skipping", mode_name[twice_word_mode]);
+
+      fprintf (dump_file, "%s mode %s for lshiftrt lowering.\n", 
+	       splitting_lshiftrt ? "Splitting" : "Skipping", mode_name[twice_word_mode]);
+
+      if (!something_to_do)
+	fprintf (dump_file, "\nNothing to lower for port.\n\n");
+    }
+
+
+  /* Check if this target even has any modes to consider lowering.   */
+  if (!something_to_do)
+    return;
 
   max = max_reg_num ();
 
@@ -1094,24 +1342,41 @@ decompose_multiword_subregs (void)
      all the insns.  */
   {
     unsigned int i;
+    bool useful_modes_seen = false;
 
     for (i = FIRST_PSEUDO_REGISTER; i < max; ++i)
       {
-	if (regno_reg_rtx[i] != NULL
-	    && GET_MODE_SIZE (GET_MODE (regno_reg_rtx[i])) > UNITS_PER_WORD)
-	  break;
+	if (regno_reg_rtx[i] != NULL)
+	  {
+	    enum machine_mode mode = GET_MODE (regno_reg_rtx[i]);
+	    if (regno_reg_rtx[i] != NULL 
+		&& (move_modes_to_split [mode] 
+		    || (splitting_some_shifts && mode == twice_word_mode)))
+	      {
+		useful_modes_seen = true;
+		break;
+	      }
+	  }
+      }
+
+    if (!useful_modes_seen)
+      {
+	if (dump_file)
+	  fprintf (dump_file, "Nothing to lower in this function.\n");
+	return;
       }
-    if (i == max)
-      return;
   }
 
-  if (df)
-    run_word_dce ();
+  if (df) 
+    {
+      df_set_flags (DF_DEFER_INSN_RESCAN);
+      run_word_dce ();
+    }
 
-  /* FIXME: When the dataflow branch is merged, we can change this
-     code to look for each multi-word pseudo-register and to find each
-     insn which sets or uses that register.  That should be faster
-     than scanning all the insns.  */
+  /* FIXME: It may be possible to change this code to look for each
+     multi-word pseudo-register and to find each insn which sets or
+     uses that register.  That should be faster than scanning all the
+     insns.  */
 
   decomposable_context = BITMAP_ALLOC (NULL);
   non_decomposable_context = BITMAP_ALLOC (NULL);
Index: rtl.h
===================================================================
--- rtl.h	(revision 185969)
+++ rtl.h	(working copy)
@@ -2523,6 +2523,9 @@ extern void init_expmed (void);
 extern void expand_inc (rtx, rtx);
 extern void expand_dec (rtx, rtx);
 
+/* In lower-subreg.c */
+extern void init_lower_subreg (void);
+
 /* In gcse.c */
 extern bool can_copy_p (enum machine_mode);
 extern bool can_assign_to_reg_without_clobbers_p (rtx);

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-03-29 21:10           ` [C Patch]: pr52543 Kenneth Zadeck
@ 2012-03-30  8:34             ` Ramana Radhakrishnan
  2012-03-30 10:09             ` Richard Sandiford
       [not found]             ` <CACUk7=XVO=yHrPBFtAVfPxMtViYthtGGfuQSVGHNqHE7ibER0g@mail.gmail.com>
  2 siblings, 0 replies; 38+ messages in thread
From: Ramana Radhakrishnan @ 2012-03-30  8:34 UTC (permalink / raw)
  To: Kenneth Zadeck
  Cc: gcc-patches, Ian Lance Taylor, Richard Sandiford, Mike Stump,
	Kenneth Zadeck, avr

[-- Attachment #1: Type: text/plain, Size: 1704 bytes --]

On 29 March 2012 22:10, Kenneth Zadeck <zadeck@naturalbridge.com> wrote:
> This patch takes a different approach to fixing PR52543 than does the patch
> in
>
> http://gcc.gnu.org/ml/gcc-patches/2012-03/msg00641.html
>
> This patch transforms the lower-subreg pass(es) from unconditionally
> splitting wide moves, zero extensions, and shifts, so that it now takes into
> account the target specific costs and only does the transformations if it is
> profitable.
>
> Unconditional splitting is a problem that not only occurs on the AVR but is
> also a problem on the ARM NEON and my private port.  Furthermore, it is a
> problem that is likely to occur on most modern larger machines since these
> machines are more likely to have fast instructions for moving things that
> are larger than word mode.

Nice - this means that atleast one pending patches for subreg
style operations for neon intrinsics can go in after appropriate tweaking.
of costs. It probably requires some tweaking and benchmarking on ARM, but
the case where we saw such spills to the stack with subreg style operations is
now much improved , indicating that the existing costs infrastructure
manages to get this right atleast for this case.

Richard(S) - If you remember your PR48941 patch - after applying the
lower-subreg patch I now see far better code and what one gets out of
-fno-split-wide-types but a lot of that gratuitous spillng has gone away.

There are still too many moves between Neon registers but there are
far less moves
to the integer side and the gratuitous spilling is now gone.

old on left - new on right ( i.e. Kenneth's patch + Richard's PR48941 patch
http://patchwork.ozlabs.org/patch/130429/)

regards
Ramana

[-- Attachment #2: diff.txt --]
[-- Type: text/plain, Size: 3736 bytes --]

cross:                                                          cross:
        @ args = 0, pretend = 0, frame = 16                   |         @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 1, uses_anonymous_args = 0           |         @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.                                @ link register save eliminated.
        str     fp, [sp, #-4]!                                |         vldmia  r0, {d26-d27}
        add     fp, sp, #0                                    |         vldmia  r1, {d24-d25}
        sub     sp, sp, #20                                   |         vmov    q10, q13  @ v4sf
        vldmia  r0, {d16-d17}                                 |         vmov    q11, q13  @ v4sf
        vmov    q10, q8  @ v4sf                               |         vmov    q8, q12  @ v4sf
        sub     sp, sp, #48                                   |         vmov    q9, q12  @ v4sf
                                                              >         vzip.32 q10, q11
                                                              >         vzip.32 q8, q9
                                                              >         vmov    q14, q10  @ v4sf
        vmov    q12, q8  @ v4sf                                         vmov    q12, q8  @ v4sf
        add     r3, sp, #15                                   |         vmov    d21, d22  @ v2sf
        bic     r3, r3, #15                                   |         vmul.f32        d16, d29, d18
        vzip.32 q10, q12                                      |         vmul.f32        d17, d21, d24
        vstmia  r3, {d20-d21}                                 |         vmov    d19, d18  @ v2sf
        vstr    d24, [r3, #16]                                |         vmul.f32        d18, d28, d25
        vstr    d25, [r3, #24]                                |         vmls.f32        d16, d21, d25
        vldmia  r1, {d16-d17}                                 |         vmls.f32        d17, d28, d19
        vmov    q9, q8  @ v4sf                                |         vmls.f32        d18, d29, d24
        vmov    q11, q8  @ v4sf                               |         vmov    d26, d16  @ v2sf
        vzip.32 q9, q11                                       |         vmov    d27, d17  @ v2sf
        vstmia  r3, {d18-d19}                                 |         vmov    d17, d18  @ v2sf
        vstr    d22, [r3, #16]                                |         vuzp.32 d26, d27
        vstr    d23, [r3, #24]                                |         vmov    d16, d26  @ v2sf
        vmov    d25, d18  @ v2sf                              |         vmov    r0, r1, d16  @ v4sf
        vmul.f32        d17, d21, d22                         |         vmov    r2, r3, d17
        vmul.f32        d18, d24, d18                         <
        vmov    d16, d19  @ v2sf                              <
        vmul.f32        d19, d20, d19                         <
        vmls.f32        d17, d24, d16                         <
        vmls.f32        d18, d20, d22                         <
        vmls.f32        d19, d21, d25                         <
        vuzp.32 d17, d18                                      <
        vmov    d20, d17  @ v2sf                              <
        vmov    d21, d19  @ v2sf                              <
        vmov    r0, r1, d20  @ v4sf                           <
        vmov    r2, r3, d21                                   <
        add     sp, fp, #0                                    <
        ldmfd   sp!, {fp}                                     <
        bx      lr                                                      bx      lr





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-03-29 21:10           ` [C Patch]: pr52543 Kenneth Zadeck
  2012-03-30  8:34             ` Ramana Radhakrishnan
@ 2012-03-30 10:09             ` Richard Sandiford
  2012-03-30 15:25               ` Kenneth Zadeck
  2012-05-10  6:45               ` Paolo Bonzini
       [not found]             ` <CACUk7=XVO=yHrPBFtAVfPxMtViYthtGGfuQSVGHNqHE7ibER0g@mail.gmail.com>
  2 siblings, 2 replies; 38+ messages in thread
From: Richard Sandiford @ 2012-03-30 10:09 UTC (permalink / raw)
  To: Kenneth Zadeck
  Cc: gcc-patches, Ian Lance Taylor, Mike Stump, Kenneth Zadeck, avr

Kenneth Zadeck <zadeck@naturalbridge.com> writes:
> This patch takes a different approach to fixing PR52543 than does the 
> patch in
>
> http://gcc.gnu.org/ml/gcc-patches/2012-03/msg00641.html
>
> This patch transforms the lower-subreg pass(es) from unconditionally 
> splitting wide moves, zero extensions, and shifts, so that it now takes 
> into account the target specific costs and only does the transformations 
> if it is profitable.

Thanks for doing this.

> +   This pass only splits moves with modes are wider than word_mode and

...modes that are wider...

> +   ashifts, lshiftrt and zero_extensions with integer modes that are

Nitlet: lshiftrts to be consistent.  Or "ASHIFTs, LSHIFTRTs and ZERO_EXTENDs".

> +   There are two useful preprocessor defines for use by maintainers:  
> +
> +   #define LOG_COSTS
> +
> +   if you wish to see the actual cost estimates that are being used
> +   for each mode wider than word mode and the cost estimates for zero
> +   extension and the shifts.   This can be useful when port maintainers 
> +   are tuning insn rtx costs.
> +
> +   #define FORCE_LOWERING
> +
> +   if you wish to test the pass with all the transformation forced on.
> +   This can be useful for finding bugs in the transformations.

Must admit I'm not keen on these kinds of macro, but it's Ian's call.

Idea for the future (i.e. not this patch) is to have a dump file for
target initialisation.

> +/* This pass can transform 4 different operations: move, ashift,
> +   lshiftrt, and zero_extend.  There is a boolean vector for move
> +   splitting that is indexed by mode and is true for each mode that is
> +   to have its copies split.  The other three operations are only done
> +   for one mode so they are only controlled by a single boolean  .*/

As mentioned privately, whether this is profitable for shifts depends
to some extent on the shift amount.  GCC already supports targets where
this transformation would be OK for some shift amounts but not others.
So for shifts, I think this should be an array of HOST_BITS_PER_WIDE_INT
booleans rather than just one.

More comments below about how this filters through your other changes.

> +/* This pass can transform 4 different operations: move, ashift,
> +   lshiftrt, and zero_extend.  There is a boolean vector for move
> +   splitting that is indexed by mode and is true for each mode that is
> +   to have its copies split.  The other three operations are only done
> +   for one mode so they are only controlled by a single boolean  .*/
> +static bool move_modes_to_split[MAX_MACHINE_MODE];
> +
> +/* Other splitting operations to be done on the this platform.  */
> +static bool splitting_ashift;
> +static bool splitting_lshiftrt;
> +static bool splitting_zext;
> +/* The or of the previous three values.  */
> +static bool splitting_some_shifts;
> +
> +/* The integer mode that is twice the width of word_mode.  */
> +static enum machine_mode twice_word_mode;
> +
>  /* Bit N in the bitmap in element M of this array is set if there is a
>     copy from reg M to reg N.  */
>  static VEC(bitmap,heap) *reg_copy_graph;
>  
> -/* Return whether X is a simple object which we can take a word_mode
> -   subreg of.  */
> +static bool something_to_do;
> +
> +/* Precomputed cost values used to determine if lowering shift
> +   operations is profitable.  */ 
> +static int word_mode_move_cost;
> +static int move_zero_cost;
> +

All this new data should be wrapped in a target structure.
See expmed.[hc] for an example.

> +    return wide_cost > word_mode_move_cost + move_zero_cost;

Should be >=.  The purpose of this transformation isn't AIUI to improve
shifts in themselves.  It's just to stop shifts from being an artificial
barrier to lower-subreg's main "optimisation" -- which in some ways is more
of an IR simplification -- i.e., replacing subregs with regs.  We want
to do it if the costs of the two sequences are equal.  More justification
at the end.

> +  else
> +    {
> +      int narrow_cost;
> +
> +      trg = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER);
> +      src = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER);

Probably best to use FIRST_PSEUDO_REGISTER + 1 for the second (here and
elsewhere) so that we get the cost of a general 3-operand shift rather
than a 2-operand one.

> +      narrow_cost = insn_rtx_cost (gen_rtx_SET (VOIDmode, trg, 
> +						gen_rtx_fmt_ee (code, word_mode, 
> +								src, 
> +								GEN_INT (shift_amt - word_mode_size))),

Watch the long lines.  Everything should be less than 80 chars.

> +      return wide_cost > narrow_cost + move_zero_cost;

">=" here too.

> +/* Initialize pass once per execution.  This envolves determining

Once per target.  And since it can be called more than once, the
function needs to zero out the new data rather than rely on load-time
zero-initialisation.

> +      t = compute_move_cost ((enum machine_mode) i);
> +
> +#ifdef LOG_COSTS
> +      fprintf (stderr, "mode %s, move cost %d, factor %d\n", mode_name[i], t, factor);
> +#endif
> +      if (t > word_mode_move_cost * factor)

">=" here too.

> +  /* The only case here to check to see if moving the upper part with a
> +     zero is cheaper than doing the zext itself.  There is some theory
> +     that any well implemented port would just have the pattern.  */

Don't really get this comment.  These days, I don't think ports are
really expected to define double-word patterns that simply decompose
to the usual (target-independent) word-mode sequence.  Ports used to do
that because having a unified operation helped the earlier rtl optimisers,
but those cases are now handled at the gimple level.  These days it seems
better to let optabs lower the thing immediately and allow the individual
word ops (which weren't exposed at the gimple level) to get maximum
optimisation.

Maybe easiest to remove the second sentence rather than argue about it.

> +  if (insn_rtx_cost (pat, true) > word_mode_move_cost + move_zero_cost) 

">=" :-)

> @@ -173,7 +364,7 @@ find_pseudo_copy (rtx set)
>    if (HARD_REGISTER_NUM_P (rd) || HARD_REGISTER_NUM_P (rs))
>      return false;
>  
> -  if (GET_MODE_SIZE (GET_MODE (dest)) <= UNITS_PER_WORD)
> +  if (!move_modes_to_split[(int)GET_MODE (dest)])
>      return false;

Space after "(int)".

> @@ -335,6 +526,11 @@ find_decomposable_subregs (rtx *px, void
>  		bitmap_set_bit (decomposable_context, regno);
>  	      break;
>  	    case SIMPLE_MOVE:
> +	      /* If this is marked as a simple move with a mode this
> +		 large, it is because find_pseudo_copy returned false
> +		 because this was not a mode we wanted to split.  */
> +	      if (!move_modes_to_split[GET_MODE (x)])

I think you need the (int) here too.  (Should have been caught by boostrap
if so.)

> @@ -668,7 +864,7 @@ resolve_simple_move (rtx set, rtx insn)
>    orig_mode = GET_MODE (dest);
>  
>    words = (GET_MODE_SIZE (orig_mode) + UNITS_PER_WORD - 1) / UNITS_PER_WORD;
> -  if (words <= 1)
> +  if (!move_modes_to_split[(int)GET_MODE (dest)])
>      return insn;
>  
>    start_sequence ();

Space after "(int)".

>    op = SET_SRC (set);
> -  if (GET_CODE (op) != ASHIFT
> -      && GET_CODE (op) != LSHIFTRT
> -      && GET_CODE (op) != ZERO_EXTEND)
> +  mode = GET_MODE (op);
> +
> +  if (mode != twice_word_mode)
> +    return 0;
> +
> +  switch (GET_CODE (op)) {
> +  case ASHIFT:
> +    if (splitting_ashift)
> +      break;
> +    return 0;
> +    
> +  case LSHIFTRT:
> +    if (splitting_lshiftrt)
> +      break;
> +    return 0;
> +    
> +  case ZERO_EXTEND:
> +    if (splitting_zext)
> +      break;
> +    return 0;

With the boolean array change suggested above, we'd not bother with this
bit and check the cached cost information:

> @@ -961,23 +1178,26 @@ find_decomposable_shift_zext (rtx insn)
>  
>    if (GET_CODE (op) == ZERO_EXTEND)
>      {
> -      if (GET_MODE (op_operand) != word_mode
> -	  || GET_MODE_BITSIZE (GET_MODE (op)) != 2 * BITS_PER_WORD)
> +      if (GET_MODE (op_operand) != word_mode)
>  	return 0;
>      }

...here and:

>    else /* left or right shift */
>      {
> +      int shift_val;
> +
>        if (!CONST_INT_P (XEXP (op, 1))
> -	  || INTVAL (XEXP (op, 1)) < BITS_PER_WORD
> -	  || GET_MODE_BITSIZE (GET_MODE (op_operand)) != 2 * BITS_PER_WORD)
> +	  || INTVAL (XEXP (op, 1)) < BITS_PER_WORD)
> +	return 0;
> +
> +      shift_val = INTVAL (XEXP (op, 1));
> +      if (!profitable_shift_p (GET_CODE (op), shift_val))
>  	return 0;
> +
> +      bitmap_set_bit (decomposable_context, REGNO (op_operand));
>      }

...here (using the boolean array instead of profitable_shift_p).

> @@ -1094,24 +1342,41 @@ decompose_multiword_subregs (void)
>       all the insns.  */
>    {
>      unsigned int i;
> +    bool useful_modes_seen = false;
>  
>      for (i = FIRST_PSEUDO_REGISTER; i < max; ++i)
>        {
> -	if (regno_reg_rtx[i] != NULL
> -	    && GET_MODE_SIZE (GET_MODE (regno_reg_rtx[i])) > UNITS_PER_WORD)
> -	  break;
> +	if (regno_reg_rtx[i] != NULL)
> +	  {
> +	    enum machine_mode mode = GET_MODE (regno_reg_rtx[i]);
> +	    if (regno_reg_rtx[i] != NULL 
> +		&& (move_modes_to_split [mode] 
> +		    || (splitting_some_shifts && mode == twice_word_mode)))

Here I think we should just check move_modes_to_split.  We should
not split a mode for which moves get more expensive, regardless of
whether shifts are cheaper.  lower-subreg is designed to help later
passes produce better code.  Splitting a shift when the associated
move is more expensive is likely to encourage those passes to generate
more expensive move sequences instead of cheaper ones.  (Includes RA
and reload.)

Besides, I claim that it makes no sense to define a double-word
shift pattern that is inherently more expensive than what optabs would
generate without the double-word shift pattern.  So I think the ">="s in
the (modified) shift cost calculations should act as "=="s in practice
(but should still be coded as ">="s for consistency).  That is, we'll
allow the shift to be split if the cost of doing so is the same as the
original shift, but not otherwise.

Or the short version: if splitting double-word moves is expensive,
skip all the zext and shift stuff too.

Looks good to me otherwise, but it's Ian's pass.

Richard

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-03-30 10:09             ` Richard Sandiford
@ 2012-03-30 15:25               ` Kenneth Zadeck
  2012-03-30 15:41                 ` Richard Sandiford
  2012-05-10  6:45               ` Paolo Bonzini
  1 sibling, 1 reply; 38+ messages in thread
From: Kenneth Zadeck @ 2012-03-30 15:25 UTC (permalink / raw)
  To: gcc-patches, Ian Lance Taylor, Mike Stump, Kenneth Zadeck, avr,
	rdsandiford, ramana.radhakrishnan




>> +   There are two useful preprocessor defines for use by maintainers:
>> +
>> +   #define LOG_COSTS
>> +
>> +   if you wish to see the actual cost estimates that are being used
>> +   for each mode wider than word mode and the cost estimates for zero
>> +   extension and the shifts.   This can be useful when port maintainers
>> +   are tuning insn rtx costs.
>> +
>> +   #define FORCE_LOWERING
>> +
>> +   if you wish to test the pass with all the transformation forced on.
>> +   This can be useful for finding bugs in the transformations.
> Must admit I'm not keen on these kinds of macro, but it's Ian's call.
>
> Idea for the future (i.e. not this patch) is to have a dump file for
> target initialisation.
Imagine my horror when i did all of this as you had privately suggested 
and discovered that there was no way to log what i was doing.  This is 
good enough until someone wants to fix the general problem.

>> +/* This pass can transform 4 different operations: move, ashift,
>> +   lshiftrt, and zero_extend.  There is a boolean vector for move
>> +   splitting that is indexed by mode and is true for each mode that is
>> +   to have its copies split.  The other three operations are only done
>> +   for one mode so they are only controlled by a single boolean  .*/
> As mentioned privately, whether this is profitable for shifts depends
> to some extent on the shift amount.  GCC already supports targets where
> this transformation would be OK for some shift amounts but not others.
> So for shifts, I think this should be an array of HOST_BITS_PER_WIDE_INT
> booleans rather than just one.
>
> More comments below about how this filters through your other changes.
I think that you actually are missing what i am doing with this.   I 
look at 3 representative values that "should" discover any non 
uniformities.   If any of them are profitable, i set this bit.  Then at 
the point where i really have to pull the trigger on a real instance, i 
check the shift amount used at that spot to see if the individual shift 
is profitable.

I did this for two reasons.   One of them was that i was a little 
concerned that HOST_BITS_PER_WIDE_INT on the smallest host was not as 
big as the bitsize of word_word mode on the largest target (it could be 
but this knowledge is above my pay grade).   The other reason was did 
not see this as a common operation and checking it on demand seemed like 
the winner.



I will do everything else you mention and resubmit after i fix ramana's ice.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-03-30 15:25               ` Kenneth Zadeck
@ 2012-03-30 15:41                 ` Richard Sandiford
  2012-03-31 16:21                   ` Kenneth Zadeck
  0 siblings, 1 reply; 38+ messages in thread
From: Richard Sandiford @ 2012-03-30 15:41 UTC (permalink / raw)
  To: Kenneth Zadeck
  Cc: gcc-patches, Ian Lance Taylor, Mike Stump, Kenneth Zadeck, avr,
	ramana.radhakrishnan

Kenneth Zadeck <zadeck@naturalbridge.com> writes:
>>> +   There are two useful preprocessor defines for use by maintainers:
>>> +
>>> +   #define LOG_COSTS
>>> +
>>> +   if you wish to see the actual cost estimates that are being used
>>> +   for each mode wider than word mode and the cost estimates for zero
>>> +   extension and the shifts.   This can be useful when port maintainers
>>> +   are tuning insn rtx costs.
>>> +
>>> +   #define FORCE_LOWERING
>>> +
>>> +   if you wish to test the pass with all the transformation forced on.
>>> +   This can be useful for finding bugs in the transformations.
>> Must admit I'm not keen on these kinds of macro, but it's Ian's call.
>>
>> Idea for the future (i.e. not this patch) is to have a dump file for
>> target initialisation.
> Imagine my horror when i did all of this as you had privately suggested 
> and discovered that there was no way to log what i was doing.  This is 
> good enough until someone wants to fix the general problem.
>
>>> +/* This pass can transform 4 different operations: move, ashift,
>>> +   lshiftrt, and zero_extend.  There is a boolean vector for move
>>> +   splitting that is indexed by mode and is true for each mode that is
>>> +   to have its copies split.  The other three operations are only done
>>> +   for one mode so they are only controlled by a single boolean  .*/
>> As mentioned privately, whether this is profitable for shifts depends
>> to some extent on the shift amount.  GCC already supports targets where
>> this transformation would be OK for some shift amounts but not others.
>> So for shifts, I think this should be an array of HOST_BITS_PER_WIDE_INT
>> booleans rather than just one.
>>
>> More comments below about how this filters through your other changes.
> I think that you actually are missing what i am doing with this.   I 
> look at 3 representative values that "should" discover any non 
> uniformities.   If any of them are profitable, i set this bit.  Then at 
> the point where i really have to pull the trigger on a real instance, i 
> check the shift amount used at that spot to see if the individual shift 
> is profitable.

No, I got that.  I just think it's an unnecessary complication.

> I did this for two reasons.   One of them was that i was a little 
> concerned that HOST_BITS_PER_WIDE_INT on the smallest host was not as 
> big as the bitsize of word_word mode on the largest target (it could be 
> but this knowledge is above my pay grade).

Ah, yes, sorry, I meant an array of BITS_PER_WORD booleans.  I had
HOST_WIDE_INT on the brain after Mike's patch.

> The other reason was did not see this as a common operation and
> checking it on demand seemed like the winner.

But (at least after the other changes I mentioned) these overall
booleans cut out only a very small portion of find_decomposable_shift_zext.
I.e.:

  op = SET_SRC (set);
  if (GET_CODE (op) != ASHIFT
      && GET_CODE (op) != LSHIFTRT
      && GET_CODE (op) != ZERO_EXTEND)  <-- unified booleans checked here
    return 0;

  op_operand = XEXP (op, 0);
  if (!REG_P (SET_DEST (set)) || !REG_P (op_operand)
      || HARD_REGISTER_NUM_P (REGNO (SET_DEST (set)))
      || HARD_REGISTER_NUM_P (REGNO (op_operand))
      || !SCALAR_INT_MODE_P (GET_MODE (op)))
    return 0;

  if (GET_CODE (op) == ZERO_EXTEND)
    {
      if (GET_MODE (op_operand) != word_mode
	  || GET_MODE_BITSIZE (GET_MODE (op)) != 2 * BITS_PER_WORD)
	return 0;
    }
  else /* left or right shift */
    {
      <--- specific booleans checked here
      if (!CONST_INT_P (XEXP (op, 1))
	  || INTVAL (XEXP (op, 1)) < BITS_PER_WORD
	  || GET_MODE_BITSIZE (GET_MODE (op_operand)) != 2 * BITS_PER_WORD)
	return 0;
    }

It seems better (and simpler) not to prejudge which shift amounts are
interesting and instead cache the "win or no win" flag for each value.

As I say, this is all in the context of this pass not being interesting
for modes where the split move is strictly more expensive than the
unified move, regardless of shift & zext costs.

Richard

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
       [not found]             ` <CACUk7=XVO=yHrPBFtAVfPxMtViYthtGGfuQSVGHNqHE7ibER0g@mail.gmail.com>
@ 2012-03-30 19:29               ` Kenneth Zadeck
  2012-03-30 22:38                 ` Ramana Radhakrishnan
  0 siblings, 1 reply; 38+ messages in thread
From: Kenneth Zadeck @ 2012-03-30 19:29 UTC (permalink / raw)
  To: Ramana Radhakrishnan, gcc-patches

ramana

i get the same failure on the trunk without my patch.

kenny

On 03/30/2012 07:36 AM, Ramana Radhakrishnan wrote:
> Hi
>
>
>> I have tested this on an x86_64 with both the force lowering on and off and
>> neither cause any regressions as well as extensive testing on my port.
>>
> So, just out of curiosity,  I decided to run this through a
> cross-build and noticed the following ICE with eglibc. I haven't had
> the time to debug this further but it does appear as though it could
> do with some more testing on some more ports and this probably needs
> some tuning as you say.
>
> $>  /work/cross-build/fsf/arm-none-linux-gnueabi/tools-lowersubregchanges-patched/bin/arm-none-linux-gnueabi-gcc
> -c -O2 ./besttry.c  -mfloat-abi=soft -march=armv5te
> ./besttry.c: In function ‘_IO_new_file_write’:
> ./besttry.c:36:1: internal compiler error: in get_loop_body, at cfgloop.c:831
>
> $>  cat besttry.c
> __extension__ typedef int __ssize_t;
> extern __thread int __libc_errno __attribute__ ((tls_model ("initial-exec")));
> struct _IO_FILE {
>    int _fileno;
>    int _flags2;
> };
> typedef struct _IO_FILE _IO_FILE;
> _IO_new_file_write (f,
>        data,
>        n)
>       _IO_FILE *f;
> {
>    __ssize_t to_do = n;
>    while (to_do>  0)
>      {
>        __ssize_t count =
>   (__builtin_expect (f->_flags2&  2, 0) ?
>    ({ unsigned int _sys_result = ({ register int _a1 asm ("r0"), _nr asm ("r7");
>          int _a3tmp = (int) ((to_do));
>          int _a2tmp = (int) ((data));
>          register int _a2 asm ("a2") = _a2tmp;
>          register int _a3 asm ("a3") = _a3tmp; _nr = ((0 + 4));
>          asm volatile ("swi	0x0	@ syscall " "SYS_ify(write)" : "=r"
> (_a1) : "r" (_nr) , "r" (_a1), "r" (_a2), "r" (_a3) : "memory"); _a1;
> });
>      if (__builtin_expect (((unsigned int) (_sys_result)>= 0xfffff001u), 0))
>        { (__libc_errno = ((-(_sys_result))));
>          _sys_result = (unsigned int) -1; }
>      (int) _sys_result; })
>    : __write (f->_fileno, data, to_do));
>        if (count<  0)
>   {
>     break;
>          }
>        to_do -= count;
>      }
> }
>
>
> Ramana
>
> On 29 March 2012 22:10, Kenneth Zadeck<zadeck@naturalbridge.com>  wrote:
>> This patch takes a different approach to fixing PR52543 than does the patch
>> in
>>
>> http://gcc.gnu.org/ml/gcc-patches/2012-03/msg00641.html
>>
>> This patch transforms the lower-subreg pass(es) from unconditionally
>> splitting wide moves, zero extensions, and shifts, so that it now takes into
>> account the target specific costs and only does the transformations if it is
>> profitable.
>>
>> Unconditional splitting is a problem that not only occurs on the AVR but is
>> also a problem on the ARM NEON and my private port.  Furthermore, it is a
>> problem that is likely to occur on most modern larger machines since these
>> machines are more likely to have fast instructions for moving things that
>> are larger than word mode.
>>
>> At compiler initialization time, each mode that is larger that a word mode
>> is examined to determine if the cost of moving a value of that mode is less
>> expensive that inserting the proper number of word sided moves.   If it is
>> cheaper to split it up, a bit is set to allow moves of that mode to be
>> lowered.
>>
>> A similar analysis is made for the zero extensions and shifts except that
>> lower subreg had been (and is still limited to only breaking up these
>> operations if the target size was twice the size of word mode.)  Also, if
>> the analysis determines that there are no profitable transformations, the
>> pass exits quickly without doing any analysis.
>>
>> It is quite likely that most ports will have to be adjusted after this patch
>> is accepted.   For instance, the analysis discovers that there are no
>> profitable transformations to be performed on the x86-64.    Since this is
>> not my platform, I have no idea if these are the correct settings.   But the
>> pass uses the standard insn_rtx_cost interface and it is the port
>> maintainers responsibility to not lie to the optimization passes so this
>> extra work in stage one should be acceptable.
>>
>> I do know from a private conversation with Richard Sandiford, that mips
>> patches are likely forthcoming.
>>
>> There is preprocessor controlled code that prints out the cost analysis.
>>   Only a summary of this can go in the subregs dump file because the analysis
>> is called from backend_init_target and so the dump file is not available.
>>     But it is very useful to define LOG_COSTS when adjusting your port.
>>
>> There is also preprocessor code that forces all of the lowering operations
>> to marked as profitable.  This is useful in debugging the new logic.
>>
>> Both of these preprocessor symbols are documented at the top of the pass.
>>
>> I have tested this on an x86_64 with both the force lowering on and off and
>> neither cause any regressions as well as extensive testing on my port.
>>
>> Ok to commit?
>>
>> Kenny
>>
>> 2012-03-29 Kenneth Zadeck<zadeck@naturalbridge.com>
>>
>>     * toplev.c (backend_init_target): Call initializer for lower-subreg pass.
>>
>>     * lower-subreg.c (move_modes_to_split, splitting_ashift,
>> splitting_lshiftrt)
>>     splitting_zext, splitting_some_shifts, twice_word_mode,
>>   something_to_do,
>>     word_mode_move_cost, move_zero_cost): New static vars.
>>     (compute_move_cost, profitable_shift_p, init_lower_subreg): New
>>     functions.
>>     (find_pseudo_copy, resolve_simple_move): Added code to only split based
>> on costs.
>>     (find_decomposable_subregs): Added code to mark as decomposable
>>     moves that are not profitable.
>>     (find_decomposable_shift_zext): Added code to only decompose
>>     shifts and zext if profitable.
>>     (resolve_shift_zext): Added comment.
>>     (decompose_multiword_subregs): Dump list of profitable
>>     transformations.  Add code to skip non profitable transformations.
>>
>>     *rtl.h(init_lower_subreg): Added declaration.
>>
>>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-03-30 19:29               ` Kenneth Zadeck
@ 2012-03-30 22:38                 ` Ramana Radhakrishnan
  0 siblings, 0 replies; 38+ messages in thread
From: Ramana Radhakrishnan @ 2012-03-30 22:38 UTC (permalink / raw)
  To: Kenneth Zadeck; +Cc: gcc-patches

On 30 March 2012 20:29, Kenneth Zadeck <zadeck@naturalbridge.com> wrote:
> ramana
>
> i get the same failure on the trunk without my patch.
>

In which case I apologise and will file a bug report separately. I
should really have checked :( .

Ramana

>
> kenny
>
> On 03/30/2012 07:36 AM, Ramana Radhakrishnan wrote:
>>
>> Hi
>>
>>
>>> I have tested this on an x86_64 with both the force lowering on and off
>>> and
>>> neither cause any regressions as well as extensive testing on my port.
>>>
>> So, just out of curiosity,  I decided to run this through a
>> cross-build and noticed the following ICE with eglibc. I haven't had
>> the time to debug this further but it does appear as though it could
>> do with some more testing on some more ports and this probably needs
>> some tuning as you say.
>>
>> $>
>>  /work/cross-build/fsf/arm-none-linux-gnueabi/tools-lowersubregchanges-patched/bin/arm-none-linux-gnueabi-gcc
>> -c -O2 ./besttry.c  -mfloat-abi=soft -march=armv5te
>> ./besttry.c: In function ‘_IO_new_file_write’:
>> ./besttry.c:36:1: internal compiler error: in get_loop_body, at
>> cfgloop.c:831
>>
>> $>  cat besttry.c
>> __extension__ typedef int __ssize_t;
>> extern __thread int __libc_errno __attribute__ ((tls_model
>> ("initial-exec")));
>> struct _IO_FILE {
>>   int _fileno;
>>   int _flags2;
>> };
>> typedef struct _IO_FILE _IO_FILE;
>> _IO_new_file_write (f,
>>       data,
>>       n)
>>      _IO_FILE *f;
>> {
>>   __ssize_t to_do = n;
>>   while (to_do>  0)
>>     {
>>       __ssize_t count =
>>  (__builtin_expect (f->_flags2&  2, 0) ?
>>
>>   ({ unsigned int _sys_result = ({ register int _a1 asm ("r0"), _nr asm
>> ("r7");
>>         int _a3tmp = (int) ((to_do));
>>         int _a2tmp = (int) ((data));
>>         register int _a2 asm ("a2") = _a2tmp;
>>         register int _a3 asm ("a3") = _a3tmp; _nr = ((0 + 4));
>>         asm volatile ("swi     0x0     @ syscall " "SYS_ify(write)" : "=r"
>> (_a1) : "r" (_nr) , "r" (_a1), "r" (_a2), "r" (_a3) : "memory"); _a1;
>> });
>>     if (__builtin_expect (((unsigned int) (_sys_result)>= 0xfffff001u),
>> 0))
>>       { (__libc_errno = ((-(_sys_result))));
>>         _sys_result = (unsigned int) -1; }
>>     (int) _sys_result; })
>>   : __write (f->_fileno, data, to_do));
>>       if (count<  0)
>>  {
>>    break;
>>         }
>>       to_do -= count;
>>     }
>> }
>>
>>
>> Ramana
>>
>> On 29 March 2012 22:10, Kenneth Zadeck<zadeck@naturalbridge.com>  wrote:
>>>
>>> This patch takes a different approach to fixing PR52543 than does the
>>> patch
>>> in
>>>
>>> http://gcc.gnu.org/ml/gcc-patches/2012-03/msg00641.html
>>>
>>> This patch transforms the lower-subreg pass(es) from unconditionally
>>> splitting wide moves, zero extensions, and shifts, so that it now takes
>>> into
>>> account the target specific costs and only does the transformations if it
>>> is
>>> profitable.
>>>
>>> Unconditional splitting is a problem that not only occurs on the AVR but
>>> is
>>> also a problem on the ARM NEON and my private port.  Furthermore, it is a
>>> problem that is likely to occur on most modern larger machines since
>>> these
>>> machines are more likely to have fast instructions for moving things that
>>> are larger than word mode.
>>>
>>> At compiler initialization time, each mode that is larger that a word
>>> mode
>>> is examined to determine if the cost of moving a value of that mode is
>>> less
>>> expensive that inserting the proper number of word sided moves.   If it
>>> is
>>> cheaper to split it up, a bit is set to allow moves of that mode to be
>>> lowered.
>>>
>>> A similar analysis is made for the zero extensions and shifts except that
>>> lower subreg had been (and is still limited to only breaking up these
>>> operations if the target size was twice the size of word mode.)  Also, if
>>> the analysis determines that there are no profitable transformations, the
>>> pass exits quickly without doing any analysis.
>>>
>>> It is quite likely that most ports will have to be adjusted after this
>>> patch
>>> is accepted.   For instance, the analysis discovers that there are no
>>> profitable transformations to be performed on the x86-64.    Since this
>>> is
>>> not my platform, I have no idea if these are the correct settings.   But
>>> the
>>> pass uses the standard insn_rtx_cost interface and it is the port
>>> maintainers responsibility to not lie to the optimization passes so this
>>> extra work in stage one should be acceptable.
>>>
>>> I do know from a private conversation with Richard Sandiford, that mips
>>> patches are likely forthcoming.
>>>
>>> There is preprocessor controlled code that prints out the cost analysis.
>>>  Only a summary of this can go in the subregs dump file because the
>>> analysis
>>> is called from backend_init_target and so the dump file is not available.
>>>    But it is very useful to define LOG_COSTS when adjusting your port.
>>>
>>> There is also preprocessor code that forces all of the lowering
>>> operations
>>> to marked as profitable.  This is useful in debugging the new logic.
>>>
>>> Both of these preprocessor symbols are documented at the top of the pass.
>>>
>>> I have tested this on an x86_64 with both the force lowering on and off
>>> and
>>> neither cause any regressions as well as extensive testing on my port.
>>>
>>> Ok to commit?
>>>
>>> Kenny
>>>
>>> 2012-03-29 Kenneth Zadeck<zadeck@naturalbridge.com>
>>>
>>>    * toplev.c (backend_init_target): Call initializer for lower-subreg
>>> pass.
>>>
>>>    * lower-subreg.c (move_modes_to_split, splitting_ashift,
>>> splitting_lshiftrt)
>>>    splitting_zext, splitting_some_shifts, twice_word_mode,
>>>  something_to_do,
>>>    word_mode_move_cost, move_zero_cost): New static vars.
>>>    (compute_move_cost, profitable_shift_p, init_lower_subreg): New
>>>    functions.
>>>    (find_pseudo_copy, resolve_simple_move): Added code to only split
>>> based
>>> on costs.
>>>    (find_decomposable_subregs): Added code to mark as decomposable
>>>    moves that are not profitable.
>>>    (find_decomposable_shift_zext): Added code to only decompose
>>>    shifts and zext if profitable.
>>>    (resolve_shift_zext): Added comment.
>>>    (decompose_multiword_subregs): Dump list of profitable
>>>    transformations.  Add code to skip non profitable transformations.
>>>
>>>    *rtl.h(init_lower_subreg): Added declaration.
>>>
>>>
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-03-30 15:41                 ` Richard Sandiford
@ 2012-03-31 16:21                   ` Kenneth Zadeck
  2012-04-03 13:53                     ` Richard Sandiford
  0 siblings, 1 reply; 38+ messages in thread
From: Kenneth Zadeck @ 2012-03-31 16:21 UTC (permalink / raw)
  To: gcc-patches, Ian Lance Taylor, Mike Stump, Kenneth Zadeck, avr,
	ramana.radhakrishnan, rdsandiford

[-- Attachment #1: Type: text/plain, Size: 909 bytes --]

New version of the patch, with all of Richard Sandiford's comments 
applied and retested.

Ok for commit?

Kenny

2012-03-31 Kenneth Zadeck <zadeck@naturalbridge.com>

     * toplev.c (backend_init_target): Call initializer for lower-subreg 
pass.

     * lower-subreg.c (target_info): New static var.
     (compute_move_cost, profitable_shift_p, init_lower_subreg): New
     functions.
     (find_pseudo_copy, resolve_simple_move): Added code to only split 
based on costs.
     (find_decomposable_subregs): Added code to mark as decomposable
     moves that are not profitable.
     (find_decomposable_shift_zext): Added code to only decompose
     shifts and zext if profitable.
     (resolve_shift_zext): Added comment.
     (decompose_multiword_subregs): Dump list of profitable
     transformations.  Add code to skip non profitable transformations.

     *rtl.h(init_lower_subreg): Added declaration.


[-- Attachment #2: lower2.diff --]
[-- Type: text/x-patch, Size: 15976 bytes --]

Index: toplev.c
===================================================================
--- toplev.c	(revision 186034)
+++ toplev.c	(working copy)
@@ -1660,6 +1660,7 @@ backend_init_target (void)
   /* rtx_cost is mode-dependent, so cached values need to be recomputed
      on a mode change.  */
   init_expmed ();
+  init_lower_subreg ();
 
   /* We may need to recompute regno_save_code[] and regno_restore_code[]
      after a mode change as well.  */
Index: lower-subreg.c
===================================================================
--- lower-subreg.c	(revision 186034)
+++ lower-subreg.c	(working copy)
@@ -41,6 +41,7 @@ along with GCC; see the file COPYING3.
 #include "tree-pass.h"
 #include "df.h"
 
+
 #ifdef STACK_GROWS_DOWNWARD
 # undef STACK_GROWS_DOWNWARD
 # define STACK_GROWS_DOWNWARD 1
@@ -52,10 +53,36 @@ DEF_VEC_P (bitmap);
 DEF_VEC_ALLOC_P (bitmap,heap);
 
 /* Decompose multi-word pseudo-registers into individual
-   pseudo-registers when possible.  This is possible when all the uses
-   of a multi-word register are via SUBREG, or are copies of the
-   register to another location.  Breaking apart the register permits
-   more CSE and permits better register allocation.  */
+   pseudo-registers when possible and profitable.  This is possible
+   when all the uses of a multi-word register are via SUBREG, or are
+   copies of the register to another location.  Breaking apart the
+   register permits more CSE and permits better register allocation.
+   This is profitable if the machine does not have move instructions
+   to do this.  
+
+   This pass only splits moves with modes that are wider than
+   word_mode and ASHIFTs, LSHIFTRTs and ZERO_EXTENDs with integer
+   modes that are twice the width of word_mode.  The latter could be
+   generalized if there was a need to do this, but the trend in
+   architectures is to not need this.
+
+   There are two useful preprocessor defines for use by maintainers:  
+
+   #define LOG_COSTS
+
+   if you wish to see the actual cost estimates that are being used
+   for each mode wider than word mode and the cost estimates for zero
+   extension and the shifts.   This can be useful when port maintainers 
+   are tuning insn rtx costs.
+
+   #define FORCE_LOWERING
+
+   if you wish to test the pass with all the transformation forced on.
+   This can be useful for finding bugs in the transformations.
+
+   Both of these should not be enabled at the same time. */
+
+#define FORCE_LOWERING
 
 /* Bit N in this bitmap is set if regno N is used in a context in
    which we can decompose it.  */
@@ -75,8 +102,188 @@ static bitmap subreg_context;
    copy from reg M to reg N.  */
 static VEC(bitmap,heap) *reg_copy_graph;
 
-/* Return whether X is a simple object which we can take a word_mode
-   subreg of.  */
+static struct {
+
+  /* This pass can transform 4 different operations: move, ashift,
+     lshiftrt, and zero_extend.  There is a boolean vector for move
+     splitting that is indexed by mode and is true for each mode that is
+     to have its copies split.  The other three operations are only done
+     for one mode so they are only controlled by a single boolean  .*/
+  bool move_modes_to_split[MAX_MACHINE_MODE];
+  
+  /* Other splitting operations to be done on the this platform.  */
+  bool splitting_ashift[MAX_BITS_PER_WORD];
+  bool splitting_lshiftrt[MAX_BITS_PER_WORD];
+  bool splitting_zext;
+
+  bool something_to_do;
+
+  /* Precomputed cost values used to determine if lowering shift
+     operations is profitable.  */ 
+  int word_mode_move_cost;
+  int move_zero_cost;
+} target_info;
+
+/* See what the move cost is.  */
+static int 
+compute_move_cost (enum machine_mode mode)
+{
+  rtx src = gen_rtx_REG (mode, FIRST_PSEUDO_REGISTER);
+  rtx trg = gen_rtx_REG (mode, FIRST_PSEUDO_REGISTER + 1);
+  rtx pat = gen_rtx_SET (VOIDmode, trg, src);
+
+  return insn_rtx_cost (pat, true);
+}
+
+
+/* Return true if it is profitable to lower and shift by SHIFT_AMT.
+   CODE can be either ASHIFT or LSHIFTRT.  */
+static bool
+profitable_shift_p (enum rtx_code code, int shift_amt)
+{
+  rtx trg = gen_rtx_REG (GET_MODE_WIDER_MODE (word_mode), FIRST_PSEUDO_REGISTER);
+  rtx src = gen_rtx_REG (GET_MODE_WIDER_MODE (word_mode), FIRST_PSEUDO_REGISTER + 1);
+  int wide_cost 
+    = insn_rtx_cost (gen_rtx_SET (VOIDmode, trg, 
+				  gen_rtx_fmt_ee (code, GET_MODE_WIDER_MODE (word_mode), 
+						  src, GEN_INT (shift_amt))),
+		     true);
+#ifdef FORCE_LOWERING
+  return true;
+#endif
+
+#ifdef LOG_COSTS
+  fprintf (stderr, "shift code %d, shift amt %d, wide_cost %d\n", 
+	   code, shift_amt, wide_cost);
+#endif
+  if (shift_amt == BITS_PER_WORD)
+    return wide_cost 
+      >= target_info.word_mode_move_cost + target_info.move_zero_cost;
+  else
+    {
+      int narrow_cost;
+
+      trg = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER);
+      src = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER + 1);
+      narrow_cost 
+	= insn_rtx_cost (gen_rtx_SET 
+			 (VOIDmode, trg, 
+			  gen_rtx_fmt_ee (code, word_mode, 
+					  src, 
+					  GEN_INT (shift_amt - BITS_PER_WORD))),
+				   true);
+
+#ifdef LOG_COSTS
+      fprintf (stderr, "narrow_cost %d\n", narrow_cost);
+#endif
+      return wide_cost > narrow_cost + target_info.move_zero_cost;
+    }
+}
+
+
+/* Initialize pass once per execution.  This envolves determining
+   which operations on the machine are profitable.  If none are found,
+   then the pass just returns when called.  */
+
+void
+init_lower_subreg (void)
+{
+  rtx trg = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER);
+  rtx pat;
+  unsigned int i;
+  int factor;
+
+  memset (&target_info, 0, sizeof target_info);
+
+  target_info.word_mode_move_cost = compute_move_cost (word_mode);
+  target_info.move_zero_cost 
+    = insn_rtx_cost (gen_rtx_SET (VOIDmode, trg, 
+				  CONST0_RTX (word_mode)), true);
+
+#ifdef LOG_COSTS
+  fprintf (stderr, "word mode move cost %d\n", target_info.word_mode_move_cost);
+  fprintf (stderr, "move 0 cost %d\n", target_info.move_zero_cost);
+#endif
+  for (i = 0; i < MAX_MACHINE_MODE; i++) 
+    {
+      int t;
+      factor = GET_MODE_SIZE (i) / UNITS_PER_WORD;
+
+      if (factor <= 1) 
+	continue;
+
+      t = compute_move_cost ((enum machine_mode) i);
+
+#ifdef LOG_COSTS
+      fprintf (stderr, "mode %s, move cost %d, factor %d\n", 
+	       mode_name[i], t, factor);
+#endif
+      if (t >= target_info.word_mode_move_cost * factor)
+	{
+	  target_info.move_modes_to_split[i] = true;
+	  target_info.something_to_do = true;
+	}
+#ifdef FORCE_LOWERING
+      target_info.move_modes_to_split[i] = true;
+      target_info.something_to_do = true;
+#endif
+    }
+
+  /* For the moves and shifts, the only case that is checked is one
+     where the mode of the target is an integer mode twice the width
+     of the word_mode. 
+
+     If it is not profitable to split a double word move then do not
+     even consider the shifts or the zero extension.  */
+  if (target_info.move_modes_to_split[(int) GET_MODE_WIDER_MODE (word_mode)])
+    {
+      /* The only case here to check to see if moving the upper part with a
+	 zero is cheaper than doing the zext itself.  */
+      pat = gen_rtx_SET (VOIDmode, trg, 
+			 gen_rtx_ZERO_EXTEND (GET_MODE_WIDER_MODE (word_mode), 
+					      gen_reg_rtx (word_mode)));
+      
+      if (insn_rtx_cost (pat, true) 
+	  >= target_info.word_mode_move_cost + target_info.move_zero_cost) 
+	{
+	  target_info.splitting_zext = true;
+	  target_info.something_to_do = true;
+	}
+      
+#ifdef FORCE_LOWERING
+      target_info.splitting_zext = true;
+      target_info.something_to_do = true;
+#endif
+      /* For the shifts there are three shift amounts that need to be
+	 checked: word_mode, word_mode + 1 and word_mode * 1.5.  The first
+	 of these can be done without a shift.  The second and third case
+	 are the same on large machines but may be different on small
+	 machines that special case shift 1 and 2.  If any of these are
+	 found to be useful, then we set the flags to consider those cases
+	 and when the actual shift amount is known, redo the cost
+	 calculation.  */
+      
+      for (i = 0; i < BITS_PER_WORD; i++)
+	{
+	  bool p = profitable_shift_p (ASHIFT, BITS_PER_WORD + i);
+
+	  target_info.splitting_ashift[i] = p;
+      	  target_info.something_to_do |= p;
+      
+	  p = profitable_shift_p (LSHIFTRT, BITS_PER_WORD + i);
+
+	  target_info.splitting_lshiftrt[i] = p;
+      	  target_info.something_to_do |= p;
+
+#ifdef FORCE_LOWERING
+	  target_info.splitting_ashift[i] = true;
+	  target_info.splitting_lshiftrt[i] = true;
+	  target_info.something_to_do = true;
+#endif
+	}
+    }
+}
+
 
 static bool
 simple_move_operand (rtx x)
@@ -173,7 +380,7 @@ find_pseudo_copy (rtx set)
   if (HARD_REGISTER_NUM_P (rd) || HARD_REGISTER_NUM_P (rs))
     return false;
 
-  if (GET_MODE_SIZE (GET_MODE (dest)) <= UNITS_PER_WORD)
+  if (!target_info.move_modes_to_split[(int) GET_MODE (dest)])
     return false;
 
   b = VEC_index (bitmap, reg_copy_graph, rs);
@@ -335,6 +542,11 @@ find_decomposable_subregs (rtx *px, void
 		bitmap_set_bit (decomposable_context, regno);
 	      break;
 	    case SIMPLE_MOVE:
+	      /* If this is marked as a simple move with a mode this
+		 large, it is because find_pseudo_copy returned false
+		 because this was not a mode we wanted to split.  */
+	      if (!target_info.move_modes_to_split[(int) GET_MODE (x)])
+		bitmap_set_bit (non_decomposable_context, regno);
 	      break;
 	    default:
 	      gcc_unreachable ();
@@ -668,7 +887,7 @@ resolve_simple_move (rtx set, rtx insn)
   orig_mode = GET_MODE (dest);
 
   words = (GET_MODE_SIZE (orig_mode) + UNITS_PER_WORD - 1) / UNITS_PER_WORD;
-  if (words <= 1)
+  if (!target_info.move_modes_to_split[(int) GET_MODE (dest)])
     return insn;
 
   start_sequence ();
@@ -941,17 +1160,29 @@ find_decomposable_shift_zext (rtx insn)
   rtx set;
   rtx op;
   rtx op_operand;
+  enum machine_mode mode;
 
   set = single_set (insn);
   if (!set)
     return 0;
 
   op = SET_SRC (set);
-  if (GET_CODE (op) != ASHIFT
-      && GET_CODE (op) != LSHIFTRT
-      && GET_CODE (op) != ZERO_EXTEND)
+  mode = GET_MODE (op);
+
+  if (mode != GET_MODE_WIDER_MODE (word_mode))
     return 0;
 
+  switch (GET_CODE (op)) 
+    {
+    case ASHIFT:
+    case LSHIFTRT:
+    case ZERO_EXTEND:
+      break;
+      
+    default:
+      return 0;
+    }
+
   op_operand = XEXP (op, 0);
   if (!REG_P (SET_DEST (set)) || !REG_P (op_operand)
       || HARD_REGISTER_NUM_P (REGNO (SET_DEST (set)))
@@ -959,25 +1190,51 @@ find_decomposable_shift_zext (rtx insn)
       || !SCALAR_INT_MODE_P (GET_MODE (op)))
     return 0;
 
-  if (GET_CODE (op) == ZERO_EXTEND)
-    {
-      if (GET_MODE (op_operand) != word_mode
-	  || GET_MODE_BITSIZE (GET_MODE (op)) != 2 * BITS_PER_WORD)
-	return 0;
-    }
-  else /* left or right shift */
+  switch (GET_CODE (op)) 
     {
-      if (!CONST_INT_P (XEXP (op, 1))
-	  || INTVAL (XEXP (op, 1)) < BITS_PER_WORD
-	  || GET_MODE_BITSIZE (GET_MODE (op_operand)) != 2 * BITS_PER_WORD)
+    case ASHIFT:
+      {
+	int shift_val;
+	
+	if (!CONST_INT_P (XEXP (op, 1))
+	    || INTVAL (XEXP (op, 1)) < BITS_PER_WORD)
+	  return 0;
+	
+	shift_val = INTVAL (XEXP (op, 1));
+	if (target_info.splitting_ashift[shift_val - BITS_PER_WORD])
+	  return 0;
+	
+	bitmap_set_bit (decomposable_context, REGNO (op_operand));
+	break;
+      }
+
+    case LSHIFTRT:
+      {
+	int shift_val;
+	
+	if (!CONST_INT_P (XEXP (op, 1))
+	    || INTVAL (XEXP (op, 1)) < BITS_PER_WORD)
+	  return 0;
+	
+	shift_val = INTVAL (XEXP (op, 1));
+	if (target_info.splitting_lshiftrt[shift_val - BITS_PER_WORD])
+	  return 0;
+	
+	bitmap_set_bit (decomposable_context, REGNO (op_operand));
+	break;
+      }
+
+    case ZERO_EXTEND:
+      if (GET_MODE (op_operand) != word_mode || !target_info.splitting_zext)
 	return 0;
+      break;
+
+    default:
+      gcc_unreachable ();
     }
 
   bitmap_set_bit (decomposable_context, REGNO (SET_DEST (set)));
 
-  if (GET_CODE (op) != ZERO_EXTEND)
-    bitmap_set_bit (decomposable_context, REGNO (op_operand));
-
   return 1;
 }
 
@@ -1008,6 +1265,8 @@ resolve_shift_zext (rtx insn)
 
   op_operand = XEXP (op, 0);
 
+  /* We can tear this operation apart only if the regs were already
+     torn apart.  */ 
   if (!resolve_reg_p (SET_DEST (set)) && !resolve_reg_p (op_operand))
     return NULL_RTX;
 
@@ -1083,8 +1351,57 @@ decompose_multiword_subregs (void)
   unsigned int max;
   basic_block bb;
 
-  if (df)
-    df_set_flags (DF_DEFER_INSN_RESCAN);
+  if (dump_file)
+    {
+      unsigned int i;
+      const char *sep;
+
+      for (i = 0; i < MAX_MACHINE_MODE; i++) 
+	{
+	  if (GET_MODE_SIZE (i) > UNITS_PER_WORD)
+	    fprintf (dump_file, "%s mode %s for copy lowering.\n", 
+		     target_info.move_modes_to_split[i] ? "Splitting" : "Skipping", mode_name[i]);
+	}
+
+      fprintf (dump_file, "%s mode %s for zero_extend lowering.\n", 
+	       target_info.splitting_zext ? "Splitting" : "Skipping", 
+	       mode_name[GET_MODE_WIDER_MODE (word_mode)]);
+
+      
+      fprintf (dump_file, "Splitting mode %s for ashift lowering with shift amounts = ", 
+	       mode_name[GET_MODE_WIDER_MODE (word_mode)]);
+      sep = "";
+      for (i = 0; i < BITS_PER_WORD; i++)
+	{
+	  if (target_info.splitting_ashift[i])
+	    {
+	      fprintf (dump_file, "%s%d", sep, i + BITS_PER_WORD);
+	      sep = ",";
+	    }
+	}
+
+      fprintf (dump_file, "\nSplitting mode %s for lshiftrt lowering with shift amounts = ", 
+	       mode_name[GET_MODE_WIDER_MODE (word_mode)]);
+      sep = "";
+      for (i = 0; i < BITS_PER_WORD; i++)
+	{
+	  if (target_info.splitting_lshiftrt[i])
+	    {
+	      fprintf (dump_file, "%s%d", sep, i + BITS_PER_WORD);
+	      sep = ",";
+	    }
+	}
+
+      fprintf (dump_file, "\n");
+
+      if (!target_info.something_to_do)
+	fprintf (dump_file, "Nothing to lower for port.\n\n");
+    }
+
+
+  /* Check if this target even has any modes to consider lowering.   */
+  if (!target_info.something_to_do)
+    return;
 
   max = max_reg_num ();
 
@@ -1094,24 +1411,39 @@ decompose_multiword_subregs (void)
      all the insns.  */
   {
     unsigned int i;
+    bool useful_modes_seen = false;
 
     for (i = FIRST_PSEUDO_REGISTER; i < max; ++i)
       {
-	if (regno_reg_rtx[i] != NULL
-	    && GET_MODE_SIZE (GET_MODE (regno_reg_rtx[i])) > UNITS_PER_WORD)
-	  break;
+	if (regno_reg_rtx[i] != NULL)
+	  {
+	    enum machine_mode mode = GET_MODE (regno_reg_rtx[i]);
+	    if (regno_reg_rtx[i] != NULL && target_info.move_modes_to_split [mode])
+	      {
+		useful_modes_seen = true;
+		break;
+	      }
+	  }
+      }
+
+    if (!useful_modes_seen)
+      {
+	if (dump_file)
+	  fprintf (dump_file, "Nothing to lower in this function.\n");
+	return;
       }
-    if (i == max)
-      return;
   }
 
-  if (df)
-    run_word_dce ();
+  if (df) 
+    {
+      df_set_flags (DF_DEFER_INSN_RESCAN);
+      run_word_dce ();
+    }
 
-  /* FIXME: When the dataflow branch is merged, we can change this
-     code to look for each multi-word pseudo-register and to find each
-     insn which sets or uses that register.  That should be faster
-     than scanning all the insns.  */
+  /* FIXME: It may be possible to change this code to look for each
+     multi-word pseudo-register and to find each insn which sets or
+     uses that register.  That should be faster than scanning all the
+     insns.  */
 
   decomposable_context = BITMAP_ALLOC (NULL);
   non_decomposable_context = BITMAP_ALLOC (NULL);
Index: rtl.h
===================================================================
--- rtl.h	(revision 186034)
+++ rtl.h	(working copy)
@@ -2523,6 +2523,9 @@ extern void init_expmed (void);
 extern void expand_inc (rtx, rtx);
 extern void expand_dec (rtx, rtx);
 
+/* In lower-subreg.c */
+extern void init_lower_subreg (void);
+
 /* In gcse.c */
 extern bool can_copy_p (enum machine_mode);
 extern bool can_assign_to_reg_without_clobbers_p (rtx);

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-03-31 16:21                   ` Kenneth Zadeck
@ 2012-04-03 13:53                     ` Richard Sandiford
  2012-04-03 15:33                       ` Kenneth Zadeck
  2012-04-03 16:22                       ` [C Patch]: pr52543 Ian Lance Taylor
  0 siblings, 2 replies; 38+ messages in thread
From: Richard Sandiford @ 2012-04-03 13:53 UTC (permalink / raw)
  To: Kenneth Zadeck
  Cc: gcc-patches, Ian Lance Taylor, Mike Stump, Kenneth Zadeck, avr,
	ramana.radhakrishnan

Kenneth Zadeck <zadeck@naturalbridge.com> writes:
> +#define FORCE_LOWERING

Don't think you meant to keep this.

> -/* Return whether X is a simple object which we can take a word_mode
> -   subreg of.  */
> +static struct {
> +
> +  /* This pass can transform 4 different operations: move, ashift,
> +     lshiftrt, and zero_extend.  There is a boolean vector for move
> +     splitting that is indexed by mode and is true for each mode that is
> +     to have its copies split.  The other three operations are only done
> +     for one mode so they are only controlled by a single boolean  .*/

Comment out of date.

> +  bool move_modes_to_split[MAX_MACHINE_MODE];
> +  
> +  /* Other splitting operations to be done on the this platform.  */
> +  bool splitting_ashift[MAX_BITS_PER_WORD];
> +  bool splitting_lshiftrt[MAX_BITS_PER_WORD];
> +  bool splitting_zext;
> +
> +  bool something_to_do;
> +
> +  /* Precomputed cost values used to determine if lowering shift
> +     operations is profitable.  */ 
> +  int word_mode_move_cost;
> +  int move_zero_cost;
> +} target_info;

This isn't right.  The point is that there is supposed to be one instance
of this structure per target, along the lines of target_expmed.

Yes, it's clunky.  Hopefully the move to C++ will allow a more elegant
approach.

> +/* Return true if it is profitable to lower and shift by SHIFT_AMT.
> +   CODE can be either ASHIFT or LSHIFTRT.  */
> +static bool
> +profitable_shift_p (enum rtx_code code, int shift_amt)
> +{
> +  rtx trg = gen_rtx_REG (GET_MODE_WIDER_MODE (word_mode), FIRST_PSEUDO_REGISTER);
> +  rtx src = gen_rtx_REG (GET_MODE_WIDER_MODE (word_mode), FIRST_PSEUDO_REGISTER + 1);
> +  int wide_cost 
> +    = insn_rtx_cost (gen_rtx_SET (VOIDmode, trg, 
> +				  gen_rtx_fmt_ee (code, GET_MODE_WIDER_MODE (word_mode), 
> +						  src, GEN_INT (shift_amt))),
> +		     true);

Lines still too long.  (You really need an 80-char-wide editor.
It's the future.)

More importantly, I'd not noticed last time that you're glossing
over the size/speed choice.  That's important, and in general
is made on a per-bb basis.  So I think we need to cache both
the size and speed costs.

> +#ifdef FORCE_LOWERING
> +  return true;
> +#endif
> +
> +#ifdef LOG_COSTS
> +  fprintf (stderr, "shift code %d, shift amt %d, wide_cost %d\n", 
> +	   code, shift_amt, wide_cost);
> +#endif
> +  if (shift_amt == BITS_PER_WORD)
> +    return wide_cost 
> +      >= target_info.word_mode_move_cost + target_info.move_zero_cost;
> +  else
> +    {
> +      int narrow_cost;
> +
> +      trg = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER);
> +      src = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER + 1);
> +      narrow_cost 
> +	= insn_rtx_cost (gen_rtx_SET 
> +			 (VOIDmode, trg, 
> +			  gen_rtx_fmt_ee (code, word_mode, 
> +					  src, 
> +					  GEN_INT (shift_amt - BITS_PER_WORD))),
> +				   true);
> +
> +#ifdef LOG_COSTS
> +      fprintf (stderr, "narrow_cost %d\n", narrow_cost);
> +#endif
> +      return wide_cost > narrow_cost + target_info.move_zero_cost;

Should be ">=".

I think it'd be better to only create the rtx once and simply replace
the shift amount for each test.

> +/* Initialize pass once per execution.  This envolves determining

once per target

> +   which operations on the machine are profitable.  If none are found,
> +   then the pass just returns when called.  */
> +
> +void
> +init_lower_subreg (void)
> +{
> +  rtx trg = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER);
> +  rtx pat;
> +  unsigned int i;
> +  int factor;
> +
> +  memset (&target_info, 0, sizeof target_info);
> +
> +  target_info.word_mode_move_cost = compute_move_cost (word_mode);
> +  target_info.move_zero_cost 
> +    = insn_rtx_cost (gen_rtx_SET (VOIDmode, trg, 
> +				  CONST0_RTX (word_mode)), true);
> +
> +#ifdef LOG_COSTS
> +  fprintf (stderr, "word mode move cost %d\n", target_info.word_mode_move_cost);
> +  fprintf (stderr, "move 0 cost %d\n", target_info.move_zero_cost);
> +#endif
> +  for (i = 0; i < MAX_MACHINE_MODE; i++) 
> +    {
> +      int t;
> +      factor = GET_MODE_SIZE (i) / UNITS_PER_WORD;
> +
> +      if (factor <= 1) 
> +	continue;
> +
> +      t = compute_move_cost ((enum machine_mode) i);
> +
> +#ifdef LOG_COSTS
> +      fprintf (stderr, "mode %s, move cost %d, factor %d\n", 
> +	       mode_name[i], t, factor);
> +#endif
> +      if (t >= target_info.word_mode_move_cost * factor)
> +	{
> +	  target_info.move_modes_to_split[i] = true;
> +	  target_info.something_to_do = true;
> +	}
> +#ifdef FORCE_LOWERING
> +      target_info.move_modes_to_split[i] = true;
> +      target_info.something_to_do = true;
> +#endif
> +    }

The code is a bit inconsistent about whether the costs are calculated
when FORCE_LOWERING is true.  Here you calculate anyway, but in the
shift case you skip.

> +
> +  /* For the moves and shifts, the only case that is checked is one
> +     where the mode of the target is an integer mode twice the width
> +     of the word_mode. 
> +
> +     If it is not profitable to split a double word move then do not
> +     even consider the shifts or the zero extension.  */
> +  if (target_info.move_modes_to_split[(int) GET_MODE_WIDER_MODE (word_mode)])

This should be GET_MODE_2XWIDER_MODE.  Same elsewhere.  (Didn't notice
that last time either, sorry.)

> +    {
> +      /* The only case here to check to see if moving the upper part with a
> +	 zero is cheaper than doing the zext itself.  */
> +      pat = gen_rtx_SET (VOIDmode, trg, 
> +			 gen_rtx_ZERO_EXTEND (GET_MODE_WIDER_MODE (word_mode), 
> +					      gen_reg_rtx (word_mode)));
> +
> +      if (insn_rtx_cost (pat, true) 
> +	  >= target_info.word_mode_move_cost + target_info.move_zero_cost) 
> +	{
> +	  target_info.splitting_zext = true;
> +	  target_info.something_to_do = true;
> +	}

No costs logged here.  Here and in the shifting code, the someting_to_do
assignments should now be redundant.

> +      
> +#ifdef FORCE_LOWERING
> +      target_info.splitting_zext = true;
> +      target_info.something_to_do = true;
> +#endif
> +      /* For the shifts there are three shift amounts that need to be
> +	 checked: word_mode, word_mode + 1 and word_mode * 1.5.  The first
> +	 of these can be done without a shift.  The second and third case
> +	 are the same on large machines but may be different on small
> +	 machines that special case shift 1 and 2.  If any of these are
> +	 found to be useful, then we set the flags to consider those cases
> +	 and when the actual shift amount is known, redo the cost
> +	 calculation.  */

This comment no longer applies.

> @@ -1094,24 +1411,39 @@ decompose_multiword_subregs (void)
>       all the insns.  */
>    {
>      unsigned int i;
> +    bool useful_modes_seen = false;
>  
>      for (i = FIRST_PSEUDO_REGISTER; i < max; ++i)
>        {
> -	if (regno_reg_rtx[i] != NULL
> -	    && GET_MODE_SIZE (GET_MODE (regno_reg_rtx[i])) > UNITS_PER_WORD)
> -	  break;
> +	if (regno_reg_rtx[i] != NULL)
> +	  {
> +	    enum machine_mode mode = GET_MODE (regno_reg_rtx[i]);
> +	    if (regno_reg_rtx[i] != NULL && target_info.move_modes_to_split [mode])

redundant test.

You're probably heartily sick of this patch by now, so here's my
attempt to do the above.  I also made the macros 0/1.  I think
that's generally preferred, because then the conditional code
gets compile testing even when the features are off.  It also
means that LOG_COSTS and FORCE_LOWERING can be turned on at
the same time.

Also, I still think simple_move is the right place to check for
profitable moves.  The comment change I had to make there seems
to reinforce this (the comment was no longer correct either way).

What do you think?  The patch looks OK to me with these changes,
but I'd like to leave it for 48 hours to see if Ian has any comments.

Bootstrapped & regression-tested on x86_64-linux-gnu.  Also tested by
building cc1 on mips64-elf.

Thanks,
Richard


gcc/
2012-04-03  Kenneth Zadeck  <zadeck@naturalbridge.com>
	    Richard Sandiford  <r.sandiford@uk.ibm.com>

	* Makefile.in (lower-subreg.o, target-globals.o): Depend on
	lower-subreg.h.
	* lower-subreg.h: New file.
	* target-globals.h (this_target_lower_subreg): Declare.
	(target_globals): Add lower_subreg;
	(restore_target_globals): Restore this_target_lower_subreg.
	* target-globals.c: Include it.
	(default_target_globals): Add default_target_lower_subreg.
	(save_target_globals): Initialize target_lower_subreg.
	* rtl.h (init_lower_subreg): Added declaration.
	* toplev.c (backend_init_target): Call initializer for lower-subreg
	pass.
	* lower-subreg.c (LOG_COSTS, FORCE_LOWERING): New macros.
	(default_target_lower_subreg): New variable.
	(this_target_lower_subreg): Likewise.
	(twice_word_mode, choices): New macros.
	(shift_cost, compute_splitting_shift, compute_costs)
	(init_lower_subreg): New functions.
	(resolve_simple_move): Add speed_p argument.  Check choices.
	(find_pseudo_copy): Don't check the mode size here.
	(resolve_simple_move): Assert the mode size.
	(find_decomposable_shift_zext): Add speed_p argument and return
	a bool.  Check choices.
	(resolve_shift_zext): Add comment.
	(dump_shift_choices, dump_choices): New functions.
	(decompose_multiword_subregs): Dump list of profitable
	transformations.  Add code to skip non profitable transformations.
	Update calls to simple_move and find_decomposable_shift_zext.

Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in	2012-04-02 11:56:03.000000000 +0100
+++ gcc/Makefile.in	2012-04-02 12:06:04.129301479 +0100
@@ -3423,11 +3423,13 @@ dbgcnt.o: dbgcnt.c $(CONFIG_H) $(SYSTEM_
 lower-subreg.o : lower-subreg.c $(CONFIG_H) $(SYSTEM_H) coretypes.h \
    $(MACHMODE_H) $(TM_H) $(RTL_H) $(TM_P_H) $(TIMEVAR_H) $(FLAGS_H) \
    insn-config.h $(BASIC_BLOCK_H) $(RECOG_H) $(OBSTACK_H) $(BITMAP_H) \
-   $(EXPR_H) $(EXCEPT_H) $(REGS_H) $(TREE_PASS_H) $(DF_H) dce.h
+   $(EXPR_H) $(EXCEPT_H) $(REGS_H) $(TREE_PASS_H) $(DF_H) dce.h \
+   lower-subreg.h
 target-globals.o : target-globals.c $(CONFIG_H) $(SYSTEM_H) coretypes.h \
    $(TM_H) insn-config.h $(MACHMODE_H) $(GGC_H) toplev.h target-globals.h \
    $(FLAGS_H) $(REGS_H) $(RTL_H) reload.h expmed.h $(EXPR_H) $(OPTABS_H) \
-   $(LIBFUNCS_H) $(CFGLOOP_H) $(IRA_INT_H) builtins.h gcse.h bb-reorder.h
+   $(LIBFUNCS_H) $(CFGLOOP_H) $(IRA_INT_H) builtins.h gcse.h bb-reorder.h \
+   lower-subreg.h
 hw-doloop.o : hw-doloop.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) \
    $(RTL_H) $(FLAGS_H) $(EXPR_H) hard-reg-set.h $(BASIC_BLOCK_H) $(TM_P_H) \
    $(DF_H) $(CFGLAYOUT_H) $(CFGLOOP_H) output.h $(RECOG_H) $(TARGET_H) \
Index: gcc/lower-subreg.h
===================================================================
--- /dev/null	2012-01-03 10:09:21.739576622 +0000
+++ gcc/lower-subreg.h	2012-04-03 14:38:35.397879009 +0100
@@ -0,0 +1,59 @@
+/* Target-dependent costs for lower-subreg.c.
+   Copyright (C) 2012 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option; any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef LOWER_SUBREG_H
+#define LOWER_SUBREG_H 1
+
+/* Information about whether, and where, lower-subreg should be applied.  */
+struct lower_subreg_choices {
+  /* A boolean vector for move splitting that is indexed by mode and is
+     true for each mode that is to have its copies split.  */
+  bool move_modes_to_split[MAX_MACHINE_MODE];
+
+  /* True if zero-extensions from word_mode to twice_word_mode should
+     be split.  */
+  bool splitting_zext;
+
+  /* Index X is true if twice_word_mode shifts by X + BITS_PER_WORD
+     should be split.  */
+  bool splitting_ashift[MAX_BITS_PER_WORD];
+  bool splitting_lshiftrt[MAX_BITS_PER_WORD];
+
+  /* True if there is at least one mode that is worth splitting.  */
+  bool something_to_do;
+};
+
+/* Target-specific information for the subreg lowering pass.  */
+struct target_lower_subreg {
+  /* An integer mode that is twice as wide as word_mode.  */
+  enum machine_mode x_twice_word_mode;
+
+  /* What we have decided to do when optimizing for size (index 0)
+     and speed (index 1).  */
+  struct lower_subreg_choices x_choices[2];
+};
+
+extern struct target_lower_subreg default_target_lower_subreg;
+#if SWITCHABLE_TARGET
+extern struct target_lower_subreg *this_target_lower_subreg;
+#else
+#define this_target_lower_subreg (&default_target_lower_subreg)
+#endif
+
+#endif
Index: gcc/target-globals.h
===================================================================
--- gcc/target-globals.h	2010-07-29 13:10:13.000000000 +0100
+++ gcc/target-globals.h	2012-04-02 12:07:41.323231941 +0100
@@ -35,6 +35,7 @@ #define TARGET_GLOBALS_H 1
 extern struct target_builtins *this_target_builtins;
 extern struct target_gcse *this_target_gcse;
 extern struct target_bb_reorder *this_target_bb_reorder;
+extern struct target_lower_subreg *this_target_lower_subreg;
 
 struct GTY(()) target_globals {
   struct target_flag_state *GTY((skip)) flag_state;
@@ -51,6 +52,7 @@ struct GTY(()) target_globals {
   struct target_builtins *GTY((skip)) builtins;
   struct target_gcse *GTY((skip)) gcse;
   struct target_bb_reorder *GTY((skip)) bb_reorder;
+  struct target_lower_subreg *GTY((skip)) lower_subreg;
 };
 
 extern struct target_globals default_target_globals;
@@ -74,6 +76,7 @@ restore_target_globals (struct target_gl
   this_target_builtins = g->builtins;
   this_target_gcse = g->gcse;
   this_target_bb_reorder = g->bb_reorder;
+  this_target_lower_subreg = g->lower_subreg;
 }
 #endif
 
Index: gcc/target-globals.c
===================================================================
--- gcc/target-globals.c	2011-03-18 13:09:16.000000000 +0000
+++ gcc/target-globals.c	2012-04-02 12:06:44.026273230 +0100
@@ -40,6 +40,7 @@ Software Foundation; either version 3, o
 #include "builtins.h"
 #include "gcse.h"
 #include "bb-reorder.h"
+#include "lower-subreg.h"
 
 #if SWITCHABLE_TARGET
 struct target_globals default_target_globals = {
@@ -56,7 +57,8 @@ struct target_globals default_target_glo
   &default_target_ira_int,
   &default_target_builtins,
   &default_target_gcse,
-  &default_target_bb_reorder
+  &default_target_bb_reorder,
+  &default_target_lower_subreg
 };
 
 struct target_globals *
@@ -79,6 +81,7 @@ save_target_globals (void)
   g->builtins = XCNEW (struct target_builtins);
   g->gcse = XCNEW (struct target_gcse);
   g->bb_reorder = XCNEW (struct target_bb_reorder);
+  g->lower_subreg = XCNEW (struct target_lower_subreg);
   restore_target_globals (g);
   init_reg_sets ();
   target_reinit ();
Index: gcc/rtl.h
===================================================================
--- gcc/rtl.h	2012-04-02 11:56:03.000000000 +0100
+++ gcc/rtl.h	2012-04-02 11:57:45.012687781 +0100
@@ -2524,6 +2524,9 @@ extern void init_expmed (void);
 extern void expand_inc (rtx, rtx);
 extern void expand_dec (rtx, rtx);
 
+/* In lower-subreg.c */
+extern void init_lower_subreg (void);
+
 /* In gcse.c */
 extern bool can_copy_p (enum machine_mode);
 extern bool can_assign_to_reg_without_clobbers_p (rtx);
Index: gcc/toplev.c
===================================================================
--- gcc/toplev.c	2012-04-02 11:56:03.000000000 +0100
+++ gcc/toplev.c	2012-04-02 11:57:45.007687786 +0100
@@ -1660,6 +1660,7 @@ backend_init_target (void)
   /* rtx_cost is mode-dependent, so cached values need to be recomputed
      on a mode change.  */
   init_expmed ();
+  init_lower_subreg ();
 
   /* We may need to recompute regno_save_code[] and regno_restore_code[]
      after a mode change as well.  */
Index: gcc/lower-subreg.c
===================================================================
--- gcc/lower-subreg.c	2012-03-07 09:31:40.000000000 +0000
+++ gcc/lower-subreg.c	2012-04-03 14:41:25.464639306 +0100
@@ -40,6 +40,7 @@ Software Foundation; either version 3, o
 #include "regs.h"
 #include "tree-pass.h"
 #include "df.h"
+#include "lower-subreg.h"
 
 #ifdef STACK_GROWS_DOWNWARD
 # undef STACK_GROWS_DOWNWARD
@@ -52,10 +53,35 @@ DEF_VEC_P (bitmap);
 DEF_VEC_ALLOC_P (bitmap,heap);
 
 /* Decompose multi-word pseudo-registers into individual
-   pseudo-registers when possible.  This is possible when all the uses
-   of a multi-word register are via SUBREG, or are copies of the
-   register to another location.  Breaking apart the register permits
-   more CSE and permits better register allocation.  */
+   pseudo-registers when possible and profitable.  This is possible
+   when all the uses of a multi-word register are via SUBREG, or are
+   copies of the register to another location.  Breaking apart the
+   register permits more CSE and permits better register allocation.
+   This is profitable if the machine does not have move instructions
+   to do this.
+
+   This pass only splits moves with modes that are wider than
+   word_mode and ASHIFTs, LSHIFTRTs and ZERO_EXTENDs with integer
+   modes that are twice the width of word_mode.  The latter could be
+   generalized if there was a need to do this, but the trend in
+   architectures is to not need this.
+
+   There are two useful preprocessor defines for use by maintainers:
+
+   #define LOG_COSTS 1
+
+   if you wish to see the actual cost estimates that are being used
+   for each mode wider than word mode and the cost estimates for zero
+   extension and the shifts.   This can be useful when port maintainers
+   are tuning insn rtx costs.
+
+   #define FORCE_LOWERING 1
+
+   if you wish to test the pass with all the transformation forced on.
+   This can be useful for finding bugs in the transformations.  */
+
+#define LOG_COSTS 0
+#define FORCE_LOWERING 0
 
 /* Bit N in this bitmap is set if regno N is used in a context in
    which we can decompose it.  */
@@ -75,8 +101,190 @@ DEF_VEC_ALLOC_P (bitmap,heap);
    copy from reg M to reg N.  */
 static VEC(bitmap,heap) *reg_copy_graph;
 
-/* Return whether X is a simple object which we can take a word_mode
-   subreg of.  */
+struct target_lower_subreg default_target_lower_subreg;
+#if SWITCHABLE_TARGET
+struct target_lower_subreg *this_target_lower_subreg
+  = &default_target_lower_subreg;
+#endif
+
+#define twice_word_mode \
+  this_target_lower_subreg->x_twice_word_mode
+#define choices \
+  this_target_lower_subreg->x_choices
+
+/* RTXes used while computing costs.  */
+struct cost_rtxes {
+  /* Source and target registers.  */
+  rtx source;
+  rtx target;
+
+  /* A twice_word_mode ZERO_EXTEND of SOURCE.  */
+  rtx zext;
+
+  /* A shift of SOURCE.  */
+  rtx shift;
+
+  /* A SET of TARGET.  */
+  rtx set;
+};
+
+/* Return the cost of a CODE shift in mode MODE by OP1 bits, using the
+   rtxes in RTXES.  SPEED_P selects between the speed and size cost.  */
+
+static int
+shift_cost (bool speed_p, struct cost_rtxes *rtxes, enum rtx_code code,
+	    enum machine_mode mode, int op1)
+{
+  PUT_MODE (rtxes->target, mode);
+  PUT_CODE (rtxes->shift, code);
+  PUT_MODE (rtxes->shift, mode);
+  PUT_MODE (rtxes->source, mode);
+  XEXP (rtxes->shift, 1) = GEN_INT (op1);
+  SET_SRC (rtxes->set) = rtxes->shift;
+  return insn_rtx_cost (rtxes->set, speed_p);
+}
+
+/* For each X in the range [0, BITS_PER_WORD), set SPLITTING[X]
+   to true if it is profitable to split a double-word CODE shift
+   of X + BITS_PER_WORD bits.  SPEED_P says whether we are testing
+   for speed or size profitability.
+
+   Use the rtxes in RTXES to calculate costs.  WORD_MOVE_ZERO_COST is
+   the cost of moving zero into a word-mode register.  WORD_MOVE_COST
+   is the cost of moving between word registers.  */
+
+static void
+compute_splitting_shift (bool speed_p, struct cost_rtxes *rtxes,
+			 bool *splitting, enum rtx_code code,
+			 int word_move_zero_cost, int word_move_cost)
+{
+  int wide_cost, narrow_cost, i;
+
+  for (i = 0; i < BITS_PER_WORD; i++)
+    {
+      wide_cost = shift_cost (speed_p, rtxes, code, twice_word_mode,
+			      i + BITS_PER_WORD);
+      if (i == 0)
+	narrow_cost = word_move_cost;
+      else
+	narrow_cost = shift_cost (speed_p, rtxes, code, word_mode, i);
+
+      if (LOG_COSTS)
+	fprintf (stderr, "%s %s by %d: original cost %d, split cost %d + %d\n",
+		 GET_MODE_NAME (twice_word_mode), GET_RTX_NAME (code),
+		 i + BITS_PER_WORD, wide_cost, narrow_cost,
+		 word_move_zero_cost);
+
+      if (FORCE_LOWERING || wide_cost >= narrow_cost + word_move_zero_cost)
+	splitting[i] = true;
+    }
+}
+
+/* Compute what we should do when optimizing for speed or size; SPEED_P
+   selects which.  Use RTXES for computing costs.  */
+
+static void
+compute_costs (bool speed_p, struct cost_rtxes *rtxes)
+{
+  unsigned int i;
+  int word_move_zero_cost, word_move_cost;
+
+  SET_SRC (rtxes->set) = CONST0_RTX (word_mode);
+  word_move_zero_cost = insn_rtx_cost (rtxes->set, speed_p);
+
+  SET_SRC (rtxes->set) = rtxes->source;
+  word_move_cost = insn_rtx_cost (rtxes->set, speed_p);
+
+  if (LOG_COSTS)
+    fprintf (stderr, "%s move: from zero cost %d, from reg cost %d\n",
+	     GET_MODE_NAME (word_mode), word_move_zero_cost, word_move_cost);
+
+  for (i = 0; i < MAX_MACHINE_MODE; i++)
+    {
+      enum machine_mode mode = (enum machine_mode) i;
+      int factor = GET_MODE_SIZE (mode) / UNITS_PER_WORD;
+      if (factor > 1)
+	{
+	  int mode_move_cost;
+
+	  PUT_MODE (rtxes->target, mode);
+	  PUT_MODE (rtxes->source, mode);
+	  mode_move_cost = insn_rtx_cost (rtxes->set, speed_p);
+
+	  if (LOG_COSTS)
+	    fprintf (stderr, "%s move: original cost %d, split cost %d * %d\n",
+		     GET_MODE_NAME (mode), mode_move_cost,
+		     word_move_cost, factor);
+
+	  if (FORCE_LOWERING || mode_move_cost >= word_move_cost * factor)
+	    {
+	      choices[speed_p].move_modes_to_split[i] = true;
+	      choices[speed_p].something_to_do = true;
+	    }
+	}
+    }
+
+  /* For the moves and shifts, the only case that is checked is one
+     where the mode of the target is an integer mode twice the width
+     of the word_mode.
+
+     If it is not profitable to split a double word move then do not
+     even consider the shifts or the zero extension.  */
+  if (choices[speed_p].move_modes_to_split[(int) twice_word_mode])
+    {
+      int zext_cost;
+
+      /* The only case here to check to see if moving the upper part with a
+	 zero is cheaper than doing the zext itself.  */
+      PUT_MODE (rtxes->target, twice_word_mode);
+      PUT_MODE (rtxes->source, word_mode);
+      SET_SRC (rtxes->set) = rtxes->zext;
+      zext_cost = insn_rtx_cost (rtxes->set, speed_p);
+
+      if (LOG_COSTS)
+	fprintf (stderr, "%s %s: original cost %d, split cost %d + %d\n",
+		 GET_MODE_NAME (twice_word_mode), GET_RTX_NAME (ZERO_EXTEND),
+		 zext_cost, word_move_cost, word_move_zero_cost);
+
+      if (FORCE_LOWERING || zext_cost >= word_move_cost + word_move_zero_cost)
+	choices[speed_p].splitting_zext = true;
+
+      compute_splitting_shift (speed_p, rtxes,
+			       choices[speed_p].splitting_ashift, ASHIFT,
+			       word_move_zero_cost, word_move_cost);
+      compute_splitting_shift (speed_p, rtxes,
+			       choices[speed_p].splitting_lshiftrt, LSHIFTRT,
+			       word_move_zero_cost, word_move_cost);
+    }
+}
+
+/* Do one-per-target initialisation.  This involves determining
+   which operations on the machine are profitable.  If none are found,
+   then the pass just returns when called.  */
+
+void
+init_lower_subreg (void)
+{
+  struct cost_rtxes rtxes;
+
+  memset (this_target_lower_subreg, 0, sizeof (*this_target_lower_subreg));
+
+  twice_word_mode = GET_MODE_2XWIDER_MODE (word_mode);
+
+  rtxes.target = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER);
+  rtxes.source = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER + 1);
+  rtxes.set = gen_rtx_SET (VOIDmode, rtxes.target, rtxes.source);
+  rtxes.zext = gen_rtx_ZERO_EXTEND (twice_word_mode, rtxes.source);
+  rtxes.shift = gen_rtx_ASHIFT (twice_word_mode, rtxes.source, const0_rtx);
+
+  if (LOG_COSTS)
+    fprintf (stderr, "\nSize costs\n==========\n\n");
+  compute_costs (false, &rtxes);
+
+  if (LOG_COSTS)
+    fprintf (stderr, "\nSpeed costs\n===========\n\n");
+  compute_costs (true, &rtxes);
+}
 
 static bool
 simple_move_operand (rtx x)
@@ -101,12 +309,15 @@ simple_move_operand (rtx x)
   return true;
 }
 
-/* If INSN is a single set between two objects, return the single set.
-   Such an insn can always be decomposed.  INSN should have been
-   passed to recog and extract_insn before this is called.  */
+/* If INSN is a single set between two objects that we want to split,
+   return the single set.  SPEED_P says whether we are optimizing
+   INSN for speed or size.
+
+   INSN should have been passed to recog and extract_insn before this
+   is called.  */
 
 static rtx
-simple_move (rtx insn)
+simple_move (rtx insn, bool speed_p)
 {
   rtx x;
   rtx set;
@@ -150,6 +361,9 @@ simple_move (rtx insn)
   if (GET_MODE_CLASS (mode) == MODE_PARTIAL_INT)
     return NULL_RTX;
 
+  if (!choices[speed_p].move_modes_to_split[(int) mode])
+    return NULL_RTX;
+
   return set;
 }
 
@@ -173,9 +387,6 @@ find_pseudo_copy (rtx set)
   if (HARD_REGISTER_NUM_P (rd) || HARD_REGISTER_NUM_P (rs))
     return false;
 
-  if (GET_MODE_SIZE (GET_MODE (dest)) <= UNITS_PER_WORD)
-    return false;
-
   b = VEC_index (bitmap, reg_copy_graph, rs);
   if (b == NULL)
     {
@@ -668,8 +879,7 @@ resolve_simple_move (rtx set, rtx insn)
   orig_mode = GET_MODE (dest);
 
   words = (GET_MODE_SIZE (orig_mode) + UNITS_PER_WORD - 1) / UNITS_PER_WORD;
-  if (words <= 1)
-    return insn;
+  gcc_assert (words > 1);
 
   start_sequence ();
 
@@ -931,12 +1141,13 @@ resolve_debug (rtx insn)
   resolve_reg_notes (insn);
 }
 
-/* Checks if INSN is a decomposable multiword-shift or zero-extend and
-   sets the decomposable_context bitmap accordingly.  A non-zero value
-   is returned if a decomposable insn has been found.  */
+/* Check if INSN is a decomposable multiword-shift or zero-extend and
+   set the decomposable_context bitmap accordingly.  SPEED_P is true
+   if we are optimizing INSN for speed rather than size.  Return true
+   if INSN is decomposable.  */
 
-static int
-find_decomposable_shift_zext (rtx insn)
+static bool
+find_decomposable_shift_zext (rtx insn, bool speed_p)
 {
   rtx set;
   rtx op;
@@ -944,41 +1155,44 @@ find_decomposable_shift_zext (rtx insn)
 
   set = single_set (insn);
   if (!set)
-    return 0;
+    return false;
 
   op = SET_SRC (set);
   if (GET_CODE (op) != ASHIFT
       && GET_CODE (op) != LSHIFTRT
       && GET_CODE (op) != ZERO_EXTEND)
-    return 0;
+    return false;
 
   op_operand = XEXP (op, 0);
   if (!REG_P (SET_DEST (set)) || !REG_P (op_operand)
       || HARD_REGISTER_NUM_P (REGNO (SET_DEST (set)))
       || HARD_REGISTER_NUM_P (REGNO (op_operand))
-      || !SCALAR_INT_MODE_P (GET_MODE (op)))
-    return 0;
+      || GET_MODE (op) != twice_word_mode)
+    return false;
 
   if (GET_CODE (op) == ZERO_EXTEND)
     {
       if (GET_MODE (op_operand) != word_mode
-	  || GET_MODE_BITSIZE (GET_MODE (op)) != 2 * BITS_PER_WORD)
-	return 0;
+	  || !choices[speed_p].splitting_zext)
+	return false;
     }
   else /* left or right shift */
     {
+      bool *splitting = (GET_CODE (op) == ASHIFT
+			 ? choices[speed_p].splitting_ashift
+			 : choices[speed_p].splitting_lshiftrt);
       if (!CONST_INT_P (XEXP (op, 1))
-	  || INTVAL (XEXP (op, 1)) < BITS_PER_WORD
-	  || GET_MODE_BITSIZE (GET_MODE (op_operand)) != 2 * BITS_PER_WORD)
-	return 0;
+	  || !IN_RANGE (INTVAL (XEXP (op, 1)), BITS_PER_WORD,
+			2 * BITS_PER_WORD - 1)
+	  || !splitting[INTVAL (XEXP (op, 1)) - BITS_PER_WORD])
+	return false;
+
+      bitmap_set_bit (decomposable_context, REGNO (op_operand));
     }
 
   bitmap_set_bit (decomposable_context, REGNO (SET_DEST (set)));
 
-  if (GET_CODE (op) != ZERO_EXTEND)
-    bitmap_set_bit (decomposable_context, REGNO (op_operand));
-
-  return 1;
+  return true;
 }
 
 /* Decompose a more than word wide shift (in INSN) of a multiword
@@ -1008,6 +1222,8 @@ resolve_shift_zext (rtx insn)
 
   op_operand = XEXP (op, 0);
 
+  /* We can tear this operation apart only if the regs were already
+     torn apart.  */
   if (!resolve_reg_p (SET_DEST (set)) && !resolve_reg_p (op_operand))
     return NULL_RTX;
 
@@ -1073,6 +1289,56 @@ resolve_shift_zext (rtx insn)
   return insns;
 }
 
+/* Print to dump_file a description of what we're doing with shift code CODE.
+   SPLITTING[X] is true if we are splitting shifts by X + BITS_PER_WORD.  */
+
+static void
+dump_shift_choices (enum rtx_code code, bool *splitting)
+{
+  int i;
+  const char *sep;
+
+  fprintf (dump_file,
+	   "  Splitting mode %s for %s lowering with shift amounts = ",
+	   GET_MODE_NAME (twice_word_mode), GET_RTX_NAME (code));
+  sep = "";
+  for (i = 0; i < BITS_PER_WORD; i++)
+    if (splitting[i])
+      {
+	fprintf (dump_file, "%s%d", sep, i + BITS_PER_WORD);
+	sep = ",";
+      }
+  fprintf (dump_file, "\n");
+}
+
+/* Print to dump_file a description of what we're doing when optimizing
+   for speed or size; SPEED_P says which.  DESCRIPTION is a description
+   of the SPEED_P choice.  */
+
+static void
+dump_choices (bool speed_p, const char *description)
+{
+  unsigned int i;
+
+  fprintf (dump_file, "Choices when optimizing for %s:\n", description);
+
+  for (i = 0; i < MAX_MACHINE_MODE; i++)
+    if (GET_MODE_SIZE (i) > UNITS_PER_WORD)
+      fprintf (dump_file, "  %s mode %s for copy lowering.\n",
+	       choices[speed_p].move_modes_to_split[i]
+	       ? "Splitting"
+	       : "Skipping",
+	       GET_MODE_NAME ((enum machine_mode) i));
+
+  fprintf (dump_file, "  %s mode %s for zero_extend lowering.\n",
+	   choices[speed_p].splitting_zext ? "Splitting" : "Skipping",
+	   GET_MODE_NAME (twice_word_mode));
+
+  dump_shift_choices (ASHIFT, choices[speed_p].splitting_ashift);
+  dump_shift_choices (LSHIFTRT, choices[speed_p].splitting_ashift);
+  fprintf (dump_file, "\n");
+}
+
 /* Look for registers which are always accessed via word-sized SUBREGs
    or via copies.  Decompose these registers into several word-sized
    pseudo-registers.  */
@@ -1083,8 +1349,19 @@ decompose_multiword_subregs (void)
   unsigned int max;
   basic_block bb;
 
-  if (df)
-    df_set_flags (DF_DEFER_INSN_RESCAN);
+  if (dump_file)
+    {
+      dump_choices (false, "size");
+      dump_choices (true, "speed");
+    }
+
+  /* Check if this target even has any modes to consider lowering.   */
+  if (!choices[false].something_to_do && !choices[true].something_to_do)
+    {
+      if (dump_file)
+	fprintf (dump_file, "Nothing to do!\n");
+      return;
+    }
 
   max = max_reg_num ();
 
@@ -1094,24 +1371,38 @@ decompose_multiword_subregs (void)
      all the insns.  */
   {
     unsigned int i;
+    bool useful_modes_seen = false;
 
     for (i = FIRST_PSEUDO_REGISTER; i < max; ++i)
+      if (regno_reg_rtx[i] != NULL)
+	{
+	  enum machine_mode mode = GET_MODE (regno_reg_rtx[i]);
+	  if (choices[false].move_modes_to_split[(int) mode]
+	      || choices[true].move_modes_to_split[(int) mode])
+	    {
+	      useful_modes_seen = true;
+	      break;
+	    }
+	}
+
+    if (!useful_modes_seen)
       {
-	if (regno_reg_rtx[i] != NULL
-	    && GET_MODE_SIZE (GET_MODE (regno_reg_rtx[i])) > UNITS_PER_WORD)
-	  break;
+	if (dump_file)
+	  fprintf (dump_file, "Nothing to lower in this function.\n");
+	return;
       }
-    if (i == max)
-      return;
   }
 
   if (df)
-    run_word_dce ();
+    {
+      df_set_flags (DF_DEFER_INSN_RESCAN);
+      run_word_dce ();
+    }
 
-  /* FIXME: When the dataflow branch is merged, we can change this
-     code to look for each multi-word pseudo-register and to find each
-     insn which sets or uses that register.  That should be faster
-     than scanning all the insns.  */
+  /* FIXME: It may be possible to change this code to look for each
+     multi-word pseudo-register and to find each insn which sets or
+     uses that register.  That should be faster than scanning all the
+     insns.  */
 
   decomposable_context = BITMAP_ALLOC (NULL);
   non_decomposable_context = BITMAP_ALLOC (NULL);
@@ -1124,7 +1415,9 @@ decompose_multiword_subregs (void)
   FOR_EACH_BB (bb)
     {
       rtx insn;
+      bool speed_p;
 
+      speed_p = optimize_bb_for_speed_p (bb);
       FOR_BB_INSNS (bb, insn)
 	{
 	  rtx set;
@@ -1138,12 +1431,12 @@ decompose_multiword_subregs (void)
 
 	  recog_memoized (insn);
 
-	  if (find_decomposable_shift_zext (insn))
+	  if (find_decomposable_shift_zext (insn, speed_p))
 	    continue;
 
 	  extract_insn (insn);
 
-	  set = simple_move (insn);
+	  set = simple_move (insn, speed_p);
 
 	  if (!set)
 	    cmi = NOT_SIMPLE_MOVE;
@@ -1197,7 +1490,9 @@ decompose_multiword_subregs (void)
       FOR_EACH_BB (bb)
 	{
 	  rtx insn;
+	  bool speed_p;
 
+	  speed_p = optimize_bb_for_speed_p (bb);
 	  FOR_BB_INSNS (bb, insn)
 	    {
 	      rtx pat;
@@ -1220,7 +1515,7 @@ decompose_multiword_subregs (void)
 		  recog_memoized (insn);
 		  extract_insn (insn);
 
-		  set = simple_move (insn);
+		  set = simple_move (insn, speed_p);
 		  if (set)
 		    {
 		      rtx orig_insn = insn;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-04-03 13:53                     ` Richard Sandiford
@ 2012-04-03 15:33                       ` Kenneth Zadeck
  2012-04-03 19:20                         ` Richard Sandiford
  2012-04-03 16:22                       ` [C Patch]: pr52543 Ian Lance Taylor
  1 sibling, 1 reply; 38+ messages in thread
From: Kenneth Zadeck @ 2012-04-03 15:33 UTC (permalink / raw)
  To: gcc-patches, Ian Lance Taylor, Mike Stump, Kenneth Zadeck, avr,
	ramana.radhakrishnan, rdsandiford

Richard,

thanks, for doing the changes.    In particular, i did not really 
understand how the target stuff was supposed to work.

I have one issue with the changes that you made.

I had actually decided that the speed/size decision was not relevant to 
this patch.    The problem is that since this is a global optimization 
which propagates the info across the entire web of moves, you really 
want/need to use the same cost metric everywhere.  (of course, making 
the speed/size decision on the global optimization level would be fine; 
my issue is with the choice being at the bb level.)    So now you are 
going to have some moves in a web saying yes while others saying no.    
The ones that say no will dominate.   It is unlikely you will get what 
you want (the important nodes really running fast) out of this, assuming 
you have a target where the decision could go either way.

Aside from this, everything looks very good.

Kenny

On 04/03/2012 09:53 AM, Richard Sandiford wrote:
> Index: gcc/lower-subreg.h
> ===================================================================
> --- /dev/null	2012-01-03 10:09:21.739576622 +0000
> +++ gcc/lower-subreg.h	2012-04-03 14:38:35.397879009 +0100
> @@ -0,0 +1,59 @@
> +/* Target-dependent costs for lower-subreg.c.
> +   Copyright (C) 2012 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify it under
> +the terms of the GNU General Public License as published by the Free
> +Software Foundation; either version 3, or (at your option; any later
> +version.
> +
> +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> +for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +<http://www.gnu.org/licenses/>.  */
> +
> +#ifndef LOWER_SUBREG_H
> +#define LOWER_SUBREG_H 1
> +
> +/* Information about whether, and where, lower-subreg should be applied.  */
> +struct lower_subreg_choices {
> +  /* A boolean vector for move splitting that is indexed by mode and is
> +     true for each mode that is to have its copies split.  */
> +  bool move_modes_to_split[MAX_MACHINE_MODE];
> +
> +  /* True if zero-extensions from word_mode to twice_word_mode should
> +     be split.  */
> +  bool splitting_zext;
> +
> +  /* Index X is true if twice_word_mode shifts by X + BITS_PER_WORD
> +     should be split.  */
> +  bool splitting_ashift[MAX_BITS_PER_WORD];
> +  bool splitting_lshiftrt[MAX_BITS_PER_WORD];
> +
> +  /* True if there is at least one mode that is worth splitting.  */
> +  bool something_to_do;
> +};
> +
> +/* Target-specific information for the subreg lowering pass.  */
> +struct target_lower_subreg {
> +  /* An integer mode that is twice as wide as word_mode.  */
> +  enum machine_mode x_twice_word_mode;
> +
> +  /* What we have decided to do when optimizing for size (index 0)
> +     and speed (index 1).  */
> +  struct lower_subreg_choices x_choices[2];
> +};
> +
> +extern struct target_lower_subreg default_target_lower_subreg;
> +#if SWITCHABLE_TARGET
> +extern struct target_lower_subreg *this_target_lower_subreg;
> +#else
> +#define this_target_lower_subreg (&default_target_lower_subreg)
> +#endif
> +
> +#endif
> Index: gcc/target-globals.h
> ===================================================================
> --- gcc/target-globals.h	2010-07-29 13:10:13.000000000 +0100
> +++ gcc/target-globals.h	2012-04-02 12:07:41.323231941 +0100
> @@ -35,6 +35,7 @@ #define TARGET_GLOBALS_H 1
>   extern struct target_builtins *this_target_builtins;
>   extern struct target_gcse *this_target_gcse;
>   extern struct target_bb_reorder *this_target_bb_reorder;
> +extern struct target_lower_subreg *this_target_lower_subreg;
>
>   struct GTY(()) target_globals {
>     struct target_flag_state *GTY((skip)) flag_state;
> @@ -51,6 +52,7 @@ struct GTY(()) target_globals {
>     struct target_builtins *GTY((skip)) builtins;
>     struct target_gcse *GTY((skip)) gcse;
>     struct target_bb_reorder *GTY((skip)) bb_reorder;
> +  struct target_lower_subreg *GTY((skip)) lower_subreg;
>   };
>
>   extern struct target_globals default_target_globals;
> @@ -74,6 +76,7 @@ restore_target_globals (struct target_gl
>     this_target_builtins = g->builtins;
>     this_target_gcse = g->gcse;
>     this_target_bb_reorder = g->bb_reorder;
> +  this_target_lower_subreg = g->lower_subreg;
>   }
>   #endif
>
> Index: gcc/target-globals.c
> ===================================================================
> --- gcc/target-globals.c	2011-03-18 13:09:16.000000000 +0000
> +++ gcc/target-globals.c	2012-04-02 12:06:44.026273230 +0100
> @@ -40,6 +40,7 @@ Software Foundation; either version 3, o
>   #include "builtins.h"
>   #include "gcse.h"
>   #include "bb-reorder.h"
> +#include "lower-subreg.h"
>
>   #if SWITCHABLE_TARGET
>   struct target_globals default_target_globals = {
> @@ -56,7 +57,8 @@ struct target_globals default_target_glo
>     &default_target_ira_int,
>     &default_target_builtins,
>     &default_target_gcse,
> -&default_target_bb_reorder
> +&default_target_bb_reorder,
> +&default_target_lower_subreg
>   };
>
>   struct target_globals *
> @@ -79,6 +81,7 @@ save_target_globals (void)
>     g->builtins = XCNEW (struct target_builtins);
>     g->gcse = XCNEW (struct target_gcse);
>     g->bb_reorder = XCNEW (struct target_bb_reorder);
> +  g->lower_subreg = XCNEW (struct target_lower_subreg);
>     restore_target_globals (g);
>     init_reg_sets ();
>     target_reinit ();
> Index: gcc/rtl.h
> ===================================================================
> --- gcc/rtl.h	2012-04-02 11:56:03.000000000 +0100
> +++ gcc/rtl.h	2012-04-02 11:57:45.012687781 +0100
> @@ -2524,6 +2524,9 @@ extern void init_expmed (void);
>   extern void expand_inc (rtx, rtx);
>   extern void expand_dec (rtx, rtx);
>
> +/* In lower-subreg.c */
> +extern void init_lower_subreg (void);
> +
>   /* In gcse.c */
>   extern bool can_copy_p (enum machine_mode);
>   extern bool can_assign_to_reg_without_clobbers_p (rtx);
> Index: gcc/toplev.c
> ===================================================================
> --- gcc/toplev.c	2012-04-02 11:56:03.000000000 +0100
> +++ gcc/toplev.c	2012-04-02 11:57:45.007687786 +0100
> @@ -1660,6 +1660,7 @@ backend_init_target (void)
>     /* rtx_cost is mode-dependent, so cached values need to be recomputed
>        on a mode change.  */
>     init_expmed ();
> +  init_lower_subreg ();
>
>     /* We may need to recompute regno_save_code[] and regno_restore_code[]
>        after a mode change as well.  */
> Index: gcc/lower-subreg.c
> ===================================================================
> --- gcc/lower-subreg.c	2012-03-07 09:31:40.000000000 +0000
> +++ gcc/lower-subreg.c	2012-04-03 14:41:25.464639306 +0100
> @@ -40,6 +40,7 @@ Software Foundation; either version 3, o
>   #include "regs.h"
>   #include "tree-pass.h"
>   #include "df.h"
> +#include "lower-subreg.h"
>
>   #ifdef STACK_GROWS_DOWNWARD
>   # undef STACK_GROWS_DOWNWARD
> @@ -52,10 +53,35 @@ DEF_VEC_P (bitmap);
>   DEF_VEC_ALLOC_P (bitmap,heap);
>
>   /* Decompose multi-word pseudo-registers into individual
> -   pseudo-registers when possible.  This is possible when all the uses
> -   of a multi-word register are via SUBREG, or are copies of the
> -   register to another location.  Breaking apart the register permits
> -   more CSE and permits better register allocation.  */
> +   pseudo-registers when possible and profitable.  This is possible
> +   when all the uses of a multi-word register are via SUBREG, or are
> +   copies of the register to another location.  Breaking apart the
> +   register permits more CSE and permits better register allocation.
> +   This is profitable if the machine does not have move instructions
> +   to do this.
> +
> +   This pass only splits moves with modes that are wider than
> +   word_mode and ASHIFTs, LSHIFTRTs and ZERO_EXTENDs with integer
> +   modes that are twice the width of word_mode.  The latter could be
> +   generalized if there was a need to do this, but the trend in
> +   architectures is to not need this.
> +
> +   There are two useful preprocessor defines for use by maintainers:
> +
> +   #define LOG_COSTS 1
> +
> +   if you wish to see the actual cost estimates that are being used
> +   for each mode wider than word mode and the cost estimates for zero
> +   extension and the shifts.   This can be useful when port maintainers
> +   are tuning insn rtx costs.
> +
> +   #define FORCE_LOWERING 1
> +
> +   if you wish to test the pass with all the transformation forced on.
> +   This can be useful for finding bugs in the transformations.  */
> +
> +#define LOG_COSTS 0
> +#define FORCE_LOWERING 0
>
>   /* Bit N in this bitmap is set if regno N is used in a context in
>      which we can decompose it.  */
> @@ -75,8 +101,190 @@ DEF_VEC_ALLOC_P (bitmap,heap);
>      copy from reg M to reg N.  */
>   static VEC(bitmap,heap) *reg_copy_graph;
>
> -/* Return whether X is a simple object which we can take a word_mode
> -   subreg of.  */
> +struct target_lower_subreg default_target_lower_subreg;
> +#if SWITCHABLE_TARGET
> +struct target_lower_subreg *this_target_lower_subreg
> +  =&default_target_lower_subreg;
> +#endif
> +
> +#define twice_word_mode \
> +  this_target_lower_subreg->x_twice_word_mode
> +#define choices \
> +  this_target_lower_subreg->x_choices
> +
> +/* RTXes used while computing costs.  */
> +struct cost_rtxes {
> +  /* Source and target registers.  */
> +  rtx source;
> +  rtx target;
> +
> +  /* A twice_word_mode ZERO_EXTEND of SOURCE.  */
> +  rtx zext;
> +
> +  /* A shift of SOURCE.  */
> +  rtx shift;
> +
> +  /* A SET of TARGET.  */
> +  rtx set;
> +};
> +
> +/* Return the cost of a CODE shift in mode MODE by OP1 bits, using the
> +   rtxes in RTXES.  SPEED_P selects between the speed and size cost.  */
> +
> +static int
> +shift_cost (bool speed_p, struct cost_rtxes *rtxes, enum rtx_code code,
> +	    enum machine_mode mode, int op1)
> +{
> +  PUT_MODE (rtxes->target, mode);
> +  PUT_CODE (rtxes->shift, code);
> +  PUT_MODE (rtxes->shift, mode);
> +  PUT_MODE (rtxes->source, mode);
> +  XEXP (rtxes->shift, 1) = GEN_INT (op1);
> +  SET_SRC (rtxes->set) = rtxes->shift;
> +  return insn_rtx_cost (rtxes->set, speed_p);
> +}
> +
> +/* For each X in the range [0, BITS_PER_WORD), set SPLITTING[X]
> +   to true if it is profitable to split a double-word CODE shift
> +   of X + BITS_PER_WORD bits.  SPEED_P says whether we are testing
> +   for speed or size profitability.
> +
> +   Use the rtxes in RTXES to calculate costs.  WORD_MOVE_ZERO_COST is
> +   the cost of moving zero into a word-mode register.  WORD_MOVE_COST
> +   is the cost of moving between word registers.  */
> +
> +static void
> +compute_splitting_shift (bool speed_p, struct cost_rtxes *rtxes,
> +			 bool *splitting, enum rtx_code code,
> +			 int word_move_zero_cost, int word_move_cost)
> +{
> +  int wide_cost, narrow_cost, i;
> +
> +  for (i = 0; i<  BITS_PER_WORD; i++)
> +    {
> +      wide_cost = shift_cost (speed_p, rtxes, code, twice_word_mode,
> +			      i + BITS_PER_WORD);
> +      if (i == 0)
> +	narrow_cost = word_move_cost;
> +      else
> +	narrow_cost = shift_cost (speed_p, rtxes, code, word_mode, i);
> +
> +      if (LOG_COSTS)
> +	fprintf (stderr, "%s %s by %d: original cost %d, split cost %d + %d\n",
> +		 GET_MODE_NAME (twice_word_mode), GET_RTX_NAME (code),
> +		 i + BITS_PER_WORD, wide_cost, narrow_cost,
> +		 word_move_zero_cost);
> +
> +      if (FORCE_LOWERING || wide_cost>= narrow_cost + word_move_zero_cost)
> +	splitting[i] = true;
> +    }
> +}
> +
> +/* Compute what we should do when optimizing for speed or size; SPEED_P
> +   selects which.  Use RTXES for computing costs.  */
> +
> +static void
> +compute_costs (bool speed_p, struct cost_rtxes *rtxes)
> +{
> +  unsigned int i;
> +  int word_move_zero_cost, word_move_cost;
> +
> +  SET_SRC (rtxes->set) = CONST0_RTX (word_mode);
> +  word_move_zero_cost = insn_rtx_cost (rtxes->set, speed_p);
> +
> +  SET_SRC (rtxes->set) = rtxes->source;
> +  word_move_cost = insn_rtx_cost (rtxes->set, speed_p);
> +
> +  if (LOG_COSTS)
> +    fprintf (stderr, "%s move: from zero cost %d, from reg cost %d\n",
> +	     GET_MODE_NAME (word_mode), word_move_zero_cost, word_move_cost);
> +
> +  for (i = 0; i<  MAX_MACHINE_MODE; i++)
> +    {
> +      enum machine_mode mode = (enum machine_mode) i;
> +      int factor = GET_MODE_SIZE (mode) / UNITS_PER_WORD;
> +      if (factor>  1)
> +	{
> +	  int mode_move_cost;
> +
> +	  PUT_MODE (rtxes->target, mode);
> +	  PUT_MODE (rtxes->source, mode);
> +	  mode_move_cost = insn_rtx_cost (rtxes->set, speed_p);
> +
> +	  if (LOG_COSTS)
> +	    fprintf (stderr, "%s move: original cost %d, split cost %d * %d\n",
> +		     GET_MODE_NAME (mode), mode_move_cost,
> +		     word_move_cost, factor);
> +
> +	  if (FORCE_LOWERING || mode_move_cost>= word_move_cost * factor)
> +	    {
> +	      choices[speed_p].move_modes_to_split[i] = true;
> +	      choices[speed_p].something_to_do = true;
> +	    }
> +	}
> +    }
> +
> +  /* For the moves and shifts, the only case that is checked is one
> +     where the mode of the target is an integer mode twice the width
> +     of the word_mode.
> +
> +     If it is not profitable to split a double word move then do not
> +     even consider the shifts or the zero extension.  */
> +  if (choices[speed_p].move_modes_to_split[(int) twice_word_mode])
> +    {
> +      int zext_cost;
> +
> +      /* The only case here to check to see if moving the upper part with a
> +	 zero is cheaper than doing the zext itself.  */
> +      PUT_MODE (rtxes->target, twice_word_mode);
> +      PUT_MODE (rtxes->source, word_mode);
> +      SET_SRC (rtxes->set) = rtxes->zext;
> +      zext_cost = insn_rtx_cost (rtxes->set, speed_p);
> +
> +      if (LOG_COSTS)
> +	fprintf (stderr, "%s %s: original cost %d, split cost %d + %d\n",
> +		 GET_MODE_NAME (twice_word_mode), GET_RTX_NAME (ZERO_EXTEND),
> +		 zext_cost, word_move_cost, word_move_zero_cost);
> +
> +      if (FORCE_LOWERING || zext_cost>= word_move_cost + word_move_zero_cost)
> +	choices[speed_p].splitting_zext = true;
> +
> +      compute_splitting_shift (speed_p, rtxes,
> +			       choices[speed_p].splitting_ashift, ASHIFT,
> +			       word_move_zero_cost, word_move_cost);
> +      compute_splitting_shift (speed_p, rtxes,
> +			       choices[speed_p].splitting_lshiftrt, LSHIFTRT,
> +			       word_move_zero_cost, word_move_cost);
> +    }
> +}
> +
> +/* Do one-per-target initialisation.  This involves determining
> +   which operations on the machine are profitable.  If none are found,
> +   then the pass just returns when called.  */
> +
> +void
> +init_lower_subreg (void)
> +{
> +  struct cost_rtxes rtxes;
> +
> +  memset (this_target_lower_subreg, 0, sizeof (*this_target_lower_subreg));
> +
> +  twice_word_mode = GET_MODE_2XWIDER_MODE (word_mode);
> +
> +  rtxes.target = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER);
> +  rtxes.source = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER + 1);
> +  rtxes.set = gen_rtx_SET (VOIDmode, rtxes.target, rtxes.source);
> +  rtxes.zext = gen_rtx_ZERO_EXTEND (twice_word_mode, rtxes.source);
> +  rtxes.shift = gen_rtx_ASHIFT (twice_word_mode, rtxes.source, const0_rtx);
> +
> +  if (LOG_COSTS)
> +    fprintf (stderr, "\nSize costs\n==========\n\n");
> +  compute_costs (false,&rtxes);
> +
> +  if (LOG_COSTS)
> +    fprintf (stderr, "\nSpeed costs\n===========\n\n");
> +  compute_costs (true,&rtxes);
> +}
>
>   static bool
>   simple_move_operand (rtx x)
> @@ -101,12 +309,15 @@ simple_move_operand (rtx x)
>     return true;
>   }
>
> -/* If INSN is a single set between two objects, return the single set.
> -   Such an insn can always be decomposed.  INSN should have been
> -   passed to recog and extract_insn before this is called.  */
> +/* If INSN is a single set between two objects that we want to split,
> +   return the single set.  SPEED_P says whether we are optimizing
> +   INSN for speed or size.
> +
> +   INSN should have been passed to recog and extract_insn before this
> +   is called.  */
>
>   static rtx
> -simple_move (rtx insn)
> +simple_move (rtx insn, bool speed_p)
>   {
>     rtx x;
>     rtx set;
> @@ -150,6 +361,9 @@ simple_move (rtx insn)
>     if (GET_MODE_CLASS (mode) == MODE_PARTIAL_INT)
>       return NULL_RTX;
>
> +  if (!choices[speed_p].move_modes_to_split[(int) mode])
> +    return NULL_RTX;
> +
>     return set;
>   }
>
> @@ -173,9 +387,6 @@ find_pseudo_copy (rtx set)
>     if (HARD_REGISTER_NUM_P (rd) || HARD_REGISTER_NUM_P (rs))
>       return false;
>
> -  if (GET_MODE_SIZE (GET_MODE (dest))<= UNITS_PER_WORD)
> -    return false;
> -
>     b = VEC_index (bitmap, reg_copy_graph, rs);
>     if (b == NULL)
>       {
> @@ -668,8 +879,7 @@ resolve_simple_move (rtx set, rtx insn)
>     orig_mode = GET_MODE (dest);
>
>     words = (GET_MODE_SIZE (orig_mode) + UNITS_PER_WORD - 1) / UNITS_PER_WORD;
> -  if (words<= 1)
> -    return insn;
> +  gcc_assert (words>  1);
>
>     start_sequence ();
>
> @@ -931,12 +1141,13 @@ resolve_debug (rtx insn)
>     resolve_reg_notes (insn);
>   }
>
> -/* Checks if INSN is a decomposable multiword-shift or zero-extend and
> -   sets the decomposable_context bitmap accordingly.  A non-zero value
> -   is returned if a decomposable insn has been found.  */
> +/* Check if INSN is a decomposable multiword-shift or zero-extend and
> +   set the decomposable_context bitmap accordingly.  SPEED_P is true
> +   if we are optimizing INSN for speed rather than size.  Return true
> +   if INSN is decomposable.  */
>
> -static int
> -find_decomposable_shift_zext (rtx insn)
> +static bool
> +find_decomposable_shift_zext (rtx insn, bool speed_p)
>   {
>     rtx set;
>     rtx op;
> @@ -944,41 +1155,44 @@ find_decomposable_shift_zext (rtx insn)
>
>     set = single_set (insn);
>     if (!set)
> -    return 0;
> +    return false;
>
>     op = SET_SRC (set);
>     if (GET_CODE (op) != ASHIFT
>         &&  GET_CODE (op) != LSHIFTRT
>         &&  GET_CODE (op) != ZERO_EXTEND)
> -    return 0;
> +    return false;
>
>     op_operand = XEXP (op, 0);
>     if (!REG_P (SET_DEST (set)) || !REG_P (op_operand)
>         || HARD_REGISTER_NUM_P (REGNO (SET_DEST (set)))
>         || HARD_REGISTER_NUM_P (REGNO (op_operand))
> -      || !SCALAR_INT_MODE_P (GET_MODE (op)))
> -    return 0;
> +      || GET_MODE (op) != twice_word_mode)
> +    return false;
>
>     if (GET_CODE (op) == ZERO_EXTEND)
>       {
>         if (GET_MODE (op_operand) != word_mode
> -	  || GET_MODE_BITSIZE (GET_MODE (op)) != 2 * BITS_PER_WORD)
> -	return 0;
> +	  || !choices[speed_p].splitting_zext)
> +	return false;
>       }
>     else /* left or right shift */
>       {
> +      bool *splitting = (GET_CODE (op) == ASHIFT
> +			 ? choices[speed_p].splitting_ashift
> +			 : choices[speed_p].splitting_lshiftrt);
>         if (!CONST_INT_P (XEXP (op, 1))
> -	  || INTVAL (XEXP (op, 1))<  BITS_PER_WORD
> -	  || GET_MODE_BITSIZE (GET_MODE (op_operand)) != 2 * BITS_PER_WORD)
> -	return 0;
> +	  || !IN_RANGE (INTVAL (XEXP (op, 1)), BITS_PER_WORD,
> +			2 * BITS_PER_WORD - 1)
> +	  || !splitting[INTVAL (XEXP (op, 1)) - BITS_PER_WORD])
> +	return false;
> +
> +      bitmap_set_bit (decomposable_context, REGNO (op_operand));
>       }
>
>     bitmap_set_bit (decomposable_context, REGNO (SET_DEST (set)));
>
> -  if (GET_CODE (op) != ZERO_EXTEND)
> -    bitmap_set_bit (decomposable_context, REGNO (op_operand));
> -
> -  return 1;
> +  return true;
>   }
>
>   /* Decompose a more than word wide shift (in INSN) of a multiword
> @@ -1008,6 +1222,8 @@ resolve_shift_zext (rtx insn)
>
>     op_operand = XEXP (op, 0);
>
> +  /* We can tear this operation apart only if the regs were already
> +     torn apart.  */
>     if (!resolve_reg_p (SET_DEST (set))&&  !resolve_reg_p (op_operand))
>       return NULL_RTX;
>
> @@ -1073,6 +1289,56 @@ resolve_shift_zext (rtx insn)
>     return insns;
>   }
>
> +/* Print to dump_file a description of what we're doing with shift code CODE.
> +   SPLITTING[X] is true if we are splitting shifts by X + BITS_PER_WORD.  */
> +
> +static void
> +dump_shift_choices (enum rtx_code code, bool *splitting)
> +{
> +  int i;
> +  const char *sep;
> +
> +  fprintf (dump_file,
> +	   "  Splitting mode %s for %s lowering with shift amounts = ",
> +	   GET_MODE_NAME (twice_word_mode), GET_RTX_NAME (code));
> +  sep = "";
> +  for (i = 0; i<  BITS_PER_WORD; i++)
> +    if (splitting[i])
> +      {
> +	fprintf (dump_file, "%s%d", sep, i + BITS_PER_WORD);
> +	sep = ",";
> +      }
> +  fprintf (dump_file, "\n");
> +}
> +
> +/* Print to dump_file a description of what we're doing when optimizing
> +   for speed or size; SPEED_P says which.  DESCRIPTION is a description
> +   of the SPEED_P choice.  */
> +
> +static void
> +dump_choices (bool speed_p, const char *description)
> +{
> +  unsigned int i;
> +
> +  fprintf (dump_file, "Choices when optimizing for %s:\n", description);
> +
> +  for (i = 0; i<  MAX_MACHINE_MODE; i++)
> +    if (GET_MODE_SIZE (i)>  UNITS_PER_WORD)
> +      fprintf (dump_file, "  %s mode %s for copy lowering.\n",
> +	       choices[speed_p].move_modes_to_split[i]
> +	       ? "Splitting"
> +	       : "Skipping",
> +	       GET_MODE_NAME ((enum machine_mode) i));
> +
> +  fprintf (dump_file, "  %s mode %s for zero_extend lowering.\n",
> +	   choices[speed_p].splitting_zext ? "Splitting" : "Skipping",
> +	   GET_MODE_NAME (twice_word_mode));
> +
> +  dump_shift_choices (ASHIFT, choices[speed_p].splitting_ashift);
> +  dump_shift_choices (LSHIFTRT, choices[speed_p].splitting_ashift);
> +  fprintf (dump_file, "\n");
> +}
> +
>   /* Look for registers which are always accessed via word-sized SUBREGs
>      or via copies.  Decompose these registers into several word-sized
>      pseudo-registers.  */
> @@ -1083,8 +1349,19 @@ decompose_multiword_subregs (void)
>     unsigned int max;
>     basic_block bb;
>
> -  if (df)
> -    df_set_flags (DF_DEFER_INSN_RESCAN);
> +  if (dump_file)
> +    {
> +      dump_choices (false, "size");
> +      dump_choices (true, "speed");
> +    }
> +
> +  /* Check if this target even has any modes to consider lowering.   */
> +  if (!choices[false].something_to_do&&  !choices[true].something_to_do)
> +    {
> +      if (dump_file)
> +	fprintf (dump_file, "Nothing to do!\n");
> +      return;
> +    }
>
>     max = max_reg_num ();
>
> @@ -1094,24 +1371,38 @@ decompose_multiword_subregs (void)
>        all the insns.  */
>     {
>       unsigned int i;
> +    bool useful_modes_seen = false;
>
>       for (i = FIRST_PSEUDO_REGISTER; i<  max; ++i)
> +      if (regno_reg_rtx[i] != NULL)
> +	{
> +	  enum machine_mode mode = GET_MODE (regno_reg_rtx[i]);
> +	  if (choices[false].move_modes_to_split[(int) mode]
> +	      || choices[true].move_modes_to_split[(int) mode])
> +	    {
> +	      useful_modes_seen = true;
> +	      break;
> +	    }
> +	}
> +
> +    if (!useful_modes_seen)
>         {
> -	if (regno_reg_rtx[i] != NULL
> -	&&  GET_MODE_SIZE (GET_MODE (regno_reg_rtx[i]))>  UNITS_PER_WORD)
> -	  break;
> +	if (dump_file)
> +	  fprintf (dump_file, "Nothing to lower in this function.\n");
> +	return;
>         }
> -    if (i == max)
> -      return;
>     }
>
>     if (df)
> -    run_word_dce ();
> +    {
> +      df_set_flags (DF_DEFER_INSN_RESCAN);
> +      run_word_dce ();
> +    }
>
> -  /* FIXME: When the dataflow branch is merged, we can change this
> -     code to look for each multi-word pseudo-register and to find each
> -     insn which sets or uses that register.  That should be faster
> -     than scanning all the insns.  */
> +  /* FIXME: It may be possible to change this code to look for each
> +     multi-word pseudo-register and to find each insn which sets or
> +     uses that register.  That should be faster than scanning all the
> +     insns.  */
>
>     decomposable_context = BITMAP_ALLOC (NULL);
>     non_decomposable_context = BITMAP_ALLOC (NULL);
> @@ -1124,7 +1415,9 @@ decompose_multiword_subregs (void)
>     FOR_EACH_BB (bb)
>       {
>         rtx insn;
> +      bool speed_p;
>
> +      speed_p = optimize_bb_for_speed_p (bb);
>         FOR_BB_INSNS (bb, insn)
>   	{
>   	  rtx set;
> @@ -1138,12 +1431,12 @@ decompose_multiword_subregs (void)
>
>   	  recog_memoized (insn);
>
> -	  if (find_decomposable_shift_zext (insn))
> +	  if (find_decomposable_shift_zext (insn, speed_p))
>   	    continue;
>
>   	  extract_insn (insn);
>
> -	  set = simple_move (insn);
> +	  set = simple_move (insn, speed_p);
>
>   	  if (!set)
>   	    cmi = NOT_SIMPLE_MOVE;
> @@ -1197,7 +1490,9 @@ decompose_multiword_subregs (void)
>         FOR_EACH_BB (bb)
>   	{
>   	  rtx insn;
> +	  bool speed_p;
>
> +	  speed_p = optimize_bb_for_speed_p (bb);
>   	  FOR_BB_INSNS (bb, insn)
>   	    {
>   	      rtx pat;
> @@ -1220,7 +1515,7 @@ decompose_multiword_subregs (void)
>   		  recog_memoized (insn);
>   		  extract_insn (insn);
>
> -		  set = simple_move (insn);
> +		  set = simple_move (insn, speed_p);
>   		  if (set)
>   		    {
>   		      rtx orig_insn = insn;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-04-03 13:53                     ` Richard Sandiford
  2012-04-03 15:33                       ` Kenneth Zadeck
@ 2012-04-03 16:22                       ` Ian Lance Taylor
  1 sibling, 0 replies; 38+ messages in thread
From: Ian Lance Taylor @ 2012-04-03 16:22 UTC (permalink / raw)
  To: Kenneth Zadeck
  Cc: gcc-patches, Mike Stump, Kenneth Zadeck, avr,
	ramana.radhakrishnan, rdsandiford

Richard Sandiford <rdsandiford@googlemail.com> writes:

> What do you think?  The patch looks OK to me with these changes,
> but I'd like to leave it for 48 hours to see if Ian has any comments.

The patch looks fine to me.

Thanks.

Ian

> 2012-04-03  Kenneth Zadeck  <zadeck@naturalbridge.com>
> 	    Richard Sandiford  <r.sandiford@uk.ibm.com>
>
> 	* Makefile.in (lower-subreg.o, target-globals.o): Depend on
> 	lower-subreg.h.
> 	* lower-subreg.h: New file.
> 	* target-globals.h (this_target_lower_subreg): Declare.
> 	(target_globals): Add lower_subreg;
> 	(restore_target_globals): Restore this_target_lower_subreg.
> 	* target-globals.c: Include it.
> 	(default_target_globals): Add default_target_lower_subreg.
> 	(save_target_globals): Initialize target_lower_subreg.
> 	* rtl.h (init_lower_subreg): Added declaration.
> 	* toplev.c (backend_init_target): Call initializer for lower-subreg
> 	pass.
> 	* lower-subreg.c (LOG_COSTS, FORCE_LOWERING): New macros.
> 	(default_target_lower_subreg): New variable.
> 	(this_target_lower_subreg): Likewise.
> 	(twice_word_mode, choices): New macros.
> 	(shift_cost, compute_splitting_shift, compute_costs)
> 	(init_lower_subreg): New functions.
> 	(resolve_simple_move): Add speed_p argument.  Check choices.
> 	(find_pseudo_copy): Don't check the mode size here.
> 	(resolve_simple_move): Assert the mode size.
> 	(find_decomposable_shift_zext): Add speed_p argument and return
> 	a bool.  Check choices.
> 	(resolve_shift_zext): Add comment.
> 	(dump_shift_choices, dump_choices): New functions.
> 	(decompose_multiword_subregs): Dump list of profitable
> 	transformations.  Add code to skip non profitable transformations.
> 	Update calls to simple_move and find_decomposable_shift_zext.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-04-03 15:33                       ` Kenneth Zadeck
@ 2012-04-03 19:20                         ` Richard Sandiford
  2012-04-03 20:36                           ` Ian Lance Taylor
  0 siblings, 1 reply; 38+ messages in thread
From: Richard Sandiford @ 2012-04-03 19:20 UTC (permalink / raw)
  To: Kenneth Zadeck
  Cc: gcc-patches, Ian Lance Taylor, Mike Stump, Kenneth Zadeck, avr,
	ramana.radhakrishnan

Kenneth Zadeck <zadeck@naturalbridge.com> writes:
> Richard,
>
> thanks, for doing the changes.    In particular, i did not really 
> understand how the target stuff was supposed to work.
>
> I have one issue with the changes that you made.
>
> I had actually decided that the speed/size decision was not relevant to 
> this patch.    The problem is that since this is a global optimization 
> which propagates the info across the entire web of moves, you really 
> want/need to use the same cost metric everywhere.  (of course, making 
> the speed/size decision on the global optimization level would be fine; 
> my issue is with the choice being at the bb level.)    So now you are 
> going to have some moves in a web saying yes while others saying no.    
> The ones that say no will dominate.   It is unlikely you will get what 
> you want (the important nodes really running fast) out of this, assuming 
> you have a target where the decision could go either way.

Yeah, I suppose that's true.  The problem is that both ways are wrong really.
If we just use the function-level speed setting, there'll be times where
a cold-only web will be optimised as hot.  But as you say, if we apply
the bb-level setting, cold blocks can hold back the hot blocks.
And I don't particularly relish the idea of trying to joust between
the two.

So yeah, applying it on a function-by-function basis is probably better.
Does anyone else have any thoughts before I make that change?

Richard

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-04-03 19:20                         ` Richard Sandiford
@ 2012-04-03 20:36                           ` Ian Lance Taylor
  2012-05-01 14:47                             ` Richard Sandiford
  0 siblings, 1 reply; 38+ messages in thread
From: Ian Lance Taylor @ 2012-04-03 20:36 UTC (permalink / raw)
  To: Kenneth Zadeck
  Cc: gcc-patches, Mike Stump, Kenneth Zadeck, avr,
	ramana.radhakrishnan, rdsandiford

Richard Sandiford <rdsandiford@googlemail.com> writes:

> Does anyone else have any thoughts before I make that change?

I think that one of you should try to write a test case where it makes a
difference, and add the test case to the testsuite.

Ian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-04-03 20:36                           ` Ian Lance Taylor
@ 2012-05-01 14:47                             ` Richard Sandiford
  2012-05-01 17:51                               ` H.J. Lu
                                                 ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Richard Sandiford @ 2012-05-01 14:47 UTC (permalink / raw)
  To: Ian Lance Taylor
  Cc: Kenneth Zadeck, gcc-patches, Mike Stump, Kenneth Zadeck, avr,
	ramana.radhakrishnan

Ian Lance Taylor <iant@google.com> writes:
> Richard Sandiford <rdsandiford@googlemail.com> writes:
>> Does anyone else have any thoughts before I make that change?
>
> I think that one of you should try to write a test case where it makes a
> difference, and add the test case to the testsuite.

I originally took that to mean a case where function vs. bb speed choices
made a difference.  That isn't really possible as things stand because
I don't know of any in-tree port that assigns different rtx costs to
SETs based on the speed setting.

But now I wonder whether you meant a test case where using rtx costs
makes a difference.  I'm not really in a position to test ARM these days,
but it sounds like any testcase for the VUNZP patch would cover this too,
since it was this patch that prevented the VUNZP one from going in.
I'll also try to come up with a MIPS testcase when I look at that
(this weekend hopefully, but maybe not on recent form).

Anyway, I modified the patch to use a per-function speed setting.
After the off-list discussion between you and Kenny, I went ahead
and applied it after retesting on x86_64-linux-gnu.

To repeat: as things stand, very few targets define proper rtx costs
for SET.  This patch is therefore expected to prevent lower-subreg
from running in cases where it's actually benefical.  If you see that
happening, please check whether the rtx_costs are defined properly.

Of course, if the costs are defined properly and lower-subreg still
makes the wrong choice, we need to look at why.

Richard


gcc/
2012-03-31  Kenneth Zadeck  <zadeck@naturalbridge.com>
	    Richard Sandiford  <r.sandiford@uk.ibm.com>

	* Makefile.in (lower-subreg.o, target-globals.o): Depend on
	lower-subreg.h.
	* lower-subreg.h: New file.
	* target-globals.h (this_target_lower_subreg): Declare.
	(target_globals): Add lower_subreg;
	(restore_target_globals): Restore this_target_lower_subreg.
	* target-globals.c: Include it.
	(default_target_globals): Add default_target_lower_subreg.
	(save_target_globals): Initialize target_lower_subreg.
	* rtl.h (init_lower_subreg): Added declaration.
	* toplev.c (backend_init_target): Call initializer for lower-subreg
	pass.
	* lower-subreg.c (LOG_COSTS, FORCE_LOWERING): New macros.
	(default_target_lower_subreg): New variable.
	(this_target_lower_subreg): Likewise.
	(twice_word_mode, choices): New macros.
	(shift_cost, compute_splitting_shift, compute_costs)
	(init_lower_subreg): New functions.
	(resolve_simple_move): Add speed_p argument.  Check choices.
	(find_pseudo_copy): Don't check the mode size here.
	(resolve_simple_move): Assert the mode size.
	(find_decomposable_shift_zext): Add speed_p argument and return
	a bool.  Check choices.
	(resolve_shift_zext): Add comment.
	(dump_shift_choices, dump_choices): New functions.
	(decompose_multiword_subregs): Dump list of profitable
	transformations.  Add code to skip non profitable transformations.
	Update calls to simple_move and find_decomposable_shift_zext.

Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in	2012-05-01 09:42:28.612732994 +0100
+++ gcc/Makefile.in	2012-05-01 09:42:51.845652552 +0100
@@ -3428,11 +3428,13 @@ dbgcnt.o: dbgcnt.c $(CONFIG_H) $(SYSTEM_
 lower-subreg.o : lower-subreg.c $(CONFIG_H) $(SYSTEM_H) coretypes.h \
    $(MACHMODE_H) $(TM_H) $(RTL_H) $(TM_P_H) $(TIMEVAR_H) $(FLAGS_H) \
    insn-config.h $(BASIC_BLOCK_H) $(RECOG_H) $(OBSTACK_H) $(BITMAP_H) \
-   $(EXPR_H) $(EXCEPT_H) $(REGS_H) $(TREE_PASS_H) $(DF_H) dce.h
+   $(EXPR_H) $(EXCEPT_H) $(REGS_H) $(TREE_PASS_H) $(DF_H) dce.h \
+   lower-subreg.h
 target-globals.o : target-globals.c $(CONFIG_H) $(SYSTEM_H) coretypes.h \
    $(TM_H) insn-config.h $(MACHMODE_H) $(GGC_H) toplev.h target-globals.h \
    $(FLAGS_H) $(REGS_H) $(RTL_H) reload.h expmed.h $(EXPR_H) $(OPTABS_H) \
-   $(LIBFUNCS_H) $(CFGLOOP_H) $(IRA_INT_H) builtins.h gcse.h bb-reorder.h
+   $(LIBFUNCS_H) $(CFGLOOP_H) $(IRA_INT_H) builtins.h gcse.h bb-reorder.h \
+   lower-subreg.h
 hw-doloop.o : hw-doloop.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) \
    $(RTL_H) $(FLAGS_H) $(EXPR_H) hard-reg-set.h $(BASIC_BLOCK_H) $(TM_P_H) \
    $(DF_H) $(CFGLAYOUT_H) $(CFGLOOP_H) output.h $(RECOG_H) $(TARGET_H) \
Index: gcc/lower-subreg.h
===================================================================
--- /dev/null	2012-04-30 12:38:22.655823157 +0100
+++ gcc/lower-subreg.h	2012-05-01 09:42:51.849652538 +0100
@@ -0,0 +1,59 @@
+/* Target-dependent costs for lower-subreg.c.
+   Copyright (C) 2012 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option; any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef LOWER_SUBREG_H
+#define LOWER_SUBREG_H 1
+
+/* Information about whether, and where, lower-subreg should be applied.  */
+struct lower_subreg_choices {
+  /* A boolean vector for move splitting that is indexed by mode and is
+     true for each mode that is to have its copies split.  */
+  bool move_modes_to_split[MAX_MACHINE_MODE];
+
+  /* True if zero-extensions from word_mode to twice_word_mode should
+     be split.  */
+  bool splitting_zext;
+
+  /* Index X is true if twice_word_mode shifts by X + BITS_PER_WORD
+     should be split.  */
+  bool splitting_ashift[MAX_BITS_PER_WORD];
+  bool splitting_lshiftrt[MAX_BITS_PER_WORD];
+
+  /* True if there is at least one mode that is worth splitting.  */
+  bool something_to_do;
+};
+
+/* Target-specific information for the subreg lowering pass.  */
+struct target_lower_subreg {
+  /* An integer mode that is twice as wide as word_mode.  */
+  enum machine_mode x_twice_word_mode;
+
+  /* What we have decided to do when optimizing for size (index 0)
+     and speed (index 1).  */
+  struct lower_subreg_choices x_choices[2];
+};
+
+extern struct target_lower_subreg default_target_lower_subreg;
+#if SWITCHABLE_TARGET
+extern struct target_lower_subreg *this_target_lower_subreg;
+#else
+#define this_target_lower_subreg (&default_target_lower_subreg)
+#endif
+
+#endif
Index: gcc/target-globals.h
===================================================================
--- gcc/target-globals.h	2012-05-01 09:42:28.612732994 +0100
+++ gcc/target-globals.h	2012-05-01 09:42:51.849652538 +0100
@@ -35,6 +35,7 @@ #define TARGET_GLOBALS_H 1
 extern struct target_builtins *this_target_builtins;
 extern struct target_gcse *this_target_gcse;
 extern struct target_bb_reorder *this_target_bb_reorder;
+extern struct target_lower_subreg *this_target_lower_subreg;
 
 struct GTY(()) target_globals {
   struct target_flag_state *GTY((skip)) flag_state;
@@ -51,6 +52,7 @@ struct GTY(()) target_globals {
   struct target_builtins *GTY((skip)) builtins;
   struct target_gcse *GTY((skip)) gcse;
   struct target_bb_reorder *GTY((skip)) bb_reorder;
+  struct target_lower_subreg *GTY((skip)) lower_subreg;
 };
 
 extern struct target_globals default_target_globals;
@@ -74,6 +76,7 @@ restore_target_globals (struct target_gl
   this_target_builtins = g->builtins;
   this_target_gcse = g->gcse;
   this_target_bb_reorder = g->bb_reorder;
+  this_target_lower_subreg = g->lower_subreg;
 }
 #endif
 
Index: gcc/target-globals.c
===================================================================
--- gcc/target-globals.c	2012-05-01 09:42:28.612732994 +0100
+++ gcc/target-globals.c	2012-05-01 09:42:51.849652537 +0100
@@ -40,6 +40,7 @@ Software Foundation; either version 3, o
 #include "builtins.h"
 #include "gcse.h"
 #include "bb-reorder.h"
+#include "lower-subreg.h"
 
 #if SWITCHABLE_TARGET
 struct target_globals default_target_globals = {
@@ -56,7 +57,8 @@ struct target_globals default_target_glo
   &default_target_ira_int,
   &default_target_builtins,
   &default_target_gcse,
-  &default_target_bb_reorder
+  &default_target_bb_reorder,
+  &default_target_lower_subreg
 };
 
 struct target_globals *
@@ -79,6 +81,7 @@ save_target_globals (void)
   g->builtins = XCNEW (struct target_builtins);
   g->gcse = XCNEW (struct target_gcse);
   g->bb_reorder = XCNEW (struct target_bb_reorder);
+  g->lower_subreg = XCNEW (struct target_lower_subreg);
   restore_target_globals (g);
   init_reg_sets ();
   target_reinit ();
Index: gcc/rtl.h
===================================================================
--- gcc/rtl.h	2012-05-01 09:42:28.612732994 +0100
+++ gcc/rtl.h	2012-05-01 09:42:51.840652567 +0100
@@ -2526,6 +2526,9 @@ extern void init_expmed (void);
 extern void expand_inc (rtx, rtx);
 extern void expand_dec (rtx, rtx);
 
+/* In lower-subreg.c */
+extern void init_lower_subreg (void);
+
 /* In gcse.c */
 extern bool can_copy_p (enum machine_mode);
 extern bool can_assign_to_reg_without_clobbers_p (rtx);
Index: gcc/toplev.c
===================================================================
--- gcc/toplev.c	2012-05-01 09:42:28.612732994 +0100
+++ gcc/toplev.c	2012-05-01 09:42:51.842652561 +0100
@@ -1601,6 +1601,7 @@ backend_init_target (void)
   /* rtx_cost is mode-dependent, so cached values need to be recomputed
      on a mode change.  */
   init_expmed ();
+  init_lower_subreg ();
 
   /* We may need to recompute regno_save_code[] and regno_restore_code[]
      after a mode change as well.  */
Index: gcc/lower-subreg.c
===================================================================
--- gcc/lower-subreg.c	2012-05-01 09:42:28.612732994 +0100
+++ gcc/lower-subreg.c	2012-05-01 09:46:48.473830772 +0100
@@ -40,6 +40,7 @@ Software Foundation; either version 3, o
 #include "regs.h"
 #include "tree-pass.h"
 #include "df.h"
+#include "lower-subreg.h"
 
 #ifdef STACK_GROWS_DOWNWARD
 # undef STACK_GROWS_DOWNWARD
@@ -52,10 +53,35 @@ DEF_VEC_P (bitmap);
 DEF_VEC_ALLOC_P (bitmap,heap);
 
 /* Decompose multi-word pseudo-registers into individual
-   pseudo-registers when possible.  This is possible when all the uses
-   of a multi-word register are via SUBREG, or are copies of the
-   register to another location.  Breaking apart the register permits
-   more CSE and permits better register allocation.  */
+   pseudo-registers when possible and profitable.  This is possible
+   when all the uses of a multi-word register are via SUBREG, or are
+   copies of the register to another location.  Breaking apart the
+   register permits more CSE and permits better register allocation.
+   This is profitable if the machine does not have move instructions
+   to do this.
+
+   This pass only splits moves with modes that are wider than
+   word_mode and ASHIFTs, LSHIFTRTs and ZERO_EXTENDs with integer
+   modes that are twice the width of word_mode.  The latter could be
+   generalized if there was a need to do this, but the trend in
+   architectures is to not need this.
+
+   There are two useful preprocessor defines for use by maintainers:
+
+   #define LOG_COSTS 1
+
+   if you wish to see the actual cost estimates that are being used
+   for each mode wider than word mode and the cost estimates for zero
+   extension and the shifts.   This can be useful when port maintainers
+   are tuning insn rtx costs.
+
+   #define FORCE_LOWERING 1
+
+   if you wish to test the pass with all the transformation forced on.
+   This can be useful for finding bugs in the transformations.  */
+
+#define LOG_COSTS 0
+#define FORCE_LOWERING 0
 
 /* Bit N in this bitmap is set if regno N is used in a context in
    which we can decompose it.  */
@@ -75,8 +101,190 @@ DEF_VEC_ALLOC_P (bitmap,heap);
    copy from reg M to reg N.  */
 static VEC(bitmap,heap) *reg_copy_graph;
 
-/* Return whether X is a simple object which we can take a word_mode
-   subreg of.  */
+struct target_lower_subreg default_target_lower_subreg;
+#if SWITCHABLE_TARGET
+struct target_lower_subreg *this_target_lower_subreg
+  = &default_target_lower_subreg;
+#endif
+
+#define twice_word_mode \
+  this_target_lower_subreg->x_twice_word_mode
+#define choices \
+  this_target_lower_subreg->x_choices
+
+/* RTXes used while computing costs.  */
+struct cost_rtxes {
+  /* Source and target registers.  */
+  rtx source;
+  rtx target;
+
+  /* A twice_word_mode ZERO_EXTEND of SOURCE.  */
+  rtx zext;
+
+  /* A shift of SOURCE.  */
+  rtx shift;
+
+  /* A SET of TARGET.  */
+  rtx set;
+};
+
+/* Return the cost of a CODE shift in mode MODE by OP1 bits, using the
+   rtxes in RTXES.  SPEED_P selects between the speed and size cost.  */
+
+static int
+shift_cost (bool speed_p, struct cost_rtxes *rtxes, enum rtx_code code,
+	    enum machine_mode mode, int op1)
+{
+  PUT_MODE (rtxes->target, mode);
+  PUT_CODE (rtxes->shift, code);
+  PUT_MODE (rtxes->shift, mode);
+  PUT_MODE (rtxes->source, mode);
+  XEXP (rtxes->shift, 1) = GEN_INT (op1);
+  SET_SRC (rtxes->set) = rtxes->shift;
+  return insn_rtx_cost (rtxes->set, speed_p);
+}
+
+/* For each X in the range [0, BITS_PER_WORD), set SPLITTING[X]
+   to true if it is profitable to split a double-word CODE shift
+   of X + BITS_PER_WORD bits.  SPEED_P says whether we are testing
+   for speed or size profitability.
+
+   Use the rtxes in RTXES to calculate costs.  WORD_MOVE_ZERO_COST is
+   the cost of moving zero into a word-mode register.  WORD_MOVE_COST
+   is the cost of moving between word registers.  */
+
+static void
+compute_splitting_shift (bool speed_p, struct cost_rtxes *rtxes,
+			 bool *splitting, enum rtx_code code,
+			 int word_move_zero_cost, int word_move_cost)
+{
+  int wide_cost, narrow_cost, i;
+
+  for (i = 0; i < BITS_PER_WORD; i++)
+    {
+      wide_cost = shift_cost (speed_p, rtxes, code, twice_word_mode,
+			      i + BITS_PER_WORD);
+      if (i == 0)
+	narrow_cost = word_move_cost;
+      else
+	narrow_cost = shift_cost (speed_p, rtxes, code, word_mode, i);
+
+      if (LOG_COSTS)
+	fprintf (stderr, "%s %s by %d: original cost %d, split cost %d + %d\n",
+		 GET_MODE_NAME (twice_word_mode), GET_RTX_NAME (code),
+		 i + BITS_PER_WORD, wide_cost, narrow_cost,
+		 word_move_zero_cost);
+
+      if (FORCE_LOWERING || wide_cost >= narrow_cost + word_move_zero_cost)
+	splitting[i] = true;
+    }
+}
+
+/* Compute what we should do when optimizing for speed or size; SPEED_P
+   selects which.  Use RTXES for computing costs.  */
+
+static void
+compute_costs (bool speed_p, struct cost_rtxes *rtxes)
+{
+  unsigned int i;
+  int word_move_zero_cost, word_move_cost;
+
+  SET_SRC (rtxes->set) = CONST0_RTX (word_mode);
+  word_move_zero_cost = insn_rtx_cost (rtxes->set, speed_p);
+
+  SET_SRC (rtxes->set) = rtxes->source;
+  word_move_cost = insn_rtx_cost (rtxes->set, speed_p);
+
+  if (LOG_COSTS)
+    fprintf (stderr, "%s move: from zero cost %d, from reg cost %d\n",
+	     GET_MODE_NAME (word_mode), word_move_zero_cost, word_move_cost);
+
+  for (i = 0; i < MAX_MACHINE_MODE; i++)
+    {
+      enum machine_mode mode = (enum machine_mode) i;
+      int factor = GET_MODE_SIZE (mode) / UNITS_PER_WORD;
+      if (factor > 1)
+	{
+	  int mode_move_cost;
+
+	  PUT_MODE (rtxes->target, mode);
+	  PUT_MODE (rtxes->source, mode);
+	  mode_move_cost = insn_rtx_cost (rtxes->set, speed_p);
+
+	  if (LOG_COSTS)
+	    fprintf (stderr, "%s move: original cost %d, split cost %d * %d\n",
+		     GET_MODE_NAME (mode), mode_move_cost,
+		     word_move_cost, factor);
+
+	  if (FORCE_LOWERING || mode_move_cost >= word_move_cost * factor)
+	    {
+	      choices[speed_p].move_modes_to_split[i] = true;
+	      choices[speed_p].something_to_do = true;
+	    }
+	}
+    }
+
+  /* For the moves and shifts, the only case that is checked is one
+     where the mode of the target is an integer mode twice the width
+     of the word_mode.
+
+     If it is not profitable to split a double word move then do not
+     even consider the shifts or the zero extension.  */
+  if (choices[speed_p].move_modes_to_split[(int) twice_word_mode])
+    {
+      int zext_cost;
+
+      /* The only case here to check to see if moving the upper part with a
+	 zero is cheaper than doing the zext itself.  */
+      PUT_MODE (rtxes->target, twice_word_mode);
+      PUT_MODE (rtxes->source, word_mode);
+      SET_SRC (rtxes->set) = rtxes->zext;
+      zext_cost = insn_rtx_cost (rtxes->set, speed_p);
+
+      if (LOG_COSTS)
+	fprintf (stderr, "%s %s: original cost %d, split cost %d + %d\n",
+		 GET_MODE_NAME (twice_word_mode), GET_RTX_NAME (ZERO_EXTEND),
+		 zext_cost, word_move_cost, word_move_zero_cost);
+
+      if (FORCE_LOWERING || zext_cost >= word_move_cost + word_move_zero_cost)
+	choices[speed_p].splitting_zext = true;
+
+      compute_splitting_shift (speed_p, rtxes,
+			       choices[speed_p].splitting_ashift, ASHIFT,
+			       word_move_zero_cost, word_move_cost);
+      compute_splitting_shift (speed_p, rtxes,
+			       choices[speed_p].splitting_lshiftrt, LSHIFTRT,
+			       word_move_zero_cost, word_move_cost);
+    }
+}
+
+/* Do one-per-target initialisation.  This involves determining
+   which operations on the machine are profitable.  If none are found,
+   then the pass just returns when called.  */
+
+void
+init_lower_subreg (void)
+{
+  struct cost_rtxes rtxes;
+
+  memset (this_target_lower_subreg, 0, sizeof (*this_target_lower_subreg));
+
+  twice_word_mode = GET_MODE_2XWIDER_MODE (word_mode);
+
+  rtxes.target = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER);
+  rtxes.source = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER + 1);
+  rtxes.set = gen_rtx_SET (VOIDmode, rtxes.target, rtxes.source);
+  rtxes.zext = gen_rtx_ZERO_EXTEND (twice_word_mode, rtxes.source);
+  rtxes.shift = gen_rtx_ASHIFT (twice_word_mode, rtxes.source, const0_rtx);
+
+  if (LOG_COSTS)
+    fprintf (stderr, "\nSize costs\n==========\n\n");
+  compute_costs (false, &rtxes);
+
+  if (LOG_COSTS)
+    fprintf (stderr, "\nSpeed costs\n===========\n\n");
+  compute_costs (true, &rtxes);
+}
 
 static bool
 simple_move_operand (rtx x)
@@ -101,12 +309,15 @@ simple_move_operand (rtx x)
   return true;
 }
 
-/* If INSN is a single set between two objects, return the single set.
-   Such an insn can always be decomposed.  INSN should have been
-   passed to recog and extract_insn before this is called.  */
+/* If INSN is a single set between two objects that we want to split,
+   return the single set.  SPEED_P says whether we are optimizing
+   INSN for speed or size.
+
+   INSN should have been passed to recog and extract_insn before this
+   is called.  */
 
 static rtx
-simple_move (rtx insn)
+simple_move (rtx insn, bool speed_p)
 {
   rtx x;
   rtx set;
@@ -150,6 +361,9 @@ simple_move (rtx insn)
   if (GET_MODE_CLASS (mode) == MODE_PARTIAL_INT)
     return NULL_RTX;
 
+  if (!choices[speed_p].move_modes_to_split[(int) mode])
+    return NULL_RTX;
+
   return set;
 }
 
@@ -173,9 +387,6 @@ find_pseudo_copy (rtx set)
   if (HARD_REGISTER_NUM_P (rd) || HARD_REGISTER_NUM_P (rs))
     return false;
 
-  if (GET_MODE_SIZE (GET_MODE (dest)) <= UNITS_PER_WORD)
-    return false;
-
   b = VEC_index (bitmap, reg_copy_graph, rs);
   if (b == NULL)
     {
@@ -668,8 +879,7 @@ resolve_simple_move (rtx set, rtx insn)
   orig_mode = GET_MODE (dest);
 
   words = (GET_MODE_SIZE (orig_mode) + UNITS_PER_WORD - 1) / UNITS_PER_WORD;
-  if (words <= 1)
-    return insn;
+  gcc_assert (words > 1);
 
   start_sequence ();
 
@@ -931,12 +1141,13 @@ resolve_debug (rtx insn)
   resolve_reg_notes (insn);
 }
 
-/* Checks if INSN is a decomposable multiword-shift or zero-extend and
-   sets the decomposable_context bitmap accordingly.  A non-zero value
-   is returned if a decomposable insn has been found.  */
+/* Check if INSN is a decomposable multiword-shift or zero-extend and
+   set the decomposable_context bitmap accordingly.  SPEED_P is true
+   if we are optimizing INSN for speed rather than size.  Return true
+   if INSN is decomposable.  */
 
-static int
-find_decomposable_shift_zext (rtx insn)
+static bool
+find_decomposable_shift_zext (rtx insn, bool speed_p)
 {
   rtx set;
   rtx op;
@@ -944,41 +1155,44 @@ find_decomposable_shift_zext (rtx insn)
 
   set = single_set (insn);
   if (!set)
-    return 0;
+    return false;
 
   op = SET_SRC (set);
   if (GET_CODE (op) != ASHIFT
       && GET_CODE (op) != LSHIFTRT
       && GET_CODE (op) != ZERO_EXTEND)
-    return 0;
+    return false;
 
   op_operand = XEXP (op, 0);
   if (!REG_P (SET_DEST (set)) || !REG_P (op_operand)
       || HARD_REGISTER_NUM_P (REGNO (SET_DEST (set)))
       || HARD_REGISTER_NUM_P (REGNO (op_operand))
-      || !SCALAR_INT_MODE_P (GET_MODE (op)))
-    return 0;
+      || GET_MODE (op) != twice_word_mode)
+    return false;
 
   if (GET_CODE (op) == ZERO_EXTEND)
     {
       if (GET_MODE (op_operand) != word_mode
-	  || GET_MODE_BITSIZE (GET_MODE (op)) != 2 * BITS_PER_WORD)
-	return 0;
+	  || !choices[speed_p].splitting_zext)
+	return false;
     }
   else /* left or right shift */
     {
+      bool *splitting = (GET_CODE (op) == ASHIFT
+			 ? choices[speed_p].splitting_ashift
+			 : choices[speed_p].splitting_lshiftrt);
       if (!CONST_INT_P (XEXP (op, 1))
-	  || INTVAL (XEXP (op, 1)) < BITS_PER_WORD
-	  || GET_MODE_BITSIZE (GET_MODE (op_operand)) != 2 * BITS_PER_WORD)
-	return 0;
+	  || !IN_RANGE (INTVAL (XEXP (op, 1)), BITS_PER_WORD,
+			2 * BITS_PER_WORD - 1)
+	  || !splitting[INTVAL (XEXP (op, 1)) - BITS_PER_WORD])
+	return false;
+
+      bitmap_set_bit (decomposable_context, REGNO (op_operand));
     }
 
   bitmap_set_bit (decomposable_context, REGNO (SET_DEST (set)));
 
-  if (GET_CODE (op) != ZERO_EXTEND)
-    bitmap_set_bit (decomposable_context, REGNO (op_operand));
-
-  return 1;
+  return true;
 }
 
 /* Decompose a more than word wide shift (in INSN) of a multiword
@@ -1008,6 +1222,8 @@ resolve_shift_zext (rtx insn)
 
   op_operand = XEXP (op, 0);
 
+  /* We can tear this operation apart only if the regs were already
+     torn apart.  */
   if (!resolve_reg_p (SET_DEST (set)) && !resolve_reg_p (op_operand))
     return NULL_RTX;
 
@@ -1073,6 +1289,56 @@ resolve_shift_zext (rtx insn)
   return insns;
 }
 
+/* Print to dump_file a description of what we're doing with shift code CODE.
+   SPLITTING[X] is true if we are splitting shifts by X + BITS_PER_WORD.  */
+
+static void
+dump_shift_choices (enum rtx_code code, bool *splitting)
+{
+  int i;
+  const char *sep;
+
+  fprintf (dump_file,
+	   "  Splitting mode %s for %s lowering with shift amounts = ",
+	   GET_MODE_NAME (twice_word_mode), GET_RTX_NAME (code));
+  sep = "";
+  for (i = 0; i < BITS_PER_WORD; i++)
+    if (splitting[i])
+      {
+	fprintf (dump_file, "%s%d", sep, i + BITS_PER_WORD);
+	sep = ",";
+      }
+  fprintf (dump_file, "\n");
+}
+
+/* Print to dump_file a description of what we're doing when optimizing
+   for speed or size; SPEED_P says which.  DESCRIPTION is a description
+   of the SPEED_P choice.  */
+
+static void
+dump_choices (bool speed_p, const char *description)
+{
+  unsigned int i;
+
+  fprintf (dump_file, "Choices when optimizing for %s:\n", description);
+
+  for (i = 0; i < MAX_MACHINE_MODE; i++)
+    if (GET_MODE_SIZE (i) > UNITS_PER_WORD)
+      fprintf (dump_file, "  %s mode %s for copy lowering.\n",
+	       choices[speed_p].move_modes_to_split[i]
+	       ? "Splitting"
+	       : "Skipping",
+	       GET_MODE_NAME ((enum machine_mode) i));
+
+  fprintf (dump_file, "  %s mode %s for zero_extend lowering.\n",
+	   choices[speed_p].splitting_zext ? "Splitting" : "Skipping",
+	   GET_MODE_NAME (twice_word_mode));
+
+  dump_shift_choices (ASHIFT, choices[speed_p].splitting_ashift);
+  dump_shift_choices (LSHIFTRT, choices[speed_p].splitting_ashift);
+  fprintf (dump_file, "\n");
+}
+
 /* Look for registers which are always accessed via word-sized SUBREGs
    or via copies.  Decompose these registers into several word-sized
    pseudo-registers.  */
@@ -1082,9 +1348,21 @@ decompose_multiword_subregs (void)
 {
   unsigned int max;
   basic_block bb;
+  bool speed_p;
 
-  if (df)
-    df_set_flags (DF_DEFER_INSN_RESCAN);
+  if (dump_file)
+    {
+      dump_choices (false, "size");
+      dump_choices (true, "speed");
+    }
+
+  /* Check if this target even has any modes to consider lowering.   */
+  if (!choices[false].something_to_do && !choices[true].something_to_do)
+    {
+      if (dump_file)
+	fprintf (dump_file, "Nothing to do!\n");
+      return;
+    }
 
   max = max_reg_num ();
 
@@ -1094,24 +1372,38 @@ decompose_multiword_subregs (void)
      all the insns.  */
   {
     unsigned int i;
+    bool useful_modes_seen = false;
 
     for (i = FIRST_PSEUDO_REGISTER; i < max; ++i)
+      if (regno_reg_rtx[i] != NULL)
+	{
+	  enum machine_mode mode = GET_MODE (regno_reg_rtx[i]);
+	  if (choices[false].move_modes_to_split[(int) mode]
+	      || choices[true].move_modes_to_split[(int) mode])
+	    {
+	      useful_modes_seen = true;
+	      break;
+	    }
+	}
+
+    if (!useful_modes_seen)
       {
-	if (regno_reg_rtx[i] != NULL
-	    && GET_MODE_SIZE (GET_MODE (regno_reg_rtx[i])) > UNITS_PER_WORD)
-	  break;
+	if (dump_file)
+	  fprintf (dump_file, "Nothing to lower in this function.\n");
+	return;
       }
-    if (i == max)
-      return;
   }
 
   if (df)
-    run_word_dce ();
+    {
+      df_set_flags (DF_DEFER_INSN_RESCAN);
+      run_word_dce ();
+    }
 
-  /* FIXME: When the dataflow branch is merged, we can change this
-     code to look for each multi-word pseudo-register and to find each
-     insn which sets or uses that register.  That should be faster
-     than scanning all the insns.  */
+  /* FIXME: It may be possible to change this code to look for each
+     multi-word pseudo-register and to find each insn which sets or
+     uses that register.  That should be faster than scanning all the
+     insns.  */
 
   decomposable_context = BITMAP_ALLOC (NULL);
   non_decomposable_context = BITMAP_ALLOC (NULL);
@@ -1121,6 +1413,7 @@ decompose_multiword_subregs (void)
   VEC_safe_grow (bitmap, heap, reg_copy_graph, max);
   memset (VEC_address (bitmap, reg_copy_graph), 0, sizeof (bitmap) * max);
 
+  speed_p = optimize_function_for_speed_p (cfun);
   FOR_EACH_BB (bb)
     {
       rtx insn;
@@ -1138,12 +1431,12 @@ decompose_multiword_subregs (void)
 
 	  recog_memoized (insn);
 
-	  if (find_decomposable_shift_zext (insn))
+	  if (find_decomposable_shift_zext (insn, speed_p))
 	    continue;
 
 	  extract_insn (insn);
 
-	  set = simple_move (insn);
+	  set = simple_move (insn, speed_p);
 
 	  if (!set)
 	    cmi = NOT_SIMPLE_MOVE;
@@ -1197,7 +1490,9 @@ decompose_multiword_subregs (void)
       FOR_EACH_BB (bb)
 	{
 	  rtx insn;
+	  bool speed_p;
 
+	  speed_p = optimize_bb_for_speed_p (bb);
 	  FOR_BB_INSNS (bb, insn)
 	    {
 	      rtx pat;
@@ -1220,7 +1515,7 @@ decompose_multiword_subregs (void)
 		  recog_memoized (insn);
 		  extract_insn (insn);
 
-		  set = simple_move (insn);
+		  set = simple_move (insn, speed_p);
 		  if (set)
 		    {
 		      rtx orig_insn = insn;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-05-01 14:47                             ` Richard Sandiford
@ 2012-05-01 17:51                               ` H.J. Lu
  2012-05-03 19:52                               ` Georg-Johann Lay
  2012-05-09  6:02                               ` Fix gcc.dg/lower-subreg-1.c failure (was: [C Patch]: pr52543) Hans-Peter Nilsson
  2 siblings, 0 replies; 38+ messages in thread
From: H.J. Lu @ 2012-05-01 17:51 UTC (permalink / raw)
  To: Ian Lance Taylor, Kenneth Zadeck, gcc-patches, Mike Stump,
	Kenneth Zadeck, avr, ramana.radhakrishnan, rdsandiford

On Tue, May 1, 2012 at 7:46 AM, Richard Sandiford
<rdsandiford@googlemail.com> wrote:
> Ian Lance Taylor <iant@google.com> writes:
>> Richard Sandiford <rdsandiford@googlemail.com> writes:
>>> Does anyone else have any thoughts before I make that change?
>>
>> I think that one of you should try to write a test case where it makes a
>> difference, and add the test case to the testsuite.
>
> I originally took that to mean a case where function vs. bb speed choices
> made a difference.  That isn't really possible as things stand because
> I don't know of any in-tree port that assigns different rtx costs to
> SETs based on the speed setting.
>
> But now I wonder whether you meant a test case where using rtx costs
> makes a difference.  I'm not really in a position to test ARM these days,
> but it sounds like any testcase for the VUNZP patch would cover this too,
> since it was this patch that prevented the VUNZP one from going in.
> I'll also try to come up with a MIPS testcase when I look at that
> (this weekend hopefully, but maybe not on recent form).
>
> Anyway, I modified the patch to use a per-function speed setting.
> After the off-list discussion between you and Kenny, I went ahead
> and applied it after retesting on x86_64-linux-gnu.
>
> To repeat: as things stand, very few targets define proper rtx costs
> for SET.  This patch is therefore expected to prevent lower-subreg
> from running in cases where it's actually benefical.  If you see that
> happening, please check whether the rtx_costs are defined properly.
>
> Of course, if the costs are defined properly and lower-subreg still
> makes the wrong choice, we need to look at why.
>
> Richard
>
>
> gcc/
> 2012-03-31  Kenneth Zadeck  <zadeck@naturalbridge.com>
>            Richard Sandiford  <r.sandiford@uk.ibm.com>
>
>        * Makefile.in (lower-subreg.o, target-globals.o): Depend on
>        lower-subreg.h.
>        * lower-subreg.h: New file.
>        * target-globals.h (this_target_lower_subreg): Declare.
>        (target_globals): Add lower_subreg;
>        (restore_target_globals): Restore this_target_lower_subreg.
>        * target-globals.c: Include it.
>        (default_target_globals): Add default_target_lower_subreg.
>        (save_target_globals): Initialize target_lower_subreg.
>        * rtl.h (init_lower_subreg): Added declaration.
>        * toplev.c (backend_init_target): Call initializer for lower-subreg
>        pass.
>        * lower-subreg.c (LOG_COSTS, FORCE_LOWERING): New macros.
>        (default_target_lower_subreg): New variable.
>        (this_target_lower_subreg): Likewise.
>        (twice_word_mode, choices): New macros.
>        (shift_cost, compute_splitting_shift, compute_costs)
>        (init_lower_subreg): New functions.
>        (resolve_simple_move): Add speed_p argument.  Check choices.
>        (find_pseudo_copy): Don't check the mode size here.
>        (resolve_simple_move): Assert the mode size.
>        (find_decomposable_shift_zext): Add speed_p argument and return
>        a bool.  Check choices.
>        (resolve_shift_zext): Add comment.
>        (dump_shift_choices, dump_choices): New functions.
>        (decompose_multiword_subregs): Dump list of profitable
>        transformations.  Add code to skip non profitable transformations.
>        Update calls to simple_move and find_decomposable_shift_zext.
>

This caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53176

-- 
H.J.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-05-01 14:47                             ` Richard Sandiford
  2012-05-01 17:51                               ` H.J. Lu
@ 2012-05-03 19:52                               ` Georg-Johann Lay
  2012-05-03 22:14                                 ` Mike Stump
  2012-05-09  6:02                               ` Fix gcc.dg/lower-subreg-1.c failure (was: [C Patch]: pr52543) Hans-Peter Nilsson
  2 siblings, 1 reply; 38+ messages in thread
From: Georg-Johann Lay @ 2012-05-03 19:52 UTC (permalink / raw)
  To: gcc-patches, avr
  Cc: Ian Lance Taylor, Kenneth Zadeck, Mike Stump, Kenneth Zadeck,
	ramana.radhakrishnan, rdsandiford

Richard Sandiford wrote:
> Ian Lance Taylor <iant@google.com> writes:
>> Richard Sandiford <rdsandiford@googlemail.com> writes:
>>> Does anyone else have any thoughts before I make that change?
>> I think that one of you should try to write a test case where it makes a
>> difference, and add the test case to the testsuite.
> 
> I originally took that to mean a case where function vs. bb speed choices
> made a difference.  That isn't really possible as things stand because
> I don't know of any in-tree port that assigns different rtx costs to
> SETs based on the speed setting.
> 
> But now I wonder whether you meant a test case where using rtx costs
> makes a difference.  I'm not really in a position to test ARM these days,
> but it sounds like any testcase for the VUNZP patch would cover this too,
> since it was this patch that prevented the VUNZP one from going in.
> I'll also try to come up with a MIPS testcase when I look at that
> (this weekend hopefully, but maybe not on recent form).
> 
> Anyway, I modified the patch to use a per-function speed setting.
> After the off-list discussion between you and Kenny, I went ahead
> and applied it after retesting on x86_64-linux-gnu.
> 
> To repeat: as things stand, very few targets define proper rtx costs
> for SET.  This patch is therefore expected to prevent lower-subreg
> from running in cases where it's actually benefical.  If you see that
> happening, please check whether the rtx_costs are defined properly.

It's hardly possible to write proper rtx_costs for SET:

1) What should be the cost of (const_int 1) if you don't see the
machine mode? Is it QI, is it HI, is it SI or whatever?

There are platforms where this matters, for example the platform this
PR was initially reported for.

2) If the target will be a REG, what is the register class for the
assignment? rtx_costs are called after reload, so it would be good to
know. It would be good to know if it is a pseudo or hard reg.

And in many places the backend does not know where it is standing.
Is it upon expanding? Prior or after combine? Or split1?

3) Likewise, the costs of MEM are peeled of MEM and pass just
the address without any information on the MEM like it's address
space. Cost might highly depend on the address space involved.

The original PR is because split of mem:HI is fine -- if it reads
from generic. And splitting mem:HI is complete bloat for other
address spaces. Likewise for wider modes like PSI, SI, ...

ad 3) I wonder if the patch helps with the avr backend at all?
Does it improve the situation in any way? And is it worth to clean up
the avr backend and remove the FIXMEs there? I.e expand reads from MEM
as they are instead of hiding all inside UNSPECs?

Johann

> Of course, if the costs are defined properly and lower-subreg still
> makes the wrong choice, we need to look at why.
> 
> Richard

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-05-03 19:52                               ` Georg-Johann Lay
@ 2012-05-03 22:14                                 ` Mike Stump
  2012-05-04 23:02                                   ` Georg-Johann Lay
  0 siblings, 1 reply; 38+ messages in thread
From: Mike Stump @ 2012-05-03 22:14 UTC (permalink / raw)
  To: Georg-Johann Lay
  Cc: gcc-patches, Ian Lance Taylor, Kenneth Zadeck, Kenneth Zadeck,
	ramana.radhakrishnan, rdsandiford

On May 3, 2012, at 12:50 PM, Georg-Johann Lay wrote:
> It's hardly possible to write proper rtx_costs for SET:
> 
> 1) What should be the cost of (const_int 1) if you don't see the
> machine mode? Is it QI, is it HI, is it SI or whatever?

You can choose to see the complete expression in its entirety and form the cost for it.

  (set (reg:DI 1) (const_int 1))

is a DI mode set of register 1, with the value 1.  This can have a different cost, from the perspective of the cost function from:

  (set (reg:SI 1 (const_int 1))

or even

  (set (reg:SI 1 (const_int 2))

or

  (set (reg:SI 2 (const_int 1))

though, little else in the compiler would help mitigate such differences.  Just return a non-zero value from the cost function.  If you return zero, you are saying that you don't care, or that the costs compose in a simplistic manner.  Try returning 1, and figuring out the total cost of the entire expression yourself, if the simplistic answer is wrong.

> 2) If the target will be a REG, what is the register class for the
> assignment?

Hard register has a class associated with it that can be found with REGNO_REG_CLASS (REGNO (x)).  If the register isn't a hard register, then, no class has been chosen for it yet, and the register allocator can choose any valid class it wants for the mode.  I suspect you're better off explaining the average cost (or maybe even the best case cost), when multiple choices exist.  In general, someone else should check later the true cost of the operation based upon the class and should be willing to influence the code generated (the class picked), so what you return for rtx_costs shouldn't matter too much.

> rtx_costs are called after reload, so it would be good to
> know. It would be good to know if it is a pseudo or hard reg.

HARD_REGISTER_NUM_P (REGNO (x)) will tell you if it is hard.  The inverse of this will tell you if is isn't hard, aka, a pseudo.

> And in many places the backend does not know where it is standing.
> Is it upon expanding? Prior or after combine? Or split1?

I think the idea is to give the cost of the rtl it asks for.

  (set (reg:SI 1) (const_int 0))

should have the same cost, before reload, after reload, during optimization, just before final, or at expand time.

> 3) Likewise, the costs of MEM are peeled of MEM and pass just
> the address without any information on the MEM like it's address
> space. Cost might highly depend on the address space involved.

Yes, that is why on my machine:

  (set (mem) (reg))

has one set of costs, and

  (set (reg) (mem))

has a completely different set of costs.  The address space, if it influences the cost, can be had with:

  MEM_ADDR_SPACE (XEXP (x, 0))

or a store, and:

  MEM_ADDR_SPACE (XEXP (x, 1))

for a load.

> The original PR is because split of mem:HI is fine -- if it reads
> from generic. And splitting mem:HI is complete bloat for other
> address spaces. Likewise for wider modes like PSI, SI, ...

The you will want to check the mode of the MEM, and address space involved.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-05-03 22:14                                 ` Mike Stump
@ 2012-05-04 23:02                                   ` Georg-Johann Lay
  2012-05-06 18:55                                     ` [committed] Fix lower-subreg cost calculation Richard Sandiford
  2012-05-07 19:01                                     ` [C Patch]: pr52543 Mike Stump
  0 siblings, 2 replies; 38+ messages in thread
From: Georg-Johann Lay @ 2012-05-04 23:02 UTC (permalink / raw)
  To: Mike Stump
  Cc: gcc-patches, Ian Lance Taylor, Kenneth Zadeck, Kenneth Zadeck,
	ramana.radhakrishnan, rdsandiford

Mike Stump schrieb:
> On May 3, 2012, at 12:50 PM, Georg-Johann Lay wrote:
>> It's hardly possible to write proper rtx_costs for SET:
>> 
>> 1) What should be the cost of (const_int 1) if you don't see the 
>> machine mode? Is it QI, is it HI, is it SI or whatever?
> 
> You can choose to see the complete expression in its entirety and
> form the cost for it.
> 
> (set (reg:DI 1) (const_int 1))

Sorry, for the dumb question, but I still don't get it.

TARGET_RTX_COSTS gets called with x = (const_int 1) and outer = SET
for example. How do I get SET_DEST from that information?

I don't now if lower-subreg.s ever emits such cost requests, but several
passes definitely do.

> is a DI mode set of register 1, with the value 1.  This can have a
> different cost, from the perspective of the cost function from:
> 
> (set (reg:SI 1 (const_int 1))
> 
> or even
> 
> (set (reg:SI 1 (const_int 2))
> 
> or
> 
> (set (reg:SI 2 (const_int 1))
> 
> though, little else in the compiler would help mitigate such
> differences.  Just return a non-zero value from the cost function.
> If you return zero, you are saying that you don't care, or that the
> costs compose in a simplistic manner.  Try returning 1, and figuring
> out the total cost of the entire expression yourself, if the
> simplistic answer is wrong.

TARGET_RTX_COSTS does not mention the "0 = don't know" in any way.
There are post-reload passes that mention it in source comments, yes.

>> 2) If the target will be a REG, what is the register class for the 
>> assignment?
> 
> Hard register has a class associated with it that can be found with
> REGNO_REG_CLASS (REGNO (x)).  If the register [...]

Thanks for the detailed explanation.

But again the question was for the case when TARGET_RTX_COSTS is called
with outer = SET and x = (const_int ?).

> then, no class has been chosen for it yet, and the register allocator
> can choose any valid class it wants for the mode.  I suspect you're
> better off explaining the average cost (or maybe even the best case
> cost), when multiple choices exist.  In general, someone else should
> check later the true cost of the operation based upon the class and
> should be willing to influence the code generated (the class picked),
> so what you return for rtx_costs shouldn't matter too much.
> 
>> rtx_costs are called after reload, so it would be good to know. It
>> would be good to know if it is a pseudo or hard reg.
> 
> HARD_REGISTER_NUM_P (REGNO (x)) will tell you if it is hard.  The
> inverse of this will tell you if is isn't hard, aka, a pseudo.
> 
>> And in many places the backend does not know where it is standing. 
>> Is it upon expanding? Prior or after combine? Or split1?
> 
> I think the idea is to give the cost of the rtl it asks for.
> 
> (set (reg:SI 1) (const_int 0))
> 
> should have the same cost, before reload, after reload, during
> optimization, just before final, or at expand time.

Just grepped a log from avr-gcc -mlog=rtx_costs but there was
not a single line with outer=pattern.

In some cases like insn-combine where there is a complete pattern,
synthesized from several insns, but just SET_DEST is passed to
rtx_costs hiding informations from the backends in an unnecessary
way; presumably for historical reasons.

There are machines with complex instructions sets like 4-operand
OR and combined OR and SHIFT or similar. Instead of writing
hundreds or thousands of lines in rtx_costs and XEXP TARGET_RTX_COSTS
(effectively rewriting insn-recog.c) it would be straight forward
to attach costs to insns and use recog + insn_attr to get the costs.

I tried that approach (write costs as insn attribute and use recog
to get the costs) in a backend and it works smooth and really helped
to keep the backend clean and maintainable.

>> 3) Likewise, the costs of MEM are peeled of MEM and pass just the
>> address without any information on the MEM like it's address space.
>> Cost might highly depend on the address space involved.
> 
> Yes, that is why on my machine:
> 
> (set (mem) (reg))
> 
> has one set of costs, and
> 
> (set (reg) (mem))

What hook are we talking about?

TARGET_RTX_COSTS? (not called with outer=PATTERN)
TARGET_MEMORY_MOVE_COST? (uses register classes)
TARGET_ADDRESS_COST (no address space available as MEM was peeled)

> has a completely different set of costs.  The address space, if it
> influences the cost, can be had with:
> 
> MEM_ADDR_SPACE (XEXP (x, 0))
> 
> or a store, and:
> 
> MEM_ADDR_SPACE (XEXP (x, 1))
> 
> for a load.
> 
>> The original PR is because split of mem:HI is fine -- if it reads 
>> from generic. And splitting mem:HI is complete bloat for other 
>> address spaces. Likewise for wider modes like PSI, SI, ...
> 
> The you will want to check the mode of the MEM, and address space
> involved.

The issue is to *get* the MEM.

To hack around PR52543 for PSImode (there is just one AS that uses
PSImode addresses), I used TARGET_MODE_DEPENDENT_ADDRESS_P, see
respective FIXME in avr.c.

But again no avail to get the AS. AS is attached to the MEM, not
to the address, i.e. XEXP (mem, 0) which is passed to that hook.

Thanks for all your detailed descriptions.

Johann

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [committed] Fix lower-subreg cost calculation
  2012-05-04 23:02                                   ` Georg-Johann Lay
@ 2012-05-06 18:55                                     ` Richard Sandiford
  2012-05-08 15:26                                       ` Richard Earnshaw
  2012-05-07 19:01                                     ` [C Patch]: pr52543 Mike Stump
  1 sibling, 1 reply; 38+ messages in thread
From: Richard Sandiford @ 2012-05-06 18:55 UTC (permalink / raw)
  To: Georg-Johann Lay
  Cc: Mike Stump, gcc-patches, Ian Lance Taylor, Kenneth Zadeck,
	Kenneth Zadeck, ramana.radhakrishnan

Georg-Johann Lay <avr@gjlay.de> writes:
> TARGET_RTX_COSTS gets called with x = (const_int 1) and outer = SET
> for example. How do I get SET_DEST from that information?
>
> I don't now if lower-subreg.s ever emits such cost requests, but several
> passes definitely do.

Gah!  I really should have remembered that insn_rtx_cost happily ignores
both SETs and SET_DESTs, and skips straight to the SET_SRC.  This caught
me out when looking at the auto-inc-dec rewrite last year too.  (The problem
in that case was that insn_rtx_cost ignored the cost of MEMs in stores,
and only took into account the cost of MEMs in loads.)

While that probably ought to change, I felt like I was going down a
rathole last time I looked at it, so this patch does what I should
have done originally.

For the record: I wondered whether rtlanal.c should base the default
register-to-register copy cost for mode M on the lowest move_cost[M][c][c].
The problem is that move_cost has traditionally been used to choose
between difference classes in the same mode, rather than between modes,
with 2 as the base cost.  So I don't think it's suitable.

Tested on x86_64-linux-gnu and with the upcoming MIPS costs.  Installed.

Sorry for the breakage.

Richard


gcc/
	* lower-subreg.c (shift_cost): Use set_src_cost, avoiding the SET.
	(compute_costs): Likewise for the zero extension.  Use set_rtx_cost
	to compute the cost of moves.  Set the mode of the target register.

Index: gcc/lower-subreg.c
===================================================================
--- gcc/lower-subreg.c	2012-05-06 13:47:49.000000000 +0100
+++ gcc/lower-subreg.c	2012-05-06 14:56:47.851024108 +0100
@@ -135,13 +135,11 @@ struct cost_rtxes {
 shift_cost (bool speed_p, struct cost_rtxes *rtxes, enum rtx_code code,
 	    enum machine_mode mode, int op1)
 {
-  PUT_MODE (rtxes->target, mode);
   PUT_CODE (rtxes->shift, code);
   PUT_MODE (rtxes->shift, mode);
   PUT_MODE (rtxes->source, mode);
   XEXP (rtxes->shift, 1) = GEN_INT (op1);
-  SET_SRC (rtxes->set) = rtxes->shift;
-  return insn_rtx_cost (rtxes->set, speed_p);
+  return set_src_cost (rtxes->shift, speed_p);
 }
 
 /* For each X in the range [0, BITS_PER_WORD), set SPLITTING[X]
@@ -189,11 +187,12 @@ compute_costs (bool speed_p, struct cost
   unsigned int i;
   int word_move_zero_cost, word_move_cost;
 
+  PUT_MODE (rtxes->target, word_mode);
   SET_SRC (rtxes->set) = CONST0_RTX (word_mode);
-  word_move_zero_cost = insn_rtx_cost (rtxes->set, speed_p);
+  word_move_zero_cost = set_rtx_cost (rtxes->set, speed_p);
 
   SET_SRC (rtxes->set) = rtxes->source;
-  word_move_cost = insn_rtx_cost (rtxes->set, speed_p);
+  word_move_cost = set_rtx_cost (rtxes->set, speed_p);
 
   if (LOG_COSTS)
     fprintf (stderr, "%s move: from zero cost %d, from reg cost %d\n",
@@ -209,7 +208,7 @@ compute_costs (bool speed_p, struct cost
 
 	  PUT_MODE (rtxes->target, mode);
 	  PUT_MODE (rtxes->source, mode);
-	  mode_move_cost = insn_rtx_cost (rtxes->set, speed_p);
+	  mode_move_cost = set_rtx_cost (rtxes->set, speed_p);
 
 	  if (LOG_COSTS)
 	    fprintf (stderr, "%s move: original cost %d, split cost %d * %d\n",
@@ -236,10 +235,8 @@ compute_costs (bool speed_p, struct cost
 
       /* The only case here to check to see if moving the upper part with a
 	 zero is cheaper than doing the zext itself.  */
-      PUT_MODE (rtxes->target, twice_word_mode);
       PUT_MODE (rtxes->source, word_mode);
-      SET_SRC (rtxes->set) = rtxes->zext;
-      zext_cost = insn_rtx_cost (rtxes->set, speed_p);
+      zext_cost = set_src_cost (rtxes->zext, speed_p);
 
       if (LOG_COSTS)
 	fprintf (stderr, "%s %s: original cost %d, split cost %d + %d\n",

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-05-04 23:02                                   ` Georg-Johann Lay
  2012-05-06 18:55                                     ` [committed] Fix lower-subreg cost calculation Richard Sandiford
@ 2012-05-07 19:01                                     ` Mike Stump
  1 sibling, 0 replies; 38+ messages in thread
From: Mike Stump @ 2012-05-07 19:01 UTC (permalink / raw)
  To: Georg-Johann Lay
  Cc: gcc-patches, Ian Lance Taylor, Kenneth Zadeck, Kenneth Zadeck,
	ramana.radhakrishnan, rdsandiford

On May 4, 2012, at 4:01 PM, Georg-Johann Lay wrote:
> Mike Stump schrieb:
>> On May 3, 2012, at 12:50 PM, Georg-Johann Lay wrote:
>>> It's hardly possible to write proper rtx_costs for SET:
>>> 1) What should be the cost of (const_int 1) if you don't see the machine mode? Is it QI, is it HI, is it SI or whatever?
>> You can choose to see the complete expression in its entirety and
>> form the cost for it.
>> (set (reg:DI 1) (const_int 1))
> 
> Sorry, for the dumb question, but I still don't get it.

Ah, so, I answered the question, if there are no machine independent bugs, how would you do it...  Unstated in my email, is that I think that anybody (the machine independent code) that wants a better cost, needs to ask a more complete question of the port.  I view TARGET_RTX_COSTS as the answer for that question.

Stated differently, if an optimization pass has the information, a mode, a memory space, the containing expression, and those details matter, then you should merely submit bug fix requests for each instance to have them include those details into the question, as those details matter.  If you only get the question x = (const_int 1) outer = SET, and this comes from other than rtx_costs, than this would be such an instance where the machine independent code should be changed.  If it comes from rtx_costs in the recursive case, then previous to this question _was_ a question for the more complete case that the port ignored.

> TARGET_RTX_COSTS gets called with x = (const_int 1) and outer = SET
> for example. How do I get SET_DEST from that information?
> 
> I don't now if lower-subreg.s ever emits such cost requests, but several
> passes definitely do.

They are wrong (overly simplistic).

> There are machines with complex instructions sets like 4-operand
> OR and combined OR and SHIFT or similar. Instead of writing
> hundreds or thousands of lines in rtx_costs and XEXP TARGET_RTX_COSTS
> (effectively rewriting insn-recog.c) it would be straight forward
> to attach costs to insns and use recog + insn_attr to get the costs.
> 
> I tried that approach (write costs as insn attribute and use recog
> to get the costs) in a backend and it works smooth and really helped
> to keep the backend clean and maintainable.

Wow, cool.  I have costs, and I'd like a solution that feels cleaner to me, that certainly feels cleaner.  Calling recog seems scary to me...  the problem is that your supposed to get the costs of arbitrarily large code snippets, if you need multiple insns, you're supposed to add of all of them up.  Does this solution work when multiple instructions are needed?  If recog says it isn't a valid instruction, what do you do?  I shudder at the thought of replicating combine or reload...

>>> 3) Likewise, the costs of MEM are peeled of MEM and pass just the
>>> address without any information on the MEM like it's address space.
>>> Cost might highly depend on the address space involved.
>> Yes, that is why on my machine:
>> (set (mem) (reg))
>> has one set of costs, and
>> (set (reg) (mem))
> 
> What hook are we talking about?
> 
> TARGET_RTX_COSTS? (not called with outer=PATTERN)

This one...  I merely saw an instance where the machine independent code asked the right question...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [committed] Fix lower-subreg cost calculation
  2012-05-06 18:55                                     ` [committed] Fix lower-subreg cost calculation Richard Sandiford
@ 2012-05-08 15:26                                       ` Richard Earnshaw
  2012-05-08 21:42                                         ` Richard Sandiford
  0 siblings, 1 reply; 38+ messages in thread
From: Richard Earnshaw @ 2012-05-08 15:26 UTC (permalink / raw)
  To: Georg-Johann Lay, Mike Stump, gcc-patches, Ian Lance Taylor,
	Kenneth Zadeck, Kenneth Zadeck, ramana.radhakrishnan,
	rdsandiford

On 06/05/12 19:55, Richard Sandiford wrote:
> Georg-Johann Lay <avr@gjlay.de> writes:
>> TARGET_RTX_COSTS gets called with x = (const_int 1) and outer = SET
>> for example. How do I get SET_DEST from that information?
>>
>> I don't now if lower-subreg.s ever emits such cost requests, but several
>> passes definitely do.
> 
> Gah!  I really should have remembered that insn_rtx_cost happily ignores
> both SETs and SET_DESTs, and skips straight to the SET_SRC.  This caught
> me out when looking at the auto-inc-dec rewrite last year too.  (The problem
> in that case was that insn_rtx_cost ignored the cost of MEMs in stores,
> and only took into account the cost of MEMs in loads.)
> 
> While that probably ought to change, I felt like I was going down a
> rathole last time I looked at it, so this patch does what I should
> have done originally.
> 
> For the record: I wondered whether rtlanal.c should base the default
> register-to-register copy cost for mode M on the lowest move_cost[M][c][c].
> The problem is that move_cost has traditionally been used to choose
> between difference classes in the same mode, rather than between modes,
> with 2 as the base cost.  So I don't think it's suitable.
> 
> Tested on x86_64-linux-gnu and with the upcoming MIPS costs.  Installed.
> 
> Sorry for the breakage.
> 
> Richard
> 
> 
> gcc/
> 	* lower-subreg.c (shift_cost): Use set_src_cost, avoiding the SET.
> 	(compute_costs): Likewise for the zero extension.  Use set_rtx_cost
> 	to compute the cost of moves.  Set the mode of the target register.
> 

FTR, this caused
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53278

R.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [committed] Fix lower-subreg cost calculation
  2012-05-08 15:26                                       ` Richard Earnshaw
@ 2012-05-08 21:42                                         ` Richard Sandiford
  2012-05-09  9:48                                           ` Richard Earnshaw
  0 siblings, 1 reply; 38+ messages in thread
From: Richard Sandiford @ 2012-05-08 21:42 UTC (permalink / raw)
  To: Richard Earnshaw
  Cc: Georg-Johann Lay, Mike Stump, gcc-patches, Ian Lance Taylor,
	Kenneth Zadeck, Kenneth Zadeck, ramana.radhakrishnan

Richard Earnshaw <rearnsha@arm.com> writes:
> FTR, this caused
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53278

Well, this really has been a brown-paper-bag patch.  Fixed as below.
Tested on x86_64-linux-gnu and applied as obvious.

Richard


gcc/
	PR rtl-optimization/53278
	* lower-subreg.c (decompose_multiword_subregs): Remove left-over
	speed_p code from earlier patch.

Index: gcc/lower-subreg.c
===================================================================
--- gcc/lower-subreg.c	2012-05-08 19:45:31.000000000 +0100
+++ gcc/lower-subreg.c	2012-05-08 19:45:31.793855523 +0100
@@ -1487,9 +1487,7 @@ decompose_multiword_subregs (void)
       FOR_EACH_BB (bb)
 	{
 	  rtx insn;
-	  bool speed_p;
 
-	  speed_p = optimize_bb_for_speed_p (bb);
 	  FOR_BB_INSNS (bb, insn)
 	    {
 	      rtx pat;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Fix gcc.dg/lower-subreg-1.c failure (was: [C Patch]: pr52543)
  2012-05-01 14:47                             ` Richard Sandiford
  2012-05-01 17:51                               ` H.J. Lu
  2012-05-03 19:52                               ` Georg-Johann Lay
@ 2012-05-09  6:02                               ` Hans-Peter Nilsson
  2012-05-09  9:15                                 ` Fix gcc.dg/lower-subreg-1.c failure Richard Sandiford
  2012-05-16  6:25                                 ` ping: Fix gcc.dg/lower-subreg-1.c failure (was: [C Patch]: pr52543) Hans-Peter Nilsson
  2 siblings, 2 replies; 38+ messages in thread
From: Hans-Peter Nilsson @ 2012-05-09  6:02 UTC (permalink / raw)
  To: rdsandiford
  Cc: iant, zadeck, gcc-patches, mikestump, Kenneth.Zadeck, avr,
	ramana.radhakrishnan

> From: Richard Sandiford <rdsandiford@googlemail.com>
> Date: Tue, 1 May 2012 16:46:38 +0200

> To repeat: as things stand, very few targets define proper rtx costs
> for SET.

IMHO it's wrong to start blaming targets when rtx_cost doesn't
take the mode in account in the first place, for the default
cost.  (Well, except for the modes-tieable subreg special-case.)
The targets where an operation in N * word_mode costs no more
than one in word_mode, if there even is one, is a minority,
let's adjust the defaults to that.

>  This patch is therefore expected to prevent lower-subreg
> from running in cases where it's actually benefical.  If you see that
> happening, please check whether the rtx_costs are defined properly.

Well, for CRIS (one of the targets of the PR53176 fallout) they
are sane, basically.  Where cris_rtx_costs returns true, it
returns mostly(*) ballparkly-correct costs *where it's passed an
rtx for which there's a corresponding insn*, otherwise falling
back to the defaults.  It shouldn't have to check for validity
of the rtx asked about; core GCC already knows which insns there
are and can gate that in rtx_cost or its callers.

(*) I see a bug in that cris_rtx_costs doesn't check the mode
for extendsidi2, to return COSTS_N_INSNS (3) instead of 0
(because a sign-extending move to SImode doesn't cost more than
a move; a sign- or zero-extension is also free in an operand for
addition and multiplication).  But this isn't on the path that
lower-subreg.c takes, so only an incidental observation...

> Of course, if the costs are defined properly and lower-subreg still
> makes the wrong choice, we need to look at why.

By the way, regarding validity of rtx_cost calls:

> +++ gcc/lower-subreg.c  2012-05-01 09:46:48.473830772 +0100

> +/* Return the cost of a CODE shift in mode MODE by OP1 bits, using the
> +   rtxes in RTXES.  SPEED_P selects between the speed and size cost.  */
> +
> +static int
> +shift_cost (bool speed_p, struct cost_rtxes *rtxes, enum rtx_code code,
> +           enum machine_mode mode, int op1)
> +{
> +  PUT_MODE (rtxes->target, mode);
> +  PUT_CODE (rtxes->shift, code);
> +  PUT_MODE (rtxes->shift, mode);
> +  PUT_MODE (rtxes->source, mode);
> +  XEXP (rtxes->shift, 1) = GEN_INT (op1);
> +  SET_SRC (rtxes->set) = rtxes->shift;
> +  return insn_rtx_cost (rtxes->set, speed_p);
> +}
> +
> +/* For each X in the range [0, BITS_PER_WORD), set SPLITTING[X]
> +   to true if it is profitable to split a double-word CODE shift
> +   of X + BITS_PER_WORD bits.  SPEED_P says whether we are testing
> +   for speed or size profitability.
> +
> +   Use the rtxes in RTXES to calculate costs.  WORD_MOVE_ZERO_COST is
> +   the cost of moving zero into a word-mode register.  WORD_MOVE_COST
> +   is the cost of moving between word registers.  */
> +
> +static void
> +compute_splitting_shift (bool speed_p, struct cost_rtxes *rtxes,
> +                        bool *splitting, enum rtx_code code,
> +                        int word_move_zero_cost, int word_move_cost)
> +{

I think there should be a gating check whether the target
implements that kind of shift in that mode at all, before
checking the cost.  Not sure whether it's generally best to put
that test here, or to make the rtx_cost function return the cost
of a libcall for that mode when that happens.  Similar for the
other insns.

Isn't the below better than doing virtually the same in each
target's rtx_costs?  Not tested yet besides "make cc1" and
checking that lower-subreg.c yields sane costs and that
gcc.dg/lower-subreg-1.c passes for cris-elf.  Note that
untieable SUBREGs still get a higher cost than tieable ones.

I'll test this for cris-elf, please tell me if/what other tests
and targets are required (simulator or compilefarm targets only,
please).

	* rtlanal.c (rtx_cost): Adjust default cost for X with a
	UNITS_PER_WORD factor for all X according to the size of
	its mode, not just for SUBREGs with untieable modes.

Index: gcc/rtlanal.c
===================================================================
--- gcc/rtlanal.c	(revision 187308)
+++ gcc/rtlanal.c	(working copy)
@@ -3755,10 +3755,17 @@ rtx_cost (rtx x, enum rtx_code outer_cod
   enum rtx_code code;
   const char *fmt;
   int total;
+  int factor;
 
   if (x == 0)
     return 0;
 
+  /* A size N times larger than UNITS_PER_WORD likely needs N times as
+     many insns, taking N times as long.  */
+  factor = GET_MODE_SIZE (GET_MODE (x)) / UNITS_PER_WORD;
+  if (factor == 0)
+    factor = 1;
+
   /* Compute the default costs of certain things.
      Note that targetm.rtx_costs can override the defaults.  */
 
@@ -3766,20 +3773,27 @@ rtx_cost (rtx x, enum rtx_code outer_cod
   switch (code)
     {
     case MULT:
-      total = COSTS_N_INSNS (5);
+      total = factor * COSTS_N_INSNS (5);
       break;
     case DIV:
     case UDIV:
     case MOD:
     case UMOD:
-      total = COSTS_N_INSNS (7);
+      total = factor * COSTS_N_INSNS (7);
       break;
     case USE:
       /* Used in combine.c as a marker.  */
       total = 0;
       break;
+    case SET:
+      /* A SET doesn't have a mode, so let's look at the SET_DEST to get
+	 the mode for the factor.  */
+      factor = GET_MODE_SIZE (GET_MODE (SET_DEST (x))) / UNITS_PER_WORD;
+      if (factor == 0)
+	factor = 1;
+      /* Pass through.  */
     default:
-      total = COSTS_N_INSNS (1);
+      total = factor * COSTS_N_INSNS (1);
     }
 
   switch (code)
@@ -3792,8 +3806,7 @@ rtx_cost (rtx x, enum rtx_code outer_cod
       /* If we can't tie these modes, make this expensive.  The larger
 	 the mode, the more expensive it is.  */
       if (! MODES_TIEABLE_P (GET_MODE (x), GET_MODE (SUBREG_REG (x))))
-	return COSTS_N_INSNS (2
-			      + GET_MODE_SIZE (GET_MODE (x)) / UNITS_PER_WORD);
+	return COSTS_N_INSNS (2 + factor);
       break;
 
     default:

brgds, H-P

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix gcc.dg/lower-subreg-1.c failure
  2012-05-09  6:02                               ` Fix gcc.dg/lower-subreg-1.c failure (was: [C Patch]: pr52543) Hans-Peter Nilsson
@ 2012-05-09  9:15                                 ` Richard Sandiford
  2012-05-09 16:37                                   ` Hans-Peter Nilsson
  2012-07-07 23:03                                   ` Fix gcc.dg/lower-subreg-1.c failure, revisited Hans-Peter Nilsson
  2012-05-16  6:25                                 ` ping: Fix gcc.dg/lower-subreg-1.c failure (was: [C Patch]: pr52543) Hans-Peter Nilsson
  1 sibling, 2 replies; 38+ messages in thread
From: Richard Sandiford @ 2012-05-09  9:15 UTC (permalink / raw)
  To: Hans-Peter Nilsson
  Cc: iant, zadeck, gcc-patches, mikestump, Kenneth.Zadeck, avr,
	ramana.radhakrishnan

Hans-Peter Nilsson <hans-peter.nilsson@axis.com> writes:
>> From: Richard Sandiford <rdsandiford@googlemail.com>
>> Date: Tue, 1 May 2012 16:46:38 +0200
>
>> To repeat: as things stand, very few targets define proper rtx costs
>> for SET.
>
> IMHO it's wrong to start blaming targets when rtx_cost doesn't
> take the mode in account in the first place, for the default
> cost.  (Well, except for the modes-tieable subreg special-case.)
> The targets where an operation in N * word_mode costs no more
> than one in word_mode, if there even is one, is a minority,
> let's adjust the defaults to that.

I'll pass on approving or disapproving this patch, but for the record:
a factor of word_mode seems a bit too simplistic.  It's OK for moves
and logic ops, but addition of multiword modes is a bit more complicated.
Multiplication and division by multiword modes is even more so, of course.

>> Of course, if the costs are defined properly and lower-subreg still
>> makes the wrong choice, we need to look at why.
>
> By the way, regarding validity of rtx_cost calls:
>
>> +++ gcc/lower-subreg.c  2012-05-01 09:46:48.473830772 +0100
>
>> +/* Return the cost of a CODE shift in mode MODE by OP1 bits, using the
>> +   rtxes in RTXES.  SPEED_P selects between the speed and size cost.  */
>> +
>> +static int
>> +shift_cost (bool speed_p, struct cost_rtxes *rtxes, enum rtx_code code,
>> +           enum machine_mode mode, int op1)
>> +{
>> +  PUT_MODE (rtxes->target, mode);
>> +  PUT_CODE (rtxes->shift, code);
>> +  PUT_MODE (rtxes->shift, mode);
>> +  PUT_MODE (rtxes->source, mode);
>> +  XEXP (rtxes->shift, 1) = GEN_INT (op1);
>> +  SET_SRC (rtxes->set) = rtxes->shift;
>> +  return insn_rtx_cost (rtxes->set, speed_p);
>> +}
>> +
>> +/* For each X in the range [0, BITS_PER_WORD), set SPLITTING[X]
>> +   to true if it is profitable to split a double-word CODE shift
>> +   of X + BITS_PER_WORD bits.  SPEED_P says whether we are testing
>> +   for speed or size profitability.
>> +
>> +   Use the rtxes in RTXES to calculate costs.  WORD_MOVE_ZERO_COST is
>> +   the cost of moving zero into a word-mode register.  WORD_MOVE_COST
>> +   is the cost of moving between word registers.  */
>> +
>> +static void
>> +compute_splitting_shift (bool speed_p, struct cost_rtxes *rtxes,
>> +                        bool *splitting, enum rtx_code code,
>> +                        int word_move_zero_cost, int word_move_cost)
>> +{
>
> I think there should be a gating check whether the target
> implements that kind of shift in that mode at all, before
> checking the cost.  Not sure whether it's generally best to put
> that test here, or to make the rtx_cost function return the cost
> of a libcall for that mode when that happens.  Similar for the
> other insns.

This has come up a few times in past discussions about rtx_cost
(as I'm sure you remember :-)).  On the one hand, I agree it might
be nice to shield the backend from invalid insns.  That would
effectively mean calling expand on each insn though, which would be
rather expensive.  Especially in a GC world.  It would also prevent
the target from being able to take things like pipeline characteristics
into account.  E.g. if you know that something takes 2 operations
that must be issued sequentially, you might want to add in an extra
(sub-COSTS_N_INSN (1)) cost compared to 2 operations that can be issued
together, even if the individual operations take the same number of cycles.

The same goes for multiplication in general.  On some targets
the speed of a multiplication can depend on the second operand,
so knowing its value is useful even if the target doesn't have
a multiplication instruction that takes constant operands.

As things stand, rtx_cost is intentionally used for more than
just valid target insns.  One of the main uses has always been
to decide whether it is better to implement multiplications by a
shift-and-add sequence, or whether to use multiplication instructions.
The associated expmed code has never tried to decide which shifts are
actually representable as matching rtl insns and which aren't.
The same goes for division-using-multiplication.

So I think this patch is using rtx_cost according to its current
interface.  If someone wants to change or restrict that interface,
than that's a separate change IMO.  And it should be done consistently
rather than in this one place.

In this case it doesn't matter anyway.  If we never see a shift
in mode M by amount X, we'll never need to make a decision about
whether to split it.

> Isn't the below better than doing virtually the same in each
> target's rtx_costs?

FWIW, MIPS, SH and x86 all used something slightly different (and more
complicated).  I imagine PowerPC and SPARC will too.  So "each" seems
a bit strong.

That's not an objection to the patch though.  I realise some ports do
have very regular architectures where every register is the same width
and has the same move cost.

Richard

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [committed] Fix lower-subreg cost calculation
  2012-05-08 21:42                                         ` Richard Sandiford
@ 2012-05-09  9:48                                           ` Richard Earnshaw
  0 siblings, 0 replies; 38+ messages in thread
From: Richard Earnshaw @ 2012-05-09  9:48 UTC (permalink / raw)
  To: Richard Earnshaw, Georg-Johann Lay, Mike Stump, gcc-patches,
	Ian Lance Taylor, Kenneth Zadeck, Kenneth Zadeck,
	ramana.radhakrishnan, rdsandiford

On 08/05/12 22:42, Richard Sandiford wrote:
> Richard Earnshaw <rearnsha@arm.com> writes:
>> FTR, this caused
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53278
> 
> Well, this really has been a brown-paper-bag patch.  Fixed as below.
> Tested on x86_64-linux-gnu and applied as obvious.
> 
> Richard
> 
> 
> gcc/
> 	PR rtl-optimization/53278
> 	* lower-subreg.c (decompose_multiword_subregs): Remove left-over
> 	speed_p code from earlier patch.
> 
> Index: gcc/lower-subreg.c
> ===================================================================
> --- gcc/lower-subreg.c	2012-05-08 19:45:31.000000000 +0100
> +++ gcc/lower-subreg.c	2012-05-08 19:45:31.793855523 +0100
> @@ -1487,9 +1487,7 @@ decompose_multiword_subregs (void)
>        FOR_EACH_BB (bb)
>  	{
>  	  rtx insn;
> -	  bool speed_p;
>  
> -	  speed_p = optimize_bb_for_speed_p (bb);
>  	  FOR_BB_INSNS (bb, insn)
>  	    {
>  	      rtx pat;
> 


Which begs the question as to why -wshadow didn't pick this up...

R.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix gcc.dg/lower-subreg-1.c failure
  2012-05-09  9:15                                 ` Fix gcc.dg/lower-subreg-1.c failure Richard Sandiford
@ 2012-05-09 16:37                                   ` Hans-Peter Nilsson
  2012-07-07 23:03                                   ` Fix gcc.dg/lower-subreg-1.c failure, revisited Hans-Peter Nilsson
  1 sibling, 0 replies; 38+ messages in thread
From: Hans-Peter Nilsson @ 2012-05-09 16:37 UTC (permalink / raw)
  To: rdsandiford
  Cc: iant, zadeck, gcc-patches, mikestump, Kenneth.Zadeck, avr,
	ramana.radhakrishnan

> From: Richard Sandiford <rdsandiford@googlemail.com>
> Date: Wed, 9 May 2012 11:14:49 +0200

> Hans-Peter Nilsson <hans-peter.nilsson@axis.com> writes:
> >> From: Richard Sandiford <rdsandiford@googlemail.com>
> >> Date: Tue, 1 May 2012 16:46:38 +0200
> >
> >> To repeat: as things stand, very few targets define proper rtx costs
> >> for SET.
> >
> > IMHO it's wrong to start blaming targets when rtx_cost doesn't
> > take the mode in account in the first place, for the default
> > cost.  (Well, except for the modes-tieable subreg special-case.)
> > The targets where an operation in N * word_mode costs no more
> > than one in word_mode, if there even is one, is a minority,
> > let's adjust the defaults to that.
> 
> I'll pass on approving or disapproving this patch, but for the record:
> a factor of word_mode seems a bit too simplistic.

I'd say it's the right level: simplistic enough for the default,
not to mention now linear, without being plainly ignorant as now.

> It's OK for moves
> and logic ops, but addition of multiword modes is a bit more complicated.

How about (same factor) factor*COSTS_N_INSNS (1)*3/2 to account
for carry?  Or is 2*factor a better default?  Further
improvements are welcome, but I see the patch as a strict
improvement and I hope it will not be shot down by requests for
perfection - at least not without detailing said perfection.

> Multiplication and division by multiword modes is even more
> so, of course.

Suggestions are welcome, but in the absence of that, I'd say any
factor larger than one is is a good start, like in the patch.

> > I think there should be a gating check whether the target
> > implements that kind of shift in that mode at all, before
> > checking the cost.  Not sure whether it's generally best to put
> > that test here, or to make the rtx_cost function return the cost
> > of a libcall for that mode when that happens.  Similar for the
> > other insns.
> 
> This has come up a few times in past discussions about rtx_cost
> (as I'm sure you remember :-)).  On the one hand, I agree it might
> be nice to shield the backend from invalid insns.  That would
> effectively mean calling expand on each insn though, which would be
> rather expensive.

No, nothing that complicated.  I'm thinking of just basically
checking that there's an operation in that mode, like:

if (direct_optab_handler (code_to_optab [GET_CODE (x)], mode)
    == CODE_FOR_nothing)
  {
    ... return tabled default cost; for libcall or open-code ...
  }

Restricting the validity-gating to checking that the mode is
valid for the operation wouldn't interfere with fancy pipeline
speculative use.

> So I think this patch is using rtx_cost according to its current
> interface.

The "interface" use previously ignored the mode for most uses
(QED), so that's not completely correct. ;)

> If someone wants to change or restrict that interface,
> than that's a separate change IMO.  And it should be done consistently
> rather than in this one place.
> 
> In this case it doesn't matter anyway.  If we never see a shift
> in mode M by amount X, we'll never need to make a decision about
> whether to split it.

If it's never used, then I don't mind it being wrong if that
simplifies the computation. :)

> > Isn't the below better than doing virtually the same in each
> > target's rtx_costs?
> 
> FWIW, MIPS, SH and x86 all used something slightly different (and more
> complicated).  I imagine PowerPC and SPARC will too.  So "each" seems
> a bit strong.
> 
> That's not an objection to the patch though.  I realise some ports do
> have very regular architectures where every register is the same width
> and has the same move cost.

Hence the default should follow a very regular model...

brgds, H-P

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-03-30 10:09             ` Richard Sandiford
  2012-03-30 15:25               ` Kenneth Zadeck
@ 2012-05-10  6:45               ` Paolo Bonzini
  2012-05-10  6:54                 ` Paolo Bonzini
  1 sibling, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2012-05-10  6:45 UTC (permalink / raw)
  To: Kenneth Zadeck, gcc-patches, Ian Lance Taylor, Mike Stump,
	Kenneth Zadeck, avr, rdsandiford

Il 30/03/2012 12:08, Richard Sandiford ha scritto:
>> > +   There are two useful preprocessor defines for use by maintainers:  
>> > +
>> > +   #define LOG_COSTS
>> > +
>> > +   if you wish to see the actual cost estimates that are being used
>> > +   for each mode wider than word mode and the cost estimates for zero
>> > +   extension and the shifts.   This can be useful when port maintainers 
>> > +   are tuning insn rtx costs.
>> > +
>> > +   #define FORCE_LOWERING
>> > +
>> > +   if you wish to test the pass with all the transformation forced on.
>> > +   This can be useful for finding bugs in the transformations.
> Must admit I'm not keen on these kinds of macro, but it's Ian's call.

Indeed, LOG_COSTS should be (dump_flags & TDF_DETAILS) != 0, and perhaps
FORCE_LOWERING should be a -f flag (like -flower-all-subregs) or a --param.

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-05-10  6:45               ` Paolo Bonzini
@ 2012-05-10  6:54                 ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2012-05-10  6:54 UTC (permalink / raw)
  Cc: Kenneth Zadeck, gcc-patches, Ian Lance Taylor, Mike Stump,
	Kenneth Zadeck, avr, rdsandiford

Il 10/05/2012 08:45, Paolo Bonzini ha scritto:
> Il 30/03/2012 12:08, Richard Sandiford ha scritto:
>>>> +   There are two useful preprocessor defines for use by maintainers:  
>>>> +
>>>> +   #define LOG_COSTS
>>>> +
>>>> +   if you wish to see the actual cost estimates that are being used
>>>> +   for each mode wider than word mode and the cost estimates for zero
>>>> +   extension and the shifts.   This can be useful when port maintainers 
>>>> +   are tuning insn rtx costs.
>>>> +
>>>> +   #define FORCE_LOWERING
>>>> +
>>>> +   if you wish to test the pass with all the transformation forced on.
>>>> +   This can be useful for finding bugs in the transformations.
>> Must admit I'm not keen on these kinds of macro, but it's Ian's call.
> 
> Indeed, LOG_COSTS should be (dump_flags & TDF_DETAILS) != 0, and perhaps
> FORCE_LOWERING should be a -f flag (like -flower-all-subregs) or a --param.

Not sure how this got sent a month after I wrote it (and decided not to
send it). :)

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* ping: Fix gcc.dg/lower-subreg-1.c failure (was: [C Patch]: pr52543)
  2012-05-09  6:02                               ` Fix gcc.dg/lower-subreg-1.c failure (was: [C Patch]: pr52543) Hans-Peter Nilsson
  2012-05-09  9:15                                 ` Fix gcc.dg/lower-subreg-1.c failure Richard Sandiford
@ 2012-05-16  6:25                                 ` Hans-Peter Nilsson
  2012-05-23  4:42                                   ` ping*2: " Hans-Peter Nilsson
  1 sibling, 1 reply; 38+ messages in thread
From: Hans-Peter Nilsson @ 2012-05-16  6:25 UTC (permalink / raw)
  To: gcc-patches

> From: Hans-Peter Nilsson <hp@axis.com>
> Date: Wed, 9 May 2012 08:02:25 +0200

Ping.  I missed the PR number decoration on the ChangeLog entry:

	PR rtl-optimization/53176
> 	* rtlanal.c (rtx_cost): Adjust default cost for X with a
> 	UNITS_PER_WORD factor for all X according to the size of
> 	its mode, not just for SUBREGs with untieable modes.

<http://gcc.gnu.org/ml/gcc-patches/2012-05/msg00609.html>

brgds, H-P

^ permalink raw reply	[flat|nested] 38+ messages in thread

* ping*2: Fix gcc.dg/lower-subreg-1.c failure (was: [C Patch]: pr52543)
  2012-05-16  6:25                                 ` ping: Fix gcc.dg/lower-subreg-1.c failure (was: [C Patch]: pr52543) Hans-Peter Nilsson
@ 2012-05-23  4:42                                   ` Hans-Peter Nilsson
  2012-05-30  2:49                                     ` ping*3: " Hans-Peter Nilsson
  0 siblings, 1 reply; 38+ messages in thread
From: Hans-Peter Nilsson @ 2012-05-23  4:42 UTC (permalink / raw)
  To: hp; +Cc: gcc-patches

> From: Hans-Peter Nilsson <hp@axis.com>
> Date: Wed, 16 May 2012 08:24:41 +0200

> > From: Hans-Peter Nilsson <hp@axis.com>
> > Date: Wed, 9 May 2012 08:02:25 +0200
> 
> Ping.  I missed the PR number decoration on the ChangeLog entry:
> 
> 	PR rtl-optimization/53176
> > 	* rtlanal.c (rtx_cost): Adjust default cost for X with a
> > 	UNITS_PER_WORD factor for all X according to the size of
> > 	its mode, not just for SUBREGs with untieable modes.
> 
> <http://gcc.gnu.org/ml/gcc-patches/2012-05/msg00609.html>

Ping again...

brgds, H-P

^ permalink raw reply	[flat|nested] 38+ messages in thread

* ping*3: Fix gcc.dg/lower-subreg-1.c failure (was: [C Patch]: pr52543)
  2012-05-23  4:42                                   ` ping*2: " Hans-Peter Nilsson
@ 2012-05-30  2:49                                     ` Hans-Peter Nilsson
  2012-06-07  5:39                                       ` ping*4: " Hans-Peter Nilsson
  0 siblings, 1 reply; 38+ messages in thread
From: Hans-Peter Nilsson @ 2012-05-30  2:49 UTC (permalink / raw)
  To: gcc-patches

> From: Hans-Peter Nilsson <hp@axis.com>
> Date: Wed, 23 May 2012 06:41:58 +0200
> > From: Hans-Peter Nilsson <hp@axis.com>
> > Date: Wed, 16 May 2012 08:24:41 +0200
> > > From: Hans-Peter Nilsson <hp@axis.com>
> > > Date: Wed, 9 May 2012 08:02:25 +0200
> > 
> > Ping.  I missed the PR number decoration on the ChangeLog entry:
> > 
> > 	PR rtl-optimization/53176
> > > 	* rtlanal.c (rtx_cost): Adjust default cost for X with a
> > > 	UNITS_PER_WORD factor for all X according to the size of
> > > 	its mode, not just for SUBREGs with untieable modes.
> > 
> > <http://gcc.gnu.org/ml/gcc-patches/2012-05/msg00609.html>
> 
> Ping again...

Yet another ping.

brgds, H-P

^ permalink raw reply	[flat|nested] 38+ messages in thread

* ping*4: Fix gcc.dg/lower-subreg-1.c failure (was: [C Patch]: pr52543)
  2012-05-30  2:49                                     ` ping*3: " Hans-Peter Nilsson
@ 2012-06-07  5:39                                       ` Hans-Peter Nilsson
  0 siblings, 0 replies; 38+ messages in thread
From: Hans-Peter Nilsson @ 2012-06-07  5:39 UTC (permalink / raw)
  To: gcc-patches

> From: Hans-Peter Nilsson <hp@axis.com>
> Date: Wed, 30 May 2012 04:49:27 +0200

> > From: Hans-Peter Nilsson <hp@axis.com>
> > Date: Wed, 23 May 2012 06:41:58 +0200
> > > From: Hans-Peter Nilsson <hp@axis.com>
> > > Date: Wed, 16 May 2012 08:24:41 +0200
> > > > From: Hans-Peter Nilsson <hp@axis.com>
> > > > Date: Wed, 9 May 2012 08:02:25 +0200
> > > 
> > > Ping.  I missed the PR number decoration on the ChangeLog entry:
> > > 
> > > 	PR rtl-optimization/53176
> > > > 	* rtlanal.c (rtx_cost): Adjust default cost for X with a
> > > > 	UNITS_PER_WORD factor for all X according to the size of
> > > > 	its mode, not just for SUBREGs with untieable modes.
> > > 
> > > <http://gcc.gnu.org/ml/gcc-patches/2012-05/msg00609.html>
> > 
> > Ping again...
> 
> Yet another ping.

And one more.  FWIW, there was discussion, but Richard S had
"not an objection to the patch" he just abstained reviewing it.
It's quite simplistic and minimal, really, so don't be shy. :)

brgds, H-P

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Fix gcc.dg/lower-subreg-1.c failure, revisited.
  2012-05-09  9:15                                 ` Fix gcc.dg/lower-subreg-1.c failure Richard Sandiford
  2012-05-09 16:37                                   ` Hans-Peter Nilsson
@ 2012-07-07 23:03                                   ` Hans-Peter Nilsson
  2012-07-08 12:14                                     ` Richard Sandiford
  2012-07-12 21:14                                     ` Hans-Peter Nilsson
  1 sibling, 2 replies; 38+ messages in thread
From: Hans-Peter Nilsson @ 2012-07-07 23:03 UTC (permalink / raw)
  To: rdsandiford; +Cc: gcc-patches

> From: Richard Sandiford <rdsandiford@googlemail.com>
> Date: Wed, 9 May 2012 11:14:49 +0200

> Hans-Peter Nilsson <hans-peter.nilsson@axis.com> writes:
> >> From: Richard Sandiford <rdsandiford@googlemail.com>
> >> Date: Tue, 1 May 2012 16:46:38 +0200
> >
> >> To repeat: as things stand, very few targets define proper rtx costs
> >> for SET.
> >
> > IMHO it's wrong to start blaming targets when rtx_cost doesn't
> > take the mode in account in the first place, for the default
> > cost.  (Well, except for the modes-tieable subreg special-case.)
> > The targets where an operation in N * word_mode costs no more
> > than one in word_mode, if there even is one, is a minority,
> > let's adjust the defaults to that.
> 
> I'll pass on approving or disapproving this patch,

Here's an update.  Could you please reconsider?  I have to
appeal to your sense of fairness: after all you were involved in
the breaking patch.

> but for the record:
> a factor of word_mode seems a bit too simplistic.  It's OK for moves
> and logic ops, but addition of multiword modes is a bit more complicated.
> Multiplication and division by multiword modes is even more so, of course.

I adjusted multiplication and division to the more realistic
O(n*n).  That's just for show of course; we're comparing 2**2 to
instead of 2*2 as the focus is on the double size.  I did not
add specific adjustments for arithmetic codes not already there,
as I did not think such an adjustment to be appropriate.  The
intent is just to adapt the existing default costs to the change
of focus due to the lower-subreg change.  But if it's necessary
for approval, I'm willing to fix that.

> As things stand, rtx_cost is intentionally used for more than
> just valid target insns.  One of the main uses has always been
> to decide whether it is better to implement multiplications by a
> shift-and-add sequence, or whether to use multiplication instructions.
> The associated expmed code has never tried to decide which shifts are
> actually representable as matching rtl insns and which aren't.
> The same goes for division-using-multiplication.
> 
> So I think this patch

(referring to the commit that caused the major
gcc.dg/lower-subreg-1.c regression in May, not the predecessor
to *this* patch)

> is using rtx_cost according to its current
> interface.

Regardless of how "interface" is defined, that commit put much
more focus on the mode and made it about necessary to handle
SET, something that the default costs don't do without the patch
below.

> > Isn't the below better than doing virtually the same in each
> > target's rtx_costs?
> 
> FWIW, MIPS, SH and x86 all used something slightly different (and more
> complicated).  I imagine PowerPC and SPARC will too.  So "each" seems
> a bit strong.

Just strong enough: this patch indeed would have fixed the
gcc.dg/lower-subreg-1.c regression for MIPS and x86 (no
regressions) that was present at revision 187212, and no doubt
the predecessor patch too.  (I had to add commits 187299 and
187582 for MIPS and I couldn't get a baseline with java or
fortran enabled, thus --enable-languages=c,c++,lto,objc).  Also
similarly tested arm-eabi, sh-elf (though I couldn't repeat the
regression for those at the baseline) and cris-elf; no
regressions.  Fixes PR53176 for cris-elf.

Ok to commit?

	* rtlanal.c (rtx_cost): Adjust default cost for X with a
	UNITS_PER_WORD factor for all X according to the size of
	its mode, not just for SUBREGs with untieable modes.
	Handle SET.  Use factor * factor for MULT, DIV, UDIV,
	MOD, UMOD.

Index: gcc/rtlanal.c
===================================================================
--- gcc/rtlanal.c	(revision 188403)
+++ gcc/rtlanal.c	(working copy)
@@ -3755,10 +3755,17 @@ rtx_cost (rtx x, enum rtx_code outer_cod
   enum rtx_code code;
   const char *fmt;
   int total;
+  int factor;
 
   if (x == 0)
     return 0;
 
+  /* A size N times larger than UNITS_PER_WORD likely needs N times as
+     many insns, taking N times as long.  */
+  factor = GET_MODE_SIZE (GET_MODE (x)) / UNITS_PER_WORD;
+  if (factor == 0)
+    factor = 1;
+
   /* Compute the default costs of certain things.
      Note that targetm.rtx_costs can override the defaults.  */
 
@@ -3766,20 +3773,31 @@ rtx_cost (rtx x, enum rtx_code outer_cod
   switch (code)
     {
     case MULT:
-      total = COSTS_N_INSNS (5);
+      /* Multiplication has time-complexity O(N*N), where N is the
+	 number of units (translated from digits) when using
+	 schoolbook long multiplication.  */
+      total = factor * factor * COSTS_N_INSNS (5);
       break;
     case DIV:
     case UDIV:
     case MOD:
     case UMOD:
-      total = COSTS_N_INSNS (7);
+      /* Similarly, complexity for schoolbook long division.  */
+      total = factor * factor * COSTS_N_INSNS (7);
       break;
     case USE:
       /* Used in combine.c as a marker.  */
       total = 0;
       break;
+    case SET:
+      /* A SET doesn't have a mode, so let's look at the SET_DEST to get
+	 the mode for the factor.  */
+      factor = GET_MODE_SIZE (GET_MODE (SET_DEST (x))) / UNITS_PER_WORD;
+      if (factor == 0)
+	factor = 1;
+      /* Pass through.  */
     default:
-      total = COSTS_N_INSNS (1);
+      total = factor * COSTS_N_INSNS (1);
     }
 
   switch (code)
@@ -3792,8 +3810,7 @@ rtx_cost (rtx x, enum rtx_code outer_cod
       /* If we can't tie these modes, make this expensive.  The larger
 	 the mode, the more expensive it is.  */
       if (! MODES_TIEABLE_P (GET_MODE (x), GET_MODE (SUBREG_REG (x))))
-	return COSTS_N_INSNS (2
-			      + GET_MODE_SIZE (GET_MODE (x)) / UNITS_PER_WORD);
+	return COSTS_N_INSNS (2 + factor);
       break;
 
     default:


brgds, H-P

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix gcc.dg/lower-subreg-1.c failure, revisited.
  2012-07-07 23:03                                   ` Fix gcc.dg/lower-subreg-1.c failure, revisited Hans-Peter Nilsson
@ 2012-07-08 12:14                                     ` Richard Sandiford
  2012-07-12 21:14                                     ` Hans-Peter Nilsson
  1 sibling, 0 replies; 38+ messages in thread
From: Richard Sandiford @ 2012-07-08 12:14 UTC (permalink / raw)
  To: Hans-Peter Nilsson; +Cc: gcc-patches

Hans-Peter Nilsson <hans-peter.nilsson@axis.com> writes:
>> From: Richard Sandiford <rdsandiford@googlemail.com>
>> Date: Wed, 9 May 2012 11:14:49 +0200
>
>> Hans-Peter Nilsson <hans-peter.nilsson@axis.com> writes:
>> >> From: Richard Sandiford <rdsandiford@googlemail.com>
>> >> Date: Tue, 1 May 2012 16:46:38 +0200
>> >
>> >> To repeat: as things stand, very few targets define proper rtx costs
>> >> for SET.
>> >
>> > IMHO it's wrong to start blaming targets when rtx_cost doesn't
>> > take the mode in account in the first place, for the default
>> > cost.  (Well, except for the modes-tieable subreg special-case.)
>> > The targets where an operation in N * word_mode costs no more
>> > than one in word_mode, if there even is one, is a minority,
>> > let's adjust the defaults to that.
>> 
>> I'll pass on approving or disapproving this patch,
>
> Here's an update.  Could you please reconsider?  I have to
> appeal to your sense of fairness: after all you were involved in
> the breaking patch.

Sorry, but no.  The new version still seems rather arbitrary to me.
As before though, I won't try to stop anyone else from approving it.

Richard

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Fix gcc.dg/lower-subreg-1.c failure, revisited.
  2012-07-07 23:03                                   ` Fix gcc.dg/lower-subreg-1.c failure, revisited Hans-Peter Nilsson
  2012-07-08 12:14                                     ` Richard Sandiford
@ 2012-07-12 21:14                                     ` Hans-Peter Nilsson
  1 sibling, 0 replies; 38+ messages in thread
From: Hans-Peter Nilsson @ 2012-07-12 21:14 UTC (permalink / raw)
  To: gcc-patches

> From: Hans-Peter Nilsson <hp@axis.com>
> Date: Sun, 8 Jul 2012 01:02:34 +0200

	PR rtl-optimization/53176
> 	* rtlanal.c (rtx_cost): Adjust default cost for X with a
> 	UNITS_PER_WORD factor for all X according to the size of
> 	its mode, not just for SUBREGs with untieable modes.
> 	Handle SET.  Use factor * factor for MULT, DIV, UDIV,
> 	MOD, UMOD.

Approved (IRL by RTH), committed.

brgds, H-P

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-03-30 14:40 Georg-Johann Lay
  2012-03-30 15:36 ` Kenneth Zadeck
@ 2012-03-30 15:46 ` Richard Sandiford
  1 sibling, 0 replies; 38+ messages in thread
From: Richard Sandiford @ 2012-03-30 15:46 UTC (permalink / raw)
  To: Georg-Johann Lay
  Cc: Kenneth Zadeck, gcc-patches, Ian Lance Taylor, Mike Stump,
	Kenneth Zadeck

"Georg-Johann Lay" <avr@gjlay.de> writes:
>> This patch takes a different approach to fixing PR52543 than does the 
>> patch in
>> 
>> http://gcc.gnu.org/ml/gcc-patches/2012-03/msg00641.html
>> 
>> This patch transforms the lower-subreg pass(es) from unconditionally 
>> splitting wide moves, zero extensions, and shifts, so that it now takes 
>> into account the target specific costs and only does the transformations 
>> if it is profitable.
>
> As far as I understand the pass, it's not only about splitting these instructions
> but also to the additional benefits of the split, i.e.  AND 0xfffffffe will only need
> one QI operation instead of 1 SI operation that costs 4 QI.
>
> And in fact, the positive benefit of subreg-lowering occurs with bit-wise operations
> like AND, IOR, EOR etc.

Hmm, but in that case, why define andsi3 at all?  optabs applies exactly
that decomposition if you don't:

  /* These can be done a word at a time.  */
  if ((binoptab == and_optab || binoptab == ior_optab || binoptab == xor_optab)
      && mclass == MODE_INT
      && GET_MODE_SIZE (mode) > UNITS_PER_WORD
      && optab_handler (binoptab, word_mode) != CODE_FOR_nothing)

Richard

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
  2012-03-30 14:40 Georg-Johann Lay
@ 2012-03-30 15:36 ` Kenneth Zadeck
  2012-03-30 15:46 ` Richard Sandiford
  1 sibling, 0 replies; 38+ messages in thread
From: Kenneth Zadeck @ 2012-03-30 15:36 UTC (permalink / raw)
  To: Georg-Johann Lay
  Cc: gcc-patches, Ian Lance Taylor, Richard Sandiford, Mike Stump,
	Kenneth Zadeck



On 03/30/2012 10:39 AM, Georg-Johann Lay wrote:
>> This patch takes a different approach to fixing PR52543 than does the
>> patch in
>>
>> http://gcc.gnu.org/ml/gcc-patches/2012-03/msg00641.html
>>
>> This patch transforms the lower-subreg pass(es) from unconditionally
>> splitting wide moves, zero extensions, and shifts, so that it now takes
>> into account the target specific costs and only does the transformations
>> if it is profitable.
> As far as I understand the pass, it's not only about splitting these instructions
> but also to the additional benefits of the split, i.e.  AND 0xfffffffe will only need
> one QI operation instead of 1 SI operation that costs 4 QI.
>
> And in fact, the positive benefit of subreg-lowering occurs with bit-wise operations
> like AND, IOR, EOR etc.
>
> And one problem is that the pass is not sensitive to address spaces.
> For example, HI splits for generic space are profitable, for non-generic
> they are not.
>
> Thus, a patch should also address address-space sensivity.
>
No, this pass only splits operations that are wider than word mode into 
word mode sized chunks.
On a machine where word mode is SI, it will split DI shifts and zero 
extends and any moves wider that SI mode into a series of SI operations. 
It does nothing for things in QI, or HI mode.   The pass was written 
before there were machines with fast vector move operations.

It might be that there are issues where the address space considerations 
may need to be taken into consideration.    Someone who has a port with 
memory operation like this may want to consider making that 
enhancement.   I think that once this patch is in place, that kind of 
change will be easier to incorporate.
>> Unconditional splitting is a problem that not only occurs on the AVR but
>> is also a problem on the ARM NEON and my private port.  Furthermore, it
>> is a problem that is likely to occur on most modern larger machines
>> since these machines are more likely to have fast instructions for
>> moving things that are larger than word mode.
>>
>> At compiler initialization time, each mode that is larger that a word
>> mode is examined to determine if the cost of moving a value of that mode
>> is less expensive that inserting the proper number of word sided
>> moves.   If it is cheaper to split it up, a bit is set to allow moves of
>> that mode to be lowered.
> As written above, the mode is *not* enough. For MEM there are is also
> address space involved.
>
> Johann
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [C Patch]: pr52543
@ 2012-03-30 14:40 Georg-Johann Lay
  2012-03-30 15:36 ` Kenneth Zadeck
  2012-03-30 15:46 ` Richard Sandiford
  0 siblings, 2 replies; 38+ messages in thread
From: Georg-Johann Lay @ 2012-03-30 14:40 UTC (permalink / raw)
  To: Kenneth Zadeck, gcc-patches
  Cc: Ian Lance Taylor, Richard Sandiford, Mike Stump, Kenneth Zadeck, avr

> This patch takes a different approach to fixing PR52543 than does the 
> patch in
> 
> http://gcc.gnu.org/ml/gcc-patches/2012-03/msg00641.html
> 
> This patch transforms the lower-subreg pass(es) from unconditionally 
> splitting wide moves, zero extensions, and shifts, so that it now takes 
> into account the target specific costs and only does the transformations 
> if it is profitable.

As far as I understand the pass, it's not only about splitting these instructions
but also to the additional benefits of the split, i.e.  AND 0xfffffffe will only need
one QI operation instead of 1 SI operation that costs 4 QI.

And in fact, the positive benefit of subreg-lowering occurs with bit-wise operations
like AND, IOR, EOR etc.

And one problem is that the pass is not sensitive to address spaces.
For example, HI splits for generic space are profitable, for non-generic
they are not. 

Thus, a patch should also address address-space sensivity.

> Unconditional splitting is a problem that not only occurs on the AVR but 
> is also a problem on the ARM NEON and my private port.  Furthermore, it 
> is a problem that is likely to occur on most modern larger machines 
> since these machines are more likely to have fast instructions for 
> moving things that are larger than word mode.
> 
> At compiler initialization time, each mode that is larger that a word 
> mode is examined to determine if the cost of moving a value of that mode 
> is less expensive that inserting the proper number of word sided 
> moves.   If it is cheaper to split it up, a bit is set to allow moves of 
> that mode to be lowered.

As written above, the mode is *not* enough. For MEM there are is also
address space involved.

Johann

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2012-07-12 21:14 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <4F67D595.9030700@naturalbridge.com>
     [not found] ` <mcrd387n9bf.fsf@dhcp-172-18-216-180.mtv.corp.google.com>
     [not found]   ` <4F6881EA.9080306@naturalbridge.com>
     [not found]     ` <mcrvclzl7b7.fsf@dhcp-172-18-216-180.mtv.corp.google.com>
     [not found]       ` <4F6889CC.8080302@naturalbridge.com>
     [not found]         ` <mcrr4wnl6lb.fsf@dhcp-172-18-216-180.mtv.corp.google.com>
2012-03-29 21:10           ` [C Patch]: pr52543 Kenneth Zadeck
2012-03-30  8:34             ` Ramana Radhakrishnan
2012-03-30 10:09             ` Richard Sandiford
2012-03-30 15:25               ` Kenneth Zadeck
2012-03-30 15:41                 ` Richard Sandiford
2012-03-31 16:21                   ` Kenneth Zadeck
2012-04-03 13:53                     ` Richard Sandiford
2012-04-03 15:33                       ` Kenneth Zadeck
2012-04-03 19:20                         ` Richard Sandiford
2012-04-03 20:36                           ` Ian Lance Taylor
2012-05-01 14:47                             ` Richard Sandiford
2012-05-01 17:51                               ` H.J. Lu
2012-05-03 19:52                               ` Georg-Johann Lay
2012-05-03 22:14                                 ` Mike Stump
2012-05-04 23:02                                   ` Georg-Johann Lay
2012-05-06 18:55                                     ` [committed] Fix lower-subreg cost calculation Richard Sandiford
2012-05-08 15:26                                       ` Richard Earnshaw
2012-05-08 21:42                                         ` Richard Sandiford
2012-05-09  9:48                                           ` Richard Earnshaw
2012-05-07 19:01                                     ` [C Patch]: pr52543 Mike Stump
2012-05-09  6:02                               ` Fix gcc.dg/lower-subreg-1.c failure (was: [C Patch]: pr52543) Hans-Peter Nilsson
2012-05-09  9:15                                 ` Fix gcc.dg/lower-subreg-1.c failure Richard Sandiford
2012-05-09 16:37                                   ` Hans-Peter Nilsson
2012-07-07 23:03                                   ` Fix gcc.dg/lower-subreg-1.c failure, revisited Hans-Peter Nilsson
2012-07-08 12:14                                     ` Richard Sandiford
2012-07-12 21:14                                     ` Hans-Peter Nilsson
2012-05-16  6:25                                 ` ping: Fix gcc.dg/lower-subreg-1.c failure (was: [C Patch]: pr52543) Hans-Peter Nilsson
2012-05-23  4:42                                   ` ping*2: " Hans-Peter Nilsson
2012-05-30  2:49                                     ` ping*3: " Hans-Peter Nilsson
2012-06-07  5:39                                       ` ping*4: " Hans-Peter Nilsson
2012-04-03 16:22                       ` [C Patch]: pr52543 Ian Lance Taylor
2012-05-10  6:45               ` Paolo Bonzini
2012-05-10  6:54                 ` Paolo Bonzini
     [not found]             ` <CACUk7=XVO=yHrPBFtAVfPxMtViYthtGGfuQSVGHNqHE7ibER0g@mail.gmail.com>
2012-03-30 19:29               ` Kenneth Zadeck
2012-03-30 22:38                 ` Ramana Radhakrishnan
2012-03-30 14:40 Georg-Johann Lay
2012-03-30 15:36 ` Kenneth Zadeck
2012-03-30 15:46 ` Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).