public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH, SPU] generated better code for loads and stores
@ 2008-08-29  0:22 Trevor_Smigiel
  2008-09-05 21:51 ` Trevor_Smigiel
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Trevor_Smigiel @ 2008-08-29  0:22 UTC (permalink / raw)
  To: gcc-patches; +Cc: Ulrich Weigand, andrew_pinski

[-- Attachment #1: Type: text/plain, Size: 2868 bytes --]

This patch generates better code for loads and stores on SPU.

The SPU can only do 16-byte, aligned loads and stores.  To load something
smaller with a smaller alignment requires a load and a rotate.  To store
something smaller requires a load, insert, and store.

Currently, there are two obvious ways to generate rtl for loads and
stores.  Generate the multiple instructions at expand time, or split
them at some later phase.   When expanded early we lose alias
information (because that 16-byte load could contain anything), and in
general do worse optimization on memory.   When we split late, the
compiler has no opportunity to combine loads/stores of the same 16
bytes.

This patch introduces an additional split pass, split0, right before the
CSE2 pass.  Before this pass, loads and stores are modeled as a single
rtl instruction, and can be optimized well.  This pass splits them into
multiple instructions, allowing CSE2 and combine to optimize the 16 byte
loads and stores.  The pass is only enabled when a target defines
SPLIT_BEFORE_CSE2.

The test case is an example which is improved by the earlier split pass.

This patch also makes other small improvements to the code generated for
loads and stores on SPU.

Ok for mainline?  In particular, the new split pass.

Trevor

2008-08-27  Trevor Smigiel <Trevor_Smigiel@playstation.sony.com>
	
	Improve code generated for loads and stores on SPU.

	* doc/tm.texi (SPLIT_BEFORE_CSE2) : Document.
	* tree-pass.h (pass_split_before_cse2) : Declare.
	* final.c (rest_of_clean_state) : Initialize split0_completed.
	* recog.c (split0_completed) : Define.
	(gate_handle_split_before_cse2, rest_of_handle_split_before_cse2) :
	New functions.
	(pass_split_before_cse2) : New pass.
	* rtl.h (split0_completed) : Declare.
        * passes.c (init_optimization_passes) : Add pass_split_before_cse2
        before pass_cse2 .
	* config/spu/spu-protos.h (spu_legitimate_address) : Add
	for_split argument.
	(aligned_mem_p, spu_valid_move) : Remove prototypes.
	(spu_split_load, spu_split_store) : Change return type to int.
	* config/spu/predicates.md (spu_mem_operand) : Remove.
	(spu_dest_operand) : Add.
	* config/spu/spu-builtins.md (spu_lqd, spu_lqx, spu_lqa,
	spu_lqr, spu_stqd, spu_stqx, spu_stqa, spu_stqr) : Remove AND
	operation.
	* config/spu/spu.c (regno_aligned_for_load) : Remove.
	(reg_aligned_for_addr, address_needs_split) : New functions.
	(spu_legitimate_address, spu_expand_mov, spu_split_load,
	spu_split_store) : Update.
	(spu_init_expanders) : Pregenerate a couple of pseudo-registers.
	* config/spu/spu.h (REG_ALIGN, SPLIT_BEFORE_CSE2) : Define.
	(GO_IF_LEGITIMATE_ADDRESS) : Update for spu_legitimate_address.
	* config/spu/spu.md ("_mov<mode>", "_movdi", "_movti") : Update
	predicates.
	("load", "store") : Change to define_split.

testsuite/
	* testsuite/gcc.target/spu/split0-1.c : Add test.

[-- Attachment #2: split0.patch --]
[-- Type: text/x-diff, Size: 46770 bytes --]

Index: gcc/gcc/doc/tm.texi
===================================================================
*** gcc/gcc/doc/tm.texi	(revision 139677)
--- gcc/gcc/doc/tm.texi	(working copy)
*************** cannot safely move arguments from the re
*** 10575,10577 ****
--- 10575,10590 ----
  to the stack.  Therefore, this hook should return true in general, but
  false for naked functions.  The default implementation always returns true.
  @end deftypefn
+ 
+ @defmac SPLIT_BEFORE_CSE2
+ This macro determines whether to use an additional split pass before the
+ second CSE pass.  @code{split0_completed} will be set after this pass is
+ completed. 
+ 
+ For example, the Cell SPU target uses this for better optimization of
+ the multiple instructions required to do simple loads and stores.  The
+ optimizations before this pass work better on simple memory
+ instructions, and the optimizations right after this pass (e.g., CSE and
+ combine) are be able to optimize the split instructions.
+ @end defmac
+ 
Index: gcc/gcc/tree-pass.h
===================================================================
*** gcc/gcc/tree-pass.h	(revision 139677)
--- gcc/gcc/tree-pass.h	(working copy)
*************** extern struct rtl_opt_pass pass_rtl_dolo
*** 446,451 ****
--- 446,452 ----
  extern struct rtl_opt_pass pass_rtl_loop_done;
  
  extern struct rtl_opt_pass pass_web;
+ extern struct rtl_opt_pass pass_split_before_cse2;
  extern struct rtl_opt_pass pass_cse2;
  extern struct rtl_opt_pass pass_df_initialize_opt;
  extern struct rtl_opt_pass pass_df_initialize_no_opt;
Index: gcc/gcc/final.c
===================================================================
*** gcc/gcc/final.c	(revision 139677)
--- gcc/gcc/final.c	(working copy)
*************** rest_of_clean_state (void)
*** 4252,4257 ****
--- 4252,4260 ----
  #ifdef STACK_REGS
    regstack_completed = 0;
  #endif
+ #ifdef SPLIT_BEFORE_CSE2
+   split0_completed = 0;
+ #endif
  
    /* Clear out the insn_length contents now that they are no
       longer valid.  */
Index: gcc/gcc/recog.c
===================================================================
*** gcc/gcc/recog.c	(revision 139677)
--- gcc/gcc/recog.c	(working copy)
*************** int reload_completed;
*** 103,108 ****
--- 103,113 ----
  /* Nonzero after thread_prologue_and_epilogue_insns has run.  */
  int epilogue_completed;
  
+ #ifdef SPLIT_BEFORE_CSE2
+ /* Nonzero after split0 pass has run.  */
+ int split0_completed;
+ #endif
+ 
  /* Initialize data used by the function `recog'.
     This must be called once in the compilation of a function
     before any insn recognition may be done in the function.  */
*************** struct rtl_opt_pass pass_split_for_short
*** 3547,3550 ****
--- 3552,3594 ----
   }
  };
  
+ static bool
+ gate_handle_split_before_cse2 (void)
+ {
+ #ifdef SPLIT_BEFORE_CSE2
+   return SPLIT_BEFORE_CSE2;
+ #else
+   return 0;
+ #endif
+ }
+ 
+ static unsigned int
+ rest_of_handle_split_before_cse2 (void)
+ {
+ #ifdef SPLIT_BEFORE_CSE2
+   split_all_insns_noflow ();
+   split0_completed = 1;
+ #endif
+   return 0;
+ }
+ 
+ struct rtl_opt_pass pass_split_before_cse2 =
+ {
+  {
+   RTL_PASS,
+   "split0",                             /* name */
+   gate_handle_split_before_cse2,        /* gate */
+   rest_of_handle_split_before_cse2,     /* execute */
+   NULL,                                 /* sub */
+   NULL,                                 /* next */
+   0,                                    /* static_pass_number */
+   0,                                    /* tv_id */
+   0,                                    /* properties_required */
+   0,                                    /* properties_provided */
+   0,                                    /* properties_destroyed */
+   0,                                    /* todo_flags_start */
+   TODO_dump_func,                       /* todo_flags_finish */
+  }
+ };
+ 
  
Index: gcc/gcc/rtl.h
===================================================================
*** gcc/gcc/rtl.h	(revision 139677)
--- gcc/gcc/rtl.h	(working copy)
*************** extern int reload_completed;
*** 2010,2015 ****
--- 2010,2020 ----
  /* Nonzero after thread_prologue_and_epilogue_insns has run.  */
  extern int epilogue_completed;
  
+ #ifdef SPLIT_BEFORE_CSE2
+ /* Nonzero after the split0 pass has completed. */
+ extern int split0_completed;
+ #endif
+ 
  /* Set to 1 while reload_as_needed is operating.
     Required by some machines to handle any generated moves differently.  */
  
Index: gcc/gcc/passes.c
===================================================================
*** gcc/gcc/passes.c	(revision 139677)
--- gcc/gcc/passes.c	(working copy)
*************** init_optimization_passes (void)
*** 743,748 ****
--- 743,749 ----
  	}
        NEXT_PASS (pass_web);
        NEXT_PASS (pass_jump_bypass);
+       NEXT_PASS (pass_split_before_cse2);
        NEXT_PASS (pass_cse2);
        NEXT_PASS (pass_rtl_dse1);
        NEXT_PASS (pass_rtl_fwprop_addr);
Index: gcc/gcc/config/spu/spu-protos.h
===================================================================
*** gcc/gcc/config/spu/spu-protos.h	(revision 139677)
--- gcc/gcc/config/spu/spu-protos.h	(working copy)
*************** extern int arith_immediate_p (rtx op, en
*** 54,60 ****
  extern int spu_constant_address_p (rtx x);
  extern int spu_legitimate_constant_p (rtx x);
  extern int spu_legitimate_address (enum machine_mode mode, rtx x,
! 				   int reg_ok_strict);
  extern rtx spu_legitimize_address (rtx x, rtx oldx, enum machine_mode mode);
  extern int spu_initial_elimination_offset (int from, int to);
  extern rtx spu_function_value (const_tree type, const_tree func);
--- 54,60 ----
  extern int spu_constant_address_p (rtx x);
  extern int spu_legitimate_constant_p (rtx x);
  extern int spu_legitimate_address (enum machine_mode mode, rtx x,
! 				   int reg_ok_strict, int for_split);
  extern rtx spu_legitimize_address (rtx x, rtx oldx, enum machine_mode mode);
  extern int spu_initial_elimination_offset (int from, int to);
  extern rtx spu_function_value (const_tree type, const_tree func);
*************** extern void spu_setup_incoming_varargs (
*** 64,74 ****
  					tree type, int *pretend_size,
  					int no_rtl);
  extern void spu_conditional_register_usage (void);
- extern int aligned_mem_p (rtx mem);
  extern int spu_expand_mov (rtx * ops, enum machine_mode mode);
! extern void spu_split_load (rtx * ops);
! extern void spu_split_store (rtx * ops);
! extern int spu_valid_move (rtx * ops);
  extern int fsmbi_const_p (rtx x);
  extern int cpat_const_p (rtx x, enum machine_mode mode);
  extern rtx gen_cpat_const (rtx * ops);
--- 64,72 ----
  					tree type, int *pretend_size,
  					int no_rtl);
  extern void spu_conditional_register_usage (void);
  extern int spu_expand_mov (rtx * ops, enum machine_mode mode);
! extern int spu_split_load (rtx * ops);
! extern int spu_split_store (rtx * ops);
  extern int fsmbi_const_p (rtx x);
  extern int cpat_const_p (rtx x, enum machine_mode mode);
  extern rtx gen_cpat_const (rtx * ops);
Index: gcc/gcc/config/spu/predicates.md
===================================================================
*** gcc/gcc/config/spu/predicates.md	(revision 139677)
--- gcc/gcc/config/spu/predicates.md	(working copy)
*************** (define_predicate "spu_nonmem_operand"
*** 39,52 ****
         (ior (not (match_code "subreg"))
              (match_test "valid_subreg (op)"))))
  
- (define_predicate "spu_mem_operand"
-   (and (match_operand 0 "memory_operand")
-        (match_test "reload_in_progress || reload_completed || aligned_mem_p (op)")))
- 
  (define_predicate "spu_mov_operand"
!   (ior (match_operand 0 "spu_mem_operand")
         (match_operand 0 "spu_nonmem_operand")))
  
  (define_predicate "call_operand"
    (and (match_code "mem")
         (match_test "(!TARGET_LARGE_MEM && satisfies_constraint_S (op))
--- 39,52 ----
         (ior (not (match_code "subreg"))
              (match_test "valid_subreg (op)"))))
  
  (define_predicate "spu_mov_operand"
!   (ior (match_operand 0 "memory_operand")
         (match_operand 0 "spu_nonmem_operand")))
  
+ (define_predicate "spu_dest_operand"
+   (ior (match_operand 0 "memory_operand")
+        (match_operand 0 "spu_reg_operand")))
+ 
  (define_predicate "call_operand"
    (and (match_code "mem")
         (match_test "(!TARGET_LARGE_MEM && satisfies_constraint_S (op))
Index: gcc/gcc/config/spu/spu.c
===================================================================
*** gcc/gcc/config/spu/spu.c	(revision 139677)
--- gcc/gcc/config/spu/spu.c	(working copy)
*************** static tree spu_build_builtin_va_list (v
*** 120,128 ****
  static void spu_va_start (tree, rtx);
  static tree spu_gimplify_va_arg_expr (tree valist, tree type,
  				      gimple_seq * pre_p, gimple_seq * post_p);
- static int regno_aligned_for_load (int regno);
  static int store_with_one_insn_p (rtx mem);
  static int mem_is_padded_component_ref (rtx x);
  static bool spu_assemble_integer (rtx x, unsigned int size, int aligned_p);
  static void spu_asm_globalize_label (FILE * file, const char *name);
  static unsigned char spu_rtx_costs (rtx x, int code, int outer_code,
--- 120,128 ----
  static void spu_va_start (tree, rtx);
  static tree spu_gimplify_va_arg_expr (tree valist, tree type,
  				      gimple_seq * pre_p, gimple_seq * post_p);
  static int store_with_one_insn_p (rtx mem);
  static int mem_is_padded_component_ref (rtx x);
+ static int reg_aligned_for_addr (rtx x, int aligned);
  static bool spu_assemble_integer (rtx x, unsigned int size, int aligned_p);
  static void spu_asm_globalize_label (FILE * file, const char *name);
  static unsigned char spu_rtx_costs (rtx x, int code, int outer_code,
*************** spu_legitimate_constant_p (rtx x)
*** 2856,2879 ****
  /* Valid address are:
     - symbol_ref, label_ref, const
     - reg
!    - reg + const, where either reg or const is 16 byte aligned
     - reg + reg, alignment doesn't matter
    The alignment matters in the reg+const case because lqd and stqd
!   ignore the 4 least significant bits of the const.  (TODO: It might be
!   preferable to allow any alignment and fix it up when splitting.) */
  int
! spu_legitimate_address (enum machine_mode mode ATTRIBUTE_UNUSED,
! 			rtx x, int reg_ok_strict)
  {
!   if (mode == TImode && GET_CODE (x) == AND
!       && GET_CODE (XEXP (x, 1)) == CONST_INT
!       && INTVAL (XEXP (x, 1)) == (HOST_WIDE_INT) -16)
      x = XEXP (x, 0);
    switch (GET_CODE (x))
      {
-     case SYMBOL_REF:
      case LABEL_REF:
!       return !TARGET_LARGE_MEM;
  
      case CONST:
        if (!TARGET_LARGE_MEM && GET_CODE (XEXP (x, 0)) == PLUS)
--- 2856,2907 ----
  /* Valid address are:
     - symbol_ref, label_ref, const
     - reg
!    - reg + const, where const is 16 byte aligned
     - reg + reg, alignment doesn't matter
    The alignment matters in the reg+const case because lqd and stqd
!   ignore the 4 least significant bits of the const.  
! 
!   Addresses are handled in 4 phases. 
!   1) from the beginning of rtl expansion until the split0 pass.  Any
!      address is acceptable.  
!   2) The split0 pass. It is responsible for making every load and store
!      valid.  It calls legitimate_address with FOR_SPLIT set to 1.  This
!      is where non-16-byte aligned loads/stores are split into multiple
!      instructions to extract or insert just the part we care about.
!   3) From the split0 pass to the beginning of reload.  During this
!      phase the constant part of an address must be 16 byte aligned, and
!      we don't allow any loads/store of less than 4 bytes.  We also
!      allow a mask of -16 to be part of the address as an optimization.
!   4) From reload until the end.  Reload can change the modes of loads
!      and stores to something smaller than 4-bytes which we need to allow
!      now, and it also adjusts the address to match.  So in this phase we
!      allow that special case.  Still allow addresses with a mask of -16.
! 
!   FOR_SPLIT is only set to 1 for phase 2, otherwise it is 0.  */
  int
! spu_legitimate_address (enum machine_mode mode, rtx x, int reg_ok_strict,
! 			int for_split)
  {
!   int aligned = (split0_completed || for_split)
!     && !reload_in_progress && !reload_completed;
!   int const_aligned = split0_completed || for_split;
!   if (GET_MODE_SIZE (mode) >= 16)
!     aligned = 0;
!   else if (aligned && GET_MODE_SIZE (mode) < 4)
!     return 0;
!   if (split0_completed
!       && (GET_CODE (x) == AND
! 	  && GET_CODE (XEXP (x, 1)) == CONST_INT
! 	  && INTVAL (XEXP (x, 1)) == (HOST_WIDE_INT) - 16
! 	  && !CONSTANT_P (XEXP (x, 0))))
      x = XEXP (x, 0);
    switch (GET_CODE (x))
      {
      case LABEL_REF:
!       return !TARGET_LARGE_MEM && !aligned;
! 
!     case SYMBOL_REF:
!       return !TARGET_LARGE_MEM && (!aligned || ALIGNED_SYMBOL_REF_P (x));
  
      case CONST:
        if (!TARGET_LARGE_MEM && GET_CODE (XEXP (x, 0)) == PLUS)
*************** spu_legitimate_address (enum machine_mod
*** 2881,2902 ****
  	  rtx sym = XEXP (XEXP (x, 0), 0);
  	  rtx cst = XEXP (XEXP (x, 0), 1);
  
- 	  /* Accept any symbol_ref + constant, assuming it does not
- 	     wrap around the local store addressability limit.  */
  	  if (GET_CODE (sym) == SYMBOL_REF && GET_CODE (cst) == CONST_INT)
! 	    return 1;
  	}
        return 0;
  
      case CONST_INT:
        return INTVAL (x) >= 0 && INTVAL (x) <= 0x3ffff;
  
      case SUBREG:
        x = XEXP (x, 0);
!       gcc_assert (GET_CODE (x) == REG);
  
      case REG:
!       return INT_REG_OK_FOR_BASE_P (x, reg_ok_strict);
  
      case PLUS:
      case LO_SUM:
--- 2909,2938 ----
  	  rtx sym = XEXP (XEXP (x, 0), 0);
  	  rtx cst = XEXP (XEXP (x, 0), 1);
  
  	  if (GET_CODE (sym) == SYMBOL_REF && GET_CODE (cst) == CONST_INT)
! 	    {
! 	      /* Check for alignment if required.  */
! 	      if (!aligned)
! 		return 1;
! 	      if ((INTVAL (cst) & 15) == 0 && ALIGNED_SYMBOL_REF_P (sym))
! 		return 1;
! 	    }
  	}
        return 0;
  
      case CONST_INT:
+       /* We don't test alignement here.  For an absolute address we
+          assume the user knows what they are doing. */
        return INTVAL (x) >= 0 && INTVAL (x) <= 0x3ffff;
  
      case SUBREG:
        x = XEXP (x, 0);
!       if (GET_CODE (x) != REG)
! 	return 0;
  
      case REG:
!       return INT_REG_OK_FOR_BASE_P (x, reg_ok_strict)
! 	&& reg_aligned_for_addr (x, 0);
  
      case PLUS:
      case LO_SUM:
*************** spu_legitimate_address (enum machine_mod
*** 2907,2927 ****
  	  op0 = XEXP (op0, 0);
  	if (GET_CODE (op1) == SUBREG)
  	  op1 = XEXP (op1, 0);
- 	/* We can't just accept any aligned register because CSE can
- 	   change it to a register that is not marked aligned and then
- 	   recog will fail.   So we only accept frame registers because
- 	   they will only be changed to other frame registers. */
  	if (GET_CODE (op0) == REG
  	    && INT_REG_OK_FOR_BASE_P (op0, reg_ok_strict)
  	    && GET_CODE (op1) == CONST_INT
  	    && INTVAL (op1) >= -0x2000
  	    && INTVAL (op1) <= 0x1fff
! 	    && (regno_aligned_for_load (REGNO (op0)) || (INTVAL (op1) & 15) == 0))
  	  return 1;
  	if (GET_CODE (op0) == REG
  	    && INT_REG_OK_FOR_BASE_P (op0, reg_ok_strict)
  	    && GET_CODE (op1) == REG
! 	    && INT_REG_OK_FOR_INDEX_P (op1, reg_ok_strict))
  	  return 1;
        }
        break;
--- 2943,2971 ----
  	  op0 = XEXP (op0, 0);
  	if (GET_CODE (op1) == SUBREG)
  	  op1 = XEXP (op1, 0);
  	if (GET_CODE (op0) == REG
  	    && INT_REG_OK_FOR_BASE_P (op0, reg_ok_strict)
  	    && GET_CODE (op1) == CONST_INT
  	    && INTVAL (op1) >= -0x2000
  	    && INTVAL (op1) <= 0x1fff
! 	    && reg_aligned_for_addr (op0, 0)
! 	    && (!const_aligned
! 		|| (INTVAL (op1) & 15) == 0
! 		|| ((reload_in_progress || reload_completed)
! 		    && GET_MODE_SIZE (mode) < 4
! 		    && (INTVAL (op1) & 15) == 4 - GET_MODE_SIZE (mode))
! 		/* Some passes create a fake register for testing valid
! 		   addresses, be more lenient when we see those.  ivopts
! 		   and reload do it. */
! 		|| REGNO (op0) == LAST_VIRTUAL_REGISTER + 1
! 		|| REGNO (op0) == LAST_VIRTUAL_REGISTER + 2))
  	  return 1;
  	if (GET_CODE (op0) == REG
  	    && INT_REG_OK_FOR_BASE_P (op0, reg_ok_strict)
+ 	    && reg_aligned_for_addr (op0, 0)
  	    && GET_CODE (op1) == REG
! 	    && INT_REG_OK_FOR_INDEX_P (op1, reg_ok_strict)
! 	    && reg_aligned_for_addr (op1, 0))
  	  return 1;
        }
        break;
*************** spu_legitimize_address (rtx x, rtx oldx 
*** 2959,2965 ****
        else if (GET_CODE (op1) != REG)
  	op1 = force_reg (Pmode, op1);
        x = gen_rtx_PLUS (Pmode, op0, op1);
!       if (spu_legitimate_address (mode, x, 0))
  	return x;
      }
    return NULL_RTX;
--- 3003,3009 ----
        else if (GET_CODE (op1) != REG)
  	op1 = force_reg (Pmode, op1);
        x = gen_rtx_PLUS (Pmode, op0, op1);
!       if (spu_legitimate_address (mode, x, 0, 0))
  	return x;
      }
    return NULL_RTX;
*************** spu_conditional_register_usage (void)
*** 3383,3442 ****
      }
  }
  
! /* This is called to decide when we can simplify a load instruction.  We
!    must only return true for registers which we know will always be
!    aligned.  Taking into account that CSE might replace this reg with
!    another one that has not been marked aligned.  
!    So this is really only true for frame, stack and virtual registers,
!    which we know are always aligned and should not be adversely effected
!    by CSE.  */
  static int
! regno_aligned_for_load (int regno)
! {
!   return regno == FRAME_POINTER_REGNUM
!     || (frame_pointer_needed && regno == HARD_FRAME_POINTER_REGNUM)
!     || regno == ARG_POINTER_REGNUM
!     || regno == STACK_POINTER_REGNUM
!     || (regno >= FIRST_VIRTUAL_REGISTER 
! 	&& regno <= LAST_VIRTUAL_REGISTER);
! }
! 
! /* Return TRUE when mem is known to be 16-byte aligned. */
! int
! aligned_mem_p (rtx mem)
  {
!   if (MEM_ALIGN (mem) >= 128)
      return 1;
!   if (GET_MODE_SIZE (GET_MODE (mem)) >= 16)
!     return 1;
!   if (GET_CODE (XEXP (mem, 0)) == PLUS)
!     {
!       rtx p0 = XEXP (XEXP (mem, 0), 0);
!       rtx p1 = XEXP (XEXP (mem, 0), 1);
!       if (regno_aligned_for_load (REGNO (p0)))
! 	{
! 	  if (GET_CODE (p1) == REG && regno_aligned_for_load (REGNO (p1)))
! 	    return 1;
! 	  if (GET_CODE (p1) == CONST_INT && (INTVAL (p1) & 15) == 0)
! 	    return 1;
! 	}
!     }
!   else if (GET_CODE (XEXP (mem, 0)) == REG)
!     {
!       if (regno_aligned_for_load (REGNO (XEXP (mem, 0))))
! 	return 1;
!     }
!   else if (ALIGNED_SYMBOL_REF_P (XEXP (mem, 0)))
!     return 1;
!   else if (GET_CODE (XEXP (mem, 0)) == CONST)
!     {
!       rtx p0 = XEXP (XEXP (XEXP (mem, 0), 0), 0);
!       rtx p1 = XEXP (XEXP (XEXP (mem, 0), 0), 1);
!       if (GET_CODE (p0) == SYMBOL_REF
! 	  && GET_CODE (p1) == CONST_INT && (INTVAL (p1) & 15) == 0)
! 	return 1;
!     }
!   return 0;
  }
  
  /* Encode symbol attributes (local vs. global, tls model) of a SYMBOL_REF
--- 3427,3442 ----
      }
  }
  
! /* This is called any time we inspect the alignment of a register for
!    addresses.  */
  static int
! reg_aligned_for_addr (rtx x, int aligned)
  {
!   int regno =
!     REGNO (x) < FIRST_PSEUDO_REGISTER ? ORIGINAL_REGNO (x) : REGNO (x);
!   if (!aligned)
      return 1;
!   return REGNO_POINTER_ALIGN (regno) >= 128;
  }
  
  /* Encode symbol attributes (local vs. global, tls model) of a SYMBOL_REF
*************** spu_encode_section_info (tree decl, rtx 
*** 3465,3473 ****
  static int
  store_with_one_insn_p (rtx mem)
  {
    rtx addr = XEXP (mem, 0);
!   if (GET_MODE (mem) == BLKmode)
      return 0;
    /* Only static objects. */
    if (GET_CODE (addr) == SYMBOL_REF)
      {
--- 3465,3476 ----
  static int
  store_with_one_insn_p (rtx mem)
  {
+   enum machine_mode mode = GET_MODE (mem);
    rtx addr = XEXP (mem, 0);
!   if (mode == BLKmode)
      return 0;
+   if (GET_MODE_SIZE (mode) >= 16)
+     return 1;
    /* Only static objects. */
    if (GET_CODE (addr) == SYMBOL_REF)
      {
*************** store_with_one_insn_p (rtx mem)
*** 3491,3496 ****
--- 3494,3515 ----
    return 0;
  }
  
+ /* Return 1 when the address is not valid for a simple load and store as
+    required by the '_mov*' patterns.   We could make this less strict
+    for loads, but we prefer mem's to look the same so they are more
+    likely to be merged.  */
+ static int
+ address_needs_split (rtx mem)
+ {
+   if (GET_MODE_SIZE (GET_MODE (mem)) < 16
+       && (GET_MODE_SIZE (GET_MODE (mem)) < 4
+ 	  || !(store_with_one_insn_p (mem)
+ 	       || mem_is_padded_component_ref (mem))))
+     return 1;
+ 
+   return 0;
+ }
+ 
  int
  spu_expand_mov (rtx * ops, enum machine_mode mode)
  {
*************** spu_expand_mov (rtx * ops, enum machine_
*** 3538,3562 ****
      }
    else
      {
-       if (GET_CODE (ops[0]) == MEM)
- 	{
- 	  if (!spu_valid_move (ops))
- 	    {
- 	      emit_insn (gen_store (ops[0], ops[1], gen_reg_rtx (TImode),
- 				    gen_reg_rtx (TImode)));
- 	      return 1;
- 	    }
- 	}
-       else if (GET_CODE (ops[1]) == MEM)
- 	{
- 	  if (!spu_valid_move (ops))
- 	    {
- 	      emit_insn (gen_load
- 			 (ops[0], ops[1], gen_reg_rtx (TImode),
- 			  gen_reg_rtx (SImode)));
- 	      return 1;
- 	    }
- 	}
        /* Catch the SImode immediates greater than 0x7fffffff, and sign
           extend them. */
        if (GET_CODE (ops[1]) == CONST_INT)
--- 3557,3562 ----
*************** spu_expand_mov (rtx * ops, enum machine_
*** 3572,3578 ****
    return 0;
  }
  
! void
  spu_split_load (rtx * ops)
  {
    enum machine_mode mode = GET_MODE (ops[0]);
--- 3572,3578 ----
    return 0;
  }
  
! int
  spu_split_load (rtx * ops)
  {
    enum machine_mode mode = GET_MODE (ops[0]);
*************** spu_split_load (rtx * ops)
*** 3580,3585 ****
--- 3580,3596 ----
    int rot_amt;
  
    addr = XEXP (ops[1], 0);
+   gcc_assert (GET_CODE (addr) != AND);
+ 
+   if (!address_needs_split (ops[1]))
+     {
+       addr = XEXP (ops[1], 0);
+       if (spu_legitimate_address (mode, addr, 0, 1))
+ 	return 0;
+       ops[1] = change_address (ops[1], VOIDmode, force_reg (Pmode, addr));
+       emit_move_insn (ops[0], ops[1]);
+       return 1;
+     }
  
    rot = 0;
    rot_amt = 0;
*************** spu_split_load (rtx * ops)
*** 3597,3608 ****
         */
        p0 = XEXP (addr, 0);
        p1 = XEXP (addr, 1);
!       if (REG_P (p0) && !regno_aligned_for_load (REGNO (p0)))
  	{
! 	  if (REG_P (p1) && !regno_aligned_for_load (REGNO (p1)))
  	    {
! 	      emit_insn (gen_addsi3 (ops[3], p0, p1));
! 	      rot = ops[3];
  	    }
  	  else
  	    rot = p0;
--- 3608,3639 ----
         */
        p0 = XEXP (addr, 0);
        p1 = XEXP (addr, 1);
!       if (!reg_aligned_for_addr (p0, 1))
  	{
! 	  if (GET_CODE (p1) == REG && !reg_aligned_for_addr (p1, 1))
  	    {
! 	      rot = gen_reg_rtx (SImode);
! 	      emit_insn (gen_addsi3 (rot, p0, p1));
! 	    }
! 	  else if (GET_CODE (p1) == CONST_INT && (INTVAL (p1) & 15))
! 	    {
! 	      if (INTVAL (p1) > 0
! 		  && INTVAL (p1) * BITS_PER_UNIT < REG_ALIGN (p0))
! 		{
! 		  rot = gen_reg_rtx (SImode);
! 		  emit_insn (gen_addsi3 (rot, p0, p1));
! 		  addr = p0;
! 		}
! 	      else
! 		{
! 		  rtx x = gen_reg_rtx (SImode);
! 		  emit_move_insn (x, p1);
! 		  if (!spu_arith_operand (p1, SImode))
! 		    p1 = x;
! 		  rot = gen_reg_rtx (SImode);
! 		  emit_insn (gen_addsi3 (rot, p0, p1));
! 		  addr = gen_rtx_PLUS (Pmode, p0, x);
! 		}
  	    }
  	  else
  	    rot = p0;
*************** spu_split_load (rtx * ops)
*** 3612,3627 ****
  	  if (GET_CODE (p1) == CONST_INT && (INTVAL (p1) & 15))
  	    {
  	      rot_amt = INTVAL (p1) & 15;
! 	      p1 = GEN_INT (INTVAL (p1) & -16);
! 	      addr = gen_rtx_PLUS (SImode, p0, p1);
  	    }
! 	  else if (REG_P (p1) && !regno_aligned_for_load (REGNO (p1)))
  	    rot = p1;
  	}
      }
    else if (GET_CODE (addr) == REG)
      {
!       if (!regno_aligned_for_load (REGNO (addr)))
  	rot = addr;
      }
    else if (GET_CODE (addr) == CONST)
--- 3643,3663 ----
  	  if (GET_CODE (p1) == CONST_INT && (INTVAL (p1) & 15))
  	    {
  	      rot_amt = INTVAL (p1) & 15;
! 	      if (INTVAL (p1) & -16)
! 		{
! 		  p1 = GEN_INT (INTVAL (p1) & -16);
! 		  addr = gen_rtx_PLUS (SImode, p0, p1);
! 		}
! 	      else
! 		addr = p0;
  	    }
! 	  else if (GET_CODE (p1) == REG && !reg_aligned_for_addr (p1, 1))
  	    rot = p1;
  	}
      }
    else if (GET_CODE (addr) == REG)
      {
!       if (!reg_aligned_for_addr (addr, 1))
  	rot = addr;
      }
    else if (GET_CODE (addr) == CONST)
*************** spu_split_load (rtx * ops)
*** 3640,3646 ****
  	    addr = XEXP (XEXP (addr, 0), 0);
  	}
        else
! 	rot = addr;
      }
    else if (GET_CODE (addr) == CONST_INT)
      {
--- 3676,3685 ----
  	    addr = XEXP (XEXP (addr, 0), 0);
  	}
        else
! 	{
! 	  rot = gen_reg_rtx (Pmode);
! 	  emit_move_insn (rot, addr);
! 	}
      }
    else if (GET_CODE (addr) == CONST_INT)
      {
*************** spu_split_load (rtx * ops)
*** 3648,3654 ****
        addr = GEN_INT (rot_amt & -16);
      }
    else if (!ALIGNED_SYMBOL_REF_P (addr))
!     rot = addr;
  
    if (GET_MODE_SIZE (mode) < 4)
      rot_amt += GET_MODE_SIZE (mode) - 4;
--- 3687,3696 ----
        addr = GEN_INT (rot_amt & -16);
      }
    else if (!ALIGNED_SYMBOL_REF_P (addr))
!     {
!       rot = gen_reg_rtx (Pmode);
!       emit_move_insn (rot, addr);
!     }
  
    if (GET_MODE_SIZE (mode) < 4)
      rot_amt += GET_MODE_SIZE (mode) - 4;
*************** spu_split_load (rtx * ops)
*** 3657,3671 ****
  
    if (rot && rot_amt)
      {
!       emit_insn (gen_addsi3 (ops[3], rot, GEN_INT (rot_amt)));
!       rot = ops[3];
        rot_amt = 0;
      }
  
!   load = ops[2];
  
!   addr = gen_rtx_AND (SImode, copy_rtx (addr), GEN_INT (-16));
!   mem = change_address (ops[1], TImode, addr);
  
    emit_insn (gen_movti (load, mem));
  
--- 3699,3713 ----
  
    if (rot && rot_amt)
      {
!       rtx x = gen_reg_rtx (SImode);
!       emit_insn (gen_addsi3 (x, rot, GEN_INT (rot_amt)));
!       rot = x;
        rot_amt = 0;
      }
  
!   load = gen_reg_rtx (TImode);
  
!   mem = change_address (ops[1], TImode, copy_rtx (addr));
  
    emit_insn (gen_movti (load, mem));
  
*************** spu_split_load (rtx * ops)
*** 3674,3696 ****
    else if (rot_amt)
      emit_insn (gen_rotlti3 (load, load, GEN_INT (rot_amt * 8)));
  
!   if (reload_completed)
!     emit_move_insn (ops[0], gen_rtx_REG (GET_MODE (ops[0]), REGNO (load)));
!   else
!     emit_insn (gen_spu_convert (ops[0], load));
  }
  
! void
  spu_split_store (rtx * ops)
  {
    enum machine_mode mode = GET_MODE (ops[0]);
!   rtx pat = ops[2];
!   rtx reg = ops[3];
    rtx addr, p0, p1, p1_lo, smem;
    int aform;
    int scalar;
  
    addr = XEXP (ops[0], 0);
  
    if (GET_CODE (addr) == PLUS)
      {
--- 3716,3746 ----
    else if (rot_amt)
      emit_insn (gen_rotlti3 (load, load, GEN_INT (rot_amt * 8)));
  
!   emit_insn (gen_spu_convert (ops[0], load));
!   return 1;
  }
  
! int
  spu_split_store (rtx * ops)
  {
    enum machine_mode mode = GET_MODE (ops[0]);
!   rtx reg;
    rtx addr, p0, p1, p1_lo, smem;
    int aform;
    int scalar;
  
+   if (!address_needs_split (ops[0]))
+     {
+       addr = XEXP (ops[0], 0);
+       if (spu_legitimate_address (mode, addr, 0, 1))
+ 	return 0;
+       ops[0] = change_address (ops[0], VOIDmode, force_reg (Pmode, addr));
+       emit_move_insn (ops[0], ops[1]);
+       return 1;
+     }
+ 
    addr = XEXP (ops[0], 0);
+   gcc_assert (GET_CODE (addr) != AND);
  
    if (GET_CODE (addr) == PLUS)
      {
*************** spu_split_store (rtx * ops)
*** 3702,3708 ****
           unaligned reg + aligned reg     => lqx, c?x, shuf, stqx
           unaligned reg + unaligned reg   => lqx, c?x, shuf, stqx
           unaligned reg + aligned const   => lqd, c?d, shuf, stqx
!          unaligned reg + unaligned const -> not allowed by legitimate address
         */
        aform = 0;
        p0 = XEXP (addr, 0);
--- 3752,3758 ----
           unaligned reg + aligned reg     => lqx, c?x, shuf, stqx
           unaligned reg + unaligned reg   => lqx, c?x, shuf, stqx
           unaligned reg + aligned const   => lqd, c?d, shuf, stqx
!          unaligned reg + unaligned const -> lqx, c?d, shuf, stqx
         */
        aform = 0;
        p0 = XEXP (addr, 0);
*************** spu_split_store (rtx * ops)
*** 3710,3717 ****
        if (GET_CODE (p0) == REG && GET_CODE (p1) == CONST_INT)
  	{
  	  p1_lo = GEN_INT (INTVAL (p1) & 15);
! 	  p1 = GEN_INT (INTVAL (p1) & -16);
! 	  addr = gen_rtx_PLUS (SImode, p0, p1);
  	}
      }
    else if (GET_CODE (addr) == REG)
--- 3760,3779 ----
        if (GET_CODE (p0) == REG && GET_CODE (p1) == CONST_INT)
  	{
  	  p1_lo = GEN_INT (INTVAL (p1) & 15);
! 	  if (reg_aligned_for_addr (p0, 1))
! 	    {
! 	      p1 = GEN_INT (INTVAL (p1) & -16);
! 	      if (p1 == const0_rtx)
! 		addr = p0;
! 	      else
! 		addr = gen_rtx_PLUS (SImode, p0, p1);
! 	    }
! 	  else
! 	    {
! 	      rtx x = gen_reg_rtx (SImode);
! 	      emit_move_insn (x, p1);
! 	      addr = gen_rtx_PLUS (SImode, p0, x);
! 	    }
  	}
      }
    else if (GET_CODE (addr) == REG)
*************** spu_split_store (rtx * ops)
*** 3728,3758 ****
        p1_lo = addr;
        if (ALIGNED_SYMBOL_REF_P (addr))
  	p1_lo = const0_rtx;
!       else if (GET_CODE (addr) == CONST)
  	{
! 	  if (GET_CODE (XEXP (addr, 0)) == PLUS
! 	      && ALIGNED_SYMBOL_REF_P (XEXP (XEXP (addr, 0), 0))
! 	      && GET_CODE (XEXP (XEXP (addr, 0), 1)) == CONST_INT)
! 	    {
! 	      HOST_WIDE_INT v = INTVAL (XEXP (XEXP (addr, 0), 1));
! 	      if ((v & -16) != 0)
! 		addr = gen_rtx_CONST (Pmode,
! 				      gen_rtx_PLUS (Pmode,
! 						    XEXP (XEXP (addr, 0), 0),
! 						    GEN_INT (v & -16)));
! 	      else
! 		addr = XEXP (XEXP (addr, 0), 0);
! 	      p1_lo = GEN_INT (v & 15);
! 	    }
  	}
        else if (GET_CODE (addr) == CONST_INT)
  	{
  	  p1_lo = GEN_INT (INTVAL (addr) & 15);
  	  addr = GEN_INT (INTVAL (addr) & -16);
  	}
      }
  
!   addr = gen_rtx_AND (SImode, copy_rtx (addr), GEN_INT (-16));
  
    scalar = store_with_one_insn_p (ops[0]);
    if (!scalar)
--- 3790,3823 ----
        p1_lo = addr;
        if (ALIGNED_SYMBOL_REF_P (addr))
  	p1_lo = const0_rtx;
!       else if (GET_CODE (addr) == CONST
! 	       && GET_CODE (XEXP (addr, 0)) == PLUS
! 	       && ALIGNED_SYMBOL_REF_P (XEXP (XEXP (addr, 0), 0))
! 	       && GET_CODE (XEXP (XEXP (addr, 0), 1)) == CONST_INT)
  	{
! 	  HOST_WIDE_INT v = INTVAL (XEXP (XEXP (addr, 0), 1));
! 	  if ((v & -16) != 0)
! 	    addr = gen_rtx_CONST (Pmode,
! 				  gen_rtx_PLUS (Pmode,
! 						XEXP (XEXP (addr, 0), 0),
! 						GEN_INT (v & -16)));
! 	  else
! 	    addr = XEXP (XEXP (addr, 0), 0);
! 	  p1_lo = GEN_INT (v & 15);
  	}
        else if (GET_CODE (addr) == CONST_INT)
  	{
  	  p1_lo = GEN_INT (INTVAL (addr) & 15);
  	  addr = GEN_INT (INTVAL (addr) & -16);
  	}
+       else
+ 	{
+ 	  p1_lo = gen_reg_rtx (SImode);
+ 	  emit_move_insn (p1_lo, addr);
+ 	}
      }
  
!   reg = gen_reg_rtx (TImode);
  
    scalar = store_with_one_insn_p (ops[0]);
    if (!scalar)
*************** spu_split_store (rtx * ops)
*** 3762,3772 ****
           possible, and copying the flags will prevent that in certain
           cases, e.g. consider the volatile flag. */
  
        rtx lmem = change_address (ops[0], TImode, copy_rtx (addr));
        set_mem_alias_set (lmem, 0);
        emit_insn (gen_movti (reg, lmem));
  
!       if (!p0 || regno_aligned_for_load (REGNO (p0)))
  	p0 = stack_pointer_rtx;
        if (!p1_lo)
  	p1_lo = const0_rtx;
--- 3827,3838 ----
           possible, and copying the flags will prevent that in certain
           cases, e.g. consider the volatile flag. */
  
+       rtx pat = gen_reg_rtx (TImode);
        rtx lmem = change_address (ops[0], TImode, copy_rtx (addr));
        set_mem_alias_set (lmem, 0);
        emit_insn (gen_movti (reg, lmem));
  
!       if (!p0 || reg_aligned_for_addr (p0, 1))
  	p0 = stack_pointer_rtx;
        if (!p1_lo)
  	p1_lo = const0_rtx;
*************** spu_split_store (rtx * ops)
*** 3774,3790 ****
        emit_insn (gen_cpat (pat, p0, p1_lo, GEN_INT (GET_MODE_SIZE (mode))));
        emit_insn (gen_shufb (reg, ops[1], reg, pat));
      }
-   else if (reload_completed)
-     {
-       if (GET_CODE (ops[1]) == REG)
- 	emit_move_insn (reg, gen_rtx_REG (GET_MODE (reg), REGNO (ops[1])));
-       else if (GET_CODE (ops[1]) == SUBREG)
- 	emit_move_insn (reg,
- 			gen_rtx_REG (GET_MODE (reg),
- 				     REGNO (SUBREG_REG (ops[1]))));
-       else
- 	abort ();
-     }
    else
      {
        if (GET_CODE (ops[1]) == REG)
--- 3840,3845 ----
*************** spu_split_store (rtx * ops)
*** 3796,3810 ****
      }
  
    if (GET_MODE_SIZE (mode) < 4 && scalar)
!     emit_insn (gen_shlqby_ti
! 	       (reg, reg, GEN_INT (4 - GET_MODE_SIZE (mode))));
  
!   smem = change_address (ops[0], TImode, addr);
    /* We can't use the previous alias set because the memory has changed
       size and can potentially overlap objects of other types.  */
    set_mem_alias_set (smem, 0);
  
    emit_insn (gen_movti (smem, reg));
  }
  
  /* Return TRUE if X is MEM which is a struct member reference
--- 3851,3866 ----
      }
  
    if (GET_MODE_SIZE (mode) < 4 && scalar)
!     emit_insn (gen_ashlti3
! 	       (reg, reg, GEN_INT (32 - GET_MODE_BITSIZE (mode))));
  
!   smem = change_address (ops[0], TImode, copy_rtx (addr));
    /* We can't use the previous alias set because the memory has changed
       size and can potentially overlap objects of other types.  */
    set_mem_alias_set (smem, 0);
  
    emit_insn (gen_movti (smem, reg));
+   return 1;
  }
  
  /* Return TRUE if X is MEM which is a struct member reference
*************** fix_range (const char *const_str)
*** 3903,3939 ****
      }
  }
  
- int
- spu_valid_move (rtx * ops)
- {
-   enum machine_mode mode = GET_MODE (ops[0]);
-   if (!register_operand (ops[0], mode) && !register_operand (ops[1], mode))
-     return 0;
- 
-   /* init_expr_once tries to recog against load and store insns to set
-      the direct_load[] and direct_store[] arrays.  We always want to
-      consider those loads and stores valid.  init_expr_once is called in
-      the context of a dummy function which does not have a decl. */
-   if (cfun->decl == 0)
-     return 1;
- 
-   /* Don't allows loads/stores which would require more than 1 insn.
-      During and after reload we assume loads and stores only take 1
-      insn. */
-   if (GET_MODE_SIZE (mode) < 16 && !reload_in_progress && !reload_completed)
-     {
-       if (GET_CODE (ops[0]) == MEM
- 	  && (GET_MODE_SIZE (mode) < 4
- 	      || !(store_with_one_insn_p (ops[0])
- 		   || mem_is_padded_component_ref (ops[0]))))
- 	return 0;
-       if (GET_CODE (ops[1]) == MEM
- 	  && (GET_MODE_SIZE (mode) < 4 || !aligned_mem_p (ops[1])))
- 	return 0;
-     }
-   return 1;
- }
- 
  /* Return TRUE if x is a CONST_INT, CONST_DOUBLE or CONST_VECTOR that
     can be generated using the fsmbi instruction. */
  int
--- 3959,3964 ----
*************** spu_sms_res_mii (struct ddg *g)
*** 5577,5588 ****
  
  void
  spu_init_expanders (void)
! {   
!   /* HARD_FRAME_REGISTER is only 128 bit aligned when
!    * frame_pointer_needed is true.  We don't know that until we're
!    * expanding the prologue. */
    if (cfun)
!     REGNO_POINTER_ALIGN (HARD_FRAME_POINTER_REGNUM) = 8;
  }
  
  static enum machine_mode
--- 5602,5627 ----
  
  void
  spu_init_expanders (void)
! {
    if (cfun)
!     {
!       rtx r0, r1;
!       /* HARD_FRAME_REGISTER is only 128 bit aligned when
!          frame_pointer_needed is true.  We don't know that until we're
!          expanding the prologue. */
!       REGNO_POINTER_ALIGN (HARD_FRAME_POINTER_REGNUM) = 8;
! 
!       /* A number of passes use LAST_VIRTUAL_REGISTER+1 and
!          LAST_VIRTUAL_REGISTER+2 to test the back-end.  We want to
!          handle those cases specially, so we reserve those two registers
!          here by generating them. */
!       r0 = gen_reg_rtx (SImode);
!       r1 = gen_reg_rtx (SImode);
!       mark_reg_pointer (r0, 128);
!       mark_reg_pointer (r1, 128);
!       gcc_assert (REGNO (r0) == LAST_VIRTUAL_REGISTER + 1
! 		  && REGNO (r1) == LAST_VIRTUAL_REGISTER + 2);
!     }
  }
  
  static enum machine_mode
Index: gcc/gcc/config/spu/spu-builtins.md
===================================================================
*** gcc/gcc/config/spu/spu-builtins.md	(revision 139677)
--- gcc/gcc/config/spu/spu-builtins.md	(working copy)
***************
*** 23,31 ****
  
  (define_expand "spu_lqd"
    [(set (match_operand:TI 0 "spu_reg_operand" "")
!         (mem:TI (and:SI (plus:SI (match_operand:SI 1 "spu_reg_operand" "")
! 				 (match_operand:SI 2 "spu_nonmem_operand" ""))
! 		        (const_int -16))))]
    ""
    {
      if (GET_CODE (operands[2]) == CONST_INT
--- 23,30 ----
  
  (define_expand "spu_lqd"
    [(set (match_operand:TI 0 "spu_reg_operand" "")
!         (mem:TI (plus:SI (match_operand:SI 1 "spu_reg_operand" "")
! 			 (match_operand:SI 2 "spu_nonmem_operand" ""))))]
    ""
    {
      if (GET_CODE (operands[2]) == CONST_INT
*************** (define_expand "spu_lqd"
*** 42,57 ****
  
  (define_expand "spu_lqx"
    [(set (match_operand:TI 0 "spu_reg_operand" "")
!         (mem:TI (and:SI (plus:SI (match_operand:SI 1 "spu_reg_operand" "")
!                                  (match_operand:SI 2 "spu_reg_operand" ""))
!                         (const_int -16))))]
    ""
    "")
  
  (define_expand "spu_lqa"
    [(set (match_operand:TI 0 "spu_reg_operand" "")
!         (mem:TI (and:SI (match_operand:SI 1 "immediate_operand" "")
!                         (const_int -16))))]
    ""
    {
      if (GET_CODE (operands[1]) == CONST_INT
--- 41,54 ----
  
  (define_expand "spu_lqx"
    [(set (match_operand:TI 0 "spu_reg_operand" "")
!         (mem:TI (plus:SI (match_operand:SI 1 "spu_reg_operand" "")
! 			 (match_operand:SI 2 "spu_reg_operand" ""))))]
    ""
    "")
  
  (define_expand "spu_lqa"
    [(set (match_operand:TI 0 "spu_reg_operand" "")
!         (mem:TI (match_operand:SI 1 "immediate_operand" "")))]
    ""
    {
      if (GET_CODE (operands[1]) == CONST_INT
*************** (define_expand "spu_lqa"
*** 61,75 ****
  
  (define_expand "spu_lqr"
    [(set (match_operand:TI 0 "spu_reg_operand" "")
! 	(mem:TI (and:SI (match_operand:SI 1 "address_operand" "")
! 			(const_int -16))))]
    ""
    "")
  
  (define_expand "spu_stqd"
!   [(set (mem:TI (and:SI (plus:SI (match_operand:SI 1 "spu_reg_operand" "")
! 				 (match_operand:SI 2 "spu_nonmem_operand" ""))
! 		        (const_int -16)))
          (match_operand:TI 0 "spu_reg_operand" "r,r"))]
    ""
    {
--- 58,70 ----
  
  (define_expand "spu_lqr"
    [(set (match_operand:TI 0 "spu_reg_operand" "")
! 	(mem:TI (match_operand:SI 1 "address_operand" "")))]
    ""
    "")
  
  (define_expand "spu_stqd"
!   [(set (mem:TI (plus:SI (match_operand:SI 1 "spu_reg_operand" "")
! 			 (match_operand:SI 2 "spu_nonmem_operand" "")))
          (match_operand:TI 0 "spu_reg_operand" "r,r"))]
    ""
    {
*************** (define_expand "spu_stqd"
*** 86,101 ****
    })
  
  (define_expand "spu_stqx"
!   [(set (mem:TI (and:SI (plus:SI (match_operand:SI 1 "spu_reg_operand" "")
! 				 (match_operand:SI 2 "spu_reg_operand" ""))
! 		        (const_int -16)))
          (match_operand:TI 0 "spu_reg_operand" "r"))]
    ""
    "")
  
  (define_expand "spu_stqa"
!   [(set (mem:TI (and:SI (match_operand:SI 1 "immediate_operand" "")
! 			(const_int -16)))
          (match_operand:TI 0 "spu_reg_operand" "r"))]
    ""
    {
--- 81,94 ----
    })
  
  (define_expand "spu_stqx"
!   [(set (mem:TI (plus:SI (match_operand:SI 1 "spu_reg_operand" "")
! 			 (match_operand:SI 2 "spu_reg_operand" "")))
          (match_operand:TI 0 "spu_reg_operand" "r"))]
    ""
    "")
  
  (define_expand "spu_stqa"
!   [(set (mem:TI (match_operand:SI 1 "immediate_operand" ""))
          (match_operand:TI 0 "spu_reg_operand" "r"))]
    ""
    {
*************** (define_expand "spu_stqa"
*** 105,112 ****
    })
  
  (define_expand "spu_stqr"
!     [(set (mem:TI (and:SI (match_operand:SI 1 "address_operand" "")
! 			  (const_int -16)))
  	  (match_operand:TI 0 "spu_reg_operand" ""))]
    ""
    "")
--- 98,104 ----
    })
  
  (define_expand "spu_stqr"
!     [(set (mem:TI (match_operand:SI 1 "address_operand" ""))
  	  (match_operand:TI 0 "spu_reg_operand" ""))]
    ""
    "")
Index: gcc/gcc/config/spu/spu.h
===================================================================
*** gcc/gcc/config/spu/spu.h	(revision 139677)
--- gcc/gcc/config/spu/spu.h	(working copy)
*************** enum reg_class { 
*** 229,234 ****
--- 229,239 ----
  #define INT_REG_OK_FOR_BASE_P(X,STRICT) \
  	((!(STRICT) || REGNO_OK_FOR_BASE_P (REGNO (X))))
  
+ #define REG_ALIGN(X) \
+ 	(REG_POINTER(X) \
+ 	 	? REGNO_POINTER_ALIGN (ORIGINAL_REGNO (X)) \
+ 		: 0)
+ 
  #define PREFERRED_RELOAD_CLASS(X,CLASS)  (CLASS)
  
  #define CLASS_MAX_NREGS(CLASS, MODE)	\
*************** targetm.resolve_overloaded_builtin = spu
*** 414,420 ****
  #endif
  
  #define GO_IF_LEGITIMATE_ADDRESS(MODE, X, ADDR)			\
!     { if (spu_legitimate_address (MODE, X, REG_OK_STRICT_FLAG))	\
  	goto ADDR;						\
      }
  
--- 419,425 ----
  #endif
  
  #define GO_IF_LEGITIMATE_ADDRESS(MODE, X, ADDR)			\
!     { if (spu_legitimate_address (MODE, X, REG_OK_STRICT_FLAG, 0))	\
  	goto ADDR;						\
      }
  
*************** targetm.resolve_overloaded_builtin = spu
*** 608,610 ****
--- 613,617 ----
  extern GTY(()) rtx spu_compare_op0;
  extern GTY(()) rtx spu_compare_op1;
  
+ #define SPLIT_BEFORE_CSE2 1
+ 
Index: gcc/gcc/config/spu/spu.md
===================================================================
*** gcc/gcc/config/spu/spu.md	(revision 139677)
--- gcc/gcc/config/spu/spu.md	(working copy)
*************** (define_expand "mov<mode>"
*** 276,283 ****
  (define_split 
    [(set (match_operand 0 "spu_reg_operand")
  	(match_operand 1 "immediate_operand"))]
! 
!   ""
    [(set (match_dup 0)
  	(high (match_dup 1)))
     (set (match_dup 0)
--- 276,282 ----
  (define_split 
    [(set (match_operand 0 "spu_reg_operand")
  	(match_operand 1 "immediate_operand"))]
!   "split0_completed"
    [(set (match_dup 0)
  	(high (match_dup 1)))
     (set (match_dup 0)
*************** (define_insn "load_pic_offset"
*** 314,322 ****
  ;; move internal
  
  (define_insn "_mov<mode>"
!   [(set (match_operand:MOV 0 "spu_nonimm_operand" "=r,r,r,r,r,m")
  	(match_operand:MOV 1 "spu_mov_operand" "r,A,f,j,m,r"))]
!   "spu_valid_move (operands)"
    "@
     ori\t%0,%1,0
     il%s1\t%0,%S1
--- 313,322 ----
  ;; move internal
  
  (define_insn "_mov<mode>"
!   [(set (match_operand:MOV 0 "spu_dest_operand" "=r,r,r,r,r,m")
  	(match_operand:MOV 1 "spu_mov_operand" "r,A,f,j,m,r"))]
!   "register_operand(operands[0], <MODE>mode)
!    || register_operand(operands[1], <MODE>mode)"
    "@
     ori\t%0,%1,0
     il%s1\t%0,%S1
*************** (define_insn "low_<mode>"
*** 334,342 ****
    "iohl\t%0,%2@l")
  
  (define_insn "_movdi"
!   [(set (match_operand:DI 0 "spu_nonimm_operand" "=r,r,r,r,r,m")
  	(match_operand:DI 1 "spu_mov_operand" "r,a,f,k,m,r"))]
!   "spu_valid_move (operands)"
    "@
     ori\t%0,%1,0
     il%d1\t%0,%D1
--- 334,343 ----
    "iohl\t%0,%2@l")
  
  (define_insn "_movdi"
!   [(set (match_operand:DI 0 "spu_dest_operand" "=r,r,r,r,r,m")
  	(match_operand:DI 1 "spu_mov_operand" "r,a,f,k,m,r"))]
!   "register_operand(operands[0], DImode)
!    || register_operand(operands[1], DImode)"
    "@
     ori\t%0,%1,0
     il%d1\t%0,%D1
*************** (define_insn "_movdi"
*** 347,355 ****
    [(set_attr "type" "fx2,fx2,shuf,shuf,load,store")])
  
  (define_insn "_movti"
!   [(set (match_operand:TI 0 "spu_nonimm_operand" "=r,r,r,r,r,m")
  	(match_operand:TI 1 "spu_mov_operand" "r,U,f,l,m,r"))]
!   "spu_valid_move (operands)"
    "@
     ori\t%0,%1,0
     il%t1\t%0,%T1
--- 348,357 ----
    [(set_attr "type" "fx2,fx2,shuf,shuf,load,store")])
  
  (define_insn "_movti"
!   [(set (match_operand:TI 0 "spu_dest_operand" "=r,r,r,r,r,m")
  	(match_operand:TI 1 "spu_mov_operand" "r,U,f,l,m,r"))]
!   "register_operand(operands[0], TImode)
!    || register_operand(operands[1], TImode)"
    "@
     ori\t%0,%1,0
     il%t1\t%0,%T1
*************** (define_insn "_movti"
*** 359,387 ****
     stq%p0\t%1,%0"
    [(set_attr "type" "fx2,fx2,shuf,shuf,load,store")])
  
! (define_insn_and_split "load"
!   [(set (match_operand 0 "spu_reg_operand" "=r")
! 	(match_operand 1 "memory_operand" "m"))
!    (clobber (match_operand:TI 2 "spu_reg_operand" "=&r"))
!    (clobber (match_operand:SI 3 "spu_reg_operand" "=&r"))]
!   "GET_MODE(operands[0]) == GET_MODE(operands[1])"
!   "#"
!   ""
    [(set (match_dup 0)
  	(match_dup 1))]
!   { spu_split_load(operands); DONE; })
  
! (define_insn_and_split "store"
!   [(set (match_operand 0 "memory_operand" "=m")
! 	(match_operand 1 "spu_reg_operand" "r"))
!    (clobber (match_operand:TI 2 "spu_reg_operand" "=&r"))
!    (clobber (match_operand:TI 3 "spu_reg_operand" "=&r"))]
!   "GET_MODE(operands[0]) == GET_MODE(operands[1])"
!   "#"
!   ""
    [(set (match_dup 0)
  	(match_dup 1))]
!   { spu_split_store(operands); DONE; })
  
  ;; Operand 3 is the number of bytes. 1:b 2:h 4:w 8:d
  
--- 361,385 ----
     stq%p0\t%1,%0"
    [(set_attr "type" "fx2,fx2,shuf,shuf,load,store")])
  
! (define_split
!   [(set (match_operand 0 "spu_reg_operand")
! 	(match_operand 1 "memory_operand"))]
!   "GET_MODE(operands[0]) == GET_MODE(operands[1]) && !split0_completed"
    [(set (match_dup 0)
  	(match_dup 1))]
!   { if (spu_split_load(operands))
!       DONE;
!   })
  
! (define_split
!   [(set (match_operand 0 "memory_operand")
! 	(match_operand 1 "spu_reg_operand"))]
!   "GET_MODE(operands[0]) == GET_MODE(operands[1]) && !split0_completed"
    [(set (match_dup 0)
  	(match_dup 1))]
!   { if (spu_split_store(operands))
!       DONE;
!   })
  
  ;; Operand 3 is the number of bytes. 1:b 2:h 4:w 8:d
  
Index: gcc/gcc/testsuite/gcc.target/spu/split0-1.c
===================================================================
*** gcc/gcc/testsuite/gcc.target/spu/split0-1.c	(revision 0)
--- gcc/gcc/testsuite/gcc.target/spu/split0-1.c	(revision 0)
***************
*** 0 ****
--- 1,17 ----
+ /* Make sure there are only 2 loads. */
+ /* { dg-do compile { target spu-*-* } } */
+ /* { dg-options "-O2" } */
+ /* { dg-final { scan-assembler-times "lqd	\\$\[0-9\]+,0\\(\\$\[0-9\]+\\)" 1 } } */
+ /* { dg-final { scan-assembler-times "lqd	\\$\[0-9\]+,16\\(\\$\[0-9\]+\\)" 1 } } */
+ /* { dg-final { scan-assembler-times "lq\[dx\]" 2 } } */
+   
+ struct __attribute__ ((__aligned__(16))) S {
+   int a, b, c, d;
+   int e, f, g, h;
+ };
+   
+ int
+ f(struct S *s)
+ { 
+   return s->a + s->b + s->c + s->d + s->e + s->f + s->g + s->h;
+ } 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH, SPU] generated better code for loads and stores
  2008-08-29  0:22 [PATCH, SPU] generated better code for loads and stores Trevor_Smigiel
@ 2008-09-05 21:51 ` Trevor_Smigiel
  2008-09-11 11:50 ` Ulrich Weigand
  2009-05-04 21:15 ` [PATCH] add optional split pass before CSE2 Trevor_Smigiel
  2 siblings, 0 replies; 7+ messages in thread
From: Trevor_Smigiel @ 2008-09-05 21:51 UTC (permalink / raw)
  To: gcc-patches; +Cc: Ulrich Weigand, andrew_pinski

Ping?

I'm a maintainer for the SPU part. 

Is it ok to add this new split pass which is only enabled when defining
a new target macro?  

Trevor

* Trevor Smigiel <Trevor_Smigiel@playstation.sony.com> [2008-08-27 19:10]:
> This patch generates better code for loads and stores on SPU.
> 
> The SPU can only do 16-byte, aligned loads and stores.  To load something
> smaller with a smaller alignment requires a load and a rotate.  To store
> something smaller requires a load, insert, and store.
> 
> Currently, there are two obvious ways to generate rtl for loads and
> stores.  Generate the multiple instructions at expand time, or split
> them at some later phase.   When expanded early we lose alias
> information (because that 16-byte load could contain anything), and in
> general do worse optimization on memory.   When we split late, the
> compiler has no opportunity to combine loads/stores of the same 16
> bytes.
> 
> This patch introduces an additional split pass, split0, right before the
> CSE2 pass.  Before this pass, loads and stores are modeled as a single
> rtl instruction, and can be optimized well.  This pass splits them into
> multiple instructions, allowing CSE2 and combine to optimize the 16 byte
> loads and stores.  The pass is only enabled when a target defines
> SPLIT_BEFORE_CSE2.
> 
> The test case is an example which is improved by the earlier split pass.
> 
> This patch also makes other small improvements to the code generated for
> loads and stores on SPU.
> 
> Ok for mainline?  In particular, the new split pass.
> 
> Trevor
> 
> 2008-08-27  Trevor Smigiel <Trevor_Smigiel@playstation.sony.com>
> 	
> 	Improve code generated for loads and stores on SPU.
> 
> 	* doc/tm.texi (SPLIT_BEFORE_CSE2) : Document.
> 	* tree-pass.h (pass_split_before_cse2) : Declare.
> 	* final.c (rest_of_clean_state) : Initialize split0_completed.
> 	* recog.c (split0_completed) : Define.
> 	(gate_handle_split_before_cse2, rest_of_handle_split_before_cse2) :
> 	New functions.
> 	(pass_split_before_cse2) : New pass.
> 	* rtl.h (split0_completed) : Declare.
>         * passes.c (init_optimization_passes) : Add pass_split_before_cse2
>         before pass_cse2 .
> 	* config/spu/spu-protos.h (spu_legitimate_address) : Add
> 	for_split argument.
> 	(aligned_mem_p, spu_valid_move) : Remove prototypes.
> 	(spu_split_load, spu_split_store) : Change return type to int.
> 	* config/spu/predicates.md (spu_mem_operand) : Remove.
> 	(spu_dest_operand) : Add.
> 	* config/spu/spu-builtins.md (spu_lqd, spu_lqx, spu_lqa,
> 	spu_lqr, spu_stqd, spu_stqx, spu_stqa, spu_stqr) : Remove AND
> 	operation.
> 	* config/spu/spu.c (regno_aligned_for_load) : Remove.
> 	(reg_aligned_for_addr, address_needs_split) : New functions.
> 	(spu_legitimate_address, spu_expand_mov, spu_split_load,
> 	spu_split_store) : Update.
> 	(spu_init_expanders) : Pregenerate a couple of pseudo-registers.
> 	* config/spu/spu.h (REG_ALIGN, SPLIT_BEFORE_CSE2) : Define.
> 	(GO_IF_LEGITIMATE_ADDRESS) : Update for spu_legitimate_address.
> 	* config/spu/spu.md ("_mov<mode>", "_movdi", "_movti") : Update
> 	predicates.
> 	("load", "store") : Change to define_split.
> 
> testsuite/
> 	* testsuite/gcc.target/spu/split0-1.c : Add test.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH, SPU] generated better code for loads and stores
  2008-08-29  0:22 [PATCH, SPU] generated better code for loads and stores Trevor_Smigiel
  2008-09-05 21:51 ` Trevor_Smigiel
@ 2008-09-11 11:50 ` Ulrich Weigand
  2008-09-12  3:37   ` Trevor_Smigiel
  2009-05-04 21:15 ` [PATCH] add optional split pass before CSE2 Trevor_Smigiel
  2 siblings, 1 reply; 7+ messages in thread
From: Ulrich Weigand @ 2008-09-11 11:50 UTC (permalink / raw)
  To: Trevor_Smigiel; +Cc: gcc-patches, andrew_pinski, rguenther

Trevor Smigiel wrote:

> This patch generates better code for loads and stores on SPU.

Unfortunately, in the current version this patch introduces a number
of regressions for me:
- wrong code generation due to exposed REGNO_POINTER_ALIGN bug
- ICE with -fnon-call-exceptions 
- code size regression with -O0
- -march=celledp code quality regression

I've seen those both on mainline and with a 4.3 backport of your
patch.


1. Wrong code generation due to exposed REGNO_POINTER_ALIGN bug

On mainline (only) the following test case now fails:
FAIL: gcc.c-torture/execute/960521-1.c execution,  -O1

This is because when compiling this code:

int *b;

foo ()
{
  int i;
  for (i = 0; i < BLOCK_SIZE - 1; i++)
    b[i] = -1;
}

GCC assumes (incorrectly) that the *value* of b must be 16-byte aligned,
because it has (correctly) infered that the *address* of b is 16-byte
aligned!

This happens because of an independent bug in computing REGNO_POINTER_ALIGN
which is present both in force_reg and set_reg_attrs_from_value:

      if (MEM_POINTER (x))
        mark_reg_pointer (reg, MEM_ALIGN (x));

This is broken, because the *value* of the MEM x was just copied into reg.
MEM_ALIGN is the alignment of the memory address, not the alignment of
the pointer that is stored there.

(Note that in GCC 4.3, REGNO_POINTER_ALIGN is incorrect just the same, but
the problem still does not show because the broken register is placed as
the second operand of the PLUS used for address generation -- therefore
the optimization in spu_split_load does not trigger.)


2. ICE with -fnon-call-exceptions

A couple of test cases now fail on both mainline and 4.3:

FAIL: g++.dg/eh/subreg-1.C (internal compiler error)
FAIL: g++.dg/opt/cfg5.C (internal compiler error)
FAIL: g++.dg/opt/pr34036.C (internal compiler error)
FAIL: g++.dg/opt/reg-stack2.C (internal compiler error)
FAIL: g++.dg/other/profile1.C (internal compiler error)

all with an ICE along the following lines:

/home/uweigand/fsf/gcc-4_3/gcc/testsuite/g++.dg/eh/subreg-1.C:41: error: in basic block 5:
/home/uweigand/fsf/gcc-4_3/gcc/testsuite/g++.dg/eh/subreg-1.C:41: error: flow control insn inside a basic block
(insn 141 21 142 5 /home/uweigand/fsf/gcc-4_3/gcc/testsuite/g++.dg/eh/subreg-1.C:35 (set (reg:TI 185)
        (mem/s:TI (reg/v/f:SI 138 [ sp.19 ]) [0 S16 A128])) -1 (expr_list:REG_EH_REGION (const_int 1 [0x1])
        (nil)))
/home/uweigand/fsf/gcc-4_3/gcc/testsuite/g++.dg/eh/subreg-1.C:41: internal compiler error: in rtl_verify_flow_info_1, at cfgrtl.c:1923

The reason for this is that with -fnon-call-exceptions, the memory store
gets tagged with a REG_EH_REGION note.  The split0 pass generates a memory
load and a memory store insn from this, and the splitter logic copies that
note to *both* instructions.  This is a problem as the load is now considered
as a "flow control" insn as well (as it supposedly may throw an exception);
and such insns are not allowed within a basic block.

I'm not sure how to fix this.  However, in actual fact, memory accesses on
the SPU *never* trap anyway -- it's just that the may_trap_p logic is not
aware of that fact.  In the SDK compiler there is a target macro
ADDRESSES_NEVER_TRAP that is set on the SPU to cause rtx_addr_can_trap_p_1
to always return 0.  If we port that feature to mainline, this will fix
this ICE as well.


3. Code size regression with -O0

I'm seeing one more test suite regression on 4.3 only:
FAIL: g++.dg/opt/longbranch1.C (test for excess errors)

This is caused by the resulting code just exceeding local store size,
while it just fit into local store before this patch.  The reason for
this code size regression with -O0 is that the new code will always
generate a rotate instruction to perform the following load:

(insn 17 16 18 4 longbranch1.ii:6 (set (reg:SI 145)
        (const_int 29936 [0x74f0])) 6 {_movsi} (nil))

(insn 18 17 19 4 longbranch1.ii:6 (set (reg:SI 146)
        (mem/c/i:SI (plus:SI (reg/f:SI 128 $vfp)
                (reg:SI 145)) [0 i+0 S4 A128])) 6 {_movsi} (nil))

even though the memory access is clearly 16-byte aligned.

The code in spu_split_load doesn't recognize this, however, because 
REGNO_POINTER_ALIGN of register 145 returns false -- this register is
in fact not a pointer, but holds a plain integral value.

However, even so, the memory access is itself marked as 16-byte
aligned via the MEM_ALIGN flag.  Unfortunately, spu_split_load
never actually looks at that flag.  In the old code, spu_valid_move
used to check aligned_mem_p, which did look at MEM_ALIGN.

It would appear that simply checking MEM_ALIGN and omitting the
rotate in spu_split_load if possible should be sufficient to fix
this regression.

Unfortunately, the resulting code is still bigger, because of the
additional _spu_convert insn that is being generated.  With -O0,
local-alloc does not attempt to use the same register for both
operands of the _spu_convert, and thus it becomes an explicit
copy in the final output.  This still causes the test case to fail ...

Now, we don't actually *need* to split the insn into a TImode load
plus a _spu_convert.  However, as your comment before address_needs_split
says, you do it deliberately to enable more merging of accesses.  Without
optimization, that merging won't happen anyway ... so I think we should
not perform the split in this case.

With this change in place as well, the code size regression is fixed.


4. -march=celledp code quality regression

The PowerXCell 8i processor has double-precision magnitude compare
instructions.  These used to be matched by combine against patterns like:

(define_insn "cmeq_<mode>_celledp"
  [(set (match_operand:<DF2I> 0 "spu_reg_operand" "=r")
        (eq:<DF2I> (abs:VDF (match_operand:VDF 1 "spu_reg_operand" "r"))
                   (abs:VDF (match_operand:VDF 2 "spu_reg_operand" "r"))))]
  "spu_arch == PROCESSOR_CELLEDP"
  "dfcmeq\t%0,%1,%2"
  [(set_attr "type" "fpd")])

However, after your patch these instructions never match, because there
is no ABS RTX any more.  Those have been split into AND by:

(define_insn_and_split "_abs<mode>2"
  [(set (match_operand:VSDF 0 "spu_reg_operand" "=r")
        (abs:VSDF (match_operand:VSDF 1 "spu_reg_operand" "r")))
   (use (match_operand:<F2I> 2 "spu_reg_operand" "r"))]
  ""
  "#"
  ""
  [(set (match_dup:<F2I> 3)
        (and:<F2I> (match_dup:<F2I> 4)
                   (match_dup:<F2I> 2)))]

which now runs already in the split0 pass (before combine).

I think the proper fix is to simply add "split0_completed" to the 
condition of this splitter -- there doesn't seem to be any gain from
running this splitter early.  (I suspect there are some other splitters
that probably should be treated likewise, but I don't have a specific
regression except the ABS case.)


The following patch contains my suggested fixes (as discussed above),
except for the REGNO_POINTER_ALIGN breakage.  Tested on mainline and
4.3 with no regressions, fixes all regressions (except the mentioned
REGNO_POINTER_ALIGN breakage) of the original patch.  What do you think?

Bye,
Ulrich


ChangeLog:

	* config/spu/spu.h (ADDRESSES_NEVER_TRAP): Define.
	* rtlanal.c (rtx_addr_can_trap_p_1): Respect ADDRESSES_NEVER_TRAP macro.
	* doc/tm.texi (ADDRESSES_NEVER_TRAP): Document.

	* config/spu/spu.c (spu_split_load): Trust MEM_ALIGN.  When not
	optimizing, do not split load unless necessary.

	* config/spu/spu.md ("_abs<mode>2"): Do not split in split0 pass.


diff -crNp -x .svn gcc-4_3-orig/gcc/config/spu/spu.c gcc-4_3/gcc/config/spu/spu.c
*** gcc-4_3-orig/gcc/config/spu/spu.c	2008-09-10 22:09:24.000000000 +0200
--- gcc-4_3/gcc/config/spu/spu.c	2008-09-11 00:40:35.000000000 +0200
*************** spu_split_load (rtx * ops)
*** 3596,3602 ****
  
    rot = 0;
    rot_amt = 0;
!   if (GET_CODE (addr) == PLUS)
      {
        /* 8 cases:
           aligned reg   + aligned reg     => lqx
--- 3596,3605 ----
  
    rot = 0;
    rot_amt = 0;
! 
!   if (MEM_ALIGN (ops[1]) >= 128)
!     /* Address is already aligned; simply perform a TImode load.  */;
!   else if (GET_CODE (addr) == PLUS)
      {
        /* 8 cases:
           aligned reg   + aligned reg     => lqx
*************** spu_split_load (rtx * ops)
*** 3707,3712 ****
--- 3710,3723 ----
        rot_amt = 0;
      }
  
+   /* If the source is properly aligned, we don't need to split this insn into
+      a TImode load plus a _spu_convert.  However, we want to perform the split
+      anyway when optimizing to make the MEMs look the same as those used for
+      stores so they are more easily merged.  When *not* optimizing, that will
+      not happen anyway, so we prefer to avoid generating the _spu_convert.  */
+   if (!rot && !rot_amt && !optimize)
+     return 0;
+ 
    load = gen_reg_rtx (TImode);
  
    mem = change_address (ops[1], TImode, copy_rtx (addr));
diff -crNp -x .svn gcc-4_3-orig/gcc/config/spu/spu.h gcc-4_3/gcc/config/spu/spu.h
*** gcc-4_3-orig/gcc/config/spu/spu.h	2008-09-10 22:09:24.000000000 +0200
--- gcc-4_3/gcc/config/spu/spu.h	2008-09-10 21:19:30.000000000 +0200
*************** extern GTY(()) rtx spu_compare_op1;
*** 640,642 ****
--- 640,644 ----
  
  #define SPLIT_BEFORE_CSE2 1
  
+ #define ADDRESSES_NEVER_TRAP 1
+ 
diff -crNp -x .svn gcc-4_3-orig/gcc/config/spu/spu.md gcc-4_3/gcc/config/spu/spu.md
*** gcc-4_3-orig/gcc/config/spu/spu.md	2008-09-10 22:09:32.000000000 +0200
--- gcc-4_3/gcc/config/spu/spu.md	2008-09-10 20:09:59.000000000 +0200
***************
*** 1246,1252 ****
     (use (match_operand:<F2I> 2 "spu_reg_operand" "r"))]
    ""
    "#"
!   ""
    [(set (match_dup:<F2I> 3)
  	(and:<F2I> (match_dup:<F2I> 4)
  		   (match_dup:<F2I> 2)))]
--- 1246,1252 ----
     (use (match_operand:<F2I> 2 "spu_reg_operand" "r"))]
    ""
    "#"
!   "split0_completed"
    [(set (match_dup:<F2I> 3)
  	(and:<F2I> (match_dup:<F2I> 4)
  		   (match_dup:<F2I> 2)))]
diff -crNp -x .svn gcc-4_3-orig/gcc/doc/tm.texi gcc-4_3/gcc/doc/tm.texi
*** gcc-4_3-orig/gcc/doc/tm.texi	2008-09-10 22:09:25.000000000 +0200
--- gcc-4_3/gcc/doc/tm.texi	2008-09-10 21:43:46.000000000 +0200
*************** optimizations before this pass work bett
*** 10384,10386 ****
--- 10384,10392 ----
  instructions, and the optimizations right after this pass (e.g., CSE and
  combine) are be able to optimize the split instructions.
  @end defmac
+ 
+ @defmac ADDRESSES_NEVER_TRAP
+ Define this macro if memory accesses will never cause a trap.
+ This is the case for example on the Cell SPU processor.
+ @end defmac
+ 
diff -crNp -x .svn gcc-4_3-orig/gcc/rtlanal.c gcc-4_3/gcc/rtlanal.c
*** gcc-4_3-orig/gcc/rtlanal.c	2008-03-05 19:44:55.000000000 +0100
--- gcc-4_3/gcc/rtlanal.c	2008-09-10 21:18:53.000000000 +0200
*************** rtx_varies_p (const_rtx x, bool for_alia
*** 265,270 ****
--- 265,274 ----
  static int
  rtx_addr_can_trap_p_1 (const_rtx x, enum machine_mode mode, bool unaligned_mems)
  {
+ #ifdef ADDRESSES_NEVER_TRAP
+   /* On some processors, like the SPU, memory accesses never trap.  */
+   return 0;
+ #else
    enum rtx_code code = GET_CODE (x);
  
    switch (code)
*************** rtx_addr_can_trap_p_1 (const_rtx x, enum
*** 344,349 ****
--- 348,354 ----
  
    /* If it isn't one of the case above, it can cause a trap.  */
    return 1;
+ #endif
  }
  
  /* Return nonzero if the use of X as an address in a MEM can cause a trap.  */

-- 
  Dr. Ulrich Weigand
  GNU Toolchain for Linux on System z and Cell BE
  Ulrich.Weigand@de.ibm.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH, SPU] generated better code for loads and stores
  2008-09-11 11:50 ` Ulrich Weigand
@ 2008-09-12  3:37   ` Trevor_Smigiel
  2008-09-12 14:24     ` Ulrich Weigand
  0 siblings, 1 reply; 7+ messages in thread
From: Trevor_Smigiel @ 2008-09-12  3:37 UTC (permalink / raw)
  To: Ulrich Weigand; +Cc: gcc-patches, andrew_pinski, rguenther

Ulrich,

Thanks for the great test and analysis.  The patch looks good expect for
one change:

> +   /* If the source is properly aligned, we don't need to split this insn into
> +      a TImode load plus a _spu_convert.  However, we want to perform the split
> +      anyway when optimizing to make the MEMs look the same as those used for
> +      stores so they are more easily merged.  When *not* optimizing, that will
> +      not happen anyway, so we prefer to avoid generating the _spu_convert.  */
> +   if (!rot && !rot_amt && !optimize
          && spu_legitimate_address (mode, addr, 0, 1))
> +     return 0;

After split0 every address must satisfy LEGITIMATE_ADDRESS.  I can't
think of a test case off the top of my head, but it is better to be safe
and make sure the address is valid.  The code after this that changes
the load to TImode forces it to be valid.

Also, the floating point neg patterns might benefit from being split
later.

I'll merge your changes into my patch, retest and resubmit.

Trevor


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH, SPU] generated better code for loads and stores
  2008-09-12  3:37   ` Trevor_Smigiel
@ 2008-09-12 14:24     ` Ulrich Weigand
  0 siblings, 0 replies; 7+ messages in thread
From: Ulrich Weigand @ 2008-09-12 14:24 UTC (permalink / raw)
  To: Trevor_Smigiel; +Cc: gcc-patches, andrew_pinski, rguenther

Trevor,

> After split0 every address must satisfy LEGITIMATE_ADDRESS.  I can't
> think of a test case off the top of my head, but it is better to be safe
> and make sure the address is valid.  The code after this that changes
> the load to TImode forces it to be valid.

I had assumed that if the address passed the tests in spu_split_load
up to this point, and no rotate was required, the address *must* be
legitimate ...   But in any case, adding a check certainly cannot hurt.

> Also, the floating point neg patterns might benefit from being split
> later.
> 
> I'll merge your changes into my patch, retest and resubmit.

Thanks,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU Toolchain for Linux on System z and Cell BE
  Ulrich.Weigand@de.ibm.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] add optional split pass before CSE2
  2008-08-29  0:22 [PATCH, SPU] generated better code for loads and stores Trevor_Smigiel
  2008-09-05 21:51 ` Trevor_Smigiel
  2008-09-11 11:50 ` Ulrich Weigand
@ 2009-05-04 21:15 ` Trevor_Smigiel
  2009-05-23  3:37   ` [SPU, PATCH] Split load and store instructions during expand Trevor_Smigiel
  2 siblings, 1 reply; 7+ messages in thread
From: Trevor_Smigiel @ 2009-05-04 21:15 UTC (permalink / raw)
  To: gcc-patches; +Cc: Ulrich Weigand, andrew_pinski

[-- Attachment #1: Type: text/plain, Size: 3158 bytes --]

Hi,

On SPU we can get better code generation for loads and stores when we
add an extra split pass just before cse2.  It is better because we can
provide the earlier RTL passes a simplified RTL description and then
split it to something closer to actual machine instructions to be
improved by later RTL passes (especially cse2 and combine).  For loads
and stores on SPU it means the early passes have correct alias
information and later passes are able to merge common loads and stores.

This can potentially help any target that has complicated operations
where creating multiple instructions at expand time is too early and
splitting at split1 is too late.

To make it optional I added a new target defined macro,
  #define SPLIT_BEFORE_CSE2 

To allow backends to control which patterns are split I added
  int split0_completed;

I originally submitted this patch in August 2008.  This message contains
more details of how it benefits SPU.
  http://gcc.gnu.org/ml/gcc-patches/2008-08/msg02110.html

This new patch includes related fixes to abs and neg from Ulrich Weigand.

Bootstrapped and tested on x86 with no new failures. 

On SPU, there is one new failure, and a few new passes.  A loop in
sms-3.c is not modulo scheduled because the splitting changes the code
and aliasing information.  The new passes are because the predicates of
stack_protect_set are updated and no longer cause an ICE.

Ok for mainline?

Thanks,
Trevor

2009-05-04  Trevor Smigiel <Trevor_Smigiel@playstation.sony.com>
            Ulrich Weigand <uweigand@de.ibm.com>,
	
	* doc/tm.texi (SPLIT_BEFORE_CSE2): Document.
	* tree-pass.h (pass_split_before_cse2): Declare.
	* final.c (rest_of_clean_state): Initialize split0_completed.
	* recog.c (split0_completed): Define.
	(gate_handle_split_before_cse2, rest_of_handle_split_before_cse2):
	New functions.
	(pass_split_before_cse2): New pass.
	* rtl.h (split0_completed): Declare.
        * passes.c (init_optimization_passes): Add pass_split_before_cse2
        before pass_cse2.
	* config/spu/spu-protos.h (spu_legitimate_address): Add
	for_split argument.
	(aligned_mem_p, spu_valid_move): Remove prototypes.
	(spu_split_load, spu_split_store): Change return type to int.
	* config/spu/predicates.md (spu_mem_operand): Remove.
	(spu_dest_operand): Add.
	* config/spu/spu-builtins.md (spu_lqd, spu_lqx, spu_lqa,
	spu_lqr, spu_stqd, spu_stqx, spu_stqa, spu_stqr): Remove AND
	operation.
	* config/spu/spu.c (regno_aligned_for_load): Remove.
	(reg_aligned_for_addr, address_needs_split): New functions.
	(spu_legitimate_address, spu_expand_mov, spu_split_load,
	spu_split_store): Update.
	(spu_init_expanders): Pregenerate a couple of pseudo-registers.
	* config/spu/spu.h (REG_ALIGN, SPLIT_BEFORE_CSE2): Define.
	(GO_IF_LEGITIMATE_ADDRESS): Update for spu_legitimate_address.
	* config/spu/spu.md (_mov<mode>, _movdi, _movti): Update
	predicates.
	(load, store): Change to define_split.
        (_neg<mode>2, _abs<mode>2): Do not split early.
        (stack_protect_set, stack_protect_test, stack_protect_test_si):
        Change spu_mem_operand to memory_operand.

testsuite/
	* testsuite/gcc.target/spu/split0-1.c: Add test.



[-- Attachment #2: split0.patch --]
[-- Type: text/x-diff, Size: 50664 bytes --]

Index: gcc/gcc/doc/tm.texi
===================================================================
*** gcc/gcc/doc/tm.texi	(revision 146906)
--- gcc/gcc/doc/tm.texi	(working copy)
*************** cannot safely move arguments from the re
*** 10731,10733 ****
--- 10731,10745 ----
  to the stack.  Therefore, this hook should return true in general, but
  false for naked functions.  The default implementation always returns true.
  @end deftypefn
+ 
+ @defmac SPLIT_BEFORE_CSE2
+ If defined, the value of this macro determines whether to use an
+ additional split pass before the second CSE pass.
+ @code{split0_completed} will be set after this pass is completed.  
+ 
+ For example, the Cell SPU target uses this for better optimization of
+ the multiple instructions required to do simple loads and stores.  The
+ optimizations before this pass work better on simple memory
+ instructions, and the optimizations after this pass (e.g., CSE and
+ combine) are able to optimize the split instructions.
+ @end defmac
Index: gcc/gcc/tree-pass.h
===================================================================
*** gcc/gcc/tree-pass.h	(revision 146906)
--- gcc/gcc/tree-pass.h	(working copy)
*************** extern struct rtl_opt_pass pass_rtl_dolo
*** 453,458 ****
--- 453,459 ----
  extern struct rtl_opt_pass pass_rtl_loop_done;
  
  extern struct rtl_opt_pass pass_web;
+ extern struct rtl_opt_pass pass_split_before_cse2;
  extern struct rtl_opt_pass pass_cse2;
  extern struct rtl_opt_pass pass_df_initialize_opt;
  extern struct rtl_opt_pass pass_df_initialize_no_opt;
Index: gcc/gcc/final.c
===================================================================
*** gcc/gcc/final.c	(revision 146906)
--- gcc/gcc/final.c	(working copy)
*************** rest_of_clean_state (void)
*** 4304,4309 ****
--- 4304,4312 ----
  #ifdef STACK_REGS
    regstack_completed = 0;
  #endif
+ #ifdef SPLIT_BEFORE_CSE2
+   split0_completed = 0;
+ #endif
  
    /* Clear out the insn_length contents now that they are no
       longer valid.  */
Index: gcc/gcc/recog.c
===================================================================
*** gcc/gcc/recog.c	(revision 146906)
--- gcc/gcc/recog.c	(working copy)
*************** int reload_completed;
*** 103,108 ****
--- 103,113 ----
  /* Nonzero after thread_prologue_and_epilogue_insns has run.  */
  int epilogue_completed;
  
+ #ifdef SPLIT_BEFORE_CSE2
+ /* Nonzero after split0 pass has run.  */
+ int split0_completed;
+ #endif
+ 
  /* Initialize data used by the function `recog'.
     This must be called once in the compilation of a function
     before any insn recognition may be done in the function.  */
*************** struct rtl_opt_pass pass_split_for_short
*** 3652,3654 ****
--- 3657,3698 ----
    TODO_dump_func | TODO_verify_rtl_sharing /* todo_flags_finish */
   }
  };
+ 
+ static bool
+ gate_handle_split_before_cse2 (void)
+ {
+ #ifdef SPLIT_BEFORE_CSE2
+   return SPLIT_BEFORE_CSE2;
+ #else
+   return 0;
+ #endif
+ }
+ 
+ static unsigned int
+ rest_of_handle_split_before_cse2 (void)
+ {
+ #ifdef SPLIT_BEFORE_CSE2
+   split_all_insns_noflow ();
+   split0_completed = 1;
+ #endif
+   return 0;
+ }
+ 
+ struct rtl_opt_pass pass_split_before_cse2 =
+ {
+  {
+   RTL_PASS,
+   "split0",                             /* name */
+   gate_handle_split_before_cse2,        /* gate */
+   rest_of_handle_split_before_cse2,     /* execute */
+   NULL,                                 /* sub */
+   NULL,                                 /* next */
+   0,                                    /* static_pass_number */
+   TV_NONE,                              /* tv_id */
+   0,                                    /* properties_required */
+   0,                                    /* properties_provided */
+   0,                                    /* properties_destroyed */
+   0,                                    /* todo_flags_start */
+   TODO_dump_func,                       /* todo_flags_finish */
+  }
+ };
Index: gcc/gcc/rtl.h
===================================================================
*** gcc/gcc/rtl.h	(revision 146906)
--- gcc/gcc/rtl.h	(working copy)
*************** extern int reload_completed;
*** 2041,2046 ****
--- 2041,2051 ----
  /* Nonzero after thread_prologue_and_epilogue_insns has run.  */
  extern int epilogue_completed;
  
+ #ifdef SPLIT_BEFORE_CSE2
+ /* Nonzero after the split0 pass has completed. */
+ extern int split0_completed;
+ #endif
+ 
  /* Set to 1 while reload_as_needed is operating.
     Required by some machines to handle any generated moves differently.  */
  
Index: gcc/gcc/passes.c
===================================================================
*** gcc/gcc/passes.c	(revision 146906)
--- gcc/gcc/passes.c	(working copy)
*************** init_optimization_passes (void)
*** 752,757 ****
--- 752,758 ----
  	}
        NEXT_PASS (pass_web);
        NEXT_PASS (pass_rtl_cprop);
+       NEXT_PASS (pass_split_before_cse2);
        NEXT_PASS (pass_cse2);
        NEXT_PASS (pass_rtl_dse1);
        NEXT_PASS (pass_rtl_fwprop_addr);
Index: gcc/gcc/config/spu/spu-protos.h
===================================================================
*** gcc/gcc/config/spu/spu-protos.h	(revision 146906)
--- gcc/gcc/config/spu/spu-protos.h	(working copy)
*************** extern bool exp2_immediate_p (rtx op, en
*** 56,62 ****
  extern int spu_constant_address_p (rtx x);
  extern int spu_legitimate_constant_p (rtx x);
  extern int spu_legitimate_address (enum machine_mode mode, rtx x,
! 				   int reg_ok_strict);
  extern rtx spu_legitimize_address (rtx x, rtx oldx, enum machine_mode mode);
  extern int spu_initial_elimination_offset (int from, int to);
  extern rtx spu_function_value (const_tree type, const_tree func);
--- 56,62 ----
  extern int spu_constant_address_p (rtx x);
  extern int spu_legitimate_constant_p (rtx x);
  extern int spu_legitimate_address (enum machine_mode mode, rtx x,
! 				   int reg_ok_strict, int for_split);
  extern rtx spu_legitimize_address (rtx x, rtx oldx, enum machine_mode mode);
  extern int spu_initial_elimination_offset (int from, int to);
  extern rtx spu_function_value (const_tree type, const_tree func);
*************** extern void spu_setup_incoming_varargs (
*** 66,76 ****
  					tree type, int *pretend_size,
  					int no_rtl);
  extern void spu_conditional_register_usage (void);
- extern int aligned_mem_p (rtx mem);
  extern int spu_expand_mov (rtx * ops, enum machine_mode mode);
! extern void spu_split_load (rtx * ops);
! extern void spu_split_store (rtx * ops);
! extern int spu_valid_move (rtx * ops);
  extern int fsmbi_const_p (rtx x);
  extern int cpat_const_p (rtx x, enum machine_mode mode);
  extern rtx gen_cpat_const (rtx * ops);
--- 66,74 ----
  					tree type, int *pretend_size,
  					int no_rtl);
  extern void spu_conditional_register_usage (void);
  extern int spu_expand_mov (rtx * ops, enum machine_mode mode);
! extern int spu_split_load (rtx * ops);
! extern int spu_split_store (rtx * ops);
  extern int fsmbi_const_p (rtx x);
  extern int cpat_const_p (rtx x, enum machine_mode mode);
  extern rtx gen_cpat_const (rtx * ops);
Index: gcc/gcc/config/spu/predicates.md
===================================================================
*** gcc/gcc/config/spu/predicates.md	(revision 146906)
--- gcc/gcc/config/spu/predicates.md	(working copy)
*************** (define_predicate "spu_nonmem_operand"
*** 39,52 ****
         (ior (not (match_code "subreg"))
              (match_test "valid_subreg (op)"))))
  
- (define_predicate "spu_mem_operand"
-   (and (match_operand 0 "memory_operand")
-        (match_test "reload_in_progress || reload_completed || aligned_mem_p (op)")))
- 
  (define_predicate "spu_mov_operand"
!   (ior (match_operand 0 "spu_mem_operand")
         (match_operand 0 "spu_nonmem_operand")))
  
  (define_predicate "call_operand"
    (and (match_code "mem")
         (match_test "(!TARGET_LARGE_MEM && satisfies_constraint_S (op))
--- 39,52 ----
         (ior (not (match_code "subreg"))
              (match_test "valid_subreg (op)"))))
  
  (define_predicate "spu_mov_operand"
!   (ior (match_operand 0 "memory_operand")
         (match_operand 0 "spu_nonmem_operand")))
  
+ (define_predicate "spu_dest_operand"
+   (ior (match_operand 0 "memory_operand")
+        (match_operand 0 "spu_reg_operand")))
+ 
  (define_predicate "call_operand"
    (and (match_code "mem")
         (match_test "(!TARGET_LARGE_MEM && satisfies_constraint_S (op))
Index: gcc/gcc/config/spu/spu.c
===================================================================
*** gcc/gcc/config/spu/spu.c	(revision 146906)
--- gcc/gcc/config/spu/spu.c	(working copy)
*************** static tree spu_build_builtin_va_list (v
*** 189,197 ****
  static void spu_va_start (tree, rtx);
  static tree spu_gimplify_va_arg_expr (tree valist, tree type,
  				      gimple_seq * pre_p, gimple_seq * post_p);
- static int regno_aligned_for_load (int regno);
  static int store_with_one_insn_p (rtx mem);
  static int mem_is_padded_component_ref (rtx x);
  static bool spu_assemble_integer (rtx x, unsigned int size, int aligned_p);
  static void spu_asm_globalize_label (FILE * file, const char *name);
  static unsigned char spu_rtx_costs (rtx x, int code, int outer_code,
--- 189,197 ----
  static void spu_va_start (tree, rtx);
  static tree spu_gimplify_va_arg_expr (tree valist, tree type,
  				      gimple_seq * pre_p, gimple_seq * post_p);
  static int store_with_one_insn_p (rtx mem);
  static int mem_is_padded_component_ref (rtx x);
+ static int reg_aligned_for_addr (rtx x, int aligned);
  static bool spu_assemble_integer (rtx x, unsigned int size, int aligned_p);
  static void spu_asm_globalize_label (FILE * file, const char *name);
  static unsigned char spu_rtx_costs (rtx x, int code, int outer_code,
*************** spu_legitimate_constant_p (rtx x)
*** 3603,3626 ****
  /* Valid address are:
     - symbol_ref, label_ref, const
     - reg
!    - reg + const, where either reg or const is 16 byte aligned
     - reg + reg, alignment doesn't matter
    The alignment matters in the reg+const case because lqd and stqd
!   ignore the 4 least significant bits of the const.  (TODO: It might be
!   preferable to allow any alignment and fix it up when splitting.) */
  int
! spu_legitimate_address (enum machine_mode mode ATTRIBUTE_UNUSED,
! 			rtx x, int reg_ok_strict)
  {
!   if (mode == TImode && GET_CODE (x) == AND
!       && GET_CODE (XEXP (x, 1)) == CONST_INT
!       && INTVAL (XEXP (x, 1)) == (HOST_WIDE_INT) -16)
      x = XEXP (x, 0);
    switch (GET_CODE (x))
      {
-     case SYMBOL_REF:
      case LABEL_REF:
!       return !TARGET_LARGE_MEM;
  
      case CONST:
        if (!TARGET_LARGE_MEM && GET_CODE (XEXP (x, 0)) == PLUS)
--- 3603,3654 ----
  /* Valid address are:
     - symbol_ref, label_ref, const
     - reg
!    - reg + const, where const is 16 byte aligned
     - reg + reg, alignment doesn't matter
    The alignment matters in the reg+const case because lqd and stqd
!   ignore the 4 least significant bits of the const.  
! 
!   Addresses are handled in 4 phases. 
!   1) from the beginning of rtl expansion until the split0 pass.  Any
!      address is acceptable.  
!   2) The split0 pass. It is responsible for making every load and store
!      valid.  It calls legitimate_address with FOR_SPLIT set to 1.  This
!      is where non-16-byte aligned loads/stores are split into multiple
!      instructions to extract or insert just the part we care about.
!   3) From the split0 pass to the beginning of reload.  During this
!      phase the constant part of an address must be 16 byte aligned, and
!      we don't allow any loads/store of less than 4 bytes.  We also
!      allow a mask of -16 to be part of the address as an optimization.
!   4) From reload until the end.  Reload can change the modes of loads
!      and stores to something smaller than 4-bytes which we need to allow
!      now, and it also adjusts the address to match.  So in this phase we
!      allow that special case.  Still allow addresses with a mask of -16.
! 
!   FOR_SPLIT is only set to 1 for phase 2, otherwise it is 0.  */
  int
! spu_legitimate_address (enum machine_mode mode, rtx x, int reg_ok_strict,
! 			int for_split)
  {
!   int aligned = (split0_completed || for_split)
!     && !reload_in_progress && !reload_completed;
!   int const_aligned = split0_completed || for_split;
!   if (GET_MODE_SIZE (mode) >= 16)
!     aligned = 0;
!   else if (aligned && GET_MODE_SIZE (mode) < 4)
!     return 0;
!   if (split0_completed
!       && (GET_CODE (x) == AND
! 	  && GET_CODE (XEXP (x, 1)) == CONST_INT
! 	  && INTVAL (XEXP (x, 1)) == (HOST_WIDE_INT) - 16
! 	  && !CONSTANT_P (XEXP (x, 0))))
      x = XEXP (x, 0);
    switch (GET_CODE (x))
      {
      case LABEL_REF:
!       return !TARGET_LARGE_MEM && !aligned;
! 
!     case SYMBOL_REF:
!       return !TARGET_LARGE_MEM && (!aligned || ALIGNED_SYMBOL_REF_P (x));
  
      case CONST:
        if (!TARGET_LARGE_MEM && GET_CODE (XEXP (x, 0)) == PLUS)
*************** spu_legitimate_address (enum machine_mod
*** 3628,3649 ****
  	  rtx sym = XEXP (XEXP (x, 0), 0);
  	  rtx cst = XEXP (XEXP (x, 0), 1);
  
- 	  /* Accept any symbol_ref + constant, assuming it does not
- 	     wrap around the local store addressability limit.  */
  	  if (GET_CODE (sym) == SYMBOL_REF && GET_CODE (cst) == CONST_INT)
! 	    return 1;
  	}
        return 0;
  
      case CONST_INT:
        return INTVAL (x) >= 0 && INTVAL (x) <= 0x3ffff;
  
      case SUBREG:
        x = XEXP (x, 0);
!       gcc_assert (GET_CODE (x) == REG);
  
      case REG:
!       return INT_REG_OK_FOR_BASE_P (x, reg_ok_strict);
  
      case PLUS:
      case LO_SUM:
--- 3656,3685 ----
  	  rtx sym = XEXP (XEXP (x, 0), 0);
  	  rtx cst = XEXP (XEXP (x, 0), 1);
  
  	  if (GET_CODE (sym) == SYMBOL_REF && GET_CODE (cst) == CONST_INT)
! 	    {
! 	      /* Check for alignment if required.  */
! 	      if (!aligned)
! 		return 1;
! 	      if ((INTVAL (cst) & 15) == 0 && ALIGNED_SYMBOL_REF_P (sym))
! 		return 1;
! 	    }
  	}
        return 0;
  
      case CONST_INT:
+       /* We don't test alignement here.  For an absolute address we
+          assume the user knows what they are doing. */
        return INTVAL (x) >= 0 && INTVAL (x) <= 0x3ffff;
  
      case SUBREG:
        x = XEXP (x, 0);
!       if (GET_CODE (x) != REG)
! 	return 0;
  
      case REG:
!       return INT_REG_OK_FOR_BASE_P (x, reg_ok_strict)
! 	&& reg_aligned_for_addr (x, 0);
  
      case PLUS:
      case LO_SUM:
*************** spu_legitimate_address (enum machine_mod
*** 3654,3674 ****
  	  op0 = XEXP (op0, 0);
  	if (GET_CODE (op1) == SUBREG)
  	  op1 = XEXP (op1, 0);
- 	/* We can't just accept any aligned register because CSE can
- 	   change it to a register that is not marked aligned and then
- 	   recog will fail.   So we only accept frame registers because
- 	   they will only be changed to other frame registers. */
  	if (GET_CODE (op0) == REG
  	    && INT_REG_OK_FOR_BASE_P (op0, reg_ok_strict)
  	    && GET_CODE (op1) == CONST_INT
  	    && INTVAL (op1) >= -0x2000
  	    && INTVAL (op1) <= 0x1fff
! 	    && (regno_aligned_for_load (REGNO (op0)) || (INTVAL (op1) & 15) == 0))
  	  return 1;
  	if (GET_CODE (op0) == REG
  	    && INT_REG_OK_FOR_BASE_P (op0, reg_ok_strict)
  	    && GET_CODE (op1) == REG
! 	    && INT_REG_OK_FOR_INDEX_P (op1, reg_ok_strict))
  	  return 1;
        }
        break;
--- 3690,3718 ----
  	  op0 = XEXP (op0, 0);
  	if (GET_CODE (op1) == SUBREG)
  	  op1 = XEXP (op1, 0);
  	if (GET_CODE (op0) == REG
  	    && INT_REG_OK_FOR_BASE_P (op0, reg_ok_strict)
  	    && GET_CODE (op1) == CONST_INT
  	    && INTVAL (op1) >= -0x2000
  	    && INTVAL (op1) <= 0x1fff
! 	    && reg_aligned_for_addr (op0, 0)
! 	    && (!const_aligned
! 		|| (INTVAL (op1) & 15) == 0
! 		|| ((reload_in_progress || reload_completed)
! 		    && GET_MODE_SIZE (mode) < 4
! 		    && (INTVAL (op1) & 15) == 4 - GET_MODE_SIZE (mode))
! 		/* Some passes create a fake register for testing valid
! 		   addresses, be more lenient when we see those.  ivopts
! 		   and reload do it. */
! 		|| REGNO (op0) == LAST_VIRTUAL_REGISTER + 1
! 		|| REGNO (op0) == LAST_VIRTUAL_REGISTER + 2))
  	  return 1;
  	if (GET_CODE (op0) == REG
  	    && INT_REG_OK_FOR_BASE_P (op0, reg_ok_strict)
+ 	    && reg_aligned_for_addr (op0, 0)
  	    && GET_CODE (op1) == REG
! 	    && INT_REG_OK_FOR_INDEX_P (op1, reg_ok_strict)
! 	    && reg_aligned_for_addr (op1, 0))
  	  return 1;
        }
        break;
*************** spu_legitimize_address (rtx x, rtx oldx 
*** 3706,3712 ****
        else if (GET_CODE (op1) != REG)
  	op1 = force_reg (Pmode, op1);
        x = gen_rtx_PLUS (Pmode, op0, op1);
!       if (spu_legitimate_address (mode, x, 0))
  	return x;
      }
    return NULL_RTX;
--- 3750,3756 ----
        else if (GET_CODE (op1) != REG)
  	op1 = force_reg (Pmode, op1);
        x = gen_rtx_PLUS (Pmode, op0, op1);
!       if (spu_legitimate_address (mode, x, 0, 0))
  	return x;
      }
    return NULL_RTX;
*************** spu_conditional_register_usage (void)
*** 4131,4190 ****
      }
  }
  
! /* This is called to decide when we can simplify a load instruction.  We
!    must only return true for registers which we know will always be
!    aligned.  Taking into account that CSE might replace this reg with
!    another one that has not been marked aligned.  
!    So this is really only true for frame, stack and virtual registers,
!    which we know are always aligned and should not be adversely effected
!    by CSE.  */
  static int
! regno_aligned_for_load (int regno)
! {
!   return regno == FRAME_POINTER_REGNUM
!     || (frame_pointer_needed && regno == HARD_FRAME_POINTER_REGNUM)
!     || regno == ARG_POINTER_REGNUM
!     || regno == STACK_POINTER_REGNUM
!     || (regno >= FIRST_VIRTUAL_REGISTER 
! 	&& regno <= LAST_VIRTUAL_REGISTER);
! }
! 
! /* Return TRUE when mem is known to be 16-byte aligned. */
! int
! aligned_mem_p (rtx mem)
  {
!   if (MEM_ALIGN (mem) >= 128)
!     return 1;
!   if (GET_MODE_SIZE (GET_MODE (mem)) >= 16)
!     return 1;
!   if (GET_CODE (XEXP (mem, 0)) == PLUS)
!     {
!       rtx p0 = XEXP (XEXP (mem, 0), 0);
!       rtx p1 = XEXP (XEXP (mem, 0), 1);
!       if (regno_aligned_for_load (REGNO (p0)))
! 	{
! 	  if (GET_CODE (p1) == REG && regno_aligned_for_load (REGNO (p1)))
! 	    return 1;
! 	  if (GET_CODE (p1) == CONST_INT && (INTVAL (p1) & 15) == 0)
! 	    return 1;
! 	}
!     }
!   else if (GET_CODE (XEXP (mem, 0)) == REG)
!     {
!       if (regno_aligned_for_load (REGNO (XEXP (mem, 0))))
! 	return 1;
!     }
!   else if (ALIGNED_SYMBOL_REF_P (XEXP (mem, 0)))
      return 1;
!   else if (GET_CODE (XEXP (mem, 0)) == CONST)
!     {
!       rtx p0 = XEXP (XEXP (XEXP (mem, 0), 0), 0);
!       rtx p1 = XEXP (XEXP (XEXP (mem, 0), 0), 1);
!       if (GET_CODE (p0) == SYMBOL_REF
! 	  && GET_CODE (p1) == CONST_INT && (INTVAL (p1) & 15) == 0)
! 	return 1;
!     }
!   return 0;
  }
  
  /* Encode symbol attributes (local vs. global, tls model) of a SYMBOL_REF
--- 4175,4190 ----
      }
  }
  
! /* This is called any time we inspect the alignment of a register for
!    addresses.  */
  static int
! reg_aligned_for_addr (rtx x, int aligned)
  {
!   int regno =
!     REGNO (x) < FIRST_PSEUDO_REGISTER ? ORIGINAL_REGNO (x) : REGNO (x);
!   if (!aligned)
      return 1;
!   return REGNO_POINTER_ALIGN (regno) >= 128;
  }
  
  /* Encode symbol attributes (local vs. global, tls model) of a SYMBOL_REF
*************** spu_encode_section_info (tree decl, rtx 
*** 4213,4221 ****
  static int
  store_with_one_insn_p (rtx mem)
  {
    rtx addr = XEXP (mem, 0);
!   if (GET_MODE (mem) == BLKmode)
      return 0;
    /* Only static objects. */
    if (GET_CODE (addr) == SYMBOL_REF)
      {
--- 4213,4224 ----
  static int
  store_with_one_insn_p (rtx mem)
  {
+   enum machine_mode mode = GET_MODE (mem);
    rtx addr = XEXP (mem, 0);
!   if (mode == BLKmode)
      return 0;
+   if (GET_MODE_SIZE (mode) >= 16)
+     return 1;
    /* Only static objects. */
    if (GET_CODE (addr) == SYMBOL_REF)
      {
*************** store_with_one_insn_p (rtx mem)
*** 4239,4244 ****
--- 4242,4263 ----
    return 0;
  }
  
+ /* Return 1 when the address is not valid for a simple load and store as
+    required by the '_mov*' patterns.   We could make this less strict
+    for loads, but we prefer mem's to look the same so they are more
+    likely to be merged.  */
+ static int
+ address_needs_split (rtx mem)
+ {
+   if (GET_MODE_SIZE (GET_MODE (mem)) < 16
+       && (GET_MODE_SIZE (GET_MODE (mem)) < 4
+ 	  || !(store_with_one_insn_p (mem)
+ 	       || mem_is_padded_component_ref (mem))))
+     return 1;
+ 
+   return 0;
+ }
+ 
  int
  spu_expand_mov (rtx * ops, enum machine_mode mode)
  {
*************** spu_expand_mov (rtx * ops, enum machine_
*** 4285,4309 ****
      }
    else
      {
-       if (GET_CODE (ops[0]) == MEM)
- 	{
- 	  if (!spu_valid_move (ops))
- 	    {
- 	      emit_insn (gen_store (ops[0], ops[1], gen_reg_rtx (TImode),
- 				    gen_reg_rtx (TImode)));
- 	      return 1;
- 	    }
- 	}
-       else if (GET_CODE (ops[1]) == MEM)
- 	{
- 	  if (!spu_valid_move (ops))
- 	    {
- 	      emit_insn (gen_load
- 			 (ops[0], ops[1], gen_reg_rtx (TImode),
- 			  gen_reg_rtx (SImode)));
- 	      return 1;
- 	    }
- 	}
        /* Catch the SImode immediates greater than 0x7fffffff, and sign
           extend them. */
        if (GET_CODE (ops[1]) == CONST_INT)
--- 4304,4309 ----
*************** spu_expand_mov (rtx * ops, enum machine_
*** 4319,4325 ****
    return 0;
  }
  
! void
  spu_split_load (rtx * ops)
  {
    enum machine_mode mode = GET_MODE (ops[0]);
--- 4319,4325 ----
    return 0;
  }
  
! int
  spu_split_load (rtx * ops)
  {
    enum machine_mode mode = GET_MODE (ops[0]);
*************** spu_split_load (rtx * ops)
*** 4327,4336 ****
    int rot_amt;
  
    addr = XEXP (ops[1], 0);
  
    rot = 0;
    rot_amt = 0;
!   if (GET_CODE (addr) == PLUS)
      {
        /* 8 cases:
           aligned reg   + aligned reg     => lqx
--- 4327,4349 ----
    int rot_amt;
  
    addr = XEXP (ops[1], 0);
+   gcc_assert (GET_CODE (addr) != AND);
+ 
+   if (!address_needs_split (ops[1]))
+     {
+       if (spu_legitimate_address (mode, addr, 0, 1))
+ 	return 0;
+       ops[1] = change_address (ops[1], VOIDmode, force_reg (Pmode, addr));
+       emit_move_insn (ops[0], ops[1]);
+       return 1;
+     }
  
    rot = 0;
    rot_amt = 0;
! 
!   if (MEM_ALIGN (ops[1]) >= 128)
!     /* Address is already aligned; simply perform a TImode load.  */;
!   else if (GET_CODE (addr) == PLUS)
      {
        /* 8 cases:
           aligned reg   + aligned reg     => lqx
*************** spu_split_load (rtx * ops)
*** 4344,4355 ****
         */
        p0 = XEXP (addr, 0);
        p1 = XEXP (addr, 1);
!       if (REG_P (p0) && !regno_aligned_for_load (REGNO (p0)))
  	{
! 	  if (REG_P (p1) && !regno_aligned_for_load (REGNO (p1)))
  	    {
! 	      emit_insn (gen_addsi3 (ops[3], p0, p1));
! 	      rot = ops[3];
  	    }
  	  else
  	    rot = p0;
--- 4357,4388 ----
         */
        p0 = XEXP (addr, 0);
        p1 = XEXP (addr, 1);
!       if (!reg_aligned_for_addr (p0, 1))
  	{
! 	  if (GET_CODE (p1) == REG && !reg_aligned_for_addr (p1, 1))
  	    {
! 	      rot = gen_reg_rtx (SImode);
! 	      emit_insn (gen_addsi3 (rot, p0, p1));
! 	    }
! 	  else if (GET_CODE (p1) == CONST_INT && (INTVAL (p1) & 15))
! 	    {
! 	      if (INTVAL (p1) > 0
! 		  && INTVAL (p1) * BITS_PER_UNIT < REG_ALIGN (p0))
! 		{
! 		  rot = gen_reg_rtx (SImode);
! 		  emit_insn (gen_addsi3 (rot, p0, p1));
! 		  addr = p0;
! 		}
! 	      else
! 		{
! 		  rtx x = gen_reg_rtx (SImode);
! 		  emit_move_insn (x, p1);
! 		  if (!spu_arith_operand (p1, SImode))
! 		    p1 = x;
! 		  rot = gen_reg_rtx (SImode);
! 		  emit_insn (gen_addsi3 (rot, p0, p1));
! 		  addr = gen_rtx_PLUS (Pmode, p0, x);
! 		}
  	    }
  	  else
  	    rot = p0;
*************** spu_split_load (rtx * ops)
*** 4359,4374 ****
  	  if (GET_CODE (p1) == CONST_INT && (INTVAL (p1) & 15))
  	    {
  	      rot_amt = INTVAL (p1) & 15;
! 	      p1 = GEN_INT (INTVAL (p1) & -16);
! 	      addr = gen_rtx_PLUS (SImode, p0, p1);
  	    }
! 	  else if (REG_P (p1) && !regno_aligned_for_load (REGNO (p1)))
  	    rot = p1;
  	}
      }
    else if (GET_CODE (addr) == REG)
      {
!       if (!regno_aligned_for_load (REGNO (addr)))
  	rot = addr;
      }
    else if (GET_CODE (addr) == CONST)
--- 4392,4412 ----
  	  if (GET_CODE (p1) == CONST_INT && (INTVAL (p1) & 15))
  	    {
  	      rot_amt = INTVAL (p1) & 15;
! 	      if (INTVAL (p1) & -16)
! 		{
! 		  p1 = GEN_INT (INTVAL (p1) & -16);
! 		  addr = gen_rtx_PLUS (SImode, p0, p1);
! 		}
! 	      else
! 		addr = p0;
  	    }
! 	  else if (GET_CODE (p1) == REG && !reg_aligned_for_addr (p1, 1))
  	    rot = p1;
  	}
      }
    else if (GET_CODE (addr) == REG)
      {
!       if (!reg_aligned_for_addr (addr, 1))
  	rot = addr;
      }
    else if (GET_CODE (addr) == CONST)
*************** spu_split_load (rtx * ops)
*** 4387,4393 ****
  	    addr = XEXP (XEXP (addr, 0), 0);
  	}
        else
! 	rot = addr;
      }
    else if (GET_CODE (addr) == CONST_INT)
      {
--- 4425,4434 ----
  	    addr = XEXP (XEXP (addr, 0), 0);
  	}
        else
! 	{
! 	  rot = gen_reg_rtx (Pmode);
! 	  emit_move_insn (rot, addr);
! 	}
      }
    else if (GET_CODE (addr) == CONST_INT)
      {
*************** spu_split_load (rtx * ops)
*** 4395,4401 ****
        addr = GEN_INT (rot_amt & -16);
      }
    else if (!ALIGNED_SYMBOL_REF_P (addr))
!     rot = addr;
  
    if (GET_MODE_SIZE (mode) < 4)
      rot_amt += GET_MODE_SIZE (mode) - 4;
--- 4436,4445 ----
        addr = GEN_INT (rot_amt & -16);
      }
    else if (!ALIGNED_SYMBOL_REF_P (addr))
!     {
!       rot = gen_reg_rtx (Pmode);
!       emit_move_insn (rot, addr);
!     }
  
    if (GET_MODE_SIZE (mode) < 4)
      rot_amt += GET_MODE_SIZE (mode) - 4;
*************** spu_split_load (rtx * ops)
*** 4404,4418 ****
  
    if (rot && rot_amt)
      {
!       emit_insn (gen_addsi3 (ops[3], rot, GEN_INT (rot_amt)));
!       rot = ops[3];
        rot_amt = 0;
      }
  
!   load = ops[2];
  
!   addr = gen_rtx_AND (SImode, copy_rtx (addr), GEN_INT (-16));
!   mem = change_address (ops[1], TImode, addr);
  
    emit_insn (gen_movti (load, mem));
  
--- 4448,4471 ----
  
    if (rot && rot_amt)
      {
!       rtx x = gen_reg_rtx (SImode);
!       emit_insn (gen_addsi3 (x, rot, GEN_INT (rot_amt)));
!       rot = x;
        rot_amt = 0;
      }
  
!   /* If the source is properly aligned, we don't need to split this insn into
!      a TImode load plus a _spu_convert.  However, we want to perform the split
!      anyway when optimizing to make the MEMs look the same as those used for
!      stores so they are more easily merged.  When *not* optimizing, that will
!      not happen anyway, so we prefer to avoid generating the _spu_convert.  */
!   if (!rot && !rot_amt && !optimize
!       && spu_legitimate_address (mode, addr, 0, 1))
!     return 0;
! 
!   load = gen_reg_rtx (TImode);
  
!   mem = change_address (ops[1], TImode, copy_rtx (addr));
  
    emit_insn (gen_movti (load, mem));
  
*************** spu_split_load (rtx * ops)
*** 4421,4443 ****
    else if (rot_amt)
      emit_insn (gen_rotlti3 (load, load, GEN_INT (rot_amt * 8)));
  
!   if (reload_completed)
!     emit_move_insn (ops[0], gen_rtx_REG (GET_MODE (ops[0]), REGNO (load)));
!   else
!     emit_insn (gen_spu_convert (ops[0], load));
  }
  
! void
  spu_split_store (rtx * ops)
  {
    enum machine_mode mode = GET_MODE (ops[0]);
!   rtx pat = ops[2];
!   rtx reg = ops[3];
    rtx addr, p0, p1, p1_lo, smem;
    int aform;
    int scalar;
  
    addr = XEXP (ops[0], 0);
  
    if (GET_CODE (addr) == PLUS)
      {
--- 4474,4504 ----
    else if (rot_amt)
      emit_insn (gen_rotlti3 (load, load, GEN_INT (rot_amt * 8)));
  
!   emit_insn (gen_spu_convert (ops[0], load));
!   return 1;
  }
  
! int
  spu_split_store (rtx * ops)
  {
    enum machine_mode mode = GET_MODE (ops[0]);
!   rtx reg;
    rtx addr, p0, p1, p1_lo, smem;
    int aform;
    int scalar;
  
+   if (!address_needs_split (ops[0]))
+     {
+       addr = XEXP (ops[0], 0);
+       if (spu_legitimate_address (mode, addr, 0, 1))
+ 	return 0;
+       ops[0] = change_address (ops[0], VOIDmode, force_reg (Pmode, addr));
+       emit_move_insn (ops[0], ops[1]);
+       return 1;
+     }
+ 
    addr = XEXP (ops[0], 0);
+   gcc_assert (GET_CODE (addr) != AND);
  
    if (GET_CODE (addr) == PLUS)
      {
*************** spu_split_store (rtx * ops)
*** 4449,4455 ****
           unaligned reg + aligned reg     => lqx, c?x, shuf, stqx
           unaligned reg + unaligned reg   => lqx, c?x, shuf, stqx
           unaligned reg + aligned const   => lqd, c?d, shuf, stqx
!          unaligned reg + unaligned const -> not allowed by legitimate address
         */
        aform = 0;
        p0 = XEXP (addr, 0);
--- 4510,4516 ----
           unaligned reg + aligned reg     => lqx, c?x, shuf, stqx
           unaligned reg + unaligned reg   => lqx, c?x, shuf, stqx
           unaligned reg + aligned const   => lqd, c?d, shuf, stqx
!          unaligned reg + unaligned const -> lqx, c?d, shuf, stqx
         */
        aform = 0;
        p0 = XEXP (addr, 0);
*************** spu_split_store (rtx * ops)
*** 4457,4464 ****
        if (GET_CODE (p0) == REG && GET_CODE (p1) == CONST_INT)
  	{
  	  p1_lo = GEN_INT (INTVAL (p1) & 15);
! 	  p1 = GEN_INT (INTVAL (p1) & -16);
! 	  addr = gen_rtx_PLUS (SImode, p0, p1);
  	}
      }
    else if (GET_CODE (addr) == REG)
--- 4518,4537 ----
        if (GET_CODE (p0) == REG && GET_CODE (p1) == CONST_INT)
  	{
  	  p1_lo = GEN_INT (INTVAL (p1) & 15);
! 	  if (reg_aligned_for_addr (p0, 1))
! 	    {
! 	      p1 = GEN_INT (INTVAL (p1) & -16);
! 	      if (p1 == const0_rtx)
! 		addr = p0;
! 	      else
! 		addr = gen_rtx_PLUS (SImode, p0, p1);
! 	    }
! 	  else
! 	    {
! 	      rtx x = gen_reg_rtx (SImode);
! 	      emit_move_insn (x, p1);
! 	      addr = gen_rtx_PLUS (SImode, p0, x);
! 	    }
  	}
      }
    else if (GET_CODE (addr) == REG)
*************** spu_split_store (rtx * ops)
*** 4475,4505 ****
        p1_lo = addr;
        if (ALIGNED_SYMBOL_REF_P (addr))
  	p1_lo = const0_rtx;
!       else if (GET_CODE (addr) == CONST)
  	{
! 	  if (GET_CODE (XEXP (addr, 0)) == PLUS
! 	      && ALIGNED_SYMBOL_REF_P (XEXP (XEXP (addr, 0), 0))
! 	      && GET_CODE (XEXP (XEXP (addr, 0), 1)) == CONST_INT)
! 	    {
! 	      HOST_WIDE_INT v = INTVAL (XEXP (XEXP (addr, 0), 1));
! 	      if ((v & -16) != 0)
! 		addr = gen_rtx_CONST (Pmode,
! 				      gen_rtx_PLUS (Pmode,
! 						    XEXP (XEXP (addr, 0), 0),
! 						    GEN_INT (v & -16)));
! 	      else
! 		addr = XEXP (XEXP (addr, 0), 0);
! 	      p1_lo = GEN_INT (v & 15);
! 	    }
  	}
        else if (GET_CODE (addr) == CONST_INT)
  	{
  	  p1_lo = GEN_INT (INTVAL (addr) & 15);
  	  addr = GEN_INT (INTVAL (addr) & -16);
  	}
      }
  
!   addr = gen_rtx_AND (SImode, copy_rtx (addr), GEN_INT (-16));
  
    scalar = store_with_one_insn_p (ops[0]);
    if (!scalar)
--- 4548,4581 ----
        p1_lo = addr;
        if (ALIGNED_SYMBOL_REF_P (addr))
  	p1_lo = const0_rtx;
!       else if (GET_CODE (addr) == CONST
! 	       && GET_CODE (XEXP (addr, 0)) == PLUS
! 	       && ALIGNED_SYMBOL_REF_P (XEXP (XEXP (addr, 0), 0))
! 	       && GET_CODE (XEXP (XEXP (addr, 0), 1)) == CONST_INT)
  	{
! 	  HOST_WIDE_INT v = INTVAL (XEXP (XEXP (addr, 0), 1));
! 	  if ((v & -16) != 0)
! 	    addr = gen_rtx_CONST (Pmode,
! 				  gen_rtx_PLUS (Pmode,
! 						XEXP (XEXP (addr, 0), 0),
! 						GEN_INT (v & -16)));
! 	  else
! 	    addr = XEXP (XEXP (addr, 0), 0);
! 	  p1_lo = GEN_INT (v & 15);
  	}
        else if (GET_CODE (addr) == CONST_INT)
  	{
  	  p1_lo = GEN_INT (INTVAL (addr) & 15);
  	  addr = GEN_INT (INTVAL (addr) & -16);
  	}
+       else
+ 	{
+ 	  p1_lo = gen_reg_rtx (SImode);
+ 	  emit_move_insn (p1_lo, addr);
+ 	}
      }
  
!   reg = gen_reg_rtx (TImode);
  
    scalar = store_with_one_insn_p (ops[0]);
    if (!scalar)
*************** spu_split_store (rtx * ops)
*** 4509,4519 ****
           possible, and copying the flags will prevent that in certain
           cases, e.g. consider the volatile flag. */
  
        rtx lmem = change_address (ops[0], TImode, copy_rtx (addr));
        set_mem_alias_set (lmem, 0);
        emit_insn (gen_movti (reg, lmem));
  
!       if (!p0 || regno_aligned_for_load (REGNO (p0)))
  	p0 = stack_pointer_rtx;
        if (!p1_lo)
  	p1_lo = const0_rtx;
--- 4585,4596 ----
           possible, and copying the flags will prevent that in certain
           cases, e.g. consider the volatile flag. */
  
+       rtx pat = gen_reg_rtx (TImode);
        rtx lmem = change_address (ops[0], TImode, copy_rtx (addr));
        set_mem_alias_set (lmem, 0);
        emit_insn (gen_movti (reg, lmem));
  
!       if (!p0 || reg_aligned_for_addr (p0, 1))
  	p0 = stack_pointer_rtx;
        if (!p1_lo)
  	p1_lo = const0_rtx;
*************** spu_split_store (rtx * ops)
*** 4521,4537 ****
        emit_insn (gen_cpat (pat, p0, p1_lo, GEN_INT (GET_MODE_SIZE (mode))));
        emit_insn (gen_shufb (reg, ops[1], reg, pat));
      }
-   else if (reload_completed)
-     {
-       if (GET_CODE (ops[1]) == REG)
- 	emit_move_insn (reg, gen_rtx_REG (GET_MODE (reg), REGNO (ops[1])));
-       else if (GET_CODE (ops[1]) == SUBREG)
- 	emit_move_insn (reg,
- 			gen_rtx_REG (GET_MODE (reg),
- 				     REGNO (SUBREG_REG (ops[1]))));
-       else
- 	abort ();
-     }
    else
      {
        if (GET_CODE (ops[1]) == REG)
--- 4598,4603 ----
*************** spu_split_store (rtx * ops)
*** 4543,4557 ****
      }
  
    if (GET_MODE_SIZE (mode) < 4 && scalar)
!     emit_insn (gen_shlqby_ti
! 	       (reg, reg, GEN_INT (4 - GET_MODE_SIZE (mode))));
  
!   smem = change_address (ops[0], TImode, addr);
    /* We can't use the previous alias set because the memory has changed
       size and can potentially overlap objects of other types.  */
    set_mem_alias_set (smem, 0);
  
    emit_insn (gen_movti (smem, reg));
  }
  
  /* Return TRUE if X is MEM which is a struct member reference
--- 4609,4624 ----
      }
  
    if (GET_MODE_SIZE (mode) < 4 && scalar)
!     emit_insn (gen_ashlti3
! 	       (reg, reg, GEN_INT (32 - GET_MODE_BITSIZE (mode))));
  
!   smem = change_address (ops[0], TImode, copy_rtx (addr));
    /* We can't use the previous alias set because the memory has changed
       size and can potentially overlap objects of other types.  */
    set_mem_alias_set (smem, 0);
  
    emit_insn (gen_movti (smem, reg));
+   return 1;
  }
  
  /* Return TRUE if X is MEM which is a struct member reference
*************** fix_range (const char *const_str)
*** 4650,4686 ****
      }
  }
  
- int
- spu_valid_move (rtx * ops)
- {
-   enum machine_mode mode = GET_MODE (ops[0]);
-   if (!register_operand (ops[0], mode) && !register_operand (ops[1], mode))
-     return 0;
- 
-   /* init_expr_once tries to recog against load and store insns to set
-      the direct_load[] and direct_store[] arrays.  We always want to
-      consider those loads and stores valid.  init_expr_once is called in
-      the context of a dummy function which does not have a decl. */
-   if (cfun->decl == 0)
-     return 1;
- 
-   /* Don't allows loads/stores which would require more than 1 insn.
-      During and after reload we assume loads and stores only take 1
-      insn. */
-   if (GET_MODE_SIZE (mode) < 16 && !reload_in_progress && !reload_completed)
-     {
-       if (GET_CODE (ops[0]) == MEM
- 	  && (GET_MODE_SIZE (mode) < 4
- 	      || !(store_with_one_insn_p (ops[0])
- 		   || mem_is_padded_component_ref (ops[0]))))
- 	return 0;
-       if (GET_CODE (ops[1]) == MEM
- 	  && (GET_MODE_SIZE (mode) < 4 || !aligned_mem_p (ops[1])))
- 	return 0;
-     }
-   return 1;
- }
- 
  /* Return TRUE if x is a CONST_INT, CONST_DOUBLE or CONST_VECTOR that
     can be generated using the fsmbi instruction. */
  int
--- 4717,4722 ----
*************** spu_sms_res_mii (struct ddg *g)
*** 6394,6405 ****
  
  void
  spu_init_expanders (void)
! {   
!   /* HARD_FRAME_REGISTER is only 128 bit aligned when
!    * frame_pointer_needed is true.  We don't know that until we're
!    * expanding the prologue. */
    if (cfun)
!     REGNO_POINTER_ALIGN (HARD_FRAME_POINTER_REGNUM) = 8;
  }
  
  static enum machine_mode
--- 6430,6455 ----
  
  void
  spu_init_expanders (void)
! {
    if (cfun)
!     {
!       rtx r0, r1;
!       /* HARD_FRAME_REGISTER is only 128 bit aligned when
!          frame_pointer_needed is true.  We don't know that until we're
!          expanding the prologue. */
!       REGNO_POINTER_ALIGN (HARD_FRAME_POINTER_REGNUM) = 8;
! 
!       /* A number of passes use LAST_VIRTUAL_REGISTER+1 and
!          LAST_VIRTUAL_REGISTER+2 to test the back-end.  We want to
!          handle those cases specially, so we reserve those two registers
!          here by generating them. */
!       r0 = gen_reg_rtx (SImode);
!       r1 = gen_reg_rtx (SImode);
!       mark_reg_pointer (r0, 128);
!       mark_reg_pointer (r1, 128);
!       gcc_assert (REGNO (r0) == LAST_VIRTUAL_REGISTER + 1
! 		  && REGNO (r1) == LAST_VIRTUAL_REGISTER + 2);
!     }
  }
  
  static enum machine_mode
Index: gcc/gcc/config/spu/spu.h
===================================================================
*** gcc/gcc/config/spu/spu.h	(revision 146906)
--- gcc/gcc/config/spu/spu.h	(working copy)
*************** enum reg_class { 
*** 232,237 ****
--- 232,242 ----
  #define INT_REG_OK_FOR_BASE_P(X,STRICT) \
  	((!(STRICT) || REGNO_OK_FOR_BASE_P (REGNO (X))))
  
+ #define REG_ALIGN(X) \
+ 	(REG_POINTER(X) \
+ 	 	? REGNO_POINTER_ALIGN (ORIGINAL_REGNO (X)) \
+ 		: 0)
+ 
  #define PREFERRED_RELOAD_CLASS(X,CLASS)  (CLASS)
  
  #define CLASS_MAX_NREGS(CLASS, MODE)	\
*************** targetm.resolve_overloaded_builtin = spu
*** 425,431 ****
  #endif
  
  #define GO_IF_LEGITIMATE_ADDRESS(MODE, X, ADDR)			\
!     { if (spu_legitimate_address (MODE, X, REG_OK_STRICT_FLAG))	\
  	goto ADDR;						\
      }
  
--- 430,436 ----
  #endif
  
  #define GO_IF_LEGITIMATE_ADDRESS(MODE, X, ADDR)			\
!     { if (spu_legitimate_address (MODE, X, REG_OK_STRICT_FLAG, 0))	\
  	goto ADDR;						\
      }
  
*************** targetm.resolve_overloaded_builtin = spu
*** 605,610 ****
--- 610,617 ----
  
  #define HANDLE_PRAGMA_PACK_PUSH_POP 1
  
+ #define SPLIT_BEFORE_CSE2 1
+ 
  /* Canonicalize a comparison from one we don't have to one we do have.  */
  #define CANONICALIZE_COMPARISON(CODE,OP0,OP1) \
    do {                                                                    \
Index: gcc/gcc/config/spu/spu-builtins.md
===================================================================
*** gcc/gcc/config/spu/spu-builtins.md	(revision 146906)
--- gcc/gcc/config/spu/spu-builtins.md	(working copy)
***************
*** 23,31 ****
  
  (define_expand "spu_lqd"
    [(set (match_operand:TI 0 "spu_reg_operand" "")
!         (mem:TI (and:SI (plus:SI (match_operand:SI 1 "spu_reg_operand" "")
! 				 (match_operand:SI 2 "spu_nonmem_operand" ""))
! 		        (const_int -16))))]
    ""
    {
      if (GET_CODE (operands[2]) == CONST_INT
--- 23,30 ----
  
  (define_expand "spu_lqd"
    [(set (match_operand:TI 0 "spu_reg_operand" "")
!         (mem:TI (plus:SI (match_operand:SI 1 "spu_reg_operand" "")
! 			 (match_operand:SI 2 "spu_nonmem_operand" ""))))]
    ""
    {
      if (GET_CODE (operands[2]) == CONST_INT
*************** (define_expand "spu_lqd"
*** 42,57 ****
  
  (define_expand "spu_lqx"
    [(set (match_operand:TI 0 "spu_reg_operand" "")
!         (mem:TI (and:SI (plus:SI (match_operand:SI 1 "spu_reg_operand" "")
!                                  (match_operand:SI 2 "spu_reg_operand" ""))
!                         (const_int -16))))]
    ""
    "")
  
  (define_expand "spu_lqa"
    [(set (match_operand:TI 0 "spu_reg_operand" "")
!         (mem:TI (and:SI (match_operand:SI 1 "immediate_operand" "")
!                         (const_int -16))))]
    ""
    {
      if (GET_CODE (operands[1]) == CONST_INT
--- 41,54 ----
  
  (define_expand "spu_lqx"
    [(set (match_operand:TI 0 "spu_reg_operand" "")
!         (mem:TI (plus:SI (match_operand:SI 1 "spu_reg_operand" "")
! 			 (match_operand:SI 2 "spu_reg_operand" ""))))]
    ""
    "")
  
  (define_expand "spu_lqa"
    [(set (match_operand:TI 0 "spu_reg_operand" "")
!         (mem:TI (match_operand:SI 1 "immediate_operand" "")))]
    ""
    {
      if (GET_CODE (operands[1]) == CONST_INT
*************** (define_expand "spu_lqa"
*** 61,75 ****
  
  (define_expand "spu_lqr"
    [(set (match_operand:TI 0 "spu_reg_operand" "")
! 	(mem:TI (and:SI (match_operand:SI 1 "address_operand" "")
! 			(const_int -16))))]
    ""
    "")
  
  (define_expand "spu_stqd"
!   [(set (mem:TI (and:SI (plus:SI (match_operand:SI 1 "spu_reg_operand" "")
! 				 (match_operand:SI 2 "spu_nonmem_operand" ""))
! 		        (const_int -16)))
          (match_operand:TI 0 "spu_reg_operand" "r,r"))]
    ""
    {
--- 58,70 ----
  
  (define_expand "spu_lqr"
    [(set (match_operand:TI 0 "spu_reg_operand" "")
! 	(mem:TI (match_operand:SI 1 "address_operand" "")))]
    ""
    "")
  
  (define_expand "spu_stqd"
!   [(set (mem:TI (plus:SI (match_operand:SI 1 "spu_reg_operand" "")
! 			 (match_operand:SI 2 "spu_nonmem_operand" "")))
          (match_operand:TI 0 "spu_reg_operand" "r,r"))]
    ""
    {
*************** (define_expand "spu_stqd"
*** 86,101 ****
    })
  
  (define_expand "spu_stqx"
!   [(set (mem:TI (and:SI (plus:SI (match_operand:SI 1 "spu_reg_operand" "")
! 				 (match_operand:SI 2 "spu_reg_operand" ""))
! 		        (const_int -16)))
          (match_operand:TI 0 "spu_reg_operand" "r"))]
    ""
    "")
  
  (define_expand "spu_stqa"
!   [(set (mem:TI (and:SI (match_operand:SI 1 "immediate_operand" "")
! 			(const_int -16)))
          (match_operand:TI 0 "spu_reg_operand" "r"))]
    ""
    {
--- 81,94 ----
    })
  
  (define_expand "spu_stqx"
!   [(set (mem:TI (plus:SI (match_operand:SI 1 "spu_reg_operand" "")
! 			 (match_operand:SI 2 "spu_reg_operand" "")))
          (match_operand:TI 0 "spu_reg_operand" "r"))]
    ""
    "")
  
  (define_expand "spu_stqa"
!   [(set (mem:TI (match_operand:SI 1 "immediate_operand" ""))
          (match_operand:TI 0 "spu_reg_operand" "r"))]
    ""
    {
*************** (define_expand "spu_stqa"
*** 105,112 ****
    })
  
  (define_expand "spu_stqr"
!     [(set (mem:TI (and:SI (match_operand:SI 1 "address_operand" "")
! 			  (const_int -16)))
  	  (match_operand:TI 0 "spu_reg_operand" ""))]
    ""
    "")
--- 98,104 ----
    })
  
  (define_expand "spu_stqr"
!     [(set (mem:TI (match_operand:SI 1 "address_operand" ""))
  	  (match_operand:TI 0 "spu_reg_operand" ""))]
    ""
    "")
Index: gcc/gcc/config/spu/spu.md
===================================================================
*** gcc/gcc/config/spu/spu.md	(revision 146906)
--- gcc/gcc/config/spu/spu.md	(working copy)
*************** (define_expand "mov<mode>"
*** 278,285 ****
  (define_split 
    [(set (match_operand 0 "spu_reg_operand")
  	(match_operand 1 "immediate_operand"))]
! 
!   ""
    [(set (match_dup 0)
  	(high (match_dup 1)))
     (set (match_dup 0)
--- 278,284 ----
  (define_split 
    [(set (match_operand 0 "spu_reg_operand")
  	(match_operand 1 "immediate_operand"))]
!   "split0_completed"
    [(set (match_dup 0)
  	(high (match_dup 1)))
     (set (match_dup 0)
*************** (define_insn "load_pic_offset"
*** 316,324 ****
  ;; move internal
  
  (define_insn "_mov<mode>"
!   [(set (match_operand:MOV 0 "spu_nonimm_operand" "=r,r,r,r,r,m")
  	(match_operand:MOV 1 "spu_mov_operand" "r,A,f,j,m,r"))]
!   "spu_valid_move (operands)"
    "@
     ori\t%0,%1,0
     il%s1\t%0,%S1
--- 315,324 ----
  ;; move internal
  
  (define_insn "_mov<mode>"
!   [(set (match_operand:MOV 0 "spu_dest_operand" "=r,r,r,r,r,m")
  	(match_operand:MOV 1 "spu_mov_operand" "r,A,f,j,m,r"))]
!   "register_operand(operands[0], <MODE>mode)
!    || register_operand(operands[1], <MODE>mode)"
    "@
     ori\t%0,%1,0
     il%s1\t%0,%S1
*************** (define_insn "low_<mode>"
*** 336,344 ****
    "iohl\t%0,%2@l")
  
  (define_insn "_movdi"
!   [(set (match_operand:DI 0 "spu_nonimm_operand" "=r,r,r,r,r,m")
  	(match_operand:DI 1 "spu_mov_operand" "r,a,f,k,m,r"))]
!   "spu_valid_move (operands)"
    "@
     ori\t%0,%1,0
     il%d1\t%0,%D1
--- 336,345 ----
    "iohl\t%0,%2@l")
  
  (define_insn "_movdi"
!   [(set (match_operand:DI 0 "spu_dest_operand" "=r,r,r,r,r,m")
  	(match_operand:DI 1 "spu_mov_operand" "r,a,f,k,m,r"))]
!   "register_operand(operands[0], DImode)
!    || register_operand(operands[1], DImode)"
    "@
     ori\t%0,%1,0
     il%d1\t%0,%D1
*************** (define_insn "_movdi"
*** 349,357 ****
    [(set_attr "type" "fx2,fx2,shuf,shuf,load,store")])
  
  (define_insn "_movti"
!   [(set (match_operand:TI 0 "spu_nonimm_operand" "=r,r,r,r,r,m")
  	(match_operand:TI 1 "spu_mov_operand" "r,U,f,l,m,r"))]
!   "spu_valid_move (operands)"
    "@
     ori\t%0,%1,0
     il%t1\t%0,%T1
--- 350,359 ----
    [(set_attr "type" "fx2,fx2,shuf,shuf,load,store")])
  
  (define_insn "_movti"
!   [(set (match_operand:TI 0 "spu_dest_operand" "=r,r,r,r,r,m")
  	(match_operand:TI 1 "spu_mov_operand" "r,U,f,l,m,r"))]
!   "register_operand(operands[0], TImode)
!    || register_operand(operands[1], TImode)"
    "@
     ori\t%0,%1,0
     il%t1\t%0,%T1
*************** (define_insn "_movti"
*** 361,389 ****
     stq%p0\t%1,%0"
    [(set_attr "type" "fx2,fx2,shuf,shuf,load,store")])
  
! (define_insn_and_split "load"
!   [(set (match_operand 0 "spu_reg_operand" "=r")
! 	(match_operand 1 "memory_operand" "m"))
!    (clobber (match_operand:TI 2 "spu_reg_operand" "=&r"))
!    (clobber (match_operand:SI 3 "spu_reg_operand" "=&r"))]
!   "GET_MODE(operands[0]) == GET_MODE(operands[1])"
!   "#"
!   ""
    [(set (match_dup 0)
  	(match_dup 1))]
!   { spu_split_load(operands); DONE; })
  
! (define_insn_and_split "store"
!   [(set (match_operand 0 "memory_operand" "=m")
! 	(match_operand 1 "spu_reg_operand" "r"))
!    (clobber (match_operand:TI 2 "spu_reg_operand" "=&r"))
!    (clobber (match_operand:TI 3 "spu_reg_operand" "=&r"))]
!   "GET_MODE(operands[0]) == GET_MODE(operands[1])"
!   "#"
!   ""
    [(set (match_dup 0)
  	(match_dup 1))]
!   { spu_split_store(operands); DONE; })
  
  ;; Operand 3 is the number of bytes. 1:b 2:h 4:w 8:d
  
--- 363,387 ----
     stq%p0\t%1,%0"
    [(set_attr "type" "fx2,fx2,shuf,shuf,load,store")])
  
! (define_split
!   [(set (match_operand 0 "spu_reg_operand")
! 	(match_operand 1 "memory_operand"))]
!   "GET_MODE(operands[0]) == GET_MODE(operands[1]) && !split0_completed"
    [(set (match_dup 0)
  	(match_dup 1))]
!   { if (spu_split_load(operands))
!       DONE;
!   })
  
! (define_split
!   [(set (match_operand 0 "memory_operand")
! 	(match_operand 1 "spu_reg_operand"))]
!   "GET_MODE(operands[0]) == GET_MODE(operands[1]) && !split0_completed"
    [(set (match_dup 0)
  	(match_dup 1))]
!   { if (spu_split_store(operands))
!       DONE;
!   })
  
  ;; Operand 3 is the number of bytes. 1:b 2:h 4:w 8:d
  
*************** (define_insn_and_split "_neg<mode>2"
*** 1238,1244 ****
     (use (match_operand:<F2I> 2 "spu_reg_operand" "r"))]
    ""
    "#"
!   ""
    [(set (match_dup:<F2I> 3)
  	(xor:<F2I> (match_dup:<F2I> 4)
  		   (match_dup:<F2I> 2)))]
--- 1236,1242 ----
     (use (match_operand:<F2I> 2 "spu_reg_operand" "r"))]
    ""
    "#"
!   "split0_completed"
    [(set (match_dup:<F2I> 3)
  	(xor:<F2I> (match_dup:<F2I> 4)
  		   (match_dup:<F2I> 2)))]
*************** (define_insn_and_split "_abs<mode>2"
*** 1274,1280 ****
     (use (match_operand:<F2I> 2 "spu_reg_operand" "r"))]
    ""
    "#"
!   ""
    [(set (match_dup:<F2I> 3)
  	(and:<F2I> (match_dup:<F2I> 4)
  		   (match_dup:<F2I> 2)))]
--- 1272,1278 ----
     (use (match_operand:<F2I> 2 "spu_reg_operand" "r"))]
    ""
    "#"
!   "split0_completed"
    [(set (match_dup:<F2I> 3)
  	(and:<F2I> (match_dup:<F2I> 4)
  		   (match_dup:<F2I> 2)))]
*************** (define_expand "vec_pack_trunc_v4si"
*** 5273,5280 ****
  }")
  
  (define_insn "stack_protect_set"
!   [(set (match_operand:SI 0 "spu_mem_operand" "=m")
!         (unspec:SI [(match_operand:SI 1 "spu_mem_operand" "m")] UNSPEC_SP_SET))
     (set (match_scratch:SI 2 "=&r") (const_int 0))]
    ""
    "lq%p1\t%2,%1\;stq%p0\t%2,%0\;xor\t%2,%2,%2"
--- 5271,5278 ----
  }")
  
  (define_insn "stack_protect_set"
!   [(set (match_operand:SI 0 "memory_operand" "=m")
!         (unspec:SI [(match_operand:SI 1 "memory_operand" "m")] UNSPEC_SP_SET))
     (set (match_scratch:SI 2 "=&r") (const_int 0))]
    ""
    "lq%p1\t%2,%1\;stq%p0\t%2,%0\;xor\t%2,%2,%2"
*************** (define_insn "stack_protect_set"
*** 5283,5290 ****
  )
  
  (define_expand "stack_protect_test"
!   [(match_operand 0 "spu_mem_operand" "")
!    (match_operand 1 "spu_mem_operand" "")
     (match_operand 2 "" "")]
    ""
  {
--- 5281,5288 ----
  )
  
  (define_expand "stack_protect_test"
!   [(match_operand 0 "memory_operand" "")
!    (match_operand 1 "memory_operand" "")
     (match_operand 2 "" "")]
    ""
  {
*************** (define_expand "stack_protect_test"
*** 5310,5317 ****
  
  (define_insn "stack_protect_test_si"
    [(set (match_operand:SI 0 "spu_reg_operand" "=&r")
!         (unspec:SI [(match_operand:SI 1 "spu_mem_operand" "m")
!                     (match_operand:SI 2 "spu_mem_operand" "m")]
                     UNSPEC_SP_TEST))
     (set (match_scratch:SI 3 "=&r") (const_int 0))]
    ""
--- 5308,5315 ----
  
  (define_insn "stack_protect_test_si"
    [(set (match_operand:SI 0 "spu_reg_operand" "=&r")
!         (unspec:SI [(match_operand:SI 1 "memory_operand" "m")
!                     (match_operand:SI 2 "memory_operand" "m")]
                     UNSPEC_SP_TEST))
     (set (match_scratch:SI 3 "=&r") (const_int 0))]
    ""
Index: gcc/gcc/testsuite/gcc.target/spu/split0-1.c
===================================================================
*** gcc/gcc/testsuite/gcc.target/spu/split0-1.c	(revision 0)
--- gcc/gcc/testsuite/gcc.target/spu/split0-1.c	(revision 0)
***************
*** 0 ****
--- 1,17 ----
+ /* Make sure there are only 2 loads. */
+ /* { dg-do compile { target spu-*-* } } */
+ /* { dg-options "-O2" } */
+ /* { dg-final { scan-assembler-times "lqd	\\$\[0-9\]+,0\\(\\$\[0-9\]+\\)" 1 } } */
+ /* { dg-final { scan-assembler-times "lqd	\\$\[0-9\]+,16\\(\\$\[0-9\]+\\)" 1 } } */
+ /* { dg-final { scan-assembler-times "lq\[dx\]" 2 } } */
+   
+ struct __attribute__ ((__aligned__(16))) S {
+   int a, b, c, d;
+   int e, f, g, h;
+ };
+   
+ int
+ f(struct S *s)
+ { 
+   return s->a + s->b + s->c + s->d + s->e + s->f + s->g + s->h;
+ } 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [SPU, PATCH] Split load and store instructions during expand
  2009-05-04 21:15 ` [PATCH] add optional split pass before CSE2 Trevor_Smigiel
@ 2009-05-23  3:37   ` Trevor_Smigiel
  0 siblings, 0 replies; 7+ messages in thread
From: Trevor_Smigiel @ 2009-05-23  3:37 UTC (permalink / raw)
  To: gcc-patches; +Cc: Ulrich Weigand, andrew_pinski

[-- Attachment #1: Type: text/plain, Size: 6315 bytes --]

Hi,

The attached patch was applied for SPU.

The SPU has only 16 byte loads and stores.  When loading and storing
objects that are smaller than 16 bytes it must split a load into
load/rotate and split a store into load/modify/store.  Currently, the
SPU back end does the splitting at split1.  If the splitting is done
sooner, the new loads and stores could be merged more effectively.  

In my previous patch I added a new split phase just before cse2.  Based
on feedback I evaluated an implementation that split loads and store
during expand.  The new results were comparable to the results of the
previous patch.  And better in many cases because I ended up
reimplementing the code for extv/extzv.

There are 2 new failures in the testsuite:

  gcc.dg/pr30643.C
    Some dead code was previously removed during rtl because it 
    was able to replace a MEM load with its const value.  It is no
    longer able to propagate the const because the MEM has become
    TImode, which is not the mode of the original store.  I experimented
    with setting REG_EQUAL and it did not help.  pr29751 was previously
    submitted against the tree phases and would resolve this issue by
    doing the optimization before rtl. 

  gcc.dg/sms-3.c
    SMS actually happens after the first split, so the failure is not
    because MEM's are split into multiple instructions and changed to
    TImode.  By splitting earlier we actually generate smaller code
    because inspecting alignment at expand time gives better results,
    leading to smaller code.  Specifically, when an address is
    (plus (reg) (reg)) the splitting code is now able to detect one
    of them as having sufficient alignment and does not need to 
    generate an add instruction.  So the code before and after the
    patch is only different by a few add instructions.
    This needs to be looked into further.

There are a couple of other issues in the expand phase that would allow
even better code if modified.  The previous patch is no better in these
cases.

  memory_address() in explow.c:
    It forces addresses into a register during expand because
    cse_not_expected is false.  When the SPU back end can see a more
    detailed address, it will normalize the loads it generates, allowing for
    better possibility of merging.  For example, when SPU splits a CONST
    address, the SYMBOL_REF is known to be 16 byte aligned, so the SPU code
    will normalize the CONST_INT part by setting the least significant 4
    bits to 0.  Consider the following test case:
    
        struct S {
          float a, b, c, d;
          float e, f, g, h;
        };

        struct S S, D;

        void
        g0()
        { 
          D.a = S.e;
          D.b = S.f;
          D.c = S.g;
          D.d = S.h;
          D.e = S.a;
          D.f = S.b;
          D.g = S.c;
          D.h = S.d;
        } 

    If memory_address did not force the address into register, the SPU
    back end would normalize the 16 loads and 8 stores allowing the
    compiler to merge them into just 4 loads and 2 stores.  Ideally, gcc
    would even recognize that the 2 loads of D are unnecessary in this
    case.  If change memory_address() to not put the address in register
    gcc would generate the following code for the above test case:
    
        g0:
                lqr     $14,S+16
                lqr     $13,S
                lqr     $21,D
                lqr     $24,D+16
                cwd     $22,0($sp)
                cwd     $17,4($sp)
                ori     $23,$14,0
                rotqbyi $18,$14,4
                ori     $25,$13,0
                rotqbyi $20,$13,4
                shufb   $16,$23,$21,$22
                shufb   $19,$25,$24,$22
                cwd     $15,8($sp)
                rotqbyi $9,$14,8
                shufb   $10,$18,$16,$17
                shufb   $11,$20,$19,$17
                rotqbyi $8,$13,8
                cwd     $12,12($sp)
                shufb   $7,$9,$10,$15
                rotqbyi $3,$14,12
                shufb   $5,$8,$11,$15
                rotqbyi $4,$13,12
                hbr     .L2,$lr
                shufb   $6,$3,$7,$12
                shufb   $2,$4,$5,$12
                stqr    $6,D
                stqr    $2,D+16
        .L2:
                bi      $lr

  force_reg() in explow.c:
    When forcing a CONST into a register it sets the alignment
    incorrectly because it does an exact_log2():
        ca = exact_log2 (INTVAL (c) & -INTVAL (c)) * BITS_PER_UNIT;
    should be
        ca = (INTVAL (c) & -INTVAL (c)) * BITS_PER_UNIT;

A patch for the latter 2 issues, should not be difficult to create, but
I have not evaluated their benefit in any benchmarks or real code.

Trevor


* Trevo Smigiel < Trevor_Smigiel @ playstation.sony.com >[2009 - 05 - 04 14: 14]:
> Hi,
> 
> On SPU we can get better code generation for loads and stores when we
> add an extra split pass just before cse2.  It is better because we can
> provide the earlier RTL passes a simplified RTL description and then
> split it to something closer to actual machine instructions to be
> improved by later RTL passes (especially cse2 and combine).  For loads
> and stores on SPU it means the early passes have correct alias
> information and later passes are able to merge common loads and stores.
> 
> This can potentially help any target that has complicated operations
> where creating multiple instructions at expand time is too early and
> splitting at split1 is too late.
> 
> To make it optional I added a new target defined macro,
>   #define SPLIT_BEFORE_CSE2 
> 
> To allow backends to control which patterns are split I added
>   int split0_completed;
> 
> I originally submitted this patch in August 2008.  This message contains
> more details of how it benefits SPU.
>   http://gcc.gnu.org/ml/gcc-patches/2008-08/msg02110.html
> 
> This new patch includes related fixes to abs and neg from Ulrich Weigand.
> 
> Bootstrapped and tested on x86 with no new failures. 
> 
> On SPU, there is one new failure, and a few new passes.  A loop in
> sms-3.c is not modulo scheduled because the splitting changes the code
> and aliasing information.  The new passes are because the predicates of
> stack_protect_set are updated and no longer cause an ICE.
> 
> Ok for mainline?
> 
> Thanks,
> Trevor
> 

[-- Attachment #2: spu_split.patch --]
[-- Type: text/x-diff, Size: 66216 bytes --]

Index: gcc/ChangeLog
===================================================================
*** gcc/ChangeLog	(revision 147813)
--- gcc/ChangeLog	(revision 147814)
***************
*** 1,3 ****
--- 1,49 ----
+ 2009-05-22  Trevor Smigiel <Trevor_Smigiel@playstation.sony.com>
+ 
+ 	* config/spu/spu-protos.h (aligned_mem_p, spu_valid_mov): Remove.
+ 	(spu_split_load, spu_split_store): Change return type to int.
+ 	(spu_split_convert): Declare.
+ 	* config/spu/predicates.md (spu_mem_operand): Remove.
+ 	(spu_mov_operand): Update.
+ 	(spu_dest_operand, shiftrt_operator, extend_operator): Define.
+ 	* config/spu/spu.c (regno_aligned_for_load): Remove.
+ 	(reg_aligned_for_addr, spu_expand_load): Define.
+ 	(spu_expand_extv): Reimplement and handle MEM.
+ 	(spu_expand_insv): Handle MEM.
+ 	(spu_sched_reorder): Handle insn's with length 0.
+ 	(spu_legitimate_address_p): Reimplement.
+ 	(store_with_one_insn_p): Return TRUE for any mode with size
+ 	larger than 16 bytes.
+ 	(address_needs_split): Define.
+ 	(spu_expand_mov): Call spu_split_load and spu_split_store for MEM
+ 	operands.
+ 	(spu_convert_move): Define.
+ 	(spu_split_load): Use spu_expand_load and change all MEM's to
+ 	TImode.
+ 	(spu_split_store): Change all MEM's to TImode.
+ 	(spu_init_expanders): Preallocate registers that correspond to
+ 	LAST_VIRTUAL_REG+1 and LAST_VIRTUAL_REG+2 and set them with
+ 	mark_reg_pointer.
+ 	(spu_split_convert): Define.
+ 	* config/spu/spu.md (QHSI, QHSDI): New mode iterators.
+ 	(_move<mode>, _movdi, _movti): Update predicate and condition.
+ 	(load, store): Change to define_split.
+ 	(extendqiti2, extendhiti2, extendsiti2, extendditi2): Simplify to
+ 	extend<mode>ti2.
+ 	(zero_extendqiti2, zero_extendhiti2, <v>lshr<mode>3_imm): Define.
+ 	(lshr<mode>3, lshr<mode>3_imm, lshr<mode>3_re): Simplify to one
+ 	define_insn_and_split of lshr<mode>3.
+ 	(shrqbybi_<mode>, shrqby_<mode>): Simplify to define_expand.
+ 	(<v>ashr<mode>3_imm): Define.
+ 	(extv, extzv, insv): Allow MEM operands.
+ 	(trunc_shr_ti<mode>, trunc_shr_tidi, shl_ext_<mode>ti,
+ 	shl_ext_diti, sext_trunc_lshr_tiqisi, zext_trunc_lshr_tiqisi,
+ 	sext_trunc_lshr_tihisi, zext_trunc_lshr_tihisi): Define for combine.
+ 	(_spu_convert2): Change to define_insn_and_split and remove the
+ 	corresponding define_peephole2.
+ 	(stack_protect_set, stack_protect_test, stack_protect_test_si):
+ 	Change predicates to memory_operand.
+ 
  2009-05-22  Mark Mitchell  <mark@codesourcery.com>
  
  	* config/arm/thumb2.md: Add 16-bit multiply instructions.
Index: gcc/config/spu/spu-protos.h
===================================================================
*** gcc/config/spu/spu-protos.h	(revision 147813)
--- gcc/config/spu/spu-protos.h	(revision 147814)
*************** extern void spu_setup_incoming_varargs (
*** 62,72 ****
  					tree type, int *pretend_size,
  					int no_rtl);
  extern void spu_conditional_register_usage (void);
- extern int aligned_mem_p (rtx mem);
  extern int spu_expand_mov (rtx * ops, enum machine_mode mode);
! extern void spu_split_load (rtx * ops);
! extern void spu_split_store (rtx * ops);
! extern int spu_valid_move (rtx * ops);
  extern int fsmbi_const_p (rtx x);
  extern int cpat_const_p (rtx x, enum machine_mode mode);
  extern rtx gen_cpat_const (rtx * ops);
--- 62,70 ----
  					tree type, int *pretend_size,
  					int no_rtl);
  extern void spu_conditional_register_usage (void);
  extern int spu_expand_mov (rtx * ops, enum machine_mode mode);
! extern int spu_split_load (rtx * ops);
! extern int spu_split_store (rtx * ops);
  extern int fsmbi_const_p (rtx x);
  extern int cpat_const_p (rtx x, enum machine_mode mode);
  extern rtx gen_cpat_const (rtx * ops);
*************** extern void spu_initialize_trampoline (r
*** 87,92 ****
--- 85,91 ----
  extern void spu_expand_sign_extend (rtx ops[]);
  extern void spu_expand_vector_init (rtx target, rtx vals);
  extern void spu_init_expanders (void);
+ extern void spu_split_convert (rtx *);
  
  /* spu-c.c */
  extern tree spu_resolve_overloaded_builtin (tree fndecl, void *fnargs);
Index: gcc/config/spu/predicates.md
===================================================================
*** gcc/config/spu/predicates.md	(revision 147813)
--- gcc/config/spu/predicates.md	(revision 147814)
*************** (define_predicate "spu_nonmem_operand"
*** 39,52 ****
         (ior (not (match_code "subreg"))
              (match_test "valid_subreg (op)"))))
  
- (define_predicate "spu_mem_operand"
-   (and (match_operand 0 "memory_operand")
-        (match_test "reload_in_progress || reload_completed || aligned_mem_p (op)")))
- 
  (define_predicate "spu_mov_operand"
!   (ior (match_operand 0 "spu_mem_operand")
         (match_operand 0 "spu_nonmem_operand")))
  
  (define_predicate "call_operand"
    (and (match_code "mem")
         (match_test "(!TARGET_LARGE_MEM && satisfies_constraint_S (op))
--- 39,52 ----
         (ior (not (match_code "subreg"))
              (match_test "valid_subreg (op)"))))
  
  (define_predicate "spu_mov_operand"
!   (ior (match_operand 0 "memory_operand")
         (match_operand 0 "spu_nonmem_operand")))
  
+ (define_predicate "spu_dest_operand"
+   (ior (match_operand 0 "memory_operand")
+        (match_operand 0 "spu_reg_operand")))
+ 
  (define_predicate "call_operand"
    (and (match_code "mem")
         (match_test "(!TARGET_LARGE_MEM && satisfies_constraint_S (op))
*************** (define_predicate "spu_exp2_operand"
*** 114,116 ****
--- 114,122 ----
         (and (match_operand 0 "immediate_operand")
  	    (match_test "exp2_immediate_p (op, mode, 0, 127)"))))
  
+ (define_predicate "shiftrt_operator"
+   (match_code "lshiftrt,ashiftrt"))
+ 
+ (define_predicate "extend_operator"
+   (match_code "sign_extend,zero_extend"))
+ 
Index: gcc/config/spu/spu.c
===================================================================
*** gcc/config/spu/spu.c	(revision 147813)
--- gcc/config/spu/spu.c	(revision 147814)
*************** static tree spu_build_builtin_va_list (v
*** 189,197 ****
  static void spu_va_start (tree, rtx);
  static tree spu_gimplify_va_arg_expr (tree valist, tree type,
  				      gimple_seq * pre_p, gimple_seq * post_p);
- static int regno_aligned_for_load (int regno);
  static int store_with_one_insn_p (rtx mem);
  static int mem_is_padded_component_ref (rtx x);
  static bool spu_assemble_integer (rtx x, unsigned int size, int aligned_p);
  static void spu_asm_globalize_label (FILE * file, const char *name);
  static unsigned char spu_rtx_costs (rtx x, int code, int outer_code,
--- 189,197 ----
  static void spu_va_start (tree, rtx);
  static tree spu_gimplify_va_arg_expr (tree valist, tree type,
  				      gimple_seq * pre_p, gimple_seq * post_p);
  static int store_with_one_insn_p (rtx mem);
  static int mem_is_padded_component_ref (rtx x);
+ static int reg_aligned_for_addr (rtx x);
  static bool spu_assemble_integer (rtx x, unsigned int size, int aligned_p);
  static void spu_asm_globalize_label (FILE * file, const char *name);
  static unsigned char spu_rtx_costs (rtx x, int code, int outer_code,
*************** static tree spu_builtin_vec_perm (tree, 
*** 211,216 ****
--- 211,217 ----
  static int spu_sms_res_mii (struct ddg *g);
  static void asm_file_start (void);
  static unsigned int spu_section_type_flags (tree, const char *, int);
+ static rtx spu_expand_load (rtx, rtx, rtx, int);
  
  extern const char *reg_names[];
  
*************** adjust_operand (rtx op, HOST_WIDE_INT * 
*** 582,647 ****
  void
  spu_expand_extv (rtx ops[], int unsignedp)
  {
    HOST_WIDE_INT width = INTVAL (ops[2]);
    HOST_WIDE_INT start = INTVAL (ops[3]);
!   HOST_WIDE_INT src_size, dst_size;
!   enum machine_mode src_mode, dst_mode;
!   rtx dst = ops[0], src = ops[1];
!   rtx s;
  
!   dst = adjust_operand (ops[0], 0);
!   dst_mode = GET_MODE (dst);
!   dst_size = GET_MODE_BITSIZE (GET_MODE (dst));
! 
!   src = adjust_operand (src, &start);
!   src_mode = GET_MODE (src);
!   src_size = GET_MODE_BITSIZE (GET_MODE (src));
  
!   if (start > 0)
      {
!       s = gen_reg_rtx (src_mode);
!       switch (src_mode)
  	{
! 	case SImode:
! 	  emit_insn (gen_ashlsi3 (s, src, GEN_INT (start)));
! 	  break;
! 	case DImode:
! 	  emit_insn (gen_ashldi3 (s, src, GEN_INT (start)));
! 	  break;
! 	case TImode:
! 	  emit_insn (gen_ashlti3 (s, src, GEN_INT (start)));
! 	  break;
! 	default:
! 	  abort ();
  	}
!       src = s;
      }
  
!   if (width < src_size)
      {
!       rtx pat;
!       int icode;
!       switch (src_mode)
! 	{
! 	case SImode:
! 	  icode = unsignedp ? CODE_FOR_lshrsi3 : CODE_FOR_ashrsi3;
! 	  break;
! 	case DImode:
! 	  icode = unsignedp ? CODE_FOR_lshrdi3 : CODE_FOR_ashrdi3;
! 	  break;
! 	case TImode:
! 	  icode = unsignedp ? CODE_FOR_lshrti3 : CODE_FOR_ashrti3;
! 	  break;
! 	default:
! 	  abort ();
! 	}
!       s = gen_reg_rtx (src_mode);
!       pat = GEN_FCN (icode) (s, src, GEN_INT (src_size - width));
!       emit_insn (pat);
!       src = s;
      }
  
!   convert_move (dst, src, unsignedp);
  }
  
  void
--- 583,667 ----
  void
  spu_expand_extv (rtx ops[], int unsignedp)
  {
+   rtx dst = ops[0], src = ops[1];
    HOST_WIDE_INT width = INTVAL (ops[2]);
    HOST_WIDE_INT start = INTVAL (ops[3]);
!   HOST_WIDE_INT align_mask;
!   rtx s0, s1, mask, r0;
  
!   gcc_assert (REG_P (dst) && GET_MODE (dst) == TImode);
  
!   if (MEM_P (src))
      {
!       /* First, determine if we need 1 TImode load or 2.  We need only 1
!          if the bits being extracted do not cross the alignment boundary
!          as determined by the MEM and its address. */
! 
!       align_mask = -MEM_ALIGN (src);
!       if ((start & align_mask) == ((start + width - 1) & align_mask))
! 	{
! 	  /* Alignment is sufficient for 1 load. */
! 	  s0 = gen_reg_rtx (TImode);
! 	  r0 = spu_expand_load (s0, 0, src, start / 8);
! 	  start &= 7;
! 	  if (r0)
! 	    emit_insn (gen_rotqby_ti (s0, s0, r0));
! 	}
!       else
  	{
! 	  /* Need 2 loads. */
! 	  s0 = gen_reg_rtx (TImode);
! 	  s1 = gen_reg_rtx (TImode);
! 	  r0 = spu_expand_load (s0, s1, src, start / 8);
! 	  start &= 7;
! 
! 	  gcc_assert (start + width <= 128);
! 	  if (r0)
! 	    {
! 	      rtx r1 = gen_reg_rtx (SImode);
! 	      mask = gen_reg_rtx (TImode);
! 	      emit_move_insn (mask, GEN_INT (-1));
! 	      emit_insn (gen_rotqby_ti (s0, s0, r0));
! 	      emit_insn (gen_rotqby_ti (s1, s1, r0));
! 	      if (GET_CODE (r0) == CONST_INT)
! 		r1 = GEN_INT (INTVAL (r0) & 15);
! 	      else
! 		emit_insn (gen_andsi3 (r1, r0, GEN_INT (15)));
! 	      emit_insn (gen_shlqby_ti (mask, mask, r1));
! 	      emit_insn (gen_selb (s0, s1, s0, mask));
! 	    }
  	}
! 
!     }
!   else if (GET_CODE (src) == SUBREG)
!     {
!       rtx r = SUBREG_REG (src);
!       gcc_assert (REG_P (r) && SCALAR_INT_MODE_P (GET_MODE (r)));
!       s0 = gen_reg_rtx (TImode);
!       if (GET_MODE_SIZE (GET_MODE (r)) < GET_MODE_SIZE (TImode))
! 	emit_insn (gen_rtx_SET (VOIDmode, s0, gen_rtx_ZERO_EXTEND (TImode, r)));
!       else
! 	emit_move_insn (s0, src);
!     }
!   else 
!     {
!       gcc_assert (REG_P (src) && GET_MODE (src) == TImode);
!       s0 = gen_reg_rtx (TImode);
!       emit_move_insn (s0, src);
      }
  
!   /* Now s0 is TImode and contains the bits to extract at start. */
! 
!   if (start)
!     emit_insn (gen_rotlti3 (s0, s0, GEN_INT (start)));
! 
!   if (128 - width)
      {
!       tree c = build_int_cst (NULL_TREE, 128 - width);
!       s0 = expand_shift (RSHIFT_EXPR, TImode, s0, c, s0, unsignedp);
      }
  
!   emit_move_insn (dst, s0);
  }
  
  void
*************** spu_expand_insv (rtx ops[])
*** 734,771 ****
      }
    if (GET_CODE (ops[0]) == MEM)
      {
-       rtx aligned = gen_reg_rtx (SImode);
        rtx low = gen_reg_rtx (SImode);
-       rtx addr = gen_reg_rtx (SImode);
        rtx rotl = gen_reg_rtx (SImode);
        rtx mask0 = gen_reg_rtx (TImode);
        rtx mem;
  
!       emit_move_insn (addr, XEXP (ops[0], 0));
!       emit_insn (gen_andsi3 (aligned, addr, GEN_INT (-16)));
        emit_insn (gen_andsi3 (low, addr, GEN_INT (15)));
        emit_insn (gen_negsi2 (rotl, low));
        emit_insn (gen_rotqby_ti (shift_reg, shift_reg, rotl));
        emit_insn (gen_rotqmby_ti (mask0, mask, rotl));
!       mem = change_address (ops[0], TImode, aligned);
        set_mem_alias_set (mem, 0);
        emit_move_insn (dst, mem);
        emit_insn (gen_selb (dst, dst, shift_reg, mask0));
-       emit_move_insn (mem, dst);
        if (start + width > MEM_ALIGN (ops[0]))
  	{
  	  rtx shl = gen_reg_rtx (SImode);
  	  rtx mask1 = gen_reg_rtx (TImode);
  	  rtx dst1 = gen_reg_rtx (TImode);
  	  rtx mem1;
  	  emit_insn (gen_subsi3 (shl, GEN_INT (16), low));
  	  emit_insn (gen_shlqby_ti (mask1, mask, shl));
! 	  mem1 = adjust_address (mem, TImode, 16);
  	  set_mem_alias_set (mem1, 0);
  	  emit_move_insn (dst1, mem1);
  	  emit_insn (gen_selb (dst1, dst1, shift_reg, mask1));
  	  emit_move_insn (mem1, dst1);
  	}
      }
    else
      emit_insn (gen_selb (dst, copy_rtx (dst), shift_reg, mask));
--- 754,794 ----
      }
    if (GET_CODE (ops[0]) == MEM)
      {
        rtx low = gen_reg_rtx (SImode);
        rtx rotl = gen_reg_rtx (SImode);
        rtx mask0 = gen_reg_rtx (TImode);
+       rtx addr;
+       rtx addr0;
+       rtx addr1;
        rtx mem;
  
!       addr = force_reg (Pmode, XEXP (ops[0], 0));
!       addr0 = gen_rtx_AND (Pmode, addr, GEN_INT (-16));
        emit_insn (gen_andsi3 (low, addr, GEN_INT (15)));
        emit_insn (gen_negsi2 (rotl, low));
        emit_insn (gen_rotqby_ti (shift_reg, shift_reg, rotl));
        emit_insn (gen_rotqmby_ti (mask0, mask, rotl));
!       mem = change_address (ops[0], TImode, addr0);
        set_mem_alias_set (mem, 0);
        emit_move_insn (dst, mem);
        emit_insn (gen_selb (dst, dst, shift_reg, mask0));
        if (start + width > MEM_ALIGN (ops[0]))
  	{
  	  rtx shl = gen_reg_rtx (SImode);
  	  rtx mask1 = gen_reg_rtx (TImode);
  	  rtx dst1 = gen_reg_rtx (TImode);
  	  rtx mem1;
+ 	  addr1 = plus_constant (addr, 16);
+ 	  addr1 = gen_rtx_AND (Pmode, addr1, GEN_INT (-16));
  	  emit_insn (gen_subsi3 (shl, GEN_INT (16), low));
  	  emit_insn (gen_shlqby_ti (mask1, mask, shl));
! 	  mem1 = change_address (ops[0], TImode, addr1);
  	  set_mem_alias_set (mem1, 0);
  	  emit_move_insn (dst1, mem1);
  	  emit_insn (gen_selb (dst1, dst1, shift_reg, mask1));
  	  emit_move_insn (mem1, dst1);
  	}
+       emit_move_insn (mem, dst);
      }
    else
      emit_insn (gen_selb (dst, copy_rtx (dst), shift_reg, mask));
*************** spu_sched_reorder (FILE *file ATTRIBUTE_
*** 2998,3004 ****
        insn = ready[i];
        if (INSN_CODE (insn) == -1
  	  || INSN_CODE (insn) == CODE_FOR_blockage
! 	  || INSN_CODE (insn) == CODE_FOR__spu_convert)
  	{
  	  ready[i] = ready[nready - 1];
  	  ready[nready - 1] = insn;
--- 3021,3027 ----
        insn = ready[i];
        if (INSN_CODE (insn) == -1
  	  || INSN_CODE (insn) == CODE_FOR_blockage
! 	  || (INSN_P (insn) && get_attr_length (insn) == 0))
  	{
  	  ready[i] = ready[nready - 1];
  	  ready[nready - 1] = insn;
*************** spu_sched_adjust_cost (rtx insn, rtx lin
*** 3129,3136 ****
        || INSN_CODE (dep_insn) == CODE_FOR_blockage)
      return 0;
  
!   if (INSN_CODE (insn) == CODE_FOR__spu_convert
!       || INSN_CODE (dep_insn) == CODE_FOR__spu_convert)
      return 0;
  
    /* Make sure hbrps are spread out. */
--- 3152,3159 ----
        || INSN_CODE (dep_insn) == CODE_FOR_blockage)
      return 0;
  
!   if ((INSN_P (insn) && get_attr_length (insn) == 0)
!       || (INSN_P (dep_insn) && get_attr_length (dep_insn) == 0))
      return 0;
  
    /* Make sure hbrps are spread out. */
*************** spu_legitimate_constant_p (rtx x)
*** 3611,3654 ****
  /* Valid address are:
     - symbol_ref, label_ref, const
     - reg
!    - reg + const, where either reg or const is 16 byte aligned
     - reg + reg, alignment doesn't matter
    The alignment matters in the reg+const case because lqd and stqd
!   ignore the 4 least significant bits of the const.  (TODO: It might be
!   preferable to allow any alignment and fix it up when splitting.) */
! bool
! spu_legitimate_address_p (enum machine_mode mode ATTRIBUTE_UNUSED,
  			  rtx x, bool reg_ok_strict)
  {
!   if (mode == TImode && GET_CODE (x) == AND
        && GET_CODE (XEXP (x, 1)) == CONST_INT
!       && INTVAL (XEXP (x, 1)) == (HOST_WIDE_INT) -16)
      x = XEXP (x, 0);
    switch (GET_CODE (x))
      {
-     case SYMBOL_REF:
      case LABEL_REF:
!       return !TARGET_LARGE_MEM;
! 
      case CONST:
!       if (!TARGET_LARGE_MEM && GET_CODE (XEXP (x, 0)) == PLUS)
! 	{
! 	  rtx sym = XEXP (XEXP (x, 0), 0);
! 	  rtx cst = XEXP (XEXP (x, 0), 1);
! 
! 	  /* Accept any symbol_ref + constant, assuming it does not
! 	     wrap around the local store addressability limit.  */
! 	  if (GET_CODE (sym) == SYMBOL_REF && GET_CODE (cst) == CONST_INT)
! 	    return 1;
! 	}
!       return 0;
  
      case CONST_INT:
        return INTVAL (x) >= 0 && INTVAL (x) <= 0x3ffff;
  
      case SUBREG:
        x = XEXP (x, 0);
!       gcc_assert (GET_CODE (x) == REG);
  
      case REG:
        return INT_REG_OK_FOR_BASE_P (x, reg_ok_strict);
--- 3634,3669 ----
  /* Valid address are:
     - symbol_ref, label_ref, const
     - reg
!    - reg + const_int, where const_int is 16 byte aligned
     - reg + reg, alignment doesn't matter
    The alignment matters in the reg+const case because lqd and stqd
!   ignore the 4 least significant bits of the const.  We only care about
!   16 byte modes because the expand phase will change all smaller MEM
!   references to TImode.  */
! static bool
! spu_legitimate_address_p (enum machine_mode mode,
  			  rtx x, bool reg_ok_strict)
  {
!   int aligned = GET_MODE_SIZE (mode) >= 16;
!   if (aligned
!       && GET_CODE (x) == AND
        && GET_CODE (XEXP (x, 1)) == CONST_INT
!       && INTVAL (XEXP (x, 1)) == (HOST_WIDE_INT) - 16)
      x = XEXP (x, 0);
    switch (GET_CODE (x))
      {
      case LABEL_REF:
!     case SYMBOL_REF:
      case CONST:
!       return !TARGET_LARGE_MEM;
  
      case CONST_INT:
        return INTVAL (x) >= 0 && INTVAL (x) <= 0x3ffff;
  
      case SUBREG:
        x = XEXP (x, 0);
!       if (REG_P (x))
! 	return 0;
  
      case REG:
        return INT_REG_OK_FOR_BASE_P (x, reg_ok_strict);
*************** spu_legitimate_address_p (enum machine_m
*** 3662,3690 ****
  	  op0 = XEXP (op0, 0);
  	if (GET_CODE (op1) == SUBREG)
  	  op1 = XEXP (op1, 0);
- 	/* We can't just accept any aligned register because CSE can
- 	   change it to a register that is not marked aligned and then
- 	   recog will fail.   So we only accept frame registers because
- 	   they will only be changed to other frame registers. */
  	if (GET_CODE (op0) == REG
  	    && INT_REG_OK_FOR_BASE_P (op0, reg_ok_strict)
  	    && GET_CODE (op1) == CONST_INT
  	    && INTVAL (op1) >= -0x2000
  	    && INTVAL (op1) <= 0x1fff
! 	    && (regno_aligned_for_load (REGNO (op0)) || (INTVAL (op1) & 15) == 0))
! 	  return 1;
  	if (GET_CODE (op0) == REG
  	    && INT_REG_OK_FOR_BASE_P (op0, reg_ok_strict)
  	    && GET_CODE (op1) == REG
  	    && INT_REG_OK_FOR_INDEX_P (op1, reg_ok_strict))
! 	  return 1;
        }
        break;
  
      default:
        break;
      }
!   return 0;
  }
  
  /* When the address is reg + const_int, force the const_int into a
--- 3677,3701 ----
  	  op0 = XEXP (op0, 0);
  	if (GET_CODE (op1) == SUBREG)
  	  op1 = XEXP (op1, 0);
  	if (GET_CODE (op0) == REG
  	    && INT_REG_OK_FOR_BASE_P (op0, reg_ok_strict)
  	    && GET_CODE (op1) == CONST_INT
  	    && INTVAL (op1) >= -0x2000
  	    && INTVAL (op1) <= 0x1fff
! 	    && (!aligned || (INTVAL (op1) & 15) == 0))
! 	  return TRUE;
  	if (GET_CODE (op0) == REG
  	    && INT_REG_OK_FOR_BASE_P (op0, reg_ok_strict)
  	    && GET_CODE (op1) == REG
  	    && INT_REG_OK_FOR_INDEX_P (op1, reg_ok_strict))
! 	  return TRUE;
        }
        break;
  
      default:
        break;
      }
!   return FALSE;
  }
  
  /* When the address is reg + const_int, force the const_int into a
*************** spu_conditional_register_usage (void)
*** 4137,4196 ****
      }
  }
  
! /* This is called to decide when we can simplify a load instruction.  We
!    must only return true for registers which we know will always be
!    aligned.  Taking into account that CSE might replace this reg with
!    another one that has not been marked aligned.  
!    So this is really only true for frame, stack and virtual registers,
!    which we know are always aligned and should not be adversely effected
!    by CSE.  */
  static int
! regno_aligned_for_load (int regno)
  {
!   return regno == FRAME_POINTER_REGNUM
!     || (frame_pointer_needed && regno == HARD_FRAME_POINTER_REGNUM)
!     || regno == ARG_POINTER_REGNUM
!     || regno == STACK_POINTER_REGNUM
!     || (regno >= FIRST_VIRTUAL_REGISTER 
! 	&& regno <= LAST_VIRTUAL_REGISTER);
! }
! 
! /* Return TRUE when mem is known to be 16-byte aligned. */
! int
! aligned_mem_p (rtx mem)
! {
!   if (MEM_ALIGN (mem) >= 128)
!     return 1;
!   if (GET_MODE_SIZE (GET_MODE (mem)) >= 16)
!     return 1;
!   if (GET_CODE (XEXP (mem, 0)) == PLUS)
!     {
!       rtx p0 = XEXP (XEXP (mem, 0), 0);
!       rtx p1 = XEXP (XEXP (mem, 0), 1);
!       if (regno_aligned_for_load (REGNO (p0)))
! 	{
! 	  if (GET_CODE (p1) == REG && regno_aligned_for_load (REGNO (p1)))
! 	    return 1;
! 	  if (GET_CODE (p1) == CONST_INT && (INTVAL (p1) & 15) == 0)
! 	    return 1;
! 	}
!     }
!   else if (GET_CODE (XEXP (mem, 0)) == REG)
!     {
!       if (regno_aligned_for_load (REGNO (XEXP (mem, 0))))
! 	return 1;
!     }
!   else if (ALIGNED_SYMBOL_REF_P (XEXP (mem, 0)))
!     return 1;
!   else if (GET_CODE (XEXP (mem, 0)) == CONST)
!     {
!       rtx p0 = XEXP (XEXP (XEXP (mem, 0), 0), 0);
!       rtx p1 = XEXP (XEXP (XEXP (mem, 0), 0), 1);
!       if (GET_CODE (p0) == SYMBOL_REF
! 	  && GET_CODE (p1) == CONST_INT && (INTVAL (p1) & 15) == 0)
! 	return 1;
!     }
!   return 0;
  }
  
  /* Encode symbol attributes (local vs. global, tls model) of a SYMBOL_REF
--- 4148,4161 ----
      }
  }
  
! /* This is called any time we inspect the alignment of a register for
!    addresses.  */
  static int
! reg_aligned_for_addr (rtx x)
  {
!   int regno =
!     REGNO (x) < FIRST_PSEUDO_REGISTER ? ORIGINAL_REGNO (x) : REGNO (x);
!   return REGNO_POINTER_ALIGN (regno) >= 128;
  }
  
  /* Encode symbol attributes (local vs. global, tls model) of a SYMBOL_REF
*************** spu_encode_section_info (tree decl, rtx 
*** 4219,4227 ****
  static int
  store_with_one_insn_p (rtx mem)
  {
    rtx addr = XEXP (mem, 0);
!   if (GET_MODE (mem) == BLKmode)
      return 0;
    /* Only static objects. */
    if (GET_CODE (addr) == SYMBOL_REF)
      {
--- 4184,4195 ----
  static int
  store_with_one_insn_p (rtx mem)
  {
+   enum machine_mode mode = GET_MODE (mem);
    rtx addr = XEXP (mem, 0);
!   if (mode == BLKmode)
      return 0;
+   if (GET_MODE_SIZE (mode) >= 16)
+     return 1;
    /* Only static objects. */
    if (GET_CODE (addr) == SYMBOL_REF)
      {
*************** store_with_one_insn_p (rtx mem)
*** 4245,4250 ****
--- 4213,4234 ----
    return 0;
  }
  
+ /* Return 1 when the address is not valid for a simple load and store as
+    required by the '_mov*' patterns.   We could make this less strict
+    for loads, but we prefer mem's to look the same so they are more
+    likely to be merged.  */
+ static int
+ address_needs_split (rtx mem)
+ {
+   if (GET_MODE_SIZE (GET_MODE (mem)) < 16
+       && (GET_MODE_SIZE (GET_MODE (mem)) < 4
+ 	  || !(store_with_one_insn_p (mem)
+ 	       || mem_is_padded_component_ref (mem))))
+     return 1;
+ 
+   return 0;
+ }
+ 
  int
  spu_expand_mov (rtx * ops, enum machine_mode mode)
  {
*************** spu_expand_mov (rtx * ops, enum machine_
*** 4289,4342 ****
  	return spu_split_immediate (ops);
        return 0;
      }
!   else
      {
!       if (GET_CODE (ops[0]) == MEM)
! 	{
! 	  if (!spu_valid_move (ops))
! 	    {
! 	      emit_insn (gen_store (ops[0], ops[1], gen_reg_rtx (TImode),
! 				    gen_reg_rtx (TImode)));
! 	      return 1;
! 	    }
! 	}
!       else if (GET_CODE (ops[1]) == MEM)
  	{
! 	  if (!spu_valid_move (ops))
! 	    {
! 	      emit_insn (gen_load
! 			 (ops[0], ops[1], gen_reg_rtx (TImode),
! 			  gen_reg_rtx (SImode)));
! 	      return 1;
! 	    }
! 	}
!       /* Catch the SImode immediates greater than 0x7fffffff, and sign
!          extend them. */
!       if (GET_CODE (ops[1]) == CONST_INT)
! 	{
! 	  HOST_WIDE_INT val = trunc_int_for_mode (INTVAL (ops[1]), mode);
! 	  if (val != INTVAL (ops[1]))
! 	    {
! 	      emit_move_insn (ops[0], GEN_INT (val));
! 	      return 1;
! 	    }
  	}
      }
    return 0;
  }
  
! void
! spu_split_load (rtx * ops)
  {
!   enum machine_mode mode = GET_MODE (ops[0]);
!   rtx addr, load, rot, mem, p0, p1;
!   int rot_amt;
  
!   addr = XEXP (ops[1], 0);
  
    rot = 0;
    rot_amt = 0;
!   if (GET_CODE (addr) == PLUS)
      {
        /* 8 cases:
           aligned reg   + aligned reg     => lqx
--- 4273,4335 ----
  	return spu_split_immediate (ops);
        return 0;
      }
! 
!   /* Catch the SImode immediates greater than 0x7fffffff, and sign
!      extend them. */
!   if (GET_CODE (ops[1]) == CONST_INT)
      {
!       HOST_WIDE_INT val = trunc_int_for_mode (INTVAL (ops[1]), mode);
!       if (val != INTVAL (ops[1]))
  	{
! 	  emit_move_insn (ops[0], GEN_INT (val));
! 	  return 1;
  	}
      }
+   if (MEM_P (ops[0]))
+     return spu_split_store (ops);
+   if (MEM_P (ops[1]))
+     return spu_split_load (ops);
+ 
    return 0;
  }
  
! static void
! spu_convert_move (rtx dst, rtx src)
  {
!   enum machine_mode mode = GET_MODE (dst);
!   enum machine_mode int_mode = mode_for_size (GET_MODE_BITSIZE (mode), MODE_INT, 0);
!   rtx reg;
!   gcc_assert (GET_MODE (src) == TImode);
!   reg = int_mode != mode ? gen_reg_rtx (int_mode) : dst;
!   emit_insn (gen_rtx_SET (VOIDmode, reg,
! 	       gen_rtx_TRUNCATE (int_mode,
! 		 gen_rtx_LSHIFTRT (TImode, src,
! 		   GEN_INT (int_mode == DImode ? 64 : 96)))));
!   if (int_mode != mode)
!     {
!       reg = simplify_gen_subreg (mode, reg, int_mode, 0);
!       emit_move_insn (dst, reg);
!     }
! }
  
! /* Load TImode values into DST0 and DST1 (when it is non-NULL) using
!    the address from SRC and SRC+16.  Return a REG or CONST_INT that 
!    specifies how many bytes to rotate the loaded registers, plus any
!    extra from EXTRA_ROTQBY.  The address and rotate amounts are
!    normalized to improve merging of loads and rotate computations. */
! static rtx
! spu_expand_load (rtx dst0, rtx dst1, rtx src, int extra_rotby)
! {
!   rtx addr = XEXP (src, 0);
!   rtx p0, p1, rot, addr0, addr1;
!   int rot_amt;
  
    rot = 0;
    rot_amt = 0;
! 
!   if (MEM_ALIGN (src) >= 128)
!     /* Address is already aligned; simply perform a TImode load.  */ ;
!   else if (GET_CODE (addr) == PLUS)
      {
        /* 8 cases:
           aligned reg   + aligned reg     => lqx
*************** spu_split_load (rtx * ops)
*** 4350,4361 ****
         */
        p0 = XEXP (addr, 0);
        p1 = XEXP (addr, 1);
!       if (REG_P (p0) && !regno_aligned_for_load (REGNO (p0)))
  	{
! 	  if (REG_P (p1) && !regno_aligned_for_load (REGNO (p1)))
  	    {
! 	      emit_insn (gen_addsi3 (ops[3], p0, p1));
! 	      rot = ops[3];
  	    }
  	  else
  	    rot = p0;
--- 4343,4376 ----
         */
        p0 = XEXP (addr, 0);
        p1 = XEXP (addr, 1);
!       if (!reg_aligned_for_addr (p0))
  	{
! 	  if (REG_P (p1) && !reg_aligned_for_addr (p1))
  	    {
! 	      rot = gen_reg_rtx (SImode);
! 	      emit_insn (gen_addsi3 (rot, p0, p1));
! 	    }
! 	  else if (GET_CODE (p1) == CONST_INT && (INTVAL (p1) & 15))
! 	    {
! 	      if (INTVAL (p1) > 0
! 		  && REG_POINTER (p0)
! 		  && INTVAL (p1) * BITS_PER_UNIT
! 		     < REGNO_POINTER_ALIGN (REGNO (p0)))
! 		{
! 		  rot = gen_reg_rtx (SImode);
! 		  emit_insn (gen_addsi3 (rot, p0, p1));
! 		  addr = p0;
! 		}
! 	      else
! 		{
! 		  rtx x = gen_reg_rtx (SImode);
! 		  emit_move_insn (x, p1);
! 		  if (!spu_arith_operand (p1, SImode))
! 		    p1 = x;
! 		  rot = gen_reg_rtx (SImode);
! 		  emit_insn (gen_addsi3 (rot, p0, p1));
! 		  addr = gen_rtx_PLUS (Pmode, p0, x);
! 		}
  	    }
  	  else
  	    rot = p0;
*************** spu_split_load (rtx * ops)
*** 4365,4380 ****
  	  if (GET_CODE (p1) == CONST_INT && (INTVAL (p1) & 15))
  	    {
  	      rot_amt = INTVAL (p1) & 15;
! 	      p1 = GEN_INT (INTVAL (p1) & -16);
! 	      addr = gen_rtx_PLUS (SImode, p0, p1);
  	    }
! 	  else if (REG_P (p1) && !regno_aligned_for_load (REGNO (p1)))
  	    rot = p1;
  	}
      }
!   else if (GET_CODE (addr) == REG)
      {
!       if (!regno_aligned_for_load (REGNO (addr)))
  	rot = addr;
      }
    else if (GET_CODE (addr) == CONST)
--- 4380,4400 ----
  	  if (GET_CODE (p1) == CONST_INT && (INTVAL (p1) & 15))
  	    {
  	      rot_amt = INTVAL (p1) & 15;
! 	      if (INTVAL (p1) & -16)
! 		{
! 		  p1 = GEN_INT (INTVAL (p1) & -16);
! 		  addr = gen_rtx_PLUS (SImode, p0, p1);
! 		}
! 	      else
! 		addr = p0;
  	    }
! 	  else if (REG_P (p1) && !reg_aligned_for_addr (p1))
  	    rot = p1;
  	}
      }
!   else if (REG_P (addr))
      {
!       if (!reg_aligned_for_addr (addr))
  	rot = addr;
      }
    else if (GET_CODE (addr) == CONST)
*************** spu_split_load (rtx * ops)
*** 4393,4399 ****
  	    addr = XEXP (XEXP (addr, 0), 0);
  	}
        else
! 	rot = addr;
      }
    else if (GET_CODE (addr) == CONST_INT)
      {
--- 4413,4422 ----
  	    addr = XEXP (XEXP (addr, 0), 0);
  	}
        else
! 	{
! 	  rot = gen_reg_rtx (Pmode);
! 	  emit_move_insn (rot, addr);
! 	}
      }
    else if (GET_CODE (addr) == CONST_INT)
      {
*************** spu_split_load (rtx * ops)
*** 4401,4449 ****
        addr = GEN_INT (rot_amt & -16);
      }
    else if (!ALIGNED_SYMBOL_REF_P (addr))
!     rot = addr;
  
!   if (GET_MODE_SIZE (mode) < 4)
!     rot_amt += GET_MODE_SIZE (mode) - 4;
  
    rot_amt &= 15;
  
    if (rot && rot_amt)
      {
!       emit_insn (gen_addsi3 (ops[3], rot, GEN_INT (rot_amt)));
!       rot = ops[3];
        rot_amt = 0;
      }
  
!   load = ops[2];
  
!   addr = gen_rtx_AND (SImode, copy_rtx (addr), GEN_INT (-16));
!   mem = change_address (ops[1], TImode, addr);
  
!   emit_insn (gen_movti (load, mem));
  
    if (rot)
      emit_insn (gen_rotqby_ti (load, load, rot));
-   else if (rot_amt)
-     emit_insn (gen_rotlti3 (load, load, GEN_INT (rot_amt * 8)));
  
!   if (reload_completed)
!     emit_move_insn (ops[0], gen_rtx_REG (GET_MODE (ops[0]), REGNO (load)));
!   else
!     emit_insn (gen_spu_convert (ops[0], load));
  }
  
! void
  spu_split_store (rtx * ops)
  {
    enum machine_mode mode = GET_MODE (ops[0]);
!   rtx pat = ops[2];
!   rtx reg = ops[3];
    rtx addr, p0, p1, p1_lo, smem;
    int aform;
    int scalar;
  
    addr = XEXP (ops[0], 0);
  
    if (GET_CODE (addr) == PLUS)
      {
--- 4424,4519 ----
        addr = GEN_INT (rot_amt & -16);
      }
    else if (!ALIGNED_SYMBOL_REF_P (addr))
!     {
!       rot = gen_reg_rtx (Pmode);
!       emit_move_insn (rot, addr);
!     }
  
!   rot_amt += extra_rotby;
  
    rot_amt &= 15;
  
    if (rot && rot_amt)
      {
!       rtx x = gen_reg_rtx (SImode);
!       emit_insn (gen_addsi3 (x, rot, GEN_INT (rot_amt)));
!       rot = x;
        rot_amt = 0;
      }
+   if (!rot && rot_amt)
+     rot = GEN_INT (rot_amt);
  
!   addr0 = copy_rtx (addr);
!   addr0 = gen_rtx_AND (SImode, copy_rtx (addr), GEN_INT (-16));
!   emit_insn (gen__movti (dst0, change_address (src, TImode, addr0)));
  
!   if (dst1)
!     {
!       addr1 = plus_constant (copy_rtx (addr), 16);
!       addr1 = gen_rtx_AND (SImode, addr1, GEN_INT (-16));
!       emit_insn (gen__movti (dst1, change_address (src, TImode, addr1)));
!     }
  
!   return rot;
! }
! 
! int
! spu_split_load (rtx * ops)
! {
!   enum machine_mode mode = GET_MODE (ops[0]);
!   rtx addr, load, rot;
!   int rot_amt;
! 
!   if (GET_MODE_SIZE (mode) >= 16)
!     return 0;
! 
!   addr = XEXP (ops[1], 0);
!   gcc_assert (GET_CODE (addr) != AND);
! 
!   if (!address_needs_split (ops[1]))
!     {
!       ops[1] = change_address (ops[1], TImode, addr);
!       load = gen_reg_rtx (TImode);
!       emit_insn (gen__movti (load, ops[1]));
!       spu_convert_move (ops[0], load);
!       return 1;
!     }
! 
!   rot_amt = GET_MODE_SIZE (mode) < 4 ? GET_MODE_SIZE (mode) - 4 : 0;
! 
!   load = gen_reg_rtx (TImode);
!   rot = spu_expand_load (load, 0, ops[1], rot_amt);
  
    if (rot)
      emit_insn (gen_rotqby_ti (load, load, rot));
  
!   spu_convert_move (ops[0], load);
!   return 1;
  }
  
! int
  spu_split_store (rtx * ops)
  {
    enum machine_mode mode = GET_MODE (ops[0]);
!   rtx reg;
    rtx addr, p0, p1, p1_lo, smem;
    int aform;
    int scalar;
  
+   if (GET_MODE_SIZE (mode) >= 16)
+     return 0;
+ 
    addr = XEXP (ops[0], 0);
+   gcc_assert (GET_CODE (addr) != AND);
+ 
+   if (!address_needs_split (ops[0]))
+     {
+       reg = gen_reg_rtx (TImode);
+       emit_insn (gen_spu_convert (reg, ops[1]));
+       ops[0] = change_address (ops[0], TImode, addr);
+       emit_move_insn (ops[0], reg);
+       return 1;
+     }
  
    if (GET_CODE (addr) == PLUS)
      {
*************** spu_split_store (rtx * ops)
*** 4455,4473 ****
           unaligned reg + aligned reg     => lqx, c?x, shuf, stqx
           unaligned reg + unaligned reg   => lqx, c?x, shuf, stqx
           unaligned reg + aligned const   => lqd, c?d, shuf, stqx
!          unaligned reg + unaligned const -> not allowed by legitimate address
         */
        aform = 0;
        p0 = XEXP (addr, 0);
        p1 = p1_lo = XEXP (addr, 1);
!       if (GET_CODE (p0) == REG && GET_CODE (p1) == CONST_INT)
  	{
  	  p1_lo = GEN_INT (INTVAL (p1) & 15);
! 	  p1 = GEN_INT (INTVAL (p1) & -16);
! 	  addr = gen_rtx_PLUS (SImode, p0, p1);
  	}
      }
!   else if (GET_CODE (addr) == REG)
      {
        aform = 0;
        p0 = addr;
--- 4525,4555 ----
           unaligned reg + aligned reg     => lqx, c?x, shuf, stqx
           unaligned reg + unaligned reg   => lqx, c?x, shuf, stqx
           unaligned reg + aligned const   => lqd, c?d, shuf, stqx
!          unaligned reg + unaligned const -> lqx, c?d, shuf, stqx
         */
        aform = 0;
        p0 = XEXP (addr, 0);
        p1 = p1_lo = XEXP (addr, 1);
!       if (REG_P (p0) && GET_CODE (p1) == CONST_INT)
  	{
  	  p1_lo = GEN_INT (INTVAL (p1) & 15);
! 	  if (reg_aligned_for_addr (p0))
! 	    {
! 	      p1 = GEN_INT (INTVAL (p1) & -16);
! 	      if (p1 == const0_rtx)
! 		addr = p0;
! 	      else
! 		addr = gen_rtx_PLUS (SImode, p0, p1);
! 	    }
! 	  else
! 	    {
! 	      rtx x = gen_reg_rtx (SImode);
! 	      emit_move_insn (x, p1);
! 	      addr = gen_rtx_PLUS (SImode, p0, x);
! 	    }
  	}
      }
!   else if (REG_P (addr))
      {
        aform = 0;
        p0 = addr;
*************** spu_split_store (rtx * ops)
*** 4481,4511 ****
        p1_lo = addr;
        if (ALIGNED_SYMBOL_REF_P (addr))
  	p1_lo = const0_rtx;
!       else if (GET_CODE (addr) == CONST)
  	{
! 	  if (GET_CODE (XEXP (addr, 0)) == PLUS
! 	      && ALIGNED_SYMBOL_REF_P (XEXP (XEXP (addr, 0), 0))
! 	      && GET_CODE (XEXP (XEXP (addr, 0), 1)) == CONST_INT)
! 	    {
! 	      HOST_WIDE_INT v = INTVAL (XEXP (XEXP (addr, 0), 1));
! 	      if ((v & -16) != 0)
! 		addr = gen_rtx_CONST (Pmode,
! 				      gen_rtx_PLUS (Pmode,
! 						    XEXP (XEXP (addr, 0), 0),
! 						    GEN_INT (v & -16)));
! 	      else
! 		addr = XEXP (XEXP (addr, 0), 0);
! 	      p1_lo = GEN_INT (v & 15);
! 	    }
  	}
        else if (GET_CODE (addr) == CONST_INT)
  	{
  	  p1_lo = GEN_INT (INTVAL (addr) & 15);
  	  addr = GEN_INT (INTVAL (addr) & -16);
  	}
      }
  
!   addr = gen_rtx_AND (SImode, copy_rtx (addr), GEN_INT (-16));
  
    scalar = store_with_one_insn_p (ops[0]);
    if (!scalar)
--- 4563,4596 ----
        p1_lo = addr;
        if (ALIGNED_SYMBOL_REF_P (addr))
  	p1_lo = const0_rtx;
!       else if (GET_CODE (addr) == CONST
! 	       && GET_CODE (XEXP (addr, 0)) == PLUS
! 	       && ALIGNED_SYMBOL_REF_P (XEXP (XEXP (addr, 0), 0))
! 	       && GET_CODE (XEXP (XEXP (addr, 0), 1)) == CONST_INT)
  	{
! 	  HOST_WIDE_INT v = INTVAL (XEXP (XEXP (addr, 0), 1));
! 	  if ((v & -16) != 0)
! 	    addr = gen_rtx_CONST (Pmode,
! 				  gen_rtx_PLUS (Pmode,
! 						XEXP (XEXP (addr, 0), 0),
! 						GEN_INT (v & -16)));
! 	  else
! 	    addr = XEXP (XEXP (addr, 0), 0);
! 	  p1_lo = GEN_INT (v & 15);
  	}
        else if (GET_CODE (addr) == CONST_INT)
  	{
  	  p1_lo = GEN_INT (INTVAL (addr) & 15);
  	  addr = GEN_INT (INTVAL (addr) & -16);
  	}
+       else
+ 	{
+ 	  p1_lo = gen_reg_rtx (SImode);
+ 	  emit_move_insn (p1_lo, addr);
+ 	}
      }
  
!   reg = gen_reg_rtx (TImode);
  
    scalar = store_with_one_insn_p (ops[0]);
    if (!scalar)
*************** spu_split_store (rtx * ops)
*** 4515,4525 ****
           possible, and copying the flags will prevent that in certain
           cases, e.g. consider the volatile flag. */
  
        rtx lmem = change_address (ops[0], TImode, copy_rtx (addr));
        set_mem_alias_set (lmem, 0);
        emit_insn (gen_movti (reg, lmem));
  
!       if (!p0 || regno_aligned_for_load (REGNO (p0)))
  	p0 = stack_pointer_rtx;
        if (!p1_lo)
  	p1_lo = const0_rtx;
--- 4600,4611 ----
           possible, and copying the flags will prevent that in certain
           cases, e.g. consider the volatile flag. */
  
+       rtx pat = gen_reg_rtx (TImode);
        rtx lmem = change_address (ops[0], TImode, copy_rtx (addr));
        set_mem_alias_set (lmem, 0);
        emit_insn (gen_movti (reg, lmem));
  
!       if (!p0 || reg_aligned_for_addr (p0))
  	p0 = stack_pointer_rtx;
        if (!p1_lo)
  	p1_lo = const0_rtx;
*************** spu_split_store (rtx * ops)
*** 4527,4543 ****
        emit_insn (gen_cpat (pat, p0, p1_lo, GEN_INT (GET_MODE_SIZE (mode))));
        emit_insn (gen_shufb (reg, ops[1], reg, pat));
      }
-   else if (reload_completed)
-     {
-       if (GET_CODE (ops[1]) == REG)
- 	emit_move_insn (reg, gen_rtx_REG (GET_MODE (reg), REGNO (ops[1])));
-       else if (GET_CODE (ops[1]) == SUBREG)
- 	emit_move_insn (reg,
- 			gen_rtx_REG (GET_MODE (reg),
- 				     REGNO (SUBREG_REG (ops[1]))));
-       else
- 	abort ();
-     }
    else
      {
        if (GET_CODE (ops[1]) == REG)
--- 4613,4618 ----
*************** spu_split_store (rtx * ops)
*** 4549,4563 ****
      }
  
    if (GET_MODE_SIZE (mode) < 4 && scalar)
!     emit_insn (gen_shlqby_ti
! 	       (reg, reg, GEN_INT (4 - GET_MODE_SIZE (mode))));
  
!   smem = change_address (ops[0], TImode, addr);
    /* We can't use the previous alias set because the memory has changed
       size and can potentially overlap objects of other types.  */
    set_mem_alias_set (smem, 0);
  
    emit_insn (gen_movti (smem, reg));
  }
  
  /* Return TRUE if X is MEM which is a struct member reference
--- 4624,4639 ----
      }
  
    if (GET_MODE_SIZE (mode) < 4 && scalar)
!     emit_insn (gen_ashlti3
! 	       (reg, reg, GEN_INT (32 - GET_MODE_BITSIZE (mode))));
  
!   smem = change_address (ops[0], TImode, copy_rtx (addr));
    /* We can't use the previous alias set because the memory has changed
       size and can potentially overlap objects of other types.  */
    set_mem_alias_set (smem, 0);
  
    emit_insn (gen_movti (smem, reg));
+   return 1;
  }
  
  /* Return TRUE if X is MEM which is a struct member reference
*************** fix_range (const char *const_str)
*** 4656,4692 ****
      }
  }
  
- int
- spu_valid_move (rtx * ops)
- {
-   enum machine_mode mode = GET_MODE (ops[0]);
-   if (!register_operand (ops[0], mode) && !register_operand (ops[1], mode))
-     return 0;
- 
-   /* init_expr_once tries to recog against load and store insns to set
-      the direct_load[] and direct_store[] arrays.  We always want to
-      consider those loads and stores valid.  init_expr_once is called in
-      the context of a dummy function which does not have a decl. */
-   if (cfun->decl == 0)
-     return 1;
- 
-   /* Don't allows loads/stores which would require more than 1 insn.
-      During and after reload we assume loads and stores only take 1
-      insn. */
-   if (GET_MODE_SIZE (mode) < 16 && !reload_in_progress && !reload_completed)
-     {
-       if (GET_CODE (ops[0]) == MEM
- 	  && (GET_MODE_SIZE (mode) < 4
- 	      || !(store_with_one_insn_p (ops[0])
- 		   || mem_is_padded_component_ref (ops[0]))))
- 	return 0;
-       if (GET_CODE (ops[1]) == MEM
- 	  && (GET_MODE_SIZE (mode) < 4 || !aligned_mem_p (ops[1])))
- 	return 0;
-     }
-   return 1;
- }
- 
  /* Return TRUE if x is a CONST_INT, CONST_DOUBLE or CONST_VECTOR that
     can be generated using the fsmbi instruction. */
  int
--- 4732,4737 ----
*************** spu_sms_res_mii (struct ddg *g)
*** 6400,6411 ****
  
  void
  spu_init_expanders (void)
! {   
!   /* HARD_FRAME_REGISTER is only 128 bit aligned when
!    * frame_pointer_needed is true.  We don't know that until we're
!    * expanding the prologue. */
    if (cfun)
!     REGNO_POINTER_ALIGN (HARD_FRAME_POINTER_REGNUM) = 8;
  }
  
  static enum machine_mode
--- 6445,6469 ----
  
  void
  spu_init_expanders (void)
! {
    if (cfun)
!     {
!       rtx r0, r1;
!       /* HARD_FRAME_REGISTER is only 128 bit aligned when
!          frame_pointer_needed is true.  We don't know that until we're
!          expanding the prologue. */
!       REGNO_POINTER_ALIGN (HARD_FRAME_POINTER_REGNUM) = 8;
! 
!       /* A number of passes use LAST_VIRTUAL_REGISTER+1 and
! 	 LAST_VIRTUAL_REGISTER+2 to test the back-end.  We want them
! 	 to be treated as aligned, so generate them here. */
!       r0 = gen_reg_rtx (SImode);
!       r1 = gen_reg_rtx (SImode);
!       mark_reg_pointer (r0, 128);
!       mark_reg_pointer (r1, 128);
!       gcc_assert (REGNO (r0) == LAST_VIRTUAL_REGISTER + 1
! 		  && REGNO (r1) == LAST_VIRTUAL_REGISTER + 2);
!     }
  }
  
  static enum machine_mode
*************** spu_gen_exp2 (enum machine_mode mode, rt
*** 6480,6483 ****
--- 6538,6557 ----
      }
  }
  
+ /* After reload, just change the convert into a move instruction
+    or a dead instruction. */
+ void
+ spu_split_convert (rtx ops[])
+ {
+   if (REGNO (ops[0]) == REGNO (ops[1]))
+     emit_note (NOTE_INSN_DELETED);
+   else
+     {
+       /* Use TImode always as this might help hard reg copyprop.  */
+       rtx op0 = gen_rtx_REG (TImode, REGNO (ops[0]));
+       rtx op1 = gen_rtx_REG (TImode, REGNO (ops[1]));
+       emit_insn (gen_move_insn (op0, op1));
+     }
+ }
+ 
  #include "gt-spu.h"
Index: gcc/config/spu/spu.md
===================================================================
*** gcc/config/spu/spu.md	(revision 147813)
--- gcc/config/spu/spu.md	(revision 147814)
*************** (define_mode_iterator MOV [QI V16QI
*** 178,183 ****
--- 178,185 ----
                          SF V4SF
                          DF V2DF])
  
+ (define_mode_iterator QHSI  [QI HI SI])
+ (define_mode_iterator QHSDI  [QI HI SI DI])
  (define_mode_iterator DTI  [DI TI])
  
  (define_mode_iterator VINT [QI V16QI
*************** (define_insn "load_pic_offset"
*** 316,324 ****
  ;; move internal
  
  (define_insn "_mov<mode>"
!   [(set (match_operand:MOV 0 "spu_nonimm_operand" "=r,r,r,r,r,m")
  	(match_operand:MOV 1 "spu_mov_operand" "r,A,f,j,m,r"))]
!   "spu_valid_move (operands)"
    "@
     ori\t%0,%1,0
     il%s1\t%0,%S1
--- 318,327 ----
  ;; move internal
  
  (define_insn "_mov<mode>"
!   [(set (match_operand:MOV 0 "spu_dest_operand" "=r,r,r,r,r,m")
  	(match_operand:MOV 1 "spu_mov_operand" "r,A,f,j,m,r"))]
!   "register_operand(operands[0], <MODE>mode)
!    || register_operand(operands[1], <MODE>mode)"
    "@
     ori\t%0,%1,0
     il%s1\t%0,%S1
*************** (define_insn "low_<mode>"
*** 336,344 ****
    "iohl\t%0,%2@l")
  
  (define_insn "_movdi"
!   [(set (match_operand:DI 0 "spu_nonimm_operand" "=r,r,r,r,r,m")
  	(match_operand:DI 1 "spu_mov_operand" "r,a,f,k,m,r"))]
!   "spu_valid_move (operands)"
    "@
     ori\t%0,%1,0
     il%d1\t%0,%D1
--- 339,348 ----
    "iohl\t%0,%2@l")
  
  (define_insn "_movdi"
!   [(set (match_operand:DI 0 "spu_dest_operand" "=r,r,r,r,r,m")
  	(match_operand:DI 1 "spu_mov_operand" "r,a,f,k,m,r"))]
!   "register_operand(operands[0], DImode)
!    || register_operand(operands[1], DImode)"
    "@
     ori\t%0,%1,0
     il%d1\t%0,%D1
*************** (define_insn "_movdi"
*** 349,357 ****
    [(set_attr "type" "fx2,fx2,shuf,shuf,load,store")])
  
  (define_insn "_movti"
!   [(set (match_operand:TI 0 "spu_nonimm_operand" "=r,r,r,r,r,m")
  	(match_operand:TI 1 "spu_mov_operand" "r,U,f,l,m,r"))]
!   "spu_valid_move (operands)"
    "@
     ori\t%0,%1,0
     il%t1\t%0,%T1
--- 353,362 ----
    [(set_attr "type" "fx2,fx2,shuf,shuf,load,store")])
  
  (define_insn "_movti"
!   [(set (match_operand:TI 0 "spu_dest_operand" "=r,r,r,r,r,m")
  	(match_operand:TI 1 "spu_mov_operand" "r,U,f,l,m,r"))]
!   "register_operand(operands[0], TImode)
!    || register_operand(operands[1], TImode)"
    "@
     ori\t%0,%1,0
     il%t1\t%0,%T1
*************** (define_insn "_movti"
*** 361,390 ****
     stq%p0\t%1,%0"
    [(set_attr "type" "fx2,fx2,shuf,shuf,load,store")])
  
! (define_insn_and_split "load"
!   [(set (match_operand 0 "spu_reg_operand" "=r")
! 	(match_operand 1 "memory_operand" "m"))
!    (clobber (match_operand:TI 2 "spu_reg_operand" "=&r"))
!    (clobber (match_operand:SI 3 "spu_reg_operand" "=&r"))]
!   "GET_MODE(operands[0]) == GET_MODE(operands[1])"
!   "#"
!   ""
    [(set (match_dup 0)
  	(match_dup 1))]
!   { spu_split_load(operands); DONE; })
  
! (define_insn_and_split "store"
!   [(set (match_operand 0 "memory_operand" "=m")
! 	(match_operand 1 "spu_reg_operand" "r"))
!    (clobber (match_operand:TI 2 "spu_reg_operand" "=&r"))
!    (clobber (match_operand:TI 3 "spu_reg_operand" "=&r"))]
!   "GET_MODE(operands[0]) == GET_MODE(operands[1])"
!   "#"
!   ""
    [(set (match_dup 0)
  	(match_dup 1))]
!   { spu_split_store(operands); DONE; })
! 
  ;; Operand 3 is the number of bytes. 1:b 2:h 4:w 8:d
  
  (define_expand "cpat"
--- 366,394 ----
     stq%p0\t%1,%0"
    [(set_attr "type" "fx2,fx2,shuf,shuf,load,store")])
  
! (define_split
!   [(set (match_operand 0 "spu_reg_operand")
! 	(match_operand 1 "memory_operand"))]
!   "GET_MODE_SIZE (GET_MODE (operands[0])) < 16
!    && GET_MODE(operands[0]) == GET_MODE(operands[1])
!    && !reload_in_progress && !reload_completed" 
    [(set (match_dup 0)
  	(match_dup 1))]
!   { if (spu_split_load(operands))
!       DONE;
!   })
  
! (define_split
!   [(set (match_operand 0 "memory_operand")
! 	(match_operand 1 "spu_reg_operand"))]
!   "GET_MODE_SIZE (GET_MODE (operands[0])) < 16
!    && GET_MODE(operands[0]) == GET_MODE(operands[1])
!    && !reload_in_progress && !reload_completed" 
    [(set (match_dup 0)
  	(match_dup 1))]
!   { if (spu_split_store(operands))
!       DONE;
!   })
  ;; Operand 3 is the number of bytes. 1:b 2:h 4:w 8:d
  
  (define_expand "cpat"
*************** (define_insn "xswd"
*** 462,494 ****
    ""
    "xswd\t%0,%1");
  
! (define_expand "extendqiti2"
!   [(set (match_operand:TI 0 "register_operand" "")
! 	(sign_extend:TI (match_operand:QI 1 "register_operand" "")))]
!   ""
!   "spu_expand_sign_extend(operands);
!    DONE;")
! 
! (define_expand "extendhiti2"
    [(set (match_operand:TI 0 "register_operand" "")
! 	(sign_extend:TI (match_operand:HI 1 "register_operand" "")))]
    ""
!   "spu_expand_sign_extend(operands);
!    DONE;")
! 
! (define_expand "extendsiti2"
!   [(set (match_operand:TI 0 "register_operand" "")
! 	(sign_extend:TI (match_operand:SI 1 "register_operand" "")))]
!   ""
!   "spu_expand_sign_extend(operands);
!    DONE;")
! 
! (define_expand "extendditi2"
!   [(set (match_operand:TI 0 "register_operand" "")
! 	(sign_extend:TI (match_operand:DI 1 "register_operand" "")))]
    ""
!   "spu_expand_sign_extend(operands);
!    DONE;")
  
  \f
  ;; zero_extend
--- 466,485 ----
    ""
    "xswd\t%0,%1");
  
! ;; By splitting this late we don't allow much opportunity for sharing of
! ;; constants.  That's ok because this should really be optimized away.
! (define_insn_and_split "extend<mode>ti2"
    [(set (match_operand:TI 0 "register_operand" "")
! 	(sign_extend:TI (match_operand:QHSDI 1 "register_operand" "")))]
    ""
!   "#"
    ""
!   [(set (match_dup:TI 0)
! 	(sign_extend:TI (match_dup:QHSDI 1)))]
!   {
!     spu_expand_sign_extend(operands);
!     DONE;
!   })
  
  \f
  ;; zero_extend
*************** (define_insn "zero_extendsidi2"
*** 525,530 ****
--- 516,537 ----
    "rotqmbyi\t%0,%1,-4"
    [(set_attr "type" "shuf")])
  
+ (define_insn "zero_extendqiti2"
+   [(set (match_operand:TI 0 "spu_reg_operand" "=r")
+ 	(zero_extend:TI (match_operand:QI 1 "spu_reg_operand" "r")))]
+   ""
+   "andi\t%0,%1,0x00ff\;rotqmbyi\t%0,%0,-12"
+   [(set_attr "type" "multi0")
+    (set_attr "length" "8")])
+ 
+ (define_insn "zero_extendhiti2"
+   [(set (match_operand:TI 0 "spu_reg_operand" "=r")
+ 	(zero_extend:TI (match_operand:HI 1 "spu_reg_operand" "r")))]
+   ""
+   "shli\t%0,%1,16\;rotqmbyi\t%0,%0,-14"
+   [(set_attr "type" "multi1")
+    (set_attr "length" "8")])
+ 
  (define_insn "zero_extendsiti2"
    [(set (match_operand:TI 0 "spu_reg_operand" "=r")
  	(zero_extend:TI (match_operand:SI 1 "spu_reg_operand" "r")))]
*************** (define_insn_and_split "<v>lshr<mode>3"
*** 2348,2353 ****
--- 2355,2367 ----
    ""
    [(set_attr "type" "*,fx3")])
    
+ (define_insn "<v>lshr<mode>3_imm"
+   [(set (match_operand:VHSI 0 "spu_reg_operand" "=r")
+ 	(lshiftrt:VHSI (match_operand:VHSI 1 "spu_reg_operand" "r")
+ 		       (match_operand:VHSI 2 "immediate_operand" "W")))]
+   ""
+   "rot<bh>mi\t%0,%1,-%<umask>2"
+   [(set_attr "type" "fx3")])
  
  (define_insn "rotm_<mode>"
    [(set (match_operand:VHSI 0 "spu_reg_operand" "=r,r")
*************** (define_insn "rotm_<mode>"
*** 2359,2447 ****
     rot<bh>mi\t%0,%1,-%<nmask>2"
    [(set_attr "type" "fx3")])
   
! (define_expand "lshr<mode>3"
!   [(parallel [(set (match_operand:DTI 0 "spu_reg_operand" "")
! 		   (lshiftrt:DTI (match_operand:DTI 1 "spu_reg_operand" "")
! 			         (match_operand:SI 2 "spu_nonmem_operand" "")))
! 	      (clobber (match_dup:DTI 3))
! 	      (clobber (match_dup:SI 4))
! 	      (clobber (match_dup:SI 5))])]
!   ""
!   "if (GET_CODE (operands[2]) == CONST_INT)
!     {
!       emit_insn (gen_lshr<mode>3_imm(operands[0], operands[1], operands[2]));
!       DONE;
!     }
!    operands[3] = gen_reg_rtx (<MODE>mode);
!    operands[4] = gen_reg_rtx (SImode);
!    operands[5] = gen_reg_rtx (SImode);")
! 
! (define_insn_and_split "lshr<mode>3_imm"
!   [(set (match_operand:DTI 0 "spu_reg_operand" "=r,r")
! 	(lshiftrt:DTI (match_operand:DTI 1 "spu_reg_operand" "r,r")
! 		      (match_operand:SI 2 "immediate_operand" "O,P")))]
    ""
    "@
     rotqmbyi\t%0,%1,-%h2
     rotqmbii\t%0,%1,-%e2"
!   "!satisfies_constraint_O (operands[2]) && !satisfies_constraint_P (operands[2])"
!   [(set (match_dup:DTI 0)
  	(lshiftrt:DTI (match_dup:DTI 1)
  		      (match_dup:SI 4)))
     (set (match_dup:DTI 0)
! 	(lshiftrt:DTI (match_dup:DTI 0)
  		      (match_dup:SI 5)))]
    {
!     HOST_WIDE_INT val = INTVAL(operands[2]);
!     operands[4] = GEN_INT (val&7);
!     operands[5] = GEN_INT (val&-8);
    }
!   [(set_attr "type" "shuf,shuf")])
! 
! (define_insn_and_split "lshr<mode>3_reg"
!   [(set (match_operand:DTI 0 "spu_reg_operand" "=r")
! 	(lshiftrt:DTI (match_operand:DTI 1 "spu_reg_operand" "r")
! 		      (match_operand:SI 2 "spu_reg_operand" "r")))
!    (clobber (match_operand:DTI 3 "spu_reg_operand" "=&r"))
!    (clobber (match_operand:SI 4 "spu_reg_operand" "=&r"))
!    (clobber (match_operand:SI 5 "spu_reg_operand" "=&r"))]
!   ""
!   "#"
!   ""
!   [(set (match_dup:DTI 3)
! 	(lshiftrt:DTI (match_dup:DTI 1)
! 		     (and:SI (neg:SI (match_dup:SI 4))
! 			     (const_int 7))))
!    (set (match_dup:DTI 0)
! 	(lshiftrt:DTI (match_dup:DTI 3)
! 		     (and:SI (neg:SI (and:SI (match_dup:SI 5)
! 					     (const_int -8)))
! 			     (const_int -8))))]
!   {
!     emit_insn (gen_subsi3(operands[4], GEN_INT(0), operands[2]));
!     emit_insn (gen_subsi3(operands[5], GEN_INT(7), operands[2]));
!   })
  
! (define_insn_and_split "shrqbybi_<mode>"
    [(set (match_operand:DTI 0 "spu_reg_operand" "=r,r")
  	(lshiftrt:DTI (match_operand:DTI 1 "spu_reg_operand" "r,r")
! 		      (and:SI (match_operand:SI 2 "spu_nonmem_operand" "r,I")
! 			      (const_int -8))))
!    (clobber (match_scratch:SI 3 "=&r,X"))]
!   ""
!   "#"
!   "reload_completed"
!   [(set (match_dup:DTI 0)
! 	(lshiftrt:DTI (match_dup:DTI 1)
! 		      (and:SI (neg:SI (and:SI (match_dup:SI 3) (const_int -8)))
  			      (const_int -8))))]
    {
      if (GET_CODE (operands[2]) == CONST_INT)
!       operands[3] = GEN_INT (7 - INTVAL (operands[2]));
      else
!       emit_insn (gen_subsi3 (operands[3], GEN_INT (7), operands[2]));
!   }
!   [(set_attr "type" "shuf")])
  
  (define_insn "rotqmbybi_<mode>"
    [(set (match_operand:DTI 0 "spu_reg_operand" "=r,r")
--- 2373,2431 ----
     rot<bh>mi\t%0,%1,-%<nmask>2"
    [(set_attr "type" "fx3")])
   
! (define_insn_and_split "lshr<mode>3"
!   [(set (match_operand:DTI 0 "spu_reg_operand" "=r,r,r")
! 	(lshiftrt:DTI (match_operand:DTI 1 "spu_reg_operand" "r,r,r")
! 		      (match_operand:SI 2 "spu_nonmem_operand" "r,O,P")))]
    ""
    "@
+    #
     rotqmbyi\t%0,%1,-%h2
     rotqmbii\t%0,%1,-%e2"
!   "REG_P (operands[2]) || (!satisfies_constraint_O (operands[2]) && !satisfies_constraint_P (operands[2]))"
!   [(set (match_dup:DTI 3)
  	(lshiftrt:DTI (match_dup:DTI 1)
  		      (match_dup:SI 4)))
     (set (match_dup:DTI 0)
! 	(lshiftrt:DTI (match_dup:DTI 3)
  		      (match_dup:SI 5)))]
    {
!     operands[3] = gen_reg_rtx (<MODE>mode);
!     if (GET_CODE (operands[2]) == CONST_INT)
!       {
! 	HOST_WIDE_INT val = INTVAL(operands[2]);
! 	operands[4] = GEN_INT (val & 7);
! 	operands[5] = GEN_INT (val & -8);
!       }
!     else
!       {
!         rtx t0 = gen_reg_rtx (SImode);
!         rtx t1 = gen_reg_rtx (SImode);
! 	emit_insn (gen_subsi3(t0, GEN_INT(0), operands[2]));
! 	emit_insn (gen_subsi3(t1, GEN_INT(7), operands[2]));
!         operands[4] = gen_rtx_AND (SImode, gen_rtx_NEG (SImode, t0), GEN_INT (7));
!         operands[5] = gen_rtx_AND (SImode, gen_rtx_NEG (SImode, gen_rtx_AND (SImode, t1, GEN_INT (-8))), GEN_INT (-8));
!       }
    }
!   [(set_attr "type" "*,shuf,shuf")])
  
! (define_expand "shrqbybi_<mode>"
    [(set (match_operand:DTI 0 "spu_reg_operand" "=r,r")
  	(lshiftrt:DTI (match_operand:DTI 1 "spu_reg_operand" "r,r")
! 		      (and:SI (neg:SI (and:SI (match_operand:SI 2 "spu_nonmem_operand" "r,I")
! 					      (const_int -8)))
  			      (const_int -8))))]
+   ""
    {
      if (GET_CODE (operands[2]) == CONST_INT)
!       operands[2] = GEN_INT (7 - INTVAL (operands[2]));
      else
!       {
!         rtx t0 = gen_reg_rtx (SImode);
! 	emit_insn (gen_subsi3 (t0, GEN_INT (7), operands[2]));
!         operands[2] = t0;
!       }
!   })
  
  (define_insn "rotqmbybi_<mode>"
    [(set (match_operand:DTI 0 "spu_reg_operand" "=r,r")
*************** (define_insn "rotqmbi_<mode>"
*** 2486,2510 ****
     rotqmbii\t%0,%1,-%E2"
    [(set_attr "type" "shuf")])
  
! (define_insn_and_split "shrqby_<mode>"
    [(set (match_operand:DTI 0 "spu_reg_operand" "=r,r")
  	(lshiftrt:DTI (match_operand:DTI 1 "spu_reg_operand" "r,r")
! 		      (mult:SI (match_operand:SI 2 "spu_nonmem_operand" "r,I")
! 			       (const_int 8))))
!    (clobber (match_scratch:SI 3 "=&r,X"))]
    ""
-   "#"
-   "reload_completed"
-   [(set (match_dup:DTI 0)
- 	(lshiftrt:DTI (match_dup:DTI 1)
- 		      (mult:SI (neg:SI (match_dup:SI 3)) (const_int 8))))]
    {
      if (GET_CODE (operands[2]) == CONST_INT)
!       operands[3] = GEN_INT (-INTVAL (operands[2]));
      else
!       emit_insn (gen_subsi3 (operands[3], GEN_INT (0), operands[2]));
!   }
!   [(set_attr "type" "shuf")])
  
  (define_insn "rotqmby_<mode>"
    [(set (match_operand:DTI 0 "spu_reg_operand" "=r,r")
--- 2470,2491 ----
     rotqmbii\t%0,%1,-%E2"
    [(set_attr "type" "shuf")])
  
! (define_expand "shrqby_<mode>"
    [(set (match_operand:DTI 0 "spu_reg_operand" "=r,r")
  	(lshiftrt:DTI (match_operand:DTI 1 "spu_reg_operand" "r,r")
! 		      (mult:SI (neg:SI (match_operand:SI 2 "spu_nonmem_operand" "r,I"))
! 			       (const_int 8))))]
    ""
    {
      if (GET_CODE (operands[2]) == CONST_INT)
!       operands[2] = GEN_INT (-INTVAL (operands[2]));
      else
!       {
!         rtx t0 = gen_reg_rtx (SImode);
! 	emit_insn (gen_subsi3 (t0, GEN_INT (0), operands[2]));
!         operands[2] = t0;
!       }
!   })
  
  (define_insn "rotqmby_<mode>"
    [(set (match_operand:DTI 0 "spu_reg_operand" "=r,r")
*************** (define_insn_and_split "<v>ashr<mode>3"
*** 2538,2543 ****
--- 2519,2532 ----
    ""
    [(set_attr "type" "*,fx3")])
    
+ (define_insn "<v>ashr<mode>3_imm"
+   [(set (match_operand:VHSI 0 "spu_reg_operand" "=r")
+ 	(ashiftrt:VHSI (match_operand:VHSI 1 "spu_reg_operand" "r")
+ 		       (match_operand:VHSI 2 "immediate_operand" "W")))]
+   ""
+   "rotma<bh>i\t%0,%1,-%<umask>2"
+   [(set_attr "type" "fx3")])
+   
  
  (define_insn "rotma_<mode>"
    [(set (match_operand:VHSI 0 "spu_reg_operand" "=r,r")
*************** (define_insn_and_split "ashrdi3"
*** 2622,2632 ****
    })
  
  
! (define_expand "ashrti3"
!   [(set (match_operand:TI 0 "spu_reg_operand" "")
! 	(ashiftrt:TI (match_operand:TI 1 "spu_reg_operand" "")
! 		     (match_operand:SI 2 "spu_nonmem_operand" "")))]
    ""
    {
      rtx sign_shift = gen_reg_rtx (SImode);
      rtx sign_mask = gen_reg_rtx (TImode);
--- 2611,2626 ----
    })
  
  
! (define_insn_and_split "ashrti3"
!   [(set (match_operand:TI 0 "spu_reg_operand" "=r,r")
! 	(ashiftrt:TI (match_operand:TI 1 "spu_reg_operand" "r,r")
! 		     (match_operand:SI 2 "spu_nonmem_operand" "r,i")))]
!   ""
!   "#"
    ""
+   [(set (match_dup:TI 0)
+ 	(ashiftrt:TI (match_dup:TI 1)
+ 		     (match_dup:SI 2)))]
    {
      rtx sign_shift = gen_reg_rtx (SImode);
      rtx sign_mask = gen_reg_rtx (TImode);
*************** (define_insn "rotqbi_ti"
*** 2711,2743 ****
  
  \f
  ;; struct extract/insert
! ;; We have to handle mem's because GCC will generate invalid SUBREG's
! ;; if it handles them.  We generate better code anyway.
  
  (define_expand "extv"
!   [(set (match_operand 0 "register_operand" "")
! 	(sign_extract (match_operand 1 "register_operand" "")
! 		      (match_operand:SI 2 "const_int_operand" "")
! 		      (match_operand:SI 3 "const_int_operand" "")))]
    ""
!   { spu_expand_extv(operands, 0); DONE; })
  
  (define_expand "extzv"
!   [(set (match_operand 0 "register_operand" "")
! 	(zero_extract (match_operand 1 "register_operand" "")
  			 (match_operand:SI 2 "const_int_operand" "")
  			 (match_operand:SI 3 "const_int_operand" "")))]
    ""
!   { spu_expand_extv(operands, 1); DONE; })
  
  (define_expand "insv"
!   [(set (zero_extract (match_operand 0 "register_operand" "")
  		      (match_operand:SI 1 "const_int_operand" "")
  		      (match_operand:SI 2 "const_int_operand" ""))
  	(match_operand 3 "nonmemory_operand" ""))]
    ""
    { spu_expand_insv(operands); DONE; })
  
  \f
  ;; String/block move insn.
  ;; Argument 0 is the destination
--- 2705,2837 ----
  
  \f
  ;; struct extract/insert
! ;; We handle mem's because GCC will generate invalid SUBREG's
! ;; and inefficient code.
  
  (define_expand "extv"
!   [(set (match_operand:TI 0 "register_operand" "")
! 	(sign_extract:TI (match_operand 1 "nonimmediate_operand" "")
! 			 (match_operand:SI 2 "const_int_operand" "")
! 			 (match_operand:SI 3 "const_int_operand" "")))]
    ""
!   {
!     spu_expand_extv (operands, 0);
!     DONE;
!   })
  
  (define_expand "extzv"
!   [(set (match_operand:TI 0 "register_operand" "")
! 	(zero_extract:TI (match_operand 1 "nonimmediate_operand" "")
  			 (match_operand:SI 2 "const_int_operand" "")
  			 (match_operand:SI 3 "const_int_operand" "")))]
    ""
!   {
!     spu_expand_extv (operands, 1);
!     DONE;
!   })
  
  (define_expand "insv"
!   [(set (zero_extract (match_operand 0 "nonimmediate_operand" "")
  		      (match_operand:SI 1 "const_int_operand" "")
  		      (match_operand:SI 2 "const_int_operand" ""))
  	(match_operand 3 "nonmemory_operand" ""))]
    ""
    { spu_expand_insv(operands); DONE; })
  
+ ;; Simplify a number of patterns that get generated by extv, extzv,
+ ;; insv, and loads.
+ (define_insn_and_split "trunc_shr_ti<mode>"
+   [(set (match_operand:QHSI 0 "spu_reg_operand" "=r")
+         (truncate:QHSI (match_operator:TI 2 "shiftrt_operator" [(match_operand:TI 1 "spu_reg_operand" "0")
+ 								(const_int 96)])))]
+   ""
+   "#"
+   "reload_completed"
+   [(const_int 0)]
+   {
+     spu_split_convert (operands);
+     DONE;
+   }
+   [(set_attr "type" "convert")
+    (set_attr "length" "0")])
+ 
+ (define_insn_and_split "trunc_shr_tidi"
+   [(set (match_operand:DI 0 "spu_reg_operand" "=r")
+         (truncate:DI (match_operator:TI 2 "shiftrt_operator" [(match_operand:TI 1 "spu_reg_operand" "0")
+ 							      (const_int 64)])))]
+   ""
+   "#"
+   "reload_completed"
+   [(const_int 0)]
+   {
+     spu_split_convert (operands);
+     DONE;
+   }
+   [(set_attr "type" "convert")
+    (set_attr "length" "0")])
+ 
+ (define_insn_and_split "shl_ext_<mode>ti"
+   [(set (match_operand:TI 0 "spu_reg_operand" "=r")
+         (ashift:TI (match_operator:TI 2 "extend_operator" [(match_operand:QHSI 1 "spu_reg_operand" "0")])
+ 		   (const_int 96)))]
+   ""
+   "#"
+   "reload_completed"
+   [(const_int 0)]
+   {
+     spu_split_convert (operands);
+     DONE;
+   }
+   [(set_attr "type" "convert")
+    (set_attr "length" "0")])
+ 
+ (define_insn_and_split "shl_ext_diti"
+   [(set (match_operand:TI 0 "spu_reg_operand" "=r")
+         (ashift:TI (match_operator:TI 2 "extend_operator" [(match_operand:DI 1 "spu_reg_operand" "0")])
+ 		   (const_int 64)))]
+   ""
+   "#"
+   "reload_completed"
+   [(const_int 0)]
+   {
+     spu_split_convert (operands);
+     DONE;
+   }
+   [(set_attr "type" "convert")
+    (set_attr "length" "0")])
+ 
+ (define_insn "sext_trunc_lshr_tiqisi"
+   [(set (match_operand:SI 0 "spu_reg_operand" "=r")
+         (sign_extend:SI (truncate:QI (match_operator:TI 2 "shiftrt_operator" [(match_operand:TI 1 "spu_reg_operand" "r")
+ 									      (const_int 120)]))))]
+   ""
+   "rotmai\t%0,%1,-24"
+   [(set_attr "type" "fx3")])
+ 
+ (define_insn "zext_trunc_lshr_tiqisi"
+   [(set (match_operand:SI 0 "spu_reg_operand" "=r")
+         (zero_extend:SI (truncate:QI (match_operator:TI 2 "shiftrt_operator" [(match_operand:TI 1 "spu_reg_operand" "r")
+ 									      (const_int 120)]))))]
+   ""
+   "rotmi\t%0,%1,-24"
+   [(set_attr "type" "fx3")])
+ 
+ (define_insn "sext_trunc_lshr_tihisi"
+   [(set (match_operand:SI 0 "spu_reg_operand" "=r")
+         (sign_extend:SI (truncate:HI (match_operator:TI 2 "shiftrt_operator" [(match_operand:TI 1 "spu_reg_operand" "r")
+ 									      (const_int 112)]))))]
+   ""
+   "rotmai\t%0,%1,-16"
+   [(set_attr "type" "fx3")])
+ 
+ (define_insn "zext_trunc_lshr_tihisi"
+   [(set (match_operand:SI 0 "spu_reg_operand" "=r")
+         (zero_extend:SI (truncate:HI (match_operator:TI 2 "shiftrt_operator" [(match_operand:TI 1 "spu_reg_operand" "r")
+ 									      (const_int 112)]))))]
+   ""
+   "rotmi\t%0,%1,-16"
+   [(set_attr "type" "fx3")])
+ 
  \f
  ;; String/block move insn.
  ;; Argument 0 is the destination
*************** (define_expand "spu_convert"
*** 4303,4323 ****
      DONE;
    })
  
! (define_insn "_spu_convert"
    [(set (match_operand 0 "spu_reg_operand" "=r")
  	(unspec [(match_operand 1 "spu_reg_operand" "0")] UNSPEC_CONVERT))]
-   "operands"
    ""
    [(set_attr "type" "convert")
     (set_attr "length" "0")])
  
- (define_peephole2
-   [(set (match_operand 0 "spu_reg_operand")
- 	(unspec [(match_operand 1 "spu_reg_operand")] UNSPEC_CONVERT))]
-   ""
-   [(use (const_int 0))]
-   "")
- 
  \f
  ;;
  (include "spu-builtins.md")
--- 4397,4416 ----
      DONE;
    })
  
! (define_insn_and_split "_spu_convert"
    [(set (match_operand 0 "spu_reg_operand" "=r")
  	(unspec [(match_operand 1 "spu_reg_operand" "0")] UNSPEC_CONVERT))]
    ""
+   "#"
+   "reload_completed"
+   [(const_int 0)]
+   {
+     spu_split_convert (operands);
+     DONE;
+   }
    [(set_attr "type" "convert")
     (set_attr "length" "0")])
  
  \f
  ;;
  (include "spu-builtins.md")
*************** (define_expand "vec_pack_trunc_v4si"
*** 5186,5193 ****
  }")
  
  (define_insn "stack_protect_set"
!   [(set (match_operand:SI 0 "spu_mem_operand" "=m")
!         (unspec:SI [(match_operand:SI 1 "spu_mem_operand" "m")] UNSPEC_SP_SET))
     (set (match_scratch:SI 2 "=&r") (const_int 0))]
    ""
    "lq%p1\t%2,%1\;stq%p0\t%2,%0\;xor\t%2,%2,%2"
--- 5279,5286 ----
  }")
  
  (define_insn "stack_protect_set"
!   [(set (match_operand:SI 0 "memory_operand" "=m")
!         (unspec:SI [(match_operand:SI 1 "memory_operand" "m")] UNSPEC_SP_SET))
     (set (match_scratch:SI 2 "=&r") (const_int 0))]
    ""
    "lq%p1\t%2,%1\;stq%p0\t%2,%0\;xor\t%2,%2,%2"
*************** (define_insn "stack_protect_set"
*** 5196,5203 ****
  )
  
  (define_expand "stack_protect_test"
!   [(match_operand 0 "spu_mem_operand" "")
!    (match_operand 1 "spu_mem_operand" "")
     (match_operand 2 "" "")]
    ""
  {
--- 5289,5296 ----
  )
  
  (define_expand "stack_protect_test"
!   [(match_operand 0 "memory_operand" "")
!    (match_operand 1 "memory_operand" "")
     (match_operand 2 "" "")]
    ""
  {
*************** (define_expand "stack_protect_test"
*** 5223,5230 ****
  
  (define_insn "stack_protect_test_si"
    [(set (match_operand:SI 0 "spu_reg_operand" "=&r")
!         (unspec:SI [(match_operand:SI 1 "spu_mem_operand" "m")
!                     (match_operand:SI 2 "spu_mem_operand" "m")]
                     UNSPEC_SP_TEST))
     (set (match_scratch:SI 3 "=&r") (const_int 0))]
    ""
--- 5316,5323 ----
  
  (define_insn "stack_protect_test_si"
    [(set (match_operand:SI 0 "spu_reg_operand" "=&r")
!         (unspec:SI [(match_operand:SI 1 "memory_operand" "m")
!                     (match_operand:SI 2 "memory_operand" "m")]
                     UNSPEC_SP_TEST))
     (set (match_scratch:SI 3 "=&r") (const_int 0))]
    ""

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-05-23  3:37 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-08-29  0:22 [PATCH, SPU] generated better code for loads and stores Trevor_Smigiel
2008-09-05 21:51 ` Trevor_Smigiel
2008-09-11 11:50 ` Ulrich Weigand
2008-09-12  3:37   ` Trevor_Smigiel
2008-09-12 14:24     ` Ulrich Weigand
2009-05-04 21:15 ` [PATCH] add optional split pass before CSE2 Trevor_Smigiel
2009-05-23  3:37   ` [SPU, PATCH] Split load and store instructions during expand Trevor_Smigiel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).