public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [RFA]: Merge stack alignment branch
@ 2008-04-04  6:31 Ye, Joey
  2008-04-04  6:39 ` Andrew Pinski
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Ye, Joey @ 2008-04-04  6:31 UTC (permalink / raw)
  To: GCC Patches; +Cc: Lu, Hongjiu, Guo, Xuepeng, ubizjak

[-- Attachment #1: Type: text/plain, Size: 15386 bytes --]

STACK branch has been created for a while and a bunch of patches to
implement stack alignment for i386/x86_64 have been checked in. Now this
branch not only can support all stack variables to be aligned at their
required boundary effectively, but also introduce zero regression
against current trunk. Here is the background information and the patch.
Comments and feedback are high appreciated.

-- BACKGROUD --
Here, we propose a new design to fully support stack alignment while
overcoming above problems. The new design will
*  Support arbitrary alignment value, including 4,8,16,32...
*  Adjust function stack alignment only when necessary
*  Initial development will be on i386 and x86_64, but can be extended
to other platforms
*  Emit efficient prologue/epilogue code for stack align
*  Coexist with special features like dynamic stack allocation (alloca),
nested functions, register parameter passing, PIC code and tail call
optimization, etc
*  Be able to debug and unwind stack

2.1 Support arbitrary alignment value
Different source code and optimizations requires different stack
alignment,
as in following table:
Feature         Alignment (bytes)
i386_ABI        4
x86_64_ABI      16
char            1
short           2
int             4
long            4/8*
long long       8
__m64           8
__m128          16
float           4
double          8
long double     16
user specified  any power of 2

*Note: 4 for i386, 8 for x86_64
The new design will support any alignment value in this table.

2.2 Adjust function stack alignment only when necessary

Current GCC defines following macros related to stack alignment:
i. STACK_BOUNDARY in bits, which is preferred by hardware, 32 for i386
and
64 for x86_64. It is the minimum stack boundary. It is fixed.
ii. PREFERRED_STACK_BOUNDARY. It sets the stack alignment when calling a
function. It may be set at command line and has no impact on stack
alignment at function entry. This proposal requires PREFERRED >= STACK,
and
by default set to ABI_STACK_BOUNDARY

This design will define a few more macros, or concepts not explicitly
defined in code:
iii. ABI_STACK_BOUNDARY in bits, which is the stack boundary specified
by
psABI, 32 for i386 and 128 for x86_64.  ABI_STACK_BOUNDARY >=
STACK_BOUNDARY. It is fixed for a given psABI.
iv. LOCAL_STACK_BOUNDARY in bits. Each function stack has its own stack
alignment requirement, which depends the alignment of its stack
variables,
LOCAL_STACK_BOUNDARY = MAX (alignment of each effective stack variable).
v. INCOMING_STACK_BOUNDARY in bits, which is the stack boundary at
function
entry. If a function is marked with __attribute__
((force_align_arg_pointer))
or -mstackrealign option is provided, INCOMING = STACK_BOUNDARY.
Otherwise,
INCOMING == PREFERRED_STACK_BOUNDARY because a function is typically
called 
locally with the same PREFERRED_STACK_BOUNDARY. For those function whose

PREFERRED is larger than ABI, it is the caller's responsibility to
invoke 
them with appropriate PREFERRED.
vi. REQUIRED_STACK_ALIGNMENT in bits, which is stack alignment required
by
local variables and calling other function. REQUIRED_STACK_ALIGNMENT ==
MAX(LOCAL_STACK_BOUNDARY,PREFERRED_STACK_BOUNDARY) in case of a non-leaf
function. For a leaf function, REQUIRED_STACK_ALIGNMENT ==
MAX(LOCAL_STACK_BOUNDARY,STACK_BOUNDARY).

This proposal won't adjust stack when INCOMING_STACK_BOUNDARY >=
REQUIRED_STACK_ALIGNMENT. Only when INCOMING_STACK_BOUNDARY <
REQUIRED_STACK_ALIGNMENT, or PREFERRED_STACK_BOUNDARY of entry function
less 
than ABI_STACK_BOUNDARY, it will adjust stack to
REQUIRED_STACK_ALIGNMENT
at prologue.

2.3 Initial development on i386 and x86_64
We initially support i386 and x86_64. In this document we focus more on
i386 because it is hard to implement because of the restriction of
having
a small register file.  But all that we discuss can be easily applied
to x86_64.

2.4 Emit more efficient prologue/epilogue
When a function needs to adjust stack alignment and has no dynamic stack
allocation, this design will generate following example
prologue/epilogue
code:
IA32 example Prologue:
        pushl     %ebp
        movl      %esp, %ebp
        andl      $-16, %esp
        subl      $4, %esp ; is $-4 the local stack size?
Epilogue:
        movl      %ebp, %esp
        popl      %ebp
        ret
Locals will be addressed as esp + offset and parameters as ebp + offset.

Add x86_64 example here.

Thus BP points to parameter frame and SP points to local frame.

2.5 Coexist with special features
Stack alignment adjustment will coexist with varying  GCC features
that have special calling conventions and frame layout, such as dynamic
stack allocation (alloca), nested functions and parameter passing via
registers to local functions.

I386 hard register usage is the major problem to make the proposal
friendly 
to various GCC features. This design requires an additional hard
register
in prologue/epilogue in case of dynamic stack allocation. The register
is 
called as Dynamic Realigned Argument Pointer, or DRAP. Because I386 PIC
requires BX as GOT pointer and I386 may use AX, DX and CX as parameter
passing registers, also it has to work with setjmp/longjmp, there are
limited candidates to choose.  Current proposal uses CX as DRAP if CX is
not 
used byr to pass parameter. If CX is not available DI will be used
because
it is preserved across setjmp/longjmp since it is callee-saved.

X86_64 is much easier. This proposal just chooses R12 as DRAP, which is
also preserved across setjmp/longjmp since it is callee-saved.

DRAP will be assigned to a virtual register, or VDRAP, in prologue so
that 
DRAP hard register itself can be free for register allocator in function
body.
Usually VDRAP will be allocated as the same DRAP register, thus the
additional
register move instruction is oftenly removed. 

2.5.1 When stack alignment adjustment comes together with alloca,
following
example prologue/epilogue will be emitted:
Prologue:
       pushl     %edi                     // Save callee save reg edi
       leal      8(%esp), %edi            // Save address of parameter
frame
       andl      $-16, %esp               // Align local stack

//  Reserve two stack slots and save return address 
//  and previous frame pointer into them. By
//  pointing new ebp to them, we build a pseudo 
//  stack for unwinding.
       pushl     $4(%edi)                 //  save return address
       pushl     %ebp                     //  save old ebp
       movl      %esp, %ebp               //  point ebp to pseudo frame
start

       subl      $24, %esp                // adjust local frame size
       movl      %edi, vreg1

epilogue:
       movl      vreg1, %edi
       movl      %ebp, %esp               // Restore esp to pseudo frame
start
       popl      %ebp
       leal      -8(%edi), %esp           // restore esp to real frame
start
       popl      %edi                     // Restore edi
       ret

Locals will be addressed as ebp - offset, parameters as vreg1 + offset

Where BX is used to set up virtual parameter frame pointer, BP points to
local frame and SP points to dynamic allocation frame.

2.5.2 Nested functions will automatically work because it uses CX as
static
pointer, which won't conflict with any registers used by stack alignment
adjustment, even when nested functions are called via function pointer
and
a function stub on stack.

2.5.3 GCC may optimize to use registers to pass parameters . At most AX,
DX
and CX will be used. Such optimization won't conflict with stack
alignment
adjustment thus it should automatically work.

2.5.4 I386 PIC uses an available register or EBX as GOT pointer. This
design
work well under i386 PIC. When picking up a register for PIC, we will
avoid
using the DRAP register:

For example:
i686 Prologue:
        pushl     %edi
        leal      8(%esp), %edi
        andl      $-16, %esp
        pushl     $4(%edi)
        pushl     %ebp
        movl      %esp, %ebp
        subl      $24,  %esp
        call      .L1
.L1:
        popl      %ebx
        movl      %edi, vreg1

Body:  // code for alloca
        movl      (vreg1), %eax
        subl      %eax, %esp
        andl      $-16, %esp
        movl      %esp, %eax

i686 Epilogue:
        movl      %ebp, %esp
        popl      %ebp
        leal      -8(%edi), %esp
        popl      %edi
        ret

Locals will be addressed as ebp - offset, parameters as vreg1 + offset,
ebx has the GOT pointer.

2.6 Debug and unwind will work since DWARF2 has the flexibility to
define
different frame pointers.

2.7 Some intrinsics rely on stack layout. Need to handle them
accordingly.
They are __builtin_return_address, __builtin_frame_address. This
proposal
will setup pseudo frame slot to help unwinder find return address and
parent frame address by emit following prologue code after adjusting
alignment:
        pushl     $4(%edi)
        pushl     %ebp

ChangeLog:
2008-04-04  Uros Bizjak  <ubizjak@gmail.com>
	    H.J. Lu  <hongjiu.lu@intel.com>

	PR target/12329
	* config/i386/i386.c (ix86_function_regparm): Limit the number
of
	register passing arguments to 2 for nested functions.

2008-04-04  Joey Ye  <joey.ye@intel.com>
	    H.J. Lu  <hongjiu.lu@intel.com>
	    Xuepeng Guo  <xuepeng.guo@intel.com>

	* builtins.c (expand_builtin_setjmp_receiver): Replace
	virtual_incoming_args_rtx with
	current_function_internal_arg_pointer.
	(expand_builtin_apply_args_1): Likewise.

	* calls.c (expand_call): Don't calculate preferred stack
	boundary according to incoming stack boundary. Replace 
	virtual_incoming_args_rtx with
	current_function_internal_arg_pointer.

	* cfgexpand.c (get_decl_align_unit): Estimate stack variable
	alignment and store to stack_alignment_estimated and
	stack_alignment_used.
	(expand_one_var): Likewise.
	(gate_stack_realign): Gate new pass
pass_collect_stackrealign_info
	and pass_handle_drap.
	(collect_stackrealign_info): Execute new pass
	pass_collect_stackrealign_info.
	(pass_collect_stackrealign_info): Define new pass.
	(handle_drap): Execute new pass pass_handle_drap.
	(pass_handle_drap): Define new pass.

	* defaults.h (MAX_VECTORIZE_STACK_ALIGNMENT): New.

	* dojump.c (clear_pending_stack_adjust): Leave an FIXME in
	comments in case pending stack ajustment is discard when stack 
	realign is needed.

	* flags.h (frame_pointer_needed): Removed.
	* final.c (frame_pointer_needed): Likewise.

	* function.c (assign_stack_local_1): Estimate stack variable 
	alignment and store to stack_alignment_estimated.
	(instantiate_new_reg): Instantiate virtual incoming args rtx to
	vDRAP if stack realignment and DRAP is needed.
	(assign_parms): Collect parameter/return type alignment and 
	contribute to stack_alignment_estimated.
	(locate_and_pad_parm): Likewise.
	(allocate_struct_function): Init stack_alignment_estimated and
	stack_alignment_used.
	(get_arg_pointer_save_area): Replace virtual_incoming_args_rtx
	with current_function_internal_arg_pointer.

	* function.h (function): Add drap_reg,
stack_alignment_estimated,
	need_frame_pointer, need_frame_pointer_set,
stack_realign_needed,
	stack_realign_really, need_drap, save_param_ptr_reg,
	stack_realign_processed, stack_realign_finalized and 
	stack_realign_used.
	(frame_pointer_needed): New.
	(stack_realign_fp): Likewise.
	(stack_realign_drap): Likewise.

	* global.c (compute_regsets): Set frame_pointer_needed
cannot_elim
	wrt stack_realign_needed.

	* stmt.c (expand_nl_goto_receiver): Replace 
	virtual_incoming_args_rtx with
	current_function_internal_arg_pointer.

	* passes.c (pass_collect_stackrealign_info): Insert this new
pass
	immediately before expand.
	(pass_handle_drap): Insert this new pass immediately after
expand.

	* tree-inline.c (expand_call_inline): Estimate stack variable
	alignment and store to stack_alignment_estimated.

	* tree-pass.h (pass_handle_drap): New.
	(pass_collect_stackrealign_info): Likewise.

	* tree-vectorizer.c (vect_can_force_dr_alignment_p): Estimate
	stack variable alignment and store to stack_alignment_estimated.

	* reload1.c (set_label_offsets): Assert that frame pointer must
be
	elimiated to stack pointer in case stack realignment is
estimated
	to happen without DRAP.
	(elimination_effects): Likewise.
	(eliminate_regs_in_insn): Likewise.
	(mark_not_eliminable): Likewise.
	(update_eliminables): Frame pointer is needed in case of stack
	realignment needed.
	(init_elim_table): Don't set frame_pointer_needed here.

	* dwarf2out.c (CUR_FDE): New.
	(reg_save_with_expression): Likewise.
	(dw_fde_struct): Add drap_regnum, stack_realignment,
	is_stack_realign, is_drap and is_drap_reg_saved.
	(add_cfi): If stack is realigned, call reg_save_with_expression
	to represent the location of stored vars.
	(dwarf2out_frame_debug_expr): Add rules 16-19 to handle stack
	realign.
	(output_cfa_loc): Handle DW_CFA_expression.
	(based_loc_descr): Update assert for stack realign.

	* config/i386/i386.c (ix86_force_align_arg_pointer_string):
Break
	long line.
	(ix86_user_incoming_stack_boundary): New.
	(ix86_default_incoming_stack_boundary): Likewise.
	(ix86_incoming_stack_boundary): Likewise.
	(find_drap_reg): Likewise.
	(override_options): Overide option value for new options.
	(ix86_function_ok_for_sibcall): Sibcall is OK even stack need
	realigning.
	(ix86_handle_cconv_attribute): Stack realign no longer impacts
	number of regparm.
	(ix86_function_regparm): Likewise.
	(setup_incoming_varargs_64): Remove the logic to set
	stack_alignment_needed here.
	(ix86_va_start): Replace virtual_incoming_args_rtx with
	current_function_internal_arg_pointer.
	(ix86_save_reg): Replace force_align_arg_pointer with drap_reg.
	(ix86_compute_frame_layout): Compute frame layout wrt stack
	realignment.
	(ix86_internal_arg_pointer): Estimate if stack realignment is
	needed and returns appropriate arg pointer rtx accordingly.
	(ix86_expand_prologue): Finally decide if stack realignment
	is needed and generate prologue code accordingly.
	(ix86_expand_epilogue): Generate epilogue code wrt stack
	realignment is really needed or not.
	* config/i386/i386.c (ix86_select_alt_pic_regnum): Check
	DRAP register.
	
	* config/i386/i386.h (MAIN_STACK_BOUNDARY): New.
	(ABI_STACK_BOUNDARY): Likewise.
	PREFERRED_STACK_BOUNDARY_DEFAULT): Likewise.
	(STACK_REALIGN_DEFAULT): Likewise.
	(INCOMING_STACK_BOUNDARY): Likewise.
	(MAX_VECTORIZE_STACK_ALIGNMENT): Likewise.
	(ix86_incoming_stack_boundary): Likewise.
	(REAL_PIC_OFFSET_TABLE_REGNUM): Updated to use BX_REG.
	(CAN_ELIMINATE): Redefine the macro to eliminate frame pointer
to
	stack pointer and arg pointer to hard frame pointer in case of
	stack realignment without DRAP.
	(machine_function): Remove force_align_arg_pointer.

	* config/i386/i386.md (BX_REG): New.
	(R13_REG): Likewise.

	* config/i386/i386.opt (mforce_drap): New.
	(mincoming-stack-boundary): Likewise.
	(mstackrealign): Updated.

	* doc/extend.texi: Update force_align_arg_pointer.
	* doc/invoke.texi: Document -mincoming-stack-boundary.  Update
	-mstackrealign.

[-- Attachment #2: merge-stack-0404.patch --]
[-- Type: application/octet-stream, Size: 72420 bytes --]

Index: flags.h
===================================================================
--- flags.h	(.../trunk/gcc)	(revision 133813)
+++ flags.h	(.../branches/stack/gcc)	(revision 133869)
@@ -223,12 +223,6 @@
 \f
 /* Other basic status info about current function.  */
 
-/* Nonzero means current function must be given a frame pointer.
-   Set in stmt.c if anything is allocated on the stack there.
-   Set in reload1.c if anything is allocated on the stack there.  */
-
-extern int frame_pointer_needed;
-
 /* Nonzero if subexpressions must be evaluated from left-to-right.  */
 extern int flag_evaluation_order;
 
Index: defaults.h
===================================================================
--- defaults.h	(.../trunk/gcc)	(revision 133813)
+++ defaults.h	(.../branches/stack/gcc)	(revision 133869)
@@ -940,4 +940,8 @@
 #define OUTGOING_REG_PARM_STACK_SPACE 0
 #endif
 
+#ifndef MAX_VECTORIZE_STACK_ALIGNMENT
+#define MAX_VECTORIZE_STACK_ALIGNMENT 0
+#endif
+
 #endif  /* ! GCC_DEFAULTS_H */
Index: tree-pass.h
===================================================================
--- tree-pass.h	(.../trunk/gcc)	(revision 133813)
+++ tree-pass.h	(.../branches/stack/gcc)	(revision 133869)
@@ -473,6 +473,8 @@
 extern struct gimple_opt_pass pass_apply_inline;
 extern struct gimple_opt_pass pass_all_early_optimizations;
 extern struct gimple_opt_pass pass_update_address_taken;
+extern struct gimple_opt_pass pass_collect_stackrealign_info;
+extern struct gimple_opt_pass pass_handle_drap;
 
 /* The root of the compilation pass tree, once constructed.  */
 extern struct opt_pass *all_passes, *all_ipa_passes, *all_lowering_passes;
Index: builtins.c
===================================================================
--- builtins.c	(.../trunk/gcc)	(revision 133813)
+++ builtins.c	(.../branches/stack/gcc)	(revision 133869)
@@ -740,7 +740,7 @@
 	{
 	  /* Now restore our arg pointer from the address at which it
 	     was saved in our stack frame.  */
-	  emit_move_insn (virtual_incoming_args_rtx,
+	  emit_move_insn (current_function_internal_arg_pointer,
 			  copy_to_reg (get_arg_pointer_save_area ()));
 	}
     }
@@ -1345,7 +1345,7 @@
       }
 
   /* Save the arg pointer to the block.  */
-  tem = copy_to_reg (virtual_incoming_args_rtx);
+  tem = copy_to_reg (current_function_internal_arg_pointer);
 #ifdef STACK_GROWS_DOWNWARD
   /* We need the pointer as the caller actually passed them to us, not
      as we might have pretended they were passed.  Make sure it's a valid
Index: final.c
===================================================================
--- final.c	(.../trunk/gcc)	(revision 133813)
+++ final.c	(.../branches/stack/gcc)	(revision 133869)
@@ -178,12 +178,6 @@
 CC_STATUS cc_prev_status;
 #endif
 
-/* Nonzero means current function must be given a frame pointer.
-   Initialized in function.c to 0.  Set only in reload1.c as per
-   the needs of the function.  */
-
-int frame_pointer_needed;
-
 /* Number of unmatched NOTE_INSN_BLOCK_BEG notes we have seen.  */
 
 static int block_depth;
Index: dojump.c
===================================================================
--- dojump.c	(.../trunk/gcc)	(revision 133813)
+++ dojump.c	(.../branches/stack/gcc)	(revision 133869)
@@ -64,8 +64,11 @@
    so the adjustment won't get done.
 
    Note, if the current function calls alloca, then it must have a
-   frame pointer regardless of the value of flag_omit_frame_pointer.  */
+   frame pointer regardless of the value of flag_omit_frame_pointer.  
 
+   When stack realign is needed, we can't discard pending stack adjustment,
+   in which stack pointer must be restored in epilogue. */
+
 void
 clear_pending_stack_adjust (void)
 {
Index: global.c
===================================================================
--- global.c	(.../trunk/gcc)	(revision 133813)
+++ global.c	(.../branches/stack/gcc)	(revision 133869)
@@ -247,11 +247,21 @@
   static const struct {const int from, to; } eliminables[] = ELIMINABLE_REGS;
   size_t i;
 #endif
+
+  /* FIXME: If EXIT_IGNORE_STACK is set, we will not save and restore
+     sp for alloca.  So we can't eliminate the frame pointer in that
+     case.  At some point, we should improve this by emitting the
+     sp-adjusting insns for this case.  */
   int need_fp
     = (! flag_omit_frame_pointer
        || (current_function_calls_alloca && EXIT_IGNORE_STACK)
-       || FRAME_POINTER_REQUIRED);
+       || FRAME_POINTER_REQUIRED
+       || current_function_accesses_prior_frames
+       || cfun->stack_realign_needed);
 
+  frame_pointer_needed = need_fp;
+  cfun->need_frame_pointer_set = 1;
+
   max_regno = max_reg_num ();
   compact_blocks ();
 
@@ -271,7 +281,10 @@
     {
       bool cannot_elim
 	= (! CAN_ELIMINATE (eliminables[i].from, eliminables[i].to)
-	   || (eliminables[i].to == STACK_POINTER_REGNUM && need_fp));
+	   || (eliminables[i].to == STACK_POINTER_REGNUM
+	       && need_fp 
+	       && (! MAX_VECTORIZE_STACK_ALIGNMENT
+		   || ! stack_realign_fp)));
 
       if (!regs_asm_clobbered[eliminables[i].from])
 	{
Index: dwarf2out.c
===================================================================
--- dwarf2out.c	(.../trunk/gcc)	(revision 133813)
+++ dwarf2out.c	(.../branches/stack/gcc)	(revision 133869)
@@ -110,6 +110,9 @@
 #define DWARF2_FRAME_REG_OUT(REGNO, FOR_EH) (REGNO)
 #endif
 
+/* Define the current fde_table entry we should use. */
+#define CUR_FDE fde_table[fde_table_in_use - 1]
+
 /* Decide whether we want to emit frame unwind information for the current
    translation unit.  */
 
@@ -239,9 +242,18 @@
   bool dw_fde_switched_sections;
   dw_cfi_ref dw_fde_cfi;
   unsigned funcdef_number;
+  /* If it is drap, which register is employed. */
+  int drap_regnum;
+  HOST_WIDE_INT stack_realignment;
   unsigned all_throwers_are_sibcalls : 1;
   unsigned nothrow : 1;
   unsigned uses_eh_lsda : 1;
+  /* Whether we did stack realign in this call frame.*/
+  unsigned is_stack_realign : 1;
+  /* Whether stack realign is drap. */
+  unsigned is_drap : 1;
+  /* Whether we saved this drap register. */
+  unsigned is_drap_reg_saved : 1;
 }
 dw_fde_node;
 
@@ -381,6 +393,7 @@
 static struct dw_loc_descr_struct *build_cfa_loc
   (dw_cfa_location *, HOST_WIDE_INT);
 static void def_cfa_1 (const char *, dw_cfa_location *);
+static void reg_save_with_expression (dw_cfi_ref);
 
 /* How to start an assembler comment.  */
 #ifndef ASM_COMMENT_START
@@ -618,6 +631,13 @@
   for (p = list_head; (*p) != NULL; p = &(*p)->dw_cfi_next)
     ;
 
+  /* If stack is realigned, accessing the stored register via CFA+offset will
+     be invalid. Here we will use a series of expressions in dwarf2 to simulate
+     the stack realign and represent the location of the stored register. */
+  if (fde_table_in_use && (CUR_FDE.is_stack_realign || CUR_FDE.is_drap) 
+      && cfi->dw_cfi_opc == DW_CFA_offset)
+    reg_save_with_expression (cfi);
+
   *p = cfi;
 }
 
@@ -1435,6 +1455,10 @@
   Rules 10-14: Save a register to the stack.  Define offset as the
 	       difference of the original location and cfa_store's
 	       location (or cfa_temp's location if cfa_temp is used).
+  
+  Rules 16-19: If AND operation happens on sp in prologue, we assume stack is
+               realigned. We will use a group of DW_OP_?? expressions to represent
+               the location of the stored register instead of CFA+offset.
 
   The Rules
 
@@ -1529,8 +1553,33 @@
 
   Rule 15:
   (set <reg> {unspec, unspec_volatile})
-  effects: target-dependent  */
+  effects: target-dependent  
+  
+  Rule 16:
+  (set sp (and: sp <const_int>))
+  effects: CUR_FDE.is_stack_realign = 1
+           cfa_store.offset = 0
 
+           if cfa_store.offset >= UNITS_PER_WORD
+             effects: CUR_FDE.is_drap_reg_saved = 1
+
+  Rule 17:
+  (set (mem ({pre_inc, pre_dec} sp)) (mem (plus (cfa.reg) (const_int))))
+  effects: cfa_store.offset += -/+ mode_size(mem)
+  
+  Rule 18:
+  (set (mem({pre_inc, pre_dec} sp)) fp)
+  constraints: CUR_FDE.is_stack_realign == 1
+  effects: CUR_FDE.is_stack_realign = 0
+           CUR_FDE.is_drap = 1
+           CUR_FDE.drap_regnum = cfa.reg
+
+  Rule 19:
+  (set fp sp)
+  constraints: CUR_FDE.is_drap == 1
+  effects: cfa.reg = fp
+           cfa.offset = cfa_store.offset */
+
 static void
 dwarf2out_frame_debug_expr (rtx expr, const char *label)
 {
@@ -1607,7 +1656,20 @@
 	      cfa_temp.reg = cfa.reg;
 	      cfa_temp.offset = cfa.offset;
 	    }
-	  else
+            /* Rule 19 */
+            /* Eachtime when setting FP to SP under the condition of that the stack
+               is realigned we assume the realign is drap and the drap register is
+               the current cfa's register. We update cfa's register to FP. */
+	  else if (fde_table_in_use && CUR_FDE.is_drap 
+                   && REGNO (src) == STACK_POINTER_REGNUM 
+                   && REGNO (dest) == HARD_FRAME_POINTER_REGNUM)
+            {
+              cfa.reg = REGNO (dest);
+              cfa.offset = cfa_store.offset;
+              cfa_temp.reg = cfa.reg;
+              cfa_temp.offset = cfa.offset;
+            }
+          else
 	    {
 	      /* Saving a register in a register.  */
 	      gcc_assert (!fixed_regs [REGNO (dest)]
@@ -1747,6 +1809,22 @@
 	  targetm.dwarf_handle_frame_unspec (label, expr, XINT (src, 1));
 	  return;
 
+	  /* Rule 16 */
+	case AND:
+          /* If this AND operation happens on stack pointer in prologue, we 
+             assume the stack is realigned and we extract the alignment. */
+          if (XEXP (src, 0) == stack_pointer_rtx && fde_table_in_use)
+            {
+              CUR_FDE.is_stack_realign = 1;
+              CUR_FDE.stack_realignment = INTVAL (XEXP (src, 1));
+              /* If we didn't push anything to stack before stack is realigned,
+                  we assume the drap register isn't saved. */
+              if (cfa_store.offset > UNITS_PER_WORD)
+                CUR_FDE.is_drap_reg_saved = 1;
+              cfa_store.offset = 0;
+            }
+          return;
+
 	default:
 	  gcc_unreachable ();
 	}
@@ -1755,7 +1833,6 @@
       break;
 
     case MEM:
-      gcc_assert (REG_P (src));
 
       /* Saving a register to the stack.  Make sure dest is relative to the
 	 CFA register.  */
@@ -1788,6 +1865,17 @@
 
 	  gcc_assert (REGNO (XEXP (XEXP (dest, 0), 0)) == STACK_POINTER_REGNUM
 		      && cfa_store.reg == STACK_POINTER_REGNUM);
+          
+          /* Rule 18 */
+          /* If we push FP after stack is realigned, we assume this realignment
+             is drap, we will recorde the drap register. */
+          if (fde_table_in_use && CUR_FDE.is_stack_realign
+              && REGNO (src) == HARD_FRAME_POINTER_REGNUM)
+            {
+              CUR_FDE.is_stack_realign = 0;
+              CUR_FDE.is_drap = 1;
+              CUR_FDE.drap_regnum = DWARF_FRAME_REGNUM (cfa.reg);
+            }            
 
 	  cfa_store.offset += offset;
 	  if (cfa.reg == STACK_POINTER_REGNUM)
@@ -1882,6 +1970,12 @@
 	      break;
 	    }
 	}
+        /* Rule 17 */
+        /* If the source operand of this MEM operation is not a register, 
+           basically the source is return address. Here we just care how 
+           much stack grew and ignore to save it. */ 
+      if (!REG_P (src))
+        break;
 
       def_cfa_1 (label, &cfa);
       {
@@ -3548,6 +3642,9 @@
   dw_loc_descr_ref loc;
   unsigned long size;
 
+  if (cfi->dw_cfi_opc == DW_CFA_expression)
+    dw2_asm_output_data (1, cfi->dw_cfi_oprnd2.dw_cfi_reg_num, NULL);
+
   /* Output the size of the block.  */
   loc = cfi->dw_cfi_oprnd1.dw_cfi_loc;
   size = size_of_locs (loc);
@@ -9024,8 +9121,9 @@
 	      offset += INTVAL (XEXP (elim, 1));
 	      elim = XEXP (elim, 0);
 	    }
-	  gcc_assert (elim == (frame_pointer_needed ? hard_frame_pointer_rtx
-		      : stack_pointer_rtx));
+	  gcc_assert (stack_realign_fp
+	              || elim == (frame_pointer_needed ? hard_frame_pointer_rtx
+		                                       : stack_pointer_rtx));
 	  offset += frame_pointer_fb_offset;
 
 	  return new_loc_descr (DW_OP_fbreg, offset, 0);
@@ -11155,9 +11253,10 @@
       offset += INTVAL (XEXP (elim, 1));
       elim = XEXP (elim, 0);
     }
-  gcc_assert (elim == (frame_pointer_needed ? hard_frame_pointer_rtx
+
+  gcc_assert (stack_realign_fp 
+              || elim == (frame_pointer_needed ? hard_frame_pointer_rtx
 		       : stack_pointer_rtx));
-
   frame_pointer_fb_offset = -offset;
 }
 
@@ -15438,6 +15537,63 @@
   if (debug_str_hash)
     htab_traverse (debug_str_hash, output_indirect_string, NULL);
 }
+
+/* In this function we use a series of DW_OP_?? expression which simulates
+   how stack is realigned to represent the location of the stored register.*/
+static void
+reg_save_with_expression (dw_cfi_ref cfi)
+{
+  struct dw_loc_descr_struct *head, *tmp;
+  HOST_WIDE_INT alignment = CUR_FDE.stack_realignment;
+  HOST_WIDE_INT offset = cfi->dw_cfi_oprnd2.dw_cfi_offset * UNITS_PER_WORD;
+  int reg = cfi->dw_cfi_oprnd1.dw_cfi_reg_num;
+  unsigned int dwarf_sp = (unsigned)DWARF_FRAME_REGNUM (STACK_POINTER_REGNUM);
+  
+  if (CUR_FDE.is_stack_realign)
+    {
+      head = tmp = new_loc_descr (DW_OP_const4s, 2 * UNITS_PER_WORD, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_const4s, alignment, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_and, 0, 0);
+
+      /* If stack grows upward, the offset will be a negative. */
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_const4s, offset, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);  
+   
+      cfi->dw_cfi_opc = DW_CFA_expression;
+      cfi->dw_cfi_oprnd2.dw_cfi_reg_num = reg; 
+      cfi->dw_cfi_oprnd1.dw_cfi_loc = head;
+    }
+
+  /* We need restore drap register through dereference. If we needn't to restore
+     the drap register we just ignore. */
+  if (CUR_FDE.is_drap && reg == CUR_FDE.drap_regnum)
+    {
+       
+      dw_cfi_ref cfi2 = new_cfi();
+
+      cfi->dw_cfi_opc = DW_CFA_expression;
+      head = tmp = new_loc_descr (DW_OP_const4s, offset, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);
+      if (CUR_FDE.is_drap_reg_saved)
+        {
+          tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_deref, 0, 0);
+          tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_const4s, 
+                                                  2 * UNITS_PER_WORD, 0);
+          tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);
+        }
+      cfi->dw_cfi_oprnd2.dw_cfi_reg_num = reg;
+      cfi->dw_cfi_oprnd1.dw_cfi_loc = head;
+
+      /* We also need restore the sp. */
+      head = tmp = new_loc_descr (DW_OP_const4s, offset, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);
+      cfi2->dw_cfi_opc = DW_CFA_expression;
+      cfi2->dw_cfi_oprnd2.dw_cfi_reg_num = dwarf_sp;
+      cfi2->dw_cfi_oprnd1.dw_cfi_loc = head;
+      cfi->dw_cfi_next = cfi2;
+    }  
+}
 #else
 
 /* This should never be used, but its address is needed for comparisons.  */
Index: function.c
===================================================================
--- function.c	(.../trunk/gcc)	(revision 133813)
+++ function.c	(.../branches/stack/gcc)	(revision 133869)
@@ -376,17 +376,19 @@
 {
   rtx x, addr;
   int bigend_correction = 0;
-  unsigned int alignment;
+  unsigned int alignment, mode_alignment, alignment_in_bits;
   int frame_off, frame_alignment, frame_phase;
 
+  if (mode == BLKmode)
+    mode_alignment = BIGGEST_ALIGNMENT;
+  else
+    mode_alignment = GET_MODE_ALIGNMENT (mode);
+
   if (align == 0)
     {
       tree type;
 
-      if (mode == BLKmode)
-	alignment = BIGGEST_ALIGNMENT;
-      else
-	alignment = GET_MODE_ALIGNMENT (mode);
+      alignment = mode_alignment;
 
       /* Allow the target to (possibly) increase the alignment of this
 	 stack slot.  */
@@ -406,16 +408,46 @@
   else
     alignment = align / BITS_PER_UNIT;
 
+  alignment_in_bits = alignment * BITS_PER_UNIT;
+
   if (FRAME_GROWS_DOWNWARD)
     frame_offset -= size;
 
-  /* Ignore alignment we can't do with expected alignment of the boundary.  */
-  if (alignment * BITS_PER_UNIT > PREFERRED_STACK_BOUNDARY)
-    alignment = PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT;
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      if (cfun->stack_alignment_estimated < alignment_in_bits)
+	{
+          if (!cfun->stack_realign_processed)
+            cfun->stack_alignment_estimated = alignment_in_bits;
+          else
+	    {
+	      gcc_assert (!cfun->stack_realign_finalized);
+	      if (!cfun->stack_realign_needed)
+		{
+		  /* It is OK to reduce the alignment as long as the
+		     requested size is 0 or the estimated stack
+		     alignment >= mode alignment.  */
+		  gcc_assert (size == 0
+			      || (cfun->stack_alignment_estimated
+				  >= mode_alignment));
+		  alignment_in_bits = cfun->stack_alignment_estimated;
+		  alignment = alignment_in_bits / BITS_PER_UNIT;
+		}
+	    }
+	}
+    }
+  else
+    {
+      /* Ignore alignment we can't do with expected alignment of the
+	 boundary.  */
+      if (alignment * BITS_PER_UNIT > PREFERRED_STACK_BOUNDARY)
+	alignment = PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT;
+    }
+  if (cfun->stack_alignment_needed < alignment_in_bits)
+    cfun->stack_alignment_needed = alignment_in_bits;
+  if (cfun->stack_alignment_used < cfun->stack_alignment_needed)
+    cfun->stack_alignment_used = cfun->stack_alignment_needed;
 
-  if (cfun->stack_alignment_needed < alignment * BITS_PER_UNIT)
-    cfun->stack_alignment_needed = alignment * BITS_PER_UNIT;
-
   /* Calculate how many bytes the start of local variables is off from
      stack alignment.  */
   frame_alignment = PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT;
@@ -1203,7 +1235,17 @@
   HOST_WIDE_INT offset;
 
   if (x == virtual_incoming_args_rtx)
-    new = arg_pointer_rtx, offset = in_arg_offset;
+    {
+      /* Replace vitural_incoming_args_rtx to internal arg pointer here */
+      if (current_function_internal_arg_pointer != virtual_incoming_args_rtx)
+        {
+          gcc_assert (stack_realign_drap);
+          new = current_function_internal_arg_pointer;
+          offset = 0;
+        }
+      else
+        new = arg_pointer_rtx, offset = in_arg_offset;
+    }
   else if (x == virtual_stack_vars_rtx)
     new = frame_pointer_rtx, offset = var_offset;
   else if (x == virtual_stack_dynamic_rtx)
@@ -3002,6 +3044,20 @@
 	  continue;
 	}
 
+      /* Estimate stack alignment from parameter alignment */
+      if (MAX_VECTORIZE_STACK_ALIGNMENT)
+        {
+          unsigned int align = FUNCTION_ARG_BOUNDARY (data.promoted_mode,
+						      data.passed_type);
+	  if (TYPE_ALIGN (data.nominal_type) > align)
+	    align = TYPE_ALIGN (data.passed_type);
+	  if (cfun->stack_alignment_estimated < align)
+	    {
+	      gcc_assert (!cfun->stack_realign_processed);
+	      cfun->stack_alignment_estimated = align;
+	    }
+	}
+	
       if (current_function_stdarg && !TREE_CHAIN (parm))
 	assign_parms_setup_varargs (&all, &data, false);
 
@@ -3039,6 +3095,28 @@
      now that all parameters have been copied out of hard registers.  */
   emit_insn (all.first_conversion_insn);
 
+  /* Estimate reload stack alignment from scalar return mode.  */
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      if (DECL_RESULT (fndecl))
+	{
+	  tree type = TREE_TYPE (DECL_RESULT (fndecl));
+	  enum machine_mode mode = TYPE_MODE (type);
+
+	  if (mode != BLKmode
+	      && mode != VOIDmode
+	      && !AGGREGATE_TYPE_P (type))
+	    {
+	      unsigned int align = GET_MODE_ALIGNMENT (mode);
+	      if (cfun->stack_alignment_estimated < align)
+		{
+		  gcc_assert (!cfun->stack_realign_processed);
+		  cfun->stack_alignment_estimated = align;
+		}
+	    }
+	} 
+    }
+
   /* If we are receiving a struct value address as the first argument, set up
      the RTL for the function result. As this might require code to convert
      the transmitted address to Pmode, we do this here to ensure that possible
@@ -3316,12 +3394,32 @@
   locate->where_pad = where_pad;
   locate->boundary = boundary;
 
-  /* Remember if the outgoing parameter requires extra alignment on the
-     calling function side.  */
-  if (boundary > PREFERRED_STACK_BOUNDARY)
-    boundary = PREFERRED_STACK_BOUNDARY;
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      /* stack_alignment_estimated can't change after stack has been
+	 realigned.  */
+      if (cfun->stack_alignment_estimated < boundary)
+        {
+          if (!cfun->stack_realign_processed)
+	    cfun->stack_alignment_estimated = boundary;
+	  else
+	    {
+	      gcc_assert (!cfun->stack_realign_finalized
+			  && cfun->stack_realign_needed);
+	    }
+	}
+    }
+  else
+    {
+      /* Remember if the outgoing parameter requires extra alignment on
+         the calling function side.  */
+      if (boundary > PREFERRED_STACK_BOUNDARY)
+        boundary = PREFERRED_STACK_BOUNDARY;
+    }
   if (cfun->stack_alignment_needed < boundary)
     cfun->stack_alignment_needed = boundary;
+  if (cfun->stack_alignment_used < cfun->stack_alignment_needed)
+    cfun->stack_alignment_used = cfun->stack_alignment_needed;
 
 #ifdef ARGS_GROW_DOWNWARD
   locate->slot_offset.constant = -initial_offset_ptr->constant;
@@ -3877,6 +3975,8 @@
   cfun = ggc_alloc_cleared (sizeof (struct function));
 
   cfun->stack_alignment_needed = STACK_BOUNDARY;
+  cfun->stack_alignment_used = STACK_BOUNDARY;
+  cfun->stack_alignment_estimated = STACK_BOUNDARY;
   cfun->preferred_stack_boundary = STACK_BOUNDARY;
 
   current_function_funcdef_no = get_next_funcdef_no ();
@@ -4655,7 +4755,8 @@
 	 generated stack slot may not be a valid memory address, so we
 	 have to check it and fix it if necessary.  */
       start_sequence ();
-      emit_move_insn (validize_mem (ret), virtual_incoming_args_rtx);
+      emit_move_insn (validize_mem (ret),
+                      current_function_internal_arg_pointer);
       seq = get_insns ();
       end_sequence ();
 
Index: tree-vectorizer.c
===================================================================
--- tree-vectorizer.c	(.../trunk/gcc)	(revision 133813)
+++ tree-vectorizer.c	(.../branches/stack/gcc)	(revision 133869)
@@ -1786,9 +1786,19 @@
 
   if (TREE_STATIC (decl))
     return (alignment <= MAX_OFILE_ALIGNMENT);
+  else if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      gcc_assert (!cfun->stack_realign_processed);
+      if (alignment <= MAX_VECTORIZE_STACK_ALIGNMENT)
+	{
+	  if (cfun->stack_alignment_estimated < alignment)
+	    cfun->stack_alignment_estimated = alignment;
+	  return true;
+	}
+      else
+	return false;
+    }
   else
-    /* This used to be PREFERRED_STACK_BOUNDARY, however, that is not 100%
-       correct until someone implements forced stack alignment.  */
     return (alignment <= STACK_BOUNDARY); 
 }
 
Index: function.h
===================================================================
--- function.h	(.../trunk/gcc)	(revision 133813)
+++ function.h	(.../branches/stack/gcc)	(revision 133869)
@@ -302,6 +302,9 @@
   /* The arg pointer hard register, or the pseudo into which it was copied.  */
   rtx internal_arg_pointer;
 
+  /* Dynamic Realign Argument Pointer used for realigning stack.  */
+  rtx drap_reg;
+
   /* Opaque pointer used by get_hard_reg_initial_val and
      has_hard_reg_initial_val (see integrate.[hc]).  */
   struct initial_value_struct *hard_reg_initial_vals;
@@ -323,9 +326,16 @@
   /* tm.h can use this to store whatever it likes.  */
   struct machine_function * GTY ((maybe_undef)) machine;
 
-  /* The largest alignment of slot allocated on the stack.  */
+  /* The largest alignment needed on the stack, including requirement
+     for outgoing stack alignment.  */
   unsigned int stack_alignment_needed;
 
+  /* The largest alignment of slot allocated on the stack.  */
+  unsigned int stack_alignment_used;
+
+  /* The estimated stack alignment.  */
+  unsigned int stack_alignment_estimated;
+
   /* Preferred alignment of the end of stack frame.  */
   unsigned int preferred_stack_boundary;
 
@@ -494,6 +504,38 @@
 
   /* Nonzero if pass_tree_profile was run on this function.  */
   unsigned int after_tree_profile : 1;
+
+/* Nonzero if current function must be given a frame pointer.
+   Set in global.c if anything is allocated on the stack there.  */
+  unsigned int need_frame_pointer : 1;
+
+  /* Nonzero if need_frame_pointer has been set.  */
+  unsigned int need_frame_pointer_set : 1;
+
+  /* Nonzero if, by estimation, current function stack needs realignment. */
+  unsigned int stack_realign_needed : 1;
+
+  /* Nonzero if function stack realignment is really needed. This flag
+     will be set after reload if by then criteria of stack realignment
+     is still true. Its value may be contridition to stack_realign_needed
+     since the latter was set before reload. This flag is more accurate
+     than stack_realign_needed so prologue/epilogue should be generated
+     according to both flags  */
+  unsigned int stack_realign_really : 1;
+
+  /* Nonzero if function being compiled needs dynamic realigned
+     argument pointer (drap) if stack needs realigning.  */
+  unsigned int need_drap : 1;
+
+  /* Nonzero if current function needs to save/restore parameter
+     pointer register in prolog, because it is a callee save reg.  */
+  unsigned int save_param_ptr_reg : 1;
+
+  /* Nonzero if function stack realignment estimatoin is done.  */
+  unsigned int stack_realign_processed : 1;
+
+  /* Nonzero if function stack realignment has been finalized.  */
+  unsigned int stack_realign_finalized : 1;
 };
 
 /* If va_list_[gf]pr_size is set to this, it means we don't know how
@@ -556,6 +598,9 @@
 #define dom_computed (cfun->cfg->x_dom_computed)
 #define n_bbs_in_dom_tree (cfun->cfg->x_n_bbs_in_dom_tree)
 #define VALUE_HISTOGRAMS(fun) (fun)->value_histograms
+#define frame_pointer_needed (cfun->need_frame_pointer)
+#define stack_realign_fp (cfun->stack_realign_needed && !cfun->need_drap)
+#define stack_realign_drap (cfun->stack_realign_needed && cfun->need_drap)
 
 /* Given a function decl for a containing function,
    return the `struct function' for it.  */
Index: calls.c
===================================================================
--- calls.c	(.../trunk/gcc)	(revision 133813)
+++ calls.c	(.../branches/stack/gcc)	(revision 133869)
@@ -2099,7 +2099,10 @@
 
   /* Figure out the amount to which the stack should be aligned.  */
   preferred_stack_boundary = PREFERRED_STACK_BOUNDARY;
-  if (fndecl)
+
+  /* With automatic stack realignment, we align stack in prologue when
+     needed and there is no need to update preferred_stack_boundary.  */
+  if (!MAX_VECTORIZE_STACK_ALIGNMENT && fndecl)
     {
       struct cgraph_rtl_info *i = cgraph_rtl_info (fndecl);
       if (i && i->preferred_incoming_stack_boundary)
@@ -2401,7 +2404,7 @@
 	 incoming argument block.  */
       if (pass == 0)
 	{
-	  argblock = virtual_incoming_args_rtx;
+	  argblock = current_function_internal_arg_pointer;
 	  argblock
 #ifdef STACK_GROWS_DOWNWARD
 	    = plus_constant (argblock, current_function_pretend_args_size);
Index: cfgexpand.c
===================================================================
--- cfgexpand.c	(.../trunk/gcc)	(revision 133813)
+++ cfgexpand.c	(.../branches/stack/gcc)	(revision 133869)
@@ -161,10 +161,27 @@
 
   align = DECL_ALIGN (decl);
   align = LOCAL_ALIGNMENT (TREE_TYPE (decl), align);
-  if (align > PREFERRED_STACK_BOUNDARY)
-    align = PREFERRED_STACK_BOUNDARY;
+
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      if (cfun->stack_alignment_estimated < align)
+	{
+	  gcc_assert(!cfun->stack_realign_processed);
+          cfun->stack_alignment_estimated = align;
+	}
+    }
+  else
+    {
+      if (align > PREFERRED_STACK_BOUNDARY)
+	align = PREFERRED_STACK_BOUNDARY;
+    }
+
+  /* stack_alignment_needed > PREFERRED_STACK_BOUNDARY is permitted.
+     So here we only make sure stack_alignment_needed >= align.  */
   if (cfun->stack_alignment_needed < align)
     cfun->stack_alignment_needed = align;
+  if (cfun->stack_alignment_used < cfun->stack_alignment_needed)
+    cfun->stack_alignment_used = cfun->stack_alignment_needed;
 
   return align / BITS_PER_UNIT;
 }
@@ -748,6 +765,29 @@
 static HOST_WIDE_INT
 expand_one_var (tree var, bool toplevel, bool really_expand)
 {
+  if (MAX_VECTORIZE_STACK_ALIGNMENT && TREE_CODE (var) == VAR_DECL)
+    {
+      unsigned int align;
+
+      /* Because we don't know if VAR will be in register or on stack,
+	 we conservatively assume it will be on stack even if VAR is
+	 eventually put into register after RA pass.  For non-automatic
+	 variables, which won't be on stack, we collect alignment of
+	 type and ignore user specified alignment.  */
+      if (TREE_STATIC (var) || DECL_EXTERNAL (var))
+	align = TYPE_ALIGN (TREE_TYPE (var));
+      else
+	align = DECL_ALIGN (var);
+
+      if (cfun->stack_alignment_estimated < align)
+        {
+          /* stack_alignment_estimated shouldn't change after stack
+             realign decision made */
+          gcc_assert(!cfun->stack_realign_processed);
+	  cfun->stack_alignment_estimated = align;
+	}
+    }
+
   if (TREE_CODE (var) != VAR_DECL)
     {
       if (really_expand)
@@ -2005,3 +2045,135 @@
   TODO_dump_func,                       /* todo_flags_finish */
  }
 };
+
+static bool
+gate_stack_realign (void)
+{
+  if (!MAX_VECTORIZE_STACK_ALIGNMENT)
+    return false;
+  else
+    {
+      gcc_assert (!cfun->stack_realign_processed);
+      return true;
+    }
+}
+
+/* Collect accurate info for stack realign.  */
+
+static unsigned int
+collect_stackrealign_info (void)
+{
+  basic_block bb;
+  block_stmt_iterator bsi;
+
+  if (cfun->has_nonlocal_label)
+    cfun->need_drap = true;
+
+  FOR_EACH_BB (bb)
+    for (bsi = bsi_start (bb); ! bsi_end_p (bsi); bsi_next (&bsi))
+      {
+	tree stmt = bsi_stmt (bsi);
+	tree call = get_call_expr_in (stmt);
+	tree decl, type;
+	int flags;
+
+	if (!call)
+	  continue;
+
+	flags = call_expr_flags (call);
+	if (flags & ECF_MAY_BE_ALLOCA)
+	  cfun->need_drap = true;
+
+	decl = get_callee_fndecl (call);
+	if (decl && DECL_BUILT_IN_CLASS (decl) == BUILT_IN_NORMAL)
+	  switch (DECL_FUNCTION_CODE (decl))
+	    {
+	    case BUILT_IN_NONLOCAL_GOTO:
+	    case BUILT_IN_APPLY:
+	    case BUILT_IN_LONGJMP:
+	      cfun->need_drap = true;
+	      break;
+	    default:
+	      break;
+	    }
+
+	type = TREE_TYPE (call);
+	if (!type || VOID_TYPE_P (type))
+          continue;
+
+	/* FIXME: Do we need DRAP when the result is returned on
+	   stack?  */
+	if (aggregate_value_p (type, decl))
+	  cfun->need_drap = true;
+      }  
+
+  return 0;
+}
+
+struct gimple_opt_pass pass_collect_stackrealign_info =
+{
+ {
+  GIMPLE_PASS,
+  "stack_realign_info",                 /* name */
+  gate_stack_realign,                   /* gate */
+  collect_stackrealign_info,            /* execute */
+  NULL,                                 /* sub */
+  NULL,                                 /* next */
+  0,                                    /* static_pass_numbler */
+  0,                                    /* tv_id */
+  0,                                    /* properties_required */
+  0,                                    /* properties_provided */
+  0,                                    /* properties_destroyed */
+  0,                                    /* todo_flags_start */
+  0,                                    /* todo_flags_finish */
+ }
+};
+
+/* New pass handle_drap. 
+   This pass first checks if DRAP is needed.
+   If yes, it will set current_function_internal_arg_pointer to that
+   virtual register. Later lregs pass will replace
+   virtual_incoming_args_rtx to that virtual reg */
+static unsigned int
+handle_drap (void)
+{
+  /* Call targetm.calls.internal_arg_pointer again. This time it will
+     return a virtual reg if DRAP is needed */
+  rtx internal_arg_rtx = targetm.calls.internal_arg_pointer (); 
+
+  /* Assertion to check internal_arg_pointer is set to the right rtx here */
+  gcc_assert (current_function_internal_arg_pointer == 
+             virtual_incoming_args_rtx);
+
+  /* Do nothing if needn't replace virtual incoming arg rtx */
+  if (current_function_internal_arg_pointer != internal_arg_rtx)
+    {
+      current_function_internal_arg_pointer = internal_arg_rtx;
+
+      /* Call fixup_tail_casss to clean up REG_EQUIV note 
+         if DRAP is needed. */
+      fixup_tail_calls ();
+    }
+
+  return 0;
+}
+
+struct gimple_opt_pass pass_handle_drap =
+{
+ {
+  GIMPLE_PASS,
+  "handle_drap",			/* name */
+  gate_stack_realign,                   /* gate */
+  handle_drap,			        /* execute */
+  NULL,                                 /* sub */
+  NULL,                                 /* next */
+  0,                                    /* static_pass_number */
+  0,				        /* tv_id */
+  /* ??? If TER is enabled, we actually receive GENERIC.  */
+  0,                                    /* properties_required */
+  PROP_rtl,                             /* properties_provided */
+  0,				        /* properties_destroyed */
+  0,                                    /* todo_flags_start */
+  TODO_dump_func,                       /* todo_flags_finish */
+ }
+};
Index: tree-inline.c
===================================================================
--- tree-inline.c	(.../trunk/gcc)	(revision 133813)
+++ tree-inline.c	(.../branches/stack/gcc)	(revision 133869)
@@ -2841,8 +2841,26 @@
 	cfun->unexpanded_var_list = tree_cons (NULL_TREE, var,
 					       cfun->unexpanded_var_list);
       else
-	cfun->unexpanded_var_list = tree_cons (NULL_TREE, remap_decl (var, id),
-					       cfun->unexpanded_var_list);
+	{
+	  /* Update stack alignment requirement if needed.  */
+	  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+	    {
+	      unsigned int align;
+
+	      if (TREE_STATIC (var) || DECL_EXTERNAL (var))
+		align = TYPE_ALIGN (TREE_TYPE (var));
+	      else
+		align = DECL_ALIGN (var);
+	      if (align  > cfun->stack_alignment_estimated)
+		{
+		  gcc_assert(!cfun->stack_realign_processed);
+		  cfun->stack_alignment_estimated = align;
+		}
+	    }
+	  cfun->unexpanded_var_list
+	    = tree_cons (NULL_TREE, remap_decl (var, id),
+			 cfun->unexpanded_var_list);
+	}
     }
 
   /* Clean up.  */
Index: passes.c
===================================================================
--- passes.c	(.../trunk/gcc)	(revision 133813)
+++ passes.c	(.../branches/stack/gcc)	(revision 133869)
@@ -686,7 +686,9 @@
   NEXT_PASS (pass_free_datastructures);
   NEXT_PASS (pass_mudflap_2);
   NEXT_PASS (pass_free_cfg_annotations);
+  NEXT_PASS (pass_collect_stackrealign_info);
   NEXT_PASS (pass_expand);
+  NEXT_PASS (pass_handle_drap); 
   NEXT_PASS (pass_rest_of_compilation);
     {
       struct opt_pass **p = &pass_rest_of_compilation.pass.sub;
Index: stmt.c
===================================================================
--- stmt.c	(.../trunk/gcc)	(revision 133813)
+++ stmt.c	(.../branches/stack/gcc)	(revision 133869)
@@ -1819,7 +1819,7 @@
 	{
 	  /* Now restore our arg pointer from the address at which it
 	     was saved in our stack frame.  */
-	  emit_move_insn (virtual_incoming_args_rtx,
+	  emit_move_insn (current_function_internal_arg_pointer,
 			  copy_to_reg (get_arg_pointer_save_area ()));
 	}
     }
Index: reload1.c
===================================================================
--- reload1.c	(.../trunk/gcc)	(revision 133813)
+++ reload1.c	(.../branches/stack/gcc)	(revision 133869)
@@ -2279,7 +2279,13 @@
 	  if (offsets_at[CODE_LABEL_NUMBER (x) - first_label_num][i]
 	      != (initial_p ? reg_eliminate[i].initial_offset
 		  : reg_eliminate[i].offset))
-	    reg_eliminate[i].can_eliminate = 0;
+            {
+	      /* Must not disable reg eliminate because stack realignment
+	         must eliminate frame pointer to stack pointer.  */
+	      gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			  || ! stack_realign_fp);
+	      reg_eliminate[i].can_eliminate = 0;
+            }
 
       return;
 
@@ -2358,7 +2364,13 @@
 	 offset because we are doing a jump to a variable address.  */
       for (p = reg_eliminate; p < &reg_eliminate[NUM_ELIMINABLE_REGS]; p++)
 	if (p->offset != p->initial_offset)
-	  p->can_eliminate = 0;
+	  {
+	    /* Must not disable reg eliminate because stack realignment
+	       must eliminate frame pointer to stack pointer.  */
+	    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			|| ! stack_realign_fp);
+	    p->can_eliminate = 0;
+	  }
       break;
 
     default:
@@ -2849,7 +2861,13 @@
       /* If we modify the source of an elimination rule, disable it.  */
       for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
 	if (ep->from_rtx == XEXP (x, 0))
-	  ep->can_eliminate = 0;
+	  {
+	    /* Must not disable reg eliminate because stack realignment
+	       must eliminate frame pointer to stack pointer.  */
+	    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			|| ! stack_realign_fp);
+	    ep->can_eliminate = 0;
+	  }
 
       /* If we modify the target of an elimination rule by adding a constant,
 	 update its offset.  If we modify the target in any other way, we'll
@@ -2875,7 +2893,14 @@
 		    && CONST_INT_P (XEXP (XEXP (x, 1), 1)))
 		  ep->offset -= INTVAL (XEXP (XEXP (x, 1), 1));
 		else
-		  ep->can_eliminate = 0;
+		  {
+		    /* Must not disable reg eliminate because stack
+		       realignment must eliminate frame pointer to
+		       stack pointer.  */
+		    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+				|| ! stack_realign_fp);
+		    ep->can_eliminate = 0;
+		  }
 	      }
 	  }
 
@@ -2918,7 +2943,13 @@
 	 know how this register is used.  */
       for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
 	if (ep->from_rtx == XEXP (x, 0))
-	  ep->can_eliminate = 0;
+	  {
+	    /* Must not disable reg eliminate because stack realignment
+	       must eliminate frame pointer to stack pointer.  */
+	    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			|| ! stack_realign_fp);
+	    ep->can_eliminate = 0;
+	  }
 
       elimination_effects (XEXP (x, 0), mem_mode);
       return;
@@ -2929,7 +2960,13 @@
 	 be performed.  Otherwise, we need not be concerned about it.  */
       for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
 	if (ep->to_rtx == XEXP (x, 0))
-	  ep->can_eliminate = 0;
+	  {
+	    /* Must not disable reg eliminate because stack realignment
+	       must eliminate frame pointer to stack pointer.  */
+	    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			|| ! stack_realign_fp);
+	    ep->can_eliminate = 0;
+	  }
 
       elimination_effects (XEXP (x, 0), mem_mode);
       return;
@@ -2963,7 +3000,14 @@
 		    && GET_CODE (XEXP (src, 1)) == CONST_INT)
 		  ep->offset -= INTVAL (XEXP (src, 1));
 		else
-		  ep->can_eliminate = 0;
+		  {
+		    /* Must not disable reg eliminate because stack
+		       realignment must eliminate frame pointer to
+		       stack pointer.  */
+		    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+				|| ! stack_realign_fp);
+		    ep->can_eliminate = 0;
+		  }
 	      }
 	}
 
@@ -3292,7 +3336,14 @@
 	      for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS];
 		   ep++)
 		if (ep->from_rtx == orig_operand[i])
-		  ep->can_eliminate = 0;
+		  {
+		    /* Must not disable reg eliminate because stack
+		       realignment must eliminate frame pointer to
+		       stack pointer.  */
+		    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+				|| ! stack_realign_fp);
+		    ep->can_eliminate = 0;
+		  }
 	    }
 
 	  /* Companion to the above plus substitution, we can allow
@@ -3422,7 +3473,13 @@
   for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
     {
       if (ep->previous_offset != ep->offset && ep->ref_outside_mem)
-	ep->can_eliminate = 0;
+	{
+	  /* Must not disable reg eliminate because stack realignment
+	     must eliminate frame pointer to stack pointer.  */
+	  gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+		      || ! stack_realign_fp);
+	  ep->can_eliminate = 0;
+	}
 
       ep->ref_outside_mem = 0;
 
@@ -3498,6 +3555,11 @@
 	    || XEXP (SET_SRC (x), 0) != dest
 	    || GET_CODE (XEXP (SET_SRC (x), 1)) != CONST_INT))
       {
+	/* Must not disable reg eliminate because stack realignment
+	   must eliminate frame pointer to stack pointer.  */
+	gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+		    || ! stack_realign_fp);
+
 	reg_eliminate[i].can_eliminate_previous
 	  = reg_eliminate[i].can_eliminate = 0;
 	num_eliminable--;
@@ -3668,8 +3730,11 @@
   frame_pointer_needed = 1;
   for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
     {
-      if (ep->can_eliminate && ep->from == FRAME_POINTER_REGNUM
-	  && ep->to != HARD_FRAME_POINTER_REGNUM)
+      if (ep->can_eliminate
+	  && ep->from == FRAME_POINTER_REGNUM
+	  && ep->to != HARD_FRAME_POINTER_REGNUM
+	  && (! MAX_VECTORIZE_STACK_ALIGNMENT
+	      || ! cfun->stack_realign_needed))
 	frame_pointer_needed = 0;
 
       if (! ep->can_eliminate && ep->can_eliminate_previous)
@@ -3713,19 +3778,9 @@
   if (!reg_eliminate)
     reg_eliminate = xcalloc (sizeof (struct elim_table), NUM_ELIMINABLE_REGS);
 
-  /* Does this function require a frame pointer?  */
+  /* frame_pointer_needed should has been set.  */
+  gcc_assert (cfun->need_frame_pointer_set);
 
-  frame_pointer_needed = (! flag_omit_frame_pointer
-			  /* ?? If EXIT_IGNORE_STACK is set, we will not save
-			     and restore sp for alloca.  So we can't eliminate
-			     the frame pointer in that case.  At some point,
-			     we should improve this by emitting the
-			     sp-adjusting insns for this case.  */
-			  || (current_function_calls_alloca
-			      && EXIT_IGNORE_STACK)
-			  || current_function_accesses_prior_frames
-			  || FRAME_POINTER_REQUIRED);
-
   num_eliminable = 0;
 
 #ifdef ELIMINABLE_REGS
@@ -3736,7 +3791,10 @@
       ep->to = ep1->to;
       ep->can_eliminate = ep->can_eliminate_previous
 	= (CAN_ELIMINATE (ep->from, ep->to)
-	   && ! (ep->to == STACK_POINTER_REGNUM && frame_pointer_needed));
+	   && ! (ep->to == STACK_POINTER_REGNUM
+		 && frame_pointer_needed 
+		 && (! MAX_VECTORIZE_STACK_ALIGNMENT
+		     || ! stack_realign_fp)));
     }
 #else
   reg_eliminate[0].from = reg_eliminate_1[0].from;
Index: config/i386/i386.h
===================================================================
--- config/i386/i386.h	(.../trunk/gcc/config/i386)	(revision 133813)
+++ config/i386/i386.h	(.../branches/stack/gcc/config/i386)	(revision 133869)
@@ -800,17 +800,33 @@
 /* Boundary (in *bits*) on which stack pointer should be aligned.  */
 #define STACK_BOUNDARY BITS_PER_WORD
 
+/* Stack boundary of the main function guaranteed by OS.  */
+#define MAIN_STACK_BOUNDARY (TARGET_64BIT ? 128 : 32)
+
+/* Stack boundary guaranteed by ABI.  */
+#define ABI_STACK_BOUNDARY (TARGET_64BIT ? 128 : 32)
+
 /* Boundary (in *bits*) on which the stack pointer prefers to be
    aligned; the compiler cannot rely on having this alignment.  */
 #define PREFERRED_STACK_BOUNDARY ix86_preferred_stack_boundary
 
-/* As of July 2001, many runtimes do not align the stack properly when
-   entering main.  This causes expand_main_function to forcibly align
-   the stack, which results in aligned frames for functions called from
-   main, though it does nothing for the alignment of main itself.  */
-#define FORCE_PREFERRED_STACK_BOUNDARY_IN_MAIN \
-  (ix86_preferred_stack_boundary > STACK_BOUNDARY && !TARGET_64BIT)
+/* It should be ABI_STACK_BOUNDARY.  But we set it to 128 bits for
+   both 32bit and 64bit, to support codes that need 128 bit stack
+   alignment for SSE instructions, but can't realign the stack.  */
+#define PREFERRED_STACK_BOUNDARY_DEFAULT 128
 
+/* 1 if -mstackrealign should be turned on by default.  It will
+   generate an alternate prologue and epilogue that realigns the
+   runtime stack if nessary.  This supports mixing codes that keep a
+   4-byte aligned stack, as specified by i386 psABI, with codes that
+   need a 16-byte aligned stack, as required by SSE instructions.  If
+   STACK_REALIGN_DEFAULT is 1 and PREFERRED_STACK_BOUNDARY_DEFAULT is
+   128, stacks for all functions may be realigned.  */
+#define STACK_REALIGN_DEFAULT 0
+
+/* Boundary (in *bits*) on which the incoming stack is aligned.  */
+#define INCOMING_STACK_BOUNDARY ix86_incoming_stack_boundary
+
 /* Target OS keeps a vector-aligned (128-bit, 16-byte) stack.  This is
    mandatory for the 64-bit ABI, and may or may not be true for other
    operating systems.  */
@@ -836,6 +852,9 @@
 
 #define BIGGEST_ALIGNMENT 128
 
+/* Maximum stack alignment for vectorizer.  */
+#define MAX_VECTORIZE_STACK_ALIGNMENT BIGGEST_ALIGNMENT
+
 /* Decide whether a variable of mode MODE should be 128 bit aligned.  */
 #define ALIGN_MODE_128(MODE) \
  ((MODE) == XFmode || SSE_REG_MODE_P (MODE))
@@ -1245,7 +1264,7 @@
    the pic register when possible.  The change is visible after the
    prologue has been emitted.  */
 
-#define REAL_PIC_OFFSET_TABLE_REGNUM  3
+#define REAL_PIC_OFFSET_TABLE_REGNUM  BX_REG
 
 #define PIC_OFFSET_TABLE_REGNUM				\
   ((TARGET_64BIT && ix86_cmodel == CM_SMALL_PIC)	\
@@ -1786,7 +1805,10 @@
    All other eliminations are valid.  */
 
 #define CAN_ELIMINATE(FROM, TO) \
-  ((TO) == STACK_POINTER_REGNUM ? !frame_pointer_needed : 1)
+  (stack_realign_fp \
+  ? ((FROM) == ARG_POINTER_REGNUM && (TO) == HARD_FRAME_POINTER_REGNUM) \
+    || ((FROM) == FRAME_POINTER_REGNUM && (TO) == STACK_POINTER_REGNUM) \
+  : ((TO) == STACK_POINTER_REGNUM ? !frame_pointer_needed : 1))
 
 /* Define the offset between two registers, one to be eliminated, and the other
    its replacement, at the start of a routine.  */
@@ -2342,6 +2364,7 @@
 
 extern enum asm_dialect ix86_asm_dialect;
 extern unsigned int ix86_preferred_stack_boundary;
+extern unsigned int ix86_incoming_stack_boundary;
 extern int ix86_branch_cost, ix86_section_threshold;
 
 /* Smallest class containing REGNO.  */
@@ -2443,7 +2466,6 @@
 {
   struct stack_local_entry *stack_locals;
   const char *some_ld_name;
-  rtx force_align_arg_pointer;
   int save_varrargs_registers;
   int accesses_prev_frame;
   int optimize_mode_switching[MAX_386_ENTITIES];
Index: config/i386/i386.md
===================================================================
--- config/i386/i386.md	(.../trunk/gcc/config/i386)	(revision 133813)
+++ config/i386/i386.md	(.../branches/stack/gcc/config/i386)	(revision 133869)
@@ -221,6 +221,7 @@
   [(AX_REG			 0)
    (DX_REG			 1)
    (CX_REG			 2)
+   (BX_REG			 3)
    (SI_REG			 4)
    (DI_REG			 5)
    (BP_REG			 6)
@@ -230,6 +231,7 @@
    (FPCR_REG			19)
    (R10_REG			39)
    (R11_REG			40)
+   (R13_REG			42)
   ])
 
 ;; Insns whose names begin with "x86_" are emitted by gen_FOO calls
Index: config/i386/i386.opt
===================================================================
--- config/i386/i386.opt	(.../trunk/gcc/config/i386)	(revision 133813)
+++ config/i386/i386.opt	(.../branches/stack/gcc/config/i386)	(revision 133869)
@@ -78,6 +78,10 @@
 Target RejectNegative Report InverseMask(NO_FANCY_MATH_387, USE_FANCY_MATH_387)
 Generate sin, cos, sqrt for FPU
 
+mforce-drap
+Target Report Var(ix86_force_drap)
+Always use Dynamic Realigned Argument Pointer (DRAP) to realign stack.
+
 mfp-ret-in-387
 Target Report Mask(FLOAT_RETURNS)
 Return values of functions in FPU registers
@@ -134,6 +138,10 @@
 Target RejectNegative Joined Var(ix86_preferred_stack_boundary_string)
 Attempt to keep stack aligned to this power of 2
 
+mincoming-stack-boundary=
+Target RejectNegative Joined Var(ix86_incoming_stack_boundary_string)
+Assume incoming stack aligned to this power of 2
+
 mpush-args
 Target Report InverseMask(NO_PUSH_ARGS, PUSH_ARGS)
 Use push instructions to save outgoing arguments
@@ -159,7 +167,7 @@
 Use SSE register passing conventions for SF and DF mode
 
 mstackrealign
-Target Report Var(ix86_force_align_arg_pointer)
+Target Report Var(ix86_force_align_arg_pointer) Init(-1)
 Realign stack in prologue
 
 mstack-arg-probe
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(.../trunk/gcc/config/i386)	(revision 133813)
+++ config/i386/i386.c	(.../branches/stack/gcc/config/i386)	(revision 133869)
@@ -1693,11 +1693,22 @@
 
 /* -mstackrealign option */
 extern int ix86_force_align_arg_pointer;
-static const char ix86_force_align_arg_pointer_string[] = "force_align_arg_pointer";
+static const char ix86_force_align_arg_pointer_string[]
+  = "force_align_arg_pointer";
 
 /* Preferred alignment for stack boundary in bits.  */
 unsigned int ix86_preferred_stack_boundary;
 
+/* Alignment for incoming stack boundary in bits specified at
+   command line.  */
+static unsigned int ix86_user_incoming_stack_boundary;
+
+/* Default alignment for incoming stack boundary in bits.  */
+static unsigned int ix86_default_incoming_stack_boundary;
+
+/* Alignment for incoming stack boundary in bits.  */
+unsigned int ix86_incoming_stack_boundary;
+
 /* Values 1-5: see jump.c */
 int ix86_branch_cost;
 
@@ -2612,11 +2623,9 @@
   if (TARGET_SSE4_2 || TARGET_ABM)
     x86_popcnt = true;
 
-  /* Validate -mpreferred-stack-boundary= value, or provide default.
-     The default of 128 bits is for Pentium III's SSE __m128.  We can't
-     change it because of optimize_size.  Otherwise, we can't mix object
-     files compiled with -Os and -On.  */
-  ix86_preferred_stack_boundary = 128;
+  /* Validate -mpreferred-stack-boundary= value or default it to
+     PREFERRED_STACK_BOUNDARY_DEFAULT.  */
+  ix86_preferred_stack_boundary = PREFERRED_STACK_BOUNDARY_DEFAULT;
   if (ix86_preferred_stack_boundary_string)
     {
       i = atoi (ix86_preferred_stack_boundary_string);
@@ -2627,6 +2636,31 @@
 	ix86_preferred_stack_boundary = (1 << i) * BITS_PER_UNIT;
     }
 
+  /* Set the default value for -mstackrealign.  */
+  if (ix86_force_align_arg_pointer == -1)
+    ix86_force_align_arg_pointer = STACK_REALIGN_DEFAULT;
+
+  /* Validate -mincoming-stack-boundary= value or default it to
+     ABI_STACK_BOUNDARY/PREFERRED_STACK_BOUNDARY.  */
+  if (ix86_force_align_arg_pointer)
+    ix86_default_incoming_stack_boundary = ABI_STACK_BOUNDARY;
+  else
+    ix86_default_incoming_stack_boundary = PREFERRED_STACK_BOUNDARY;
+  ix86_incoming_stack_boundary = ix86_default_incoming_stack_boundary;
+  if (ix86_incoming_stack_boundary_string)
+    {
+      i = atoi (ix86_incoming_stack_boundary_string);
+      if (i < (TARGET_64BIT ? 4 : 2) || i > 12)
+	error ("-mincoming-stack-boundary=%d is not between %d and 12",
+	       i, TARGET_64BIT ? 4 : 2);
+      else
+	{
+	  ix86_user_incoming_stack_boundary = (1 << i) * BITS_PER_UNIT;
+	  ix86_incoming_stack_boundary
+	    = ix86_user_incoming_stack_boundary;
+	}
+    }
+
   /* Accept -msseregparm only if at least SSE support is enabled.  */
   if (TARGET_SSEREGPARM
       && ! TARGET_SSE)
@@ -3066,11 +3100,6 @@
       && ix86_function_regparm (TREE_TYPE (decl), NULL) >= 3)
     return false;
 
-  /* If we forced aligned the stack, then sibcalling would unalign the
-     stack, which may break the called function.  */
-  if (cfun->machine->force_align_arg_pointer)
-    return false;
-
   /* Otherwise okay.  That also includes certain types of indirect calls.  */
   return true;
 }
@@ -3121,15 +3150,6 @@
 	  *no_add_attrs = true;
 	}
 
-      if (!TARGET_64BIT
-	  && lookup_attribute (ix86_force_align_arg_pointer_string,
-			       TYPE_ATTRIBUTES (*node))
-	  && compare_tree_int (cst, REGPARM_MAX-1))
-	{
-	  error ("%s functions limited to %d register parameters",
-		 ix86_force_align_arg_pointer_string, REGPARM_MAX-1);
-	}
-
       return NULL_TREE;
     }
 
@@ -3241,8 +3261,23 @@
 
   attr = lookup_attribute ("regparm", TYPE_ATTRIBUTES (type));
   if (attr)
-    return TREE_INT_CST_LOW (TREE_VALUE (TREE_VALUE (attr)));
+    {
+      regparm
+	= TREE_INT_CST_LOW (TREE_VALUE (TREE_VALUE (attr)));
 
+      if (decl && TREE_CODE (decl) == FUNCTION_DECL)
+	{
+	  /* We can't use regparm(3) for nested functions as these use
+	     static chain pointer in third argument.  */
+	  if (regparm == 3
+	      && decl_function_context (decl)
+	      && !DECL_NO_STATIC_CHAIN (decl))
+	    regparm = 2;
+	}
+
+      return regparm;
+    }
+
   if (lookup_attribute ("fastcall", TYPE_ATTRIBUTES (type)))
     return 2;
 
@@ -3266,8 +3301,7 @@
 	  /* We can't use regparm(3) for nested functions as these use
 	     static chain pointer in third argument.  */
 	  if (local_regparm == 3
-	      && (decl_function_context (decl)
-                  || ix86_force_align_arg_pointer)
+	      && decl_function_context (decl)
 	      && !DECL_NO_STATIC_CHAIN (decl))
 	    local_regparm = 2;
 
@@ -3276,13 +3310,11 @@
 	     the callee DECL_STRUCT_FUNCTION is gone, so we fall back to
 	     scanning the attributes for the self-realigning property.  */
 	  f = DECL_STRUCT_FUNCTION (decl);
-	  if (local_regparm == 3
-	      && (f ? !!f->machine->force_align_arg_pointer
-		  : !!lookup_attribute (ix86_force_align_arg_pointer_string,
-					TYPE_ATTRIBUTES (TREE_TYPE (decl)))))
-	    local_regparm = 2;
+          /* Since current internal arg pointer will won't conflict
+	     with parameter passing regs, so no need to change stack
+	     realignment and adjust regparm number.
 
-	  /* Each fixed register usage increases register pressure,
+	     Each fixed register usage increases register pressure,
 	     so less registers should be used for argument passing.
 	     This functionality can be overriden by an explicit
 	     regparm value.  */
@@ -4995,15 +5027,7 @@
 
   /* Indicate to allocate space on the stack for varargs save area.  */
   ix86_save_varrargs_registers = 1;
-  /* We need 16-byte stack alignment to save SSE registers.  If user
-     asked for lower preferred_stack_boundary, lets just hope that he knows
-     what he is doing and won't varargs SSE values.
 
-     We also may end up assuming that only 64bit values are stored in SSE
-     register let some floating point program work.  */
-  if (ix86_preferred_stack_boundary >= BIGGEST_ALIGNMENT)
-    cfun->stack_alignment_needed = BIGGEST_ALIGNMENT;
-
   save_area = frame_pointer_rtx;
   set = get_varargs_alias_set ();
 
@@ -5170,7 +5194,7 @@
 
   /* Find the overflow area.  */
   type = TREE_TYPE (ovf);
-  t = make_tree (type, virtual_incoming_args_rtx);
+  t = make_tree (type, current_function_internal_arg_pointer);
   if (words != 0)
     t = build2 (POINTER_PLUS_EXPR, type, t,
 	        size_int (words * UNITS_PER_WORD));
@@ -5929,9 +5953,14 @@
   if (current_function_is_leaf && !current_function_profile
       && !ix86_current_function_calls_tls_descriptor)
     {
-      int i;
+      int i, drap;
+      /* Can't use the same register for both PIC and DRAP.  */
+      if (cfun->drap_reg)
+	drap = REGNO (cfun->drap_reg);
+      else
+	drap = -1;
       for (i = 2; i >= 0; --i)
-        if (!df_regs_ever_live_p (i))
+        if (i != drap && !df_regs_ever_live_p (i))
 	  return i;
     }
 
@@ -5967,8 +5996,8 @@
 	}
     }
 
-  if (cfun->machine->force_align_arg_pointer
-      && regno == REGNO (cfun->machine->force_align_arg_pointer))
+  if (cfun->drap_reg
+      && regno == REGNO (cfun->drap_reg))
     return 1;
 
   return (df_regs_ever_live_p (regno)
@@ -6034,6 +6063,9 @@
   stack_alignment_needed = cfun->stack_alignment_needed / BITS_PER_UNIT;
   preferred_alignment = cfun->preferred_stack_boundary / BITS_PER_UNIT;
 
+  gcc_assert (!size || stack_alignment_needed);
+  gcc_assert (preferred_alignment >= STACK_BOUNDARY / BITS_PER_UNIT);
+
   /* During reload iteration the amount of registers saved can change.
      Recompute the value as needed.  Do not recompute when amount of registers
      didn't change as reload does multiple calls to the function and does not
@@ -6076,19 +6108,10 @@
 
   frame->hard_frame_pointer_offset = offset;
 
-  /* Do some sanity checking of stack_alignment_needed and
-     preferred_alignment, since i386 port is the only using those features
-     that may break easily.  */
+  /* Set offset to aligned because the realigned frame tarts from here.  */
+  if (stack_realign_fp)
+    offset = (offset + stack_alignment_needed -1) & -stack_alignment_needed;
 
-  gcc_assert (!size || stack_alignment_needed);
-  gcc_assert (preferred_alignment >= STACK_BOUNDARY / BITS_PER_UNIT);
-  gcc_assert (preferred_alignment <= PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT);
-  gcc_assert (stack_alignment_needed
-	      <= PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT);
-
-  if (stack_alignment_needed < STACK_BOUNDARY / BITS_PER_UNIT)
-    stack_alignment_needed = STACK_BOUNDARY / BITS_PER_UNIT;
-
   /* Register save area */
   offset += frame->nregs * UNITS_PER_WORD;
 
@@ -6253,35 +6276,129 @@
     RTX_FRAME_RELATED_P (insn) = 1;
 }
 
+/* Find an available register to be used as dynamic realign argument
+   pointer regsiter.  Such a register will be written in prologue and
+   used in begin of body, so it must not be
+	1. parameter passing register.
+	2. GOT pointer.
+   For i386, we use CX if it is not used to pass parameter. Otherwise
+   we just pick DI.
+   For x86_64, we just pick R13 directly.
+
+   Return: the regno of choosed register.  */
+
+static unsigned int 
+find_drap_reg (void)
+{
+  int param_reg_num;
+
+  if (TARGET_64BIT)
+    return R13_REG;
+
+  /* Use DI for nested function or function need static chain.  */
+  if (decl_function_context (cfun->decl)
+      && !DECL_NO_STATIC_CHAIN (cfun->decl))
+    return DI_REG;
+
+  if (cfun->tail_call_emit)
+    return DI_REG;
+
+  param_reg_num = ix86_function_regparm (TREE_TYPE (cfun->decl),
+					 cfun->decl);
+
+  if (param_reg_num <= 2
+      && !lookup_attribute ("fastcall",
+			    TYPE_ATTRIBUTES (TREE_TYPE (cfun->decl))))
+    return CX_REG;
+
+  return DI_REG;
+}
+
 /* Handle the TARGET_INTERNAL_ARG_POINTER hook.  */
 
 static rtx
 ix86_internal_arg_pointer (void)
 {
-  bool has_force_align_arg_pointer =
-    (0 != lookup_attribute (ix86_force_align_arg_pointer_string,
-			    TYPE_ATTRIBUTES (TREE_TYPE (current_function_decl))));
-  if ((FORCE_PREFERRED_STACK_BOUNDARY_IN_MAIN
-       && DECL_NAME (current_function_decl)
-       && MAIN_NAME_P (DECL_NAME (current_function_decl))
-       && DECL_FILE_SCOPE_P (current_function_decl))
-      || ix86_force_align_arg_pointer
-      || has_force_align_arg_pointer)
+  /* If called in "expand" pass, currently_expanding_to_rtl will
+     be true */
+  if (currently_expanding_to_rtl) 
+    return virtual_incoming_args_rtx;
+
+  /* Prefer the one specified at command line. */
+  ix86_incoming_stack_boundary 
+    = (ix86_user_incoming_stack_boundary
+       ? ix86_user_incoming_stack_boundary
+       : ix86_default_incoming_stack_boundary);
+
+  /* Current stack realign doesn't support eh_return. Assume
+     function who calls eh_return is aligned. There will be sanity
+     check if stack realign happens together with eh_return later.  */
+  if (current_function_calls_eh_return)
+    ix86_incoming_stack_boundary = PREFERRED_STACK_BOUNDARY;
+
+  /* Incoming stack alignment can be changed on individual functions
+     via force_align_arg_pointer attribute.  We use the smallest
+     incoming stack boundary.  */
+  if (ix86_incoming_stack_boundary > ABI_STACK_BOUNDARY
+      && lookup_attribute (ix86_force_align_arg_pointer_string,
+			   TYPE_ATTRIBUTES (TREE_TYPE (current_function_decl))))
+    ix86_incoming_stack_boundary = ABI_STACK_BOUNDARY;
+
+  /* Stack at entrance of main is aligned by runtime.  We use the
+     smallest incoming stack boundary. */
+  if (ix86_incoming_stack_boundary > MAIN_STACK_BOUNDARY
+      && DECL_NAME (current_function_decl)
+      && MAIN_NAME_P (DECL_NAME (current_function_decl))
+      && DECL_FILE_SCOPE_P (current_function_decl))
+    ix86_incoming_stack_boundary = MAIN_STACK_BOUNDARY;
+
+  gcc_assert (cfun->stack_alignment_needed 
+              <= cfun->stack_alignment_estimated);
+
+  /* x86_64 vararg needs 16byte stack alignment for register save
+     area.  */
+  if (TARGET_64BIT
+      && current_function_stdarg
+      && cfun->stack_alignment_estimated < 128)
+    cfun->stack_alignment_estimated = 128;
+
+  /* Update cfun->stack_alignment_estimated and use it later to align
+     stack.  FIXME: How to optimize for leaf function?  */
+  if (PREFERRED_STACK_BOUNDARY > cfun->stack_alignment_estimated)
+    cfun->stack_alignment_estimated = PREFERRED_STACK_BOUNDARY;
+  if (PREFERRED_STACK_BOUNDARY > cfun->stack_alignment_needed)
+    cfun->stack_alignment_needed = PREFERRED_STACK_BOUNDARY;
+
+  cfun->stack_realign_needed
+    = ix86_incoming_stack_boundary < cfun->stack_alignment_estimated;
+
+  cfun->stack_realign_processed = true;
+
+  if (ix86_force_drap
+      || !ACCUMULATE_OUTGOING_ARGS)
+    cfun->need_drap = true;
+
+  if (stack_realign_drap)
     {
-      /* Nested functions can't realign the stack due to a register
-	 conflict.  */
-      if (DECL_CONTEXT (current_function_decl)
-	  && TREE_CODE (DECL_CONTEXT (current_function_decl)) == FUNCTION_DECL)
-	{
-	  if (ix86_force_align_arg_pointer)
-	    warning (0, "-mstackrealign ignored for nested functions");
-	  if (has_force_align_arg_pointer)
-	    error ("%s not supported for nested functions",
-		   ix86_force_align_arg_pointer_string);
-	  return virtual_incoming_args_rtx;
-	}
-      cfun->machine->force_align_arg_pointer = gen_rtx_REG (Pmode, CX_REG);
-      return copy_to_reg (cfun->machine->force_align_arg_pointer);
+      /* Assign DRAP to vDRAP and returns vDRAP */
+      unsigned int regno = find_drap_reg ();
+      rtx drap_vreg;
+      rtx arg_ptr;
+      rtx seq;
+
+      if (regno != CX_REG)
+	cfun->save_param_ptr_reg = true;
+
+      arg_ptr = gen_rtx_REG (Pmode, regno);
+      cfun->drap_reg = arg_ptr;
+
+      start_sequence ();
+      drap_vreg = copy_to_reg(arg_ptr);
+      seq = get_insns ();
+      end_sequence ();
+      
+      emit_insn_before (seq, NEXT_INSN (entry_of_function ()));
+      return drap_vreg;
     }
   else
     return virtual_incoming_args_rtx;
@@ -6320,53 +6437,64 @@
   bool pic_reg_used;
   struct ix86_frame frame;
   HOST_WIDE_INT allocate;
+  rtx (*gen_andsp) (rtx, rtx, rtx);
 
+  /* DRAP should not coexist with stack_realign_fp */
+  gcc_assert (!(cfun->drap_reg && stack_realign_fp));
+
+  /* Check if stack realign is really needed after reload, and 
+     stores result in cfun */
+  cfun->stack_realign_really = (ix86_incoming_stack_boundary
+				< (current_function_is_leaf
+				   ? cfun->stack_alignment_used
+				   : cfun->stack_alignment_needed));
+
+  cfun->stack_realign_finalized = true;
+
   ix86_compute_frame_layout (&frame);
 
-  if (cfun->machine->force_align_arg_pointer)
+  /* Emit prologue code to adjust stack alignment and setup DRAP, in case
+     of DRAP is needed and stack realignment is really needed after reload */
+  if (cfun->drap_reg && cfun->stack_realign_really)
     {
       rtx x, y;
+      int align_bytes = cfun->stack_alignment_needed / BITS_PER_UNIT;
+      int param_ptr_offset = (cfun->save_param_ptr_reg
+			      ?  STACK_BOUNDARY / BITS_PER_UNIT : 0);
 
+      gcc_assert (stack_realign_drap);
+
       /* Grab the argument pointer.  */
-      x = plus_constant (stack_pointer_rtx, 4);
-      y = cfun->machine->force_align_arg_pointer;
+      x = plus_constant (stack_pointer_rtx, 
+                         (STACK_BOUNDARY / BITS_PER_UNIT 
+			  + param_ptr_offset));
+      y = cfun->drap_reg;
+
+      /* Only need to push parameter pointer reg if it is caller
+	 saved reg */
+      if (cfun->save_param_ptr_reg)
+	{
+	  /* Push arg pointer reg */
+	  insn = emit_insn (gen_push (y));
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	}
+
       insn = emit_insn (gen_rtx_SET (VOIDmode, y, x));
-      RTX_FRAME_RELATED_P (insn) = 1;
+      RTX_FRAME_RELATED_P (insn) = 1; 
 
-      /* The unwind info consists of two parts: install the fafp as the cfa,
-	 and record the fafp as the "save register" of the stack pointer.
-	 The later is there in order that the unwinder can see where it
-	 should restore the stack pointer across the and insn.  */
-      x = gen_rtx_UNSPEC (VOIDmode, gen_rtvec (1, const0_rtx), UNSPEC_DEF_CFA);
-      x = gen_rtx_SET (VOIDmode, y, x);
-      RTX_FRAME_RELATED_P (x) = 1;
-      y = gen_rtx_UNSPEC (VOIDmode, gen_rtvec (1, stack_pointer_rtx),
-			  UNSPEC_REG_SAVE);
-      y = gen_rtx_SET (VOIDmode, cfun->machine->force_align_arg_pointer, y);
-      RTX_FRAME_RELATED_P (y) = 1;
-      x = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, x, y));
-      x = gen_rtx_EXPR_LIST (REG_FRAME_RELATED_EXPR, x, NULL);
-      REG_NOTES (insn) = x;
-
+      gen_andsp = TARGET_64BIT ? gen_anddi3 : gen_andsi3;
       /* Align the stack.  */
-      emit_insn (gen_andsi3 (stack_pointer_rtx, stack_pointer_rtx,
-			     GEN_INT (-16)));
+      insn = emit_insn ((*gen_andsp) (stack_pointer_rtx,
+				  stack_pointer_rtx,
+				  GEN_INT (-align_bytes)));
+      RTX_FRAME_RELATED_P (insn) = 1;
 
-      /* And here we cheat like madmen with the unwind info.  We force the
-	 cfa register back to sp+4, which is exactly what it was at the
-	 start of the function.  Re-pushing the return address results in
-	 the return at the same spot relative to the cfa, and thus is
-	 correct wrt the unwind info.  */
-      x = cfun->machine->force_align_arg_pointer;
-      x = gen_frame_mem (Pmode, plus_constant (x, -4));
+      x = cfun->drap_reg;
+      x = gen_frame_mem (Pmode,
+                         plus_constant (x,
+					-(STACK_BOUNDARY / BITS_PER_UNIT)));
       insn = emit_insn (gen_push (x));
       RTX_FRAME_RELATED_P (insn) = 1;
-
-      x = GEN_INT (4);
-      x = gen_rtx_UNSPEC (VOIDmode, gen_rtvec (1, x), UNSPEC_DEF_CFA);
-      x = gen_rtx_SET (VOIDmode, stack_pointer_rtx, x);
-      x = gen_rtx_EXPR_LIST (REG_FRAME_RELATED_EXPR, x, NULL);
-      REG_NOTES (insn) = x;
     }
 
   /* Note: AT&T enter does NOT have reversed args.  Enter is probably
@@ -6381,6 +6509,19 @@
       RTX_FRAME_RELATED_P (insn) = 1;
     }
 
+  if (stack_realign_fp && cfun->stack_realign_really)
+    {
+      int align_bytes = cfun->stack_alignment_needed / BITS_PER_UNIT;
+      gcc_assert (align_bytes > STACK_BOUNDARY / BITS_PER_UNIT);
+
+      gen_andsp = TARGET_64BIT ? gen_anddi3 : gen_andsi3;
+      /* Align the stack.  */
+      insn = emit_insn ((*gen_andsp) (stack_pointer_rtx,
+				      stack_pointer_rtx,
+				      GEN_INT (-align_bytes)));
+      RTX_FRAME_RELATED_P (insn) = 1;
+    }
+
   allocate = frame.to_allocate;
 
   if (!frame.save_regs_using_mov)
@@ -6395,7 +6536,9 @@
      a red zone location */
   if (TARGET_RED_ZONE && frame.save_regs_using_mov
       && (! TARGET_STACK_PROBE || allocate < CHECK_STACK_LIMIT))
-    ix86_emit_save_regs_using_mov (frame_pointer_needed ? hard_frame_pointer_rtx
+    ix86_emit_save_regs_using_mov ((frame_pointer_needed
+				     && !cfun->stack_realign_really) 
+                                   ? hard_frame_pointer_rtx
 				   : stack_pointer_rtx,
 				   -frame.nregs * UNITS_PER_WORD);
 
@@ -6454,8 +6597,11 @@
       && !(TARGET_RED_ZONE
          && (! TARGET_STACK_PROBE || allocate < CHECK_STACK_LIMIT)))
     {
-      if (!frame_pointer_needed || !frame.to_allocate)
-        ix86_emit_save_regs_using_mov (stack_pointer_rtx, frame.to_allocate);
+      if (!frame_pointer_needed
+	  || !frame.to_allocate
+	  || cfun->stack_realign_really)
+        ix86_emit_save_regs_using_mov (stack_pointer_rtx,
+				       frame.to_allocate);
       else
         ix86_emit_save_regs_using_mov (hard_frame_pointer_rtx,
 				       -frame.nregs * UNITS_PER_WORD);
@@ -6505,6 +6651,16 @@
 	emit_insn (gen_prologue_use (pic_offset_table_rtx));
       emit_insn (gen_blockage ());
     }
+
+  if (cfun->drap_reg && !cfun->stack_realign_really)
+    {
+      /* vDRAP is setup but after reload it turns out stack realign
+         isn't necessary, here we will emit prologue to setup DRAP
+         without stack realign adjustment */
+      int drap_bp_offset = STACK_BOUNDARY / BITS_PER_UNIT * 2;
+      rtx x = plus_constant (hard_frame_pointer_rtx, drap_bp_offset);
+      insn = emit_insn (gen_rtx_SET (VOIDmode, cfun->drap_reg, x));
+    }
 }
 
 /* Emit code to restore saved registers using MOV insns.  First register
@@ -6543,7 +6699,10 @@
 ix86_expand_epilogue (int style)
 {
   int regno;
-  int sp_valid = !frame_pointer_needed || current_function_sp_is_unchanging;
+ /* When stack realign may happen, SP must be valid. */
+  int sp_valid = (!frame_pointer_needed
+		  || current_function_sp_is_unchanging
+		  || (stack_realign_fp && cfun->stack_realign_really));
   struct ix86_frame frame;
   HOST_WIDE_INT offset;
 
@@ -6580,11 +6739,16 @@
     {
       /* Restore registers.  We can use ebp or esp to address the memory
 	 locations.  If both are available, default to ebp, since offsets
-	 are known to be small.  Only exception is esp pointing directly to the
-	 end of block of saved registers, where we may simplify addressing
-	 mode.  */
+	 are known to be small.  Only exception is esp pointing directly
+	 to the end of block of saved registers, where we may simplify
+	 addressing mode.  
 
-      if (!frame_pointer_needed || (sp_valid && !frame.to_allocate))
+	 If we are realigning stack with bp and sp, regs restore can't
+	 be addressed by bp. sp must be used instead.  */
+
+      if (!frame_pointer_needed
+	  || (sp_valid && !frame.to_allocate) 
+	  || (stack_realign_fp && cfun->stack_realign_really))
 	ix86_emit_restore_regs_using_mov (stack_pointer_rtx,
 					  frame.to_allocate, style == 2);
       else
@@ -6596,6 +6760,10 @@
 	{
 	  rtx tmp, sa = EH_RETURN_STACKADJ_RTX;
 
+	  if (cfun->stack_realign_really)
+	    {
+	      error("Stack realign has conflict with eh_return");
+	    }
 	  if (frame_pointer_needed)
 	    {
 	      tmp = gen_rtx_PLUS (Pmode, hard_frame_pointer_rtx, sa);
@@ -6639,10 +6807,16 @@
   else
     {
       /* First step is to deallocate the stack frame so that we can
-	 pop the registers.  */
+	 pop the registers.
+
+	 If we realign stack with frame pointer, then stack pointer
+         won't be able to recover via lea $offset(%bp), %sp, because
+         there is a padding area between bp and sp for realign. 
+         "add $to_allocate, %sp" must be used instead.  */
       if (!sp_valid)
 	{
 	  gcc_assert (frame_pointer_needed);
+          gcc_assert (!(stack_realign_fp && cfun->stack_realign_really));
 	  pro_epilogue_adjust_stack (stack_pointer_rtx,
 				     hard_frame_pointer_rtx,
 				     GEN_INT (offset), style);
@@ -6665,18 +6839,47 @@
 	     able to grok it fast.  */
 	  if (TARGET_USE_LEAVE)
 	    emit_insn (TARGET_64BIT ? gen_leave_rex64 () : gen_leave ());
-	  else if (TARGET_64BIT)
-	    emit_insn (gen_popdi1 (hard_frame_pointer_rtx));
-	  else
-	    emit_insn (gen_popsi1 (hard_frame_pointer_rtx));
+	  else 
+            {
+              /* For stack realigned really happens, recover stack 
+                 pointer to hard frame pointer is a must, if not using 
+                 leave.  */
+              if (stack_realign_fp && cfun->stack_realign_really)
+		pro_epilogue_adjust_stack (stack_pointer_rtx,
+					   hard_frame_pointer_rtx,
+					   const0_rtx, style);
+              if (TARGET_64BIT)
+                emit_insn (gen_popdi1 (hard_frame_pointer_rtx));
+              else
+                emit_insn (gen_popsi1 (hard_frame_pointer_rtx));
+            }
 	}
     }
 
-  if (cfun->machine->force_align_arg_pointer)
+  if (cfun->drap_reg && cfun->stack_realign_really)
     {
-      emit_insn (gen_addsi3 (stack_pointer_rtx,
-			     cfun->machine->force_align_arg_pointer,
-			     GEN_INT (-4)));
+      int param_ptr_offset = (cfun->save_param_ptr_reg
+			      ? STACK_BOUNDARY / BITS_PER_UNIT : 0);
+      gcc_assert (stack_realign_drap);
+      if (TARGET_64BIT)
+        {
+          emit_insn (gen_adddi3 (stack_pointer_rtx,
+				 cfun->drap_reg,
+				 GEN_INT (-(STACK_BOUNDARY / BITS_PER_UNIT
+					    + param_ptr_offset))));
+          if (cfun->save_param_ptr_reg)
+            emit_insn (gen_popdi1 (cfun->drap_reg));
+        }
+      else
+        {
+          emit_insn (gen_addsi3 (stack_pointer_rtx,
+				 cfun->drap_reg,
+				 GEN_INT (-(STACK_BOUNDARY / BITS_PER_UNIT 
+					    + param_ptr_offset))));
+          if (cfun->save_param_ptr_reg)
+            emit_insn (gen_popsi1 (cfun->drap_reg));
+        }
+      
     }
 
   /* Sibcall epilogues don't want a return instruction.  */


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFA]: Merge stack alignment branch
  2008-04-04  6:31 [RFA]: Merge stack alignment branch Ye, Joey
@ 2008-04-04  6:39 ` Andrew Pinski
  2008-04-04 12:40   ` H.J. Lu
  2008-04-05 16:26   ` Ye, Joey
  2008-04-04 19:05 ` Jan Hubicka
  2008-04-10 10:42 ` Ye, Joey
  2 siblings, 2 replies; 26+ messages in thread
From: Andrew Pinski @ 2008-04-04  6:39 UTC (permalink / raw)
  To: Ye, Joey
  Cc: GCC Patches, Lu, Hongjiu, Guo, Xuepeng, <ubizjak@gmail.com>



Sent from my iPhone

On Apr 3, 2008, at 23:23, "Ye, Joey" <joey.ye@intel.com> wrote:

> STACK branch has been created for a while and a bunch of patches to
> implement stack alignment for i386/x86_64 have been checked in. Now  
> this
> branch not only can support all stack variables to be aligned at their
> required boundary effectively, but also introduce zero regression
> against current trunk. Here is the background information and the  
> patch.
> Comments and feedback are high appreciated.


Why not align the variables that need the extra alignment?  This seems  
simpler and gets rid of the need for target specific changes. I  
already posted a patch to do it that way. It was created to support  
the Cell proccesor. We really need variables which have alignment of  
128 byte as the DMA will only work on memory that is 128 byte aligned  
if the size is greater than or equal to 128. Also it is not the common  
case that we need the extra alignment.

>
>
> -- BACKGROUD --
> Here, we propose a new design to fully support stack alignment while
> overcoming above problems. The new design will
> *  Support arbitrary alignment value, including 4,8,16,32...
> *  Adjust function stack alignment only when necessary
> *  Initial development will be on i386 and x86_64, but can be extended
> to other platforms
> *  Emit efficient prologue/epilogue code for stack align
> *  Coexist with special features like dynamic stack allocation  
> (alloca),
> nested functions, register parameter passing, PIC code and tail call
> optimization, etc
> *  Be able to debug and unwind stack
>
> 2.1 Support arbitrary alignment value
> Different source code and optimizations requires different stack
> alignment,
> as in following table:
> Feature         Alignment (bytes)
> i386_ABI        4
> x86_64_ABI      16
> char            1
> short           2
> int             4
> long            4/8*
> long long       8
> __m64           8
> __m128          16
> float           4
> double          8
> long double     16
> user specified  any power of 2
>
> *Note: 4 for i386, 8 for x86_64
> The new design will support any alignment value in this table.
>
> 2.2 Adjust function stack alignment only when necessary
>
> Current GCC defines following macros related to stack alignment:
> i. STACK_BOUNDARY in bits, which is preferred by hardware, 32 for i386
> and
> 64 for x86_64. It is the minimum stack boundary. It is fixed.
> ii. PREFERRED_STACK_BOUNDARY. It sets the stack alignment when  
> calling a
> function. It may be set at command line and has no impact on stack
> alignment at function entry. This proposal requires PREFERRED >=  
> STACK,
> and
> by default set to ABI_STACK_BOUNDARY
>
> This design will define a few more macros, or concepts not explicitly
> defined in code:
> iii. ABI_STACK_BOUNDARY in bits, which is the stack boundary specified
> by
> psABI, 32 for i386 and 128 for x86_64.  ABI_STACK_BOUNDARY >=
> STACK_BOUNDARY. It is fixed for a given psABI.
> iv. LOCAL_STACK_BOUNDARY in bits. Each function stack has its own  
> stack
> alignment requirement, which depends the alignment of its stack
> variables,
> LOCAL_STACK_BOUNDARY = MAX (alignment of each effective stack  
> variable).
> v. INCOMING_STACK_BOUNDARY in bits, which is the stack boundary at
> function
> entry. If a function is marked with __attribute__
> ((force_align_arg_pointer))
> or -mstackrealign option is provided, INCOMING = STACK_BOUNDARY.
> Otherwise,
> INCOMING == PREFERRED_STACK_BOUNDARY because a function is typically
> called
> locally with the same PREFERRED_STACK_BOUNDARY. For those function  
> whose
>
> PREFERRED is larger than ABI, it is the caller's responsibility to
> invoke
> them with appropriate PREFERRED.
> vi. REQUIRED_STACK_ALIGNMENT in bits, which is stack alignment  
> required
> by
> local variables and calling other function. REQUIRED_STACK_ALIGNMENT  
> ==
> MAX(LOCAL_STACK_BOUNDARY,PREFERRED_STACK_BOUNDARY) in case of a non- 
> leaf
> function. For a leaf function, REQUIRED_STACK_ALIGNMENT ==
> MAX(LOCAL_STACK_BOUNDARY,STACK_BOUNDARY).
>
> This proposal won't adjust stack when INCOMING_STACK_BOUNDARY >=
> REQUIRED_STACK_ALIGNMENT. Only when INCOMING_STACK_BOUNDARY <
> REQUIRED_STACK_ALIGNMENT, or PREFERRED_STACK_BOUNDARY of entry  
> function
> less
> than ABI_STACK_BOUNDARY, it will adjust stack to
> REQUIRED_STACK_ALIGNMENT
> at prologue.
>
> 2.3 Initial development on i386 and x86_64
> We initially support i386 and x86_64. In this document we focus more  
> on
> i386 because it is hard to implement because of the restriction of
> having
> a small register file.  But all that we discuss can be easily applied
> to x86_64.
>
> 2.4 Emit more efficient prologue/epilogue
> When a function needs to adjust stack alignment and has no dynamic  
> stack
> allocation, this design will generate following example
> prologue/epilogue
> code:
> IA32 example Prologue:
>        pushl     %ebp
>        movl      %esp, %ebp
>        andl      $-16, %esp
>        subl      $4, %esp ; is $-4 the local stack size?
> Epilogue:
>        movl      %ebp, %esp
>        popl      %ebp
>        ret
> Locals will be addressed as esp + offset and parameters as ebp +  
> offset.
>
> Add x86_64 example here.
>
> Thus BP points to parameter frame and SP points to local frame.
>
> 2.5 Coexist with special features
> Stack alignment adjustment will coexist with varying  GCC features
> that have special calling conventions and frame layout, such as  
> dynamic
> stack allocation (alloca), nested functions and parameter passing via
> registers to local functions.
>
> I386 hard register usage is the major problem to make the proposal
> friendly
> to various GCC features. This design requires an additional hard
> register
> in prologue/epilogue in case of dynamic stack allocation. The register
> is
> called as Dynamic Realigned Argument Pointer, or DRAP. Because I386  
> PIC
> requires BX as GOT pointer and I386 may use AX, DX and CX as parameter
> passing registers, also it has to work with setjmp/longjmp, there are
> limited candidates to choose.  Current proposal uses CX as DRAP if  
> CX is
> not
> used byr to pass parameter. If CX is not available DI will be used
> because
> it is preserved across setjmp/longjmp since it is callee-saved.
>
> X86_64 is much easier. This proposal just chooses R12 as DRAP, which  
> is
> also preserved across setjmp/longjmp since it is callee-saved.
>
> DRAP will be assigned to a virtual register, or VDRAP, in prologue so
> that
> DRAP hard register itself can be free for register allocator in  
> function
> body.
> Usually VDRAP will be allocated as the same DRAP register, thus the
> additional
> register move instruction is oftenly removed.
>
> 2.5.1 When stack alignment adjustment comes together with alloca,
> following
> example prologue/epilogue will be emitted:
> Prologue:
>       pushl     %edi                     // Save callee save reg edi
>       leal      8(%esp), %edi            // Save address of parameter
> frame
>       andl      $-16, %esp               // Align local stack
>
> //  Reserve two stack slots and save return address
> //  and previous frame pointer into them. By
> //  pointing new ebp to them, we build a pseudo
> //  stack for unwinding.
>       pushl     $4(%edi)                 //  save return address
>       pushl     %ebp                     //  save old ebp
>       movl      %esp, %ebp               //  point ebp to pseudo frame
> start
>
>       subl      $24, %esp                // adjust local frame size
>       movl      %edi, vreg1
>
> epilogue:
>       movl      vreg1, %edi
>       movl      %ebp, %esp               // Restore esp to pseudo  
> frame
> start
>       popl      %ebp
>       leal      -8(%edi), %esp           // restore esp to real frame
> start
>       popl      %edi                     // Restore edi
>       ret
>
> Locals will be addressed as ebp - offset, parameters as vreg1 + offset
>
> Where BX is used to set up virtual parameter frame pointer, BP  
> points to
> local frame and SP points to dynamic allocation frame.
>
> 2.5.2 Nested functions will automatically work because it uses CX as
> static
> pointer, which won't conflict with any registers used by stack  
> alignment
> adjustment, even when nested functions are called via function pointer
> and
> a function stub on stack.
>
> 2.5.3 GCC may optimize to use registers to pass parameters . At most  
> AX,
> DX
> and CX will be used. Such optimization won't conflict with stack
> alignment
> adjustment thus it should automatically work.
>
> 2.5.4 I386 PIC uses an available register or EBX as GOT pointer. This
> design
> work well under i386 PIC. When picking up a register for PIC, we will
> avoid
> using the DRAP register:
>
> For example:
> i686 Prologue:
>        pushl     %edi
>        leal      8(%esp), %edi
>        andl      $-16, %esp
>        pushl     $4(%edi)
>        pushl     %ebp
>        movl      %esp, %ebp
>        subl      $24,  %esp
>        call      .L1
> .L1:
>        popl      %ebx
>        movl      %edi, vreg1
>
> Body:  // code for alloca
>        movl      (vreg1), %eax
>        subl      %eax, %esp
>        andl      $-16, %esp
>        movl      %esp, %eax
>
> i686 Epilogue:
>        movl      %ebp, %esp
>        popl      %ebp
>        leal      -8(%edi), %esp
>        popl      %edi
>        ret
>
> Locals will be addressed as ebp - offset, parameters as vreg1 +  
> offset,
> ebx has the GOT pointer.
>
> 2.6 Debug and unwind will work since DWARF2 has the flexibility to
> define
> different frame pointers.
>
> 2.7 Some intrinsics rely on stack layout. Need to handle them
> accordingly.
> They are __builtin_return_address, __builtin_frame_address. This
> proposal
> will setup pseudo frame slot to help unwinder find return address and
> parent frame address by emit following prologue code after adjusting
> alignment:
>        pushl     $4(%edi)
>        pushl     %ebp
>
> ChangeLog:
> 2008-04-04  Uros Bizjak  <ubizjak@gmail.com>
>        H.J. Lu  <hongjiu.lu@intel.com>
>
>    PR target/12329
>    * config/i386/i386.c (ix86_function_regparm): Limit the number
> of
>    register passing arguments to 2 for nested functions.
>
> 2008-04-04  Joey Ye  <joey.ye@intel.com>
>        H.J. Lu  <hongjiu.lu@intel.com>
>        Xuepeng Guo  <xuepeng.guo@intel.com>
>
>    * builtins.c (expand_builtin_setjmp_receiver): Replace
>    virtual_incoming_args_rtx with
>    current_function_internal_arg_pointer.
>    (expand_builtin_apply_args_1): Likewise.
>
>    * calls.c (expand_call): Don't calculate preferred stack
>    boundary according to incoming stack boundary. Replace
>    virtual_incoming_args_rtx with
>    current_function_internal_arg_pointer.
>
>    * cfgexpand.c (get_decl_align_unit): Estimate stack variable
>    alignment and store to stack_alignment_estimated and
>    stack_alignment_used.
>    (expand_one_var): Likewise.
>    (gate_stack_realign): Gate new pass
> pass_collect_stackrealign_info
>    and pass_handle_drap.
>    (collect_stackrealign_info): Execute new pass
>    pass_collect_stackrealign_info.
>    (pass_collect_stackrealign_info): Define new pass.
>    (handle_drap): Execute new pass pass_handle_drap.
>    (pass_handle_drap): Define new pass.
>
>    * defaults.h (MAX_VECTORIZE_STACK_ALIGNMENT): New.
>
>    * dojump.c (clear_pending_stack_adjust): Leave an FIXME in
>    comments in case pending stack ajustment is discard when stack
>    realign is needed.
>
>    * flags.h (frame_pointer_needed): Removed.
>    * final.c (frame_pointer_needed): Likewise.
>
>    * function.c (assign_stack_local_1): Estimate stack variable
>    alignment and store to stack_alignment_estimated.
>    (instantiate_new_reg): Instantiate virtual incoming args rtx to
>    vDRAP if stack realignment and DRAP is needed.
>    (assign_parms): Collect parameter/return type alignment and
>    contribute to stack_alignment_estimated.
>    (locate_and_pad_parm): Likewise.
>    (allocate_struct_function): Init stack_alignment_estimated and
>    stack_alignment_used.
>    (get_arg_pointer_save_area): Replace virtual_incoming_args_rtx
>    with current_function_internal_arg_pointer.
>
>    * function.h (function): Add drap_reg,
> stack_alignment_estimated,
>    need_frame_pointer, need_frame_pointer_set,
> stack_realign_needed,
>    stack_realign_really, need_drap, save_param_ptr_reg,
>    stack_realign_processed, stack_realign_finalized and
>    stack_realign_used.
>    (frame_pointer_needed): New.
>    (stack_realign_fp): Likewise.
>    (stack_realign_drap): Likewise.
>
>    * global.c (compute_regsets): Set frame_pointer_needed
> cannot_elim
>    wrt stack_realign_needed.
>
>    * stmt.c (expand_nl_goto_receiver): Replace
>    virtual_incoming_args_rtx with
>    current_function_internal_arg_pointer.
>
>    * passes.c (pass_collect_stackrealign_info): Insert this new
> pass
>    immediately before expand.
>    (pass_handle_drap): Insert this new pass immediately after
> expand.
>
>    * tree-inline.c (expand_call_inline): Estimate stack variable
>    alignment and store to stack_alignment_estimated.
>
>    * tree-pass.h (pass_handle_drap): New.
>    (pass_collect_stackrealign_info): Likewise.
>
>    * tree-vectorizer.c (vect_can_force_dr_alignment_p): Estimate
>    stack variable alignment and store to stack_alignment_estimated.
>
>    * reload1.c (set_label_offsets): Assert that frame pointer must
> be
>    elimiated to stack pointer in case stack realignment is
> estimated
>    to happen without DRAP.
>    (elimination_effects): Likewise.
>    (eliminate_regs_in_insn): Likewise.
>    (mark_not_eliminable): Likewise.
>    (update_eliminables): Frame pointer is needed in case of stack
>    realignment needed.
>    (init_elim_table): Don't set frame_pointer_needed here.
>
>    * dwarf2out.c (CUR_FDE): New.
>    (reg_save_with_expression): Likewise.
>    (dw_fde_struct): Add drap_regnum, stack_realignment,
>    is_stack_realign, is_drap and is_drap_reg_saved.
>    (add_cfi): If stack is realigned, call reg_save_with_expression
>    to represent the location of stored vars.
>    (dwarf2out_frame_debug_expr): Add rules 16-19 to handle stack
>    realign.
>    (output_cfa_loc): Handle DW_CFA_expression.
>    (based_loc_descr): Update assert for stack realign.
>
>    * config/i386/i386.c (ix86_force_align_arg_pointer_string):
> Break
>    long line.
>    (ix86_user_incoming_stack_boundary): New.
>    (ix86_default_incoming_stack_boundary): Likewise.
>    (ix86_incoming_stack_boundary): Likewise.
>    (find_drap_reg): Likewise.
>    (override_options): Overide option value for new options.
>    (ix86_function_ok_for_sibcall): Sibcall is OK even stack need
>    realigning.
>    (ix86_handle_cconv_attribute): Stack realign no longer impacts
>    number of regparm.
>    (ix86_function_regparm): Likewise.
>    (setup_incoming_varargs_64): Remove the logic to set
>    stack_alignment_needed here.
>    (ix86_va_start): Replace virtual_incoming_args_rtx with
>    current_function_internal_arg_pointer.
>    (ix86_save_reg): Replace force_align_arg_pointer with drap_reg.
>    (ix86_compute_frame_layout): Compute frame layout wrt stack
>    realignment.
>    (ix86_internal_arg_pointer): Estimate if stack realignment is
>    needed and returns appropriate arg pointer rtx accordingly.
>    (ix86_expand_prologue): Finally decide if stack realignment
>    is needed and generate prologue code accordingly.
>    (ix86_expand_epilogue): Generate epilogue code wrt stack
>    realignment is really needed or not.
>    * config/i386/i386.c (ix86_select_alt_pic_regnum): Check
>    DRAP register.
>
>    * config/i386/i386.h (MAIN_STACK_BOUNDARY): New.
>    (ABI_STACK_BOUNDARY): Likewise.
>    PREFERRED_STACK_BOUNDARY_DEFAULT): Likewise.
>    (STACK_REALIGN_DEFAULT): Likewise.
>    (INCOMING_STACK_BOUNDARY): Likewise.
>    (MAX_VECTORIZE_STACK_ALIGNMENT): Likewise.
>    (ix86_incoming_stack_boundary): Likewise.
>    (REAL_PIC_OFFSET_TABLE_REGNUM): Updated to use BX_REG.
>    (CAN_ELIMINATE): Redefine the macro to eliminate frame pointer
> to
>    stack pointer and arg pointer to hard frame pointer
> <merge-stack-0404.patch>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFA]: Merge stack alignment branch
  2008-04-04  6:39 ` Andrew Pinski
@ 2008-04-04 12:40   ` H.J. Lu
  2008-04-04 19:18     ` Andrew Pinski
  2008-04-05 16:26   ` Ye, Joey
  1 sibling, 1 reply; 26+ messages in thread
From: H.J. Lu @ 2008-04-04 12:40 UTC (permalink / raw)
  To: Andrew Pinski
  Cc: Ye, Joey, GCC Patches, Lu, Hongjiu, Guo, Xuepeng,
	<ubizjak@gmail.com>

On Thu, Apr 03, 2008 at 11:34:26PM -0700, Andrew Pinski wrote:
>
>
> Sent from my iPhone
>
> On Apr 3, 2008, at 23:23, "Ye, Joey" <joey.ye@intel.com> wrote:
>
>> STACK branch has been created for a while and a bunch of patches to
>> implement stack alignment for i386/x86_64 have been checked in. Now this
>> branch not only can support all stack variables to be aligned at their
>> required boundary effectively, but also introduce zero regression
>> against current trunk. Here is the background information and the patch.
>> Comments and feedback are high appreciated.
>
>
> Why not align the variables that need the extra alignment?  This seems 
> simpler and gets rid of the need for target specific changes. I already 
> posted a patch to do it that way. It was created to support the Cell 
> proccesor. We really need variables which have alignment of 128 byte as the 
> DMA will only work on memory that is 128 byte aligned if the size is 
> greater than or equal to 128. Also it is not the common case that we need 
> the extra alignment.
>

Does your scheme work with reload? To support AVX, we need to
align stack to 32byte whenever AVX register is accessed on stack.
That will be a very common case when -mavx is used.  But we want
to keep preferred stack boundary, which defines incoming stack
boundary, as 16byte for binary compatibility.

H.J.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFA]: Merge stack alignment branch
  2008-04-04  6:31 [RFA]: Merge stack alignment branch Ye, Joey
  2008-04-04  6:39 ` Andrew Pinski
@ 2008-04-04 19:05 ` Jan Hubicka
  2008-04-04 21:05   ` H.J. Lu
                     ` (2 more replies)
  2008-04-10 10:42 ` Ye, Joey
  2 siblings, 3 replies; 26+ messages in thread
From: Jan Hubicka @ 2008-04-04 19:05 UTC (permalink / raw)
  To: Ye, Joey; +Cc: GCC Patches, Lu, Hongjiu, Guo, Xuepeng, ubizjak

Hi,
I will look in detail to the patch later this weekend.  I think it would
make sense to break up neccesary changes in generic bits to separate
patches (in -x -cp format).  This will ease reviewing process since I
for instance can't approve non-i386 specific bits of your patch.

Index: tree-inline.c
===================================================================
--- tree-inline.c	(.../trunk/gcc)	(revision 133813)
+++ tree-inline.c	(.../branches/stack/gcc)	(revision 133869)
@@ -2841,8 +2841,26 @@
 	cfun->unexpanded_var_list = tree_cons (NULL_TREE, var,
 					       cfun->unexpanded_var_list);
       else
-	cfun->unexpanded_var_list = tree_cons (NULL_TREE, remap_decl (var, id),
-					       cfun->unexpanded_var_list);
+	{
+	  /* Update stack alignment requirement if needed.  */
+	  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+	    {
+	      unsigned int align;
+
+	      if (TREE_STATIC (var) || DECL_EXTERNAL (var))
+		align = TYPE_ALIGN (TREE_TYPE (var));
+	      else
+		align = DECL_ALIGN (var);
+	      if (align  > cfun->stack_alignment_estimated)
+		{
+		  gcc_assert(!cfun->stack_realign_processed);
+		  cfun->stack_alignment_estimated = align;
+		}
+	    }
+	  cfun->unexpanded_var_list
+	    = tree_cons (NULL_TREE, remap_decl (var, id),
+			 cfun->unexpanded_var_list);
+	}

I think it is mistake to maintain info about stack alignment during
gimple transformations.  At expansion time we walk the list and we can
figure out the alignment once possibly some of the variables are
optimized out.

The info can also go into RTL datastructures I am trying to introduce
instead of cfun then.

Honza

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFA]: Merge stack alignment branch
  2008-04-04 12:40   ` H.J. Lu
@ 2008-04-04 19:18     ` Andrew Pinski
  2008-04-04 20:33       ` H.J. Lu
  2008-04-05 15:24       ` Ye, Joey
  0 siblings, 2 replies; 26+ messages in thread
From: Andrew Pinski @ 2008-04-04 19:18 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Ye, Joey, GCC Patches, Lu, Hongjiu, Guo, Xuepeng,
	<ubizjak@gmail.com>

On Fri, Apr 4, 2008 at 5:35 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>  Does your scheme work with reload? To support AVX, we need to
>  align stack to 32byte whenever AVX register is accessed on stack.
>  That will be a very common case when -mavx is used.  But we want
>  to keep preferred stack boundary, which defines incoming stack
>  boundary, as 16byte for binary compatibility.

Yes it works with reload.  The variable gets reloaded to a stack slot
which is aligned correctly.  The stack on x86 is still aligned to the
word boundary but the variables get aligned to the correct alignment.

There is still a bug with respect of VLA's not getting the correct
alignment.  This happens with your scheme also.

Thanks,
Andrew Pinski

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFA]: Merge stack alignment branch
  2008-04-04 19:18     ` Andrew Pinski
@ 2008-04-04 20:33       ` H.J. Lu
  2008-04-05 15:24       ` Ye, Joey
  1 sibling, 0 replies; 26+ messages in thread
From: H.J. Lu @ 2008-04-04 20:33 UTC (permalink / raw)
  To: Andrew Pinski
  Cc: Ye, Joey, GCC Patches, Lu, Hongjiu, Guo, Xuepeng,
	<ubizjak@gmail.com>

On Fri, Apr 04, 2008 at 11:45:21AM -0700, Andrew Pinski wrote:
> On Fri, Apr 4, 2008 at 5:35 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> >  Does your scheme work with reload? To support AVX, we need to
> >  align stack to 32byte whenever AVX register is accessed on stack.
> >  That will be a very common case when -mavx is used.  But we want
> >  to keep preferred stack boundary, which defines incoming stack
> >  boundary, as 16byte for binary compatibility.
> 
> Yes it works with reload.  The variable gets reloaded to a stack slot
> which is aligned correctly.  The stack on x86 is still aligned to the

So for every AVX register spill, you will align its stack slot?

> word boundary but the variables get aligned to the correct alignment.
> 
> There is still a bug with respect of VLA's not getting the correct
> alignment.  This happens with your scheme also.

Do you have a testcase? We have 2 varag testcases for stack alignment
and they work correctly. I can add one more.

Thanks.


H.J.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFA]: Merge stack alignment branch
  2008-04-04 19:05 ` Jan Hubicka
@ 2008-04-04 21:05   ` H.J. Lu
  2008-04-08  1:57   ` Ye, Joey
  2008-04-11 12:32   ` Ye, Joey
  2 siblings, 0 replies; 26+ messages in thread
From: H.J. Lu @ 2008-04-04 21:05 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Ye, Joey, GCC Patches, Lu, Hongjiu, Guo, Xuepeng, ubizjak

On Fri, Apr 04, 2008 at 08:39:27PM +0200, Jan Hubicka wrote:
> Hi,
> I will look in detail to the patch later this weekend.  I think it would

Thanks.

> make sense to break up neccesary changes in generic bits to separate
> patches (in -x -cp format).  This will ease reviewing process since I
> for instance can't approve non-i386 specific bits of your patch.
> 
> Index: tree-inline.c
> ===================================================================
> --- tree-inline.c	(.../trunk/gcc)	(revision 133813)
> +++ tree-inline.c	(.../branches/stack/gcc)	(revision 133869)
> @@ -2841,8 +2841,26 @@
>  	cfun->unexpanded_var_list = tree_cons (NULL_TREE, var,
>  					       cfun->unexpanded_var_list);
>        else
> -	cfun->unexpanded_var_list = tree_cons (NULL_TREE, remap_decl (var, id),
> -					       cfun->unexpanded_var_list);
> +	{
> +	  /* Update stack alignment requirement if needed.  */
> +	  if (MAX_VECTORIZE_STACK_ALIGNMENT)
> +	    {
> +	      unsigned int align;
> +
> +	      if (TREE_STATIC (var) || DECL_EXTERNAL (var))
> +		align = TYPE_ALIGN (TREE_TYPE (var));
> +	      else
> +		align = DECL_ALIGN (var);
> +	      if (align  > cfun->stack_alignment_estimated)
> +		{
> +		  gcc_assert(!cfun->stack_realign_processed);
> +		  cfun->stack_alignment_estimated = align;
> +		}
> +	    }
> +	  cfun->unexpanded_var_list
> +	    = tree_cons (NULL_TREE, remap_decl (var, id),
> +			 cfun->unexpanded_var_list);
> +	}
> 
> I think it is mistake to maintain info about stack alignment during
> gimple transformations.  At expansion time we walk the list and we can
> figure out the alignment once possibly some of the variables are
> optimized out.
> 

We need to collect accurate stack alignment information. We apprecate
any suggestions. Joey, can you give Jan's suggestion a try?

FYI, for testing, we use the enclosed patch to put some stress on
stack alignment. The only known failures should be

FAIL: gcc.target/i386/stackalign/asm-1.c -mstackrealign (internal compiler error)
FAIL: gcc.target/i386/stackalign/asm-1.c -mstackrealign (test for excess errors)
FAIL: gcc.target/i386/stackalign/asm-1.c -mno-stackrealign (internal compiler error)
FAIL: gcc.target/i386/stackalign/asm-1.c -mno-stackrealign (test for excess errors)
FAIL: gcc.target/i386/stackalign/local-1.c -mno-stackrealign scan-assembler-not sub[^\\n]*sp



H.J.
--- ../../gcc-stack-fsf/gcc/gcc/config/i386/i386.h	2008-03-20 14:57:29.000000000 -0700
+++ gcc/config/i386/i386.h	2008-03-20 15:13:33.000000000 -0700
@@ -824,6 +824,18 @@ enum target_cpu_default
    128, stacks for all functions may be realigned.  */
 #define STACK_REALIGN_DEFAULT 0
 
+/* The followings should be removed when we merge the stack alignment
+   branch to mainline.  */ 
+#undef PREFERRED_STACK_BOUNDARY_DEFAULT
+#undef STACK_REALIGN_DEFAULT
+#if 0
+#define PREFERRED_STACK_BOUNDARY_DEFAULT ABI_STACK_BOUNDARY
+#define STACK_REALIGN_DEFAULT 0
+#else
+#define PREFERRED_STACK_BOUNDARY_DEFAULT 128
+#define STACK_REALIGN_DEFAULT (TARGET_64BIT ? 0 : 1)
+#endif
+
 /* Boundary (in *bits*) on which the incoming stack is aligned.  */
 #define INCOMING_STACK_BOUNDARY ix86_incoming_stack_boundary
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFA]: Merge stack alignment branch
  2008-04-04 19:18     ` Andrew Pinski
  2008-04-04 20:33       ` H.J. Lu
@ 2008-04-05 15:24       ` Ye, Joey
  1 sibling, 0 replies; 26+ messages in thread
From: Ye, Joey @ 2008-04-05 15:24 UTC (permalink / raw)
  To: Andrew Pinski, H.J. Lu; +Cc: GCC Patches, Lu, Hongjiu, Guo, Xuepeng, ubizjak

Andrew,

Since your scheme works with reload, maybe both scheme can work together
as complement to each other. This patch handles stack alignment
correctly and efficiently for most of cases. But due to some corner
cases in reload, it handles alignment conservatively. If both scheme
work together, the result can be efficient for most of cases and
accurate for corner cases.

Thanks - Joey

-----Original Message-----
From: gcc-patches-owner@gcc.gnu.org
[mailto:gcc-patches-owner@gcc.gnu.org] On Behalf Of Andrew Pinski
Sent: Saturday, April 05, 2008 2:45 AM
To: H.J. Lu
Cc: Ye, Joey; GCC Patches; Lu, Hongjiu; Guo, Xuepeng;
<ubizjak@gmail.com>
Subject: Re: [RFA]: Merge stack alignment branch

On Fri, Apr 4, 2008 at 5:35 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>  Does your scheme work with reload? To support AVX, we need to
>  align stack to 32byte whenever AVX register is accessed on stack.
>  That will be a very common case when -mavx is used.  But we want
>  to keep preferred stack boundary, which defines incoming stack
>  boundary, as 16byte for binary compatibility.

Yes it works with reload.  The variable gets reloaded to a stack slot
which is aligned correctly.  The stack on x86 is still aligned to the
word boundary but the variables get aligned to the correct alignment.

There is still a bug with respect of VLA's not getting the correct
alignment.  This happens with your scheme also.

Thanks,
Andrew Pinski

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFA]: Merge stack alignment branch
  2008-04-04  6:39 ` Andrew Pinski
  2008-04-04 12:40   ` H.J. Lu
@ 2008-04-05 16:26   ` Ye, Joey
  1 sibling, 0 replies; 26+ messages in thread
From: Ye, Joey @ 2008-04-05 16:26 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: GCC Patches, Lu, Hongjiu, Guo, Xuepeng, ubizjak

Andrew,

I reviewed your patch. If understood correctly it generates additional
instructions for each stack variable that need bigger alignment. It may
not be the optimized way in case of frequent access. 

This patch uses as less as one additional instruction to align the frame
for current function. It is more efficient in most of cases. As to the
change to target specific code, I think the benefit can justify the
effort.

Thanks - Joey 

-----Original Message-----
From: Andrew Pinski [mailto:pinskia@gmail.com] 
Sent: Friday, April 04, 2008 2:34 PM
To: Ye, Joey
Cc: GCC Patches; Lu, Hongjiu; Guo, Xuepeng; <ubizjak@gmail.com>
Subject: Re: [RFA]: Merge stack alignment branch



Sent from my iPhone

On Apr 3, 2008, at 23:23, "Ye, Joey" <joey.ye@intel.com> wrote:

> STACK branch has been created for a while and a bunch of patches to
> implement stack alignment for i386/x86_64 have been checked in. Now  
> this
> branch not only can support all stack variables to be aligned at their
> required boundary effectively, but also introduce zero regression
> against current trunk. Here is the background information and the  
> patch.
> Comments and feedback are high appreciated.


Why not align the variables that need the extra alignment?  This seems  
simpler and gets rid of the need for target specific changes. I  
already posted a patch to do it that way. It was created to support  
the Cell proccesor. We really need variables which have alignment of  
128 byte as the DMA will only work on memory that is 128 byte aligned  
if the size is greater than or equal to 128. Also it is not the common  
case that we need the extra alignment.

>
>
> -- BACKGROUD --
> Here, we propose a new design to fully support stack alignment while
> overcoming above problems. The new design will
> *  Support arbitrary alignment value, including 4,8,16,32...
> *  Adjust function stack alignment only when necessary
> *  Initial development will be on i386 and x86_64, but can be extended
> to other platforms
> *  Emit efficient prologue/epilogue code for stack align
> *  Coexist with special features like dynamic stack allocation  
> (alloca),
> nested functions, register parameter passing, PIC code and tail call
> optimization, etc
> *  Be able to debug and unwind stack
>
> 2.1 Support arbitrary alignment value
> Different source code and optimizations requires different stack
> alignment,
> as in following table:
> Feature         Alignment (bytes)
> i386_ABI        4
> x86_64_ABI      16
> char            1
> short           2
> int             4
> long            4/8*
> long long       8
> __m64           8
> __m128          16
> float           4
> double          8
> long double     16
> user specified  any power of 2
>
> *Note: 4 for i386, 8 for x86_64
> The new design will support any alignment value in this table.
>
> 2.2 Adjust function stack alignment only when necessary
>
> Current GCC defines following macros related to stack alignment:
> i. STACK_BOUNDARY in bits, which is preferred by hardware, 32 for i386
> and
> 64 for x86_64. It is the minimum stack boundary. It is fixed.
> ii. PREFERRED_STACK_BOUNDARY. It sets the stack alignment when  
> calling a
> function. It may be set at command line and has no impact on stack
> alignment at function entry. This proposal requires PREFERRED >=  
> STACK,
> and
> by default set to ABI_STACK_BOUNDARY
>
> This design will define a few more macros, or concepts not explicitly
> defined in code:
> iii. ABI_STACK_BOUNDARY in bits, which is the stack boundary specified
> by
> psABI, 32 for i386 and 128 for x86_64.  ABI_STACK_BOUNDARY >=
> STACK_BOUNDARY. It is fixed for a given psABI.
> iv. LOCAL_STACK_BOUNDARY in bits. Each function stack has its own  
> stack
> alignment requirement, which depends the alignment of its stack
> variables,
> LOCAL_STACK_BOUNDARY = MAX (alignment of each effective stack  
> variable).
> v. INCOMING_STACK_BOUNDARY in bits, which is the stack boundary at
> function
> entry. If a function is marked with __attribute__
> ((force_align_arg_pointer))
> or -mstackrealign option is provided, INCOMING = STACK_BOUNDARY.
> Otherwise,
> INCOMING == PREFERRED_STACK_BOUNDARY because a function is typically
> called
> locally with the same PREFERRED_STACK_BOUNDARY. For those function  
> whose
>
> PREFERRED is larger than ABI, it is the caller's responsibility to
> invoke
> them with appropriate PREFERRED.
> vi. REQUIRED_STACK_ALIGNMENT in bits, which is stack alignment  
> required
> by
> local variables and calling other function. REQUIRED_STACK_ALIGNMENT  
> ==
> MAX(LOCAL_STACK_BOUNDARY,PREFERRED_STACK_BOUNDARY) in case of a non- 
> leaf
> function. For a leaf function, REQUIRED_STACK_ALIGNMENT ==
> MAX(LOCAL_STACK_BOUNDARY,STACK_BOUNDARY).
>
> This proposal won't adjust stack when INCOMING_STACK_BOUNDARY >=
> REQUIRED_STACK_ALIGNMENT. Only when INCOMING_STACK_BOUNDARY <
> REQUIRED_STACK_ALIGNMENT, or PREFERRED_STACK_BOUNDARY of entry  
> function
> less
> than ABI_STACK_BOUNDARY, it will adjust stack to
> REQUIRED_STACK_ALIGNMENT
> at prologue.
>
> 2.3 Initial development on i386 and x86_64
> We initially support i386 and x86_64. In this document we focus more  
> on
> i386 because it is hard to implement because of the restriction of
> having
> a small register file.  But all that we discuss can be easily applied
> to x86_64.
>
> 2.4 Emit more efficient prologue/epilogue
> When a function needs to adjust stack alignment and has no dynamic  
> stack
> allocation, this design will generate following example
> prologue/epilogue
> code:
> IA32 example Prologue:
>        pushl     %ebp
>        movl      %esp, %ebp
>        andl      $-16, %esp
>        subl      $4, %esp ; is $-4 the local stack size?
> Epilogue:
>        movl      %ebp, %esp
>        popl      %ebp
>        ret
> Locals will be addressed as esp + offset and parameters as ebp +  
> offset.
>
> Add x86_64 example here.
>
> Thus BP points to parameter frame and SP points to local frame.
>
> 2.5 Coexist with special features
> Stack alignment adjustment will coexist with varying  GCC features
> that have special calling conventions and frame layout, such as  
> dynamic
> stack allocation (alloca), nested functions and parameter passing via
> registers to local functions.
>
> I386 hard register usage is the major problem to make the proposal
> friendly
> to various GCC features. This design requires an additional hard
> register
> in prologue/epilogue in case of dynamic stack allocation. The register
> is
> called as Dynamic Realigned Argument Pointer, or DRAP. Because I386  
> PIC
> requires BX as GOT pointer and I386 may use AX, DX and CX as parameter
> passing registers, also it has to work with setjmp/longjmp, there are
> limited candidates to choose.  Current proposal uses CX as DRAP if  
> CX is
> not
> used byr to pass parameter. If CX is not available DI will be used
> because
> it is preserved across setjmp/longjmp since it is callee-saved.
>
> X86_64 is much easier. This proposal just chooses R12 as DRAP, which  
> is
> also preserved across setjmp/longjmp since it is callee-saved.
>
> DRAP will be assigned to a virtual register, or VDRAP, in prologue so
> that
> DRAP hard register itself can be free for register allocator in  
> function
> body.
> Usually VDRAP will be allocated as the same DRAP register, thus the
> additional
> register move instruction is oftenly removed.
>
> 2.5.1 When stack alignment adjustment comes together with alloca,
> following
> example prologue/epilogue will be emitted:
> Prologue:
>       pushl     %edi                     // Save callee save reg edi
>       leal      8(%esp), %edi            // Save address of parameter
> frame
>       andl      $-16, %esp               // Align local stack
>
> //  Reserve two stack slots and save return address
> //  and previous frame pointer into them. By
> //  pointing new ebp to them, we build a pseudo
> //  stack for unwinding.
>       pushl     $4(%edi)                 //  save return address
>       pushl     %ebp                     //  save old ebp
>       movl      %esp, %ebp               //  point ebp to pseudo frame
> start
>
>       subl      $24, %esp                // adjust local frame size
>       movl      %edi, vreg1
>
> epilogue:
>       movl      vreg1, %edi
>       movl      %ebp, %esp               // Restore esp to pseudo  
> frame
> start
>       popl      %ebp
>       leal      -8(%edi), %esp           // restore esp to real frame
> start
>       popl      %edi                     // Restore edi
>       ret
>
> Locals will be addressed as ebp - offset, parameters as vreg1 + offset
>
> Where BX is used to set up virtual parameter frame pointer, BP  
> points to
> local frame and SP points to dynamic allocation frame.
>
> 2.5.2 Nested functions will automatically work because it uses CX as
> static
> pointer, which won't conflict with any registers used by stack  
> alignment
> adjustment, even when nested functions are called via function pointer
> and
> a function stub on stack.
>
> 2.5.3 GCC may optimize to use registers to pass parameters . At most  
> AX,
> DX
> and CX will be used. Such optimization won't conflict with stack
> alignment
> adjustment thus it should automatically work.
>
> 2.5.4 I386 PIC uses an available register or EBX as GOT pointer. This
> design
> work well under i386 PIC. When picking up a register for PIC, we will
> avoid
> using the DRAP register:
>
> For example:
> i686 Prologue:
>        pushl     %edi
>        leal      8(%esp), %edi
>        andl      $-16, %esp
>        pushl     $4(%edi)
>        pushl     %ebp
>        movl      %esp, %ebp
>        subl      $24,  %esp
>        call      .L1
> .L1:
>        popl      %ebx
>        movl      %edi, vreg1
>
> Body:  // code for alloca
>        movl      (vreg1), %eax
>        subl      %eax, %esp
>        andl      $-16, %esp
>        movl      %esp, %eax
>
> i686 Epilogue:
>        movl      %ebp, %esp
>        popl      %ebp
>        leal      -8(%edi), %esp
>        popl      %edi
>        ret
>
> Locals will be addressed as ebp - offset, parameters as vreg1 +  
> offset,
> ebx has the GOT pointer.
>
> 2.6 Debug and unwind will work since DWARF2 has the flexibility to
> define
> different frame pointers.
>
> 2.7 Some intrinsics rely on stack layout. Need to handle them
> accordingly.
> They are __builtin_return_address, __builtin_frame_address. This
> proposal
> will setup pseudo frame slot to help unwinder find return address and
> parent frame address by emit following prologue code after adjusting
> alignment:
>        pushl     $4(%edi)
>        pushl     %ebp
>
> ChangeLog:
> 2008-04-04  Uros Bizjak  <ubizjak@gmail.com>
>        H.J. Lu  <hongjiu.lu@intel.com>
>
>    PR target/12329
>    * config/i386/i386.c (ix86_function_regparm): Limit the number
> of
>    register passing arguments to 2 for nested functions.
>
> 2008-04-04  Joey Ye  <joey.ye@intel.com>
>        H.J. Lu  <hongjiu.lu@intel.com>
>        Xuepeng Guo  <xuepeng.guo@intel.com>
>
>    * builtins.c (expand_builtin_setjmp_receiver): Replace
>    virtual_incoming_args_rtx with
>    current_function_internal_arg_pointer.
>    (expand_builtin_apply_args_1): Likewise.
>
>    * calls.c (expand_call): Don't calculate preferred stack
>    boundary according to incoming stack boundary. Replace
>    virtual_incoming_args_rtx with
>    current_function_internal_arg_pointer.
>
>    * cfgexpand.c (get_decl_align_unit): Estimate stack variable
>    alignment and store to stack_alignment_estimated and
>    stack_alignment_used.
>    (expand_one_var): Likewise.
>    (gate_stack_realign): Gate new pass
> pass_collect_stackrealign_info
>    and pass_handle_drap.
>    (collect_stackrealign_info): Execute new pass
>    pass_collect_stackrealign_info.
>    (pass_collect_stackrealign_info): Define new pass.
>    (handle_drap): Execute new pass pass_handle_drap.
>    (pass_handle_drap): Define new pass.
>
>    * defaults.h (MAX_VECTORIZE_STACK_ALIGNMENT): New.
>
>    * dojump.c (clear_pending_stack_adjust): Leave an FIXME in
>    comments in case pending stack ajustment is discard when stack
>    realign is needed.
>
>    * flags.h (frame_pointer_needed): Removed.
>    * final.c (frame_pointer_needed): Likewise.
>
>    * function.c (assign_stack_local_1): Estimate stack variable
>    alignment and store to stack_alignment_estimated.
>    (instantiate_new_reg): Instantiate virtual incoming args rtx to
>    vDRAP if stack realignment and DRAP is needed.
>    (assign_parms): Collect parameter/return type alignment and
>    contribute to stack_alignment_estimated.
>    (locate_and_pad_parm): Likewise.
>    (allocate_struct_function): Init stack_alignment_estimated and
>    stack_alignment_used.
>    (get_arg_pointer_save_area): Replace virtual_incoming_args_rtx
>    with current_function_internal_arg_pointer.
>
>    * function.h (function): Add drap_reg,
> stack_alignment_estimated,
>    need_frame_pointer, need_frame_pointer_set,
> stack_realign_needed,
>    stack_realign_really, need_drap, save_param_ptr_reg,
>    stack_realign_processed, stack_realign_finalized and
>    stack_realign_used.
>    (frame_pointer_needed): New.
>    (stack_realign_fp): Likewise.
>    (stack_realign_drap): Likewise.
>
>    * global.c (compute_regsets): Set frame_pointer_needed
> cannot_elim
>    wrt stack_realign_needed.
>
>    * stmt.c (expand_nl_goto_receiver): Replace
>    virtual_incoming_args_rtx with
>    current_function_internal_arg_pointer.
>
>    * passes.c (pass_collect_stackrealign_info): Insert this new
> pass
>    immediately before expand.
>    (pass_handle_drap): Insert this new pass immediately after
> expand.
>
>    * tree-inline.c (expand_call_inline): Estimate stack variable
>    alignment and store to stack_alignment_estimated.
>
>    * tree-pass.h (pass_handle_drap): New.
>    (pass_collect_stackrealign_info): Likewise.
>
>    * tree-vectorizer.c (vect_can_force_dr_alignment_p): Estimate
>    stack variable alignment and store to stack_alignment_estimated.
>
>    * reload1.c (set_label_offsets): Assert that frame pointer must
> be
>    elimiated to stack pointer in case stack realignment is
> estimated
>    to happen without DRAP.
>    (elimination_effects): Likewise.
>    (eliminate_regs_in_insn): Likewise.
>    (mark_not_eliminable): Likewise.
>    (update_eliminables): Frame pointer is needed in case of stack
>    realignment needed.
>    (init_elim_table): Don't set frame_pointer_needed here.
>
>    * dwarf2out.c (CUR_FDE): New.
>    (reg_save_with_expression): Likewise.
>    (dw_fde_struct): Add drap_regnum, stack_realignment,
>    is_stack_realign, is_drap and is_drap_reg_saved.
>    (add_cfi): If stack is realigned, call reg_save_with_expression
>    to represent the location of stored vars.
>    (dwarf2out_frame_debug_expr): Add rules 16-19 to handle stack
>    realign.
>    (output_cfa_loc): Handle DW_CFA_expression.
>    (based_loc_descr): Update assert for stack realign.
>
>    * config/i386/i386.c (ix86_force_align_arg_pointer_string):
> Break
>    long line.
>    (ix86_user_incoming_stack_boundary): New.
>    (ix86_default_incoming_stack_boundary): Likewise.
>    (ix86_incoming_stack_boundary): Likewise.
>    (find_drap_reg): Likewise.
>    (override_options): Overide option value for new options.
>    (ix86_function_ok_for_sibcall): Sibcall is OK even stack need
>    realigning.
>    (ix86_handle_cconv_attribute): Stack realign no longer impacts
>    number of regparm.
>    (ix86_function_regparm): Likewise.
>    (setup_incoming_varargs_64): Remove the logic to set
>    stack_alignment_needed here.
>    (ix86_va_start): Replace virtual_incoming_args_rtx with
>    current_function_internal_arg_pointer.
>    (ix86_save_reg): Replace force_align_arg_pointer with drap_reg.
>    (ix86_compute_frame_layout): Compute frame layout wrt stack
>    realignment.
>    (ix86_internal_arg_pointer): Estimate if stack realignment is
>    needed and returns appropriate arg pointer rtx accordingly.
>    (ix86_expand_prologue): Finally decide if stack realignment
>    is needed and generate prologue code accordingly.
>    (ix86_expand_epilogue): Generate epilogue code wrt stack
>    realignment is really needed or not.
>    * config/i386/i386.c (ix86_select_alt_pic_regnum): Check
>    DRAP register.
>
>    * config/i386/i386.h (MAIN_STACK_BOUNDARY): New.
>    (ABI_STACK_BOUNDARY): Likewise.
>    PREFERRED_STACK_BOUNDARY_DEFAULT): Likewise.
>    (STACK_REALIGN_DEFAULT): Likewise.
>    (INCOMING_STACK_BOUNDARY): Likewise.
>    (MAX_VECTORIZE_STACK_ALIGNMENT): Likewise.
>    (ix86_incoming_stack_boundary): Likewise.
>    (REAL_PIC_OFFSET_TABLE_REGNUM): Updated to use BX_REG.
>    (CAN_ELIMINATE): Redefine the macro to eliminate frame pointer
> to
>    stack pointer and arg pointer to hard frame pointer
> <merge-stack-0404.patch>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFA]: Merge stack alignment branch
  2008-04-04 19:05 ` Jan Hubicka
  2008-04-04 21:05   ` H.J. Lu
@ 2008-04-08  1:57   ` Ye, Joey
  2008-04-11 12:32   ` Ye, Joey
  2 siblings, 0 replies; 26+ messages in thread
From: Ye, Joey @ 2008-04-08  1:57 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Lu, Hongjiu, Guo, Xuepeng

[-- Attachment #1: Type: text/plain, Size: 2292 bytes --]

Jan,

Here I break the patch into two. One for x86 port and one for generic.

There are cases that alignment information is missed after inlining. In
expand_used_vars, vars in unexpanded_var_list are supposely been
expanded. But to some reason I don't know the alignment info of
variables in inlined function is not collected in expand_one_var. Can
you point me to the right place?

Thanks - Joey

-----Original Message-----
From: Jan Hubicka [mailto:hubicka@ucw.cz] 
Sent: Saturday, April 05, 2008 2:39 AM
To: Ye, Joey
Cc: GCC Patches; Lu, Hongjiu; Guo, Xuepeng; ubizjak@gmail.com
Subject: Re: [RFA]: Merge stack alignment branch

Hi,
I will look in detail to the patch later this weekend.  I think it would
make sense to break up neccesary changes in generic bits to separate
patches (in -x -cp format).  This will ease reviewing process since I
for instance can't approve non-i386 specific bits of your patch.

Index: tree-inline.c
===================================================================
--- tree-inline.c	(.../trunk/gcc)	(revision 133813)
+++ tree-inline.c	(.../branches/stack/gcc)	(revision
133869)
@@ -2841,8 +2841,26 @@
 	cfun->unexpanded_var_list = tree_cons (NULL_TREE, var,
 
cfun->unexpanded_var_list);
       else
-	cfun->unexpanded_var_list = tree_cons (NULL_TREE, remap_decl
(var, id),
-
cfun->unexpanded_var_list);
+	{
+	  /* Update stack alignment requirement if needed.  */
+	  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+	    {
+	      unsigned int align;
+
+	      if (TREE_STATIC (var) || DECL_EXTERNAL (var))
+		align = TYPE_ALIGN (TREE_TYPE (var));
+	      else
+		align = DECL_ALIGN (var);
+	      if (align  > cfun->stack_alignment_estimated)
+		{
+		  gcc_assert(!cfun->stack_realign_processed);
+		  cfun->stack_alignment_estimated = align;
+		}
+	    }
+	  cfun->unexpanded_var_list
+	    = tree_cons (NULL_TREE, remap_decl (var, id),
+			 cfun->unexpanded_var_list);
+	}

I think it is mistake to maintain info about stack alignment during
gimple transformations.  At expansion time we walk the list and we can
figure out the alignment once possibly some of the variables are
optimized out.

The info can also go into RTL datastructures I am trying to introduce
instead of cfun then.

Honza

[-- Attachment #2: stack-align-generic-0408.patch --]
[-- Type: application/octet-stream, Size: 42241 bytes --]

Index: flags.h
===================================================================
--- flags.h	(.../trunk/gcc)	(revision 133813)
+++ flags.h	(.../branches/stack/gcc)	(revision 133869)
@@ -223,12 +223,6 @@
 \f
 /* Other basic status info about current function.  */
 
-/* Nonzero means current function must be given a frame pointer.
-   Set in stmt.c if anything is allocated on the stack there.
-   Set in reload1.c if anything is allocated on the stack there.  */
-
-extern int frame_pointer_needed;
-
 /* Nonzero if subexpressions must be evaluated from left-to-right.  */
 extern int flag_evaluation_order;
 
Index: defaults.h
===================================================================
--- defaults.h	(.../trunk/gcc)	(revision 133813)
+++ defaults.h	(.../branches/stack/gcc)	(revision 133869)
@@ -940,4 +940,8 @@
 #define OUTGOING_REG_PARM_STACK_SPACE 0
 #endif
 
+#ifndef MAX_VECTORIZE_STACK_ALIGNMENT
+#define MAX_VECTORIZE_STACK_ALIGNMENT 0
+#endif
+
 #endif  /* ! GCC_DEFAULTS_H */
Index: tree-pass.h
===================================================================
--- tree-pass.h	(.../trunk/gcc)	(revision 133813)
+++ tree-pass.h	(.../branches/stack/gcc)	(revision 133869)
@@ -473,6 +473,8 @@
 extern struct gimple_opt_pass pass_apply_inline;
 extern struct gimple_opt_pass pass_all_early_optimizations;
 extern struct gimple_opt_pass pass_update_address_taken;
+extern struct gimple_opt_pass pass_collect_stackrealign_info;
+extern struct gimple_opt_pass pass_handle_drap;
 
 /* The root of the compilation pass tree, once constructed.  */
 extern struct opt_pass *all_passes, *all_ipa_passes, *all_lowering_passes;
Index: builtins.c
===================================================================
--- builtins.c	(.../trunk/gcc)	(revision 133813)
+++ builtins.c	(.../branches/stack/gcc)	(revision 133869)
@@ -740,7 +740,7 @@
 	{
 	  /* Now restore our arg pointer from the address at which it
 	     was saved in our stack frame.  */
-	  emit_move_insn (virtual_incoming_args_rtx,
+	  emit_move_insn (current_function_internal_arg_pointer,
 			  copy_to_reg (get_arg_pointer_save_area ()));
 	}
     }
@@ -1345,7 +1345,7 @@
       }
 
   /* Save the arg pointer to the block.  */
-  tem = copy_to_reg (virtual_incoming_args_rtx);
+  tem = copy_to_reg (current_function_internal_arg_pointer);
 #ifdef STACK_GROWS_DOWNWARD
   /* We need the pointer as the caller actually passed them to us, not
      as we might have pretended they were passed.  Make sure it's a valid
Index: final.c
===================================================================
--- final.c	(.../trunk/gcc)	(revision 133813)
+++ final.c	(.../branches/stack/gcc)	(revision 133869)
@@ -178,12 +178,6 @@
 CC_STATUS cc_prev_status;
 #endif
 
-/* Nonzero means current function must be given a frame pointer.
-   Initialized in function.c to 0.  Set only in reload1.c as per
-   the needs of the function.  */
-
-int frame_pointer_needed;
-
 /* Number of unmatched NOTE_INSN_BLOCK_BEG notes we have seen.  */
 
 static int block_depth;
Index: dojump.c
===================================================================
--- dojump.c	(.../trunk/gcc)	(revision 133813)
+++ dojump.c	(.../branches/stack/gcc)	(revision 133869)
@@ -64,8 +64,11 @@
    so the adjustment won't get done.
 
    Note, if the current function calls alloca, then it must have a
-   frame pointer regardless of the value of flag_omit_frame_pointer.  */
+   frame pointer regardless of the value of flag_omit_frame_pointer.  
 
+   When stack realign is needed, we can't discard pending stack adjustment,
+   in which stack pointer must be restored in epilogue. */
+
 void
 clear_pending_stack_adjust (void)
 {
Index: global.c
===================================================================
--- global.c	(.../trunk/gcc)	(revision 133813)
+++ global.c	(.../branches/stack/gcc)	(revision 133869)
@@ -247,11 +247,21 @@
   static const struct {const int from, to; } eliminables[] = ELIMINABLE_REGS;
   size_t i;
 #endif
+
+  /* FIXME: If EXIT_IGNORE_STACK is set, we will not save and restore
+     sp for alloca.  So we can't eliminate the frame pointer in that
+     case.  At some point, we should improve this by emitting the
+     sp-adjusting insns for this case.  */
   int need_fp
     = (! flag_omit_frame_pointer
        || (current_function_calls_alloca && EXIT_IGNORE_STACK)
-       || FRAME_POINTER_REQUIRED);
+       || FRAME_POINTER_REQUIRED
+       || current_function_accesses_prior_frames
+       || cfun->stack_realign_needed);
 
+  frame_pointer_needed = need_fp;
+  cfun->need_frame_pointer_set = 1;
+
   max_regno = max_reg_num ();
   compact_blocks ();
 
@@ -271,7 +281,10 @@
     {
       bool cannot_elim
 	= (! CAN_ELIMINATE (eliminables[i].from, eliminables[i].to)
-	   || (eliminables[i].to == STACK_POINTER_REGNUM && need_fp));
+	   || (eliminables[i].to == STACK_POINTER_REGNUM
+	       && need_fp 
+	       && (! MAX_VECTORIZE_STACK_ALIGNMENT
+		   || ! stack_realign_fp)));
 
       if (!regs_asm_clobbered[eliminables[i].from])
 	{
Index: dwarf2out.c
===================================================================
--- dwarf2out.c	(.../trunk/gcc)	(revision 133813)
+++ dwarf2out.c	(.../branches/stack/gcc)	(revision 133869)
@@ -110,6 +110,9 @@
 #define DWARF2_FRAME_REG_OUT(REGNO, FOR_EH) (REGNO)
 #endif
 
+/* Define the current fde_table entry we should use. */
+#define CUR_FDE fde_table[fde_table_in_use - 1]
+
 /* Decide whether we want to emit frame unwind information for the current
    translation unit.  */
 
@@ -239,9 +242,18 @@
   bool dw_fde_switched_sections;
   dw_cfi_ref dw_fde_cfi;
   unsigned funcdef_number;
+  /* If it is drap, which register is employed. */
+  int drap_regnum;
+  HOST_WIDE_INT stack_realignment;
   unsigned all_throwers_are_sibcalls : 1;
   unsigned nothrow : 1;
   unsigned uses_eh_lsda : 1;
+  /* Whether we did stack realign in this call frame.*/
+  unsigned is_stack_realign : 1;
+  /* Whether stack realign is drap. */
+  unsigned is_drap : 1;
+  /* Whether we saved this drap register. */
+  unsigned is_drap_reg_saved : 1;
 }
 dw_fde_node;
 
@@ -381,6 +393,7 @@
 static struct dw_loc_descr_struct *build_cfa_loc
   (dw_cfa_location *, HOST_WIDE_INT);
 static void def_cfa_1 (const char *, dw_cfa_location *);
+static void reg_save_with_expression (dw_cfi_ref);
 
 /* How to start an assembler comment.  */
 #ifndef ASM_COMMENT_START
@@ -618,6 +631,13 @@
   for (p = list_head; (*p) != NULL; p = &(*p)->dw_cfi_next)
     ;
 
+  /* If stack is realigned, accessing the stored register via CFA+offset will
+     be invalid. Here we will use a series of expressions in dwarf2 to simulate
+     the stack realign and represent the location of the stored register. */
+  if (fde_table_in_use && (CUR_FDE.is_stack_realign || CUR_FDE.is_drap) 
+      && cfi->dw_cfi_opc == DW_CFA_offset)
+    reg_save_with_expression (cfi);
+
   *p = cfi;
 }
 
@@ -1435,6 +1455,10 @@
   Rules 10-14: Save a register to the stack.  Define offset as the
 	       difference of the original location and cfa_store's
 	       location (or cfa_temp's location if cfa_temp is used).
+  
+  Rules 16-19: If AND operation happens on sp in prologue, we assume stack is
+               realigned. We will use a group of DW_OP_?? expressions to represent
+               the location of the stored register instead of CFA+offset.
 
   The Rules
 
@@ -1529,8 +1553,33 @@
 
   Rule 15:
   (set <reg> {unspec, unspec_volatile})
-  effects: target-dependent  */
+  effects: target-dependent  
+  
+  Rule 16:
+  (set sp (and: sp <const_int>))
+  effects: CUR_FDE.is_stack_realign = 1
+           cfa_store.offset = 0
 
+           if cfa_store.offset >= UNITS_PER_WORD
+             effects: CUR_FDE.is_drap_reg_saved = 1
+
+  Rule 17:
+  (set (mem ({pre_inc, pre_dec} sp)) (mem (plus (cfa.reg) (const_int))))
+  effects: cfa_store.offset += -/+ mode_size(mem)
+  
+  Rule 18:
+  (set (mem({pre_inc, pre_dec} sp)) fp)
+  constraints: CUR_FDE.is_stack_realign == 1
+  effects: CUR_FDE.is_stack_realign = 0
+           CUR_FDE.is_drap = 1
+           CUR_FDE.drap_regnum = cfa.reg
+
+  Rule 19:
+  (set fp sp)
+  constraints: CUR_FDE.is_drap == 1
+  effects: cfa.reg = fp
+           cfa.offset = cfa_store.offset */
+
 static void
 dwarf2out_frame_debug_expr (rtx expr, const char *label)
 {
@@ -1607,7 +1656,20 @@
 	      cfa_temp.reg = cfa.reg;
 	      cfa_temp.offset = cfa.offset;
 	    }
-	  else
+            /* Rule 19 */
+            /* Eachtime when setting FP to SP under the condition of that the stack
+               is realigned we assume the realign is drap and the drap register is
+               the current cfa's register. We update cfa's register to FP. */
+	  else if (fde_table_in_use && CUR_FDE.is_drap 
+                   && REGNO (src) == STACK_POINTER_REGNUM 
+                   && REGNO (dest) == HARD_FRAME_POINTER_REGNUM)
+            {
+              cfa.reg = REGNO (dest);
+              cfa.offset = cfa_store.offset;
+              cfa_temp.reg = cfa.reg;
+              cfa_temp.offset = cfa.offset;
+            }
+          else
 	    {
 	      /* Saving a register in a register.  */
 	      gcc_assert (!fixed_regs [REGNO (dest)]
@@ -1747,6 +1809,22 @@
 	  targetm.dwarf_handle_frame_unspec (label, expr, XINT (src, 1));
 	  return;
 
+	  /* Rule 16 */
+	case AND:
+          /* If this AND operation happens on stack pointer in prologue, we 
+             assume the stack is realigned and we extract the alignment. */
+          if (XEXP (src, 0) == stack_pointer_rtx && fde_table_in_use)
+            {
+              CUR_FDE.is_stack_realign = 1;
+              CUR_FDE.stack_realignment = INTVAL (XEXP (src, 1));
+              /* If we didn't push anything to stack before stack is realigned,
+                  we assume the drap register isn't saved. */
+              if (cfa_store.offset > UNITS_PER_WORD)
+                CUR_FDE.is_drap_reg_saved = 1;
+              cfa_store.offset = 0;
+            }
+          return;
+
 	default:
 	  gcc_unreachable ();
 	}
@@ -1755,7 +1833,6 @@
       break;
 
     case MEM:
-      gcc_assert (REG_P (src));
 
       /* Saving a register to the stack.  Make sure dest is relative to the
 	 CFA register.  */
@@ -1788,6 +1865,17 @@
 
 	  gcc_assert (REGNO (XEXP (XEXP (dest, 0), 0)) == STACK_POINTER_REGNUM
 		      && cfa_store.reg == STACK_POINTER_REGNUM);
+          
+          /* Rule 18 */
+          /* If we push FP after stack is realigned, we assume this realignment
+             is drap, we will recorde the drap register. */
+          if (fde_table_in_use && CUR_FDE.is_stack_realign
+              && REGNO (src) == HARD_FRAME_POINTER_REGNUM)
+            {
+              CUR_FDE.is_stack_realign = 0;
+              CUR_FDE.is_drap = 1;
+              CUR_FDE.drap_regnum = DWARF_FRAME_REGNUM (cfa.reg);
+            }            
 
 	  cfa_store.offset += offset;
 	  if (cfa.reg == STACK_POINTER_REGNUM)
@@ -1882,6 +1970,12 @@
 	      break;
 	    }
 	}
+        /* Rule 17 */
+        /* If the source operand of this MEM operation is not a register, 
+           basically the source is return address. Here we just care how 
+           much stack grew and ignore to save it. */ 
+      if (!REG_P (src))
+        break;
 
       def_cfa_1 (label, &cfa);
       {
@@ -3548,6 +3642,9 @@
   dw_loc_descr_ref loc;
   unsigned long size;
 
+  if (cfi->dw_cfi_opc == DW_CFA_expression)
+    dw2_asm_output_data (1, cfi->dw_cfi_oprnd2.dw_cfi_reg_num, NULL);
+
   /* Output the size of the block.  */
   loc = cfi->dw_cfi_oprnd1.dw_cfi_loc;
   size = size_of_locs (loc);
@@ -9024,8 +9121,9 @@
 	      offset += INTVAL (XEXP (elim, 1));
 	      elim = XEXP (elim, 0);
 	    }
-	  gcc_assert (elim == (frame_pointer_needed ? hard_frame_pointer_rtx
-		      : stack_pointer_rtx));
+	  gcc_assert (stack_realign_fp
+	              || elim == (frame_pointer_needed ? hard_frame_pointer_rtx
+		                                       : stack_pointer_rtx));
 	  offset += frame_pointer_fb_offset;
 
 	  return new_loc_descr (DW_OP_fbreg, offset, 0);
@@ -11155,9 +11253,10 @@
       offset += INTVAL (XEXP (elim, 1));
       elim = XEXP (elim, 0);
     }
-  gcc_assert (elim == (frame_pointer_needed ? hard_frame_pointer_rtx
+
+  gcc_assert (stack_realign_fp 
+              || elim == (frame_pointer_needed ? hard_frame_pointer_rtx
 		       : stack_pointer_rtx));
-
   frame_pointer_fb_offset = -offset;
 }
 
@@ -15438,6 +15537,63 @@
   if (debug_str_hash)
     htab_traverse (debug_str_hash, output_indirect_string, NULL);
 }
+
+/* In this function we use a series of DW_OP_?? expression which simulates
+   how stack is realigned to represent the location of the stored register.*/
+static void
+reg_save_with_expression (dw_cfi_ref cfi)
+{
+  struct dw_loc_descr_struct *head, *tmp;
+  HOST_WIDE_INT alignment = CUR_FDE.stack_realignment;
+  HOST_WIDE_INT offset = cfi->dw_cfi_oprnd2.dw_cfi_offset * UNITS_PER_WORD;
+  int reg = cfi->dw_cfi_oprnd1.dw_cfi_reg_num;
+  unsigned int dwarf_sp = (unsigned)DWARF_FRAME_REGNUM (STACK_POINTER_REGNUM);
+  
+  if (CUR_FDE.is_stack_realign)
+    {
+      head = tmp = new_loc_descr (DW_OP_const4s, 2 * UNITS_PER_WORD, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_const4s, alignment, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_and, 0, 0);
+
+      /* If stack grows upward, the offset will be a negative. */
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_const4s, offset, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);  
+   
+      cfi->dw_cfi_opc = DW_CFA_expression;
+      cfi->dw_cfi_oprnd2.dw_cfi_reg_num = reg; 
+      cfi->dw_cfi_oprnd1.dw_cfi_loc = head;
+    }
+
+  /* We need restore drap register through dereference. If we needn't to restore
+     the drap register we just ignore. */
+  if (CUR_FDE.is_drap && reg == CUR_FDE.drap_regnum)
+    {
+       
+      dw_cfi_ref cfi2 = new_cfi();
+
+      cfi->dw_cfi_opc = DW_CFA_expression;
+      head = tmp = new_loc_descr (DW_OP_const4s, offset, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);
+      if (CUR_FDE.is_drap_reg_saved)
+        {
+          tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_deref, 0, 0);
+          tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_const4s, 
+                                                  2 * UNITS_PER_WORD, 0);
+          tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);
+        }
+      cfi->dw_cfi_oprnd2.dw_cfi_reg_num = reg;
+      cfi->dw_cfi_oprnd1.dw_cfi_loc = head;
+
+      /* We also need restore the sp. */
+      head = tmp = new_loc_descr (DW_OP_const4s, offset, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);
+      cfi2->dw_cfi_opc = DW_CFA_expression;
+      cfi2->dw_cfi_oprnd2.dw_cfi_reg_num = dwarf_sp;
+      cfi2->dw_cfi_oprnd1.dw_cfi_loc = head;
+      cfi->dw_cfi_next = cfi2;
+    }  
+}
 #else
 
 /* This should never be used, but its address is needed for comparisons.  */
Index: function.c
===================================================================
--- function.c	(.../trunk/gcc)	(revision 133813)
+++ function.c	(.../branches/stack/gcc)	(revision 133869)
@@ -376,17 +376,19 @@
 {
   rtx x, addr;
   int bigend_correction = 0;
-  unsigned int alignment;
+  unsigned int alignment, mode_alignment, alignment_in_bits;
   int frame_off, frame_alignment, frame_phase;
 
+  if (mode == BLKmode)
+    mode_alignment = BIGGEST_ALIGNMENT;
+  else
+    mode_alignment = GET_MODE_ALIGNMENT (mode);
+
   if (align == 0)
     {
       tree type;
 
-      if (mode == BLKmode)
-	alignment = BIGGEST_ALIGNMENT;
-      else
-	alignment = GET_MODE_ALIGNMENT (mode);
+      alignment = mode_alignment;
 
       /* Allow the target to (possibly) increase the alignment of this
 	 stack slot.  */
@@ -406,16 +408,46 @@
   else
     alignment = align / BITS_PER_UNIT;
 
+  alignment_in_bits = alignment * BITS_PER_UNIT;
+
   if (FRAME_GROWS_DOWNWARD)
     frame_offset -= size;
 
-  /* Ignore alignment we can't do with expected alignment of the boundary.  */
-  if (alignment * BITS_PER_UNIT > PREFERRED_STACK_BOUNDARY)
-    alignment = PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT;
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      if (cfun->stack_alignment_estimated < alignment_in_bits)
+	{
+          if (!cfun->stack_realign_processed)
+            cfun->stack_alignment_estimated = alignment_in_bits;
+          else
+	    {
+	      gcc_assert (!cfun->stack_realign_finalized);
+	      if (!cfun->stack_realign_needed)
+		{
+		  /* It is OK to reduce the alignment as long as the
+		     requested size is 0 or the estimated stack
+		     alignment >= mode alignment.  */
+		  gcc_assert (size == 0
+			      || (cfun->stack_alignment_estimated
+				  >= mode_alignment));
+		  alignment_in_bits = cfun->stack_alignment_estimated;
+		  alignment = alignment_in_bits / BITS_PER_UNIT;
+		}
+	    }
+	}
+    }
+  else
+    {
+      /* Ignore alignment we can't do with expected alignment of the
+	 boundary.  */
+      if (alignment * BITS_PER_UNIT > PREFERRED_STACK_BOUNDARY)
+	alignment = PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT;
+    }
+  if (cfun->stack_alignment_needed < alignment_in_bits)
+    cfun->stack_alignment_needed = alignment_in_bits;
+  if (cfun->stack_alignment_used < cfun->stack_alignment_needed)
+    cfun->stack_alignment_used = cfun->stack_alignment_needed;
 
-  if (cfun->stack_alignment_needed < alignment * BITS_PER_UNIT)
-    cfun->stack_alignment_needed = alignment * BITS_PER_UNIT;
-
   /* Calculate how many bytes the start of local variables is off from
      stack alignment.  */
   frame_alignment = PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT;
@@ -1203,7 +1235,17 @@
   HOST_WIDE_INT offset;
 
   if (x == virtual_incoming_args_rtx)
-    new = arg_pointer_rtx, offset = in_arg_offset;
+    {
+      /* Replace vitural_incoming_args_rtx to internal arg pointer here */
+      if (current_function_internal_arg_pointer != virtual_incoming_args_rtx)
+        {
+          gcc_assert (stack_realign_drap);
+          new = current_function_internal_arg_pointer;
+          offset = 0;
+        }
+      else
+        new = arg_pointer_rtx, offset = in_arg_offset;
+    }
   else if (x == virtual_stack_vars_rtx)
     new = frame_pointer_rtx, offset = var_offset;
   else if (x == virtual_stack_dynamic_rtx)
@@ -3002,6 +3044,20 @@
 	  continue;
 	}
 
+      /* Estimate stack alignment from parameter alignment */
+      if (MAX_VECTORIZE_STACK_ALIGNMENT)
+        {
+          unsigned int align = FUNCTION_ARG_BOUNDARY (data.promoted_mode,
+						      data.passed_type);
+	  if (TYPE_ALIGN (data.nominal_type) > align)
+	    align = TYPE_ALIGN (data.passed_type);
+	  if (cfun->stack_alignment_estimated < align)
+	    {
+	      gcc_assert (!cfun->stack_realign_processed);
+	      cfun->stack_alignment_estimated = align;
+	    }
+	}
+	
       if (current_function_stdarg && !TREE_CHAIN (parm))
 	assign_parms_setup_varargs (&all, &data, false);
 
@@ -3039,6 +3095,28 @@
      now that all parameters have been copied out of hard registers.  */
   emit_insn (all.first_conversion_insn);
 
+  /* Estimate reload stack alignment from scalar return mode.  */
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      if (DECL_RESULT (fndecl))
+	{
+	  tree type = TREE_TYPE (DECL_RESULT (fndecl));
+	  enum machine_mode mode = TYPE_MODE (type);
+
+	  if (mode != BLKmode
+	      && mode != VOIDmode
+	      && !AGGREGATE_TYPE_P (type))
+	    {
+	      unsigned int align = GET_MODE_ALIGNMENT (mode);
+	      if (cfun->stack_alignment_estimated < align)
+		{
+		  gcc_assert (!cfun->stack_realign_processed);
+		  cfun->stack_alignment_estimated = align;
+		}
+	    }
+	} 
+    }
+
   /* If we are receiving a struct value address as the first argument, set up
      the RTL for the function result. As this might require code to convert
      the transmitted address to Pmode, we do this here to ensure that possible
@@ -3316,12 +3394,32 @@
   locate->where_pad = where_pad;
   locate->boundary = boundary;
 
-  /* Remember if the outgoing parameter requires extra alignment on the
-     calling function side.  */
-  if (boundary > PREFERRED_STACK_BOUNDARY)
-    boundary = PREFERRED_STACK_BOUNDARY;
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      /* stack_alignment_estimated can't change after stack has been
+	 realigned.  */
+      if (cfun->stack_alignment_estimated < boundary)
+        {
+          if (!cfun->stack_realign_processed)
+	    cfun->stack_alignment_estimated = boundary;
+	  else
+	    {
+	      gcc_assert (!cfun->stack_realign_finalized
+			  && cfun->stack_realign_needed);
+	    }
+	}
+    }
+  else
+    {
+      /* Remember if the outgoing parameter requires extra alignment on
+         the calling function side.  */
+      if (boundary > PREFERRED_STACK_BOUNDARY)
+        boundary = PREFERRED_STACK_BOUNDARY;
+    }
   if (cfun->stack_alignment_needed < boundary)
     cfun->stack_alignment_needed = boundary;
+  if (cfun->stack_alignment_used < cfun->stack_alignment_needed)
+    cfun->stack_alignment_used = cfun->stack_alignment_needed;
 
 #ifdef ARGS_GROW_DOWNWARD
   locate->slot_offset.constant = -initial_offset_ptr->constant;
@@ -3877,6 +3975,8 @@
   cfun = ggc_alloc_cleared (sizeof (struct function));
 
   cfun->stack_alignment_needed = STACK_BOUNDARY;
+  cfun->stack_alignment_used = STACK_BOUNDARY;
+  cfun->stack_alignment_estimated = STACK_BOUNDARY;
   cfun->preferred_stack_boundary = STACK_BOUNDARY;
 
   current_function_funcdef_no = get_next_funcdef_no ();
@@ -4655,7 +4755,8 @@
 	 generated stack slot may not be a valid memory address, so we
 	 have to check it and fix it if necessary.  */
       start_sequence ();
-      emit_move_insn (validize_mem (ret), virtual_incoming_args_rtx);
+      emit_move_insn (validize_mem (ret),
+                      current_function_internal_arg_pointer);
       seq = get_insns ();
       end_sequence ();
 
Index: tree-vectorizer.c
===================================================================
--- tree-vectorizer.c	(.../trunk/gcc)	(revision 133813)
+++ tree-vectorizer.c	(.../branches/stack/gcc)	(revision 133869)
@@ -1786,9 +1786,19 @@
 
   if (TREE_STATIC (decl))
     return (alignment <= MAX_OFILE_ALIGNMENT);
+  else if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      gcc_assert (!cfun->stack_realign_processed);
+      if (alignment <= MAX_VECTORIZE_STACK_ALIGNMENT)
+	{
+	  if (cfun->stack_alignment_estimated < alignment)
+	    cfun->stack_alignment_estimated = alignment;
+	  return true;
+	}
+      else
+	return false;
+    }
   else
-    /* This used to be PREFERRED_STACK_BOUNDARY, however, that is not 100%
-       correct until someone implements forced stack alignment.  */
     return (alignment <= STACK_BOUNDARY); 
 }
 
Index: function.h
===================================================================
--- function.h	(.../trunk/gcc)	(revision 133813)
+++ function.h	(.../branches/stack/gcc)	(revision 133869)
@@ -302,6 +302,9 @@
   /* The arg pointer hard register, or the pseudo into which it was copied.  */
   rtx internal_arg_pointer;
 
+  /* Dynamic Realign Argument Pointer used for realigning stack.  */
+  rtx drap_reg;
+
   /* Opaque pointer used by get_hard_reg_initial_val and
      has_hard_reg_initial_val (see integrate.[hc]).  */
   struct initial_value_struct *hard_reg_initial_vals;
@@ -323,9 +326,16 @@
   /* tm.h can use this to store whatever it likes.  */
   struct machine_function * GTY ((maybe_undef)) machine;
 
-  /* The largest alignment of slot allocated on the stack.  */
+  /* The largest alignment needed on the stack, including requirement
+     for outgoing stack alignment.  */
   unsigned int stack_alignment_needed;
 
+  /* The largest alignment of slot allocated on the stack.  */
+  unsigned int stack_alignment_used;
+
+  /* The estimated stack alignment.  */
+  unsigned int stack_alignment_estimated;
+
   /* Preferred alignment of the end of stack frame.  */
   unsigned int preferred_stack_boundary;
 
@@ -494,6 +504,38 @@
 
   /* Nonzero if pass_tree_profile was run on this function.  */
   unsigned int after_tree_profile : 1;
+
+/* Nonzero if current function must be given a frame pointer.
+   Set in global.c if anything is allocated on the stack there.  */
+  unsigned int need_frame_pointer : 1;
+
+  /* Nonzero if need_frame_pointer has been set.  */
+  unsigned int need_frame_pointer_set : 1;
+
+  /* Nonzero if, by estimation, current function stack needs realignment. */
+  unsigned int stack_realign_needed : 1;
+
+  /* Nonzero if function stack realignment is really needed. This flag
+     will be set after reload if by then criteria of stack realignment
+     is still true. Its value may be contridition to stack_realign_needed
+     since the latter was set before reload. This flag is more accurate
+     than stack_realign_needed so prologue/epilogue should be generated
+     according to both flags  */
+  unsigned int stack_realign_really : 1;
+
+  /* Nonzero if function being compiled needs dynamic realigned
+     argument pointer (drap) if stack needs realigning.  */
+  unsigned int need_drap : 1;
+
+  /* Nonzero if current function needs to save/restore parameter
+     pointer register in prolog, because it is a callee save reg.  */
+  unsigned int save_param_ptr_reg : 1;
+
+  /* Nonzero if function stack realignment estimatoin is done.  */
+  unsigned int stack_realign_processed : 1;
+
+  /* Nonzero if function stack realignment has been finalized.  */
+  unsigned int stack_realign_finalized : 1;
 };
 
 /* If va_list_[gf]pr_size is set to this, it means we don't know how
@@ -556,6 +598,9 @@
 #define dom_computed (cfun->cfg->x_dom_computed)
 #define n_bbs_in_dom_tree (cfun->cfg->x_n_bbs_in_dom_tree)
 #define VALUE_HISTOGRAMS(fun) (fun)->value_histograms
+#define frame_pointer_needed (cfun->need_frame_pointer)
+#define stack_realign_fp (cfun->stack_realign_needed && !cfun->need_drap)
+#define stack_realign_drap (cfun->stack_realign_needed && cfun->need_drap)
 
 /* Given a function decl for a containing function,
    return the `struct function' for it.  */
Index: calls.c
===================================================================
--- calls.c	(.../trunk/gcc)	(revision 133813)
+++ calls.c	(.../branches/stack/gcc)	(revision 133869)
@@ -2099,7 +2099,10 @@
 
   /* Figure out the amount to which the stack should be aligned.  */
   preferred_stack_boundary = PREFERRED_STACK_BOUNDARY;
-  if (fndecl)
+
+  /* With automatic stack realignment, we align stack in prologue when
+     needed and there is no need to update preferred_stack_boundary.  */
+  if (!MAX_VECTORIZE_STACK_ALIGNMENT && fndecl)
     {
       struct cgraph_rtl_info *i = cgraph_rtl_info (fndecl);
       if (i && i->preferred_incoming_stack_boundary)
@@ -2401,7 +2404,7 @@
 	 incoming argument block.  */
       if (pass == 0)
 	{
-	  argblock = virtual_incoming_args_rtx;
+	  argblock = current_function_internal_arg_pointer;
 	  argblock
 #ifdef STACK_GROWS_DOWNWARD
 	    = plus_constant (argblock, current_function_pretend_args_size);
Index: cfgexpand.c
===================================================================
--- cfgexpand.c	(.../trunk/gcc)	(revision 133813)
+++ cfgexpand.c	(.../branches/stack/gcc)	(revision 133869)
@@ -161,10 +161,27 @@
 
   align = DECL_ALIGN (decl);
   align = LOCAL_ALIGNMENT (TREE_TYPE (decl), align);
-  if (align > PREFERRED_STACK_BOUNDARY)
-    align = PREFERRED_STACK_BOUNDARY;
+
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      if (cfun->stack_alignment_estimated < align)
+	{
+	  gcc_assert(!cfun->stack_realign_processed);
+          cfun->stack_alignment_estimated = align;
+	}
+    }
+  else
+    {
+      if (align > PREFERRED_STACK_BOUNDARY)
+	align = PREFERRED_STACK_BOUNDARY;
+    }
+
+  /* stack_alignment_needed > PREFERRED_STACK_BOUNDARY is permitted.
+     So here we only make sure stack_alignment_needed >= align.  */
   if (cfun->stack_alignment_needed < align)
     cfun->stack_alignment_needed = align;
+  if (cfun->stack_alignment_used < cfun->stack_alignment_needed)
+    cfun->stack_alignment_used = cfun->stack_alignment_needed;
 
   return align / BITS_PER_UNIT;
 }
@@ -748,6 +765,29 @@
 static HOST_WIDE_INT
 expand_one_var (tree var, bool toplevel, bool really_expand)
 {
+  if (MAX_VECTORIZE_STACK_ALIGNMENT && TREE_CODE (var) == VAR_DECL)
+    {
+      unsigned int align;
+
+      /* Because we don't know if VAR will be in register or on stack,
+	 we conservatively assume it will be on stack even if VAR is
+	 eventually put into register after RA pass.  For non-automatic
+	 variables, which won't be on stack, we collect alignment of
+	 type and ignore user specified alignment.  */
+      if (TREE_STATIC (var) || DECL_EXTERNAL (var))
+	align = TYPE_ALIGN (TREE_TYPE (var));
+      else
+	align = DECL_ALIGN (var);
+
+      if (cfun->stack_alignment_estimated < align)
+        {
+          /* stack_alignment_estimated shouldn't change after stack
+             realign decision made */
+          gcc_assert(!cfun->stack_realign_processed);
+	  cfun->stack_alignment_estimated = align;
+	}
+    }
+
   if (TREE_CODE (var) != VAR_DECL)
     {
       if (really_expand)
@@ -2005,3 +2045,135 @@
   TODO_dump_func,                       /* todo_flags_finish */
  }
 };
+
+static bool
+gate_stack_realign (void)
+{
+  if (!MAX_VECTORIZE_STACK_ALIGNMENT)
+    return false;
+  else
+    {
+      gcc_assert (!cfun->stack_realign_processed);
+      return true;
+    }
+}
+
+/* Collect accurate info for stack realign.  */
+
+static unsigned int
+collect_stackrealign_info (void)
+{
+  basic_block bb;
+  block_stmt_iterator bsi;
+
+  if (cfun->has_nonlocal_label)
+    cfun->need_drap = true;
+
+  FOR_EACH_BB (bb)
+    for (bsi = bsi_start (bb); ! bsi_end_p (bsi); bsi_next (&bsi))
+      {
+	tree stmt = bsi_stmt (bsi);
+	tree call = get_call_expr_in (stmt);
+	tree decl, type;
+	int flags;
+
+	if (!call)
+	  continue;
+
+	flags = call_expr_flags (call);
+	if (flags & ECF_MAY_BE_ALLOCA)
+	  cfun->need_drap = true;
+
+	decl = get_callee_fndecl (call);
+	if (decl && DECL_BUILT_IN_CLASS (decl) == BUILT_IN_NORMAL)
+	  switch (DECL_FUNCTION_CODE (decl))
+	    {
+	    case BUILT_IN_NONLOCAL_GOTO:
+	    case BUILT_IN_APPLY:
+	    case BUILT_IN_LONGJMP:
+	      cfun->need_drap = true;
+	      break;
+	    default:
+	      break;
+	    }
+
+	type = TREE_TYPE (call);
+	if (!type || VOID_TYPE_P (type))
+          continue;
+
+	/* FIXME: Do we need DRAP when the result is returned on
+	   stack?  */
+	if (aggregate_value_p (type, decl))
+	  cfun->need_drap = true;
+      }  
+
+  return 0;
+}
+
+struct gimple_opt_pass pass_collect_stackrealign_info =
+{
+ {
+  GIMPLE_PASS,
+  "stack_realign_info",                 /* name */
+  gate_stack_realign,                   /* gate */
+  collect_stackrealign_info,            /* execute */
+  NULL,                                 /* sub */
+  NULL,                                 /* next */
+  0,                                    /* static_pass_numbler */
+  0,                                    /* tv_id */
+  0,                                    /* properties_required */
+  0,                                    /* properties_provided */
+  0,                                    /* properties_destroyed */
+  0,                                    /* todo_flags_start */
+  0,                                    /* todo_flags_finish */
+ }
+};
+
+/* New pass handle_drap. 
+   This pass first checks if DRAP is needed.
+   If yes, it will set current_function_internal_arg_pointer to that
+   virtual register. Later lregs pass will replace
+   virtual_incoming_args_rtx to that virtual reg */
+static unsigned int
+handle_drap (void)
+{
+  /* Call targetm.calls.internal_arg_pointer again. This time it will
+     return a virtual reg if DRAP is needed */
+  rtx internal_arg_rtx = targetm.calls.internal_arg_pointer (); 
+
+  /* Assertion to check internal_arg_pointer is set to the right rtx here */
+  gcc_assert (current_function_internal_arg_pointer == 
+             virtual_incoming_args_rtx);
+
+  /* Do nothing if needn't replace virtual incoming arg rtx */
+  if (current_function_internal_arg_pointer != internal_arg_rtx)
+    {
+      current_function_internal_arg_pointer = internal_arg_rtx;
+
+      /* Call fixup_tail_casss to clean up REG_EQUIV note 
+         if DRAP is needed. */
+      fixup_tail_calls ();
+    }
+
+  return 0;
+}
+
+struct gimple_opt_pass pass_handle_drap =
+{
+ {
+  GIMPLE_PASS,
+  "handle_drap",			/* name */
+  gate_stack_realign,                   /* gate */
+  handle_drap,			        /* execute */
+  NULL,                                 /* sub */
+  NULL,                                 /* next */
+  0,                                    /* static_pass_number */
+  0,				        /* tv_id */
+  /* ??? If TER is enabled, we actually receive GENERIC.  */
+  0,                                    /* properties_required */
+  PROP_rtl,                             /* properties_provided */
+  0,				        /* properties_destroyed */
+  0,                                    /* todo_flags_start */
+  TODO_dump_func,                       /* todo_flags_finish */
+ }
+};
Index: tree-inline.c
===================================================================
--- tree-inline.c	(.../trunk/gcc)	(revision 133813)
+++ tree-inline.c	(.../branches/stack/gcc)	(revision 133869)
@@ -2841,8 +2841,26 @@
 	cfun->unexpanded_var_list = tree_cons (NULL_TREE, var,
 					       cfun->unexpanded_var_list);
       else
-	cfun->unexpanded_var_list = tree_cons (NULL_TREE, remap_decl (var, id),
-					       cfun->unexpanded_var_list);
+	{
+	  /* Update stack alignment requirement if needed.  */
+	  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+	    {
+	      unsigned int align;
+
+	      if (TREE_STATIC (var) || DECL_EXTERNAL (var))
+		align = TYPE_ALIGN (TREE_TYPE (var));
+	      else
+		align = DECL_ALIGN (var);
+	      if (align  > cfun->stack_alignment_estimated)
+		{
+		  gcc_assert(!cfun->stack_realign_processed);
+		  cfun->stack_alignment_estimated = align;
+		}
+	    }
+	  cfun->unexpanded_var_list
+	    = tree_cons (NULL_TREE, remap_decl (var, id),
+			 cfun->unexpanded_var_list);
+	}
     }
 
   /* Clean up.  */
Index: passes.c
===================================================================
--- passes.c	(.../trunk/gcc)	(revision 133813)
+++ passes.c	(.../branches/stack/gcc)	(revision 133869)
@@ -686,7 +686,9 @@
   NEXT_PASS (pass_free_datastructures);
   NEXT_PASS (pass_mudflap_2);
   NEXT_PASS (pass_free_cfg_annotations);
+  NEXT_PASS (pass_collect_stackrealign_info);
   NEXT_PASS (pass_expand);
+  NEXT_PASS (pass_handle_drap); 
   NEXT_PASS (pass_rest_of_compilation);
     {
       struct opt_pass **p = &pass_rest_of_compilation.pass.sub;
Index: stmt.c
===================================================================
--- stmt.c	(.../trunk/gcc)	(revision 133813)
+++ stmt.c	(.../branches/stack/gcc)	(revision 133869)
@@ -1819,7 +1819,7 @@
 	{
 	  /* Now restore our arg pointer from the address at which it
 	     was saved in our stack frame.  */
-	  emit_move_insn (virtual_incoming_args_rtx,
+	  emit_move_insn (current_function_internal_arg_pointer,
 			  copy_to_reg (get_arg_pointer_save_area ()));
 	}
     }
Index: reload1.c
===================================================================
--- reload1.c	(.../trunk/gcc)	(revision 133813)
+++ reload1.c	(.../branches/stack/gcc)	(revision 133869)
@@ -2279,7 +2279,13 @@
 	  if (offsets_at[CODE_LABEL_NUMBER (x) - first_label_num][i]
 	      != (initial_p ? reg_eliminate[i].initial_offset
 		  : reg_eliminate[i].offset))
-	    reg_eliminate[i].can_eliminate = 0;
+            {
+	      /* Must not disable reg eliminate because stack realignment
+	         must eliminate frame pointer to stack pointer.  */
+	      gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			  || ! stack_realign_fp);
+	      reg_eliminate[i].can_eliminate = 0;
+            }
 
       return;
 
@@ -2358,7 +2364,13 @@
 	 offset because we are doing a jump to a variable address.  */
       for (p = reg_eliminate; p < &reg_eliminate[NUM_ELIMINABLE_REGS]; p++)
 	if (p->offset != p->initial_offset)
-	  p->can_eliminate = 0;
+	  {
+	    /* Must not disable reg eliminate because stack realignment
+	       must eliminate frame pointer to stack pointer.  */
+	    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			|| ! stack_realign_fp);
+	    p->can_eliminate = 0;
+	  }
       break;
 
     default:
@@ -2849,7 +2861,13 @@
       /* If we modify the source of an elimination rule, disable it.  */
       for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
 	if (ep->from_rtx == XEXP (x, 0))
-	  ep->can_eliminate = 0;
+	  {
+	    /* Must not disable reg eliminate because stack realignment
+	       must eliminate frame pointer to stack pointer.  */
+	    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			|| ! stack_realign_fp);
+	    ep->can_eliminate = 0;
+	  }
 
       /* If we modify the target of an elimination rule by adding a constant,
 	 update its offset.  If we modify the target in any other way, we'll
@@ -2875,7 +2893,14 @@
 		    && CONST_INT_P (XEXP (XEXP (x, 1), 1)))
 		  ep->offset -= INTVAL (XEXP (XEXP (x, 1), 1));
 		else
-		  ep->can_eliminate = 0;
+		  {
+		    /* Must not disable reg eliminate because stack
+		       realignment must eliminate frame pointer to
+		       stack pointer.  */
+		    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+				|| ! stack_realign_fp);
+		    ep->can_eliminate = 0;
+		  }
 	      }
 	  }
 
@@ -2918,7 +2943,13 @@
 	 know how this register is used.  */
       for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
 	if (ep->from_rtx == XEXP (x, 0))
-	  ep->can_eliminate = 0;
+	  {
+	    /* Must not disable reg eliminate because stack realignment
+	       must eliminate frame pointer to stack pointer.  */
+	    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			|| ! stack_realign_fp);
+	    ep->can_eliminate = 0;
+	  }
 
       elimination_effects (XEXP (x, 0), mem_mode);
       return;
@@ -2929,7 +2960,13 @@
 	 be performed.  Otherwise, we need not be concerned about it.  */
       for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
 	if (ep->to_rtx == XEXP (x, 0))
-	  ep->can_eliminate = 0;
+	  {
+	    /* Must not disable reg eliminate because stack realignment
+	       must eliminate frame pointer to stack pointer.  */
+	    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			|| ! stack_realign_fp);
+	    ep->can_eliminate = 0;
+	  }
 
       elimination_effects (XEXP (x, 0), mem_mode);
       return;
@@ -2963,7 +3000,14 @@
 		    && GET_CODE (XEXP (src, 1)) == CONST_INT)
 		  ep->offset -= INTVAL (XEXP (src, 1));
 		else
-		  ep->can_eliminate = 0;
+		  {
+		    /* Must not disable reg eliminate because stack
+		       realignment must eliminate frame pointer to
+		       stack pointer.  */
+		    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+				|| ! stack_realign_fp);
+		    ep->can_eliminate = 0;
+		  }
 	      }
 	}
 
@@ -3292,7 +3336,14 @@
 	      for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS];
 		   ep++)
 		if (ep->from_rtx == orig_operand[i])
-		  ep->can_eliminate = 0;
+		  {
+		    /* Must not disable reg eliminate because stack
+		       realignment must eliminate frame pointer to
+		       stack pointer.  */
+		    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+				|| ! stack_realign_fp);
+		    ep->can_eliminate = 0;
+		  }
 	    }
 
 	  /* Companion to the above plus substitution, we can allow
@@ -3422,7 +3473,13 @@
   for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
     {
       if (ep->previous_offset != ep->offset && ep->ref_outside_mem)
-	ep->can_eliminate = 0;
+	{
+	  /* Must not disable reg eliminate because stack realignment
+	     must eliminate frame pointer to stack pointer.  */
+	  gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+		      || ! stack_realign_fp);
+	  ep->can_eliminate = 0;
+	}
 
       ep->ref_outside_mem = 0;
 
@@ -3498,6 +3555,11 @@
 	    || XEXP (SET_SRC (x), 0) != dest
 	    || GET_CODE (XEXP (SET_SRC (x), 1)) != CONST_INT))
       {
+	/* Must not disable reg eliminate because stack realignment
+	   must eliminate frame pointer to stack pointer.  */
+	gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+		    || ! stack_realign_fp);
+
 	reg_eliminate[i].can_eliminate_previous
 	  = reg_eliminate[i].can_eliminate = 0;
 	num_eliminable--;
@@ -3668,8 +3730,11 @@
   frame_pointer_needed = 1;
   for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
     {
-      if (ep->can_eliminate && ep->from == FRAME_POINTER_REGNUM
-	  && ep->to != HARD_FRAME_POINTER_REGNUM)
+      if (ep->can_eliminate
+	  && ep->from == FRAME_POINTER_REGNUM
+	  && ep->to != HARD_FRAME_POINTER_REGNUM
+	  && (! MAX_VECTORIZE_STACK_ALIGNMENT
+	      || ! cfun->stack_realign_needed))
 	frame_pointer_needed = 0;
 
       if (! ep->can_eliminate && ep->can_eliminate_previous)
@@ -3713,19 +3778,9 @@
   if (!reg_eliminate)
     reg_eliminate = xcalloc (sizeof (struct elim_table), NUM_ELIMINABLE_REGS);
 
-  /* Does this function require a frame pointer?  */
+  /* frame_pointer_needed should has been set.  */
+  gcc_assert (cfun->need_frame_pointer_set);
 
-  frame_pointer_needed = (! flag_omit_frame_pointer
-			  /* ?? If EXIT_IGNORE_STACK is set, we will not save
-			     and restore sp for alloca.  So we can't eliminate
-			     the frame pointer in that case.  At some point,
-			     we should improve this by emitting the
-			     sp-adjusting insns for this case.  */
-			  || (current_function_calls_alloca
-			      && EXIT_IGNORE_STACK)
-			  || current_function_accesses_prior_frames
-			  || FRAME_POINTER_REQUIRED);
-
   num_eliminable = 0;
 
 #ifdef ELIMINABLE_REGS
@@ -3736,7 +3791,10 @@
       ep->to = ep1->to;
       ep->can_eliminate = ep->can_eliminate_previous
 	= (CAN_ELIMINATE (ep->from, ep->to)
-	   && ! (ep->to == STACK_POINTER_REGNUM && frame_pointer_needed));
+	   && ! (ep->to == STACK_POINTER_REGNUM
+		 && frame_pointer_needed 
+		 && (! MAX_VECTORIZE_STACK_ALIGNMENT
+		     || ! stack_realign_fp)));
     }
 #else
   reg_eliminate[0].from = reg_eliminate_1[0].from;

[-- Attachment #3: stack-align-x86-0408.patch --]
[-- Type: application/octet-stream, Size: 30034 bytes --]

Index: i386.h
===================================================================
--- i386.h	(.../trunk/gcc/config/i386)	(revision 133813)
+++ i386.h	(.../branches/stack/gcc/config/i386)	(revision 133869)
@@ -800,17 +800,33 @@
 /* Boundary (in *bits*) on which stack pointer should be aligned.  */
 #define STACK_BOUNDARY BITS_PER_WORD
 
+/* Stack boundary of the main function guaranteed by OS.  */
+#define MAIN_STACK_BOUNDARY (TARGET_64BIT ? 128 : 32)
+
+/* Stack boundary guaranteed by ABI.  */
+#define ABI_STACK_BOUNDARY (TARGET_64BIT ? 128 : 32)
+
 /* Boundary (in *bits*) on which the stack pointer prefers to be
    aligned; the compiler cannot rely on having this alignment.  */
 #define PREFERRED_STACK_BOUNDARY ix86_preferred_stack_boundary
 
-/* As of July 2001, many runtimes do not align the stack properly when
-   entering main.  This causes expand_main_function to forcibly align
-   the stack, which results in aligned frames for functions called from
-   main, though it does nothing for the alignment of main itself.  */
-#define FORCE_PREFERRED_STACK_BOUNDARY_IN_MAIN \
-  (ix86_preferred_stack_boundary > STACK_BOUNDARY && !TARGET_64BIT)
+/* It should be ABI_STACK_BOUNDARY.  But we set it to 128 bits for
+   both 32bit and 64bit, to support codes that need 128 bit stack
+   alignment for SSE instructions, but can't realign the stack.  */
+#define PREFERRED_STACK_BOUNDARY_DEFAULT 128
 
+/* 1 if -mstackrealign should be turned on by default.  It will
+   generate an alternate prologue and epilogue that realigns the
+   runtime stack if nessary.  This supports mixing codes that keep a
+   4-byte aligned stack, as specified by i386 psABI, with codes that
+   need a 16-byte aligned stack, as required by SSE instructions.  If
+   STACK_REALIGN_DEFAULT is 1 and PREFERRED_STACK_BOUNDARY_DEFAULT is
+   128, stacks for all functions may be realigned.  */
+#define STACK_REALIGN_DEFAULT 0
+
+/* Boundary (in *bits*) on which the incoming stack is aligned.  */
+#define INCOMING_STACK_BOUNDARY ix86_incoming_stack_boundary
+
 /* Target OS keeps a vector-aligned (128-bit, 16-byte) stack.  This is
    mandatory for the 64-bit ABI, and may or may not be true for other
    operating systems.  */
@@ -836,6 +852,9 @@
 
 #define BIGGEST_ALIGNMENT 128
 
+/* Maximum stack alignment for vectorizer.  */
+#define MAX_VECTORIZE_STACK_ALIGNMENT BIGGEST_ALIGNMENT
+
 /* Decide whether a variable of mode MODE should be 128 bit aligned.  */
 #define ALIGN_MODE_128(MODE) \
  ((MODE) == XFmode || SSE_REG_MODE_P (MODE))
@@ -1245,7 +1264,7 @@
    the pic register when possible.  The change is visible after the
    prologue has been emitted.  */
 
-#define REAL_PIC_OFFSET_TABLE_REGNUM  3
+#define REAL_PIC_OFFSET_TABLE_REGNUM  BX_REG
 
 #define PIC_OFFSET_TABLE_REGNUM				\
   ((TARGET_64BIT && ix86_cmodel == CM_SMALL_PIC)	\
@@ -1786,7 +1805,10 @@
    All other eliminations are valid.  */
 
 #define CAN_ELIMINATE(FROM, TO) \
-  ((TO) == STACK_POINTER_REGNUM ? !frame_pointer_needed : 1)
+  (stack_realign_fp \
+  ? ((FROM) == ARG_POINTER_REGNUM && (TO) == HARD_FRAME_POINTER_REGNUM) \
+    || ((FROM) == FRAME_POINTER_REGNUM && (TO) == STACK_POINTER_REGNUM) \
+  : ((TO) == STACK_POINTER_REGNUM ? !frame_pointer_needed : 1))
 
 /* Define the offset between two registers, one to be eliminated, and the other
    its replacement, at the start of a routine.  */
@@ -2342,6 +2364,7 @@
 
 extern enum asm_dialect ix86_asm_dialect;
 extern unsigned int ix86_preferred_stack_boundary;
+extern unsigned int ix86_incoming_stack_boundary;
 extern int ix86_branch_cost, ix86_section_threshold;
 
 /* Smallest class containing REGNO.  */
@@ -2443,7 +2466,6 @@
 {
   struct stack_local_entry *stack_locals;
   const char *some_ld_name;
-  rtx force_align_arg_pointer;
   int save_varrargs_registers;
   int accesses_prev_frame;
   int optimize_mode_switching[MAX_386_ENTITIES];
Index: i386.md
===================================================================
--- i386.md	(.../trunk/gcc/config/i386)	(revision 133813)
+++ i386.md	(.../branches/stack/gcc/config/i386)	(revision 133869)
@@ -221,6 +221,7 @@
   [(AX_REG			 0)
    (DX_REG			 1)
    (CX_REG			 2)
+   (BX_REG			 3)
    (SI_REG			 4)
    (DI_REG			 5)
    (BP_REG			 6)
@@ -230,6 +231,7 @@
    (FPCR_REG			19)
    (R10_REG			39)
    (R11_REG			40)
+   (R13_REG			42)
   ])
 
 ;; Insns whose names begin with "x86_" are emitted by gen_FOO calls
Index: i386.opt
===================================================================
--- i386.opt	(.../trunk/gcc/config/i386)	(revision 133813)
+++ i386.opt	(.../branches/stack/gcc/config/i386)	(revision 133869)
@@ -78,6 +78,10 @@
 Target RejectNegative Report InverseMask(NO_FANCY_MATH_387, USE_FANCY_MATH_387)
 Generate sin, cos, sqrt for FPU
 
+mforce-drap
+Target Report Var(ix86_force_drap)
+Always use Dynamic Realigned Argument Pointer (DRAP) to realign stack.
+
 mfp-ret-in-387
 Target Report Mask(FLOAT_RETURNS)
 Return values of functions in FPU registers
@@ -134,6 +138,10 @@
 Target RejectNegative Joined Var(ix86_preferred_stack_boundary_string)
 Attempt to keep stack aligned to this power of 2
 
+mincoming-stack-boundary=
+Target RejectNegative Joined Var(ix86_incoming_stack_boundary_string)
+Assume incoming stack aligned to this power of 2
+
 mpush-args
 Target Report InverseMask(NO_PUSH_ARGS, PUSH_ARGS)
 Use push instructions to save outgoing arguments
@@ -159,7 +167,7 @@
 Use SSE register passing conventions for SF and DF mode
 
 mstackrealign
-Target Report Var(ix86_force_align_arg_pointer)
+Target Report Var(ix86_force_align_arg_pointer) Init(-1)
 Realign stack in prologue
 
 mstack-arg-probe
Index: i386.c
===================================================================
--- i386.c	(.../trunk/gcc/config/i386)	(revision 133813)
+++ i386.c	(.../branches/stack/gcc/config/i386)	(revision 133869)
@@ -1693,11 +1693,22 @@
 
 /* -mstackrealign option */
 extern int ix86_force_align_arg_pointer;
-static const char ix86_force_align_arg_pointer_string[] = "force_align_arg_pointer";
+static const char ix86_force_align_arg_pointer_string[]
+  = "force_align_arg_pointer";
 
 /* Preferred alignment for stack boundary in bits.  */
 unsigned int ix86_preferred_stack_boundary;
 
+/* Alignment for incoming stack boundary in bits specified at
+   command line.  */
+static unsigned int ix86_user_incoming_stack_boundary;
+
+/* Default alignment for incoming stack boundary in bits.  */
+static unsigned int ix86_default_incoming_stack_boundary;
+
+/* Alignment for incoming stack boundary in bits.  */
+unsigned int ix86_incoming_stack_boundary;
+
 /* Values 1-5: see jump.c */
 int ix86_branch_cost;
 
@@ -2612,11 +2623,9 @@
   if (TARGET_SSE4_2 || TARGET_ABM)
     x86_popcnt = true;
 
-  /* Validate -mpreferred-stack-boundary= value, or provide default.
-     The default of 128 bits is for Pentium III's SSE __m128.  We can't
-     change it because of optimize_size.  Otherwise, we can't mix object
-     files compiled with -Os and -On.  */
-  ix86_preferred_stack_boundary = 128;
+  /* Validate -mpreferred-stack-boundary= value or default it to
+     PREFERRED_STACK_BOUNDARY_DEFAULT.  */
+  ix86_preferred_stack_boundary = PREFERRED_STACK_BOUNDARY_DEFAULT;
   if (ix86_preferred_stack_boundary_string)
     {
       i = atoi (ix86_preferred_stack_boundary_string);
@@ -2627,6 +2636,31 @@
 	ix86_preferred_stack_boundary = (1 << i) * BITS_PER_UNIT;
     }
 
+  /* Set the default value for -mstackrealign.  */
+  if (ix86_force_align_arg_pointer == -1)
+    ix86_force_align_arg_pointer = STACK_REALIGN_DEFAULT;
+
+  /* Validate -mincoming-stack-boundary= value or default it to
+     ABI_STACK_BOUNDARY/PREFERRED_STACK_BOUNDARY.  */
+  if (ix86_force_align_arg_pointer)
+    ix86_default_incoming_stack_boundary = ABI_STACK_BOUNDARY;
+  else
+    ix86_default_incoming_stack_boundary = PREFERRED_STACK_BOUNDARY;
+  ix86_incoming_stack_boundary = ix86_default_incoming_stack_boundary;
+  if (ix86_incoming_stack_boundary_string)
+    {
+      i = atoi (ix86_incoming_stack_boundary_string);
+      if (i < (TARGET_64BIT ? 4 : 2) || i > 12)
+	error ("-mincoming-stack-boundary=%d is not between %d and 12",
+	       i, TARGET_64BIT ? 4 : 2);
+      else
+	{
+	  ix86_user_incoming_stack_boundary = (1 << i) * BITS_PER_UNIT;
+	  ix86_incoming_stack_boundary
+	    = ix86_user_incoming_stack_boundary;
+	}
+    }
+
   /* Accept -msseregparm only if at least SSE support is enabled.  */
   if (TARGET_SSEREGPARM
       && ! TARGET_SSE)
@@ -3066,11 +3100,6 @@
       && ix86_function_regparm (TREE_TYPE (decl), NULL) >= 3)
     return false;
 
-  /* If we forced aligned the stack, then sibcalling would unalign the
-     stack, which may break the called function.  */
-  if (cfun->machine->force_align_arg_pointer)
-    return false;
-
   /* Otherwise okay.  That also includes certain types of indirect calls.  */
   return true;
 }
@@ -3121,15 +3150,6 @@
 	  *no_add_attrs = true;
 	}
 
-      if (!TARGET_64BIT
-	  && lookup_attribute (ix86_force_align_arg_pointer_string,
-			       TYPE_ATTRIBUTES (*node))
-	  && compare_tree_int (cst, REGPARM_MAX-1))
-	{
-	  error ("%s functions limited to %d register parameters",
-		 ix86_force_align_arg_pointer_string, REGPARM_MAX-1);
-	}
-
       return NULL_TREE;
     }
 
@@ -3241,8 +3261,23 @@
 
   attr = lookup_attribute ("regparm", TYPE_ATTRIBUTES (type));
   if (attr)
-    return TREE_INT_CST_LOW (TREE_VALUE (TREE_VALUE (attr)));
+    {
+      regparm
+	= TREE_INT_CST_LOW (TREE_VALUE (TREE_VALUE (attr)));
 
+      if (decl && TREE_CODE (decl) == FUNCTION_DECL)
+	{
+	  /* We can't use regparm(3) for nested functions as these use
+	     static chain pointer in third argument.  */
+	  if (regparm == 3
+	      && decl_function_context (decl)
+	      && !DECL_NO_STATIC_CHAIN (decl))
+	    regparm = 2;
+	}
+
+      return regparm;
+    }
+
   if (lookup_attribute ("fastcall", TYPE_ATTRIBUTES (type)))
     return 2;
 
@@ -3266,8 +3301,7 @@
 	  /* We can't use regparm(3) for nested functions as these use
 	     static chain pointer in third argument.  */
 	  if (local_regparm == 3
-	      && (decl_function_context (decl)
-                  || ix86_force_align_arg_pointer)
+	      && decl_function_context (decl)
 	      && !DECL_NO_STATIC_CHAIN (decl))
 	    local_regparm = 2;
 
@@ -3276,13 +3310,11 @@
 	     the callee DECL_STRUCT_FUNCTION is gone, so we fall back to
 	     scanning the attributes for the self-realigning property.  */
 	  f = DECL_STRUCT_FUNCTION (decl);
-	  if (local_regparm == 3
-	      && (f ? !!f->machine->force_align_arg_pointer
-		  : !!lookup_attribute (ix86_force_align_arg_pointer_string,
-					TYPE_ATTRIBUTES (TREE_TYPE (decl)))))
-	    local_regparm = 2;
+          /* Since current internal arg pointer will won't conflict
+	     with parameter passing regs, so no need to change stack
+	     realignment and adjust regparm number.
 
-	  /* Each fixed register usage increases register pressure,
+	     Each fixed register usage increases register pressure,
 	     so less registers should be used for argument passing.
 	     This functionality can be overriden by an explicit
 	     regparm value.  */
@@ -4995,15 +5027,7 @@
 
   /* Indicate to allocate space on the stack for varargs save area.  */
   ix86_save_varrargs_registers = 1;
-  /* We need 16-byte stack alignment to save SSE registers.  If user
-     asked for lower preferred_stack_boundary, lets just hope that he knows
-     what he is doing and won't varargs SSE values.
 
-     We also may end up assuming that only 64bit values are stored in SSE
-     register let some floating point program work.  */
-  if (ix86_preferred_stack_boundary >= BIGGEST_ALIGNMENT)
-    cfun->stack_alignment_needed = BIGGEST_ALIGNMENT;
-
   save_area = frame_pointer_rtx;
   set = get_varargs_alias_set ();
 
@@ -5170,7 +5194,7 @@
 
   /* Find the overflow area.  */
   type = TREE_TYPE (ovf);
-  t = make_tree (type, virtual_incoming_args_rtx);
+  t = make_tree (type, current_function_internal_arg_pointer);
   if (words != 0)
     t = build2 (POINTER_PLUS_EXPR, type, t,
 	        size_int (words * UNITS_PER_WORD));
@@ -5929,9 +5953,14 @@
   if (current_function_is_leaf && !current_function_profile
       && !ix86_current_function_calls_tls_descriptor)
     {
-      int i;
+      int i, drap;
+      /* Can't use the same register for both PIC and DRAP.  */
+      if (cfun->drap_reg)
+	drap = REGNO (cfun->drap_reg);
+      else
+	drap = -1;
       for (i = 2; i >= 0; --i)
-        if (!df_regs_ever_live_p (i))
+        if (i != drap && !df_regs_ever_live_p (i))
 	  return i;
     }
 
@@ -5967,8 +5996,8 @@
 	}
     }
 
-  if (cfun->machine->force_align_arg_pointer
-      && regno == REGNO (cfun->machine->force_align_arg_pointer))
+  if (cfun->drap_reg
+      && regno == REGNO (cfun->drap_reg))
     return 1;
 
   return (df_regs_ever_live_p (regno)
@@ -6034,6 +6063,9 @@
   stack_alignment_needed = cfun->stack_alignment_needed / BITS_PER_UNIT;
   preferred_alignment = cfun->preferred_stack_boundary / BITS_PER_UNIT;
 
+  gcc_assert (!size || stack_alignment_needed);
+  gcc_assert (preferred_alignment >= STACK_BOUNDARY / BITS_PER_UNIT);
+
   /* During reload iteration the amount of registers saved can change.
      Recompute the value as needed.  Do not recompute when amount of registers
      didn't change as reload does multiple calls to the function and does not
@@ -6076,19 +6108,10 @@
 
   frame->hard_frame_pointer_offset = offset;
 
-  /* Do some sanity checking of stack_alignment_needed and
-     preferred_alignment, since i386 port is the only using those features
-     that may break easily.  */
+  /* Set offset to aligned because the realigned frame tarts from here.  */
+  if (stack_realign_fp)
+    offset = (offset + stack_alignment_needed -1) & -stack_alignment_needed;
 
-  gcc_assert (!size || stack_alignment_needed);
-  gcc_assert (preferred_alignment >= STACK_BOUNDARY / BITS_PER_UNIT);
-  gcc_assert (preferred_alignment <= PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT);
-  gcc_assert (stack_alignment_needed
-	      <= PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT);
-
-  if (stack_alignment_needed < STACK_BOUNDARY / BITS_PER_UNIT)
-    stack_alignment_needed = STACK_BOUNDARY / BITS_PER_UNIT;
-
   /* Register save area */
   offset += frame->nregs * UNITS_PER_WORD;
 
@@ -6253,35 +6276,129 @@
     RTX_FRAME_RELATED_P (insn) = 1;
 }
 
+/* Find an available register to be used as dynamic realign argument
+   pointer regsiter.  Such a register will be written in prologue and
+   used in begin of body, so it must not be
+	1. parameter passing register.
+	2. GOT pointer.
+   For i386, we use CX if it is not used to pass parameter. Otherwise
+   we just pick DI.
+   For x86_64, we just pick R13 directly.
+
+   Return: the regno of choosed register.  */
+
+static unsigned int 
+find_drap_reg (void)
+{
+  int param_reg_num;
+
+  if (TARGET_64BIT)
+    return R13_REG;
+
+  /* Use DI for nested function or function need static chain.  */
+  if (decl_function_context (cfun->decl)
+      && !DECL_NO_STATIC_CHAIN (cfun->decl))
+    return DI_REG;
+
+  if (cfun->tail_call_emit)
+    return DI_REG;
+
+  param_reg_num = ix86_function_regparm (TREE_TYPE (cfun->decl),
+					 cfun->decl);
+
+  if (param_reg_num <= 2
+      && !lookup_attribute ("fastcall",
+			    TYPE_ATTRIBUTES (TREE_TYPE (cfun->decl))))
+    return CX_REG;
+
+  return DI_REG;
+}
+
 /* Handle the TARGET_INTERNAL_ARG_POINTER hook.  */
 
 static rtx
 ix86_internal_arg_pointer (void)
 {
-  bool has_force_align_arg_pointer =
-    (0 != lookup_attribute (ix86_force_align_arg_pointer_string,
-			    TYPE_ATTRIBUTES (TREE_TYPE (current_function_decl))));
-  if ((FORCE_PREFERRED_STACK_BOUNDARY_IN_MAIN
-       && DECL_NAME (current_function_decl)
-       && MAIN_NAME_P (DECL_NAME (current_function_decl))
-       && DECL_FILE_SCOPE_P (current_function_decl))
-      || ix86_force_align_arg_pointer
-      || has_force_align_arg_pointer)
+  /* If called in "expand" pass, currently_expanding_to_rtl will
+     be true */
+  if (currently_expanding_to_rtl) 
+    return virtual_incoming_args_rtx;
+
+  /* Prefer the one specified at command line. */
+  ix86_incoming_stack_boundary 
+    = (ix86_user_incoming_stack_boundary
+       ? ix86_user_incoming_stack_boundary
+       : ix86_default_incoming_stack_boundary);
+
+  /* Current stack realign doesn't support eh_return. Assume
+     function who calls eh_return is aligned. There will be sanity
+     check if stack realign happens together with eh_return later.  */
+  if (current_function_calls_eh_return)
+    ix86_incoming_stack_boundary = PREFERRED_STACK_BOUNDARY;
+
+  /* Incoming stack alignment can be changed on individual functions
+     via force_align_arg_pointer attribute.  We use the smallest
+     incoming stack boundary.  */
+  if (ix86_incoming_stack_boundary > ABI_STACK_BOUNDARY
+      && lookup_attribute (ix86_force_align_arg_pointer_string,
+			   TYPE_ATTRIBUTES (TREE_TYPE (current_function_decl))))
+    ix86_incoming_stack_boundary = ABI_STACK_BOUNDARY;
+
+  /* Stack at entrance of main is aligned by runtime.  We use the
+     smallest incoming stack boundary. */
+  if (ix86_incoming_stack_boundary > MAIN_STACK_BOUNDARY
+      && DECL_NAME (current_function_decl)
+      && MAIN_NAME_P (DECL_NAME (current_function_decl))
+      && DECL_FILE_SCOPE_P (current_function_decl))
+    ix86_incoming_stack_boundary = MAIN_STACK_BOUNDARY;
+
+  gcc_assert (cfun->stack_alignment_needed 
+              <= cfun->stack_alignment_estimated);
+
+  /* x86_64 vararg needs 16byte stack alignment for register save
+     area.  */
+  if (TARGET_64BIT
+      && current_function_stdarg
+      && cfun->stack_alignment_estimated < 128)
+    cfun->stack_alignment_estimated = 128;
+
+  /* Update cfun->stack_alignment_estimated and use it later to align
+     stack.  FIXME: How to optimize for leaf function?  */
+  if (PREFERRED_STACK_BOUNDARY > cfun->stack_alignment_estimated)
+    cfun->stack_alignment_estimated = PREFERRED_STACK_BOUNDARY;
+  if (PREFERRED_STACK_BOUNDARY > cfun->stack_alignment_needed)
+    cfun->stack_alignment_needed = PREFERRED_STACK_BOUNDARY;
+
+  cfun->stack_realign_needed
+    = ix86_incoming_stack_boundary < cfun->stack_alignment_estimated;
+
+  cfun->stack_realign_processed = true;
+
+  if (ix86_force_drap
+      || !ACCUMULATE_OUTGOING_ARGS)
+    cfun->need_drap = true;
+
+  if (stack_realign_drap)
     {
-      /* Nested functions can't realign the stack due to a register
-	 conflict.  */
-      if (DECL_CONTEXT (current_function_decl)
-	  && TREE_CODE (DECL_CONTEXT (current_function_decl)) == FUNCTION_DECL)
-	{
-	  if (ix86_force_align_arg_pointer)
-	    warning (0, "-mstackrealign ignored for nested functions");
-	  if (has_force_align_arg_pointer)
-	    error ("%s not supported for nested functions",
-		   ix86_force_align_arg_pointer_string);
-	  return virtual_incoming_args_rtx;
-	}
-      cfun->machine->force_align_arg_pointer = gen_rtx_REG (Pmode, CX_REG);
-      return copy_to_reg (cfun->machine->force_align_arg_pointer);
+      /* Assign DRAP to vDRAP and returns vDRAP */
+      unsigned int regno = find_drap_reg ();
+      rtx drap_vreg;
+      rtx arg_ptr;
+      rtx seq;
+
+      if (regno != CX_REG)
+	cfun->save_param_ptr_reg = true;
+
+      arg_ptr = gen_rtx_REG (Pmode, regno);
+      cfun->drap_reg = arg_ptr;
+
+      start_sequence ();
+      drap_vreg = copy_to_reg(arg_ptr);
+      seq = get_insns ();
+      end_sequence ();
+      
+      emit_insn_before (seq, NEXT_INSN (entry_of_function ()));
+      return drap_vreg;
     }
   else
     return virtual_incoming_args_rtx;
@@ -6320,53 +6437,64 @@
   bool pic_reg_used;
   struct ix86_frame frame;
   HOST_WIDE_INT allocate;
+  rtx (*gen_andsp) (rtx, rtx, rtx);
 
+  /* DRAP should not coexist with stack_realign_fp */
+  gcc_assert (!(cfun->drap_reg && stack_realign_fp));
+
+  /* Check if stack realign is really needed after reload, and 
+     stores result in cfun */
+  cfun->stack_realign_really = (ix86_incoming_stack_boundary
+				< (current_function_is_leaf
+				   ? cfun->stack_alignment_used
+				   : cfun->stack_alignment_needed));
+
+  cfun->stack_realign_finalized = true;
+
   ix86_compute_frame_layout (&frame);
 
-  if (cfun->machine->force_align_arg_pointer)
+  /* Emit prologue code to adjust stack alignment and setup DRAP, in case
+     of DRAP is needed and stack realignment is really needed after reload */
+  if (cfun->drap_reg && cfun->stack_realign_really)
     {
       rtx x, y;
+      int align_bytes = cfun->stack_alignment_needed / BITS_PER_UNIT;
+      int param_ptr_offset = (cfun->save_param_ptr_reg
+			      ?  STACK_BOUNDARY / BITS_PER_UNIT : 0);
 
+      gcc_assert (stack_realign_drap);
+
       /* Grab the argument pointer.  */
-      x = plus_constant (stack_pointer_rtx, 4);
-      y = cfun->machine->force_align_arg_pointer;
+      x = plus_constant (stack_pointer_rtx, 
+                         (STACK_BOUNDARY / BITS_PER_UNIT 
+			  + param_ptr_offset));
+      y = cfun->drap_reg;
+
+      /* Only need to push parameter pointer reg if it is caller
+	 saved reg */
+      if (cfun->save_param_ptr_reg)
+	{
+	  /* Push arg pointer reg */
+	  insn = emit_insn (gen_push (y));
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	}
+
       insn = emit_insn (gen_rtx_SET (VOIDmode, y, x));
-      RTX_FRAME_RELATED_P (insn) = 1;
+      RTX_FRAME_RELATED_P (insn) = 1; 
 
-      /* The unwind info consists of two parts: install the fafp as the cfa,
-	 and record the fafp as the "save register" of the stack pointer.
-	 The later is there in order that the unwinder can see where it
-	 should restore the stack pointer across the and insn.  */
-      x = gen_rtx_UNSPEC (VOIDmode, gen_rtvec (1, const0_rtx), UNSPEC_DEF_CFA);
-      x = gen_rtx_SET (VOIDmode, y, x);
-      RTX_FRAME_RELATED_P (x) = 1;
-      y = gen_rtx_UNSPEC (VOIDmode, gen_rtvec (1, stack_pointer_rtx),
-			  UNSPEC_REG_SAVE);
-      y = gen_rtx_SET (VOIDmode, cfun->machine->force_align_arg_pointer, y);
-      RTX_FRAME_RELATED_P (y) = 1;
-      x = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, x, y));
-      x = gen_rtx_EXPR_LIST (REG_FRAME_RELATED_EXPR, x, NULL);
-      REG_NOTES (insn) = x;
-
+      gen_andsp = TARGET_64BIT ? gen_anddi3 : gen_andsi3;
       /* Align the stack.  */
-      emit_insn (gen_andsi3 (stack_pointer_rtx, stack_pointer_rtx,
-			     GEN_INT (-16)));
+      insn = emit_insn ((*gen_andsp) (stack_pointer_rtx,
+				  stack_pointer_rtx,
+				  GEN_INT (-align_bytes)));
+      RTX_FRAME_RELATED_P (insn) = 1;
 
-      /* And here we cheat like madmen with the unwind info.  We force the
-	 cfa register back to sp+4, which is exactly what it was at the
-	 start of the function.  Re-pushing the return address results in
-	 the return at the same spot relative to the cfa, and thus is
-	 correct wrt the unwind info.  */
-      x = cfun->machine->force_align_arg_pointer;
-      x = gen_frame_mem (Pmode, plus_constant (x, -4));
+      x = cfun->drap_reg;
+      x = gen_frame_mem (Pmode,
+                         plus_constant (x,
+					-(STACK_BOUNDARY / BITS_PER_UNIT)));
       insn = emit_insn (gen_push (x));
       RTX_FRAME_RELATED_P (insn) = 1;
-
-      x = GEN_INT (4);
-      x = gen_rtx_UNSPEC (VOIDmode, gen_rtvec (1, x), UNSPEC_DEF_CFA);
-      x = gen_rtx_SET (VOIDmode, stack_pointer_rtx, x);
-      x = gen_rtx_EXPR_LIST (REG_FRAME_RELATED_EXPR, x, NULL);
-      REG_NOTES (insn) = x;
     }
 
   /* Note: AT&T enter does NOT have reversed args.  Enter is probably
@@ -6381,6 +6509,19 @@
       RTX_FRAME_RELATED_P (insn) = 1;
     }
 
+  if (stack_realign_fp && cfun->stack_realign_really)
+    {
+      int align_bytes = cfun->stack_alignment_needed / BITS_PER_UNIT;
+      gcc_assert (align_bytes > STACK_BOUNDARY / BITS_PER_UNIT);
+
+      gen_andsp = TARGET_64BIT ? gen_anddi3 : gen_andsi3;
+      /* Align the stack.  */
+      insn = emit_insn ((*gen_andsp) (stack_pointer_rtx,
+				      stack_pointer_rtx,
+				      GEN_INT (-align_bytes)));
+      RTX_FRAME_RELATED_P (insn) = 1;
+    }
+
   allocate = frame.to_allocate;
 
   if (!frame.save_regs_using_mov)
@@ -6395,7 +6536,9 @@
      a red zone location */
   if (TARGET_RED_ZONE && frame.save_regs_using_mov
       && (! TARGET_STACK_PROBE || allocate < CHECK_STACK_LIMIT))
-    ix86_emit_save_regs_using_mov (frame_pointer_needed ? hard_frame_pointer_rtx
+    ix86_emit_save_regs_using_mov ((frame_pointer_needed
+				     && !cfun->stack_realign_really) 
+                                   ? hard_frame_pointer_rtx
 				   : stack_pointer_rtx,
 				   -frame.nregs * UNITS_PER_WORD);
 
@@ -6454,8 +6597,11 @@
       && !(TARGET_RED_ZONE
          && (! TARGET_STACK_PROBE || allocate < CHECK_STACK_LIMIT)))
     {
-      if (!frame_pointer_needed || !frame.to_allocate)
-        ix86_emit_save_regs_using_mov (stack_pointer_rtx, frame.to_allocate);
+      if (!frame_pointer_needed
+	  || !frame.to_allocate
+	  || cfun->stack_realign_really)
+        ix86_emit_save_regs_using_mov (stack_pointer_rtx,
+				       frame.to_allocate);
       else
         ix86_emit_save_regs_using_mov (hard_frame_pointer_rtx,
 				       -frame.nregs * UNITS_PER_WORD);
@@ -6505,6 +6651,16 @@
 	emit_insn (gen_prologue_use (pic_offset_table_rtx));
       emit_insn (gen_blockage ());
     }
+
+  if (cfun->drap_reg && !cfun->stack_realign_really)
+    {
+      /* vDRAP is setup but after reload it turns out stack realign
+         isn't necessary, here we will emit prologue to setup DRAP
+         without stack realign adjustment */
+      int drap_bp_offset = STACK_BOUNDARY / BITS_PER_UNIT * 2;
+      rtx x = plus_constant (hard_frame_pointer_rtx, drap_bp_offset);
+      insn = emit_insn (gen_rtx_SET (VOIDmode, cfun->drap_reg, x));
+    }
 }
 
 /* Emit code to restore saved registers using MOV insns.  First register
@@ -6543,7 +6699,10 @@
 ix86_expand_epilogue (int style)
 {
   int regno;
-  int sp_valid = !frame_pointer_needed || current_function_sp_is_unchanging;
+ /* When stack realign may happen, SP must be valid. */
+  int sp_valid = (!frame_pointer_needed
+		  || current_function_sp_is_unchanging
+		  || (stack_realign_fp && cfun->stack_realign_really));
   struct ix86_frame frame;
   HOST_WIDE_INT offset;
 
@@ -6580,11 +6739,16 @@
     {
       /* Restore registers.  We can use ebp or esp to address the memory
 	 locations.  If both are available, default to ebp, since offsets
-	 are known to be small.  Only exception is esp pointing directly to the
-	 end of block of saved registers, where we may simplify addressing
-	 mode.  */
+	 are known to be small.  Only exception is esp pointing directly
+	 to the end of block of saved registers, where we may simplify
+	 addressing mode.  
 
-      if (!frame_pointer_needed || (sp_valid && !frame.to_allocate))
+	 If we are realigning stack with bp and sp, regs restore can't
+	 be addressed by bp. sp must be used instead.  */
+
+      if (!frame_pointer_needed
+	  || (sp_valid && !frame.to_allocate) 
+	  || (stack_realign_fp && cfun->stack_realign_really))
 	ix86_emit_restore_regs_using_mov (stack_pointer_rtx,
 					  frame.to_allocate, style == 2);
       else
@@ -6596,6 +6760,10 @@
 	{
 	  rtx tmp, sa = EH_RETURN_STACKADJ_RTX;
 
+	  if (cfun->stack_realign_really)
+	    {
+	      error("Stack realign has conflict with eh_return");
+	    }
 	  if (frame_pointer_needed)
 	    {
 	      tmp = gen_rtx_PLUS (Pmode, hard_frame_pointer_rtx, sa);
@@ -6639,10 +6807,16 @@
   else
     {
       /* First step is to deallocate the stack frame so that we can
-	 pop the registers.  */
+	 pop the registers.
+
+	 If we realign stack with frame pointer, then stack pointer
+         won't be able to recover via lea $offset(%bp), %sp, because
+         there is a padding area between bp and sp for realign. 
+         "add $to_allocate, %sp" must be used instead.  */
       if (!sp_valid)
 	{
 	  gcc_assert (frame_pointer_needed);
+          gcc_assert (!(stack_realign_fp && cfun->stack_realign_really));
 	  pro_epilogue_adjust_stack (stack_pointer_rtx,
 				     hard_frame_pointer_rtx,
 				     GEN_INT (offset), style);
@@ -6665,18 +6839,47 @@
 	     able to grok it fast.  */
 	  if (TARGET_USE_LEAVE)
 	    emit_insn (TARGET_64BIT ? gen_leave_rex64 () : gen_leave ());
-	  else if (TARGET_64BIT)
-	    emit_insn (gen_popdi1 (hard_frame_pointer_rtx));
-	  else
-	    emit_insn (gen_popsi1 (hard_frame_pointer_rtx));
+	  else 
+            {
+              /* For stack realigned really happens, recover stack 
+                 pointer to hard frame pointer is a must, if not using 
+                 leave.  */
+              if (stack_realign_fp && cfun->stack_realign_really)
+		pro_epilogue_adjust_stack (stack_pointer_rtx,
+					   hard_frame_pointer_rtx,
+					   const0_rtx, style);
+              if (TARGET_64BIT)
+                emit_insn (gen_popdi1 (hard_frame_pointer_rtx));
+              else
+                emit_insn (gen_popsi1 (hard_frame_pointer_rtx));
+            }
 	}
     }
 
-  if (cfun->machine->force_align_arg_pointer)
+  if (cfun->drap_reg && cfun->stack_realign_really)
     {
-      emit_insn (gen_addsi3 (stack_pointer_rtx,
-			     cfun->machine->force_align_arg_pointer,
-			     GEN_INT (-4)));
+      int param_ptr_offset = (cfun->save_param_ptr_reg
+			      ? STACK_BOUNDARY / BITS_PER_UNIT : 0);
+      gcc_assert (stack_realign_drap);
+      if (TARGET_64BIT)
+        {
+          emit_insn (gen_adddi3 (stack_pointer_rtx,
+				 cfun->drap_reg,
+				 GEN_INT (-(STACK_BOUNDARY / BITS_PER_UNIT
+					    + param_ptr_offset))));
+          if (cfun->save_param_ptr_reg)
+            emit_insn (gen_popdi1 (cfun->drap_reg));
+        }
+      else
+        {
+          emit_insn (gen_addsi3 (stack_pointer_rtx,
+				 cfun->drap_reg,
+				 GEN_INT (-(STACK_BOUNDARY / BITS_PER_UNIT 
+					    + param_ptr_offset))));
+          if (cfun->save_param_ptr_reg)
+            emit_insn (gen_popsi1 (cfun->drap_reg));
+        }
+      
     }
 
   /* Sibcall epilogues don't want a return instruction.  */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFA]: Merge stack alignment branch
  2008-04-04  6:31 [RFA]: Merge stack alignment branch Ye, Joey
  2008-04-04  6:39 ` Andrew Pinski
  2008-04-04 19:05 ` Jan Hubicka
@ 2008-04-10 10:42 ` Ye, Joey
  2008-04-11 13:27   ` Jan Hubicka
  2 siblings, 1 reply; 26+ messages in thread
From: Ye, Joey @ 2008-04-10 10:42 UTC (permalink / raw)
  To: Ye, Joey, GCC Patches; +Cc: Lu, Hongjiu, Guo, Xuepeng, ubizjak

[-- Attachment #1: Type: text/plain, Size: 15736 bytes --]

Updated patches. 42 lines smaller than previous one and broken into 3
patches.
ChangeLog:
2008-04-10  Uros Bizjak  <ubizjak@gmail.com>
	    H.J. Lu  <hongjiu.lu@intel.com>

	PR target/12329
	* config/i386/i386.c (ix86_function_regparm): Limit the number
of
	register passing arguments to 2 for nested functions.

2008-04-10  Joey Ye  <joey.ye@intel.com>
	    H.J. Lu  <hongjiu.lu@intel.com>
	    Xuepeng Guo  <xuepeng.guo@intel.com>

	* builtins.c (expand_builtin_setjmp_receiver): Replace
	virtual_incoming_args_rtx with
	crtl->args.internal_arg_pointer.
	(expand_builtin_apply_args_1): Likewise.
	(expand_builtin_longjmp): DRAP will be needed if some builtins
are
	called.
	(expand_builtin_apply): Likewise.

	* calls.c (expand_call): Don't calculate preferred stack
	boundary according to incoming stack boundary. Replace 
	virtual_incoming_args_rtx with
	crtl->args.internal_arg_pointer.
	(emit_call_1): DRAP will be needed if return pops.

	* emit-rtl.c (gen_reg_rtx): Estimate stack alignment when
generating
	virtual registers.

	* cfgexpand.c (get_decl_align_unit): Estimate stack variable
	alignment and store to stack_alignment_estimated and
	stack_alignment_used.
	(expand_one_var): Likewise.
	(gate_handle_drap): Gate new pass pass_handle_drap.
	(handle_drap): Execute new pass pass_handle_drap.
	(pass_handle_drap): Define new pass.

	* defaults.h (MAX_VECTORIZE_STACK_ALIGNMENT): New.

	* dojump.c (clear_pending_stack_adjust): Leave an FIXME in
	comments in case pending stack ajustment is discard when stack 
	realign is needed.

	* flags.h (frame_pointer_needed): Removed.
	* final.c (frame_pointer_needed): Likewise.

	* function.c (assign_stack_local_1): Estimate stack variable 
	alignment and store to stack_alignment_estimated.
	(instantiate_new_reg): Instantiate virtual incoming args rtx to
	vDRAP if stack realignment and DRAP is needed.
	(assign_parms): Collect parameter/return type alignment and 
	contribute to stack_alignment_estimated.
	(locate_and_pad_parm): Likewise.
	(allocate_struct_function): Init stack_alignment_estimated and
	stack_alignment_used.
	(get_arg_pointer_save_area): Replace virtual_incoming_args_rtx
	with crtl->args.internal_arg_pointer.

	* function.h (function): Add new field
stack_alignment_estimated,
	need_frame_pointer, need_frame_pointer_set,
stack_realign_needed,
	stack_realign_really, need_drap, save_param_ptr_reg,
	stack_realign_processed, stack_realign_finalized and 
	stack_realign_used.
	(rtl_data): Add new field drap_reg. 
	(frame_pointer_needed): New.
	(stack_realign_fp): Likewise.
	(stack_realign_drap): Likewise.

	* global.c (compute_regsets): Set frame_pointer_needed
cannot_elim
	wrt stack_realign_needed.

	* stmt.c (expand_nl_goto_receiver): Replace 
	virtual_incoming_args_rtx with
	crtl->args.internal_arg_pointer.

	* passes.c (pass_handle_drap): Insert this new pass immediately
	after expand.

	* tree-inline.c (expand_call_inline): Estimate stack variable
	alignment and store to stack_alignment_estimated.

	* tree-pass.h (pass_handle_drap): New.

	* tree-vectorizer.c (vect_can_force_dr_alignment_p): Estimate
	stack variable alignment and store to stack_alignment_estimated.

	* reload1.c (set_label_offsets): Assert that frame pointer must
be
	elimiated to stack pointer in case stack realignment is
estimated
	to happen without DRAP.
	(elimination_effects): Likewise.
	(eliminate_regs_in_insn): Likewise.
	(mark_not_eliminable): Likewise.
	(update_eliminables): Frame pointer is needed in case of stack
	realignment needed.
	(init_elim_table): Don't set frame_pointer_needed here.

	* dwarf2out.c (CUR_FDE): New.
	(reg_save_with_expression): Likewise.
	(dw_fde_struct): Add drap_regnum, stack_realignment,
	is_stack_realign, is_drap and is_drap_reg_saved.
	(add_cfi): If stack is realigned, call reg_save_with_expression
	to represent the location of stored vars.
	(dwarf2out_frame_debug_expr): Add rules 16-19 to handle stack
	realign.
	(output_cfa_loc): Handle DW_CFA_expression.
	(based_loc_descr): Update assert for stack realign.

	* config/i386/i386.c (ix86_force_align_arg_pointer_string):
Break
	long line.
	(ix86_user_incoming_stack_boundary): New.
	(ix86_default_incoming_stack_boundary): Likewise.
	(ix86_incoming_stack_boundary): Likewise.
	(find_drap_reg): Likewise.
	(override_options): Overide option value for new options.
	(ix86_function_ok_for_sibcall): Sibcall is OK even stack need
	realigning.
	(ix86_handle_cconv_attribute): Stack realign no longer impacts
	number of regparm.
	(ix86_function_regparm): Likewise.
	(setup_incoming_varargs_64): Remove the logic to set
	stack_alignment_needed here.
	(ix86_va_start): Replace virtual_incoming_args_rtx with
	crtl->args.internal_arg_pointer.
	(ix86_save_reg): Replace force_align_arg_pointer with drap_reg.
	(ix86_compute_frame_layout): Compute frame layout wrt stack
	realignment.
	(ix86_internal_arg_pointer): Estimate if stack realignment is
	needed and returns appropriate arg pointer rtx accordingly.
	(ix86_expand_prologue): Finally decide if stack realignment
	is needed and generate prologue code accordingly.
	(ix86_expand_epilogue): Generate epilogue code wrt stack
	realignment is really needed or not.
	* config/i386/i386.c (ix86_select_alt_pic_regnum): Check
	DRAP register.
	
	* config/i386/i386.h (MAIN_STACK_BOUNDARY): New.
	(ABI_STACK_BOUNDARY): Likewise.
	PREFERRED_STACK_BOUNDARY_DEFAULT): Likewise.
	(STACK_REALIGN_DEFAULT): Likewise.
	(INCOMING_STACK_BOUNDARY): Likewise.
	(MAX_VECTORIZE_STACK_ALIGNMENT): Likewise.
	(ix86_incoming_stack_boundary): Likewise.
	(REAL_PIC_OFFSET_TABLE_REGNUM): Updated to use BX_REG.
	(CAN_ELIMINATE): Redefine the macro to eliminate frame pointer
to
	stack pointer and arg pointer to hard frame pointer in case of
	stack realignment without DRAP.
	(machine_function): Remove force_align_arg_pointer.

	* config/i386/i386.md (BX_REG): New.
	(R13_REG): Likewise.

	* config/i386/i386.opt (mforce_drap): New.
	(mincoming-stack-boundary): Likewise.
	(mstackrealign): Updated.

	* doc/extend.texi: Update force_align_arg_pointer.
	* doc/invoke.texi: Document -mincoming-stack-boundary.  Update
	-mstackrealign.
	

Thanks - Joey 

-----Original Message-----
From: gcc-patches-owner@gcc.gnu.org
[mailto:gcc-patches-owner@gcc.gnu.org] On Behalf Of Ye, Joey
Sent: Friday, April 04, 2008 2:23 PM
To: GCC Patches
Cc: Lu, Hongjiu; Guo, Xuepeng; ubizjak@gmail.com
Subject: [RFA]: Merge stack alignment branch

STACK branch has been created for a while and a bunch of patches to
implement stack alignment for i386/x86_64 have been checked in. Now this
branch not only can support all stack variables to be aligned at their
required boundary effectively, but also introduce zero regression
against current trunk. Here is the background information and the patch.
Comments and feedback are high appreciated.

-- BACKGROUD --
Here, we propose a new design to fully support stack alignment while
overcoming above problems. The new design will
*  Support arbitrary alignment value, including 4,8,16,32...
*  Adjust function stack alignment only when necessary
*  Initial development will be on i386 and x86_64, but can be extended
to other platforms
*  Emit efficient prologue/epilogue code for stack align
*  Coexist with special features like dynamic stack allocation (alloca),
nested functions, register parameter passing, PIC code and tail call
optimization, etc
*  Be able to debug and unwind stack

2.1 Support arbitrary alignment value
Different source code and optimizations requires different stack
alignment,
as in following table:
Feature         Alignment (bytes)
i386_ABI        4
x86_64_ABI      16
char            1
short           2
int             4
long            4/8*
long long       8
__m64           8
__m128          16
float           4
double          8
long double     16
user specified  any power of 2

*Note: 4 for i386, 8 for x86_64
The new design will support any alignment value in this table.

2.2 Adjust function stack alignment only when necessary

Current GCC defines following macros related to stack alignment:
i. STACK_BOUNDARY in bits, which is preferred by hardware, 32 for i386
and
64 for x86_64. It is the minimum stack boundary. It is fixed.
ii. PREFERRED_STACK_BOUNDARY. It sets the stack alignment when calling a
function. It may be set at command line and has no impact on stack
alignment at function entry. This proposal requires PREFERRED >= STACK,
and
by default set to ABI_STACK_BOUNDARY

This design will define a few more macros, or concepts not explicitly
defined in code:
iii. ABI_STACK_BOUNDARY in bits, which is the stack boundary specified
by
psABI, 32 for i386 and 128 for x86_64.  ABI_STACK_BOUNDARY >=
STACK_BOUNDARY. It is fixed for a given psABI.
iv. LOCAL_STACK_BOUNDARY in bits. Each function stack has its own stack
alignment requirement, which depends the alignment of its stack
variables,
LOCAL_STACK_BOUNDARY = MAX (alignment of each effective stack variable).
v. INCOMING_STACK_BOUNDARY in bits, which is the stack boundary at
function
entry. If a function is marked with __attribute__
((force_align_arg_pointer))
or -mstackrealign option is provided, INCOMING = STACK_BOUNDARY.
Otherwise,
INCOMING == PREFERRED_STACK_BOUNDARY because a function is typically
called 
locally with the same PREFERRED_STACK_BOUNDARY. For those function whose

PREFERRED is larger than ABI, it is the caller's responsibility to
invoke 
them with appropriate PREFERRED.
vi. REQUIRED_STACK_ALIGNMENT in bits, which is stack alignment required
by
local variables and calling other function. REQUIRED_STACK_ALIGNMENT ==
MAX(LOCAL_STACK_BOUNDARY,PREFERRED_STACK_BOUNDARY) in case of a non-leaf
function. For a leaf function, REQUIRED_STACK_ALIGNMENT ==
MAX(LOCAL_STACK_BOUNDARY,STACK_BOUNDARY).

This proposal won't adjust stack when INCOMING_STACK_BOUNDARY >=
REQUIRED_STACK_ALIGNMENT. Only when INCOMING_STACK_BOUNDARY <
REQUIRED_STACK_ALIGNMENT, or PREFERRED_STACK_BOUNDARY of entry function
less 
than ABI_STACK_BOUNDARY, it will adjust stack to
REQUIRED_STACK_ALIGNMENT
at prologue.

2.3 Initial development on i386 and x86_64
We initially support i386 and x86_64. In this document we focus more on
i386 because it is hard to implement because of the restriction of
having
a small register file.  But all that we discuss can be easily applied
to x86_64.

2.4 Emit more efficient prologue/epilogue
When a function needs to adjust stack alignment and has no dynamic stack
allocation, this design will generate following example
prologue/epilogue
code:
IA32 example Prologue:
        pushl     %ebp
        movl      %esp, %ebp
        andl      $-16, %esp
        subl      $4, %esp ; is $-4 the local stack size?
Epilogue:
        movl      %ebp, %esp
        popl      %ebp
        ret
Locals will be addressed as esp + offset and parameters as ebp + offset.

Add x86_64 example here.

Thus BP points to parameter frame and SP points to local frame.

2.5 Coexist with special features
Stack alignment adjustment will coexist with varying  GCC features
that have special calling conventions and frame layout, such as dynamic
stack allocation (alloca), nested functions and parameter passing via
registers to local functions.

I386 hard register usage is the major problem to make the proposal
friendly 
to various GCC features. This design requires an additional hard
register
in prologue/epilogue in case of dynamic stack allocation. The register
is 
called as Dynamic Realigned Argument Pointer, or DRAP. Because I386 PIC
requires BX as GOT pointer and I386 may use AX, DX and CX as parameter
passing registers, also it has to work with setjmp/longjmp, there are
limited candidates to choose.  Current proposal uses CX as DRAP if CX is
not 
used byr to pass parameter. If CX is not available DI will be used
because
it is preserved across setjmp/longjmp since it is callee-saved.

X86_64 is much easier. This proposal just chooses R12 as DRAP, which is
also preserved across setjmp/longjmp since it is callee-saved.

DRAP will be assigned to a virtual register, or VDRAP, in prologue so
that 
DRAP hard register itself can be free for register allocator in function
body.
Usually VDRAP will be allocated as the same DRAP register, thus the
additional
register move instruction is oftenly removed. 

2.5.1 When stack alignment adjustment comes together with alloca,
following
example prologue/epilogue will be emitted:
Prologue:
       pushl     %edi                     // Save callee save reg edi
       leal      8(%esp), %edi            // Save address of parameter
frame
       andl      $-16, %esp               // Align local stack

//  Reserve two stack slots and save return address 
//  and previous frame pointer into them. By
//  pointing new ebp to them, we build a pseudo 
//  stack for unwinding.
       pushl     $4(%edi)                 //  save return address
       pushl     %ebp                     //  save old ebp
       movl      %esp, %ebp               //  point ebp to pseudo frame
start

       subl      $24, %esp                // adjust local frame size
       movl      %edi, vreg1

epilogue:
       movl      vreg1, %edi
       movl      %ebp, %esp               // Restore esp to pseudo frame
start
       popl      %ebp
       leal      -8(%edi), %esp           // restore esp to real frame
start
       popl      %edi                     // Restore edi
       ret

Locals will be addressed as ebp - offset, parameters as vreg1 + offset

Where BX is used to set up virtual parameter frame pointer, BP points to
local frame and SP points to dynamic allocation frame.

2.5.2 Nested functions will automatically work because it uses CX as
static
pointer, which won't conflict with any registers used by stack alignment
adjustment, even when nested functions are called via function pointer
and
a function stub on stack.

2.5.3 GCC may optimize to use registers to pass parameters . At most AX,
DX
and CX will be used. Such optimization won't conflict with stack
alignment
adjustment thus it should automatically work.

2.5.4 I386 PIC uses an available register or EBX as GOT pointer. This
design
work well under i386 PIC. When picking up a register for PIC, we will
avoid
using the DRAP register:

For example:
i686 Prologue:
        pushl     %edi
        leal      8(%esp), %edi
        andl      $-16, %esp
        pushl     $4(%edi)
        pushl     %ebp
        movl      %esp, %ebp
        subl      $24,  %esp
        call      .L1
.L1:
        popl      %ebx
        movl      %edi, vreg1

Body:  // code for alloca
        movl      (vreg1), %eax
        subl      %eax, %esp
        andl      $-16, %esp
        movl      %esp, %eax

i686 Epilogue:
        movl      %ebp, %esp
        popl      %ebp
        leal      -8(%edi), %esp
        popl      %edi
        ret

Locals will be addressed as ebp - offset, parameters as vreg1 + offset,
ebx has the GOT pointer.

2.6 Debug and unwind will work since DWARF2 has the flexibility to
define
different frame pointers.

2.7 Some intrinsics rely on stack layout. Need to handle them
accordingly.
They are __builtin_return_address, __builtin_frame_address. This
proposal
will setup pseudo frame slot to help unwinder find return address and
parent frame address by emit following prologue code after adjusting
alignment:
        pushl     $4(%edi)
        pushl     %ebp


[-- Attachment #2: stack-align-dwarf2-0410.patch --]
[-- Type: application/octet-stream, Size: 10625 bytes --]

Index: dwarf2out.c
===================================================================
--- dwarf2out.c	(.../trunk/gcc)	(revision 134098)
+++ dwarf2out.c	(.../branches/stack/gcc)	(revision 134141)
@@ -110,6 +110,9 @@ static void dwarf2out_source_line (unsig
 #define DWARF2_FRAME_REG_OUT(REGNO, FOR_EH) (REGNO)
 #endif
 
+/* Define the current fde_table entry we should use. */
+#define CUR_FDE fde_table[fde_table_in_use - 1]
+
 /* Decide whether we want to emit frame unwind information for the current
    translation unit.  */
 
@@ -239,9 +242,18 @@ typedef struct dw_fde_struct GTY(())
   bool dw_fde_switched_sections;
   dw_cfi_ref dw_fde_cfi;
   unsigned funcdef_number;
+  /* If it is drap, which register is employed. */
+  int drap_regnum;
+  HOST_WIDE_INT stack_realignment;
   unsigned all_throwers_are_sibcalls : 1;
   unsigned nothrow : 1;
   unsigned uses_eh_lsda : 1;
+  /* Whether we did stack realign in this call frame.*/
+  unsigned is_stack_realign : 1;
+  /* Whether stack realign is drap. */
+  unsigned is_drap : 1;
+  /* Whether we saved this drap register. */
+  unsigned is_drap_reg_saved : 1;
 }
 dw_fde_node;
 
@@ -381,6 +393,7 @@ static void get_cfa_from_loc_descr (dw_c
 static struct dw_loc_descr_struct *build_cfa_loc
   (dw_cfa_location *, HOST_WIDE_INT);
 static void def_cfa_1 (const char *, dw_cfa_location *);
+static void reg_save_with_expression (dw_cfi_ref);
 
 /* How to start an assembler comment.  */
 #ifndef ASM_COMMENT_START
@@ -618,6 +631,13 @@ add_cfi (dw_cfi_ref *list_head, dw_cfi_r
   for (p = list_head; (*p) != NULL; p = &(*p)->dw_cfi_next)
     ;
 
+  /* If stack is realigned, accessing the stored register via CFA+offset will
+     be invalid. Here we will use a series of expressions in dwarf2 to simulate
+     the stack realign and represent the location of the stored register. */
+  if (fde_table_in_use && (CUR_FDE.is_stack_realign || CUR_FDE.is_drap) 
+      && cfi->dw_cfi_opc == DW_CFA_offset)
+    reg_save_with_expression (cfi);
+
   *p = cfi;
 }
 
@@ -1435,6 +1455,10 @@ static dw_cfa_location cfa_temp;
   Rules 10-14: Save a register to the stack.  Define offset as the
 	       difference of the original location and cfa_store's
 	       location (or cfa_temp's location if cfa_temp is used).
+  
+  Rules 16-19: If AND operation happens on sp in prologue, we assume stack is
+               realigned. We will use a group of DW_OP_?? expressions to represent
+               the location of the stored register instead of CFA+offset.
 
   The Rules
 
@@ -1529,7 +1553,32 @@ static dw_cfa_location cfa_temp;
 
   Rule 15:
   (set <reg> {unspec, unspec_volatile})
-  effects: target-dependent  */
+  effects: target-dependent  
+  
+  Rule 16:
+  (set sp (and: sp <const_int>))
+  effects: CUR_FDE.is_stack_realign = 1
+           cfa_store.offset = 0
+
+           if cfa_store.offset >= UNITS_PER_WORD
+             effects: CUR_FDE.is_drap_reg_saved = 1
+
+  Rule 17:
+  (set (mem ({pre_inc, pre_dec} sp)) (mem (plus (cfa.reg) (const_int))))
+  effects: cfa_store.offset += -/+ mode_size(mem)
+  
+  Rule 18:
+  (set (mem({pre_inc, pre_dec} sp)) fp)
+  constraints: CUR_FDE.is_stack_realign == 1
+  effects: CUR_FDE.is_stack_realign = 0
+           CUR_FDE.is_drap = 1
+           CUR_FDE.drap_regnum = cfa.reg
+
+  Rule 19:
+  (set fp sp)
+  constraints: CUR_FDE.is_drap == 1
+  effects: cfa.reg = fp
+           cfa.offset = cfa_store.offset */
 
 static void
 dwarf2out_frame_debug_expr (rtx expr, const char *label)
@@ -1607,7 +1656,20 @@ dwarf2out_frame_debug_expr (rtx expr, co
 	      cfa_temp.reg = cfa.reg;
 	      cfa_temp.offset = cfa.offset;
 	    }
-	  else
+            /* Rule 19 */
+            /* Eachtime when setting FP to SP under the condition of that the stack
+               is realigned we assume the realign is drap and the drap register is
+               the current cfa's register. We update cfa's register to FP. */
+	  else if (fde_table_in_use && CUR_FDE.is_drap 
+                   && REGNO (src) == STACK_POINTER_REGNUM 
+                   && REGNO (dest) == HARD_FRAME_POINTER_REGNUM)
+            {
+              cfa.reg = REGNO (dest);
+              cfa.offset = cfa_store.offset;
+              cfa_temp.reg = cfa.reg;
+              cfa_temp.offset = cfa.offset;
+            }
+          else
 	    {
 	      /* Saving a register in a register.  */
 	      gcc_assert (!fixed_regs [REGNO (dest)]
@@ -1747,6 +1809,22 @@ dwarf2out_frame_debug_expr (rtx expr, co
 	  targetm.dwarf_handle_frame_unspec (label, expr, XINT (src, 1));
 	  return;
 
+	  /* Rule 16 */
+	case AND:
+          /* If this AND operation happens on stack pointer in prologue, we 
+             assume the stack is realigned and we extract the alignment. */
+          if (XEXP (src, 0) == stack_pointer_rtx && fde_table_in_use)
+            {
+              CUR_FDE.is_stack_realign = 1;
+              CUR_FDE.stack_realignment = INTVAL (XEXP (src, 1));
+              /* If we didn't push anything to stack before stack is realigned,
+                  we assume the drap register isn't saved. */
+              if (cfa_store.offset > UNITS_PER_WORD)
+                CUR_FDE.is_drap_reg_saved = 1;
+              cfa_store.offset = 0;
+            }
+          return;
+
 	default:
 	  gcc_unreachable ();
 	}
@@ -1755,7 +1833,6 @@ dwarf2out_frame_debug_expr (rtx expr, co
       break;
 
     case MEM:
-      gcc_assert (REG_P (src));
 
       /* Saving a register to the stack.  Make sure dest is relative to the
 	 CFA register.  */
@@ -1788,6 +1865,17 @@ dwarf2out_frame_debug_expr (rtx expr, co
 
 	  gcc_assert (REGNO (XEXP (XEXP (dest, 0), 0)) == STACK_POINTER_REGNUM
 		      && cfa_store.reg == STACK_POINTER_REGNUM);
+          
+          /* Rule 18 */
+          /* If we push FP after stack is realigned, we assume this realignment
+             is drap, we will recorde the drap register. */
+          if (fde_table_in_use && CUR_FDE.is_stack_realign
+              && REGNO (src) == HARD_FRAME_POINTER_REGNUM)
+            {
+              CUR_FDE.is_stack_realign = 0;
+              CUR_FDE.is_drap = 1;
+              CUR_FDE.drap_regnum = DWARF_FRAME_REGNUM (cfa.reg);
+            }            
 
 	  cfa_store.offset += offset;
 	  if (cfa.reg == STACK_POINTER_REGNUM)
@@ -1882,6 +1970,12 @@ dwarf2out_frame_debug_expr (rtx expr, co
 	      break;
 	    }
 	}
+        /* Rule 17 */
+        /* If the source operand of this MEM operation is not a register, 
+           basically the source is return address. Here we just care how 
+           much stack grew and ignore to save it. */ 
+      if (!REG_P (src))
+        break;
 
       def_cfa_1 (label, &cfa);
       {
@@ -3548,6 +3642,9 @@ output_cfa_loc (dw_cfi_ref cfi)
   dw_loc_descr_ref loc;
   unsigned long size;
 
+  if (cfi->dw_cfi_opc == DW_CFA_expression)
+    dw2_asm_output_data (1, cfi->dw_cfi_oprnd2.dw_cfi_reg_num, NULL);
+
   /* Output the size of the block.  */
   loc = cfi->dw_cfi_oprnd1.dw_cfi_loc;
   size = size_of_locs (loc);
@@ -9024,8 +9121,9 @@ based_loc_descr (rtx reg, HOST_WIDE_INT 
 	      offset += INTVAL (XEXP (elim, 1));
 	      elim = XEXP (elim, 0);
 	    }
-	  gcc_assert (elim == (frame_pointer_needed ? hard_frame_pointer_rtx
-		      : stack_pointer_rtx));
+	  gcc_assert (stack_realign_fp
+	              || elim == (frame_pointer_needed ? hard_frame_pointer_rtx
+		                                       : stack_pointer_rtx));
 	  offset += frame_pointer_fb_offset;
 
 	  return new_loc_descr (DW_OP_fbreg, offset, 0);
@@ -11155,9 +11253,10 @@ compute_frame_pointer_to_fb_displacement
       offset += INTVAL (XEXP (elim, 1));
       elim = XEXP (elim, 0);
     }
-  gcc_assert (elim == (frame_pointer_needed ? hard_frame_pointer_rtx
-		       : stack_pointer_rtx));
 
+  gcc_assert (stack_realign_fp 
+              || elim == (frame_pointer_needed ? hard_frame_pointer_rtx
+		       : stack_pointer_rtx));
   frame_pointer_fb_offset = -offset;
 }
 
@@ -15438,6 +15537,63 @@ dwarf2out_finish (const char *filename)
   if (debug_str_hash)
     htab_traverse (debug_str_hash, output_indirect_string, NULL);
 }
+
+/* In this function we use a series of DW_OP_?? expression which simulates
+   how stack is realigned to represent the location of the stored register.*/
+static void
+reg_save_with_expression (dw_cfi_ref cfi)
+{
+  struct dw_loc_descr_struct *head, *tmp;
+  HOST_WIDE_INT alignment = CUR_FDE.stack_realignment;
+  HOST_WIDE_INT offset = cfi->dw_cfi_oprnd2.dw_cfi_offset * UNITS_PER_WORD;
+  int reg = cfi->dw_cfi_oprnd1.dw_cfi_reg_num;
+  unsigned int dwarf_sp = (unsigned)DWARF_FRAME_REGNUM (STACK_POINTER_REGNUM);
+  
+  if (CUR_FDE.is_stack_realign)
+    {
+      head = tmp = new_loc_descr (DW_OP_const4s, 2 * UNITS_PER_WORD, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_const4s, alignment, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_and, 0, 0);
+
+      /* If stack grows upward, the offset will be a negative. */
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_const4s, offset, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);  
+   
+      cfi->dw_cfi_opc = DW_CFA_expression;
+      cfi->dw_cfi_oprnd2.dw_cfi_reg_num = reg; 
+      cfi->dw_cfi_oprnd1.dw_cfi_loc = head;
+    }
+
+  /* We need restore drap register through dereference. If we needn't to restore
+     the drap register we just ignore. */
+  if (CUR_FDE.is_drap && reg == CUR_FDE.drap_regnum)
+    {
+       
+      dw_cfi_ref cfi2 = new_cfi();
+
+      cfi->dw_cfi_opc = DW_CFA_expression;
+      head = tmp = new_loc_descr (DW_OP_const4s, offset, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);
+      if (CUR_FDE.is_drap_reg_saved)
+        {
+          tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_deref, 0, 0);
+          tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_const4s, 
+                                                  2 * UNITS_PER_WORD, 0);
+          tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);
+        }
+      cfi->dw_cfi_oprnd2.dw_cfi_reg_num = reg;
+      cfi->dw_cfi_oprnd1.dw_cfi_loc = head;
+
+      /* We also need restore the sp. */
+      head = tmp = new_loc_descr (DW_OP_const4s, offset, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);
+      cfi2->dw_cfi_opc = DW_CFA_expression;
+      cfi2->dw_cfi_oprnd2.dw_cfi_reg_num = dwarf_sp;
+      cfi2->dw_cfi_oprnd1.dw_cfi_loc = head;
+      cfi->dw_cfi_next = cfi2;
+    }  
+}
 #else
 
 /* This should never be used, but its address is needed for comparisons.  */

[-- Attachment #3: stack-align-generic-0410.patch --]
[-- Type: application/octet-stream, Size: 32983 bytes --]

Index: flags.h
===================================================================
--- flags.h	(.../trunk/gcc)	(revision 134098)
+++ flags.h	(.../branches/stack/gcc)	(revision 134141)
@@ -223,12 +223,6 @@ extern int flag_dump_rtl_in_asm;
 \f
 /* Other basic status info about current function.  */
 
-/* Nonzero means current function must be given a frame pointer.
-   Set in stmt.c if anything is allocated on the stack there.
-   Set in reload1.c if anything is allocated on the stack there.  */
-
-extern int frame_pointer_needed;
-
 /* Nonzero if subexpressions must be evaluated from left-to-right.  */
 extern int flag_evaluation_order;
 
Index: defaults.h
===================================================================
--- defaults.h	(.../trunk/gcc)	(revision 134098)
+++ defaults.h	(.../branches/stack/gcc)	(revision 134141)
@@ -940,4 +940,8 @@ along with GCC; see the file COPYING3.  
 #define OUTGOING_REG_PARM_STACK_SPACE 0
 #endif
 
+#ifndef MAX_VECTORIZE_STACK_ALIGNMENT
+#define MAX_VECTORIZE_STACK_ALIGNMENT 0
+#endif
+
 #endif  /* ! GCC_DEFAULTS_H */
Index: tree-pass.h
===================================================================
--- tree-pass.h	(.../trunk/gcc)	(revision 134098)
+++ tree-pass.h	(.../branches/stack/gcc)	(revision 134141)
@@ -472,6 +472,7 @@ extern struct gimple_opt_pass pass_inlin
 extern struct gimple_opt_pass pass_apply_inline;
 extern struct gimple_opt_pass pass_all_early_optimizations;
 extern struct gimple_opt_pass pass_update_address_taken;
+extern struct gimple_opt_pass pass_handle_drap;
 
 /* The root of the compilation pass tree, once constructed.  */
 extern struct opt_pass *all_passes, *all_ipa_passes, *all_lowering_passes;
Index: builtins.c
===================================================================
--- builtins.c	(.../trunk/gcc)	(revision 134098)
+++ builtins.c	(.../branches/stack/gcc)	(revision 134141)
@@ -740,7 +740,7 @@ expand_builtin_setjmp_receiver (rtx rece
 	{
 	  /* Now restore our arg pointer from the address at which it
 	     was saved in our stack frame.  */
-	  emit_move_insn (virtual_incoming_args_rtx,
+	  emit_move_insn (crtl->args.internal_arg_pointer,
 			  copy_to_reg (get_arg_pointer_save_area ()));
 	}
     }
@@ -775,6 +775,11 @@ expand_builtin_longjmp (rtx buf_addr, rt
   rtx fp, lab, stack, insn, last;
   enum machine_mode sa_mode = STACK_SAVEAREA_MODE (SAVE_NONLOCAL);
 
+  /* DRAP is needed for stack realign if longjmp is expanded to current 
+     function  */
+  if (MAX_VECTORIZE_STACK_ALIGNMENT && !cfun->need_drap)
+    cfun->need_drap = true;
+
   if (setjmp_alias_set == -1)
     setjmp_alias_set = new_alias_set ();
 
@@ -1345,7 +1350,7 @@ expand_builtin_apply_args_1 (void)
       }
 
   /* Save the arg pointer to the block.  */
-  tem = copy_to_reg (virtual_incoming_args_rtx);
+  tem = copy_to_reg (crtl->args.internal_arg_pointer);
 #ifdef STACK_GROWS_DOWNWARD
   /* We need the pointer as the caller actually passed them to us, not
      as we might have pretended they were passed.  Make sure it's a valid
@@ -1453,6 +1458,14 @@ expand_builtin_apply (rtx function, rtx 
   /* Allocate a block of memory onto the stack and copy the memory
      arguments to the outgoing arguments address.  */
   allocate_dynamic_stack_space (argsize, 0, BITS_PER_UNIT);
+
+  /* Set DRAP flag to true, even though allocate_dynamic_stack_space
+     may have already set current_function_calls_alloca to true.
+     current_function_calls_alloca won't be set if argsize is zero,
+     so we have to guarantee need_drap is true here.  */
+  if (MAX_VECTORIZE_STACK_ALIGNMENT && !cfun->need_drap)
+    cfun->need_drap = true;
+
   dest = virtual_outgoing_args_rtx;
 #ifndef STACK_GROWS_DOWNWARD
   if (GET_CODE (argsize) == CONST_INT)
Index: final.c
===================================================================
--- final.c	(.../trunk/gcc)	(revision 134098)
+++ final.c	(.../branches/stack/gcc)	(revision 134141)
@@ -178,12 +178,6 @@ CC_STATUS cc_status;
 CC_STATUS cc_prev_status;
 #endif
 
-/* Nonzero means current function must be given a frame pointer.
-   Initialized in function.c to 0.  Set only in reload1.c as per
-   the needs of the function.  */
-
-int frame_pointer_needed;
-
 /* Number of unmatched NOTE_INSN_BLOCK_BEG notes we have seen.  */
 
 static int block_depth;
Index: dojump.c
===================================================================
--- dojump.c	(.../trunk/gcc)	(revision 134098)
+++ dojump.c	(.../branches/stack/gcc)	(revision 134141)
@@ -64,7 +64,10 @@ discard_pending_stack_adjust (void)
    so the adjustment won't get done.
 
    Note, if the current function calls alloca, then it must have a
-   frame pointer regardless of the value of flag_omit_frame_pointer.  */
+   frame pointer regardless of the value of flag_omit_frame_pointer.  
+
+   When stack realign is needed, we can't discard pending stack adjustment,
+   in which stack pointer must be restored in epilogue. */
 
 void
 clear_pending_stack_adjust (void)
Index: global.c
===================================================================
--- global.c	(.../trunk/gcc)	(revision 134098)
+++ global.c	(.../branches/stack/gcc)	(revision 134141)
@@ -247,10 +247,20 @@ compute_regsets (HARD_REG_SET *elim_set,
   static const struct {const int from, to; } eliminables[] = ELIMINABLE_REGS;
   size_t i;
 #endif
+
+  /* FIXME: If EXIT_IGNORE_STACK is set, we will not save and restore
+     sp for alloca.  So we can't eliminate the frame pointer in that
+     case.  At some point, we should improve this by emitting the
+     sp-adjusting insns for this case.  */
   int need_fp
     = (! flag_omit_frame_pointer
        || (current_function_calls_alloca && EXIT_IGNORE_STACK)
-       || FRAME_POINTER_REQUIRED);
+       || FRAME_POINTER_REQUIRED
+       || current_function_accesses_prior_frames
+       || cfun->stack_realign_needed);
+
+  frame_pointer_needed = need_fp;
+  cfun->need_frame_pointer_set = 1;
 
   max_regno = max_reg_num ();
   compact_blocks ();
@@ -271,7 +281,10 @@ compute_regsets (HARD_REG_SET *elim_set,
     {
       bool cannot_elim
 	= (! CAN_ELIMINATE (eliminables[i].from, eliminables[i].to)
-	   || (eliminables[i].to == STACK_POINTER_REGNUM && need_fp));
+	   || (eliminables[i].to == STACK_POINTER_REGNUM
+	       && need_fp 
+	       && (! MAX_VECTORIZE_STACK_ALIGNMENT
+		   || ! stack_realign_fp)));
 
       if (!regs_asm_clobbered[eliminables[i].from])
 	{
Index: function.c
===================================================================
--- function.c	(.../trunk/gcc)	(revision 134098)
+++ function.c	(.../branches/stack/gcc)	(revision 134141)
@@ -342,17 +342,19 @@ assign_stack_local (enum machine_mode mo
 {
   rtx x, addr;
   int bigend_correction = 0;
-  unsigned int alignment;
+  unsigned int alignment, mode_alignment, alignment_in_bits;
   int frame_off, frame_alignment, frame_phase;
 
+  if (mode == BLKmode)
+    mode_alignment = BIGGEST_ALIGNMENT;
+  else
+    mode_alignment = GET_MODE_ALIGNMENT (mode);
+
   if (align == 0)
     {
       tree type;
 
-      if (mode == BLKmode)
-	alignment = BIGGEST_ALIGNMENT;
-      else
-	alignment = GET_MODE_ALIGNMENT (mode);
+      alignment = mode_alignment;
 
       /* Allow the target to (possibly) increase the alignment of this
 	 stack slot.  */
@@ -372,15 +374,45 @@ assign_stack_local (enum machine_mode mo
   else
     alignment = align / BITS_PER_UNIT;
 
+  alignment_in_bits = alignment * BITS_PER_UNIT;
+
   if (FRAME_GROWS_DOWNWARD)
     frame_offset -= size;
 
-  /* Ignore alignment we can't do with expected alignment of the boundary.  */
-  if (alignment * BITS_PER_UNIT > PREFERRED_STACK_BOUNDARY)
-    alignment = PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT;
-
-  if (cfun->stack_alignment_needed < alignment * BITS_PER_UNIT)
-    cfun->stack_alignment_needed = alignment * BITS_PER_UNIT;
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      if (cfun->stack_alignment_estimated < alignment_in_bits)
+	{
+          if (!cfun->stack_realign_processed)
+            cfun->stack_alignment_estimated = alignment_in_bits;
+          else
+	    {
+	      gcc_assert (!cfun->stack_realign_finalized);
+	      if (!cfun->stack_realign_needed)
+		{
+		  /* It is OK to reduce the alignment as long as the
+		     requested size is 0 or the estimated stack
+		     alignment >= mode alignment.  */
+		  gcc_assert (size == 0
+			      || (cfun->stack_alignment_estimated
+				  >= mode_alignment));
+		  alignment_in_bits = cfun->stack_alignment_estimated;
+		  alignment = alignment_in_bits / BITS_PER_UNIT;
+		}
+	    }
+	}
+    }
+  else
+    {
+      /* Ignore alignment we can't do with expected alignment of the
+	 boundary.  */
+      if (alignment * BITS_PER_UNIT > PREFERRED_STACK_BOUNDARY)
+	alignment = PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT;
+    }
+  if (cfun->stack_alignment_needed < alignment_in_bits)
+    cfun->stack_alignment_needed = alignment_in_bits;
+  if (cfun->stack_alignment_used < cfun->stack_alignment_needed)
+    cfun->stack_alignment_used = cfun->stack_alignment_needed;
 
   /* Calculate how many bytes the start of local variables is off from
      stack alignment.  */
@@ -1169,7 +1201,17 @@ instantiate_new_reg (rtx x, HOST_WIDE_IN
   HOST_WIDE_INT offset;
 
   if (x == virtual_incoming_args_rtx)
-    new = arg_pointer_rtx, offset = in_arg_offset;
+    {
+      /* Replace vitural_incoming_args_rtx to internal arg pointer here */
+      if (crtl->args.internal_arg_pointer != virtual_incoming_args_rtx)
+        {
+          gcc_assert (stack_realign_drap);
+          new = crtl->args.internal_arg_pointer;
+          offset = 0;
+        }
+      else
+        new = arg_pointer_rtx, offset = in_arg_offset;
+    }
   else if (x == virtual_stack_vars_rtx)
     new = frame_pointer_rtx, offset = var_offset;
   else if (x == virtual_stack_dynamic_rtx)
@@ -2968,6 +3010,20 @@ assign_parms (tree fndecl)
 	  continue;
 	}
 
+      /* Estimate stack alignment from parameter alignment */
+      if (MAX_VECTORIZE_STACK_ALIGNMENT)
+        {
+          unsigned int align = FUNCTION_ARG_BOUNDARY (data.promoted_mode,
+						      data.passed_type);
+	  if (TYPE_ALIGN (data.nominal_type) > align)
+	    align = TYPE_ALIGN (data.passed_type);
+	  if (cfun->stack_alignment_estimated < align)
+	    {
+	      gcc_assert (!cfun->stack_realign_processed);
+	      cfun->stack_alignment_estimated = align;
+	    }
+	}
+	
       if (current_function_stdarg && !TREE_CHAIN (parm))
 	assign_parms_setup_varargs (&all, &data, false);
 
@@ -3005,6 +3061,28 @@ assign_parms (tree fndecl)
      now that all parameters have been copied out of hard registers.  */
   emit_insn (all.first_conversion_insn);
 
+  /* Estimate reload stack alignment from scalar return mode.  */
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      if (DECL_RESULT (fndecl))
+	{
+	  tree type = TREE_TYPE (DECL_RESULT (fndecl));
+	  enum machine_mode mode = TYPE_MODE (type);
+
+	  if (mode != BLKmode
+	      && mode != VOIDmode
+	      && !AGGREGATE_TYPE_P (type))
+	    {
+	      unsigned int align = GET_MODE_ALIGNMENT (mode);
+	      if (cfun->stack_alignment_estimated < align)
+		{
+		  gcc_assert (!cfun->stack_realign_processed);
+		  cfun->stack_alignment_estimated = align;
+		}
+	    }
+	} 
+    }
+
   /* If we are receiving a struct value address as the first argument, set up
      the RTL for the function result. As this might require code to convert
      the transmitted address to Pmode, we do this here to ensure that possible
@@ -3282,12 +3360,32 @@ locate_and_pad_parm (enum machine_mode p
   locate->where_pad = where_pad;
   locate->boundary = boundary;
 
-  /* Remember if the outgoing parameter requires extra alignment on the
-     calling function side.  */
-  if (boundary > PREFERRED_STACK_BOUNDARY)
-    boundary = PREFERRED_STACK_BOUNDARY;
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      /* stack_alignment_estimated can't change after stack has been
+	 realigned.  */
+      if (cfun->stack_alignment_estimated < boundary)
+        {
+          if (!cfun->stack_realign_processed)
+	    cfun->stack_alignment_estimated = boundary;
+	  else
+	    {
+	      gcc_assert (!cfun->stack_realign_finalized
+			  && cfun->stack_realign_needed);
+	    }
+	}
+    }
+  else
+    {
+      /* Remember if the outgoing parameter requires extra alignment on
+         the calling function side.  */
+      if (boundary > PREFERRED_STACK_BOUNDARY)
+        boundary = PREFERRED_STACK_BOUNDARY;
+    }
   if (cfun->stack_alignment_needed < boundary)
     cfun->stack_alignment_needed = boundary;
+  if (cfun->stack_alignment_used < cfun->stack_alignment_needed)
+    cfun->stack_alignment_used = cfun->stack_alignment_needed;
 
 #ifdef ARGS_GROW_DOWNWARD
   locate->slot_offset.constant = -initial_offset_ptr->constant;
@@ -3843,6 +3941,8 @@ allocate_struct_function (tree fndecl, b
   cfun = ggc_alloc_cleared (sizeof (struct function));
 
   cfun->stack_alignment_needed = STACK_BOUNDARY;
+  cfun->stack_alignment_used = STACK_BOUNDARY;
+  cfun->stack_alignment_estimated = STACK_BOUNDARY;
   cfun->preferred_stack_boundary = STACK_BOUNDARY;
 
   current_function_funcdef_no = get_next_funcdef_no ();
@@ -4622,7 +4722,8 @@ get_arg_pointer_save_area (void)
 	 generated stack slot may not be a valid memory address, so we
 	 have to check it and fix it if necessary.  */
       start_sequence ();
-      emit_move_insn (validize_mem (ret), virtual_incoming_args_rtx);
+      emit_move_insn (validize_mem (ret),
+                      crtl->args.internal_arg_pointer);
       seq = get_insns ();
       end_sequence ();
 
Index: tree-vectorizer.c
===================================================================
--- tree-vectorizer.c	(.../trunk/gcc)	(revision 134098)
+++ tree-vectorizer.c	(.../branches/stack/gcc)	(revision 134141)
@@ -1786,9 +1786,19 @@ vect_can_force_dr_alignment_p (const_tre
 
   if (TREE_STATIC (decl))
     return (alignment <= MAX_OFILE_ALIGNMENT);
+  else if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      gcc_assert (!cfun->stack_realign_processed);
+      if (alignment <= MAX_VECTORIZE_STACK_ALIGNMENT)
+	{
+	  if (cfun->stack_alignment_estimated < alignment)
+	    cfun->stack_alignment_estimated = alignment;
+	  return true;
+	}
+      else
+	return false;
+    }
   else
-    /* This used to be PREFERRED_STACK_BOUNDARY, however, that is not 100%
-       correct until someone implements forced stack alignment.  */
     return (alignment <= STACK_BOUNDARY); 
 }
 
Index: function.h
===================================================================
--- function.h	(.../trunk/gcc)	(revision 134098)
+++ function.h	(.../branches/stack/gcc)	(revision 134141)
@@ -271,6 +271,9 @@ struct rtl_data GTY(())
      needed by inner routines.  */
   rtx x_arg_pointer_save_area;
 
+  /* Dynamic Realign Argument Pointer used for realigning stack.  */
+  rtx drap_reg;
+
   /* Offset to end of allocated area of stack frame.
      If stack grows down, this is the address of the last stack slot allocated.
      If stack grows up, this is the address for the next slot.  */
@@ -352,9 +355,16 @@ struct function GTY(())
   /* tm.h can use this to store whatever it likes.  */
   struct machine_function * GTY ((maybe_undef)) machine;
 
-  /* The largest alignment of slot allocated on the stack.  */
+  /* The largest alignment needed on the stack, including requirement
+     for outgoing stack alignment.  */
   unsigned int stack_alignment_needed;
 
+  /* The largest alignment of slot allocated on the stack.  */
+  unsigned int stack_alignment_used;
+
+  /* The estimated stack alignment.  */
+  unsigned int stack_alignment_estimated;
+
   /* Preferred alignment of the end of stack frame.  */
   unsigned int preferred_stack_boundary;
 
@@ -509,6 +519,38 @@ struct function GTY(())
 
   /* Nonzero if pass_tree_profile was run on this function.  */
   unsigned int after_tree_profile : 1;
+
+/* Nonzero if current function must be given a frame pointer.
+   Set in global.c if anything is allocated on the stack there.  */
+  unsigned int need_frame_pointer : 1;
+
+  /* Nonzero if need_frame_pointer has been set.  */
+  unsigned int need_frame_pointer_set : 1;
+
+  /* Nonzero if, by estimation, current function stack needs realignment. */
+  unsigned int stack_realign_needed : 1;
+
+  /* Nonzero if function stack realignment is really needed. This flag
+     will be set after reload if by then criteria of stack realignment
+     is still true. Its value may be contridition to stack_realign_needed
+     since the latter was set before reload. This flag is more accurate
+     than stack_realign_needed so prologue/epilogue should be generated
+     according to both flags  */
+  unsigned int stack_realign_really : 1;
+
+  /* Nonzero if function being compiled needs dynamic realigned
+     argument pointer (drap) if stack needs realigning.  */
+  unsigned int need_drap : 1;
+
+  /* Nonzero if current function needs to save/restore parameter
+     pointer register in prolog, because it is a callee save reg.  */
+  unsigned int save_param_ptr_reg : 1;
+
+  /* Nonzero if function stack realignment estimatoin is done.  */
+  unsigned int stack_realign_processed : 1;
+
+  /* Nonzero if function stack realignment has been finalized.  */
+  unsigned int stack_realign_finalized : 1;
 };
 
 /* If va_list_[gf]pr_size is set to this, it means we don't know how
@@ -563,6 +605,9 @@ extern void instantiate_decl_rtl (rtx x)
 #define dom_computed (cfun->cfg->x_dom_computed)
 #define n_bbs_in_dom_tree (cfun->cfg->x_n_bbs_in_dom_tree)
 #define VALUE_HISTOGRAMS(fun) (fun)->value_histograms
+#define frame_pointer_needed (cfun->need_frame_pointer)
+#define stack_realign_fp (cfun->stack_realign_needed && !cfun->need_drap)
+#define stack_realign_drap (cfun->stack_realign_needed && cfun->need_drap)
 
 /* Given a function decl for a containing function,
    return the `struct function' for it.  */
Index: calls.c
===================================================================
--- calls.c	(.../trunk/gcc)	(revision 134098)
+++ calls.c	(.../branches/stack/gcc)	(revision 134141)
@@ -419,6 +419,10 @@ emit_call_1 (rtx funexp, tree fntree, tr
       rounded_stack_size -= n_popped;
       rounded_stack_size_rtx = GEN_INT (rounded_stack_size);
       stack_pointer_delta -= n_popped;
+
+      /* If popup is needed, stack realign must use DRAP  */
+      if (MAX_VECTORIZE_STACK_ALIGNMENT && !cfun->need_drap)
+        cfun->need_drap = true;
     }
 
   if (!ACCUMULATE_OUTGOING_ARGS)
@@ -2091,7 +2095,10 @@ expand_call (tree exp, rtx target, int i
 
   /* Figure out the amount to which the stack should be aligned.  */
   preferred_stack_boundary = PREFERRED_STACK_BOUNDARY;
-  if (fndecl)
+
+  /* With automatic stack realignment, we align stack in prologue when
+     needed and there is no need to update preferred_stack_boundary.  */
+  if (!MAX_VECTORIZE_STACK_ALIGNMENT && fndecl)
     {
       struct cgraph_rtl_info *i = cgraph_rtl_info (fndecl);
       if (i && i->preferred_incoming_stack_boundary)
@@ -2392,7 +2399,7 @@ expand_call (tree exp, rtx target, int i
 	 incoming argument block.  */
       if (pass == 0)
 	{
-	  argblock = virtual_incoming_args_rtx;
+	  argblock = crtl->args.internal_arg_pointer;
 	  argblock
 #ifdef STACK_GROWS_DOWNWARD
 	    = plus_constant (argblock, crtl->args.pretend_args_size);
Index: emit-rtl.c
===================================================================
--- emit-rtl.c	(.../trunk/gcc)	(revision 134098)
+++ emit-rtl.c	(.../branches/stack/gcc)	(revision 134141)
@@ -864,9 +864,20 @@ rtx
 gen_reg_rtx (enum machine_mode mode)
 {
   rtx val;
+  unsigned int align = GET_MODE_ALIGNMENT (mode);
 
   gcc_assert (can_create_pseudo_p ());
 
+  /* If a virtual register with bigger mode alignment is generated,
+     increase stack alignment estimation because it might be spilled
+     to stack later.  */
+  if (MAX_VECTORIZE_STACK_ALIGNMENT 
+      && cfun->stack_alignment_estimated < align)
+    {
+      gcc_assert (!cfun->stack_realign_processed);
+      cfun->stack_alignment_estimated = align;
+    }
+		
   if (generating_concat_p
       && (GET_MODE_CLASS (mode) == MODE_COMPLEX_FLOAT
 	  || GET_MODE_CLASS (mode) == MODE_COMPLEX_INT))
Index: cfgexpand.c
===================================================================
--- cfgexpand.c	(.../trunk/gcc)	(revision 134098)
+++ cfgexpand.c	(.../branches/stack/gcc)	(revision 134141)
@@ -161,10 +161,27 @@ get_decl_align_unit (tree decl)
 
   align = DECL_ALIGN (decl);
   align = LOCAL_ALIGNMENT (TREE_TYPE (decl), align);
-  if (align > PREFERRED_STACK_BOUNDARY)
-    align = PREFERRED_STACK_BOUNDARY;
+
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      if (cfun->stack_alignment_estimated < align)
+	{
+	  gcc_assert(!cfun->stack_realign_processed);
+          cfun->stack_alignment_estimated = align;
+	}
+    }
+  else
+    {
+      if (align > PREFERRED_STACK_BOUNDARY)
+	align = PREFERRED_STACK_BOUNDARY;
+    }
+
+  /* stack_alignment_needed > PREFERRED_STACK_BOUNDARY is permitted.
+     So here we only make sure stack_alignment_needed >= align.  */
   if (cfun->stack_alignment_needed < align)
     cfun->stack_alignment_needed = align;
+  if (cfun->stack_alignment_used < cfun->stack_alignment_needed)
+    cfun->stack_alignment_used = cfun->stack_alignment_needed;
 
   return align / BITS_PER_UNIT;
 }
@@ -743,6 +760,29 @@ defer_stack_allocation (tree var, bool t
 static HOST_WIDE_INT
 expand_one_var (tree var, bool toplevel, bool really_expand)
 {
+  if (MAX_VECTORIZE_STACK_ALIGNMENT && TREE_CODE (var) == VAR_DECL)
+    {
+      unsigned int align;
+
+      /* Because we don't know if VAR will be in register or on stack,
+	 we conservatively assume it will be on stack even if VAR is
+	 eventually put into register after RA pass.  For non-automatic
+	 variables, which won't be on stack, we collect alignment of
+	 type and ignore user specified alignment.  */
+      if (TREE_STATIC (var) || DECL_EXTERNAL (var))
+	align = TYPE_ALIGN (TREE_TYPE (var));
+      else
+	align = DECL_ALIGN (var);
+
+      if (cfun->stack_alignment_estimated < align)
+        {
+          /* stack_alignment_estimated shouldn't change after stack
+             realign decision made */
+          gcc_assert(!cfun->stack_realign_processed);
+	  cfun->stack_alignment_estimated = align;
+	}
+    }
+
   if (TREE_CODE (var) != VAR_DECL)
     ;
   else if (DECL_EXTERNAL (var))
@@ -1997,3 +2037,71 @@ struct gimple_opt_pass pass_expand =
   TODO_dump_func,                       /* todo_flags_finish */
  }
 };
+
+static bool
+gate_handle_drap (void)
+{
+  if (!MAX_VECTORIZE_STACK_ALIGNMENT)
+    return false;
+  else
+    {
+      gcc_assert (!cfun->stack_realign_processed);
+      return true;
+    }
+}
+
+/* This pass sets crtl->args.internal_arg_pointer to a virtual
+   register if DRAP is needed.  Local register allocator will replace
+   virtual_incoming_args_rtx with the virtual register.  */
+
+static unsigned int
+handle_drap (void)
+{
+  rtx internal_arg_rtx; 
+
+  if (!cfun->need_drap
+      && (current_function_calls_alloca
+          || cfun->has_nonlocal_label
+          || current_function_has_nonlocal_goto))
+    cfun->need_drap = true;
+
+  /* Call targetm.calls.internal_arg_pointer again.  This time it will
+     return a virtual register if DRAP is needed.  */
+  internal_arg_rtx = targetm.calls.internal_arg_pointer (); 
+
+  /* Assertion to check internal_arg_pointer is set to the right rtx
+     here.  */
+  gcc_assert (crtl->args.internal_arg_pointer == 
+             virtual_incoming_args_rtx);
+
+  /* Do nothing if no need to replace virtual_incoming_args_rtx.  */
+  if (crtl->args.internal_arg_pointer != internal_arg_rtx)
+    {
+      crtl->args.internal_arg_pointer = internal_arg_rtx;
+
+      /* Call fixup_tail_casss to clean up REG_EQUIV note if DRAP is
+         needed. */
+      fixup_tail_calls ();
+    }
+
+  return 0;
+}
+
+struct gimple_opt_pass pass_handle_drap =
+{
+ {
+  GIMPLE_PASS,
+  "handle_drap",			/* name */
+  gate_handle_drap,			/* gate */
+  handle_drap,			        /* execute */
+  NULL,                                 /* sub */
+  NULL,                                 /* next */
+  0,                                    /* static_pass_number */
+  0,				        /* tv_id */
+  0,                                    /* properties_required */
+  0,                                    /* properties_provided */
+  0,				        /* properties_destroyed */
+  0,                                    /* todo_flags_start */
+  TODO_dump_func,                       /* todo_flags_finish */
+ }
+};
Index: passes.c
===================================================================
--- passes.c	(.../trunk/gcc)	(revision 134098)
+++ passes.c	(.../branches/stack/gcc)	(revision 134141)
@@ -685,6 +685,7 @@ init_optimization_passes (void)
   NEXT_PASS (pass_mudflap_2);
   NEXT_PASS (pass_free_cfg_annotations);
   NEXT_PASS (pass_expand);
+  NEXT_PASS (pass_handle_drap); 
   NEXT_PASS (pass_rest_of_compilation);
     {
       struct opt_pass **p = &pass_rest_of_compilation.pass.sub;
Index: stmt.c
===================================================================
--- stmt.c	(.../trunk/gcc)	(revision 134098)
+++ stmt.c	(.../branches/stack/gcc)	(revision 134141)
@@ -1819,7 +1819,7 @@ expand_nl_goto_receiver (void)
 	{
 	  /* Now restore our arg pointer from the address at which it
 	     was saved in our stack frame.  */
-	  emit_move_insn (virtual_incoming_args_rtx,
+	  emit_move_insn (crtl->args.internal_arg_pointer,
 			  copy_to_reg (get_arg_pointer_save_area ()));
 	}
     }
Index: reload1.c
===================================================================
--- reload1.c	(.../trunk/gcc)	(revision 134098)
+++ reload1.c	(.../branches/stack/gcc)	(revision 134141)
@@ -2279,7 +2279,13 @@ set_label_offsets (rtx x, rtx insn, int 
 	  if (offsets_at[CODE_LABEL_NUMBER (x) - first_label_num][i]
 	      != (initial_p ? reg_eliminate[i].initial_offset
 		  : reg_eliminate[i].offset))
-	    reg_eliminate[i].can_eliminate = 0;
+            {
+	      /* Must not disable reg eliminate because stack realignment
+	         must eliminate frame pointer to stack pointer.  */
+	      gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			  || ! stack_realign_fp);
+	      reg_eliminate[i].can_eliminate = 0;
+            }
 
       return;
 
@@ -2358,7 +2364,13 @@ set_label_offsets (rtx x, rtx insn, int 
 	 offset because we are doing a jump to a variable address.  */
       for (p = reg_eliminate; p < &reg_eliminate[NUM_ELIMINABLE_REGS]; p++)
 	if (p->offset != p->initial_offset)
-	  p->can_eliminate = 0;
+	  {
+	    /* Must not disable reg eliminate because stack realignment
+	       must eliminate frame pointer to stack pointer.  */
+	    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			|| ! stack_realign_fp);
+	    p->can_eliminate = 0;
+	  }
       break;
 
     default:
@@ -2849,7 +2861,13 @@ elimination_effects (rtx x, enum machine
       /* If we modify the source of an elimination rule, disable it.  */
       for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
 	if (ep->from_rtx == XEXP (x, 0))
-	  ep->can_eliminate = 0;
+	  {
+	    /* Must not disable reg eliminate because stack realignment
+	       must eliminate frame pointer to stack pointer.  */
+	    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			|| ! stack_realign_fp);
+	    ep->can_eliminate = 0;
+	  }
 
       /* If we modify the target of an elimination rule by adding a constant,
 	 update its offset.  If we modify the target in any other way, we'll
@@ -2875,7 +2893,14 @@ elimination_effects (rtx x, enum machine
 		    && CONST_INT_P (XEXP (XEXP (x, 1), 1)))
 		  ep->offset -= INTVAL (XEXP (XEXP (x, 1), 1));
 		else
-		  ep->can_eliminate = 0;
+		  {
+		    /* Must not disable reg eliminate because stack
+		       realignment must eliminate frame pointer to
+		       stack pointer.  */
+		    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+				|| ! stack_realign_fp);
+		    ep->can_eliminate = 0;
+		  }
 	      }
 	  }
 
@@ -2918,7 +2943,13 @@ elimination_effects (rtx x, enum machine
 	 know how this register is used.  */
       for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
 	if (ep->from_rtx == XEXP (x, 0))
-	  ep->can_eliminate = 0;
+	  {
+	    /* Must not disable reg eliminate because stack realignment
+	       must eliminate frame pointer to stack pointer.  */
+	    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			|| ! stack_realign_fp);
+	    ep->can_eliminate = 0;
+	  }
 
       elimination_effects (XEXP (x, 0), mem_mode);
       return;
@@ -2929,7 +2960,13 @@ elimination_effects (rtx x, enum machine
 	 be performed.  Otherwise, we need not be concerned about it.  */
       for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
 	if (ep->to_rtx == XEXP (x, 0))
-	  ep->can_eliminate = 0;
+	  {
+	    /* Must not disable reg eliminate because stack realignment
+	       must eliminate frame pointer to stack pointer.  */
+	    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			|| ! stack_realign_fp);
+	    ep->can_eliminate = 0;
+	  }
 
       elimination_effects (XEXP (x, 0), mem_mode);
       return;
@@ -2963,7 +3000,14 @@ elimination_effects (rtx x, enum machine
 		    && GET_CODE (XEXP (src, 1)) == CONST_INT)
 		  ep->offset -= INTVAL (XEXP (src, 1));
 		else
-		  ep->can_eliminate = 0;
+		  {
+		    /* Must not disable reg eliminate because stack
+		       realignment must eliminate frame pointer to
+		       stack pointer.  */
+		    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+				|| ! stack_realign_fp);
+		    ep->can_eliminate = 0;
+		  }
 	      }
 	}
 
@@ -3292,7 +3336,14 @@ eliminate_regs_in_insn (rtx insn, int re
 	      for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS];
 		   ep++)
 		if (ep->from_rtx == orig_operand[i])
-		  ep->can_eliminate = 0;
+		  {
+		    /* Must not disable reg eliminate because stack
+		       realignment must eliminate frame pointer to
+		       stack pointer.  */
+		    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+				|| ! stack_realign_fp);
+		    ep->can_eliminate = 0;
+		  }
 	    }
 
 	  /* Companion to the above plus substitution, we can allow
@@ -3422,7 +3473,13 @@ eliminate_regs_in_insn (rtx insn, int re
   for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
     {
       if (ep->previous_offset != ep->offset && ep->ref_outside_mem)
-	ep->can_eliminate = 0;
+	{
+	  /* Must not disable reg eliminate because stack realignment
+	     must eliminate frame pointer to stack pointer.  */
+	  gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+		      || ! stack_realign_fp);
+	  ep->can_eliminate = 0;
+	}
 
       ep->ref_outside_mem = 0;
 
@@ -3498,6 +3555,11 @@ mark_not_eliminable (rtx dest, const_rtx
 	    || XEXP (SET_SRC (x), 0) != dest
 	    || GET_CODE (XEXP (SET_SRC (x), 1)) != CONST_INT))
       {
+	/* Must not disable reg eliminate because stack realignment
+	   must eliminate frame pointer to stack pointer.  */
+	gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+		    || ! stack_realign_fp);
+
 	reg_eliminate[i].can_eliminate_previous
 	  = reg_eliminate[i].can_eliminate = 0;
 	num_eliminable--;
@@ -3668,8 +3730,11 @@ update_eliminables (HARD_REG_SET *pset)
   frame_pointer_needed = 1;
   for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
     {
-      if (ep->can_eliminate && ep->from == FRAME_POINTER_REGNUM
-	  && ep->to != HARD_FRAME_POINTER_REGNUM)
+      if (ep->can_eliminate
+	  && ep->from == FRAME_POINTER_REGNUM
+	  && ep->to != HARD_FRAME_POINTER_REGNUM
+	  && (! MAX_VECTORIZE_STACK_ALIGNMENT
+	      || ! cfun->stack_realign_needed))
 	frame_pointer_needed = 0;
 
       if (! ep->can_eliminate && ep->can_eliminate_previous)
@@ -3713,18 +3778,8 @@ init_elim_table (void)
   if (!reg_eliminate)
     reg_eliminate = xcalloc (sizeof (struct elim_table), NUM_ELIMINABLE_REGS);
 
-  /* Does this function require a frame pointer?  */
-
-  frame_pointer_needed = (! flag_omit_frame_pointer
-			  /* ?? If EXIT_IGNORE_STACK is set, we will not save
-			     and restore sp for alloca.  So we can't eliminate
-			     the frame pointer in that case.  At some point,
-			     we should improve this by emitting the
-			     sp-adjusting insns for this case.  */
-			  || (current_function_calls_alloca
-			      && EXIT_IGNORE_STACK)
-			  || current_function_accesses_prior_frames
-			  || FRAME_POINTER_REQUIRED);
+  /* frame_pointer_needed should has been set.  */
+  gcc_assert (cfun->need_frame_pointer_set);
 
   num_eliminable = 0;
 
@@ -3736,7 +3791,10 @@ init_elim_table (void)
       ep->to = ep1->to;
       ep->can_eliminate = ep->can_eliminate_previous
 	= (CAN_ELIMINATE (ep->from, ep->to)
-	   && ! (ep->to == STACK_POINTER_REGNUM && frame_pointer_needed));
+	   && ! (ep->to == STACK_POINTER_REGNUM
+		 && frame_pointer_needed 
+		 && (! MAX_VECTORIZE_STACK_ALIGNMENT
+		     || ! stack_realign_fp)));
     }
 #else
   reg_eliminate[0].from = reg_eliminate_1[0].from;

[-- Attachment #4: stack-align-x86-0410.patch --]
[-- Type: application/octet-stream, Size: 30420 bytes --]

Index: i386.h
===================================================================
--- i386.h	(.../trunk/gcc/config/i386)	(revision 134098)
+++ i386.h	(.../branches/stack/gcc/config/i386)	(revision 134141)
@@ -806,16 +806,32 @@ enum target_cpu_default
 /* Boundary (in *bits*) on which stack pointer should be aligned.  */
 #define STACK_BOUNDARY BITS_PER_WORD
 
+/* Stack boundary of the main function guaranteed by OS.  */
+#define MAIN_STACK_BOUNDARY (TARGET_64BIT ? 128 : 32)
+
+/* Stack boundary guaranteed by ABI.  */
+#define ABI_STACK_BOUNDARY (TARGET_64BIT ? 128 : 32)
+
 /* Boundary (in *bits*) on which the stack pointer prefers to be
    aligned; the compiler cannot rely on having this alignment.  */
 #define PREFERRED_STACK_BOUNDARY ix86_preferred_stack_boundary
 
-/* As of July 2001, many runtimes do not align the stack properly when
-   entering main.  This causes expand_main_function to forcibly align
-   the stack, which results in aligned frames for functions called from
-   main, though it does nothing for the alignment of main itself.  */
-#define FORCE_PREFERRED_STACK_BOUNDARY_IN_MAIN \
-  (ix86_preferred_stack_boundary > STACK_BOUNDARY && !TARGET_64BIT)
+/* It should be ABI_STACK_BOUNDARY.  But we set it to 128 bits for
+   both 32bit and 64bit, to support codes that need 128 bit stack
+   alignment for SSE instructions, but can't realign the stack.  */
+#define PREFERRED_STACK_BOUNDARY_DEFAULT 128
+
+/* 1 if -mstackrealign should be turned on by default.  It will
+   generate an alternate prologue and epilogue that realigns the
+   runtime stack if nessary.  This supports mixing codes that keep a
+   4-byte aligned stack, as specified by i386 psABI, with codes that
+   need a 16-byte aligned stack, as required by SSE instructions.  If
+   STACK_REALIGN_DEFAULT is 1 and PREFERRED_STACK_BOUNDARY_DEFAULT is
+   128, stacks for all functions may be realigned.  */
+#define STACK_REALIGN_DEFAULT 0
+
+/* Boundary (in *bits*) on which the incoming stack is aligned.  */
+#define INCOMING_STACK_BOUNDARY ix86_incoming_stack_boundary
 
 /* Target OS keeps a vector-aligned (128-bit, 16-byte) stack.  This is
    mandatory for the 64-bit ABI, and may or may not be true for other
@@ -842,6 +858,9 @@ enum target_cpu_default
 
 #define BIGGEST_ALIGNMENT 128
 
+/* Maximum stack alignment for vectorizer.  */
+#define MAX_VECTORIZE_STACK_ALIGNMENT BIGGEST_ALIGNMENT
+
 /* Decide whether a variable of mode MODE should be 128 bit aligned.  */
 #define ALIGN_MODE_128(MODE) \
  ((MODE) == XFmode || SSE_REG_MODE_P (MODE))
@@ -1251,7 +1270,7 @@ do {									\
    the pic register when possible.  The change is visible after the
    prologue has been emitted.  */
 
-#define REAL_PIC_OFFSET_TABLE_REGNUM  3
+#define REAL_PIC_OFFSET_TABLE_REGNUM  BX_REG
 
 #define PIC_OFFSET_TABLE_REGNUM				\
   ((TARGET_64BIT && ix86_cmodel == CM_SMALL_PIC)	\
@@ -1792,7 +1811,10 @@ typedef struct ix86_args {
    All other eliminations are valid.  */
 
 #define CAN_ELIMINATE(FROM, TO) \
-  ((TO) == STACK_POINTER_REGNUM ? !frame_pointer_needed : 1)
+  (stack_realign_fp \
+  ? ((FROM) == ARG_POINTER_REGNUM && (TO) == HARD_FRAME_POINTER_REGNUM) \
+    || ((FROM) == FRAME_POINTER_REGNUM && (TO) == STACK_POINTER_REGNUM) \
+  : ((TO) == STACK_POINTER_REGNUM ? !frame_pointer_needed : 1))
 
 /* Define the offset between two registers, one to be eliminated, and the other
    its replacement, at the start of a routine.  */
@@ -2348,6 +2370,7 @@ enum asm_dialect {
 
 extern enum asm_dialect ix86_asm_dialect;
 extern unsigned int ix86_preferred_stack_boundary;
+extern unsigned int ix86_incoming_stack_boundary;
 extern int ix86_branch_cost, ix86_section_threshold;
 
 /* Smallest class containing REGNO.  */
@@ -2449,7 +2472,6 @@ struct machine_function GTY(())
 {
   struct stack_local_entry *stack_locals;
   const char *some_ld_name;
-  rtx force_align_arg_pointer;
   int save_varrargs_registers;
   int accesses_prev_frame;
   int optimize_mode_switching[MAX_386_ENTITIES];
Index: i386.md
===================================================================
--- i386.md	(.../trunk/gcc/config/i386)	(revision 134098)
+++ i386.md	(.../branches/stack/gcc/config/i386)	(revision 134141)
@@ -232,6 +232,7 @@
   [(AX_REG			 0)
    (DX_REG			 1)
    (CX_REG			 2)
+   (BX_REG			 3)
    (SI_REG			 4)
    (DI_REG			 5)
    (BP_REG			 6)
@@ -241,6 +242,7 @@
    (FPCR_REG			19)
    (R10_REG			39)
    (R11_REG			40)
+   (R13_REG			42)
   ])
 
 ;; Insns whose names begin with "x86_" are emitted by gen_FOO calls
Index: i386.opt
===================================================================
--- i386.opt	(.../trunk/gcc/config/i386)	(revision 134098)
+++ i386.opt	(.../branches/stack/gcc/config/i386)	(revision 134141)
@@ -78,6 +78,10 @@ mfancy-math-387
 Target RejectNegative Report InverseMask(NO_FANCY_MATH_387, USE_FANCY_MATH_387)
 Generate sin, cos, sqrt for FPU
 
+mforce-drap
+Target Report Var(ix86_force_drap)
+Always use Dynamic Realigned Argument Pointer (DRAP) to realign stack.
+
 mfp-ret-in-387
 Target Report Mask(FLOAT_RETURNS)
 Return values of functions in FPU registers
@@ -134,6 +138,10 @@ mpreferred-stack-boundary=
 Target RejectNegative Joined Var(ix86_preferred_stack_boundary_string)
 Attempt to keep stack aligned to this power of 2
 
+mincoming-stack-boundary=
+Target RejectNegative Joined Var(ix86_incoming_stack_boundary_string)
+Assume incoming stack aligned to this power of 2
+
 mpush-args
 Target Report InverseMask(NO_PUSH_ARGS, PUSH_ARGS)
 Use push instructions to save outgoing arguments
@@ -159,7 +167,7 @@ Target RejectNegative Mask(SSEREGPARM)
 Use SSE register passing conventions for SF and DF mode
 
 mstackrealign
-Target Report Var(ix86_force_align_arg_pointer)
+Target Report Var(ix86_force_align_arg_pointer) Init(-1)
 Realign stack in prologue
 
 mstack-arg-probe
Index: i386.c
===================================================================
--- i386.c	(.../trunk/gcc/config/i386)	(revision 134098)
+++ i386.c	(.../branches/stack/gcc/config/i386)	(revision 134141)
@@ -1694,11 +1694,22 @@ static int ix86_regparm;
 
 /* -mstackrealign option */
 extern int ix86_force_align_arg_pointer;
-static const char ix86_force_align_arg_pointer_string[] = "force_align_arg_pointer";
+static const char ix86_force_align_arg_pointer_string[]
+  = "force_align_arg_pointer";
 
 /* Preferred alignment for stack boundary in bits.  */
 unsigned int ix86_preferred_stack_boundary;
 
+/* Alignment for incoming stack boundary in bits specified at
+   command line.  */
+static unsigned int ix86_user_incoming_stack_boundary;
+
+/* Default alignment for incoming stack boundary in bits.  */
+static unsigned int ix86_default_incoming_stack_boundary;
+
+/* Alignment for incoming stack boundary in bits.  */
+unsigned int ix86_incoming_stack_boundary;
+
 /* Values 1-5: see jump.c */
 int ix86_branch_cost;
 
@@ -2627,11 +2638,9 @@ override_options (void)
   if (TARGET_SSE4_2 || TARGET_ABM)
     x86_popcnt = true;
 
-  /* Validate -mpreferred-stack-boundary= value, or provide default.
-     The default of 128 bits is for Pentium III's SSE __m128.  We can't
-     change it because of optimize_size.  Otherwise, we can't mix object
-     files compiled with -Os and -On.  */
-  ix86_preferred_stack_boundary = 128;
+  /* Validate -mpreferred-stack-boundary= value or default it to
+     PREFERRED_STACK_BOUNDARY_DEFAULT.  */
+  ix86_preferred_stack_boundary = PREFERRED_STACK_BOUNDARY_DEFAULT;
   if (ix86_preferred_stack_boundary_string)
     {
       i = atoi (ix86_preferred_stack_boundary_string);
@@ -2642,6 +2651,31 @@ override_options (void)
 	ix86_preferred_stack_boundary = (1 << i) * BITS_PER_UNIT;
     }
 
+  /* Set the default value for -mstackrealign.  */
+  if (ix86_force_align_arg_pointer == -1)
+    ix86_force_align_arg_pointer = STACK_REALIGN_DEFAULT;
+
+  /* Validate -mincoming-stack-boundary= value or default it to
+     ABI_STACK_BOUNDARY/PREFERRED_STACK_BOUNDARY.  */
+  if (ix86_force_align_arg_pointer)
+    ix86_default_incoming_stack_boundary = ABI_STACK_BOUNDARY;
+  else
+    ix86_default_incoming_stack_boundary = PREFERRED_STACK_BOUNDARY;
+  ix86_incoming_stack_boundary = ix86_default_incoming_stack_boundary;
+  if (ix86_incoming_stack_boundary_string)
+    {
+      i = atoi (ix86_incoming_stack_boundary_string);
+      if (i < (TARGET_64BIT ? 4 : 2) || i > 12)
+	error ("-mincoming-stack-boundary=%d is not between %d and 12",
+	       i, TARGET_64BIT ? 4 : 2);
+      else
+	{
+	  ix86_user_incoming_stack_boundary = (1 << i) * BITS_PER_UNIT;
+	  ix86_incoming_stack_boundary
+	    = ix86_user_incoming_stack_boundary;
+	}
+    }
+
   /* Accept -msseregparm only if at least SSE support is enabled.  */
   if (TARGET_SSEREGPARM
       && ! TARGET_SSE)
@@ -3081,11 +3115,6 @@ ix86_function_ok_for_sibcall (tree decl,
       && ix86_function_regparm (TREE_TYPE (decl), NULL) >= 3)
     return false;
 
-  /* If we forced aligned the stack, then sibcalling would unalign the
-     stack, which may break the called function.  */
-  if (cfun->machine->force_align_arg_pointer)
-    return false;
-
   /* Otherwise okay.  That also includes certain types of indirect calls.  */
   return true;
 }
@@ -3136,15 +3165,6 @@ ix86_handle_cconv_attribute (tree *node,
 	  *no_add_attrs = true;
 	}
 
-      if (!TARGET_64BIT
-	  && lookup_attribute (ix86_force_align_arg_pointer_string,
-			       TYPE_ATTRIBUTES (*node))
-	  && compare_tree_int (cst, REGPARM_MAX-1))
-	{
-	  error ("%s functions limited to %d register parameters",
-		 ix86_force_align_arg_pointer_string, REGPARM_MAX-1);
-	}
-
       return NULL_TREE;
     }
 
@@ -3302,8 +3322,7 @@ ix86_function_regparm (const_tree type, 
 	  /* We can't use regparm(3) for nested functions as these use
 	     static chain pointer in third argument.  */
 	  if (local_regparm == 3
-	      && (decl_function_context (decl)
-                  || ix86_force_align_arg_pointer)
+	      && decl_function_context (decl)
 	      && !DECL_NO_STATIC_CHAIN (decl))
 	    local_regparm = 2;
 
@@ -3312,13 +3331,11 @@ ix86_function_regparm (const_tree type, 
 	     the callee DECL_STRUCT_FUNCTION is gone, so we fall back to
 	     scanning the attributes for the self-realigning property.  */
 	  f = DECL_STRUCT_FUNCTION (decl);
-	  if (local_regparm == 3
-	      && (f ? !!f->machine->force_align_arg_pointer
-		  : !!lookup_attribute (ix86_force_align_arg_pointer_string,
-					TYPE_ATTRIBUTES (TREE_TYPE (decl)))))
-	    local_regparm = 2;
+          /* Since current internal arg pointer will won't conflict
+	     with parameter passing regs, so no need to change stack
+	     realignment and adjust regparm number.
 
-	  /* Each fixed register usage increases register pressure,
+	     Each fixed register usage increases register pressure,
 	     so less registers should be used for argument passing.
 	     This functionality can be overriden by an explicit
 	     regparm value.  */
@@ -5053,14 +5070,6 @@ setup_incoming_varargs_64 (CUMULATIVE_AR
 
   /* Indicate to allocate space on the stack for varargs save area.  */
   ix86_save_varrargs_registers = 1;
-  /* We need 16-byte stack alignment to save SSE registers.  If user
-     asked for lower preferred_stack_boundary, lets just hope that he knows
-     what he is doing and won't varargs SSE values.
-
-     We also may end up assuming that only 64bit values are stored in SSE
-     register let some floating point program work.  */
-  if (ix86_preferred_stack_boundary >= BIGGEST_ALIGNMENT)
-    cfun->stack_alignment_needed = BIGGEST_ALIGNMENT;
 
   save_area = frame_pointer_rtx;
   set = get_varargs_alias_set ();
@@ -5228,7 +5237,7 @@ ix86_va_start (tree valist, rtx nextarg)
 
   /* Find the overflow area.  */
   type = TREE_TYPE (ovf);
-  t = make_tree (type, virtual_incoming_args_rtx);
+  t = make_tree (type, crtl->args.internal_arg_pointer);
   if (words != 0)
     t = build2 (POINTER_PLUS_EXPR, type, t,
 	        size_int (words * UNITS_PER_WORD));
@@ -5996,9 +6005,14 @@ ix86_select_alt_pic_regnum (void)
   if (current_function_is_leaf && !current_function_profile
       && !ix86_current_function_calls_tls_descriptor)
     {
-      int i;
+      int i, drap;
+      /* Can't use the same register for both PIC and DRAP.  */
+      if (crtl->drap_reg)
+	drap = REGNO (crtl->drap_reg);
+      else
+	drap = -1;
       for (i = 2; i >= 0; --i)
-        if (!df_regs_ever_live_p (i))
+        if (i != drap && !df_regs_ever_live_p (i))
 	  return i;
     }
 
@@ -6034,8 +6048,8 @@ ix86_save_reg (unsigned int regno, int m
 	}
     }
 
-  if (cfun->machine->force_align_arg_pointer
-      && regno == REGNO (cfun->machine->force_align_arg_pointer))
+  if (crtl->drap_reg
+      && regno == REGNO (crtl->drap_reg))
     return 1;
 
   return (df_regs_ever_live_p (regno)
@@ -6101,6 +6115,9 @@ ix86_compute_frame_layout (struct ix86_f
   stack_alignment_needed = cfun->stack_alignment_needed / BITS_PER_UNIT;
   preferred_alignment = cfun->preferred_stack_boundary / BITS_PER_UNIT;
 
+  gcc_assert (!size || stack_alignment_needed);
+  gcc_assert (preferred_alignment >= STACK_BOUNDARY / BITS_PER_UNIT);
+
   /* During reload iteration the amount of registers saved can change.
      Recompute the value as needed.  Do not recompute when amount of registers
      didn't change as reload does multiple calls to the function and does not
@@ -6143,18 +6160,9 @@ ix86_compute_frame_layout (struct ix86_f
 
   frame->hard_frame_pointer_offset = offset;
 
-  /* Do some sanity checking of stack_alignment_needed and
-     preferred_alignment, since i386 port is the only using those features
-     that may break easily.  */
-
-  gcc_assert (!size || stack_alignment_needed);
-  gcc_assert (preferred_alignment >= STACK_BOUNDARY / BITS_PER_UNIT);
-  gcc_assert (preferred_alignment <= PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT);
-  gcc_assert (stack_alignment_needed
-	      <= PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT);
-
-  if (stack_alignment_needed < STACK_BOUNDARY / BITS_PER_UNIT)
-    stack_alignment_needed = STACK_BOUNDARY / BITS_PER_UNIT;
+  /* Set offset to aligned because the realigned frame tarts from here.  */
+  if (stack_realign_fp)
+    offset = (offset + stack_alignment_needed -1) & -stack_alignment_needed;
 
   /* Register save area */
   offset += frame->nregs * UNITS_PER_WORD;
@@ -6320,35 +6328,129 @@ pro_epilogue_adjust_stack (rtx dest, rtx
     RTX_FRAME_RELATED_P (insn) = 1;
 }
 
+/* Find an available register to be used as dynamic realign argument
+   pointer regsiter.  Such a register will be written in prologue and
+   used in begin of body, so it must not be
+	1. parameter passing register.
+	2. GOT pointer.
+   For i386, we use CX if it is not used to pass parameter. Otherwise
+   we just pick DI.
+   For x86_64, we just pick R13 directly.
+
+   Return: the regno of choosed register.  */
+
+static unsigned int 
+find_drap_reg (void)
+{
+  int param_reg_num;
+
+  if (TARGET_64BIT)
+    return R13_REG;
+
+  /* Use DI for nested function or function need static chain.  */
+  if (decl_function_context (cfun->decl)
+      && !DECL_NO_STATIC_CHAIN (cfun->decl))
+    return DI_REG;
+
+  if (cfun->tail_call_emit)
+    return DI_REG;
+
+  param_reg_num = ix86_function_regparm (TREE_TYPE (cfun->decl),
+					 cfun->decl);
+
+  if (param_reg_num <= 2
+      && !lookup_attribute ("fastcall",
+			    TYPE_ATTRIBUTES (TREE_TYPE (cfun->decl))))
+    return CX_REG;
+
+  return DI_REG;
+}
+
 /* Handle the TARGET_INTERNAL_ARG_POINTER hook.  */
 
 static rtx
 ix86_internal_arg_pointer (void)
 {
-  bool has_force_align_arg_pointer =
-    (0 != lookup_attribute (ix86_force_align_arg_pointer_string,
-			    TYPE_ATTRIBUTES (TREE_TYPE (current_function_decl))));
-  if ((FORCE_PREFERRED_STACK_BOUNDARY_IN_MAIN
-       && DECL_NAME (current_function_decl)
-       && MAIN_NAME_P (DECL_NAME (current_function_decl))
-       && DECL_FILE_SCOPE_P (current_function_decl))
-      || ix86_force_align_arg_pointer
-      || has_force_align_arg_pointer)
-    {
-      /* Nested functions can't realign the stack due to a register
-	 conflict.  */
-      if (DECL_CONTEXT (current_function_decl)
-	  && TREE_CODE (DECL_CONTEXT (current_function_decl)) == FUNCTION_DECL)
-	{
-	  if (ix86_force_align_arg_pointer)
-	    warning (0, "-mstackrealign ignored for nested functions");
-	  if (has_force_align_arg_pointer)
-	    error ("%s not supported for nested functions",
-		   ix86_force_align_arg_pointer_string);
-	  return virtual_incoming_args_rtx;
-	}
-      cfun->machine->force_align_arg_pointer = gen_rtx_REG (Pmode, CX_REG);
-      return copy_to_reg (cfun->machine->force_align_arg_pointer);
+  /* If called in "expand" pass, currently_expanding_to_rtl will
+     be true */
+  if (currently_expanding_to_rtl) 
+    return virtual_incoming_args_rtx;
+
+  /* Prefer the one specified at command line. */
+  ix86_incoming_stack_boundary 
+    = (ix86_user_incoming_stack_boundary
+       ? ix86_user_incoming_stack_boundary
+       : ix86_default_incoming_stack_boundary);
+
+  /* Current stack realign doesn't support eh_return. Assume
+     function who calls eh_return is aligned. There will be sanity
+     check if stack realign happens together with eh_return later.  */
+  if (current_function_calls_eh_return)
+    ix86_incoming_stack_boundary = PREFERRED_STACK_BOUNDARY;
+
+  /* Incoming stack alignment can be changed on individual functions
+     via force_align_arg_pointer attribute.  We use the smallest
+     incoming stack boundary.  */
+  if (ix86_incoming_stack_boundary > ABI_STACK_BOUNDARY
+      && lookup_attribute (ix86_force_align_arg_pointer_string,
+			   TYPE_ATTRIBUTES (TREE_TYPE (current_function_decl))))
+    ix86_incoming_stack_boundary = ABI_STACK_BOUNDARY;
+
+  /* Stack at entrance of main is aligned by runtime.  We use the
+     smallest incoming stack boundary. */
+  if (ix86_incoming_stack_boundary > MAIN_STACK_BOUNDARY
+      && DECL_NAME (current_function_decl)
+      && MAIN_NAME_P (DECL_NAME (current_function_decl))
+      && DECL_FILE_SCOPE_P (current_function_decl))
+    ix86_incoming_stack_boundary = MAIN_STACK_BOUNDARY;
+
+  gcc_assert (cfun->stack_alignment_needed 
+              <= cfun->stack_alignment_estimated);
+
+  /* x86_64 vararg needs 16byte stack alignment for register save
+     area.  */
+  if (TARGET_64BIT
+      && current_function_stdarg
+      && cfun->stack_alignment_estimated < 128)
+    cfun->stack_alignment_estimated = 128;
+
+  /* Update cfun->stack_alignment_estimated and use it later to align
+     stack.  FIXME: How to optimize for leaf function?  */
+  if (PREFERRED_STACK_BOUNDARY > cfun->stack_alignment_estimated)
+    cfun->stack_alignment_estimated = PREFERRED_STACK_BOUNDARY;
+  if (PREFERRED_STACK_BOUNDARY > cfun->stack_alignment_needed)
+    cfun->stack_alignment_needed = PREFERRED_STACK_BOUNDARY;
+
+  cfun->stack_realign_needed
+    = ix86_incoming_stack_boundary < cfun->stack_alignment_estimated;
+
+  cfun->stack_realign_processed = true;
+
+  if (ix86_force_drap
+      || !ACCUMULATE_OUTGOING_ARGS)
+    cfun->need_drap = true;
+
+  if (stack_realign_drap)
+    {
+      /* Assign DRAP to vDRAP and returns vDRAP */
+      unsigned int regno = find_drap_reg ();
+      rtx drap_vreg;
+      rtx arg_ptr;
+      rtx seq;
+
+      if (regno != CX_REG)
+	cfun->save_param_ptr_reg = true;
+
+      arg_ptr = gen_rtx_REG (Pmode, regno);
+      crtl->drap_reg = arg_ptr;
+
+      start_sequence ();
+      drap_vreg = copy_to_reg(arg_ptr);
+      seq = get_insns ();
+      end_sequence ();
+      
+      emit_insn_before (seq, NEXT_INSN (entry_of_function ()));
+      return drap_vreg;
     }
   else
     return virtual_incoming_args_rtx;
@@ -6387,53 +6489,64 @@ ix86_expand_prologue (void)
   bool pic_reg_used;
   struct ix86_frame frame;
   HOST_WIDE_INT allocate;
+  rtx (*gen_andsp) (rtx, rtx, rtx);
+
+  /* DRAP should not coexist with stack_realign_fp */
+  gcc_assert (!(crtl->drap_reg && stack_realign_fp));
+
+  /* Check if stack realign is really needed after reload, and 
+     stores result in cfun */
+  cfun->stack_realign_really = (ix86_incoming_stack_boundary
+				< (current_function_is_leaf
+				   ? cfun->stack_alignment_used
+				   : cfun->stack_alignment_needed));
+
+  cfun->stack_realign_finalized = true;
 
   ix86_compute_frame_layout (&frame);
 
-  if (cfun->machine->force_align_arg_pointer)
+  /* Emit prologue code to adjust stack alignment and setup DRAP, in case
+     of DRAP is needed and stack realignment is really needed after reload */
+  if (crtl->drap_reg && cfun->stack_realign_really)
     {
       rtx x, y;
+      int align_bytes = cfun->stack_alignment_needed / BITS_PER_UNIT;
+      int param_ptr_offset = (cfun->save_param_ptr_reg
+			      ?  STACK_BOUNDARY / BITS_PER_UNIT : 0);
+
+      gcc_assert (stack_realign_drap);
 
       /* Grab the argument pointer.  */
-      x = plus_constant (stack_pointer_rtx, 4);
-      y = cfun->machine->force_align_arg_pointer;
-      insn = emit_insn (gen_rtx_SET (VOIDmode, y, x));
-      RTX_FRAME_RELATED_P (insn) = 1;
+      x = plus_constant (stack_pointer_rtx, 
+                         (STACK_BOUNDARY / BITS_PER_UNIT 
+			  + param_ptr_offset));
+      y = crtl->drap_reg;
+
+      /* Only need to push parameter pointer reg if it is caller
+	 saved reg */
+      if (cfun->save_param_ptr_reg)
+	{
+	  /* Push arg pointer reg */
+	  insn = emit_insn (gen_push (y));
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	}
 
-      /* The unwind info consists of two parts: install the fafp as the cfa,
-	 and record the fafp as the "save register" of the stack pointer.
-	 The later is there in order that the unwinder can see where it
-	 should restore the stack pointer across the and insn.  */
-      x = gen_rtx_UNSPEC (VOIDmode, gen_rtvec (1, const0_rtx), UNSPEC_DEF_CFA);
-      x = gen_rtx_SET (VOIDmode, y, x);
-      RTX_FRAME_RELATED_P (x) = 1;
-      y = gen_rtx_UNSPEC (VOIDmode, gen_rtvec (1, stack_pointer_rtx),
-			  UNSPEC_REG_SAVE);
-      y = gen_rtx_SET (VOIDmode, cfun->machine->force_align_arg_pointer, y);
-      RTX_FRAME_RELATED_P (y) = 1;
-      x = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, x, y));
-      x = gen_rtx_EXPR_LIST (REG_FRAME_RELATED_EXPR, x, NULL);
-      REG_NOTES (insn) = x;
+      insn = emit_insn (gen_rtx_SET (VOIDmode, y, x));
+      RTX_FRAME_RELATED_P (insn) = 1; 
 
+      gen_andsp = TARGET_64BIT ? gen_anddi3 : gen_andsi3;
       /* Align the stack.  */
-      emit_insn (gen_andsi3 (stack_pointer_rtx, stack_pointer_rtx,
-			     GEN_INT (-16)));
+      insn = emit_insn ((*gen_andsp) (stack_pointer_rtx,
+				  stack_pointer_rtx,
+				  GEN_INT (-align_bytes)));
+      RTX_FRAME_RELATED_P (insn) = 1;
 
-      /* And here we cheat like madmen with the unwind info.  We force the
-	 cfa register back to sp+4, which is exactly what it was at the
-	 start of the function.  Re-pushing the return address results in
-	 the return at the same spot relative to the cfa, and thus is
-	 correct wrt the unwind info.  */
-      x = cfun->machine->force_align_arg_pointer;
-      x = gen_frame_mem (Pmode, plus_constant (x, -4));
+      x = crtl->drap_reg;
+      x = gen_frame_mem (Pmode,
+                         plus_constant (x,
+					-(STACK_BOUNDARY / BITS_PER_UNIT)));
       insn = emit_insn (gen_push (x));
       RTX_FRAME_RELATED_P (insn) = 1;
-
-      x = GEN_INT (4);
-      x = gen_rtx_UNSPEC (VOIDmode, gen_rtvec (1, x), UNSPEC_DEF_CFA);
-      x = gen_rtx_SET (VOIDmode, stack_pointer_rtx, x);
-      x = gen_rtx_EXPR_LIST (REG_FRAME_RELATED_EXPR, x, NULL);
-      REG_NOTES (insn) = x;
     }
 
   /* Note: AT&T enter does NOT have reversed args.  Enter is probably
@@ -6448,6 +6561,19 @@ ix86_expand_prologue (void)
       RTX_FRAME_RELATED_P (insn) = 1;
     }
 
+  if (stack_realign_fp && cfun->stack_realign_really)
+    {
+      int align_bytes = cfun->stack_alignment_needed / BITS_PER_UNIT;
+      gcc_assert (align_bytes > STACK_BOUNDARY / BITS_PER_UNIT);
+
+      gen_andsp = TARGET_64BIT ? gen_anddi3 : gen_andsi3;
+      /* Align the stack.  */
+      insn = emit_insn ((*gen_andsp) (stack_pointer_rtx,
+				      stack_pointer_rtx,
+				      GEN_INT (-align_bytes)));
+      RTX_FRAME_RELATED_P (insn) = 1;
+    }
+
   allocate = frame.to_allocate;
 
   if (!frame.save_regs_using_mov)
@@ -6462,7 +6588,9 @@ ix86_expand_prologue (void)
      a red zone location */
   if (TARGET_RED_ZONE && frame.save_regs_using_mov
       && (! TARGET_STACK_PROBE || allocate < CHECK_STACK_LIMIT))
-    ix86_emit_save_regs_using_mov (frame_pointer_needed ? hard_frame_pointer_rtx
+    ix86_emit_save_regs_using_mov ((frame_pointer_needed
+				     && !cfun->stack_realign_really) 
+                                   ? hard_frame_pointer_rtx
 				   : stack_pointer_rtx,
 				   -frame.nregs * UNITS_PER_WORD);
 
@@ -6521,8 +6649,11 @@ ix86_expand_prologue (void)
       && !(TARGET_RED_ZONE
          && (! TARGET_STACK_PROBE || allocate < CHECK_STACK_LIMIT)))
     {
-      if (!frame_pointer_needed || !frame.to_allocate)
-        ix86_emit_save_regs_using_mov (stack_pointer_rtx, frame.to_allocate);
+      if (!frame_pointer_needed
+	  || !frame.to_allocate
+	  || cfun->stack_realign_really)
+        ix86_emit_save_regs_using_mov (stack_pointer_rtx,
+				       frame.to_allocate);
       else
         ix86_emit_save_regs_using_mov (hard_frame_pointer_rtx,
 				       -frame.nregs * UNITS_PER_WORD);
@@ -6572,6 +6703,16 @@ ix86_expand_prologue (void)
 	emit_insn (gen_prologue_use (pic_offset_table_rtx));
       emit_insn (gen_blockage ());
     }
+
+  if (crtl->drap_reg && !cfun->stack_realign_really)
+    {
+      /* vDRAP is setup but after reload it turns out stack realign
+         isn't necessary, here we will emit prologue to setup DRAP
+         without stack realign adjustment */
+      int drap_bp_offset = STACK_BOUNDARY / BITS_PER_UNIT * 2;
+      rtx x = plus_constant (hard_frame_pointer_rtx, drap_bp_offset);
+      insn = emit_insn (gen_rtx_SET (VOIDmode, crtl->drap_reg, x));
+    }
 }
 
 /* Emit code to restore saved registers using MOV insns.  First register
@@ -6610,7 +6751,10 @@ void
 ix86_expand_epilogue (int style)
 {
   int regno;
-  int sp_valid = !frame_pointer_needed || current_function_sp_is_unchanging;
+ /* When stack realign may happen, SP must be valid. */
+  int sp_valid = (!frame_pointer_needed
+		  || current_function_sp_is_unchanging
+		  || (stack_realign_fp && cfun->stack_realign_really));
   struct ix86_frame frame;
   HOST_WIDE_INT offset;
 
@@ -6647,11 +6791,16 @@ ix86_expand_epilogue (int style)
     {
       /* Restore registers.  We can use ebp or esp to address the memory
 	 locations.  If both are available, default to ebp, since offsets
-	 are known to be small.  Only exception is esp pointing directly to the
-	 end of block of saved registers, where we may simplify addressing
-	 mode.  */
-
-      if (!frame_pointer_needed || (sp_valid && !frame.to_allocate))
+	 are known to be small.  Only exception is esp pointing directly
+	 to the end of block of saved registers, where we may simplify
+	 addressing mode.  
+
+	 If we are realigning stack with bp and sp, regs restore can't
+	 be addressed by bp. sp must be used instead.  */
+
+      if (!frame_pointer_needed
+	  || (sp_valid && !frame.to_allocate) 
+	  || (stack_realign_fp && cfun->stack_realign_really))
 	ix86_emit_restore_regs_using_mov (stack_pointer_rtx,
 					  frame.to_allocate, style == 2);
       else
@@ -6663,6 +6812,10 @@ ix86_expand_epilogue (int style)
 	{
 	  rtx tmp, sa = EH_RETURN_STACKADJ_RTX;
 
+	  if (cfun->stack_realign_really)
+	    {
+	      error("Stack realign has conflict with eh_return");
+	    }
 	  if (frame_pointer_needed)
 	    {
 	      tmp = gen_rtx_PLUS (Pmode, hard_frame_pointer_rtx, sa);
@@ -6706,10 +6859,16 @@ ix86_expand_epilogue (int style)
   else
     {
       /* First step is to deallocate the stack frame so that we can
-	 pop the registers.  */
+	 pop the registers.
+
+	 If we realign stack with frame pointer, then stack pointer
+         won't be able to recover via lea $offset(%bp), %sp, because
+         there is a padding area between bp and sp for realign. 
+         "add $to_allocate, %sp" must be used instead.  */
       if (!sp_valid)
 	{
 	  gcc_assert (frame_pointer_needed);
+          gcc_assert (!(stack_realign_fp && cfun->stack_realign_really));
 	  pro_epilogue_adjust_stack (stack_pointer_rtx,
 				     hard_frame_pointer_rtx,
 				     GEN_INT (offset), style);
@@ -6732,18 +6891,47 @@ ix86_expand_epilogue (int style)
 	     able to grok it fast.  */
 	  if (TARGET_USE_LEAVE)
 	    emit_insn (TARGET_64BIT ? gen_leave_rex64 () : gen_leave ());
-	  else if (TARGET_64BIT)
-	    emit_insn (gen_popdi1 (hard_frame_pointer_rtx));
-	  else
-	    emit_insn (gen_popsi1 (hard_frame_pointer_rtx));
+	  else 
+            {
+              /* For stack realigned really happens, recover stack 
+                 pointer to hard frame pointer is a must, if not using 
+                 leave.  */
+              if (stack_realign_fp && cfun->stack_realign_really)
+		pro_epilogue_adjust_stack (stack_pointer_rtx,
+					   hard_frame_pointer_rtx,
+					   const0_rtx, style);
+              if (TARGET_64BIT)
+                emit_insn (gen_popdi1 (hard_frame_pointer_rtx));
+              else
+                emit_insn (gen_popsi1 (hard_frame_pointer_rtx));
+            }
 	}
     }
 
-  if (cfun->machine->force_align_arg_pointer)
+  if (crtl->drap_reg && cfun->stack_realign_really)
     {
-      emit_insn (gen_addsi3 (stack_pointer_rtx,
-			     cfun->machine->force_align_arg_pointer,
-			     GEN_INT (-4)));
+      int param_ptr_offset = (cfun->save_param_ptr_reg
+			      ? STACK_BOUNDARY / BITS_PER_UNIT : 0);
+      gcc_assert (stack_realign_drap);
+      if (TARGET_64BIT)
+        {
+          emit_insn (gen_adddi3 (stack_pointer_rtx,
+				 crtl->drap_reg,
+				 GEN_INT (-(STACK_BOUNDARY / BITS_PER_UNIT
+					    + param_ptr_offset))));
+          if (cfun->save_param_ptr_reg)
+            emit_insn (gen_popdi1 (crtl->drap_reg));
+        }
+      else
+        {
+          emit_insn (gen_addsi3 (stack_pointer_rtx,
+				 crtl->drap_reg,
+				 GEN_INT (-(STACK_BOUNDARY / BITS_PER_UNIT 
+					    + param_ptr_offset))));
+          if (cfun->save_param_ptr_reg)
+            emit_insn (gen_popsi1 (crtl->drap_reg));
+        }
+      
     }
 
   /* Sibcall epilogues don't want a return instruction.  */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFA]: Merge stack alignment branch
  2008-04-04 19:05 ` Jan Hubicka
  2008-04-04 21:05   ` H.J. Lu
  2008-04-08  1:57   ` Ye, Joey
@ 2008-04-11 12:32   ` Ye, Joey
  2 siblings, 0 replies; 26+ messages in thread
From: Ye, Joey @ 2008-04-11 12:32 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Lu, Hongjiu, Guo, Xuepeng, ubizjak

[-- Attachment #1: Type: text/plain, Size: 8640 bytes --]

Jan,

> Index: tree-inline.c
> ...
> I think it is mistake to maintain info about stack alignment during
> gimple transformations.  At expansion time we walk the list and we can
> figure out the alignment once possibly some of the variables are
> optimized out.
Removed from tree-inline.c

> The info can also go into RTL datastructures I am trying to introduce
> instead of cfun then. 
Newly introduced fields are all moved to RTL data structures (function,
rtl_data) now.

Overall, these up-to-date patches remove a redundant pass introduced in
previous one, also clean up changes in tree passes and work fine with
new RTL data structure.

2008-04-11  Uros Bizjak  <ubizjak@gmail.com>
	    H.J. Lu  <hongjiu.lu@intel.com>

	PR target/12329
	* config/i386/i386.c (ix86_function_regparm): Limit the number
of
	register passing arguments to 2 for nested functions.

2008-04-11  Joey Ye  <joey.ye@intel.com>
	    H.J. Lu  <hongjiu.lu@intel.com>
	    Xuepeng Guo  <xuepeng.guo@intel.com>

	* builtins.c (expand_builtin_setjmp_receiver): Replace
	virtual_incoming_args_rtx with
	crtl->args.internal_arg_pointer.
	(expand_builtin_apply_args_1): Likewise.
	(expand_builtin_longjmp): DRAP will be needed if some builtins
are
	called.
	(expand_builtin_apply): Likewise.

	* calls.c (expand_call): Don't calculate preferred stack
	boundary according to incoming stack boundary. Replace 
	virtual_incoming_args_rtx with
	crtl->args.internal_arg_pointer.
	(emit_call_1): DRAP will be needed if return pops.

	* emit-rtl.c (gen_reg_rtx): Estimate stack alignment when
generating
	virtual registers.

	* cfgexpand.c (get_decl_align_unit): Estimate stack variable
	alignment and store to stack_alignment_estimated and
	stack_alignment_used.
	(expand_one_var): Likewise.
	(gate_handle_drap): Gate new pass pass_handle_drap.
	(handle_drap): Execute new pass pass_handle_drap.
	(pass_handle_drap): Define new pass.

	* defaults.h (MAX_VECTORIZE_STACK_ALIGNMENT): New.

	* flags.h (frame_pointer_needed): Removed.
	* final.c (frame_pointer_needed): Likewise.

	* function.c (assign_stack_local_1): Estimate stack variable 
	alignment and store to stack_alignment_estimated.
	(instantiate_new_reg): Instantiate virtual incoming args rtx to
	vDRAP if stack realignment and DRAP is needed.
	(assign_parms): Collect parameter/return type alignment and 
	contribute to stack_alignment_estimated.
	(locate_and_pad_parm): Likewise.
	(allocate_struct_function): Init stack_alignment_estimated and
	stack_alignment_used.
	(get_arg_pointer_save_area): Replace virtual_incoming_args_rtx
	with crtl->args.internal_arg_pointer.

	* function.h (function): Add new field
stack_alignment_estimated,
	need_frame_pointer, need_frame_pointer_set,
stack_realign_needed,
	stack_realign_really, need_drap, save_param_ptr_reg,
	stack_realign_processed, stack_realign_finalized and 
	stack_realign_used.
	(rtl_data): Add new field drap_reg. 
	(frame_pointer_needed): New.
	(stack_realign_fp): Likewise.
	(stack_realign_drap): Likewise.

	* global.c (compute_regsets): Set frame_pointer_needed
cannot_elim
	wrt stack_realign_needed.

	* stmt.c (expand_nl_goto_receiver): Replace 
	virtual_incoming_args_rtx with
	crtl->args.internal_arg_pointer.

	* passes.c (pass_handle_drap): Insert this new pass immediately
	after expand.

	* tree-inline.c (expand_call_inline): Estimate stack variable
	alignment and store to stack_alignment_estimated.

	* tree-pass.h (pass_handle_drap): New.

	* tree-vectorizer.c (vect_can_force_dr_alignment_p): Return
	true if alignment of variable on stack is less than or
	equal to MAX_VECTORIZE_STACK_ALIGNMENT.

	* reload1.c (set_label_offsets): Assert that frame pointer must
be
	elimiated to stack pointer in case stack realignment is
estimated
	to happen without DRAP.
	(elimination_effects): Likewise.
	(eliminate_regs_in_insn): Likewise.
	(mark_not_eliminable): Likewise.
	(update_eliminables): Frame pointer is needed in case of stack
	realignment needed.
	(init_elim_table): Don't set frame_pointer_needed here.

	* dwarf2out.c (CUR_FDE): New.
	(reg_save_with_expression): Likewise.
	(dw_fde_struct): Add drap_regnum, stack_realignment,
	is_stack_realign, is_drap and is_drap_reg_saved.
	(add_cfi): If stack is realigned, call reg_save_with_expression
	to represent the location of stored vars.
	(dwarf2out_frame_debug_expr): Add rules 16-19 to handle stack
	realign.
	(output_cfa_loc): Handle DW_CFA_expression.
	(based_loc_descr): Update assert for stack realign.

	* config/i386/i386.c (ix86_force_align_arg_pointer_string):
Break
	long line.
	(ix86_user_incoming_stack_boundary): New.
	(ix86_default_incoming_stack_boundary): Likewise.
	(ix86_incoming_stack_boundary): Likewise.
	(find_drap_reg): Likewise.
	(override_options): Overide option value for new options.
	(ix86_function_ok_for_sibcall): Sibcall is OK even stack need
	realigning.
	(ix86_handle_cconv_attribute): Stack realign no longer impacts
	number of regparm.
	(ix86_function_regparm): Likewise.
	(setup_incoming_varargs_64): Remove the logic to set
	stack_alignment_needed here.
	(ix86_va_start): Replace virtual_incoming_args_rtx with
	crtl->args.internal_arg_pointer.
	(ix86_save_reg): Replace force_align_arg_pointer with drap_reg.
	(ix86_compute_frame_layout): Compute frame layout wrt stack
	realignment.
	(ix86_internal_arg_pointer): Estimate if stack realignment is
	needed and returns appropriate arg pointer rtx accordingly.
	(ix86_expand_prologue): Finally decide if stack realignment
	is needed and generate prologue code accordingly.
	(ix86_expand_epilogue): Generate epilogue code wrt stack
	realignment is really needed or not.
	* config/i386/i386.c (ix86_select_alt_pic_regnum): Check
	DRAP register.
	
	* config/i386/i386.h (MAIN_STACK_BOUNDARY): New.
	(ABI_STACK_BOUNDARY): Likewise.
	PREFERRED_STACK_BOUNDARY_DEFAULT): Likewise.
	(STACK_REALIGN_DEFAULT): Likewise.
	(INCOMING_STACK_BOUNDARY): Likewise.
	(MAX_VECTORIZE_STACK_ALIGNMENT): Likewise.
	(ix86_incoming_stack_boundary): Likewise.
	(REAL_PIC_OFFSET_TABLE_REGNUM): Updated to use BX_REG.
	(CAN_ELIMINATE): Redefine the macro to eliminate frame pointer
to
	stack pointer and arg pointer to hard frame pointer in case of
	stack realignment without DRAP.
	(machine_function): Remove force_align_arg_pointer.

	* config/i386/i386.md (BX_REG): New.
	(R13_REG): Likewise.

	* config/i386/i386.opt (mforce_drap): New.
	(mincoming-stack-boundary): Likewise.
	(mstackrealign): Updated.

	* doc/extend.texi: Update force_align_arg_pointer.
	* doc/invoke.texi: Document -mincoming-stack-boundary.  Update
	-mstackrealign.
	

Thanks - Joey

-----Original Message-----
From: Jan Hubicka [mailto:hubicka@ucw.cz] 
Sent: Saturday, April 05, 2008 2:39 AM
To: Ye, Joey
Cc: GCC Patches; Lu, Hongjiu; Guo, Xuepeng; ubizjak@gmail.com
Subject: Re: [RFA]: Merge stack alignment branch

Hi,
I will look in detail to the patch later this weekend.  I think it would
make sense to break up neccesary changes in generic bits to separate
patches (in -x -cp format).  This will ease reviewing process since I
for instance can't approve non-i386 specific bits of your patch.

Index: tree-inline.c
===================================================================
--- tree-inline.c	(.../trunk/gcc)	(revision 133813)
+++ tree-inline.c	(.../branches/stack/gcc)	(revision
133869)
@@ -2841,8 +2841,26 @@
 	cfun->unexpanded_var_list = tree_cons (NULL_TREE, var,
 
cfun->unexpanded_var_list);
       else
-	cfun->unexpanded_var_list = tree_cons (NULL_TREE, remap_decl
(var, id),
-
cfun->unexpanded_var_list);
+	{
+	  /* Update stack alignment requirement if needed.  */
+	  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+	    {
+	      unsigned int align;
+
+	      if (TREE_STATIC (var) || DECL_EXTERNAL (var))
+		align = TYPE_ALIGN (TREE_TYPE (var));
+	      else
+		align = DECL_ALIGN (var);
+	      if (align  > cfun->stack_alignment_estimated)
+		{
+		  gcc_assert(!cfun->stack_realign_processed);
+		  cfun->stack_alignment_estimated = align;
+		}
+	    }
+	  cfun->unexpanded_var_list
+	    = tree_cons (NULL_TREE, remap_decl (var, id),
+			 cfun->unexpanded_var_list);
+	}

I think it is mistake to maintain info about stack alignment during
gimple transformations.  At expansion time we walk the list and we can
figure out the alignment once possibly some of the variables are
optimized out.

The info can also go into RTL datastructures I am trying to introduce
instead of cfun then.

Honza

[-- Attachment #2: stack-align-dwarf2-0411.patch --]
[-- Type: application/octet-stream, Size: 10625 bytes --]

Index: dwarf2out.c
===================================================================
--- dwarf2out.c	(.../trunk/gcc)	(revision 134098)
+++ dwarf2out.c	(.../branches/stack/gcc)	(revision 134150)
@@ -110,6 +110,9 @@ static void dwarf2out_source_line (unsig
 #define DWARF2_FRAME_REG_OUT(REGNO, FOR_EH) (REGNO)
 #endif
 
+/* Define the current fde_table entry we should use. */
+#define CUR_FDE fde_table[fde_table_in_use - 1]
+
 /* Decide whether we want to emit frame unwind information for the current
    translation unit.  */
 
@@ -239,9 +242,18 @@ typedef struct dw_fde_struct GTY(())
   bool dw_fde_switched_sections;
   dw_cfi_ref dw_fde_cfi;
   unsigned funcdef_number;
+  /* If it is drap, which register is employed. */
+  int drap_regnum;
+  HOST_WIDE_INT stack_realignment;
   unsigned all_throwers_are_sibcalls : 1;
   unsigned nothrow : 1;
   unsigned uses_eh_lsda : 1;
+  /* Whether we did stack realign in this call frame.*/
+  unsigned is_stack_realign : 1;
+  /* Whether stack realign is drap. */
+  unsigned is_drap : 1;
+  /* Whether we saved this drap register. */
+  unsigned is_drap_reg_saved : 1;
 }
 dw_fde_node;
 
@@ -381,6 +393,7 @@ static void get_cfa_from_loc_descr (dw_c
 static struct dw_loc_descr_struct *build_cfa_loc
   (dw_cfa_location *, HOST_WIDE_INT);
 static void def_cfa_1 (const char *, dw_cfa_location *);
+static void reg_save_with_expression (dw_cfi_ref);
 
 /* How to start an assembler comment.  */
 #ifndef ASM_COMMENT_START
@@ -618,6 +631,13 @@ add_cfi (dw_cfi_ref *list_head, dw_cfi_r
   for (p = list_head; (*p) != NULL; p = &(*p)->dw_cfi_next)
     ;
 
+  /* If stack is realigned, accessing the stored register via CFA+offset will
+     be invalid. Here we will use a series of expressions in dwarf2 to simulate
+     the stack realign and represent the location of the stored register. */
+  if (fde_table_in_use && (CUR_FDE.is_stack_realign || CUR_FDE.is_drap) 
+      && cfi->dw_cfi_opc == DW_CFA_offset)
+    reg_save_with_expression (cfi);
+
   *p = cfi;
 }
 
@@ -1435,6 +1455,10 @@ static dw_cfa_location cfa_temp;
   Rules 10-14: Save a register to the stack.  Define offset as the
 	       difference of the original location and cfa_store's
 	       location (or cfa_temp's location if cfa_temp is used).
+  
+  Rules 16-19: If AND operation happens on sp in prologue, we assume stack is
+               realigned. We will use a group of DW_OP_?? expressions to represent
+               the location of the stored register instead of CFA+offset.
 
   The Rules
 
@@ -1529,7 +1553,32 @@ static dw_cfa_location cfa_temp;
 
   Rule 15:
   (set <reg> {unspec, unspec_volatile})
-  effects: target-dependent  */
+  effects: target-dependent  
+  
+  Rule 16:
+  (set sp (and: sp <const_int>))
+  effects: CUR_FDE.is_stack_realign = 1
+           cfa_store.offset = 0
+
+           if cfa_store.offset >= UNITS_PER_WORD
+             effects: CUR_FDE.is_drap_reg_saved = 1
+
+  Rule 17:
+  (set (mem ({pre_inc, pre_dec} sp)) (mem (plus (cfa.reg) (const_int))))
+  effects: cfa_store.offset += -/+ mode_size(mem)
+  
+  Rule 18:
+  (set (mem({pre_inc, pre_dec} sp)) fp)
+  constraints: CUR_FDE.is_stack_realign == 1
+  effects: CUR_FDE.is_stack_realign = 0
+           CUR_FDE.is_drap = 1
+           CUR_FDE.drap_regnum = cfa.reg
+
+  Rule 19:
+  (set fp sp)
+  constraints: CUR_FDE.is_drap == 1
+  effects: cfa.reg = fp
+           cfa.offset = cfa_store.offset */
 
 static void
 dwarf2out_frame_debug_expr (rtx expr, const char *label)
@@ -1607,7 +1656,20 @@ dwarf2out_frame_debug_expr (rtx expr, co
 	      cfa_temp.reg = cfa.reg;
 	      cfa_temp.offset = cfa.offset;
 	    }
-	  else
+            /* Rule 19 */
+            /* Eachtime when setting FP to SP under the condition of that the stack
+               is realigned we assume the realign is drap and the drap register is
+               the current cfa's register. We update cfa's register to FP. */
+	  else if (fde_table_in_use && CUR_FDE.is_drap 
+                   && REGNO (src) == STACK_POINTER_REGNUM 
+                   && REGNO (dest) == HARD_FRAME_POINTER_REGNUM)
+            {
+              cfa.reg = REGNO (dest);
+              cfa.offset = cfa_store.offset;
+              cfa_temp.reg = cfa.reg;
+              cfa_temp.offset = cfa.offset;
+            }
+          else
 	    {
 	      /* Saving a register in a register.  */
 	      gcc_assert (!fixed_regs [REGNO (dest)]
@@ -1747,6 +1809,22 @@ dwarf2out_frame_debug_expr (rtx expr, co
 	  targetm.dwarf_handle_frame_unspec (label, expr, XINT (src, 1));
 	  return;
 
+	  /* Rule 16 */
+	case AND:
+          /* If this AND operation happens on stack pointer in prologue, we 
+             assume the stack is realigned and we extract the alignment. */
+          if (XEXP (src, 0) == stack_pointer_rtx && fde_table_in_use)
+            {
+              CUR_FDE.is_stack_realign = 1;
+              CUR_FDE.stack_realignment = INTVAL (XEXP (src, 1));
+              /* If we didn't push anything to stack before stack is realigned,
+                  we assume the drap register isn't saved. */
+              if (cfa_store.offset > UNITS_PER_WORD)
+                CUR_FDE.is_drap_reg_saved = 1;
+              cfa_store.offset = 0;
+            }
+          return;
+
 	default:
 	  gcc_unreachable ();
 	}
@@ -1755,7 +1833,6 @@ dwarf2out_frame_debug_expr (rtx expr, co
       break;
 
     case MEM:
-      gcc_assert (REG_P (src));
 
       /* Saving a register to the stack.  Make sure dest is relative to the
 	 CFA register.  */
@@ -1788,6 +1865,17 @@ dwarf2out_frame_debug_expr (rtx expr, co
 
 	  gcc_assert (REGNO (XEXP (XEXP (dest, 0), 0)) == STACK_POINTER_REGNUM
 		      && cfa_store.reg == STACK_POINTER_REGNUM);
+          
+          /* Rule 18 */
+          /* If we push FP after stack is realigned, we assume this realignment
+             is drap, we will recorde the drap register. */
+          if (fde_table_in_use && CUR_FDE.is_stack_realign
+              && REGNO (src) == HARD_FRAME_POINTER_REGNUM)
+            {
+              CUR_FDE.is_stack_realign = 0;
+              CUR_FDE.is_drap = 1;
+              CUR_FDE.drap_regnum = DWARF_FRAME_REGNUM (cfa.reg);
+            }            
 
 	  cfa_store.offset += offset;
 	  if (cfa.reg == STACK_POINTER_REGNUM)
@@ -1882,6 +1970,12 @@ dwarf2out_frame_debug_expr (rtx expr, co
 	      break;
 	    }
 	}
+        /* Rule 17 */
+        /* If the source operand of this MEM operation is not a register, 
+           basically the source is return address. Here we just care how 
+           much stack grew and ignore to save it. */ 
+      if (!REG_P (src))
+        break;
 
       def_cfa_1 (label, &cfa);
       {
@@ -3548,6 +3642,9 @@ output_cfa_loc (dw_cfi_ref cfi)
   dw_loc_descr_ref loc;
   unsigned long size;
 
+  if (cfi->dw_cfi_opc == DW_CFA_expression)
+    dw2_asm_output_data (1, cfi->dw_cfi_oprnd2.dw_cfi_reg_num, NULL);
+
   /* Output the size of the block.  */
   loc = cfi->dw_cfi_oprnd1.dw_cfi_loc;
   size = size_of_locs (loc);
@@ -9024,8 +9121,9 @@ based_loc_descr (rtx reg, HOST_WIDE_INT 
 	      offset += INTVAL (XEXP (elim, 1));
 	      elim = XEXP (elim, 0);
 	    }
-	  gcc_assert (elim == (frame_pointer_needed ? hard_frame_pointer_rtx
-		      : stack_pointer_rtx));
+	  gcc_assert (stack_realign_fp
+	              || elim == (frame_pointer_needed ? hard_frame_pointer_rtx
+		                                       : stack_pointer_rtx));
 	  offset += frame_pointer_fb_offset;
 
 	  return new_loc_descr (DW_OP_fbreg, offset, 0);
@@ -11155,9 +11253,10 @@ compute_frame_pointer_to_fb_displacement
       offset += INTVAL (XEXP (elim, 1));
       elim = XEXP (elim, 0);
     }
-  gcc_assert (elim == (frame_pointer_needed ? hard_frame_pointer_rtx
-		       : stack_pointer_rtx));
 
+  gcc_assert (stack_realign_fp 
+              || elim == (frame_pointer_needed ? hard_frame_pointer_rtx
+		       : stack_pointer_rtx));
   frame_pointer_fb_offset = -offset;
 }
 
@@ -15438,6 +15537,63 @@ dwarf2out_finish (const char *filename)
   if (debug_str_hash)
     htab_traverse (debug_str_hash, output_indirect_string, NULL);
 }
+
+/* In this function we use a series of DW_OP_?? expression which simulates
+   how stack is realigned to represent the location of the stored register.*/
+static void
+reg_save_with_expression (dw_cfi_ref cfi)
+{
+  struct dw_loc_descr_struct *head, *tmp;
+  HOST_WIDE_INT alignment = CUR_FDE.stack_realignment;
+  HOST_WIDE_INT offset = cfi->dw_cfi_oprnd2.dw_cfi_offset * UNITS_PER_WORD;
+  int reg = cfi->dw_cfi_oprnd1.dw_cfi_reg_num;
+  unsigned int dwarf_sp = (unsigned)DWARF_FRAME_REGNUM (STACK_POINTER_REGNUM);
+  
+  if (CUR_FDE.is_stack_realign)
+    {
+      head = tmp = new_loc_descr (DW_OP_const4s, 2 * UNITS_PER_WORD, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_const4s, alignment, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_and, 0, 0);
+
+      /* If stack grows upward, the offset will be a negative. */
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_const4s, offset, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);  
+   
+      cfi->dw_cfi_opc = DW_CFA_expression;
+      cfi->dw_cfi_oprnd2.dw_cfi_reg_num = reg; 
+      cfi->dw_cfi_oprnd1.dw_cfi_loc = head;
+    }
+
+  /* We need restore drap register through dereference. If we needn't to restore
+     the drap register we just ignore. */
+  if (CUR_FDE.is_drap && reg == CUR_FDE.drap_regnum)
+    {
+       
+      dw_cfi_ref cfi2 = new_cfi();
+
+      cfi->dw_cfi_opc = DW_CFA_expression;
+      head = tmp = new_loc_descr (DW_OP_const4s, offset, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);
+      if (CUR_FDE.is_drap_reg_saved)
+        {
+          tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_deref, 0, 0);
+          tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_const4s, 
+                                                  2 * UNITS_PER_WORD, 0);
+          tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);
+        }
+      cfi->dw_cfi_oprnd2.dw_cfi_reg_num = reg;
+      cfi->dw_cfi_oprnd1.dw_cfi_loc = head;
+
+      /* We also need restore the sp. */
+      head = tmp = new_loc_descr (DW_OP_const4s, offset, 0);
+      tmp = tmp->dw_loc_next = new_loc_descr (DW_OP_minus, 0, 0);
+      cfi2->dw_cfi_opc = DW_CFA_expression;
+      cfi2->dw_cfi_oprnd2.dw_cfi_reg_num = dwarf_sp;
+      cfi2->dw_cfi_oprnd1.dw_cfi_loc = head;
+      cfi->dw_cfi_next = cfi2;
+    }  
+}
 #else
 
 /* This should never be used, but its address is needed for comparisons.  */

[-- Attachment #3: stack-align-generic-0411.patch --]
[-- Type: application/octet-stream, Size: 32182 bytes --]

Index: flags.h
===================================================================
--- flags.h	(.../trunk/gcc)	(revision 134098)
+++ flags.h	(.../branches/stack/gcc)	(revision 134150)
@@ -223,12 +223,6 @@ extern int flag_dump_rtl_in_asm;
 \f
 /* Other basic status info about current function.  */
 
-/* Nonzero means current function must be given a frame pointer.
-   Set in stmt.c if anything is allocated on the stack there.
-   Set in reload1.c if anything is allocated on the stack there.  */
-
-extern int frame_pointer_needed;
-
 /* Nonzero if subexpressions must be evaluated from left-to-right.  */
 extern int flag_evaluation_order;
 
Index: defaults.h
===================================================================
--- defaults.h	(.../trunk/gcc)	(revision 134098)
+++ defaults.h	(.../branches/stack/gcc)	(revision 134150)
@@ -940,4 +940,8 @@ along with GCC; see the file COPYING3.  
 #define OUTGOING_REG_PARM_STACK_SPACE 0
 #endif
 
+#ifndef MAX_VECTORIZE_STACK_ALIGNMENT
+#define MAX_VECTORIZE_STACK_ALIGNMENT 0
+#endif
+
 #endif  /* ! GCC_DEFAULTS_H */
Index: tree-pass.h
===================================================================
--- tree-pass.h	(.../trunk/gcc)	(revision 134098)
+++ tree-pass.h	(.../branches/stack/gcc)	(revision 134150)
@@ -472,6 +472,7 @@ extern struct gimple_opt_pass pass_inlin
 extern struct gimple_opt_pass pass_apply_inline;
 extern struct gimple_opt_pass pass_all_early_optimizations;
 extern struct gimple_opt_pass pass_update_address_taken;
+extern struct gimple_opt_pass pass_handle_drap;
 
 /* The root of the compilation pass tree, once constructed.  */
 extern struct opt_pass *all_passes, *all_ipa_passes, *all_lowering_passes;
Index: builtins.c
===================================================================
--- builtins.c	(.../trunk/gcc)	(revision 134098)
+++ builtins.c	(.../branches/stack/gcc)	(revision 134150)
@@ -740,7 +740,7 @@ expand_builtin_setjmp_receiver (rtx rece
 	{
 	  /* Now restore our arg pointer from the address at which it
 	     was saved in our stack frame.  */
-	  emit_move_insn (virtual_incoming_args_rtx,
+	  emit_move_insn (crtl->args.internal_arg_pointer,
 			  copy_to_reg (get_arg_pointer_save_area ()));
 	}
     }
@@ -775,6 +775,11 @@ expand_builtin_longjmp (rtx buf_addr, rt
   rtx fp, lab, stack, insn, last;
   enum machine_mode sa_mode = STACK_SAVEAREA_MODE (SAVE_NONLOCAL);
 
+  /* DRAP is needed for stack realign if longjmp is expanded to current 
+     function  */
+  if (MAX_VECTORIZE_STACK_ALIGNMENT && !cfun->need_drap)
+    cfun->need_drap = true;
+
   if (setjmp_alias_set == -1)
     setjmp_alias_set = new_alias_set ();
 
@@ -1345,7 +1350,7 @@ expand_builtin_apply_args_1 (void)
       }
 
   /* Save the arg pointer to the block.  */
-  tem = copy_to_reg (virtual_incoming_args_rtx);
+  tem = copy_to_reg (crtl->args.internal_arg_pointer);
 #ifdef STACK_GROWS_DOWNWARD
   /* We need the pointer as the caller actually passed them to us, not
      as we might have pretended they were passed.  Make sure it's a valid
@@ -1453,6 +1458,14 @@ expand_builtin_apply (rtx function, rtx 
   /* Allocate a block of memory onto the stack and copy the memory
      arguments to the outgoing arguments address.  */
   allocate_dynamic_stack_space (argsize, 0, BITS_PER_UNIT);
+
+  /* Set DRAP flag to true, even though allocate_dynamic_stack_space
+     may have already set current_function_calls_alloca to true.
+     current_function_calls_alloca won't be set if argsize is zero,
+     so we have to guarantee need_drap is true here.  */
+  if (MAX_VECTORIZE_STACK_ALIGNMENT && !cfun->need_drap)
+    cfun->need_drap = true;
+
   dest = virtual_outgoing_args_rtx;
 #ifndef STACK_GROWS_DOWNWARD
   if (GET_CODE (argsize) == CONST_INT)
Index: final.c
===================================================================
--- final.c	(.../trunk/gcc)	(revision 134098)
+++ final.c	(.../branches/stack/gcc)	(revision 134150)
@@ -178,12 +178,6 @@ CC_STATUS cc_status;
 CC_STATUS cc_prev_status;
 #endif
 
-/* Nonzero means current function must be given a frame pointer.
-   Initialized in function.c to 0.  Set only in reload1.c as per
-   the needs of the function.  */
-
-int frame_pointer_needed;
-
 /* Number of unmatched NOTE_INSN_BLOCK_BEG notes we have seen.  */
 
 static int block_depth;
Index: global.c
===================================================================
--- global.c	(.../trunk/gcc)	(revision 134098)
+++ global.c	(.../branches/stack/gcc)	(revision 134150)
@@ -247,10 +247,20 @@ compute_regsets (HARD_REG_SET *elim_set,
   static const struct {const int from, to; } eliminables[] = ELIMINABLE_REGS;
   size_t i;
 #endif
+
+  /* FIXME: If EXIT_IGNORE_STACK is set, we will not save and restore
+     sp for alloca.  So we can't eliminate the frame pointer in that
+     case.  At some point, we should improve this by emitting the
+     sp-adjusting insns for this case.  */
   int need_fp
     = (! flag_omit_frame_pointer
        || (current_function_calls_alloca && EXIT_IGNORE_STACK)
-       || FRAME_POINTER_REQUIRED);
+       || FRAME_POINTER_REQUIRED
+       || current_function_accesses_prior_frames
+       || cfun->stack_realign_needed);
+
+  frame_pointer_needed = need_fp;
+  cfun->need_frame_pointer_set = 1;
 
   max_regno = max_reg_num ();
   compact_blocks ();
@@ -271,7 +281,10 @@ compute_regsets (HARD_REG_SET *elim_set,
     {
       bool cannot_elim
 	= (! CAN_ELIMINATE (eliminables[i].from, eliminables[i].to)
-	   || (eliminables[i].to == STACK_POINTER_REGNUM && need_fp));
+	   || (eliminables[i].to == STACK_POINTER_REGNUM
+	       && need_fp 
+	       && (! MAX_VECTORIZE_STACK_ALIGNMENT
+		   || ! stack_realign_fp)));
 
       if (!regs_asm_clobbered[eliminables[i].from])
 	{
Index: function.c
===================================================================
--- function.c	(.../trunk/gcc)	(revision 134098)
+++ function.c	(.../branches/stack/gcc)	(revision 134150)
@@ -342,17 +342,19 @@ assign_stack_local (enum machine_mode mo
 {
   rtx x, addr;
   int bigend_correction = 0;
-  unsigned int alignment;
+  unsigned int alignment, mode_alignment, alignment_in_bits;
   int frame_off, frame_alignment, frame_phase;
 
+  if (mode == BLKmode)
+    mode_alignment = BIGGEST_ALIGNMENT;
+  else
+    mode_alignment = GET_MODE_ALIGNMENT (mode);
+
   if (align == 0)
     {
       tree type;
 
-      if (mode == BLKmode)
-	alignment = BIGGEST_ALIGNMENT;
-      else
-	alignment = GET_MODE_ALIGNMENT (mode);
+      alignment = mode_alignment;
 
       /* Allow the target to (possibly) increase the alignment of this
 	 stack slot.  */
@@ -372,15 +374,45 @@ assign_stack_local (enum machine_mode mo
   else
     alignment = align / BITS_PER_UNIT;
 
+  alignment_in_bits = alignment * BITS_PER_UNIT;
+
   if (FRAME_GROWS_DOWNWARD)
     frame_offset -= size;
 
-  /* Ignore alignment we can't do with expected alignment of the boundary.  */
-  if (alignment * BITS_PER_UNIT > PREFERRED_STACK_BOUNDARY)
-    alignment = PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT;
-
-  if (cfun->stack_alignment_needed < alignment * BITS_PER_UNIT)
-    cfun->stack_alignment_needed = alignment * BITS_PER_UNIT;
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      if (cfun->stack_alignment_estimated < alignment_in_bits)
+	{
+          if (!cfun->stack_realign_processed)
+            cfun->stack_alignment_estimated = alignment_in_bits;
+          else
+	    {
+	      gcc_assert (!cfun->stack_realign_finalized);
+	      if (!cfun->stack_realign_needed)
+		{
+		  /* It is OK to reduce the alignment as long as the
+		     requested size is 0 or the estimated stack
+		     alignment >= mode alignment.  */
+		  gcc_assert (size == 0
+			      || (cfun->stack_alignment_estimated
+				  >= mode_alignment));
+		  alignment_in_bits = cfun->stack_alignment_estimated;
+		  alignment = alignment_in_bits / BITS_PER_UNIT;
+		}
+	    }
+	}
+    }
+  else
+    {
+      /* Ignore alignment we can't do with expected alignment of the
+	 boundary.  */
+      if (alignment * BITS_PER_UNIT > PREFERRED_STACK_BOUNDARY)
+	alignment = PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT;
+    }
+  if (cfun->stack_alignment_needed < alignment_in_bits)
+    cfun->stack_alignment_needed = alignment_in_bits;
+  if (cfun->stack_alignment_used < cfun->stack_alignment_needed)
+    cfun->stack_alignment_used = cfun->stack_alignment_needed;
 
   /* Calculate how many bytes the start of local variables is off from
      stack alignment.  */
@@ -1169,7 +1201,17 @@ instantiate_new_reg (rtx x, HOST_WIDE_IN
   HOST_WIDE_INT offset;
 
   if (x == virtual_incoming_args_rtx)
-    new = arg_pointer_rtx, offset = in_arg_offset;
+    {
+      /* Replace vitural_incoming_args_rtx to internal arg pointer here */
+      if (crtl->args.internal_arg_pointer != virtual_incoming_args_rtx)
+        {
+          gcc_assert (stack_realign_drap);
+          new = crtl->args.internal_arg_pointer;
+          offset = 0;
+        }
+      else
+        new = arg_pointer_rtx, offset = in_arg_offset;
+    }
   else if (x == virtual_stack_vars_rtx)
     new = frame_pointer_rtx, offset = var_offset;
   else if (x == virtual_stack_dynamic_rtx)
@@ -2968,6 +3010,20 @@ assign_parms (tree fndecl)
 	  continue;
 	}
 
+      /* Estimate stack alignment from parameter alignment */
+      if (MAX_VECTORIZE_STACK_ALIGNMENT)
+        {
+          unsigned int align = FUNCTION_ARG_BOUNDARY (data.promoted_mode,
+						      data.passed_type);
+	  if (TYPE_ALIGN (data.nominal_type) > align)
+	    align = TYPE_ALIGN (data.passed_type);
+	  if (cfun->stack_alignment_estimated < align)
+	    {
+	      gcc_assert (!cfun->stack_realign_processed);
+	      cfun->stack_alignment_estimated = align;
+	    }
+	}
+	
       if (current_function_stdarg && !TREE_CHAIN (parm))
 	assign_parms_setup_varargs (&all, &data, false);
 
@@ -3005,6 +3061,28 @@ assign_parms (tree fndecl)
      now that all parameters have been copied out of hard registers.  */
   emit_insn (all.first_conversion_insn);
 
+  /* Estimate reload stack alignment from scalar return mode.  */
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      if (DECL_RESULT (fndecl))
+	{
+	  tree type = TREE_TYPE (DECL_RESULT (fndecl));
+	  enum machine_mode mode = TYPE_MODE (type);
+
+	  if (mode != BLKmode
+	      && mode != VOIDmode
+	      && !AGGREGATE_TYPE_P (type))
+	    {
+	      unsigned int align = GET_MODE_ALIGNMENT (mode);
+	      if (cfun->stack_alignment_estimated < align)
+		{
+		  gcc_assert (!cfun->stack_realign_processed);
+		  cfun->stack_alignment_estimated = align;
+		}
+	    }
+	} 
+    }
+
   /* If we are receiving a struct value address as the first argument, set up
      the RTL for the function result. As this might require code to convert
      the transmitted address to Pmode, we do this here to ensure that possible
@@ -3282,12 +3360,34 @@ locate_and_pad_parm (enum machine_mode p
   locate->where_pad = where_pad;
   locate->boundary = boundary;
 
-  /* Remember if the outgoing parameter requires extra alignment on the
-     calling function side.  */
-  if (boundary > PREFERRED_STACK_BOUNDARY)
-    boundary = PREFERRED_STACK_BOUNDARY;
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      /* stack_alignment_estimated can't change after stack has been
+	 realigned.  */
+      if (cfun->stack_alignment_estimated < boundary)
+        {
+          if (!cfun->stack_realign_processed)
+	    cfun->stack_alignment_estimated = boundary;
+	  else
+	    {
+	      gcc_assert (!cfun->stack_realign_finalized
+			  && cfun->stack_realign_needed);
+	    }
+	}
+    }
+  else
+    {
+      /* Remember if the outgoing parameter requires extra alignment on
+         the calling function side.  */
+      if (boundary > PREFERRED_STACK_BOUNDARY)
+        boundary = PREFERRED_STACK_BOUNDARY;
+    }
   if (cfun->stack_alignment_needed < boundary)
     cfun->stack_alignment_needed = boundary;
+  if (cfun->stack_alignment_used < cfun->stack_alignment_needed)
+    cfun->stack_alignment_used = cfun->stack_alignment_needed;
+  if (cfun->preferred_stack_boundary < boundary)
+    cfun->preferred_stack_boundary = boundary;
 
 #ifdef ARGS_GROW_DOWNWARD
   locate->slot_offset.constant = -initial_offset_ptr->constant;
@@ -3843,6 +3943,8 @@ allocate_struct_function (tree fndecl, b
   cfun = ggc_alloc_cleared (sizeof (struct function));
 
   cfun->stack_alignment_needed = STACK_BOUNDARY;
+  cfun->stack_alignment_used = STACK_BOUNDARY;
+  cfun->stack_alignment_estimated = STACK_BOUNDARY;
   cfun->preferred_stack_boundary = STACK_BOUNDARY;
 
   current_function_funcdef_no = get_next_funcdef_no ();
@@ -4622,7 +4724,8 @@ get_arg_pointer_save_area (void)
 	 generated stack slot may not be a valid memory address, so we
 	 have to check it and fix it if necessary.  */
       start_sequence ();
-      emit_move_insn (validize_mem (ret), virtual_incoming_args_rtx);
+      emit_move_insn (validize_mem (ret),
+                      crtl->args.internal_arg_pointer);
       seq = get_insns ();
       end_sequence ();
 
Index: tree-vectorizer.c
===================================================================
--- tree-vectorizer.c	(.../trunk/gcc)	(revision 134098)
+++ tree-vectorizer.c	(.../branches/stack/gcc)	(revision 134150)
@@ -1786,9 +1786,9 @@ vect_can_force_dr_alignment_p (const_tre
 
   if (TREE_STATIC (decl))
     return (alignment <= MAX_OFILE_ALIGNMENT);
+  else if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    return (alignment <= MAX_VECTORIZE_STACK_ALIGNMENT);
   else
-    /* This used to be PREFERRED_STACK_BOUNDARY, however, that is not 100%
-       correct until someone implements forced stack alignment.  */
     return (alignment <= STACK_BOUNDARY); 
 }
 
Index: function.h
===================================================================
--- function.h	(.../trunk/gcc)	(revision 134098)
+++ function.h	(.../branches/stack/gcc)	(revision 134150)
@@ -271,6 +271,9 @@ struct rtl_data GTY(())
      needed by inner routines.  */
   rtx x_arg_pointer_save_area;
 
+  /* Dynamic Realign Argument Pointer used for realigning stack.  */
+  rtx drap_reg;
+
   /* Offset to end of allocated area of stack frame.
      If stack grows down, this is the address of the last stack slot allocated.
      If stack grows up, this is the address for the next slot.  */
@@ -352,9 +355,16 @@ struct function GTY(())
   /* tm.h can use this to store whatever it likes.  */
   struct machine_function * GTY ((maybe_undef)) machine;
 
-  /* The largest alignment of slot allocated on the stack.  */
+  /* The largest alignment needed on the stack, including requirement
+     for outgoing stack alignment.  */
   unsigned int stack_alignment_needed;
 
+  /* The largest alignment of slot allocated on the stack.  */
+  unsigned int stack_alignment_used;
+
+  /* The estimated stack alignment.  */
+  unsigned int stack_alignment_estimated;
+
   /* Preferred alignment of the end of stack frame.  */
   unsigned int preferred_stack_boundary;
 
@@ -509,6 +519,38 @@ struct function GTY(())
 
   /* Nonzero if pass_tree_profile was run on this function.  */
   unsigned int after_tree_profile : 1;
+
+/* Nonzero if current function must be given a frame pointer.
+   Set in global.c if anything is allocated on the stack there.  */
+  unsigned int need_frame_pointer : 1;
+
+  /* Nonzero if need_frame_pointer has been set.  */
+  unsigned int need_frame_pointer_set : 1;
+
+  /* Nonzero if, by estimation, current function stack needs realignment. */
+  unsigned int stack_realign_needed : 1;
+
+  /* Nonzero if function stack realignment is really needed. This flag
+     will be set after reload if by then criteria of stack realignment
+     is still true. Its value may be contridition to stack_realign_needed
+     since the latter was set before reload. This flag is more accurate
+     than stack_realign_needed so prologue/epilogue should be generated
+     according to both flags  */
+  unsigned int stack_realign_really : 1;
+
+  /* Nonzero if function being compiled needs dynamic realigned
+     argument pointer (drap) if stack needs realigning.  */
+  unsigned int need_drap : 1;
+
+  /* Nonzero if current function needs to save/restore parameter
+     pointer register in prolog, because it is a callee save reg.  */
+  unsigned int save_param_ptr_reg : 1;
+
+  /* Nonzero if function stack realignment estimatoin is done.  */
+  unsigned int stack_realign_processed : 1;
+
+  /* Nonzero if function stack realignment has been finalized.  */
+  unsigned int stack_realign_finalized : 1;
 };
 
 /* If va_list_[gf]pr_size is set to this, it means we don't know how
@@ -563,6 +605,9 @@ extern void instantiate_decl_rtl (rtx x)
 #define dom_computed (cfun->cfg->x_dom_computed)
 #define n_bbs_in_dom_tree (cfun->cfg->x_n_bbs_in_dom_tree)
 #define VALUE_HISTOGRAMS(fun) (fun)->value_histograms
+#define frame_pointer_needed (cfun->need_frame_pointer)
+#define stack_realign_fp (cfun->stack_realign_needed && !cfun->need_drap)
+#define stack_realign_drap (cfun->stack_realign_needed && cfun->need_drap)
 
 /* Given a function decl for a containing function,
    return the `struct function' for it.  */
Index: calls.c
===================================================================
--- calls.c	(.../trunk/gcc)	(revision 134098)
+++ calls.c	(.../branches/stack/gcc)	(revision 134150)
@@ -419,6 +419,10 @@ emit_call_1 (rtx funexp, tree fntree, tr
       rounded_stack_size -= n_popped;
       rounded_stack_size_rtx = GEN_INT (rounded_stack_size);
       stack_pointer_delta -= n_popped;
+
+      /* If popup is needed, stack realign must use DRAP  */
+      if (MAX_VECTORIZE_STACK_ALIGNMENT && !cfun->need_drap)
+        cfun->need_drap = true;
     }
 
   if (!ACCUMULATE_OUTGOING_ARGS)
@@ -2091,7 +2095,10 @@ expand_call (tree exp, rtx target, int i
 
   /* Figure out the amount to which the stack should be aligned.  */
   preferred_stack_boundary = PREFERRED_STACK_BOUNDARY;
-  if (fndecl)
+
+  /* With automatic stack realignment, we align stack in prologue when
+     needed and there is no need to update preferred_stack_boundary.  */
+  if (!MAX_VECTORIZE_STACK_ALIGNMENT && fndecl)
     {
       struct cgraph_rtl_info *i = cgraph_rtl_info (fndecl);
       if (i && i->preferred_incoming_stack_boundary)
@@ -2392,7 +2399,7 @@ expand_call (tree exp, rtx target, int i
 	 incoming argument block.  */
       if (pass == 0)
 	{
-	  argblock = virtual_incoming_args_rtx;
+	  argblock = crtl->args.internal_arg_pointer;
 	  argblock
 #ifdef STACK_GROWS_DOWNWARD
 	    = plus_constant (argblock, crtl->args.pretend_args_size);
Index: emit-rtl.c
===================================================================
--- emit-rtl.c	(.../trunk/gcc)	(revision 134098)
+++ emit-rtl.c	(.../branches/stack/gcc)	(revision 134150)
@@ -864,9 +864,20 @@ rtx
 gen_reg_rtx (enum machine_mode mode)
 {
   rtx val;
+  unsigned int align = GET_MODE_ALIGNMENT (mode);
 
   gcc_assert (can_create_pseudo_p ());
 
+  /* If a virtual register with bigger mode alignment is generated,
+     increase stack alignment estimation because it might be spilled
+     to stack later.  */
+  if (MAX_VECTORIZE_STACK_ALIGNMENT 
+      && cfun->stack_alignment_estimated < align)
+    {
+      gcc_assert (!cfun->stack_realign_processed);
+      cfun->stack_alignment_estimated = align;
+    }
+		
   if (generating_concat_p
       && (GET_MODE_CLASS (mode) == MODE_COMPLEX_FLOAT
 	  || GET_MODE_CLASS (mode) == MODE_COMPLEX_INT))
Index: cfgexpand.c
===================================================================
--- cfgexpand.c	(.../trunk/gcc)	(revision 134098)
+++ cfgexpand.c	(.../branches/stack/gcc)	(revision 134150)
@@ -161,10 +161,27 @@ get_decl_align_unit (tree decl)
 
   align = DECL_ALIGN (decl);
   align = LOCAL_ALIGNMENT (TREE_TYPE (decl), align);
-  if (align > PREFERRED_STACK_BOUNDARY)
-    align = PREFERRED_STACK_BOUNDARY;
+
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      if (cfun->stack_alignment_estimated < align)
+	{
+	  gcc_assert(!cfun->stack_realign_processed);
+          cfun->stack_alignment_estimated = align;
+	}
+    }
+  else
+    {
+      if (align > PREFERRED_STACK_BOUNDARY)
+	align = PREFERRED_STACK_BOUNDARY;
+    }
+
+  /* stack_alignment_needed > PREFERRED_STACK_BOUNDARY is permitted.
+     So here we only make sure stack_alignment_needed >= align.  */
   if (cfun->stack_alignment_needed < align)
     cfun->stack_alignment_needed = align;
+  if (cfun->stack_alignment_used < cfun->stack_alignment_needed)
+    cfun->stack_alignment_used = cfun->stack_alignment_needed;
 
   return align / BITS_PER_UNIT;
 }
@@ -743,6 +760,29 @@ defer_stack_allocation (tree var, bool t
 static HOST_WIDE_INT
 expand_one_var (tree var, bool toplevel, bool really_expand)
 {
+  if (MAX_VECTORIZE_STACK_ALIGNMENT && TREE_CODE (var) == VAR_DECL)
+    {
+      unsigned int align;
+
+      /* Because we don't know if VAR will be in register or on stack,
+	 we conservatively assume it will be on stack even if VAR is
+	 eventually put into register after RA pass.  For non-automatic
+	 variables, which won't be on stack, we collect alignment of
+	 type and ignore user specified alignment.  */
+      if (TREE_STATIC (var) || DECL_EXTERNAL (var))
+	align = TYPE_ALIGN (TREE_TYPE (var));
+      else
+	align = DECL_ALIGN (var);
+
+      if (cfun->stack_alignment_estimated < align)
+        {
+          /* stack_alignment_estimated shouldn't change after stack
+             realign decision made */
+          gcc_assert(!cfun->stack_realign_processed);
+	  cfun->stack_alignment_estimated = align;
+	}
+    }
+
   if (TREE_CODE (var) != VAR_DECL)
     ;
   else if (DECL_EXTERNAL (var))
@@ -1997,3 +2037,71 @@ struct gimple_opt_pass pass_expand =
   TODO_dump_func,                       /* todo_flags_finish */
  }
 };
+
+static bool
+gate_handle_drap (void)
+{
+  if (!MAX_VECTORIZE_STACK_ALIGNMENT)
+    return false;
+  else
+    {
+      gcc_assert (!cfun->stack_realign_processed);
+      return true;
+    }
+}
+
+/* This pass sets crtl->args.internal_arg_pointer to a virtual
+   register if DRAP is needed.  Local register allocator will replace
+   virtual_incoming_args_rtx with the virtual register.  */
+
+static unsigned int
+handle_drap (void)
+{
+  rtx internal_arg_rtx; 
+
+  if (!cfun->need_drap
+      && (current_function_calls_alloca
+          || cfun->has_nonlocal_label
+          || current_function_has_nonlocal_goto))
+    cfun->need_drap = true;
+
+  /* Call targetm.calls.internal_arg_pointer again.  This time it will
+     return a virtual register if DRAP is needed.  */
+  internal_arg_rtx = targetm.calls.internal_arg_pointer (); 
+
+  /* Assertion to check internal_arg_pointer is set to the right rtx
+     here.  */
+  gcc_assert (crtl->args.internal_arg_pointer == 
+             virtual_incoming_args_rtx);
+
+  /* Do nothing if no need to replace virtual_incoming_args_rtx.  */
+  if (crtl->args.internal_arg_pointer != internal_arg_rtx)
+    {
+      crtl->args.internal_arg_pointer = internal_arg_rtx;
+
+      /* Call fixup_tail_casss to clean up REG_EQUIV note if DRAP is
+         needed. */
+      fixup_tail_calls ();
+    }
+
+  return 0;
+}
+
+struct gimple_opt_pass pass_handle_drap =
+{
+ {
+  GIMPLE_PASS,
+  "handle_drap",			/* name */
+  gate_handle_drap,			/* gate */
+  handle_drap,			        /* execute */
+  NULL,                                 /* sub */
+  NULL,                                 /* next */
+  0,                                    /* static_pass_number */
+  0,				        /* tv_id */
+  0,                                    /* properties_required */
+  0,                                    /* properties_provided */
+  0,				        /* properties_destroyed */
+  0,                                    /* todo_flags_start */
+  TODO_dump_func,                       /* todo_flags_finish */
+ }
+};
Index: passes.c
===================================================================
--- passes.c	(.../trunk/gcc)	(revision 134098)
+++ passes.c	(.../branches/stack/gcc)	(revision 134150)
@@ -685,6 +685,7 @@ init_optimization_passes (void)
   NEXT_PASS (pass_mudflap_2);
   NEXT_PASS (pass_free_cfg_annotations);
   NEXT_PASS (pass_expand);
+  NEXT_PASS (pass_handle_drap); 
   NEXT_PASS (pass_rest_of_compilation);
     {
       struct opt_pass **p = &pass_rest_of_compilation.pass.sub;
Index: stmt.c
===================================================================
--- stmt.c	(.../trunk/gcc)	(revision 134098)
+++ stmt.c	(.../branches/stack/gcc)	(revision 134150)
@@ -1819,7 +1819,7 @@ expand_nl_goto_receiver (void)
 	{
 	  /* Now restore our arg pointer from the address at which it
 	     was saved in our stack frame.  */
-	  emit_move_insn (virtual_incoming_args_rtx,
+	  emit_move_insn (crtl->args.internal_arg_pointer,
 			  copy_to_reg (get_arg_pointer_save_area ()));
 	}
     }
Index: reload1.c
===================================================================
--- reload1.c	(.../trunk/gcc)	(revision 134098)
+++ reload1.c	(.../branches/stack/gcc)	(revision 134150)
@@ -2279,7 +2279,13 @@ set_label_offsets (rtx x, rtx insn, int 
 	  if (offsets_at[CODE_LABEL_NUMBER (x) - first_label_num][i]
 	      != (initial_p ? reg_eliminate[i].initial_offset
 		  : reg_eliminate[i].offset))
-	    reg_eliminate[i].can_eliminate = 0;
+            {
+	      /* Must not disable reg eliminate because stack realignment
+	         must eliminate frame pointer to stack pointer.  */
+	      gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			  || ! stack_realign_fp);
+	      reg_eliminate[i].can_eliminate = 0;
+            }
 
       return;
 
@@ -2358,7 +2364,13 @@ set_label_offsets (rtx x, rtx insn, int 
 	 offset because we are doing a jump to a variable address.  */
       for (p = reg_eliminate; p < &reg_eliminate[NUM_ELIMINABLE_REGS]; p++)
 	if (p->offset != p->initial_offset)
-	  p->can_eliminate = 0;
+	  {
+	    /* Must not disable reg eliminate because stack realignment
+	       must eliminate frame pointer to stack pointer.  */
+	    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			|| ! stack_realign_fp);
+	    p->can_eliminate = 0;
+	  }
       break;
 
     default:
@@ -2849,7 +2861,13 @@ elimination_effects (rtx x, enum machine
       /* If we modify the source of an elimination rule, disable it.  */
       for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
 	if (ep->from_rtx == XEXP (x, 0))
-	  ep->can_eliminate = 0;
+	  {
+	    /* Must not disable reg eliminate because stack realignment
+	       must eliminate frame pointer to stack pointer.  */
+	    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			|| ! stack_realign_fp);
+	    ep->can_eliminate = 0;
+	  }
 
       /* If we modify the target of an elimination rule by adding a constant,
 	 update its offset.  If we modify the target in any other way, we'll
@@ -2875,7 +2893,14 @@ elimination_effects (rtx x, enum machine
 		    && CONST_INT_P (XEXP (XEXP (x, 1), 1)))
 		  ep->offset -= INTVAL (XEXP (XEXP (x, 1), 1));
 		else
-		  ep->can_eliminate = 0;
+		  {
+		    /* Must not disable reg eliminate because stack
+		       realignment must eliminate frame pointer to
+		       stack pointer.  */
+		    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+				|| ! stack_realign_fp);
+		    ep->can_eliminate = 0;
+		  }
 	      }
 	  }
 
@@ -2918,7 +2943,13 @@ elimination_effects (rtx x, enum machine
 	 know how this register is used.  */
       for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
 	if (ep->from_rtx == XEXP (x, 0))
-	  ep->can_eliminate = 0;
+	  {
+	    /* Must not disable reg eliminate because stack realignment
+	       must eliminate frame pointer to stack pointer.  */
+	    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			|| ! stack_realign_fp);
+	    ep->can_eliminate = 0;
+	  }
 
       elimination_effects (XEXP (x, 0), mem_mode);
       return;
@@ -2929,7 +2960,13 @@ elimination_effects (rtx x, enum machine
 	 be performed.  Otherwise, we need not be concerned about it.  */
       for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
 	if (ep->to_rtx == XEXP (x, 0))
-	  ep->can_eliminate = 0;
+	  {
+	    /* Must not disable reg eliminate because stack realignment
+	       must eliminate frame pointer to stack pointer.  */
+	    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+			|| ! stack_realign_fp);
+	    ep->can_eliminate = 0;
+	  }
 
       elimination_effects (XEXP (x, 0), mem_mode);
       return;
@@ -2963,7 +3000,14 @@ elimination_effects (rtx x, enum machine
 		    && GET_CODE (XEXP (src, 1)) == CONST_INT)
 		  ep->offset -= INTVAL (XEXP (src, 1));
 		else
-		  ep->can_eliminate = 0;
+		  {
+		    /* Must not disable reg eliminate because stack
+		       realignment must eliminate frame pointer to
+		       stack pointer.  */
+		    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+				|| ! stack_realign_fp);
+		    ep->can_eliminate = 0;
+		  }
 	      }
 	}
 
@@ -3292,7 +3336,14 @@ eliminate_regs_in_insn (rtx insn, int re
 	      for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS];
 		   ep++)
 		if (ep->from_rtx == orig_operand[i])
-		  ep->can_eliminate = 0;
+		  {
+		    /* Must not disable reg eliminate because stack
+		       realignment must eliminate frame pointer to
+		       stack pointer.  */
+		    gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+				|| ! stack_realign_fp);
+		    ep->can_eliminate = 0;
+		  }
 	    }
 
 	  /* Companion to the above plus substitution, we can allow
@@ -3422,7 +3473,13 @@ eliminate_regs_in_insn (rtx insn, int re
   for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
     {
       if (ep->previous_offset != ep->offset && ep->ref_outside_mem)
-	ep->can_eliminate = 0;
+	{
+	  /* Must not disable reg eliminate because stack realignment
+	     must eliminate frame pointer to stack pointer.  */
+	  gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+		      || ! stack_realign_fp);
+	  ep->can_eliminate = 0;
+	}
 
       ep->ref_outside_mem = 0;
 
@@ -3498,6 +3555,11 @@ mark_not_eliminable (rtx dest, const_rtx
 	    || XEXP (SET_SRC (x), 0) != dest
 	    || GET_CODE (XEXP (SET_SRC (x), 1)) != CONST_INT))
       {
+	/* Must not disable reg eliminate because stack realignment
+	   must eliminate frame pointer to stack pointer.  */
+	gcc_assert (! MAX_VECTORIZE_STACK_ALIGNMENT
+		    || ! stack_realign_fp);
+
 	reg_eliminate[i].can_eliminate_previous
 	  = reg_eliminate[i].can_eliminate = 0;
 	num_eliminable--;
@@ -3668,8 +3730,11 @@ update_eliminables (HARD_REG_SET *pset)
   frame_pointer_needed = 1;
   for (ep = reg_eliminate; ep < &reg_eliminate[NUM_ELIMINABLE_REGS]; ep++)
     {
-      if (ep->can_eliminate && ep->from == FRAME_POINTER_REGNUM
-	  && ep->to != HARD_FRAME_POINTER_REGNUM)
+      if (ep->can_eliminate
+	  && ep->from == FRAME_POINTER_REGNUM
+	  && ep->to != HARD_FRAME_POINTER_REGNUM
+	  && (! MAX_VECTORIZE_STACK_ALIGNMENT
+	      || ! cfun->stack_realign_needed))
 	frame_pointer_needed = 0;
 
       if (! ep->can_eliminate && ep->can_eliminate_previous)
@@ -3713,18 +3778,8 @@ init_elim_table (void)
   if (!reg_eliminate)
     reg_eliminate = xcalloc (sizeof (struct elim_table), NUM_ELIMINABLE_REGS);
 
-  /* Does this function require a frame pointer?  */
-
-  frame_pointer_needed = (! flag_omit_frame_pointer
-			  /* ?? If EXIT_IGNORE_STACK is set, we will not save
-			     and restore sp for alloca.  So we can't eliminate
-			     the frame pointer in that case.  At some point,
-			     we should improve this by emitting the
-			     sp-adjusting insns for this case.  */
-			  || (current_function_calls_alloca
-			      && EXIT_IGNORE_STACK)
-			  || current_function_accesses_prior_frames
-			  || FRAME_POINTER_REQUIRED);
+  /* frame_pointer_needed should has been set.  */
+  gcc_assert (cfun->need_frame_pointer_set);
 
   num_eliminable = 0;
 
@@ -3736,7 +3791,10 @@ init_elim_table (void)
       ep->to = ep1->to;
       ep->can_eliminate = ep->can_eliminate_previous
 	= (CAN_ELIMINATE (ep->from, ep->to)
-	   && ! (ep->to == STACK_POINTER_REGNUM && frame_pointer_needed));
+	   && ! (ep->to == STACK_POINTER_REGNUM
+		 && frame_pointer_needed 
+		 && (! MAX_VECTORIZE_STACK_ALIGNMENT
+		     || ! stack_realign_fp)));
     }
 #else
   reg_eliminate[0].from = reg_eliminate_1[0].from;

[-- Attachment #4: stack-align-x86-0411.patch --]
[-- Type: application/octet-stream, Size: 30484 bytes --]

Index: i386.h
===================================================================
--- i386.h	(.../trunk/gcc/config/i386)	(revision 134098)
+++ i386.h	(.../branches/stack/gcc/config/i386)	(revision 134150)
@@ -806,16 +806,32 @@ enum target_cpu_default
 /* Boundary (in *bits*) on which stack pointer should be aligned.  */
 #define STACK_BOUNDARY BITS_PER_WORD
 
+/* Stack boundary of the main function guaranteed by OS.  */
+#define MAIN_STACK_BOUNDARY (TARGET_64BIT ? 128 : 32)
+
+/* Stack boundary guaranteed by ABI.  */
+#define ABI_STACK_BOUNDARY (TARGET_64BIT ? 128 : 32)
+
 /* Boundary (in *bits*) on which the stack pointer prefers to be
    aligned; the compiler cannot rely on having this alignment.  */
 #define PREFERRED_STACK_BOUNDARY ix86_preferred_stack_boundary
 
-/* As of July 2001, many runtimes do not align the stack properly when
-   entering main.  This causes expand_main_function to forcibly align
-   the stack, which results in aligned frames for functions called from
-   main, though it does nothing for the alignment of main itself.  */
-#define FORCE_PREFERRED_STACK_BOUNDARY_IN_MAIN \
-  (ix86_preferred_stack_boundary > STACK_BOUNDARY && !TARGET_64BIT)
+/* It should be ABI_STACK_BOUNDARY.  But we set it to 128 bits for
+   both 32bit and 64bit, to support codes that need 128 bit stack
+   alignment for SSE instructions, but can't realign the stack.  */
+#define PREFERRED_STACK_BOUNDARY_DEFAULT 128
+
+/* 1 if -mstackrealign should be turned on by default.  It will
+   generate an alternate prologue and epilogue that realigns the
+   runtime stack if nessary.  This supports mixing codes that keep a
+   4-byte aligned stack, as specified by i386 psABI, with codes that
+   need a 16-byte aligned stack, as required by SSE instructions.  If
+   STACK_REALIGN_DEFAULT is 1 and PREFERRED_STACK_BOUNDARY_DEFAULT is
+   128, stacks for all functions may be realigned.  */
+#define STACK_REALIGN_DEFAULT 0
+
+/* Boundary (in *bits*) on which the incoming stack is aligned.  */
+#define INCOMING_STACK_BOUNDARY ix86_incoming_stack_boundary
 
 /* Target OS keeps a vector-aligned (128-bit, 16-byte) stack.  This is
    mandatory for the 64-bit ABI, and may or may not be true for other
@@ -842,6 +858,9 @@ enum target_cpu_default
 
 #define BIGGEST_ALIGNMENT 128
 
+/* Maximum stack alignment for vectorizer.  */
+#define MAX_VECTORIZE_STACK_ALIGNMENT BIGGEST_ALIGNMENT
+
 /* Decide whether a variable of mode MODE should be 128 bit aligned.  */
 #define ALIGN_MODE_128(MODE) \
  ((MODE) == XFmode || SSE_REG_MODE_P (MODE))
@@ -1251,7 +1270,7 @@ do {									\
    the pic register when possible.  The change is visible after the
    prologue has been emitted.  */
 
-#define REAL_PIC_OFFSET_TABLE_REGNUM  3
+#define REAL_PIC_OFFSET_TABLE_REGNUM  BX_REG
 
 #define PIC_OFFSET_TABLE_REGNUM				\
   ((TARGET_64BIT && ix86_cmodel == CM_SMALL_PIC)	\
@@ -1792,7 +1811,10 @@ typedef struct ix86_args {
    All other eliminations are valid.  */
 
 #define CAN_ELIMINATE(FROM, TO) \
-  ((TO) == STACK_POINTER_REGNUM ? !frame_pointer_needed : 1)
+  (stack_realign_fp \
+  ? ((FROM) == ARG_POINTER_REGNUM && (TO) == HARD_FRAME_POINTER_REGNUM) \
+    || ((FROM) == FRAME_POINTER_REGNUM && (TO) == STACK_POINTER_REGNUM) \
+  : ((TO) == STACK_POINTER_REGNUM ? !frame_pointer_needed : 1))
 
 /* Define the offset between two registers, one to be eliminated, and the other
    its replacement, at the start of a routine.  */
@@ -2348,6 +2370,7 @@ enum asm_dialect {
 
 extern enum asm_dialect ix86_asm_dialect;
 extern unsigned int ix86_preferred_stack_boundary;
+extern unsigned int ix86_incoming_stack_boundary;
 extern int ix86_branch_cost, ix86_section_threshold;
 
 /* Smallest class containing REGNO.  */
@@ -2449,7 +2472,6 @@ struct machine_function GTY(())
 {
   struct stack_local_entry *stack_locals;
   const char *some_ld_name;
-  rtx force_align_arg_pointer;
   int save_varrargs_registers;
   int accesses_prev_frame;
   int optimize_mode_switching[MAX_386_ENTITIES];
Index: i386.md
===================================================================
--- i386.md	(.../trunk/gcc/config/i386)	(revision 134098)
+++ i386.md	(.../branches/stack/gcc/config/i386)	(revision 134150)
@@ -232,6 +232,7 @@
   [(AX_REG			 0)
    (DX_REG			 1)
    (CX_REG			 2)
+   (BX_REG			 3)
    (SI_REG			 4)
    (DI_REG			 5)
    (BP_REG			 6)
@@ -241,6 +242,7 @@
    (FPCR_REG			19)
    (R10_REG			39)
    (R11_REG			40)
+   (R13_REG			42)
   ])
 
 ;; Insns whose names begin with "x86_" are emitted by gen_FOO calls
Index: i386.opt
===================================================================
--- i386.opt	(.../trunk/gcc/config/i386)	(revision 134098)
+++ i386.opt	(.../branches/stack/gcc/config/i386)	(revision 134150)
@@ -78,6 +78,10 @@ mfancy-math-387
 Target RejectNegative Report InverseMask(NO_FANCY_MATH_387, USE_FANCY_MATH_387)
 Generate sin, cos, sqrt for FPU
 
+mforce-drap
+Target Report Var(ix86_force_drap)
+Always use Dynamic Realigned Argument Pointer (DRAP) to realign stack.
+
 mfp-ret-in-387
 Target Report Mask(FLOAT_RETURNS)
 Return values of functions in FPU registers
@@ -134,6 +138,10 @@ mpreferred-stack-boundary=
 Target RejectNegative Joined Var(ix86_preferred_stack_boundary_string)
 Attempt to keep stack aligned to this power of 2
 
+mincoming-stack-boundary=
+Target RejectNegative Joined Var(ix86_incoming_stack_boundary_string)
+Assume incoming stack aligned to this power of 2
+
 mpush-args
 Target Report InverseMask(NO_PUSH_ARGS, PUSH_ARGS)
 Use push instructions to save outgoing arguments
@@ -159,7 +167,7 @@ Target RejectNegative Mask(SSEREGPARM)
 Use SSE register passing conventions for SF and DF mode
 
 mstackrealign
-Target Report Var(ix86_force_align_arg_pointer)
+Target Report Var(ix86_force_align_arg_pointer) Init(-1)
 Realign stack in prologue
 
 mstack-arg-probe
Index: i386.c
===================================================================
--- i386.c	(.../trunk/gcc/config/i386)	(revision 134098)
+++ i386.c	(.../branches/stack/gcc/config/i386)	(revision 134150)
@@ -1694,11 +1694,22 @@ static int ix86_regparm;
 
 /* -mstackrealign option */
 extern int ix86_force_align_arg_pointer;
-static const char ix86_force_align_arg_pointer_string[] = "force_align_arg_pointer";
+static const char ix86_force_align_arg_pointer_string[]
+  = "force_align_arg_pointer";
 
 /* Preferred alignment for stack boundary in bits.  */
 unsigned int ix86_preferred_stack_boundary;
 
+/* Alignment for incoming stack boundary in bits specified at
+   command line.  */
+static unsigned int ix86_user_incoming_stack_boundary;
+
+/* Default alignment for incoming stack boundary in bits.  */
+static unsigned int ix86_default_incoming_stack_boundary;
+
+/* Alignment for incoming stack boundary in bits.  */
+unsigned int ix86_incoming_stack_boundary;
+
 /* Values 1-5: see jump.c */
 int ix86_branch_cost;
 
@@ -2627,11 +2638,9 @@ override_options (void)
   if (TARGET_SSE4_2 || TARGET_ABM)
     x86_popcnt = true;
 
-  /* Validate -mpreferred-stack-boundary= value, or provide default.
-     The default of 128 bits is for Pentium III's SSE __m128.  We can't
-     change it because of optimize_size.  Otherwise, we can't mix object
-     files compiled with -Os and -On.  */
-  ix86_preferred_stack_boundary = 128;
+  /* Validate -mpreferred-stack-boundary= value or default it to
+     PREFERRED_STACK_BOUNDARY_DEFAULT.  */
+  ix86_preferred_stack_boundary = PREFERRED_STACK_BOUNDARY_DEFAULT;
   if (ix86_preferred_stack_boundary_string)
     {
       i = atoi (ix86_preferred_stack_boundary_string);
@@ -2642,6 +2651,31 @@ override_options (void)
 	ix86_preferred_stack_boundary = (1 << i) * BITS_PER_UNIT;
     }
 
+  /* Set the default value for -mstackrealign.  */
+  if (ix86_force_align_arg_pointer == -1)
+    ix86_force_align_arg_pointer = STACK_REALIGN_DEFAULT;
+
+  /* Validate -mincoming-stack-boundary= value or default it to
+     ABI_STACK_BOUNDARY/PREFERRED_STACK_BOUNDARY.  */
+  if (ix86_force_align_arg_pointer)
+    ix86_default_incoming_stack_boundary = ABI_STACK_BOUNDARY;
+  else
+    ix86_default_incoming_stack_boundary = PREFERRED_STACK_BOUNDARY;
+  ix86_incoming_stack_boundary = ix86_default_incoming_stack_boundary;
+  if (ix86_incoming_stack_boundary_string)
+    {
+      i = atoi (ix86_incoming_stack_boundary_string);
+      if (i < (TARGET_64BIT ? 4 : 2) || i > 12)
+	error ("-mincoming-stack-boundary=%d is not between %d and 12",
+	       i, TARGET_64BIT ? 4 : 2);
+      else
+	{
+	  ix86_user_incoming_stack_boundary = (1 << i) * BITS_PER_UNIT;
+	  ix86_incoming_stack_boundary
+	    = ix86_user_incoming_stack_boundary;
+	}
+    }
+
   /* Accept -msseregparm only if at least SSE support is enabled.  */
   if (TARGET_SSEREGPARM
       && ! TARGET_SSE)
@@ -3081,11 +3115,6 @@ ix86_function_ok_for_sibcall (tree decl,
       && ix86_function_regparm (TREE_TYPE (decl), NULL) >= 3)
     return false;
 
-  /* If we forced aligned the stack, then sibcalling would unalign the
-     stack, which may break the called function.  */
-  if (cfun->machine->force_align_arg_pointer)
-    return false;
-
   /* Otherwise okay.  That also includes certain types of indirect calls.  */
   return true;
 }
@@ -3136,15 +3165,6 @@ ix86_handle_cconv_attribute (tree *node,
 	  *no_add_attrs = true;
 	}
 
-      if (!TARGET_64BIT
-	  && lookup_attribute (ix86_force_align_arg_pointer_string,
-			       TYPE_ATTRIBUTES (*node))
-	  && compare_tree_int (cst, REGPARM_MAX-1))
-	{
-	  error ("%s functions limited to %d register parameters",
-		 ix86_force_align_arg_pointer_string, REGPARM_MAX-1);
-	}
-
       return NULL_TREE;
     }
 
@@ -3302,8 +3322,7 @@ ix86_function_regparm (const_tree type, 
 	  /* We can't use regparm(3) for nested functions as these use
 	     static chain pointer in third argument.  */
 	  if (local_regparm == 3
-	      && (decl_function_context (decl)
-                  || ix86_force_align_arg_pointer)
+	      && decl_function_context (decl)
 	      && !DECL_NO_STATIC_CHAIN (decl))
 	    local_regparm = 2;
 
@@ -3312,13 +3331,11 @@ ix86_function_regparm (const_tree type, 
 	     the callee DECL_STRUCT_FUNCTION is gone, so we fall back to
 	     scanning the attributes for the self-realigning property.  */
 	  f = DECL_STRUCT_FUNCTION (decl);
-	  if (local_regparm == 3
-	      && (f ? !!f->machine->force_align_arg_pointer
-		  : !!lookup_attribute (ix86_force_align_arg_pointer_string,
-					TYPE_ATTRIBUTES (TREE_TYPE (decl)))))
-	    local_regparm = 2;
+          /* Since current internal arg pointer will won't conflict
+	     with parameter passing regs, so no need to change stack
+	     realignment and adjust regparm number.
 
-	  /* Each fixed register usage increases register pressure,
+	     Each fixed register usage increases register pressure,
 	     so less registers should be used for argument passing.
 	     This functionality can be overriden by an explicit
 	     regparm value.  */
@@ -5053,14 +5070,6 @@ setup_incoming_varargs_64 (CUMULATIVE_AR
 
   /* Indicate to allocate space on the stack for varargs save area.  */
   ix86_save_varrargs_registers = 1;
-  /* We need 16-byte stack alignment to save SSE registers.  If user
-     asked for lower preferred_stack_boundary, lets just hope that he knows
-     what he is doing and won't varargs SSE values.
-
-     We also may end up assuming that only 64bit values are stored in SSE
-     register let some floating point program work.  */
-  if (ix86_preferred_stack_boundary >= BIGGEST_ALIGNMENT)
-    cfun->stack_alignment_needed = BIGGEST_ALIGNMENT;
 
   save_area = frame_pointer_rtx;
   set = get_varargs_alias_set ();
@@ -5228,7 +5237,7 @@ ix86_va_start (tree valist, rtx nextarg)
 
   /* Find the overflow area.  */
   type = TREE_TYPE (ovf);
-  t = make_tree (type, virtual_incoming_args_rtx);
+  t = make_tree (type, crtl->args.internal_arg_pointer);
   if (words != 0)
     t = build2 (POINTER_PLUS_EXPR, type, t,
 	        size_int (words * UNITS_PER_WORD));
@@ -5996,9 +6005,14 @@ ix86_select_alt_pic_regnum (void)
   if (current_function_is_leaf && !current_function_profile
       && !ix86_current_function_calls_tls_descriptor)
     {
-      int i;
+      int i, drap;
+      /* Can't use the same register for both PIC and DRAP.  */
+      if (crtl->drap_reg)
+	drap = REGNO (crtl->drap_reg);
+      else
+	drap = -1;
       for (i = 2; i >= 0; --i)
-        if (!df_regs_ever_live_p (i))
+        if (i != drap && !df_regs_ever_live_p (i))
 	  return i;
     }
 
@@ -6034,8 +6048,8 @@ ix86_save_reg (unsigned int regno, int m
 	}
     }
 
-  if (cfun->machine->force_align_arg_pointer
-      && regno == REGNO (cfun->machine->force_align_arg_pointer))
+  if (crtl->drap_reg
+      && regno == REGNO (crtl->drap_reg))
     return 1;
 
   return (df_regs_ever_live_p (regno)
@@ -6101,6 +6115,10 @@ ix86_compute_frame_layout (struct ix86_f
   stack_alignment_needed = cfun->stack_alignment_needed / BITS_PER_UNIT;
   preferred_alignment = cfun->preferred_stack_boundary / BITS_PER_UNIT;
 
+  gcc_assert (!size || stack_alignment_needed);
+  gcc_assert (preferred_alignment >= STACK_BOUNDARY / BITS_PER_UNIT);
+  gcc_assert (preferred_alignment <= stack_alignment_needed);
+
   /* During reload iteration the amount of registers saved can change.
      Recompute the value as needed.  Do not recompute when amount of registers
      didn't change as reload does multiple calls to the function and does not
@@ -6143,18 +6161,9 @@ ix86_compute_frame_layout (struct ix86_f
 
   frame->hard_frame_pointer_offset = offset;
 
-  /* Do some sanity checking of stack_alignment_needed and
-     preferred_alignment, since i386 port is the only using those features
-     that may break easily.  */
-
-  gcc_assert (!size || stack_alignment_needed);
-  gcc_assert (preferred_alignment >= STACK_BOUNDARY / BITS_PER_UNIT);
-  gcc_assert (preferred_alignment <= PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT);
-  gcc_assert (stack_alignment_needed
-	      <= PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT);
-
-  if (stack_alignment_needed < STACK_BOUNDARY / BITS_PER_UNIT)
-    stack_alignment_needed = STACK_BOUNDARY / BITS_PER_UNIT;
+  /* Set offset to aligned because the realigned frame tarts from here.  */
+  if (stack_realign_fp)
+    offset = (offset + stack_alignment_needed -1) & -stack_alignment_needed;
 
   /* Register save area */
   offset += frame->nregs * UNITS_PER_WORD;
@@ -6320,35 +6329,129 @@ pro_epilogue_adjust_stack (rtx dest, rtx
     RTX_FRAME_RELATED_P (insn) = 1;
 }
 
+/* Find an available register to be used as dynamic realign argument
+   pointer regsiter.  Such a register will be written in prologue and
+   used in begin of body, so it must not be
+	1. parameter passing register.
+	2. GOT pointer.
+   For i386, we use CX if it is not used to pass parameter. Otherwise
+   we just pick DI.
+   For x86_64, we just pick R13 directly.
+
+   Return: the regno of choosed register.  */
+
+static unsigned int 
+find_drap_reg (void)
+{
+  int param_reg_num;
+
+  if (TARGET_64BIT)
+    return R13_REG;
+
+  /* Use DI for nested function or function need static chain.  */
+  if (decl_function_context (cfun->decl)
+      && !DECL_NO_STATIC_CHAIN (cfun->decl))
+    return DI_REG;
+
+  if (cfun->tail_call_emit)
+    return DI_REG;
+
+  param_reg_num = ix86_function_regparm (TREE_TYPE (cfun->decl),
+					 cfun->decl);
+
+  if (param_reg_num <= 2
+      && !lookup_attribute ("fastcall",
+			    TYPE_ATTRIBUTES (TREE_TYPE (cfun->decl))))
+    return CX_REG;
+
+  return DI_REG;
+}
+
 /* Handle the TARGET_INTERNAL_ARG_POINTER hook.  */
 
 static rtx
 ix86_internal_arg_pointer (void)
 {
-  bool has_force_align_arg_pointer =
-    (0 != lookup_attribute (ix86_force_align_arg_pointer_string,
-			    TYPE_ATTRIBUTES (TREE_TYPE (current_function_decl))));
-  if ((FORCE_PREFERRED_STACK_BOUNDARY_IN_MAIN
-       && DECL_NAME (current_function_decl)
-       && MAIN_NAME_P (DECL_NAME (current_function_decl))
-       && DECL_FILE_SCOPE_P (current_function_decl))
-      || ix86_force_align_arg_pointer
-      || has_force_align_arg_pointer)
-    {
-      /* Nested functions can't realign the stack due to a register
-	 conflict.  */
-      if (DECL_CONTEXT (current_function_decl)
-	  && TREE_CODE (DECL_CONTEXT (current_function_decl)) == FUNCTION_DECL)
-	{
-	  if (ix86_force_align_arg_pointer)
-	    warning (0, "-mstackrealign ignored for nested functions");
-	  if (has_force_align_arg_pointer)
-	    error ("%s not supported for nested functions",
-		   ix86_force_align_arg_pointer_string);
-	  return virtual_incoming_args_rtx;
-	}
-      cfun->machine->force_align_arg_pointer = gen_rtx_REG (Pmode, CX_REG);
-      return copy_to_reg (cfun->machine->force_align_arg_pointer);
+  /* If called in "expand" pass, currently_expanding_to_rtl will
+     be true */
+  if (currently_expanding_to_rtl) 
+    return virtual_incoming_args_rtx;
+
+  /* Prefer the one specified at command line. */
+  ix86_incoming_stack_boundary 
+    = (ix86_user_incoming_stack_boundary
+       ? ix86_user_incoming_stack_boundary
+       : ix86_default_incoming_stack_boundary);
+
+  /* Current stack realign doesn't support eh_return. Assume
+     function who calls eh_return is aligned. There will be sanity
+     check if stack realign happens together with eh_return later.  */
+  if (current_function_calls_eh_return)
+    ix86_incoming_stack_boundary = PREFERRED_STACK_BOUNDARY;
+
+  /* Incoming stack alignment can be changed on individual functions
+     via force_align_arg_pointer attribute.  We use the smallest
+     incoming stack boundary.  */
+  if (ix86_incoming_stack_boundary > ABI_STACK_BOUNDARY
+      && lookup_attribute (ix86_force_align_arg_pointer_string,
+			   TYPE_ATTRIBUTES (TREE_TYPE (current_function_decl))))
+    ix86_incoming_stack_boundary = ABI_STACK_BOUNDARY;
+
+  /* Stack at entrance of main is aligned by runtime.  We use the
+     smallest incoming stack boundary. */
+  if (ix86_incoming_stack_boundary > MAIN_STACK_BOUNDARY
+      && DECL_NAME (current_function_decl)
+      && MAIN_NAME_P (DECL_NAME (current_function_decl))
+      && DECL_FILE_SCOPE_P (current_function_decl))
+    ix86_incoming_stack_boundary = MAIN_STACK_BOUNDARY;
+
+  gcc_assert (cfun->stack_alignment_needed 
+              <= cfun->stack_alignment_estimated);
+
+  /* x86_64 vararg needs 16byte stack alignment for register save
+     area.  */
+  if (TARGET_64BIT
+      && current_function_stdarg
+      && cfun->stack_alignment_estimated < 128)
+    cfun->stack_alignment_estimated = 128;
+
+  /* Update cfun->stack_alignment_estimated and use it later to align
+     stack.  FIXME: How to optimize for leaf function?  */
+  if (PREFERRED_STACK_BOUNDARY > cfun->stack_alignment_estimated)
+    cfun->stack_alignment_estimated = PREFERRED_STACK_BOUNDARY;
+  if (PREFERRED_STACK_BOUNDARY > cfun->stack_alignment_needed)
+    cfun->stack_alignment_needed = PREFERRED_STACK_BOUNDARY;
+
+  cfun->stack_realign_needed
+    = ix86_incoming_stack_boundary < cfun->stack_alignment_estimated;
+
+  cfun->stack_realign_processed = true;
+
+  if (ix86_force_drap
+      || !ACCUMULATE_OUTGOING_ARGS)
+    cfun->need_drap = true;
+
+  if (stack_realign_drap)
+    {
+      /* Assign DRAP to vDRAP and returns vDRAP */
+      unsigned int regno = find_drap_reg ();
+      rtx drap_vreg;
+      rtx arg_ptr;
+      rtx seq;
+
+      if (regno != CX_REG)
+	cfun->save_param_ptr_reg = true;
+
+      arg_ptr = gen_rtx_REG (Pmode, regno);
+      crtl->drap_reg = arg_ptr;
+
+      start_sequence ();
+      drap_vreg = copy_to_reg(arg_ptr);
+      seq = get_insns ();
+      end_sequence ();
+      
+      emit_insn_before (seq, NEXT_INSN (entry_of_function ()));
+      return drap_vreg;
     }
   else
     return virtual_incoming_args_rtx;
@@ -6387,53 +6490,64 @@ ix86_expand_prologue (void)
   bool pic_reg_used;
   struct ix86_frame frame;
   HOST_WIDE_INT allocate;
+  rtx (*gen_andsp) (rtx, rtx, rtx);
+
+  /* DRAP should not coexist with stack_realign_fp */
+  gcc_assert (!(crtl->drap_reg && stack_realign_fp));
+
+  /* Check if stack realign is really needed after reload, and 
+     stores result in cfun */
+  cfun->stack_realign_really = (ix86_incoming_stack_boundary
+				< (current_function_is_leaf
+				   ? cfun->stack_alignment_used
+				   : cfun->stack_alignment_needed));
+
+  cfun->stack_realign_finalized = true;
 
   ix86_compute_frame_layout (&frame);
 
-  if (cfun->machine->force_align_arg_pointer)
+  /* Emit prologue code to adjust stack alignment and setup DRAP, in case
+     of DRAP is needed and stack realignment is really needed after reload */
+  if (crtl->drap_reg && cfun->stack_realign_really)
     {
       rtx x, y;
+      int align_bytes = cfun->stack_alignment_needed / BITS_PER_UNIT;
+      int param_ptr_offset = (cfun->save_param_ptr_reg
+			      ?  STACK_BOUNDARY / BITS_PER_UNIT : 0);
+
+      gcc_assert (stack_realign_drap);
 
       /* Grab the argument pointer.  */
-      x = plus_constant (stack_pointer_rtx, 4);
-      y = cfun->machine->force_align_arg_pointer;
-      insn = emit_insn (gen_rtx_SET (VOIDmode, y, x));
-      RTX_FRAME_RELATED_P (insn) = 1;
+      x = plus_constant (stack_pointer_rtx, 
+                         (STACK_BOUNDARY / BITS_PER_UNIT 
+			  + param_ptr_offset));
+      y = crtl->drap_reg;
+
+      /* Only need to push parameter pointer reg if it is caller
+	 saved reg */
+      if (cfun->save_param_ptr_reg)
+	{
+	  /* Push arg pointer reg */
+	  insn = emit_insn (gen_push (y));
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	}
 
-      /* The unwind info consists of two parts: install the fafp as the cfa,
-	 and record the fafp as the "save register" of the stack pointer.
-	 The later is there in order that the unwinder can see where it
-	 should restore the stack pointer across the and insn.  */
-      x = gen_rtx_UNSPEC (VOIDmode, gen_rtvec (1, const0_rtx), UNSPEC_DEF_CFA);
-      x = gen_rtx_SET (VOIDmode, y, x);
-      RTX_FRAME_RELATED_P (x) = 1;
-      y = gen_rtx_UNSPEC (VOIDmode, gen_rtvec (1, stack_pointer_rtx),
-			  UNSPEC_REG_SAVE);
-      y = gen_rtx_SET (VOIDmode, cfun->machine->force_align_arg_pointer, y);
-      RTX_FRAME_RELATED_P (y) = 1;
-      x = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, x, y));
-      x = gen_rtx_EXPR_LIST (REG_FRAME_RELATED_EXPR, x, NULL);
-      REG_NOTES (insn) = x;
+      insn = emit_insn (gen_rtx_SET (VOIDmode, y, x));
+      RTX_FRAME_RELATED_P (insn) = 1; 
 
+      gen_andsp = TARGET_64BIT ? gen_anddi3 : gen_andsi3;
       /* Align the stack.  */
-      emit_insn (gen_andsi3 (stack_pointer_rtx, stack_pointer_rtx,
-			     GEN_INT (-16)));
+      insn = emit_insn ((*gen_andsp) (stack_pointer_rtx,
+				  stack_pointer_rtx,
+				  GEN_INT (-align_bytes)));
+      RTX_FRAME_RELATED_P (insn) = 1;
 
-      /* And here we cheat like madmen with the unwind info.  We force the
-	 cfa register back to sp+4, which is exactly what it was at the
-	 start of the function.  Re-pushing the return address results in
-	 the return at the same spot relative to the cfa, and thus is
-	 correct wrt the unwind info.  */
-      x = cfun->machine->force_align_arg_pointer;
-      x = gen_frame_mem (Pmode, plus_constant (x, -4));
+      x = crtl->drap_reg;
+      x = gen_frame_mem (Pmode,
+                         plus_constant (x,
+					-(STACK_BOUNDARY / BITS_PER_UNIT)));
       insn = emit_insn (gen_push (x));
       RTX_FRAME_RELATED_P (insn) = 1;
-
-      x = GEN_INT (4);
-      x = gen_rtx_UNSPEC (VOIDmode, gen_rtvec (1, x), UNSPEC_DEF_CFA);
-      x = gen_rtx_SET (VOIDmode, stack_pointer_rtx, x);
-      x = gen_rtx_EXPR_LIST (REG_FRAME_RELATED_EXPR, x, NULL);
-      REG_NOTES (insn) = x;
     }
 
   /* Note: AT&T enter does NOT have reversed args.  Enter is probably
@@ -6448,6 +6562,19 @@ ix86_expand_prologue (void)
       RTX_FRAME_RELATED_P (insn) = 1;
     }
 
+  if (stack_realign_fp && cfun->stack_realign_really)
+    {
+      int align_bytes = cfun->stack_alignment_needed / BITS_PER_UNIT;
+      gcc_assert (align_bytes > STACK_BOUNDARY / BITS_PER_UNIT);
+
+      gen_andsp = TARGET_64BIT ? gen_anddi3 : gen_andsi3;
+      /* Align the stack.  */
+      insn = emit_insn ((*gen_andsp) (stack_pointer_rtx,
+				      stack_pointer_rtx,
+				      GEN_INT (-align_bytes)));
+      RTX_FRAME_RELATED_P (insn) = 1;
+    }
+
   allocate = frame.to_allocate;
 
   if (!frame.save_regs_using_mov)
@@ -6462,7 +6589,9 @@ ix86_expand_prologue (void)
      a red zone location */
   if (TARGET_RED_ZONE && frame.save_regs_using_mov
       && (! TARGET_STACK_PROBE || allocate < CHECK_STACK_LIMIT))
-    ix86_emit_save_regs_using_mov (frame_pointer_needed ? hard_frame_pointer_rtx
+    ix86_emit_save_regs_using_mov ((frame_pointer_needed
+				     && !cfun->stack_realign_really) 
+                                   ? hard_frame_pointer_rtx
 				   : stack_pointer_rtx,
 				   -frame.nregs * UNITS_PER_WORD);
 
@@ -6521,8 +6650,11 @@ ix86_expand_prologue (void)
       && !(TARGET_RED_ZONE
          && (! TARGET_STACK_PROBE || allocate < CHECK_STACK_LIMIT)))
     {
-      if (!frame_pointer_needed || !frame.to_allocate)
-        ix86_emit_save_regs_using_mov (stack_pointer_rtx, frame.to_allocate);
+      if (!frame_pointer_needed
+	  || !frame.to_allocate
+	  || cfun->stack_realign_really)
+        ix86_emit_save_regs_using_mov (stack_pointer_rtx,
+				       frame.to_allocate);
       else
         ix86_emit_save_regs_using_mov (hard_frame_pointer_rtx,
 				       -frame.nregs * UNITS_PER_WORD);
@@ -6572,6 +6704,16 @@ ix86_expand_prologue (void)
 	emit_insn (gen_prologue_use (pic_offset_table_rtx));
       emit_insn (gen_blockage ());
     }
+
+  if (crtl->drap_reg && !cfun->stack_realign_really)
+    {
+      /* vDRAP is setup but after reload it turns out stack realign
+         isn't necessary, here we will emit prologue to setup DRAP
+         without stack realign adjustment */
+      int drap_bp_offset = STACK_BOUNDARY / BITS_PER_UNIT * 2;
+      rtx x = plus_constant (hard_frame_pointer_rtx, drap_bp_offset);
+      insn = emit_insn (gen_rtx_SET (VOIDmode, crtl->drap_reg, x));
+    }
 }
 
 /* Emit code to restore saved registers using MOV insns.  First register
@@ -6610,7 +6752,10 @@ void
 ix86_expand_epilogue (int style)
 {
   int regno;
-  int sp_valid = !frame_pointer_needed || current_function_sp_is_unchanging;
+ /* When stack realign may happen, SP must be valid. */
+  int sp_valid = (!frame_pointer_needed
+		  || current_function_sp_is_unchanging
+		  || (stack_realign_fp && cfun->stack_realign_really));
   struct ix86_frame frame;
   HOST_WIDE_INT offset;
 
@@ -6647,11 +6792,16 @@ ix86_expand_epilogue (int style)
     {
       /* Restore registers.  We can use ebp or esp to address the memory
 	 locations.  If both are available, default to ebp, since offsets
-	 are known to be small.  Only exception is esp pointing directly to the
-	 end of block of saved registers, where we may simplify addressing
-	 mode.  */
-
-      if (!frame_pointer_needed || (sp_valid && !frame.to_allocate))
+	 are known to be small.  Only exception is esp pointing directly
+	 to the end of block of saved registers, where we may simplify
+	 addressing mode.  
+
+	 If we are realigning stack with bp and sp, regs restore can't
+	 be addressed by bp. sp must be used instead.  */
+
+      if (!frame_pointer_needed
+	  || (sp_valid && !frame.to_allocate) 
+	  || (stack_realign_fp && cfun->stack_realign_really))
 	ix86_emit_restore_regs_using_mov (stack_pointer_rtx,
 					  frame.to_allocate, style == 2);
       else
@@ -6663,6 +6813,10 @@ ix86_expand_epilogue (int style)
 	{
 	  rtx tmp, sa = EH_RETURN_STACKADJ_RTX;
 
+	  if (cfun->stack_realign_really)
+	    {
+	      error("Stack realign has conflict with eh_return");
+	    }
 	  if (frame_pointer_needed)
 	    {
 	      tmp = gen_rtx_PLUS (Pmode, hard_frame_pointer_rtx, sa);
@@ -6706,10 +6860,16 @@ ix86_expand_epilogue (int style)
   else
     {
       /* First step is to deallocate the stack frame so that we can
-	 pop the registers.  */
+	 pop the registers.
+
+	 If we realign stack with frame pointer, then stack pointer
+         won't be able to recover via lea $offset(%bp), %sp, because
+         there is a padding area between bp and sp for realign. 
+         "add $to_allocate, %sp" must be used instead.  */
       if (!sp_valid)
 	{
 	  gcc_assert (frame_pointer_needed);
+          gcc_assert (!(stack_realign_fp && cfun->stack_realign_really));
 	  pro_epilogue_adjust_stack (stack_pointer_rtx,
 				     hard_frame_pointer_rtx,
 				     GEN_INT (offset), style);
@@ -6732,18 +6892,47 @@ ix86_expand_epilogue (int style)
 	     able to grok it fast.  */
 	  if (TARGET_USE_LEAVE)
 	    emit_insn (TARGET_64BIT ? gen_leave_rex64 () : gen_leave ());
-	  else if (TARGET_64BIT)
-	    emit_insn (gen_popdi1 (hard_frame_pointer_rtx));
-	  else
-	    emit_insn (gen_popsi1 (hard_frame_pointer_rtx));
+	  else 
+            {
+              /* For stack realigned really happens, recover stack 
+                 pointer to hard frame pointer is a must, if not using 
+                 leave.  */
+              if (stack_realign_fp && cfun->stack_realign_really)
+		pro_epilogue_adjust_stack (stack_pointer_rtx,
+					   hard_frame_pointer_rtx,
+					   const0_rtx, style);
+              if (TARGET_64BIT)
+                emit_insn (gen_popdi1 (hard_frame_pointer_rtx));
+              else
+                emit_insn (gen_popsi1 (hard_frame_pointer_rtx));
+            }
 	}
     }
 
-  if (cfun->machine->force_align_arg_pointer)
+  if (crtl->drap_reg && cfun->stack_realign_really)
     {
-      emit_insn (gen_addsi3 (stack_pointer_rtx,
-			     cfun->machine->force_align_arg_pointer,
-			     GEN_INT (-4)));
+      int param_ptr_offset = (cfun->save_param_ptr_reg
+			      ? STACK_BOUNDARY / BITS_PER_UNIT : 0);
+      gcc_assert (stack_realign_drap);
+      if (TARGET_64BIT)
+        {
+          emit_insn (gen_adddi3 (stack_pointer_rtx,
+				 crtl->drap_reg,
+				 GEN_INT (-(STACK_BOUNDARY / BITS_PER_UNIT
+					    + param_ptr_offset))));
+          if (cfun->save_param_ptr_reg)
+            emit_insn (gen_popdi1 (crtl->drap_reg));
+        }
+      else
+        {
+          emit_insn (gen_addsi3 (stack_pointer_rtx,
+				 crtl->drap_reg,
+				 GEN_INT (-(STACK_BOUNDARY / BITS_PER_UNIT 
+					    + param_ptr_offset))));
+          if (cfun->save_param_ptr_reg)
+            emit_insn (gen_popsi1 (crtl->drap_reg));
+        }
+      
     }
 
   /* Sibcall epilogues don't want a return instruction.  */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFA]: Merge stack alignment branch
  2008-04-10 10:42 ` Ye, Joey
@ 2008-04-11 13:27   ` Jan Hubicka
  2008-04-12  3:39     ` H.J. Lu
                       ` (9 more replies)
  0 siblings, 10 replies; 26+ messages in thread
From: Jan Hubicka @ 2008-04-11 13:27 UTC (permalink / raw)
  To: Ye, Joey; +Cc: GCC Patches, Lu, Hongjiu, Guo, Xuepeng, ubizjak

Thank you for breaking this up!  It seems to me that the generic part of
patch still contain several different infrastructure changes (that is
stack frame alignment tracking, drap pointer support, some of
incomming_arg_rtx related changes and the actual target macro bits).

I would still suggest breaking the generic bits up further and submit
them one by one so RTL maintainers can handle them more easilly.


Index: flags.h
===================================================================
--- flags.h	(.../trunk/gcc)	(revision 134098)
+++ flags.h	(.../branches/stack/gcc)	(revision 134141)
@@ -223,12 +223,6 @@ extern int flag_dump_rtl_in_asm;
 \f
 /* Other basic status info about current function.  */
 
-/* Nonzero means current function must be given a frame pointer.
-   Set in stmt.c if anything is allocated on the stack there.
-   Set in reload1.c if anything is allocated on the stack there.  */
-
-extern int frame_pointer_needed;

frame_pointer_needed should IMO go into crtl, instead of cfun.  It is
computed only at expansion time, right?

Index: builtins.c
===================================================================
--- builtins.c	(.../trunk/gcc)	(revision 134098)
+++ builtins.c	(.../branches/stack/gcc)	(revision 134141)
@@ -740,7 +740,7 @@ expand_builtin_setjmp_receiver (rtx rece
 	{
 	  /* Now restore our arg pointer from the address at which it
 	     was saved in our stack frame.  */
-	  emit_move_insn (virtual_incoming_args_rtx,
+	  emit_move_insn (crtl->args.internal_arg_pointer,

Should not the move into virtual_incoming_args_rtx eliminated back to
internal_arg_pointer store anyway?
I was in impression that virtual_incoming_args_rtx should expand to
direct stack frame reference when internal_arg_pointer is unused and to
internal_arg_pointer otherwise...
 
+  /* DRAP is needed for stack realign if longjmp is expanded to current 
+     function  */
+  if (MAX_VECTORIZE_STACK_ALIGNMENT && !cfun->need_drap)
+    cfun->need_drap = true;

Simiarly need_drap is RTL properly, so it should live in crtl.  I know
that most of the flags are still in cfun, I plan to move them soon, once
the already patch posted in this series is reviewed.  Sorry for all the
conflicts I must've caused.

How exactly longjmp/setjmp machinery imply need for DRAP in functions
not needing alignment otherwise?
Index: global.c
===================================================================
--- global.c	(.../trunk/gcc)	(revision 134098)
+++ global.c	(.../branches/stack/gcc)	(revision 134141)
@@ -247,10 +247,20 @@ compute_regsets (HARD_REG_SET *elim_set,
   static const struct {const int from, to; } eliminables[] = ELIMINABLE_REGS;
   size_t i;
 #endif
+
+  /* FIXME: If EXIT_IGNORE_STACK is set, we will not save and restore
+     sp for alloca.  So we can't eliminate the frame pointer in that
+     case.  At some point, we should improve this by emitting the
+     sp-adjusting insns for this case.  */
   int need_fp
     = (! flag_omit_frame_pointer
        || (current_function_calls_alloca && EXIT_IGNORE_STACK)
-       || FRAME_POINTER_REQUIRED);
+       || FRAME_POINTER_REQUIRED
+       || current_function_accesses_prior_frames
+       || cfun->stack_realign_needed);
+
+  frame_pointer_needed = need_fp;
+  cfun->need_frame_pointer_set = 1;

Originally we decided on whether we need frame pointer during reload.
It was always my impression that it is done because we do want to decide
later, during reload iterations, that the frame pointer is required, via
FRAME_POINTER_REQUIRED macro on some architectures.

Are you sure that you can always safely decide on frame pointer
beforehand?
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      if (cfun->stack_alignment_estimated < alignment_in_bits)
+	{
+          if (!cfun->stack_realign_processed)
+            cfun->stack_alignment_estimated = alignment_in_bits;
+          else
+	    {
+	      gcc_assert (!cfun->stack_realign_finalized);
+	      if (!cfun->stack_realign_needed)
+		{
+		  /* It is OK to reduce the alignment as long as the
+		     requested size is 0 or the estimated stack
+		     alignment >= mode alignment.  */

So basically the purpose of stack_alignment_estimated is to avoid
wasting stack frame space by padding when LOCAL_ALIGNMENT request
alignment greater than STACK_BOUNDARY and we believe that the function
won't need the DRAP code?

The comment should probably mention why we ever request size of 0.  I
don't know at least ;)
+		  gcc_assert (size == 0
+			      || (cfun->stack_alignment_estimated
+				  >= mode_alignment));
+		  alignment_in_bits = cfun->stack_alignment_estimated;
+		  alignment = alignment_in_bits / BITS_PER_UNIT;
+		}
+	    }
+	}
+    }
+  else
+    {
+      /* Ignore alignment we can't do with expected alignment of the
+	 boundary.  */
+      if (alignment * BITS_PER_UNIT > PREFERRED_STACK_BOUNDARY)
+	alignment = PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT;

It seems that alignment_in_bits recomuptation is missing here.
@@ -2968,6 +3010,20 @@ assign_parms (tree fndecl)

 	  continue;
 	}
 
+      /* Estimate stack alignment from parameter alignment */
+      if (MAX_VECTORIZE_STACK_ALIGNMENT)
+        {
+          unsigned int align = FUNCTION_ARG_BOUNDARY (data.promoted_mode,
+						      data.passed_type);
+	  if (TYPE_ALIGN (data.nominal_type) > align)
+	    align = TYPE_ALIGN (data.passed_type);
+	  if (cfun->stack_alignment_estimated < align)
+	    {
+	      gcc_assert (!cfun->stack_realign_processed);
+	      cfun->stack_alignment_estimated = align;
+	    }
+	}
+	

That TYPE_ALIGN bump seems wrong.  If you want to pass that type
aligned, FUNCTION_ARG_BOUNDARY should return proper value?

When incomming argument or return value is aligned, why it affects the
function's stack frame alignment at all? Those live in the caller
function stack frame that is already aligned.

Do we take into account that the alignment of stack pointer is known and
we don't need to re-align?
       if (current_function_stdarg && !TREE_CHAIN (parm))
 	assign_parms_setup_varargs (&all, &data, false);
 
Index: tree-vectorizer.c
===================================================================
--- tree-vectorizer.c	(.../trunk/gcc)	(revision 134098)
+++ tree-vectorizer.c	(.../branches/stack/gcc)	(revision 134141)
@@ -1786,9 +1786,19 @@ vect_can_force_dr_alignment_p (const_tre
 
   if (TREE_STATIC (decl))
     return (alignment <= MAX_OFILE_ALIGNMENT);
+  else if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      gcc_assert (!cfun->stack_realign_processed);
+      if (alignment <= MAX_VECTORIZE_STACK_ALIGNMENT)
+	{
+	  if (cfun->stack_alignment_estimated < alignment)
+	    cfun->stack_alignment_estimated = alignment;

I would preffer if all alignments was visible at RTL expansion time.  

So the reason why you need to handle stack_alignment_estimated at tree
level is that when vectorizing you know you are introducing alignment
requirement on a local array, while later RTL expansion can't work it
out, since the memory accesses are no longer clearly associated with the
stack frame area?

Perhaps deriving type and changing the static array type to an array
with greater alignemnt would work?
Index: function.h
===================================================================
--- function.h	(.../trunk/gcc)	(revision 134098)
+++ function.h	(.../branches/stack/gcc)	(revision 134141)
@@ -352,9 +355,16 @@ struct function GTY(())
   /* tm.h can use this to store whatever it likes.  */
   struct machine_function * GTY ((maybe_undef)) machine;
 
-  /* The largest alignment of slot allocated on the stack.  */
+  /* The largest alignment needed on the stack, including requirement
+     for outgoing stack alignment.  */
   unsigned int stack_alignment_needed;
 
+  /* The largest alignment of slot allocated on the stack.  */
+  unsigned int stack_alignment_used;
+
+  /* The estimated stack alignment.  */
+  unsigned int stack_alignment_estimated;
+
   /* Preferred alignment of the end of stack frame.  */
   unsigned int preferred_stack_boundary;
 
@@ -509,6 +519,38 @@ struct function GTY(())
 
   /* Nonzero if pass_tree_profile was run on this function.  */
   unsigned int after_tree_profile : 1;
+
+/* Nonzero if current function must be given a frame pointer.
+   Set in global.c if anything is allocated on the stack there.  */
+  unsigned int need_frame_pointer : 1;
+
+  /* Nonzero if need_frame_pointer has been set.  */
+  unsigned int need_frame_pointer_set : 1;
+
+  /* Nonzero if, by estimation, current function stack needs realignment. */
+  unsigned int stack_realign_needed : 1;
+
+  /* Nonzero if function stack realignment is really needed. This flag
+     will be set after reload if by then criteria of stack realignment
+     is still true. Its value may be contridition to stack_realign_needed
+     since the latter was set before reload. This flag is more accurate
+     than stack_realign_needed so prologue/epilogue should be generated
+     according to both flags  */
+  unsigned int stack_realign_really : 1;
+
+  /* Nonzero if function being compiled needs dynamic realigned
+     argument pointer (drap) if stack needs realigning.  */
+  unsigned int need_drap : 1;
+
+  /* Nonzero if current function needs to save/restore parameter
+     pointer register in prolog, because it is a callee save reg.  */
+  unsigned int save_param_ptr_reg : 1;
+
+  /* Nonzero if function stack realignment estimatoin is done.  */
+  unsigned int stack_realign_processed : 1;
+
+  /* Nonzero if function stack realignment has been finalized.  */
+  unsigned int stack_realign_finalized : 1;


As I've mentioned originally, it would be nice to place all the
variables and flags computed only at expansion time or later to
rtl_data.  I guess it covers majority of the above variables.

+#define frame_pointer_needed (cfun->need_frame_pointer)

It might be better to stick with the x_frame_pointer_needed accestor
scheme or simply replace it in sources with crtl->frame_pointer_needed
(I would definitly preffer the second)

+      /* If popup is needed, stack realign must use DRAP  */
+      if (MAX_VECTORIZE_STACK_ALIGNMENT && !cfun->need_drap)
+        cfun->need_drap = true;

No need to test !cfun->need_drap.

@@ -743,6 +760,29 @@ defer_stack_allocation (tree var, bool t
 static HOST_WIDE_INT
 expand_one_var (tree var, bool toplevel, bool really_expand)
 {
+  if (MAX_VECTORIZE_STACK_ALIGNMENT && TREE_CODE (var) == VAR_DECL)
+    {
+      unsigned int align;
+
+      /* Because we don't know if VAR will be in register or on stack,
+	 we conservatively assume it will be on stack even if VAR is
+	 eventually put into register after RA pass.  For non-automatic
+	 variables, which won't be on stack, we collect alignment of
+	 type and ignore user specified alignment.  */
+      if (TREE_STATIC (var) || DECL_EXTERNAL (var))
+	align = TYPE_ALIGN (TREE_TYPE (var));
+      else
+	align = DECL_ALIGN (var);
+
+      if (cfun->stack_alignment_estimated < align)
+        {
+          /* stack_alignment_estimated shouldn't change after stack
+             realign decision made */
+          gcc_assert(!cfun->stack_realign_processed);
+	  cfun->stack_alignment_estimated = align;
+	}
+    }

It seems a bit overzelaous to track the alignment everywhere.  The
variables ends up either on stack or in pseudo.  In both cases the
stack_alignment_estimated shold be bumped already?
+/* This pass sets crtl->args.internal_arg_pointer to a virtual
+   register if DRAP is needed.  Local register allocator will replace
+   virtual_incoming_args_rtx with the virtual register.  */
+
+static unsigned int
+handle_drap (void)
+{
+  rtx internal_arg_rtx; 
+
+  if (!cfun->need_drap
+      && (current_function_calls_alloca
+          || cfun->has_nonlocal_label
+          || current_function_has_nonlocal_goto))
+    cfun->need_drap = true;
+
+  /* Call targetm.calls.internal_arg_pointer again.  This time it will
+     return a virtual register if DRAP is needed.  */
+  internal_arg_rtx = targetm.calls.internal_arg_pointer (); 
+
+  /* Assertion to check internal_arg_pointer is set to the right rtx
+     here.  */
+  gcc_assert (crtl->args.internal_arg_pointer == 
+             virtual_incoming_args_rtx);
+
+  /* Do nothing if no need to replace virtual_incoming_args_rtx.  */
+  if (crtl->args.internal_arg_pointer != internal_arg_rtx)
+    {
+      crtl->args.internal_arg_pointer = internal_arg_rtx;
+
+      /* Call fixup_tail_casss to clean up REG_EQUIV note if DRAP is
+         needed. */
+      fixup_tail_calls ();
+    }
+
+  return 0;
+}
+
+struct gimple_opt_pass pass_handle_drap =
+{
+ {
+  GIMPLE_PASS,

This should be RTL_PASS, but in general it seems like other stuff in we
do during cfgexpand at the end of function body expansion, so it
probably more naturally belongs there than to extra pass.

Honza

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFA]: Merge stack alignment branch
  2008-04-11 13:27   ` Jan Hubicka
@ 2008-04-12  3:39     ` H.J. Lu
  2008-04-12 18:12     ` H.J. Lu
                       ` (8 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: H.J. Lu @ 2008-04-12  3:39 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Ye, Joey, GCC Patches, Lu, Hongjiu, Guo, Xuepeng, ubizjak

On Fri, Apr 11, 2008 at 12:03:38PM +0200, Jan Hubicka wrote:
> +		  gcc_assert (size == 0
> +			      || (cfun->stack_alignment_estimated
> +				  >= mode_alignment));
> +		  alignment_in_bits = cfun->stack_alignment_estimated;
> +		  alignment = alignment_in_bits / BITS_PER_UNIT;
> +		}
> +	    }
> +	}
> +    }
> +  else
> +    {
> +      /* Ignore alignment we can't do with expected alignment of the
> +	 boundary.  */
> +      if (alignment * BITS_PER_UNIT > PREFERRED_STACK_BOUNDARY)
> +	alignment = PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT;
> 
> It seems that alignment_in_bits recomuptation is missing here.

I am checking in this patch to fix it. It also sets alignment on stack
slot.

Thanks.


H.J.
---
2008-04-11  H.J. Lu  <hongjiu.lu@intel.com>

	* function.c (assign_stack_local): Update alignment_in_bits.
	Set alignment on stack slot.

2008-04-11  Joey Ye  <joey.ye@intel.com>

	* dojump.c (discard_pending_stack_adjust): Insert empty line, which
	make it same as trunk.

Index: gcc/testsuite/ChangeLog.test
===================================================================
Index: gcc/dojump.c
===================================================================
--- gcc/dojump.c	(.../fsf/branches/stack)	(revision 2152)
+++ gcc/dojump.c	(.../branches/stack-test)	(revision 2152)
@@ -65,6 +65,7 @@ discard_pending_stack_adjust (void)
 
    Note, if the current function calls alloca, then it must have a
    frame pointer regardless of the value of flag_omit_frame_pointer.  */
+
 void
 clear_pending_stack_adjust (void)
 {
Index: gcc/function.c
===================================================================
--- gcc/function.c	(.../fsf/branches/stack)	(revision 2152)
+++ gcc/function.c	(.../branches/stack-test)	(revision 2152)
@@ -407,7 +407,10 @@ assign_stack_local (enum machine_mode mo
       /* Ignore alignment we can't do with expected alignment of the
 	 boundary.  */
       if (alignment * BITS_PER_UNIT > PREFERRED_STACK_BOUNDARY)
-	alignment = PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT;
+	{
+	  alignment_in_bits = PREFERRED_STACK_BOUNDARY;
+	  alignment = PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT;
+	}
     }
   if (cfun->stack_alignment_needed < alignment_in_bits)
     cfun->stack_alignment_needed = alignment_in_bits;
@@ -465,6 +468,7 @@ assign_stack_local (enum machine_mode mo
     frame_offset += size;
 
   x = gen_rtx_MEM (mode, addr);
+  set_mem_align (x, alignment_in_bits);
   MEM_NOTRAP_P (x) = 1;
 
   stack_slot_list

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFA]: Merge stack alignment branch
  2008-04-11 13:27   ` Jan Hubicka
  2008-04-12  3:39     ` H.J. Lu
@ 2008-04-12 18:12     ` H.J. Lu
  2008-04-12 18:36     ` H.J. Lu
                       ` (7 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: H.J. Lu @ 2008-04-12 18:12 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Ye, Joey, GCC Patches, Lu, Hongjiu, Guo, Xuepeng, ubizjak

[-- Attachment #1: Type: text/plain, Size: 638 bytes --]

On Fri, Apr 11, 2008 at 3:03 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>
>  +      /* If popup is needed, stack realign must use DRAP  */
>  +      if (MAX_VECTORIZE_STACK_ALIGNMENT && !cfun->need_drap)
>  +        cfun->need_drap = true;
>
>  No need to test !cfun->need_drap.
>

Hi,

I am checking this patch into stack branch.

Thanks.

H.J.
---
2008-04-12  H.J. Lu  <hongjiu.lu@intel.com>

        * builtins.c (expand_builtin_longjmp): Don't check !cfun->need_drap
        when setting cfun->need_drap.
        (expand_builtin_apply): Likewise.
        * calls.c (emit_call_1): Likewise.
        * cfgexpand.c (handle_drap): Likewise.

[-- Attachment #2: drap.txt --]
[-- Type: text/plain, Size: 2506 bytes --]

Index: builtins.c
===================================================================
--- builtins.c	(revision 134220)
+++ builtins.c	(working copy)
@@ -777,7 +777,7 @@ expand_builtin_longjmp (rtx buf_addr, rt
 
   /* DRAP is needed for stack realign if longjmp is expanded to current 
      function  */
-  if (MAX_VECTORIZE_STACK_ALIGNMENT && !cfun->need_drap)
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
     cfun->need_drap = true;
 
   if (setjmp_alias_set == -1)
@@ -1463,7 +1463,7 @@ expand_builtin_apply (rtx function, rtx 
      may have already set current_function_calls_alloca to true.
      current_function_calls_alloca won't be set if argsize is zero,
      so we have to guarantee need_drap is true here.  */
-  if (MAX_VECTORIZE_STACK_ALIGNMENT && !cfun->need_drap)
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
     cfun->need_drap = true;
 
   dest = virtual_outgoing_args_rtx;
Index: ChangeLog.stackalign
===================================================================
--- ChangeLog.stackalign	(revision 134220)
+++ ChangeLog.stackalign	(working copy)
@@ -1,3 +1,11 @@
+2008-04-12  H.J. Lu  <hongjiu.lu@intel.com>
+
+	* builtins.c (expand_builtin_longjmp): Don't check !cfun->need_drap
+	when setting cfun->need_drap.
+	(expand_builtin_apply): Likewise.
+	* calls.c (emit_call_1): Likewise.
+	* cfgexpand.c (handle_drap): Likewise.
+
 2008-04-11  H.J. Lu  <hongjiu.lu@intel.com>
 
 	* function.c (assign_stack_local): Update alignment_in_bits.
Index: calls.c
===================================================================
--- calls.c	(revision 134220)
+++ calls.c	(working copy)
@@ -421,7 +421,7 @@ emit_call_1 (rtx funexp, tree fntree, tr
       stack_pointer_delta -= n_popped;
 
       /* If popup is needed, stack realign must use DRAP  */
-      if (MAX_VECTORIZE_STACK_ALIGNMENT && !cfun->need_drap)
+      if (MAX_VECTORIZE_STACK_ALIGNMENT)
         cfun->need_drap = true;
     }
 
Index: cfgexpand.c
===================================================================
--- cfgexpand.c	(revision 134220)
+++ cfgexpand.c	(working copy)
@@ -2059,10 +2059,9 @@ handle_drap (void)
 {
   rtx internal_arg_rtx; 
 
-  if (!cfun->need_drap
-      && (current_function_calls_alloca
-          || cfun->has_nonlocal_label
-          || current_function_has_nonlocal_goto))
+  if (current_function_calls_alloca
+      || cfun->has_nonlocal_label
+      || current_function_has_nonlocal_goto)
     cfun->need_drap = true;
 
   /* Call targetm.calls.internal_arg_pointer again.  This time it will

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFA]: Merge stack alignment branch
  2008-04-11 13:27   ` Jan Hubicka
  2008-04-12  3:39     ` H.J. Lu
  2008-04-12 18:12     ` H.J. Lu
@ 2008-04-12 18:36     ` H.J. Lu
  2008-04-14  9:53     ` Ye, Joey
                       ` (6 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: H.J. Lu @ 2008-04-12 18:36 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Ye, Joey, GCC Patches, Lu, Hongjiu, Guo, Xuepeng, ubizjak

On Fri, Apr 11, 2008 at 3:03 AM, Jan Hubicka <hubicka@ucw.cz> wrote:

>  Index: function.h
>  ===================================================================
>  --- function.h  (.../trunk/gcc) (revision 134098)
>  +++ function.h  (.../branches/stack/gcc)        (revision 134141)
>  @@ -352,9 +355,16 @@ struct function GTY(())
>    /* tm.h can use this to store whatever it likes.  */
>    struct machine_function * GTY ((maybe_undef)) machine;
>
>  -  /* The largest alignment of slot allocated on the stack.  */
>  +  /* The largest alignment needed on the stack, including requirement
>  +     for outgoing stack alignment.  */
>    unsigned int stack_alignment_needed;
>
>  +  /* The largest alignment of slot allocated on the stack.  */
>  +  unsigned int stack_alignment_used;
>  +
>  +  /* The estimated stack alignment.  */
>  +  unsigned int stack_alignment_estimated;
>  +
>    /* Preferred alignment of the end of stack frame.  */
>    unsigned int preferred_stack_boundary;
>
>  @@ -509,6 +519,38 @@ struct function GTY(())
>
>    /* Nonzero if pass_tree_profile was run on this function.  */
>    unsigned int after_tree_profile : 1;
>  +
>  +/* Nonzero if current function must be given a frame pointer.
>  +   Set in global.c if anything is allocated on the stack there.  */
>  +  unsigned int need_frame_pointer : 1;
>  +
>  +  /* Nonzero if need_frame_pointer has been set.  */
>  +  unsigned int need_frame_pointer_set : 1;
>  +
>  +  /* Nonzero if, by estimation, current function stack needs realignment. */
>  +  unsigned int stack_realign_needed : 1;
>  +
>  +  /* Nonzero if function stack realignment is really needed. This flag
>  +     will be set after reload if by then criteria of stack realignment
>  +     is still true. Its value may be contridition to stack_realign_needed
>  +     since the latter was set before reload. This flag is more accurate
>  +     than stack_realign_needed so prologue/epilogue should be generated
>  +     according to both flags  */
>  +  unsigned int stack_realign_really : 1;
>  +
>  +  /* Nonzero if function being compiled needs dynamic realigned
>  +     argument pointer (drap) if stack needs realigning.  */
>  +  unsigned int need_drap : 1;
>  +
>  +  /* Nonzero if current function needs to save/restore parameter
>  +     pointer register in prolog, because it is a callee save reg.  */
>  +  unsigned int save_param_ptr_reg : 1;
>  +
>  +  /* Nonzero if function stack realignment estimatoin is done.  */
>  +  unsigned int stack_realign_processed : 1;
>  +
>  +  /* Nonzero if function stack realignment has been finalized.  */
>  +  unsigned int stack_realign_finalized : 1;
>
>
>  As I've mentioned originally, it would be nice to place all the
>  variables and flags computed only at expansion time or later to
>  rtl_data.  I guess it covers majority of the above variables.
>
>  +#define frame_pointer_needed (cfun->need_frame_pointer)
>
>  It might be better to stick with the x_frame_pointer_needed accestor
>  scheme or simply replace it in sources with crtl->frame_pointer_needed
>  (I would definitly preffer the second)
>

They should stay with stack_alignment_needed and preferred_stack_boundary,
which are in "struct function". We can move those new stack alignment
fields to rtl_data when they are moved to rtl_data.

Thanks.


H.J.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFA]: Merge stack alignment branch
  2008-04-11 13:27   ` Jan Hubicka
                       ` (3 preceding siblings ...)
  2008-04-14  9:53     ` Ye, Joey
@ 2008-04-14  9:53     ` Ye, Joey
  2008-04-14  9:53     ` Ye, Joey
                       ` (4 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Ye, Joey @ 2008-04-14  9:53 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Lu, Hongjiu, Guo, Xuepeng, ubizjak

Jan,

As to your comments:
Index: builtins.c
===================================================================
--- builtins.c	(.../trunk/gcc)	(revision 134098)
+++ builtins.c	(.../branches/stack/gcc)	(revision 134141)
@@ -740,7 +740,7 @@ expand_builtin_setjmp_receiver (rtx rece
 	{
 	  /* Now restore our arg pointer from the address at which it
 	     was saved in our stack frame.  */
-	  emit_move_insn (virtual_incoming_args_rtx,
+	  emit_move_insn (crtl->args.internal_arg_pointer,

> Should not the move into virtual_incoming_args_rtx eliminated back to
> internal_arg_pointer store anyway?
> I was in impression that virtual_incoming_args_rtx should expand to
> direct stack frame reference when internal_arg_pointer is unused and
to
> internal_arg_pointer otherwise... 
It works in exactly the way you described. virtual_incoming_args will be
eliminated to internal_arg_pointer after expansion, replacing in
expansion does seem unnecessary.

However, there are places in existing code that internal_arg_pointer is
referrenced, for example, in calls.c. I think there should be a
consistent way to use either virtual_incoming_args or
internal_arg_pointe in expansion pass. This patch chooses to use
internal_arg_pointer so that it looks more compatible to existing code.
It can be otherwise if there is reason to back up virtual_incoming_args.

Thanks - Joey
-----Original Message-----
From: Jan Hubicka [mailto:hubicka@ucw.cz] 
Sent: Friday, April 11, 2008 6:04 PM
To: Ye, Joey
Cc: GCC Patches; Lu, Hongjiu; Guo, Xuepeng; ubizjak@gmail.com
Subject: Re: [RFA]: Merge stack alignment branch

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFA]: Merge stack alignment branch
  2008-04-11 13:27   ` Jan Hubicka
                       ` (2 preceding siblings ...)
  2008-04-12 18:36     ` H.J. Lu
@ 2008-04-14  9:53     ` Ye, Joey
  2008-04-14 16:46       ` H.J. Lu
  2008-04-14  9:53     ` Ye, Joey
                       ` (5 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Ye, Joey @ 2008-04-14  9:53 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Lu, Hongjiu, Guo, Xuepeng, ubizjak

-----Original Message-----
From: Jan Hubicka [mailto:hubicka@ucw.cz] 
Sent: Friday, April 11, 2008 6:04 PM
To: Ye, Joey
Cc: GCC Patches; Lu, Hongjiu; Guo, Xuepeng; ubizjak@gmail.com
Subject: Re: [RFA]: Merge stack alignment branch

+/* This pass sets crtl->args.internal_arg_pointer to a virtual
+   register if DRAP is needed.  Local register allocator will replace
+   virtual_incoming_args_rtx with the virtual register.  */

> This should be RTL_PASS, but in general it seems like other stuff in
we
> do during cfgexpand at the end of function body expansion, so it
> probably more naturally belongs there than to extra pass.
Accept. Check out this patch:

Index: cfgexpand.c
===================================================================
--- cfgexpand.c	(revision 134220)
+++ cfgexpand.c	(working copy)
@@ -1828,6 +1828,41 @@ discover_nonconstant_array_refs (void)
     }
 }
 
+/* This function sets crtl->args.internal_arg_pointer to a virtual
+   register if DRAP is needed.  Local register allocator will replace
+   virtual_incoming_args_rtx with the virtual register.  */
+
+static void
+handle_drap (void)
+{
+  rtx internal_arg_rtx; 
+
+  if (!cfun->need_drap
+      && (current_function_calls_alloca
+          || cfun->has_nonlocal_label
+          || current_function_has_nonlocal_goto))
+    cfun->need_drap = true;
+
+  /* Call targetm.calls.internal_arg_pointer again.  This time it will
+     return a virtual register if DRAP is needed.  */
+  internal_arg_rtx = targetm.calls.internal_arg_pointer (); 
+
+  /* Assertion to check internal_arg_pointer is set to the right rtx
+     here.  */
+  gcc_assert (crtl->args.internal_arg_pointer == 
+             virtual_incoming_args_rtx);
+
+  /* Do nothing if no need to replace virtual_incoming_args_rtx.  */
+  if (crtl->args.internal_arg_pointer != internal_arg_rtx)
+    {
+      crtl->args.internal_arg_pointer = internal_arg_rtx;
+
+      /* Call fixup_tail_casss to clean up REG_EQUIV note if DRAP is
+         needed. */
+      fixup_tail_calls ();
+    }
+}
+
 /* Translate the intermediate representation contained in the CFG
    from GIMPLE trees to RTL.
 
@@ -1930,6 +1965,10 @@ tree_expand_cfg (void)
   sbitmap_free (blocks);
 
   compact_blocks ();
+
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    handle_drap();
+
 #ifdef ENABLE_CHECKING
   verify_flow_info ();
 #endif

Honza

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFA]: Merge stack alignment branch
  2008-04-11 13:27   ` Jan Hubicka
                       ` (4 preceding siblings ...)
  2008-04-14  9:53     ` Ye, Joey
@ 2008-04-14  9:53     ` Ye, Joey
  2008-04-14  9:53     ` Ye, Joey
                       ` (3 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Ye, Joey @ 2008-04-14  9:53 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Lu, Hongjiu, Guo, Xuepeng, ubizjak

[-- Attachment #1: Type: text/plain, Size: 2189 bytes --]

Jan,

As to your comments about expand_one_var:
> It seems a bit overzelaous to track the alignment everywhere.  The
> variables ends up either on stack or in pseudo.  In both cases the
> stack_alignment_estimated shold be bumped already? 
As I said in code comments, we don't know if a static variable would be
spilled to stack or not when expanding it. Current strategy is to
estimate it very conservatively. It is not the most optimized way and
there is room for tuning. I'd like to leave it for later patches.

Good news is that even if stack alignment estimation is too conservative
in expansion, fixing up code after reload can reduce the penalty to as
less as zero or one additional instruction depends on DARP usage.

Attached case shows and static variable end up spilled into stack later.

As to your comments:
> @@ -2968,6 +3010,20 @@ assign_parms (tree fndecl)

> That TYPE_ALIGN bump seems wrong.  If you want to pass that type
> aligned, FUNCTION_ARG_BOUNDARY should return proper value?
This is optimization of estimation. For following funtion
typedef int __attribute__((aligned (16)) type_a;
void foo(type_a arg) {...}

Arg can be copied into vreg and spilled to stack. The stack alignment
will be increased to align_of(int), but not align_of(type_a). So we
estimate stack alignment from TYPE_ALIGN instead.

> When incomming argument or return value is aligned, why it affects the
> function's stack frame alignment at all? Those live in the caller
> function stack frame that is already aligned.
Again, it is an estimation. Incomming argument can be copied into a
virtual reg and finally spilled into stack.

> Do we take into account that the alignment of stack pointer is known
and
> we don't need to re-align?
Yes. If incoming_stack_boundary >= stack_alignment_estimated, then no
re-align is needed. Incoming_stack_boundary depends on ABI and
PREFERRED_STACK_BOUDNARY.

Thanks - Joey

-----Original Message-----
From: Jan Hubicka [mailto:hubicka@ucw.cz] 
Sent: Friday, April 11, 2008 6:04 PM
To: Ye, Joey
Cc: GCC Patches; Lu, Hongjiu; Guo, Xuepeng; ubizjak@gmail.com
Subject: Re: [RFA]: Merge stack alignment branch

[-- Attachment #2: slot.c --]
[-- Type: application/octet-stream, Size: 279 bytes --]

double g = 1.0;
double g1;
double g2,g3,g4, g5, g6, g7, g8;
void foo(int g_c)
{
	volatile char c;
	char i,j;
	for (i=0; i<g_c; i++)
	{
		g += 2.0;
		for (j=0; j<g_c; j++)
			g1 += g;
		g2 += g1;
		g3 += g;
		g4 += g2;
		g5 += g4;
		g6 += g5;
		g7 += g6;
		g8 += g7;
	}
	c = i;
}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFA]: Merge stack alignment branch
  2008-04-11 13:27   ` Jan Hubicka
                       ` (5 preceding siblings ...)
  2008-04-14  9:53     ` Ye, Joey
@ 2008-04-14  9:53     ` Ye, Joey
  2008-04-15 17:35       ` H.J. Lu
  2008-04-14  9:55     ` Ye, Joey
                       ` (2 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Ye, Joey @ 2008-04-14  9:53 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Lu, Hongjiu, Guo, Xuepeng, ubizjak

[-- Attachment #1: Type: text/plain, Size: 2412 bytes --]

Jan,

Thanks for your reviewing. I'll reply you in a series of email trying to
address all your concerns. 

As to your comments:
> I would still suggest breaking the generic bits up further and submit
> them one by one so RTL maintainers can handle them more easilly. 
I'm doing that

> frame_pointer_needed should IMO go into crtl, instead of cfun.  It is
> computed only at expansion time, right?
It is computed even after expansion time. 

As your comments:
> Index: global.c
> Originally we decided on whether we need frame pointer during reload.
> It was always my impression that it is done because we do want to
decide
> later, during reload iterations, that the frame pointer is required,
via
> FRAME_POINTER_REQUIRED macro on some architectures.

> Are you sure that you can always safely decide on frame pointer
> beforehand? 
What I read from existing code doesn't exactly reflect your impression.
Frame_pointer_needed is set at two places.

The first one is at init_elim_table, which is called at very beginning
of reload. The second place is in update_eliminables, which is invoked
during reload iterations. Here it is rewriten and finalized. 

My patches just move the first place to compute_regsets, which is called
in beginning of greg. The conditions used to decide frame_pointer_needed
at init_elim_table doesn't differ those of compute_regsets. In another
words, following condition check won't work differently in
init_elim_table and compute_regsets:
       (! flag_omit_frame_pointer
        || (current_function_calls_alloca && EXIT_IGNORE_STACK)
        || FRAME_POINTER_REQUIRED
        || current_function_accesses_prior_frames
        || cfun->stack_realign_needed)

So it is safe.

As to the reason why I suggest to move it, one is code refacting. It
doesn't look like a good coding practise to duplicate almost the same
condition check at two place. The other is that i386 version of macro
CAN_ELIMINATE references frame_pointer_needed, which is used before set
in compute_regsets.

Such a change can be independent to stack alignment. Here I break it
into a seperate patch.

Thanks - Joey

-----Original Message-----
From: Jan Hubicka [mailto:hubicka@ucw.cz] 
Sent: Friday, April 11, 2008 6:04 PM
To: Ye, Joey
Cc: GCC Patches; Lu, Hongjiu; Guo, Xuepeng; ubizjak@gmail.com
Subject: Re: [RFA]: Merge stack alignment branch

[-- Attachment #2: frame_pointer_needed-0414.patch --]
[-- Type: application/octet-stream, Size: 2312 bytes --]

Index: global.c
===================================================================
--- global.c	(revision 134220)
+++ global.c	(working copy)
@@ -252,6 +252,9 @@ compute_regsets (HARD_REG_SET *elim_set,
        || (current_function_calls_alloca && EXIT_IGNORE_STACK)
        || FRAME_POINTER_REQUIRED);
 
+  frame_pointer_needed = need_fp;
+  crtl->need_frame_pointer_set = 1;
+
   max_regno = max_reg_num ();
   compact_blocks ();
 
Index: function.h
===================================================================
--- function.h	(revision 134220)
+++ function.h	(working copy)
@@ -288,6 +288,12 @@ struct rtl_data GTY(())
   /* Current nesting level for temporaries.  */
   int x_temp_slot_level;
 
+  /* Nonzero if current function must be given a frame pointer.
+     Set in global.c if anything is allocated on the stack there.  */
+  unsigned int need_frame_pointer : 1;
+
+  /* Nonzero if need_frame_pointer has been set.  */
+  unsigned int need_frame_pointer_set : 1;
 };
 
 #define return_label (crtl->x_return_label)
@@ -301,6 +307,7 @@ struct rtl_data GTY(())
 #define avail_temp_slots (crtl->x_avail_temp_slots)
 #define temp_slot_level (crtl->x_temp_slot_level)
 #define nonlocal_goto_handler_labels (crtl->x_nonlocal_goto_handler_labels)
+#define frame_pointer_needed (crtl->need_frame_pointer)
 
 extern GTY(()) struct rtl_data x_rtl;
 
Index: reload1.c
===================================================================
--- reload1.c	(revision 134220)
+++ reload1.c	(working copy)
@@ -3713,18 +3713,8 @@ init_elim_table (void)
   if (!reg_eliminate)
     reg_eliminate = xcalloc (sizeof (struct elim_table), NUM_ELIMINABLE_REGS);
 
-  /* Does this function require a frame pointer?  */
-
-  frame_pointer_needed = (! flag_omit_frame_pointer
-			  /* ?? If EXIT_IGNORE_STACK is set, we will not save
-			     and restore sp for alloca.  So we can't eliminate
-			     the frame pointer in that case.  At some point,
-			     we should improve this by emitting the
-			     sp-adjusting insns for this case.  */
-			  || (current_function_calls_alloca
-			      && EXIT_IGNORE_STACK)
-			  || current_function_accesses_prior_frames
-			  || FRAME_POINTER_REQUIRED);
+  /* frame_pointer_needed should has been set.  */
+  gcc_assert (crtl->need_frame_pointer_set);
 
   num_eliminable = 0;
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFA]: Merge stack alignment branch
  2008-04-11 13:27   ` Jan Hubicka
                       ` (6 preceding siblings ...)
  2008-04-14  9:53     ` Ye, Joey
@ 2008-04-14  9:55     ` Ye, Joey
  2008-04-14  9:58     ` Ye, Joey
  2008-04-14 10:37     ` Ye, Joey
  9 siblings, 0 replies; 26+ messages in thread
From: Ye, Joey @ 2008-04-14  9:55 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Lu, Hongjiu, Guo, Xuepeng, ubizjak

Jan,

As to your comments:
+  /* DRAP is needed for stack realign if longjmp is expanded to current

+     function  */
+  if (MAX_VECTORIZE_STACK_ALIGNMENT && !cfun->need_drap)
+    cfun->need_drap = true;

> Simiarly need_drap is RTL properly, so it should live in crtl.  I know
> that most of the flags are still in cfun, I plan to move them soon,
once
> the already patch posted in this series is reviewed.  Sorry for all
the
> conflicts I must've caused.
need_drap is set in expansion. It should be moved to rtl_data all
together with other similar flags, wrt HJ's comment
http://gcc.gnu.org/ml/gcc-patches/2008-04/msg01019.html

> How exactly longjmp/setjmp machinery imply need for DRAP in functions
> not needing alignment otherwise? 
Longjmp will be expanded into a couple of RTLs, one of which sets stack
pointer. Stack realignment can't live with setting to stack pointer in
function body without DRAP. 
The reason behind this is that reload may not eliminate frame_pointer
into stack_pointer once stack pointer is altered somewhere other than
prologue/epilogue.

Thanks - Joey

-----Original Message-----
From: Jan Hubicka [mailto:hubicka@ucw.cz] 
Sent: Friday, April 11, 2008 6:04 PM
To: Ye, Joey
Cc: GCC Patches; Lu, Hongjiu; Guo, Xuepeng; ubizjak@gmail.com
Subject: Re: [RFA]: Merge stack alignment branch

Thank you for breaking this up!  It seems to me that the generic part of
patch still contain several different infrastructure changes (that is
stack frame alignment tracking, drap pointer support, some of
incomming_arg_rtx related changes and the actual target macro bits).

I would still suggest breaking the generic bits up further and submit
them one by one so RTL maintainers can handle them more easilly.


Index: flags.h
===================================================================
--- flags.h	(.../trunk/gcc)	(revision 134098)
+++ flags.h	(.../branches/stack/gcc)	(revision 134141)
@@ -223,12 +223,6 @@ extern int flag_dump_rtl_in_asm;
 \f
 /* Other basic status info about current function.  */
 
-/* Nonzero means current function must be given a frame pointer.
-   Set in stmt.c if anything is allocated on the stack there.
-   Set in reload1.c if anything is allocated on the stack there.  */
-
-extern int frame_pointer_needed;

frame_pointer_needed should IMO go into crtl, instead of cfun.  It is
computed only at expansion time, right?

Index: builtins.c
===================================================================
--- builtins.c	(.../trunk/gcc)	(revision 134098)
+++ builtins.c	(.../branches/stack/gcc)	(revision 134141)
@@ -740,7 +740,7 @@ expand_builtin_setjmp_receiver (rtx rece
 	{
 	  /* Now restore our arg pointer from the address at which it
 	     was saved in our stack frame.  */
-	  emit_move_insn (virtual_incoming_args_rtx,
+	  emit_move_insn (crtl->args.internal_arg_pointer,

Should not the move into virtual_incoming_args_rtx eliminated back to
internal_arg_pointer store anyway?
I was in impression that virtual_incoming_args_rtx should expand to
direct stack frame reference when internal_arg_pointer is unused and to
internal_arg_pointer otherwise...
 
+  /* DRAP is needed for stack realign if longjmp is expanded to current

+     function  */
+  if (MAX_VECTORIZE_STACK_ALIGNMENT && !cfun->need_drap)
+    cfun->need_drap = true;

Simiarly need_drap is RTL properly, so it should live in crtl.  I know
that most of the flags are still in cfun, I plan to move them soon, once
the already patch posted in this series is reviewed.  Sorry for all the
conflicts I must've caused.

How exactly longjmp/setjmp machinery imply need for DRAP in functions
not needing alignment otherwise?
Index: global.c
===================================================================
--- global.c	(.../trunk/gcc)	(revision 134098)
+++ global.c	(.../branches/stack/gcc)	(revision 134141)
@@ -247,10 +247,20 @@ compute_regsets (HARD_REG_SET *elim_set,
   static const struct {const int from, to; } eliminables[] =
ELIMINABLE_REGS;
   size_t i;
 #endif
+
+  /* FIXME: If EXIT_IGNORE_STACK is set, we will not save and restore
+     sp for alloca.  So we can't eliminate the frame pointer in that
+     case.  At some point, we should improve this by emitting the
+     sp-adjusting insns for this case.  */
   int need_fp
     = (! flag_omit_frame_pointer
        || (current_function_calls_alloca && EXIT_IGNORE_STACK)
-       || FRAME_POINTER_REQUIRED);
+       || FRAME_POINTER_REQUIRED
+       || current_function_accesses_prior_frames
+       || cfun->stack_realign_needed);
+
+  frame_pointer_needed = need_fp;
+  cfun->need_frame_pointer_set = 1;

Originally we decided on whether we need frame pointer during reload.
It was always my impression that it is done because we do want to decide
later, during reload iterations, that the frame pointer is required, via
FRAME_POINTER_REQUIRED macro on some architectures.

Are you sure that you can always safely decide on frame pointer
beforehand?
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      if (cfun->stack_alignment_estimated < alignment_in_bits)
+	{
+          if (!cfun->stack_realign_processed)
+            cfun->stack_alignment_estimated = alignment_in_bits;
+          else
+	    {
+	      gcc_assert (!cfun->stack_realign_finalized);
+	      if (!cfun->stack_realign_needed)
+		{
+		  /* It is OK to reduce the alignment as long as the
+		     requested size is 0 or the estimated stack
+		     alignment >= mode alignment.  */

So basically the purpose of stack_alignment_estimated is to avoid
wasting stack frame space by padding when LOCAL_ALIGNMENT request
alignment greater than STACK_BOUNDARY and we believe that the function
won't need the DRAP code?

The comment should probably mention why we ever request size of 0.  I
don't know at least ;)
+		  gcc_assert (size == 0
+			      || (cfun->stack_alignment_estimated
+				  >= mode_alignment));
+		  alignment_in_bits = cfun->stack_alignment_estimated;
+		  alignment = alignment_in_bits / BITS_PER_UNIT;
+		}
+	    }
+	}
+    }
+  else
+    {
+      /* Ignore alignment we can't do with expected alignment of the
+	 boundary.  */
+      if (alignment * BITS_PER_UNIT > PREFERRED_STACK_BOUNDARY)
+	alignment = PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT;

It seems that alignment_in_bits recomuptation is missing here.
@@ -2968,6 +3010,20 @@ assign_parms (tree fndecl)

 	  continue;
 	}
 
+      /* Estimate stack alignment from parameter alignment */
+      if (MAX_VECTORIZE_STACK_ALIGNMENT)
+        {
+          unsigned int align = FUNCTION_ARG_BOUNDARY
(data.promoted_mode,
+						      data.passed_type);
+	  if (TYPE_ALIGN (data.nominal_type) > align)
+	    align = TYPE_ALIGN (data.passed_type);
+	  if (cfun->stack_alignment_estimated < align)
+	    {
+	      gcc_assert (!cfun->stack_realign_processed);
+	      cfun->stack_alignment_estimated = align;
+	    }
+	}
+	

That TYPE_ALIGN bump seems wrong.  If you want to pass that type
aligned, FUNCTION_ARG_BOUNDARY should return proper value?

When incomming argument or return value is aligned, why it affects the
function's stack frame alignment at all? Those live in the caller
function stack frame that is already aligned.

Do we take into account that the alignment of stack pointer is known and
we don't need to re-align?
       if (current_function_stdarg && !TREE_CHAIN (parm))
 	assign_parms_setup_varargs (&all, &data, false);
 
Index: tree-vectorizer.c
===================================================================
--- tree-vectorizer.c	(.../trunk/gcc)	(revision 134098)
+++ tree-vectorizer.c	(.../branches/stack/gcc)	(revision
134141)
@@ -1786,9 +1786,19 @@ vect_can_force_dr_alignment_p (const_tre
 
   if (TREE_STATIC (decl))
     return (alignment <= MAX_OFILE_ALIGNMENT);
+  else if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      gcc_assert (!cfun->stack_realign_processed);
+      if (alignment <= MAX_VECTORIZE_STACK_ALIGNMENT)
+	{
+	  if (cfun->stack_alignment_estimated < alignment)
+	    cfun->stack_alignment_estimated = alignment;

I would preffer if all alignments was visible at RTL expansion time.  

So the reason why you need to handle stack_alignment_estimated at tree
level is that when vectorizing you know you are introducing alignment
requirement on a local array, while later RTL expansion can't work it
out, since the memory accesses are no longer clearly associated with the
stack frame area?

Perhaps deriving type and changing the static array type to an array
with greater alignemnt would work?
Index: function.h
===================================================================
--- function.h	(.../trunk/gcc)	(revision 134098)
+++ function.h	(.../branches/stack/gcc)	(revision 134141)
@@ -352,9 +355,16 @@ struct function GTY(())
   /* tm.h can use this to store whatever it likes.  */
   struct machine_function * GTY ((maybe_undef)) machine;
 
-  /* The largest alignment of slot allocated on the stack.  */
+  /* The largest alignment needed on the stack, including requirement
+     for outgoing stack alignment.  */
   unsigned int stack_alignment_needed;
 
+  /* The largest alignment of slot allocated on the stack.  */
+  unsigned int stack_alignment_used;
+
+  /* The estimated stack alignment.  */
+  unsigned int stack_alignment_estimated;
+
   /* Preferred alignment of the end of stack frame.  */
   unsigned int preferred_stack_boundary;
 
@@ -509,6 +519,38 @@ struct function GTY(())
 
   /* Nonzero if pass_tree_profile was run on this function.  */
   unsigned int after_tree_profile : 1;
+
+/* Nonzero if current function must be given a frame pointer.
+   Set in global.c if anything is allocated on the stack there.  */
+  unsigned int need_frame_pointer : 1;
+
+  /* Nonzero if need_frame_pointer has been set.  */
+  unsigned int need_frame_pointer_set : 1;
+
+  /* Nonzero if, by estimation, current function stack needs
realignment. */
+  unsigned int stack_realign_needed : 1;
+
+  /* Nonzero if function stack realignment is really needed. This flag
+     will be set after reload if by then criteria of stack realignment
+     is still true. Its value may be contridition to
stack_realign_needed
+     since the latter was set before reload. This flag is more accurate
+     than stack_realign_needed so prologue/epilogue should be generated
+     according to both flags  */
+  unsigned int stack_realign_really : 1;
+
+  /* Nonzero if function being compiled needs dynamic realigned
+     argument pointer (drap) if stack needs realigning.  */
+  unsigned int need_drap : 1;
+
+  /* Nonzero if current function needs to save/restore parameter
+     pointer register in prolog, because it is a callee save reg.  */
+  unsigned int save_param_ptr_reg : 1;
+
+  /* Nonzero if function stack realignment estimatoin is done.  */
+  unsigned int stack_realign_processed : 1;
+
+  /* Nonzero if function stack realignment has been finalized.  */
+  unsigned int stack_realign_finalized : 1;


As I've mentioned originally, it would be nice to place all the
variables and flags computed only at expansion time or later to
rtl_data.  I guess it covers majority of the above variables.

+#define frame_pointer_needed (cfun->need_frame_pointer)

It might be better to stick with the x_frame_pointer_needed accestor
scheme or simply replace it in sources with crtl->frame_pointer_needed
(I would definitly preffer the second)

+      /* If popup is needed, stack realign must use DRAP  */
+      if (MAX_VECTORIZE_STACK_ALIGNMENT && !cfun->need_drap)
+        cfun->need_drap = true;

No need to test !cfun->need_drap.

@@ -743,6 +760,29 @@ defer_stack_allocation (tree var, bool t
 static HOST_WIDE_INT
 expand_one_var (tree var, bool toplevel, bool really_expand)
 {
+  if (MAX_VECTORIZE_STACK_ALIGNMENT && TREE_CODE (var) == VAR_DECL)
+    {
+      unsigned int align;
+
+      /* Because we don't know if VAR will be in register or on stack,
+	 we conservatively assume it will be on stack even if VAR is
+	 eventually put into register after RA pass.  For non-automatic
+	 variables, which won't be on stack, we collect alignment of
+	 type and ignore user specified alignment.  */
+      if (TREE_STATIC (var) || DECL_EXTERNAL (var))
+	align = TYPE_ALIGN (TREE_TYPE (var));
+      else
+	align = DECL_ALIGN (var);
+
+      if (cfun->stack_alignment_estimated < align)
+        {
+          /* stack_alignment_estimated shouldn't change after stack
+             realign decision made */
+          gcc_assert(!cfun->stack_realign_processed);
+	  cfun->stack_alignment_estimated = align;
+	}
+    }

It seems a bit overzelaous to track the alignment everywhere.  The
variables ends up either on stack or in pseudo.  In both cases the
stack_alignment_estimated shold be bumped already?
+/* This pass sets crtl->args.internal_arg_pointer to a virtual
+   register if DRAP is needed.  Local register allocator will replace
+   virtual_incoming_args_rtx with the virtual register.  */
+
+static unsigned int
+handle_drap (void)
+{
+  rtx internal_arg_rtx; 
+
+  if (!cfun->need_drap
+      && (current_function_calls_alloca
+          || cfun->has_nonlocal_label
+          || current_function_has_nonlocal_goto))
+    cfun->need_drap = true;
+
+  /* Call targetm.calls.internal_arg_pointer again.  This time it will
+     return a virtual register if DRAP is needed.  */
+  internal_arg_rtx = targetm.calls.internal_arg_pointer (); 
+
+  /* Assertion to check internal_arg_pointer is set to the right rtx
+     here.  */
+  gcc_assert (crtl->args.internal_arg_pointer == 
+             virtual_incoming_args_rtx);
+
+  /* Do nothing if no need to replace virtual_incoming_args_rtx.  */
+  if (crtl->args.internal_arg_pointer != internal_arg_rtx)
+    {
+      crtl->args.internal_arg_pointer = internal_arg_rtx;
+
+      /* Call fixup_tail_casss to clean up REG_EQUIV note if DRAP is
+         needed. */
+      fixup_tail_calls ();
+    }
+
+  return 0;
+}
+
+struct gimple_opt_pass pass_handle_drap =
+{
+ {
+  GIMPLE_PASS,

This should be RTL_PASS, but in general it seems like other stuff in we
do during cfgexpand at the end of function body expansion, so it
probably more naturally belongs there than to extra pass.

Honza

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFA]: Merge stack alignment branch
  2008-04-11 13:27   ` Jan Hubicka
                       ` (7 preceding siblings ...)
  2008-04-14  9:55     ` Ye, Joey
@ 2008-04-14  9:58     ` Ye, Joey
  2008-04-14 14:22       ` H.J. Lu
  2008-04-14 10:37     ` Ye, Joey
  9 siblings, 1 reply; 26+ messages in thread
From: Ye, Joey @ 2008-04-14  9:58 UTC (permalink / raw)
  To: Jan Hubicka, Lu, Hongjiu; +Cc: GCC Patches, Guo, Xuepeng, ubizjak

Jan,

As to your comments:
+  if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    {
+      if (cfun->stack_alignment_estimated < alignment_in_bits)
+	{
+          if (!cfun->stack_realign_processed)
+            cfun->stack_alignment_estimated = alignment_in_bits;
+          else
+	    {
+	      gcc_assert (!cfun->stack_realign_finalized);
+	      if (!cfun->stack_realign_needed)
+		{
+		  /* It is OK to reduce the alignment as long as the
+		     requested size is 0 or the estimated stack
+		     alignment >= mode alignment.  */

> So basically the purpose of stack_alignment_estimated is to avoid
> wasting stack frame space by padding when LOCAL_ALIGNMENT request
> alignment greater than STACK_BOUNDARY and we believe that the function
> won't need the DRAP code?
stack_alignment_estimated is to estimate conservatively the stack
alignment requirement based on the knowledge we have before reload. We
may make mistake by underestimate the requirement, which we'd treat as
assertion failure.

Basically this piece of code is saying: if I'm called before reload,
just update my estimation; if I'm called in/after reload and the
alignment is bigger than what I estimated and the stack realignment
decision was made incorrectly, crash.

> The comment should probably mention why we ever request size of 0.  I
> don't know at least ;) 
Neither do I. HJ, can you help to answer?

Thanks - Joey

-----Original Message-----
From: Jan Hubicka [mailto:hubicka@ucw.cz] 
Sent: Friday, April 11, 2008 6:04 PM
To: Ye, Joey
Cc: GCC Patches; Lu, Hongjiu; Guo, Xuepeng; ubizjak@gmail.com
Subject: Re: [RFA]: Merge stack alignment branch

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFA]: Merge stack alignment branch
  2008-04-11 13:27   ` Jan Hubicka
                       ` (8 preceding siblings ...)
  2008-04-14  9:58     ` Ye, Joey
@ 2008-04-14 10:37     ` Ye, Joey
  9 siblings, 0 replies; 26+ messages in thread
From: Ye, Joey @ 2008-04-14 10:37 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches, Lu, Hongjiu, Guo, Xuepeng, ubizjak

Jan,

As to your comments:
> I would preffer if all alignments was visible at RTL expansion time.  

> So the reason why you need to handle stack_alignment_estimated at tree
> level is that when vectorizing you know you are introducing alignment
> requirement on a local array, while later RTL expansion can't work it
> out, since the memory accesses are no longer clearly associated with
the
> stack frame area?

> Perhaps deriving type and changing the static array type to an array
> with greater alignemnt would work?

Estimating alignment at tree level turns out to be unnecessary. Fixed by
following patch. Also included in my latest post:
http://gcc.gnu.org/ml/gcc-patches/2008-04/msg00928.html

Index: tree-vectorizer.c
===================================================================
--- tree-vectorizer.c	(.../trunk/gcc)	(revision 134098)
+++ tree-vectorizer.c	(.../branches/stack/gcc)	(revision
134150)
@@ -1786,9 +1786,9 @@ vect_can_force_dr_alignment_p (const_tre
 
   if (TREE_STATIC (decl))
     return (alignment <= MAX_OFILE_ALIGNMENT);
+  else if (MAX_VECTORIZE_STACK_ALIGNMENT)
+    return (alignment <= MAX_VECTORIZE_STACK_ALIGNMENT);
   else
-    /* This used to be PREFERRED_STACK_BOUNDARY, however, that is not
100%
-       correct until someone implements forced stack alignment.  */
     return (alignment <= STACK_BOUNDARY); 
 }
  

-----Original Message-----
From: Jan Hubicka [mailto:hubicka@ucw.cz] 
Sent: Friday, April 11, 2008 6:04 PM
To: Ye, Joey
Cc: GCC Patches; Lu, Hongjiu; Guo, Xuepeng; ubizjak@gmail.com
Subject: Re: [RFA]: Merge stack alignment branch

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFA]: Merge stack alignment branch
  2008-04-14  9:58     ` Ye, Joey
@ 2008-04-14 14:22       ` H.J. Lu
  0 siblings, 0 replies; 26+ messages in thread
From: H.J. Lu @ 2008-04-14 14:22 UTC (permalink / raw)
  To: Ye, Joey; +Cc: Jan Hubicka, Lu, Hongjiu, GCC Patches, Guo, Xuepeng, ubizjak

On Mon, Apr 14, 2008 at 05:51:30PM +0800, Joey Ye wrote:
> Jan,
> 
> As to your comments:
> +  if (MAX_VECTORIZE_STACK_ALIGNMENT)
> +    {
> +      if (cfun->stack_alignment_estimated < alignment_in_bits)
> +	{
> +          if (!cfun->stack_realign_processed)
> +            cfun->stack_alignment_estimated = alignment_in_bits;
> +          else
> +	    {
> +	      gcc_assert (!cfun->stack_realign_finalized);
> +	      if (!cfun->stack_realign_needed)
> +		{
> +		  /* It is OK to reduce the alignment as long as the
> +		     requested size is 0 or the estimated stack
> +		     alignment >= mode alignment.  */
> 
> > So basically the purpose of stack_alignment_estimated is to avoid
> > wasting stack frame space by padding when LOCAL_ALIGNMENT request
> > alignment greater than STACK_BOUNDARY and we believe that the function
> > won't need the DRAP code?
> stack_alignment_estimated is to estimate conservatively the stack
> alignment requirement based on the knowledge we have before reload. We
> may make mistake by underestimate the requirement, which we'd treat as
> assertion failure.
> 
> Basically this piece of code is saying: if I'm called before reload,
> just update my estimation; if I'm called in/after reload and the
> alignment is bigger than what I estimated and the stack realignment
> decision was made incorrectly, crash.
> 
> > The comment should probably mention why we ever request size of 0.  I
> > don't know at least ;) 
> Neither do I. HJ, can you help to answer?
> 

reload () in reload1.c has

      if (starting_frame_size && cfun->stack_alignment_needed)
	{
	  /* If we have a stack frame, we must align it now.  The
	     stack size may be a part of the offset computation for
	     register elimination.  So if this changes the stack size,
	     then repeat the elimination bookkeeping.  We don't
	     realign when there is no stack, as that will cause a
	     stack frame when none is needed should
	     STARTING_FRAME_OFFSET not be already aligned to
	     STACK_BOUNDARY.  */
	  assign_stack_local (BLKmode, 0, cfun->stack_alignment_needed);
	  if (starting_frame_size != get_frame_size ())
	    continue;
	}

Although this will never reach assert in assign_stack_local since
cfun->stack_alignment_estimated >= cfun->stack_alignment_needed,
it is there to cover any possible such usages.


H.J.

H.J.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFA]: Merge stack alignment branch
  2008-04-14  9:53     ` Ye, Joey
@ 2008-04-14 16:46       ` H.J. Lu
  0 siblings, 0 replies; 26+ messages in thread
From: H.J. Lu @ 2008-04-14 16:46 UTC (permalink / raw)
  To: Ye, Joey; +Cc: Jan Hubicka, GCC Patches, Lu, Hongjiu, Guo, Xuepeng, ubizjak

[-- Attachment #1: Type: text/plain, Size: 1248 bytes --]

On Mon, Apr 14, 2008 at 2:52 AM, Ye, Joey <joey.ye@intel.com> wrote:
> -----Original Message-----
>  From: Jan Hubicka [mailto:hubicka@ucw.cz]
>  Sent: Friday, April 11, 2008 6:04 PM
>  To: Ye, Joey
>
> Cc: GCC Patches; Lu, Hongjiu; Guo, Xuepeng; ubizjak@gmail.com
>  Subject: Re: [RFA]: Merge stack alignment branch
>
>  +/* This pass sets crtl->args.internal_arg_pointer to a virtual
>  +   register if DRAP is needed.  Local register allocator will replace
>  +   virtual_incoming_args_rtx with the virtual register.  */
>
>
> > This should be RTL_PASS, but in general it seems like other stuff in
>  we
>  > do during cfgexpand at the end of function body expansion, so it
>  > probably more naturally belongs there than to extra pass.
>  Accept. Check out this patch:
>

I checked in this patch into stack branch:

2008-04-14  Joey Ye  <joey.ye@intel.com>
            H.J. Lu  <hongjiu.lu@intel.com>

        * cfgexpand.c (handle_drap): Make it void.  Return if
        MAX_VECTORIZE_STACK_ALIGNMENT is 0.
        (tree_expand_cfg): Call handle_drap.
        (gate_handle_drap): Removed.
        (pass_handle_drap): Likewise.
        * passes.c (pass_handle_drap): Likewise.
        * tree-pass.h (pass_handle_drap): Likewise.

Thanks.

H.J.

[-- Attachment #2: drap.txt --]
[-- Type: text/plain, Size: 4840 bytes --]

Index: tree-pass.h
===================================================================
--- tree-pass.h	(revision 134224)
+++ tree-pass.h	(working copy)
@@ -472,7 +472,6 @@ extern struct gimple_opt_pass pass_inlin
 extern struct gimple_opt_pass pass_apply_inline;
 extern struct gimple_opt_pass pass_all_early_optimizations;
 extern struct gimple_opt_pass pass_update_address_taken;
-extern struct gimple_opt_pass pass_handle_drap;
 
 /* The root of the compilation pass tree, once constructed.  */
 extern struct opt_pass *all_passes, *all_ipa_passes, *all_lowering_passes;
Index: cfgexpand.c
===================================================================
--- cfgexpand.c	(revision 134224)
+++ cfgexpand.c	(working copy)
@@ -1868,6 +1868,43 @@ discover_nonconstant_array_refs (void)
     }
 }
 
+/* This function sets crtl->args.internal_arg_pointer to a virtual
+   register if DRAP is needed.  Local register allocator will replace
+   virtual_incoming_args_rtx with the virtual register.  */
+
+static void
+handle_drap (void)
+{
+  rtx internal_arg_rtx; 
+
+  if (!MAX_VECTORIZE_STACK_ALIGNMENT)
+    return;
+  
+  if (current_function_calls_alloca
+      || cfun->has_nonlocal_label
+      || current_function_has_nonlocal_goto)
+    cfun->need_drap = true;
+
+  /* Call targetm.calls.internal_arg_pointer again.  This time it will
+     return a virtual register if DRAP is needed.  */
+  internal_arg_rtx = targetm.calls.internal_arg_pointer (); 
+
+  /* Assertion to check internal_arg_pointer is set to the right rtx
+     here.  */
+  gcc_assert (crtl->args.internal_arg_pointer == 
+             virtual_incoming_args_rtx);
+
+  /* Do nothing if no need to replace virtual_incoming_args_rtx.  */
+  if (crtl->args.internal_arg_pointer != internal_arg_rtx)
+    {
+      crtl->args.internal_arg_pointer = internal_arg_rtx;
+
+      /* Call fixup_tail_casss to clean up REG_EQUIV note if DRAP is
+         needed. */
+      fixup_tail_calls ();
+    }
+}
+
 /* Translate the intermediate representation contained in the CFG
    from GIMPLE trees to RTL.
 
@@ -1970,6 +2007,9 @@ tree_expand_cfg (void)
   sbitmap_free (blocks);
 
   compact_blocks ();
+
+  handle_drap ();
+
 #ifdef ENABLE_CHECKING
   verify_flow_info ();
 #endif
@@ -2037,70 +2077,3 @@ struct gimple_opt_pass pass_expand =
   TODO_dump_func,                       /* todo_flags_finish */
  }
 };
-
-static bool
-gate_handle_drap (void)
-{
-  if (!MAX_VECTORIZE_STACK_ALIGNMENT)
-    return false;
-  else
-    {
-      gcc_assert (!cfun->stack_realign_processed);
-      return true;
-    }
-}
-
-/* This pass sets crtl->args.internal_arg_pointer to a virtual
-   register if DRAP is needed.  Local register allocator will replace
-   virtual_incoming_args_rtx with the virtual register.  */
-
-static unsigned int
-handle_drap (void)
-{
-  rtx internal_arg_rtx; 
-
-  if (current_function_calls_alloca
-      || cfun->has_nonlocal_label
-      || current_function_has_nonlocal_goto)
-    cfun->need_drap = true;
-
-  /* Call targetm.calls.internal_arg_pointer again.  This time it will
-     return a virtual register if DRAP is needed.  */
-  internal_arg_rtx = targetm.calls.internal_arg_pointer (); 
-
-  /* Assertion to check internal_arg_pointer is set to the right rtx
-     here.  */
-  gcc_assert (crtl->args.internal_arg_pointer == 
-             virtual_incoming_args_rtx);
-
-  /* Do nothing if no need to replace virtual_incoming_args_rtx.  */
-  if (crtl->args.internal_arg_pointer != internal_arg_rtx)
-    {
-      crtl->args.internal_arg_pointer = internal_arg_rtx;
-
-      /* Call fixup_tail_casss to clean up REG_EQUIV note if DRAP is
-         needed. */
-      fixup_tail_calls ();
-    }
-
-  return 0;
-}
-
-struct gimple_opt_pass pass_handle_drap =
-{
- {
-  GIMPLE_PASS,
-  "handle_drap",			/* name */
-  gate_handle_drap,			/* gate */
-  handle_drap,			        /* execute */
-  NULL,                                 /* sub */
-  NULL,                                 /* next */
-  0,                                    /* static_pass_number */
-  0,				        /* tv_id */
-  0,                                    /* properties_required */
-  0,                                    /* properties_provided */
-  0,				        /* properties_destroyed */
-  0,                                    /* todo_flags_start */
-  TODO_dump_func,                       /* todo_flags_finish */
- }
-};
Index: passes.c
===================================================================
--- passes.c	(revision 134224)
+++ passes.c	(working copy)
@@ -685,7 +685,6 @@ init_optimization_passes (void)
   NEXT_PASS (pass_mudflap_2);
   NEXT_PASS (pass_free_cfg_annotations);
   NEXT_PASS (pass_expand);
-  NEXT_PASS (pass_handle_drap); 
   NEXT_PASS (pass_rest_of_compilation);
     {
       struct opt_pass **p = &pass_rest_of_compilation.pass.sub;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFA]: Merge stack alignment branch
  2008-04-14  9:53     ` Ye, Joey
@ 2008-04-15 17:35       ` H.J. Lu
  0 siblings, 0 replies; 26+ messages in thread
From: H.J. Lu @ 2008-04-15 17:35 UTC (permalink / raw)
  To: Ye, Joey; +Cc: Jan Hubicka, GCC Patches, Lu, Hongjiu, Guo, Xuepeng, ubizjak

[-- Attachment #1: Type: text/plain, Size: 692 bytes --]

Hi,

On Mon, Apr 14, 2008 at 2:51 AM, Ye, Joey <joey.ye@intel.com> wrote:
>
>  > frame_pointer_needed should IMO go into crtl, instead of cfun.  It is
>  > computed only at expansion time, right?
>  It is computed even after expansion time.
>

I am checking the enclosed patch to stack branch.

Thanks.


H.J.
----
2008-04-15  Joey Ye  <joey.ye@intel.com>

        *  global.c (compute_regsets): Replace cfun->need_frame_pointer_set
        with crtl->need_frame_pointer_set.
        * reload1.c (init_elim_table): Likewise.

        * function.h (function): Move need_frame_pointer and
        need_frame_pointer_set to ...
        (rtl_data): Here.
        (frame_pointer_needed): Updated.

[-- Attachment #2: fp.txt --]
[-- Type: text/plain, Size: 2755 bytes --]

Index: global.c
===================================================================
--- global.c	(revision 2168)
+++ global.c	(working copy)
@@ -260,7 +260,7 @@ compute_regsets (HARD_REG_SET *elim_set,
        || cfun->stack_realign_needed);
 
   frame_pointer_needed = need_fp;
-  cfun->need_frame_pointer_set = 1;
+  crtl->need_frame_pointer_set = 1;
 
   max_regno = max_reg_num ();
   compact_blocks ();
Index: function.h
===================================================================
--- function.h	(revision 2168)
+++ function.h	(working copy)
@@ -291,6 +291,12 @@ struct rtl_data GTY(())
   /* Current nesting level for temporaries.  */
   int x_temp_slot_level;
 
+  /* Nonzero if current function must be given a frame pointer.
+     Set in global.c if anything is allocated on the stack there.  */
+  unsigned int need_frame_pointer : 1;
+
+  /* Nonzero if need_frame_pointer has been set.  */
+  unsigned int need_frame_pointer_set : 1;
 };
 
 #define return_label (crtl->x_return_label)
@@ -304,6 +310,7 @@ struct rtl_data GTY(())
 #define avail_temp_slots (crtl->x_avail_temp_slots)
 #define temp_slot_level (crtl->x_temp_slot_level)
 #define nonlocal_goto_handler_labels (crtl->x_nonlocal_goto_handler_labels)
+#define frame_pointer_needed (crtl->need_frame_pointer)
 
 extern GTY(()) struct rtl_data x_rtl;
 
@@ -520,13 +527,6 @@ struct function GTY(())
   /* Nonzero if pass_tree_profile was run on this function.  */
   unsigned int after_tree_profile : 1;
 
-/* Nonzero if current function must be given a frame pointer.
-   Set in global.c if anything is allocated on the stack there.  */
-  unsigned int need_frame_pointer : 1;
-
-  /* Nonzero if need_frame_pointer has been set.  */
-  unsigned int need_frame_pointer_set : 1;
-
   /* Nonzero if, by estimation, current function stack needs realignment. */
   unsigned int stack_realign_needed : 1;
 
@@ -605,7 +605,6 @@ extern void instantiate_decl_rtl (rtx x)
 #define dom_computed (cfun->cfg->x_dom_computed)
 #define n_bbs_in_dom_tree (cfun->cfg->x_n_bbs_in_dom_tree)
 #define VALUE_HISTOGRAMS(fun) (fun)->value_histograms
-#define frame_pointer_needed (cfun->need_frame_pointer)
 #define stack_realign_fp (cfun->stack_realign_needed && !cfun->need_drap)
 #define stack_realign_drap (cfun->stack_realign_needed && cfun->need_drap)
 
Index: reload1.c
===================================================================
--- reload1.c	(revision 2168)
+++ reload1.c	(working copy)
@@ -3779,7 +3779,7 @@ init_elim_table (void)
     reg_eliminate = xcalloc (sizeof (struct elim_table), NUM_ELIMINABLE_REGS);
 
   /* frame_pointer_needed should has been set.  */
-  gcc_assert (cfun->need_frame_pointer_set);
+  gcc_assert (crtl->need_frame_pointer_set);
 
   num_eliminable = 0;
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2008-04-15 16:14 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-04-04  6:31 [RFA]: Merge stack alignment branch Ye, Joey
2008-04-04  6:39 ` Andrew Pinski
2008-04-04 12:40   ` H.J. Lu
2008-04-04 19:18     ` Andrew Pinski
2008-04-04 20:33       ` H.J. Lu
2008-04-05 15:24       ` Ye, Joey
2008-04-05 16:26   ` Ye, Joey
2008-04-04 19:05 ` Jan Hubicka
2008-04-04 21:05   ` H.J. Lu
2008-04-08  1:57   ` Ye, Joey
2008-04-11 12:32   ` Ye, Joey
2008-04-10 10:42 ` Ye, Joey
2008-04-11 13:27   ` Jan Hubicka
2008-04-12  3:39     ` H.J. Lu
2008-04-12 18:12     ` H.J. Lu
2008-04-12 18:36     ` H.J. Lu
2008-04-14  9:53     ` Ye, Joey
2008-04-14 16:46       ` H.J. Lu
2008-04-14  9:53     ` Ye, Joey
2008-04-14  9:53     ` Ye, Joey
2008-04-14  9:53     ` Ye, Joey
2008-04-15 17:35       ` H.J. Lu
2008-04-14  9:55     ` Ye, Joey
2008-04-14  9:58     ` Ye, Joey
2008-04-14 14:22       ` H.J. Lu
2008-04-14 10:37     ` Ye, Joey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).