public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Honnor ix86_accumulate_outgoing_args again
@ 2013-10-02 17:32 Jan Hubicka
  2013-10-02 22:45 ` Jan Hubicka
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Hubicka @ 2013-10-02 17:32 UTC (permalink / raw)
  To: gcc-patches, hjl.tools, ubizjak, rth, Ganesh.Gopalasubramanian

Hi,
currently ix86_accumulate_outgoing_args is ignored on all targets except for
Solaris (that sets USE_IX86_FRAME_POINTER to true).  It seems like accidental
effect of http://gcc.gnu.org/ml/gcc-patches/2010-08/txt00102.txt that enabled
omit-frame-pointer for 32bit (I take the 64bit change was purely accidental)
probably based on the fact non-accumulate-outgoing-args was not doing well with
assynchronous unwind info.

The reason for this seems to be gone by
http://gcc.gnu.org/ml/gcc-patches/2013-03/msg00995.html

So I thing we ought to honnor accumulate-outgoing-args again and in fact
consider disabling it for generic - it is disabled for core (that may need
re-benchmarking). For all AMD targets it is currently on.  I tested disabling
it on buldozer 32bit and it seems mostly SPEC neutral for specint2000 (I am
wating for more benchmarks) with very nice code size improvements in all
benchmarks with exception of MCF with LTO (not sure at all why), with overall
reduction of 5.2% (same gain as we get for -flto aproximately)
http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-32o-32bit/size.html

There may be close to noise factor drops as seen in
http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-32o-32bit/recent.html I
will see how other tests shape and wait for multiple runs to show how much of
this is actual noise. We may consider disabling it for size optimized functions
and -O2 (and not for -O3) at least.

This patch however only remove code forcingly enabling
MASK_ACCUMULATE_OUTGOING_ARGS.  If there will be no complains, I will commit it
tomorrow.

Honza

	* i386.c (ix86_option_override_internal): Do not force
	ACCUMULATE_OUTGOING_ARGS when unwind info is generated.
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 203117)
+++ config/i386/i386.c	(working copy)
@@ -3793,28 +3793,11 @@ ix86_option_override_internal (bool main
       }
 
   ix86_tune_mask = 1u << ix86_tune;
-  if ((!USE_IX86_FRAME_POINTER
-       || (x86_accumulate_outgoing_args & ix86_tune_mask))
+  if ((x86_accumulate_outgoing_args & ix86_tune_mask)
       && !(target_flags_explicit & MASK_ACCUMULATE_OUTGOING_ARGS)
       && !optimize_size)
     target_flags |= MASK_ACCUMULATE_OUTGOING_ARGS;
 
-  /* ??? Unwind info is not correct around the CFG unless either a frame
-     pointer is present or M_A_O_A is set.  Fixing this requires rewriting
-     unwind info generation to be aware of the CFG and propagating states
-     around edges.  */
-  if ((flag_unwind_tables || flag_asynchronous_unwind_tables
-       || flag_exceptions || flag_non_call_exceptions)
-      && flag_omit_frame_pointer
-      && !(target_flags & MASK_ACCUMULATE_OUTGOING_ARGS))
-    {
-      if (target_flags_explicit & MASK_ACCUMULATE_OUTGOING_ARGS)
-	warning (0, "unwind tables currently require either a frame pointer "
-		 "or %saccumulate-outgoing-args%s for correctness",
-		 prefix, suffix);
-      target_flags |= MASK_ACCUMULATE_OUTGOING_ARGS;
-    }
-
   /* If stack probes are required, the space used for large function
      arguments on the stack must also be probed, so enable
      -maccumulate-outgoing-args so this happens in the prologue.  */

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Honnor ix86_accumulate_outgoing_args again
  2013-10-02 17:32 Honnor ix86_accumulate_outgoing_args again Jan Hubicka
@ 2013-10-02 22:45 ` Jan Hubicka
  2013-10-02 22:51   ` H.J. Lu
  2013-10-03  1:11   ` Vladimir Makarov
  0 siblings, 2 replies; 21+ messages in thread
From: Jan Hubicka @ 2013-10-02 22:45 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: gcc-patches, hjl.tools, ubizjak, rth, Ganesh.Gopalasubramanian, vmakarov

> So I thing we ought to honnor accumulate-outgoing-args again and in fact
> consider disabling it for generic - it is disabled for core (that may need
> re-benchmarking). For all AMD targets it is currently on.  I tested disabling
> it on buldozer 32bit and it seems mostly SPEC neutral for specint2000 (I am
> wating for more benchmarks) with very nice code size improvements in all
> benchmarks with exception of MCF with LTO (not sure at all why), with overall
> reduction of 5.2% (same gain as we get for -flto aproximately)
> http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-32o-32bit/size.html
Specint 2000/2006 seems quite good. (though for spec2k6 the differences in sizes
are no longer anywhere near to -flto code size reductions)

Unfortunately there is 40% regression on mgrid with -flto (and also noticeable
regression without LTO).  First thing I noticed is that we stop omitting frame
pointer in the hottest function.  This is because we see:

(set (reg/f:SI 7 sp)
    (plus:SI (reg/f:SI 7 sp)
        (const_int -8 [0xfffffffffffffff8])))

and we end up marking SP as as uneliminable in:

          /* See if this is setting the replacement hard register for
             an elimination.

             If DEST is the hard frame pointer, we do nothing because
             we assume that all assignments to the frame pointer are
             for non-local gotos and are being done at a time when
             they are valid and do not disturb anything else.  Some
             machines want to eliminate a fake argument pointer (or
             even a fake frame pointer) with either the real frame
             pointer or the stack pointer.  Assignments to the hard
             frame pointer must not prevent this elimination.  */

          for (ep = reg_eliminate;
               ep < &reg_eliminate[NUM_ELIMINABLE_REGS];
               ep++)
            if (ep->to_rtx == SET_DEST (x)
                && SET_DEST (x) != hard_frame_pointer_rtx
                && (! (SUPPORTS_STACK_ALIGNMENT && stack_realign_fp
                       && REGNO (ep->to_rtx) == STACK_POINTER_REGNUM)
                    || GET_CODE (SET_SRC (x)) != PLUS
                    || XEXP (SET_SRC (x), 0) != SET_DEST (x)
                    || ! CONST_INT_P (XEXP (SET_SRC (x), 1))))
              setup_can_eliminate (ep, false);

It is because of

                && (! (SUPPORTS_STACK_ALIGNMENT && stack_realign_fp
                       && REGNO (ep->to_rtx) == STACK_POINTER_REGNUM)

I am somewhat confused why do we need to stop eliminating.  Function is not
marked as needing drap (and in that case stack_realign_fp would be true)
What is this conditional shooting for?

Mgrid is somewhat sensitive to register pressure, so this actually may
explain the 40% difference...

Honza

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Honnor ix86_accumulate_outgoing_args again
  2013-10-02 22:45 ` Jan Hubicka
@ 2013-10-02 22:51   ` H.J. Lu
  2013-10-03  9:24     ` Jan Hubicka
  2013-10-03  1:11   ` Vladimir Makarov
  1 sibling, 1 reply; 21+ messages in thread
From: H.J. Lu @ 2013-10-02 22:51 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: GCC Patches, Uros Bizjak, Richard Henderson,
	Ganesh.Gopalasubramanian, Vladimir Makarov

On Wed, Oct 2, 2013 at 3:45 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> So I thing we ought to honnor accumulate-outgoing-args again and in fact
>> consider disabling it for generic - it is disabled for core (that may need
>> re-benchmarking). For all AMD targets it is currently on.  I tested disabling
>> it on buldozer 32bit and it seems mostly SPEC neutral for specint2000 (I am
>> wating for more benchmarks) with very nice code size improvements in all
>> benchmarks with exception of MCF with LTO (not sure at all why), with overall
>> reduction of 5.2% (same gain as we get for -flto aproximately)
>> http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-32o-32bit/size.html
> Specint 2000/2006 seems quite good. (though for spec2k6 the differences in sizes
> are no longer anywhere near to -flto code size reductions)
>
> Unfortunately there is 40% regression on mgrid with -flto (and also noticeable
> regression without LTO).  First thing I noticed is that we stop omitting frame
> pointer in the hottest function.  This is because we see:

Does it happen with both 32-bit and 64-bit?

> (set (reg/f:SI 7 sp)
>     (plus:SI (reg/f:SI 7 sp)
>         (const_int -8 [0xfffffffffffffff8])))
>
> and we end up marking SP as as uneliminable in:
>
>           /* See if this is setting the replacement hard register for
>              an elimination.
>
>              If DEST is the hard frame pointer, we do nothing because
>              we assume that all assignments to the frame pointer are
>              for non-local gotos and are being done at a time when
>              they are valid and do not disturb anything else.  Some
>              machines want to eliminate a fake argument pointer (or
>              even a fake frame pointer) with either the real frame
>              pointer or the stack pointer.  Assignments to the hard
>              frame pointer must not prevent this elimination.  */
>
>           for (ep = reg_eliminate;
>                ep < &reg_eliminate[NUM_ELIMINABLE_REGS];
>                ep++)
>             if (ep->to_rtx == SET_DEST (x)
>                 && SET_DEST (x) != hard_frame_pointer_rtx
>                 && (! (SUPPORTS_STACK_ALIGNMENT && stack_realign_fp
>                        && REGNO (ep->to_rtx) == STACK_POINTER_REGNUM)
>                     || GET_CODE (SET_SRC (x)) != PLUS
>                     || XEXP (SET_SRC (x), 0) != SET_DEST (x)
>                     || ! CONST_INT_P (XEXP (SET_SRC (x), 1))))
>               setup_can_eliminate (ep, false);
>
> It is because of
>
>                 && (! (SUPPORTS_STACK_ALIGNMENT && stack_realign_fp
>                        && REGNO (ep->to_rtx) == STACK_POINTER_REGNUM)
>
> I am somewhat confused why do we need to stop eliminating.  Function is not
> marked as needing drap (and in that case stack_realign_fp would be true)
> What is this conditional shooting for?

Why is stack_realign_fp true?


-- 
H.J.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Honnor ix86_accumulate_outgoing_args again
  2013-10-02 22:45 ` Jan Hubicka
  2013-10-02 22:51   ` H.J. Lu
@ 2013-10-03  1:11   ` Vladimir Makarov
  2013-10-03 13:05     ` Jan Hubicka
  1 sibling, 1 reply; 21+ messages in thread
From: Vladimir Makarov @ 2013-10-03  1:11 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: gcc-patches, hjl.tools, ubizjak, rth, Ganesh.Gopalasubramanian

On 13-10-02 6:45 PM, Jan Hubicka wrote:
>> So I thing we ought to honnor accumulate-outgoing-args again and in fact
>> consider disabling it for generic - it is disabled for core (that may need
>> re-benchmarking). For all AMD targets it is currently on.  I tested disabling
>> it on buldozer 32bit and it seems mostly SPEC neutral for specint2000 (I am
>> wating for more benchmarks) with very nice code size improvements in all
>> benchmarks with exception of MCF with LTO (not sure at all why), with overall
>> reduction of 5.2% (same gain as we get for -flto aproximately)
>> http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-32o-32bit/size.html
> Specint 2000/2006 seems quite good. (though for spec2k6 the differences in sizes
> are no longer anywhere near to -flto code size reductions)
>
> Unfortunately there is 40% regression on mgrid with -flto (and also noticeable
> regression without LTO).  First thing I noticed is that we stop omitting frame
> pointer in the hottest function.  This is because we see:
>
> (set (reg/f:SI 7 sp)
>      (plus:SI (reg/f:SI 7 sp)
>          (const_int -8 [0xfffffffffffffff8])))
>
> and we end up marking SP as as uneliminable in:
>
>            /* See if this is setting the replacement hard register for
>               an elimination.
>
>               If DEST is the hard frame pointer, we do nothing because
>               we assume that all assignments to the frame pointer are
>               for non-local gotos and are being done at a time when
>               they are valid and do not disturb anything else.  Some
>               machines want to eliminate a fake argument pointer (or
>               even a fake frame pointer) with either the real frame
>               pointer or the stack pointer.  Assignments to the hard
>               frame pointer must not prevent this elimination.  */
>
>            for (ep = reg_eliminate;
>                 ep < &reg_eliminate[NUM_ELIMINABLE_REGS];
>                 ep++)
>              if (ep->to_rtx == SET_DEST (x)
>                  && SET_DEST (x) != hard_frame_pointer_rtx
>                  && (! (SUPPORTS_STACK_ALIGNMENT && stack_realign_fp
>                         && REGNO (ep->to_rtx) == STACK_POINTER_REGNUM)
>                      || GET_CODE (SET_SRC (x)) != PLUS
>                      || XEXP (SET_SRC (x), 0) != SET_DEST (x)
>                      || ! CONST_INT_P (XEXP (SET_SRC (x), 1))))
>                setup_can_eliminate (ep, false);
>
> It is because of
>
>                  && (! (SUPPORTS_STACK_ALIGNMENT && stack_realign_fp
>                         && REGNO (ep->to_rtx) == STACK_POINTER_REGNUM)
>
> I am somewhat confused why do we need to stop eliminating.  Function is not
> marked as needing drap (and in that case stack_realign_fp would be true)
> What is this conditional shooting for?
This part was added to fix

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57018

and because LRA still misses some reload functionality for elimination.  
I am a bit embarrassed: I have this thing to do for 4 months and I still 
did not start to work on it yet.  There are too much things on my plate.

As we are going to use outgoing arg accumulation, this problem is 
becoming higher priority one.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Honnor ix86_accumulate_outgoing_args again
  2013-10-02 22:51   ` H.J. Lu
@ 2013-10-03  9:24     ` Jan Hubicka
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Hubicka @ 2013-10-03  9:24 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jan Hubicka, GCC Patches, Uros Bizjak, Richard Henderson,
	Ganesh.Gopalasubramanian, Vladimir Makarov

> > Unfortunately there is 40% regression on mgrid with -flto (and also noticeable
> > regression without LTO).  First thing I noticed is that we stop omitting frame
> > pointer in the hottest function.  This is because we see:
> 
> Does it happen with both 32-bit and 64-bit?
No, 32bit only.
> 
> > (set (reg/f:SI 7 sp)
> >     (plus:SI (reg/f:SI 7 sp)
> >         (const_int -8 [0xfffffffffffffff8])))
> >
> > and we end up marking SP as as uneliminable in:
> >
> >           /* See if this is setting the replacement hard register for
> >              an elimination.
> >
> >              If DEST is the hard frame pointer, we do nothing because
> >              we assume that all assignments to the frame pointer are
> >              for non-local gotos and are being done at a time when
> >              they are valid and do not disturb anything else.  Some
> >              machines want to eliminate a fake argument pointer (or
> >              even a fake frame pointer) with either the real frame
> >              pointer or the stack pointer.  Assignments to the hard
> >              frame pointer must not prevent this elimination.  */
> >
> >           for (ep = reg_eliminate;
> >                ep < &reg_eliminate[NUM_ELIMINABLE_REGS];
> >                ep++)
> >             if (ep->to_rtx == SET_DEST (x)
> >                 && SET_DEST (x) != hard_frame_pointer_rtx
> >                 && (! (SUPPORTS_STACK_ALIGNMENT && stack_realign_fp
> >                        && REGNO (ep->to_rtx) == STACK_POINTER_REGNUM)
> >                     || GET_CODE (SET_SRC (x)) != PLUS
> >                     || XEXP (SET_SRC (x), 0) != SET_DEST (x)
> >                     || ! CONST_INT_P (XEXP (SET_SRC (x), 1))))
> >               setup_can_eliminate (ep, false);
> >
> > It is because of
> >
> >                 && (! (SUPPORTS_STACK_ALIGNMENT && stack_realign_fp
> >                        && REGNO (ep->to_rtx) == STACK_POINTER_REGNUM)
> >
> > I am somewhat confused why do we need to stop eliminating.  Function is not
> > marked as needing drap (and in that case stack_realign_fp would be true)
> > What is this conditional shooting for?
> 
> Why is stack_realign_fp true?
It is false, but htere is ! in the conditional.

Honza
> 
> 
> -- 
> H.J.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Honnor ix86_accumulate_outgoing_args again
  2013-10-03  1:11   ` Vladimir Makarov
@ 2013-10-03 13:05     ` Jan Hubicka
  2013-10-10 18:48       ` Jan Hubicka
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Hubicka @ 2013-10-03 13:05 UTC (permalink / raw)
  To: Vladimir Makarov
  Cc: Jan Hubicka, gcc-patches, hjl.tools, ubizjak, rth,
	Ganesh.Gopalasubramanian

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57018
> 
> and because LRA still misses some reload functionality for
> elimination.  I am a bit embarrassed: I have this thing to do for 4
> months and I still did not start to work on it yet.  There are too
> much things on my plate.
> 
> As we are going to use outgoing arg accumulation, this problem is
> becoming higher priority one.

Thank you,
we currently use outgoing arg accumulation always on x86_64, I plan to
re-disable arg accumulation on CPUs that handle push/pop well (i.e. have stack
engine).  This brings nice code size savings.

I wonder how much this actually comes from not omitting frame pointer in
non-leaf functions with IRA. EBP based addressing is more compact than ESP
and thus -fomit-frame-pointer is disabled with -Os.

Perhaps frame elimination can be actually decided on by register allocation?

On similar note I just benchmarked -mfpmath=sse for 32bit code and it is quite
big performance win and again causes about 5% code size regression.  I want to
propose defaulting to -mfpmath=sse for 32bit for -ffast-math and -Ofast.  (in a
way I would like to see -mfpmath=sse by default for 32bit on CPUs supporting
SSE2, but that has been voted down long time ago becuase it loses the 80bit
precision for temporaries in double/float computations).

I wonder if we can eventually make -mfpmath=sse,387 working well (I did not
bechmark it yet, but statically it still produces more spiling than -mfpmath=sse)
and/or if we can possibly decide on fpmath based on hotness of function
(at least with profile around).

Thanks for all the hard work on IRA!
Honza

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Honnor ix86_accumulate_outgoing_args again
  2013-10-03 13:05     ` Jan Hubicka
@ 2013-10-10 18:48       ` Jan Hubicka
       [not found]         ` <0EFAB2BDD0F67E4FB6CCC8B9F87D7569427DC81A@IRSMSX101.ger.corp.intel.com>
                           ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Jan Hubicka @ 2013-10-10 18:48 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Vladimir Makarov, gcc-patches, hjl.tools, ubizjak, rth,
	Ganesh.Gopalasubramanian

Hi,
this patch makes ACCUMULATE_OUTGOING_ARGS to disable itself when function is
cold.  I did some extra testing and to my amusement we now seem to output
more compact unwind info when ACCUMULATE_OUTGOING_ARGS is disabled, so this
seems quite consistent code size win.

We actually can do better and enable ACCUMULATE_OUTGOING_ARGS only when
function contains hot calls.  This should also avoid need for frame allocation
in prologue/epilogue on hot path then. I will look into this incrementally.

I also noticed that we still have some tuning flags in i386.c rather than
in x86-tune.c so I moved them there.

Testing x86_64-linux and will commit it once testing converge.

Honza
	* config/i386/i386.h (ACCUMULATE_OUTGOING_ARGS): Disable accumulation
	for cold functions.
	* x86-tune.def (X86_TUNE_USE_LEAVE): Update comment.
	(X86_TUNE_PUSH_MEMORY): Likewise.
	(X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL,
	X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL): New.
	(X86_TUNE_ACCUMULATE_OUTGOING_ARGS, X86_TUNE_ALWAYS_FANCY_MATH_387): New.
	* i386.c (x86_accumulate_outgoing_args, x86_arch_always_fancy_math_387,
	x86_avx256_split_unaligned_load, x86_avx256_split_unaligned_store):
	Remove.
	(ix86_option_override_internal): Update to use tune features instead
	of variables.
Index: config/i386/i386.h
===================================================================
--- config/i386/i386.h	(revision 203380)
+++ config/i386/i386.h	(working copy)
@@ -1492,13 +1492,26 @@ enum reg_class
    will be computed and placed into the variable `crtl->outgoing_args_size'.
    No space will be pushed onto the stack for each call; instead, the
    function prologue should increase the stack frame size by this amount.  
+
+   In 32bit mode enabling argument accumulation results in about 5% code size
+   growth becuase move instructions are less compact than push.  In 64bit
+   mode the difference is less drastic but visible.  
+
+   FIXME: Unlike earlier implementations, the size of unwind info seems to
+   actually grouw with accumulation.  Is that because accumulated args
+   unwind info became unnecesarily bloated?
    
    64-bit MS ABI seem to require 16 byte alignment everywhere except for
-   function prologue and apilogue.  This is not possible without
-   ACCUMULATE_OUTGOING_ARGS.  */
+   function prologue and epilogue.  This is not possible without
+   ACCUMULATE_OUTGOING_ARGS.  
+
+   If stack probes are required, the space used for large function
+   arguments on the stack must also be probed, so enable
+   -maccumulate-outgoing-args so this happens in the prologue.  */
 
 #define ACCUMULATE_OUTGOING_ARGS \
-  (TARGET_ACCUMULATE_OUTGOING_ARGS || TARGET_64BIT_MS_ABI)
+  ((TARGET_ACCUMULATE_OUTGOING_ARGS && optimize_function_for_speed_p (cfun)) \
+   || TARGET_STACK_PROBE || TARGET_64BIT_MS_ABI)
 
 /* If defined, a C expression whose value is nonzero when we want to use PUSH
    instructions to pass outgoing arguments.  */
Index: config/i386/x86-tune.def
===================================================================
--- config/i386/x86-tune.def	(revision 203387)
+++ config/i386/x86-tune.def	(working copy)
@@ -18,15 +18,13 @@ a copy of the GCC Runtime Library Except
 see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
 <http://www.gnu.org/licenses/>.  */
 
-/* X86_TUNE_USE_LEAVE: Leave does not affect Nocona SPEC2000 results
-   negatively, so enabling for Generic64 seems like good code size
-   tradeoff.  We can't enable it for 32bit generic because it does not
-   work well with PPro base chips.  */
+/* X86_TUNE_USE_LEAVE: Use "leave" instruction in epilogues where it fits.  */
 DEF_TUNE (X86_TUNE_USE_LEAVE, "use_leave", 
 	  m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC)
 
 /* X86_TUNE_PUSH_MEMORY: Enable generation of "push mem" instructions.
-   Some chips, like 486 and Pentium have problems with these sequences.  */
+   Some chips, like 486 and Pentium works faster with separate load
+   and push instructions.  */
 DEF_TUNE (X86_TUNE_PUSH_MEMORY, "push_memory", 
           m_386 | m_P4_NOCONA | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE 
           | m_GENERIC)
@@ -210,6 +208,16 @@ DEF_TUNE (X86_TUNE_SSE_UNALIGNED_LOAD_OP
 DEF_TUNE (X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL, "sse_unaligned_store_optimal",
           m_COREI7 | m_BDVER | m_SLM | m_GENERIC)
 
+/* X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL: if true, unaligned loads are
+   split.  */
+DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL, "256_unaligned_load_optimal", 
+          ~(m_COREI7 | m_GENERIC))
+
+/* X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL: if true, unaligned loads are
+   split.  */
+DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL, "256_unaligned_load_optimal", 
+          ~(m_COREI7 | m_BDVER | m_GENERIC))
+
 /* Use packed single precision instructions where posisble.  I.e. movups instead
    of movupd.  */
 DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL, "sse_packed_single_insn_optimal",
@@ -398,3 +406,24 @@ DEF_TUNE (X86_TUNE_AVOID_MEM_OPND_FOR_CM
    fp converts to destination register.  */
 DEF_TUNE (X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS, "split_mem_opnd_for_fp_converts",
           m_SLM)
+
+/* X86_TUNE_ACCUMULATE_OUTGOING_ARGS: Allocate stack space for outgoing
+   arguments in prologue/epilogue instead of separately for each call
+   by push/pop instructions.
+   This increase code size by about 5% in 32bit mode, less so in 64bit mode
+   because parameters are passed in registers.  It is considerable
+   win for targets without stack engine that prevents multple push operations
+   to happen in parallel.
+
+   FIXME: the flags is incorrectly enabled for amdfam10, Bulldozer,
+   Bobcat and Generic.  This is because disabling it causes large
+   regression on mgrid due to IRA limitation leading to unecessary
+   use of the frame pointer in 32bit mode.  */
+DEF_TUNE (X86_TUNE_ACCUMULATE_OUTGOING_ARGS, "accumulate_outgoing_args", 
+	  m_PPRO | m_P4_NOCONA | m_ATOM | m_SLM | m_AMD_MULTIPLE | m_GENERIC)
+
+/* X86_TUNE_ALWAYS_FANCY_MATH_387: controls use of fancy 387 operations,
+   such as fsqrt, fprem, fsin, fcos, fsincos etc.
+   Should be enabled for all targets that always has coprocesor.  */
+DEF_TUNE (X86_TUNE_ALWAYS_FANCY_MATH_387, "always_fancy_math_387", 
+          ~(m_386 | m_486))
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 203380)
+++ config/i386/i386.c	(working copy)
@@ -1898,18 +1898,6 @@ static unsigned int initial_ix86_arch_fe
   ~m_386,
 };
 
-static const unsigned int x86_accumulate_outgoing_args
-  = m_PPRO | m_P4_NOCONA | m_ATOM | m_SLM | m_AMD_MULTIPLE | m_GENERIC;
-
-static const unsigned int x86_arch_always_fancy_math_387
-  = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_ATOM | m_SLM | m_AMD_MULTIPLE | m_GENERIC;
-
-static const unsigned int x86_avx256_split_unaligned_load
-  = m_COREI7 | m_GENERIC;
-
-static const unsigned int x86_avx256_split_unaligned_store
-  = m_COREI7 | m_BDVER | m_GENERIC;
-
 /* In case the average insn count for single function invocation is
    lower than this constant, emit fast (but longer) prologue and
    epilogue code.  */
@@ -2920,7 +2908,7 @@ static void
 ix86_option_override_internal (bool main_args_p)
 {
   int i;
-  unsigned int ix86_arch_mask, ix86_tune_mask;
+  unsigned int ix86_arch_mask;
   const bool ix86_tune_specified = (ix86_tune_string != NULL);
   const char *prefix;
   const char *suffix;
@@ -3673,7 +3661,7 @@ ix86_option_override_internal (bool main
 
   /* If the architecture always has an FPU, turn off NO_FANCY_MATH_387,
      since the insns won't need emulation.  */
-  if (x86_arch_always_fancy_math_387 & ix86_arch_mask)
+  if (ix86_tune_features [X86_TUNE_ALWAYS_FANCY_MATH_387])
     target_flags &= ~MASK_NO_FANCY_MATH_387;
 
   /* Likewise, if the target doesn't have a 387, or we've specified
@@ -3805,8 +3793,7 @@ ix86_option_override_internal (bool main
 	gcc_unreachable ();
       }
 
-  ix86_tune_mask = 1u << ix86_tune;
-  if ((x86_accumulate_outgoing_args & ix86_tune_mask)
+  if (ix86_tune_features [X86_TUNE_ACCUMULATE_OUTGOING_ARGS]
       && !(target_flags_explicit & MASK_ACCUMULATE_OUTGOING_ARGS)
       && !optimize_size)
     target_flags |= MASK_ACCUMULATE_OUTGOING_ARGS;
@@ -3946,10 +3933,10 @@ ix86_option_override_internal (bool main
       if (flag_expensive_optimizations
 	  && !(target_flags_explicit & MASK_VZEROUPPER))
 	target_flags |= MASK_VZEROUPPER;
-      if ((x86_avx256_split_unaligned_load & ix86_tune_mask)
+      if (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]
 	  && !(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_LOAD))
 	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
-      if ((x86_avx256_split_unaligned_store & ix86_tune_mask)
+      if (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL]
 	  && !(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_STORE))
 	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
       /* Enable 128-bit AVX instruction generation

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Honnor ix86_accumulate_outgoing_args again
       [not found]               ` <0EFAB2BDD0F67E4FB6CCC8B9F87D7569427F84D0@IRSMSX101.ger.corp.intel.com>
@ 2013-10-19 21:28                 ` Jan Hubicka
  2013-10-21  8:01                   ` Vladimir Makarov
  2013-10-22  8:36                   ` Zamyatin, Igor
  0 siblings, 2 replies; 21+ messages in thread
From: Jan Hubicka @ 2013-10-19 21:28 UTC (permalink / raw)
  To: Zamyatin, Igor, gcc-patches, vmakarov; +Cc: 'Jan Hubicka'

> Jan,
> 
> Does this seem reasonable to you?

Oops, sorry, I missed your email. (I was travelling and I am finishing a paper
now).
> 
> Thanks,
> Igor
> 
> > -----Original Message-----
> > From: Zamyatin, Igor
> > Sent: Tuesday, October 15, 2013 3:48 PM
> > To: Jan Hubicka
> > Subject: RE: Honnor ix86_accumulate_outgoing_args again
> > 
> > Jan,
> > 
> > Now we have following prologue in, say, phi0 routine in equake
> > 
> > 0x804aa90 1  push   %ebp
> > 0x804aa91 2  mov    %esp,%ebp
> > 0x804aa93 3  sub    $0x18,%esp
> > 0x804aa96 4  vmovsd 0x80ef7a8,%xmm0
> > 0x804aa9e 5  vmovsd 0x8(%ebp),%xmm1
> > 0x804aaa3 6  vcomisd %xmm1,%xmm0   <-- we see big stall somewhere here
> > or 1-2 instructions above
> > 
> > While earlier it was
> > 
> > 0x804abd0 1 sub    $0x2c,%esp
> > 0x804abd3 2 vmovsd 0x30(%esp),%xmm1
> > 0x804abd9 3 vmovsd 0x80efcc8,%xmm0
> > 0x804abe1 4 vcomisd %xmm1,%xmm0

Thanks for analysis! It is a different benchmark than for bulldozer, but
apparently same case.  Again we used to eliminate frame pointer here but IRS
now doesn't Do you see the same regression with -fno-omit-frame-pointer
-maccumulate-outgoing-args?

I suppose this is a conflict in between the push instruction hanled by stack
engine and initialization of EBP that isn't.  That would explain why bulldozer
don't seem to care about this particular benchmark (its stack engine seems to
have quite different design). 

This is a bit sad situation - accumulate-outgoing-args is expensive code size
wise and it seems we don't really need esp with -mno-accumulate-outgoing-args.
The non-accumulation code path was mistakely disabled for too long ;(

Vladimir, how much effort do you think it will be to fix the frame pointer
elimination here?
I can imagine it is a quite tricky case. If so I would suggest adding m_CORE_ALL
to X86_TUNE_ACCUMULATE_OUTGOING_ARGS with a comment explaining the problem and
mentioning the regression on equake on core and mgrid on Bulldizer and opening
an enhancement request for this...

I also wonder if direct ESP use and push/pop instructions are causing so
noticeable issues, I wonder if we can't "shrink wrap" this into red-zone in the
64bit compilation.  It seems that even with -maccumulate-outgoing-args pushing
the frame allocation as late as possible in the function would be a good idea
so it is not close to the push/pop/call/ret.

Honza

> > 
> > Note that for Atom and SLM no regression happens because they are already
> > in x86_accumulate_outgoing_args. As for other machines  - seems now
> > (after your change) they don't get that
> > MASK_ACCUMULATE_OUTGOING_ARGS and it leads to using ebp in the
> > prologue.
> > 
> > Thanks,
> > Igor
> > 
> > -----Original Message-----
> > From: Jan Hubicka [mailto:hubicka@ucw.cz]
> > Sent: Monday, October 14, 2013 8:44 PM
> > To: Zamyatin, Igor
> > Cc: Jan Hubicka
> > Subject: Re: Honnor ix86_accumulate_outgoing_args again
> > 
> > > Jan,
> > >
> > > I see that you haven't committed this change. Any particular reason for
> > this?
> > 
> > No, seems I forgot about it.
> > >
> > > BTW, after r203171 (http://gcc.gnu.org/ml/gcc-cvs/2013-
> > 10/msg00122.html) we see lot of performance degradation on spec2000 and
> > spec2006 tests on SandyBridge and IvyBridge in 32 bits mode. E.g.
> > 183.equake got ~-15%. Have you seen it?
> > 
> > I tested this only on Bulldozer chips where I saw large regression from mgrid,
> > but curiously not from equake.  I tracked it down to frame pointer being
> > forced by IRA that Vladimir promised to look at later.
> > 
> > I assumed that arg accumulation was tested before the flags was set, so I did
> > not benchmarked chips that previously dsabled it. Perhaps things changed
> > because IRA was merged in meantime while this feature was accidentally
> > disabled.
> > 
> > It would be great if you can figure out why equake regress - in FP
> > benchmarks we do not use push/pop so perhaps we just end up emitting
> > something really stupid or we manage to confuse stack engine...
> > 
> > Honza
> > 
> > >
> > > Thanks,
> > > Igor
> > >
> > > -----Original Message-----
> > > From: gcc-patches-owner@gcc.gnu.org
> > > [mailto:gcc-patches-owner@gcc.gnu.org] On Behalf Of Jan Hubicka
> > > Sent: Thursday, October 10, 2013 10:40 PM
> > > To: Jan Hubicka
> > > Cc: Vladimir Makarov; gcc-patches@gcc.gnu.org; hjl.tools@gmail.com;
> > > ubizjak@gmail.com; rth@redhat.com;
> > Ganesh.Gopalasubramanian@amd.com
> > > Subject: Re: Honnor ix86_accumulate_outgoing_args again
> > >
> > > Hi,
> > > this patch makes ACCUMULATE_OUTGOING_ARGS to disable itself when
> > function is cold.  I did some extra testing and to my amusement we now
> > seem to output more compact unwind info when
> > ACCUMULATE_OUTGOING_ARGS is disabled, so this seems quite consistent
> > code size win.
> > >
> > > We actually can do better and enable ACCUMULATE_OUTGOING_ARGS
> > only when function contains hot calls.  This should also avoid need for frame
> > allocation in prologue/epilogue on hot path then. I will look into this
> > incrementally..
> > >
> > > I also noticed that we still have some tuning flags in i386.c rather than in
> > x86-tune.c so I moved them there.
> > >
> > > Testing x86_64-linux and will commit it once testing converge.
> > >
> > > Honza
> > > 	* config/i386/i386.h (ACCUMULATE_OUTGOING_ARGS): Disable
> > accumulation
> > > 	for cold functions.
> > > 	* x86-tune.def (X86_TUNE_USE_LEAVE): Update comment.
> > > 	(X86_TUNE_PUSH_MEMORY): Likewise.
> > > 	(X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL,
> > > 	X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL): New.
> > > 	(X86_TUNE_ACCUMULATE_OUTGOING_ARGS,
> > X86_TUNE_ALWAYS_FANCY_MATH_387): New.
> > > 	* i386.c (x86_accumulate_outgoing_args,
> > x86_arch_always_fancy_math_387,
> > > 	x86_avx256_split_unaligned_load,
> > x86_avx256_split_unaligned_store):
> > > 	Remove.
> > > 	(ix86_option_override_internal): Update to use tune features
> > instead
> > > 	of variables.
> > > Index: config/i386/i386.h
> > >
> > ==========================================================
> > =========
> > > --- config/i386/i386.h	(revision 203380)
> > > +++ config/i386/i386.h	(working copy)
> > > @@ -1492,13 +1492,26 @@ enum reg_class
> > >     will be computed and placed into the variable `crtl->outgoing_args_size'.
> > >     No space will be pushed onto the stack for each call; instead, the
> > >     function prologue should increase the stack frame size by this amount.
> > > +
> > > +   In 32bit mode enabling argument accumulation results in about 5% code
> > size
> > > +   growth becuase move instructions are less compact than push.  In 64bit
> > > +   mode the difference is less drastic but visible.
> > > +
> > > +   FIXME: Unlike earlier implementations, the size of unwind info seems to
> > > +   actually grouw with accumulation.  Is that because accumulated args
> > > +   unwind info became unnecesarily bloated?
> > >
> > >     64-bit MS ABI seem to require 16 byte alignment everywhere except for
> > > -   function prologue and apilogue.  This is not possible without
> > > -   ACCUMULATE_OUTGOING_ARGS.  */
> > > +   function prologue and epilogue.  This is not possible without
> > > +   ACCUMULATE_OUTGOING_ARGS.
> > > +
> > > +   If stack probes are required, the space used for large function
> > > +   arguments on the stack must also be probed, so enable
> > > +   -maccumulate-outgoing-args so this happens in the prologue.  */
> > >
> > >  #define ACCUMULATE_OUTGOING_ARGS \
> > > -  (TARGET_ACCUMULATE_OUTGOING_ARGS || TARGET_64BIT_MS_ABI)
> > > +  ((TARGET_ACCUMULATE_OUTGOING_ARGS &&
> > optimize_function_for_speed_p (cfun)) \
> > > +   || TARGET_STACK_PROBE || TARGET_64BIT_MS_ABI)
> > >
> > >  /* If defined, a C expression whose value is nonzero when we want to use
> > PUSH
> > >     instructions to pass outgoing arguments.  */
> > > Index: config/i386/x86-tune.def
> > >
> > ==========================================================
> > =========
> > > --- config/i386/x86-tune.def	(revision 203387)
> > > +++ config/i386/x86-tune.def	(working copy)
> > > @@ -18,15 +18,13 @@ a copy of the GCC Runtime Library Except  see the
> > > files COPYING3 and COPYING.RUNTIME respectively.  If not, see
> > > <http://www.gnu.org/licenses/>.  */
> > >
> > > -/* X86_TUNE_USE_LEAVE: Leave does not affect Nocona SPEC2000 results
> > > -   negatively, so enabling for Generic64 seems like good code size
> > > -   tradeoff.  We can't enable it for 32bit generic because it does not
> > > -   work well with PPro base chips.  */
> > > +/* X86_TUNE_USE_LEAVE: Use "leave" instruction in epilogues where it
> > > +fits.  */
> > >  DEF_TUNE (X86_TUNE_USE_LEAVE, "use_leave",
> > >  	  m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE |
> > m_GENERIC)
> > >
> > >  /* X86_TUNE_PUSH_MEMORY: Enable generation of "push mem"
> > instructions.
> > > -   Some chips, like 486 and Pentium have problems with these sequences..
> > */
> > > +   Some chips, like 486 and Pentium works faster with separate load
> > > +   and push instructions.  */
> > >  DEF_TUNE (X86_TUNE_PUSH_MEMORY, "push_memory",
> > >            m_386 | m_P4_NOCONA | m_CORE_ALL | m_K6_GEODE |
> > m_AMD_MULTIPLE
> > >            | m_GENERIC)
> > > @@ -210,6 +208,16 @@ DEF_TUNE
> > (X86_TUNE_SSE_UNALIGNED_LOAD_OP  DEF_TUNE
> > (X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL,
> > "sse_unaligned_store_optimal",
> > >            m_COREI7 | m_BDVER | m_SLM | m_GENERIC)
> > >
> > > +/* X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL: if true, unaligned
> > loads are
> > > +   split.  */
> > > +DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL,
> > "256_unaligned_load_optimal",
> > > +          ~(m_COREI7 | m_GENERIC))
> > > +
> > > +/* X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL: if true, unaligned
> > loads are
> > > +   split.  */
> > > +DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL,
> > "256_unaligned_load_optimal",
> > > +          ~(m_COREI7 | m_BDVER | m_GENERIC))
> > > +
> > >  /* Use packed single precision instructions where posisble.  I.e. movups
> > instead
> > >     of movupd.  */
> > >  DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL,
> > "sse_packed_single_insn_optimal", @@ -398,3 +406,24 @@ DEF_TUNE
> > (X86_TUNE_AVOID_MEM_OPND_FOR_CM
> > >     fp converts to destination register.  */  DEF_TUNE
> > (X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS,
> > "split_mem_opnd_for_fp_converts",
> > >            m_SLM)
> > > +
> > > +/* X86_TUNE_ACCUMULATE_OUTGOING_ARGS: Allocate stack space for
> > outgoing
> > > +   arguments in prologue/epilogue instead of separately for each call
> > > +   by push/pop instructions.
> > > +   This increase code size by about 5% in 32bit mode, less so in 64bit mode
> > > +   because parameters are passed in registers.  It is considerable
> > > +   win for targets without stack engine that prevents multple push
> > operations
> > > +   to happen in parallel.
> > > +
> > > +   FIXME: the flags is incorrectly enabled for amdfam10, Bulldozer,
> > > +   Bobcat and Generic.  This is because disabling it causes large
> > > +   regression on mgrid due to IRA limitation leading to unecessary
> > > +   use of the frame pointer in 32bit mode.  */ DEF_TUNE
> > > +(X86_TUNE_ACCUMULATE_OUTGOING_ARGS,
> > "accumulate_outgoing_args",
> > > +	  m_PPRO | m_P4_NOCONA | m_ATOM | m_SLM |
> > m_AMD_MULTIPLE |
> > > +m_GENERIC)
> > > +
> > > +/* X86_TUNE_ALWAYS_FANCY_MATH_387: controls use of fancy 387
> > operations,
> > > +   such as fsqrt, fprem, fsin, fcos, fsincos etc.
> > > +   Should be enabled for all targets that always has coprocesor.  */
> > > +DEF_TUNE (X86_TUNE_ALWAYS_FANCY_MATH_387,
> > "always_fancy_math_387",
> > > +          ~(m_386 | m_486))
> > > Index: config/i386/i386.c
> > >
> > ==========================================================
> > =========
> > > --- config/i386/i386.c	(revision 203380)
> > > +++ config/i386/i386.c	(working copy)
> > > @@ -1898,18 +1898,6 @@ static unsigned int initial_ix86_arch_fe
> > >    ~m_386,
> > >  };
> > >
> > > -static const unsigned int x86_accumulate_outgoing_args
> > > -  = m_PPRO | m_P4_NOCONA | m_ATOM | m_SLM | m_AMD_MULTIPLE
> > |
> > > m_GENERIC;
> > > -
> > > -static const unsigned int x86_arch_always_fancy_math_387
> > > -  = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_ATOM |
> > m_SLM |
> > > m_AMD_MULTIPLE | m_GENERIC;
> > > -
> > > -static const unsigned int x86_avx256_split_unaligned_load
> > > -  = m_COREI7 | m_GENERIC;
> > > -
> > > -static const unsigned int x86_avx256_split_unaligned_store
> > > -  = m_COREI7 | m_BDVER | m_GENERIC;
> > > -
> > >  /* In case the average insn count for single function invocation is
> > >     lower than this constant, emit fast (but longer) prologue and
> > >     epilogue code.  */
> > > @@ -2920,7 +2908,7 @@ static void
> > >  ix86_option_override_internal (bool main_args_p)  {
> > >    int i;
> > > -  unsigned int ix86_arch_mask, ix86_tune_mask;
> > > +  unsigned int ix86_arch_mask;
> > >    const bool ix86_tune_specified = (ix86_tune_string != NULL);
> > >    const char *prefix;
> > >    const char *suffix;
> > > @@ -3673,7 +3661,7 @@ ix86_option_override_internal (bool main
> > >
> > >    /* If the architecture always has an FPU, turn off NO_FANCY_MATH_387,
> > >       since the insns won't need emulation.  */
> > > -  if (x86_arch_always_fancy_math_387 & ix86_arch_mask)
> > > +  if (ix86_tune_features [X86_TUNE_ALWAYS_FANCY_MATH_387])
> > >      target_flags &= ~MASK_NO_FANCY_MATH_387;
> > >
> > >    /* Likewise, if the target doesn't have a 387, or we've specified @@ -
> > 3805,8 +3793,7 @@ ix86_option_override_internal (bool main
> > >  	gcc_unreachable ();
> > >        }
> > >
> > > -  ix86_tune_mask = 1u << ix86_tune;
> > > -  if ((x86_accumulate_outgoing_args & ix86_tune_mask)
> > > +  if (ix86_tune_features [X86_TUNE_ACCUMULATE_OUTGOING_ARGS]
> > >        && !(target_flags_explicit & MASK_ACCUMULATE_OUTGOING_ARGS)
> > >        && !optimize_size)
> > >      target_flags |= MASK_ACCUMULATE_OUTGOING_ARGS; @@ -3946,10
> > +3933,10 @@ ix86_option_override_internal (bool main
> > >        if (flag_expensive_optimizations
> > >  	  && !(target_flags_explicit & MASK_VZEROUPPER))
> > >  	target_flags |= MASK_VZEROUPPER;
> > > -      if ((x86_avx256_split_unaligned_load & ix86_tune_mask)
> > > +      if
> > (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]
> > >  	  && !(target_flags_explicit &
> > MASK_AVX256_SPLIT_UNALIGNED_LOAD))
> > >  	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
> > > -      if ((x86_avx256_split_unaligned_store & ix86_tune_mask)
> > > +      if
> > (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL]
> > >  	  && !(target_flags_explicit &
> > MASK_AVX256_SPLIT_UNALIGNED_STORE))
> > >  	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
> > >        /* Enable 128-bit AVX instruction generation

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Honnor ix86_accumulate_outgoing_args again
  2013-10-19 21:28                 ` Jan Hubicka
@ 2013-10-21  8:01                   ` Vladimir Makarov
  2013-11-03  9:59                     ` Zamyatin, Igor
  2013-10-22  8:36                   ` Zamyatin, Igor
  1 sibling, 1 reply; 21+ messages in thread
From: Vladimir Makarov @ 2013-10-21  8:01 UTC (permalink / raw)
  To: Jan Hubicka, Zamyatin, Igor, gcc-patches

On 13-10-19 4:30 PM, Jan Hubicka wrote:
>> Jan,
>>
>> Does this seem reasonable to you?
> Oops, sorry, I missed your email. (I was travelling and I am finishing a paper
> now).
>> Thanks,
>> Igor
>>
>>> -----Original Message-----
>>> From: Zamyatin, Igor
>>> Sent: Tuesday, October 15, 2013 3:48 PM
>>> To: Jan Hubicka
>>> Subject: RE: Honnor ix86_accumulate_outgoing_args again
>>>
>>> Jan,
>>>
>>> Now we have following prologue in, say, phi0 routine in equake
>>>
>>> 0x804aa90 1  push   %ebp
>>> 0x804aa91 2  mov    %esp,%ebp
>>> 0x804aa93 3  sub    $0x18,%esp
>>> 0x804aa96 4  vmovsd 0x80ef7a8,%xmm0
>>> 0x804aa9e 5  vmovsd 0x8(%ebp),%xmm1
>>> 0x804aaa3 6  vcomisd %xmm1,%xmm0   <-- we see big stall somewhere here
>>> or 1-2 instructions above
>>>
>>> While earlier it was
>>>
>>> 0x804abd0 1 sub    $0x2c,%esp
>>> 0x804abd3 2 vmovsd 0x30(%esp),%xmm1
>>> 0x804abd9 3 vmovsd 0x80efcc8,%xmm0
>>> 0x804abe1 4 vcomisd %xmm1,%xmm0
> Thanks for analysis! It is a different benchmark than for bulldozer, but
> apparently same case.  Again we used to eliminate frame pointer here but IRS
> now doesn't Do you see the same regression with -fno-omit-frame-pointer
> -maccumulate-outgoing-args?
>
> I suppose this is a conflict in between the push instruction hanled by stack
> engine and initialization of EBP that isn't.  That would explain why bulldozer
> don't seem to care about this particular benchmark (its stack engine seems to
> have quite different design).
>
> This is a bit sad situation - accumulate-outgoing-args is expensive code size
> wise and it seems we don't really need esp with -mno-accumulate-outgoing-args.
> The non-accumulation code path was mistakely disabled for too long ;(
>
> Vladimir, how much effort do you think it will be to fix the frame pointer
> elimination here?
My guess is a week.  The problem I am busy and having some problems with two
small projects right now which I'd like to include into gcc-4.9.

But I think, this still can be fixed on stage2 as it is a PR.

> I can imagine it is a quite tricky case. If so I would suggest adding m_CORE_ALL
> to X86_TUNE_ACCUMULATE_OUTGOING_ARGS with a comment explaining the problem and
> mentioning the regression on equake on core and mgrid on Bulldizer and opening
> an enhancement request for this...
>
> I also wonder if direct ESP use and push/pop instructions are causing so
> noticeable issues, I wonder if we can't "shrink wrap" this into red-zone in the
> 64bit compilation.  It seems that even with -maccumulate-outgoing-args pushing
> the frame allocation as late as possible in the function would be a good idea
> so it is not close to the push/pop/call/ret.
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Honnor ix86_accumulate_outgoing_args again
  2013-10-19 21:28                 ` Jan Hubicka
  2013-10-21  8:01                   ` Vladimir Makarov
@ 2013-10-22  8:36                   ` Zamyatin, Igor
  1 sibling, 0 replies; 21+ messages in thread
From: Zamyatin, Igor @ 2013-10-22  8:36 UTC (permalink / raw)
  To: Jan Hubicka, gcc-patches, vmakarov

Jan,

Please see my answers below

> -----Original Message-----
> From: Jan Hubicka [mailto:hubicka@ucw.cz]
> Sent: Sunday, October 20, 2013 12:30 AM
> To: Zamyatin, Igor; gcc-patches@gcc.gnu.org; vmakarov@redhat.com
> Cc: 'Jan Hubicka'
> Subject: Re: Honnor ix86_accumulate_outgoing_args again
> 
> > Jan,
> >
> > Does this seem reasonable to you?
> 
> Oops, sorry, I missed your email. (I was travelling and I am finishing a paper
> now).
> >
> > Thanks,
> > Igor
> >
> > > -----Original Message-----
> > > From: Zamyatin, Igor
> > > Sent: Tuesday, October 15, 2013 3:48 PM
> > > To: Jan Hubicka
> > > Subject: RE: Honnor ix86_accumulate_outgoing_args again
> > >
> > > Jan,
> > >
> > > Now we have following prologue in, say, phi0 routine in equake
> > >
> > > 0x804aa90 1  push   %ebp
> > > 0x804aa91 2  mov    %esp,%ebp
> > > 0x804aa93 3  sub    $0x18,%esp
> > > 0x804aa96 4  vmovsd 0x80ef7a8,%xmm0
> > > 0x804aa9e 5  vmovsd 0x8(%ebp),%xmm1
> > > 0x804aaa3 6  vcomisd %xmm1,%xmm0   <-- we see big stall somewhere
> here
> > > or 1-2 instructions above
> > >
> > > While earlier it was
> > >
> > > 0x804abd0 1 sub    $0x2c,%esp
> > > 0x804abd3 2 vmovsd 0x30(%esp),%xmm1
> > > 0x804abd9 3 vmovsd 0x80efcc8,%xmm0
> > > 0x804abe1 4 vcomisd %xmm1,%xmm0
> 
> Thanks for analysis! It is a different benchmark than for bulldozer, but
> apparently same case.  Again we used to eliminate frame pointer here but
> IRS now doesn't Do you see the same regression with -fno-omit-frame-
> pointer -maccumulate-outgoing-args?

No, with these flags performance is back

> 
> I suppose this is a conflict in between the push instruction hanled by stack
> engine and initialization of EBP that isn't.  That would explain why bulldozer
> don't seem to care about this particular benchmark (its stack engine seems to
> have quite different design).
> 
> This is a bit sad situation - accumulate-outgoing-args is expensive code size
> wise and it seems we don't really need esp with -mno-accumulate-outgoing-
> args.

Could you please explain this a bit more?

Thanks,
Igor

> The non-accumulation code path was mistakely disabled for too long ;(
> 
> Vladimir, how much effort do you think it will be to fix the frame pointer
> elimination here?
> I can imagine it is a quite tricky case. If so I would suggest adding
> m_CORE_ALL to X86_TUNE_ACCUMULATE_OUTGOING_ARGS with a
> comment explaining the problem and mentioning the regression on equake
> on core and mgrid on Bulldizer and opening an enhancement request for
> this...
> 
> I also wonder if direct ESP use and push/pop instructions are causing so
> noticeable issues, I wonder if we can't "shrink wrap" this into red-zone in the
> 64bit compilation.  It seems that even with -maccumulate-outgoing-args
> pushing the frame allocation as late as possible in the function would be a
> good idea so it is not close to the push/pop/call/ret.
> 
> Honza
> 
> > >
> > > Note that for Atom and SLM no regression happens because they are
> > > already in x86_accumulate_outgoing_args. As for other machines  -
> > > seems now (after your change) they don't get that
> > > MASK_ACCUMULATE_OUTGOING_ARGS and it leads to using ebp in the
> > > prologue.
> > >
> > > Thanks,
> > > Igor
> > >
> > > -----Original Message-----
> > > From: Jan Hubicka [mailto:hubicka@ucw.cz]
> > > Sent: Monday, October 14, 2013 8:44 PM
> > > To: Zamyatin, Igor
> > > Cc: Jan Hubicka
> > > Subject: Re: Honnor ix86_accumulate_outgoing_args again
> > >
> > > > Jan,
> > > >
> > > > I see that you haven't committed this change. Any particular
> > > > reason for
> > > this?
> > >
> > > No, seems I forgot about it.
> > > >
> > > > BTW, after r203171 (http://gcc.gnu.org/ml/gcc-cvs/2013-
> > > 10/msg00122.html) we see lot of performance degradation on spec2000
> > > and
> > > spec2006 tests on SandyBridge and IvyBridge in 32 bits mode. E.g.
> > > 183.equake got ~-15%. Have you seen it?
> > >
> > > I tested this only on Bulldozer chips where I saw large regression
> > > from mgrid, but curiously not from equake.  I tracked it down to
> > > frame pointer being forced by IRA that Vladimir promised to look at later.
> > >
> > > I assumed that arg accumulation was tested before the flags was set,
> > > so I did not benchmarked chips that previously dsabled it. Perhaps
> > > things changed because IRA was merged in meantime while this feature
> > > was accidentally disabled.
> > >
> > > It would be great if you can figure out why equake regress - in FP
> > > benchmarks we do not use push/pop so perhaps we just end up emitting
> > > something really stupid or we manage to confuse stack engine...
> > >
> > > Honza
> > >
> > > >
> > > > Thanks,
> > > > Igor
> > > >
> > > > -----Original Message-----
> > > > From: gcc-patches-owner@gcc.gnu.org
> > > > [mailto:gcc-patches-owner@gcc.gnu.org] On Behalf Of Jan Hubicka
> > > > Sent: Thursday, October 10, 2013 10:40 PM
> > > > To: Jan Hubicka
> > > > Cc: Vladimir Makarov; gcc-patches@gcc.gnu.org;
> > > > hjl.tools@gmail.com; ubizjak@gmail.com; rth@redhat.com;
> > > Ganesh.Gopalasubramanian@amd.com
> > > > Subject: Re: Honnor ix86_accumulate_outgoing_args again
> > > >
> > > > Hi,
> > > > this patch makes ACCUMULATE_OUTGOING_ARGS to disable itself
> when
> > > function is cold.  I did some extra testing and to my amusement we
> > > now seem to output more compact unwind info when
> > > ACCUMULATE_OUTGOING_ARGS is disabled, so this seems quite
> consistent
> > > code size win.
> > > >
> > > > We actually can do better and enable ACCUMULATE_OUTGOING_ARGS
> > > only when function contains hot calls.  This should also avoid need
> > > for frame allocation in prologue/epilogue on hot path then. I will
> > > look into this incrementally..
> > > >
> > > > I also noticed that we still have some tuning flags in i386.c
> > > > rather than in
> > > x86-tune.c so I moved them there.
> > > >
> > > > Testing x86_64-linux and will commit it once testing converge.
> > > >
> > > > Honza
> > > > 	* config/i386/i386.h (ACCUMULATE_OUTGOING_ARGS): Disable
> > > accumulation
> > > > 	for cold functions.
> > > > 	* x86-tune.def (X86_TUNE_USE_LEAVE): Update comment.
> > > > 	(X86_TUNE_PUSH_MEMORY): Likewise.
> > > > 	(X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL,
> > > > 	X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL): New.
> > > > 	(X86_TUNE_ACCUMULATE_OUTGOING_ARGS,
> > > X86_TUNE_ALWAYS_FANCY_MATH_387): New.
> > > > 	* i386.c (x86_accumulate_outgoing_args,
> > > x86_arch_always_fancy_math_387,
> > > > 	x86_avx256_split_unaligned_load,
> > > x86_avx256_split_unaligned_store):
> > > > 	Remove.
> > > > 	(ix86_option_override_internal): Update to use tune features
> > > instead
> > > > 	of variables.
> > > > Index: config/i386/i386.h
> > > >
> > >
> ==========================================================
> > > =========
> > > > --- config/i386/i386.h	(revision 203380)
> > > > +++ config/i386/i386.h	(working copy)
> > > > @@ -1492,13 +1492,26 @@ enum reg_class
> > > >     will be computed and placed into the variable `crtl-
> >outgoing_args_size'.
> > > >     No space will be pushed onto the stack for each call; instead, the
> > > >     function prologue should increase the stack frame size by this
> amount.
> > > > +
> > > > +   In 32bit mode enabling argument accumulation results in about
> > > > + 5% code
> > > size
> > > > +   growth becuase move instructions are less compact than push.  In
> 64bit
> > > > +   mode the difference is less drastic but visible.
> > > > +
> > > > +   FIXME: Unlike earlier implementations, the size of unwind info
> seems to
> > > > +   actually grouw with accumulation.  Is that because accumulated args
> > > > +   unwind info became unnecesarily bloated?
> > > >
> > > >     64-bit MS ABI seem to require 16 byte alignment everywhere except
> for
> > > > -   function prologue and apilogue.  This is not possible without
> > > > -   ACCUMULATE_OUTGOING_ARGS.  */
> > > > +   function prologue and epilogue.  This is not possible without
> > > > +   ACCUMULATE_OUTGOING_ARGS.
> > > > +
> > > > +   If stack probes are required, the space used for large function
> > > > +   arguments on the stack must also be probed, so enable
> > > > +   -maccumulate-outgoing-args so this happens in the prologue.
> > > > + */
> > > >
> > > >  #define ACCUMULATE_OUTGOING_ARGS \
> > > > -  (TARGET_ACCUMULATE_OUTGOING_ARGS ||
> TARGET_64BIT_MS_ABI)
> > > > +  ((TARGET_ACCUMULATE_OUTGOING_ARGS &&
> > > optimize_function_for_speed_p (cfun)) \
> > > > +   || TARGET_STACK_PROBE || TARGET_64BIT_MS_ABI)
> > > >
> > > >  /* If defined, a C expression whose value is nonzero when we want
> > > > to use
> > > PUSH
> > > >     instructions to pass outgoing arguments.  */
> > > > Index: config/i386/x86-tune.def
> > > >
> > >
> ==========================================================
> > > =========
> > > > --- config/i386/x86-tune.def	(revision 203387)
> > > > +++ config/i386/x86-tune.def	(working copy)
> > > > @@ -18,15 +18,13 @@ a copy of the GCC Runtime Library Except  see
> > > > the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
> > > > <http://www.gnu.org/licenses/>.  */
> > > >
> > > > -/* X86_TUNE_USE_LEAVE: Leave does not affect Nocona SPEC2000
> results
> > > > -   negatively, so enabling for Generic64 seems like good code size
> > > > -   tradeoff.  We can't enable it for 32bit generic because it does not
> > > > -   work well with PPro base chips.  */
> > > > +/* X86_TUNE_USE_LEAVE: Use "leave" instruction in epilogues where
> > > > +it fits.  */
> > > >  DEF_TUNE (X86_TUNE_USE_LEAVE, "use_leave",
> > > >  	  m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE |
> > > m_GENERIC)
> > > >
> > > >  /* X86_TUNE_PUSH_MEMORY: Enable generation of "push mem"
> > > instructions.
> > > > -   Some chips, like 486 and Pentium have problems with these
> sequences..
> > > */
> > > > +   Some chips, like 486 and Pentium works faster with separate load
> > > > +   and push instructions.  */
> > > >  DEF_TUNE (X86_TUNE_PUSH_MEMORY, "push_memory",
> > > >            m_386 | m_P4_NOCONA | m_CORE_ALL | m_K6_GEODE |
> > > m_AMD_MULTIPLE
> > > >            | m_GENERIC)
> > > > @@ -210,6 +208,16 @@ DEF_TUNE
> > > (X86_TUNE_SSE_UNALIGNED_LOAD_OP  DEF_TUNE
> > > (X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL,
> > > "sse_unaligned_store_optimal",
> > > >            m_COREI7 | m_BDVER | m_SLM | m_GENERIC)
> > > >
> > > > +/* X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL: if true,
> unaligned
> > > loads are
> > > > +   split.  */
> > > > +DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL,
> > > "256_unaligned_load_optimal",
> > > > +          ~(m_COREI7 | m_GENERIC))
> > > > +
> > > > +/* X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL: if true,
> unaligned
> > > loads are
> > > > +   split.  */
> > > > +DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL,
> > > "256_unaligned_load_optimal",
> > > > +          ~(m_COREI7 | m_BDVER | m_GENERIC))
> > > > +
> > > >  /* Use packed single precision instructions where posisble.  I.e.
> > > > movups
> > > instead
> > > >     of movupd.  */
> > > >  DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL,
> > > "sse_packed_single_insn_optimal", @@ -398,3 +406,24 @@ DEF_TUNE
> > > (X86_TUNE_AVOID_MEM_OPND_FOR_CM
> > > >     fp converts to destination register.  */  DEF_TUNE
> > > (X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS,
> > > "split_mem_opnd_for_fp_converts",
> > > >            m_SLM)
> > > > +
> > > > +/* X86_TUNE_ACCUMULATE_OUTGOING_ARGS: Allocate stack space
> for
> > > outgoing
> > > > +   arguments in prologue/epilogue instead of separately for each call
> > > > +   by push/pop instructions.
> > > > +   This increase code size by about 5% in 32bit mode, less so in 64bit
> mode
> > > > +   because parameters are passed in registers.  It is considerable
> > > > +   win for targets without stack engine that prevents multple
> > > > + push
> > > operations
> > > > +   to happen in parallel.
> > > > +
> > > > +   FIXME: the flags is incorrectly enabled for amdfam10, Bulldozer,
> > > > +   Bobcat and Generic.  This is because disabling it causes large
> > > > +   regression on mgrid due to IRA limitation leading to unecessary
> > > > +   use of the frame pointer in 32bit mode.  */ DEF_TUNE
> > > > +(X86_TUNE_ACCUMULATE_OUTGOING_ARGS,
> > > "accumulate_outgoing_args",
> > > > +	  m_PPRO | m_P4_NOCONA | m_ATOM | m_SLM |
> > > m_AMD_MULTIPLE |
> > > > +m_GENERIC)
> > > > +
> > > > +/* X86_TUNE_ALWAYS_FANCY_MATH_387: controls use of fancy 387
> > > operations,
> > > > +   such as fsqrt, fprem, fsin, fcos, fsincos etc.
> > > > +   Should be enabled for all targets that always has coprocesor.
> > > > +*/ DEF_TUNE (X86_TUNE_ALWAYS_FANCY_MATH_387,
> > > "always_fancy_math_387",
> > > > +          ~(m_386 | m_486))
> > > > Index: config/i386/i386.c
> > > >
> > >
> ==========================================================
> > > =========
> > > > --- config/i386/i386.c	(revision 203380)
> > > > +++ config/i386/i386.c	(working copy)
> > > > @@ -1898,18 +1898,6 @@ static unsigned int initial_ix86_arch_fe
> > > >    ~m_386,
> > > >  };
> > > >
> > > > -static const unsigned int x86_accumulate_outgoing_args
> > > > -  = m_PPRO | m_P4_NOCONA | m_ATOM | m_SLM |
> m_AMD_MULTIPLE
> > > |
> > > > m_GENERIC;
> > > > -
> > > > -static const unsigned int x86_arch_always_fancy_math_387
> > > > -  = m_PENT | m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_ATOM |
> > > m_SLM |
> > > > m_AMD_MULTIPLE | m_GENERIC;
> > > > -
> > > > -static const unsigned int x86_avx256_split_unaligned_load
> > > > -  = m_COREI7 | m_GENERIC;
> > > > -
> > > > -static const unsigned int x86_avx256_split_unaligned_store
> > > > -  = m_COREI7 | m_BDVER | m_GENERIC;
> > > > -
> > > >  /* In case the average insn count for single function invocation is
> > > >     lower than this constant, emit fast (but longer) prologue and
> > > >     epilogue code.  */
> > > > @@ -2920,7 +2908,7 @@ static void
> > > >  ix86_option_override_internal (bool main_args_p)  {
> > > >    int i;
> > > > -  unsigned int ix86_arch_mask, ix86_tune_mask;
> > > > +  unsigned int ix86_arch_mask;
> > > >    const bool ix86_tune_specified = (ix86_tune_string != NULL);
> > > >    const char *prefix;
> > > >    const char *suffix;
> > > > @@ -3673,7 +3661,7 @@ ix86_option_override_internal (bool main
> > > >
> > > >    /* If the architecture always has an FPU, turn off
> NO_FANCY_MATH_387,
> > > >       since the insns won't need emulation.  */
> > > > -  if (x86_arch_always_fancy_math_387 & ix86_arch_mask)
> > > > +  if (ix86_tune_features [X86_TUNE_ALWAYS_FANCY_MATH_387])
> > > >      target_flags &= ~MASK_NO_FANCY_MATH_387;
> > > >
> > > >    /* Likewise, if the target doesn't have a 387, or we've
> > > > specified @@ -
> > > 3805,8 +3793,7 @@ ix86_option_override_internal (bool main
> > > >  	gcc_unreachable ();
> > > >        }
> > > >
> > > > -  ix86_tune_mask = 1u << ix86_tune;
> > > > -  if ((x86_accumulate_outgoing_args & ix86_tune_mask)
> > > > +  if (ix86_tune_features
> [X86_TUNE_ACCUMULATE_OUTGOING_ARGS]
> > > >        && !(target_flags_explicit &
> MASK_ACCUMULATE_OUTGOING_ARGS)
> > > >        && !optimize_size)
> > > >      target_flags |= MASK_ACCUMULATE_OUTGOING_ARGS; @@ -
> 3946,10
> > > +3933,10 @@ ix86_option_override_internal (bool main
> > > >        if (flag_expensive_optimizations
> > > >  	  && !(target_flags_explicit & MASK_VZEROUPPER))
> > > >  	target_flags |= MASK_VZEROUPPER;
> > > > -      if ((x86_avx256_split_unaligned_load & ix86_tune_mask)
> > > > +      if
> > > (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]
> > > >  	  && !(target_flags_explicit &
> > > MASK_AVX256_SPLIT_UNALIGNED_LOAD))
> > > >  	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
> > > > -      if ((x86_avx256_split_unaligned_store & ix86_tune_mask)
> > > > +      if
> > > (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL]
> > > >  	  && !(target_flags_explicit &
> > > MASK_AVX256_SPLIT_UNALIGNED_STORE))
> > > >  	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
> > > >        /* Enable 128-bit AVX instruction generation

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Honnor ix86_accumulate_outgoing_args again
  2013-10-10 18:48       ` Jan Hubicka
       [not found]         ` <0EFAB2BDD0F67E4FB6CCC8B9F87D7569427DC81A@IRSMSX101.ger.corp.intel.com>
@ 2013-10-31 11:23         ` Florian Weimer
  2013-11-12  7:01         ` Jakub Jelinek
  2 siblings, 0 replies; 21+ messages in thread
From: Florian Weimer @ 2013-10-31 11:23 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: GCC Patches

On 10/10/2013 08:40 PM, Jan Hubicka wrote:
> +   In 32bit mode enabling argument accumulation results in about 5% code size
> +   growth becuase move instructions are less compact than push.  In 64bit
> +   mode the difference is less drastic but visible.
> +
> +   FIXME: Unlike earlier implementations, the size of unwind info seems to
> +   actually grouw with accumulation.  Is that because accumulated args
> +   unwind info became unnecesarily bloated?

Several typos: "32bit" "64bit", "becuase", "grouw".  "push." should be 
"pushes.", I think.  I can't parse the question at the end.

Sorry, no comments on the actual code changes. :-/

-- 
Florian Weimer / Red Hat Product Security Team

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Honnor ix86_accumulate_outgoing_args again
  2013-10-21  8:01                   ` Vladimir Makarov
@ 2013-11-03  9:59                     ` Zamyatin, Igor
  0 siblings, 0 replies; 21+ messages in thread
From: Zamyatin, Igor @ 2013-11-03  9:59 UTC (permalink / raw)
  To: Vladimir Makarov, Jan Hubicka, gcc-patches

So, Jan, what do you think will be best solution for stage 1?

Thanks,
Igor

> -----Original Message-----
> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> owner@gcc.gnu.org] On Behalf Of Vladimir Makarov
> Sent: Monday, October 21, 2013 6:52 AM
> To: Jan Hubicka; Zamyatin, Igor; gcc-patches@gcc.gnu.org
> Subject: Re: Honnor ix86_accumulate_outgoing_args again
> 
> On 13-10-19 4:30 PM, Jan Hubicka wrote:
> >> Jan,
> >>
> >> Does this seem reasonable to you?
> > Oops, sorry, I missed your email. (I was travelling and I am finishing
> > a paper now).
> >> Thanks,
> >> Igor
> >>
> >>> -----Original Message-----
> >>> From: Zamyatin, Igor
> >>> Sent: Tuesday, October 15, 2013 3:48 PM
> >>> To: Jan Hubicka
> >>> Subject: RE: Honnor ix86_accumulate_outgoing_args again
> >>>
> >>> Jan,
> >>>
> >>> Now we have following prologue in, say, phi0 routine in equake
> >>>
> >>> 0x804aa90 1  push   %ebp
> >>> 0x804aa91 2  mov    %esp,%ebp
> >>> 0x804aa93 3  sub    $0x18,%esp
> >>> 0x804aa96 4  vmovsd 0x80ef7a8,%xmm0
> >>> 0x804aa9e 5  vmovsd 0x8(%ebp),%xmm1
> >>> 0x804aaa3 6  vcomisd %xmm1,%xmm0   <-- we see big stall somewhere
> here
> >>> or 1-2 instructions above
> >>>
> >>> While earlier it was
> >>>
> >>> 0x804abd0 1 sub    $0x2c,%esp
> >>> 0x804abd3 2 vmovsd 0x30(%esp),%xmm1
> >>> 0x804abd9 3 vmovsd 0x80efcc8,%xmm0
> >>> 0x804abe1 4 vcomisd %xmm1,%xmm0
> > Thanks for analysis! It is a different benchmark than for bulldozer,
> > but apparently same case.  Again we used to eliminate frame pointer
> > here but IRS now doesn't Do you see the same regression with
> > -fno-omit-frame-pointer -maccumulate-outgoing-args?
> >
> > I suppose this is a conflict in between the push instruction hanled by
> > stack engine and initialization of EBP that isn't.  That would explain
> > why bulldozer don't seem to care about this particular benchmark (its
> > stack engine seems to have quite different design).
> >
> > This is a bit sad situation - accumulate-outgoing-args is expensive
> > code size wise and it seems we don't really need esp with -mno-
> accumulate-outgoing-args.
> > The non-accumulation code path was mistakely disabled for too long ;(
> >
> > Vladimir, how much effort do you think it will be to fix the frame
> > pointer elimination here?
> My guess is a week.  The problem I am busy and having some problems with
> two small projects right now which I'd like to include into gcc-4.9.
> 
> But I think, this still can be fixed on stage2 as it is a PR.
> 
> > I can imagine it is a quite tricky case. If so I would suggest adding
> > m_CORE_ALL to X86_TUNE_ACCUMULATE_OUTGOING_ARGS with a
> comment
> > explaining the problem and mentioning the regression on equake on core
> > and mgrid on Bulldizer and opening an enhancement request for this...
> >
> > I also wonder if direct ESP use and push/pop instructions are causing
> > so noticeable issues, I wonder if we can't "shrink wrap" this into
> > red-zone in the 64bit compilation.  It seems that even with
> > -maccumulate-outgoing-args pushing the frame allocation as late as
> > possible in the function would be a good idea so it is not close to the
> push/pop/call/ret.
> >
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Honnor ix86_accumulate_outgoing_args again
  2013-10-10 18:48       ` Jan Hubicka
       [not found]         ` <0EFAB2BDD0F67E4FB6CCC8B9F87D7569427DC81A@IRSMSX101.ger.corp.intel.com>
  2013-10-31 11:23         ` Florian Weimer
@ 2013-11-12  7:01         ` Jakub Jelinek
  2013-11-12  7:09           ` H.J. Lu
  2 siblings, 1 reply; 21+ messages in thread
From: Jakub Jelinek @ 2013-11-12  7:01 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Vladimir Makarov, gcc-patches, hjl.tools, ubizjak, rth,
	Ganesh.Gopalasubramanian

On Thu, Oct 10, 2013 at 08:40:05PM +0200, Jan Hubicka wrote:
> --- config/i386/x86-tune.def	(revision 203387)
> +++ config/i386/x86-tune.def	(working copy)

> +/* X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL: if true, unaligned loads are
> +   split.  */
> +DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL, "256_unaligned_load_optimal", 
> +          ~(m_COREI7 | m_GENERIC))
> +
> +/* X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL: if true, unaligned loads are

s/loads/stores/

> +   split.  */
> +DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL, "256_unaligned_load_optimal", 
> +          ~(m_COREI7 | m_BDVER | m_GENERIC))

s/load/store/

Also, I wonder if we couldn't improve the generated code for
-mavx2 -mtune=generic or -march=core-avx2 -mtune=generic etc.
- m_GENERIC is included clearly because vmovup{s,d} was really bad
on SandyBridge (am I right here?), but if the ISA includes AVX2, then
the code will not run on that chip at all, so can't we override it?

> @@ -3946,10 +3933,10 @@ ix86_option_override_internal (bool main
>        if (flag_expensive_optimizations
>  	  && !(target_flags_explicit & MASK_VZEROUPPER))
>  	target_flags |= MASK_VZEROUPPER;
> -      if ((x86_avx256_split_unaligned_load & ix86_tune_mask)
> +      if (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]

Didn't you mean to use X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL here?

>  	  && !(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_LOAD))
>  	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
> -      if ((x86_avx256_split_unaligned_store & ix86_tune_mask)
> +      if (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL]

And similarly for STORE here?

>  	  && !(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_STORE))
>  	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
>        /* Enable 128-bit AVX instruction generation

	Jakub

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Honnor ix86_accumulate_outgoing_args again
  2013-11-12  7:01         ` Jakub Jelinek
@ 2013-11-12  7:09           ` H.J. Lu
  2013-11-12 12:45             ` Jan Hubicka
  0 siblings, 1 reply; 21+ messages in thread
From: H.J. Lu @ 2013-11-12  7:09 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Jan Hubicka, Vladimir Makarov, GCC Patches, Uros Bizjak,
	Richard Henderson, Ganesh.Gopalasubramanian

On Mon, Nov 11, 2013 at 4:18 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, Oct 10, 2013 at 08:40:05PM +0200, Jan Hubicka wrote:
>> --- config/i386/x86-tune.def  (revision 203387)
>> +++ config/i386/x86-tune.def  (working copy)
>
>> +/* X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL: if true, unaligned loads are
>> +   split.  */
>> +DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL, "256_unaligned_load_optimal",
>> +          ~(m_COREI7 | m_GENERIC))
>> +
>> +/* X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL: if true, unaligned loads are
>
> s/loads/stores/
>
>> +   split.  */
>> +DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL, "256_unaligned_load_optimal",
>> +          ~(m_COREI7 | m_BDVER | m_GENERIC))
>
> s/load/store/
>

We should exclude m_COREI7_AVX since Ivy Bridge needs to split
32-byt unaligned load and store.

Also comments should say "if false, unaligned loads are split. " instead
of " if true,..."

> Also, I wonder if we couldn't improve the generated code for
> -mavx2 -mtune=generic or -march=core-avx2 -mtune=generic etc.
> - m_GENERIC is included clearly because vmovup{s,d} was really bad
> on SandyBridge (am I right here?), but if the ISA includes AVX2, then
> the code will not run on that chip at all, so can't we override it?

Yes, we should do it, similar to

       if (TARGET_AVX
              || TARGET_SSE_UNALIGNED_LOAD_OPTIMAL
              || TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL
              || optimize_insn_for_size_p ())
            {
              /* We will eventually emit movups based on insn attributes.  */
              emit_insn (gen_sse2_loadupd (op0, op1));
              return;
            }

>> @@ -3946,10 +3933,10 @@ ix86_option_override_internal (bool main
>>        if (flag_expensive_optimizations
>>         && !(target_flags_explicit & MASK_VZEROUPPER))
>>       target_flags |= MASK_VZEROUPPER;
>> -      if ((x86_avx256_split_unaligned_load & ix86_tune_mask)
>> +      if (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]
>
> Didn't you mean to use X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL here?
>
>>         && !(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_LOAD))
>>       target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
>> -      if ((x86_avx256_split_unaligned_store & ix86_tune_mask)
>> +      if (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL]
>
> And similarly for STORE here?
>
>>         && !(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_STORE))
>>       target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
>>        /* Enable 128-bit AVX instruction generation
>
>         Jakub

Something like this.


-- 
H.J.
---
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 430d562..b8cb871 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -3974,10 +3974,10 @@ ix86_option_override_internal (bool main_args_p,
       if (flag_expensive_optimizations
       && !(opts_set->x_target_flags & MASK_VZEROUPPER))
     opts->x_target_flags |= MASK_VZEROUPPER;
-      if (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]
+      if (!ix86_tune_features[X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL]
       && !(opts_set->x_target_flags & MASK_AVX256_SPLIT_UNALIGNED_LOAD))
     opts->x_target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
-      if (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL]
+      if (!ix86_tune_features[X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL]
       && !(opts_set->x_target_flags & MASK_AVX256_SPLIT_UNALIGNED_STORE))
     opts->x_target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
       /* Enable 128-bit AVX instruction generation
@@ -16576,7 +16576,7 @@ ix86_avx256_split_vector_move_misalign (rtx
op0, rtx op1)

   if (MEM_P (op1))
     {
-      if (TARGET_AVX256_SPLIT_UNALIGNED_LOAD)
+      if (!TARGET_AVX2 && TARGET_AVX256_SPLIT_UNALIGNED_LOAD)
     {
       rtx r = gen_reg_rtx (mode);
       m = adjust_address (op1, mode, 0);
@@ -16596,7 +16596,7 @@ ix86_avx256_split_vector_move_misalign (rtx
op0, rtx op1)
     }
   else if (MEM_P (op0))
     {
-      if (TARGET_AVX256_SPLIT_UNALIGNED_STORE)
+      if (!TARGET_AVX2 && TARGET_AVX256_SPLIT_UNALIGNED_STORE)
     {
       m = adjust_address (op0, mode, 0);
       emit_insn (extract (m, op1, const0_rtx));
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 1a85ce2..6c7063a 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -376,15 +376,15 @@ DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS,
"use_vector_converts", m_AMDFAM10)
 /* AVX instruction selection tuning (some of SSE flags affects AVX, too)     */
 /*****************************************************************************/

-/* X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL: if true, unaligned loads are
+/* X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL: if false, unaligned loads are
    split.  */
 DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL, "256_unaligned_load_optimal",
-          ~(m_COREI7 | m_GENERIC))
+          ~(m_COREI7 | m_COREI7_AVX | m_GENERIC))

-/* X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL: if true, unaligned loads are
+/* X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL: if false, unaligned stores are
    split.  */
-DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL,
"256_unaligned_load_optimal",
-          ~(m_COREI7 | m_BDVER | m_GENERIC))
+DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL,
"256_unaligned_store_optimal",
+          ~(m_COREI7 | m_COREI7_AVX | m_BDVER | m_GENERIC))

 /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
    the auto-vectorizer.  */

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Honnor ix86_accumulate_outgoing_args again
  2013-11-12  7:09           ` H.J. Lu
@ 2013-11-12 12:45             ` Jan Hubicka
  2013-11-12 13:03               ` Jakub Jelinek
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Hubicka @ 2013-11-12 12:45 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jakub Jelinek, Jan Hubicka, Vladimir Makarov, GCC Patches,
	Uros Bizjak, Richard Henderson, Ganesh.Gopalasubramanian

This is OK, thanks for catching the pasto!
Only...
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 430d562..b8cb871 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -3974,10 +3974,10 @@ ix86_option_override_internal (bool main_args_p,
>        if (flag_expensive_optimizations
>        && !(opts_set->x_target_flags & MASK_VZEROUPPER))
>      opts->x_target_flags |= MASK_VZEROUPPER;
> -      if (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]
> +      if (!ix86_tune_features[X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL]
>        && !(opts_set->x_target_flags & MASK_AVX256_SPLIT_UNALIGNED_LOAD))
>      opts->x_target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
> -      if (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL]
> +      if (!ix86_tune_features[X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL]
>        && !(opts_set->x_target_flags & MASK_AVX256_SPLIT_UNALIGNED_STORE))
>      opts->x_target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
>        /* Enable 128-bit AVX instruction generation
> @@ -16576,7 +16576,7 @@ ix86_avx256_split_vector_move_misalign (rtx
> op0, rtx op1)
> 
>    if (MEM_P (op1))
>      {
> -      if (TARGET_AVX256_SPLIT_UNALIGNED_LOAD)
> +      if (!TARGET_AVX2 && TARGET_AVX256_SPLIT_UNALIGNED_LOAD)
>      {
>        rtx r = gen_reg_rtx (mode);
>        m = adjust_address (op1, mode, 0);
> @@ -16596,7 +16596,7 @@ ix86_avx256_split_vector_move_misalign (rtx
> op0, rtx op1)
>      }
>    else if (MEM_P (op0))
>      {
> -      if (TARGET_AVX256_SPLIT_UNALIGNED_STORE)
> +      if (!TARGET_AVX2 && TARGET_AVX256_SPLIT_UNALIGNED_STORE)

I would add explanation comment on those two.

Shall we also disable argument accumulation for cores? It seems we won't solve the IRA issues, right?

Honza

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Honnor ix86_accumulate_outgoing_args again
  2013-11-12 12:45             ` Jan Hubicka
@ 2013-11-12 13:03               ` Jakub Jelinek
  2013-11-12 14:36                 ` H.J. Lu
                                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Jakub Jelinek @ 2013-11-12 13:03 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: H.J. Lu, Vladimir Makarov, GCC Patches, Uros Bizjak,
	Richard Henderson, Ganesh.Gopalasubramanian

On Tue, Nov 12, 2013 at 11:05:45AM +0100, Jan Hubicka wrote:
> > @@ -16576,7 +16576,7 @@ ix86_avx256_split_vector_move_misalign (rtx
> > op0, rtx op1)
> > 
> >    if (MEM_P (op1))
> >      {
> > -      if (TARGET_AVX256_SPLIT_UNALIGNED_LOAD)
> > +      if (!TARGET_AVX2 && TARGET_AVX256_SPLIT_UNALIGNED_LOAD)
> >      {
> >        rtx r = gen_reg_rtx (mode);
> >        m = adjust_address (op1, mode, 0);
> > @@ -16596,7 +16596,7 @@ ix86_avx256_split_vector_move_misalign (rtx
> > op0, rtx op1)
> >      }
> >    else if (MEM_P (op0))
> >      {
> > -      if (TARGET_AVX256_SPLIT_UNALIGNED_STORE)
> > +      if (!TARGET_AVX2 && TARGET_AVX256_SPLIT_UNALIGNED_STORE)
> 
> I would add explanation comment on those two.

Looking at http://gcc.gnu.org/ml/gcc-patches/2013-11/msg01235.html
we are going to have some AMD CPU with AVX2 support soon, the question is
if it will prefer 256-bit vmovups/vmovupd/vmovdqu or split, but even
if it will prefer split, the question is if like bdver{1,2,3} it will
be X86_TUNE_AVX128_OPTIMAL, because if yes, then how 256-bit unaligned
loads/stores are handled is much less important there.  Ganesh?

> Shall we also disable argument accumulation for cores? It seems we won't
> solve the IRA issues, right?

You mean LRA issues here, right?  If you are starting to use
no-accumulate-outgoing-args much more often than in the past, I think
the problem that LRA forces a frame pointer in that case is much more
important now (or has that been fixed in the mean time?).  Vlad?

	Jakub

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Honnor ix86_accumulate_outgoing_args again
  2013-11-12 13:03               ` Jakub Jelinek
@ 2013-11-12 14:36                 ` H.J. Lu
  2013-11-12 16:43                 ` Vladimir Makarov
  2013-11-13  8:10                 ` Gopalasubramanian, Ganesh
  2 siblings, 0 replies; 21+ messages in thread
From: H.J. Lu @ 2013-11-12 14:36 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Jan Hubicka, Vladimir Makarov, GCC Patches, Uros Bizjak,
	Richard Henderson, Ganesh.Gopalasubramanian

On Tue, Nov 12, 2013 at 2:26 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Tue, Nov 12, 2013 at 11:05:45AM +0100, Jan Hubicka wrote:
>> > @@ -16576,7 +16576,7 @@ ix86_avx256_split_vector_move_misalign (rtx
>> > op0, rtx op1)
>> >
>> >    if (MEM_P (op1))
>> >      {
>> > -      if (TARGET_AVX256_SPLIT_UNALIGNED_LOAD)
>> > +      if (!TARGET_AVX2 && TARGET_AVX256_SPLIT_UNALIGNED_LOAD)
>> >      {
>> >        rtx r = gen_reg_rtx (mode);
>> >        m = adjust_address (op1, mode, 0);
>> > @@ -16596,7 +16596,7 @@ ix86_avx256_split_vector_move_misalign (rtx
>> > op0, rtx op1)
>> >      }
>> >    else if (MEM_P (op0))
>> >      {
>> > -      if (TARGET_AVX256_SPLIT_UNALIGNED_STORE)
>> > +      if (!TARGET_AVX2 && TARGET_AVX256_SPLIT_UNALIGNED_STORE)
>>
>> I would add explanation comment on those two.
>
> Looking at http://gcc.gnu.org/ml/gcc-patches/2013-11/msg01235.html
> we are going to have some AMD CPU with AVX2 support soon, the question is
> if it will prefer 256-bit vmovups/vmovupd/vmovdqu or split, but even
> if it will prefer split, the question is if like bdver{1,2,3} it will
> be X86_TUNE_AVX128_OPTIMAL, because if yes, then how 256-bit unaligned
> loads/stores are handled is much less important there.  Ganesh?

I left those two out. This is what I checked in.

H.J.
---
Index: ChangeLog
===================================================================
--- ChangeLog    (revision 204699)
+++ ChangeLog    (working copy)
@@ -1,3 +1,16 @@
+2013-11-12  H.J. Lu  <hongjiu.lu@intel.com>
+
+    PR target/59084
+    * config/i386/i386.c (ix86_option_override_internal): Check
+    X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL and
+    X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL for
+    MASK_AVX256_SPLIT_UNALIGNED_LOAD and
+    MASK_AVX256_SPLIT_UNALIGNED_STORE.
+
+    * config/i386/x86-tune.def (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL):
+    Clear m_COREI7_AVX and update comments.
+    (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL): Likewise.
+
 2013-11-12  Martin Jambor  <mjambor@suse.cz>

     PR rtl-optimization/10474
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c    (revision 204699)
+++ config/i386/i386.c    (working copy)
@@ -3974,10 +3974,10 @@ ix86_option_override_internal (bool main
       if (flag_expensive_optimizations
       && !(opts_set->x_target_flags & MASK_VZEROUPPER))
     opts->x_target_flags |= MASK_VZEROUPPER;
-      if (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL]
+      if (!ix86_tune_features[X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL]
       && !(opts_set->x_target_flags & MASK_AVX256_SPLIT_UNALIGNED_LOAD))
     opts->x_target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
-      if (!ix86_tune_features[X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL]
+      if (!ix86_tune_features[X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL]
       && !(opts_set->x_target_flags & MASK_AVX256_SPLIT_UNALIGNED_STORE))
     opts->x_target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
       /* Enable 128-bit AVX instruction generation
Index: config/i386/x86-tune.def
===================================================================
--- config/i386/x86-tune.def    (revision 204699)
+++ config/i386/x86-tune.def    (working copy)
@@ -376,15 +376,15 @@ DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS,
 /* AVX instruction selection tuning (some of SSE flags affects AVX, too)     */
 /*****************************************************************************/

-/* X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL: if true, unaligned loads are
+/* X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL: if false, unaligned loads are
    split.  */
 DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL, "256_unaligned_load_optimal",
-          ~(m_COREI7 | m_GENERIC))
+          ~(m_COREI7 | m_COREI7_AVX | m_GENERIC))

-/* X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL: if true, unaligned loads are
+/* X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL: if false, unaligned stores are
    split.  */
-DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL,
"256_unaligned_load_optimal",
-          ~(m_COREI7 | m_BDVER | m_GENERIC))
+DEF_TUNE (X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL,
"256_unaligned_store_optimal",
+          ~(m_COREI7 | m_COREI7_AVX | m_BDVER | m_GENERIC))

 /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
    the auto-vectorizer.  */


-- 
H.J.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Honnor ix86_accumulate_outgoing_args again
  2013-11-12 13:03               ` Jakub Jelinek
  2013-11-12 14:36                 ` H.J. Lu
@ 2013-11-12 16:43                 ` Vladimir Makarov
  2013-11-12 16:44                   ` Jakub Jelinek
  2013-11-13  8:10                 ` Gopalasubramanian, Ganesh
  2 siblings, 1 reply; 21+ messages in thread
From: Vladimir Makarov @ 2013-11-12 16:43 UTC (permalink / raw)
  To: Jakub Jelinek, Jan Hubicka
  Cc: H.J. Lu, GCC Patches, Uros Bizjak, Richard Henderson,
	Ganesh.Gopalasubramanian

On 11/12/2013 05:26 AM, Jakub Jelinek wrote:
> On Tue, Nov 12, 2013 at 11:05:45AM +0100, Jan Hubicka wrote:
>>> @@ -16576,7 +16576,7 @@ ix86_avx256_split_vector_move_misalign (rtx
>>> op0, rtx op1)
>>>
>>>    if (MEM_P (op1))
>>>      {
>>> -      if (TARGET_AVX256_SPLIT_UNALIGNED_LOAD)
>>> +      if (!TARGET_AVX2 && TARGET_AVX256_SPLIT_UNALIGNED_LOAD)
>>>      {
>>>        rtx r = gen_reg_rtx (mode);
>>>        m = adjust_address (op1, mode, 0);
>>> @@ -16596,7 +16596,7 @@ ix86_avx256_split_vector_move_misalign (rtx
>>> op0, rtx op1)
>>>      }
>>>    else if (MEM_P (op0))
>>>      {
>>> -      if (TARGET_AVX256_SPLIT_UNALIGNED_STORE)
>>> +      if (!TARGET_AVX2 && TARGET_AVX256_SPLIT_UNALIGNED_STORE)
>> I would add explanation comment on those two.
> Looking at http://gcc.gnu.org/ml/gcc-patches/2013-11/msg01235.html
> we are going to have some AMD CPU with AVX2 support soon, the question is
> if it will prefer 256-bit vmovups/vmovupd/vmovdqu or split, but even
> if it will prefer split, the question is if like bdver{1,2,3} it will
> be X86_TUNE_AVX128_OPTIMAL, because if yes, then how 256-bit unaligned
> loads/stores are handled is much less important there.  Ganesh?
>
>> Shall we also disable argument accumulation for cores? It seems we won't
>> solve the IRA issues, right?
> You mean LRA issues here, right?  If you are starting to use
> no-accumulate-outgoing-args much more often than in the past, I think
> the problem that LRA forces a frame pointer in that case is much more
> important now (or has that been fixed in the mean time?).  Vlad?
>
>
I guess it is serious.  So it should fix this for gcc-4.9 in any case. 
I'd say it need 1-2 week of my work.  Right now I am stiil thinking how
to better approach to the implementation of it in LRA.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Honnor ix86_accumulate_outgoing_args again
  2013-11-12 16:43                 ` Vladimir Makarov
@ 2013-11-12 16:44                   ` Jakub Jelinek
  2013-11-12 19:42                     ` Jan Hubicka
  0 siblings, 1 reply; 21+ messages in thread
From: Jakub Jelinek @ 2013-11-12 16:44 UTC (permalink / raw)
  To: Vladimir Makarov
  Cc: Jan Hubicka, H.J. Lu, GCC Patches, Uros Bizjak,
	Richard Henderson, Ganesh.Gopalasubramanian

On Tue, Nov 12, 2013 at 10:39:28AM -0500, Vladimir Makarov wrote:
> >> Shall we also disable argument accumulation for cores? It seems we won't
> >> solve the IRA issues, right?
> > You mean LRA issues here, right?  If you are starting to use
> > no-accumulate-outgoing-args much more often than in the past, I think
> > the problem that LRA forces a frame pointer in that case is much more
> > important now (or has that been fixed in the mean time?).  Vlad?
> >
> >
> I guess it is serious.  So it should fix this for gcc-4.9 in any case. 
> I'd say it need 1-2 week of my work.  Right now I am stiil thinking how
> to better approach to the implementation of it in LRA.

That would be nice.  Given that it is a regression from 4.7 anyway, it can
be fixed in stage3 too, but preferrably sooner than later, so that there is
some time to tune the backend tunables.

	Jakub

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Honnor ix86_accumulate_outgoing_args again
  2013-11-12 16:44                   ` Jakub Jelinek
@ 2013-11-12 19:42                     ` Jan Hubicka
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Hubicka @ 2013-11-12 19:42 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Vladimir Makarov, Jan Hubicka, H.J. Lu, GCC Patches, Uros Bizjak,
	Richard Henderson, Ganesh.Gopalasubramanian

> On Tue, Nov 12, 2013 at 10:39:28AM -0500, Vladimir Makarov wrote:
> > >> Shall we also disable argument accumulation for cores? It seems we won't
> > >> solve the IRA issues, right?
> > > You mean LRA issues here, right?  If you are starting to use
> > > no-accumulate-outgoing-args much more often than in the past, I think
> > > the problem that LRA forces a frame pointer in that case is much more
> > > important now (or has that been fixed in the mean time?).  Vlad?
> > >
> > >
> > I guess it is serious.  So it should fix this for gcc-4.9 in any case. 
> > I'd say it need 1-2 week of my work.  Right now I am stiil thinking how
> > to better approach to the implementation of it in LRA.
> 
> That would be nice.  Given that it is a regression from 4.7 anyway, it can
> be fixed in stage3 too, but preferrably sooner than later, so that there is
> some time to tune the backend tunables.

Sounds good.  I do not think it is critical - we can always just keep argument
accumulation on as we did for 4.8 and probably few earlier releases, but it
would be really nice to fix this the correct way.

Thank you!
Honza
> 
> 	Jakub

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Honnor ix86_accumulate_outgoing_args again
  2013-11-12 13:03               ` Jakub Jelinek
  2013-11-12 14:36                 ` H.J. Lu
  2013-11-12 16:43                 ` Vladimir Makarov
@ 2013-11-13  8:10                 ` Gopalasubramanian, Ganesh
  2 siblings, 0 replies; 21+ messages in thread
From: Gopalasubramanian, Ganesh @ 2013-11-13  8:10 UTC (permalink / raw)
  To: Jakub Jelinek, Jan Hubicka
  Cc: H.J. Lu, Vladimir Makarov, GCC Patches, Uros Bizjak, Richard Henderson

> we are going to have some AMD CPU with AVX2 support soon, the question is
> if it will prefer 256-bit vmovups/vmovupd/vmovdqu or split, but even
> if it will prefer split, the question is if like bdver{1,2,3} it will
> be X86_TUNE_AVX128_OPTIMAL, because if yes, then how 256-bit unaligned
> loads/stores are handled is much less important there.  Ganesh?

256-bit is friendly on bdver4. 
But, 256 bit unaligned stores are micro-coded which we would like to avoid. So we require 128-bit MOVUPS.

-----Original Message-----
From: Jakub Jelinek [mailto:jakub@redhat.com] 
Sent: Tuesday, November 12, 2013 3:57 PM
To: Jan Hubicka
Cc: H.J. Lu; Vladimir Makarov; GCC Patches; Uros Bizjak; Richard Henderson; Gopalasubramanian, Ganesh
Subject: Re: Honnor ix86_accumulate_outgoing_args again

On Tue, Nov 12, 2013 at 11:05:45AM +0100, Jan Hubicka wrote:
> > @@ -16576,7 +16576,7 @@ ix86_avx256_split_vector_move_misalign (rtx 
> > op0, rtx op1)
> > 
> >    if (MEM_P (op1))
> >      {
> > -      if (TARGET_AVX256_SPLIT_UNALIGNED_LOAD)
> > +      if (!TARGET_AVX2 && TARGET_AVX256_SPLIT_UNALIGNED_LOAD)
> >      {
> >        rtx r = gen_reg_rtx (mode);
> >        m = adjust_address (op1, mode, 0); @@ -16596,7 +16596,7 @@ 
> > ix86_avx256_split_vector_move_misalign (rtx op0, rtx op1)
> >      }
> >    else if (MEM_P (op0))
> >      {
> > -      if (TARGET_AVX256_SPLIT_UNALIGNED_STORE)
> > +      if (!TARGET_AVX2 && TARGET_AVX256_SPLIT_UNALIGNED_STORE)
> 
> I would add explanation comment on those two.

Looking at http://gcc.gnu.org/ml/gcc-patches/2013-11/msg01235.html
we are going to have some AMD CPU with AVX2 support soon, the question is
if it will prefer 256-bit vmovups/vmovupd/vmovdqu or split, but even
if it will prefer split, the question is if like bdver{1,2,3} it will
be X86_TUNE_AVX128_OPTIMAL, because if yes, then how 256-bit unaligned
loads/stores are handled is much less important there.  Ganesh?

> Shall we also disable argument accumulation for cores? It seems we won't
> solve the IRA issues, right?

You mean LRA issues here, right?  If you are starting to use
no-accumulate-outgoing-args much more often than in the past, I think
the problem that LRA forces a frame pointer in that case is much more
important now (or has that been fixed in the mean time?).  Vlad?

	Jakub


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-11-13  5:10 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-02 17:32 Honnor ix86_accumulate_outgoing_args again Jan Hubicka
2013-10-02 22:45 ` Jan Hubicka
2013-10-02 22:51   ` H.J. Lu
2013-10-03  9:24     ` Jan Hubicka
2013-10-03  1:11   ` Vladimir Makarov
2013-10-03 13:05     ` Jan Hubicka
2013-10-10 18:48       ` Jan Hubicka
     [not found]         ` <0EFAB2BDD0F67E4FB6CCC8B9F87D7569427DC81A@IRSMSX101.ger.corp.intel.com>
     [not found]           ` <20131014164343.GA17422@kam.mff.cuni.cz>
     [not found]             ` <0EFAB2BDD0F67E4FB6CCC8B9F87D7569427DE1D0@IRSMSX101.ger.corp.intel.com>
     [not found]               ` <0EFAB2BDD0F67E4FB6CCC8B9F87D7569427F84D0@IRSMSX101.ger.corp.intel.com>
2013-10-19 21:28                 ` Jan Hubicka
2013-10-21  8:01                   ` Vladimir Makarov
2013-11-03  9:59                     ` Zamyatin, Igor
2013-10-22  8:36                   ` Zamyatin, Igor
2013-10-31 11:23         ` Florian Weimer
2013-11-12  7:01         ` Jakub Jelinek
2013-11-12  7:09           ` H.J. Lu
2013-11-12 12:45             ` Jan Hubicka
2013-11-12 13:03               ` Jakub Jelinek
2013-11-12 14:36                 ` H.J. Lu
2013-11-12 16:43                 ` Vladimir Makarov
2013-11-12 16:44                   ` Jakub Jelinek
2013-11-12 19:42                     ` Jan Hubicka
2013-11-13  8:10                 ` Gopalasubramanian, Ganesh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).