public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* Shrink wrapping issues
@ 2011-11-05  9:51 Jakub Jelinek
  2011-11-05 10:28 ` Alan Modra
  0 siblings, 1 reply; 3+ messages in thread
From: Jakub Jelinek @ 2011-11-05  9:51 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Richard Henderson, gcc

Hi!

On the following testcase with -m64 -O3 -mavx2 (but it is just an example,
you can replace the loop there with any code that doesn't touch the
stack or frame pointer at all), only f3 is shrink wrapped and in that case
it on the other side doesn't add vzeroupper before leaving the AVX using
code that it IMNSHO should.  But I wonder why we can't shrink-wrap also
the first two testcases (well, in the second testcase it wouldn't be book
shrink-wrapping, but essentially throwing away the prologue/epilogue).

From quick look, f1 isn't shrink-wrapped probably because of the set
of bb's that need prologue/epilogue around it doesn't end in a return,
but in a tail call.  Can't we just add a prologue before the bar call
and throw the epilogue away (normally the epilogue in a function that
ends only in a tail call is just emitted after the barrier and
optimized away I think, we could do the same?).

And f2 is something that IMHO with especially AVX/AVX2 code happens very
often, the prologue is expensive as it realigns the stack.  The reason
for that is that until reload we don't know whether something won't be
spilled on the stack and we need/want 32-byte aligned stack slots
for that spilling.  Isn't the case when none of the bbs actually need
stack/frame pointer just a special case of shrink wrapping?  Can't we
either throw the prologue/epilogue away then and just end the function
in simple_return?  f4 is another test case for the same thing,
this time with no AVX/AVX2 intrinsics, but which the vectorizer
vectorizes using 256-bit vectors.

#include <x86intrin.h>

__m256i a[16], b[16], f;
__m256d g[16], h;
extern void bar (void);
extern void baz (void);

void
f1 (int c)
{
  int i;
  if (c)
    for (i = 0; i < 16; i++)
      a[i] = _mm256_i64gather_epi64 (NULL, b[i], 1);
  else
    {
      bar ();
      baz ();
    }
}

void
f2 (void)
{
  int i;
  for (i = 0; i < 16; i++)
    a[i] = _mm256_i64gather_epi64 (NULL, b[i], 1);
}

int
f3 (int c)
{
  int i;
  if (c)
    for (i = 0; i < 16; i++)
      a[i] = _mm256_i64gather_epi64 (NULL, b[i], 1);
  else
    {
      bar ();
      baz ();
    }
  return c;
}

float x[8], y[8];

void
f4 (void)
{
  int i;
  for (i = 0; i < 8; i++)
    x[i] = y[i] * 2 - x[i];
}


	Jakub

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Shrink wrapping issues
  2011-11-05  9:51 Shrink wrapping issues Jakub Jelinek
@ 2011-11-05 10:28 ` Alan Modra
  2011-11-05 10:50   ` Jakub Jelinek
  0 siblings, 1 reply; 3+ messages in thread
From: Alan Modra @ 2011-11-05 10:28 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Bernd Schmidt, Richard Henderson, gcc

On Sat, Nov 05, 2011 at 10:50:44AM +0100, Jakub Jelinek wrote:
> >From quick look, f1 isn't shrink-wrapped probably because of the set
> of bb's that need prologue/epilogue around it doesn't end in a return,
> but in a tail call.  Can't we just add a prologue before the bar call
> and throw the epilogue away (normally the epilogue in a function that
> ends only in a tail call is just emitted after the barrier and
> optimized away I think, we could do the same?).

http://gcc.gnu.org/ml/gcc-patches/2011-11/msg00046.html ought to cure
this particular problem.  With that patch, similar code on
powerpc-linux does result in shrink wrapping.

> And f2 is something that IMHO with especially AVX/AVX2 code happens very
> often, the prologue is expensive as it realigns the stack.  The reason
> for that is that until reload we don't know whether something won't be
> spilled on the stack and we need/want 32-byte aligned stack slots
> for that spilling.

Huh?  thread_prologue_and_epilogue is after reload.  So your backend
ought to be able to figure out whether an aligned stack is needed.

-- 
Alan Modra
Australia Development Lab, IBM

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Shrink wrapping issues
  2011-11-05 10:28 ` Alan Modra
@ 2011-11-05 10:50   ` Jakub Jelinek
  0 siblings, 0 replies; 3+ messages in thread
From: Jakub Jelinek @ 2011-11-05 10:50 UTC (permalink / raw)
  To: Bernd Schmidt, Richard Henderson, gcc

On Sat, Nov 05, 2011 at 08:58:18PM +1030, Alan Modra wrote:
> On Sat, Nov 05, 2011 at 10:50:44AM +0100, Jakub Jelinek wrote:
> > >From quick look, f1 isn't shrink-wrapped probably because of the set
> > of bb's that need prologue/epilogue around it doesn't end in a return,
> > but in a tail call.  Can't we just add a prologue before the bar call
> > and throw the epilogue away (normally the epilogue in a function that
> > ends only in a tail call is just emitted after the barrier and
> > optimized away I think, we could do the same?).
> 
> http://gcc.gnu.org/ml/gcc-patches/2011-11/msg00046.html ought to cure
> this particular problem.  With that patch, similar code on
> powerpc-linux does result in shrink wrapping.

I'll try that.

> > And f2 is something that IMHO with especially AVX/AVX2 code happens very
> > often, the prologue is expensive as it realigns the stack.  The reason
> > for that is that until reload we don't know whether something won't be
> > spilled on the stack and we need/want 32-byte aligned stack slots
> > for that spilling.
> 
> Huh?  thread_prologue_and_epilogue is after reload.  So your backend
> ought to be able to figure out whether an aligned stack is needed.

Sure, the backend could figure out.  But it would need to duplicate
parts of what function.c does.  In particular, roughly equivalent of
computing prologue_used and set_up_by_prologue hard regsets (sure, it
doesn't have to necessarily generate the prologue for that), then
the
  FOR_EACH_BB (bb)
    FOR_BB_INSNS (bb, insn)
      if (requires_stack_frame_p (insn, prologue_used,
				  set_up_by_prologue))
	goto needs_prologue;
(and perhaps only do this if -fno-omit-frame-pointer wasn't used, i.e. if
the prologue frame pointer setup was not requested by the user explicitly).
If it is just i?86/x86_64, then perhaps we just should export
requires_stack_frame_p and do it in the backend.  But if it is for more
targets, can't function.c just handle it in the generic code when it
already computes everything needed?

	Jakub

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-11-05 10:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-05  9:51 Shrink wrapping issues Jakub Jelinek
2011-11-05 10:28 ` Alan Modra
2011-11-05 10:50   ` Jakub Jelinek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).