RFC: Extend x86-64 psABI for 256bit AVX register

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* RFC: Extend x86-64 psABI for 256bit AVX register
@ 2008-06-05 14:31 H.J. Lu
  2008-06-05 14:49 ` Richard Guenther
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: H.J. Lu @ 2008-06-05 14:31 UTC (permalink / raw)
  To: discuss, GCC, Girkar, Milind, Dmitriev, Serguei N

Hi,

x86-64 psABI defines

typedef struct
{
  unsigned int gp_offset;
  unsigned int fp_offset;
  void *overflow_arg_area;
  void *reg_save_area;
} va_list[1];

for variable argument list. "va_list" is used to access variable argument
list:

void
bar (const char *format, va_list ap)
{
  if (va_arg (ap, int) != 0)
    abort ();
}

void
foo(char *fmt, ...)
{
  va_list ap;
  va_start (fmt, ap);
  bar (fmt, ap);
  va_end (ap);
}

foo and bar may be compiled with different compilers. We have to keep
the current layout for va_list so that we can mix va_list codes compiled
with AVX and non-AVX compilers. We need to extend the variable argument
handling in the x86-64 psABI to support passing __m256/__m256d/__m256i
on the variable argument list. We propose 2 ways to extend the register
save area to add 256bit AVX registers support:

1. Extend the register save area to put upper 128bit at the end.
  Pros:
    Aligned access.
    Save stack space if 256bit registers are used.
  Cons
    Split access. Require more split access beyond 256bit.

2. Extend the register save area to put full 265bit YMMs at the end.
The first DWORD after the register save area has the offset of
the extended array for YMM registers. The next DWORD has the
element size of the extended array. Unaligned access will be used.
  Pros:
    No split access.
    Easily extendable beyond 256bit.
    Limited unaligned access penalty if stack is aligned at 32byte.
  Cons:
    May require store both the lower 128bit and full 256bit register
    content. We may avoid saving the lower 128bit if correct type
    is required when accessing variable argument list, similar to int
    vs. double.
    Waste 272 byte on stack when 256bit registers are used.
    Unaligned load and store.

We should agree on one approach to ensure compatibility between
different compilers.

Personally, I prefer #2 for its simplicity. Does anyone else have a
preference?

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-05 14:31 RFC: Extend x86-64 psABI for 256bit AVX register H.J. Lu
@ 2008-06-05 14:49 ` Richard Guenther
  2008-06-05 15:52   ` H.J. Lu
  2008-06-05 15:15 ` Jan Hubicka
  2008-06-06 15:01 ` Jakub Jelinek
  2 siblings, 1 reply; 25+ messages in thread
From: Richard Guenther @ 2008-06-05 14:49 UTC (permalink / raw)
  To: H.J. Lu; +Cc: discuss, GCC, Girkar, Milind, Dmitriev, Serguei N

On Thu, Jun 5, 2008 at 4:31 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> Hi,
>
> x86-64 psABI defines
>
> typedef struct
> {
>  unsigned int gp_offset;
>  unsigned int fp_offset;
>  void *overflow_arg_area;
>  void *reg_save_area;
> } va_list[1];
>
> for variable argument list. "va_list" is used to access variable argument
> list:
>
> void
> bar (const char *format, va_list ap)
> {
>  if (va_arg (ap, int) != 0)
>    abort ();
> }
>
> void
> foo(char *fmt, ...)
> {
>  va_list ap;
>  va_start (fmt, ap);
>  bar (fmt, ap);
>  va_end (ap);
> }
>
> foo and bar may be compiled with different compilers. We have to keep
> the current layout for va_list so that we can mix va_list codes compiled
> with AVX and non-AVX compilers. We need to extend the variable argument
> handling in the x86-64 psABI to support passing __m256/__m256d/__m256i
> on the variable argument list. We propose 2 ways to extend the register
> save area to add 256bit AVX registers support:
>
> 1. Extend the register save area to put upper 128bit at the end.
>  Pros:
>    Aligned access.
>    Save stack space if 256bit registers are used.
>  Cons
>    Split access. Require more split access beyond 256bit.
>
> 2. Extend the register save area to put full 265bit YMMs at the end.
> The first DWORD after the register save area has the offset of
> the extended array for YMM registers. The next DWORD has the
> element size of the extended array. Unaligned access will be used.
>  Pros:
>    No split access.
>    Easily extendable beyond 256bit.
>    Limited unaligned access penalty if stack is aligned at 32byte.
>  Cons:
>    May require store both the lower 128bit and full 256bit register
>    content. We may avoid saving the lower 128bit if correct type
>    is required when accessing variable argument list, similar to int
>    vs. double.
>    Waste 272 byte on stack when 256bit registers are used.
>    Unaligned load and store.
>
> We should agree on one approach to ensure compatibility between
> different compilers.
>
> Personally, I prefer #2 for its simplicity. Does anyone else have a
> preference?

If you want to mix AVX and non-AVX code then you need a way to
detect if AVX information was saved at runtime.  What is it in those
both cases?

If you don't want to mix AVX and non-AVX code then basically you
can declare the ABIs incompatible anyway?

There is also a third option of passing AVX values by reference.

For simplicity I would also prefer 2) - after all we don't need to fill
in the XMM area / the AVX area if the value is unused.

Richard.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-05 14:31 RFC: Extend x86-64 psABI for 256bit AVX register H.J. Lu
  2008-06-05 14:49 ` Richard Guenther
@ 2008-06-05 15:15 ` Jan Hubicka
  2008-06-05 16:14   ` H.J. Lu
  2008-06-06 15:01 ` Jakub Jelinek
  2 siblings, 1 reply; 25+ messages in thread
From: Jan Hubicka @ 2008-06-05 15:15 UTC (permalink / raw)
  To: H.J. Lu; +Cc: discuss, GCC, Girkar, Milind, Dmitriev, Serguei N

> 
> 1. Extend the register save area to put upper 128bit at the end.
>   Pros:
>     Aligned access.
>     Save stack space if 256bit registers are used.
>   Cons
>     Split access. Require more split access beyond 256bit.
> 
> 2. Extend the register save area to put full 265bit YMMs at the end.
> The first DWORD after the register save area has the offset of
> the extended array for YMM registers. The next DWORD has the
> element size of the extended array. Unaligned access will be used.
>   Pros:
>     No split access.
>     Easily extendable beyond 256bit.
>     Limited unaligned access penalty if stack is aligned at 32byte.
>   Cons:
>     May require store both the lower 128bit and full 256bit register
>     content. We may avoid saving the lower 128bit if correct type
>     is required when accessing variable argument list, similar to int
>     vs. double.
>     Waste 272 byte on stack when 256bit registers are used.
>     Unaligned load and store.
> 
> We should agree on one approach to ensure compatibility between
> different compilers.

This is something that definitly should be hanlded by ABI update.

We probably need to also somehow update the way to specify what to save
to varargs prologue.  Otherwise if you would have YMM aware printf
running on non-AVX hardware, we would end up with invalid instructions.

At the moment, eax is required to specify number of XMM registers, we
probably can extend it to have number of XMM registers in AL and YMM in
AH.

I personally don't have much preferences over 1. or 2.. 1. seems
relatively easy to implement too, or is packaging two 128bit values to
single 256bit difficult in va_arg expansion?

Honza
> 
> Personally, I prefer #2 for its simplicity. Does anyone else have a
> preference?
> 
> Thanks.
> 
> -- 
> H.J.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-05 14:49 ` Richard Guenther
@ 2008-06-05 15:52   ` H.J. Lu
  0 siblings, 0 replies; 25+ messages in thread
From: H.J. Lu @ 2008-06-05 15:52 UTC (permalink / raw)
  To: Richard Guenther; +Cc: discuss, GCC, Girkar, Milind, Dmitriev, Serguei N

On Thu, Jun 5, 2008 at 7:49 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Thu, Jun 5, 2008 at 4:31 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> Hi,
>>
>> x86-64 psABI defines
>>
>> typedef struct
>> {
>>  unsigned int gp_offset;
>>  unsigned int fp_offset;
>>  void *overflow_arg_area;
>>  void *reg_save_area;
>> } va_list[1];
>>
>> for variable argument list. "va_list" is used to access variable argument
>> list:
>>
>> void
>> bar (const char *format, va_list ap)
>> {
>>  if (va_arg (ap, int) != 0)
>>    abort ();
>> }
>>
>> void
>> foo(char *fmt, ...)
>> {
>>  va_list ap;
>>  va_start (fmt, ap);
>>  bar (fmt, ap);
>>  va_end (ap);
>> }
>>
>> foo and bar may be compiled with different compilers. We have to keep
>> the current layout for va_list so that we can mix va_list codes compiled
>> with AVX and non-AVX compilers. We need to extend the variable argument
>> handling in the x86-64 psABI to support passing __m256/__m256d/__m256i
>> on the variable argument list. We propose 2 ways to extend the register
>> save area to add 256bit AVX registers support:
>>
>> 1. Extend the register save area to put upper 128bit at the end.
>>  Pros:
>>    Aligned access.
>>    Save stack space if 256bit registers are used.
>>  Cons
>>    Split access. Require more split access beyond 256bit.
>>
>> 2. Extend the register save area to put full 265bit YMMs at the end.
>> The first DWORD after the register save area has the offset of
>> the extended array for YMM registers. The next DWORD has the
>> element size of the extended array. Unaligned access will be used.
>>  Pros:
>>    No split access.
>>    Easily extendable beyond 256bit.
>>    Limited unaligned access penalty if stack is aligned at 32byte.
>>  Cons:
>>    May require store both the lower 128bit and full 256bit register
>>    content. We may avoid saving the lower 128bit if correct type
>>    is required when accessing variable argument list, similar to int
>>    vs. double.
>>    Waste 272 byte on stack when 256bit registers are used.
>>    Unaligned load and store.
>>
>> We should agree on one approach to ensure compatibility between
>> different compilers.
>>
>> Personally, I prefer #2 for its simplicity. Does anyone else have a
>> preference?
>
> If you want to mix AVX and non-AVX code then you need a way to
> detect if AVX information was saved at runtime.  What is it in those
> both cases?
>
> If you don't want to mix AVX and non-AVX code then basically you
> can declare the ABIs incompatible anyway?

We want to extend the psABI in such a way that we can link
AVX enabled code to call vfprintf in glibc which is compiled
with the older compiler and doesn't use YMM registers.
That is if bar, in the example above, doesn't use YMM
registers, it can be compiled by any compilers. bar doesn't
need to know if YMM  registers are used in caller at all.
All necessary information for YMM registers are specified
in the psABI. If  a compiler doesn't use YMM registers,
it  doesn't have to do anything.

>
> There is also a third option of passing AVX values by reference.
>
> For simplicity I would also prefer 2) - after all we don't need to fill
> in the XMM area / the AVX area if the value is unused.
>

That is what I believe.

Thanks.


-- 
H.J.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-05 15:15 ` Jan Hubicka
@ 2008-06-05 16:14   ` H.J. Lu
  2008-06-06  8:29     ` Jan Hubicka
  0 siblings, 1 reply; 25+ messages in thread
From: H.J. Lu @ 2008-06-05 16:14 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: discuss, GCC, Girkar, Milind, Dmitriev, Serguei N

On Thu, Jun 5, 2008 at 8:15 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> 1. Extend the register save area to put upper 128bit at the end.
>>   Pros:
>>     Aligned access.
>>     Save stack space if 256bit registers are used.
>>   Cons
>>     Split access. Require more split access beyond 256bit.
>>
>> 2. Extend the register save area to put full 265bit YMMs at the end.
>> The first DWORD after the register save area has the offset of
>> the extended array for YMM registers. The next DWORD has the
>> element size of the extended array. Unaligned access will be used.
>>   Pros:
>>     No split access.
>>     Easily extendable beyond 256bit.
>>     Limited unaligned access penalty if stack is aligned at 32byte.
>>   Cons:
>>     May require store both the lower 128bit and full 256bit register
>>     content. We may avoid saving the lower 128bit if correct type
>>     is required when accessing variable argument list, similar to int
>>     vs. double.
>>     Waste 272 byte on stack when 256bit registers are used.
>>     Unaligned load and store.
>>
>> We should agree on one approach to ensure compatibility between
>> different compilers.
>
> This is something that definitly should be hanlded by ABI update.
>
> We probably need to also somehow update the way to specify what to save
> to varargs prologue.  Otherwise if you would have YMM aware printf

Yes, but I believe that is compiler specific. Different compilers may
have different approaches for varargs prologue, as long as they follow
the psABI.

> running on non-AVX hardware, we would end up with invalid instructions.

That is nothing new. The same applies to SSE on ia32. Basically, you
shouldn't call YMM aware printf on non-AVX hardware.  You can have
/lib64/avx/libc.so.6 if necessary.

>
> At the moment, eax is required to specify number of XMM registers, we
> probably can extend it to have number of XMM registers in AL and YMM in
> AH.

ymm0 and xmm0 are the same register. xmm0 is the lower 128bit
of xmm0. I am not sure if we need separate XMM registers from
YMM registers.

>
> I personally don't have much preferences over 1. or 2.. 1. seems
> relatively easy to implement too, or is packaging two 128bit values to
> single 256bit difficult in va_arg expansion?
>

Access to 256bit register as lower and upper 128bits needs 2
instructions. For store

vmovaps   %xmm7, -143(%rax)
vextractf128 $1, %ymm7, -15(%rax)

For load

vmovaps  -143(%rax),%xmm7
vinsert128 $1, -15(%rax),%ymm7,%ymm7

If we go beyond 256bit, we need more instructions to access
the full register. For 512bit, it will be split into lower 128bit,
middle 128bit and upper 256bit. 1024bit will have 4 parts.

For #2, only one instruction will be needed for 256bit and
beyond.

Thanks.


-- 
H.J.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-05 16:14   ` H.J. Lu
@ 2008-06-06  8:29     ` Jan Hubicka
  2008-06-06 13:50       ` H.J. Lu
  0 siblings, 1 reply; 25+ messages in thread
From: Jan Hubicka @ 2008-06-06  8:29 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Jan Hubicka, discuss, GCC, Girkar, Milind, Dmitriev, Serguei N

> 
> ymm0 and xmm0 are the same register. xmm0 is the lower 128bit
> of xmm0. I am not sure if we need separate XMM registers from
> YMM registers.


Yes, I know that xmm0 is lower part of ymm0.  I still think we ought to
be able to support varargs that do save ymm0 registers only when ymm
values are passed same way as we touch SSE only when SSE values are
passed via EAX hint.
This way we will be able to support e.g. printf that has YMM printing %
construct but don't need YMM enabled hardware when those are not used.

This is why I think extending EAX to contain information about amount of
XMM values to save and in addition YMM values to save is sane.  Then old
non-YMM aware varargs prologues will crash when YMM values are passed,
but all other combinations will work.
> 
> >
> > I personally don't have much preferences over 1. or 2.. 1. seems
> > relatively easy to implement too, or is packaging two 128bit values to
> > single 256bit difficult in va_arg expansion?
> >
> 
> Access to 256bit register as lower and upper 128bits needs 2
> instructions. For store
> 
> vmovaps   %xmm7, -143(%rax)
> vextractf128 $1, %ymm7, -15(%rax)
> 
> For load
> 
> vmovaps  -143(%rax),%xmm7
> vinsert128 $1, -15(%rax),%ymm7,%ymm7
> 
> If we go beyond 256bit, we need more instructions to access
> the full register. For 512bit, it will be split into lower 128bit,
> middle 128bit and upper 256bit. 1024bit will have 4 parts.
> 
> For #2, only one instruction will be needed for 256bit and
> beyond.

Yes, but we will still save half of stack space.  Well, I don't have
much preferences here.  If it seems saner to simply save whole thing
saving lower part twice, I am fine with that.

Honza
> 
> Thanks.
> 
> 
> -- 
> H.J.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-06  8:29     ` Jan Hubicka
@ 2008-06-06 13:50       ` H.J. Lu
  2008-06-06 14:28         ` H.J. Lu
  0 siblings, 1 reply; 25+ messages in thread
From: H.J. Lu @ 2008-06-06 13:50 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Jan Hubicka, discuss, GCC, Girkar, Milind, Dmitriev, Serguei N,
	Kreitzer, David L

On Fri, Jun 06, 2008 at 10:28:34AM +0200, Jan Hubicka wrote:
> > 
> > ymm0 and xmm0 are the same register. xmm0 is the lower 128bit
> > of xmm0. I am not sure if we need separate XMM registers from
> > YMM registers.
> 
> 
> Yes, I know that xmm0 is lower part of ymm0.  I still think we ought to
> be able to support varargs that do save ymm0 registers only when ymm
> values are passed same way as we touch SSE only when SSE values are
> passed via EAX hint.

Which register do you propose for hint? The current psABI uses RAX
for XMM registers. We can't change it to AL and AH for YMM without
breaking backward compatibility.

> This way we will be able to support e.g. printf that has YMM printing %
> construct but don't need YMM enabled hardware when those are not used.
> 
> This is why I think extending EAX to contain information about amount of
> XMM values to save and in addition YMM values to save is sane.  Then old
> non-YMM aware varargs prologues will crash when YMM values are passed,
> but all other combinations will work.

I don't think it is necessary since -mavx will enable AVX code
generation for all SSE codes. Unless the function only uses integer,
it will crash on non-YMM aware hardware.  That is if there is one
SSE register is used, which is hinted in RAX, varargs prologue will
use AVX instructions to save it. We don't need another hint for AVX
instructions.

> > 
> > >
> > > I personally don't have much preferences over 1. or 2.. 1. seems
> > > relatively easy to implement too, or is packaging two 128bit values to
> > > single 256bit difficult in va_arg expansion?
> > >
> > 
> > Access to 256bit register as lower and upper 128bits needs 2
> > instructions. For store
> > 
> > vmovaps   %xmm7, -143(%rax)
> > vextractf128 $1, %ymm7, -15(%rax)
> > 
> > For load
> > 
> > vmovaps  -143(%rax),%xmm7
> > vinsert128 $1, -15(%rax),%ymm7,%ymm7
> > 
> > If we go beyond 256bit, we need more instructions to access
> > the full register. For 512bit, it will be split into lower 128bit,
> > middle 128bit and upper 256bit. 1024bit will have 4 parts.
> > 
> > For #2, only one instruction will be needed for 256bit and
> > beyond.
> 
> Yes, but we will still save half of stack space.  Well, I don't have
> much preferences here.  If it seems saner to simply save whole thing
> saving lower part twice, I am fine with that.

I was told that it wasn't very easy to get decent performance with
split access. I extended my proposal to include a 16bit bitmask to
indicate which YMM regisetrs should be saved. If the bit is 0,
we should only save the the lower 128bit in the original register
save area. Otherwise, we should only save the same whole YMM register.

H.J.
----
x86-64 psABI defines

typedef struct
{
  unsigned int gp_offset;
  unsigned int fp_offset;
  void *overflow_arg_area;
  void *reg_save_area;
} va_list[1];

for variable argument list. "va_list" is used to access variable argument
list:

void
bar (const char *format, va_list ap)
{
  if (va_arg (ap, int) != 0)
    abort ();
}

void
foo(char *fmt, ...)
{
  va_list ap;
  va_start (fmt, ap); 
  bar (fmt, ap);
  va_end (ap);
}

foo and bar may be compiled with different compilers. We have to keep
the current layout for va_list so that we can mix va_list codes compiled
with AVX and non-AVX compilers. We need to extend the variable argument
handling in the x86-64 psABI to support passing __m256/__m256d/__m256i
on the variable argument list. We propose 2 ways to extend the register
save area to add 256bit AVX registers support:

1. Extend the register save area to put upper 128bit at the end.
  Pros: 
    Aligned access.
    Save stack space if 256bit registers are used. 
  Cons 
    Split access. Require more split access beyond 256bit.

2. Extend the register save area to put full 265bit YMMs at the end.
The first DWORD after the register save area has the offset of the
extended array for YMM registers from the start of the register save
area. The next DWORD has the element size of the extended array.  The
next WORD encodes which YMM registers should be saved.  Unaligned access
will be used.

The		Offset 	Register 
original	0	%rdi
register	8	%rsi
save 		16	%rdx
area 		24	%rcx
		32	%r8
		40	%r9
		48	%xmm0
		64	%xmm1
		...
		288	%xmm15
Hints		304	320	offset from offset 0.
		308	32	size of element
		312	bitmask	for used YMM registers
		314	Unused
Extended	320	%ymm0
array for	352	%ymm1
YMM		...
registers	800	%ymm15

  Pros: 
    No split access.
    Easily extendable beyond 256bit.
    Limited unaligned access penalty if stack is aligned at 32byte.
  Cons:
    May require store both the lower 128bit and full 256bit register
    content. We may avoid saving the lower 128bit if correct type
    is required when accessing variable argument list, similar to int
    vs. double.
    Waste 272 byte on stack when 256bit registers are used.
    Unaligned load and store.

We should agree on one approach to ensure compatibility between
different compilers.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-06 13:50       ` H.J. Lu
@ 2008-06-06 14:28         ` H.J. Lu
  2008-06-06 14:31           ` Richard Guenther
  2008-06-09 14:41           ` Jan Hubicka
  0 siblings, 2 replies; 25+ messages in thread
From: H.J. Lu @ 2008-06-06 14:28 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Jan Hubicka, discuss, GCC, Girkar, Milind, Dmitriev, Serguei N,
	Kreitzer, David L

On Fri, Jun 06, 2008 at 06:50:26AM -0700, H.J. Lu wrote:
> On Fri, Jun 06, 2008 at 10:28:34AM +0200, Jan Hubicka wrote:
> > > 
> > > ymm0 and xmm0 are the same register. xmm0 is the lower 128bit
> > > of xmm0. I am not sure if we need separate XMM registers from
> > > YMM registers.
> > 
> > 
> > Yes, I know that xmm0 is lower part of ymm0.  I still think we ought to
> > be able to support varargs that do save ymm0 registers only when ymm
> > values are passed same way as we touch SSE only when SSE values are
> > passed via EAX hint.
> 
> Which register do you propose for hint? The current psABI uses RAX
> for XMM registers. We can't change it to AL and AH for YMM without
> breaking backward compatibility.
> 
> > This way we will be able to support e.g. printf that has YMM printing %
> > construct but don't need YMM enabled hardware when those are not used.
> > 
> > This is why I think extending EAX to contain information about amount of
> > XMM values to save and in addition YMM values to save is sane.  Then old
> > non-YMM aware varargs prologues will crash when YMM values are passed,
> > but all other combinations will work.
> 
> I don't think it is necessary since -mavx will enable AVX code
> generation for all SSE codes. Unless the function only uses integer,
> it will crash on non-YMM aware hardware.  That is if there is one
> SSE register is used, which is hinted in RAX, varargs prologue will
> use AVX instructions to save it. We don't need another hint for AVX
> instructions.
> 
> > > 
> > > >
> > > > I personally don't have much preferences over 1. or 2.. 1. seems
> > > > relatively easy to implement too, or is packaging two 128bit values to
> > > > single 256bit difficult in va_arg expansion?
> > > >
> > > 
> > > Access to 256bit register as lower and upper 128bits needs 2
> > > instructions. For store
> > > 
> > > vmovaps   %xmm7, -143(%rax)
> > > vextractf128 $1, %ymm7, -15(%rax)
> > > 
> > > For load
> > > 
> > > vmovaps  -143(%rax),%xmm7
> > > vinsert128 $1, -15(%rax),%ymm7,%ymm7
> > > 
> > > If we go beyond 256bit, we need more instructions to access
> > > the full register. For 512bit, it will be split into lower 128bit,
> > > middle 128bit and upper 256bit. 1024bit will have 4 parts.
> > > 
> > > For #2, only one instruction will be needed for 256bit and
> > > beyond.
> > 
> > Yes, but we will still save half of stack space.  Well, I don't have
> > much preferences here.  If it seems saner to simply save whole thing
> > saving lower part twice, I am fine with that.
> 
> I was told that it wasn't very easy to get decent performance with
> split access. I extended my proposal to include a 16bit bitmask to
> indicate which YMM regisetrs should be saved. If the bit is 0,
> we should only save the the lower 128bit in the original register
> save area. Otherwise, we should only save the same whole YMM register.
> 

My second thought. How useful is such a bitmask? Do we really
need it? Is that accepetable to save the lower 128bit twice?

Thanks.


H.J.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-06 14:28         ` H.J. Lu
@ 2008-06-06 14:31           ` Richard Guenther
  2008-06-06 14:41             ` H.J. Lu
  2008-06-09 14:41           ` Jan Hubicka
  1 sibling, 1 reply; 25+ messages in thread
From: Richard Guenther @ 2008-06-06 14:31 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jan Hubicka, Jan Hubicka, discuss, GCC, Girkar, Milind, Dmitriev,
	Serguei N, Kreitzer, David L

On Fri, Jun 6, 2008 at 4:28 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Fri, Jun 06, 2008 at 06:50:26AM -0700, H.J. Lu wrote:
>> On Fri, Jun 06, 2008 at 10:28:34AM +0200, Jan Hubicka wrote:
>> > >
>> > > ymm0 and xmm0 are the same register. xmm0 is the lower 128bit
>> > > of xmm0. I am not sure if we need separate XMM registers from
>> > > YMM registers.
>> >
>> >
>> > Yes, I know that xmm0 is lower part of ymm0.  I still think we ought to
>> > be able to support varargs that do save ymm0 registers only when ymm
>> > values are passed same way as we touch SSE only when SSE values are
>> > passed via EAX hint.
>>
>> Which register do you propose for hint? The current psABI uses RAX
>> for XMM registers. We can't change it to AL and AH for YMM without
>> breaking backward compatibility.
>>
>> > This way we will be able to support e.g. printf that has YMM printing %
>> > construct but don't need YMM enabled hardware when those are not used.
>> >
>> > This is why I think extending EAX to contain information about amount of
>> > XMM values to save and in addition YMM values to save is sane.  Then old
>> > non-YMM aware varargs prologues will crash when YMM values are passed,
>> > but all other combinations will work.
>>
>> I don't think it is necessary since -mavx will enable AVX code
>> generation for all SSE codes. Unless the function only uses integer,
>> it will crash on non-YMM aware hardware.  That is if there is one
>> SSE register is used, which is hinted in RAX, varargs prologue will
>> use AVX instructions to save it. We don't need another hint for AVX
>> instructions.
>>
>> > >
>> > > >
>> > > > I personally don't have much preferences over 1. or 2.. 1. seems
>> > > > relatively easy to implement too, or is packaging two 128bit values to
>> > > > single 256bit difficult in va_arg expansion?
>> > > >
>> > >
>> > > Access to 256bit register as lower and upper 128bits needs 2
>> > > instructions. For store
>> > >
>> > > vmovaps   %xmm7, -143(%rax)
>> > > vextractf128 $1, %ymm7, -15(%rax)
>> > >
>> > > For load
>> > >
>> > > vmovaps  -143(%rax),%xmm7
>> > > vinsert128 $1, -15(%rax),%ymm7,%ymm7
>> > >
>> > > If we go beyond 256bit, we need more instructions to access
>> > > the full register. For 512bit, it will be split into lower 128bit,
>> > > middle 128bit and upper 256bit. 1024bit will have 4 parts.
>> > >
>> > > For #2, only one instruction will be needed for 256bit and
>> > > beyond.
>> >
>> > Yes, but we will still save half of stack space.  Well, I don't have
>> > much preferences here.  If it seems saner to simply save whole thing
>> > saving lower part twice, I am fine with that.
>>
>> I was told that it wasn't very easy to get decent performance with
>> split access. I extended my proposal to include a 16bit bitmask to
>> indicate which YMM regisetrs should be saved. If the bit is 0,
>> we should only save the the lower 128bit in the original register
>> save area. Otherwise, we should only save the same whole YMM register.
>>
>
> My second thought. How useful is such a bitmask? Do we really
> need it? Is that accepetable to save the lower 128bit twice?

Why do we need to save the lower 128bit at all if a ymm reg is passed?
Can't we assume "type-correctness"?

Richard.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-06 14:31           ` Richard Guenther
@ 2008-06-06 14:41             ` H.J. Lu
  2008-06-06 14:44               ` Richard Guenther
  0 siblings, 1 reply; 25+ messages in thread
From: H.J. Lu @ 2008-06-06 14:41 UTC (permalink / raw)
  To: Richard Guenther
  Cc: Jan Hubicka, Jan Hubicka, discuss, GCC, Girkar, Milind, Dmitriev,
	Serguei N, Kreitzer, David L

On Fri, Jun 6, 2008 at 7:31 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Fri, Jun 6, 2008 at 4:28 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Fri, Jun 06, 2008 at 06:50:26AM -0700, H.J. Lu wrote:
>>> On Fri, Jun 06, 2008 at 10:28:34AM +0200, Jan Hubicka wrote:
>>> > >
>>> > > ymm0 and xmm0 are the same register. xmm0 is the lower 128bit
>>> > > of xmm0. I am not sure if we need separate XMM registers from
>>> > > YMM registers.
>>> >
>>> >
>>> > Yes, I know that xmm0 is lower part of ymm0.  I still think we ought to
>>> > be able to support varargs that do save ymm0 registers only when ymm
>>> > values are passed same way as we touch SSE only when SSE values are
>>> > passed via EAX hint.
>>>
>>> Which register do you propose for hint? The current psABI uses RAX
>>> for XMM registers. We can't change it to AL and AH for YMM without
>>> breaking backward compatibility.
>>>
>>> > This way we will be able to support e.g. printf that has YMM printing %
>>> > construct but don't need YMM enabled hardware when those are not used.
>>> >
>>> > This is why I think extending EAX to contain information about amount of
>>> > XMM values to save and in addition YMM values to save is sane.  Then old
>>> > non-YMM aware varargs prologues will crash when YMM values are passed,
>>> > but all other combinations will work.
>>>
>>> I don't think it is necessary since -mavx will enable AVX code
>>> generation for all SSE codes. Unless the function only uses integer,
>>> it will crash on non-YMM aware hardware.  That is if there is one
>>> SSE register is used, which is hinted in RAX, varargs prologue will
>>> use AVX instructions to save it. We don't need another hint for AVX
>>> instructions.
>>>
>>> > >
>>> > > >
>>> > > > I personally don't have much preferences over 1. or 2.. 1. seems
>>> > > > relatively easy to implement too, or is packaging two 128bit values to
>>> > > > single 256bit difficult in va_arg expansion?
>>> > > >
>>> > >
>>> > > Access to 256bit register as lower and upper 128bits needs 2
>>> > > instructions. For store
>>> > >
>>> > > vmovaps   %xmm7, -143(%rax)
>>> > > vextractf128 $1, %ymm7, -15(%rax)
>>> > >
>>> > > For load
>>> > >
>>> > > vmovaps  -143(%rax),%xmm7
>>> > > vinsert128 $1, -15(%rax),%ymm7,%ymm7
>>> > >
>>> > > If we go beyond 256bit, we need more instructions to access
>>> > > the full register. For 512bit, it will be split into lower 128bit,
>>> > > middle 128bit and upper 256bit. 1024bit will have 4 parts.
>>> > >
>>> > > For #2, only one instruction will be needed for 256bit and
>>> > > beyond.
>>> >
>>> > Yes, but we will still save half of stack space.  Well, I don't have
>>> > much preferences here.  If it seems saner to simply save whole thing
>>> > saving lower part twice, I am fine with that.
>>>
>>> I was told that it wasn't very easy to get decent performance with
>>> split access. I extended my proposal to include a 16bit bitmask to
>>> indicate which YMM regisetrs should be saved. If the bit is 0,
>>> we should only save the the lower 128bit in the original register
>>> save area. Otherwise, we should only save the same whole YMM register.
>>>
>>
>> My second thought. How useful is such a bitmask? Do we really
>> need it? Is that accepetable to save the lower 128bit twice?
>
> Why do we need to save the lower 128bit at all if a ymm reg is passed?
> Can't we assume "type-correctness"?

Say a double is passed in YMM0/XMM0, we should save it in XMM0 area.
Do we also need to save the whole 256bit YMM0? If we save both XMM0 and
YMM0, we are free to use any location to load the saved register content.
Either one will be correct.


-- 
H.J.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-06 14:41             ` H.J. Lu
@ 2008-06-06 14:44               ` Richard Guenther
  0 siblings, 0 replies; 25+ messages in thread
From: Richard Guenther @ 2008-06-06 14:44 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jan Hubicka, Jan Hubicka, discuss, GCC, Girkar, Milind, Dmitriev,
	Serguei N, Kreitzer, David L

On Fri, Jun 6, 2008 at 4:40 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Fri, Jun 6, 2008 at 7:31 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> On Fri, Jun 6, 2008 at 4:28 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> On Fri, Jun 06, 2008 at 06:50:26AM -0700, H.J. Lu wrote:
>>>> On Fri, Jun 06, 2008 at 10:28:34AM +0200, Jan Hubicka wrote:
>>>> > >
>>>> > > ymm0 and xmm0 are the same register. xmm0 is the lower 128bit
>>>> > > of xmm0. I am not sure if we need separate XMM registers from
>>>> > > YMM registers.
>>>> >
>>>> >
>>>> > Yes, I know that xmm0 is lower part of ymm0.  I still think we ought to
>>>> > be able to support varargs that do save ymm0 registers only when ymm
>>>> > values are passed same way as we touch SSE only when SSE values are
>>>> > passed via EAX hint.
>>>>
>>>> Which register do you propose for hint? The current psABI uses RAX
>>>> for XMM registers. We can't change it to AL and AH for YMM without
>>>> breaking backward compatibility.
>>>>
>>>> > This way we will be able to support e.g. printf that has YMM printing %
>>>> > construct but don't need YMM enabled hardware when those are not used.
>>>> >
>>>> > This is why I think extending EAX to contain information about amount of
>>>> > XMM values to save and in addition YMM values to save is sane.  Then old
>>>> > non-YMM aware varargs prologues will crash when YMM values are passed,
>>>> > but all other combinations will work.
>>>>
>>>> I don't think it is necessary since -mavx will enable AVX code
>>>> generation for all SSE codes. Unless the function only uses integer,
>>>> it will crash on non-YMM aware hardware.  That is if there is one
>>>> SSE register is used, which is hinted in RAX, varargs prologue will
>>>> use AVX instructions to save it. We don't need another hint for AVX
>>>> instructions.
>>>>
>>>> > >
>>>> > > >
>>>> > > > I personally don't have much preferences over 1. or 2.. 1. seems
>>>> > > > relatively easy to implement too, or is packaging two 128bit values to
>>>> > > > single 256bit difficult in va_arg expansion?
>>>> > > >
>>>> > >
>>>> > > Access to 256bit register as lower and upper 128bits needs 2
>>>> > > instructions. For store
>>>> > >
>>>> > > vmovaps   %xmm7, -143(%rax)
>>>> > > vextractf128 $1, %ymm7, -15(%rax)
>>>> > >
>>>> > > For load
>>>> > >
>>>> > > vmovaps  -143(%rax),%xmm7
>>>> > > vinsert128 $1, -15(%rax),%ymm7,%ymm7
>>>> > >
>>>> > > If we go beyond 256bit, we need more instructions to access
>>>> > > the full register. For 512bit, it will be split into lower 128bit,
>>>> > > middle 128bit and upper 256bit. 1024bit will have 4 parts.
>>>> > >
>>>> > > For #2, only one instruction will be needed for 256bit and
>>>> > > beyond.
>>>> >
>>>> > Yes, but we will still save half of stack space.  Well, I don't have
>>>> > much preferences here.  If it seems saner to simply save whole thing
>>>> > saving lower part twice, I am fine with that.
>>>>
>>>> I was told that it wasn't very easy to get decent performance with
>>>> split access. I extended my proposal to include a 16bit bitmask to
>>>> indicate which YMM regisetrs should be saved. If the bit is 0,
>>>> we should only save the the lower 128bit in the original register
>>>> save area. Otherwise, we should only save the same whole YMM register.
>>>>
>>>
>>> My second thought. How useful is such a bitmask? Do we really
>>> need it? Is that accepetable to save the lower 128bit twice?
>>
>> Why do we need to save the lower 128bit at all if a ymm reg is passed?
>> Can't we assume "type-correctness"?
>
> Say a double is passed in YMM0/XMM0, we should save it in XMM0 area.
> Do we also need to save the whole 256bit YMM0? If we save both XMM0 and
> YMM0, we are free to use any location to load the saved register content.
> Either one will be correct.

What is the benefit here?  (What would the contents of the upper 128bit
be - apart from "undefined")

I suppose you can load into xmm0 and then "extend" to ymm0?

Richard.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-05 14:31 RFC: Extend x86-64 psABI for 256bit AVX register H.J. Lu
  2008-06-05 14:49 ` Richard Guenther
  2008-06-05 15:15 ` Jan Hubicka
@ 2008-06-06 15:01 ` Jakub Jelinek
  2 siblings, 0 replies; 25+ messages in thread
From: Jakub Jelinek @ 2008-06-06 15:01 UTC (permalink / raw)
  To: H.J. Lu; +Cc: discuss, GCC, Girkar, Milind, Dmitriev, Serguei N

On Thu, Jun 05, 2008 at 07:31:12AM -0700, H.J. Lu wrote:
> 1. Extend the register save area to put upper 128bit at the end.
>   Pros:
>     Aligned access.
>     Save stack space if 256bit registers are used.
>   Cons
>     Split access. Require more split access beyond 256bit.
> 
> 2. Extend the register save area to put full 265bit YMMs at the end.
> The first DWORD after the register save area has the offset of
> the extended array for YMM registers. The next DWORD has the
> element size of the extended array. Unaligned access will be used.
>   Pros:
>     No split access.
>     Easily extendable beyond 256bit.
>     Limited unaligned access penalty if stack is aligned at 32byte.
>   Cons:
>     May require store both the lower 128bit and full 256bit register
>     content. We may avoid saving the lower 128bit if correct type
>     is required when accessing variable argument list, similar to int
>     vs. double.
>     Waste 272 byte on stack when 256bit registers are used.
>     Unaligned load and store.

Or:

3. Pass unnamed __m256 arguments both in YMM registers and on the
stack or just on the stack.  How often do you think people pass
vectors to varargs functions?  I think I haven't seen that yet except
in gcc testcases.  The x86_64 float varargs setup prologue is already
quite slow now, do we want to make it even slower for something
very rarely used?  Although we have tree-stdarg optimization pass
which is able to optimize the varargs prologue setup code in some cases,
e.g. for printf etc. it can't help, as printf etc. just
does va_start, passes the va_list to another function and does va_end,
so it must count with any possibility.  Named __m256 arguments would
still be passed in YMM registers only...

	Jakub

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-06 14:28         ` H.J. Lu
  2008-06-06 14:31           ` Richard Guenther
@ 2008-06-09 14:41           ` Jan Hubicka
  2008-06-10 11:24             ` Jakub Jelinek
  1 sibling, 1 reply; 25+ messages in thread
From: Jan Hubicka @ 2008-06-09 14:41 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jan Hubicka, Jan Hubicka, discuss, GCC, Girkar, Milind, Dmitriev,
	Serguei N, Kreitzer, David L

> On Fri, Jun 06, 2008 at 06:50:26AM -0700, H.J. Lu wrote:
> > On Fri, Jun 06, 2008 at 10:28:34AM +0200, Jan Hubicka wrote:
> > > > 
> > > > ymm0 and xmm0 are the same register. xmm0 is the lower 128bit
> > > > of xmm0. I am not sure if we need separate XMM registers from
> > > > YMM registers.
> > > 
> > > 
> > > Yes, I know that xmm0 is lower part of ymm0.  I still think we ought to
> > > be able to support varargs that do save ymm0 registers only when ymm
> > > values are passed same way as we touch SSE only when SSE values are
> > > passed via EAX hint.
> > 
> > Which register do you propose for hint? The current psABI uses RAX
> > for XMM registers. We can't change it to AL and AH for YMM without
> > breaking backward compatibility.
> > 
> > > This way we will be able to support e.g. printf that has YMM printing %
> > > construct but don't need YMM enabled hardware when those are not used.
> > > 
> > > This is why I think extending EAX to contain information about amount of
> > > XMM values to save and in addition YMM values to save is sane.  Then old
> > > non-YMM aware varargs prologues will crash when YMM values are passed,
> > > but all other combinations will work.
> > 
> > I don't think it is necessary since -mavx will enable AVX code
> > generation for all SSE codes. Unless the function only uses integer,
> > it will crash on non-YMM aware hardware.  That is if there is one
> > SSE register is used, which is hinted in RAX, varargs prologue will
> > use AVX instructions to save it. We don't need another hint for AVX
> > instructions.
> > 
> > > > 
> > > > >
> > > > > I personally don't have much preferences over 1. or 2.. 1. seems
> > > > > relatively easy to implement too, or is packaging two 128bit values to
> > > > > single 256bit difficult in va_arg expansion?
> > > > >
> > > > 
> > > > Access to 256bit register as lower and upper 128bits needs 2
> > > > instructions. For store
> > > > 
> > > > vmovaps   %xmm7, -143(%rax)
> > > > vextractf128 $1, %ymm7, -15(%rax)
> > > > 
> > > > For load
> > > > 
> > > > vmovaps  -143(%rax),%xmm7
> > > > vinsert128 $1, -15(%rax),%ymm7,%ymm7
> > > > 
> > > > If we go beyond 256bit, we need more instructions to access
> > > > the full register. For 512bit, it will be split into lower 128bit,
> > > > middle 128bit and upper 256bit. 1024bit will have 4 parts.
> > > > 
> > > > For #2, only one instruction will be needed for 256bit and
> > > > beyond.
> > > 
> > > Yes, but we will still save half of stack space.  Well, I don't have
> > > much preferences here.  If it seems saner to simply save whole thing
> > > saving lower part twice, I am fine with that.
> > 
> > I was told that it wasn't very easy to get decent performance with
> > split access. I extended my proposal to include a 16bit bitmask to
> > indicate which YMM regisetrs should be saved. If the bit is 0,
> > we should only save the the lower 128bit in the original register
> > save area. Otherwise, we should only save the same whole YMM register.
> > 
> 
> My second thought. How useful is such a bitmask? Do we really
> need it? Is that accepetable to save the lower 128bit twice?

I dont' see much benefit in bitmask.  I think we only should try to
enforce:
  1) that AVX prologue will not ICE on non-AVX hardware for functions
  not using AVX va_arg constructs.
  2) backward compatibility with current va_lists.  That is make
     calling AVX function from non-AVX code work as well as calling
     non-AVX function from AVX code.

I don't think unconditionally saving the AVX registers or guarding them
same way as we do for SSE is good because it breaks 1).

We can't use new register to hint number of AVX operands, because the
register would be uninitialized in non-AVX code.

Still it seems to me that we can use extend current eax convention.
Currently the value must be in range 0...8 as it specify number of SSE
registers.  We can pack both numbers into it.  This way we get
unforutnately wild jump on case of AVX code calling non-AVX function and
passing in AVX arguments, but this seems less important than 1) and 2)
to me and I don't see how to get all three cases working.

Duplicating the value seems OK with me if it simplifies implementation
significandly.

Honza
> 
> Thanks.
> 
> 
> H.J.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-09 14:41           ` Jan Hubicka
@ 2008-06-10 11:24             ` Jakub Jelinek
  2008-06-10 11:32               ` Jan Hubicka
  0 siblings, 1 reply; 25+ messages in thread
From: Jakub Jelinek @ 2008-06-10 11:24 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: H.J. Lu, Jan Hubicka, discuss, GCC, Girkar, Milind, Dmitriev,
	Serguei N, Kreitzer, David L

On Mon, Jun 09, 2008 at 04:40:54PM +0200, Jan Hubicka wrote:
> Still it seems to me that we can use extend current eax convention.
> Currently the value must be in range 0...8 as it specify number of SSE
> registers.  We can pack both numbers into it.  This way we get
> unforutnately wild jump on case of AVX code calling non-AVX function and
> passing in AVX arguments, but this seems less important than 1) and 2)
> to me and I don't see how to get all three cases working.
> 
> Duplicating the value seems OK with me if it simplifies implementation
> significandly.

I don't understand why you want to pass __m256 and 256-bit vector values
to anonymous arguments in registers.  The only thing the vararg functions
would do with it would be save it somewhere on the stack.
Given the x86_64 ABI, you can't expect calling an implicitly
prototyped or non-vararg prototyped function which is actually
defined as vararg function (as %rax wouldn't be properly initialized),
which means you need a prototype for all vararg functions and
at that point the caller can just do the job for the callee and push stuff
on the stack.  Then vararg prologue doesn't need to save %ymm* registers
at all and va_arg will handle __m256 just fine.

	Jakub

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-10 11:24             ` Jakub Jelinek
@ 2008-06-10 11:32               ` Jan Hubicka
  2008-06-10 13:48                 ` H.J. Lu
  0 siblings, 1 reply; 25+ messages in thread
From: Jan Hubicka @ 2008-06-10 11:32 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Jan Hubicka, H.J. Lu, Jan Hubicka, discuss, GCC, Girkar, Milind,
	Dmitriev, Serguei N, Kreitzer, David L

> 
> I don't understand why you want to pass __m256 and 256-bit vector values
> to anonymous arguments in registers.  The only thing the vararg functions
> would do with it would be save it somewhere on the stack.
> Given the x86_64 ABI, you can't expect calling an implicitly
> prototyped or non-vararg prototyped function which is actually
> defined as vararg function (as %rax wouldn't be properly initialized),

Unprototyped functions calls all get rax set. If calle is variadic,
things still work.  Sure, for __m256 we can also declare prototypes for
variadic functions mandatory and simply pass things on stack.

Honza
> which means you need a prototype for all vararg functions and
> at that point the caller can just do the job for the callee and push stuff
> on the stack.  Then vararg prologue doesn't need to save %ymm* registers
> at all and va_arg will handle __m256 just fine.
> 
> 	Jakub

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-10 11:32               ` Jan Hubicka
@ 2008-06-10 13:48                 ` H.J. Lu
  2008-06-10 14:50                   ` Jan Hubicka
  0 siblings, 1 reply; 25+ messages in thread
From: H.J. Lu @ 2008-06-10 13:48 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Jakub Jelinek, Jan Hubicka, discuss, GCC, Girkar, Milind,
	Dmitriev, Serguei N, Kreitzer, David L

On Tue, Jun 10, 2008 at 4:32 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> I don't understand why you want to pass __m256 and 256-bit vector values
>> to anonymous arguments in registers.  The only thing the vararg functions
>> would do with it would be save it somewhere on the stack.
>> Given the x86_64 ABI, you can't expect calling an implicitly
>> prototyped or non-vararg prototyped function which is actually
>> defined as vararg function (as %rax wouldn't be properly initialized),
>
> Unprototyped functions calls all get rax set. If calle is variadic,
> things still work.  Sure, for __m256 we can also declare prototypes for
> variadic functions mandatory and simply pass things on stack.
>

Do unprototyped functions calls work with __m128 and vararg on ia32?
I don't think it works since the first 3 __m128 were passed in registers,
but everything is passed on stack for vararg. If we require prototypes
for ia32, we should do the same for x86-64.

-- 
H.J.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-10 13:48                 ` H.J. Lu
@ 2008-06-10 14:50                   ` Jan Hubicka
  2008-06-10 14:57                     ` Jakub Jelinek
  0 siblings, 1 reply; 25+ messages in thread
From: Jan Hubicka @ 2008-06-10 14:50 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jan Hubicka, Jakub Jelinek, Jan Hubicka, discuss, GCC, Girkar,
	Milind, Dmitriev, Serguei N, Kreitzer, David L

> On Tue, Jun 10, 2008 at 4:32 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >>
> >> I don't understand why you want to pass __m256 and 256-bit vector values
> >> to anonymous arguments in registers.  The only thing the vararg functions
> >> would do with it would be save it somewhere on the stack.
> >> Given the x86_64 ABI, you can't expect calling an implicitly
> >> prototyped or non-vararg prototyped function which is actually
> >> defined as vararg function (as %rax wouldn't be properly initialized),
> >
> > Unprototyped functions calls all get rax set. If calle is variadic,
> > things still work.  Sure, for __m256 we can also declare prototypes for
> > variadic functions mandatory and simply pass things on stack.
> >
> 
> Do unprototyped functions calls work with __m128 and vararg on ia32?
> I don't think it works since the first 3 __m128 were passed in registers,
> but everything is passed on stack for vararg. If we require prototypes
> for ia32, we should do the same for x86-64.

unprototyped varargs don't work for __m128 or regparm conventions on
ia32 indeed.

Main motivation for getting varargs to work right on x86-64 with
unprototyped functions was to make legacy codebases using FP operands
happy.  There are no legacy codebases using vector extensions (I hope :)
so this motivation is not that important.  We might however want to do
consistent here.  So I guess we now have following choices:

 1) make __m256 passed on stack on variadic functions and in registers
 otherwse. Then we don't need to worry about varargs changes at all.
 This will break unprototyped calls.
 2) extend rax to pass info about if __m256 registers are present and
 upper half needs to be saved.
 This will break passing __m256 arguments to functions with prologues
 compiled with legacy compiler that will do wild jump.  All other cases
 should work
 3) Save upper halves whenever we want to save SSE registers.  This will
 break calling variadic functions compiled with __m256 support in.

I guess either 1) or 2) is fine for me, as I told earlier, I am not big
fan of 3).  I guess 1) is easier and probably make more sense?

Honza
> 
> -- 
> H.J.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-10 14:50                   ` Jan Hubicka
@ 2008-06-10 14:57                     ` Jakub Jelinek
  2008-06-10 15:41                       ` H.J. Lu
  0 siblings, 1 reply; 25+ messages in thread
From: Jakub Jelinek @ 2008-06-10 14:57 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: H.J. Lu, Jan Hubicka, discuss, GCC, Girkar, Milind, Dmitriev,
	Serguei N, Kreitzer, David L

On Tue, Jun 10, 2008 at 04:50:14PM +0200, Jan Hubicka wrote:
>  1) make __m256 passed on stack on variadic functions and in registers
>  otherwse. Then we don't need to worry about varargs changes at all.
>  This will break unprototyped calls.
>  2) extend rax to pass info about if __m256 registers are present and
>  upper half needs to be saved.
>  This will break passing __m256 arguments to functions with prologues
>  compiled with legacy compiler that will do wild jump.  All other cases
>  should work
>  3) Save upper halves whenever we want to save SSE registers.  This will
>  break calling variadic functions compiled with __m256 support in.
> 
> I guess either 1) or 2) is fine for me, as I told earlier, I am not big
> fan of 3).  I guess 1) is easier and probably make more sense?

I vote for 1), though I think it should be passed on stack only for ...
args.  E.g. for
void foo (__m256 x, ...);
void bar (__m256 x, __m256 y, __m256 z)
{
  foo (x, y, z);
}
the first argument would be passed in %ymm0, while the unnamed arguments
y and z would be pushed to stack.

	Jakub

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-10 14:57                     ` Jakub Jelinek
@ 2008-06-10 15:41                       ` H.J. Lu
  2008-06-10 15:49                         ` Jan Hubicka
  0 siblings, 1 reply; 25+ messages in thread
From: H.J. Lu @ 2008-06-10 15:41 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Jan Hubicka, Jan Hubicka, discuss, GCC, Girkar, Milind, Dmitriev,
	Serguei N, Kreitzer, David L

On Tue, Jun 10, 2008 at 8:11 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Tue, Jun 10, 2008 at 04:50:14PM +0200, Jan Hubicka wrote:
>>  1) make __m256 passed on stack on variadic functions and in registers
>>  otherwse. Then we don't need to worry about varargs changes at all.
>>  This will break unprototyped calls.
>>  2) extend rax to pass info about if __m256 registers are present and
>>  upper half needs to be saved.
>>  This will break passing __m256 arguments to functions with prologues
>>  compiled with legacy compiler that will do wild jump.  All other cases
>>  should work
>>  3) Save upper halves whenever we want to save SSE registers.  This will
>>  break calling variadic functions compiled with __m256 support in.
>>
>> I guess either 1) or 2) is fine for me, as I told earlier, I am not big
>> fan of 3).  I guess 1) is easier and probably make more sense?
>
> I vote for 1), though I think it should be passed on stack only for ...
> args.  E.g. for
> void foo (__m256 x, ...);
> void bar (__m256 x, __m256 y, __m256 z)
> {
>  foo (x, y, z);
> }
> the first argument would be passed in %ymm0, while the unnamed arguments
> y and z would be pushed to stack.
>

I agree. We will add testcases for whatever psABI extension we have
chosen.

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-10 15:41                       ` H.J. Lu
@ 2008-06-10 15:49                         ` Jan Hubicka
  2008-06-10 16:18                           ` H.J. Lu
  2008-06-11 14:49                           ` H.J. Lu
  0 siblings, 2 replies; 25+ messages in thread
From: Jan Hubicka @ 2008-06-10 15:49 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jakub Jelinek, Jan Hubicka, Jan Hubicka, discuss, GCC, Girkar,
	Milind, Dmitriev, Serguei N, Kreitzer, David L

> On Tue, Jun 10, 2008 at 8:11 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> > On Tue, Jun 10, 2008 at 04:50:14PM +0200, Jan Hubicka wrote:
> >>  1) make __m256 passed on stack on variadic functions and in registers
> >>  otherwse. Then we don't need to worry about varargs changes at all.
> >>  This will break unprototyped calls.
> >>  2) extend rax to pass info about if __m256 registers are present and
> >>  upper half needs to be saved.
> >>  This will break passing __m256 arguments to functions with prologues
> >>  compiled with legacy compiler that will do wild jump.  All other cases
> >>  should work
> >>  3) Save upper halves whenever we want to save SSE registers.  This will
> >>  break calling variadic functions compiled with __m256 support in.
> >>
> >> I guess either 1) or 2) is fine for me, as I told earlier, I am not big
> >> fan of 3).  I guess 1) is easier and probably make more sense?
> >
> > I vote for 1), though I think it should be passed on stack only for ...
> > args.  E.g. for
> > void foo (__m256 x, ...);
> > void bar (__m256 x, __m256 y, __m256 z)
> > {
> >  foo (x, y, z);
> > }
> > the first argument would be passed in %ymm0, while the unnamed arguments
> > y and z would be pushed to stack.
> >
> 
> I agree. We will add testcases for whatever psABI extension we have
> chosen.

I guess we all agree on passing variadic arguments on stack (that is
only those belonging on ...) and rest in registers.  It seems easiest in
regard to future register set extensions too.  Only negative thing is
that calls to variadic functions will become bit longer, but I guess it
is not big deal. (the fact that register passing conventions are shorter
and variadic functions tends to be called many times was also original
motivation to support register passing on pretty much everything for
varargs in psABI)

Would you mind preparing psABI patch too?

Thanks,
Honza
> 
> Thanks.
> 
> -- 
> H.J.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-10 15:49                         ` Jan Hubicka
@ 2008-06-10 16:18                           ` H.J. Lu
  2008-06-11 14:49                           ` H.J. Lu
  1 sibling, 0 replies; 25+ messages in thread
From: H.J. Lu @ 2008-06-10 16:18 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Jakub Jelinek, Jan Hubicka, discuss, GCC, Girkar, Milind,
	Dmitriev, Serguei N, Kreitzer, David L

On Tue, Jun 10, 2008 at 8:48 AM, Jan Hubicka <jh@suse.cz> wrote:
>> On Tue, Jun 10, 2008 at 8:11 AM, Jakub Jelinek <jakub@redhat.com> wrote:
>> > On Tue, Jun 10, 2008 at 04:50:14PM +0200, Jan Hubicka wrote:
>> >>  1) make __m256 passed on stack on variadic functions and in registers
>> >>  otherwse. Then we don't need to worry about varargs changes at all.
>> >>  This will break unprototyped calls.
>> >>  2) extend rax to pass info about if __m256 registers are present and
>> >>  upper half needs to be saved.
>> >>  This will break passing __m256 arguments to functions with prologues
>> >>  compiled with legacy compiler that will do wild jump.  All other cases
>> >>  should work
>> >>  3) Save upper halves whenever we want to save SSE registers.  This will
>> >>  break calling variadic functions compiled with __m256 support in.
>> >>
>> >> I guess either 1) or 2) is fine for me, as I told earlier, I am not big
>> >> fan of 3).  I guess 1) is easier and probably make more sense?
>> >
>> > I vote for 1), though I think it should be passed on stack only for ...
>> > args.  E.g. for
>> > void foo (__m256 x, ...);
>> > void bar (__m256 x, __m256 y, __m256 z)
>> > {
>> >  foo (x, y, z);
>> > }
>> > the first argument would be passed in %ymm0, while the unnamed arguments
>> > y and z would be pushed to stack.
>> >
>>
>> I agree. We will add testcases for whatever psABI extension we have
>> chosen.
>
> I guess we all agree on passing variadic arguments on stack (that is
> only those belonging on ...) and rest in registers.  It seems easiest in
> regard to future register set extensions too.  Only negative thing is
> that calls to variadic functions will become bit longer, but I guess it
> is not big deal. (the fact that register passing conventions are shorter
> and variadic functions tends to be called many times was also original
> motivation to support register passing on pretty much everything for
> varargs in psABI)
>
> Would you mind preparing psABI patch too?
>

We will have an AVX  BoF at gcc summit  next week. We have
reserved 5:15pm to 6:15pm on Wednesday. We will submit the
psABI patch before gcc summit.

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-10 15:49                         ` Jan Hubicka
  2008-06-10 16:18                           ` H.J. Lu
@ 2008-06-11 14:49                           ` H.J. Lu
  2008-06-15 22:37                             ` Jakub Jelinek
  1 sibling, 1 reply; 25+ messages in thread
From: H.J. Lu @ 2008-06-11 14:49 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Jakub Jelinek, Jan Hubicka, discuss, GCC, Girkar, Milind,
	Dmitriev, Serguei N, Kreitzer, David L

On Tue, Jun 10, 2008 at 05:48:57PM +0200, Jan Hubicka wrote:
> > On Tue, Jun 10, 2008 at 8:11 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> > > On Tue, Jun 10, 2008 at 04:50:14PM +0200, Jan Hubicka wrote:
> > >>  1) make __m256 passed on stack on variadic functions and in registers
> > >>  otherwse. Then we don't need to worry about varargs changes at all.
> > >>  This will break unprototyped calls.
> > >>  2) extend rax to pass info about if __m256 registers are present and
> > >>  upper half needs to be saved.
> > >>  This will break passing __m256 arguments to functions with prologues
> > >>  compiled with legacy compiler that will do wild jump.  All other cases
> > >>  should work
> > >>  3) Save upper halves whenever we want to save SSE registers.  This will
> > >>  break calling variadic functions compiled with __m256 support in.
> > >>
> > >> I guess either 1) or 2) is fine for me, as I told earlier, I am not big
> > >> fan of 3).  I guess 1) is easier and probably make more sense?
> > >
> > > I vote for 1), though I think it should be passed on stack only for ...
> > > args.  E.g. for
> > > void foo (__m256 x, ...);
> > > void bar (__m256 x, __m256 y, __m256 z)
> > > {
> > >  foo (x, y, z);
> > > }
> > > the first argument would be passed in %ymm0, while the unnamed arguments
> > > y and z would be pushed to stack.
> > >
> > 
> > I agree. We will add testcases for whatever psABI extension we have
> > chosen.
> 
> I guess we all agree on passing variadic arguments on stack (that is
> only those belonging on ...) and rest in registers.  It seems easiest in
> regard to future register set extensions too.  Only negative thing is
> that calls to variadic functions will become bit longer, but I guess it
> is not big deal. (the fact that register passing conventions are shorter
> and variadic functions tends to be called many times was also original
> motivation to support register passing on pretty much everything for
> varargs in psABI)
> 

There is no precedent for passing named parameters in registers but
unnamed parameters on the stack.  On IA32 for the __m128 types, we
pass ALL __m128 parameters on the stack for varargs functions, not
just the unnamed ones. I think we should do the same for x86-64.


H.J.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-11 14:49                           ` H.J. Lu
@ 2008-06-15 22:37                             ` Jakub Jelinek
  2008-06-16  1:49                               ` Jan Hubicka
  0 siblings, 1 reply; 25+ messages in thread
From: Jakub Jelinek @ 2008-06-15 22:37 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jan Hubicka, discuss, GCC, Girkar, Milind, Dmitriev, Serguei N,
	Kreitzer, David L

On Wed, Jun 11, 2008 at 07:49:12AM -0700, H.J. Lu wrote:
> > I guess we all agree on passing variadic arguments on stack (that is
> > only those belonging on ...) and rest in registers.  It seems easiest in
> > regard to future register set extensions too.  Only negative thing is
> > that calls to variadic functions will become bit longer, but I guess it
> > is not big deal. (the fact that register passing conventions are shorter
> > and variadic functions tends to be called many times was also original
> > motivation to support register passing on pretty much everything for
> > varargs in psABI)
> > 
> 
> There is no precedent for passing named parameters in registers but
> unnamed parameters on the stack.  On IA32 for the __m128 types, we
> pass ALL __m128 parameters on the stack for varargs functions, not
> just the unnamed ones. I think we should do the same for x86-64.

Why should the 32-bit ABI influence x86-64 ABI decisions?
There are clear advantages of passing __m128 named arguments in registers
(shorter/faster code both on the caller and callee side) and there
are advantages of passing __m128 unnamed arguments on the stack
(for va_arg to work, they need to be on the stack, and if they
are passed in registers, the callee would need to push them
to the stack).

	Jakub

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-15 22:37                             ` Jakub Jelinek
@ 2008-06-16  1:49                               ` Jan Hubicka
  2008-06-18 23:16                                 ` H.J. Lu
  0 siblings, 1 reply; 25+ messages in thread
From: Jan Hubicka @ 2008-06-16  1:49 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: H.J. Lu, Jan Hubicka, discuss, GCC, Girkar, Milind, Dmitriev,
	Serguei N, Kreitzer, David L

> On Wed, Jun 11, 2008 at 07:49:12AM -0700, H.J. Lu wrote:
> > > I guess we all agree on passing variadic arguments on stack (that is
> > > only those belonging on ...) and rest in registers.  It seems easiest in
> > > regard to future register set extensions too.  Only negative thing is
> > > that calls to variadic functions will become bit longer, but I guess it
> > > is not big deal. (the fact that register passing conventions are shorter
> > > and variadic functions tends to be called many times was also original
> > > motivation to support register passing on pretty much everything for
> > > varargs in psABI)
> > > 
> > 
> > There is no precedent for passing named parameters in registers but
> > unnamed parameters on the stack.  On IA32 for the __m128 types, we
> > pass ALL __m128 parameters on the stack for varargs functions, not
> > just the unnamed ones. I think we should do the same for x86-64.
> 
> Why should the 32-bit ABI influence x86-64 ABI decisions?
> There are clear advantages of passing __m128 named arguments in registers
> (shorter/faster code both on the caller and callee side) and there
> are advantages of passing __m128 unnamed arguments on the stack
> (for va_arg to work, they need to be on the stack, and if they
> are passed in registers, the callee would need to push them
> to the stack).

For record I would also preffer passing all named AVX arguments in
registers.  x86-64 ABI was designed for performance not for backward
compatibility, so it should be consistent with original idea and I think
that ABIs are divergent anough so this won't cause too much of extra
confussion.  But I am happy with both solutions.

Honza
> 
> 	Jakub

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Extend x86-64 psABI for 256bit AVX register
  2008-06-16  1:49                               ` Jan Hubicka
@ 2008-06-18 23:16                                 ` H.J. Lu
  0 siblings, 0 replies; 25+ messages in thread
From: H.J. Lu @ 2008-06-18 23:16 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Jakub Jelinek, Jan Hubicka, discuss, GCC, Girkar, Milind,
	Dmitriev, Serguei N, Kreitzer, David L, Michael Meissner,
	Dwarakanath Rajagopal

[-- Attachment #1: Type: text/plain, Size: 1889 bytes --]

Hi,

Here is the AVX patch for x86-64 psABI proposed at gcc submmit 2008.

H.J.
---
On Sun, Jun 15, 2008 at 6:49 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Wed, Jun 11, 2008 at 07:49:12AM -0700, H.J. Lu wrote:
>> > > I guess we all agree on passing variadic arguments on stack (that is
>> > > only those belonging on ...) and rest in registers.  It seems easiest in
>> > > regard to future register set extensions too.  Only negative thing is
>> > > that calls to variadic functions will become bit longer, but I guess it
>> > > is not big deal. (the fact that register passing conventions are shorter
>> > > and variadic functions tends to be called many times was also original
>> > > motivation to support register passing on pretty much everything for
>> > > varargs in psABI)
>> > >
>> >
>> > There is no precedent for passing named parameters in registers but
>> > unnamed parameters on the stack.  On IA32 for the __m128 types, we
>> > pass ALL __m128 parameters on the stack for varargs functions, not
>> > just the unnamed ones. I think we should do the same for x86-64.
>>
>> Why should the 32-bit ABI influence x86-64 ABI decisions?
>> There are clear advantages of passing __m128 named arguments in registers
>> (shorter/faster code both on the caller and callee side) and there
>> are advantages of passing __m128 unnamed arguments on the stack
>> (for va_arg to work, they need to be on the stack, and if they
>> are passed in registers, the callee would need to push them
>> to the stack).
>
> For record I would also preffer passing all named AVX arguments in
> registers.  x86-64 ABI was designed for performance not for backward
> compatibility, so it should be consistent with original idea and I think
> that ABIs are divergent anough so this won't cause too much of extra
> confussion.  But I am happy with both solutions.
>
> Honza
>>
>>       Jakub
>



-- 
H.J.

[-- Attachment #2: avx-5.patch --]
[-- Type: application/octet-stream, Size: 8970 bytes --]

Index: low-level-sys-info.tex
===================================================================
--- low-level-sys-info.tex	(revision 217)
+++ low-level-sys-info.tex	(working copy)
@@ -25,8 +25,8 @@ object, and the term \emph{\textindex{\s
 \subsubsection{Fundamental Types}
 
 Figure~\ref{basic-types} shows the correspondence between ISO C's
-scalar types and the processor's.  \code{__int128},
-\code{__float128}, \code{__m64} and \code{__m128} types are optional.
+scalar types and the processor's.  \code{__int128}, \code{__float128},
+\code{__m64}, \code{__m128} and \code{__m256} types are optional.
 
 \begin{figure}
   \caption{Scalar Types}\label{basic-types}
@@ -91,8 +91,10 @@ scalar types and the processor's.  \code
     Packed & \texttt{__m64}$^{\dagger\dagger}$ & 8 & 8 & \MMX{} and \threednow \\
     \cline{2-5}
     & \texttt{__m128}$^{\dagger\dagger}$ & 16 & 16 & SSE and SSE-2 \\
+    \cline{2-5}
+    & \texttt{__m256}$^{\dagger\dagger}$ & 32 & 32 & AVX \\
 \noalign{\smallskip}
-\cline{1-2}
+\cline{1-5}
 \multicolumn{3}{l}{\small $^\dagger$ This type is called \texttt{bool}
 in C++.}\\
 \multicolumn{3}{l}{\small $^{\dagger\dagger}$ These types are optional.}\\
@@ -134,8 +136,8 @@ any nonzero value is considered \code{tr
 Like the Intel386 architecture, the \xARCH architecture in general
 does not require all data accesses to be properly aligned.  Misaligned
 data accesses are slower than aligned accesses
-but otherwise behave identically.  The only exception is that
-\code{__m128} must always be aligned properly.
+but otherwise behave identically.  The only exceptions are that
+\code{__m128} and \code{__m256} must always be aligned properly.
 
 \subsubsection{Aggregates and Unions}
 
@@ -194,9 +196,9 @@ integral values of a specified size.
 \Hrule
 \end{figure}
 
-The ABI does not permit bit-fields having the type \texttt{__m64} or
-\texttt{__m128}.  Programs using bit-fields of these types are not
-portable.
+The ABI does not permit bit-fields having the type \texttt{__m64},
+\texttt{__m128} or \texttt{__m256}.  Programs using bit-fields of
+these types are not portable.
 
 Bit-fields that are neither signed nor unsigned
 always have non-negative values. Although they may have type char,
@@ -240,6 +242,15 @@ the x87 floating point registers may be 
 mode as a 64-bit register.  All of these registers are global to all
 procedures active for a given thread.
 
+Intel\textsuperscript{\textregistered} AVX
+(Intel\textsuperscript{\textregistered} Advanced Vector
+Extensions) provides 16 256-bit wide AVX registers
+(\reg{ymm0} - \reg{ymm15}).  The lower 128-bits of \reg{ymm0} - \reg{ymm15}
+are aliased to the respective 128b-bit SSE registers (\reg{xmm0} -
+\reg{xmm15}). For purposes of parameter passing and function return,
+\reg{xmmN} and \reg{ymmN} refer to the same register. Only one of them
+can be used at the same time.
+
 This subsection discusses usage of each register.  Registers \RBP, \RBX and
 \reg{r12} through \reg{r15} ``belong'' to the calling function and the
 called function is required to preserve their values.  In other words,
@@ -300,9 +311,10 @@ stack.  This stack grows downwards from 
 \Hrule
 \end{figure}
 
-The end of the input argument area shall be aligned on a 16 byte
-boundary.  In other words, the value $(\RSP - 8)$ is always a multiple
-of $16$ when control is transferred to the function entry point.  The
+The end of the input argument area shall be aligned on a 16 (32, if
+\texttt{__m256} is passed on stack) byte boundary.  In other
+words, the value $(\RSP - 8)$ is always a multiple of $16$ ($32$) when
+control is transferred to the function entry point.  The
 stack pointer, \RSP, always points to the end of the latest allocated
 stack frame.  \footnote{The conventional use of \RBP{} as a frame
   pointer for the stack frame may be avoided by using \RSP (the stack
@@ -333,9 +345,10 @@ classes are corresponding to \xARCH regi
 \begin{description}
 \item[INTEGER] This class consists of integral types that fit into one of
   the general purpose registers.
-\item[SSE] The class consists of types that fits into a SSE register.
+\item[SSE] The class consists of types that fit into a SSE register.
 \item[SSEUP] The class consists of types that fit into a SSE register
   and can be passed and returned in the most significant half of it.
+\item[AVX] The class consists of types that fit into a AVX register.
 \item[X87, X87UP] These classes consists of types that will be returned via
   the x87 FPU.
 \item[COMPLEX\_X87] This class consists of types that will be returned
@@ -361,6 +374,7 @@ The basic types are assigned their natur
 \item Arguments of types \code{__float128}, \code{_Decimal128}
   and \code{__m128} are split into two halves.  The least significant
   ones belong to class SSE, the most significant one to class SSEUP.
+\item Arguments of type \code{__m256} are in class AVX.
 \item The 64-bit mantissa of arguments of type \code{long double}
   belongs to class X87, the 16-bit exponent plus 6 bytes of padding
   belongs to class X87UP.
@@ -468,6 +482,9 @@ left-to-right order) for passing as foll
 \item If the class is SSEUP, the \eightbyte is passed in the upper
    half of the last used SSE register.
 
+\item If the class is AVX, the next available AVX register is used, the
+   registers are taken in the order from \reg{ymm0} to \reg{ymm7}.
+
 \item If the class is X87, X87UP or COMPLEX\_X87, it is passed in memory.
 \end{enumerate}
 
@@ -544,6 +561,9 @@ the number of SSE registers used. The co
 match exactly the number of registers, but must be an upper bound on
 the number of SSE registers used and is in the range 0--8 inclusive.
 
+When passing \texttt{__m256} arguments to functions that use varargs
+or stdarg, function prototypes must be provided.  Otherwise, the
+run-time behavior is undefined.
 
 \paragraph{Returning of Values}
 The returning of values is done according to the following algorithm:
@@ -567,6 +587,9 @@ The returning of values is done accordin
 \item If the class is SSEUP, the \eightbyte is passed in the upper half of the
    last used SSE register.
 
+\item If the class is AVX, the next available AVX register of the
+   sequence \reg{ymm0}, \reg{ymm1} is used.
+
 \item If the class is X87, the value is returned on the X87 stack in
    \reg{st0} as 80-bit x87 number.
 
@@ -598,13 +621,15 @@ structparm s;\\
 int e, f, g, h, i, j, k;\\
 long double ld;\\
 double m, n;\\
+__m256 y;\\
 \\
 extern void func (int e, int f,\\
 \phantom{extern void func (}structparm s, int g, int h,\\
 \phantom{extern void func (}long double ld, double m,\\
+\phantom{extern void func (}__m256 y,\\
 \phantom{extern void func (}double n, int i, int j, int k);\\
 \\
-func (e, f, s, g, h, ld, m, n, i, j, k);\\
+func (e, f, s, g, h, ld, m, y, n, i, j, k);\\
 \cline{1-1}
 \end{tabular}
 }
@@ -622,12 +647,12 @@ func (e, f, s, g, h, ld, m, n, i, j, k);
 \multicolumn{2}{c}{Floating Point Registers} &
 \multicolumn{2}{c}{Stack Frame Offset}\\
 \hline
-\RDI: &\code{e} & \reg{xmm0}: &\code{s.d} &\code{0:}& \code{ld} \\
-\RSI: &\code{f} & \reg{xmm1}: &\code{m}& \code{16:}& \code{j} \\
-\RDX: &\code{s.a,s.b} & \reg{xmm2}: &\code{n}&\code{24:}& \code{k} \\
-\RCX: &\code{g} & & & \\
-\reg{r8}:&\code{h} & && & \\
-\reg{r9}:&\code{i} & && & \\
+\RDI:    &\code{e}      &\reg{xmm0}:&\code{s.d}&\code{0:} &\code{ld} \\
+\RSI:    &\code{f}      &\reg{xmm1}:&\code{m}  &\code{16:}&\code{j} \\
+\RDX:    &\code{s.a,s.b}&\reg{ymm2}:&\code{y}  &\code{24:}&\code{k} \\
+\RCX:    &\code{g}      &\reg{xmm3}:&\code{n}  &          & \\
+\reg{r8}:&\code{h}      &           &          &          & \\
+\reg{r9}:&\code{i}      &           &          &          & \\
 \end{tabular}
 
 \end{center}
@@ -1957,6 +1982,10 @@ the function in SSE registers.%
 %%% XXX: Really only floating pointer parameters?
 %%% XXX: Use %al or %rax?
 
+When \texttt{__m256} is passed as variable-argument, it should always
+be passed on stack. Only named \texttt{__m256} arguments may be passed
+in register as specified in section \ref{sec-calling-conventions}.
+
 \begin{figure}[H]
 \Hrule
 \caption{Parameter Passing Example with Variable-Argument List}
@@ -1968,10 +1997,11 @@ the function in SSE registers.%
 int a, b;\\
 long double ld;\\
 double m, n;\\
+__m256 u, y;\\
 \\
-extern void func (int a, double  m,...);\\
+extern void func (int a, double m, __m256 u, ...);\\
 \\
-func (a, m, b, ld, n);\\
+func (a, m, u, b, ld, y, n);\\
 \cline{1-1}
 \end{tabular}
 }
@@ -1989,9 +2019,9 @@ func (a, m, b, ld, n);\\
 \multicolumn{2}{c}{Floating Point Registers} &
 \multicolumn{2}{c}{Stack Frame Offset}\\
 \hline
-\RDI: &\code{a} & \reg{xmm0}: &\code{m} &\code{0:}& \code{ld} \\
-\RSI: &\code{b} & \reg{xmm1}: &\code{n}& &  \\
-\RAX: & 2 & & & \\
+\RDI: &\code{a}&\reg{xmm0}:&\code{m}&\code{0:} &\code{ld} \\
+\RSI: &\code{b}&\reg{ymm1}:&\code{u}&\code{32:}&\code{y} \\
+\RAX: & 3      &\reg{xmm2}:&\code{n}& \\
 \end{tabular}
 \end{center}
 \Hrule

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2008-06-18 23:16 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-06-05 14:31 RFC: Extend x86-64 psABI for 256bit AVX register H.J. Lu
2008-06-05 14:49 ` Richard Guenther
2008-06-05 15:52   ` H.J. Lu
2008-06-05 15:15 ` Jan Hubicka
2008-06-05 16:14   ` H.J. Lu
2008-06-06  8:29     ` Jan Hubicka
2008-06-06 13:50       ` H.J. Lu
2008-06-06 14:28         ` H.J. Lu
2008-06-06 14:31           ` Richard Guenther
2008-06-06 14:41             ` H.J. Lu
2008-06-06 14:44               ` Richard Guenther
2008-06-09 14:41           ` Jan Hubicka
2008-06-10 11:24             ` Jakub Jelinek
2008-06-10 11:32               ` Jan Hubicka
2008-06-10 13:48                 ` H.J. Lu
2008-06-10 14:50                   ` Jan Hubicka
2008-06-10 14:57                     ` Jakub Jelinek
2008-06-10 15:41                       ` H.J. Lu
2008-06-10 15:49                         ` Jan Hubicka
2008-06-10 16:18                           ` H.J. Lu
2008-06-11 14:49                           ` H.J. Lu
2008-06-15 22:37                             ` Jakub Jelinek
2008-06-16  1:49                               ` Jan Hubicka
2008-06-18 23:16                                 ` H.J. Lu
2008-06-06 15:01 ` Jakub Jelinek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).