public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
@ 2013-07-23 19:34 H.J. Lu
  2013-07-23 19:57 ` Joseph S. Myers
  2013-07-24 15:20 ` Richard Biener
  0 siblings, 2 replies; 25+ messages in thread
From: H.J. Lu @ 2013-07-23 19:34 UTC (permalink / raw)
  To: GNU C Library, GCC Development, Binutils, Girkar, Milind,
	Kreitzer, David L

[-- Attachment #1: Type: text/plain, Size: 136 bytes --]

Hi,

Here is a patch to extend x86-64 psABI to support AVX-512:

http://software.intel.com/sites/default/files/319433-015.pdf


--
H.J.

[-- Attachment #2: avx512.patch --]
[-- Type: application/octet-stream, Size: 8834 bytes --]

diff --git a/low-level-sys-info.tex b/low-level-sys-info.tex
index c125a5f..b80f4e3 100644
--- a/low-level-sys-info.tex
+++ b/low-level-sys-info.tex
@@ -26,7 +26,8 @@ object, and the term \emph{\textindex{\sixteenbyte{}}} refers to a
 
 Figure~\ref{basic-types} shows the correspondence between ISO C's
 scalar types and the processor's.  \code{__int128}, \code{__float128},
-\code{__m64}, \code{__m128} and \code{__m256} types are optional.
+\code{__m64}, \code{__m128}, \code{__m256} and \code{__m512} types are
+optional.
 
 \begin{figure}
   \caption{Scalar Types}\label{basic-types}
@@ -104,6 +105,8 @@ scalar types and the processor's.  \code{__int128}, \code{__float128},
     & \texttt{__m128}$^{\dagger\dagger}$ & 16 & 16 & SSE and SSE-2 \\
     \cline{2-5}
     & \texttt{__m256}$^{\dagger\dagger}$ & 32 & 32 & AVX \\
+    \cline{2-5}
+    & \texttt{__m512}$^{\dagger\dagger}$ & 64 & 64 & AVX-512 \\
 \noalign{\smallskip}
 \cline{1-5}
 \multicolumn{3}{l}{\small $^\dagger$ This type is called \texttt{bool}
@@ -148,7 +151,8 @@ Like the Intel386 architecture, the \xARCH architecture in general
 does not require all data accesses to be properly aligned.  Misaligned
 data accesses are slower than aligned accesses
 but otherwise behave identically.  The only exceptions are that
-\code{__m128} and \code{__m256} must always be aligned properly.
+\code{__m128}, \code{__m256} and \code{__m512} must always be aligned
+properly.
 
 \subsubsection{Aggregates and Unions}
 
@@ -215,8 +219,8 @@ integral values of a specified size.
 \end{figure}
 
 The ABI does not permit bit-fields having the type \texttt{__m64},
-\texttt{__m128} or \texttt{__m256}.  Programs using bit-fields of
-these types are not portable.
+\texttt{__m128}, \texttt{__m256} or \texttt{__m512}.  Programs using
+bit-fields of these types are not portable.
 
 Bit-fields that are neither signed nor unsigned
 always have non-negative values. Although they may have type char,
@@ -263,10 +267,17 @@ procedures active for a given thread.
 Intel AVX (Advanced Vector Extensions) provides 16 256-bit wide AVX registers
 (\reg{ymm0} - \reg{ymm15}).  The lower 128-bits of \reg{ymm0} - \reg{ymm15}
 are aliased to the respective 128b-bit SSE registers (\reg{xmm0} -
-\reg{xmm15}). For purposes of parameter passing and function return,
-\reg{xmmN} and \reg{ymmN} refer to the same register. Only one of them
-can be used at the same time.  We use vector register to refer to either
-SSE or AVX register.
+\reg{xmm15}).  Intel AVX-512 provides 32 512-bit wide SIMD registers
+(\reg{zmm0} - \reg{zmm31}).  The lower 128-bits of \reg{zmm0} - \reg{zmm31}
+are aliased to the respective 128b-bit SSE registers
+(\reg{xmm0} - \reg{xmm31}).  The lower 256-bits of \reg{zmm0} - \reg{zmm31}
+are aliased to the respective 256-bit AVX registers
+(\reg{ymm0} - \reg{ymm31}).  For purposes of parameter passing
+and function return, \reg{xmmN}, \reg{ymmN} and \reg{zmmN} refer to the
+same register. Only one of them can be used at the same time.  We use
+vector register to refer to either SSE, AVX or AVX-512 register.  In
+addition, Intel AVX-512 also provides 8 vector mask registers (\reg{k0}
+- \reg{k7}), each 64-bit wide.
 
 This subsection discusses usage of each register.  Registers \RBP, \RBX and
 \reg{r12} through \reg{r15} ``belong'' to the calling function and the
@@ -328,9 +339,10 @@ stack.  This stack grows downwards from high addresses.  Figure
 \Hrule
 \end{figure}
 
-The end of the input argument area shall be aligned on a 16 (32, if
-\texttt{__m256} is passed on stack) byte boundary.  In other
-words, the value $(\RSP + 8)$ is always a multiple of $16$ ($32$) when
+The end of the input argument area shall be aligned on a 16 (32 or 64, if
+\texttt{__m256} or \texttt{__m512} is passed on stack) byte boundary. 
+In other words, the value $(\RSP + 8)$ is always a multiple of $16$
+($32$ or $64$) when
 control is transferred to the function entry point.  The
 stack pointer, \RSP, always points to the end of the latest allocated
 stack frame.  \footnote{The conventional use of \RBP{} as a frame
@@ -393,6 +405,10 @@ The basic types are assigned their natural classes:
 \item Arguments of type \code{__m256} are split into four \eightbyte
   chunks.  The least significant one belongs to class SSE and all the
   others to class SSEUP.
+\item Arguments of type \code{__m512} are split into eight \eightbyte
+  chunks.  The least significant one belongs to class SSE and all the
+  others to class SSEUP.
+\item The 64-bit mantissa of arguments of type \code{long double}
 \item The 64-bit mantissa of arguments of type \code{long double}
   belongs to class X87, the 16-bit exponent plus 6 bytes of padding
   belongs to class X87UP.
@@ -557,6 +573,7 @@ arguments & No\\
 \reg{xmm2}--\reg{xmm7} & used to pass floating point arguments & No\\
 \reg{xmm8}--\reg{xmm15} & temporary registers & No\\
 \reg{mmx0}--\reg{mmx7}& temporary registers & No\\
+\reg{k0}--\reg{k7} & temporary registers & No\\
 \reg{st0},\reg{st1} & temporary registers; used to return \code{long double} arguments & No \\
 \reg{st2}--\reg{st7} & temporary registers & No \\
 \reg{fs}& Reserved for system (as thread specific data register) & No\\
@@ -592,9 +609,9 @@ For calls that may call functions that use varargs or stdargs
 match exactly the number of registers, but must be an upper bound on
 the number of vector registers used and is in the range 0--8 inclusive.
 
-When passing \texttt{__m256} arguments to functions that use varargs
-or stdarg, function prototypes must be provided.  Otherwise, the
-run-time behavior is undefined.
+When passing \texttt{__m256} or \texttt{__m512} arguments to functions
+that use varargs or stdarg, function prototypes must be provided.
+Otherwise, the run-time behavior is undefined.
 
 \paragraph{Returning of Values}
 The returning of values is done according to the following algorithm:
@@ -652,14 +669,16 @@ int e, f, g, h, i, j, k;\\
 long double ld;\\
 double m, n;\\
 __m256 y;\\
+__m512 z;\\
 \\
 extern void func (int e, int f,\\
 \phantom{extern void func (}structparm s, int g, int h,\\
 \phantom{extern void func (}long double ld, double m,\\
 \phantom{extern void func (}__m256 y,\\
+\phantom{extern void func (}__m512 z,\\
 \phantom{extern void func (}double n, int i, int j, int k);\\
 \\
-func (e, f, s, g, h, ld, m, y, n, i, j, k);\\
+func (e, f, s, g, h, ld, m, y, z, n, i, j, k);\\
 \cline{1-1}
 \end{tabular}
 }
@@ -680,8 +699,8 @@ func (e, f, s, g, h, ld, m, y, n, i, j, k);\\
 \RDI:    &\code{e}      &\reg{xmm0}:&\code{s.d}&\code{0:} &\code{ld} \\
 \RSI:    &\code{f}      &\reg{xmm1}:&\code{m}  &\code{16:}&\code{j} \\
 \RDX:    &\code{s.a,s.b}&\reg{ymm2}:&\code{y}  &\code{24:}&\code{k} \\
-\RCX:    &\code{g}      &\reg{xmm3}:&\code{n}  &          & \\
-\reg{r8}:&\code{h}      &           &          &          & \\
+\RCX:    &\code{g}      &\reg{zmm3}:&\code{z}  &          & \\
+\reg{r8}:&\code{h}      &\reg{xmm4}:&\code{n}  &          & \\
 \reg{r9}:&\code{i}      &           &          &          & \\
 \end{tabular}
 
@@ -2015,9 +2034,10 @@ the function in vector registers.%
 %%% XXX: Really only floating pointer parameters?
 %%% XXX: Use %al or %rax?
 
-When \texttt{__m256} is passed as variable-argument, it should always
-be passed on stack. Only named \texttt{__m256} arguments may be passed
-in register as specified in section \ref{sec-calling-conventions}.
+When \texttt{__m256} or \texttt{__m512} is passed as variable-argument,
+it should always be passed on stack. Only named \texttt{__m256} and
+\texttt{__m512} arguments may be passed in register as specified in
+section \ref{sec-calling-conventions}.
 
 \begin{figure}[H]
 \Hrule
@@ -2031,10 +2051,11 @@ int a, b;\\
 long double ld;\\
 double m, n;\\
 __m256 u, y;\\
+__m512 v, z;\\
 \\
-extern void func (int a, double m, __m256 u, ...);\\
+extern void func (int a, double m, __m256 u, __m512 v, ...);\\
 \\
-func (a, m, u, b, ld, y, n);\\
+func (a, m, u, v, b, ld, y, z, n);\\
 \cline{1-1}
 \end{tabular}
 }
@@ -2054,7 +2075,8 @@ func (a, m, u, b, ld, y, n);\\
 \hline
 \RDI: &\code{a}&\reg{xmm0}:&\code{m}&\code{0:} &\code{ld} \\
 \RSI: &\code{b}&\reg{ymm1}:&\code{u}&\code{32:}&\code{y} \\
-\RAX: & 3      &\reg{xmm2}:&\code{n}& \\
+\RAX: & 3      &\reg{zmm2}:&\code{v}& \\
+\     &        &\reg{xmm3}:&\code{n}& \\
 \end{tabular}
 \end{center}
 \Hrule
@@ -2297,6 +2319,9 @@ LDT Register                    & 63    & \reg{ldtr} \\
 128-bit Media Control and Status & 64   & \reg{mxcsr} \\
 x87 Control Word                & 65    & \reg{fcw} \\
 x87 Status Word                 & 66    & \reg{fsw} \\
+Upper Vector Registers 16--31   & 67-82 & \reg{xmm16}--\reg{xmm31} \\
+Reserved                        & 83-117 & \\
+Vector Mask Registers 0--7      & 118-125 & \reg{k0}--\reg{k7} \\
 \end{tabular}
 \end{center}
 \Hrule

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-23 19:34 [x86-64 psABI]: Extend x86-64 psABI to support AVX-512 H.J. Lu
@ 2013-07-23 19:57 ` Joseph S. Myers
  2013-07-25 14:48   ` Gopalasubramanian, Ganesh
  2013-07-24 15:20 ` Richard Biener
  1 sibling, 1 reply; 25+ messages in thread
From: Joseph S. Myers @ 2013-07-23 19:57 UTC (permalink / raw)
  To: H.J. Lu
  Cc: GNU C Library, GCC Development, Binutils, Girkar, Milind,
	Kreitzer, David L, Ganesh.Gopalasubramanian

On Tue, 23 Jul 2013, H.J. Lu wrote:

> Here is a patch to extend x86-64 psABI to support AVX-512:

I have no comments on this patch for now - but where is the version 
control repository we should use for the ABI source code, since x86-64.org 
has been down for some time?

(I've also CC:ed the last person from AMD to post to gcc-patches, in the 
hope that they have the right contacts to get x86-64.org - website, 
mailing lists, version control - brought back up again.)

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-23 19:34 [x86-64 psABI]: Extend x86-64 psABI to support AVX-512 H.J. Lu
  2013-07-23 19:57 ` Joseph S. Myers
@ 2013-07-24 15:20 ` Richard Biener
  2013-07-24 15:27   ` H.J. Lu
  2013-07-24 18:25   ` [x86-64 psABI]: Extend x86-64 psABI to support AVX-512 Richard Henderson
  1 sibling, 2 replies; 25+ messages in thread
From: Richard Biener @ 2013-07-24 15:20 UTC (permalink / raw)
  To: H.J. Lu, GNU C Library, GCC Development, Binutils, Girkar,
	Milind, Kreitzer, David L

"H.J. Lu" <hjl.tools@gmail.com> wrote:

>Hi,
>
>Here is a patch to extend x86-64 psABI to support AVX-512:

Afaik avx 512 doubles the amount of xmm registers. Can we get them callee saved please?

Thanks,
Richard.

>http://software.intel.com/sites/default/files/319433-015.pdf
>
>
>--
>H.J.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-24 15:20 ` Richard Biener
@ 2013-07-24 15:27   ` H.J. Lu
  2013-07-24 15:38     ` Joseph S. Myers
  2013-07-24 17:34     ` Richard Biener
  2013-07-24 18:25   ` [x86-64 psABI]: Extend x86-64 psABI to support AVX-512 Richard Henderson
  1 sibling, 2 replies; 25+ messages in thread
From: H.J. Lu @ 2013-07-24 15:27 UTC (permalink / raw)
  To: Richard Biener
  Cc: GNU C Library, GCC Development, Binutils, Girkar, Milind,
	Kreitzer, David L

On Wed, Jul 24, 2013 at 8:23 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> "H.J. Lu" <hjl.tools@gmail.com> wrote:
>
>>Hi,
>>
>>Here is a patch to extend x86-64 psABI to support AVX-512:
>
> Afaik avx 512 doubles the amount of xmm registers. Can we get them callee saved please?
>

Make them callee saved means we need to change ld.so to
preserve them and we need to change unwind library to
support them.  It is certainly doable.

--
H.J.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-24 15:27   ` H.J. Lu
@ 2013-07-24 15:38     ` Joseph S. Myers
  2013-07-24 17:34     ` Richard Biener
  1 sibling, 0 replies; 25+ messages in thread
From: Joseph S. Myers @ 2013-07-24 15:38 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Richard Biener, GNU C Library, GCC Development, Binutils, Girkar,
	Milind, Kreitzer, David L

On Wed, 24 Jul 2013, H.J. Lu wrote:

> > Afaik avx 512 doubles the amount of xmm registers. Can we get them 
> > callee saved please?
> 
> Make them callee saved means we need to change ld.so to
> preserve them and we need to change unwind library to
> support them.  It is certainly doable.

And setjmp/longjmp (with consequent versioning implications if there isn't 
enough space in jmp_buf).  Avoiding the need for such library changes in 
order to use new instruction set features is why it's usual to make new 
registers (or new bits of existing registers) call-clobbered.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-24 15:27   ` H.J. Lu
  2013-07-24 15:38     ` Joseph S. Myers
@ 2013-07-24 17:34     ` Richard Biener
  2013-07-24 17:42       ` H.J. Lu
                         ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: Richard Biener @ 2013-07-24 17:34 UTC (permalink / raw)
  To: H.J. Lu
  Cc: GNU C Library, GCC Development, Binutils, Girkar, Milind,
	Kreitzer, David L

"H.J. Lu" <hjl.tools@gmail.com> wrote:

>On Wed, Jul 24, 2013 at 8:23 AM, Richard Biener
><richard.guenther@gmail.com> wrote:
>> "H.J. Lu" <hjl.tools@gmail.com> wrote:
>>
>>>Hi,
>>>
>>>Here is a patch to extend x86-64 psABI to support AVX-512:
>>
>> Afaik avx 512 doubles the amount of xmm registers. Can we get them
>callee saved please?
>>
>
>Make them callee saved means we need to change ld.so to
>preserve them and we need to change unwind library to
>support them.  It is certainly doable.

IMHO it was a mistake to not have any callee saved xmm register in the original abi - we should fix this at this opportunity. Loops with function calls are not that uncommon.

Richard.

>--
>H.J.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-24 17:34     ` Richard Biener
@ 2013-07-24 17:42       ` H.J. Lu
  2013-07-24 17:55         ` Peter Bergner
  2013-07-24 18:14       ` Ondřej Bílka
  2013-07-25  3:07       ` Jakub Jelinek
  2 siblings, 1 reply; 25+ messages in thread
From: H.J. Lu @ 2013-07-24 17:42 UTC (permalink / raw)
  To: Richard Biener
  Cc: GNU C Library, GCC Development, Binutils, Girkar, Milind,
	Kreitzer, David L

On Wed, Jul 24, 2013 at 10:36 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> "H.J. Lu" <hjl.tools@gmail.com> wrote:
>
>>On Wed, Jul 24, 2013 at 8:23 AM, Richard Biener
>><richard.guenther@gmail.com> wrote:
>>> "H.J. Lu" <hjl.tools@gmail.com> wrote:
>>>
>>>>Hi,
>>>>
>>>>Here is a patch to extend x86-64 psABI to support AVX-512:
>>>
>>> Afaik avx 512 doubles the amount of xmm registers. Can we get them
>>callee saved please?
>>>
>>
>>Make them callee saved means we need to change ld.so to
>>preserve them and we need to change unwind library to
>>support them.  It is certainly doable.
>
> IMHO it was a mistake to not have any callee saved xmm register in the original abi - we should fix this at this opportunity. Loops with function calls are not that uncommon.
>

Are there any other Linux targets with callee saved vector registers?


--
H.J.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-24 17:42       ` H.J. Lu
@ 2013-07-24 17:55         ` Peter Bergner
  2013-07-24 19:22           ` H.J. Lu
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Bergner @ 2013-07-24 17:55 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Richard Biener, GNU C Library, GCC Development, Binutils, Girkar,
	Milind, Kreitzer, David L

On Wed, 2013-07-24 at 10:42 -0700, H.J. Lu wrote:
> Are there any other Linux targets with callee saved vector registers?

Yes, on POWER.  From our ABI:

  On processors with the VMX feature.
    v0-v1 Volatile scratch registers
    v2-v13 Volatile vector parameters registers
    v14-v19 Volatile scratch registers
    v20-v31 Non-volatile registers

I'll note that the new VSX register state we recently added with power7
were made volatile, but then we already had these non-volatile altivec
regs to use.

Peteer


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-24 17:34     ` Richard Biener
  2013-07-24 17:42       ` H.J. Lu
@ 2013-07-24 18:14       ` Ondřej Bílka
  2013-07-25  3:07       ` Jakub Jelinek
  2 siblings, 0 replies; 25+ messages in thread
From: Ondřej Bílka @ 2013-07-24 18:14 UTC (permalink / raw)
  To: Richard Biener
  Cc: H.J. Lu, GNU C Library, GCC Development, Binutils, Girkar,
	Milind, Kreitzer, David L

On Wed, Jul 24, 2013 at 07:36:31PM +0200, Richard Biener wrote:
> "H.J. Lu" <hjl.tools@gmail.com> wrote:
> 
> >On Wed, Jul 24, 2013 at 8:23 AM, Richard Biener
> ><richard.guenther@gmail.com> wrote:
> >> "H.J. Lu" <hjl.tools@gmail.com> wrote:
> >>
> >>>Hi,
> >>>
> >>>Here is a patch to extend x86-64 psABI to support AVX-512:
> >>
> >> Afaik avx 512 doubles the amount of xmm registers. Can we get them
> >callee saved please?
> >>
> >
> >Make them callee saved means we need to change ld.so to
> >preserve them and we need to change unwind library to
> >support them.  It is certainly doable.
> 
> IMHO it was a mistake to not have any callee saved xmm register in the original abi - we should fix this at this opportunity. Loops with function calls are not that uncommon.
>
I also noticed this problem and best solution that I came upon is analogue to
__attribute__((fastcall))

This would make possible for libraries to add versioned symbols that use
attribute and migrate to saner calling convention.

> Richard.
> 
> >--
> >H.J.
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-24 15:20 ` Richard Biener
  2013-07-24 15:27   ` H.J. Lu
@ 2013-07-24 18:25   ` Richard Henderson
  2013-07-24 18:52     ` Ondřej Bílka
  2013-07-30 13:55     ` Kirill Yukhin
  1 sibling, 2 replies; 25+ messages in thread
From: Richard Henderson @ 2013-07-24 18:25 UTC (permalink / raw)
  To: Richard Biener
  Cc: H.J. Lu, GNU C Library, GCC Development, Binutils, Girkar,
	Milind, Kreitzer, David L

On 07/24/2013 05:23 AM, Richard Biener wrote:
> "H.J. Lu" <hjl.tools@gmail.com> wrote:
> 
>> Hi,
>>
>> Here is a patch to extend x86-64 psABI to support AVX-512:
> 
> Afaik avx 512 doubles the amount of xmm registers. Can we get them callee saved please?

Having them callee saved pre-supposes that one knows the width of the register.

There's room in the instruction set for avx1024.  Does anyone believe that is
not going to appear in the next few years?


r~

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-24 18:25   ` [x86-64 psABI]: Extend x86-64 psABI to support AVX-512 Richard Henderson
@ 2013-07-24 18:52     ` Ondřej Bílka
  2013-07-25 12:17       ` Janne Blomqvist
  2013-07-30 13:55     ` Kirill Yukhin
  1 sibling, 1 reply; 25+ messages in thread
From: Ondřej Bílka @ 2013-07-24 18:52 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Richard Biener, H.J. Lu, GNU C Library, GCC Development,
	Binutils, Girkar, Milind, Kreitzer, David L

On Wed, Jul 24, 2013 at 08:25:14AM -1000, Richard Henderson wrote:
> On 07/24/2013 05:23 AM, Richard Biener wrote:
> > "H.J. Lu" <hjl.tools@gmail.com> wrote:
> > 
> >> Hi,
> >>
> >> Here is a patch to extend x86-64 psABI to support AVX-512:
> > 
> > Afaik avx 512 doubles the amount of xmm registers. Can we get them callee saved please?
> 
> Having them callee saved pre-supposes that one knows the width of the register.
> 
> There's room in the instruction set for avx1024.  Does anyone believe that is
> not going to appear in the next few years?
> 
It would be mistake for intel to focus on avx1024. You hit diminishing
returns and only few workloads would utilize loading 128 bytes at once.
Problem with vectorization is that it becomes memory bound so you will
not got much because performance is dominated by cache throughput.

You would get bigger speedup from more effective pipelining, more
fusion...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-24 17:55         ` Peter Bergner
@ 2013-07-24 19:22           ` H.J. Lu
  0 siblings, 0 replies; 25+ messages in thread
From: H.J. Lu @ 2013-07-24 19:22 UTC (permalink / raw)
  To: Peter Bergner
  Cc: Richard Biener, GNU C Library, GCC Development, Binutils, Girkar,
	Milind, Kreitzer, David L

On Wed, Jul 24, 2013 at 10:55 AM, Peter Bergner <bergner@vnet.ibm.com> wrote:
> On Wed, 2013-07-24 at 10:42 -0700, H.J. Lu wrote:
>> Are there any other Linux targets with callee saved vector registers?
>
> Yes, on POWER.  From our ABI:
>
>   On processors with the VMX feature.
>     v0-v1 Volatile scratch registers
>     v2-v13 Volatile vector parameters registers
>     v14-v19 Volatile scratch registers
>     v20-v31 Non-volatile registers
>
> I'll note that the new VSX register state we recently added with power7
> were made volatile, but then we already had these non-volatile altivec
> regs to use.

How do you save/restore those vector registers for
exception? Unwinder in libgcc uses _Unwind_Word
to save and restore registers in DWARF unwind frame.
It doesn't support anything wider than _Unwind_Word,
which is usually smaller than vector register.

--
H.J.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-24 17:34     ` Richard Biener
  2013-07-24 17:42       ` H.J. Lu
  2013-07-24 18:14       ` Ondřej Bílka
@ 2013-07-25  3:07       ` Jakub Jelinek
  2013-07-25  7:09         ` Ondřej Bílka
  2 siblings, 1 reply; 25+ messages in thread
From: Jakub Jelinek @ 2013-07-25  3:07 UTC (permalink / raw)
  To: Richard Biener
  Cc: H.J. Lu, GNU C Library, GCC Development, Binutils, Girkar,
	Milind, Kreitzer, David L

On Wed, Jul 24, 2013 at 07:36:31PM +0200, Richard Biener wrote:
> >Make them callee saved means we need to change ld.so to
> >preserve them and we need to change unwind library to
> >support them.  It is certainly doable.
> 
> IMHO it was a mistake to not have any callee saved xmm register in the
> original abi - we should fix this at this opportunity.  Loops with
> function calls are not that uncommon.

I've raised that earlier already.  One issue with that beyond having to
teach unwinders about this (dynamic linker if you mean only for the lazy PLT
resolving is only a matter of whether the dynamic linker itself has been
built with a compiler that would clobber those registers anywhere) is that
as history shows, the vector registers keep growing over time.
So if we reserve now either 8 or all 16 zmm16 to zmm31 registers as call
saved, do we save them as 512 bit registers, or say 1024 bit already?
If just 512 bit, then when next time the vector registers grow in size (will
they?), would we have just low parts of the 1024 bits registers call saved
and upper half call clobbered (I guess that is the case for M$Win 64-bit ABI
now, just with 128 bit vs. more).

But yeah, it would be nice to have some call saved ones.

	Jakub

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-25  3:07       ` Jakub Jelinek
@ 2013-07-25  7:09         ` Ondřej Bílka
  2013-07-25 16:51           ` Rich Felker
  0 siblings, 1 reply; 25+ messages in thread
From: Ondřej Bílka @ 2013-07-25  7:09 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Richard Biener, H.J. Lu, GNU C Library, GCC Development,
	Binutils, Girkar, Milind, Kreitzer, David L

On Thu, Jul 25, 2013 at 05:06:55AM +0200, Jakub Jelinek wrote:
> On Wed, Jul 24, 2013 at 07:36:31PM +0200, Richard Biener wrote:
> > >Make them callee saved means we need to change ld.so to
> > >preserve them and we need to change unwind library to
> > >support them.  It is certainly doable.
> > 
> > IMHO it was a mistake to not have any callee saved xmm register in the
> > original abi - we should fix this at this opportunity.  Loops with
> > function calls are not that uncommon.
> 
> I've raised that earlier already.  One issue with that beyond having to
> teach unwinders about this (dynamic linker if you mean only for the lazy PLT
> resolving is only a matter of whether the dynamic linker itself has been
> built with a compiler that would clobber those registers anywhere) is that
> as history shows, the vector registers keep growing over time.
> So if we reserve now either 8 or all 16 zmm16 to zmm31 registers as call
> saved, do we save them as 512 bit registers, or say 1024 bit already?

We shouldn't save them all as we would often need to unnecessarily save
register in leaf function. I am fine with 8. In practice 4 should be
enough for most use cases. 

> If just 512 bit, then when next time the vector registers grow in size (will
> they?), would we have just low parts of the 1024 bits registers call saved
> and upper half call clobbered (I guess that is the case for M$Win 64-bit ABI
> now, just with 128 bit vs. more).
>
I do not think that 1024 bit registers will come in next ten years.
If they came tohn call clobbered is better. Full 1024 bits would be used
rarely; given that in most cases we will use them just to store 64bit
for doubles.
 
> But yeah, it would be nice to have some call saved ones.
> 
> 	Jakub

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-24 18:52     ` Ondřej Bílka
@ 2013-07-25 12:17       ` Janne Blomqvist
  2013-07-25 12:47         ` Ondřej Bílka
  0 siblings, 1 reply; 25+ messages in thread
From: Janne Blomqvist @ 2013-07-25 12:17 UTC (permalink / raw)
  To: Ondřej Bílka
  Cc: Richard Henderson, Richard Biener, H.J. Lu, GNU C Library,
	GCC Development, Binutils, Girkar, Milind, Kreitzer, David L

On Wed, Jul 24, 2013 at 9:52 PM, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Wed, Jul 24, 2013 at 08:25:14AM -1000, Richard Henderson wrote:
>> On 07/24/2013 05:23 AM, Richard Biener wrote:
>> > "H.J. Lu" <hjl.tools@gmail.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> Here is a patch to extend x86-64 psABI to support AVX-512:
>> >
>> > Afaik avx 512 doubles the amount of xmm registers. Can we get them callee saved please?
>>
>> Having them callee saved pre-supposes that one knows the width of the register.
>>
>> There's room in the instruction set for avx1024.  Does anyone believe that is
>> not going to appear in the next few years?
>>
> It would be mistake for intel to focus on avx1024. You hit diminishing
> returns and only few workloads would utilize loading 128 bytes at once.
> Problem with vectorization is that it becomes memory bound so you will
> not got much because performance is dominated by cache throughput.
>
> You would get bigger speedup from more effective pipelining, more
> fusion...

ISTR that one of the main reason "long" vector ISA's did so well on
some workloads was not that the vector length was big, per se, but
rather that the scatter/gather instructions these ISA's typically have
allowed them to extract much more parallelism from the memory
subsystem. The typical example being sparse matrix style problems, but
I suppose other types of problems with indirect accesses could benefit
as well. Deeper OoO buffers would in principle allow the same memory
level parallelism extraction, but those apparently have quite steep
power and silicon area cost scaling (O(n**2) or maybe even O(n**3)),
making really deep buffers impractical.

And, IIRC scatter/gather instructions are featured as of some
recent-ish AVX-something version. That being said, maybe current
cache-based memory subsystems are different enough from the vector
supercomputers of yore that the above doesn't hold to the same extent
anymore..


--
Janne Blomqvist

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-25 12:17       ` Janne Blomqvist
@ 2013-07-25 12:47         ` Ondřej Bílka
  0 siblings, 0 replies; 25+ messages in thread
From: Ondřej Bílka @ 2013-07-25 12:47 UTC (permalink / raw)
  To: Janne Blomqvist
  Cc: Richard Henderson, Richard Biener, H.J. Lu, GNU C Library,
	GCC Development, Binutils, Girkar, Milind, Kreitzer, David L

On Thu, Jul 25, 2013 at 03:17:43PM +0300, Janne Blomqvist wrote:
> On Wed, Jul 24, 2013 at 9:52 PM, Ondřej Bílka <neleai@seznam.cz> wrote:
> > On Wed, Jul 24, 2013 at 08:25:14AM -1000, Richard Henderson wrote:
> >> On 07/24/2013 05:23 AM, Richard Biener wrote:
> >> > "H.J. Lu" <hjl.tools@gmail.com> wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> Here is a patch to extend x86-64 psABI to support AVX-512:
> >> >
> >> > Afaik avx 512 doubles the amount of xmm registers. Can we get them callee saved please?
> >>
> >> Having them callee saved pre-supposes that one knows the width of the register.
> >>
> >> There's room in the instruction set for avx1024.  Does anyone believe that is
> >> not going to appear in the next few years?
> >>
> > It would be mistake for intel to focus on avx1024. You hit diminishing
> > returns and only few workloads would utilize loading 128 bytes at once.
> > Problem with vectorization is that it becomes memory bound so you will
> > not got much because performance is dominated by cache throughput.
> >
> > You would get bigger speedup from more effective pipelining, more
> > fusion...
> 
> ISTR that one of the main reason "long" vector ISA's did so well on
> some workloads was not that the vector length was big, per se, but
> rather that the scatter/gather instructions these ISA's typically have
> allowed them to extract much more parallelism from the memory
> subsystem. The typical example being sparse matrix style problems, but
> I suppose other types of problems with indirect accesses could benefit
> as well. Deeper OoO buffers would in principle allow the same memory
> level parallelism extraction, but those apparently have quite steep
> power and silicon area cost scaling (O(n**2) or maybe even O(n**3)),
> making really deep buffers impractical.
> 
> And, IIRC scatter/gather instructions are featured as of some
> recent-ish AVX-something version. That being said, maybe current
> cache-based memory subsystems are different enough from the vector
> supercomputers of yore that the above doesn't hold to the same extent
> anymore..
>
Also this depends how many details intel got right. One example is
pmovmsk instruction. It is trivial to implement in silicon and gives
advantage over other architectures.

When a problem is 'find elements in array that satisfy some expression'
then without pmovmsk or equivalent finding what changed is relatively expensive.

One problem is that depending on profile you may spend majority of time
for small sizes. So you need to have effective branches for these sizes
(gcc does not handle that well yet). Then you get problem that it
increases icache pressure.

Then another problem is that you often could benefit from vector
instructions if you could read/write more memory. Reading can be done
inexpensively by checking if it crosses page, writing data is problem
and so we do a suboptimal path just to write only data that changed.

This could also be solved technologically if a masked move instruction 
could encode only to memory accesses that changed and thus avoid
possible race conditions in unchanged parts.
> 
> --
> Janne Blomqvist

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-23 19:57 ` Joseph S. Myers
@ 2013-07-25 14:48   ` Gopalasubramanian, Ganesh
  0 siblings, 0 replies; 25+ messages in thread
From: Gopalasubramanian, Ganesh @ 2013-07-25 14:48 UTC (permalink / raw)
  To: Joseph S. Myers
  Cc: GNU C Library, GCC Development, Binutils, Girkar, Milind,
	Kreitzer, David L, H.J. Lu

Hi,

This got lost in our site-consolidation efforts.
We are working to make it active again.
Will update the community soon.

Regards
Ganesh
________________________________________
From: Joseph Myers [joseph@codesourcery.com]
Sent: Tuesday, July 23, 2013 2:57 PM
To: H.J. Lu
Cc: GNU C Library; GCC Development; Binutils; Girkar, Milind; Kreitzer, David L; Gopalasubramanian, Ganesh
Subject: Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512

On Tue, 23 Jul 2013, H.J. Lu wrote:

> Here is a patch to extend x86-64 psABI to support AVX-512:

I have no comments on this patch for now - but where is the version
control repository we should use for the ABI source code, since x86-64.org
has been down for some time?

(I've also CC:ed the last person from AMD to post to gcc-patches, in the
hope that they have the right contacts to get x86-64.org - website,
mailing lists, version control - brought back up again.)

--
Joseph S. Myers
joseph@codesourcery.com


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-25  7:09         ` Ondřej Bílka
@ 2013-07-25 16:51           ` Rich Felker
  2013-07-27 15:44             ` Ondřej Bílka
  0 siblings, 1 reply; 25+ messages in thread
From: Rich Felker @ 2013-07-25 16:51 UTC (permalink / raw)
  To: Ondřej Bílka
  Cc: Jakub Jelinek, Richard Biener, H.J. Lu, GNU C Library,
	GCC Development, Binutils, Girkar, Milind, Kreitzer, David L

On Thu, Jul 25, 2013 at 08:55:38AM +0200, Ondřej Bílka wrote:
> On Thu, Jul 25, 2013 at 05:06:55AM +0200, Jakub Jelinek wrote:
> > On Wed, Jul 24, 2013 at 07:36:31PM +0200, Richard Biener wrote:
> > > >Make them callee saved means we need to change ld.so to
> > > >preserve them and we need to change unwind library to
> > > >support them.  It is certainly doable.
> > > 
> > > IMHO it was a mistake to not have any callee saved xmm register in the
> > > original abi - we should fix this at this opportunity.  Loops with
> > > function calls are not that uncommon.
> > 
> > I've raised that earlier already.  One issue with that beyond having to
> > teach unwinders about this (dynamic linker if you mean only for the lazy PLT
> > resolving is only a matter of whether the dynamic linker itself has been
> > built with a compiler that would clobber those registers anywhere) is that
> > as history shows, the vector registers keep growing over time.
> > So if we reserve now either 8 or all 16 zmm16 to zmm31 registers as call
> > saved, do we save them as 512 bit registers, or say 1024 bit already?
> 
> We shouldn't save them all as we would often need to unnecessarily save
> register in leaf function. I am fine with 8. In practice 4 should be
> enough for most use cases. 

You can't add call-saved registers without breaking the ABI, because
they need to be saved in the jmp_buf, which does not have space for
them.

Also, unless you add them at the same time the registers are added to
the machine (so there's no existing code using those registers),
you'll have ABI problems like this: function using the new call-saved
registers calls qsort, which calls application code, which assumes the
registers are call-clobbered and clobbers them; after qsort returns,
the original caller's state is gone.

Adding call-saved registers to an existing psABI is just fundamentally
misguided.

Rich

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-25 16:51           ` Rich Felker
@ 2013-07-27 15:44             ` Ondřej Bílka
  2013-07-27 16:13               ` Rich Felker
  0 siblings, 1 reply; 25+ messages in thread
From: Ondřej Bílka @ 2013-07-27 15:44 UTC (permalink / raw)
  To: Rich Felker
  Cc: Jakub Jelinek, Richard Biener, H.J. Lu, GNU C Library,
	GCC Development, Binutils, Girkar, Milind, Kreitzer, David L

On Thu, Jul 25, 2013 at 12:50:53PM -0400, Rich Felker wrote:
> On Thu, Jul 25, 2013 at 08:55:38AM +0200, Ondřej Bílka wrote:
> > On Thu, Jul 25, 2013 at 05:06:55AM +0200, Jakub Jelinek wrote:
> > > On Wed, Jul 24, 2013 at 07:36:31PM +0200, Richard Biener wrote:
> > > > >Make them callee saved means we need to change ld.so to
> > > > >preserve them and we need to change unwind library to
> > > > >support them.  It is certainly doable.
> > > > 
> > > > IMHO it was a mistake to not have any callee saved xmm register in the
> > > > original abi - we should fix this at this opportunity.  Loops with
> > > > function calls are not that uncommon.
> > > 
> > > I've raised that earlier already.  One issue with that beyond having to
> > > teach unwinders about this (dynamic linker if you mean only for the lazy PLT
> > > resolving is only a matter of whether the dynamic linker itself has been
> > > built with a compiler that would clobber those registers anywhere) is that
> > > as history shows, the vector registers keep growing over time.
> > > So if we reserve now either 8 or all 16 zmm16 to zmm31 registers as call
> > > saved, do we save them as 512 bit registers, or say 1024 bit already?
> > 
> > We shouldn't save them all as we would often need to unnecessarily save
> > register in leaf function. I am fine with 8. In practice 4 should be
> > enough for most use cases. 
> 
> You can't add call-saved registers without breaking the ABI, because
> they need to be saved in the jmp_buf, which does not have space for
> them.
>
Well you can. Use versioning, structure will not change and layout for
old setjmp/longjmp is unchanged. For new setjmp we set jump address to
jmp_buf address to distinguish it from first case. Then for each thread
we keep a stack with extra space needed to save additional registers. 
When setjmp/longjmp is called we prune frames from exited functions.

 
> Also, unless you add them at the same time the registers are added to
> the machine (so there's no existing code using those registers),
> you'll have ABI problems like this: function using the new call-saved
> registers calls qsort, which calls application code, which assumes the
> registers are call-clobbered and clobbers them; after qsort returns,
> the original caller's state is gone.
>
What are you talking about? Do you mean that user wrongly marked qsort
as a function that does not clobber arguments?

> Adding call-saved registers to an existing psABI is just fundamentally
> misguided.
> 
> Rich

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-27 15:44             ` Ondřej Bílka
@ 2013-07-27 16:13               ` Rich Felker
  2013-07-27 16:24                 ` Rich Felker
  2013-07-27 18:27                 ` Support setjmp in x86-64 psABI with AVX-512 Ondřej Bílka
  0 siblings, 2 replies; 25+ messages in thread
From: Rich Felker @ 2013-07-27 16:13 UTC (permalink / raw)
  To: Ondřej Bílka
  Cc: Jakub Jelinek, Richard Biener, H.J. Lu, GNU C Library,
	GCC Development, Binutils, Girkar, Milind, Kreitzer, David L

On Sat, Jul 27, 2013 at 05:44:05PM +0200, Ondřej Bílka wrote:
> On Thu, Jul 25, 2013 at 12:50:53PM -0400, Rich Felker wrote:
> > On Thu, Jul 25, 2013 at 08:55:38AM +0200, Ondřej Bílka wrote:
> > > On Thu, Jul 25, 2013 at 05:06:55AM +0200, Jakub Jelinek wrote:
> > > > On Wed, Jul 24, 2013 at 07:36:31PM +0200, Richard Biener wrote:
> > > > > >Make them callee saved means we need to change ld.so to
> > > > > >preserve them and we need to change unwind library to
> > > > > >support them.  It is certainly doable.
> > > > > 
> > > > > IMHO it was a mistake to not have any callee saved xmm register in the
> > > > > original abi - we should fix this at this opportunity.  Loops with
> > > > > function calls are not that uncommon.
> > > > 
> > > > I've raised that earlier already.  One issue with that beyond having to
> > > > teach unwinders about this (dynamic linker if you mean only for the lazy PLT
> > > > resolving is only a matter of whether the dynamic linker itself has been
> > > > built with a compiler that would clobber those registers anywhere) is that
> > > > as history shows, the vector registers keep growing over time.
> > > > So if we reserve now either 8 or all 16 zmm16 to zmm31 registers as call
> > > > saved, do we save them as 512 bit registers, or say 1024 bit already?
> > > 
> > > We shouldn't save them all as we would often need to unnecessarily save
> > > register in leaf function. I am fine with 8. In practice 4 should be
> > > enough for most use cases. 
> > 
> > You can't add call-saved registers without breaking the ABI, because
> > they need to be saved in the jmp_buf, which does not have space for
> > them.
> >
> Well you can. Use versioning, structure will not change and layout for
> old setjmp/longjmp is unchanged. For new setjmp we set jump address to
> jmp_buf address to distinguish it from first case. Then for each thread
> we keep a stack with extra space needed to save additional registers. 
> When setjmp/longjmp is called we prune frames from exited functions.

This required unbounded storage which does not exist. From a practical
standpoint you would either have to reserve a huge amount of storage
(e.g. double the allocated thread stack size and use half of it as
reserved space for jmp_buf) or make the calling program crash when the
small, reasonable amount of reserved space is exhausted. The latter is
highly unacceptable since the main purpose (IMO:) of jmp_buf is to
work around bad library code that can't handle resource exhaustion by
replacing its 'xmalloc' type functions with ones that longjmp to a
thread-local jmp_buf set by the caller (e.g. this is the only way to
use glib robustly).

By the way, I do have another horrible idea for how you could do it.
glibc's jmp_buf is actually a sigjmp_buf and contains 120 wasted bytes
of sigset_t for nonexistant HURD signals. So you could store a few
registers after the actually-used part of the sigset_t.

> > Also, unless you add them at the same time the registers are added to
> > the machine (so there's no existing code using those registers),
> > you'll have ABI problems like this: function using the new call-saved
> > registers calls qsort, which calls application code, which assumes the
> > registers are call-clobbered and clobbers them; after qsort returns,
> > the original caller's state is gone.
> >
> What are you talking about? Do you mean that user wrongly marked qsort
> as a function that does not clobber arguments?

OK, you're obviously thinking of some kind of special way of tagging
individual functions as preserving new registers, rather than whole
object or shared library files, in which case it's plausible that you
can make this part work.

Rich

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-27 16:13               ` Rich Felker
@ 2013-07-27 16:24                 ` Rich Felker
  2013-07-27 18:27                 ` Support setjmp in x86-64 psABI with AVX-512 Ondřej Bílka
  1 sibling, 0 replies; 25+ messages in thread
From: Rich Felker @ 2013-07-27 16:24 UTC (permalink / raw)
  To: Ondřej Bílka
  Cc: Jakub Jelinek, Richard Biener, H.J. Lu, GNU C Library,
	GCC Development, Binutils, Girkar, Milind, Kreitzer, David L

On Sat, Jul 27, 2013 at 12:12:57PM -0400, Rich Felker wrote:
> By the way, I do have another horrible idea for how you could do it.
> glibc's jmp_buf is actually a sigjmp_buf and contains 120 wasted bytes
> of sigset_t for nonexistant HURD signals. So you could store a few
> registers after the actually-used part of the sigset_t.

And another, possibly more acceptable way to do it:

#define setjmp(x) __new_setjmp(x, (__new_jmp_buf){0}, sizeof(__new_jmp_buf))

This would allocate the extra space on the caller's stack with a
lifetime equivalent to the validity lifetime of the jmp_buf, so it
should be valid, but I'm not sure if it covers all the needed cases
for interaction between old code and new code.

Rich

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Support setjmp in x86-64 psABI with AVX-512
  2013-07-27 16:13               ` Rich Felker
  2013-07-27 16:24                 ` Rich Felker
@ 2013-07-27 18:27                 ` Ondřej Bílka
  2013-07-27 20:09                   ` Rich Felker
  1 sibling, 1 reply; 25+ messages in thread
From: Ondřej Bílka @ 2013-07-27 18:27 UTC (permalink / raw)
  To: Rich Felker; +Cc: GNU C Library, GCC Development, Binutils

On Sat, Jul 27, 2013 at 12:12:57PM -0400, Rich Felker wrote:
> On Sat, Jul 27, 2013 at 05:44:05PM +0200, Ondřej Bílka wrote:
> > On Thu, Jul 25, 2013 at 12:50:53PM -0400, Rich Felker wrote:
> > > On Thu, Jul 25, 2013 at 08:55:38AM +0200, Ondřej Bílka wrote:
> > > You can't add call-saved registers without breaking the ABI, because
> > > they need to be saved in the jmp_buf, which does not have space for
> > > them.
> > >
> > Well you can. Use versioning, structure will not change and layout for
> > old setjmp/longjmp is unchanged. For new setjmp we set jump address to
> > jmp_buf address to distinguish it from first case. Then for each thread
> > we keep a stack with extra space needed to save additional registers. 
> > When setjmp/longjmp is called we prune frames from exited functions.
> 
> This required unbounded storage which does not exist. From a practical
> standpoint you would either have to reserve a huge amount of storage
> (e.g. double the allocated thread stack size and use half of it as
> reserved space for jmp_buf) or make the calling program crash when the

Standard trick mmap and double.
> small, reasonable amount of reserved space is exhausted. The latter is
> highly unacceptable since the main purpose (IMO:) of jmp_buf is to
> work around bad library code that can't handle resource exhaustion by
> replacing its 'xmalloc' type functions with ones that longjmp to a
> thread-local jmp_buf set by the caller (e.g. this is the only way to
> use glib robustly).
> 
Well what I wrote is to work around pathologic cases. 
With versioning and changing size of structure I could do trick with
distinguishing by pointing to itself and it would mostly work. 

It would break when function that uses setjmp obtains jmp_buf by
parameter from other unit.

To avoid it we need allocate some extra space. Most programs would have
number of jmp_buf instances limited so not deallocating extra would not 
cause problem. To violate that limit you need to have variation to these:

int rec(..){
  jmp_buf x;
  setjmp(x);
  ...
  rec(x);
}

while (cond){
  jmp_buf *x= malloc(...);
  setjmp(x);
  ...
  free(x);
}

> By the way, I do have another horrible idea for how you could do it.

Next idea would be hack gcc to mark all variables volatile in functions
with setjmp.

> glibc's jmp_buf is actually a sigjmp_buf and contains 120 wasted bytes
> of sigset_t for nonexistant HURD signals. So you could store a few
> registers after the actually-used part of the sigset_t.
> 
> > > Also, unless you add them at the same time the registers are added to
> > > the machine (so there's no existing code using those registers),
> > > you'll have ABI problems like this: function using the new call-saved
> > > registers calls qsort, which calls application code, which assumes the
> > > registers are call-clobbered and clobbers them; after qsort returns,
> > > the original caller's state is gone.
> > >
> > What are you talking about? Do you mean that user wrongly marked qsort
> > as a function that does not clobber arguments?
> 
> OK, you're obviously thinking of some kind of special way of tagging
> individual functions as preserving new registers, rather than whole
> object or shared library files, in which case it's plausible that you
> can make this part work.
> 
> Rich

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Support setjmp in x86-64 psABI with AVX-512
  2013-07-27 18:27                 ` Support setjmp in x86-64 psABI with AVX-512 Ondřej Bílka
@ 2013-07-27 20:09                   ` Rich Felker
  0 siblings, 0 replies; 25+ messages in thread
From: Rich Felker @ 2013-07-27 20:09 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: GNU C Library, GCC Development, Binutils

On Sat, Jul 27, 2013 at 08:27:07PM +0200, Ondřej Bílka wrote:
> On Sat, Jul 27, 2013 at 12:12:57PM -0400, Rich Felker wrote:
> > On Sat, Jul 27, 2013 at 05:44:05PM +0200, Ondřej Bílka wrote:
> > > On Thu, Jul 25, 2013 at 12:50:53PM -0400, Rich Felker wrote:
> > > > On Thu, Jul 25, 2013 at 08:55:38AM +0200, Ondřej Bílka wrote:
> > > > You can't add call-saved registers without breaking the ABI, because
> > > > they need to be saved in the jmp_buf, which does not have space for
> > > > them.
> > > >
> > > Well you can. Use versioning, structure will not change and layout for
> > > old setjmp/longjmp is unchanged. For new setjmp we set jump address to
> > > jmp_buf address to distinguish it from first case. Then for each thread
> > > we keep a stack with extra space needed to save additional registers. 
> > > When setjmp/longjmp is called we prune frames from exited functions.
> > 
> > This required unbounded storage which does not exist. From a practical
> > standpoint you would either have to reserve a huge amount of storage
> > (e.g. double the allocated thread stack size and use half of it as
> > reserved space for jmp_buf) or make the calling program crash when the
> 
> Standard trick mmap and double.

??

> > small, reasonable amount of reserved space is exhausted. The latter is
> > highly unacceptable since the main purpose (IMO:) of jmp_buf is to
> > work around bad library code that can't handle resource exhaustion by
> > replacing its 'xmalloc' type functions with ones that longjmp to a
> > thread-local jmp_buf set by the caller (e.g. this is the only way to
> > use glib robustly).
> > 
> Well what I wrote is to work around pathologic cases. 
> With versioning and changing size of structure I could do trick with
> distinguishing by pointing to itself and it would mostly work. 
> 
> It would break when function that uses setjmp obtains jmp_buf by
> parameter from other unit.

This is a fairly reasonable usage, e.g. when the jmp_buf is part of
some context structure passed around, and different call levels want
to replace it to 'handle exceptions'. Personally, I think the right
solution is to use a jmp_buf* in the context rather than a jmp_buf, so
different call levels can swap around which one it points to (and so
the ultimate caller can re-use the same jmp_buf in multiple contexts),
but I can't force everybody to do The Right Thing. If code is valid,
conforming C (and sometimes even when it's not) then it needs to be
supported by the implementation.

> To avoid it we need allocate some extra space. Most programs would have
> number of jmp_buf instances limited so not deallocating extra would not 
> cause problem. To violate that limit you need to have variation to these:

Well the number of jmp_buf instances you could use in a reasonable
sense is limited by the stack size/number of call frames, which is why
I thought a "safe" amount of space would be equal to the size of the
stack. However as you've said, one could allocate many more...

> > By the way, I do have another horrible idea for how you could do it.
> 
> Next idea would be hack gcc to mark all variables volatile in functions
> with setjmp.

That does not help. setjmp may not be backing up the caller's
variables, but rather register values belonging to

    (the caller of)^N the caller

for arbitrarily large values of N.

Rich

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-24 18:25   ` [x86-64 psABI]: Extend x86-64 psABI to support AVX-512 Richard Henderson
  2013-07-24 18:52     ` Ondřej Bílka
@ 2013-07-30 13:55     ` Kirill Yukhin
  2013-08-02 12:49       ` Kirill Yukhin
  1 sibling, 1 reply; 25+ messages in thread
From: Kirill Yukhin @ 2013-07-30 13:55 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Richard Biener, H.J. Lu, GNU C Library, GCC Development,
	Binutils, Girkar, Milind, Kreitzer, David L

On Wed, Jul 24, 2013 at 08:25:14AM -1000, Richard Henderson wrote:
> On 07/24/2013 05:23 AM, Richard Biener wrote:
> > "H.J. Lu" <hjl.tools@gmail.com> wrote:
> > 
> >> Hi,
> >>
> >> Here is a patch to extend x86-64 psABI to support AVX-512:
> > 
> > Afaik avx 512 doubles the amount of xmm registers. Can we get them callee saved please?
> 
> Having them callee saved pre-supposes that one knows the width of the register.

Whole architecture of SSE/AVX is based on the fact of zerroing-upper.
For references - take a look at definition of VLMAX in Spec.
E.g. for AVX2 we had:
     vaddps %ymm1, %ymm2, %ymm3

Intuition says (at least to me) that after compilation it shouldn't have an idea of 256-bit `upper' half.
But with AVX-512 we have (again, see Spec, operation section of vaddps, VEX.256 encoded):
    DEST[31:0] = SRC1[31:0] + SRC2[31:0]
    ...
    DEST[255:224] = SRC1[255:224] + SRC2[255:224].
    DEST[MAX_VL-1:256] = 0
So, legacy code *will* change upper 256-bit of vector register.

The roots can be found in GPR 64-bit insns. So, we have different behavior on 64-bit and 32-bit target for following sequence:
    push %eax
    ;; play with eax
    pop %eax
on 64-bit machine upper 32-bits of %eax will be zeroed, and if we'll try to use old version of %rax - fail!

So, following such philosophy prohibits to make vector registers callee-safe.

BUT.

What if we make couple of new registers calle-safe in the sense of *scalar* type?
So, what we can do:
    1. make callee-safe only bits [0..XXX] of vector register.
    2. make call-clobbered bits of (XXX..VLMAX] in the same register.

XXX is number of bits to be callee-safe: 64, 80, 128 or even 512.

Advantage is that when we are doing FP scalar code, we don’t bother about save/restore callee-safe part.
    vaddss %xmm17, %xmm17, %xmm17
    call foo
    vaddss %xmm17, %xmm17, %xmm17

We don’t care if `foo’:
    - is legacy in AVX-512 sense – it just see no xmm17
    - in future ISA sense. If this code is 1024-bit wide reg and `foo’ is AVX-512. It will save XXX bits, allowing us to continue scalar calculations without saving/restore

--
Thanks, K

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
  2013-07-30 13:55     ` Kirill Yukhin
@ 2013-08-02 12:49       ` Kirill Yukhin
  0 siblings, 0 replies; 25+ messages in thread
From: Kirill Yukhin @ 2013-08-02 12:49 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Richard Biener, H.J. Lu, GNU C Library, GCC Development,
	Binutils, Girkar, Milind, Kreitzer, David L

[-- Attachment #1: Type: text/plain, Size: 4544 bytes --]

On 30 Jul 17:55, Kirill Yukhin wrote:
> On Wed, Jul 24, 2013 at 08:25:14AM -1000, Richard Henderson wrote:
> > On 07/24/2013 05:23 AM, Richard Biener wrote:
> > > "H.J. Lu" <hjl.tools@gmail.com> wrote:
> > > 
> > >> Hi,
> > >>
> > >> Here is a patch to extend x86-64 psABI to support AVX-512:
> > > 
> > > Afaik avx 512 doubles the amount of xmm registers. Can we get them callee saved please?

Hello,
I've implemented a tiny patch on top of `avx512' branch.
It makes first 128-bit parts 8 registers of AVX-512 callee saved: xmm16 through xmm23.

Here is performance data. It seems we have a little degradation in GEOMEAN.

Workload: Spec2006
Dataset: test
Options experiment: -m64 -fstrict-aliasing -fno-prefetch-loop-arrays -Ofast -funroll-loops -flto -fwhole-program -mavx512f
Options refernece : -m64 -fstrict-aliasing -fno-prefetch-loop-arrays -Ofast -funroll-loops -flto -fwhole-program

		"8 callee-"	"icount, all"	icount
		"save icount"	"call-clobber"	decrease
--------------------------------------------------------
400.perlbench	1686198567	1682320942	-0.23%
401.bzip2	18983033855	18983033907	0.00%
403.gcc		3999481141	3999095681	-0.01%
410.bwaves	13736672428	13736640026	0.00%
416.gamess	1531782811	1531350122	-0.03%
429.mcf		3079764286	3080957858	0.04%
433.milc	14628097067	14628175244	0.00%
434.zeusmp	21336261982	21359384879	0.11%
435.gromacs	3593653152	3588581849	-0.14%
436.cactusADM	2822346689	2828797842	0.23%
437.leslie3d	15903712760	15975143040	0.45%
444.namd	42446067469	43607637322	2.74%
445.gobmk	35272482208	35268743690	-0.01%
447.dealII	42476324881	42507009849	0.07%
450.soplex	45943150	45652666	-0.63%
453.povray	2314481169	2222157619	-3.99%
454.calculix	131024939	131078501	0.04%
456.hmmer	13853478444	13853306947	0.00%
458.sjeng	14173066874	14173066909	0.00%
459.GemsFDTD	2437559044	2437819638	0.01%
462.libquantum	175827242	175657854	-0.10%
464.h264ref	75718510217	75711714226	-0.01%
465.tonto	2505737844	2511457541	0.23%
470.lbm		4799298802	4812180033	0.27%
473.astar	17435751523	17435498947	0.00%
481.wrf		7144685575	7170593748	0.36%
482.sphinx3	6000198462	5984438416	-0.26%
483.xalancbmk	273958223	273638145	-0.12%
--------------------------------------------------------
GEOMEAN		4678862313	4677012093	-0.04%

Bigger % is better, negative mean that we have icount
increased after experiment


It seems to me that LRA is not always optimal, e.g. if you compile attached testcase
with: ./build-x86_64-linux/gcc/xgcc -B./build-x86_64-linux/gcc repro.c -S -Ofast -mavx512f

Assembler for main looks like:
main:
.LFB2331:
        vcvtsi2ss       %edi, %xmm1, %xmm1
        subq    $24, %rsp
        vextractf32x4   $0x0, %zmm16, (%rsp)
        vmovaps %zmm1, %zmm16
        call    test
        vfmadd132ss     .LC1(%rip), %xmm16, %xmm16
        vmovaps %zmm16, %zmm2
        movl    $.LC2, %edi
        movl    $1, %eax
        vunpcklps       %xmm2, %xmm2, %xmm2
        vcvtps2pd       %xmm2, %xmm0
        call    printf
        vmovaps %zmm16, %zmm3
        vinsertf32x4    $0x0, (%rsp), %zmm16, %zmm16
        addq    $24, %rsp
        vcvttss2si      %xmm3, %eax
        ret
I have no idea, why we are doind conversion to %xmm1 and then save it to %xmm16
However it maybe non-LRA issue.

Thanks, K


---
 gcc/config/i386/i386.c | 2 +-
 gcc/config/i386/i386.h | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 6b13ac9..d6d8040 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -9125,7 +9125,7 @@ ix86_nsaved_sseregs (void)
   int nregs = 0;
   int regno;
 
-  if (!TARGET_64BIT_MS_ABI)
+  if (!(TARGET_64BIT_MS_ABI || TARGET_AVX512F))
     return 0;
   for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
     if (SSE_REGNO_P (regno) && ix86_save_reg (regno, true))
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index d7a934d..9faab8b 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -1026,9 +1026,9 @@ enum target_cpu_default
 /*xmm8,xmm9,xmm10,xmm11,xmm12,xmm13,xmm14,xmm15*/		\
      6,   6,    6,    6,    6,    6,    6,    6,		\
 /*xmm16,xmm17,xmm18,xmm19,xmm20,xmm21,xmm22,xmm23*/		\
-     6,    6,     6,    6,    6,    6,    6,    6,		\
+     0,    0,     0,    0,    0,    0,    0,    0,		\
 /*xmm24,xmm25,xmm26,xmm27,xmm28,xmm29,xmm30,xmm31*/		\
-     6,    6,     6,    6,    6,    6,    6,    6,		\
+     1,    1,     1,    1,    1,    1,    1,    1,		\
  /* k0,  k1,  k2,  k3,  k4,  k5,  k6,  k7*/			\
      1,   1,   1,   1,   1,   1,   1,   1 }
 
-- 
1.7.11.7



[-- Attachment #2: repro.c --]
[-- Type: text/plain, Size: 370 bytes --]

#include <stdio.h>
#include <immintrin.h>

int *p;
volatile float g1 = 100, g2  = 200;

void foo ()
{
  printf ("Hi\n");
}

void extern
test (void)
{
  float x, y, z;
  y = g1;
  z = g2;
  x = y + z;
  foo ();
  x += y * z;
  g2 = x;
}

int
main (int argc, char **argv)
{
  float a = argc;
  a += argc;
  test ();
  a += argc;

  printf ("==> %f\n", a);

  return a;
}


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2013-08-02 12:49 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-23 19:34 [x86-64 psABI]: Extend x86-64 psABI to support AVX-512 H.J. Lu
2013-07-23 19:57 ` Joseph S. Myers
2013-07-25 14:48   ` Gopalasubramanian, Ganesh
2013-07-24 15:20 ` Richard Biener
2013-07-24 15:27   ` H.J. Lu
2013-07-24 15:38     ` Joseph S. Myers
2013-07-24 17:34     ` Richard Biener
2013-07-24 17:42       ` H.J. Lu
2013-07-24 17:55         ` Peter Bergner
2013-07-24 19:22           ` H.J. Lu
2013-07-24 18:14       ` Ondřej Bílka
2013-07-25  3:07       ` Jakub Jelinek
2013-07-25  7:09         ` Ondřej Bílka
2013-07-25 16:51           ` Rich Felker
2013-07-27 15:44             ` Ondřej Bílka
2013-07-27 16:13               ` Rich Felker
2013-07-27 16:24                 ` Rich Felker
2013-07-27 18:27                 ` Support setjmp in x86-64 psABI with AVX-512 Ondřej Bílka
2013-07-27 20:09                   ` Rich Felker
2013-07-24 18:25   ` [x86-64 psABI]: Extend x86-64 psABI to support AVX-512 Richard Henderson
2013-07-24 18:52     ` Ondřej Bílka
2013-07-25 12:17       ` Janne Blomqvist
2013-07-25 12:47         ` Ondřej Bílka
2013-07-30 13:55     ` Kirill Yukhin
2013-08-02 12:49       ` Kirill Yukhin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).