x86 CPU features detection for applications (and AMX)

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* x86 CPU features detection for applications (and AMX)
@ 2021-06-23 15:04 Florian Weimer
  2021-06-23 15:32 ` Dave Hansen
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Florian Weimer @ 2021-06-23 15:04 UTC (permalink / raw)
  To: libc-alpha, linux-api, x86, linux-arch

We have an interface in glibc to query CPU features:

  X86-specific Facilities
  <https://www.gnu.org/software/libc/manual/html_node/X86.html>

CPU_FEATURE_USABLE all preconditions for a feature are met,
HAS_CPU_FEATURE means it's in silicon but possibly dormant.
CPU_FEATURE_USABLE is supposed to look at XCR0, AT_HWCAP2 etc. before
enabling the relevant bit (so it cannot pass through any unknown bits).

It turns out we screwed up in the glibc 2.33 release the absolutely
required headers weren't actually installed:

  [PATCH] x86: Install <bits/platform/x86.h> [BZ #27958]
  <https://sourceware.org/pipermail/libc-alpha/2021-June/127215.html>

Given that the magic constants aren't available in any other way, this
feature was completely unusable, so we can perhaps revisit it and switch
to a different approach.

Previously kernel developers have expressed dismay that we didn't
coordinate the interface with them.  This is why I want raise this now.

When we designed this glibc interface, we assumed that bits would be
static during the life-time of the process, initialized at process
start.  That follows the model of previous x86 CPU feature enablement.
In the background, CPU_FEATURE_USABLE/HAS_CPU_FEATURE calls a function
which returns a pointer to eight 32-bit words, based on the index passed
to the function (out-of-range indidces return a pointer to zeros,
enabling forward compatibility).  The macros then use a magic constants
that encodes he lookup index and which of those 128 bits to extract to
find that bit, plus the feature/usable choice.  This means that we
*could* keep this interface unchanged if the kernel gives us a way to
read up-to-date feature state from a 256 bit area (or at least 32 bit
word) in thread-specific data.  Similar to what we have with
set_robust_list and rseq today.

This still wouldn't cover the enable/disable side, but at least it would
work for CPU features which are modal and come and go.  The fact that we
tell GCC to cache the returned pointer from that internal function, but
not that the data is immutable works to our advantage here.

On the other hand, maybe there is a way to give users a better
interface.  Obviously we want to avoid a syscall for a simple CPU
feature check.  And we also need something to enable/disable CPU
features.

Thanks,
Florian

PS: Is it true that there is no public mailing list for Linux
discussions specific to x86?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-23 15:04 x86 CPU features detection for applications (and AMX) Florian Weimer
@ 2021-06-23 15:32 ` Dave Hansen
  2021-07-08  6:05   ` Florian Weimer
  2021-06-25 23:31 ` Thiago Macieira
  2021-07-08 17:56 ` Mark Brown
  2 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2021-06-23 15:32 UTC (permalink / raw)
  To: Florian Weimer, libc-alpha, linux-api, x86, linux-arch

On 6/23/21 8:04 AM, Florian Weimer wrote:
> https://www.gnu.org/software/libc/manual/html_node/X86.html
...
> Previously kernel developers have expressed dismay that we didn't
> coordinate the interface with them.  This is why I want raise this now.

This looks basically like someone dumped a bunch of CPUID bit values and
exposed them to applications without considering whether applications
would ever need them.  For instance, why would an app ever care about:

	PKS – Protection keys for supervisor-mode pages.

And how could glibc ever give applications accurate information about
whether PKS "is supported by the operating system"?  It just plain
doesn't know, or at least only knows from a really weak ABI like
/proc/cpuinfo.

It also doesn't seem to tell applications what they want which is, "can
I, the application, *use* this feature?"

> PS: Is it true that there is no public mailing list for Linux
> discussions specific to x86?

Yes.  I've asked recently for something x86-related, but folks were to
concerned what I was asking for was too specific, which was more of a
brainstorming place to put x86-specific RFC's.

	https://subspace.kernel.org/lists.linux.dev.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-23 15:32 ` Dave Hansen
@ 2021-07-08  6:05   ` Florian Weimer
  2021-07-08 14:19     ` Dave Hansen
  0 siblings, 1 reply; 27+ messages in thread
From: Florian Weimer @ 2021-07-08  6:05 UTC (permalink / raw)
  To: Dave Hansen; +Cc: libc-alpha, linux-api, x86, linux-arch, H.J. Lu, linux-kernel

* Dave Hansen:

> On 6/23/21 8:04 AM, Florian Weimer wrote:
>> https://www.gnu.org/software/libc/manual/html_node/X86.html
> ...
>> Previously kernel developers have expressed dismay that we didn't
>> coordinate the interface with them.  This is why I want raise this now.
>
> This looks basically like someone dumped a bunch of CPUID bit values and
> exposed them to applications without considering whether applications
> would ever need them.  For instance, why would an app ever care about:
>
> 	PKS – Protection keys for supervisor-mode pages.
>
> And how could glibc ever give applications accurate information about
> whether PKS "is supported by the operating system"?  It just plain
> doesn't know, or at least only knows from a really weak ABI like
> /proc/cpuinfo.

glibc is expected to mask these bits for CPU_FEATURE_USABLE because they
have unknown semantics (to glibc).

They are still exposed via HAS_CPU_FEATURE.

I argued against HAS_CPU_FEATURE because the mere presence of this
interface will introduce application bugs because application really
must use CPU_FEATURE_USABLE instead.

I wanted to go with a curated set of bits, but we couldn't get consensus
around that.  Curiously, the present interface can expose changing CPU
state (if the kernel updates some fixed memory region accordingly), my
preferred interface would not have supported that.

> It also doesn't seem to tell applications what they want which is, "can
> I, the application, *use* this feature?"

CPU_FEATURE_USABLE is supposed to be that interface.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-07-08  6:05   ` Florian Weimer
@ 2021-07-08 14:19     ` Dave Hansen
  2021-07-08 14:31       ` Florian Weimer
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2021-07-08 14:19 UTC (permalink / raw)
  To: Florian Weimer
  Cc: libc-alpha, linux-api, x86, linux-arch, H.J. Lu, linux-kernel

On 7/7/21 11:05 PM, Florian Weimer wrote:
>> This looks basically like someone dumped a bunch of CPUID bit values and
>> exposed them to applications without considering whether applications
>> would ever need them.  For instance, why would an app ever care about:
>>
>> 	PKS – Protection keys for supervisor-mode pages.
>>
>> And how could glibc ever give applications accurate information about
>> whether PKS "is supported by the operating system"?  It just plain
>> doesn't know, or at least only knows from a really weak ABI like
>> /proc/cpuinfo.
> glibc is expected to mask these bits for CPU_FEATURE_USABLE because they
> have unknown semantics (to glibc).

OK, so if I call CPU_FEATURE_USABLE(PKS) on a system *WITH* PKS
supported in the operating system, I'll get false from an interface that
claims to be:

> This macro returns a nonzero value (true) if the processor has the
> feature name and the feature is supported by the operating system.

The interface just seems buggy by *design*.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-07-08 14:19     ` Dave Hansen
@ 2021-07-08 14:31       ` Florian Weimer
  2021-07-08 14:36         ` Dave Hansen
  0 siblings, 1 reply; 27+ messages in thread
From: Florian Weimer @ 2021-07-08 14:31 UTC (permalink / raw)
  To: Dave Hansen; +Cc: libc-alpha, linux-api, x86, linux-arch, H.J. Lu, linux-kernel

* Dave Hansen:

> On 7/7/21 11:05 PM, Florian Weimer wrote:
>>> This looks basically like someone dumped a bunch of CPUID bit values and
>>> exposed them to applications without considering whether applications
>>> would ever need them.  For instance, why would an app ever care about:
>>>
>>> 	PKS – Protection keys for supervisor-mode pages.
>>>
>>> And how could glibc ever give applications accurate information about
>>> whether PKS "is supported by the operating system"?  It just plain
>>> doesn't know, or at least only knows from a really weak ABI like
>>> /proc/cpuinfo.
>> glibc is expected to mask these bits for CPU_FEATURE_USABLE because they
>> have unknown semantics (to glibc).
>
> OK, so if I call CPU_FEATURE_USABLE(PKS) on a system *WITH* PKS
> supported in the operating system, I'll get false from an interface that
> claims to be:
>
>> This macro returns a nonzero value (true) if the processor has the
>> feature name and the feature is supported by the operating system.
>
> The interface just seems buggy by *design*.

Yes, but that is largely a documentation matter.  We should have said
something about “userspace” there, and that the bit needs to be known to
glibc.  There is another exception: FSGSBASE, and that's a real bug we
need to fix (it has to go through AT_HWCAP2).

If we want to avoid that, we need to go down the road of a curated set
of CPUID bits, where a bit only exists if we have taught glibc its
semantics.  You still might get a false negative by running against an
older glibc than the application was built for.  (We are not going to
force applications that e.g. look for FSGSBASE only run with a glibc
that is at least of that version which implemented semantics for the
FSGSBASE bit.)

Thanks,
Florian

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-07-08 14:31       ` Florian Weimer
@ 2021-07-08 14:36         ` Dave Hansen
  2021-07-08 14:41           ` Florian Weimer
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2021-07-08 14:36 UTC (permalink / raw)
  To: Florian Weimer
  Cc: libc-alpha, linux-api, x86, linux-arch, H.J. Lu, linux-kernel

On 7/8/21 7:31 AM, Florian Weimer wrote:
>> OK, so if I call CPU_FEATURE_USABLE(PKS) on a system *WITH* PKS
>> supported in the operating system, I'll get false from an interface that
>> claims to be:
>>
>>> This macro returns a nonzero value (true) if the processor has the
>>> feature name and the feature is supported by the operating system.
>> The interface just seems buggy by *design*.
> Yes, but that is largely a documentation matter.  We should have said
> something about “userspace” there, and that the bit needs to be known to
> glibc.  There is another exception: FSGSBASE, and that's a real bug we
> need to fix (it has to go through AT_HWCAP2).
> 
> If we want to avoid that, we need to go down the road of a curated set
> of CPUID bits, where a bit only exists if we have taught glibc its
> semantics.  You still might get a false negative by running against an
> older glibc than the application was built for.  (We are not going to
> force applications that e.g. look for FSGSBASE only run with a glibc
> that is at least of that version which implemented semantics for the
> FSGSBASE bit.)

That's kinda my whole point.

These *MUST* be curated to be meaningful.  Right now, someone just
dumped a set of CPUID bits into the documentation.

The interface really needs *three* modes:

1. Yes, the CPU/OS supports this feature
2. No, the CPU/OS doesn't support this feature
3. Hell if I know, never heard of this feature
	
The interface really conflates 2 and 3.  To me, that makes it
fundamentally flawed.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-07-08 14:36         ` Dave Hansen
@ 2021-07-08 14:41           ` Florian Weimer
  0 siblings, 0 replies; 27+ messages in thread
From: Florian Weimer @ 2021-07-08 14:41 UTC (permalink / raw)
  To: Dave Hansen; +Cc: libc-alpha, linux-api, x86, linux-arch, H.J. Lu, linux-kernel

* Dave Hansen:

> That's kinda my whole point.
>
> These *MUST* be curated to be meaningful.  Right now, someone just
> dumped a set of CPUID bits into the documentation.
>
> The interface really needs *three* modes:
>
> 1. Yes, the CPU/OS supports this feature
> 2. No, the CPU/OS doesn't support this feature
> 3. Hell if I know, never heard of this feature
> 	
> The interface really conflates 2 and 3.  To me, that makes it
> fundamentally flawed.

That's an interesing point.

3 looks potentially more useful than the feature/usable distinction to
me.

The recent RTM change suggests that there are more states, but we
probably can't do much about such soft-disable changes.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-23 15:04 x86 CPU features detection for applications (and AMX) Florian Weimer
  2021-06-23 15:32 ` Dave Hansen
@ 2021-06-25 23:31 ` Thiago Macieira
       [not found]   ` <3c5c29e2-1b52-3576-eda2-018fb1e58ff9@metux.net>
  2021-07-08  7:08   ` Florian Weimer
  2021-07-08 17:56 ` Mark Brown
  2 siblings, 2 replies; 27+ messages in thread
From: Thiago Macieira @ 2021-06-25 23:31 UTC (permalink / raw)
  To: fweimer; +Cc: hjl.tools, libc-alpha, linux-api, linux-arch, x86

On 23 Jun 2021 17:04:27 +0200, Florian Weimer wrote:
> We have an interface in glibc to query CPU features:
> X86-specific Facilities
> <https://www.gnu.org/software/libc/manual/html_node/X86.html>
>
> CPU_FEATURE_USABLE all preconditions for a feature are met,
> HAS_CPU_FEATURE means it's in silicon but possibly dormant.
> CPU_FEATURE_USABLE is supposed to look at XCR0, AT_HWCAP2 etc. before
> enabling the relevant bit (so it cannot pass through any unknown bits).

It's a nice initiative, but it doesn't help library and applications that need 
to be either cross-platform or backwards compatible.

The first problem is the cross-platformness need. Because we library and 
application developers need to support other OSes, we'll need to deploy our 
own CPUID-based detection. It's far better to use common code everywhere, 
where one developer working on Linux can fix bugs in FreeBSD, macOS or Windows 
or any of the permutations. Every platform-specific deviation adds to 
maintenance requirements and is a source of potential latent bugs, now or in 
the future due to refactoring. That is why doing everything in the form of 
instructions would be far better and easier, rather than system calls.

[Unless said system calls were standardised and actually deployed. Making this 
a cross-platform library that is not part of libc would be a major step in 
that direction]

The second problem is going to be backwards compatibility. Applications and 
libraries may want to ship precompiled binaries that make use of the new CPU 
features, whether they are open source or not. It comes as no surprise to 
anyone that we CPU makers will have made software that use those features and 
want to have it ready on Day 1 of the HW being available for the market (if 
we're doing our jobs right). That often involves precompiling because everyone 
who installed their compilers more than one year ago will not have the 
necessary tools to build. That runs counter to the need to use a libc 
interface that didn't exist until recently.

And by "recently", I mean "anything since the glibc that came with Red Hat 
Enterprise Linux 7" (2.17).

So no, application and library developers will not use libc functions they 
don't need to, especially if it adds to their problems, unless there's no way 
around it.

> Previously kernel developers have expressed dismay that we didn't
> coordinate the interface with them.  This is why I want raise this now.

You also need to coordinate with your users.

A platform-specific API to solve a problem that is already solved is "knock 
yourself out, we're not going to use this." So my first suggestion is to 
remove the "platform-specific" part and make this a cross-platform solution.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering

^ permalink raw reply	[flat|nested] 27+ messages in thread

[parent not found: <3c5c29e2-1b52-3576-eda2-018fb1e58ff9@metux.net>]

* Re: x86 CPU features detection for applications (and AMX)
       [not found]   ` <3c5c29e2-1b52-3576-eda2-018fb1e58ff9@metux.net>
@ 2021-06-28 13:20     ` Peter Zijlstra
       [not found]       ` <534d0171-2cc5-cd0a-904f-cd3c499b55af@metux.net>
  2021-06-28 15:08     ` Thiago Macieira
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2021-06-28 13:20 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: Thiago Macieira, fweimer, hjl.tools, libc-alpha, linux-api,
	linux-arch, x86

On Mon, Jun 28, 2021 at 02:40:32PM +0200, Enrico Weigelt, metux IT consult wrote:

> Going back to AMX - just had a quick look at the spec (*1). Sorry, but
> this thing is really weird and horrible to use. Come on, these chips
> already have billions of transistors, it really can't hurt so much
> spending a few more to provide a clean and easy to use machine code
> interface. Grmmpf! (This is a general problem we've got with so many
> HW folks, why can't them just talk to us SW folks first so we can find
> a good solution for both sides, before that goes into the field ?)
> 
> And one point that immediately jumps into my mind (w/o looking deeper
> into it): it introduces completely new registers - do we now need extra
> code for tasks switching etc ?

No, but because it's register state and part of XSAVE, it has immediate
impact in ABI. In particular, the signal stack layout includes XSAVE (as
does ptrace()).

At the same time, 'legacy' applications (up until _very_ recently) had a
minimum signal stack size of 2K, which is already violated by the
addition of AVX512 (there's actual breakage due to that).

Adding the insane AMX state (8k+) into that is a complete trainwreck
waiting to happen. Not to mention that having !INIT AMX state has direct
consequences for P-state selection and thus performance.

For these reasons, us OS folks, will mandate you get to do a prctl() to
request/release AMX (and we get to say: no). If you use AMX without
this, the instruction will fault (because not set in XCR0) and we'll
SIGBUS or something.

Userspace will have to do something like:

 - check CPUID, if !AMX -> fail
 - issue prctl(), if error -> fail
 - issue XGETBV and check the AMX bit it set, if not -> fail
 - request the signal stack size / spawn threads
 - use AMX

Spawning threads prior to enabling AMX will result in using the wrong
signal stack size and result in malfunction, you get to keep the pieces.

^ permalink raw reply	[flat|nested] 27+ messages in thread

[parent not found: <534d0171-2cc5-cd0a-904f-cd3c499b55af@metux.net>]

* Re: x86 CPU features detection for applications (and AMX)
       [not found]       ` <534d0171-2cc5-cd0a-904f-cd3c499b55af@metux.net>
@ 2021-06-30 15:36         ` Thiago Macieira
  0 siblings, 0 replies; 27+ messages in thread
From: Thiago Macieira @ 2021-06-30 15:36 UTC (permalink / raw)
  To: Peter Zijlstra, Enrico Weigelt, metux IT consult
  Cc: fweimer, hjl.tools, libc-alpha, linux-api, linux-arch, x86

On Wednesday, 30 June 2021 05:50:30 PDT Enrico Weigelt, metux IT consult 
wrote:
> > No, but because it's register state and part of XSAVE, it has immediate
> > impact in ABI. In particular, the signal stack layout includes XSAVE (as
> > does ptrace()).
> 
> OMGs, I've already suspected such sickness. I don't even dare thinking
> about consequences for compilers and library ABIs.
> 
> Does anyone here know why they designed this as inline operations ? This
> thing seems to be pretty much what typical TPUs are doing (or a subset
> of it). Why not just adding a TPU next to the CPU on the same chip ?

To be clear: this is a SW ABI. It has nothing to do the presence or absence of 
other processing units in the system.

The moment you receive a Unix signal with SA_SIGINFO, the mcontext state needs 
to be saved somewhere. Where would you save it? Please remember that:

- signal handlers can be called at any point in the execution, including
  in the middle of malloc()
- signal handlers can longjmp out of the handler back into non-handler code
- in a multithreaded application, each thread can be handling a signal 
  simultaneously

We could have the kernel hold on to that and have a system call to extract 
them, but that's an ABI change and I think won't work for the longjmp case.

> > Userspace will have to do something like:
> >   - check CPUID, if !AMX -> fail
> >   - issue prctl(), if error -> fail
> >   - issue XGETBV and check the AMX bit it set, if not -> fail
> 
> Can't we to this just by prctl() call ?
> IOW: ask the kernel, who gonna say yes or no.

That's possible. The kernel can't enable an AMX state on a system without AMX.

> Are there any situations where kernel says yes, but process still can't
> use it ? Why so ?

Today there is no such case that I can think of.

> >   - request the signal stack size / spawn threads
> 
> Signal stack is separate from the usual stack, right ?
> Why can't this all be done in one shot ?

Yes, we're talking about the sigaltstack() call.

What is "this all" in the sentence above?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
       [not found]   ` <3c5c29e2-1b52-3576-eda2-018fb1e58ff9@metux.net>
  2021-06-28 13:20     ` Peter Zijlstra
@ 2021-06-28 15:08     ` Thiago Macieira
  2021-06-28 15:27       ` Peter Zijlstra
       [not found]       ` <e07294c9-b02a-e1c5-3620-7fae7269fdf1@metux.net>
  1 sibling, 2 replies; 27+ messages in thread
From: Thiago Macieira @ 2021-06-28 15:08 UTC (permalink / raw)
  To: fweimer, Enrico Weigelt, metux IT consult
  Cc: hjl.tools, libc-alpha, linux-api, linux-arch, x86

On Monday, 28 June 2021 05:40:32 PDT Enrico Weigelt, metux IT consult wrote:
> > The first problem is the cross-platformness need. Because we library and
> > application developers need to support other OSes, we'll need to deploy
> > our
> > own CPUID-based detection. It's far better to use common code everywhere,
> > where one developer working on Linux can fix bugs in FreeBSD, macOS or
> > Windows or any of the permutations. Every platform-specific deviation
> > adds to maintenance requirements and is a source of potential latent
> > bugs, now or in the future due to refactoring. That is why doing
> > everything in the form of instructions would be far better and easier,
> > rather than system calls.
> hmm, maybe some libcpuid ?

Indeed. I'm querying inside Intel to see if I can get buy-in to create such a 
library.

> > The second problem is going to be backwards compatibility. Applications
> > and
> > libraries may want to ship precompiled binaries that make use of the new
> > CPU features, whether they are open source or not.
> 
> Since we're talking about GNU libc here, binary-only stuff is probably
> out of scope here. OTOH, using differnt libc versions in those special
> cases isn't such a big deal.

Shipping a libc is not trivial, either technically or due to licensing 
requirements. Most applications want to link against whatever libc the system 
already provides, if that's possible.

> > It comes as no surprise to anyone that we CPU makers will have made
> > software that use those features and
> > want to have it ready on Day 1 of the HW being available for the market
> > (if
> > we're doing our jobs right). That often involves precompiling because
> > everyone > who installed their compilers more than one year ago will not
> > have the necessary tools to build.
> 
> Actually, you should talk to the compiler folks much more early, at the
> point where you know how those features look like.

We do, but it's not enough.

GCC releases once a year, so it's easy to miss the feature freeze. Then there 
are Linux distros that do LTS every 2 years or so. Worse, those two are 
usually out of phase. For example, if you're using the current Ubuntu LTS 
today (almost July 2021), you're using 20.04, which was released one month 
before the GCC 10 release. So you're using GCC 9, released May 2019, which 
means its features were frozen on December 2018. That's an incredibly long 
lead time.

As a consequence, you will see precompiled binaries.

> For using certain new CPU specific features, the need for a compiler
> upgrade really should be no excuse. And at least for vast majority of
> cases, a proper compiler could do it much better than the average
> programmer.

To compile the software that uses those instructions, undoubtedly. But what if 
I did that for you and you could simply download the binary for the library 
and/or plugins such that you could slot into your existing systems and CI? 
This could make a difference between adoption or not.

> > And by "recently", I mean "anything since the glibc that came with Red Hat
> > Enterprise Linux 7" (2.17).
> 
> Uh, that's really ancient. Nobody can seriously expect modern features
> on such an ancient distro. If people really insist spending so much
> money for running such old code, instead of just doing a dist upgrade,
> then I can only reply with "not our problem".

Yes and no.

Red Hat has been incredibly successful in backporting kernel features to the 
old 3.10 that came with RHEL 7. Whether they will do that for AMX state saving 
and the system call that we're discussing here, I can't say. AFAIU, they did 
backport the AVX512 state-saving to that 3.10, so they may.

Even if they don't, the *software* that people deploy may be the same build 
for RHEL 7 and for a modern distro that will have a 5.14 kernel. That software 
may have non-AVX, AVX2, AVX512 and AMX-specific code paths and would do 
runtime detection of which one is best to use. If a system call is needed, the 
system call needs to be issued even on that 3.10 and if it responds with 
-ENOSYS or -EINVAL, then it will fall back to the next best option.

So my point is: this shouldn't be in glibc because the glibc will not have the 
new system call wrappers or TLS fields.

> What we SW engineers need is an easy and fast method to act depending on
> whether some CPU supports some feature (eg. a new opcode). Things like
> cpuinfo are only a tiny piece of that. What we could really use is a
> conditional jump/call based on whether feature X is supported - without
> any kernel intervention. Then the machine code could be easily layed out
> to support both cases with our without some feature X. Alternatively we
> could have a fast trapping in useland - hw generated call already would
> be a big help.

That's what cpuid is for. With GCC function multi-versioning or equivalent 
manually-rolled solutions, you can get exactly what you're asking for.

Yes, the checking became far more complex with the need to check XCR0 after 
AVX came along, but since the instruction itself is a slow and serialising, 
any library will just cache the results. And as a result, the level of CPU 
features is not expected to change. It never has in the past, so this hasn't 
been an issue.

> If we had something in that direction, we wouldn't have to have this
> kind discussion here anymore - it would be entirely up to compiler and
> library folks, no need for any kernel support at all.

For most features, there isn't. You don't see us discussing 
AVX512VP2INTERSECT, for example. This discussion only exists because AMX 
requires more state to be saved during context switches and signal delivery. 
See Peter's email.

> And one point that immediately jumps into my mind (w/o looking deeper
> into it): it introduces completely new registers - do we now need extra
> code for tasks switching etc ?

Yes, this is the crux of this discussion.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 15:08     ` Thiago Macieira
@ 2021-06-28 15:27       ` Peter Zijlstra
  2021-06-28 16:13         ` Thiago Macieira
       [not found]       ` <e07294c9-b02a-e1c5-3620-7fae7269fdf1@metux.net>
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2021-06-28 15:27 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha,
	linux-api, linux-arch, x86

On Mon, Jun 28, 2021 at 08:08:41AM -0700, Thiago Macieira wrote:
> On Monday, 28 June 2021 05:40:32 PDT Enrico Weigelt, metux IT consult wrote:

> > What we SW engineers need is an easy and fast method to act depending on
> > whether some CPU supports some feature (eg. a new opcode). Things like
> > cpuinfo are only a tiny piece of that. What we could really use is a
> > conditional jump/call based on whether feature X is supported - without
> > any kernel intervention. Then the machine code could be easily layed out
> > to support both cases with our without some feature X. Alternatively we
> > could have a fast trapping in useland - hw generated call already would
> > be a big help.
> 
> That's what cpuid is for. With GCC function multi-versioning or equivalent 
> manually-rolled solutions, you can get exactly what you're asking for.

Right, lots of self-modifying code solutions there, some of which can be
linker driven, some not. In the kernel we use alternative() to replace
short code sequences depending on CPUID.

Userspace *could* do the same, rewriting code before first execution is
fairly straight forward.

> Yes, the checking became far more complex with the need to check XCR0 after 
> AVX came along, but since the instruction itself is a slow and serialising, 
> any library will just cache the results. And as a result, the level of CPU 
> features is not expected to change. It never has in the past, so this hasn't 
> been an issue.

Arguably you should be checking XCR0 for any feature there, including
SSE/AVX/AVX512 and now AMX.

Ideally we'd do a prctl() for AVX512 too, except it's too late :-(

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 15:27       ` Peter Zijlstra
@ 2021-06-28 16:13         ` Thiago Macieira
  2021-06-28 17:11           ` Peter Zijlstra
  2021-06-28 17:43           ` Peter Zijlstra
  0 siblings, 2 replies; 27+ messages in thread
From: Thiago Macieira @ 2021-06-28 16:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha,
	linux-api, linux-arch, x86

On Monday, 28 June 2021 08:27:24 PDT Peter Zijlstra wrote:
> > That's what cpuid is for. With GCC function multi-versioning or equivalent
> > manually-rolled solutions, you can get exactly what you're asking for.
> 
> Right, lots of self-modifying code solutions there, some of which can be
> linker driven, some not. In the kernel we use alternative() to replace
> short code sequences depending on CPUID.
> 
> Userspace *could* do the same, rewriting code before first execution is
> fairly straight forward.

Userspace shouldn't do SMC. It's bad enough that JITs without caching exist, 
but having pure paged code is better. Pure pages are shared as needed by the 
kernel.

All you need is a simple bit test. You can then either branch to different 
code paths or write to a function pointer so it'll go there directly the next 
time. You can also choose to load different plugins depending on what CPU 
features were found.

Consequence: CPU feature checking is done *very* early, often before main().

> Arguably you should be checking XCR0 for any feature there, including
> SSE/AVX/AVX512 and now AMX.
> 
> Ideally we'd do a prctl() for AVX512 too, except it's too late :-(

Right.

But speaking of which, this library would deal with Apple having done the 
allocate-state-on-demand feature for AVX512 without XFD. See
https://github.com/qt/qtbase/blob/dev/src/corelib/global/qsimd.cpp#L346-L369

Anyway, what's the current thinking on what the arch_prctl() should be? Is 
that a per-thread state or will it affect the entire process group? And is it 
a sticky functionality, or are we talking about ref/deref?

Maybe in order to answer that, we need to understand what the worst case 
scenario we need to support is. What are they?

1) alt-stack signal handlers, usually for crashing signals (to catch a stack 
overflow)

2) cooperative user-space task schedulers, e.g. coroutines

3) preemptive user-space task schedulers (I don't know if such software exists 
or even if it is possible)

4) combination of 1 and 3

5) #4, in which each part is comes from a separate library with no knowledge 
of each other, and initialised concurrently in different threads

I'd *assume* that any user-space task scheduler is aware of XSAVE at the very 
least and will know how to allocate context-saving buffers of sufficient size 
for each task. I think this is a safe assumption because AVX is over 10 years 
old now and XSAVE is a required feature for enabling the AVX state. That is, 
any library that knows to save AVX state (the upper 128-bits of the YMM 
registers) is aware of XSAVE and the fact that the state size is dynamic.

Crash handlers are another story. Speaking from experience, my first attempt 
at doing them simply used a global char array of MINSIGSTKSZ and that failed 
to get delivered (note that code will now fail to compile because MINSIGSTKSZ 
is no longer a constant expression). My code was attempting to launch gdb on 
itself, so it wasn't even a SA_SIGINFO signal and therefore the failure was 
baffling. I had to read the kernel source code to figure out that regardless 
of SA_SIGINFO, the state is saved on stack anyway and therefore needs to be 
big enough. So I simply increased the global variable's size until it 
succeeded in delivering on my AVX512 machine. And because it is no longer 
using MINSIGSTKSZ, it will not fail to compile after the glibc upgrade, but it 
will fail to deliver with AMX state enabled.

[I've since learned to check the XSAVE state size in order to create the alt-
stack]

How much do we need to worry about these crash handlers?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 16:13         ` Thiago Macieira
@ 2021-06-28 17:11           ` Peter Zijlstra
  2021-06-28 17:23             ` Thiago Macieira
  2021-06-28 17:43           ` Peter Zijlstra
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2021-06-28 17:11 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha,
	linux-api, linux-arch, x86

On Mon, Jun 28, 2021 at 09:13:29AM -0700, Thiago Macieira wrote:
> On Monday, 28 June 2021 08:27:24 PDT Peter Zijlstra wrote:
> > > That's what cpuid is for. With GCC function multi-versioning or equivalent
> > > manually-rolled solutions, you can get exactly what you're asking for.
> > 
> > Right, lots of self-modifying code solutions there, some of which can be
> > linker driven, some not. In the kernel we use alternative() to replace
> > short code sequences depending on CPUID.
> > 
> > Userspace *could* do the same, rewriting code before first execution is
> > fairly straight forward.
> 
> Userspace shouldn't do SMC. It's bad enough that JITs without caching exist, 
> but having pure paged code is better. Pure pages are shared as needed by the 
> kernel.

I don't feel that strongly; if SMC gets you measurable performance
gains, go for it. If you're short on memory, buy more.

> All you need is a simple bit test. You can then either branch to different 
> code paths or write to a function pointer so it'll go there directly the next 
> time. You can also choose to load different plugins depending on what CPU 
> features were found.

Both bit tests and indirect function calls suffer the extra memory load,
which is not free.

> Consequence: CPU feature checking is done *very* early, often before main().

For the linker based ones, yes. IIRC the ifunc() attribute is
particularly useful here.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 17:11           ` Peter Zijlstra
@ 2021-06-28 17:23             ` Thiago Macieira
  2021-06-28 19:08               ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Thiago Macieira @ 2021-06-28 17:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha,
	linux-api, linux-arch, x86

On Monday, 28 June 2021 10:11:16 PDT Peter Zijlstra wrote:
> > Consequence: CPU feature checking is done *very* early, often before
> > main().
> For the linker based ones, yes. IIRC the ifunc() attribute is
> particularly useful here.

Exactly. ifunc was designed for this exact purpose. And hence the fact that 
CPUID initialisation will be done very, very early.

Anyway, if the AMX state is a sticky "set once per process", it's likely going 
to get set early for every process that *may* use AMX. And this is assuming we 
do the library right and only set it if has AMX code at all, instead of all 
the time.

On the other hand, if it's not set once and for all, we'll have to contend 
with the size changing. TBH, this is a lot more complicated to deal with. Take 
the hypothetical example of a preemptive user-space task scheduler that 
interrupts an AMX routine (let's say for the sake of the argument that it is 
an on-stack signal; I don't see why a scheduler would need to be alt-stack). 
It will record the state and then transition to another routine. And this 
routine may be resumed in another thread of the same process.

Will the kernel understand that the new routine does not need the AMX state? 
Will it understand that the *other* routine, in the other thread will? If this 
is not done automatically by the kernel, then the task scheduler will need to 
know to ask the kernel what the reference count for the AMX state is and will 
need a syscall to set it (not just increment/decrement, though one could 
implement that with a loop).

This applies differently in the case of cooperative scheduling. The SysV ABI 
will probably say that the AMX state is caller-save, so the function call from 
the AMX-using routine implies all its state has been saved somewhere. But what 
about the kernel-side AMX refcount? Is that part of the ABI?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 17:23             ` Thiago Macieira
@ 2021-06-28 19:08               ` Peter Zijlstra
  2021-06-28 19:26                 ` Thiago Macieira
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2021-06-28 19:08 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha,
	linux-api, linux-arch, x86

On Mon, Jun 28, 2021 at 10:23:47AM -0700, Thiago Macieira wrote:
> On Monday, 28 June 2021 10:11:16 PDT Peter Zijlstra wrote:
> > > Consequence: CPU feature checking is done *very* early, often before
> > > main().
> > For the linker based ones, yes. IIRC the ifunc() attribute is
> > particularly useful here.
> 
> Exactly. ifunc was designed for this exact purpose. And hence the fact that 
> CPUID initialisation will be done very, very early.
> 
> Anyway, if the AMX state is a sticky "set once per process", it's likely going 
> to get set early for every process that *may* use AMX. And this is assuming we 
> do the library right and only set it if has AMX code at all, instead of all 
> the time.

This, AFAIU. If the ifunc() resolver finds we haz AMX it can do the
prctl() and on success pick the AMX routine.

Assuming of course, that if a program links with a library that supports
AMX, it will actually end up using it.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 19:08               ` Peter Zijlstra
@ 2021-06-28 19:26                 ` Thiago Macieira
  0 siblings, 0 replies; 27+ messages in thread
From: Thiago Macieira @ 2021-06-28 19:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha,
	linux-api, linux-arch, x86

On Monday, 28 June 2021 12:08:16 PDT Peter Zijlstra wrote:
> > Anyway, if the AMX state is a sticky "set once per process", it's likely
> > going to get set early for every process that *may* use AMX. And this is
> > assuming we do the library right and only set it if has AMX code at all,
> > instead of all the time.
> 
> This, AFAIU. If the ifunc() resolver finds we haz AMX it can do the
> prctl() and on success pick the AMX routine.
> 
> Assuming of course, that if a program links with a library that supports
> AMX, it will actually end up using it.

That's what I meant and I agree. If it has an AMX function for *anything*, it 
will do the arch_prctl() and enable the state, even if said function is never 
called.

This is the good case. The bad case is that it does the arch_prctl() before it 
sees whether there is any AMX function.

Do we expect that the dynamic loader will have this code? It currently 
searches the multiple ABI levels (up to x86-64-v4 to include AVX512) and HW 
capabilities. I can readily see AMX being one of the capabilities, if not an 
ABI level. Though it should be trivial for it to call the arch_prctl() if and 
only if it is about to load an ELF module that declares use of AMX and also 
*not* load it if the syscall fails.

$ LD_DEBUG=libs /lib64/ld-linux-x86-64.so.2 --inhibit-cache /bin/ls 
      1620:     find library=librt.so.1 [0]; searching
      1620:      search path=.....
      1620:       trying file=/usr/lib64/glibc-hwcaps/x86-64-v4/librt.so.1
      1620:       trying file=/usr/lib64/glibc-hwcaps/x86-64-v3/librt.so.1
      1620:       trying file=/usr/lib64/glibc-hwcaps/x86-64-v2/librt.so.1
      1620:       trying file=/usr/lib64/tls/haswell/avx512_1/x86_64/librt.so.
1
      1620:       trying file=/usr/lib64/tls/haswell/avx512_1/librt.so.1
      1620:       trying file=/usr/lib64/tls/haswell/x86_64/librt.so.1
      1620:       trying file=/usr/lib64/tls/haswell/librt.so.1
      1620:       trying file=/usr/lib64/tls/avx512_1/x86_64/librt.so.1
      1620:       trying file=/usr/lib64/tls/avx512_1/librt.so.1
      1620:       trying file=/usr/lib64/tls/x86_64/librt.so.1
      1620:       trying file=/usr/lib64/tls/librt.so.1
      1620:       trying file=/usr/lib64/haswell/avx512_1/x86_64/librt.so.1
      1620:       trying file=/usr/lib64/haswell/avx512_1/librt.so.1
      1620:       trying file=/usr/lib64/haswell/x86_64/librt.so.1
      1620:       trying file=/usr/lib64/haswell/librt.so.1
      1620:       trying file=/usr/lib64/avx512_1/x86_64/librt.so.1
      1620:       trying file=/usr/lib64/avx512_1/librt.so.1
      1620:       trying file=/usr/lib64/x86_64/librt.so.1
      1620:       trying file=/usr/lib64/librt.so.1

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 16:13         ` Thiago Macieira
  2021-06-28 17:11           ` Peter Zijlstra
@ 2021-06-28 17:43           ` Peter Zijlstra
  2021-06-28 19:05             ` Thiago Macieira
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2021-06-28 17:43 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha,
	linux-api, linux-arch, x86

On Mon, Jun 28, 2021 at 09:13:29AM -0700, Thiago Macieira wrote:

> Anyway, what's the current thinking on what the arch_prctl() should be? Is 
> that a per-thread state or will it affect the entire process group? And is it 
> a sticky functionality, or are we talking about ref/deref?

So I didn't follow the initial discussion too well; so I might be
getting this wrong. In which case I'm hoping Thomas and/or Andy will
correct me.

But I think the proposal was per process. Having this per thread would
be really unfortunate IMO.

> Maybe in order to answer that, we need to understand what the worst case 
> scenario we need to support is. What are they?
> 
> 1) alt-stack signal handlers, usually for crashing signals (to catch a stack 
> overflow)
> 
> 2) cooperative user-space task schedulers, e.g. coroutines
> 
> 3) preemptive user-space task schedulers (I don't know if such software exists 
> or even if it is possible)

I think it's been done; use sigsetmask()/pthread_sigmask() as 'IRQ'
disable, and run a preemption tick off of SIGALRM or something.

> 4) combination of 1 and 3

None of those I think. The worst case is old executables using
MINSIGSTKSZ and not using the magic signal context at all, just regular
old signals. If you run them on an AVX512 enabled machine, they overflow
their stack and cause memory corruption.

AFAICT the only feasible way forward with that is some sysctl which
default disables AVX512 and add the prctl() and have some unsafe wrapper
that enables AVX512 for select 'legacy' programs for as long as they
exist :/

That is, binaries/libraries compiled against a static (and small)
MINSIGSTKSZ are the enemy. Which brings us to:

> 5) #4, in which each part is comes from a separate library with no knowledge 
> of each other, and initialised concurrently in different threads

That's terrible... library init should *NEVER* spawn threads (I know,
don't start).

Anything that does this is basically unfixable, because we can't
guarantee the AMX prctl() gets done before the first thread.

So yes, worst case I suppose...

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 17:43           ` Peter Zijlstra
@ 2021-06-28 19:05             ` Thiago Macieira
  0 siblings, 0 replies; 27+ messages in thread
From: Thiago Macieira @ 2021-06-28 19:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha,
	linux-api, linux-arch, x86

On Monday, 28 June 2021 10:43:47 PDT Peter Zijlstra wrote:
> None of those I think. The worst case is old executables using
> MINSIGSTKSZ and not using the magic signal context at all, just regular
> old signals. If you run them on an AVX512 enabled machine, they overflow
> their stack and cause memory corruption.

Indeed, they are already broken today. Retroactively fixing them 5 years later 
can be an additional goal, but it shouldn't stop us from having an AMX 
solution.

BTW, exactly MINSIGSTKSZ on an SKX overflows inside the kernel, so there's no 
memory corruption in userspace. The signal handler is not even invoked and the 
process is killed by a second signal delivery. But somewhere between 
MINSIGSTKSZ and SIGSTKSZ, it will invoke the signal handler and will overflow 
inside the user code. If that alt-stack wasn't designed with guard pages, it 
will corrupt memory, as you said.

> AFAICT the only feasible way forward with that is some sysctl which
> default disables AVX512 and add the prctl() and have some unsafe wrapper
> that enables AVX512 for select 'legacy' programs for as long as they
> exist :/
> 
> That is, binaries/libraries compiled against a static (and small)
> MINSIGSTKSZ are the enemy. Which brings us to:

> > 5) #4, in which each part is comes from a separate library with no
> > knowledge of each other, and initialised concurrently in different
> > threads
> 
> That's terrible... library init should *NEVER* spawn threads (I know,
> don't start).

Indeed, but that wasn't even what I was suggesting. I was thinking that the 
application would have started threads and the library init was run on 
separate threads. This may happen with lazy initialisation on first use.

But threads aren't required for the problem to happen. If the crash-handling 
case runs first, before AMX state is enabled, it might decide that it doesn't 
need to allocate sufficient alt-stack space for AMX. I think the 
recommendation here for userspace is clear: allocate the maximum that XSAVE 
tells you that you'll need, regardless of what the ambient enabled feature set 
is.

[And pretty please set up guard pages. Given that the XSAVE state area for SPR 
appears to be too close to 12 kB (see below), I'd say they should mmap() 20 kB 
and then mprotect() the lowest page to PROT_NONE.]

> Anything that does this is basically unfixable, because we can't
> guarantee the AMX prctl() gets done before the first thread.
> 
> So yes, worst case I suppose...

To wit: the worst case is a static, small alt-stack *without* guard pages 
(e.g., a malloc()ed or even static buffer) of sufficient size to let the 
kernel transition back to userspace but not for the userspace routine to run.

Any code using the old MINSIGSTKSZ (2048) will fail to run for AVX512 in the 
first place. There will be no data corruption, the crash handler will not run 
and the application will simply crash. An out-of-process core dump handler (if 
any, like systemd-coredump) will still get run.

Code using SIGSTKSZ (8192) will run for AVX512 and there'll be about 5 kB left 
to the user routine. So if it doesn't have too deep a call stack, will not 
corrupt memory. And this code will not run for AMX state, because it's smaller 
than the XSAVE state (see below), making that the same case as the MINSIGSTKSZ 
for AVX512 above.

It's possible someone would use something between those two values, but why? I 
expect that alt-stack handlers that use a constant value use either of the two 
constants or maybe some multiple of SIGSTKSZ, but nothing in-between them.

$ /opt/intel/sde-external-8.63.0-2021-01-18-lin/sde64 -spr -- cpuid -1 -l 0xd        
CPU:
   XSAVE features (0xd/0):
      XCR0 lower 32 bits valid bit field mask = 0x000600ff
      XCR0 upper 32 bits valid bit field mask = 0x00000000
         XCR0 supported: x87 state            = true
         XCR0 supported: SSE state            = true
         XCR0 supported: AVX state            = true
         XCR0 supported: MPX BNDREGS          = true
         XCR0 supported: MPX BNDCSR           = true
         XCR0 supported: AVX-512 opmask       = true
         XCR0 supported: AVX-512 ZMM_Hi256    = true
         XCR0 supported: AVX-512 Hi16_ZMM     = true
         IA32_XSS supported: PT state         = false
         XCR0 supported: PKRU state           = false
         XCR0 supported: CET_U state          = false
         XCR0 supported: CET_S state          = false
         IA32_XSS supported: HDC state        = false
         IA32_XSS supported: UINTR state      = false
         LBR supported                        = false
         IA32_XSS supported: HWP state        = false
         XTILECFG supported                   = true
         XTILEDATA supported                  = true
      bytes required by fields in XCR0        = 0x00002b00 (11008)
      bytes required by XSAVE/XRSTOR area     = 0x00002b00 (11008)

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering

^ permalink raw reply	[flat|nested] 27+ messages in thread

[parent not found: <e07294c9-b02a-e1c5-3620-7fae7269fdf1@metux.net>]

* Re: x86 CPU features detection for applications (and AMX)
       [not found]       ` <e07294c9-b02a-e1c5-3620-7fae7269fdf1@metux.net>
@ 2021-06-30 14:34         ` Florian Weimer
       [not found]           ` <030f1462-2bf9-39bc-d620-6d9fbe454a27@metux.net>
  2021-06-30 15:29         ` Thiago Macieira
  1 sibling, 1 reply; 27+ messages in thread
From: Florian Weimer @ 2021-06-30 14:34 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: Thiago Macieira, hjl.tools, libc-alpha, linux-api, linux-arch, x86

* Enrico Weigelt:

> OTOH, if one really needs to be independent of distros, one should build
> a complete nano-distro, where everything's installed under certain
> prefix, including libc. Isn't a big deal at all - we have plenty tools
> for that, daily practise in embedded world. The only difference would be
> tweaking the ld scripts to set a different ld.so path.

It breaks integration with system-wide settings, such as user/group
databases, host name lookup, and cryptographic policies.  In many
environments, that is not really an option.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 27+ messages in thread

[parent not found: <030f1462-2bf9-39bc-d620-6d9fbe454a27@metux.net>]

* Re: x86 CPU features detection for applications (and AMX)
       [not found]           ` <030f1462-2bf9-39bc-d620-6d9fbe454a27@metux.net>
@ 2021-06-30 15:38             ` Florian Weimer
       [not found]               ` <4ba30cb7-6854-0691-fad6-4ca9ce674ac2@metux.net>
  0 siblings, 1 reply; 27+ messages in thread
From: Florian Weimer @ 2021-06-30 15:38 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: Thiago Macieira, hjl.tools, libc-alpha, linux-api, linux-arch, x86

* Enrico Weigelt:

> On 30.06.21 16:34, Florian Weimer wrote:
>
>> It breaks integration with system-wide settings, such as user/group
>> databases, host name lookup, and cryptographic policies.  In many
>> environments, that is not really an option.
>
> Not necessarily, these can still be applied (and fairly simple).
> You actually have to twist more extra knobs if to wanted those weird
> things to happen.

Sorry, this is just not true.  You cannot load system libraries such as
NSS modules or cryptographic libraries with a custom glibc because the
system glibc could be newer, and glibc does not provide that kind of
compatibility (only the other way round).

Thanks,
Florian


^ permalink raw reply	[flat|nested] 27+ messages in thread

[parent not found: <4ba30cb7-6854-0691-fad6-4ca9ce674ac2@metux.net>]

* Re: x86 CPU features detection for applications (and AMX)
       [not found]               ` <4ba30cb7-6854-0691-fad6-4ca9ce674ac2@metux.net>
@ 2021-07-01  8:21                 ` Florian Weimer
       [not found]                   ` <034dcf9b-1f8c-23ee-86a6-791122bc0f8c@metux.net>
  0 siblings, 1 reply; 27+ messages in thread
From: Florian Weimer @ 2021-07-01  8:21 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: Thiago Macieira, hjl.tools, libc-alpha, linux-api, linux-arch, x86

* Enrico Weigelt:

> And I'm repeating my previous questions: can you name some actual real
> world (not hypothetical or academical) scenarios where:
>
> somebody really needs some binary-only application &&
> needs those extra modules *into that* application &&
> cannot recompile these modules into the applications's prefix &&
> needs AMX in that application &&
> cannot just use chroot &&
> cannot put it into container ?

There are no real-world scenarios yet which involve AMX, so I'm not sure
what you are after with this question.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 27+ messages in thread

[parent not found: <034dcf9b-1f8c-23ee-86a6-791122bc0f8c@metux.net>]

* Re: x86 CPU features detection for applications (and AMX)
       [not found]                   ` <034dcf9b-1f8c-23ee-86a6-791122bc0f8c@metux.net>
@ 2021-07-06 12:57                     ` Florian Weimer
  0 siblings, 0 replies; 27+ messages in thread
From: Florian Weimer @ 2021-07-06 12:57 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: Thiago Macieira, hjl.tools, libc-alpha, linux-api, linux-arch, x86

* Enrico Weigelt:

> On 01.07.21 10:21, Florian Weimer wrote:
>> * Enrico Weigelt:
>> 
>>> And I'm repeating my previous questions: can you name some actual real
>>> world (not hypothetical or academical) scenarios where:
>>>
>>> somebody really needs some binary-only application &&
>>> needs those extra modules *into that* application &&
>>> cannot recompile these modules into the applications's prefix &&
>>> needs AMX in that application &&
>>> cannot just use chroot &&
>>> cannot put it into container ?
>> There are no real-world scenarios yet which involve AMX, so I'm not
>> sure
>> what you are after with this question.
>
> Okay, let's take AMX out of the equation (until it actually arrives
> in the field). How does it look like then ?

We have customers that want to use name service switch (NSS) plugins in
proprietary software and who do not want to distribute the (GNU)
toolchain with their application.  The latter excludes
chroot/containers.  Some applications more or less require to run
directly on the host (e.g., if they have some system monitoring aspect).

Thanks,
Florian


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
       [not found]       ` <e07294c9-b02a-e1c5-3620-7fae7269fdf1@metux.net>
  2021-06-30 14:34         ` Florian Weimer
@ 2021-06-30 15:29         ` Thiago Macieira
  1 sibling, 0 replies; 27+ messages in thread
From: Thiago Macieira @ 2021-06-30 15:29 UTC (permalink / raw)
  To: fweimer, Enrico Weigelt, metux IT consult
  Cc: hjl.tools, libc-alpha, linux-api, linux-arch, x86

On Wednesday, 30 June 2021 07:32:29 PDT Enrico Weigelt, metux IT consult 
wrote:
> What does "buy-in" mean in that context ? Some other departement ? Some
> external developers ?
> 
> I tend to believe that those things should better be done by some
> independent party, maybe GNU, FSF, etc, and cpu vendors should just
> sponsor this work and provide necessary specs.

For me specifically, I need to identify some SW dept. that would take on the 
responsibility for it long-term. Abandonware would serve no one.

I wouldn't mind this were a collaborative project under the auspices of 
freedesktop.org or similar, so long as it is cross-platform to at least macOS, 
FreeBSD and Windows. But given that this is in Intel's interest for this 
library to exist and make it easy for people to use our CPU features, it 
seemed like a natural fit for Intel. And even if it isn't an Intel-owned 
project, we probably want to be contributors.

> Shipping precompiled binaries and linking against system libraries is
> always a risky game. The cleanest approach here IMHO would be building
> packages for various distros (means: using their toolchains / libs).
> This actually isn't as work intensive as it might sound - I'm doing this
> all the day and have a bunch of helpful tools for that.

I understand, but whether it is easier and better for 99% of the cases does 
not mean it is so for 100%. And most especially it does not guarantee that it 
will be used for everyone. For reasons real or not, there are precompiled 
binaries. Just see Google Chrome, for example.

> Licensing with glibc also isn't a serious problem here. All you need to
> do is be compliant with the LGPL. In short: publish all your patches to
> glibc, offer the license and link dynamically. Already done that a
> thousand times.

We can agree it's an additional hurdle, which will likely cause people to 
investigate a solution that doesn't require that hurdle.

> Wait a minute ... how long does it take from the architectural design,
> until the real silicon is out in the field ? I would be very surprised
> whether the whole process in done in a much shorted time frame.
> 
> Note: by "much more early", I meant already at the point where the spec
> of the new feature exists, at least on paper.

I'm not going to comment on the timing of architectural decisions. But just 
from the example I gave: in order to be ready for a late 2021 or early 2022 
launch, we'd need to have the feature's specification published and the 
patches accepted by December 2018. That's about 3 years lead time.

How many software projects (let alone mixed software and hardware) do you know 
that know 3 years ahead of time what they will need?

>  > Then there are Linux distros that do LTS every 2 years or so.
> 
> Why don't the few actually affected parties just upgrade their compiler
> on their build machines when needed ?

Have you tried?

Besides, the whole problem here is barrier of entry. If we don't make it easy 
for them to use the new features, they won't. And I was using this as an 
argument for why precompiled binaries will exist: the interested parties will 
take the pain to upgrade the compilers and other supporting software so that 
the build even of Open Source software is the most capable one, then release 
that binary for others who haven't. This lowers the barrier of entry 
significantly.

And this is all to justify that such a functionality shouldn't be part of 
glibc, where it can't be used by those precompiled binaries which, for one 
reason or another, will exist.

It should be in a small, permissively-licensed library that will often get 
statically linked into the binary in question.

> > To compile the software that uses those instructions, undoubtedly. But
> > what if I did that for you and you could simply download the binary for
> > the library and/or plugins such that you could slot into your existing
> > systems and CI? This could make a difference between adoption or not.
> 
> For me, it wouldn't, at all. I never download binaries from untrusted
> sources. (except for forensic analysis).

I understand and I am, myself, almost like you. I do have some precompiled 
binaries (aforementioned Google Chrome), but as a rule I avoid them.

But not everyone is like the two of us.

> BUT: we're talking about about brand new silicon here. Why should
> anybody - who really needs these new features - install such an ancient
> OS on a brand new machine ?

I don't know. It might be for fleet homogeneity: everything has the same SW 
installed, facilitating maintenance. Just coming up with reasons.

> > Even if they don't, the *software* that people deploy may be the same
> > build
> > for RHEL 7 and for a modern distro that will have a 5.14 kernel.
> 
> Now we're getting to the vital point: trying to make "universal"
> binaries for verious different distros. This is something I'm strictly
> advising against since 25 years, because with that you're putting
> yourself into *a lot* trouble (ABI compatibility between arbitrary
> distros or even various distro releases always had been pretty much a
> myth, only works for some specific cases). Just don't do it, unless you
> *really* don't have any other chance.

Well, that's the point, isn't it? Are we ready to call this use-case not 
valid, so it can't be used to support the argument of a solution that needs to 
be deployable to old distros?

> > So my point is: this shouldn't be in glibc because the glibc will not have
> > the new system call wrappers or TLS fields.
> 
> Yes, I'm fully on your side here. Glibc already is overloaded with too
> much of those kind of things that shouldn't belong in there. Actually,
> even stuff like DNS resolving IMHO doensn't belong into libc.

Thanks.

(name resolving is required by POSIX to be there, so it exists in every 
system; might as well be every libc)

> My proposal would an conditional jump opcode that directly checks for
> specific features. If this is well designed, I believe that can be
> resolved by the cpu's internal prefetcher unit. But for that we'd also
> need some extra task status bit so the cpu knows it is enabled for the
> current task.

That's more of a "can I use this now", instead of "can I use this ever". So 
far, the answer to the two has been the same. Therefore, there has been no 
need to have the functionality that you're describing.

> > For most features, there isn't. You don't see us discussing
> > AVX512VP2INTERSECT, for example. This discussion only exists because AMX
> > requires more state to be saved during context switches and signal
> > delivery.
> But over all these years, new some registers have been introduced.
> I fail to imagine how context switches can be done properly w/o also
> saving/restoring such new registers.

There have been a few small registers and state that need to be saved here and 
there, but the biggest blocks were:

- SSE state
- AVX state
- AVX512 state
- AMX state

The first two were small enough (and long enough ago) that the discussions 
were small and aren't relevant today. The AVX512 state was added in the past 
decade. And as you've seen from this thread, that is still a sticky point, and 
that was only about 1.5 kB.

However, the vast majority of CPU features do not add new context state.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-25 23:31 ` Thiago Macieira
       [not found]   ` <3c5c29e2-1b52-3576-eda2-018fb1e58ff9@metux.net>
@ 2021-07-08  7:08   ` Florian Weimer
  2021-07-08 15:13     ` Thiago Macieira
  1 sibling, 1 reply; 27+ messages in thread
From: Florian Weimer @ 2021-07-08  7:08 UTC (permalink / raw)
  To: Thiago Macieira; +Cc: hjl.tools, libc-alpha, linux-api, linux-arch, x86

* Thiago Macieira:

> On 23 Jun 2021 17:04:27 +0200, Florian Weimer wrote:
>> We have an interface in glibc to query CPU features:
>> X86-specific Facilities
>> <https://www.gnu.org/software/libc/manual/html_node/X86.html>
>>
>> CPU_FEATURE_USABLE all preconditions for a feature are met,
>> HAS_CPU_FEATURE means it's in silicon but possibly dormant.
>> CPU_FEATURE_USABLE is supposed to look at XCR0, AT_HWCAP2 etc. before
>> enabling the relevant bit (so it cannot pass through any unknown bits).
>
> It's a nice initiative, but it doesn't help library and applications that need 
> to be either cross-platform or backwards compatible.
>
> The first problem is the cross-platformness need. Because we library and 
> application developers need to support other OSes, we'll need to deploy our 
> own CPUID-based detection. It's far better to use common code everywhere, 
> where one developer working on Linux can fix bugs in FreeBSD, macOS or Windows 
> or any of the permutations. Every platform-specific deviation adds to 
> maintenance requirements and is a source of potential latent bugs, now or in 
> the future due to refactoring. That is why doing everything in the form of 
> instructions would be far better and easier, rather than system calls.

I must say this is a rather application-specific view.  Sure, you get
consistency within the application across different targets, but for
those who work on multiple applications (but perhaps on a single
distribution/OS), things are very inconsistent.

And the reason why I started this is that CPUID-based feature detection
is dead anyway (assuming the kernel developers do not implement lazy
initialization of the AMX state).  CPUID (and ancillary data such as
XCR0) will say that AMX support is there, but it will not work unless
some (yet to decided) steps are executed by the userspace thread.

While I consider the CPUID-based model a success (and the cross-OS
consistency may have contributed to that), its days seem to be over.

> [Unless said system calls were standardised and actually
> deployed. Making this a cross-platform library that is not part of
> libc would be a major step in that direction]

That won't help with AMX, as far as I can tell.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-07-08  7:08   ` Florian Weimer
@ 2021-07-08 15:13     ` Thiago Macieira
  0 siblings, 0 replies; 27+ messages in thread
From: Thiago Macieira @ 2021-07-08 15:13 UTC (permalink / raw)
  To: Florian Weimer; +Cc: hjl.tools, libc-alpha, linux-api, linux-arch, x86

On Thursday, 8 July 2021 00:08:16 PDT Florian Weimer wrote:
> > The first problem is the cross-platformness need. Because we library and
> > application developers need to support other OSes, we'll need to deploy
> > our
> > own CPUID-based detection. It's far better to use common code everywhere,
> > where one developer working on Linux can fix bugs in FreeBSD, macOS or
> > Windows or any of the permutations. Every platform-specific deviation
> > adds to maintenance requirements and is a source of potential latent
> > bugs, now or in the future due to refactoring. That is why doing
> > everything in the form of instructions would be far better and easier,
> > rather than system calls.
> I must say this is a rather application-specific view.  Sure, you get
> consistency within the application across different targets, but for
> those who work on multiple applications (but perhaps on a single
> distribution/OS), things are very inconsistent.

Why would they be inconsistent, if the library is cross-platform?

> And the reason why I started this is that CPUID-based feature detection
> is dead anyway (assuming the kernel developers do not implement lazy
> initialization of the AMX state).  CPUID (and ancillary data such as
> XCR0) will say that AMX support is there, but it will not work unless
> some (yet to decided) steps are executed by the userspace thread.
> 
> While I consider the CPUID-based model a success (and the cross-OS
> consistency may have contributed to that), its days seem to be over.

Well, we need to design the API of this library such that we can accommodate 
the various possibilities. For all CPU possibilities, the library needs to be 
able to tell what the state of support is, among a state of "already enabled", 
"possible but not enabled" and "impossible", along with a call to enable them. 
The latter should be supported at least for AVX512 and AMX states. On Linux, 
only AMX will be tristate, but on macOS we need the tristate for AVX512 too.

This library would then wrap all the necessary checking for OSXSAVE and XCR0, 
so the user doesn't need to worry about them or how the OS enables them, only 
the features they're interested in.

Additionally, I'd like the library to also have constant expression paths that 
evaluate to constant true if the feature was already enabled at compile time 
(e.g., -march=x86-64-v3 sets __AVX2__ and __FMA__, so you can always run AVX2 
and FMA code, without checking). But that's just icing on top.

(it won't come as a surprise that I already have code for most of this)

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-23 15:04 x86 CPU features detection for applications (and AMX) Florian Weimer
  2021-06-23 15:32 ` Dave Hansen
  2021-06-25 23:31 ` Thiago Macieira
@ 2021-07-08 17:56 ` Mark Brown
  2 siblings, 0 replies; 27+ messages in thread
From: Mark Brown @ 2021-07-08 17:56 UTC (permalink / raw)
  To: Florian Weimer
  Cc: libc-alpha, linux-api, x86, linux-arch, H.J. Lu, Catalin Marinas,
	Will Deacon

[-- Attachment #1: Type: text/plain, Size: 1662 bytes --]

On Wed, Jun 23, 2021 at 05:04:27PM +0200, Florian Weimer wrote:

Copying in Catalin & Will.

> We have an interface in glibc to query CPU features:

>   X86-specific Facilities
>   <https://www.gnu.org/software/libc/manual/html_node/X86.html>

> CPU_FEATURE_USABLE all preconditions for a feature are met,
> HAS_CPU_FEATURE means it's in silicon but possibly dormant.
> CPU_FEATURE_USABLE is supposed to look at XCR0, AT_HWCAP2 etc. before
> enabling the relevant bit (so it cannot pass through any unknown bits).

...

> When we designed this glibc interface, we assumed that bits would be
> static during the life-time of the process, initialized at process
> start.  That follows the model of previous x86 CPU feature enablement.

...

> This still wouldn't cover the enable/disable side, but at least it would
> work for CPU features which are modal and come and go.  The fact that we
> tell GCC to cache the returned pointer from that internal function, but
> not that the data is immutable works to our advantage here.

> On the other hand, maybe there is a way to give users a better
> interface.  Obviously we want to avoid a syscall for a simple CPU
> feature check.  And we also need something to enable/disable CPU
> features.

This enabling and disabling of CPU features sounds like something that
might also become relevant for arm64, for example I can see a use case
for having something that allows some of the more expensive features
to be masked from some userspace processes for resource management
purposes.  This sounds like a bit of a different use case to x86 AIUI
but I think there's overlap in the actual operations that would be
needed.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2021-07-08 17:57 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-23 15:04 x86 CPU features detection for applications (and AMX) Florian Weimer
2021-06-23 15:32 ` Dave Hansen
2021-07-08  6:05   ` Florian Weimer
2021-07-08 14:19     ` Dave Hansen
2021-07-08 14:31       ` Florian Weimer
2021-07-08 14:36         ` Dave Hansen
2021-07-08 14:41           ` Florian Weimer
2021-06-25 23:31 ` Thiago Macieira
     [not found]   ` <3c5c29e2-1b52-3576-eda2-018fb1e58ff9@metux.net>
2021-06-28 13:20     ` Peter Zijlstra
     [not found]       ` <534d0171-2cc5-cd0a-904f-cd3c499b55af@metux.net>
2021-06-30 15:36         ` Thiago Macieira
2021-06-28 15:08     ` Thiago Macieira
2021-06-28 15:27       ` Peter Zijlstra
2021-06-28 16:13         ` Thiago Macieira
2021-06-28 17:11           ` Peter Zijlstra
2021-06-28 17:23             ` Thiago Macieira
2021-06-28 19:08               ` Peter Zijlstra
2021-06-28 19:26                 ` Thiago Macieira
2021-06-28 17:43           ` Peter Zijlstra
2021-06-28 19:05             ` Thiago Macieira
     [not found]       ` <e07294c9-b02a-e1c5-3620-7fae7269fdf1@metux.net>
2021-06-30 14:34         ` Florian Weimer
     [not found]           ` <030f1462-2bf9-39bc-d620-6d9fbe454a27@metux.net>
2021-06-30 15:38             ` Florian Weimer
     [not found]               ` <4ba30cb7-6854-0691-fad6-4ca9ce674ac2@metux.net>
2021-07-01  8:21                 ` Florian Weimer
     [not found]                   ` <034dcf9b-1f8c-23ee-86a6-791122bc0f8c@metux.net>
2021-07-06 12:57                     ` Florian Weimer
2021-06-30 15:29         ` Thiago Macieira
2021-07-08  7:08   ` Florian Weimer
2021-07-08 15:13     ` Thiago Macieira
2021-07-08 17:56 ` Mark Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).