* x86 CPU features detection for applications (and AMX) @ 2021-06-23 15:04 Florian Weimer 2021-06-23 15:32 ` Dave Hansen ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Florian Weimer @ 2021-06-23 15:04 UTC (permalink / raw) To: libc-alpha, linux-api, x86, linux-arch We have an interface in glibc to query CPU features: X86-specific Facilities <https://www.gnu.org/software/libc/manual/html_node/X86.html> CPU_FEATURE_USABLE all preconditions for a feature are met, HAS_CPU_FEATURE means it's in silicon but possibly dormant. CPU_FEATURE_USABLE is supposed to look at XCR0, AT_HWCAP2 etc. before enabling the relevant bit (so it cannot pass through any unknown bits). It turns out we screwed up in the glibc 2.33 release the absolutely required headers weren't actually installed: [PATCH] x86: Install <bits/platform/x86.h> [BZ #27958] <https://sourceware.org/pipermail/libc-alpha/2021-June/127215.html> Given that the magic constants aren't available in any other way, this feature was completely unusable, so we can perhaps revisit it and switch to a different approach. Previously kernel developers have expressed dismay that we didn't coordinate the interface with them. This is why I want raise this now. When we designed this glibc interface, we assumed that bits would be static during the life-time of the process, initialized at process start. That follows the model of previous x86 CPU feature enablement. In the background, CPU_FEATURE_USABLE/HAS_CPU_FEATURE calls a function which returns a pointer to eight 32-bit words, based on the index passed to the function (out-of-range indidces return a pointer to zeros, enabling forward compatibility). The macros then use a magic constants that encodes he lookup index and which of those 128 bits to extract to find that bit, plus the feature/usable choice. This means that we *could* keep this interface unchanged if the kernel gives us a way to read up-to-date feature state from a 256 bit area (or at least 32 bit word) in thread-specific data. Similar to what we have with set_robust_list and rseq today. This still wouldn't cover the enable/disable side, but at least it would work for CPU features which are modal and come and go. The fact that we tell GCC to cache the returned pointer from that internal function, but not that the data is immutable works to our advantage here. On the other hand, maybe there is a way to give users a better interface. Obviously we want to avoid a syscall for a simple CPU feature check. And we also need something to enable/disable CPU features. Thanks, Florian PS: Is it true that there is no public mailing list for Linux discussions specific to x86? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) 2021-06-23 15:04 x86 CPU features detection for applications (and AMX) Florian Weimer @ 2021-06-23 15:32 ` Dave Hansen 2021-07-08 6:05 ` Florian Weimer 2021-06-25 23:31 ` Thiago Macieira 2021-07-08 17:56 ` Mark Brown 2 siblings, 1 reply; 27+ messages in thread From: Dave Hansen @ 2021-06-23 15:32 UTC (permalink / raw) To: Florian Weimer, libc-alpha, linux-api, x86, linux-arch On 6/23/21 8:04 AM, Florian Weimer wrote: > https://www.gnu.org/software/libc/manual/html_node/X86.html ... > Previously kernel developers have expressed dismay that we didn't > coordinate the interface with them. This is why I want raise this now. This looks basically like someone dumped a bunch of CPUID bit values and exposed them to applications without considering whether applications would ever need them. For instance, why would an app ever care about: PKS – Protection keys for supervisor-mode pages. And how could glibc ever give applications accurate information about whether PKS "is supported by the operating system"? It just plain doesn't know, or at least only knows from a really weak ABI like /proc/cpuinfo. It also doesn't seem to tell applications what they want which is, "can I, the application, *use* this feature?" > PS: Is it true that there is no public mailing list for Linux > discussions specific to x86? Yes. I've asked recently for something x86-related, but folks were to concerned what I was asking for was too specific, which was more of a brainstorming place to put x86-specific RFC's. https://subspace.kernel.org/lists.linux.dev.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) 2021-06-23 15:32 ` Dave Hansen @ 2021-07-08 6:05 ` Florian Weimer 2021-07-08 14:19 ` Dave Hansen 0 siblings, 1 reply; 27+ messages in thread From: Florian Weimer @ 2021-07-08 6:05 UTC (permalink / raw) To: Dave Hansen; +Cc: libc-alpha, linux-api, x86, linux-arch, H.J. Lu, linux-kernel * Dave Hansen: > On 6/23/21 8:04 AM, Florian Weimer wrote: >> https://www.gnu.org/software/libc/manual/html_node/X86.html > ... >> Previously kernel developers have expressed dismay that we didn't >> coordinate the interface with them. This is why I want raise this now. > > This looks basically like someone dumped a bunch of CPUID bit values and > exposed them to applications without considering whether applications > would ever need them. For instance, why would an app ever care about: > > PKS – Protection keys for supervisor-mode pages. > > And how could glibc ever give applications accurate information about > whether PKS "is supported by the operating system"? It just plain > doesn't know, or at least only knows from a really weak ABI like > /proc/cpuinfo. glibc is expected to mask these bits for CPU_FEATURE_USABLE because they have unknown semantics (to glibc). They are still exposed via HAS_CPU_FEATURE. I argued against HAS_CPU_FEATURE because the mere presence of this interface will introduce application bugs because application really must use CPU_FEATURE_USABLE instead. I wanted to go with a curated set of bits, but we couldn't get consensus around that. Curiously, the present interface can expose changing CPU state (if the kernel updates some fixed memory region accordingly), my preferred interface would not have supported that. > It also doesn't seem to tell applications what they want which is, "can > I, the application, *use* this feature?" CPU_FEATURE_USABLE is supposed to be that interface. Thanks, Florian ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) 2021-07-08 6:05 ` Florian Weimer @ 2021-07-08 14:19 ` Dave Hansen 2021-07-08 14:31 ` Florian Weimer 0 siblings, 1 reply; 27+ messages in thread From: Dave Hansen @ 2021-07-08 14:19 UTC (permalink / raw) To: Florian Weimer Cc: libc-alpha, linux-api, x86, linux-arch, H.J. Lu, linux-kernel On 7/7/21 11:05 PM, Florian Weimer wrote: >> This looks basically like someone dumped a bunch of CPUID bit values and >> exposed them to applications without considering whether applications >> would ever need them. For instance, why would an app ever care about: >> >> PKS – Protection keys for supervisor-mode pages. >> >> And how could glibc ever give applications accurate information about >> whether PKS "is supported by the operating system"? It just plain >> doesn't know, or at least only knows from a really weak ABI like >> /proc/cpuinfo. > glibc is expected to mask these bits for CPU_FEATURE_USABLE because they > have unknown semantics (to glibc). OK, so if I call CPU_FEATURE_USABLE(PKS) on a system *WITH* PKS supported in the operating system, I'll get false from an interface that claims to be: > This macro returns a nonzero value (true) if the processor has the > feature name and the feature is supported by the operating system. The interface just seems buggy by *design*. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) 2021-07-08 14:19 ` Dave Hansen @ 2021-07-08 14:31 ` Florian Weimer 2021-07-08 14:36 ` Dave Hansen 0 siblings, 1 reply; 27+ messages in thread From: Florian Weimer @ 2021-07-08 14:31 UTC (permalink / raw) To: Dave Hansen; +Cc: libc-alpha, linux-api, x86, linux-arch, H.J. Lu, linux-kernel * Dave Hansen: > On 7/7/21 11:05 PM, Florian Weimer wrote: >>> This looks basically like someone dumped a bunch of CPUID bit values and >>> exposed them to applications without considering whether applications >>> would ever need them. For instance, why would an app ever care about: >>> >>> PKS – Protection keys for supervisor-mode pages. >>> >>> And how could glibc ever give applications accurate information about >>> whether PKS "is supported by the operating system"? It just plain >>> doesn't know, or at least only knows from a really weak ABI like >>> /proc/cpuinfo. >> glibc is expected to mask these bits for CPU_FEATURE_USABLE because they >> have unknown semantics (to glibc). > > OK, so if I call CPU_FEATURE_USABLE(PKS) on a system *WITH* PKS > supported in the operating system, I'll get false from an interface that > claims to be: > >> This macro returns a nonzero value (true) if the processor has the >> feature name and the feature is supported by the operating system. > > The interface just seems buggy by *design*. Yes, but that is largely a documentation matter. We should have said something about “userspace” there, and that the bit needs to be known to glibc. There is another exception: FSGSBASE, and that's a real bug we need to fix (it has to go through AT_HWCAP2). If we want to avoid that, we need to go down the road of a curated set of CPUID bits, where a bit only exists if we have taught glibc its semantics. You still might get a false negative by running against an older glibc than the application was built for. (We are not going to force applications that e.g. look for FSGSBASE only run with a glibc that is at least of that version which implemented semantics for the FSGSBASE bit.) Thanks, Florian ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) 2021-07-08 14:31 ` Florian Weimer @ 2021-07-08 14:36 ` Dave Hansen 2021-07-08 14:41 ` Florian Weimer 0 siblings, 1 reply; 27+ messages in thread From: Dave Hansen @ 2021-07-08 14:36 UTC (permalink / raw) To: Florian Weimer Cc: libc-alpha, linux-api, x86, linux-arch, H.J. Lu, linux-kernel On 7/8/21 7:31 AM, Florian Weimer wrote: >> OK, so if I call CPU_FEATURE_USABLE(PKS) on a system *WITH* PKS >> supported in the operating system, I'll get false from an interface that >> claims to be: >> >>> This macro returns a nonzero value (true) if the processor has the >>> feature name and the feature is supported by the operating system. >> The interface just seems buggy by *design*. > Yes, but that is largely a documentation matter. We should have said > something about “userspace” there, and that the bit needs to be known to > glibc. There is another exception: FSGSBASE, and that's a real bug we > need to fix (it has to go through AT_HWCAP2). > > If we want to avoid that, we need to go down the road of a curated set > of CPUID bits, where a bit only exists if we have taught glibc its > semantics. You still might get a false negative by running against an > older glibc than the application was built for. (We are not going to > force applications that e.g. look for FSGSBASE only run with a glibc > that is at least of that version which implemented semantics for the > FSGSBASE bit.) That's kinda my whole point. These *MUST* be curated to be meaningful. Right now, someone just dumped a set of CPUID bits into the documentation. The interface really needs *three* modes: 1. Yes, the CPU/OS supports this feature 2. No, the CPU/OS doesn't support this feature 3. Hell if I know, never heard of this feature The interface really conflates 2 and 3. To me, that makes it fundamentally flawed. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) 2021-07-08 14:36 ` Dave Hansen @ 2021-07-08 14:41 ` Florian Weimer 0 siblings, 0 replies; 27+ messages in thread From: Florian Weimer @ 2021-07-08 14:41 UTC (permalink / raw) To: Dave Hansen; +Cc: libc-alpha, linux-api, x86, linux-arch, H.J. Lu, linux-kernel * Dave Hansen: > That's kinda my whole point. > > These *MUST* be curated to be meaningful. Right now, someone just > dumped a set of CPUID bits into the documentation. > > The interface really needs *three* modes: > > 1. Yes, the CPU/OS supports this feature > 2. No, the CPU/OS doesn't support this feature > 3. Hell if I know, never heard of this feature > > The interface really conflates 2 and 3. To me, that makes it > fundamentally flawed. That's an interesing point. 3 looks potentially more useful than the feature/usable distinction to me. The recent RTM change suggests that there are more states, but we probably can't do much about such soft-disable changes. Thanks, Florian ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) 2021-06-23 15:04 x86 CPU features detection for applications (and AMX) Florian Weimer 2021-06-23 15:32 ` Dave Hansen @ 2021-06-25 23:31 ` Thiago Macieira [not found] ` <3c5c29e2-1b52-3576-eda2-018fb1e58ff9@metux.net> 2021-07-08 7:08 ` Florian Weimer 2021-07-08 17:56 ` Mark Brown 2 siblings, 2 replies; 27+ messages in thread From: Thiago Macieira @ 2021-06-25 23:31 UTC (permalink / raw) To: fweimer; +Cc: hjl.tools, libc-alpha, linux-api, linux-arch, x86 On 23 Jun 2021 17:04:27 +0200, Florian Weimer wrote: > We have an interface in glibc to query CPU features: > X86-specific Facilities > <https://www.gnu.org/software/libc/manual/html_node/X86.html> > > CPU_FEATURE_USABLE all preconditions for a feature are met, > HAS_CPU_FEATURE means it's in silicon but possibly dormant. > CPU_FEATURE_USABLE is supposed to look at XCR0, AT_HWCAP2 etc. before > enabling the relevant bit (so it cannot pass through any unknown bits). It's a nice initiative, but it doesn't help library and applications that need to be either cross-platform or backwards compatible. The first problem is the cross-platformness need. Because we library and application developers need to support other OSes, we'll need to deploy our own CPUID-based detection. It's far better to use common code everywhere, where one developer working on Linux can fix bugs in FreeBSD, macOS or Windows or any of the permutations. Every platform-specific deviation adds to maintenance requirements and is a source of potential latent bugs, now or in the future due to refactoring. That is why doing everything in the form of instructions would be far better and easier, rather than system calls. [Unless said system calls were standardised and actually deployed. Making this a cross-platform library that is not part of libc would be a major step in that direction] The second problem is going to be backwards compatibility. Applications and libraries may want to ship precompiled binaries that make use of the new CPU features, whether they are open source or not. It comes as no surprise to anyone that we CPU makers will have made software that use those features and want to have it ready on Day 1 of the HW being available for the market (if we're doing our jobs right). That often involves precompiling because everyone who installed their compilers more than one year ago will not have the necessary tools to build. That runs counter to the need to use a libc interface that didn't exist until recently. And by "recently", I mean "anything since the glibc that came with Red Hat Enterprise Linux 7" (2.17). So no, application and library developers will not use libc functions they don't need to, especially if it adds to their problems, unless there's no way around it. > Previously kernel developers have expressed dismay that we didn't > coordinate the interface with them. This is why I want raise this now. You also need to coordinate with your users. A platform-specific API to solve a problem that is already solved is "knock yourself out, we're not going to use this." So my first suggestion is to remove the "platform-specific" part and make this a cross-platform solution. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel DPG Cloud Engineering ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <3c5c29e2-1b52-3576-eda2-018fb1e58ff9@metux.net>]
* Re: x86 CPU features detection for applications (and AMX) [not found] ` <3c5c29e2-1b52-3576-eda2-018fb1e58ff9@metux.net> @ 2021-06-28 13:20 ` Peter Zijlstra [not found] ` <534d0171-2cc5-cd0a-904f-cd3c499b55af@metux.net> 2021-06-28 15:08 ` Thiago Macieira 1 sibling, 1 reply; 27+ messages in thread From: Peter Zijlstra @ 2021-06-28 13:20 UTC (permalink / raw) To: Enrico Weigelt, metux IT consult Cc: Thiago Macieira, fweimer, hjl.tools, libc-alpha, linux-api, linux-arch, x86 On Mon, Jun 28, 2021 at 02:40:32PM +0200, Enrico Weigelt, metux IT consult wrote: > Going back to AMX - just had a quick look at the spec (*1). Sorry, but > this thing is really weird and horrible to use. Come on, these chips > already have billions of transistors, it really can't hurt so much > spending a few more to provide a clean and easy to use machine code > interface. Grmmpf! (This is a general problem we've got with so many > HW folks, why can't them just talk to us SW folks first so we can find > a good solution for both sides, before that goes into the field ?) > > And one point that immediately jumps into my mind (w/o looking deeper > into it): it introduces completely new registers - do we now need extra > code for tasks switching etc ? No, but because it's register state and part of XSAVE, it has immediate impact in ABI. In particular, the signal stack layout includes XSAVE (as does ptrace()). At the same time, 'legacy' applications (up until _very_ recently) had a minimum signal stack size of 2K, which is already violated by the addition of AVX512 (there's actual breakage due to that). Adding the insane AMX state (8k+) into that is a complete trainwreck waiting to happen. Not to mention that having !INIT AMX state has direct consequences for P-state selection and thus performance. For these reasons, us OS folks, will mandate you get to do a prctl() to request/release AMX (and we get to say: no). If you use AMX without this, the instruction will fault (because not set in XCR0) and we'll SIGBUS or something. Userspace will have to do something like: - check CPUID, if !AMX -> fail - issue prctl(), if error -> fail - issue XGETBV and check the AMX bit it set, if not -> fail - request the signal stack size / spawn threads - use AMX Spawning threads prior to enabling AMX will result in using the wrong signal stack size and result in malfunction, you get to keep the pieces. ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <534d0171-2cc5-cd0a-904f-cd3c499b55af@metux.net>]
* Re: x86 CPU features detection for applications (and AMX) [not found] ` <534d0171-2cc5-cd0a-904f-cd3c499b55af@metux.net> @ 2021-06-30 15:36 ` Thiago Macieira 0 siblings, 0 replies; 27+ messages in thread From: Thiago Macieira @ 2021-06-30 15:36 UTC (permalink / raw) To: Peter Zijlstra, Enrico Weigelt, metux IT consult Cc: fweimer, hjl.tools, libc-alpha, linux-api, linux-arch, x86 On Wednesday, 30 June 2021 05:50:30 PDT Enrico Weigelt, metux IT consult wrote: > > No, but because it's register state and part of XSAVE, it has immediate > > impact in ABI. In particular, the signal stack layout includes XSAVE (as > > does ptrace()). > > OMGs, I've already suspected such sickness. I don't even dare thinking > about consequences for compilers and library ABIs. > > Does anyone here know why they designed this as inline operations ? This > thing seems to be pretty much what typical TPUs are doing (or a subset > of it). Why not just adding a TPU next to the CPU on the same chip ? To be clear: this is a SW ABI. It has nothing to do the presence or absence of other processing units in the system. The moment you receive a Unix signal with SA_SIGINFO, the mcontext state needs to be saved somewhere. Where would you save it? Please remember that: - signal handlers can be called at any point in the execution, including in the middle of malloc() - signal handlers can longjmp out of the handler back into non-handler code - in a multithreaded application, each thread can be handling a signal simultaneously We could have the kernel hold on to that and have a system call to extract them, but that's an ABI change and I think won't work for the longjmp case. > > Userspace will have to do something like: > > - check CPUID, if !AMX -> fail > > - issue prctl(), if error -> fail > > - issue XGETBV and check the AMX bit it set, if not -> fail > > Can't we to this just by prctl() call ? > IOW: ask the kernel, who gonna say yes or no. That's possible. The kernel can't enable an AMX state on a system without AMX. > Are there any situations where kernel says yes, but process still can't > use it ? Why so ? Today there is no such case that I can think of. > > - request the signal stack size / spawn threads > > Signal stack is separate from the usual stack, right ? > Why can't this all be done in one shot ? Yes, we're talking about the sigaltstack() call. What is "this all" in the sentence above? -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel DPG Cloud Engineering ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) [not found] ` <3c5c29e2-1b52-3576-eda2-018fb1e58ff9@metux.net> 2021-06-28 13:20 ` Peter Zijlstra @ 2021-06-28 15:08 ` Thiago Macieira 2021-06-28 15:27 ` Peter Zijlstra [not found] ` <e07294c9-b02a-e1c5-3620-7fae7269fdf1@metux.net> 1 sibling, 2 replies; 27+ messages in thread From: Thiago Macieira @ 2021-06-28 15:08 UTC (permalink / raw) To: fweimer, Enrico Weigelt, metux IT consult Cc: hjl.tools, libc-alpha, linux-api, linux-arch, x86 On Monday, 28 June 2021 05:40:32 PDT Enrico Weigelt, metux IT consult wrote: > > The first problem is the cross-platformness need. Because we library and > > application developers need to support other OSes, we'll need to deploy > > our > > own CPUID-based detection. It's far better to use common code everywhere, > > where one developer working on Linux can fix bugs in FreeBSD, macOS or > > Windows or any of the permutations. Every platform-specific deviation > > adds to maintenance requirements and is a source of potential latent > > bugs, now or in the future due to refactoring. That is why doing > > everything in the form of instructions would be far better and easier, > > rather than system calls. > hmm, maybe some libcpuid ? Indeed. I'm querying inside Intel to see if I can get buy-in to create such a library. > > The second problem is going to be backwards compatibility. Applications > > and > > libraries may want to ship precompiled binaries that make use of the new > > CPU features, whether they are open source or not. > > Since we're talking about GNU libc here, binary-only stuff is probably > out of scope here. OTOH, using differnt libc versions in those special > cases isn't such a big deal. Shipping a libc is not trivial, either technically or due to licensing requirements. Most applications want to link against whatever libc the system already provides, if that's possible. > > It comes as no surprise to anyone that we CPU makers will have made > > software that use those features and > > want to have it ready on Day 1 of the HW being available for the market > > (if > > we're doing our jobs right). That often involves precompiling because > > everyone > who installed their compilers more than one year ago will not > > have the necessary tools to build. > > Actually, you should talk to the compiler folks much more early, at the > point where you know how those features look like. We do, but it's not enough. GCC releases once a year, so it's easy to miss the feature freeze. Then there are Linux distros that do LTS every 2 years or so. Worse, those two are usually out of phase. For example, if you're using the current Ubuntu LTS today (almost July 2021), you're using 20.04, which was released one month before the GCC 10 release. So you're using GCC 9, released May 2019, which means its features were frozen on December 2018. That's an incredibly long lead time. As a consequence, you will see precompiled binaries. > For using certain new CPU specific features, the need for a compiler > upgrade really should be no excuse. And at least for vast majority of > cases, a proper compiler could do it much better than the average > programmer. To compile the software that uses those instructions, undoubtedly. But what if I did that for you and you could simply download the binary for the library and/or plugins such that you could slot into your existing systems and CI? This could make a difference between adoption or not. > > And by "recently", I mean "anything since the glibc that came with Red Hat > > Enterprise Linux 7" (2.17). > > Uh, that's really ancient. Nobody can seriously expect modern features > on such an ancient distro. If people really insist spending so much > money for running such old code, instead of just doing a dist upgrade, > then I can only reply with "not our problem". Yes and no. Red Hat has been incredibly successful in backporting kernel features to the old 3.10 that came with RHEL 7. Whether they will do that for AMX state saving and the system call that we're discussing here, I can't say. AFAIU, they did backport the AVX512 state-saving to that 3.10, so they may. Even if they don't, the *software* that people deploy may be the same build for RHEL 7 and for a modern distro that will have a 5.14 kernel. That software may have non-AVX, AVX2, AVX512 and AMX-specific code paths and would do runtime detection of which one is best to use. If a system call is needed, the system call needs to be issued even on that 3.10 and if it responds with -ENOSYS or -EINVAL, then it will fall back to the next best option. So my point is: this shouldn't be in glibc because the glibc will not have the new system call wrappers or TLS fields. > What we SW engineers need is an easy and fast method to act depending on > whether some CPU supports some feature (eg. a new opcode). Things like > cpuinfo are only a tiny piece of that. What we could really use is a > conditional jump/call based on whether feature X is supported - without > any kernel intervention. Then the machine code could be easily layed out > to support both cases with our without some feature X. Alternatively we > could have a fast trapping in useland - hw generated call already would > be a big help. That's what cpuid is for. With GCC function multi-versioning or equivalent manually-rolled solutions, you can get exactly what you're asking for. Yes, the checking became far more complex with the need to check XCR0 after AVX came along, but since the instruction itself is a slow and serialising, any library will just cache the results. And as a result, the level of CPU features is not expected to change. It never has in the past, so this hasn't been an issue. > If we had something in that direction, we wouldn't have to have this > kind discussion here anymore - it would be entirely up to compiler and > library folks, no need for any kernel support at all. For most features, there isn't. You don't see us discussing AVX512VP2INTERSECT, for example. This discussion only exists because AMX requires more state to be saved during context switches and signal delivery. See Peter's email. > And one point that immediately jumps into my mind (w/o looking deeper > into it): it introduces completely new registers - do we now need extra > code for tasks switching etc ? Yes, this is the crux of this discussion. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel DPG Cloud Engineering ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) 2021-06-28 15:08 ` Thiago Macieira @ 2021-06-28 15:27 ` Peter Zijlstra 2021-06-28 16:13 ` Thiago Macieira [not found] ` <e07294c9-b02a-e1c5-3620-7fae7269fdf1@metux.net> 1 sibling, 1 reply; 27+ messages in thread From: Peter Zijlstra @ 2021-06-28 15:27 UTC (permalink / raw) To: Thiago Macieira Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha, linux-api, linux-arch, x86 On Mon, Jun 28, 2021 at 08:08:41AM -0700, Thiago Macieira wrote: > On Monday, 28 June 2021 05:40:32 PDT Enrico Weigelt, metux IT consult wrote: > > What we SW engineers need is an easy and fast method to act depending on > > whether some CPU supports some feature (eg. a new opcode). Things like > > cpuinfo are only a tiny piece of that. What we could really use is a > > conditional jump/call based on whether feature X is supported - without > > any kernel intervention. Then the machine code could be easily layed out > > to support both cases with our without some feature X. Alternatively we > > could have a fast trapping in useland - hw generated call already would > > be a big help. > > That's what cpuid is for. With GCC function multi-versioning or equivalent > manually-rolled solutions, you can get exactly what you're asking for. Right, lots of self-modifying code solutions there, some of which can be linker driven, some not. In the kernel we use alternative() to replace short code sequences depending on CPUID. Userspace *could* do the same, rewriting code before first execution is fairly straight forward. > Yes, the checking became far more complex with the need to check XCR0 after > AVX came along, but since the instruction itself is a slow and serialising, > any library will just cache the results. And as a result, the level of CPU > features is not expected to change. It never has in the past, so this hasn't > been an issue. Arguably you should be checking XCR0 for any feature there, including SSE/AVX/AVX512 and now AMX. Ideally we'd do a prctl() for AVX512 too, except it's too late :-( ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) 2021-06-28 15:27 ` Peter Zijlstra @ 2021-06-28 16:13 ` Thiago Macieira 2021-06-28 17:11 ` Peter Zijlstra 2021-06-28 17:43 ` Peter Zijlstra 0 siblings, 2 replies; 27+ messages in thread From: Thiago Macieira @ 2021-06-28 16:13 UTC (permalink / raw) To: Peter Zijlstra Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha, linux-api, linux-arch, x86 On Monday, 28 June 2021 08:27:24 PDT Peter Zijlstra wrote: > > That's what cpuid is for. With GCC function multi-versioning or equivalent > > manually-rolled solutions, you can get exactly what you're asking for. > > Right, lots of self-modifying code solutions there, some of which can be > linker driven, some not. In the kernel we use alternative() to replace > short code sequences depending on CPUID. > > Userspace *could* do the same, rewriting code before first execution is > fairly straight forward. Userspace shouldn't do SMC. It's bad enough that JITs without caching exist, but having pure paged code is better. Pure pages are shared as needed by the kernel. All you need is a simple bit test. You can then either branch to different code paths or write to a function pointer so it'll go there directly the next time. You can also choose to load different plugins depending on what CPU features were found. Consequence: CPU feature checking is done *very* early, often before main(). > Arguably you should be checking XCR0 for any feature there, including > SSE/AVX/AVX512 and now AMX. > > Ideally we'd do a prctl() for AVX512 too, except it's too late :-( Right. But speaking of which, this library would deal with Apple having done the allocate-state-on-demand feature for AVX512 without XFD. See https://github.com/qt/qtbase/blob/dev/src/corelib/global/qsimd.cpp#L346-L369 Anyway, what's the current thinking on what the arch_prctl() should be? Is that a per-thread state or will it affect the entire process group? And is it a sticky functionality, or are we talking about ref/deref? Maybe in order to answer that, we need to understand what the worst case scenario we need to support is. What are they? 1) alt-stack signal handlers, usually for crashing signals (to catch a stack overflow) 2) cooperative user-space task schedulers, e.g. coroutines 3) preemptive user-space task schedulers (I don't know if such software exists or even if it is possible) 4) combination of 1 and 3 5) #4, in which each part is comes from a separate library with no knowledge of each other, and initialised concurrently in different threads I'd *assume* that any user-space task scheduler is aware of XSAVE at the very least and will know how to allocate context-saving buffers of sufficient size for each task. I think this is a safe assumption because AVX is over 10 years old now and XSAVE is a required feature for enabling the AVX state. That is, any library that knows to save AVX state (the upper 128-bits of the YMM registers) is aware of XSAVE and the fact that the state size is dynamic. Crash handlers are another story. Speaking from experience, my first attempt at doing them simply used a global char array of MINSIGSTKSZ and that failed to get delivered (note that code will now fail to compile because MINSIGSTKSZ is no longer a constant expression). My code was attempting to launch gdb on itself, so it wasn't even a SA_SIGINFO signal and therefore the failure was baffling. I had to read the kernel source code to figure out that regardless of SA_SIGINFO, the state is saved on stack anyway and therefore needs to be big enough. So I simply increased the global variable's size until it succeeded in delivering on my AVX512 machine. And because it is no longer using MINSIGSTKSZ, it will not fail to compile after the glibc upgrade, but it will fail to deliver with AMX state enabled. [I've since learned to check the XSAVE state size in order to create the alt- stack] How much do we need to worry about these crash handlers? -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel DPG Cloud Engineering ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) 2021-06-28 16:13 ` Thiago Macieira @ 2021-06-28 17:11 ` Peter Zijlstra 2021-06-28 17:23 ` Thiago Macieira 2021-06-28 17:43 ` Peter Zijlstra 1 sibling, 1 reply; 27+ messages in thread From: Peter Zijlstra @ 2021-06-28 17:11 UTC (permalink / raw) To: Thiago Macieira Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha, linux-api, linux-arch, x86 On Mon, Jun 28, 2021 at 09:13:29AM -0700, Thiago Macieira wrote: > On Monday, 28 June 2021 08:27:24 PDT Peter Zijlstra wrote: > > > That's what cpuid is for. With GCC function multi-versioning or equivalent > > > manually-rolled solutions, you can get exactly what you're asking for. > > > > Right, lots of self-modifying code solutions there, some of which can be > > linker driven, some not. In the kernel we use alternative() to replace > > short code sequences depending on CPUID. > > > > Userspace *could* do the same, rewriting code before first execution is > > fairly straight forward. > > Userspace shouldn't do SMC. It's bad enough that JITs without caching exist, > but having pure paged code is better. Pure pages are shared as needed by the > kernel. I don't feel that strongly; if SMC gets you measurable performance gains, go for it. If you're short on memory, buy more. > All you need is a simple bit test. You can then either branch to different > code paths or write to a function pointer so it'll go there directly the next > time. You can also choose to load different plugins depending on what CPU > features were found. Both bit tests and indirect function calls suffer the extra memory load, which is not free. > Consequence: CPU feature checking is done *very* early, often before main(). For the linker based ones, yes. IIRC the ifunc() attribute is particularly useful here. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) 2021-06-28 17:11 ` Peter Zijlstra @ 2021-06-28 17:23 ` Thiago Macieira 2021-06-28 19:08 ` Peter Zijlstra 0 siblings, 1 reply; 27+ messages in thread From: Thiago Macieira @ 2021-06-28 17:23 UTC (permalink / raw) To: Peter Zijlstra Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha, linux-api, linux-arch, x86 On Monday, 28 June 2021 10:11:16 PDT Peter Zijlstra wrote: > > Consequence: CPU feature checking is done *very* early, often before > > main(). > For the linker based ones, yes. IIRC the ifunc() attribute is > particularly useful here. Exactly. ifunc was designed for this exact purpose. And hence the fact that CPUID initialisation will be done very, very early. Anyway, if the AMX state is a sticky "set once per process", it's likely going to get set early for every process that *may* use AMX. And this is assuming we do the library right and only set it if has AMX code at all, instead of all the time. On the other hand, if it's not set once and for all, we'll have to contend with the size changing. TBH, this is a lot more complicated to deal with. Take the hypothetical example of a preemptive user-space task scheduler that interrupts an AMX routine (let's say for the sake of the argument that it is an on-stack signal; I don't see why a scheduler would need to be alt-stack). It will record the state and then transition to another routine. And this routine may be resumed in another thread of the same process. Will the kernel understand that the new routine does not need the AMX state? Will it understand that the *other* routine, in the other thread will? If this is not done automatically by the kernel, then the task scheduler will need to know to ask the kernel what the reference count for the AMX state is and will need a syscall to set it (not just increment/decrement, though one could implement that with a loop). This applies differently in the case of cooperative scheduling. The SysV ABI will probably say that the AMX state is caller-save, so the function call from the AMX-using routine implies all its state has been saved somewhere. But what about the kernel-side AMX refcount? Is that part of the ABI? -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel DPG Cloud Engineering ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) 2021-06-28 17:23 ` Thiago Macieira @ 2021-06-28 19:08 ` Peter Zijlstra 2021-06-28 19:26 ` Thiago Macieira 0 siblings, 1 reply; 27+ messages in thread From: Peter Zijlstra @ 2021-06-28 19:08 UTC (permalink / raw) To: Thiago Macieira Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha, linux-api, linux-arch, x86 On Mon, Jun 28, 2021 at 10:23:47AM -0700, Thiago Macieira wrote: > On Monday, 28 June 2021 10:11:16 PDT Peter Zijlstra wrote: > > > Consequence: CPU feature checking is done *very* early, often before > > > main(). > > For the linker based ones, yes. IIRC the ifunc() attribute is > > particularly useful here. > > Exactly. ifunc was designed for this exact purpose. And hence the fact that > CPUID initialisation will be done very, very early. > > Anyway, if the AMX state is a sticky "set once per process", it's likely going > to get set early for every process that *may* use AMX. And this is assuming we > do the library right and only set it if has AMX code at all, instead of all > the time. This, AFAIU. If the ifunc() resolver finds we haz AMX it can do the prctl() and on success pick the AMX routine. Assuming of course, that if a program links with a library that supports AMX, it will actually end up using it. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) 2021-06-28 19:08 ` Peter Zijlstra @ 2021-06-28 19:26 ` Thiago Macieira 0 siblings, 0 replies; 27+ messages in thread From: Thiago Macieira @ 2021-06-28 19:26 UTC (permalink / raw) To: Peter Zijlstra Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha, linux-api, linux-arch, x86 On Monday, 28 June 2021 12:08:16 PDT Peter Zijlstra wrote: > > Anyway, if the AMX state is a sticky "set once per process", it's likely > > going to get set early for every process that *may* use AMX. And this is > > assuming we do the library right and only set it if has AMX code at all, > > instead of all the time. > > This, AFAIU. If the ifunc() resolver finds we haz AMX it can do the > prctl() and on success pick the AMX routine. > > Assuming of course, that if a program links with a library that supports > AMX, it will actually end up using it. That's what I meant and I agree. If it has an AMX function for *anything*, it will do the arch_prctl() and enable the state, even if said function is never called. This is the good case. The bad case is that it does the arch_prctl() before it sees whether there is any AMX function. Do we expect that the dynamic loader will have this code? It currently searches the multiple ABI levels (up to x86-64-v4 to include AVX512) and HW capabilities. I can readily see AMX being one of the capabilities, if not an ABI level. Though it should be trivial for it to call the arch_prctl() if and only if it is about to load an ELF module that declares use of AMX and also *not* load it if the syscall fails. $ LD_DEBUG=libs /lib64/ld-linux-x86-64.so.2 --inhibit-cache /bin/ls 1620: find library=librt.so.1 [0]; searching 1620: search path=..... 1620: trying file=/usr/lib64/glibc-hwcaps/x86-64-v4/librt.so.1 1620: trying file=/usr/lib64/glibc-hwcaps/x86-64-v3/librt.so.1 1620: trying file=/usr/lib64/glibc-hwcaps/x86-64-v2/librt.so.1 1620: trying file=/usr/lib64/tls/haswell/avx512_1/x86_64/librt.so. 1 1620: trying file=/usr/lib64/tls/haswell/avx512_1/librt.so.1 1620: trying file=/usr/lib64/tls/haswell/x86_64/librt.so.1 1620: trying file=/usr/lib64/tls/haswell/librt.so.1 1620: trying file=/usr/lib64/tls/avx512_1/x86_64/librt.so.1 1620: trying file=/usr/lib64/tls/avx512_1/librt.so.1 1620: trying file=/usr/lib64/tls/x86_64/librt.so.1 1620: trying file=/usr/lib64/tls/librt.so.1 1620: trying file=/usr/lib64/haswell/avx512_1/x86_64/librt.so.1 1620: trying file=/usr/lib64/haswell/avx512_1/librt.so.1 1620: trying file=/usr/lib64/haswell/x86_64/librt.so.1 1620: trying file=/usr/lib64/haswell/librt.so.1 1620: trying file=/usr/lib64/avx512_1/x86_64/librt.so.1 1620: trying file=/usr/lib64/avx512_1/librt.so.1 1620: trying file=/usr/lib64/x86_64/librt.so.1 1620: trying file=/usr/lib64/librt.so.1 -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel DPG Cloud Engineering ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) 2021-06-28 16:13 ` Thiago Macieira 2021-06-28 17:11 ` Peter Zijlstra @ 2021-06-28 17:43 ` Peter Zijlstra 2021-06-28 19:05 ` Thiago Macieira 1 sibling, 1 reply; 27+ messages in thread From: Peter Zijlstra @ 2021-06-28 17:43 UTC (permalink / raw) To: Thiago Macieira Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha, linux-api, linux-arch, x86 On Mon, Jun 28, 2021 at 09:13:29AM -0700, Thiago Macieira wrote: > Anyway, what's the current thinking on what the arch_prctl() should be? Is > that a per-thread state or will it affect the entire process group? And is it > a sticky functionality, or are we talking about ref/deref? So I didn't follow the initial discussion too well; so I might be getting this wrong. In which case I'm hoping Thomas and/or Andy will correct me. But I think the proposal was per process. Having this per thread would be really unfortunate IMO. > Maybe in order to answer that, we need to understand what the worst case > scenario we need to support is. What are they? > > 1) alt-stack signal handlers, usually for crashing signals (to catch a stack > overflow) > > 2) cooperative user-space task schedulers, e.g. coroutines > > 3) preemptive user-space task schedulers (I don't know if such software exists > or even if it is possible) I think it's been done; use sigsetmask()/pthread_sigmask() as 'IRQ' disable, and run a preemption tick off of SIGALRM or something. > 4) combination of 1 and 3 None of those I think. The worst case is old executables using MINSIGSTKSZ and not using the magic signal context at all, just regular old signals. If you run them on an AVX512 enabled machine, they overflow their stack and cause memory corruption. AFAICT the only feasible way forward with that is some sysctl which default disables AVX512 and add the prctl() and have some unsafe wrapper that enables AVX512 for select 'legacy' programs for as long as they exist :/ That is, binaries/libraries compiled against a static (and small) MINSIGSTKSZ are the enemy. Which brings us to: > 5) #4, in which each part is comes from a separate library with no knowledge > of each other, and initialised concurrently in different threads That's terrible... library init should *NEVER* spawn threads (I know, don't start). Anything that does this is basically unfixable, because we can't guarantee the AMX prctl() gets done before the first thread. So yes, worst case I suppose... ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) 2021-06-28 17:43 ` Peter Zijlstra @ 2021-06-28 19:05 ` Thiago Macieira 0 siblings, 0 replies; 27+ messages in thread From: Thiago Macieira @ 2021-06-28 19:05 UTC (permalink / raw) To: Peter Zijlstra Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha, linux-api, linux-arch, x86 On Monday, 28 June 2021 10:43:47 PDT Peter Zijlstra wrote: > None of those I think. The worst case is old executables using > MINSIGSTKSZ and not using the magic signal context at all, just regular > old signals. If you run them on an AVX512 enabled machine, they overflow > their stack and cause memory corruption. Indeed, they are already broken today. Retroactively fixing them 5 years later can be an additional goal, but it shouldn't stop us from having an AMX solution. BTW, exactly MINSIGSTKSZ on an SKX overflows inside the kernel, so there's no memory corruption in userspace. The signal handler is not even invoked and the process is killed by a second signal delivery. But somewhere between MINSIGSTKSZ and SIGSTKSZ, it will invoke the signal handler and will overflow inside the user code. If that alt-stack wasn't designed with guard pages, it will corrupt memory, as you said. > AFAICT the only feasible way forward with that is some sysctl which > default disables AVX512 and add the prctl() and have some unsafe wrapper > that enables AVX512 for select 'legacy' programs for as long as they > exist :/ > > That is, binaries/libraries compiled against a static (and small) > MINSIGSTKSZ are the enemy. Which brings us to: > > 5) #4, in which each part is comes from a separate library with no > > knowledge of each other, and initialised concurrently in different > > threads > > That's terrible... library init should *NEVER* spawn threads (I know, > don't start). Indeed, but that wasn't even what I was suggesting. I was thinking that the application would have started threads and the library init was run on separate threads. This may happen with lazy initialisation on first use. But threads aren't required for the problem to happen. If the crash-handling case runs first, before AMX state is enabled, it might decide that it doesn't need to allocate sufficient alt-stack space for AMX. I think the recommendation here for userspace is clear: allocate the maximum that XSAVE tells you that you'll need, regardless of what the ambient enabled feature set is. [And pretty please set up guard pages. Given that the XSAVE state area for SPR appears to be too close to 12 kB (see below), I'd say they should mmap() 20 kB and then mprotect() the lowest page to PROT_NONE.] > Anything that does this is basically unfixable, because we can't > guarantee the AMX prctl() gets done before the first thread. > > So yes, worst case I suppose... To wit: the worst case is a static, small alt-stack *without* guard pages (e.g., a malloc()ed or even static buffer) of sufficient size to let the kernel transition back to userspace but not for the userspace routine to run. Any code using the old MINSIGSTKSZ (2048) will fail to run for AVX512 in the first place. There will be no data corruption, the crash handler will not run and the application will simply crash. An out-of-process core dump handler (if any, like systemd-coredump) will still get run. Code using SIGSTKSZ (8192) will run for AVX512 and there'll be about 5 kB left to the user routine. So if it doesn't have too deep a call stack, will not corrupt memory. And this code will not run for AMX state, because it's smaller than the XSAVE state (see below), making that the same case as the MINSIGSTKSZ for AVX512 above. It's possible someone would use something between those two values, but why? I expect that alt-stack handlers that use a constant value use either of the two constants or maybe some multiple of SIGSTKSZ, but nothing in-between them. $ /opt/intel/sde-external-8.63.0-2021-01-18-lin/sde64 -spr -- cpuid -1 -l 0xd CPU: XSAVE features (0xd/0): XCR0 lower 32 bits valid bit field mask = 0x000600ff XCR0 upper 32 bits valid bit field mask = 0x00000000 XCR0 supported: x87 state = true XCR0 supported: SSE state = true XCR0 supported: AVX state = true XCR0 supported: MPX BNDREGS = true XCR0 supported: MPX BNDCSR = true XCR0 supported: AVX-512 opmask = true XCR0 supported: AVX-512 ZMM_Hi256 = true XCR0 supported: AVX-512 Hi16_ZMM = true IA32_XSS supported: PT state = false XCR0 supported: PKRU state = false XCR0 supported: CET_U state = false XCR0 supported: CET_S state = false IA32_XSS supported: HDC state = false IA32_XSS supported: UINTR state = false LBR supported = false IA32_XSS supported: HWP state = false XTILECFG supported = true XTILEDATA supported = true bytes required by fields in XCR0 = 0x00002b00 (11008) bytes required by XSAVE/XRSTOR area = 0x00002b00 (11008) -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel DPG Cloud Engineering ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <e07294c9-b02a-e1c5-3620-7fae7269fdf1@metux.net>]
* Re: x86 CPU features detection for applications (and AMX) [not found] ` <e07294c9-b02a-e1c5-3620-7fae7269fdf1@metux.net> @ 2021-06-30 14:34 ` Florian Weimer [not found] ` <030f1462-2bf9-39bc-d620-6d9fbe454a27@metux.net> 2021-06-30 15:29 ` Thiago Macieira 1 sibling, 1 reply; 27+ messages in thread From: Florian Weimer @ 2021-06-30 14:34 UTC (permalink / raw) To: Enrico Weigelt, metux IT consult Cc: Thiago Macieira, hjl.tools, libc-alpha, linux-api, linux-arch, x86 * Enrico Weigelt: > OTOH, if one really needs to be independent of distros, one should build > a complete nano-distro, where everything's installed under certain > prefix, including libc. Isn't a big deal at all - we have plenty tools > for that, daily practise in embedded world. The only difference would be > tweaking the ld scripts to set a different ld.so path. It breaks integration with system-wide settings, such as user/group databases, host name lookup, and cryptographic policies. In many environments, that is not really an option. Thanks, Florian ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <030f1462-2bf9-39bc-d620-6d9fbe454a27@metux.net>]
* Re: x86 CPU features detection for applications (and AMX) [not found] ` <030f1462-2bf9-39bc-d620-6d9fbe454a27@metux.net> @ 2021-06-30 15:38 ` Florian Weimer [not found] ` <4ba30cb7-6854-0691-fad6-4ca9ce674ac2@metux.net> 0 siblings, 1 reply; 27+ messages in thread From: Florian Weimer @ 2021-06-30 15:38 UTC (permalink / raw) To: Enrico Weigelt, metux IT consult Cc: Thiago Macieira, hjl.tools, libc-alpha, linux-api, linux-arch, x86 * Enrico Weigelt: > On 30.06.21 16:34, Florian Weimer wrote: > >> It breaks integration with system-wide settings, such as user/group >> databases, host name lookup, and cryptographic policies. In many >> environments, that is not really an option. > > Not necessarily, these can still be applied (and fairly simple). > You actually have to twist more extra knobs if to wanted those weird > things to happen. Sorry, this is just not true. You cannot load system libraries such as NSS modules or cryptographic libraries with a custom glibc because the system glibc could be newer, and glibc does not provide that kind of compatibility (only the other way round). Thanks, Florian ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <4ba30cb7-6854-0691-fad6-4ca9ce674ac2@metux.net>]
* Re: x86 CPU features detection for applications (and AMX) [not found] ` <4ba30cb7-6854-0691-fad6-4ca9ce674ac2@metux.net> @ 2021-07-01 8:21 ` Florian Weimer [not found] ` <034dcf9b-1f8c-23ee-86a6-791122bc0f8c@metux.net> 0 siblings, 1 reply; 27+ messages in thread From: Florian Weimer @ 2021-07-01 8:21 UTC (permalink / raw) To: Enrico Weigelt, metux IT consult Cc: Thiago Macieira, hjl.tools, libc-alpha, linux-api, linux-arch, x86 * Enrico Weigelt: > And I'm repeating my previous questions: can you name some actual real > world (not hypothetical or academical) scenarios where: > > somebody really needs some binary-only application && > needs those extra modules *into that* application && > cannot recompile these modules into the applications's prefix && > needs AMX in that application && > cannot just use chroot && > cannot put it into container ? There are no real-world scenarios yet which involve AMX, so I'm not sure what you are after with this question. Thanks, Florian ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <034dcf9b-1f8c-23ee-86a6-791122bc0f8c@metux.net>]
* Re: x86 CPU features detection for applications (and AMX) [not found] ` <034dcf9b-1f8c-23ee-86a6-791122bc0f8c@metux.net> @ 2021-07-06 12:57 ` Florian Weimer 0 siblings, 0 replies; 27+ messages in thread From: Florian Weimer @ 2021-07-06 12:57 UTC (permalink / raw) To: Enrico Weigelt, metux IT consult Cc: Thiago Macieira, hjl.tools, libc-alpha, linux-api, linux-arch, x86 * Enrico Weigelt: > On 01.07.21 10:21, Florian Weimer wrote: >> * Enrico Weigelt: >> >>> And I'm repeating my previous questions: can you name some actual real >>> world (not hypothetical or academical) scenarios where: >>> >>> somebody really needs some binary-only application && >>> needs those extra modules *into that* application && >>> cannot recompile these modules into the applications's prefix && >>> needs AMX in that application && >>> cannot just use chroot && >>> cannot put it into container ? >> There are no real-world scenarios yet which involve AMX, so I'm not >> sure >> what you are after with this question. > > Okay, let's take AMX out of the equation (until it actually arrives > in the field). How does it look like then ? We have customers that want to use name service switch (NSS) plugins in proprietary software and who do not want to distribute the (GNU) toolchain with their application. The latter excludes chroot/containers. Some applications more or less require to run directly on the host (e.g., if they have some system monitoring aspect). Thanks, Florian ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) [not found] ` <e07294c9-b02a-e1c5-3620-7fae7269fdf1@metux.net> 2021-06-30 14:34 ` Florian Weimer @ 2021-06-30 15:29 ` Thiago Macieira 1 sibling, 0 replies; 27+ messages in thread From: Thiago Macieira @ 2021-06-30 15:29 UTC (permalink / raw) To: fweimer, Enrico Weigelt, metux IT consult Cc: hjl.tools, libc-alpha, linux-api, linux-arch, x86 On Wednesday, 30 June 2021 07:32:29 PDT Enrico Weigelt, metux IT consult wrote: > What does "buy-in" mean in that context ? Some other departement ? Some > external developers ? > > I tend to believe that those things should better be done by some > independent party, maybe GNU, FSF, etc, and cpu vendors should just > sponsor this work and provide necessary specs. For me specifically, I need to identify some SW dept. that would take on the responsibility for it long-term. Abandonware would serve no one. I wouldn't mind this were a collaborative project under the auspices of freedesktop.org or similar, so long as it is cross-platform to at least macOS, FreeBSD and Windows. But given that this is in Intel's interest for this library to exist and make it easy for people to use our CPU features, it seemed like a natural fit for Intel. And even if it isn't an Intel-owned project, we probably want to be contributors. > Shipping precompiled binaries and linking against system libraries is > always a risky game. The cleanest approach here IMHO would be building > packages for various distros (means: using their toolchains / libs). > This actually isn't as work intensive as it might sound - I'm doing this > all the day and have a bunch of helpful tools for that. I understand, but whether it is easier and better for 99% of the cases does not mean it is so for 100%. And most especially it does not guarantee that it will be used for everyone. For reasons real or not, there are precompiled binaries. Just see Google Chrome, for example. > Licensing with glibc also isn't a serious problem here. All you need to > do is be compliant with the LGPL. In short: publish all your patches to > glibc, offer the license and link dynamically. Already done that a > thousand times. We can agree it's an additional hurdle, which will likely cause people to investigate a solution that doesn't require that hurdle. > Wait a minute ... how long does it take from the architectural design, > until the real silicon is out in the field ? I would be very surprised > whether the whole process in done in a much shorted time frame. > > Note: by "much more early", I meant already at the point where the spec > of the new feature exists, at least on paper. I'm not going to comment on the timing of architectural decisions. But just from the example I gave: in order to be ready for a late 2021 or early 2022 launch, we'd need to have the feature's specification published and the patches accepted by December 2018. That's about 3 years lead time. How many software projects (let alone mixed software and hardware) do you know that know 3 years ahead of time what they will need? > > Then there are Linux distros that do LTS every 2 years or so. > > Why don't the few actually affected parties just upgrade their compiler > on their build machines when needed ? Have you tried? Besides, the whole problem here is barrier of entry. If we don't make it easy for them to use the new features, they won't. And I was using this as an argument for why precompiled binaries will exist: the interested parties will take the pain to upgrade the compilers and other supporting software so that the build even of Open Source software is the most capable one, then release that binary for others who haven't. This lowers the barrier of entry significantly. And this is all to justify that such a functionality shouldn't be part of glibc, where it can't be used by those precompiled binaries which, for one reason or another, will exist. It should be in a small, permissively-licensed library that will often get statically linked into the binary in question. > > To compile the software that uses those instructions, undoubtedly. But > > what if I did that for you and you could simply download the binary for > > the library and/or plugins such that you could slot into your existing > > systems and CI? This could make a difference between adoption or not. > > For me, it wouldn't, at all. I never download binaries from untrusted > sources. (except for forensic analysis). I understand and I am, myself, almost like you. I do have some precompiled binaries (aforementioned Google Chrome), but as a rule I avoid them. But not everyone is like the two of us. > BUT: we're talking about about brand new silicon here. Why should > anybody - who really needs these new features - install such an ancient > OS on a brand new machine ? I don't know. It might be for fleet homogeneity: everything has the same SW installed, facilitating maintenance. Just coming up with reasons. > > Even if they don't, the *software* that people deploy may be the same > > build > > for RHEL 7 and for a modern distro that will have a 5.14 kernel. > > Now we're getting to the vital point: trying to make "universal" > binaries for verious different distros. This is something I'm strictly > advising against since 25 years, because with that you're putting > yourself into *a lot* trouble (ABI compatibility between arbitrary > distros or even various distro releases always had been pretty much a > myth, only works for some specific cases). Just don't do it, unless you > *really* don't have any other chance. Well, that's the point, isn't it? Are we ready to call this use-case not valid, so it can't be used to support the argument of a solution that needs to be deployable to old distros? > > So my point is: this shouldn't be in glibc because the glibc will not have > > the new system call wrappers or TLS fields. > > Yes, I'm fully on your side here. Glibc already is overloaded with too > much of those kind of things that shouldn't belong in there. Actually, > even stuff like DNS resolving IMHO doensn't belong into libc. Thanks. (name resolving is required by POSIX to be there, so it exists in every system; might as well be every libc) > My proposal would an conditional jump opcode that directly checks for > specific features. If this is well designed, I believe that can be > resolved by the cpu's internal prefetcher unit. But for that we'd also > need some extra task status bit so the cpu knows it is enabled for the > current task. That's more of a "can I use this now", instead of "can I use this ever". So far, the answer to the two has been the same. Therefore, there has been no need to have the functionality that you're describing. > > For most features, there isn't. You don't see us discussing > > AVX512VP2INTERSECT, for example. This discussion only exists because AMX > > requires more state to be saved during context switches and signal > > delivery. > But over all these years, new some registers have been introduced. > I fail to imagine how context switches can be done properly w/o also > saving/restoring such new registers. There have been a few small registers and state that need to be saved here and there, but the biggest blocks were: - SSE state - AVX state - AVX512 state - AMX state The first two were small enough (and long enough ago) that the discussions were small and aren't relevant today. The AVX512 state was added in the past decade. And as you've seen from this thread, that is still a sticky point, and that was only about 1.5 kB. However, the vast majority of CPU features do not add new context state. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel DPG Cloud Engineering ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) 2021-06-25 23:31 ` Thiago Macieira [not found] ` <3c5c29e2-1b52-3576-eda2-018fb1e58ff9@metux.net> @ 2021-07-08 7:08 ` Florian Weimer 2021-07-08 15:13 ` Thiago Macieira 1 sibling, 1 reply; 27+ messages in thread From: Florian Weimer @ 2021-07-08 7:08 UTC (permalink / raw) To: Thiago Macieira; +Cc: hjl.tools, libc-alpha, linux-api, linux-arch, x86 * Thiago Macieira: > On 23 Jun 2021 17:04:27 +0200, Florian Weimer wrote: >> We have an interface in glibc to query CPU features: >> X86-specific Facilities >> <https://www.gnu.org/software/libc/manual/html_node/X86.html> >> >> CPU_FEATURE_USABLE all preconditions for a feature are met, >> HAS_CPU_FEATURE means it's in silicon but possibly dormant. >> CPU_FEATURE_USABLE is supposed to look at XCR0, AT_HWCAP2 etc. before >> enabling the relevant bit (so it cannot pass through any unknown bits). > > It's a nice initiative, but it doesn't help library and applications that need > to be either cross-platform or backwards compatible. > > The first problem is the cross-platformness need. Because we library and > application developers need to support other OSes, we'll need to deploy our > own CPUID-based detection. It's far better to use common code everywhere, > where one developer working on Linux can fix bugs in FreeBSD, macOS or Windows > or any of the permutations. Every platform-specific deviation adds to > maintenance requirements and is a source of potential latent bugs, now or in > the future due to refactoring. That is why doing everything in the form of > instructions would be far better and easier, rather than system calls. I must say this is a rather application-specific view. Sure, you get consistency within the application across different targets, but for those who work on multiple applications (but perhaps on a single distribution/OS), things are very inconsistent. And the reason why I started this is that CPUID-based feature detection is dead anyway (assuming the kernel developers do not implement lazy initialization of the AMX state). CPUID (and ancillary data such as XCR0) will say that AMX support is there, but it will not work unless some (yet to decided) steps are executed by the userspace thread. While I consider the CPUID-based model a success (and the cross-OS consistency may have contributed to that), its days seem to be over. > [Unless said system calls were standardised and actually > deployed. Making this a cross-platform library that is not part of > libc would be a major step in that direction] That won't help with AMX, as far as I can tell. Thanks, Florian ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) 2021-07-08 7:08 ` Florian Weimer @ 2021-07-08 15:13 ` Thiago Macieira 0 siblings, 0 replies; 27+ messages in thread From: Thiago Macieira @ 2021-07-08 15:13 UTC (permalink / raw) To: Florian Weimer; +Cc: hjl.tools, libc-alpha, linux-api, linux-arch, x86 On Thursday, 8 July 2021 00:08:16 PDT Florian Weimer wrote: > > The first problem is the cross-platformness need. Because we library and > > application developers need to support other OSes, we'll need to deploy > > our > > own CPUID-based detection. It's far better to use common code everywhere, > > where one developer working on Linux can fix bugs in FreeBSD, macOS or > > Windows or any of the permutations. Every platform-specific deviation > > adds to maintenance requirements and is a source of potential latent > > bugs, now or in the future due to refactoring. That is why doing > > everything in the form of instructions would be far better and easier, > > rather than system calls. > I must say this is a rather application-specific view. Sure, you get > consistency within the application across different targets, but for > those who work on multiple applications (but perhaps on a single > distribution/OS), things are very inconsistent. Why would they be inconsistent, if the library is cross-platform? > And the reason why I started this is that CPUID-based feature detection > is dead anyway (assuming the kernel developers do not implement lazy > initialization of the AMX state). CPUID (and ancillary data such as > XCR0) will say that AMX support is there, but it will not work unless > some (yet to decided) steps are executed by the userspace thread. > > While I consider the CPUID-based model a success (and the cross-OS > consistency may have contributed to that), its days seem to be over. Well, we need to design the API of this library such that we can accommodate the various possibilities. For all CPU possibilities, the library needs to be able to tell what the state of support is, among a state of "already enabled", "possible but not enabled" and "impossible", along with a call to enable them. The latter should be supported at least for AVX512 and AMX states. On Linux, only AMX will be tristate, but on macOS we need the tristate for AVX512 too. This library would then wrap all the necessary checking for OSXSAVE and XCR0, so the user doesn't need to worry about them or how the OS enables them, only the features they're interested in. Additionally, I'd like the library to also have constant expression paths that evaluate to constant true if the feature was already enabled at compile time (e.g., -march=x86-64-v3 sets __AVX2__ and __FMA__, so you can always run AVX2 and FMA code, without checking). But that's just icing on top. (it won't come as a surprise that I already have code for most of this) -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel DPG Cloud Engineering ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: x86 CPU features detection for applications (and AMX) 2021-06-23 15:04 x86 CPU features detection for applications (and AMX) Florian Weimer 2021-06-23 15:32 ` Dave Hansen 2021-06-25 23:31 ` Thiago Macieira @ 2021-07-08 17:56 ` Mark Brown 2 siblings, 0 replies; 27+ messages in thread From: Mark Brown @ 2021-07-08 17:56 UTC (permalink / raw) To: Florian Weimer Cc: libc-alpha, linux-api, x86, linux-arch, H.J. Lu, Catalin Marinas, Will Deacon [-- Attachment #1: Type: text/plain, Size: 1662 bytes --] On Wed, Jun 23, 2021 at 05:04:27PM +0200, Florian Weimer wrote: Copying in Catalin & Will. > We have an interface in glibc to query CPU features: > X86-specific Facilities > <https://www.gnu.org/software/libc/manual/html_node/X86.html> > CPU_FEATURE_USABLE all preconditions for a feature are met, > HAS_CPU_FEATURE means it's in silicon but possibly dormant. > CPU_FEATURE_USABLE is supposed to look at XCR0, AT_HWCAP2 etc. before > enabling the relevant bit (so it cannot pass through any unknown bits). ... > When we designed this glibc interface, we assumed that bits would be > static during the life-time of the process, initialized at process > start. That follows the model of previous x86 CPU feature enablement. ... > This still wouldn't cover the enable/disable side, but at least it would > work for CPU features which are modal and come and go. The fact that we > tell GCC to cache the returned pointer from that internal function, but > not that the data is immutable works to our advantage here. > On the other hand, maybe there is a way to give users a better > interface. Obviously we want to avoid a syscall for a simple CPU > feature check. And we also need something to enable/disable CPU > features. This enabling and disabling of CPU features sounds like something that might also become relevant for arm64, for example I can see a use case for having something that allows some of the more expensive features to be masked from some userspace processes for resource management purposes. This sounds like a bit of a different use case to x86 AIUI but I think there's overlap in the actual operations that would be needed. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2021-07-08 17:57 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-06-23 15:04 x86 CPU features detection for applications (and AMX) Florian Weimer 2021-06-23 15:32 ` Dave Hansen 2021-07-08 6:05 ` Florian Weimer 2021-07-08 14:19 ` Dave Hansen 2021-07-08 14:31 ` Florian Weimer 2021-07-08 14:36 ` Dave Hansen 2021-07-08 14:41 ` Florian Weimer 2021-06-25 23:31 ` Thiago Macieira [not found] ` <3c5c29e2-1b52-3576-eda2-018fb1e58ff9@metux.net> 2021-06-28 13:20 ` Peter Zijlstra [not found] ` <534d0171-2cc5-cd0a-904f-cd3c499b55af@metux.net> 2021-06-30 15:36 ` Thiago Macieira 2021-06-28 15:08 ` Thiago Macieira 2021-06-28 15:27 ` Peter Zijlstra 2021-06-28 16:13 ` Thiago Macieira 2021-06-28 17:11 ` Peter Zijlstra 2021-06-28 17:23 ` Thiago Macieira 2021-06-28 19:08 ` Peter Zijlstra 2021-06-28 19:26 ` Thiago Macieira 2021-06-28 17:43 ` Peter Zijlstra 2021-06-28 19:05 ` Thiago Macieira [not found] ` <e07294c9-b02a-e1c5-3620-7fae7269fdf1@metux.net> 2021-06-30 14:34 ` Florian Weimer [not found] ` <030f1462-2bf9-39bc-d620-6d9fbe454a27@metux.net> 2021-06-30 15:38 ` Florian Weimer [not found] ` <4ba30cb7-6854-0691-fad6-4ca9ce674ac2@metux.net> 2021-07-01 8:21 ` Florian Weimer [not found] ` <034dcf9b-1f8c-23ee-86a6-791122bc0f8c@metux.net> 2021-07-06 12:57 ` Florian Weimer 2021-06-30 15:29 ` Thiago Macieira 2021-07-08 7:08 ` Florian Weimer 2021-07-08 15:13 ` Thiago Macieira 2021-07-08 17:56 ` Mark Brown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).