public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* Why does glibc use AVX-512?
@ 2021-03-26  4:38 Andy Lutomirski
  2021-03-26 10:06 ` Borislav Petkov
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Andy Lutomirski @ 2021-03-26  4:38 UTC (permalink / raw)
  To: libc-alpha, H. J. Lu, X86 ML, LKML, Bae, Chang Seok,
	Florian Weimer, Carlos O'Donell, Rich Felker

Hi all-

glibc appears to use AVX512F for memcpy by default.  (Unless
Prefer_ERMS is default-on, but I genuinely can't tell if this is the
case.  I did some searching.)  The commit adding it refers to a 2016
email saying that it's 30% on KNL.  Unfortunately, AVX-512 is now
available in normal hardware, and the overhead from switching between
normal and AVX-512 code appears to vary from bad to genuinely
horrible.  And, once anything has used the high parts of YMM and/or
ZMM, those states tend to get stuck with XINUSE=1.

I'm wondering whether glibc should stop using AVX-512 by default.

Meanwhile, some of you may have noticed a little ABI break we have.
On AVX-512 hardware, the size of a signal frame is unreasonably large,
and this is causing problems even for existing software that doesn't
use AVX-512.  Do any of you have any clever ideas for how to fix it?
We have some kernel patches around to try to fail more cleanly, but we
still fail.

I think we should seriously consider solutions in which, for new
tasks, XCR0 has new giant features (e.g. AMX) and possibly even
AVX-512 cleared, and programs need to explicitly request enablement.
This would allow programs to opt into not saving/restoring across
signals or to save/restore in buffers supplied when the feature is
enabled.  This has all kinds of pros and cons, and I'm not sure it's a
great idea.  But, in the absence of some change to the ABI, the
default outcome is that, on AMX-enabled kernels on AMX-enabled
hardware, the signal frame will be more than 8kB, and this will affect
*every* signal regardless of whether AMX is in use.

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why does glibc use AVX-512?
  2021-03-26  4:38 Why does glibc use AVX-512? Andy Lutomirski
@ 2021-03-26 10:06 ` Borislav Petkov
  2021-03-26 18:17   ` Andy Lutomirski
  2021-03-26 12:12 ` Florian Weimer
  2021-03-26 13:32 ` David Laight
  2 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2021-03-26 10:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: libc-alpha, H. J. Lu, X86 ML, LKML, Bae, Chang Seok,
	Florian Weimer, Carlos O'Donell, Rich Felker

On Thu, Mar 25, 2021 at 09:38:24PM -0700, Andy Lutomirski wrote:
> I think we should seriously consider solutions in which, for new
> tasks, XCR0 has new giant features (e.g. AMX) and possibly even
> AVX-512 cleared, and programs need to explicitly request enablement.

I totally agree with making this depend on an explicit user request,
but...

> This would allow programs to opt into not saving/restoring across
> signals or to save/restore in buffers supplied when the feature is
> enabled.  This has all kinds of pros and cons, and I'm not sure it's a
> great idea.  But, in the absence of some change to the ABI, the
> default outcome is that, on AMX-enabled kernels on AMX-enabled
> hardware, the signal frame will be more than 8kB, and this will affect
> *every* signal regardless of whether AMX is in use.

... what's stopping the library from issuing that new ABI call before it
starts the app and get <insert fat feature here> automatically enabled
for everything by default?

And then we'll get the lazy FPU thing all over again.

So the ABI should be explicit user interaction or a kernel cmdline param
or so.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why does glibc use AVX-512?
  2021-03-26  4:38 Why does glibc use AVX-512? Andy Lutomirski
  2021-03-26 10:06 ` Borislav Petkov
@ 2021-03-26 12:12 ` Florian Weimer
  2021-03-26 18:14   ` Andy Lutomirski
  2021-03-26 13:32 ` David Laight
  2 siblings, 1 reply; 14+ messages in thread
From: Florian Weimer @ 2021-03-26 12:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. J. Lu, X86 ML, LKML, Bae, Chang Seok, Carlos O'Donell,
	Rich Felker, libc-alpha

* Andy Lutomirski-alpha:

> glibc appears to use AVX512F for memcpy by default.  (Unless
> Prefer_ERMS is default-on, but I genuinely can't tell if this is the
> case.  I did some searching.)  The commit adding it refers to a 2016
> email saying that it's 30% on KNL.

As far as I know, glibc only does that on KNL, and there it is
actually beneficial.  The relevant code is:

      /* Since AVX512ER is unique to Xeon Phi, set Prefer_No_VZEROUPPER
         if AVX512ER is available.  Don't use AVX512 to avoid lower CPU
         frequency if AVX512ER isn't available.  */
      if (CPU_FEATURES_CPU_P (cpu_features, AVX512ER))
        cpu_features->preferred[index_arch_Prefer_No_VZEROUPPER]
          |= bit_arch_Prefer_No_VZEROUPPER;
      else
        cpu_features->preferred[index_arch_Prefer_No_AVX512]
          |= bit_arch_Prefer_No_AVX512;

So it's not just about Prefer_ERMS.

> I think we should seriously consider solutions in which, for new
> tasks, XCR0 has new giant features (e.g. AMX) and possibly even

I think the AMX programming model will be different, yes.

> AVX-512 cleared, and programs need to explicitly request enablement.
> This would allow programs to opt into not saving/restoring across
> signals or to save/restore in buffers supplied when the feature is
> enabled.

Isn't XSAVEOPT already able to handle that?

In glibc, we use XSAVE/XSAVEC for the dynamic loader trampoline, so it
should not needlessly enable AVX-512 state today, while still enabling
AVX-512 calling conventions transparently.

There is a discussion about using the higher (AVX-512-only) %ymm
registers, to avoid the %xmm transition penalty without the need for
VZEROUPPER.  (VZEROUPPER is incompatible with RTM from a performance
point of view.)  That would perhaps negatively impact XSAVEOPT.

Assuming you can make XSAVEOPT work for you on the kernel side, my
instincts tell me that we should have markup for RTM, not for AVX-512.
This way, we could avoid use of the AVX-512 registers and keep using
VZEROUPPER, without run-time transaction checks, and deal with other
idiosyncrasies needed for transaction support that users might
encounter once this feature sees more use.  But the VZEROUPPER vs RTM
issues is currently stuck in some internal process issue on my end (or
two, come to think of it), which I hope to untangle next month.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Why does glibc use AVX-512?
  2021-03-26  4:38 Why does glibc use AVX-512? Andy Lutomirski
  2021-03-26 10:06 ` Borislav Petkov
  2021-03-26 12:12 ` Florian Weimer
@ 2021-03-26 13:32 ` David Laight
  2 siblings, 0 replies; 14+ messages in thread
From: David Laight @ 2021-03-26 13:32 UTC (permalink / raw)
  To: 'Andy Lutomirski',
	libc-alpha, H. J. Lu, X86 ML, LKML, Bae, Chang Seok,
	Florian Weimer, Carlos O'Donell, Rich Felker

From: Andy Lutomirski
> Sent: 26 March 2021 04:38
> 
> Hi all-
> 
> glibc appears to use AVX512F for memcpy by default.  (Unless
> Prefer_ERMS is default-on, but I genuinely can't tell if this is the
> case.  I did some searching.)  The commit adding it refers to a 2016
> email saying that it's 30% on KNL.  Unfortunately, AVX-512 is now
> available in normal hardware, and the overhead from switching between
> normal and AVX-512 code appears to vary from bad to genuinely
> horrible.  And, once anything has used the high parts of YMM and/or
> ZMM, those states tend to get stuck with XINUSE=1.

Yes I wonder how much faster 'normal' copies ever get because
of these optimisations.
Not many programs sit in a loop repeatedly copying the same 8k buffer.

Not to mention the cpu where the 'wide' instructions either
use the 'narrow' execution unit twice or at half frequency.
So while supported, using them isn't really useful.

IIRC the [XYZ]MM registers are all caller saved?
So system calls (or rather the C wrapper) is allowed to
trash them all.
So the system call entry could zero all the [XYZ]MM registers.
I think they XSAVExxx and later XRESTORExxx are then quick.
In particular they don't need saving on a context switch from
a system call.
This might get them marked 'not in use' more often.
But probably not if memcpy() starts using them.
(This doesn't help signal handlers.)

ISTR one cpu family where ZVEROUPPER goes from 'cheap' to
'expensive'.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why does glibc use AVX-512?
  2021-03-26 12:12 ` Florian Weimer
@ 2021-03-26 18:14   ` Andy Lutomirski
  2021-03-26 19:34     ` Florian Weimer
  0 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2021-03-26 18:14 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andy Lutomirski, H. J. Lu, X86 ML, LKML, Bae, Chang Seok,
	Carlos O'Donell, Rich Felker, libc-alpha

On Fri, Mar 26, 2021 at 5:12 AM Florian Weimer <fw@deneb.enyo.de> wrote:
>
> * Andy Lutomirski-alpha:
>
> > glibc appears to use AVX512F for memcpy by default.  (Unless
> > Prefer_ERMS is default-on, but I genuinely can't tell if this is the
> > case.  I did some searching.)  The commit adding it refers to a 2016
> > email saying that it's 30% on KNL.
>
> As far as I know, glibc only does that on KNL, and there it is
> actually beneficial.  The relevant code is:
>
>       /* Since AVX512ER is unique to Xeon Phi, set Prefer_No_VZEROUPPER
>          if AVX512ER is available.  Don't use AVX512 to avoid lower CPU
>          frequency if AVX512ER isn't available.  */
>       if (CPU_FEATURES_CPU_P (cpu_features, AVX512ER))
>         cpu_features->preferred[index_arch_Prefer_No_VZEROUPPER]
>           |= bit_arch_Prefer_No_VZEROUPPER;
>       else
>         cpu_features->preferred[index_arch_Prefer_No_AVX512]
>           |= bit_arch_Prefer_No_AVX512;
>
> So it's not just about Prefer_ERMS.

Phew.

>
> > AVX-512 cleared, and programs need to explicitly request enablement.
> > This would allow programs to opt into not saving/restoring across
> > signals or to save/restore in buffers supplied when the feature is
> > enabled.
>
> Isn't XSAVEOPT already able to handle that?
>

Yes, but we need a place to put the data, and we need to acknowledge
that, with the current save-everything-on-signal model, the amount of
time and memory used is essentially unbounded.  This isn't great.

>
> There is a discussion about using the higher (AVX-512-only) %ymm
> registers, to avoid the %xmm transition penalty without the need for
> VZEROUPPER.  (VZEROUPPER is incompatible with RTM from a performance
> point of view.)  That would perhaps negatively impact XSAVEOPT.
>
> Assuming you can make XSAVEOPT work for you on the kernel side, my
> instincts tell me that we should have markup for RTM, not for AVX-512.
> This way, we could avoid use of the AVX-512 registers and keep using
> VZEROUPPER, without run-time transaction checks, and deal with other
> idiosyncrasies needed for transaction support that users might
> encounter once this feature sees more use.  But the VZEROUPPER vs RTM
> issues is currently stuck in some internal process issue on my end (or
> two, come to think of it), which I hope to untangle next month.

Can you elaborate on the issue?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why does glibc use AVX-512?
  2021-03-26 10:06 ` Borislav Petkov
@ 2021-03-26 18:17   ` Andy Lutomirski
  0 siblings, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2021-03-26 18:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, libc-alpha, H. J. Lu, X86 ML, LKML, Bae,
	Chang Seok, Florian Weimer, Carlos O'Donell, Rich Felker

On Fri, Mar 26, 2021 at 3:08 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Mar 25, 2021 at 09:38:24PM -0700, Andy Lutomirski wrote:
> > I think we should seriously consider solutions in which, for new
> > tasks, XCR0 has new giant features (e.g. AMX) and possibly even
> > AVX-512 cleared, and programs need to explicitly request enablement.
>
> I totally agree with making this depend on an explicit user request,
> but...
>
> > This would allow programs to opt into not saving/restoring across
> > signals or to save/restore in buffers supplied when the feature is
> > enabled.  This has all kinds of pros and cons, and I'm not sure it's a
> > great idea.  But, in the absence of some change to the ABI, the
> > default outcome is that, on AMX-enabled kernels on AMX-enabled
> > hardware, the signal frame will be more than 8kB, and this will affect
> > *every* signal regardless of whether AMX is in use.
>
> ... what's stopping the library from issuing that new ABI call before it
> starts the app and get <insert fat feature here> automatically enabled
> for everything by default?
>
> And then we'll get the lazy FPU thing all over again.

At the end of the day, it's not the kernel's job to make userspace be
sane or to make users or programmers make the right decisions.  But it
is our job to make sure that it's even possible to make the system
work well, and we are responsible for making sure that old binaries
continue to work, preferably well, on new kernels and new hardware.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why does glibc use AVX-512?
  2021-03-26 18:14   ` Andy Lutomirski
@ 2021-03-26 19:34     ` Florian Weimer
  2021-03-26 19:47       ` Andy Lutomirski
  0 siblings, 1 reply; 14+ messages in thread
From: Florian Weimer @ 2021-03-26 19:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. J. Lu, X86 ML, LKML, Bae, Chang Seok, Carlos O'Donell,
	Rich Felker, libc-alpha

* Andy Lutomirski:

>> > AVX-512 cleared, and programs need to explicitly request enablement.
>> > This would allow programs to opt into not saving/restoring across
>> > signals or to save/restore in buffers supplied when the feature is
>> > enabled.
>>
>> Isn't XSAVEOPT already able to handle that?
>>
>
> Yes, but we need a place to put the data, and we need to acknowledge
> that, with the current save-everything-on-signal model, the amount of
> time and memory used is essentially unbounded.  This isn't great.

The size has to have a known upper bound, but the save amount can be
dynamic, right?

How was the old lazy FPU initialization support for i386 implemented?

>> Assuming you can make XSAVEOPT work for you on the kernel side, my
>> instincts tell me that we should have markup for RTM, not for AVX-512.
>> This way, we could avoid use of the AVX-512 registers and keep using
>> VZEROUPPER, without run-time transaction checks, and deal with other
>> idiosyncrasies needed for transaction support that users might
>> encounter once this feature sees more use.  But the VZEROUPPER vs RTM
>> issues is currently stuck in some internal process issue on my end (or
>> two, come to think of it), which I hope to untangle next month.
>
> Can you elaborate on the issue?

This is the bug:

  vzeroupper use in AVX2 multiarch string functions cause HTM aborts 
  <https://sourceware.org/bugzilla/show_bug.cgi?id=27457>

Unfortunately we have a bug (outside of glibc) that makes me wonder if
we can actually roll out RTM transaction checks (or any RTM
instruction) on a large scale:

  x86: Sporadic failures in tst-cpu-features-cpuinfo 
  <https://sourceware.org/bugzilla/show_bug.cgi?id=27398#c3>

The dynamic RTM check might trap due to this bug.  (We have a bit more
information about the nature of the bug, currently missing from
Bugzilla.)

I'm also worried that the new dynamic RTM check in the string
functions has a performance impact.  Due to its nature, it will be
enabled for every program once running on RTM-capable hardware, not
just those that actually use RTM.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why does glibc use AVX-512?
  2021-03-26 19:34     ` Florian Weimer
@ 2021-03-26 19:47       ` Andy Lutomirski
  2021-03-26 20:06         ` Andrew Cooper
  2021-03-26 20:35         ` Florian Weimer
  0 siblings, 2 replies; 14+ messages in thread
From: Andy Lutomirski @ 2021-03-26 19:47 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andy Lutomirski, H. J. Lu, X86 ML, LKML, Bae, Chang Seok,
	Carlos O'Donell, Rich Felker, libc-alpha

On Fri, Mar 26, 2021 at 12:34 PM Florian Weimer <fw@deneb.enyo.de> wrote:
>
> * Andy Lutomirski:
>
> >> > AVX-512 cleared, and programs need to explicitly request enablement.
> >> > This would allow programs to opt into not saving/restoring across
> >> > signals or to save/restore in buffers supplied when the feature is
> >> > enabled.
> >>
> >> Isn't XSAVEOPT already able to handle that?
> >>
> >
> > Yes, but we need a place to put the data, and we need to acknowledge
> > that, with the current save-everything-on-signal model, the amount of
> > time and memory used is essentially unbounded.  This isn't great.
>
> The size has to have a known upper bound, but the save amount can be
> dynamic, right?
>
> How was the old lazy FPU initialization support for i386 implemented?
>
> >> Assuming you can make XSAVEOPT work for you on the kernel side, my
> >> instincts tell me that we should have markup for RTM, not for AVX-512.
> >> This way, we could avoid use of the AVX-512 registers and keep using
> >> VZEROUPPER, without run-time transaction checks, and deal with other
> >> idiosyncrasies needed for transaction support that users might
> >> encounter once this feature sees more use.  But the VZEROUPPER vs RTM
> >> issues is currently stuck in some internal process issue on my end (or
> >> two, come to think of it), which I hope to untangle next month.
> >
> > Can you elaborate on the issue?
>
> This is the bug:
>
>   vzeroupper use in AVX2 multiarch string functions cause HTM aborts
>   <https://sourceware.org/bugzilla/show_bug.cgi?id=27457>
>
> Unfortunately we have a bug (outside of glibc) that makes me wonder if
> we can actually roll out RTM transaction checks (or any RTM
> instruction) on a large scale:
>
>   x86: Sporadic failures in tst-cpu-features-cpuinfo
>   <https://sourceware.org/bugzilla/show_bug.cgi?id=27398#c3>

It's worth noting that recent microcode updates have make RTM
considerably less likely to actually work on many parts.  It's
possible you should just disable it. :(

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why does glibc use AVX-512?
  2021-03-26 19:47       ` Andy Lutomirski
@ 2021-03-26 20:06         ` Andrew Cooper
  2021-03-26 20:35         ` Florian Weimer
  1 sibling, 0 replies; 14+ messages in thread
From: Andrew Cooper @ 2021-03-26 20:06 UTC (permalink / raw)
  To: Andy Lutomirski, Florian Weimer
  Cc: H. J. Lu, X86 ML, LKML, Bae, Chang Seok, Carlos O'Donell,
	Rich Felker, libc-alpha

On 26/03/2021 19:47, Andy Lutomirski wrote:
> On Fri, Mar 26, 2021 at 12:34 PM Florian Weimer <fw@deneb.enyo.de> wrote:
>> * Andy Lutomirski:
>>
>>>>> AVX-512 cleared, and programs need to explicitly request enablement.
>>>>> This would allow programs to opt into not saving/restoring across
>>>>> signals or to save/restore in buffers supplied when the feature is
>>>>> enabled.
>>>> Isn't XSAVEOPT already able to handle that?
>>>>
>>> Yes, but we need a place to put the data, and we need to acknowledge
>>> that, with the current save-everything-on-signal model, the amount of
>>> time and memory used is essentially unbounded.  This isn't great.
>> The size has to have a known upper bound, but the save amount can be
>> dynamic, right?
>>
>> How was the old lazy FPU initialization support for i386 implemented?
>>
>>>> Assuming you can make XSAVEOPT work for you on the kernel side, my
>>>> instincts tell me that we should have markup for RTM, not for AVX-512.
>>>> This way, we could avoid use of the AVX-512 registers and keep using
>>>> VZEROUPPER, without run-time transaction checks, and deal with other
>>>> idiosyncrasies needed for transaction support that users might
>>>> encounter once this feature sees more use.  But the VZEROUPPER vs RTM
>>>> issues is currently stuck in some internal process issue on my end (or
>>>> two, come to think of it), which I hope to untangle next month.
>>> Can you elaborate on the issue?
>> This is the bug:
>>
>>   vzeroupper use in AVX2 multiarch string functions cause HTM aborts
>>   <https://sourceware.org/bugzilla/show_bug.cgi?id=27457>
>>
>> Unfortunately we have a bug (outside of glibc) that makes me wonder if
>> we can actually roll out RTM transaction checks (or any RTM
>> instruction) on a large scale:
>>
>>   x86: Sporadic failures in tst-cpu-features-cpuinfo
>>   <https://sourceware.org/bugzilla/show_bug.cgi?id=27398#c3>
> It's worth noting that recent microcode updates have make RTM
> considerably less likely to actually work on many parts.  It's
> possible you should just disable it. :(

For a variety of errata and speculative security reasons, hypervisors
now have the ability to hide/show the HLE/RTM CPUID bits, independently
of letting TSX actually work or not.

For migration compatibility reasons, you might quite possibly find
yourself in a VM which advertises the HLE/RTM bits but will
unconditionally abort any transaction.

Honestly, if I were you, I'd just leave it to the user to explicitly opt
in if they want transactions.

~Andrew


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why does glibc use AVX-512?
  2021-03-26 19:47       ` Andy Lutomirski
  2021-03-26 20:06         ` Andrew Cooper
@ 2021-03-26 20:35         ` Florian Weimer
  2021-03-26 20:43           ` H.J. Lu
  2021-03-26 20:48           ` Andy Lutomirski
  1 sibling, 2 replies; 14+ messages in thread
From: Florian Weimer @ 2021-03-26 20:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. J. Lu, X86 ML, LKML, Bae, Chang Seok, Carlos O'Donell,
	Rich Felker, libc-alpha

* Andy Lutomirski:

> On Fri, Mar 26, 2021 at 12:34 PM Florian Weimer <fw@deneb.enyo.de> wrote:
>>   x86: Sporadic failures in tst-cpu-features-cpuinfo
>>   <https://sourceware.org/bugzilla/show_bug.cgi?id=27398#c3>
>
> It's worth noting that recent microcode updates have make RTM
> considerably less likely to actually work on many parts.  It's
> possible you should just disable it. :(

Sorry, I'm not sure who should disable it.

Let me sum up the situation:

We have a request for a performance enhancement in glibc, so that
applications can use it on server parts where RTM actually works.

For CPUs that support AVX-512, we may be able to meet that with a
change that uses the new 256-bit registers, t avoid the %xmm
transition penalty.  (This is the easy case, hopefully—there shouldn't
be any frequency issues associated with that, and if the kernel
doesn't optimize the context switch today, that's a nonissue as well.)

For CPUs that do not support AVX-512 but support RTM (and AVX2), we
need a dynamic run-time check whether the string function is invoked
in a transaction.  In that case, we need to use VZEROALL instead of
VZEROUPPER.  (It's apparently too costly to issue VZEROALL
unconditionally.)

All this needs to work transparently without user intervention.  We
cannot require firmware upgrades to fix the incorrect RTM reporting
issue (the bug I referenced).  I think we can require software updates
which tell glibc when to use RTM-enabled string functions if the
dynamic selection does not work (either for performance reasons, or
because of the RTM reporting bug).

I want to avoid a situation where one in eight processes fail to work
correctly because the CPUID checks ran on CPU 0, where RTM is reported
as available, and then we trap when executing XTEST on other CPUs.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why does glibc use AVX-512?
  2021-03-26 20:35         ` Florian Weimer
@ 2021-03-26 20:43           ` H.J. Lu
  2021-03-26 20:48           ` Andy Lutomirski
  1 sibling, 0 replies; 14+ messages in thread
From: H.J. Lu @ 2021-03-26 20:43 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andy Lutomirski, X86 ML, LKML, Bae, Chang Seok,
	Carlos O'Donell, Rich Felker, libc-alpha

On Fri, Mar 26, 2021 at 1:35 PM Florian Weimer <fw@deneb.enyo.de> wrote:

>
> All this needs to work transparently without user intervention.  We
> cannot require firmware upgrades to fix the incorrect RTM reporting
> issue (the bug I referenced).  I think we can require software updates
> which tell glibc when to use RTM-enabled string functions if the
> dynamic selection does not work (either for performance reasons, or
> because of the RTM reporting bug).
>
> I want to avoid a situation where one in eight processes fail to work
> correctly because the CPUID checks ran on CPU 0, where RTM is reported
> as available, and then we trap when executing XTEST on other CPUs.

glibc can disable RTM based on CPU model and stepping.

-- 
H.J.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why does glibc use AVX-512?
  2021-03-26 20:35         ` Florian Weimer
  2021-03-26 20:43           ` H.J. Lu
@ 2021-03-26 20:48           ` Andy Lutomirski
  2021-03-26 21:11             ` Florian Weimer
  1 sibling, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2021-03-26 20:48 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andy Lutomirski, H. J. Lu, X86 ML, LKML, Bae, Chang Seok,
	Carlos O'Donell, Rich Felker, libc-alpha

On Fri, Mar 26, 2021 at 1:35 PM Florian Weimer <fw@deneb.enyo.de> wrote:
>
> * Andy Lutomirski:
>
> > On Fri, Mar 26, 2021 at 12:34 PM Florian Weimer <fw@deneb.enyo.de> wrote:
> >>   x86: Sporadic failures in tst-cpu-features-cpuinfo
> >>   <https://sourceware.org/bugzilla/show_bug.cgi?id=27398#c3>
> >
> > It's worth noting that recent microcode updates have make RTM
> > considerably less likely to actually work on many parts.  It's
> > possible you should just disable it. :(
>
> Sorry, I'm not sure who should disable it.
>
> Let me sum up the situation:
>
> We have a request for a performance enhancement in glibc, so that
> applications can use it on server parts where RTM actually works.
>
> For CPUs that support AVX-512, we may be able to meet that with a
> change that uses the new 256-bit registers, t avoid the %xmm
> transition penalty.  (This is the easy case, hopefully—there shouldn't
> be any frequency issues associated with that, and if the kernel
> doesn't optimize the context switch today, that's a nonissue as well.)

I would make sure that the transition penalty actually works the way
you think it does.  My general experience with the transition
penalties is that the CPU is rather more aggressive about penalizing
you than makes sense.

>
> For CPUs that do not support AVX-512 but support RTM (and AVX2), we
> need a dynamic run-time check whether the string function is invoked
> in a transaction.  In that case, we need to use VZEROALL instead of
> VZEROUPPER.  (It's apparently too costly to issue VZEROALL
> unconditionally.)

So VZEROALL works in a transaction and VZEROUPPER doesn't?  That's bizarre.


> All this needs to work transparently without user intervention.  We
> cannot require firmware upgrades to fix the incorrect RTM reporting
> issue (the bug I referenced).  I think we can require software updates
> which tell glibc when to use RTM-enabled string functions if the
> dynamic selection does not work (either for performance reasons, or
> because of the RTM reporting bug).
>
> I want to avoid a situation where one in eight processes fail to work
> correctly because the CPUID checks ran on CPU 0, where RTM is reported
> as available, and then we trap when executing XTEST on other CPUs.

What kind of system has that problem?  If RTM reports as available,
then it should work in the sense of not trapping.  (There is no
guarantee that transactions will *ever* complete, and that part is no
joke.)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why does glibc use AVX-512?
  2021-03-26 20:48           ` Andy Lutomirski
@ 2021-03-26 21:11             ` Florian Weimer
  2021-03-26 21:21               ` Andy Lutomirski
  0 siblings, 1 reply; 14+ messages in thread
From: Florian Weimer @ 2021-03-26 21:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. J. Lu, X86 ML, LKML, Bae, Chang Seok, Carlos O'Donell,
	Rich Felker, libc-alpha

* Andy Lutomirski:

> On Fri, Mar 26, 2021 at 1:35 PM Florian Weimer <fw@deneb.enyo.de> wrote:
>>
>> * Andy Lutomirski:
>>
>> > On Fri, Mar 26, 2021 at 12:34 PM Florian Weimer <fw@deneb.enyo.de> wrote:
>> >>   x86: Sporadic failures in tst-cpu-features-cpuinfo
>> >>   <https://sourceware.org/bugzilla/show_bug.cgi?id=27398#c3>
>> >
>> > It's worth noting that recent microcode updates have make RTM
>> > considerably less likely to actually work on many parts.  It's
>> > possible you should just disable it. :(
>>
>> Sorry, I'm not sure who should disable it.
>>
>> Let me sum up the situation:
>>
>> We have a request for a performance enhancement in glibc, so that
>> applications can use it on server parts where RTM actually works.
>>
>> For CPUs that support AVX-512, we may be able to meet that with a
>> change that uses the new 256-bit registers, t avoid the %xmm
>> transition penalty.  (This is the easy case, hopefully—there shouldn't
>> be any frequency issues associated with that, and if the kernel
>> doesn't optimize the context switch today, that's a nonissue as well.)
>
> I would make sure that the transition penalty actually works the way
> you think it does.  My general experience with the transition
> penalties is that the CPU is rather more aggressive about penalizing
> you than makes sense.

Do you mean the frequency/thermal budget?

I mean the immense slowdown you get if you use %xmm registers after
their %ymm counterparts (doesn't have to be %zmm, that issue is
present starting with AVX) and you have not issued VZEROALL or
VZEROUPPER between the two uses.

It's a bit like EMMS, I gues, only that you don't get corruption, just
really poor performance.

>> For CPUs that do not support AVX-512 but support RTM (and AVX2), we
>> need a dynamic run-time check whether the string function is invoked
>> in a transaction.  In that case, we need to use VZEROALL instead of
>> VZEROUPPER.  (It's apparently too costly to issue VZEROALL
>> unconditionally.)
>
> So VZEROALL works in a transaction and VZEROUPPER doesn't?  That's bizarre.

Apparently yes.

>> All this needs to work transparently without user intervention.  We
>> cannot require firmware upgrades to fix the incorrect RTM reporting
>> issue (the bug I referenced).  I think we can require software updates
>> which tell glibc when to use RTM-enabled string functions if the
>> dynamic selection does not work (either for performance reasons, or
>> because of the RTM reporting bug).
>>
>> I want to avoid a situation where one in eight processes fail to work
>> correctly because the CPUID checks ran on CPU 0, where RTM is reported
>> as available, and then we trap when executing XTEST on other CPUs.
>
> What kind of system has that problem?

It's a standard laptop after a suspend/resume cycle.  It's either a
kernel or firmware bug.

> If RTM reports as available, then it should work in the sense of not
> trapping.  (There is no guarantee that transactions will *ever*
> complete, and that part is no joke.)

XTEST doesn't abort transactions, but it traps without RTM support.
If CPU0 has RTM support and we enable XTEST use in glibc based on that
(because the startup code runs on CPU0), then the XTEST instruction
must not trap when running on other CPUs.

Currently, we do not use RTM for anything in glibc by default, even if
it is available according to CPUID.  (There are ways to opt in, unless
the CPU is on the disallow list due to the early Haswell bug.)  I'm
worried that if we start executing XTEST on all CPUs that indicate RTM
support, we will see lots of weird issues, along the lines of bug 27398.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Why does glibc use AVX-512?
  2021-03-26 21:11             ` Florian Weimer
@ 2021-03-26 21:21               ` Andy Lutomirski
  0 siblings, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2021-03-26 21:21 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andy Lutomirski, H. J. Lu, X86 ML, LKML, Bae, Chang Seok,
	Carlos O'Donell, Rich Felker, libc-alpha



> On Mar 26, 2021, at 2:11 PM, Florian Weimer <fw@deneb.enyo.de> wrote:
> 
> * Andy Lutomirski:
> 
>>> On Fri, Mar 26, 2021 at 1:35 PM Florian Weimer <fw@deneb.enyo.de> wrote:
>>> 
>>> I mean the immense slowdown you get if you use %xmm registers after
> their %ymm counterparts (doesn't have to be %zmm, that issue is
> present starting with AVX) and you have not issued VZEROALL or
> VZEROUPPER between the two uses.

It turns out that it’s not necessary to access the registers in question to trigger this behavior. You just need to make the CPU think it should penalize you. For example, LDMXCSR appears to be a legacy SSE insn for this purpose, and VLDMXCSR is an AVX insn for this purpose. I wouldn’t trust that using ymm9 would avoid the penalty just because common sense says it should.

>> What kind of system has that problem?
> 
> It's a standard laptop after a suspend/resume cycle.  It's either a
> kernel or firmware bug.

What kernel version?  I think fixing the kernel makes more sense than fixing glibc.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-03-26 21:21 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-26  4:38 Why does glibc use AVX-512? Andy Lutomirski
2021-03-26 10:06 ` Borislav Petkov
2021-03-26 18:17   ` Andy Lutomirski
2021-03-26 12:12 ` Florian Weimer
2021-03-26 18:14   ` Andy Lutomirski
2021-03-26 19:34     ` Florian Weimer
2021-03-26 19:47       ` Andy Lutomirski
2021-03-26 20:06         ` Andrew Cooper
2021-03-26 20:35         ` Florian Weimer
2021-03-26 20:43           ` H.J. Lu
2021-03-26 20:48           ` Andy Lutomirski
2021-03-26 21:11             ` Florian Weimer
2021-03-26 21:21               ` Andy Lutomirski
2021-03-26 13:32 ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).