public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts
@ 2021-02-22 12:40 rguenth at gcc dot gnu.org
  2021-02-22 12:40 ` [Bug string/27457] " rguenth at gcc dot gnu.org
                   ` (40 more replies)
  0 siblings, 41 replies; 42+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-22 12:40 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

            Bug ID: 27457
           Summary: vzeroupper use in AVX2 multiarch string functions
                    cause HTM aborts
           Product: glibc
           Version: 2.31
            Status: NEW
          Severity: normal
          Priority: P2
         Component: string
          Assignee: unassigned at sourceware dot org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

The use of vzeroupper in for example strcmp on a AVX2 capable machine like
Skylake-X causes HTM aborts when used inside transactions.  This causes severe
performance degradation for some workloads compared to glibc without those
multiarch implementations.

For one specific benchmark the following hack restores performance (as does
removing the VZEROUPPER or replacing it with the way more costly VZEROALL):

diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S
b/sysdeps/x86_64/multiarch/strcmp-avx2.S
index ee82fa3e19..208b396557 100644
--- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
+++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
@@ -127,7 +127,8 @@ L(return):
        movzbl  (%rsi, %rdx), %edx
        subl    %edx, %eax
 # endif
-       VZEROUPPER
+       vpxor   %ymm0, %ymm0, %ymm0
+       vpxor   %ymm1, %ymm1, %ymm1
        ret

        .p2align 4

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
@ 2021-02-22 12:40 ` rguenth at gcc dot gnu.org
  2021-02-22 14:50 ` matz at suse dot de
                   ` (39 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-22 12:40 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hjl at sourceware dot org

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
  2021-02-22 12:40 ` [Bug string/27457] " rguenth at gcc dot gnu.org
@ 2021-02-22 14:50 ` matz at suse dot de
  2021-02-22 15:00 ` rguenth at gcc dot gnu.org
                   ` (38 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: matz at suse dot de @ 2021-02-22 14:50 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

Michael Matz <matz at suse dot de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |matz at suse dot de

--- Comment #1 from Michael Matz <matz at suse dot de> ---
FWIW, the (proprietary) benchmark regresses by 40% (!) when using the avx2
strcmp
routines, even though the overall runtime of strcmp is only about 1%.  So the
transaction aborts caused by vzeroupper here are quite tremendous.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
  2021-02-22 12:40 ` [Bug string/27457] " rguenth at gcc dot gnu.org
  2021-02-22 14:50 ` matz at suse dot de
@ 2021-02-22 15:00 ` rguenth at gcc dot gnu.org
  2021-02-22 15:26 ` hjl.tools at gmail dot com
                   ` (37 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-02-22 15:00 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
More correct, the wcscmp path ends here with higher %ymm regs used

diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S
b/sysdeps/x86_64/multiarch/strcmp-avx2.S
index ee82fa3e19..bd3b6243e2 100644
--- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
+++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
@@ -122,13 +122,16 @@ L(wcscmp_return):
        negl    %eax
        orl     $1, %eax
 L(return):
+       VZEROUPPER
+       ret
 # else
        movzbl  (%rdi, %rdx), %eax
        movzbl  (%rsi, %rdx), %edx
        subl    %edx, %eax
-# endif
-       VZEROUPPER
+       vpxor   %ymm0, %ymm0, %ymm0
+       vpxor   %ymm1, %ymm1, %ymm1
        ret
+# endif

        .p2align 4
 L(return_vec_size):

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-02-22 15:00 ` rguenth at gcc dot gnu.org
@ 2021-02-22 15:26 ` hjl.tools at gmail dot com
  2021-02-22 15:26 ` hjl.tools at gmail dot com
                   ` (36 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: hjl.tools at gmail dot com @ 2021-02-22 15:26 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

H.J. Lu <hjl.tools at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at sourceware dot org   |hjl.tools at gmail dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2021-02-22 15:26 ` hjl.tools at gmail dot com
@ 2021-02-22 15:26 ` hjl.tools at gmail dot com
  2021-02-22 18:45 ` fweimer at redhat dot com
                   ` (35 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: hjl.tools at gmail dot com @ 2021-02-22 15:26 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

H.J. Lu <hjl.tools at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|hjl at sourceware dot org          |hjl.tools at gmail dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2021-02-22 15:26 ` hjl.tools at gmail dot com
@ 2021-02-22 18:45 ` fweimer at redhat dot com
  2021-02-23  9:44 ` roman.dementiev at intel dot com
                   ` (34 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: fweimer at redhat dot com @ 2021-02-22 18:45 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |fweimer at redhat dot com
              Flags|                            |security-

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2021-02-22 18:45 ` fweimer at redhat dot com
@ 2021-02-23  9:44 ` roman.dementiev at intel dot com
  2021-02-27  2:39 ` hjl.tools at gmail dot com
                   ` (33 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: roman.dementiev at intel dot com @ 2021-02-23  9:44 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

Roman Dementiev <roman.dementiev at intel dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |roman.dementiev at intel dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2021-02-23  9:44 ` roman.dementiev at intel dot com
@ 2021-02-27  2:39 ` hjl.tools at gmail dot com
  2021-02-27  7:34 ` rguenther at suse dot de
                   ` (32 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: hjl.tools at gmail dot com @ 2021-02-27  2:39 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

H.J. Lu <hjl.tools at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |2.34

--- Comment #3 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Richard Biener from comment #2)
> More correct, the wcscmp path ends here with higher %ymm regs used
> 
> diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S
> b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> index ee82fa3e19..bd3b6243e2 100644
> --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
> +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
> @@ -122,13 +122,16 @@ L(wcscmp_return):
>         negl    %eax
>         orl     $1, %eax
>  L(return):
> +       VZEROUPPER
> +       ret
>  # else
>         movzbl  (%rdi, %rdx), %eax
>         movzbl  (%rsi, %rdx), %edx
>         subl    %edx, %eax
> -# endif
> -       VZEROUPPER
> +       vpxor   %ymm0, %ymm0, %ymm0
> +       vpxor   %ymm1, %ymm1, %ymm1
>         ret
> +# endif
>  
>         .p2align 4
>  L(return_vec_size):

These won't remove AVX-SSE transition penalty.  I am re-implementing
all AVX string/memory functions with YMM16-YMM31, which don't need
VZEROUPPER.  My current work is on users/hjl/pr27457/evex branch at

https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/pr27457/evex

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2021-02-27  2:39 ` hjl.tools at gmail dot com
@ 2021-02-27  7:34 ` rguenther at suse dot de
  2021-02-28 14:53 ` hjl.tools at gmail dot com
                   ` (31 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: rguenther at suse dot de @ 2021-02-27  7:34 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #4 from rguenther at suse dot de ---
On February 27, 2021 3:39:50 AM GMT+01:00, "hjl.tools at gmail dot com"
<sourceware-bugzilla@sourceware.org> wrote:
>https://sourceware.org/bugzilla/show_bug.cgi?id=27457
>
>H.J. Lu <hjl.tools at gmail dot com> changed:
>
>           What    |Removed                     |Added
>----------------------------------------------------------------------------
>   Target Milestone|---                         |2.34
>
>--- Comment #3 from H.J. Lu <hjl.tools at gmail dot com> ---
>(In reply to Richard Biener from comment #2)
>> More correct, the wcscmp path ends here with higher %ymm regs used
>> 
>> diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S
>> b/sysdeps/x86_64/multiarch/strcmp-avx2.S
>> index ee82fa3e19..bd3b6243e2 100644
>> --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
>> +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
>> @@ -122,13 +122,16 @@ L(wcscmp_return):
>>         negl    %eax
>>         orl     $1, %eax
>>  L(return):
>> +       VZEROUPPER
>> +       ret
>>  # else
>>         movzbl  (%rdi, %rdx), %eax
>>         movzbl  (%rsi, %rdx), %edx
>>         subl    %edx, %eax
>> -# endif
>> -       VZEROUPPER
>> +       vpxor   %ymm0, %ymm0, %ymm0
>> +       vpxor   %ymm1, %ymm1, %ymm1
>>         ret
>> +# endif
>>  
>>         .p2align 4
>>  L(return_vec_size):
>
>These won't remove AVX-SSE transition penalty.  I am re-implementing
>all AVX string/memory functions with YMM16-YMM31, which don't need
>VZEROUPPER. 

It should still avoid any false dependences. Ymm16 to ymm31 are only available
with AVX512, that will make the AVX2 strong functions unusable on non-avx512
hardware. Are you introducing another set of functions then? 

 My current work is on users/hjl/pr27457/evex branch at
>
>https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/pr27457/evex

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2021-02-27  7:34 ` rguenther at suse dot de
@ 2021-02-28 14:53 ` hjl.tools at gmail dot com
  2021-03-01 11:32 ` fweimer at redhat dot com
                   ` (30 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: hjl.tools at gmail dot com @ 2021-02-28 14:53 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #5 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to rguenther from comment #4)
> 
> It should still avoid any false dependences. Ymm16 to ymm31 are only
> available with AVX512, that will make the AVX2 strong functions unusable on
> non-avx512 hardware. Are you introducing another set of functions then? 
> 
>  My current work is on users/hjl/pr27457/evex branch at
> >
> >https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/pr27457/evex

I added another set of AVX/AVX2 functions to support RTM.  You can use
libcpu-rt-c.so:

https://gitlab.com/cpu-rt/glibc/-/wikis/libcpu-rt-c.so-to-avoid-vzeroupper-in-RTM-region

to try it out.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2021-02-28 14:53 ` hjl.tools at gmail dot com
@ 2021-03-01 11:32 ` fweimer at redhat dot com
  2021-03-01 12:24 ` mliska at suse dot cz
                   ` (29 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: fweimer at redhat dot com @ 2021-03-01 11:32 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #6 from Florian Weimer <fweimer at redhat dot com> ---
(In reply to H.J. Lu from comment #5)
> I added another set of AVX/AVX2 functions to support RTM.  You can use
> libcpu-rt-c.so:
> 
> https://gitlab.com/cpu-rt/glibc/-/wikis/libcpu-rt-c.so-to-avoid-vzeroupper-
> in-RTM-region
> 
> to try it out.

The sources do not seem to show the selection logic.
__memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER
instruction, which can't be good for general application performance.

Anyway, is there anything we can realistically do here on CPUs with AVX2, but
without AVX-512? VPXOR doesn't avoid the transition penalty according to the
Intel documentation. Switching to 128-bit registers once the CPU supports RTM
is also likely to introduce performance regressions in non-transactional code.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2021-03-01 11:32 ` fweimer at redhat dot com
@ 2021-03-01 12:24 ` mliska at suse dot cz
  2021-03-01 12:47 ` rguenther at suse dot de
                   ` (28 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: mliska at suse dot cz @ 2021-03-01 12:24 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

Martin Liska <mliska at suse dot cz> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mliska at suse dot cz

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2021-03-01 12:24 ` mliska at suse dot cz
@ 2021-03-01 12:47 ` rguenther at suse dot de
  2021-03-01 13:13 ` roman.dementiev at intel dot com
                   ` (27 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: rguenther at suse dot de @ 2021-03-01 12:47 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #7 from rguenther at suse dot de ---
On Mon, 1 Mar 2021, fweimer at redhat dot com wrote:

> https://sourceware.org/bugzilla/show_bug.cgi?id=27457
> 
> --- Comment #6 from Florian Weimer <fweimer at redhat dot com> ---
> (In reply to H.J. Lu from comment #5)
> > I added another set of AVX/AVX2 functions to support RTM.  You can use
> > libcpu-rt-c.so:
> > 
> > https://gitlab.com/cpu-rt/glibc/-/wikis/libcpu-rt-c.so-to-avoid-vzeroupper-
> > in-RTM-region
> > 
> > to try it out.
> 
> The sources do not seem to show the selection logic.
> __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER
> instruction, which can't be good for general application performance.

The intent was to use %ymm16+ only which does not cause any transition
penalty even w/o vzeroupper.

> Anyway, is there anything we can realistically do here on CPUs with AVX2, but
> without AVX-512? VPXOR doesn't avoid the transition penalty according to the
> Intel documentation. Switching to 128-bit registers once the CPU supports RTM
> is also likely to introduce performance regressions in non-transactional code.

There's the option to use another path when HTM is available, doing
xtest and branch to the non-AVX variants when a transaction is active.
I'm not sure about the overhead of xtest here.  Nor am I sure whether
Intel has (or plans any) SKUs with HTM but not AVX512.  Since HTM
on broadwell + haswell is crippled due to bugs (and thus usually
disabled in firmware) this leaves Skylake and later where I don't
know of any HTM but no-AVX512 SKUs.  But this is Intel, so ...

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2021-03-01 12:47 ` rguenther at suse dot de
@ 2021-03-01 13:13 ` roman.dementiev at intel dot com
  2021-03-01 13:19 ` fweimer at redhat dot com
                   ` (26 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: roman.dementiev at intel dot com @ 2021-03-01 13:13 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #8 from Roman Dementiev <roman.dementiev at intel dot com> ---
xtest has a few cycles latency.

RTM is not disabled in firmware by default on HSX-EX, BDW-EP, BDW-EX, SKX, CLX,
CPX (server SKUs). This is different on client SKUs.

Examples of RTM but no-AVX512 SKUs: HSX-EX, BDW-EP, BDW-EX.

--
Roman

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2021-03-01 13:13 ` roman.dementiev at intel dot com
@ 2021-03-01 13:19 ` fweimer at redhat dot com
  2021-03-01 13:21 ` hjl.tools at gmail dot com
                   ` (25 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: fweimer at redhat dot com @ 2021-03-01 13:19 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #9 from Florian Weimer <fweimer at redhat dot com> ---
(In reply to rguenther from comment #7)
> On Mon, 1 Mar 2021, fweimer at redhat dot com wrote:
> > The sources do not seem to show the selection logic.
> > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER
> > instruction, which can't be good for general application performance.
> 
> The intent was to use %ymm16+ only which does not cause any transition
> penalty even w/o vzeroupper.

I still saw %ymm0 usage in the disassembly, if I recall correctly. And for
AVX2, there isn't much choice. I didn't try to reverse-engineer the
corresponding IFUNC selector.

> There's the option to use another path when HTM is available, doing
> xtest and branch to the non-AVX variants when a transaction is active.
> I'm not sure about the overhead of xtest here.  Nor am I sure whether
> Intel has (or plans any) SKUs with HTM but not AVX512.  Since HTM
> on broadwell + haswell is crippled due to bugs (and thus usually
> disabled in firmware) this leaves Skylake and later where I don't
> know of any HTM but no-AVX512 SKUs.  But this is Intel, so ...

Skylake is an overloaded term. There is e.g. Xeon E3-1240 v5 which is
advertised as having the Skylake microarchitecture, but it doesn't have
AVX-512. (I haven't checked RTM status, but I don't see why it wouldn't be
supported.) Xeon E3-1240 v6 is Kaby Lake, but it doesn't have AVX-512 either,
and according to what I see, RTM is not disabled by at least one current
distribution kernel/microcode combination.

But I think the gist is that RTM without AVX-512 exists out there even in
server parts.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2021-03-01 13:19 ` fweimer at redhat dot com
@ 2021-03-01 13:21 ` hjl.tools at gmail dot com
  2021-03-01 13:24 ` hjl.tools at gmail dot com
                   ` (24 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: hjl.tools at gmail dot com @ 2021-03-01 13:21 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #10 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Florian Weimer from comment #6)
> (In reply to H.J. Lu from comment #5)
> > I added another set of AVX/AVX2 functions to support RTM.  You can use
> > libcpu-rt-c.so:
> > 
> > https://gitlab.com/cpu-rt/glibc/-/wikis/libcpu-rt-c.so-to-avoid-vzeroupper-
> > in-RTM-region
> > 
> > to try it out.
> 
> The sources do not seem to show the selection logic.
> __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER
> instruction, which can't be good for general application performance.

memset-avx2-unaligned-erms-rtm.S has

#define ZERO_UPPER_VEC_REGISTERS_RETURN \
  ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST

0000000000000040 <__memset_avx2_unaligned_rtm>:
  40:   f3 0f 1e fa             endbr64 
  44:   c5 f9 6e c6             vmovd  %esi,%xmm0
  48:   48 89 f8                mov    %rdi,%rax
  4b:   c4 e2 7d 78 c0          vpbroadcastb %xmm0,%ymm0
  50:   48 83 fa 20             cmp    $0x20,%rdx
  54:   0f 82 0b 01 00 00       jb     165
<__memset_avx2_unaligned_erms_rtm+0xc5>
  5a:   48 83 fa 40             cmp    $0x40,%rdx
  5e:   77 75                   ja     d5
<__memset_avx2_unaligned_erms_rtm+0x35>
  60:   c5 fe 7f 44 17 e0       vmovdqu %ymm0,-0x20(%rdi,%rdx,1)
  66:   c5 fe 7f 07             vmovdqu %ymm0,(%rdi)
  6a:   e9 84 00 00 00          jmp    f3
<__memset_avx2_unaligned_erms_rtm+0x53>
  6f:   90                      nop
,,,
00000000000000a0 <__memset_avx2_unaligned_erms_rtm>:
  a0:   f3 0f 1e fa             endbr64 
  a4:   c5 f9 6e c6             vmovd  %esi,%xmm0
  a8:   48 89 f8                mov    %rdi,%rax
  ab:   c4 e2 7d 78 c0          vpbroadcastb %xmm0,%ymm0
  b0:   48 83 fa 20             cmp    $0x20,%rdx
  b4:   0f 82 ab 00 00 00       jb     165
<__memset_avx2_unaligned_erms_rtm+0xc5>
  ba:   48 83 fa 40             cmp    $0x40,%rdx
  be:   77 0c                   ja     cc
<__memset_avx2_unaligned_erms_rtm+0x2c>
  c0:   c5 fe 7f 44 17 e0       vmovdqu %ymm0,-0x20(%rdi,%rdx,1)
  c6:   c5 fe 7f 07             vmovdqu %ymm0,(%rdi)
  ca:   eb 27                   jmp    f3
<__memset_avx2_unaligned_erms_rtm+0x53>
  cc:   48 3b 15 00 00 00 00    cmp    0x0(%rip),%rdx        # d3
<__memset_avx2_unaligned_erms_rtm+0x33>
  d3:   77 9f                   ja     74 <__memset_avx2_erms_rtm+0x4>
  d5:   48 81 fa 80 00 00 00    cmp    $0x80,%rdx
  dc:   77 22                   ja     100
<__memset_avx2_unaligned_erms_rtm+0x60>
  de:   c5 fe 7f 07             vmovdqu %ymm0,(%rdi)
  e2:   c5 fe 7f 47 20          vmovdqu %ymm0,0x20(%rdi)
  e7:   c5 fe 7f 44 17 e0       vmovdqu %ymm0,-0x20(%rdi,%rdx,1)
  ed:   c5 fe 7f 44 17 c0       vmovdqu %ymm0,-0x40(%rdi,%rdx,1)
  f3:   0f 01 d6                xtest  
  f6:   74 04                   je     fc
<__memset_avx2_unaligned_erms_rtm+0x5c>
  f8:   c5 fc 77                vzeroall 
  fb:   c3                      ret    
  fc:   c5 f8 77                vzeroupper 
  ff:   c3                      ret

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2021-03-01 13:21 ` hjl.tools at gmail dot com
@ 2021-03-01 13:24 ` hjl.tools at gmail dot com
  2021-03-01 13:27 ` hjl.tools at gmail dot com
                   ` (23 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: hjl.tools at gmail dot com @ 2021-03-01 13:24 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #11 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Florian Weimer from comment #9)
> (In reply to rguenther from comment #7)
> > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote:
> > > The sources do not seem to show the selection logic.
> > > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER
> > > instruction, which can't be good for general application performance.
> > 
> > The intent was to use %ymm16+ only which does not cause any transition
> > penalty even w/o vzeroupper.
> 
> I still saw %ymm0 usage in the disassembly, if I recall correctly. And for
> AVX2, there isn't much choice. I didn't try to reverse-engineer the
> corresponding IFUNC selector.

At function exit, there is

  f3:   0f 01 d6                xtest  
  f6:   74 04                   je     fc
<__memset_avx2_unaligned_erms_rtm+0x5c>
  f8:   c5 fc 77                vzeroall 
  fb:   c3                      ret    
  fc:   c5 f8 77                vzeroupper 
  ff:   c3                      ret    

> > There's the option to use another path when HTM is available, doing
> > xtest and branch to the non-AVX variants when a transaction is active.
> > I'm not sure about the overhead of xtest here.  Nor am I sure whether
> > Intel has (or plans any) SKUs with HTM but not AVX512.  Since HTM
> > on broadwell + haswell is crippled due to bugs (and thus usually
> > disabled in firmware) this leaves Skylake and later where I don't
> > know of any HTM but no-AVX512 SKUs.  But this is Intel, so ...
> 
> Skylake is an overloaded term. There is e.g. Xeon E3-1240 v5 which is
> advertised as having the Skylake microarchitecture, but it doesn't have
> AVX-512. (I haven't checked RTM status, but I don't see why it wouldn't be
> supported.) Xeon E3-1240 v6 is Kaby Lake, but it doesn't have AVX-512
> either, and according to what I see, RTM is not disabled by at least one
> current distribution kernel/microcode combination.
> 
> But I think the gist is that RTM without AVX-512 exists out there even in
> server parts.

It is handed by

#define ZERO_UPPER_VEC_REGISTERS_RETURN \
  ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2021-03-01 13:24 ` hjl.tools at gmail dot com
@ 2021-03-01 13:27 ` hjl.tools at gmail dot com
  2021-03-01 13:29 ` hjl.tools at gmail dot com
                   ` (22 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: hjl.tools at gmail dot com @ 2021-03-01 13:27 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #12 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Florian Weimer from comment #9)
> (In reply to rguenther from comment #7)
> > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote:
> > > The sources do not seem to show the selection logic.
> > > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER
> > > instruction, which can't be good for general application performance.
> > 
> > The intent was to use %ymm16+ only which does not cause any transition
> > penalty even w/o vzeroupper.
> 
> I still saw %ymm0 usage in the disassembly, if I recall correctly. And for
> AVX2, there isn't much choice. I didn't try to reverse-engineer the
> corresponding IFUNC selector.
> 

memset-evex-unaligned-erms.S has

# define VEC_SIZE       32
# define XMM0           xmm16
# define YMM0           ymm16
# define VEC0           ymm16
# define VEC(i)         VEC##i
# define VMOVU          vmovdqu64
# define VMOVA          vmovdqa64
# define VZEROUPPER

There are no %ymm0 nor vzeroupper.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2021-03-01 13:27 ` hjl.tools at gmail dot com
@ 2021-03-01 13:29 ` hjl.tools at gmail dot com
  2021-03-01 13:44 ` rguenth at gcc dot gnu.org
                   ` (21 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: hjl.tools at gmail dot com @ 2021-03-01 13:29 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #13 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to rguenther from comment #7)
> On Mon, 1 Mar 2021, fweimer at redhat dot com wrote:
>
> There's the option to use another path when HTM is available, doing
> xtest and branch to the non-AVX variants when a transaction is active.
> I'm not sure about the overhead of xtest here.  Nor am I sure whether

The overhead is low.  Can you verify it?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2021-03-01 13:29 ` hjl.tools at gmail dot com
@ 2021-03-01 13:44 ` rguenth at gcc dot gnu.org
  2021-03-01 14:05 ` hjl.tools at gmail dot com
                   ` (20 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-01 13:44 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to H.J. Lu from comment #11)
> (In reply to Florian Weimer from comment #9)
> > (In reply to rguenther from comment #7)
> > > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote:
> > > > The sources do not seem to show the selection logic.
> > > > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER
> > > > instruction, which can't be good for general application performance.
> > > 
> > > The intent was to use %ymm16+ only which does not cause any transition
> > > penalty even w/o vzeroupper.
> > 
> > I still saw %ymm0 usage in the disassembly, if I recall correctly. And for
> > AVX2, there isn't much choice. I didn't try to reverse-engineer the
> > corresponding IFUNC selector.
> 
> At function exit, there is
> 
>   f3:	0f 01 d6             	xtest  
>   f6:	74 04                	je     fc <__memset_avx2_unaligned_erms_rtm+0x5c>
>   f8:	c5 fc 77             	vzeroall 
>   fb:	c3                   	ret    
>   fc:	c5 f8 77             	vzeroupper 
>   ff:	c3                   	ret    

Note according to Agner vzeroall, for example on Haswell, decodes to
20 uops while vzeroupper only requires 4.  On Skylake it's even worse
(34 uops).  For short sizes (as in our benchmark which had 16-31 byte
strcmp) this might be a bigger difference than using the SSE2 variant
off an early xtest result.  That said, why not, for HTM + AVX2 CPUs,
have an intermediate dispatcher between the AVX2 and the SSE variant
using xtest?  That leaves the actual implementations unchanged and thus
with known performance characteristic?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (19 preceding siblings ...)
  2021-03-01 13:44 ` rguenth at gcc dot gnu.org
@ 2021-03-01 14:05 ` hjl.tools at gmail dot com
  2021-03-01 14:14 ` rguenther at suse dot de
                   ` (19 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: hjl.tools at gmail dot com @ 2021-03-01 14:05 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #15 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Richard Biener from comment #14)
> 
> Note according to Agner vzeroall, for example on Haswell, decodes to
> 20 uops while vzeroupper only requires 4.  On Skylake it's even worse
> (34 uops).  For short sizes (as in our benchmark which had 16-31 byte
> strcmp) this might be a bigger difference than using the SSE2 variant
> off an early xtest result.  That said, why not, for HTM + AVX2 CPUs,
> have an intermediate dispatcher between the AVX2 and the SSE variant
> using xtest?  That leaves the actual implementations unchanged and thus
> with known performance characteristic?

It is implemented on users/hjl/pr27457/wrapper branch:

https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/wrapper

There are 2 problems:

1. Many RTM tests failed for other reasons.
2. Even with vzeroall overhead, AVX version may still be faster than
SSE version.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (20 preceding siblings ...)
  2021-03-01 14:05 ` hjl.tools at gmail dot com
@ 2021-03-01 14:14 ` rguenther at suse dot de
  2021-03-01 14:25 ` rguenth at gcc dot gnu.org
                   ` (18 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: rguenther at suse dot de @ 2021-03-01 14:14 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #16 from rguenther at suse dot de ---
On Mon, 1 Mar 2021, hjl.tools at gmail dot com wrote:

> https://sourceware.org/bugzilla/show_bug.cgi?id=27457
> 
> --- Comment #15 from H.J. Lu <hjl.tools at gmail dot com> ---
> (In reply to Richard Biener from comment #14)
> > 
> > Note according to Agner vzeroall, for example on Haswell, decodes to
> > 20 uops while vzeroupper only requires 4.  On Skylake it's even worse
> > (34 uops).  For short sizes (as in our benchmark which had 16-31 byte
> > strcmp) this might be a bigger difference than using the SSE2 variant
> > off an early xtest result.  That said, why not, for HTM + AVX2 CPUs,
> > have an intermediate dispatcher between the AVX2 and the SSE variant
> > using xtest?  That leaves the actual implementations unchanged and thus
> > with known performance characteristic?
> 
> It is implemented on users/hjl/pr27457/wrapper branch:
> 
> https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/wrapper
> 
> There are 2 problems:
> 
> 1. Many RTM tests failed for other reasons.
> 2. Even with vzeroall overhead, AVX version may still be faster than
> SSE version.

And the SSE version may still be faster than the AVX version with
vzeroall.

I guess we should mostly care about optimizing for "modern" CPUs
which likely means HTM + AVX512 which should be already optimal
on your branches by using %ymm16+.  So we're talking about
the "legacy" AVX2 + HTM path.

And there I think we should optimize the path that is _not_ in
a transaction since that will be 99% of the cases.  Which to
me means using the proven tuned (on their respective ISA subsets)
SSE2 and AVX2 variants and simply switch between them based on
xtest.  Yeah, so strcmp of a large string inside an transaction
might not run at optimal AVX2 speed.  But it will be faster
than before the xtest dispatch since before that it would have
aborted the transaction.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (21 preceding siblings ...)
  2021-03-01 14:14 ` rguenther at suse dot de
@ 2021-03-01 14:25 ` rguenth at gcc dot gnu.org
  2021-03-01 14:37 ` hjl.tools at gmail dot com
                   ` (17 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-01 14:25 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to H.J. Lu from comment #11)
> (In reply to Florian Weimer from comment #9)
> > (In reply to rguenther from comment #7)
> > > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote:
> > > > The sources do not seem to show the selection logic.
> > > > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER
> > > > instruction, which can't be good for general application performance.
> > > 
> > > The intent was to use %ymm16+ only which does not cause any transition
> > > penalty even w/o vzeroupper.
> > 
> > I still saw %ymm0 usage in the disassembly, if I recall correctly. And for
> > AVX2, there isn't much choice. I didn't try to reverse-engineer the
> > corresponding IFUNC selector.
> 
> At function exit, there is
> 
>   f3:	0f 01 d6             	xtest  
>   f6:	74 04                	je     fc <__memset_avx2_unaligned_erms_rtm+0x5c>
>   f8:	c5 fc 77             	vzeroall 
>   fb:	c3                   	ret    
>   fc:	c5 f8 77             	vzeroupper 
>   ff:	c3                   	ret    

Btw, if the 'je' mispredicts to the vzeroupper case inside an transaction
will the speculative execution of vzeroupper abort the transaction or
does it only abort the transaction when retired?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (22 preceding siblings ...)
  2021-03-01 14:25 ` rguenth at gcc dot gnu.org
@ 2021-03-01 14:37 ` hjl.tools at gmail dot com
  2021-03-01 14:47 ` hjl.tools at gmail dot com
                   ` (16 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: hjl.tools at gmail dot com @ 2021-03-01 14:37 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #18 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to rguenther from comment #16)
> On Mon, 1 Mar 2021, hjl.tools at gmail dot com wrote:
> 
> > https://sourceware.org/bugzilla/show_bug.cgi?id=27457
> > 
> > --- Comment #15 from H.J. Lu <hjl.tools at gmail dot com> ---
> > (In reply to Richard Biener from comment #14)
> > > 
> > > Note according to Agner vzeroall, for example on Haswell, decodes to
> > > 20 uops while vzeroupper only requires 4.  On Skylake it's even worse
> > > (34 uops).  For short sizes (as in our benchmark which had 16-31 byte
> > > strcmp) this might be a bigger difference than using the SSE2 variant
> > > off an early xtest result.  That said, why not, for HTM + AVX2 CPUs,
> > > have an intermediate dispatcher between the AVX2 and the SSE variant
> > > using xtest?  That leaves the actual implementations unchanged and thus
> > > with known performance characteristic?
> > 
> > It is implemented on users/hjl/pr27457/wrapper branch:
> > 
> > https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/wrapper
> > 
> > There are 2 problems:
> > 
> > 1. Many RTM tests failed for other reasons.
> > 2. Even with vzeroall overhead, AVX version may still be faster than
> > SSE version.
> 
> And the SSE version may still be faster than the AVX version with
> vzeroall.

Here is some data:

Function: strcmp
Variant: default
                                       __strcmp_avx2    __strcmp_sse2_unaligned
     length=14, align1=14, align2=14:        11.36             17.50    
     length=14, align1=14, align2=14:        11.36             15.59    
     length=14, align1=14, align2=14:        11.43             15.55    
     length=15, align1=15, align2=15:        11.36             17.42    
     length=15, align1=15, align2=15:        11.96             17.41    
     length=15, align1=15, align2=15:        11.36             16.97    
     length=16, align1=16, align2=16:        11.36             18.58    
     length=16, align1=16, align2=16:        11.36             17.41    
     length=16, align1=16, align2=16:        11.43             17.34    
     length=17, align1=17, align2=17:        11.36             21.37    
     length=17, align1=17, align2=17:        11.36             18.52    
     length=17, align1=17, align2=17:        11.36             17.94    
     length=18, align1=18, align2=18:        11.36             19.73    
     length=18, align1=18, align2=18:        11.36             19.20    
     length=18, align1=18, align2=18:        11.36             19.13    
     length=19, align1=19, align2=19:        11.36             20.38    
     length=19, align1=19, align2=19:        11.36             19.39    
     length=19, align1=19, align2=19:        11.36             20.39    
     length=20, align1=20, align2=20:        11.36             21.53    
     length=20, align1=20, align2=20:        11.36             20.98    
     length=20, align1=20, align2=20:        11.36             20.93    
     length=21, align1=21, align2=21:        11.36             22.83    
     length=21, align1=21, align2=21:        11.36             22.26    
     length=21, align1=21, align2=21:        11.36             22.25    
     length=22, align1=22, align2=22:        11.43             23.37    
     length=22, align1=22, align2=22:        11.36             22.78    
     length=22, align1=22, align2=22:        12.29             22.12    
     length=23, align1=23, align2=23:        11.36             24.63    
     length=23, align1=23, align2=23:        12.53             23.97    
     length=23, align1=23, align2=23:        11.36             23.97    
     length=24, align1=24, align2=24:        11.36             24.52    
     length=24, align1=24, align2=24:        11.36             43.47    
     length=24, align1=24, align2=24:        11.36             44.47    
     length=25, align1=25, align2=25:        11.36             39.50    
     length=25, align1=25, align2=25:        11.36             48.97    
     length=25, align1=25, align2=25:        11.36             48.53    
     length=26, align1=26, align2=26:        11.36             47.87    
     length=26, align1=26, align2=26:        11.36             47.20    
     length=26, align1=26, align2=26:        11.36             47.15    
     length=27, align1=27, align2=27:        11.36             50.90    
     length=27, align1=27, align2=27:        11.44             49.98    
     length=27, align1=27, align2=27:        11.36             49.77    
     length=28, align1=28, align2=28:        11.36             49.74    
     length=28, align1=28, align2=28:        11.36             48.86    
     length=28, align1=28, align2=28:        11.36             49.08    
     length=29, align1=29, align2=29:        11.36             52.74    
     length=29, align1=29, align2=29:        11.36             54.04    
     length=29, align1=29, align2=29:        11.36             29.49    
     length=30, align1=30, align2=30:        11.36             50.91    
     length=30, align1=30, align2=30:        11.36             51.09    
     length=30, align1=30, align2=30:        11.36             51.13    
     length=31, align1=31, align2=31:        12.36             54.33    
     length=31, align1=31, align2=31:        11.36             53.49    
     length=31, align1=31, align2=31:        11.36             53.29    
       length=16, align1=0, align2=0:        11.36             18.02    
       length=16, align1=0, align2=0:        11.36             18.58    
       length=16, align1=0, align2=0:        11.36             17.34    
       length=16, align1=0, align2=0:        11.44             19.88    
       length=16, align1=0, align2=0:        11.36             16.74    
       length=16, align1=0, align2=0:        11.36             17.42    
       length=16, align1=0, align2=3:        11.36             17.34    
       length=16, align1=3, align2=4:        11.36             17.34    
       length=32, align1=0, align2=0:        12.29             61.07    
       length=32, align1=0, align2=0:        12.63             61.08    
       length=32, align1=0, align2=0:        11.36             60.48    
       length=32, align1=0, align2=0:        11.36             60.48    
       length=32, align1=0, align2=0:        11.36             60.40    
       length=32, align1=0, align2=0:        11.36             60.40    
       length=32, align1=0, align2=4:        11.36             60.40    
       length=32, align1=4, align2=5:        12.10             59.72    

> I guess we should mostly care about optimizing for "modern" CPUs
> which likely means HTM + AVX512 which should be already optimal
> on your branches by using %ymm16+.  So we're talking about
> the "legacy" AVX2 + HTM path.
> 
> And there I think we should optimize the path that is _not_ in
> a transaction since that will be 99% of the cases.  Which to
> me means using the proven tuned (on their respective ISA subsets)
> SSE2 and AVX2 variants and simply switch between them based on
> xtest.  Yeah, so strcmp of a large string inside an transaction

I tried it and I got RTM abort for other reasons.

> might not run at optimal AVX2 speed.  But it will be faster
> than before the xtest dispatch since before that it would have
> aborted the transaction.

Please give my current approach is a try.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (23 preceding siblings ...)
  2021-03-01 14:37 ` hjl.tools at gmail dot com
@ 2021-03-01 14:47 ` hjl.tools at gmail dot com
  2021-03-01 14:49 ` rguenth at gcc dot gnu.org
                   ` (15 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: hjl.tools at gmail dot com @ 2021-03-01 14:47 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #19 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Richard Biener from comment #17)
> (In reply to H.J. Lu from comment #11)
> > (In reply to Florian Weimer from comment #9)
> > > (In reply to rguenther from comment #7)
> > > > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote:
> > > > > The sources do not seem to show the selection logic.
> > > > > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER
> > > > > instruction, which can't be good for general application performance.
> > > > 
> > > > The intent was to use %ymm16+ only which does not cause any transition
> > > > penalty even w/o vzeroupper.
> > > 
> > > I still saw %ymm0 usage in the disassembly, if I recall correctly. And for
> > > AVX2, there isn't much choice. I didn't try to reverse-engineer the
> > > corresponding IFUNC selector.
> > 
> > At function exit, there is
> > 
> >   f3:	0f 01 d6             	xtest  
> >   f6:	74 04                	je     fc <__memset_avx2_unaligned_erms_rtm+0x5c>
> >   f8:	c5 fc 77             	vzeroall 
> >   fb:	c3                   	ret    
> >   fc:	c5 f8 77             	vzeroupper 
> >   ff:	c3                   	ret    
> 
> Btw, if the 'je' mispredicts to the vzeroupper case inside an transaction
> will the speculative execution of vzeroupper abort the transaction or
> does it only abort the transaction when retired?

My branch includes RTM test:

  for (i = 0; i < loop; i++)
    {
      if (_xbegin() == _XBEGIN_STARTED)
        {
          failed |= function ();
          _xend();
        }
      else
        {
          failed |= function ();
          ++naborts;
        }
    }

It passes with xtest + je.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (24 preceding siblings ...)
  2021-03-01 14:47 ` hjl.tools at gmail dot com
@ 2021-03-01 14:49 ` rguenth at gcc dot gnu.org
  2021-03-01 14:53 ` rguenth at gcc dot gnu.org
                   ` (14 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-01 14:49 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #20 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to H.J. Lu from comment #18)
> (In reply to rguenther from comment #16)
> > On Mon, 1 Mar 2021, hjl.tools at gmail dot com wrote:
> > 
> > > https://sourceware.org/bugzilla/show_bug.cgi?id=27457
> > > 
> > > --- Comment #15 from H.J. Lu <hjl.tools at gmail dot com> ---
> > > (In reply to Richard Biener from comment #14)
> > > > 
> > > > Note according to Agner vzeroall, for example on Haswell, decodes to
> > > > 20 uops while vzeroupper only requires 4.  On Skylake it's even worse
> > > > (34 uops).  For short sizes (as in our benchmark which had 16-31 byte
> > > > strcmp) this might be a bigger difference than using the SSE2 variant
> > > > off an early xtest result.  That said, why not, for HTM + AVX2 CPUs,
> > > > have an intermediate dispatcher between the AVX2 and the SSE variant
> > > > using xtest?  That leaves the actual implementations unchanged and thus
> > > > with known performance characteristic?
> > > 
> > > It is implemented on users/hjl/pr27457/wrapper branch:
> > > 
> > > https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/pr27457/wrapper
> > > 
> > > There are 2 problems:
> > > 
> > > 1. Many RTM tests failed for other reasons.
> > > 2. Even with vzeroall overhead, AVX version may still be faster than
> > > SSE version.
> > 
> > And the SSE version may still be faster than the AVX version with
> > vzeroall.
> 
> Here is some data:
> 
> Function: strcmp
> Variant: default
>                                        __strcmp_avx2	__strcmp_sse2_unaligned
>      length=14, align1=14, align2=14:        11.36	       17.50	
>      length=14, align1=14, align2=14:        11.36	       15.59	
>      length=14, align1=14, align2=14:        11.43	       15.55	
>      length=15, align1=15, align2=15:        11.36	       17.42	
>      length=15, align1=15, align2=15:        11.96	       17.41	
>      length=15, align1=15, align2=15:        11.36	       16.97	
>      length=16, align1=16, align2=16:        11.36	       18.58	
>      length=16, align1=16, align2=16:        11.36	       17.41	
>      length=16, align1=16, align2=16:        11.43	       17.34	
>      length=17, align1=17, align2=17:        11.36	       21.37	
>      length=17, align1=17, align2=17:        11.36	       18.52	
>      length=17, align1=17, align2=17:        11.36	       17.94	
>      length=18, align1=18, align2=18:        11.36	       19.73	
>      length=18, align1=18, align2=18:        11.36	       19.20	
>      length=18, align1=18, align2=18:        11.36	       19.13	
>      length=19, align1=19, align2=19:        11.36	       20.38	
>      length=19, align1=19, align2=19:        11.36	       19.39	
>      length=19, align1=19, align2=19:        11.36	       20.39	
>      length=20, align1=20, align2=20:        11.36	       21.53	
>      length=20, align1=20, align2=20:        11.36	       20.98	
>      length=20, align1=20, align2=20:        11.36	       20.93	
>      length=21, align1=21, align2=21:        11.36	       22.83	
>      length=21, align1=21, align2=21:        11.36	       22.26	
>      length=21, align1=21, align2=21:        11.36	       22.25	
>      length=22, align1=22, align2=22:        11.43	       23.37	
>      length=22, align1=22, align2=22:        11.36	       22.78	
>      length=22, align1=22, align2=22:        12.29	       22.12	
>      length=23, align1=23, align2=23:        11.36	       24.63	
>      length=23, align1=23, align2=23:        12.53	       23.97	
>      length=23, align1=23, align2=23:        11.36	       23.97	
>      length=24, align1=24, align2=24:        11.36	       24.52	
>      length=24, align1=24, align2=24:        11.36	       43.47	
>      length=24, align1=24, align2=24:        11.36	       44.47	
>      length=25, align1=25, align2=25:        11.36	       39.50	
>      length=25, align1=25, align2=25:        11.36	       48.97	
>      length=25, align1=25, align2=25:        11.36	       48.53	
>      length=26, align1=26, align2=26:        11.36	       47.87	
>      length=26, align1=26, align2=26:        11.36	       47.20	
>      length=26, align1=26, align2=26:        11.36	       47.15	
>      length=27, align1=27, align2=27:        11.36	       50.90	
>      length=27, align1=27, align2=27:        11.44	       49.98	
>      length=27, align1=27, align2=27:        11.36	       49.77	
>      length=28, align1=28, align2=28:        11.36	       49.74	
>      length=28, align1=28, align2=28:        11.36	       48.86	
>      length=28, align1=28, align2=28:        11.36	       49.08	
>      length=29, align1=29, align2=29:        11.36	       52.74	
>      length=29, align1=29, align2=29:        11.36	       54.04	
>      length=29, align1=29, align2=29:        11.36	       29.49	
>      length=30, align1=30, align2=30:        11.36	       50.91	
>      length=30, align1=30, align2=30:        11.36	       51.09	
>      length=30, align1=30, align2=30:        11.36	       51.13	
>      length=31, align1=31, align2=31:        12.36	       54.33	
>      length=31, align1=31, align2=31:        11.36	       53.49	
>      length=31, align1=31, align2=31:        11.36	       53.29	
>        length=16, align1=0, align2=0:        11.36	       18.02	
>        length=16, align1=0, align2=0:        11.36	       18.58	
>        length=16, align1=0, align2=0:        11.36	       17.34	
>        length=16, align1=0, align2=0:        11.44	       19.88	
>        length=16, align1=0, align2=0:        11.36	       16.74	
>        length=16, align1=0, align2=0:        11.36	       17.42	
>        length=16, align1=0, align2=3:        11.36	       17.34	
>        length=16, align1=3, align2=4:        11.36	       17.34	
>        length=32, align1=0, align2=0:        12.29	       61.07	
>        length=32, align1=0, align2=0:        12.63	       61.08	
>        length=32, align1=0, align2=0:        11.36	       60.48	
>        length=32, align1=0, align2=0:        11.36	       60.48	
>        length=32, align1=0, align2=0:        11.36	       60.40	
>        length=32, align1=0, align2=0:        11.36	       60.40	
>        length=32, align1=0, align2=4:        11.36	       60.40	
>        length=32, align1=4, align2=5:        12.10	       59.72	

That's with or without the vzeroall actually executing?

> > I guess we should mostly care about optimizing for "modern" CPUs
> > which likely means HTM + AVX512 which should be already optimal
> > on your branches by using %ymm16+.  So we're talking about
> > the "legacy" AVX2 + HTM path.
> > 
> > And there I think we should optimize the path that is _not_ in
> > a transaction since that will be 99% of the cases.  Which to
> > me means using the proven tuned (on their respective ISA subsets)
> > SSE2 and AVX2 variants and simply switch between them based on
> > xtest.  Yeah, so strcmp of a large string inside an transaction
> 
> I tried it and I got RTM abort for other reasons.
> 
> > might not run at optimal AVX2 speed.  But it will be faster
> > than before the xtest dispatch since before that it would have
> > aborted the transaction.
> 
> Please give my current approach is a try.

Well, I know that even unconditionally doing vzeroall will fix our observed
regression since the time is dominated by all the other code inside the
transaction that is then retried a few times (and always re-fails with
vzeroupper), the strcmp part is just ~1%.

I also only have AVX512 HW with HTM so can't easily test the AVX2 + HTM
path.  That said, I'm fine with the xtest/vzero{all,upper} epilogue.

But I have also been reported libmicro regressions for strcpy with length
10 (32byte aligned), hot cache, when using AVX2 vs. SSE2 (SSE2 being faster
by 20%).  [note strcpy, not strcmp here]  Yeah, stupid benchmark ... but
it likely shows that for small lenghts every detail matters in case you
want to shave off the last ns.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (25 preceding siblings ...)
  2021-03-01 14:49 ` rguenth at gcc dot gnu.org
@ 2021-03-01 14:53 ` rguenth at gcc dot gnu.org
  2021-03-01 15:19 ` hjl.tools at gmail dot com
                   ` (13 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-01 14:53 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #21 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to H.J. Lu from comment #19)
> (In reply to Richard Biener from comment #17)
> > (In reply to H.J. Lu from comment #11)
> > > (In reply to Florian Weimer from comment #9)
> > > > (In reply to rguenther from comment #7)
> > > > > On Mon, 1 Mar 2021, fweimer at redhat dot com wrote:
> > > > > > The sources do not seem to show the selection logic.
> > > > > > __memset_avx2_unaligned_rtm seems to simply have dropped the VZEROUPPER
> > > > > > instruction, which can't be good for general application performance.
> > > > > 
> > > > > The intent was to use %ymm16+ only which does not cause any transition
> > > > > penalty even w/o vzeroupper.
> > > > 
> > > > I still saw %ymm0 usage in the disassembly, if I recall correctly. And for
> > > > AVX2, there isn't much choice. I didn't try to reverse-engineer the
> > > > corresponding IFUNC selector.
> > > 
> > > At function exit, there is
> > > 
> > >   f3:	0f 01 d6             	xtest  
> > >   f6:	74 04                	je     fc <__memset_avx2_unaligned_erms_rtm+0x5c>
> > >   f8:	c5 fc 77             	vzeroall 
> > >   fb:	c3                   	ret    
> > >   fc:	c5 f8 77             	vzeroupper 
> > >   ff:	c3                   	ret    
> > 
> > Btw, if the 'je' mispredicts to the vzeroupper case inside an transaction
> > will the speculative execution of vzeroupper abort the transaction or
> > does it only abort the transaction when retired?
> 
> My branch includes RTM test:
> 
>   for (i = 0; i < loop; i++)
>     {
>       if (_xbegin() == _XBEGIN_STARTED)
> 	{
> 	  failed |= function ();
> 	  _xend();
> 	}
>       else
> 	{
> 	  failed |= function ();
> 	  ++naborts;
> 	}
>     }
> 
> It passes with xtest + je.

that's good to hear.  Does it still work when you add and unconditional
non-transaction

      failed |= function ();

before the loop?  Just trying to make sure we do actually mispredict the
xtest + je.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (26 preceding siblings ...)
  2021-03-01 14:53 ` rguenth at gcc dot gnu.org
@ 2021-03-01 15:19 ` hjl.tools at gmail dot com
  2021-03-01 23:39 ` hjl.tools at gmail dot com
                   ` (12 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: hjl.tools at gmail dot com @ 2021-03-01 15:19 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #22 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Richard Biener from comment #21)

> 
> that's good to hear.  Does it still work when you add and unconditional
> non-transaction
> 
>       failed |= function ();
> 
> before the loop?  Just trying to make sure we do actually mispredict the
> xtest + je.

I will give it a try.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (27 preceding siblings ...)
  2021-03-01 15:19 ` hjl.tools at gmail dot com
@ 2021-03-01 23:39 ` hjl.tools at gmail dot com
  2021-03-05 16:54 ` hjl.tools at gmail dot com
                   ` (11 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: hjl.tools at gmail dot com @ 2021-03-01 23:39 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #23 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to H.J. Lu from comment #22)
> (In reply to Richard Biener from comment #21)
> 
> > 
> > that's good to hear.  Does it still work when you add and unconditional
> > non-transaction
> > 
> >       failed |= function ();
> > 
> > before the loop?  Just trying to make sure we do actually mispredict the
> > xtest + je.
> 
> I will give it a try.

I did:

 for (i = 0; i < loop; i++)
    {
      failed |= function ();
      if (_xbegin() == _XBEGIN_STARTED)
        {
          failed |= function ();
          _xend();
        }
      else
        {
          failed |= function ();
          ++naborts;
        }
    }

There is no issue.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (28 preceding siblings ...)
  2021-03-01 23:39 ` hjl.tools at gmail dot com
@ 2021-03-05 16:54 ` hjl.tools at gmail dot com
  2021-03-11 10:42 ` rguenth at gcc dot gnu.org
                   ` (10 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: hjl.tools at gmail dot com @ 2021-03-05 16:54 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

H.J. Lu <hjl.tools at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                URL|                            |https://sourceware.org/pipe
                   |                            |rmail/libc-alpha/2021-March
                   |                            |/123302.html

--- Comment #24 from H.J. Lu <hjl.tools at gmail dot com> ---
A patch set is posted at

https://sourceware.org/pipermail/libc-alpha/2021-March/123302.html

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (29 preceding siblings ...)
  2021-03-05 16:54 ` hjl.tools at gmail dot com
@ 2021-03-11 10:42 ` rguenth at gcc dot gnu.org
  2021-03-16 13:53 ` rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-11 10:42 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #25 from Richard Biener <rguenth at gcc dot gnu.org> ---
We have successfully tested the backport to the 2.31 branch where it mitigates
the regression seen.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (30 preceding siblings ...)
  2021-03-11 10:42 ` rguenth at gcc dot gnu.org
@ 2021-03-16 13:53 ` rguenth at gcc dot gnu.org
  2021-03-16 14:12 ` hjl.tools at gmail dot com
                   ` (8 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-16 13:53 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #26 from Richard Biener <rguenth at gcc dot gnu.org> ---
Any progress?  I see the patchset did not get any feedback in the last 10 days.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (31 preceding siblings ...)
  2021-03-16 13:53 ` rguenth at gcc dot gnu.org
@ 2021-03-16 14:12 ` hjl.tools at gmail dot com
  2021-03-29 23:00 ` hjl.tools at gmail dot com
                   ` (7 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: hjl.tools at gmail dot com @ 2021-03-16 14:12 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #27 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Richard Biener from comment #26)
> Any progress?  I see the patchset did not get any feedback in the last 10
> days.

We discussed it at glibc patch meeting on Monday.  I posted the v2 patch:

https://sourceware.org/pipermail/libc-alpha/2021-March/123867.html

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (32 preceding siblings ...)
  2021-03-16 14:12 ` hjl.tools at gmail dot com
@ 2021-03-29 23:00 ` hjl.tools at gmail dot com
  2022-01-27 20:21 ` cvs-commit at gcc dot gnu.org
                   ` (6 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: hjl.tools at gmail dot com @ 2021-03-29 23:00 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #28 from H.J. Lu <hjl.tools at gmail dot com> ---
Fixed for 2.34 on master so far by 10 commits.  The last commit is

commit e4fda4631017e49d4ee5a2755db34289b6860fa4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sun Mar 7 09:45:23 2021 -0800

    x86-64: Use ZMM16-ZMM31 in AVX512 memmove family functions

    Update ifunc-memmove.h to select the function optimized with AVX512
    instructions using ZMM16-ZMM31 registers to avoid RTM abort with usable
    AVX512VL since VZEROUPPER isn't needed at function exit.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (33 preceding siblings ...)
  2021-03-29 23:00 ` hjl.tools at gmail dot com
@ 2022-01-27 20:21 ` cvs-commit at gcc dot gnu.org
  2022-01-27 20:23 ` cvs-commit at gcc dot gnu.org
                   ` (5 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-01-27 20:21 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #29 from cvs-commit at gcc dot gnu.org <cvs-commit at gcc dot gnu.org> ---
The release/2.33/master branch has been updated by H.J. Lu
<hjl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=902af2f5eee71c3e48fe30d43fd7c61d563e975b

commit 902af2f5eee71c3e48fe30d43fd7c61d563e975b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Jan 27 12:20:21 2022 -0800

    NEWS: Add a bug fix entry for BZ #27457

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (34 preceding siblings ...)
  2022-01-27 20:21 ` cvs-commit at gcc dot gnu.org
@ 2022-01-27 20:23 ` cvs-commit at gcc dot gnu.org
  2022-01-27 20:47 ` cvs-commit at gcc dot gnu.org
                   ` (4 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-01-27 20:23 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #30 from cvs-commit at gcc dot gnu.org <cvs-commit at gcc dot gnu.org> ---
The release/2.32/master branch has been updated by H.J. Lu
<hjl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=05751d1c5c85b7f55086e81567cd55b025b25625

commit 05751d1c5c85b7f55086e81567cd55b025b25625
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Jan 27 12:22:42 2022 -0800

    NEWS: Add a bug fix entry for BZ #27457

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (35 preceding siblings ...)
  2022-01-27 20:23 ` cvs-commit at gcc dot gnu.org
@ 2022-01-27 20:47 ` cvs-commit at gcc dot gnu.org
  2022-01-27 20:47 ` cvs-commit at gcc dot gnu.org
                   ` (3 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-01-27 20:47 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #31 from cvs-commit at gcc dot gnu.org <cvs-commit at gcc dot gnu.org> ---
The release/2.31/master branch has been updated by H.J. Lu
<hjl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c0cbb9345ea2d81d017e7725e3ac5250ef870513

commit c0cbb9345ea2d81d017e7725e3ac5250ef870513
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Jan 27 12:23:42 2022 -0800

    NEWS: Add a bug fix entry for BZ #27457

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (36 preceding siblings ...)
  2022-01-27 20:47 ` cvs-commit at gcc dot gnu.org
@ 2022-01-27 20:47 ` cvs-commit at gcc dot gnu.org
  2022-01-27 20:48 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  40 siblings, 0 replies; 42+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-01-27 20:47 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #32 from cvs-commit at gcc dot gnu.org <cvs-commit at gcc dot gnu.org> ---
The release/2.30/master branch has been updated by H.J. Lu
<hjl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=de28bb3c612daedb08fa975325c1a293fbca07a9

commit de28bb3c612daedb08fa975325c1a293fbca07a9
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Jan 27 12:24:44 2022 -0800

    NEWS: Add a bug fix entry for BZ #27457

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (37 preceding siblings ...)
  2022-01-27 20:47 ` cvs-commit at gcc dot gnu.org
@ 2022-01-27 20:48 ` cvs-commit at gcc dot gnu.org
  2022-01-27 22:41 ` cvs-commit at gcc dot gnu.org
  2022-01-28  2:24 ` hjl.tools at gmail dot com
  40 siblings, 0 replies; 42+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-01-27 20:48 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #33 from cvs-commit at gcc dot gnu.org <cvs-commit at gcc dot gnu.org> ---
The release/2.29/master branch has been updated by H.J. Lu
<hjl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=c3535cb6cdd4cbbce22018df09cc69633781d808

commit c3535cb6cdd4cbbce22018df09cc69633781d808
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Jan 27 12:25:41 2022 -0800

    NEWS: Add a bug fix entry for BZ #27457

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (38 preceding siblings ...)
  2022-01-27 20:48 ` cvs-commit at gcc dot gnu.org
@ 2022-01-27 22:41 ` cvs-commit at gcc dot gnu.org
  2022-01-28  2:24 ` hjl.tools at gmail dot com
  40 siblings, 0 replies; 42+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-01-27 22:41 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

--- Comment #34 from cvs-commit at gcc dot gnu.org <cvs-commit at gcc dot gnu.org> ---
The release/2.28/master branch has been updated by H.J. Lu
<hjl@sourceware.org>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2baf5616d5ec5e592d64746253713969eb473f5b

commit 2baf5616d5ec5e592d64746253713969eb473f5b
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Jan 27 12:49:55 2022 -0800

    NEWS: Add a bug fix entry for BZ #27457

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Bug string/27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
  2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
                   ` (39 preceding siblings ...)
  2022-01-27 22:41 ` cvs-commit at gcc dot gnu.org
@ 2022-01-28  2:24 ` hjl.tools at gmail dot com
  40 siblings, 0 replies; 42+ messages in thread
From: hjl.tools at gmail dot com @ 2022-01-28  2:24 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=27457

H.J. Lu <hjl.tools at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED

--- Comment #35 from H.J. Lu <hjl.tools at gmail dot com> ---
Fixed for 2.34 and all release branches.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2022-01-28  2:24 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-22 12:40 [Bug string/27457] New: vzeroupper use in AVX2 multiarch string functions cause HTM aborts rguenth at gcc dot gnu.org
2021-02-22 12:40 ` [Bug string/27457] " rguenth at gcc dot gnu.org
2021-02-22 14:50 ` matz at suse dot de
2021-02-22 15:00 ` rguenth at gcc dot gnu.org
2021-02-22 15:26 ` hjl.tools at gmail dot com
2021-02-22 15:26 ` hjl.tools at gmail dot com
2021-02-22 18:45 ` fweimer at redhat dot com
2021-02-23  9:44 ` roman.dementiev at intel dot com
2021-02-27  2:39 ` hjl.tools at gmail dot com
2021-02-27  7:34 ` rguenther at suse dot de
2021-02-28 14:53 ` hjl.tools at gmail dot com
2021-03-01 11:32 ` fweimer at redhat dot com
2021-03-01 12:24 ` mliska at suse dot cz
2021-03-01 12:47 ` rguenther at suse dot de
2021-03-01 13:13 ` roman.dementiev at intel dot com
2021-03-01 13:19 ` fweimer at redhat dot com
2021-03-01 13:21 ` hjl.tools at gmail dot com
2021-03-01 13:24 ` hjl.tools at gmail dot com
2021-03-01 13:27 ` hjl.tools at gmail dot com
2021-03-01 13:29 ` hjl.tools at gmail dot com
2021-03-01 13:44 ` rguenth at gcc dot gnu.org
2021-03-01 14:05 ` hjl.tools at gmail dot com
2021-03-01 14:14 ` rguenther at suse dot de
2021-03-01 14:25 ` rguenth at gcc dot gnu.org
2021-03-01 14:37 ` hjl.tools at gmail dot com
2021-03-01 14:47 ` hjl.tools at gmail dot com
2021-03-01 14:49 ` rguenth at gcc dot gnu.org
2021-03-01 14:53 ` rguenth at gcc dot gnu.org
2021-03-01 15:19 ` hjl.tools at gmail dot com
2021-03-01 23:39 ` hjl.tools at gmail dot com
2021-03-05 16:54 ` hjl.tools at gmail dot com
2021-03-11 10:42 ` rguenth at gcc dot gnu.org
2021-03-16 13:53 ` rguenth at gcc dot gnu.org
2021-03-16 14:12 ` hjl.tools at gmail dot com
2021-03-29 23:00 ` hjl.tools at gmail dot com
2022-01-27 20:21 ` cvs-commit at gcc dot gnu.org
2022-01-27 20:23 ` cvs-commit at gcc dot gnu.org
2022-01-27 20:47 ` cvs-commit at gcc dot gnu.org
2022-01-27 20:47 ` cvs-commit at gcc dot gnu.org
2022-01-27 20:48 ` cvs-commit at gcc dot gnu.org
2022-01-27 22:41 ` cvs-commit at gcc dot gnu.org
2022-01-28  2:24 ` hjl.tools at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).