public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug string/26852] New: aarch64/strcmp has performance regression for some cases
@ 2020-11-09  8:10 xuchunmei at linux dot alibaba.com
  2020-11-09  8:12 ` [Bug string/26852] " xuchunmei at linux dot alibaba.com
                   ` (13 more replies)
  0 siblings, 14 replies; 15+ messages in thread
From: xuchunmei at linux dot alibaba.com @ 2020-11-09  8:10 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26852

            Bug ID: 26852
           Summary: aarch64/strcmp has performance regression for some
                    cases
           Product: glibc
           Version: 2.32
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: string
          Assignee: unassigned at sourceware dot org
          Reporter: xuchunmei at linux dot alibaba.com
  Target Milestone: ---

Created attachment 12948
  --> https://sourceware.org/bugzilla/attachment.cgi?id=12948&action=edit
bench-strcmp data of glibc2.30 and glibc2.32

unixbench of dhry2reg has performance regression on glibc2.32, compared with
glibc2.30. perf record data display the regression is from strcmp. 
I simply the testcase with following code:
#include<stdio.h>
#include <string.h>

int main()
{
        int count=1000000;
        int i = 0;
        char str1[32] = "DHRYSTONE PROGRAM, 1'ST STRING";
        char str2[32] = "DHRYSTONE PROGRAM, 2'ND STRING";
        int ret;

        while (i<count) {
                ret = strcmp(str1, str2);
                if (ret > 0)
                        printf("11111\n");
                i++;
        }
}

and compare the execute time on glibc2.30 abd glibc2.32.
on glibc2.30, the result is:
# time ./test

real    0m0.008s
user    0m0.008s
sys     0m0.000s
while on glibc2.32, the result is:
# time ./test

real    0m0.023s
user    0m0.023s
sys     0m0.000s

also,I modify bench-strcmp.c to test length=20, add the following code:
  do_test(&json_ctx, 0, 0, 20, MIDCHAR, 0);
  do_test(&json_ctx, 0, 1, 20, MIDCHAR, 0);
  do_test(&json_ctx, 1, 0, 20, MIDCHAR, 0);
  do_test(&json_ctx, 1, 1, 20, MIDCHAR, 0);

on glibc2.30, result is:
       length=20, align1=0, align2=0:         7.69             20.39 (-164.98%)
       length=20, align1=0, align2=1:        10.00             20.49 (-104.84%)
       length=20, align1=1, align2=0:        13.57             20.53 (-51.30%)
       length=20, align1=1, align2=1:         8.85             20.39 (-130.33%)
on glibc2.32, result is:
       length=20, align1=0, align2=0:         7.81             20.00 (-156.15%)
       length=20, align1=0, align2=1:        26.54             20.16 ( 24.02%)
       length=20, align1=1, align2=0:        14.24             20.00 (-40.50%)
       length=20, align1=1, align2=1:         8.47             20.12 (-137.67%)

glibc2.32 with "length=20, align1=0, align2=1" has a performance regression.

glibc2.32 only have one more commit than glibc2.30 with
sysdeps/aarch64/strcmp.S:
commit adac54ffc5ded48cba7deb18e46df984b213b0ac
Author: Alex Butler <Alex.Butler@arm.com>
Date:   Tue Jun 16 12:42:38 2020 +0000

    aarch64: MTE compatible strcmp


also I compare bench-strcmp data between glibc2.30 and glibc2.32, refer to the
attachment. In some cases, glibc2.32 have performace regression.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug string/26852] aarch64/strcmp has performance regression for some cases
  2020-11-09  8:10 [Bug string/26852] New: aarch64/strcmp has performance regression for some cases xuchunmei at linux dot alibaba.com
@ 2020-11-09  8:12 ` xuchunmei at linux dot alibaba.com
  2020-11-09  8:13 ` xuchunmei at linux dot alibaba.com
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: xuchunmei at linux dot alibaba.com @ 2020-11-09  8:12 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26852

xuchunmei <xuchunmei at linux dot alibaba.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |nsz at gcc dot gnu.org

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug string/26852] aarch64/strcmp has performance regression for some cases
  2020-11-09  8:10 [Bug string/26852] New: aarch64/strcmp has performance regression for some cases xuchunmei at linux dot alibaba.com
  2020-11-09  8:12 ` [Bug string/26852] " xuchunmei at linux dot alibaba.com
@ 2020-11-09  8:13 ` xuchunmei at linux dot alibaba.com
  2020-11-09 10:30 ` nsz at gcc dot gnu.org
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: xuchunmei at linux dot alibaba.com @ 2020-11-09  8:13 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26852

xuchunmei <xuchunmei at linux dot alibaba.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |xuchunmei at linux dot alibaba.com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug string/26852] aarch64/strcmp has performance regression for some cases
  2020-11-09  8:10 [Bug string/26852] New: aarch64/strcmp has performance regression for some cases xuchunmei at linux dot alibaba.com
  2020-11-09  8:12 ` [Bug string/26852] " xuchunmei at linux dot alibaba.com
  2020-11-09  8:13 ` xuchunmei at linux dot alibaba.com
@ 2020-11-09 10:30 ` nsz at gcc dot gnu.org
  2020-11-09 12:53 ` xuchunmei at linux dot alibaba.com
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: nsz at gcc dot gnu.org @ 2020-11-09 10:30 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26852

Szabolcs Nagy <nsz at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wdijkstr at arm dot com

--- Comment #1 from Szabolcs Nagy <nsz at gcc dot gnu.org> ---
which core are you measuring?

the code was committed based on the measurements in
https://sourceware.org/pipermail/libc-alpha/2020-June/115109.html

| length | align1 | align2 | uplift A72 | uplift A53 | uplift N1 |
...
|     19 |     19 |     19 |      1.00x |      1.05x |     1.00x |
|     19 |     19 |     19 |      1.01x |      1.06x |     1.00x |
|     19 |     19 |     19 |      1.00x |      1.06x |     1.00x |
|     20 |     20 |     20 |      1.03x |      1.07x |     1.04x |
|     20 |     20 |     20 |      1.00x |      1.07x |     1.00x |
|     20 |     20 |     20 |      0.99x |      1.06x |     1.00x |
|     21 |     21 |     21 |      1.03x |      1.07x |     1.04x |
|     21 |     21 |     21 |      1.03x |      1.07x |     1.04x |
|     21 |     21 |     21 |      1.05x |      1.07x |     1.04x |
|     22 |     22 |     22 |      1.03x |      1.08x |     1.04x |
|     22 |     22 |     22 |      1.03x |      1.07x |     1.04x |
|     22 |     22 |     22 |      1.03x |      1.07x |     1.04x |
|     23 |     23 |     23 |      1.02x |      1.07x |     1.03x |
|     23 |     23 |     23 |      1.03x |      1.07x |     1.04x |
|     23 |     23 |     23 |      1.03x |      1.07x |     1.04x |
...
|     16 |      0 |      0 |      1.01x |      1.02x |     1.01x |
|     16 |      0 |      0 |      1.00x |      0.97x |     1.01x |
|     16 |      0 |      0 |      1.00x |      1.00x |     0.76x |
|     16 |      0 |      0 |      1.00x |      1.00x |     0.97x |
|     16 |      0 |      0 |      1.00x |      1.00x |     0.97x |
|     16 |      0 |      0 |      1.00x |      1.00x |     0.97x |
|     16 |      0 |      3 |      0.86x |      1.00x |     0.88x |
|     16 |      3 |      4 |      1.00x |      0.93x |     1.10x |
|     32 |      0 |      0 |      1.07x |      1.04x |     1.08x |
|     32 |      0 |      0 |      1.08x |      1.04x |     1.08x |
|     32 |      0 |      0 |      1.04x |      1.05x |     1.05x |
|     32 |      0 |      0 |      1.04x |      0.96x |     1.05x |
|     32 |      0 |      0 |      1.04x |      0.96x |     1.05x |
|     32 |      0 |      0 |      1.04x |      0.98x |     1.05x |
|     32 |      0 |      4 |      0.91x |      1.03x |     0.93x |
|     32 |      4 |      5 |      0.94x |      1.00x |     1.00x |
...
|     16 |      1 |      2 |      0.96x |      0.96x |     1.08x |
|     16 |      2 |      1 |      0.86x |      0.95x |     0.97x |
|     16 |      1 |      2 |      0.97x |      0.96x |     1.08x |
|     16 |      2 |      1 |      0.86x |      0.95x |     0.97x |
|     16 |      1 |      2 |      0.95x |      0.95x |     1.19x |
|     16 |      2 |      1 |      0.86x |      0.95x |     1.07x |
|     32 |      2 |      4 |      0.91x |      0.98x |     1.00x |
|     32 |      4 |      2 |      0.92x |      0.93x |     0.97x |
|     32 |      2 |      4 |      0.89x |      0.96x |     1.00x |
|     32 |      4 |      2 |      0.92x |      0.92x |     0.97x |
|     32 |      2 |      4 |      0.91x |      0.96x |     1.00x |
|     32 |      4 |      2 |      0.92x |      1.00x |     0.97x |

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug string/26852] aarch64/strcmp has performance regression for some cases
  2020-11-09  8:10 [Bug string/26852] New: aarch64/strcmp has performance regression for some cases xuchunmei at linux dot alibaba.com
                   ` (2 preceding siblings ...)
  2020-11-09 10:30 ` nsz at gcc dot gnu.org
@ 2020-11-09 12:53 ` xuchunmei at linux dot alibaba.com
  2020-11-09 13:06 ` wdijkstr at arm dot com
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: xuchunmei at linux dot alibaba.com @ 2020-11-09 12:53 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26852

--- Comment #2 from xuchunmei <xuchunmei at linux dot alibaba.com> ---
(In reply to Szabolcs Nagy from comment #1)
> which core are you measuring?

glibc 2.30 is from
https://kojipkgs.fedoraproject.org//packages/glibc/2.30/5.fc31/src/glibc-2.30-5.fc31.src.rpm
with release: http://ftp.gnu.org/gnu/glibc/glibc-2.30.tar.gz

and glibc2.32 is from:
https://kojipkgs.fedoraproject.org//packages/glibc/2.32/1.fc33/src/glibc-2.32-1.fc33.src.rpm
with release: http://ftp.gnu.org/gnu/glibc/glibc-2.32.tar.gz

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug string/26852] aarch64/strcmp has performance regression for some cases
  2020-11-09  8:10 [Bug string/26852] New: aarch64/strcmp has performance regression for some cases xuchunmei at linux dot alibaba.com
                   ` (3 preceding siblings ...)
  2020-11-09 12:53 ` xuchunmei at linux dot alibaba.com
@ 2020-11-09 13:06 ` wdijkstr at arm dot com
  2020-11-09 13:08 ` wdijkstr at arm dot com
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: wdijkstr at arm dot com @ 2020-11-09 13:06 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26852

--- Comment #3 from Wilco <wdijkstr at arm dot com> ---
(In reply to xuchunmei from comment #0)
> Created attachment 12948 [details]
> bench-strcmp data of glibc2.30 and glibc2.32
> 
> unixbench of dhry2reg has performance regression on glibc2.32, compared with
> glibc2.30. perf record data display the regression is from strcmp. 
> I simply the testcase with following code:
> #include<stdio.h>
> #include <string.h>
> 
> int main()
> {
> 	int count=1000000;
> 	int i = 0;
> 	char str1[32] = "DHRYSTONE PROGRAM, 1'ST STRING";
> 	char str2[32] = "DHRYSTONE PROGRAM, 2'ND STRING";
> 	int ret;
> 
> 	while (i<count) {
> 		ret = strcmp(str1, str2);
> 		if (ret > 0)
> 			printf("11111\n");
> 		i++;
> 	}
> }
> 
> and compare the execute time on glibc2.30 abd glibc2.32.
> on glibc2.30, the result is:
> # time ./test
> 
> real	0m0.008s
> user	0m0.008s
> sys	0m0.000s
> while on glibc2.32, the result is:
> # time ./test
> 
> real	0m0.023s
> user	0m0.023s
> sys	0m0.000s

This looks like you're getting branch mispredictions in some small loops on the
microarchitecture you're using - your spreadsheet shows larger sizes at the
same alignment are significantly faster in 2.32. Try increasing the iteration
count by a factor of 1000 and do a perf stat with the number of instructions
and mispredictions (perf record may also be useful in showing where the extra
cycles are spent).

If it is branch prediction then usually there is little that can be done -
sometimes aligning loops helps but the new strcmp code already carefully aligns
all loops.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug string/26852] aarch64/strcmp has performance regression for some cases
  2020-11-09  8:10 [Bug string/26852] New: aarch64/strcmp has performance regression for some cases xuchunmei at linux dot alibaba.com
                   ` (4 preceding siblings ...)
  2020-11-09 13:06 ` wdijkstr at arm dot com
@ 2020-11-09 13:08 ` wdijkstr at arm dot com
  2020-11-10  1:46 ` xuchunmei at linux dot alibaba.com
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: wdijkstr at arm dot com @ 2020-11-09 13:08 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26852

--- Comment #4 from Wilco <wdijkstr at arm dot com> ---
(In reply to xuchunmei from comment #2)
> (In reply to Szabolcs Nagy from comment #1)
> > which core are you measuring?

Szabolcs meant which CPU?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug string/26852] aarch64/strcmp has performance regression for some cases
  2020-11-09  8:10 [Bug string/26852] New: aarch64/strcmp has performance regression for some cases xuchunmei at linux dot alibaba.com
                   ` (5 preceding siblings ...)
  2020-11-09 13:08 ` wdijkstr at arm dot com
@ 2020-11-10  1:46 ` xuchunmei at linux dot alibaba.com
  2020-11-10 12:53 ` wdijkstr at arm dot com
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: xuchunmei at linux dot alibaba.com @ 2020-11-10  1:46 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26852

--- Comment #5 from xuchunmei <xuchunmei at linux dot alibaba.com> ---
(In reply to Wilco from comment #3)
> (In reply to xuchunmei from comment #0)
> > Created attachment 12948 [details]

> 
> This looks like you're getting branch mispredictions in some small loops on
> the microarchitecture you're using - your spreadsheet shows larger sizes at
> the same alignment are significantly faster in 2.32. Try increasing the
> iteration count by a factor of 1000 and do a perf stat with the number of
> instructions and mispredictions (perf record may also be useful in showing
> where the extra cycles are spent).
> 
> If it is branch prediction then usually there is little that can be done -
> sometimes aligning loops helps but the new strcmp code already carefully
> aligns all loops.

I enlarge the interation count by a factor of 1000, and use perf stat to record
data.
on glibc2.30:
# perf stat ./test

 Performance counter stats for './test':

          6,169.61 msec task-clock                #    1.000 CPUs utilized
                 3      context-switches          #    0.000 K/sec
                 1      cpu-migrations            #    0.000 K/sec
                36      page-faults               #    0.006 K/sec
    16,038,780,874      cycles                    #    2.600 GHz
    54,023,778,377      instructions              #    3.37  insn per cycle
   <not supported>      branches
           102,800      branch-misses

       6.170468980 seconds time elapsed

       6.170341000 seconds user
       0.000000000 seconds sys
on glibc2.32:
# perf stat ./test

 Performance counter stats for './test':

         21,583.06 msec task-clock                #    1.000 CPUs utilized
                 5      context-switches          #    0.000 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                37      page-faults               #    0.002 K/sec
    56,113,183,027      cycles                    #    2.600 GHz
    55,074,860,805      instructions              #    0.98  insn per cycle
   <not supported>      branches
     2,000,188,234      branch-misses

      21.584040588 seconds time elapsed

      21.583636000 seconds user
       0.000000000 seconds sys

my test platform is kunpeng920. glibc2.32 has more branch-misses.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug string/26852] aarch64/strcmp has performance regression for some cases
  2020-11-09  8:10 [Bug string/26852] New: aarch64/strcmp has performance regression for some cases xuchunmei at linux dot alibaba.com
                   ` (6 preceding siblings ...)
  2020-11-10  1:46 ` xuchunmei at linux dot alibaba.com
@ 2020-11-10 12:53 ` wdijkstr at arm dot com
  2022-02-23  3:22 ` yangyanchao6 at huawei dot com
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: wdijkstr at arm dot com @ 2020-11-10 12:53 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26852

--- Comment #6 from Wilco <wdijkstr at arm dot com> ---
(In reply to xuchunmei from comment #5)

>     54,023,778,377      instructions              #    3.37  insn per cycle
>            102,800      branch-misses


>     55,074,860,805      instructions              #    0.98  insn per cycle
>      2,000,188,234      branch-misses

> 
> my test platform is kunpeng920. glibc2.32 has more branch-misses.

That shows 2 misses per iteration which means this is a branch predictor issue
indeed. I'd show this example to the kunpeng designers - there should be zero
mispredictions after the first few iterations.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug string/26852] aarch64/strcmp has performance regression for some cases
  2020-11-09  8:10 [Bug string/26852] New: aarch64/strcmp has performance regression for some cases xuchunmei at linux dot alibaba.com
                   ` (7 preceding siblings ...)
  2020-11-10 12:53 ` wdijkstr at arm dot com
@ 2022-02-23  3:22 ` yangyanchao6 at huawei dot com
  2022-02-23  3:29 ` yangyanchao6 at huawei dot com
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: yangyanchao6 at huawei dot com @ 2022-02-23  3:22 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26852

yangyanchao6 at huawei dot com <yangyanchao6 at huawei dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |yangyanchao6 at huawei dot com

--- Comment #7 from yangyanchao6 at huawei dot com <yangyanchao6 at huawei dot com> ---
Created attachment 13992
  --> https://sourceware.org/bugzilla/attachment.cgi?id=13992&action=edit
strcmp performance test program

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug string/26852] aarch64/strcmp has performance regression for some cases
  2020-11-09  8:10 [Bug string/26852] New: aarch64/strcmp has performance regression for some cases xuchunmei at linux dot alibaba.com
                   ` (8 preceding siblings ...)
  2022-02-23  3:22 ` yangyanchao6 at huawei dot com
@ 2022-02-23  3:29 ` yangyanchao6 at huawei dot com
  2022-03-05 19:31 ` goldstein.w.n at gmail dot com
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: yangyanchao6 at huawei dot com @ 2022-02-23  3:29 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26852

--- Comment #8 from yangyanchao6 at huawei dot com <yangyanchao6 at huawei dot com> ---
(In reply to Szabolcs Nagy from comment #1)
> which core are you measuring?
> 
> the code was committed based on the measurements in
> https://sourceware.org/pipermail/libc-alpha/2020-June/115109.html
> 
> | length | align1 | align2 | uplift A72 | uplift A53 | uplift N1 |
> ...
> |     19 |     19 |     19 |      1.00x |      1.05x |     1.00x |
> |     19 |     19 |     19 |      1.01x |      1.06x |     1.00x |
> |     19 |     19 |     19 |      1.00x |      1.06x |     1.00x |
> |     20 |     20 |     20 |      1.03x |      1.07x |     1.04x |
> |     20 |     20 |     20 |      1.00x |      1.07x |     1.00x |
> |     20 |     20 |     20 |      0.99x |      1.06x |     1.00x |
> |     21 |     21 |     21 |      1.03x |      1.07x |     1.04x |
> |     21 |     21 |     21 |      1.03x |      1.07x |     1.04x |
> |     21 |     21 |     21 |      1.05x |      1.07x |     1.04x |
> |     22 |     22 |     22 |      1.03x |      1.08x |     1.04x |
> |     22 |     22 |     22 |      1.03x |      1.07x |     1.04x |
> |     22 |     22 |     22 |      1.03x |      1.07x |     1.04x |
> |     23 |     23 |     23 |      1.02x |      1.07x |     1.03x |
> |     23 |     23 |     23 |      1.03x |      1.07x |     1.04x |
> |     23 |     23 |     23 |      1.03x |      1.07x |     1.04x |
> ...
> |     16 |      0 |      0 |      1.01x |      1.02x |     1.01x |
> |     16 |      0 |      0 |      1.00x |      0.97x |     1.01x |
> |     16 |      0 |      0 |      1.00x |      1.00x |     0.76x |
> |     16 |      0 |      0 |      1.00x |      1.00x |     0.97x |
> |     16 |      0 |      0 |      1.00x |      1.00x |     0.97x |
> |     16 |      0 |      0 |      1.00x |      1.00x |     0.97x |
> |     16 |      0 |      3 |      0.86x |      1.00x |     0.88x |
> |     16 |      3 |      4 |      1.00x |      0.93x |     1.10x |
> |     32 |      0 |      0 |      1.07x |      1.04x |     1.08x |
> |     32 |      0 |      0 |      1.08x |      1.04x |     1.08x |
> |     32 |      0 |      0 |      1.04x |      1.05x |     1.05x |
> |     32 |      0 |      0 |      1.04x |      0.96x |     1.05x |
> |     32 |      0 |      0 |      1.04x |      0.96x |     1.05x |
> |     32 |      0 |      0 |      1.04x |      0.98x |     1.05x |
> |     32 |      0 |      4 |      0.91x |      1.03x |     0.93x |
> |     32 |      4 |      5 |      0.94x |      1.00x |     1.00x |
> ...
> |     16 |      1 |      2 |      0.96x |      0.96x |     1.08x |
> |     16 |      2 |      1 |      0.86x |      0.95x |     0.97x |
> |     16 |      1 |      2 |      0.97x |      0.96x |     1.08x |
> |     16 |      2 |      1 |      0.86x |      0.95x |     0.97x |
> |     16 |      1 |      2 |      0.95x |      0.95x |     1.19x |
> |     16 |      2 |      1 |      0.86x |      0.95x |     1.07x |
> |     32 |      2 |      4 |      0.91x |      0.98x |     1.00x |
> |     32 |      4 |      2 |      0.92x |      0.93x |     0.97x |
> |     32 |      2 |      4 |      0.89x |      0.96x |     1.00x |
> |     32 |      4 |      2 |      0.92x |      0.92x |     0.97x |
> |     32 |      2 |      4 |      0.91x |      0.96x |     1.00x |
> |     32 |      4 |      2 |      0.92x |      1.00x |     0.97x |

I've also do some test on this issue: 

Use "attachment 13992" to test:
[root@localhost test]# ./a.out 50 50
0xaaaae0ac12a0,0xaaaae0ac12e0
base:50, diff:50, postion:0, cycle=48677957
base:50, diff:50, postion:8, cycle=60936950
base:50, diff:50, postion:16, cycle=277960241
base:50, diff:50, postion:24, cycle=69354865
base:50, diff:50, postion:32, cycle=78853402
base:50, diff:50, postion:40, cycle=90342072
base:50, diff:50, postion:48, cycle=101327114
Performance deteriorates only when the 16 to 23 characters are different.

This question involves two submissions:
https://sourceware.org/git/ ? p=glibc.git;a=commit; h=adac54ffc5
>From adac54ffc5ded48cba7deb18e46df984b213b0ac Mon Sep 17 00:00:00 2001
From: Alex Butler <Alex.Butler@arm.com>
Date: Tue, 16 Jun 2020 12:42:38 +0000
Subject: [PATCH] aarch64: MTE compatible strcmp

https://sourceware.org/git/ ? p=glibc.git;a=commit; h=34f0d01d5e
>From 34f0d01d5e43c7dedd002ab47f6266dfb5b79c22 Mon Sep 17 00:00:00 2001
From: Wilco Dijkstra <wdijkstr@arm.com>
Date: Wed, 15 Jul 2020 16:50:02 +0100
Subject: [PATCH] AArch64: Align ENTRY to a cacheline

At the beginning of adac54ffc5ded48cba7deb18e46df984b213b0ac, the performance
is good, but after 34f0d01d5e43c7dedd002ab47f6266dfb5b79, the performance
deteriorates.

This problem also has something to do with alignment?

diff --git a/sysdeps/aarch64/strcmp.S b/sysdeps/aarch64/strcmp.S
index f225d718..7a048b66 100644
--- a/sysdeps/aarch64/strcmp.S
+++ b/sysdeps/aarch64/strcmp.S
@@ -71,8 +71,6 @@ ENTRY(strcmp)
b.ne L(misaligned8)
cbnz tmp, L(mutual_align)

- .p2align 4
-
L(loop_aligned):
ldr data2, [src1, off2]
ldr data1, [src1], 8

I removed the first alignment in strcmp and the 300% performance degradation
disappeared.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug string/26852] aarch64/strcmp has performance regression for some cases
  2020-11-09  8:10 [Bug string/26852] New: aarch64/strcmp has performance regression for some cases xuchunmei at linux dot alibaba.com
                   ` (9 preceding siblings ...)
  2022-02-23  3:29 ` yangyanchao6 at huawei dot com
@ 2022-03-05 19:31 ` goldstein.w.n at gmail dot com
  2022-03-15 17:04 ` wdijkstr at arm dot com
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: goldstein.w.n at gmail dot com @ 2022-03-05 19:31 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26852

Noah Goldstein <goldstein.w.n at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |goldstein.w.n at gmail dot com

--- Comment #9 from Noah Goldstein <goldstein.w.n at gmail dot com> ---
(In reply to yangyanchao6@huawei.com from comment #8)
> (In reply to Szabolcs Nagy from comment #1)
> > which core are you measuring?
> > 
> > the code was committed based on the measurements in
> > https://sourceware.org/pipermail/libc-alpha/2020-June/115109.html
> > 
> > | length | align1 | align2 | uplift A72 | uplift A53 | uplift N1 |
> > ...
> > |     19 |     19 |     19 |      1.00x |      1.05x |     1.00x |
> > |     19 |     19 |     19 |      1.01x |      1.06x |     1.00x |
> > |     19 |     19 |     19 |      1.00x |      1.06x |     1.00x |
> > |     20 |     20 |     20 |      1.03x |      1.07x |     1.04x |
> > |     20 |     20 |     20 |      1.00x |      1.07x |     1.00x |
> > |     20 |     20 |     20 |      0.99x |      1.06x |     1.00x |
> > |     21 |     21 |     21 |      1.03x |      1.07x |     1.04x |
> > |     21 |     21 |     21 |      1.03x |      1.07x |     1.04x |
> > |     21 |     21 |     21 |      1.05x |      1.07x |     1.04x |
> > |     22 |     22 |     22 |      1.03x |      1.08x |     1.04x |
> > |     22 |     22 |     22 |      1.03x |      1.07x |     1.04x |
> > |     22 |     22 |     22 |      1.03x |      1.07x |     1.04x |
> > |     23 |     23 |     23 |      1.02x |      1.07x |     1.03x |
> > |     23 |     23 |     23 |      1.03x |      1.07x |     1.04x |
> > |     23 |     23 |     23 |      1.03x |      1.07x |     1.04x |
> > ...
> > |     16 |      0 |      0 |      1.01x |      1.02x |     1.01x |
> > |     16 |      0 |      0 |      1.00x |      0.97x |     1.01x |
> > |     16 |      0 |      0 |      1.00x |      1.00x |     0.76x |
> > |     16 |      0 |      0 |      1.00x |      1.00x |     0.97x |
> > |     16 |      0 |      0 |      1.00x |      1.00x |     0.97x |
> > |     16 |      0 |      0 |      1.00x |      1.00x |     0.97x |
> > |     16 |      0 |      3 |      0.86x |      1.00x |     0.88x |
> > |     16 |      3 |      4 |      1.00x |      0.93x |     1.10x |
> > |     32 |      0 |      0 |      1.07x |      1.04x |     1.08x |
> > |     32 |      0 |      0 |      1.08x |      1.04x |     1.08x |
> > |     32 |      0 |      0 |      1.04x |      1.05x |     1.05x |
> > |     32 |      0 |      0 |      1.04x |      0.96x |     1.05x |
> > |     32 |      0 |      0 |      1.04x |      0.96x |     1.05x |
> > |     32 |      0 |      0 |      1.04x |      0.98x |     1.05x |
> > |     32 |      0 |      4 |      0.91x |      1.03x |     0.93x |
> > |     32 |      4 |      5 |      0.94x |      1.00x |     1.00x |
> > ...
> > |     16 |      1 |      2 |      0.96x |      0.96x |     1.08x |
> > |     16 |      2 |      1 |      0.86x |      0.95x |     0.97x |
> > |     16 |      1 |      2 |      0.97x |      0.96x |     1.08x |
> > |     16 |      2 |      1 |      0.86x |      0.95x |     0.97x |
> > |     16 |      1 |      2 |      0.95x |      0.95x |     1.19x |
> > |     16 |      2 |      1 |      0.86x |      0.95x |     1.07x |
> > |     32 |      2 |      4 |      0.91x |      0.98x |     1.00x |
> > |     32 |      4 |      2 |      0.92x |      0.93x |     0.97x |
> > |     32 |      2 |      4 |      0.89x |      0.96x |     1.00x |
> > |     32 |      4 |      2 |      0.92x |      0.92x |     0.97x |
> > |     32 |      2 |      4 |      0.91x |      0.96x |     1.00x |
> > |     32 |      4 |      2 |      0.92x |      1.00x |     0.97x |
> 
> I've also do some test on this issue: 
> 
> Use "attachment 13992 [details]" to test:
> [root@localhost test]# ./a.out 50 50
> 0xaaaae0ac12a0,0xaaaae0ac12e0
> base:50, diff:50, postion:0, cycle=48677957
> base:50, diff:50, postion:8, cycle=60936950
> base:50, diff:50, postion:16, cycle=277960241
> base:50, diff:50, postion:24, cycle=69354865
> base:50, diff:50, postion:32, cycle=78853402
> base:50, diff:50, postion:40, cycle=90342072
> base:50, diff:50, postion:48, cycle=101327114
> Performance deteriorates only when the 16 to 23 characters are different.
> 
> This question involves two submissions:
> https://sourceware.org/git/ ? p=glibc.git;a=commit; h=adac54ffc5
> From adac54ffc5ded48cba7deb18e46df984b213b0ac Mon Sep 17 00:00:00 2001
> From: Alex Butler <Alex.Butler@arm.com>
> Date: Tue, 16 Jun 2020 12:42:38 +0000
> Subject: [PATCH] aarch64: MTE compatible strcmp
> 
> https://sourceware.org/git/ ? p=glibc.git;a=commit; h=34f0d01d5e
> From 34f0d01d5e43c7dedd002ab47f6266dfb5b79c22 Mon Sep 17 00:00:00 2001
> From: Wilco Dijkstra <wdijkstr@arm.com>
> Date: Wed, 15 Jul 2020 16:50:02 +0100
> Subject: [PATCH] AArch64: Align ENTRY to a cacheline
> 
> At the beginning of adac54ffc5ded48cba7deb18e46df984b213b0ac, the
> performance is good, but after 34f0d01d5e43c7dedd002ab47f6266dfb5b79, the
> performance deteriorates.
> 
> This problem also has something to do with alignment?
> 
> diff --git a/sysdeps/aarch64/strcmp.S b/sysdeps/aarch64/strcmp.S
> index f225d718..7a048b66 100644
> --- a/sysdeps/aarch64/strcmp.S
> +++ b/sysdeps/aarch64/strcmp.S
> @@ -71,8 +71,6 @@ ENTRY(strcmp)
> b.ne L(misaligned8)
> cbnz tmp, L(mutual_align)
> 
> - .p2align 4
> -
> L(loop_aligned):
> ldr data2, [src1, off2]
> ldr data1, [src1], 8
> 
> I removed the first alignment in strcmp and the 300% performance degradation
> disappeared.

Not an expert of the microarch but that sounds like a benchmark artifact.
Possibly check the decode path? If the change in alignment causes decode
to run / not-run out of the trace cache that would severely impact the
benchmark but not necessarily affect real-world performance (where
presumably strcmp is just run back-to-back-to-back in a loop).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug string/26852] aarch64/strcmp has performance regression for some cases
  2020-11-09  8:10 [Bug string/26852] New: aarch64/strcmp has performance regression for some cases xuchunmei at linux dot alibaba.com
                   ` (10 preceding siblings ...)
  2022-03-05 19:31 ` goldstein.w.n at gmail dot com
@ 2022-03-15 17:04 ` wdijkstr at arm dot com
  2022-03-15 17:33 ` goldstein.w.n at gmail dot com
  2022-03-16 18:53 ` wdijkstr at arm dot com
  13 siblings, 0 replies; 15+ messages in thread
From: wdijkstr at arm dot com @ 2022-03-15 17:04 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26852

--- Comment #10 from Wilco <wdijkstr at arm dot com> ---
(In reply to Noah Goldstein from comment #9)
> (In reply to yangyanchao6@huawei.com from comment #8)

> > - .p2align 4
> > -
> > L(loop_aligned):
> > ldr data2, [src1, off2]
> > ldr data1, [src1], 8
> > 
> > I removed the first alignment in strcmp and the 300% performance degradation
> > disappeared.
> 
> Not an expert of the microarch but that sounds like a benchmark artifact.
> Possibly check the decode path? If the change in alignment causes decode
> to run / not-run out of the trace cache that would severely impact the
> benchmark but not necessarily affect real-world performance (where
> presumably strcmp is just run back-to-back-to-back in a loop).

As reported it does also occur in Dhrystone, so it's not due to calling strcmp
in a tiny loop. It seems like an issue with the branch predictor not learning
to predict certain loops. The question is whether the proposed workaround means
all sizes now work without misprediction. Other string functions use small
loops as well and compiled code will be affected too, so it's not clear to me
there is an easy fix here.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug string/26852] aarch64/strcmp has performance regression for some cases
  2020-11-09  8:10 [Bug string/26852] New: aarch64/strcmp has performance regression for some cases xuchunmei at linux dot alibaba.com
                   ` (11 preceding siblings ...)
  2022-03-15 17:04 ` wdijkstr at arm dot com
@ 2022-03-15 17:33 ` goldstein.w.n at gmail dot com
  2022-03-16 18:53 ` wdijkstr at arm dot com
  13 siblings, 0 replies; 15+ messages in thread
From: goldstein.w.n at gmail dot com @ 2022-03-15 17:33 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26852

--- Comment #11 from Noah Goldstein <goldstein.w.n at gmail dot com> ---
(In reply to Wilco from comment #10)
> (In reply to Noah Goldstein from comment #9)
> > (In reply to yangyanchao6@huawei.com from comment #8)
> 
> > > - .p2align 4
> > > -
> > > L(loop_aligned):
> > > ldr data2, [src1, off2]
> > > ldr data1, [src1], 8
> > > 
> > > I removed the first alignment in strcmp and the 300% performance degradation
> > > disappeared.
> > 
> > Not an expert of the microarch but that sounds like a benchmark artifact.
> > Possibly check the decode path? If the change in alignment causes decode
> > to run / not-run out of the trace cache that would severely impact the
> > benchmark but not necessarily affect real-world performance (where
> > presumably strcmp is just run back-to-back-to-back in a loop).
> 
> As reported it does also occur in Dhrystone, so it's not due to calling
> strcmp in a tiny loop. It seems like an issue with the branch predictor not
> learning to predict certain loops. The question is whether the proposed
> workaround means all sizes now work without misprediction. Other string
> functions use small loops as well and compiled code will be affected too, so
> it's not clear to me there is an easy fix here.

Well whats causing the mispredictions? Clobber in the BHT?

By decode I was wondering if there was something like the
loop-stream-detector in aarch64. On x86 entering the LSD is
related to code alignment and can cause a spike in branch-misses
because its implemented s.t the only way to exit LSD decode 'mode'
is a branch-miss.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug string/26852] aarch64/strcmp has performance regression for some cases
  2020-11-09  8:10 [Bug string/26852] New: aarch64/strcmp has performance regression for some cases xuchunmei at linux dot alibaba.com
                   ` (12 preceding siblings ...)
  2022-03-15 17:33 ` goldstein.w.n at gmail dot com
@ 2022-03-16 18:53 ` wdijkstr at arm dot com
  13 siblings, 0 replies; 15+ messages in thread
From: wdijkstr at arm dot com @ 2022-03-16 18:53 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26852

--- Comment #12 from Wilco <wdijkstr at arm dot com> ---
(In reply to Noah Goldstein from comment #11)
> (In reply to Wilco from comment #10)

> > As reported it does also occur in Dhrystone, so it's not due to calling
> > strcmp in a tiny loop. It seems like an issue with the branch predictor not
> > learning to predict certain loops. The question is whether the proposed
> > workaround means all sizes now work without misprediction. Other string
> > functions use small loops as well and compiled code will be affected too, so
> > it's not clear to me there is an easy fix here.
> 
> Well whats causing the mispredictions? Clobber in the BHT?

We can't tell from the results - I think that's a question for the CPU
designers.

> By decode I was wondering if there was something like the
> loop-stream-detector in aarch64. On x86 entering the LSD is
> related to code alignment and can cause a spike in branch-misses
> because its implemented s.t the only way to exit LSD decode 'mode'
> is a branch-miss.

AFAIK there is no public description of the microarchitecture, so it may not
even have a loop buffer. If it was something like that, I would expect a slow
down in other cases, not just for exactly 3 iterations.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-03-16 18:53 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-09  8:10 [Bug string/26852] New: aarch64/strcmp has performance regression for some cases xuchunmei at linux dot alibaba.com
2020-11-09  8:12 ` [Bug string/26852] " xuchunmei at linux dot alibaba.com
2020-11-09  8:13 ` xuchunmei at linux dot alibaba.com
2020-11-09 10:30 ` nsz at gcc dot gnu.org
2020-11-09 12:53 ` xuchunmei at linux dot alibaba.com
2020-11-09 13:06 ` wdijkstr at arm dot com
2020-11-09 13:08 ` wdijkstr at arm dot com
2020-11-10  1:46 ` xuchunmei at linux dot alibaba.com
2020-11-10 12:53 ` wdijkstr at arm dot com
2022-02-23  3:22 ` yangyanchao6 at huawei dot com
2022-02-23  3:29 ` yangyanchao6 at huawei dot com
2022-03-05 19:31 ` goldstein.w.n at gmail dot com
2022-03-15 17:04 ` wdijkstr at arm dot com
2022-03-15 17:33 ` goldstein.w.n at gmail dot com
2022-03-16 18:53 ` wdijkstr at arm dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).