[Bug target/115025] New: prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/115025] New: prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform
@ 2024-05-10  8:18 colin.king at intel dot com
  2024-05-10  8:23 ` [Bug target/115025] " colin.king at intel dot com
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: colin.king at intel dot com @ 2024-05-10  8:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115025

            Bug ID: 115025
           Summary: prime computation performance regression, x86, between
                    gcc-14 and gcc-13 on skylake platform
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: colin.king at intel dot com
  Target Milestone: ---

Created attachment 58163
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58163&action=edit
reproducer source code

I'm seeing a ~7% performance regression in gcc-14 compared to gcc-13, using gcc
on Ubuntu 24.04 computing prime numbers:

Versions:
gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) 
gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f]
(Ubuntu 14-20240412-0ubuntu1) 

cking@skylake:~$ CFLAGS="" gcc-13 -O2 reproducer-prime.c -lm
cking@skylake:~$ ./a.out 
473.04 prime ops per sec

cking@skylake:~$ CFLAGS="" gcc-14 -O2 reproducer-prime.c -lm
cking@skylake:~$ ./a.out 
439.86 prime ops per sec

Attached is the reproducer. Note that the use of
__attribute__((optimize("-O3"))) and/or __builtin_expect((x), 0) does not
affect the performance regression.

The original issue appeared when regression testing stress-ng cpu prime number
stressor [1]. I've managed to extract the attached reproducer from the original
code (see attached).

Attached are the reproducer C source and disassembled object code. 

References: [1]
https://github.com/ColinIanKing/stress-ng/blob/master/stress-cpu.c

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/115025] prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform
  2024-05-10  8:18 [Bug target/115025] New: prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform colin.king at intel dot com
@ 2024-05-10  8:23 ` colin.king at intel dot com
  2024-05-10  8:23 ` colin.king at intel dot com
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: colin.king at intel dot com @ 2024-05-10  8:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115025

--- Comment #1 from Colin Ian King <colin.king at intel dot com> ---
Created attachment 58164
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58164&action=edit
gcc-13 disassembly

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/115025] prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform
  2024-05-10  8:18 [Bug target/115025] New: prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform colin.king at intel dot com
  2024-05-10  8:23 ` [Bug target/115025] " colin.king at intel dot com
@ 2024-05-10  8:23 ` colin.king at intel dot com
  2024-05-10  8:28 ` colin.king at intel dot com
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: colin.king at intel dot com @ 2024-05-10  8:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115025

--- Comment #2 from Colin Ian King <colin.king at intel dot com> ---
Created attachment 58165
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58165&action=edit
gcc-14 disassembly

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/115025] prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform
  2024-05-10  8:18 [Bug target/115025] New: prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform colin.king at intel dot com
  2024-05-10  8:23 ` [Bug target/115025] " colin.king at intel dot com
  2024-05-10  8:23 ` colin.king at intel dot com
@ 2024-05-10  8:28 ` colin.king at intel dot com
  2024-05-10  8:29 ` colin.king at intel dot com
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: colin.king at intel dot com @ 2024-05-10  8:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115025

--- Comment #3 from Colin Ian King <colin.king at intel dot com> ---
Created attachment 58166
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58166&action=edit
perf output for gcc-13 compiled code

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/115025] prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform
  2024-05-10  8:18 [Bug target/115025] New: prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform colin.king at intel dot com
                   ` (2 preceding siblings ...)
  2024-05-10  8:28 ` colin.king at intel dot com
@ 2024-05-10  8:29 ` colin.king at intel dot com
  2024-05-16  1:50 ` [Bug target/115025] [14/15 regression] " sjames at gcc dot gnu.org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: colin.king at intel dot com @ 2024-05-10  8:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115025

--- Comment #4 from Colin Ian King <colin.king at intel dot com> ---
Created attachment 58167
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58167&action=edit
perf output for gcc-14 compiled code

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/115025] [14/15 regression] prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform
  2024-05-10  8:18 [Bug target/115025] New: prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform colin.king at intel dot com
                   ` (3 preceding siblings ...)
  2024-05-10  8:29 ` colin.king at intel dot com
@ 2024-05-16  1:50 ` sjames at gcc dot gnu.org
  2024-05-16  8:54 ` haochen.jiang at intel dot com
  2024-05-22  7:24 ` haochen.jiang at intel dot com
  6 siblings, 0 replies; 8+ messages in thread
From: sjames at gcc dot gnu.org @ 2024-05-16  1:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115025

Sam James <sjames at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |14.2

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/115025] [14/15 regression] prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform
  2024-05-10  8:18 [Bug target/115025] New: prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform colin.king at intel dot com
                   ` (4 preceding siblings ...)
  2024-05-16  1:50 ` [Bug target/115025] [14/15 regression] " sjames at gcc dot gnu.org
@ 2024-05-16  8:54 ` haochen.jiang at intel dot com
  2024-05-22  7:24 ` haochen.jiang at intel dot com
  6 siblings, 0 replies; 8+ messages in thread
From: haochen.jiang at intel dot com @ 2024-05-16  8:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115025

--- Comment #5 from Haochen Jiang <haochen.jiang at intel dot com> ---
My guess is that for the prime judging loop:

        for (i = 5; i < max; i += 6)
                if ((n % i == 0) || (n % (i + 2) == 0))
                        return 0;

In GCC13, it extracts the first loop, which is (n % 5 == 0) || (n % 7 == 0),
out of the whole loop to do imul+cmp instead of div.

However, on current trunk, it still remains div and will be slower.

BTW, there is also a codegen regression which won't cause perf regression. On
current trunk, the sqrt BB is not merged together. It increases codesize but no
perf impact.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/115025] [14/15 regression] prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform
  2024-05-10  8:18 [Bug target/115025] New: prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform colin.king at intel dot com
                   ` (5 preceding siblings ...)
  2024-05-16  8:54 ` haochen.jiang at intel dot com
@ 2024-05-22  7:24 ` haochen.jiang at intel dot com
  6 siblings, 0 replies; 8+ messages in thread
From: haochen.jiang at intel dot com @ 2024-05-22  7:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115025

Haochen Jiang <haochen.jiang at intel dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jh at suse dot cz

--- Comment #6 from Haochen Jiang <haochen.jiang at intel dot com> ---
From my investigation, there are two commits related to this PR. Both of them
related to copy header pass (ch2).

This is the dump before ch2 pass for that loop.

  <bb 9> [local count: 109475452]:
  _4 = n_1(D) % 5;
  if (_4 == 0)
    goto <bb 14>; [3.66%]
  else
    goto <bb 10>; [96.34%]

  <bb 10> [local count: 105468650]:
  _24 = n_1(D) % 7;
  if (_24 == 0)
    goto <bb 14>; [3.66%]
  else
    goto <bb 13>; [96.34%]


First is r14-2675. After this commit, the ch2 pass refused to duplicate bb 9
and bb 10 for the following reason, which previously will duplicate. This
caused half of the total regression.

"Not duplicating bb 9: condition based on non-IV loop variant."

The other is r14-2709. After this commit, the ch2 pass tried to duplicate both
bb 9 and bb 10, but eventually the pass did not. However, the commit
contributed the other half of the regression.

Going to dig into deeper

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-05-22  7:24 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-10  8:18 [Bug target/115025] New: prime computation performance regression, x86, between gcc-14 and gcc-13 on skylake platform colin.king at intel dot com
2024-05-10  8:23 ` [Bug target/115025] " colin.king at intel dot com
2024-05-10  8:23 ` colin.king at intel dot com
2024-05-10  8:28 ` colin.king at intel dot com
2024-05-10  8:29 ` colin.king at intel dot com
2024-05-16  1:50 ` [Bug target/115025] [14/15 regression] " sjames at gcc dot gnu.org
2024-05-16  8:54 ` haochen.jiang at intel dot com
2024-05-22  7:24 ` haochen.jiang at intel dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).