public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9
@ 2020-03-31 17:33 jamborm at gcc dot gnu.org
  2020-03-31 23:12 ` [Bug tree-optimization/94427] " jamborm at gcc dot gnu.org
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: jamborm at gcc dot gnu.org @ 2020-03-31 17:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427

            Bug ID: 94427
           Summary: 456.hmmer is 8-17% slower when compiled at -Ofast than
                    with GCC 9
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jamborm at gcc dot gnu.org
                CC: rguenth at gcc dot gnu.org
            Blocks: 26163
  Target Milestone: ---
              Host: x86_64-linux
            Target: x86_64-linux

SPECINT 2006 benchmark 456.hmmer runs 18% slower on AMD Zen2 CPUs, 15%
on AMD Zen1 CPUs and 8% on Intel Cascade Lake server CPUs when built
with trunk (revision 26b3e568a60) and just -Ofast (so with generic
march/mtune) than when compiled wth GCC 9.

Bisecting the regression leads to commit:

  commit 14ec49a7537004633b7fff859178cbebd288ca1d
  Author: Richard Biener <rguenther@suse.de>
  Date:   Tue Jul 2 07:35:23 2019 +0000

    re PR tree-optimization/58483 (missing optimization opportunity for const
std::vector compared to std::array)

    2019-07-02  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/58483
            * tree-ssa-scopedtables.c (avail_expr_hash): Use OEP_ADDRESS_OF
            for MEM_REF base hashing.
            (equal_mem_array_ref_p): Likewise for base comparison.

            * gcc.dg/tree-ssa/ssa-dom-cse-8.c: New testcase.

    From-SVN: r272922


Collected profiles are weird, almost the other way round I would
expect them to be, because the *slow* version spends less time in cold
section - but both spend IMHO too much time there.  The following data
were collected on AMD Zen2 but those from Intel are similar in this
regard.  What is different is that on Intel perf stat reports doubling
of branch misses - and because it has older perf it does not report
front/back-end stalls.

Before the aforementioned revision:

 Performance counter stats for 'numactl -C 0 -l specinvoke':

         163360.87 msec task-clock:u              #    0.992 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
              7639      page-faults:u             #    0.047 K/sec
      525635661818      cycles:u                  #    
         809847511      stalled-cycles-frontend:u #    0.15% frontend cycles
idle     (83.35%)
      299331255326      stalled-cycles-backend:u  #   56.95% backend cycles
idle      (83.30%)
     1757801907547      instructions:u            #    3.34  insn per cycle
                                                  #    0.17  stalled cycles per
insn  (83.34%)
      133496985084      branches:u                #  817.191 M/sec             
      (83.35%)
         682351923      branch-misses:u           #    0.51% of all branches   
      (83.31%)

     164.659685804 seconds time elapsed

     163.325420000 seconds user
       0.022183000 seconds sys

# Samples: 637K of event 'cycles:u'
# Event count (approx.): 527143782584
#
# Overhead       Samples  Shared Object            Symbol
# ........  ............  .......................  ....................
#   
    58.43%        372284  hmmer_peak.mine-std-gen  [.] P7Viterbi
    35.12%        223887  hmmer_peak.mine-std-gen  [.] P7Viterbi.cold
     2.59%         16418  hmmer_peak.mine-std-gen  [.] FChoose
     2.51%         15906  hmmer_peak.mine-std-gen  [.] sre_random


At the aforementioned revision:

 Performance counter stats for 'numactl -C 0 -l specinvoke':                    

         191483.84 msec task-clock:u              #    0.994 CPUs utilized      
                 0      context-switches:u        #    0.000 K/sec              
                 0      cpu-migrations:u          #    0.000 K/sec              
              7639      page-faults:u             #    0.040 K/sec              
      622159384711      cycles:u                  #    
         817604010      stalled-cycles-frontend:u #    0.13% frontend cycles
idle     (83.31%)          
      439972264588      stalled-cycles-backend:u  #   70.72% backend cycles
idle      (83.34%)          
     1707838992202      instructions:u            #    2.75  insn per cycle     
                                                  #    0.26  stalled cycles per
insn  (83.35%)          
       91309384910      branches:u                #  476.852 M/sec             
      (83.32%)          
         655463713      branch-misses:u           #    0.72% of all branches   
      (83.33%)          

     192.564513355 seconds time elapsed

     191.443774000 seconds user
       0.023978000 seconds sys

# Samples: 752K of event 'cycles:u'
# Event count (approx.): 622947549968
#
# Overhead       Samples  Shared Object             Symbol
# ........  ............  ........................  ....................
#   
    83.68%        629645  hmmer_peak.small-std-gen  [.] P7Viterbi
    10.84%         81591  hmmer_peak.small-std-gen  [.] P7Viterbi.cold
     2.21%         16546  hmmer_peak.small-std-gen  [.] FChoose
     2.11%         15793  hmmer_peak.small-std-gen  [.] sre_random


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9
  2020-03-31 17:33 [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 jamborm at gcc dot gnu.org
@ 2020-03-31 23:12 ` jamborm at gcc dot gnu.org
  2020-04-01  6:48 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: jamborm at gcc dot gnu.org @ 2020-03-31 23:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427

--- Comment #1 from Martin Jambor <jamborm at gcc dot gnu.org> ---
OK, so it turns out the identified commit only allows us to shoot
ourselves in the foot - and there one too few branches, not too many.

The hottest loop, consuming most of the time is:

Percent         Instructions
------------------------------------------------
  0.03 │ fb0:┌─+add     -0x8(%r9,%rcx,4),%eax
  5.03 │     │  mov     %eax,-0x4(%r13,%rcx,4)
  2.48 │     │  mov     -0x8(%r8,%rcx,4),%esi
  0.02 │     │  add     -0x8(%rdx,%rcx,4),%esi
  0.06 │     │  cmp     %eax,%esi
  4.49 │     │  cmovge  %esi,%eax
 17.17 │     │  mov     %ecx,%esi
  0.03 │     │  cmp     $0xc521974f,%eax
  3.50 │     │  cmovl   %ebx,%eax   <----------- this used to be a branch
 21.84 │     │  mov     %eax,-0x4(%r13,%rcx,4)
  3.88 │     │  add     $0x1,%rcx
  0.00 │     │  cmp     %rdi,%rcx
  0.04 │     └──jne     fb0

where the marked conditional move was a branch one revision before,
because, after fwprop3 the IL looked like:

  <bb 16> [local count: 955630217]:
  # cstore_281 = PHI <[fast_algorithms.c:142:53] sc_223(14),
[fast_algorithms.c:142:53] cstore_249(15)>
  [fast_algorithms.c:142:49] MEM <int> [(void *)_72] = cstore_281;
  [fast_algorithms.c:143:13] _78 = [fast_algorithms.c:143:13] *_72;
  [fast_algorithms.c:143:10] if (_78 < -987654321)
    goto <bb 18>; [50.00%]
  else
    goto <bb 17>; [50.00%]

  <bb 17> [local count: 477815109]:

  <bb 18> [local count: 955630217]:
  # cstore_250 = PHI <[fast_algorithms.c:143:33] -987654321(16),
[fast_algorithms.c:143:33] cstore_281(17)>
  [fast_algorithms.c:143:29] MEM <int> [(void *)_72] = cstore_250;

The aforementioned revision turned this into more optimized code:

  <bb 16> [local count: 955630217]:
  # cstore_281 = PHI <[fast_algorithms.c:142:53] sc_223(14),
[fast_algorithms.c:142:53] _73(15)>
  [fast_algorithms.c:143:10] if (cstore_281 < -987654321)
    goto <bb 18>; [50.00%]
  else
    goto <bb 17>; [50.00%]

  <bb 17> [local count: 477815109]:

  <bb 18> [local count: 955630217]:
  # cstore_250 = PHI <[fast_algorithms.c:143:33] -987654321(16),
[fast_algorithms.c:143:33] cstore_281(17)>
  [fast_algorithms.c:143:29] MEM <int> [(void *)_72] = cstore_250;

Which then phiopt3 changed to:

  cstore_248 = MAX_EXPR <cstore_249, -987654321>;
  [fast_algorithms.c:143:29] MEM <int> [(void *)_72] = cstore_248;

and expander apparently always expands MAX_EXPR into a conditional
move if it can(?).

When I hacked phiopt not to do the transformation for - ehm - any
GIMPLE_COND statement originating from source line 143, I recovered
the original run-time of the benchmark.  On both AMD and Intel.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9
  2020-03-31 17:33 [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 jamborm at gcc dot gnu.org
  2020-03-31 23:12 ` [Bug tree-optimization/94427] " jamborm at gcc dot gnu.org
@ 2020-04-01  6:48 ` rguenth at gcc dot gnu.org
  2023-08-07  9:19 ` hubicka at gcc dot gnu.org
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-04-01  6:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
The hot/cold thing smells like PR90911 btw.

The cmov thing you identified is PR91154 which was fixed for archs that can
do the required SSE min/max operation which is SSE 4.1 IIRC.

It also is then a "duplicate" of the (various?) bugs about
to-cmov-or-not-to-cmov
where the microarchitectural details are unclear.

So I fear with just SSE2 we can't do anything here but apply yet another
heuristic special to hmmer that will likely hurt another case (there are
corresponding bugs asking for _more_ cmov...).  One of my first patches was
to notice the special constant for the max/min operation and not do
conditional move expansion (plus prevent later RTL ifcvt to apply with the
same heuristics).

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9
  2020-03-31 17:33 [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 jamborm at gcc dot gnu.org
  2020-03-31 23:12 ` [Bug tree-optimization/94427] " jamborm at gcc dot gnu.org
  2020-04-01  6:48 ` rguenth at gcc dot gnu.org
@ 2023-08-07  9:19 ` hubicka at gcc dot gnu.org
  2023-08-07  9:38 ` hubicka at gcc dot gnu.org
  2023-08-07  9:44 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-08-07  9:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu.org

--- Comment #3 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
With profile feedback on zen4 we now get hottest loops as:

       │      dc[k] = dc[k-1] + tpdd[k-1];                        ▒
       │16b0:┌─ vmovd         (%r14,%rdx,1),%xmm2                 ▒
  0.15 │     │  vpaddd        %xmm2,%xmm0,%xmm0                   ▒
  5.79 │     │  vmovd         %xmm0,0x4(%rax,%rdx,1)              ▒
       │     │if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; ▒
  8.04 │     │  vmovd         (%r15,%rdx,1),%xmm7                 ▒
  0.16 │     │  vmovd         (%rcx,%rdx,1),%xmm2                 ▒
  0.41 │     │  vpaddd        %xmm7,%xmm2,%xmm2                   ▒
       │     │if (dc[k] < -INFTY) dc[k] = -INFTY;                 ▒
  0.71 │     │  vmovdqa       _IO_stdin_used+0x560,%xmm7          ◆
  1.07 │     │  vpmaxsd       %xmm7,%xmm2,%xmm2                   ▒
  0.73 │     │  vpmaxsd       %xmm0,%xmm2,%xmm0                   ▒
  5.83 │     │  vmovd         %xmm0,0x4(%rax,%rdx,1)              ▒
       │     │for (k = 1; k <= M; k++) {                          ▒
  5.86 │     │  add           $0x4,%rdx                           ▒
  1.40 │     ├──cmp           %rdx,%r13                           ▒
  0.00 │     └──jne           16b0                                ▒

no time is spent in cold section.

Without profile I get:
  88.80%  hmmer_peak.chn-  [.] P7Viterbi                          ◆
   5.10%  hmmer_peak.chn-  [.] sre_random                         ▒
   2.31%  hmmer_peak.chn-  [.] FChoose                            ▒
   1.35%  hmmer_peak.chn-  [.] RandomSequence                     ▒

so no time in cold section either.

internal loop almost identical:
       │17e0:┌─ vmovd         (%r11,%rdi,4),%xmm3                 ▒
  0.07 │     │  mov           %rdi,%r8                            ▒
  0.09 │     │  vpaddd        %xmm3,%xmm0,%xmm0                   ▒
  6.20 │     │  vmovd         %xmm0,0x4(%rdx,%rdi,4)              ▒
       │     │if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; ▒
  7.00 │     │  vmovd         (%rax,%rdi,4),%xmm6                 ▒
  0.19 │     │  vmovd         (%r10,%rdi,4),%xmm3                 ▒
  0.16 │     │  vpaddd        %xmm3,%xmm6,%xmm3                   ◆
       │     │if (dc[k] < -INFTY) dc[k] = -INFTY;                 ▒
  1.25 │     │  vmovdqa       _IO_stdin_used+0x600,%xmm6          ▒
  0.89 │     │  vpmaxsd       %xmm6,%xmm3,%xmm3                   ▒
  0.46 │     │  vpmaxsd       %xmm0,%xmm3,%xmm0                   ▒
  5.85 │     │  vmovd         %xmm0,0x4(%rdx,%rdi,4)              ▒
       │     │for (k = 1; k <= M; k++) {                          ▒
  6.02 │     │  inc           %rdi                                ▒
  2.48 │     ├──cmp           %r8,%r9                             ▒
  0.00 │     └──jne           17e0                                ▒

However the hottest loop seems to be completely elsewhere then shown by you
since it is FP loop and yours seems integer?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9
  2020-03-31 17:33 [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 jamborm at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2023-08-07  9:19 ` hubicka at gcc dot gnu.org
@ 2023-08-07  9:38 ` hubicka at gcc dot gnu.org
  2023-08-07  9:44 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-08-07  9:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427

--- Comment #4 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
It is the same loop - it was float only in my mind (since the function return
float value :)

With loop splitting we no longer have the last iteration check, but we still
have the underflow checks that are indeed likely predictable well and in
unvectorized version may make sense to be not if converted.

So I guess in unvectorized loop the 100% predictable conditonal should be still
a win but vectorization should likely outweight that?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9
  2020-03-31 17:33 [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 jamborm at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2023-08-07  9:38 ` hubicka at gcc dot gnu.org
@ 2023-08-07  9:44 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-08-07  9:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #4)
> It is the same loop - it was float only in my mind (since the function
> return float value :)
> 
> With loop splitting we no longer have the last iteration check, but we still
> have the underflow checks that are indeed likely predictable well and in
> unvectorized version may make sense to be not if converted.
> 
> So I guess in unvectorized loop the 100% predictable conditonal should be
> still a win but vectorization should likely outweight that?

I think so.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-08-07  9:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-31 17:33 [Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9 jamborm at gcc dot gnu.org
2020-03-31 23:12 ` [Bug tree-optimization/94427] " jamborm at gcc dot gnu.org
2020-04-01  6:48 ` rguenth at gcc dot gnu.org
2023-08-07  9:19 ` hubicka at gcc dot gnu.org
2023-08-07  9:38 ` hubicka at gcc dot gnu.org
2023-08-07  9:44 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).