[Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO
@ 2024-02-22 16:19 pheeck at gcc dot gnu.org
  2024-02-22 17:50 ` [Bug target/114057] " pheeck at gcc dot gnu.org
                   ` (16 more replies)
  0 siblings, 17 replies; 18+ messages in thread
From: pheeck at gcc dot gnu.org @ 2024-02-22 16:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114057

            Bug ID: 114057
           Summary: [14 Regression] 435.gromacs fails verification on with
                    -Ofast -march=znver4 and PGO
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: needs-bisection, wrong-code
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pheeck at gcc dot gnu.org
            Blocks: 26163
  Target Milestone: ---
              Host: x86_64-linux
            Target: x86_64-linux

Compiling the 2006 CPU benchmark 435.gromacs on an AMD Zen4 CPU with -Ofast
-march=native and PGO using the current trunk GCC and running the benchmark
results in the following miscomparison:

 0002:  3.07684e+02
        3.11606e+02
                  ^

According to https://www.spec.org/cpu2006/Docs/435.gromacs.html
> The gromacs.out results shouldn't differ by more than 1.25% from the reference values
And this difference is ~1.27%.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/114057] [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO
  2024-02-22 16:19 [Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO pheeck at gcc dot gnu.org
@ 2024-02-22 17:50 ` pheeck at gcc dot gnu.org
  2024-02-22 18:08 ` pheeck at gcc dot gnu.org
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: pheeck at gcc dot gnu.org @ 2024-02-22 17:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114057

--- Comment #1 from Filip Kastl <pheeck at gcc dot gnu.org> ---
The oldest commit with this miscomparison i found so far is g:eb619490b01baa2f.
The most recent commit without the miscomparison i found so far is
g:405096f908e1ceb0.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/114057] [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO
  2024-02-22 16:19 [Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO pheeck at gcc dot gnu.org
  2024-02-22 17:50 ` [Bug target/114057] " pheeck at gcc dot gnu.org
@ 2024-02-22 18:08 ` pheeck at gcc dot gnu.org
  2024-02-23  6:54 ` rguenth at gcc dot gnu.org
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: pheeck at gcc dot gnu.org @ 2024-02-22 18:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114057

--- Comment #2 from Filip Kastl <pheeck at gcc dot gnu.org> ---
Hm, seems like g:eb619490b01baa2f actually doesn't miscompare. My bad.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/114057] [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO
  2024-02-22 16:19 [Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO pheeck at gcc dot gnu.org
  2024-02-22 17:50 ` [Bug target/114057] " pheeck at gcc dot gnu.org
  2024-02-22 18:08 ` pheeck at gcc dot gnu.org
@ 2024-02-23  6:54 ` rguenth at gcc dot gnu.org
  2024-02-23  9:44 ` pheeck at gcc dot gnu.org
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-02-23  6:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114057

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |14.0

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
If it is a FP rounding issue then my guess would be the extra reduc_scal_*
patterns in the x86 backend to do more BB reduction vectorization.

I guess it doesn't miscompare with -O3 or -O3 -fno-signed-zeros
-fno-trapping-math -ffinite-math-only -ffp-contract=fast?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/114057] [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO
  2024-02-22 16:19 [Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO pheeck at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2024-02-23  6:54 ` rguenth at gcc dot gnu.org
@ 2024-02-23  9:44 ` pheeck at gcc dot gnu.org
  2024-02-23 10:26 ` [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails verification with -Ofast -march=znver4 and PGO after r14-7272-g57f611604e8bab pheeck at gcc dot gnu.org
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: pheeck at gcc dot gnu.org @ 2024-02-23  9:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114057

Filip Kastl <pheeck at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |fxue at os dot amperecomputing.com
           Keywords|needs-bisection             |

--- Comment #4 from Filip Kastl <pheeck at gcc dot gnu.org> ---
I've bisected this to g:57f611604e8bab67af6c0bcfe6ea88c001408412.


(In reply to Richard Biener from comment #3)
> I guess it doesn't miscompare with -O3 or -O3 -fno-signed-zeros
> -fno-trapping-math -ffinite-math-only -ffp-contract=fast?

I'll take a look.


(In reply to Filip Kastl from comment #2)
> Hm, seems like g:eb619490b01baa2f actually doesn't miscompare. My bad.

Disregard this. g:eb619490b01baa2f *does* miscompare. I had some troubles with
my bisect script.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails verification with -Ofast -march=znver4 and PGO after r14-7272-g57f611604e8bab
  2024-02-22 16:19 [Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO pheeck at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2024-02-23  9:44 ` pheeck at gcc dot gnu.org
@ 2024-02-23 10:26 ` pheeck at gcc dot gnu.org
  2024-02-25 11:11 ` [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails verification with -Ofast -march={znver2,znver4} " pheeck at gcc dot gnu.org
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: pheeck at gcc dot gnu.org @ 2024-02-23 10:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114057

Filip Kastl <pheeck at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|[14 Regression] 435.gromacs |[14 Regression] 435.gromacs
                   |fails verification on with  |fails verification with
                   |-Ofast -march=znver4 and    |-Ofast -march=znver4 and
                   |PGO                         |PGO after
                   |                            |r14-7272-g57f611604e8bab
          Component|target                      |tree-optimization

--- Comment #5 from Filip Kastl <pheeck at gcc dot gnu.org> ---
Can confirm that with -O3 -march=native PGO nor with -O3 -march=native PGO
-fno-signed-zeros -fno-trapping-math -ffinite-math-only -ffp-contract=fast the
benchmark doesn't miscompare.

Btw I just noticed that the bisection lead to the same commit as the bisection
of pr113833. I suppose that both PRs relate to the same problem.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails verification with -Ofast -march={znver2,znver4} and PGO after r14-7272-g57f611604e8bab
  2024-02-22 16:19 [Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO pheeck at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2024-02-23 10:26 ` [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails verification with -Ofast -march=znver4 and PGO after r14-7272-g57f611604e8bab pheeck at gcc dot gnu.org
@ 2024-02-25 11:11 ` pheeck at gcc dot gnu.org
  2024-03-07 20:38 ` law at gcc dot gnu.org
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: pheeck at gcc dot gnu.org @ 2024-02-25 11:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114057

Filip Kastl <pheeck at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|[14 Regression] 435.gromacs |[14 Regression] 435.gromacs
                   |fails verification with     |fails verification with
                   |-Ofast -march=znver4 and    |-Ofast
                   |PGO after                   |-march={znver2,znver4} and
                   |r14-7272-g57f611604e8bab    |PGO after
                   |                            |r14-7272-g57f611604e8bab

--- Comment #6 from Filip Kastl <pheeck at gcc dot gnu.org> ---
I just observed the same miscomparison on a Zen2 machine. Exactly the same
values:

0002:  3.07684e+02
       3.11606e+02

So I assume it's the same problem.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails verification with -Ofast -march={znver2,znver4} and PGO after r14-7272-g57f611604e8bab
  2024-02-22 16:19 [Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO pheeck at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2024-02-25 11:11 ` [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails verification with -Ofast -march={znver2,znver4} " pheeck at gcc dot gnu.org
@ 2024-03-07 20:38 ` law at gcc dot gnu.org
  2024-03-26 10:59 ` rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: law at gcc dot gnu.org @ 2024-03-07 20:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114057

Jeffrey A. Law <law at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P1
                 CC|                            |law at gcc dot gnu.org

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails verification with -Ofast -march={znver2,znver4} and PGO after r14-7272-g57f611604e8bab
  2024-02-22 16:19 [Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO pheeck at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2024-03-07 20:38 ` law at gcc dot gnu.org
@ 2024-03-26 10:59 ` rguenth at gcc dot gnu.org
  2024-03-26 11:27 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-26 10:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114057

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
I guess that bisecting with -fno-vect-cost-model might reveal the actual
offender.  (if it doesn't turn up other problems, of course)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails verification with -Ofast -march={znver2,znver4} and PGO after r14-7272-g57f611604e8bab
  2024-02-22 16:19 [Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO pheeck at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2024-03-26 10:59 ` rguenth at gcc dot gnu.org
@ 2024-03-26 11:27 ` rguenth at gcc dot gnu.org
  2024-03-26 12:02 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-26 11:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114057

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2024-03-26
             Status|UNCONFIRMED                 |NEW

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed on Zen2 btw.  I'm going to have a closer look here.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails verification with -Ofast -march={znver2,znver4} and PGO after r14-7272-g57f611604e8bab
  2024-02-22 16:19 [Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO pheeck at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2024-03-26 11:27 ` rguenth at gcc dot gnu.org
@ 2024-03-26 12:02 ` rguenth at gcc dot gnu.org
  2024-03-26 15:32 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-26 12:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114057

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
Interestingly r14-7272-g57f611604e8bab causes quite some BB vectorization
cases to be rejected - I would have expected it to only get us more
vectorization?

-innerf.f:277:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:281:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:284:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:323:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:327:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:330:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:630:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:634:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:637:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:679:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:683:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:686:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:1283:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:1287:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:1290:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:1354:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:1358:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:1361:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:2004:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:2008:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:2011:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:2082:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:2086:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:2089:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:2365:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:2369:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:2372:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:3347:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:3351:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:3354:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:3390:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:3394:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:3397:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:3434:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:3438:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:3441:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:4477:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:4481:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:4484:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:4520:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:4524:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:4527:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:4567:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:4571:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:4574:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:5702:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:5706:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:5709:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:5745:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:5749:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:5752:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:5815:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:5819:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:5822:72: optimized: basic block part vectorized using 8 byte vectors
+innerf.f:6114:72: optimized: basic block part vectorized using 8 byte vectors
+innerf.f:6118:72: optimized: basic block part vectorized using 32 byte vectors
+innerf.f:6127:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:7050:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:7054:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:7057:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:7093:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:7097:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:7100:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:7170:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:7174:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:7177:72: optimized: basic block part vectorized using 8 byte vectors
+innerf.f:7479:72: optimized: basic block part vectorized using 8 byte vectors
+innerf.f:7483:72: optimized: basic block part vectorized using 32 byte vectors
+innerf.f:7492:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:8280:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:8284:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:8287:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:9306:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:9310:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:9313:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:9350:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:9354:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:9357:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:9394:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:9398:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:9401:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:10483:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:10487:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:10490:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:10527:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:10531:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:10534:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:10574:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:10578:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:10581:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:11751:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:11755:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:11758:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:11795:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:11799:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:11802:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:11865:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:11869:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:11872:72: optimized: basic block part vectorized using 8 byte vectors
+innerf.f:12173:72: optimized: basic block part vectorized using 8 byte vectors
+innerf.f:12177:72: optimized: basic block part vectorized using 32 byte
vectors
+innerf.f:12186:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:13136:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:13140:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:13143:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:13180:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:13184:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:13187:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:13257:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:13261:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:13264:72: optimized: basic block part vectorized using 8 byte vectors
+innerf.f:13574:72: optimized: basic block part vectorized using 8 byte vectors
+innerf.f:13578:72: optimized: basic block part vectorized using 32 byte
vectors
+innerf.f:13587:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:14631:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:14635:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:14638:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:15896:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:15900:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:15903:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:15953:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:15957:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:15960:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:15998:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:16002:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:16005:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:17325:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:17329:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:17332:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:17382:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:17386:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:17389:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:17429:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:17433:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:17436:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:19038:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:19042:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:19045:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:19095:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:19099:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:19102:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:19164:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:19168:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:19171:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:20844:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:20848:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:20851:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:20901:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:20905:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:20908:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:20977:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:20981:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:20984:72: optimized: basic block part vectorized using 8 byte vectors
-vec.h:378:9: optimized: basic block part vectorized using 16 byte vectors
-vec.h:379:9: optimized: basic block part vectorized using 16 byte vectors
-vec.h:380:9: optimized: basic block part vectorized using 16 byte vectors
-clincs.c:65:12: optimized: basic block part vectorized using 8 byte vectors
-clincs.c:151:12: optimized: basic block part vectorized using 8 byte vectors
-vec.h:537:15: optimized: basic block part vectorized using 32 byte vectors
-vec.h:453:15: optimized: basic block part vectorized using 32 byte vectors
-coupling.c:188:10: optimized: basic block part vectorized using 8 byte vectors
-vec.h:426:15: optimized: basic block part vectorized using 32 byte vectors
-vec.h:252:8: optimized: basic block part vectorized using 8 byte vectors
-update.c:266:16: optimized: basic block part vectorized using 8 byte vectors

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails verification with -Ofast -march={znver2,znver4} and PGO after r14-7272-g57f611604e8bab
  2024-02-22 16:19 [Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO pheeck at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2024-03-26 12:02 ` rguenth at gcc dot gnu.org
@ 2024-03-26 15:32 ` rguenth at gcc dot gnu.org
  2024-03-26 15:34 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-26 15:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114057

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the ref output is

-3.22397e+05
3.07684e+02
1.06621e+10

and before the change we get

-3.22205e+05
3.05161e+02
1.06660e+10

while after it is

-3.22401e+05
3.11606e+02
1.06579e+10

vectorization differences show in innerf.o, bondfree.o, clincs.o, coupling.o,
disre.o and update.o while all but innerf.o show only less vectorization.
Only using the "bad" version of innerf.o gets us

-3.23378e+05
3.08348e+02
1.06697e+10

which should still PASS.  Replacing all above TUs with the bad objects
reproduces the bad output.

Replacing update.o, disre.o, coupling.o or clincs.o with the GOOD version
doesn't change the output, so it's only innerf.o and bondfree.o making
a difference.  Using only BAD bondfree.o gives

-3.22265e+05
3.07882e+02
1.06644e+10

That would have been OK as well.

The bondfree.o change is small:

@@ -2720,9 +2588,6 @@
 vec.h:252:8: optimized: basic block part vectorized using 8 byte vectors
 vec.h:419:8: optimized: basic block part vectorized using 8 byte vectors
 vec.h:419:8: optimized: basic block part vectorized using 8 byte vectors
-vec.h:378:9: optimized: basic block part vectorized using 16 byte vectors
-vec.h:379:9: optimized: basic block part vectorized using 16 byte vectors
-vec.h:380:9: optimized: basic block part vectorized using 16 byte vectors
 bondfree.c:806:16: optimized: basic block part vectorized using 8 byte vectors
 vec.h:239:8: optimized: basic block part vectorized using 8 byte vectors
 vec.h:265:8: optimized: basic block part vectorized using 8 byte vectors

while the innerf.o changes are many (but possibly similar).

I will see to understand the bondfree change first.  That's the following
change in the function idihs:

 vec.h:380:9: note: Cost model analysis for part in loop 1:
-  Vector cost: 624
-  Scalar cost: 700
-vec.h:380:9: note: Basic block will be vectorized using SLP
-vec.h:252:8: optimized: basic block part vectorized using 8 byte vectors
-vec.h:252:8: optimized: basic block part vectorized using 8 byte vectors
-vec.h:252:8: optimized: basic block part vectorized using 8 byte vectors
-vec.h:419:8: optimized: basic block part vectorized using 8 byte vectors
-vec.h:419:8: optimized: basic block part vectorized using 8 byte vectors
-vec.h:378:9: optimized: basic block part vectorized using 16 byte vectors
-vec.h:379:9: optimized: basic block part vectorized using 16 byte vectors
-vec.h:380:9: optimized: basic block part vectorized using 16 byte vectors
-vec.h:380:9: note: Vectorizing SLP tree:
-vec.h:380:9: note: node 0x345f188 (max_nunits=2, refcnt=1) vector(2) float
+  Vector cost: 640
+  Scalar cost: 532
+vec.h:380:9: missed: not vectorized: vectorization is not profitable.

where it basically changes what nodes we think are live.  Note this is
a larger graph with multiple instances so we might suffer from
what I noted in PR114413.

The IL has all but the call to do_dih_fup inlined into idihs.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails verification with -Ofast -march={znver2,znver4} and PGO after r14-7272-g57f611604e8bab
  2024-02-22 16:19 [Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO pheeck at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2024-03-26 15:32 ` rguenth at gcc dot gnu.org
@ 2024-03-26 15:34 ` rguenth at gcc dot gnu.org
  2024-03-26 15:35 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-26 15:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114057

--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 57816
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57816&action=edit
manually reduced preprocessed source

This is the TU reduced to idihs where I put flatten on, with just -Ofast
-march=znver2 one can reproduce the vectorization difference when reverting
the offending revision ontop of trunk.

I'll have a closer look tomorrow.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails verification with -Ofast -march={znver2,znver4} and PGO after r14-7272-g57f611604e8bab
  2024-02-22 16:19 [Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO pheeck at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2024-03-26 15:34 ` rguenth at gcc dot gnu.org
@ 2024-03-26 15:35 ` rguenth at gcc dot gnu.org
  2024-03-27 10:36 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-26 15:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114057

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails verification with -Ofast -march={znver2,znver4} and PGO after r14-7272-g57f611604e8bab
  2024-02-22 16:19 [Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO pheeck at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2024-03-26 15:35 ` rguenth at gcc dot gnu.org
@ 2024-03-27 10:36 ` rguenth at gcc dot gnu.org
  2024-03-27 11:50 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-27 10:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114057

--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
OK, so I think the change is that we get to "correctly" notice

-vec.h:380:9: note: node (external) 0x6a2e9d8 (max_nunits=2, refcnt=1)
vector(2) float
-vec.h:380:9: note:     stmt 0 _164 = MEM[(const real *)_27 + 8B];
-vec.h:380:9: note:     stmt 1 _158 = MEM[(const real *)_27];
+vec.h:380:9: note: node (external) 0x5a823a8 (max_nunits=2, refcnt=1)
vector(2) float
+vec.h:380:9: note:     [l] stmt 0 _164 = MEM[(const real *)_27 + 8B];
+vec.h:380:9: note:     [l] stmt 1 _158 = MEM[(const real *)_27];

for the loads we do not handle because of gaps and promoted external.  That
leads to extra costs.

But also

+vec.h:380:9: note: node 0x5a81770 (max_nunits=2, refcnt=2) vector(2) float
 vec.h:380:9: note: op template: x_160 = _158 - _159;
 vec.h:380:9: note:     stmt 0 x_160 = _158 - _159;
-vec.h:380:9: note:     [l] stmt 1 y_163 = _161 - _162;
+vec.h:380:9: note:     stmt 1 y_163 = _161 - _162;

so y_163 isn't considered live for some reason.  We find

_123 = _117 * y_163;

is vectorized as part of a reduction.  On the costing side we then see

-_161 - _162 1 times scalar_stmt costs 12 in body
-MEM[(const real *)_27 + 4B] 1 times scalar_load costs 12 in body
-MEM[(const real *)_24 + 4B] 1 times scalar_load costs 12 in body

which is the live (and dependent) stmts no longer costed on the scalar
side but also

+MEM[(const real *)_27 + 8B] 1 times vec_to_scalar costs 4 in epilogue
+MEM[(const real *)_24 + 8B] 1 times vec_to_scalar costs 4 in epilogue

costed in the vector epilog.  This is because we're conservative as we
don't really know whether we'll be able to code-generate the live
operation.  The costing side here is also not in sync as can be seen
from the _161 - _162 op removed.

I should also note that the setting of PURE_SLP is done a bit too early,
before we analyze operations and eventually throw away instances or
prune it by promoting ops external.

For reductions we also falsely claim all root stmts are vectorized - we
do have remain ops.  Fixing this restores the LIVE on them and in some
way restores vectorization.

I'm going to test this as fix for now.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails verification with -Ofast -march={znver2,znver4} and PGO after r14-7272-g57f611604e8bab
  2024-02-22 16:19 [Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO pheeck at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2024-03-27 10:36 ` rguenth at gcc dot gnu.org
@ 2024-03-27 11:50 ` cvs-commit at gcc dot gnu.org
  2024-03-27 11:51 ` rguenth at gcc dot gnu.org
  2024-03-28  7:04 ` rguenth at gcc dot gnu.org
  16 siblings, 0 replies; 18+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-03-27 11:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114057

--- Comment #13 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:0b02da5b99e89347f5f8bf875ec8318f84adff18

commit r14-9687-g0b02da5b99e89347f5f8bf875ec8318f84adff18
Author: Richard Biener <rguenther@suse.de>
Date:   Wed Mar 27 11:37:16 2024 +0100

    tree-optimization/114057 - handle BB reduction remain defs as LIVE

    The following makes sure to record the scalars we add to the BB
    reduction vectorization result as scalar uses for the purpose of
    computing live lanes.  This restores vectorization in the
    bondfree.c TU of 435.gromacs.

            PR tree-optimization/114057
            * tree-vect-slp.cc (vect_bb_slp_mark_live_stmts): Mark
            BB reduction remain defs as scalar uses.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails verification with -Ofast -march={znver2,znver4} and PGO after r14-7272-g57f611604e8bab
  2024-02-22 16:19 [Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO pheeck at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2024-03-27 11:50 ` cvs-commit at gcc dot gnu.org
@ 2024-03-27 11:51 ` rguenth at gcc dot gnu.org
  2024-03-28  7:04 ` rguenth at gcc dot gnu.org
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-27 11:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114057

--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
I wasn't able to reproduce the miscompare on a Zen4 machine (but with
-march=znver2).  But the original vectorizataion of bondfree.c should be
restored and thus the miscompare gone.  I'll verify tomorrow when I'm back
at the machine I was able to reproduce the issue.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails verification with -Ofast -march={znver2,znver4} and PGO after r14-7272-g57f611604e8bab
  2024-02-22 16:19 [Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO pheeck at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2024-03-27 11:51 ` rguenth at gcc dot gnu.org
@ 2024-03-28  7:04 ` rguenth at gcc dot gnu.org
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-28  7:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114057

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed fixed.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-03-28  7:04 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-22 16:19 [Bug target/114057] New: [14 Regression] 435.gromacs fails verification on with -Ofast -march=znver4 and PGO pheeck at gcc dot gnu.org
2024-02-22 17:50 ` [Bug target/114057] " pheeck at gcc dot gnu.org
2024-02-22 18:08 ` pheeck at gcc dot gnu.org
2024-02-23  6:54 ` rguenth at gcc dot gnu.org
2024-02-23  9:44 ` pheeck at gcc dot gnu.org
2024-02-23 10:26 ` [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails verification with -Ofast -march=znver4 and PGO after r14-7272-g57f611604e8bab pheeck at gcc dot gnu.org
2024-02-25 11:11 ` [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails verification with -Ofast -march={znver2,znver4} " pheeck at gcc dot gnu.org
2024-03-07 20:38 ` law at gcc dot gnu.org
2024-03-26 10:59 ` rguenth at gcc dot gnu.org
2024-03-26 11:27 ` rguenth at gcc dot gnu.org
2024-03-26 12:02 ` rguenth at gcc dot gnu.org
2024-03-26 15:32 ` rguenth at gcc dot gnu.org
2024-03-26 15:34 ` rguenth at gcc dot gnu.org
2024-03-26 15:35 ` rguenth at gcc dot gnu.org
2024-03-27 10:36 ` rguenth at gcc dot gnu.org
2024-03-27 11:50 ` cvs-commit at gcc dot gnu.org
2024-03-27 11:51 ` rguenth at gcc dot gnu.org
2024-03-28  7:04 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).