[Bug tree-optimization/97984] New: [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/97984] New: [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add
@ 2020-11-25 11:49 ktkachov at gcc dot gnu.org
  2020-11-25 12:59 ` [Bug tree-optimization/97984] " rguenth at gcc dot gnu.org
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: ktkachov at gcc dot gnu.org @ 2020-11-25 11:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97984

            Bug ID: 97984
           Summary: [10/11 Regression] Worse code for -O3 than -O2 on
                    aarch64 vector multiply-add
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

The code:
void x (long * __restrict a, long * __restrict b)
{
  a[0] *= b[0];
  a[1] *= b[1];
  a[0] += b[0];
  a[1] += b[1];
}

at -O2 generates:
x:
        ldp     x4, x3, [x0]
        ldp     x2, x1, [x1]
        madd    x2, x2, x4, x2
        madd    x1, x1, x3, x1
        stp     x2, x1, [x0]
        ret

whereas at -O3 it does:
x:
        ldp     x2, x3, [x0]
        ldr     x4, [x1]
        ldr     q1, [x1]
        mul     x2, x2, x4
        ldr     x4, [x1, 8]
        fmov    d0, x2
        ins     v0.d[1], x3
        mul     x1, x3, x4
        ins     v0.d[1], x1
        add     v0.2d, v0.2d, v1.2d
        str     q0, [x0]
        ret

which is clearly inferior.
GCC 9 used to generate the good code for both -O2 and -O3

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/97984] [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add
  2020-11-25 11:49 [Bug tree-optimization/97984] New: [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add ktkachov at gcc dot gnu.org
@ 2020-11-25 12:59 ` rguenth at gcc dot gnu.org
  2021-01-14  9:47 ` [Bug target/97984] " rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-11-25 12:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97984

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|unknown                     |11.0
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2020-11-25
   Target Milestone|---                         |10.3
             Status|UNCONFIRMED                 |NEW

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
We vectorize the add but not the multiplication.  FMA discovery comes after
vectorization so it can't inhibit the transform (and vectorizer costing cannot
factor that in).  Is there vector madd available?

arm vectorizer costing could honor the fact that there's ldp/stp instructions
and thus not artifically make a vector load cheaper than two adjacent scalar
loads.  That would only make the costings equal though.

0x3151930 *b_11(D) 1 times unaligned_load (misalign -1) costs 1 in body
0x3151930 _2 + _3 1 times vector_stmt costs 1 in body
0x3151930 <unknown> 1 times vec_construct costs 2 in prologue
0x3151930 _7 1 times unaligned_store (misalign -1) costs 1 in body
0x31569d0 _7 1 times scalar_store costs 1 in body
0x31569d0 _8 1 times scalar_store costs 1 in body
0x31569d0 _2 + _3 1 times scalar_stmt costs 1 in body
0x31569d0 _5 + _6 1 times scalar_stmt costs 1 in body
0x31569d0 *b_11(D) 1 times scalar_load costs 1 in body
0x31569d0 MEM[(long int *)b_11(D) + 8B] 1 times scalar_load costs 1 in body
t.c:5:8: note: Cost model analysis:
  Vector inside of basic block cost: 3
  Vector prologue cost: 2
  Vector epilogue cost: 0
  Scalar cost of basic block: 6
t.c:5:8: note: Basic block will be vectorized using SLP

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/97984] [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add
  2020-11-25 11:49 [Bug tree-optimization/97984] New: [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add ktkachov at gcc dot gnu.org
  2020-11-25 12:59 ` [Bug tree-optimization/97984] " rguenth at gcc dot gnu.org
@ 2021-01-14  9:47 ` rguenth at gcc dot gnu.org
  2021-04-08 12:02 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-14  9:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97984

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P2
          Component|tree-optimization           |target

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/97984] [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add
  2020-11-25 11:49 [Bug tree-optimization/97984] New: [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add ktkachov at gcc dot gnu.org
  2020-11-25 12:59 ` [Bug tree-optimization/97984] " rguenth at gcc dot gnu.org
  2021-01-14  9:47 ` [Bug target/97984] " rguenth at gcc dot gnu.org
@ 2021-04-08 12:02 ` rguenth at gcc dot gnu.org
  2021-11-22  0:49 ` pinskia at gcc dot gnu.org
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-08 12:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97984

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|10.3                        |10.4

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 10.3 is being released, retargeting bugs to GCC 10.4.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/97984] [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add
  2020-11-25 11:49 [Bug tree-optimization/97984] New: [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add ktkachov at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-04-08 12:02 ` rguenth at gcc dot gnu.org
@ 2021-11-22  0:49 ` pinskia at gcc dot gnu.org
  2021-12-08 14:24 ` marxin at gcc dot gnu.org
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-11-22  0:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97984

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |needs-bisection
      Known to work|                            |12.0
            Summary|[10/11/12 Regression] Worse |[10/11 Regression] Worse
                   |code for -O3 than -O2 on    |code for -O3 than -O2 on
                   |aarch64 vector multiply-add |aarch64 vector multiply-add

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
The cost model on the trunk seems to have been fixed.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/97984] [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add
  2020-11-25 11:49 [Bug tree-optimization/97984] New: [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add ktkachov at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2021-11-22  0:49 ` pinskia at gcc dot gnu.org
@ 2021-12-08 14:24 ` marxin at gcc dot gnu.org
  2021-12-08 19:51 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-12-08 14:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97984

Martin Liška <marxin at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |marxin at gcc dot gnu.org

--- Comment #4 from Martin Liška <marxin at gcc dot gnu.org> ---
Hm, can't replicate for GCC 10, do you use any -mtune or so?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/97984] [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add
  2020-11-25 11:49 [Bug tree-optimization/97984] New: [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add ktkachov at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2021-12-08 14:24 ` marxin at gcc dot gnu.org
@ 2021-12-08 19:51 ` pinskia at gcc dot gnu.org
  2021-12-09 10:05 ` marxin at gcc dot gnu.org
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-12-08 19:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97984

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Martin Liška from comment #4)
> Hm, can't replicate for GCC 10, do you use any -mtune or so?

I can reproduce worse code for GCC 10 at -O3 -mtune=generic:

        ldp     x2, x3, [x0]
        ldr     x4, [x1]
        ldr     q1, [x1]
        mul     x2, x2, x4
        ldr     x4, [x1, 8]
        fmov    d0, x2
        ins     v0.d[1], x3
        mul     x1, x3, x4
        ins     v0.d[1], x1
        add     v0.2d, v0.2d, v1.2d
        str     q0, [x0]

But with -O3 -mtune=cortext-a57 the decent code happens.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/97984] [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add
  2020-11-25 11:49 [Bug tree-optimization/97984] New: [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add ktkachov at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2021-12-08 19:51 ` pinskia at gcc dot gnu.org
@ 2021-12-09 10:05 ` marxin at gcc dot gnu.org
  2022-06-28 10:42 ` jakub at gcc dot gnu.org
  2023-07-07 10:38 ` [Bug target/97984] [11 " rguenth at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-12-09 10:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97984

--- Comment #6 from Martin Liška <marxin at gcc dot gnu.org> ---
Ok so it started on GCC 10 branch with
r10-4677-g60838d634634a70d65a126166c944b159ac7649c.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/97984] [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add
  2020-11-25 11:49 [Bug tree-optimization/97984] New: [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add ktkachov at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2021-12-09 10:05 ` marxin at gcc dot gnu.org
@ 2022-06-28 10:42 ` jakub at gcc dot gnu.org
  2023-07-07 10:38 ` [Bug target/97984] [11 " rguenth at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-06-28 10:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97984

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|10.4                        |10.5

--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 10.4 is being released, retargeting bugs to GCC 10.5.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/97984] [11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add
  2020-11-25 11:49 [Bug tree-optimization/97984] New: [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add ktkachov at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2022-06-28 10:42 ` jakub at gcc dot gnu.org
@ 2023-07-07 10:38 ` rguenth at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-07 10:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97984

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|10.5                        |11.5

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 10 branch is being closed.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-07-07 10:38 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-25 11:49 [Bug tree-optimization/97984] New: [10/11 Regression] Worse code for -O3 than -O2 on aarch64 vector multiply-add ktkachov at gcc dot gnu.org
2020-11-25 12:59 ` [Bug tree-optimization/97984] " rguenth at gcc dot gnu.org
2021-01-14  9:47 ` [Bug target/97984] " rguenth at gcc dot gnu.org
2021-04-08 12:02 ` rguenth at gcc dot gnu.org
2021-11-22  0:49 ` pinskia at gcc dot gnu.org
2021-12-08 14:24 ` marxin at gcc dot gnu.org
2021-12-08 19:51 ` pinskia at gcc dot gnu.org
2021-12-09 10:05 ` marxin at gcc dot gnu.org
2022-06-28 10:42 ` jakub at gcc dot gnu.org
2023-07-07 10:38 ` [Bug target/97984] [11 " rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).