[Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations
@ 2014-10-09 21:56 e.menezes at samsung dot com
  2014-10-09 22:01 ` [Bug target/63503] " pinskia at gcc dot gnu.org
                   ` (23 more replies)
  0 siblings, 24 replies; 25+ messages in thread
From: e.menezes at samsung dot com @ 2014-10-09 21:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

            Bug ID: 63503
           Summary: [AArch64] A57 executes fused multiply-add poorly in
                    some situations
           Product: gcc
           Version: 5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: e.menezes at samsung dot com
                CC: spop at gcc dot gnu.org
            Target: aarch64-*

Curious why Geekbench's {D,S}GEMM by GCC were 8-9% slower than by LLVM, I was
baffled to find that the code emitted by GCC for the innermost loop in the
algorithm core is actually very good:

.L8:
    ldr d2, [x8, w5, uxtw 3]
    ldr d1, [x7, w5, uxtw 3]
    add w5, w5, 1
    cmp w5, w6
    fmadd   d0, d2, d1, d0
    bne .L8

LLVM's code is not so neat:

.LBB0_10:
    ldr d1, [x27, x22, lsl #3]
    ldr d2, [x9, x22, lsl #3]
    fmul    d1, d1, d2
    fadd    d0, d0, d1
    add w21, w21, #1
    add x22, x22, #1
    cmp w21, w24, uxtw
    b.ne .LBB0_10

However, it runs faster.

Methinks that the A57 microarchitecture is performing tricks for discrete FP
operations but not for fused multiply-add, since both code sequences are
semantically the same.  Whatever it is, it seems that fused multiply-add, and
perhaps its cousins, is actually a performance hit only when one depends on the
results of a previous one, as in this case on the results of the fused
operation in the previous loop iteration.

I'll try to create a simple test-case, but, in the meantime, please chime in
about your thoughts.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
@ 2014-10-09 22:01 ` pinskia at gcc dot gnu.org
  2014-10-09 22:02 ` pinskia at gcc dot gnu.org
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: pinskia at gcc dot gnu.org @ 2014-10-09 22:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This might be true for A57 but for our chip (ThunderX), using fused
multiply-add is better.

The other question here are there denormals happening?  That might cause some
performance differences between using fmadd and fmul/fadd.

On most normal processors using fused multiply-add is an improvement also.

Can you attach the preprocessed source and what options you are using?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
  2014-10-09 22:01 ` [Bug target/63503] " pinskia at gcc dot gnu.org
@ 2014-10-09 22:02 ` pinskia at gcc dot gnu.org
  2014-10-09 22:05 ` e.menezes at samsung dot com
                   ` (21 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: pinskia at gcc dot gnu.org @ 2014-10-09 22:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
The other option it is the fusion of the cmp and branch which is causing the
improvement.

Can you manually edit the assembly and swap the cmp and fmadd in the GCC output
and try again?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
  2014-10-09 22:01 ` [Bug target/63503] " pinskia at gcc dot gnu.org
  2014-10-09 22:02 ` pinskia at gcc dot gnu.org
@ 2014-10-09 22:05 ` e.menezes at samsung dot com
  2014-10-09 22:14 ` e.menezes at samsung dot com
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: e.menezes at samsung dot com @ 2014-10-09 22:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #3 from Evandro Menezes <e.menezes at samsung dot com> ---
(In reply to Andrew Pinski from comment #1)
> The other question here are there denormals happening?  That might cause
> some performance differences between using fmadd and fmul/fadd.

Nope, no denormals.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (2 preceding siblings ...)
  2014-10-09 22:05 ` e.menezes at samsung dot com
@ 2014-10-09 22:14 ` e.menezes at samsung dot com
  2014-10-09 23:08 ` pinskia at gcc dot gnu.org
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: e.menezes at samsung dot com @ 2014-10-09 22:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #4 from Evandro Menezes <e.menezes at samsung dot com> ---
Here's a simplified code to reproduce these results:

double sum(double *A, double *B, int n) 
{
  int i;
  double res = 0;

  for (i = 0; i < n; i++)
    res += A [i] * B [i];

  return res;
}


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (3 preceding siblings ...)
  2014-10-09 22:14 ` e.menezes at samsung dot com
@ 2014-10-09 23:08 ` pinskia at gcc dot gnu.org
  2014-10-10 13:24 ` wdijkstr at arm dot com
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: pinskia at gcc dot gnu.org @ 2014-10-09 23:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Also how sure are you that it is the fused multiply-add and not the scheduling
of the instructions?  As I mentioned, try swapping the cmp and fmadd; you might
get a performance boost.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (4 preceding siblings ...)
  2014-10-09 23:08 ` pinskia at gcc dot gnu.org
@ 2014-10-10 13:24 ` wdijkstr at arm dot com
  2014-10-10 13:59 ` ramana at gcc dot gnu.org
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: wdijkstr at arm dot com @ 2014-10-10 13:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

Wilco <wdijkstr at arm dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wdijkstr at arm dot com

--- Comment #6 from Wilco <wdijkstr at arm dot com> ---
I ran the assembler examples on A57 hardware with identical input. The FMADD
code is ~20% faster irrespectively of the size of the input. This is not a
surprise given that the FMADD latency is lower than the FADD and FMUL latency.

The alignment of the loop or scheduling don't matter at all as the FMADD
latency dominates by far - with serious optimization this code could run 4-5
times as fast and would only be limited by memory bandwidth on datasets larger
than L2.

So this particular example shows issues in LLVM, not in GCC.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (5 preceding siblings ...)
  2014-10-10 13:24 ` wdijkstr at arm dot com
@ 2014-10-10 13:59 ` ramana at gcc dot gnu.org
  2014-10-14 20:21 ` e.menezes at samsung dot com
                   ` (16 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: ramana at gcc dot gnu.org @ 2014-10-10 13:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |WAITING
   Last reconfirmed|                            |2014-10-10
     Ever confirmed|0                           |1

--- Comment #7 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---

(In reply to Wilco from comment #6)
> I ran the assembler examples on A57 hardware with identical input. The FMADD
> code is ~20% faster irrespectively of the size of the input. This is not a
> surprise given that the FMADD latency is lower than the FADD and FMUL
> latency.
> 
> The alignment of the loop or scheduling don't matter at all as the FMADD
> latency dominates by far - with serious optimization this code could run 4-5
> times as fast and would only be limited by memory bandwidth on datasets
> larger than L2.
> 
> So this particular example shows issues in LLVM, not in GCC.

The difference as to why LLVM puts out an fma vs we don't is probably because
of default language standards. GCC defaults to GNU89 while LLVM defaults to
C99. If you used -std=c99 with GCC as well you'd get the same sequence as LLVM.

As Evandro doesn't mention flags it's hard to say whether there really is a
problem here or not.

I only know of a separate gotcha with fmadds which is unfortunate but that's
not relevant to this discussion. 

http://comments.gmane.org/gmane.comp.compilers.llvm.cvs/200282

This probably needs more analysis than the current state.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (6 preceding siblings ...)
  2014-10-10 13:59 ` ramana at gcc dot gnu.org
@ 2014-10-14 20:21 ` e.menezes at samsung dot com
  2014-10-14 22:38 ` e.menezes at samsung dot com
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: e.menezes at samsung dot com @ 2014-10-14 20:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #8 from Evandro Menezes <e.menezes at samsung dot com> ---
(In reply to Ramana Radhakrishnan from comment #7)
> As Evandro doesn't mention flags it's hard to say whether there really is a
> problem here or not.

Both GCC and LLVM were given "-O3 -ffast-math".


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (7 preceding siblings ...)
  2014-10-14 20:21 ` e.menezes at samsung dot com
@ 2014-10-14 22:38 ` e.menezes at samsung dot com
  2014-10-21 17:47 ` wdijkstr at arm dot com
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: e.menezes at samsung dot com @ 2014-10-14 22:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #9 from Evandro Menezes <e.menezes at samsung dot com> ---
(In reply to Wilco from comment #6)
> I ran the assembler examples on A57 hardware with identical input. The FMADD
> code is ~20% faster irrespectively of the size of the input. This is not a
> surprise given that the FMADD latency is lower than the FADD and FMUL
> latency.

I ran the same Geekbench binaries on A53 and the result is about the same
between the GCC and the LLVM code, if with a slight (< 1%) advantage for GCC.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (8 preceding siblings ...)
  2014-10-14 22:38 ` e.menezes at samsung dot com
@ 2014-10-21 17:47 ` wdijkstr at arm dot com
  2014-10-21 18:35 ` pinskia at gcc dot gnu.org
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: wdijkstr at arm dot com @ 2014-10-21 17:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #10 from Wilco <wdijkstr at arm dot com> ---
The loops shown are not the correct inner loops for those options - with
-ffast-math they are vectorized. LLVM unrolls 2x but GCC doesn't. So the
question is why GCC doesn't unroll vectorized loops like LLVM?

GCC:

.L24:
    ldr    q3, [x13, x5]
    add    x6, x6, 1
    ldr    q2, [x16, x5]
    cmp    x6, x12
    add    x5, x5, 16
    fmla    v1.2d, v3.2d, v2.2d
    bcc    .L24

LLVM:

.LBB2_12:
    ldur    q2, [x8, #-16]
    ldr    q3, [x8], #32
    ldur    q4, [x21, #-16]
    ldr    q5, [x21], #32
    fmla    v1.2d, v2.2d, v4.2d
    fmla    v0.2d, v3.2d, v5.2d
    sub    x30, x30, #4            // =4
    cbnz    x30, .LBB2_12


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (9 preceding siblings ...)
  2014-10-21 17:47 ` wdijkstr at arm dot com
@ 2014-10-21 18:35 ` pinskia at gcc dot gnu.org
  2014-10-21 21:41 ` e.menezes at samsung dot com
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: pinskia at gcc dot gnu.org @ 2014-10-21 18:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #11 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Wilco from comment #10)
> The loops shown are not the correct inner loops for those options - with
> -ffast-math they are vectorized. LLVM unrolls 2x but GCC doesn't. So the
> question is why GCC doesn't unroll vectorized loops like LLVM?

Because unrolling is not enabled at -O3.  Try adding -funroll-loops.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (10 preceding siblings ...)
  2014-10-21 18:35 ` pinskia at gcc dot gnu.org
@ 2014-10-21 21:41 ` e.menezes at samsung dot com
  2014-10-22 12:13 ` wdijkstr at arm dot com
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: e.menezes at samsung dot com @ 2014-10-21 21:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #12 from Evandro Menezes <e.menezes at samsung dot com> ---
Created attachment 33774
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33774&action=edit
Simple test-case


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (11 preceding siblings ...)
  2014-10-21 21:41 ` e.menezes at samsung dot com
@ 2014-10-22 12:13 ` wdijkstr at arm dot com
  2014-10-22 16:54 ` e.menezes at samsung dot com
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: wdijkstr at arm dot com @ 2014-10-22 12:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #13 from Wilco <wdijkstr at arm dot com> ---
(In reply to Andrew Pinski from comment #11)
> (In reply to Wilco from comment #10)
> > The loops shown are not the correct inner loops for those options - with
> > -ffast-math they are vectorized. LLVM unrolls 2x but GCC doesn't. So the
> > question is why GCC doesn't unroll vectorized loops like LLVM?
> 
> Because unrolling is not enabled at -O3.  Try adding -funroll-loops.

Isn't it odd that GCC doesn't even do the most basic unrolling at its maximum
optimization setting? But it does do vectorization?

Note -funroll-loops is not sufficient either, you need
-fvariable-expansion-in-unroller as well for this particular loop which also
isn't enabled at -O3. Plus setting the associated param to 4 or 8.

So GCC is certainly capable of generating quality code for this example, it
just doesn't do so by default - unlike LLVM.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (12 preceding siblings ...)
  2014-10-22 12:13 ` wdijkstr at arm dot com
@ 2014-10-22 16:54 ` e.menezes at samsung dot com
  2014-10-22 17:58 ` wdijkstr at arm dot com
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: e.menezes at samsung dot com @ 2014-10-22 16:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #14 from Evandro Menezes <e.menezes at samsung dot com> ---
Compiling the test-case above with just -O2, I can reproduce the code I
mentioned initially and easily measure the cycle count to run it on target
using perf.

The binary created by GCC runs in about 447000 user cycles and the one created
by LLVM, in about 499000 user cycles.  IOW, fused multiply-add is a win on A57.

Looking further why Geekbench's {D,S}GEMM performs worse with GCC than with
LLVM, both using "-Ofast", GCC fails to vectorize the loop in
"gemm_block_kernel", while LLVM does.

I should've done a more detailed analysis in this issue before submitting this
bug, sorry.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (13 preceding siblings ...)
  2014-10-22 16:54 ` e.menezes at samsung dot com
@ 2014-10-22 17:58 ` wdijkstr at arm dot com
  2014-10-22 23:30 ` e.menezes at samsung dot com
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: wdijkstr at arm dot com @ 2014-10-22 17:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #15 from Wilco <wdijkstr at arm dot com> ---
(In reply to Evandro Menezes from comment #14)
> Compiling the test-case above with just -O2, I can reproduce the code I
> mentioned initially and easily measure the cycle count to run it on target
> using perf.
> 
> The binary created by GCC runs in about 447000 user cycles and the one
> created by LLVM, in about 499000 user cycles.  IOW, fused multiply-add is a
> win on A57.
> 
> Looking further why Geekbench's {D,S}GEMM performs worse with GCC than with
> LLVM, both using "-Ofast", GCC fails to vectorize the loop in
> "gemm_block_kernel", while LLVM does.
>   
> I should've done a more detailed analysis in this issue before submitting
> this bug, sorry.

Using -Ofast is not any different from -O3 -ffast-math when compiling
non-Fortran code. As comment 10 shows, both loops are vectorized, however LLVM
unrolls twice and uses multiple accumulators while GCC doesn't.

I still don't see what this has to do with A57. You should open a generic bug
about GCC not applying basic loop optimizations with -O3 (in fact limited
unrolling is useful even for -O2).


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (15 preceding siblings ...)
  2014-10-22 23:30 ` e.menezes at samsung dot com
@ 2014-10-22 23:30 ` e.menezes at samsung dot com
  2014-10-22 23:59 ` e.menezes at samsung dot com
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: e.menezes at samsung dot com @ 2014-10-22 23:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #17 from Evandro <e.menezes at samsung dot com> ---
Created attachment 33785
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33785&action=edit
Simple matrix multiplication


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (14 preceding siblings ...)
  2014-10-22 17:58 ` wdijkstr at arm dot com
@ 2014-10-22 23:30 ` e.menezes at samsung dot com
  2014-10-22 23:30 ` e.menezes at samsung dot com
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: e.menezes at samsung dot com @ 2014-10-22 23:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #16 from Evandro <e.menezes at samsung dot com> ---
(In reply to Wilco from comment #15)
> Using -Ofast is not any different from -O3 -ffast-math when compiling
> non-Fortran code. As comment 10 shows, both loops are vectorized, however
> LLVM unrolls twice and uses multiple accumulators while GCC doesn't.

You're right.  LLVM produces:

.LBB0_1:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
        add      x11, x9, x8
        add      x12, x10, x8
        ldp      q2, q3, [x11]
        ldp      q4, q5, [x12]
        add      x8, x8, #32             // =32
        fmla     v0.2d, v2.2d, v4.2d
        fmla     v1.2d, v3.2d, v5.2d
        cmp      x8, #128, lsl #12      // =524288
        b.ne    .LBB0_1

And GCC:

.L3:
        ldr     q2, [x2, x0]
        add     w1, w1, 1
        ldr     q1, [x3, x0]
        cmp     w1, w4
        add     x0, x0, 16
        fmla    v0.2d, v2.2d, v1.2d
        bcc     .L3

> I still don't see what this has to do with A57. You should open a generic
> bug about GCC not applying basic loop optimizations with -O3 (in fact
> limited unrolling is useful even for -O2).

Indeed, but I think that there's still a code-generation opportunity for A57
here.

Note above that the registers are loaded in pairs by LLVM, while GCC, when it
unrolls the loop, more aggressively BTW, each vector is loaded individually:

.L3:
        ldr     q28, [x15, x16]
        add     x17, x16, 16
        ldr     q29, [x14, x16]
        add     x0, x16, 32
        ldr     q30, [x15, x17]
        add     x18, x16, 48
        ldr     q31, [x14, x17]
        add     x1, x16, 64
        ...
        fmla    v27.2d, v28.2d, v29.2d
        ...
        fmla    v27.2d, v30.2d, v31.2d
        ...     # Rest of 8x unroll
        bcc     .L3

It also goes without saying that this code could also benefit from the
post-increment addressing mode.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (16 preceding siblings ...)
  2014-10-22 23:30 ` e.menezes at samsung dot com
@ 2014-10-22 23:59 ` e.menezes at samsung dot com
  2014-10-23  0:31 ` wdijkstr at arm dot com
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: e.menezes at samsung dot com @ 2014-10-22 23:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

Evandro <e.menezes at samsung dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #33774|0                           |1
        is obsolete|                            |

--- Comment #18 from Evandro <e.menezes at samsung dot com> ---
Created attachment 33786
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33786&action=edit
Simple test-case


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (17 preceding siblings ...)
  2014-10-22 23:59 ` e.menezes at samsung dot com
@ 2014-10-23  0:31 ` wdijkstr at arm dot com
  2014-10-23 10:26 ` ramana.radhakrishnan at arm dot com
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: wdijkstr at arm dot com @ 2014-10-23  0:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #19 from Wilco <wdijkstr at arm dot com> ---
(In reply to Evandro from comment #16)
> (In reply to Wilco from comment #15)
> > Using -Ofast is not any different from -O3 -ffast-math when compiling
> > non-Fortran code. As comment 10 shows, both loops are vectorized, however
> > LLVM unrolls twice and uses multiple accumulators while GCC doesn't.
> 
> You're right.  LLVM produces:
> 
> .LBB0_1:                                // %vector.body
>                                         // =>This Inner Loop Header: Depth=1
>         add      x11, x9, x8
>         add      x12, x10, x8
>         ldp      q2, q3, [x11]
>         ldp      q4, q5, [x12]
>         add      x8, x8, #32             // =32
>         fmla     v0.2d, v2.2d, v4.2d
>         fmla     v1.2d, v3.2d, v5.2d
>         cmp      x8, #128, lsl #12      // =524288
>         b.ne    .LBB0_1
> 
> And GCC:
> 
> .L3:
>         ldr     q2, [x2, x0]
>         add     w1, w1, 1
>         ldr     q1, [x3, x0]
>         cmp     w1, w4
>         add     x0, x0, 16
>         fmla    v0.2d, v2.2d, v1.2d
>         bcc     .L3
> 
> > I still don't see what this has to do with A57. You should open a generic
> > bug about GCC not applying basic loop optimizations with -O3 (in fact
> > limited unrolling is useful even for -O2).
> 
> Indeed, but I think that there's still a code-generation opportunity for A57
> here.
> 
> Note above that the registers are loaded in pairs by LLVM, while GCC, when
> it unrolls the loop, more aggressively BTW, each vector is loaded
> individually:

Load/store pair optimization should be committed soon:
https://gcc.gnu.org/ml/gcc-patches/2014-10/msg02005.html

> .L3:
>         ldr     q28, [x15, x16]
>         add     x17, x16, 16
>         ldr     q29, [x14, x16]
>         add     x0, x16, 32
>         ldr     q30, [x15, x17]
>         add     x18, x16, 48
>         ldr     q31, [x14, x17]
>         add     x1, x16, 64
>         ...
>         fmla    v27.2d, v28.2d, v29.2d
>         ...
>         fmla    v27.2d, v30.2d, v31.2d
>         ...     # Rest of 8x unroll
>         bcc     .L3
> 
> It also goes without saying that this code could also benefit from the
> post-increment addressing mode.

Yes I've noticed bad addressing like that and fixes are in progress. It's an
issue in iv-opt - even without post-increment enabled the obvious addressing
mode to use is immediate offset.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (18 preceding siblings ...)
  2014-10-23  0:31 ` wdijkstr at arm dot com
@ 2014-10-23 10:26 ` ramana.radhakrishnan at arm dot com
  2014-10-28 20:57 ` e.menezes at samsung dot com
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: ramana.radhakrishnan at arm dot com @ 2014-10-23 10:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #20 from ramana.radhakrishnan at arm dot com <ramana.radhakrishnan at arm dot com> ---
On 23/10/14 00:28, e.menezes at samsung dot com wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
>
> --- Comment #16 from Evandro <e.menezes at samsung dot com> ---
> (In reply to Wilco from comment #15)
>> Using -Ofast is not any different from -O3 -ffast-math when compiling
>> non-Fortran code. As comment 10 shows, both loops are vectorized, however
>> LLVM unrolls twice and uses multiple accumulators while GCC doesn't.
>
> You're right.  LLVM produces:
>
> .LBB0_1:                                // %vector.body
>                                          // =>This Inner Loop Header: Depth=1
>          add      x11, x9, x8
>          add      x12, x10, x8
>          ldp      q2, q3, [x11]
>          ldp      q4, q5, [x12]
>          add      x8, x8, #32             // =32
>          fmla     v0.2d, v2.2d, v4.2d
>          fmla     v1.2d, v3.2d, v5.2d
>          cmp      x8, #128, lsl #12      // =524288
>          b.ne    .LBB0_1
>
> And GCC:
>
> .L3:
>          ldr     q2, [x2, x0]
>          add     w1, w1, 1
>          ldr     q1, [x3, x0]
>          cmp     w1, w4
>          add     x0, x0, 16
>          fmla    v0.2d, v2.2d, v1.2d
>          bcc     .L3
>
>> I still don't see what this has to do with A57. You should open a generic
>> bug about GCC not applying basic loop optimizations with -O3 (in fact
>> limited unrolling is useful even for -O2).
>
> Indeed, but I think that there's still a code-generation opportunity for A57
> here.

What you mention is a general code generation improvement for AArch64.

There's nothing Cortex-A57 specific about it. In the AArch64 backend, we 
think architecture and then micro-architecture.

>
> Note above that the registers are loaded in pairs by LLVM, while GCC, when it
> unrolls the loop, more aggressively BTW, each vector is loaded individually:
>
> .L3:
>          ldr     q28, [x15, x16]
>          add     x17, x16, 16
>          ldr     q29, [x14, x16]
>          add     x0, x16, 32
>          ldr     q30, [x15, x17]
>          add     x18, x16, 48
>          ldr     q31, [x14, x17]
>          add     x1, x16, 64
>          ...
>          fmla    v27.2d, v28.2d, v29.2d
>          ...
>          fmla    v27.2d, v30.2d, v31.2d
>          ...     # Rest of 8x unroll
>          bcc     .L3
>
> It also goes without saying that this code could also benefit from the
> post-increment addressing mode.


What's the kind of performance delta you see if you managed to unroll 
the loop just a wee bit ? Probably not much looking at the code produced 
here.

Ramana

>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (19 preceding siblings ...)
  2014-10-23 10:26 ` ramana.radhakrishnan at arm dot com
@ 2014-10-28 20:57 ` e.menezes at samsung dot com
  2014-10-28 22:54 ` e.menezes at samsung dot com
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: e.menezes at samsung dot com @ 2014-10-28 20:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #21 from Evandro <e.menezes at samsung dot com> ---
(In reply to ramana.radhakrishnan@arm.com from comment #20)
> What's the kind of performance delta you see if you managed to unroll 
> the loop just a wee bit ? Probably not much looking at the code produced 
> here.

Comparing the cycle counts on Juno when running the program from the matrix
multiplication test above built with -Ofast and unrolling:

-fno-unroll-loops: 592000
-funroll-loops --param max-unroll-times=2: 594000
-funroll-loops --param max-unroll-times=4: 592000
-funroll-loops: 590000 (implies --param max-unroll-times=8)
-funroll-loops --param max-unroll-times=16: 581000

It seems to me that without effective iv-opt in place, loops have to be
unrolled too aggressively to make any difference in this case, greatly
sacrificing code size.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (20 preceding siblings ...)
  2014-10-28 20:57 ` e.menezes at samsung dot com
@ 2014-10-28 22:54 ` e.menezes at samsung dot com
  2014-10-29  0:07 ` wdijkstr at arm dot com
  2015-04-28  8:11 ` thopre01 at gcc dot gnu.org
  23 siblings, 0 replies; 25+ messages in thread
From: e.menezes at samsung dot com @ 2014-10-28 22:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #23 from Evandro <e.menezes at samsung dot com> ---
(In reply to Wilco from comment #22)
> Unrolling alone isn't good enough in sum reductions. As I mentioned before,
> GCC doesn't enable any of the useful loop optimizations by default. So add
> -fvariable-expansion-in-unroller to get a good speedup with unrolling. Again
> these are all generic GCC issues.

Adding -fvariable-expansion-in-unroller when using -funroll-loops results in
practically the same code being emitted.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (21 preceding siblings ...)
  2014-10-28 22:54 ` e.menezes at samsung dot com
@ 2014-10-29  0:07 ` wdijkstr at arm dot com
  2015-04-28  8:11 ` thopre01 at gcc dot gnu.org
  23 siblings, 0 replies; 25+ messages in thread
From: wdijkstr at arm dot com @ 2014-10-29  0:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #24 from Wilco <wdijkstr at arm dot com> ---
(In reply to Evandro from comment #23)
> (In reply to Wilco from comment #22)
> > Unrolling alone isn't good enough in sum reductions. As I mentioned before,
> > GCC doesn't enable any of the useful loop optimizations by default. So add
> > -fvariable-expansion-in-unroller to get a good speedup with unrolling. Again
> > these are all generic GCC issues.
> 
> Adding -fvariable-expansion-in-unroller when using -funroll-loops results in
> practically the same code being emitted.

Correct, all it does is cut the dependency chain of the accumulates. But that's
enough to get the speedup.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug target/63503] [AArch64] A57 executes fused multiply-add poorly in some situations
  2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
                   ` (22 preceding siblings ...)
  2014-10-29  0:07 ` wdijkstr at arm dot com
@ 2015-04-28  8:11 ` thopre01 at gcc dot gnu.org
  23 siblings, 0 replies; 25+ messages in thread
From: thopre01 at gcc dot gnu.org @ 2015-04-28  8:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503

--- Comment #25 from Thomas Preud'homme <thopre01 at gcc dot gnu.org> ---
Author: thopre01
Date: Tue Apr 28 08:10:44 2015
New Revision: 222512

URL: https://gcc.gnu.org/viewcvs?rev=222512&root=gcc&view=rev
Log:
2015-04-28  Thomas Preud'homme  <thomas.preudhomme@arm.com>

    gcc/
    PR target/63503
    * config.gcc: Add cortex-a57-fma-steering.o to extra_objs for
    aarch64-*-*.
    * config/aarch64/t-aarch64: Add a rule for cortex-a57-fma-steering.o.
    * config/aarch64/aarch64.h (AARCH64_FL_USE_FMA_STEERING_PASS): Define.
    (AARCH64_TUNE_FMA_STEERING): Likewise.
    * config/aarch64/aarch64-cores.def: Set
    AARCH64_FL_USE_FMA_STEERING_PASS for cores with dynamic steering of
    FMUL/FMADD instructions.
    * config/aarch64/aarch64.c (aarch64_register_fma_steering): Declare.
    (aarch64_override_options): Include cortex-a57-fma-steering.h. Call
    aarch64_register_fma_steering () if AARCH64_TUNE_FMA_STEERING is true.
    * config/aarch64/cortex-a57-fma-steering.h: New file.
    * config/aarch64/cortex-a57-fma-steering.c: Likewise.

Added:
    trunk/gcc/config/aarch64/cortex-a57-fma-steering.c
    trunk/gcc/config/aarch64/cortex-a57-fma-steering.h
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config.gcc
    trunk/gcc/config/aarch64/aarch64-cores.def
    trunk/gcc/config/aarch64/aarch64.c
    trunk/gcc/config/aarch64/aarch64.h
    trunk/gcc/config/aarch64/t-aarch64


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2015-04-28  8:11 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-09 21:56 [Bug target/63503] New: [AArch64] A57 executes fused multiply-add poorly in some situations e.menezes at samsung dot com
2014-10-09 22:01 ` [Bug target/63503] " pinskia at gcc dot gnu.org
2014-10-09 22:02 ` pinskia at gcc dot gnu.org
2014-10-09 22:05 ` e.menezes at samsung dot com
2014-10-09 22:14 ` e.menezes at samsung dot com
2014-10-09 23:08 ` pinskia at gcc dot gnu.org
2014-10-10 13:24 ` wdijkstr at arm dot com
2014-10-10 13:59 ` ramana at gcc dot gnu.org
2014-10-14 20:21 ` e.menezes at samsung dot com
2014-10-14 22:38 ` e.menezes at samsung dot com
2014-10-21 17:47 ` wdijkstr at arm dot com
2014-10-21 18:35 ` pinskia at gcc dot gnu.org
2014-10-21 21:41 ` e.menezes at samsung dot com
2014-10-22 12:13 ` wdijkstr at arm dot com
2014-10-22 16:54 ` e.menezes at samsung dot com
2014-10-22 17:58 ` wdijkstr at arm dot com
2014-10-22 23:30 ` e.menezes at samsung dot com
2014-10-22 23:30 ` e.menezes at samsung dot com
2014-10-22 23:59 ` e.menezes at samsung dot com
2014-10-23  0:31 ` wdijkstr at arm dot com
2014-10-23 10:26 ` ramana.radhakrishnan at arm dot com
2014-10-28 20:57 ` e.menezes at samsung dot com
2014-10-28 22:54 ` e.menezes at samsung dot com
2014-10-29  0:07 ` wdijkstr at arm dot com
2015-04-28  8:11 ` thopre01 at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).