[Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size
@ 2014-07-25 22:37 e.menezes at samsung dot com
  2014-07-25 22:40 ` [Bug target/61915] " pinskia at gcc dot gnu.org
                   ` (22 more replies)
  0 siblings, 23 replies; 24+ messages in thread
From: e.menezes at samsung dot com @ 2014-07-25 22:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

            Bug ID: 61915
           Summary: [AArch64] Default use of the LRA results in extra code
                    size
           Product: gcc
           Version: 4.10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: e.menezes at samsung dot com

The issue that I observed in code size due to the default use of the LRA
results in the spilling of the FP register used to spill variables into, which
increases code-size.

For example, in Dhrystone, out of dhry_1.c I see sequences like this:

  ldr    d9, [sp, 144]
  ...
  fmov    x0, d9
  bl    printf
  ...
  fmov    x0, d9
  ...
  bl    printf

By disabling the LRA, the code is a tad leaner (2%):

  ldr    x0, [sp, 144]
  ...
  bl    printf
  ...
  ldr    x0, [sp, 144]
  ...
  bl    printf

Moreover, is transferring registers between the GP and the FP register files
always cheap?  In some x86 processors this used to be accomplished internally
through the load-store unit anyway (e.g., Opteron).  How is this accomplished
internally in A53 and A57?

Is using the LRA by default clearly beneficial in other cases?

At the Cauldron I mentioned some variables that could be rematerialized when
needed instead of being spilled, but I could not reproduce that.  I'll try some
more to spot this behavior.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] Default use of the LRA results in extra code size
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
@ 2014-07-25 22:40 ` pinskia at gcc dot gnu.org
  2014-07-25 22:41 ` pinskia at gcc dot gnu.org
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2014-07-25 22:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
> How is this accomplished internally in A53 and A57?

I don't know about A53 and A57 but I can say that for Cavium's Thunder, it does
not go through the load/store unit and there is a direct path between the gprs
and fps (and the latency for them is low).


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] Default use of the LRA results in extra code size
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
  2014-07-25 22:40 ` [Bug target/61915] " pinskia at gcc dot gnu.org
@ 2014-07-25 22:41 ` pinskia at gcc dot gnu.org
  2014-07-25 22:45 ` e.menezes at samsung dot com
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2014-07-25 22:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
https://gcc.gnu.org/ml/gcc/2014-05/msg00160.html


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] Default use of the LRA results in extra code size
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
  2014-07-25 22:40 ` [Bug target/61915] " pinskia at gcc dot gnu.org
  2014-07-25 22:41 ` pinskia at gcc dot gnu.org
@ 2014-07-25 22:45 ` e.menezes at samsung dot com
  2014-08-05  9:08 ` [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 ramana at gcc dot gnu.org
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: e.menezes at samsung dot com @ 2014-07-25 22:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #3 from Evandro Menezes <e.menezes at samsung dot com> ---
In Opteron, there was a path from FP to the GP registers, but not the other way
around.  That path was eventually made symmetric in Barcelona only.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (2 preceding siblings ...)
  2014-07-25 22:45 ` e.menezes at samsung dot com
@ 2014-08-05  9:08 ` ramana at gcc dot gnu.org
  2014-08-05 15:27 ` e.menezes at samsung dot com
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: ramana at gcc dot gnu.org @ 2014-08-05  9:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Target|                            |aarch64-*
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2014-08-05
           Assignee|unassigned at gcc dot gnu.org      |ramana at gcc dot gnu.org
            Summary|[AArch64] Default use of    |[AArch64] High amounts of
                   |the LRA results in extra    |GP to FP register moves
                   |code size                   |using LRA on AArch64
     Ever confirmed|0                           |1

--- Comment #4 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
We've noticed this overeagerness hurting in a number of places including
SPEC2k(6) for Cortex-A57 and are in the process of fixing up REGISTER_MOVE_COST
and MEMORY_MOVE_COST to fix this up for those cores. That is the first source
of reducing these number of moves. 

If you have more examples and more analysis from outside these benchmarks it
would be useful to help look for such cases.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (3 preceding siblings ...)
  2014-08-05  9:08 ` [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 ramana at gcc dot gnu.org
@ 2014-08-05 15:27 ` e.menezes at samsung dot com
  2014-08-14 14:28 ` vmakarov at gcc dot gnu.org
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: e.menezes at samsung dot com @ 2014-08-05 15:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #5 from Evandro Menezes <e.menezes at samsung dot com> ---
Created attachment 33249
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33249&action=edit
Dhrystone, part 2 of 3

I firstly observed this issue when looking into Dhrystone built with fairly
standard options:

-O2 -fno-short-enums -fno-inline -fno-inline-functions
-fno-inline-small-functions -fno-inline-functions-called-once
-fomit-frame-pointer -funroll-all-loops

If I add -mno-lra, the code size in dhry_1.o is about 2% smaller.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (4 preceding siblings ...)
  2014-08-05 15:27 ` e.menezes at samsung dot com
@ 2014-08-14 14:28 ` vmakarov at gcc dot gnu.org
  2014-08-14 14:53 ` e.menezes at samsung dot com
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: vmakarov at gcc dot gnu.org @ 2014-08-14 14:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #6 from Vladimir Makarov <vmakarov at gcc dot gnu.org> ---
(In reply to Evandro Menezes from comment #5)
> Created attachment 33249 [details]
> Dhrystone, part 2 of 3
> 
> I firstly observed this issue when looking into Dhrystone built with fairly
> standard options:
> 
> -O2 -fno-short-enums -fno-inline -fno-inline-functions
> -fno-inline-small-functions -fno-inline-functions-called-once
> -fomit-frame-pointer -funroll-all-loops
> 
> If I add -mno-lra, the code size in dhry_1.o is about 2% smaller.

Evandro, thanks for reporting this.  Sorry, I am busy with other thing these
days.  I'll start to work on this PR in September to try to make some progress
for the next GCC release.

May be a better remeaterialization in LRA I am working on now will help the PR
too.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (5 preceding siblings ...)
  2014-08-14 14:28 ` vmakarov at gcc dot gnu.org
@ 2014-08-14 14:53 ` e.menezes at samsung dot com
  2014-08-14 15:02 ` vmakarov at gcc dot gnu.org
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: e.menezes at samsung dot com @ 2014-08-14 14:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #7 from Evandro Menezes <e.menezes at samsung dot com> ---
(In reply to Vladimir Makarov from comment #6)
> 
> Evandro, thanks for reporting this.  Sorry, I am busy with other thing these
> days.  I'll start to work on this PR in September to try to make some
> progress for the next GCC release.
> 
> May be a better remeaterialization in LRA I am working on now will help the
> PR too.

Vladimir,

I was thinking about using the hook function to avoid using FPR, at least when
-Os is specified, for the time being.  This way, registers would still be
allocated by the LRA, but this side-effect would be under control.  Or do y'all
think that it's better to wait a little while longer?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (6 preceding siblings ...)
  2014-08-14 14:53 ` e.menezes at samsung dot com
@ 2014-08-14 15:02 ` vmakarov at gcc dot gnu.org
  2014-10-22 21:23 ` wdijkstr at arm dot com
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: vmakarov at gcc dot gnu.org @ 2014-08-14 15:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #8 from Vladimir Makarov <vmakarov at gcc dot gnu.org> ---
(In reply to Evandro Menezes from comment #5)
> Created attachment 33249 [details]
> Dhrystone, part 2 of 3
> 
> I firstly observed this issue when looking into Dhrystone built with fairly
> standard options:
> 
> -O2 -fno-short-enums -fno-inline -fno-inline-functions
> -fno-inline-small-functions -fno-inline-functions-called-once
> -fomit-frame-pointer -funroll-all-loops
> 
> If I add -mno-lra, the code size in dhry_1.o is about 2% smaller.

Evandro, thanks for reporting this.  Sorry, I am busy with other thing these
days.  I'll start to work on this PR in September to try to make some progress
for the next GCC release.

May be a better remeaterialization in LRA I am working on now will help the PR
too.

(In reply to Evandro Menezes from comment #7)
> (In reply to Vladimir Makarov from comment #6)
> > 
> > Evandro, thanks for reporting this.  Sorry, I am busy with other thing these
> > days.  I'll start to work on this PR in September to try to make some
> > progress for the next GCC release.
> > 
> > May be a better remeaterialization in LRA I am working on now will help the
> > PR too.
> 
> Vladimir,
> 
> I was thinking about using the hook function to avoid using FPR, at least
> when -Os is specified, for the time being.  This way, registers would still
> be allocated by the LRA, but this side-effect would be under control.  Or do
> y'all think that it's better to wait a little while longer?

If it works and it is ok for ARM mainteiners, it is ok for me too.

I will look at this with the point of LRA, can be the code decreased or not.

Your solution is on the machine-dependent part.  So it is up to you and ARM
maintainers.  I think you should not wait for what I may or may not find in LRA
itself to fix it.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (7 preceding siblings ...)
  2014-08-14 15:02 ` vmakarov at gcc dot gnu.org
@ 2014-10-22 21:23 ` wdijkstr at arm dot com
  2014-10-22 23:28 ` wdijkstr at arm dot com
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: wdijkstr at arm dot com @ 2014-10-22 21:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

Wilco <wdijkstr at arm dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wdijkstr at arm dot com

--- Comment #9 from Wilco <wdijkstr at arm dot com> ---
(In reply to Evandro Menezes from comment #0)
> The issue that I observed in code size due to the default use of the LRA
> results in the spilling of the FP register used to spill variables into,
> which increases code-size.

The performance cost is a much bigger issue than codesize. The problem is that
when register pressure is high, the register allocator decides to allocate
integer liveranges to FP registers and insert int<->fp moves for every
use/define (ie. you end up with far more moves than you would if it were
spilled, so it is a bad thing even if int<->fp moves are cheap).

I committed a workaround
(http://gcc.gnu.org/ml/gcc-patches/2014-09/msg00362.html) by increasing the
int<->fp move cost. Can you try this and check the issue has indeed gone? You
need -mcpu=cortex-a57.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (8 preceding siblings ...)
  2014-10-22 21:23 ` wdijkstr at arm dot com
@ 2014-10-22 23:28 ` wdijkstr at arm dot com
  2014-10-24 21:34 ` e.menezes at samsung dot com
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: wdijkstr at arm dot com @ 2014-10-22 23:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #10 from Wilco <wdijkstr at arm dot com> ---
(In reply to Andrew Pinski from comment #2)
> https://gcc.gnu.org/ml/gcc/2014-05/msg00160.html

Note currently it is not possible to use FP registers for spilling using the
hooks - basically you still end up with int<->fp moves for every definition and
use (even when multiple uses are right next to each other), and
rematerialization does not happen at all.

However what you'd expect is that all spill optimizations apply first and if
all else fails every load/store of a stack slot is turned into an int<->fp
move.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (9 preceding siblings ...)
  2014-10-22 23:28 ` wdijkstr at arm dot com
@ 2014-10-24 21:34 ` e.menezes at samsung dot com
  2014-10-24 21:39 ` e.menezes at samsung dot com
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: e.menezes at samsung dot com @ 2014-10-24 21:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #11 from Evandro <e.menezes at samsung dot com> ---
(In reply to Wilco from comment #9)
> The performance cost is a much bigger issue than codesize. The problem is
> that when register pressure is high, the register allocator decides to
> allocate integer liveranges to FP registers and insert int<->fp moves for
> every use/define (ie. you end up with far more moves than you would if it
> were spilled, so it is a bad thing even if int<->fp moves are cheap).
> 
> I committed a workaround
> (http://gcc.gnu.org/ml/gcc-patches/2014-09/msg00362.html) by increasing the
> int<->fp move cost. Can you try this and check the issue has indeed gone?
> You need -mcpu=cortex-a57.

I believe that it pretty much is, after a cursory examination.  The code size 
after the patch is back down about 2% for the test case above.  Of note, the
prolog and epilog are much smaller, because the FP registers don't have to be
saved and restored anymore, and the stack frame shrank correspondingly.

Do you have an idea of the performance impact of this patch?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (10 preceding siblings ...)
  2014-10-24 21:34 ` e.menezes at samsung dot com
@ 2014-10-24 21:39 ` e.menezes at samsung dot com
  2014-10-24 22:39 ` pinskia at gcc dot gnu.org
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: e.menezes at samsung dot com @ 2014-10-24 21:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #12 from Evandro <e.menezes at samsung dot com> ---
(In reply to Evandro from comment #11)
> Do you have an idea of the performance impact of this patch?

At least in Dhrystone, it improved by over 2% on A57.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (11 preceding siblings ...)
  2014-10-24 21:39 ` e.menezes at samsung dot com
@ 2014-10-24 22:39 ` pinskia at gcc dot gnu.org
  2014-10-25  0:57 ` e.menezes at samsung dot com
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2014-10-24 22:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #13 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Wilco from comment #9)
> I committed a workaround
> (http://gcc.gnu.org/ml/gcc-patches/2014-09/msg00362.html) by increasing the
> int<->fp move cost. Can you try this and check the issue has indeed gone?
> You need -mcpu=cortex-a57.

Note when I submitted ThunderX support I used a base of 2 instead of a base of
1 due to 2 being the default and all values are relative to that.  This is
mentioned in https://gcc.gnu.org/onlinedocs/gccint/Costs.html .  In fact a
value of 2 means reload will not look at the constraints of a move instruction.

So I think the cortex* cpus should also re-base these values based on 2 being
gpr-to-gpr value.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (12 preceding siblings ...)
  2014-10-24 22:39 ` pinskia at gcc dot gnu.org
@ 2014-10-25  0:57 ` e.menezes at samsung dot com
  2014-10-25  1:29 ` wdijkstr at arm dot com
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: e.menezes at samsung dot com @ 2014-10-25  0:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #14 from Evandro <e.menezes at samsung dot com> ---
(In reply to Wilco from comment #10)
> Note currently it is not possible to use FP registers for spilling using the
> hooks - basically you still end up with int<->fp moves for every definition
> and use (even when multiple uses are right next to each other), and
> rematerialization does not happen at all.

Vladimir,

I had also noticed that the hooks that you pointed me to didn't seem to work as
documented.  Are we missing anything?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (13 preceding siblings ...)
  2014-10-25  0:57 ` e.menezes at samsung dot com
@ 2014-10-25  1:29 ` wdijkstr at arm dot com
  2014-10-25  1:41 ` wdijkstr at arm dot com
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: wdijkstr at arm dot com @ 2014-10-25  1:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #15 from Wilco <wdijkstr at arm dot com> ---
(In reply to Evandro from comment #12)
> (In reply to Evandro from comment #11)
> > Do you have an idea of the performance impact of this patch?
> 
> At least in Dhrystone, it improved by over 2% on A57.

It was ~2% on SPECINT2k, ~3% on SPECFP2k. There were large gains on other
benchmarks where the allocator had gone berserk on FP moves inside the hot
loop. The removal of the redundant FP saves/restores in many functions helps as
well.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (14 preceding siblings ...)
  2014-10-25  1:29 ` wdijkstr at arm dot com
@ 2014-10-25  1:41 ` wdijkstr at arm dot com
  2014-10-25  6:46 ` pinskia at gcc dot gnu.org
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: wdijkstr at arm dot com @ 2014-10-25  1:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #16 from Wilco <wdijkstr at arm dot com> ---
(In reply to Andrew Pinski from comment #13)
> (In reply to Wilco from comment #9)
> > I committed a workaround
> > (http://gcc.gnu.org/ml/gcc-patches/2014-09/msg00362.html) by increasing the
> > int<->fp move cost. Can you try this and check the issue has indeed gone?
> > You need -mcpu=cortex-a57.
> 
> Note when I submitted ThunderX support I used a base of 2 instead of a base
> of 1 due to 2 being the default and all values are relative to that.  This
> is mentioned in https://gcc.gnu.org/onlinedocs/gccint/Costs.html .  In fact
> a value of 2 means reload will not look at the constraints of a move
> instruction.
> 
> So I think the cortex* cpus should also re-base these values based on 2
> being gpr-to-gpr value.

You mean only use multiples of 2? That's interesting as I've not seen that done
elsewhere. Are these costs in any way related to real issue and latency cycles?
Most targets have complex tables with all the exact latencies for every little
uarch detail, but from what I've seen in the allocator these costs have almost
no meaning.

So did you find that setting the FP move cost so low actually works alright on
ThunderX? I'd like to figure out a setting for the generic target that works
out well across all AArch64 implementations.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (15 preceding siblings ...)
  2014-10-25  1:41 ` wdijkstr at arm dot com
@ 2014-10-25  6:46 ` pinskia at gcc dot gnu.org
  2014-10-28 10:51 ` [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost ramana at gcc dot gnu.org
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: pinskia at gcc dot gnu.org @ 2014-10-25  6:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #17 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Wilco from comment #16)
> (In reply to Andrew Pinski from comment #13)
> > (In reply to Wilco from comment #9)
> > > I committed a workaround
> > > (http://gcc.gnu.org/ml/gcc-patches/2014-09/msg00362.html) by increasing the
> > > int<->fp move cost. Can you try this and check the issue has indeed gone?
> > > You need -mcpu=cortex-a57.
> > 
> > Note when I submitted ThunderX support I used a base of 2 instead of a base
> > of 1 due to 2 being the default and all values are relative to that.  This
> > is mentioned in https://gcc.gnu.org/onlinedocs/gccint/Costs.html .  In fact
> > a value of 2 means reload will not look at the constraints of a move
> > instruction.
> > 
> > So I think the cortex* cpus should also re-base these values based on 2
> > being gpr-to-gpr value.
> 
> You mean only use multiples of 2? That's interesting as I've not seen that
> done elsewhere. Are these costs in any way related to real issue and latency
> cycles? Most targets have complex tables with all the exact latencies for
> every little uarch detail, but from what I've seen in the allocator these
> costs have almost no meaning.

Not always multiple of 2 though in the case of ThunderX they are multiple of
twos.  The costs are not really directly related to the latency cost but it is
relative to one another.  So I could have used 2, 3, 4 (meaning latency of 1,
2, 3) instead.  I used the factor of 2 instead of 1 for ThunderX because 2 + 3
!= 4 but rather 5.

> 
> So did you find that setting the FP move cost so low actually works alright
> on ThunderX? I'd like to figure out a setting for the generic target that
> works out well across all AArch64 implementations.

Yes it seems to at least on the things we have benchmarked but we have not done
much big benchmarks like SPEC yet.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (16 preceding siblings ...)
  2014-10-25  6:46 ` pinskia at gcc dot gnu.org
@ 2014-10-28 10:51 ` ramana at gcc dot gnu.org
  2014-10-28 11:14 ` ramana at gcc dot gnu.org
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: ramana at gcc dot gnu.org @ 2014-10-28 10:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|[AArch64] High amounts of   |[AArch64] High amounts of
                   |GP to FP register moves     |GP to FP register moves
                   |using LRA on AArch64        |using LRA on AArch64 -
                   |                            |Improve Generic
                   |                            |register_move_cost and
                   |                            |memory_move_cost

--- Comment #19 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
To my mind it seems like 407 fmoves is just a bit too berserk and regardless of
how efficient your core is, there is no point in having so many moves back and
forth.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (17 preceding siblings ...)
  2014-10-28 10:51 ` [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost ramana at gcc dot gnu.org
@ 2014-10-28 11:14 ` ramana at gcc dot gnu.org
  2014-10-31 16:25 ` e.menezes at samsung dot com
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: ramana at gcc dot gnu.org @ 2014-10-28 11:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|ramana at gcc dot gnu.org          |wdijkstr at arm dot com
   Target Milestone|---                         |5.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (18 preceding siblings ...)
  2014-10-28 11:14 ` ramana at gcc dot gnu.org
@ 2014-10-31 16:25 ` e.menezes at samsung dot com
  2014-11-19 14:41 ` jiwang at gcc dot gnu.org
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 24+ messages in thread
From: e.menezes at samsung dot com @ 2014-10-31 16:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #20 from Evandro <e.menezes at samsung dot com> ---
(In reply to Ramana Radhakrishnan from comment #19)
> To my mind it seems like 407 fmoves is just a bit too berserk and regardless
> of how efficient your core is, there is no point in having so many moves
> back and forth.

It seems that the only LRA parameter exposed is
lra-max-considered-reload-pseudos. It defaults to 500 and decreasing it,
results in more FMOVs; increasing it, in less. It doesn't have any effect over
1000. At 1000, the number of FMOVs decreases by 5% in some cases.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (19 preceding siblings ...)
  2014-10-31 16:25 ` e.menezes at samsung dot com
@ 2014-11-19 14:41 ` jiwang at gcc dot gnu.org
  2014-11-19 14:47 ` wdijkstr at arm dot com
  2015-03-10  7:34 ` collison at gcc dot gnu.org
  22 siblings, 0 replies; 24+ messages in thread
From: jiwang at gcc dot gnu.org @ 2014-11-19 14:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #21 from Jiong Wang <jiwang at gcc dot gnu.org> ---
Author: jiwang
Date: Wed Nov 19 14:40:26 2014
New Revision: 217780

URL: https://gcc.gnu.org/viewcvs?rev=217780&root=gcc&view=rev
Log:
[AArch64] Adjust generic move costs

  2014-11-19  Wilco Dijkstra  <wdijkstr@arm.com>

    PR target/61915
    * config/aarch64/aarch64.c (generic_regmove_cost): Increase FP move cost.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/aarch64/aarch64.c


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (20 preceding siblings ...)
  2014-11-19 14:41 ` jiwang at gcc dot gnu.org
@ 2014-11-19 14:47 ` wdijkstr at arm dot com
  2015-03-10  7:34 ` collison at gcc dot gnu.org
  22 siblings, 0 replies; 24+ messages in thread
From: wdijkstr at arm dot com @ 2014-11-19 14:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

Wilco <wdijkstr at arm dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #22 from Wilco <wdijkstr at arm dot com> ---
Fixed


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost
  2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
                   ` (21 preceding siblings ...)
  2014-11-19 14:47 ` wdijkstr at arm dot com
@ 2015-03-10  7:34 ` collison at gcc dot gnu.org
  22 siblings, 0 replies; 24+ messages in thread
From: collison at gcc dot gnu.org @ 2015-03-10  7:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61915

--- Comment #23 from collison at gcc dot gnu.org ---
Author: collison
Date: Tue Mar 10 07:34:20 2015
New Revision: 221302

URL: https://gcc.gnu.org/viewcvs?rev=221302&root=gcc&view=rev
Log:
2015-03-10  Michael Collison  <michael.collison@linaro.org>

    Backport from trunk r217780.
    2014-11-19  Wilco Dijkstra  <wdijkstr@arm.com>

    PR target/61915
    * config/aarch64/aarch64.c (generic_regmove_cost): Increase FP move
    cost.


Modified:
    branches/linaro/gcc-4_9-branch/gcc/ChangeLog.linaro
    branches/linaro/gcc-4_9-branch/gcc/config/aarch64/aarch64.c


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-03-10  7:34 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-25 22:37 [Bug target/61915] New: [AArch64] Default use of the LRA results in extra code size e.menezes at samsung dot com
2014-07-25 22:40 ` [Bug target/61915] " pinskia at gcc dot gnu.org
2014-07-25 22:41 ` pinskia at gcc dot gnu.org
2014-07-25 22:45 ` e.menezes at samsung dot com
2014-08-05  9:08 ` [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 ramana at gcc dot gnu.org
2014-08-05 15:27 ` e.menezes at samsung dot com
2014-08-14 14:28 ` vmakarov at gcc dot gnu.org
2014-08-14 14:53 ` e.menezes at samsung dot com
2014-08-14 15:02 ` vmakarov at gcc dot gnu.org
2014-10-22 21:23 ` wdijkstr at arm dot com
2014-10-22 23:28 ` wdijkstr at arm dot com
2014-10-24 21:34 ` e.menezes at samsung dot com
2014-10-24 21:39 ` e.menezes at samsung dot com
2014-10-24 22:39 ` pinskia at gcc dot gnu.org
2014-10-25  0:57 ` e.menezes at samsung dot com
2014-10-25  1:29 ` wdijkstr at arm dot com
2014-10-25  1:41 ` wdijkstr at arm dot com
2014-10-25  6:46 ` pinskia at gcc dot gnu.org
2014-10-28 10:51 ` [Bug target/61915] [AArch64] High amounts of GP to FP register moves using LRA on AArch64 - Improve Generic register_move_cost and memory_move_cost ramana at gcc dot gnu.org
2014-10-28 11:14 ` ramana at gcc dot gnu.org
2014-10-31 16:25 ` e.menezes at samsung dot com
2014-11-19 14:41 ` jiwang at gcc dot gnu.org
2014-11-19 14:47 ` wdijkstr at arm dot com
2015-03-10  7:34 ` collison at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).