public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/94145] New: Longcalls mis-optimize loading the function address
@ 2020-03-11 16:00 meissner at gcc dot gnu.org
  2020-03-11 18:27 ` [Bug target/94145] " meissner at gcc dot gnu.org
                   ` (13 more replies)
  0 siblings, 14 replies; 15+ messages in thread
From: meissner at gcc dot gnu.org @ 2020-03-11 16:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94145

            Bug ID: 94145
           Summary: Longcalls mis-optimize loading the function address
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: meissner at gcc dot gnu.org
  Target Milestone: ---

I'm working on a feature where we convert some/all built-in function calls to
use the longcall sequence.  I discovered that the compiler is mis-optimizing
loading up the function address.  This showed up in the Spec 2017 wrf_r
benchmark where I replaced some 60,000 direct calls to longcalls.

In particular, the PowerPC backend is not marking the load of the function
address as being volatile.  This allows the compiler to move the load out of a
loop.

However with the current ELF semantics, you don't want to do this because the
function address changes.  The first call to the function, the address is the
PLT stub, but in subsequent calls it is the address of the function itself
after the shared library is loaded.

In addition, because UNSPECs are used, the compiler is likely to store the
function address in the stack and reload it.  Given that the UNSPEC is just a
load, it would be better not to optimize this to doing the extra load/store.

In fixing the linker bug that this feature uncovered, Alan Modra has a simple
patch to fix it.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/94145] Longcalls mis-optimize loading the function address
  2020-03-11 16:00 [Bug target/94145] New: Longcalls mis-optimize loading the function address meissner at gcc dot gnu.org
@ 2020-03-11 18:27 ` meissner at gcc dot gnu.org
  2020-03-11 21:33 ` segher at gcc dot gnu.org
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: meissner at gcc dot gnu.org @ 2020-03-11 18:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94145

--- Comment #1 from Michael Meissner <meissner at gcc dot gnu.org> ---
Created attachment 48021
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48021&action=edit
Example code

Compile with -mcpu=future -mpcrel -O3 to see the load of the address being
moved out of the loop.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/94145] Longcalls mis-optimize loading the function address
  2020-03-11 16:00 [Bug target/94145] New: Longcalls mis-optimize loading the function address meissner at gcc dot gnu.org
  2020-03-11 18:27 ` [Bug target/94145] " meissner at gcc dot gnu.org
@ 2020-03-11 21:33 ` segher at gcc dot gnu.org
  2020-03-11 21:35 ` segher at gcc dot gnu.org
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: segher at gcc dot gnu.org @ 2020-03-11 21:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94145

--- Comment #2 from Segher Boessenkool <segher at gcc dot gnu.org> ---
How could the function address ever not be constant?

Hoisting it to somewhere where it is (dynamically) more expensive is a
bad idea, of course.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/94145] Longcalls mis-optimize loading the function address
  2020-03-11 16:00 [Bug target/94145] New: Longcalls mis-optimize loading the function address meissner at gcc dot gnu.org
  2020-03-11 18:27 ` [Bug target/94145] " meissner at gcc dot gnu.org
  2020-03-11 21:33 ` segher at gcc dot gnu.org
@ 2020-03-11 21:35 ` segher at gcc dot gnu.org
  2020-03-11 22:50 ` amodra at gmail dot com
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: segher at gcc dot gnu.org @ 2020-03-11 21:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94145

--- Comment #3 from Segher Boessenkool <segher at gcc dot gnu.org> ---
C11 6.6/9 says it *always* is constant, even.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/94145] Longcalls mis-optimize loading the function address
  2020-03-11 16:00 [Bug target/94145] New: Longcalls mis-optimize loading the function address meissner at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2020-03-11 21:35 ` segher at gcc dot gnu.org
@ 2020-03-11 22:50 ` amodra at gmail dot com
  2020-03-11 22:50 ` amodra at gmail dot com
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: amodra at gmail dot com @ 2020-03-11 22:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94145

Alan Modra <amodra at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amodra at gmail dot com

--- Comment #4 from Alan Modra <amodra at gmail dot com> ---
Created attachment 48022
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48022&action=edit
gcc fix to make plt loads volatile

This patch has some commentary explaining what is going on.  Strictly speaking
what is in the PLT is *not* the "address of a function".  I haven't
bootstrapped it yet, but Mike has used it to build spec with -mlongcalls which
probably is better testing for this patch than a bootstrap anyway.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/94145] Longcalls mis-optimize loading the function address
  2020-03-11 16:00 [Bug target/94145] New: Longcalls mis-optimize loading the function address meissner at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2020-03-11 22:50 ` amodra at gmail dot com
@ 2020-03-11 22:50 ` amodra at gmail dot com
  2020-03-12  8:48 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: amodra at gmail dot com @ 2020-03-11 22:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94145

Alan Modra <amodra at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |amodra at gmail dot com
   Last reconfirmed|                            |2020-03-11
     Ever confirmed|0                           |1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/94145] Longcalls mis-optimize loading the function address
  2020-03-11 16:00 [Bug target/94145] New: Longcalls mis-optimize loading the function address meissner at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2020-03-11 22:50 ` amodra at gmail dot com
@ 2020-03-12  8:48 ` rguenth at gcc dot gnu.org
  2020-03-12 13:00 ` amodra at gmail dot com
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-03-12  8:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94145

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
So what prevents GIMPLE from doing the transform to an indirect call and
hoisting the call address computation out of the loop?  I fear your volatile
marking is
papering over an entirely different issue.  Of course it will likely work
as a workaround since nobody is going to do that above mentioned dance.  Maybe
code like

void foo();

void bar()
{
  void (volatile fn*)() = foo;
  void (fn2 *)() = fn;
  for (int i = 0; i<10000; ++i)
    fn2();
}

will expose the same "issue" whatever it really is?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/94145] Longcalls mis-optimize loading the function address
  2020-03-11 16:00 [Bug target/94145] New: Longcalls mis-optimize loading the function address meissner at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2020-03-12  8:48 ` rguenth at gcc dot gnu.org
@ 2020-03-12 13:00 ` amodra at gmail dot com
  2020-03-12 13:17 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: amodra at gmail dot com @ 2020-03-12 13:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94145

--- Comment #6 from Alan Modra <amodra at gmail dot com> ---
Transformations to indirect calls and hoisting function addresses out of loops
is fine.  That sort of thing has nothing to do with this problem.

The problem is that the PLT really is volatile, and the inline PLT code for
powerpc exposes those PLT loads without letting gcc know they are in fact
volatile.  If gcc decides to cache a PLT load in a register and then use it for
multiple calls to the same function you might end up going via the ld.so symbol
resolver for every one of those calls rather than only on the first call.  That
is very definitely not an optimisation.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/94145] Longcalls mis-optimize loading the function address
  2020-03-11 16:00 [Bug target/94145] New: Longcalls mis-optimize loading the function address meissner at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2020-03-12 13:00 ` amodra at gmail dot com
@ 2020-03-12 13:17 ` rguenth at gcc dot gnu.org
  2020-03-12 13:34 ` amodra at gmail dot com
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-03-12 13:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94145

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Ah, I read "mis-optimize" as produce wrong-code ...  OTOH CSEing the load from
the PLT once it is resolved _would_ be an optimization.  Asks for loop
peeling I guess?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/94145] Longcalls mis-optimize loading the function address
  2020-03-11 16:00 [Bug target/94145] New: Longcalls mis-optimize loading the function address meissner at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2020-03-12 13:17 ` rguenth at gcc dot gnu.org
@ 2020-03-12 13:34 ` amodra at gmail dot com
  2020-03-12 13:57 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: amodra at gmail dot com @ 2020-03-12 13:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94145

--- Comment #8 from Alan Modra <amodra at gmail dot com> ---
(In reply to Richard Biener from comment #7)
> OTOH CSEing the load from the PLT once it is resolved _would_ be an
> optimization.

Possibly.  Sometimes making the call sequence seem more efficient runs into
stalls particularly when the called function is short.

>  Asks for loop peeling I guess?

Yeah, that might be one way to get the first call of a function inside a loop
over and done with.  And so you'd know the PLT entry was resolved and thus no
longer volatile.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/94145] Longcalls mis-optimize loading the function address
  2020-03-11 16:00 [Bug target/94145] New: Longcalls mis-optimize loading the function address meissner at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2020-03-12 13:34 ` amodra at gmail dot com
@ 2020-03-12 13:57 ` rguenth at gcc dot gnu.org
  2020-03-12 15:38 ` segher at gcc dot gnu.org
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-03-12 13:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94145

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Alan Modra from comment #8)
> (In reply to Richard Biener from comment #7)
> > OTOH CSEing the load from the PLT once it is resolved _would_ be an
> > optimization.
> 
> Possibly.  Sometimes making the call sequence seem more efficient runs into
> stalls particularly when the called function is short.
>
> >  Asks for loop peeling I guess?
> 
> Yeah, that might be one way to get the first call of a function inside a
> loop over and done with.  And so you'd know the PLT entry was resolved and
> thus no longer volatile.

I suppose there's no (portable) way to "speculate" the call, thus _just_
eventually resolve the PLT?  That way we could do such hoisted PLT loads
as

  load PTL + speculate call
  load PTL

or rather always do

  resolve-and-load-PLT

since the times we want to lazily load the PLT with resolving later are
scarce?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/94145] Longcalls mis-optimize loading the function address
  2020-03-11 16:00 [Bug target/94145] New: Longcalls mis-optimize loading the function address meissner at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2020-03-12 13:57 ` rguenth at gcc dot gnu.org
@ 2020-03-12 15:38 ` segher at gcc dot gnu.org
  2020-03-27 22:16 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: segher at gcc dot gnu.org @ 2020-03-12 15:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94145

--- Comment #10 from Segher Boessenkool <segher at gcc dot gnu.org> ---
The resolved address can only change before the first call to it, so we
could even automatically insert code checking that, perhaps.

My other concern is not slowing down the code if LD_BIND_NOW is in use.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/94145] Longcalls mis-optimize loading the function address
  2020-03-11 16:00 [Bug target/94145] New: Longcalls mis-optimize loading the function address meissner at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2020-03-12 15:38 ` segher at gcc dot gnu.org
@ 2020-03-27 22:16 ` cvs-commit at gcc dot gnu.org
  2020-05-01  1:18 ` cvs-commit at gcc dot gnu.org
  2020-05-01  1:19 ` amodra at gmail dot com
  13 siblings, 0 replies; 15+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2020-03-27 22:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94145

--- Comment #11 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Alan Modra <amodra@gcc.gnu.org>:

https://gcc.gnu.org/g:19e5389debb03c3623f6a2ce8a8f6f4aa2118901

commit r10-7430-g19e5389debb03c3623f6a2ce8a8f6f4aa2118901
Author: Alan Modra <amodra@gmail.com>
Date:   Wed Mar 11 21:22:37 2020 +1030

    [RS6000] PR94145, make PLT loads volatile

    The PLT is volatile.  On PowerPC it is a bss style section which the
    dynamic loader initialises to point at resolver stubs (called glink on
    PowerPC64) to support lazy resolution of function addresses.  The
    first call to a given function goes via the dynamic loader symbol
    resolver, which updates the PLT entry for that function and calls the
    function.  The second call, if there is one and we don't have a
    multi-threaded race, will use the updated PLT entry and thus avoid
    the relatively slow symbol resolver path.

    Calls via the PLT are like calls via a function pointer, except that
    no initialised function pointer is volatile like the PLT.  All
    initialised function pointers are resolved at program startup to point
    at the function or are left as NULL.  There is no support for lazy
    resolution of any user visible function pointer.

    So why does any of this matter to gcc?  Well, normally the PLT call
    mechanism happens entirely behind gcc's back, but since we implemented
    inline PLT calls (effectively putting the PLT code stub that loads the
    PLT entry inline and making that code sequence scheduled), the load of
    the PLT entry is visible to gcc.  That load then is subject to gcc
    optimization, for example in

    /* -S -mcpu=future -mpcrel -mlongcall -O2.  */
    int foo (int);
    void bar (void)
    {
      while (foo(0))
        foo (99);
    }

    we see the PLT load for foo being hoisted out of the loop and stashed
    in a call-saved register.  If that happens to be the first call to
    foo, then the stashed value is that for the resolver stub, and every
    call to foo in the loop will then go via the slow resolver path.  Not
    a good idea.  Also, if foo turns out to be a local function and the
    linker replaces the PLT calls with direct calls to foo then gcc has
    just wasted a call-saved register.

    This patch teaches gcc that the PLT loads are volatile.  The change
    doesn't affect other loads of function pointers and thus has no effect
    on normal indirect function calls.  Note that because the
    "optimization" this patch prevents can only occur over function calls,
    the only place gcc can stash PLT loads is in call-saved registers or
    in other memory.  I'm reasonably confident that this change will be
    neutral or positive for the "ld -z now" case where the PLT is not
    volatile, in code where there is any register pressure.  Even if gcc
    could be taught to recognise cases where the PLT is resolved, you'd
    need to discount use of registers to cache PLT loads by some factor
    involving the chance that those calls would be converted to direct
    calls.

            PR target/94145
            * config/rs6000/rs6000.c (rs6000_longcall_ref): Use unspec_volatile
            for PLT16_LO and PLT_PCREL.
            * config/rs6000/rs6000.md (UNSPEC_PLT16_LO, UNSPEC_PLT_PCREL):
Remove.
            (UNSPECV_PLT16_LO, UNSPECV_PLT_PCREL): Define.
            (pltseq_plt16_lo_, pltseq_plt_pcrel): Use unspec_volatile.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/94145] Longcalls mis-optimize loading the function address
  2020-03-11 16:00 [Bug target/94145] New: Longcalls mis-optimize loading the function address meissner at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2020-03-27 22:16 ` cvs-commit at gcc dot gnu.org
@ 2020-05-01  1:18 ` cvs-commit at gcc dot gnu.org
  2020-05-01  1:19 ` amodra at gmail dot com
  13 siblings, 0 replies; 15+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2020-05-01  1:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94145

--- Comment #12 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The releases/gcc-9 branch has been updated by Alan Modra <amodra@gcc.gnu.org>:

https://gcc.gnu.org/g:0c3519218fb11bdde5356aec9fcac133b4988698

commit r9-8556-g0c3519218fb11bdde5356aec9fcac133b4988698
Author: Alan Modra <amodra@gmail.com>
Date:   Wed Mar 11 21:22:37 2020 +1030

    [RS6000] PR94145, make PLT loads volatile

            PR target/94145
            * config/rs6000/rs6000.c (rs6000_longcall_ref): Use unspec_volatile
            for PLT16_LO.
            * config/rs6000/rs6000.md (UNSPEC_PLT16_LO): Remove.
            (UNSPECV_PLT16_LO): Define.
            (pltseq_plt16_lo_): Use unspec_volatile.

    (cherry picked from commit 19e5389debb03c3623f6a2ce8a8f6f4aa2118901)
    minus the PLT_PCREL parts

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/94145] Longcalls mis-optimize loading the function address
  2020-03-11 16:00 [Bug target/94145] New: Longcalls mis-optimize loading the function address meissner at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2020-05-01  1:18 ` cvs-commit at gcc dot gnu.org
@ 2020-05-01  1:19 ` amodra at gmail dot com
  13 siblings, 0 replies; 15+ messages in thread
From: amodra at gmail dot com @ 2020-05-01  1:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94145

Alan Modra <amodra at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|ASSIGNED                    |RESOLVED

--- Comment #13 from Alan Modra <amodra at gmail dot com> ---
Fixed for gcc-10 and next gcc-9 release.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-05-01  1:19 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-11 16:00 [Bug target/94145] New: Longcalls mis-optimize loading the function address meissner at gcc dot gnu.org
2020-03-11 18:27 ` [Bug target/94145] " meissner at gcc dot gnu.org
2020-03-11 21:33 ` segher at gcc dot gnu.org
2020-03-11 21:35 ` segher at gcc dot gnu.org
2020-03-11 22:50 ` amodra at gmail dot com
2020-03-11 22:50 ` amodra at gmail dot com
2020-03-12  8:48 ` rguenth at gcc dot gnu.org
2020-03-12 13:00 ` amodra at gmail dot com
2020-03-12 13:17 ` rguenth at gcc dot gnu.org
2020-03-12 13:34 ` amodra at gmail dot com
2020-03-12 13:57 ` rguenth at gcc dot gnu.org
2020-03-12 15:38 ` segher at gcc dot gnu.org
2020-03-27 22:16 ` cvs-commit at gcc dot gnu.org
2020-05-01  1:18 ` cvs-commit at gcc dot gnu.org
2020-05-01  1:19 ` amodra at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).