[RS6000] make PLT loads volatile

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [RS6000] make PLT loads volatile
@ 2020-03-12  2:48 Alan Modra
  2020-03-12 16:57 ` Segher Boessenkool
  0 siblings, 1 reply; 8+ messages in thread
From: Alan Modra @ 2020-03-12  2:48 UTC (permalink / raw)
  To: gcc-patches; +Cc: Segher Boessenkool

With lazy PLT resolution the first load of a PLT entry may be a value
pointing at a resolver stub.  gcc's loop processing can result in the
PLT load in inline PLT calls being hoisted out of a loop in the
mistaken idea that this is an optimisation.  It isn't.  If the value
hoisted was that for a resolver stub then every call to that function
in the loop will go via the resolver, slowing things down quite
dramatically.

The PLT really is volatile, so teach gcc about that.

Bootstrapped and regression tested on powerpc64le-linux and tested
with a spec build using -mlongcalls.  OK for mainline?

	* config/rs6000/rs6000.c (rs6000_longcall_ref): Use unspec_volatile
	for PLT16_LO and PLT_PCREL.
	* config/rs6000/rs6000.md (UNSPEC_PLT16_LO, UNSPEC_PLT_PCREL): Remove.
	(UNSPECV_PLT16_LO, UNSPECV_PLT_PCREL): Define.
	(pltseq_plt16_lo_, pltseq_plt_pcrel): Use unspec_volatile.

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 46b7dec2abd..2d6790877fa 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -19264,8 +19264,9 @@ rs6000_longcall_ref (rtx call_ref, rtx arg)
       if (rs6000_pcrel_p (cfun))
 	{
 	  rtx reg = gen_rtx_REG (Pmode, regno);
-	  rtx u = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, base, call_ref, arg),
-				  UNSPEC_PLT_PCREL);
+	  rtx u = gen_rtx_UNSPEC_VOLATILE (Pmode,
+					   gen_rtvec (3, base, call_ref, arg),
+					   UNSPECV_PLT_PCREL);
 	  emit_insn (gen_rtx_SET (reg, u));
 	  return reg;
 	}
@@ -19284,8 +19285,9 @@ rs6000_longcall_ref (rtx call_ref, rtx arg)
       rtx reg = gen_rtx_REG (Pmode, regno);
       rtx hi = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, base, call_ref, arg),
 			       UNSPEC_PLT16_HA);
-      rtx lo = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, reg, call_ref, arg),
-			       UNSPEC_PLT16_LO);
+      rtx lo = gen_rtx_UNSPEC_VOLATILE (Pmode,
+					gen_rtvec (3, reg, call_ref, arg),
+					UNSPECV_PLT16_LO);
       emit_insn (gen_rtx_SET (reg, hi));
       emit_insn (gen_rtx_SET (reg, lo));
       return reg;
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index ad88b6783af..5a8e9de670b 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -148,8 +148,6 @@
    UNSPEC_SI_FROM_SF
    UNSPEC_PLTSEQ
    UNSPEC_PLT16_HA
-   UNSPEC_PLT16_LO
-   UNSPEC_PLT_PCREL
   ])
 
 ;;
@@ -178,6 +176,8 @@
    UNSPECV_MTFSB1		; Set FPSCR Field bit to 1
    UNSPECV_SPLIT_STACK_RETURN   ; A camouflaged return
    UNSPECV_SPEC_BARRIER         ; Speculation barrier
+   UNSPECV_PLT16_LO
+   UNSPECV_PLT_PCREL
   ])
 
 ; The three different kinds of epilogue.
@@ -10359,10 +10359,10 @@
 
 (define_insn "*pltseq_plt16_lo_<mode>"
   [(set (match_operand:P 0 "gpc_reg_operand" "=r")
-	(unspec:P [(match_operand:P 1 "gpc_reg_operand" "b")
-		   (match_operand:P 2 "symbol_ref_operand" "s")
-		   (match_operand:P 3 "" "")]
-		  UNSPEC_PLT16_LO))]
+	(unspec_volatile:P [(match_operand:P 1 "gpc_reg_operand" "b")
+			    (match_operand:P 2 "symbol_ref_operand" "s")
+			    (match_operand:P 3 "" "")]
+			   UNSPECV_PLT16_LO))]
   "TARGET_PLTSEQ"
 {
   return rs6000_pltseq_template (operands, RS6000_PLTSEQ_PLT16_LO);
@@ -10382,10 +10382,10 @@
 
 (define_insn "*pltseq_plt_pcrel<mode>"
   [(set (match_operand:P 0 "gpc_reg_operand" "=r")
-	(unspec:P [(match_operand:P 1 "" "")
-		   (match_operand:P 2 "symbol_ref_operand" "s")
-		   (match_operand:P 3 "" "")]
-		  UNSPEC_PLT_PCREL))]
+	(unspec_volatile:P [(match_operand:P 1 "" "")
+			    (match_operand:P 2 "symbol_ref_operand" "s")
+			    (match_operand:P 3 "" "")]
+			   UNSPECV_PLT_PCREL))]
   "HAVE_AS_PLTSEQ && TARGET_ELF
    && rs6000_pcrel_p (cfun)"
 {

-- 
Alan Modra
Australia Development Lab, IBM

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RS6000] make PLT loads volatile
  2020-03-12  2:48 [RS6000] make PLT loads volatile Alan Modra
@ 2020-03-12 16:57 ` Segher Boessenkool
  2020-03-12 23:36   ` Alan Modra
  0 siblings, 1 reply; 8+ messages in thread
From: Segher Boessenkool @ 2020-03-12 16:57 UTC (permalink / raw)
  To: Alan Modra; +Cc: gcc-patches

Hi!

On Thu, Mar 12, 2020 at 01:18:50PM +1030, Alan Modra wrote:
> With lazy PLT resolution the first load of a PLT entry may be a value
> pointing at a resolver stub.  gcc's loop processing can result in the
> PLT load in inline PLT calls being hoisted out of a loop in the
> mistaken idea that this is an optimisation.  It isn't.  If the value
> hoisted was that for a resolver stub then every call to that function
> in the loop will go via the resolver, slowing things down quite
> dramatically.
> 
> The PLT really is volatile, so teach gcc about that.

It would be nice if we could keep it cached after it has been resolved
once, this has potential for regressing performance if we don't?  And
LD_BIND_NOW should keep working just as fast as it is now, too?


Segher

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RS6000] make PLT loads volatile
  2020-03-12 16:57 ` Segher Boessenkool
@ 2020-03-12 23:36   ` Alan Modra
  2020-03-13 15:40     ` Segher Boessenkool
  0 siblings, 1 reply; 8+ messages in thread
From: Alan Modra @ 2020-03-12 23:36 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches

On Thu, Mar 12, 2020 at 11:57:17AM -0500, Segher Boessenkool wrote:
> Hi!
> 
> On Thu, Mar 12, 2020 at 01:18:50PM +1030, Alan Modra wrote:
> > With lazy PLT resolution the first load of a PLT entry may be a value
> > pointing at a resolver stub.  gcc's loop processing can result in the
> > PLT load in inline PLT calls being hoisted out of a loop in the
> > mistaken idea that this is an optimisation.  It isn't.  If the value
> > hoisted was that for a resolver stub then every call to that function
> > in the loop will go via the resolver, slowing things down quite
> > dramatically.
> > 
> > The PLT really is volatile, so teach gcc about that.
> 
> It would be nice if we could keep it cached after it has been resolved
> once, this has potential for regressing performance if we don't?  And
> LD_BIND_NOW should keep working just as fast as it is now, too?

Using a call-saved register to cache a load out of the PLT looks
really silly when the inline PLT call is turned back into a direct
call by the linker.  You end up with an unnecessary save and restore
of the register, plus copies from the register to r12.  What's the
chance of someone reporting that as a gcc "bug"?  :-)  Then there's
the possibility that shortening the number of instructions between two
calls of a small function runs into stalls.

How can we teach gcc about these unknowns?  ie. How to weight use of a
call-saved register to cache PLT loads against other possible uses of
that register in a loop?  It's quite likely not a good use, even when
gcc knows the PLT entry has been resolved..  Which means some gcc
infrastructure would be needed to do this sensibly and without the
necessary infrastructure, I think gcc hoisting a PLT load out of a
loop should never be done.

-- 
Alan Modra
Australia Development Lab, IBM

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RS6000] make PLT loads volatile
  2020-03-12 23:36   ` Alan Modra
@ 2020-03-13 15:40     ` Segher Boessenkool
  2020-03-13 23:00       ` Alan Modra
  0 siblings, 1 reply; 8+ messages in thread
From: Segher Boessenkool @ 2020-03-13 15:40 UTC (permalink / raw)
  To: Alan Modra; +Cc: gcc-patches

Hi!

On Fri, Mar 13, 2020 at 10:06:01AM +1030, Alan Modra wrote:
> On Thu, Mar 12, 2020 at 11:57:17AM -0500, Segher Boessenkool wrote:
> > On Thu, Mar 12, 2020 at 01:18:50PM +1030, Alan Modra wrote:
> > > With lazy PLT resolution the first load of a PLT entry may be a value
> > > pointing at a resolver stub.  gcc's loop processing can result in the
> > > PLT load in inline PLT calls being hoisted out of a loop in the
> > > mistaken idea that this is an optimisation.  It isn't.  If the value
> > > hoisted was that for a resolver stub then every call to that function
> > > in the loop will go via the resolver, slowing things down quite
> > > dramatically.
> > > 
> > > The PLT really is volatile, so teach gcc about that.
> > 
> > It would be nice if we could keep it cached after it has been resolved
> > once, this has potential for regressing performance if we don't?  And
> > LD_BIND_NOW should keep working just as fast as it is now, too?
> 
> Using a call-saved register to cache a load out of the PLT looks
> really silly

Who said anything about using call-saved registers?  GCC will usually
make a stack slot for this, and only use a non-volatile register when
that is profitable.  (I know it is a bit too aggressive with it, but
that is a generic problem).

> when the inline PLT call is turned back into a direct
> call by the linker.

Ah, so yeah, for direct calls we do not want this.  I was thinking this
was about indirect calls (via a bctrl that is), dunno how I got that
misperception.  Sorry.

What is this like for indirect calls (at C level)?  Does your patch do
anything to those?


Segher

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RS6000] make PLT loads volatile
  2020-03-13 15:40     ` Segher Boessenkool
@ 2020-03-13 23:00       ` Alan Modra
  2020-03-18 21:53         ` Segher Boessenkool
  0 siblings, 1 reply; 8+ messages in thread
From: Alan Modra @ 2020-03-13 23:00 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches

On Fri, Mar 13, 2020 at 10:40:38AM -0500, Segher Boessenkool wrote:
> Hi!
> 
> On Fri, Mar 13, 2020 at 10:06:01AM +1030, Alan Modra wrote:
> > On Thu, Mar 12, 2020 at 11:57:17AM -0500, Segher Boessenkool wrote:
> > > On Thu, Mar 12, 2020 at 01:18:50PM +1030, Alan Modra wrote:
> > > > With lazy PLT resolution the first load of a PLT entry may be a value
> > > > pointing at a resolver stub.  gcc's loop processing can result in the
> > > > PLT load in inline PLT calls being hoisted out of a loop in the
> > > > mistaken idea that this is an optimisation.  It isn't.  If the value
> > > > hoisted was that for a resolver stub then every call to that function
> > > > in the loop will go via the resolver, slowing things down quite
> > > > dramatically.
> > > > 
> > > > The PLT really is volatile, so teach gcc about that.
> > > 
> > > It would be nice if we could keep it cached after it has been resolved
> > > once, this has potential for regressing performance if we don't?  And
> > > LD_BIND_NOW should keep working just as fast as it is now, too?
> > 
> > Using a call-saved register to cache a load out of the PLT looks
> > really silly
> 
> Who said anything about using call-saved registers?  GCC will usually
> make a stack slot for this, and only use a non-volatile register when
> that is profitable.  (I know it is a bit too aggressive with it, but
> that is a generic problem).

Using a stack slot comes about due to hoisting then running out of
call-saved registers in the loop.  Score another reason not to hoist
PLT loads.

> > when the inline PLT call is turned back into a direct
> > call by the linker.
> 
> Ah, so yeah, for direct calls we do not want this.  I was thinking this
> was about indirect calls (via a bctrl that is), dunno how I got that
> misperception.  Sorry.
> 
> What is this like for indirect calls (at C level)?  Does your patch do
> anything to those?

No effect at all.  To put your mind at rest on this point you can
verify quite easily by noticing that UNSPECV_PLT* is only generated in
rs6000_longcall_ref, and calls to that function are conditional on
GET_CODE (func_desc) == SYMBOL_REF.

-- 
Alan Modra
Australia Development Lab, IBM

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RS6000] make PLT loads volatile
  2020-03-13 23:00       ` Alan Modra
@ 2020-03-18 21:53         ` Segher Boessenkool
  2020-03-23  1:51           ` [RS6000] PR94145, " Alan Modra
  0 siblings, 1 reply; 8+ messages in thread
From: Segher Boessenkool @ 2020-03-18 21:53 UTC (permalink / raw)
  To: Alan Modra; +Cc: gcc-patches

Hi!

On Sat, Mar 14, 2020 at 09:30:02AM +1030, Alan Modra wrote:
> On Fri, Mar 13, 2020 at 10:40:38AM -0500, Segher Boessenkool wrote:
> > > Using a call-saved register to cache a load out of the PLT looks
> > > really silly
> > 
> > Who said anything about using call-saved registers?  GCC will usually
> > make a stack slot for this, and only use a non-volatile register when
> > that is profitable.  (I know it is a bit too aggressive with it, but
> > that is a generic problem).
> 
> Using a stack slot comes about due to hoisting then running out of
> call-saved registers in the loop.  Score another reason not to hoist
> PLT loads.

LRA can do this directly, without ever using call-saved registers.  There
are some other passes that can do rematerialisation as well.  Not enough
yet, but GCC does *not* use non-volatile registers to save values, unless
it thinks that is cheaper (which it currently thinks too often, but that
is a separate problem).

> > > when the inline PLT call is turned back into a direct
> > > call by the linker.
> > 
> > Ah, so yeah, for direct calls we do not want this.  I was thinking this
> > was about indirect calls (via a bctrl that is), dunno how I got that
> > misperception.  Sorry.
> > 
> > What is this like for indirect calls (at C level)?  Does your patch do
> > anything to those?
> 
> No effect at all.  To put your mind at rest on this point you can
> verify quite easily by noticing that UNSPECV_PLT* is only generated in
> rs6000_longcall_ref, and calls to that function are conditional on
> GET_CODE (func_desc) == SYMBOL_REF.

Could you please send a new patch (could be the same patch even) that
is easier to review for me?  With things like all of the above info in
the message (describing the setting, the problem, and the solution).

Or should I read the original message another ten times until it clicks?
It certainly is possible this matter just is hard to grasp :-/

Thanks in advance,


Segher

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RS6000] PR94145, make PLT loads volatile
  2020-03-18 21:53         ` Segher Boessenkool
@ 2020-03-23  1:51           ` Alan Modra
  2020-03-26 17:12             ` Segher Boessenkool
  0 siblings, 1 reply; 8+ messages in thread
From: Alan Modra @ 2020-03-23  1:51 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches

On Wed, Mar 18, 2020 at 04:53:59PM -0500, Segher Boessenkool wrote:
> Could you please send a new patch (could be the same patch even) that
> is easier to review for me?

The PLT is volatile.  On PowerPC it is a bss style section which the
dynamic loader initialises to point at resolver stubs (called glink on
PowerPC64) to support lazy resolution of function addresses.  The
first call to a given function goes via the dynamic loader symbol
resolver, which updates the PLT entry for that function and calls the
function.  The second call, if there is one and we don't have a
multi-threaded race, will use the updated PLT entry and thus avoid
the relatively slow symbol resolver path.

Calls via the PLT are like calls via a function pointer, except that
no initialised function pointer is volatile like the PLT.  All
initialised function pointers are resolved at program startup to point
at the function or are left as NULL.  There is no support for lazy
resolution of any user visible function pointer.

So why does any of this matter to gcc?  Well, normally the PLT call
mechanism happens entirely behind gcc's back, but since we implemented
inline PLT calls (effectively putting the PLT code stub that loads the
PLT entry inline and making that code sequence scheduled), the load of
the PLT entry is visible to gcc.  That load then is subject to gcc
optimization, for example in

/* -S -mcpu=future -mpcrel -mlongcall -O2.  */
int foo (int);
void bar (void)
{
  while (foo(0))
    foo (99);
}

we see the PLT load for foo being hoisted out of the loop and stashed
in a call-saved register.  If that happens to be the first call to
foo, then the stashed value is that for the resolver stub, and every
call to foo in the loop will then go via the slow resolver path.  Not
a good idea.  Also, if foo turns out to be a local function and the
linker replaces the PLT calls with direct calls to foo then gcc has
just wasted a call-saved register.

This patch teaches gcc that the PLT loads are volatile.  The change
doesn't affect other loads of function pointers and thus has no effect
on normal indirect function calls.  Note that because the
"optimization" this patch prevents can only occur over function calls,
the only place gcc can stash PLT loads is in call-saved registers or
in other memory.  I'm reasonably confident that this change will be
neutral or positive for the "ld -z now" case where the PLT is not
volatile, in code where there is any register pressure.  Even if gcc
could be taught to recognise cases where the PLT is resolved, you'd
need to discount use of registers to cache PLT loads by some factor
involving the chance that those calls would be converted to direct
calls..

	PR target/94145
	* config/rs6000/rs6000.c (rs6000_longcall_ref): Use unspec_volatile
	for PLT16_LO and PLT_PCREL.
	* config/rs6000/rs6000.md (UNSPEC_PLT16_LO, UNSPEC_PLT_PCREL): Remove.
	(UNSPECV_PLT16_LO, UNSPECV_PLT_PCREL): Define.
	(pltseq_plt16_lo_, pltseq_plt_pcrel): Use unspec_volatile.

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 07f7cf516ba..68046fdb5ee 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -19274,8 +19274,9 @@ rs6000_longcall_ref (rtx call_ref, rtx arg)
       if (rs6000_pcrel_p (cfun))
 	{
 	  rtx reg = gen_rtx_REG (Pmode, regno);
-	  rtx u = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, base, call_ref, arg),
-				  UNSPEC_PLT_PCREL);
+	  rtx u = gen_rtx_UNSPEC_VOLATILE (Pmode,
+					   gen_rtvec (3, base, call_ref, arg),
+					   UNSPECV_PLT_PCREL);
 	  emit_insn (gen_rtx_SET (reg, u));
 	  return reg;
 	}
@@ -19294,8 +19295,9 @@ rs6000_longcall_ref (rtx call_ref, rtx arg)
       rtx reg = gen_rtx_REG (Pmode, regno);
       rtx hi = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, base, call_ref, arg),
 			       UNSPEC_PLT16_HA);
-      rtx lo = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, reg, call_ref, arg),
-			       UNSPEC_PLT16_LO);
+      rtx lo = gen_rtx_UNSPEC_VOLATILE (Pmode,
+					gen_rtvec (3, reg, call_ref, arg),
+					UNSPECV_PLT16_LO);
       emit_insn (gen_rtx_SET (reg, hi));
       emit_insn (gen_rtx_SET (reg, lo));
       return reg;
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index ad88b6783af..5a8e9de670b 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -148,8 +148,6 @@
    UNSPEC_SI_FROM_SF
    UNSPEC_PLTSEQ
    UNSPEC_PLT16_HA
-   UNSPEC_PLT16_LO
-   UNSPEC_PLT_PCREL
   ])

 ;;
@@ -178,6 +176,8 @@
    UNSPECV_MTFSB1		; Set FPSCR Field bit to 1
    UNSPECV_SPLIT_STACK_RETURN   ; A camouflaged return
    UNSPECV_SPEC_BARRIER         ; Speculation barrier
+   UNSPECV_PLT16_LO
+   UNSPECV_PLT_PCREL
   ])

 ; The three different kinds of epilogue.
@@ -10359,10 +10359,10 @@

 (define_insn "*pltseq_plt16_lo_<mode>"
   [(set (match_operand:P 0 "gpc_reg_operand" "=r")
-	(unspec:P [(match_operand:P 1 "gpc_reg_operand" "b")
-		   (match_operand:P 2 "symbol_ref_operand" "s")
-		   (match_operand:P 3 "" "")]
-		  UNSPEC_PLT16_LO))]
+	(unspec_volatile:P [(match_operand:P 1 "gpc_reg_operand" "b")
+			    (match_operand:P 2 "symbol_ref_operand" "s")
+			    (match_operand:P 3 "" "")]
+			   UNSPECV_PLT16_LO))]
   "TARGET_PLTSEQ"
 {
   return rs6000_pltseq_template (operands, RS6000_PLTSEQ_PLT16_LO);
@@ -10382,10 +10382,10 @@

 (define_insn "*pltseq_plt_pcrel<mode>"
   [(set (match_operand:P 0 "gpc_reg_operand" "=r")
-	(unspec:P [(match_operand:P 1 "" "")
-		   (match_operand:P 2 "symbol_ref_operand" "s")
-		   (match_operand:P 3 "" "")]
-		  UNSPEC_PLT_PCREL))]
+	(unspec_volatile:P [(match_operand:P 1 "" "")
+			    (match_operand:P 2 "symbol_ref_operand" "s")
+			    (match_operand:P 3 "" "")]
+			   UNSPECV_PLT_PCREL))]
   "HAVE_AS_PLTSEQ && TARGET_ELF
    && rs6000_pcrel_p (cfun)"
 {

-- 
Alan Modra
Australia Development Lab, IBM

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RS6000] PR94145, make PLT loads volatile
  2020-03-23  1:51           ` [RS6000] PR94145, " Alan Modra
@ 2020-03-26 17:12             ` Segher Boessenkool
  0 siblings, 0 replies; 8+ messages in thread
From: Segher Boessenkool @ 2020-03-26 17:12 UTC (permalink / raw)
  To: Alan Modra; +Cc: gcc-patches

Hi!

On Mon, Mar 23, 2020 at 12:21:03PM +1030, Alan Modra wrote:
> On Wed, Mar 18, 2020 at 04:53:59PM -0500, Segher Boessenkool wrote:
> > Could you please send a new patch (could be the same patch even) that
> > is easier to review for me?
> 
> The PLT is volatile.  On PowerPC it is a bss style section which the
> dynamic loader initialises to point at resolver stubs (called glink on
> PowerPC64) to support lazy resolution of function addresses.  The
> first call to a given function goes via the dynamic loader symbol
> resolver, which updates the PLT entry for that function and calls the
> function.  The second call, if there is one and we don't have a
> multi-threaded race, will use the updated PLT entry and thus avoid
> the relatively slow symbol resolver path.

Okay, so it isn't volatile, we have the guarantee that it will stay the
same after we have called the function once (on this same execution
thread)?

> Calls via the PLT are like calls via a function pointer, except that
> no initialised function pointer is volatile like the PLT.  All
> initialised function pointers are resolved at program startup to point
> at the function or are left as NULL.  There is no support for lazy
> resolution of any user visible function pointer.
> 
> So why does any of this matter to gcc?  Well, normally the PLT call
> mechanism happens entirely behind gcc's back, but since we implemented
> inline PLT calls (effectively putting the PLT code stub that loads the
> PLT entry inline and making that code sequence scheduled), the load of
> the PLT entry is visible to gcc.  That load then is subject to gcc
> optimization, for example in
> 
> /* -S -mcpu=future -mpcrel -mlongcall -O2.  */
> int foo (int);
> void bar (void)
> {
>   while (foo(0))
>     foo (99);
> }
> 
> we see the PLT load for foo being hoisted out of the loop and stashed
> in a call-saved register.  If that happens to be the first call to
> foo, then the stashed value is that for the resolver stub, and every
> call to foo in the loop will then go via the slow resolver path.  Not
> a good idea.  Also, if foo turns out to be a local function and the
> linker replaces the PLT calls with direct calls to foo then gcc has
> just wasted a call-saved register.

So you are saying that calling the PLT directly is always faster than
calling via a function pointer, even if that is the correct resolved
address?  That is the part I am worried about.

I think that is right, but I don't quite see it, and I don't know what
is done at runtime nearly well enough :-/

> This patch teaches gcc that the PLT loads are volatile.  The change
> doesn't affect other loads of function pointers and thus has no effect
> on normal indirect function calls.  Note that because the
> "optimization" this patch prevents can only occur over function calls,
> the only place gcc can stash PLT loads is in call-saved registers or
> in other memory.  I'm reasonably confident that this change will be
> neutral or positive for the "ld -z now" case where the PLT is not
> volatile, in code where there is any register pressure.

That is good enough for me :-)

> Even if gcc
> could be taught to recognise cases where the PLT is resolved, you'd
> need to discount use of registers to cache PLT loads by some factor
> involving the chance that those calls would be converted to direct
> calls..

> 	PR target/94145
> 	* config/rs6000/rs6000.c (rs6000_longcall_ref): Use unspec_volatile
> 	for PLT16_LO and PLT_PCREL.
> 	* config/rs6000/rs6000.md (UNSPEC_PLT16_LO, UNSPEC_PLT_PCREL): Remove.
> 	(UNSPECV_PLT16_LO, UNSPECV_PLT_PCREL): Define.
> 	(pltseq_plt16_lo_, pltseq_plt_pcrel): Use unspec_volatile.

Okay for trunk.  Thank you!


Segher

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-03-26 17:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-12  2:48 [RS6000] make PLT loads volatile Alan Modra
2020-03-12 16:57 ` Segher Boessenkool
2020-03-12 23:36   ` Alan Modra
2020-03-13 15:40     ` Segher Boessenkool
2020-03-13 23:00       ` Alan Modra
2020-03-18 21:53         ` Segher Boessenkool
2020-03-23  1:51           ` [RS6000] PR94145, " Alan Modra
2020-03-26 17:12             ` Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).