[PATCH i386] Move CLOBBERED_REGS earlier in register class list

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH i386] Move CLOBBERED_REGS earlier in register class list
  2015-05-04 16:38 PIC calls without PLT, generic implementation Alexander Monakov
  2015-05-04 16:38 ` [PATCH i386] Extend sibcall peepholes to allow source in %eax Alexander Monakov
@ 2015-05-04 16:38 ` Alexander Monakov
  2015-05-10 16:44   ` Jan Hubicka
  2015-05-04 16:38 ` [PATCH i386] PR65753: allow PIC tail calls via function pointers Alexander Monakov
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 106+ messages in thread
From: Alexander Monakov @ 2015-05-04 16:38 UTC (permalink / raw)
  To: gcc-patches; +Cc: Alexander Monakov, Rich Felker

On 32-bit x86, register class CLOBBERED_REGS is a proper subset of
LEGACY_REGS, which causes IRA not to consider it separately for register
allocation, even when it has lower cost than other classes.  This patch is
useful to fix code generation problem that appears with no-PLT PIC tailcalls.

Was there a specific reason for CLOBBERED_REGS class to be listed as late as
it is?  On 32-bit this class contains only EAX, ECX, EDX.

OK?
	* config/i386/i386.h (enum reg_class): Move CLOBBERED_REGS before Q_REGS.
	(REG_CLASS_NAMES): Ditto.
	(REG_CLASS_CONTENTS): Ditto.

diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 1e755d3..75071ac 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -1300,17 +1300,17 @@ extern const char *host_detect_local_cpu (int argc, const char **argv);
 
 enum reg_class
 {
   NO_REGS,
   AREG, DREG, CREG, BREG, SIREG, DIREG,
   AD_REGS,			/* %eax/%edx for DImode */
+  CLOBBERED_REGS,		/* call-clobbered integer registers */
   Q_REGS,			/* %eax %ebx %ecx %edx */
   NON_Q_REGS,			/* %esi %edi %ebp %esp */
   INDEX_REGS,			/* %eax %ebx %ecx %edx %esi %edi %ebp */
   LEGACY_REGS,			/* %eax %ebx %ecx %edx %esi %edi %ebp %esp */
-  CLOBBERED_REGS,		/* call-clobbered integer registers */
   GENERAL_REGS,			/* %eax %ebx %ecx %edx %esi %edi %ebp %esp
 				   %r8 %r9 %r10 %r11 %r12 %r13 %r14 %r15 */
   FP_TOP_REG, FP_SECOND_REG,	/* %st(0) %st(1) */
   FLOAT_REGS,
   SSE_FIRST_REG,
   NO_REX_SSE_REGS,
@@ -1361,16 +1361,16 @@ enum reg_class
 
 #define REG_CLASS_NAMES \
 {  "NO_REGS",				\
    "AREG", "DREG", "CREG", "BREG",	\
    "SIREG", "DIREG",			\
    "AD_REGS",				\
+   "CLOBBERED_REGS",			\
    "Q_REGS", "NON_Q_REGS",		\
    "INDEX_REGS",			\
    "LEGACY_REGS",			\
-   "CLOBBERED_REGS",			\
    "GENERAL_REGS",			\
    "FP_TOP_REG", "FP_SECOND_REG",	\
    "FLOAT_REGS",			\
    "SSE_FIRST_REG",			\
    "NO_REX_SSE_REGS",			\
    "SSE_REGS",				\
@@ -1400,17 +1400,17 @@ enum reg_class
       { 0x02,       0x0,    0x0 },       /* DREG */                      \
       { 0x04,       0x0,    0x0 },       /* CREG */                      \
       { 0x08,       0x0,    0x0 },       /* BREG */                      \
       { 0x10,       0x0,    0x0 },       /* SIREG */                     \
       { 0x20,       0x0,    0x0 },       /* DIREG */                     \
       { 0x03,       0x0,    0x0 },       /* AD_REGS */                   \
+      { 0x07,       0x0,    0x0 },       /* CLOBBERED_REGS */            \
       { 0x0f,       0x0,    0x0 },       /* Q_REGS */                    \
   { 0x1100f0,    0x1fe0,    0x0 },       /* NON_Q_REGS */                \
       { 0x7f,    0x1fe0,    0x0 },       /* INDEX_REGS */                \
   { 0x1100ff,       0x0,    0x0 },       /* LEGACY_REGS */               \
-      { 0x07,       0x0,    0x0 },       /* CLOBBERED_REGS */            \
   { 0x1100ff,    0x1fe0,    0x0 },       /* GENERAL_REGS */              \
      { 0x100,       0x0,    0x0 },       /* FP_TOP_REG */                \
     { 0x0200,       0x0,    0x0 },       /* FP_SECOND_REG */             \
     { 0xff00,       0x0,    0x0 },       /* FLOAT_REGS */                \
   { 0x200000,       0x0,    0x0 },       /* SSE_FIRST_REG */             \
 { 0x1fe00000,  0x000000,    0x0 },       /* NO_REX_SSE_REGS */           \

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [RFC PATCH] ira: accept loads via argp rtx in validate_equiv_mem
  2015-05-04 16:38 PIC calls without PLT, generic implementation Alexander Monakov
                   ` (2 preceding siblings ...)
  2015-05-04 16:38 ` [PATCH i386] PR65753: allow PIC tail calls via function pointers Alexander Monakov
@ 2015-05-04 16:38 ` Alexander Monakov
  2015-05-04 17:37   ` Jeff Law
  2015-05-04 16:38 ` [PATCH] Expand PIC calls without PLT with -fno-plt Alexander Monakov
  2015-05-04 16:38 ` [PATCH i386] Allow sibcalls in no-PLT PIC Alexander Monakov
  5 siblings, 1 reply; 106+ messages in thread
From: Alexander Monakov @ 2015-05-04 16:38 UTC (permalink / raw)
  To: gcc-patches; +Cc: Alexander Monakov, Rich Felker

With this patch at hand, I'd like to discuss a code generation problem, which
my patch solves only partially.  FWIW, it passes bootstrap/regtest on x86-64.

With other patches in series applied, GCC with -fno-plt can generate tail
calls in PIC mode more frequently, but sometimes poorer code is generated.
I've tried to look for possible causes, and found one issue so far.

Consider the following testcase:

void foo1(int a, int b, int c, int d, int e, int f, int g, int h);
int bar(int x);
void foo2(int a, int b, int c, int d, int e, int f, int g, int h)
{
  bar(a);
  foo1(a, b, c, d, e, f, g, h);
}

Comparing x86 code generation with -O2 -m32 and with/without -fPIC, you can
see that -fPIC happens to produce smaller code.  Without -fPIC, GCC
saves/restores all arguments before/after call to 'bar'.

The reason for that is without -fPIC, GCC performs tail call optimization on
'foo1', and that causes it to drop REG_EQUIV notes for incoming arguments in
fixup_tail_calls.  After that, code generation diverges at IRA stage, where
lack of equivalences prevents loads of pseudos to be moved to the point of
first use.

The patch tries to repair the problem by allowing REG_EQUIV notes to be
resynthesized at ira init for loads that happen via `argp' rtx.  It helps for
the simple testcase above, but not for problematic Clang/LLVM functions where
I noticed the issue.

I hope there's a way around the 'big hammer' approach of fixup_tail_calls.
Might it be possible instead of dropping REG_EQUIV notes, to copy incoming
arguments into other pseudos just prior to stack pointer adjustment in
preparation for tailcall?

diff --git a/gcc/ira.c b/gcc/ira.c
index ea2b69f..e6b82e2 100644
--- a/gcc/ira.c
+++ b/gcc/ira.c
@@ -3001,13 +3001,16 @@ validate_equiv_mem (rtx_insn *start, rtx reg, rtx memref)
 
       /* This used to ignore readonly memory and const/pure calls.  The problem
 	 is the equivalent form may reference a pseudo which gets assigned a
 	 call clobbered hard reg.  When we later replace REG with its
 	 equivalent form, the value in the call-clobbered reg has been
 	 changed and all hell breaks loose.  */
-      if (CALL_P (insn))
+      rtx addr = XEXP (memref, 0);
+      if (GET_CODE (addr) == PLUS && GET_CODE (XEXP (addr, 1)) == CONST_INT)
+	addr = XEXP (addr, 0);
+      if (CALL_P (insn) && addr != arg_pointer_rtx)
 	return 0;
 
       note_stores (PATTERN (insn), validate_equiv_mem_from_store, NULL);
 
       /* If a register mentioned in MEMREF is modified via an
 	 auto-increment, we lose the equivalence.  Do the same if one

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-04 16:38 PIC calls without PLT, generic implementation Alexander Monakov
                   ` (3 preceding siblings ...)
  2015-05-04 16:38 ` [RFC PATCH] ira: accept loads via argp rtx in validate_equiv_mem Alexander Monakov
@ 2015-05-04 16:38 ` Alexander Monakov
  2015-05-04 17:34   ` Jeff Law
                     ` (2 more replies)
  2015-05-04 16:38 ` [PATCH i386] Allow sibcalls in no-PLT PIC Alexander Monakov
  5 siblings, 3 replies; 106+ messages in thread
From: Alexander Monakov @ 2015-05-04 16:38 UTC (permalink / raw)
  To: gcc-patches; +Cc: Alexander Monakov, Rich Felker

This patch introduces option -fno-plt that allows to expand calls that would
go via PLT to load the address of the function immediately at call site (which
introduces a GOT load).  Cover letter explains the motivation for this patch.

New option documentation for invoke.texi is missing from the patch; if this is
accepted I'll be happy to send a v2 with documentation added.

	* calls.c (prepare_call_address): Transform PLT call to GOT lookup and
	indirect call by forcing address into a pseudo with -fno-plt.
	* common.opt (flag_plt): New option.

diff --git a/gcc/calls.c b/gcc/calls.c
index 970415d..0c3b9aa 100644
--- a/gcc/calls.c
+++ b/gcc/calls.c
@@ -222,12 +222,18 @@ prepare_call_address (tree fndecl_or_type, rtx funexp, rtx static_chain_value,
     /* If we are using registers for parameters, force the
        function address into a register now.  */
     funexp = ((reg_parm_seen
 	       && targetm.small_register_classes_for_mode_p (FUNCTION_MODE))
 	      ? force_not_mem (memory_address (FUNCTION_MODE, funexp))
 	      : memory_address (FUNCTION_MODE, funexp));
+  else if (flag_pic && !flag_plt && fndecl_or_type
+	   && TREE_CODE (fndecl_or_type) == FUNCTION_DECL
+	   && !targetm.binds_local_p (fndecl_or_type))
+    {
+      funexp = force_reg (Pmode, funexp);
+    }
   else if (! sibcallp)
     {
 #ifndef NO_FUNCTION_CSE
       if (optimize && ! flag_no_function_cse)
 	funexp = force_reg (Pmode, funexp);
 #endif
diff --git a/gcc/common.opt b/gcc/common.opt
index b49ac46..cd8b256 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1773,12 +1773,16 @@ Common Report Var(flag_pic,1) Negative(fpie)
 Generate position-independent code if possible (small mode)
 
 fpie
 Common Report Var(flag_pie,1) Negative(fPIC)
 Generate position-independent code for executables if possible (small mode)
 
+fplt
+Common Report Var(flag_plt) Init(1)
+Use PLT for PIC calls (-fno-plt: load the address from GOT at call site)
+
 fplugin=
 Common Joined RejectNegative Var(common_deferred_options) Defer
 Specify a plugin to load
 
 fplugin-arg-
 Common Joined RejectNegative Var(common_deferred_options) Defer

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH i386] PR65753: allow PIC tail calls via function pointers
  2015-05-04 16:38 PIC calls without PLT, generic implementation Alexander Monakov
  2015-05-04 16:38 ` [PATCH i386] Extend sibcall peepholes to allow source in %eax Alexander Monakov
  2015-05-04 16:38 ` [PATCH i386] Move CLOBBERED_REGS earlier in register class list Alexander Monakov
@ 2015-05-04 16:38 ` Alexander Monakov
  2015-05-10 16:37   ` Jan Hubicka
  2015-05-04 16:38 ` [RFC PATCH] ira: accept loads via argp rtx in validate_equiv_mem Alexander Monakov
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 106+ messages in thread
From: Alexander Monakov @ 2015-05-04 16:38 UTC (permalink / raw)
  To: gcc-patches; +Cc: Alexander Monakov, Rich Felker

In the i386 backend, tailcalls are incorrectly disallowed in PIC mode for
calls via function pointers on the basis that indirect calls, like direct
calls, would go via PLT and thus require %ebx to point to GOT -- but that is
not true.  Quoting Rich Felker who reported the bug,

  "For PLT slots in the non-PIE main executable, %ebx is not required at all.
  PLT slots in PIE or shared libraries need %ebx, but a function pointer can
  never evaluate to such a PLT slot; it always evaluates to the nominal address
  of the function which is the same in all DSOs and therefore fundamentally
  cannot depend on the address of the GOT in the calling DSO"

As far as I can see it's simply a mistake that was there from day 1 (comment 4
in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65753 points to original patch).

Bootstrapped and regtested on 32-bit x86, OK for trunk?
(the comment before the condition will need to be adjusted too, i.e.
s/optimize any indirect call, or a direct call/optimize any direct call/ )

	PR target/65753
	* config/i386/i386.c (ix86_function_ok_for_sibcall): Allow PIC sibcalls
	via function pointers.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 3263656..f29e053 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -5448,13 +5448,13 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
   /* If we are generating position-independent code, we cannot sibcall
      optimize any indirect call, or a direct call to a global function,
      as the PLT requires %ebx be live. (Darwin does not have a PLT.)  */
   if (!TARGET_MACHO
       && !TARGET_64BIT
       && flag_pic
-      && (!decl || !targetm.binds_local_p (decl)))
+      && (decl && !targetm.binds_local_p (decl)))
     return false;
 
   /* If we need to align the outgoing stack, then sibcalling would
      unalign the stack, which may break the called function.  */
   if (ix86_minimum_incoming_stack_boundary (true)
       < PREFERRED_STACK_BOUNDARY)

^ permalink raw reply	[flat|nested] 106+ messages in thread

* PIC calls without PLT, generic implementation
@ 2015-05-04 16:38 Alexander Monakov
  2015-05-04 16:38 ` [PATCH i386] Extend sibcall peepholes to allow source in %eax Alexander Monakov
                   ` (5 more replies)
  0 siblings, 6 replies; 106+ messages in thread
From: Alexander Monakov @ 2015-05-04 16:38 UTC (permalink / raw)
  To: gcc-patches; +Cc: Alexander Monakov, Rich Felker, Sriraman Tallam

Recent post by Sriraman prompts me to post my -fno-plt approach sooner rather
than later; I was working on no-PLT PIC codegen in last few days too.
Although I'm posting a patch series, half of it is i386 backend tuning and can
go in independently.  Except one patch where it's noted specifically, the
patches were bootstrapped and regtested together, not separately, on x86-64.
Likewise the improvement claimed below is obtained with GCC with all patches
applied, the difference being only in -fno-plt flag.

The approach taken here is different.  Instead of adjusting call expansion in
the back end, I force callee address to be loaded into a pseudo at RTL
expansion time, similar to "function CSE" which is not enabled to most
targets.  The address load (which loads from GOT) can be moved out of loops,
scheduled, or, on x86, re-fused with indirect jump by peepholes.  On 32-bit
x86, it also allows the compiler to use registers other than %ebx for GOT
pointer (which can be a win since %ebx is callee-saved).

The benefit of PLT is the possibility of lazy relocation.  It is not possible
with BIND_NOW, in particular when -z relro -z now flags were used at link time
as security hardening measure.  Performance-critical executables do not
particularly need PLT and lazy relocation too, except if they are used very
frequently, with each individual run time extremely small -- but in that case
they can benefit massively from static linking or less massively from
prelinking, and with prelinking they can get the benefit of no-plt.

I've used LLVM/Clang to evaluate performance impact of PLT-less PIC codegen.
I configured with
  cmake -DLLVM_ENABLE_PIC=ON -DBUILD_SHARED_LIBS=ON \
  -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=OFF
from 3.6 release branch; this configuration mimics non-static build that e.g.
OpenSUSE is using, and produces Clang dependent on 112 clang/llvm shared
libraries, with roughly 24000 externally visible functions.

Without input files time is mostly spent on dynamic linking, so without
prelink there's a predictable regression, from 55 to 140 ms.  On C++ hello
world, I get:
            PLT   no-PLT  PLT+BIND_NOW
[32bit]  430 ms   535 ms  590 ms
[64bit]  410 ms   495 ms  555 ms

So no-PLT is >20% slower than default, but already >10% faster when non-lazy
binding is forced.

On tramp3d compilation with -O2 -g I get:
            PLT   no-PLT
[32bit]  49.0 s   43.3 s
[64bit]  41.6 s   36.8 s

So on long-running compiles -fno-plt is a very significant win.  Note that I'm
using Clang as (perhaps extreme) example of PIC-call-intensive code, but the
argument about -fno-plt being useful for performance should apply generally.

When looking at code size changes, there's a 1% improvement on 32-bit
libstdc++ and a small regression on 64-bit.  On LLVM/Clang, there's overall size
regression on both 32-bit and 64-bit; I've tried to analyze it and so far came
up with one possible cause, which is detailed in IRA REG_EQUIV patch.

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH i386] Extend sibcall peepholes to allow source in %eax
  2015-05-04 16:38 PIC calls without PLT, generic implementation Alexander Monakov
@ 2015-05-04 16:38 ` Alexander Monakov
  2015-05-10 16:54   ` Jan Hubicka
  2015-05-04 16:38 ` [PATCH i386] Move CLOBBERED_REGS earlier in register class list Alexander Monakov
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 106+ messages in thread
From: Alexander Monakov @ 2015-05-04 16:38 UTC (permalink / raw)
  To: gcc-patches; +Cc: Alexander Monakov, Rich Felker

On i386, peepholes that transform memory load and register-indirect jump into
memory-indirect jump are overly restrictive in that they don't allow combining
when the jump target is loaded into %eax, and the called function returns a
value (also in %eax, so it's not dead after the call).  Fix this by checking
for same source and output register operands separately.

OK?
	* config/i386/i386.md (sibcall_value_memory): Extend peepholes to
	allow memory address in %eax.
	(sibcall_value_pop_memory): Likewise.

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 729db75..7f81bcc 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -11872,13 +11872,14 @@
   [(set (match_operand:W 0 "register_operand")
 	(match_operand:W 1 "memory_operand"))
    (set (match_operand 2)
    (call (mem:QI (match_dup 0))
 		 (match_operand 3)))]
   "!TARGET_X32 && SIBLING_CALL_P (peep2_next_insn (1))
-   && peep2_reg_dead_p (2, operands[0])"
+   && (REGNO (operands[2]) == REGNO (operands[0])
+       || peep2_reg_dead_p (2, operands[0]))"
   [(parallel [(set (match_dup 2)
 		   (call (mem:QI (match_dup 1))
 			 (match_dup 3)))
 	      (unspec [(const_int 0)] UNSPEC_PEEPSIB)])])
 
 (define_peephole2
@@ -11886,13 +11887,14 @@
 	(match_operand:W 1 "memory_operand"))
    (unspec_volatile [(const_int 0)] UNSPECV_BLOCKAGE)
    (set (match_operand 2)
 	(call (mem:QI (match_dup 0))
 	      (match_operand 3)))]
   "!TARGET_X32 && SIBLING_CALL_P (peep2_next_insn (2))
-   && peep2_reg_dead_p (3, operands[0])"
+   && (REGNO (operands[2]) == REGNO (operands[0])
+       || peep2_reg_dead_p (3, operands[0]))"
   [(unspec_volatile [(const_int 0)] UNSPECV_BLOCKAGE)
    (parallel [(set (match_dup 2)
 		   (call (mem:QI (match_dup 1))
 			 (match_dup 3)))
 	      (unspec [(const_int 0)] UNSPEC_PEEPSIB)])])
 
@@ -11951,13 +11953,14 @@
 		   (call (mem:QI (match_dup 0))
 			 (match_operand 3)))
 	      (set (reg:SI SP_REG)
 		   (plus:SI (reg:SI SP_REG)
 			    (match_operand:SI 4 "immediate_operand")))])]
   "!TARGET_64BIT && SIBLING_CALL_P (peep2_next_insn (1))
-   && peep2_reg_dead_p (2, operands[0])"
+   && (REGNO (operands[2]) == REGNO (operands[0])
+       || peep2_reg_dead_p (2, operands[0]))"
   [(parallel [(set (match_dup 2)
 		   (call (mem:QI (match_dup 1))
 			 (match_dup 3)))
 	      (set (reg:SI SP_REG)
 		   (plus:SI (reg:SI SP_REG)
 			    (match_dup 4)))
@@ -11971,13 +11974,14 @@
 		   (call (mem:QI (match_dup 0))
 			 (match_operand 3)))
 	      (set (reg:SI SP_REG)
 		   (plus:SI (reg:SI SP_REG)
 			    (match_operand:SI 4 "immediate_operand")))])]
   "!TARGET_64BIT && SIBLING_CALL_P (peep2_next_insn (2))
-   && peep2_reg_dead_p (3, operands[0])"
+   && (REGNO (operands[2]) == REGNO (operands[0])
+       || peep2_reg_dead_p (3, operands[0]))"
   [(unspec_volatile [(const_int 0)] UNSPECV_BLOCKAGE)
    (parallel [(set (match_dup 2)
 		   (call (mem:QI (match_dup 1))
 			 (match_dup 3)))
 	      (set (reg:SI SP_REG)
 		   (plus:SI (reg:SI SP_REG)

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-04 16:38 PIC calls without PLT, generic implementation Alexander Monakov
                   ` (4 preceding siblings ...)
  2015-05-04 16:38 ` [PATCH] Expand PIC calls without PLT with -fno-plt Alexander Monakov
@ 2015-05-04 16:38 ` Alexander Monakov
  2015-05-15 16:37   ` Alexander Monakov
  5 siblings, 1 reply; 106+ messages in thread
From: Alexander Monakov @ 2015-05-04 16:38 UTC (permalink / raw)
  To: gcc-patches; +Cc: Alexander Monakov, Rich Felker

With -fno-plt, we don't have to reject even direct calls as sibcall
candidates.

This patch depends on '-fplt' flag that is introduced in another patch.

This patch requires that with -fno-plt all sibcall candidates go through
prepare_call_address that transforms the call to a GOT lookup.

OK?
	* config/i386/i386.c (ix86_function_ok_for_sibcall): Check flag_plt.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index f29e053..b734350 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -5448,12 +5448,13 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
   /* If we are generating position-independent code, we cannot sibcall
      optimize any indirect call, or a direct call to a global function,
      as the PLT requires %ebx be live. (Darwin does not have a PLT.)  */
   if (!TARGET_MACHO
       && !TARGET_64BIT
       && flag_pic
+      && flag_plt
       && (decl && !targetm.binds_local_p (decl)))
     return false;
 
   /* If we need to align the outgoing stack, then sibcalling would
      unalign the stack, which may break the called function.  */
   if (ix86_minimum_incoming_stack_boundary (true)

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-04 16:38 ` [PATCH] Expand PIC calls without PLT with -fno-plt Alexander Monakov
@ 2015-05-04 17:34   ` Jeff Law
  2015-05-04 17:40     ` Jakub Jelinek
  2015-05-10 16:59   ` Jan Hubicka
  2015-06-22 15:52   ` Jiong Wang
  2 siblings, 1 reply; 106+ messages in thread
From: Jeff Law @ 2015-05-04 17:34 UTC (permalink / raw)
  To: Alexander Monakov, gcc-patches; +Cc: Rich Felker

On 05/04/2015 10:37 AM, Alexander Monakov wrote:
> This patch introduces option -fno-plt that allows to expand calls that would
> go via PLT to load the address of the function immediately at call site (which
> introduces a GOT load).  Cover letter explains the motivation for this patch.
>
> New option documentation for invoke.texi is missing from the patch; if this is
> accepted I'll be happy to send a v2 with documentation added.
>
> 	* calls.c (prepare_call_address): Transform PLT call to GOT lookup and
> 	indirect call by forcing address into a pseudo with -fno-plt.
> 	* common.opt (flag_plt): New option.
OK once you cobble together the invoke.texi changes.

Jeff


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [RFC PATCH] ira: accept loads via argp rtx in validate_equiv_mem
  2015-05-04 16:38 ` [RFC PATCH] ira: accept loads via argp rtx in validate_equiv_mem Alexander Monakov
@ 2015-05-04 17:37   ` Jeff Law
  0 siblings, 0 replies; 106+ messages in thread
From: Jeff Law @ 2015-05-04 17:37 UTC (permalink / raw)
  To: Alexander Monakov, gcc-patches; +Cc: Rich Felker

On 05/04/2015 10:37 AM, Alexander Monakov wrote:
> With this patch at hand, I'd like to discuss a code generation problem, which
> my patch solves only partially.  FWIW, it passes bootstrap/regtest on x86-64.
>
> With other patches in series applied, GCC with -fno-plt can generate tail
> calls in PIC mode more frequently, but sometimes poorer code is generated.
> I've tried to look for possible causes, and found one issue so far.
>
> Consider the following testcase:
>
> void foo1(int a, int b, int c, int d, int e, int f, int g, int h);
> int bar(int x);
> void foo2(int a, int b, int c, int d, int e, int f, int g, int h)
> {
>    bar(a);
>    foo1(a, b, c, d, e, f, g, h);
> }
>
> Comparing x86 code generation with -O2 -m32 and with/without -fPIC, you can
> see that -fPIC happens to produce smaller code.  Without -fPIC, GCC
> saves/restores all arguments before/after call to 'bar'.
>
> The reason for that is without -fPIC, GCC performs tail call optimization on
> 'foo1', and that causes it to drop REG_EQUIV notes for incoming arguments in
> fixup_tail_calls.  After that, code generation diverges at IRA stage, where
> lack of equivalences prevents loads of pseudos to be moved to the point of
> first use.
>
> The patch tries to repair the problem by allowing REG_EQUIV notes to be
> resynthesized at ira init for loads that happen via `argp' rtx.  It helps for
> the simple testcase above, but not for problematic Clang/LLVM functions where
> I noticed the issue.
>
> I hope there's a way around the 'big hammer' approach of fixup_tail_calls.
> Might it be possible instead of dropping REG_EQUIV notes, to copy incoming
> arguments into other pseudos just prior to stack pointer adjustment in
> preparation for tailcall?
Isn't the whole point of dropping the notes to indicate that those 
argument slots are not longer guaranteed to hold the value at all points 
throughout the function?

That can certainly be relaxed, but you'll have to have some kind of code 
to analyze the data in the argument slots to ensure they haven't 
changed.  You can't just blindly put the notes back if I remember this 
stuff correctly.

Jeff

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-04 17:34   ` Jeff Law
@ 2015-05-04 17:40     ` Jakub Jelinek
  2015-05-04 17:42       ` Jeff Law
  0 siblings, 1 reply; 106+ messages in thread
From: Jakub Jelinek @ 2015-05-04 17:40 UTC (permalink / raw)
  To: Jeff Law; +Cc: Alexander Monakov, gcc-patches, Rich Felker

On Mon, May 04, 2015 at 11:34:05AM -0600, Jeff Law wrote:
> On 05/04/2015 10:37 AM, Alexander Monakov wrote:
> >This patch introduces option -fno-plt that allows to expand calls that would
> >go via PLT to load the address of the function immediately at call site (which
> >introduces a GOT load).  Cover letter explains the motivation for this patch.
> >
> >New option documentation for invoke.texi is missing from the patch; if this is
> >accepted I'll be happy to send a v2 with documentation added.
> >
> >	* calls.c (prepare_call_address): Transform PLT call to GOT lookup and
> >	indirect call by forcing address into a pseudo with -fno-plt.
> >	* common.opt (flag_plt): New option.
> OK once you cobble together the invoke.texi changes.

Isn't what Michael/Alan suggested better?  I mean as/ld/compiler changes to
inline the plt slot's first part, then lazy binding will work fine.

	Jakub

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-04 17:40     ` Jakub Jelinek
@ 2015-05-04 17:42       ` Jeff Law
  2015-05-06  3:08         ` Rich Felker
  2015-05-06 15:25         ` Alexander Monakov
  0 siblings, 2 replies; 106+ messages in thread
From: Jeff Law @ 2015-05-04 17:42 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Alexander Monakov, gcc-patches, Rich Felker

On 05/04/2015 11:39 AM, Jakub Jelinek wrote:
> On Mon, May 04, 2015 at 11:34:05AM -0600, Jeff Law wrote:
>> On 05/04/2015 10:37 AM, Alexander Monakov wrote:
>>> This patch introduces option -fno-plt that allows to expand calls that would
>>> go via PLT to load the address of the function immediately at call site (which
>>> introduces a GOT load).  Cover letter explains the motivation for this patch.
>>>
>>> New option documentation for invoke.texi is missing from the patch; if this is
>>> accepted I'll be happy to send a v2 with documentation added.
>>>
>>> 	* calls.c (prepare_call_address): Transform PLT call to GOT lookup and
>>> 	indirect call by forcing address into a pseudo with -fno-plt.
>>> 	* common.opt (flag_plt): New option.
>> OK once you cobble together the invoke.texi changes.
>
> Isn't what Michael/Alan suggested better?  I mean as/ld/compiler changes to
> inline the plt slot's first part, then lazy binding will work fine.
I must have missed Alan/Michael's message.

ISTM the win here is that by going through the GOT, you can CSE the GOT 
reference and possibly get some more register allocation freedom.  Is 
that still the case with Alan/Michael's approach?

jeff

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-04 17:42       ` Jeff Law
@ 2015-05-06  3:08         ` Rich Felker
  2015-05-10 17:07           ` Jan Hubicka
  2015-05-06 15:25         ` Alexander Monakov
  1 sibling, 1 reply; 106+ messages in thread
From: Rich Felker @ 2015-05-06  3:08 UTC (permalink / raw)
  To: Jeff Law; +Cc: Jakub Jelinek, Alexander Monakov, gcc-patches

On Mon, May 04, 2015 at 11:42:20AM -0600, Jeff Law wrote:
> On 05/04/2015 11:39 AM, Jakub Jelinek wrote:
> >On Mon, May 04, 2015 at 11:34:05AM -0600, Jeff Law wrote:
> >>On 05/04/2015 10:37 AM, Alexander Monakov wrote:
> >>>This patch introduces option -fno-plt that allows to expand calls that would
> >>>go via PLT to load the address of the function immediately at call site (which
> >>>introduces a GOT load).  Cover letter explains the motivation for this patch.
> >>>
> >>>New option documentation for invoke.texi is missing from the patch; if this is
> >>>accepted I'll be happy to send a v2 with documentation added.
> >>>
> >>>	* calls.c (prepare_call_address): Transform PLT call to GOT lookup and
> >>>	indirect call by forcing address into a pseudo with -fno-plt.
> >>>	* common.opt (flag_plt): New option.
> >>OK once you cobble together the invoke.texi changes.
> >
> >Isn't what Michael/Alan suggested better?  I mean as/ld/compiler changes to
> >inline the plt slot's first part, then lazy binding will work fine.
> I must have missed Alan/Michael's message.
> 
> ISTM the win here is that by going through the GOT, you can CSE the
> GOT reference and possibly get some more register allocation
> freedom.  Is that still the case with Alan/Michael's approach?

There are many advantages to 'going through the GOT'. CSE'ing the
reference is just one. The biggest (IMO) is that you can avoid the bad
PLT ABI that most targets have, where making a call to a PLT slot
requires the GOT address to be pre-loaded into a fixed, call-saved
register. This precludes sibcalls and forces many functions which
otherwise would not need their own stack frames to create one for
saving the old value of the GOT register. See my blog entry on the
topic here: http://ewontfix.com/18/

Anyone who really wants lazy binding can use -fplt (which is
presumably still the default; I didn't check) but lazy binding should
largely be considered deprecated anyway since effective use of relro
protection requires -z now too, in which case you're paying all the
costs (which are considerable!) for lazy binding support even though
you won't get it.

Rich

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-04 17:42       ` Jeff Law
  2015-05-06  3:08         ` Rich Felker
@ 2015-05-06 15:25         ` Alexander Monakov
  2015-05-06 15:46           ` Jakub Jelinek
  2015-05-07 18:22           ` Jeff Law
  1 sibling, 2 replies; 106+ messages in thread
From: Alexander Monakov @ 2015-05-06 15:25 UTC (permalink / raw)
  To: Jeff Law; +Cc: Jakub Jelinek, gcc-patches, Rich Felker

On Mon, 4 May 2015, Jeff Law wrote:
> On 05/04/2015 11:39 AM, Jakub Jelinek wrote:
> > On Mon, May 04, 2015 at 11:34:05AM -0600, Jeff Law wrote:
> > > On 05/04/2015 10:37 AM, Alexander Monakov wrote:
> > > > This patch introduces option -fno-plt that allows to expand calls that
> > > > would
> > > > go via PLT to load the address of the function immediately at call site
> > > > (which
> > > > introduces a GOT load).  Cover letter explains the motivation for this
> > > > patch.
> > > >
> > > > New option documentation for invoke.texi is missing from the patch; if
> > > > this is
> > > > accepted I'll be happy to send a v2 with documentation added.
> > > >
> > > >  * calls.c (prepare_call_address): Transform PLT call to GOT lookup and
> > > >  indirect call by forcing address into a pseudo with -fno-plt.
> > > >  * common.opt (flag_plt): New option.
> > > OK once you cobble together the invoke.texi changes.
> >
> > Isn't what Michael/Alan suggested better?  I mean as/ld/compiler changes to
> > inline the plt slot's first part, then lazy binding will work fine.
> I must have missed Alan/Michael's message.
> 
> ISTM the win here is that by going through the GOT, you can CSE the GOT
> reference and possibly get some more register allocation freedom.  Is that
> still the case with Alan/Michael's approach?

If the same PLT stubs as today are to be used, it constrains the compiler on
32-bit x86 and possibly other arches where PLT stubs need GOT pointer in a
specific register.  It's possible to imagine more complex PLT stubs that
obtain GOT pointer on their own, but in that case you can't let optimizations
such as loop invariant motion move the GOT load away from the call in a
fashion that could result in PLT stub pointer be reused many times.

Going ahead with this patch now allows anyone to play with no-PLT codegen on
any architecture.  As you can see from this series, on x86 it uncovered several
codegen blunders (and fixing those should improve normal codegen as well -- so
everybody wins).

Below is my proposed patch for invoke.texi.  Still OK to check in?

	* doc/invoke.texi (Code Generation Options): Add -fno-plt.
	([-fno-plt]): Document.

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 520c2c5..fd4199c 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -1122,7 +1122,7 @@ See S/390 and zSeries Options.
 -finstrument-functions-exclude-function-list=@var{sym},@var{sym},@dots{} @gol
 -finstrument-functions-exclude-file-list=@var{file},@var{file},@dots{} @gol
 -fno-common  -fno-ident @gol
--fpcc-struct-return  -fpic  -fPIC -fpie -fPIE @gol
+-fpcc-struct-return  -fpic  -fPIC -fpie -fPIE -fno-plt @gol
 -fno-jump-tables @gol
 -frecord-gcc-switches @gol
 -freg-struct-return  -fshort-enums @gol
@@ -23615,6 +23615,16 @@ used during linking.
 @code{__pie__} and @code{__PIE__}.  The macros have the value 1
 for @option{-fpie} and 2 for @option{-fPIE}.
 
+@item -fno-plt
+@opindex fno-plt
+Do not use PLT for external function calls in position-independent code.
+Instead, load callee address at call site from GOT and branch to it.
+This leads to more efficient code by eliminating PLT stubs and exposing
+GOT load to optimizations.  On architectures such as 32-bit x86 where
+PLT stubs expect GOT pointer in a specific register, this gives more
+register allocation freedom to the compiler.  Lazy binding requires PLT:
+with @option{-fno-plt} all external symbols are resolved at load time.
+
 @item -fno-jump-tables
 @opindex fno-jump-tables
 Do not use jump tables for switch statements even where it would be

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-06 15:25         ` Alexander Monakov
@ 2015-05-06 15:46           ` Jakub Jelinek
  2015-05-06 15:55             ` Jeff Law
  2015-05-06 16:44             ` Alexander Monakov
  2015-05-07 18:22           ` Jeff Law
  1 sibling, 2 replies; 106+ messages in thread
From: Jakub Jelinek @ 2015-05-06 15:46 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Jeff Law, gcc-patches, Rich Felker

On Wed, May 06, 2015 at 06:24:58PM +0300, Alexander Monakov wrote:
> If the same PLT stubs as today are to be used, it constrains the compiler on
> 32-bit x86 and possibly other arches where PLT stubs need GOT pointer in a
> specific register.  It's possible to imagine more complex PLT stubs that
> obtain GOT pointer on their own, but in that case you can't let optimizations
> such as loop invariant motion move the GOT load away from the call in a
> fashion that could result in PLT stub pointer be reused many times.

Why?
32-bit x86 (shouldn't we care much more about x86-64, where this is a
non-issue?) PLT looks like:

4c2b7310 <_Unwind_Find_FDE@plt-0x10>:
4c2b7310:       ff b3 04 00 00 00       pushl  0x4(%ebx)
4c2b7316:       ff a3 08 00 00 00       jmp    *0x8(%ebx)
4c2b731c:       00 00                   add    %al,(%eax)
        ...

4c2b7320 <_Unwind_Find_FDE@plt>:
4c2b7320:       ff a3 0c 00 00 00       jmp    *0xc(%ebx)
4c2b7326:       68 00 00 00 00          push   $0x0
4c2b732b:       e9 e0 ff ff ff          jmp    4c2b7310

4c2b7330 <realloc@plt>:
4c2b7330:       ff a3 10 00 00 00       jmp    *0x10(%ebx)
4c2b7336:       68 08 00 00 00          push   $0x8
4c2b733b:       e9 d0 ff ff ff          jmp    4c2b7310

The linker would know very well what kind of relocations are used for
particular PLT slot, and for the new relocations which would resolve to the
address of the .got.plt slot it could just tweak corresponding 3rd insn
in the slot, to not jump to first plt slot - 16, but a few bytes before that
that would just load the address of _G_O_T_ into %ebx and then fallthru
into the 0x4c2b7310 snippet above.  The lazy binding would be a few ticks
slower in that case, but no requirement on %ebx to contain _G_O_T_.

As for hoisting the load of the call address before the loop, with lazy
binding that has the obvious disadvantage that you'd resolve the slot again
and again, if you are unlucky enough that the function hasn't been resolved
yet.  Unless the shared PLT stub after computing _G_O_T_ (for x86) also
rechecks the .got.plt address.

	Jakub

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-06 15:46           ` Jakub Jelinek
@ 2015-05-06 15:55             ` Jeff Law
  2015-05-06 16:44             ` Alexander Monakov
  1 sibling, 0 replies; 106+ messages in thread
From: Jeff Law @ 2015-05-06 15:55 UTC (permalink / raw)
  To: Jakub Jelinek, Alexander Monakov; +Cc: gcc-patches, Rich Felker

On 05/06/2015 09:45 AM, Jakub Jelinek wrote:

> As for hoisting the load of the call address before the loop, with lazy
> binding that has the obvious disadvantage that you'd resolve the slot again
> and again, if you are unlucky enough that the function hasn't been resolved
> yet.  Unless the shared PLT stub after computing _G_O_T_ (for x86) also
> rechecks the .got.plt address.
Yea, but I suspect that's the rare case rather than the common case.

Of course, it's so bloody expensive when it happens, it might totally 
outweigh the aggregated benefits from all the other profitable hoisted 
GOT loads.

jeff

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-06 15:46           ` Jakub Jelinek
  2015-05-06 15:55             ` Jeff Law
@ 2015-05-06 16:44             ` Alexander Monakov
  2015-05-06 17:35               ` Rich Felker
  1 sibling, 1 reply; 106+ messages in thread
From: Alexander Monakov @ 2015-05-06 16:44 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Jeff Law, gcc-patches, Rich Felker

On Wed, 6 May 2015, Jakub Jelinek wrote:
> The linker would know very well what kind of relocations are used for
> particular PLT slot, and for the new relocations which would resolve to the
> address of the .got.plt slot it could just tweak corresponding 3rd insn
> in the slot, to not jump to first plt slot - 16, but a few bytes before that
> that would just load the address of _G_O_T_ into %ebx and then fallthru
> into the 0x4c2b7310 snippet above.  The lazy binding would be a few ticks
> slower in that case, but no requirement on %ebx to contain _G_O_T_.

No, %ebx is callee-saved, so you can't outright overwrite it in the PLT stub.

Alexander

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-06 16:44             ` Alexander Monakov
@ 2015-05-06 17:35               ` Rich Felker
  2015-05-06 18:26                 ` H.J. Lu
  0 siblings, 1 reply; 106+ messages in thread
From: Rich Felker @ 2015-05-06 17:35 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Jakub Jelinek, Jeff Law, gcc-patches

On Wed, May 06, 2015 at 07:43:58PM +0300, Alexander Monakov wrote:
> On Wed, 6 May 2015, Jakub Jelinek wrote:
> > The linker would know very well what kind of relocations are used for
> > particular PLT slot, and for the new relocations which would resolve to the
> > address of the .got.plt slot it could just tweak corresponding 3rd insn
> > in the slot, to not jump to first plt slot - 16, but a few bytes before that
> > that would just load the address of _G_O_T_ into %ebx and then fallthru
> > into the 0x4c2b7310 snippet above.  The lazy binding would be a few ticks
> > slower in that case, but no requirement on %ebx to contain _G_O_T_.
> 
> No, %ebx is callee-saved, so you can't outright overwrite it in the PLT stub.

Indeed. And the situation is the same on almost all targets. The only
exceptions are those with direct PC-relative addressing (like x86_64)
and those with reserved inter-procedural linkage registers and
efficient PC-relative address loading via them (like ARM and AArch64).
MIPS (o32) is also an interesting exception in that the normal ABI is
already PLT-free, and while callees need a PIC register loaded, it's a
call-clobbered register, not a call-saved one, so it doesn't make the
same kind of trouble,

I really don't see a need to make no-PLT code gen support lazy binding
when it's necessarily going to be costly to do so, and precludes most
of the benefits of the no-PLT approach. Anyone still wanting/needing
lazy binding semantics can use PLT, and can even choose on a per-TU
basis (or maybe even more fine-grained with pragmas/attributes?).
Those of us who are suffering the cost of PLT with no benefits
(because we use -Wl,-z,relro -Wl,-z,now) can just be rid of it (by
adding -fno-plt) and enjoy something like a 10% performance boost in
PIC/PIE.

Rich

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-06 17:35               ` Rich Felker
@ 2015-05-06 18:26                 ` H.J. Lu
  2015-05-06 18:37                   ` Rich Felker
  0 siblings, 1 reply; 106+ messages in thread
From: H.J. Lu @ 2015-05-06 18:26 UTC (permalink / raw)
  To: Rich Felker; +Cc: Alexander Monakov, Jakub Jelinek, Jeff Law, GCC Patches

On Wed, May 6, 2015 at 10:35 AM, Rich Felker <dalias@libc.org> wrote:
> On Wed, May 06, 2015 at 07:43:58PM +0300, Alexander Monakov wrote:
>> On Wed, 6 May 2015, Jakub Jelinek wrote:
>> > The linker would know very well what kind of relocations are used for
>> > particular PLT slot, and for the new relocations which would resolve to the
>> > address of the .got.plt slot it could just tweak corresponding 3rd insn
>> > in the slot, to not jump to first plt slot - 16, but a few bytes before that
>> > that would just load the address of _G_O_T_ into %ebx and then fallthru
>> > into the 0x4c2b7310 snippet above.  The lazy binding would be a few ticks
>> > slower in that case, but no requirement on %ebx to contain _G_O_T_.
>>
>> No, %ebx is callee-saved, so you can't outright overwrite it in the PLT stub.
>
> Indeed. And the situation is the same on almost all targets. The only
> exceptions are those with direct PC-relative addressing (like x86_64)
> and those with reserved inter-procedural linkage registers and
> efficient PC-relative address loading via them (like ARM and AArch64).
> MIPS (o32) is also an interesting exception in that the normal ABI is
> already PLT-free, and while callees need a PIC register loaded, it's a
> call-clobbered register, not a call-saved one, so it doesn't make the
> same kind of trouble,
>
> I really don't see a need to make no-PLT code gen support lazy binding
> when it's necessarily going to be costly to do so, and precludes most
> of the benefits of the no-PLT approach. Anyone still wanting/needing
> lazy binding semantics can use PLT, and can even choose on a per-TU
> basis (or maybe even more fine-grained with pragmas/attributes?).
> Those of us who are suffering the cost of PLT with no benefits
> (because we use -Wl,-z,relro -Wl,-z,now) can just be rid of it (by
> adding -fno-plt) and enjoy something like a 10% performance boost in
> PIC/PIE.
>

There are things compiler can do for performance and correctness
if it is told what options will be passed to linker.  -z now is one and
-Bsymbolic is another one:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65886

I think we should add -fnow and -fsymbolic.  Together with LTO,
we can generate faster executables as well as shared libraries.

-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-06 18:26                 ` H.J. Lu
@ 2015-05-06 18:37                   ` Rich Felker
  2015-05-06 18:45                     ` H.J. Lu
  0 siblings, 1 reply; 106+ messages in thread
From: Rich Felker @ 2015-05-06 18:37 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Alexander Monakov, Jakub Jelinek, Jeff Law, GCC Patches

On Wed, May 06, 2015 at 11:26:29AM -0700, H.J. Lu wrote:
> On Wed, May 6, 2015 at 10:35 AM, Rich Felker <dalias@libc.org> wrote:
> > On Wed, May 06, 2015 at 07:43:58PM +0300, Alexander Monakov wrote:
> >> On Wed, 6 May 2015, Jakub Jelinek wrote:
> >> > The linker would know very well what kind of relocations are used for
> >> > particular PLT slot, and for the new relocations which would resolve to the
> >> > address of the .got.plt slot it could just tweak corresponding 3rd insn
> >> > in the slot, to not jump to first plt slot - 16, but a few bytes before that
> >> > that would just load the address of _G_O_T_ into %ebx and then fallthru
> >> > into the 0x4c2b7310 snippet above.  The lazy binding would be a few ticks
> >> > slower in that case, but no requirement on %ebx to contain _G_O_T_.
> >>
> >> No, %ebx is callee-saved, so you can't outright overwrite it in the PLT stub.
> >
> > Indeed. And the situation is the same on almost all targets. The only
> > exceptions are those with direct PC-relative addressing (like x86_64)
> > and those with reserved inter-procedural linkage registers and
> > efficient PC-relative address loading via them (like ARM and AArch64).
> > MIPS (o32) is also an interesting exception in that the normal ABI is
> > already PLT-free, and while callees need a PIC register loaded, it's a
> > call-clobbered register, not a call-saved one, so it doesn't make the
> > same kind of trouble,
> >
> > I really don't see a need to make no-PLT code gen support lazy binding
> > when it's necessarily going to be costly to do so, and precludes most
> > of the benefits of the no-PLT approach. Anyone still wanting/needing
> > lazy binding semantics can use PLT, and can even choose on a per-TU
> > basis (or maybe even more fine-grained with pragmas/attributes?).
> > Those of us who are suffering the cost of PLT with no benefits
> > (because we use -Wl,-z,relro -Wl,-z,now) can just be rid of it (by
> > adding -fno-plt) and enjoy something like a 10% performance boost in
> > PIC/PIE.
> >
> 
> There are things compiler can do for performance and correctness
> if it is told what options will be passed to linker.  -z now is one and
> -Bsymbolic is another one:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65886
> 
> I think we should add -fnow and -fsymbolic.  Together with LTO,
> we can generate faster executables as well as shared libraries.

I don't see how knowing about -Bsymbolic can help the compiler
optimize. Without visibility, it can't know whether the symbols will
be defined in the same DSO. With visibility, it can already do the
equivalent hints. Perhaps it helps in the case where the symbol is
already defined (and non-weak) in the same TU, but I think in this
case it should already be optimizing the reference. Symbol
interposition over top of a non-weak symbol from the same TU is always
invalid and the compiler should not be pessimizing code to make it
work.

As for -fnow, I haven't thought about it much but I also don't see
many places where it could help. The only benefit that comes to mind
is on targets with weak memory order, where it would eliminate some of
the cost of synchronizing TLSDESC lazy bindings (see Szabolcs Nagy's
work on AArch64). It might also benefit PLT calls on such targets, but
you would get a lot more benefit from -fno-plt, and in that case -fnow
would not allow any further optimization.

Rich

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-06 18:37                   ` Rich Felker
@ 2015-05-06 18:45                     ` H.J. Lu
  2015-05-06 19:01                       ` Rich Felker
  0 siblings, 1 reply; 106+ messages in thread
From: H.J. Lu @ 2015-05-06 18:45 UTC (permalink / raw)
  To: Rich Felker; +Cc: Alexander Monakov, Jakub Jelinek, Jeff Law, GCC Patches

On Wed, May 6, 2015 at 11:37 AM, Rich Felker <dalias@libc.org> wrote:
> On Wed, May 06, 2015 at 11:26:29AM -0700, H.J. Lu wrote:
>> On Wed, May 6, 2015 at 10:35 AM, Rich Felker <dalias@libc.org> wrote:
>> > On Wed, May 06, 2015 at 07:43:58PM +0300, Alexander Monakov wrote:
>> >> On Wed, 6 May 2015, Jakub Jelinek wrote:
>> >> > The linker would know very well what kind of relocations are used for
>> >> > particular PLT slot, and for the new relocations which would resolve to the
>> >> > address of the .got.plt slot it could just tweak corresponding 3rd insn
>> >> > in the slot, to not jump to first plt slot - 16, but a few bytes before that
>> >> > that would just load the address of _G_O_T_ into %ebx and then fallthru
>> >> > into the 0x4c2b7310 snippet above.  The lazy binding would be a few ticks
>> >> > slower in that case, but no requirement on %ebx to contain _G_O_T_.
>> >>
>> >> No, %ebx is callee-saved, so you can't outright overwrite it in the PLT stub.
>> >
>> > Indeed. And the situation is the same on almost all targets. The only
>> > exceptions are those with direct PC-relative addressing (like x86_64)
>> > and those with reserved inter-procedural linkage registers and
>> > efficient PC-relative address loading via them (like ARM and AArch64).
>> > MIPS (o32) is also an interesting exception in that the normal ABI is
>> > already PLT-free, and while callees need a PIC register loaded, it's a
>> > call-clobbered register, not a call-saved one, so it doesn't make the
>> > same kind of trouble,
>> >
>> > I really don't see a need to make no-PLT code gen support lazy binding
>> > when it's necessarily going to be costly to do so, and precludes most
>> > of the benefits of the no-PLT approach. Anyone still wanting/needing
>> > lazy binding semantics can use PLT, and can even choose on a per-TU
>> > basis (or maybe even more fine-grained with pragmas/attributes?).
>> > Those of us who are suffering the cost of PLT with no benefits
>> > (because we use -Wl,-z,relro -Wl,-z,now) can just be rid of it (by
>> > adding -fno-plt) and enjoy something like a 10% performance boost in
>> > PIC/PIE.
>> >
>>
>> There are things compiler can do for performance and correctness
>> if it is told what options will be passed to linker.  -z now is one and
>> -Bsymbolic is another one:
>>
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65886
>>
>> I think we should add -fnow and -fsymbolic.  Together with LTO,
>> we can generate faster executables as well as shared libraries.
>
> I don't see how knowing about -Bsymbolic can help the compiler
> optimize. Without visibility, it can't know whether the symbols will
> be defined in the same DSO. With visibility, it can already do the
> equivalent hints. Perhaps it helps in the case where the symbol is
> already defined (and non-weak) in the same TU, but I think in this
> case it should already be optimizing the reference. Symbol
> interposition over top of a non-weak symbol from the same TU is always
> invalid and the compiler should not be pessimizing code to make it
> work.

-Bsymbolic will bind all references to local definitions in shared libraries,
with and without visibility, weak or non-weak.  Compiler can use it
in binds_tls_local_p and we can generate much better codes in shared
libraries.

> As for -fnow, I haven't thought about it much but I also don't see
> many places where it could help. The only benefit that comes to mind
> is on targets with weak memory order, where it would eliminate some of
> the cost of synchronizing TLSDESC lazy bindings (see Szabolcs Nagy's
> work on AArch64). It might also benefit PLT calls on such targets, but
> you would get a lot more benefit from -fno-plt, and in that case -fnow
> would not allow any further optimization.
>

-fno-plt doesn't work with lazy binding.  -fnow tells compiler that
lazy binding is not used and it can optimize without PLT.  With
-flto -fnow, compiler can make much better choices.

-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-06 18:45                     ` H.J. Lu
@ 2015-05-06 19:01                       ` Rich Felker
  2015-05-06 19:05                         ` H.J. Lu
  0 siblings, 1 reply; 106+ messages in thread
From: Rich Felker @ 2015-05-06 19:01 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Alexander Monakov, Jakub Jelinek, Jeff Law, GCC Patches

On Wed, May 06, 2015 at 11:44:57AM -0700, H.J. Lu wrote:
> On Wed, May 6, 2015 at 11:37 AM, Rich Felker <dalias@libc.org> wrote:
> > On Wed, May 06, 2015 at 11:26:29AM -0700, H.J. Lu wrote:
> >> On Wed, May 6, 2015 at 10:35 AM, Rich Felker <dalias@libc.org> wrote:
> >> > On Wed, May 06, 2015 at 07:43:58PM +0300, Alexander Monakov wrote:
> >> >> On Wed, 6 May 2015, Jakub Jelinek wrote:
> >> >> > The linker would know very well what kind of relocations are used for
> >> >> > particular PLT slot, and for the new relocations which would resolve to the
> >> >> > address of the .got.plt slot it could just tweak corresponding 3rd insn
> >> >> > in the slot, to not jump to first plt slot - 16, but a few bytes before that
> >> >> > that would just load the address of _G_O_T_ into %ebx and then fallthru
> >> >> > into the 0x4c2b7310 snippet above.  The lazy binding would be a few ticks
> >> >> > slower in that case, but no requirement on %ebx to contain _G_O_T_.
> >> >>
> >> >> No, %ebx is callee-saved, so you can't outright overwrite it in the PLT stub.
> >> >
> >> > Indeed. And the situation is the same on almost all targets. The only
> >> > exceptions are those with direct PC-relative addressing (like x86_64)
> >> > and those with reserved inter-procedural linkage registers and
> >> > efficient PC-relative address loading via them (like ARM and AArch64).
> >> > MIPS (o32) is also an interesting exception in that the normal ABI is
> >> > already PLT-free, and while callees need a PIC register loaded, it's a
> >> > call-clobbered register, not a call-saved one, so it doesn't make the
> >> > same kind of trouble,
> >> >
> >> > I really don't see a need to make no-PLT code gen support lazy binding
> >> > when it's necessarily going to be costly to do so, and precludes most
> >> > of the benefits of the no-PLT approach. Anyone still wanting/needing
> >> > lazy binding semantics can use PLT, and can even choose on a per-TU
> >> > basis (or maybe even more fine-grained with pragmas/attributes?).
> >> > Those of us who are suffering the cost of PLT with no benefits
> >> > (because we use -Wl,-z,relro -Wl,-z,now) can just be rid of it (by
> >> > adding -fno-plt) and enjoy something like a 10% performance boost in
> >> > PIC/PIE.
> >> >
> >>
> >> There are things compiler can do for performance and correctness
> >> if it is told what options will be passed to linker.  -z now is one and
> >> -Bsymbolic is another one:
> >>
> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65886
> >>
> >> I think we should add -fnow and -fsymbolic.  Together with LTO,
> >> we can generate faster executables as well as shared libraries.
> >
> > I don't see how knowing about -Bsymbolic can help the compiler
> > optimize. Without visibility, it can't know whether the symbols will
> > be defined in the same DSO. With visibility, it can already do the
> > equivalent hints. Perhaps it helps in the case where the symbol is
> > already defined (and non-weak) in the same TU, but I think in this
> > case it should already be optimizing the reference. Symbol
> > interposition over top of a non-weak symbol from the same TU is always
> > invalid and the compiler should not be pessimizing code to make it
> > work.
> 
> -Bsymbolic will bind all references to local definitions in shared libraries,
> with and without visibility, weak or non-weak.  Compiler can use it
> in binds_tls_local_p and we can generate much better codes in shared
> libraries.

Yes, I'm aware of what it does. But at compile-time the compiler can't
know whether the referenced symbol will be defined in the same DSO
unless this is visibility annotation telling it. Even when linking a
shared library using -Bsymbolic, the library code can still make calls
(or data references) to symbols in other DSOs.

> > As for -fnow, I haven't thought about it much but I also don't see
> > many places where it could help. The only benefit that comes to mind
> > is on targets with weak memory order, where it would eliminate some of
> > the cost of synchronizing TLSDESC lazy bindings (see Szabolcs Nagy's
> > work on AArch64). It might also benefit PLT calls on such targets, but
> > you would get a lot more benefit from -fno-plt, and in that case -fnow
> > would not allow any further optimization.
> 
> -fno-plt doesn't work with lazy binding.  -fnow tells compiler that
> lazy binding is not used and it can optimize without PLT.  With
> -flto -fnow, compiler can make much better choices.

Ah, I see now you had LTO in mind. In that case the compiler does know
when the symbol is defined in the same DSO for -Bsymbolic. So that
clears up the usefulness of your proposed -fsymbolic. I still don't
see how -fnow would have a lot of practical usefulness, but I'm
certainly not opposed to it.

Rich

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-06 19:01                       ` Rich Felker
@ 2015-05-06 19:05                         ` H.J. Lu
  2015-05-06 19:18                           ` Rich Felker
  0 siblings, 1 reply; 106+ messages in thread
From: H.J. Lu @ 2015-05-06 19:05 UTC (permalink / raw)
  To: Rich Felker; +Cc: Alexander Monakov, Jakub Jelinek, Jeff Law, GCC Patches

On Wed, May 6, 2015 at 12:01 PM, Rich Felker <dalias@libc.org> wrote:
> On Wed, May 06, 2015 at 11:44:57AM -0700, H.J. Lu wrote:
>> On Wed, May 6, 2015 at 11:37 AM, Rich Felker <dalias@libc.org> wrote:
>> > On Wed, May 06, 2015 at 11:26:29AM -0700, H.J. Lu wrote:
>> >> On Wed, May 6, 2015 at 10:35 AM, Rich Felker <dalias@libc.org> wrote:
>> >> > On Wed, May 06, 2015 at 07:43:58PM +0300, Alexander Monakov wrote:
>> >> >> On Wed, 6 May 2015, Jakub Jelinek wrote:
>> >> >> > The linker would know very well what kind of relocations are used for
>> >> >> > particular PLT slot, and for the new relocations which would resolve to the
>> >> >> > address of the .got.plt slot it could just tweak corresponding 3rd insn
>> >> >> > in the slot, to not jump to first plt slot - 16, but a few bytes before that
>> >> >> > that would just load the address of _G_O_T_ into %ebx and then fallthru
>> >> >> > into the 0x4c2b7310 snippet above.  The lazy binding would be a few ticks
>> >> >> > slower in that case, but no requirement on %ebx to contain _G_O_T_.
>> >> >>
>> >> >> No, %ebx is callee-saved, so you can't outright overwrite it in the PLT stub.
>> >> >
>> >> > Indeed. And the situation is the same on almost all targets. The only
>> >> > exceptions are those with direct PC-relative addressing (like x86_64)
>> >> > and those with reserved inter-procedural linkage registers and
>> >> > efficient PC-relative address loading via them (like ARM and AArch64).
>> >> > MIPS (o32) is also an interesting exception in that the normal ABI is
>> >> > already PLT-free, and while callees need a PIC register loaded, it's a
>> >> > call-clobbered register, not a call-saved one, so it doesn't make the
>> >> > same kind of trouble,
>> >> >
>> >> > I really don't see a need to make no-PLT code gen support lazy binding
>> >> > when it's necessarily going to be costly to do so, and precludes most
>> >> > of the benefits of the no-PLT approach. Anyone still wanting/needing
>> >> > lazy binding semantics can use PLT, and can even choose on a per-TU
>> >> > basis (or maybe even more fine-grained with pragmas/attributes?).
>> >> > Those of us who are suffering the cost of PLT with no benefits
>> >> > (because we use -Wl,-z,relro -Wl,-z,now) can just be rid of it (by
>> >> > adding -fno-plt) and enjoy something like a 10% performance boost in
>> >> > PIC/PIE.
>> >> >
>> >>
>> >> There are things compiler can do for performance and correctness
>> >> if it is told what options will be passed to linker.  -z now is one and
>> >> -Bsymbolic is another one:
>> >>
>> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65886
>> >>
>> >> I think we should add -fnow and -fsymbolic.  Together with LTO,
>> >> we can generate faster executables as well as shared libraries.
>> >
>> > I don't see how knowing about -Bsymbolic can help the compiler
>> > optimize. Without visibility, it can't know whether the symbols will
>> > be defined in the same DSO. With visibility, it can already do the
>> > equivalent hints. Perhaps it helps in the case where the symbol is
>> > already defined (and non-weak) in the same TU, but I think in this
>> > case it should already be optimizing the reference. Symbol
>> > interposition over top of a non-weak symbol from the same TU is always
>> > invalid and the compiler should not be pessimizing code to make it
>> > work.
>>
>> -Bsymbolic will bind all references to local definitions in shared libraries,
>> with and without visibility, weak or non-weak.  Compiler can use it
>> in binds_tls_local_p and we can generate much better codes in shared
>> libraries.
>
> Yes, I'm aware of what it does. But at compile-time the compiler can't
> know whether the referenced symbol will be defined in the same DSO
> unless this is visibility annotation telling it. Even when linking a
> shared library using -Bsymbolic, the library code can still make calls
> (or data references) to symbols in other DSOs.

Even without LTO, -fsymbolic -fPIC will generate better codes for

---
int glob_a = 1;

int foo ()
{
  return glob_a;
}
---

and

---
int glob_a (void)
{
  return -1;
}

int foo ()
{
  return glob_a ();
}
---


>> > As for -fnow, I haven't thought about it much but I also don't see
>> > many places where it could help. The only benefit that comes to mind
>> > is on targets with weak memory order, where it would eliminate some of
>> > the cost of synchronizing TLSDESC lazy bindings (see Szabolcs Nagy's
>> > work on AArch64). It might also benefit PLT calls on such targets, but
>> > you would get a lot more benefit from -fno-plt, and in that case -fnow
>> > would not allow any further optimization.
>>
>> -fno-plt doesn't work with lazy binding.  -fnow tells compiler that
>> lazy binding is not used and it can optimize without PLT.  With
>> -flto -fnow, compiler can make much better choices.
>
> Ah, I see now you had LTO in mind. In that case the compiler does know
> when the symbol is defined in the same DSO for -Bsymbolic. So that
> clears up the usefulness of your proposed -fsymbolic. I still don't
> see how -fnow would have a lot of practical usefulness, but I'm
> certainly not opposed to it.
>
> Rich



-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-06 19:05                         ` H.J. Lu
@ 2015-05-06 19:18                           ` Rich Felker
  2015-05-06 19:24                             ` H.J. Lu
  2015-05-11 11:48                             ` Michael Matz
  0 siblings, 2 replies; 106+ messages in thread
From: Rich Felker @ 2015-05-06 19:18 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Alexander Monakov, Jakub Jelinek, Jeff Law, GCC Patches

On Wed, May 06, 2015 at 12:05:20PM -0700, H.J. Lu wrote:
> >> -Bsymbolic will bind all references to local definitions in shared libraries,
> >> with and without visibility, weak or non-weak.  Compiler can use it
> >> in binds_tls_local_p and we can generate much better codes in shared
> >> libraries.
> >
> > Yes, I'm aware of what it does. But at compile-time the compiler can't
> > know whether the referenced symbol will be defined in the same DSO
> > unless this is visibility annotation telling it. Even when linking a
> > shared library using -Bsymbolic, the library code can still make calls
> > (or data references) to symbols in other DSOs.
> 
> Even without LTO, -fsymbolic -fPIC will generate better codes for
> 
> ---
> int glob_a = 1;
> 
> int foo ()
> {
>   return glob_a;
> }
> ---

I see how this case is improved, but it depends on the dubious (and
undocumented?) behavior of -Bsymbolic breaking copy relocations.

> and
> 
> ---
> int glob_a (void)
> {
>   return -1;
> }
> 
> int foo ()
> {
>   return glob_a ();
> }
> ---

I don't see how this case is improved unless GCC is failing to
consider strong definitions in the same TU as locally-binding. If this
is the case, is there a reason for that behavior? IMO it's wrong.

Rich

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-06 19:18                           ` Rich Felker
@ 2015-05-06 19:24                             ` H.J. Lu
  2015-05-11 11:48                             ` Michael Matz
  1 sibling, 0 replies; 106+ messages in thread
From: H.J. Lu @ 2015-05-06 19:24 UTC (permalink / raw)
  To: Rich Felker; +Cc: Alexander Monakov, Jakub Jelinek, Jeff Law, GCC Patches

On Wed, May 6, 2015 at 12:17 PM, Rich Felker <dalias@libc.org> wrote:
> On Wed, May 06, 2015 at 12:05:20PM -0700, H.J. Lu wrote:
>> >> -Bsymbolic will bind all references to local definitions in shared libraries,
>> >> with and without visibility, weak or non-weak.  Compiler can use it
>> >> in binds_tls_local_p and we can generate much better codes in shared
>> >> libraries.
>> >
>> > Yes, I'm aware of what it does. But at compile-time the compiler can't
>> > know whether the referenced symbol will be defined in the same DSO
>> > unless this is visibility annotation telling it. Even when linking a
>> > shared library using -Bsymbolic, the library code can still make calls
>> > (or data references) to symbols in other DSOs.
>>
>> Even without LTO, -fsymbolic -fPIC will generate better codes for
>>
>> ---
>> int glob_a = 1;
>>
>> int foo ()
>> {
>>   return glob_a;
>> }
>> ---
>
> I see how this case is improved, but it depends on the dubious (and
> undocumented?) behavior of -Bsymbolic breaking copy relocations.

-Bsymbolic breaks copy relocations, independent of compiler.
However, we can pass -fsymbolic when building PIE to avoid
copy relocation.  With -fsymbolic -fPIE -pie -flto, we can generate
direct reference for locally defined symbol.


>> and
>>
>> ---
>> int glob_a (void)
>> {
>>   return -1;
>> }
>>
>> int foo ()
>> {
>>   return glob_a ();
>> }
>> ---
>
> I don't see how this case is improved unless GCC is failing to
> consider strong definitions in the same TU as locally-binding. If this
> is the case, is there a reason for that behavior? IMO it's wrong.

glob_a is a strong definition.  If you have another strong definition,
you will get a linker error.


-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-06 15:25         ` Alexander Monakov
  2015-05-06 15:46           ` Jakub Jelinek
@ 2015-05-07 18:22           ` Jeff Law
  2015-05-07 19:13             ` H.J. Lu
  1 sibling, 1 reply; 106+ messages in thread
From: Jeff Law @ 2015-05-07 18:22 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Jakub Jelinek, gcc-patches, Rich Felker

On 05/06/2015 09:24 AM, Alexander Monakov wrote:
> On Mon, 4 May 2015, Jeff Law wrote:
>> On 05/04/2015 11:39 AM, Jakub Jelinek wrote:
>>> On Mon, May 04, 2015 at 11:34:05AM -0600, Jeff Law wrote:
>>>> On 05/04/2015 10:37 AM, Alexander Monakov wrote:
>>>>> This patch introduces option -fno-plt that allows to expand calls that
>>>>> would
>>>>> go via PLT to load the address of the function immediately at call site
>>>>> (which
>>>>> introduces a GOT load).  Cover letter explains the motivation for this
>>>>> patch.
>>>>>
>>>>> New option documentation for invoke.texi is missing from the patch; if
>>>>> this is
>>>>> accepted I'll be happy to send a v2 with documentation added.
>>>>>
>>>>>   * calls.c (prepare_call_address): Transform PLT call to GOT lookup and
>>>>>   indirect call by forcing address into a pseudo with -fno-plt.
>>>>>   * common.opt (flag_plt): New option.
>>>> OK once you cobble together the invoke.texi changes.
>>>
>>> Isn't what Michael/Alan suggested better?  I mean as/ld/compiler changes to
>>> inline the plt slot's first part, then lazy binding will work fine.
>> I must have missed Alan/Michael's message.
>>
>> ISTM the win here is that by going through the GOT, you can CSE the GOT
>> reference and possibly get some more register allocation freedom.  Is that
>> still the case with Alan/Michael's approach?
>
> If the same PLT stubs as today are to be used, it constrains the compiler on
> 32-bit x86 and possibly other arches where PLT stubs need GOT pointer in a
> specific register.  It's possible to imagine more complex PLT stubs that
> obtain GOT pointer on their own, but in that case you can't let optimizations
> such as loop invariant motion move the GOT load away from the call in a
> fashion that could result in PLT stub pointer be reused many times.
>
> Going ahead with this patch now allows anyone to play with no-PLT codegen on
> any architecture.  As you can see from this series, on x86 it uncovered several
> codegen blunders (and fixing those should improve normal codegen as well -- so
> everybody wins).
>
> Below is my proposed patch for invoke.texi.  Still OK to check in?
>
> 	* doc/invoke.texi (Code Generation Options): Add -fno-plt.
> 	([-fno-plt]): Document.
We're not changing the defaults, so I think this is fine.  Whether or 
not it proves useful is still to be determined.

jeff

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-07 18:22           ` Jeff Law
@ 2015-05-07 19:13             ` H.J. Lu
  0 siblings, 0 replies; 106+ messages in thread
From: H.J. Lu @ 2015-05-07 19:13 UTC (permalink / raw)
  To: Jeff Law; +Cc: Alexander Monakov, Jakub Jelinek, GCC Patches, Rich Felker

On Thu, May 7, 2015 at 11:22 AM, Jeff Law <law@redhat.com> wrote:
> On 05/06/2015 09:24 AM, Alexander Monakov wrote:
>>
>> On Mon, 4 May 2015, Jeff Law wrote:
>>>
>>> On 05/04/2015 11:39 AM, Jakub Jelinek wrote:
>>>>
>>>> On Mon, May 04, 2015 at 11:34:05AM -0600, Jeff Law wrote:
>>>>>
>>>>> On 05/04/2015 10:37 AM, Alexander Monakov wrote:
>>>>>>
>>>>>> This patch introduces option -fno-plt that allows to expand calls that
>>>>>> would
>>>>>> go via PLT to load the address of the function immediately at call
>>>>>> site
>>>>>> (which
>>>>>> introduces a GOT load).  Cover letter explains the motivation for this
>>>>>> patch.
>>>>>>
>>>>>> New option documentation for invoke.texi is missing from the patch; if
>>>>>> this is
>>>>>> accepted I'll be happy to send a v2 with documentation added.
>>>>>>
>>>>>>   * calls.c (prepare_call_address): Transform PLT call to GOT lookup
>>>>>> and
>>>>>>   indirect call by forcing address into a pseudo with -fno-plt.
>>>>>>   * common.opt (flag_plt): New option.
>>>>>
>>>>> OK once you cobble together the invoke.texi changes.
>>>>
>>>>
>>>> Isn't what Michael/Alan suggested better?  I mean as/ld/compiler changes
>>>> to
>>>> inline the plt slot's first part, then lazy binding will work fine.
>>>
>>> I must have missed Alan/Michael's message.
>>>
>>> ISTM the win here is that by going through the GOT, you can CSE the GOT
>>> reference and possibly get some more register allocation freedom.  Is
>>> that
>>> still the case with Alan/Michael's approach?
>>
>>
>> If the same PLT stubs as today are to be used, it constrains the compiler
>> on
>> 32-bit x86 and possibly other arches where PLT stubs need GOT pointer in a
>> specific register.  It's possible to imagine more complex PLT stubs that
>> obtain GOT pointer on their own, but in that case you can't let
>> optimizations
>> such as loop invariant motion move the GOT load away from the call in a
>> fashion that could result in PLT stub pointer be reused many times.
>>
>> Going ahead with this patch now allows anyone to play with no-PLT codegen
>> on
>> any architecture.  As you can see from this series, on x86 it uncovered
>> several
>> codegen blunders (and fixing those should improve normal codegen as well
>> -- so
>> everybody wins).
>>
>> Below is my proposed patch for invoke.texi.  Still OK to check in?
>>
>>         * doc/invoke.texi (Code Generation Options): Add -fno-plt.
>>         ([-fno-plt]): Document.
>
> We're not changing the defaults, so I think this is fine.  Whether or not it
> proves useful is still to be determined.
>

We should do if we know -z now will be passed to linker and function
foo is defined in a shared library.  Without the new relocation, we will
only know for sure if foo is defined in a shared library when we do LTO.
With the new relocation, we can do it for all non-local functions via a
compiler switch.


-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] PR65753: allow PIC tail calls via function pointers
  2015-05-04 16:38 ` [PATCH i386] PR65753: allow PIC tail calls via function pointers Alexander Monakov
@ 2015-05-10 16:37   ` Jan Hubicka
  2015-05-11 16:11     ` Alexander Monakov
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Hubicka @ 2015-05-10 16:37 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Rich Felker

> In the i386 backend, tailcalls are incorrectly disallowed in PIC mode for
> calls via function pointers on the basis that indirect calls, like direct
> calls, would go via PLT and thus require %ebx to point to GOT -- but that is
> not true.  Quoting Rich Felker who reported the bug,
> 
>   "For PLT slots in the non-PIE main executable, %ebx is not required at all.
>   PLT slots in PIE or shared libraries need %ebx, but a function pointer can
>   never evaluate to such a PLT slot; it always evaluates to the nominal address
>   of the function which is the same in all DSOs and therefore fundamentally
>   cannot depend on the address of the GOT in the calling DSO"
> 
> As far as I can see it's simply a mistake that was there from day 1 (comment 4
> in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65753 points to original patch).
> 
> Bootstrapped and regtested on 32-bit x86, OK for trunk?
> (the comment before the condition will need to be adjusted too, i.e.
> s/optimize any indirect call, or a direct call/optimize any direct call/ )
> 
> 	PR target/65753
> 	* config/i386/i386.c (ix86_function_ok_for_sibcall): Allow PIC sibcalls
> 	via function pointers.
> 
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 3263656..f29e053 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -5448,13 +5448,13 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
>    /* If we are generating position-independent code, we cannot sibcall
>       optimize any indirect call, or a direct call to a global function,
>       as the PLT requires %ebx be live. (Darwin does not have a PLT.)  */
>    if (!TARGET_MACHO
>        && !TARGET_64BIT
>        && flag_pic
> -      && (!decl || !targetm.binds_local_p (decl)))
> +      && (decl && !targetm.binds_local_p (decl)))

You probably need to update comment here. I wonder what happens when we optimize
indirect call to direct call to global function at RTL level? I suppose we are
safe here, because at RTL level we explicitly represent if we refer to PLT entry
or the functionaddress itself and we never optimize one to the other?

Patch is OK if you make sure that this works and update the comment.

Honza
>      return false;
>  
>    /* If we need to align the outgoing stack, then sibcalling would
>       unalign the stack, which may break the called function.  */
>    if (ix86_minimum_incoming_stack_boundary (true)
>        < PREFERRED_STACK_BOUNDARY)

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Move CLOBBERED_REGS earlier in register class list
  2015-05-04 16:38 ` [PATCH i386] Move CLOBBERED_REGS earlier in register class list Alexander Monakov
@ 2015-05-10 16:44   ` Jan Hubicka
  2015-05-10 17:51     ` Uros Bizjak
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Hubicka @ 2015-05-10 16:44 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Rich Felker, ubizjak, vmakarov

> On 32-bit x86, register class CLOBBERED_REGS is a proper subset of
> LEGACY_REGS, which causes IRA not to consider it separately for register
> allocation, even when it has lower cost than other classes.  This patch is
> useful to fix code generation problem that appears with no-PLT PIC tailcalls.
> 
> Was there a specific reason for CLOBBERED_REGS class to be listed as late as
> it is?  On 32-bit this class contains only EAX, ECX, EDX.

Uros moved CLOBBERED_REGS late in 
https://gcc.gnu.org/ml/gcc-patches/2012-08/msg00796.html
which contains a rationale, too.

I am adding Uros and Vladimir to CC just in case they missed the email :)
Honza
> 
> OK?
> 	* config/i386/i386.h (enum reg_class): Move CLOBBERED_REGS before Q_REGS.
> 	(REG_CLASS_NAMES): Ditto.
> 	(REG_CLASS_CONTENTS): Ditto.
> 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Extend sibcall peepholes to allow source in %eax
  2015-05-04 16:38 ` [PATCH i386] Extend sibcall peepholes to allow source in %eax Alexander Monakov
@ 2015-05-10 16:54   ` Jan Hubicka
  2015-05-11 17:50     ` Alexander Monakov
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Hubicka @ 2015-05-10 16:54 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Rich Felker

> On i386, peepholes that transform memory load and register-indirect jump into
> memory-indirect jump are overly restrictive in that they don't allow combining
> when the jump target is loaded into %eax, and the called function returns a
> value (also in %eax, so it's not dead after the call).  Fix this by checking
> for same source and output register operands separately.
> 
> OK?
> 	* config/i386/i386.md (sibcall_value_memory): Extend peepholes to
> 	allow memory address in %eax.
> 	(sibcall_value_pop_memory): Likewise.

Why do we need the check for liveness after all?  There is SIBLING_CALL_P
(peep2_next_insn (1)) so we know that the function terminates by the call
and there are no other uses of the value.

Don't we however need to check that operands[0] is not used by the call_insn as
parameter of the call?  I.e. something like

void
test(void (*callback ()))
{
  callback(callback);
}

I think instead of peep2_reg_dead_p we want to check that the parameter is not in
CALL_INSN_FUNCTION_USAGE of the sibcall..

Honza
> 
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index 729db75..7f81bcc 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -11872,13 +11872,14 @@
>    [(set (match_operand:W 0 "register_operand")
>  	(match_operand:W 1 "memory_operand"))
>     (set (match_operand 2)
>     (call (mem:QI (match_dup 0))
>  		 (match_operand 3)))]
>    "!TARGET_X32 && SIBLING_CALL_P (peep2_next_insn (1))
> -   && peep2_reg_dead_p (2, operands[0])"
> +   && (REGNO (operands[2]) == REGNO (operands[0])
> +       || peep2_reg_dead_p (2, operands[0]))"
>    [(parallel [(set (match_dup 2)
>  		   (call (mem:QI (match_dup 1))
>  			 (match_dup 3)))
>  	      (unspec [(const_int 0)] UNSPEC_PEEPSIB)])])
>  
>  (define_peephole2
> @@ -11886,13 +11887,14 @@
>  	(match_operand:W 1 "memory_operand"))
>     (unspec_volatile [(const_int 0)] UNSPECV_BLOCKAGE)
>     (set (match_operand 2)
>  	(call (mem:QI (match_dup 0))
>  	      (match_operand 3)))]
>    "!TARGET_X32 && SIBLING_CALL_P (peep2_next_insn (2))
> -   && peep2_reg_dead_p (3, operands[0])"
> +   && (REGNO (operands[2]) == REGNO (operands[0])
> +       || peep2_reg_dead_p (3, operands[0]))"
>    [(unspec_volatile [(const_int 0)] UNSPECV_BLOCKAGE)
>     (parallel [(set (match_dup 2)
>  		   (call (mem:QI (match_dup 1))
>  			 (match_dup 3)))
>  	      (unspec [(const_int 0)] UNSPEC_PEEPSIB)])])
>  
> @@ -11951,13 +11953,14 @@
>  		   (call (mem:QI (match_dup 0))
>  			 (match_operand 3)))
>  	      (set (reg:SI SP_REG)
>  		   (plus:SI (reg:SI SP_REG)
>  			    (match_operand:SI 4 "immediate_operand")))])]
>    "!TARGET_64BIT && SIBLING_CALL_P (peep2_next_insn (1))
> -   && peep2_reg_dead_p (2, operands[0])"
> +   && (REGNO (operands[2]) == REGNO (operands[0])
> +       || peep2_reg_dead_p (2, operands[0]))"
>    [(parallel [(set (match_dup 2)
>  		   (call (mem:QI (match_dup 1))
>  			 (match_dup 3)))
>  	      (set (reg:SI SP_REG)
>  		   (plus:SI (reg:SI SP_REG)
>  			    (match_dup 4)))
> @@ -11971,13 +11974,14 @@
>  		   (call (mem:QI (match_dup 0))
>  			 (match_operand 3)))
>  	      (set (reg:SI SP_REG)
>  		   (plus:SI (reg:SI SP_REG)
>  			    (match_operand:SI 4 "immediate_operand")))])]
>    "!TARGET_64BIT && SIBLING_CALL_P (peep2_next_insn (2))
> -   && peep2_reg_dead_p (3, operands[0])"
> +   && (REGNO (operands[2]) == REGNO (operands[0])
> +       || peep2_reg_dead_p (3, operands[0]))"
>    [(unspec_volatile [(const_int 0)] UNSPECV_BLOCKAGE)
>     (parallel [(set (match_dup 2)
>  		   (call (mem:QI (match_dup 1))
>  			 (match_dup 3)))
>  	      (set (reg:SI SP_REG)
>  		   (plus:SI (reg:SI SP_REG)

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-04 16:38 ` [PATCH] Expand PIC calls without PLT with -fno-plt Alexander Monakov
  2015-05-04 17:34   ` Jeff Law
@ 2015-05-10 16:59   ` Jan Hubicka
  2015-05-11 20:36     ` Jeff Law
  2015-06-22 15:52   ` Jiong Wang
  2 siblings, 1 reply; 106+ messages in thread
From: Jan Hubicka @ 2015-05-10 16:59 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Rich Felker

> This patch introduces option -fno-plt that allows to expand calls that would
> go via PLT to load the address of the function immediately at call site (which
> introduces a GOT load).  Cover letter explains the motivation for this patch.
> 
> New option documentation for invoke.texi is missing from the patch; if this is
> accepted I'll be happy to send a v2 with documentation added.
> 
> 	* calls.c (prepare_call_address): Transform PLT call to GOT lookup and
> 	indirect call by forcing address into a pseudo with -fno-plt.
> 	* common.opt (flag_plt): New option.
> 
> diff --git a/gcc/common.opt b/gcc/common.opt
> index b49ac46..cd8b256 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -1773,12 +1773,16 @@ Common Report Var(flag_pic,1) Negative(fpie)
>  Generate position-independent code if possible (small mode)
>  
>  fpie
>  Common Report Var(flag_pie,1) Negative(fPIC)
>  Generate position-independent code for executables if possible (small mode)
>  
> +fplt
> +Common Report Var(flag_plt) Init(1)
> +Use PLT for PIC calls (-fno-plt: load the address from GOT at call site)
> +

This won't play well with LTO since fplt will become another global flag while
it affects codegen.

I still did not catch up with the other thread and Hj's work on doing this
transparently in linker, but if this is getting in, please add Optimization to
fplt, so the PLT usage can be decided with per function granuality.

Honza

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-06  3:08         ` Rich Felker
@ 2015-05-10 17:07           ` Jan Hubicka
  0 siblings, 0 replies; 106+ messages in thread
From: Jan Hubicka @ 2015-05-10 17:07 UTC (permalink / raw)
  To: Rich Felker; +Cc: Jeff Law, Jakub Jelinek, Alexander Monakov, gcc-patches

> On Mon, May 04, 2015 at 11:42:20AM -0600, Jeff Law wrote:
> > On 05/04/2015 11:39 AM, Jakub Jelinek wrote:
> > >On Mon, May 04, 2015 at 11:34:05AM -0600, Jeff Law wrote:
> > >>On 05/04/2015 10:37 AM, Alexander Monakov wrote:
> > >>>This patch introduces option -fno-plt that allows to expand calls that would
> > >>>go via PLT to load the address of the function immediately at call site (which
> > >>>introduces a GOT load).  Cover letter explains the motivation for this patch.
> > >>>
> > >>>New option documentation for invoke.texi is missing from the patch; if this is
> > >>>accepted I'll be happy to send a v2 with documentation added.
> > >>>
> > >>>	* calls.c (prepare_call_address): Transform PLT call to GOT lookup and
> > >>>	indirect call by forcing address into a pseudo with -fno-plt.
> > >>>	* common.opt (flag_plt): New option.
> > >>OK once you cobble together the invoke.texi changes.
> > >
> > >Isn't what Michael/Alan suggested better?  I mean as/ld/compiler changes to
> > >inline the plt slot's first part, then lazy binding will work fine.
> > I must have missed Alan/Michael's message.
> > 
> > ISTM the win here is that by going through the GOT, you can CSE the
> > GOT reference and possibly get some more register allocation
> > freedom.  Is that still the case with Alan/Michael's approach?
> 
> There are many advantages to 'going through the GOT'. CSE'ing the
> reference is just one. The biggest (IMO) is that you can avoid the bad
> PLT ABI that most targets have, where making a call to a PLT slot
> requires the GOT address to be pre-loaded into a fixed, call-saved
> register. This precludes sibcalls and forces many functions which
> otherwise would not need their own stack frames to create one for
> saving the old value of the GOT register. See my blog entry on the
> topic here: http://ewontfix.com/18/

One common pattern I noticed while looking at codegen for speculative devirtualization
is that in case we do not inline the virtual call we end up with

if (ptr = &foo)
  foo()

which leads to both GOT lookup to figure out address of foo and call across PLT.
It would be nice to handle this gratefully.

Note that one of improvements I want to do to devirt machinery is to change
the code seuqence to:

 if (vptr == &expected_vtable)
   foo ()
 else
   vptr[token]();

To saven the vtable lookup. But this is not possible in all cases - it happens
that there are multiple predicted vtables all agreeeing on the partiuclar slot.

Honza
> 
> Anyone who really wants lazy binding can use -fplt (which is
> presumably still the default; I didn't check) but lazy binding should
> largely be considered deprecated anyway since effective use of relro
> protection requires -z now too, in which case you're paying all the
> costs (which are considerable!) for lazy binding support even though
> you won't get it.
> 
> Rich

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Move CLOBBERED_REGS earlier in register class list
  2015-05-10 16:44   ` Jan Hubicka
@ 2015-05-10 17:51     ` Uros Bizjak
  2015-05-10 18:09       ` Uros Bizjak
  0 siblings, 1 reply; 106+ messages in thread
From: Uros Bizjak @ 2015-05-10 17:51 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Alexander Monakov, gcc-patches, Rich Felker, Vladimir Makarov

On Sun, May 10, 2015 at 6:44 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On 32-bit x86, register class CLOBBERED_REGS is a proper subset of
>> LEGACY_REGS, which causes IRA not to consider it separately for register
>> allocation, even when it has lower cost than other classes.  This patch is
>> useful to fix code generation problem that appears with no-PLT PIC tailcalls.
>>
>> Was there a specific reason for CLOBBERED_REGS class to be listed as late as
>> it is?  On 32-bit this class contains only EAX, ECX, EDX.
>
> Uros moved CLOBBERED_REGS late in
> https://gcc.gnu.org/ml/gcc-patches/2012-08/msg00796.html
> which contains a rationale, too.
>
> I am adding Uros and Vladimir to CC just in case they missed the email :)
> Honza

Uh, I don't remember that far, but from the context of the referred
patch, it looks like a "cleanup" of some sort. My atch matched 32bit
to 64bit, but could be also in the opposite way. Let's try the patch
and see what breaks.

Uros.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Move CLOBBERED_REGS earlier in register class list
  2015-05-10 17:51     ` Uros Bizjak
@ 2015-05-10 18:09       ` Uros Bizjak
  2015-05-11 16:26         ` Alexander Monakov
  0 siblings, 1 reply; 106+ messages in thread
From: Uros Bizjak @ 2015-05-10 18:09 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Alexander Monakov, gcc-patches, Rich Felker, Vladimir Makarov

On Sun, May 10, 2015 at 7:51 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Sun, May 10, 2015 at 6:44 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> On 32-bit x86, register class CLOBBERED_REGS is a proper subset of
>>> LEGACY_REGS, which causes IRA not to consider it separately for register
>>> allocation, even when it has lower cost than other classes.  This patch is
>>> useful to fix code generation problem that appears with no-PLT PIC tailcalls.
>>>
>>> Was there a specific reason for CLOBBERED_REGS class to be listed as late as
>>> it is?  On 32-bit this class contains only EAX, ECX, EDX.
>>
>> Uros moved CLOBBERED_REGS late in
>> https://gcc.gnu.org/ml/gcc-patches/2012-08/msg00796.html
>> which contains a rationale, too.
>>
>> I am adding Uros and Vladimir to CC just in case they missed the email :)
>> Honza
>
> Uh, I don't remember that far, but from the context of the referred
> patch, it looks like a "cleanup" of some sort. My atch matched 32bit
> to 64bit, but could be also in the opposite way. Let's try the patch
> and see what breaks.

Ah, the reason was that 64bit targets have many more call-clobbered
registers. So, I tried to position CLOBBERED_REGS according to the
ascending number of registers in the register set. Maybe the most
clean solution is to split the class to CLOBBERED_REGS_32 and
CLOBBERED_REGS_64 classes and set the register constraint in
constraints.md depending on TARGET_64BIT.

Uros.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-06 19:18                           ` Rich Felker
  2015-05-06 19:24                             ` H.J. Lu
@ 2015-05-11 11:48                             ` Michael Matz
  2015-05-11 14:20                               ` Rich Felker
  1 sibling, 1 reply; 106+ messages in thread
From: Michael Matz @ 2015-05-11 11:48 UTC (permalink / raw)
  To: Rich Felker
  Cc: H.J. Lu, Alexander Monakov, Jakub Jelinek, Jeff Law, GCC Patches

Hi,

On Wed, 6 May 2015, Rich Felker wrote:

> I don't see how this case is improved unless GCC is failing to consider 
> strong definitions in the same TU as locally-binding.

Interposition of non-static non-inline non-weak symbols is supported 
independend of if they are defined in the same TU or not (if you're 
producing a shared lib, that is).  I.e. no, they are not considered 
locally-binding (for instance, they aren't automatically inlined).

> If this is the case, is there a reason for that behavior?

Because IMHO interposition is orthogonal to TU placement, and hence 
shouldn't be influenced by it.  There's visibility, inline hints or 
static-ness to achieve different effects.  (perhaps the real reason is: 
because it always worked like that :) )

> IMO it's wrong.

Why?  I think it's right.

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-11 11:48                             ` Michael Matz
@ 2015-05-11 14:20                               ` Rich Felker
  0 siblings, 0 replies; 106+ messages in thread
From: Rich Felker @ 2015-05-11 14:20 UTC (permalink / raw)
  To: Michael Matz
  Cc: H.J. Lu, Alexander Monakov, Jakub Jelinek, Jeff Law, GCC Patches

On Mon, May 11, 2015 at 01:48:03PM +0200, Michael Matz wrote:
> Hi,
> 
> On Wed, 6 May 2015, Rich Felker wrote:
> 
> > I don't see how this case is improved unless GCC is failing to consider 
> > strong definitions in the same TU as locally-binding.
> 
> Interposition of non-static non-inline non-weak symbols is supported 
> independend of if they are defined in the same TU or not (if you're 
> producing a shared lib, that is).  I.e. no, they are not considered 
> locally-binding (for instance, they aren't automatically inlined).
>
> > If this is the case, is there a reason for that behavior?
> 
> Because IMHO interposition is orthogonal to TU placement, and hence 
> shouldn't be influenced by it.  There's visibility, inline hints or 
> static-ness to achieve different effects.  (perhaps the real reason is: 
> because it always worked like that :) )
> 
> > IMO it's wrong.
> 
> Why?  I think it's right.

I see it as an unnecessary pessimization. The ELF shared library
semantics for allowing interposition were designed to avoid behavioral
regressions versus static linking, and this is not such a case. With
an archive-type library, it's possible to cause whole TUs to be
omitted when linking as long as whatever symbol(s) may have been
needed from them are already defined elsewhere; interposition makes
the same possible with dynamic linking. But if symbols A and B were
both in the same TU, having A defined prior to searching an archive
but B undefined will cause the TU that defines both to be pulled in,
and is such a linking error (multiple definitions). So I'm not sure
why it's desirable to support this.

The "it always worked like that" argument may suffice if people are
depending on this behavior now (OTOH I'd rather see it break so they
fix their breakage of static linking) but I suspect the historical
reason it worked like that is that compilers were not smart enough to
process whole TUs at a time but just worked with one function at a
time and did not know that a referenced symbol was in the same TU.

BTW visibility can't really address the issue except with hacks
(hidden aliases) or protected visibility (which is hard to use because
it's broken on lots of toolchain versions).

Rich

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] PR65753: allow PIC tail calls via function pointers
  2015-05-10 16:37   ` Jan Hubicka
@ 2015-05-11 16:11     ` Alexander Monakov
  0 siblings, 0 replies; 106+ messages in thread
From: Alexander Monakov @ 2015-05-11 16:11 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, Rich Felker

On Sun, 10 May 2015, Jan Hubicka wrote:
> You probably need to update comment here. I wonder what happens when we optimize
> indirect call to direct call to global function at RTL level? I suppose we are
> safe here, because at RTL level we explicitly represent if we refer to PLT entry
> or the functionaddress itself and we never optimize one to the other?
> 
> Patch is OK if you make sure that this works and update the comment.

I think we are safe: to have things break we'd have to have a GOT-relative
memory load be combined with a branch on the RTL level, and GOT loads have
UNSPEC_GOT.  I have used the following example to try to induce failure:

  void foo(void);
  void bar()
  {
    void (*p)(void) = foo;
    p();
  }

With the following options: 

  gcc -fPIC -m32 -O -foptimize-sibling-calls -fno-tree-ccp -fno-tree-copy-prop
  -fno-tree-fre -fno-tree-dominator-opts -fno-tree-ter

GCC has indirect call after pass_expand.  Without -fPIC it is transformed into
direct call in pass_combine, with -fPIC it is kept as is.

I've added a testcase.  Below is what I'm checking in.  Thanks!

Index: testsuite/gcc.target/i386/pr65753.c
===================================================================
--- testsuite/gcc.target/i386/pr65753.c	(revision 0)
+++ testsuite/gcc.target/i386/pr65753.c	(revision 0)
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-fPIC -O" } */
+/* { dg-final { scan-assembler-not "call" } } */
+
+void foo(void (*bar)(void))
+{
+  bar();
+}
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 223002)
+++ config/i386/i386.c	(working copy)
@@ -5473,12 +5473,12 @@
   rtx a, b;
 
   /* If we are generating position-independent code, we cannot sibcall
-     optimize any indirect call, or a direct call to a global function,
-     as the PLT requires %ebx be live. (Darwin does not have a PLT.)  */
+     optimize direct calls to global functions, as the PLT requires
+     %ebx be live. (Darwin does not have a PLT.)  */
   if (!TARGET_MACHO
       && !TARGET_64BIT
       && flag_pic
-      && (!decl || !targetm.binds_local_p (decl)))
+      && decl && !targetm.binds_local_p (decl))
     return false;

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Move CLOBBERED_REGS earlier in register class list
  2015-05-10 18:09       ` Uros Bizjak
@ 2015-05-11 16:26         ` Alexander Monakov
  2015-05-11 16:30           ` Uros Bizjak
  0 siblings, 1 reply; 106+ messages in thread
From: Alexander Monakov @ 2015-05-11 16:26 UTC (permalink / raw)
  To: Uros Bizjak; +Cc: Jan Hubicka, gcc-patches, Rich Felker, Vladimir Makarov

On Sun, 10 May 2015, Uros Bizjak wrote:
> On Sun, May 10, 2015 at 7:51 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> > On Sun, May 10, 2015 at 6:44 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >>> On 32-bit x86, register class CLOBBERED_REGS is a proper subset of
> >>> LEGACY_REGS, which causes IRA not to consider it separately for register
> >>> allocation, even when it has lower cost than other classes.  This patch is
> >>> useful to fix code generation problem that appears with no-PLT PIC tailcalls.
> >>>
> >>> Was there a specific reason for CLOBBERED_REGS class to be listed as late as
> >>> it is?  On 32-bit this class contains only EAX, ECX, EDX.
> >>
> >> Uros moved CLOBBERED_REGS late in
> >> https://gcc.gnu.org/ml/gcc-patches/2012-08/msg00796.html
> >> which contains a rationale, too.
> >>
> >> I am adding Uros and Vladimir to CC just in case they missed the email :)
> >> Honza
> >
> > Uh, I don't remember that far, but from the context of the referred
> > patch, it looks like a "cleanup" of some sort. My atch matched 32bit
> > to 64bit, but could be also in the opposite way. Let's try the patch
> > and see what breaks.
> 
> Ah, the reason was that 64bit targets have many more call-clobbered
> registers. So, I tried to position CLOBBERED_REGS according to the
> ascending number of registers in the register set. Maybe the most
> clean solution is to split the class to CLOBBERED_REGS_32 and
> CLOBBERED_REGS_64 classes and set the register constraint in
> constraints.md depending on TARGET_64BIT.

Is there something to be gained by doing such a split?

It seems to me that trying clobbered regs for allocation earlier that others
(thus, callee-saved) makes sense in general, at least for non-static functions,
as callers in other TUs would have to save/restore those registers anyway.

Alexander

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Move CLOBBERED_REGS earlier in register class list
  2015-05-11 16:26         ` Alexander Monakov
@ 2015-05-11 16:30           ` Uros Bizjak
  0 siblings, 0 replies; 106+ messages in thread
From: Uros Bizjak @ 2015-05-11 16:30 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Jan Hubicka, gcc-patches, Rich Felker, Vladimir Makarov

On Mon, May 11, 2015 at 6:25 PM, Alexander Monakov <amonakov@ispras.ru> wrote:
>> >>> LEGACY_REGS, which causes IRA not to consider it separately for register
>> >>> allocation, even when it has lower cost than other classes.  This patch is
>> >>> useful to fix code generation problem that appears with no-PLT PIC tailcalls.
>> >>>
>> >>> Was there a specific reason for CLOBBERED_REGS class to be listed as late as
>> >>> it is?  On 32-bit this class contains only EAX, ECX, EDX.
>> >>
>> >> Uros moved CLOBBERED_REGS late in
>> >> https://gcc.gnu.org/ml/gcc-patches/2012-08/msg00796.html
>> >> which contains a rationale, too.
>> >>
>> >> I am adding Uros and Vladimir to CC just in case they missed the email :)
>> >> Honza
>> >
>> > Uh, I don't remember that far, but from the context of the referred
>> > patch, it looks like a "cleanup" of some sort. My atch matched 32bit
>> > to 64bit, but could be also in the opposite way. Let's try the patch
>> > and see what breaks.
>>
>> Ah, the reason was that 64bit targets have many more call-clobbered
>> registers. So, I tried to position CLOBBERED_REGS according to the
>> ascending number of registers in the register set. Maybe the most
>> clean solution is to split the class to CLOBBERED_REGS_32 and
>> CLOBBERED_REGS_64 classes and set the register constraint in
>> constraints.md depending on TARGET_64BIT.
>
> Is there something to be gained by doing such a split?
>
> It seems to me that trying clobbered regs for allocation earlier that others
> (thus, callee-saved) makes sense in general, at least for non-static functions,
> as callers in other TUs would have to save/restore those registers anyway.

OK, if the position of CLOBBERED_REGS doesn't affect other
functionality (for 32, 64 and MS_ABI targets), then just move the
definition to the better place. Please also add the comment explaining
the rationale to prevent possible "cleanup" attempts in the future.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Extend sibcall peepholes to allow source in %eax
  2015-05-10 16:54   ` Jan Hubicka
@ 2015-05-11 17:50     ` Alexander Monakov
  2015-05-11 18:00       ` Jan Hubicka
  0 siblings, 1 reply; 106+ messages in thread
From: Alexander Monakov @ 2015-05-11 17:50 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, Uros Bizjak, Rich Felker

On Sun, 10 May 2015, Jan Hubicka wrote:

> > On i386, peepholes that transform memory load and register-indirect jump into
> > memory-indirect jump are overly restrictive in that they don't allow combining
> > when the jump target is loaded into %eax, and the called function returns a
> > value (also in %eax, so it's not dead after the call).  Fix this by checking
> > for same source and output register operands separately.
> > 
> > OK?
> > 	* config/i386/i386.md (sibcall_value_memory): Extend peepholes to
> > 	allow memory address in %eax.
> > 	(sibcall_value_pop_memory): Likewise.
> 
> Why do we need the check for liveness after all?  There is SIBLING_CALL_P
> (peep2_next_insn (1)) so we know that the function terminates by the call
> and there are no other uses of the value.

Indeed.  Uros, the peep2_reg_dead_p check was added by your patch as svn
revision 211776, git commit e51f8b8fed.  Would you agree that the check is not
necessary for sibcalls as Honza explains?  Would you approve a patch that
removes it in the sibcall peepholes I modify in the patch under discussion?
 
> Don't we however need to check that operands[0] is not used by the call_insn as
> parameter of the call?  I.e. something like
> 
> void
> test(void (*callback ()))
> {
>   callback(callback);
> }

You need a pointer-to-pointer-to-function to trigger the peephole.  Something
like this:

  void foo()
  {
    void (**bar)(void*);
    asm("":"=r"(bar));
    (*bar)(*bar);
  }

> I think instead of peep2_reg_dead_p we want to check that the parameter is not in
> CALL_INSN_FUNCTION_USAGE of the sibcall..

Playing with the above testcase I can't induce failure.  It seems today GCC
won't allocate the same register as callee address and one of the arguments.
Do you want me to implement such a check anyway?

> > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> > index 729db75..7f81bcc 100644
> > --- a/gcc/config/i386/i386.md
> > +++ b/gcc/config/i386/i386.md
> > @@ -11872,13 +11872,14 @@
> >    [(set (match_operand:W 0 "register_operand")
> >  	(match_operand:W 1 "memory_operand"))
> >     (set (match_operand 2)
> >     (call (mem:QI (match_dup 0))
> >  		 (match_operand 3)))]
> >    "!TARGET_X32 && SIBLING_CALL_P (peep2_next_insn (1))
> > -   && peep2_reg_dead_p (2, operands[0])"
> > +   && (REGNO (operands[2]) == REGNO (operands[0])
> > +       || peep2_reg_dead_p (2, operands[0]))"
> >    [(parallel [(set (match_dup 2)
> >  		   (call (mem:QI (match_dup 1))
> >  			 (match_dup 3)))
> >  	      (unspec [(const_int 0)] UNSPEC_PEEPSIB)])])
> >  
> >  (define_peephole2
> > @@ -11886,13 +11887,14 @@
> >  	(match_operand:W 1 "memory_operand"))
> >     (unspec_volatile [(const_int 0)] UNSPECV_BLOCKAGE)
> >     (set (match_operand 2)
> >  	(call (mem:QI (match_dup 0))
> >  	      (match_operand 3)))]
> >    "!TARGET_X32 && SIBLING_CALL_P (peep2_next_insn (2))
> > -   && peep2_reg_dead_p (3, operands[0])"
> > +   && (REGNO (operands[2]) == REGNO (operands[0])
> > +       || peep2_reg_dead_p (3, operands[0]))"
> >    [(unspec_volatile [(const_int 0)] UNSPECV_BLOCKAGE)
> >     (parallel [(set (match_dup 2)
> >  		   (call (mem:QI (match_dup 1))
> >  			 (match_dup 3)))
> >  	      (unspec [(const_int 0)] UNSPEC_PEEPSIB)])])
> >  
> > @@ -11951,13 +11953,14 @@
> >  		   (call (mem:QI (match_dup 0))
> >  			 (match_operand 3)))
> >  	      (set (reg:SI SP_REG)
> >  		   (plus:SI (reg:SI SP_REG)
> >  			    (match_operand:SI 4 "immediate_operand")))])]
> >    "!TARGET_64BIT && SIBLING_CALL_P (peep2_next_insn (1))
> > -   && peep2_reg_dead_p (2, operands[0])"
> > +   && (REGNO (operands[2]) == REGNO (operands[0])
> > +       || peep2_reg_dead_p (2, operands[0]))"
> >    [(parallel [(set (match_dup 2)
> >  		   (call (mem:QI (match_dup 1))
> >  			 (match_dup 3)))
> >  	      (set (reg:SI SP_REG)
> >  		   (plus:SI (reg:SI SP_REG)
> >  			    (match_dup 4)))
> > @@ -11971,13 +11974,14 @@
> >  		   (call (mem:QI (match_dup 0))
> >  			 (match_operand 3)))
> >  	      (set (reg:SI SP_REG)
> >  		   (plus:SI (reg:SI SP_REG)
> >  			    (match_operand:SI 4 "immediate_operand")))])]
> >    "!TARGET_64BIT && SIBLING_CALL_P (peep2_next_insn (2))
> > -   && peep2_reg_dead_p (3, operands[0])"
> > +   && (REGNO (operands[2]) == REGNO (operands[0])
> > +       || peep2_reg_dead_p (3, operands[0]))"
> >    [(unspec_volatile [(const_int 0)] UNSPECV_BLOCKAGE)
> >     (parallel [(set (match_dup 2)
> >  		   (call (mem:QI (match_dup 1))
> >  			 (match_dup 3)))
> >  	      (set (reg:SI SP_REG)
> >  		   (plus:SI (reg:SI SP_REG)
> 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Extend sibcall peepholes to allow source in %eax
  2015-05-11 17:50     ` Alexander Monakov
@ 2015-05-11 18:00       ` Jan Hubicka
  2015-05-11 19:46         ` Uros Bizjak
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Hubicka @ 2015-05-11 18:00 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Jan Hubicka, gcc-patches, Uros Bizjak, Rich Felker

> On Sun, 10 May 2015, Jan Hubicka wrote:
> 
> > > On i386, peepholes that transform memory load and register-indirect jump into
> > > memory-indirect jump are overly restrictive in that they don't allow combining
> > > when the jump target is loaded into %eax, and the called function returns a
> > > value (also in %eax, so it's not dead after the call).  Fix this by checking
> > > for same source and output register operands separately.
> > > 
> > > OK?
> > > 	* config/i386/i386.md (sibcall_value_memory): Extend peepholes to
> > > 	allow memory address in %eax.
> > > 	(sibcall_value_pop_memory): Likewise.
> > 
> > Why do we need the check for liveness after all?  There is SIBLING_CALL_P
> > (peep2_next_insn (1)) so we know that the function terminates by the call
> > and there are no other uses of the value.
> 
> Indeed.  Uros, the peep2_reg_dead_p check was added by your patch as svn
> revision 211776, git commit e51f8b8fed.  Would you agree that the check is not
> necessary for sibcalls as Honza explains?  Would you approve a patch that
> removes it in the sibcall peepholes I modify in the patch under discussion?
>  
> > Don't we however need to check that operands[0] is not used by the call_insn as
> > parameter of the call?  I.e. something like
> > 
> > void
> > test(void (*callback ()))
> > {
> >   callback(callback);
> > }
> 
> You need a pointer-to-pointer-to-function to trigger the peephole.  Something
> like this:
> 
>   void foo()
>   {
>     void (**bar)(void*);
>     asm("":"=r"(bar));
>     (*bar)(*bar);
>   }
> 
> > I think instead of peep2_reg_dead_p we want to check that the parameter is not in
> > CALL_INSN_FUNCTION_USAGE of the sibcall..
> 
> Playing with the above testcase I can't induce failure.  It seems today GCC
> won't allocate the same register as callee address and one of the arguments.
> Do you want me to implement such a check anyway?

Hmm, only way I can trigger same register is:
void foo()
  {
    void (**bar)(void*);
    asm("":"=r"(bar));
    register void (*var)(void *) asm("%eax");
    var=*bar;
    asm("":"+r"(var));
    var(var);
  }

removing the second asm makes CSE to forward propagae the memory operand
to call that makes the call different from the register variable.

Still I would check for that, but this is more Uros' area.

Honza

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Extend sibcall peepholes to allow source in %eax
  2015-05-11 18:00       ` Jan Hubicka
@ 2015-05-11 19:46         ` Uros Bizjak
  2015-05-11 19:48           ` Jeff Law
  0 siblings, 1 reply; 106+ messages in thread
From: Uros Bizjak @ 2015-05-11 19:46 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Alexander Monakov, gcc-patches, Rich Felker

On Mon, May 11, 2015 at 8:00 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Sun, 10 May 2015, Jan Hubicka wrote:
>>
>> > > On i386, peepholes that transform memory load and register-indirect jump into
>> > > memory-indirect jump are overly restrictive in that they don't allow combining
>> > > when the jump target is loaded into %eax, and the called function returns a
>> > > value (also in %eax, so it's not dead after the call).  Fix this by checking
>> > > for same source and output register operands separately.
>> > >
>> > > OK?
>> > >   * config/i386/i386.md (sibcall_value_memory): Extend peepholes to
>> > >   allow memory address in %eax.
>> > >   (sibcall_value_pop_memory): Likewise.
>> >
>> > Why do we need the check for liveness after all?  There is SIBLING_CALL_P
>> > (peep2_next_insn (1)) so we know that the function terminates by the call
>> > and there are no other uses of the value.
>>
>> Indeed.  Uros, the peep2_reg_dead_p check was added by your patch as svn
>> revision 211776, git commit e51f8b8fed.  Would you agree that the check is not
>> necessary for sibcalls as Honza explains?  Would you approve a patch that
>> removes it in the sibcall peepholes I modify in the patch under discussion?
>>
>> > Don't we however need to check that operands[0] is not used by the call_insn as
>> > parameter of the call?  I.e. something like
>> >
>> > void
>> > test(void (*callback ()))
>> > {
>> >   callback(callback);
>> > }
>>
>> You need a pointer-to-pointer-to-function to trigger the peephole.  Something
>> like this:
>>
>>   void foo()
>>   {
>>     void (**bar)(void*);
>>     asm("":"=r"(bar));
>>     (*bar)(*bar);
>>   }
>>
>> > I think instead of peep2_reg_dead_p we want to check that the parameter is not in
>> > CALL_INSN_FUNCTION_USAGE of the sibcall..
>>
>> Playing with the above testcase I can't induce failure.  It seems today GCC
>> won't allocate the same register as callee address and one of the arguments.
>> Do you want me to implement such a check anyway?
>
> Hmm, only way I can trigger same register is:
> void foo()
>   {
>     void (**bar)(void*);
>     asm("":"=r"(bar));
>     register void (*var)(void *) asm("%eax");
>     var=*bar;
>     asm("":"+r"(var));
>     var(var);
>   }
>
> removing the second asm makes CSE to forward propagae the memory operand
> to call that makes the call different from the register variable.
>
> Still I would check for that, but this is more Uros' area.

This is from [1], and reading this reference, it looks to me that the
check was introduced due to:

- Adds check that eliminated register is really dead after the call
(maybe an overkill, but some hard-to-debug problems surfaced due to
missing liveness checks in the past)

Going down that memory lane, it looks like a safety check for
something that *might* happen. Looking at the comment, I'd say we can
remove the check, but we should look for possible fallout.

[1] https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01451.html

Uros.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Extend sibcall peepholes to allow source in %eax
  2015-05-11 19:46         ` Uros Bizjak
@ 2015-05-11 19:48           ` Jeff Law
  2015-05-11 20:16             ` Jan Hubicka
  0 siblings, 1 reply; 106+ messages in thread
From: Jeff Law @ 2015-05-11 19:48 UTC (permalink / raw)
  To: Uros Bizjak, Jan Hubicka; +Cc: Alexander Monakov, gcc-patches, Rich Felker

On 05/11/2015 01:46 PM, Uros Bizjak wrote:
> On Mon, May 11, 2015 at 8:00 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> On Sun, 10 May 2015, Jan Hubicka wrote:
>>>
>>>>> On i386, peepholes that transform memory load and register-indirect jump into
>>>>> memory-indirect jump are overly restrictive in that they don't allow combining
>>>>> when the jump target is loaded into %eax, and the called function returns a
>>>>> value (also in %eax, so it's not dead after the call).  Fix this by checking
>>>>> for same source and output register operands separately.
>>>>>
>>>>> OK?
>>>>>    * config/i386/i386.md (sibcall_value_memory): Extend peepholes to
>>>>>    allow memory address in %eax.
>>>>>    (sibcall_value_pop_memory): Likewise.
>>>>
>>>> Why do we need the check for liveness after all?  There is SIBLING_CALL_P
>>>> (peep2_next_insn (1)) so we know that the function terminates by the call
>>>> and there are no other uses of the value.
>>>
>>> Indeed.  Uros, the peep2_reg_dead_p check was added by your patch as svn
>>> revision 211776, git commit e51f8b8fed.  Would you agree that the check is not
>>> necessary for sibcalls as Honza explains?  Would you approve a patch that
>>> removes it in the sibcall peepholes I modify in the patch under discussion?
>>>
>>>> Don't we however need to check that operands[0] is not used by the call_insn as
>>>> parameter of the call?  I.e. something like
>>>>
>>>> void
>>>> test(void (*callback ()))
>>>> {
>>>>    callback(callback);
>>>> }
>>>
>>> You need a pointer-to-pointer-to-function to trigger the peephole.  Something
>>> like this:
>>>
>>>    void foo()
>>>    {
>>>      void (**bar)(void*);
>>>      asm("":"=r"(bar));
>>>      (*bar)(*bar);
>>>    }
>>>
>>>> I think instead of peep2_reg_dead_p we want to check that the parameter is not in
>>>> CALL_INSN_FUNCTION_USAGE of the sibcall..
>>>
>>> Playing with the above testcase I can't induce failure.  It seems today GCC
>>> won't allocate the same register as callee address and one of the arguments.
>>> Do you want me to implement such a check anyway?
>>
>> Hmm, only way I can trigger same register is:
>> void foo()
>>    {
>>      void (**bar)(void*);
>>      asm("":"=r"(bar));
>>      register void (*var)(void *) asm("%eax");
>>      var=*bar;
>>      asm("":"+r"(var));
>>      var(var);
>>    }
>>
>> removing the second asm makes CSE to forward propagae the memory operand
>> to call that makes the call different from the register variable.
>>
>> Still I would check for that, but this is more Uros' area.
>
> This is from [1], and reading this reference, it looks to me that the
> check was introduced due to:
>
> - Adds check that eliminated register is really dead after the call
> (maybe an overkill, but some hard-to-debug problems surfaced due to
> missing liveness checks in the past)
>
> Going down that memory lane, it looks like a safety check for
> something that *might* happen. Looking at the comment, I'd say we can
> remove the check, but we should look for possible fallout.
I'd tend to agree.

jeff

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Extend sibcall peepholes to allow source in %eax
  2015-05-11 19:48           ` Jeff Law
@ 2015-05-11 20:16             ` Jan Hubicka
  2015-05-13 19:05               ` Alexander Monakov
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Hubicka @ 2015-05-11 20:16 UTC (permalink / raw)
  To: Jeff Law
  Cc: Uros Bizjak, Jan Hubicka, Alexander Monakov, gcc-patches, Rich Felker

> On 05/11/2015 01:46 PM, Uros Bizjak wrote:
> >On Mon, May 11, 2015 at 8:00 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >>>On Sun, 10 May 2015, Jan Hubicka wrote:
> >>>
> >>>>>On i386, peepholes that transform memory load and register-indirect jump into
> >>>>>memory-indirect jump are overly restrictive in that they don't allow combining
> >>>>>when the jump target is loaded into %eax, and the called function returns a
> >>>>>value (also in %eax, so it's not dead after the call).  Fix this by checking
> >>>>>for same source and output register operands separately.
> >>>>>
> >>>>>OK?
> >>>>>   * config/i386/i386.md (sibcall_value_memory): Extend peepholes to
> >>>>>   allow memory address in %eax.
> >>>>>   (sibcall_value_pop_memory): Likewise.
> >>>>
> >>>>Why do we need the check for liveness after all?  There is SIBLING_CALL_P
> >>>>(peep2_next_insn (1)) so we know that the function terminates by the call
> >>>>and there are no other uses of the value.
> >>>
> >>>Indeed.  Uros, the peep2_reg_dead_p check was added by your patch as svn
> >>>revision 211776, git commit e51f8b8fed.  Would you agree that the check is not
> >>>necessary for sibcalls as Honza explains?  Would you approve a patch that
> >>>removes it in the sibcall peepholes I modify in the patch under discussion?
> >>>
> >>>>Don't we however need to check that operands[0] is not used by the call_insn as
> >>>>parameter of the call?  I.e. something like
> >>>>
> >>>>void
> >>>>test(void (*callback ()))
> >>>>{
> >>>>   callback(callback);
> >>>>}
> >>>
> >>>You need a pointer-to-pointer-to-function to trigger the peephole.  Something
> >>>like this:
> >>>
> >>>   void foo()
> >>>   {
> >>>     void (**bar)(void*);
> >>>     asm("":"=r"(bar));
> >>>     (*bar)(*bar);
> >>>   }
> >>>
> >>>>I think instead of peep2_reg_dead_p we want to check that the parameter is not in
> >>>>CALL_INSN_FUNCTION_USAGE of the sibcall..
> >>>
> >>>Playing with the above testcase I can't induce failure.  It seems today GCC
> >>>won't allocate the same register as callee address and one of the arguments.
> >>>Do you want me to implement such a check anyway?
> >>
> >>Hmm, only way I can trigger same register is:
> >>void foo()
> >>   {
> >>     void (**bar)(void*);
> >>     asm("":"=r"(bar));
> >>     register void (*var)(void *) asm("%eax");
> >>     var=*bar;
> >>     asm("":"+r"(var));
> >>     var(var);
> >>   }
> >>
> >>removing the second asm makes CSE to forward propagae the memory operand
> >>to call that makes the call different from the register variable.
> >>
> >>Still I would check for that, but this is more Uros' area.
> >
> >This is from [1], and reading this reference, it looks to me that the
> >check was introduced due to:
> >
> >- Adds check that eliminated register is really dead after the call
> >(maybe an overkill, but some hard-to-debug problems surfaced due to
> >missing liveness checks in the past)
> >
> >Going down that memory lane, it looks like a safety check for
> >something that *might* happen. Looking at the comment, I'd say we can
> >remove the check, but we should look for possible fallout.
> I'd tend to agree.

Yes, to make my original email clear, I think we are safe to remove
peep2_reg_dead_p.

I would however introduce a check that the call target is not also among
parameters of the function. In this case the peephole would remove the load
and make the parameter unefined.

While current mainline don't seem to be able to translate the testcase above
that way, perhaps future improvements to LRA/postreload gcse may make it happen
and generally RTL patterns are better to be safe by definition not
only for the actual RTL we are able to generate. I suppose reg_mentioned_p
on call usage is enough.

Honza
> 
> jeff

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-10 16:59   ` Jan Hubicka
@ 2015-05-11 20:36     ` Jeff Law
  2015-05-11 20:55       ` H.J. Lu
  0 siblings, 1 reply; 106+ messages in thread
From: Jeff Law @ 2015-05-11 20:36 UTC (permalink / raw)
  To: Jan Hubicka, Alexander Monakov; +Cc: gcc-patches, Rich Felker

On 05/10/2015 10:59 AM, Jan Hubicka wrote:
>> This patch introduces option -fno-plt that allows to expand calls that would
>> go via PLT to load the address of the function immediately at call site (which
>> introduces a GOT load).  Cover letter explains the motivation for this patch.
>>
>> New option documentation for invoke.texi is missing from the patch; if this is
>> accepted I'll be happy to send a v2 with documentation added.
>>
>> 	* calls.c (prepare_call_address): Transform PLT call to GOT lookup and
>> 	indirect call by forcing address into a pseudo with -fno-plt.
>> 	* common.opt (flag_plt): New option.
>>
>> diff --git a/gcc/common.opt b/gcc/common.opt
>> index b49ac46..cd8b256 100644
>> --- a/gcc/common.opt
>> +++ b/gcc/common.opt
>> @@ -1773,12 +1773,16 @@ Common Report Var(flag_pic,1) Negative(fpie)
>>   Generate position-independent code if possible (small mode)
>>
>>   fpie
>>   Common Report Var(flag_pie,1) Negative(fPIC)
>>   Generate position-independent code for executables if possible (small mode)
>>
>> +fplt
>> +Common Report Var(flag_plt) Init(1)
>> +Use PLT for PIC calls (-fno-plt: load the address from GOT at call site)
>> +
>
> This won't play well with LTO since fplt will become another global flag while
> it affects codegen.
I know Richi explained this to me in the past, but I can't remember the 
details of why this is bad.  Can you walk me through it again?

jeff

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-11 20:36     ` Jeff Law
@ 2015-05-11 20:55       ` H.J. Lu
  2015-05-11 22:13         ` Jan Hubicka
  0 siblings, 1 reply; 106+ messages in thread
From: H.J. Lu @ 2015-05-11 20:55 UTC (permalink / raw)
  To: Jeff Law; +Cc: Jan Hubicka, Alexander Monakov, GCC Patches, Rich Felker

On Mon, May 11, 2015 at 1:36 PM, Jeff Law <law@redhat.com> wrote:
> On 05/10/2015 10:59 AM, Jan Hubicka wrote:
>>>
>>> This patch introduces option -fno-plt that allows to expand calls that
>>> would
>>> go via PLT to load the address of the function immediately at call site
>>> (which
>>> introduces a GOT load).  Cover letter explains the motivation for this
>>> patch.
>>>
>>> New option documentation for invoke.texi is missing from the patch; if
>>> this is
>>> accepted I'll be happy to send a v2 with documentation added.
>>>
>>>         * calls.c (prepare_call_address): Transform PLT call to GOT
>>> lookup and
>>>         indirect call by forcing address into a pseudo with -fno-plt.
>>>         * common.opt (flag_plt): New option.
>>>
>>> diff --git a/gcc/common.opt b/gcc/common.opt
>>> index b49ac46..cd8b256 100644
>>> --- a/gcc/common.opt
>>> +++ b/gcc/common.opt
>>> @@ -1773,12 +1773,16 @@ Common Report Var(flag_pic,1) Negative(fpie)
>>>   Generate position-independent code if possible (small mode)
>>>
>>>   fpie
>>>   Common Report Var(flag_pie,1) Negative(fPIC)
>>>   Generate position-independent code for executables if possible (small
>>> mode)
>>>
>>> +fplt
>>> +Common Report Var(flag_plt) Init(1)
>>> +Use PLT for PIC calls (-fno-plt: load the address from GOT at call site)
>>> +
>>
>>
>> This won't play well with LTO since fplt will become another global flag
>> while
>> it affects codegen.
>
> I know Richi explained this to me in the past, but I can't remember the
> details of why this is bad.  Can you walk me through it again?
>

I have proposed a different approach:

https://gcc.gnu.org/ml/gcc/2015-05/msg00086.html


-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-11 20:55       ` H.J. Lu
@ 2015-05-11 22:13         ` Jan Hubicka
  0 siblings, 0 replies; 106+ messages in thread
From: Jan Hubicka @ 2015-05-11 22:13 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jeff Law, Jan Hubicka, Alexander Monakov, GCC Patches, Rich Felker

> >> This won't play well with LTO since fplt will become another global flag
> >> while
> >> it affects codegen.
> >
> > I know Richi explained this to me in the past, but I can't remember the
> > details of why this is bad.  Can you walk me through it again?
> >
> 
> I have proposed a different approach:
> 
> https://gcc.gnu.org/ml/gcc/2015-05/msg00086.html

THe RELAX_PC* approach looks indeed interesting (I still need to catch up
with the thread), but to answer Jeff's question.
With LTO we need to handle stuff like
gcc a.c -fplt -flto -c -O2
gcc b.c -fno-plt -flto -c -Os
gcc a.o b.o -flto

and generaly we would like to mimmic as closely as possible what happens with
non-LTO builds. That is functions originating from a.c should be -O2 optimized
with PLT and functions from b.c should size optimized w/o PLT (wich makes cross
module inlining fun).
To do so we now attach implicit optimization/target node to each function that
stores the flags used to build the unit.  optimization nodes contains only
those flags that are defined as Optimization.

So in general if we have a flag that is about function codegen and we are able
to produce function with different values of the flag in one unit, we really
want to mark it as Optimization (and decide what we want to do about inlining
across the flag boundary). Not all flags works like this, for example -fPIC is
a global flag and then there is Richi's code in lto-wrapper that reados those
options from all .o files first and somehow chose the prevailing one for the
whole program.

In longer term we want to eliminate as many as possible of those global flags
(for exmaple -m32 can stay global as you can not mix it with -m64) and also
to explicitely represent some of the flags in IL, so inlining across boundaries
works as expected.

Honza
> 
> 
> -- 
> H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Extend sibcall peepholes to allow source in %eax
  2015-05-11 20:16             ` Jan Hubicka
@ 2015-05-13 19:05               ` Alexander Monakov
  2015-05-13 20:04                 ` Jan Hubicka
  0 siblings, 1 reply; 106+ messages in thread
From: Alexander Monakov @ 2015-05-13 19:05 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Jeff Law, Uros Bizjak, gcc-patches, Rich Felker

On Mon, 11 May 2015, Jan Hubicka wrote:
> Yes, to make my original email clear, I think we are safe to remove
> peep2_reg_dead_p.
> 
> I would however introduce a check that the call target is not also among
> parameters of the function. In this case the peephole would remove the load
> and make the parameter unefined.
> 
> While current mainline don't seem to be able to translate the testcase above
> that way, perhaps future improvements to LRA/postreload gcse may make it happen
> and generally RTL patterns are better to be safe by definition not
> only for the actual RTL we are able to generate. I suppose reg_mentioned_p
> on call usage is enough.

Thanks.  I have bootstrapped and regtested the following patch.  OK?

	* config/i386/i386.md (sibcall_memory): Check that register with
	callee address is not also used as one of the arguments, instead
	of checking that it is not live after the sibcall.
	(sibcall_pop_memory): Ditto.
	(sibcall_value_memory): Ditto.
	(sibcall_value_pop_memory): Ditto.
testsuite:
	* gcc.target/i386/sibcall-7.c: New test.

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 0959aef..9c1aa7d 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -11673,7 +11673,8 @@
    (call (mem:QI (match_dup 0))
         (match_operand 3))]
   "!TARGET_X32 && SIBLING_CALL_P (peep2_next_insn (1))
-   && peep2_reg_dead_p (2, operands[0])"
+   && !reg_mentioned_p (operands[0],
+                       CALL_INSN_FUNCTION_USAGE (peep2_next_insn (1)))"
   [(parallel [(call (mem:QI (match_dup 1))
                    (match_dup 3))
              (unspec [(const_int 0)] UNSPEC_PEEPSIB)])])
@@ -11685,7 +11686,8 @@
    (call (mem:QI (match_dup 0))
         (match_operand 3))]
   "!TARGET_X32 && SIBLING_CALL_P (peep2_next_insn (2))
-   && peep2_reg_dead_p (3, operands[0])"
+   && !reg_mentioned_p (operands[0],
+                       CALL_INSN_FUNCTION_USAGE (peep2_next_insn (2)))"
   [(unspec_volatile [(const_int 0)] UNSPECV_BLOCKAGE)
    (parallel [(call (mem:QI (match_dup 1))
                    (match_dup 3))
@@ -11744,7 +11746,8 @@
                   (plus:SI (reg:SI SP_REG)
                            (match_operand:SI 4 "immediate_operand")))])]
   "!TARGET_64BIT && SIBLING_CALL_P (peep2_next_insn (1))
-   && peep2_reg_dead_p (2, operands[0])"
+   && !reg_mentioned_p (operands[0],
+                       CALL_INSN_FUNCTION_USAGE (peep2_next_insn (1)))"
   [(parallel [(call (mem:QI (match_dup 1))
                    (match_dup 3))
              (set (reg:SI SP_REG)
@@ -11762,7 +11765,8 @@
                   (plus:SI (reg:SI SP_REG)
                            (match_operand:SI 4 "immediate_operand")))])]
   "!TARGET_64BIT && SIBLING_CALL_P (peep2_next_insn (2))
-   && peep2_reg_dead_p (3, operands[0])"
+   && !reg_mentioned_p (operands[0],
+                       CALL_INSN_FUNCTION_USAGE (peep2_next_insn (2)))"
   [(unspec_volatile [(const_int 0)] UNSPECV_BLOCKAGE)
    (parallel [(call (mem:QI (match_dup 1))
                    (match_dup 3))
@@ -11838,7 +11842,8 @@
    (call (mem:QI (match_dup 0))
                 (match_operand 3)))]
   "!TARGET_X32 && SIBLING_CALL_P (peep2_next_insn (1))
-   && peep2_reg_dead_p (2, operands[0])"
+   && !reg_mentioned_p (operands[0],
+                       CALL_INSN_FUNCTION_USAGE (peep2_next_insn (1)))"
   [(parallel [(set (match_dup 2)
                   (call (mem:QI (match_dup 1))
                         (match_dup 3)))
@@ -11852,7 +11857,8 @@
       (call (mem:QI (match_dup 0))
              (match_operand 3)))]
   "!TARGET_X32 && SIBLING_CALL_P (peep2_next_insn (2))
-   && peep2_reg_dead_p (3, operands[0])"
+   && !reg_mentioned_p (operands[0],
+                       CALL_INSN_FUNCTION_USAGE (peep2_next_insn (2)))"
   [(unspec_volatile [(const_int 0)] UNSPECV_BLOCKAGE)
    (parallel [(set (match_dup 2)
                   (call (mem:QI (match_dup 1))
@@ -11917,7 +11923,8 @@
                   (plus:SI (reg:SI SP_REG)
                            (match_operand:SI 4 "immediate_operand")))])]
   "!TARGET_64BIT && SIBLING_CALL_P (peep2_next_insn (1))
-   && peep2_reg_dead_p (2, operands[0])"
+   && !reg_mentioned_p (operands[0],
+                       CALL_INSN_FUNCTION_USAGE (peep2_next_insn (1)))"
   [(parallel [(set (match_dup 2)
                   (call (mem:QI (match_dup 1))
                         (match_dup 3)))
@@ -11937,7 +11944,8 @@
                   (plus:SI (reg:SI SP_REG)
                            (match_operand:SI 4 "immediate_operand")))])]
   "!TARGET_64BIT && SIBLING_CALL_P (peep2_next_insn (2))
-   && peep2_reg_dead_p (3, operands[0])"
+   && !reg_mentioned_p (operands[0],
+                       CALL_INSN_FUNCTION_USAGE (peep2_next_insn (2)))"
   [(unspec_volatile [(const_int 0)] UNSPECV_BLOCKAGE)
    (parallel [(set (match_dup 2)
                   (call (mem:QI (match_dup 1))
diff --git a/gcc/testsuite/gcc.target/i386/sibcall-7.c b/gcc/testsuite/gcc.target/i386/sibcall-7.c
index e69de29..72fdaff 100644
--- a/gcc/testsuite/gcc.target/i386/sibcall-7.c
+++ b/gcc/testsuite/gcc.target/i386/sibcall-7.c
@@ -0,0 +1,11 @@
+/* { dg-do compile { target { { ! x32 } } } } */
+/* { dg-options "-O2" } */
+
+int foo()
+{
+  int (**bar)(void);
+  asm("":"=a"(bar));
+  return (*bar)();
+}
+
+/* { dg-final { scan-assembler-not "mov" } } */

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Extend sibcall peepholes to allow source in %eax
  2015-05-13 19:05               ` Alexander Monakov
@ 2015-05-13 20:04                 ` Jan Hubicka
  2015-05-14 17:36                   ` Alexander Monakov
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Hubicka @ 2015-05-13 20:04 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Jan Hubicka, Jeff Law, Uros Bizjak, gcc-patches, Rich Felker

> On Mon, 11 May 2015, Jan Hubicka wrote:
> > Yes, to make my original email clear, I think we are safe to remove
> > peep2_reg_dead_p.
> > 
> > I would however introduce a check that the call target is not also among
> > parameters of the function. In this case the peephole would remove the load
> > and make the parameter unefined.
> > 
> > While current mainline don't seem to be able to translate the testcase above
> > that way, perhaps future improvements to LRA/postreload gcse may make it happen
> > and generally RTL patterns are better to be safe by definition not
> > only for the actual RTL we are able to generate. I suppose reg_mentioned_p
> > on call usage is enough.
> 
> Thanks.  I have bootstrapped and regtested the following patch.  OK?
> 
> 	* config/i386/i386.md (sibcall_memory): Check that register with
> 	callee address is not also used as one of the arguments, instead
> 	of checking that it is not live after the sibcall.
> 	(sibcall_pop_memory): Ditto.
> 	(sibcall_value_memory): Ditto.
> 	(sibcall_value_pop_memory): Ditto.
> testsuite:
> 	* gcc.target/i386/sibcall-7.c: New test.

Thank you! This looks fine.  Please add also the testcase that should break if
the new test was wrong andosmeone fixed postreload to allow use of the same register
this check will prevent wrong code?

OK with that change.
Honza
> 
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index 0959aef..9c1aa7d 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -11673,7 +11673,8 @@
>     (call (mem:QI (match_dup 0))
>          (match_operand 3))]
>    "!TARGET_X32 && SIBLING_CALL_P (peep2_next_insn (1))
> -   && peep2_reg_dead_p (2, operands[0])"
> +   && !reg_mentioned_p (operands[0],
> +                       CALL_INSN_FUNCTION_USAGE (peep2_next_insn (1)))"
>    [(parallel [(call (mem:QI (match_dup 1))
>                     (match_dup 3))
>               (unspec [(const_int 0)] UNSPEC_PEEPSIB)])])
> @@ -11685,7 +11686,8 @@
>     (call (mem:QI (match_dup 0))
>          (match_operand 3))]
>    "!TARGET_X32 && SIBLING_CALL_P (peep2_next_insn (2))
> -   && peep2_reg_dead_p (3, operands[0])"
> +   && !reg_mentioned_p (operands[0],
> +                       CALL_INSN_FUNCTION_USAGE (peep2_next_insn (2)))"
>    [(unspec_volatile [(const_int 0)] UNSPECV_BLOCKAGE)
>     (parallel [(call (mem:QI (match_dup 1))
>                     (match_dup 3))
> @@ -11744,7 +11746,8 @@
>                    (plus:SI (reg:SI SP_REG)
>                             (match_operand:SI 4 "immediate_operand")))])]
>    "!TARGET_64BIT && SIBLING_CALL_P (peep2_next_insn (1))
> -   && peep2_reg_dead_p (2, operands[0])"
> +   && !reg_mentioned_p (operands[0],
> +                       CALL_INSN_FUNCTION_USAGE (peep2_next_insn (1)))"
>    [(parallel [(call (mem:QI (match_dup 1))
>                     (match_dup 3))
>               (set (reg:SI SP_REG)
> @@ -11762,7 +11765,8 @@
>                    (plus:SI (reg:SI SP_REG)
>                             (match_operand:SI 4 "immediate_operand")))])]
>    "!TARGET_64BIT && SIBLING_CALL_P (peep2_next_insn (2))
> -   && peep2_reg_dead_p (3, operands[0])"
> +   && !reg_mentioned_p (operands[0],
> +                       CALL_INSN_FUNCTION_USAGE (peep2_next_insn (2)))"
>    [(unspec_volatile [(const_int 0)] UNSPECV_BLOCKAGE)
>     (parallel [(call (mem:QI (match_dup 1))
>                     (match_dup 3))
> @@ -11838,7 +11842,8 @@
>     (call (mem:QI (match_dup 0))
>                  (match_operand 3)))]
>    "!TARGET_X32 && SIBLING_CALL_P (peep2_next_insn (1))
> -   && peep2_reg_dead_p (2, operands[0])"
> +   && !reg_mentioned_p (operands[0],
> +                       CALL_INSN_FUNCTION_USAGE (peep2_next_insn (1)))"
>    [(parallel [(set (match_dup 2)
>                    (call (mem:QI (match_dup 1))
>                          (match_dup 3)))
> @@ -11852,7 +11857,8 @@
>        (call (mem:QI (match_dup 0))
>               (match_operand 3)))]
>    "!TARGET_X32 && SIBLING_CALL_P (peep2_next_insn (2))
> -   && peep2_reg_dead_p (3, operands[0])"
> +   && !reg_mentioned_p (operands[0],
> +                       CALL_INSN_FUNCTION_USAGE (peep2_next_insn (2)))"
>    [(unspec_volatile [(const_int 0)] UNSPECV_BLOCKAGE)
>     (parallel [(set (match_dup 2)
>                    (call (mem:QI (match_dup 1))
> @@ -11917,7 +11923,8 @@
>                    (plus:SI (reg:SI SP_REG)
>                             (match_operand:SI 4 "immediate_operand")))])]
>    "!TARGET_64BIT && SIBLING_CALL_P (peep2_next_insn (1))
> -   && peep2_reg_dead_p (2, operands[0])"
> +   && !reg_mentioned_p (operands[0],
> +                       CALL_INSN_FUNCTION_USAGE (peep2_next_insn (1)))"
>    [(parallel [(set (match_dup 2)
>                    (call (mem:QI (match_dup 1))
>                          (match_dup 3)))
> @@ -11937,7 +11944,8 @@
>                    (plus:SI (reg:SI SP_REG)
>                             (match_operand:SI 4 "immediate_operand")))])]
>    "!TARGET_64BIT && SIBLING_CALL_P (peep2_next_insn (2))
> -   && peep2_reg_dead_p (3, operands[0])"
> +   && !reg_mentioned_p (operands[0],
> +                       CALL_INSN_FUNCTION_USAGE (peep2_next_insn (2)))"
>    [(unspec_volatile [(const_int 0)] UNSPECV_BLOCKAGE)
>     (parallel [(set (match_dup 2)
>                    (call (mem:QI (match_dup 1))
> diff --git a/gcc/testsuite/gcc.target/i386/sibcall-7.c b/gcc/testsuite/gcc.target/i386/sibcall-7.c
> index e69de29..72fdaff 100644
> --- a/gcc/testsuite/gcc.target/i386/sibcall-7.c
> +++ b/gcc/testsuite/gcc.target/i386/sibcall-7.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile { target { { ! x32 } } } } */
> +/* { dg-options "-O2" } */
> +
> +int foo()
> +{
> +  int (**bar)(void);
> +  asm("":"=a"(bar));
> +  return (*bar)();
> +}
> +
> +/* { dg-final { scan-assembler-not "mov" } } */

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Extend sibcall peepholes to allow source in %eax
  2015-05-13 20:04                 ` Jan Hubicka
@ 2015-05-14 17:36                   ` Alexander Monakov
  0 siblings, 0 replies; 106+ messages in thread
From: Alexander Monakov @ 2015-05-14 17:36 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Jeff Law, Uros Bizjak, gcc-patches, Rich Felker

On Wed, 13 May 2015, Jan Hubicka wrote:

> Thank you! This looks fine.  Please add also the testcase that should break if
> the new test was wrong andosmeone fixed postreload to allow use of the same register
> this check will prevent wrong code?

I'm checking in a patch with the following additional test.

diff --git a/gcc/testsuite/gcc.target/i386/sibcall-8.c b/gcc/testsuite/gcc.target/i386/sibcall-8.c
index e69de29..3ab3809 100644
--- a/gcc/testsuite/gcc.target/i386/sibcall-8.c
+++ b/gcc/testsuite/gcc.target/i386/sibcall-8.c
@@ -0,0 +1,24 @@
+/* { dg-do run } */
+/* { dg-options "-O2" } */
+
+extern void abort (void);
+
+static int __attribute__((regparm(1)))
+bar(void *arg)
+{
+  return arg != bar;
+}
+
+static int __attribute__((noinline,noclone,regparm(1)))
+foo(int (__attribute__((regparm(1))) **bar)(void*))
+{
+  return (*bar)(*bar);
+}
+
+int main()
+{
+  int (__attribute__((regparm(1))) *p)(void*) = bar;
+  if (foo(&p))
+    abort();
+  return 0;
+}

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-04 16:38 ` [PATCH i386] Allow sibcalls in no-PLT PIC Alexander Monakov
@ 2015-05-15 16:37   ` Alexander Monakov
  2015-05-15 16:48     ` H.J. Lu
  0 siblings, 1 reply; 106+ messages in thread
From: Alexander Monakov @ 2015-05-15 16:37 UTC (permalink / raw)
  To: gcc-patches; +Cc: Rich Felker, Jan Hubicka, Uros Bizjak

Ping?  Any comment about this patch?

On Mon, 4 May 2015, Alexander Monakov wrote:

> With -fno-plt, we don't have to reject even direct calls as sibcall
> candidates.
> 
> This patch depends on '-fplt' flag that is introduced in another patch.
> 
> This patch requires that with -fno-plt all sibcall candidates go through
> prepare_call_address that transforms the call to a GOT lookup.
> 
> OK?
> 	* config/i386/i386.c (ix86_function_ok_for_sibcall): Check flag_plt.
> 
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index f29e053..b734350 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -5448,12 +5448,13 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
>    /* If we are generating position-independent code, we cannot sibcall
>       optimize any indirect call, or a direct call to a global function,
>       as the PLT requires %ebx be live. (Darwin does not have a PLT.)  */
>    if (!TARGET_MACHO
>        && !TARGET_64BIT
>        && flag_pic
> +      && flag_plt
>        && (decl && !targetm.binds_local_p (decl)))
>      return false;
>  
>    /* If we need to align the outgoing stack, then sibcalling would
>       unalign the stack, which may break the called function.  */
>    if (ix86_minimum_incoming_stack_boundary (true)
> 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-15 16:37   ` Alexander Monakov
@ 2015-05-15 16:48     ` H.J. Lu
  2015-05-15 20:08       ` Jan Hubicka
  0 siblings, 1 reply; 106+ messages in thread
From: H.J. Lu @ 2015-05-15 16:48 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: GCC Patches, Rich Felker, Jan Hubicka, Uros Bizjak

On Fri, May 15, 2015 at 9:27 AM, Alexander Monakov <amonakov@ispras.ru> wrote:
> Ping?  Any comment about this patch?
>
> On Mon, 4 May 2015, Alexander Monakov wrote:
>
>> With -fno-plt, we don't have to reject even direct calls as sibcall
>> candidates.
>>
>> This patch depends on '-fplt' flag that is introduced in another patch.
>>
>> This patch requires that with -fno-plt all sibcall candidates go through
>> prepare_call_address that transforms the call to a GOT lookup.
>>
>> OK?
>>       * config/i386/i386.c (ix86_function_ok_for_sibcall): Check flag_plt.
>>
>> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
>> index f29e053..b734350 100644
>> --- a/gcc/config/i386/i386.c
>> +++ b/gcc/config/i386/i386.c
>> @@ -5448,12 +5448,13 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
>>    /* If we are generating position-independent code, we cannot sibcall
>>       optimize any indirect call, or a direct call to a global function,
>>       as the PLT requires %ebx be live. (Darwin does not have a PLT.)  */
>>    if (!TARGET_MACHO
>>        && !TARGET_64BIT
>>        && flag_pic
>> +      && flag_plt
>>        && (decl && !targetm.binds_local_p (decl)))
>>      return false;
>>
>>    /* If we need to align the outgoing stack, then sibcalling would
>>       unalign the stack, which may break the called function.  */
>>    if (ix86_minimum_incoming_stack_boundary (true)
>>

I think it should be done via psABI change similar to

https://groups.google.com/forum/#!topic/x86-64-abi/n8GYMpqvBxI

which I have implemented on users/hjl/relax branch in binutils.

-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-15 16:48     ` H.J. Lu
@ 2015-05-15 20:08       ` Jan Hubicka
  2015-05-15 20:23         ` H.J. Lu
  2015-05-18 18:25         ` Alexander Monakov
  0 siblings, 2 replies; 106+ messages in thread
From: Jan Hubicka @ 2015-05-15 20:08 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Alexander Monakov, GCC Patches, Rich Felker, Jan Hubicka, Uros Bizjak

> On Fri, May 15, 2015 at 9:27 AM, Alexander Monakov <amonakov@ispras.ru> wrote:
> > Ping?  Any comment about this patch?
> >
> > On Mon, 4 May 2015, Alexander Monakov wrote:
> >
> >> With -fno-plt, we don't have to reject even direct calls as sibcall
> >> candidates.
> >>
> >> This patch depends on '-fplt' flag that is introduced in another patch.
> >>
> >> This patch requires that with -fno-plt all sibcall candidates go through
> >> prepare_call_address that transforms the call to a GOT lookup.
> >>
> >> OK?
> >>       * config/i386/i386.c (ix86_function_ok_for_sibcall): Check flag_plt.
> >>
> >> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> >> index f29e053..b734350 100644
> >> --- a/gcc/config/i386/i386.c
> >> +++ b/gcc/config/i386/i386.c
> >> @@ -5448,12 +5448,13 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
> >>    /* If we are generating position-independent code, we cannot sibcall
> >>       optimize any indirect call, or a direct call to a global function,
> >>       as the PLT requires %ebx be live. (Darwin does not have a PLT.)  */
> >>    if (!TARGET_MACHO
> >>        && !TARGET_64BIT
> >>        && flag_pic
> >> +      && flag_plt
> >>        && (decl && !targetm.binds_local_p (decl)))
> >>      return false;
> >>
> >>    /* If we need to align the outgoing stack, then sibcalling would
> >>       unalign the stack, which may break the called function.  */
> >>    if (ix86_minimum_incoming_stack_boundary (true)
> >>
> 
> I think it should be done via psABI change similar to
> 
> https://groups.google.com/forum/#!topic/x86-64-abi/n8GYMpqvBxI
> 
> which I have implemented on users/hjl/relax branch in binutils.

OK, I am trying to understand how relax branch works and what difference it makes.
As I underestand it, the main purpose is to be able to make relaxed call of

   call function

that will, in 64bit mode, either result to RIP relative call with extra NOP just
before the instruction if FUNCTION binds within the DSO or to indirect call through
GOT bypassing the PLT.  This saves overhead of PLT and increase every such call
by extra NOP for no-LTO builds and even in LTO when the symbol is defined but
interposable.  This is actually really nice trick.

Now this is about 32bit mode where explicit GOT pointer register is needed
(how this work with large code model on x86-64?). It is needed by PLT, but I suppose
to implement the same relaxation for 32bit it would need to use EBX to lookup the
GOT pointer, too, so the check above would still be valid.

The patches makes sense to be given that we support -fno-plt now. 

Honza
> 
> -- 
> H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-15 20:08       ` Jan Hubicka
@ 2015-05-15 20:23         ` H.J. Lu
  2015-05-15 20:35           ` Rich Felker
  2015-05-18 18:25         ` Alexander Monakov
  1 sibling, 1 reply; 106+ messages in thread
From: H.J. Lu @ 2015-05-15 20:23 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Alexander Monakov, GCC Patches, Rich Felker, Uros Bizjak

On Fri, May 15, 2015 at 12:48 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Fri, May 15, 2015 at 9:27 AM, Alexander Monakov <amonakov@ispras.ru> wrote:
>> > Ping?  Any comment about this patch?
>> >
>> > On Mon, 4 May 2015, Alexander Monakov wrote:
>> >
>> >> With -fno-plt, we don't have to reject even direct calls as sibcall
>> >> candidates.
>> >>
>> >> This patch depends on '-fplt' flag that is introduced in another patch.
>> >>
>> >> This patch requires that with -fno-plt all sibcall candidates go through
>> >> prepare_call_address that transforms the call to a GOT lookup.
>> >>
>> >> OK?
>> >>       * config/i386/i386.c (ix86_function_ok_for_sibcall): Check flag_plt.
>> >>
>> >> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
>> >> index f29e053..b734350 100644
>> >> --- a/gcc/config/i386/i386.c
>> >> +++ b/gcc/config/i386/i386.c
>> >> @@ -5448,12 +5448,13 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
>> >>    /* If we are generating position-independent code, we cannot sibcall
>> >>       optimize any indirect call, or a direct call to a global function,
>> >>       as the PLT requires %ebx be live. (Darwin does not have a PLT.)  */
>> >>    if (!TARGET_MACHO
>> >>        && !TARGET_64BIT
>> >>        && flag_pic
>> >> +      && flag_plt
>> >>        && (decl && !targetm.binds_local_p (decl)))
>> >>      return false;
>> >>
>> >>    /* If we need to align the outgoing stack, then sibcalling would
>> >>       unalign the stack, which may break the called function.  */
>> >>    if (ix86_minimum_incoming_stack_boundary (true)
>> >>
>>
>> I think it should be done via psABI change similar to
>>
>> https://groups.google.com/forum/#!topic/x86-64-abi/n8GYMpqvBxI
>>
>> which I have implemented on users/hjl/relax branch in binutils.
>
> OK, I am trying to understand how relax branch works and what difference it makes.
> As I underestand it, the main purpose is to be able to make relaxed call of
>
>    call function
>
> that will, in 64bit mode, either result to RIP relative call with extra NOP just
> before the instruction if FUNCTION binds within the DSO or to indirect call through
> GOT bypassing the PLT.  This saves overhead of PLT and increase every such call
> by extra NOP for no-LTO builds and even in LTO when the symbol is defined but
> interposable.  This is actually really nice trick.
>
> Now this is about 32bit mode where explicit GOT pointer register is needed
> (how this work with large code model on x86-64?). It is needed by PLT, but I suppose
> to implement the same relaxation for 32bit it would need to use EBX to lookup the
> GOT pointer, too, so the check above would still be valid.
>

With relax branch in 32-bit, there are 2 cases:

1. PIC or PIE:  We generate

set up EBX
relax call foo@PLT

It is almost the same as we do now, except for the relax prefix.
If foo is defined in another shared library or may be preempted,
linker will generate

call *foo@GOTPLT(%ebx)

If foo turns out local, linker will output

relax call foo

2. Non PIC/PIE: We generate

relax call foo

If foo is defined in a DSO,  linker will generate

call/jmp *foo@GOTPLT

We don't set up EBX in this case.  If foo turns out local, linker will output

relax call foo

-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-15 20:23         ` H.J. Lu
@ 2015-05-15 20:35           ` Rich Felker
  2015-05-15 20:37             ` H.J. Lu
  0 siblings, 1 reply; 106+ messages in thread
From: Rich Felker @ 2015-05-15 20:35 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Jan Hubicka, Alexander Monakov, GCC Patches, Uros Bizjak

On Fri, May 15, 2015 at 01:08:15PM -0700, H.J. Lu wrote:
> With relax branch in 32-bit, there are 2 cases:
> 
> 1. PIC or PIE:  We generate
> 
> set up EBX
> relax call foo@PLT
> 
> It is almost the same as we do now, except for the relax prefix.
> If foo is defined in another shared library or may be preempted,
> linker will generate
> 
> call *foo@GOTPLT(%ebx)
> 
> If foo turns out local, linker will output
> 
> relax call foo

This does not address the initial and primary motivation for no-plt on
32-bit: eliminating the awful codegen constraint costs of the
GOT-register (ebx, and equivalent on other targets) ABI for calling
PLT entries. If instead you generated code that sets up an expression
for the GOT slot using arbitrary registers, and relaxed it to a direct
call (possibly rendering the register setup useless), it would be
comparable to the no-plt approach. So for example:

set up ecx (or whatever register)
relax call *foo@GOT(%ecx)

and relax to:

set up ecx (or whatever register; now useless)
relax call foo

But the no-plt approach is still superior in that the address load
from the GOT can be hoisted out of loops, etc., resulting in something
like:

call *%esi

This could be valuable in loops calling a math function repeatedly,
for example.

Overall I'm still not a fan of the relaxation approach. There are very
few places it would actually help that couldn't already be improved
better with use of visibility, and it can't give codegen as good as
no-plt option.

Rich

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-15 20:35           ` Rich Felker
@ 2015-05-15 20:37             ` H.J. Lu
  2015-05-15 20:45               ` Rich Felker
  0 siblings, 1 reply; 106+ messages in thread
From: H.J. Lu @ 2015-05-15 20:37 UTC (permalink / raw)
  To: Rich Felker; +Cc: Jan Hubicka, Alexander Monakov, GCC Patches, Uros Bizjak

On Fri, May 15, 2015 at 1:23 PM, Rich Felker <dalias@libc.org> wrote:
> On Fri, May 15, 2015 at 01:08:15PM -0700, H.J. Lu wrote:
>> With relax branch in 32-bit, there are 2 cases:
>>
>> 1. PIC or PIE:  We generate
>>
>> set up EBX
>> relax call foo@PLT
>>
>> It is almost the same as we do now, except for the relax prefix.
>> If foo is defined in another shared library or may be preempted,
>> linker will generate
>>
>> call *foo@GOTPLT(%ebx)
>>
>> If foo turns out local, linker will output
>>
>> relax call foo
>
> This does not address the initial and primary motivation for no-plt on
> 32-bit: eliminating the awful codegen constraint costs of the
> GOT-register (ebx, and equivalent on other targets) ABI for calling
> PLT entries. If instead you generated code that sets up an expression
> for the GOT slot using arbitrary registers, and relaxed it to a direct
> call (possibly rendering the register setup useless), it would be
> comparable to the no-plt approach. So for example:
>
> set up ecx (or whatever register)
> relax call *foo@GOT(%ecx)
>
> and relax to:
>
> set up ecx (or whatever register; now useless)
> relax call foo
>
> But the no-plt approach is still superior in that the address load
> from the GOT can be hoisted out of loops, etc., resulting in something
> like:
>
> call *%esi
>
> This could be valuable in loops calling a math function repeatedly,
> for example.
>
> Overall I'm still not a fan of the relaxation approach. There are very
> few places it would actually help that couldn't already be improved
> better with use of visibility, and it can't give codegen as good as
> no-plt option.

With no-plt option, compiler has to know if a function is external
or may be preempted.  If compiler guessed wrong, the generated
DSO or executable will always go through indirect branch even
though the target is local.  With relax branch, the decision is left
to linker.  Of course, EBX must be used unless we add a new PLT
relocation for each register used to to hold GOT base, like

relax call foo@PLT_ECX
relax call foo@PLT_EDX
...


-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-15 20:37             ` H.J. Lu
@ 2015-05-15 20:45               ` Rich Felker
  2015-05-15 22:16                 ` H.J. Lu
  0 siblings, 1 reply; 106+ messages in thread
From: Rich Felker @ 2015-05-15 20:45 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Jan Hubicka, Alexander Monakov, GCC Patches, Uros Bizjak

On Fri, May 15, 2015 at 01:35:14PM -0700, H.J. Lu wrote:
> On Fri, May 15, 2015 at 1:23 PM, Rich Felker <dalias@libc.org> wrote:
> > On Fri, May 15, 2015 at 01:08:15PM -0700, H.J. Lu wrote:
> >> With relax branch in 32-bit, there are 2 cases:
> >>
> >> 1. PIC or PIE:  We generate
> >>
> >> set up EBX
> >> relax call foo@PLT
> >>
> >> It is almost the same as we do now, except for the relax prefix.
> >> If foo is defined in another shared library or may be preempted,
> >> linker will generate
> >>
> >> call *foo@GOTPLT(%ebx)
> >>
> >> If foo turns out local, linker will output
> >>
> >> relax call foo
> >
> > This does not address the initial and primary motivation for no-plt on
> > 32-bit: eliminating the awful codegen constraint costs of the
> > GOT-register (ebx, and equivalent on other targets) ABI for calling
> > PLT entries. If instead you generated code that sets up an expression
> > for the GOT slot using arbitrary registers, and relaxed it to a direct
> > call (possibly rendering the register setup useless), it would be
> > comparable to the no-plt approach. So for example:
> >
> > set up ecx (or whatever register)
> > relax call *foo@GOT(%ecx)
> >
> > and relax to:
> >
> > set up ecx (or whatever register; now useless)
> > relax call foo
> >
> > But the no-plt approach is still superior in that the address load
> > from the GOT can be hoisted out of loops, etc., resulting in something
> > like:
> >
> > call *%esi
> >
> > This could be valuable in loops calling a math function repeatedly,
> > for example.
> >
> > Overall I'm still not a fan of the relaxation approach. There are very
> > few places it would actually help that couldn't already be improved
> > better with use of visibility, and it can't give codegen as good as
> > no-plt option.
> 
> With no-plt option, compiler has to know if a function is external
> or may be preempted.

I still don't see significant practical cases where the linker would
know this but the compiler can't. If you use visibility properly, the
compiler knows, and if you do LTO and -Bsymbolic[-functions], the
compiler should have that information available at LTO time (this is
an enhancement that needs to be made, though).

> If compiler guessed wrong, the generated
> DSO or executable will always go through indirect branch even
> though the target is local.

The only way this is avoided now is with -Bsymbolic[-functions] which
is not widely used. Otherwise interposition is always allowed for
default-visibility functions, so I don't see how the indirect branch
here is suboptimal.

> With relax branch, the decision is left
> to linker.  Of course, EBX must be used unless we add a new PLT
> relocation for each register used to to hold GOT base, like
> 
> relax call foo@PLT_ECX
> relax call foo@PLT_EDX

No, that's not needed. If the linker doesn't make the relaxation, the
instruction the compiler generated remains in place, and has the
effective address expression using whichever register it wanted:

relax call *foo@GOT(%ecx)
relax call *foo@GOT(%edx)
etc.

If the linker chooses to relax it to a direct call, no register at all
is needed, so the linker can just throw this away and use:

call foo

for all of them.

Rich

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-15 20:45               ` Rich Felker
@ 2015-05-15 22:16                 ` H.J. Lu
  2015-05-15 23:14                   ` Jan Hubicka
  0 siblings, 1 reply; 106+ messages in thread
From: H.J. Lu @ 2015-05-15 22:16 UTC (permalink / raw)
  To: Rich Felker; +Cc: Jan Hubicka, Alexander Monakov, GCC Patches, Uros Bizjak

On Fri, May 15, 2015 at 1:42 PM, Rich Felker <dalias@libc.org> wrote:
> On Fri, May 15, 2015 at 01:35:14PM -0700, H.J. Lu wrote:
>> On Fri, May 15, 2015 at 1:23 PM, Rich Felker <dalias@libc.org> wrote:
>> > On Fri, May 15, 2015 at 01:08:15PM -0700, H.J. Lu wrote:
>> >> With relax branch in 32-bit, there are 2 cases:
>> >>
>> >> 1. PIC or PIE:  We generate
>> >>
>> >> set up EBX
>> >> relax call foo@PLT
>> >>
>> >> It is almost the same as we do now, except for the relax prefix.
>> >> If foo is defined in another shared library or may be preempted,
>> >> linker will generate
>> >>
>> >> call *foo@GOTPLT(%ebx)
>> >>
>> >> If foo turns out local, linker will output
>> >>
>> >> relax call foo
>> >
>> > This does not address the initial and primary motivation for no-plt on
>> > 32-bit: eliminating the awful codegen constraint costs of the
>> > GOT-register (ebx, and equivalent on other targets) ABI for calling
>> > PLT entries. If instead you generated code that sets up an expression
>> > for the GOT slot using arbitrary registers, and relaxed it to a direct
>> > call (possibly rendering the register setup useless), it would be
>> > comparable to the no-plt approach. So for example:
>> >
>> > set up ecx (or whatever register)
>> > relax call *foo@GOT(%ecx)
>> >
>> > and relax to:
>> >
>> > set up ecx (or whatever register; now useless)
>> > relax call foo
>> >
>> > But the no-plt approach is still superior in that the address load
>> > from the GOT can be hoisted out of loops, etc., resulting in something
>> > like:
>> >
>> > call *%esi
>> >
>> > This could be valuable in loops calling a math function repeatedly,
>> > for example.
>> >
>> > Overall I'm still not a fan of the relaxation approach. There are very
>> > few places it would actually help that couldn't already be improved
>> > better with use of visibility, and it can't give codegen as good as
>> > no-plt option.
>>
>> With no-plt option, compiler has to know if a function is external
>> or may be preempted.
>
> I still don't see significant practical cases where the linker would
> know this but the compiler can't. If you use visibility properly, the
> compiler knows, and if you do LTO and -Bsymbolic[-functions], the
> compiler should have that information available at LTO time (this is
> an enhancement that needs to be made, though).

There are codes like

extern void foo (void);

void
bar (void)
{
  foo ();
}

Even with LTO, compiler may have to assume foo is external
when foo is compiled with LTO.

>> If compiler guessed wrong, the generated
>> DSO or executable will always go through indirect branch even
>> though the target is local.
>
> The only way this is avoided now is with -Bsymbolic[-functions] which
> is not widely used. Otherwise interposition is always allowed for
> default-visibility functions, so I don't see how the indirect branch
> here is suboptimal.

Relax branch is to avoid indirect branch to local targets.  If
you don't think  indirect branch to local targets is a performance
issue, relax branch isn't for you.

>> With relax branch, the decision is left
>> to linker.  Of course, EBX must be used unless we add a new PLT
>> relocation for each register used to to hold GOT base, like
>>
>> relax call foo@PLT_ECX
>> relax call foo@PLT_EDX
>
> No, that's not needed. If the linker doesn't make the relaxation, the
> instruction the compiler generated remains in place, and has the
> effective address expression using whichever register it wanted:
>
> relax call *foo@GOT(%ecx)
> relax call *foo@GOT(%edx)
> etc.

relax branch is only used for direct branch and it isn't for indirect
branch. I will implement

relax call foo@PLT(%reg)

The compiler can pick any registers to hold GOT base.  Lazy
binding is supported only when EBX is used.

> If the linker chooses to relax it to a direct call, no register at all
> is needed, so the linker can just throw this away and use:
>
> call foo
>
> for all of them.
>
> Rich



-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-15 22:16                 ` H.J. Lu
@ 2015-05-15 23:14                   ` Jan Hubicka
  2015-05-15 23:30                     ` H.J. Lu
  0 siblings, 1 reply; 106+ messages in thread
From: Jan Hubicka @ 2015-05-15 23:14 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Rich Felker, Jan Hubicka, Alexander Monakov, GCC Patches, Uros Bizjak

Hello,
> 
> There are codes like
> 
> extern void foo (void);
> 
> void
> bar (void)
> {
>   foo ();
> }
> 
> Even with LTO, compiler may have to assume foo is external
> when foo is compiled with LTO.

This is not exactly true if FOO is defined in other translation unit
compiled with LTO and hidden visibility.

OK, so as I get it, we get the following cases:

 1) compiler knows it is generating call to a local symbol a current
    unit (binds_to_current_def_p returns true).

    We handle this correctly by doing IP relative call.

 2) compiler knows it is generating call to a local symbol in DSO
    (binds_local_p return true)
    Currently I think this is only the -fno-pic case or case of explicit
    hidden visibility and in this case we do IP relative call.

    We may want to propose plugin API update adding PREVAILING_DEF_EXP.
    So copiler would be able to default to this case for PREVAILING_DEF
    and we will also catch cases where the symbol is defined in current
    DSO as weak symbol, but the definition is not LTO.
    This would be also way to communicate -Bsymbolic[-functions] across
    the plugin API.

 3) compiler knows there is going to be definition in the current DSO
    (by seeing a COMDAT function body or resolution info) that is interposable
    but because the function is inline or -fno-semantic-interposition happens,
    the semantics will not change.

    In this case it would be nice to arrange IP relative call to the
    hidden alias.  This may require an extension both on compiler and linker
    side.

    I was thinking of doing so for comdats by adding hidden alias with
    fixed mangling, like __gnu_<function>.hiddenalias, and referring it.
    But I think it is not safe as linker may throw away section that
    is produced by GCC and prevail section that is not leaving to an undefined
    symbol?

    I think this is rather common case in C++ (never made any stats) because
    uninlined comdats are quite common.

 4) compiler has no clue but linker may know better

    Here we traditionally always produce a PLT call.  In cases the call
    is known to be hot in the program it makes sense to trade lazy binding
    for performance and produce call via GOT reference (-fno-plt).
    I also see that H.J.'s branch helps us to actually avoid the GOT
    reference in cases the symbol ends up binding locally. How the lazy
    binding with relaxation works?

    We may try to communicate down the information whether the symbol can
    or can not semantically interpose to the linker, so it can do
    -Bsymbolic by default for inline and COMDAT functions.
    Actually perhaps the linker can just default to this for all comdat
    defined symbols?

    I think it still make sense to work on non-LTO codegen improvements.
    As much as I would like everyone to LTO and FDO, most people don't.

 5) Compiler knows it is generating call to external function.
    We do not special case this, but we could add binds_external_p and
    make it to determine this case from resolution info during LTO.

    I do not see if this case is any different from 4 from PIC codegen
    perspective except that perhaps the relax relocation will allow us to lazy
    bind?

Honza

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-15 23:14                   ` Jan Hubicka
@ 2015-05-15 23:30                     ` H.J. Lu
  2015-05-15 23:35                       ` H.J. Lu
  2015-05-15 23:49                       ` Rich Felker
  0 siblings, 2 replies; 106+ messages in thread
From: H.J. Lu @ 2015-05-15 23:30 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Rich Felker, Alexander Monakov, GCC Patches, Uros Bizjak

On Fri, May 15, 2015 at 4:08 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hello,
>>
>> There are codes like
>>
>> extern void foo (void);
>>
>> void
>> bar (void)
>> {
>>   foo ();
>> }
>>
>> Even with LTO, compiler may have to assume foo is external
>> when foo is compiled with LTO.
>
> This is not exactly true if FOO is defined in other translation unit
> compiled with LTO and hidden visibility.

I was meant to say " when foo is compiled without LTO.".

> OK, so as I get it, we get the following cases:
>
>  1) compiler knows it is generating call to a local symbol a current
>     unit (binds_to_current_def_p returns true).
>
>     We handle this correctly by doing IP relative call.
>
>  2) compiler knows it is generating call to a local symbol in DSO
>     (binds_local_p return true)
>     Currently I think this is only the -fno-pic case or case of explicit
>     hidden visibility and in this case we do IP relative call.
>
>     We may want to propose plugin API update adding PREVAILING_DEF_EXP.
>     So copiler would be able to default to this case for PREVAILING_DEF
>     and we will also catch cases where the symbol is defined in current
>     DSO as weak symbol, but the definition is not LTO.
>     This would be also way to communicate -Bsymbolic[-functions] across
>     the plugin API.
>
>  3) compiler knows there is going to be definition in the current DSO
>     (by seeing a COMDAT function body or resolution info) that is interposable
>     but because the function is inline or -fno-semantic-interposition happens,
>     the semantics will not change.
>
>     In this case it would be nice to arrange IP relative call to the
>     hidden alias.  This may require an extension both on compiler and linker
>     side.
>
>     I was thinking of doing so for comdats by adding hidden alias with
>     fixed mangling, like __gnu_<function>.hiddenalias, and referring it.
>     But I think it is not safe as linker may throw away section that
>     is produced by GCC and prevail section that is not leaving to an undefined
>     symbol?
>
>     I think this is rather common case in C++ (never made any stats) because
>     uninlined comdats are quite common.
>
>  4) compiler has no clue but linker may know better
>
>     Here we traditionally always produce a PLT call.  In cases the call
>     is known to be hot in the program it makes sense to trade lazy binding
>     for performance and produce call via GOT reference (-fno-plt).
>     I also see that H.J.'s branch helps us to actually avoid the GOT
>     reference in cases the symbol ends up binding locally. How the lazy
>     binding with relaxation works?

If there is no GOT slot allocated for symbol foo, linker should resolve
foo@GOTPLT(%ebx) to to its PLT slot address + 6, which is the push
instruction, to support  lazy binding.  Otherwise, linker should resolve it
to its GOT slot address.

>     We may try to communicate down the information whether the symbol can
>     or can not semantically interpose to the linker, so it can do
>     -Bsymbolic by default for inline and COMDAT functions.
>     Actually perhaps the linker can just default to this for all comdat
>     defined symbols?
>
>     I think it still make sense to work on non-LTO codegen improvements.
>     As much as I would like everyone to LTO and FDO, most people don't.
>
>  5) Compiler knows it is generating call to external function.
>     We do not special case this, but we could add binds_external_p and
>     make it to determine this case from resolution info during LTO.
>
>     I do not see if this case is any different from 4 from PIC codegen
>     perspective except that perhaps the relax relocation will allow us to lazy
>     bind?

My relax branch proposal works even without LTO.


-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-15 23:30                     ` H.J. Lu
@ 2015-05-15 23:35                       ` H.J. Lu
  2015-05-15 23:44                         ` H.J. Lu
  2015-05-15 23:49                       ` Rich Felker
  1 sibling, 1 reply; 106+ messages in thread
From: H.J. Lu @ 2015-05-15 23:35 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Rich Felker, Alexander Monakov, GCC Patches, Uros Bizjak

On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> My relax branch proposal works even without LTO.
>

I will borrow GOTPCREL from x86-64 and do

[hjl@gnu-6 relax-4]$ cat b.S
call *foo@GOTPCREL(%eax)
[hjl@gnu-6 relax-4]$ ./as -32 -o b.o b.S
[hjl@gnu-6 relax-4]$ ./objdump -dwr b.o

b.o:     file format elf32-i386


Disassembly of section .text:

00000000 <.text>:
   0: ff 90 fc ff ff ff     call   *-0x4(%eax) 2: R_386_RELAX_GOT32 foo
[hjl@gnu-6 relax-4]$

And linker can turn it into

relax call foo

if foo is defined locally.

-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-15 23:35                       ` H.J. Lu
@ 2015-05-15 23:44                         ` H.J. Lu
  2015-05-16  0:18                           ` Rich Felker
  0 siblings, 1 reply; 106+ messages in thread
From: H.J. Lu @ 2015-05-15 23:44 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Rich Felker, Alexander Monakov, GCC Patches, Uros Bizjak

On Fri, May 15, 2015 at 4:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> My relax branch proposal works even without LTO.
>>
>
> I will borrow GOTPCREL from x86-64 and do
>
> [hjl@gnu-6 relax-4]$ cat b.S
> call *foo@GOTPCREL(%eax)

call *foo@GOTPLT(%eax)

is a better choice.

> [hjl@gnu-6 relax-4]$ ./as -32 -o b.o b.S
> [hjl@gnu-6 relax-4]$ ./objdump -dwr b.o
>
> b.o:     file format elf32-i386
>
>
> Disassembly of section .text:
>
> 00000000 <.text>:
>    0: ff 90 fc ff ff ff     call   *-0x4(%eax) 2: R_386_RELAX_GOT32 foo
> [hjl@gnu-6 relax-4]$
>
> And linker can turn it into
>
> relax call foo
>
> if foo is defined locally.

-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-15 23:30                     ` H.J. Lu
  2015-05-15 23:35                       ` H.J. Lu
@ 2015-05-15 23:49                       ` Rich Felker
  2015-05-19 14:48                         ` Michael Matz
  1 sibling, 1 reply; 106+ messages in thread
From: Rich Felker @ 2015-05-15 23:49 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Jan Hubicka, Alexander Monakov, GCC Patches, Uros Bizjak

On Fri, May 15, 2015 at 04:14:07PM -0700, H.J. Lu wrote:
> On Fri, May 15, 2015 at 4:08 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> > Hello,
> >>
> >> There are codes like
> >>
> >> extern void foo (void);
> >>
> >> void
> >> bar (void)
> >> {
> >>   foo ();
> >> }
> >>
> >> Even with LTO, compiler may have to assume foo is external
> >> when foo is compiled with LTO.
> >
> > This is not exactly true if FOO is defined in other translation unit
> > compiled with LTO and hidden visibility.
> 
> I was meant to say " when foo is compiled without LTO.".
> 
> > OK, so as I get it, we get the following cases:
> >
> >  1) compiler knows it is generating call to a local symbol a current
> >     unit (binds_to_current_def_p returns true).
> >
> >     We handle this correctly by doing IP relative call.
> >
> >  2) compiler knows it is generating call to a local symbol in DSO
> >     (binds_local_p return true)
> >     Currently I think this is only the -fno-pic case or case of explicit
> >     hidden visibility and in this case we do IP relative call.
> >
> >     We may want to propose plugin API update adding PREVAILING_DEF_EXP.
> >     So copiler would be able to default to this case for PREVAILING_DEF
> >     and we will also catch cases where the symbol is defined in current
> >     DSO as weak symbol, but the definition is not LTO.
> >     This would be also way to communicate -Bsymbolic[-functions] across
> >     the plugin API.
> >
> >  3) compiler knows there is going to be definition in the current DSO
> >     (by seeing a COMDAT function body or resolution info) that is interposable
> >     but because the function is inline or -fno-semantic-interposition happens,
> >     the semantics will not change.
> >
> >     In this case it would be nice to arrange IP relative call to the
> >     hidden alias.  This may require an extension both on compiler and linker
> >     side.
> >
> >     I was thinking of doing so for comdats by adding hidden alias with
> >     fixed mangling, like __gnu_<function>.hiddenalias, and referring it.
> >     But I think it is not safe as linker may throw away section that
> >     is produced by GCC and prevail section that is not leaving to an undefined
> >     symbol?
> >
> >     I think this is rather common case in C++ (never made any stats) because
> >     uninlined comdats are quite common.
> >
> >  4) compiler has no clue but linker may know better
> >
> >     Here we traditionally always produce a PLT call.  In cases the call
> >     is known to be hot in the program it makes sense to trade lazy binding
> >     for performance and produce call via GOT reference (-fno-plt).
> >     I also see that H.J.'s branch helps us to actually avoid the GOT
> >     reference in cases the symbol ends up binding locally. How the lazy
> >     binding with relaxation works?
> 
> If there is no GOT slot allocated for symbol foo, linker should resolve
> foo@GOTPLT(%ebx) to to its PLT slot address + 6, which is the push
> instruction, to support  lazy binding.  Otherwise, linker should resolve it
> to its GOT slot address.

Forget lazy binding. It's dead anyway because serious distros want
PIE+relro+bindnow+... If people really want lazy binding, they can use
options which support it, but I don't want to keep suffering the
codegen cost of lazy binding despite never using it. There should be
an option to generate optimal code equivalent to what you get with
Alexander Monakov's patches for those of us who aren't trying to
support this legacy feature that precludes good performance and
precludes hardening.

Rich

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-15 23:44                         ` H.J. Lu
@ 2015-05-16  0:18                           ` Rich Felker
  2015-05-16 14:33                             ` H.J. Lu
  0 siblings, 1 reply; 106+ messages in thread
From: Rich Felker @ 2015-05-16  0:18 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Jan Hubicka, Alexander Monakov, GCC Patches, Uros Bizjak

On Fri, May 15, 2015 at 04:34:57PM -0700, H.J. Lu wrote:
> On Fri, May 15, 2015 at 4:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> > On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> >> My relax branch proposal works even without LTO.
> >>
> >
> > I will borrow GOTPCREL from x86-64 and do
> >
> > [hjl@gnu-6 relax-4]$ cat b.S
> > call *foo@GOTPCREL(%eax)
> 
> call *foo@GOTPLT(%eax)
> 
> is a better choice.

foo@GOTPCREL is preferable (but does not yet exist for ia32, so the
reloc type would have to be added) since it saves a useless add.
Instead of:

	call __x86.get_pc_thunk.ax
	addl $_GLOBAL_OFFSET_TABLE_, %eax
	call *foo@GOTPLT(%eax)

you can just do:

	call __x86.get_pc_thunk.ax
	call *foo@GOTPCREL(%eax)

Note that it also works to have extra instructions between:

	call __x86.get_pc_thunk.ax
1:	...
	call *foo@GOTPCREL+(1b-.)(%eax)

I may not have gotten the syntax quite right, but hopefully yoy get
the idea. This same approach (with GOTPCREL) can be used for _all_ GOT
accesses, including global data, to eliminate the useless add.

Rich

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-16  0:18                           ` Rich Felker
@ 2015-05-16 14:33                             ` H.J. Lu
  2015-05-16 19:03                               ` H.J. Lu
  0 siblings, 1 reply; 106+ messages in thread
From: H.J. Lu @ 2015-05-16 14:33 UTC (permalink / raw)
  To: Rich Felker; +Cc: Jan Hubicka, Alexander Monakov, GCC Patches, Uros Bizjak

On Fri, May 15, 2015 at 4:49 PM, Rich Felker <dalias@libc.org> wrote:
> On Fri, May 15, 2015 at 04:34:57PM -0700, H.J. Lu wrote:
>> On Fri, May 15, 2015 at 4:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> > On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> >> My relax branch proposal works even without LTO.
>> >>
>> >
>> > I will borrow GOTPCREL from x86-64 and do
>> >
>> > [hjl@gnu-6 relax-4]$ cat b.S
>> > call *foo@GOTPCREL(%eax)
>>
>> call *foo@GOTPLT(%eax)
>>
>> is a better choice.
>
> foo@GOTPCREL is preferable (but does not yet exist for ia32, so the
> reloc type would have to be added) since it saves a useless add.
> Instead of:
>
>         call __x86.get_pc_thunk.ax
>         addl $_GLOBAL_OFFSET_TABLE_, %eax
>         call *foo@GOTPLT(%eax)
>
> you can just do:
>
>         call __x86.get_pc_thunk.ax
>         call *foo@GOTPCREL(%eax)
>
> Note that it also works to have extra instructions between:
>
>         call __x86.get_pc_thunk.ax
> 1:      ...
>         call *foo@GOTPCREL+(1b-.)(%eax)
>
> I may not have gotten the syntax quite right, but hopefully yoy get
> the idea. This same approach (with GOTPCREL) can be used for _all_ GOT
> accesses, including global data, to eliminate the useless add.
>

This is a good idea.  But I'd like to use something for both i386 and
x86-64.  I am proposing

call/jmp *foo@GOTPCRELAX+addend(%reg)

It is similar to @GOTPCREL, but with a new relax relocation.  Before
I can do that, I need to fix

https://sourceware.org/bugzilla/show_bug.cgi?id=18423

first.

-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-16 14:33                             ` H.J. Lu
@ 2015-05-16 19:03                               ` H.J. Lu
  2015-05-16 19:32                                 ` Rich Felker
  0 siblings, 1 reply; 106+ messages in thread
From: H.J. Lu @ 2015-05-16 19:03 UTC (permalink / raw)
  To: Rich Felker; +Cc: Jan Hubicka, Alexander Monakov, GCC Patches, Uros Bizjak

On Sat, May 16, 2015 at 7:19 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Fri, May 15, 2015 at 4:49 PM, Rich Felker <dalias@libc.org> wrote:
>> On Fri, May 15, 2015 at 04:34:57PM -0700, H.J. Lu wrote:
>>> On Fri, May 15, 2015 at 4:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> > On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> >> My relax branch proposal works even without LTO.
>>> >>
>>> >
>>> > I will borrow GOTPCREL from x86-64 and do
>>> >
>>> > [hjl@gnu-6 relax-4]$ cat b.S
>>> > call *foo@GOTPCREL(%eax)
>>>
>>> call *foo@GOTPLT(%eax)
>>>
>>> is a better choice.
>>
>> foo@GOTPCREL is preferable (but does not yet exist for ia32, so the
>> reloc type would have to be added) since it saves a useless add.
>> Instead of:
>>
>>         call __x86.get_pc_thunk.ax
>>         addl $_GLOBAL_OFFSET_TABLE_, %eax
>>         call *foo@GOTPLT(%eax)
>>
>> you can just do:
>>
>>         call __x86.get_pc_thunk.ax
>>         call *foo@GOTPCREL(%eax)
>>
>> Note that it also works to have extra instructions between:
>>
>>         call __x86.get_pc_thunk.ax
>> 1:      ...
>>         call *foo@GOTPCREL+(1b-.)(%eax)
>>
>> I may not have gotten the syntax quite right, but hopefully yoy get
>> the idea. This same approach (with GOTPCREL) can be used for _all_ GOT
>> accesses, including global data, to eliminate the useless add.
>>
>
> This is a good idea.  But I'd like to use something for both i386 and
> x86-64.  I am proposing
>
> call/jmp *foo@GOTPCRELAX+addend(%reg)
>
> It is similar to @GOTPCREL, but with a new relax relocation.  Before
> I can do that, I need to fix

It doesn't work.  REG must hold GOT base for other GOT relocations.
We need to keep

addl $_GLOBAL_OFFSET_TABLE_, %eax

-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-16 19:03                               ` H.J. Lu
@ 2015-05-16 19:32                                 ` Rich Felker
  2015-05-16 23:23                                   ` H.J. Lu
  0 siblings, 1 reply; 106+ messages in thread
From: Rich Felker @ 2015-05-16 19:32 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Jan Hubicka, Alexander Monakov, GCC Patches, Uros Bizjak

On Sat, May 16, 2015 at 11:59:56AM -0700, H.J. Lu wrote:
> On Sat, May 16, 2015 at 7:19 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> > On Fri, May 15, 2015 at 4:49 PM, Rich Felker <dalias@libc.org> wrote:
> >> On Fri, May 15, 2015 at 04:34:57PM -0700, H.J. Lu wrote:
> >>> On Fri, May 15, 2015 at 4:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> >>> > On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> >>> >> My relax branch proposal works even without LTO.
> >>> >>
> >>> >
> >>> > I will borrow GOTPCREL from x86-64 and do
> >>> >
> >>> > [hjl@gnu-6 relax-4]$ cat b.S
> >>> > call *foo@GOTPCREL(%eax)
> >>>
> >>> call *foo@GOTPLT(%eax)
> >>>
> >>> is a better choice.
> >>
> >> foo@GOTPCREL is preferable (but does not yet exist for ia32, so the
> >> reloc type would have to be added) since it saves a useless add.
> >> Instead of:
> >>
> >>         call __x86.get_pc_thunk.ax
> >>         addl $_GLOBAL_OFFSET_TABLE_, %eax
> >>         call *foo@GOTPLT(%eax)
> >>
> >> you can just do:
> >>
> >>         call __x86.get_pc_thunk.ax
> >>         call *foo@GOTPCREL(%eax)
> >>
> >> Note that it also works to have extra instructions between:
> >>
> >>         call __x86.get_pc_thunk.ax
> >> 1:      ...
> >>         call *foo@GOTPCREL+(1b-.)(%eax)
> >>
> >> I may not have gotten the syntax quite right, but hopefully yoy get
> >> the idea. This same approach (with GOTPCREL) can be used for _all_ GOT
> >> accesses, including global data, to eliminate the useless add.
> >>
> >
> > This is a good idea.  But I'd like to use something for both i386 and
> > x86-64.  I am proposing
> >
> > call/jmp *foo@GOTPCRELAX+addend(%reg)
> >
> > It is similar to @GOTPCREL, but with a new relax relocation.  Before
> > I can do that, I need to fix
> 
> It doesn't work.  REG must hold GOT base for other GOT relocations.
> We need to keep
> 
> addl $_GLOBAL_OFFSET_TABLE_, %eax

Like I just said, all foo@GOT(%gotreg) can be replaced with
foo@GOTPCREL+[label-.](%labelreg) where %labelreg is a register
pointing to the referenced label (the point at which the program
counter was saved). This is a minor but useful optimization that can
be made for all GOT accesses, not just ones for [relaxable] function
calls.

Rich

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-16 19:32                                 ` Rich Felker
@ 2015-05-16 23:23                                   ` H.J. Lu
  0 siblings, 0 replies; 106+ messages in thread
From: H.J. Lu @ 2015-05-16 23:23 UTC (permalink / raw)
  To: Rich Felker; +Cc: Jan Hubicka, Alexander Monakov, GCC Patches, Uros Bizjak

On Sat, May 16, 2015 at 12:03 PM, Rich Felker <dalias@libc.org> wrote:
> On Sat, May 16, 2015 at 11:59:56AM -0700, H.J. Lu wrote:
>> On Sat, May 16, 2015 at 7:19 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> > On Fri, May 15, 2015 at 4:49 PM, Rich Felker <dalias@libc.org> wrote:
>> >> On Fri, May 15, 2015 at 04:34:57PM -0700, H.J. Lu wrote:
>> >>> On Fri, May 15, 2015 at 4:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> >>> > On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> >>> >> My relax branch proposal works even without LTO.
>> >>> >>
>> >>> >
>> >>> > I will borrow GOTPCREL from x86-64 and do
>> >>> >
>> >>> > [hjl@gnu-6 relax-4]$ cat b.S
>> >>> > call *foo@GOTPCREL(%eax)
>> >>>
>> >>> call *foo@GOTPLT(%eax)
>> >>>
>> >>> is a better choice.
>> >>
>> >> foo@GOTPCREL is preferable (but does not yet exist for ia32, so the
>> >> reloc type would have to be added) since it saves a useless add.
>> >> Instead of:
>> >>
>> >>         call __x86.get_pc_thunk.ax
>> >>         addl $_GLOBAL_OFFSET_TABLE_, %eax
>> >>         call *foo@GOTPLT(%eax)
>> >>
>> >> you can just do:
>> >>
>> >>         call __x86.get_pc_thunk.ax
>> >>         call *foo@GOTPCREL(%eax)
>> >>
>> >> Note that it also works to have extra instructions between:
>> >>
>> >>         call __x86.get_pc_thunk.ax
>> >> 1:      ...
>> >>         call *foo@GOTPCREL+(1b-.)(%eax)
>> >>
>> >> I may not have gotten the syntax quite right, but hopefully yoy get
>> >> the idea. This same approach (with GOTPCREL) can be used for _all_ GOT
>> >> accesses, including global data, to eliminate the useless add.
>> >>
>> >
>> > This is a good idea.  But I'd like to use something for both i386 and
>> > x86-64.  I am proposing
>> >
>> > call/jmp *foo@GOTPCRELAX+addend(%reg)
>> >
>> > It is similar to @GOTPCREL, but with a new relax relocation.  Before
>> > I can do that, I need to fix
>>
>> It doesn't work.  REG must hold GOT base for other GOT relocations.
>> We need to keep
>>
>> addl $_GLOBAL_OFFSET_TABLE_, %eax
>
> Like I just said, all foo@GOT(%gotreg) can be replaced with
> foo@GOTPCREL+[label-.](%labelreg) where %labelreg is a register
> pointing to the referenced label (the point at which the program
> counter was saved). This is a minor but useful optimization that can
> be made for all GOT accesses, not just ones for [relaxable] function
> calls.

There is also foo@GOTOFF(%reg).  Remove addl is independent of
relax branch.  I will leave it out.  Relax branch will support

call/jmp   *bar@GOTRELAX(%reg)

for both i386 and x86-64.


-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-15 20:08       ` Jan Hubicka
  2015-05-15 20:23         ` H.J. Lu
@ 2015-05-18 18:25         ` Alexander Monakov
  2015-05-18 19:03           ` Jan Hubicka
  1 sibling, 1 reply; 106+ messages in thread
From: Alexander Monakov @ 2015-05-18 18:25 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: H.J. Lu, GCC Patches, Rich Felker, Uros Bizjak

On Fri, 15 May 2015, Jan Hubicka wrote:
> > >> With -fno-plt, we don't have to reject even direct calls as sibcall
> > >> candidates.
> > >>
> > >> This patch depends on '-fplt' flag that is introduced in another patch.
> > >>
> > >> This patch requires that with -fno-plt all sibcall candidates go through
> > >> prepare_call_address that transforms the call to a GOT lookup.
> > >>
> > >> OK?
> > >>       * config/i386/i386.c (ix86_function_ok_for_sibcall): Check flag_plt.
> > >>
> > >> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> > >> index f29e053..b734350 100644
> > >> --- a/gcc/config/i386/i386.c
> > >> +++ b/gcc/config/i386/i386.c
> > >> @@ -5448,12 +5448,13 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
> > >>    /* If we are generating position-independent code, we cannot sibcall
> > >>       optimize any indirect call, or a direct call to a global function,
> > >>       as the PLT requires %ebx be live. (Darwin does not have a PLT.)  */
> > >>    if (!TARGET_MACHO
> > >>        && !TARGET_64BIT
> > >>        && flag_pic
> > >> +      && flag_plt
> > >>        && (decl && !targetm.binds_local_p (decl)))
> > >>      return false;
> > >>
> > >>    /* If we need to align the outgoing stack, then sibcalling would
> > >>       unalign the stack, which may break the called function.  */
> > >>    if (ix86_minimum_incoming_stack_boundary (true)
> > >>
> > 
> > I think it should be done via psABI change similar to
> > 
> > https://groups.google.com/forum/#!topic/x86-64-abi/n8GYMpqvBxI
> > 
> > which I have implemented on users/hjl/relax branch in binutils.
> 
> OK, I am trying to understand how relax branch works and what difference it makes.
> As I underestand it, the main purpose is to be able to make relaxed call of
> 
>    call function
> 
> that will, in 64bit mode, either result to RIP relative call with extra NOP just
> before the instruction if FUNCTION binds within the DSO or to indirect call through
> GOT bypassing the PLT.  This saves overhead of PLT and increase every such call
> by extra NOP for no-LTO builds and even in LTO when the symbol is defined but
> interposable.  This is actually really nice trick.
> 
> Now this is about 32bit mode where explicit GOT pointer register is needed
> (how this work with large code model on x86-64?). It is needed by PLT, but I suppose
> to implement the same relaxation for 32bit it would need to use EBX to lookup the
> GOT pointer, too, so the check above would still be valid.
> 
> The patches makes sense to be given that we support -fno-plt now.

After this message the discussion diverged in the direction of H.J.Lu's
proposed relaxation scheme involving new type of relocations.

I'm not clear if my patch is actually approved.  I'd like to point out that it
doesn't clash with H.J.Lu's work.  It improves codegen by allowing sibcalls in
more circumstances.

Alexander

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-18 18:25         ` Alexander Monakov
@ 2015-05-18 19:03           ` Jan Hubicka
  0 siblings, 0 replies; 106+ messages in thread
From: Jan Hubicka @ 2015-05-18 19:03 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Jan Hubicka, H.J. Lu, GCC Patches, Rich Felker, Uros Bizjak

> 
> After this message the discussion diverged in the direction of H.J.Lu's
> proposed relaxation scheme involving new type of relocations.
> 
> I'm not clear if my patch is actually approved.  I'd like to point out that it
> doesn't clash with H.J.Lu's work.  It improves codegen by allowing sibcalls in
> more circumstances.

Yes, the original patch is OK.

Honza
> 
> Alexander

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-15 23:49                       ` Rich Felker
@ 2015-05-19 14:48                         ` Michael Matz
  2015-05-19 15:11                           ` Jeff Law
  2015-05-19 18:08                           ` Rich Felker
  0 siblings, 2 replies; 106+ messages in thread
From: Michael Matz @ 2015-05-19 14:48 UTC (permalink / raw)
  To: Rich Felker
  Cc: H.J. Lu, Jan Hubicka, Alexander Monakov, GCC Patches, Uros Bizjak

Hi,

On Fri, 15 May 2015, Rich Felker wrote:

> Forget lazy binding. It's dead anyway because serious distros want
> PIE+relro+bindnow+...

You keep saying this, but I can't help the feeling it's mostly because 
musl doesn't support it ;-)

No, you don't have to use bindnow to get the effects of relro.  Sure 
there's more parts of the GOT protected with it, but if that's really that 
much more hardened is up for debate.

> If people really want lazy binding, they can use options which support 
> it, but I don't want to keep suffering the codegen cost of lazy binding 
> despite never using it.

> There should be an option to generate optimal code equivalent to what 
> you get with Alexander Monakov's patches for those of us who aren't 
> trying to support this legacy feature that precludes good performance 
> and precludes hardening.

H.J.'s branch is for _improving_ code on top of the no-plt code, it's not 
replacing it or an alternative for it.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-19 14:48                         ` Michael Matz
@ 2015-05-19 15:11                           ` Jeff Law
  2015-05-19 16:03                             ` Michael Matz
  2015-05-19 18:08                           ` Rich Felker
  1 sibling, 1 reply; 106+ messages in thread
From: Jeff Law @ 2015-05-19 15:11 UTC (permalink / raw)
  To: Michael Matz, Rich Felker
  Cc: H.J. Lu, Jan Hubicka, Alexander Monakov, GCC Patches, Uros Bizjak

On 05/19/2015 08:43 AM, Michael Matz wrote:
> Hi,
>
> On Fri, 15 May 2015, Rich Felker wrote:
>
>> Forget lazy binding. It's dead anyway because serious distros want
>> PIE+relro+bindnow+...
>
> You keep saying this, but I can't help the feeling it's mostly because
> musl doesn't support it ;-)
FWIW, Red Hat is pushing PIE & partial RELRO deeper and deeper into the 
distribution.  It's not clear yet how far bindnow will go though.

jeff

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-19 15:11                           ` Jeff Law
@ 2015-05-19 16:03                             ` Michael Matz
  2015-05-19 19:11                               ` Rich Felker
  0 siblings, 1 reply; 106+ messages in thread
From: Michael Matz @ 2015-05-19 16:03 UTC (permalink / raw)
  To: Jeff Law
  Cc: Rich Felker, H.J. Lu, Jan Hubicka, Alexander Monakov,
	GCC Patches, Uros Bizjak

Hi,

On Tue, 19 May 2015, Jeff Law wrote:

> > > Forget lazy binding. It's dead anyway because serious distros want 
> > > PIE+relro+bindnow+...
> > 
> > You keep saying this, but I can't help the feeling it's mostly because 
> > musl doesn't support it ;-)
> 
> FWIW, Red Hat is pushing PIE & partial RELRO deeper and deeper into the 
> distribution.

Yeah, us as well, though I don't necessarily see the point for most 
packages; feels a bit like a checkmark item :)


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-19 14:48                         ` Michael Matz
  2015-05-19 15:11                           ` Jeff Law
@ 2015-05-19 18:08                           ` Rich Felker
  2015-05-19 19:03                             ` Richard Henderson
  1 sibling, 1 reply; 106+ messages in thread
From: Rich Felker @ 2015-05-19 18:08 UTC (permalink / raw)
  To: Michael Matz
  Cc: H.J. Lu, Jan Hubicka, Alexander Monakov, GCC Patches, Uros Bizjak

On Tue, May 19, 2015 at 04:43:53PM +0200, Michael Matz wrote:
> Hi,
> 
> On Fri, 15 May 2015, Rich Felker wrote:
> 
> > Forget lazy binding. It's dead anyway because serious distros want
> > PIE+relro+bindnow+...
> 
> You keep saying this, but I can't help the feeling it's mostly because 
> musl doesn't support it ;-)

Well the reasons musl doesn't support it are partly the above, and
partly that it's been a continuous source of subtle bugs in glibc --
things like clobbering new vector registers, missing synchronization,
failures to be async-signal-safe, etc. So it's not that I think lazy
binding is bad because musl doesn't support it, but rather that musl
doesn't support lazy binding because I think it's bad. :-)

> No, you don't have to use bindnow to get the effects of relro.  Sure 
> there's more parts of the GOT protected with it, but if that's really that 
> much more hardened is up for debate.

Normally it's function addresses that you care about protecting --
they're the easy vector for arbitrary code execution -- and they're
unprotected without bindnow. Addresses of global data could also be an
attack vector, but a more difficult one to exploit.

> > If people really want lazy binding, they can use options which support 
> > it, but I don't want to keep suffering the codegen cost of lazy binding 
> > despite never using it.
> 
> > There should be an option to generate optimal code equivalent to what 
> > you get with Alexander Monakov's patches for those of us who aren't 
> > trying to support this legacy feature that precludes good performance 
> > and precludes hardening.
> 
> H.J.'s branch is for _improving_ code on top of the no-plt code, it's not 
> replacing it or an alternative for it.

Thanks for the clarification -- this was the part I was failing to
understand. I'm still mildly worried that concerns for supporting
relaxation might lead to decisions not to optimize code in ways that
would be difficult to relax (e.g. certain types of address load
reordering or hoisting) but I don't understand GCC internals
sufficiently to know if this concern is warranted or not. As long as
his work isn't interfering with the ability of -fno-plt to generate
optimal code, I agree it's both inappropriate and counter-productive
for me to be objecting to part or all of it.

I would still like to see the @GOTPCREL stuff added and used instead
of @GOT, as I mentioned earlier in the thread, but I agree that's
independent of relaxation support and shouldn't block it.

Rich

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-19 18:08                           ` Rich Felker
@ 2015-05-19 19:03                             ` Richard Henderson
  2015-05-19 19:10                               ` H.J. Lu
                                                 ` (2 more replies)
  0 siblings, 3 replies; 106+ messages in thread
From: Richard Henderson @ 2015-05-19 19:03 UTC (permalink / raw)
  To: Rich Felker, Michael Matz
  Cc: H.J. Lu, Jan Hubicka, Alexander Monakov, GCC Patches, Uros Bizjak

On 05/19/2015 11:06 AM, Rich Felker wrote:
> I'm still mildly worried that concerns for supporting
> relaxation might lead to decisions not to optimize code in ways that
> would be difficult to relax (e.g. certain types of address load
> reordering or hoisting) but I don't understand GCC internals
> sufficiently to know if this concern is warranted or not.

It is.  The relaxation that HJ is working on requires that the reads from the
got not be hoisted.  I'm not especially convinced that what he's working on is
a win.

With LTO, the compiler can do the same job that he's attempting in the linker,
without an extra nop.  Without LTO, leaving it to the linker means that you
can't hoist the load and hide the memory latency.

> I would still like to see the @GOTPCREL stuff added and used instead
> of @GOT, as I mentioned earlier in the thread, but I agree that's
> independent of relaxation support and shouldn't block it.

I don't think that @GOTPCREL for 32-bit is a good idea.  This is the scheme
that Darwin uses, so we do have some experience with it.

In order for it to work you've got to have a pointer to a random address in the
function.  It means that you can only "easily" compute the address once.  If
you need the value again you wind up with the same "extra" addl insn that we
have with the current GOT pointer.

We've just started to do inter-function register allocation.  The next step
along those lines is to share the computation of GOT between multiple
functions.  At which point it really helps to have one global base address to
talk about.

r~

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-19 19:03                             ` Richard Henderson
@ 2015-05-19 19:10                               ` H.J. Lu
  2015-05-19 19:17                                 ` Richard Henderson
  2015-05-19 19:48                               ` Rich Felker
  2015-05-20 12:13                               ` Michael Matz
  2 siblings, 1 reply; 106+ messages in thread
From: H.J. Lu @ 2015-05-19 19:10 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Rich Felker, Michael Matz, Jan Hubicka, Alexander Monakov,
	GCC Patches, Uros Bizjak

On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote:
> On 05/19/2015 11:06 AM, Rich Felker wrote:
>> I'm still mildly worried that concerns for supporting
>> relaxation might lead to decisions not to optimize code in ways that
>> would be difficult to relax (e.g. certain types of address load
>> reordering or hoisting) but I don't understand GCC internals
>> sufficiently to know if this concern is warranted or not.
>
> It is.  The relaxation that HJ is working on requires that the reads from the
> got not be hoisted.  I'm not especially convinced that what he's working on is
> a win.
>
> With LTO, the compiler can do the same job that he's attempting in the linker,
> without an extra nop.  Without LTO, leaving it to the linker means that you
> can't hoist the load and hide the memory latency.
>

My relax approach won't take away any optimization done by compiler.
It simply turns indirect branch into direct branch with a nop prefix at
link-time.  I am having a hard time to understand why we shouldn't do it.


-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-19 16:03                             ` Michael Matz
@ 2015-05-19 19:11                               ` Rich Felker
  0 siblings, 0 replies; 106+ messages in thread
From: Rich Felker @ 2015-05-19 19:11 UTC (permalink / raw)
  To: Michael Matz
  Cc: Jeff Law, H.J. Lu, Jan Hubicka, Alexander Monakov, GCC Patches,
	Uros Bizjak

On Tue, May 19, 2015 at 06:01:07PM +0200, Michael Matz wrote:
> Hi,
> 
> On Tue, 19 May 2015, Jeff Law wrote:
> 
> > > > Forget lazy binding. It's dead anyway because serious distros want 
> > > > PIE+relro+bindnow+...
> > > 
> > > You keep saying this, but I can't help the feeling it's mostly because 
> > > musl doesn't support it ;-)
> > 
> > FWIW, Red Hat is pushing PIE & partial RELRO deeper and deeper into the 
> > distribution.
> 
> Yeah, us as well, though I don't necessarily see the point for most 
> packages; feels a bit like a checkmark item :)

These days it's fairly rare to have software which does not interact
at all with untrusted data. Consider how much user-facing application
software that was not previously considered security-critical is
making network connections using complex protocols (e.g. anything with
TLS, IM protocols, ...), opening image files from random sources
(attachments, files that happen to be in a browsed-to directory, on
USB sticks, etc.), and so on. I think it's smart to be hardening
everything, at least for distros providing all sorts of random
unvetted software.

Rich

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-19 19:10                               ` H.J. Lu
@ 2015-05-19 19:17                                 ` Richard Henderson
  2015-05-19 19:20                                   ` H.J. Lu
  0 siblings, 1 reply; 106+ messages in thread
From: Richard Henderson @ 2015-05-19 19:17 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Rich Felker, Michael Matz, Jan Hubicka, Alexander Monakov,
	GCC Patches, Uros Bizjak

On 05/19/2015 12:06 PM, H.J. Lu wrote:
> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote:
>> On 05/19/2015 11:06 AM, Rich Felker wrote:
>>> I'm still mildly worried that concerns for supporting
>>> relaxation might lead to decisions not to optimize code in ways that
>>> would be difficult to relax (e.g. certain types of address load
>>> reordering or hoisting) but I don't understand GCC internals
>>> sufficiently to know if this concern is warranted or not.
>>
>> It is.  The relaxation that HJ is working on requires that the reads from the
>> got not be hoisted.  I'm not especially convinced that what he's working on is
>> a win.
>>
>> With LTO, the compiler can do the same job that he's attempting in the linker,
>> without an extra nop.  Without LTO, leaving it to the linker means that you
>> can't hoist the load and hide the memory latency.
>>
> 
> My relax approach won't take away any optimization done by compiler.
> It simply turns indirect branch into direct branch with a nop prefix at
> link-time.  I am having a hard time to understand why we shouldn't do it.

I well understand what you're doing.

But my point is that the only time the compiler should present you with the
form of indirect branch you're looking for is when there's no place to hoist
the load.

At which point, is it really worth adding a new relocation to the ABI?  Is it
really worth adding new code to the linker that won't be exercised often?


r~

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-19 19:17                                 ` Richard Henderson
@ 2015-05-19 19:20                                   ` H.J. Lu
  2015-05-19 19:54                                     ` Richard Henderson
  2015-05-19 20:27                                     ` Rich Felker
  0 siblings, 2 replies; 106+ messages in thread
From: H.J. Lu @ 2015-05-19 19:20 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Rich Felker, Michael Matz, Jan Hubicka, Alexander Monakov,
	GCC Patches, Uros Bizjak

On Tue, May 19, 2015 at 12:11 PM, Richard Henderson <rth@redhat.com> wrote:
> On 05/19/2015 12:06 PM, H.J. Lu wrote:
>> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote:
>>> On 05/19/2015 11:06 AM, Rich Felker wrote:
>>>> I'm still mildly worried that concerns for supporting
>>>> relaxation might lead to decisions not to optimize code in ways that
>>>> would be difficult to relax (e.g. certain types of address load
>>>> reordering or hoisting) but I don't understand GCC internals
>>>> sufficiently to know if this concern is warranted or not.
>>>
>>> It is.  The relaxation that HJ is working on requires that the reads from the
>>> got not be hoisted.  I'm not especially convinced that what he's working on is
>>> a win.
>>>
>>> With LTO, the compiler can do the same job that he's attempting in the linker,
>>> without an extra nop.  Without LTO, leaving it to the linker means that you
>>> can't hoist the load and hide the memory latency.
>>>
>>
>> My relax approach won't take away any optimization done by compiler.
>> It simply turns indirect branch into direct branch with a nop prefix at
>> link-time.  I am having a hard time to understand why we shouldn't do it.
>
> I well understand what you're doing.
>
> But my point is that the only time the compiler should present you with the
> form of indirect branch you're looking for is when there's no place to hoist
> the load.
>
> At which point, is it really worth adding a new relocation to the ABI?  Is it
> really worth adding new code to the linker that won't be exercised often?

I believe there are plenty of indirect branches via GOT when compiling
PIE/PIC with -fno-plt:

[hjl@gnu-6 gcc]$ cat /tmp/x.c
extern void foo (void);

void
bar (void)
{
  foo ();
}
[hjl@gnu-6 gcc]$ ./xgcc -B./ -fPIC -O3 -S /tmp/x.c -fno-plt
[hjl@gnu-6 gcc]$ cat x.s
.file "x.c"
.section .text.unlikely,"ax",@progbits
.LCOLDB0:
.text
.LHOTB0:
.p2align 4,,15
.globl bar
.type bar, @function
bar:
.LFB0:
.cfi_startproc
jmp *foo@GOTPCREL(%rip)
.cfi_endproc
.LFE0:
.size bar, .-bar

-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-19 19:03                             ` Richard Henderson
  2015-05-19 19:10                               ` H.J. Lu
@ 2015-05-19 19:48                               ` Rich Felker
  2015-05-19 20:16                                 ` Richard Henderson
  2015-05-20 12:13                               ` Michael Matz
  2 siblings, 1 reply; 106+ messages in thread
From: Rich Felker @ 2015-05-19 19:48 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Michael Matz, H.J. Lu, Jan Hubicka, Alexander Monakov,
	GCC Patches, Uros Bizjak

On Tue, May 19, 2015 at 11:59:00AM -0700, Richard Henderson wrote:
> On 05/19/2015 11:06 AM, Rich Felker wrote:
> > I'm still mildly worried that concerns for supporting
> > relaxation might lead to decisions not to optimize code in ways that
> > would be difficult to relax (e.g. certain types of address load
> > reordering or hoisting) but I don't understand GCC internals
> > sufficiently to know if this concern is warranted or not.
> 
> It is.  The relaxation that HJ is working on requires that the reads from the
> got not be hoisted.  I'm not especially convinced that what he's working on is
> a win.

Well as long as -fno-plt actually generates a load from the GOT like
what would be done for data access, and does not go out of its way to
produce something compatible with relaxation, my hope is that it would
not affected by the pessimization. I'm not sure if that's the case
though.

> With LTO, the compiler can do the same job that he's attempting in the linker,
> without an extra nop.  Without LTO, leaving it to the linker means that you
> can't hoist the load and hide the memory latency.

Yes, this is my feeling too. Alexander Monakov have been discussing it
on #musl a bit and I think the conclusion we reached is that
relaxation is possibly a significant real-world win for non-PIC main
executables, where it's very likely that addresses will be resolved at
ld-time and for the programmer not to specifically annotate this with
protected visibility. In such a case, you get either a direct call or
a direct address load and indirect call, rather than hitting an extra
cache line in the PLT thunk to do the address load and indirect call.
Note that, being non-PIC, there is no GOT register involved here.

> > I would still like to see the @GOTPCREL stuff added and used instead
> > of @GOT, as I mentioned earlier in the thread, but I agree that's
> > independent of relaxation support and shouldn't block it.
> 
> I don't think that @GOTPCREL for 32-bit is a good idea.  This is the scheme
> that Darwin uses, so we do have some experience with it.
> 
> In order for it to work you've got to have a pointer to a random address in the
> function.  It means that you can only "easily" compute the address once.  If
> you need the value again you wind up with the same "extra" addl insn that we
> have with the current GOT pointer.

Why would you recompute it (this requires a fairly expensive call that
reads or pops its own return address) rather than simply spilling the
already-computed value and reloading it from the stack?

The only example I can think of where it might make sense is when you
don't want to load the address unconditionally because there are
shrink-wrappable code paths that don't need it, but multple code paths
that do, in which case they would each load different values. Is this
the concern you have in mind?

> We've just started to do inter-function register allocation.  The next step
> along those lines is to share the computation of GOT between multiple
> functions.  At which point it really helps to have one global base address to
> talk about.

I see -- that would be another case where it simplifies things.

Rich

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-19 19:20                                   ` H.J. Lu
@ 2015-05-19 19:54                                     ` Richard Henderson
  2015-05-19 20:27                                     ` Rich Felker
  1 sibling, 0 replies; 106+ messages in thread
From: Richard Henderson @ 2015-05-19 19:54 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Rich Felker, Michael Matz, Jan Hubicka, Alexander Monakov,
	GCC Patches, Uros Bizjak

On 05/19/2015 12:17 PM, H.J. Lu wrote:
>> But my point is that the only time the compiler should present you with the
>> form of indirect branch you're looking for is when there's no place to hoist
>> the load.
>>
>> At which point, is it really worth adding a new relocation to the ABI?  Is it
>> really worth adding new code to the linker that won't be exercised often?
> 
> I believe there are plenty of indirect branches via GOT when compiling
> PIE/PIC with -fno-plt:
> 
> [hjl@gnu-6 gcc]$ cat /tmp/x.c
> extern void foo (void);
> 
> void
> bar (void)
> {
>   foo ();
> }

Sure, as I said, when there's no place to hoist the load.

Try anything more complicated,

void bar (void)
{
  int i;
  for (i = 0; i < 10; ++i)
    foo ();
}

void baz (void)
{
  foo ();
  foo ();
}

and you'll not see the call *foo@GOTPCREL(%rip) form.

Of course there's also plenty of times where combine recreates exactly that
form when perhaps the scheduler might have preferred otherwise.  Those are
optimization choices to be addressed under separate cover.

My point that we can already do what you want via LTO, without adding new
relocations, is still relevant.


r~

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-19 19:48                               ` Rich Felker
@ 2015-05-19 20:16                                 ` Richard Henderson
  0 siblings, 0 replies; 106+ messages in thread
From: Richard Henderson @ 2015-05-19 20:16 UTC (permalink / raw)
  To: Rich Felker
  Cc: Michael Matz, H.J. Lu, Jan Hubicka, Alexander Monakov,
	GCC Patches, Uros Bizjak

On 05/19/2015 12:35 PM, Rich Felker wrote:
> Why would you recompute it (this requires a fairly expensive call that
> reads or pops its own return address) rather than simply spilling the
> already-computed value and reloading it from the stack?
> 
> The only example I can think of where it might make sense is when you
> don't want to load the address unconditionally because there are
> shrink-wrappable code paths that don't need it, but multple code paths
> that do, in which case they would each load different values. Is this
> the concern you have in mind?

That too.  I was thinking of exception landing pads, i.e. catches and cleanups,
where in the past we've had to re-compute the GOT address.  Though now that I
think on that more, it wasn't x86 that had that particular landing pad trouble.


r~

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-19 19:20                                   ` H.J. Lu
  2015-05-19 19:54                                     ` Richard Henderson
@ 2015-05-19 20:27                                     ` Rich Felker
  2015-05-19 20:44                                       ` H.J. Lu
  1 sibling, 1 reply; 106+ messages in thread
From: Rich Felker @ 2015-05-19 20:27 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Richard Henderson, Michael Matz, Jan Hubicka, Alexander Monakov,
	GCC Patches, Uros Bizjak

On Tue, May 19, 2015 at 12:17:18PM -0700, H.J. Lu wrote:
> On Tue, May 19, 2015 at 12:11 PM, Richard Henderson <rth@redhat.com> wrote:
> > On 05/19/2015 12:06 PM, H.J. Lu wrote:
> >> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote:
> >>> On 05/19/2015 11:06 AM, Rich Felker wrote:
> >>>> I'm still mildly worried that concerns for supporting
> >>>> relaxation might lead to decisions not to optimize code in ways that
> >>>> would be difficult to relax (e.g. certain types of address load
> >>>> reordering or hoisting) but I don't understand GCC internals
> >>>> sufficiently to know if this concern is warranted or not.
> >>>
> >>> It is.  The relaxation that HJ is working on requires that the reads from the
> >>> got not be hoisted.  I'm not especially convinced that what he's working on is
> >>> a win.
> >>>
> >>> With LTO, the compiler can do the same job that he's attempting in the linker,
> >>> without an extra nop.  Without LTO, leaving it to the linker means that you
> >>> can't hoist the load and hide the memory latency.
> >>>
> >>
> >> My relax approach won't take away any optimization done by compiler.
> >> It simply turns indirect branch into direct branch with a nop prefix at
> >> link-time.  I am having a hard time to understand why we shouldn't do it.
> >
> > I well understand what you're doing.
> >
> > But my point is that the only time the compiler should present you with the
> > form of indirect branch you're looking for is when there's no place to hoist
> > the load.
> >
> > At which point, is it really worth adding a new relocation to the ABI?  Is it
> > really worth adding new code to the linker that won't be exercised often?
> 
> I believe there are plenty of indirect branches via GOT when compiling
> PIE/PIC with -fno-plt:
> 
> [hjl@gnu-6 gcc]$ cat /tmp/x.c
> extern void foo (void);
> 
> void
> bar (void)
> {
>   foo ();
> }
> [hjl@gnu-6 gcc]$ ./xgcc -B./ -fPIC -O3 -S /tmp/x.c -fno-plt
> [hjl@gnu-6 gcc]$ cat x.s
> ..file "x.c"
> ..section .text.unlikely,"ax",@progbits
> ..LCOLDB0:
> ..text
> ..LHOTB0:
> ..p2align 4,,15
> ..globl bar
> ..type bar, @function
> bar:
> ..LFB0:
> ..cfi_startproc
> jmp *foo@GOTPCREL(%rip)
> ..cfi_endproc
> ..LFE0:
> ..size bar, .-bar

I agree these exist. What I question is whether the savings from the
linker being able to relax this to a direct call in the case where the
programmer failed to let the compiler make it a direct call to begin
with (by using hidden or protected visibility) are worth the cost of
not being able to hoist the load out of loops or schedule it earlier
in cases where relaxation is not possible because the call target is
not defined in the same DSO.

Rich

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-19 20:27                                     ` Rich Felker
@ 2015-05-19 20:44                                       ` H.J. Lu
  2015-05-19 21:28                                         ` Rich Felker
  0 siblings, 1 reply; 106+ messages in thread
From: H.J. Lu @ 2015-05-19 20:44 UTC (permalink / raw)
  To: Rich Felker
  Cc: Richard Henderson, Michael Matz, Jan Hubicka, Alexander Monakov,
	GCC Patches, Uros Bizjak

On Tue, May 19, 2015 at 1:15 PM, Rich Felker <dalias@libc.org> wrote:
> On Tue, May 19, 2015 at 12:17:18PM -0700, H.J. Lu wrote:
>> On Tue, May 19, 2015 at 12:11 PM, Richard Henderson <rth@redhat.com> wrote:
>> > On 05/19/2015 12:06 PM, H.J. Lu wrote:
>> >> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote:
>> >>> On 05/19/2015 11:06 AM, Rich Felker wrote:
>> >>>> I'm still mildly worried that concerns for supporting
>> >>>> relaxation might lead to decisions not to optimize code in ways that
>> >>>> would be difficult to relax (e.g. certain types of address load
>> >>>> reordering or hoisting) but I don't understand GCC internals
>> >>>> sufficiently to know if this concern is warranted or not.
>> >>>
>> >>> It is.  The relaxation that HJ is working on requires that the reads from the
>> >>> got not be hoisted.  I'm not especially convinced that what he's working on is
>> >>> a win.
>> >>>
>> >>> With LTO, the compiler can do the same job that he's attempting in the linker,
>> >>> without an extra nop.  Without LTO, leaving it to the linker means that you
>> >>> can't hoist the load and hide the memory latency.
>> >>>
>> >>
>> >> My relax approach won't take away any optimization done by compiler.
>> >> It simply turns indirect branch into direct branch with a nop prefix at
>> >> link-time.  I am having a hard time to understand why we shouldn't do it.
>> >
>> > I well understand what you're doing.
>> >
>> > But my point is that the only time the compiler should present you with the
>> > form of indirect branch you're looking for is when there's no place to hoist
>> > the load.
>> >
>> > At which point, is it really worth adding a new relocation to the ABI?  Is it
>> > really worth adding new code to the linker that won't be exercised often?
>>
>> I believe there are plenty of indirect branches via GOT when compiling
>> PIE/PIC with -fno-plt:
>>
>> [hjl@gnu-6 gcc]$ cat /tmp/x.c
>> extern void foo (void);
>>
>> void
>> bar (void)
>> {
>>   foo ();
>> }
>> [hjl@gnu-6 gcc]$ ./xgcc -B./ -fPIC -O3 -S /tmp/x.c -fno-plt
>> [hjl@gnu-6 gcc]$ cat x.s
>> ..file "x.c"
>> ..section .text.unlikely,"ax",@progbits
>> ..LCOLDB0:
>> ..text
>> ..LHOTB0:
>> ..p2align 4,,15
>> ..globl bar
>> ..type bar, @function
>> bar:
>> ..LFB0:
>> ..cfi_startproc
>> jmp *foo@GOTPCREL(%rip)
>> ..cfi_endproc
>> ..LFE0:
>> ..size bar, .-bar
>
> I agree these exist. What I question is whether the savings from the
> linker being able to relax this to a direct call in the case where the
> programmer failed to let the compiler make it a direct call to begin
> with (by using hidden or protected visibility) are worth the cost of
> not being able to hoist the load out of loops or schedule it earlier
> in cases where relaxation is not possible because the call target is
> not defined in the same DSO.

Just for fun.  I compiled binutils as PIE with -fno-plt -flto:

[hjl@gnu-mic-2 gas]$ file as-new
as-new: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV),
dynamically linked (uses shared libs), for GNU/Linux 2.6.32, not
stripped
[hjl@gnu-mic-2 gas]$

There are 43:

ff 25 21 93 2d 00     jmpq   *0x2d9321(%rip)        # 3d5f58 <_DYNAMIC+0x1e8>

and 1983

ff 15 eb f4 38 00     callq  *0x38f4eb(%rip)        # 3d60e0 <_DYNAMIC+0x370>

-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-19 20:44                                       ` H.J. Lu
@ 2015-05-19 21:28                                         ` Rich Felker
  2015-05-20  0:52                                           ` H.J. Lu
  0 siblings, 1 reply; 106+ messages in thread
From: Rich Felker @ 2015-05-19 21:28 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Richard Henderson, Michael Matz, Jan Hubicka, Alexander Monakov,
	GCC Patches, Uros Bizjak

On Tue, May 19, 2015 at 01:27:06PM -0700, H.J. Lu wrote:
> On Tue, May 19, 2015 at 1:15 PM, Rich Felker <dalias@libc.org> wrote:
> > On Tue, May 19, 2015 at 12:17:18PM -0700, H.J. Lu wrote:
> >> On Tue, May 19, 2015 at 12:11 PM, Richard Henderson <rth@redhat.com> wrote:
> >> > On 05/19/2015 12:06 PM, H.J. Lu wrote:
> >> >> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote:
> >> >>> On 05/19/2015 11:06 AM, Rich Felker wrote:
> >> >>>> I'm still mildly worried that concerns for supporting
> >> >>>> relaxation might lead to decisions not to optimize code in ways that
> >> >>>> would be difficult to relax (e.g. certain types of address load
> >> >>>> reordering or hoisting) but I don't understand GCC internals
> >> >>>> sufficiently to know if this concern is warranted or not.
> >> >>>
> >> >>> It is.  The relaxation that HJ is working on requires that the reads from the
> >> >>> got not be hoisted.  I'm not especially convinced that what he's working on is
> >> >>> a win.
> >> >>>
> >> >>> With LTO, the compiler can do the same job that he's attempting in the linker,
> >> >>> without an extra nop.  Without LTO, leaving it to the linker means that you
> >> >>> can't hoist the load and hide the memory latency.
> >> >>>
> >> >>
> >> >> My relax approach won't take away any optimization done by compiler.
> >> >> It simply turns indirect branch into direct branch with a nop prefix at
> >> >> link-time.  I am having a hard time to understand why we shouldn't do it.
> >> >
> >> > I well understand what you're doing.
> >> >
> >> > But my point is that the only time the compiler should present you with the
> >> > form of indirect branch you're looking for is when there's no place to hoist
> >> > the load.
> >> >
> >> > At which point, is it really worth adding a new relocation to the ABI?  Is it
> >> > really worth adding new code to the linker that won't be exercised often?
> >>
> >> I believe there are plenty of indirect branches via GOT when compiling
> >> PIE/PIC with -fno-plt:
> >>
> >> [hjl@gnu-6 gcc]$ cat /tmp/x.c
> >> extern void foo (void);
> >>
> >> void
> >> bar (void)
> >> {
> >>   foo ();
> >> }
> >> [hjl@gnu-6 gcc]$ ./xgcc -B./ -fPIC -O3 -S /tmp/x.c -fno-plt
> >> [hjl@gnu-6 gcc]$ cat x.s
> >> ..file "x.c"
> >> ..section .text.unlikely,"ax",@progbits
> >> ..LCOLDB0:
> >> ..text
> >> ..LHOTB0:
> >> ..p2align 4,,15
> >> ..globl bar
> >> ..type bar, @function
> >> bar:
> >> ..LFB0:
> >> ..cfi_startproc
> >> jmp *foo@GOTPCREL(%rip)
> >> ..cfi_endproc
> >> ..LFE0:
> >> ..size bar, .-bar
> >
> > I agree these exist. What I question is whether the savings from the
> > linker being able to relax this to a direct call in the case where the
> > programmer failed to let the compiler make it a direct call to begin
> > with (by using hidden or protected visibility) are worth the cost of
> > not being able to hoist the load out of loops or schedule it earlier
> > in cases where relaxation is not possible because the call target is
> > not defined in the same DSO.
> 
> Just for fun.  I compiled binutils as PIE with -fno-plt -flto:
> 
> [hjl@gnu-mic-2 gas]$ file as-new
> as-new: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV),
> dynamically linked (uses shared libs), for GNU/Linux 2.6.32, not
> stripped
> [hjl@gnu-mic-2 gas]$
> 
> There are 43:
> 
> ff 25 21 93 2d 00     jmpq   *0x2d9321(%rip)        # 3d5f58 <_DYNAMIC+0x1e8>
> 
> and 1983
> 
> ff 15 eb f4 38 00     callq  *0x38f4eb(%rip)        # 3d60e0 <_DYNAMIC+0x370>

How many of those would be relaxed? I suspect it depends a lot on
whether libbfd is static or shared.

Rich

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-19 21:28                                         ` Rich Felker
@ 2015-05-20  0:52                                           ` H.J. Lu
  2015-05-20  1:09                                             ` Rich Felker
  0 siblings, 1 reply; 106+ messages in thread
From: H.J. Lu @ 2015-05-20  0:52 UTC (permalink / raw)
  To: Rich Felker
  Cc: Richard Henderson, Michael Matz, Jan Hubicka, Alexander Monakov,
	GCC Patches, Uros Bizjak

On Tue, May 19, 2015 at 1:54 PM, Rich Felker <dalias@libc.org> wrote:
> On Tue, May 19, 2015 at 01:27:06PM -0700, H.J. Lu wrote:
>> On Tue, May 19, 2015 at 1:15 PM, Rich Felker <dalias@libc.org> wrote:
>> > On Tue, May 19, 2015 at 12:17:18PM -0700, H.J. Lu wrote:
>> >> On Tue, May 19, 2015 at 12:11 PM, Richard Henderson <rth@redhat.com> wrote:
>> >> > On 05/19/2015 12:06 PM, H.J. Lu wrote:
>> >> >> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote:
>> >> >>> On 05/19/2015 11:06 AM, Rich Felker wrote:
>> >> >>>> I'm still mildly worried that concerns for supporting
>> >> >>>> relaxation might lead to decisions not to optimize code in ways that
>> >> >>>> would be difficult to relax (e.g. certain types of address load
>> >> >>>> reordering or hoisting) but I don't understand GCC internals
>> >> >>>> sufficiently to know if this concern is warranted or not.
>> >> >>>
>> >> >>> It is.  The relaxation that HJ is working on requires that the reads from the
>> >> >>> got not be hoisted.  I'm not especially convinced that what he's working on is
>> >> >>> a win.
>> >> >>>
>> >> >>> With LTO, the compiler can do the same job that he's attempting in the linker,
>> >> >>> without an extra nop.  Without LTO, leaving it to the linker means that you
>> >> >>> can't hoist the load and hide the memory latency.
>> >> >>>
>> >> >>
>> >> >> My relax approach won't take away any optimization done by compiler.
>> >> >> It simply turns indirect branch into direct branch with a nop prefix at
>> >> >> link-time.  I am having a hard time to understand why we shouldn't do it.
>> >> >
>> >> > I well understand what you're doing.
>> >> >
>> >> > But my point is that the only time the compiler should present you with the
>> >> > form of indirect branch you're looking for is when there's no place to hoist
>> >> > the load.
>> >> >
>> >> > At which point, is it really worth adding a new relocation to the ABI?  Is it
>> >> > really worth adding new code to the linker that won't be exercised often?
>> >>
>> >> I believe there are plenty of indirect branches via GOT when compiling
>> >> PIE/PIC with -fno-plt:
>> >>
>> >> [hjl@gnu-6 gcc]$ cat /tmp/x.c
>> >> extern void foo (void);
>> >>
>> >> void
>> >> bar (void)
>> >> {
>> >>   foo ();
>> >> }
>> >> [hjl@gnu-6 gcc]$ ./xgcc -B./ -fPIC -O3 -S /tmp/x.c -fno-plt
>> >> [hjl@gnu-6 gcc]$ cat x.s
>> >> ..file "x.c"
>> >> ..section .text.unlikely,"ax",@progbits
>> >> ..LCOLDB0:
>> >> ..text
>> >> ..LHOTB0:
>> >> ..p2align 4,,15
>> >> ..globl bar
>> >> ..type bar, @function
>> >> bar:
>> >> ..LFB0:
>> >> ..cfi_startproc
>> >> jmp *foo@GOTPCREL(%rip)
>> >> ..cfi_endproc
>> >> ..LFE0:
>> >> ..size bar, .-bar
>> >
>> > I agree these exist. What I question is whether the savings from the
>> > linker being able to relax this to a direct call in the case where the
>> > programmer failed to let the compiler make it a direct call to begin
>> > with (by using hidden or protected visibility) are worth the cost of
>> > not being able to hoist the load out of loops or schedule it earlier
>> > in cases where relaxation is not possible because the call target is
>> > not defined in the same DSO.
>>
>> Just for fun.  I compiled binutils as PIE with -fno-plt -flto:
>>
>> [hjl@gnu-mic-2 gas]$ file as-new
>> as-new: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV),
>> dynamically linked (uses shared libs), for GNU/Linux 2.6.32, not
>> stripped
>> [hjl@gnu-mic-2 gas]$
>>
>> There are 43:
>>
>> ff 25 21 93 2d 00     jmpq   *0x2d9321(%rip)        # 3d5f58 <_DYNAMIC+0x1e8>
>>
>> and 1983
>>
>> ff 15 eb f4 38 00     callq  *0x38f4eb(%rip)        # 3d60e0 <_DYNAMIC+0x370>
>
> How many of those would be relaxed? I suspect it depends a lot on
> whether libbfd is static or shared.

When shared libraries are enabled, there are 177 indirect branches
to locally defined functions.  Call to any locally defined functions,
which aren't compiled with LTO, is indirect.

-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-20  0:52                                           ` H.J. Lu
@ 2015-05-20  1:09                                             ` Rich Felker
  2015-05-22 19:32                                               ` Richard Henderson
  0 siblings, 1 reply; 106+ messages in thread
From: Rich Felker @ 2015-05-20  1:09 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Richard Henderson, Michael Matz, Jan Hubicka, Alexander Monakov,
	GCC Patches, Uros Bizjak

On Tue, May 19, 2015 at 05:10:11PM -0700, H.J. Lu wrote:
> On Tue, May 19, 2015 at 1:54 PM, Rich Felker <dalias@libc.org> wrote:
> > On Tue, May 19, 2015 at 01:27:06PM -0700, H.J. Lu wrote:
> >> On Tue, May 19, 2015 at 1:15 PM, Rich Felker <dalias@libc.org> wrote:
> >> > On Tue, May 19, 2015 at 12:17:18PM -0700, H.J. Lu wrote:
> >> >> On Tue, May 19, 2015 at 12:11 PM, Richard Henderson <rth@redhat.com> wrote:
> >> >> > On 05/19/2015 12:06 PM, H.J. Lu wrote:
> >> >> >> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote:
> >> >> >>> On 05/19/2015 11:06 AM, Rich Felker wrote:
> >> >> >>>> I'm still mildly worried that concerns for supporting
> >> >> >>>> relaxation might lead to decisions not to optimize code in ways that
> >> >> >>>> would be difficult to relax (e.g. certain types of address load
> >> >> >>>> reordering or hoisting) but I don't understand GCC internals
> >> >> >>>> sufficiently to know if this concern is warranted or not.
> >> >> >>>
> >> >> >>> It is.  The relaxation that HJ is working on requires that the reads from the
> >> >> >>> got not be hoisted.  I'm not especially convinced that what he's working on is
> >> >> >>> a win.
> >> >> >>>
> >> >> >>> With LTO, the compiler can do the same job that he's attempting in the linker,
> >> >> >>> without an extra nop.  Without LTO, leaving it to the linker means that you
> >> >> >>> can't hoist the load and hide the memory latency.
> >> >> >>>
> >> >> >>
> >> >> >> My relax approach won't take away any optimization done by compiler.
> >> >> >> It simply turns indirect branch into direct branch with a nop prefix at
> >> >> >> link-time.  I am having a hard time to understand why we shouldn't do it.
> >> >> >
> >> >> > I well understand what you're doing.
> >> >> >
> >> >> > But my point is that the only time the compiler should present you with the
> >> >> > form of indirect branch you're looking for is when there's no place to hoist
> >> >> > the load.
> >> >> >
> >> >> > At which point, is it really worth adding a new relocation to the ABI?  Is it
> >> >> > really worth adding new code to the linker that won't be exercised often?
> >> >>
> >> >> I believe there are plenty of indirect branches via GOT when compiling
> >> >> PIE/PIC with -fno-plt:
> >> >>
> >> >> [hjl@gnu-6 gcc]$ cat /tmp/x.c
> >> >> extern void foo (void);
> >> >>
> >> >> void
> >> >> bar (void)
> >> >> {
> >> >>   foo ();
> >> >> }
> >> >> [hjl@gnu-6 gcc]$ ./xgcc -B./ -fPIC -O3 -S /tmp/x.c -fno-plt
> >> >> [hjl@gnu-6 gcc]$ cat x.s
> >> >> ..file "x.c"
> >> >> ..section .text.unlikely,"ax",@progbits
> >> >> ..LCOLDB0:
> >> >> ..text
> >> >> ..LHOTB0:
> >> >> ..p2align 4,,15
> >> >> ..globl bar
> >> >> ..type bar, @function
> >> >> bar:
> >> >> ..LFB0:
> >> >> ..cfi_startproc
> >> >> jmp *foo@GOTPCREL(%rip)
> >> >> ..cfi_endproc
> >> >> ..LFE0:
> >> >> ..size bar, .-bar
> >> >
> >> > I agree these exist. What I question is whether the savings from the
> >> > linker being able to relax this to a direct call in the case where the
> >> > programmer failed to let the compiler make it a direct call to begin
> >> > with (by using hidden or protected visibility) are worth the cost of
> >> > not being able to hoist the load out of loops or schedule it earlier
> >> > in cases where relaxation is not possible because the call target is
> >> > not defined in the same DSO.
> >>
> >> Just for fun.  I compiled binutils as PIE with -fno-plt -flto:
> >>
> >> [hjl@gnu-mic-2 gas]$ file as-new
> >> as-new: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV),
> >> dynamically linked (uses shared libs), for GNU/Linux 2.6.32, not
> >> stripped
> >> [hjl@gnu-mic-2 gas]$
> >>
> >> There are 43:
> >>
> >> ff 25 21 93 2d 00     jmpq   *0x2d9321(%rip)        # 3d5f58 <_DYNAMIC+0x1e8>
> >>
> >> and 1983
> >>
> >> ff 15 eb f4 38 00     callq  *0x38f4eb(%rip)        # 3d60e0 <_DYNAMIC+0x370>
> >
> > How many of those would be relaxed? I suspect it depends a lot on
> > whether libbfd is static or shared.
> 
> When shared libraries are enabled, there are 177 indirect branches
> to locally defined functions.  Call to any locally defined functions,
> which aren't compiled with LTO, is indirect.

And are the above indirect calls/jumps (1983+43) candidates for
scheduling/hoisting the address load (that's not being done yet), or
are they the ones the compiler opted not to schedule/hoist? The win
from relaxation seems small here, but as long as you're not going to
block optimizations that would preclude relaxing, I don't see any
disadvantages to doing it.

Rich

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-19 19:03                             ` Richard Henderson
  2015-05-19 19:10                               ` H.J. Lu
  2015-05-19 19:48                               ` Rich Felker
@ 2015-05-20 12:13                               ` Michael Matz
  2015-05-20 12:40                                 ` H.J. Lu
  2015-05-20 14:17                                 ` Rich Felker
  2 siblings, 2 replies; 106+ messages in thread
From: Michael Matz @ 2015-05-20 12:13 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Rich Felker, H.J. Lu, Jan Hubicka, Alexander Monakov,
	GCC Patches, Uros Bizjak

Hi,

On Tue, 19 May 2015, Richard Henderson wrote:

> It is.  The relaxation that HJ is working on requires that the reads 
> from the got not be hoisted.  I'm not especially convinced that what 
> he's working on is a win.
> 
> With LTO, the compiler can do the same job that he's attempting in the 
> linker, without an extra nop.  Without LTO, leaving it to the linker 
> means that you can't hoist the load and hide the memory latency.

Well, hoisting always needs a register, and if hoisted out of a loop 
(which you all seem to be after) that register is live through the whole 
loop body.  You need a register for each different called function in such 
loop, trading the one GOT pointer with N other registers.  For 
register-starved machines this is a real problem, even x86-64 doesn't have 
that many.  I.e. I'm not convinced that this hoisting will really be much 
of a win that often, outside toy examples.  Sure, the compiler can hoist 
function addresses trivially, but I think it will lead to spilling more 
often than not, or alternatively the hoisting will be undone by the 
register allocators rematerialization.  Of course, this would have to be 
measured for real not hand-waved, but, well, I'd be surprised if it's not 
so.

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-20 12:13                               ` Michael Matz
@ 2015-05-20 12:40                                 ` H.J. Lu
  2015-05-20 14:17                                 ` Rich Felker
  1 sibling, 0 replies; 106+ messages in thread
From: H.J. Lu @ 2015-05-20 12:40 UTC (permalink / raw)
  To: Michael Matz
  Cc: Richard Henderson, Rich Felker, Jan Hubicka, Alexander Monakov,
	GCC Patches, Uros Bizjak

On Wed, May 20, 2015 at 5:10 AM, Michael Matz <matz@suse.de> wrote:
> Hi,
>
> On Tue, 19 May 2015, Richard Henderson wrote:
>
>> It is.  The relaxation that HJ is working on requires that the reads
>> from the got not be hoisted.  I'm not especially convinced that what
>> he's working on is a win.
>>
>> With LTO, the compiler can do the same job that he's attempting in the
>> linker, without an extra nop.  Without LTO, leaving it to the linker
>> means that you can't hoist the load and hide the memory latency.
>
> Well, hoisting always needs a register, and if hoisted out of a loop
> (which you all seem to be after) that register is live through the whole
> loop body.  You need a register for each different called function in such
> loop, trading the one GOT pointer with N other registers.  For
> register-starved machines this is a real problem, even x86-64 doesn't have
> that many.  I.e. I'm not convinced that this hoisting will really be much
> of a win that often, outside toy examples.  Sure, the compiler can hoist
> function addresses trivially, but I think it will lead to spilling more
> often than not, or alternatively the hoisting will be undone by the
> register allocators rematerialization.  Of course, this would have to be
> measured for real not hand-waved, but, well, I'd be surprised if it's not
> so.
>

We should replace "call/jmp *foo@GOTPCREL(%rip)" with
 "call/jmp *foo@GOTRELAX(%rip)".   As an option, we apply
-fno-plt to both PIC and non-PIC codes, if foo is externally defined.
It will save one indirect branch if GCC is right.  If GCC is wrong
and foo is defined locally, we get a nop prefix/suffix. We have
nothing to lose.

-- 
H.J.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-20 12:13                               ` Michael Matz
  2015-05-20 12:40                                 ` H.J. Lu
@ 2015-05-20 14:17                                 ` Rich Felker
  2015-05-20 14:33                                   ` Michael Matz
  1 sibling, 1 reply; 106+ messages in thread
From: Rich Felker @ 2015-05-20 14:17 UTC (permalink / raw)
  To: Michael Matz
  Cc: Richard Henderson, H.J. Lu, Jan Hubicka, Alexander Monakov,
	GCC Patches, Uros Bizjak

On Wed, May 20, 2015 at 02:10:41PM +0200, Michael Matz wrote:
> Hi,
> 
> On Tue, 19 May 2015, Richard Henderson wrote:
> 
> > It is.  The relaxation that HJ is working on requires that the reads 
> > from the got not be hoisted.  I'm not especially convinced that what 
> > he's working on is a win.
> > 
> > With LTO, the compiler can do the same job that he's attempting in the 
> > linker, without an extra nop.  Without LTO, leaving it to the linker 
> > means that you can't hoist the load and hide the memory latency.
> 
> Well, hoisting always needs a register, and if hoisted out of a loop 
> (which you all seem to be after) that register is live through the whole 
> loop body.  You need a register for each different called function in such 
> loop, trading the one GOT pointer with N other registers.  For 
> register-starved machines this is a real problem, even x86-64 doesn't have 
> that many.  I.e. I'm not convinced that this hoisting will really be much 
> of a win that often, outside toy examples.  Sure, the compiler can hoist 
> function addresses trivially, but I think it will lead to spilling more 
> often than not, or alternatively the hoisting will be undone by the 
> register allocators rematerialization.  Of course, this would have to be 
> measured for real not hand-waved, but, well, I'd be surprised if it's not 
> so.

The obvious example where it's useful on x86_64 is a major class:
anything where the majority of the callee's data is floating point and
thus kept in xmm registers. In that case register pressure is a lot
lower, and there's also an obvious class of cross-DSO functions calls
you'd be making over and over again: anything from libm.

Rich

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-20 14:17                                 ` Rich Felker
@ 2015-05-20 14:33                                   ` Michael Matz
  0 siblings, 0 replies; 106+ messages in thread
From: Michael Matz @ 2015-05-20 14:33 UTC (permalink / raw)
  To: Rich Felker
  Cc: Richard Henderson, H.J. Lu, Jan Hubicka, Alexander Monakov,
	GCC Patches, Uros Bizjak

Hi,

On Wed, 20 May 2015, Rich Felker wrote:

> > of a win that often, outside toy examples.  Sure, the compiler can hoist 
> > function addresses trivially, but I think it will lead to spilling more 
> > often than not, or alternatively the hoisting will be undone by the 
> > register allocators rematerialization.  Of course, this would have to be 
> > measured for real not hand-waved, but, well, I'd be surprised if it's not 
> > so.
> 
> The obvious example where it's useful on x86_64 is a major class: 

Yes, I can construct all kinds of examples where it's useful.  That 
doesn't touch the topic of real-world cases or hard numbers actually 
comparing the number of hoisted callee addresses, the number that stay 
hoisted until after register allocation and the number of spills added by 
hoisting, on some relevant code base, like gcc itself, or SPEC.

> anything where the majority of the callee's data is floating point and 
> thus kept in xmm registers.

This code tends to work on multiple arrays in practice, and hence integer 
registers are required for all the addresses and offsets and loop 
counters.

> In that case register pressure is a lot lower,

Register pressure on x86 is never low :)  Yes, x86-64 and others are much 
better in this regard.

> and there's also an obvious class of cross-DSO functions calls you'd be 
> making over and over again: anything from libm.

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH i386] Allow sibcalls in no-PLT PIC
  2015-05-20  1:09                                             ` Rich Felker
@ 2015-05-22 19:32                                               ` Richard Henderson
  0 siblings, 0 replies; 106+ messages in thread
From: Richard Henderson @ 2015-05-22 19:32 UTC (permalink / raw)
  To: Rich Felker, H.J. Lu
  Cc: Michael Matz, Jan Hubicka, Alexander Monakov, GCC Patches, Uros Bizjak

On 05/19/2015 06:06 PM, Rich Felker wrote:
> And are the above indirect calls/jumps (1983+43) candidates for
> scheduling/hoisting the address load (that's not being done yet), or
> are they the ones the compiler opted not to schedule/hoist? The win
> from relaxation seems small here, but as long as you're not going to
> block optimizations that would preclude relaxing, I don't see any
> disadvantages to doing it.

FWIW, I bootstrapped gcc with lto and -fpie -fno-plt:

	total calls	252436
	total indirect	21198	(8.4%)
	via got		10128	(4.0% / 48%)
	via reg		9007	(3.6% / 42%)
	via data	2063	(0.8% / 10%)

Those via data are things like

        callq  *0x145fdc4(%rip) # 19c0ea8 <lang_hooks+0x1e8>
        callq  *0x14517cc(%rip) # 19c0388 <targetm+0x328>

where we have a call to a hook at a known address.

Those via reg (or complex address) are also self explanatory -- we have all
sorts of hooks and indirection inside gcc, so this is unsurprising.  That said,
the very first one I examined,

000000000056735e <_ZL15omega_free_eqnsP5eqn_di.lto_priv.3334>:
  ...
  56736f: mov    0x144f6f2(%rip),%r13        # 19b6a68 <_DYNAMIC+0x928>
  ...
  567380: sub    $0x18,%r12
  567384: test   %ebx,%ebx
  567386: js     567394 <_ZL15omega_free_eqnsP5eqn_di.lto_priv.3334+0x36>
  567388: mov    0x28(%rbp,%r12,1),%rdi
  56738d: dec    %ebx
  56738f: callq  *%r13
  567392: jmp    567380 <_ZL15omega_free_eqnsP5eqn_di.lto_priv.3334+0x22>
  ...

does in fact hoist the address of "free" out of the loop.


Those via got can be identified by comparing the address against readelf -r to
examine the dynamic relocations.  There are plenty of truly non-local calls,
e.g. to libc.  These obviously cannot be relaxed.

Of those 10128 calls via the got, I found EXACTLY ONE that was local, to

  _Z22const_0_to_255_operandP7rtx_def12machine_mode

from

  _ZL19ix86_expand_builtinP9tree_nodeP7rtx_defS2_12machine_modei.lto_priv.2163

This is certain to be a bug, though I don't know where.  There are plenty of
other calls to const_0_to_255_operand elsewhere, and they are all, as expected,
direct.  This will likely take significant detective work...



r~

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-05-04 16:38 ` [PATCH] Expand PIC calls without PLT with -fno-plt Alexander Monakov
  2015-05-04 17:34   ` Jeff Law
  2015-05-10 16:59   ` Jan Hubicka
@ 2015-06-22 15:52   ` Jiong Wang
  2015-06-22 18:18     ` Alexander Monakov
  2 siblings, 1 reply; 106+ messages in thread
From: Jiong Wang @ 2015-06-22 15:52 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches, Rich Felker

On 04/05/15 17:37, Alexander Monakov wrote:
> This patch introduces option -fno-plt that allows to expand calls that would
> go via PLT to load the address of the function immediately at call site (which
> introduces a GOT load).  Cover letter explains the motivation for this patch.
>
> New option documentation for invoke.texi is missing from the patch; if this is
> accepted I'll be happy to send a v2 with documentation added.
>
> 	* calls.c (prepare_call_address): Transform PLT call to GOT lookup and
> 	indirect call by forcing address into a pseudo with -fno-plt.
> 	* common.opt (flag_plt): New option.

Have done a quick experiment, -fno-plt doesn't work on AArch64.

it's because although this patch force the function address into register,
but the combine pass runs later combine it back as AArch64 have defined such
insn pattern.

For X86, it's not combined back. From the rtl dump, it's because the rtl 
pre pass
has moved the address load instruction into another basic block and 
combine pass
don't combine across basic blocks. Also, x86 backend has done some check 
on flag_plt
in the new added ix86_nopic_noplt_attribute_p which could help generate 
correct insns.

What I can think of the fix on AArch64 is by restricting the call symbol 
under
"flag_plt == true" only, so that call via register can't be combined 
into call
symbol direct,

Or better to prohibit combine pass for such combining? as the generic 
fix on combine may
fix other broken targets.

Thoughts?

Regards,
Jiong

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-06-22 15:52   ` Jiong Wang
@ 2015-06-22 18:18     ` Alexander Monakov
  2015-06-23  8:41       ` Ramana Radhakrishnan
  0 siblings, 1 reply; 106+ messages in thread
From: Alexander Monakov @ 2015-06-22 18:18 UTC (permalink / raw)
  To: Jiong Wang; +Cc: gcc-patches, Rich Felker, Dmitry Melnik, Eugene Kudryashov

On Mon, 22 Jun 2015, Jiong Wang wrote:
> Have done a quick experiment, -fno-plt doesn't work on AArch64.
> 
> it's because although this patch force the function address into register,
> but the combine pass runs later combine it back as AArch64 have defined such
> insn pattern.
> 
> For X86, it's not combined back. From the rtl dump, it's because the rtl pre
> pass has moved the address load instruction into another basic block and
> combine pass don't combine across basic blocks. Also, x86 backend has done
> some check on flag_plt in the new added ix86_nopic_noplt_attribute_p which
> could help generate correct insns.
> 
> What I can think of the fix on AArch64 is by restricting the call symbol
> under "flag_plt == true" only, so that call via register can't be combined
> into call symbol direct,
> 
> Or better to prohibit combine pass for such combining? as the generic fix on
> combine may fix other broken targets.

My colleagues at ISP RAS (CC'ed) have been looking on arm (and aarch64) no-plt
codegen.  We also saw the problem with the combine pass you describe.  I think
your description of why it's not observed on x86 is incorrect; the newly added
ix86_nopic_noplt_attribute_p should not have anything to do with that.  It's
just that the GOT load insn has a REG_EQUAL note, and the combine pass can use
it to replace the register in the indirect branch, producing a direct branch
to a symbol (i.e. a PLT jump).

Actually we are not hitting the same problem on x86 by pure luck.  Early RTL
passes manage to lose the REG_EQUAL note, so by the time combine runs, the
register annotation is lost.  It's possible to reproduce the arm/aarch64
problem on x86 with -fno-gcse and the following hack:

diff --git a/gcc/cse.c b/gcc/cse.c
index 2a33827..88cff96 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -6634,6 +6634,9 @@ cse_main (rtx_insn *f ATTRIBUTE_UNUSED, int nregs)
   int *rc_order = XNEWVEC (int, last_basic_block_for_fn (cfun));
   int i, n_blocks;

+  if (!flag_gcse)
+    return 0;
+
   df_set_flags (DF_LR_RUN_DCE);
   df_note_add_problem ();
   df_analyze ();

Regarding fixing the issue, I also think that combine pass might be a better
place (than the backends).  I'd appreciate comments from maintainers.

If you try disabling the REG_EQUAL note generation [*], you'll probably find a
performance regression on arm32 (and probably on aarch64 as well? we only
tried arm32 so far).  The main reason for that is that GCC emits pretty bad
code for a GOT load.  Instead of using two add instructions and one ldr for
the GOT slot access, like the PLT stubs do, it uses three(!) ldr instructions
and one add.  The first ldr is for loading the GOT address, and the second is
for the offset of the GOT slot.  As I understand, to fix that, GCC has to
learn using the GOT_PREL relocation type.

[*] To do that, we hacked arm legitimize_pic_address not to emit REG_EQUAL
note under !flag_plt.

Alexander

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-06-22 18:18     ` Alexander Monakov
@ 2015-06-23  8:41       ` Ramana Radhakrishnan
  2015-06-23 10:43         ` Alexander Monakov
  2015-06-23 13:28         ` Jeff Law
  0 siblings, 2 replies; 106+ messages in thread
From: Ramana Radhakrishnan @ 2015-06-23  8:41 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Jiong Wang, gcc-patches, Rich Felker, Dmitry Melnik, Eugene Kudryashov

On Mon, Jun 22, 2015 at 7:11 PM, Alexander Monakov <amonakov@ispras.ru> wrote:
> On Mon, 22 Jun 2015, Jiong Wang wrote:
>> Have done a quick experiment, -fno-plt doesn't work on AArch64.
>>
>> it's because although this patch force the function address into register,
>> but the combine pass runs later combine it back as AArch64 have defined such
>> insn pattern.
>>
>> For X86, it's not combined back. From the rtl dump, it's because the rtl pre
>> pass has moved the address load instruction into another basic block and
>> combine pass don't combine across basic blocks. Also, x86 backend has done
>> some check on flag_plt in the new added ix86_nopic_noplt_attribute_p which
>> could help generate correct insns.
>>
>> What I can think of the fix on AArch64 is by restricting the call symbol
>> under "flag_plt == true" only, so that call via register can't be combined
>> into call symbol direct,
>>
>> Or better to prohibit combine pass for such combining? as the generic fix on
>> combine may fix other broken targets.
>
> My colleagues at ISP RAS (CC'ed) have been looking on arm (and aarch64) no-plt
> codegen.  We also saw the problem with the combine pass you describe.  I think
> your description of why it's not observed on x86 is incorrect; the newly added
> ix86_nopic_noplt_attribute_p should not have anything to do with that.  It's
> just that the GOT load insn has a REG_EQUAL note, and the combine pass can use
> it to replace the register in the indirect branch, producing a direct branch
> to a symbol (i.e. a PLT jump).


>
> Actually we are not hitting the same problem on x86 by pure luck.  Early RTL
> passes manage to lose the REG_EQUAL note, so by the time combine runs, the
> register annotation is lost.  It's possible to reproduce the arm/aarch64
> problem on x86 with -fno-gcse and the following hack:
>
> diff --git a/gcc/cse.c b/gcc/cse.c
> index 2a33827..88cff96 100644
> --- a/gcc/cse.c
> +++ b/gcc/cse.c
> @@ -6634,6 +6634,9 @@ cse_main (rtx_insn *f ATTRIBUTE_UNUSED, int nregs)
>    int *rc_order = XNEWVEC (int, last_basic_block_for_fn (cfun));
>    int i, n_blocks;
>
> +  if (!flag_gcse)
> +    return 0;
> +
>    df_set_flags (DF_LR_RUN_DCE);
>    df_note_add_problem ();
>    df_analyze ();
>
> Regarding fixing the issue, I also think that combine pass might be a better
> place (than the backends).  I'd appreciate comments from maintainers.
>
>

Not on AArch64 the GOT slot can be accessed with a single PC relative
instruction followed by a load, thus I don't expect there to any more
work to be done in the AArch64 backend other than massaging this into
an indirect call in the "call" related patterns.

So you'd get something like

adrp x0, :got:a
ldr x0, [x0, :got_lo12:a]
blr [x0]

and in the tiny model

ldr x0, :got:a
blr [x0]

if your elf module is small enough.

> If you try disabling the REG_EQUAL note generation [*], you'll probably find a
> performance regression on arm32 (and probably on aarch64 as well?
> we only

IMHO disabling the REG_EQUAL note generation is the wrong way to go about this.

> tried arm32 so far).  The main reason for that is that GCC emits pretty bad
> code for a GOT load.  Instead of using two add instructions and one ldr for
> the GOT slot access, like the PLT stubs do, it uses three(!) ldr instructions
> and one add.  The first ldr is for loading the GOT address, and the second is
> for the offset of the GOT slot.  As I understand, to fix that, GCC has to
> learn using the GOT_PREL relocation type.

Irrespective of combine, as a first step we should fix the predicates
and the call expanders to prevent this sort of replacement in the
backends. Tightening the predicates in the call patterns will achieve
the same for you and then we can investigate the use of GOT_PREL. My
recollection of this is that you need to work out when it's more
beneficial to use GOT_PREL over GOT but it's been a while since I
looked in that area.

>
> [*] To do that, we hacked arm legitimize_pic_address not to emit REG_EQUAL
> note under !flag_plt.
>
> Alexander

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-06-23  8:41       ` Ramana Radhakrishnan
@ 2015-06-23 10:43         ` Alexander Monakov
  2015-06-23 13:28         ` Jeff Law
  1 sibling, 0 replies; 106+ messages in thread
From: Alexander Monakov @ 2015-06-23 10:43 UTC (permalink / raw)
  To: Ramana Radhakrishnan
  Cc: Jiong Wang, gcc-patches, Rich Felker, Dmitry Melnik, Eugene Kudryashov

On Tue, 23 Jun 2015, Ramana Radhakrishnan wrote:
> > If you try disabling the REG_EQUAL note generation [*], you'll probably find a
> > performance regression on arm32 (and probably on aarch64 as well?
> > we only
> 
> IMHO disabling the REG_EQUAL note generation is the wrong way to go about this.

Of course.  I only mentioned that as a way to look at no-plt codegen that we
used, lacking a solution to the combine problem.

Alexander

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH] Expand PIC calls without PLT with -fno-plt
  2015-06-23  8:41       ` Ramana Radhakrishnan
  2015-06-23 10:43         ` Alexander Monakov
@ 2015-06-23 13:28         ` Jeff Law
  2015-07-16 10:37           ` [AArch64] Tighten direct call pattern to repair -fno-plt Jiong Wang
  1 sibling, 1 reply; 106+ messages in thread
From: Jeff Law @ 2015-06-23 13:28 UTC (permalink / raw)
  To: ramrad01, Alexander Monakov
  Cc: Jiong Wang, gcc-patches, Rich Felker, Dmitry Melnik, Eugene Kudryashov

On 06/23/2015 02:29 AM, Ramana Radhakrishnan wrote:

>> If you try disabling the REG_EQUAL note generation [*], you'll probably find a
>> performance regression on arm32 (and probably on aarch64 as well?
>> we only
>
> IMHO disabling the REG_EQUAL note generation is the wrong way to go about this.
Agreed.

> Irrespective of combine, as a first step we should fix the predicates
> and the call expanders to prevent this sort of replacement in the
> backends. Tightening the predicates in the call patterns will achieve
> the same for you and then we can investigate the use of GOT_PREL. My
> recollection of this is that you need to work out when it's more
> beneficial to use GOT_PREL over GOT but it's been a while since I
> looked in that area.
Also agreed.  This is primarily a backend issue with the call patterns.

This is similar to the situation on the PA with the 32bit SOM runtime 
where direct and indirect calls have different calling conventions. 
Those different calling conventions combined with the early loading of 
the parameter registers in effect restricts us from being able to 
transform an indirect call into a direct call (combine) or vice-versa (cse).

The way we handled this was to split the calls into two patterns, one 
for direct one for indirect and tightening their predicates appropriately.

Jeff

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [AArch64] Tighten direct call pattern to repair -fno-plt
  2015-06-23 13:28         ` Jeff Law
@ 2015-07-16 10:37           ` Jiong Wang
  2015-07-16 10:47             ` Alexander Monakov
  2015-08-04  9:50             ` [AArch64] Tighten " James Greenhalgh
  0 siblings, 2 replies; 106+ messages in thread
From: Jiong Wang @ 2015-07-16 10:37 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2160 bytes --]


Jeff Law writes:

> On 06/23/2015 02:29 AM, Ramana Radhakrishnan wrote:
>
>>> If you try disabling the REG_EQUAL note generation [*], you'll probably find a
>>> performance regression on arm32 (and probably on aarch64 as well?
>>> we only
>>
>> IMHO disabling the REG_EQUAL note generation is the wrong way to go about this.
> Agreed.
>
>> Irrespective of combine, as a first step we should fix the predicates
>> and the call expanders to prevent this sort of replacement in the
>> backends. Tightening the predicates in the call patterns will achieve
>> the same for you and then we can investigate the use of GOT_PREL. My
>> recollection of this is that you need to work out when it's more
>> beneficial to use GOT_PREL over GOT but it's been a while since I
>> looked in that area.
> Also agreed.  This is primarily a backend issue with the call patterns.
>
> This is similar to the situation on the PA with the 32bit SOM runtime 
> where direct and indirect calls have different calling conventions. 
> Those different calling conventions combined with the early loading of 
> the parameter registers in effect restricts us from being able to 
> transform an indirect call into a direct call (combine) or vice-versa (cse).
>
> The way we handled this was to split the calls into two patterns, one 
> for direct one for indirect and tightening their predicates appropriately.
>
> Jeff

Attachment is the patch which repair -fno-plt support for AArch64.

aarch64_is_noplt_call_p will only be true if:

  * gcc is generating position independent code.
  * function symbol has declaration.
  * either -fno-plt or "(no_plt)" attribute specified.
  * it's a external function.
  
OK for trunk?

2015-07-16  Jiong Wang  <jiong.wang@arm.com>

gcc/
  * config/aarch64/aarch64-protos.h (aarch64_is_noplt_call_p): New
  declaration.
  * config/aarch64/aarch64.c (aarch64_is_noplt_call_p): New function.
  * config/aarch64/aarch64.md (call_value_symbol): Check noplt
  scenarios.
  (call_symbol): Ditto.

gcc/testsuite/
  * gcc.target/aarch64/noplt_1.c: New testcase.
  * gcc.target/aarch64/noplt_2.c: Ditto.


[-- Attachment #2: noplt.patch --]
[-- Type: text/x-diff, Size: 3277 bytes --]

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 4062c27..c354dc6 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -254,6 +254,7 @@ bool aarch64_gen_movmemqi (rtx *);
 bool aarch64_gimple_fold_builtin (gimple_stmt_iterator *);
 bool aarch64_is_extend_from_extract (machine_mode, rtx, rtx);
 bool aarch64_is_long_call_p (rtx);
+bool aarch64_is_noplt_call_p (rtx);
 bool aarch64_label_mentioned_p (rtx);
 bool aarch64_legitimate_pic_operand_p (rtx);
 bool aarch64_modes_tieable_p (machine_mode mode1,
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 5d4dc83..4522fc2 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -747,6 +747,24 @@ aarch64_is_long_call_p (rtx sym)
   return aarch64_decl_is_long_call_p (SYMBOL_REF_DECL (sym));
 }
 
+/* Return true if calls to symbol-ref SYM should not go through
+   plt stubs.  */
+
+bool
+aarch64_is_noplt_call_p (rtx sym)
+{
+  const_tree decl = SYMBOL_REF_DECL (sym);
+
+  if (flag_pic
+      && decl
+      && (!flag_plt
+	  || lookup_attribute ("noplt", DECL_ATTRIBUTES (decl)))
+      && !targetm.binds_local_p (decl))
+    return true;
+
+  return false;
+}
+
 /* Return true if the offsets to a zero/sign-extract operation
    represent an expression that matches an extend operation.  The
    operands represent the paramters from
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 2d56a75..b88aac2 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -603,7 +603,8 @@
    (use (match_operand 2 "" ""))
    (clobber (reg:DI LR_REGNUM))]
   "GET_CODE (operands[0]) == SYMBOL_REF
-   && !aarch64_is_long_call_p (operands[0])"
+   && !aarch64_is_long_call_p (operands[0])
+   && !aarch64_is_noplt_call_p (operands[0])"
   "bl\\t%a0"
   [(set_attr "type" "call")]
 )
@@ -665,7 +666,8 @@
    (use (match_operand 3 "" ""))
    (clobber (reg:DI LR_REGNUM))]
   "GET_CODE (operands[1]) == SYMBOL_REF
-   && !aarch64_is_long_call_p (operands[1])"
+   && !aarch64_is_long_call_p (operands[1])
+   && !aarch64_is_noplt_call_p (operands[1])"
   "bl\\t%a1"
   [(set_attr "type" "call")]
 )
diff --git a/gcc/testsuite/gcc.target/aarch64/noplt_1.c b/gcc/testsuite/gcc.target/aarch64/noplt_1.c
new file mode 100644
index 0000000..4d778a4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/noplt_1.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fpic -fno-plt" } */
+
+int* bar (void) ;
+
+int
+foo (int a)
+{
+  int *b = bar ();
+  return b[a];
+}
+
+/* { dg-final { scan-assembler "#:got_lo12:" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/noplt_2.c b/gcc/testsuite/gcc.target/aarch64/noplt_2.c
new file mode 100644
index 0000000..226737a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/noplt_2.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fpic" } */
+
+__attribute__ ((noplt))
+int* bar0 (void) ;
+int* bar1 (void) ;
+
+int
+foo (int a)
+{
+  int *b0 = bar0 ();
+  int *b1 = bar1 ();
+  return b0[a] + b1[a];
+}
+
+/* { dg-final { scan-assembler-times "#:got_lo12:" 1 } } */

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [AArch64] Tighten direct call pattern to repair -fno-plt
  2015-07-16 10:37           ` [AArch64] Tighten direct call pattern to repair -fno-plt Jiong Wang
@ 2015-07-16 10:47             ` Alexander Monakov
  2015-07-16 10:48               ` Jiong Wang
  2015-08-04  9:50             ` [AArch64] Tighten " James Greenhalgh
  1 sibling, 1 reply; 106+ messages in thread
From: Alexander Monakov @ 2015-07-16 10:47 UTC (permalink / raw)
  To: Jiong Wang; +Cc: gcc-patches

> Attachment is the patch which repair -fno-plt support for AArch64.
> 
> aarch64_is_noplt_call_p will only be true if:
> 
>   * gcc is generating position independent code.
>   * function symbol has declaration.
>   * either -fno-plt or "(no_plt)" attribute specified.
>   * it's a external function.
>   
> OK for trunk?
> 
> 2015-07-16  Jiong Wang  <jiong.wang@arm.com>
> 
> gcc/
>   * config/aarch64/aarch64-protos.h (aarch64_is_noplt_call_p): New
>   declaration.
>   * config/aarch64/aarch64.c (aarch64_is_noplt_call_p): New function.
>   * config/aarch64/aarch64.md (call_value_symbol): Check noplt
>   scenarios.
>   (call_symbol): Ditto.

Shouldn't the same treatment be applied to tailcall (sibcall_{,value_}symbol)
patterns?  I guess it could be done as a followup patch, but would be nice if
that isn't forgotten.

Alexander

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [AArch64] Tighten direct call pattern to repair -fno-plt
  2015-07-16 10:47             ` Alexander Monakov
@ 2015-07-16 10:48               ` Jiong Wang
  2015-07-21 12:52                 ` [AArch64][sibcall]Tighten " Jiong Wang
  0 siblings, 1 reply; 106+ messages in thread
From: Jiong Wang @ 2015-07-16 10:48 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: gcc-patches


Alexander Monakov writes:

>> Attachment is the patch which repair -fno-plt support for AArch64.
>> 
>> aarch64_is_noplt_call_p will only be true if:
>> 
>>   * gcc is generating position independent code.
>>   * function symbol has declaration.
>>   * either -fno-plt or "(no_plt)" attribute specified.
>>   * it's a external function.
>>   
>> OK for trunk?
>> 
>> 2015-07-16  Jiong Wang  <jiong.wang@arm.com>
>> 
>> gcc/
>>   * config/aarch64/aarch64-protos.h (aarch64_is_noplt_call_p): New
>>   declaration.
>>   * config/aarch64/aarch64.c (aarch64_is_noplt_call_p): New function.
>>   * config/aarch64/aarch64.md (call_value_symbol): Check noplt
>>   scenarios.
>>   (call_symbol): Ditto.
>
> Shouldn't the same treatment be applied to tailcall (sibcall_{,value_}symbol)
> patterns?  I guess it could be done as a followup patch, but would be nice if
> that isn't forgotten.

Thanks for the remaind, that will be done as a followup patch.

-- 
Regards,
Jiong

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [AArch64][sibcall]Tighten direct call pattern to repair -fno-plt
  2015-07-16 10:48               ` Jiong Wang
@ 2015-07-21 12:52                 ` Jiong Wang
  2015-08-04  9:50                   ` James Greenhalgh
  0 siblings, 1 reply; 106+ messages in thread
From: Jiong Wang @ 2015-07-21 12:52 UTC (permalink / raw)
  To: gcc-patches; +Cc: Alexander Monakov

[-- Attachment #1: Type: text/plain, Size: 1644 bytes --]


Jiong Wang writes:

> Alexander Monakov writes:
>
>>> Attachment is the patch which repair -fno-plt support for AArch64.
>>> 
>>> aarch64_is_noplt_call_p will only be true if:
>>> 
>>>   * gcc is generating position independent code.
>>>   * function symbol has declaration.
>>>   * either -fno-plt or "(no_plt)" attribute specified.
>>>   * it's a external function.
>>>   
>>> OK for trunk?
>>> 
>>> 2015-07-16  Jiong Wang  <jiong.wang@arm.com>
>>> 
>>> gcc/
>>>   * config/aarch64/aarch64-protos.h (aarch64_is_noplt_call_p): New
>>>   declaration.
>>>   * config/aarch64/aarch64.c (aarch64_is_noplt_call_p): New function.
>>>   * config/aarch64/aarch64.md (call_value_symbol): Check noplt
>>>   scenarios.
>>>   (call_symbol): Ditto.
>>
>> Shouldn't the same treatment be applied to tailcall (sibcall_{,value_}symbol)
>> patterns?  I guess it could be done as a followup patch, but would be nice if
>> that isn't forgotten.
>
> Thanks for the remaind, that will be done as a followup patch.

Patch attached.

Added one more restriction to "Usf" constraint which is used by sibcall
pattern when matching direct call.

given example like

void
cal_novalue (int a)
{
  dec (a);
}

when -fpic -fno-plt specified we now generate:

cal:
        adrp    x1, :got:dec
        ldr     x1, [x1, #:got_lo12:dec]
        br      x1

instead of:

cal:
        b dec

2015-07-20  Jiong Wang  <jiong.wang@arm.com>

gcc/
  * config/aarch64/constraints.md (Usf): Add the test of
  aarch64_is_noplt_call_p.

gcc/testsuite/
  * gcc.target/aarch64/noplt_3.c: New test.

-- 
Regards,
Jiong


[-- Attachment #2: noplt_sib.patch --]
[-- Type: text/x-diff, Size: 1195 bytes --]

diff --git a/gcc/config/aarch64/constraints.md b/gcc/config/aarch64/constraints.md
index 5b189ea..9dc2108 100644
--- a/gcc/config/aarch64/constraints.md
+++ b/gcc/config/aarch64/constraints.md
@@ -101,8 +101,9 @@
        (match_test "(unsigned HOST_WIDE_INT) ival < 64")))
 
 (define_constraint "Usf"
-  "@internal Usf is a symbol reference."
-  (match_code "symbol_ref"))
+  "@internal Usf is a symbol reference under the context where plt stub allowed."
+  (and (match_code "symbol_ref")
+       (match_test "!aarch64_is_noplt_call_p (op)")))
 
 (define_constraint "UsM"
   "@internal
diff --git a/gcc/testsuite/gcc.target/aarch64/noplt_3.c b/gcc/testsuite/gcc.target/aarch64/noplt_3.c
new file mode 100644
index 0000000..54b51bd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/noplt_3.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fpic -fno-plt" } */
+/* { dg-skip-if "-mcmodel=large, no support for -fpic" { aarch64-*-* }  { "-mcmodel=large" } { "" } } */
+
+int dec (int);
+
+int
+cal (int a)
+{
+  return dec (a);
+}
+
+void
+cal_novalue (int a)
+{
+  dec (a);
+}
+
+/* { dg-final { scan-assembler-times "#:got:" 2 } } */

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [AArch64][sibcall]Tighten direct call pattern to repair -fno-plt
  2015-07-21 12:52                 ` [AArch64][sibcall]Tighten " Jiong Wang
@ 2015-08-04  9:50                   ` James Greenhalgh
  2015-08-06 16:18                     ` [COMMITTED][AArch64][sibcall]Tighten " Jiong Wang
  0 siblings, 1 reply; 106+ messages in thread
From: James Greenhalgh @ 2015-08-04  9:50 UTC (permalink / raw)
  To: Jiong Wang; +Cc: gcc-patches, Alexander Monakov

On Tue, Jul 21, 2015 at 01:42:35PM +0100, Jiong Wang wrote:
> 
> Jiong Wang writes:
> 
> > Alexander Monakov writes:
> >
> >>> Attachment is the patch which repair -fno-plt support for AArch64.
> >>> 
> >>> aarch64_is_noplt_call_p will only be true if:
> >>> 
> >>>   * gcc is generating position independent code.
> >>>   * function symbol has declaration.
> >>>   * either -fno-plt or "(no_plt)" attribute specified.
> >>>   * it's a external function.
> >>>   
> >>> OK for trunk?
> >>> 
> >>> 2015-07-16  Jiong Wang  <jiong.wang@arm.com>
> >>> 
> >>> gcc/
> >>>   * config/aarch64/aarch64-protos.h (aarch64_is_noplt_call_p): New
> >>>   declaration.
> >>>   * config/aarch64/aarch64.c (aarch64_is_noplt_call_p): New function.
> >>>   * config/aarch64/aarch64.md (call_value_symbol): Check noplt
> >>>   scenarios.
> >>>   (call_symbol): Ditto.
> >>
> >> Shouldn't the same treatment be applied to tailcall (sibcall_{,value_}symbol)
> >> patterns?  I guess it could be done as a followup patch, but would be nice if
> >> that isn't forgotten.
> >
> > Thanks for the remaind, that will be done as a followup patch.
> 
> Patch attached.
> 
> Added one more restriction to "Usf" constraint which is used by sibcall
> pattern when matching direct call.
> 
> given example like
> 
> void
> cal_novalue (int a)
> {
>   dec (a);
> }
> 
> when -fpic -fno-plt specified we now generate:
> 
> cal:
>         adrp    x1, :got:dec
>         ldr     x1, [x1, #:got_lo12:dec]
>         br      x1
> 
> instead of:
> 
> cal:
>         b dec

OK.

Thanks,
James

> 2015-07-20  Jiong Wang  <jiong.wang@arm.com>
> 
> gcc/
>   * config/aarch64/constraints.md (Usf): Add the test of
>   aarch64_is_noplt_call_p.
> 
> gcc/testsuite/
>   * gcc.target/aarch64/noplt_3.c: New test.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [AArch64] Tighten direct call pattern to repair -fno-plt
  2015-07-16 10:37           ` [AArch64] Tighten direct call pattern to repair -fno-plt Jiong Wang
  2015-07-16 10:47             ` Alexander Monakov
@ 2015-08-04  9:50             ` James Greenhalgh
  2015-08-06 16:16               ` [COMMITTED][AArch64] " Jiong Wang
  1 sibling, 1 reply; 106+ messages in thread
From: James Greenhalgh @ 2015-08-04  9:50 UTC (permalink / raw)
  To: Jiong Wang; +Cc: gcc-patches

On Thu, Jul 16, 2015 at 11:21:25AM +0100, Jiong Wang wrote:
> 
> Jeff Law writes:
> 
> > On 06/23/2015 02:29 AM, Ramana Radhakrishnan wrote:
> >
> >>> If you try disabling the REG_EQUAL note generation [*], you'll probably find a
> >>> performance regression on arm32 (and probably on aarch64 as well?
> >>> we only
> >>
> >> IMHO disabling the REG_EQUAL note generation is the wrong way to go about this.
> > Agreed.
> >
> >> Irrespective of combine, as a first step we should fix the predicates
> >> and the call expanders to prevent this sort of replacement in the
> >> backends. Tightening the predicates in the call patterns will achieve
> >> the same for you and then we can investigate the use of GOT_PREL. My
> >> recollection of this is that you need to work out when it's more
> >> beneficial to use GOT_PREL over GOT but it's been a while since I
> >> looked in that area.
> > Also agreed.  This is primarily a backend issue with the call patterns.
> >
> > This is similar to the situation on the PA with the 32bit SOM runtime 
> > where direct and indirect calls have different calling conventions. 
> > Those different calling conventions combined with the early loading of 
> > the parameter registers in effect restricts us from being able to 
> > transform an indirect call into a direct call (combine) or vice-versa (cse).
> >
> > The way we handled this was to split the calls into two patterns, one 
> > for direct one for indirect and tightening their predicates appropriately.
> >
> > Jeff
> 
> Attachment is the patch which repair -fno-plt support for AArch64.
> 
> aarch64_is_noplt_call_p will only be true if:
> 
>   * gcc is generating position independent code.
>   * function symbol has declaration.
>   * either -fno-plt or "(no_plt)" attribute specified.
>   * it's a external function.
>   
> OK for trunk?

OK.

Thanks,
James

> 
> 2015-07-16  Jiong Wang  <jiong.wang@arm.com>
> 
> gcc/
>   * config/aarch64/aarch64-protos.h (aarch64_is_noplt_call_p): New
>   declaration.
>   * config/aarch64/aarch64.c (aarch64_is_noplt_call_p): New function.
>   * config/aarch64/aarch64.md (call_value_symbol): Check noplt
>   scenarios.
>   (call_symbol): Ditto.
> 
> gcc/testsuite/
>   * gcc.target/aarch64/noplt_1.c: New testcase.
>   * gcc.target/aarch64/noplt_2.c: Ditto.
> 

((Though do check the ChangeLog formatting when you commit :-).))

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [COMMITTED][AArch64] Tighten direct call pattern to repair -fno-plt
  2015-08-04  9:50             ` [AArch64] Tighten " James Greenhalgh
@ 2015-08-06 16:16               ` Jiong Wang
  0 siblings, 0 replies; 106+ messages in thread
From: Jiong Wang @ 2015-08-06 16:16 UTC (permalink / raw)
  To: James Greenhalgh; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2797 bytes --]


James Greenhalgh writes:

> On Thu, Jul 16, 2015 at 11:21:25AM +0100, Jiong Wang wrote:
>> 
>> Jeff Law writes:
>> 
>> > On 06/23/2015 02:29 AM, Ramana Radhakrishnan wrote:
>> >
>> >>> If you try disabling the REG_EQUAL note generation [*], you'll probably find a
>> >>> performance regression on arm32 (and probably on aarch64 as well?
>> >>> we only
>> >>
>> >> IMHO disabling the REG_EQUAL note generation is the wrong way to go about this.
>> > Agreed.
>> >
>> >> Irrespective of combine, as a first step we should fix the predicates
>> >> and the call expanders to prevent this sort of replacement in the
>> >> backends. Tightening the predicates in the call patterns will achieve
>> >> the same for you and then we can investigate the use of GOT_PREL. My
>> >> recollection of this is that you need to work out when it's more
>> >> beneficial to use GOT_PREL over GOT but it's been a while since I
>> >> looked in that area.
>> > Also agreed.  This is primarily a backend issue with the call patterns.
>> >
>> > This is similar to the situation on the PA with the 32bit SOM runtime 
>> > where direct and indirect calls have different calling conventions. 
>> > Those different calling conventions combined with the early loading of 
>> > the parameter registers in effect restricts us from being able to 
>> > transform an indirect call into a direct call (combine) or vice-versa (cse).
>> >
>> > The way we handled this was to split the calls into two patterns, one 
>> > for direct one for indirect and tightening their predicates appropriately.
>> >
>> > Jeff
>> 
>> Attachment is the patch which repair -fno-plt support for AArch64.
>> 
>> aarch64_is_noplt_call_p will only be true if:
>> 
>>   * gcc is generating position independent code.
>>   * function symbol has declaration.
>>   * either -fno-plt or "(no_plt)" attribute specified.
>>   * it's a external function.
>>   
>> OK for trunk?
>
> OK.
>
> Thanks,
> James
>
>> 
>> 2015-07-16  Jiong Wang  <jiong.wang@arm.com>
>> 
>> gcc/
>>   * config/aarch64/aarch64-protos.h (aarch64_is_noplt_call_p): New
>>   declaration.
>>   * config/aarch64/aarch64.c (aarch64_is_noplt_call_p): New function.
>>   * config/aarch64/aarch64.md (call_value_symbol): Check noplt
>>   scenarios.
>>   (call_symbol): Ditto.
>> 
>> gcc/testsuite/
>>   * gcc.target/aarch64/noplt_1.c: New testcase.
>>   * gcc.target/aarch64/noplt_2.c: Ditto.
>> 
>
> ((Though do check the ChangeLog formatting when you commit :-).))

Thanks for review.

I realized I need to apply the same trick as I have done at

  https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01653.html,

then the included testcase can work well on any of tiny, small, large
model.

Committed the following patch:


[-- Attachment #2: new-1.patch --]
[-- Type: text/x-diff, Size: 5606 bytes --]

commit 2bcb7473b37f9aa76e530f0a2007893489f61586
Author: jiwang <jiwang@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Thu Aug 6 15:57:36 2015 +0000

    [AArch64] Tighten direct call pattern to repair -fno-plt
    
    2015-08-06  Jiong Wang  <jiong.wang@arm.com>
    
    gcc/
      * config/aarch64/aarch64-protos.h (aarch64_is_noplt_call_p): New declaration.
      * config/aarch64/aarch64.c (aarch64_is_noplt_call_p): New function.
      * config/aarch64/aarch64.md (call_value_symbol): Check noplt scenarios.
      (call_symbol): Likewise.
    
    gcc/testsuite/
      * gcc.target/aarch64/noplt_1.c: New testcase.
      * gcc.target/aarch64/noplt_2.c: Likewise.
    
    
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@226681 138bc75d-0d04-0410-961f-82ee72b054a4

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 43df172..2b364ce 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,10 @@
+2015-08-06  Jiong Wang  <jiong.wang@arm.com>
+
+	* config/aarch64/aarch64-protos.h (aarch64_is_noplt_call_p): New declaration.
+	* config/aarch64/aarch64.c (aarch64_is_noplt_call_p): New function.
+	* config/aarch64/aarch64.md (call_value_symbol): Check noplt scenarios.
+	(call_symbol): Likewise.
+
 2015-08-06  Venkataramanan Kumar  <Venkataramanan.kumar@amd.com>
 
 	* tree-vect-patterns.c (vect_recog_mult_pattern): New function
diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 5d8902f..32b5d09 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -258,6 +258,7 @@ bool aarch64_handle_option (struct gcc_options *, struct gcc_options *,
 			     const struct cl_decoded_option *, location_t);
 bool aarch64_is_extend_from_extract (machine_mode, rtx, rtx);
 bool aarch64_is_long_call_p (rtx);
+bool aarch64_is_noplt_call_p (rtx);
 bool aarch64_label_mentioned_p (rtx);
 void aarch64_declare_function_name (FILE *, const char*, tree);
 bool aarch64_legitimate_pic_operand_p (rtx);
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 1394ed7..e991a49 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -744,6 +744,24 @@ aarch64_is_long_call_p (rtx sym)
   return aarch64_decl_is_long_call_p (SYMBOL_REF_DECL (sym));
 }
 
+/* Return true if calls to symbol-ref SYM should not go through
+   plt stubs.  */
+
+bool
+aarch64_is_noplt_call_p (rtx sym)
+{
+  const_tree decl = SYMBOL_REF_DECL (sym);
+
+  if (flag_pic
+      && decl
+      && (!flag_plt
+	  || lookup_attribute ("noplt", DECL_ATTRIBUTES (decl)))
+      && !targetm.binds_local_p (decl))
+    return true;
+
+  return false;
+}
+
 /* Return true if the offsets to a zero/sign-extract operation
    represent an expression that matches an extend operation.  The
    operands represent the paramters from
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index b7b04c4..7f99753 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -603,7 +603,8 @@
    (use (match_operand 2 "" ""))
    (clobber (reg:DI LR_REGNUM))]
   "GET_CODE (operands[0]) == SYMBOL_REF
-   && !aarch64_is_long_call_p (operands[0])"
+   && !aarch64_is_long_call_p (operands[0])
+   && !aarch64_is_noplt_call_p (operands[0])"
   "bl\\t%a0"
   [(set_attr "type" "call")]
 )
@@ -665,7 +666,8 @@
    (use (match_operand 3 "" ""))
    (clobber (reg:DI LR_REGNUM))]
   "GET_CODE (operands[1]) == SYMBOL_REF
-   && !aarch64_is_long_call_p (operands[1])"
+   && !aarch64_is_long_call_p (operands[1])
+   && !aarch64_is_noplt_call_p (operands[1])"
   "bl\\t%a1"
   [(set_attr "type" "call")]
 )
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 76afd8e..fb3bf07 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2015-08-06  Jiong Wang  <jiong.wang@arm.com>
+
+	* gcc.target/aarch64/noplt_1.c: New testcase.
+	* gcc.target/aarch64/noplt_2.c: Likewise.
+
 2015-08-06  Venkataramanan Kumar  <Venkataramanan.kumar@amd.com>
 
 	* gcc.dg/vect/vect-mult-pattern-1.c: New test.
diff --git a/gcc/testsuite/gcc.target/aarch64/noplt_1.c b/gcc/testsuite/gcc.target/aarch64/noplt_1.c
new file mode 100644
index 0000000..4e9bb62
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/noplt_1.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fpic -fno-plt" } */
+/* { dg-skip-if "-mcmodel=large, no support for -fpic" { aarch64-*-* }  { "-mcmodel=large" } { "" } } */
+
+int* bar (void) ;
+
+int
+foo (int a)
+{
+  int *b = bar ();
+  return b[a];
+}
+
+/* { dg-final { scan-assembler "#:got:" { target { aarch64_tiny || aarch64_small } } } } */
+/* { dg-final { scan-assembler "#:got_lo12:" { target aarch64_small } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/noplt_2.c b/gcc/testsuite/gcc.target/aarch64/noplt_2.c
new file mode 100644
index 0000000..718999b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/noplt_2.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fpic" } */
+/* { dg-skip-if "-mcmodel=large, no support for -fpic" { aarch64-*-* }  { "-mcmodel=large" } { "" } } */
+
+__attribute__ ((noplt))
+int* bar0 (void) ;
+int* bar1 (void) ;
+
+int
+foo (int a)
+{
+  int *b0 = bar0 ();
+  int *b1 = bar1 ();
+  return b0[a] + b1[a];
+}
+
+/* { dg-final { scan-assembler-times "#:got:" 1 { target { aarch64_tiny || aarch64_small } } } } */
+/* { dg-final { scan-assembler-times "#:got_lo12:" 1 { target aarch64_small } } } */

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [COMMITTED][AArch64][sibcall]Tighten direct call pattern to repair -fno-plt
  2015-08-04  9:50                   ` James Greenhalgh
@ 2015-08-06 16:18                     ` Jiong Wang
  2015-08-07  8:22                       ` James Greenhalgh
  0 siblings, 1 reply; 106+ messages in thread
From: Jiong Wang @ 2015-08-06 16:18 UTC (permalink / raw)
  To: James Greenhalgh; +Cc: gcc-patches, Alexander Monakov

[-- Attachment #1: Type: text/plain, Size: 1853 bytes --]


James Greenhalgh writes:

> On Tue, Jul 21, 2015 at 01:42:35PM +0100, Jiong Wang wrote:
>> 
>> Jiong Wang writes:
>> 
>> > Alexander Monakov writes:
>> >
>> >>> Attachment is the patch which repair -fno-plt support for AArch64.
>> >>> 
>> >>> aarch64_is_noplt_call_p will only be true if:
>> >>> 
>> >>>   * gcc is generating position independent code.
>> >>>   * function symbol has declaration.
>> >>>   * either -fno-plt or "(no_plt)" attribute specified.
>> >>>   * it's a external function.
>> >>>   
>> >>> OK for trunk?
>> >>> 
>> >>> 2015-07-16  Jiong Wang  <jiong.wang@arm.com>
>> >>> 
>> >>> gcc/
>> >>>   * config/aarch64/aarch64-protos.h (aarch64_is_noplt_call_p): New
>> >>>   declaration.
>> >>>   * config/aarch64/aarch64.c (aarch64_is_noplt_call_p): New function.
>> >>>   * config/aarch64/aarch64.md (call_value_symbol): Check noplt
>> >>>   scenarios.
>> >>>   (call_symbol): Ditto.
>> >>
>> >> Shouldn't the same treatment be applied to tailcall (sibcall_{,value_}symbol)
>> >> patterns?  I guess it could be done as a followup patch, but would be nice if
>> >> that isn't forgotten.
>> >
>> > Thanks for the remaind, that will be done as a followup patch.
>> 
>> Patch attached.
>> 
>> Added one more restriction to "Usf" constraint which is used by sibcall
>> pattern when matching direct call.
>> 
>> given example like
>> 
>> void
>> cal_novalue (int a)
>> {
>>   dec (a);
>> }
>> 
>> when -fpic -fno-plt specified we now generate:
>> 
>> cal:
>>         adrp    x1, :got:dec
>>         ldr     x1, [x1, #:got_lo12:dec]
>>         br      x1
>> 
>> instead of:
>> 
>> cal:
>>         b dec
>
> OK.
>
> Thanks,
> James
>

Committed the following patch which done minor adjustments so the
testcases can work well on any of tiny, small, large model.

Thanks.


[-- Attachment #2: new-2.patch --]
[-- Type: text/x-diff, Size: 2437 bytes --]

Index: gcc/ChangeLog
===================================================================
--- gcc/ChangeLog	(revision 226681)
+++ gcc/ChangeLog	(working copy)
@@ -1,5 +1,10 @@
 2015-08-06  Jiong Wang  <jiong.wang@arm.com>
 
+	* config/aarch64/constraints.md (Usf): Add the test of
+	aarch64_is_noplt_call_p.
+
+2015-08-06  Jiong Wang  <jiong.wang@arm.com>
+
 	* config/aarch64/aarch64-protos.h (aarch64_is_noplt_call_p): New declaration.
 	* config/aarch64/aarch64.c (aarch64_is_noplt_call_p): New function.
 	* config/aarch64/aarch64.md (call_value_symbol): Check noplt scenarios.
Index: gcc/config/aarch64/constraints.md
===================================================================
--- gcc/config/aarch64/constraints.md	(revision 226680)
+++ gcc/config/aarch64/constraints.md	(working copy)
@@ -101,8 +101,9 @@
        (match_test "(unsigned HOST_WIDE_INT) ival < 64")))
 
 (define_constraint "Usf"
-  "@internal Usf is a symbol reference."
-  (match_code "symbol_ref"))
+  "@internal Usf is a symbol reference under the context where plt stub allowed."
+  (and (match_code "symbol_ref")
+       (match_test "!aarch64_is_noplt_call_p (op)")))
 
 (define_constraint "UsM"
   "@internal
Index: gcc/testsuite/ChangeLog
===================================================================
--- gcc/testsuite/ChangeLog	(revision 226681)
+++ gcc/testsuite/ChangeLog	(working copy)
@@ -1,5 +1,9 @@
 2015-08-06  Jiong Wang  <jiong.wang@arm.com>
 
+	* gcc.target/aarch64/noplt_3.c: New testcase.
+
+2015-08-06  Jiong Wang  <jiong.wang@arm.com>
+
 	* gcc.target/aarch64/noplt_1.c: New testcase.
 	* gcc.target/aarch64/noplt_2.c: Likewise.
 
Index: gcc/testsuite/gcc.target/aarch64/noplt_3.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/noplt_3.c	(revision 0)
+++ gcc/testsuite/gcc.target/aarch64/noplt_3.c	(working copy)
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fpic -fno-plt" } */
+/* { dg-skip-if "-mcmodel=large, no support for -fpic" { aarch64-*-* }  { "-mcmodel=large" } { "" } } */
+
+int dec (int);
+
+int
+cal (int a)
+{
+  return dec (a);
+}
+
+void
+cal_novalue (int a)
+{
+  dec (a);
+}
+
+/* { dg-final { scan-assembler-times "#:got:" 2 { target { aarch64_tiny || aarch64_small } } } } */
+/* { dg-final { scan-assembler-times "#:got_lo12:" 2 { target aarch64_small } } } */

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [COMMITTED][AArch64][sibcall]Tighten direct call pattern to repair -fno-plt
  2015-08-06 16:18                     ` [COMMITTED][AArch64][sibcall]Tighten " Jiong Wang
@ 2015-08-07  8:22                       ` James Greenhalgh
  2015-08-07 13:28                         ` Jiong Wang
  0 siblings, 1 reply; 106+ messages in thread
From: James Greenhalgh @ 2015-08-07  8:22 UTC (permalink / raw)
  To: Jiong Wang; +Cc: gcc-patches, Alexander Monakov

On Thu, Aug 06, 2015 at 05:16:33PM +0100, Jiong Wang wrote:
> 
> James Greenhalgh writes:
> 
> > On Tue, Jul 21, 2015 at 01:42:35PM +0100, Jiong Wang wrote:
> >> 
> >> Jiong Wang writes:
> >> 
> >> > Alexander Monakov writes:
> >> >
> >> >>> Attachment is the patch which repair -fno-plt support for AArch64.
> >> >>> 
> >> >>> aarch64_is_noplt_call_p will only be true if:
> >> >>> 
> >> >>>   * gcc is generating position independent code.
> >> >>>   * function symbol has declaration.
> >> >>>   * either -fno-plt or "(no_plt)" attribute specified.
> >> >>>   * it's a external function.
> >> >>>   
> >> >>> OK for trunk?
> >> >>> 
> >> >>> 2015-07-16  Jiong Wang  <jiong.wang@arm.com>
> >> >>> 
> >> >>> gcc/
> >> >>>   * config/aarch64/aarch64-protos.h (aarch64_is_noplt_call_p): New
> >> >>>   declaration.
> >> >>>   * config/aarch64/aarch64.c (aarch64_is_noplt_call_p): New function.
> >> >>>   * config/aarch64/aarch64.md (call_value_symbol): Check noplt
> >> >>>   scenarios.
> >> >>>   (call_symbol): Ditto.
> >> >>
> >> >> Shouldn't the same treatment be applied to tailcall (sibcall_{,value_}symbol)
> >> >> patterns?  I guess it could be done as a followup patch, but would be nice if
> >> >> that isn't forgotten.
> >> >
> >> > Thanks for the remaind, that will be done as a followup patch.

Hi Jiong,

The new testcases introduced in this and the related patch are failing
for me on aarch64-none-elf:

    aarch64-none-elf

	NA->FAIL: gcc.target/aarch64/noplt_1.c scan-assembler
	NA->FAIL: gcc.target/aarch64/noplt_2.c scan-assembler-times
	NA->FAIL: gcc.target/aarch64/noplt_3.c scan-assembler-times

For this invocation:
 
  .../build/obj/gcc2/gcc/xgcc -B.../build/obj/gcc2/gcc/ .../src/gcc/testsuite/gcc.target/aarch64/noplt_1.c -fno-diagnostics-show-caret -fdiagnostics-color=never -O2 -fno-plt -fpic -S  -mcmodel=small -o noplt_1.s

I get this code generation for the small memory model:

foo:
	stp	x29, x30, [sp, -32]!
	adrp	x1, _GLOBAL_OFFSET_TABLE_
	add	x29, sp, 0
	str	x19, [sp, 16]
	mov	w19, w0
	ldr	x0, [x1, #:gotpage_lo15:bar]
	blr	x0
	ldr	w0, [x0, w19, sxtw 2]
	ldr	x19, [sp, 16]
	ldp	x29, x30, [sp], 32
	ret
	.size	foo, .-foo

Which uses a different relocation.

Did you intend for these tests to be run with -fPIC -fno-plt rather than
-fpic -fno-plt, or does this indicate a bug?

Thanks,
James


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [COMMITTED][AArch64][sibcall]Tighten direct call pattern to repair -fno-plt
  2015-08-07  8:22                       ` James Greenhalgh
@ 2015-08-07 13:28                         ` Jiong Wang
  0 siblings, 0 replies; 106+ messages in thread
From: Jiong Wang @ 2015-08-07 13:28 UTC (permalink / raw)
  To: James Greenhalgh; +Cc: gcc-patches, Alexander Monakov

[-- Attachment #1: Type: text/plain, Size: 3410 bytes --]

James Greenhalgh writes:

> On Thu, Aug 06, 2015 at 05:16:33PM +0100, Jiong Wang wrote:
>
> Hi Jiong,
>
> The new testcases introduced in this and the related patch are failing
> for me on aarch64-none-elf:
>
>     aarch64-none-elf
>
> 	NA->FAIL: gcc.target/aarch64/noplt_1.c scan-assembler
> 	NA->FAIL: gcc.target/aarch64/noplt_2.c scan-assembler-times
> 	NA->FAIL: gcc.target/aarch64/noplt_3.c scan-assembler-times
>
> For this invocation:
>  
>   .../build/obj/gcc2/gcc/xgcc -B.../build/obj/gcc2/gcc/ .../src/gcc/testsuite/gcc.target/aarch64/noplt_1.c -fno-diagnostics-show-caret -fdiagnostics-color=never -O2 -fno-plt -fpic -S  -mcmodel=small -o noplt_1.s
>
> I get this code generation for the small memory model:
>
> foo:
> 	stp	x29, x30, [sp, -32]!
> 	adrp	x1, _GLOBAL_OFFSET_TABLE_
> 	add	x29, sp, 0
> 	str	x19, [sp, 16]
> 	mov	w19, w0
> 	ldr	x0, [x1, #:gotpage_lo15:bar]
> 	blr	x0
> 	ldr	w0, [x0, w19, sxtw 2]
> 	ldr	x19, [sp, 16]
> 	ldp	x29, x30, [sp], 32
> 	ret
> 	.size	foo, .-foo
>
> Which uses a different relocation.
>
> Did you intend for these tests to be run with -fPIC -fno-plt rather than
> -fpic -fno-plt, or does this indicate a bug?
>
> Thanks,
> James

  Thanks for noticed this.

  As it's -fpic in dg-option, so they are supposed to work under -fpic,
  while I was checking the instruction sequences for -fPIC which is wrong.

  It's passed on my local machine because I was doing cross-check and
  there is no cross binutils installed, I only built cc1, then run the
  check. After double check I found actually all those scan-assemble
  test have not been triggered, because looks like the dejagnu was using
  local x86 assembler to do some prerequite check, then found -EL not
  supported, then those prerequite check returns false.

  Even worse, as -fpic for AArch64 will fall back to -fPIC if the
  installed aarch64 binutils don't support recently added relocation
  types for -fpic, so even I have wrote correct instruction sequences in
  this testcase, it will fail on those environment where old binutils
  installed, and... even new binutils installed, they may still fail if
  the user pre-configure gcc with -mabi=ilp32, as for ILP32, the
  relocation modifer for small code small for -fPIC is gotpage_14
  instead of gotpage_15 for LP64.

  Unfortunally, after all above considered, these testcases still fail
  if the user force -mcmodel=tiny to the compilation options, because
  those dejagnu target effective check will not know those extra options
  user added.

  After second think, I found my previous checking logic is not
  good. It's better we check the final branch type instead of be
  bothered by those relocation modifiers.

  Because the fundanmental changes -fno-plt bring us is it turn direct
  branch into indirect branch, then turn "bl/b" into "blr/br", while
  those GOT reloated modifers are just for preparing the branch
  destination register.

  Patch pass -fpic/-fPIC/binutils-without-fpic/binutils-with-fpic/ilp32.

  Should be OK now.

  Commited as obivious.

  Thanks.

  2015-08-07  Jiong Wang  <jiong.wang@arm.com>

  gcc/testsuite/
    * gcc.target/aarch64/noplt_1.c: Check branch type instead of
    relocation modifers.
    * gcc.target/aarch64/noplt_2.c: Likewise.
    * gcc.target/aarch64/noplt_3.c: Likewise.

-- 
Regards,
Jiong

[-- Attachment #2: fix.patch --]
[-- Type: text/x-diff, Size: 1737 bytes --]

diff --git a/gcc/testsuite/gcc.target/aarch64/noplt_1.c b/gcc/testsuite/gcc.target/aarch64/noplt_1.c
index 4e9bb62..731fcae 100644
--- a/gcc/testsuite/gcc.target/aarch64/noplt_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/noplt_1.c
@@ -11,5 +11,5 @@ foo (int a)
   return b[a];
 }

-/* { dg-final { scan-assembler "#:got:" { target { aarch64_tiny || aarch64_small } } } } */
-/* { dg-final { scan-assembler "#:got_lo12:" { target aarch64_small } } } */
+/* { dg-final { scan-assembler "blr" } } */
+/* { dg-final { scan-assembler-not "bl\t" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/noplt_2.c b/gcc/testsuite/gcc.target/aarch64/noplt_2.c
index 718999b..3be94aa 100644
--- a/gcc/testsuite/gcc.target/aarch64/noplt_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/noplt_2.c
@@ -14,5 +14,5 @@ foo (int a)
   return b0[a] + b1[a];
 }

-/* { dg-final { scan-assembler-times "#:got:" 1 { target { aarch64_tiny || aarch64_small } } } } */
-/* { dg-final { scan-assembler-times "#:got_lo12:" 1 { target aarch64_small } } } */
+/* { dg-final { scan-assembler-times "blr" 1 } } */
+/* { dg-final { scan-assembler-times "bl\t" 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/noplt_3.c b/gcc/testsuite/gcc.target/aarch64/noplt_3.c
index c1993b6..ef6e65d 100644
--- a/gcc/testsuite/gcc.target/aarch64/noplt_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/noplt_3.c
@@ -16,5 +16,5 @@ cal_novalue (int a)
   dec (a);
 }

-/* { dg-final { scan-assembler-times "#:got:" 2 { target { aarch64_tiny || aarch64_small } } } } */
-/* { dg-final { scan-assembler-times "#:got_lo12:" 2 { target aarch64_small } } } */
+/* { dg-final { scan-assembler-times "br" 2 } } */
+/* { dg-final { scan-assembler-not "b\t" } } */

^ permalink raw reply	[flat|nested] 106+ messages in thread

end of thread, other threads:[~2015-08-07 13:28 UTC | newest]

Thread overview: 106+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-04 16:38 PIC calls without PLT, generic implementation Alexander Monakov
2015-05-04 16:38 ` [PATCH i386] Extend sibcall peepholes to allow source in %eax Alexander Monakov
2015-05-10 16:54   ` Jan Hubicka
2015-05-11 17:50     ` Alexander Monakov
2015-05-11 18:00       ` Jan Hubicka
2015-05-11 19:46         ` Uros Bizjak
2015-05-11 19:48           ` Jeff Law
2015-05-11 20:16             ` Jan Hubicka
2015-05-13 19:05               ` Alexander Monakov
2015-05-13 20:04                 ` Jan Hubicka
2015-05-14 17:36                   ` Alexander Monakov
2015-05-04 16:38 ` [PATCH i386] Move CLOBBERED_REGS earlier in register class list Alexander Monakov
2015-05-10 16:44   ` Jan Hubicka
2015-05-10 17:51     ` Uros Bizjak
2015-05-10 18:09       ` Uros Bizjak
2015-05-11 16:26         ` Alexander Monakov
2015-05-11 16:30           ` Uros Bizjak
2015-05-04 16:38 ` [PATCH i386] PR65753: allow PIC tail calls via function pointers Alexander Monakov
2015-05-10 16:37   ` Jan Hubicka
2015-05-11 16:11     ` Alexander Monakov
2015-05-04 16:38 ` [RFC PATCH] ira: accept loads via argp rtx in validate_equiv_mem Alexander Monakov
2015-05-04 17:37   ` Jeff Law
2015-05-04 16:38 ` [PATCH] Expand PIC calls without PLT with -fno-plt Alexander Monakov
2015-05-04 17:34   ` Jeff Law
2015-05-04 17:40     ` Jakub Jelinek
2015-05-04 17:42       ` Jeff Law
2015-05-06  3:08         ` Rich Felker
2015-05-10 17:07           ` Jan Hubicka
2015-05-06 15:25         ` Alexander Monakov
2015-05-06 15:46           ` Jakub Jelinek
2015-05-06 15:55             ` Jeff Law
2015-05-06 16:44             ` Alexander Monakov
2015-05-06 17:35               ` Rich Felker
2015-05-06 18:26                 ` H.J. Lu
2015-05-06 18:37                   ` Rich Felker
2015-05-06 18:45                     ` H.J. Lu
2015-05-06 19:01                       ` Rich Felker
2015-05-06 19:05                         ` H.J. Lu
2015-05-06 19:18                           ` Rich Felker
2015-05-06 19:24                             ` H.J. Lu
2015-05-11 11:48                             ` Michael Matz
2015-05-11 14:20                               ` Rich Felker
2015-05-07 18:22           ` Jeff Law
2015-05-07 19:13             ` H.J. Lu
2015-05-10 16:59   ` Jan Hubicka
2015-05-11 20:36     ` Jeff Law
2015-05-11 20:55       ` H.J. Lu
2015-05-11 22:13         ` Jan Hubicka
2015-06-22 15:52   ` Jiong Wang
2015-06-22 18:18     ` Alexander Monakov
2015-06-23  8:41       ` Ramana Radhakrishnan
2015-06-23 10:43         ` Alexander Monakov
2015-06-23 13:28         ` Jeff Law
2015-07-16 10:37           ` [AArch64] Tighten direct call pattern to repair -fno-plt Jiong Wang
2015-07-16 10:47             ` Alexander Monakov
2015-07-16 10:48               ` Jiong Wang
2015-07-21 12:52                 ` [AArch64][sibcall]Tighten " Jiong Wang
2015-08-04  9:50                   ` James Greenhalgh
2015-08-06 16:18                     ` [COMMITTED][AArch64][sibcall]Tighten " Jiong Wang
2015-08-07  8:22                       ` James Greenhalgh
2015-08-07 13:28                         ` Jiong Wang
2015-08-04  9:50             ` [AArch64] Tighten " James Greenhalgh
2015-08-06 16:16               ` [COMMITTED][AArch64] " Jiong Wang
2015-05-04 16:38 ` [PATCH i386] Allow sibcalls in no-PLT PIC Alexander Monakov
2015-05-15 16:37   ` Alexander Monakov
2015-05-15 16:48     ` H.J. Lu
2015-05-15 20:08       ` Jan Hubicka
2015-05-15 20:23         ` H.J. Lu
2015-05-15 20:35           ` Rich Felker
2015-05-15 20:37             ` H.J. Lu
2015-05-15 20:45               ` Rich Felker
2015-05-15 22:16                 ` H.J. Lu
2015-05-15 23:14                   ` Jan Hubicka
2015-05-15 23:30                     ` H.J. Lu
2015-05-15 23:35                       ` H.J. Lu
2015-05-15 23:44                         ` H.J. Lu
2015-05-16  0:18                           ` Rich Felker
2015-05-16 14:33                             ` H.J. Lu
2015-05-16 19:03                               ` H.J. Lu
2015-05-16 19:32                                 ` Rich Felker
2015-05-16 23:23                                   ` H.J. Lu
2015-05-15 23:49                       ` Rich Felker
2015-05-19 14:48                         ` Michael Matz
2015-05-19 15:11                           ` Jeff Law
2015-05-19 16:03                             ` Michael Matz
2015-05-19 19:11                               ` Rich Felker
2015-05-19 18:08                           ` Rich Felker
2015-05-19 19:03                             ` Richard Henderson
2015-05-19 19:10                               ` H.J. Lu
2015-05-19 19:17                                 ` Richard Henderson
2015-05-19 19:20                                   ` H.J. Lu
2015-05-19 19:54                                     ` Richard Henderson
2015-05-19 20:27                                     ` Rich Felker
2015-05-19 20:44                                       ` H.J. Lu
2015-05-19 21:28                                         ` Rich Felker
2015-05-20  0:52                                           ` H.J. Lu
2015-05-20  1:09                                             ` Rich Felker
2015-05-22 19:32                                               ` Richard Henderson
2015-05-19 19:48                               ` Rich Felker
2015-05-19 20:16                                 ` Richard Henderson
2015-05-20 12:13                               ` Michael Matz
2015-05-20 12:40                                 ` H.J. Lu
2015-05-20 14:17                                 ` Rich Felker
2015-05-20 14:33                                   ` Michael Matz
2015-05-18 18:25         ` Alexander Monakov
2015-05-18 19:03           ` Jan Hubicka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).