public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: [RFC][PATCH][X86_64] Eliminate PLT stubs for specified external functions via -fno-plt=
@ 2015-05-10 15:19 H.J. Lu
       [not found] ` <CAAs8HmwWSDY+KjKcB4W=TiYV0Pz7NSvfL_8igp+hPT-LU1utTg@mail.gmail.com>
  0 siblings, 1 reply; 65+ messages in thread
From: H.J. Lu @ 2015-05-10 15:19 UTC (permalink / raw)
  To: Michael Matz; +Cc: Sriraman Tallam, GCC Patches, David Li

On Sat, May 9, 2015 at 9:34 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Mon, May 4, 2015 at 7:45 AM, Michael Matz <matz@suse.de> wrote:
>> Hi,
>>
>> On Thu, 30 Apr 2015, Sriraman Tallam wrote:
>>
>>> We noticed that one of our benchmarks sped-up by ~1% when we eliminated
>>> PLT stubs for some of the hot external library functions like memcmp,
>>> pow.  The win was from better icache and itlb performance. The main
>>> reason was that the PLT stubs had no spatial locality with the
>>> call-sites. I have started looking at ways to tell the compiler to
>>> eliminate PLT stubs (in-effect inline them) for specified external
>>> functions, for x86_64. I have a proposal and a patch and I would like to
>>> hear what you think.
>>>
>>> This comes with caveats.  This cannot be generally done for all
>>> functions marked extern as it is impossible for the compiler to say if a
>>> function is "truly extern" (defined in a shared library). If a function
>>> is not truly extern(ends up defined in the final executable), then
>>> calling it indirectly is a performance penalty as it could have been a
>>> direct call.
>>
>> This can be fixed by Alans idea.
>>
>>> Further, the newly created GOT entries are fixed up at
>>> start-up and do not get lazily bound.
>>
>> And this can be fixed by some enhancements in the linker and dynamic
>> linker.  The idea is to still generate a PLT stub and make its GOT entry
>> point to it initially (like a normal got.plt slot).  Then the first
>> indirect call will use the address of PLT entry (starting lazy resolution)
>> and update the GOT slot with the real address, so further indirect calls
>> will directly go to the function.
>>
>> This requires a new asm marker (and hence new reloc) as normally if
>> there's a GOT slot it's filled by the real symbols address, unlike if
>> there's only a got.plt slot.  E.g. a
>>
>>   call *foo@GOTPLT(%rip)
>>
>> would generate a GOT slot (and fill its address into above call insn), but
>> generate a JUMP_SLOT reloc in the final executable, not a GLOB_DAT one.
>>
>
> I added the "relax" prefix support to x86 assembler on users/hjl/relax
> branch
>
> at
>
> https://sourceware.org/git/?p=binutils-gdb.git;a=summary
>
> [hjl@gnu-tools-1 relax-3]$ cat r.S
> .text
> relax jmp foo
> relax call foo
> relax jmp foo@plt
> relax call foo@plt
> [hjl@gnu-tools-1 relax-3]$ ./as -o r.o r.S
> [hjl@gnu-tools-1 relax-3]$ ./objdump -drw r.o
>
> r.o:     file format elf64-x86-64
>
>
> Disassembly of section .text:
>
> 0000000000000000 <.text>:
>    0: 66 e9 00 00 00 00     data16 jmpq 0x6 2: R_X86_64_RELAX_PC32 foo-0x4
>    6: 66 e8 00 00 00 00     data16 callq 0xc 8: R_X86_64_RELAX_PC32 foo-0x4
>    c: 66 e9 00 00 00 00     data16 jmpq 0x12 e: R_X86_64_RELAX_PLT32foo-0x4
>   12: 66 e8 00 00 00 00     data16 callq 0x18 14: R_X86_64_RELAX_PLT32foo-0x4
> [hjl@gnu-tools-1 relax-3]$
>
> Right now, the relax relocations are treated as PC32/PLT32 relocations.
> I am working on linker support.
>

I implemented the linker support for x86-64:

00000000 <main>:
   0: 48 83 ec 08           sub    $0x8,%rsp
   4: e8 00 00 00 00       callq  9 <main+0x9> 5: R_X86_64_PC32 plt-0x4
   9: e8 00 00 00 00       callq  e <main+0xe> a: R_X86_64_PLT32 plt-0x4
   e: e8 00 00 00 00       callq  13 <main+0x13> f: R_X86_64_PC32 bar-0x4
  13: 66 e8 00 00 00 00     data16 callq 19 <main+0x19> 15:
R_X86_64_RELAX_PC32 bar-0x4
  19: 66 e8 00 00 00 00     data16 callq 1f <main+0x1f> 1b:
R_X86_64_RELAX_PLT32 bar-0x4
  1f: 66 e8 00 00 00 00     data16 callq 25 <main+0x25> 21:
R_X86_64_RELAX_PC32 foo-0x4
  25: 66 e8 00 00 00 00     data16 callq 2b <main+0x2b> 27:
R_X86_64_RELAX_PLT32 foo-0x4
  2b: 31 c0                 xor    %eax,%eax
  2d: 48 83 c4 08           add    $0x8,%rsp
  31: c3                   retq

00400460 <main>:
  400460: 48 83 ec 08           sub    $0x8,%rsp
  400464: e8 d7 ff ff ff       callq  400440 <plt@plt>
  400469: e8 d2 ff ff ff       callq  400440 <plt@plt>
  40046e: e8 ad ff ff ff       callq  400420 <bar@plt>
  400473: ff 15 ff 03 20 00     callq  *0x2003ff(%rip)        # 600878
<_DYNAMIC+0xf8>
  400479: ff 15 f9 03 20 00     callq  *0x2003f9(%rip)        # 600878
<_DYNAMIC+0xf8>
  40047f: 66 e8 f3 00 00 00     data16 callq 400578 <foo>
  400485: 66 e8 ed 00 00 00     data16 callq 400578 <foo>
  40048b: 31 c0                 xor    %eax,%eax
  40048d: 48 83 c4 08           add    $0x8,%rsp
  400491: c3                   retq

Sriraman, can you give it a try?

-- 
H.J.

^ permalink raw reply	[flat|nested] 65+ messages in thread
* [RFC][PATCH][X86_64] Eliminate PLT stubs for specified external functions via -fno-plt=
@ 2015-05-01  0:31 Sriraman Tallam
  2015-05-01  3:21 ` Alan Modra
                   ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Sriraman Tallam @ 2015-05-01  0:31 UTC (permalink / raw)
  To: GCC Patches, H.J. Lu, David Li

[-- Attachment #1: Type: text/plain, Size: 3872 bytes --]

Hi,

We noticed that one of our benchmarks sped-up by ~1% when we
eliminated PLT stubs for some of the hot external library functions
like memcmp, pow.  The win was from better icache and itlb
performance. The main reason was that the PLT stubs had no spatial
locality with the call-sites. I have started looking at ways to tell
the compiler to eliminate PLT stubs (in-effect inline them) for
specified external functions, for x86_64. I have a proposal and a
patch and I would like to hear what you think.

Here is a  summary of what is happening currently. A call to an
external function is direct but calls into the PLT stub which then
jumps indirectly to the GOT entry.  If I could replace the direct call
to the PLT stub with an indirect call to a GOT entry which will hold
the address of the external function, I have gotten rid of the PLT
stub.  Here is an example:

foo.cc
=====

extern int foo ();  // Truly external library function, defined in a
shared library.

int main() {
  foo();
  ...
}

Currently, this is what is happening.

foo.s looks like this:

main:
.....
callq _Z3foov

but the linker replaces this to call the PLT stub of foo instead.

Function main calls the plt stub directly:

0000000000400766 <main>:
    ….
    40076a:       e8 71 fe ff ff          callq  4005e0 <_Z3foov@plt>

and the PLT stub does this:

00000000004005e0 <_Z3foov@plt>:
  4005e0:       jmpq   *0x15d2(%rip)        # 401bb8
<_GLOBAL_OFFSET_TABLE_+0x28>
  4005e6:       pushq  $0x2
  4005eb:       jmpq   4005b0 <_init+0x28>

The GOT entry at address 0x401bb8 contains the address of foo which
will be lazily bound.

What my proposal plans does is to change foo.s to look like this:

callq *_Z3foov@GOTPCREL(%rip)

which is indirectly calling foo via a GOT entry that contains the
address of foo.  The address in the GOT entry is fixed up at load time
and the linker creates only one GOT entry per function irrespective of
the number of callers.

a.out now looks like this:

0000000000400746 <main>:
...
40074a:       ff 15 20 14 00 00       callq  *0x1420(%rip)        #
401b70 <_DYNAMIC+0x1e8>
...

Function main indirectly calls foo using the contents at location
0x401b70 which is actually a GOT entry containing the address of foo.
Notice that we have in effect inlined the PLT stub.

This comes with  caveats.  This cannot be generally done for all
functions marked extern as it is impossible for the compiler to say if
a function is "truly extern" (defined in a shared library). If a
function is not truly extern(ends up defined in the final executable),
then calling it indirectly is a performance penalty as it could have
been a direct call.  Further, the newly created GOT entries are fixed
up at start-up and do not get lazily bound.

Given this, I propose adding a new option called
-fno-plt=<function-name> to the compiler.  This tells the compiler
that we know that the function is truly extern and we want the
indirect call only for these call-sites.  I have attached a patch that
adds -fno-plt= to GCC.  Any number of "-fno-plt=" can be specified and
all call-sites corresponding to these named functions will be done
indirectly using the mechanism described above without the use of a
PLT stub.

Alternatively, we can do this entirely in the linker.  We can
introduce a new relocation type to tell the linker to convert all
direct calls to truly extern functions into indirect calls via GOT
entries.  The GCC patch just seems simpler.
Also, we could link statically but we do not want that or we could
copy the specific external functions into our executable. This might
work for executable A but a different set of external functions might
be hot for executable B. We want a more general solution.


Please let me know what you think.

Thanks
Sri

[-- Attachment #2: avoid_plt_patch.txt --]
[-- Type: text/plain, Size: 4091 bytes --]

	* common.opt (-fno-plt=): New option.
	* config/i386/i386.c (avoid_plt_to_call): New function.
	(ix86_output_call_insn):  Check if PLT needs to be avoided
	and call or jump indirectly if true.
	* opts-global.c (htab_str_eq): New function.
	(avoid_plt_fnsymbol_names_tab): New htab.
	(handle_common_deferred_options): Handle -fno-plt=

Index: common.opt
===================================================================
--- common.opt	(revision 222641)
+++ common.opt	(working copy)
@@ -1087,6 +1087,11 @@ fdbg-cnt=
 Common RejectNegative Joined Var(common_deferred_options) Defer
 -fdbg-cnt=<counter>:<limit>[,<counter>:<limit>,...]	Set the debug counter limit.   
 
+fno-plt=
+Common RejectNegative Joined Var(common_deferred_options) Defer
+-fno-plt=<symbol1>  Avoid going through the PLT when calling the specified function.
+Allow multiple instances of this option with different function names.
+
 fdebug-prefix-map=
 Common Joined RejectNegative Var(common_deferred_options) Defer
 Map one directory name to another in debug information
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 222641)
+++ config/i386/i386.c	(working copy)
@@ -25282,6 +25282,25 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx call
   return call;
 }
 
+extern htab_t avoid_plt_fnsymbol_names_tab;
+/* If the function referenced by call_op is to a external function
+   and calls via PLT must be avoided as specified by -fno-plt=, then
+   return true.  */
+
+static int
+avoid_plt_to_call(rtx call_op)
+{
+  const char *name;
+  if (GET_CODE (call_op) != SYMBOL_REF
+      || SYMBOL_REF_LOCAL_P (call_op)
+      || avoid_plt_fnsymbol_names_tab == NULL)
+    return 0;
+  name = XSTR (call_op, 0);
+  if (htab_find_slot (avoid_plt_fnsymbol_names_tab, name, NO_INSERT) != NULL)
+    return 1;
+  return 0;
+}
+
 /* Output the assembly for a call instruction.  */
 
 const char *
@@ -25294,7 +25313,12 @@ ix86_output_call_insn (rtx insn, rtx call_op)
   if (SIBLING_CALL_P (insn))
     {
       if (direct_p)
-	xasm = "jmp\t%P0";
+	{
+	  if (avoid_plt_to_call (call_op))
+	    xasm = "jmp\t*%p0@GOTPCREL(%%rip)";
+	  else
+	    xasm = "jmp\t%P0";
+	}
       /* SEH epilogue detection requires the indirect branch case
 	 to include REX.W.  */
       else if (TARGET_SEH)
@@ -25346,9 +25370,15 @@ ix86_output_call_insn (rtx insn, rtx call_op)
     }
 
   if (direct_p)
-    xasm = "call\t%P0";
+    {
+      if (avoid_plt_to_call (call_op))
+        xasm = "call\t*%p0@GOTPCREL(%%rip)";
+      else
+        xasm = "call\t%P0";
+    }
   else
     xasm = "call\t%A0";
+ 
 
   output_asm_insn (xasm, &call_op);
 
Index: opts-global.c
===================================================================
--- opts-global.c	(revision 222641)
+++ opts-global.c	(working copy)
@@ -47,6 +47,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "xregex.h"
 #include "attribs.h"
 #include "stringpool.h"
+#include "hash-table.h"
 
 typedef const char *const_char_p; /* For DEF_VEC_P.  */
 
@@ -420,6 +421,17 @@ decode_options (struct gcc_options *opts, struct g
   finish_options (opts, opts_set, loc);
 }
 
+/* Helper function for the hash table that compares the
+   existing entry (S1) with the given string (S2).  */
+
+static int
+htab_str_eq (const void *s1, const void *s2)
+{
+  return !strcmp ((const char *)s1, (const char *) s2);
+}
+
+htab_t avoid_plt_fnsymbol_names_tab = NULL;
+
 /* Process common options that have been deferred until after the
    handlers have been called for all options.  */
 
@@ -539,6 +551,15 @@ handle_common_deferred_options (void)
 	  stack_limit_rtx = gen_rtx_SYMBOL_REF (Pmode, ggc_strdup (opt->arg));
 	  break;
 
+        case OPT_fno_plt_:
+	  void **slot;
+	  if (avoid_plt_fnsymbol_names_tab == NULL)
+	    avoid_plt_fnsymbol_names_tab = htab_create (10, htab_hash_string,
+							htab_str_eq, NULL);
+          slot = htab_find_slot (avoid_plt_fnsymbol_names_tab, opt->arg, INSERT);
+          *slot = (void *)opt->arg;
+          break;
+
 	default:
 	  gcc_unreachable ();
 	}

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2015-07-24 18:44 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-10 15:19 [RFC][PATCH][X86_64] Eliminate PLT stubs for specified external functions via -fno-plt= H.J. Lu
     [not found] ` <CAAs8HmwWSDY+KjKcB4W=TiYV0Pz7NSvfL_8igp+hPT-LU1utTg@mail.gmail.com>
2015-05-21 21:31   ` Sriraman Tallam
2015-05-21 21:39     ` Sriraman Tallam
2015-05-21 22:02     ` Pedro Alves
2015-05-21 22:02       ` Jakub Jelinek
2015-05-22  1:47         ` H.J. Lu
2015-05-22  3:38         ` Xinliang David Li
2015-05-21 22:34       ` Sriraman Tallam
2015-05-22  9:22         ` Pedro Alves
2015-05-22 15:13           ` Sriraman Tallam
2015-05-28 18:53           ` Sriraman Tallam
2015-05-28 19:05             ` H.J. Lu
2015-05-28 19:48               ` Sriraman Tallam
2015-05-28 20:19                 ` H.J. Lu
2015-05-28 21:27                   ` Sriraman Tallam
2015-05-28 21:31                     ` H.J. Lu
2015-05-28 21:52                       ` Sriraman Tallam
2015-05-28 22:48                         ` H.J. Lu
2015-05-29  3:51                           ` Sriraman Tallam
2015-05-29  5:13                             ` H.J. Lu
2015-05-29  7:13                               ` Sriraman Tallam
2015-05-29 17:36                                 ` Sriraman Tallam
2015-05-29 17:52                                   ` H.J. Lu
2015-05-29 18:33                                     ` Sriraman Tallam
2015-05-29 20:50                                 ` Jan Hubicka
2015-05-29 22:56                                   ` Sriraman Tallam
2015-05-29 23:08                                     ` Sriraman Tallam
     [not found]                                     ` <CAJA7tRYsMiq7rx34c=z6KwRdwYxxaeP6Z6qzA4XEwnJSMT7z=Q@mail.gmail.com>
2015-05-30  4:44                                       ` Sriraman Tallam
2015-06-01  8:24                                         ` Ramana Radhakrishnan
2015-06-01 18:01                                           ` Sriraman Tallam
2015-06-01 18:41                                             ` Ramana Radhakrishnan
2015-06-01 18:55                                               ` Sriraman Tallam
2015-06-01 20:33                                                 ` Ramana Radhakrishnan
2015-06-02 18:27                                                   ` Sriraman Tallam
2015-06-02 19:59                                                     ` Bernhard Reutner-Fischer
2015-06-02 20:09                                                       ` Sriraman Tallam
2015-06-02 21:18                                                         ` Bernhard Reutner-Fischer
2015-06-02 21:09                                                     ` Ramana Radhakrishnan
2015-06-02 21:25                                                       ` Xinliang David Li
2015-06-02 21:52                                                         ` Bernhard Reutner-Fischer
2015-06-02 21:40                                                       ` Sriraman Tallam
2015-06-03 14:37                                                         ` Ramana Radhakrishnan
2015-06-03 18:53                                                           ` Sriraman Tallam
2015-06-03 20:16                                                             ` Richard Henderson
2015-06-03 20:59                                                               ` Sriraman Tallam
2015-06-04 16:56                                                                 ` Sriraman Tallam
2015-06-04 17:30                                                                   ` Richard Henderson
2015-06-04 21:34                                                                     ` Sriraman Tallam
2015-07-24 19:02                                                                   ` H.J. Lu
2015-06-03 19:57                                                       ` Richard Henderson
  -- strict thread matches above, loose matches on Subject: below --
2015-05-01  0:31 Sriraman Tallam
2015-05-01  3:21 ` Alan Modra
2015-05-01  3:26   ` Sriraman Tallam
2015-05-01 15:01 ` Andi Kleen
2015-05-01 16:19   ` Xinliang David Li
2015-05-01 16:23     ` H.J. Lu
2015-05-01 16:26       ` Xinliang David Li
2015-05-01 18:06         ` Sriraman Tallam
2015-05-02 12:12           ` Andi Kleen
2015-05-01 17:50   ` Sriraman Tallam
2015-05-04 14:45 ` Michael Matz
2015-05-04 16:43   ` Xinliang David Li
2015-05-04 16:58     ` Michael Matz
2015-05-04 17:22       ` Xinliang David Li
2015-05-09 16:35   ` H.J. Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).