Hi,

We noticed that one of our benchmarks sped-up by ~1% when we
eliminated PLT stubs for some of the hot external library functions
like memcmp, pow.  The win was from better icache and itlb
performance. The main reason was that the PLT stubs had no spatial
locality with the call-sites. I have started looking at ways to tell
the compiler to eliminate PLT stubs (in-effect inline them) for
specified external functions, for x86_64. I have a proposal and a
patch and I would like to hear what you think.

Here is a  summary of what is happening currently. A call to an
external function is direct but calls into the PLT stub which then
jumps indirectly to the GOT entry.  If I could replace the direct call
to the PLT stub with an indirect call to a GOT entry which will hold
the address of the external function, I have gotten rid of the PLT
stub.  Here is an example:

foo.cc
=====

extern int foo ();  // Truly external library function, defined in a
shared library.

int main() {
  foo();
  ...
}

Currently, this is what is happening.

foo.s looks like this:

main:
.....
callq _Z3foov

but the linker replaces this to call the PLT stub of foo instead.

Function main calls the plt stub directly:

0000000000400766 <main>:
    ….
    40076a:       e8 71 fe ff ff          callq  4005e0 <_Z3foov@plt>

and the PLT stub does this:

00000000004005e0 <_Z3foov@plt>:
  4005e0:       jmpq   *0x15d2(%rip)        # 401bb8
<_GLOBAL_OFFSET_TABLE_+0x28>
  4005e6:       pushq  $0x2
  4005eb:       jmpq   4005b0 <_init+0x28>

The GOT entry at address 0x401bb8 contains the address of foo which
will be lazily bound.

What my proposal plans does is to change foo.s to look like this:

callq *_Z3foov@GOTPCREL(%rip)

which is indirectly calling foo via a GOT entry that contains the
address of foo.  The address in the GOT entry is fixed up at load time
and the linker creates only one GOT entry per function irrespective of
the number of callers.

a.out now looks like this:

0000000000400746 <main>:
...
40074a:       ff 15 20 14 00 00       callq  *0x1420(%rip)        #
401b70 <_DYNAMIC+0x1e8>
...

Function main indirectly calls foo using the contents at location
0x401b70 which is actually a GOT entry containing the address of foo.
Notice that we have in effect inlined the PLT stub.

This comes with  caveats.  This cannot be generally done for all
functions marked extern as it is impossible for the compiler to say if
a function is "truly extern" (defined in a shared library). If a
function is not truly extern(ends up defined in the final executable),
then calling it indirectly is a performance penalty as it could have
been a direct call.  Further, the newly created GOT entries are fixed
up at start-up and do not get lazily bound.

Given this, I propose adding a new option called
-fno-plt=<function-name> to the compiler.  This tells the compiler
that we know that the function is truly extern and we want the
indirect call only for these call-sites.  I have attached a patch that
adds -fno-plt= to GCC.  Any number of "-fno-plt=" can be specified and
all call-sites corresponding to these named functions will be done
indirectly using the mechanism described above without the use of a
PLT stub.

Alternatively, we can do this entirely in the linker.  We can
introduce a new relocation type to tell the linker to convert all
direct calls to truly extern functions into indirect calls via GOT
entries.  The GCC patch just seems simpler.
Also, we could link statically but we do not want that or we could
copy the specific external functions into our executable. This might
work for executable A but a different set of external functions might
be hot for executable B. We want a more general solution.


Please let me know what you think.

Thanks
Sri