From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 85456 invoked by alias); 10 May 2015 15:19:46 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 85444 invoked by uid 89); 10 May 2015 15:19:45 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.5 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 X-HELO: mail-ob0-f178.google.com Received: from mail-ob0-f178.google.com (HELO mail-ob0-f178.google.com) (209.85.214.178) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Sun, 10 May 2015 15:19:43 +0000 Received: by obbkp3 with SMTP id kp3so84488735obb.3 for ; Sun, 10 May 2015 08:19:41 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.202.62.212 with SMTP id l203mr4785105oia.67.1431271181831; Sun, 10 May 2015 08:19:41 -0700 (PDT) Received: by 10.76.54.14 with HTTP; Sun, 10 May 2015 08:19:41 -0700 (PDT) Date: Sun, 10 May 2015 15:19:00 -0000 Message-ID: Subject: Re: [RFC][PATCH][X86_64] Eliminate PLT stubs for specified external functions via -fno-plt= From: "H.J. Lu" To: Michael Matz Cc: Sriraman Tallam , GCC Patches , David Li Content-Type: text/plain; charset=UTF-8 X-IsSubscribed: yes X-SW-Source: 2015-05/txt/msg00845.txt.bz2 On Sat, May 9, 2015 at 9:34 AM, H.J. Lu wrote: > On Mon, May 4, 2015 at 7:45 AM, Michael Matz wrote: >> Hi, >> >> On Thu, 30 Apr 2015, Sriraman Tallam wrote: >> >>> We noticed that one of our benchmarks sped-up by ~1% when we eliminated >>> PLT stubs for some of the hot external library functions like memcmp, >>> pow. The win was from better icache and itlb performance. The main >>> reason was that the PLT stubs had no spatial locality with the >>> call-sites. I have started looking at ways to tell the compiler to >>> eliminate PLT stubs (in-effect inline them) for specified external >>> functions, for x86_64. I have a proposal and a patch and I would like to >>> hear what you think. >>> >>> This comes with caveats. This cannot be generally done for all >>> functions marked extern as it is impossible for the compiler to say if a >>> function is "truly extern" (defined in a shared library). If a function >>> is not truly extern(ends up defined in the final executable), then >>> calling it indirectly is a performance penalty as it could have been a >>> direct call. >> >> This can be fixed by Alans idea. >> >>> Further, the newly created GOT entries are fixed up at >>> start-up and do not get lazily bound. >> >> And this can be fixed by some enhancements in the linker and dynamic >> linker. The idea is to still generate a PLT stub and make its GOT entry >> point to it initially (like a normal got.plt slot). Then the first >> indirect call will use the address of PLT entry (starting lazy resolution) >> and update the GOT slot with the real address, so further indirect calls >> will directly go to the function. >> >> This requires a new asm marker (and hence new reloc) as normally if >> there's a GOT slot it's filled by the real symbols address, unlike if >> there's only a got.plt slot. E.g. a >> >> call *foo@GOTPLT(%rip) >> >> would generate a GOT slot (and fill its address into above call insn), but >> generate a JUMP_SLOT reloc in the final executable, not a GLOB_DAT one. >> > > I added the "relax" prefix support to x86 assembler on users/hjl/relax > branch > > at > > https://sourceware.org/git/?p=binutils-gdb.git;a=summary > > [hjl@gnu-tools-1 relax-3]$ cat r.S > .text > relax jmp foo > relax call foo > relax jmp foo@plt > relax call foo@plt > [hjl@gnu-tools-1 relax-3]$ ./as -o r.o r.S > [hjl@gnu-tools-1 relax-3]$ ./objdump -drw r.o > > r.o: file format elf64-x86-64 > > > Disassembly of section .text: > > 0000000000000000 <.text>: > 0: 66 e9 00 00 00 00 data16 jmpq 0x6 2: R_X86_64_RELAX_PC32 foo-0x4 > 6: 66 e8 00 00 00 00 data16 callq 0xc 8: R_X86_64_RELAX_PC32 foo-0x4 > c: 66 e9 00 00 00 00 data16 jmpq 0x12 e: R_X86_64_RELAX_PLT32foo-0x4 > 12: 66 e8 00 00 00 00 data16 callq 0x18 14: R_X86_64_RELAX_PLT32foo-0x4 > [hjl@gnu-tools-1 relax-3]$ > > Right now, the relax relocations are treated as PC32/PLT32 relocations. > I am working on linker support. > I implemented the linker support for x86-64: 00000000
: 0: 48 83 ec 08 sub $0x8,%rsp 4: e8 00 00 00 00 callq 9 5: R_X86_64_PC32 plt-0x4 9: e8 00 00 00 00 callq e a: R_X86_64_PLT32 plt-0x4 e: e8 00 00 00 00 callq 13 f: R_X86_64_PC32 bar-0x4 13: 66 e8 00 00 00 00 data16 callq 19 15: R_X86_64_RELAX_PC32 bar-0x4 19: 66 e8 00 00 00 00 data16 callq 1f 1b: R_X86_64_RELAX_PLT32 bar-0x4 1f: 66 e8 00 00 00 00 data16 callq 25 21: R_X86_64_RELAX_PC32 foo-0x4 25: 66 e8 00 00 00 00 data16 callq 2b 27: R_X86_64_RELAX_PLT32 foo-0x4 2b: 31 c0 xor %eax,%eax 2d: 48 83 c4 08 add $0x8,%rsp 31: c3 retq 00400460
: 400460: 48 83 ec 08 sub $0x8,%rsp 400464: e8 d7 ff ff ff callq 400440 400469: e8 d2 ff ff ff callq 400440 40046e: e8 ad ff ff ff callq 400420 400473: ff 15 ff 03 20 00 callq *0x2003ff(%rip) # 600878 <_DYNAMIC+0xf8> 400479: ff 15 f9 03 20 00 callq *0x2003f9(%rip) # 600878 <_DYNAMIC+0xf8> 40047f: 66 e8 f3 00 00 00 data16 callq 400578 400485: 66 e8 ed 00 00 00 data16 callq 400578 40048b: 31 c0 xor %eax,%eax 40048d: 48 83 c4 08 add $0x8,%rsp 400491: c3 retq Sriraman, can you give it a try? -- H.J.