From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 75603 invoked by alias); 4 May 2015 16:38:22 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 75591 invoked by uid 89); 4 May 2015 16:38:21 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.5 required=5.0 tests=AWL,BAYES_00,KAM_LAZY_DOMAIN_SECURITY,T_RP_MATCHES_RCVD autolearn=no version=3.3.2 X-HELO: smtp.ispras.ru Received: from smtp.ispras.ru (HELO smtp.ispras.ru) (83.149.199.79) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 04 May 2015 16:38:20 +0000 Received: from condor.intra.ispras.ru (unknown [83.149.199.91]) by smtp.ispras.ru (Postfix) with ESMTP id B527A214EA; Mon, 4 May 2015 19:38:16 +0300 (MSK) Received: by condor.intra.ispras.ru (Postfix, from userid 23246) id 3D11A1227575; Mon, 4 May 2015 19:38:16 +0300 (MSK) From: Alexander Monakov To: gcc-patches@gcc.gnu.org Cc: Alexander Monakov , Rich Felker , Sriraman Tallam Subject: PIC calls without PLT, generic implementation Date: Mon, 04 May 2015 16:38:00 -0000 Message-Id: <1430757479-14241-1-git-send-email-amonakov@ispras.ru> X-IsSubscribed: yes X-SW-Source: 2015-05/txt/msg00225.txt.bz2 Recent post by Sriraman prompts me to post my -fno-plt approach sooner rather than later; I was working on no-PLT PIC codegen in last few days too. Although I'm posting a patch series, half of it is i386 backend tuning and can go in independently. Except one patch where it's noted specifically, the patches were bootstrapped and regtested together, not separately, on x86-64. Likewise the improvement claimed below is obtained with GCC with all patches applied, the difference being only in -fno-plt flag. The approach taken here is different. Instead of adjusting call expansion in the back end, I force callee address to be loaded into a pseudo at RTL expansion time, similar to "function CSE" which is not enabled to most targets. The address load (which loads from GOT) can be moved out of loops, scheduled, or, on x86, re-fused with indirect jump by peepholes. On 32-bit x86, it also allows the compiler to use registers other than %ebx for GOT pointer (which can be a win since %ebx is callee-saved). The benefit of PLT is the possibility of lazy relocation. It is not possible with BIND_NOW, in particular when -z relro -z now flags were used at link time as security hardening measure. Performance-critical executables do not particularly need PLT and lazy relocation too, except if they are used very frequently, with each individual run time extremely small -- but in that case they can benefit massively from static linking or less massively from prelinking, and with prelinking they can get the benefit of no-plt. I've used LLVM/Clang to evaluate performance impact of PLT-less PIC codegen. I configured with cmake -DLLVM_ENABLE_PIC=ON -DBUILD_SHARED_LIBS=ON \ -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=OFF from 3.6 release branch; this configuration mimics non-static build that e.g. OpenSUSE is using, and produces Clang dependent on 112 clang/llvm shared libraries, with roughly 24000 externally visible functions. Without input files time is mostly spent on dynamic linking, so without prelink there's a predictable regression, from 55 to 140 ms. On C++ hello world, I get: PLT no-PLT PLT+BIND_NOW [32bit] 430 ms 535 ms 590 ms [64bit] 410 ms 495 ms 555 ms So no-PLT is >20% slower than default, but already >10% faster when non-lazy binding is forced. On tramp3d compilation with -O2 -g I get: PLT no-PLT [32bit] 49.0 s 43.3 s [64bit] 41.6 s 36.8 s So on long-running compiles -fno-plt is a very significant win. Note that I'm using Clang as (perhaps extreme) example of PIC-call-intensive code, but the argument about -fno-plt being useful for performance should apply generally. When looking at code size changes, there's a 1% improvement on 32-bit libstdc++ and a small regression on 64-bit. On LLVM/Clang, there's overall size regression on both 32-bit and 64-bit; I've tried to analyze it and so far came up with one possible cause, which is detailed in IRA REG_EQUIV patch. Thanks. Alexander