From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-396787-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 75603 invoked by alias); 4 May 2015 16:38:22 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 75591 invoked by uid 89); 4 May 2015 16:38:21 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.5 required=5.0 tests=AWL,BAYES_00,KAM_LAZY_DOMAIN_SECURITY,T_RP_MATCHES_RCVD autolearn=no version=3.3.2
X-HELO: smtp.ispras.ru
Received: from smtp.ispras.ru (HELO smtp.ispras.ru) (83.149.199.79) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 04 May 2015 16:38:20 +0000
Received: from condor.intra.ispras.ru (unknown [83.149.199.91])	by smtp.ispras.ru (Postfix) with ESMTP id B527A214EA;	Mon,  4 May 2015 19:38:16 +0300 (MSK)
Received: by condor.intra.ispras.ru (Postfix, from userid 23246)	id 3D11A1227575; Mon,  4 May 2015 19:38:16 +0300 (MSK)
From: Alexander Monakov <amonakov@ispras.ru>
To: gcc-patches@gcc.gnu.org
Cc: Alexander Monakov <amonakov@ispras.ru>,	Rich Felker <dalias@libc.org>,	Sriraman Tallam <tmsriram@google.com>
Subject: PIC calls without PLT, generic implementation
Date: Mon, 04 May 2015 16:38:00 -0000
Message-Id: <1430757479-14241-1-git-send-email-amonakov@ispras.ru>
X-IsSubscribed: yes
X-SW-Source: 2015-05/txt/msg00225.txt.bz2

Recent post by Sriraman prompts me to post my -fno-plt approach sooner rather
than later; I was working on no-PLT PIC codegen in last few days too.
Although I'm posting a patch series, half of it is i386 backend tuning and can
go in independently.  Except one patch where it's noted specifically, the
patches were bootstrapped and regtested together, not separately, on x86-64.
Likewise the improvement claimed below is obtained with GCC with all patches
applied, the difference being only in -fno-plt flag.

The approach taken here is different.  Instead of adjusting call expansion in
the back end, I force callee address to be loaded into a pseudo at RTL
expansion time, similar to "function CSE" which is not enabled to most
targets.  The address load (which loads from GOT) can be moved out of loops,
scheduled, or, on x86, re-fused with indirect jump by peepholes.  On 32-bit
x86, it also allows the compiler to use registers other than %ebx for GOT
pointer (which can be a win since %ebx is callee-saved).

The benefit of PLT is the possibility of lazy relocation.  It is not possible
with BIND_NOW, in particular when -z relro -z now flags were used at link time
as security hardening measure.  Performance-critical executables do not
particularly need PLT and lazy relocation too, except if they are used very
frequently, with each individual run time extremely small -- but in that case
they can benefit massively from static linking or less massively from
prelinking, and with prelinking they can get the benefit of no-plt.

I've used LLVM/Clang to evaluate performance impact of PLT-less PIC codegen.
I configured with
  cmake -DLLVM_ENABLE_PIC=ON -DBUILD_SHARED_LIBS=ON \
  -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=OFF
from 3.6 release branch; this configuration mimics non-static build that e.g.
OpenSUSE is using, and produces Clang dependent on 112 clang/llvm shared
libraries, with roughly 24000 externally visible functions.

Without input files time is mostly spent on dynamic linking, so without
prelink there's a predictable regression, from 55 to 140 ms.  On C++ hello
world, I get:
            PLT   no-PLT  PLT+BIND_NOW
[32bit]  430 ms   535 ms  590 ms
[64bit]  410 ms   495 ms  555 ms

So no-PLT is >20% slower than default, but already >10% faster when non-lazy
binding is forced.

On tramp3d compilation with -O2 -g I get:
            PLT   no-PLT
[32bit]  49.0 s   43.3 s
[64bit]  41.6 s   36.8 s

So on long-running compiles -fno-plt is a very significant win.  Note that I'm
using Clang as (perhaps extreme) example of PIC-call-intensive code, but the
argument about -fno-plt being useful for performance should apply generally.

When looking at code size changes, there's a 1% improvement on 32-bit
libstdc++ and a small regression on 64-bit.  On LLVM/Clang, there's overall size
regression on both 32-bit and 64-bit; I've tried to analyze it and so far came
up with one possible cause, which is detailed in IRA REG_EQUIV patch.

Thanks.
Alexander