From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <hjl.tools@gmail.com>
Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com
 [IPv6:2607:f8b0:4864:20::636])
 by sourceware.org (Postfix) with ESMTPS id 09E503858D20
 for <binutils@sourceware.org>; Tue,  1 Mar 2022 00:05:13 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 09E503858D20
Received: by mail-pl1-x636.google.com with SMTP id l9so11641013pls.6
 for <binutils@sourceware.org>; Mon, 28 Feb 2022 16:05:12 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=UZQQVMdR1DV0FZQsG+ehpUYp9TX8rquD/Vss/34nH2k=;
 b=EuFuuLQaCcJP4KercP73QaY9/z3UzEsycPpNpSDsSCoM+8nW3eeGitHIeBGTZynzhy
 8IS8g234Vhue0QF3Kfrzpe4GG3H0biJ/YfdcP1LMBvnJhBjINupzrdxOQ97x6ckkGa/K
 IMqpKxwT+jRXt7G3Z8pQBSThi4gthJCgbFllzDVbBFMJmq/mwoI5CRqoX2LltjFyAPJt
 8b2hIAM6rFkci+DZM4YmkKtM5032eB8JLweFwIHzWSfDqfwL6DzcMqGt3777np+WSjcC
 9Ez7Q8pLzj6OKpID286s4EdG/i96muYafLMA0DvIAjtHb1zcpX2NGy6/mH602SU9XR5I
 clIg==
X-Gm-Message-State: AOAM531twoTGEGxMPcSrmkuFdoYeRJ+Cl6QLRKRCyTEw4agL1Z+8ykN0
 qk8uNX1XVinDvHqdfTEJ4zfN+5Gh0Zjy6/H3UN8=
X-Google-Smtp-Source: ABdhPJxm1UuWCYlgCZUrJOZTqe9jBvayGJgI/0gAzY2xiANpggulVqZntW+r2naSZGKN1F2pA2JbprL3Yr84R93TNd8=
X-Received: by 2002:a17:902:b410:b0:14b:e53:7aa0 with SMTP id
 x16-20020a170902b41000b0014b0e537aa0mr23077782plr.101.1646093111225; Mon, 28
 Feb 2022 16:05:11 -0800 (PST)
MIME-Version: 1.0
References: <CACKH++aqb-QUnyRmOZWR-L1wzmUsEv7sGB+KXs53TRSjp1xjsw@mail.gmail.com>
 <CAMe9rOrROkcPQ3vrBTkXdaM84ca-HZBGiAmmfGCGK+33uRsC0A@mail.gmail.com>
 <CACKH++ZC2W8m5wwu-hfBzdpgta3841A9K6htMU_0yZPn=jZYYA@mail.gmail.com>
In-Reply-To: <CACKH++ZC2W8m5wwu-hfBzdpgta3841A9K6htMU_0yZPn=jZYYA@mail.gmail.com>
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Mon, 28 Feb 2022 16:04:34 -0800
Message-ID: <CAMe9rOoVn0LKNCjiQKj31Fyoq_i8CsCvQzmiDvTsEUJCTd1TvQ@mail.gmail.com>
Subject: Re: x86-64: new CET-enabled PLT format proposal
To: Rui Ueyama <rui314@gmail.com>, "Moreira, Joao" <joao.moreira@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>, x86-64-abi <x86-64-abi@googlegroups.com>,
 Binutils <binutils@sourceware.org>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-3020.7 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: binutils@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Binutils mailing list <binutils.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/binutils>,
 <mailto:binutils-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/binutils/>
List-Post: <mailto:binutils@sourceware.org>
List-Help: <mailto:binutils-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/binutils>,
 <mailto:binutils-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Mar 2022 00:05:14 -0000

On Sun, Feb 27, 2022 at 7:46 PM Rui Ueyama <rui314@gmail.com> wrote:
>
> On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils
> > <binutils@sourceware.org> wrote:
> > >
> > > Hello,
> > >
> > > I'd like to propose an alternative instruction sequence for the Intel
> > > CET-enabled PLT section. Compared to the existing one, the new scheme is
> > > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not
> > > require a separate second PLT section (.plt.sec).
> > >
> > > Here is the proposed code sequence:
> > >
> > >   PLT0:
> > >
> > >   f3 0f 1e fa        // endbr64
> > >   41 53              // push %r11
> > >   ff 35 00 00 00 00  // push GOT[1]
> > >   ff 25 00 00 00 00  // jmp *GOT[2]
> > >   0f 1f 40 00        // nop
> > >   0f 1f 40 00        // nop
> > >   0f 1f 40 00        // nop
> > >   66 90              // nop
> > >
> > >   PLTn:
> > >
> > >   f3 0f 1e fa        // endbr64
> > >   41 bb 00 00 00 00  // mov $namen_reloc_index %r11d
> > >   ff 25 00 00 00 00  // jmp *GOT[namen_index]
> >
> > All PLT calls will have an extra MOV.
>
> One extra load-immediate mov instruction is executed per a function
> call through a PLT entry. It's so tiny that I couldn't see any
> difference in real-world apps.
>
> > > GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a
> > > PLT entry is called for the first time, the control is passed to PLT0 to call
> > > the resolver function.
> > >
> > > It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries
> > > to clobber this register (*1), and the resolve function (__dl_runtime_resolve)
> > > already clobbers it.
> > >
> > > (*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be
> > > preserved, nor is it used to pass arguments. Making this register available as
> > > scratch register means that code in the PLT need not spill any registers when
> > > computing the address to which control needs to be transferred."
> > >
> > > FYI, this is the current CET-enabled PLT:
> > >
> > >   PLT0:
> > >
> > >   ff 35 00 00 00 00    // push GOT[0]
> > >   f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1]
> > >   0f 1f 00             // nop
> > >
> > >   PLTn in .plt:
> > >
> > >   f3 0f 1e fa          // endbr64
> > >   68 00 00 00 00       // push $namen_reloc_index
> > >   f2 e9 e1 ff ff ff    // bnd jmpq PLT0
> > >   90                   // nop
> > >
> > >   PLTn in .plt.sec:
> > >
> > >   f3 0f 1e fa          // endbr64
> > >   f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index]
> > >   0f 1f 44 00 00       // nop
> > >
> > > In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In
> > > the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we
> > > have many PLT sections while we have only one header, so in practice, the
> > > proposed format is almost 50% smaller than the existing one.
> >
> > Does it have any impact on performance?   .plt.sec can be placed
> > in a different page from .plt.
> >
> > > The proposed PLT does not use jump instructions with BND prefix, as Intel MPX
> > > has been deprecated.
> > >
> > > I already implemented the proposed scheme to my linker
> > > (https://github.com/rui314/mold) and it looks like it's working fine.
> > >
> > > Any thoughts?
> >
> > I'd like to see visible performance improvements or new features in
> > a new PLT layout.
>
> I didn't see any visible performance improvement with real-world apps.
> I might be able to craft a microbenchmark to hammer PLT entries really
> hard in some pattern to see some difference, but I think that doesn't
> make much sense. The size reduction is for real though.

I am aware that there are 2 other proposals to use R11 in PLT/function
call.   But they are introducing new features.  I don't think we should
use R11 in PLT without any real performance improvements.

> > I cced x86-64 psABI mailing list.
> >
> >
> > --
> > H.J.


-- 
H.J.