From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <emacsray@gmail.com>
Received: from mail-pj1-f54.google.com (mail-pj1-f54.google.com
 [209.85.216.54])
 by sourceware.org (Postfix) with ESMTPS id A832C3858D20
 for <binutils@sourceware.org>; Tue,  1 Mar 2022 02:22:49 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A832C3858D20
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=maskray.me
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-pj1-f54.google.com with SMTP id
 g7-20020a17090a708700b001bb78857ccdso856657pjk.1
 for <binutils@sourceware.org>; Mon, 28 Feb 2022 18:22:49 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:in-reply-to;
 bh=SwLCxNAY/sD5vU4jRg1q39aKFneAqqwTDw7kLFFWjxM=;
 b=HSPMs6W84A/wJl2jG88QWDKTlBnyyWtzcqKZ+c632c8rAKZWHPIU7FdDGIIneaLFl3
 7xw6bPx1GNFm50h2yY0d9bdL0Z7VzDCIpRYP7/FxYBVKJQzCfJwbwUOr3zcysAIqtE5d
 aqCoWB3ho0/pVrDBIx1OKDy3M/5RFa+I8Too0CvTPV4dREgDWm4sc3vS/yLn3T/fFFTY
 vGiT+c3RlBxHwBwTcNqeliXeVzalR51gYukGUr3vr5RBUrwo8wG2Qk5irrz+iQgnqPCP
 N+5XXn78nZpO+vBx/SmFDFcKjHC+DtP0vrmqHr9/5mWwMv4nm1nfBEF7o/0wVY8+gZxJ
 ikhQ==
X-Gm-Message-State: AOAM53245FWZ73JNSz1NudKR7aUZy+JbQZec6k8xHVoGv7/cDXe3GyZ9
 EqUZhXFPqnvinakN78UTzffOJwmY8UU=
X-Google-Smtp-Source: ABdhPJwvw0AzHIbm/FynIFlba11R0gmooMs7ZGkcCvcYjp8MJD5O9owR+XEx57d7MUECyxO0amp8kw==
X-Received: by 2002:a17:902:e5c4:b0:151:5e23:759c with SMTP id
 u4-20020a170902e5c400b001515e23759cmr10219936plf.29.1646101368627; 
 Mon, 28 Feb 2022 18:22:48 -0800 (PST)
Received: from localhost ([2601:647:6300:b760:41cf:bc48:28af:8671])
 by smtp.gmail.com with ESMTPSA id
 u5-20020a056a00158500b004f0f12b320asm15876826pfk.6.2022.02.28.18.22.48
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 28 Feb 2022 18:22:48 -0800 (PST)
Date: Mon, 28 Feb 2022 18:22:47 -0800
From: Fangrui Song <i@maskray.me>
To: Rui Ueyama <rui314@gmail.com>
Cc: "H.J. Lu" <hjl.tools@gmail.com>, x86-64-abi <x86-64-abi@googlegroups.com>,
 Andi Kleen <andi@firstfloor.org>, Binutils <binutils@sourceware.org>,
 "Moreira, Joao" <joao.moreira@intel.com>
Subject: Re: x86-64: new CET-enabled PLT format proposal
Message-ID: <20220301022247.kwcolxruopevfwcc@gmail.com>
References: <CACKH++aqb-QUnyRmOZWR-L1wzmUsEv7sGB+KXs53TRSjp1xjsw@mail.gmail.com>
 <CAMe9rOrROkcPQ3vrBTkXdaM84ca-HZBGiAmmfGCGK+33uRsC0A@mail.gmail.com>
 <CACKH++ZC2W8m5wwu-hfBzdpgta3841A9K6htMU_0yZPn=jZYYA@mail.gmail.com>
 <CAMe9rOoVn0LKNCjiQKj31Fyoq_i8CsCvQzmiDvTsEUJCTd1TvQ@mail.gmail.com>
 <CACKH++ZC37gf-s75wNTm2L-vYU+vHyOhiSxAFWUpW7DYnRNkVA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Disposition: inline
In-Reply-To: <CACKH++ZC37gf-s75wNTm2L-vYU+vHyOhiSxAFWUpW7DYnRNkVA@mail.gmail.com>
X-Spam-Status: No, score=-1.5 required=5.0 tests=BAYES_00,
 FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,
 KAM_DMARC_STATUS, KAM_INFOUSMEBIZ, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3,
 RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: binutils@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Binutils mailing list <binutils.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/binutils>,
 <mailto:binutils-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/binutils/>
List-Post: <mailto:binutils@sourceware.org>
List-Help: <mailto:binutils-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/binutils>,
 <mailto:binutils-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Mar 2022 02:22:52 -0000

On 2022-03-01, Rui Ueyama via Binutils wrote:
>I think size reduction matters to some users even if you do not care
>about that that much. But I'm not trying too hard to push GNU binutils
>to adopt it. I just wanted to let you guys know that we invented a
>compact (and we believe better) instruction sequence for the
>CET-enabled PLT and we are already using it.
>
>On Tue, Mar 1, 2022 at 9:05 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Sun, Feb 27, 2022 at 7:46 PM Rui Ueyama <rui314@gmail.com> wrote:
>> >
>> > On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>> > >
>> > > On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils
>> > > <binutils@sourceware.org> wrote:
>> > > >
>> > > > Hello,
>> > > >
>> > > > I'd like to propose an alternative instruction sequence for the Intel
>> > > > CET-enabled PLT section. Compared to the existing one, the new scheme is
>> > > > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not
>> > > > require a separate second PLT section (.plt.sec).
>> > > >
>> > > > Here is the proposed code sequence:
>> > > >
>> > > >   PLT0:
>> > > >
>> > > >   f3 0f 1e fa        // endbr64
>> > > >   41 53              // push %r11
>> > > >   ff 35 00 00 00 00  // push GOT[1]
>> > > >   ff 25 00 00 00 00  // jmp *GOT[2]
>> > > >   0f 1f 40 00        // nop
>> > > >   0f 1f 40 00        // nop
>> > > >   0f 1f 40 00        // nop
>> > > >   66 90              // nop
>> > > >
>> > > >   PLTn:
>> > > >
>> > > >   f3 0f 1e fa        // endbr64
>> > > >   41 bb 00 00 00 00  // mov $namen_reloc_index %r11d
>> > > >   ff 25 00 00 00 00  // jmp *GOT[namen_index]
>> > >
>> > > All PLT calls will have an extra MOV.
>> >
>> > One extra load-immediate mov instruction is executed per a function
>> > call through a PLT entry. It's so tiny that I couldn't see any
>> > difference in real-world apps.
>> >
>> > > > GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a
>> > > > PLT entry is called for the first time, the control is passed to PLT0 to call
>> > > > the resolver function.
>> > > >
>> > > > It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries
>> > > > to clobber this register (*1), and the resolve function (__dl_runtime_resolve)
>> > > > already clobbers it.
>> > > >
>> > > > (*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be
>> > > > preserved, nor is it used to pass arguments. Making this register available as
>> > > > scratch register means that code in the PLT need not spill any registers when
>> > > > computing the address to which control needs to be transferred."
>> > > >
>> > > > FYI, this is the current CET-enabled PLT:
>> > > >
>> > > >   PLT0:
>> > > >
>> > > >   ff 35 00 00 00 00    // push GOT[0]
>> > > >   f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1]
>> > > >   0f 1f 00             // nop
>> > > >
>> > > >   PLTn in .plt:
>> > > >
>> > > >   f3 0f 1e fa          // endbr64
>> > > >   68 00 00 00 00       // push $namen_reloc_index
>> > > >   f2 e9 e1 ff ff ff    // bnd jmpq PLT0
>> > > >   90                   // nop
>> > > >
>> > > >   PLTn in .plt.sec:
>> > > >
>> > > >   f3 0f 1e fa          // endbr64
>> > > >   f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index]
>> > > >   0f 1f 44 00 00       // nop
>> > > >
>> > > > In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In
>> > > > the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we
>> > > > have many PLT sections while we have only one header, so in practice, the
>> > > > proposed format is almost 50% smaller than the existing one.
>> > >
>> > > Does it have any impact on performance?   .plt.sec can be placed
>> > > in a different page from .plt.
>> > >
>> > > > The proposed PLT does not use jump instructions with BND prefix, as Intel MPX
>> > > > has been deprecated.
>> > > >
>> > > > I already implemented the proposed scheme to my linker
>> > > > (https://github.com/rui314/mold) and it looks like it's working fine.
>> > > >
>> > > > Any thoughts?
>> > >
>> > > I'd like to see visible performance improvements or new features in
>> > > a new PLT layout.
>> >
>> > I didn't see any visible performance improvement with real-world apps.
>> > I might be able to craft a microbenchmark to hammer PLT entries really
>> > hard in some pattern to see some difference, but I think that doesn't
>> > make much sense. The size reduction is for real though.
>>
>> I am aware that there are 2 other proposals to use R11 in PLT/function
>> call.   But they are introducing new features.  I don't think we should
>> use R11 in PLT without any real performance improvements.

I like the proposal.  There are merits of simplified implementation,
code size reduction, and less obvious ones: (a) linker script users
won't need to mention .plt.sec (b) tools can use a more unified approach
identifying PLTs like other architectures.

>> > > I cced x86-64 psABI mailing list.
>> > >
>> > >
>> > > --
>> > > H.J.
>>
>>
>>
>> --
>> H.J.