From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by sourceware.org (Postfix) with ESMTPS id 09E503858D20 for ; Tue, 1 Mar 2022 00:05:13 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 09E503858D20 Received: by mail-pl1-x636.google.com with SMTP id l9so11641013pls.6 for ; Mon, 28 Feb 2022 16:05:12 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=UZQQVMdR1DV0FZQsG+ehpUYp9TX8rquD/Vss/34nH2k=; b=EuFuuLQaCcJP4KercP73QaY9/z3UzEsycPpNpSDsSCoM+8nW3eeGitHIeBGTZynzhy 8IS8g234Vhue0QF3Kfrzpe4GG3H0biJ/YfdcP1LMBvnJhBjINupzrdxOQ97x6ckkGa/K IMqpKxwT+jRXt7G3Z8pQBSThi4gthJCgbFllzDVbBFMJmq/mwoI5CRqoX2LltjFyAPJt 8b2hIAM6rFkci+DZM4YmkKtM5032eB8JLweFwIHzWSfDqfwL6DzcMqGt3777np+WSjcC 9Ez7Q8pLzj6OKpID286s4EdG/i96muYafLMA0DvIAjtHb1zcpX2NGy6/mH602SU9XR5I clIg== X-Gm-Message-State: AOAM531twoTGEGxMPcSrmkuFdoYeRJ+Cl6QLRKRCyTEw4agL1Z+8ykN0 qk8uNX1XVinDvHqdfTEJ4zfN+5Gh0Zjy6/H3UN8= X-Google-Smtp-Source: ABdhPJxm1UuWCYlgCZUrJOZTqe9jBvayGJgI/0gAzY2xiANpggulVqZntW+r2naSZGKN1F2pA2JbprL3Yr84R93TNd8= X-Received: by 2002:a17:902:b410:b0:14b:e53:7aa0 with SMTP id x16-20020a170902b41000b0014b0e537aa0mr23077782plr.101.1646093111225; Mon, 28 Feb 2022 16:05:11 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: "H.J. Lu" Date: Mon, 28 Feb 2022 16:04:34 -0800 Message-ID: Subject: Re: x86-64: new CET-enabled PLT format proposal To: Rui Ueyama , "Moreira, Joao" Cc: Andi Kleen , x86-64-abi , Binutils Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3020.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: binutils@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Binutils mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Mar 2022 00:05:14 -0000 On Sun, Feb 27, 2022 at 7:46 PM Rui Ueyama wrote: > > On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu wrote: > > > > On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils > > wrote: > > > > > > Hello, > > > > > > I'd like to propose an alternative instruction sequence for the Intel > > > CET-enabled PLT section. Compared to the existing one, the new scheme is > > > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not > > > require a separate second PLT section (.plt.sec). > > > > > > Here is the proposed code sequence: > > > > > > PLT0: > > > > > > f3 0f 1e fa // endbr64 > > > 41 53 // push %r11 > > > ff 35 00 00 00 00 // push GOT[1] > > > ff 25 00 00 00 00 // jmp *GOT[2] > > > 0f 1f 40 00 // nop > > > 0f 1f 40 00 // nop > > > 0f 1f 40 00 // nop > > > 66 90 // nop > > > > > > PLTn: > > > > > > f3 0f 1e fa // endbr64 > > > 41 bb 00 00 00 00 // mov $namen_reloc_index %r11d > > > ff 25 00 00 00 00 // jmp *GOT[namen_index] > > > > All PLT calls will have an extra MOV. > > One extra load-immediate mov instruction is executed per a function > call through a PLT entry. It's so tiny that I couldn't see any > difference in real-world apps. > > > > GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a > > > PLT entry is called for the first time, the control is passed to PLT0 to call > > > the resolver function. > > > > > > It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries > > > to clobber this register (*1), and the resolve function (__dl_runtime_resolve) > > > already clobbers it. > > > > > > (*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be > > > preserved, nor is it used to pass arguments. Making this register available as > > > scratch register means that code in the PLT need not spill any registers when > > > computing the address to which control needs to be transferred." > > > > > > FYI, this is the current CET-enabled PLT: > > > > > > PLT0: > > > > > > ff 35 00 00 00 00 // push GOT[0] > > > f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1] > > > 0f 1f 00 // nop > > > > > > PLTn in .plt: > > > > > > f3 0f 1e fa // endbr64 > > > 68 00 00 00 00 // push $namen_reloc_index > > > f2 e9 e1 ff ff ff // bnd jmpq PLT0 > > > 90 // nop > > > > > > PLTn in .plt.sec: > > > > > > f3 0f 1e fa // endbr64 > > > f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index] > > > 0f 1f 44 00 00 // nop > > > > > > In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In > > > the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we > > > have many PLT sections while we have only one header, so in practice, the > > > proposed format is almost 50% smaller than the existing one. > > > > Does it have any impact on performance? .plt.sec can be placed > > in a different page from .plt. > > > > > The proposed PLT does not use jump instructions with BND prefix, as Intel MPX > > > has been deprecated. > > > > > > I already implemented the proposed scheme to my linker > > > (https://github.com/rui314/mold) and it looks like it's working fine. > > > > > > Any thoughts? > > > > I'd like to see visible performance improvements or new features in > > a new PLT layout. > > I didn't see any visible performance improvement with real-world apps. > I might be able to craft a microbenchmark to hammer PLT entries really > hard in some pattern to see some difference, but I think that doesn't > make much sense. The size reduction is for real though. I am aware that there are 2 other proposals to use R11 in PLT/function call. But they are introducing new features. I don't think we should use R11 in PLT without any real performance improvements. > > I cced x86-64 psABI mailing list. > > > > > > -- > > H.J. -- H.J.