From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pj1-f54.google.com (mail-pj1-f54.google.com [209.85.216.54]) by sourceware.org (Postfix) with ESMTPS id A832C3858D20 for ; Tue, 1 Mar 2022 02:22:49 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A832C3858D20 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=maskray.me Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-pj1-f54.google.com with SMTP id g7-20020a17090a708700b001bb78857ccdso856657pjk.1 for ; Mon, 28 Feb 2022 18:22:49 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=SwLCxNAY/sD5vU4jRg1q39aKFneAqqwTDw7kLFFWjxM=; b=HSPMs6W84A/wJl2jG88QWDKTlBnyyWtzcqKZ+c632c8rAKZWHPIU7FdDGIIneaLFl3 7xw6bPx1GNFm50h2yY0d9bdL0Z7VzDCIpRYP7/FxYBVKJQzCfJwbwUOr3zcysAIqtE5d aqCoWB3ho0/pVrDBIx1OKDy3M/5RFa+I8Too0CvTPV4dREgDWm4sc3vS/yLn3T/fFFTY vGiT+c3RlBxHwBwTcNqeliXeVzalR51gYukGUr3vr5RBUrwo8wG2Qk5irrz+iQgnqPCP N+5XXn78nZpO+vBx/SmFDFcKjHC+DtP0vrmqHr9/5mWwMv4nm1nfBEF7o/0wVY8+gZxJ ikhQ== X-Gm-Message-State: AOAM53245FWZ73JNSz1NudKR7aUZy+JbQZec6k8xHVoGv7/cDXe3GyZ9 EqUZhXFPqnvinakN78UTzffOJwmY8UU= X-Google-Smtp-Source: ABdhPJwvw0AzHIbm/FynIFlba11R0gmooMs7ZGkcCvcYjp8MJD5O9owR+XEx57d7MUECyxO0amp8kw== X-Received: by 2002:a17:902:e5c4:b0:151:5e23:759c with SMTP id u4-20020a170902e5c400b001515e23759cmr10219936plf.29.1646101368627; Mon, 28 Feb 2022 18:22:48 -0800 (PST) Received: from localhost ([2601:647:6300:b760:41cf:bc48:28af:8671]) by smtp.gmail.com with ESMTPSA id u5-20020a056a00158500b004f0f12b320asm15876826pfk.6.2022.02.28.18.22.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 28 Feb 2022 18:22:48 -0800 (PST) Date: Mon, 28 Feb 2022 18:22:47 -0800 From: Fangrui Song To: Rui Ueyama Cc: "H.J. Lu" , x86-64-abi , Andi Kleen , Binutils , "Moreira, Joao" Subject: Re: x86-64: new CET-enabled PLT format proposal Message-ID: <20220301022247.kwcolxruopevfwcc@gmail.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-1.5 required=5.0 tests=BAYES_00, FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, KAM_INFOUSMEBIZ, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: binutils@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Binutils mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Mar 2022 02:22:52 -0000 On 2022-03-01, Rui Ueyama via Binutils wrote: >I think size reduction matters to some users even if you do not care >about that that much. But I'm not trying too hard to push GNU binutils >to adopt it. I just wanted to let you guys know that we invented a >compact (and we believe better) instruction sequence for the >CET-enabled PLT and we are already using it. > >On Tue, Mar 1, 2022 at 9:05 AM H.J. Lu wrote: >> >> On Sun, Feb 27, 2022 at 7:46 PM Rui Ueyama wrote: >> > >> > On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu wrote: >> > > >> > > On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils >> > > wrote: >> > > > >> > > > Hello, >> > > > >> > > > I'd like to propose an alternative instruction sequence for the Intel >> > > > CET-enabled PLT section. Compared to the existing one, the new scheme is >> > > > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not >> > > > require a separate second PLT section (.plt.sec). >> > > > >> > > > Here is the proposed code sequence: >> > > > >> > > > PLT0: >> > > > >> > > > f3 0f 1e fa // endbr64 >> > > > 41 53 // push %r11 >> > > > ff 35 00 00 00 00 // push GOT[1] >> > > > ff 25 00 00 00 00 // jmp *GOT[2] >> > > > 0f 1f 40 00 // nop >> > > > 0f 1f 40 00 // nop >> > > > 0f 1f 40 00 // nop >> > > > 66 90 // nop >> > > > >> > > > PLTn: >> > > > >> > > > f3 0f 1e fa // endbr64 >> > > > 41 bb 00 00 00 00 // mov $namen_reloc_index %r11d >> > > > ff 25 00 00 00 00 // jmp *GOT[namen_index] >> > > >> > > All PLT calls will have an extra MOV. >> > >> > One extra load-immediate mov instruction is executed per a function >> > call through a PLT entry. It's so tiny that I couldn't see any >> > difference in real-world apps. >> > >> > > > GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a >> > > > PLT entry is called for the first time, the control is passed to PLT0 to call >> > > > the resolver function. >> > > > >> > > > It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries >> > > > to clobber this register (*1), and the resolve function (__dl_runtime_resolve) >> > > > already clobbers it. >> > > > >> > > > (*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be >> > > > preserved, nor is it used to pass arguments. Making this register available as >> > > > scratch register means that code in the PLT need not spill any registers when >> > > > computing the address to which control needs to be transferred." >> > > > >> > > > FYI, this is the current CET-enabled PLT: >> > > > >> > > > PLT0: >> > > > >> > > > ff 35 00 00 00 00 // push GOT[0] >> > > > f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1] >> > > > 0f 1f 00 // nop >> > > > >> > > > PLTn in .plt: >> > > > >> > > > f3 0f 1e fa // endbr64 >> > > > 68 00 00 00 00 // push $namen_reloc_index >> > > > f2 e9 e1 ff ff ff // bnd jmpq PLT0 >> > > > 90 // nop >> > > > >> > > > PLTn in .plt.sec: >> > > > >> > > > f3 0f 1e fa // endbr64 >> > > > f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index] >> > > > 0f 1f 44 00 00 // nop >> > > > >> > > > In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In >> > > > the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we >> > > > have many PLT sections while we have only one header, so in practice, the >> > > > proposed format is almost 50% smaller than the existing one. >> > > >> > > Does it have any impact on performance? .plt.sec can be placed >> > > in a different page from .plt. >> > > >> > > > The proposed PLT does not use jump instructions with BND prefix, as Intel MPX >> > > > has been deprecated. >> > > > >> > > > I already implemented the proposed scheme to my linker >> > > > (https://github.com/rui314/mold) and it looks like it's working fine. >> > > > >> > > > Any thoughts? >> > > >> > > I'd like to see visible performance improvements or new features in >> > > a new PLT layout. >> > >> > I didn't see any visible performance improvement with real-world apps. >> > I might be able to craft a microbenchmark to hammer PLT entries really >> > hard in some pattern to see some difference, but I think that doesn't >> > make much sense. The size reduction is for real though. >> >> I am aware that there are 2 other proposals to use R11 in PLT/function >> call. But they are introducing new features. I don't think we should >> use R11 in PLT without any real performance improvements. I like the proposal. There are merits of simplified implementation, code size reduction, and less obvious ones: (a) linker script users won't need to mention .plt.sec (b) tools can use a more unified approach identifying PLTs like other architectures. >> > > I cced x86-64 psABI mailing list. >> > > >> > > >> > > -- >> > > H.J. >> >> >> >> -- >> H.J.