From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pj1-x1042.google.com (mail-pj1-x1042.google.com [IPv6:2607:f8b0:4864:20::1042]) by sourceware.org (Postfix) with ESMTPS id D7B703858D34 for ; Mon, 20 Apr 2020 04:33:23 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org D7B703858D34 Received: by mail-pj1-x1042.google.com with SMTP id t40so3890252pjb.3 for ; Sun, 19 Apr 2020 21:33:23 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:subject:to:cc:references:in-reply-to :mime-version:message-id:content-transfer-encoding; bh=/SU6oEkGmAQk0odZCe5iksWkWXpw1rmuYB3XbYdsaiI=; b=tyEXJ6U3yvFi6JAtFJOv5qJqyx9shP9gc0nTZ3QrqCnwaoCfg/6yWmoTKL6mXo2vU1 24y//YfBlkx8WkGWbIeEpiSSN9MSNvSB77FSF813PTUzfmGAuiX9isdhurjJkGh2EupY TKuTpOfExS6t2AH72wZs5VXpgpHXm5HLjra52dNFPIAcTseJaF8o2OZ/Z0cEeJOq+5GB l3PcOpW0gXq9MCZPDWYJjciaWnaugQq0VAsJpVAHjYgMZrOZ+XRrIO9c2OR0TqAqfwyq jROG9Y5LR/spNwOzq81tuLQ07PXiS+gldTu6tCozJf5h4GBi+fMAJj2HYcPuPIys2GRK MATA== X-Gm-Message-State: AGi0PuY32tv6cAosyYhhTMsAxmyDyDstR8Gh/tI78FoEq+sI0o5m21H8 ph67I6y69Ea13JLLFbIS34c= X-Google-Smtp-Source: APiQypIXHLzTzY/Qh3MIsu+pH5hF6id8O/cnX7Wfi4Xm/YIeT8paufHl8HNKcBlATNKaxDVJqO65uQ== X-Received: by 2002:a17:90a:a591:: with SMTP id b17mr19624411pjq.90.1587357202797; Sun, 19 Apr 2020 21:33:22 -0700 (PDT) Received: from localhost ([203.185.249.170]) by smtp.gmail.com with ESMTPSA id s10sm12459863pjp.13.2020.04.19.21.33.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 19 Apr 2020 21:33:22 -0700 (PDT) Date: Mon, 20 Apr 2020 14:31:58 +1000 From: Nicholas Piggin Subject: Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2 To: Rich Felker Cc: Adhemerval Zanella , libc-alpha@sourceware.org, libc-dev@lists.llvm.org, linuxppc-dev@lists.ozlabs.org, musl@lists.openwall.com References: <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> <20200420013412.GZ11469@brightrain.aerifal.cx> <1587348538.l1ioqml73m.astroid@bobo.none> <20200420040926.GA11469@brightrain.aerifal.cx> In-Reply-To: <20200420040926.GA11469@brightrain.aerifal.cx> MIME-Version: 1.0 Message-Id: <1587356128.aslvdnmtbw.astroid@bobo.none> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=1.9 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_NUMSUBJECT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, SUSPICIOUS_RECIPS, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Level: * X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 20 Apr 2020 04:33:25 -0000 Excerpts from Rich Felker's message of April 20, 2020 2:09 pm: > On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote: >> Excerpts from Rich Felker's message of April 20, 2020 11:34 am: >> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote: >> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am: >> >> > Note that because lr is clobbered we need at least once normally >> >> > call-clobbered register that's not syscall clobbered to save lr in. >> >> > Otherwise stack frame setup is required to spill it. >> >>=20 >> >> The kernel would like to use r9-r12 for itself. We could do with fewe= r=20 >> >> registers, but we have some delay establishing the stack (depends on = a >> >> load which depends on a mfspr), and entry code tends to be quite stor= e >> >> heavy whereas on the caller side you have r1 set up (modulo stack=20 >> >> updates), and the system call is a long delay during which time the=20 >> >> store queue has significant time to drain. >> >>=20 >> >> My feeling is it would be better for kernel to have these scratch=20 >> >> registers. >> >=20 >> > If your new kernel syscall mechanism requires the caller to make a >> > whole stack frame it otherwise doesn't need and spill registers to it, >> > it becomes a lot less attractive. Some of those 90 cycles saved are >> > immediately lost on the userspace side, plus you either waste icache >> > at the call point or require the syscall to go through a >> > userspace-side helper function that performs the spill and restore. >>=20 >> You would be surprised how few cycles that takes on a high end CPU. Some= =20 >> might be a couple of %. I am one for counting cycles mind you, I'm not=20 >> being flippant about it. If we can come up with something faster I'd be=20 >> up for it. >=20 > If the cycle count is trivial then just do it on the kernel side. The cycle count for user is, because you have r1 ready. Kernel does not=20 have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to=20 save into. Which is also wasted work for a userspace. Now that I think about it, no stack frame is even required! lr is saved=20 into the caller's stack when its clobbered with an asm, just as when=20 it's used for a function call. >> > The right way to do this is to have the kernel preserve enough >> > registers that userspace can avoid having any spills. It doesn't have >> > to preserve everything, probably just enough to save lr. (BTW are >>=20 >> Again, the problem is the kernel doesn't have its dependencies=20 >> immediately ready to spill, and spilling (may be) more costly=20 >> immediately after the call because we're doing a lot of stores. >>=20 >> I could try measure this. Unfortunately our pipeline simulator tool=20 >> doesn't model system calls properly so it's hard to see what's happening= =20 >> across the user/kernel horizon, I might check if that can be improved >> or I can hack it by putting some isync in there or something. >=20 > I think it's unlikely to make any real difference to the total number > of cycles spent which side it happens on, but putting it on the kernel > side makes it easier to avoid wasting size/icache at each syscall > site. >=20 >> > syscall arg registers still preserved? If not, this is a major cost on >> > the userspace side, since any call point that has to loop-and-retry >> > (e.g. futex) now needs to make its own place to store the original >> > values.) >>=20 >> Powerpc system calls never did. We could have scv preserve them, but=20 >> you'd still need to restore r3. We could make an ABI which does not >> clobber r3 but puts the return value in r9, say. I'd like to see what >> the user side code looks like to take advantage of such a thing though. >=20 > Oh wow, I hadn't realized that, but indeed the code we have now is > allowing for the kernel to clobber them all. So at least this isn't > getting any worse I guess. I think it was a very poor choice of > behavior though and a disadvantage vs what other archs do (some of > them preserve all registers; others preserve only normally call-saved > ones plus the syscall arg ones and possibly a few other specials). Well, we could change it. Does the generated code improve significantly we take those clobbers away? Thanks, Nick