From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 54002 invoked by alias); 27 Sep 2018 15:13:20 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 53991 invoked by uid 89); 27 Sep 2018 15:13:20 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-11.9 required=5.0 tests=BAYES_00,GIT_PATCH_2,GIT_PATCH_3,SPF_PASS autolearn=ham version=3.3.2 spammy=27092018, 27.09.2018 X-HELO: mx1.suse.de Received: from mx2.suse.de (HELO mx1.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 27 Sep 2018 15:13:18 +0000 Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id CBC37B045; Thu, 27 Sep 2018 15:13:15 +0000 (UTC) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (1.0) Subject: Re: [PATCH, AArch64 00/11] LSE atomics out-of-line From: Alexander Graf In-Reply-To: <590e4768-0d3b-92e4-9103-ed36902125ff@arm.com> Date: Thu, 27 Sep 2018 15:19:00 -0000 Cc: "rth7680@gmail.com" , "gcc-patches@gcc.gnu.org" , "matz@suse.de" , Richard Henderson , nd , will.deacon@arm.com Content-Transfer-Encoding: quoted-printable Message-Id: <94FFDC4A-D3FE-4524-9EA0-0EC4BAA9E398@suse.de> References: <20180926050355.32746-1-richard.henderson@linaro.org> <590e4768-0d3b-92e4-9103-ed36902125ff@arm.com> To: Ramana Radhakrishnan X-SW-Source: 2018-09/txt/msg01671.txt.bz2 > Am 27.09.2018 um 15:07 schrieb Ramana Radhakrishnan : >=20 >> On 26/09/2018 06:03, rth7680@gmail.com wrote: >> From: Richard Henderson >> ARMv8.1 adds an (mandatory) Atomics extension, also known as the >> Large System Extension. Deploying this extension at the OS level >> has proved challenging. >> The following is the result of a conversation between myself, >> Alex Graf of SuSE, and Ramana Radhakrishnan of ARM, at last week's >> Linaro Connect in Vancouver. >> The current state of the world is that one could distribute two >> different copies of a given shared library and place the LSE-enabled >> version in /lib64/atomics/ and it will be selected over the /lib64/ >> version by ld.so when HWCAP_ATOMICS is present. >> Alex's main concern with this is that (1) he doesn't want to >> distribute two copies of every library, or determine what a >> resonable subset would be and (2) this solution does not work >> for executables, e.g. mysql. >> Ramana's main concern was to avoid the overhead of an indirect jump, >> especially in how that would affect the (non-)branch-prediction of >> the smallest implementations. >> Therefore, I've created small out-of-line helpers that are directly >> linked into every library or executable that requires them. There >> will be two direct branches, both of which will be well-predicted. >> In the process, I discovered a number of places within the code >> where the existing implementation could be improved. In particular: >> - the LSE patterns didn't use predicates or constraints that >> match the actual instructions, requiring unnecessary splitting. >> - the non-LSE compare-and-swap can use an extending compare to >> avoid requiring the input to have been previously extended. >> - TImode compare-and-swap was missing entirely. This brings >> aarch64 to parity with x86_64 wrt __sync_val_compare_and_swap. >> There is a final patch that enables the new option by default. >> I am not necessarily expecting this to be merged upstream, but >> for the operating system to decide what the default should be. >> It might be that this should be a configure option, so as to >> make that OS choice easier, but I've just now thought of that. ;-) >> I'm going to have to rely on Alex and/or Ramana to perform >> testing on a system that supports LSE. >=20 > Thanks for this patchset - >=20 > I'll give this a whirl in the next couple of days but don't expect result= s until Monday or so. >=20 > I do have an additional concern that I forgot to mention in Vancouver - >=20 > Thanks Wilco for reminding me that this now replaces a bunch of inline in= structions with effectively a library call therefore clobbering a whole bun= ch of caller saved registers. >=20 > In which case I see 2 options. >=20 > - maybe we should consider a private interface and restrict the register= s that these files are compiled with to minimise the number of caller saved= registers we trash. >=20 > - Alternatively we should consider an option to inline these at O2 or O3 = as we may just be trading the performance improvements we get with using th= e lse atomics I talked to Will Deacon about lse atomics today a bit. Apparently, a key be= nefit that you get from using them is guaranteed forward progress when comp= ared to an exclusives loop. So IMHO even a tiny slowdown might be better than not progressing. Another concern he did bring up was that due to the additional conditional = code a cmpxchg loop may become bigger, so converges slower/never than a nat= ive implementation. I assume we can identify those cases later and solve th= em with ifuncs in the target code though. Alex > for additional stacking and unstacking of caller saved registers in the m= ain functions... >=20 > But anyway while we discuss that we'll have a look at testing and benchma= rking this. >=20 >=20 > regards > Ramana >=20 >> r~ >> Richard Henderson (11): >> aarch64: Simplify LSE cas generation >> aarch64: Improve cas generation >> aarch64: Improve swp generation >> aarch64: Improve atomic-op lse generation >> aarch64: Emit LSE st instructions >> Add visibility to libfunc constructors >> Link static libgcc after shared libgcc for -shared-libgcc >> aarch64: Add out-of-line functions for LSE atomics >> aarch64: Implement -matomic-ool >> aarch64: Implement TImode compare-and-swap >> Enable -matomic-ool by default >> gcc/config/aarch64/aarch64-protos.h | 20 +- >> gcc/optabs-libfuncs.h | 2 + >> gcc/common/config/aarch64/aarch64-common.c | 6 +- >> gcc/config/aarch64/aarch64.c | 480 ++++++-------- >> gcc/gcc.c | 9 +- >> gcc/optabs-libfuncs.c | 26 +- >> .../atomic-comp-swap-release-acquire.c | 2 +- >> .../gcc.target/aarch64/atomic-inst-ldadd.c | 18 +- >> .../gcc.target/aarch64/atomic-inst-ldlogic.c | 54 +- >> .../gcc.target/aarch64/atomic-op-acq_rel.c | 2 +- >> .../gcc.target/aarch64/atomic-op-acquire.c | 2 +- >> .../gcc.target/aarch64/atomic-op-char.c | 2 +- >> .../gcc.target/aarch64/atomic-op-consume.c | 2 +- >> .../gcc.target/aarch64/atomic-op-imm.c | 2 +- >> .../gcc.target/aarch64/atomic-op-int.c | 2 +- >> .../gcc.target/aarch64/atomic-op-long.c | 2 +- >> .../gcc.target/aarch64/atomic-op-relaxed.c | 2 +- >> .../gcc.target/aarch64/atomic-op-release.c | 2 +- >> .../gcc.target/aarch64/atomic-op-seq_cst.c | 2 +- >> .../gcc.target/aarch64/atomic-op-short.c | 2 +- >> .../aarch64/atomic_cmp_exchange_zero_reg_1.c | 2 +- >> .../atomic_cmp_exchange_zero_strong_1.c | 2 +- >> .../gcc.target/aarch64/sync-comp-swap.c | 2 +- >> .../gcc.target/aarch64/sync-op-acquire.c | 2 +- >> .../gcc.target/aarch64/sync-op-full.c | 2 +- >> libgcc/config/aarch64/lse.c | 280 ++++++++ >> gcc/config/aarch64/aarch64.opt | 4 + >> gcc/config/aarch64/atomics.md | 608 ++++++++++-------- >> gcc/config/aarch64/iterators.md | 8 +- >> gcc/config/aarch64/predicates.md | 12 + >> gcc/doc/invoke.texi | 14 +- >> libgcc/config.host | 4 + >> libgcc/config/aarch64/t-lse | 48 ++ >> 33 files changed, 1050 insertions(+), 577 deletions(-) >> create mode 100644 libgcc/config/aarch64/lse.c >> create mode 100644 libgcc/config/aarch64/t-lse >=20