From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pj1-x1034.google.com (mail-pj1-x1034.google.com [IPv6:2607:f8b0:4864:20::1034]) by sourceware.org (Postfix) with ESMTPS id C3C8E3858018 for ; Mon, 4 Dec 2023 19:43:43 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C3C8E3858018 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=vrull.eu Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=vrull.eu ARC-Filter: OpenARC Filter v1.0.0 sourceware.org C3C8E3858018 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::1034 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1701719034; cv=none; b=INl1DfvF0Eef9nkCOb0PAbYXNwk29t5AMNBmEnaRxcSkv2qZzwJMad9jPv+OU1zwVeUFp4gX4YZvwoc5uGawqQtKQK2b1v020hA7+8jPQFuo8eABJbNIlgQZ07EDiz/2TUtWnvJoB2gHq9gV5hEgY1G1UKpYeOnPhoKWj/C/jxM= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1701719034; c=relaxed/simple; bh=jZ8fAltcnA7OIHGvv3hFWS9YACYFbMMdODl1RTufQGY=; h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=MOKELkjDGk6fwSSMc0WRndWzCDvzdSz6YOvP3DX6ZXu2lo5C0gZ/qFUDTfU7W15uaVgCXB6+7Gy1Kcvu3mUJSAGlpjgkuFN5osvWF6i/zaB1bC435cJwVDpp2izW454kOAhcJbG9tHDiVTE+4xYISA0t+jQziUtnoyq9C99X8pw= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-pj1-x1034.google.com with SMTP id 98e67ed59e1d1-2866e4ac34bso2359601a91.1 for ; Mon, 04 Dec 2023 11:43:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vrull.eu; s=google; t=1701719022; x=1702323822; darn=gcc.gnu.org; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=8M4wIgHDmzrSQVAY/Crb+4nzDrYfaqYYn6lfcgd/CKQ=; b=dk3JPHO3IrC5K/GBe7XBrR/57buyFkEoRk39yWOWEwXRyKxMFlLD0YgjpDMxZrLYbR CzVGtkzKGo87hPaXbS5ow0rsxsqMeUV8Ifck6RGuLfR05PYuH/z4AhHS9ObsJ5W2DtQR 77WzaYAMyzOQg/fJ2aaA4yTsV7y34d4bpXFdjNtYeNE1rNsItk4iCS1w1BJqttMTLdXM APk6fslHJ4R1NsqpiwCp8VOR7DSHrmFbqKJSaXBzlBmgHCsvJEPW0P1XxrFMSgpdvc+/ JOMKzWDMceIpV7bGDysZE81cpioPDpAfU5d3YXPe86X8uVJo1KfGHdZgbuVxKt9g3aRW 83Tw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701719022; x=1702323822; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=8M4wIgHDmzrSQVAY/Crb+4nzDrYfaqYYn6lfcgd/CKQ=; b=Rm7RA2s5LarZ5luMLKAXD6j0rYhYoxzZNpZSu+OevDECSZ1nEV94l9KVuzOcAGETCg l4FeY3JZOy3G6N/oH9TqQa7qgUUnfH58QYdMZD3e7V9C8zkiDfETxWAweMfaaQhmMew9 rhl5rOpOZrGD8C7wTNY1EjinbHUVPrE+nB8RJ7jDSdXvKlHZj3WhUiu4pncc3VU6oEVv SqTNryXVN5XLYoNevLtO5cGUpYGqrGVeEEkeThJ8GUpPKHUnnomQg1rX40uaXIGeijgU UmpcQuKzUzttSvYadzyetNL+BBzI7dDL85yqqYNHYidVgogqnBWCFjkh/5fNaVGhLE0q yIWg== X-Gm-Message-State: AOJu0YzVvnBBKveNU72mnBzKT2nGciFkeNsXCgbCmOzDBZqwGWBAKhl8 aG9daG8GgQE4rtY7jHWZN61ghS5TH/t1Yzy5QCPw3Fe16q5R1r+7 X-Google-Smtp-Source: AGHT+IGHnRAA1BJCB3zTmAPczGr+Uz6VqMEkh7vMarKj+PXCFAhtOG8gzaBxP6KwF6FT0iU3TxbbBrJLIT9pVr+VASA= X-Received: by 2002:a17:90b:17c5:b0:286:b853:e67d with SMTP id me5-20020a17090b17c500b00286b853e67dmr103460pjb.15.1701719022027; Mon, 04 Dec 2023 11:43:42 -0800 (PST) MIME-Version: 1.0 References: <20231204180042.12450-1-manos.anagnostakis@vrull.eu> In-Reply-To: From: Manos Anagnostakis Date: Mon, 4 Dec 2023 21:43:29 +0200 Message-ID: Subject: Re: [PATCH v4] aarch64: New RTL optimization pass avoid-store-forwarding. To: Manos Anagnostakis , gcc-patches@gcc.gnu.org, Philipp Tomsich , Manolis Tsamis , Richard Sandiford Content-Type: multipart/alternative; boundary="000000000000d0dd79060bb4552d" X-Spam-Status: No, score=-9.0 required=5.0 tests=BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,HTML_MESSAGE,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: --000000000000d0dd79060bb4552d Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable =CE=A3=CF=84=CE=B9=CF=82 =CE=94=CE=B5=CF=85 4 =CE=94=CE=B5=CE=BA 2023, 21:2= 2 =CE=BF =CF=87=CF=81=CE=AE=CF=83=CF=84=CE=B7=CF=82 Richard Sandiford < richard.sandiford@arm.com> =CE=AD=CE=B3=CF=81=CE=B1=CF=88=CE=B5: > Manos Anagnostakis writes: > > This is an RTL pass that detects store forwarding from stores to larger > loads (load pairs). > > > > This optimization is SPEC2017-driven and was found to be beneficial for > some benchmarks, > > through testing on ampere1/ampere1a machines. > > > > For example, it can transform cases like > > > > str d5, [sp, #320] > > fmul d5, d31, d29 > > ldp d31, d17, [sp, #312] # Large load from small store > > > > to > > > > str d5, [sp, #320] > > fmul d5, d31, d29 > > ldr d31, [sp, #312] > > ldr d17, [sp, #320] > > > > Currently, the pass is disabled by default on all architectures and > enabled by a target-specific option. > > > > If deemed beneficial enough for a default, it will be enabled on > ampere1/ampere1a, > > or other architectures as well, without needing to be turned on by this > option. > > > > Bootstrapped and regtested on aarch64-linux. > > > > gcc/ChangeLog: > > > > * config.gcc: Add aarch64-store-forwarding.o to extra_objs. > > * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New > pass. > > * config/aarch64/aarch64-protos.h > (make_pass_avoid_store_forwarding): Declare. > > * config/aarch64/aarch64.opt (mavoid-store-forwarding): New > option. > > (aarch64-store-forwarding-threshold): New param. > > * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o > > * doc/invoke.texi: Document new option and new param. > > * config/aarch64/aarch64-store-forwarding.cc: New file. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test. > > * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test. > > * gcc.target/aarch64/ldp_ssll_overlap.c: New test. > > > > Signed-off-by: Manos Anagnostakis > > Co-Authored-By: Manolis Tsamis > > Co-Authored-By: Philipp Tomsich > > --- > > Changes in v4: > > - I had problems to make cselib_subst_to_values work correctly > > so I used cselib_lookup to implement the exact same behaviour a= nd > > record the store value at the time we iterate over it. > > - Removed the store/load_mem_addr check from is_forwarding as > > unnecessary. > > - The pass is called on all optimization levels right now. > > - The threshold check should remain as it is as we only care for > > the front element of the list. The comment above the check > explains > > why a single if is enough. > > I still think this is structurally better as a while. There's no reason > in principle we why wouldn't want to record the stores in: > > stp x0, x1, [x4, #8] > ldp x0, x1, [x4, #0] > ldp x2, x3, [x4, #16] > > and then the two stores should have the same distance value. > I realise we don't do that yet, but still. > Ah, you mean forwarding from stp. I was a bit confused with what you meant the previous time. This was not initially meant for this patch, but I think it wouldn't take long to implement that before pushing this. It is your call of course if I should include it. > > > - The documentation changes requested. > > - Adjusted a comment. > > > > gcc/config.gcc | 1 + > > gcc/config/aarch64/aarch64-passes.def | 1 + > > gcc/config/aarch64/aarch64-protos.h | 1 + > > .../aarch64/aarch64-store-forwarding.cc | 321 ++++++++++++++++++ > > gcc/config/aarch64/aarch64.opt | 9 + > > gcc/config/aarch64/t-aarch64 | 10 + > > gcc/doc/invoke.texi | 11 +- > > .../aarch64/ldp_ssll_no_overlap_address.c | 33 ++ > > .../aarch64/ldp_ssll_no_overlap_offset.c | 33 ++ > > .../gcc.target/aarch64/ldp_ssll_overlap.c | 33 ++ > > 10 files changed, 452 insertions(+), 1 deletion(-) > > create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc > > create mode 100644 > gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c > > create mode 100644 > gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c > > create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c > > > > diff --git a/gcc/config.gcc b/gcc/config.gcc > > index 748430194f3..2ee3b61c4fa 100644 > > --- a/gcc/config.gcc > > +++ b/gcc/config.gcc > > @@ -350,6 +350,7 @@ aarch64*-*-*) > > cxx_target_objs=3D"aarch64-c.o" > > d_target_objs=3D"aarch64-d.o" > > extra_objs=3D"aarch64-builtins.o aarch-common.o > aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o > aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o > cortex-a57-fma-steering.o aarch64-speculation.o > falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o" > > + extra_objs=3D"${extra_objs} aarch64-store-forwarding.o" > > target_gtfiles=3D"\$(srcdir)/config/aarch64/aarch64-builtins.cc > \$(srcdir)/config/aarch64/aarch64-sve-builtins.h > \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc" > > target_has_targetm_common=3Dyes > > ;; > > diff --git a/gcc/config/aarch64/aarch64-passes.def > b/gcc/config/aarch64/aarch64-passes.def > > index 6ace797b738..fa79e8adca8 100644 > > --- a/gcc/config/aarch64/aarch64-passes.def > > +++ b/gcc/config/aarch64/aarch64-passes.def > > @@ -23,3 +23,4 @@ INSERT_PASS_BEFORE (pass_reorder_blocks, 1, > pass_track_speculation); > > INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance= ); > > INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti); > > INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion); > > +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding); > > diff --git a/gcc/config/aarch64/aarch64-protos.h > b/gcc/config/aarch64/aarch64-protos.h > > index d2718cc87b3..7d9dfa06af9 100644 > > --- a/gcc/config/aarch64/aarch64-protos.h > > +++ b/gcc/config/aarch64/aarch64-protos.h > > @@ -1050,6 +1050,7 @@ rtl_opt_pass *make_pass_track_speculation > (gcc::context *); > > rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *); > > rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt); > > rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt); > > +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt); > > > > poly_uint64 aarch64_regmode_natural_size (machine_mode); > > > > diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc > b/gcc/config/aarch64/aarch64-store-forwarding.cc > > new file mode 100644 > > index 00000000000..ae3cbe519cd > > --- /dev/null > > +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc > > @@ -0,0 +1,321 @@ > > +/* Avoid store forwarding optimization pass. > > + Copyright (C) 2023 Free Software Foundation, Inc. > > + Contributed by VRULL GmbH. > > + > > + This file is part of GCC. > > + > > + GCC is free software; you can redistribute it and/or modify it > > + under the terms of the GNU General Public License as published by > > + the Free Software Foundation; either version 3, or (at your option) > > + any later version. > > + > > + GCC is distributed in the hope that it will be useful, but > > + WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + General Public License for more details. > > + > > + You should have received a copy of the GNU General Public License > > + along with GCC; see the file COPYING3. If not see > > + . */ > > + > > +#define IN_TARGET_CODE 1 > > + > > +#include "config.h" > > +#define INCLUDE_LIST > > +#include "system.h" > > +#include "coretypes.h" > > +#include "backend.h" > > +#include "rtl.h" > > +#include "alias.h" > > +#include "rtlanal.h" > > +#include "tree-pass.h" > > +#include "cselib.h" > > + > > +/* This is an RTL pass that detects store forwarding from stores to > larger > > + loads (load pairs). For example, it can transform cases like > > + > > + str d5, [sp, #320] > > + fmul d5, d31, d29 > > + ldp d31, d17, [sp, #312] # Large load from small store > > + > > + to > > + > > + str d5, [sp, #320] > > + fmul d5, d31, d29 > > + ldr d31, [sp, #312] > > + ldr d17, [sp, #320] > > + > > + Design: The pass follows a straightforward design. It starts by > > + initializing the alias analysis and the cselib. Both of these are > used to > > + find stores and larger loads with overlapping addresses, which are > > + candidates for store forwarding optimizations. It then scans on > basic block > > + level to find stores that forward to larger loads and handles them > > + accordingly as described in the above example. Finally, the alias > analysis > > + and the cselib library are closed. */ > > + > > +typedef struct > > +{ > > + rtx_insn *store_insn; > > + rtx store_mem_addr; > > + unsigned int insn_cnt; > > +} store_info; > > + > > +typedef std::list list_store_info; > > + > > +/* Statistics counters. */ > > +static unsigned int stats_store_count =3D 0; > > +static unsigned int stats_ldp_count =3D 0; > > +static unsigned int stats_ssll_count =3D 0; > > +static unsigned int stats_transformed_count =3D 0; > > + > > +/* Default. */ > > +static rtx dummy; > > +static bool is_load (rtx expr, rtx &op_1=3Ddummy); > > + > > +/* Return true if SET expression EXPR is a store; otherwise false. */ > > + > > +static bool > > +is_store (rtx expr) > > +{ > > + return MEM_P (SET_DEST (expr)); > > +} > > + > > +/* Return true if SET expression EXPR is a load; otherwise false. OP_1 > will > > + contain the MEM operand of the load. */ > > + > > +static bool > > +is_load (rtx expr, rtx &op_1) > > +{ > > + op_1 =3D SET_SRC (expr); > > + > > + if (GET_CODE (op_1) =3D=3D ZERO_EXTEND > > + || GET_CODE (op_1) =3D=3D SIGN_EXTEND) > > + op_1 =3D XEXP (op_1, 0); > > + > > + return MEM_P (op_1); > > +} > > + > > +/* Return true if STORE_MEM_ADDR is forwarding to the address of > LOAD_MEM; > > + otherwise false. STORE_MEM_MODE is the mode of the MEM rtx > containing > > + STORE_MEM_ADDR. */ > > + > > +static bool > > +is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode > store_mem_mode) > > +{ > > + /* Sometimes we do not have the proper value. */ > > + if (!CSELIB_VAL_PTR (store_mem_addr)) > > + return false; > > + > > + gcc_checking_assert (MEM_P (load_mem)); > > + > > + rtx load_mem_addr =3D get_addr (XEXP (load_mem, 0)); > > + machine_mode load_mem_mode =3D GET_MODE (load_mem); > > + load_mem_addr =3D cselib_lookup (load_mem_addr, load_mem_mode, 1, > > + load_mem_mode)->val_rtx; > > Like I said in the previous review, it shouldn't be necessary to do any > manual lookup on the load address. rtx_equal_for_cselib_1 does the > lookup itself. Does that not work? > I thought you meant only that the if check was redundant here, which it was. I'll reply if cselib can handle the load all by itself. Thanks for the review! Manos. > > The patch is OK with the four lines above deleted, if that works, > and with s/if/while/. But please reply if that combination doesn't work. > > Thanks, > Richard > > > + return rtx_equal_for_cselib_1 (store_mem_addr, > > + load_mem_addr, > > + store_mem_mode, 0); > > +} > > + > > +/* Return true if INSN is a load pair, preceded by a store forwarding > to it; > > + otherwise false. STORE_EXPRS contains the stores. */ > > + > > +static bool > > +is_small_store_to_large_load (list_store_info store_exprs, rtx_insn > *insn) > > +{ > > + unsigned int load_count =3D 0; > > + bool forwarding =3D false; > > + rtx expr =3D PATTERN (insn); > > + > > + if (GET_CODE (expr) !=3D PARALLEL > > + || XVECLEN (expr, 0) !=3D 2) > > + return false; > > + > > + for (int i =3D 0; i < XVECLEN (expr, 0); i++) > > + { > > + rtx op_1; > > + rtx out_exp =3D XVECEXP (expr, 0, i); > > + > > + if (GET_CODE (out_exp) !=3D SET) > > + continue; > > + > > + if (!is_load (out_exp, op_1)) > > + continue; > > + > > + load_count++; > > + > > + for (store_info str : store_exprs) > > + { > > + rtx store_insn =3D str.store_insn; > > + > > + if (!is_forwarding (str.store_mem_addr, op_1, > > + GET_MODE (SET_DEST (PATTERN (store_insn))))) > > + continue; > > + > > + if (dump_file) > > + { > > + fprintf (dump_file, > > + "Store forwarding to PARALLEL with loads:\n"); > > + fprintf (dump_file, " From: "); > > + print_rtl_single (dump_file, store_insn); > > + fprintf (dump_file, " To: "); > > + print_rtl_single (dump_file, insn); > > + } > > + > > + forwarding =3D true; > > + } > > + } > > + > > + if (load_count =3D=3D 2) > > + stats_ldp_count++; > > + > > + return load_count =3D=3D 2 && forwarding; > > +} > > + > > +/* Break a load pair into its 2 distinct loads, except if the base > source > > + address to load from is overwriten in the first load. INSN should > be the > > + PARALLEL of the load pair. */ > > + > > +static void > > +break_ldp (rtx_insn *insn) > > +{ > > + rtx expr =3D PATTERN (insn); > > + > > + gcc_checking_assert (GET_CODE (expr) =3D=3D PARALLEL && XVECLEN (exp= r, 0) > =3D=3D 2); > > + > > + rtx load_0 =3D XVECEXP (expr, 0, 0); > > + rtx load_1 =3D XVECEXP (expr, 0, 1); > > + > > + gcc_checking_assert (is_load (load_0) && is_load (load_1)); > > + > > + /* The base address was overwriten in the first load. */ > > + if (reg_mentioned_p (SET_DEST (load_0), SET_SRC (load_1))) > > + return; > > + > > + emit_insn_before (load_0, insn); > > + emit_insn_before (load_1, insn); > > + remove_insn (insn); > > + > > + stats_transformed_count++; > > +} > > + > > +static void > > +scan_and_transform_bb_level () > > +{ > > + rtx_insn *insn, *next; > > + basic_block bb; > > + FOR_EACH_BB_FN (bb, cfun) > > + { > > + list_store_info store_exprs; > > + unsigned int insn_cnt =3D 0; > > + for (insn =3D BB_HEAD (bb); insn !=3D NEXT_INSN (BB_END (bb)); i= nsn =3D > next) > > + { > > + next =3D NEXT_INSN (insn); > > + > > + /* If we cross a CALL_P insn, clear the list, because the > > + small-store-to-large-load is unlikely to cause performance > > + difference. */ > > + if (CALL_P (insn)) > > + store_exprs.clear (); > > + > > + if (!NONJUMP_INSN_P (insn)) > > + continue; > > + > > + cselib_process_insn (insn); > > + > > + rtx expr =3D single_set (insn); > > + > > + /* If a store is encountered, append it to the store_exprs list > to > > + check it later. */ > > + if (expr && is_store (expr)) > > + { > > + rtx store_mem =3D SET_DEST (expr); > > + rtx store_mem_addr =3D get_addr (XEXP (store_mem, 0)); > > + machine_mode store_mem_mode =3D GET_MODE (store_mem); > > + store_mem_addr =3D cselib_lookup (store_mem_addr, > > + store_mem_mode, 1, > > + store_mem_mode)->val_rtx; > > + store_exprs.push_back ({ insn, store_mem_addr, insn_cnt++ }= ); > > + stats_store_count++; > > + } > > + > > + /* Check for small-store-to-large-load. */ > > + if (is_small_store_to_large_load (store_exprs, insn)) > > + { > > + stats_ssll_count++; > > + break_ldp (insn); > > + } > > + > > + /* Pop the first store from the list if it's distance crosses t= he > > + maximum accepted threshold. The list contains unique values > > + sorted in ascending order, meaning that only one distance can > be > > + off at a time. */ > > + if (!store_exprs.empty () > > + && (insn_cnt - store_exprs.front ().insn_cnt > > + > (unsigned int) > aarch64_store_forwarding_threshold_param)) > > + store_exprs.pop_front (); > > + } > > + } > > +} > > + > > +static void > > +execute_avoid_store_forwarding () > > +{ > > + init_alias_analysis (); > > + cselib_init (CSELIB_RECORD_MEMORY | CSELIB_PRESERVE_CONSTANTS); > > + scan_and_transform_bb_level (); > > + end_alias_analysis (); > > + cselib_finish (); > > + statistics_counter_event (cfun, "Number of stores identified: ", > > + stats_store_count); > > + statistics_counter_event (cfun, "Number of load pairs identified: ", > > + stats_ldp_count); > > + statistics_counter_event (cfun, > > + "Number of forwarding cases identified: ", > > + stats_ssll_count); > > + statistics_counter_event (cfun, "Number of trasformed cases: ", > > + stats_transformed_count); > > +} > > + > > +const pass_data pass_data_avoid_store_forwarding =3D > > +{ > > + RTL_PASS, /* type. */ > > + "avoid_store_forwarding", /* name. */ > > + OPTGROUP_NONE, /* optinfo_flags. */ > > + TV_NONE, /* tv_id. */ > > + 0, /* properties_required. */ > > + 0, /* properties_provided. */ > > + 0, /* properties_destroyed. */ > > + 0, /* todo_flags_start. */ > > + 0 /* todo_flags_finish. */ > > +}; > > + > > +class pass_avoid_store_forwarding : public rtl_opt_pass > > +{ > > +public: > > + pass_avoid_store_forwarding (gcc::context *ctxt) > > + : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt) > > + {} > > + > > + /* opt_pass methods: */ > > + virtual bool gate (function *) > > + { > > + return aarch64_flag_avoid_store_forwarding; > > + } > > + > > + virtual unsigned int execute (function *) > > + { > > + execute_avoid_store_forwarding (); > > + return 0; > > + } > > + > > +}; // class pass_avoid_store_forwarding > > + > > +/* Create a new avoid store forwarding pass instance. */ > > + > > +rtl_opt_pass * > > +make_pass_avoid_store_forwarding (gcc::context *ctxt) > > +{ > > + return new pass_avoid_store_forwarding (ctxt); > > +} > > diff --git a/gcc/config/aarch64/aarch64.opt > b/gcc/config/aarch64/aarch64.opt > > index f5a518202a1..e4498d53b46 100644 > > --- a/gcc/config/aarch64/aarch64.opt > > +++ b/gcc/config/aarch64/aarch64.opt > > @@ -304,6 +304,10 @@ moutline-atomics > > Target Var(aarch64_flag_outline_atomics) Init(2) Save > > Generate local calls to out-of-line atomic operations. > > > > +mavoid-store-forwarding > > +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0) > Optimization > > +Avoid store forwarding to load pairs. > > + > > -param=3Daarch64-sve-compare-costs=3D > > Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1) > IntegerRange(0, 1) Param > > When vectorizing for SVE, consider using unpacked vectors for smaller > elements and use the cost model to pick the cheapest approach. Also use > the cost model to choose between SVE and Advanced SIMD vectorization. > > @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never) > Value(AARCH64_LDP_STP_POLICY_NEVER) > > > > EnumValue > > Enum(aarch64_ldp_stp_policy) String(aligned) > Value(AARCH64_LDP_STP_POLICY_ALIGNED) > > + > > +-param=3Daarch64-store-forwarding-threshold=3D > > +Target Joined UInteger Var(aarch64_store_forwarding_threshold_param) > Init(20) Param > > +Maximum instruction distance allowed between a store and a load pair > for this to be > > +considered a candidate to avoid when using -mavoid-store-forwarding. > > diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64 > > index a9a244ab6d6..7639b50358d 100644 > > --- a/gcc/config/aarch64/t-aarch64 > > +++ b/gcc/config/aarch64/t-aarch64 > > @@ -176,6 +176,16 @@ aarch64-cc-fusion.o: > $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \ > > $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \ > > $(srcdir)/config/aarch64/aarch64-cc-fusion.cc > > > > +aarch64-store-forwarding.o: \ > > + $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \ > > + $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h > $(RTL_BASE_H) \ > > + dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H) > $(RECOG_H) \ > > + output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \ > > + $(CONTEXT_H) $(TREE_PASS_H) regrename.h \ > > + $(srcdir)/config/aarch64/aarch64-protos.h > > + $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \ > > + $(srcdir)/config/aarch64/aarch64-store-forwarding.cc > > + > > comma=3D, > > MULTILIB_OPTIONS =3D $(subst $(comma),/, $(patsubst %, mabi=3D%, $(= subst > $(comma),$(comma)mabi=3D,$(TM_MULTILIB_CONFIG)))) > > MULTILIB_DIRNAMES =3D $(subst $(comma), ,$(TM_MULTILIB_CONFIG)) > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > index 2b51ff304f6..39dbc04207e 100644 > > --- a/gcc/doc/invoke.texi > > +++ b/gcc/doc/invoke.texi > > @@ -798,7 +798,7 @@ Objective-C and Objective-C++ Dialects}. > > -moverride=3D@var{string} -mverbose-cost-dump > > -mstack-protector-guard=3D@var{guard} > -mstack-protector-guard-reg=3D@var{sysreg} > > -mstack-protector-guard-offset=3D@var{offset} -mtrack-speculation > > --moutline-atomics } > > +-moutline-atomics -mavoid-store-forwarding} > > > > @emph{Adapteva Epiphany Options} > > @gccoptlist{-mhalf-reg-file -mprefer-short-insn-regs > > @@ -16738,6 +16738,11 @@ With @option{--param=3Daarch64-stp-policy=3Dne= ver}, > do not emit stp. > > With @option{--param=3Daarch64-stp-policy=3Daligned}, emit stp only if= the > > source pointer is aligned to at least double the alignment of the type. > > > > +@item aarch64-store-forwarding-threshold > > +Maximum allowed instruction distance between a store and a load pair f= or > > +this to be considered a candidate to avoid when using > > +@option{-mavoid-store-forwarding}. > > + > > @item aarch64-loop-vect-issue-rate-niters > > The tuning for some AArch64 CPUs tries to take both latencies and issue > > rates into account when deciding whether a loop should be vectorized > > @@ -20763,6 +20768,10 @@ Generate code which uses only the > general-purpose registers. This will prevent > > the compiler from using floating-point and Advanced SIMD registers but > will not > > impose any restrictions on the assembler. > > > > +@item -mavoid-store-forwarding > > +@itemx -mno-avoid-store-forwarding > > +Avoid store forwarding to load pairs. > > + > > @opindex mlittle-endian > > @item -mlittle-endian > > Generate little-endian code. This is the default when GCC is > configured for an > > diff --git > a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c > b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c > > new file mode 100644 > > index 00000000000..b77de6c64b6 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c > > @@ -0,0 +1,33 @@ > > +/* { dg-options "-O2 -mcpu=3Dgeneric -mavoid-store-forwarding" } */ > > + > > +#include > > + > > +typedef int v4si __attribute__ ((vector_size (16))); > > + > > +/* Different address, same offset, no overlap */ > > + > > +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \ > > +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr, > TYPE *st_arr_2, TYPE i, TYPE dummy){ \ > > + TYPE r, y; \ > > + st_arr[0] =3D i; \ > > + ld_arr[0] =3D dummy; \ > > + r =3D st_arr_2[0]; \ > > + y =3D st_arr_2[1]; \ > > + return r + y; \ > > +} > > + > > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t) > > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t) > > +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t) > > +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t) > > +LDP_SSLL_NO_OVERLAP_ADDRESS(int) > > +LDP_SSLL_NO_OVERLAP_ADDRESS(long) > > +LDP_SSLL_NO_OVERLAP_ADDRESS(float) > > +LDP_SSLL_NO_OVERLAP_ADDRESS(double) > > +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si) > > + > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } = */ > > diff --git > a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c > b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c > > new file mode 100644 > > index 00000000000..f1b3a66abfd > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c > > @@ -0,0 +1,33 @@ > > +/* { dg-options "-O2 -mcpu=3Dgeneric -mavoid-store-forwarding" } */ > > + > > +#include > > + > > +typedef int v4si __attribute__ ((vector_size (16))); > > + > > +/* Same address, different offset, no overlap */ > > + > > +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \ > > +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE > i, TYPE dummy){ \ > > + TYPE r, y; \ > > + st_arr[0] =3D i; \ > > + ld_arr[0] =3D dummy; \ > > + r =3D st_arr[10]; \ > > + y =3D st_arr[11]; \ > > + return r + y; \ > > +} > > + > > +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t) > > +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t) > > +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t) > > +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t) > > +LDP_SSLL_NO_OVERLAP_OFFSET(int) > > +LDP_SSLL_NO_OVERLAP_OFFSET(long) > > +LDP_SSLL_NO_OVERLAP_OFFSET(float) > > +LDP_SSLL_NO_OVERLAP_OFFSET(double) > > +LDP_SSLL_NO_OVERLAP_OFFSET(v4si) > > + > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } = */ > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c > b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c > > new file mode 100644 > > index 00000000000..8d5ce5cc87e > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c > > @@ -0,0 +1,33 @@ > > +/* { dg-options "-O2 -mcpu=3Dgeneric -mavoid-store-forwarding" } */ > > + > > +#include > > + > > +typedef int v4si __attribute__ ((vector_size (16))); > > + > > +/* Same address, same offset, overlap */ > > + > > +#define LDP_SSLL_OVERLAP(TYPE) \ > > +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE > dummy){ \ > > + TYPE r, y; \ > > + st_arr[0] =3D i; \ > > + ld_arr[0] =3D dummy; \ > > + r =3D st_arr[0]; \ > > + y =3D st_arr[1]; \ > > + return r + y; \ > > +} > > + > > +LDP_SSLL_OVERLAP(uint32_t) > > +LDP_SSLL_OVERLAP(uint64_t) > > +LDP_SSLL_OVERLAP(int32_t) > > +LDP_SSLL_OVERLAP(int64_t) > > +LDP_SSLL_OVERLAP(int) > > +LDP_SSLL_OVERLAP(long) > > +LDP_SSLL_OVERLAP(float) > > +LDP_SSLL_OVERLAP(double) > > +LDP_SSLL_OVERLAP(v4si) > > + > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } } = */ > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } = */ > > -- > > 2.41.0 > --000000000000d0dd79060bb4552d--