From: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
To: Richard Biener <richard.guenther@gmail.com>
Cc: Philipp Tomsich <philipp.tomsich@vrull.eu>,
Manolis Tsamis <manolis.tsamis@vrull.eu>,
Richard Sandiford <richard.sandiford@arm.com>,
gcc-patches@gcc.gnu.org
Subject: Re: [PATCH v6] aarch64: New RTL optimization pass avoid-store-forwarding.
Date: Thu, 7 Dec 2023 12:27:56 +0200 [thread overview]
Message-ID: <CAL3A+7aL-S=gu=BJUmXfSvp_D8mF9hhTqHm_3aLRqaBykDMXpA@mail.gmail.com> (raw)
In-Reply-To: <CAFiYyc0HpArO03VkWrn+APkXZqQkBufFx5oC_xWSmk2Jv_Ow1Q@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 32593 bytes --]
Στις Πέμ 7 Δεκ 2023, 09:39 ο χρήστης Richard Biener <
richard.guenther@gmail.com> έγραψε:
> On Wed, Dec 6, 2023 at 6:15 PM Manos Anagnostakis
> <manos.anagnostakis@vrull.eu> wrote:
> >
> > Hi again,
> >
> > I went and tested the requested changes and found out the following:
> >
> > 1. The pass is currently increasing insn_cnt on a NONJUMP_INSN_P, which
> is a subset of NONDEBUG_INSN_P. I think there is no problem with depending
> on -g with the current version. Do you see something I don't or did you
> mean something else?
>
> It just occurred to me - thanks for double-checking (it wasn't obvious
> to me NONJUMP_INSN_P doesn't include DEBUG_INSNs ...).
>
> > 2. Not processing all instructions is not letting cselib record all the
> effects they have, thus it does not have updated information to find true
> forwardings at any given time. I can confirm this since I am witnessing
> many unexpected changes on the number of handled cases if I do this only
> for loads/stores.
>
> Ah, OK. I guess I don't fully get it, it seems you use cselib to
> compare addresses and while possibly not
> processing part of the address computation might break this other
> stmts inbetween should be uninteresting
> (at least I don't see you handling intermediate may-aliasing [small]
> stores to disable the splitting).
>
> So in the end it's a compile-time trade-off between relying solely on
> cselib or trying to optimize
> cselib use with DF for address computes?
>
> Richard.
>
I am not really familiar with the DF technique, but what we did is the
following:
At first we were using a combination of alias analysis with cselib, which
while trying to avoid false positives that had their memory-register
address changed inbetween, was rejecting part of stores from being
considered as candidates for a forwarding to an ldp.
Thus in the end using only the cselib with the correct lookups was the way
to address both the register difference and/or the intermediate change on
the address value.
>
> > Thanks in advance and please let me know your thoughts on the above.
> > Manos.
> >
> > On Wed, Dec 6, 2023 at 5:10 PM Manos Anagnostakis <
> manos.anagnostakis@vrull.eu> wrote:
> >>
> >> Hi Richard,
> >>
> >> thanks for the useful comments.
> >>
> >> On Wed, Dec 6, 2023 at 4:32 PM Richard Biener <
> richard.guenther@gmail.com> wrote:
> >>>
> >>> On Wed, Dec 6, 2023 at 2:48 PM Manos Anagnostakis
> >>> <manos.anagnostakis@vrull.eu> wrote:
> >>> >
> >>> > This is an RTL pass that detects store forwarding from stores to
> larger loads (load pairs).
> >>> >
> >>> > This optimization is SPEC2017-driven and was found to be beneficial
> for some benchmarks,
> >>> > through testing on ampere1/ampere1a machines.
> >>> >
> >>> > For example, it can transform cases like
> >>> >
> >>> > str d5, [sp, #320]
> >>> > fmul d5, d31, d29
> >>> > ldp d31, d17, [sp, #312] # Large load from small store
> >>> >
> >>> > to
> >>> >
> >>> > str d5, [sp, #320]
> >>> > fmul d5, d31, d29
> >>> > ldr d31, [sp, #312]
> >>> > ldr d17, [sp, #320]
> >>> >
> >>> > Currently, the pass is disabled by default on all architectures and
> enabled by a target-specific option.
> >>> >
> >>> > If deemed beneficial enough for a default, it will be enabled on
> ampere1/ampere1a,
> >>> > or other architectures as well, without needing to be turned on by
> this option.
> >>>
> >>> What is aarch64-specific about the pass?
> >>
> >> The pass was designed to target load pairs, which are aarch64 specific,
> thus it cannot handle generic loads.
> >>>
> >>>
> >>> I see an increasingly large number of target specific passes pop up
> (probably
> >>> for the excuse we can generalize them if necessary). But GCC isn't
> LLVM
> >>> and this feels like getting out of hand?
> >>>
> >>> The x86 backend also has its store-forwarding "pass" as part of mdreorg
> >>> in ix86_split_stlf_stall_load.
> >>>
> >>> Richard.
> >>>
> >>> > Bootstrapped and regtested on aarch64-linux.
> >>> >
> >>> > gcc/ChangeLog:
> >>> >
> >>> > * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
> >>> > * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New
> pass.
> >>> > * config/aarch64/aarch64-protos.h
> (make_pass_avoid_store_forwarding): Declare.
> >>> > * config/aarch64/aarch64.opt (mavoid-store-forwarding): New
> option.
> >>> > (aarch64-store-forwarding-threshold): New param.
> >>> > * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
> >>> > * doc/invoke.texi: Document new option and new param.
> >>> > * config/aarch64/aarch64-store-forwarding.cc: New file.
> >>> >
> >>> > gcc/testsuite/ChangeLog:
> >>> >
> >>> > * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
> >>> > * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
> >>> > * gcc.target/aarch64/ldp_ssll_overlap.c: New test.
> >>> >
> >>> > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> >>> > Co-Authored-By: Manolis Tsamis <manolis.tsamis@vrull.eu>
> >>> > Co-Authored-By: Philipp Tomsich <philipp.tomsich@vrull.eu>
> >>> > ---
> >>> > Changes in v6:
> >>> > - An obvious change. insn_cnt was incremented only on
> >>> > stores and not for every insn in the bb. Now restored.
> >>> >
> >>> > gcc/config.gcc | 1 +
> >>> > gcc/config/aarch64/aarch64-passes.def | 1 +
> >>> > gcc/config/aarch64/aarch64-protos.h | 1 +
> >>> > .../aarch64/aarch64-store-forwarding.cc | 318
> ++++++++++++++++++
> >>> > gcc/config/aarch64/aarch64.opt | 9 +
> >>> > gcc/config/aarch64/t-aarch64 | 10 +
> >>> > gcc/doc/invoke.texi | 11 +-
> >>> > .../aarch64/ldp_ssll_no_overlap_address.c | 33 ++
> >>> > .../aarch64/ldp_ssll_no_overlap_offset.c | 33 ++
> >>> > .../gcc.target/aarch64/ldp_ssll_overlap.c | 33 ++
> >>> > 10 files changed, 449 insertions(+), 1 deletion(-)
> >>> > create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc
> >>> > create mode 100644
> gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> >>> > create mode 100644
> gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> >>> > create mode 100644
> gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> >>> >
> >>> > diff --git a/gcc/config.gcc b/gcc/config.gcc
> >>> > index 6450448f2f0..7c48429eb82 100644
> >>> > --- a/gcc/config.gcc
> >>> > +++ b/gcc/config.gcc
> >>> > @@ -350,6 +350,7 @@ aarch64*-*-*)
> >>> > cxx_target_objs="aarch64-c.o"
> >>> > d_target_objs="aarch64-d.o"
> >>> > extra_objs="aarch64-builtins.o aarch-common.o
> aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o
> aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o
> aarch64-sve-builtins-sme.o cortex-a57-fma-steering.o aarch64-speculation.o
> falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
> >>> > + extra_objs="${extra_objs} aarch64-store-forwarding.o"
> >>> >
> target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.h
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
> >>> > target_has_targetm_common=yes
> >>> > ;;
> >>> > diff --git a/gcc/config/aarch64/aarch64-passes.def
> b/gcc/config/aarch64/aarch64-passes.def
> >>> > index 662a13fd5e6..94ced0aebf6 100644
> >>> > --- a/gcc/config/aarch64/aarch64-passes.def
> >>> > +++ b/gcc/config/aarch64/aarch64-passes.def
> >>> > @@ -24,3 +24,4 @@ INSERT_PASS_BEFORE
> (pass_late_thread_prologue_and_epilogue, 1, pass_switch_pstat
> >>> > INSERT_PASS_AFTER (pass_machine_reorg, 1,
> pass_tag_collision_avoidance);
> >>> > INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
> >>> > INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
> >>> > +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding);
> >>> > diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> >>> > index 60ff61f6d54..8f5f2ca4710 100644
> >>> > --- a/gcc/config/aarch64/aarch64-protos.h
> >>> > +++ b/gcc/config/aarch64/aarch64-protos.h
> >>> > @@ -1069,6 +1069,7 @@ rtl_opt_pass
> *make_pass_tag_collision_avoidance (gcc::context *);
> >>> > rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
> >>> > rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
> >>> > rtl_opt_pass *make_pass_switch_pstate_sm (gcc::context *ctxt);
> >>> > +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt);
> >>> >
> >>> > poly_uint64 aarch64_regmode_natural_size (machine_mode);
> >>> >
> >>> > diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc
> b/gcc/config/aarch64/aarch64-store-forwarding.cc
> >>> > new file mode 100644
> >>> > index 00000000000..8a6faefd8c0
> >>> > --- /dev/null
> >>> > +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc
> >>> > @@ -0,0 +1,318 @@
> >>> > +/* Avoid store forwarding optimization pass.
> >>> > + Copyright (C) 2023 Free Software Foundation, Inc.
> >>> > + Contributed by VRULL GmbH.
> >>> > +
> >>> > + This file is part of GCC.
> >>> > +
> >>> > + GCC is free software; you can redistribute it and/or modify it
> >>> > + under the terms of the GNU General Public License as published by
> >>> > + the Free Software Foundation; either version 3, or (at your
> option)
> >>> > + any later version.
> >>> > +
> >>> > + GCC is distributed in the hope that it will be useful, but
> >>> > + WITHOUT ANY WARRANTY; without even the implied warranty of
> >>> > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> >>> > + General Public License for more details.
> >>> > +
> >>> > + You should have received a copy of the GNU General Public License
> >>> > + along with GCC; see the file COPYING3. If not see
> >>> > + <http://www.gnu.org/licenses/>. */
> >>> > +
> >>> > +#define IN_TARGET_CODE 1
> >>> > +
> >>> > +#include "config.h"
> >>> > +#define INCLUDE_LIST
> >>> > +#include "system.h"
> >>> > +#include "coretypes.h"
> >>> > +#include "backend.h"
> >>> > +#include "rtl.h"
> >>> > +#include "alias.h"
> >>> > +#include "rtlanal.h"
> >>> > +#include "tree-pass.h"
> >>> > +#include "cselib.h"
> >>> > +
> >>> > +/* This is an RTL pass that detects store forwarding from stores to
> larger
> >>> > + loads (load pairs). For example, it can transform cases like
> >>> > +
> >>> > + str d5, [sp, #320]
> >>> > + fmul d5, d31, d29
> >>> > + ldp d31, d17, [sp, #312] # Large load from small store
> >>> > +
> >>> > + to
> >>> > +
> >>> > + str d5, [sp, #320]
> >>> > + fmul d5, d31, d29
> >>> > + ldr d31, [sp, #312]
> >>> > + ldr d17, [sp, #320]
> >>> > +
> >>> > + Design: The pass follows a straightforward design. It starts by
> >>> > + initializing the alias analysis and the cselib. Both of these
> are used to
> >>> > + find stores and larger loads with overlapping addresses, which
> are
> >>> > + candidates for store forwarding optimizations. It then scans on
> basic block
> >>> > + level to find stores that forward to larger loads and handles
> them
> >>> > + accordingly as described in the above example. Finally, the
> alias analysis
> >>> > + and the cselib library are closed. */
> >>> > +
> >>> > +typedef struct
> >>> > +{
> >>> > + rtx_insn *store_insn;
> >>> > + rtx store_mem_addr;
> >>> > + unsigned int insn_cnt;
> >>> > +} str_info;
> >>> > +
> >>> > +typedef std::list<str_info> list_store_info;
> >>> > +
> >>> > +/* Statistics counters. */
> >>> > +static unsigned int stats_store_count = 0;
> >>> > +static unsigned int stats_ldp_count = 0;
> >>> > +static unsigned int stats_ssll_count = 0;
> >>> > +static unsigned int stats_transformed_count = 0;
> >>> > +
> >>> > +/* Default. */
> >>> > +static rtx dummy;
> >>> > +static bool is_load (rtx expr, rtx &op_1=dummy);
> >>> > +
> >>> > +/* Return true if SET expression EXPR is a store; otherwise false.
> */
> >>> > +
> >>> > +static bool
> >>> > +is_store (rtx expr)
> >>> > +{
> >>> > + return MEM_P (SET_DEST (expr));
> >>> > +}
> >>> > +
> >>> > +/* Return true if SET expression EXPR is a load; otherwise false.
> OP_1 will
> >>> > + contain the MEM operand of the load. */
> >>> > +
> >>> > +static bool
> >>> > +is_load (rtx expr, rtx &op_1)
> >>> > +{
> >>> > + op_1 = SET_SRC (expr);
> >>> > +
> >>> > + if (GET_CODE (op_1) == ZERO_EXTEND
> >>> > + || GET_CODE (op_1) == SIGN_EXTEND)
> >>> > + op_1 = XEXP (op_1, 0);
> >>> > +
> >>> > + return MEM_P (op_1);
> >>> > +}
> >>> > +
> >>> > +/* Return true if STORE_MEM_ADDR is forwarding to the address of
> LOAD_MEM;
> >>> > + otherwise false. STORE_MEM_MODE is the mode of the MEM rtx
> containing
> >>> > + STORE_MEM_ADDR. */
> >>> > +
> >>> > +static bool
> >>> > +is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode
> store_mem_mode)
> >>> > +{
> >>> > + /* Sometimes we do not have the proper value. */
> >>> > + if (!CSELIB_VAL_PTR (store_mem_addr))
> >>> > + return false;
> >>> > +
> >>> > + gcc_checking_assert (MEM_P (load_mem));
> >>> > +
> >>> > + return rtx_equal_for_cselib_1 (store_mem_addr,
> >>> > + get_addr (XEXP (load_mem, 0)),
> >>> > + store_mem_mode, 0);
> >>> > +}
> >>> > +
> >>> > +/* Return true if INSN is a load pair, preceded by a store
> forwarding to it;
> >>> > + otherwise false. STORE_EXPRS contains the stores. */
> >>> > +
> >>> > +static bool
> >>> > +is_small_store_to_large_load (list_store_info store_exprs, rtx_insn
> *insn)
> >>> > +{
> >>> > + unsigned int load_count = 0;
> >>> > + bool forwarding = false;
> >>> > + rtx expr = PATTERN (insn);
> >>> > +
> >>> > + if (GET_CODE (expr) != PARALLEL
> >>> > + || XVECLEN (expr, 0) != 2)
> >>> > + return false;
> >>> > +
> >>> > + for (int i = 0; i < XVECLEN (expr, 0); i++)
> >>> > + {
> >>> > + rtx op_1;
> >>> > + rtx out_exp = XVECEXP (expr, 0, i);
> >>> > +
> >>> > + if (GET_CODE (out_exp) != SET)
> >>> > + continue;
> >>> > +
> >>> > + if (!is_load (out_exp, op_1))
> >>> > + continue;
> >>> > +
> >>> > + load_count++;
> >>> > +
> >>> > + for (str_info str : store_exprs)
> >>> > + {
> >>> > + rtx store_insn = str.store_insn;
> >>> > +
> >>> > + if (!is_forwarding (str.store_mem_addr, op_1,
> >>> > + GET_MODE (SET_DEST (PATTERN
> (store_insn)))))
> >>> > + continue;
> >>> > +
> >>> > + if (dump_file)
> >>> > + {
> >>> > + fprintf (dump_file,
> >>> > + "Store forwarding to PARALLEL with loads:\n");
> >>> > + fprintf (dump_file, " From: ");
> >>> > + print_rtl_single (dump_file, store_insn);
> >>> > + fprintf (dump_file, " To: ");
> >>> > + print_rtl_single (dump_file, insn);
> >>> > + }
> >>> > +
> >>> > + forwarding = true;
> >>> > + }
> >>> > + }
> >>> > +
> >>> > + if (load_count == 2)
> >>> > + stats_ldp_count++;
> >>> > +
> >>> > + return load_count == 2 && forwarding;
> >>> > +}
> >>> > +
> >>> > +/* Break a load pair into its 2 distinct loads, except if the base
> source
> >>> > + address to load from is overwriten in the first load. INSN
> should be the
> >>> > + PARALLEL of the load pair. */
> >>> > +
> >>> > +static void
> >>> > +break_ldp (rtx_insn *insn)
> >>> > +{
> >>> > + rtx expr = PATTERN (insn);
> >>> > +
> >>> > + gcc_checking_assert (GET_CODE (expr) == PARALLEL && XVECLEN
> (expr, 0) == 2);
> >>> > +
> >>> > + rtx load_0 = XVECEXP (expr, 0, 0);
> >>> > + rtx load_1 = XVECEXP (expr, 0, 1);
> >>> > +
> >>> > + gcc_checking_assert (is_load (load_0) && is_load (load_1));
> >>> > +
> >>> > + /* The base address was overwriten in the first load. */
> >>> > + if (reg_mentioned_p (SET_DEST (load_0), SET_SRC (load_1)))
> >>> > + return;
> >>> > +
> >>> > + emit_insn_before (load_0, insn);
> >>> > + emit_insn_before (load_1, insn);
> >>> > + remove_insn (insn);
> >>> > +
> >>> > + stats_transformed_count++;
> >>> > +}
> >>> > +
> >>> > +static void
> >>> > +scan_and_transform_bb_level ()
> >>> > +{
> >>> > + rtx_insn *insn, *next;
> >>> > + basic_block bb;
> >>> > + FOR_EACH_BB_FN (bb, cfun)
> >>> > + {
> >>> > + list_store_info store_exprs;
> >>> > + unsigned int insn_cnt = 0;
> >>> > + for (insn = BB_HEAD (bb); insn != NEXT_INSN (BB_END (bb));
> insn = next)
> >>> > + {
> >>> > + next = NEXT_INSN (insn);
> >>>
> >>> You probably want NONDEBUG here, otherwise insn_cnt will depend
> >>> on -g?
> >>>
> >>> > + /* If we cross a CALL_P insn, clear the list, because the
> >>> > + small-store-to-large-load is unlikely to cause
> performance
> >>> > + difference. */
> >>> > + if (CALL_P (insn))
> >>> > + store_exprs.clear ();
> >>> > +
> >>> > + if (!NONJUMP_INSN_P (insn))
> >>> > + continue;
> >>> > +
> >>> > + cselib_process_insn (insn);
> >>>
> >>> is it necessary to process each insn with cselib? Only loads & stores
> I guess?
> >>>
> >>> > + rtx expr = single_set (insn);
> >>> > +
> >>> > + /* If a store is encountered, append it to the store_exprs
> list to
> >>> > + check it later. */
> >>> > + if (expr && is_store (expr))
> >>> > + {
> >>> > + rtx store_mem = SET_DEST (expr);
> >>> > + rtx store_mem_addr = get_addr (XEXP (store_mem, 0));
> >>> > + machine_mode store_mem_mode = GET_MODE (store_mem);
> >>> > + store_mem_addr = cselib_lookup (store_mem_addr,
> >>> > + store_mem_mode, 1,
> >>> > +
> store_mem_mode)->val_rtx;
> >>> > + store_exprs.push_back ({ insn, store_mem_addr,
> insn_cnt });
> >>> > + stats_store_count++;
> >>> > + }
> >>> > +
> >>> > + /* Check for small-store-to-large-load. */
> >>> > + if (is_small_store_to_large_load (store_exprs, insn))
> >>> > + {
> >>> > + stats_ssll_count++;
> >>> > + break_ldp (insn);
> >>> > + }
> >>> > +
> >>> > + /* Pop the first store from the list if it's distance
> crosses the
> >>> > + maximum accepted threshold. The list contains unique
> values
> >>> > + sorted in ascending order, meaning that only one
> distance can be
> >>> > + off at a time. */
> >>> > + if (!store_exprs.empty ()
> >>> > + && (insn_cnt - store_exprs.front ().insn_cnt
> >>> > + > (unsigned int)
> aarch64_store_forwarding_threshold_param))
> >>> > + store_exprs.pop_front ();
> >>> > +
> >>> > + insn_cnt++;
> >>> > + }
> >>> > + }
> >>> > +}
> >>> > +
> >>> > +static void
> >>> > +execute_avoid_store_forwarding ()
> >>> > +{
> >>> > + init_alias_analysis ();
> >>> > + cselib_init (CSELIB_RECORD_MEMORY | CSELIB_PRESERVE_CONSTANTS);
> >>> > + scan_and_transform_bb_level ();
> >>> > + end_alias_analysis ();
> >>> > + cselib_finish ();
> >>> > + statistics_counter_event (cfun, "Number of stores identified: ",
> >>> > + stats_store_count);
> >>> > + statistics_counter_event (cfun, "Number of load pairs identified:
> ",
> >>> > + stats_ldp_count);
> >>> > + statistics_counter_event (cfun,
> >>> > + "Number of forwarding cases identified:
> ",
> >>> > + stats_ssll_count);
> >>> > + statistics_counter_event (cfun, "Number of trasformed cases: ",
> >>> > + stats_transformed_count);
> >>> > +}
> >>> > +
> >>> > +const pass_data pass_data_avoid_store_forwarding =
> >>> > +{
> >>> > + RTL_PASS, /* type. */
> >>> > + "avoid_store_forwarding", /* name. */
> >>> > + OPTGROUP_NONE, /* optinfo_flags. */
> >>> > + TV_NONE, /* tv_id. */
> >>> > + 0, /* properties_required. */
> >>> > + 0, /* properties_provided. */
> >>> > + 0, /* properties_destroyed. */
> >>> > + 0, /* todo_flags_start. */
> >>> > + 0 /* todo_flags_finish. */
> >>> > +};
> >>> > +
> >>> > +class pass_avoid_store_forwarding : public rtl_opt_pass
> >>> > +{
> >>> > +public:
> >>> > + pass_avoid_store_forwarding (gcc::context *ctxt)
> >>> > + : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
> >>> > + {}
> >>> > +
> >>> > + /* opt_pass methods: */
> >>> > + virtual bool gate (function *)
> >>> > + {
> >>> > + return aarch64_flag_avoid_store_forwarding;
> >>> > + }
> >>> > +
> >>> > + virtual unsigned int execute (function *)
> >>> > + {
> >>> > + execute_avoid_store_forwarding ();
> >>> > + return 0;
> >>> > + }
> >>> > +
> >>> > +}; // class pass_avoid_store_forwarding
> >>> > +
> >>> > +/* Create a new avoid store forwarding pass instance. */
> >>> > +
> >>> > +rtl_opt_pass *
> >>> > +make_pass_avoid_store_forwarding (gcc::context *ctxt)
> >>> > +{
> >>> > + return new pass_avoid_store_forwarding (ctxt);
> >>> > +}
> >>> > diff --git a/gcc/config/aarch64/aarch64.opt
> b/gcc/config/aarch64/aarch64.opt
> >>> > index f5a518202a1..e4498d53b46 100644
> >>> > --- a/gcc/config/aarch64/aarch64.opt
> >>> > +++ b/gcc/config/aarch64/aarch64.opt
> >>> > @@ -304,6 +304,10 @@ moutline-atomics
> >>> > Target Var(aarch64_flag_outline_atomics) Init(2) Save
> >>> > Generate local calls to out-of-line atomic operations.
> >>> >
> >>> > +mavoid-store-forwarding
> >>> > +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0)
> Optimization
> >>> > +Avoid store forwarding to load pairs.
> >>> > +
> >>> > -param=aarch64-sve-compare-costs=
> >>> > Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1)
> IntegerRange(0, 1) Param
> >>> > When vectorizing for SVE, consider using unpacked vectors for
> smaller elements and use the cost model to pick the cheapest approach.
> Also use the cost model to choose between SVE and Advanced SIMD
> vectorization.
> >>> > @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never)
> Value(AARCH64_LDP_STP_POLICY_NEVER)
> >>> >
> >>> > EnumValue
> >>> > Enum(aarch64_ldp_stp_policy) String(aligned)
> Value(AARCH64_LDP_STP_POLICY_ALIGNED)
> >>> > +
> >>> > +-param=aarch64-store-forwarding-threshold=
> >>> > +Target Joined UInteger
> Var(aarch64_store_forwarding_threshold_param) Init(20) Param
> >>> > +Maximum instruction distance allowed between a store and a load
> pair for this to be
> >>> > +considered a candidate to avoid when using -mavoid-store-forwarding.
> >>> > diff --git a/gcc/config/aarch64/t-aarch64
> b/gcc/config/aarch64/t-aarch64
> >>> > index 0d96ae3d0b2..5676cdd9585 100644
> >>> > --- a/gcc/config/aarch64/t-aarch64
> >>> > +++ b/gcc/config/aarch64/t-aarch64
> >>> > @@ -194,6 +194,16 @@ aarch64-cc-fusion.o:
> $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
> >>> > $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS)
> $(INCLUDES) \
> >>> > $(srcdir)/config/aarch64/aarch64-cc-fusion.cc
> >>> >
> >>> > +aarch64-store-forwarding.o: \
> >>> > + $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \
> >>> > + $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h
> $(RTL_BASE_H) \
> >>> > + dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H)
> $(RECOG_H) \
> >>> > + output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
> >>> > + $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
> >>> > + $(srcdir)/config/aarch64/aarch64-protos.h
> >>> > + $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS)
> $(INCLUDES) \
> >>> > + $(srcdir)/config/aarch64/aarch64-store-forwarding.cc
> >>> > +
> >>> > comma=,
> >>> > MULTILIB_OPTIONS = $(subst $(comma),/, $(patsubst %, mabi=%,
> $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
> >>> > MULTILIB_DIRNAMES = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
> >>> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> >>> > index 32f535e1ed4..9bf3a83286a 100644
> >>> > --- a/gcc/doc/invoke.texi
> >>> > +++ b/gcc/doc/invoke.texi
> >>> > @@ -801,7 +801,7 @@ Objective-C and Objective-C++ Dialects}.
> >>> > -moverride=@var{string} -mverbose-cost-dump
> >>> > -mstack-protector-guard=@var{guard}
> -mstack-protector-guard-reg=@var{sysreg}
> >>> > -mstack-protector-guard-offset=@var{offset} -mtrack-speculation
> >>> > --moutline-atomics }
> >>> > +-moutline-atomics -mavoid-store-forwarding}
> >>> >
> >>> > @emph{Adapteva Epiphany Options}
> >>> > @gccoptlist{-mhalf-reg-file -mprefer-short-insn-regs
> >>> > @@ -16774,6 +16774,11 @@ With
> @option{--param=aarch64-stp-policy=never}, do not emit stp.
> >>> > With @option{--param=aarch64-stp-policy=aligned}, emit stp only if
> the
> >>> > source pointer is aligned to at least double the alignment of the
> type.
> >>> >
> >>> > +@item aarch64-store-forwarding-threshold
> >>> > +Maximum allowed instruction distance between a store and a load
> pair for
> >>> > +this to be considered a candidate to avoid when using
> >>> > +@option{-mavoid-store-forwarding}.
> >>> > +
> >>> > @item aarch64-loop-vect-issue-rate-niters
> >>> > The tuning for some AArch64 CPUs tries to take both latencies and
> issue
> >>> > rates into account when deciding whether a loop should be vectorized
> >>> > @@ -20857,6 +20862,10 @@ Generate code which uses only the
> general-purpose registers. This will prevent
> >>> > the compiler from using floating-point and Advanced SIMD registers
> but will not
> >>> > impose any restrictions on the assembler.
> >>> >
> >>> > +@item -mavoid-store-forwarding
> >>> > +@itemx -mno-avoid-store-forwarding
> >>> > +Avoid store forwarding to load pairs.
> >>> > +
> >>> > @opindex mlittle-endian
> >>> > @item -mlittle-endian
> >>> > Generate little-endian code. This is the default when GCC is
> configured for an
> >>> > diff --git
> a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> >>> > new file mode 100644
> >>> > index 00000000000..b77de6c64b6
> >>> > --- /dev/null
> >>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> >>> > @@ -0,0 +1,33 @@
> >>> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> >>> > +
> >>> > +#include <stdint.h>
> >>> > +
> >>> > +typedef int v4si __attribute__ ((vector_size (16)));
> >>> > +
> >>> > +/* Different address, same offset, no overlap */
> >>> > +
> >>> > +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \
> >>> > +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr,
> TYPE *st_arr_2, TYPE i, TYPE dummy){ \
> >>> > + TYPE r, y; \
> >>> > + st_arr[0] = i; \
> >>> > + ld_arr[0] = dummy; \
> >>> > + r = st_arr_2[0]; \
> >>> > + y = st_arr_2[1]; \
> >>> > + return r + y; \
> >>> > +}
> >>> > +
> >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t)
> >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t)
> >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t)
> >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t)
> >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int)
> >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(long)
> >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(float)
> >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(double)
> >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si)
> >>> > +
> >>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 }
> } */
> >>> > diff --git
> a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> >>> > new file mode 100644
> >>> > index 00000000000..f1b3a66abfd
> >>> > --- /dev/null
> >>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> >>> > @@ -0,0 +1,33 @@
> >>> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> >>> > +
> >>> > +#include <stdint.h>
> >>> > +
> >>> > +typedef int v4si __attribute__ ((vector_size (16)));
> >>> > +
> >>> > +/* Same address, different offset, no overlap */
> >>> > +
> >>> > +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \
> >>> > +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr,
> TYPE i, TYPE dummy){ \
> >>> > + TYPE r, y; \
> >>> > + st_arr[0] = i; \
> >>> > + ld_arr[0] = dummy; \
> >>> > + r = st_arr[10]; \
> >>> > + y = st_arr[11]; \
> >>> > + return r + y; \
> >>> > +}
> >>> > +
> >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t)
> >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t)
> >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t)
> >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t)
> >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(int)
> >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(long)
> >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(float)
> >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(double)
> >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(v4si)
> >>> > +
> >>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 }
> } */
> >>> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> >>> > new file mode 100644
> >>> > index 00000000000..8d5ce5cc87e
> >>> > --- /dev/null
> >>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> >>> > @@ -0,0 +1,33 @@
> >>> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> >>> > +
> >>> > +#include <stdint.h>
> >>> > +
> >>> > +typedef int v4si __attribute__ ((vector_size (16)));
> >>> > +
> >>> > +/* Same address, same offset, overlap */
> >>> > +
> >>> > +#define LDP_SSLL_OVERLAP(TYPE) \
> >>> > +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i,
> TYPE dummy){ \
> >>> > + TYPE r, y; \
> >>> > + st_arr[0] = i; \
> >>> > + ld_arr[0] = dummy; \
> >>> > + r = st_arr[0]; \
> >>> > + y = st_arr[1]; \
> >>> > + return r + y; \
> >>> > +}
> >>> > +
> >>> > +LDP_SSLL_OVERLAP(uint32_t)
> >>> > +LDP_SSLL_OVERLAP(uint64_t)
> >>> > +LDP_SSLL_OVERLAP(int32_t)
> >>> > +LDP_SSLL_OVERLAP(int64_t)
> >>> > +LDP_SSLL_OVERLAP(int)
> >>> > +LDP_SSLL_OVERLAP(long)
> >>> > +LDP_SSLL_OVERLAP(float)
> >>> > +LDP_SSLL_OVERLAP(double)
> >>> > +LDP_SSLL_OVERLAP(v4si)
> >>> > +
> >>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 }
> } */
> >>> > --
> >>> > 2.41.0
> >>
> >> I will address the other two points on a seperate version.
> >>
> >> --
> >> Manos Anagnostakis | Compiler Engineer |
> >> E: manos.anagnostakis@vrull.eu
> >>
> >> VRULL GmbH | Beatrixgasse 32 1030 Vienna |
> >> W: www.vrull.eu | LinkedIn
> >
> >
> >
> > --
> > Manos Anagnostakis | Compiler Engineer |
> > E: manos.anagnostakis@vrull.eu
> >
> > VRULL GmbH | Beatrixgasse 32 1030 Vienna |
> > W: www.vrull.eu | LinkedIn
>
next prev parent reply other threads:[~2023-12-07 10:28 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-06 13:46 Manos Anagnostakis
2023-12-06 14:28 ` Richard Biener
2023-12-06 15:10 ` Manos Anagnostakis
2023-12-06 17:15 ` Manos Anagnostakis
2023-12-07 7:35 ` Richard Biener
2023-12-07 10:27 ` Manos Anagnostakis [this message]
2023-12-06 18:44 ` Philipp Tomsich
2023-12-07 7:31 ` Richard Biener
2023-12-07 12:20 ` Richard Sandiford
2023-12-07 14:09 ` Richard Biener
2023-12-08 15:47 ` Manos Anagnostakis
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAL3A+7aL-S=gu=BJUmXfSvp_D8mF9hhTqHm_3aLRqaBykDMXpA@mail.gmail.com' \
--to=manos.anagnostakis@vrull.eu \
--cc=gcc-patches@gcc.gnu.org \
--cc=manolis.tsamis@vrull.eu \
--cc=philipp.tomsich@vrull.eu \
--cc=richard.guenther@gmail.com \
--cc=richard.sandiford@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).