public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH v6] aarch64: New RTL optimization pass avoid-store-forwarding.
@ 2023-12-06 13:46 Manos Anagnostakis
  2023-12-06 14:28 ` Richard Biener
  0 siblings, 1 reply; 11+ messages in thread
From: Manos Anagnostakis @ 2023-12-06 13:46 UTC (permalink / raw)
  To: gcc-patches
  Cc: Philipp Tomsich, Richard Sandiford, Manos Anagnostakis, Manolis Tsamis

This is an RTL pass that detects store forwarding from stores to larger loads (load pairs).

This optimization is SPEC2017-driven and was found to be beneficial for some benchmarks,
through testing on ampere1/ampere1a machines.

For example, it can transform cases like

str  d5, [sp, #320]
fmul d5, d31, d29
ldp  d31, d17, [sp, #312] # Large load from small store

to

str  d5, [sp, #320]
fmul d5, d31, d29
ldr  d31, [sp, #312]
ldr  d17, [sp, #320]

Currently, the pass is disabled by default on all architectures and enabled by a target-specific option.

If deemed beneficial enough for a default, it will be enabled on ampere1/ampere1a,
or other architectures as well, without needing to be turned on by this option.

Bootstrapped and regtested on aarch64-linux.

gcc/ChangeLog:

        * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
        * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New pass.
        * config/aarch64/aarch64-protos.h (make_pass_avoid_store_forwarding): Declare.
        * config/aarch64/aarch64.opt (mavoid-store-forwarding): New option.
	(aarch64-store-forwarding-threshold): New param.
        * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
        * doc/invoke.texi: Document new option and new param.
        * config/aarch64/aarch64-store-forwarding.cc: New file.

gcc/testsuite/ChangeLog:

        * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
        * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
        * gcc.target/aarch64/ldp_ssll_overlap.c: New test.

Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
Co-Authored-By: Manolis Tsamis <manolis.tsamis@vrull.eu>
Co-Authored-By: Philipp Tomsich <philipp.tomsich@vrull.eu>
---
Changes in v6:
	- An obvious change. insn_cnt was incremented only on
	  stores and not for every insn in the bb. Now restored.

 gcc/config.gcc                                |   1 +
 gcc/config/aarch64/aarch64-passes.def         |   1 +
 gcc/config/aarch64/aarch64-protos.h           |   1 +
 .../aarch64/aarch64-store-forwarding.cc       | 318 ++++++++++++++++++
 gcc/config/aarch64/aarch64.opt                |   9 +
 gcc/config/aarch64/t-aarch64                  |  10 +
 gcc/doc/invoke.texi                           |  11 +-
 .../aarch64/ldp_ssll_no_overlap_address.c     |  33 ++
 .../aarch64/ldp_ssll_no_overlap_offset.c      |  33 ++
 .../gcc.target/aarch64/ldp_ssll_overlap.c     |  33 ++
 10 files changed, 449 insertions(+), 1 deletion(-)
 create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 6450448f2f0..7c48429eb82 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -350,6 +350,7 @@ aarch64*-*-*)
 	cxx_target_objs="aarch64-c.o"
 	d_target_objs="aarch64-d.o"
 	extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o aarch64-sve-builtins-sme.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
+	extra_objs="${extra_objs} aarch64-store-forwarding.o"
 	target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
 	target_has_targetm_common=yes
 	;;
diff --git a/gcc/config/aarch64/aarch64-passes.def b/gcc/config/aarch64/aarch64-passes.def
index 662a13fd5e6..94ced0aebf6 100644
--- a/gcc/config/aarch64/aarch64-passes.def
+++ b/gcc/config/aarch64/aarch64-passes.def
@@ -24,3 +24,4 @@ INSERT_PASS_BEFORE (pass_late_thread_prologue_and_epilogue, 1, pass_switch_pstat
 INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
 INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
 INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
+INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding);
diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 60ff61f6d54..8f5f2ca4710 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -1069,6 +1069,7 @@ rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *);
 rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
 rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
 rtl_opt_pass *make_pass_switch_pstate_sm (gcc::context *ctxt);
+rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt);

 poly_uint64 aarch64_regmode_natural_size (machine_mode);

diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc b/gcc/config/aarch64/aarch64-store-forwarding.cc
new file mode 100644
index 00000000000..8a6faefd8c0
--- /dev/null
+++ b/gcc/config/aarch64/aarch64-store-forwarding.cc
@@ -0,0 +1,318 @@
+/* Avoid store forwarding optimization pass.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+   Contributed by VRULL GmbH.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#define IN_TARGET_CODE 1
+
+#include "config.h"
+#define INCLUDE_LIST
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "alias.h"
+#include "rtlanal.h"
+#include "tree-pass.h"
+#include "cselib.h"
+
+/* This is an RTL pass that detects store forwarding from stores to larger
+   loads (load pairs). For example, it can transform cases like
+
+   str  d5, [sp, #320]
+   fmul d5, d31, d29
+   ldp  d31, d17, [sp, #312] # Large load from small store
+
+   to
+
+   str  d5, [sp, #320]
+   fmul d5, d31, d29
+   ldr  d31, [sp, #312]
+   ldr  d17, [sp, #320]
+
+   Design: The pass follows a straightforward design.  It starts by
+   initializing the alias analysis and the cselib.  Both of these are used to
+   find stores and larger loads with overlapping addresses, which are
+   candidates for store forwarding optimizations.  It then scans on basic block
+   level to find stores that forward to larger loads and handles them
+   accordingly as described in the above example.  Finally, the alias analysis
+   and the cselib library are closed.  */
+
+typedef struct
+{
+  rtx_insn *store_insn;
+  rtx store_mem_addr;
+  unsigned int insn_cnt;
+} str_info;
+
+typedef std::list<str_info> list_store_info;
+
+/* Statistics counters.  */
+static unsigned int stats_store_count = 0;
+static unsigned int stats_ldp_count = 0;
+static unsigned int stats_ssll_count = 0;
+static unsigned int stats_transformed_count = 0;
+
+/* Default.  */
+static rtx dummy;
+static bool is_load (rtx expr, rtx &op_1=dummy);
+
+/* Return true if SET expression EXPR is a store; otherwise false.  */
+
+static bool
+is_store (rtx expr)
+{
+  return MEM_P (SET_DEST (expr));
+}
+
+/* Return true if SET expression EXPR is a load; otherwise false.  OP_1 will
+   contain the MEM operand of the load.  */
+
+static bool
+is_load (rtx expr, rtx &op_1)
+{
+  op_1 = SET_SRC (expr);
+
+  if (GET_CODE (op_1) == ZERO_EXTEND
+      || GET_CODE (op_1) == SIGN_EXTEND)
+    op_1 = XEXP (op_1, 0);
+
+  return MEM_P (op_1);
+}
+
+/* Return true if STORE_MEM_ADDR is forwarding to the address of LOAD_MEM;
+   otherwise false.  STORE_MEM_MODE is the mode of the MEM rtx containing
+   STORE_MEM_ADDR.  */
+
+static bool
+is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode store_mem_mode)
+{
+  /* Sometimes we do not have the proper value.  */
+  if (!CSELIB_VAL_PTR (store_mem_addr))
+    return false;
+
+  gcc_checking_assert (MEM_P (load_mem));
+
+  return rtx_equal_for_cselib_1 (store_mem_addr,
+				 get_addr (XEXP (load_mem, 0)),
+				 store_mem_mode, 0);
+}
+
+/* Return true if INSN is a load pair, preceded by a store forwarding to it;
+   otherwise false.  STORE_EXPRS contains the stores.  */
+
+static bool
+is_small_store_to_large_load (list_store_info store_exprs, rtx_insn *insn)
+{
+  unsigned int load_count = 0;
+  bool forwarding = false;
+  rtx expr = PATTERN (insn);
+
+  if (GET_CODE (expr) != PARALLEL
+      || XVECLEN (expr, 0) != 2)
+    return false;
+
+  for (int i = 0; i < XVECLEN (expr, 0); i++)
+    {
+      rtx op_1;
+      rtx out_exp = XVECEXP (expr, 0, i);
+
+      if (GET_CODE (out_exp) != SET)
+	continue;
+
+      if (!is_load (out_exp, op_1))
+	continue;
+
+      load_count++;
+
+      for (str_info str : store_exprs)
+	{
+	  rtx store_insn = str.store_insn;
+
+	  if (!is_forwarding (str.store_mem_addr, op_1,
+			      GET_MODE (SET_DEST (PATTERN (store_insn)))))
+	    continue;
+
+	  if (dump_file)
+	    {
+	      fprintf (dump_file,
+		       "Store forwarding to PARALLEL with loads:\n");
+	      fprintf (dump_file, "  From: ");
+	      print_rtl_single (dump_file, store_insn);
+	      fprintf (dump_file, "  To: ");
+	      print_rtl_single (dump_file, insn);
+	    }
+
+	  forwarding = true;
+       }
+    }
+
+  if (load_count == 2)
+    stats_ldp_count++;
+
+  return load_count == 2 && forwarding;
+}
+
+/* Break a load pair into its 2 distinct loads, except if the base source
+   address to load from is overwriten in the first load.  INSN should be the
+   PARALLEL of the load pair.  */
+
+static void
+break_ldp (rtx_insn *insn)
+{
+  rtx expr = PATTERN (insn);
+
+  gcc_checking_assert (GET_CODE (expr) == PARALLEL && XVECLEN (expr, 0) == 2);
+
+  rtx load_0 = XVECEXP (expr, 0, 0);
+  rtx load_1 = XVECEXP (expr, 0, 1);
+
+  gcc_checking_assert (is_load (load_0) && is_load (load_1));
+
+  /* The base address was overwriten in the first load.  */
+  if (reg_mentioned_p (SET_DEST (load_0), SET_SRC (load_1)))
+    return;
+
+  emit_insn_before (load_0, insn);
+  emit_insn_before (load_1, insn);
+  remove_insn (insn);
+
+  stats_transformed_count++;
+}
+
+static void
+scan_and_transform_bb_level ()
+{
+  rtx_insn *insn, *next;
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      list_store_info store_exprs;
+      unsigned int insn_cnt = 0;
+      for (insn = BB_HEAD (bb); insn != NEXT_INSN (BB_END (bb)); insn = next)
+	{
+	  next = NEXT_INSN (insn);
+
+	  /* If we cross a CALL_P insn, clear the list, because the
+	     small-store-to-large-load is unlikely to cause performance
+	     difference.  */
+	  if (CALL_P (insn))
+	    store_exprs.clear ();
+
+	  if (!NONJUMP_INSN_P (insn))
+	    continue;
+
+	  cselib_process_insn (insn);
+
+	  rtx expr = single_set (insn);
+
+	  /* If a store is encountered, append it to the store_exprs list to
+	     check it later.  */
+	  if (expr && is_store (expr))
+	    {
+	      rtx store_mem = SET_DEST (expr);
+	      rtx store_mem_addr = get_addr (XEXP (store_mem, 0));
+	      machine_mode store_mem_mode = GET_MODE (store_mem);
+	      store_mem_addr = cselib_lookup (store_mem_addr,
+					      store_mem_mode, 1,
+					      store_mem_mode)->val_rtx;
+	      store_exprs.push_back ({ insn, store_mem_addr, insn_cnt });
+	      stats_store_count++;
+	    }
+
+	  /* Check for small-store-to-large-load.  */
+	  if (is_small_store_to_large_load (store_exprs, insn))
+	    {
+	      stats_ssll_count++;
+	      break_ldp (insn);
+	    }
+
+	  /* Pop the first store from the list if it's distance crosses the
+	     maximum accepted threshold.  The list contains unique values
+	     sorted in ascending order, meaning that only one distance can be
+	     off at a time.  */
+	  if (!store_exprs.empty ()
+	      && (insn_cnt - store_exprs.front ().insn_cnt
+		 > (unsigned int) aarch64_store_forwarding_threshold_param))
+	    store_exprs.pop_front ();
+
+	  insn_cnt++;
+	}
+    }
+}
+
+static void
+execute_avoid_store_forwarding ()
+{
+  init_alias_analysis ();
+  cselib_init (CSELIB_RECORD_MEMORY | CSELIB_PRESERVE_CONSTANTS);
+  scan_and_transform_bb_level ();
+  end_alias_analysis ();
+  cselib_finish ();
+  statistics_counter_event (cfun, "Number of stores identified: ",
+			    stats_store_count);
+  statistics_counter_event (cfun, "Number of load pairs identified: ",
+			    stats_ldp_count);
+  statistics_counter_event (cfun,
+			    "Number of forwarding cases identified: ",
+			    stats_ssll_count);
+  statistics_counter_event (cfun, "Number of trasformed cases: ",
+			    stats_transformed_count);
+}
+
+const pass_data pass_data_avoid_store_forwarding =
+{
+  RTL_PASS, /* type.  */
+  "avoid_store_forwarding", /* name.  */
+  OPTGROUP_NONE, /* optinfo_flags.  */
+  TV_NONE, /* tv_id.  */
+  0, /* properties_required.  */
+  0, /* properties_provided.  */
+  0, /* properties_destroyed.  */
+  0, /* todo_flags_start.  */
+  0 /* todo_flags_finish.  */
+};
+
+class pass_avoid_store_forwarding : public rtl_opt_pass
+{
+public:
+  pass_avoid_store_forwarding (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return aarch64_flag_avoid_store_forwarding;
+    }
+
+  virtual unsigned int execute (function *)
+    {
+      execute_avoid_store_forwarding ();
+      return 0;
+    }
+
+}; // class pass_avoid_store_forwarding
+
+/* Create a new avoid store forwarding pass instance.  */
+
+rtl_opt_pass *
+make_pass_avoid_store_forwarding (gcc::context *ctxt)
+{
+  return new pass_avoid_store_forwarding (ctxt);
+}
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index f5a518202a1..e4498d53b46 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -304,6 +304,10 @@ moutline-atomics
 Target Var(aarch64_flag_outline_atomics) Init(2) Save
 Generate local calls to out-of-line atomic operations.

+mavoid-store-forwarding
+Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0) Optimization
+Avoid store forwarding to load pairs.
+
 -param=aarch64-sve-compare-costs=
 Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1) IntegerRange(0, 1) Param
 When vectorizing for SVE, consider using unpacked vectors for smaller elements and use the cost model to pick the cheapest approach.  Also use the cost model to choose between SVE and Advanced SIMD vectorization.
@@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never) Value(AARCH64_LDP_STP_POLICY_NEVER)

 EnumValue
 Enum(aarch64_ldp_stp_policy) String(aligned) Value(AARCH64_LDP_STP_POLICY_ALIGNED)
+
+-param=aarch64-store-forwarding-threshold=
+Target Joined UInteger Var(aarch64_store_forwarding_threshold_param) Init(20) Param
+Maximum instruction distance allowed between a store and a load pair for this to be
+considered a candidate to avoid when using -mavoid-store-forwarding.
diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
index 0d96ae3d0b2..5676cdd9585 100644
--- a/gcc/config/aarch64/t-aarch64
+++ b/gcc/config/aarch64/t-aarch64
@@ -194,6 +194,16 @@ aarch64-cc-fusion.o: $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
 	$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
 		$(srcdir)/config/aarch64/aarch64-cc-fusion.cc

+aarch64-store-forwarding.o: \
+    $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \
+    $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h $(RTL_BASE_H) \
+    dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H) $(RECOG_H) \
+    output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
+    $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
+    $(srcdir)/config/aarch64/aarch64-protos.h
+	$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
+		$(srcdir)/config/aarch64/aarch64-store-forwarding.cc
+
 comma=,
 MULTILIB_OPTIONS    = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
 MULTILIB_DIRNAMES   = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 32f535e1ed4..9bf3a83286a 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -801,7 +801,7 @@ Objective-C and Objective-C++ Dialects}.
 -moverride=@var{string}  -mverbose-cost-dump
 -mstack-protector-guard=@var{guard} -mstack-protector-guard-reg=@var{sysreg}
 -mstack-protector-guard-offset=@var{offset} -mtrack-speculation
--moutline-atomics }
+-moutline-atomics -mavoid-store-forwarding}

 @emph{Adapteva Epiphany Options}
 @gccoptlist{-mhalf-reg-file  -mprefer-short-insn-regs
@@ -16774,6 +16774,11 @@ With @option{--param=aarch64-stp-policy=never}, do not emit stp.
 With @option{--param=aarch64-stp-policy=aligned}, emit stp only if the
 source pointer is aligned to at least double the alignment of the type.

+@item aarch64-store-forwarding-threshold
+Maximum allowed instruction distance between a store and a load pair for
+this to be considered a candidate to avoid when using
+@option{-mavoid-store-forwarding}.
+
 @item aarch64-loop-vect-issue-rate-niters
 The tuning for some AArch64 CPUs tries to take both latencies and issue
 rates into account when deciding whether a loop should be vectorized
@@ -20857,6 +20862,10 @@ Generate code which uses only the general-purpose registers.  This will prevent
 the compiler from using floating-point and Advanced SIMD registers but will not
 impose any restrictions on the assembler.

+@item -mavoid-store-forwarding
+@itemx -mno-avoid-store-forwarding
+Avoid store forwarding to load pairs.
+
 @opindex mlittle-endian
 @item -mlittle-endian
 Generate little-endian code.  This is the default when GCC is configured for an
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
new file mode 100644
index 00000000000..b77de6c64b6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
@@ -0,0 +1,33 @@
+/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
+
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+/* Different address, same offset, no overlap  */
+
+#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \
+TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE *st_arr_2, TYPE i, TYPE dummy){ \
+	TYPE r, y; \
+	st_arr[0] = i; \
+	ld_arr[0] = dummy; \
+	r = st_arr_2[0]; \
+	y = st_arr_2[1]; \
+	return r + y; \
+}
+
+LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t)
+LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t)
+LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t)
+LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t)
+LDP_SSLL_NO_OVERLAP_ADDRESS(int)
+LDP_SSLL_NO_OVERLAP_ADDRESS(long)
+LDP_SSLL_NO_OVERLAP_ADDRESS(float)
+LDP_SSLL_NO_OVERLAP_ADDRESS(double)
+LDP_SSLL_NO_OVERLAP_ADDRESS(v4si)
+
+/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
new file mode 100644
index 00000000000..f1b3a66abfd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
@@ -0,0 +1,33 @@
+/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
+
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+/* Same address, different offset, no overlap  */
+
+#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \
+TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE dummy){ \
+	TYPE r, y; \
+	st_arr[0] = i; \
+	ld_arr[0] = dummy; \
+	r = st_arr[10]; \
+	y = st_arr[11]; \
+	return r + y; \
+}
+
+LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t)
+LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t)
+LDP_SSLL_NO_OVERLAP_OFFSET(int32_t)
+LDP_SSLL_NO_OVERLAP_OFFSET(int64_t)
+LDP_SSLL_NO_OVERLAP_OFFSET(int)
+LDP_SSLL_NO_OVERLAP_OFFSET(long)
+LDP_SSLL_NO_OVERLAP_OFFSET(float)
+LDP_SSLL_NO_OVERLAP_OFFSET(double)
+LDP_SSLL_NO_OVERLAP_OFFSET(v4si)
+
+/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
+/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
new file mode 100644
index 00000000000..8d5ce5cc87e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
@@ -0,0 +1,33 @@
+/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
+
+#include <stdint.h>
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+/* Same address, same offset, overlap  */
+
+#define LDP_SSLL_OVERLAP(TYPE) \
+TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE dummy){ \
+	TYPE r, y; \
+	st_arr[0] = i; \
+	ld_arr[0] = dummy; \
+	r = st_arr[0]; \
+	y = st_arr[1]; \
+	return r + y; \
+}
+
+LDP_SSLL_OVERLAP(uint32_t)
+LDP_SSLL_OVERLAP(uint64_t)
+LDP_SSLL_OVERLAP(int32_t)
+LDP_SSLL_OVERLAP(int64_t)
+LDP_SSLL_OVERLAP(int)
+LDP_SSLL_OVERLAP(long)
+LDP_SSLL_OVERLAP(float)
+LDP_SSLL_OVERLAP(double)
+LDP_SSLL_OVERLAP(v4si)
+
+/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } } */
+/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
--
2.41.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v6] aarch64: New RTL optimization pass avoid-store-forwarding.
  2023-12-06 13:46 [PATCH v6] aarch64: New RTL optimization pass avoid-store-forwarding Manos Anagnostakis
@ 2023-12-06 14:28 ` Richard Biener
  2023-12-06 15:10   ` Manos Anagnostakis
  2023-12-06 18:44   ` Philipp Tomsich
  0 siblings, 2 replies; 11+ messages in thread
From: Richard Biener @ 2023-12-06 14:28 UTC (permalink / raw)
  To: Manos Anagnostakis
  Cc: gcc-patches, Philipp Tomsich, Richard Sandiford, Manolis Tsamis

On Wed, Dec 6, 2023 at 2:48 PM Manos Anagnostakis
<manos.anagnostakis@vrull.eu> wrote:
>
> This is an RTL pass that detects store forwarding from stores to larger loads (load pairs).
>
> This optimization is SPEC2017-driven and was found to be beneficial for some benchmarks,
> through testing on ampere1/ampere1a machines.
>
> For example, it can transform cases like
>
> str  d5, [sp, #320]
> fmul d5, d31, d29
> ldp  d31, d17, [sp, #312] # Large load from small store
>
> to
>
> str  d5, [sp, #320]
> fmul d5, d31, d29
> ldr  d31, [sp, #312]
> ldr  d17, [sp, #320]
>
> Currently, the pass is disabled by default on all architectures and enabled by a target-specific option.
>
> If deemed beneficial enough for a default, it will be enabled on ampere1/ampere1a,
> or other architectures as well, without needing to be turned on by this option.

What is aarch64-specific about the pass?

I see an increasingly large number of target specific passes pop up (probably
for the excuse we can generalize them if necessary).  But GCC isn't LLVM
and this feels like getting out of hand?

The x86 backend also has its store-forwarding "pass" as part of mdreorg
in ix86_split_stlf_stall_load.

Richard.

> Bootstrapped and regtested on aarch64-linux.
>
> gcc/ChangeLog:
>
>         * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
>         * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New pass.
>         * config/aarch64/aarch64-protos.h (make_pass_avoid_store_forwarding): Declare.
>         * config/aarch64/aarch64.opt (mavoid-store-forwarding): New option.
>         (aarch64-store-forwarding-threshold): New param.
>         * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
>         * doc/invoke.texi: Document new option and new param.
>         * config/aarch64/aarch64-store-forwarding.cc: New file.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
>         * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
>         * gcc.target/aarch64/ldp_ssll_overlap.c: New test.
>
> Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> Co-Authored-By: Manolis Tsamis <manolis.tsamis@vrull.eu>
> Co-Authored-By: Philipp Tomsich <philipp.tomsich@vrull.eu>
> ---
> Changes in v6:
>         - An obvious change. insn_cnt was incremented only on
>           stores and not for every insn in the bb. Now restored.
>
>  gcc/config.gcc                                |   1 +
>  gcc/config/aarch64/aarch64-passes.def         |   1 +
>  gcc/config/aarch64/aarch64-protos.h           |   1 +
>  .../aarch64/aarch64-store-forwarding.cc       | 318 ++++++++++++++++++
>  gcc/config/aarch64/aarch64.opt                |   9 +
>  gcc/config/aarch64/t-aarch64                  |  10 +
>  gcc/doc/invoke.texi                           |  11 +-
>  .../aarch64/ldp_ssll_no_overlap_address.c     |  33 ++
>  .../aarch64/ldp_ssll_no_overlap_offset.c      |  33 ++
>  .../gcc.target/aarch64/ldp_ssll_overlap.c     |  33 ++
>  10 files changed, 449 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 6450448f2f0..7c48429eb82 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -350,6 +350,7 @@ aarch64*-*-*)
>         cxx_target_objs="aarch64-c.o"
>         d_target_objs="aarch64-d.o"
>         extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o aarch64-sve-builtins-sme.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
> +       extra_objs="${extra_objs} aarch64-store-forwarding.o"
>         target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
>         target_has_targetm_common=yes
>         ;;
> diff --git a/gcc/config/aarch64/aarch64-passes.def b/gcc/config/aarch64/aarch64-passes.def
> index 662a13fd5e6..94ced0aebf6 100644
> --- a/gcc/config/aarch64/aarch64-passes.def
> +++ b/gcc/config/aarch64/aarch64-passes.def
> @@ -24,3 +24,4 @@ INSERT_PASS_BEFORE (pass_late_thread_prologue_and_epilogue, 1, pass_switch_pstat
>  INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
>  INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
>  INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
> +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding);
> diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> index 60ff61f6d54..8f5f2ca4710 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -1069,6 +1069,7 @@ rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *);
>  rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
>  rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
>  rtl_opt_pass *make_pass_switch_pstate_sm (gcc::context *ctxt);
> +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt);
>
>  poly_uint64 aarch64_regmode_natural_size (machine_mode);
>
> diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc b/gcc/config/aarch64/aarch64-store-forwarding.cc
> new file mode 100644
> index 00000000000..8a6faefd8c0
> --- /dev/null
> +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc
> @@ -0,0 +1,318 @@
> +/* Avoid store forwarding optimization pass.
> +   Copyright (C) 2023 Free Software Foundation, Inc.
> +   Contributed by VRULL GmbH.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your option)
> +   any later version.
> +
> +   GCC is distributed in the hope that it will be useful, but
> +   WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#define IN_TARGET_CODE 1
> +
> +#include "config.h"
> +#define INCLUDE_LIST
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "alias.h"
> +#include "rtlanal.h"
> +#include "tree-pass.h"
> +#include "cselib.h"
> +
> +/* This is an RTL pass that detects store forwarding from stores to larger
> +   loads (load pairs). For example, it can transform cases like
> +
> +   str  d5, [sp, #320]
> +   fmul d5, d31, d29
> +   ldp  d31, d17, [sp, #312] # Large load from small store
> +
> +   to
> +
> +   str  d5, [sp, #320]
> +   fmul d5, d31, d29
> +   ldr  d31, [sp, #312]
> +   ldr  d17, [sp, #320]
> +
> +   Design: The pass follows a straightforward design.  It starts by
> +   initializing the alias analysis and the cselib.  Both of these are used to
> +   find stores and larger loads with overlapping addresses, which are
> +   candidates for store forwarding optimizations.  It then scans on basic block
> +   level to find stores that forward to larger loads and handles them
> +   accordingly as described in the above example.  Finally, the alias analysis
> +   and the cselib library are closed.  */
> +
> +typedef struct
> +{
> +  rtx_insn *store_insn;
> +  rtx store_mem_addr;
> +  unsigned int insn_cnt;
> +} str_info;
> +
> +typedef std::list<str_info> list_store_info;
> +
> +/* Statistics counters.  */
> +static unsigned int stats_store_count = 0;
> +static unsigned int stats_ldp_count = 0;
> +static unsigned int stats_ssll_count = 0;
> +static unsigned int stats_transformed_count = 0;
> +
> +/* Default.  */
> +static rtx dummy;
> +static bool is_load (rtx expr, rtx &op_1=dummy);
> +
> +/* Return true if SET expression EXPR is a store; otherwise false.  */
> +
> +static bool
> +is_store (rtx expr)
> +{
> +  return MEM_P (SET_DEST (expr));
> +}
> +
> +/* Return true if SET expression EXPR is a load; otherwise false.  OP_1 will
> +   contain the MEM operand of the load.  */
> +
> +static bool
> +is_load (rtx expr, rtx &op_1)
> +{
> +  op_1 = SET_SRC (expr);
> +
> +  if (GET_CODE (op_1) == ZERO_EXTEND
> +      || GET_CODE (op_1) == SIGN_EXTEND)
> +    op_1 = XEXP (op_1, 0);
> +
> +  return MEM_P (op_1);
> +}
> +
> +/* Return true if STORE_MEM_ADDR is forwarding to the address of LOAD_MEM;
> +   otherwise false.  STORE_MEM_MODE is the mode of the MEM rtx containing
> +   STORE_MEM_ADDR.  */
> +
> +static bool
> +is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode store_mem_mode)
> +{
> +  /* Sometimes we do not have the proper value.  */
> +  if (!CSELIB_VAL_PTR (store_mem_addr))
> +    return false;
> +
> +  gcc_checking_assert (MEM_P (load_mem));
> +
> +  return rtx_equal_for_cselib_1 (store_mem_addr,
> +                                get_addr (XEXP (load_mem, 0)),
> +                                store_mem_mode, 0);
> +}
> +
> +/* Return true if INSN is a load pair, preceded by a store forwarding to it;
> +   otherwise false.  STORE_EXPRS contains the stores.  */
> +
> +static bool
> +is_small_store_to_large_load (list_store_info store_exprs, rtx_insn *insn)
> +{
> +  unsigned int load_count = 0;
> +  bool forwarding = false;
> +  rtx expr = PATTERN (insn);
> +
> +  if (GET_CODE (expr) != PARALLEL
> +      || XVECLEN (expr, 0) != 2)
> +    return false;
> +
> +  for (int i = 0; i < XVECLEN (expr, 0); i++)
> +    {
> +      rtx op_1;
> +      rtx out_exp = XVECEXP (expr, 0, i);
> +
> +      if (GET_CODE (out_exp) != SET)
> +       continue;
> +
> +      if (!is_load (out_exp, op_1))
> +       continue;
> +
> +      load_count++;
> +
> +      for (str_info str : store_exprs)
> +       {
> +         rtx store_insn = str.store_insn;
> +
> +         if (!is_forwarding (str.store_mem_addr, op_1,
> +                             GET_MODE (SET_DEST (PATTERN (store_insn)))))
> +           continue;
> +
> +         if (dump_file)
> +           {
> +             fprintf (dump_file,
> +                      "Store forwarding to PARALLEL with loads:\n");
> +             fprintf (dump_file, "  From: ");
> +             print_rtl_single (dump_file, store_insn);
> +             fprintf (dump_file, "  To: ");
> +             print_rtl_single (dump_file, insn);
> +           }
> +
> +         forwarding = true;
> +       }
> +    }
> +
> +  if (load_count == 2)
> +    stats_ldp_count++;
> +
> +  return load_count == 2 && forwarding;
> +}
> +
> +/* Break a load pair into its 2 distinct loads, except if the base source
> +   address to load from is overwriten in the first load.  INSN should be the
> +   PARALLEL of the load pair.  */
> +
> +static void
> +break_ldp (rtx_insn *insn)
> +{
> +  rtx expr = PATTERN (insn);
> +
> +  gcc_checking_assert (GET_CODE (expr) == PARALLEL && XVECLEN (expr, 0) == 2);
> +
> +  rtx load_0 = XVECEXP (expr, 0, 0);
> +  rtx load_1 = XVECEXP (expr, 0, 1);
> +
> +  gcc_checking_assert (is_load (load_0) && is_load (load_1));
> +
> +  /* The base address was overwriten in the first load.  */
> +  if (reg_mentioned_p (SET_DEST (load_0), SET_SRC (load_1)))
> +    return;
> +
> +  emit_insn_before (load_0, insn);
> +  emit_insn_before (load_1, insn);
> +  remove_insn (insn);
> +
> +  stats_transformed_count++;
> +}
> +
> +static void
> +scan_and_transform_bb_level ()
> +{
> +  rtx_insn *insn, *next;
> +  basic_block bb;
> +  FOR_EACH_BB_FN (bb, cfun)
> +    {
> +      list_store_info store_exprs;
> +      unsigned int insn_cnt = 0;
> +      for (insn = BB_HEAD (bb); insn != NEXT_INSN (BB_END (bb)); insn = next)
> +       {
> +         next = NEXT_INSN (insn);

You probably want NONDEBUG here, otherwise insn_cnt will depend
on -g?

> +         /* If we cross a CALL_P insn, clear the list, because the
> +            small-store-to-large-load is unlikely to cause performance
> +            difference.  */
> +         if (CALL_P (insn))
> +           store_exprs.clear ();
> +
> +         if (!NONJUMP_INSN_P (insn))
> +           continue;
> +
> +         cselib_process_insn (insn);

is it necessary to process each insn with cselib?  Only loads & stores I guess?

> +         rtx expr = single_set (insn);
> +
> +         /* If a store is encountered, append it to the store_exprs list to
> +            check it later.  */
> +         if (expr && is_store (expr))
> +           {
> +             rtx store_mem = SET_DEST (expr);
> +             rtx store_mem_addr = get_addr (XEXP (store_mem, 0));
> +             machine_mode store_mem_mode = GET_MODE (store_mem);
> +             store_mem_addr = cselib_lookup (store_mem_addr,
> +                                             store_mem_mode, 1,
> +                                             store_mem_mode)->val_rtx;
> +             store_exprs.push_back ({ insn, store_mem_addr, insn_cnt });
> +             stats_store_count++;
> +           }
> +
> +         /* Check for small-store-to-large-load.  */
> +         if (is_small_store_to_large_load (store_exprs, insn))
> +           {
> +             stats_ssll_count++;
> +             break_ldp (insn);
> +           }
> +
> +         /* Pop the first store from the list if it's distance crosses the
> +            maximum accepted threshold.  The list contains unique values
> +            sorted in ascending order, meaning that only one distance can be
> +            off at a time.  */
> +         if (!store_exprs.empty ()
> +             && (insn_cnt - store_exprs.front ().insn_cnt
> +                > (unsigned int) aarch64_store_forwarding_threshold_param))
> +           store_exprs.pop_front ();
> +
> +         insn_cnt++;
> +       }
> +    }
> +}
> +
> +static void
> +execute_avoid_store_forwarding ()
> +{
> +  init_alias_analysis ();
> +  cselib_init (CSELIB_RECORD_MEMORY | CSELIB_PRESERVE_CONSTANTS);
> +  scan_and_transform_bb_level ();
> +  end_alias_analysis ();
> +  cselib_finish ();
> +  statistics_counter_event (cfun, "Number of stores identified: ",
> +                           stats_store_count);
> +  statistics_counter_event (cfun, "Number of load pairs identified: ",
> +                           stats_ldp_count);
> +  statistics_counter_event (cfun,
> +                           "Number of forwarding cases identified: ",
> +                           stats_ssll_count);
> +  statistics_counter_event (cfun, "Number of trasformed cases: ",
> +                           stats_transformed_count);
> +}
> +
> +const pass_data pass_data_avoid_store_forwarding =
> +{
> +  RTL_PASS, /* type.  */
> +  "avoid_store_forwarding", /* name.  */
> +  OPTGROUP_NONE, /* optinfo_flags.  */
> +  TV_NONE, /* tv_id.  */
> +  0, /* properties_required.  */
> +  0, /* properties_provided.  */
> +  0, /* properties_destroyed.  */
> +  0, /* todo_flags_start.  */
> +  0 /* todo_flags_finish.  */
> +};
> +
> +class pass_avoid_store_forwarding : public rtl_opt_pass
> +{
> +public:
> +  pass_avoid_store_forwarding (gcc::context *ctxt)
> +    : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *)
> +    {
> +      return aarch64_flag_avoid_store_forwarding;
> +    }
> +
> +  virtual unsigned int execute (function *)
> +    {
> +      execute_avoid_store_forwarding ();
> +      return 0;
> +    }
> +
> +}; // class pass_avoid_store_forwarding
> +
> +/* Create a new avoid store forwarding pass instance.  */
> +
> +rtl_opt_pass *
> +make_pass_avoid_store_forwarding (gcc::context *ctxt)
> +{
> +  return new pass_avoid_store_forwarding (ctxt);
> +}
> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> index f5a518202a1..e4498d53b46 100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -304,6 +304,10 @@ moutline-atomics
>  Target Var(aarch64_flag_outline_atomics) Init(2) Save
>  Generate local calls to out-of-line atomic operations.
>
> +mavoid-store-forwarding
> +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0) Optimization
> +Avoid store forwarding to load pairs.
> +
>  -param=aarch64-sve-compare-costs=
>  Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1) IntegerRange(0, 1) Param
>  When vectorizing for SVE, consider using unpacked vectors for smaller elements and use the cost model to pick the cheapest approach.  Also use the cost model to choose between SVE and Advanced SIMD vectorization.
> @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never) Value(AARCH64_LDP_STP_POLICY_NEVER)
>
>  EnumValue
>  Enum(aarch64_ldp_stp_policy) String(aligned) Value(AARCH64_LDP_STP_POLICY_ALIGNED)
> +
> +-param=aarch64-store-forwarding-threshold=
> +Target Joined UInteger Var(aarch64_store_forwarding_threshold_param) Init(20) Param
> +Maximum instruction distance allowed between a store and a load pair for this to be
> +considered a candidate to avoid when using -mavoid-store-forwarding.
> diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
> index 0d96ae3d0b2..5676cdd9585 100644
> --- a/gcc/config/aarch64/t-aarch64
> +++ b/gcc/config/aarch64/t-aarch64
> @@ -194,6 +194,16 @@ aarch64-cc-fusion.o: $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
>         $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
>                 $(srcdir)/config/aarch64/aarch64-cc-fusion.cc
>
> +aarch64-store-forwarding.o: \
> +    $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \
> +    $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h $(RTL_BASE_H) \
> +    dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H) $(RECOG_H) \
> +    output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
> +    $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
> +    $(srcdir)/config/aarch64/aarch64-protos.h
> +       $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> +               $(srcdir)/config/aarch64/aarch64-store-forwarding.cc
> +
>  comma=,
>  MULTILIB_OPTIONS    = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
>  MULTILIB_DIRNAMES   = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 32f535e1ed4..9bf3a83286a 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -801,7 +801,7 @@ Objective-C and Objective-C++ Dialects}.
>  -moverride=@var{string}  -mverbose-cost-dump
>  -mstack-protector-guard=@var{guard} -mstack-protector-guard-reg=@var{sysreg}
>  -mstack-protector-guard-offset=@var{offset} -mtrack-speculation
> --moutline-atomics }
> +-moutline-atomics -mavoid-store-forwarding}
>
>  @emph{Adapteva Epiphany Options}
>  @gccoptlist{-mhalf-reg-file  -mprefer-short-insn-regs
> @@ -16774,6 +16774,11 @@ With @option{--param=aarch64-stp-policy=never}, do not emit stp.
>  With @option{--param=aarch64-stp-policy=aligned}, emit stp only if the
>  source pointer is aligned to at least double the alignment of the type.
>
> +@item aarch64-store-forwarding-threshold
> +Maximum allowed instruction distance between a store and a load pair for
> +this to be considered a candidate to avoid when using
> +@option{-mavoid-store-forwarding}.
> +
>  @item aarch64-loop-vect-issue-rate-niters
>  The tuning for some AArch64 CPUs tries to take both latencies and issue
>  rates into account when deciding whether a loop should be vectorized
> @@ -20857,6 +20862,10 @@ Generate code which uses only the general-purpose registers.  This will prevent
>  the compiler from using floating-point and Advanced SIMD registers but will not
>  impose any restrictions on the assembler.
>
> +@item -mavoid-store-forwarding
> +@itemx -mno-avoid-store-forwarding
> +Avoid store forwarding to load pairs.
> +
>  @opindex mlittle-endian
>  @item -mlittle-endian
>  Generate little-endian code.  This is the default when GCC is configured for an
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> new file mode 100644
> index 00000000000..b77de6c64b6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> @@ -0,0 +1,33 @@
> +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> +
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +/* Different address, same offset, no overlap  */
> +
> +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \
> +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE *st_arr_2, TYPE i, TYPE dummy){ \
> +       TYPE r, y; \
> +       st_arr[0] = i; \
> +       ld_arr[0] = dummy; \
> +       r = st_arr_2[0]; \
> +       y = st_arr_2[1]; \
> +       return r + y; \
> +}
> +
> +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(int)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(long)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(float)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(double)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si)
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> new file mode 100644
> index 00000000000..f1b3a66abfd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> @@ -0,0 +1,33 @@
> +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> +
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +/* Same address, different offset, no overlap  */
> +
> +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \
> +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE dummy){ \
> +       TYPE r, y; \
> +       st_arr[0] = i; \
> +       ld_arr[0] = dummy; \
> +       r = st_arr[10]; \
> +       y = st_arr[11]; \
> +       return r + y; \
> +}
> +
> +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t)
> +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t)
> +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t)
> +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t)
> +LDP_SSLL_NO_OVERLAP_OFFSET(int)
> +LDP_SSLL_NO_OVERLAP_OFFSET(long)
> +LDP_SSLL_NO_OVERLAP_OFFSET(float)
> +LDP_SSLL_NO_OVERLAP_OFFSET(double)
> +LDP_SSLL_NO_OVERLAP_OFFSET(v4si)
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> new file mode 100644
> index 00000000000..8d5ce5cc87e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> @@ -0,0 +1,33 @@
> +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> +
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +/* Same address, same offset, overlap  */
> +
> +#define LDP_SSLL_OVERLAP(TYPE) \
> +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE dummy){ \
> +       TYPE r, y; \
> +       st_arr[0] = i; \
> +       ld_arr[0] = dummy; \
> +       r = st_arr[0]; \
> +       y = st_arr[1]; \
> +       return r + y; \
> +}
> +
> +LDP_SSLL_OVERLAP(uint32_t)
> +LDP_SSLL_OVERLAP(uint64_t)
> +LDP_SSLL_OVERLAP(int32_t)
> +LDP_SSLL_OVERLAP(int64_t)
> +LDP_SSLL_OVERLAP(int)
> +LDP_SSLL_OVERLAP(long)
> +LDP_SSLL_OVERLAP(float)
> +LDP_SSLL_OVERLAP(double)
> +LDP_SSLL_OVERLAP(v4si)
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> --
> 2.41.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v6] aarch64: New RTL optimization pass avoid-store-forwarding.
  2023-12-06 14:28 ` Richard Biener
@ 2023-12-06 15:10   ` Manos Anagnostakis
  2023-12-06 17:15     ` Manos Anagnostakis
  2023-12-06 18:44   ` Philipp Tomsich
  1 sibling, 1 reply; 11+ messages in thread
From: Manos Anagnostakis @ 2023-12-06 15:10 UTC (permalink / raw)
  To: Richard Biener
  Cc: gcc-patches, Philipp Tomsich, Richard Sandiford, Manolis Tsamis

[-- Attachment #1: Type: text/plain, Size: 27491 bytes --]

Hi Richard,

thanks for the useful comments.

On Wed, Dec 6, 2023 at 4:32 PM Richard Biener <richard.guenther@gmail.com>
wrote:

> On Wed, Dec 6, 2023 at 2:48 PM Manos Anagnostakis
> <manos.anagnostakis@vrull.eu> wrote:
> >
> > This is an RTL pass that detects store forwarding from stores to larger
> loads (load pairs).
> >
> > This optimization is SPEC2017-driven and was found to be beneficial for
> some benchmarks,
> > through testing on ampere1/ampere1a machines.
> >
> > For example, it can transform cases like
> >
> > str  d5, [sp, #320]
> > fmul d5, d31, d29
> > ldp  d31, d17, [sp, #312] # Large load from small store
> >
> > to
> >
> > str  d5, [sp, #320]
> > fmul d5, d31, d29
> > ldr  d31, [sp, #312]
> > ldr  d17, [sp, #320]
> >
> > Currently, the pass is disabled by default on all architectures and
> enabled by a target-specific option.
> >
> > If deemed beneficial enough for a default, it will be enabled on
> ampere1/ampere1a,
> > or other architectures as well, without needing to be turned on by this
> option.
>
> What is aarch64-specific about the pass?
>
The pass was designed to target load pairs, which are aarch64 specific,
thus it cannot handle generic loads.

>
> I see an increasingly large number of target specific passes pop up
> (probably
> for the excuse we can generalize them if necessary).  But GCC isn't LLVM
> and this feels like getting out of hand?
>
> The x86 backend also has its store-forwarding "pass" as part of mdreorg
> in ix86_split_stlf_stall_load.
>
> Richard.
>
> > Bootstrapped and regtested on aarch64-linux.
> >
> > gcc/ChangeLog:
> >
> >         * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
> >         * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New
> pass.
> >         * config/aarch64/aarch64-protos.h
> (make_pass_avoid_store_forwarding): Declare.
> >         * config/aarch64/aarch64.opt (mavoid-store-forwarding): New
> option.
> >         (aarch64-store-forwarding-threshold): New param.
> >         * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
> >         * doc/invoke.texi: Document new option and new param.
> >         * config/aarch64/aarch64-store-forwarding.cc: New file.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
> >         * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
> >         * gcc.target/aarch64/ldp_ssll_overlap.c: New test.
> >
> > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> > Co-Authored-By: Manolis Tsamis <manolis.tsamis@vrull.eu>
> > Co-Authored-By: Philipp Tomsich <philipp.tomsich@vrull.eu>
> > ---
> > Changes in v6:
> >         - An obvious change. insn_cnt was incremented only on
> >           stores and not for every insn in the bb. Now restored.
> >
> >  gcc/config.gcc                                |   1 +
> >  gcc/config/aarch64/aarch64-passes.def         |   1 +
> >  gcc/config/aarch64/aarch64-protos.h           |   1 +
> >  .../aarch64/aarch64-store-forwarding.cc       | 318 ++++++++++++++++++
> >  gcc/config/aarch64/aarch64.opt                |   9 +
> >  gcc/config/aarch64/t-aarch64                  |  10 +
> >  gcc/doc/invoke.texi                           |  11 +-
> >  .../aarch64/ldp_ssll_no_overlap_address.c     |  33 ++
> >  .../aarch64/ldp_ssll_no_overlap_offset.c      |  33 ++
> >  .../gcc.target/aarch64/ldp_ssll_overlap.c     |  33 ++
> >  10 files changed, 449 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc
> >  create mode 100644
> gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> >  create mode 100644
> gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> >
> > diff --git a/gcc/config.gcc b/gcc/config.gcc
> > index 6450448f2f0..7c48429eb82 100644
> > --- a/gcc/config.gcc
> > +++ b/gcc/config.gcc
> > @@ -350,6 +350,7 @@ aarch64*-*-*)
> >         cxx_target_objs="aarch64-c.o"
> >         d_target_objs="aarch64-d.o"
> >         extra_objs="aarch64-builtins.o aarch-common.o
> aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o
> aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o
> aarch64-sve-builtins-sme.o cortex-a57-fma-steering.o aarch64-speculation.o
> falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
> > +       extra_objs="${extra_objs} aarch64-store-forwarding.o"
> >         target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.h
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
> >         target_has_targetm_common=yes
> >         ;;
> > diff --git a/gcc/config/aarch64/aarch64-passes.def
> b/gcc/config/aarch64/aarch64-passes.def
> > index 662a13fd5e6..94ced0aebf6 100644
> > --- a/gcc/config/aarch64/aarch64-passes.def
> > +++ b/gcc/config/aarch64/aarch64-passes.def
> > @@ -24,3 +24,4 @@ INSERT_PASS_BEFORE
> (pass_late_thread_prologue_and_epilogue, 1, pass_switch_pstat
> >  INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
> >  INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
> >  INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
> > +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding);
> > diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> > index 60ff61f6d54..8f5f2ca4710 100644
> > --- a/gcc/config/aarch64/aarch64-protos.h
> > +++ b/gcc/config/aarch64/aarch64-protos.h
> > @@ -1069,6 +1069,7 @@ rtl_opt_pass *make_pass_tag_collision_avoidance
> (gcc::context *);
> >  rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
> >  rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
> >  rtl_opt_pass *make_pass_switch_pstate_sm (gcc::context *ctxt);
> > +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt);
> >
> >  poly_uint64 aarch64_regmode_natural_size (machine_mode);
> >
> > diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc
> b/gcc/config/aarch64/aarch64-store-forwarding.cc
> > new file mode 100644
> > index 00000000000..8a6faefd8c0
> > --- /dev/null
> > +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc
> > @@ -0,0 +1,318 @@
> > +/* Avoid store forwarding optimization pass.
> > +   Copyright (C) 2023 Free Software Foundation, Inc.
> > +   Contributed by VRULL GmbH.
> > +
> > +   This file is part of GCC.
> > +
> > +   GCC is free software; you can redistribute it and/or modify it
> > +   under the terms of the GNU General Public License as published by
> > +   the Free Software Foundation; either version 3, or (at your option)
> > +   any later version.
> > +
> > +   GCC is distributed in the hope that it will be useful, but
> > +   WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   General Public License for more details.
> > +
> > +   You should have received a copy of the GNU General Public License
> > +   along with GCC; see the file COPYING3.  If not see
> > +   <http://www.gnu.org/licenses/>.  */
> > +
> > +#define IN_TARGET_CODE 1
> > +
> > +#include "config.h"
> > +#define INCLUDE_LIST
> > +#include "system.h"
> > +#include "coretypes.h"
> > +#include "backend.h"
> > +#include "rtl.h"
> > +#include "alias.h"
> > +#include "rtlanal.h"
> > +#include "tree-pass.h"
> > +#include "cselib.h"
> > +
> > +/* This is an RTL pass that detects store forwarding from stores to
> larger
> > +   loads (load pairs). For example, it can transform cases like
> > +
> > +   str  d5, [sp, #320]
> > +   fmul d5, d31, d29
> > +   ldp  d31, d17, [sp, #312] # Large load from small store
> > +
> > +   to
> > +
> > +   str  d5, [sp, #320]
> > +   fmul d5, d31, d29
> > +   ldr  d31, [sp, #312]
> > +   ldr  d17, [sp, #320]
> > +
> > +   Design: The pass follows a straightforward design.  It starts by
> > +   initializing the alias analysis and the cselib.  Both of these are
> used to
> > +   find stores and larger loads with overlapping addresses, which are
> > +   candidates for store forwarding optimizations.  It then scans on
> basic block
> > +   level to find stores that forward to larger loads and handles them
> > +   accordingly as described in the above example.  Finally, the alias
> analysis
> > +   and the cselib library are closed.  */
> > +
> > +typedef struct
> > +{
> > +  rtx_insn *store_insn;
> > +  rtx store_mem_addr;
> > +  unsigned int insn_cnt;
> > +} str_info;
> > +
> > +typedef std::list<str_info> list_store_info;
> > +
> > +/* Statistics counters.  */
> > +static unsigned int stats_store_count = 0;
> > +static unsigned int stats_ldp_count = 0;
> > +static unsigned int stats_ssll_count = 0;
> > +static unsigned int stats_transformed_count = 0;
> > +
> > +/* Default.  */
> > +static rtx dummy;
> > +static bool is_load (rtx expr, rtx &op_1=dummy);
> > +
> > +/* Return true if SET expression EXPR is a store; otherwise false.  */
> > +
> > +static bool
> > +is_store (rtx expr)
> > +{
> > +  return MEM_P (SET_DEST (expr));
> > +}
> > +
> > +/* Return true if SET expression EXPR is a load; otherwise false.  OP_1
> will
> > +   contain the MEM operand of the load.  */
> > +
> > +static bool
> > +is_load (rtx expr, rtx &op_1)
> > +{
> > +  op_1 = SET_SRC (expr);
> > +
> > +  if (GET_CODE (op_1) == ZERO_EXTEND
> > +      || GET_CODE (op_1) == SIGN_EXTEND)
> > +    op_1 = XEXP (op_1, 0);
> > +
> > +  return MEM_P (op_1);
> > +}
> > +
> > +/* Return true if STORE_MEM_ADDR is forwarding to the address of
> LOAD_MEM;
> > +   otherwise false.  STORE_MEM_MODE is the mode of the MEM rtx
> containing
> > +   STORE_MEM_ADDR.  */
> > +
> > +static bool
> > +is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode
> store_mem_mode)
> > +{
> > +  /* Sometimes we do not have the proper value.  */
> > +  if (!CSELIB_VAL_PTR (store_mem_addr))
> > +    return false;
> > +
> > +  gcc_checking_assert (MEM_P (load_mem));
> > +
> > +  return rtx_equal_for_cselib_1 (store_mem_addr,
> > +                                get_addr (XEXP (load_mem, 0)),
> > +                                store_mem_mode, 0);
> > +}
> > +
> > +/* Return true if INSN is a load pair, preceded by a store forwarding
> to it;
> > +   otherwise false.  STORE_EXPRS contains the stores.  */
> > +
> > +static bool
> > +is_small_store_to_large_load (list_store_info store_exprs, rtx_insn
> *insn)
> > +{
> > +  unsigned int load_count = 0;
> > +  bool forwarding = false;
> > +  rtx expr = PATTERN (insn);
> > +
> > +  if (GET_CODE (expr) != PARALLEL
> > +      || XVECLEN (expr, 0) != 2)
> > +    return false;
> > +
> > +  for (int i = 0; i < XVECLEN (expr, 0); i++)
> > +    {
> > +      rtx op_1;
> > +      rtx out_exp = XVECEXP (expr, 0, i);
> > +
> > +      if (GET_CODE (out_exp) != SET)
> > +       continue;
> > +
> > +      if (!is_load (out_exp, op_1))
> > +       continue;
> > +
> > +      load_count++;
> > +
> > +      for (str_info str : store_exprs)
> > +       {
> > +         rtx store_insn = str.store_insn;
> > +
> > +         if (!is_forwarding (str.store_mem_addr, op_1,
> > +                             GET_MODE (SET_DEST (PATTERN
> (store_insn)))))
> > +           continue;
> > +
> > +         if (dump_file)
> > +           {
> > +             fprintf (dump_file,
> > +                      "Store forwarding to PARALLEL with loads:\n");
> > +             fprintf (dump_file, "  From: ");
> > +             print_rtl_single (dump_file, store_insn);
> > +             fprintf (dump_file, "  To: ");
> > +             print_rtl_single (dump_file, insn);
> > +           }
> > +
> > +         forwarding = true;
> > +       }
> > +    }
> > +
> > +  if (load_count == 2)
> > +    stats_ldp_count++;
> > +
> > +  return load_count == 2 && forwarding;
> > +}
> > +
> > +/* Break a load pair into its 2 distinct loads, except if the base
> source
> > +   address to load from is overwriten in the first load.  INSN should
> be the
> > +   PARALLEL of the load pair.  */
> > +
> > +static void
> > +break_ldp (rtx_insn *insn)
> > +{
> > +  rtx expr = PATTERN (insn);
> > +
> > +  gcc_checking_assert (GET_CODE (expr) == PARALLEL && XVECLEN (expr, 0)
> == 2);
> > +
> > +  rtx load_0 = XVECEXP (expr, 0, 0);
> > +  rtx load_1 = XVECEXP (expr, 0, 1);
> > +
> > +  gcc_checking_assert (is_load (load_0) && is_load (load_1));
> > +
> > +  /* The base address was overwriten in the first load.  */
> > +  if (reg_mentioned_p (SET_DEST (load_0), SET_SRC (load_1)))
> > +    return;
> > +
> > +  emit_insn_before (load_0, insn);
> > +  emit_insn_before (load_1, insn);
> > +  remove_insn (insn);
> > +
> > +  stats_transformed_count++;
> > +}
> > +
> > +static void
> > +scan_and_transform_bb_level ()
> > +{
> > +  rtx_insn *insn, *next;
> > +  basic_block bb;
> > +  FOR_EACH_BB_FN (bb, cfun)
> > +    {
> > +      list_store_info store_exprs;
> > +      unsigned int insn_cnt = 0;
> > +      for (insn = BB_HEAD (bb); insn != NEXT_INSN (BB_END (bb)); insn =
> next)
> > +       {
> > +         next = NEXT_INSN (insn);
>
> You probably want NONDEBUG here, otherwise insn_cnt will depend
> on -g?
>
> > +         /* If we cross a CALL_P insn, clear the list, because the
> > +            small-store-to-large-load is unlikely to cause performance
> > +            difference.  */
> > +         if (CALL_P (insn))
> > +           store_exprs.clear ();
> > +
> > +         if (!NONJUMP_INSN_P (insn))
> > +           continue;
> > +
> > +         cselib_process_insn (insn);
>
> is it necessary to process each insn with cselib?  Only loads & stores I
> guess?
>
> > +         rtx expr = single_set (insn);
> > +
> > +         /* If a store is encountered, append it to the store_exprs
> list to
> > +            check it later.  */
> > +         if (expr && is_store (expr))
> > +           {
> > +             rtx store_mem = SET_DEST (expr);
> > +             rtx store_mem_addr = get_addr (XEXP (store_mem, 0));
> > +             machine_mode store_mem_mode = GET_MODE (store_mem);
> > +             store_mem_addr = cselib_lookup (store_mem_addr,
> > +                                             store_mem_mode, 1,
> > +                                             store_mem_mode)->val_rtx;
> > +             store_exprs.push_back ({ insn, store_mem_addr, insn_cnt });
> > +             stats_store_count++;
> > +           }
> > +
> > +         /* Check for small-store-to-large-load.  */
> > +         if (is_small_store_to_large_load (store_exprs, insn))
> > +           {
> > +             stats_ssll_count++;
> > +             break_ldp (insn);
> > +           }
> > +
> > +         /* Pop the first store from the list if it's distance crosses
> the
> > +            maximum accepted threshold.  The list contains unique values
> > +            sorted in ascending order, meaning that only one distance
> can be
> > +            off at a time.  */
> > +         if (!store_exprs.empty ()
> > +             && (insn_cnt - store_exprs.front ().insn_cnt
> > +                > (unsigned int)
> aarch64_store_forwarding_threshold_param))
> > +           store_exprs.pop_front ();
> > +
> > +         insn_cnt++;
> > +       }
> > +    }
> > +}
> > +
> > +static void
> > +execute_avoid_store_forwarding ()
> > +{
> > +  init_alias_analysis ();
> > +  cselib_init (CSELIB_RECORD_MEMORY | CSELIB_PRESERVE_CONSTANTS);
> > +  scan_and_transform_bb_level ();
> > +  end_alias_analysis ();
> > +  cselib_finish ();
> > +  statistics_counter_event (cfun, "Number of stores identified: ",
> > +                           stats_store_count);
> > +  statistics_counter_event (cfun, "Number of load pairs identified: ",
> > +                           stats_ldp_count);
> > +  statistics_counter_event (cfun,
> > +                           "Number of forwarding cases identified: ",
> > +                           stats_ssll_count);
> > +  statistics_counter_event (cfun, "Number of trasformed cases: ",
> > +                           stats_transformed_count);
> > +}
> > +
> > +const pass_data pass_data_avoid_store_forwarding =
> > +{
> > +  RTL_PASS, /* type.  */
> > +  "avoid_store_forwarding", /* name.  */
> > +  OPTGROUP_NONE, /* optinfo_flags.  */
> > +  TV_NONE, /* tv_id.  */
> > +  0, /* properties_required.  */
> > +  0, /* properties_provided.  */
> > +  0, /* properties_destroyed.  */
> > +  0, /* todo_flags_start.  */
> > +  0 /* todo_flags_finish.  */
> > +};
> > +
> > +class pass_avoid_store_forwarding : public rtl_opt_pass
> > +{
> > +public:
> > +  pass_avoid_store_forwarding (gcc::context *ctxt)
> > +    : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
> > +  {}
> > +
> > +  /* opt_pass methods: */
> > +  virtual bool gate (function *)
> > +    {
> > +      return aarch64_flag_avoid_store_forwarding;
> > +    }
> > +
> > +  virtual unsigned int execute (function *)
> > +    {
> > +      execute_avoid_store_forwarding ();
> > +      return 0;
> > +    }
> > +
> > +}; // class pass_avoid_store_forwarding
> > +
> > +/* Create a new avoid store forwarding pass instance.  */
> > +
> > +rtl_opt_pass *
> > +make_pass_avoid_store_forwarding (gcc::context *ctxt)
> > +{
> > +  return new pass_avoid_store_forwarding (ctxt);
> > +}
> > diff --git a/gcc/config/aarch64/aarch64.opt
> b/gcc/config/aarch64/aarch64.opt
> > index f5a518202a1..e4498d53b46 100644
> > --- a/gcc/config/aarch64/aarch64.opt
> > +++ b/gcc/config/aarch64/aarch64.opt
> > @@ -304,6 +304,10 @@ moutline-atomics
> >  Target Var(aarch64_flag_outline_atomics) Init(2) Save
> >  Generate local calls to out-of-line atomic operations.
> >
> > +mavoid-store-forwarding
> > +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0)
> Optimization
> > +Avoid store forwarding to load pairs.
> > +
> >  -param=aarch64-sve-compare-costs=
> >  Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1)
> IntegerRange(0, 1) Param
> >  When vectorizing for SVE, consider using unpacked vectors for smaller
> elements and use the cost model to pick the cheapest approach.  Also use
> the cost model to choose between SVE and Advanced SIMD vectorization.
> > @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never)
> Value(AARCH64_LDP_STP_POLICY_NEVER)
> >
> >  EnumValue
> >  Enum(aarch64_ldp_stp_policy) String(aligned)
> Value(AARCH64_LDP_STP_POLICY_ALIGNED)
> > +
> > +-param=aarch64-store-forwarding-threshold=
> > +Target Joined UInteger Var(aarch64_store_forwarding_threshold_param)
> Init(20) Param
> > +Maximum instruction distance allowed between a store and a load pair
> for this to be
> > +considered a candidate to avoid when using -mavoid-store-forwarding.
> > diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
> > index 0d96ae3d0b2..5676cdd9585 100644
> > --- a/gcc/config/aarch64/t-aarch64
> > +++ b/gcc/config/aarch64/t-aarch64
> > @@ -194,6 +194,16 @@ aarch64-cc-fusion.o:
> $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
> >         $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> >                 $(srcdir)/config/aarch64/aarch64-cc-fusion.cc
> >
> > +aarch64-store-forwarding.o: \
> > +    $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \
> > +    $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h
> $(RTL_BASE_H) \
> > +    dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H)
> $(RECOG_H) \
> > +    output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
> > +    $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
> > +    $(srcdir)/config/aarch64/aarch64-protos.h
> > +       $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> > +               $(srcdir)/config/aarch64/aarch64-store-forwarding.cc
> > +
> >  comma=,
> >  MULTILIB_OPTIONS    = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst
> $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
> >  MULTILIB_DIRNAMES   = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index 32f535e1ed4..9bf3a83286a 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -801,7 +801,7 @@ Objective-C and Objective-C++ Dialects}.
> >  -moverride=@var{string}  -mverbose-cost-dump
> >  -mstack-protector-guard=@var{guard}
> -mstack-protector-guard-reg=@var{sysreg}
> >  -mstack-protector-guard-offset=@var{offset} -mtrack-speculation
> > --moutline-atomics }
> > +-moutline-atomics -mavoid-store-forwarding}
> >
> >  @emph{Adapteva Epiphany Options}
> >  @gccoptlist{-mhalf-reg-file  -mprefer-short-insn-regs
> > @@ -16774,6 +16774,11 @@ With @option{--param=aarch64-stp-policy=never},
> do not emit stp.
> >  With @option{--param=aarch64-stp-policy=aligned}, emit stp only if the
> >  source pointer is aligned to at least double the alignment of the type.
> >
> > +@item aarch64-store-forwarding-threshold
> > +Maximum allowed instruction distance between a store and a load pair for
> > +this to be considered a candidate to avoid when using
> > +@option{-mavoid-store-forwarding}.
> > +
> >  @item aarch64-loop-vect-issue-rate-niters
> >  The tuning for some AArch64 CPUs tries to take both latencies and issue
> >  rates into account when deciding whether a loop should be vectorized
> > @@ -20857,6 +20862,10 @@ Generate code which uses only the
> general-purpose registers.  This will prevent
> >  the compiler from using floating-point and Advanced SIMD registers but
> will not
> >  impose any restrictions on the assembler.
> >
> > +@item -mavoid-store-forwarding
> > +@itemx -mno-avoid-store-forwarding
> > +Avoid store forwarding to load pairs.
> > +
> >  @opindex mlittle-endian
> >  @item -mlittle-endian
> >  Generate little-endian code.  This is the default when GCC is
> configured for an
> > diff --git
> a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> > new file mode 100644
> > index 00000000000..b77de6c64b6
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> > +
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +/* Different address, same offset, no overlap  */
> > +
> > +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \
> > +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr,
> TYPE *st_arr_2, TYPE i, TYPE dummy){ \
> > +       TYPE r, y; \
> > +       st_arr[0] = i; \
> > +       ld_arr[0] = dummy; \
> > +       r = st_arr_2[0]; \
> > +       y = st_arr_2[1]; \
> > +       return r + y; \
> > +}
> > +
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(long)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(float)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(double)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si)
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > diff --git
> a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> > new file mode 100644
> > index 00000000000..f1b3a66abfd
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> > +
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +/* Same address, different offset, no overlap  */
> > +
> > +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \
> > +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE
> i, TYPE dummy){ \
> > +       TYPE r, y; \
> > +       st_arr[0] = i; \
> > +       ld_arr[0] = dummy; \
> > +       r = st_arr[10]; \
> > +       y = st_arr[11]; \
> > +       return r + y; \
> > +}
> > +
> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(int)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(long)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(float)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(double)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(v4si)
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> > new file mode 100644
> > index 00000000000..8d5ce5cc87e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> > +
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +/* Same address, same offset, overlap  */
> > +
> > +#define LDP_SSLL_OVERLAP(TYPE) \
> > +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE
> dummy){ \
> > +       TYPE r, y; \
> > +       st_arr[0] = i; \
> > +       ld_arr[0] = dummy; \
> > +       r = st_arr[0]; \
> > +       y = st_arr[1]; \
> > +       return r + y; \
> > +}
> > +
> > +LDP_SSLL_OVERLAP(uint32_t)
> > +LDP_SSLL_OVERLAP(uint64_t)
> > +LDP_SSLL_OVERLAP(int32_t)
> > +LDP_SSLL_OVERLAP(int64_t)
> > +LDP_SSLL_OVERLAP(int)
> > +LDP_SSLL_OVERLAP(long)
> > +LDP_SSLL_OVERLAP(float)
> > +LDP_SSLL_OVERLAP(double)
> > +LDP_SSLL_OVERLAP(v4si)
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > --
> > 2.41.0
>
I will address the other two points on a seperate version.

-- 
*Manos Anagnostakis | Compiler Engineer |*
E: manos.anagnostakis@vrull.eu <makeljana.shkurti@vrull.eu>

*VRULL GmbH *| Beatrixgasse 32 1030 Vienna |
 W: www.vrull.eu | LinkedIn <https://www.linkedin.com/company/vrull/>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v6] aarch64: New RTL optimization pass avoid-store-forwarding.
  2023-12-06 15:10   ` Manos Anagnostakis
@ 2023-12-06 17:15     ` Manos Anagnostakis
  2023-12-07  7:35       ` Richard Biener
  0 siblings, 1 reply; 11+ messages in thread
From: Manos Anagnostakis @ 2023-12-06 17:15 UTC (permalink / raw)
  To: Richard Biener
  Cc: gcc-patches, Philipp Tomsich, Richard Sandiford, Manolis Tsamis

[-- Attachment #1: Type: text/plain, Size: 29370 bytes --]

Hi again,

I went and tested the requested changes and found out the following:

1. The pass is currently increasing insn_cnt on a NONJUMP_INSN_P, which is
a subset of NONDEBUG_INSN_P. I think there is no problem with depending on
-g with the current version. Do you see something I don't or did you mean
something else?
2. Not processing all instructions is not letting cselib record all the
effects they have, thus it does not have updated information to find true
forwardings at any given time. I can confirm this since I am witnessing
many unexpected changes on the number of handled cases if I do this only
for loads/stores.

Thanks in advance and please let me know your thoughts on the above.
Manos.

On Wed, Dec 6, 2023 at 5:10 PM Manos Anagnostakis <
manos.anagnostakis@vrull.eu> wrote:

> Hi Richard,
>
> thanks for the useful comments.
>
> On Wed, Dec 6, 2023 at 4:32 PM Richard Biener <richard.guenther@gmail.com>
> wrote:
>
>> On Wed, Dec 6, 2023 at 2:48 PM Manos Anagnostakis
>> <manos.anagnostakis@vrull.eu> wrote:
>> >
>> > This is an RTL pass that detects store forwarding from stores to larger
>> loads (load pairs).
>> >
>> > This optimization is SPEC2017-driven and was found to be beneficial for
>> some benchmarks,
>> > through testing on ampere1/ampere1a machines.
>> >
>> > For example, it can transform cases like
>> >
>> > str  d5, [sp, #320]
>> > fmul d5, d31, d29
>> > ldp  d31, d17, [sp, #312] # Large load from small store
>> >
>> > to
>> >
>> > str  d5, [sp, #320]
>> > fmul d5, d31, d29
>> > ldr  d31, [sp, #312]
>> > ldr  d17, [sp, #320]
>> >
>> > Currently, the pass is disabled by default on all architectures and
>> enabled by a target-specific option.
>> >
>> > If deemed beneficial enough for a default, it will be enabled on
>> ampere1/ampere1a,
>> > or other architectures as well, without needing to be turned on by this
>> option.
>>
>> What is aarch64-specific about the pass?
>>
> The pass was designed to target load pairs, which are aarch64 specific,
> thus it cannot handle generic loads.
>
>>
>> I see an increasingly large number of target specific passes pop up
>> (probably
>> for the excuse we can generalize them if necessary).  But GCC isn't LLVM
>> and this feels like getting out of hand?
>>
>> The x86 backend also has its store-forwarding "pass" as part of mdreorg
>> in ix86_split_stlf_stall_load.
>>
>> Richard.
>>
>> > Bootstrapped and regtested on aarch64-linux.
>> >
>> > gcc/ChangeLog:
>> >
>> >         * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
>> >         * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New
>> pass.
>> >         * config/aarch64/aarch64-protos.h
>> (make_pass_avoid_store_forwarding): Declare.
>> >         * config/aarch64/aarch64.opt (mavoid-store-forwarding): New
>> option.
>> >         (aarch64-store-forwarding-threshold): New param.
>> >         * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
>> >         * doc/invoke.texi: Document new option and new param.
>> >         * config/aarch64/aarch64-store-forwarding.cc: New file.
>> >
>> > gcc/testsuite/ChangeLog:
>> >
>> >         * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
>> >         * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
>> >         * gcc.target/aarch64/ldp_ssll_overlap.c: New test.
>> >
>> > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
>> > Co-Authored-By: Manolis Tsamis <manolis.tsamis@vrull.eu>
>> > Co-Authored-By: Philipp Tomsich <philipp.tomsich@vrull.eu>
>> > ---
>> > Changes in v6:
>> >         - An obvious change. insn_cnt was incremented only on
>> >           stores and not for every insn in the bb. Now restored.
>> >
>> >  gcc/config.gcc                                |   1 +
>> >  gcc/config/aarch64/aarch64-passes.def         |   1 +
>> >  gcc/config/aarch64/aarch64-protos.h           |   1 +
>> >  .../aarch64/aarch64-store-forwarding.cc       | 318 ++++++++++++++++++
>> >  gcc/config/aarch64/aarch64.opt                |   9 +
>> >  gcc/config/aarch64/t-aarch64                  |  10 +
>> >  gcc/doc/invoke.texi                           |  11 +-
>> >  .../aarch64/ldp_ssll_no_overlap_address.c     |  33 ++
>> >  .../aarch64/ldp_ssll_no_overlap_offset.c      |  33 ++
>> >  .../gcc.target/aarch64/ldp_ssll_overlap.c     |  33 ++
>> >  10 files changed, 449 insertions(+), 1 deletion(-)
>> >  create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc
>> >  create mode 100644
>> gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
>> >  create mode 100644
>> gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
>> >
>> > diff --git a/gcc/config.gcc b/gcc/config.gcc
>> > index 6450448f2f0..7c48429eb82 100644
>> > --- a/gcc/config.gcc
>> > +++ b/gcc/config.gcc
>> > @@ -350,6 +350,7 @@ aarch64*-*-*)
>> >         cxx_target_objs="aarch64-c.o"
>> >         d_target_objs="aarch64-d.o"
>> >         extra_objs="aarch64-builtins.o aarch-common.o
>> aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o
>> aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o
>> aarch64-sve-builtins-sme.o cortex-a57-fma-steering.o aarch64-speculation.o
>> falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
>> > +       extra_objs="${extra_objs} aarch64-store-forwarding.o"
>> >         target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc
>> \$(srcdir)/config/aarch64/aarch64-sve-builtins.h
>> \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
>> >         target_has_targetm_common=yes
>> >         ;;
>> > diff --git a/gcc/config/aarch64/aarch64-passes.def
>> b/gcc/config/aarch64/aarch64-passes.def
>> > index 662a13fd5e6..94ced0aebf6 100644
>> > --- a/gcc/config/aarch64/aarch64-passes.def
>> > +++ b/gcc/config/aarch64/aarch64-passes.def
>> > @@ -24,3 +24,4 @@ INSERT_PASS_BEFORE
>> (pass_late_thread_prologue_and_epilogue, 1, pass_switch_pstat
>> >  INSERT_PASS_AFTER (pass_machine_reorg, 1,
>> pass_tag_collision_avoidance);
>> >  INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
>> >  INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
>> > +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding);
>> > diff --git a/gcc/config/aarch64/aarch64-protos.h
>> b/gcc/config/aarch64/aarch64-protos.h
>> > index 60ff61f6d54..8f5f2ca4710 100644
>> > --- a/gcc/config/aarch64/aarch64-protos.h
>> > +++ b/gcc/config/aarch64/aarch64-protos.h
>> > @@ -1069,6 +1069,7 @@ rtl_opt_pass *make_pass_tag_collision_avoidance
>> (gcc::context *);
>> >  rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
>> >  rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
>> >  rtl_opt_pass *make_pass_switch_pstate_sm (gcc::context *ctxt);
>> > +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt);
>> >
>> >  poly_uint64 aarch64_regmode_natural_size (machine_mode);
>> >
>> > diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc
>> b/gcc/config/aarch64/aarch64-store-forwarding.cc
>> > new file mode 100644
>> > index 00000000000..8a6faefd8c0
>> > --- /dev/null
>> > +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc
>> > @@ -0,0 +1,318 @@
>> > +/* Avoid store forwarding optimization pass.
>> > +   Copyright (C) 2023 Free Software Foundation, Inc.
>> > +   Contributed by VRULL GmbH.
>> > +
>> > +   This file is part of GCC.
>> > +
>> > +   GCC is free software; you can redistribute it and/or modify it
>> > +   under the terms of the GNU General Public License as published by
>> > +   the Free Software Foundation; either version 3, or (at your option)
>> > +   any later version.
>> > +
>> > +   GCC is distributed in the hope that it will be useful, but
>> > +   WITHOUT ANY WARRANTY; without even the implied warranty of
>> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> > +   General Public License for more details.
>> > +
>> > +   You should have received a copy of the GNU General Public License
>> > +   along with GCC; see the file COPYING3.  If not see
>> > +   <http://www.gnu.org/licenses/>.  */
>> > +
>> > +#define IN_TARGET_CODE 1
>> > +
>> > +#include "config.h"
>> > +#define INCLUDE_LIST
>> > +#include "system.h"
>> > +#include "coretypes.h"
>> > +#include "backend.h"
>> > +#include "rtl.h"
>> > +#include "alias.h"
>> > +#include "rtlanal.h"
>> > +#include "tree-pass.h"
>> > +#include "cselib.h"
>> > +
>> > +/* This is an RTL pass that detects store forwarding from stores to
>> larger
>> > +   loads (load pairs). For example, it can transform cases like
>> > +
>> > +   str  d5, [sp, #320]
>> > +   fmul d5, d31, d29
>> > +   ldp  d31, d17, [sp, #312] # Large load from small store
>> > +
>> > +   to
>> > +
>> > +   str  d5, [sp, #320]
>> > +   fmul d5, d31, d29
>> > +   ldr  d31, [sp, #312]
>> > +   ldr  d17, [sp, #320]
>> > +
>> > +   Design: The pass follows a straightforward design.  It starts by
>> > +   initializing the alias analysis and the cselib.  Both of these are
>> used to
>> > +   find stores and larger loads with overlapping addresses, which are
>> > +   candidates for store forwarding optimizations.  It then scans on
>> basic block
>> > +   level to find stores that forward to larger loads and handles them
>> > +   accordingly as described in the above example.  Finally, the alias
>> analysis
>> > +   and the cselib library are closed.  */
>> > +
>> > +typedef struct
>> > +{
>> > +  rtx_insn *store_insn;
>> > +  rtx store_mem_addr;
>> > +  unsigned int insn_cnt;
>> > +} str_info;
>> > +
>> > +typedef std::list<str_info> list_store_info;
>> > +
>> > +/* Statistics counters.  */
>> > +static unsigned int stats_store_count = 0;
>> > +static unsigned int stats_ldp_count = 0;
>> > +static unsigned int stats_ssll_count = 0;
>> > +static unsigned int stats_transformed_count = 0;
>> > +
>> > +/* Default.  */
>> > +static rtx dummy;
>> > +static bool is_load (rtx expr, rtx &op_1=dummy);
>> > +
>> > +/* Return true if SET expression EXPR is a store; otherwise false.  */
>> > +
>> > +static bool
>> > +is_store (rtx expr)
>> > +{
>> > +  return MEM_P (SET_DEST (expr));
>> > +}
>> > +
>> > +/* Return true if SET expression EXPR is a load; otherwise false.
>> OP_1 will
>> > +   contain the MEM operand of the load.  */
>> > +
>> > +static bool
>> > +is_load (rtx expr, rtx &op_1)
>> > +{
>> > +  op_1 = SET_SRC (expr);
>> > +
>> > +  if (GET_CODE (op_1) == ZERO_EXTEND
>> > +      || GET_CODE (op_1) == SIGN_EXTEND)
>> > +    op_1 = XEXP (op_1, 0);
>> > +
>> > +  return MEM_P (op_1);
>> > +}
>> > +
>> > +/* Return true if STORE_MEM_ADDR is forwarding to the address of
>> LOAD_MEM;
>> > +   otherwise false.  STORE_MEM_MODE is the mode of the MEM rtx
>> containing
>> > +   STORE_MEM_ADDR.  */
>> > +
>> > +static bool
>> > +is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode
>> store_mem_mode)
>> > +{
>> > +  /* Sometimes we do not have the proper value.  */
>> > +  if (!CSELIB_VAL_PTR (store_mem_addr))
>> > +    return false;
>> > +
>> > +  gcc_checking_assert (MEM_P (load_mem));
>> > +
>> > +  return rtx_equal_for_cselib_1 (store_mem_addr,
>> > +                                get_addr (XEXP (load_mem, 0)),
>> > +                                store_mem_mode, 0);
>> > +}
>> > +
>> > +/* Return true if INSN is a load pair, preceded by a store forwarding
>> to it;
>> > +   otherwise false.  STORE_EXPRS contains the stores.  */
>> > +
>> > +static bool
>> > +is_small_store_to_large_load (list_store_info store_exprs, rtx_insn
>> *insn)
>> > +{
>> > +  unsigned int load_count = 0;
>> > +  bool forwarding = false;
>> > +  rtx expr = PATTERN (insn);
>> > +
>> > +  if (GET_CODE (expr) != PARALLEL
>> > +      || XVECLEN (expr, 0) != 2)
>> > +    return false;
>> > +
>> > +  for (int i = 0; i < XVECLEN (expr, 0); i++)
>> > +    {
>> > +      rtx op_1;
>> > +      rtx out_exp = XVECEXP (expr, 0, i);
>> > +
>> > +      if (GET_CODE (out_exp) != SET)
>> > +       continue;
>> > +
>> > +      if (!is_load (out_exp, op_1))
>> > +       continue;
>> > +
>> > +      load_count++;
>> > +
>> > +      for (str_info str : store_exprs)
>> > +       {
>> > +         rtx store_insn = str.store_insn;
>> > +
>> > +         if (!is_forwarding (str.store_mem_addr, op_1,
>> > +                             GET_MODE (SET_DEST (PATTERN
>> (store_insn)))))
>> > +           continue;
>> > +
>> > +         if (dump_file)
>> > +           {
>> > +             fprintf (dump_file,
>> > +                      "Store forwarding to PARALLEL with loads:\n");
>> > +             fprintf (dump_file, "  From: ");
>> > +             print_rtl_single (dump_file, store_insn);
>> > +             fprintf (dump_file, "  To: ");
>> > +             print_rtl_single (dump_file, insn);
>> > +           }
>> > +
>> > +         forwarding = true;
>> > +       }
>> > +    }
>> > +
>> > +  if (load_count == 2)
>> > +    stats_ldp_count++;
>> > +
>> > +  return load_count == 2 && forwarding;
>> > +}
>> > +
>> > +/* Break a load pair into its 2 distinct loads, except if the base
>> source
>> > +   address to load from is overwriten in the first load.  INSN should
>> be the
>> > +   PARALLEL of the load pair.  */
>> > +
>> > +static void
>> > +break_ldp (rtx_insn *insn)
>> > +{
>> > +  rtx expr = PATTERN (insn);
>> > +
>> > +  gcc_checking_assert (GET_CODE (expr) == PARALLEL && XVECLEN (expr,
>> 0) == 2);
>> > +
>> > +  rtx load_0 = XVECEXP (expr, 0, 0);
>> > +  rtx load_1 = XVECEXP (expr, 0, 1);
>> > +
>> > +  gcc_checking_assert (is_load (load_0) && is_load (load_1));
>> > +
>> > +  /* The base address was overwriten in the first load.  */
>> > +  if (reg_mentioned_p (SET_DEST (load_0), SET_SRC (load_1)))
>> > +    return;
>> > +
>> > +  emit_insn_before (load_0, insn);
>> > +  emit_insn_before (load_1, insn);
>> > +  remove_insn (insn);
>> > +
>> > +  stats_transformed_count++;
>> > +}
>> > +
>> > +static void
>> > +scan_and_transform_bb_level ()
>> > +{
>> > +  rtx_insn *insn, *next;
>> > +  basic_block bb;
>> > +  FOR_EACH_BB_FN (bb, cfun)
>> > +    {
>> > +      list_store_info store_exprs;
>> > +      unsigned int insn_cnt = 0;
>> > +      for (insn = BB_HEAD (bb); insn != NEXT_INSN (BB_END (bb)); insn
>> = next)
>> > +       {
>> > +         next = NEXT_INSN (insn);
>>
>> You probably want NONDEBUG here, otherwise insn_cnt will depend
>> on -g?
>>
>> > +         /* If we cross a CALL_P insn, clear the list, because the
>> > +            small-store-to-large-load is unlikely to cause performance
>> > +            difference.  */
>> > +         if (CALL_P (insn))
>> > +           store_exprs.clear ();
>> > +
>> > +         if (!NONJUMP_INSN_P (insn))
>> > +           continue;
>> > +
>> > +         cselib_process_insn (insn);
>>
>> is it necessary to process each insn with cselib?  Only loads & stores I
>> guess?
>>
>> > +         rtx expr = single_set (insn);
>> > +
>> > +         /* If a store is encountered, append it to the store_exprs
>> list to
>> > +            check it later.  */
>> > +         if (expr && is_store (expr))
>> > +           {
>> > +             rtx store_mem = SET_DEST (expr);
>> > +             rtx store_mem_addr = get_addr (XEXP (store_mem, 0));
>> > +             machine_mode store_mem_mode = GET_MODE (store_mem);
>> > +             store_mem_addr = cselib_lookup (store_mem_addr,
>> > +                                             store_mem_mode, 1,
>> > +                                             store_mem_mode)->val_rtx;
>> > +             store_exprs.push_back ({ insn, store_mem_addr, insn_cnt
>> });
>> > +             stats_store_count++;
>> > +           }
>> > +
>> > +         /* Check for small-store-to-large-load.  */
>> > +         if (is_small_store_to_large_load (store_exprs, insn))
>> > +           {
>> > +             stats_ssll_count++;
>> > +             break_ldp (insn);
>> > +           }
>> > +
>> > +         /* Pop the first store from the list if it's distance crosses
>> the
>> > +            maximum accepted threshold.  The list contains unique
>> values
>> > +            sorted in ascending order, meaning that only one distance
>> can be
>> > +            off at a time.  */
>> > +         if (!store_exprs.empty ()
>> > +             && (insn_cnt - store_exprs.front ().insn_cnt
>> > +                > (unsigned int)
>> aarch64_store_forwarding_threshold_param))
>> > +           store_exprs.pop_front ();
>> > +
>> > +         insn_cnt++;
>> > +       }
>> > +    }
>> > +}
>> > +
>> > +static void
>> > +execute_avoid_store_forwarding ()
>> > +{
>> > +  init_alias_analysis ();
>> > +  cselib_init (CSELIB_RECORD_MEMORY | CSELIB_PRESERVE_CONSTANTS);
>> > +  scan_and_transform_bb_level ();
>> > +  end_alias_analysis ();
>> > +  cselib_finish ();
>> > +  statistics_counter_event (cfun, "Number of stores identified: ",
>> > +                           stats_store_count);
>> > +  statistics_counter_event (cfun, "Number of load pairs identified: ",
>> > +                           stats_ldp_count);
>> > +  statistics_counter_event (cfun,
>> > +                           "Number of forwarding cases identified: ",
>> > +                           stats_ssll_count);
>> > +  statistics_counter_event (cfun, "Number of trasformed cases: ",
>> > +                           stats_transformed_count);
>> > +}
>> > +
>> > +const pass_data pass_data_avoid_store_forwarding =
>> > +{
>> > +  RTL_PASS, /* type.  */
>> > +  "avoid_store_forwarding", /* name.  */
>> > +  OPTGROUP_NONE, /* optinfo_flags.  */
>> > +  TV_NONE, /* tv_id.  */
>> > +  0, /* properties_required.  */
>> > +  0, /* properties_provided.  */
>> > +  0, /* properties_destroyed.  */
>> > +  0, /* todo_flags_start.  */
>> > +  0 /* todo_flags_finish.  */
>> > +};
>> > +
>> > +class pass_avoid_store_forwarding : public rtl_opt_pass
>> > +{
>> > +public:
>> > +  pass_avoid_store_forwarding (gcc::context *ctxt)
>> > +    : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
>> > +  {}
>> > +
>> > +  /* opt_pass methods: */
>> > +  virtual bool gate (function *)
>> > +    {
>> > +      return aarch64_flag_avoid_store_forwarding;
>> > +    }
>> > +
>> > +  virtual unsigned int execute (function *)
>> > +    {
>> > +      execute_avoid_store_forwarding ();
>> > +      return 0;
>> > +    }
>> > +
>> > +}; // class pass_avoid_store_forwarding
>> > +
>> > +/* Create a new avoid store forwarding pass instance.  */
>> > +
>> > +rtl_opt_pass *
>> > +make_pass_avoid_store_forwarding (gcc::context *ctxt)
>> > +{
>> > +  return new pass_avoid_store_forwarding (ctxt);
>> > +}
>> > diff --git a/gcc/config/aarch64/aarch64.opt
>> b/gcc/config/aarch64/aarch64.opt
>> > index f5a518202a1..e4498d53b46 100644
>> > --- a/gcc/config/aarch64/aarch64.opt
>> > +++ b/gcc/config/aarch64/aarch64.opt
>> > @@ -304,6 +304,10 @@ moutline-atomics
>> >  Target Var(aarch64_flag_outline_atomics) Init(2) Save
>> >  Generate local calls to out-of-line atomic operations.
>> >
>> > +mavoid-store-forwarding
>> > +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0)
>> Optimization
>> > +Avoid store forwarding to load pairs.
>> > +
>> >  -param=aarch64-sve-compare-costs=
>> >  Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1)
>> IntegerRange(0, 1) Param
>> >  When vectorizing for SVE, consider using unpacked vectors for smaller
>> elements and use the cost model to pick the cheapest approach.  Also use
>> the cost model to choose between SVE and Advanced SIMD vectorization.
>> > @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never)
>> Value(AARCH64_LDP_STP_POLICY_NEVER)
>> >
>> >  EnumValue
>> >  Enum(aarch64_ldp_stp_policy) String(aligned)
>> Value(AARCH64_LDP_STP_POLICY_ALIGNED)
>> > +
>> > +-param=aarch64-store-forwarding-threshold=
>> > +Target Joined UInteger Var(aarch64_store_forwarding_threshold_param)
>> Init(20) Param
>> > +Maximum instruction distance allowed between a store and a load pair
>> for this to be
>> > +considered a candidate to avoid when using -mavoid-store-forwarding.
>> > diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
>> > index 0d96ae3d0b2..5676cdd9585 100644
>> > --- a/gcc/config/aarch64/t-aarch64
>> > +++ b/gcc/config/aarch64/t-aarch64
>> > @@ -194,6 +194,16 @@ aarch64-cc-fusion.o:
>> $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
>> >         $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES)
>> \
>> >                 $(srcdir)/config/aarch64/aarch64-cc-fusion.cc
>> >
>> > +aarch64-store-forwarding.o: \
>> > +    $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \
>> > +    $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h
>> $(RTL_BASE_H) \
>> > +    dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H)
>> $(RECOG_H) \
>> > +    output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
>> > +    $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
>> > +    $(srcdir)/config/aarch64/aarch64-protos.h
>> > +       $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES)
>> \
>> > +               $(srcdir)/config/aarch64/aarch64-store-forwarding.cc
>> > +
>> >  comma=,
>> >  MULTILIB_OPTIONS    = $(subst $(comma),/, $(patsubst %, mabi=%,
>> $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
>> >  MULTILIB_DIRNAMES   = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
>> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
>> > index 32f535e1ed4..9bf3a83286a 100644
>> > --- a/gcc/doc/invoke.texi
>> > +++ b/gcc/doc/invoke.texi
>> > @@ -801,7 +801,7 @@ Objective-C and Objective-C++ Dialects}.
>> >  -moverride=@var{string}  -mverbose-cost-dump
>> >  -mstack-protector-guard=@var{guard}
>> -mstack-protector-guard-reg=@var{sysreg}
>> >  -mstack-protector-guard-offset=@var{offset} -mtrack-speculation
>> > --moutline-atomics }
>> > +-moutline-atomics -mavoid-store-forwarding}
>> >
>> >  @emph{Adapteva Epiphany Options}
>> >  @gccoptlist{-mhalf-reg-file  -mprefer-short-insn-regs
>> > @@ -16774,6 +16774,11 @@ With
>> @option{--param=aarch64-stp-policy=never}, do not emit stp.
>> >  With @option{--param=aarch64-stp-policy=aligned}, emit stp only if the
>> >  source pointer is aligned to at least double the alignment of the type.
>> >
>> > +@item aarch64-store-forwarding-threshold
>> > +Maximum allowed instruction distance between a store and a load pair
>> for
>> > +this to be considered a candidate to avoid when using
>> > +@option{-mavoid-store-forwarding}.
>> > +
>> >  @item aarch64-loop-vect-issue-rate-niters
>> >  The tuning for some AArch64 CPUs tries to take both latencies and issue
>> >  rates into account when deciding whether a loop should be vectorized
>> > @@ -20857,6 +20862,10 @@ Generate code which uses only the
>> general-purpose registers.  This will prevent
>> >  the compiler from using floating-point and Advanced SIMD registers but
>> will not
>> >  impose any restrictions on the assembler.
>> >
>> > +@item -mavoid-store-forwarding
>> > +@itemx -mno-avoid-store-forwarding
>> > +Avoid store forwarding to load pairs.
>> > +
>> >  @opindex mlittle-endian
>> >  @item -mlittle-endian
>> >  Generate little-endian code.  This is the default when GCC is
>> configured for an
>> > diff --git
>> a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
>> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
>> > new file mode 100644
>> > index 00000000000..b77de6c64b6
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
>> > @@ -0,0 +1,33 @@
>> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
>> > +
>> > +#include <stdint.h>
>> > +
>> > +typedef int v4si __attribute__ ((vector_size (16)));
>> > +
>> > +/* Different address, same offset, no overlap  */
>> > +
>> > +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \
>> > +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr,
>> TYPE *st_arr_2, TYPE i, TYPE dummy){ \
>> > +       TYPE r, y; \
>> > +       st_arr[0] = i; \
>> > +       ld_arr[0] = dummy; \
>> > +       r = st_arr_2[0]; \
>> > +       y = st_arr_2[1]; \
>> > +       return r + y; \
>> > +}
>> > +
>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t)
>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t)
>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t)
>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t)
>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int)
>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(long)
>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(float)
>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(double)
>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si)
>> > +
>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } }
>> */
>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } }
>> */
>> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } }
>> */
>> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } }
>> */
>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } }
>> */
>> > diff --git
>> a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
>> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
>> > new file mode 100644
>> > index 00000000000..f1b3a66abfd
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
>> > @@ -0,0 +1,33 @@
>> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
>> > +
>> > +#include <stdint.h>
>> > +
>> > +typedef int v4si __attribute__ ((vector_size (16)));
>> > +
>> > +/* Same address, different offset, no overlap  */
>> > +
>> > +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \
>> > +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr,
>> TYPE i, TYPE dummy){ \
>> > +       TYPE r, y; \
>> > +       st_arr[0] = i; \
>> > +       ld_arr[0] = dummy; \
>> > +       r = st_arr[10]; \
>> > +       y = st_arr[11]; \
>> > +       return r + y; \
>> > +}
>> > +
>> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t)
>> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t)
>> > +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t)
>> > +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t)
>> > +LDP_SSLL_NO_OVERLAP_OFFSET(int)
>> > +LDP_SSLL_NO_OVERLAP_OFFSET(long)
>> > +LDP_SSLL_NO_OVERLAP_OFFSET(float)
>> > +LDP_SSLL_NO_OVERLAP_OFFSET(double)
>> > +LDP_SSLL_NO_OVERLAP_OFFSET(v4si)
>> > +
>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } }
>> */
>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } }
>> */
>> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } }
>> */
>> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } }
>> */
>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } }
>> */
>> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
>> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
>> > new file mode 100644
>> > index 00000000000..8d5ce5cc87e
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
>> > @@ -0,0 +1,33 @@
>> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
>> > +
>> > +#include <stdint.h>
>> > +
>> > +typedef int v4si __attribute__ ((vector_size (16)));
>> > +
>> > +/* Same address, same offset, overlap  */
>> > +
>> > +#define LDP_SSLL_OVERLAP(TYPE) \
>> > +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE
>> dummy){ \
>> > +       TYPE r, y; \
>> > +       st_arr[0] = i; \
>> > +       ld_arr[0] = dummy; \
>> > +       r = st_arr[0]; \
>> > +       y = st_arr[1]; \
>> > +       return r + y; \
>> > +}
>> > +
>> > +LDP_SSLL_OVERLAP(uint32_t)
>> > +LDP_SSLL_OVERLAP(uint64_t)
>> > +LDP_SSLL_OVERLAP(int32_t)
>> > +LDP_SSLL_OVERLAP(int64_t)
>> > +LDP_SSLL_OVERLAP(int)
>> > +LDP_SSLL_OVERLAP(long)
>> > +LDP_SSLL_OVERLAP(float)
>> > +LDP_SSLL_OVERLAP(double)
>> > +LDP_SSLL_OVERLAP(v4si)
>> > +
>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } }
>> */
>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } }
>> */
>> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } }
>> */
>> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } }
>> */
>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } }
>> */
>> > --
>> > 2.41.0
>>
> I will address the other two points on a seperate version.
>
> --
> *Manos Anagnostakis | Compiler Engineer |*
> E: manos.anagnostakis@vrull.eu <makeljana.shkurti@vrull.eu>
>
> *VRULL GmbH *| Beatrixgasse 32 1030 Vienna |
>  W: www.vrull.eu | LinkedIn <https://www.linkedin.com/company/vrull/>
>


-- 
*Manos Anagnostakis | Compiler Engineer |*
E: manos.anagnostakis@vrull.eu <makeljana.shkurti@vrull.eu>

*VRULL GmbH *| Beatrixgasse 32 1030 Vienna |
 W: www.vrull.eu | LinkedIn <https://www.linkedin.com/company/vrull/>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v6] aarch64: New RTL optimization pass avoid-store-forwarding.
  2023-12-06 14:28 ` Richard Biener
  2023-12-06 15:10   ` Manos Anagnostakis
@ 2023-12-06 18:44   ` Philipp Tomsich
  2023-12-07  7:31     ` Richard Biener
  1 sibling, 1 reply; 11+ messages in thread
From: Philipp Tomsich @ 2023-12-06 18:44 UTC (permalink / raw)
  To: Richard Biener
  Cc: Manos Anagnostakis, gcc-patches, Richard Sandiford, Manolis Tsamis

On Wed, 6 Dec 2023 at 23:32, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Wed, Dec 6, 2023 at 2:48 PM Manos Anagnostakis
> <manos.anagnostakis@vrull.eu> wrote:
> >
> > This is an RTL pass that detects store forwarding from stores to larger loads (load pairs).
> >
> > This optimization is SPEC2017-driven and was found to be beneficial for some benchmarks,
> > through testing on ampere1/ampere1a machines.
> >
> > For example, it can transform cases like
> >
> > str  d5, [sp, #320]
> > fmul d5, d31, d29
> > ldp  d31, d17, [sp, #312] # Large load from small store
> >
> > to
> >
> > str  d5, [sp, #320]
> > fmul d5, d31, d29
> > ldr  d31, [sp, #312]
> > ldr  d17, [sp, #320]
> >
> > Currently, the pass is disabled by default on all architectures and enabled by a target-specific option.
> >
> > If deemed beneficial enough for a default, it will be enabled on ampere1/ampere1a,
> > or other architectures as well, without needing to be turned on by this option.
>
> What is aarch64-specific about the pass?
>
> I see an increasingly large number of target specific passes pop up (probably
> for the excuse we can generalize them if necessary).  But GCC isn't LLVM
> and this feels like getting out of hand?

We had an OK from Richard Sandiford on the earlier (v5) version with
v6 just fixing an obvious bug... so I was about to merge this earlier
just when you commented.

Given that this had months of test exposure on our end, I would prefer
to move this forward for GCC14 in its current form.
The project of replacing architecture-specific store-forwarding passes
with a generalized infrastructure could then be addressed in the GCC15
timeframe (or beyond)?

--Philipp.

>
> The x86 backend also has its store-forwarding "pass" as part of mdreorg
> in ix86_split_stlf_stall_load.
>
> Richard.
>
> > Bootstrapped and regtested on aarch64-linux.
> >
> > gcc/ChangeLog:
> >
> >         * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
> >         * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New pass.
> >         * config/aarch64/aarch64-protos.h (make_pass_avoid_store_forwarding): Declare.
> >         * config/aarch64/aarch64.opt (mavoid-store-forwarding): New option.
> >         (aarch64-store-forwarding-threshold): New param.
> >         * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
> >         * doc/invoke.texi: Document new option and new param.
> >         * config/aarch64/aarch64-store-forwarding.cc: New file.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
> >         * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
> >         * gcc.target/aarch64/ldp_ssll_overlap.c: New test.
> >
> > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> > Co-Authored-By: Manolis Tsamis <manolis.tsamis@vrull.eu>
> > Co-Authored-By: Philipp Tomsich <philipp.tomsich@vrull.eu>
> > ---
> > Changes in v6:
> >         - An obvious change. insn_cnt was incremented only on
> >           stores and not for every insn in the bb. Now restored.
> >
> >  gcc/config.gcc                                |   1 +
> >  gcc/config/aarch64/aarch64-passes.def         |   1 +
> >  gcc/config/aarch64/aarch64-protos.h           |   1 +
> >  .../aarch64/aarch64-store-forwarding.cc       | 318 ++++++++++++++++++
> >  gcc/config/aarch64/aarch64.opt                |   9 +
> >  gcc/config/aarch64/t-aarch64                  |  10 +
> >  gcc/doc/invoke.texi                           |  11 +-
> >  .../aarch64/ldp_ssll_no_overlap_address.c     |  33 ++
> >  .../aarch64/ldp_ssll_no_overlap_offset.c      |  33 ++
> >  .../gcc.target/aarch64/ldp_ssll_overlap.c     |  33 ++
> >  10 files changed, 449 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> >
> > diff --git a/gcc/config.gcc b/gcc/config.gcc
> > index 6450448f2f0..7c48429eb82 100644
> > --- a/gcc/config.gcc
> > +++ b/gcc/config.gcc
> > @@ -350,6 +350,7 @@ aarch64*-*-*)
> >         cxx_target_objs="aarch64-c.o"
> >         d_target_objs="aarch64-d.o"
> >         extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o aarch64-sve-builtins-sme.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
> > +       extra_objs="${extra_objs} aarch64-store-forwarding.o"
> >         target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
> >         target_has_targetm_common=yes
> >         ;;
> > diff --git a/gcc/config/aarch64/aarch64-passes.def b/gcc/config/aarch64/aarch64-passes.def
> > index 662a13fd5e6..94ced0aebf6 100644
> > --- a/gcc/config/aarch64/aarch64-passes.def
> > +++ b/gcc/config/aarch64/aarch64-passes.def
> > @@ -24,3 +24,4 @@ INSERT_PASS_BEFORE (pass_late_thread_prologue_and_epilogue, 1, pass_switch_pstat
> >  INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
> >  INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
> >  INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
> > +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding);
> > diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> > index 60ff61f6d54..8f5f2ca4710 100644
> > --- a/gcc/config/aarch64/aarch64-protos.h
> > +++ b/gcc/config/aarch64/aarch64-protos.h
> > @@ -1069,6 +1069,7 @@ rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *);
> >  rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
> >  rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
> >  rtl_opt_pass *make_pass_switch_pstate_sm (gcc::context *ctxt);
> > +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt);
> >
> >  poly_uint64 aarch64_regmode_natural_size (machine_mode);
> >
> > diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc b/gcc/config/aarch64/aarch64-store-forwarding.cc
> > new file mode 100644
> > index 00000000000..8a6faefd8c0
> > --- /dev/null
> > +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc
> > @@ -0,0 +1,318 @@
> > +/* Avoid store forwarding optimization pass.
> > +   Copyright (C) 2023 Free Software Foundation, Inc.
> > +   Contributed by VRULL GmbH.
> > +
> > +   This file is part of GCC.
> > +
> > +   GCC is free software; you can redistribute it and/or modify it
> > +   under the terms of the GNU General Public License as published by
> > +   the Free Software Foundation; either version 3, or (at your option)
> > +   any later version.
> > +
> > +   GCC is distributed in the hope that it will be useful, but
> > +   WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   General Public License for more details.
> > +
> > +   You should have received a copy of the GNU General Public License
> > +   along with GCC; see the file COPYING3.  If not see
> > +   <http://www.gnu.org/licenses/>.  */
> > +
> > +#define IN_TARGET_CODE 1
> > +
> > +#include "config.h"
> > +#define INCLUDE_LIST
> > +#include "system.h"
> > +#include "coretypes.h"
> > +#include "backend.h"
> > +#include "rtl.h"
> > +#include "alias.h"
> > +#include "rtlanal.h"
> > +#include "tree-pass.h"
> > +#include "cselib.h"
> > +
> > +/* This is an RTL pass that detects store forwarding from stores to larger
> > +   loads (load pairs). For example, it can transform cases like
> > +
> > +   str  d5, [sp, #320]
> > +   fmul d5, d31, d29
> > +   ldp  d31, d17, [sp, #312] # Large load from small store
> > +
> > +   to
> > +
> > +   str  d5, [sp, #320]
> > +   fmul d5, d31, d29
> > +   ldr  d31, [sp, #312]
> > +   ldr  d17, [sp, #320]
> > +
> > +   Design: The pass follows a straightforward design.  It starts by
> > +   initializing the alias analysis and the cselib.  Both of these are used to
> > +   find stores and larger loads with overlapping addresses, which are
> > +   candidates for store forwarding optimizations.  It then scans on basic block
> > +   level to find stores that forward to larger loads and handles them
> > +   accordingly as described in the above example.  Finally, the alias analysis
> > +   and the cselib library are closed.  */
> > +
> > +typedef struct
> > +{
> > +  rtx_insn *store_insn;
> > +  rtx store_mem_addr;
> > +  unsigned int insn_cnt;
> > +} str_info;
> > +
> > +typedef std::list<str_info> list_store_info;
> > +
> > +/* Statistics counters.  */
> > +static unsigned int stats_store_count = 0;
> > +static unsigned int stats_ldp_count = 0;
> > +static unsigned int stats_ssll_count = 0;
> > +static unsigned int stats_transformed_count = 0;
> > +
> > +/* Default.  */
> > +static rtx dummy;
> > +static bool is_load (rtx expr, rtx &op_1=dummy);
> > +
> > +/* Return true if SET expression EXPR is a store; otherwise false.  */
> > +
> > +static bool
> > +is_store (rtx expr)
> > +{
> > +  return MEM_P (SET_DEST (expr));
> > +}
> > +
> > +/* Return true if SET expression EXPR is a load; otherwise false.  OP_1 will
> > +   contain the MEM operand of the load.  */
> > +
> > +static bool
> > +is_load (rtx expr, rtx &op_1)
> > +{
> > +  op_1 = SET_SRC (expr);
> > +
> > +  if (GET_CODE (op_1) == ZERO_EXTEND
> > +      || GET_CODE (op_1) == SIGN_EXTEND)
> > +    op_1 = XEXP (op_1, 0);
> > +
> > +  return MEM_P (op_1);
> > +}
> > +
> > +/* Return true if STORE_MEM_ADDR is forwarding to the address of LOAD_MEM;
> > +   otherwise false.  STORE_MEM_MODE is the mode of the MEM rtx containing
> > +   STORE_MEM_ADDR.  */
> > +
> > +static bool
> > +is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode store_mem_mode)
> > +{
> > +  /* Sometimes we do not have the proper value.  */
> > +  if (!CSELIB_VAL_PTR (store_mem_addr))
> > +    return false;
> > +
> > +  gcc_checking_assert (MEM_P (load_mem));
> > +
> > +  return rtx_equal_for_cselib_1 (store_mem_addr,
> > +                                get_addr (XEXP (load_mem, 0)),
> > +                                store_mem_mode, 0);
> > +}
> > +
> > +/* Return true if INSN is a load pair, preceded by a store forwarding to it;
> > +   otherwise false.  STORE_EXPRS contains the stores.  */
> > +
> > +static bool
> > +is_small_store_to_large_load (list_store_info store_exprs, rtx_insn *insn)
> > +{
> > +  unsigned int load_count = 0;
> > +  bool forwarding = false;
> > +  rtx expr = PATTERN (insn);
> > +
> > +  if (GET_CODE (expr) != PARALLEL
> > +      || XVECLEN (expr, 0) != 2)
> > +    return false;
> > +
> > +  for (int i = 0; i < XVECLEN (expr, 0); i++)
> > +    {
> > +      rtx op_1;
> > +      rtx out_exp = XVECEXP (expr, 0, i);
> > +
> > +      if (GET_CODE (out_exp) != SET)
> > +       continue;
> > +
> > +      if (!is_load (out_exp, op_1))
> > +       continue;
> > +
> > +      load_count++;
> > +
> > +      for (str_info str : store_exprs)
> > +       {
> > +         rtx store_insn = str.store_insn;
> > +
> > +         if (!is_forwarding (str.store_mem_addr, op_1,
> > +                             GET_MODE (SET_DEST (PATTERN (store_insn)))))
> > +           continue;
> > +
> > +         if (dump_file)
> > +           {
> > +             fprintf (dump_file,
> > +                      "Store forwarding to PARALLEL with loads:\n");
> > +             fprintf (dump_file, "  From: ");
> > +             print_rtl_single (dump_file, store_insn);
> > +             fprintf (dump_file, "  To: ");
> > +             print_rtl_single (dump_file, insn);
> > +           }
> > +
> > +         forwarding = true;
> > +       }
> > +    }
> > +
> > +  if (load_count == 2)
> > +    stats_ldp_count++;
> > +
> > +  return load_count == 2 && forwarding;
> > +}
> > +
> > +/* Break a load pair into its 2 distinct loads, except if the base source
> > +   address to load from is overwriten in the first load.  INSN should be the
> > +   PARALLEL of the load pair.  */
> > +
> > +static void
> > +break_ldp (rtx_insn *insn)
> > +{
> > +  rtx expr = PATTERN (insn);
> > +
> > +  gcc_checking_assert (GET_CODE (expr) == PARALLEL && XVECLEN (expr, 0) == 2);
> > +
> > +  rtx load_0 = XVECEXP (expr, 0, 0);
> > +  rtx load_1 = XVECEXP (expr, 0, 1);
> > +
> > +  gcc_checking_assert (is_load (load_0) && is_load (load_1));
> > +
> > +  /* The base address was overwriten in the first load.  */
> > +  if (reg_mentioned_p (SET_DEST (load_0), SET_SRC (load_1)))
> > +    return;
> > +
> > +  emit_insn_before (load_0, insn);
> > +  emit_insn_before (load_1, insn);
> > +  remove_insn (insn);
> > +
> > +  stats_transformed_count++;
> > +}
> > +
> > +static void
> > +scan_and_transform_bb_level ()
> > +{
> > +  rtx_insn *insn, *next;
> > +  basic_block bb;
> > +  FOR_EACH_BB_FN (bb, cfun)
> > +    {
> > +      list_store_info store_exprs;
> > +      unsigned int insn_cnt = 0;
> > +      for (insn = BB_HEAD (bb); insn != NEXT_INSN (BB_END (bb)); insn = next)
> > +       {
> > +         next = NEXT_INSN (insn);
>
> You probably want NONDEBUG here, otherwise insn_cnt will depend
> on -g?
>
> > +         /* If we cross a CALL_P insn, clear the list, because the
> > +            small-store-to-large-load is unlikely to cause performance
> > +            difference.  */
> > +         if (CALL_P (insn))
> > +           store_exprs.clear ();
> > +
> > +         if (!NONJUMP_INSN_P (insn))
> > +           continue;
> > +
> > +         cselib_process_insn (insn);
>
> is it necessary to process each insn with cselib?  Only loads & stores I guess?
>
> > +         rtx expr = single_set (insn);
> > +
> > +         /* If a store is encountered, append it to the store_exprs list to
> > +            check it later.  */
> > +         if (expr && is_store (expr))
> > +           {
> > +             rtx store_mem = SET_DEST (expr);
> > +             rtx store_mem_addr = get_addr (XEXP (store_mem, 0));
> > +             machine_mode store_mem_mode = GET_MODE (store_mem);
> > +             store_mem_addr = cselib_lookup (store_mem_addr,
> > +                                             store_mem_mode, 1,
> > +                                             store_mem_mode)->val_rtx;
> > +             store_exprs.push_back ({ insn, store_mem_addr, insn_cnt });
> > +             stats_store_count++;
> > +           }
> > +
> > +         /* Check for small-store-to-large-load.  */
> > +         if (is_small_store_to_large_load (store_exprs, insn))
> > +           {
> > +             stats_ssll_count++;
> > +             break_ldp (insn);
> > +           }
> > +
> > +         /* Pop the first store from the list if it's distance crosses the
> > +            maximum accepted threshold.  The list contains unique values
> > +            sorted in ascending order, meaning that only one distance can be
> > +            off at a time.  */
> > +         if (!store_exprs.empty ()
> > +             && (insn_cnt - store_exprs.front ().insn_cnt
> > +                > (unsigned int) aarch64_store_forwarding_threshold_param))
> > +           store_exprs.pop_front ();
> > +
> > +         insn_cnt++;
> > +       }
> > +    }
> > +}
> > +
> > +static void
> > +execute_avoid_store_forwarding ()
> > +{
> > +  init_alias_analysis ();
> > +  cselib_init (CSELIB_RECORD_MEMORY | CSELIB_PRESERVE_CONSTANTS);
> > +  scan_and_transform_bb_level ();
> > +  end_alias_analysis ();
> > +  cselib_finish ();
> > +  statistics_counter_event (cfun, "Number of stores identified: ",
> > +                           stats_store_count);
> > +  statistics_counter_event (cfun, "Number of load pairs identified: ",
> > +                           stats_ldp_count);
> > +  statistics_counter_event (cfun,
> > +                           "Number of forwarding cases identified: ",
> > +                           stats_ssll_count);
> > +  statistics_counter_event (cfun, "Number of trasformed cases: ",
> > +                           stats_transformed_count);
> > +}
> > +
> > +const pass_data pass_data_avoid_store_forwarding =
> > +{
> > +  RTL_PASS, /* type.  */
> > +  "avoid_store_forwarding", /* name.  */
> > +  OPTGROUP_NONE, /* optinfo_flags.  */
> > +  TV_NONE, /* tv_id.  */
> > +  0, /* properties_required.  */
> > +  0, /* properties_provided.  */
> > +  0, /* properties_destroyed.  */
> > +  0, /* todo_flags_start.  */
> > +  0 /* todo_flags_finish.  */
> > +};
> > +
> > +class pass_avoid_store_forwarding : public rtl_opt_pass
> > +{
> > +public:
> > +  pass_avoid_store_forwarding (gcc::context *ctxt)
> > +    : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
> > +  {}
> > +
> > +  /* opt_pass methods: */
> > +  virtual bool gate (function *)
> > +    {
> > +      return aarch64_flag_avoid_store_forwarding;
> > +    }
> > +
> > +  virtual unsigned int execute (function *)
> > +    {
> > +      execute_avoid_store_forwarding ();
> > +      return 0;
> > +    }
> > +
> > +}; // class pass_avoid_store_forwarding
> > +
> > +/* Create a new avoid store forwarding pass instance.  */
> > +
> > +rtl_opt_pass *
> > +make_pass_avoid_store_forwarding (gcc::context *ctxt)
> > +{
> > +  return new pass_avoid_store_forwarding (ctxt);
> > +}
> > diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> > index f5a518202a1..e4498d53b46 100644
> > --- a/gcc/config/aarch64/aarch64.opt
> > +++ b/gcc/config/aarch64/aarch64.opt
> > @@ -304,6 +304,10 @@ moutline-atomics
> >  Target Var(aarch64_flag_outline_atomics) Init(2) Save
> >  Generate local calls to out-of-line atomic operations.
> >
> > +mavoid-store-forwarding
> > +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0) Optimization
> > +Avoid store forwarding to load pairs.
> > +
> >  -param=aarch64-sve-compare-costs=
> >  Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1) IntegerRange(0, 1) Param
> >  When vectorizing for SVE, consider using unpacked vectors for smaller elements and use the cost model to pick the cheapest approach.  Also use the cost model to choose between SVE and Advanced SIMD vectorization.
> > @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never) Value(AARCH64_LDP_STP_POLICY_NEVER)
> >
> >  EnumValue
> >  Enum(aarch64_ldp_stp_policy) String(aligned) Value(AARCH64_LDP_STP_POLICY_ALIGNED)
> > +
> > +-param=aarch64-store-forwarding-threshold=
> > +Target Joined UInteger Var(aarch64_store_forwarding_threshold_param) Init(20) Param
> > +Maximum instruction distance allowed between a store and a load pair for this to be
> > +considered a candidate to avoid when using -mavoid-store-forwarding.
> > diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
> > index 0d96ae3d0b2..5676cdd9585 100644
> > --- a/gcc/config/aarch64/t-aarch64
> > +++ b/gcc/config/aarch64/t-aarch64
> > @@ -194,6 +194,16 @@ aarch64-cc-fusion.o: $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
> >         $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> >                 $(srcdir)/config/aarch64/aarch64-cc-fusion.cc
> >
> > +aarch64-store-forwarding.o: \
> > +    $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \
> > +    $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h $(RTL_BASE_H) \
> > +    dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H) $(RECOG_H) \
> > +    output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
> > +    $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
> > +    $(srcdir)/config/aarch64/aarch64-protos.h
> > +       $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> > +               $(srcdir)/config/aarch64/aarch64-store-forwarding.cc
> > +
> >  comma=,
> >  MULTILIB_OPTIONS    = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
> >  MULTILIB_DIRNAMES   = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index 32f535e1ed4..9bf3a83286a 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -801,7 +801,7 @@ Objective-C and Objective-C++ Dialects}.
> >  -moverride=@var{string}  -mverbose-cost-dump
> >  -mstack-protector-guard=@var{guard} -mstack-protector-guard-reg=@var{sysreg}
> >  -mstack-protector-guard-offset=@var{offset} -mtrack-speculation
> > --moutline-atomics }
> > +-moutline-atomics -mavoid-store-forwarding}
> >
> >  @emph{Adapteva Epiphany Options}
> >  @gccoptlist{-mhalf-reg-file  -mprefer-short-insn-regs
> > @@ -16774,6 +16774,11 @@ With @option{--param=aarch64-stp-policy=never}, do not emit stp.
> >  With @option{--param=aarch64-stp-policy=aligned}, emit stp only if the
> >  source pointer is aligned to at least double the alignment of the type.
> >
> > +@item aarch64-store-forwarding-threshold
> > +Maximum allowed instruction distance between a store and a load pair for
> > +this to be considered a candidate to avoid when using
> > +@option{-mavoid-store-forwarding}.
> > +
> >  @item aarch64-loop-vect-issue-rate-niters
> >  The tuning for some AArch64 CPUs tries to take both latencies and issue
> >  rates into account when deciding whether a loop should be vectorized
> > @@ -20857,6 +20862,10 @@ Generate code which uses only the general-purpose registers.  This will prevent
> >  the compiler from using floating-point and Advanced SIMD registers but will not
> >  impose any restrictions on the assembler.
> >
> > +@item -mavoid-store-forwarding
> > +@itemx -mno-avoid-store-forwarding
> > +Avoid store forwarding to load pairs.
> > +
> >  @opindex mlittle-endian
> >  @item -mlittle-endian
> >  Generate little-endian code.  This is the default when GCC is configured for an
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> > new file mode 100644
> > index 00000000000..b77de6c64b6
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> > +
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +/* Different address, same offset, no overlap  */
> > +
> > +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \
> > +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE *st_arr_2, TYPE i, TYPE dummy){ \
> > +       TYPE r, y; \
> > +       st_arr[0] = i; \
> > +       ld_arr[0] = dummy; \
> > +       r = st_arr_2[0]; \
> > +       y = st_arr_2[1]; \
> > +       return r + y; \
> > +}
> > +
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(long)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(float)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(double)
> > +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si)
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> > new file mode 100644
> > index 00000000000..f1b3a66abfd
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> > +
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +/* Same address, different offset, no overlap  */
> > +
> > +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \
> > +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE dummy){ \
> > +       TYPE r, y; \
> > +       st_arr[0] = i; \
> > +       ld_arr[0] = dummy; \
> > +       r = st_arr[10]; \
> > +       y = st_arr[11]; \
> > +       return r + y; \
> > +}
> > +
> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(int)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(long)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(float)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(double)
> > +LDP_SSLL_NO_OVERLAP_OFFSET(v4si)
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> > new file mode 100644
> > index 00000000000..8d5ce5cc87e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> > +
> > +#include <stdint.h>
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +/* Same address, same offset, overlap  */
> > +
> > +#define LDP_SSLL_OVERLAP(TYPE) \
> > +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE dummy){ \
> > +       TYPE r, y; \
> > +       st_arr[0] = i; \
> > +       ld_arr[0] = dummy; \
> > +       r = st_arr[0]; \
> > +       y = st_arr[1]; \
> > +       return r + y; \
> > +}
> > +
> > +LDP_SSLL_OVERLAP(uint32_t)
> > +LDP_SSLL_OVERLAP(uint64_t)
> > +LDP_SSLL_OVERLAP(int32_t)
> > +LDP_SSLL_OVERLAP(int64_t)
> > +LDP_SSLL_OVERLAP(int)
> > +LDP_SSLL_OVERLAP(long)
> > +LDP_SSLL_OVERLAP(float)
> > +LDP_SSLL_OVERLAP(double)
> > +LDP_SSLL_OVERLAP(v4si)
> > +
> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } } */
> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > --
> > 2.41.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v6] aarch64: New RTL optimization pass avoid-store-forwarding.
  2023-12-06 18:44   ` Philipp Tomsich
@ 2023-12-07  7:31     ` Richard Biener
  2023-12-07 12:20       ` Richard Sandiford
  0 siblings, 1 reply; 11+ messages in thread
From: Richard Biener @ 2023-12-07  7:31 UTC (permalink / raw)
  To: Philipp Tomsich
  Cc: Manos Anagnostakis, gcc-patches, Richard Sandiford, Manolis Tsamis

On Wed, Dec 6, 2023 at 7:44 PM Philipp Tomsich <philipp.tomsich@vrull.eu> wrote:
>
> On Wed, 6 Dec 2023 at 23:32, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Wed, Dec 6, 2023 at 2:48 PM Manos Anagnostakis
> > <manos.anagnostakis@vrull.eu> wrote:
> > >
> > > This is an RTL pass that detects store forwarding from stores to larger loads (load pairs).
> > >
> > > This optimization is SPEC2017-driven and was found to be beneficial for some benchmarks,
> > > through testing on ampere1/ampere1a machines.
> > >
> > > For example, it can transform cases like
> > >
> > > str  d5, [sp, #320]
> > > fmul d5, d31, d29
> > > ldp  d31, d17, [sp, #312] # Large load from small store
> > >
> > > to
> > >
> > > str  d5, [sp, #320]
> > > fmul d5, d31, d29
> > > ldr  d31, [sp, #312]
> > > ldr  d17, [sp, #320]
> > >
> > > Currently, the pass is disabled by default on all architectures and enabled by a target-specific option.
> > >
> > > If deemed beneficial enough for a default, it will be enabled on ampere1/ampere1a,
> > > or other architectures as well, without needing to be turned on by this option.
> >
> > What is aarch64-specific about the pass?
> >
> > I see an increasingly large number of target specific passes pop up (probably
> > for the excuse we can generalize them if necessary).  But GCC isn't LLVM
> > and this feels like getting out of hand?
>
> We had an OK from Richard Sandiford on the earlier (v5) version with
> v6 just fixing an obvious bug... so I was about to merge this earlier
> just when you commented.
>
> Given that this had months of test exposure on our end, I would prefer
> to move this forward for GCC14 in its current form.
> The project of replacing architecture-specific store-forwarding passes
> with a generalized infrastructure could then be addressed in the GCC15
> timeframe (or beyond)?

It's up to target maintainers, I just picked this pass (randomly) to make this
comment (of course also knowing that STLF fails are a common issue on
pipelined uarchs).

Richard.

>
> --Philipp.
>
> >
> > The x86 backend also has its store-forwarding "pass" as part of mdreorg
> > in ix86_split_stlf_stall_load.
> >
> > Richard.
> >
> > > Bootstrapped and regtested on aarch64-linux.
> > >
> > > gcc/ChangeLog:
> > >
> > >         * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
> > >         * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New pass.
> > >         * config/aarch64/aarch64-protos.h (make_pass_avoid_store_forwarding): Declare.
> > >         * config/aarch64/aarch64.opt (mavoid-store-forwarding): New option.
> > >         (aarch64-store-forwarding-threshold): New param.
> > >         * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
> > >         * doc/invoke.texi: Document new option and new param.
> > >         * config/aarch64/aarch64-store-forwarding.cc: New file.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
> > >         * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
> > >         * gcc.target/aarch64/ldp_ssll_overlap.c: New test.
> > >
> > > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> > > Co-Authored-By: Manolis Tsamis <manolis.tsamis@vrull.eu>
> > > Co-Authored-By: Philipp Tomsich <philipp.tomsich@vrull.eu>
> > > ---
> > > Changes in v6:
> > >         - An obvious change. insn_cnt was incremented only on
> > >           stores and not for every insn in the bb. Now restored.
> > >
> > >  gcc/config.gcc                                |   1 +
> > >  gcc/config/aarch64/aarch64-passes.def         |   1 +
> > >  gcc/config/aarch64/aarch64-protos.h           |   1 +
> > >  .../aarch64/aarch64-store-forwarding.cc       | 318 ++++++++++++++++++
> > >  gcc/config/aarch64/aarch64.opt                |   9 +
> > >  gcc/config/aarch64/t-aarch64                  |  10 +
> > >  gcc/doc/invoke.texi                           |  11 +-
> > >  .../aarch64/ldp_ssll_no_overlap_address.c     |  33 ++
> > >  .../aarch64/ldp_ssll_no_overlap_offset.c      |  33 ++
> > >  .../gcc.target/aarch64/ldp_ssll_overlap.c     |  33 ++
> > >  10 files changed, 449 insertions(+), 1 deletion(-)
> > >  create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc
> > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> > >
> > > diff --git a/gcc/config.gcc b/gcc/config.gcc
> > > index 6450448f2f0..7c48429eb82 100644
> > > --- a/gcc/config.gcc
> > > +++ b/gcc/config.gcc
> > > @@ -350,6 +350,7 @@ aarch64*-*-*)
> > >         cxx_target_objs="aarch64-c.o"
> > >         d_target_objs="aarch64-d.o"
> > >         extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o aarch64-sve-builtins-sme.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
> > > +       extra_objs="${extra_objs} aarch64-store-forwarding.o"
> > >         target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
> > >         target_has_targetm_common=yes
> > >         ;;
> > > diff --git a/gcc/config/aarch64/aarch64-passes.def b/gcc/config/aarch64/aarch64-passes.def
> > > index 662a13fd5e6..94ced0aebf6 100644
> > > --- a/gcc/config/aarch64/aarch64-passes.def
> > > +++ b/gcc/config/aarch64/aarch64-passes.def
> > > @@ -24,3 +24,4 @@ INSERT_PASS_BEFORE (pass_late_thread_prologue_and_epilogue, 1, pass_switch_pstat
> > >  INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
> > >  INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
> > >  INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
> > > +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding);
> > > diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> > > index 60ff61f6d54..8f5f2ca4710 100644
> > > --- a/gcc/config/aarch64/aarch64-protos.h
> > > +++ b/gcc/config/aarch64/aarch64-protos.h
> > > @@ -1069,6 +1069,7 @@ rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *);
> > >  rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
> > >  rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
> > >  rtl_opt_pass *make_pass_switch_pstate_sm (gcc::context *ctxt);
> > > +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt);
> > >
> > >  poly_uint64 aarch64_regmode_natural_size (machine_mode);
> > >
> > > diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc b/gcc/config/aarch64/aarch64-store-forwarding.cc
> > > new file mode 100644
> > > index 00000000000..8a6faefd8c0
> > > --- /dev/null
> > > +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc
> > > @@ -0,0 +1,318 @@
> > > +/* Avoid store forwarding optimization pass.
> > > +   Copyright (C) 2023 Free Software Foundation, Inc.
> > > +   Contributed by VRULL GmbH.
> > > +
> > > +   This file is part of GCC.
> > > +
> > > +   GCC is free software; you can redistribute it and/or modify it
> > > +   under the terms of the GNU General Public License as published by
> > > +   the Free Software Foundation; either version 3, or (at your option)
> > > +   any later version.
> > > +
> > > +   GCC is distributed in the hope that it will be useful, but
> > > +   WITHOUT ANY WARRANTY; without even the implied warranty of
> > > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > > +   General Public License for more details.
> > > +
> > > +   You should have received a copy of the GNU General Public License
> > > +   along with GCC; see the file COPYING3.  If not see
> > > +   <http://www.gnu.org/licenses/>.  */
> > > +
> > > +#define IN_TARGET_CODE 1
> > > +
> > > +#include "config.h"
> > > +#define INCLUDE_LIST
> > > +#include "system.h"
> > > +#include "coretypes.h"
> > > +#include "backend.h"
> > > +#include "rtl.h"
> > > +#include "alias.h"
> > > +#include "rtlanal.h"
> > > +#include "tree-pass.h"
> > > +#include "cselib.h"
> > > +
> > > +/* This is an RTL pass that detects store forwarding from stores to larger
> > > +   loads (load pairs). For example, it can transform cases like
> > > +
> > > +   str  d5, [sp, #320]
> > > +   fmul d5, d31, d29
> > > +   ldp  d31, d17, [sp, #312] # Large load from small store
> > > +
> > > +   to
> > > +
> > > +   str  d5, [sp, #320]
> > > +   fmul d5, d31, d29
> > > +   ldr  d31, [sp, #312]
> > > +   ldr  d17, [sp, #320]
> > > +
> > > +   Design: The pass follows a straightforward design.  It starts by
> > > +   initializing the alias analysis and the cselib.  Both of these are used to
> > > +   find stores and larger loads with overlapping addresses, which are
> > > +   candidates for store forwarding optimizations.  It then scans on basic block
> > > +   level to find stores that forward to larger loads and handles them
> > > +   accordingly as described in the above example.  Finally, the alias analysis
> > > +   and the cselib library are closed.  */
> > > +
> > > +typedef struct
> > > +{
> > > +  rtx_insn *store_insn;
> > > +  rtx store_mem_addr;
> > > +  unsigned int insn_cnt;
> > > +} str_info;
> > > +
> > > +typedef std::list<str_info> list_store_info;
> > > +
> > > +/* Statistics counters.  */
> > > +static unsigned int stats_store_count = 0;
> > > +static unsigned int stats_ldp_count = 0;
> > > +static unsigned int stats_ssll_count = 0;
> > > +static unsigned int stats_transformed_count = 0;
> > > +
> > > +/* Default.  */
> > > +static rtx dummy;
> > > +static bool is_load (rtx expr, rtx &op_1=dummy);
> > > +
> > > +/* Return true if SET expression EXPR is a store; otherwise false.  */
> > > +
> > > +static bool
> > > +is_store (rtx expr)
> > > +{
> > > +  return MEM_P (SET_DEST (expr));
> > > +}
> > > +
> > > +/* Return true if SET expression EXPR is a load; otherwise false.  OP_1 will
> > > +   contain the MEM operand of the load.  */
> > > +
> > > +static bool
> > > +is_load (rtx expr, rtx &op_1)
> > > +{
> > > +  op_1 = SET_SRC (expr);
> > > +
> > > +  if (GET_CODE (op_1) == ZERO_EXTEND
> > > +      || GET_CODE (op_1) == SIGN_EXTEND)
> > > +    op_1 = XEXP (op_1, 0);
> > > +
> > > +  return MEM_P (op_1);
> > > +}
> > > +
> > > +/* Return true if STORE_MEM_ADDR is forwarding to the address of LOAD_MEM;
> > > +   otherwise false.  STORE_MEM_MODE is the mode of the MEM rtx containing
> > > +   STORE_MEM_ADDR.  */
> > > +
> > > +static bool
> > > +is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode store_mem_mode)
> > > +{
> > > +  /* Sometimes we do not have the proper value.  */
> > > +  if (!CSELIB_VAL_PTR (store_mem_addr))
> > > +    return false;
> > > +
> > > +  gcc_checking_assert (MEM_P (load_mem));
> > > +
> > > +  return rtx_equal_for_cselib_1 (store_mem_addr,
> > > +                                get_addr (XEXP (load_mem, 0)),
> > > +                                store_mem_mode, 0);
> > > +}
> > > +
> > > +/* Return true if INSN is a load pair, preceded by a store forwarding to it;
> > > +   otherwise false.  STORE_EXPRS contains the stores.  */
> > > +
> > > +static bool
> > > +is_small_store_to_large_load (list_store_info store_exprs, rtx_insn *insn)
> > > +{
> > > +  unsigned int load_count = 0;
> > > +  bool forwarding = false;
> > > +  rtx expr = PATTERN (insn);
> > > +
> > > +  if (GET_CODE (expr) != PARALLEL
> > > +      || XVECLEN (expr, 0) != 2)
> > > +    return false;
> > > +
> > > +  for (int i = 0; i < XVECLEN (expr, 0); i++)
> > > +    {
> > > +      rtx op_1;
> > > +      rtx out_exp = XVECEXP (expr, 0, i);
> > > +
> > > +      if (GET_CODE (out_exp) != SET)
> > > +       continue;
> > > +
> > > +      if (!is_load (out_exp, op_1))
> > > +       continue;
> > > +
> > > +      load_count++;
> > > +
> > > +      for (str_info str : store_exprs)
> > > +       {
> > > +         rtx store_insn = str.store_insn;
> > > +
> > > +         if (!is_forwarding (str.store_mem_addr, op_1,
> > > +                             GET_MODE (SET_DEST (PATTERN (store_insn)))))
> > > +           continue;
> > > +
> > > +         if (dump_file)
> > > +           {
> > > +             fprintf (dump_file,
> > > +                      "Store forwarding to PARALLEL with loads:\n");
> > > +             fprintf (dump_file, "  From: ");
> > > +             print_rtl_single (dump_file, store_insn);
> > > +             fprintf (dump_file, "  To: ");
> > > +             print_rtl_single (dump_file, insn);
> > > +           }
> > > +
> > > +         forwarding = true;
> > > +       }
> > > +    }
> > > +
> > > +  if (load_count == 2)
> > > +    stats_ldp_count++;
> > > +
> > > +  return load_count == 2 && forwarding;
> > > +}
> > > +
> > > +/* Break a load pair into its 2 distinct loads, except if the base source
> > > +   address to load from is overwriten in the first load.  INSN should be the
> > > +   PARALLEL of the load pair.  */
> > > +
> > > +static void
> > > +break_ldp (rtx_insn *insn)
> > > +{
> > > +  rtx expr = PATTERN (insn);
> > > +
> > > +  gcc_checking_assert (GET_CODE (expr) == PARALLEL && XVECLEN (expr, 0) == 2);
> > > +
> > > +  rtx load_0 = XVECEXP (expr, 0, 0);
> > > +  rtx load_1 = XVECEXP (expr, 0, 1);
> > > +
> > > +  gcc_checking_assert (is_load (load_0) && is_load (load_1));
> > > +
> > > +  /* The base address was overwriten in the first load.  */
> > > +  if (reg_mentioned_p (SET_DEST (load_0), SET_SRC (load_1)))
> > > +    return;
> > > +
> > > +  emit_insn_before (load_0, insn);
> > > +  emit_insn_before (load_1, insn);
> > > +  remove_insn (insn);
> > > +
> > > +  stats_transformed_count++;
> > > +}
> > > +
> > > +static void
> > > +scan_and_transform_bb_level ()
> > > +{
> > > +  rtx_insn *insn, *next;
> > > +  basic_block bb;
> > > +  FOR_EACH_BB_FN (bb, cfun)
> > > +    {
> > > +      list_store_info store_exprs;
> > > +      unsigned int insn_cnt = 0;
> > > +      for (insn = BB_HEAD (bb); insn != NEXT_INSN (BB_END (bb)); insn = next)
> > > +       {
> > > +         next = NEXT_INSN (insn);
> >
> > You probably want NONDEBUG here, otherwise insn_cnt will depend
> > on -g?
> >
> > > +         /* If we cross a CALL_P insn, clear the list, because the
> > > +            small-store-to-large-load is unlikely to cause performance
> > > +            difference.  */
> > > +         if (CALL_P (insn))
> > > +           store_exprs.clear ();
> > > +
> > > +         if (!NONJUMP_INSN_P (insn))
> > > +           continue;
> > > +
> > > +         cselib_process_insn (insn);
> >
> > is it necessary to process each insn with cselib?  Only loads & stores I guess?
> >
> > > +         rtx expr = single_set (insn);
> > > +
> > > +         /* If a store is encountered, append it to the store_exprs list to
> > > +            check it later.  */
> > > +         if (expr && is_store (expr))
> > > +           {
> > > +             rtx store_mem = SET_DEST (expr);
> > > +             rtx store_mem_addr = get_addr (XEXP (store_mem, 0));
> > > +             machine_mode store_mem_mode = GET_MODE (store_mem);
> > > +             store_mem_addr = cselib_lookup (store_mem_addr,
> > > +                                             store_mem_mode, 1,
> > > +                                             store_mem_mode)->val_rtx;
> > > +             store_exprs.push_back ({ insn, store_mem_addr, insn_cnt });
> > > +             stats_store_count++;
> > > +           }
> > > +
> > > +         /* Check for small-store-to-large-load.  */
> > > +         if (is_small_store_to_large_load (store_exprs, insn))
> > > +           {
> > > +             stats_ssll_count++;
> > > +             break_ldp (insn);
> > > +           }
> > > +
> > > +         /* Pop the first store from the list if it's distance crosses the
> > > +            maximum accepted threshold.  The list contains unique values
> > > +            sorted in ascending order, meaning that only one distance can be
> > > +            off at a time.  */
> > > +         if (!store_exprs.empty ()
> > > +             && (insn_cnt - store_exprs.front ().insn_cnt
> > > +                > (unsigned int) aarch64_store_forwarding_threshold_param))
> > > +           store_exprs.pop_front ();
> > > +
> > > +         insn_cnt++;
> > > +       }
> > > +    }
> > > +}
> > > +
> > > +static void
> > > +execute_avoid_store_forwarding ()
> > > +{
> > > +  init_alias_analysis ();
> > > +  cselib_init (CSELIB_RECORD_MEMORY | CSELIB_PRESERVE_CONSTANTS);
> > > +  scan_and_transform_bb_level ();
> > > +  end_alias_analysis ();
> > > +  cselib_finish ();
> > > +  statistics_counter_event (cfun, "Number of stores identified: ",
> > > +                           stats_store_count);
> > > +  statistics_counter_event (cfun, "Number of load pairs identified: ",
> > > +                           stats_ldp_count);
> > > +  statistics_counter_event (cfun,
> > > +                           "Number of forwarding cases identified: ",
> > > +                           stats_ssll_count);
> > > +  statistics_counter_event (cfun, "Number of trasformed cases: ",
> > > +                           stats_transformed_count);
> > > +}
> > > +
> > > +const pass_data pass_data_avoid_store_forwarding =
> > > +{
> > > +  RTL_PASS, /* type.  */
> > > +  "avoid_store_forwarding", /* name.  */
> > > +  OPTGROUP_NONE, /* optinfo_flags.  */
> > > +  TV_NONE, /* tv_id.  */
> > > +  0, /* properties_required.  */
> > > +  0, /* properties_provided.  */
> > > +  0, /* properties_destroyed.  */
> > > +  0, /* todo_flags_start.  */
> > > +  0 /* todo_flags_finish.  */
> > > +};
> > > +
> > > +class pass_avoid_store_forwarding : public rtl_opt_pass
> > > +{
> > > +public:
> > > +  pass_avoid_store_forwarding (gcc::context *ctxt)
> > > +    : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
> > > +  {}
> > > +
> > > +  /* opt_pass methods: */
> > > +  virtual bool gate (function *)
> > > +    {
> > > +      return aarch64_flag_avoid_store_forwarding;
> > > +    }
> > > +
> > > +  virtual unsigned int execute (function *)
> > > +    {
> > > +      execute_avoid_store_forwarding ();
> > > +      return 0;
> > > +    }
> > > +
> > > +}; // class pass_avoid_store_forwarding
> > > +
> > > +/* Create a new avoid store forwarding pass instance.  */
> > > +
> > > +rtl_opt_pass *
> > > +make_pass_avoid_store_forwarding (gcc::context *ctxt)
> > > +{
> > > +  return new pass_avoid_store_forwarding (ctxt);
> > > +}
> > > diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> > > index f5a518202a1..e4498d53b46 100644
> > > --- a/gcc/config/aarch64/aarch64.opt
> > > +++ b/gcc/config/aarch64/aarch64.opt
> > > @@ -304,6 +304,10 @@ moutline-atomics
> > >  Target Var(aarch64_flag_outline_atomics) Init(2) Save
> > >  Generate local calls to out-of-line atomic operations.
> > >
> > > +mavoid-store-forwarding
> > > +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0) Optimization
> > > +Avoid store forwarding to load pairs.
> > > +
> > >  -param=aarch64-sve-compare-costs=
> > >  Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1) IntegerRange(0, 1) Param
> > >  When vectorizing for SVE, consider using unpacked vectors for smaller elements and use the cost model to pick the cheapest approach.  Also use the cost model to choose between SVE and Advanced SIMD vectorization.
> > > @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never) Value(AARCH64_LDP_STP_POLICY_NEVER)
> > >
> > >  EnumValue
> > >  Enum(aarch64_ldp_stp_policy) String(aligned) Value(AARCH64_LDP_STP_POLICY_ALIGNED)
> > > +
> > > +-param=aarch64-store-forwarding-threshold=
> > > +Target Joined UInteger Var(aarch64_store_forwarding_threshold_param) Init(20) Param
> > > +Maximum instruction distance allowed between a store and a load pair for this to be
> > > +considered a candidate to avoid when using -mavoid-store-forwarding.
> > > diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
> > > index 0d96ae3d0b2..5676cdd9585 100644
> > > --- a/gcc/config/aarch64/t-aarch64
> > > +++ b/gcc/config/aarch64/t-aarch64
> > > @@ -194,6 +194,16 @@ aarch64-cc-fusion.o: $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
> > >         $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> > >                 $(srcdir)/config/aarch64/aarch64-cc-fusion.cc
> > >
> > > +aarch64-store-forwarding.o: \
> > > +    $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \
> > > +    $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h $(RTL_BASE_H) \
> > > +    dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H) $(RECOG_H) \
> > > +    output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
> > > +    $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
> > > +    $(srcdir)/config/aarch64/aarch64-protos.h
> > > +       $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> > > +               $(srcdir)/config/aarch64/aarch64-store-forwarding.cc
> > > +
> > >  comma=,
> > >  MULTILIB_OPTIONS    = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
> > >  MULTILIB_DIRNAMES   = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
> > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > index 32f535e1ed4..9bf3a83286a 100644
> > > --- a/gcc/doc/invoke.texi
> > > +++ b/gcc/doc/invoke.texi
> > > @@ -801,7 +801,7 @@ Objective-C and Objective-C++ Dialects}.
> > >  -moverride=@var{string}  -mverbose-cost-dump
> > >  -mstack-protector-guard=@var{guard} -mstack-protector-guard-reg=@var{sysreg}
> > >  -mstack-protector-guard-offset=@var{offset} -mtrack-speculation
> > > --moutline-atomics }
> > > +-moutline-atomics -mavoid-store-forwarding}
> > >
> > >  @emph{Adapteva Epiphany Options}
> > >  @gccoptlist{-mhalf-reg-file  -mprefer-short-insn-regs
> > > @@ -16774,6 +16774,11 @@ With @option{--param=aarch64-stp-policy=never}, do not emit stp.
> > >  With @option{--param=aarch64-stp-policy=aligned}, emit stp only if the
> > >  source pointer is aligned to at least double the alignment of the type.
> > >
> > > +@item aarch64-store-forwarding-threshold
> > > +Maximum allowed instruction distance between a store and a load pair for
> > > +this to be considered a candidate to avoid when using
> > > +@option{-mavoid-store-forwarding}.
> > > +
> > >  @item aarch64-loop-vect-issue-rate-niters
> > >  The tuning for some AArch64 CPUs tries to take both latencies and issue
> > >  rates into account when deciding whether a loop should be vectorized
> > > @@ -20857,6 +20862,10 @@ Generate code which uses only the general-purpose registers.  This will prevent
> > >  the compiler from using floating-point and Advanced SIMD registers but will not
> > >  impose any restrictions on the assembler.
> > >
> > > +@item -mavoid-store-forwarding
> > > +@itemx -mno-avoid-store-forwarding
> > > +Avoid store forwarding to load pairs.
> > > +
> > >  @opindex mlittle-endian
> > >  @item -mlittle-endian
> > >  Generate little-endian code.  This is the default when GCC is configured for an
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> > > new file mode 100644
> > > index 00000000000..b77de6c64b6
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> > > @@ -0,0 +1,33 @@
> > > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> > > +
> > > +#include <stdint.h>
> > > +
> > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > +
> > > +/* Different address, same offset, no overlap  */
> > > +
> > > +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \
> > > +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE *st_arr_2, TYPE i, TYPE dummy){ \
> > > +       TYPE r, y; \
> > > +       st_arr[0] = i; \
> > > +       ld_arr[0] = dummy; \
> > > +       r = st_arr_2[0]; \
> > > +       y = st_arr_2[1]; \
> > > +       return r + y; \
> > > +}
> > > +
> > > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t)
> > > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t)
> > > +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t)
> > > +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t)
> > > +LDP_SSLL_NO_OVERLAP_ADDRESS(int)
> > > +LDP_SSLL_NO_OVERLAP_ADDRESS(long)
> > > +LDP_SSLL_NO_OVERLAP_ADDRESS(float)
> > > +LDP_SSLL_NO_OVERLAP_ADDRESS(double)
> > > +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si)
> > > +
> > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> > > new file mode 100644
> > > index 00000000000..f1b3a66abfd
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> > > @@ -0,0 +1,33 @@
> > > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> > > +
> > > +#include <stdint.h>
> > > +
> > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > +
> > > +/* Same address, different offset, no overlap  */
> > > +
> > > +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \
> > > +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE dummy){ \
> > > +       TYPE r, y; \
> > > +       st_arr[0] = i; \
> > > +       ld_arr[0] = dummy; \
> > > +       r = st_arr[10]; \
> > > +       y = st_arr[11]; \
> > > +       return r + y; \
> > > +}
> > > +
> > > +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t)
> > > +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t)
> > > +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t)
> > > +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t)
> > > +LDP_SSLL_NO_OVERLAP_OFFSET(int)
> > > +LDP_SSLL_NO_OVERLAP_OFFSET(long)
> > > +LDP_SSLL_NO_OVERLAP_OFFSET(float)
> > > +LDP_SSLL_NO_OVERLAP_OFFSET(double)
> > > +LDP_SSLL_NO_OVERLAP_OFFSET(v4si)
> > > +
> > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> > > new file mode 100644
> > > index 00000000000..8d5ce5cc87e
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> > > @@ -0,0 +1,33 @@
> > > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> > > +
> > > +#include <stdint.h>
> > > +
> > > +typedef int v4si __attribute__ ((vector_size (16)));
> > > +
> > > +/* Same address, same offset, overlap  */
> > > +
> > > +#define LDP_SSLL_OVERLAP(TYPE) \
> > > +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE dummy){ \
> > > +       TYPE r, y; \
> > > +       st_arr[0] = i; \
> > > +       ld_arr[0] = dummy; \
> > > +       r = st_arr[0]; \
> > > +       y = st_arr[1]; \
> > > +       return r + y; \
> > > +}
> > > +
> > > +LDP_SSLL_OVERLAP(uint32_t)
> > > +LDP_SSLL_OVERLAP(uint64_t)
> > > +LDP_SSLL_OVERLAP(int32_t)
> > > +LDP_SSLL_OVERLAP(int64_t)
> > > +LDP_SSLL_OVERLAP(int)
> > > +LDP_SSLL_OVERLAP(long)
> > > +LDP_SSLL_OVERLAP(float)
> > > +LDP_SSLL_OVERLAP(double)
> > > +LDP_SSLL_OVERLAP(v4si)
> > > +
> > > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } } */
> > > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> > > --
> > > 2.41.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v6] aarch64: New RTL optimization pass avoid-store-forwarding.
  2023-12-06 17:15     ` Manos Anagnostakis
@ 2023-12-07  7:35       ` Richard Biener
  2023-12-07 10:27         ` Manos Anagnostakis
  0 siblings, 1 reply; 11+ messages in thread
From: Richard Biener @ 2023-12-07  7:35 UTC (permalink / raw)
  To: Manos Anagnostakis
  Cc: gcc-patches, Philipp Tomsich, Richard Sandiford, Manolis Tsamis

On Wed, Dec 6, 2023 at 6:15 PM Manos Anagnostakis
<manos.anagnostakis@vrull.eu> wrote:
>
> Hi again,
>
> I went and tested the requested changes and found out the following:
>
> 1. The pass is currently increasing insn_cnt on a NONJUMP_INSN_P, which is a subset of NONDEBUG_INSN_P. I think there is no problem with depending on -g with the current version. Do you see something I don't or did you mean something else?

It just occurred to me - thanks for double-checking (it wasn't obvious
to me NONJUMP_INSN_P doesn't include DEBUG_INSNs ...).

> 2. Not processing all instructions is not letting cselib record all the effects they have, thus it does not have updated information to find true forwardings at any given time. I can confirm this since I am witnessing many unexpected changes on the number of handled cases if I do this only for loads/stores.

Ah, OK.  I guess I don't fully get it, it seems you use cselib to
compare addresses and while possibly not
processing part of the address computation might break this other
stmts inbetween should be uninteresting
(at least I don't see you handling intermediate may-aliasing [small]
stores to disable the splitting).

So in the end it's a compile-time trade-off between relying solely on
cselib or trying to optimize
cselib use with DF for address computes?

Richard.

> Thanks in advance and please let me know your thoughts on the above.
> Manos.
>
> On Wed, Dec 6, 2023 at 5:10 PM Manos Anagnostakis <manos.anagnostakis@vrull.eu> wrote:
>>
>> Hi Richard,
>>
>> thanks for the useful comments.
>>
>> On Wed, Dec 6, 2023 at 4:32 PM Richard Biener <richard.guenther@gmail.com> wrote:
>>>
>>> On Wed, Dec 6, 2023 at 2:48 PM Manos Anagnostakis
>>> <manos.anagnostakis@vrull.eu> wrote:
>>> >
>>> > This is an RTL pass that detects store forwarding from stores to larger loads (load pairs).
>>> >
>>> > This optimization is SPEC2017-driven and was found to be beneficial for some benchmarks,
>>> > through testing on ampere1/ampere1a machines.
>>> >
>>> > For example, it can transform cases like
>>> >
>>> > str  d5, [sp, #320]
>>> > fmul d5, d31, d29
>>> > ldp  d31, d17, [sp, #312] # Large load from small store
>>> >
>>> > to
>>> >
>>> > str  d5, [sp, #320]
>>> > fmul d5, d31, d29
>>> > ldr  d31, [sp, #312]
>>> > ldr  d17, [sp, #320]
>>> >
>>> > Currently, the pass is disabled by default on all architectures and enabled by a target-specific option.
>>> >
>>> > If deemed beneficial enough for a default, it will be enabled on ampere1/ampere1a,
>>> > or other architectures as well, without needing to be turned on by this option.
>>>
>>> What is aarch64-specific about the pass?
>>
>> The pass was designed to target load pairs, which are aarch64 specific, thus it cannot handle generic loads.
>>>
>>>
>>> I see an increasingly large number of target specific passes pop up (probably
>>> for the excuse we can generalize them if necessary).  But GCC isn't LLVM
>>> and this feels like getting out of hand?
>>>
>>> The x86 backend also has its store-forwarding "pass" as part of mdreorg
>>> in ix86_split_stlf_stall_load.
>>>
>>> Richard.
>>>
>>> > Bootstrapped and regtested on aarch64-linux.
>>> >
>>> > gcc/ChangeLog:
>>> >
>>> >         * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
>>> >         * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New pass.
>>> >         * config/aarch64/aarch64-protos.h (make_pass_avoid_store_forwarding): Declare.
>>> >         * config/aarch64/aarch64.opt (mavoid-store-forwarding): New option.
>>> >         (aarch64-store-forwarding-threshold): New param.
>>> >         * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
>>> >         * doc/invoke.texi: Document new option and new param.
>>> >         * config/aarch64/aarch64-store-forwarding.cc: New file.
>>> >
>>> > gcc/testsuite/ChangeLog:
>>> >
>>> >         * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
>>> >         * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
>>> >         * gcc.target/aarch64/ldp_ssll_overlap.c: New test.
>>> >
>>> > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
>>> > Co-Authored-By: Manolis Tsamis <manolis.tsamis@vrull.eu>
>>> > Co-Authored-By: Philipp Tomsich <philipp.tomsich@vrull.eu>
>>> > ---
>>> > Changes in v6:
>>> >         - An obvious change. insn_cnt was incremented only on
>>> >           stores and not for every insn in the bb. Now restored.
>>> >
>>> >  gcc/config.gcc                                |   1 +
>>> >  gcc/config/aarch64/aarch64-passes.def         |   1 +
>>> >  gcc/config/aarch64/aarch64-protos.h           |   1 +
>>> >  .../aarch64/aarch64-store-forwarding.cc       | 318 ++++++++++++++++++
>>> >  gcc/config/aarch64/aarch64.opt                |   9 +
>>> >  gcc/config/aarch64/t-aarch64                  |  10 +
>>> >  gcc/doc/invoke.texi                           |  11 +-
>>> >  .../aarch64/ldp_ssll_no_overlap_address.c     |  33 ++
>>> >  .../aarch64/ldp_ssll_no_overlap_offset.c      |  33 ++
>>> >  .../gcc.target/aarch64/ldp_ssll_overlap.c     |  33 ++
>>> >  10 files changed, 449 insertions(+), 1 deletion(-)
>>> >  create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc
>>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
>>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
>>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
>>> >
>>> > diff --git a/gcc/config.gcc b/gcc/config.gcc
>>> > index 6450448f2f0..7c48429eb82 100644
>>> > --- a/gcc/config.gcc
>>> > +++ b/gcc/config.gcc
>>> > @@ -350,6 +350,7 @@ aarch64*-*-*)
>>> >         cxx_target_objs="aarch64-c.o"
>>> >         d_target_objs="aarch64-d.o"
>>> >         extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o aarch64-sve-builtins-sme.o cortex-a57-fma-steering.o aarch64-speculation.o falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
>>> > +       extra_objs="${extra_objs} aarch64-store-forwarding.o"
>>> >         target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc \$(srcdir)/config/aarch64/aarch64-sve-builtins.h \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
>>> >         target_has_targetm_common=yes
>>> >         ;;
>>> > diff --git a/gcc/config/aarch64/aarch64-passes.def b/gcc/config/aarch64/aarch64-passes.def
>>> > index 662a13fd5e6..94ced0aebf6 100644
>>> > --- a/gcc/config/aarch64/aarch64-passes.def
>>> > +++ b/gcc/config/aarch64/aarch64-passes.def
>>> > @@ -24,3 +24,4 @@ INSERT_PASS_BEFORE (pass_late_thread_prologue_and_epilogue, 1, pass_switch_pstat
>>> >  INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
>>> >  INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
>>> >  INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
>>> > +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding);
>>> > diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
>>> > index 60ff61f6d54..8f5f2ca4710 100644
>>> > --- a/gcc/config/aarch64/aarch64-protos.h
>>> > +++ b/gcc/config/aarch64/aarch64-protos.h
>>> > @@ -1069,6 +1069,7 @@ rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *);
>>> >  rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
>>> >  rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
>>> >  rtl_opt_pass *make_pass_switch_pstate_sm (gcc::context *ctxt);
>>> > +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt);
>>> >
>>> >  poly_uint64 aarch64_regmode_natural_size (machine_mode);
>>> >
>>> > diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc b/gcc/config/aarch64/aarch64-store-forwarding.cc
>>> > new file mode 100644
>>> > index 00000000000..8a6faefd8c0
>>> > --- /dev/null
>>> > +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc
>>> > @@ -0,0 +1,318 @@
>>> > +/* Avoid store forwarding optimization pass.
>>> > +   Copyright (C) 2023 Free Software Foundation, Inc.
>>> > +   Contributed by VRULL GmbH.
>>> > +
>>> > +   This file is part of GCC.
>>> > +
>>> > +   GCC is free software; you can redistribute it and/or modify it
>>> > +   under the terms of the GNU General Public License as published by
>>> > +   the Free Software Foundation; either version 3, or (at your option)
>>> > +   any later version.
>>> > +
>>> > +   GCC is distributed in the hope that it will be useful, but
>>> > +   WITHOUT ANY WARRANTY; without even the implied warranty of
>>> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>> > +   General Public License for more details.
>>> > +
>>> > +   You should have received a copy of the GNU General Public License
>>> > +   along with GCC; see the file COPYING3.  If not see
>>> > +   <http://www.gnu.org/licenses/>.  */
>>> > +
>>> > +#define IN_TARGET_CODE 1
>>> > +
>>> > +#include "config.h"
>>> > +#define INCLUDE_LIST
>>> > +#include "system.h"
>>> > +#include "coretypes.h"
>>> > +#include "backend.h"
>>> > +#include "rtl.h"
>>> > +#include "alias.h"
>>> > +#include "rtlanal.h"
>>> > +#include "tree-pass.h"
>>> > +#include "cselib.h"
>>> > +
>>> > +/* This is an RTL pass that detects store forwarding from stores to larger
>>> > +   loads (load pairs). For example, it can transform cases like
>>> > +
>>> > +   str  d5, [sp, #320]
>>> > +   fmul d5, d31, d29
>>> > +   ldp  d31, d17, [sp, #312] # Large load from small store
>>> > +
>>> > +   to
>>> > +
>>> > +   str  d5, [sp, #320]
>>> > +   fmul d5, d31, d29
>>> > +   ldr  d31, [sp, #312]
>>> > +   ldr  d17, [sp, #320]
>>> > +
>>> > +   Design: The pass follows a straightforward design.  It starts by
>>> > +   initializing the alias analysis and the cselib.  Both of these are used to
>>> > +   find stores and larger loads with overlapping addresses, which are
>>> > +   candidates for store forwarding optimizations.  It then scans on basic block
>>> > +   level to find stores that forward to larger loads and handles them
>>> > +   accordingly as described in the above example.  Finally, the alias analysis
>>> > +   and the cselib library are closed.  */
>>> > +
>>> > +typedef struct
>>> > +{
>>> > +  rtx_insn *store_insn;
>>> > +  rtx store_mem_addr;
>>> > +  unsigned int insn_cnt;
>>> > +} str_info;
>>> > +
>>> > +typedef std::list<str_info> list_store_info;
>>> > +
>>> > +/* Statistics counters.  */
>>> > +static unsigned int stats_store_count = 0;
>>> > +static unsigned int stats_ldp_count = 0;
>>> > +static unsigned int stats_ssll_count = 0;
>>> > +static unsigned int stats_transformed_count = 0;
>>> > +
>>> > +/* Default.  */
>>> > +static rtx dummy;
>>> > +static bool is_load (rtx expr, rtx &op_1=dummy);
>>> > +
>>> > +/* Return true if SET expression EXPR is a store; otherwise false.  */
>>> > +
>>> > +static bool
>>> > +is_store (rtx expr)
>>> > +{
>>> > +  return MEM_P (SET_DEST (expr));
>>> > +}
>>> > +
>>> > +/* Return true if SET expression EXPR is a load; otherwise false.  OP_1 will
>>> > +   contain the MEM operand of the load.  */
>>> > +
>>> > +static bool
>>> > +is_load (rtx expr, rtx &op_1)
>>> > +{
>>> > +  op_1 = SET_SRC (expr);
>>> > +
>>> > +  if (GET_CODE (op_1) == ZERO_EXTEND
>>> > +      || GET_CODE (op_1) == SIGN_EXTEND)
>>> > +    op_1 = XEXP (op_1, 0);
>>> > +
>>> > +  return MEM_P (op_1);
>>> > +}
>>> > +
>>> > +/* Return true if STORE_MEM_ADDR is forwarding to the address of LOAD_MEM;
>>> > +   otherwise false.  STORE_MEM_MODE is the mode of the MEM rtx containing
>>> > +   STORE_MEM_ADDR.  */
>>> > +
>>> > +static bool
>>> > +is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode store_mem_mode)
>>> > +{
>>> > +  /* Sometimes we do not have the proper value.  */
>>> > +  if (!CSELIB_VAL_PTR (store_mem_addr))
>>> > +    return false;
>>> > +
>>> > +  gcc_checking_assert (MEM_P (load_mem));
>>> > +
>>> > +  return rtx_equal_for_cselib_1 (store_mem_addr,
>>> > +                                get_addr (XEXP (load_mem, 0)),
>>> > +                                store_mem_mode, 0);
>>> > +}
>>> > +
>>> > +/* Return true if INSN is a load pair, preceded by a store forwarding to it;
>>> > +   otherwise false.  STORE_EXPRS contains the stores.  */
>>> > +
>>> > +static bool
>>> > +is_small_store_to_large_load (list_store_info store_exprs, rtx_insn *insn)
>>> > +{
>>> > +  unsigned int load_count = 0;
>>> > +  bool forwarding = false;
>>> > +  rtx expr = PATTERN (insn);
>>> > +
>>> > +  if (GET_CODE (expr) != PARALLEL
>>> > +      || XVECLEN (expr, 0) != 2)
>>> > +    return false;
>>> > +
>>> > +  for (int i = 0; i < XVECLEN (expr, 0); i++)
>>> > +    {
>>> > +      rtx op_1;
>>> > +      rtx out_exp = XVECEXP (expr, 0, i);
>>> > +
>>> > +      if (GET_CODE (out_exp) != SET)
>>> > +       continue;
>>> > +
>>> > +      if (!is_load (out_exp, op_1))
>>> > +       continue;
>>> > +
>>> > +      load_count++;
>>> > +
>>> > +      for (str_info str : store_exprs)
>>> > +       {
>>> > +         rtx store_insn = str.store_insn;
>>> > +
>>> > +         if (!is_forwarding (str.store_mem_addr, op_1,
>>> > +                             GET_MODE (SET_DEST (PATTERN (store_insn)))))
>>> > +           continue;
>>> > +
>>> > +         if (dump_file)
>>> > +           {
>>> > +             fprintf (dump_file,
>>> > +                      "Store forwarding to PARALLEL with loads:\n");
>>> > +             fprintf (dump_file, "  From: ");
>>> > +             print_rtl_single (dump_file, store_insn);
>>> > +             fprintf (dump_file, "  To: ");
>>> > +             print_rtl_single (dump_file, insn);
>>> > +           }
>>> > +
>>> > +         forwarding = true;
>>> > +       }
>>> > +    }
>>> > +
>>> > +  if (load_count == 2)
>>> > +    stats_ldp_count++;
>>> > +
>>> > +  return load_count == 2 && forwarding;
>>> > +}
>>> > +
>>> > +/* Break a load pair into its 2 distinct loads, except if the base source
>>> > +   address to load from is overwriten in the first load.  INSN should be the
>>> > +   PARALLEL of the load pair.  */
>>> > +
>>> > +static void
>>> > +break_ldp (rtx_insn *insn)
>>> > +{
>>> > +  rtx expr = PATTERN (insn);
>>> > +
>>> > +  gcc_checking_assert (GET_CODE (expr) == PARALLEL && XVECLEN (expr, 0) == 2);
>>> > +
>>> > +  rtx load_0 = XVECEXP (expr, 0, 0);
>>> > +  rtx load_1 = XVECEXP (expr, 0, 1);
>>> > +
>>> > +  gcc_checking_assert (is_load (load_0) && is_load (load_1));
>>> > +
>>> > +  /* The base address was overwriten in the first load.  */
>>> > +  if (reg_mentioned_p (SET_DEST (load_0), SET_SRC (load_1)))
>>> > +    return;
>>> > +
>>> > +  emit_insn_before (load_0, insn);
>>> > +  emit_insn_before (load_1, insn);
>>> > +  remove_insn (insn);
>>> > +
>>> > +  stats_transformed_count++;
>>> > +}
>>> > +
>>> > +static void
>>> > +scan_and_transform_bb_level ()
>>> > +{
>>> > +  rtx_insn *insn, *next;
>>> > +  basic_block bb;
>>> > +  FOR_EACH_BB_FN (bb, cfun)
>>> > +    {
>>> > +      list_store_info store_exprs;
>>> > +      unsigned int insn_cnt = 0;
>>> > +      for (insn = BB_HEAD (bb); insn != NEXT_INSN (BB_END (bb)); insn = next)
>>> > +       {
>>> > +         next = NEXT_INSN (insn);
>>>
>>> You probably want NONDEBUG here, otherwise insn_cnt will depend
>>> on -g?
>>>
>>> > +         /* If we cross a CALL_P insn, clear the list, because the
>>> > +            small-store-to-large-load is unlikely to cause performance
>>> > +            difference.  */
>>> > +         if (CALL_P (insn))
>>> > +           store_exprs.clear ();
>>> > +
>>> > +         if (!NONJUMP_INSN_P (insn))
>>> > +           continue;
>>> > +
>>> > +         cselib_process_insn (insn);
>>>
>>> is it necessary to process each insn with cselib?  Only loads & stores I guess?
>>>
>>> > +         rtx expr = single_set (insn);
>>> > +
>>> > +         /* If a store is encountered, append it to the store_exprs list to
>>> > +            check it later.  */
>>> > +         if (expr && is_store (expr))
>>> > +           {
>>> > +             rtx store_mem = SET_DEST (expr);
>>> > +             rtx store_mem_addr = get_addr (XEXP (store_mem, 0));
>>> > +             machine_mode store_mem_mode = GET_MODE (store_mem);
>>> > +             store_mem_addr = cselib_lookup (store_mem_addr,
>>> > +                                             store_mem_mode, 1,
>>> > +                                             store_mem_mode)->val_rtx;
>>> > +             store_exprs.push_back ({ insn, store_mem_addr, insn_cnt });
>>> > +             stats_store_count++;
>>> > +           }
>>> > +
>>> > +         /* Check for small-store-to-large-load.  */
>>> > +         if (is_small_store_to_large_load (store_exprs, insn))
>>> > +           {
>>> > +             stats_ssll_count++;
>>> > +             break_ldp (insn);
>>> > +           }
>>> > +
>>> > +         /* Pop the first store from the list if it's distance crosses the
>>> > +            maximum accepted threshold.  The list contains unique values
>>> > +            sorted in ascending order, meaning that only one distance can be
>>> > +            off at a time.  */
>>> > +         if (!store_exprs.empty ()
>>> > +             && (insn_cnt - store_exprs.front ().insn_cnt
>>> > +                > (unsigned int) aarch64_store_forwarding_threshold_param))
>>> > +           store_exprs.pop_front ();
>>> > +
>>> > +         insn_cnt++;
>>> > +       }
>>> > +    }
>>> > +}
>>> > +
>>> > +static void
>>> > +execute_avoid_store_forwarding ()
>>> > +{
>>> > +  init_alias_analysis ();
>>> > +  cselib_init (CSELIB_RECORD_MEMORY | CSELIB_PRESERVE_CONSTANTS);
>>> > +  scan_and_transform_bb_level ();
>>> > +  end_alias_analysis ();
>>> > +  cselib_finish ();
>>> > +  statistics_counter_event (cfun, "Number of stores identified: ",
>>> > +                           stats_store_count);
>>> > +  statistics_counter_event (cfun, "Number of load pairs identified: ",
>>> > +                           stats_ldp_count);
>>> > +  statistics_counter_event (cfun,
>>> > +                           "Number of forwarding cases identified: ",
>>> > +                           stats_ssll_count);
>>> > +  statistics_counter_event (cfun, "Number of trasformed cases: ",
>>> > +                           stats_transformed_count);
>>> > +}
>>> > +
>>> > +const pass_data pass_data_avoid_store_forwarding =
>>> > +{
>>> > +  RTL_PASS, /* type.  */
>>> > +  "avoid_store_forwarding", /* name.  */
>>> > +  OPTGROUP_NONE, /* optinfo_flags.  */
>>> > +  TV_NONE, /* tv_id.  */
>>> > +  0, /* properties_required.  */
>>> > +  0, /* properties_provided.  */
>>> > +  0, /* properties_destroyed.  */
>>> > +  0, /* todo_flags_start.  */
>>> > +  0 /* todo_flags_finish.  */
>>> > +};
>>> > +
>>> > +class pass_avoid_store_forwarding : public rtl_opt_pass
>>> > +{
>>> > +public:
>>> > +  pass_avoid_store_forwarding (gcc::context *ctxt)
>>> > +    : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
>>> > +  {}
>>> > +
>>> > +  /* opt_pass methods: */
>>> > +  virtual bool gate (function *)
>>> > +    {
>>> > +      return aarch64_flag_avoid_store_forwarding;
>>> > +    }
>>> > +
>>> > +  virtual unsigned int execute (function *)
>>> > +    {
>>> > +      execute_avoid_store_forwarding ();
>>> > +      return 0;
>>> > +    }
>>> > +
>>> > +}; // class pass_avoid_store_forwarding
>>> > +
>>> > +/* Create a new avoid store forwarding pass instance.  */
>>> > +
>>> > +rtl_opt_pass *
>>> > +make_pass_avoid_store_forwarding (gcc::context *ctxt)
>>> > +{
>>> > +  return new pass_avoid_store_forwarding (ctxt);
>>> > +}
>>> > diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
>>> > index f5a518202a1..e4498d53b46 100644
>>> > --- a/gcc/config/aarch64/aarch64.opt
>>> > +++ b/gcc/config/aarch64/aarch64.opt
>>> > @@ -304,6 +304,10 @@ moutline-atomics
>>> >  Target Var(aarch64_flag_outline_atomics) Init(2) Save
>>> >  Generate local calls to out-of-line atomic operations.
>>> >
>>> > +mavoid-store-forwarding
>>> > +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0) Optimization
>>> > +Avoid store forwarding to load pairs.
>>> > +
>>> >  -param=aarch64-sve-compare-costs=
>>> >  Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1) IntegerRange(0, 1) Param
>>> >  When vectorizing for SVE, consider using unpacked vectors for smaller elements and use the cost model to pick the cheapest approach.  Also use the cost model to choose between SVE and Advanced SIMD vectorization.
>>> > @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never) Value(AARCH64_LDP_STP_POLICY_NEVER)
>>> >
>>> >  EnumValue
>>> >  Enum(aarch64_ldp_stp_policy) String(aligned) Value(AARCH64_LDP_STP_POLICY_ALIGNED)
>>> > +
>>> > +-param=aarch64-store-forwarding-threshold=
>>> > +Target Joined UInteger Var(aarch64_store_forwarding_threshold_param) Init(20) Param
>>> > +Maximum instruction distance allowed between a store and a load pair for this to be
>>> > +considered a candidate to avoid when using -mavoid-store-forwarding.
>>> > diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
>>> > index 0d96ae3d0b2..5676cdd9585 100644
>>> > --- a/gcc/config/aarch64/t-aarch64
>>> > +++ b/gcc/config/aarch64/t-aarch64
>>> > @@ -194,6 +194,16 @@ aarch64-cc-fusion.o: $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
>>> >         $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
>>> >                 $(srcdir)/config/aarch64/aarch64-cc-fusion.cc
>>> >
>>> > +aarch64-store-forwarding.o: \
>>> > +    $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \
>>> > +    $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h $(RTL_BASE_H) \
>>> > +    dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H) $(RECOG_H) \
>>> > +    output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
>>> > +    $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
>>> > +    $(srcdir)/config/aarch64/aarch64-protos.h
>>> > +       $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
>>> > +               $(srcdir)/config/aarch64/aarch64-store-forwarding.cc
>>> > +
>>> >  comma=,
>>> >  MULTILIB_OPTIONS    = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
>>> >  MULTILIB_DIRNAMES   = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
>>> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
>>> > index 32f535e1ed4..9bf3a83286a 100644
>>> > --- a/gcc/doc/invoke.texi
>>> > +++ b/gcc/doc/invoke.texi
>>> > @@ -801,7 +801,7 @@ Objective-C and Objective-C++ Dialects}.
>>> >  -moverride=@var{string}  -mverbose-cost-dump
>>> >  -mstack-protector-guard=@var{guard} -mstack-protector-guard-reg=@var{sysreg}
>>> >  -mstack-protector-guard-offset=@var{offset} -mtrack-speculation
>>> > --moutline-atomics }
>>> > +-moutline-atomics -mavoid-store-forwarding}
>>> >
>>> >  @emph{Adapteva Epiphany Options}
>>> >  @gccoptlist{-mhalf-reg-file  -mprefer-short-insn-regs
>>> > @@ -16774,6 +16774,11 @@ With @option{--param=aarch64-stp-policy=never}, do not emit stp.
>>> >  With @option{--param=aarch64-stp-policy=aligned}, emit stp only if the
>>> >  source pointer is aligned to at least double the alignment of the type.
>>> >
>>> > +@item aarch64-store-forwarding-threshold
>>> > +Maximum allowed instruction distance between a store and a load pair for
>>> > +this to be considered a candidate to avoid when using
>>> > +@option{-mavoid-store-forwarding}.
>>> > +
>>> >  @item aarch64-loop-vect-issue-rate-niters
>>> >  The tuning for some AArch64 CPUs tries to take both latencies and issue
>>> >  rates into account when deciding whether a loop should be vectorized
>>> > @@ -20857,6 +20862,10 @@ Generate code which uses only the general-purpose registers.  This will prevent
>>> >  the compiler from using floating-point and Advanced SIMD registers but will not
>>> >  impose any restrictions on the assembler.
>>> >
>>> > +@item -mavoid-store-forwarding
>>> > +@itemx -mno-avoid-store-forwarding
>>> > +Avoid store forwarding to load pairs.
>>> > +
>>> >  @opindex mlittle-endian
>>> >  @item -mlittle-endian
>>> >  Generate little-endian code.  This is the default when GCC is configured for an
>>> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
>>> > new file mode 100644
>>> > index 00000000000..b77de6c64b6
>>> > --- /dev/null
>>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
>>> > @@ -0,0 +1,33 @@
>>> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
>>> > +
>>> > +#include <stdint.h>
>>> > +
>>> > +typedef int v4si __attribute__ ((vector_size (16)));
>>> > +
>>> > +/* Different address, same offset, no overlap  */
>>> > +
>>> > +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \
>>> > +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE *st_arr_2, TYPE i, TYPE dummy){ \
>>> > +       TYPE r, y; \
>>> > +       st_arr[0] = i; \
>>> > +       ld_arr[0] = dummy; \
>>> > +       r = st_arr_2[0]; \
>>> > +       y = st_arr_2[1]; \
>>> > +       return r + y; \
>>> > +}
>>> > +
>>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t)
>>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t)
>>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t)
>>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t)
>>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int)
>>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(long)
>>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(float)
>>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(double)
>>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si)
>>> > +
>>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
>>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
>>> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
>>> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
>>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
>>> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
>>> > new file mode 100644
>>> > index 00000000000..f1b3a66abfd
>>> > --- /dev/null
>>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
>>> > @@ -0,0 +1,33 @@
>>> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
>>> > +
>>> > +#include <stdint.h>
>>> > +
>>> > +typedef int v4si __attribute__ ((vector_size (16)));
>>> > +
>>> > +/* Same address, different offset, no overlap  */
>>> > +
>>> > +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \
>>> > +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE dummy){ \
>>> > +       TYPE r, y; \
>>> > +       st_arr[0] = i; \
>>> > +       ld_arr[0] = dummy; \
>>> > +       r = st_arr[10]; \
>>> > +       y = st_arr[11]; \
>>> > +       return r + y; \
>>> > +}
>>> > +
>>> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t)
>>> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t)
>>> > +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t)
>>> > +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t)
>>> > +LDP_SSLL_NO_OVERLAP_OFFSET(int)
>>> > +LDP_SSLL_NO_OVERLAP_OFFSET(long)
>>> > +LDP_SSLL_NO_OVERLAP_OFFSET(float)
>>> > +LDP_SSLL_NO_OVERLAP_OFFSET(double)
>>> > +LDP_SSLL_NO_OVERLAP_OFFSET(v4si)
>>> > +
>>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
>>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
>>> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
>>> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
>>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
>>> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
>>> > new file mode 100644
>>> > index 00000000000..8d5ce5cc87e
>>> > --- /dev/null
>>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
>>> > @@ -0,0 +1,33 @@
>>> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
>>> > +
>>> > +#include <stdint.h>
>>> > +
>>> > +typedef int v4si __attribute__ ((vector_size (16)));
>>> > +
>>> > +/* Same address, same offset, overlap  */
>>> > +
>>> > +#define LDP_SSLL_OVERLAP(TYPE) \
>>> > +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE dummy){ \
>>> > +       TYPE r, y; \
>>> > +       st_arr[0] = i; \
>>> > +       ld_arr[0] = dummy; \
>>> > +       r = st_arr[0]; \
>>> > +       y = st_arr[1]; \
>>> > +       return r + y; \
>>> > +}
>>> > +
>>> > +LDP_SSLL_OVERLAP(uint32_t)
>>> > +LDP_SSLL_OVERLAP(uint64_t)
>>> > +LDP_SSLL_OVERLAP(int32_t)
>>> > +LDP_SSLL_OVERLAP(int64_t)
>>> > +LDP_SSLL_OVERLAP(int)
>>> > +LDP_SSLL_OVERLAP(long)
>>> > +LDP_SSLL_OVERLAP(float)
>>> > +LDP_SSLL_OVERLAP(double)
>>> > +LDP_SSLL_OVERLAP(v4si)
>>> > +
>>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
>>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
>>> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } } */
>>> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } } */
>>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
>>> > --
>>> > 2.41.0
>>
>> I will address the other two points on a seperate version.
>>
>> --
>> Manos Anagnostakis | Compiler Engineer |
>> E: manos.anagnostakis@vrull.eu
>>
>> VRULL GmbH | Beatrixgasse 32 1030 Vienna |
>>  W: www.vrull.eu | LinkedIn
>
>
>
> --
> Manos Anagnostakis | Compiler Engineer |
> E: manos.anagnostakis@vrull.eu
>
> VRULL GmbH | Beatrixgasse 32 1030 Vienna |
>  W: www.vrull.eu | LinkedIn

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v6] aarch64: New RTL optimization pass avoid-store-forwarding.
  2023-12-07  7:35       ` Richard Biener
@ 2023-12-07 10:27         ` Manos Anagnostakis
  0 siblings, 0 replies; 11+ messages in thread
From: Manos Anagnostakis @ 2023-12-07 10:27 UTC (permalink / raw)
  To: Richard Biener
  Cc: Philipp Tomsich, Manolis Tsamis, Richard Sandiford, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 32593 bytes --]

Στις Πέμ 7 Δεκ 2023, 09:39 ο χρήστης Richard Biener <
richard.guenther@gmail.com> έγραψε:

> On Wed, Dec 6, 2023 at 6:15 PM Manos Anagnostakis
> <manos.anagnostakis@vrull.eu> wrote:
> >
> > Hi again,
> >
> > I went and tested the requested changes and found out the following:
> >
> > 1. The pass is currently increasing insn_cnt on a NONJUMP_INSN_P, which
> is a subset of NONDEBUG_INSN_P. I think there is no problem with depending
> on -g with the current version. Do you see something I don't or did you
> mean something else?
>
> It just occurred to me - thanks for double-checking (it wasn't obvious
> to me NONJUMP_INSN_P doesn't include DEBUG_INSNs ...).
>
> > 2. Not processing all instructions is not letting cselib record all the
> effects they have, thus it does not have updated information to find true
> forwardings at any given time. I can confirm this since I am witnessing
> many unexpected changes on the number of handled cases if I do this only
> for loads/stores.
>
> Ah, OK.  I guess I don't fully get it, it seems you use cselib to
> compare addresses and while possibly not
> processing part of the address computation might break this other
> stmts inbetween should be uninteresting
> (at least I don't see you handling intermediate may-aliasing [small]
> stores to disable the splitting).
>
> So in the end it's a compile-time trade-off between relying solely on
> cselib or trying to optimize
> cselib use with DF for address computes?
>
> Richard.
>
I am not really familiar with the DF technique, but what we did is the
following:

At first we were using a combination of alias analysis with cselib, which
while trying to avoid false positives that had their memory-register
address changed inbetween, was rejecting part of stores from being
considered as candidates for a forwarding to an ldp.

Thus in the end using only the cselib with the correct lookups was the way
to address both the register difference and/or the intermediate change on
the address value.

>
> > Thanks in advance and please let me know your thoughts on the above.
> > Manos.
> >
> > On Wed, Dec 6, 2023 at 5:10 PM Manos Anagnostakis <
> manos.anagnostakis@vrull.eu> wrote:
> >>
> >> Hi Richard,
> >>
> >> thanks for the useful comments.
> >>
> >> On Wed, Dec 6, 2023 at 4:32 PM Richard Biener <
> richard.guenther@gmail.com> wrote:
> >>>
> >>> On Wed, Dec 6, 2023 at 2:48 PM Manos Anagnostakis
> >>> <manos.anagnostakis@vrull.eu> wrote:
> >>> >
> >>> > This is an RTL pass that detects store forwarding from stores to
> larger loads (load pairs).
> >>> >
> >>> > This optimization is SPEC2017-driven and was found to be beneficial
> for some benchmarks,
> >>> > through testing on ampere1/ampere1a machines.
> >>> >
> >>> > For example, it can transform cases like
> >>> >
> >>> > str  d5, [sp, #320]
> >>> > fmul d5, d31, d29
> >>> > ldp  d31, d17, [sp, #312] # Large load from small store
> >>> >
> >>> > to
> >>> >
> >>> > str  d5, [sp, #320]
> >>> > fmul d5, d31, d29
> >>> > ldr  d31, [sp, #312]
> >>> > ldr  d17, [sp, #320]
> >>> >
> >>> > Currently, the pass is disabled by default on all architectures and
> enabled by a target-specific option.
> >>> >
> >>> > If deemed beneficial enough for a default, it will be enabled on
> ampere1/ampere1a,
> >>> > or other architectures as well, without needing to be turned on by
> this option.
> >>>
> >>> What is aarch64-specific about the pass?
> >>
> >> The pass was designed to target load pairs, which are aarch64 specific,
> thus it cannot handle generic loads.
> >>>
> >>>
> >>> I see an increasingly large number of target specific passes pop up
> (probably
> >>> for the excuse we can generalize them if necessary).  But GCC isn't
> LLVM
> >>> and this feels like getting out of hand?
> >>>
> >>> The x86 backend also has its store-forwarding "pass" as part of mdreorg
> >>> in ix86_split_stlf_stall_load.
> >>>
> >>> Richard.
> >>>
> >>> > Bootstrapped and regtested on aarch64-linux.
> >>> >
> >>> > gcc/ChangeLog:
> >>> >
> >>> >         * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
> >>> >         * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New
> pass.
> >>> >         * config/aarch64/aarch64-protos.h
> (make_pass_avoid_store_forwarding): Declare.
> >>> >         * config/aarch64/aarch64.opt (mavoid-store-forwarding): New
> option.
> >>> >         (aarch64-store-forwarding-threshold): New param.
> >>> >         * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
> >>> >         * doc/invoke.texi: Document new option and new param.
> >>> >         * config/aarch64/aarch64-store-forwarding.cc: New file.
> >>> >
> >>> > gcc/testsuite/ChangeLog:
> >>> >
> >>> >         * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
> >>> >         * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
> >>> >         * gcc.target/aarch64/ldp_ssll_overlap.c: New test.
> >>> >
> >>> > Signed-off-by: Manos Anagnostakis <manos.anagnostakis@vrull.eu>
> >>> > Co-Authored-By: Manolis Tsamis <manolis.tsamis@vrull.eu>
> >>> > Co-Authored-By: Philipp Tomsich <philipp.tomsich@vrull.eu>
> >>> > ---
> >>> > Changes in v6:
> >>> >         - An obvious change. insn_cnt was incremented only on
> >>> >           stores and not for every insn in the bb. Now restored.
> >>> >
> >>> >  gcc/config.gcc                                |   1 +
> >>> >  gcc/config/aarch64/aarch64-passes.def         |   1 +
> >>> >  gcc/config/aarch64/aarch64-protos.h           |   1 +
> >>> >  .../aarch64/aarch64-store-forwarding.cc       | 318
> ++++++++++++++++++
> >>> >  gcc/config/aarch64/aarch64.opt                |   9 +
> >>> >  gcc/config/aarch64/t-aarch64                  |  10 +
> >>> >  gcc/doc/invoke.texi                           |  11 +-
> >>> >  .../aarch64/ldp_ssll_no_overlap_address.c     |  33 ++
> >>> >  .../aarch64/ldp_ssll_no_overlap_offset.c      |  33 ++
> >>> >  .../gcc.target/aarch64/ldp_ssll_overlap.c     |  33 ++
> >>> >  10 files changed, 449 insertions(+), 1 deletion(-)
> >>> >  create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc
> >>> >  create mode 100644
> gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> >>> >  create mode 100644
> gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> >>> >  create mode 100644
> gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> >>> >
> >>> > diff --git a/gcc/config.gcc b/gcc/config.gcc
> >>> > index 6450448f2f0..7c48429eb82 100644
> >>> > --- a/gcc/config.gcc
> >>> > +++ b/gcc/config.gcc
> >>> > @@ -350,6 +350,7 @@ aarch64*-*-*)
> >>> >         cxx_target_objs="aarch64-c.o"
> >>> >         d_target_objs="aarch64-d.o"
> >>> >         extra_objs="aarch64-builtins.o aarch-common.o
> aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o
> aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o
> aarch64-sve-builtins-sme.o cortex-a57-fma-steering.o aarch64-speculation.o
> falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
> >>> > +       extra_objs="${extra_objs} aarch64-store-forwarding.o"
> >>> >
>  target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.h
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
> >>> >         target_has_targetm_common=yes
> >>> >         ;;
> >>> > diff --git a/gcc/config/aarch64/aarch64-passes.def
> b/gcc/config/aarch64/aarch64-passes.def
> >>> > index 662a13fd5e6..94ced0aebf6 100644
> >>> > --- a/gcc/config/aarch64/aarch64-passes.def
> >>> > +++ b/gcc/config/aarch64/aarch64-passes.def
> >>> > @@ -24,3 +24,4 @@ INSERT_PASS_BEFORE
> (pass_late_thread_prologue_and_epilogue, 1, pass_switch_pstat
> >>> >  INSERT_PASS_AFTER (pass_machine_reorg, 1,
> pass_tag_collision_avoidance);
> >>> >  INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
> >>> >  INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
> >>> > +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding);
> >>> > diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> >>> > index 60ff61f6d54..8f5f2ca4710 100644
> >>> > --- a/gcc/config/aarch64/aarch64-protos.h
> >>> > +++ b/gcc/config/aarch64/aarch64-protos.h
> >>> > @@ -1069,6 +1069,7 @@ rtl_opt_pass
> *make_pass_tag_collision_avoidance (gcc::context *);
> >>> >  rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
> >>> >  rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
> >>> >  rtl_opt_pass *make_pass_switch_pstate_sm (gcc::context *ctxt);
> >>> > +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt);
> >>> >
> >>> >  poly_uint64 aarch64_regmode_natural_size (machine_mode);
> >>> >
> >>> > diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc
> b/gcc/config/aarch64/aarch64-store-forwarding.cc
> >>> > new file mode 100644
> >>> > index 00000000000..8a6faefd8c0
> >>> > --- /dev/null
> >>> > +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc
> >>> > @@ -0,0 +1,318 @@
> >>> > +/* Avoid store forwarding optimization pass.
> >>> > +   Copyright (C) 2023 Free Software Foundation, Inc.
> >>> > +   Contributed by VRULL GmbH.
> >>> > +
> >>> > +   This file is part of GCC.
> >>> > +
> >>> > +   GCC is free software; you can redistribute it and/or modify it
> >>> > +   under the terms of the GNU General Public License as published by
> >>> > +   the Free Software Foundation; either version 3, or (at your
> option)
> >>> > +   any later version.
> >>> > +
> >>> > +   GCC is distributed in the hope that it will be useful, but
> >>> > +   WITHOUT ANY WARRANTY; without even the implied warranty of
> >>> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >>> > +   General Public License for more details.
> >>> > +
> >>> > +   You should have received a copy of the GNU General Public License
> >>> > +   along with GCC; see the file COPYING3.  If not see
> >>> > +   <http://www.gnu.org/licenses/>.  */
> >>> > +
> >>> > +#define IN_TARGET_CODE 1
> >>> > +
> >>> > +#include "config.h"
> >>> > +#define INCLUDE_LIST
> >>> > +#include "system.h"
> >>> > +#include "coretypes.h"
> >>> > +#include "backend.h"
> >>> > +#include "rtl.h"
> >>> > +#include "alias.h"
> >>> > +#include "rtlanal.h"
> >>> > +#include "tree-pass.h"
> >>> > +#include "cselib.h"
> >>> > +
> >>> > +/* This is an RTL pass that detects store forwarding from stores to
> larger
> >>> > +   loads (load pairs). For example, it can transform cases like
> >>> > +
> >>> > +   str  d5, [sp, #320]
> >>> > +   fmul d5, d31, d29
> >>> > +   ldp  d31, d17, [sp, #312] # Large load from small store
> >>> > +
> >>> > +   to
> >>> > +
> >>> > +   str  d5, [sp, #320]
> >>> > +   fmul d5, d31, d29
> >>> > +   ldr  d31, [sp, #312]
> >>> > +   ldr  d17, [sp, #320]
> >>> > +
> >>> > +   Design: The pass follows a straightforward design.  It starts by
> >>> > +   initializing the alias analysis and the cselib.  Both of these
> are used to
> >>> > +   find stores and larger loads with overlapping addresses, which
> are
> >>> > +   candidates for store forwarding optimizations.  It then scans on
> basic block
> >>> > +   level to find stores that forward to larger loads and handles
> them
> >>> > +   accordingly as described in the above example.  Finally, the
> alias analysis
> >>> > +   and the cselib library are closed.  */
> >>> > +
> >>> > +typedef struct
> >>> > +{
> >>> > +  rtx_insn *store_insn;
> >>> > +  rtx store_mem_addr;
> >>> > +  unsigned int insn_cnt;
> >>> > +} str_info;
> >>> > +
> >>> > +typedef std::list<str_info> list_store_info;
> >>> > +
> >>> > +/* Statistics counters.  */
> >>> > +static unsigned int stats_store_count = 0;
> >>> > +static unsigned int stats_ldp_count = 0;
> >>> > +static unsigned int stats_ssll_count = 0;
> >>> > +static unsigned int stats_transformed_count = 0;
> >>> > +
> >>> > +/* Default.  */
> >>> > +static rtx dummy;
> >>> > +static bool is_load (rtx expr, rtx &op_1=dummy);
> >>> > +
> >>> > +/* Return true if SET expression EXPR is a store; otherwise false.
> */
> >>> > +
> >>> > +static bool
> >>> > +is_store (rtx expr)
> >>> > +{
> >>> > +  return MEM_P (SET_DEST (expr));
> >>> > +}
> >>> > +
> >>> > +/* Return true if SET expression EXPR is a load; otherwise false.
> OP_1 will
> >>> > +   contain the MEM operand of the load.  */
> >>> > +
> >>> > +static bool
> >>> > +is_load (rtx expr, rtx &op_1)
> >>> > +{
> >>> > +  op_1 = SET_SRC (expr);
> >>> > +
> >>> > +  if (GET_CODE (op_1) == ZERO_EXTEND
> >>> > +      || GET_CODE (op_1) == SIGN_EXTEND)
> >>> > +    op_1 = XEXP (op_1, 0);
> >>> > +
> >>> > +  return MEM_P (op_1);
> >>> > +}
> >>> > +
> >>> > +/* Return true if STORE_MEM_ADDR is forwarding to the address of
> LOAD_MEM;
> >>> > +   otherwise false.  STORE_MEM_MODE is the mode of the MEM rtx
> containing
> >>> > +   STORE_MEM_ADDR.  */
> >>> > +
> >>> > +static bool
> >>> > +is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode
> store_mem_mode)
> >>> > +{
> >>> > +  /* Sometimes we do not have the proper value.  */
> >>> > +  if (!CSELIB_VAL_PTR (store_mem_addr))
> >>> > +    return false;
> >>> > +
> >>> > +  gcc_checking_assert (MEM_P (load_mem));
> >>> > +
> >>> > +  return rtx_equal_for_cselib_1 (store_mem_addr,
> >>> > +                                get_addr (XEXP (load_mem, 0)),
> >>> > +                                store_mem_mode, 0);
> >>> > +}
> >>> > +
> >>> > +/* Return true if INSN is a load pair, preceded by a store
> forwarding to it;
> >>> > +   otherwise false.  STORE_EXPRS contains the stores.  */
> >>> > +
> >>> > +static bool
> >>> > +is_small_store_to_large_load (list_store_info store_exprs, rtx_insn
> *insn)
> >>> > +{
> >>> > +  unsigned int load_count = 0;
> >>> > +  bool forwarding = false;
> >>> > +  rtx expr = PATTERN (insn);
> >>> > +
> >>> > +  if (GET_CODE (expr) != PARALLEL
> >>> > +      || XVECLEN (expr, 0) != 2)
> >>> > +    return false;
> >>> > +
> >>> > +  for (int i = 0; i < XVECLEN (expr, 0); i++)
> >>> > +    {
> >>> > +      rtx op_1;
> >>> > +      rtx out_exp = XVECEXP (expr, 0, i);
> >>> > +
> >>> > +      if (GET_CODE (out_exp) != SET)
> >>> > +       continue;
> >>> > +
> >>> > +      if (!is_load (out_exp, op_1))
> >>> > +       continue;
> >>> > +
> >>> > +      load_count++;
> >>> > +
> >>> > +      for (str_info str : store_exprs)
> >>> > +       {
> >>> > +         rtx store_insn = str.store_insn;
> >>> > +
> >>> > +         if (!is_forwarding (str.store_mem_addr, op_1,
> >>> > +                             GET_MODE (SET_DEST (PATTERN
> (store_insn)))))
> >>> > +           continue;
> >>> > +
> >>> > +         if (dump_file)
> >>> > +           {
> >>> > +             fprintf (dump_file,
> >>> > +                      "Store forwarding to PARALLEL with loads:\n");
> >>> > +             fprintf (dump_file, "  From: ");
> >>> > +             print_rtl_single (dump_file, store_insn);
> >>> > +             fprintf (dump_file, "  To: ");
> >>> > +             print_rtl_single (dump_file, insn);
> >>> > +           }
> >>> > +
> >>> > +         forwarding = true;
> >>> > +       }
> >>> > +    }
> >>> > +
> >>> > +  if (load_count == 2)
> >>> > +    stats_ldp_count++;
> >>> > +
> >>> > +  return load_count == 2 && forwarding;
> >>> > +}
> >>> > +
> >>> > +/* Break a load pair into its 2 distinct loads, except if the base
> source
> >>> > +   address to load from is overwriten in the first load.  INSN
> should be the
> >>> > +   PARALLEL of the load pair.  */
> >>> > +
> >>> > +static void
> >>> > +break_ldp (rtx_insn *insn)
> >>> > +{
> >>> > +  rtx expr = PATTERN (insn);
> >>> > +
> >>> > +  gcc_checking_assert (GET_CODE (expr) == PARALLEL && XVECLEN
> (expr, 0) == 2);
> >>> > +
> >>> > +  rtx load_0 = XVECEXP (expr, 0, 0);
> >>> > +  rtx load_1 = XVECEXP (expr, 0, 1);
> >>> > +
> >>> > +  gcc_checking_assert (is_load (load_0) && is_load (load_1));
> >>> > +
> >>> > +  /* The base address was overwriten in the first load.  */
> >>> > +  if (reg_mentioned_p (SET_DEST (load_0), SET_SRC (load_1)))
> >>> > +    return;
> >>> > +
> >>> > +  emit_insn_before (load_0, insn);
> >>> > +  emit_insn_before (load_1, insn);
> >>> > +  remove_insn (insn);
> >>> > +
> >>> > +  stats_transformed_count++;
> >>> > +}
> >>> > +
> >>> > +static void
> >>> > +scan_and_transform_bb_level ()
> >>> > +{
> >>> > +  rtx_insn *insn, *next;
> >>> > +  basic_block bb;
> >>> > +  FOR_EACH_BB_FN (bb, cfun)
> >>> > +    {
> >>> > +      list_store_info store_exprs;
> >>> > +      unsigned int insn_cnt = 0;
> >>> > +      for (insn = BB_HEAD (bb); insn != NEXT_INSN (BB_END (bb));
> insn = next)
> >>> > +       {
> >>> > +         next = NEXT_INSN (insn);
> >>>
> >>> You probably want NONDEBUG here, otherwise insn_cnt will depend
> >>> on -g?
> >>>
> >>> > +         /* If we cross a CALL_P insn, clear the list, because the
> >>> > +            small-store-to-large-load is unlikely to cause
> performance
> >>> > +            difference.  */
> >>> > +         if (CALL_P (insn))
> >>> > +           store_exprs.clear ();
> >>> > +
> >>> > +         if (!NONJUMP_INSN_P (insn))
> >>> > +           continue;
> >>> > +
> >>> > +         cselib_process_insn (insn);
> >>>
> >>> is it necessary to process each insn with cselib?  Only loads & stores
> I guess?
> >>>
> >>> > +         rtx expr = single_set (insn);
> >>> > +
> >>> > +         /* If a store is encountered, append it to the store_exprs
> list to
> >>> > +            check it later.  */
> >>> > +         if (expr && is_store (expr))
> >>> > +           {
> >>> > +             rtx store_mem = SET_DEST (expr);
> >>> > +             rtx store_mem_addr = get_addr (XEXP (store_mem, 0));
> >>> > +             machine_mode store_mem_mode = GET_MODE (store_mem);
> >>> > +             store_mem_addr = cselib_lookup (store_mem_addr,
> >>> > +                                             store_mem_mode, 1,
> >>> > +
>  store_mem_mode)->val_rtx;
> >>> > +             store_exprs.push_back ({ insn, store_mem_addr,
> insn_cnt });
> >>> > +             stats_store_count++;
> >>> > +           }
> >>> > +
> >>> > +         /* Check for small-store-to-large-load.  */
> >>> > +         if (is_small_store_to_large_load (store_exprs, insn))
> >>> > +           {
> >>> > +             stats_ssll_count++;
> >>> > +             break_ldp (insn);
> >>> > +           }
> >>> > +
> >>> > +         /* Pop the first store from the list if it's distance
> crosses the
> >>> > +            maximum accepted threshold.  The list contains unique
> values
> >>> > +            sorted in ascending order, meaning that only one
> distance can be
> >>> > +            off at a time.  */
> >>> > +         if (!store_exprs.empty ()
> >>> > +             && (insn_cnt - store_exprs.front ().insn_cnt
> >>> > +                > (unsigned int)
> aarch64_store_forwarding_threshold_param))
> >>> > +           store_exprs.pop_front ();
> >>> > +
> >>> > +         insn_cnt++;
> >>> > +       }
> >>> > +    }
> >>> > +}
> >>> > +
> >>> > +static void
> >>> > +execute_avoid_store_forwarding ()
> >>> > +{
> >>> > +  init_alias_analysis ();
> >>> > +  cselib_init (CSELIB_RECORD_MEMORY | CSELIB_PRESERVE_CONSTANTS);
> >>> > +  scan_and_transform_bb_level ();
> >>> > +  end_alias_analysis ();
> >>> > +  cselib_finish ();
> >>> > +  statistics_counter_event (cfun, "Number of stores identified: ",
> >>> > +                           stats_store_count);
> >>> > +  statistics_counter_event (cfun, "Number of load pairs identified:
> ",
> >>> > +                           stats_ldp_count);
> >>> > +  statistics_counter_event (cfun,
> >>> > +                           "Number of forwarding cases identified:
> ",
> >>> > +                           stats_ssll_count);
> >>> > +  statistics_counter_event (cfun, "Number of trasformed cases: ",
> >>> > +                           stats_transformed_count);
> >>> > +}
> >>> > +
> >>> > +const pass_data pass_data_avoid_store_forwarding =
> >>> > +{
> >>> > +  RTL_PASS, /* type.  */
> >>> > +  "avoid_store_forwarding", /* name.  */
> >>> > +  OPTGROUP_NONE, /* optinfo_flags.  */
> >>> > +  TV_NONE, /* tv_id.  */
> >>> > +  0, /* properties_required.  */
> >>> > +  0, /* properties_provided.  */
> >>> > +  0, /* properties_destroyed.  */
> >>> > +  0, /* todo_flags_start.  */
> >>> > +  0 /* todo_flags_finish.  */
> >>> > +};
> >>> > +
> >>> > +class pass_avoid_store_forwarding : public rtl_opt_pass
> >>> > +{
> >>> > +public:
> >>> > +  pass_avoid_store_forwarding (gcc::context *ctxt)
> >>> > +    : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
> >>> > +  {}
> >>> > +
> >>> > +  /* opt_pass methods: */
> >>> > +  virtual bool gate (function *)
> >>> > +    {
> >>> > +      return aarch64_flag_avoid_store_forwarding;
> >>> > +    }
> >>> > +
> >>> > +  virtual unsigned int execute (function *)
> >>> > +    {
> >>> > +      execute_avoid_store_forwarding ();
> >>> > +      return 0;
> >>> > +    }
> >>> > +
> >>> > +}; // class pass_avoid_store_forwarding
> >>> > +
> >>> > +/* Create a new avoid store forwarding pass instance.  */
> >>> > +
> >>> > +rtl_opt_pass *
> >>> > +make_pass_avoid_store_forwarding (gcc::context *ctxt)
> >>> > +{
> >>> > +  return new pass_avoid_store_forwarding (ctxt);
> >>> > +}
> >>> > diff --git a/gcc/config/aarch64/aarch64.opt
> b/gcc/config/aarch64/aarch64.opt
> >>> > index f5a518202a1..e4498d53b46 100644
> >>> > --- a/gcc/config/aarch64/aarch64.opt
> >>> > +++ b/gcc/config/aarch64/aarch64.opt
> >>> > @@ -304,6 +304,10 @@ moutline-atomics
> >>> >  Target Var(aarch64_flag_outline_atomics) Init(2) Save
> >>> >  Generate local calls to out-of-line atomic operations.
> >>> >
> >>> > +mavoid-store-forwarding
> >>> > +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0)
> Optimization
> >>> > +Avoid store forwarding to load pairs.
> >>> > +
> >>> >  -param=aarch64-sve-compare-costs=
> >>> >  Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1)
> IntegerRange(0, 1) Param
> >>> >  When vectorizing for SVE, consider using unpacked vectors for
> smaller elements and use the cost model to pick the cheapest approach.
> Also use the cost model to choose between SVE and Advanced SIMD
> vectorization.
> >>> > @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never)
> Value(AARCH64_LDP_STP_POLICY_NEVER)
> >>> >
> >>> >  EnumValue
> >>> >  Enum(aarch64_ldp_stp_policy) String(aligned)
> Value(AARCH64_LDP_STP_POLICY_ALIGNED)
> >>> > +
> >>> > +-param=aarch64-store-forwarding-threshold=
> >>> > +Target Joined UInteger
> Var(aarch64_store_forwarding_threshold_param) Init(20) Param
> >>> > +Maximum instruction distance allowed between a store and a load
> pair for this to be
> >>> > +considered a candidate to avoid when using -mavoid-store-forwarding.
> >>> > diff --git a/gcc/config/aarch64/t-aarch64
> b/gcc/config/aarch64/t-aarch64
> >>> > index 0d96ae3d0b2..5676cdd9585 100644
> >>> > --- a/gcc/config/aarch64/t-aarch64
> >>> > +++ b/gcc/config/aarch64/t-aarch64
> >>> > @@ -194,6 +194,16 @@ aarch64-cc-fusion.o:
> $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
> >>> >         $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS)
> $(INCLUDES) \
> >>> >                 $(srcdir)/config/aarch64/aarch64-cc-fusion.cc
> >>> >
> >>> > +aarch64-store-forwarding.o: \
> >>> > +    $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \
> >>> > +    $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h
> $(RTL_BASE_H) \
> >>> > +    dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H)
> $(RECOG_H) \
> >>> > +    output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
> >>> > +    $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
> >>> > +    $(srcdir)/config/aarch64/aarch64-protos.h
> >>> > +       $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS)
> $(INCLUDES) \
> >>> > +               $(srcdir)/config/aarch64/aarch64-store-forwarding.cc
> >>> > +
> >>> >  comma=,
> >>> >  MULTILIB_OPTIONS    = $(subst $(comma),/, $(patsubst %, mabi=%,
> $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
> >>> >  MULTILIB_DIRNAMES   = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
> >>> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> >>> > index 32f535e1ed4..9bf3a83286a 100644
> >>> > --- a/gcc/doc/invoke.texi
> >>> > +++ b/gcc/doc/invoke.texi
> >>> > @@ -801,7 +801,7 @@ Objective-C and Objective-C++ Dialects}.
> >>> >  -moverride=@var{string}  -mverbose-cost-dump
> >>> >  -mstack-protector-guard=@var{guard}
> -mstack-protector-guard-reg=@var{sysreg}
> >>> >  -mstack-protector-guard-offset=@var{offset} -mtrack-speculation
> >>> > --moutline-atomics }
> >>> > +-moutline-atomics -mavoid-store-forwarding}
> >>> >
> >>> >  @emph{Adapteva Epiphany Options}
> >>> >  @gccoptlist{-mhalf-reg-file  -mprefer-short-insn-regs
> >>> > @@ -16774,6 +16774,11 @@ With
> @option{--param=aarch64-stp-policy=never}, do not emit stp.
> >>> >  With @option{--param=aarch64-stp-policy=aligned}, emit stp only if
> the
> >>> >  source pointer is aligned to at least double the alignment of the
> type.
> >>> >
> >>> > +@item aarch64-store-forwarding-threshold
> >>> > +Maximum allowed instruction distance between a store and a load
> pair for
> >>> > +this to be considered a candidate to avoid when using
> >>> > +@option{-mavoid-store-forwarding}.
> >>> > +
> >>> >  @item aarch64-loop-vect-issue-rate-niters
> >>> >  The tuning for some AArch64 CPUs tries to take both latencies and
> issue
> >>> >  rates into account when deciding whether a loop should be vectorized
> >>> > @@ -20857,6 +20862,10 @@ Generate code which uses only the
> general-purpose registers.  This will prevent
> >>> >  the compiler from using floating-point and Advanced SIMD registers
> but will not
> >>> >  impose any restrictions on the assembler.
> >>> >
> >>> > +@item -mavoid-store-forwarding
> >>> > +@itemx -mno-avoid-store-forwarding
> >>> > +Avoid store forwarding to load pairs.
> >>> > +
> >>> >  @opindex mlittle-endian
> >>> >  @item -mlittle-endian
> >>> >  Generate little-endian code.  This is the default when GCC is
> configured for an
> >>> > diff --git
> a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> >>> > new file mode 100644
> >>> > index 00000000000..b77de6c64b6
> >>> > --- /dev/null
> >>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> >>> > @@ -0,0 +1,33 @@
> >>> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> >>> > +
> >>> > +#include <stdint.h>
> >>> > +
> >>> > +typedef int v4si __attribute__ ((vector_size (16)));
> >>> > +
> >>> > +/* Different address, same offset, no overlap  */
> >>> > +
> >>> > +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \
> >>> > +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr,
> TYPE *st_arr_2, TYPE i, TYPE dummy){ \
> >>> > +       TYPE r, y; \
> >>> > +       st_arr[0] = i; \
> >>> > +       ld_arr[0] = dummy; \
> >>> > +       r = st_arr_2[0]; \
> >>> > +       y = st_arr_2[1]; \
> >>> > +       return r + y; \
> >>> > +}
> >>> > +
> >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t)
> >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t)
> >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t)
> >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t)
> >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int)
> >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(long)
> >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(float)
> >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(double)
> >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si)
> >>> > +
> >>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 }
> } */
> >>> > diff --git
> a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> >>> > new file mode 100644
> >>> > index 00000000000..f1b3a66abfd
> >>> > --- /dev/null
> >>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> >>> > @@ -0,0 +1,33 @@
> >>> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> >>> > +
> >>> > +#include <stdint.h>
> >>> > +
> >>> > +typedef int v4si __attribute__ ((vector_size (16)));
> >>> > +
> >>> > +/* Same address, different offset, no overlap  */
> >>> > +
> >>> > +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \
> >>> > +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr,
> TYPE i, TYPE dummy){ \
> >>> > +       TYPE r, y; \
> >>> > +       st_arr[0] = i; \
> >>> > +       ld_arr[0] = dummy; \
> >>> > +       r = st_arr[10]; \
> >>> > +       y = st_arr[11]; \
> >>> > +       return r + y; \
> >>> > +}
> >>> > +
> >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t)
> >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t)
> >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t)
> >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t)
> >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(int)
> >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(long)
> >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(float)
> >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(double)
> >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(v4si)
> >>> > +
> >>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 }
> } */
> >>> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> >>> > new file mode 100644
> >>> > index 00000000000..8d5ce5cc87e
> >>> > --- /dev/null
> >>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> >>> > @@ -0,0 +1,33 @@
> >>> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> >>> > +
> >>> > +#include <stdint.h>
> >>> > +
> >>> > +typedef int v4si __attribute__ ((vector_size (16)));
> >>> > +
> >>> > +/* Same address, same offset, overlap  */
> >>> > +
> >>> > +#define LDP_SSLL_OVERLAP(TYPE) \
> >>> > +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i,
> TYPE dummy){ \
> >>> > +       TYPE r, y; \
> >>> > +       st_arr[0] = i; \
> >>> > +       ld_arr[0] = dummy; \
> >>> > +       r = st_arr[0]; \
> >>> > +       y = st_arr[1]; \
> >>> > +       return r + y; \
> >>> > +}
> >>> > +
> >>> > +LDP_SSLL_OVERLAP(uint32_t)
> >>> > +LDP_SSLL_OVERLAP(uint64_t)
> >>> > +LDP_SSLL_OVERLAP(int32_t)
> >>> > +LDP_SSLL_OVERLAP(int64_t)
> >>> > +LDP_SSLL_OVERLAP(int)
> >>> > +LDP_SSLL_OVERLAP(long)
> >>> > +LDP_SSLL_OVERLAP(float)
> >>> > +LDP_SSLL_OVERLAP(double)
> >>> > +LDP_SSLL_OVERLAP(v4si)
> >>> > +
> >>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 }
> } */
> >>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 }
> } */
> >>> > --
> >>> > 2.41.0
> >>
> >> I will address the other two points on a seperate version.
> >>
> >> --
> >> Manos Anagnostakis | Compiler Engineer |
> >> E: manos.anagnostakis@vrull.eu
> >>
> >> VRULL GmbH | Beatrixgasse 32 1030 Vienna |
> >>  W: www.vrull.eu | LinkedIn
> >
> >
> >
> > --
> > Manos Anagnostakis | Compiler Engineer |
> > E: manos.anagnostakis@vrull.eu
> >
> > VRULL GmbH | Beatrixgasse 32 1030 Vienna |
> >  W: www.vrull.eu | LinkedIn
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v6] aarch64: New RTL optimization pass avoid-store-forwarding.
  2023-12-07  7:31     ` Richard Biener
@ 2023-12-07 12:20       ` Richard Sandiford
  2023-12-07 14:09         ` Richard Biener
  0 siblings, 1 reply; 11+ messages in thread
From: Richard Sandiford @ 2023-12-07 12:20 UTC (permalink / raw)
  To: Richard Biener
  Cc: Philipp Tomsich, Manos Anagnostakis, gcc-patches, Manolis Tsamis

Richard Biener <richard.guenther@gmail.com> writes:
> On Wed, Dec 6, 2023 at 7:44 PM Philipp Tomsich <philipp.tomsich@vrull.eu> wrote:
>>
>> On Wed, 6 Dec 2023 at 23:32, Richard Biener <richard.guenther@gmail.com> wrote:
>> >
>> > On Wed, Dec 6, 2023 at 2:48 PM Manos Anagnostakis
>> > <manos.anagnostakis@vrull.eu> wrote:
>> > >
>> > > This is an RTL pass that detects store forwarding from stores to larger loads (load pairs).
>> > >
>> > > This optimization is SPEC2017-driven and was found to be beneficial for some benchmarks,
>> > > through testing on ampere1/ampere1a machines.
>> > >
>> > > For example, it can transform cases like
>> > >
>> > > str  d5, [sp, #320]
>> > > fmul d5, d31, d29
>> > > ldp  d31, d17, [sp, #312] # Large load from small store
>> > >
>> > > to
>> > >
>> > > str  d5, [sp, #320]
>> > > fmul d5, d31, d29
>> > > ldr  d31, [sp, #312]
>> > > ldr  d17, [sp, #320]
>> > >
>> > > Currently, the pass is disabled by default on all architectures and enabled by a target-specific option.
>> > >
>> > > If deemed beneficial enough for a default, it will be enabled on ampere1/ampere1a,
>> > > or other architectures as well, without needing to be turned on by this option.
>> >
>> > What is aarch64-specific about the pass?
>> >
>> > I see an increasingly large number of target specific passes pop up (probably
>> > for the excuse we can generalize them if necessary).  But GCC isn't LLVM
>> > and this feels like getting out of hand?
>>
>> We had an OK from Richard Sandiford on the earlier (v5) version with
>> v6 just fixing an obvious bug... so I was about to merge this earlier
>> just when you commented.
>>
>> Given that this had months of test exposure on our end, I would prefer
>> to move this forward for GCC14 in its current form.
>> The project of replacing architecture-specific store-forwarding passes
>> with a generalized infrastructure could then be addressed in the GCC15
>> timeframe (or beyond)?
>
> It's up to target maintainers, I just picked this pass (randomly) to make this
> comment (of course also knowing that STLF fails are a common issue on
> pipelined uarchs).

I agree there's scope for making some of this target-independent.

One vague thing I've been wondering about is whether, for some passes
like these, we should use inheritance rather than target hooks.  So in
this case, the target-independent code would provide a framework for
iterating over the function and testing for forwarding, but the target
would ultimately decide what to do with that information.  This would
also make it easier for targets to add genuinely target-specific
information to the bookkeeping structures.

In case it sounds otherwise, that's supposed to be more than
just a structural C++-vs-C thing.  The idea is that we'd have
a pass for "resolving store forwarding-related problems",
but the specific goals would be mostly (or at least partially)
target-specific rather than target-independent.

I'd wondered the same thing about the early-ra pass that we're
adding for SME.  Some of the framework could be generalised and
made target-independent, but the main purpose of the pass (using
strided registers with certain patterns and constraints) is highly
target-specific.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v6] aarch64: New RTL optimization pass avoid-store-forwarding.
  2023-12-07 12:20       ` Richard Sandiford
@ 2023-12-07 14:09         ` Richard Biener
  2023-12-08 15:47           ` Manos Anagnostakis
  0 siblings, 1 reply; 11+ messages in thread
From: Richard Biener @ 2023-12-07 14:09 UTC (permalink / raw)
  To: Richard Biener, Philipp Tomsich, Manos Anagnostakis, gcc-patches,
	Manolis Tsamis, richard.sandiford

On Thu, Dec 7, 2023 at 1:20 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Wed, Dec 6, 2023 at 7:44 PM Philipp Tomsich <philipp.tomsich@vrull.eu> wrote:
> >>
> >> On Wed, 6 Dec 2023 at 23:32, Richard Biener <richard.guenther@gmail.com> wrote:
> >> >
> >> > On Wed, Dec 6, 2023 at 2:48 PM Manos Anagnostakis
> >> > <manos.anagnostakis@vrull.eu> wrote:
> >> > >
> >> > > This is an RTL pass that detects store forwarding from stores to larger loads (load pairs).
> >> > >
> >> > > This optimization is SPEC2017-driven and was found to be beneficial for some benchmarks,
> >> > > through testing on ampere1/ampere1a machines.
> >> > >
> >> > > For example, it can transform cases like
> >> > >
> >> > > str  d5, [sp, #320]
> >> > > fmul d5, d31, d29
> >> > > ldp  d31, d17, [sp, #312] # Large load from small store
> >> > >
> >> > > to
> >> > >
> >> > > str  d5, [sp, #320]
> >> > > fmul d5, d31, d29
> >> > > ldr  d31, [sp, #312]
> >> > > ldr  d17, [sp, #320]
> >> > >
> >> > > Currently, the pass is disabled by default on all architectures and enabled by a target-specific option.
> >> > >
> >> > > If deemed beneficial enough for a default, it will be enabled on ampere1/ampere1a,
> >> > > or other architectures as well, without needing to be turned on by this option.
> >> >
> >> > What is aarch64-specific about the pass?
> >> >
> >> > I see an increasingly large number of target specific passes pop up (probably
> >> > for the excuse we can generalize them if necessary).  But GCC isn't LLVM
> >> > and this feels like getting out of hand?
> >>
> >> We had an OK from Richard Sandiford on the earlier (v5) version with
> >> v6 just fixing an obvious bug... so I was about to merge this earlier
> >> just when you commented.
> >>
> >> Given that this had months of test exposure on our end, I would prefer
> >> to move this forward for GCC14 in its current form.
> >> The project of replacing architecture-specific store-forwarding passes
> >> with a generalized infrastructure could then be addressed in the GCC15
> >> timeframe (or beyond)?
> >
> > It's up to target maintainers, I just picked this pass (randomly) to make this
> > comment (of course also knowing that STLF fails are a common issue on
> > pipelined uarchs).
>
> I agree there's scope for making some of this target-independent.
>
> One vague thing I've been wondering about is whether, for some passes
> like these, we should use inheritance rather than target hooks.  So in
> this case, the target-independent code would provide a framework for
> iterating over the function and testing for forwarding, but the target
> would ultimately decide what to do with that information.  This would
> also make it easier for targets to add genuinely target-specific
> information to the bookkeeping structures.
>
> In case it sounds otherwise, that's supposed to be more than
> just a structural C++-vs-C thing.  The idea is that we'd have
> a pass for "resolving store forwarding-related problems",
> but the specific goals would be mostly (or at least partially)
> target-specific rather than target-independent.

In some cases we've used target hooks for this, in this case it might
work as well.

> I'd wondered the same thing about the early-ra pass that we're
> adding for SME.  Some of the framework could be generalised and
> made target-independent, but the main purpose of the pass (using
> strided registers with certain patterns and constraints) is highly
> target-specific.

.. not sure about this one though.

Richard.

> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v6] aarch64: New RTL optimization pass avoid-store-forwarding.
  2023-12-07 14:09         ` Richard Biener
@ 2023-12-08 15:47           ` Manos Anagnostakis
  0 siblings, 0 replies; 11+ messages in thread
From: Manos Anagnostakis @ 2023-12-08 15:47 UTC (permalink / raw)
  To: Richard Biener, Richard Sandiford
  Cc: Philipp Tomsich, gcc-patches, Manolis Tsamis

[-- Attachment #1: Type: text/plain, Size: 4090 bytes --]

So is it OK for trunk as is in v6 with the generic changes added in GCC-15?

Manos.

Στις Πέμ 7 Δεκ 2023, 16:10 ο χρήστης Richard Biener <
richard.guenther@gmail.com> έγραψε:

> On Thu, Dec 7, 2023 at 1:20 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Richard Biener <richard.guenther@gmail.com> writes:
> > > On Wed, Dec 6, 2023 at 7:44 PM Philipp Tomsich <
> philipp.tomsich@vrull.eu> wrote:
> > >>
> > >> On Wed, 6 Dec 2023 at 23:32, Richard Biener <
> richard.guenther@gmail.com> wrote:
> > >> >
> > >> > On Wed, Dec 6, 2023 at 2:48 PM Manos Anagnostakis
> > >> > <manos.anagnostakis@vrull.eu> wrote:
> > >> > >
> > >> > > This is an RTL pass that detects store forwarding from stores to
> larger loads (load pairs).
> > >> > >
> > >> > > This optimization is SPEC2017-driven and was found to be
> beneficial for some benchmarks,
> > >> > > through testing on ampere1/ampere1a machines.
> > >> > >
> > >> > > For example, it can transform cases like
> > >> > >
> > >> > > str  d5, [sp, #320]
> > >> > > fmul d5, d31, d29
> > >> > > ldp  d31, d17, [sp, #312] # Large load from small store
> > >> > >
> > >> > > to
> > >> > >
> > >> > > str  d5, [sp, #320]
> > >> > > fmul d5, d31, d29
> > >> > > ldr  d31, [sp, #312]
> > >> > > ldr  d17, [sp, #320]
> > >> > >
> > >> > > Currently, the pass is disabled by default on all architectures
> and enabled by a target-specific option.
> > >> > >
> > >> > > If deemed beneficial enough for a default, it will be enabled on
> ampere1/ampere1a,
> > >> > > or other architectures as well, without needing to be turned on
> by this option.
> > >> >
> > >> > What is aarch64-specific about the pass?
> > >> >
> > >> > I see an increasingly large number of target specific passes pop up
> (probably
> > >> > for the excuse we can generalize them if necessary).  But GCC isn't
> LLVM
> > >> > and this feels like getting out of hand?
> > >>
> > >> We had an OK from Richard Sandiford on the earlier (v5) version with
> > >> v6 just fixing an obvious bug... so I was about to merge this earlier
> > >> just when you commented.
> > >>
> > >> Given that this had months of test exposure on our end, I would prefer
> > >> to move this forward for GCC14 in its current form.
> > >> The project of replacing architecture-specific store-forwarding passes
> > >> with a generalized infrastructure could then be addressed in the GCC15
> > >> timeframe (or beyond)?
> > >
> > > It's up to target maintainers, I just picked this pass (randomly) to
> make this
> > > comment (of course also knowing that STLF fails are a common issue on
> > > pipelined uarchs).
> >
> > I agree there's scope for making some of this target-independent.
> >
> > One vague thing I've been wondering about is whether, for some passes
> > like these, we should use inheritance rather than target hooks.  So in
> > this case, the target-independent code would provide a framework for
> > iterating over the function and testing for forwarding, but the target
> > would ultimately decide what to do with that information.  This would
> > also make it easier for targets to add genuinely target-specific
> > information to the bookkeeping structures.
> >
> > In case it sounds otherwise, that's supposed to be more than
> > just a structural C++-vs-C thing.  The idea is that we'd have
> > a pass for "resolving store forwarding-related problems",
> > but the specific goals would be mostly (or at least partially)
> > target-specific rather than target-independent.
>
> In some cases we've used target hooks for this, in this case it might
> work as well.
>
> > I'd wondered the same thing about the early-ra pass that we're
> > adding for SME.  Some of the framework could be generalised and
> > made target-independent, but the main purpose of the pass (using
> > strided registers with certain patterns and constraints) is highly
> > target-specific.
>
> .. not sure about this one though.
>
> Richard.
>
> > Thanks,
> > Richard
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-12-08 15:47 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-06 13:46 [PATCH v6] aarch64: New RTL optimization pass avoid-store-forwarding Manos Anagnostakis
2023-12-06 14:28 ` Richard Biener
2023-12-06 15:10   ` Manos Anagnostakis
2023-12-06 17:15     ` Manos Anagnostakis
2023-12-07  7:35       ` Richard Biener
2023-12-07 10:27         ` Manos Anagnostakis
2023-12-06 18:44   ` Philipp Tomsich
2023-12-07  7:31     ` Richard Biener
2023-12-07 12:20       ` Richard Sandiford
2023-12-07 14:09         ` Richard Biener
2023-12-08 15:47           ` Manos Anagnostakis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).