public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH v3] Implement new RTL optimizations pass: fold-mem-offsets.
@ 2023-07-13 14:13 Manolis Tsamis
  2023-07-13 15:05 ` Manolis Tsamis
  0 siblings, 1 reply; 13+ messages in thread
From: Manolis Tsamis @ 2023-07-13 14:13 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jeff Law, Philipp Tomsich, Richard Biener, Manolis Tsamis

This is a new RTL pass that tries to optimize memory offset calculations
by moving them from add immediate instructions to the memory loads/stores.
For example it can transform this:

  addi t4,sp,16
  add  t2,a6,t4
  shl  t3,t2,1
  ld   a2,0(t3)
  addi a2,1
  sd   a2,8(t2)

into the following (one instruction less):

  add  t2,a6,sp
  shl  t3,t2,1
  ld   a2,32(t3)
  addi a2,1
  sd   a2,24(t2)

Although there are places where this is done already, this pass is more
powerful and can handle the more difficult cases that are currently not
optimized. Also, it runs late enough and can optimize away unnecessary
stack pointer calculations.

gcc/ChangeLog:

	* Makefile.in: Add fold-mem-offsets.o.
	* passes.def: Schedule a new pass.
	* tree-pass.h (make_pass_fold_mem_offsets): Declare.
	* common.opt: New options.
	* doc/invoke.texi: Document new option.
	* fold-mem-offsets.cc: New file.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/fold-mem-offsets-1.c: New test.
	* gcc.target/riscv/fold-mem-offsets-2.c: New test.
	* gcc.target/riscv/fold-mem-offsets-3.c: New test.

Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu>
---

Changes in v3:
        - Added propagation for more codes:
          sub, neg, mul.
        - Added folding / elimination for sub and
          const int moves.
        - For the validity check of the generated addresses
          also test memory_address_addr_space_p.
        - Replaced GEN_INT with gen_int_mode.
        - Replaced some bitmap_head with auto_bitmap.
        - Refactor each phase into own function for readability.
        - Add dump details.
        - Replace rtx iteration with reg_mentioned_p.
        - Return early for codes that we can't propagate through.

 gcc/Makefile.in                               |   1 +
 gcc/common.opt                                |   4 +
 gcc/doc/invoke.texi                           |   8 +
 gcc/fold-mem-offsets.cc                       | 749 ++++++++++++++++++
 gcc/passes.def                                |   1 +
 .../gcc.target/riscv/fold-mem-offsets-1.c     |  16 +
 .../gcc.target/riscv/fold-mem-offsets-2.c     |  24 +
 .../gcc.target/riscv/fold-mem-offsets-3.c     |  17 +
 gcc/tree-pass.h                               |   1 +
 9 files changed, 821 insertions(+)
 create mode 100644 gcc/fold-mem-offsets.cc
 create mode 100644 gcc/testsuite/gcc.target/riscv/fold-mem-offsets-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/fold-mem-offsets-2.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/fold-mem-offsets-3.c

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index c478ec85201..6a5c2915133 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1430,6 +1430,7 @@ OBJS = \
 	fixed-value.o \
 	fold-const.o \
 	fold-const-call.o \
+	fold-mem-offsets.o \
 	function.o \
 	function-abi.o \
 	function-tests.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index 25f650e2dae..901947f1db5 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1248,6 +1248,10 @@ fcprop-registers
 Common Var(flag_cprop_registers) Optimization
 Perform a register copy-propagation optimization pass.
 
+ffold-mem-offsets
+Target Bool Var(flag_fold_mem_offsets) Init(1)
+Fold instructions calculating memory offsets to the memory access instruction if possible.
+
 fcrossjumping
 Common Var(flag_crossjumping) Optimization
 Perform cross-jumping optimization.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index cbc1282c274..dc4e6922bb5 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -539,6 +539,7 @@ Objective-C and Objective-C++ Dialects}.
 -fauto-inc-dec  -fbranch-probabilities
 -fcaller-saves
 -fcombine-stack-adjustments  -fconserve-stack
+-ffold-mem-offsets
 -fcompare-elim  -fcprop-registers  -fcrossjumping
 -fcse-follow-jumps  -fcse-skip-blocks  -fcx-fortran-rules
 -fcx-limited-range
@@ -14293,6 +14294,13 @@ the comparison operation before register allocation is complete.
 
 Enabled at levels @option{-O1}, @option{-O2}, @option{-O3}, @option{-Os}.
 
+@opindex ffold-mem-offsets
+@item -ffold-mem-offsets
+@itemx -fno-fold-mem-offsets
+Try to eliminate add instructions by folding them in memory loads/stores.
+
+Enabled at levels @option{-O2}, @option{-O3}.
+
 @opindex fcprop-registers
 @item -fcprop-registers
 After register allocation and post-register allocation instruction splitting,
diff --git a/gcc/fold-mem-offsets.cc b/gcc/fold-mem-offsets.cc
new file mode 100644
index 00000000000..a27c9ab18a4
--- /dev/null
+++ b/gcc/fold-mem-offsets.cc
@@ -0,0 +1,749 @@
+/* Late RTL pass to fold memory offsets.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or (at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "rtl.h"
+#include "tree.h"
+#include "expr.h"
+#include "backend.h"
+#include "regs.h"
+#include "target.h"
+#include "memmodel.h"
+#include "emit-rtl.h"
+#include "insn-config.h"
+#include "recog.h"
+#include "predict.h"
+#include "df.h"
+#include "tree-pass.h"
+#include "cfgrtl.h"
+
+/* This pass tries to optimize memory offset calculations by moving constants
+   from add instructions to the memory instructions (loads / stores).
+   For example it can transform code like this:
+
+     add  t4, sp, 16
+     add  t2, a6, t4
+     shl  t3, t2, 1
+     ld   a2, 0(t3)
+     add  a2, 1
+     sd   a2, 8(t2)
+
+   into the following (one instruction less):
+
+     add  t2, a6, sp
+     shl  t3, t2, 1
+     ld   a2, 32(t3)
+     add  a2, 1
+     sd   a2, 24(t2)
+
+   Although the previous passes try to emit efficient offset calculations
+   this pass is still beneficial because:
+
+    - The mechanisms that optimize memory offsets usually work with specific
+      patterns or have limitations.  This pass is designed to fold offsets
+      through complex calculations that affect multiple memory operations
+      and have partially overlapping calculations.
+
+    - There are cases where add instructions are introduced in late rtl passes
+      and the rest of the pipeline cannot eliminate them.  Arrays and structs
+      allocated on the stack can result in unwanted add instructions that
+      cannot be eliminated easily.
+
+   This pass works on a basic block level and consists of 4 phases:
+
+    - Phase 1 (Analysis): Find "foldable" instructions.
+      Foldable instructions are those that we know how to propagate
+      a constant addition through (add, shift, move, ...) and only have other
+      foldable instructions for uses.  In that phase a DFS traversal on the
+      definition tree is performed and foldable instructions are marked on
+      a bitmap.  The add immediate instructions that are reachable in this
+      DFS are candidates for folding since all the intermediate calculations
+      affected by them are also foldable.
+
+    - Phase 2 (Validity): Traverse and calculate the offsets that would result
+      from folding the add immediate instructions.  Check whether the
+      calculated offsets result in a valid instruction for the target.
+
+    - Phase 3 (Commit offsets): Traverse again.  It is now known which folds
+      are valid so at this point change the offsets in the memory instructions.
+
+    - Phase 4 (Commit instruction deletions): Scan all instructions and delete
+      or simplify (reduce to move) all add immediate instructions that were
+      folded.
+
+   This pass should run before hard register propagation because it creates
+   register moves that we expect to be eliminated.  */
+
+namespace {
+
+const pass_data pass_data_fold_mem =
+{
+  RTL_PASS, /* type */
+  "fold_mem_offsets", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_fold_mem_offsets : public rtl_opt_pass
+{
+public:
+  pass_fold_mem_offsets (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_fold_mem, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return flag_fold_mem_offsets && optimize >= 2;
+    }
+
+  virtual unsigned int execute (function *);
+}; // class pass_fold_mem_offsets
+
+/* Tracks which instructions can be reached through instructions that can
+   propagate offsets for folding.  */
+static bitmap_head can_fold_insns;
+
+/* Marks instructions that are currently eligible for folding.  */
+static bitmap_head candidate_fold_insns;
+
+/* Tracks instructions that cannot be folded because it turned out that
+   folding will result in creating an invalid memory instruction.
+   An instruction can be in both CANDIDATE_FOLD_INSNS and CANNOT_FOLD_INSNS
+   at the same time, in which case it is not legal to fold.  */
+static bitmap_head cannot_fold_insns;
+
+/* The number of instructions that were simplified or eliminated.  */
+static int stats_fold_count;
+
+enum fold_mem_phase
+{
+  FM_PHASE_ANALYSIS,
+  FM_PHASE_VALIDITY,
+  FM_PHASE_COMMIT_OFFSETS,
+  FM_PHASE_COMMIT_INSNS
+};
+
+/* Get the single reaching definition of an instruction inside a BB.
+   The definition is desired for REG used in INSN.
+   Return the definition insn or NULL if there's no definition with
+   the desired criteria.  */
+static rtx_insn*
+get_single_def_in_bb (rtx_insn *insn, rtx reg)
+{
+  df_ref use;
+  struct df_link *ref_chain, *ref_link;
+
+  FOR_EACH_INSN_USE (use, insn)
+    {
+      if (GET_CODE (DF_REF_REG (use)) == SUBREG)
+	return NULL;
+      if (REGNO (DF_REF_REG (use)) == REGNO (reg))
+	break;
+    }
+
+  if (!use)
+    return NULL;
+
+  ref_chain = DF_REF_CHAIN (use);
+
+  if (!ref_chain)
+    return NULL;
+
+  for (ref_link = ref_chain; ref_link; ref_link = ref_link->next)
+    {
+      /* Problem getting some definition for this instruction.  */
+      if (ref_link->ref == NULL)
+	return NULL;
+      if (DF_REF_INSN_INFO (ref_link->ref) == NULL)
+	return NULL;
+      if (global_regs[REGNO (reg)]
+	  && !set_of (reg, DF_REF_INSN (ref_link->ref)))
+	return NULL;
+    }
+
+  if (ref_chain->next)
+    return NULL;
+
+  rtx_insn* def = DF_REF_INSN (ref_chain->ref);
+
+  if (BLOCK_FOR_INSN (def) != BLOCK_FOR_INSN (insn))
+    return NULL;
+
+  if (DF_INSN_LUID (def) > DF_INSN_LUID (insn))
+    return NULL;
+
+  return def;
+}
+
+/* Get all uses of REG which is set in INSN.  Return the use list or NULL if a
+   use is missing / irregular.  If SUCCESS is not NULL then set it to false if
+   there are missing / irregular uses and true otherwise.  */
+static struct df_link*
+get_uses (rtx_insn *insn, rtx reg, bool* success)
+{
+  df_ref def;
+  struct df_link *ref_chain, *ref_link;
+
+  if (success)
+    *success = false;
+
+  FOR_EACH_INSN_DEF (def, insn)
+    if (REGNO (DF_REF_REG (def)) == REGNO (reg))
+      break;
+
+  if (!def)
+    return NULL;
+
+  ref_chain = DF_REF_CHAIN (def);
+
+  for (ref_link = ref_chain; ref_link; ref_link = ref_link->next)
+    {
+      /* Problem getting a use for this instruction.  */
+      if (ref_link->ref == NULL)
+	return NULL;
+      if (DF_REF_CLASS (ref_link->ref) != DF_REF_REGULAR)
+	return NULL;
+    }
+
+  if (success)
+    *success = true;
+
+  return ref_chain;
+}
+
+/* Function that computes the offset that would have to be added to all uses
+   of REG if the instructions marked in FOLDABLE_INSNS were to be eliminated.
+
+   If ANALYZE is true then mark in CAN_FOLD_INSNS which instructions
+   transitively only affect other instructions found in CAN_FOLD_INSNS.
+   If ANALYZE is false then compute the offset required for folding.  */
+static HOST_WIDE_INT
+fold_offsets (rtx_insn* insn, rtx reg, bool analyze, bitmap foldable_insns)
+{
+  rtx_insn* def = get_single_def_in_bb (insn, reg);
+
+  if (!def || GET_CODE (PATTERN (def)) != SET)
+    return 0;
+
+  rtx src = SET_SRC (PATTERN (def));
+  rtx dest = SET_DEST (PATTERN (def));
+
+  if (!REG_P (dest))
+    return 0;
+
+  enum rtx_code code = GET_CODE (src);
+
+  /* Keep these in sync with the switch below.  We need to early out because
+     otherwise a lot of unnecessary work is done in use analysis.  */
+  if (!(code == PLUS || code == MINUS || code == NEG || code == MULT
+	|| code == ASHIFT || code == REG || code == CONST_INT))
+    return 0;
+
+  unsigned int dest_regno = REGNO (dest);
+
+  /* We can only affect the values of GPR registers.  */
+  if (fixed_regs[dest_regno]
+      || !TEST_HARD_REG_BIT (reg_class_contents[GENERAL_REGS], dest_regno))
+    return 0;
+
+  if (analyze)
+    {
+      /* We only fold through instructions that are transitively used as
+	 memory addresses and do not have other uses.  Use the same logic
+	 from offset calculation to visit instructions that can propagate
+	 offsets and keep track of them in CAN_FOLD_INSNS.  */
+      bool success;
+      struct df_link *uses = get_uses (def, dest, &success), *ref_link;
+
+      if (!success)
+	return 0;
+
+      for (ref_link = uses; ref_link; ref_link = ref_link->next)
+	{
+	  rtx_insn* use = DF_REF_INSN (ref_link->ref);
+
+	  if (DEBUG_INSN_P (use))
+	    continue;
+
+	  /* Punt if the use is anything more complicated than a set
+	     (clobber, use, etc).  */
+	  if (!NONJUMP_INSN_P (use) || GET_CODE (PATTERN (use)) != SET)
+	    return 0;
+
+	  /* This use affects instructions outside of CAN_FOLD_INSNS.  */
+	  if (!bitmap_bit_p (&can_fold_insns, INSN_UID (use)))
+	    return 0;
+
+	  rtx use_set = PATTERN (use);
+
+	  /* Special case: A foldable memory store is not foldable if it
+	     mentions DEST outside of the address calculation.  */
+	  if (use_set && MEM_P (SET_DEST (use_set))
+	      && reg_mentioned_p (dest, SET_SRC (use_set)))
+	    return 0;
+	}
+
+      bitmap_set_bit (&can_fold_insns, INSN_UID (def));
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Instruction marked for propagation: ");
+	  print_rtl_single (dump_file, def);
+	}
+    }
+  else
+    {
+      /* We cannot propagate through this instruction.  */
+      if (!bitmap_bit_p (&can_fold_insns, INSN_UID (def)))
+	return 0;
+    }
+
+  switch (code)
+    {
+    case PLUS:
+      {
+	/* Propagate through add.  */
+	rtx arg1 = XEXP (src, 0);
+	rtx arg2 = XEXP (src, 1);
+	HOST_WIDE_INT offset = 0;
+
+	if (REG_P (arg1))
+	  offset += fold_offsets (def, arg1, analyze, foldable_insns);
+	else if (GET_CODE (arg1) == ASHIFT
+		 && REG_P (XEXP (arg1, 0))
+		 && CONST_INT_P (XEXP (arg1, 1)))
+	  {
+	    /* Handle R1 = (R2 << C) + ...  */
+	    HOST_WIDE_INT scale
+	      = (HOST_WIDE_INT_1U << INTVAL (XEXP (arg1, 1)));
+	    offset += scale * fold_offsets (def, XEXP (arg1, 0), analyze,
+					 foldable_insns);
+	  }
+	else if (GET_CODE (arg1) == PLUS
+		 && REG_P (XEXP (arg1, 0))
+		 && REG_P (XEXP (arg1, 1)))
+	  {
+	    /* Handle R1 = (R2 + R3) + ...  */
+	    offset += fold_offsets (def, XEXP (arg1, 0), analyze,
+				    foldable_insns);
+	    offset += fold_offsets (def, XEXP (arg1, 1), analyze,
+				    foldable_insns);
+	  }
+	else if (GET_CODE (arg1) == PLUS
+		 && GET_CODE (XEXP (arg1, 0)) == ASHIFT
+		 && REG_P (XEXP (XEXP (arg1, 0), 0))
+		 && CONST_INT_P (XEXP (XEXP (arg1, 0), 1))
+		 && REG_P (XEXP (arg1, 1)))
+	  {
+	    /* Handle R1 = ((R2 << C) + R3) + ...  */
+	    HOST_WIDE_INT scale
+	      = (HOST_WIDE_INT_1U << INTVAL (XEXP (XEXP (arg1, 0), 1)));
+	    offset += scale * fold_offsets (def, XEXP (XEXP (arg1, 0), 0),
+					    analyze, foldable_insns);
+	    offset += fold_offsets (def, XEXP (arg1, 1), analyze,
+				    foldable_insns);
+	  }
+	else
+	  return 0;
+
+	if (REG_P (arg2))
+	  offset += fold_offsets (def, arg2, analyze, foldable_insns);
+	else if (REG_P (arg1) && CONST_INT_P (arg2) && !analyze)
+	  {
+	    offset += INTVAL (arg2);
+	    /* This is a R1 = R2 + C instruction, candidate for folding.  */
+	    bitmap_set_bit (foldable_insns, INSN_UID (def));
+	  }
+	else
+	  return 0;
+
+	return offset;
+      }
+    case MINUS:
+      {
+	/* Propagate through minus.  */
+	rtx arg1 = XEXP (src, 0);
+	rtx arg2 = XEXP (src, 1);
+	HOST_WIDE_INT offset = 0;
+
+	if (REG_P (arg1))
+	  offset += fold_offsets (def, arg1, analyze, foldable_insns);
+	else
+	  return 0;
+
+	if (REG_P (arg2))
+	  offset -= fold_offsets (def, arg2, analyze, foldable_insns);
+	else if (REG_P (arg1) && CONST_INT_P (arg2) && !analyze)
+	  {
+	    offset -= INTVAL (arg2);
+	    /* This is a R1 = R2 - C instruction, candidate for folding.  */
+	    bitmap_set_bit (foldable_insns, INSN_UID (def));
+	  }
+	else
+	  return 0;
+
+	return offset;
+      }
+    case NEG:
+      {
+	/* Propagate through negation.  */
+	rtx arg1 = XEXP (src, 0);
+	if (REG_P (arg1))
+	  return -fold_offsets (def, arg1, analyze, foldable_insns);
+	else
+	  return 0;
+      }
+    case MULT:
+      {
+	/* Propagate through multiply by constant.  */
+	rtx arg1 = XEXP (src, 0);
+	rtx arg2 = XEXP (src, 1);
+
+	if (REG_P (arg1) && CONST_INT_P (arg2))
+	  {
+	    HOST_WIDE_INT scale = INTVAL (arg2);
+	    return scale * fold_offsets (def, arg1, analyze, foldable_insns);
+	  }
+
+	return 0;
+      }
+    case ASHIFT:
+      {
+	/* Propagate through shift left by constant.  */
+	rtx arg1 = XEXP (src, 0);
+	rtx arg2 = XEXP (src, 1);
+
+	if (REG_P (arg1) && CONST_INT_P (arg2))
+	  {
+	    HOST_WIDE_INT scale = (HOST_WIDE_INT_1U << INTVAL (arg2));
+	    return scale * fold_offsets (def, arg1, analyze, foldable_insns);
+	  }
+
+	return 0;
+      }
+    case REG:
+      /* Propagate through register move.  */
+      return fold_offsets (def, src, analyze, foldable_insns);
+    case CONST_INT:
+      {
+	/* R1 = C is candidate for folding.  */
+	if (!analyze)
+	  bitmap_set_bit (foldable_insns, INSN_UID (def));
+	return INTVAL (src);
+      }
+    default:
+      /* Cannot propagate.  */
+      return 0;
+    }
+}
+
+/* Test if INSN is a memory load / store that can have an offset folded to it.
+   Return true iff INSN is such an instruction and return through MEM_OUT,
+   REG_OUT and OFFSET_OUT the RTX that has a MEM code, the register that is
+   used as a base address and the offset accordingly.
+   All of the out pointers may be NULL in which case they will be ignored.  */
+bool
+get_fold_mem_root (rtx_insn* insn, rtx *mem_out, rtx *reg_out,
+		   HOST_WIDE_INT *offset_out)
+{
+  rtx set = single_set (insn);
+  rtx mem = NULL_RTX;
+
+  if (set != NULL_RTX)
+    {
+      rtx src = SET_SRC (set);
+      rtx dest = SET_DEST (set);
+
+      /* Don't fold when we have unspec / volatile.  */
+      if (GET_CODE (src) == UNSPEC
+	  || GET_CODE (src) == UNSPEC_VOLATILE
+	  || GET_CODE (dest) == UNSPEC
+	  || GET_CODE (dest) == UNSPEC_VOLATILE)
+	return false;
+
+      if (MEM_P (src))
+	mem = src;
+      else if (MEM_P (dest))
+	mem = dest;
+      else if ((GET_CODE (src) == SIGN_EXTEND
+		|| GET_CODE (src) == ZERO_EXTEND)
+	       && MEM_P (XEXP (src, 0)))
+	mem = XEXP (src, 0);
+    }
+
+  if (mem == NULL_RTX)
+    return false;
+
+  rtx mem_addr = XEXP (mem, 0);
+  rtx reg;
+  HOST_WIDE_INT offset;
+
+  if (REG_P (mem_addr))
+    {
+      reg = mem_addr;
+      offset = 0;
+    }
+  else if (GET_CODE (mem_addr) == PLUS
+	   && REG_P (XEXP (mem_addr, 0))
+	   && CONST_INT_P (XEXP (mem_addr, 1)))
+    {
+      reg = XEXP (mem_addr, 0);
+      offset = INTVAL (XEXP (mem_addr, 1));
+    }
+  else
+    return false;
+
+  if (mem_out)
+    *mem_out = mem;
+  if (reg_out)
+    *reg_out = reg;
+  if (offset_out)
+    *offset_out = offset;
+
+  return true;
+}
+
+/* If INSN is a root memory instruction then do a DFS traversal on its
+   definitions and find folding candidates.  */
+static void
+do_analysis (rtx_insn* insn)
+{
+  rtx reg;
+  if (!get_fold_mem_root (insn, NULL, &reg, NULL))
+    return;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "Starting analysis from root: ");
+      print_rtl_single (dump_file, insn);
+    }
+
+  /* Analyse folding opportunities for this memory instruction.  */
+  bitmap_set_bit (&can_fold_insns, INSN_UID (insn));
+  fold_offsets (insn, reg, true, NULL);
+}
+
+/* If INSN is a root memory instruction then compute a potentially new offset
+   for it and test if the resulting instruction is valid.  */
+static void
+do_check_validity (rtx_insn* insn)
+{
+  rtx mem, reg;
+  HOST_WIDE_INT cur_offset;
+  if (!get_fold_mem_root (insn, &mem, &reg, &cur_offset))
+    return;
+
+  auto_bitmap fold_insns;
+  HOST_WIDE_INT new_offset
+    = cur_offset + fold_offsets (insn, reg, false, fold_insns);
+
+  /* Test if it is valid to change MEM's address offset to NEW_OFFSET.  */
+  rtx mem_addr = XEXP (mem, 0);
+  machine_mode mode = GET_MODE (mem_addr);
+  XEXP (mem, 0) = gen_rtx_PLUS (mode, reg, gen_int_mode (new_offset, mode));
+
+  bool illegal = insn_invalid_p (insn, false)
+		 || !memory_address_addr_space_p (mode, XEXP (mem, 0),
+						  MEM_ADDR_SPACE (mem));
+
+  /* Restore the instruction.  */
+  XEXP (mem, 0) = mem_addr;
+  INSN_CODE (insn) = -1;
+
+  if (illegal)
+    bitmap_ior_into (&cannot_fold_insns, fold_insns);
+  else
+    bitmap_ior_into (&candidate_fold_insns, fold_insns);
+}
+
+/* If INSN is a root memory instruction that was affected by any folding
+   then update its offset as necessary.  */
+static void
+do_commit_offset (rtx_insn* insn)
+{
+  rtx mem, reg;
+  HOST_WIDE_INT cur_offset;
+  if (!get_fold_mem_root (insn, &mem, &reg, &cur_offset))
+    return;
+
+  auto_bitmap fold_insns;
+  HOST_WIDE_INT new_offset
+    = cur_offset + fold_offsets (insn, reg, false, fold_insns);
+
+  /* If an change turned out illegal in the previous phase then legal
+     transformations that share calculations also become illegal.  */
+  if (bitmap_intersect_p (&cannot_fold_insns, fold_insns))
+    {
+      bitmap_ior_into (&cannot_fold_insns, fold_insns);
+      return;
+    }
+
+  if (new_offset == cur_offset)
+    return;
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "Memory offset changed from "
+	       HOST_WIDE_INT_PRINT_DEC " to " HOST_WIDE_INT_PRINT_DEC
+	       " for instruction:\n", cur_offset, new_offset);
+      print_rtl_single (dump_file, insn);
+    }
+
+  gcc_assert (!bitmap_empty_p (fold_insns));
+  machine_mode mode = GET_MODE (XEXP (mem, 0));
+  XEXP (mem, 0) = gen_rtx_PLUS (mode, reg, gen_int_mode (new_offset, mode));
+  df_insn_rescan (insn);
+}
+
+/* If INSN is a move / add instruction that was folded then replace its
+   constant part with zero.  */
+static void
+do_commit_insn (rtx_insn* insn)
+{
+  if (bitmap_bit_p (&candidate_fold_insns, INSN_UID (insn))
+      && !bitmap_bit_p (&cannot_fold_insns, INSN_UID (insn)))
+    {
+      if (dump_file)
+	{
+	  fprintf (dump_file, "Instruction folded:");
+	  print_rtl_single (dump_file, insn);
+	}
+
+      stats_fold_count++;
+
+      rtx set = single_set (insn);
+      rtx dest = SET_DEST (set);
+      rtx src = SET_SRC (set);
+
+      /* Emit a move and let subsequent passes eliminate it if possible.  */
+      if (GET_CODE (src) == CONST_INT)
+	{
+	  /* INSN is R1 = C.
+	     Replace it with R1 = 0 because C was folded.  */
+	  rtx mov_rtx
+	    = gen_move_insn (dest, gen_int_mode (0, GET_MODE (dest)));
+	  df_insn_rescan (emit_insn_after (mov_rtx, insn));
+	}
+      else
+	{
+	  /* INSN is R1 = R2 + C.
+	     Replace it with R1 = R2 because C was folded.  */
+	  rtx arg1 = XEXP (src, 0);
+
+	  /* If the DEST == ARG1 then the move is a no-op.  */
+	  if (REGNO (dest) != REGNO (arg1))
+	    {
+	      gcc_checking_assert (GET_MODE (dest) == GET_MODE (arg1));
+	      rtx mov_rtx = gen_move_insn (dest, arg1);
+	      df_insn_rescan (emit_insn_after (mov_rtx, insn));
+	    }
+	}
+
+      /* Delete the original move / add instruction.  */
+      delete_insn (insn);
+    }
+}
+
+/* Wrapper function to perform the actions defined by PHASE for INSN.  */
+static void
+fold_mem_offsets_driver (rtx_insn* insn, int phase)
+{
+  switch (phase)
+    {
+      case FM_PHASE_ANALYSIS:
+	do_analysis (insn);
+	return;
+      case FM_PHASE_VALIDITY:
+	do_check_validity (insn);
+	return;
+      case FM_PHASE_COMMIT_OFFSETS:
+	do_commit_offset (insn);
+	return;
+      case FM_PHASE_COMMIT_INSNS:
+	do_commit_insn (insn);
+	return;
+    }
+}
+
+unsigned int
+pass_fold_mem_offsets::execute (function *fn)
+{
+  df_set_flags (DF_RD_PRUNE_DEAD_DEFS | DF_DEFER_INSN_RESCAN);
+  df_chain_add_problem (DF_UD_CHAIN + DF_DU_CHAIN);
+  df_analyze ();
+
+  bitmap_initialize (&can_fold_insns, NULL);
+  bitmap_initialize (&candidate_fold_insns, NULL);
+  bitmap_initialize (&cannot_fold_insns, NULL);
+
+  stats_fold_count = 0;
+
+  basic_block bb;
+  rtx_insn *insn;
+  FOR_ALL_BB_FN (bb, fn)
+    {
+      /* There is a conflict between this pass and RISCV's shorten-memrefs
+	  pass.  For now disable folding if optimizing for size because
+	  otherwise this cancels the effects of shorten-memrefs.  */
+      if (optimize_bb_for_size_p (bb))
+	continue;
+
+      bitmap_clear (&can_fold_insns);
+      bitmap_clear (&candidate_fold_insns);
+      bitmap_clear (&cannot_fold_insns);
+
+      FOR_BB_INSNS (bb, insn)
+	fold_mem_offsets_driver (insn, FM_PHASE_ANALYSIS);
+
+      FOR_BB_INSNS (bb, insn)
+	fold_mem_offsets_driver (insn, FM_PHASE_VALIDITY);
+
+      FOR_BB_INSNS (bb, insn)
+	fold_mem_offsets_driver (insn, FM_PHASE_COMMIT_OFFSETS);
+
+      FOR_BB_INSNS (bb, insn)
+	fold_mem_offsets_driver (insn, FM_PHASE_COMMIT_INSNS);
+    }
+
+  statistics_counter_event (cfun, "Number of folded instructions",
+			    stats_fold_count);
+
+  bitmap_release (&can_fold_insns);
+  bitmap_release (&candidate_fold_insns);
+  bitmap_release (&cannot_fold_insns);
+
+  return 0;
+}
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_fold_mem_offsets (gcc::context *ctxt)
+{
+  return new pass_fold_mem_offsets (ctxt);
+}
diff --git a/gcc/passes.def b/gcc/passes.def
index faa5208b26b..59522d9a4c1 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -508,6 +508,7 @@ along with GCC; see the file COPYING3.  If not see
 	  NEXT_PASS (pass_peephole2);
 	  NEXT_PASS (pass_if_after_reload);
 	  NEXT_PASS (pass_regrename);
+	  NEXT_PASS (pass_fold_mem_offsets);
 	  NEXT_PASS (pass_cprop_hardreg);
 	  NEXT_PASS (pass_fast_rtl_dce);
 	  NEXT_PASS (pass_reorder_blocks);
diff --git a/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-1.c b/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-1.c
new file mode 100644
index 00000000000..574cc92b6ab
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-1.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mfold-mem-offsets" } */
+
+void sink(int arr[2]);
+
+void
+foo(int a, int b, int i)
+{
+  int arr[2] = {a, b};
+  arr[i]++;
+  sink(arr);
+}
+
+// Should compile without negative memory offsets when using -mfold-mem-offsets
+/* { dg-final { scan-assembler-not "lw\t.*,-.*\\(.*\\)" } } */
+/* { dg-final { scan-assembler-not "sw\t.*,-.*\\(.*\\)" } } */
\ No newline at end of file
diff --git a/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-2.c b/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-2.c
new file mode 100644
index 00000000000..e6c251d3a3c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-2.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mfold-mem-offsets" } */
+
+void sink(int arr[3]);
+
+void
+foo(int a, int b, int c, int i)
+{
+  int arr1[3] = {a, b, c};
+  int arr2[3] = {a, c, b};
+  int arr3[3] = {c, b, a};
+
+  arr1[i]++;
+  arr2[i]++;
+  arr3[i]++;
+  
+  sink(arr1);
+  sink(arr2);
+  sink(arr3);
+}
+
+// Should compile without negative memory offsets when using -mfold-mem-offsets
+/* { dg-final { scan-assembler-not "lw\t.*,-.*\\(.*\\)" } } */
+/* { dg-final { scan-assembler-not "sw\t.*,-.*\\(.*\\)" } } */
\ No newline at end of file
diff --git a/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-3.c b/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-3.c
new file mode 100644
index 00000000000..8586d3e3a29
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-3.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mfold-mem-offsets" } */
+
+void load(int arr[2]);
+
+int
+foo(long unsigned int i)
+{
+  int arr[2];
+  load(arr);
+
+  return arr[3 * i + 77];
+}
+
+// Should compile without negative memory offsets when using -mfold-mem-offsets
+/* { dg-final { scan-assembler-not "lw\t.*,-.*\\(.*\\)" } } */
+/* { dg-final { scan-assembler-not "addi\t.*,.*,77" } } */
\ No newline at end of file
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 6cdaed7d4b2..6d93af8fd57 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -618,6 +618,7 @@ extern rtl_opt_pass *make_pass_sched_fusion (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_peephole2 (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_if_after_reload (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_regrename (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_fold_mem_offsets (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_cprop_hardreg (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_reorder_blocks (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_leaf_regs (gcc::context *ctxt);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Implement new RTL optimizations pass: fold-mem-offsets.
  2023-07-13 14:13 [PATCH v3] Implement new RTL optimizations pass: fold-mem-offsets Manolis Tsamis
@ 2023-07-13 15:05 ` Manolis Tsamis
  2023-07-14  5:35   ` Jeff Law
  0 siblings, 1 reply; 13+ messages in thread
From: Manolis Tsamis @ 2023-07-13 15:05 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jeff Law, Philipp Tomsich, Richard Biener, Jakub Jelinek

In this version I have made f-m-o able to also eliminate constant
moves in addition to the add constant instructions.
This increases the number of simplified/eliminated instructions and is
a good addition for RISC style ISAs where these are more common.

This has led to pr52146.c failing in x86, which I haven't been able to
find a way to fix.
This involves directly writing to a constant address with -mx32

The code
        movl    $-18874240, %eax
        movl    $0, (%eax)

is 'optimized' to
        movl    $0, %eax
        movl    $0, -18874240(%eax)

Which is actually
        movl    $0, -18874240

which is wrong per the ticket.
The fix for the ticket involved changes to legitimate_address_p which
f-m-o does call but it doesn't reject due to the existence of (%eax)
which in turn is actually zero.
I believe this is not strictly an f-m-o issue since the pass calls all
the required functions to test whether the newly synthesized memory
instruction is valid.

Any ideas on how to solve this issue is appreciated.

Manolis

On Thu, Jul 13, 2023 at 5:13 PM Manolis Tsamis <manolis.tsamis@vrull.eu> wrote:
>
> This is a new RTL pass that tries to optimize memory offset calculations
> by moving them from add immediate instructions to the memory loads/stores.
> For example it can transform this:
>
>   addi t4,sp,16
>   add  t2,a6,t4
>   shl  t3,t2,1
>   ld   a2,0(t3)
>   addi a2,1
>   sd   a2,8(t2)
>
> into the following (one instruction less):
>
>   add  t2,a6,sp
>   shl  t3,t2,1
>   ld   a2,32(t3)
>   addi a2,1
>   sd   a2,24(t2)
>
> Although there are places where this is done already, this pass is more
> powerful and can handle the more difficult cases that are currently not
> optimized. Also, it runs late enough and can optimize away unnecessary
> stack pointer calculations.
>
> gcc/ChangeLog:
>
>         * Makefile.in: Add fold-mem-offsets.o.
>         * passes.def: Schedule a new pass.
>         * tree-pass.h (make_pass_fold_mem_offsets): Declare.
>         * common.opt: New options.
>         * doc/invoke.texi: Document new option.
>         * fold-mem-offsets.cc: New file.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/riscv/fold-mem-offsets-1.c: New test.
>         * gcc.target/riscv/fold-mem-offsets-2.c: New test.
>         * gcc.target/riscv/fold-mem-offsets-3.c: New test.
>
> Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu>
> ---
>
> Changes in v3:
>         - Added propagation for more codes:
>           sub, neg, mul.
>         - Added folding / elimination for sub and
>           const int moves.
>         - For the validity check of the generated addresses
>           also test memory_address_addr_space_p.
>         - Replaced GEN_INT with gen_int_mode.
>         - Replaced some bitmap_head with auto_bitmap.
>         - Refactor each phase into own function for readability.
>         - Add dump details.
>         - Replace rtx iteration with reg_mentioned_p.
>         - Return early for codes that we can't propagate through.
>
>  gcc/Makefile.in                               |   1 +
>  gcc/common.opt                                |   4 +
>  gcc/doc/invoke.texi                           |   8 +
>  gcc/fold-mem-offsets.cc                       | 749 ++++++++++++++++++
>  gcc/passes.def                                |   1 +
>  .../gcc.target/riscv/fold-mem-offsets-1.c     |  16 +
>  .../gcc.target/riscv/fold-mem-offsets-2.c     |  24 +
>  .../gcc.target/riscv/fold-mem-offsets-3.c     |  17 +
>  gcc/tree-pass.h                               |   1 +
>  9 files changed, 821 insertions(+)
>  create mode 100644 gcc/fold-mem-offsets.cc
>  create mode 100644 gcc/testsuite/gcc.target/riscv/fold-mem-offsets-1.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/fold-mem-offsets-2.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/fold-mem-offsets-3.c
>
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index c478ec85201..6a5c2915133 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1430,6 +1430,7 @@ OBJS = \
>         fixed-value.o \
>         fold-const.o \
>         fold-const-call.o \
> +       fold-mem-offsets.o \
>         function.o \
>         function-abi.o \
>         function-tests.o \
> diff --git a/gcc/common.opt b/gcc/common.opt
> index 25f650e2dae..901947f1db5 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -1248,6 +1248,10 @@ fcprop-registers
>  Common Var(flag_cprop_registers) Optimization
>  Perform a register copy-propagation optimization pass.
>
> +ffold-mem-offsets
> +Target Bool Var(flag_fold_mem_offsets) Init(1)
> +Fold instructions calculating memory offsets to the memory access instruction if possible.
> +
>  fcrossjumping
>  Common Var(flag_crossjumping) Optimization
>  Perform cross-jumping optimization.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index cbc1282c274..dc4e6922bb5 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -539,6 +539,7 @@ Objective-C and Objective-C++ Dialects}.
>  -fauto-inc-dec  -fbranch-probabilities
>  -fcaller-saves
>  -fcombine-stack-adjustments  -fconserve-stack
> +-ffold-mem-offsets
>  -fcompare-elim  -fcprop-registers  -fcrossjumping
>  -fcse-follow-jumps  -fcse-skip-blocks  -fcx-fortran-rules
>  -fcx-limited-range
> @@ -14293,6 +14294,13 @@ the comparison operation before register allocation is complete.
>
>  Enabled at levels @option{-O1}, @option{-O2}, @option{-O3}, @option{-Os}.
>
> +@opindex ffold-mem-offsets
> +@item -ffold-mem-offsets
> +@itemx -fno-fold-mem-offsets
> +Try to eliminate add instructions by folding them in memory loads/stores.
> +
> +Enabled at levels @option{-O2}, @option{-O3}.
> +
>  @opindex fcprop-registers
>  @item -fcprop-registers
>  After register allocation and post-register allocation instruction splitting,
> diff --git a/gcc/fold-mem-offsets.cc b/gcc/fold-mem-offsets.cc
> new file mode 100644
> index 00000000000..a27c9ab18a4
> --- /dev/null
> +++ b/gcc/fold-mem-offsets.cc
> @@ -0,0 +1,749 @@
> +/* Late RTL pass to fold memory offsets.
> +   Copyright (C) 2023 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify
> +it under the terms of the GNU General Public License as published by
> +the Free Software Foundation; either version 3, or (at your option)
> +any later version.
> +
> +GCC is distributed in the hope that it will be useful,
> +but WITHOUT ANY WARRANTY; without even the implied warranty of
> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +GNU General Public License for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +<http://www.gnu.org/licenses/>.  */
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "tm.h"
> +#include "rtl.h"
> +#include "tree.h"
> +#include "expr.h"
> +#include "backend.h"
> +#include "regs.h"
> +#include "target.h"
> +#include "memmodel.h"
> +#include "emit-rtl.h"
> +#include "insn-config.h"
> +#include "recog.h"
> +#include "predict.h"
> +#include "df.h"
> +#include "tree-pass.h"
> +#include "cfgrtl.h"
> +
> +/* This pass tries to optimize memory offset calculations by moving constants
> +   from add instructions to the memory instructions (loads / stores).
> +   For example it can transform code like this:
> +
> +     add  t4, sp, 16
> +     add  t2, a6, t4
> +     shl  t3, t2, 1
> +     ld   a2, 0(t3)
> +     add  a2, 1
> +     sd   a2, 8(t2)
> +
> +   into the following (one instruction less):
> +
> +     add  t2, a6, sp
> +     shl  t3, t2, 1
> +     ld   a2, 32(t3)
> +     add  a2, 1
> +     sd   a2, 24(t2)
> +
> +   Although the previous passes try to emit efficient offset calculations
> +   this pass is still beneficial because:
> +
> +    - The mechanisms that optimize memory offsets usually work with specific
> +      patterns or have limitations.  This pass is designed to fold offsets
> +      through complex calculations that affect multiple memory operations
> +      and have partially overlapping calculations.
> +
> +    - There are cases where add instructions are introduced in late rtl passes
> +      and the rest of the pipeline cannot eliminate them.  Arrays and structs
> +      allocated on the stack can result in unwanted add instructions that
> +      cannot be eliminated easily.
> +
> +   This pass works on a basic block level and consists of 4 phases:
> +
> +    - Phase 1 (Analysis): Find "foldable" instructions.
> +      Foldable instructions are those that we know how to propagate
> +      a constant addition through (add, shift, move, ...) and only have other
> +      foldable instructions for uses.  In that phase a DFS traversal on the
> +      definition tree is performed and foldable instructions are marked on
> +      a bitmap.  The add immediate instructions that are reachable in this
> +      DFS are candidates for folding since all the intermediate calculations
> +      affected by them are also foldable.
> +
> +    - Phase 2 (Validity): Traverse and calculate the offsets that would result
> +      from folding the add immediate instructions.  Check whether the
> +      calculated offsets result in a valid instruction for the target.
> +
> +    - Phase 3 (Commit offsets): Traverse again.  It is now known which folds
> +      are valid so at this point change the offsets in the memory instructions.
> +
> +    - Phase 4 (Commit instruction deletions): Scan all instructions and delete
> +      or simplify (reduce to move) all add immediate instructions that were
> +      folded.
> +
> +   This pass should run before hard register propagation because it creates
> +   register moves that we expect to be eliminated.  */
> +
> +namespace {
> +
> +const pass_data pass_data_fold_mem =
> +{
> +  RTL_PASS, /* type */
> +  "fold_mem_offsets", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_NONE, /* tv_id */
> +  0, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  TODO_df_finish, /* todo_flags_finish */
> +};
> +
> +class pass_fold_mem_offsets : public rtl_opt_pass
> +{
> +public:
> +  pass_fold_mem_offsets (gcc::context *ctxt)
> +    : rtl_opt_pass (pass_data_fold_mem, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *)
> +    {
> +      return flag_fold_mem_offsets && optimize >= 2;
> +    }
> +
> +  virtual unsigned int execute (function *);
> +}; // class pass_fold_mem_offsets
> +
> +/* Tracks which instructions can be reached through instructions that can
> +   propagate offsets for folding.  */
> +static bitmap_head can_fold_insns;
> +
> +/* Marks instructions that are currently eligible for folding.  */
> +static bitmap_head candidate_fold_insns;
> +
> +/* Tracks instructions that cannot be folded because it turned out that
> +   folding will result in creating an invalid memory instruction.
> +   An instruction can be in both CANDIDATE_FOLD_INSNS and CANNOT_FOLD_INSNS
> +   at the same time, in which case it is not legal to fold.  */
> +static bitmap_head cannot_fold_insns;
> +
> +/* The number of instructions that were simplified or eliminated.  */
> +static int stats_fold_count;
> +
> +enum fold_mem_phase
> +{
> +  FM_PHASE_ANALYSIS,
> +  FM_PHASE_VALIDITY,
> +  FM_PHASE_COMMIT_OFFSETS,
> +  FM_PHASE_COMMIT_INSNS
> +};
> +
> +/* Get the single reaching definition of an instruction inside a BB.
> +   The definition is desired for REG used in INSN.
> +   Return the definition insn or NULL if there's no definition with
> +   the desired criteria.  */
> +static rtx_insn*
> +get_single_def_in_bb (rtx_insn *insn, rtx reg)
> +{
> +  df_ref use;
> +  struct df_link *ref_chain, *ref_link;
> +
> +  FOR_EACH_INSN_USE (use, insn)
> +    {
> +      if (GET_CODE (DF_REF_REG (use)) == SUBREG)
> +       return NULL;
> +      if (REGNO (DF_REF_REG (use)) == REGNO (reg))
> +       break;
> +    }
> +
> +  if (!use)
> +    return NULL;
> +
> +  ref_chain = DF_REF_CHAIN (use);
> +
> +  if (!ref_chain)
> +    return NULL;
> +
> +  for (ref_link = ref_chain; ref_link; ref_link = ref_link->next)
> +    {
> +      /* Problem getting some definition for this instruction.  */
> +      if (ref_link->ref == NULL)
> +       return NULL;
> +      if (DF_REF_INSN_INFO (ref_link->ref) == NULL)
> +       return NULL;
> +      if (global_regs[REGNO (reg)]
> +         && !set_of (reg, DF_REF_INSN (ref_link->ref)))
> +       return NULL;
> +    }
> +
> +  if (ref_chain->next)
> +    return NULL;
> +
> +  rtx_insn* def = DF_REF_INSN (ref_chain->ref);
> +
> +  if (BLOCK_FOR_INSN (def) != BLOCK_FOR_INSN (insn))
> +    return NULL;
> +
> +  if (DF_INSN_LUID (def) > DF_INSN_LUID (insn))
> +    return NULL;
> +
> +  return def;
> +}
> +
> +/* Get all uses of REG which is set in INSN.  Return the use list or NULL if a
> +   use is missing / irregular.  If SUCCESS is not NULL then set it to false if
> +   there are missing / irregular uses and true otherwise.  */
> +static struct df_link*
> +get_uses (rtx_insn *insn, rtx reg, bool* success)
> +{
> +  df_ref def;
> +  struct df_link *ref_chain, *ref_link;
> +
> +  if (success)
> +    *success = false;
> +
> +  FOR_EACH_INSN_DEF (def, insn)
> +    if (REGNO (DF_REF_REG (def)) == REGNO (reg))
> +      break;
> +
> +  if (!def)
> +    return NULL;
> +
> +  ref_chain = DF_REF_CHAIN (def);
> +
> +  for (ref_link = ref_chain; ref_link; ref_link = ref_link->next)
> +    {
> +      /* Problem getting a use for this instruction.  */
> +      if (ref_link->ref == NULL)
> +       return NULL;
> +      if (DF_REF_CLASS (ref_link->ref) != DF_REF_REGULAR)
> +       return NULL;
> +    }
> +
> +  if (success)
> +    *success = true;
> +
> +  return ref_chain;
> +}
> +
> +/* Function that computes the offset that would have to be added to all uses
> +   of REG if the instructions marked in FOLDABLE_INSNS were to be eliminated.
> +
> +   If ANALYZE is true then mark in CAN_FOLD_INSNS which instructions
> +   transitively only affect other instructions found in CAN_FOLD_INSNS.
> +   If ANALYZE is false then compute the offset required for folding.  */
> +static HOST_WIDE_INT
> +fold_offsets (rtx_insn* insn, rtx reg, bool analyze, bitmap foldable_insns)
> +{
> +  rtx_insn* def = get_single_def_in_bb (insn, reg);
> +
> +  if (!def || GET_CODE (PATTERN (def)) != SET)
> +    return 0;
> +
> +  rtx src = SET_SRC (PATTERN (def));
> +  rtx dest = SET_DEST (PATTERN (def));
> +
> +  if (!REG_P (dest))
> +    return 0;
> +
> +  enum rtx_code code = GET_CODE (src);
> +
> +  /* Keep these in sync with the switch below.  We need to early out because
> +     otherwise a lot of unnecessary work is done in use analysis.  */
> +  if (!(code == PLUS || code == MINUS || code == NEG || code == MULT
> +       || code == ASHIFT || code == REG || code == CONST_INT))
> +    return 0;
> +
> +  unsigned int dest_regno = REGNO (dest);
> +
> +  /* We can only affect the values of GPR registers.  */
> +  if (fixed_regs[dest_regno]
> +      || !TEST_HARD_REG_BIT (reg_class_contents[GENERAL_REGS], dest_regno))
> +    return 0;
> +
> +  if (analyze)
> +    {
> +      /* We only fold through instructions that are transitively used as
> +        memory addresses and do not have other uses.  Use the same logic
> +        from offset calculation to visit instructions that can propagate
> +        offsets and keep track of them in CAN_FOLD_INSNS.  */
> +      bool success;
> +      struct df_link *uses = get_uses (def, dest, &success), *ref_link;
> +
> +      if (!success)
> +       return 0;
> +
> +      for (ref_link = uses; ref_link; ref_link = ref_link->next)
> +       {
> +         rtx_insn* use = DF_REF_INSN (ref_link->ref);
> +
> +         if (DEBUG_INSN_P (use))
> +           continue;
> +
> +         /* Punt if the use is anything more complicated than a set
> +            (clobber, use, etc).  */
> +         if (!NONJUMP_INSN_P (use) || GET_CODE (PATTERN (use)) != SET)
> +           return 0;
> +
> +         /* This use affects instructions outside of CAN_FOLD_INSNS.  */
> +         if (!bitmap_bit_p (&can_fold_insns, INSN_UID (use)))
> +           return 0;
> +
> +         rtx use_set = PATTERN (use);
> +
> +         /* Special case: A foldable memory store is not foldable if it
> +            mentions DEST outside of the address calculation.  */
> +         if (use_set && MEM_P (SET_DEST (use_set))
> +             && reg_mentioned_p (dest, SET_SRC (use_set)))
> +           return 0;
> +       }
> +
> +      bitmap_set_bit (&can_fold_insns, INSN_UID (def));
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       {
> +         fprintf (dump_file, "Instruction marked for propagation: ");
> +         print_rtl_single (dump_file, def);
> +       }
> +    }
> +  else
> +    {
> +      /* We cannot propagate through this instruction.  */
> +      if (!bitmap_bit_p (&can_fold_insns, INSN_UID (def)))
> +       return 0;
> +    }
> +
> +  switch (code)
> +    {
> +    case PLUS:
> +      {
> +       /* Propagate through add.  */
> +       rtx arg1 = XEXP (src, 0);
> +       rtx arg2 = XEXP (src, 1);
> +       HOST_WIDE_INT offset = 0;
> +
> +       if (REG_P (arg1))
> +         offset += fold_offsets (def, arg1, analyze, foldable_insns);
> +       else if (GET_CODE (arg1) == ASHIFT
> +                && REG_P (XEXP (arg1, 0))
> +                && CONST_INT_P (XEXP (arg1, 1)))
> +         {
> +           /* Handle R1 = (R2 << C) + ...  */
> +           HOST_WIDE_INT scale
> +             = (HOST_WIDE_INT_1U << INTVAL (XEXP (arg1, 1)));
> +           offset += scale * fold_offsets (def, XEXP (arg1, 0), analyze,
> +                                        foldable_insns);
> +         }
> +       else if (GET_CODE (arg1) == PLUS
> +                && REG_P (XEXP (arg1, 0))
> +                && REG_P (XEXP (arg1, 1)))
> +         {
> +           /* Handle R1 = (R2 + R3) + ...  */
> +           offset += fold_offsets (def, XEXP (arg1, 0), analyze,
> +                                   foldable_insns);
> +           offset += fold_offsets (def, XEXP (arg1, 1), analyze,
> +                                   foldable_insns);
> +         }
> +       else if (GET_CODE (arg1) == PLUS
> +                && GET_CODE (XEXP (arg1, 0)) == ASHIFT
> +                && REG_P (XEXP (XEXP (arg1, 0), 0))
> +                && CONST_INT_P (XEXP (XEXP (arg1, 0), 1))
> +                && REG_P (XEXP (arg1, 1)))
> +         {
> +           /* Handle R1 = ((R2 << C) + R3) + ...  */
> +           HOST_WIDE_INT scale
> +             = (HOST_WIDE_INT_1U << INTVAL (XEXP (XEXP (arg1, 0), 1)));
> +           offset += scale * fold_offsets (def, XEXP (XEXP (arg1, 0), 0),
> +                                           analyze, foldable_insns);
> +           offset += fold_offsets (def, XEXP (arg1, 1), analyze,
> +                                   foldable_insns);
> +         }
> +       else
> +         return 0;
> +
> +       if (REG_P (arg2))
> +         offset += fold_offsets (def, arg2, analyze, foldable_insns);
> +       else if (REG_P (arg1) && CONST_INT_P (arg2) && !analyze)
> +         {
> +           offset += INTVAL (arg2);
> +           /* This is a R1 = R2 + C instruction, candidate for folding.  */
> +           bitmap_set_bit (foldable_insns, INSN_UID (def));
> +         }
> +       else
> +         return 0;
> +
> +       return offset;
> +      }
> +    case MINUS:
> +      {
> +       /* Propagate through minus.  */
> +       rtx arg1 = XEXP (src, 0);
> +       rtx arg2 = XEXP (src, 1);
> +       HOST_WIDE_INT offset = 0;
> +
> +       if (REG_P (arg1))
> +         offset += fold_offsets (def, arg1, analyze, foldable_insns);
> +       else
> +         return 0;
> +
> +       if (REG_P (arg2))
> +         offset -= fold_offsets (def, arg2, analyze, foldable_insns);
> +       else if (REG_P (arg1) && CONST_INT_P (arg2) && !analyze)
> +         {
> +           offset -= INTVAL (arg2);
> +           /* This is a R1 = R2 - C instruction, candidate for folding.  */
> +           bitmap_set_bit (foldable_insns, INSN_UID (def));
> +         }
> +       else
> +         return 0;
> +
> +       return offset;
> +      }
> +    case NEG:
> +      {
> +       /* Propagate through negation.  */
> +       rtx arg1 = XEXP (src, 0);
> +       if (REG_P (arg1))
> +         return -fold_offsets (def, arg1, analyze, foldable_insns);
> +       else
> +         return 0;
> +      }
> +    case MULT:
> +      {
> +       /* Propagate through multiply by constant.  */
> +       rtx arg1 = XEXP (src, 0);
> +       rtx arg2 = XEXP (src, 1);
> +
> +       if (REG_P (arg1) && CONST_INT_P (arg2))
> +         {
> +           HOST_WIDE_INT scale = INTVAL (arg2);
> +           return scale * fold_offsets (def, arg1, analyze, foldable_insns);
> +         }
> +
> +       return 0;
> +      }
> +    case ASHIFT:
> +      {
> +       /* Propagate through shift left by constant.  */
> +       rtx arg1 = XEXP (src, 0);
> +       rtx arg2 = XEXP (src, 1);
> +
> +       if (REG_P (arg1) && CONST_INT_P (arg2))
> +         {
> +           HOST_WIDE_INT scale = (HOST_WIDE_INT_1U << INTVAL (arg2));
> +           return scale * fold_offsets (def, arg1, analyze, foldable_insns);
> +         }
> +
> +       return 0;
> +      }
> +    case REG:
> +      /* Propagate through register move.  */
> +      return fold_offsets (def, src, analyze, foldable_insns);
> +    case CONST_INT:
> +      {
> +       /* R1 = C is candidate for folding.  */
> +       if (!analyze)
> +         bitmap_set_bit (foldable_insns, INSN_UID (def));
> +       return INTVAL (src);
> +      }
> +    default:
> +      /* Cannot propagate.  */
> +      return 0;
> +    }
> +}
> +
> +/* Test if INSN is a memory load / store that can have an offset folded to it.
> +   Return true iff INSN is such an instruction and return through MEM_OUT,
> +   REG_OUT and OFFSET_OUT the RTX that has a MEM code, the register that is
> +   used as a base address and the offset accordingly.
> +   All of the out pointers may be NULL in which case they will be ignored.  */
> +bool
> +get_fold_mem_root (rtx_insn* insn, rtx *mem_out, rtx *reg_out,
> +                  HOST_WIDE_INT *offset_out)
> +{
> +  rtx set = single_set (insn);
> +  rtx mem = NULL_RTX;
> +
> +  if (set != NULL_RTX)
> +    {
> +      rtx src = SET_SRC (set);
> +      rtx dest = SET_DEST (set);
> +
> +      /* Don't fold when we have unspec / volatile.  */
> +      if (GET_CODE (src) == UNSPEC
> +         || GET_CODE (src) == UNSPEC_VOLATILE
> +         || GET_CODE (dest) == UNSPEC
> +         || GET_CODE (dest) == UNSPEC_VOLATILE)
> +       return false;
> +
> +      if (MEM_P (src))
> +       mem = src;
> +      else if (MEM_P (dest))
> +       mem = dest;
> +      else if ((GET_CODE (src) == SIGN_EXTEND
> +               || GET_CODE (src) == ZERO_EXTEND)
> +              && MEM_P (XEXP (src, 0)))
> +       mem = XEXP (src, 0);
> +    }
> +
> +  if (mem == NULL_RTX)
> +    return false;
> +
> +  rtx mem_addr = XEXP (mem, 0);
> +  rtx reg;
> +  HOST_WIDE_INT offset;
> +
> +  if (REG_P (mem_addr))
> +    {
> +      reg = mem_addr;
> +      offset = 0;
> +    }
> +  else if (GET_CODE (mem_addr) == PLUS
> +          && REG_P (XEXP (mem_addr, 0))
> +          && CONST_INT_P (XEXP (mem_addr, 1)))
> +    {
> +      reg = XEXP (mem_addr, 0);
> +      offset = INTVAL (XEXP (mem_addr, 1));
> +    }
> +  else
> +    return false;
> +
> +  if (mem_out)
> +    *mem_out = mem;
> +  if (reg_out)
> +    *reg_out = reg;
> +  if (offset_out)
> +    *offset_out = offset;
> +
> +  return true;
> +}
> +
> +/* If INSN is a root memory instruction then do a DFS traversal on its
> +   definitions and find folding candidates.  */
> +static void
> +do_analysis (rtx_insn* insn)
> +{
> +  rtx reg;
> +  if (!get_fold_mem_root (insn, NULL, &reg, NULL))
> +    return;
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file, "Starting analysis from root: ");
> +      print_rtl_single (dump_file, insn);
> +    }
> +
> +  /* Analyse folding opportunities for this memory instruction.  */
> +  bitmap_set_bit (&can_fold_insns, INSN_UID (insn));
> +  fold_offsets (insn, reg, true, NULL);
> +}
> +
> +/* If INSN is a root memory instruction then compute a potentially new offset
> +   for it and test if the resulting instruction is valid.  */
> +static void
> +do_check_validity (rtx_insn* insn)
> +{
> +  rtx mem, reg;
> +  HOST_WIDE_INT cur_offset;
> +  if (!get_fold_mem_root (insn, &mem, &reg, &cur_offset))
> +    return;
> +
> +  auto_bitmap fold_insns;
> +  HOST_WIDE_INT new_offset
> +    = cur_offset + fold_offsets (insn, reg, false, fold_insns);
> +
> +  /* Test if it is valid to change MEM's address offset to NEW_OFFSET.  */
> +  rtx mem_addr = XEXP (mem, 0);
> +  machine_mode mode = GET_MODE (mem_addr);
> +  XEXP (mem, 0) = gen_rtx_PLUS (mode, reg, gen_int_mode (new_offset, mode));
> +
> +  bool illegal = insn_invalid_p (insn, false)
> +                || !memory_address_addr_space_p (mode, XEXP (mem, 0),
> +                                                 MEM_ADDR_SPACE (mem));
> +
> +  /* Restore the instruction.  */
> +  XEXP (mem, 0) = mem_addr;
> +  INSN_CODE (insn) = -1;
> +
> +  if (illegal)
> +    bitmap_ior_into (&cannot_fold_insns, fold_insns);
> +  else
> +    bitmap_ior_into (&candidate_fold_insns, fold_insns);
> +}
> +
> +/* If INSN is a root memory instruction that was affected by any folding
> +   then update its offset as necessary.  */
> +static void
> +do_commit_offset (rtx_insn* insn)
> +{
> +  rtx mem, reg;
> +  HOST_WIDE_INT cur_offset;
> +  if (!get_fold_mem_root (insn, &mem, &reg, &cur_offset))
> +    return;
> +
> +  auto_bitmap fold_insns;
> +  HOST_WIDE_INT new_offset
> +    = cur_offset + fold_offsets (insn, reg, false, fold_insns);
> +
> +  /* If an change turned out illegal in the previous phase then legal
> +     transformations that share calculations also become illegal.  */
> +  if (bitmap_intersect_p (&cannot_fold_insns, fold_insns))
> +    {
> +      bitmap_ior_into (&cannot_fold_insns, fold_insns);
> +      return;
> +    }
> +
> +  if (new_offset == cur_offset)
> +    return;
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "Memory offset changed from "
> +              HOST_WIDE_INT_PRINT_DEC " to " HOST_WIDE_INT_PRINT_DEC
> +              " for instruction:\n", cur_offset, new_offset);
> +      print_rtl_single (dump_file, insn);
> +    }
> +
> +  gcc_assert (!bitmap_empty_p (fold_insns));
> +  machine_mode mode = GET_MODE (XEXP (mem, 0));
> +  XEXP (mem, 0) = gen_rtx_PLUS (mode, reg, gen_int_mode (new_offset, mode));
> +  df_insn_rescan (insn);
> +}
> +
> +/* If INSN is a move / add instruction that was folded then replace its
> +   constant part with zero.  */
> +static void
> +do_commit_insn (rtx_insn* insn)
> +{
> +  if (bitmap_bit_p (&candidate_fold_insns, INSN_UID (insn))
> +      && !bitmap_bit_p (&cannot_fold_insns, INSN_UID (insn)))
> +    {
> +      if (dump_file)
> +       {
> +         fprintf (dump_file, "Instruction folded:");
> +         print_rtl_single (dump_file, insn);
> +       }
> +
> +      stats_fold_count++;
> +
> +      rtx set = single_set (insn);
> +      rtx dest = SET_DEST (set);
> +      rtx src = SET_SRC (set);
> +
> +      /* Emit a move and let subsequent passes eliminate it if possible.  */
> +      if (GET_CODE (src) == CONST_INT)
> +       {
> +         /* INSN is R1 = C.
> +            Replace it with R1 = 0 because C was folded.  */
> +         rtx mov_rtx
> +           = gen_move_insn (dest, gen_int_mode (0, GET_MODE (dest)));
> +         df_insn_rescan (emit_insn_after (mov_rtx, insn));
> +       }
> +      else
> +       {
> +         /* INSN is R1 = R2 + C.
> +            Replace it with R1 = R2 because C was folded.  */
> +         rtx arg1 = XEXP (src, 0);
> +
> +         /* If the DEST == ARG1 then the move is a no-op.  */
> +         if (REGNO (dest) != REGNO (arg1))
> +           {
> +             gcc_checking_assert (GET_MODE (dest) == GET_MODE (arg1));
> +             rtx mov_rtx = gen_move_insn (dest, arg1);
> +             df_insn_rescan (emit_insn_after (mov_rtx, insn));
> +           }
> +       }
> +
> +      /* Delete the original move / add instruction.  */
> +      delete_insn (insn);
> +    }
> +}
> +
> +/* Wrapper function to perform the actions defined by PHASE for INSN.  */
> +static void
> +fold_mem_offsets_driver (rtx_insn* insn, int phase)
> +{
> +  switch (phase)
> +    {
> +      case FM_PHASE_ANALYSIS:
> +       do_analysis (insn);
> +       return;
> +      case FM_PHASE_VALIDITY:
> +       do_check_validity (insn);
> +       return;
> +      case FM_PHASE_COMMIT_OFFSETS:
> +       do_commit_offset (insn);
> +       return;
> +      case FM_PHASE_COMMIT_INSNS:
> +       do_commit_insn (insn);
> +       return;
> +    }
> +}
> +
> +unsigned int
> +pass_fold_mem_offsets::execute (function *fn)
> +{
> +  df_set_flags (DF_RD_PRUNE_DEAD_DEFS | DF_DEFER_INSN_RESCAN);
> +  df_chain_add_problem (DF_UD_CHAIN + DF_DU_CHAIN);
> +  df_analyze ();
> +
> +  bitmap_initialize (&can_fold_insns, NULL);
> +  bitmap_initialize (&candidate_fold_insns, NULL);
> +  bitmap_initialize (&cannot_fold_insns, NULL);
> +
> +  stats_fold_count = 0;
> +
> +  basic_block bb;
> +  rtx_insn *insn;
> +  FOR_ALL_BB_FN (bb, fn)
> +    {
> +      /* There is a conflict between this pass and RISCV's shorten-memrefs
> +         pass.  For now disable folding if optimizing for size because
> +         otherwise this cancels the effects of shorten-memrefs.  */
> +      if (optimize_bb_for_size_p (bb))
> +       continue;
> +
> +      bitmap_clear (&can_fold_insns);
> +      bitmap_clear (&candidate_fold_insns);
> +      bitmap_clear (&cannot_fold_insns);
> +
> +      FOR_BB_INSNS (bb, insn)
> +       fold_mem_offsets_driver (insn, FM_PHASE_ANALYSIS);
> +
> +      FOR_BB_INSNS (bb, insn)
> +       fold_mem_offsets_driver (insn, FM_PHASE_VALIDITY);
> +
> +      FOR_BB_INSNS (bb, insn)
> +       fold_mem_offsets_driver (insn, FM_PHASE_COMMIT_OFFSETS);
> +
> +      FOR_BB_INSNS (bb, insn)
> +       fold_mem_offsets_driver (insn, FM_PHASE_COMMIT_INSNS);
> +    }
> +
> +  statistics_counter_event (cfun, "Number of folded instructions",
> +                           stats_fold_count);
> +
> +  bitmap_release (&can_fold_insns);
> +  bitmap_release (&candidate_fold_insns);
> +  bitmap_release (&cannot_fold_insns);
> +
> +  return 0;
> +}
> +
> +} // anon namespace
> +
> +rtl_opt_pass *
> +make_pass_fold_mem_offsets (gcc::context *ctxt)
> +{
> +  return new pass_fold_mem_offsets (ctxt);
> +}
> diff --git a/gcc/passes.def b/gcc/passes.def
> index faa5208b26b..59522d9a4c1 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -508,6 +508,7 @@ along with GCC; see the file COPYING3.  If not see
>           NEXT_PASS (pass_peephole2);
>           NEXT_PASS (pass_if_after_reload);
>           NEXT_PASS (pass_regrename);
> +         NEXT_PASS (pass_fold_mem_offsets);
>           NEXT_PASS (pass_cprop_hardreg);
>           NEXT_PASS (pass_fast_rtl_dce);
>           NEXT_PASS (pass_reorder_blocks);
> diff --git a/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-1.c b/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-1.c
> new file mode 100644
> index 00000000000..574cc92b6ab
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-1.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mfold-mem-offsets" } */
> +
> +void sink(int arr[2]);
> +
> +void
> +foo(int a, int b, int i)
> +{
> +  int arr[2] = {a, b};
> +  arr[i]++;
> +  sink(arr);
> +}
> +
> +// Should compile without negative memory offsets when using -mfold-mem-offsets
> +/* { dg-final { scan-assembler-not "lw\t.*,-.*\\(.*\\)" } } */
> +/* { dg-final { scan-assembler-not "sw\t.*,-.*\\(.*\\)" } } */
> \ No newline at end of file
> diff --git a/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-2.c b/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-2.c
> new file mode 100644
> index 00000000000..e6c251d3a3c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-2.c
> @@ -0,0 +1,24 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mfold-mem-offsets" } */
> +
> +void sink(int arr[3]);
> +
> +void
> +foo(int a, int b, int c, int i)
> +{
> +  int arr1[3] = {a, b, c};
> +  int arr2[3] = {a, c, b};
> +  int arr3[3] = {c, b, a};
> +
> +  arr1[i]++;
> +  arr2[i]++;
> +  arr3[i]++;
> +
> +  sink(arr1);
> +  sink(arr2);
> +  sink(arr3);
> +}
> +
> +// Should compile without negative memory offsets when using -mfold-mem-offsets
> +/* { dg-final { scan-assembler-not "lw\t.*,-.*\\(.*\\)" } } */
> +/* { dg-final { scan-assembler-not "sw\t.*,-.*\\(.*\\)" } } */
> \ No newline at end of file
> diff --git a/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-3.c b/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-3.c
> new file mode 100644
> index 00000000000..8586d3e3a29
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/fold-mem-offsets-3.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mfold-mem-offsets" } */
> +
> +void load(int arr[2]);
> +
> +int
> +foo(long unsigned int i)
> +{
> +  int arr[2];
> +  load(arr);
> +
> +  return arr[3 * i + 77];
> +}
> +
> +// Should compile without negative memory offsets when using -mfold-mem-offsets
> +/* { dg-final { scan-assembler-not "lw\t.*,-.*\\(.*\\)" } } */
> +/* { dg-final { scan-assembler-not "addi\t.*,.*,77" } } */
> \ No newline at end of file
> diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> index 6cdaed7d4b2..6d93af8fd57 100644
> --- a/gcc/tree-pass.h
> +++ b/gcc/tree-pass.h
> @@ -618,6 +618,7 @@ extern rtl_opt_pass *make_pass_sched_fusion (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_peephole2 (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_if_after_reload (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_regrename (gcc::context *ctxt);
> +extern rtl_opt_pass *make_pass_fold_mem_offsets (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_cprop_hardreg (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_reorder_blocks (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_leaf_regs (gcc::context *ctxt);
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Implement new RTL optimizations pass: fold-mem-offsets.
  2023-07-13 15:05 ` Manolis Tsamis
@ 2023-07-14  5:35   ` Jeff Law
  2023-07-18 17:15     ` Manolis Tsamis
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Law @ 2023-07-14  5:35 UTC (permalink / raw)
  To: Manolis Tsamis, gcc-patches
  Cc: Philipp Tomsich, Richard Biener, Jakub Jelinek



On 7/13/23 09:05, Manolis Tsamis wrote:
> In this version I have made f-m-o able to also eliminate constant
> moves in addition to the add constant instructions.
> This increases the number of simplified/eliminated instructions and is
> a good addition for RISC style ISAs where these are more common.
> 
> This has led to pr52146.c failing in x86, which I haven't been able to
> find a way to fix.
> This involves directly writing to a constant address with -mx32
> 
> The code
>          movl    $-18874240, %eax
>          movl    $0, (%eax)
> 
> is 'optimized' to
>          movl    $0, %eax
>          movl    $0, -18874240(%eax)
> 
> Which is actually
>          movl    $0, -18874240
> 
> which is wrong per the ticket.
> The fix for the ticket involved changes to legitimate_address_p which
> f-m-o does call but it doesn't reject due to the existence of (%eax)
> which in turn is actually zero.
> I believe this is not strictly an f-m-o issue since the pass calls all
> the required functions to test whether the newly synthesized memory
> instruction is valid.
> 
> Any ideas on how to solve this issue is appreciated.
I wonder if costing might be useful here.  I would expect the 2nd 
sequence is the most costly of the three if address costing models are 
reasonably accurate.

Another way would be to look at the length of the memory reference insn. 
  If it's larger, then it's likely more costly.

That's what I've got off the top of my head.

Jeff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Implement new RTL optimizations pass: fold-mem-offsets.
  2023-07-14  5:35   ` Jeff Law
@ 2023-07-18 17:15     ` Manolis Tsamis
  2023-07-18 18:01       ` Jeff Law
  0 siblings, 1 reply; 13+ messages in thread
From: Manolis Tsamis @ 2023-07-18 17:15 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches, Philipp Tomsich, Richard Biener, Jakub Jelinek

On Fri, Jul 14, 2023 at 8:35 AM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
>
> On 7/13/23 09:05, Manolis Tsamis wrote:
> > In this version I have made f-m-o able to also eliminate constant
> > moves in addition to the add constant instructions.
> > This increases the number of simplified/eliminated instructions and is
> > a good addition for RISC style ISAs where these are more common.
> >
> > This has led to pr52146.c failing in x86, which I haven't been able to
> > find a way to fix.
> > This involves directly writing to a constant address with -mx32
> >
> > The code
> >          movl    $-18874240, %eax
> >          movl    $0, (%eax)
> >
> > is 'optimized' to
> >          movl    $0, %eax
> >          movl    $0, -18874240(%eax)
> >
> > Which is actually
> >          movl    $0, -18874240
> >
> > which is wrong per the ticket.
> > The fix for the ticket involved changes to legitimate_address_p which
> > f-m-o does call but it doesn't reject due to the existence of (%eax)
> > which in turn is actually zero.
> > I believe this is not strictly an f-m-o issue since the pass calls all
> > the required functions to test whether the newly synthesized memory
> > instruction is valid.
> >
> > Any ideas on how to solve this issue is appreciated.
> I wonder if costing might be useful here.  I would expect the 2nd
> sequence is the most costly of the three if address costing models are
> reasonably accurate.
>
> Another way would be to look at the length of the memory reference insn.
>   If it's larger, then it's likely more costly.
>
> That's what I've got off the top of my head.
>

I could test whether the cost function prefers the version that we
want, but that would be a workaround I would like to avoid. It may
also be the case that this reproduces with a different sequence where
the unwanted code is actually more profitable.

I was trying to find out whether the original fix can be extended in a
way that solves this, because having an address that is reported as
legitimate but is actually not could also create issues elsewhere.
But I don't yet have a suggestion on how to fix it yet.

Thanks,
Manolis

> Jeff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Implement new RTL optimizations pass: fold-mem-offsets.
  2023-07-18 17:15     ` Manolis Tsamis
@ 2023-07-18 18:01       ` Jeff Law
  2023-07-18 23:42         ` Vineet Gupta
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Law @ 2023-07-18 18:01 UTC (permalink / raw)
  To: Manolis Tsamis
  Cc: gcc-patches, Philipp Tomsich, Richard Biener, Jakub Jelinek



On 7/18/23 11:15, Manolis Tsamis wrote:
> On Fri, Jul 14, 2023 at 8:35 AM Jeff Law <jeffreyalaw@gmail.com> wrote:
>>
>>
>>
>> On 7/13/23 09:05, Manolis Tsamis wrote:
>>> In this version I have made f-m-o able to also eliminate constant
>>> moves in addition to the add constant instructions.
>>> This increases the number of simplified/eliminated instructions and is
>>> a good addition for RISC style ISAs where these are more common.
>>>
>>> This has led to pr52146.c failing in x86, which I haven't been able to
>>> find a way to fix.
>>> This involves directly writing to a constant address with -mx32
>>>
>>> The code
>>>           movl    $-18874240, %eax
>>>           movl    $0, (%eax)
>>>
>>> is 'optimized' to
>>>           movl    $0, %eax
>>>           movl    $0, -18874240(%eax)
>>>
>>> Which is actually
>>>           movl    $0, -18874240
>>>
>>> which is wrong per the ticket.
>>> The fix for the ticket involved changes to legitimate_address_p which
>>> f-m-o does call but it doesn't reject due to the existence of (%eax)
>>> which in turn is actually zero.
>>> I believe this is not strictly an f-m-o issue since the pass calls all
>>> the required functions to test whether the newly synthesized memory
>>> instruction is valid.
>>>
>>> Any ideas on how to solve this issue is appreciated.
>> I wonder if costing might be useful here.  I would expect the 2nd
>> sequence is the most costly of the three if address costing models are
>> reasonably accurate.
>>
>> Another way would be to look at the length of the memory reference insn.
>>    If it's larger, then it's likely more costly.
>>
>> That's what I've got off the top of my head.
>>
> 
> I could test whether the cost function prefers the version that we
> want, but that would be a workaround I would like to avoid. It may
> also be the case that this reproduces with a different sequence where
> the unwanted code is actually more profitable.
> 
> I was trying to find out whether the original fix can be extended in a
> way that solves this, because having an address that is reported as
> legitimate but is actually not could also create issues elsewhere.
> But I don't yet have a suggestion on how to fix it yet.
I was thinking a bit more about this yesterday, and even in the case 
where the new mem crosses a boundary thus making the memory load/store 
more expensive I think we're still OK.

The key is at worst we will have changed an earlier instruction like t = 
sp + <const> into t = sp which should reduce the cost of that earlier 
instruction.  And I would expect the vast majority of the time we 
completely eliminate that earlier instruction.

So this may ultimately be a non-issue.

Vineet @ Rivos has indicated he stumbled across an ICE with the V3 code. 
  Hopefully he'll get a testcase for that extracted shortly.

jeff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Implement new RTL optimizations pass: fold-mem-offsets.
  2023-07-18 18:01       ` Jeff Law
@ 2023-07-18 23:42         ` Vineet Gupta
  2023-07-19  4:31           ` Jeff Law
  0 siblings, 1 reply; 13+ messages in thread
From: Vineet Gupta @ 2023-07-18 23:42 UTC (permalink / raw)
  To: Jeff Law, Manolis Tsamis
  Cc: gcc-patches, Philipp Tomsich, Richard Biener, Jakub Jelinek,
	gnu-toolchain

[-- Attachment #1: Type: text/plain, Size: 3042 bytes --]

Hi Manolis,

On 7/18/23 11:01, Jeff Law via Gcc-patches wrote:
> Vineet @ Rivos has indicated he stumbled across an ICE with the V3 
> code.  Hopefully he'll get a testcase for that extracted shortly. 

Yeah, I was trying to build SPEC2017 with this patch and ran into ICE 
for several of them with -Ofast build: The reduced test from 455.nab is 
attached here.
The issue happens with v2 as well, so not something introduced by v3.

There's ICE in cprop_hardreg which immediately follows f-m-o.


The protagonist is ins 93 which starts off in combine as a simple set of 
a DF 0.

| sff.i.288r.combine:(insn 93 337 94 8 (set (reg/v:DF 236 [ e ])
| sff.i.288r.combine- (const_double:DF 0.0 [0x0.0p+0])) "sff.i":23:11 
190 {*movdf_hardfloat_rv64}

Subsequently reload transforms it into SP + offset

| sff.i.303r.reload:(insn 93 337 94 9 (set (mem/c:DF (plus:DI (reg/f:DI 
2 sp)
| sff.i.303r.reload- (const_int 8 [0x8])) [4 %sfp+-8 S8 A64])
| sff.i.303r.reload- (const_double:DF 0.0 [0x0.0p+0])) "sff.i":23:11 190 
{*movdf_hardfloat_rv64}
| sff.i.303r.reload- (expr_list:REG_EQUAL (const_double:DF 0.0 [0x0.0p+0])

It gets processed by f-m-o and lands in cprop_hardreg, where it triggers 
ICE.

| (insn 93 337 523 11 (set (mem/c:DF (plus:DI (reg/f:DI 2 sp)
|                 (const_int 8 [0x8])) [4 %sfp+-8 S8 A64])
|         (const_double:DF 0.0 [0x0.0p+0])) "sff.i":23:11 -1
^^^
|      (expr_list:REG_EQUAL (const_double:DF 0.0 [0x0.0p+0])
|        (nil)))
| during RTL pass: cprop_hardreg

Here's my analysis:

f-m-o: do_check_validity() -> insn_invalid_p() tries to recog() a 
modified version of insn 93 (actually there is no change, so perhaps 
something we can optimize later). The corresponding md pattern 
movdf_hardfloat_rv64 no longer matches since it expects REG_P for 
operand0, while reload has converted it into SP + offset. f-m-o then 
does the right thing by invalidating INSN_CODE=-1 for a subsequent 
recog() to work correctly.
But it seems this -1 lingers into the next pass, and trips up 
copyprop_hardreg_forward_1() -> extract_constrain_insn()
So I don't know what the right fix here should be.

In a run with -fno-fold-mem-offsets, the same insn 93 is successfully 
grok'ed by cprop_hardreg,

| (insn 93 337 522 11 (set (mem/c:DF (plus:DI (reg/f:DI 2 sp)
|                (const_int 8 [0x8])) [4 %sfp+-8 S8 A64])
|        (const_double:DF 0.0 [0x0.0p+0])) "sff.i":23:11 190 
{*movdf_hardfloat_rv64}
^^^^^^^^^^^^^^^
|     (expr_list:REG_EQUAL (const_double:DF 0.0 [0x0.0p+0])
|        (nil)))

P.S. I wonder if it is a good idea in general to call recog() post 
reload since the insn could be changed sufficiently to no longer match 
the md patterns. Of course I don't know the answer.

P.S.2 When debugging code, I noticed a minor annoyance in the patch with 
the whole fold_mem_offsets_driver() switch-case indirection. It doesn't 
seem to be serving any purpose, and we could simply call corresponding 
do_* routines in execute () itself.

Thx,
-Vineet

[-- Attachment #2: sff.i --]
[-- Type: text/plain, Size: 690 bytes --]

a, c, d, g, h, i, j, k, m, p, q, r, aa, t, u, x;
double *f, *s;
double l, n, o, v, w;
b() {
  double e, ad, ae, af, ag, ah, ai, aj, am, an, ap, aq, ar, as, at, av, aw;
  for (; q; q = aa) {
    r = f[x];
    ah = f[r + 2] - g;
    af = f[0] - f[r];
    ag = 1 - f[r + 1];
    av = ae * af * ah * ai;
    aj = h - w * p * ah;
    am = o + av * af;
    an = j * o * av * ag;
    ap = (am - m) * ad * (k - an - n) * a - v * c;
    ar = (aj - l) * c;
    if (a)
      ;
    else
    az:
      switch (d) {
      case 1:
        e = aw * i;
        break;
      case 2:
        exit(1);
      }
    s[0] = ap;
    t += at * aq;
    u = as += at * ar;
    if (c)
      goto az;
  }
  return e;
}

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Implement new RTL optimizations pass: fold-mem-offsets.
  2023-07-18 23:42         ` Vineet Gupta
@ 2023-07-19  4:31           ` Jeff Law
  2023-07-19  8:08             ` Manolis Tsamis
  2023-07-20  6:18             ` Vineet Gupta
  0 siblings, 2 replies; 13+ messages in thread
From: Jeff Law @ 2023-07-19  4:31 UTC (permalink / raw)
  To: Vineet Gupta, Manolis Tsamis
  Cc: gcc-patches, Philipp Tomsich, Richard Biener, Jakub Jelinek,
	gnu-toolchain



On 7/18/23 17:42, Vineet Gupta wrote:
> Hi Manolis,
> 
> On 7/18/23 11:01, Jeff Law via Gcc-patches wrote:
>> Vineet @ Rivos has indicated he stumbled across an ICE with the V3 
>> code.  Hopefully he'll get a testcase for that extracted shortly. 
> 
> Yeah, I was trying to build SPEC2017 with this patch and ran into ICE 
> for several of them with -Ofast build: The reduced test from 455.nab is 
> attached here.
> The issue happens with v2 as well, so not something introduced by v3.
> 
> There's ICE in cprop_hardreg which immediately follows f-m-o.
> 
> 
> The protagonist is ins 93 which starts off in combine as a simple set of 
> a DF 0.
> 
> | sff.i.288r.combine:(insn 93 337 94 8 (set (reg/v:DF 236 [ e ])
> | sff.i.288r.combine- (const_double:DF 0.0 [0x0.0p+0])) "sff.i":23:11 
> 190 {*movdf_hardfloat_rv64}
> 
> Subsequently reload transforms it into SP + offset
> 
> | sff.i.303r.reload:(insn 93 337 94 9 (set (mem/c:DF (plus:DI (reg/f:DI 
> 2 sp)
> | sff.i.303r.reload- (const_int 8 [0x8])) [4 %sfp+-8 S8 A64])
> | sff.i.303r.reload- (const_double:DF 0.0 [0x0.0p+0])) "sff.i":23:11 190 
> {*movdf_hardfloat_rv64}
> | sff.i.303r.reload- (expr_list:REG_EQUAL (const_double:DF 0.0 [0x0.0p+0])
> 
> It gets processed by f-m-o and lands in cprop_hardreg, where it triggers 
> ICE.
> 
> | (insn 93 337 523 11 (set (mem/c:DF (plus:DI (reg/f:DI 2 sp)
> |                 (const_int 8 [0x8])) [4 %sfp+-8 S8 A64])
> |         (const_double:DF 0.0 [0x0.0p+0])) "sff.i":23:11 -1
> ^^^
> |      (expr_list:REG_EQUAL (const_double:DF 0.0 [0x0.0p+0])
> |        (nil)))
> | during RTL pass: cprop_hardreg
> 
> Here's my analysis:
> 
> f-m-o: do_check_validity() -> insn_invalid_p() tries to recog() a 
> modified version of insn 93 (actually there is no change, so perhaps 
> something we can optimize later). The corresponding md pattern 
> movdf_hardfloat_rv64 no longer matches since it expects REG_P for 
> operand0, while reload has converted it into SP + offset. f-m-o then 
> does the right thing by invalidating INSN_CODE=-1 for a subsequent 
> recog() to work correctly.
> But it seems this -1 lingers into the next pass, and trips up 
> copyprop_hardreg_forward_1() -> extract_constrain_insn()
> So I don't know what the right fix here should be.
This is a bug in the RISC-V backend.  I actually fixed basically the 
same bug in another backend that was exposed by the f-m-o code.


> 
> In a run with -fno-fold-mem-offsets, the same insn 93 is successfully 
> grok'ed by cprop_hardreg,
> 
> | (insn 93 337 522 11 (set (mem/c:DF (plus:DI (reg/f:DI 2 sp)
> |                (const_int 8 [0x8])) [4 %sfp+-8 S8 A64])
> |        (const_double:DF 0.0 [0x0.0p+0])) "sff.i":23:11 190 
> {*movdf_hardfloat_rv64}
> ^^^^^^^^^^^^^^^
> |     (expr_list:REG_EQUAL (const_double:DF 0.0 [0x0.0p+0])
> |        (nil)))
> 
> P.S. I wonder if it is a good idea in general to call recog() post 
> reload since the insn could be changed sufficiently to no longer match 
> the md patterns. Of course I don't know the answer.
If this ever causes a problem, it's a backend bug.  It's that simple.

Conceptually it should always be safe to set INSN_CODE to -1 for any insn.

Odds are for this specific case in the RV backend, we just need a 
constraint to store 0.0 into a memory location.  That can actually be 
implemented as a store from x0 since 0.0 has the bit pattern 0x0.  This 
is probably a good thing to expose anyway as an optimization and can 
move forward independently of the f-m-o patch.



> 
> P.S.2 When debugging code, I noticed a minor annoyance in the patch with 
> the whole fold_mem_offsets_driver() switch-case indirection. It doesn't 
> seem to be serving any purpose, and we could simply call corresponding 
> do_* routines in execute () itself.
We were in the process of squashing some of this out of the 
implementation.   I hadn't looked at the V3 patch to see how much 
progress had been made on this yet.

Thanks for digging into this!

jeff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Implement new RTL optimizations pass: fold-mem-offsets.
  2023-07-19  4:31           ` Jeff Law
@ 2023-07-19  8:08             ` Manolis Tsamis
  2023-07-19 14:16               ` Jeff Law
  2023-07-20  6:18             ` Vineet Gupta
  1 sibling, 1 reply; 13+ messages in thread
From: Manolis Tsamis @ 2023-07-19  8:08 UTC (permalink / raw)
  To: Jeff Law
  Cc: Vineet Gupta, gcc-patches, Philipp Tomsich, Richard Biener,
	Jakub Jelinek, gnu-toolchain

Hi Vineet, Jeff,

On Wed, Jul 19, 2023 at 7:31 AM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
>
> On 7/18/23 17:42, Vineet Gupta wrote:
> > Hi Manolis,
> >
> > On 7/18/23 11:01, Jeff Law via Gcc-patches wrote:
> >> Vineet @ Rivos has indicated he stumbled across an ICE with the V3
> >> code.  Hopefully he'll get a testcase for that extracted shortly.
> >
> > Yeah, I was trying to build SPEC2017 with this patch and ran into ICE
> > for several of them with -Ofast build: The reduced test from 455.nab is
> > attached here.
> > The issue happens with v2 as well, so not something introduced by v3.
> >
> > There's ICE in cprop_hardreg which immediately follows f-m-o.
> >
> >
> > The protagonist is ins 93 which starts off in combine as a simple set of
> > a DF 0.
> >
> > | sff.i.288r.combine:(insn 93 337 94 8 (set (reg/v:DF 236 [ e ])
> > | sff.i.288r.combine- (const_double:DF 0.0 [0x0.0p+0])) "sff.i":23:11
> > 190 {*movdf_hardfloat_rv64}
> >
> > Subsequently reload transforms it into SP + offset
> >
> > | sff.i.303r.reload:(insn 93 337 94 9 (set (mem/c:DF (plus:DI (reg/f:DI
> > 2 sp)
> > | sff.i.303r.reload- (const_int 8 [0x8])) [4 %sfp+-8 S8 A64])
> > | sff.i.303r.reload- (const_double:DF 0.0 [0x0.0p+0])) "sff.i":23:11 190
> > {*movdf_hardfloat_rv64}
> > | sff.i.303r.reload- (expr_list:REG_EQUAL (const_double:DF 0.0 [0x0.0p+0])
> >
> > It gets processed by f-m-o and lands in cprop_hardreg, where it triggers
> > ICE.
> >
> > | (insn 93 337 523 11 (set (mem/c:DF (plus:DI (reg/f:DI 2 sp)
> > |                 (const_int 8 [0x8])) [4 %sfp+-8 S8 A64])
> > |         (const_double:DF 0.0 [0x0.0p+0])) "sff.i":23:11 -1
> > ^^^
> > |      (expr_list:REG_EQUAL (const_double:DF 0.0 [0x0.0p+0])
> > |        (nil)))
> > | during RTL pass: cprop_hardreg
> >
> > Here's my analysis:
> >
> > f-m-o: do_check_validity() -> insn_invalid_p() tries to recog() a
> > modified version of insn 93 (actually there is no change, so perhaps
> > something we can optimize later). The corresponding md pattern
> > movdf_hardfloat_rv64 no longer matches since it expects REG_P for
> > operand0, while reload has converted it into SP + offset. f-m-o then
> > does the right thing by invalidating INSN_CODE=-1 for a subsequent
> > recog() to work correctly.
> > But it seems this -1 lingers into the next pass, and trips up
> > copyprop_hardreg_forward_1() -> extract_constrain_insn()
> > So I don't know what the right fix here should be.
> This is a bug in the RISC-V backend.  I actually fixed basically the
> same bug in another backend that was exposed by the f-m-o code.
>

I stumbled upon the same thing when doing an aarch64 bootstrap build yesterday.
Given that this causes issues, maybe doing
  int icode = INSN_CODE (insn);
  ...
  INSN_CODE (insn) = icode;
Is a good option and should also be more performant.
Even with that I'm still getting a segfault while doing a bootstrap
build that I'm investigating.

>
> >
> > In a run with -fno-fold-mem-offsets, the same insn 93 is successfully
> > grok'ed by cprop_hardreg,
> >
> > | (insn 93 337 522 11 (set (mem/c:DF (plus:DI (reg/f:DI 2 sp)
> > |                (const_int 8 [0x8])) [4 %sfp+-8 S8 A64])
> > |        (const_double:DF 0.0 [0x0.0p+0])) "sff.i":23:11 190
> > {*movdf_hardfloat_rv64}
> > ^^^^^^^^^^^^^^^
> > |     (expr_list:REG_EQUAL (const_double:DF 0.0 [0x0.0p+0])
> > |        (nil)))
> >
> > P.S. I wonder if it is a good idea in general to call recog() post
> > reload since the insn could be changed sufficiently to no longer match
> > the md patterns. Of course I don't know the answer.
> If this ever causes a problem, it's a backend bug.  It's that simple.
>
> Conceptually it should always be safe to set INSN_CODE to -1 for any insn.
>
> Odds are for this specific case in the RV backend, we just need a
> constraint to store 0.0 into a memory location.  That can actually be
> implemented as a store from x0 since 0.0 has the bit pattern 0x0.  This
> is probably a good thing to expose anyway as an optimization and can
> move forward independently of the f-m-o patch.
>
>
>
> >
> > P.S.2 When debugging code, I noticed a minor annoyance in the patch with
> > the whole fold_mem_offsets_driver() switch-case indirection. It doesn't
> > seem to be serving any purpose, and we could simply call corresponding
> > do_* routines in execute () itself.
> We were in the process of squashing some of this out of the
> implementation.   I hadn't looked at the V3 patch to see how much
> progress had been made on this yet.
>

Thanks for pointing that out Vineet!
When I refactored the code in the separate do_* functions it never
occured to me that both the _driver function and the state enum are
now useless. I will remove all of this in the next iteration.

Manolis

> Thanks for digging into this!
>
> jeff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Implement new RTL optimizations pass: fold-mem-offsets.
  2023-07-19  8:08             ` Manolis Tsamis
@ 2023-07-19 14:16               ` Jeff Law
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff Law @ 2023-07-19 14:16 UTC (permalink / raw)
  To: Manolis Tsamis
  Cc: Vineet Gupta, gcc-patches, Philipp Tomsich, Richard Biener,
	Jakub Jelinek, gnu-toolchain



On 7/19/23 02:08, Manolis Tsamis wrote:
de.
>>
> 
> I stumbled upon the same thing when doing an aarch64 bootstrap build yesterday.
> Given that this causes issues, maybe doing
>    int icode = INSN_CODE (insn);
>    ...
>    INSN_CODE (insn) = icode;
> Is a good option and should also be more performant.
I nearly suggested you do something like this, but ultimately it's a 
workaround for target bugs.  So part of me wants to keep it as is, but I 
can also understand the desire to make a chance like you've suggesting.


> Even with that I'm still getting a segfault while doing a bootstrap
> build that I'm investigating.
Sounds good.  I still need to drop the V3 into my tester and validate 
the m68k (and everything else) just works.  I'm slightly concerned about 
SH, but it's still failing even after taking the V2 out of the tester, 
so the SH issues are clearly unrelated to f-m-o.

I'll take Vineet's testcase and verify that we can just use an integer 
store to handle the 0.0 case.  As I mentioned, that's the right thing to 
do anyway from both a correctness and performance standpoint.  I'll also 
review the movsf pattern for the same problem/optimization.

jeff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Implement new RTL optimizations pass: fold-mem-offsets.
  2023-07-19  4:31           ` Jeff Law
  2023-07-19  8:08             ` Manolis Tsamis
@ 2023-07-20  6:18             ` Vineet Gupta
  2023-07-20 21:59               ` Jeff Law
  1 sibling, 1 reply; 13+ messages in thread
From: Vineet Gupta @ 2023-07-20  6:18 UTC (permalink / raw)
  To: Jeff Law, Manolis Tsamis
  Cc: gcc-patches, Philipp Tomsich, Richard Biener, Jakub Jelinek,
	gnu-toolchain



On 7/18/23 21:31, Jeff Law via Gcc-patches wrote:
>>
>> In a run with -fno-fold-mem-offsets, the same insn 93 is successfully 
>> grok'ed by cprop_hardreg,
>>
>> | (insn 93 337 522 11 (set (mem/c:DF (plus:DI (reg/f:DI 2 sp)
>> |                (const_int 8 [0x8])) [4 %sfp+-8 S8 A64])
>> |        (const_double:DF 0.0 [0x0.0p+0])) "sff.i":23:11 190 
>> {*movdf_hardfloat_rv64}
>> ^^^^^^^^^^^^^^^
>> |     (expr_list:REG_EQUAL (const_double:DF 0.0 [0x0.0p+0])
>> |        (nil)))
>>
>> P.S. I wonder if it is a good idea in general to call recog() post 
>> reload since the insn could be changed sufficiently to no longer 
>> match the md patterns. Of course I don't know the answer.
> If this ever causes a problem, it's a backend bug.  It's that simple.
>
> Conceptually it should always be safe to set INSN_CODE to -1 for any 
> insn.

Sure the -1 should be handled, but are you implying that f-mo- will 
always generate a valid combination and recog() failing is simply a bug 
in backend and/or f-m-o. If not, the -1 setting can potentially trigger 
an ICE in future.

>
> Odds are for this specific case in the RV backend, we just need a 
> constraint to store 0.0 into a memory location.  That can actually be 
> implemented as a store from x0 since 0.0 has the bit pattern 0x0.  
> This is probably a good thing to expose anyway as an optimization and 
> can move forward independently of the f-m-o patch.

I call dibs on this :-) Seems like an interesting little side project.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110748

-Vineet

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Implement new RTL optimizations pass: fold-mem-offsets.
  2023-07-20  6:18             ` Vineet Gupta
@ 2023-07-20 21:59               ` Jeff Law
  2023-08-07 14:44                 ` Manolis Tsamis
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Law @ 2023-07-20 21:59 UTC (permalink / raw)
  To: Vineet Gupta, Manolis Tsamis
  Cc: gcc-patches, Philipp Tomsich, Richard Biener, Jakub Jelinek,
	gnu-toolchain



On 7/20/23 00:18, Vineet Gupta wrote:
> 
> 
> On 7/18/23 21:31, Jeff Law via Gcc-patches wrote:
>>>
>>> In a run with -fno-fold-mem-offsets, the same insn 93 is successfully 
>>> grok'ed by cprop_hardreg,
>>>
>>> | (insn 93 337 522 11 (set (mem/c:DF (plus:DI (reg/f:DI 2 sp)
>>> |                (const_int 8 [0x8])) [4 %sfp+-8 S8 A64])
>>> |        (const_double:DF 0.0 [0x0.0p+0])) "sff.i":23:11 190 
>>> {*movdf_hardfloat_rv64}
>>> ^^^^^^^^^^^^^^^
>>> |     (expr_list:REG_EQUAL (const_double:DF 0.0 [0x0.0p+0])
>>> |        (nil)))
>>>
>>> P.S. I wonder if it is a good idea in general to call recog() post 
>>> reload since the insn could be changed sufficiently to no longer 
>>> match the md patterns. Of course I don't know the answer.
>> If this ever causes a problem, it's a backend bug.  It's that simple.
>>
>> Conceptually it should always be safe to set INSN_CODE to -1 for any 
>> insn.
> 
> Sure the -1 should be handled, but are you implying that f-mo- will 
> always generate a valid combination and recog() failing is simply a bug 
> in backend and/or f-m-o. If not, the -1 setting can potentially trigger 
> an ICE in future.
A recog failure after setting INSN_CODE to -1 would always be an 
indicator of a target bug at the point where f-m-o runs.

In that would be generally true as well.  There are some very obscure 
exceptions and those exceptions are for narrow periods of time.



> 
>>
>> Odds are for this specific case in the RV backend, we just need a 
>> constraint to store 0.0 into a memory location.  That can actually be 
>> implemented as a store from x0 since 0.0 has the bit pattern 0x0. This 
>> is probably a good thing to expose anyway as an optimization and can 
>> move forward independently of the f-m-o patch.
> 
> I call dibs on this :-) Seems like an interesting little side project.
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110748
It's yours  :-)

jeff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Implement new RTL optimizations pass: fold-mem-offsets.
  2023-07-20 21:59               ` Jeff Law
@ 2023-08-07 14:44                 ` Manolis Tsamis
  2023-08-07 17:13                   ` Jeff Law
  0 siblings, 1 reply; 13+ messages in thread
From: Manolis Tsamis @ 2023-08-07 14:44 UTC (permalink / raw)
  To: Jeff Law
  Cc: Vineet Gupta, gcc-patches, Philipp Tomsich, Richard Biener,
	Jakub Jelinek, gnu-toolchain

I have sent out a new v4 of this
(https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626502.html).

In the new version I both restore the INSN_CODE as mentioned here and
I also call recog when I commit the offset change (because there may
be a change from no offset to some offsets).
I have also removed the driver function as per Vineet's suggestion.

Last thing I have fixed a nasty bug with REG_EQUIV's that resulted in
failing bootstrap on AArch64.

The offending sequence looked like this (in 'simplified RTX'):

(set (reg/f:DI 3 x3)
        (plus:DI (reg/f:DI 2 x2)
            (const_int 8 [0x8])))
(set (reg/f:DI 19 x19)
        (plus:DI (reg/f:DI 3 x3)
            (reg:DI 19 x19)))
...
(set (reg:TI 2 x2)
        (mem:TI (reg/f:DI 0 x0)))
     (expr_list:REG_DEAD (reg/f:DI 0 x0)
        (expr_list:REG_EQUIV (mem:TI (reg/f:DI 19 x19))
            (nil)))
(set (mem:TI (reg/f:DI 19 x19))
        (reg:TI 2 x2))
     (expr_list:REG_DEAD (reg/f:DI 19 x19)
        (expr_list:REG_DEAD (reg:TI 2 x2)
            (nil)))

Were the first instruction (x3 = x2 + 8) was folded in the address
calculation of the last one, resulting in (mem:TI (plus:DI (reg/f:DI
19 x19) (const_int 8 [0x8])), but then the previous REG_EQUIV was
incorrect due to the modified runtime value of x19.
For now I opted to treat REG_EQ/EQUIV notes as uses that we cannot
handle, so if any of them are involved we don't fold offsets. Although
it would may be possible to improve this in the future, I think it's
fine for now and the reduction of folded insns is a small %.
I have tested v4 with a number of benchmarks and large projects on
x64, aarch64 and riscv64 without observing any issues. The x86
testsuite issue still exists as I don't have a satisfactory solution
to it yet.

Any feedback for the changes is appreciated!

Thanks,
Manolis

On Fri, Jul 21, 2023 at 12:59 AM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
>
> On 7/20/23 00:18, Vineet Gupta wrote:
> >
> >
> > On 7/18/23 21:31, Jeff Law via Gcc-patches wrote:
> >>>
> >>> In a run with -fno-fold-mem-offsets, the same insn 93 is successfully
> >>> grok'ed by cprop_hardreg,
> >>>
> >>> | (insn 93 337 522 11 (set (mem/c:DF (plus:DI (reg/f:DI 2 sp)
> >>> |                (const_int 8 [0x8])) [4 %sfp+-8 S8 A64])
> >>> |        (const_double:DF 0.0 [0x0.0p+0])) "sff.i":23:11 190
> >>> {*movdf_hardfloat_rv64}
> >>> ^^^^^^^^^^^^^^^
> >>> |     (expr_list:REG_EQUAL (const_double:DF 0.0 [0x0.0p+0])
> >>> |        (nil)))
> >>>
> >>> P.S. I wonder if it is a good idea in general to call recog() post
> >>> reload since the insn could be changed sufficiently to no longer
> >>> match the md patterns. Of course I don't know the answer.
> >> If this ever causes a problem, it's a backend bug.  It's that simple.
> >>
> >> Conceptually it should always be safe to set INSN_CODE to -1 for any
> >> insn.
> >
> > Sure the -1 should be handled, but are you implying that f-mo- will
> > always generate a valid combination and recog() failing is simply a bug
> > in backend and/or f-m-o. If not, the -1 setting can potentially trigger
> > an ICE in future.
> A recog failure after setting INSN_CODE to -1 would always be an
> indicator of a target bug at the point where f-m-o runs.
>
> In that would be generally true as well.  There are some very obscure
> exceptions and those exceptions are for narrow periods of time.
>
>
>
> >
> >>
> >> Odds are for this specific case in the RV backend, we just need a
> >> constraint to store 0.0 into a memory location.  That can actually be
> >> implemented as a store from x0 since 0.0 has the bit pattern 0x0. This
> >> is probably a good thing to expose anyway as an optimization and can
> >> move forward independently of the f-m-o patch.
> >
> > I call dibs on this :-) Seems like an interesting little side project.
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110748
> It's yours  :-)
>
> jeff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3] Implement new RTL optimizations pass: fold-mem-offsets.
  2023-08-07 14:44                 ` Manolis Tsamis
@ 2023-08-07 17:13                   ` Jeff Law
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff Law @ 2023-08-07 17:13 UTC (permalink / raw)
  To: Manolis Tsamis
  Cc: Vineet Gupta, gcc-patches, Philipp Tomsich, Richard Biener,
	Jakub Jelinek, gnu-toolchain



On 8/7/23 08:44, Manolis Tsamis wrote:
> I have sent out a new v4 of this
> (https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626502.html).
> 
> In the new version I both restore the INSN_CODE as mentioned here and
> I also call recog when I commit the offset change (because there may
> be a change from no offset to some offsets).
> I have also removed the driver function as per Vineet's suggestion.
> 
> Last thing I have fixed a nasty bug with REG_EQUIV's that resulted in
> failing bootstrap on AArch64.
> 
> The offending sequence looked like this (in 'simplified RTX'):
> 
> (set (reg/f:DI 3 x3)
>          (plus:DI (reg/f:DI 2 x2)
>              (const_int 8 [0x8])))
> (set (reg/f:DI 19 x19)
>          (plus:DI (reg/f:DI 3 x3)
>              (reg:DI 19 x19)))
> ...
> (set (reg:TI 2 x2)
>          (mem:TI (reg/f:DI 0 x0)))
>       (expr_list:REG_DEAD (reg/f:DI 0 x0)
>          (expr_list:REG_EQUIV (mem:TI (reg/f:DI 19 x19))
>              (nil)))
> (set (mem:TI (reg/f:DI 19 x19))
>          (reg:TI 2 x2))
>       (expr_list:REG_DEAD (reg/f:DI 19 x19)
>          (expr_list:REG_DEAD (reg:TI 2 x2)
>              (nil)))
> 
> Were the first instruction (x3 = x2 + 8) was folded in the address
> calculation of the last one, resulting in (mem:TI (plus:DI (reg/f:DI
> 19 x19) (const_int 8 [0x8])), but then the previous REG_EQUIV was
> incorrect due to the modified runtime value of x19.
> For now I opted to treat REG_EQ/EQUIV notes as uses that we cannot
> handle, so if any of them are involved we don't fold offsets. Although
> it would may be possible to improve this in the future, I think it's
> fine for now and the reduction of folded insns is a small %.
> I have tested v4 with a number of benchmarks and large projects on
> x64, aarch64 and riscv64 without observing any issues. The x86
> testsuite issue still exists as I don't have a satisfactory solution
> to it yet.
> 
> Any feedback for the changes is appreciated!
Thanks for the explanation.  You could just remove the REG_EQUIV note. 
It's not used much after allocation & reloading.  Or just ignoring those 
insns as you've done seems reasonable as well.  I doubt one approach is 
clear better than the other.

jeff

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-08-07 17:13 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-13 14:13 [PATCH v3] Implement new RTL optimizations pass: fold-mem-offsets Manolis Tsamis
2023-07-13 15:05 ` Manolis Tsamis
2023-07-14  5:35   ` Jeff Law
2023-07-18 17:15     ` Manolis Tsamis
2023-07-18 18:01       ` Jeff Law
2023-07-18 23:42         ` Vineet Gupta
2023-07-19  4:31           ` Jeff Law
2023-07-19  8:08             ` Manolis Tsamis
2023-07-19 14:16               ` Jeff Law
2023-07-20  6:18             ` Vineet Gupta
2023-07-20 21:59               ` Jeff Law
2023-08-07 14:44                 ` Manolis Tsamis
2023-08-07 17:13                   ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).