[Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
@ 2024-05-30 19:51 Ajit Agarwal
  2024-05-30 21:34 ` Segher Boessenkool
  2024-05-31  9:53 ` Richard Sandiford
  0 siblings, 2 replies; 20+ messages in thread
From: Ajit Agarwal @ 2024-05-30 19:51 UTC (permalink / raw)
  To: Alex Coplan, Richard Sandiford, Kewen.Lin, Segher Boessenkool,
	Michael Meissner, Peter Bergner, David Edelsohn, gcc-patches

Hello All:

Common infrastructure using generic code for pair mem fusion of different
targets.

rs6000 target specific specific code implements virtual functions defined
by generic code.

Code is implemented with pure virtual functions to interface with target
code.

Target specific code are added in rs6000-mem-fusion.cc and additional virtual
function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.

Bootstrapped and regtested for aarch64-linux-gnu and powerpc64-linux-gnu.

Thanks & Regards
Ajit


aarch64, rs6000, middle-end: Add implementation for different targets for pair mem fusion

Common infrastructure using generic code for pair mem fusion of different
targets.

rs6000 target specific specific code implements virtual functions defined
by generic code.

Code is implemented with pure virtual functions to interface with target
code.

Target specific code are added in rs6000-mem-fusion.cc and additional virtual
function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.

2024-05-31  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>

gcc/ChangeLog:

	* config/aarch64/aarch64-ldp-fusion.cc: Add target specific
	implementation of additional virtual functions added in pair_fusion
	struct.
	* config/rs6000/rs6000-passes.def: New mem fusion pass
	before pass_early_remat.
	* config/rs6000/rs6000-mem-fusion.cc: Add new pass.
	Add target specific implementation using pure virtual
	functions.
	* config.gcc: Add new object file.
	* config/rs6000/rs6000-protos.h: Add new prototype for mem
	fusion pass.
	* config/rs6000/t-rs6000: Add new rule.
	* rtl-ssa/accesses.h: Moved set_is_live_out_use as public
	from private.

gcc/testsuite/ChangeLog:

	* g++.target/powerpc/me-fusion.C: New test.
	* g++.target/powerpc/mem-fusion-1.C: New test.
	* gcc.target/powerpc/mma-builtin-1.c: Modify test.
---
 gcc/config.gcc                                |   2 +
 gcc/config/aarch64/aarch64-ldp-fusion.cc      |  23 +
 gcc/config/rs6000/rs6000-mem-fusion.cc        | 629 ++++++++++++++++++
 gcc/config/rs6000/rs6000-passes.def           |   4 +-
 gcc/config/rs6000/rs6000-protos.h             |   1 +
 gcc/config/rs6000/t-rs6000                    |   5 +
 gcc/pair-fusion.cc                            |  18 +-
 gcc/pair-fusion.h                             |  20 +
 gcc/rtl-ssa/accesses.h                        |   2 +-
 .../g++.target/powerpc/mem-fusion-1.C         |  22 +
 gcc/testsuite/g++.target/powerpc/mem-fusion.C |  15 +
 .../gcc.target/powerpc/mma-builtin-1.c        |   4 +-
 12 files changed, 740 insertions(+), 5 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-mem-fusion.cc
 create mode 100644 gcc/testsuite/g++.target/powerpc/mem-fusion-1.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/mem-fusion.C

diff --git a/gcc/config.gcc b/gcc/config.gcc
index a37113bd00a..1beabc35d52 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -524,6 +524,7 @@ powerpc*-*-*)
 	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
 	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
 	extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
+	extra_objs="${extra_objs} rs6000-mem-fusion.o"
 	extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
 	extra_headers="${extra_headers} bmi2intrin.h bmiintrin.h"
 	extra_headers="${extra_headers} xmmintrin.h mm_malloc.h emmintrin.h"
@@ -560,6 +561,7 @@ rs6000*-*-*)
 	extra_options="${extra_options} g.opt fused-madd.opt rs6000/rs6000-tables.opt"
 	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
 	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
+	extra_objs="${extra_objs} rs6000-mem-fusion.o"
 	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-logue.cc \$(srcdir)/config/rs6000/rs6000-call.cc"
 	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
 	;;
diff --git a/gcc/config/aarch64/aarch64-ldp-fusion.cc b/gcc/config/aarch64/aarch64-ldp-fusion.cc
index 0af927231d3..784cdc3937c 100644
--- a/gcc/config/aarch64/aarch64-ldp-fusion.cc
+++ b/gcc/config/aarch64/aarch64-ldp-fusion.cc
@@ -104,6 +104,29 @@ struct aarch64_pair_fusion : public pair_fusion
 				  bool load_p) override final;
 
   rtx destructure_pair (rtx regs[2], rtx pattern, bool load_p) override final;
+
+  bool should_handle_unordered_insns (rtl_ssa::insn_info *,
+				      rtl_ssa::insn_info *) override final
+  {
+    return true;
+  }
+
+  bool fuseable_store_p (rtl_ssa::insn_info *,
+			 rtl_ssa::insn_info *) override final
+  {
+    return true;
+  }
+
+  bool fuseable_load_p (rtl_ssa::insn_info *) override final
+  {
+    return true;
+  }
+
+  void set_multiword_subreg (rtl_ssa::insn_info *, rtl_ssa::insn_info *,
+			     bool) override final
+  {
+    return;
+  }
 };
 
 bool
diff --git a/gcc/config/rs6000/rs6000-mem-fusion.cc b/gcc/config/rs6000/rs6000-mem-fusion.cc
new file mode 100644
index 00000000000..d4291c855e7
--- /dev/null
+++ b/gcc/config/rs6000/rs6000-mem-fusion.cc
@@ -0,0 +1,629 @@
+/* Subroutines used to perform adjacent load/store into
+   paired memory accesses for TARGET_POWER10 and TARGET_VSX.
+
+   Copyright (C) 2024 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+   License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#define INCLUDE_ALGORITHM
+#define INCLUDE_FUNCTIONAL
+#define INCLUDE_LIST
+#define INCLUDE_TYPE_TRAITS
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "target.h"
+#include "df.h"
+#include "rtl.h"
+#include "rtl-iter.h"
+#include "rtl-ssa.h"
+#include "tree-pass.h"
+#include "ordered-hash-map.h"
+#include "pair-fusion.h"
+
+using namespace rtl_ssa;
+
+struct rs6000_pair_fusion : public pair_fusion
+{
+  bool fpsimd_op_p (rtx , machine_mode , bool)  override final
+  {
+    return false;
+  }
+
+  bool pair_mem_insn_p (rtx_insn *, bool &) override final
+  {
+    return false;
+  }
+
+  bool pair_mem_ok_with_policy (rtx, bool) override final
+  {
+    return true;
+  }
+
+  bool pair_operand_mode_ok_p (machine_mode mode) override final;
+
+  rtx gen_pair (rtx *pats, rtx writeback, bool load_p) override final;
+
+  bool pair_reg_operand_ok_p (bool, rtx, machine_mode) override final
+  {
+    return true;
+  }
+
+  int pair_mem_alias_check_limit () override final
+  {
+    return 0;
+  }
+
+  bool should_handle_writeback (enum writeback) override final
+  {
+    return false;
+  }
+
+  bool track_loads_p () override final
+  {
+    return true;
+  }
+
+  bool track_stores_p () override final
+  {
+    return true;
+  }
+
+  bool pair_mem_in_range_p (HOST_WIDE_INT) override final
+  {
+    return true;
+  }
+
+  rtx gen_promote_writeback_pair (rtx, rtx, rtx *, bool) override final
+  {
+    return NULL_RTX;
+  }
+
+  rtx destructure_pair (rtx_def **, rtx, bool) override final
+  {
+    return NULL_RTX;
+  }
+
+  bool should_handle_unordered_insns (insn_info *i1,
+				      insn_info *i2) override final
+  {
+    if (*i1 > *i2)
+      return false;
+
+    return true;
+  }
+
+  bool fuseable_store_p (insn_info *i1, insn_info *i2) override final;
+
+  bool fuseable_load_p (insn_info *insn) override final;
+
+  void set_multiword_subreg (insn_info *i1, insn_info *i2,
+			     bool load_p) override final;
+};
+
+bool
+rs6000_pair_fusion::pair_operand_mode_ok_p (machine_mode mode)
+{
+  return (ALTIVEC_OR_VSX_VECTOR_MODE (mode));
+
+}
+
+/* df_insn_rescan the unspec instruction where operands
+   are reversed given insn_info INFO.  */
+static void
+set_rescan (insn_info *info)
+{
+  for (auto def : info->defs())
+    {
+      auto set = dyn_cast<set_info *> (def);
+      if (set && set->has_any_uses ())
+	{
+	  for (auto use : set->all_uses())
+	    {
+	      if (use->insn () && use->insn ()->rtl ())
+		{
+		  rtx_insn *rtl_insn = use->insn ()->rtl ();
+		  rtx set = single_set (rtl_insn);
+
+		  if (set == NULL_RTX)
+		    return;
+
+		  rtx op0 = SET_SRC (set);
+		  if (GET_CODE (op0) != UNSPEC)
+		    return;
+
+		  use->set_is_live_out_use (true);
+		  df_insn_rescan (rtl_insn);
+		}
+	     }
+	}
+    }
+}
+
+/* df_insn_rescan the def instruction where operands
+   are reversed given INSN.  */
+static bool
+set_rescan_store (insn_info *insn)
+{
+  for (auto use : insn->uses())
+    {
+      auto def = use->def ();
+
+      if (def->insn ()->is_artificial())
+	return false;
+
+      if (def->insn () && def->insn ()->rtl ()
+	  && def->insn()->is_real() )
+	{
+	  rtx_insn *rtl_insn = def->insn ()->rtl ();
+	  rtx set = single_set (rtl_insn);
+
+	  if (set == NULL_RTX)
+	    return false;
+	  df_insn_rescan (rtl_insn);
+	}
+    }
+
+  return true;
+}
+
+ordered_hash_map<rtx_insn *, bool> insn_map;
+
+/* Return false if dependent def is load instruction given INSN otherwise
+   false.  */
+static bool
+feasible_store_p (rtx_insn *insn, bool immediate_dep)
+{
+  df_ref use;
+  df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
+
+  FOR_EACH_INSN_INFO_USE (use, insn_info)
+    {
+      struct df_link *def_link = DF_REF_CHAIN (use);
+
+      if (!def_link || !def_link->ref
+	  || DF_REF_IS_ARTIFICIAL (def_link->ref))
+	continue;
+
+      rtx_insn *select_insn2 = DF_REF_INSN (def_link->ref);
+
+      if (select_insn2 == NULL)
+	continue;
+
+      if (select_insn2 == insn)
+	return true;
+
+      while (def_link && def_link->ref)
+	{
+	  rtx set = single_set (select_insn2);
+	  rtx insn_set = single_set (insn);
+
+	  if (set != NULL_RTX && insn_set != NULL_RTX)
+	    {
+	      if (GET_MODE (SET_SRC (set)) != GET_MODE (SET_SRC (insn_set)))
+		{
+		  if (GET_MODE (SET_SRC (set)) == OOmode)
+		    return false;
+
+		  immediate_dep = false;
+		}
+		else
+		  {
+		    if (immediate_dep && MEM_P (SET_SRC (set)))
+		      return false;
+		  }
+
+		if (insn_map.get (select_insn2))
+		  return true;
+		else
+		  insn_map.put (select_insn2, true);
+
+		if (!feasible_store_p (select_insn2, immediate_dep))
+		  return false;
+	    }
+	  def_link = def_link->next;
+	}
+     }
+  return true;
+}
+
+/* Check for feasibility of store to be fuseable
+   or not. Return true if feasible otherwise false.  */
+static bool
+feasible_store_p (insn_info *insn)
+{
+  for (auto use : insn->uses ())
+    {
+      auto def = use->def ();
+
+      if (def->insn ()->is_artificial ())
+	return false;
+
+      if (def->insn () && def->insn ()->rtl ()
+	  && def->insn()->is_real ())
+	{
+	  rtx_insn *rtl_insn = def->insn ()->rtl ();
+	  rtx set = single_set (rtl_insn);
+
+	  if (set == NULL_RTX)
+	    return false;
+
+	  /* Return false if dependent def is load.  */
+	  if (rtl_insn && MEM_P (SET_SRC (set)))
+	    return false;
+
+	  /* Return false if dependent def is store.  */
+	  if (rtl_insn && MEM_P (SET_DEST (set)))
+	    return false;
+
+	  /* Return false if dependent def is parallel.  */
+	  if (GET_CODE (PATTERN (rtl_insn)) == PARALLEL)
+	    return false;
+
+	  rtx src = SET_SRC (set);
+	  rtx_code code = GET_CODE (src);
+
+	  /* Return false if dependent def is CONST_VECTOR or UNSPEC.  */
+	  if (code == CONST_VECTOR || code == UNSPEC)
+	    return false;
+
+	  /* Recursively check for dependent instruction is Load.  */
+	  if (!feasible_store_p (rtl_insn, true))
+	    return false;
+
+	  if (GET_RTX_CLASS (code) == RTX_TERNARY)
+	    return false;
+	}
+    }
+  return true;
+}
+
+/* Check if store can be fuseable or not.
+   Return true if fuseable otherwise false.  */
+bool
+rs6000_pair_fusion::fuseable_store_p (insn_info *i1, insn_info *i2)
+{
+  rtx_insn *insn1 = i1->rtl ();
+  rtx_insn *insn2 = i2->rtl ();
+
+  rtx body = PATTERN (insn1);
+  rtx src_exp = SET_SRC (body);
+  rtx insn2_body = PATTERN (insn2);
+  rtx insn2_src_exp = SET_SRC (insn2_body);
+
+  /* Return false if def and use count are not same.  */
+  if (REG_P (src_exp) &&
+      (DF_REG_DEF_COUNT (REGNO (src_exp)) != DF_REG_USE_COUNT (REGNO (src_exp))
+       || DF_REG_USE_COUNT (REGNO (src_exp)) > 1))
+    return false;
+
+  /* Return false if src of insn1 and src of ins2 are same.  */
+  if (src_exp == insn2_src_exp)
+    return false;
+
+  /* Return false if src of insn1 is subreg.  */
+  if (GET_CODE (src_exp) == SUBREG)
+    return false;
+
+  /* Return false if src of insn1 is TImode or TFmode.  */
+  if (GET_MODE (src_exp) == TImode || GET_MODE (src_exp) == TFmode)
+    return false;
+
+  for (rtx note = REG_NOTES (insn1); note; note = XEXP (note, 1))
+    if (REG_NOTE_KIND (note) == REG_EQUAL
+	|| REG_NOTE_KIND (note) == REG_EQUIV)
+      return false;
+
+  if (!feasible_store_p (i1))
+    return false;;
+
+  if (!feasible_store_p (i2))
+    return false;
+
+  return true;
+}
+
+/* Set subreg for def of store INSN given rtx SRC instruction.  */
+static void
+set_store_subreg (rtx_insn *insn, rtx src)
+{
+  rtx set = single_set (insn);
+  rtx src_exp = SET_SRC (set);
+  df_ref use;
+
+  df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
+  FOR_EACH_INSN_INFO_USE (use, insn_info)
+    {
+      struct df_link *def_link = DF_REF_CHAIN (use);
+
+      if (!def_link || !def_link->ref
+	  || DF_REF_IS_ARTIFICIAL (def_link->ref))
+	continue;
+
+      while (def_link && def_link->ref)
+	{
+	  rtx *loc = DF_REF_LOC (def_link->ref);
+
+	  if (GET_MODE (*loc) == GET_MODE (src_exp))
+	      *loc = copy_rtx (src);
+
+	  def_link = def_link->next;
+	}
+    }
+}
+
+/* Generate store pair stxvp given rtx I1.  */
+static rtx
+rs6000_gen_store_pair (rtx i1)
+{
+  rtx src_exp = SET_SRC (i1);
+  rtx dest_exp = SET_DEST (i1);
+  rtx stxv;
+  PUT_MODE_RAW (src_exp, OOmode);
+  PUT_MODE_RAW (dest_exp, OOmode);
+  stxv = gen_rtx_SET (dest_exp, src_exp);
+  if (dump_file)
+    {
+      fprintf (dump_file, "Replacing stxv with stxvp  \n");
+      print_rtl_single (dump_file, stxv);
+    }
+  return stxv;
+}
+
+/* Check whether load can be fusable or not.
+   Return true if dependent use is UNSPEC otherwise false.  */
+bool
+rs6000_pair_fusion::fuseable_load_p (insn_info *info)
+{
+  rtx_insn *insn = info->rtl ();
+
+  for (rtx note = REG_NOTES (insn); note; note = XEXP (note, 1))
+    if (REG_NOTE_KIND (note) == REG_EQUAL
+	|| REG_NOTE_KIND (note) == REG_EQUIV)
+      return false;
+
+  for (auto def : info->defs ())
+    {
+      auto set = dyn_cast<set_info *> (def);
+      if (set && set->has_any_uses ())
+	{
+	  for (auto use : set->all_uses())
+	    {
+	      if (use->insn ()->is_artificial ())
+		return false;
+
+	       insn_info *info = use->insn ();
+
+	       if (info
+		   && info->rtl ()
+		   && info->is_real ())
+		  {
+		    rtx_insn *rtl_insn = info->rtl ();
+		    rtx set = single_set (rtl_insn);
+
+		    if (set == NULL_RTX)
+		      return false;
+
+		    rtx op0 = SET_SRC (set);
+		    if (GET_CODE (op0) != UNSPEC)
+		      return false;
+		  }
+	      }
+	  }
+    }
+  return true;
+}
+
+/* Set subreg with use of INSN given SRC rtx instruction.  */
+static void set_load_subreg (rtx_insn *insn, rtx src)
+{
+  df_ref use;
+  df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
+  FOR_EACH_INSN_INFO_DEF (use, insn_info)
+    {
+      struct df_link *def_link = DF_REF_CHAIN (use);
+
+      if (!def_link || !def_link->ref
+	  || DF_REF_IS_ARTIFICIAL (def_link->ref))
+	continue;
+
+      while (def_link && def_link->ref)
+	{
+	  rtx *loc = DF_REF_LOC (def_link->ref);
+	  *loc =  copy_rtx (src);
+	  def_link = def_link->next;
+	}
+     }
+}
+
+/* Set subreg for OO mode store pair to generate
+   registers in pairs given insn_info I1 and I2.  */
+static void set_multiword_subreg_store (insn_info *i1, insn_info *i2)
+{
+  rtx_insn *insn1 = i1->rtl ();
+  rtx_insn *insn2 = i2->rtl ();
+  rtx body = PATTERN (insn1);
+  rtx src_exp = SET_SRC (body);
+  rtx insn2_body = PATTERN (insn2);
+  rtx insn2_dest_exp = SET_DEST (insn2_body);
+  machine_mode mode = GET_MODE (src_exp);
+  int regoff;
+  rtx src;
+  rtx addr = XEXP (insn2_dest_exp, 0);
+
+  PUT_MODE_RAW (src_exp, OOmode);
+  if (GET_CODE (addr) == PLUS
+      && XEXP (addr, 1) && CONST_INT_P (XEXP (addr, 1)))
+    regoff = 16;
+  else
+    regoff = 0;
+
+  src = simplify_gen_subreg (mode,
+			     src_exp, GET_MODE (src_exp),
+			     regoff);
+
+  set_store_subreg (insn1, src);
+
+  int regoff1 = 0;
+  rtx src1;
+
+  src1 = simplify_gen_subreg (mode,
+			      src_exp, GET_MODE (src_exp),
+			      regoff1);
+
+  set_store_subreg (insn2, src1);
+  set_rescan_store (i1);
+  set_rescan_store (i2);
+  df_insn_rescan (insn1);
+}
+
+/* Set subreg for OO mode pair load to generate
+   registers in pairs given insn_info I2 and I2.  */
+static void
+set_multiword_subreg_load (insn_info *i1, insn_info *i2)
+{
+  rtx_insn *insn1 = i1->rtl();
+  rtx_insn *insn2 = i2->rtl();
+  rtx body = PATTERN (insn1);
+  rtx dest_exp = SET_DEST (body);
+  rtx insn2_body = PATTERN (insn2);
+  machine_mode mode = GET_MODE (dest_exp);
+  PUT_MODE_RAW (dest_exp, OOmode);
+
+  rtx insn2_src_exp = SET_SRC (insn2_body);
+  int regoff = 0;
+  rtx src;
+
+  src = simplify_gen_subreg (mode,
+			     dest_exp, GET_MODE (dest_exp),
+			     regoff);
+
+  set_load_subreg (insn2, src);
+
+  int regoff1;
+  rtx src1;
+  rtx addr = XEXP (insn2_src_exp, 0);
+
+  if (GET_CODE (addr) == PLUS
+      && XEXP (addr, 1)
+      && CONST_INT_P (XEXP(addr, 1)))
+    regoff1 = 16;
+  else
+    regoff1 = 0;
+
+  src1 = simplify_gen_subreg (mode,
+			      dest_exp, GET_MODE (dest_exp),
+			      regoff1);
+
+  set_load_subreg (insn1, src1);
+  set_rescan (i1);
+  set_rescan (i2);
+  df_insn_rescan (insn1);
+}
+
+/* Set subreg for OO mode pair to generate sequential
+   registers given insn_info pairs I1, I2 and LOAD_P is true
+   iff load insn and false if store insn.  */
+void
+rs6000_pair_fusion::set_multiword_subreg(insn_info *i1, insn_info *i2,
+					      bool load_p)
+{
+  if (load_p)
+    set_multiword_subreg_load (i1, i2);
+  else
+    set_multiword_subreg_store (i1, i2);
+}
+
+/* Return load pair rtl instruction given rtx I1.  */
+static rtx
+rs6000_gen_load_pair (rtx i1)
+{
+  rtx src_exp = SET_SRC (i1);
+  rtx dest_exp = SET_DEST (i1);
+  rtx lxv;
+  PUT_MODE_RAW (src_exp, OOmode);
+  PUT_MODE_RAW (dest_exp, OOmode);
+  lxv = gen_rtx_SET (dest_exp, src_exp);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "lxv with lxvp ");
+      print_rtl_single (dump_file, lxv);
+    }
+
+  return lxv;
+}
+
+rtx
+rs6000_pair_fusion::gen_pair (rtx *pats, rtx writeback, bool load_p)
+{
+  if (load_p || writeback)
+    return rs6000_gen_load_pair (pats[0]);
+  else
+    return rs6000_gen_store_pair (pats[0]);
+}
+
+const pass_data pass_data_mem_fusion =
+{
+  RTL_PASS, /* type */
+  "mem_fusion", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_mem_fusion : public rtl_opt_pass
+{
+public:
+  pass_mem_fusion (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_mem_fusion, ctxt)
+  {}
+
+  opt_pass *clone () override { return new pass_mem_fusion (m_ctxt);}
+
+  /* opt_pass methods: */
+  bool gate (function *)
+    {
+      return (optimize > 0 && TARGET_VSX && TARGET_POWER10);
+    }
+
+  unsigned int execute (function *) final override
+    {
+      /* We use DF data flow because we change location rtx
+	 which is easier to find and modify.
+	 We use mix of rtl-ssa def-use and DF data flow
+	 where it is easier.  */
+      df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
+      df_analyze ();
+      df_set_flags (DF_DEFER_INSN_RESCAN);
+
+      rs6000_pair_fusion pass;
+      pass.run ();
+      return 0;
+    }
+}; // class pass_mem_fusion
+
+rtl_opt_pass *
+make_pass_mem_fusion (gcc::context *ctxt)
+{
+  return new pass_mem_fusion (ctxt);
+}
diff --git a/gcc/config/rs6000/rs6000-passes.def b/gcc/config/rs6000/rs6000-passes.def
index 46a0d0b8c56..0b48f57014d 100644
--- a/gcc/config/rs6000/rs6000-passes.def
+++ b/gcc/config/rs6000/rs6000-passes.def
@@ -28,7 +28,9 @@ along with GCC; see the file COPYING3.  If not see
      The power8 does not have instructions that automaticaly do the byte swaps
      for loads and stores.  */
   INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
-
+  /* Pass to replace adjacent memory addresses lxv/stxv instruction with
+     lxvp/stxvp instruction.  */
+  INSERT_PASS_BEFORE (pass_early_remat, 1, pass_mem_fusion);
   /* Pass to do the PCREL_OPT optimization that combines the load of an
      external symbol's address along with a single load or store using that
      address as a base register.  */
diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
index 09a57a806fa..1412b31c2eb 100644
--- a/gcc/config/rs6000/rs6000-protos.h
+++ b/gcc/config/rs6000/rs6000-protos.h
@@ -343,6 +343,7 @@ namespace gcc { class context; }
 class rtl_opt_pass;
 
 extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
+extern rtl_opt_pass *make_pass_mem_fusion (gcc::context *);
 extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
 extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
 extern bool rs6000_quadword_masked_address_p (const_rtx exp);
diff --git a/gcc/config/rs6000/t-rs6000 b/gcc/config/rs6000/t-rs6000
index b3ce09d523b..df9b3a35b66 100644
--- a/gcc/config/rs6000/t-rs6000
+++ b/gcc/config/rs6000/t-rs6000
@@ -35,6 +35,11 @@ rs6000-p8swap.o: $(srcdir)/config/rs6000/rs6000-p8swap.cc
 	$(COMPILE) $<
 	$(POSTCOMPILE)
 
+rs6000-mem-fusion.o: $(srcdir)/config/rs6000/rs6000-mem-fusion.cc
+	$(COMPILE) $<
+	$(POSTCOMPILE)
+
+
 rs6000-d.o: $(srcdir)/config/rs6000/rs6000-d.cc
 	$(COMPILE) $<
 	$(POSTCOMPILE)
diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc
index 9f897ac04e2..2dbe9f854ef 100644
--- a/gcc/pair-fusion.cc
+++ b/gcc/pair-fusion.cc
@@ -312,7 +312,7 @@ static int
 encode_lfs (lfs_fields fields)
 {
   int size_log2 = exact_log2 (fields.size);
-  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 4);
+  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 6);
   return ((int)fields.load_p << 3)
     | ((int)fields.fpsimd_p << 2)
     | (size_log2 - 2);
@@ -425,6 +425,9 @@ pair_fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
   if (MEM_VOLATILE_P (mem))
     return;
 
+  if (load_p && !m_pass->fuseable_load_p (insn))
+    return;
+
   // Ignore writeback accesses if the hook says to do so.
   if (!m_pass->should_handle_writeback (writeback::EXISTING)
       && GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
@@ -1821,6 +1824,7 @@ pair_fusion_bb_info::fuse_pair (bool load_p,
 
   rtx reg_notes = combine_reg_notes (first, second, load_p);
 
+  m_pass->set_multiword_subreg (first, second, load_p);
   rtx pair_pat = m_pass->gen_pair (pats, writeback_effect, load_p);
   insn_change *pair_change = nullptr;
   auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
@@ -2401,6 +2405,9 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
       reversed = true;
     }
 
+  if (!m_pass->should_handle_unordered_insns (i1, i2))
+    return false;
+
   rtx cand_mems[2];
   rtx reg_ops[2];
   rtx pats[2];
@@ -2411,6 +2418,15 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
       reg_ops[i] = XEXP (pats[i], !load_p);
     }
 
+  if (!load_p && !m_pass->fuseable_store_p (i1, i2))
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "punting on store-mem-pairs due to non fuseable cand (%d,%d)\n",
+		 insns[0]->uid (), insns[1]->uid ());
+      return false;
+    }
+
   if (load_p && reg_overlap_mentioned_p (reg_ops[0], reg_ops[1]))
     {
       if (dump_file)
diff --git a/gcc/pair-fusion.h b/gcc/pair-fusion.h
index 2a38dc8f743..5371e48d3e2 100644
--- a/gcc/pair-fusion.h
+++ b/gcc/pair-fusion.h
@@ -171,6 +171,26 @@ struct pair_fusion {
   virtual rtx gen_promote_writeback_pair (rtx wb_effect, rtx mem,
 					  rtx regs[2], bool load_p) = 0;
 
+  // Given insn_info pair I1 and I2, sets subreg with multiword registers
+  // to assign register pairs by allocators.
+  // LOAD_P is true iff the pair is a load.
+  virtual void set_multiword_subreg (rtl_ssa::insn_info *i1, rtl_ssa::insn_info *i2,
+				     bool load_p) = 0;
+
+  // Given insn_info pair I1 and I2, checks if pairs are feasible to perform
+  // store mem pairs.
+  // Return true if feasible to perform store mem pairs otherwise false.
+  virtual bool fuseable_store_p (rtl_ssa::insn_info *i1, rtl_ssa::insn_info *i2) = 0;
+
+  // Given insn_info pair I1 and I2, checks if pairs are feasible to perform
+  // load mem pairs.
+  // Return true if feasible to perform load mem pairs otherwise false.
+  virtual bool fuseable_load_p (rtl_ssa::insn_info *info) = 0;
+
+  // Given insn_info pair I1 and I2, return true if offsets are in order.
+  virtual bool should_handle_unordered_insns (rtl_ssa::insn_info *i1,
+					      rtl_ssa::insn_info *i2) = 0;
+
   void process_block (rtl_ssa::bb_info *bb);
   rtl_ssa::insn_info *find_trailing_add (rtl_ssa::insn_info *insns[2],
 					 const rtl_ssa::insn_range_info
diff --git a/gcc/rtl-ssa/accesses.h b/gcc/rtl-ssa/accesses.h
index 7d2916d00c2..3e800932d86 100644
--- a/gcc/rtl-ssa/accesses.h
+++ b/gcc/rtl-ssa/accesses.h
@@ -379,6 +379,7 @@ public:
   //
   // This routine is only meaningful when def () is nonnull.
   bool is_last_use () const;
+  void set_is_live_out_use (bool value) { m_is_live_out_use = value; }
 
   // Print a description of def () to PP.
   void print_def (pretty_printer *pp) const;
@@ -430,7 +431,6 @@ private:
   void record_reference (rtx_obj_reference, bool);
   void set_insn (insn_info *);
   void set_def (set_info *set) { m_def = set; }
-  void set_is_live_out_use (bool value) { m_is_live_out_use = value; }
   void copy_prev_from (use_info *);
   void copy_next_from (use_info *);
   void set_last_use (use_info *);
diff --git a/gcc/testsuite/g++.target/powerpc/mem-fusion-1.C b/gcc/testsuite/g++.target/powerpc/mem-fusion-1.C
new file mode 100644
index 00000000000..d10ff0cdf36
--- /dev/null
+++ b/gcc/testsuite/g++.target/powerpc/mem-fusion-1.C
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target power10_ok } */
+/* { dg-options "-mdejagnu-cpu=power10 -O2" } */
+
+#include <altivec.h>
+	
+void
+foo2 ()
+{
+  __vector_quad *dst1;
+  __vector_quad *dst2;
+  vector unsigned char src;
+  __vector_quad acc;
+  vector unsigned char *ptr;
+  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
+  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
+  *dst1 = acc;
+  __builtin_mma_xvf32ger(&acc, src, ptr[2]);
+  __builtin_mma_xvf32gerpp(&acc, src, ptr[3]);
+  *dst2 = acc;
+}
+/* { dg-final { scan-assembler {\mlxvp\M} } } */
diff --git a/gcc/testsuite/g++.target/powerpc/mem-fusion.C b/gcc/testsuite/g++.target/powerpc/mem-fusion.C
new file mode 100644
index 00000000000..c523572cf3c
--- /dev/null
+++ b/gcc/testsuite/g++.target/powerpc/mem-fusion.C
@@ -0,0 +1,15 @@
+/* { dg-do compile } */ 
+/* { dg-require-effective-target power10_ok } */
+/* { dg-options "-mdejagnu-cpu=power10 -O2" } */ 
+
+#include <altivec.h>
+
+void
+foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
+{
+  __vector_quad acc;
+  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
+  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
+  *dst = acc;
+}
+/* { dg-final { scan-assembler {\mlxvp\M} } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c b/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
index 69ee826e1be..ae29127f954 100644
--- a/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
+++ b/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
@@ -258,8 +258,8 @@ foo13b (__vector_quad *dst, __vector_quad *src, vec_t *vec)
   dst[13] = acc;
 }
 
-/* { dg-final { scan-assembler-times {\mlxv\M} 40 } } */
-/* { dg-final { scan-assembler-times {\mlxvp\M} 12 } } */
+/* { dg-final { scan-assembler-times {\mlxv\M} 0 } } */
+/* { dg-final { scan-assembler-times {\mlxvp\M} 32 } } */
 /* { dg-final { scan-assembler-times {\mstxvp\M} 40 } } */
 /* { dg-final { scan-assembler-times {\mxxmfacc\M} 20 } } */
 /* { dg-final { scan-assembler-times {\mxxmtacc\M} 6 } } */
-- 
2.43.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
  2024-05-30 19:51 [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion Ajit Agarwal
@ 2024-05-30 21:34 ` Segher Boessenkool
  2024-05-31  8:14   ` Richard Sandiford
  2024-05-31  9:53 ` Richard Sandiford
  1 sibling, 1 reply; 20+ messages in thread
From: Segher Boessenkool @ 2024-05-30 21:34 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Richard Sandiford, Kewen.Lin, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Hi!

On Fri, May 31, 2024 at 01:21:44AM +0530, Ajit Agarwal wrote:
> Code is implemented with pure virtual functions to interface with target
> code.

It's not a pure function.  A pure function -- by definition -- has no
side effects.  These things have side effects.

What you mean is this is *an implementation* for C++ functions without
a generic implementation.  An obfuscation some people (like me) would
say.  But please call things what they are!  So not "pure function".
That has a meaning, and this isn't it.

> 	* config/aarch64/aarch64-ldp-fusion.cc: Add target specific
> 	implementation of additional virtual functions added in pair_fusion
> 	struct.

This does not belong in this patch.  Do not send "rs6000" patches that
touch anything outside of config/rs6000/ and similar, certainly not in
config/something-else/!

This would be WAY easier to review (read: AT ALL POSSIBLE) if you
included some detailed rationale and design document.

Segher

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
  2024-05-30 21:34 ` Segher Boessenkool
@ 2024-05-31  8:14   ` Richard Sandiford
  2024-05-31 14:19     ` Segher Boessenkool
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Sandiford @ 2024-05-31  8:14 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Ajit Agarwal, Alex Coplan, Kewen.Lin, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Segher Boessenkool <segher@kernel.crashing.org> writes:
> Hi!
>
> On Fri, May 31, 2024 at 01:21:44AM +0530, Ajit Agarwal wrote:
>> Code is implemented with pure virtual functions to interface with target
>> code.
>
> It's not a pure function.  A pure function -- by definition -- has no
> side effects.  These things have side effects.
>
> What you mean is this is *an implementation* for C++ functions without
> a generic implementation.  An obfuscation some people (like me) would
> say.  But please call things what they are!  So not "pure function".
> That has a meaning, and this isn't it.

"pure virtual function" is an established term.  The "pure" modifies
"virtual", not "function".

The description is correct because the patch adds pure virtual functions
to the base class and expects the derived class to override and implement
them.

>> 	* config/aarch64/aarch64-ldp-fusion.cc: Add target specific
>> 	implementation of additional virtual functions added in pair_fusion
>> 	struct.
>
> This does not belong in this patch.  Do not send "rs6000" patches that
> touch anything outside of config/rs6000/ and similar, certainly not in
> config/something-else/!
>
> This would be WAY easier to review (read: AT ALL POSSIBLE) if you
> included some detailed rationale and design document.

Please don't shout.

I don't think this kind of aggressive review is helpful to the project.

Richard

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
  2024-05-30 19:51 [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion Ajit Agarwal
  2024-05-30 21:34 ` Segher Boessenkool
@ 2024-05-31  9:53 ` Richard Sandiford
  2024-05-31 10:28   ` Richard Sandiford
                     ` (2 more replies)
  1 sibling, 3 replies; 20+ messages in thread
From: Richard Sandiford @ 2024-05-31  9:53 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
> Hello All:
>
> Common infrastructure using generic code for pair mem fusion of different
> targets.
>
> rs6000 target specific specific code implements virtual functions defined
> by generic code.
>
> Code is implemented with pure virtual functions to interface with target
> code.
>
> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>
> Bootstrapped and regtested for aarch64-linux-gnu and powerpc64-linux-gnu.
>
> Thanks & Regards
> Ajit
>
>
> aarch64, rs6000, middle-end: Add implementation for different targets for pair mem fusion
>
> Common infrastructure using generic code for pair mem fusion of different
> targets.
>
> rs6000 target specific specific code implements virtual functions defined
> by generic code.
>
> Code is implemented with pure virtual functions to interface with target
> code.
>
> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>
> 2024-05-31  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>
> gcc/ChangeLog:
>
> 	* config/aarch64/aarch64-ldp-fusion.cc: Add target specific
> 	implementation of additional virtual functions added in pair_fusion
> 	struct.
> 	* config/rs6000/rs6000-passes.def: New mem fusion pass
> 	before pass_early_remat.
> 	* config/rs6000/rs6000-mem-fusion.cc: Add new pass.
> 	Add target specific implementation using pure virtual
> 	functions.
> 	* config.gcc: Add new object file.
> 	* config/rs6000/rs6000-protos.h: Add new prototype for mem
> 	fusion pass.
> 	* config/rs6000/t-rs6000: Add new rule.
> 	* rtl-ssa/accesses.h: Moved set_is_live_out_use as public
> 	from private.
>
> gcc/testsuite/ChangeLog:
>
> 	* g++.target/powerpc/me-fusion.C: New test.
> 	* g++.target/powerpc/mem-fusion-1.C: New test.
> 	* gcc.target/powerpc/mma-builtin-1.c: Modify test.
> ---

This isn't a complete review, just some initial questions & comments
about selected parts.

> [...]
> +/* Check whether load can be fusable or not.
> +   Return true if dependent use is UNSPEC otherwise false.  */
> +bool
> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
> +{
> +  rtx_insn *insn = info->rtl ();
> +
> +  for (rtx note = REG_NOTES (insn); note; note = XEXP (note, 1))
> +    if (REG_NOTE_KIND (note) == REG_EQUAL
> +	|| REG_NOTE_KIND (note) == REG_EQUIV)
> +      return false;

It's unusual to punt on an optimisation because of a REG_EQUAL/EQUIV
note.  What's the reason for doing this?  Are you trying to avoid
fusing pairs before reload that are equivalent to a MEM (i.e. have
a natural spill slot)?  I think Alex hit a similar situation.

> +
> +  for (auto def : info->defs ())
> +    {
> +      auto set = dyn_cast<set_info *> (def);
> +      if (set && set->has_any_uses ())
> +	{
> +	  for (auto use : set->all_uses())

Nit: has_any_uses isn't necessary: the inner loop will simply do nothing
in that case.  Also, we can/should restrict the scan to non-debug uses.

This can then be:

  for (auto def : info->defs ())
    if (auto set = dyn_cast<set_info *> (def))
      for (auto use : set->nondebug_insn_uses())

> +	    {
> +	      if (use->insn ()->is_artificial ())
> +		return false;
> +
> +	       insn_info *info = use->insn ();
> +
> +	       if (info
> +		   && info->rtl ()

This test shouldn't be necessary.

> +		   && info->is_real ())
> +		  {
> +		    rtx_insn *rtl_insn = info->rtl ();
> +		    rtx set = single_set (rtl_insn);
> +
> +		    if (set == NULL_RTX)
> +		      return false;
> +
> +		    rtx op0 = SET_SRC (set);
> +		    if (GET_CODE (op0) != UNSPEC)
> +		      return false;

What's the motivation for rejecting unspecs?  It's unsual to treat
all unspecs as a distinct group.

Also, using single_set means that the function still lets through
parallels of two sets in which the sources are unspecs.  Is that
intentional?

The reasons behind things like the REG_EQUAL/EQUIV and UNSPEC decisions
need to be described in comments, so that other people coming to this
code later can understand the motivation.  The same thing applies to
other decisions in the patch.

> +		  }
> +	      }
> +	  }
> +    }
> +  return true;
> +}
> [...]
> diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc
> index 9f897ac04e2..2dbe9f854ef 100644
> --- a/gcc/pair-fusion.cc
> +++ b/gcc/pair-fusion.cc
> @@ -312,7 +312,7 @@ static int
>  encode_lfs (lfs_fields fields)
>  {
>    int size_log2 = exact_log2 (fields.size);
> -  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 4);
> +  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 6);
>    return ((int)fields.load_p << 3)
>      | ((int)fields.fpsimd_p << 2)
>      | (size_log2 - 2);

The point of the assert is to make sure that "size_log2 - 2"
fits within 2 bits, and so does not interfere with the fpsimd_p
and load_p parts of the encoding.  If rs6000 needs size_log2 == 6,
the shift amounts should be increased by 1 to compensate.

If we do bump the shifts by 1, the new range can be:

  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 9);

> [...]
> +  // Given insn_info pair I1 and I2, return true if offsets are in order.
> +  virtual bool should_handle_unordered_insns (rtl_ssa::insn_info *i1,
> +					      rtl_ssa::insn_info *i2) = 0;
> +

This name seems a bit misleading.  The function is used in:

@@ -2401,6 +2405,9 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
       reversed = true;
     }
 
+  if (!m_pass->should_handle_unordered_insns (i1, i2))
+    return false;
+
   rtx cand_mems[2];
   rtx reg_ops[2];
   rtx pats[2];

and so it acts as a general opt-out.  The insns aren't known to be unordered.

It looks like the rs6000 override requires the original insns to be
in offset order.  Could you say why that's necessary?  (Both in email
and as a comment in the code.)

If we do need a general opt-out like this, it should probably go at the
very start of try_fuse_pair, before even the dump message.  (Alternatively,
it could go after the dump message, but then the "return false" would need
a dump message of its own to explain the failure.)

Thanks,
Richard

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
  2024-05-31  9:53 ` Richard Sandiford
@ 2024-05-31 10:28   ` Richard Sandiford
  2024-05-31 13:54   ` Ajit Agarwal
  2024-06-02 13:16   ` Ajit Agarwal
  2 siblings, 0 replies; 20+ messages in thread
From: Richard Sandiford @ 2024-05-31 10:28 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Reviewing my review :)

Richard Sandiford <richard.sandiford@arm.com> writes:
>> +
>> +  for (auto def : info->defs ())
>> +    {
>> +      auto set = dyn_cast<set_info *> (def);
>> +      if (set && set->has_any_uses ())
>> +	{
>> +	  for (auto use : set->all_uses())
>
> Nit: has_any_uses isn't necessary: the inner loop will simply do nothing
> in that case.  Also, we can/should restrict the scan to non-debug uses.
>
> This can then be:
>
>   for (auto def : info->defs ())
>     if (auto set = dyn_cast<set_info *> (def))
>       for (auto use : set->nondebug_insn_uses())

I forgot the space before "()" in the line above.

>
>> +	    {
>> +	      if (use->insn ()->is_artificial ())
>> +		return false;
>> +
>> +	       insn_info *info = use->insn ();
>> +
>> +	       if (info
>> +		   && info->rtl ()
>
> This test shouldn't be necessary.
>
>> +		   && info->is_real ())
>> +		  {
>> +		    rtx_insn *rtl_insn = info->rtl ();
>> +		    rtx set = single_set (rtl_insn);
>> +
>> +		    if (set == NULL_RTX)
>> +		      return false;
>> +
>> +		    rtx op0 = SET_SRC (set);
>> +		    if (GET_CODE (op0) != UNSPEC)
>> +		      return false;
> [...]
> Also, using single_set means that the function still lets through
> parallels of two sets in which the sources are unspecs.  Is that
> intentional?

I got this wrong, sorry.  You return false for non-single_set,
so that particular problem doesn't arise.  But why do we want to
reject uses of registers that are set by parallel sets?

Thanks,
Richard

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
  2024-05-31  9:53 ` Richard Sandiford
  2024-05-31 10:28   ` Richard Sandiford
@ 2024-05-31 13:54   ` Ajit Agarwal
  2024-05-31 14:38     ` Richard Sandiford
  2024-06-02 13:16   ` Ajit Agarwal
  2 siblings, 1 reply; 20+ messages in thread
From: Ajit Agarwal @ 2024-05-31 13:54 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 31/05/24 3:23 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> Hello All:
>>
>> Common infrastructure using generic code for pair mem fusion of different
>> targets.
>>
>> rs6000 target specific specific code implements virtual functions defined
>> by generic code.
>>
>> Code is implemented with pure virtual functions to interface with target
>> code.
>>
>> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
>> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>>
>> Bootstrapped and regtested for aarch64-linux-gnu and powerpc64-linux-gnu.
>>
>> Thanks & Regards
>> Ajit
>>
>>
>> aarch64, rs6000, middle-end: Add implementation for different targets for pair mem fusion
>>
>> Common infrastructure using generic code for pair mem fusion of different
>> targets.
>>
>> rs6000 target specific specific code implements virtual functions defined
>> by generic code.
>>
>> Code is implemented with pure virtual functions to interface with target
>> code.
>>
>> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
>> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>>
>> 2024-05-31  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>
>> gcc/ChangeLog:
>>
>> 	* config/aarch64/aarch64-ldp-fusion.cc: Add target specific
>> 	implementation of additional virtual functions added in pair_fusion
>> 	struct.
>> 	* config/rs6000/rs6000-passes.def: New mem fusion pass
>> 	before pass_early_remat.
>> 	* config/rs6000/rs6000-mem-fusion.cc: Add new pass.
>> 	Add target specific implementation using pure virtual
>> 	functions.
>> 	* config.gcc: Add new object file.
>> 	* config/rs6000/rs6000-protos.h: Add new prototype for mem
>> 	fusion pass.
>> 	* config/rs6000/t-rs6000: Add new rule.
>> 	* rtl-ssa/accesses.h: Moved set_is_live_out_use as public
>> 	from private.
>>
>> gcc/testsuite/ChangeLog:
>>
>> 	* g++.target/powerpc/me-fusion.C: New test.
>> 	* g++.target/powerpc/mem-fusion-1.C: New test.
>> 	* gcc.target/powerpc/mma-builtin-1.c: Modify test.
>> ---
> 
> This isn't a complete review, just some initial questions & comments
> about selected parts.
> 
>> [...]
>> +/* Check whether load can be fusable or not.
>> +   Return true if dependent use is UNSPEC otherwise false.  */
>> +bool
>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>> +{
>> +  rtx_insn *insn = info->rtl ();
>> +
>> +  for (rtx note = REG_NOTES (insn); note; note = XEXP (note, 1))
>> +    if (REG_NOTE_KIND (note) == REG_EQUAL
>> +	|| REG_NOTE_KIND (note) == REG_EQUIV)
>> +      return false;
> 
> It's unusual to punt on an optimisation because of a REG_EQUAL/EQUIV
> note.  What's the reason for doing this?  Are you trying to avoid
> fusing pairs before reload that are equivalent to a MEM (i.e. have
> a natural spill slot)?  I think Alex hit a similar situation.
> 

We have used the above check because of some SPEC benchmarks failing with
with MEM pairs having REG_EQUAL/EQUIV notes.

By adding the checks the benchmarks passes and also it improves the
performance.

This checks were added during initial implementation of pair fusion
pass.

I will investigate further if this check is still required or not.

Sorry for the inconvenience caused.

>> +
>> +  for (auto def : info->defs ())
>> +    {
>> +      auto set = dyn_cast<set_info *> (def);
>> +      if (set && set->has_any_uses ())
>> +	{
>> +	  for (auto use : set->all_uses())
> 
> Nit: has_any_uses isn't necessary: the inner loop will simply do nothing
> in that case.  Also, we can/should restrict the scan to non-debug uses.
> 
> This can then be:
> 
>   for (auto def : info->defs ())
>     if (auto set = dyn_cast<set_info *> (def))
>       for (auto use : set->nondebug_insn_uses())
> 

Sure. I will change as above.

>> +	    {
>> +	      if (use->insn ()->is_artificial ())
>> +		return false;
>> +
>> +	       insn_info *info = use->insn ();
>> +
>> +	       if (info
>> +		   && info->rtl ()
> 
> This test shouldn't be necessary.
> 

Sure I will remove this check.

>> +		   && info->is_real ())
>> +		  {
>> +		    rtx_insn *rtl_insn = info->rtl ();
>> +		    rtx set = single_set (rtl_insn);
>> +
>> +		    if (set == NULL_RTX)
>> +		      return false;
>> +
>> +		    rtx op0 = SET_SRC (set);
>> +		    if (GET_CODE (op0) != UNSPEC)
>> +		      return false;
> 
> What's the motivation for rejecting unspecs?  It's unsual to treat
> all unspecs as a distinct group.
> 
> Also, using single_set means that the function still lets through
> parallels of two sets in which the sources are unspecs.  Is that
> intentional?
> 
> The reasons behind things like the REG_EQUAL/EQUIV and UNSPEC decisions
> need to be described in comments, so that other people coming to this
> code later can understand the motivation.  The same thing applies to
> other decisions in the patch.
> 

Adjacent load pair fusion with 256 bit OOmode is seen and valid with use of load
in UNSPEC. Thats why this check is added.

>> +		  }
>> +	      }
>> +	  }
>> +    }
>> +  return true;
>> +}
>> [...]
>> diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc
>> index 9f897ac04e2..2dbe9f854ef 100644
>> --- a/gcc/pair-fusion.cc
>> +++ b/gcc/pair-fusion.cc
>> @@ -312,7 +312,7 @@ static int
>>  encode_lfs (lfs_fields fields)
>>  {
>>    int size_log2 = exact_log2 (fields.size);
>> -  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 4);
>> +  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 6);
>>    return ((int)fields.load_p << 3)
>>      | ((int)fields.fpsimd_p << 2)
>>      | (size_log2 - 2);
> 
> The point of the assert is to make sure that "size_log2 - 2"
> fits within 2 bits, and so does not interfere with the fpsimd_p
> and load_p parts of the encoding.  If rs6000 needs size_log2 == 6,
> the shift amounts should be increased by 1 to compensate.
> 
> If we do bump the shifts by 1, the new range can be:
> 
>   gcc_checking_assert (size_log2 >= 2 && size_log2 <= 9);
> 

Sure I will make this change.

>> [...]
>> +  // Given insn_info pair I1 and I2, return true if offsets are in order.
>> +  virtual bool should_handle_unordered_insns (rtl_ssa::insn_info *i1,
>> +					      rtl_ssa::insn_info *i2) = 0;
>> +
> 
> This name seems a bit misleading.  The function is used in:
> 
> @@ -2401,6 +2405,9 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>        reversed = true;
>      }
>  
> +  if (!m_pass->should_handle_unordered_insns (i1, i2))
> +    return false;
> +
>    rtx cand_mems[2];
>    rtx reg_ops[2];
>    rtx pats[2];
> 
> and so it acts as a general opt-out.  The insns aren't known to be unordered.
> 
> It looks like the rs6000 override requires the original insns to be
> in offset order.  Could you say why that's necessary?  (Both in email
> and as a comment in the code.)
>

Yes rs6000 requires the original load insns to be in offset order.
Some regression tests like vect-outer-4f fails if we do load pair
fusion with load offsets are not in offset order as this breaks lxvp 
semantics.
 
> If we do need a general opt-out like this, it should probably go at the
> very start of try_fuse_pair, before even the dump message.  (Alternatively,
> it could go after the dump message, but then the "return false" would need
> a dump message of its own to explain the failure.)
> 

Sure.

> Thanks,
> Richard

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
  2024-05-31  8:14   ` Richard Sandiford
@ 2024-05-31 14:19     ` Segher Boessenkool
  0 siblings, 0 replies; 20+ messages in thread
From: Segher Boessenkool @ 2024-05-31 14:19 UTC (permalink / raw)
  To: Ajit Agarwal, Alex Coplan, Kewen.Lin, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

On Fri, May 31, 2024 at 09:14:21AM +0100, Richard Sandiford wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > Hi!
> >
> > On Fri, May 31, 2024 at 01:21:44AM +0530, Ajit Agarwal wrote:
> >> Code is implemented with pure virtual functions to interface with target
> >> code.
> >
> > It's not a pure function.  A pure function -- by definition -- has no
> > side effects.  These things have side effects.
> >
> > What you mean is this is *an implementation* for C++ functions without
> > a generic implementation.  An obfuscation some people (like me) would
> > say.  But please call things what they are!  So not "pure function".
> > That has a meaning, and this isn't it.
> 
> "pure virtual function" is an established term.  The "pure" modifies
> "virtual", not "function".
> 
> The description is correct because the patch adds pure virtual functions
> to the base class and expects the derived class to override and implement
> them.

But this code -- the architecture implementation! -- certainly does
*not* add abstract functions: it should provide an implementation for
them, instead.  So the commit message is completely misleading :-(

And no, it is not an established term, not outside of the C++ world.
Which GCC agreed *not* to dive too much into.  You can only sanely write
most of compilers -- just like anything where algorithmics matter -- in
an imperative, procedural language.  Obfuscation (like action at a
distance like this, where the meaning of a program depends hugely on
knowledge of other parts of the program, far away!) is the devil.

Reading the program is at least 1000x more important than writing it.
Writing is easy.  Reading and understanding, not so much.

> >> 	* config/aarch64/aarch64-ldp-fusion.cc: Add target specific
> >> 	implementation of additional virtual functions added in pair_fusion
> >> 	struct.
> >
> > This does not belong in this patch.  Do not send "rs6000" patches that
> > touch anything outside of config/rs6000/ and similar, certainly not in
> > config/something-else/!
> >
> > This would be WAY easier to review (read: AT ALL POSSIBLE) if you
> > included some detailed rationale and design document.
> 
> Please don't shout.
> 
> I don't think this kind of aggressive review is helpful to the project.

And I don't think wasting people's time is helpful.

I don't shout.  If you read that as shouting, that's not my problem.

It is EMPHASIS (there are no small caps in email).  Do you prefer *fat
print* for that?  Or /slanted/?  Or **italics**?  _Underlined_ perhaps?

Segher

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
  2024-05-31 13:54   ` Ajit Agarwal
@ 2024-05-31 14:38     ` Richard Sandiford
  2024-05-31 16:59       ` Ajit Agarwal
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Sandiford @ 2024-05-31 14:38 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
> On 31/05/24 3:23 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>> Hello All:
>>>
>>> Common infrastructure using generic code for pair mem fusion of different
>>> targets.
>>>
>>> rs6000 target specific specific code implements virtual functions defined
>>> by generic code.
>>>
>>> Code is implemented with pure virtual functions to interface with target
>>> code.
>>>
>>> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
>>> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>>>
>>> Bootstrapped and regtested for aarch64-linux-gnu and powerpc64-linux-gnu.
>>>
>>> Thanks & Regards
>>> Ajit
>>>
>>>
>>> aarch64, rs6000, middle-end: Add implementation for different targets for pair mem fusion
>>>
>>> Common infrastructure using generic code for pair mem fusion of different
>>> targets.
>>>
>>> rs6000 target specific specific code implements virtual functions defined
>>> by generic code.
>>>
>>> Code is implemented with pure virtual functions to interface with target
>>> code.
>>>
>>> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
>>> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>>>
>>> 2024-05-31  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>>
>>> gcc/ChangeLog:
>>>
>>> 	* config/aarch64/aarch64-ldp-fusion.cc: Add target specific
>>> 	implementation of additional virtual functions added in pair_fusion
>>> 	struct.
>>> 	* config/rs6000/rs6000-passes.def: New mem fusion pass
>>> 	before pass_early_remat.
>>> 	* config/rs6000/rs6000-mem-fusion.cc: Add new pass.
>>> 	Add target specific implementation using pure virtual
>>> 	functions.
>>> 	* config.gcc: Add new object file.
>>> 	* config/rs6000/rs6000-protos.h: Add new prototype for mem
>>> 	fusion pass.
>>> 	* config/rs6000/t-rs6000: Add new rule.
>>> 	* rtl-ssa/accesses.h: Moved set_is_live_out_use as public
>>> 	from private.
>>>
>>> gcc/testsuite/ChangeLog:
>>>
>>> 	* g++.target/powerpc/me-fusion.C: New test.
>>> 	* g++.target/powerpc/mem-fusion-1.C: New test.
>>> 	* gcc.target/powerpc/mma-builtin-1.c: Modify test.
>>> ---
>> 
>> This isn't a complete review, just some initial questions & comments
>> about selected parts.
>> 
>>> [...]
>>> +/* Check whether load can be fusable or not.
>>> +   Return true if dependent use is UNSPEC otherwise false.  */
>>> +bool
>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>>> +{
>>> +  rtx_insn *insn = info->rtl ();
>>> +
>>> +  for (rtx note = REG_NOTES (insn); note; note = XEXP (note, 1))
>>> +    if (REG_NOTE_KIND (note) == REG_EQUAL
>>> +	|| REG_NOTE_KIND (note) == REG_EQUIV)
>>> +      return false;
>> 
>> It's unusual to punt on an optimisation because of a REG_EQUAL/EQUIV
>> note.  What's the reason for doing this?  Are you trying to avoid
>> fusing pairs before reload that are equivalent to a MEM (i.e. have
>> a natural spill slot)?  I think Alex hit a similar situation.
>> 
>
> We have used the above check because of some SPEC benchmarks failing with
> with MEM pairs having REG_EQUAL/EQUIV notes.
>
> By adding the checks the benchmarks passes and also it improves the
> performance.
>
> This checks were added during initial implementation of pair fusion
> pass.
>
> I will investigate further if this check is still required or not.

Thanks.  If it does affect SPEC results, it would be good to look
at the underlying reason, as a justification for the check.

AIUI, the case Alex was due to the way that the RA recognises:

  (set (reg R) (mem address-of-a-stack-variable))
    REG_EQUIV: (mem address-of-a-stack-variable)

where the REG_EQUIV is either explicit or detected by the RA.
If R needs to be spilled, it can then be spilled to its existing
location on the stack.  And if R needs to be spilled in the
instruction above (because of register pressure before the first
use of R), the RA is able to delete the instruction.

But if that is the reason, the condition should be restricted
to cases in which the note is a memory.

I think Alex had tried something similar and found that it wasn't
always effective.

> [...]
>>> +		   && info->is_real ())
>>> +		  {
>>> +		    rtx_insn *rtl_insn = info->rtl ();
>>> +		    rtx set = single_set (rtl_insn);
>>> +
>>> +		    if (set == NULL_RTX)
>>> +		      return false;
>>> +
>>> +		    rtx op0 = SET_SRC (set);
>>> +		    if (GET_CODE (op0) != UNSPEC)
>>> +		      return false;
>> 
>> What's the motivation for rejecting unspecs?  It's unsual to treat
>> all unspecs as a distinct group.
>> 
>> Also, using single_set means that the function still lets through
>> parallels of two sets in which the sources are unspecs.  Is that
>> intentional?
>> 
>> The reasons behind things like the REG_EQUAL/EQUIV and UNSPEC decisions
>> need to be described in comments, so that other people coming to this
>> code later can understand the motivation.  The same thing applies to
>> other decisions in the patch.
>> 
>
> Adjacent load pair fusion with 256 bit OOmode is seen and valid with use of load
> in UNSPEC. Thats why this check is added.

Can you give an example (in terms of rtl)?

> [...]
>>> [...]
>>> +  // Given insn_info pair I1 and I2, return true if offsets are in order.
>>> +  virtual bool should_handle_unordered_insns (rtl_ssa::insn_info *i1,
>>> +					      rtl_ssa::insn_info *i2) = 0;
>>> +
>> 
>> This name seems a bit misleading.  The function is used in:
>> 
>> @@ -2401,6 +2405,9 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>        reversed = true;
>>      }
>>  
>> +  if (!m_pass->should_handle_unordered_insns (i1, i2))
>> +    return false;
>> +
>>    rtx cand_mems[2];
>>    rtx reg_ops[2];
>>    rtx pats[2];
>> 
>> and so it acts as a general opt-out.  The insns aren't known to be unordered.
>> 
>> It looks like the rs6000 override requires the original insns to be
>> in offset order.  Could you say why that's necessary?  (Both in email
>> and as a comment in the code.)
>>
>
> Yes rs6000 requires the original load insns to be in offset order.
> Some regression tests like vect-outer-4f fails if we do load pair
> fusion with load offsets are not in offset order as this breaks lxvp 
> semantics.

How does it break the semantics though?  In principle, the generic code
only fuses if it has "proved" that the loads can happen in either order.
So it shouldn't matter which order the hardware does things in.

Could you give an example of the kind of situation that you want
to avoid, and why it generates the wrong result?

Thanks,
Richard


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
  2024-05-31 14:38     ` Richard Sandiford
@ 2024-05-31 16:59       ` Ajit Agarwal
  2024-06-02  5:52         ` Ajit Agarwal
  2024-06-03  8:37         ` Richard Sandiford
  0 siblings, 2 replies; 20+ messages in thread
From: Ajit Agarwal @ 2024-05-31 16:59 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 31/05/24 8:08 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> On 31/05/24 3:23 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>> Hello All:
>>>>
>>>> Common infrastructure using generic code for pair mem fusion of different
>>>> targets.
>>>>
>>>> rs6000 target specific specific code implements virtual functions defined
>>>> by generic code.
>>>>
>>>> Code is implemented with pure virtual functions to interface with target
>>>> code.
>>>>
>>>> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
>>>> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>>>>
>>>> Bootstrapped and regtested for aarch64-linux-gnu and powerpc64-linux-gnu.
>>>>
>>>> Thanks & Regards
>>>> Ajit
>>>>
>>>>
>>>> aarch64, rs6000, middle-end: Add implementation for different targets for pair mem fusion
>>>>
>>>> Common infrastructure using generic code for pair mem fusion of different
>>>> targets.
>>>>
>>>> rs6000 target specific specific code implements virtual functions defined
>>>> by generic code.
>>>>
>>>> Code is implemented with pure virtual functions to interface with target
>>>> code.
>>>>
>>>> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
>>>> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>>>>
>>>> 2024-05-31  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>> 	* config/aarch64/aarch64-ldp-fusion.cc: Add target specific
>>>> 	implementation of additional virtual functions added in pair_fusion
>>>> 	struct.
>>>> 	* config/rs6000/rs6000-passes.def: New mem fusion pass
>>>> 	before pass_early_remat.
>>>> 	* config/rs6000/rs6000-mem-fusion.cc: Add new pass.
>>>> 	Add target specific implementation using pure virtual
>>>> 	functions.
>>>> 	* config.gcc: Add new object file.
>>>> 	* config/rs6000/rs6000-protos.h: Add new prototype for mem
>>>> 	fusion pass.
>>>> 	* config/rs6000/t-rs6000: Add new rule.
>>>> 	* rtl-ssa/accesses.h: Moved set_is_live_out_use as public
>>>> 	from private.
>>>>
>>>> gcc/testsuite/ChangeLog:
>>>>
>>>> 	* g++.target/powerpc/me-fusion.C: New test.
>>>> 	* g++.target/powerpc/mem-fusion-1.C: New test.
>>>> 	* gcc.target/powerpc/mma-builtin-1.c: Modify test.
>>>> ---
>>>
>>> This isn't a complete review, just some initial questions & comments
>>> about selected parts.
>>>
>>>> [...]
>>>> +/* Check whether load can be fusable or not.
>>>> +   Return true if dependent use is UNSPEC otherwise false.  */
>>>> +bool
>>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>>>> +{
>>>> +  rtx_insn *insn = info->rtl ();
>>>> +
>>>> +  for (rtx note = REG_NOTES (insn); note; note = XEXP (note, 1))
>>>> +    if (REG_NOTE_KIND (note) == REG_EQUAL
>>>> +	|| REG_NOTE_KIND (note) == REG_EQUIV)
>>>> +      return false;
>>>
>>> It's unusual to punt on an optimisation because of a REG_EQUAL/EQUIV
>>> note.  What's the reason for doing this?  Are you trying to avoid
>>> fusing pairs before reload that are equivalent to a MEM (i.e. have
>>> a natural spill slot)?  I think Alex hit a similar situation.
>>>
>>
>> We have used the above check because of some SPEC benchmarks failing with
>> with MEM pairs having REG_EQUAL/EQUIV notes.
>>
>> By adding the checks the benchmarks passes and also it improves the
>> performance.
>>
>> This checks were added during initial implementation of pair fusion
>> pass.
>>
>> I will investigate further if this check is still required or not.
> 
> Thanks.  If it does affect SPEC results, it would be good to look
> at the underlying reason, as a justification for the check.
> 
> AIUI, the case Alex was due to the way that the RA recognises:
> 
>   (set (reg R) (mem address-of-a-stack-variable))
>     REG_EQUIV: (mem address-of-a-stack-variable)
> 
> where the REG_EQUIV is either explicit or detected by the RA.
> If R needs to be spilled, it can then be spilled to its existing
> location on the stack.  And if R needs to be spilled in the
> instruction above (because of register pressure before the first
> use of R), the RA is able to delete the instruction.
> 
> But if that is the reason, the condition should be restricted
> to cases in which the note is a memory.
> 
> I think Alex had tried something similar and found that it wasn't
> always effective.
> 

Sure I will check.
>> [...]
>>>> +		   && info->is_real ())
>>>> +		  {
>>>> +		    rtx_insn *rtl_insn = info->rtl ();
>>>> +		    rtx set = single_set (rtl_insn);
>>>> +
>>>> +		    if (set == NULL_RTX)
>>>> +		      return false;
>>>> +
>>>> +		    rtx op0 = SET_SRC (set);
>>>> +		    if (GET_CODE (op0) != UNSPEC)
>>>> +		      return false;
>>>
>>> What's the motivation for rejecting unspecs?  It's unsual to treat
>>> all unspecs as a distinct group.
>>>
>>> Also, using single_set means that the function still lets through
>>> parallels of two sets in which the sources are unspecs.  Is that
>>> intentional?
>>>
>>> The reasons behind things like the REG_EQUAL/EQUIV and UNSPEC decisions
>>> need to be described in comments, so that other people coming to this
>>> code later can understand the motivation.  The same thing applies to
>>> other decisions in the patch.
>>>
>>
>> Adjacent load pair fusion with 256 bit OOmode is seen and valid with use of load
>> in UNSPEC. Thats why this check is added.
> 
> Can you give an example (in terms of rtl)?


(insn 23 72 25 4 (set (reg:OO 124 [ vect__1.24 ])
        (mem:OO (reg:DI 118 [ ivtmp.73 ]) [0 MEM <vector(16) unsigned char> [(unsigned char *)_65]+0 S16 A8]))  {*movoo}
     (nil))
(insn 25 23 103 4 (set (reg:OO 174 [ vect__1.26 ])
        (mem:OO (plus:DI (reg:DI 118 [ ivtmp.73 ])
                (const_int 32 [0x20])) [0 MEM <vector(16) unsigned char> [(unsigned char *)_65 + 32B]+0 S16 A8]))  {*movoo}
     (nil))
(insn 103 25 27 4 (set (reg:DI 118 [ ivtmp.73 ])
        (plus:DI (reg:DI 118 [ ivtmp.73 ])
            (const_int 64 [0x40]))) 66 {*adddi3}
     (nil))
(insn 27 103 29 4 (set (reg:V16QI 176 [ vect_perm_even_123 ])
        (unspec:V16QI [
                (subreg:V16QI (reg:OO 124 [ vect__1.24 ]) 0)
                (subreg:V16QI (reg:OO 124 [ vect__1.24 ]) 16)
                (reg:V16QI 300)
            ] UNSPEC_VPERM))  {altivec_vperm_v16qi_direct}
     (nil))

This is the example where UNSPEC is used with 256 bit OOmode.

> 
>> [...]
>>>> [...]
>>>> +  // Given insn_info pair I1 and I2, return true if offsets are in order.
>>>> +  virtual bool should_handle_unordered_insns (rtl_ssa::insn_info *i1,
>>>> +					      rtl_ssa::insn_info *i2) = 0;
>>>> +
>>>
>>> This name seems a bit misleading.  The function is used in:
>>>
>>> @@ -2401,6 +2405,9 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>>        reversed = true;
>>>      }
>>>  
>>> +  if (!m_pass->should_handle_unordered_insns (i1, i2))
>>> +    return false;
>>> +
>>>    rtx cand_mems[2];
>>>    rtx reg_ops[2];
>>>    rtx pats[2];
>>>
>>> and so it acts as a general opt-out.  The insns aren't known to be unordered.
>>>
>>> It looks like the rs6000 override requires the original insns to be
>>> in offset order.  Could you say why that's necessary?  (Both in email
>>> and as a comment in the code.)
>>>
>>
>> Yes rs6000 requires the original load insns to be in offset order.
>> Some regression tests like vect-outer-4f fails if we do load pair
>> fusion with load offsets are not in offset order as this breaks lxvp 
>> semantics.
> 
> How does it break the semantics though?  In principle, the generic code
> only fuses if it has "proved" that the loads can happen in either order.
> So it shouldn't matter which order the hardware does things in.
> 
> Could you give an example of the kind of situation that you want
> to avoid, and why it generates the wrong result?
>

(insn 31 62 32 2 (set (reg:V16QI 177 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 64B] ])
        (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
                (const_int 64 [0x40])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 64B]+0 S16 A16]))  {vsx_movv16qi_64bit}
     (nil))
(insn 32 31 16 2 (set (reg:V16QI 178 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 80B] ])
        (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
                (const_int 80 [0x50])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 80B]+0 S16 A16]))  {vsx_movv16qi_64bit}
     (nil))
(insn 16 32 21 2 (set (reg:V16QI 159 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 16B] ])
        (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
                (const_int 16 [0x10])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 16B]+0 S16 A16]))  {vsx_movv16qi_64bit}
     (nil))
(insn 21 16 22 2 (set (reg:V16QI 165 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 32B] ])
        (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
                (const_int 32 [0x20])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 32B]+0 S16 A16])) {vsx_movv16qi_64bit}
     (nil))
(insn 22 21 37 2 (set (reg:V16QI 166 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 48B] ])
        (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
                (const_int 48 [0x30])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 48B]+0 S16 A16])) {vsx_movv16qi_64bit}
     (nil))

insn 22 and insn 31 is merged in the failure case and breaks the code.

 
> Thanks,
> Richard
>

Thanks & Regards
Ajit
 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
  2024-05-31 16:59       ` Ajit Agarwal
@ 2024-06-02  5:52         ` Ajit Agarwal
  2024-06-03  8:37         ` Richard Sandiford
  1 sibling, 0 replies; 20+ messages in thread
From: Ajit Agarwal @ 2024-06-02  5:52 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 31/05/24 10:29 pm, Ajit Agarwal wrote:
> Hello Richard:
> 
> On 31/05/24 8:08 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>> On 31/05/24 3:23 pm, Richard Sandiford wrote:
>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>> Hello All:
>>>>>
>>>>> Common infrastructure using generic code for pair mem fusion of different
>>>>> targets.
>>>>>
>>>>> rs6000 target specific specific code implements virtual functions defined
>>>>> by generic code.
>>>>>
>>>>> Code is implemented with pure virtual functions to interface with target
>>>>> code.
>>>>>
>>>>> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
>>>>> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>>>>>
>>>>> Bootstrapped and regtested for aarch64-linux-gnu and powerpc64-linux-gnu.
>>>>>
>>>>> Thanks & Regards
>>>>> Ajit
>>>>>
>>>>>
>>>>> aarch64, rs6000, middle-end: Add implementation for different targets for pair mem fusion
>>>>>
>>>>> Common infrastructure using generic code for pair mem fusion of different
>>>>> targets.
>>>>>
>>>>> rs6000 target specific specific code implements virtual functions defined
>>>>> by generic code.
>>>>>
>>>>> Code is implemented with pure virtual functions to interface with target
>>>>> code.
>>>>>
>>>>> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
>>>>> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>>>>>
>>>>> 2024-05-31  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>>>>
>>>>> gcc/ChangeLog:
>>>>>
>>>>> 	* config/aarch64/aarch64-ldp-fusion.cc: Add target specific
>>>>> 	implementation of additional virtual functions added in pair_fusion
>>>>> 	struct.
>>>>> 	* config/rs6000/rs6000-passes.def: New mem fusion pass
>>>>> 	before pass_early_remat.
>>>>> 	* config/rs6000/rs6000-mem-fusion.cc: Add new pass.
>>>>> 	Add target specific implementation using pure virtual
>>>>> 	functions.
>>>>> 	* config.gcc: Add new object file.
>>>>> 	* config/rs6000/rs6000-protos.h: Add new prototype for mem
>>>>> 	fusion pass.
>>>>> 	* config/rs6000/t-rs6000: Add new rule.
>>>>> 	* rtl-ssa/accesses.h: Moved set_is_live_out_use as public
>>>>> 	from private.
>>>>>
>>>>> gcc/testsuite/ChangeLog:
>>>>>
>>>>> 	* g++.target/powerpc/me-fusion.C: New test.
>>>>> 	* g++.target/powerpc/mem-fusion-1.C: New test.
>>>>> 	* gcc.target/powerpc/mma-builtin-1.c: Modify test.
>>>>> ---
>>>>
>>>> This isn't a complete review, just some initial questions & comments
>>>> about selected parts.
>>>>
>>>>> [...]
>>>>> +/* Check whether load can be fusable or not.
>>>>> +   Return true if dependent use is UNSPEC otherwise false.  */
>>>>> +bool
>>>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>>>>> +{
>>>>> +  rtx_insn *insn = info->rtl ();
>>>>> +
>>>>> +  for (rtx note = REG_NOTES (insn); note; note = XEXP (note, 1))
>>>>> +    if (REG_NOTE_KIND (note) == REG_EQUAL
>>>>> +	|| REG_NOTE_KIND (note) == REG_EQUIV)
>>>>> +      return false;
>>>>
>>>> It's unusual to punt on an optimisation because of a REG_EQUAL/EQUIV
>>>> note.  What's the reason for doing this?  Are you trying to avoid
>>>> fusing pairs before reload that are equivalent to a MEM (i.e. have
>>>> a natural spill slot)?  I think Alex hit a similar situation.
>>>>
>>>
>>> We have used the above check because of some SPEC benchmarks failing with
>>> with MEM pairs having REG_EQUAL/EQUIV notes.
>>>
>>> By adding the checks the benchmarks passes and also it improves the
>>> performance.
>>>
>>> This checks were added during initial implementation of pair fusion
>>> pass.
>>>
>>> I will investigate further if this check is still required or not.
>>
>> Thanks.  If it does affect SPEC results, it would be good to look
>> at the underlying reason, as a justification for the check.
>>
>> AIUI, the case Alex was due to the way that the RA recognises:
>>
>>   (set (reg R) (mem address-of-a-stack-vare ciable))
>>     REG_EQUIV: (mem address-of-a-stack-variable)
>>
>> where the REG_EQUIV is either explicit or detected by the RA.
>> If R needs to be spilled, it can then be spilled to its existing
>> location on the stack.  And if R needs to be spilled in the
>> instruction above (because of register pressure before the first
>> use of R), the RA is able to delete the instruction.
>>
>> But if that is the reason, the condition should be restricted
>> to cases in which the note is a memory.
>>
>> I think Alex had tried something similar and found that it wasn't
>> always effective.
>>
> 
> Sure I will check.

Spec looks good without the above check of REG_EQUIV/REG_EQUAL.
I dont see the above condition in Spec benchmarks any more for 
MEM pairs.
Hence I will remove the above check from the patch and send
new patch without the above check.

Thanks & Regards
Ajit
>>> [...]
>>>>> +		   && info->is_real ())
>>>>> +		  {
>>>>> +		    rtx_insn *rtl_insn = info->rtl ();
>>>>> +		    rtx set = single_set (rtl_insn);
>>>>> +
>>>>> +		    if (set == NULL_RTX)
>>>>> +		      return false;
>>>>> +
>>>>> +		    rtx op0 = SET_SRC (set);
>>>>> +		    if (GET_CODE (op0) != UNSPEC)
>>>>> +		      return false;
>>>>
>>>> What's the motivation for rejecting unspecs?  It's unsual to treat
>>>> all unspecs as a distinct group.
>>>>
>>>> Also, using single_set means that the function still lets through
>>>> parallels of two sets in which the sources are unspecs.  Is that
>>>> intentional?
>>>>
>>>> The reasons behind things like the REG_EQUAL/EQUIV and UNSPEC decisions
>>>> need to be described in comments, so that other people coming to this
>>>> code later can understand the motivation.  The same thing applies to
>>>> other decisions in the patch.
>>>>
>>>
>>> Adjacent load pair fusion with 256 bit OOmode is seen and valid with use of load
>>> in UNSPEC. Thats why this check is added.
>>
>> Can you give an example (in terms of rtl)?
> 
> 
> (insn 23 72 25 4 (set (reg:OO 124 [ vect__1.24 ])
>         (mem:OO (reg:DI 118 [ ivtmp.73 ]) [0 MEM <vector(16) unsigned char> [(unsigned char *)_65]+0 S16 A8]))  {*movoo}
>      (nil))
> (insn 25 23 103 4 (set (reg:OO 174 [ vect__1.26 ])
>         (mem:OO (plus:DI (reg:DI 118 [ ivtmp.73 ])
>                 (const_int 32 [0x20])) [0 MEM <vector(16) unsigned char> [(unsigned char *)_65 + 32B]+0 S16 A8]))  {*movoo}
>      (nil))
> (insn 103 25 27 4 (set (reg:DI 118 [ ivtmp.73 ])
>         (plus:DI (reg:DI 118 [ ivtmp.73 ])
>             (const_int 64 [0x40]))) 66 {*adddi3}
>      (nil))
> (insn 27 103 29 4 (set (reg:V16QI 176 [ vect_perm_even_123 ])
>         (unspec:V16QI [
>                 (subreg:V16QI (reg:OO 124 [ vect__1.24 ]) 0)
>                 (subreg:V16QI (reg:OO 124 [ vect__1.24 ]) 16)
>                 (reg:V16QI 300)
>             ] UNSPEC_VPERM))  {altivec_vperm_v16qi_direct}
>      (nil))
> 
> This is the example where UNSPEC is used with 256 bit OOmode.
> 
>>
>>> [...]
>>>>> [...]
>>>>> +  // Given insn_info pair I1 and I2, return true if offsets are in order.
>>>>> +  virtual bool should_handle_unordered_insns (rtl_ssa::insn_info *i1,
>>>>> +					      rtl_ssa::insn_info *i2) = 0;
>>>>> +
>>>>
>>>> This name seems a bit misleading.  The function is used in:
>>>>
>>>> @@ -2401,6 +2405,9 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>>>        reversed = true;
>>>>      }
>>>>  
>>>> +  if (!m_pass->should_handle_unordered_insns (i1, i2))
>>>> +    return false;
>>>> +
>>>>    rtx cand_mems[2];
>>>>    rtx reg_ops[2];
>>>>    rtx pats[2];
>>>>
>>>> and so it acts as a general opt-out.  The insns aren't known to be unordered.
>>>>
>>>> It looks like the rs6000 override requires the original insns to be
>>>> in offset order.  Could you say why that's necessary?  (Both in email
>>>> and as a comment in the code.)
>>>>
>>>
>>> Yes rs6000 requires the original load insns to be in offset order.
>>> Some regression tests like vect-outer-4f fails if we do load pair
>>> fusion with load offsets are not in offset order as this breaks lxvp 
>>> semantics.
>>
>> How does it break the semantics though?  In principle, the generic code
>> only fuses if it has "proved" that the loads can happen in either order.
>> So it shouldn't matter which order the hardware does things in.
>>
>> Could you give an example of the kind of situation that you want
>> to avoid, and why it generates the wrong result?
>>
> 
> (insn 31 62 32 2 (set (reg:V16QI 177 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 64B] ])
>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>                 (const_int 64 [0x40])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 64B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>      (nil))
> (insn 32 31 16 2 (set (reg:V16QI 178 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 80B] ])
>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>                 (const_int 80 [0x50])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 80B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>      (nil))
> (insn 16 32 21 2 (set (reg:V16QI 159 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 16B] ])
>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>                 (const_int 16 [0x10])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 16B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>      (nil))
> (insn 21 16 22 2 (set (reg:V16QI 165 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 32B] ])
>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>                 (const_int 32 [0x20])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 32B]+0 S16 A16])) {vsx_movv16qi_64bit}
>      (nil))
> (insn 22 21 37 2 (set (reg:V16QI 166 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 48B] ])
>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>                 (const_int 48 [0x30])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 48B]+0 S16 A16])) {vsx_movv16qi_64bit}
>      (nil))
> 
> insn 22 and insn 31 is merged in the failure case and breaks the code.
> 
>  
>> Thanks,
>> Richard
>>
> 
> Thanks & Regards
> Ajit
>  

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
  2024-05-31  9:53 ` Richard Sandiford
  2024-05-31 10:28   ` Richard Sandiford
  2024-05-31 13:54   ` Ajit Agarwal
@ 2024-06-02 13:16   ` Ajit Agarwal
  2 siblings, 0 replies; 20+ messages in thread
From: Ajit Agarwal @ 2024-06-02 13:16 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 31/05/24 3:23 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> Hello All:
>>
>> Common infrastructure using generic code for pair mem fusion of different
>> targets.
>>
>> rs6000 target specific specific code implements virtual functions defined
>> by generic code.
>>
>> Code is implemented with pure virtual functions to interface with target
>> code.
>>
>> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
>> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>>
>> Bootstrapped and regtested for aarch64-linux-gnu and powerpc64-linux-gnu.
>>
>> Thanks & Regards
>> Ajit
>>
>>
>> aarch64, rs6000, middle-end: Add implementation for different targets for pair mem fusion
>>
>> Common infrastructure using generic code for pair mem fusion of different
>> targets.
>>
>> rs6000 target specific specific code implements virtual functions defined
>> by generic code.
>>
>> Code is implemented with pure virtual functions to interface with target
>> code.
>>
>> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
>> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>>
>> 2024-05-31  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>
>> gcc/ChangeLog:
>>
>> 	* config/aarch64/aarch64-ldp-fusion.cc: Add target specific
>> 	implementation of additional virtual functions added in pair_fusion
>> 	struct.
>> 	* config/rs6000/rs6000-passes.def: New mem fusion pass
>> 	before pass_early_remat.
>> 	* config/rs6000/rs6000-mem-fusion.cc: Add new pass.
>> 	Add target specific implementation using pure virtual
>> 	functions.
>> 	* config.gcc: Add new object file.
>> 	* config/rs6000/rs6000-protos.h: Add new prototype for mem
>> 	fusion pass.
>> 	* config/rs6000/t-rs6000: Add new rule.
>> 	* rtl-ssa/accesses.h: Moved set_is_live_out_use as public
>> 	from private.
>>
>> gcc/testsuite/ChangeLog:
>>
>> 	* g++.target/powerpc/me-fusion.C: New test.
>> 	* g++.target/powerpc/mem-fusion-1.C: New test.
>> 	* gcc.target/powerpc/mma-builtin-1.c: Modify test.
>> ---
> 
> This isn't a complete review, just some initial questions & comments
> about selected parts.
> 
>> [...]
>> +/* Check whether load can be fusable or not.
>> +   Return true if dependent use is UNSPEC otherwise false.  */
>> +bool
>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>> +{
>> +  rtx_insn *insn = info->rtl ();
>> +
>> +  for (rtx note = REG_NOTES (insn); note; note = XEXP (note, 1))
>> +    if (REG_NOTE_KIND (note) == REG_EQUAL
>> +	|| REG_NOTE_KIND (note) == REG_EQUIV)
>> +      return false;
> 
> It's unusual to punt on an optimisation because of a REG_EQUAL/EQUIV
> note.  What's the reason for doing this?  Are you trying to avoid
> fusing pairs before reload that are equivalent to a MEM (i.e. have
> a natural spill slot)?  I think Alex hit a similar situation.
> 

Removed the above check and addressed in the new patch sent.
>> +
>> +  for (auto def : info->defs ())
>> +    {
>> +      auto set = dyn_cast<set_info *> (def);
>> +      if (set && set->has_any_uses ())
>> +	{
>> +	  for (auto use : set->all_uses())
> 
> Nit: has_any_uses isn't necessary: the inner loop will simply do nothing
> in that case.  Also, we can/should restrict the scan to non-debug uses.
> 
> This can then be:
> 
>   for (auto def : info->defs ())
>     if (auto set = dyn_cast<set_info *> (def))
>       for (auto use : set->nondebug_insn_uses())
> 
>> +	    {
>> +	      if (use->insn ()->is_artificial ())
>> +		return false;
>> +
>> +	       insn_info *info = use->insn ();
>> +
>> +	       if (info
>> +		   && info->rtl ()
> 
> This test shouldn't be necessary.
> 
>> +		   && info->is_real ())
>> +		  {
>> +		    rtx_insn *rtl_insn = info->rtl ();
>> +		    rtx set = single_set (rtl_insn);
>> +
>> +		    if (set == NULL_RTX)
>> +		      return false;
>> +
>> +		    rtx op0 = SET_SRC (set);
>> +		    if (GET_CODE (op0) != UNSPEC)
>> +		      return false;
> 
> What's the motivation for rejecting unspecs?  It's unsual to treat
> all unspecs as a distinct group.
> 
> Also, using single_set means that the function still lets through
> parallels of two sets in which the sources are unspecs.  Is that
> intentional?
> 
> The reasons behind things like the REG_EQUAL/EQUIV and UNSPEC decisions
> need to be described in comments, so that other people coming to this
> code later can understand the motivation.  The same thing applies to
> other decisions in the patch.
> 

Addressed description in code in new patch sent.

>> +		  }
>> +	      }
>> +	  }
>> +    }
>> +  return true;
>> +}
>> [...]
>> diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc
>> index 9f897ac04e2..2dbe9f854ef 100644
>> --- a/gcc/pair-fusion.cc
>> +++ b/gcc/pair-fusion.cc
>> @@ -312,7 +312,7 @@ static int
>>  encode_lfs (lfs_fields fields)
>>  {
>>    int size_log2 = exact_log2 (fields.size);
>> -  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 4);
>> +  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 6);
>>    return ((int)fields.load_p << 3)
>>      | ((int)fields.fpsimd_p << 2)
>>      | (size_log2 - 2);
> 
> The point of the assert is to make sure that "size_log2 - 2"
> fits within 2 bits, and so does not interfere with the fpsimd_p
> and load_p parts of the encoding.  If rs6000 needs size_log2 == 6,
> the shift amounts should be increased by 1 to compensate.
> 
> If we do bump the shifts by 1, the new range can be:
> 
>   gcc_checking_assert (size_log2 >= 2 && size_log2 <= 9);
>

Addressed in the new patch sent.

 
>> [...]
>> +  // Given insn_info pair I1 and I2, return true if offsets are in order.
>> +  virtual bool should_handle_unordered_insns (rtl_ssa::insn_info *i1,
>> +					      rtl_ssa::insn_info *i2) = 0;
>> +
> 
> This name seems a bit misleading.  The function is used in:
> 
> @@ -2401,6 +2405,9 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>        reversed = true;
>      }
>  
> +  if (!m_pass->should_handle_unordered_insns (i1, i2))
> +    return false;
> +
>    rtx cand_mems[2];
>    rtx reg_ops[2];
>    rtx pats[2];
> 
> and so it acts as a general opt-out.  The insns aren't known to be unordered.
> 
> It looks like the rs6000 override requires the original insns to be
> in offset order.  Could you say why that's necessary?  (Both in email
> and as a comment in the code.)
> 
> If we do need a general opt-out like this, it should probably go at the
> very start of try_fuse_pair, before even the dump message.  (Alternatively,
> it could go after the dump message, but then the "return false" would need
> a dump message of its own to explain the failure.)
>

Addressed comments in the code and move the check at the start of the try_fuse_pair
functionn in new patch sent.

 
> Thanks,
> Richard

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
  2024-05-31 16:59       ` Ajit Agarwal
  2024-06-02  5:52         ` Ajit Agarwal
@ 2024-06-03  8:37         ` Richard Sandiford
  2024-06-03 11:05           ` Ajit Agarwal
  1 sibling, 1 reply; 20+ messages in thread
From: Richard Sandiford @ 2024-06-03  8:37 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
> Hello Richard:
> On 31/05/24 8:08 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>> On 31/05/24 3:23 pm, Richard Sandiford wrote:
>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>> Hello All:
>>>>>
>>>>> Common infrastructure using generic code for pair mem fusion of different
>>>>> targets.
>>>>>
>>>>> rs6000 target specific specific code implements virtual functions defined
>>>>> by generic code.
>>>>>
>>>>> Code is implemented with pure virtual functions to interface with target
>>>>> code.
>>>>>
>>>>> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
>>>>> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>>>>>
>>>>> Bootstrapped and regtested for aarch64-linux-gnu and powerpc64-linux-gnu.
>>>>>
>>>>> Thanks & Regards
>>>>> Ajit
>>>>>
>>>>>
>>>>> aarch64, rs6000, middle-end: Add implementation for different targets for pair mem fusion
>>>>>
>>>>> Common infrastructure using generic code for pair mem fusion of different
>>>>> targets.
>>>>>
>>>>> rs6000 target specific specific code implements virtual functions defined
>>>>> by generic code.
>>>>>
>>>>> Code is implemented with pure virtual functions to interface with target
>>>>> code.
>>>>>
>>>>> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
>>>>> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>>>>>
>>>>> 2024-05-31  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>>>>
>>>>> gcc/ChangeLog:
>>>>>
>>>>> 	* config/aarch64/aarch64-ldp-fusion.cc: Add target specific
>>>>> 	implementation of additional virtual functions added in pair_fusion
>>>>> 	struct.
>>>>> 	* config/rs6000/rs6000-passes.def: New mem fusion pass
>>>>> 	before pass_early_remat.
>>>>> 	* config/rs6000/rs6000-mem-fusion.cc: Add new pass.
>>>>> 	Add target specific implementation using pure virtual
>>>>> 	functions.
>>>>> 	* config.gcc: Add new object file.
>>>>> 	* config/rs6000/rs6000-protos.h: Add new prototype for mem
>>>>> 	fusion pass.
>>>>> 	* config/rs6000/t-rs6000: Add new rule.
>>>>> 	* rtl-ssa/accesses.h: Moved set_is_live_out_use as public
>>>>> 	from private.
>>>>>
>>>>> gcc/testsuite/ChangeLog:
>>>>>
>>>>> 	* g++.target/powerpc/me-fusion.C: New test.
>>>>> 	* g++.target/powerpc/mem-fusion-1.C: New test.
>>>>> 	* gcc.target/powerpc/mma-builtin-1.c: Modify test.
>>>>> ---
>>>>
>>>> This isn't a complete review, just some initial questions & comments
>>>> about selected parts.
>>>>
>>>>> [...]
>>>>> +/* Check whether load can be fusable or not.
>>>>> +   Return true if dependent use is UNSPEC otherwise false.  */
>>>>> +bool
>>>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>>>>> +{
>>>>> +  rtx_insn *insn = info->rtl ();
>>>>> +
>>>>> +  for (rtx note = REG_NOTES (insn); note; note = XEXP (note, 1))
>>>>> +    if (REG_NOTE_KIND (note) == REG_EQUAL
>>>>> +	|| REG_NOTE_KIND (note) == REG_EQUIV)
>>>>> +      return false;
>>>>
>>>> It's unusual to punt on an optimisation because of a REG_EQUAL/EQUIV
>>>> note.  What's the reason for doing this?  Are you trying to avoid
>>>> fusing pairs before reload that are equivalent to a MEM (i.e. have
>>>> a natural spill slot)?  I think Alex hit a similar situation.
>>>>
>>>
>>> We have used the above check because of some SPEC benchmarks failing with
>>> with MEM pairs having REG_EQUAL/EQUIV notes.
>>>
>>> By adding the checks the benchmarks passes and also it improves the
>>> performance.
>>>
>>> This checks were added during initial implementation of pair fusion
>>> pass.
>>>
>>> I will investigate further if this check is still required or not.
>> 
>> Thanks.  If it does affect SPEC results, it would be good to look
>> at the underlying reason, as a justification for the check.
>> 
>> AIUI, the case Alex was due to the way that the RA recognises:
>> 
>>   (set (reg R) (mem address-of-a-stack-variable))
>>     REG_EQUIV: (mem address-of-a-stack-variable)
>> 
>> where the REG_EQUIV is either explicit or detected by the RA.
>> If R needs to be spilled, it can then be spilled to its existing
>> location on the stack.  And if R needs to be spilled in the
>> instruction above (because of register pressure before the first
>> use of R), the RA is able to delete the instruction.
>> 
>> But if that is the reason, the condition should be restricted
>> to cases in which the note is a memory.
>> 
>> I think Alex had tried something similar and found that it wasn't
>> always effective.
>> 
>
> Sure I will check.
>>> [...]
>>>>> +		   && info->is_real ())
>>>>> +		  {
>>>>> +		    rtx_insn *rtl_insn = info->rtl ();
>>>>> +		    rtx set = single_set (rtl_insn);
>>>>> +
>>>>> +		    if (set == NULL_RTX)
>>>>> +		      return false;
>>>>> +
>>>>> +		    rtx op0 = SET_SRC (set);
>>>>> +		    if (GET_CODE (op0) != UNSPEC)
>>>>> +		      return false;
>>>>
>>>> What's the motivation for rejecting unspecs?  It's unsual to treat
>>>> all unspecs as a distinct group.
>>>>
>>>> Also, using single_set means that the function still lets through
>>>> parallels of two sets in which the sources are unspecs.  Is that
>>>> intentional?
>>>>
>>>> The reasons behind things like the REG_EQUAL/EQUIV and UNSPEC decisions
>>>> need to be described in comments, so that other people coming to this
>>>> code later can understand the motivation.  The same thing applies to
>>>> other decisions in the patch.
>>>>
>>>
>>> Adjacent load pair fusion with 256 bit OOmode is seen and valid with use of load
>>> in UNSPEC. Thats why this check is added.
>> 
>> Can you give an example (in terms of rtl)?
>
>
> (insn 23 72 25 4 (set (reg:OO 124 [ vect__1.24 ])
>         (mem:OO (reg:DI 118 [ ivtmp.73 ]) [0 MEM <vector(16) unsigned char> [(unsigned char *)_65]+0 S16 A8]))  {*movoo}
>      (nil))
> (insn 25 23 103 4 (set (reg:OO 174 [ vect__1.26 ])
>         (mem:OO (plus:DI (reg:DI 118 [ ivtmp.73 ])
>                 (const_int 32 [0x20])) [0 MEM <vector(16) unsigned char> [(unsigned char *)_65 + 32B]+0 S16 A8]))  {*movoo}
>      (nil))
> (insn 103 25 27 4 (set (reg:DI 118 [ ivtmp.73 ])
>         (plus:DI (reg:DI 118 [ ivtmp.73 ])
>             (const_int 64 [0x40]))) 66 {*adddi3}
>      (nil))
> (insn 27 103 29 4 (set (reg:V16QI 176 [ vect_perm_even_123 ])
>         (unspec:V16QI [
>                 (subreg:V16QI (reg:OO 124 [ vect__1.24 ]) 0)
>                 (subreg:V16QI (reg:OO 124 [ vect__1.24 ]) 16)
>                 (reg:V16QI 300)
>             ] UNSPEC_VPERM))  {altivec_vperm_v16qi_direct}
>      (nil))
>
> This is the example where UNSPEC is used with 256 bit OOmode.

But what (conceptually) makes the vperm instruction interesting?
I.e. what is it about the instruction that makes you want to look for
it?

Is it that the instruction uses both halves of the fused load?  If so,
I don't think the test in the patch achieves that.  ISTM that the patch
checks both loads individually, and so would also match cases where one
load is used by one vperm and the other load is used by a different vperm.
Is that the intention?

It would also match cases where the vperm inputs are the other around,
so that the first operand is loaded from a higher address than the
second operand.  Is that ok, or is it a problem?

Similarly, there are other instructions that have unspecs, such as:

(define_insn "*insert4b_internal"
  [(set (match_operand:V16QI 0 "vsx_register_operand" "=wa")
	(unspec:V16QI [(match_operand:V4SI 1 "vsx_register_operand" "wa")
		       (match_operand:V16QI 2 "vsx_register_operand" "0")
		       (match_operand:QI 3 "const_0_to_12_operand" "n")]
		   UNSPEC_XXINSERTW))]

(to pick an arbitrary example).  It looks like the patch would also
accept uses by this instruction.  Is that intentional?

If it is intentional, what distinguishes things like vperm and xxinsertw
(and all other unspecs) from plain addition?

  [(set (match_operand:VSX_F 0 "vsx_register_operand" "=wa")
        (plus:VSX_F (match_operand:VSX_F 1 "vsx_register_operand" "wa")
		    (match_operand:VSX_F 2 "vsx_register_operand" "wa")))]

This is why the intention behind the patch is important.  As it stands,
it isn't clear what criteria the patch is using to distinguish "valid"
fuse candidates from "invalid" ones.

>>>>> [...]
>>>>> +  // Given insn_info pair I1 and I2, return true if offsets are in order.
>>>>> +  virtual bool should_handle_unordered_insns (rtl_ssa::insn_info *i1,
>>>>> +					      rtl_ssa::insn_info *i2) = 0;
>>>>> +
>>>>
>>>> This name seems a bit misleading.  The function is used in:
>>>>
>>>> @@ -2401,6 +2405,9 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>>>        reversed = true;
>>>>      }
>>>>  
>>>> +  if (!m_pass->should_handle_unordered_insns (i1, i2))
>>>> +    return false;
>>>> +
>>>>    rtx cand_mems[2];
>>>>    rtx reg_ops[2];
>>>>    rtx pats[2];
>>>>
>>>> and so it acts as a general opt-out.  The insns aren't known to be unordered.
>>>>
>>>> It looks like the rs6000 override requires the original insns to be
>>>> in offset order.  Could you say why that's necessary?  (Both in email
>>>> and as a comment in the code.)
>>>>
>>>
>>> Yes rs6000 requires the original load insns to be in offset order.
>>> Some regression tests like vect-outer-4f fails if we do load pair
>>> fusion with load offsets are not in offset order as this breaks lxvp 
>>> semantics.
>> 
>> How does it break the semantics though?  In principle, the generic code
>> only fuses if it has "proved" that the loads can happen in either order.
>> So it shouldn't matter which order the hardware does things in.
>> 
>> Could you give an example of the kind of situation that you want
>> to avoid, and why it generates the wrong result?
>>
>
> (insn 31 62 32 2 (set (reg:V16QI 177 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 64B] ])
>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>                 (const_int 64 [0x40])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 64B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>      (nil))
> (insn 32 31 16 2 (set (reg:V16QI 178 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 80B] ])
>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>                 (const_int 80 [0x50])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 80B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>      (nil))
> (insn 16 32 21 2 (set (reg:V16QI 159 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 16B] ])
>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>                 (const_int 16 [0x10])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 16B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>      (nil))
> (insn 21 16 22 2 (set (reg:V16QI 165 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 32B] ])
>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>                 (const_int 32 [0x20])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 32B]+0 S16 A16])) {vsx_movv16qi_64bit}
>      (nil))
> (insn 22 21 37 2 (set (reg:V16QI 166 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 48B] ])
>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>                 (const_int 48 [0x30])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 48B]+0 S16 A16])) {vsx_movv16qi_64bit}
>      (nil))
>
> insn 22 and insn 31 is merged in the failure case and breaks the code.

What specifically goes wrong though?  This is just a sequence of loads
from the same base pointer, with no interdependencies, so it should be
possible to do the loads in any order.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
  2024-06-03  8:37         ` Richard Sandiford
@ 2024-06-03 11:05           ` Ajit Agarwal
  2024-06-03 11:33             ` Richard Sandiford
  0 siblings, 1 reply; 20+ messages in thread
From: Ajit Agarwal @ 2024-06-03 11:05 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 03/06/24 2:07 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> Hello Richard:
>> On 31/05/24 8:08 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>> On 31/05/24 3:23 pm, Richard Sandiford wrote:
>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>> Hello All:
>>>>>>
>>>>>> Common infrastructure using generic code for pair mem fusion of different
>>>>>> targets.
>>>>>>
>>>>>> rs6000 target specific specific code implements virtual functions defined
>>>>>> by generic code.
>>>>>>
>>>>>> Code is implemented with pure virtual functions to interface with target
>>>>>> code.
>>>>>>
>>>>>> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
>>>>>> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>>>>>>
>>>>>> Bootstrapped and regtested for aarch64-linux-gnu and powerpc64-linux-gnu.
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Ajit
>>>>>>
>>>>>>
>>>>>> aarch64, rs6000, middle-end: Add implementation for different targets for pair mem fusion
>>>>>>
>>>>>> Common infrastructure using generic code for pair mem fusion of different
>>>>>> targets.
>>>>>>
>>>>>> rs6000 target specific specific code implements virtual functions defined
>>>>>> by generic code.
>>>>>>
>>>>>> Code is implemented with pure virtual functions to interface with target
>>>>>> code.
>>>>>>
>>>>>> Target specific code are added in rs6000-mem-fusion.cc and additional virtual
>>>>>> function implementation required for rs6000 are added in aarch64-ldp-fusion.cc.
>>>>>>
>>>>>> 2024-05-31  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>>>>>
>>>>>> gcc/ChangeLog:
>>>>>>
>>>>>> 	* config/aarch64/aarch64-ldp-fusion.cc: Add target specific
>>>>>> 	implementation of additional virtual functions added in pair_fusion
>>>>>> 	struct.
>>>>>> 	* config/rs6000/rs6000-passes.def: New mem fusion pass
>>>>>> 	before pass_early_remat.
>>>>>> 	* config/rs6000/rs6000-mem-fusion.cc: Add new pass.
>>>>>> 	Add target specific implementation using pure virtual
>>>>>> 	functions.
>>>>>> 	* config.gcc: Add new object file.
>>>>>> 	* config/rs6000/rs6000-protos.h: Add new prototype for mem
>>>>>> 	fusion pass.
>>>>>> 	* config/rs6000/t-rs6000: Add new rule.
>>>>>> 	* rtl-ssa/accesses.h: Moved set_is_live_out_use as public
>>>>>> 	from private.
>>>>>>
>>>>>> gcc/testsuite/ChangeLog:
>>>>>>
>>>>>> 	* g++.target/powerpc/me-fusion.C: New test.
>>>>>> 	* g++.target/powerpc/mem-fusion-1.C: New test.
>>>>>> 	* gcc.target/powerpc/mma-builtin-1.c: Modify test.
>>>>>> ---
>>>>>
>>>>> This isn't a complete review, just some initial questions & comments
>>>>> about selected parts.
>>>>>
>>>>>> [...]
>>>>>> +/* Check whether load can be fusable or not.
>>>>>> +   Return true if dependent use is UNSPEC otherwise false.  */
>>>>>> +bool
>>>>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>>>>>> +{
>>>>>> +  rtx_insn *insn = info->rtl ();
>>>>>> +
>>>>>> +  for (rtx note = REG_NOTES (insn); note; note = XEXP (note, 1))
>>>>>> +    if (REG_NOTE_KIND (note) == REG_EQUAL
>>>>>> +	|| REG_NOTE_KIND (note) == REG_EQUIV)
>>>>>> +      return false;
>>>>>
>>>>> It's unusual to punt on an optimisation because of a REG_EQUAL/EQUIV
>>>>> note.  What's the reason for doing this?  Are you trying to avoid
>>>>> fusing pairs before reload that are equivalent to a MEM (i.e. have
>>>>> a natural spill slot)?  I think Alex hit a similar situation.
>>>>>
>>>>
>>>> We have used the above check because of some SPEC benchmarks failing with
>>>> with MEM pairs having REG_EQUAL/EQUIV notes.
>>>>
>>>> By adding the checks the benchmarks passes and also it improves the
>>>> performance.
>>>>
>>>> This checks were added during initial implementation of pair fusion
>>>> pass.
>>>>
>>>> I will investigate further if this check is still required or not.
>>>
>>> Thanks.  If it does affect SPEC results, it would be good to look
>>> at the underlying reason, as a justification for the check.
>>>
>>> AIUI, the case Alex was due to the way that the RA recognises:
>>>
>>>   (set (reg R) (mem address-of-a-stack-variable))
>>>     REG_EQUIV: (mem address-of-a-stack-variable)
>>>
>>> where the REG_EQUIV is either explicit or detected by the RA.
>>> If R needs to be spilled, it can then be spilled to its existing
>>> location on the stack.  And if R needs to be spilled in the
>>> instruction above (because of register pressure before the first
>>> use of R), the RA is able to delete the instruction.
>>>
>>> But if that is the reason, the condition should be restricted
>>> to cases in which the note is a memory.
>>>
>>> I think Alex had tried something similar and found that it wasn't
>>> always effective.
>>>
>>
>> Sure I will check.
>>>> [...]
>>>>>> +		   && info->is_real ())
>>>>>> +		  {
>>>>>> +		    rtx_insn *rtl_insn = info->rtl ();
>>>>>> +		    rtx set = single_set (rtl_insn);
>>>>>> +
>>>>>> +		    if (set == NULL_RTX)
>>>>>> +		      return false;
>>>>>> +
>>>>>> +		    rtx op0 = SET_SRC (set);
>>>>>> +		    if (GET_CODE (op0) != UNSPEC)
>>>>>> +		      return false;
>>>>>
>>>>> What's the motivation for rejecting unspecs?  It's unsual to treat
>>>>> all unspecs as a distinct group.
>>>>>
>>>>> Also, using single_set means that the function still lets through
>>>>> parallels of two sets in which the sources are unspecs.  Is that
>>>>> intentional?
>>>>>
>>>>> The reasons behind things like the REG_EQUAL/EQUIV and UNSPEC decisions
>>>>> need to be described in comments, so that other people coming to this
>>>>> code later can understand the motivation.  The same thing applies to
>>>>> other decisions in the patch.
>>>>>
>>>>
>>>> Adjacent load pair fusion with 256 bit OOmode is seen and valid with use of load
>>>> in UNSPEC. Thats why this check is added.
>>>
>>> Can you give an example (in terms of rtl)?
>>
>>
>> (insn 23 72 25 4 (set (reg:OO 124 [ vect__1.24 ])
>>         (mem:OO (reg:DI 118 [ ivtmp.73 ]) [0 MEM <vector(16) unsigned char> [(unsigned char *)_65]+0 S16 A8]))  {*movoo}
>>      (nil))
>> (insn 25 23 103 4 (set (reg:OO 174 [ vect__1.26 ])
>>         (mem:OO (plus:DI (reg:DI 118 [ ivtmp.73 ])
>>                 (const_int 32 [0x20])) [0 MEM <vector(16) unsigned char> [(unsigned char *)_65 + 32B]+0 S16 A8]))  {*movoo}
>>      (nil))
>> (insn 103 25 27 4 (set (reg:DI 118 [ ivtmp.73 ])
>>         (plus:DI (reg:DI 118 [ ivtmp.73 ])
>>             (const_int 64 [0x40]))) 66 {*adddi3}
>>      (nil))
>> (insn 27 103 29 4 (set (reg:V16QI 176 [ vect_perm_even_123 ])
>>         (unspec:V16QI [
>>                 (subreg:V16QI (reg:OO 124 [ vect__1.24 ]) 0)
>>                 (subreg:V16QI (reg:OO 124 [ vect__1.24 ]) 16)
>>                 (reg:V16QI 300)
>>             ] UNSPEC_VPERM))  {altivec_vperm_v16qi_direct}
>>      (nil))
>>
>> This is the example where UNSPEC is used with 256 bit OOmode.
> 
> But what (conceptually) makes the vperm instruction interesting?
> I.e. what is it about the instruction that makes you want to look for
> it?
> 
The example above used UNSPEC with vperm but this patch handles
all the variants of UNSPEC instructions.

> Is it that the instruction uses both halves of the fused load?  If so,
> I don't think the test in the patch achieves that.  ISTM that the patch
> checks both loads individually, and so would also match cases where one
> load is used by one vperm and the other load is used by a different vperm.
> Is that the intention?
>

Yes the patch doesn't assumes that UNSPEC instruction uses both
halves of load. The patch checks both load individually and can
handles where one load used by one vperm and other load is
used by different vperm.

 
> It would also match cases where the vperm inputs are the other around,
> so that the first operand is loaded from a higher address than the
> second operand.  Is that ok, or is it a problem?
>

Yes the patch can handle this case also.


 
> Similarly, there are other instructions that have unspecs, such as:
> 
> (define_insn "*insert4b_internal"
>   [(set (match_operand:V16QI 0 "vsx_register_operand" "=wa")
> 	(unspec:V16QI [(match_operand:V4SI 1 "vsx_register_operand" "wa")
> 		       (match_operand:V16QI 2 "vsx_register_operand" "0")
> 		       (match_operand:QI 3 "const_0_to_12_operand" "n")]
> 		   UNSPEC_XXINSERTW))]
> 
> (to pick an arbitrary example).  It looks like the patch would also
> accept uses by this instruction.  Is that intentional?
>

Yes patch can accept above UNSPEC.
The patch can handle all variants of UNSPEC instructions.

 
> If it is intentional, what distinguishes things like vperm and xxinsertw
> (and all other unspecs) from plain addition?
> 
>   [(set (match_operand:VSX_F 0 "vsx_register_operand" "=wa")
>         (plus:VSX_F (match_operand:VSX_F 1 "vsx_register_operand" "wa")
> 		    (match_operand:VSX_F 2 "vsx_register_operand" "wa")))]
>

Plain addition are not supported currently.
We have not seen many cases with plain addition and this patch
will not accept plain addition.

 
> This is why the intention behind the patch is important.  As it stands,
> it isn't clear what criteria the patch is using to distinguish "valid"
> fuse candidates from "invalid" ones.
>

Intention behind this patch all variants of UNSPEC instructions are
supported and uses without UNSPEC are not supported in this patch.

 
>>>>>> [...]
>>>>>> +  // Given insn_info pair I1 and I2, return true if offsets are in order.
>>>>>> +  virtual bool should_handle_unordered_insns (rtl_ssa::insn_info *i1,
>>>>>> +					      rtl_ssa::insn_info *i2) = 0;
>>>>>> +
>>>>>
>>>>> This name seems a bit misleading.  The function is used in:
>>>>>
>>>>> @@ -2401,6 +2405,9 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>>>>        reversed = true;
>>>>>      }
>>>>>  
>>>>> +  if (!m_pass->should_handle_unordered_insns (i1, i2))
>>>>> +    return false;
>>>>> +
>>>>>    rtx cand_mems[2];
>>>>>    rtx reg_ops[2];
>>>>>    rtx pats[2];
>>>>>
>>>>> and so it acts as a general opt-out.  The insns aren't known to be unordered.
>>>>>
>>>>> It looks like the rs6000 override requires the original insns to be
>>>>> in offset order.  Could you say why that's necessary?  (Both in email
>>>>> and as a comment in the code.)
>>>>>
>>>>
>>>> Yes rs6000 requires the original load insns to be in offset order.
>>>> Some regression tests like vect-outer-4f fails if we do load pair
>>>> fusion with load offsets are not in offset order as this breaks lxvp 
>>>> semantics.
>>>
>>> How does it break the semantics though?  In principle, the generic code
>>> only fuses if it has "proved" that the loads can happen in either order.
>>> So it shouldn't matter which order the hardware does things in.
>>>
>>> Could you give an example of the kind of situation that you want
>>> to avoid, and why it generates the wrong result?
>>>
>>
>> (insn 31 62 32 2 (set (reg:V16QI 177 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 64B] ])
>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>                 (const_int 64 [0x40])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 64B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>>      (nil))
>> (insn 32 31 16 2 (set (reg:V16QI 178 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 80B] ])
>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>                 (const_int 80 [0x50])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 80B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>>      (nil))
>> (insn 16 32 21 2 (set (reg:V16QI 159 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 16B] ])
>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>                 (const_int 16 [0x10])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 16B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>>      (nil))
>> (insn 21 16 22 2 (set (reg:V16QI 165 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 32B] ])
>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>                 (const_int 32 [0x20])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 32B]+0 S16 A16])) {vsx_movv16qi_64bit}
>>      (nil))
>> (insn 22 21 37 2 (set (reg:V16QI 166 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 48B] ])
>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>                 (const_int 48 [0x30])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 48B]+0 S16 A16])) {vsx_movv16qi_64bit}
>>      (nil))
>>
>> insn 22 and insn 31 is merged in the failure case and breaks the code.
> 
> What specifically goes wrong though?  This is just a sequence of loads
> from the same base pointer, with no interdependencies, so it should be
> possible to do the loads in any order.
>

Here in fuse_pair we set first and second based  as follows:

  insn_info *first = (*i1 < *i2) ? i1 : i2;
  insn_info *second = (first == i1) ? i2 : i1;

This makes higher offset with first and lower offset with second.
if (*i1 > *i2).

and in set_multiword_subreg interface we pass first and second.
Hence above code breaks because subreg offsets with 256 bits are not set properly.

If we pass i1 and i2 in set_multiword_subreg (i1, i2, load_p)
in fuse_pair should_handle_unordered_insns is not required in try_fuse_pair.

I will send the patch by removing the above interface check 
in try_fuse_pair and pass i1 and i2 in set_multiword_subreg

This works fine.

Sorry for the inconvenience caused.
 
> Thanks,
> Richard

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
  2024-06-03 11:05           ` Ajit Agarwal
@ 2024-06-03 11:33             ` Richard Sandiford
  2024-06-03 13:47               ` Ajit Agarwal
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Sandiford @ 2024-06-03 11:33 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> [...]
>> If it is intentional, what distinguishes things like vperm and xxinsertw
>> (and all other unspecs) from plain addition?
>> 
>>   [(set (match_operand:VSX_F 0 "vsx_register_operand" "=wa")
>>         (plus:VSX_F (match_operand:VSX_F 1 "vsx_register_operand" "wa")
>> 		    (match_operand:VSX_F 2 "vsx_register_operand" "wa")))]
>>
>
> Plain addition are not supported currently.
> We have not seen many cases with plain addition and this patch
> will not accept plain addition.
>
>  
>> This is why the intention behind the patch is important.  As it stands,
>> it isn't clear what criteria the patch is using to distinguish "valid"
>> fuse candidates from "invalid" ones.
>>
>
> Intention behind this patch all variants of UNSPEC instructions are
> supported and uses without UNSPEC are not supported in this patch.

But why make the distinction this way though?  UNSPEC is a very
GCC-specific concept.  Whether something is an UNSPEC or some other
RTL code depends largely on historical accident.  E.g. we have specific
codes for VEC_SELECT, VEC_MERGE, and VEC_DUPLICATE, but don't have one
for VEC_PERM (even for VEC_PERM_EXPR exists in gimple).

It seems unlikely that GCC's choice about whether to represent something
as an UNSPEC or as another RTL code lines up neatly with the kind of
codegen decisions that a good assembly programmer would make.

I suppose another way of asking is to turn this around and say: what
kind of uses are you trying to exclude?  Presumably things are worse
if you remove this function override.  But what makes them worse?
What kind of uses cause the regression?

>>>>>>> [...]
>>>>>>> +  // Given insn_info pair I1 and I2, return true if offsets are in order.
>>>>>>> +  virtual bool should_handle_unordered_insns (rtl_ssa::insn_info *i1,
>>>>>>> +					      rtl_ssa::insn_info *i2) = 0;
>>>>>>> +
>>>>>>
>>>>>> This name seems a bit misleading.  The function is used in:
>>>>>>
>>>>>> @@ -2401,6 +2405,9 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>>>>>        reversed = true;
>>>>>>      }
>>>>>>  
>>>>>> +  if (!m_pass->should_handle_unordered_insns (i1, i2))
>>>>>> +    return false;
>>>>>> +
>>>>>>    rtx cand_mems[2];
>>>>>>    rtx reg_ops[2];
>>>>>>    rtx pats[2];
>>>>>>
>>>>>> and so it acts as a general opt-out.  The insns aren't known to be unordered.
>>>>>>
>>>>>> It looks like the rs6000 override requires the original insns to be
>>>>>> in offset order.  Could you say why that's necessary?  (Both in email
>>>>>> and as a comment in the code.)
>>>>>>
>>>>>
>>>>> Yes rs6000 requires the original load insns to be in offset order.
>>>>> Some regression tests like vect-outer-4f fails if we do load pair
>>>>> fusion with load offsets are not in offset order as this breaks lxvp 
>>>>> semantics.
>>>>
>>>> How does it break the semantics though?  In principle, the generic code
>>>> only fuses if it has "proved" that the loads can happen in either order.
>>>> So it shouldn't matter which order the hardware does things in.
>>>>
>>>> Could you give an example of the kind of situation that you want
>>>> to avoid, and why it generates the wrong result?
>>>>
>>>
>>> (insn 31 62 32 2 (set (reg:V16QI 177 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 64B] ])
>>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>>                 (const_int 64 [0x40])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 64B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>>>      (nil))
>>> (insn 32 31 16 2 (set (reg:V16QI 178 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 80B] ])
>>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>>                 (const_int 80 [0x50])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 80B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>>>      (nil))
>>> (insn 16 32 21 2 (set (reg:V16QI 159 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 16B] ])
>>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>>                 (const_int 16 [0x10])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 16B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>>>      (nil))
>>> (insn 21 16 22 2 (set (reg:V16QI 165 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 32B] ])
>>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>>                 (const_int 32 [0x20])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 32B]+0 S16 A16])) {vsx_movv16qi_64bit}
>>>      (nil))
>>> (insn 22 21 37 2 (set (reg:V16QI 166 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 48B] ])
>>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>>                 (const_int 48 [0x30])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 48B]+0 S16 A16])) {vsx_movv16qi_64bit}
>>>      (nil))
>>>
>>> insn 22 and insn 31 is merged in the failure case and breaks the code.
>> 
>> What specifically goes wrong though?  This is just a sequence of loads
>> from the same base pointer, with no interdependencies, so it should be
>> possible to do the loads in any order.
>>
>
> Here in fuse_pair we set first and second based  as follows:
>
>   insn_info *first = (*i1 < *i2) ? i1 : i2;
>   insn_info *second = (first == i1) ? i2 : i1;
>
> This makes higher offset with first and lower offset with second.
> if (*i1 > *i2).
>
> and in set_multiword_subreg interface we pass first and second.
> Hence above code breaks because subreg offsets with 256 bits are not set properly.
>
> If we pass i1 and i2 in set_multiword_subreg (i1, i2, load_p)
> in fuse_pair should_handle_unordered_insns is not required in try_fuse_pair.
>
> I will send the patch by removing the above interface check 
> in try_fuse_pair and pass i1 and i2 in set_multiword_subreg

Thanks for looking into it.

I think it'd be better to resolve the unspec discussion before
posting another version of the patch though.

Richard

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
  2024-06-03 11:33             ` Richard Sandiford
@ 2024-06-03 13:47               ` Ajit Agarwal
  2024-06-03 14:17                 ` Richard Sandiford
  0 siblings, 1 reply; 20+ messages in thread
From: Ajit Agarwal @ 2024-06-03 13:47 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 03/06/24 5:03 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>> [...]
>>> If it is intentional, what distinguishes things like vperm and xxinsertw
>>> (and all other unspecs) from plain addition?
>>>
>>>   [(set (match_operand:VSX_F 0 "vsx_register_operand" "=wa")
>>>         (plus:VSX_F (match_operand:VSX_F 1 "vsx_register_operand" "wa")
>>> 		    (match_operand:VSX_F 2 "vsx_register_operand" "wa")))]
>>>
>>
>> Plain addition are not supported currently.
>> We have not seen many cases with plain addition and this patch
>> will not accept plain addition.
>>
>>  
>>> This is why the intention behind the patch is important.  As it stands,
>>> it isn't clear what criteria the patch is using to distinguish "valid"
>>> fuse candidates from "invalid" ones.
>>>
>>
>> Intention behind this patch all variants of UNSPEC instructions are
>> supported and uses without UNSPEC are not supported in this patch.
> 
> But why make the distinction this way though?  UNSPEC is a very
> GCC-specific concept.  Whether something is an UNSPEC or some other
> RTL code depends largely on historical accident.  E.g. we have specific
> codes for VEC_SELECT, VEC_MERGE, and VEC_DUPLICATE, but don't have one
> for VEC_PERM (even for VEC_PERM_EXPR exists in gimple).
> 
> It seems unlikely that GCC's choice about whether to represent something
> as an UNSPEC or as another RTL code lines up neatly with the kind of
> codegen decisions that a good assembly programmer would make.
> 
> I suppose another way of asking is to turn this around and say: what
> kind of uses are you trying to exclude?  Presumably things are worse
> if you remove this function override.  But what makes them worse?
> What kind of uses cause the regression?
> 

Uses of fused load where load with low address uses are modified with load with high address uses.

Similarly load with high address uses are modified with load low address
uses.

This is the semantics of lxvp instructions which can occur through
UNSPEC uses otherwise it breaks the functionality and seen failure
in almost all vect regressions and SPEC benchmarks.


>>>>>>>> [...]
>>>>>>>> +  // Given insn_info pair I1 and I2, return true if offsets are in order.
>>>>>>>> +  virtual bool should_handle_unordered_insns (rtl_ssa::insn_info *i1,
>>>>>>>> +					      rtl_ssa::insn_info *i2) = 0;
>>>>>>>> +
>>>>>>>
>>>>>>> This name seems a bit misleading.  The function is used in:
>>>>>>>
>>>>>>> @@ -2401,6 +2405,9 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>>>>>>>        reversed = true;
>>>>>>>      }
>>>>>>>  
>>>>>>> +  if (!m_pass->should_handle_unordered_insns (i1, i2))
>>>>>>> +    return false;
>>>>>>> +
>>>>>>>    rtx cand_mems[2];
>>>>>>>    rtx reg_ops[2];
>>>>>>>    rtx pats[2];
>>>>>>>
>>>>>>> and so it acts as a general opt-out.  The insns aren't known to be unordered.
>>>>>>>
>>>>>>> It looks like the rs6000 override requires the original insns to be
>>>>>>> in offset order.  Could you say why that's necessary?  (Both in email
>>>>>>> and as a comment in the code.)
>>>>>>>
>>>>>>
>>>>>> Yes rs6000 requires the original load insns to be in offset order.
>>>>>> Some regression tests like vect-outer-4f fails if we do load pair
>>>>>> fusion with load offsets are not in offset order as this breaks lxvp 
>>>>>> semantics.
>>>>>
>>>>> How does it break the semantics though?  In principle, the generic code
>>>>> only fuses if it has "proved" that the loads can happen in either order.
>>>>> So it shouldn't matter which order the hardware does things in.
>>>>>
>>>>> Could you give an example of the kind of situation that you want
>>>>> to avoid, and why it generates the wrong result?
>>>>>
>>>>
>>>> (insn 31 62 32 2 (set (reg:V16QI 177 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 64B] ])
>>>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>>>                 (const_int 64 [0x40])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 64B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>>>>      (nil))
>>>> (insn 32 31 16 2 (set (reg:V16QI 178 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 80B] ])
>>>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>>>                 (const_int 80 [0x50])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 80B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>>>>      (nil))
>>>> (insn 16 32 21 2 (set (reg:V16QI 159 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 16B] ])
>>>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>>>                 (const_int 16 [0x10])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 16B]+0 S16 A16]))  {vsx_movv16qi_64bit}
>>>>      (nil))
>>>> (insn 21 16 22 2 (set (reg:V16QI 165 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 32B] ])
>>>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>>>                 (const_int 32 [0x20])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 32B]+0 S16 A16])) {vsx_movv16qi_64bit}
>>>>      (nil))
>>>> (insn 22 21 37 2 (set (reg:V16QI 166 [ MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 48B] ])
>>>>         (mem:V16QI (plus:DI (reg/f:DI 121 [ vectp.62 ])
>>>>                 (const_int 48 [0x30])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)vectp.62_36 + 48B]+0 S16 A16])) {vsx_movv16qi_64bit}
>>>>      (nil))
>>>>
>>>> insn 22 and insn 31 is merged in the failure case and breaks the code.
>>>
>>> What specifically goes wrong though?  This is just a sequence of loads
>>> from the same base pointer, with no interdependencies, so it should be
>>> possible to do the loads in any order.
>>>
>>
>> Here in fuse_pair we set first and second based  as follows:
>>
>>   insn_info *first = (*i1 < *i2) ? i1 : i2;
>>   insn_info *second = (first == i1) ? i2 : i1;
>>
>> This makes higher offset with first and lower offset with second.
>> if (*i1 > *i2).
>>
>> and in set_multiword_subreg interface we pass first and second.
>> Hence above code breaks because subreg offsets with 256 bits are not set properly.
>>
>> If we pass i1 and i2 in set_multiword_subreg (i1, i2, load_p)
>> in fuse_pair should_handle_unordered_insns is not required in try_fuse_pair.
>>
>> I will send the patch by removing the above interface check 
>> in try_fuse_pair and pass i1 and i2 in set_multiword_subreg
> 
> Thanks for looking into it.
> 
> I think it'd be better to resolve the unspec discussion before
> posting another version of the patch though.
> 
Sure.
> Richard

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
  2024-06-03 13:47               ` Ajit Agarwal
@ 2024-06-03 14:17                 ` Richard Sandiford
  2024-06-03 14:34                   ` Ajit Agarwal
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Sandiford @ 2024-06-03 14:17 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
> On 03/06/24 5:03 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>> [...]
>>>> If it is intentional, what distinguishes things like vperm and xxinsertw
>>>> (and all other unspecs) from plain addition?
>>>>
>>>>   [(set (match_operand:VSX_F 0 "vsx_register_operand" "=wa")
>>>>         (plus:VSX_F (match_operand:VSX_F 1 "vsx_register_operand" "wa")
>>>> 		    (match_operand:VSX_F 2 "vsx_register_operand" "wa")))]
>>>>
>>>
>>> Plain addition are not supported currently.
>>> We have not seen many cases with plain addition and this patch
>>> will not accept plain addition.
>>>
>>>  
>>>> This is why the intention behind the patch is important.  As it stands,
>>>> it isn't clear what criteria the patch is using to distinguish "valid"
>>>> fuse candidates from "invalid" ones.
>>>>
>>>
>>> Intention behind this patch all variants of UNSPEC instructions are
>>> supported and uses without UNSPEC are not supported in this patch.
>> 
>> But why make the distinction this way though?  UNSPEC is a very
>> GCC-specific concept.  Whether something is an UNSPEC or some other
>> RTL code depends largely on historical accident.  E.g. we have specific
>> codes for VEC_SELECT, VEC_MERGE, and VEC_DUPLICATE, but don't have one
>> for VEC_PERM (even for VEC_PERM_EXPR exists in gimple).
>> 
>> It seems unlikely that GCC's choice about whether to represent something
>> as an UNSPEC or as another RTL code lines up neatly with the kind of
>> codegen decisions that a good assembly programmer would make.
>> 
>> I suppose another way of asking is to turn this around and say: what
>> kind of uses are you trying to exclude?  Presumably things are worse
>> if you remove this function override.  But what makes them worse?
>> What kind of uses cause the regression?
>> 
>
> Uses of fused load where load with low address uses are modified with load with high address uses.
>
> Similarly load with high address uses are modified with load low address
> uses.

It sounds like something is going wrong the subreg updates.
Can you give an example of where this occurs?  For instance...

> This is the semantics of lxvp instructions which can occur through
> UNSPEC uses otherwise it breaks the functionality and seen failure
> in almost all vect regressions and SPEC benchmarks.

...could you take one of the simpler vect regressions, show the before
and after RTL, and why the transformation is wrong?

Thanks,
Richard

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
  2024-06-03 14:17                 ` Richard Sandiford
@ 2024-06-03 14:34                   ` Ajit Agarwal
  2024-06-03 14:54                     ` Richard Sandiford
  0 siblings, 1 reply; 20+ messages in thread
From: Ajit Agarwal @ 2024-06-03 14:34 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 03/06/24 7:47 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> On 03/06/24 5:03 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>> [...]
>>>>> If it is intentional, what distinguishes things like vperm and xxinsertw
>>>>> (and all other unspecs) from plain addition?
>>>>>
>>>>>   [(set (match_operand:VSX_F 0 "vsx_register_operand" "=wa")
>>>>>         (plus:VSX_F (match_operand:VSX_F 1 "vsx_register_operand" "wa")
>>>>> 		    (match_operand:VSX_F 2 "vsx_register_operand" "wa")))]
>>>>>
>>>>
>>>> Plain addition are not supported currently.
>>>> We have not seen many cases with plain addition and this patch
>>>> will not accept plain addition.
>>>>
>>>>  
>>>>> This is why the intention behind the patch is important.  As it stands,
>>>>> it isn't clear what criteria the patch is using to distinguish "valid"
>>>>> fuse candidates from "invalid" ones.
>>>>>
>>>>
>>>> Intention behind this patch all variants of UNSPEC instructions are
>>>> supported and uses without UNSPEC are not supported in this patch.
>>>
>>> But why make the distinction this way though?  UNSPEC is a very
>>> GCC-specific concept.  Whether something is an UNSPEC or some other
>>> RTL code depends largely on historical accident.  E.g. we have specific
>>> codes for VEC_SELECT, VEC_MERGE, and VEC_DUPLICATE, but don't have one
>>> for VEC_PERM (even for VEC_PERM_EXPR exists in gimple).
>>>
>>> It seems unlikely that GCC's choice about whether to represent something
>>> as an UNSPEC or as another RTL code lines up neatly with the kind of
>>> codegen decisions that a good assembly programmer would make.
>>>
>>> I suppose another way of asking is to turn this around and say: what
>>> kind of uses are you trying to exclude?  Presumably things are worse
>>> if you remove this function override.  But what makes them worse?
>>> What kind of uses cause the regression?
>>>
>>
>> Uses of fused load where load with low address uses are modified with load with high address uses.
>>
>> Similarly load with high address uses are modified with load low address
>> uses.
> 
> It sounds like something is going wrong the subreg updates.
> Can you give an example of where this occurs?  For instance...
> 
>> This is the semantics of lxvp instructions which can occur through
>> UNSPEC uses otherwise it breaks the functionality and seen failure
>> in almost all vect regressions and SPEC benchmarks.
> 
> ...could you take one of the simpler vect regressions, show the before
> and after RTL, and why the transformation is wrong?
>

Before the change:

(insn 32 30 103 5 (set (reg:V16QI 127 [ _32 ])
        (mem:V16QI (reg:DI 130 [ ivtmp.37 ]) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)_55]+0 S16 A128])) {vsx_movv16qi_64bit}
     (nil))
(insn 103 32 135 5 (set (reg:V16QI 173 [ _32 ])
        (mem:V16QI (plus:DI (reg:DI 130 [ ivtmp.37 ])
                (const_int 16 [0x10])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)_55]+0 S16 A128])) {vsx_movv16qi_64bit}
     (nil))
(insn 135 103 34 5 (set (reg:DI 155)
        (plus:DI (reg:DI 130 [ ivtmp.37 ])
            (const_int 16 [0x10]))) 66 {*adddi3}
     (nil))
(insn 34 135 104 5 (set (reg:V16QI 143 [ _27 ])
        (unspec:V16QI [
                (reg:V16QI 127 [ _32 ]) repeated x2
                (reg:V16QI 152)
            ] UNSPEC_VPERM))  {altivec_vperm_v16qi_direct}
     (expr_list:REG_DEAD (reg:V16QI 127 [ _32 ])
        (nil)))
(insn 104 34 35 5 (set (reg:V16QI 174 [ _27 ])
        (unspec:V16QI [
                (reg:V16QI 173 [ _32 ]) repeated x2
                (reg:V16QI 152)
            ] UNSPEC_VPERM)) 
 {altivec_vperm_v16qi_direct}


After the change:

(insn 103 30 135 5 (set (reg:OO 127 [ _32 ])
        (mem:OO (reg:DI 130 [ ivtmp.37 ]) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)_55]+0 S16 A128])) {*movoo}
     (nil))
(insn 135 103 34 5 (set (reg:DI 155)
        (plus:DI (reg:DI 130 [ ivtmp.37 ])
            (const_int 16 [0x10]))) 66 {*adddi3}
     (nil))
(insn 34 135 104 5 (set (reg:V16QI 143 [ _27 ])
        (unspec:V16QI [
                (subreg:V16QI (reg:OO 127 [ _32 ]) 16)
                (subreg:V16QI (reg:OO 127 [ _32 ]) 16)
                (reg:V16QI 152)
            ] UNSPEC_VPERM)) {altivec_vperm_v16qi_direct}
     (expr_list:REG_DEAD (reg:OO 127 [ _32 ])
        (nil)))
(insn 104 34 35 5 (set (reg:V16QI 174 [ _27 ])
        (unspec:V16QI [
                (subreg:V16QI (reg:OO 127 [ _32 ]) 0)
                (subreg:V16QI (reg:OO 127 [ _32 ]) 0)
                (reg:V16QI 152)
            ] UNSPEC_VPERM))  {altivec_vperm_v16qi_direct}

After the change the tests passes.
 
> Thanks,
> Richard

Thanks & Regards
Ajit

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
  2024-06-03 14:34                   ` Ajit Agarwal
@ 2024-06-03 14:54                     ` Richard Sandiford
  2024-06-03 15:58                       ` Ajit Agarwal
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Sandiford @ 2024-06-03 14:54 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
> Hello Richard:
>
> On 03/06/24 7:47 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>> On 03/06/24 5:03 pm, Richard Sandiford wrote:
>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>> [...]
>>>>>> If it is intentional, what distinguishes things like vperm and xxinsertw
>>>>>> (and all other unspecs) from plain addition?
>>>>>>
>>>>>>   [(set (match_operand:VSX_F 0 "vsx_register_operand" "=wa")
>>>>>>         (plus:VSX_F (match_operand:VSX_F 1 "vsx_register_operand" "wa")
>>>>>> 		    (match_operand:VSX_F 2 "vsx_register_operand" "wa")))]
>>>>>>
>>>>>
>>>>> Plain addition are not supported currently.
>>>>> We have not seen many cases with plain addition and this patch
>>>>> will not accept plain addition.
>>>>>
>>>>>  
>>>>>> This is why the intention behind the patch is important.  As it stands,
>>>>>> it isn't clear what criteria the patch is using to distinguish "valid"
>>>>>> fuse candidates from "invalid" ones.
>>>>>>
>>>>>
>>>>> Intention behind this patch all variants of UNSPEC instructions are
>>>>> supported and uses without UNSPEC are not supported in this patch.
>>>>
>>>> But why make the distinction this way though?  UNSPEC is a very
>>>> GCC-specific concept.  Whether something is an UNSPEC or some other
>>>> RTL code depends largely on historical accident.  E.g. we have specific
>>>> codes for VEC_SELECT, VEC_MERGE, and VEC_DUPLICATE, but don't have one
>>>> for VEC_PERM (even for VEC_PERM_EXPR exists in gimple).
>>>>
>>>> It seems unlikely that GCC's choice about whether to represent something
>>>> as an UNSPEC or as another RTL code lines up neatly with the kind of
>>>> codegen decisions that a good assembly programmer would make.
>>>>
>>>> I suppose another way of asking is to turn this around and say: what
>>>> kind of uses are you trying to exclude?  Presumably things are worse
>>>> if you remove this function override.  But what makes them worse?
>>>> What kind of uses cause the regression?
>>>>
>>>
>>> Uses of fused load where load with low address uses are modified with load with high address uses.
>>>
>>> Similarly load with high address uses are modified with load low address
>>> uses.
>> 
>> It sounds like something is going wrong the subreg updates.
>> Can you give an example of where this occurs?  For instance...
>> 
>>> This is the semantics of lxvp instructions which can occur through
>>> UNSPEC uses otherwise it breaks the functionality and seen failure
>>> in almost all vect regressions and SPEC benchmarks.
>> 
>> ...could you take one of the simpler vect regressions, show the before
>> and after RTL, and why the transformation is wrong?
>
> Before the change:
>
> (insn 32 30 103 5 (set (reg:V16QI 127 [ _32 ])
>         (mem:V16QI (reg:DI 130 [ ivtmp.37 ]) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)_55]+0 S16 A128])) {vsx_movv16qi_64bit}
>      (nil))
> (insn 103 32 135 5 (set (reg:V16QI 173 [ _32 ])
>         (mem:V16QI (plus:DI (reg:DI 130 [ ivtmp.37 ])
>                 (const_int 16 [0x10])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)_55]+0 S16 A128])) {vsx_movv16qi_64bit}
>      (nil))
> (insn 135 103 34 5 (set (reg:DI 155)
>         (plus:DI (reg:DI 130 [ ivtmp.37 ])
>             (const_int 16 [0x10]))) 66 {*adddi3}
>      (nil))
> (insn 34 135 104 5 (set (reg:V16QI 143 [ _27 ])
>         (unspec:V16QI [
>                 (reg:V16QI 127 [ _32 ]) repeated x2
>                 (reg:V16QI 152)
>             ] UNSPEC_VPERM))  {altivec_vperm_v16qi_direct}
>      (expr_list:REG_DEAD (reg:V16QI 127 [ _32 ])
>         (nil)))
> (insn 104 34 35 5 (set (reg:V16QI 174 [ _27 ])
>         (unspec:V16QI [
>                 (reg:V16QI 173 [ _32 ]) repeated x2
>                 (reg:V16QI 152)
>             ] UNSPEC_VPERM)) 
>  {altivec_vperm_v16qi_direct}
>
>
> After the change:
>
> (insn 103 30 135 5 (set (reg:OO 127 [ _32 ])
>         (mem:OO (reg:DI 130 [ ivtmp.37 ]) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)_55]+0 S16 A128])) {*movoo}
>      (nil))
> (insn 135 103 34 5 (set (reg:DI 155)
>         (plus:DI (reg:DI 130 [ ivtmp.37 ])
>             (const_int 16 [0x10]))) 66 {*adddi3}
>      (nil))
> (insn 34 135 104 5 (set (reg:V16QI 143 [ _27 ])
>         (unspec:V16QI [
>                 (subreg:V16QI (reg:OO 127 [ _32 ]) 16)
>                 (subreg:V16QI (reg:OO 127 [ _32 ]) 16)
>                 (reg:V16QI 152)
>             ] UNSPEC_VPERM)) {altivec_vperm_v16qi_direct}
>      (expr_list:REG_DEAD (reg:OO 127 [ _32 ])
>         (nil)))
> (insn 104 34 35 5 (set (reg:V16QI 174 [ _27 ])
>         (unspec:V16QI [
>                 (subreg:V16QI (reg:OO 127 [ _32 ]) 0)
>                 (subreg:V16QI (reg:OO 127 [ _32 ]) 0)
>                 (reg:V16QI 152)
>             ] UNSPEC_VPERM))  {altivec_vperm_v16qi_direct}
>
> After the change the tests passes.

But isn't this an example of the optimisation working on unspecs,
and working correctly?

I meant instead: could you give an example of the vect regressions
that you saw with the unspec test removed?  You mentioned that many
vect tests regressed without the unspec test, so it would be helpful
to see these failures in action.  That is, it'd be helpful to take
a compiler that doesn't have the unspec tests and show:

- the relevant rtl of one of the failing tests before the pass runs
  (when the rtl is still correct)

- the relevant rtl of one of the failing tests after the pass runs
  (when the rtl is now incorrect)

- the reason why the rtl after the pass is wrong

Thanks,
Richard

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
  2024-06-03 14:54                     ` Richard Sandiford
@ 2024-06-03 15:58                       ` Ajit Agarwal
  2024-06-04 16:00                         ` Ajit Agarwal
  0 siblings, 1 reply; 20+ messages in thread
From: Ajit Agarwal @ 2024-06-03 15:58 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 03/06/24 8:24 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> Hello Richard:
>>
>> On 03/06/24 7:47 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>> On 03/06/24 5:03 pm, Richard Sandiford wrote:
>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>> [...]
>>>>>>> If it is intentional, what distinguishes things like vperm and xxinsertw
>>>>>>> (and all other unspecs) from plain addition?
>>>>>>>
>>>>>>>   [(set (match_operand:VSX_F 0 "vsx_register_operand" "=wa")
>>>>>>>         (plus:VSX_F (match_operand:VSX_F 1 "vsx_register_operand" "wa")
>>>>>>> 		    (match_operand:VSX_F 2 "vsx_register_operand" "wa")))]
>>>>>>>
>>>>>>
>>>>>> Plain addition are not supported currently.
>>>>>> We have not seen many cases with plain addition and this patch
>>>>>> will not accept plain addition.
>>>>>>
>>>>>>  
>>>>>>> This is why the intention behind the patch is important.  As it stands,
>>>>>>> it isn't clear what criteria the patch is using to distinguish "valid"
>>>>>>> fuse candidates from "invalid" ones.
>>>>>>>
>>>>>>
>>>>>> Intention behind this patch all variants of UNSPEC instructions are
>>>>>> supported and uses without UNSPEC are not supported in this patch.
>>>>>
>>>>> But why make the distinction this way though?  UNSPEC is a very
>>>>> GCC-specific concept.  Whether something is an UNSPEC or some other
>>>>> RTL code depends largely on historical accident.  E.g. we have specific
>>>>> codes for VEC_SELECT, VEC_MERGE, and VEC_DUPLICATE, but don't have one
>>>>> for VEC_PERM (even for VEC_PERM_EXPR exists in gimple).
>>>>>
>>>>> It seems unlikely that GCC's choice about whether to represent something
>>>>> as an UNSPEC or as another RTL code lines up neatly with the kind of
>>>>> codegen decisions that a good assembly programmer would make.
>>>>>
>>>>> I suppose another way of asking is to turn this around and say: what
>>>>> kind of uses are you trying to exclude?  Presumably things are worse
>>>>> if you remove this function override.  But what makes them worse?
>>>>> What kind of uses cause the regression?
>>>>>
>>>>
>>>> Uses of fused load where load with low address uses are modified with load with high address uses.
>>>>
>>>> Similarly load with high address uses are modified with load low address
>>>> uses.
>>>
>>> It sounds like something is going wrong the subreg updates.
>>> Can you give an example of where this occurs?  For instance...
>>>
>>>> This is the semantics of lxvp instructions which can occur through
>>>> UNSPEC uses otherwise it breaks the functionality and seen failure
>>>> in almost all vect regressions and SPEC benchmarks.
>>>
>>> ...could you take one of the simpler vect regressions, show the before
>>> and after RTL, and why the transformation is wrong?
>>
>> Before the change:
>>
>> (insn 32 30 103 5 (set (reg:V16QI 127 [ _32 ])
>>         (mem:V16QI (reg:DI 130 [ ivtmp.37 ]) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)_55]+0 S16 A128])) {vsx_movv16qi_64bit}
>>      (nil))
>> (insn 103 32 135 5 (set (reg:V16QI 173 [ _32 ])
>>         (mem:V16QI (plus:DI (reg:DI 130 [ ivtmp.37 ])
>>                 (const_int 16 [0x10])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)_55]+0 S16 A128])) {vsx_movv16qi_64bit}
>>      (nil))
>> (insn 135 103 34 5 (set (reg:DI 155)
>>         (plus:DI (reg:DI 130 [ ivtmp.37 ])
>>             (const_int 16 [0x10]))) 66 {*adddi3}
>>      (nil))
>> (insn 34 135 104 5 (set (reg:V16QI 143 [ _27 ])
>>         (unspec:V16QI [
>>                 (reg:V16QI 127 [ _32 ]) repeated x2
>>                 (reg:V16QI 152)
>>             ] UNSPEC_VPERM))  {altivec_vperm_v16qi_direct}
>>      (expr_list:REG_DEAD (reg:V16QI 127 [ _32 ])
>>         (nil)))
>> (insn 104 34 35 5 (set (reg:V16QI 174 [ _27 ])
>>         (unspec:V16QI [
>>                 (reg:V16QI 173 [ _32 ]) repeated x2
>>                 (reg:V16QI 152)
>>             ] UNSPEC_VPERM)) 
>>  {altivec_vperm_v16qi_direct}
>>
>>
>> After the change:
>>
>> (insn 103 30 135 5 (set (reg:OO 127 [ _32 ])
>>         (mem:OO (reg:DI 130 [ ivtmp.37 ]) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)_55]+0 S16 A128])) {*movoo}
>>      (nil))
>> (insn 135 103 34 5 (set (reg:DI 155)
>>         (plus:DI (reg:DI 130 [ ivtmp.37 ])
>>             (const_int 16 [0x10]))) 66 {*adddi3}
>>      (nil))
>> (insn 34 135 104 5 (set (reg:V16QI 143 [ _27 ])
>>         (unspec:V16QI [
>>                 (subreg:V16QI (reg:OO 127 [ _32 ]) 16)
>>                 (subreg:V16QI (reg:OO 127 [ _32 ]) 16)
>>                 (reg:V16QI 152)
>>             ] UNSPEC_VPERM)) {altivec_vperm_v16qi_direct}
>>      (expr_list:REG_DEAD (reg:OO 127 [ _32 ])
>>         (nil)))
>> (insn 104 34 35 5 (set (reg:V16QI 174 [ _27 ])
>>         (unspec:V16QI [
>>                 (subreg:V16QI (reg:OO 127 [ _32 ]) 0)
>>                 (subreg:V16QI (reg:OO 127 [ _32 ]) 0)
>>                 (reg:V16QI 152)
>>             ] UNSPEC_VPERM))  {altivec_vperm_v16qi_direct}
>>
>> After the change the tests passes.
> 
> But isn't this an example of the optimisation working on unspecs,
> and working correctly?
> 

Yes this is working fine.

> I meant instead: could you give an example of the vect regressions
> that you saw with the unspec test removed?  You mentioned that many
> vect tests regressed without the unspec test, so it would be helpful
> to see these failures in action.  That is, it'd be helpful to take
> a compiler that doesn't have the unspec tests and show:
> 
> - the relevant rtl of one of the failing tests before the pass runs
>   (when the rtl is still correct)
> 
> - the relevant rtl of one of the failing tests after the pass runs
>   (when the rtl is now incorrect)
> 
> - the reason why the rtl after the pass is wrong
>

I meant to say this is the semantics of lxvp instructions which can occur through
UNSPEC uses. If we dont use above semantics in UNSPEC the vect regressions and spec 
fails functionality.

I will find a test without UNSPEC and let you know.

Thanks & Regards
Ajit
 
> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion
  2024-06-03 15:58                       ` Ajit Agarwal
@ 2024-06-04 16:00                         ` Ajit Agarwal
  0 siblings, 0 replies; 20+ messages in thread
From: Ajit Agarwal @ 2024-06-04 16:00 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 03/06/24 9:28 pm, Ajit Agarwal wrote:
> Hello Richard:
> 
> On 03/06/24 8:24 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>> Hello Richard:
>>>
>>> On 03/06/24 7:47 pm, Richard Sandiford wrote:
>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>> On 03/06/24 5:03 pm, Richard Sandiford wrote:
>>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>> [...]
>>>>>>>> If it is intentional, what distinguishes things like vperm and xxinsertw
>>>>>>>> (and all other unspecs) from plain addition?
>>>>>>>>
>>>>>>>>   [(set (match_operand:VSX_F 0 "vsx_register_operand" "=wa")
>>>>>>>>         (plus:VSX_F (match_operand:VSX_F 1 "vsx_register_operand" "wa")
>>>>>>>> 		    (match_operand:VSX_F 2 "vsx_register_operand" "wa")))]
>>>>>>>>
>>>>>>>
>>>>>>> Plain addition are not supported currently.
>>>>>>> We have not seen many cases with plain addition and this patch
>>>>>>> will not accept plain addition.
>>>>>>>
>>>>>>>  
>>>>>>>> This is why the intention behind the patch is important.  As it stands,
>>>>>>>> it isn't clear what criteria the patch is using to distinguish "valid"
>>>>>>>> fuse candidates from "invalid" ones.
>>>>>>>>
>>>>>>>
>>>>>>> Intention behind this patch all variants of UNSPEC instructions are
>>>>>>> supported and uses without UNSPEC are not supported in this patch.
>>>>>>
>>>>>> But why make the distinction this way though?  UNSPEC is a very
>>>>>> GCC-specific concept.  Whether something is an UNSPEC or some other
>>>>>> RTL code depends largely on historical accident.  E.g. we have specific
>>>>>> codes for VEC_SELECT, VEC_MERGE, and VEC_DUPLICATE, but don't have one
>>>>>> for VEC_PERM (even for VEC_PERM_EXPR exists in gimple).
>>>>>>
>>>>>> It seems unlikely that GCC's choice about whether to represent something
>>>>>> as an UNSPEC or as another RTL code lines up neatly with the kind of
>>>>>> codegen decisions that a good assembly programmer would make.
>>>>>>
>>>>>> I suppose another way of asking is to turn this around and say: what
>>>>>> kind of uses are you trying to exclude?  Presumably things are worse
>>>>>> if you remove this function override.  But what makes them worse?
>>>>>> What kind of uses cause the regression?
>>>>>>
>>>>>
>>>>> Uses of fused load where load with low address uses are modified with load with high address uses.
>>>>>
>>>>> Similarly load with high address uses are modified with load low address
>>>>> uses.
>>>>
>>>> It sounds like something is going wrong the subreg updates.
>>>> Can you give an example of where this occurs?  For instance...
>>>>
>>>>> This is the semantics of lxvp instructions which can occur through
>>>>> UNSPEC uses otherwise it breaks the functionality and seen failure
>>>>> in almost all vect regressions and SPEC benchmarks.
>>>>
>>>> ...could you take one of the simpler vect regressions, show the before
>>>> and after RTL, and why the transformation is wrong?
>>>
>>> Before the change:
>>>
>>> (insn 32 30 103 5 (set (reg:V16QI 127 [ _32 ])
>>>         (mem:V16QI (reg:DI 130 [ ivtmp.37 ]) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)_55]+0 S16 A128])) {vsx_movv16qi_64bit}
>>>      (nil))
>>> (insn 103 32 135 5 (set (reg:V16QI 173 [ _32 ])
>>>         (mem:V16QI (plus:DI (reg:DI 130 [ ivtmp.37 ])
>>>                 (const_int 16 [0x10])) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)_55]+0 S16 A128])) {vsx_movv16qi_64bit}
>>>      (nil))
>>> (insn 135 103 34 5 (set (reg:DI 155)
>>>         (plus:DI (reg:DI 130 [ ivtmp.37 ])
>>>             (const_int 16 [0x10]))) 66 {*adddi3}
>>>      (nil))
>>> (insn 34 135 104 5 (set (reg:V16QI 143 [ _27 ])
>>>         (unspec:V16QI [
>>>                 (reg:V16QI 127 [ _32 ]) repeated x2
>>>                 (reg:V16QI 152)
>>>             ] UNSPEC_VPERM))  {altivec_vperm_v16qi_direct}
>>>      (expr_list:REG_DEAD (reg:V16QI 127 [ _32 ])
>>>         (nil)))
>>> (insn 104 34 35 5 (set (reg:V16QI 174 [ _27 ])
>>>         (unspec:V16QI [
>>>                 (reg:V16QI 173 [ _32 ]) repeated x2
>>>                 (reg:V16QI 152)
>>>             ] UNSPEC_VPERM)) 
>>>  {altivec_vperm_v16qi_direct}
>>>
>>>
>>> After the change:
>>>
>>> (insn 103 30 135 5 (set (reg:OO 127 [ _32 ])
>>>         (mem:OO (reg:DI 130 [ ivtmp.37 ]) [1 MEM <vector(8) short unsigned int> [(short unsigned int *)_55]+0 S16 A128])) {*movoo}
>>>      (nil))
>>> (insn 135 103 34 5 (set (reg:DI 155)
>>>         (plus:DI (reg:DI 130 [ ivtmp.37 ])
>>>             (const_int 16 [0x10]))) 66 {*adddi3}
>>>      (nil))
>>> (insn 34 135 104 5 (set (reg:V16QI 143 [ _27 ])
>>>         (unspec:V16QI [
>>>                 (subreg:V16QI (reg:OO 127 [ _32 ]) 16)
>>>                 (subreg:V16QI (reg:OO 127 [ _32 ]) 16)
>>>                 (reg:V16QI 152)
>>>             ] UNSPEC_VPERM)) {altivec_vperm_v16qi_direct}
>>>      (expr_list:REG_DEAD (reg:OO 127 [ _32 ])
>>>         (nil)))
>>> (insn 104 34 35 5 (set (reg:V16QI 174 [ _27 ])
>>>         (unspec:V16QI [
>>>                 (subreg:V16QI (reg:OO 127 [ _32 ]) 0)
>>>                 (subreg:V16QI (reg:OO 127 [ _32 ]) 0)
>>>                 (reg:V16QI 152)
>>>             ] UNSPEC_VPERM))  {altivec_vperm_v16qi_direct}
>>>
>>> After the change the tests passes.
>>
>> But isn't this an example of the optimisation working on unspecs,
>> and working correctly?
>>
> 
> Yes this is working fine.
> 
>> I meant instead: could you give an example of the vect regressions
>> that you saw with the unspec test removed?  You mentioned that many
>> vect tests regressed without the unspec test, so it would be helpful
>> to see these failures in action.  That is, it'd be helpful to take
>> a compiler that doesn't have the unspec tests and show:
>>
>> - the relevant rtl of one of the failing tests before the pass runs
>>   (when the rtl is still correct)
>>
>> - the relevant rtl of one of the failing tests after the pass runs
>>   (when the rtl is now incorrect)
>>
>> - the reason why the rtl after the pass is wrong
>>
> 
> I meant to say this is the semantics of lxvp instructions which can occur through
> UNSPEC uses. If we dont use above semantics in UNSPEC the vect regressions and spec 
> fails functionality.
> 
> I will find a test without UNSPEC and let you know.
>

I have fixed all the issues with all RTL uses of fused load other than UNSPEC.
With the fixes all RTL codes uses of fused load are supported along with
UNSPEC.

Thanks for suggesting to use all RTL for fused load along with UNSPEC.

I will separate patch with all the fixes soon.

Thanks & Regards
Ajit 
> Thanks & Regards
> Ajit
>  
>> Thanks,
>> Richard

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-06-04 16:01 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-30 19:51 [Patch, rs6000, aarch64, middle-end] Add implementation for different targets for pair mem fusion Ajit Agarwal
2024-05-30 21:34 ` Segher Boessenkool
2024-05-31  8:14   ` Richard Sandiford
2024-05-31 14:19     ` Segher Boessenkool
2024-05-31  9:53 ` Richard Sandiford
2024-05-31 10:28   ` Richard Sandiford
2024-05-31 13:54   ` Ajit Agarwal
2024-05-31 14:38     ` Richard Sandiford
2024-05-31 16:59       ` Ajit Agarwal
2024-06-02  5:52         ` Ajit Agarwal
2024-06-03  8:37         ` Richard Sandiford
2024-06-03 11:05           ` Ajit Agarwal
2024-06-03 11:33             ` Richard Sandiford
2024-06-03 13:47               ` Ajit Agarwal
2024-06-03 14:17                 ` Richard Sandiford
2024-06-03 14:34                   ` Ajit Agarwal
2024-06-03 14:54                     ` Richard Sandiford
2024-06-03 15:58                       ` Ajit Agarwal
2024-06-04 16:00                         ` Ajit Agarwal
2024-06-02 13:16   ` Ajit Agarwal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).