public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
@ 2024-06-06  5:55 Ajit Agarwal
  2024-06-06  8:58 ` Richard Sandiford
  0 siblings, 1 reply; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-06  5:55 UTC (permalink / raw)
  To: Alex Coplan, Richard Sandiford, Kewen.Lin, Segher Boessenkool,
	Michael Meissner, Peter Bergner, David Edelsohn, gcc-patches

Hello All:

All comments are addressed.

Common infrastructure using generic code for pair mem fusion of different
targets.

rs6000 target specific code implement virtual functions defined by generic code.

Target specific code are added in rs6000-mem-fusion.cc.

Tested for powerpc64-linux-gnu.

Thanks & Regards
Ajit


rs6000, middle-end: Add implementation for different targets for pair mem fusion

Common infrastructure using generic code for pair mem fusion of different
targets.

rs6000 target specific code implement virtual functions defined by generic code.

Target specific code are added in rs6000-mem-fusion.cc.

2024-06-06  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>

gcc/ChangeLog:

	* config/rs6000/rs6000-passes.def: New mem fusion pass
	before pass_early_remat.
	* pair-fusion.h: Add additional pure virtual function
	required for rs6000 target implementation.
	* pair-fusion.cc: Use of virtual functions for additional
	virtual function addded for rs6000 target.
	* config/rs6000/rs6000-mem-fusion.cc: Add new pass.
	Add target specific implementation for generic pure virtual
	functions.
	* config.gcc: Add new object file.
	* config/rs6000/rs6000-protos.h: Add new prototype for mem
	fusion pass.
	* config/rs6000/t-rs6000: Add new rule.
	* rtl-ssa/accesses.h: Moved set_is_live_out_use as public
	from private.

gcc/testsuite/ChangeLog:

	* g++.target/powerpc/mem-fusion.C: New test.
	* g++.target/powerpc/mem-fusion-1.C: New test.
	* gcc.target/powerpc/mma-builtin-1.c: Modify test.
---
 gcc/config.gcc                                |   2 +
 gcc/config/rs6000/rs6000-mem-fusion.cc        | 645 ++++++++++++++++++
 gcc/config/rs6000/rs6000-passes.def           |   4 +-
 gcc/config/rs6000/rs6000-protos.h             |   1 +
 gcc/config/rs6000/t-rs6000                    |   5 +
 gcc/pair-fusion.cc                            |  23 +-
 gcc/pair-fusion.h                             |  16 +
 gcc/rtl-ssa/accesses.h                        |   4 +-
 .../g++.target/powerpc/mem-fusion-1.C         |  22 +
 gcc/testsuite/g++.target/powerpc/mem-fusion.C |  15 +
 .../gcc.target/powerpc/mma-builtin-1.c        |   4 +-
 11 files changed, 731 insertions(+), 10 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-mem-fusion.cc
 create mode 100644 gcc/testsuite/g++.target/powerpc/mem-fusion-1.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/mem-fusion.C

diff --git a/gcc/config.gcc b/gcc/config.gcc
index e500ba63e32..348308b2e93 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -524,6 +524,7 @@ powerpc*-*-*)
 	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
 	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
 	extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
+	extra_objs="${extra_objs} rs6000-mem-fusion.o"
 	extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
 	extra_headers="${extra_headers} bmi2intrin.h bmiintrin.h"
 	extra_headers="${extra_headers} xmmintrin.h mm_malloc.h emmintrin.h"
@@ -560,6 +561,7 @@ rs6000*-*-*)
 	extra_options="${extra_options} g.opt fused-madd.opt rs6000/rs6000-tables.opt"
 	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
 	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
+	extra_objs="${extra_objs} rs6000-mem-fusion.o"
 	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-logue.cc \$(srcdir)/config/rs6000/rs6000-call.cc"
 	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
 	;;
diff --git a/gcc/config/rs6000/rs6000-mem-fusion.cc b/gcc/config/rs6000/rs6000-mem-fusion.cc
new file mode 100644
index 00000000000..bdc08062534
--- /dev/null
+++ b/gcc/config/rs6000/rs6000-mem-fusion.cc
@@ -0,0 +1,645 @@
+/* Subroutines used to perform adjacent load/store into
+   paired memory accesses for TARGET_POWER10 and TARGET_VSX.
+
+   Copyright (C) 2024 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+   License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#define INCLUDE_ALGORITHM
+#define INCLUDE_FUNCTIONAL
+#define INCLUDE_LIST
+#define INCLUDE_TYPE_TRAITS
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "target.h"
+#include "df.h"
+#include "rtl.h"
+#include "rtl-iter.h"
+#include "rtl-ssa.h"
+#include "tree-pass.h"
+#include "ordered-hash-map.h"
+#include "pair-fusion.h"
+
+using namespace rtl_ssa;
+
+struct rs6000_pair_fusion : public pair_fusion
+{
+  bool fpsimd_op_p (rtx , machine_mode , bool)  override final
+  {
+    return false;
+  }
+
+  bool pair_mem_insn_p (rtx_insn *, bool &) override final
+  {
+    return false;
+  }
+
+  bool pair_mem_ok_with_policy (rtx, bool) override final
+  {
+    return true;
+  }
+
+  bool pair_operand_mode_ok_p (machine_mode mode) override final;
+
+  rtx gen_pair (rtx *pats, rtx writeback, bool load_p) override final;
+
+  bool pair_reg_operand_ok_p (bool, rtx, machine_mode) override final
+  {
+    return true;
+  }
+
+  int pair_mem_alias_check_limit () override final
+  {
+    return 0;
+  }
+
+  bool should_handle_writeback (enum writeback_type) override final
+  {
+    return false;
+  }
+
+  bool track_loads_p () override final
+  {
+    return true;
+  }
+
+  bool track_stores_p () override final
+  {
+    return true;
+  }
+
+  bool pair_mem_in_range_p (HOST_WIDE_INT) override final
+  {
+    return true;
+  }
+
+  rtx gen_promote_writeback_pair (rtx, rtx, rtx *, bool) override final
+  {
+    return NULL_RTX;
+  }
+
+  rtx destructure_pair (rtx_def **, rtx, bool) override final
+  {
+    return NULL_RTX;
+  }
+
+  bool fuseable_store_p (insn_info *i1, insn_info *i2) override final;
+
+  bool fuseable_load_p (insn_info *insn) override final;
+
+  void set_multiword_subreg (insn_info *i1, insn_info *i2,
+			     bool load_p) override final;
+};
+
+bool
+rs6000_pair_fusion::pair_operand_mode_ok_p (machine_mode mode)
+{
+  return (ALTIVEC_OR_VSX_VECTOR_MODE (mode));
+
+}
+
+// df_insn_rescan dependent instruction where operands
+// are reversed given insn_info INFO.
+static void
+set_rescan_load (insn_info *info)
+{
+  df_ref use;
+  df_insn_info *iinfo = DF_INSN_INFO_GET (info->rtl ());
+  FOR_EACH_INSN_INFO_DEF (use, iinfo)
+    {
+      struct df_link *def_link = DF_REF_CHAIN (use);
+
+      if (!def_link || !def_link->ref
+	  || DF_REF_IS_ARTIFICIAL (def_link->ref))
+	continue;
+
+      while (def_link && def_link->ref)
+	{
+	  rtx_insn *insn = DF_REF_INSN (def_link->ref);
+	  rtx set = single_set (insn);
+	  if (set == NULL_RTX)
+	    return;
+	  df_insn_rescan (insn);
+	  def_link = def_link->next;
+	}
+     }
+}
+
+// df_insn_rescan the def instruction where operands are reversed given INSN.
+static bool
+set_rescan_store (insn_info *insn)
+{
+  for (auto use : insn->uses())
+    {
+      auto def = use->def ();
+
+      if (def->insn ()->is_artificial())
+	return false;
+
+      if (def->insn () && def->insn ()->rtl ()
+	  && def->insn()->is_real() )
+	{
+	  rtx_insn *rtl_insn = def->insn ()->rtl ();
+	  rtx set = single_set (rtl_insn);
+
+	  if (set == NULL_RTX)
+	    return false;
+	  df_insn_rescan (rtl_insn);
+	}
+    }
+
+  return true;
+}
+
+ordered_hash_map<rtx_insn *, bool> insn_map;
+
+// Return false if dependent def is load instruction given INSN otherwise
+// false.
+static bool
+feasible_store_p (rtx_insn *insn, bool immediate_dep)
+{
+  df_ref use;
+  df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
+
+  FOR_EACH_INSN_INFO_USE (use, insn_info)
+    {
+      struct df_link *def_link = DF_REF_CHAIN (use);
+
+      if (!def_link || !def_link->ref
+	  || DF_REF_IS_ARTIFICIAL (def_link->ref))
+	continue;
+
+      rtx_insn *select_insn2 = DF_REF_INSN (def_link->ref);
+
+      if (select_insn2 == NULL)
+	continue;
+
+      if (select_insn2 == insn)
+	return true;
+
+      while (def_link && def_link->ref)
+	{
+	  rtx set = single_set (select_insn2);
+	  rtx insn_set = single_set (insn);
+
+	  if (set != NULL_RTX && insn_set != NULL_RTX)
+	    {
+	      if (GET_MODE (SET_SRC (set)) != GET_MODE (SET_SRC (insn_set)))
+		{
+		  if (GET_MODE (SET_SRC (set)) == OOmode)
+		    return false;
+
+		  immediate_dep = false;
+		}
+		else
+		  {
+		    if (immediate_dep && MEM_P (SET_SRC (set)))
+		      return false;
+		  }
+
+		if (insn_map.get (select_insn2))
+		  return true;
+		else
+		  insn_map.put (select_insn2, true);
+
+		if (!feasible_store_p (select_insn2, immediate_dep))
+		  return false;
+	    }
+	  def_link = def_link->next;
+	}
+     }
+  return true;
+}
+
+// Check for feasibility of store to be fuseable or not. Return true if
+// feasible otherwise false.
+static bool
+feasible_store_p (insn_info *insn)
+{
+  for (auto use : insn->uses ())
+    {
+      auto def = use->def ();
+
+      if (def->insn ()->is_artificial ())
+	return false;
+
+      if (def->insn () && def->insn ()->rtl ()
+	  && def->insn()->is_real ())
+	{
+	  rtx_insn *rtl_insn = def->insn ()->rtl ();
+	  rtx set = single_set (rtl_insn);
+
+	  if (set == NULL_RTX)
+	    return false;
+
+	  // Return false if dependent def is load.
+	  if (rtl_insn && MEM_P (SET_SRC (set)))
+	    return false;
+
+	  // Return false if dependent def is store.
+	  if (rtl_insn && MEM_P (SET_DEST (set)))
+	    return false;
+
+	  // Return false if dependent def is parallel.
+	  if (GET_CODE (PATTERN (rtl_insn)) == PARALLEL)
+	    return false;
+
+	  rtx src = SET_SRC (set);
+	  rtx_code code = GET_CODE (src);
+
+	  // Return false if dependent def is CONST_VECTOR or UNSPEC.
+	  if (code == CONST_VECTOR || code == UNSPEC)
+	    return false;
+
+	  // Recursively check for dependent instruction is Load.
+	  if (!feasible_store_p (rtl_insn, true))
+	    return false;
+
+	  // Return false if source operand of dependent def instruction
+	  //  has OOmode set through load.
+	  if (GET_RTX_CLASS (code) == RTX_BIN_ARITH
+	      || GET_RTX_CLASS (code) == RTX_COMM_ARITH)
+	    {
+	      rtx src_operand0 = XEXP (src, 0);
+	      if (GET_CODE (src_operand0) == SUBREG
+		  && GET_MODE (SUBREG_REG (src_operand0)) == OOmode)
+		return false;
+	     }
+
+	  if (GET_RTX_CLASS (code) == RTX_TERNARY)
+	    return false;
+	}
+    }
+  return true;
+}
+
+// Check if store can be fuseable or not.  Return true if fuseable otherwise
+// false.
+bool
+rs6000_pair_fusion::fuseable_store_p (insn_info *i1, insn_info *i2)
+{
+  rtx_insn *insn1 = i1->rtl ();
+  rtx_insn *insn2 = i2->rtl ();
+
+  rtx body = PATTERN (insn1);
+  rtx src_exp = SET_SRC (body);
+  rtx insn2_body = PATTERN (insn2);
+  rtx insn2_src_exp = SET_SRC (insn2_body);
+
+  // Return false if def and use count are not same.
+  if (REG_P (src_exp) &&
+      (DF_REG_DEF_COUNT (REGNO (src_exp)) != DF_REG_USE_COUNT (REGNO (src_exp))
+       || DF_REG_USE_COUNT (REGNO (src_exp)) > 1))
+    return false;
+
+  // Return false if src of insn1 and src of ins2 are same.
+  if (src_exp == insn2_src_exp)
+    return false;
+
+  // Return false if src of insn1 is subreg.
+  if (GET_CODE (src_exp) == SUBREG)
+    return false;
+
+  // Return false if src of insn1 is TImode or TFmode.
+  if (GET_MODE (src_exp) == TImode || GET_MODE (src_exp) == TFmode)
+    return false;
+
+  if (!feasible_store_p (i1))
+    return false;;
+
+  if (!feasible_store_p (i2))
+    return false;
+
+  return true;
+}
+
+// Set subreg for def of store INSN given rtx SRC instruction.
+static void
+set_store_subreg (rtx_insn *insn, rtx src)
+{
+  rtx set = single_set (insn);
+  rtx src_exp = SET_SRC (set);
+  df_ref use;
+
+  df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
+  FOR_EACH_INSN_INFO_USE (use, insn_info)
+    {
+      struct df_link *def_link = DF_REF_CHAIN (use);
+
+      if (!def_link || !def_link->ref
+	  || DF_REF_IS_ARTIFICIAL (def_link->ref))
+	continue;
+
+      while (def_link && def_link->ref)
+	{
+	  rtx *loc = DF_REF_LOC (def_link->ref);
+
+	  if (GET_MODE (*loc) == GET_MODE (src_exp))
+	      *loc = copy_rtx (src);
+
+	  def_link = def_link->next;
+	}
+    }
+}
+
+// Generate store pair stxvp given rtx I1.
+static rtx
+rs6000_gen_store_pair (rtx i1)
+{
+  rtx src_exp = SET_SRC (i1);
+  rtx dest_exp = SET_DEST (i1);
+  rtx stxv;
+  PUT_MODE_RAW (src_exp, OOmode);
+  PUT_MODE_RAW (dest_exp, OOmode);
+  stxv = gen_rtx_SET (dest_exp, src_exp);
+  if (dump_file)
+    {
+      fprintf (dump_file, "Replacing stxv with stxvp  \n");
+      print_rtl_single (dump_file, stxv);
+    }
+  return stxv;
+}
+
+// Check whether load can be fusable or not.
+// Return true if fuseable otherwise false.
+bool
+rs6000_pair_fusion::fuseable_load_p (insn_info *info)
+{
+  for (auto def : info->defs())
+    {
+      auto set = dyn_cast<set_info *> (def);
+      for (auto use1 : set->nondebug_insn_uses ())
+	use1->set_is_live_out_use (true);
+    }
+
+  rtx_insn *rtl_insn = info ->rtl ();
+  rtx body = PATTERN (rtl_insn);
+  rtx dest_exp = SET_DEST (body);
+
+  if (REG_P (dest_exp) &&
+      (DF_REG_DEF_COUNT (REGNO (dest_exp)) > 1
+       || DF_REG_EQ_USE_COUNT (REGNO (dest_exp)) > 0))
+    return  false;
+
+  rtx addr = XEXP (SET_SRC (body), 0);
+
+  if (GET_CODE (addr) == PLUS
+      && XEXP (addr, 1) && CONST_INT_P (XEXP (addr, 1)))
+    {
+      if (INTVAL (XEXP (addr, 1)) == -16)
+	return false;
+  }
+
+  df_ref use;
+  df_insn_info *insn_info = DF_INSN_INFO_GET (info->rtl ());
+  FOR_EACH_INSN_INFO_DEF (use, insn_info)
+    {
+      struct df_link *def_link = DF_REF_CHAIN (use);
+
+      if (!def_link || !def_link->ref
+	  || DF_REF_IS_ARTIFICIAL (def_link->ref))
+	continue;
+
+      while (def_link && def_link->ref)
+	{
+	  rtx_insn *insn = DF_REF_INSN (def_link->ref);
+	  if (GET_CODE (PATTERN (insn)) == PARALLEL)
+	    return false;
+
+	  rtx set = single_set (insn);
+	  if (set == NULL_RTX)
+	    return false;
+
+	  rtx op0 = SET_SRC (set);
+	  rtx_code code = GET_CODE (op0);
+
+	  // This check is added as register pairs are not generated
+	  // by RA for neg:V2DF (fma: V2DF (reg1)
+	  //                  (reg2)
+	  //                  (neg:V2DF (reg3)))
+	  if (GET_RTX_CLASS (code) == RTX_UNARY)
+	    return false;
+
+	  def_link = def_link->next;
+	}
+     }
+  return true;
+}
+
+// Set subreg with use of INSN given SRC rtx instruction.
+static void
+set_load_subreg (rtx_insn *insn, rtx src)
+{
+  df_ref use;
+  df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
+  FOR_EACH_INSN_INFO_DEF (use, insn_info)
+    {
+      struct df_link *def_link = DF_REF_CHAIN (use);
+
+      if (!def_link || !def_link->ref
+	  || DF_REF_IS_ARTIFICIAL (def_link->ref))
+	continue;
+
+      while (def_link && def_link->ref)
+	{
+	  rtx *loc = DF_REF_LOC (def_link->ref);
+	  *loc =  copy_rtx (src);
+	  def_link = def_link->next;
+	}
+     }
+}
+
+// Set subreg for OO mode store pair to generate registers in pairs
+// given insn_info I1 and I2.
+static void
+set_multiword_subreg_store (insn_info *i1, insn_info *i2)
+{
+  rtx_insn *insn1 = i1->rtl ();
+  rtx_insn *insn2 = i2->rtl ();
+  rtx body = PATTERN (insn1);
+  rtx src_exp = SET_SRC (body);
+  rtx insn2_body = PATTERN (insn2);
+  rtx insn2_dest_exp = SET_DEST (insn2_body);
+  machine_mode mode = GET_MODE (src_exp);
+  int regoff;
+  rtx src;
+  rtx addr = XEXP (insn2_dest_exp, 0);
+
+  PUT_MODE_RAW (src_exp, OOmode);
+  if (GET_CODE (addr) == PLUS
+      && XEXP (addr, 1) && CONST_INT_P (XEXP (addr, 1)))
+    regoff = 16;
+  else
+    regoff = 0;
+
+  src = simplify_gen_subreg (mode,
+			     src_exp, GET_MODE (src_exp),
+			     regoff);
+
+  set_store_subreg (insn1, src);
+
+  int regoff1 = 0;
+  rtx src1;
+
+  src1 = simplify_gen_subreg (mode,
+			      src_exp, GET_MODE (src_exp),
+			      regoff1);
+
+  set_store_subreg (insn2, src1);
+  set_rescan_store (i1);
+  set_rescan_store (i2);
+  df_insn_rescan (insn1);
+}
+
+// Set subreg for OO mode pair load to generate registers in pairs given
+// insn_info I2 and I2.
+static void
+set_multiword_subreg_load (insn_info *i1, insn_info *i2)
+{
+  rtx_insn *insn1 = i1->rtl();
+  rtx_insn *insn2 = i2->rtl();
+  rtx body = PATTERN (insn1);
+  rtx dest_exp = SET_DEST (body);
+  rtx insn2_body = PATTERN (insn2);
+  machine_mode mode = GET_MODE (dest_exp);
+  PUT_MODE_RAW (dest_exp, OOmode);
+
+  rtx insn2_src_exp = SET_SRC (insn2_body);
+  int regoff = 0;
+  rtx src;
+
+  src = simplify_gen_subreg (mode,
+			     dest_exp, GET_MODE (dest_exp),
+			     regoff);
+
+  set_load_subreg (insn2, src);
+
+  int regoff1;
+  rtx src1;
+  rtx addr = XEXP (insn2_src_exp, 0);
+
+  if (GET_CODE (addr) == PLUS
+      && XEXP (addr, 1)
+      && CONST_INT_P (XEXP(addr, 1)))
+    regoff1 = 16;
+  else
+    regoff1 = 0;
+
+  src1 = simplify_gen_subreg (mode,
+			      dest_exp, GET_MODE (dest_exp),
+			      regoff1);
+  if (src1)
+    set_load_subreg (insn1, src1);
+
+  set_rescan_load (i1);
+  set_rescan_load (i2);
+  df_insn_rescan (insn1);
+}
+
+// Set subreg for OO mode pair to generate sequential registers given
+// insn_info pairs I1, I2 and LOAD_P is true iff load insn and false
+// if store insn.
+void
+rs6000_pair_fusion::set_multiword_subreg (insn_info *i1, insn_info *i2,
+					  bool load_p)
+{
+  if (load_p)
+    set_multiword_subreg_load (i1, i2);
+  else
+    set_multiword_subreg_store (i1, i2);
+}
+
+// Return load pair given rtx I1.
+static rtx
+rs6000_gen_load_pair (rtx i1)
+{
+  rtx src_exp = SET_SRC (i1);
+  rtx dest_exp = SET_DEST (i1);
+  rtx lxv;
+  PUT_MODE_RAW (src_exp, OOmode);
+  PUT_MODE_RAW (dest_exp, OOmode);
+  lxv = gen_rtx_SET (dest_exp, src_exp);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "lxv with lxvp ");
+      print_rtl_single (dump_file, lxv);
+    }
+
+  return lxv;
+}
+
+rtx
+rs6000_pair_fusion::gen_pair (rtx *pats, rtx writeback, bool load_p)
+{
+  if (load_p || writeback)
+    return rs6000_gen_load_pair (pats[0]);
+  else
+    return rs6000_gen_store_pair (pats[0]);
+}
+
+const pass_data pass_data_mem_fusion =
+{
+  RTL_PASS, /* type */
+  "mem_fusion", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_mem_fusion : public rtl_opt_pass
+{
+public:
+  pass_mem_fusion (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_mem_fusion, ctxt)
+  {}
+
+  opt_pass *clone () override { return new pass_mem_fusion (m_ctxt);}
+
+  /* opt_pass methods: */
+  bool gate (function *)
+    {
+      return (optimize > 0 && TARGET_VSX && TARGET_POWER10);
+    }
+
+  unsigned int execute (function *) final override
+    {
+      /* We use DF data flow because we change location rtx
+	 which is easier to find and modify.
+	 We use mix of rtl-ssa def-use and DF data flow
+	 where it is easier.  */
+      df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
+      df_analyze ();
+      df_set_flags (DF_DEFER_INSN_RESCAN);
+
+      rs6000_pair_fusion pass;
+      pass.run ();
+      return 0;
+    }
+}; // class pass_mem_fusion
+
+rtl_opt_pass *
+make_pass_mem_fusion (gcc::context *ctxt)
+{
+  return new pass_mem_fusion (ctxt);
+}
diff --git a/gcc/config/rs6000/rs6000-passes.def b/gcc/config/rs6000/rs6000-passes.def
index 46a0d0b8c56..0b48f57014d 100644
--- a/gcc/config/rs6000/rs6000-passes.def
+++ b/gcc/config/rs6000/rs6000-passes.def
@@ -28,7 +28,9 @@ along with GCC; see the file COPYING3.  If not see
      The power8 does not have instructions that automaticaly do the byte swaps
      for loads and stores.  */
   INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
-
+  /* Pass to replace adjacent memory addresses lxv/stxv instruction with
+     lxvp/stxvp instruction.  */
+  INSERT_PASS_BEFORE (pass_early_remat, 1, pass_mem_fusion);
   /* Pass to do the PCREL_OPT optimization that combines the load of an
      external symbol's address along with a single load or store using that
      address as a base register.  */
diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
index 09a57a806fa..1412b31c2eb 100644
--- a/gcc/config/rs6000/rs6000-protos.h
+++ b/gcc/config/rs6000/rs6000-protos.h
@@ -343,6 +343,7 @@ namespace gcc { class context; }
 class rtl_opt_pass;
 
 extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
+extern rtl_opt_pass *make_pass_mem_fusion (gcc::context *);
 extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
 extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
 extern bool rs6000_quadword_masked_address_p (const_rtx exp);
diff --git a/gcc/config/rs6000/t-rs6000 b/gcc/config/rs6000/t-rs6000
index b3ce09d523b..df9b3a35b66 100644
--- a/gcc/config/rs6000/t-rs6000
+++ b/gcc/config/rs6000/t-rs6000
@@ -35,6 +35,11 @@ rs6000-p8swap.o: $(srcdir)/config/rs6000/rs6000-p8swap.cc
 	$(COMPILE) $<
 	$(POSTCOMPILE)
 
+rs6000-mem-fusion.o: $(srcdir)/config/rs6000/rs6000-mem-fusion.cc
+	$(COMPILE) $<
+	$(POSTCOMPILE)
+
+
 rs6000-d.o: $(srcdir)/config/rs6000/rs6000-d.cc
 	$(COMPILE) $<
 	$(POSTCOMPILE)
diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc
index 26b2284ed37..3d278cb3bd9 100644
--- a/gcc/pair-fusion.cc
+++ b/gcc/pair-fusion.cc
@@ -312,9 +312,9 @@ static int
 encode_lfs (lfs_fields fields)
 {
   int size_log2 = exact_log2 (fields.size);
-  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 4);
-  return ((int)fields.load_p << 3)
-    | ((int)fields.fpsimd_p << 2)
+  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 9);
+  return ((int)fields.load_p << 4)
+    | ((int)fields.fpsimd_p << 3)
     | (size_log2 - 2);
 }
 
@@ -322,8 +322,8 @@ encode_lfs (lfs_fields fields)
 static lfs_fields
 decode_lfs (int lfs)
 {
-  bool load_p = (lfs & (1 << 3));
-  bool fpsimd_p = (lfs & (1 << 2));
+  bool load_p = (lfs & (1 << 4));
+  bool fpsimd_p = (lfs & (1 << 3));
   unsigned size = 1U << ((lfs & 3) + 2);
   return { load_p, fpsimd_p, size };
 }
@@ -425,6 +425,9 @@ pair_fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
   if (MEM_VOLATILE_P (mem))
     return;
 
+  if (load_p && !m_pass->fuseable_load_p (insn))
+    return;
+
   // Ignore writeback accesses if the hook says to do so.
   if (!m_pass->should_handle_writeback (writeback_type::EXISTING)
       && GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
@@ -1821,6 +1824,7 @@ pair_fusion_bb_info::fuse_pair (bool load_p,
 
   rtx reg_notes = combine_reg_notes (first, second, load_p);
 
+  m_pass->set_multiword_subreg (i1, i2, load_p);
   rtx pair_pat = m_pass->gen_pair (pats, writeback_effect, load_p);
   insn_change *pair_change = nullptr;
   auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
@@ -2411,6 +2415,15 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
       reg_ops[i] = XEXP (pats[i], !load_p);
     }
 
+  if (!load_p && !m_pass->fuseable_store_p (i1, i2))
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "punting on store-mem-pairs due to non fuseable cand (%d,%d)\n",
+		 insns[0]->uid (), insns[1]->uid ());
+      return false;
+    }
+
   if (load_p && reg_overlap_mentioned_p (reg_ops[0], reg_ops[1]))
     {
       if (dump_file)
diff --git a/gcc/pair-fusion.h b/gcc/pair-fusion.h
index 45e4edceecb..8d4dc14d932 100644
--- a/gcc/pair-fusion.h
+++ b/gcc/pair-fusion.h
@@ -171,6 +171,22 @@ struct pair_fusion {
   virtual rtx gen_promote_writeback_pair (rtx wb_effect, rtx mem,
 					  rtx regs[2], bool load_p) = 0;
 
+  // Given insn_info pair I1 and I2, sets subreg with multiword registers
+  // to assign register pairs by allocators.
+  // LOAD_P is true iff the pair is a load.
+  virtual void set_multiword_subreg (rtl_ssa::insn_info *i1, rtl_ssa::insn_info *i2,
+				     bool load_p) = 0;
+
+  // Given insn_info pair I1 and I2, checks if pairs are feasible to perform
+  // store mem pairs.
+  // Return true if feasible to perform store mem pairs otherwise false.
+  virtual bool fuseable_store_p (rtl_ssa::insn_info *i1, rtl_ssa::insn_info *i2) = 0;
+
+  // Given insn_info pair I1 and I2, checks if pairs are feasible to perform
+  // load mem pairs.
+  // Return true if feasible to perform load mem pairs otherwise false.
+  virtual bool fuseable_load_p (rtl_ssa::insn_info *info) = 0;
+
   void process_block (rtl_ssa::bb_info *bb);
   rtl_ssa::insn_info *find_trailing_add (rtl_ssa::insn_info *insns[2],
 					 const rtl_ssa::insn_range_info
diff --git a/gcc/rtl-ssa/accesses.h b/gcc/rtl-ssa/accesses.h
index 7d2916d00c2..e3e2addef37 100644
--- a/gcc/rtl-ssa/accesses.h
+++ b/gcc/rtl-ssa/accesses.h
@@ -358,7 +358,7 @@ public:
   use_info *next_any_insn_use () const;
 
   // Return the next use by a debug instruction, or null if none.
-  // This is only valid if is_in_debug_insn ().
+  // This is only valid if if is_in_debug_insn ().
   use_info *next_debug_insn_use () const;
 
   // Return the previous use by a phi node in the list, or null if none.
@@ -379,6 +379,7 @@ public:
   //
   // This routine is only meaningful when def () is nonnull.
   bool is_last_use () const;
+  void set_is_live_out_use (bool value) { m_is_live_out_use = value; }
 
   // Print a description of def () to PP.
   void print_def (pretty_printer *pp) const;
@@ -430,7 +431,6 @@ private:
   void record_reference (rtx_obj_reference, bool);
   void set_insn (insn_info *);
   void set_def (set_info *set) { m_def = set; }
-  void set_is_live_out_use (bool value) { m_is_live_out_use = value; }
   void copy_prev_from (use_info *);
   void copy_next_from (use_info *);
   void set_last_use (use_info *);
diff --git a/gcc/testsuite/g++.target/powerpc/mem-fusion-1.C b/gcc/testsuite/g++.target/powerpc/mem-fusion-1.C
new file mode 100644
index 00000000000..d10ff0cdf36
--- /dev/null
+++ b/gcc/testsuite/g++.target/powerpc/mem-fusion-1.C
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target power10_ok } */
+/* { dg-options "-mdejagnu-cpu=power10 -O2" } */
+
+#include <altivec.h>
+	
+void
+foo2 ()
+{
+  __vector_quad *dst1;
+  __vector_quad *dst2;
+  vector unsigned char src;
+  __vector_quad acc;
+  vector unsigned char *ptr;
+  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
+  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
+  *dst1 = acc;
+  __builtin_mma_xvf32ger(&acc, src, ptr[2]);
+  __builtin_mma_xvf32gerpp(&acc, src, ptr[3]);
+  *dst2 = acc;
+}
+/* { dg-final { scan-assembler {\mlxvp\M} } } */
diff --git a/gcc/testsuite/g++.target/powerpc/mem-fusion.C b/gcc/testsuite/g++.target/powerpc/mem-fusion.C
new file mode 100644
index 00000000000..c523572cf3c
--- /dev/null
+++ b/gcc/testsuite/g++.target/powerpc/mem-fusion.C
@@ -0,0 +1,15 @@
+/* { dg-do compile } */ 
+/* { dg-require-effective-target power10_ok } */
+/* { dg-options "-mdejagnu-cpu=power10 -O2" } */ 
+
+#include <altivec.h>
+
+void
+foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
+{
+  __vector_quad acc;
+  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
+  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
+  *dst = acc;
+}
+/* { dg-final { scan-assembler {\mlxvp\M} } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c b/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
index 69ee826e1be..ae29127f954 100644
--- a/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
+++ b/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
@@ -258,8 +258,8 @@ foo13b (__vector_quad *dst, __vector_quad *src, vec_t *vec)
   dst[13] = acc;
 }
 
-/* { dg-final { scan-assembler-times {\mlxv\M} 40 } } */
-/* { dg-final { scan-assembler-times {\mlxvp\M} 12 } } */
+/* { dg-final { scan-assembler-times {\mlxv\M} 0 } } */
+/* { dg-final { scan-assembler-times {\mlxvp\M} 32 } } */
 /* { dg-final { scan-assembler-times {\mstxvp\M} 40 } } */
 /* { dg-final { scan-assembler-times {\mxxmfacc\M} 20 } } */
 /* { dg-final { scan-assembler-times {\mxxmtacc\M} 6 } } */
-- 
2.43.0


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-06  5:55 [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion Ajit Agarwal
@ 2024-06-06  8:58 ` Richard Sandiford
  2024-06-06 13:53   ` Ajit Agarwal
  0 siblings, 1 reply; 35+ messages in thread
From: Richard Sandiford @ 2024-06-06  8:58 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Hi,

Just some comments on the fuseable_load_p part, since that's what
we were discussing last time.

It looks like this now relies on:

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
> +      /* We use DF data flow because we change location rtx
> +	 which is easier to find and modify.
> +	 We use mix of rtl-ssa def-use and DF data flow
> +	 where it is easier.  */
> +      df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
> +      df_analyze ();
> +      df_set_flags (DF_DEFER_INSN_RESCAN);

But please don't do this!  For one thing, building DU/UD chains
as well as rtl-ssa is really expensive in terms of compile time.
But more importantly, modifications need to happen via rtl-ssa
to ensure that the IL is kept up-to-date.  If we don't do that,
later fuse attempts will be based on stale data and so could
generate incorrect code.

> +// Check whether load can be fusable or not.
> +// Return true if fuseable otherwise false.
> +bool
> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
> +{
> +  for (auto def : info->defs())
> +    {
> +      auto set = dyn_cast<set_info *> (def);
> +      for (auto use1 : set->nondebug_insn_uses ())
> +	use1->set_is_live_out_use (true);
> +    }

What was the reason for adding this loop?

> +
> +  rtx_insn *rtl_insn = info ->rtl ();
> +  rtx body = PATTERN (rtl_insn);
> +  rtx dest_exp = SET_DEST (body);
> +
> +  if (REG_P (dest_exp) &&
> +      (DF_REG_DEF_COUNT (REGNO (dest_exp)) > 1

The rtl-ssa way of checking this is:

  crtl->ssa->is_single_dominating_def (...)

> +       || DF_REG_EQ_USE_COUNT (REGNO (dest_exp)) > 0))
> +    return  false;

Why are uses in notes a problem?  In the worst case, we should just be
able to remove the note instead.

> +
> +  rtx addr = XEXP (SET_SRC (body), 0);
> +
> +  if (GET_CODE (addr) == PLUS
> +      && XEXP (addr, 1) && CONST_INT_P (XEXP (addr, 1)))
> +    {
> +      if (INTVAL (XEXP (addr, 1)) == -16)
> +	return false;
> +  }

What's special about -16?

> +
> +  df_ref use;
> +  df_insn_info *insn_info = DF_INSN_INFO_GET (info->rtl ());
> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
> +    {
> +      struct df_link *def_link = DF_REF_CHAIN (use);
> +
> +      if (!def_link || !def_link->ref
> +	  || DF_REF_IS_ARTIFICIAL (def_link->ref))
> +	continue;
> +
> +      while (def_link && def_link->ref)
> +	{
> +	  rtx_insn *insn = DF_REF_INSN (def_link->ref);
> +	  if (GET_CODE (PATTERN (insn)) == PARALLEL)
> +	    return false;

Why do you need to skip PARALLELs?

> +
> +	  rtx set = single_set (insn);
> +	  if (set == NULL_RTX)
> +	    return false;
> +
> +	  rtx op0 = SET_SRC (set);
> +	  rtx_code code = GET_CODE (op0);
> +
> +	  // This check is added as register pairs are not generated
> +	  // by RA for neg:V2DF (fma: V2DF (reg1)
> +	  //                  (reg2)
> +	  //                  (neg:V2DF (reg3)))
> +	  if (GET_RTX_CLASS (code) == RTX_UNARY)
> +	    return false;

What's special about (neg (fma ...))?

> +
> +	  def_link = def_link->next;
> +	}
> +     }
> +  return true;
> +}

Thanks,
Richard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-06  8:58 ` Richard Sandiford
@ 2024-06-06 13:53   ` Ajit Agarwal
  2024-06-06 14:33     ` Richard Sandiford
  0 siblings, 1 reply; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-06 13:53 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 06/06/24 2:28 pm, Richard Sandiford wrote:
> Hi,
> 
> Just some comments on the fuseable_load_p part, since that's what
> we were discussing last time.
> 
> It looks like this now relies on:
> 
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> +      /* We use DF data flow because we change location rtx
>> +	 which is easier to find and modify.
>> +	 We use mix of rtl-ssa def-use and DF data flow
>> +	 where it is easier.  */
>> +      df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
>> +      df_analyze ();
>> +      df_set_flags (DF_DEFER_INSN_RESCAN);
> 
> But please don't do this!  For one thing, building DU/UD chains
> as well as rtl-ssa is really expensive in terms of compile time.
> But more importantly, modifications need to happen via rtl-ssa
> to ensure that the IL is kept up-to-date.  If we don't do that,
> later fuse attempts will be based on stale data and so could
> generate incorrect code.
> 

Sure I have made changes to use only rtl-ssa and not to use
UD/DU chains. I will send the changes in separate subsequent
patch.

>> +// Check whether load can be fusable or not.
>> +// Return true if fuseable otherwise false.
>> +bool
>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>> +{
>> +  for (auto def : info->defs())
>> +    {
>> +      auto set = dyn_cast<set_info *> (def);
>> +      for (auto use1 : set->nondebug_insn_uses ())
>> +	use1->set_is_live_out_use (true);
>> +    }
> 
> What was the reason for adding this loop?
>

The purpose of adding is to avoid assert failure in gcc/rtl-ssa/changes.cc:252

 
>> +
>> +  rtx_insn *rtl_insn = info ->rtl ();
>> +  rtx body = PATTERN (rtl_insn);
>> +  rtx dest_exp = SET_DEST (body);
>> +
>> +  if (REG_P (dest_exp) &&
>> +      (DF_REG_DEF_COUNT (REGNO (dest_exp)) > 1
> 
> The rtl-ssa way of checking this is:
> 
>   crtl->ssa->is_single_dominating_def (...)
> 
>> +       || DF_REG_EQ_USE_COUNT (REGNO (dest_exp)) > 0))
>> +    return  false;
> 
> Why are uses in notes a problem?  In the worst case, we should just be
> able to remove the note instead.
>

We can remove this and its no more required. I will make this
change in subsequent patches.
 
>> +
>> +  rtx addr = XEXP (SET_SRC (body), 0);
>> +
>> +  if (GET_CODE (addr) == PLUS
>> +      && XEXP (addr, 1) && CONST_INT_P (XEXP (addr, 1)))
>> +    {
>> +      if (INTVAL (XEXP (addr, 1)) == -16)
>> +	return false;
>> +  }
> 
> What's special about -16?
> 

The tests like libgomp/for-8 fails with fused load with offset -16 and 0.
Thats why I have added this check.


>> +
>> +  df_ref use;
>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (info->rtl ());
>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>> +    {
>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>> +
>> +      if (!def_link || !def_link->ref
>> +	  || DF_REF_IS_ARTIFICIAL (def_link->ref))
>> +	continue;
>> +
>> +      while (def_link && def_link->ref)
>> +	{
>> +	  rtx_insn *insn = DF_REF_INSN (def_link->ref);
>> +	  if (GET_CODE (PATTERN (insn)) == PARALLEL)
>> +	    return false;
> 
> Why do you need to skip PARALLELs?
>

vec_select with parallel give failures final.cc "can't split-up with subreg 128 (reg OO"
Thats why I have added this.

 
>> +
>> +	  rtx set = single_set (insn);
>> +	  if (set == NULL_RTX)
>> +	    return false;
>> +
>> +	  rtx op0 = SET_SRC (set);
>> +	  rtx_code code = GET_CODE (op0);
>> +
>> +	  // This check is added as register pairs are not generated
>> +	  // by RA for neg:V2DF (fma: V2DF (reg1)
>> +	  //                  (reg2)
>> +	  //                  (neg:V2DF (reg3)))
>> +	  if (GET_RTX_CLASS (code) == RTX_UNARY)
>> +	    return false;
> 
> What's special about (neg (fma ...))?
>

I am not sure why register allocator fails allocating register pairs with
NEG Unary operation with fma operand. I have not debugged register allocator why the NEG
Unary operation with fma operand. 
 
>> +
>> +	  def_link = def_link->next;
>> +	}
>> +     }
>> +  return true;
>> +}
> 
> Thanks,
> Richard

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-06 13:53   ` Ajit Agarwal
@ 2024-06-06 14:33     ` Richard Sandiford
  2024-06-06 21:17       ` Ajit Agarwal
  0 siblings, 1 reply; 35+ messages in thread
From: Richard Sandiford @ 2024-06-06 14:33 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
> On 06/06/24 2:28 pm, Richard Sandiford wrote:
>> Hi,
>> 
>> Just some comments on the fuseable_load_p part, since that's what
>> we were discussing last time.
>> 
>> It looks like this now relies on:
>> 
>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>> +      /* We use DF data flow because we change location rtx
>>> +	 which is easier to find and modify.
>>> +	 We use mix of rtl-ssa def-use and DF data flow
>>> +	 where it is easier.  */
>>> +      df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
>>> +      df_analyze ();
>>> +      df_set_flags (DF_DEFER_INSN_RESCAN);
>> 
>> But please don't do this!  For one thing, building DU/UD chains
>> as well as rtl-ssa is really expensive in terms of compile time.
>> But more importantly, modifications need to happen via rtl-ssa
>> to ensure that the IL is kept up-to-date.  If we don't do that,
>> later fuse attempts will be based on stale data and so could
>> generate incorrect code.
>> 
>
> Sure I have made changes to use only rtl-ssa and not to use
> UD/DU chains. I will send the changes in separate subsequent
> patch.

Thanks.  Before you send the patch though:

>>> +// Check whether load can be fusable or not.
>>> +// Return true if fuseable otherwise false.
>>> +bool
>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>>> +{
>>> +  for (auto def : info->defs())
>>> +    {
>>> +      auto set = dyn_cast<set_info *> (def);
>>> +      for (auto use1 : set->nondebug_insn_uses ())
>>> +	use1->set_is_live_out_use (true);
>>> +    }
>> 
>> What was the reason for adding this loop?
>>
>
> The purpose of adding is to avoid assert failure in gcc/rtl-ssa/changes.cc:252

That assert is making sure that we don't delete a definition of a
register (or memory) while a real insn still uses it.  If the assert
is firing then something has gone wrong.

Live-out uses are a particular kind of use that occur at the end of
basic blocks.  It's incorrect to mark normal insn uses as live-out.

When an assert fails, it's important to understand why the failure
occurs, rather than brute-force the assert condition to true.

>>> [...]
>>> +
>>> +  rtx addr = XEXP (SET_SRC (body), 0);
>>> +
>>> +  if (GET_CODE (addr) == PLUS
>>> +      && XEXP (addr, 1) && CONST_INT_P (XEXP (addr, 1)))
>>> +    {
>>> +      if (INTVAL (XEXP (addr, 1)) == -16)
>>> +	return false;
>>> +  }
>> 
>> What's special about -16?
>> 
>
> The tests like libgomp/for-8 fails with fused load with offset -16 and 0.
> Thats why I have added this check.

But why does it fail though?  It sounds like the testcase is pointing
out a problem in the pass (or perhaps elsewhere).  It's important that
we try to understand and fix the underlying problem.

>>> +
>>> +  df_ref use;
>>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (info->rtl ());
>>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>>> +    {
>>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>>> +
>>> +      if (!def_link || !def_link->ref
>>> +	  || DF_REF_IS_ARTIFICIAL (def_link->ref))
>>> +	continue;
>>> +
>>> +      while (def_link && def_link->ref)
>>> +	{
>>> +	  rtx_insn *insn = DF_REF_INSN (def_link->ref);
>>> +	  if (GET_CODE (PATTERN (insn)) == PARALLEL)
>>> +	    return false;
>> 
>> Why do you need to skip PARALLELs?
>>
>
> vec_select with parallel give failures final.cc "can't split-up with subreg 128 (reg OO"
> Thats why I have added this.

But in (vec_select ... (parallel ...)), the parallel won't be the 
PATTERN (insn).  It'll instead be a suboperand of the vec_select.

Here too it's important to understand why the final.cc failure occurs
and what the correct fix is.

>>> +
>>> +	  rtx set = single_set (insn);
>>> +	  if (set == NULL_RTX)
>>> +	    return false;
>>> +
>>> +	  rtx op0 = SET_SRC (set);
>>> +	  rtx_code code = GET_CODE (op0);
>>> +
>>> +	  // This check is added as register pairs are not generated
>>> +	  // by RA for neg:V2DF (fma: V2DF (reg1)
>>> +	  //                  (reg2)
>>> +	  //                  (neg:V2DF (reg3)))
>>> +	  if (GET_RTX_CLASS (code) == RTX_UNARY)
>>> +	    return false;
>> 
>> What's special about (neg (fma ...))?
>>
>
> I am not sure why register allocator fails allocating register pairs with
> NEG Unary operation with fma operand. I have not debugged register allocator why the NEG
> Unary operation with fma operand. 

I don't think it'll be specific to (neg (fma ...)).  Here too I think the
test showed up a problem with the pass and we should try to understand
and fix it.

More generally, it seems like you've run the testsuite, seen failures,
isolated a particular change that was wrong in some way, and then added
checks for that particular bit of input rtl.  But that isn't how the
testsuite is meant to be used.

The code needs to make sense from first principles, with comments to
explain decisions that are nonobvious.  (Well, to some extent anyway :))
The testsuite exists to verify the code.  If a testsuite failure occurs,
we need to understand what went wrong in the test, why exactly it happened,
which part of the code was at fault, and how that code should be fixed.

In other words, getting clean test results isn't an end in itself.
The testsuite is instead a tool for pointing out things that were
overlooked.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-06 14:33     ` Richard Sandiford
@ 2024-06-06 21:17       ` Ajit Agarwal
  2024-06-06 22:54         ` Richard Sandiford
  0 siblings, 1 reply; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-06 21:17 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 06/06/24 8:03 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> On 06/06/24 2:28 pm, Richard Sandiford wrote:
>>> Hi,
>>>
>>> Just some comments on the fuseable_load_p part, since that's what
>>> we were discussing last time.
>>>
>>> It looks like this now relies on:
>>>
>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>> +      /* We use DF data flow because we change location rtx
>>>> +	 which is easier to find and modify.
>>>> +	 We use mix of rtl-ssa def-use and DF data flow
>>>> +	 where it is easier.  */
>>>> +      df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
>>>> +      df_analyze ();
>>>> +      df_set_flags (DF_DEFER_INSN_RESCAN);
>>>
>>> But please don't do this!  For one thing, building DU/UD chains
>>> as well as rtl-ssa is really expensive in terms of compile time.
>>> But more importantly, modifications need to happen via rtl-ssa
>>> to ensure that the IL is kept up-to-date.  If we don't do that,
>>> later fuse attempts will be based on stale data and so could
>>> generate incorrect code.
>>>
>>
>> Sure I have made changes to use only rtl-ssa and not to use
>> UD/DU chains. I will send the changes in separate subsequent
>> patch.
> 
> Thanks.  Before you send the patch though:
> 
>>>> +// Check whether load can be fusable or not.
>>>> +// Return true if fuseable otherwise false.
>>>> +bool
>>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>>>> +{
>>>> +  for (auto def : info->defs())
>>>> +    {
>>>> +      auto set = dyn_cast<set_info *> (def);
>>>> +      for (auto use1 : set->nondebug_insn_uses ())
>>>> +	use1->set_is_live_out_use (true);
>>>> +    }
>>>
>>> What was the reason for adding this loop?
>>>
>>
>> The purpose of adding is to avoid assert failure in gcc/rtl-ssa/changes.cc:252
> 
> That assert is making sure that we don't delete a definition of a
> register (or memory) while a real insn still uses it.  If the assert
> is firing then something has gone wrong.
> 
> Live-out uses are a particular kind of use that occur at the end of
> basic blocks.  It's incorrect to mark normal insn uses as live-out.
> 
> When an assert fails, it's important to understand why the failure
> occurs, rather than brute-force the assert condition to true.
> 

The above assert failure occurs when there is a debug insn and its
use is not live-out.

>>>> [...]
>>>> +
>>>> +  rtx addr = XEXP (SET_SRC (body), 0);
>>>> +
>>>> +  if (GET_CODE (addr) == PLUS
>>>> +      && XEXP (addr, 1) && CONST_INT_P (XEXP (addr, 1)))
>>>> +    {
>>>> +      if (INTVAL (XEXP (addr, 1)) == -16)
>>>> +	return false;
>>>> +  }
>>>
>>> What's special about -16?
>>>
>>
>> The tests like libgomp/for-8 fails with fused load with offset -16 and 0.
>> Thats why I have added this check.
> 
> But why does it fail though?  It sounds like the testcase is pointing
> out a problem in the pass (or perhaps elsewhere).  It's important that
> we try to understand and fix the underlying problem.
> 

This check is not required anymore and will remove from subsequent patches.
>>>> +
>>>> +  df_ref use;
>>>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (info->rtl ());
>>>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>>>> +    {
>>>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>>>> +
>>>> +      if (!def_link || !def_link->ref
>>>> +	  || DF_REF_IS_ARTIFICIAL (def_link->ref))
>>>> +	continue;
>>>> +
>>>> +      while (def_link && def_link->ref)
>>>> +	{
>>>> +	  rtx_insn *insn = DF_REF_INSN (def_link->ref);
>>>> +	  if (GET_CODE (PATTERN (insn)) == PARALLEL)
>>>> +	    return false;
>>>
>>> Why do you need to skip PARALLELs?
>>>
>>
>> vec_select with parallel give failures final.cc "can't split-up with subreg 128 (reg OO"
>> Thats why I have added this.
> 
> But in (vec_select ... (parallel ...)), the parallel won't be the 
> PATTERN (insn).  It'll instead be a suboperand of the vec_select.
> 
> Here too it's important to understand why the final.cc failure occurs
> and what the correct fix is.
> 

subreg with vec_select operand already exists before fusion pass.
We overwrite them with subreg 128 bits from 256 OO mode operand.
Due to this in final.cc we couldnt splt at line 2807 and bails
out fatal_insn.

Currently we dont support already existing subreg vector operand
to generate register pairs.
We should bail out from fusion pass in this case.
>>>> +
>>>> +	  rtx set = single_set (insn);
>>>> +	  if (set == NULL_RTX)
>>>> +	    return false;
>>>> +
>>>> +	  rtx op0 = SET_SRC (set);
>>>> +	  rtx_code code = GET_CODE (op0);
>>>> +
>>>> +	  // This check is added as register pairs are not generated
>>>> +	  // by RA for neg:V2DF (fma: V2DF (reg1)
>>>> +	  //                  (reg2)
>>>> +	  //                  (neg:V2DF (reg3)))
>>>> +	  if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>> +	    return false;
>>>
>>> What's special about (neg (fma ...))?
>>>
>>
>> I am not sure why register allocator fails allocating register pairs with
>> NEG Unary operation with fma operand. I have not debugged register allocator why the NEG
>> Unary operation with fma operand. 
> 

For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits are
set correctly. 
IRA marked them spill candidates as spill priority is zero.

Due to this LRA reload pass couldn't allocate register pairs.
> I don't think it'll be specific to (neg (fma ...)).  Here too I think the
> test showed up a problem with the pass and we should try to understand
> and fix it.
> 
> More generally, it seems like you've run the testsuite, seen failures,
> isolated a particular change that was wrong in some way, and then added
> checks for that particular bit of input rtl.  But that isn't how the
> testsuite is meant to be used.
> 
> The code needs to make sense from first principles, with comments to
> explain decisions that are nonobvious.  (Well, to some extent anyway :))
> The testsuite exists to verify the code.  If a testsuite failure occurs,
> we need to understand what went wrong in the test, why exactly it happened,
> which part of the code was at fault, and how that code should be fixed.
> 
> In other words, getting clean test results isn't an end in itself.
> The testsuite is instead a tool for pointing out things that were
> overlooked.
> 
> Thanks,
> Richard

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-06 21:17       ` Ajit Agarwal
@ 2024-06-06 22:54         ` Richard Sandiford
  2024-06-07  7:35           ` Ajit Agarwal
  0 siblings, 1 reply; 35+ messages in thread
From: Richard Sandiford @ 2024-06-06 22:54 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
> On 06/06/24 8:03 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>> On 06/06/24 2:28 pm, Richard Sandiford wrote:
>>>> Hi,
>>>>
>>>> Just some comments on the fuseable_load_p part, since that's what
>>>> we were discussing last time.
>>>>
>>>> It looks like this now relies on:
>>>>
>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>> +      /* We use DF data flow because we change location rtx
>>>>> +	 which is easier to find and modify.
>>>>> +	 We use mix of rtl-ssa def-use and DF data flow
>>>>> +	 where it is easier.  */
>>>>> +      df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
>>>>> +      df_analyze ();
>>>>> +      df_set_flags (DF_DEFER_INSN_RESCAN);
>>>>
>>>> But please don't do this!  For one thing, building DU/UD chains
>>>> as well as rtl-ssa is really expensive in terms of compile time.
>>>> But more importantly, modifications need to happen via rtl-ssa
>>>> to ensure that the IL is kept up-to-date.  If we don't do that,
>>>> later fuse attempts will be based on stale data and so could
>>>> generate incorrect code.
>>>>
>>>
>>> Sure I have made changes to use only rtl-ssa and not to use
>>> UD/DU chains. I will send the changes in separate subsequent
>>> patch.
>> 
>> Thanks.  Before you send the patch though:
>> 
>>>>> +// Check whether load can be fusable or not.
>>>>> +// Return true if fuseable otherwise false.
>>>>> +bool
>>>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>>>>> +{
>>>>> +  for (auto def : info->defs())
>>>>> +    {
>>>>> +      auto set = dyn_cast<set_info *> (def);
>>>>> +      for (auto use1 : set->nondebug_insn_uses ())
>>>>> +	use1->set_is_live_out_use (true);
>>>>> +    }
>>>>
>>>> What was the reason for adding this loop?
>>>>
>>>
>>> The purpose of adding is to avoid assert failure in gcc/rtl-ssa/changes.cc:252
>> 
>> That assert is making sure that we don't delete a definition of a
>> register (or memory) while a real insn still uses it.  If the assert
>> is firing then something has gone wrong.
>> 
>> Live-out uses are a particular kind of use that occur at the end of
>> basic blocks.  It's incorrect to mark normal insn uses as live-out.
>> 
>> When an assert fails, it's important to understand why the failure
>> occurs, rather than brute-force the assert condition to true.
>> 
>
> The above assert failure occurs when there is a debug insn and its
> use is not live-out.

Uses in debug insns are never live-out uses.

It sounds like the bug is that we're failing to update all debug uses of
the original register.  We need to do that, or "reset" the debug insn if
substitution fails for some reason.

See fixup_debug_uses for what the target-independent part of the pass
does for debug insns that are affected by movement.  Hopefully the
update needed here will be simpler than that.

>>>>> [...]
>>>>> +
>>>>> +  rtx addr = XEXP (SET_SRC (body), 0);
>>>>> +
>>>>> +  if (GET_CODE (addr) == PLUS
>>>>> +      && XEXP (addr, 1) && CONST_INT_P (XEXP (addr, 1)))
>>>>> +    {
>>>>> +      if (INTVAL (XEXP (addr, 1)) == -16)
>>>>> +	return false;
>>>>> +  }
>>>>
>>>> What's special about -16?
>>>>
>>>
>>> The tests like libgomp/for-8 fails with fused load with offset -16 and 0.
>>> Thats why I have added this check.
>> 
>> But why does it fail though?  It sounds like the testcase is pointing
>> out a problem in the pass (or perhaps elsewhere).  It's important that
>> we try to understand and fix the underlying problem.
>> 
>
> This check is not required anymore and will remove from subsequent patches.

OK, great.

>>>>> +
>>>>> +  df_ref use;
>>>>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (info->rtl ());
>>>>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>>>>> +    {
>>>>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>>>>> +
>>>>> +      if (!def_link || !def_link->ref
>>>>> +	  || DF_REF_IS_ARTIFICIAL (def_link->ref))
>>>>> +	continue;
>>>>> +
>>>>> +      while (def_link && def_link->ref)
>>>>> +	{
>>>>> +	  rtx_insn *insn = DF_REF_INSN (def_link->ref);
>>>>> +	  if (GET_CODE (PATTERN (insn)) == PARALLEL)
>>>>> +	    return false;
>>>>
>>>> Why do you need to skip PARALLELs?
>>>>
>>>
>>> vec_select with parallel give failures final.cc "can't split-up with subreg 128 (reg OO"
>>> Thats why I have added this.
>> 
>> But in (vec_select ... (parallel ...)), the parallel won't be the 
>> PATTERN (insn).  It'll instead be a suboperand of the vec_select.
>> 
>> Here too it's important to understand why the final.cc failure occurs
>> and what the correct fix is.
>> 
>
> subreg with vec_select operand already exists before fusion pass.
> We overwrite them with subreg 128 bits from 256 OO mode operand.

But why is that wrong?  What was the full rtl of the subreg before the
pass runs, what did the subreg look like after the pass, and why is the
change not correct?

In general, there are two main ways that an rtl change can be incorrect:

(1) The new rtl isn't well-formed (such as (subreg (subreg X A) B)).
    In this case, the new rtl makes no inherent sense when viewed
    in isolation: it isn't necessary to see the old rtl to tell that
    the new rtl is wrong.

(2) The new rtl is well-formed (i.e. makes inherent sense when viewed in
    isolation) but it does not have the same semantics as the old rtl.
    In other words, the new rtl is describing a different operation
    from the old rtl.

I think we need to talk about it in those terms, rather than where
the eventual ICE occurs.

> Due to this in final.cc we couldnt splt at line 2807 and bails
> out fatal_insn.
>
> Currently we dont support already existing subreg vector operand
> to generate register pairs.
> We should bail out from fusion pass in this case.
>>>>> +
>>>>> +	  rtx set = single_set (insn);
>>>>> +	  if (set == NULL_RTX)
>>>>> +	    return false;
>>>>> +
>>>>> +	  rtx op0 = SET_SRC (set);
>>>>> +	  rtx_code code = GET_CODE (op0);
>>>>> +
>>>>> +	  // This check is added as register pairs are not generated
>>>>> +	  // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>> +	  //                  (reg2)
>>>>> +	  //                  (neg:V2DF (reg3)))
>>>>> +	  if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>> +	    return false;
>>>>
>>>> What's special about (neg (fma ...))?
>>>>
>>>
>>> I am not sure why register allocator fails allocating register pairs with
>>> NEG Unary operation with fma operand. I have not debugged register allocator why the NEG
>>> Unary operation with fma operand. 
>> 
>
> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits are
> set correctly. 
> IRA marked them spill candidates as spill priority is zero.
>
> Due to this LRA reload pass couldn't allocate register pairs.

I think this is just restating the symptom though.  I suppose the same
kind of questions apply here too: what was the instruction before the
pass runs, what was the instruction after the pass runs, and why is
the rtl change incorrect (by the meaning above)?

Thanks,
Richard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-06 22:54         ` Richard Sandiford
@ 2024-06-07  7:35           ` Ajit Agarwal
  2024-06-07  8:22             ` Richard Sandiford
  0 siblings, 1 reply; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-07  7:35 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 07/06/24 4:24 am, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> On 06/06/24 8:03 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>> On 06/06/24 2:28 pm, Richard Sandiford wrote:
>>>>> Hi,
>>>>>
>>>>> Just some comments on the fuseable_load_p part, since that's what
>>>>> we were discussing last time.
>>>>>
>>>>> It looks like this now relies on:
>>>>>
>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>> +      /* We use DF data flow because we change location rtx
>>>>>> +	 which is easier to find and modify.
>>>>>> +	 We use mix of rtl-ssa def-use and DF data flow
>>>>>> +	 where it is easier.  */
>>>>>> +      df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
>>>>>> +      df_analyze ();
>>>>>> +      df_set_flags (DF_DEFER_INSN_RESCAN);
>>>>>
>>>>> But please don't do this!  For one thing, building DU/UD chains
>>>>> as well as rtl-ssa is really expensive in terms of compile time.
>>>>> But more importantly, modifications need to happen via rtl-ssa
>>>>> to ensure that the IL is kept up-to-date.  If we don't do that,
>>>>> later fuse attempts will be based on stale data and so could
>>>>> generate incorrect code.
>>>>>
>>>>
>>>> Sure I have made changes to use only rtl-ssa and not to use
>>>> UD/DU chains. I will send the changes in separate subsequent
>>>> patch.
>>>
>>> Thanks.  Before you send the patch though:
>>>
>>>>>> +// Check whether load can be fusable or not.
>>>>>> +// Return true if fuseable otherwise false.
>>>>>> +bool
>>>>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>>>>>> +{
>>>>>> +  for (auto def : info->defs())
>>>>>> +    {
>>>>>> +      auto set = dyn_cast<set_info *> (def);
>>>>>> +      for (auto use1 : set->nondebug_insn_uses ())
>>>>>> +	use1->set_is_live_out_use (true);
>>>>>> +    }
>>>>>
>>>>> What was the reason for adding this loop?
>>>>>
>>>>
>>>> The purpose of adding is to avoid assert failure in gcc/rtl-ssa/changes.cc:252
>>>
>>> That assert is making sure that we don't delete a definition of a
>>> register (or memory) while a real insn still uses it.  If the assert
>>> is firing then something has gone wrong.
>>>
>>> Live-out uses are a particular kind of use that occur at the end of
>>> basic blocks.  It's incorrect to mark normal insn uses as live-out.
>>>
>>> When an assert fails, it's important to understand why the failure
>>> occurs, rather than brute-force the assert condition to true.
>>>
>>
>> The above assert failure occurs when there is a debug insn and its
>> use is not live-out.
> 
> Uses in debug insns are never live-out uses.
> 
> It sounds like the bug is that we're failing to update all debug uses of
> the original register.  We need to do that, or "reset" the debug insn if
> substitution fails for some reason.
> 
> See fixup_debug_uses for what the target-independent part of the pass
> does for debug insns that are affected by movement.  Hopefully the
> update needed here will be simpler than that.
> 

Sure. Thanks.

>>>>>> [...]
>>>>>> +
>>>>>> +  rtx addr = XEXP (SET_SRC (body), 0);
>>>>>> +
>>>>>> +  if (GET_CODE (addr) == PLUS
>>>>>> +      && XEXP (addr, 1) && CONST_INT_P (XEXP (addr, 1)))
>>>>>> +    {
>>>>>> +      if (INTVAL (XEXP (addr, 1)) == -16)
>>>>>> +	return false;
>>>>>> +  }
>>>>>
>>>>> What's special about -16?
>>>>>
>>>>
>>>> The tests like libgomp/for-8 fails with fused load with offset -16 and 0.
>>>> Thats why I have added this check.
>>>
>>> But why does it fail though?  It sounds like the testcase is pointing
>>> out a problem in the pass (or perhaps elsewhere).  It's important that
>>> we try to understand and fix the underlying problem.
>>>
>>
>> This check is not required anymore and will remove from subsequent patches.
> 
> OK, great.
> 

Thanks.

>>>>>> +
>>>>>> +  df_ref use;
>>>>>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (info->rtl ());
>>>>>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>>>>>> +    {
>>>>>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>>>>>> +
>>>>>> +      if (!def_link || !def_link->ref
>>>>>> +	  || DF_REF_IS_ARTIFICIAL (def_link->ref))
>>>>>> +	continue;
>>>>>> +
>>>>>> +      while (def_link && def_link->ref)
>>>>>> +	{
>>>>>> +	  rtx_insn *insn = DF_REF_INSN (def_link->ref);
>>>>>> +	  if (GET_CODE (PATTERN (insn)) == PARALLEL)
>>>>>> +	    return false;
>>>>>
>>>>> Why do you need to skip PARALLELs?
>>>>>
>>>>
>>>> vec_select with parallel give failures final.cc "can't split-up with subreg 128 (reg OO"
>>>> Thats why I have added this.
>>>
>>> But in (vec_select ... (parallel ...)), the parallel won't be the 
>>> PATTERN (insn).  It'll instead be a suboperand of the vec_select.
>>>
>>> Here too it's important to understand why the final.cc failure occurs
>>> and what the correct fix is.
>>>
>>
>> subreg with vec_select operand already exists before fusion pass.
>> We overwrite them with subreg 128 bits from 256 OO mode operand.
> 
> But why is that wrong?  What was the full rtl of the subreg before the
> pass runs, what did the subreg look like after the pass, and why is the
> change not correct?
> 
> In general, there are two main ways that an rtl change can be incorrect:
> 
> (1) The new rtl isn't well-formed (such as (subreg (subreg X A) B)).
>     In this case, the new rtl makes no inherent sense when viewed
>     in isolation: it isn't necessary to see the old rtl to tell that
>     the new rtl is wrong.
> 
> (2) The new rtl is well-formed (i.e. makes inherent sense when viewed in
>     isolation) but it does not have the same semantics as the old rtl.
>     In other words, the new rtl is describing a different operation
>     from the old rtl.
> 
> I think we need to talk about it in those terms, rather than where
> the eventual ICE occurs.
> 
Before the fusion.
old rtl looks like this:

(vec_select:HI (subreg:V8HI (reg:V16QI 125 [ vect__29.38 ]) 0)

After the fusion
new rtl looks like this:

(vec_select:HI (subreg:V16QI (reg:OO 125 [ vect__29.38 ]) 16)

new rtl is not well formed.

Thats why its failing.

reg:v16QI 125 is the destination of the load that needs to be fused.

>> Due to this in final.cc we couldnt splt at line 2807 and bails
>> out fatal_insn.
>>
>> Currently we dont support already existing subreg vector operand
>> to generate register pairs.
>> We should bail out from fusion pass in this case.
>>>>>> +
>>>>>> +	  rtx set = single_set (insn);
>>>>>> +	  if (set == NULL_RTX)
>>>>>> +	    return false;
>>>>>> +
>>>>>> +	  rtx op0 = SET_SRC (set);
>>>>>> +	  rtx_code code = GET_CODE (op0);
>>>>>> +
>>>>>> +	  // This check is added as register pairs are not generated
>>>>>> +	  // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>> +	  //                  (reg2)
>>>>>> +	  //                  (neg:V2DF (reg3)))
>>>>>> +	  if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>> +	    return false;
>>>>>
>>>>> What's special about (neg (fma ...))?
>>>>>
>>>>
>>>> I am not sure why register allocator fails allocating register pairs with
>>>> NEG Unary operation with fma operand. I have not debugged register allocator why the NEG
>>>> Unary operation with fma operand. 
>>>
>>
>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits are
>> set correctly. 
>> IRA marked them spill candidates as spill priority is zero.
>>
>> Due to this LRA reload pass couldn't allocate register pairs.
> 
> I think this is just restating the symptom though.  I suppose the same
> kind of questions apply here too: what was the instruction before the
> pass runs, what was the instruction after the pass runs, and why is
> the rtl change incorrect (by the meaning above)?
> 

Original case where we dont do load fusion, spill happens, in that
case we dont require sequential register pairs to be generated for 2 loads
for. Hence it worked.

rtl change is correct and there is no error.

for load fusion spill happens and we dont generate sequential register pairs
because pf spill candidate and lxvp gives incorrect results as sequential register
pairs are required for lxvp.


> Thanks,
> Richard

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-07  7:35           ` Ajit Agarwal
@ 2024-06-07  8:22             ` Richard Sandiford
  2024-06-07 10:41               ` Ajit Agarwal
  0 siblings, 1 reply; 35+ messages in thread
From: Richard Sandiford @ 2024-06-07  8:22 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>> +
>>>>>>> +  df_ref use;
>>>>>>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (info->rtl ());
>>>>>>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>>>>>>> +    {
>>>>>>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>>>>>>> +
>>>>>>> +      if (!def_link || !def_link->ref
>>>>>>> +	  || DF_REF_IS_ARTIFICIAL (def_link->ref))
>>>>>>> +	continue;
>>>>>>> +
>>>>>>> +      while (def_link && def_link->ref)
>>>>>>> +	{
>>>>>>> +	  rtx_insn *insn = DF_REF_INSN (def_link->ref);
>>>>>>> +	  if (GET_CODE (PATTERN (insn)) == PARALLEL)
>>>>>>> +	    return false;
>>>>>>
>>>>>> Why do you need to skip PARALLELs?
>>>>>>
>>>>>
>>>>> vec_select with parallel give failures final.cc "can't split-up with subreg 128 (reg OO"
>>>>> Thats why I have added this.
>>>>
>>>> But in (vec_select ... (parallel ...)), the parallel won't be the 
>>>> PATTERN (insn).  It'll instead be a suboperand of the vec_select.
>>>>
>>>> Here too it's important to understand why the final.cc failure occurs
>>>> and what the correct fix is.
>>>>
>>>
>>> subreg with vec_select operand already exists before fusion pass.
>>> We overwrite them with subreg 128 bits from 256 OO mode operand.
>> 
>> But why is that wrong?  What was the full rtl of the subreg before the
>> pass runs, what did the subreg look like after the pass, and why is the
>> change not correct?
>> 
>> In general, there are two main ways that an rtl change can be incorrect:
>> 
>> (1) The new rtl isn't well-formed (such as (subreg (subreg X A) B)).
>>     In this case, the new rtl makes no inherent sense when viewed
>>     in isolation: it isn't necessary to see the old rtl to tell that
>>     the new rtl is wrong.
>> 
>> (2) The new rtl is well-formed (i.e. makes inherent sense when viewed in
>>     isolation) but it does not have the same semantics as the old rtl.
>>     In other words, the new rtl is describing a different operation
>>     from the old rtl.
>> 
>> I think we need to talk about it in those terms, rather than where
>> the eventual ICE occurs.
>> 
> Before the fusion.
> old rtl looks like this:
>
> (vec_select:HI (subreg:V8HI (reg:V16QI 125 [ vect__29.38 ]) 0)
>
> After the fusion
> new rtl looks like this:
>
> (vec_select:HI (subreg:V16QI (reg:OO 125 [ vect__29.38 ]) 16)
>
> new rtl is not well formed.
>
> Thats why its failing.
>
> reg:v16QI 125 is the destination of the load that needs to be fused.

This indicates that there's a bug in the substitution code.

It's probably better to create a fresh OO register, rather than
change an existing 128-bit register to 256 bits.  If we do that,
and if reg:V16QI 125 is the destination of the second load
(which I assume it is from the 16 offset in the subreg),
then the new RTL should be:

  (vec_select:HI (subreg:V8HI (reg:OO NEW_REG) 16) ...)

It's possible to get this by using insn_propagation to replace
(reg:V16QI 125) with (subreg:V16QI (reg:OO NEW_REG) 16).
insn_propagation should then take care of the rest.

There are no existing rtl-ssa routines for handling new registers
though.  (The idea was to add things as the need arose.)

>>> Due to this in final.cc we couldnt splt at line 2807 and bails
>>> out fatal_insn.
>>>
>>> Currently we dont support already existing subreg vector operand
>>> to generate register pairs.
>>> We should bail out from fusion pass in this case.
>>>>>>> +
>>>>>>> +	  rtx set = single_set (insn);
>>>>>>> +	  if (set == NULL_RTX)
>>>>>>> +	    return false;
>>>>>>> +
>>>>>>> +	  rtx op0 = SET_SRC (set);
>>>>>>> +	  rtx_code code = GET_CODE (op0);
>>>>>>> +
>>>>>>> +	  // This check is added as register pairs are not generated
>>>>>>> +	  // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>> +	  //                  (reg2)
>>>>>>> +	  //                  (neg:V2DF (reg3)))
>>>>>>> +	  if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>> +	    return false;
>>>>>>
>>>>>> What's special about (neg (fma ...))?
>>>>>>
>>>>>
>>>>> I am not sure why register allocator fails allocating register pairs with
>>>>> NEG Unary operation with fma operand. I have not debugged register allocator why the NEG
>>>>> Unary operation with fma operand. 
>>>>
>>>
>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits are
>>> set correctly. 
>>> IRA marked them spill candidates as spill priority is zero.
>>>
>>> Due to this LRA reload pass couldn't allocate register pairs.
>> 
>> I think this is just restating the symptom though.  I suppose the same
>> kind of questions apply here too: what was the instruction before the
>> pass runs, what was the instruction after the pass runs, and why is
>> the rtl change incorrect (by the meaning above)?
>> 
>
> Original case where we dont do load fusion, spill happens, in that
> case we dont require sequential register pairs to be generated for 2 loads
> for. Hence it worked.
>
> rtl change is correct and there is no error.
>
> for load fusion spill happens and we dont generate sequential register pairs
> because pf spill candidate and lxvp gives incorrect results as sequential register
> pairs are required for lxvp.

Can you go into more detail?  How is the lxvp represented?  And how do
we end up not getting a sequential register pair?  What does the rtl
look like (before and after things have gone wrong)?

It seems like either the rtl is not describing the result of the fusion
correctly or there is some problem in the .md description of lxvp.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-07  8:22             ` Richard Sandiford
@ 2024-06-07 10:41               ` Ajit Agarwal
  2024-06-10  8:42                 ` Richard Sandiford
  0 siblings, 1 reply; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-07 10:41 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 07/06/24 1:52 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>> +
>>>>>>>> +  df_ref use;
>>>>>>>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (info->rtl ());
>>>>>>>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>>>>>>>> +    {
>>>>>>>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>>>>>>>> +
>>>>>>>> +      if (!def_link || !def_link->ref
>>>>>>>> +	  || DF_REF_IS_ARTIFICIAL (def_link->ref))
>>>>>>>> +	continue;
>>>>>>>> +
>>>>>>>> +      while (def_link && def_link->ref)
>>>>>>>> +	{
>>>>>>>> +	  rtx_insn *insn = DF_REF_INSN (def_link->ref);
>>>>>>>> +	  if (GET_CODE (PATTERN (insn)) == PARALLEL)
>>>>>>>> +	    return false;
>>>>>>>
>>>>>>> Why do you need to skip PARALLELs?
>>>>>>>
>>>>>>
>>>>>> vec_select with parallel give failures final.cc "can't split-up with subreg 128 (reg OO"
>>>>>> Thats why I have added this.
>>>>>
>>>>> But in (vec_select ... (parallel ...)), the parallel won't be the 
>>>>> PATTERN (insn).  It'll instead be a suboperand of the vec_select.
>>>>>
>>>>> Here too it's important to understand why the final.cc failure occurs
>>>>> and what the correct fix is.
>>>>>
>>>>
>>>> subreg with vec_select operand already exists before fusion pass.
>>>> We overwrite them with subreg 128 bits from 256 OO mode operand.
>>>
>>> But why is that wrong?  What was the full rtl of the subreg before the
>>> pass runs, what did the subreg look like after the pass, and why is the
>>> change not correct?
>>>
>>> In general, there are two main ways that an rtl change can be incorrect:
>>>
>>> (1) The new rtl isn't well-formed (such as (subreg (subreg X A) B)).
>>>     In this case, the new rtl makes no inherent sense when viewed
>>>     in isolation: it isn't necessary to see the old rtl to tell that
>>>     the new rtl is wrong.
>>>
>>> (2) The new rtl is well-formed (i.e. makes inherent sense when viewed in
>>>     isolation) but it does not have the same semantics as the old rtl.
>>>     In other words, the new rtl is describing a different operation
>>>     from the old rtl.
>>>
>>> I think we need to talk about it in those terms, rather than where
>>> the eventual ICE occurs.
>>>
>> Before the fusion.
>> old rtl looks like this:
>>
>> (vec_select:HI (subreg:V8HI (reg:V16QI 125 [ vect__29.38 ]) 0)
>>
>> After the fusion
>> new rtl looks like this:
>>
>> (vec_select:HI (subreg:V16QI (reg:OO 125 [ vect__29.38 ]) 16)
>>
>> new rtl is not well formed.
>>
>> Thats why its failing.
>>
>> reg:v16QI 125 is the destination of the load that needs to be fused.
> 
> This indicates that there's a bug in the substitution code.
> 
> It's probably better to create a fresh OO register, rather than
> change an existing 128-bit register to 256 bits.  If we do that,
> and if reg:V16QI 125 is the destination of the second load
> (which I assume it is from the 16 offset in the subreg),
> then the new RTL should be:
> 
>   (vec_select:HI (subreg:V8HI (reg:OO NEW_REG) 16) ...)
> 
> It's possible to get this by using insn_propagation to replace
> (reg:V16QI 125) with (subreg:V16QI (reg:OO NEW_REG) 16).
> insn_propagation should then take care of the rest.
> 
> There are no existing rtl-ssa routines for handling new registers
> though.  (The idea was to add things as the need arose.)
> 

Sure I will do that. Thanks.

>>>> Due to this in final.cc we couldnt splt at line 2807 and bails
>>>> out fatal_insn.
>>>>
>>>> Currently we dont support already existing subreg vector operand
>>>> to generate register pairs.
>>>> We should bail out from fusion pass in this case.
>>>>>>>> +
>>>>>>>> +	  rtx set = single_set (insn);
>>>>>>>> +	  if (set == NULL_RTX)
>>>>>>>> +	    return false;
>>>>>>>> +
>>>>>>>> +	  rtx op0 = SET_SRC (set);
>>>>>>>> +	  rtx_code code = GET_CODE (op0);
>>>>>>>> +
>>>>>>>> +	  // This check is added as register pairs are not generated
>>>>>>>> +	  // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>> +	  //                  (reg2)
>>>>>>>> +	  //                  (neg:V2DF (reg3)))
>>>>>>>> +	  if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>> +	    return false;
>>>>>>>
>>>>>>> What's special about (neg (fma ...))?
>>>>>>>
>>>>>>
>>>>>> I am not sure why register allocator fails allocating register pairs with
>>>>>> NEG Unary operation with fma operand. I have not debugged register allocator why the NEG
>>>>>> Unary operation with fma operand. 
>>>>>
>>>>
>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits are
>>>> set correctly. 
>>>> IRA marked them spill candidates as spill priority is zero.
>>>>
>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>
>>> I think this is just restating the symptom though.  I suppose the same
>>> kind of questions apply here too: what was the instruction before the
>>> pass runs, what was the instruction after the pass runs, and why is
>>> the rtl change incorrect (by the meaning above)?
>>>
>>
>> Original case where we dont do load fusion, spill happens, in that
>> case we dont require sequential register pairs to be generated for 2 loads
>> for. Hence it worked.
>>
>> rtl change is correct and there is no error.
>>
>> for load fusion spill happens and we dont generate sequential register pairs
>> because pf spill candidate and lxvp gives incorrect results as sequential register
>> pairs are required for lxvp.
> 
> Can you go into more detail?  How is the lxvp represented?  And how do
> we end up not getting a sequential register pair?  What does the rtl
> look like (before and after things have gone wrong)?
> 
> It seems like either the rtl is not describing the result of the fusion
> correctly or there is some problem in the .md description of lxvp.
> 

After fusion pass:

(insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
        (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
                (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
     (nil))
(insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
        (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
                (reg:V2DF 44 12 [3119])
                (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
     (nil))

In LRA reload.

(insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ])
        (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) "shell_lam.fppized.f":238:72 2187 {*movoo}
     (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])
        (nil)))
(insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ])
        (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 16)
                (reg:V2DF 4283 [3119])
                (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 16)))))  {*vsx_nfmsv2df4}
     (nil))


In LRA reload sequential registers are not generated as r2572 is splled and move to spill location
in stack and subsequent uses loads from stack. Hence sequential registers pairs are not generated.

lxvp vsx0, 0(r1).

It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use sequential register pairs.

Without load fusion since 2 loads exists and 2 loads need not require sequential registers
hence it worked but with load fusion and using lxvp it requires sequential register pairs.
  
> Thanks,
> Richard

Thanks & regards
Ajit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-07 10:41               ` Ajit Agarwal
@ 2024-06-10  8:42                 ` Richard Sandiford
  2024-06-10  9:15                   ` Ajit Agarwal
  0 siblings, 1 reply; 35+ messages in thread
From: Richard Sandiford @ 2024-06-10  8:42 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>>> +
>>>>>>>>> +	  rtx set = single_set (insn);
>>>>>>>>> +	  if (set == NULL_RTX)
>>>>>>>>> +	    return false;
>>>>>>>>> +
>>>>>>>>> +	  rtx op0 = SET_SRC (set);
>>>>>>>>> +	  rtx_code code = GET_CODE (op0);
>>>>>>>>> +
>>>>>>>>> +	  // This check is added as register pairs are not generated
>>>>>>>>> +	  // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>>> +	  //                  (reg2)
>>>>>>>>> +	  //                  (neg:V2DF (reg3)))
>>>>>>>>> +	  if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>>> +	    return false;
>>>>>>>>
>>>>>>>> What's special about (neg (fma ...))?
>>>>>>>>
>>>>>>>
>>>>>>> I am not sure why register allocator fails allocating register pairs with
>>>>>>> NEG Unary operation with fma operand. I have not debugged register allocator why the NEG
>>>>>>> Unary operation with fma operand. 
>>>>>>
>>>>>
>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits are
>>>>> set correctly. 
>>>>> IRA marked them spill candidates as spill priority is zero.
>>>>>
>>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>>
>>>> I think this is just restating the symptom though.  I suppose the same
>>>> kind of questions apply here too: what was the instruction before the
>>>> pass runs, what was the instruction after the pass runs, and why is
>>>> the rtl change incorrect (by the meaning above)?
>>>>
>>>
>>> Original case where we dont do load fusion, spill happens, in that
>>> case we dont require sequential register pairs to be generated for 2 loads
>>> for. Hence it worked.
>>>
>>> rtl change is correct and there is no error.
>>>
>>> for load fusion spill happens and we dont generate sequential register pairs
>>> because pf spill candidate and lxvp gives incorrect results as sequential register
>>> pairs are required for lxvp.
>> 
>> Can you go into more detail?  How is the lxvp represented?  And how do
>> we end up not getting a sequential register pair?  What does the rtl
>> look like (before and after things have gone wrong)?
>> 
>> It seems like either the rtl is not describing the result of the fusion
>> correctly or there is some problem in the .md description of lxvp.
>> 
>
> After fusion pass:
>
> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>      (nil))
> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>                 (reg:V2DF 44 12 [3119])
>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>      (nil))
>
> In LRA reload.
>
> (insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ])
>         (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) "shell_lam.fppized.f":238:72 2187 {*movoo}
>      (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])
>         (nil)))
> (insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ])
>         (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 16)
>                 (reg:V2DF 4283 [3119])
>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 16)))))  {*vsx_nfmsv2df4}
>      (nil))
>
>
> In LRA reload sequential registers are not generated as r2572 is splled and move to spill location
> in stack and subsequent uses loads from stack. Hence sequential registers pairs are not generated.
>
> lxvp vsx0, 0(r1).
>
> It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use sequential register pairs.
>
> Without load fusion since 2 loads exists and 2 loads need not require sequential registers
> hence it worked but with load fusion and using lxvp it requires sequential register pairs.

Do you mean that this is a performance regression?  I.e. the fact that
lxvp requires sequential registers causes extra spilling, due to having
less allocation freedom?

Or is it a correctness problem?  If so, what is it?  Nothing in the rtl
above looks wrong in principle (although I've no idea if the REG_EQUIV
is correct in this context).  What does the allocated code look like,
and why is it wrong?

If (reg:OO 2561) is spilled and then one half of it used, only that half
needs to be loaded from the spill slot.  E.g. if (reg:OO 2561) is reloaded
for insn 2412 on its own, only the second half of the register needs to be
loaded from memory.

Richard


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-10  8:42                 ` Richard Sandiford
@ 2024-06-10  9:15                   ` Ajit Agarwal
  2024-06-10  9:22                     ` Richard Sandiford
  0 siblings, 1 reply; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-10  9:15 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 10/06/24 2:12 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>>>> +
>>>>>>>>>> +	  rtx set = single_set (insn);
>>>>>>>>>> +	  if (set == NULL_RTX)
>>>>>>>>>> +	    return false;
>>>>>>>>>> +
>>>>>>>>>> +	  rtx op0 = SET_SRC (set);
>>>>>>>>>> +	  rtx_code code = GET_CODE (op0);
>>>>>>>>>> +
>>>>>>>>>> +	  // This check is added as register pairs are not generated
>>>>>>>>>> +	  // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>>>> +	  //                  (reg2)
>>>>>>>>>> +	  //                  (neg:V2DF (reg3)))
>>>>>>>>>> +	  if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>>>> +	    return false;
>>>>>>>>>
>>>>>>>>> What's special about (neg (fma ...))?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I am not sure why register allocator fails allocating register pairs with
>>>>>>>> NEG Unary operation with fma operand. I have not debugged register allocator why the NEG
>>>>>>>> Unary operation with fma operand. 
>>>>>>>
>>>>>>
>>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits are
>>>>>> set correctly. 
>>>>>> IRA marked them spill candidates as spill priority is zero.
>>>>>>
>>>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>>>
>>>>> I think this is just restating the symptom though.  I suppose the same
>>>>> kind of questions apply here too: what was the instruction before the
>>>>> pass runs, what was the instruction after the pass runs, and why is
>>>>> the rtl change incorrect (by the meaning above)?
>>>>>
>>>>
>>>> Original case where we dont do load fusion, spill happens, in that
>>>> case we dont require sequential register pairs to be generated for 2 loads
>>>> for. Hence it worked.
>>>>
>>>> rtl change is correct and there is no error.
>>>>
>>>> for load fusion spill happens and we dont generate sequential register pairs
>>>> because pf spill candidate and lxvp gives incorrect results as sequential register
>>>> pairs are required for lxvp.
>>>
>>> Can you go into more detail?  How is the lxvp represented?  And how do
>>> we end up not getting a sequential register pair?  What does the rtl
>>> look like (before and after things have gone wrong)?
>>>
>>> It seems like either the rtl is not describing the result of the fusion
>>> correctly or there is some problem in the .md description of lxvp.
>>>
>>
>> After fusion pass:
>>
>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>      (nil))
>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>                 (reg:V2DF 44 12 [3119])
>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>      (nil))
>>
>> In LRA reload.
>>
>> (insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ])
>>         (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) "shell_lam.fppized.f":238:72 2187 {*movoo}
>>      (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])
>>         (nil)))
>> (insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ])
>>         (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 16)
>>                 (reg:V2DF 4283 [3119])
>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 16)))))  {*vsx_nfmsv2df4}
>>      (nil))
>>
>>
>> In LRA reload sequential registers are not generated as r2572 is splled and move to spill location
>> in stack and subsequent uses loads from stack. Hence sequential registers pairs are not generated.
>>
>> lxvp vsx0, 0(r1).
>>
>> It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use sequential register pairs.
>>
>> Without load fusion since 2 loads exists and 2 loads need not require sequential registers
>> hence it worked but with load fusion and using lxvp it requires sequential register pairs.
> 
> Do you mean that this is a performance regression?  I.e. the fact that
> lxvp requires sequential registers causes extra spilling, due to having
> less allocation freedom?
> 
> Or is it a correctness problem?  If so, what is it?  Nothing in the rtl
> above looks wrong in principle (although I've no idea if the REG_EQUIV
> is correct in this context).  What does the allocated code look like,
> and why is it wrong?
> 
> If (reg:OO 2561) is spilled and then one half of it used, only that half
> needs to be loaded from the spill slot.  E.g. if (reg:OO 2561) is reloaded
> for insn 2412 on its own, only the second half of the register needs to be
> loaded from memory.
>

This is bwaves spec 2017 benchmark. Spill happening in register allocator
could be because of less registers available in order to generate
sequential registers for lxvp.

Because of spill sequential registers are not generated and breaks the
correctness. REG_EQUIV is generated by IRA.

Allocated code because of spill doesn't generate sequential registers.
In LRA reload because of spill marked for IRA adjust to another
register causing not to generate sequential registers.

reg:OO 2572 is spilled not reg:OO 2561. Because of spilled its
loaded from memory instead of generating sequential registers.

Other reasons for spilling because of long live range for reg:OO 2572.

We can add heuristics in fusion code not to fuse for longer live
ranges that should solve the problem.

> Richard
> 

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-10  9:15                   ` Ajit Agarwal
@ 2024-06-10  9:22                     ` Richard Sandiford
  2024-06-10  9:43                       ` Ajit Agarwal
  0 siblings, 1 reply; 35+ messages in thread
From: Richard Sandiford @ 2024-06-10  9:22 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
> On 10/06/24 2:12 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>>>>> +
>>>>>>>>>>> +	  rtx set = single_set (insn);
>>>>>>>>>>> +	  if (set == NULL_RTX)
>>>>>>>>>>> +	    return false;
>>>>>>>>>>> +
>>>>>>>>>>> +	  rtx op0 = SET_SRC (set);
>>>>>>>>>>> +	  rtx_code code = GET_CODE (op0);
>>>>>>>>>>> +
>>>>>>>>>>> +	  // This check is added as register pairs are not generated
>>>>>>>>>>> +	  // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>>>>> +	  //                  (reg2)
>>>>>>>>>>> +	  //                  (neg:V2DF (reg3)))
>>>>>>>>>>> +	  if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>>>>> +	    return false;
>>>>>>>>>>
>>>>>>>>>> What's special about (neg (fma ...))?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am not sure why register allocator fails allocating register pairs with
>>>>>>>>> NEG Unary operation with fma operand. I have not debugged register allocator why the NEG
>>>>>>>>> Unary operation with fma operand. 
>>>>>>>>
>>>>>>>
>>>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits are
>>>>>>> set correctly. 
>>>>>>> IRA marked them spill candidates as spill priority is zero.
>>>>>>>
>>>>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>>>>
>>>>>> I think this is just restating the symptom though.  I suppose the same
>>>>>> kind of questions apply here too: what was the instruction before the
>>>>>> pass runs, what was the instruction after the pass runs, and why is
>>>>>> the rtl change incorrect (by the meaning above)?
>>>>>>
>>>>>
>>>>> Original case where we dont do load fusion, spill happens, in that
>>>>> case we dont require sequential register pairs to be generated for 2 loads
>>>>> for. Hence it worked.
>>>>>
>>>>> rtl change is correct and there is no error.
>>>>>
>>>>> for load fusion spill happens and we dont generate sequential register pairs
>>>>> because pf spill candidate and lxvp gives incorrect results as sequential register
>>>>> pairs are required for lxvp.
>>>>
>>>> Can you go into more detail?  How is the lxvp represented?  And how do
>>>> we end up not getting a sequential register pair?  What does the rtl
>>>> look like (before and after things have gone wrong)?
>>>>
>>>> It seems like either the rtl is not describing the result of the fusion
>>>> correctly or there is some problem in the .md description of lxvp.
>>>>
>>>
>>> After fusion pass:
>>>
>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>      (nil))
>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>                 (reg:V2DF 44 12 [3119])
>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>      (nil))
>>>
>>> In LRA reload.
>>>
>>> (insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ])
>>>         (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) "shell_lam.fppized.f":238:72 2187 {*movoo}
>>>      (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])
>>>         (nil)))
>>> (insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ])
>>>         (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 16)
>>>                 (reg:V2DF 4283 [3119])
>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 16)))))  {*vsx_nfmsv2df4}
>>>      (nil))
>>>
>>>
>>> In LRA reload sequential registers are not generated as r2572 is splled and move to spill location
>>> in stack and subsequent uses loads from stack. Hence sequential registers pairs are not generated.
>>>
>>> lxvp vsx0, 0(r1).
>>>
>>> It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use sequential register pairs.
>>>
>>> Without load fusion since 2 loads exists and 2 loads need not require sequential registers
>>> hence it worked but with load fusion and using lxvp it requires sequential register pairs.
>> 
>> Do you mean that this is a performance regression?  I.e. the fact that
>> lxvp requires sequential registers causes extra spilling, due to having
>> less allocation freedom?
>> 
>> Or is it a correctness problem?  If so, what is it?  Nothing in the rtl
>> above looks wrong in principle (although I've no idea if the REG_EQUIV
>> is correct in this context).  What does the allocated code look like,
>> and why is it wrong?
>> 
>> If (reg:OO 2561) is spilled and then one half of it used, only that half
>> needs to be loaded from the spill slot.  E.g. if (reg:OO 2561) is reloaded
>> for insn 2412 on its own, only the second half of the register needs to be
>> loaded from memory.
>>
>
> This is bwaves spec 2017 benchmark. Spill happening in register allocator
> could be because of less registers available in order to generate
> sequential registers for lxvp.
>
> Because of spill sequential registers are not generated and breaks the
> correctness. REG_EQUIV is generated by IRA.
>
> Allocated code because of spill doesn't generate sequential registers.
> In LRA reload because of spill marked for IRA adjust to another
> register causing not to generate sequential registers.
>
> reg:OO 2572 is spilled not reg:OO 2561. Because of spilled its
> loaded from memory instead of generating sequential registers.
>
> Other reasons for spilling because of long live range for reg:OO 2572.
>
> We can add heuristics in fusion code not to fuse for longer live
> ranges that should solve the problem.

This doesn't describe the real problem though.  It's natural for
registers to be spilled sometimes.  That would happen for OOmode
registers even if we didn't form them in the fusion pass.  And it's
normal GCC semantics that the two hard registers allocated to an OOmode
pseudo are consecutive.

Like I asked above, please show the allocated code (including spills
and reloads) and explain why it's wrong.

Richard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-10  9:22                     ` Richard Sandiford
@ 2024-06-10  9:43                       ` Ajit Agarwal
  2024-06-10  9:50                         ` Richard Sandiford
  0 siblings, 1 reply; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-10  9:43 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 10/06/24 2:52 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> On 10/06/24 2:12 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>>>>>> +
>>>>>>>>>>>> +	  rtx set = single_set (insn);
>>>>>>>>>>>> +	  if (set == NULL_RTX)
>>>>>>>>>>>> +	    return false;
>>>>>>>>>>>> +
>>>>>>>>>>>> +	  rtx op0 = SET_SRC (set);
>>>>>>>>>>>> +	  rtx_code code = GET_CODE (op0);
>>>>>>>>>>>> +
>>>>>>>>>>>> +	  // This check is added as register pairs are not generated
>>>>>>>>>>>> +	  // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>>>>>> +	  //                  (reg2)
>>>>>>>>>>>> +	  //                  (neg:V2DF (reg3)))
>>>>>>>>>>>> +	  if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>>>>>> +	    return false;
>>>>>>>>>>>
>>>>>>>>>>> What's special about (neg (fma ...))?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I am not sure why register allocator fails allocating register pairs with
>>>>>>>>>> NEG Unary operation with fma operand. I have not debugged register allocator why the NEG
>>>>>>>>>> Unary operation with fma operand. 
>>>>>>>>>
>>>>>>>>
>>>>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits are
>>>>>>>> set correctly. 
>>>>>>>> IRA marked them spill candidates as spill priority is zero.
>>>>>>>>
>>>>>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>>>>>
>>>>>>> I think this is just restating the symptom though.  I suppose the same
>>>>>>> kind of questions apply here too: what was the instruction before the
>>>>>>> pass runs, what was the instruction after the pass runs, and why is
>>>>>>> the rtl change incorrect (by the meaning above)?
>>>>>>>
>>>>>>
>>>>>> Original case where we dont do load fusion, spill happens, in that
>>>>>> case we dont require sequential register pairs to be generated for 2 loads
>>>>>> for. Hence it worked.
>>>>>>
>>>>>> rtl change is correct and there is no error.
>>>>>>
>>>>>> for load fusion spill happens and we dont generate sequential register pairs
>>>>>> because pf spill candidate and lxvp gives incorrect results as sequential register
>>>>>> pairs are required for lxvp.
>>>>>
>>>>> Can you go into more detail?  How is the lxvp represented?  And how do
>>>>> we end up not getting a sequential register pair?  What does the rtl
>>>>> look like (before and after things have gone wrong)?
>>>>>
>>>>> It seems like either the rtl is not describing the result of the fusion
>>>>> correctly or there is some problem in the .md description of lxvp.
>>>>>
>>>>
>>>> After fusion pass:
>>>>
>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>      (nil))
>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>                 (reg:V2DF 44 12 [3119])
>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>>      (nil))
>>>>
>>>> In LRA reload.
>>>>
>>>> (insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ])
>>>>         (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) "shell_lam.fppized.f":238:72 2187 {*movoo}
>>>>      (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])
>>>>         (nil)))
>>>> (insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ])
>>>>         (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 16)
>>>>                 (reg:V2DF 4283 [3119])
>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 16)))))  {*vsx_nfmsv2df4}
>>>>      (nil))
>>>>
>>>>
>>>> In LRA reload sequential registers are not generated as r2572 is splled and move to spill location
>>>> in stack and subsequent uses loads from stack. Hence sequential registers pairs are not generated.
>>>>
>>>> lxvp vsx0, 0(r1).
>>>>
>>>> It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use sequential register pairs.
>>>>
>>>> Without load fusion since 2 loads exists and 2 loads need not require sequential registers
>>>> hence it worked but with load fusion and using lxvp it requires sequential register pairs.
>>>
>>> Do you mean that this is a performance regression?  I.e. the fact that
>>> lxvp requires sequential registers causes extra spilling, due to having
>>> less allocation freedom?
>>>
>>> Or is it a correctness problem?  If so, what is it?  Nothing in the rtl
>>> above looks wrong in principle (although I've no idea if the REG_EQUIV
>>> is correct in this context).  What does the allocated code look like,
>>> and why is it wrong?
>>>
>>> If (reg:OO 2561) is spilled and then one half of it used, only that half
>>> needs to be loaded from the spill slot.  E.g. if (reg:OO 2561) is reloaded
>>> for insn 2412 on its own, only the second half of the register needs to be
>>> loaded from memory.
>>>
>>
>> This is bwaves spec 2017 benchmark. Spill happening in register allocator
>> could be because of less registers available in order to generate
>> sequential registers for lxvp.
>>
>> Because of spill sequential registers are not generated and breaks the
>> correctness. REG_EQUIV is generated by IRA.
>>
>> Allocated code because of spill doesn't generate sequential registers.
>> In LRA reload because of spill marked for IRA adjust to another
>> register causing not to generate sequential registers.
>>
>> reg:OO 2572 is spilled not reg:OO 2561. Because of spilled its
>> loaded from memory instead of generating sequential registers.
>>
>> Other reasons for spilling because of long live range for reg:OO 2572.
>>
>> We can add heuristics in fusion code not to fuse for longer live
>> ranges that should solve the problem.
> 
> This doesn't describe the real problem though.  It's natural for
> registers to be spilled sometimes.  That would happen for OOmode
> registers even if we didn't form them in the fusion pass.  And it's
> normal GCC semantics that the two hard registers allocated to an OOmode
> pseudo are consecutive.
> 
> Like I asked above, please show the allocated code (including spills
> and reloads) and explain why it's wrong.
> 

After LRA reload:

(insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
        (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
                (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
     (nil))
(insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
        (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
                (reg:V2DF 44 12 [3119])
                (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
     (nil))

(insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
        (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
                (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
                (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
     (nil))

In the above allocated code it assign registers 51 and 47 and they are not sequential.

> Richard

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-10  9:43                       ` Ajit Agarwal
@ 2024-06-10  9:50                         ` Richard Sandiford
  2024-06-10  9:57                           ` Ajit Agarwal
  2024-06-10 10:13                           ` Ajit Agarwal
  0 siblings, 2 replies; 35+ messages in thread
From: Richard Sandiford @ 2024-06-10  9:50 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
> Hello Richard:
>
> On 10/06/24 2:52 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>> On 10/06/24 2:12 pm, Richard Sandiford wrote:
>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +	  rtx set = single_set (insn);
>>>>>>>>>>>>> +	  if (set == NULL_RTX)
>>>>>>>>>>>>> +	    return false;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +	  rtx op0 = SET_SRC (set);
>>>>>>>>>>>>> +	  rtx_code code = GET_CODE (op0);
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +	  // This check is added as register pairs are not generated
>>>>>>>>>>>>> +	  // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>>>>>>> +	  //                  (reg2)
>>>>>>>>>>>>> +	  //                  (neg:V2DF (reg3)))
>>>>>>>>>>>>> +	  if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>>>>>>> +	    return false;
>>>>>>>>>>>>
>>>>>>>>>>>> What's special about (neg (fma ...))?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I am not sure why register allocator fails allocating register pairs with
>>>>>>>>>>> NEG Unary operation with fma operand. I have not debugged register allocator why the NEG
>>>>>>>>>>> Unary operation with fma operand. 
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits are
>>>>>>>>> set correctly. 
>>>>>>>>> IRA marked them spill candidates as spill priority is zero.
>>>>>>>>>
>>>>>>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>>>>>>
>>>>>>>> I think this is just restating the symptom though.  I suppose the same
>>>>>>>> kind of questions apply here too: what was the instruction before the
>>>>>>>> pass runs, what was the instruction after the pass runs, and why is
>>>>>>>> the rtl change incorrect (by the meaning above)?
>>>>>>>>
>>>>>>>
>>>>>>> Original case where we dont do load fusion, spill happens, in that
>>>>>>> case we dont require sequential register pairs to be generated for 2 loads
>>>>>>> for. Hence it worked.
>>>>>>>
>>>>>>> rtl change is correct and there is no error.
>>>>>>>
>>>>>>> for load fusion spill happens and we dont generate sequential register pairs
>>>>>>> because pf spill candidate and lxvp gives incorrect results as sequential register
>>>>>>> pairs are required for lxvp.
>>>>>>
>>>>>> Can you go into more detail?  How is the lxvp represented?  And how do
>>>>>> we end up not getting a sequential register pair?  What does the rtl
>>>>>> look like (before and after things have gone wrong)?
>>>>>>
>>>>>> It seems like either the rtl is not describing the result of the fusion
>>>>>> correctly or there is some problem in the .md description of lxvp.
>>>>>>
>>>>>
>>>>> After fusion pass:
>>>>>
>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>      (nil))
>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>>>      (nil))
>>>>>
>>>>> In LRA reload.
>>>>>
>>>>> (insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ])
>>>>>         (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) "shell_lam.fppized.f":238:72 2187 {*movoo}
>>>>>      (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])
>>>>>         (nil)))
>>>>> (insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ])
>>>>>         (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 16)
>>>>>                 (reg:V2DF 4283 [3119])
>>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 16)))))  {*vsx_nfmsv2df4}
>>>>>      (nil))
>>>>>
>>>>>
>>>>> In LRA reload sequential registers are not generated as r2572 is splled and move to spill location
>>>>> in stack and subsequent uses loads from stack. Hence sequential registers pairs are not generated.
>>>>>
>>>>> lxvp vsx0, 0(r1).
>>>>>
>>>>> It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use sequential register pairs.
>>>>>
>>>>> Without load fusion since 2 loads exists and 2 loads need not require sequential registers
>>>>> hence it worked but with load fusion and using lxvp it requires sequential register pairs.
>>>>
>>>> Do you mean that this is a performance regression?  I.e. the fact that
>>>> lxvp requires sequential registers causes extra spilling, due to having
>>>> less allocation freedom?
>>>>
>>>> Or is it a correctness problem?  If so, what is it?  Nothing in the rtl
>>>> above looks wrong in principle (although I've no idea if the REG_EQUIV
>>>> is correct in this context).  What does the allocated code look like,
>>>> and why is it wrong?
>>>>
>>>> If (reg:OO 2561) is spilled and then one half of it used, only that half
>>>> needs to be loaded from the spill slot.  E.g. if (reg:OO 2561) is reloaded
>>>> for insn 2412 on its own, only the second half of the register needs to be
>>>> loaded from memory.
>>>>
>>>
>>> This is bwaves spec 2017 benchmark. Spill happening in register allocator
>>> could be because of less registers available in order to generate
>>> sequential registers for lxvp.
>>>
>>> Because of spill sequential registers are not generated and breaks the
>>> correctness. REG_EQUIV is generated by IRA.
>>>
>>> Allocated code because of spill doesn't generate sequential registers.
>>> In LRA reload because of spill marked for IRA adjust to another
>>> register causing not to generate sequential registers.
>>>
>>> reg:OO 2572 is spilled not reg:OO 2561. Because of spilled its
>>> loaded from memory instead of generating sequential registers.
>>>
>>> Other reasons for spilling because of long live range for reg:OO 2572.
>>>
>>> We can add heuristics in fusion code not to fuse for longer live
>>> ranges that should solve the problem.
>> 
>> This doesn't describe the real problem though.  It's natural for
>> registers to be spilled sometimes.  That would happen for OOmode
>> registers even if we didn't form them in the fusion pass.  And it's
>> normal GCC semantics that the two hard registers allocated to an OOmode
>> pseudo are consecutive.
>> 
>> Like I asked above, please show the allocated code (including spills
>> and reloads) and explain why it's wrong.
>> 
>
> After LRA reload:
>
> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>      (nil))
> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>                 (reg:V2DF 44 12 [3119])
>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>      (nil))
>
> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>      (nil))
>
> In the above allocated code it assign registers 51 and 47 and they are not sequential.

The reload for 2412 looks valid.  What was the original pre-reload
version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?

Richard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-10  9:50                         ` Richard Sandiford
@ 2024-06-10  9:57                           ` Ajit Agarwal
  2024-06-10 10:28                             ` Richard Sandiford
  2024-06-10 10:13                           ` Ajit Agarwal
  1 sibling, 1 reply; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-10  9:57 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 10/06/24 3:20 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> Hello Richard:
>>
>> On 10/06/24 2:52 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>> On 10/06/24 2:12 pm, Richard Sandiford wrote:
>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +	  rtx set = single_set (insn);
>>>>>>>>>>>>>> +	  if (set == NULL_RTX)
>>>>>>>>>>>>>> +	    return false;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +	  rtx op0 = SET_SRC (set);
>>>>>>>>>>>>>> +	  rtx_code code = GET_CODE (op0);
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +	  // This check is added as register pairs are not generated
>>>>>>>>>>>>>> +	  // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>>>>>>>> +	  //                  (reg2)
>>>>>>>>>>>>>> +	  //                  (neg:V2DF (reg3)))
>>>>>>>>>>>>>> +	  if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>>>>>>>> +	    return false;
>>>>>>>>>>>>>
>>>>>>>>>>>>> What's special about (neg (fma ...))?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I am not sure why register allocator fails allocating register pairs with
>>>>>>>>>>>> NEG Unary operation with fma operand. I have not debugged register allocator why the NEG
>>>>>>>>>>>> Unary operation with fma operand. 
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits are
>>>>>>>>>> set correctly. 
>>>>>>>>>> IRA marked them spill candidates as spill priority is zero.
>>>>>>>>>>
>>>>>>>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>>>>>>>
>>>>>>>>> I think this is just restating the symptom though.  I suppose the same
>>>>>>>>> kind of questions apply here too: what was the instruction before the
>>>>>>>>> pass runs, what was the instruction after the pass runs, and why is
>>>>>>>>> the rtl change incorrect (by the meaning above)?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Original case where we dont do load fusion, spill happens, in that
>>>>>>>> case we dont require sequential register pairs to be generated for 2 loads
>>>>>>>> for. Hence it worked.
>>>>>>>>
>>>>>>>> rtl change is correct and there is no error.
>>>>>>>>
>>>>>>>> for load fusion spill happens and we dont generate sequential register pairs
>>>>>>>> because pf spill candidate and lxvp gives incorrect results as sequential register
>>>>>>>> pairs are required for lxvp.
>>>>>>>
>>>>>>> Can you go into more detail?  How is the lxvp represented?  And how do
>>>>>>> we end up not getting a sequential register pair?  What does the rtl
>>>>>>> look like (before and after things have gone wrong)?
>>>>>>>
>>>>>>> It seems like either the rtl is not describing the result of the fusion
>>>>>>> correctly or there is some problem in the .md description of lxvp.
>>>>>>>
>>>>>>
>>>>>> After fusion pass:
>>>>>>
>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>>      (nil))
>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>>>>      (nil))
>>>>>>
>>>>>> In LRA reload.
>>>>>>
>>>>>> (insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ])
>>>>>>         (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) "shell_lam.fppized.f":238:72 2187 {*movoo}
>>>>>>      (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])
>>>>>>         (nil)))
>>>>>> (insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ])
>>>>>>         (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 16)
>>>>>>                 (reg:V2DF 4283 [3119])
>>>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 16)))))  {*vsx_nfmsv2df4}
>>>>>>      (nil))
>>>>>>
>>>>>>
>>>>>> In LRA reload sequential registers are not generated as r2572 is splled and move to spill location
>>>>>> in stack and subsequent uses loads from stack. Hence sequential registers pairs are not generated.
>>>>>>
>>>>>> lxvp vsx0, 0(r1).
>>>>>>
>>>>>> It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use sequential register pairs.
>>>>>>
>>>>>> Without load fusion since 2 loads exists and 2 loads need not require sequential registers
>>>>>> hence it worked but with load fusion and using lxvp it requires sequential register pairs.
>>>>>
>>>>> Do you mean that this is a performance regression?  I.e. the fact that
>>>>> lxvp requires sequential registers causes extra spilling, due to having
>>>>> less allocation freedom?
>>>>>
>>>>> Or is it a correctness problem?  If so, what is it?  Nothing in the rtl
>>>>> above looks wrong in principle (although I've no idea if the REG_EQUIV
>>>>> is correct in this context).  What does the allocated code look like,
>>>>> and why is it wrong?
>>>>>
>>>>> If (reg:OO 2561) is spilled and then one half of it used, only that half
>>>>> needs to be loaded from the spill slot.  E.g. if (reg:OO 2561) is reloaded
>>>>> for insn 2412 on its own, only the second half of the register needs to be
>>>>> loaded from memory.
>>>>>
>>>>
>>>> This is bwaves spec 2017 benchmark. Spill happening in register allocator
>>>> could be because of less registers available in order to generate
>>>> sequential registers for lxvp.
>>>>
>>>> Because of spill sequential registers are not generated and breaks the
>>>> correctness. REG_EQUIV is generated by IRA.
>>>>
>>>> Allocated code because of spill doesn't generate sequential registers.
>>>> In LRA reload because of spill marked for IRA adjust to another
>>>> register causing not to generate sequential registers.
>>>>
>>>> reg:OO 2572 is spilled not reg:OO 2561. Because of spilled its
>>>> loaded from memory instead of generating sequential registers.
>>>>
>>>> Other reasons for spilling because of long live range for reg:OO 2572.
>>>>
>>>> We can add heuristics in fusion code not to fuse for longer live
>>>> ranges that should solve the problem.
>>>
>>> This doesn't describe the real problem though.  It's natural for
>>> registers to be spilled sometimes.  That would happen for OOmode
>>> registers even if we didn't form them in the fusion pass.  And it's
>>> normal GCC semantics that the two hard registers allocated to an OOmode
>>> pseudo are consecutive.
>>>
>>> Like I asked above, please show the allocated code (including spills
>>> and reloads) and explain why it's wrong.
>>>
>>
>> After LRA reload:
>>
>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>      (nil))
>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>                 (reg:V2DF 44 12 [3119])
>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>      (nil))
>>
>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
>>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>>      (nil))
>>
>> In the above allocated code it assign registers 51 and 47 and they are not sequential.
> 
> The reload for 2412 looks valid.  What was the original pre-reload
> version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?
> 

This is preload version of 2473:

(insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
        (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
                (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
                (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 0))))) {*vsx_nfmsv2df4}
     (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
        (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ])
            (nil))))

insn 2472 is replaced with 9299 after reload.

> Richard

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-10  9:50                         ` Richard Sandiford
  2024-06-10  9:57                           ` Ajit Agarwal
@ 2024-06-10 10:13                           ` Ajit Agarwal
  1 sibling, 0 replies; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-10 10:13 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 10/06/24 3:20 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> Hello Richard:
>>
>> On 10/06/24 2:52 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>> On 10/06/24 2:12 pm, Richard Sandiford wrote:
>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +	  rtx set = single_set (insn);
>>>>>>>>>>>>>> +	  if (set == NULL_RTX)
>>>>>>>>>>>>>> +	    return false;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +	  rtx op0 = SET_SRC (set);
>>>>>>>>>>>>>> +	  rtx_code code = GET_CODE (op0);
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +	  // This check is added as register pairs are not generated
>>>>>>>>>>>>>> +	  // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>>>>>>>> +	  //                  (reg2)
>>>>>>>>>>>>>> +	  //                  (neg:V2DF (reg3)))
>>>>>>>>>>>>>> +	  if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>>>>>>>> +	    return false;
>>>>>>>>>>>>>
>>>>>>>>>>>>> What's special about (neg (fma ...))?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I am not sure why register allocator fails allocating register pairs with
>>>>>>>>>>>> NEG Unary operation with fma operand. I have not debugged register allocator why the NEG
>>>>>>>>>>>> Unary operation with fma operand. 
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits are
>>>>>>>>>> set correctly. 
>>>>>>>>>> IRA marked them spill candidates as spill priority is zero.
>>>>>>>>>>
>>>>>>>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>>>>>>>
>>>>>>>>> I think this is just restating the symptom though.  I suppose the same
>>>>>>>>> kind of questions apply here too: what was the instruction before the
>>>>>>>>> pass runs, what was the instruction after the pass runs, and why is
>>>>>>>>> the rtl change incorrect (by the meaning above)?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Original case where we dont do load fusion, spill happens, in that
>>>>>>>> case we dont require sequential register pairs to be generated for 2 loads
>>>>>>>> for. Hence it worked.
>>>>>>>>
>>>>>>>> rtl change is correct and there is no error.
>>>>>>>>
>>>>>>>> for load fusion spill happens and we dont generate sequential register pairs
>>>>>>>> because pf spill candidate and lxvp gives incorrect results as sequential register
>>>>>>>> pairs are required for lxvp.
>>>>>>>
>>>>>>> Can you go into more detail?  How is the lxvp represented?  And how do
>>>>>>> we end up not getting a sequential register pair?  What does the rtl
>>>>>>> look like (before and after things have gone wrong)?
>>>>>>>
>>>>>>> It seems like either the rtl is not describing the result of the fusion
>>>>>>> correctly or there is some problem in the .md description of lxvp.
>>>>>>>
>>>>>>
>>>>>> After fusion pass:
>>>>>>
>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>>      (nil))
>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>>>>      (nil))
>>>>>>
>>>>>> In LRA reload.
>>>>>>
>>>>>> (insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ])
>>>>>>         (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) "shell_lam.fppized.f":238:72 2187 {*movoo}
>>>>>>      (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])
>>>>>>         (nil)))
>>>>>> (insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ])
>>>>>>         (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 16)
>>>>>>                 (reg:V2DF 4283 [3119])
>>>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 16)))))  {*vsx_nfmsv2df4}
>>>>>>      (nil))
>>>>>>
>>>>>>
>>>>>> In LRA reload sequential registers are not generated as r2572 is splled and move to spill location
>>>>>> in stack and subsequent uses loads from stack. Hence sequential registers pairs are not generated.
>>>>>>
>>>>>> lxvp vsx0, 0(r1).
>>>>>>
>>>>>> It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use sequential register pairs.
>>>>>>
>>>>>> Without load fusion since 2 loads exists and 2 loads need not require sequential registers
>>>>>> hence it worked but with load fusion and using lxvp it requires sequential register pairs.
>>>>>
>>>>> Do you mean that this is a performance regression?  I.e. the fact that
>>>>> lxvp requires sequential registers causes extra spilling, due to having
>>>>> less allocation freedom?
>>>>>
>>>>> Or is it a correctness problem?  If so, what is it?  Nothing in the rtl
>>>>> above looks wrong in principle (although I've no idea if the REG_EQUIV
>>>>> is correct in this context).  What does the allocated code look like,
>>>>> and why is it wrong?
>>>>>
>>>>> If (reg:OO 2561) is spilled and then one half of it used, only that half
>>>>> needs to be loaded from the spill slot.  E.g. if (reg:OO 2561) is reloaded
>>>>> for insn 2412 on its own, only the second half of the register needs to be
>>>>> loaded from memory.
>>>>>
>>>>
>>>> This is bwaves spec 2017 benchmark. Spill happening in register allocator
>>>> could be because of less registers available in order to generate
>>>> sequential registers for lxvp.
>>>>
>>>> Because of spill sequential registers are not generated and breaks the
>>>> correctness. REG_EQUIV is generated by IRA.
>>>>
>>>> Allocated code because of spill doesn't generate sequential registers.
>>>> In LRA reload because of spill marked for IRA adjust to another
>>>> register causing not to generate sequential registers.
>>>>
>>>> reg:OO 2572 is spilled not reg:OO 2561. Because of spilled its
>>>> loaded from memory instead of generating sequential registers.
>>>>
>>>> Other reasons for spilling because of long live range for reg:OO 2572.
>>>>
>>>> We can add heuristics in fusion code not to fuse for longer live
>>>> ranges that should solve the problem.
>>>
>>> This doesn't describe the real problem though.  It's natural for
>>> registers to be spilled sometimes.  That would happen for OOmode
>>> registers even if we didn't form them in the fusion pass.  And it's
>>> normal GCC semantics that the two hard registers allocated to an OOmode
>>> pseudo are consecutive.
>>>
>>> Like I asked above, please show the allocated code (including spills
>>> and reloads) and explain why it's wrong.
>>>
>>
>> After LRA reload:
>>
>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>      (nil))
>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>                 (reg:V2DF 44 12 [3119])
>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>      (nil))
>>
>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
>>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>>      (nil))
>>
>> In the above allocated code it assign registers 51 and 47 and they are not sequential.
> 
> The reload for 2412 looks valid.  What was the original pre-reload
> version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?
> 

This is preload version of 2473:

(insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
        (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
                (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
                (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 0))))) {*vsx_nfmsv2df4}
     (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
        (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ])
            (nil))))

insn 2472 is replaced with 9299 after reload.

> Richard

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-10  9:57                           ` Ajit Agarwal
@ 2024-06-10 10:28                             ` Richard Sandiford
  2024-06-11 10:39                               ` Ajit Agarwal
  0 siblings, 1 reply; 35+ messages in thread
From: Richard Sandiford @ 2024-06-10 10:28 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
> On 10/06/24 3:20 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>> On 10/06/24 2:52 pm, Richard Sandiford wrote:
>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>> On 10/06/24 2:12 pm, Richard Sandiford wrote:
>>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +	  rtx set = single_set (insn);
>>>>>>>>>>>>>>> +	  if (set == NULL_RTX)
>>>>>>>>>>>>>>> +	    return false;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +	  rtx op0 = SET_SRC (set);
>>>>>>>>>>>>>>> +	  rtx_code code = GET_CODE (op0);
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +	  // This check is added as register pairs are not generated
>>>>>>>>>>>>>>> +	  // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>>>>>>>>> +	  //                  (reg2)
>>>>>>>>>>>>>>> +	  //                  (neg:V2DF (reg3)))
>>>>>>>>>>>>>>> +	  if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>>>>>>>>> +	    return false;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What's special about (neg (fma ...))?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am not sure why register allocator fails allocating register pairs with
>>>>>>>>>>>>> NEG Unary operation with fma operand. I have not debugged register allocator why the NEG
>>>>>>>>>>>>> Unary operation with fma operand. 
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits are
>>>>>>>>>>> set correctly. 
>>>>>>>>>>> IRA marked them spill candidates as spill priority is zero.
>>>>>>>>>>>
>>>>>>>>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>>>>>>>>
>>>>>>>>>> I think this is just restating the symptom though.  I suppose the same
>>>>>>>>>> kind of questions apply here too: what was the instruction before the
>>>>>>>>>> pass runs, what was the instruction after the pass runs, and why is
>>>>>>>>>> the rtl change incorrect (by the meaning above)?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Original case where we dont do load fusion, spill happens, in that
>>>>>>>>> case we dont require sequential register pairs to be generated for 2 loads
>>>>>>>>> for. Hence it worked.
>>>>>>>>>
>>>>>>>>> rtl change is correct and there is no error.
>>>>>>>>>
>>>>>>>>> for load fusion spill happens and we dont generate sequential register pairs
>>>>>>>>> because pf spill candidate and lxvp gives incorrect results as sequential register
>>>>>>>>> pairs are required for lxvp.
>>>>>>>>
>>>>>>>> Can you go into more detail?  How is the lxvp represented?  And how do
>>>>>>>> we end up not getting a sequential register pair?  What does the rtl
>>>>>>>> look like (before and after things have gone wrong)?
>>>>>>>>
>>>>>>>> It seems like either the rtl is not describing the result of the fusion
>>>>>>>> correctly or there is some problem in the .md description of lxvp.
>>>>>>>>
>>>>>>>
>>>>>>> After fusion pass:
>>>>>>>
>>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>>>      (nil))
>>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>>>>>      (nil))
>>>>>>>
>>>>>>> In LRA reload.
>>>>>>>
>>>>>>> (insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ])
>>>>>>>         (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) "shell_lam.fppized.f":238:72 2187 {*movoo}
>>>>>>>      (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])
>>>>>>>         (nil)))
>>>>>>> (insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ])
>>>>>>>         (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 16)
>>>>>>>                 (reg:V2DF 4283 [3119])
>>>>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 16)))))  {*vsx_nfmsv2df4}
>>>>>>>      (nil))
>>>>>>>
>>>>>>>
>>>>>>> In LRA reload sequential registers are not generated as r2572 is splled and move to spill location
>>>>>>> in stack and subsequent uses loads from stack. Hence sequential registers pairs are not generated.
>>>>>>>
>>>>>>> lxvp vsx0, 0(r1).
>>>>>>>
>>>>>>> It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use sequential register pairs.
>>>>>>>
>>>>>>> Without load fusion since 2 loads exists and 2 loads need not require sequential registers
>>>>>>> hence it worked but with load fusion and using lxvp it requires sequential register pairs.
>>>>>>
>>>>>> Do you mean that this is a performance regression?  I.e. the fact that
>>>>>> lxvp requires sequential registers causes extra spilling, due to having
>>>>>> less allocation freedom?
>>>>>>
>>>>>> Or is it a correctness problem?  If so, what is it?  Nothing in the rtl
>>>>>> above looks wrong in principle (although I've no idea if the REG_EQUIV
>>>>>> is correct in this context).  What does the allocated code look like,
>>>>>> and why is it wrong?
>>>>>>
>>>>>> If (reg:OO 2561) is spilled and then one half of it used, only that half
>>>>>> needs to be loaded from the spill slot.  E.g. if (reg:OO 2561) is reloaded
>>>>>> for insn 2412 on its own, only the second half of the register needs to be
>>>>>> loaded from memory.
>>>>>>
>>>>>
>>>>> This is bwaves spec 2017 benchmark. Spill happening in register allocator
>>>>> could be because of less registers available in order to generate
>>>>> sequential registers for lxvp.
>>>>>
>>>>> Because of spill sequential registers are not generated and breaks the
>>>>> correctness. REG_EQUIV is generated by IRA.
>>>>>
>>>>> Allocated code because of spill doesn't generate sequential registers.
>>>>> In LRA reload because of spill marked for IRA adjust to another
>>>>> register causing not to generate sequential registers.
>>>>>
>>>>> reg:OO 2572 is spilled not reg:OO 2561. Because of spilled its
>>>>> loaded from memory instead of generating sequential registers.
>>>>>
>>>>> Other reasons for spilling because of long live range for reg:OO 2572.
>>>>>
>>>>> We can add heuristics in fusion code not to fuse for longer live
>>>>> ranges that should solve the problem.
>>>>
>>>> This doesn't describe the real problem though.  It's natural for
>>>> registers to be spilled sometimes.  That would happen for OOmode
>>>> registers even if we didn't form them in the fusion pass.  And it's
>>>> normal GCC semantics that the two hard registers allocated to an OOmode
>>>> pseudo are consecutive.
>>>>
>>>> Like I asked above, please show the allocated code (including spills
>>>> and reloads) and explain why it's wrong.
>>>>
>>>
>>> After LRA reload:
>>>
>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>      (nil))
>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>                 (reg:V2DF 44 12 [3119])
>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>      (nil))
>>>
>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
>>>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>>>      (nil))
>>>
>>> In the above allocated code it assign registers 51 and 47 and they are not sequential.
>> 
>> The reload for 2412 looks valid.  What was the original pre-reload
>> version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?
>> 
>
> This is preload version of 2473:
>
> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>         (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>                 (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 0))))) {*vsx_nfmsv2df4}
>      (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>         (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ])
>             (nil))))
>
> insn 2472 is replaced with 9299 after reload.

You'd have to check the dumps to be sure, but I think 9299 is instead
generated as an input reload of 2412, rather than being a replacement
of insn 2472.  That is, LRA needs to reload (subreg:V2DF (reg:OO 2572) 16)
from memory for insn 2412.  It can use the destination of insn 2412 (r51)
as a temporary to do that.  It doesn't need to load the other half of
reg:OO 2572 for this instruction.  That in itself looks ok.

So it looks like the problem is specific to insn 2473.  Perhaps LRA
thinks that r47 already contains the low half of (reg:OO 2572),
left behind by some previous instruction not shown above?
If LRA is wrong about that -- if r47 doesn't already contain the
low half of (reg:OO 2572) -- then there's a bug somewhere.
But we need to track down and fix the bug rather than try to sidestep
it in the fusion pass.

Richard


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-10 10:28                             ` Richard Sandiford
@ 2024-06-11 10:39                               ` Ajit Agarwal
  2024-06-11 11:06                                 ` Richard Sandiford
  0 siblings, 1 reply; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-11 10:39 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 10/06/24 3:58 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> On 10/06/24 3:20 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>> On 10/06/24 2:52 pm, Richard Sandiford wrote:
>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>> On 10/06/24 2:12 pm, Richard Sandiford wrote:
>>>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +	  rtx set = single_set (insn);
>>>>>>>>>>>>>>>> +	  if (set == NULL_RTX)
>>>>>>>>>>>>>>>> +	    return false;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +	  rtx op0 = SET_SRC (set);
>>>>>>>>>>>>>>>> +	  rtx_code code = GET_CODE (op0);
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +	  // This check is added as register pairs are not generated
>>>>>>>>>>>>>>>> +	  // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>>>>>>>>>>>>> +	  //                  (reg2)
>>>>>>>>>>>>>>>> +	  //                  (neg:V2DF (reg3)))
>>>>>>>>>>>>>>>> +	  if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>>>>>>>>>>>>> +	    return false;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What's special about (neg (fma ...))?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am not sure why register allocator fails allocating register pairs with
>>>>>>>>>>>>>> NEG Unary operation with fma operand. I have not debugged register allocator why the NEG
>>>>>>>>>>>>>> Unary operation with fma operand. 
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits are
>>>>>>>>>>>> set correctly. 
>>>>>>>>>>>> IRA marked them spill candidates as spill priority is zero.
>>>>>>>>>>>>
>>>>>>>>>>>> Due to this LRA reload pass couldn't allocate register pairs.
>>>>>>>>>>>
>>>>>>>>>>> I think this is just restating the symptom though.  I suppose the same
>>>>>>>>>>> kind of questions apply here too: what was the instruction before the
>>>>>>>>>>> pass runs, what was the instruction after the pass runs, and why is
>>>>>>>>>>> the rtl change incorrect (by the meaning above)?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Original case where we dont do load fusion, spill happens, in that
>>>>>>>>>> case we dont require sequential register pairs to be generated for 2 loads
>>>>>>>>>> for. Hence it worked.
>>>>>>>>>>
>>>>>>>>>> rtl change is correct and there is no error.
>>>>>>>>>>
>>>>>>>>>> for load fusion spill happens and we dont generate sequential register pairs
>>>>>>>>>> because pf spill candidate and lxvp gives incorrect results as sequential register
>>>>>>>>>> pairs are required for lxvp.
>>>>>>>>>
>>>>>>>>> Can you go into more detail?  How is the lxvp represented?  And how do
>>>>>>>>> we end up not getting a sequential register pair?  What does the rtl
>>>>>>>>> look like (before and after things have gone wrong)?
>>>>>>>>>
>>>>>>>>> It seems like either the rtl is not describing the result of the fusion
>>>>>>>>> correctly or there is some problem in the .md description of lxvp.
>>>>>>>>>
>>>>>>>>
>>>>>>>> After fusion pass:
>>>>>>>>
>>>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>>>>      (nil))
>>>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>>>>>>      (nil))
>>>>>>>>
>>>>>>>> In LRA reload.
>>>>>>>>
>>>>>>>> (insn 2472 2461 2412 161 (set (reg:OO 2572 [ vect__300.543_236 ])
>>>>>>>>         (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])) "shell_lam.fppized.f":238:72 2187 {*movoo}
>>>>>>>>      (expr_list:REG_EQUIV (mem:OO (reg:DI 4260 [orig:1285 ivtmp.886 ] [1285]) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+0 S16 A64])
>>>>>>>>         (nil)))
>>>>>>>> (insn 2412 2472 2477 161 (set (reg:V2DF 240 [ vect__302.545 ])
>>>>>>>>         (neg:V2DF (fma:V2DF (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 16)
>>>>>>>>                 (reg:V2DF 4283 [3119])
>>>>>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 16)))))  {*vsx_nfmsv2df4}
>>>>>>>>      (nil))
>>>>>>>>
>>>>>>>>
>>>>>>>> In LRA reload sequential registers are not generated as r2572 is splled and move to spill location
>>>>>>>> in stack and subsequent uses loads from stack. Hence sequential registers pairs are not generated.
>>>>>>>>
>>>>>>>> lxvp vsx0, 0(r1).
>>>>>>>>
>>>>>>>> It loads from from r1+0 into vsx0 and vsx1 and appropriate uses use sequential register pairs.
>>>>>>>>
>>>>>>>> Without load fusion since 2 loads exists and 2 loads need not require sequential registers
>>>>>>>> hence it worked but with load fusion and using lxvp it requires sequential register pairs.
>>>>>>>
>>>>>>> Do you mean that this is a performance regression?  I.e. the fact that
>>>>>>> lxvp requires sequential registers causes extra spilling, due to having
>>>>>>> less allocation freedom?
>>>>>>>
>>>>>>> Or is it a correctness problem?  If so, what is it?  Nothing in the rtl
>>>>>>> above looks wrong in principle (although I've no idea if the REG_EQUIV
>>>>>>> is correct in this context).  What does the allocated code look like,
>>>>>>> and why is it wrong?
>>>>>>>
>>>>>>> If (reg:OO 2561) is spilled and then one half of it used, only that half
>>>>>>> needs to be loaded from the spill slot.  E.g. if (reg:OO 2561) is reloaded
>>>>>>> for insn 2412 on its own, only the second half of the register needs to be
>>>>>>> loaded from memory.
>>>>>>>
>>>>>>
>>>>>> This is bwaves spec 2017 benchmark. Spill happening in register allocator
>>>>>> could be because of less registers available in order to generate
>>>>>> sequential registers for lxvp.
>>>>>>
>>>>>> Because of spill sequential registers are not generated and breaks the
>>>>>> correctness. REG_EQUIV is generated by IRA.
>>>>>>
>>>>>> Allocated code because of spill doesn't generate sequential registers.
>>>>>> In LRA reload because of spill marked for IRA adjust to another
>>>>>> register causing not to generate sequential registers.
>>>>>>
>>>>>> reg:OO 2572 is spilled not reg:OO 2561. Because of spilled its
>>>>>> loaded from memory instead of generating sequential registers.
>>>>>>
>>>>>> Other reasons for spilling because of long live range for reg:OO 2572.
>>>>>>
>>>>>> We can add heuristics in fusion code not to fuse for longer live
>>>>>> ranges that should solve the problem.
>>>>>
>>>>> This doesn't describe the real problem though.  It's natural for
>>>>> registers to be spilled sometimes.  That would happen for OOmode
>>>>> registers even if we didn't form them in the fusion pass.  And it's
>>>>> normal GCC semantics that the two hard registers allocated to an OOmode
>>>>> pseudo are consecutive.
>>>>>
>>>>> Like I asked above, please show the allocated code (including spills
>>>>> and reloads) and explain why it's wrong.
>>>>>
>>>>
>>>> After LRA reload:
>>>>
>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>      (nil))
>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>                 (reg:V2DF 44 12 [3119])
>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>>      (nil))
>>>>
>>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>>>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>>>>      (nil))
>>>>
>>>> In the above allocated code it assign registers 51 and 47 and they are not sequential.
>>>
>>> The reload for 2412 looks valid.  What was the original pre-reload
>>> version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?
>>>
>>
>> This is preload version of 2473:
>>
>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>>         (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>>                 (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 0))))) {*vsx_nfmsv2df4}
>>      (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>>         (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ])
>>             (nil))))
>>
>> insn 2472 is replaced with 9299 after reload.
> 
> You'd have to check the dumps to be sure, but I think 9299 is instead
> generated as an input reload of 2412, rather than being a replacement
> of insn 2472.  T

Yes it is generated for 2412. The predecessor of 2412 is load from
plus offset as in 2412 we have subreg:V2DF (reg OO 2572) 16).

This is not correct as we are not generating lxvp and it is 
normal load lxv.
As normal load is generated in predecessor insn of 2412 with
plus constant offset it breaks the correctness.

hat is, LRA needs to reload (subreg:V2DF (reg:OO 2572) 16)
> from memory for insn 2412.  It can use the destination of insn 2412 (r51)
> as a temporary to do that.  It doesn't need to load the other half of
> reg:OO 2572 for this instruction.  That in itself looks ok.
> 
> So it looks like the problem is specific to insn 2473.  Perhaps LRA
> thinks that r47 already contains the low half of (reg:OO 2572),
> left behind by some previous instruction not shown above?
> If LRA is wrong about that -- if r47 doesn't already contain the
> low half of (reg:OO 2572) -- then there's a bug somewhere.
> But we need to track down and fix the bug rather than try to sidestep
> it in the fusion pass.
> 

Similarly for 2473 normal load with 0 offset are generated in predecessor
insn as we are generating subreg:V2DF (reg OO 2572) 0 in 2473. As we are not
generating lxvp this is not correct and breaks the code.

Above code is valid if we are generating lxvp that generates
sequential registers, but we are not geneating lxvp and normal
load is generated and this breaks the code.
> Richard
> 

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-11 10:39                               ` Ajit Agarwal
@ 2024-06-11 11:06                                 ` Richard Sandiford
  2024-06-11 11:26                                   ` Ajit Agarwal
  0 siblings, 1 reply; 35+ messages in thread
From: Richard Sandiford @ 2024-06-11 11:06 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>> After LRA reload:
>>>>>
>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>      (nil))
>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>>>      (nil))
>>>>>
>>>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>>>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>>>>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>>>>>      (nil))
>>>>>
>>>>> In the above allocated code it assign registers 51 and 47 and they are not sequential.
>>>>
>>>> The reload for 2412 looks valid.  What was the original pre-reload
>>>> version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?
>>>>
>>>
>>> This is preload version of 2473:
>>>
>>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>>>         (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>>>                 (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 0))))) {*vsx_nfmsv2df4}
>>>      (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>>>         (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ])
>>>             (nil))))
>>>
>>> insn 2472 is replaced with 9299 after reload.
>> 
>> You'd have to check the dumps to be sure, but I think 9299 is instead
>> generated as an input reload of 2412, rather than being a replacement
>> of insn 2472.  T
>
> Yes it is generated for 2412. The predecessor of 2412 is load from
> plus offset as in 2412 we have subreg:V2DF (reg OO 2572) 16).
>
> This is not correct as we are not generating lxvp and it is 
> normal load lxv.
> As normal load is generated in predecessor insn of 2412 with
> plus constant offset it breaks the correctness.

Not using lxvp is a deliberate choice though.

If a (reg:OO R) has been spilled, there's no requirement for LRA
to load both halves of R when only one half is needed.  LRA just
loads what it needs into whichever registers happen to be free.

If the reload of R instead used lxvp, LRA would be forced to free
up another register for the other half of R, even though that value
would never be used.

>> That is, LRA needs to reload (subreg:V2DF (reg:OO 2572) 16)
>> from memory for insn 2412.  It can use the destination of insn 2412 (r51)
>> as a temporary to do that.  It doesn't need to load the other half of
>> reg:OO 2572 for this instruction.  That in itself looks ok.
>> 
>> So it looks like the problem is specific to insn 2473.  Perhaps LRA
>> thinks that r47 already contains the low half of (reg:OO 2572),
>> left behind by some previous instruction not shown above?
>> If LRA is wrong about that -- if r47 doesn't already contain the
>> low half of (reg:OO 2572) -- then there's a bug somewhere.
>> But we need to track down and fix the bug rather than try to sidestep
>> it in the fusion pass.
>> 
>
> Similarly for 2473 normal load with 0 offset are generated in predecessor
> insn as we are generating subreg:V2DF (reg OO 2572) 0 in 2473. As we are not
> generating lxvp this is not correct and breaks the code.

That too sounds ok, for the reasons above.

> Above code is valid if we are generating lxvp that generates
> sequential registers, but we are not geneating lxvp and normal
> load is generated and this breaks the code.

I think you said earlier that the code is miscompiled (fails at
runtime).  If that's due to an RA issue, then presumably there is
an instruction that, after RA, is reading the wrong value.  In other
words, there's presumably a register input somewhere that has the wrong
contents.  Have you isolated which instruction and register that is?

Thanks,
Richard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-11 11:06                                 ` Richard Sandiford
@ 2024-06-11 11:26                                   ` Ajit Agarwal
  2024-06-11 11:30                                     ` Ajit Agarwal
  2024-06-11 11:43                                     ` Ajit Agarwal
  0 siblings, 2 replies; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-11 11:26 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 11/06/24 4:36 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>> After LRA reload:
>>>>>>
>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>>      (nil))
>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>>>>      (nil))
>>>>>>
>>>>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>>>>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>>>>>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>>>>>>      (nil))
>>>>>>
>>>>>> In the above allocated code it assign registers 51 and 47 and they are not sequential.
>>>>>
>>>>> The reload for 2412 looks valid.  What was the original pre-reload
>>>>> version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?
>>>>>
>>>>
>>>> This is preload version of 2473:
>>>>
>>>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>>>>                 (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 0))))) {*vsx_nfmsv2df4}
>>>>      (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>>>>         (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ])
>>>>             (nil))))
>>>>
>>>> insn 2472 is replaced with 9299 after reload.
>>>
>>> You'd have to check the dumps to be sure, but I think 9299 is instead
>>> generated as an input reload of 2412, rather than being a replacement
>>> of insn 2472.  T
>>
>> Yes it is generated for 2412. The predecessor of 2412 is load from
>> plus offset as in 2412 we have subreg:V2DF (reg OO 2572) 16).
>>
>> This is not correct as we are not generating lxvp and it is 
>> normal load lxv.
>> As normal load is generated in predecessor insn of 2412 with
>> plus constant offset it breaks the correctness.
> 
> Not using lxvp is a deliberate choice though.
> 
> If a (reg:OO R) has been spilled, there's no requirement for LRA
> to load both halves of R when only one half is needed.  LRA just
> loads what it needs into whichever registers happen to be free.
> 
> If the reload of R instead used lxvp, LRA would be forced to free
> up another register for the other half of R, even though that value
> would never be used.
> 

If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
will be from plus offset 16 instead it should be loaded value 
from zero offset. As in load fusion pass we are replacing
(reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
is from plus 16 offsets and thats why its breaking the correctness.

Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
and loaded value is from 16 offset instead its loading from zero
offset and thats why we are breaking the correctness.

To generate lxvp this is the semantics of replacing in load fusion pass. 
>>> That is, LRA needs to reload (subreg:V2DF (reg:OO 2572) 16)
>>> from memory for insn 2412.  It can use the destination of insn 2412 (r51)
>>> as a temporary to do that.  It doesn't need to load the other half of
>>> reg:OO 2572 for this instruction.  That in itself looks ok.
>>>
>>> So it looks like the problem is specific to insn 2473.  Perhaps LRA
>>> thinks that r47 already contains the low half of (reg:OO 2572),
>>> left behind by some previous instruction not shown above?
>>> If LRA is wrong about that -- if r47 doesn't already contain the
>>> low half of (reg:OO 2572) -- then there's a bug somewhere.
>>> But we need to track down and fix the bug rather than try to sidestep
>>> it in the fusion pass.
>>>
>>
>> Similarly for 2473 normal load with 0 offset are generated in predecessor
>> insn as we are generating subreg:V2DF (reg OO 2572) 0 in 2473. As we are not
>> generating lxvp this is not correct and breaks the code.
> 
> That too sounds ok, for the reasons above.
> 
>> Above code is valid if we are generating lxvp that generates
>> sequential registers, but we are not geneating lxvp and normal
>> load is generated and this breaks the code.
> 
> I think you said earlier that the code is miscompiled (fails at
> runtime).  If that's due to an RA issue, then presumably there is
> an instruction that, after RA, is reading the wrong value.  In other
> words, there's presumably a register input somewhere that has the wrong
> contents.  Have you isolated which instruction and register that is?
> 
The code is compiled and run successfully but we get miscompare
error.

> Thanks,
> Richard

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-11 11:26                                   ` Ajit Agarwal
@ 2024-06-11 11:30                                     ` Ajit Agarwal
  2024-06-11 11:45                                       ` Richard Sandiford
  2024-06-11 11:43                                     ` Ajit Agarwal
  1 sibling, 1 reply; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-11 11:30 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 11/06/24 4:56 pm, Ajit Agarwal wrote:
> Hello Richard:
> 
> On 11/06/24 4:36 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>> After LRA reload:
>>>>>>>
>>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>>>      (nil))
>>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>>>>>      (nil))
>>>>>>>
>>>>>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>>>>>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>>>>>>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>>>>>>>      (nil))
>>>>>>>
>>>>>>> In the above allocated code it assign registers 51 and 47 and they are not sequential.
>>>>>>
>>>>>> The reload for 2412 looks valid.  What was the original pre-reload
>>>>>> version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?
>>>>>>
>>>>>
>>>>> This is preload version of 2473:
>>>>>
>>>>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>>>>>                 (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
>>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 0))))) {*vsx_nfmsv2df4}
>>>>>      (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>>>>>         (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ])
>>>>>             (nil))))
>>>>>
>>>>> insn 2472 is replaced with 9299 after reload.
>>>>
>>>> You'd have to check the dumps to be sure, but I think 9299 is instead
>>>> generated as an input reload of 2412, rather than being a replacement
>>>> of insn 2472.  T
>>>
>>> Yes it is generated for 2412. The predecessor of 2412 is load from
>>> plus offset as in 2412 we have subreg:V2DF (reg OO 2572) 16).
>>>
>>> This is not correct as we are not generating lxvp and it is 
>>> normal load lxv.
>>> As normal load is generated in predecessor insn of 2412 with
>>> plus constant offset it breaks the correctness.
>>
>> Not using lxvp is a deliberate choice though.
>>
>> If a (reg:OO R) has been spilled, there's no requirement for LRA
>> to load both halves of R when only one half is needed.  LRA just
>> loads what it needs into whichever registers happen to be free.
>>
>> If the reload of R instead used lxvp, LRA would be forced to free
>> up another register for the other half of R, even though that value
>> would never be used.
>>
> 
> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
> will be from plus offset 16 instead it should be loaded value 
> from zero offset. As in load fusion pass we are replacing
> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
> is from plus 16 offsets and thats why its breaking the correctness.
> 
> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
> and loaded value is from 16 offset instead its loading from zero
> offset and thats why we are breaking the correctness.
> 

If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
will be from plus offset 16 instead it should be loaded value 
from zero offset. As in load fusion pass we are replacing
(reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
is from plus 16 offsets instead it should load from zero offset.
Thats why its breaking the correctness.
 
Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
and loaded value is from 16 offset instead its loading from zero
offset and thats why we are breaking the correctness.

Thanks & Regards
Ajit
> To generate lxvp this is the semantics of replacing in load fusion pass. 
>>>> That is, LRA needs to reload (subreg:V2DF (reg:OO 2572) 16)
>>>> from memory for insn 2412.  It can use the destination of insn 2412 (r51)
>>>> as a temporary to do that.  It doesn't need to load the other half of
>>>> reg:OO 2572 for this instruction.  That in itself looks ok.
>>>>
>>>> So it looks like the problem is specific to insn 2473.  Perhaps LRA
>>>> thinks that r47 already contains the low half of (reg:OO 2572),
>>>> left behind by some previous instruction not shown above?
>>>> If LRA is wrong about that -- if r47 doesn't already contain the
>>>> low half of (reg:OO 2572) -- then there's a bug somewhere.
>>>> But we need to track down and fix the bug rather than try to sidestep
>>>> it in the fusion pass.
>>>>
>>>
>>> Similarly for 2473 normal load with 0 offset are generated in predecessor
>>> insn as we are generating subreg:V2DF (reg OO 2572) 0 in 2473. As we are not
>>> generating lxvp this is not correct and breaks the code.
>>
>> That too sounds ok, for the reasons above.
>>
>>> Above code is valid if we are generating lxvp that generates
>>> sequential registers, but we are not geneating lxvp and normal
>>> load is generated and this breaks the code.
>>
>> I think you said earlier that the code is miscompiled (fails at
>> runtime).  If that's due to an RA issue, then presumably there is
>> an instruction that, after RA, is reading the wrong value.  In other
>> words, there's presumably a register input somewhere that has the wrong
>> contents.  Have you isolated which instruction and register that is?
>>
> The code is compiled and run successfully but we get miscompare
> error.
> 
>> Thanks,
>> Richard
> 
> Thanks & Regards
> Ajit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-11 11:26                                   ` Ajit Agarwal
  2024-06-11 11:30                                     ` Ajit Agarwal
@ 2024-06-11 11:43                                     ` Ajit Agarwal
  1 sibling, 0 replies; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-11 11:43 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 11/06/24 4:56 pm, Ajit Agarwal wrote:
> Hello Richard:
> 
> On 11/06/24 4:36 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>> After LRA reload:
>>>>>>>
>>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>>>      (nil))
>>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>>>>>      (nil))
>>>>>>>
>>>>>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>>>>>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>>>>>>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>>>>>>>      (nil))
>>>>>>>
>>>>>>> In the above allocated code it assign registers 51 and 47 and they are not sequential.
>>>>>>
>>>>>> The reload for 2412 looks valid.  What was the original pre-reload
>>>>>> version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?
>>>>>>
>>>>>
>>>>> This is preload version of 2473:
>>>>>
>>>>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>>>>>                 (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
>>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 0))))) {*vsx_nfmsv2df4}
>>>>>      (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>>>>>         (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ])
>>>>>             (nil))))
>>>>>
>>>>> insn 2472 is replaced with 9299 after reload.
>>>>
>>>> You'd have to check the dumps to be sure, but I think 9299 is instead
>>>> generated as an input reload of 2412, rather than being a replacement
>>>> of insn 2472.  T
>>>
>>> Yes it is generated for 2412. The predecessor of 2412 is load from
>>> plus offset as in 2412 we have subreg:V2DF (reg OO 2572) 16).
>>>
>>> This is not correct as we are not generating lxvp and it is 
>>> normal load lxv.
>>> As normal load is generated in predecessor insn of 2412 with
>>> plus constant offset it breaks the correctness.
>>
>> Not using lxvp is a deliberate choice though.
>>
>> If a (reg:OO R) has been spilled, there's no requirement for LRA
>> to load both halves of R when only one half is needed.  LRA just
>> loads what it needs into whichever registers happen to be free.
>>
>> If the reload of R instead used lxvp, LRA would be forced to free
>> up another register for the other half of R, even though that value
>> would never be used.
>>
> 
> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
> will be from plus offset 16 instead it should be loaded value 
> from zero offset. As in load fusion pass we are replacing
> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
> is from plus 16 offsets and thats why its breaking the correctness.
> 
> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
> and loaded value is from 16 offset instead its loading from zero
> offset and thats why we are breaking the correctness.
> 
If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
will be from plus offset 16 instead it should be loaded value 
from zero offset. As in load fusion pass we are replacing
(reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
is from plus 16 offsets instead it should load from zero offset.
Thats why its breaking the correctness.
 
Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
and loaded value is from 16 offset instead its loading from zero
offset and thats why we are breaking the correctness.

Thanks & Regards
Ajit

> To generate lxvp this is the semantics of replacing in load fusion pass. 
>>>> That is, LRA needs to reload (subreg:V2DF (reg:OO 2572) 16)
>>>> from memory for insn 2412.  It can use the destination of insn 2412 (r51)
>>>> as a temporary to do that.  It doesn't need to load the other half of
>>>> reg:OO 2572 for this instruction.  That in itself looks ok.
>>>>
>>>> So it looks like the problem is specific to insn 2473.  Perhaps LRA
>>>> thinks that r47 already contains the low half of (reg:OO 2572),
>>>> left behind by some previous instruction not shown above?
>>>> If LRA is wrong about that -- if r47 doesn't already contain the
>>>> low half of (reg:OO 2572) -- then there's a bug somewhere.
>>>> But we need to track down and fix the bug rather than try to sidestep
>>>> it in the fusion pass.
>>>>
>>>
>>> Similarly for 2473 normal load with 0 offset are generated in predecessor
>>> insn as we are generating subreg:V2DF (reg OO 2572) 0 in 2473. As we are not
>>> generating lxvp this is not correct and breaks the code.
>>
>> That too sounds ok, for the reasons above.
>>
>>> Above code is valid if we are generating lxvp that generates
>>> sequential registers, but we are not geneating lxvp and normal
>>> load is generated and this breaks the code.
>>
>> I think you said earlier that the code is miscompiled (fails at
>> runtime).  If that's due to an RA issue, then presumably there is
>> an instruction that, after RA, is reading the wrong value.  In other
>> words, there's presumably a register input somewhere that has the wrong
>> contents.  Have you isolated which instruction and register that is?
>>
> The code is compiled and run successfully but we get miscompare
> error.
> 
>> Thanks,
>> Richard
> 
> Thanks & Regards
> Ajit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-11 11:30                                     ` Ajit Agarwal
@ 2024-06-11 11:45                                       ` Richard Sandiford
  2024-06-11 11:59                                         ` Ajit Agarwal
  0 siblings, 1 reply; 35+ messages in thread
From: Richard Sandiford @ 2024-06-11 11:45 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
> Hello Richard:
> On 11/06/24 4:56 pm, Ajit Agarwal wrote:
>> Hello Richard:
>> 
>> On 11/06/24 4:36 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>> After LRA reload:
>>>>>>>>
>>>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>>>>      (nil))
>>>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>>>>>>      (nil))
>>>>>>>>
>>>>>>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>>>>>>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>>>>>>>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>>>>>>>>      (nil))
>>>>>>>>
>>>>>>>> In the above allocated code it assign registers 51 and 47 and they are not sequential.
>>>>>>>
>>>>>>> The reload for 2412 looks valid.  What was the original pre-reload
>>>>>>> version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?
>>>>>>>
>>>>>>
>>>>>> This is preload version of 2473:
>>>>>>
>>>>>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>>>>>>                 (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
>>>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 0))))) {*vsx_nfmsv2df4}
>>>>>>      (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>>>>>>         (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ])
>>>>>>             (nil))))
>>>>>>
>>>>>> insn 2472 is replaced with 9299 after reload.
>>>>>
>>>>> You'd have to check the dumps to be sure, but I think 9299 is instead
>>>>> generated as an input reload of 2412, rather than being a replacement
>>>>> of insn 2472.  T
>>>>
>>>> Yes it is generated for 2412. The predecessor of 2412 is load from
>>>> plus offset as in 2412 we have subreg:V2DF (reg OO 2572) 16).
>>>>
>>>> This is not correct as we are not generating lxvp and it is 
>>>> normal load lxv.
>>>> As normal load is generated in predecessor insn of 2412 with
>>>> plus constant offset it breaks the correctness.
>>>
>>> Not using lxvp is a deliberate choice though.
>>>
>>> If a (reg:OO R) has been spilled, there's no requirement for LRA
>>> to load both halves of R when only one half is needed.  LRA just
>>> loads what it needs into whichever registers happen to be free.
>>>
>>> If the reload of R instead used lxvp, LRA would be forced to free
>>> up another register for the other half of R, even though that value
>>> would never be used.
>>>
>> 
>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
>> will be from plus offset 16 instead it should be loaded value 
>> from zero offset. As in load fusion pass we are replacing
>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
>> is from plus 16 offsets and thats why its breaking the correctness.
>> 
>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
>> and loaded value is from 16 offset instead its loading from zero
>> offset and thats why we are breaking the correctness.
>> 
>
> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
> will be from plus offset 16 instead it should be loaded value 
> from zero offset. As in load fusion pass we are replacing
> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
> is from plus 16 offsets instead it should load from zero offset.
> Thats why its breaking the correctness.
>  
> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
> and loaded value is from 16 offset instead its loading from zero
> offset and thats why we are breaking the correctness.

I don't understand, sorry.  (subreg:V2DI (reg:OO R) 0) is always

(a) the first hard register in (reg:OO R), when the whole of R
    is stored in hard registers
(b) at address offset 0 from the start of (reg:OO R), when R is
    spilled to memory

Similarly, (subreg:V2DI (reg:OO R) 16) is always

(c) the second hard register in (reg:OO R), when the whole of R
    is stored in hard registers
(d) at address offset 16 from the start of (reg:OO R), when R is
    spilled to memory

Richard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-11 11:45                                       ` Richard Sandiford
@ 2024-06-11 11:59                                         ` Ajit Agarwal
  2024-06-11 12:42                                           ` Richard Sandiford
  0 siblings, 1 reply; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-11 11:59 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 11/06/24 5:15 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> Hello Richard:
>> On 11/06/24 4:56 pm, Ajit Agarwal wrote:
>>> Hello Richard:
>>>
>>> On 11/06/24 4:36 pm, Richard Sandiford wrote:
>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>>> After LRA reload:
>>>>>>>>>
>>>>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>>>>>      (nil))
>>>>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>>>>>>>      (nil))
>>>>>>>>>
>>>>>>>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>>>>>>>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>>>>>>>>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>>>>>>>>>      (nil))
>>>>>>>>>
>>>>>>>>> In the above allocated code it assign registers 51 and 47 and they are not sequential.
>>>>>>>>
>>>>>>>> The reload for 2412 looks valid.  What was the original pre-reload
>>>>>>>> version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?
>>>>>>>>
>>>>>>>
>>>>>>> This is preload version of 2473:
>>>>>>>
>>>>>>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>>>>>>>                 (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
>>>>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 0))))) {*vsx_nfmsv2df4}
>>>>>>>      (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>>>>>>>         (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ])
>>>>>>>             (nil))))
>>>>>>>
>>>>>>> insn 2472 is replaced with 9299 after reload.
>>>>>>
>>>>>> You'd have to check the dumps to be sure, but I think 9299 is instead
>>>>>> generated as an input reload of 2412, rather than being a replacement
>>>>>> of insn 2472.  T
>>>>>
>>>>> Yes it is generated for 2412. The predecessor of 2412 is load from
>>>>> plus offset as in 2412 we have subreg:V2DF (reg OO 2572) 16).
>>>>>
>>>>> This is not correct as we are not generating lxvp and it is 
>>>>> normal load lxv.
>>>>> As normal load is generated in predecessor insn of 2412 with
>>>>> plus constant offset it breaks the correctness.
>>>>
>>>> Not using lxvp is a deliberate choice though.
>>>>
>>>> If a (reg:OO R) has been spilled, there's no requirement for LRA
>>>> to load both halves of R when only one half is needed.  LRA just
>>>> loads what it needs into whichever registers happen to be free.
>>>>
>>>> If the reload of R instead used lxvp, LRA would be forced to free
>>>> up another register for the other half of R, even though that value
>>>> would never be used.
>>>>
>>>
>>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
>>> will be from plus offset 16 instead it should be loaded value 
>>> from zero offset. As in load fusion pass we are replacing
>>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
>>> is from plus 16 offsets and thats why its breaking the correctness.
>>>
>>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
>>> and loaded value is from 16 offset instead its loading from zero
>>> offset and thats why we are breaking the correctness.
>>>
>>
>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
>> will be from plus offset 16 instead it should be loaded value 
>> from zero offset. As in load fusion pass we are replacing
>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
>> is from plus 16 offsets instead it should load from zero offset.
>> Thats why its breaking the correctness.
>>  
>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
>> and loaded value is from 16 offset instead its loading from zero
>> offset and thats why we are breaking the correctness.
> 
> I don't understand, sorry.  (subreg:V2DI (reg:OO R) 0) is always
> 
> (a) the first hard register in (reg:OO R), when the whole of R
>     is stored in hard registers
> (b) at address offset 0 from the start of (reg:OO R), when R is
>     spilled to memory
> 
> Similarly, (subreg:V2DI (reg:OO R) 16) is always
> 
> (c) the second hard register in (reg:OO R), when the whole of R
>     is stored in hard registers
> (d) at address offset 16 from the start of (reg:OO R), when R is
>     spilled to memory
>

Yes but we are replacing use of loaded value from plus 16 offset
with subreg (reg OO ) 0 and similarly we are replacing use of loaded
value from 0 offset with subreg (reg OO ) 16 as we are swapping
the use operand.

When it is spilled its vice versa subreg (reg OO ) 16 should be
loaded from 0 offset and subreg (reg OO) 0 should be loaded
from 16 offset as we are swapping the use operand.

This is the semantics of lxvp.

> Richard

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-11 11:59                                         ` Ajit Agarwal
@ 2024-06-11 12:42                                           ` Richard Sandiford
  2024-06-11 12:57                                             ` Ajit Agarwal
  0 siblings, 1 reply; 35+ messages in thread
From: Richard Sandiford @ 2024-06-11 12:42 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
> Hello Richard:
>
> On 11/06/24 5:15 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>> Hello Richard:
>>> On 11/06/24 4:56 pm, Ajit Agarwal wrote:
>>>> Hello Richard:
>>>>
>>>> On 11/06/24 4:36 pm, Richard Sandiford wrote:
>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>>>> After LRA reload:
>>>>>>>>>>
>>>>>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>>>>>>      (nil))
>>>>>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>>>>>>>>      (nil))
>>>>>>>>>>
>>>>>>>>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
>>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>>>>>>>>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>>>>>>>>>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>>>>>>>>>>      (nil))
>>>>>>>>>>
>>>>>>>>>> In the above allocated code it assign registers 51 and 47 and they are not sequential.
>>>>>>>>>
>>>>>>>>> The reload for 2412 looks valid.  What was the original pre-reload
>>>>>>>>> version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?
>>>>>>>>>
>>>>>>>>
>>>>>>>> This is preload version of 2473:
>>>>>>>>
>>>>>>>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>>>>>>>>                 (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
>>>>>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 0))))) {*vsx_nfmsv2df4}
>>>>>>>>      (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>>>>>>>>         (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ])
>>>>>>>>             (nil))))
>>>>>>>>
>>>>>>>> insn 2472 is replaced with 9299 after reload.
>>>>>>>
>>>>>>> You'd have to check the dumps to be sure, but I think 9299 is instead
>>>>>>> generated as an input reload of 2412, rather than being a replacement
>>>>>>> of insn 2472.  T
>>>>>>
>>>>>> Yes it is generated for 2412. The predecessor of 2412 is load from
>>>>>> plus offset as in 2412 we have subreg:V2DF (reg OO 2572) 16).
>>>>>>
>>>>>> This is not correct as we are not generating lxvp and it is 
>>>>>> normal load lxv.
>>>>>> As normal load is generated in predecessor insn of 2412 with
>>>>>> plus constant offset it breaks the correctness.
>>>>>
>>>>> Not using lxvp is a deliberate choice though.
>>>>>
>>>>> If a (reg:OO R) has been spilled, there's no requirement for LRA
>>>>> to load both halves of R when only one half is needed.  LRA just
>>>>> loads what it needs into whichever registers happen to be free.
>>>>>
>>>>> If the reload of R instead used lxvp, LRA would be forced to free
>>>>> up another register for the other half of R, even though that value
>>>>> would never be used.
>>>>>
>>>>
>>>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
>>>> will be from plus offset 16 instead it should be loaded value 
>>>> from zero offset. As in load fusion pass we are replacing
>>>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
>>>> is from plus 16 offsets and thats why its breaking the correctness.
>>>>
>>>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
>>>> and loaded value is from 16 offset instead its loading from zero
>>>> offset and thats why we are breaking the correctness.
>>>>
>>>
>>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
>>> will be from plus offset 16 instead it should be loaded value 
>>> from zero offset. As in load fusion pass we are replacing
>>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
>>> is from plus 16 offsets instead it should load from zero offset.
>>> Thats why its breaking the correctness.
>>>  
>>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
>>> and loaded value is from 16 offset instead its loading from zero
>>> offset and thats why we are breaking the correctness.
>> 
>> I don't understand, sorry.  (subreg:V2DI (reg:OO R) 0) is always
>> 
>> (a) the first hard register in (reg:OO R), when the whole of R
>>     is stored in hard registers
>> (b) at address offset 0 from the start of (reg:OO R), when R is
>>     spilled to memory
>> 
>> Similarly, (subreg:V2DI (reg:OO R) 16) is always
>> 
>> (c) the second hard register in (reg:OO R), when the whole of R
>>     is stored in hard registers
>> (d) at address offset 16 from the start of (reg:OO R), when R is
>>     spilled to memory
>>
>
> Yes but we are replacing use of loaded value from plus 16 offset
> with subreg (reg OO ) 0 and similarly we are replacing use of loaded
> value from 0 offset with subreg (reg OO ) 16 as we are swapping
> the use operand.
>
> When it is spilled its vice versa subreg (reg OO ) 16 should be
> loaded from 0 offset and subreg (reg OO) 0 should be loaded
> from 16 offset as we are swapping the use operand.
>
> This is the semantics of lxvp.

Hmm, OK.  Does that mean that:

   lxvp A,B

loads A+1 from B and A from B+16?  (I couldn't find an online
description of the instruction btw -- is there one?)

If lxvp does behave like that, it shouldn't be described as a normal move:

  [(set (match_operand:OO 0 "nonimmediate_operand" "=wa,ZwO,wa")
	(match_operand:OO 1 "input_operand" "ZwO,wa,wa"))]

since the rules I listed above are target-independent.

Alternatively, if lxvp is described as a normal move,
with the 128-bit pieces swapped compared to GCC's normal order,
TARGET_CAN_CHANGE_MODE_CLASS must prevent subregs that select
one half of the register (or less).  But that would defeat exactly
the optimisation that you're trying to do.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-11 12:42                                           ` Richard Sandiford
@ 2024-06-11 12:57                                             ` Ajit Agarwal
  2024-06-11 13:37                                               ` Richard Sandiford
  0 siblings, 1 reply; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-11 12:57 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 11/06/24 6:12 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> Hello Richard:
>>
>> On 11/06/24 5:15 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>> Hello Richard:
>>>> On 11/06/24 4:56 pm, Ajit Agarwal wrote:
>>>>> Hello Richard:
>>>>>
>>>>> On 11/06/24 4:36 pm, Richard Sandiford wrote:
>>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>>>>> After LRA reload:
>>>>>>>>>>>
>>>>>>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>>>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>>>>>>>      (nil))
>>>>>>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>>>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>>>>>>>>>      (nil))
>>>>>>>>>>>
>>>>>>>>>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
>>>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>>>>>>>>>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>>>>>>>>>>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>>>>>>>>>>>      (nil))
>>>>>>>>>>>
>>>>>>>>>>> In the above allocated code it assign registers 51 and 47 and they are not sequential.
>>>>>>>>>>
>>>>>>>>>> The reload for 2412 looks valid.  What was the original pre-reload
>>>>>>>>>> version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is preload version of 2473:
>>>>>>>>>
>>>>>>>>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>>>>>>>>>                 (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
>>>>>>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 0))))) {*vsx_nfmsv2df4}
>>>>>>>>>      (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>>>>>>>>>         (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ])
>>>>>>>>>             (nil))))
>>>>>>>>>
>>>>>>>>> insn 2472 is replaced with 9299 after reload.
>>>>>>>>
>>>>>>>> You'd have to check the dumps to be sure, but I think 9299 is instead
>>>>>>>> generated as an input reload of 2412, rather than being a replacement
>>>>>>>> of insn 2472.  T
>>>>>>>
>>>>>>> Yes it is generated for 2412. The predecessor of 2412 is load from
>>>>>>> plus offset as in 2412 we have subreg:V2DF (reg OO 2572) 16).
>>>>>>>
>>>>>>> This is not correct as we are not generating lxvp and it is 
>>>>>>> normal load lxv.
>>>>>>> As normal load is generated in predecessor insn of 2412 with
>>>>>>> plus constant offset it breaks the correctness.
>>>>>>
>>>>>> Not using lxvp is a deliberate choice though.
>>>>>>
>>>>>> If a (reg:OO R) has been spilled, there's no requirement for LRA
>>>>>> to load both halves of R when only one half is needed.  LRA just
>>>>>> loads what it needs into whichever registers happen to be free.
>>>>>>
>>>>>> If the reload of R instead used lxvp, LRA would be forced to free
>>>>>> up another register for the other half of R, even though that value
>>>>>> would never be used.
>>>>>>
>>>>>
>>>>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
>>>>> will be from plus offset 16 instead it should be loaded value 
>>>>> from zero offset. As in load fusion pass we are replacing
>>>>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
>>>>> is from plus 16 offsets and thats why its breaking the correctness.
>>>>>
>>>>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
>>>>> and loaded value is from 16 offset instead its loading from zero
>>>>> offset and thats why we are breaking the correctness.
>>>>>
>>>>
>>>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
>>>> will be from plus offset 16 instead it should be loaded value 
>>>> from zero offset. As in load fusion pass we are replacing
>>>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
>>>> is from plus 16 offsets instead it should load from zero offset.
>>>> Thats why its breaking the correctness.
>>>>  
>>>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
>>>> and loaded value is from 16 offset instead its loading from zero
>>>> offset and thats why we are breaking the correctness.
>>>
>>> I don't understand, sorry.  (subreg:V2DI (reg:OO R) 0) is always
>>>
>>> (a) the first hard register in (reg:OO R), when the whole of R
>>>     is stored in hard registers
>>> (b) at address offset 0 from the start of (reg:OO R), when R is
>>>     spilled to memory
>>>
>>> Similarly, (subreg:V2DI (reg:OO R) 16) is always
>>>
>>> (c) the second hard register in (reg:OO R), when the whole of R
>>>     is stored in hard registers
>>> (d) at address offset 16 from the start of (reg:OO R), when R is
>>>     spilled to memory
>>>
>>
>> Yes but we are replacing use of loaded value from plus 16 offset
>> with subreg (reg OO ) 0 and similarly we are replacing use of loaded
>> value from 0 offset with subreg (reg OO ) 16 as we are swapping
>> the use operand.
>>
>> When it is spilled its vice versa subreg (reg OO ) 16 should be
>> loaded from 0 offset and subreg (reg OO) 0 should be loaded
>> from 16 offset as we are swapping the use operand.
>>
>> This is the semantics of lxvp.
> 
> Hmm, OK.  Does that mean that:
> 
>    lxvp A,B
> 
> loads A+1 from B and A from B+16?  (I couldn't find an online
> description of the instruction btw -- is there one?)
> 

Yes thats correct. Even I didn't find online document that
describes the same.

> If lxvp does behave like that, it shouldn't be described as a normal move:
> 
>   [(set (match_operand:OO 0 "nonimmediate_operand" "=wa,ZwO,wa")
> 	(match_operand:OO 1 "input_operand" "ZwO,wa,wa"))]
> 
> since the rules I listed above are target-independent.
>

Rules for load for lxvp should as normal, but how the use
of lxvp defines should be different for lxvp than normal
load.
 
> Alternatively, if lxvp is described as a normal move,
> with the 128-bit pieces swapped compared to GCC's normal order,
> TARGET_CAN_CHANGE_MODE_CLASS must prevent subregs that select
> one half of the register (or less).  But that would defeat exactly
> the optimisation that you're trying to do.
> 

Sorry, I didn't get how TARGET_CAN_CHANGE_MODE_CLASS prevents
subreg that selects one half of the registers. 

I didn't see such cases.
 > Thanks,
> Richard

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-11 12:57                                             ` Ajit Agarwal
@ 2024-06-11 13:37                                               ` Richard Sandiford
  2024-06-11 14:40                                                 ` Ajit Agarwal
  0 siblings, 1 reply; 35+ messages in thread
From: Richard Sandiford @ 2024-06-11 13:37 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
> Hello Richard:
> On 11/06/24 6:12 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>> Hello Richard:
>>>
>>> On 11/06/24 5:15 pm, Richard Sandiford wrote:
>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>> Hello Richard:
>>>>> On 11/06/24 4:56 pm, Ajit Agarwal wrote:
>>>>>> Hello Richard:
>>>>>>
>>>>>> On 11/06/24 4:36 pm, Richard Sandiford wrote:
>>>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>>>>>> After LRA reload:
>>>>>>>>>>>>
>>>>>>>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>>>>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>>>>>>>>      (nil))
>>>>>>>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>>>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>>>>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>>>>>>>>>>      (nil))
>>>>>>>>>>>>
>>>>>>>>>>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
>>>>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>>>>>>>>>>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>>>>>>>>>>>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>>>>>>>>>>>>      (nil))
>>>>>>>>>>>>
>>>>>>>>>>>> In the above allocated code it assign registers 51 and 47 and they are not sequential.
>>>>>>>>>>>
>>>>>>>>>>> The reload for 2412 looks valid.  What was the original pre-reload
>>>>>>>>>>> version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This is preload version of 2473:
>>>>>>>>>>
>>>>>>>>>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>>>>>>>>>>                 (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
>>>>>>>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 0))))) {*vsx_nfmsv2df4}
>>>>>>>>>>      (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>>>>>>>>>>         (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ])
>>>>>>>>>>             (nil))))
>>>>>>>>>>
>>>>>>>>>> insn 2472 is replaced with 9299 after reload.
>>>>>>>>>
>>>>>>>>> You'd have to check the dumps to be sure, but I think 9299 is instead
>>>>>>>>> generated as an input reload of 2412, rather than being a replacement
>>>>>>>>> of insn 2472.  T
>>>>>>>>
>>>>>>>> Yes it is generated for 2412. The predecessor of 2412 is load from
>>>>>>>> plus offset as in 2412 we have subreg:V2DF (reg OO 2572) 16).
>>>>>>>>
>>>>>>>> This is not correct as we are not generating lxvp and it is 
>>>>>>>> normal load lxv.
>>>>>>>> As normal load is generated in predecessor insn of 2412 with
>>>>>>>> plus constant offset it breaks the correctness.
>>>>>>>
>>>>>>> Not using lxvp is a deliberate choice though.
>>>>>>>
>>>>>>> If a (reg:OO R) has been spilled, there's no requirement for LRA
>>>>>>> to load both halves of R when only one half is needed.  LRA just
>>>>>>> loads what it needs into whichever registers happen to be free.
>>>>>>>
>>>>>>> If the reload of R instead used lxvp, LRA would be forced to free
>>>>>>> up another register for the other half of R, even though that value
>>>>>>> would never be used.
>>>>>>>
>>>>>>
>>>>>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
>>>>>> will be from plus offset 16 instead it should be loaded value 
>>>>>> from zero offset. As in load fusion pass we are replacing
>>>>>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
>>>>>> is from plus 16 offsets and thats why its breaking the correctness.
>>>>>>
>>>>>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
>>>>>> and loaded value is from 16 offset instead its loading from zero
>>>>>> offset and thats why we are breaking the correctness.
>>>>>>
>>>>>
>>>>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
>>>>> will be from plus offset 16 instead it should be loaded value 
>>>>> from zero offset. As in load fusion pass we are replacing
>>>>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
>>>>> is from plus 16 offsets instead it should load from zero offset.
>>>>> Thats why its breaking the correctness.
>>>>>  
>>>>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
>>>>> and loaded value is from 16 offset instead its loading from zero
>>>>> offset and thats why we are breaking the correctness.
>>>>
>>>> I don't understand, sorry.  (subreg:V2DI (reg:OO R) 0) is always
>>>>
>>>> (a) the first hard register in (reg:OO R), when the whole of R
>>>>     is stored in hard registers
>>>> (b) at address offset 0 from the start of (reg:OO R), when R is
>>>>     spilled to memory
>>>>
>>>> Similarly, (subreg:V2DI (reg:OO R) 16) is always
>>>>
>>>> (c) the second hard register in (reg:OO R), when the whole of R
>>>>     is stored in hard registers
>>>> (d) at address offset 16 from the start of (reg:OO R), when R is
>>>>     spilled to memory
>>>>
>>>
>>> Yes but we are replacing use of loaded value from plus 16 offset
>>> with subreg (reg OO ) 0 and similarly we are replacing use of loaded
>>> value from 0 offset with subreg (reg OO ) 16 as we are swapping
>>> the use operand.
>>>
>>> When it is spilled its vice versa subreg (reg OO ) 16 should be
>>> loaded from 0 offset and subreg (reg OO) 0 should be loaded
>>> from 16 offset as we are swapping the use operand.
>>>
>>> This is the semantics of lxvp.
>> 
>> Hmm, OK.  Does that mean that:
>> 
>>    lxvp A,B
>> 
>> loads A+1 from B and A from B+16?  (I couldn't find an online
>> description of the instruction btw -- is there one?)
>> 
>
> Yes thats correct. Even I didn't find online document that
> describes the same.

Thanks, I think I get it now.

>> If lxvp does behave like that, it shouldn't be described as a normal move:
>> 
>>   [(set (match_operand:OO 0 "nonimmediate_operand" "=wa,ZwO,wa")
>> 	(match_operand:OO 1 "input_operand" "ZwO,wa,wa"))]
>> 
>> since the rules I listed above are target-independent.
>>
>
> Rules for load for lxvp should as normal, but how the use
> of lxvp defines should be different for lxvp than normal
> load.

I don't think that's true though.  Because of the rules above,
GCC expects that:

  (set (reg:OO R) (mem:OO ADDR))

will load ADDR+0 into R and ADDR+16 into R+1.  Anything else will
cause exactly the kind of mix-up that you're seeing here.

In the worst case, the load should be described as:

  (set (reg:OO R)
       (unspec:OO [(mem:OO ADDR)] UNSPEC_LXVP))

although there are other alternatives that could be used if OOmode
weren't defined as an OPAQUE_MODE.

>> Alternatively, if lxvp is described as a normal move,
>> with the 128-bit pieces swapped compared to GCC's normal order,
>> TARGET_CAN_CHANGE_MODE_CLASS must prevent subregs that select
>> one half of the register (or less).  But that would defeat exactly
>> the optimisation that you're trying to do.
>> 
>
> Sorry, I didn't get how TARGET_CAN_CHANGE_MODE_CLASS prevents
> subreg that selects one half of the registers. 

It doesn't prevent the subregs at such, but it stops the mode
change from happening in certain registers.

For example, if we have (subreg:V2DI (reg:OO R) 0), and if
TARGET_CAN_CHANGE_MODE_CLASS disallows changes between V2DI
and OO in all register classes that can hold OO, then the register
allocator would handle the subreg by spilling register R to memory
and loading a V2DI piece of it from there.

This would allow the port to store OOmode in an unusual order.
But like I say, it would also defeat exactly what you're trying
to do here.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-11 13:37                                               ` Richard Sandiford
@ 2024-06-11 14:40                                                 ` Ajit Agarwal
  2024-06-11 15:29                                                   ` Richard Sandiford
  0 siblings, 1 reply; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-11 14:40 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 11/06/24 7:07 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> Hello Richard:
>> On 11/06/24 6:12 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>> Hello Richard:
>>>>
>>>> On 11/06/24 5:15 pm, Richard Sandiford wrote:
>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>> Hello Richard:
>>>>>> On 11/06/24 4:56 pm, Ajit Agarwal wrote:
>>>>>>> Hello Richard:
>>>>>>>
>>>>>>> On 11/06/24 4:36 pm, Richard Sandiford wrote:
>>>>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>>>>>>> After LRA reload:
>>>>>>>>>>>>>
>>>>>>>>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>>>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>>>>>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>>>>>>>>>      (nil))
>>>>>>>>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>>>>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>>>>>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>>>>>>>>>>>      (nil))
>>>>>>>>>>>>>
>>>>>>>>>>>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
>>>>>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>>>>>>>>>>>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>>>>>>>>>>>>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>>>>>>>>>>>>>      (nil))
>>>>>>>>>>>>>
>>>>>>>>>>>>> In the above allocated code it assign registers 51 and 47 and they are not sequential.
>>>>>>>>>>>>
>>>>>>>>>>>> The reload for 2412 looks valid.  What was the original pre-reload
>>>>>>>>>>>> version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This is preload version of 2473:
>>>>>>>>>>>
>>>>>>>>>>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>>>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>>>>>>>>>>>                 (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
>>>>>>>>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 0))))) {*vsx_nfmsv2df4}
>>>>>>>>>>>      (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>>>>>>>>>>>         (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ])
>>>>>>>>>>>             (nil))))
>>>>>>>>>>>
>>>>>>>>>>> insn 2472 is replaced with 9299 after reload.
>>>>>>>>>>
>>>>>>>>>> You'd have to check the dumps to be sure, but I think 9299 is instead
>>>>>>>>>> generated as an input reload of 2412, rather than being a replacement
>>>>>>>>>> of insn 2472.  T
>>>>>>>>>
>>>>>>>>> Yes it is generated for 2412. The predecessor of 2412 is load from
>>>>>>>>> plus offset as in 2412 we have subreg:V2DF (reg OO 2572) 16).
>>>>>>>>>
>>>>>>>>> This is not correct as we are not generating lxvp and it is 
>>>>>>>>> normal load lxv.
>>>>>>>>> As normal load is generated in predecessor insn of 2412 with
>>>>>>>>> plus constant offset it breaks the correctness.
>>>>>>>>
>>>>>>>> Not using lxvp is a deliberate choice though.
>>>>>>>>
>>>>>>>> If a (reg:OO R) has been spilled, there's no requirement for LRA
>>>>>>>> to load both halves of R when only one half is needed.  LRA just
>>>>>>>> loads what it needs into whichever registers happen to be free.
>>>>>>>>
>>>>>>>> If the reload of R instead used lxvp, LRA would be forced to free
>>>>>>>> up another register for the other half of R, even though that value
>>>>>>>> would never be used.
>>>>>>>>
>>>>>>>
>>>>>>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
>>>>>>> will be from plus offset 16 instead it should be loaded value 
>>>>>>> from zero offset. As in load fusion pass we are replacing
>>>>>>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
>>>>>>> is from plus 16 offsets and thats why its breaking the correctness.
>>>>>>>
>>>>>>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
>>>>>>> and loaded value is from 16 offset instead its loading from zero
>>>>>>> offset and thats why we are breaking the correctness.
>>>>>>>
>>>>>>
>>>>>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
>>>>>> will be from plus offset 16 instead it should be loaded value 
>>>>>> from zero offset. As in load fusion pass we are replacing
>>>>>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
>>>>>> is from plus 16 offsets instead it should load from zero offset.
>>>>>> Thats why its breaking the correctness.
>>>>>>  
>>>>>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
>>>>>> and loaded value is from 16 offset instead its loading from zero
>>>>>> offset and thats why we are breaking the correctness.
>>>>>
>>>>> I don't understand, sorry.  (subreg:V2DI (reg:OO R) 0) is always
>>>>>
>>>>> (a) the first hard register in (reg:OO R), when the whole of R
>>>>>     is stored in hard registers
>>>>> (b) at address offset 0 from the start of (reg:OO R), when R is
>>>>>     spilled to memory
>>>>>
>>>>> Similarly, (subreg:V2DI (reg:OO R) 16) is always
>>>>>
>>>>> (c) the second hard register in (reg:OO R), when the whole of R
>>>>>     is stored in hard registers
>>>>> (d) at address offset 16 from the start of (reg:OO R), when R is
>>>>>     spilled to memory
>>>>>
>>>>
>>>> Yes but we are replacing use of loaded value from plus 16 offset
>>>> with subreg (reg OO ) 0 and similarly we are replacing use of loaded
>>>> value from 0 offset with subreg (reg OO ) 16 as we are swapping
>>>> the use operand.
>>>>
>>>> When it is spilled its vice versa subreg (reg OO ) 16 should be
>>>> loaded from 0 offset and subreg (reg OO) 0 should be loaded
>>>> from 16 offset as we are swapping the use operand.
>>>>
>>>> This is the semantics of lxvp.
>>>
>>> Hmm, OK.  Does that mean that:
>>>
>>>    lxvp A,B
>>>
>>> loads A+1 from B and A from B+16?  (I couldn't find an online
>>> description of the instruction btw -- is there one?)
>>>
>>
>> Yes thats correct. Even I didn't find online document that
>> describes the same.
> 
> Thanks, I think I get it now.
> 

Thanks a lot. Can I know what should we be doing with neg (fma)
correctness failures with load fusion.

>>> If lxvp does behave like that, it shouldn't be described as a normal move:
>>>
>>>   [(set (match_operand:OO 0 "nonimmediate_operand" "=wa,ZwO,wa")
>>> 	(match_operand:OO 1 "input_operand" "ZwO,wa,wa"))]
>>>
>>> since the rules I listed above are target-independent.
>>>
>>
>> Rules for load for lxvp should as normal, but how the use
>> of lxvp defines should be different for lxvp than normal
>> load.
> 
> I don't think that's true though.  Because of the rules above,
> GCC expects that:
> 
>   (set (reg:OO R) (mem:OO ADDR))
> 
> will load ADDR+0 into R and ADDR+16 into R+1.  Anything else will
> cause exactly the kind of mix-up that you're seeing here.
> 
> In the worst case, the load should be described as:
> 
>   (set (reg:OO R)
>        (unspec:OO [(mem:OO ADDR)] UNSPEC_LXVP))
> 
> although there are other alternatives that could be used if OOmode
> weren't defined as an OPAQUE_MODE.
> 

If not described as above breaks the correctness.

>>> Alternatively, if lxvp is described as a normal move,
>>> with the 128-bit pieces swapped compared to GCC's normal order,
>>> TARGET_CAN_CHANGE_MODE_CLASS must prevent subregs that select
>>> one half of the register (or less).  But that would defeat exactly
>>> the optimisation that you're trying to do.
>>>
>>
>> Sorry, I didn't get how TARGET_CAN_CHANGE_MODE_CLASS prevents
>> subreg that selects one half of the registers. 
> 
> It doesn't prevent the subregs at such, but it stops the mode
> change from happening in certain registers.
> 
> For example, if we have (subreg:V2DI (reg:OO R) 0), and if
> TARGET_CAN_CHANGE_MODE_CLASS disallows changes between V2DI
> and OO in all register classes that can hold OO, then the register
> allocator would handle the subreg by spilling register R to memory
> and loading a V2DI piece of it from there.
> 
> This would allow the port to store OOmode in an unusual order.
> But like I say, it would also defeat exactly what you're trying
> to do here.
> 
Sure. Thanks.

> Thanks,
> Richard

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-11 14:40                                                 ` Ajit Agarwal
@ 2024-06-11 15:29                                                   ` Richard Sandiford
  2024-06-11 15:55                                                     ` Ajit Agarwal
  0 siblings, 1 reply; 35+ messages in thread
From: Richard Sandiford @ 2024-06-11 15:29 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
> On 11/06/24 7:07 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>> Hello Richard:
>>> On 11/06/24 6:12 pm, Richard Sandiford wrote:
>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>> Hello Richard:
>>>>>
>>>>> On 11/06/24 5:15 pm, Richard Sandiford wrote:
>>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>> Hello Richard:
>>>>>>> On 11/06/24 4:56 pm, Ajit Agarwal wrote:
>>>>>>>> Hello Richard:
>>>>>>>>
>>>>>>>> On 11/06/24 4:36 pm, Richard Sandiford wrote:
>>>>>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>>>>>>>> After LRA reload:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>>>>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>>>>>>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>>>>>>>>>>      (nil))
>>>>>>>>>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>>>>>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>>>>>>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>>>>>>>>>>>>      (nil))
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
>>>>>>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>>>>>>>>>>>>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>>>>>>>>>>>>>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>>>>>>>>>>>>>>      (nil))
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In the above allocated code it assign registers 51 and 47 and they are not sequential.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The reload for 2412 looks valid.  What was the original pre-reload
>>>>>>>>>>>>> version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> This is preload version of 2473:
>>>>>>>>>>>>
>>>>>>>>>>>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>>>>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>>>>>>>>>>>>                 (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
>>>>>>>>>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 0))))) {*vsx_nfmsv2df4}
>>>>>>>>>>>>      (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>>>>>>>>>>>>         (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ])
>>>>>>>>>>>>             (nil))))
>>>>>>>>>>>>
>>>>>>>>>>>> insn 2472 is replaced with 9299 after reload.
>>>>>>>>>>>
>>>>>>>>>>> You'd have to check the dumps to be sure, but I think 9299 is instead
>>>>>>>>>>> generated as an input reload of 2412, rather than being a replacement
>>>>>>>>>>> of insn 2472.  T
>>>>>>>>>>
>>>>>>>>>> Yes it is generated for 2412. The predecessor of 2412 is load from
>>>>>>>>>> plus offset as in 2412 we have subreg:V2DF (reg OO 2572) 16).
>>>>>>>>>>
>>>>>>>>>> This is not correct as we are not generating lxvp and it is 
>>>>>>>>>> normal load lxv.
>>>>>>>>>> As normal load is generated in predecessor insn of 2412 with
>>>>>>>>>> plus constant offset it breaks the correctness.
>>>>>>>>>
>>>>>>>>> Not using lxvp is a deliberate choice though.
>>>>>>>>>
>>>>>>>>> If a (reg:OO R) has been spilled, there's no requirement for LRA
>>>>>>>>> to load both halves of R when only one half is needed.  LRA just
>>>>>>>>> loads what it needs into whichever registers happen to be free.
>>>>>>>>>
>>>>>>>>> If the reload of R instead used lxvp, LRA would be forced to free
>>>>>>>>> up another register for the other half of R, even though that value
>>>>>>>>> would never be used.
>>>>>>>>>
>>>>>>>>
>>>>>>>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
>>>>>>>> will be from plus offset 16 instead it should be loaded value 
>>>>>>>> from zero offset. As in load fusion pass we are replacing
>>>>>>>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
>>>>>>>> is from plus 16 offsets and thats why its breaking the correctness.
>>>>>>>>
>>>>>>>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
>>>>>>>> and loaded value is from 16 offset instead its loading from zero
>>>>>>>> offset and thats why we are breaking the correctness.
>>>>>>>>
>>>>>>>
>>>>>>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
>>>>>>> will be from plus offset 16 instead it should be loaded value 
>>>>>>> from zero offset. As in load fusion pass we are replacing
>>>>>>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
>>>>>>> is from plus 16 offsets instead it should load from zero offset.
>>>>>>> Thats why its breaking the correctness.
>>>>>>>  
>>>>>>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
>>>>>>> and loaded value is from 16 offset instead its loading from zero
>>>>>>> offset and thats why we are breaking the correctness.
>>>>>>
>>>>>> I don't understand, sorry.  (subreg:V2DI (reg:OO R) 0) is always
>>>>>>
>>>>>> (a) the first hard register in (reg:OO R), when the whole of R
>>>>>>     is stored in hard registers
>>>>>> (b) at address offset 0 from the start of (reg:OO R), when R is
>>>>>>     spilled to memory
>>>>>>
>>>>>> Similarly, (subreg:V2DI (reg:OO R) 16) is always
>>>>>>
>>>>>> (c) the second hard register in (reg:OO R), when the whole of R
>>>>>>     is stored in hard registers
>>>>>> (d) at address offset 16 from the start of (reg:OO R), when R is
>>>>>>     spilled to memory
>>>>>>
>>>>>
>>>>> Yes but we are replacing use of loaded value from plus 16 offset
>>>>> with subreg (reg OO ) 0 and similarly we are replacing use of loaded
>>>>> value from 0 offset with subreg (reg OO ) 16 as we are swapping
>>>>> the use operand.
>>>>>
>>>>> When it is spilled its vice versa subreg (reg OO ) 16 should be
>>>>> loaded from 0 offset and subreg (reg OO) 0 should be loaded
>>>>> from 16 offset as we are swapping the use operand.
>>>>>
>>>>> This is the semantics of lxvp.
>>>>
>>>> Hmm, OK.  Does that mean that:
>>>>
>>>>    lxvp A,B
>>>>
>>>> loads A+1 from B and A from B+16?  (I couldn't find an online
>>>> description of the instruction btw -- is there one?)
>>>>
>>>
>>> Yes thats correct. Even I didn't find online document that
>>> describes the same.
>> 
>> Thanks, I think I get it now.
>> 
>
> Thanks a lot. Can I know what should we be doing with neg (fma)
> correctness failures with load fusion.

I think it would involve:

- describing lxvp and stxvp as unspec patterns, as I mentioned
  in the previous reply

- making plain movoo split loads and stores into individual
  lxv and stxvs.  (Or, alternative, it could use lxvp and stxvp,
  but internally swap the registers after load and before store.)
  That is, movoo should load the lower-numbered register from the
  lower address and the higher-numbered register from the higher
  address, and likewise for stores.

- make the fusion pass replace the first load result with
  (subreg:V2DI (reg:OO R) 16) and the second load result with
  (subreg:V2DI (reg:OO R) 0), as I think it already does.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-11 15:29                                                   ` Richard Sandiford
@ 2024-06-11 15:55                                                     ` Ajit Agarwal
  2024-06-11 16:11                                                       ` Richard Sandiford
  0 siblings, 1 reply; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-11 15:55 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 11/06/24 8:59 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> On 11/06/24 7:07 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>> Hello Richard:
>>>> On 11/06/24 6:12 pm, Richard Sandiford wrote:
>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>> Hello Richard:
>>>>>>
>>>>>> On 11/06/24 5:15 pm, Richard Sandiford wrote:
>>>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>> Hello Richard:
>>>>>>>> On 11/06/24 4:56 pm, Ajit Agarwal wrote:
>>>>>>>>> Hello Richard:
>>>>>>>>>
>>>>>>>>> On 11/06/24 4:36 pm, Richard Sandiford wrote:
>>>>>>>>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>>>>>>>>>>> After LRA reload:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (insn 9299 2472 2412 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>>>>>>>>>         (mem:V2DF (plus:DI (reg:DI 8 8 [orig:1285 ivtmp.886 ] [1285])
>>>>>>>>>>>>>>>                 (const_int 16 [0x10])) [1 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4188]+16 S16 A64])) "shell_lam.fppized.f":238:72 1190 {vsx_movv2df_64bit}
>>>>>>>>>>>>>>>      (nil))
>>>>>>>>>>>>>>> (insn 2412 9299 2477 187 (set (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240])
>>>>>>>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 39 7 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050]+16 ])
>>>>>>>>>>>>>>>                 (reg:V2DF 44 12 [3119])
>>>>>>>>>>>>>>>                 (neg:V2DF (reg:V2DF 51 19 [orig:240 vect__302.545 ] [240]))))) {*vsx_nfmsv2df4}
>>>>>>>>>>>>>>>      (nil))
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (insn 2473 9311 9312 187 (set (reg:V2DF 38 6 [orig:905 vect__302.545 ] [905])
>>>>>>>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 44 12 [3119])
>>>>>>>>>>>>>>>                 (reg:V2DF 38 6 [orig:2561 MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ] [2561])
>>>>>>>>>>>>>>>                 (neg:V2DF (reg:V2DF 47 15 [5266]))))) {*vsx_nfmsv2df4}
>>>>>>>>>>>>>>>      (nil))
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In the above allocated code it assign registers 51 and 47 and they are not sequential.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The reload for 2412 looks valid.  What was the original pre-reload
>>>>>>>>>>>>>> version of insn 2473?  Also, what happened to insn 2472?  Was it deleted?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is preload version of 2473:
>>>>>>>>>>>>>
>>>>>>>>>>>>> (insn 2473 2396 2478 161 (set (reg:V2DF 905 [ vect__302.545 ])
>>>>>>>>>>>>>         (neg:V2DF (fma:V2DF (reg:V2DF 4283 [3119])
>>>>>>>>>>>>>                 (subreg:V2DF (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ]) 0)
>>>>>>>>>>>>>                 (neg:V2DF (subreg:V2DF (reg:OO 2572 [ vect__300.543_236 ]) 0))))) {*vsx_nfmsv2df4}
>>>>>>>>>>>>>      (expr_list:REG_DEAD (reg:OO 2572 [ vect__300.543_236 ])
>>>>>>>>>>>>>         (expr_list:REG_DEAD (reg:OO 2561 [ MEM <vector(2) real(kind=8)> [(real(kind=8) *)_4050] ])
>>>>>>>>>>>>>             (nil))))
>>>>>>>>>>>>>
>>>>>>>>>>>>> insn 2472 is replaced with 9299 after reload.
>>>>>>>>>>>>
>>>>>>>>>>>> You'd have to check the dumps to be sure, but I think 9299 is instead
>>>>>>>>>>>> generated as an input reload of 2412, rather than being a replacement
>>>>>>>>>>>> of insn 2472.  T
>>>>>>>>>>>
>>>>>>>>>>> Yes it is generated for 2412. The predecessor of 2412 is load from
>>>>>>>>>>> plus offset as in 2412 we have subreg:V2DF (reg OO 2572) 16).
>>>>>>>>>>>
>>>>>>>>>>> This is not correct as we are not generating lxvp and it is 
>>>>>>>>>>> normal load lxv.
>>>>>>>>>>> As normal load is generated in predecessor insn of 2412 with
>>>>>>>>>>> plus constant offset it breaks the correctness.
>>>>>>>>>>
>>>>>>>>>> Not using lxvp is a deliberate choice though.
>>>>>>>>>>
>>>>>>>>>> If a (reg:OO R) has been spilled, there's no requirement for LRA
>>>>>>>>>> to load both halves of R when only one half is needed.  LRA just
>>>>>>>>>> loads what it needs into whichever registers happen to be free.
>>>>>>>>>>
>>>>>>>>>> If the reload of R instead used lxvp, LRA would be forced to free
>>>>>>>>>> up another register for the other half of R, even though that value
>>>>>>>>>> would never be used.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
>>>>>>>>> will be from plus offset 16 instead it should be loaded value 
>>>>>>>>> from zero offset. As in load fusion pass we are replacing
>>>>>>>>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
>>>>>>>>> is from plus 16 offsets and thats why its breaking the correctness.
>>>>>>>>>
>>>>>>>>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
>>>>>>>>> and loaded value is from 16 offset instead its loading from zero
>>>>>>>>> offset and thats why we are breaking the correctness.
>>>>>>>>>
>>>>>>>>
>>>>>>>> If a (reg:OO R ) 16 is loaded when it is spilled then loaded value
>>>>>>>> will be from plus offset 16 instead it should be loaded value 
>>>>>>>> from zero offset. As in load fusion pass we are replacing
>>>>>>>> (reg:V2DI R) with subreg (reg:OO R) 16 and hence loaded value
>>>>>>>> is from plus 16 offsets instead it should load from zero offset.
>>>>>>>> Thats why its breaking the correctness.
>>>>>>>>  
>>>>>>>> Similarly we are replacing (reg:V2DI R) 16 with subreg (reg:OO R) 0
>>>>>>>> and loaded value is from 16 offset instead its loading from zero
>>>>>>>> offset and thats why we are breaking the correctness.
>>>>>>>
>>>>>>> I don't understand, sorry.  (subreg:V2DI (reg:OO R) 0) is always
>>>>>>>
>>>>>>> (a) the first hard register in (reg:OO R), when the whole of R
>>>>>>>     is stored in hard registers
>>>>>>> (b) at address offset 0 from the start of (reg:OO R), when R is
>>>>>>>     spilled to memory
>>>>>>>
>>>>>>> Similarly, (subreg:V2DI (reg:OO R) 16) is always
>>>>>>>
>>>>>>> (c) the second hard register in (reg:OO R), when the whole of R
>>>>>>>     is stored in hard registers
>>>>>>> (d) at address offset 16 from the start of (reg:OO R), when R is
>>>>>>>     spilled to memory
>>>>>>>
>>>>>>
>>>>>> Yes but we are replacing use of loaded value from plus 16 offset
>>>>>> with subreg (reg OO ) 0 and similarly we are replacing use of loaded
>>>>>> value from 0 offset with subreg (reg OO ) 16 as we are swapping
>>>>>> the use operand.
>>>>>>
>>>>>> When it is spilled its vice versa subreg (reg OO ) 16 should be
>>>>>> loaded from 0 offset and subreg (reg OO) 0 should be loaded
>>>>>> from 16 offset as we are swapping the use operand.
>>>>>>
>>>>>> This is the semantics of lxvp.
>>>>>
>>>>> Hmm, OK.  Does that mean that:
>>>>>
>>>>>    lxvp A,B
>>>>>
>>>>> loads A+1 from B and A from B+16?  (I couldn't find an online
>>>>> description of the instruction btw -- is there one?)
>>>>>
>>>>
>>>> Yes thats correct. Even I didn't find online document that
>>>> describes the same.
>>>
>>> Thanks, I think I get it now.
>>>
>>
>> Thanks a lot. Can I know what should we be doing with neg (fma)
>> correctness failures with load fusion.
> 
> I think it would involve:
> 
> - describing lxvp and stxvp as unspec patterns, as I mentioned
>   in the previous reply
> 
> - making plain movoo split loads and stores into individual
>   lxv and stxvs.  (Or, alternative, it could use lxvp and stxvp,
>   but internally swap the registers after load and before store.)
>   That is, movoo should load the lower-numbered register from the
>   lower address and the higher-numbered register from the higher
>   address, and likewise for stores.
> 

Would you mind elaborating the above.

> - make the fusion pass replace the first load result with
>   (subreg:V2DI (reg:OO R) 16) and the second load result with
>   (subreg:V2DI (reg:OO R) 0), as I think it already does.
> 
> Thanks,
> Richard

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-11 15:55                                                     ` Ajit Agarwal
@ 2024-06-11 16:11                                                       ` Richard Sandiford
  2024-06-11 19:05                                                         ` Ajit Agarwal
  0 siblings, 1 reply; 35+ messages in thread
From: Richard Sandiford @ 2024-06-11 16:11 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>> Thanks a lot. Can I know what should we be doing with neg (fma)
>>> correctness failures with load fusion.
>> 
>> I think it would involve:
>> 
>> - describing lxvp and stxvp as unspec patterns, as I mentioned
>>   in the previous reply
>> 
>> - making plain movoo split loads and stores into individual
>>   lxv and stxvs.  (Or, alternative, it could use lxvp and stxvp,
>>   but internally swap the registers after load and before store.)
>>   That is, movoo should load the lower-numbered register from the
>>   lower address and the higher-numbered register from the higher
>>   address, and likewise for stores.
>> 
>
> Would you mind elaborating the above.

I think movoo should use rs6000_split_multireg_move for all alternatives,
like movxo does.  movoo should split into 2 V1TI loads/stores and movxo
should split into 4 V1TI loads/stores.  lxvp and stxvp would be
independent patterns of the form:

  (set ...
       (unspec [...] UNSPEC_FOO))

---

rs6000_split_multireg_move has:

  /* The __vector_pair and __vector_quad modes are multi-register
     modes, so if we have to load or store the registers, we have to be
     careful to properly swap them if we're in little endian mode
     below.  This means the last register gets the first memory
     location.  We also need to be careful of using the right register
     numbers if we are splitting XO to OO.  */

But I don't see how this can work reliably if we allow the kind of
subregs that you want to create here.  The register order is the opposite
from the one that GCC expects.

This is more a question for the PowerPC maintainers though.

And this is one of the (admittedly many) times when I wish GCC's
subreg model was more like LLVM's. :)

Thanks,
Richard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-11 16:11                                                       ` Richard Sandiford
@ 2024-06-11 19:05                                                         ` Ajit Agarwal
  2024-06-11 21:32                                                           ` Richard Sandiford
  0 siblings, 1 reply; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-11 19:05 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 11/06/24 9:41 pm, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>> Thanks a lot. Can I know what should we be doing with neg (fma)
>>>> correctness failures with load fusion.
>>>
>>> I think it would involve:
>>>
>>> - describing lxvp and stxvp as unspec patterns, as I mentioned
>>>   in the previous reply
>>>
>>> - making plain movoo split loads and stores into individual
>>>   lxv and stxvs.  (Or, alternative, it could use lxvp and stxvp,
>>>   but internally swap the registers after load and before store.)
>>>   That is, movoo should load the lower-numbered register from the
>>>   lower address and the higher-numbered register from the higher
>>>   address, and likewise for stores.
>>>
>>
>> Would you mind elaborating the above.
> 
> I think movoo should use rs6000_split_multireg_move for all alternatives,
> like movxo does.  movoo should split into 2 V1TI loads/stores and movxo
> should split into 4 V1TI loads/stores.  lxvp and stxvp would be
> independent patterns of the form:
> 
>   (set ...
>        (unspec [...] UNSPEC_FOO))
> 
> ---
> 

In load fusion pass I generate the above pattern for adjacent merge
pairs.

> rs6000_split_multireg_move has:
> 
>   /* The __vector_pair and __vector_quad modes are multi-register
>      modes, so if we have to load or store the registers, we have to be
>      careful to properly swap them if we're in little endian mode
>      below.  This means the last register gets the first memory
>      location.  We also need to be careful of using the right register
>      numbers if we are splitting XO to OO.  */
> 
> But I don't see how this can work reliably if we allow the kind of
> subregs that you want to create here.  The register order is the opposite
> from the one that GCC expects.
> 
> This is more a question for the PowerPC maintainers though.
>

Above unspec pattern generated and modified the movoo pattern to accept
the above spec it goes through the rs6000_split_multireg_move
it splits into 2 VITI loads and generate consecutive loads with sequential
registers. In load_fusion pass I generate the subreg along with load results 
subreg (reg OO R) 16 and subreg (reg OO R) 0.

But it doesnt generate lxvp instruction. If above unspec instruction
pattern and write separate pattern in md file to generate lxvp instead of
normal movoo, then it won't go through rs6000_split_multireg_move

> And this is one of the (admittedly many) times when I wish GCC's
> subreg model was more like LLVM's. :)
> 
> Thanks,
> Richard

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-11 19:05                                                         ` Ajit Agarwal
@ 2024-06-11 21:32                                                           ` Richard Sandiford
  2024-06-12 12:15                                                             ` Ajit Agarwal
  2024-06-12 12:53                                                             ` Ajit Agarwal
  0 siblings, 2 replies; 35+ messages in thread
From: Richard Sandiford @ 2024-06-11 21:32 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches

Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
> Hello Richard:
>
> On 11/06/24 9:41 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>> Thanks a lot. Can I know what should we be doing with neg (fma)
>>>>> correctness failures with load fusion.
>>>>
>>>> I think it would involve:
>>>>
>>>> - describing lxvp and stxvp as unspec patterns, as I mentioned
>>>>   in the previous reply
>>>>
>>>> - making plain movoo split loads and stores into individual
>>>>   lxv and stxvs.  (Or, alternative, it could use lxvp and stxvp,
>>>>   but internally swap the registers after load and before store.)
>>>>   That is, movoo should load the lower-numbered register from the
>>>>   lower address and the higher-numbered register from the higher
>>>>   address, and likewise for stores.
>>>>
>>>
>>> Would you mind elaborating the above.
>> 
>> I think movoo should use rs6000_split_multireg_move for all alternatives,
>> like movxo does.  movoo should split into 2 V1TI loads/stores and movxo
>> should split into 4 V1TI loads/stores.  lxvp and stxvp would be
>> independent patterns of the form:
>> 
>>   (set ...
>>        (unspec [...] UNSPEC_FOO))
>> 
>> ---
>> 
>
> In load fusion pass I generate the above pattern for adjacent merge
> pairs.
>
>> rs6000_split_multireg_move has:
>> 
>>   /* The __vector_pair and __vector_quad modes are multi-register
>>      modes, so if we have to load or store the registers, we have to be
>>      careful to properly swap them if we're in little endian mode
>>      below.  This means the last register gets the first memory
>>      location.  We also need to be careful of using the right register
>>      numbers if we are splitting XO to OO.  */
>> 
>> But I don't see how this can work reliably if we allow the kind of
>> subregs that you want to create here.  The register order is the opposite
>> from the one that GCC expects.
>> 
>> This is more a question for the PowerPC maintainers though.
>>
>
> Above unspec pattern generated and modified the movoo pattern to accept
> the above spec it goes through the rs6000_split_multireg_move
> it splits into 2 VITI loads and generate consecutive loads with sequential
> registers. In load_fusion pass I generate the subreg along with load results 
> subreg (reg OO R) 16 and subreg (reg OO R) 0.
>
> But it doesnt generate lxvp instruction. If above unspec instruction
> pattern and write separate pattern in md file to generate lxvp instead of
> normal movoo, then it won't go through rs6000_split_multireg_move

I don't understand the last bit, sorry.  Under the scheme I described,
lxvp should be generated only through an unspec (and no other way).
Same for stxvp.  The fusion pass should generate those unspecs.

If the fusion pass has generated the code correctly, the lxvp unspec
will remain throughout compilation, unless all uses of it are later
deleted as dead.

The movoo rtl pattern should continue to be:

  [(set (match_operand:OO 0 "nonimmediate_operand" "=wa,ZwO,wa")
	(match_operand:OO 1 "input_operand" "ZwO,wa,wa"))]

But movoo should generate individual loads, stores and moves.  By design,
it should never generate lxvp or stxvp.

This means that, if a fused load is spilled, the sequence will be
something like:

  lxvp ...   // original fused load (unspec)
  ...
  stxv ...   // store one half to the stack (split from movoo)
  stxv ...   // store the other half to the stack (split from movoo)

Then insns that use the pair will load whichever half they need
from the stack.

I realise that isn't great, but it should at least be correct.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-11 21:32                                                           ` Richard Sandiford
@ 2024-06-12 12:15                                                             ` Ajit Agarwal
  2024-06-12 12:53                                                             ` Ajit Agarwal
  1 sibling, 0 replies; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-12 12:15 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 12/06/24 3:02 am, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> Hello Richard:
>>
>> On 11/06/24 9:41 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>> Thanks a lot. Can I know what should we be doing with neg (fma)
>>>>>> correctness failures with load fusion.
>>>>>
>>>>> I think it would involve:
>>>>>
>>>>> - describing lxvp and stxvp as unspec patterns, as I mentioned
>>>>>   in the previous reply
>>>>>
>>>>> - making plain movoo split loads and stores into individual
>>>>>   lxv and stxvs.  (Or, alternative, it could use lxvp and stxvp,
>>>>>   but internally swap the registers after load and before store.)
>>>>>   That is, movoo should load the lower-numbered register from the
>>>>>   lower address and the higher-numbered register from the higher
>>>>>   address, and likewise for stores.
>>>>>
>>>>
>>>> Would you mind elaborating the above.
>>>
>>> I think movoo should use rs6000_split_multireg_move for all alternatives,
>>> like movxo does.  movoo should split into 2 V1TI loads/stores and movxo
>>> should split into 4 V1TI loads/stores.  lxvp and stxvp would be
>>> independent patterns of the form:
>>>
>>>   (set ...
>>>        (unspec [...] UNSPEC_FOO))
>>>
>>> ---
>>>
>>
>> In load fusion pass I generate the above pattern for adjacent merge
>> pairs.
>>
>>> rs6000_split_multireg_move has:
>>>
>>>   /* The __vector_pair and __vector_quad modes are multi-register
>>>      modes, so if we have to load or store the registers, we have to be
>>>      careful to properly swap them if we're in little endian mode
>>>      below.  This means the last register gets the first memory
>>>      location.  We also need to be careful of using the right register
>>>      numbers if we are splitting XO to OO.  */
>>>
>>> But I don't see how this can work reliably if we allow the kind of
>>> subregs that you want to create here.  The register order is the opposite
>>> from the one that GCC expects.
>>>
>>> This is more a question for the PowerPC maintainers though.
>>>
>>
>> Above unspec pattern generated and modified the movoo pattern to accept
>> the above spec it goes through the rs6000_split_multireg_move
>> it splits into 2 VITI loads and generate consecutive loads with sequential
>> registers. In load_fusion pass I generate the subreg along with load results 
>> subreg (reg OO R) 16 and subreg (reg OO R) 0.
>>
>> But it doesnt generate lxvp instruction. If above unspec instruction
>> pattern and write separate pattern in md file to generate lxvp instead of
>> normal movoo, then it won't go through rs6000_split_multireg_move
> 
> I don't understand the last bit, sorry.  Under the scheme I described,
> lxvp should be generated only through an unspec (and no other way).
> Same for stxvp.  The fusion pass should generate those unspecs.
> 
> If the fusion pass has generated the code correctly, the lxvp unspec
> will remain throughout compilation, unless all uses of it are later
> deleted as dead.
> 
> The movoo rtl pattern should continue to be:
> 
>   [(set (match_operand:OO 0 "nonimmediate_operand" "=wa,ZwO,wa")
> 	(match_operand:OO 1 "input_operand" "ZwO,wa,wa"))]
> 
> But movoo should generate individual loads, stores and moves.  By design,
> it should never generate lxvp or stxvp.
> 
> This means that, if a fused load is spilled, the sequence will be
> something like:
> 
>   lxvp ...   // original fused load (unspec)
>   ...
>   stxv ...   // store one half to the stack (split from movoo)
>   stxv ...   // store the other half to the stack (split from movoo)
> 
> Then insns that use the pair will load whichever half they need
> from the stack.
> 
> I realise that isn't great, but it should at least be correct.
> 

Thanks a lot. It worked.

> Thanks,
> Richard

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion
  2024-06-11 21:32                                                           ` Richard Sandiford
  2024-06-12 12:15                                                             ` Ajit Agarwal
@ 2024-06-12 12:53                                                             ` Ajit Agarwal
  1 sibling, 0 replies; 35+ messages in thread
From: Ajit Agarwal @ 2024-06-12 12:53 UTC (permalink / raw)
  To: Alex Coplan, Kewen.Lin, Segher Boessenkool, Michael Meissner,
	Peter Bergner, David Edelsohn, gcc-patches, richard.sandiford

Hello Richard:

On 12/06/24 3:02 am, Richard Sandiford wrote:
> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>> Hello Richard:
>>
>> On 11/06/24 9:41 pm, Richard Sandiford wrote:
>>> Ajit Agarwal <aagarwa1@linux.ibm.com> writes:
>>>>>> Thanks a lot. Can I know what should we be doing with neg (fma)
>>>>>> correctness failures with load fusion.
>>>>>
>>>>> I think it would involve:
>>>>>
>>>>> - describing lxvp and stxvp as unspec patterns, as I mentioned
>>>>>   in the previous reply
>>>>>
>>>>> - making plain movoo split loads and stores into individual
>>>>>   lxv and stxvs.  (Or, alternative, it could use lxvp and stxvp,
>>>>>   but internally swap the registers after load and before store.)
>>>>>   That is, movoo should load the lower-numbered register from the
>>>>>   lower address and the higher-numbered register from the higher
>>>>>   address, and likewise for stores.
>>>>>
>>>>
>>>> Would you mind elaborating the above.
>>>
>>> I think movoo should use rs6000_split_multireg_move for all alternatives,
>>> like movxo does.  movoo should split into 2 V1TI loads/stores and movxo
>>> should split into 4 V1TI loads/stores.  lxvp and stxvp would be
>>> independent patterns of the form:
>>>
>>>   (set ...
>>>        (unspec [...] UNSPEC_FOO))
>>>
>>> ---
>>>
>>
>> In load fusion pass I generate the above pattern for adjacent merge
>> pairs.
>>
>>> rs6000_split_multireg_move has:
>>>
>>>   /* The __vector_pair and __vector_quad modes are multi-register
>>>      modes, so if we have to load or store the registers, we have to be
>>>      careful to properly swap them if we're in little endian mode
>>>      below.  This means the last register gets the first memory
>>>      location.  We also need to be careful of using the right register
>>>      numbers if we are splitting XO to OO.  */
>>>
>>> But I don't see how this can work reliably if we allow the kind of
>>> subregs that you want to create here.  The register order is the opposite
>>> from the one that GCC expects.
>>>
>>> This is more a question for the PowerPC maintainers though.
>>>
>>
>> Above unspec pattern generated and modified the movoo pattern to accept
>> the above spec it goes through the rs6000_split_multireg_move
>> it splits into 2 VITI loads and generate consecutive loads with sequential
>> registers. In load_fusion pass I generate the subreg along with load results 
>> subreg (reg OO R) 16 and subreg (reg OO R) 0.
>>
>> But it doesnt generate lxvp instruction. If above unspec instruction
>> pattern and write separate pattern in md file to generate lxvp instead of
>> normal movoo, then it won't go through rs6000_split_multireg_move
> 
> I don't understand the last bit, sorry.  Under the scheme I described,
> lxvp should be generated only through an unspec (and no other way).
> Same for stxvp.  The fusion pass should generate those unspecs.
> 
> If the fusion pass has generated the code correctly, the lxvp unspec
> will remain throughout compilation, unless all uses of it are later
> deleted as dead.
> 
> The movoo rtl pattern should continue to be:
> 
>   [(set (match_operand:OO 0 "nonimmediate_operand" "=wa,ZwO,wa")
> 	(match_operand:OO 1 "input_operand" "ZwO,wa,wa"))]
> 
> But movoo should generate individual loads, stores and moves.  By design,
> it should never generate lxvp or stxvp.
> 
> This means that, if a fused load is spilled, the sequence will be
> something like:
> 
>   lxvp ...   // original fused load (unspec)
>   ...
>   stxv ...   // store one half to the stack (split from movoo)
>   stxv ...   // store the other half to the stack (split from movoo)
> 
> Then insns that use the pair will load whichever half they need
> from the stack.
> 
> I realise that isn't great, but it should at least be correct.
> 

Thanks a lot. It worked.

> Thanks,
> Richard

Thanks & Regards
Ajit

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2024-06-12 13:38 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-06  5:55 [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion Ajit Agarwal
2024-06-06  8:58 ` Richard Sandiford
2024-06-06 13:53   ` Ajit Agarwal
2024-06-06 14:33     ` Richard Sandiford
2024-06-06 21:17       ` Ajit Agarwal
2024-06-06 22:54         ` Richard Sandiford
2024-06-07  7:35           ` Ajit Agarwal
2024-06-07  8:22             ` Richard Sandiford
2024-06-07 10:41               ` Ajit Agarwal
2024-06-10  8:42                 ` Richard Sandiford
2024-06-10  9:15                   ` Ajit Agarwal
2024-06-10  9:22                     ` Richard Sandiford
2024-06-10  9:43                       ` Ajit Agarwal
2024-06-10  9:50                         ` Richard Sandiford
2024-06-10  9:57                           ` Ajit Agarwal
2024-06-10 10:28                             ` Richard Sandiford
2024-06-11 10:39                               ` Ajit Agarwal
2024-06-11 11:06                                 ` Richard Sandiford
2024-06-11 11:26                                   ` Ajit Agarwal
2024-06-11 11:30                                     ` Ajit Agarwal
2024-06-11 11:45                                       ` Richard Sandiford
2024-06-11 11:59                                         ` Ajit Agarwal
2024-06-11 12:42                                           ` Richard Sandiford
2024-06-11 12:57                                             ` Ajit Agarwal
2024-06-11 13:37                                               ` Richard Sandiford
2024-06-11 14:40                                                 ` Ajit Agarwal
2024-06-11 15:29                                                   ` Richard Sandiford
2024-06-11 15:55                                                     ` Ajit Agarwal
2024-06-11 16:11                                                       ` Richard Sandiford
2024-06-11 19:05                                                         ` Ajit Agarwal
2024-06-11 21:32                                                           ` Richard Sandiford
2024-06-12 12:15                                                             ` Ajit Agarwal
2024-06-12 12:53                                                             ` Ajit Agarwal
2024-06-11 11:43                                     ` Ajit Agarwal
2024-06-10 10:13                           ` Ajit Agarwal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).