[PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
@ 2023-10-07 19:04 Ajit Agarwal
  2023-10-15 12:13 ` [PING ^0][PATCH " Ajit Agarwal
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Ajit Agarwal @ 2023-10-07 19:04 UTC (permalink / raw)
  To: gcc-patches; +Cc: Segher Boessenkool, Peter Bergner, Kewen.Lin

Hello All:

This patch add new pass to replace contiguous addresses vector load lxv with mma instruction
lxvp. This patch addresses one regressions failure in ARM architecture.

Bootstrapped and regtested with powepc64-linux-gnu.

Thanks & Regards
Ajit


rs6000: Add new pass for replacement of contiguous lxv with lxvp.

New pass to replace contiguous addresses lxv with lxvp. This pass
is registered after ree rtl pass.

2023-10-07  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>

gcc/ChangeLog:

	* config/rs6000/rs6000-passes.def: Registered vecload pass.
	* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
	* config.gcc: Add new executable.
	* config/rs6000/rs6000-protos.h: Add new prototype for vecload
	pass.
	* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
	* config/rs6000/t-rs6000: Add new rule.

gcc/testsuite/ChangeLog:

	* g++.target/powerpc/vecload.C: New test.
---
 gcc/config.gcc                             |   4 +-
 gcc/config/rs6000/rs6000-passes.def        |   1 +
 gcc/config/rs6000/rs6000-protos.h          |   2 +
 gcc/config/rs6000/rs6000-vecload-opt.cc    | 234 +++++++++++++++++++++
 gcc/config/rs6000/rs6000.cc                |   3 +-
 gcc/config/rs6000/t-rs6000                 |   4 +
 gcc/testsuite/g++.target/powerpc/vecload.C |  15 ++
 7 files changed, 260 insertions(+), 3 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C

diff --git a/gcc/config.gcc b/gcc/config.gcc
index ee46d96bf62..482ab094b89 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -515,7 +515,7 @@ or1k*-*-*)
 	;;
 powerpc*-*-*)
 	cpu_type=rs6000
-	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
 	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
 	extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
 	extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
@@ -552,7 +552,7 @@ riscv*)
 	;;
 rs6000*-*-*)
 	extra_options="${extra_options} g.opt fused-madd.opt rs6000/rs6000-tables.opt"
-	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
 	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
 	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-logue.cc \$(srcdir)/config/rs6000/rs6000-call.cc"
 	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
diff --git a/gcc/config/rs6000/rs6000-passes.def b/gcc/config/rs6000/rs6000-passes.def
index ca899d5f7af..9ecf8ce6a9c 100644
--- a/gcc/config/rs6000/rs6000-passes.def
+++ b/gcc/config/rs6000/rs6000-passes.def
@@ -28,6 +28,7 @@ along with GCC; see the file COPYING3.  If not see
      The power8 does not have instructions that automaticaly do the byte swaps
      for loads and stores.  */
   INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
+  INSERT_PASS_AFTER (pass_ree, 1, pass_analyze_vecload);
 
   /* Pass to do the PCREL_OPT optimization that combines the load of an
      external symbol's address along with a single load or store using that
diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
index f70118ea40f..9c44bae33d3 100644
--- a/gcc/config/rs6000/rs6000-protos.h
+++ b/gcc/config/rs6000/rs6000-protos.h
@@ -91,6 +91,7 @@ extern int mems_ok_for_quad_peep (rtx, rtx);
 extern bool gpr_or_gpr_p (rtx, rtx);
 extern bool direct_move_p (rtx, rtx);
 extern bool quad_address_p (rtx, machine_mode, bool);
+extern bool mode_supports_dq_form (machine_mode);
 extern bool quad_load_store_p (rtx, rtx);
 extern bool fusion_gpr_load_p (rtx, rtx, rtx, rtx);
 extern void expand_fusion_gpr_load (rtx *);
@@ -344,6 +345,7 @@ class rtl_opt_pass;
 
 extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
 extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
+extern rtl_opt_pass *make_pass_analyze_vecload (gcc::context *);
 extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
 extern bool rs6000_quadword_masked_address_p (const_rtx exp);
 extern rtx rs6000_gen_lvx (enum machine_mode, rtx, rtx);
diff --git a/gcc/config/rs6000/rs6000-vecload-opt.cc b/gcc/config/rs6000/rs6000-vecload-opt.cc
new file mode 100644
index 00000000000..63ee733af89
--- /dev/null
+++ b/gcc/config/rs6000/rs6000-vecload-opt.cc
@@ -0,0 +1,234 @@
+/* Subroutines used to replace lxv with lxvp
+   for p10 little-endian VSX code.
+   Copyright (C) 2020-2023 Free Software Foundation, Inc.
+   Contributed by Ajit Kumar Agarwal <aagarwa1@linux.ibm.com>.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+   License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#define IN_TARGET_CODE 1
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "tree.h"
+#include "memmodel.h"
+#include "df.h"
+#include "tm_p.h"
+#include "ira.h"
+#include "print-tree.h"
+#include "varasm.h"
+#include "explow.h"
+#include "expr.h"
+#include "output.h"
+#include "tree-pass.h"
+#include "regs.h"
+#include "rtx-vector-builder.h"
+#include "rs6000-protos.h"
+
+static inline bool
+quad_address_offset_p (HOST_WIDE_INT offset)
+{
+  return (IN_RANGE (offset, -32768, 32767) && ((offset) & 0xf) == 0);
+}
+
+/* Replace identified lxv with lxvp.  */
+static void
+replace_lxv_with_lxvp (rtx_insn *insn1, rtx_insn *insn2)
+{
+  rtx body = PATTERN (insn1);
+  rtx src_exp = SET_SRC (body);
+  rtx dest_exp = SET_DEST (body);
+  rtx lxv;
+  rtx insn2_body = PATTERN (insn2);
+  rtx insn2_dest_exp = SET_DEST (insn2_body);
+  unsigned int regno = REGNO (dest_exp);
+
+  if (regno > REGNO (insn2_dest_exp))
+    {
+      df_set_regs_ever_live (REGNO (dest_exp), false);
+      df_set_regs_ever_live (REGNO (insn2_dest_exp), true);
+      SET_REGNO (dest_exp, REGNO (insn2_dest_exp));
+      dest_exp->used = 1;
+      df_set_regs_ever_live (REGNO (insn2_dest_exp), false);
+      df_set_regs_ever_live (regno, true);
+      SET_REGNO (insn2_dest_exp, regno);
+      insn2_dest_exp->used = 1;
+    }
+  rtx opnd = gen_rtx_REG (OOmode, REGNO (dest_exp));
+  PUT_MODE (src_exp, OOmode);
+  lxv = gen_movoo (opnd, src_exp);
+  rtx_insn *new_insn = emit_insn_before (lxv,  insn1);
+  set_block_for_insn (new_insn, BLOCK_FOR_INSN (insn1));
+  df_insn_rescan (new_insn);
+  
+  if (dump_file)
+    {
+      unsigned int new_uid = INSN_UID (new_insn);
+      fprintf (dump_file, "Replacing lxv %d with lxvp  %d\n",
+			  INSN_UID (insn1), new_uid);
+    }
+  df_insn_delete (insn1);
+  remove_insn (insn1);
+  df_insn_delete (insn2);
+  remove_insn (insn2);
+  insn1->set_deleted ();
+  insn2->set_deleted ();
+}
+
+/* Identify lxv instruction that are candidate of continguous
+   addresses and replace them with mma instruction lxvp.  */
+unsigned int
+rs6000_analyze_vecload (function *fun)
+{
+  basic_block bb;
+  rtx_insn *insn, *curr_insn = 0;
+  rtx_insn *insn1 = 0, *insn2 = 0;
+  bool first_vec_insn = false;
+  unsigned int offset = 0;
+  unsigned int regno = 0;
+
+  FOR_ALL_BB_FN (bb, fun)
+    FOR_BB_INSNS_SAFE (bb, insn, curr_insn)
+    {
+      if (NONDEBUG_INSN_P (insn) && GET_CODE (PATTERN (insn)) == SET)
+	{
+	  rtx set = single_set (insn);
+	  rtx src = SET_SRC (set);
+	  machine_mode mode = GET_MODE (SET_DEST (set));
+	  bool dest_fp_p, dest_vmx_p, dest_vsx_p = false;
+	  rtx dest = SET_DEST (PATTERN (insn));
+	  int dest_regno;
+
+	  if (REG_P (dest))
+	    {
+	      dest_regno = REGNO (dest);
+	      dest_fp_p = FP_REGNO_P (dest_regno);
+	      dest_vmx_p = ALTIVEC_REGNO_P (dest_regno);
+	      dest_vsx_p = dest_fp_p | dest_vmx_p;
+	    }
+	   else
+	     {
+	       dest_regno = -1;
+	       dest_fp_p = dest_vmx_p = dest_vsx_p = false;
+	     }
+
+	   if (TARGET_VSX && TARGET_MMA && dest_vsx_p)
+	     {
+	       if (mode_supports_dq_form (mode)
+		   && dest_regno >= 0 && MEM_P (src)
+		   && quad_address_p (XEXP (src, 0), mode, true))
+		  {
+		    if (first_vec_insn)
+		      {
+			rtx addr = XEXP (src, 0);
+			insn2 = insn;
+
+			if (GET_CODE (addr) != PLUS)
+			  return false;
+
+			rtx op0 = XEXP (addr, 0);
+			if (!REG_P (op0) || !INT_REG_OK_FOR_BASE_P (op0, true))
+			  return false;
+
+			rtx op1 = XEXP (addr, 1);
+			if (!CONST_INT_P (op1))
+			  return false;
+
+			mem_attrs attrs (*get_mem_attrs (src));
+			bool reg_attrs_found = false;
+
+			if (REG_P (dest) && REG_ATTRS (dest))
+			  {
+			    poly_int64 off = REG_ATTRS (dest)->offset;
+			    if (known_ge (off, 0))
+			      reg_attrs_found = true;
+			  }
+			if ((attrs.offset_known_p && known_ge (attrs.offset, 0))
+			     && reg_attrs_found
+			     &&  quad_address_offset_p (INTVAL (op1))
+			     && (regno == REGNO (op0))
+			     && ((INTVAL (op1) - offset) == 16))
+			   {
+			     replace_lxv_with_lxvp (insn1, insn2);
+			     return true;
+			   }
+			}
+			if (REG_P (XEXP (src, 0))
+			    && GET_CODE (XEXP (src, 0)) != PLUS)
+			  {
+			    mem_attrs attrs (*get_mem_attrs (src));
+			    if (attrs.offset_known_p)
+			      offset = attrs.offset;
+			    if (offset == 0 && REG_P (dest) && REG_ATTRS (dest))
+			      offset = REG_ATTRS (dest)->offset;
+			    regno = REGNO (XEXP (src,0));
+			    first_vec_insn = true;
+			    insn1 = insn;
+			 }
+		     }
+	       }
+	}
+    }
+  return false;
+}
+
+const pass_data pass_data_analyze_vecload =
+{
+  RTL_PASS, /* type */
+  "vecload", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_analyze_vecload : public rtl_opt_pass
+{
+public:
+  pass_analyze_vecload(gcc::context *ctxt)
+    : rtl_opt_pass(pass_data_analyze_vecload, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return (optimize > 0 && TARGET_VSX);
+    }
+
+  virtual unsigned int execute (function *fun)
+    {
+      return rs6000_analyze_vecload (fun);
+    }
+
+  opt_pass *clone ()
+    {
+      return new pass_analyze_vecload (m_ctxt);
+    }
+
+}; // class pass_analyze_vecload
+
+rtl_opt_pass *
+make_pass_analyze_vecload (gcc::context *ctxt)
+{
+  return new pass_analyze_vecload (ctxt);
+}
+
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index cc9253bb040..dba545271e0 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -387,7 +387,7 @@ mode_supports_vmx_dform (machine_mode mode)
 /* Return true if we have D-form addressing in VSX registers.  This addressing
    is more limited than normal d-form addressing in that the offset must be
    aligned on a 16-byte boundary.  */
-static inline bool
+bool
 mode_supports_dq_form (machine_mode mode)
 {
   return ((reg_addr[mode].addr_mask[RELOAD_REG_ANY] & RELOAD_REG_QUAD_OFFSET)
@@ -1178,6 +1178,7 @@ static bool rs6000_secondary_reload_move (enum rs6000_reg_type,
 					  secondary_reload_info *,
 					  bool);
 rtl_opt_pass *make_pass_analyze_swaps (gcc::context*);
+rtl_opt_pass *make_pass_analyze_vecload (gcc::context*);
 
 /* Hash table stuff for keeping track of TOC entries.  */
 
diff --git a/gcc/config/rs6000/t-rs6000 b/gcc/config/rs6000/t-rs6000
index f183b42ce1d..da7ae26e88b 100644
--- a/gcc/config/rs6000/t-rs6000
+++ b/gcc/config/rs6000/t-rs6000
@@ -47,6 +47,10 @@ rs6000-builtin.o: $(srcdir)/config/rs6000/rs6000-builtin.cc
 	$(COMPILE) $<
 	$(POSTCOMPILE)
 
+rs6000-vecload-opt.o: $(srcdir)/config/rs6000/rs6000-vecload-opt.cc
+	$(COMPILE) $<
+	$(POSTCOMPILE)
+
 build/rs6000-gen-builtins.o: $(srcdir)/config/rs6000/rs6000-gen-builtins.cc
 build/rbtree.o: $(srcdir)/config/rs6000/rbtree.cc
 
diff --git a/gcc/testsuite/g++.target/powerpc/vecload.C b/gcc/testsuite/g++.target/powerpc/vecload.C
new file mode 100644
index 00000000000..f1689ad6522
--- /dev/null
+++ b/gcc/testsuite/g++.target/powerpc/vecload.C
@@ -0,0 +1,15 @@
+/* { dg-do compile } */ 
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-mdejagnu-cpu=power10 -O2 -mmma" } */ 
+
+#include <altivec.h>
+
+void
+foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
+{
+  __vector_quad acc;
+  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
+  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
+  *dst = acc;
+}
+/* { dg-final { scan-assembler {\mlxvp\M} } } */
-- 
2.39.3


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PING ^0][PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-10-07 19:04 [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp Ajit Agarwal
@ 2023-10-15 12:13 ` Ajit Agarwal
  2023-10-23  8:32   ` [PING ^1][PATCH " Ajit Agarwal
  2023-11-24  9:31 ` [PATCH " Kewen.Lin
  2023-11-28  7:05 ` Michael Meissner
  2 siblings, 1 reply; 24+ messages in thread
From: Ajit Agarwal @ 2023-10-15 12:13 UTC (permalink / raw)
  To: gcc-patches; +Cc: Segher Boessenkool, Kewen.Lin, Peter Bergner

Hello All:

Please review.

Thanks & Regards
Ajit


-------- Forwarded Message --------
Subject: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
Date: Sun, 8 Oct 2023 00:34:27 +0530
From: Ajit Agarwal <aagarwa1@linux.ibm.com>
To: gcc-patches <gcc-patches@gcc.gnu.org>
CC: Segher Boessenkool <segher@kernel.crashing.org>, Peter Bergner <bergner@linux.ibm.com>, Kewen.Lin <linkw@linux.ibm.com>

Hello All:

This patch add new pass to replace contiguous addresses vector load lxv with mma instruction
lxvp. This patch addresses one regressions failure in ARM architecture.

Bootstrapped and regtested with powepc64-linux-gnu.

Thanks & Regards
Ajit


rs6000: Add new pass for replacement of contiguous lxv with lxvp.

New pass to replace contiguous addresses lxv with lxvp. This pass
is registered after ree rtl pass.

2023-10-07  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>

gcc/ChangeLog:

	* config/rs6000/rs6000-passes.def: Registered vecload pass.
	* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
	* config.gcc: Add new executable.
	* config/rs6000/rs6000-protos.h: Add new prototype for vecload
	pass.
	* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
	* config/rs6000/t-rs6000: Add new rule.

gcc/testsuite/ChangeLog:

	* g++.target/powerpc/vecload.C: New test.
---
 gcc/config.gcc                             |   4 +-
 gcc/config/rs6000/rs6000-passes.def        |   1 +
 gcc/config/rs6000/rs6000-protos.h          |   2 +
 gcc/config/rs6000/rs6000-vecload-opt.cc    | 234 +++++++++++++++++++++
 gcc/config/rs6000/rs6000.cc                |   3 +-
 gcc/config/rs6000/t-rs6000                 |   4 +
 gcc/testsuite/g++.target/powerpc/vecload.C |  15 ++
 7 files changed, 260 insertions(+), 3 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C

diff --git a/gcc/config.gcc b/gcc/config.gcc
index ee46d96bf62..482ab094b89 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -515,7 +515,7 @@ or1k*-*-*)
 	;;
 powerpc*-*-*)
 	cpu_type=rs6000
-	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
 	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
 	extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
 	extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
@@ -552,7 +552,7 @@ riscv*)
 	;;
 rs6000*-*-*)
 	extra_options="${extra_options} g.opt fused-madd.opt rs6000/rs6000-tables.opt"
-	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
 	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
 	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-logue.cc \$(srcdir)/config/rs6000/rs6000-call.cc"
 	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
diff --git a/gcc/config/rs6000/rs6000-passes.def b/gcc/config/rs6000/rs6000-passes.def
index ca899d5f7af..9ecf8ce6a9c 100644
--- a/gcc/config/rs6000/rs6000-passes.def
+++ b/gcc/config/rs6000/rs6000-passes.def
@@ -28,6 +28,7 @@ along with GCC; see the file COPYING3.  If not see
      The power8 does not have instructions that automaticaly do the byte swaps
      for loads and stores.  */
   INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
+  INSERT_PASS_AFTER (pass_ree, 1, pass_analyze_vecload);
 
   /* Pass to do the PCREL_OPT optimization that combines the load of an
      external symbol's address along with a single load or store using that
diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
index f70118ea40f..9c44bae33d3 100644
--- a/gcc/config/rs6000/rs6000-protos.h
+++ b/gcc/config/rs6000/rs6000-protos.h
@@ -91,6 +91,7 @@ extern int mems_ok_for_quad_peep (rtx, rtx);
 extern bool gpr_or_gpr_p (rtx, rtx);
 extern bool direct_move_p (rtx, rtx);
 extern bool quad_address_p (rtx, machine_mode, bool);
+extern bool mode_supports_dq_form (machine_mode);
 extern bool quad_load_store_p (rtx, rtx);
 extern bool fusion_gpr_load_p (rtx, rtx, rtx, rtx);
 extern void expand_fusion_gpr_load (rtx *);
@@ -344,6 +345,7 @@ class rtl_opt_pass;
 
 extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
 extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
+extern rtl_opt_pass *make_pass_analyze_vecload (gcc::context *);
 extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
 extern bool rs6000_quadword_masked_address_p (const_rtx exp);
 extern rtx rs6000_gen_lvx (enum machine_mode, rtx, rtx);
diff --git a/gcc/config/rs6000/rs6000-vecload-opt.cc b/gcc/config/rs6000/rs6000-vecload-opt.cc
new file mode 100644
index 00000000000..63ee733af89
--- /dev/null
+++ b/gcc/config/rs6000/rs6000-vecload-opt.cc
@@ -0,0 +1,234 @@
+/* Subroutines used to replace lxv with lxvp
+   for p10 little-endian VSX code.
+   Copyright (C) 2020-2023 Free Software Foundation, Inc.
+   Contributed by Ajit Kumar Agarwal <aagarwa1@linux.ibm.com>.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+   License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#define IN_TARGET_CODE 1
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "tree.h"
+#include "memmodel.h"
+#include "df.h"
+#include "tm_p.h"
+#include "ira.h"
+#include "print-tree.h"
+#include "varasm.h"
+#include "explow.h"
+#include "expr.h"
+#include "output.h"
+#include "tree-pass.h"
+#include "regs.h"
+#include "rtx-vector-builder.h"
+#include "rs6000-protos.h"
+
+static inline bool
+quad_address_offset_p (HOST_WIDE_INT offset)
+{
+  return (IN_RANGE (offset, -32768, 32767) && ((offset) & 0xf) == 0);
+}
+
+/* Replace identified lxv with lxvp.  */
+static void
+replace_lxv_with_lxvp (rtx_insn *insn1, rtx_insn *insn2)
+{
+  rtx body = PATTERN (insn1);
+  rtx src_exp = SET_SRC (body);
+  rtx dest_exp = SET_DEST (body);
+  rtx lxv;
+  rtx insn2_body = PATTERN (insn2);
+  rtx insn2_dest_exp = SET_DEST (insn2_body);
+  unsigned int regno = REGNO (dest_exp);
+
+  if (regno > REGNO (insn2_dest_exp))
+    {
+      df_set_regs_ever_live (REGNO (dest_exp), false);
+      df_set_regs_ever_live (REGNO (insn2_dest_exp), true);
+      SET_REGNO (dest_exp, REGNO (insn2_dest_exp));
+      dest_exp->used = 1;
+      df_set_regs_ever_live (REGNO (insn2_dest_exp), false);
+      df_set_regs_ever_live (regno, true);
+      SET_REGNO (insn2_dest_exp, regno);
+      insn2_dest_exp->used = 1;
+    }
+  rtx opnd = gen_rtx_REG (OOmode, REGNO (dest_exp));
+  PUT_MODE (src_exp, OOmode);
+  lxv = gen_movoo (opnd, src_exp);
+  rtx_insn *new_insn = emit_insn_before (lxv,  insn1);
+  set_block_for_insn (new_insn, BLOCK_FOR_INSN (insn1));
+  df_insn_rescan (new_insn);
+  
+  if (dump_file)
+    {
+      unsigned int new_uid = INSN_UID (new_insn);
+      fprintf (dump_file, "Replacing lxv %d with lxvp  %d\n",
+			  INSN_UID (insn1), new_uid);
+    }
+  df_insn_delete (insn1);
+  remove_insn (insn1);
+  df_insn_delete (insn2);
+  remove_insn (insn2);
+  insn1->set_deleted ();
+  insn2->set_deleted ();
+}
+
+/* Identify lxv instruction that are candidate of continguous
+   addresses and replace them with mma instruction lxvp.  */
+unsigned int
+rs6000_analyze_vecload (function *fun)
+{
+  basic_block bb;
+  rtx_insn *insn, *curr_insn = 0;
+  rtx_insn *insn1 = 0, *insn2 = 0;
+  bool first_vec_insn = false;
+  unsigned int offset = 0;
+  unsigned int regno = 0;
+
+  FOR_ALL_BB_FN (bb, fun)
+    FOR_BB_INSNS_SAFE (bb, insn, curr_insn)
+    {
+      if (NONDEBUG_INSN_P (insn) && GET_CODE (PATTERN (insn)) == SET)
+	{
+	  rtx set = single_set (insn);
+	  rtx src = SET_SRC (set);
+	  machine_mode mode = GET_MODE (SET_DEST (set));
+	  bool dest_fp_p, dest_vmx_p, dest_vsx_p = false;
+	  rtx dest = SET_DEST (PATTERN (insn));
+	  int dest_regno;
+
+	  if (REG_P (dest))
+	    {
+	      dest_regno = REGNO (dest);
+	      dest_fp_p = FP_REGNO_P (dest_regno);
+	      dest_vmx_p = ALTIVEC_REGNO_P (dest_regno);
+	      dest_vsx_p = dest_fp_p | dest_vmx_p;
+	    }
+	   else
+	     {
+	       dest_regno = -1;
+	       dest_fp_p = dest_vmx_p = dest_vsx_p = false;
+	     }
+
+	   if (TARGET_VSX && TARGET_MMA && dest_vsx_p)
+	     {
+	       if (mode_supports_dq_form (mode)
+		   && dest_regno >= 0 && MEM_P (src)
+		   && quad_address_p (XEXP (src, 0), mode, true))
+		  {
+		    if (first_vec_insn)
+		      {
+			rtx addr = XEXP (src, 0);
+			insn2 = insn;
+
+			if (GET_CODE (addr) != PLUS)
+			  return false;
+
+			rtx op0 = XEXP (addr, 0);
+			if (!REG_P (op0) || !INT_REG_OK_FOR_BASE_P (op0, true))
+			  return false;
+
+			rtx op1 = XEXP (addr, 1);
+			if (!CONST_INT_P (op1))
+			  return false;
+
+			mem_attrs attrs (*get_mem_attrs (src));
+			bool reg_attrs_found = false;
+
+			if (REG_P (dest) && REG_ATTRS (dest))
+			  {
+			    poly_int64 off = REG_ATTRS (dest)->offset;
+			    if (known_ge (off, 0))
+			      reg_attrs_found = true;
+			  }
+			if ((attrs.offset_known_p && known_ge (attrs.offset, 0))
+			     && reg_attrs_found
+			     &&  quad_address_offset_p (INTVAL (op1))
+			     && (regno == REGNO (op0))
+			     && ((INTVAL (op1) - offset) == 16))
+			   {
+			     replace_lxv_with_lxvp (insn1, insn2);
+			     return true;
+			   }
+			}
+			if (REG_P (XEXP (src, 0))
+			    && GET_CODE (XEXP (src, 0)) != PLUS)
+			  {
+			    mem_attrs attrs (*get_mem_attrs (src));
+			    if (attrs.offset_known_p)
+			      offset = attrs.offset;
+			    if (offset == 0 && REG_P (dest) && REG_ATTRS (dest))
+			      offset = REG_ATTRS (dest)->offset;
+			    regno = REGNO (XEXP (src,0));
+			    first_vec_insn = true;
+			    insn1 = insn;
+			 }
+		     }
+	       }
+	}
+    }
+  return false;
+}
+
+const pass_data pass_data_analyze_vecload =
+{
+  RTL_PASS, /* type */
+  "vecload", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_analyze_vecload : public rtl_opt_pass
+{
+public:
+  pass_analyze_vecload(gcc::context *ctxt)
+    : rtl_opt_pass(pass_data_analyze_vecload, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return (optimize > 0 && TARGET_VSX);
+    }
+
+  virtual unsigned int execute (function *fun)
+    {
+      return rs6000_analyze_vecload (fun);
+    }
+
+  opt_pass *clone ()
+    {
+      return new pass_analyze_vecload (m_ctxt);
+    }
+
+}; // class pass_analyze_vecload
+
+rtl_opt_pass *
+make_pass_analyze_vecload (gcc::context *ctxt)
+{
+  return new pass_analyze_vecload (ctxt);
+}
+
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index cc9253bb040..dba545271e0 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -387,7 +387,7 @@ mode_supports_vmx_dform (machine_mode mode)
 /* Return true if we have D-form addressing in VSX registers.  This addressing
    is more limited than normal d-form addressing in that the offset must be
    aligned on a 16-byte boundary.  */
-static inline bool
+bool
 mode_supports_dq_form (machine_mode mode)
 {
   return ((reg_addr[mode].addr_mask[RELOAD_REG_ANY] & RELOAD_REG_QUAD_OFFSET)
@@ -1178,6 +1178,7 @@ static bool rs6000_secondary_reload_move (enum rs6000_reg_type,
 					  secondary_reload_info *,
 					  bool);
 rtl_opt_pass *make_pass_analyze_swaps (gcc::context*);
+rtl_opt_pass *make_pass_analyze_vecload (gcc::context*);
 
 /* Hash table stuff for keeping track of TOC entries.  */
 
diff --git a/gcc/config/rs6000/t-rs6000 b/gcc/config/rs6000/t-rs6000
index f183b42ce1d..da7ae26e88b 100644
--- a/gcc/config/rs6000/t-rs6000
+++ b/gcc/config/rs6000/t-rs6000
@@ -47,6 +47,10 @@ rs6000-builtin.o: $(srcdir)/config/rs6000/rs6000-builtin.cc
 	$(COMPILE) $<
 	$(POSTCOMPILE)
 
+rs6000-vecload-opt.o: $(srcdir)/config/rs6000/rs6000-vecload-opt.cc
+	$(COMPILE) $<
+	$(POSTCOMPILE)
+
 build/rs6000-gen-builtins.o: $(srcdir)/config/rs6000/rs6000-gen-builtins.cc
 build/rbtree.o: $(srcdir)/config/rs6000/rbtree.cc
 
diff --git a/gcc/testsuite/g++.target/powerpc/vecload.C b/gcc/testsuite/g++.target/powerpc/vecload.C
new file mode 100644
index 00000000000..f1689ad6522
--- /dev/null
+++ b/gcc/testsuite/g++.target/powerpc/vecload.C
@@ -0,0 +1,15 @@
+/* { dg-do compile } */ 
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-mdejagnu-cpu=power10 -O2 -mmma" } */ 
+
+#include <altivec.h>
+
+void
+foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
+{
+  __vector_quad acc;
+  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
+  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
+  *dst = acc;
+}
+/* { dg-final { scan-assembler {\mlxvp\M} } } */
-- 
2.39.3


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PING ^1][PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-10-15 12:13 ` [PING ^0][PATCH " Ajit Agarwal
@ 2023-10-23  8:32   ` Ajit Agarwal
  2023-11-10  7:04     ` [PING ^2][PATCH " Ajit Agarwal
  0 siblings, 1 reply; 24+ messages in thread
From: Ajit Agarwal @ 2023-10-23  8:32 UTC (permalink / raw)
  To: gcc-patches; +Cc: Kewen.Lin, Segher Boessenkool, Peter Bergner



Ping ^1.

-------- Forwarded Message --------
Subject: [PING ^0][PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
Date: Sun, 15 Oct 2023 17:43:24 +0530
From: Ajit Agarwal <aagarwa1@linux.ibm.com>
To: gcc-patches <gcc-patches@gcc.gnu.org>
CC: Segher Boessenkool <segher@kernel.crashing.org>, Kewen.Lin <linkw@linux.ibm.com>, Peter Bergner <bergner@linux.ibm.com>

Hello All:

Please review.

Thanks & Regards
Ajit


-------- Forwarded Message --------
Subject: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
Date: Sun, 8 Oct 2023 00:34:27 +0530
From: Ajit Agarwal <aagarwa1@linux.ibm.com>
To: gcc-patches <gcc-patches@gcc.gnu.org>
CC: Segher Boessenkool <segher@kernel.crashing.org>, Peter Bergner <bergner@linux.ibm.com>, Kewen.Lin <linkw@linux.ibm.com>

Hello All:

This patch add new pass to replace contiguous addresses vector load lxv with mma instruction
lxvp. This patch addresses one regressions failure in ARM architecture.

Bootstrapped and regtested with powepc64-linux-gnu.

Thanks & Regards
Ajit


rs6000: Add new pass for replacement of contiguous lxv with lxvp.

New pass to replace contiguous addresses lxv with lxvp. This pass
is registered after ree rtl pass.

2023-10-07  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>

gcc/ChangeLog:

	* config/rs6000/rs6000-passes.def: Registered vecload pass.
	* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
	* config.gcc: Add new executable.
	* config/rs6000/rs6000-protos.h: Add new prototype for vecload
	pass.
	* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
	* config/rs6000/t-rs6000: Add new rule.

gcc/testsuite/ChangeLog:

	* g++.target/powerpc/vecload.C: New test.
---
 gcc/config.gcc                             |   4 +-
 gcc/config/rs6000/rs6000-passes.def        |   1 +
 gcc/config/rs6000/rs6000-protos.h          |   2 +
 gcc/config/rs6000/rs6000-vecload-opt.cc    | 234 +++++++++++++++++++++
 gcc/config/rs6000/rs6000.cc                |   3 +-
 gcc/config/rs6000/t-rs6000                 |   4 +
 gcc/testsuite/g++.target/powerpc/vecload.C |  15 ++
 7 files changed, 260 insertions(+), 3 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
 create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C

diff --git a/gcc/config.gcc b/gcc/config.gcc
index ee46d96bf62..482ab094b89 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -515,7 +515,7 @@ or1k*-*-*)
 	;;
 powerpc*-*-*)
 	cpu_type=rs6000
-	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
 	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
 	extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
 	extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
@@ -552,7 +552,7 @@ riscv*)
 	;;
 rs6000*-*-*)
 	extra_options="${extra_options} g.opt fused-madd.opt rs6000/rs6000-tables.opt"
-	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
 	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
 	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-logue.cc \$(srcdir)/config/rs6000/rs6000-call.cc"
 	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
diff --git a/gcc/config/rs6000/rs6000-passes.def b/gcc/config/rs6000/rs6000-passes.def
index ca899d5f7af..9ecf8ce6a9c 100644
--- a/gcc/config/rs6000/rs6000-passes.def
+++ b/gcc/config/rs6000/rs6000-passes.def
@@ -28,6 +28,7 @@ along with GCC; see the file COPYING3.  If not see
      The power8 does not have instructions that automaticaly do the byte swaps
      for loads and stores.  */
   INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
+  INSERT_PASS_AFTER (pass_ree, 1, pass_analyze_vecload);
 
   /* Pass to do the PCREL_OPT optimization that combines the load of an
      external symbol's address along with a single load or store using that
diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
index f70118ea40f..9c44bae33d3 100644
--- a/gcc/config/rs6000/rs6000-protos.h
+++ b/gcc/config/rs6000/rs6000-protos.h
@@ -91,6 +91,7 @@ extern int mems_ok_for_quad_peep (rtx, rtx);
 extern bool gpr_or_gpr_p (rtx, rtx);
 extern bool direct_move_p (rtx, rtx);
 extern bool quad_address_p (rtx, machine_mode, bool);
+extern bool mode_supports_dq_form (machine_mode);
 extern bool quad_load_store_p (rtx, rtx);
 extern bool fusion_gpr_load_p (rtx, rtx, rtx, rtx);
 extern void expand_fusion_gpr_load (rtx *);
@@ -344,6 +345,7 @@ class rtl_opt_pass;
 
 extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
 extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
+extern rtl_opt_pass *make_pass_analyze_vecload (gcc::context *);
 extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
 extern bool rs6000_quadword_masked_address_p (const_rtx exp);
 extern rtx rs6000_gen_lvx (enum machine_mode, rtx, rtx);
diff --git a/gcc/config/rs6000/rs6000-vecload-opt.cc b/gcc/config/rs6000/rs6000-vecload-opt.cc
new file mode 100644
index 00000000000..63ee733af89
--- /dev/null
+++ b/gcc/config/rs6000/rs6000-vecload-opt.cc
@@ -0,0 +1,234 @@
+/* Subroutines used to replace lxv with lxvp
+   for p10 little-endian VSX code.
+   Copyright (C) 2020-2023 Free Software Foundation, Inc.
+   Contributed by Ajit Kumar Agarwal <aagarwa1@linux.ibm.com>.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+   License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#define IN_TARGET_CODE 1
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "tree.h"
+#include "memmodel.h"
+#include "df.h"
+#include "tm_p.h"
+#include "ira.h"
+#include "print-tree.h"
+#include "varasm.h"
+#include "explow.h"
+#include "expr.h"
+#include "output.h"
+#include "tree-pass.h"
+#include "regs.h"
+#include "rtx-vector-builder.h"
+#include "rs6000-protos.h"
+
+static inline bool
+quad_address_offset_p (HOST_WIDE_INT offset)
+{
+  return (IN_RANGE (offset, -32768, 32767) && ((offset) & 0xf) == 0);
+}
+
+/* Replace identified lxv with lxvp.  */
+static void
+replace_lxv_with_lxvp (rtx_insn *insn1, rtx_insn *insn2)
+{
+  rtx body = PATTERN (insn1);
+  rtx src_exp = SET_SRC (body);
+  rtx dest_exp = SET_DEST (body);
+  rtx lxv;
+  rtx insn2_body = PATTERN (insn2);
+  rtx insn2_dest_exp = SET_DEST (insn2_body);
+  unsigned int regno = REGNO (dest_exp);
+
+  if (regno > REGNO (insn2_dest_exp))
+    {
+      df_set_regs_ever_live (REGNO (dest_exp), false);
+      df_set_regs_ever_live (REGNO (insn2_dest_exp), true);
+      SET_REGNO (dest_exp, REGNO (insn2_dest_exp));
+      dest_exp->used = 1;
+      df_set_regs_ever_live (REGNO (insn2_dest_exp), false);
+      df_set_regs_ever_live (regno, true);
+      SET_REGNO (insn2_dest_exp, regno);
+      insn2_dest_exp->used = 1;
+    }
+  rtx opnd = gen_rtx_REG (OOmode, REGNO (dest_exp));
+  PUT_MODE (src_exp, OOmode);
+  lxv = gen_movoo (opnd, src_exp);
+  rtx_insn *new_insn = emit_insn_before (lxv,  insn1);
+  set_block_for_insn (new_insn, BLOCK_FOR_INSN (insn1));
+  df_insn_rescan (new_insn);
+  
+  if (dump_file)
+    {
+      unsigned int new_uid = INSN_UID (new_insn);
+      fprintf (dump_file, "Replacing lxv %d with lxvp  %d\n",
+			  INSN_UID (insn1), new_uid);
+    }
+  df_insn_delete (insn1);
+  remove_insn (insn1);
+  df_insn_delete (insn2);
+  remove_insn (insn2);
+  insn1->set_deleted ();
+  insn2->set_deleted ();
+}
+
+/* Identify lxv instruction that are candidate of continguous
+   addresses and replace them with mma instruction lxvp.  */
+unsigned int
+rs6000_analyze_vecload (function *fun)
+{
+  basic_block bb;
+  rtx_insn *insn, *curr_insn = 0;
+  rtx_insn *insn1 = 0, *insn2 = 0;
+  bool first_vec_insn = false;
+  unsigned int offset = 0;
+  unsigned int regno = 0;
+
+  FOR_ALL_BB_FN (bb, fun)
+    FOR_BB_INSNS_SAFE (bb, insn, curr_insn)
+    {
+      if (NONDEBUG_INSN_P (insn) && GET_CODE (PATTERN (insn)) == SET)
+	{
+	  rtx set = single_set (insn);
+	  rtx src = SET_SRC (set);
+	  machine_mode mode = GET_MODE (SET_DEST (set));
+	  bool dest_fp_p, dest_vmx_p, dest_vsx_p = false;
+	  rtx dest = SET_DEST (PATTERN (insn));
+	  int dest_regno;
+
+	  if (REG_P (dest))
+	    {
+	      dest_regno = REGNO (dest);
+	      dest_fp_p = FP_REGNO_P (dest_regno);
+	      dest_vmx_p = ALTIVEC_REGNO_P (dest_regno);
+	      dest_vsx_p = dest_fp_p | dest_vmx_p;
+	    }
+	   else
+	     {
+	       dest_regno = -1;
+	       dest_fp_p = dest_vmx_p = dest_vsx_p = false;
+	     }
+
+	   if (TARGET_VSX && TARGET_MMA && dest_vsx_p)
+	     {
+	       if (mode_supports_dq_form (mode)
+		   && dest_regno >= 0 && MEM_P (src)
+		   && quad_address_p (XEXP (src, 0), mode, true))
+		  {
+		    if (first_vec_insn)
+		      {
+			rtx addr = XEXP (src, 0);
+			insn2 = insn;
+
+			if (GET_CODE (addr) != PLUS)
+			  return false;
+
+			rtx op0 = XEXP (addr, 0);
+			if (!REG_P (op0) || !INT_REG_OK_FOR_BASE_P (op0, true))
+			  return false;
+
+			rtx op1 = XEXP (addr, 1);
+			if (!CONST_INT_P (op1))
+			  return false;
+
+			mem_attrs attrs (*get_mem_attrs (src));
+			bool reg_attrs_found = false;
+
+			if (REG_P (dest) && REG_ATTRS (dest))
+			  {
+			    poly_int64 off = REG_ATTRS (dest)->offset;
+			    if (known_ge (off, 0))
+			      reg_attrs_found = true;
+			  }
+			if ((attrs.offset_known_p && known_ge (attrs.offset, 0))
+			     && reg_attrs_found
+			     &&  quad_address_offset_p (INTVAL (op1))
+			     && (regno == REGNO (op0))
+			     && ((INTVAL (op1) - offset) == 16))
+			   {
+			     replace_lxv_with_lxvp (insn1, insn2);
+			     return true;
+			   }
+			}
+			if (REG_P (XEXP (src, 0))
+			    && GET_CODE (XEXP (src, 0)) != PLUS)
+			  {
+			    mem_attrs attrs (*get_mem_attrs (src));
+			    if (attrs.offset_known_p)
+			      offset = attrs.offset;
+			    if (offset == 0 && REG_P (dest) && REG_ATTRS (dest))
+			      offset = REG_ATTRS (dest)->offset;
+			    regno = REGNO (XEXP (src,0));
+			    first_vec_insn = true;
+			    insn1 = insn;
+			 }
+		     }
+	       }
+	}
+    }
+  return false;
+}
+
+const pass_data pass_data_analyze_vecload =
+{
+  RTL_PASS, /* type */
+  "vecload", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_analyze_vecload : public rtl_opt_pass
+{
+public:
+  pass_analyze_vecload(gcc::context *ctxt)
+    : rtl_opt_pass(pass_data_analyze_vecload, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return (optimize > 0 && TARGET_VSX);
+    }
+
+  virtual unsigned int execute (function *fun)
+    {
+      return rs6000_analyze_vecload (fun);
+    }
+
+  opt_pass *clone ()
+    {
+      return new pass_analyze_vecload (m_ctxt);
+    }
+
+}; // class pass_analyze_vecload
+
+rtl_opt_pass *
+make_pass_analyze_vecload (gcc::context *ctxt)
+{
+  return new pass_analyze_vecload (ctxt);
+}
+
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index cc9253bb040..dba545271e0 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -387,7 +387,7 @@ mode_supports_vmx_dform (machine_mode mode)
 /* Return true if we have D-form addressing in VSX registers.  This addressing
    is more limited than normal d-form addressing in that the offset must be
    aligned on a 16-byte boundary.  */
-static inline bool
+bool
 mode_supports_dq_form (machine_mode mode)
 {
   return ((reg_addr[mode].addr_mask[RELOAD_REG_ANY] & RELOAD_REG_QUAD_OFFSET)
@@ -1178,6 +1178,7 @@ static bool rs6000_secondary_reload_move (enum rs6000_reg_type,
 					  secondary_reload_info *,
 					  bool);
 rtl_opt_pass *make_pass_analyze_swaps (gcc::context*);
+rtl_opt_pass *make_pass_analyze_vecload (gcc::context*);
 
 /* Hash table stuff for keeping track of TOC entries.  */
 
diff --git a/gcc/config/rs6000/t-rs6000 b/gcc/config/rs6000/t-rs6000
index f183b42ce1d..da7ae26e88b 100644
--- a/gcc/config/rs6000/t-rs6000
+++ b/gcc/config/rs6000/t-rs6000
@@ -47,6 +47,10 @@ rs6000-builtin.o: $(srcdir)/config/rs6000/rs6000-builtin.cc
 	$(COMPILE) $<
 	$(POSTCOMPILE)
 
+rs6000-vecload-opt.o: $(srcdir)/config/rs6000/rs6000-vecload-opt.cc
+	$(COMPILE) $<
+	$(POSTCOMPILE)
+
 build/rs6000-gen-builtins.o: $(srcdir)/config/rs6000/rs6000-gen-builtins.cc
 build/rbtree.o: $(srcdir)/config/rs6000/rbtree.cc
 
diff --git a/gcc/testsuite/g++.target/powerpc/vecload.C b/gcc/testsuite/g++.target/powerpc/vecload.C
new file mode 100644
index 00000000000..f1689ad6522
--- /dev/null
+++ b/gcc/testsuite/g++.target/powerpc/vecload.C
@@ -0,0 +1,15 @@
+/* { dg-do compile } */ 
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-mdejagnu-cpu=power10 -O2 -mmma" } */ 
+
+#include <altivec.h>
+
+void
+foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
+{
+  __vector_quad acc;
+  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
+  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
+  *dst = acc;
+}
+/* { dg-final { scan-assembler {\mlxvp\M} } } */
-- 
2.39.3


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PING ^2][PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-10-23  8:32   ` [PING ^1][PATCH " Ajit Agarwal
@ 2023-11-10  7:04     ` Ajit Agarwal
  0 siblings, 0 replies; 24+ messages in thread
From: Ajit Agarwal @ 2023-11-10  7:04 UTC (permalink / raw)
  To: gcc-patches; +Cc: Kewen.Lin, Segher Boessenkool, Peter Bergner

Ping ^2.


On 23/10/23 2:02 pm, Ajit Agarwal wrote:
> 
> 
> Ping ^1.
> 
> -------- Forwarded Message --------
> Subject: [PING ^0][PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
> Date: Sun, 15 Oct 2023 17:43:24 +0530
> From: Ajit Agarwal <aagarwa1@linux.ibm.com>
> To: gcc-patches <gcc-patches@gcc.gnu.org>
> CC: Segher Boessenkool <segher@kernel.crashing.org>, Kewen.Lin <linkw@linux.ibm.com>, Peter Bergner <bergner@linux.ibm.com>
> 
> Hello All:
> 
> Please review.
> 
> Thanks & Regards
> Ajit
> 
> 
> -------- Forwarded Message --------
> Subject: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
> Date: Sun, 8 Oct 2023 00:34:27 +0530
> From: Ajit Agarwal <aagarwa1@linux.ibm.com>
> To: gcc-patches <gcc-patches@gcc.gnu.org>
> CC: Segher Boessenkool <segher@kernel.crashing.org>, Peter Bergner <bergner@linux.ibm.com>, Kewen.Lin <linkw@linux.ibm.com>
> 
> Hello All:
> 
> This patch add new pass to replace contiguous addresses vector load lxv with mma instruction
> lxvp. This patch addresses one regressions failure in ARM architecture.
> 
> Bootstrapped and regtested with powepc64-linux-gnu.
> 
> Thanks & Regards
> Ajit
> 
> 
> rs6000: Add new pass for replacement of contiguous lxv with lxvp.
> 
> New pass to replace contiguous addresses lxv with lxvp. This pass
> is registered after ree rtl pass.
> 
> 2023-10-07  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
> 
> gcc/ChangeLog:
> 
> 	* config/rs6000/rs6000-passes.def: Registered vecload pass.
> 	* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
> 	* config.gcc: Add new executable.
> 	* config/rs6000/rs6000-protos.h: Add new prototype for vecload
> 	pass.
> 	* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
> 	* config/rs6000/t-rs6000: Add new rule.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* g++.target/powerpc/vecload.C: New test.
> ---
>  gcc/config.gcc                             |   4 +-
>  gcc/config/rs6000/rs6000-passes.def        |   1 +
>  gcc/config/rs6000/rs6000-protos.h          |   2 +
>  gcc/config/rs6000/rs6000-vecload-opt.cc    | 234 +++++++++++++++++++++
>  gcc/config/rs6000/rs6000.cc                |   3 +-
>  gcc/config/rs6000/t-rs6000                 |   4 +
>  gcc/testsuite/g++.target/powerpc/vecload.C |  15 ++
>  7 files changed, 260 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
>  create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C
> 
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index ee46d96bf62..482ab094b89 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -515,7 +515,7 @@ or1k*-*-*)
>  	;;
>  powerpc*-*-*)
>  	cpu_type=rs6000
> -	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
> +	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>  	extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
>  	extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
> @@ -552,7 +552,7 @@ riscv*)
>  	;;
>  rs6000*-*-*)
>  	extra_options="${extra_options} g.opt fused-madd.opt rs6000/rs6000-tables.opt"
> -	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
> +	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-logue.cc \$(srcdir)/config/rs6000/rs6000-call.cc"
>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
> diff --git a/gcc/config/rs6000/rs6000-passes.def b/gcc/config/rs6000/rs6000-passes.def
> index ca899d5f7af..9ecf8ce6a9c 100644
> --- a/gcc/config/rs6000/rs6000-passes.def
> +++ b/gcc/config/rs6000/rs6000-passes.def
> @@ -28,6 +28,7 @@ along with GCC; see the file COPYING3.  If not see
>       The power8 does not have instructions that automaticaly do the byte swaps
>       for loads and stores.  */
>    INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
> +  INSERT_PASS_AFTER (pass_ree, 1, pass_analyze_vecload);
>  
>    /* Pass to do the PCREL_OPT optimization that combines the load of an
>       external symbol's address along with a single load or store using that
> diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
> index f70118ea40f..9c44bae33d3 100644
> --- a/gcc/config/rs6000/rs6000-protos.h
> +++ b/gcc/config/rs6000/rs6000-protos.h
> @@ -91,6 +91,7 @@ extern int mems_ok_for_quad_peep (rtx, rtx);
>  extern bool gpr_or_gpr_p (rtx, rtx);
>  extern bool direct_move_p (rtx, rtx);
>  extern bool quad_address_p (rtx, machine_mode, bool);
> +extern bool mode_supports_dq_form (machine_mode);
>  extern bool quad_load_store_p (rtx, rtx);
>  extern bool fusion_gpr_load_p (rtx, rtx, rtx, rtx);
>  extern void expand_fusion_gpr_load (rtx *);
> @@ -344,6 +345,7 @@ class rtl_opt_pass;
>  
>  extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
>  extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
> +extern rtl_opt_pass *make_pass_analyze_vecload (gcc::context *);
>  extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
>  extern bool rs6000_quadword_masked_address_p (const_rtx exp);
>  extern rtx rs6000_gen_lvx (enum machine_mode, rtx, rtx);
> diff --git a/gcc/config/rs6000/rs6000-vecload-opt.cc b/gcc/config/rs6000/rs6000-vecload-opt.cc
> new file mode 100644
> index 00000000000..63ee733af89
> --- /dev/null
> +++ b/gcc/config/rs6000/rs6000-vecload-opt.cc
> @@ -0,0 +1,234 @@
> +/* Subroutines used to replace lxv with lxvp
> +   for p10 little-endian VSX code.
> +   Copyright (C) 2020-2023 Free Software Foundation, Inc.
> +   Contributed by Ajit Kumar Agarwal <aagarwa1@linux.ibm.com>.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published
> +   by the Free Software Foundation; either version 3, or (at your
> +   option) any later version.
> +
> +   GCC is distributed in the hope that it will be useful, but WITHOUT
> +   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
> +   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
> +   License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#define IN_TARGET_CODE 1
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "tree.h"
> +#include "memmodel.h"
> +#include "df.h"
> +#include "tm_p.h"
> +#include "ira.h"
> +#include "print-tree.h"
> +#include "varasm.h"
> +#include "explow.h"
> +#include "expr.h"
> +#include "output.h"
> +#include "tree-pass.h"
> +#include "regs.h"
> +#include "rtx-vector-builder.h"
> +#include "rs6000-protos.h"
> +
> +static inline bool
> +quad_address_offset_p (HOST_WIDE_INT offset)
> +{
> +  return (IN_RANGE (offset, -32768, 32767) && ((offset) & 0xf) == 0);
> +}
> +
> +/* Replace identified lxv with lxvp.  */
> +static void
> +replace_lxv_with_lxvp (rtx_insn *insn1, rtx_insn *insn2)
> +{
> +  rtx body = PATTERN (insn1);
> +  rtx src_exp = SET_SRC (body);
> +  rtx dest_exp = SET_DEST (body);
> +  rtx lxv;
> +  rtx insn2_body = PATTERN (insn2);
> +  rtx insn2_dest_exp = SET_DEST (insn2_body);
> +  unsigned int regno = REGNO (dest_exp);
> +
> +  if (regno > REGNO (insn2_dest_exp))
> +    {
> +      df_set_regs_ever_live (REGNO (dest_exp), false);
> +      df_set_regs_ever_live (REGNO (insn2_dest_exp), true);
> +      SET_REGNO (dest_exp, REGNO (insn2_dest_exp));
> +      dest_exp->used = 1;
> +      df_set_regs_ever_live (REGNO (insn2_dest_exp), false);
> +      df_set_regs_ever_live (regno, true);
> +      SET_REGNO (insn2_dest_exp, regno);
> +      insn2_dest_exp->used = 1;
> +    }
> +  rtx opnd = gen_rtx_REG (OOmode, REGNO (dest_exp));
> +  PUT_MODE (src_exp, OOmode);
> +  lxv = gen_movoo (opnd, src_exp);
> +  rtx_insn *new_insn = emit_insn_before (lxv,  insn1);
> +  set_block_for_insn (new_insn, BLOCK_FOR_INSN (insn1));
> +  df_insn_rescan (new_insn);
> +  
> +  if (dump_file)
> +    {
> +      unsigned int new_uid = INSN_UID (new_insn);
> +      fprintf (dump_file, "Replacing lxv %d with lxvp  %d\n",
> +			  INSN_UID (insn1), new_uid);
> +    }
> +  df_insn_delete (insn1);
> +  remove_insn (insn1);
> +  df_insn_delete (insn2);
> +  remove_insn (insn2);
> +  insn1->set_deleted ();
> +  insn2->set_deleted ();
> +}
> +
> +/* Identify lxv instruction that are candidate of continguous
> +   addresses and replace them with mma instruction lxvp.  */
> +unsigned int
> +rs6000_analyze_vecload (function *fun)
> +{
> +  basic_block bb;
> +  rtx_insn *insn, *curr_insn = 0;
> +  rtx_insn *insn1 = 0, *insn2 = 0;
> +  bool first_vec_insn = false;
> +  unsigned int offset = 0;
> +  unsigned int regno = 0;
> +
> +  FOR_ALL_BB_FN (bb, fun)
> +    FOR_BB_INSNS_SAFE (bb, insn, curr_insn)
> +    {
> +      if (NONDEBUG_INSN_P (insn) && GET_CODE (PATTERN (insn)) == SET)
> +	{
> +	  rtx set = single_set (insn);
> +	  rtx src = SET_SRC (set);
> +	  machine_mode mode = GET_MODE (SET_DEST (set));
> +	  bool dest_fp_p, dest_vmx_p, dest_vsx_p = false;
> +	  rtx dest = SET_DEST (PATTERN (insn));
> +	  int dest_regno;
> +
> +	  if (REG_P (dest))
> +	    {
> +	      dest_regno = REGNO (dest);
> +	      dest_fp_p = FP_REGNO_P (dest_regno);
> +	      dest_vmx_p = ALTIVEC_REGNO_P (dest_regno);
> +	      dest_vsx_p = dest_fp_p | dest_vmx_p;
> +	    }
> +	   else
> +	     {
> +	       dest_regno = -1;
> +	       dest_fp_p = dest_vmx_p = dest_vsx_p = false;
> +	     }
> +
> +	   if (TARGET_VSX && TARGET_MMA && dest_vsx_p)
> +	     {
> +	       if (mode_supports_dq_form (mode)
> +		   && dest_regno >= 0 && MEM_P (src)
> +		   && quad_address_p (XEXP (src, 0), mode, true))
> +		  {
> +		    if (first_vec_insn)
> +		      {
> +			rtx addr = XEXP (src, 0);
> +			insn2 = insn;
> +
> +			if (GET_CODE (addr) != PLUS)
> +			  return false;
> +
> +			rtx op0 = XEXP (addr, 0);
> +			if (!REG_P (op0) || !INT_REG_OK_FOR_BASE_P (op0, true))
> +			  return false;
> +
> +			rtx op1 = XEXP (addr, 1);
> +			if (!CONST_INT_P (op1))
> +			  return false;
> +
> +			mem_attrs attrs (*get_mem_attrs (src));
> +			bool reg_attrs_found = false;
> +
> +			if (REG_P (dest) && REG_ATTRS (dest))
> +			  {
> +			    poly_int64 off = REG_ATTRS (dest)->offset;
> +			    if (known_ge (off, 0))
> +			      reg_attrs_found = true;
> +			  }
> +			if ((attrs.offset_known_p && known_ge (attrs.offset, 0))
> +			     && reg_attrs_found
> +			     &&  quad_address_offset_p (INTVAL (op1))
> +			     && (regno == REGNO (op0))
> +			     && ((INTVAL (op1) - offset) == 16))
> +			   {
> +			     replace_lxv_with_lxvp (insn1, insn2);
> +			     return true;
> +			   }
> +			}
> +			if (REG_P (XEXP (src, 0))
> +			    && GET_CODE (XEXP (src, 0)) != PLUS)
> +			  {
> +			    mem_attrs attrs (*get_mem_attrs (src));
> +			    if (attrs.offset_known_p)
> +			      offset = attrs.offset;
> +			    if (offset == 0 && REG_P (dest) && REG_ATTRS (dest))
> +			      offset = REG_ATTRS (dest)->offset;
> +			    regno = REGNO (XEXP (src,0));
> +			    first_vec_insn = true;
> +			    insn1 = insn;
> +			 }
> +		     }
> +	       }
> +	}
> +    }
> +  return false;
> +}
> +
> +const pass_data pass_data_analyze_vecload =
> +{
> +  RTL_PASS, /* type */
> +  "vecload", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_NONE, /* tv_id */
> +  0, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  TODO_df_finish, /* todo_flags_finish */
> +};
> +
> +class pass_analyze_vecload : public rtl_opt_pass
> +{
> +public:
> +  pass_analyze_vecload(gcc::context *ctxt)
> +    : rtl_opt_pass(pass_data_analyze_vecload, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *)
> +    {
> +      return (optimize > 0 && TARGET_VSX);
> +    }
> +
> +  virtual unsigned int execute (function *fun)
> +    {
> +      return rs6000_analyze_vecload (fun);
> +    }
> +
> +  opt_pass *clone ()
> +    {
> +      return new pass_analyze_vecload (m_ctxt);
> +    }
> +
> +}; // class pass_analyze_vecload
> +
> +rtl_opt_pass *
> +make_pass_analyze_vecload (gcc::context *ctxt)
> +{
> +  return new pass_analyze_vecload (ctxt);
> +}
> +
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index cc9253bb040..dba545271e0 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -387,7 +387,7 @@ mode_supports_vmx_dform (machine_mode mode)
>  /* Return true if we have D-form addressing in VSX registers.  This addressing
>     is more limited than normal d-form addressing in that the offset must be
>     aligned on a 16-byte boundary.  */
> -static inline bool
> +bool
>  mode_supports_dq_form (machine_mode mode)
>  {
>    return ((reg_addr[mode].addr_mask[RELOAD_REG_ANY] & RELOAD_REG_QUAD_OFFSET)
> @@ -1178,6 +1178,7 @@ static bool rs6000_secondary_reload_move (enum rs6000_reg_type,
>  					  secondary_reload_info *,
>  					  bool);
>  rtl_opt_pass *make_pass_analyze_swaps (gcc::context*);
> +rtl_opt_pass *make_pass_analyze_vecload (gcc::context*);
>  
>  /* Hash table stuff for keeping track of TOC entries.  */
>  
> diff --git a/gcc/config/rs6000/t-rs6000 b/gcc/config/rs6000/t-rs6000
> index f183b42ce1d..da7ae26e88b 100644
> --- a/gcc/config/rs6000/t-rs6000
> +++ b/gcc/config/rs6000/t-rs6000
> @@ -47,6 +47,10 @@ rs6000-builtin.o: $(srcdir)/config/rs6000/rs6000-builtin.cc
>  	$(COMPILE) $<
>  	$(POSTCOMPILE)
>  
> +rs6000-vecload-opt.o: $(srcdir)/config/rs6000/rs6000-vecload-opt.cc
> +	$(COMPILE) $<
> +	$(POSTCOMPILE)
> +
>  build/rs6000-gen-builtins.o: $(srcdir)/config/rs6000/rs6000-gen-builtins.cc
>  build/rbtree.o: $(srcdir)/config/rs6000/rbtree.cc
>  
> diff --git a/gcc/testsuite/g++.target/powerpc/vecload.C b/gcc/testsuite/g++.target/powerpc/vecload.C
> new file mode 100644
> index 00000000000..f1689ad6522
> --- /dev/null
> +++ b/gcc/testsuite/g++.target/powerpc/vecload.C
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */ 
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-options "-mdejagnu-cpu=power10 -O2 -mmma" } */ 
> +
> +#include <altivec.h>
> +
> +void
> +foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
> +{
> +  __vector_quad acc;
> +  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
> +  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
> +  *dst = acc;
> +}
> +/* { dg-final { scan-assembler {\mlxvp\M} } } */

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-10-07 19:04 [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp Ajit Agarwal
  2023-10-15 12:13 ` [PING ^0][PATCH " Ajit Agarwal
@ 2023-11-24  9:31 ` Kewen.Lin
  2023-11-28  4:34   ` Michael Meissner
  2023-12-01  9:10   ` Ajit Agarwal
  2023-11-28  7:05 ` Michael Meissner
  2 siblings, 2 replies; 24+ messages in thread
From: Kewen.Lin @ 2023-11-24  9:31 UTC (permalink / raw)
  To: GCC Patches
  Cc: Segher Boessenkool, David Edelsohn, Peter Bergner, Michael Meissner

Hi Ajit,

Don't forget to CC David (CC-ed) :), some comments are inlined below.

on 2023/10/8 03:04, Ajit Agarwal wrote:
> Hello All:
> 
> This patch add new pass to replace contiguous addresses vector load lxv with mma instruction
> lxvp.

IMHO the current binding lxvp (and lxvpx, stxvp{x,}) to MMA looks wrong, it's only
Power10 and VSX required, these instructions should perform well without MMA support.
So one patch to separate their support from MMA seems to go first.

> This patch addresses one regressions failure in ARM architecture.

Could you explain this?  I don't see any test case for this.

> Bootstrapped and regtested with powepc64-linux-gnu.
> 
> Thanks & Regards
> Ajit
> 
> 
> rs6000: Add new pass for replacement of contiguous lxv with lxvp.
> 
> New pass to replace contiguous addresses lxv with lxvp. This pass
> is registered after ree rtl pass.> 
> 2023-10-07  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
> 
> gcc/ChangeLog:
> 
> 	* config/rs6000/rs6000-passes.def: Registered vecload pass.
> 	* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
> 	* config.gcc: Add new executable.
> 	* config/rs6000/rs6000-protos.h: Add new prototype for vecload
> 	pass.
> 	* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
> 	* config/rs6000/t-rs6000: Add new rule.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* g++.target/powerpc/vecload.C: New test.
> ---
>  gcc/config.gcc                             |   4 +-
>  gcc/config/rs6000/rs6000-passes.def        |   1 +
>  gcc/config/rs6000/rs6000-protos.h          |   2 +
>  gcc/config/rs6000/rs6000-vecload-opt.cc    | 234 +++++++++++++++++++++
>  gcc/config/rs6000/rs6000.cc                |   3 +-
>  gcc/config/rs6000/t-rs6000                 |   4 +
>  gcc/testsuite/g++.target/powerpc/vecload.C |  15 ++
>  7 files changed, 260 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
>  create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C
> 
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index ee46d96bf62..482ab094b89 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -515,7 +515,7 @@ or1k*-*-*)
>  	;;
>  powerpc*-*-*)
>  	cpu_type=rs6000
> -	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
> +	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>  	extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
>  	extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
> @@ -552,7 +552,7 @@ riscv*)
>  	;;
>  rs6000*-*-*)
>  	extra_options="${extra_options} g.opt fused-madd.opt rs6000/rs6000-tables.opt"
> -	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
> +	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-logue.cc \$(srcdir)/config/rs6000/rs6000-call.cc"
>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
> diff --git a/gcc/config/rs6000/rs6000-passes.def b/gcc/config/rs6000/rs6000-passes.def
> index ca899d5f7af..9ecf8ce6a9c 100644
> --- a/gcc/config/rs6000/rs6000-passes.def
> +++ b/gcc/config/rs6000/rs6000-passes.def
> @@ -28,6 +28,7 @@ along with GCC; see the file COPYING3.  If not see
>       The power8 does not have instructions that automaticaly do the byte swaps
>       for loads and stores.  */
>    INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
> +  INSERT_PASS_AFTER (pass_ree, 1, pass_analyze_vecload);
>  
>    /* Pass to do the PCREL_OPT optimization that combines the load of an
>       external symbol's address along with a single load or store using that
> diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
> index f70118ea40f..9c44bae33d3 100644
> --- a/gcc/config/rs6000/rs6000-protos.h
> +++ b/gcc/config/rs6000/rs6000-protos.h
> @@ -91,6 +91,7 @@ extern int mems_ok_for_quad_peep (rtx, rtx);
>  extern bool gpr_or_gpr_p (rtx, rtx);
>  extern bool direct_move_p (rtx, rtx);
>  extern bool quad_address_p (rtx, machine_mode, bool);
> +extern bool mode_supports_dq_form (machine_mode);
>  extern bool quad_load_store_p (rtx, rtx);
>  extern bool fusion_gpr_load_p (rtx, rtx, rtx, rtx);
>  extern void expand_fusion_gpr_load (rtx *);
> @@ -344,6 +345,7 @@ class rtl_opt_pass;
>  
>  extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
>  extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
> +extern rtl_opt_pass *make_pass_analyze_vecload (gcc::context *);
>  extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
>  extern bool rs6000_quadword_masked_address_p (const_rtx exp);
>  extern rtx rs6000_gen_lvx (enum machine_mode, rtx, rtx);
> diff --git a/gcc/config/rs6000/rs6000-vecload-opt.cc b/gcc/config/rs6000/rs6000-vecload-opt.cc
> new file mode 100644
> index 00000000000..63ee733af89
> --- /dev/null
> +++ b/gcc/config/rs6000/rs6000-vecload-opt.cc
> @@ -0,0 +1,234 @@
> +/* Subroutines used to replace lxv with lxvp
> +   for p10 little-endian VSX code.
> +   Copyright (C) 2020-2023 Free Software Foundation, Inc.
> +   Contributed by Ajit Kumar Agarwal <aagarwa1@linux.ibm.com>.
> +

Some comments here describing how is this implemented are appreciated,
it helps code readers in future and also the reviewers.  :)

> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published
> +   by the Free Software Foundation; either version 3, or (at your
> +   option) any later version.
> +
> +   GCC is distributed in the hope that it will be useful, but WITHOUT
> +   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
> +   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
> +   License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#define IN_TARGET_CODE 1
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "tree.h"
> +#include "memmodel.h"
> +#include "df.h"
> +#include "tm_p.h"
> +#include "ira.h"
> +#include "print-tree.h"
> +#include "varasm.h"
> +#include "explow.h"
> +#include "expr.h"
> +#include "output.h"
> +#include "tree-pass.h"
> +#include "regs.h"
> +#include "rtx-vector-builder.h"
> +#include "rs6000-protos.h"

Nit: Some included header looks useless, at least the rtx-vector-builder.h,
could you have a check and leave those necessary only.

> +
> +static inline bool
> +quad_address_offset_p (HOST_WIDE_INT offset)
> +{
> +  return (IN_RANGE (offset, -32768, 32767) && ((offset) & 0xf) == 0);
> +}

This same implementation is in rs6000-internal.h, so just include it?

> +
> +/* Replace identified lxv with lxvp.  */
> +static void
> +replace_lxv_with_lxvp (rtx_insn *insn1, rtx_insn *insn2)
> +{
> +  rtx body = PATTERN (insn1);
> +  rtx src_exp = SET_SRC (body);
> +  rtx dest_exp = SET_DEST (body);
> +  rtx lxv;
> +  rtx insn2_body = PATTERN (insn2);
> +  rtx insn2_dest_exp = SET_DEST (insn2_body);
> +  unsigned int regno = REGNO (dest_exp);

Nit: It seems more clear using body{1,2}, src{1,2}, dest{1,2} etc.

> +
> +  if (regno > REGNO (insn2_dest_exp))
> +    {
> +      df_set_regs_ever_live (REGNO (dest_exp), false);
> +      df_set_regs_ever_live (REGNO (insn2_dest_exp), true);
> +      SET_REGNO (dest_exp, REGNO (insn2_dest_exp));
> +      dest_exp->used = 1;
> +      df_set_regs_ever_live (REGNO (insn2_dest_exp), false);
> +      df_set_regs_ever_live (regno, true);
> +      SET_REGNO (insn2_dest_exp, regno);
> +      insn2_dest_exp->used = 1;
> +    }

Why do we need this change?  I think a descriptive comment is appreciated
for the above hunk.

> +  rtx opnd = gen_rtx_REG (OOmode, REGNO (dest_exp));
> +  PUT_MODE (src_exp, OOmode);
> +  lxv = gen_movoo (opnd, src_exp);
> +  rtx_insn *new_insn = emit_insn_before (lxv,  insn1);
> +  set_block_for_insn (new_insn, BLOCK_FOR_INSN (insn1));
> +  df_insn_rescan (new_insn);
> +  
> +  if (dump_file)
> +    {
> +      unsigned int new_uid = INSN_UID (new_insn);
> +      fprintf (dump_file, "Replacing lxv %d with lxvp  %d\n",
> +			  INSN_UID (insn1), new_uid);
> +    }
> +  df_insn_delete (insn1);
> +  remove_insn (insn1);
> +  df_insn_delete (insn2);
> +  remove_insn (insn2);
> +  insn1->set_deleted ();
> +  insn2->set_deleted ();
> +}
> +
> +/* Identify lxv instruction that are candidate of continguous
> +   addresses and replace them with mma instruction lxvp.  */
> +unsigned int
> +rs6000_analyze_vecload (function *fun)
> +{
> +  basic_block bb;
> +  rtx_insn *insn, *curr_insn = 0;
> +  rtx_insn *insn1 = 0, *insn2 = 0;
> +  bool first_vec_insn = false;
> +  unsigned int offset = 0;
> +  unsigned int regno = 0;
> +
> +  FOR_ALL_BB_FN (bb, fun)
> +    FOR_BB_INSNS_SAFE (bb, insn, curr_insn)
> +    {
> +      if (NONDEBUG_INSN_P (insn) && GET_CODE (PATTERN (insn)) == SET)
> +	{
> +	  rtx set = single_set (insn);
> +	  rtx src = SET_SRC (set);
> +	  machine_mode mode = GET_MODE (SET_DEST (set));
> +	  bool dest_fp_p, dest_vmx_p, dest_vsx_p = false;
> +	  rtx dest = SET_DEST (PATTERN (insn));
> +	  int dest_regno;
> +
> +	  if (REG_P (dest))
> +	    {
> +	      dest_regno = REGNO (dest);
> +	      dest_fp_p = FP_REGNO_P (dest_regno);
> +	      dest_vmx_p = ALTIVEC_REGNO_P (dest_regno);
> +	      dest_vsx_p = dest_fp_p | dest_vmx_p;

Nit: it looks dest_vsx_p is enough, don't need separate bool for
dest_vmx_p and dest_fp_p.

> +	    }
> +	   else
> +	     {
> +	       dest_regno = -1;
> +	       dest_fp_p = dest_vmx_p = dest_vsx_p = false;

Just "continue;" for !REG_P (dest) as the below code only takes
care of "dest_regno >= 0".

> +	     }
> +
> +	   if (TARGET_VSX && TARGET_MMA && dest_vsx_p)

Nit: Duplicated condition, TARGET_MMA requires TARGET_VSX.

> +	     {
> +	       if (mode_supports_dq_form (mode)
> +		   && dest_regno >= 0 && MEM_P (src)

dest_regno >= 0 becomes useless if we continue early;

> +		   && quad_address_p (XEXP (src, 0), mode, true))
> +		  {
> +		    if (first_vec_insn)
> +		      {
> +			rtx addr = XEXP (src, 0);
> +			insn2 = insn;
> +
> +			if (GET_CODE (addr) != PLUS)
> +			  return false;

This looks unexpected to me, when we encounter one possible candidate as
first vec insn, it can be still unrelated to the others, if we directly
return early, we can miss some candidates behind.

This is also applied for other "return false"s below.

> +
> +			rtx op0 = XEXP (addr, 0);
> +			if (!REG_P (op0) || !INT_REG_OK_FOR_BASE_P (op0, true))
> +			  return false;
> +
> +			rtx op1 = XEXP (addr, 1);
> +			if (!CONST_INT_P (op1))
> +			  return false;
> +
> +			mem_attrs attrs (*get_mem_attrs (src));
> +			bool reg_attrs_found = false;
> +
> +			if (REG_P (dest) && REG_ATTRS (dest))
> +			  {
> +			    poly_int64 off = REG_ATTRS (dest)->offset;
> +			    if (known_ge (off, 0))
> +			      reg_attrs_found = true;
> +			  }
> +			if ((attrs.offset_known_p && known_ge (attrs.offset, 0))
> +			     && reg_attrs_found
> +			     &&  quad_address_offset_p (INTVAL (op1))
> +			     && (regno == REGNO (op0))
> +			     && ((INTVAL (op1) - offset) == 16))
> +			   {
> +			     replace_lxv_with_lxvp (insn1, insn2);
> +			     return true;

This function is directly used for return in pass_analyze_vecload::execute,
the return value of pass...::execute is for TODO flags, "return true" is
appending flag to it.  If you want to record some information like if this
function changes something actually, you can add one variable "bool changed"
in pass_analyze_vecload::execute and keep the return value as 0 there.

> +			   }

Here it seems to me that this handlings only takes care of the first indirect
addressing and the second DQ-form addressing, but we should be able to handle
two DQ-form or two X-form as well if we know they are adjacent as expected.

Besides, it seems a bad idea to put this pass after reload? as register allocation
finishes, this pairing has to be restricted by the reg No. (I didn't see any
checking on the reg No. relationship for paring btw.)

Looking forward to the comments from Segher/David/Peter/Mike etc.

> +			}
> +			if (REG_P (XEXP (src, 0))

Formatting issue, wrong indentation for this "if" hunk.

> +			    && GET_CODE (XEXP (src, 0)) != PLUS)
> +			  {
> +			    mem_attrs attrs (*get_mem_attrs (src));
> +			    if (attrs.offset_known_p)
> +			      offset = attrs.offset;
> +			    if (offset == 0 && REG_P (dest) && REG_ATTRS (dest))
> +			      offset = REG_ATTRS (dest)->offset;
> +			    regno = REGNO (XEXP (src,0));
> +			    first_vec_insn = true;
> +			    insn1 = insn;
> +			 }
> +		     }
> +	       }
> +	}
> +    }
> +  return false;
> +}
> +
> +const pass_data pass_data_analyze_vecload =
> +{
> +  RTL_PASS, /* type */
> +  "vecload", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_NONE, /* tv_id */
> +  0, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  TODO_df_finish, /* todo_flags_finish */
> +};
> +
> +class pass_analyze_vecload : public rtl_opt_pass
> +{
> +public:
> +  pass_analyze_vecload(gcc::context *ctxt)
> +    : rtl_opt_pass(pass_data_analyze_vecload, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *)
> +    {
> +      return (optimize > 0 && TARGET_VSX);
> +    }
> +

It should be gated with Power10 as well.

> +  virtual unsigned int execute (function *fun)
> +    {
> +      return rs6000_analyze_vecload (fun);

As the comment on "return true" in rs6000_analyze_vecload, the return value
of this function is for TODO flags.

"...
The return value contains TODOs to execute in addition to those in
TODO_flags_finish.   */"

> +    }
> +
> +  opt_pass *clone ()
> +    {
> +      return new pass_analyze_vecload (m_ctxt);
> +    }

I guess this was copied from rs6000 swap pass, but
the clone here isn't needed actually since it's only
one instance.

> +
> +}; // class pass_analyze_vecload
> +
> +rtl_opt_pass *
> +make_pass_analyze_vecload (gcc::context *ctxt)
> +{
> +  return new pass_analyze_vecload (ctxt);
> +}
> +
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index cc9253bb040..dba545271e0 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -387,7 +387,7 @@ mode_supports_vmx_dform (machine_mode mode)
>  /* Return true if we have D-form addressing in VSX registers.  This addressing
>     is more limited than normal d-form addressing in that the offset must be
>     aligned on a 16-byte boundary.  */
> -static inline bool
> +bool
>  mode_supports_dq_form (machine_mode mode)
>  {
>    return ((reg_addr[mode].addr_mask[RELOAD_REG_ANY] & RELOAD_REG_QUAD_OFFSET)
> @@ -1178,6 +1178,7 @@ static bool rs6000_secondary_reload_move (enum rs6000_reg_type,
>  					  secondary_reload_info *,
>  					  bool);
>  rtl_opt_pass *make_pass_analyze_swaps (gcc::context*);
> +rtl_opt_pass *make_pass_analyze_vecload (gcc::context*);
>  
>  /* Hash table stuff for keeping track of TOC entries.  */
>  
> diff --git a/gcc/config/rs6000/t-rs6000 b/gcc/config/rs6000/t-rs6000
> index f183b42ce1d..da7ae26e88b 100644
> --- a/gcc/config/rs6000/t-rs6000
> +++ b/gcc/config/rs6000/t-rs6000
> @@ -47,6 +47,10 @@ rs6000-builtin.o: $(srcdir)/config/rs6000/rs6000-builtin.cc
>  	$(COMPILE) $<
>  	$(POSTCOMPILE)
>  
> +rs6000-vecload-opt.o: $(srcdir)/config/rs6000/rs6000-vecload-opt.cc
> +	$(COMPILE) $<
> +	$(POSTCOMPILE)
> +
>  build/rs6000-gen-builtins.o: $(srcdir)/config/rs6000/rs6000-gen-builtins.cc
>  build/rbtree.o: $(srcdir)/config/rs6000/rbtree.cc
>  
> diff --git a/gcc/testsuite/g++.target/powerpc/vecload.C b/gcc/testsuite/g++.target/powerpc/vecload.C
> new file mode 100644
> index 00000000000..f1689ad6522
> --- /dev/null
> +++ b/gcc/testsuite/g++.target/powerpc/vecload.C
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */ 
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-options "-mdejagnu-cpu=power10 -O2 -mmma" } */ 
> +
> +#include <altivec.h>
> +
> +void
> +foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
> +{
> +  __vector_quad acc;
> +  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
> +  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
> +  *dst = acc;
> +}
> +/* { dg-final { scan-assembler {\mlxvp\M} } } */

Maybe some more test cases as below can be considered and make this pass better:

void
foo1 (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
{
  __vector_quad acc;
  __builtin_mma_xvf32ger(&acc, src, ptr[1]);
  __builtin_mma_xvf32gerpp(&acc, src, ptr[2]);
  *dst = acc;
}

void
foo2 (__vector_quad *dst1, __vector_quad *dst2, vector unsigned char *ptr, vector unsigned char src)
{
  __vector_quad acc;
  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
  *dst1 = acc;
  __builtin_mma_xvf32ger(&acc, src, ptr[2]);
  __builtin_mma_xvf32gerpp(&acc, src, ptr[3]);
  *dst2 = acc;
}

BR,
Kewen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-11-24  9:31 ` [PATCH " Kewen.Lin
@ 2023-11-28  4:34   ` Michael Meissner
  2023-11-28  9:33     ` Kewen.Lin
  2023-12-01  9:10   ` Ajit Agarwal
  1 sibling, 1 reply; 24+ messages in thread
From: Michael Meissner @ 2023-11-28  4:34 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Segher Boessenkool, David Edelsohn, Peter Bergner,
	Michael Meissner

On Fri, Nov 24, 2023 at 05:31:20PM +0800, Kewen.Lin wrote:
> Hi Ajit,
> 
> Don't forget to CC David (CC-ed) :), some comments are inlined below.
> 
> on 2023/10/8 03:04, Ajit Agarwal wrote:
> > Hello All:
> > 
> > This patch add new pass to replace contiguous addresses vector load lxv with mma instruction
> > lxvp.
> 
> IMHO the current binding lxvp (and lxvpx, stxvp{x,}) to MMA looks wrong, it's only
> Power10 and VSX required, these instructions should perform well without MMA support.
> So one patch to separate their support from MMA seems to go first.

I tend to agree with you, but I recall the decision being made because at the
time, vector pairs and vector quads were only used with MMA.  We now have
various attempts to improve things for using vector pairs for non-MMA code. In
my patches, I keeped the MMA requirement, but if we decide to make it ISA 3.1
only if is fairly straight forward to look at all of the TARGET_MMA tests.

Now in the GCC 13 days, it was useful that -mmma controlled vector pair.  There
was an issue if we enabled memcpy to use store vector pair, it would lead to
one slow down.  When I was doing the tests, it was easy to use -mno-mma and it
would stop memcpy from using load/store vector pair since GCC doesn't generate
code to use MMA without using the built-ins.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-10-07 19:04 [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp Ajit Agarwal
  2023-10-15 12:13 ` [PING ^0][PATCH " Ajit Agarwal
  2023-11-24  9:31 ` [PATCH " Kewen.Lin
@ 2023-11-28  7:05 ` Michael Meissner
  2023-11-28  9:44   ` Kewen.Lin
  2 siblings, 1 reply; 24+ messages in thread
From: Michael Meissner @ 2023-11-28  7:05 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: gcc-patches, Segher Boessenkool, Peter Bergner, Kewen.Lin,
	David Edelsohn

I tried using this patch to compare with the vector size attribute patch I
posted.  I could not build it as a cross compiler on my x86_64 because the
assembler gives the following error:

Error: operand out of domain (11 is not a multiple of 2) for
std_stacktrace-elf.o.  If you look at the assembler, it has combined a lxvp 11
and lxvp 12 into:

        lxvp 11,0(9)

The powerpc architecture requires that registers that are loaded with load
vector pair and stored with store vector point instructions only load/store
even/odd register pairs, and not odd/even pairs.  Unfortunately, it will mean
that this optimization will match less often.

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-11-28  4:34   ` Michael Meissner
@ 2023-11-28  9:33     ` Kewen.Lin
  0 siblings, 0 replies; 24+ messages in thread
From: Kewen.Lin @ 2023-11-28  9:33 UTC (permalink / raw)
  To: Michael Meissner
  Cc: GCC Patches, Segher Boessenkool, David Edelsohn, Peter Bergner,
	Ajit Agarwal

Hi Mike,

on 2023/11/28 12:34, Michael Meissner wrote:
> On Fri, Nov 24, 2023 at 05:31:20PM +0800, Kewen.Lin wrote:
>> Hi Ajit,
>>
>> Don't forget to CC David (CC-ed) :), some comments are inlined below.
>>
>> on 2023/10/8 03:04, Ajit Agarwal wrote:
>>> Hello All:
>>>
>>> This patch add new pass to replace contiguous addresses vector load lxv with mma instruction
>>> lxvp.
>>
>> IMHO the current binding lxvp (and lxvpx, stxvp{x,}) to MMA looks wrong, it's only
>> Power10 and VSX required, these instructions should perform well without MMA support.
>> So one patch to separate their support from MMA seems to go first.
> 
> I tend to agree with you, but I recall the decision being made because at the
> time, vector pairs and vector quads were only used with MMA.  We now have
> various attempts to improve things for using vector pairs for non-MMA code. In

Thanks for the comments!  Yeah, so this time seems a good timing to make it separated
from MMA support.

> my patches, I keeped the MMA requirement, but if we decide to make it ISA 3.1
> only if is fairly straight forward to look at all of the TARGET_MMA tests.
> 
> Now in the GCC 13 days, it was useful that -mmma controlled vector pair.  There
> was an issue if we enabled memcpy to use store vector pair, it would lead to
> one slow down.  When I was doing the tests, it was easy to use -mno-mma and it
> would stop memcpy from using load/store vector pair since GCC doesn't generate
> code to use MMA without using the built-ins.

OK, maybe a vector store pair specific option can be added for the disablement need.

BR,
Kewen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-11-28  7:05 ` Michael Meissner
@ 2023-11-28  9:44   ` Kewen.Lin
  2023-11-28 15:41     ` Michael Meissner
  2023-12-01  9:13     ` Ajit Agarwal
  0 siblings, 2 replies; 24+ messages in thread
From: Kewen.Lin @ 2023-11-28  9:44 UTC (permalink / raw)
  To: Michael Meissner
  Cc: Ajit Agarwal, gcc-patches, David Edelsohn, Segher Boessenkool,
	Peter Bergner

on 2023/11/28 15:05, Michael Meissner wrote:
> I tried using this patch to compare with the vector size attribute patch I
> posted.  I could not build it as a cross compiler on my x86_64 because the
> assembler gives the following error:
> 
> Error: operand out of domain (11 is not a multiple of 2) for
> std_stacktrace-elf.o.  If you look at the assembler, it has combined a lxvp 11
> and lxvp 12 into:
> 
>         lxvp 11,0(9)
> 
> The powerpc architecture requires that registers that are loaded with load
> vector pair and stored with store vector point instructions only load/store
> even/odd register pairs, and not odd/even pairs.  Unfortunately, it will mean
> that this optimization will match less often.
> 

Yes, the current implementation need some refinements, as comments in [1]:

> Besides, it seems a bad idea to put this pass after reload? as register allocation
> finishes, this pairing has to be restricted by the reg No. (I didn't see any
> checking on the reg No. relationship for paring btw.)
> 
> Looking forward to the comments from Segher/David/Peter/Mike etc.

I wonder if we should consider running such pass before reload instead.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/638070.html

BR,
Kewen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-11-28  9:44   ` Kewen.Lin
@ 2023-11-28 15:41     ` Michael Meissner
  2023-11-29 14:10       ` Ajit Agarwal
  2023-12-01  9:13     ` Ajit Agarwal
  1 sibling, 1 reply; 24+ messages in thread
From: Michael Meissner @ 2023-11-28 15:41 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Michael Meissner, Ajit Agarwal, gcc-patches, David Edelsohn,
	Segher Boessenkool, Peter Bergner

On Tue, Nov 28, 2023 at 05:44:43PM +0800, Kewen.Lin wrote:
> on 2023/11/28 15:05, Michael Meissner wrote:
> > I tried using this patch to compare with the vector size attribute patch I
> > posted.  I could not build it as a cross compiler on my x86_64 because the
> > assembler gives the following error:
> > 
> > Error: operand out of domain (11 is not a multiple of 2) for
> > std_stacktrace-elf.o.  If you look at the assembler, it has combined a lxvp 11
> > and lxvp 12 into:
> > 
> >         lxvp 11,0(9)
> > 
> > The powerpc architecture requires that registers that are loaded with load
> > vector pair and stored with store vector point instructions only load/store
> > even/odd register pairs, and not odd/even pairs.  Unfortunately, it will mean
> > that this optimization will match less often.
> > 
> 
> Yes, the current implementation need some refinements, as comments in [1]:
> 
> > Besides, it seems a bad idea to put this pass after reload? as register allocation
> > finishes, this pairing has to be restricted by the reg No. (I didn't see any
> > checking on the reg No. relationship for paring btw.)
> > 
> > Looking forward to the comments from Segher/David/Peter/Mike etc.
> 
> I wonder if we should consider running such pass before reload instead.
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/638070.html
> 
> BR,
> Kewen

If I add code to check if the target register is even, then the following
fails:

/home/meissner/fsf-src/work148-ajit/libquadmath/math/erfq.c: In function ‘erfcq’:
/home/meissner/fsf-src/work148-ajit/libquadmath/math/erfq.c:943:1: error: insn does not satisfy its constraints:
  943 | }
      | ^
(insn 1087 1939 1088 66 (set (reg/v:KF 74 10 [orig:643 y ] [643])
        (fma:KF (reg/v:KF 64 0 [orig:153 z ] [153])
            (reg/v:KF 65 1 [orig:639 y ] [639])
            (reg:KF 76 12 [orig:642 MEM[(const _Float128 *)p_276 + 16B] ] [642]))) "/home/meissner/fsf-src/work148-ajit/libquadmath/math/erfq.c":112:9 1004 {fmakf4_hw}
     (expr_list:REG_DEAD (reg/v:KF 65 1 [orig:639 y ] [639])
        (nil)))

In particular, the IEEE 128-bit arithmetic functions require Altivec registers.
So we would need to make sure the new insns all meet their constraints.

I tend to think that it would be desirable to do it before reload.  But then we
will need to check if extra moves are generated.  I suspect we will need
Peter's patch to allow 128-bit types that are subregs of OOmode.  I.e., the
code generated would change:

	(set (reg:MODE1 tmp-reg)
	     (mem ...+8))

	(set (reg:MODE2 tmp-reg+1)
	     (mem ...))

to:

	(set (reg:OO vp-reg)
	     (mem ...))

	(set (reg:MODE1 tmp-reg)
	     (subreg:MODE1 (reg:OO vp-reg 0)))

	(set (reg:MODE2 tmp-reg+1)
	     (subreg:MODE2 (reg:OO vp-reg 16)))

Note, I may have the offsets and register numbers backwards in terms of endian.

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-11-28 15:41     ` Michael Meissner
@ 2023-11-29 14:10       ` Ajit Agarwal
  0 siblings, 0 replies; 24+ messages in thread
From: Ajit Agarwal @ 2023-11-29 14:10 UTC (permalink / raw)
  To: Michael Meissner, Kewen.Lin, gcc-patches, David Edelsohn,
	Segher Boessenkool, Peter Bergner

Hello All:

I am working on fixing the below issues and incorporating comments from Kewen and
Michael.

Thanks & Regards
Ajit

On 28/11/23 9:11 pm, Michael Meissner wrote:
> On Tue, Nov 28, 2023 at 05:44:43PM +0800, Kewen.Lin wrote:
>> on 2023/11/28 15:05, Michael Meissner wrote:
>>> I tried using this patch to compare with the vector size attribute patch I
>>> posted.  I could not build it as a cross compiler on my x86_64 because the
>>> assembler gives the following error:
>>>
>>> Error: operand out of domain (11 is not a multiple of 2) for
>>> std_stacktrace-elf.o.  If you look at the assembler, it has combined a lxvp 11
>>> and lxvp 12 into:
>>>
>>>         lxvp 11,0(9)
>>>
>>> The powerpc architecture requires that registers that are loaded with load
>>> vector pair and stored with store vector point instructions only load/store
>>> even/odd register pairs, and not odd/even pairs.  Unfortunately, it will mean
>>> that this optimization will match less often.
>>>
>>
>> Yes, the current implementation need some refinements, as comments in [1]:
>>
>>> Besides, it seems a bad idea to put this pass after reload? as register allocation
>>> finishes, this pairing has to be restricted by the reg No. (I didn't see any
>>> checking on the reg No. relationship for paring btw.)
>>>
>>> Looking forward to the comments from Segher/David/Peter/Mike etc.
>>
>> I wonder if we should consider running such pass before reload instead.
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/638070.html
>>
>> BR,
>> Kewen
> 
> If I add code to check if the target register is even, then the following
> fails:
> 
> /home/meissner/fsf-src/work148-ajit/libquadmath/math/erfq.c: In function ‘erfcq’:
> /home/meissner/fsf-src/work148-ajit/libquadmath/math/erfq.c:943:1: error: insn does not satisfy its constraints:
>   943 | }
>       | ^
> (insn 1087 1939 1088 66 (set (reg/v:KF 74 10 [orig:643 y ] [643])
>         (fma:KF (reg/v:KF 64 0 [orig:153 z ] [153])
>             (reg/v:KF 65 1 [orig:639 y ] [639])
>             (reg:KF 76 12 [orig:642 MEM[(const _Float128 *)p_276 + 16B] ] [642]))) "/home/meissner/fsf-src/work148-ajit/libquadmath/math/erfq.c":112:9 1004 {fmakf4_hw}
>      (expr_list:REG_DEAD (reg/v:KF 65 1 [orig:639 y ] [639])
>         (nil)))
> 
> In particular, the IEEE 128-bit arithmetic functions require Altivec registers.
> So we would need to make sure the new insns all meet their constraints.
> 
> I tend to think that it would be desirable to do it before reload.  But then we
> will need to check if extra moves are generated.  I suspect we will need
> Peter's patch to allow 128-bit types that are subregs of OOmode.  I.e., the
> code generated would change:
> 
> 	(set (reg:MODE1 tmp-reg)
> 	     (mem ...+8))
> 
> 	(set (reg:MODE2 tmp-reg+1)
> 	     (mem ...))
> 
> to:
> 
> 	(set (reg:OO vp-reg)
> 	     (mem ...))
> 
> 	(set (reg:MODE1 tmp-reg)
> 	     (subreg:MODE1 (reg:OO vp-reg 0)))
> 
> 	(set (reg:MODE2 tmp-reg+1)
> 	     (subreg:MODE2 (reg:OO vp-reg 16)))
> 
> Note, I may have the offsets and register numbers backwards in terms of endian.
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-11-24  9:31 ` [PATCH " Kewen.Lin
  2023-11-28  4:34   ` Michael Meissner
@ 2023-12-01  9:10   ` Ajit Agarwal
  2023-12-04  2:01     ` Kewen.Lin
  1 sibling, 1 reply; 24+ messages in thread
From: Ajit Agarwal @ 2023-12-01  9:10 UTC (permalink / raw)
  To: Kewen.Lin, GCC Patches
  Cc: Segher Boessenkool, David Edelsohn, Peter Bergner, Michael Meissner

Hello Kewen:

On 24/11/23 3:01 pm, Kewen.Lin wrote:
> Hi Ajit,
> 
> Don't forget to CC David (CC-ed) :), some comments are inlined below.
> 
> on 2023/10/8 03:04, Ajit Agarwal wrote:
>> Hello All:
>>
>> This patch add new pass to replace contiguous addresses vector load lxv with mma instruction
>> lxvp.
> 
> IMHO the current binding lxvp (and lxvpx, stxvp{x,}) to MMA looks wrong, it's only
> Power10 and VSX required, these instructions should perform well without MMA support.
> So one patch to separate their support from MMA seems to go first.
> 

I will make the changes for Power10 and VSX.

>> This patch addresses one regressions failure in ARM architecture.
> 
> Could you explain this?  I don't see any test case for this.

I have submitted v1 of the patch and there were regressions failure for Linaro.
I have fixed in version V2.
> 
>> Bootstrapped and regtested with powepc64-linux-gnu.
>>
>> Thanks & Regards
>> Ajit
>>
>>
>> rs6000: Add new pass for replacement of contiguous lxv with lxvp.
>>
>> New pass to replace contiguous addresses lxv with lxvp. This pass
>> is registered after ree rtl pass.> 
>> 2023-10-07  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>
>> gcc/ChangeLog:
>>
>> 	* config/rs6000/rs6000-passes.def: Registered vecload pass.
>> 	* config/rs6000/rs6000-vecload-opt.cc: Add new pass.
>> 	* config.gcc: Add new executable.
>> 	* config/rs6000/rs6000-protos.h: Add new prototype for vecload
>> 	pass.
>> 	* config/rs6000/rs6000.cc: Add new prototype for vecload pass.
>> 	* config/rs6000/t-rs6000: Add new rule.
>>
>> gcc/testsuite/ChangeLog:
>>
>> 	* g++.target/powerpc/vecload.C: New test.
>> ---
>>  gcc/config.gcc                             |   4 +-
>>  gcc/config/rs6000/rs6000-passes.def        |   1 +
>>  gcc/config/rs6000/rs6000-protos.h          |   2 +
>>  gcc/config/rs6000/rs6000-vecload-opt.cc    | 234 +++++++++++++++++++++
>>  gcc/config/rs6000/rs6000.cc                |   3 +-
>>  gcc/config/rs6000/t-rs6000                 |   4 +
>>  gcc/testsuite/g++.target/powerpc/vecload.C |  15 ++
>>  7 files changed, 260 insertions(+), 3 deletions(-)
>>  create mode 100644 gcc/config/rs6000/rs6000-vecload-opt.cc
>>  create mode 100644 gcc/testsuite/g++.target/powerpc/vecload.C
>>
>> diff --git a/gcc/config.gcc b/gcc/config.gcc
>> index ee46d96bf62..482ab094b89 100644
>> --- a/gcc/config.gcc
>> +++ b/gcc/config.gcc
>> @@ -515,7 +515,7 @@ or1k*-*-*)
>>  	;;
>>  powerpc*-*-*)
>>  	cpu_type=rs6000
>> -	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
>> +	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
>>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>>  	extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
>>  	extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
>> @@ -552,7 +552,7 @@ riscv*)
>>  	;;
>>  rs6000*-*-*)
>>  	extra_options="${extra_options} g.opt fused-madd.opt rs6000/rs6000-tables.opt"
>> -	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
>> +	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-vecload-opt.o"
>>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-logue.cc \$(srcdir)/config/rs6000/rs6000-call.cc"
>>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
>> diff --git a/gcc/config/rs6000/rs6000-passes.def b/gcc/config/rs6000/rs6000-passes.def
>> index ca899d5f7af..9ecf8ce6a9c 100644
>> --- a/gcc/config/rs6000/rs6000-passes.def
>> +++ b/gcc/config/rs6000/rs6000-passes.def
>> @@ -28,6 +28,7 @@ along with GCC; see the file COPYING3.  If not see
>>       The power8 does not have instructions that automaticaly do the byte swaps
>>       for loads and stores.  */
>>    INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
>> +  INSERT_PASS_AFTER (pass_ree, 1, pass_analyze_vecload);
>>  
>>    /* Pass to do the PCREL_OPT optimization that combines the load of an
>>       external symbol's address along with a single load or store using that
>> diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
>> index f70118ea40f..9c44bae33d3 100644
>> --- a/gcc/config/rs6000/rs6000-protos.h
>> +++ b/gcc/config/rs6000/rs6000-protos.h
>> @@ -91,6 +91,7 @@ extern int mems_ok_for_quad_peep (rtx, rtx);
>>  extern bool gpr_or_gpr_p (rtx, rtx);
>>  extern bool direct_move_p (rtx, rtx);
>>  extern bool quad_address_p (rtx, machine_mode, bool);
>> +extern bool mode_supports_dq_form (machine_mode);
>>  extern bool quad_load_store_p (rtx, rtx);
>>  extern bool fusion_gpr_load_p (rtx, rtx, rtx, rtx);
>>  extern void expand_fusion_gpr_load (rtx *);
>> @@ -344,6 +345,7 @@ class rtl_opt_pass;
>>  
>>  extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
>>  extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
>> +extern rtl_opt_pass *make_pass_analyze_vecload (gcc::context *);
>>  extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
>>  extern bool rs6000_quadword_masked_address_p (const_rtx exp);
>>  extern rtx rs6000_gen_lvx (enum machine_mode, rtx, rtx);
>> diff --git a/gcc/config/rs6000/rs6000-vecload-opt.cc b/gcc/config/rs6000/rs6000-vecload-opt.cc
>> new file mode 100644
>> index 00000000000..63ee733af89
>> --- /dev/null
>> +++ b/gcc/config/rs6000/rs6000-vecload-opt.cc
>> @@ -0,0 +1,234 @@
>> +/* Subroutines used to replace lxv with lxvp
>> +   for p10 little-endian VSX code.
>> +   Copyright (C) 2020-2023 Free Software Foundation, Inc.
>> +   Contributed by Ajit Kumar Agarwal <aagarwa1@linux.ibm.com>.
>> +
> 
> Some comments here describing how is this implemented are appreciated,
> it helps code readers in future and also the reviewers.  :)
> 
>> +   This file is part of GCC.
>> +
>> +   GCC is free software; you can redistribute it and/or modify it
>> +   under the terms of the GNU General Public License as published
>> +   by the Free Software Foundation; either version 3, or (at your
>> +   option) any later version.
>> +
>> +   GCC is distributed in the hope that it will be useful, but WITHOUT
>> +   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
>> +   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
>> +   License for more details.
>> +
>> +   You should have received a copy of the GNU General Public License
>> +   along with GCC; see the file COPYING3.  If not see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +#define IN_TARGET_CODE 1
>> +
>> +#include "config.h"
>> +#include "system.h"
>> +#include "coretypes.h"
>> +#include "backend.h"
>> +#include "rtl.h"
>> +#include "tree.h"
>> +#include "memmodel.h"
>> +#include "df.h"
>> +#include "tm_p.h"
>> +#include "ira.h"
>> +#include "print-tree.h"
>> +#include "varasm.h"
>> +#include "explow.h"
>> +#include "expr.h"
>> +#include "output.h"
>> +#include "tree-pass.h"
>> +#include "regs.h"
>> +#include "rtx-vector-builder.h"
>> +#include "rs6000-protos.h"
> 
> Nit: Some included header looks useless, at least the rtx-vector-builder.h,
> could you have a check and leave those necessary only.
> 

I will make the change accordingly.
>> +
>> +static inline bool
>> +quad_address_offset_p (HOST_WIDE_INT offset)
>> +{
>> +  return (IN_RANGE (offset, -32768, 32767) && ((offset) & 0xf) == 0);
>> +}
> 
> This same implementation is in rs6000-internal.h, so just include it?
> 
>> +
>> +/* Replace identified lxv with lxvp.  */
>> +static void
>> +replace_lxv_with_lxvp (rtx_insn *insn1, rtx_insn *insn2)
>> +{
>> +  rtx body = PATTERN (insn1);
>> +  rtx src_exp = SET_SRC (body);
>> +  rtx dest_exp = SET_DEST (body);
>> +  rtx lxv;
>> +  rtx insn2_body = PATTERN (insn2);
>> +  rtx insn2_dest_exp = SET_DEST (insn2_body);
>> +  unsigned int regno = REGNO (dest_exp);
> 
> Nit: It seems more clear using body{1,2}, src{1,2}, dest{1,2} etc.
> 

I will make the changes.

>> +
>> +  if (regno > REGNO (insn2_dest_exp))
>> +    {
>> +      df_set_regs_ever_live (REGNO (dest_exp), false);
>> +      df_set_regs_ever_live (REGNO (insn2_dest_exp), true);
>> +      SET_REGNO (dest_exp, REGNO (insn2_dest_exp));
>> +      dest_exp->used = 1;
>> +      df_set_regs_ever_live (REGNO (insn2_dest_exp), false);
>> +      df_set_regs_ever_live (regno, true);
>> +      SET_REGNO (insn2_dest_exp, regno);
>> +      insn2_dest_exp->used = 1;
>> +    }
> 
> Why do we need this change?  I think a descriptive comment is appreciated
> for the above hunk.
> 

I will add descriptive comment.

>> +  rtx opnd = gen_rtx_REG (OOmode, REGNO (dest_exp));
>> +  PUT_MODE (src_exp, OOmode);
>> +  lxv = gen_movoo (opnd, src_exp);
>> +  rtx_insn *new_insn = emit_insn_before (lxv,  insn1);
>> +  set_block_for_insn (new_insn, BLOCK_FOR_INSN (insn1));
>> +  df_insn_rescan (new_insn);
>> +  
>> +  if (dump_file)
>> +    {
>> +      unsigned int new_uid = INSN_UID (new_insn);
>> +      fprintf (dump_file, "Replacing lxv %d with lxvp  %d\n",
>> +			  INSN_UID (insn1), new_uid);
>> +    }
>> +  df_insn_delete (insn1);
>> +  remove_insn (insn1);
>> +  df_insn_delete (insn2);
>> +  remove_insn (insn2);
>> +  insn1->set_deleted ();
>> +  insn2->set_deleted ();
>> +}
>> +
>> +/* Identify lxv instruction that are candidate of continguous
>> +   addresses and replace them with mma instruction lxvp.  */
>> +unsigned int
>> +rs6000_analyze_vecload (function *fun)
>> +{
>> +  basic_block bb;
>> +  rtx_insn *insn, *curr_insn = 0;
>> +  rtx_insn *insn1 = 0, *insn2 = 0;
>> +  bool first_vec_insn = false;
>> +  unsigned int offset = 0;
>> +  unsigned int regno = 0;
>> +
>> +  FOR_ALL_BB_FN (bb, fun)
>> +    FOR_BB_INSNS_SAFE (bb, insn, curr_insn)
>> +    {
>> +      if (NONDEBUG_INSN_P (insn) && GET_CODE (PATTERN (insn)) == SET)
>> +	{
>> +	  rtx set = single_set (insn);
>> +	  rtx src = SET_SRC (set);
>> +	  machine_mode mode = GET_MODE (SET_DEST (set));
>> +	  bool dest_fp_p, dest_vmx_p, dest_vsx_p = false;
>> +	  rtx dest = SET_DEST (PATTERN (insn));
>> +	  int dest_regno;
>> +
>> +	  if (REG_P (dest))
>> +	    {
>> +	      dest_regno = REGNO (dest);
>> +	      dest_fp_p = FP_REGNO_P (dest_regno);
>> +	      dest_vmx_p = ALTIVEC_REGNO_P (dest_regno);
>> +	      dest_vsx_p = dest_fp_p | dest_vmx_p;
> 
> Nit: it looks dest_vsx_p is enough, don't need separate bool for
> dest_vmx_p and dest_fp_p.
> 
>> +	    }
>> +	   else
>> +	     {
>> +	       dest_regno = -1;
>> +	       dest_fp_p = dest_vmx_p = dest_vsx_p = false;
> 
> Just "continue;" for !REG_P (dest) as the below code only takes
> care of "dest_regno >= 0".
> 
>> +	     }
>> +
>> +	   if (TARGET_VSX && TARGET_MMA && dest_vsx_p)
> 
> Nit: Duplicated condition, TARGET_MMA requires TARGET_VSX.
> 
>> +	     {
>> +	       if (mode_supports_dq_form (mode)
>> +		   && dest_regno >= 0 && MEM_P (src)
> 
> dest_regno >= 0 becomes useless if we continue early;
> 
>> +		   && quad_address_p (XEXP (src, 0), mode, true))
>> +		  {
>> +		    if (first_vec_insn)
>> +		      {
>> +			rtx addr = XEXP (src, 0);
>> +			insn2 = insn;
>> +
>> +			if (GET_CODE (addr) != PLUS)
>> +			  return false;
> 
> This looks unexpected to me, when we encounter one possible candidate as
> first vec insn, it can be still unrelated to the others, if we directly
> return early, we can miss some candidates behind.
>

I will make the changes accordingly.
 
> This is also applied for other "return false"s below.
> 
>> +
>> +			rtx op0 = XEXP (addr, 0);
>> +			if (!REG_P (op0) || !INT_REG_OK_FOR_BASE_P (op0, true))
>> +			  return false;
>> +
>> +			rtx op1 = XEXP (addr, 1);
>> +			if (!CONST_INT_P (op1))
>> +			  return false;
>> +
>> +			mem_attrs attrs (*get_mem_attrs (src));
>> +			bool reg_attrs_found = false;
>> +
>> +			if (REG_P (dest) && REG_ATTRS (dest))
>> +			  {
>> +			    poly_int64 off = REG_ATTRS (dest)->offset;
>> +			    if (known_ge (off, 0))
>> +			      reg_attrs_found = true;
>> +			  }
>> +			if ((attrs.offset_known_p && known_ge (attrs.offset, 0))
>> +			     && reg_attrs_found
>> +			     &&  quad_address_offset_p (INTVAL (op1))
>> +			     && (regno == REGNO (op0))
>> +			     && ((INTVAL (op1) - offset) == 16))
>> +			   {
>> +			     replace_lxv_with_lxvp (insn1, insn2);
>> +			     return true;
> 
> This function is directly used for return in pass_analyze_vecload::execute,
> the return value of pass...::execute is for TODO flags, "return true" is
> appending flag to it.  If you want to record some information like if this
> function changes something actually, you can add one variable "bool changed"
> in pass_analyze_vecload::execute and keep the return value as 0 there.
> 

I will make the changes accordingly.

>> +			   }
> 
> Here it seems to me that this handlings only takes care of the first indirect
> addressing and the second DQ-form addressing, but we should be able to handle
> two DQ-form or two X-form as well if we know they are adjacent as expected.
>

I will make the change.

 
> Besides, it seems a bad idea to put this pass after reload? as register allocation
> finishes, this pairing has to be restricted by the reg No. (I didn't see any
> checking on the reg No. relationship for paring btw.)
>

Adding before reload pass deletes one of the lxv and replaced with lxvp. This
fails in reload pass while freeing reg_eqivs as ira populates them and then
vecload pass deletes some of insns and while freeing in reload pass as insn
is already deleted in vecload pass reload pass segfaults.

Moving vecload pass before ira will not make register pairs with lxvp and
in ira and that will be a problem.

Making after reload pass is the only solution I see as ira and reload pass
makes register pairs and vecload pass will be easier with generation of
lxvp.

Please suggest.
 
> Looking forward to the comments from Segher/David/Peter/Mike etc.
> 
>> +			}
>> +			if (REG_P (XEXP (src, 0))
> 
> Formatting issue, wrong indentation for this "if" hunk.
> 
>> +			    && GET_CODE (XEXP (src, 0)) != PLUS)
>> +			  {
>> +			    mem_attrs attrs (*get_mem_attrs (src));
>> +			    if (attrs.offset_known_p)
>> +			      offset = attrs.offset;
>> +			    if (offset == 0 && REG_P (dest) && REG_ATTRS (dest))
>> +			      offset = REG_ATTRS (dest)->offset;
>> +			    regno = REGNO (XEXP (src,0));
>> +			    first_vec_insn = true;
>> +			    insn1 = insn;
>> +			 }
>> +		     }
>> +	       }
>> +	}
>> +    }
>> +  return false;
>> +}
>> +
>> +const pass_data pass_data_analyze_vecload =
>> +{
>> +  RTL_PASS, /* type */
>> +  "vecload", /* name */
>> +  OPTGROUP_NONE, /* optinfo_flags */
>> +  TV_NONE, /* tv_id */
>> +  0, /* properties_required */
>> +  0, /* properties_provided */
>> +  0, /* properties_destroyed */
>> +  0, /* todo_flags_start */
>> +  TODO_df_finish, /* todo_flags_finish */
>> +};
>> +
>> +class pass_analyze_vecload : public rtl_opt_pass
>> +{
>> +public:
>> +  pass_analyze_vecload(gcc::context *ctxt)
>> +    : rtl_opt_pass(pass_data_analyze_vecload, ctxt)
>> +  {}
>> +
>> +  /* opt_pass methods: */
>> +  virtual bool gate (function *)
>> +    {
>> +      return (optimize > 0 && TARGET_VSX);
>> +    }
>> +
> 
> It should be gated with Power10 as well.
> 
>> +  virtual unsigned int execute (function *fun)
>> +    {
>> +      return rs6000_analyze_vecload (fun);
> 
> As the comment on "return true" in rs6000_analyze_vecload, the return value
> of this function is for TODO flags.
> 
> "...
> The return value contains TODOs to execute in addition to those in
> TODO_flags_finish.   */"
> 
>> +    }
>> +
>> +  opt_pass *clone ()
>> +    {
>> +      return new pass_analyze_vecload (m_ctxt);
>> +    }
> 
> I guess this was copied from rs6000 swap pass, but
> the clone here isn't needed actually since it's only
> one instance.
> 

I will make the changes.

>> +
>> +}; // class pass_analyze_vecload
>> +
>> +rtl_opt_pass *
>> +make_pass_analyze_vecload (gcc::context *ctxt)
>> +{
>> +  return new pass_analyze_vecload (ctxt);
>> +}
>> +
>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>> index cc9253bb040..dba545271e0 100644
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -387,7 +387,7 @@ mode_supports_vmx_dform (machine_mode mode)
>>  /* Return true if we have D-form addressing in VSX registers.  This addressing
>>     is more limited than normal d-form addressing in that the offset must be
>>     aligned on a 16-byte boundary.  */
>> -static inline bool
>> +bool
>>  mode_supports_dq_form (machine_mode mode)
>>  {
>>    return ((reg_addr[mode].addr_mask[RELOAD_REG_ANY] & RELOAD_REG_QUAD_OFFSET)
>> @@ -1178,6 +1178,7 @@ static bool rs6000_secondary_reload_move (enum rs6000_reg_type,
>>  					  secondary_reload_info *,
>>  					  bool);
>>  rtl_opt_pass *make_pass_analyze_swaps (gcc::context*);
>> +rtl_opt_pass *make_pass_analyze_vecload (gcc::context*);
>>  
>>  /* Hash table stuff for keeping track of TOC entries.  */
>>  
>> diff --git a/gcc/config/rs6000/t-rs6000 b/gcc/config/rs6000/t-rs6000
>> index f183b42ce1d..da7ae26e88b 100644
>> --- a/gcc/config/rs6000/t-rs6000
>> +++ b/gcc/config/rs6000/t-rs6000
>> @@ -47,6 +47,10 @@ rs6000-builtin.o: $(srcdir)/config/rs6000/rs6000-builtin.cc
>>  	$(COMPILE) $<
>>  	$(POSTCOMPILE)
>>  
>> +rs6000-vecload-opt.o: $(srcdir)/config/rs6000/rs6000-vecload-opt.cc
>> +	$(COMPILE) $<
>> +	$(POSTCOMPILE)
>> +
>>  build/rs6000-gen-builtins.o: $(srcdir)/config/rs6000/rs6000-gen-builtins.cc
>>  build/rbtree.o: $(srcdir)/config/rs6000/rbtree.cc
>>  
>> diff --git a/gcc/testsuite/g++.target/powerpc/vecload.C b/gcc/testsuite/g++.target/powerpc/vecload.C
>> new file mode 100644
>> index 00000000000..f1689ad6522
>> --- /dev/null
>> +++ b/gcc/testsuite/g++.target/powerpc/vecload.C
>> @@ -0,0 +1,15 @@
>> +/* { dg-do compile } */ 
>> +/* { dg-require-effective-target powerpc_p9vector_ok } */
>> +/* { dg-options "-mdejagnu-cpu=power10 -O2 -mmma" } */ 
>> +
>> +#include <altivec.h>
>> +
>> +void
>> +foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
>> +{
>> +  __vector_quad acc;
>> +  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
>> +  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
>> +  *dst = acc;
>> +}
>> +/* { dg-final { scan-assembler {\mlxvp\M} } } */
> 
> Maybe some more test cases as below can be considered and make this pass better:
> 
> void
> foo1 (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
> {
>   __vector_quad acc;
>   __builtin_mma_xvf32ger(&acc, src, ptr[1]);
>   __builtin_mma_xvf32gerpp(&acc, src, ptr[2]);
>   *dst = acc;
> }
> 
> void
> foo2 (__vector_quad *dst1, __vector_quad *dst2, vector unsigned char *ptr, vector unsigned char src)
> {
>   __vector_quad acc;
>   __builtin_mma_xvf32ger(&acc, src, ptr[0]);
>   __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
>   *dst1 = acc;
>   __builtin_mma_xvf32ger(&acc, src, ptr[2]);
>   __builtin_mma_xvf32gerpp(&acc, src, ptr[3]);
>   *dst2 = acc;
> }
>

I will add the above tests.

Thanks & Regards
Ajit
 
> BR,
> Kewen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-11-28  9:44   ` Kewen.Lin
  2023-11-28 15:41     ` Michael Meissner
@ 2023-12-01  9:13     ` Ajit Agarwal
  1 sibling, 0 replies; 24+ messages in thread
From: Ajit Agarwal @ 2023-12-01  9:13 UTC (permalink / raw)
  To: Kewen.Lin, Michael Meissner
  Cc: gcc-patches, David Edelsohn, Segher Boessenkool, Peter Bergner



On 28/11/23 3:14 pm, Kewen.Lin wrote:
> on 2023/11/28 15:05, Michael Meissner wrote:
>> I tried using this patch to compare with the vector size attribute patch I
>> posted.  I could not build it as a cross compiler on my x86_64 because the
>> assembler gives the following error:
>>
>> Error: operand out of domain (11 is not a multiple of 2) for
>> std_stacktrace-elf.o.  If you look at the assembler, it has combined a lxvp 11
>> and lxvp 12 into:
>>
>>         lxvp 11,0(9)
>>
>> The powerpc architecture requires that registers that are loaded with load
>> vector pair and stored with store vector point instructions only load/store
>> even/odd register pairs, and not odd/even pairs.  Unfortunately, it will mean
>> that this optimization will match less often.
>>
> 
> Yes, the current implementation need some refinements, as comments in [1]:
> 
>> Besides, it seems a bad idea to put this pass after reload? as register allocation
>> finishes, this pairing has to be restricted by the reg No. (I didn't see any
>> checking on the reg No. relationship for paring btw.)
>>
>> Looking forward to the comments from Segher/David/Peter/Mike etc.
> 
> I wonder if we should consider running such pass before reload instead.

Adding before reload pass deletes one of the lxv and replaced with lxvp. This
fails in reload pass while freeing reg_eqivs as ira populates them and then
vecload pass deletes some of insns and while freeing in reload pass as insn
is already deleted in vecload pass reload pass segfaults.

Moving vecload pass before ira will not make register pairs with lxvp and
in ira and that will be a problem.

Making after reload pass is the only solution I see as ira and reload pass
makes register pairs and vecload pass will be easier with generation of
lxvp.

Thanks & Regards
Ajit
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/638070.html
> 
> BR,
> Kewen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-12-01  9:10   ` Ajit Agarwal
@ 2023-12-04  2:01     ` Kewen.Lin
  2023-12-05 13:43       ` Ajit Agarwal
  0 siblings, 1 reply; 24+ messages in thread
From: Kewen.Lin @ 2023-12-04  2:01 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Segher Boessenkool, David Edelsohn, Peter Bergner,
	Michael Meissner, GCC Patches

Hi Ajit,

on 2023/12/1 17:10, Ajit Agarwal wrote:
> Hello Kewen:
> 
> On 24/11/23 3:01 pm, Kewen.Lin wrote:
>> Hi Ajit,
>>
>> Don't forget to CC David (CC-ed) :), some comments are inlined below.
>>
>> on 2023/10/8 03:04, Ajit Agarwal wrote:
>>> Hello All:
>>>
>>> This patch add new pass to replace contiguous addresses vector load lxv with mma instruction
>>> lxvp.
>>
>> IMHO the current binding lxvp (and lxvpx, stxvp{x,}) to MMA looks wrong, it's only
>> Power10 and VSX required, these instructions should perform well without MMA support.
>> So one patch to separate their support from MMA seems to go first.
>>
> 
> I will make the changes for Power10 and VSX.
> 
>>> This patch addresses one regressions failure in ARM architecture.
>>
>> Could you explain this?  I don't see any test case for this.
> 
> I have submitted v1 of the patch and there were regressions failure for Linaro.
> I have fixed in version V2.

OK, thanks for clarifying.  So some unexpected changes on generic code in v1
caused the failure exposed on arm.

> 
>  
>> Besides, it seems a bad idea to put this pass after reload? as register allocation
>> finishes, this pairing has to be restricted by the reg No. (I didn't see any
>> checking on the reg No. relationship for paring btw.)
>>
> 
> Adding before reload pass deletes one of the lxv and replaced with lxvp. This
> fails in reload pass while freeing reg_eqivs as ira populates them and then

I can't find reg_eqivs, I guessed you meant reg_equivs and moved this pass right before
pass_reload (between pass_ira and pass_reload)?  IMHO it's unexpected as those two passes
are closely correlated.  I was expecting to put it somewhere before ira.

> vecload pass deletes some of insns and while freeing in reload pass as insn
> is already deleted in vecload pass reload pass segfaults.
> 
> Moving vecload pass before ira will not make register pairs with lxvp and
> in ira and that will be a problem.

Could you elaborate the obstacle for moving such pass before pass_ira?

Basing on the status quo, the lxvp is bundled with OOmode, then I'd expect
we can generate OOmode move (load) and use the components with unspec (or
subreg with Peter's patch) to replace all the previous use places, it looks
doable to me.

> 
> Making after reload pass is the only solution I see as ira and reload pass
> makes register pairs and vecload pass will be easier with generation of
> lxvp.
> 
> Please suggest.
>  
>> Looking forward to the comments from Segher/David/Peter/Mike etc.

Still looking forward. :)

BR,
Kewen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-12-04  2:01     ` Kewen.Lin
@ 2023-12-05 13:43       ` Ajit Agarwal
  2023-12-05 18:01         ` Ajit Agarwal
  0 siblings, 1 reply; 24+ messages in thread
From: Ajit Agarwal @ 2023-12-05 13:43 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Segher Boessenkool, David Edelsohn, Peter Bergner,
	Michael Meissner, GCC Patches

Hello Kewen:

On 04/12/23 7:31 am, Kewen.Lin wrote:
> Hi Ajit,
> 
> on 2023/12/1 17:10, Ajit Agarwal wrote:
>> Hello Kewen:
>>
>> On 24/11/23 3:01 pm, Kewen.Lin wrote:
>>> Hi Ajit,
>>>
>>> Don't forget to CC David (CC-ed) :), some comments are inlined below.
>>>
>>> on 2023/10/8 03:04, Ajit Agarwal wrote:
>>>> Hello All:
>>>>
>>>> This patch add new pass to replace contiguous addresses vector load lxv with mma instruction
>>>> lxvp.
>>>
>>> IMHO the current binding lxvp (and lxvpx, stxvp{x,}) to MMA looks wrong, it's only
>>> Power10 and VSX required, these instructions should perform well without MMA support.
>>> So one patch to separate their support from MMA seems to go first.
>>>
>>
>> I will make the changes for Power10 and VSX.
>>
>>>> This patch addresses one regressions failure in ARM architecture.
>>>
>>> Could you explain this?  I don't see any test case for this.
>>
>> I have submitted v1 of the patch and there were regressions failure for Linaro.
>> I have fixed in version V2.
> 
> OK, thanks for clarifying.  So some unexpected changes on generic code in v1
> caused the failure exposed on arm.
> 
>>
>>  
>>> Besides, it seems a bad idea to put this pass after reload? as register allocation
>>> finishes, this pairing has to be restricted by the reg No. (I didn't see any
>>> checking on the reg No. relationship for paring btw.)
>>>
>>
>> Adding before reload pass deletes one of the lxv and replaced with lxvp. This
>> fails in reload pass while freeing reg_eqivs as ira populates them and then
> 
> I can't find reg_eqivs, I guessed you meant reg_equivs and moved this pass right before
> pass_reload (between pass_ira and pass_reload)?  IMHO it's unexpected as those two passes
> are closely correlated.  I was expecting to put it somewhere before ira.

Yes they are tied together and moving before reload will not work.

> 
>> vecload pass deletes some of insns and while freeing in reload pass as insn
>> is already deleted in vecload pass reload pass segfaults.
>>
>> Moving vecload pass before ira will not make register pairs with lxvp and
>> in ira and that will be a problem.
> 
> Could you elaborate the obstacle for moving such pass before pass_ira?
> 
> Basing on the status quo, the lxvp is bundled with OOmode, then I'd expect
> we can generate OOmode move (load) and use the components with unspec (or
> subreg with Peter's patch) to replace all the previous use places, it looks
> doable to me.

Moving before ira passes, we delete the offset lxv and generate lxvp and replace all
the uses, that I am doing. But the offset lxvp register generated by ira are not
register pair and generate random register and hence we cannot generate lxvp.

For example one lxv is generated with register 32 and other pair is generated
with register 45 by ira if we move it before ira passes.

Thanks & Regards
Ajit
> 

>>
>> Making after reload pass is the only solution I see as ira and reload pass
>> makes register pairs and vecload pass will be easier with generation of
>> lxvp.
>>
>> Please suggest.
>>  
>>> Looking forward to the comments from Segher/David/Peter/Mike etc.
> 
> Still looking forward. :)
> 
> BR,
> Kewen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-12-05 13:43       ` Ajit Agarwal
@ 2023-12-05 18:01         ` Ajit Agarwal
  2023-12-06  2:22           ` Kewen.Lin
  0 siblings, 1 reply; 24+ messages in thread
From: Ajit Agarwal @ 2023-12-05 18:01 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Segher Boessenkool, David Edelsohn, Peter Bergner,
	Michael Meissner, GCC Patches

Hello Kewen:


On 05/12/23 7:13 pm, Ajit Agarwal wrote:
> Hello Kewen:
> 
> On 04/12/23 7:31 am, Kewen.Lin wrote:
>> Hi Ajit,
>>
>> on 2023/12/1 17:10, Ajit Agarwal wrote:
>>> Hello Kewen:
>>>
>>> On 24/11/23 3:01 pm, Kewen.Lin wrote:
>>>> Hi Ajit,
>>>>
>>>> Don't forget to CC David (CC-ed) :), some comments are inlined below.
>>>>
>>>> on 2023/10/8 03:04, Ajit Agarwal wrote:
>>>>> Hello All:
>>>>>
>>>>> This patch add new pass to replace contiguous addresses vector load lxv with mma instruction
>>>>> lxvp.
>>>>
>>>> IMHO the current binding lxvp (and lxvpx, stxvp{x,}) to MMA looks wrong, it's only
>>>> Power10 and VSX required, these instructions should perform well without MMA support.
>>>> So one patch to separate their support from MMA seems to go first.
>>>>
>>>
>>> I will make the changes for Power10 and VSX.
>>>
>>>>> This patch addresses one regressions failure in ARM architecture.
>>>>
>>>> Could you explain this?  I don't see any test case for this.
>>>
>>> I have submitted v1 of the patch and there were regressions failure for Linaro.
>>> I have fixed in version V2.
>>
>> OK, thanks for clarifying.  So some unexpected changes on generic code in v1
>> caused the failure exposed on arm.
>>
>>>
>>>  
>>>> Besides, it seems a bad idea to put this pass after reload? as register allocation
>>>> finishes, this pairing has to be restricted by the reg No. (I didn't see any
>>>> checking on the reg No. relationship for paring btw.)
>>>>
>>>
>>> Adding before reload pass deletes one of the lxv and replaced with lxvp. This
>>> fails in reload pass while freeing reg_eqivs as ira populates them and then
>>
>> I can't find reg_eqivs, I guessed you meant reg_equivs and moved this pass right before
>> pass_reload (between pass_ira and pass_reload)?  IMHO it's unexpected as those two passes
>> are closely correlated.  I was expecting to put it somewhere before ira.
> 
> Yes they are tied together and moving before reload will not work.
> 
>>
>>> vecload pass deletes some of insns and while freeing in reload pass as insn
>>> is already deleted in vecload pass reload pass segfaults.
>>>
>>> Moving vecload pass before ira will not make register pairs with lxvp and
>>> in ira and that will be a problem.
>>
>> Could you elaborate the obstacle for moving such pass before pass_ira?
>>
>> Basing on the status quo, the lxvp is bundled with OOmode, then I'd expect
>> we can generate OOmode move (load) and use the components with unspec (or
>> subreg with Peter's patch) to replace all the previous use places, it looks
>> doable to me.
> 
> Moving before ira passes, we delete the offset lxv and generate lxvp and replace all
> the uses, that I am doing. But the offset lxvp register generated by ira are not
> register pair and generate random register and hence we cannot generate lxvp.
> 
> For example one lxv is generated with register 32 and other pair is generated
> with register 45 by ira if we move it before ira passes.

It generates the following.
	lxvp %vs32,0(%r4)
        xvf32ger 0,%vs34,%vs32
        xvf32gerpp 0,%vs34,%vs45
        xxmfacc 0
        stxvp %vs2,0(%r3)
        stxvp %vs0,32(%r3)
        blr


Instead of vs33 ira generates vs45 if we move before pass_ira.

Thanks & Regards
Ajit

 
> Thanks & Regards
> Ajit
>>
> 
>>>
>>> Making after reload pass is the only solution I see as ira and reload pass
>>> makes register pairs and vecload pass will be easier with generation of
>>> lxvp.
>>>
>>> Please suggest.
>>>  
>>>> Looking forward to the comments from Segher/David/Peter/Mike etc.
>>
>> Still looking forward. :)
>>
>> BR,
>> Kewen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-12-05 18:01         ` Ajit Agarwal
@ 2023-12-06  2:22           ` Kewen.Lin
  2023-12-06  5:09             ` Michael Meissner
  2023-12-07 11:01             ` Ajit Agarwal
  0 siblings, 2 replies; 24+ messages in thread
From: Kewen.Lin @ 2023-12-06  2:22 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Segher Boessenkool, David Edelsohn, Peter Bergner,
	Michael Meissner, GCC Patches

on 2023/12/6 02:01, Ajit Agarwal wrote:
> Hello Kewen:
> 
> 
> On 05/12/23 7:13 pm, Ajit Agarwal wrote:
>> Hello Kewen:
>>
>> On 04/12/23 7:31 am, Kewen.Lin wrote:
>>> Hi Ajit,
>>>
>>> on 2023/12/1 17:10, Ajit Agarwal wrote:
>>>> Hello Kewen:
>>>>
>>>> On 24/11/23 3:01 pm, Kewen.Lin wrote:
>>>>> Hi Ajit,
>>>>>
>>>>> Don't forget to CC David (CC-ed) :), some comments are inlined below.
>>>>>
>>>>> on 2023/10/8 03:04, Ajit Agarwal wrote:
>>>>>> Hello All:
>>>>>>
>>>>>> This patch add new pass to replace contiguous addresses vector load lxv with mma instruction
>>>>>> lxvp.
>>>>>
>>>>> IMHO the current binding lxvp (and lxvpx, stxvp{x,}) to MMA looks wrong, it's only
>>>>> Power10 and VSX required, these instructions should perform well without MMA support.
>>>>> So one patch to separate their support from MMA seems to go first.
>>>>>
>>>>
>>>> I will make the changes for Power10 and VSX.
>>>>
>>>>>> This patch addresses one regressions failure in ARM architecture.
>>>>>
>>>>> Could you explain this?  I don't see any test case for this.
>>>>
>>>> I have submitted v1 of the patch and there were regressions failure for Linaro.
>>>> I have fixed in version V2.
>>>
>>> OK, thanks for clarifying.  So some unexpected changes on generic code in v1
>>> caused the failure exposed on arm.
>>>
>>>>
>>>>  
>>>>> Besides, it seems a bad idea to put this pass after reload? as register allocation
>>>>> finishes, this pairing has to be restricted by the reg No. (I didn't see any
>>>>> checking on the reg No. relationship for paring btw.)
>>>>>
>>>>
>>>> Adding before reload pass deletes one of the lxv and replaced with lxvp. This
>>>> fails in reload pass while freeing reg_eqivs as ira populates them and then
>>>
>>> I can't find reg_eqivs, I guessed you meant reg_equivs and moved this pass right before
>>> pass_reload (between pass_ira and pass_reload)?  IMHO it's unexpected as those two passes
>>> are closely correlated.  I was expecting to put it somewhere before ira.
>>
>> Yes they are tied together and moving before reload will not work.
>>
>>>
>>>> vecload pass deletes some of insns and while freeing in reload pass as insn
>>>> is already deleted in vecload pass reload pass segfaults.
>>>>
>>>> Moving vecload pass before ira will not make register pairs with lxvp and
>>>> in ira and that will be a problem.
>>>
>>> Could you elaborate the obstacle for moving such pass before pass_ira?
>>>
>>> Basing on the status quo, the lxvp is bundled with OOmode, then I'd expect
>>> we can generate OOmode move (load) and use the components with unspec (or
>>> subreg with Peter's patch) to replace all the previous use places, it looks
>>> doable to me.
>>
>> Moving before ira passes, we delete the offset lxv and generate lxvp and replace all
>> the uses, that I am doing. But the offset lxvp register generated by ira are not
>> register pair and generate random register and hence we cannot generate lxvp.
>>
>> For example one lxv is generated with register 32 and other pair is generated
>> with register 45 by ira if we move it before ira passes.
> 
> It generates the following.
> 	lxvp %vs32,0(%r4)
>         xvf32ger 0,%vs34,%vs32
>         xvf32gerpp 0,%vs34,%vs45

What do the RTL insns for these insns look like?

I'd expect you use UNSPEC_MMA_EXTRACT to extract V16QI from the result of lxvp,
the current define_insn_and_split "*vsx_disassemble_pair" should be able to take
care of it further (eg: reg and regoff).

BR,
Kewen

>         xxmfacc 0
>         stxvp %vs2,0(%r3)
>         stxvp %vs0,32(%r3)
>         blr
> 
> 
> Instead of vs33 ira generates vs45 if we move before pass_ira.
> 
> Thanks & Regards
> Ajit
> 
>  
>> Thanks & Regards
>> Ajit
>>>
>>
>>>>
>>>> Making after reload pass is the only solution I see as ira and reload pass
>>>> makes register pairs and vecload pass will be easier with generation of
>>>> lxvp.
>>>>
>>>> Please suggest.
>>>>  
>>>>> Looking forward to the comments from Segher/David/Peter/Mike etc.
>>>
>>> Still looking forward. :)
>>>
>>> BR,
>>> Kewen


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-12-06  2:22           ` Kewen.Lin
@ 2023-12-06  5:09             ` Michael Meissner
  2023-12-07  7:14               ` Kewen.Lin
  2023-12-07 11:01             ` Ajit Agarwal
  1 sibling, 1 reply; 24+ messages in thread
From: Michael Meissner @ 2023-12-06  5:09 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Ajit Agarwal, Segher Boessenkool, David Edelsohn, Peter Bergner,
	Michael Meissner, GCC Patches

On Wed, Dec 06, 2023 at 10:22:57AM +0800, Kewen.Lin wrote:
> I'd expect you use UNSPEC_MMA_EXTRACT to extract V16QI from the result of lxvp,
> the current define_insn_and_split "*vsx_disassemble_pair" should be able to take
> care of it further (eg: reg and regoff).
> 
> BR,
> Kewen

With Peter's subreg patch, UNSPEC_MMA_EXTRACT would produce two move with
eSUBREGs:

For a FMA type loop such as:

union vector_hack2 {
  vector  unsigned char vuc[2];
  vector double v[2];
};

static void
use_mma_ld_st_normal_no_unroll (double * __restrict__ r,
				const double * __restrict__ a,
				const double * __restrict__ b,
				size_t num)
{
  __vector_pair * __restrict__ v_r = ( __vector_pair * __restrict__) r;
  const __vector_pair * __restrict__ v_a = (const __vector_pair * __restrict__) a;
  const __vector_pair * __restrict__ v_b = (const __vector_pair * __restrict__) b;
  size_t num_vector = num / (2 * (sizeof (vector double) / sizeof (double)));
  size_t num_scalar = num % (2 * (sizeof (vector double) / sizeof (double)));
  size_t i;
  union vector_hack2 a_union;
  union vector_hack2 b_union;
  union vector_hack2 r_union;
  vector double a_hi, a_lo;
  vector double b_hi, b_lo;
  vector double r_hi, r_lo;
  union vector_hack result_hi, result_lo;

#pragma GCC unroll 0
  for (i = 0; i < num_vector; i++)
    {
      __builtin_vsx_disassemble_pair (&a_union.vuc, &v_a[i]);
      __builtin_vsx_disassemble_pair (&b_union.vuc, &v_b[i]);
      __builtin_vsx_disassemble_pair (&r_union.vuc, &v_r[i]);

      a_hi = a_union.v[0];
      b_hi = b_union.v[0];
      r_hi = r_union.v[0];

      a_lo = a_union.v[1];
      b_lo = b_union.v[1];
      r_lo = r_union.v[1];

      result_hi.v = (a_hi * b_hi) + r_hi;
      result_lo.v = (a_lo * b_lo) + r_lo;

      __builtin_vsx_build_pair (&v_r[i], result_hi.vuc, result_lo.vuc);
    }

  if (num_scalar)
    {
      r += num_vector * (2 * (sizeof (vector double) / sizeof (double)));
      a += num_vector * (2 * (sizeof (vector double) / sizeof (double)));
      b += num_vector * (2 * (sizeof (vector double) / sizeof (double)));

#pragma GCC unroll 0
      for (i = 0; i < num_scalar; i++)
 r[i] += (a[i] * b[i]);
    }

  return;
}

Peter's code would produce the following in the inner loop:

(insn 16 15 19 4 (set (reg:OO 133 [ _43 ])
        (mem:OO (plus:DI (reg/v/f:DI 150 [ a ])
                (reg:DI 143 [ ivtmp.1088 ])) [6 MEM[(__vector_pair *)a_30(D) + ivtmp.1088_88 * 1]+0 S32 A128])) "p10-fma.h":3285:1 2181 {*movoo}
     (nil))
(insn 19 16 22 4 (set (reg:OO 136 [ _48 ])
        (mem:OO (plus:DI (reg/v/f:DI 151 [ b ])
                (reg:DI 143 [ ivtmp.1088 ])) [6 MEM[(__vector_pair *)b_31(D) + ivtmp.1088_88 * 1]+0 S32 A128])) "p10-fma.h":3285:1 2181 {*movoo}
     (nil))
(insn 22 19 25 4 (set (reg:OO 139 [ _53 ])
        (mem:OO (plus:DI (reg/v/f:DI 149 [ r ])
                (reg:DI 143 [ ivtmp.1088 ])) [6 MEM[(__vector_pair *)r_29(D) + ivtmp.1088_88 * 1]+0 S32 A128])) "p10-fma.h":3285:1 2181 {*movoo}
     (nil))
(insn 25 22 26 4 (set (reg:V2DF 117 [ _6 ])
        (fma:V2DF (subreg:V2DF (reg:OO 136 [ _48 ]) 16)
            (subreg:V2DF (reg:OO 133 [ _43 ]) 16)
            (subreg:V2DF (reg:OO 139 [ _53 ]) 16))) "p10-fma.h":3319:35 1265 {*vsx_fmav2df4}
     (nil))
(insn 26 25 27 4 (set (reg:V2DF 118 [ _8 ])
        (fma:V2DF (subreg:V2DF (reg:OO 136 [ _48 ]) 0)
            (subreg:V2DF (reg:OO 133 [ _43 ]) 0)
            (subreg:V2DF (reg:OO 139 [ _53 ]) 0))) "p10-fma.h":3320:35 1265 {*vsx_fmav2df4}
     (expr_list:REG_DEAD (reg:OO 139 [ _53 ])
        (expr_list:REG_DEAD (reg:OO 136 [ _48 ])
            (expr_list:REG_DEAD (reg:OO 133 [ _43 ])
                (nil)))))
(insn 27 26 28 4 (set (reg:OO 142 [ _59 ])
        (unspec:OO [
                (subreg:V16QI (reg:V2DF 117 [ _6 ]) 0)
                (subreg:V16QI (reg:V2DF 118 [ _8 ]) 0)
            ] UNSPEC_VSX_ASSEMBLE)) 2183 {*vsx_assemble_pair}
     (expr_list:REG_DEAD (reg:V2DF 118 [ _8 ])
        (expr_list:REG_DEAD (reg:V2DF 117 [ _6 ])
            (nil))))

Now in theory you could get ride of the UNSPEC_VSX_ASSEMBLE also using SUBREG's.

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-12-06  5:09             ` Michael Meissner
@ 2023-12-07  7:14               ` Kewen.Lin
  0 siblings, 0 replies; 24+ messages in thread
From: Kewen.Lin @ 2023-12-07  7:14 UTC (permalink / raw)
  To: Michael Meissner
  Cc: Ajit Agarwal, Segher Boessenkool, David Edelsohn, Peter Bergner,
	GCC Patches

on 2023/12/6 13:09, Michael Meissner wrote:
> On Wed, Dec 06, 2023 at 10:22:57AM +0800, Kewen.Lin wrote:
>> I'd expect you use UNSPEC_MMA_EXTRACT to extract V16QI from the result of lxvp,
>> the current define_insn_and_split "*vsx_disassemble_pair" should be able to take
>> care of it further (eg: reg and regoff).
>>
>> BR,
>> Kewen
> 
> With Peter's subreg patch, UNSPEC_MMA_EXTRACT would produce two move with
> eSUBREGs:

With the below details, I think you meant that even with Peter's subreg patch
which was intended to get rid of UNSPEC_MMA_EXTRACT for OOmode, we could still
have sub-optimal moves?

The proposed subreg and the current UNSPEC_MMA_EXTRACT unspec are alternatives
to extract the component from the result of lxvp.  Since the latest trunk still
adopts UNSPEC_MMA_EXTRACT, I replied to Ajit with it.

> 
> For a FMA type loop such as:
> 
> union vector_hack2 {
>   vector  unsigned char vuc[2];
>   vector double v[2];
> };
> 
> static void
> use_mma_ld_st_normal_no_unroll (double * __restrict__ r,
> 				const double * __restrict__ a,
> 				const double * __restrict__ b,
> 				size_t num)
> {
>   __vector_pair * __restrict__ v_r = ( __vector_pair * __restrict__) r;
>   const __vector_pair * __restrict__ v_a = (const __vector_pair * __restrict__) a;
>   const __vector_pair * __restrict__ v_b = (const __vector_pair * __restrict__) b;
>   size_t num_vector = num / (2 * (sizeof (vector double) / sizeof (double)));
>   size_t num_scalar = num % (2 * (sizeof (vector double) / sizeof (double)));
>   size_t i;
>   union vector_hack2 a_union;
>   union vector_hack2 b_union;
>   union vector_hack2 r_union;
>   vector double a_hi, a_lo;
>   vector double b_hi, b_lo;
>   vector double r_hi, r_lo;
>   union vector_hack result_hi, result_lo;
> 
> #pragma GCC unroll 0
>   for (i = 0; i < num_vector; i++)
>     {
>       __builtin_vsx_disassemble_pair (&a_union.vuc, &v_a[i]);
>       __builtin_vsx_disassemble_pair (&b_union.vuc, &v_b[i]);
>       __builtin_vsx_disassemble_pair (&r_union.vuc, &v_r[i]);
> 
>       a_hi = a_union.v[0];
>       b_hi = b_union.v[0];
>       r_hi = r_union.v[0];
> 
>       a_lo = a_union.v[1];
>       b_lo = b_union.v[1];
>       r_lo = r_union.v[1];
> 
>       result_hi.v = (a_hi * b_hi) + r_hi;
>       result_lo.v = (a_lo * b_lo) + r_lo;
> 
>       __builtin_vsx_build_pair (&v_r[i], result_hi.vuc, result_lo.vuc);
>     }
> 
>   if (num_scalar)
>     {
>       r += num_vector * (2 * (sizeof (vector double) / sizeof (double)));
>       a += num_vector * (2 * (sizeof (vector double) / sizeof (double)));
>       b += num_vector * (2 * (sizeof (vector double) / sizeof (double)));
> 
> #pragma GCC unroll 0
>       for (i = 0; i < num_scalar; i++)
>  r[i] += (a[i] * b[i]);
>     }
> 
>   return;
> }
> 
> Peter's code would produce the following in the inner loop:
> 
> (insn 16 15 19 4 (set (reg:OO 133 [ _43 ])
>         (mem:OO (plus:DI (reg/v/f:DI 150 [ a ])
>                 (reg:DI 143 [ ivtmp.1088 ])) [6 MEM[(__vector_pair *)a_30(D) + ivtmp.1088_88 * 1]+0 S32 A128])) "p10-fma.h":3285:1 2181 {*movoo}
>      (nil))
> (insn 19 16 22 4 (set (reg:OO 136 [ _48 ])
>         (mem:OO (plus:DI (reg/v/f:DI 151 [ b ])
>                 (reg:DI 143 [ ivtmp.1088 ])) [6 MEM[(__vector_pair *)b_31(D) + ivtmp.1088_88 * 1]+0 S32 A128])) "p10-fma.h":3285:1 2181 {*movoo}
>      (nil))
> (insn 22 19 25 4 (set (reg:OO 139 [ _53 ])
>         (mem:OO (plus:DI (reg/v/f:DI 149 [ r ])
>                 (reg:DI 143 [ ivtmp.1088 ])) [6 MEM[(__vector_pair *)r_29(D) + ivtmp.1088_88 * 1]+0 S32 A128])) "p10-fma.h":3285:1 2181 {*movoo}
>      (nil))
> (insn 25 22 26 4 (set (reg:V2DF 117 [ _6 ])
>         (fma:V2DF (subreg:V2DF (reg:OO 136 [ _48 ]) 16)
>             (subreg:V2DF (reg:OO 133 [ _43 ]) 16)
>             (subreg:V2DF (reg:OO 139 [ _53 ]) 16))) "p10-fma.h":3319:35 1265 {*vsx_fmav2df4}
>      (nil))
> (insn 26 25 27 4 (set (reg:V2DF 118 [ _8 ])
>         (fma:V2DF (subreg:V2DF (reg:OO 136 [ _48 ]) 0)
>             (subreg:V2DF (reg:OO 133 [ _43 ]) 0)
>             (subreg:V2DF (reg:OO 139 [ _53 ]) 0))) "p10-fma.h":3320:35 1265 {*vsx_fmav2df4}
>      (expr_list:REG_DEAD (reg:OO 139 [ _53 ])
>         (expr_list:REG_DEAD (reg:OO 136 [ _48 ])
>             (expr_list:REG_DEAD (reg:OO 133 [ _43 ])
>                 (nil)))))
> (insn 27 26 28 4 (set (reg:OO 142 [ _59 ])
>         (unspec:OO [
>                 (subreg:V16QI (reg:V2DF 117 [ _6 ]) 0)
>                 (subreg:V16QI (reg:V2DF 118 [ _8 ]) 0)
>             ] UNSPEC_VSX_ASSEMBLE)) 2183 {*vsx_assemble_pair}
>      (expr_list:REG_DEAD (reg:V2DF 118 [ _8 ])
>         (expr_list:REG_DEAD (reg:V2DF 117 [ _6 ])
>             (nil))))
> 
> Now in theory you could get ride of the UNSPEC_VSX_ASSEMBLE also using SUBREG's.

Agree, it looks doable, this comment seems more for Peter's subreg patch. :)

BR,
Kewen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-12-06  2:22           ` Kewen.Lin
  2023-12-06  5:09             ` Michael Meissner
@ 2023-12-07 11:01             ` Ajit Agarwal
  2023-12-08  8:01               ` Ajit Agarwal
  1 sibling, 1 reply; 24+ messages in thread
From: Ajit Agarwal @ 2023-12-07 11:01 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Segher Boessenkool, David Edelsohn, Peter Bergner,
	Michael Meissner, GCC Patches

Hello Kewen:

On 06/12/23 7:52 am, Kewen.Lin wrote:
> on 2023/12/6 02:01, Ajit Agarwal wrote:
>> Hello Kewen:
>>
>>
>> On 05/12/23 7:13 pm, Ajit Agarwal wrote:
>>> Hello Kewen:
>>>
>>> On 04/12/23 7:31 am, Kewen.Lin wrote:
>>>> Hi Ajit,
>>>>
>>>> on 2023/12/1 17:10, Ajit Agarwal wrote:
>>>>> Hello Kewen:
>>>>>
>>>>> On 24/11/23 3:01 pm, Kewen.Lin wrote:
>>>>>> Hi Ajit,
>>>>>>
>>>>>> Don't forget to CC David (CC-ed) :), some comments are inlined below.
>>>>>>
>>>>>> on 2023/10/8 03:04, Ajit Agarwal wrote:
>>>>>>> Hello All:
>>>>>>>
>>>>>>> This patch add new pass to replace contiguous addresses vector load lxv with mma instruction
>>>>>>> lxvp.
>>>>>>
>>>>>> IMHO the current binding lxvp (and lxvpx, stxvp{x,}) to MMA looks wrong, it's only
>>>>>> Power10 and VSX required, these instructions should perform well without MMA support.
>>>>>> So one patch to separate their support from MMA seems to go first.
>>>>>>
>>>>>
>>>>> I will make the changes for Power10 and VSX.
>>>>>
>>>>>>> This patch addresses one regressions failure in ARM architecture.
>>>>>>
>>>>>> Could you explain this?  I don't see any test case for this.
>>>>>
>>>>> I have submitted v1 of the patch and there were regressions failure for Linaro.
>>>>> I have fixed in version V2.
>>>>
>>>> OK, thanks for clarifying.  So some unexpected changes on generic code in v1
>>>> caused the failure exposed on arm.
>>>>
>>>>>
>>>>>  
>>>>>> Besides, it seems a bad idea to put this pass after reload? as register allocation
>>>>>> finishes, this pairing has to be restricted by the reg No. (I didn't see any
>>>>>> checking on the reg No. relationship for paring btw.)
>>>>>>
>>>>>
>>>>> Adding before reload pass deletes one of the lxv and replaced with lxvp. This
>>>>> fails in reload pass while freeing reg_eqivs as ira populates them and then
>>>>
>>>> I can't find reg_eqivs, I guessed you meant reg_equivs and moved this pass right before
>>>> pass_reload (between pass_ira and pass_reload)?  IMHO it's unexpected as those two passes
>>>> are closely correlated.  I was expecting to put it somewhere before ira.
>>>
>>> Yes they are tied together and moving before reload will not work.
>>>
>>>>
>>>>> vecload pass deletes some of insns and while freeing in reload pass as insn
>>>>> is already deleted in vecload pass reload pass segfaults.
>>>>>
>>>>> Moving vecload pass before ira will not make register pairs with lxvp and
>>>>> in ira and that will be a problem.
>>>>
>>>> Could you elaborate the obstacle for moving such pass before pass_ira?
>>>>
>>>> Basing on the status quo, the lxvp is bundled with OOmode, then I'd expect
>>>> we can generate OOmode move (load) and use the components with unspec (or
>>>> subreg with Peter's patch) to replace all the previous use places, it looks
>>>> doable to me.
>>>
>>> Moving before ira passes, we delete the offset lxv and generate lxvp and replace all
>>> the uses, that I am doing. But the offset lxvp register generated by ira are not
>>> register pair and generate random register and hence we cannot generate lxvp.
>>>
>>> For example one lxv is generated with register 32 and other pair is generated
>>> with register 45 by ira if we move it before ira passes.
>>
>> It generates the following.
>> 	lxvp %vs32,0(%r4)
>>         xvf32ger 0,%vs34,%vs32
>>         xvf32gerpp 0,%vs34,%vs45
> 
> What do the RTL insns for these insns look like?
> 
> I'd expect you use UNSPEC_MMA_EXTRACT to extract V16QI from the result of lxvp,
> the current define_insn_and_split "*vsx_disassemble_pair" should be able to take
> care of it further (eg: reg and regoff).
> 

Yes with UNSPEC_MMA_EXTRACT it generates lxvp with register pair instead of random
register by ira and reload pass. But there is an extra moves that gets generated.

I am working further on this and send the new version of the patch with all the
comments incorporated.

Thanks & Regards
Ajit
> BR,
> Kewen
> 
>>         xxmfacc 0
>>         stxvp %vs2,0(%r3)
>>         stxvp %vs0,32(%r3)
>>         blr
>>
>>
>> Instead of vs33 ira generates vs45 if we move before pass_ira.
>>
>> Thanks & Regards
>> Ajit
>>
>>  
>>> Thanks & Regards
>>> Ajit
>>>>
>>>
>>>>>
>>>>> Making after reload pass is the only solution I see as ira and reload pass
>>>>> makes register pairs and vecload pass will be easier with generation of
>>>>> lxvp.
>>>>>
>>>>> Please suggest.
>>>>>  
>>>>>> Looking forward to the comments from Segher/David/Peter/Mike etc.
>>>>
>>>> Still looking forward. :)
>>>>
>>>> BR,
>>>> Kewen
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-12-07 11:01             ` Ajit Agarwal
@ 2023-12-08  8:01               ` Ajit Agarwal
  2023-12-08  9:51                 ` Kewen.Lin
  2023-12-12  6:28                 ` Kewen.Lin
  0 siblings, 2 replies; 24+ messages in thread
From: Ajit Agarwal @ 2023-12-08  8:01 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Segher Boessenkool, David Edelsohn, Peter Bergner,
	Michael Meissner, GCC Patches

Hello Kewen:

On 07/12/23 4:31 pm, Ajit Agarwal wrote:
> Hello Kewen:
> 
> On 06/12/23 7:52 am, Kewen.Lin wrote:
>> on 2023/12/6 02:01, Ajit Agarwal wrote:
>>> Hello Kewen:
>>>
>>>
>>> On 05/12/23 7:13 pm, Ajit Agarwal wrote:
>>>> Hello Kewen:
>>>>
>>>> On 04/12/23 7:31 am, Kewen.Lin wrote:
>>>>> Hi Ajit,
>>>>>
>>>>> on 2023/12/1 17:10, Ajit Agarwal wrote:
>>>>>> Hello Kewen:
>>>>>>
>>>>>> On 24/11/23 3:01 pm, Kewen.Lin wrote:
>>>>>>> Hi Ajit,
>>>>>>>
>>>>>>> Don't forget to CC David (CC-ed) :), some comments are inlined below.
>>>>>>>
>>>>>>> on 2023/10/8 03:04, Ajit Agarwal wrote:
>>>>>>>> Hello All:
>>>>>>>>
>>>>>>>> This patch add new pass to replace contiguous addresses vector load lxv with mma instruction
>>>>>>>> lxvp.
>>>>>>>
>>>>>>> IMHO the current binding lxvp (and lxvpx, stxvp{x,}) to MMA looks wrong, it's only
>>>>>>> Power10 and VSX required, these instructions should perform well without MMA support.
>>>>>>> So one patch to separate their support from MMA seems to go first.
>>>>>>>
>>>>>>
>>>>>> I will make the changes for Power10 and VSX.
>>>>>>
>>>>>>>> This patch addresses one regressions failure in ARM architecture.
>>>>>>>
>>>>>>> Could you explain this?  I don't see any test case for this.
>>>>>>
>>>>>> I have submitted v1 of the patch and there were regressions failure for Linaro.
>>>>>> I have fixed in version V2.
>>>>>
>>>>> OK, thanks for clarifying.  So some unexpected changes on generic code in v1
>>>>> caused the failure exposed on arm.
>>>>>
>>>>>>
>>>>>>  
>>>>>>> Besides, it seems a bad idea to put this pass after reload? as register allocation
>>>>>>> finishes, this pairing has to be restricted by the reg No. (I didn't see any
>>>>>>> checking on the reg No. relationship for paring btw.)
>>>>>>>
>>>>>>
>>>>>> Adding before reload pass deletes one of the lxv and replaced with lxvp. This
>>>>>> fails in reload pass while freeing reg_eqivs as ira populates them and then
>>>>>
>>>>> I can't find reg_eqivs, I guessed you meant reg_equivs and moved this pass right before
>>>>> pass_reload (between pass_ira and pass_reload)?  IMHO it's unexpected as those two passes
>>>>> are closely correlated.  I was expecting to put it somewhere before ira.
>>>>
>>>> Yes they are tied together and moving before reload will not work.
>>>>
>>>>>
>>>>>> vecload pass deletes some of insns and while freeing in reload pass as insn
>>>>>> is already deleted in vecload pass reload pass segfaults.
>>>>>>
>>>>>> Moving vecload pass before ira will not make register pairs with lxvp and
>>>>>> in ira and that will be a problem.
>>>>>
>>>>> Could you elaborate the obstacle for moving such pass before pass_ira?
>>>>>
>>>>> Basing on the status quo, the lxvp is bundled with OOmode, then I'd expect
>>>>> we can generate OOmode move (load) and use the components with unspec (or
>>>>> subreg with Peter's patch) to replace all the previous use places, it looks
>>>>> doable to me.
>>>>
>>>> Moving before ira passes, we delete the offset lxv and generate lxvp and replace all
>>>> the uses, that I am doing. But the offset lxvp register generated by ira are not
>>>> register pair and generate random register and hence we cannot generate lxvp.
>>>>
>>>> For example one lxv is generated with register 32 and other pair is generated
>>>> with register 45 by ira if we move it before ira passes.
>>>
>>> It generates the following.
>>> 	lxvp %vs32,0(%r4)
>>>         xvf32ger 0,%vs34,%vs32
>>>         xvf32gerpp 0,%vs34,%vs45
>>
>> What do the RTL insns for these insns look like?
>>
>> I'd expect you use UNSPEC_MMA_EXTRACT to extract V16QI from the result of lxvp,
>> the current define_insn_and_split "*vsx_disassemble_pair" should be able to take
>> care of it further (eg: reg and regoff).
>>
> 
> Yes with UNSPEC_MMA_EXTRACT it generates lxvp with register pair instead of random
> register by ira and reload pass. But there is an extra moves that gets generated.
> 

With UNSPEC_MMA_EXTRACT I could generate the register pair but functionally here is the
below code which is incorrect.

 l	lxvp %vs0,0(%r4)
        xxlor %vs32,%vs0,%vs0
        xvf32ger 0,%vs34,%vs32
        xvf32gerpp 0,%vs34,%vs33
        xxmfacc 0
        stxvp %vs2,0(%r3)
        stxvp %vs0,32(%r3)
        blr


Here is the RTL Code:

(insn 19 4 20 2 (set (reg:OO 124 [ *ptr_4(D) ])
        (mem:OO (reg/v/f:DI 122 [ ptr ]) [0 *ptr_4(D)+0 S16 A128])) -1
     (nil))
(insn 20 19 9 2 (set (reg:V16QI 129 [orig:124 *ptr_4(D) ] [124])
        (subreg:V16QI (reg:OO 124 [ *ptr_4(D) ]) 0)) -1
     (nil))
(insn 9 20 11 2 (set (reg:XO 119 [ _7 ])
        (unspec:XO [
                (reg/v:V16QI 123 [ src ])
                (reg:V16QI 129 [orig:124 *ptr_4(D) ] [124])
            ] UNSPEC_MMA_XVF32GER)) 2195 {mma_xvf32ger}
     (expr_list:REG_DEAD (reg:OO 124 [ *ptr_4(D) ])
        (nil)))
(insn 11 9 12 2 (set (reg:XO 120 [ _9 ])
        (unspec:XO [
                (reg:XO 119 [ _7 ])
                (reg/v:V16QI 123 [ src ])
                (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
            ] UNSPEC_MMA_XVF32GERPP)) 2209 {mma_xvf32gerpp}
     (expr_list:REG_DEAD (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
        (expr_list:REG_DEAD (reg/v:V16QI 123 [ src ])
            (expr_list:REG_DEAD (reg:XO 119 [ _7 ])
                (nil)))))
(insn 12 11 18 2 (set (mem:XO (reg:DI 126) [1 *dst_10(D)+0 S64 A128])
        (reg:XO 120 [ _9 ])) "../gcc/testsuite/g++.target/powerpc/vecload.C":13:8 2182 {*movxo}
     (expr_list:REG_DEAD (reg:DI 126)
        (expr_list:REG_DEAD (reg:XO 120 [ _9 ])
            (nil))))
(note 18 12 0 NOTE_INSN_DELETED)

r124 and r129 conflicts live range amd ira generates different registers which will not
serve our purpose.

Making r124 and r129 as same will not allocate register by ira as r124 could have both OOmode
and V16QImode.

Doing this pass before ira_pass has such above issues and we could solve them after making
after reload pass.

Please suggest.

Thanks & Regards
Ajit

> I am working further on this and send the new version of the patch with all the
> comments incorporated.
> 
> Thanks & Regards
> Ajit
>> BR,
>> Kewen
>>
>>>         xxmfacc 0
>>>         stxvp %vs2,0(%r3)
>>>         stxvp %vs0,32(%r3)
>>>         blr
>>>
>>>
>>> Instead of vs33 ira generates vs45 if we move before pass_ira.
>>>
>>> Thanks & Regards
>>> Ajit
>>>
>>>  
>>>> Thanks & Regards
>>>> Ajit
>>>>>
>>>>
>>>>>>
>>>>>> Making after reload pass is the only solution I see as ira and reload pass
>>>>>> makes register pairs and vecload pass will be easier with generation of
>>>>>> lxvp.
>>>>>>
>>>>>> Please suggest.
>>>>>>  
>>>>>>> Looking forward to the comments from Segher/David/Peter/Mike etc.
>>>>>
>>>>> Still looking forward. :)
>>>>>
>>>>> BR,
>>>>> Kewen
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-12-08  8:01               ` Ajit Agarwal
@ 2023-12-08  9:51                 ` Kewen.Lin
  2023-12-12  6:28                 ` Kewen.Lin
  1 sibling, 0 replies; 24+ messages in thread
From: Kewen.Lin @ 2023-12-08  9:51 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Segher Boessenkool, David Edelsohn, Peter Bergner,
	Michael Meissner, GCC Patches, Vladimir Makarov,
	Richard Sandiford, Richard Biener, Jeff Law, Jakub Jelinek

Hi Ajit,

on 2023/12/8 16:01, Ajit Agarwal wrote:
> Hello Kewen:
> 
> On 07/12/23 4:31 pm, Ajit Agarwal wrote:
>> Hello Kewen:
>>
>> On 06/12/23 7:52 am, Kewen.Lin wrote:
>>> on 2023/12/6 02:01, Ajit Agarwal wrote:
>>>> Hello Kewen:
>>>>
>>>>
>>>> On 05/12/23 7:13 pm, Ajit Agarwal wrote:
>>>>> Hello Kewen:
>>>>>
>>>>> On 04/12/23 7:31 am, Kewen.Lin wrote:
>>>>>> Hi Ajit,
>>>>>>
>>>>>> on 2023/12/1 17:10, Ajit Agarwal wrote:
>>>>>>> Hello Kewen:
>>>>>>>
>>>>>>> On 24/11/23 3:01 pm, Kewen.Lin wrote:
>>>>>>>> Hi Ajit,
>>>>>>>>
>>>>>>>> Don't forget to CC David (CC-ed) :), some comments are inlined below.
>>>>>>>>
>>>>>>>> on 2023/10/8 03:04, Ajit Agarwal wrote:
>>>>>>>>> Hello All:
>>>>>>>>>
>>>>>>>>> This patch add new pass to replace contiguous addresses vector load lxv with mma instruction
>>>>>>>>> lxvp.
>>>>>>>>
>>>>>>>> IMHO the current binding lxvp (and lxvpx, stxvp{x,}) to MMA looks wrong, it's only
>>>>>>>> Power10 and VSX required, these instructions should perform well without MMA support.
>>>>>>>> So one patch to separate their support from MMA seems to go first.
>>>>>>>>
>>>>>>>
>>>>>>> I will make the changes for Power10 and VSX.
>>>>>>>
>>>>>>>>> This patch addresses one regressions failure in ARM architecture.
>>>>>>>>
>>>>>>>> Could you explain this?  I don't see any test case for this.
>>>>>>>
>>>>>>> I have submitted v1 of the patch and there were regressions failure for Linaro.
>>>>>>> I have fixed in version V2.
>>>>>>
>>>>>> OK, thanks for clarifying.  So some unexpected changes on generic code in v1
>>>>>> caused the failure exposed on arm.
>>>>>>
>>>>>>>
>>>>>>>  
>>>>>>>> Besides, it seems a bad idea to put this pass after reload? as register allocation
>>>>>>>> finishes, this pairing has to be restricted by the reg No. (I didn't see any
>>>>>>>> checking on the reg No. relationship for paring btw.)
>>>>>>>>
>>>>>>>
>>>>>>> Adding before reload pass deletes one of the lxv and replaced with lxvp. This
>>>>>>> fails in reload pass while freeing reg_eqivs as ira populates them and then
>>>>>>
>>>>>> I can't find reg_eqivs, I guessed you meant reg_equivs and moved this pass right before
>>>>>> pass_reload (between pass_ira and pass_reload)?  IMHO it's unexpected as those two passes
>>>>>> are closely correlated.  I was expecting to put it somewhere before ira.
>>>>>
>>>>> Yes they are tied together and moving before reload will not work.
>>>>>
>>>>>>
>>>>>>> vecload pass deletes some of insns and while freeing in reload pass as insn
>>>>>>> is already deleted in vecload pass reload pass segfaults.
>>>>>>>
>>>>>>> Moving vecload pass before ira will not make register pairs with lxvp and
>>>>>>> in ira and that will be a problem.
>>>>>>
>>>>>> Could you elaborate the obstacle for moving such pass before pass_ira?
>>>>>>
>>>>>> Basing on the status quo, the lxvp is bundled with OOmode, then I'd expect
>>>>>> we can generate OOmode move (load) and use the components with unspec (or
>>>>>> subreg with Peter's patch) to replace all the previous use places, it looks
>>>>>> doable to me.
>>>>>
>>>>> Moving before ira passes, we delete the offset lxv and generate lxvp and replace all
>>>>> the uses, that I am doing. But the offset lxvp register generated by ira are not
>>>>> register pair and generate random register and hence we cannot generate lxvp.
>>>>>
>>>>> For example one lxv is generated with register 32 and other pair is generated
>>>>> with register 45 by ira if we move it before ira passes.
>>>>
>>>> It generates the following.
>>>> 	lxvp %vs32,0(%r4)
>>>>         xvf32ger 0,%vs34,%vs32
>>>>         xvf32gerpp 0,%vs34,%vs45
>>>
>>> What do the RTL insns for these insns look like?
>>>
>>> I'd expect you use UNSPEC_MMA_EXTRACT to extract V16QI from the result of lxvp,
>>> the current define_insn_and_split "*vsx_disassemble_pair" should be able to take
>>> care of it further (eg: reg and regoff).
>>>
>>
>> Yes with UNSPEC_MMA_EXTRACT it generates lxvp with register pair instead of random
>> register by ira and reload pass. But there is an extra moves that gets generated.
>>
> 
> With UNSPEC_MMA_EXTRACT I could generate the register pair but functionally here is the
> below code which is incorrect.> 
>  l	lxvp %vs0,0(%r4)
>         xxlor %vs32,%vs0,%vs0
>         xvf32ger 0,%vs34,%vs32
>         xvf32gerpp 0,%vs34,%vs33
>         xxmfacc 0
>         stxvp %vs2,0(%r3)
>         stxvp %vs0,32(%r3)
>         blr
> 
> 
> Here is the RTL Code:
> 
> (insn 19 4 20 2 (set (reg:OO 124 [ *ptr_4(D) ])
>         (mem:OO (reg/v/f:DI 122 [ ptr ]) [0 *ptr_4(D)+0 S16 A128])) -1
>      (nil))
> (insn 20 19 9 2 (set (reg:V16QI 129 [orig:124 *ptr_4(D) ] [124])
>         (subreg:V16QI (reg:OO 124 [ *ptr_4(D) ]) 0)) -1
>      (nil))
> (insn 9 20 11 2 (set (reg:XO 119 [ _7 ])
>         (unspec:XO [
>                 (reg/v:V16QI 123 [ src ])
>                 (reg:V16QI 129 [orig:124 *ptr_4(D) ] [124])
>             ] UNSPEC_MMA_XVF32GER)) 2195 {mma_xvf32ger}
>      (expr_list:REG_DEAD (reg:OO 124 [ *ptr_4(D) ])
>         (nil)))
> (insn 11 9 12 2 (set (reg:XO 120 [ _9 ])
>         (unspec:XO [
>                 (reg:XO 119 [ _7 ])
>                 (reg/v:V16QI 123 [ src ])
>                 (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])

Thanks for trying this and the update!

I think the functionality issue is due to this "reg:V16QI 125" isn't defined, it seems that
you don't make explicit code to extract this component from "reg:OO 124"?

>             ] UNSPEC_MMA_XVF32GERPP)) 2209 {mma_xvf32gerpp}
>      (expr_list:REG_DEAD (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
>         (expr_list:REG_DEAD (reg/v:V16QI 123 [ src ])
>             (expr_list:REG_DEAD (reg:XO 119 [ _7 ])
>                 (nil)))))
> (insn 12 11 18 2 (set (mem:XO (reg:DI 126) [1 *dst_10(D)+0 S64 A128])
>         (reg:XO 120 [ _9 ])) "../gcc/testsuite/g++.target/powerpc/vecload.C":13:8 2182 {*movxo}
>      (expr_list:REG_DEAD (reg:DI 126)
>         (expr_list:REG_DEAD (reg:XO 120 [ _9 ])
>             (nil))))
> (note 18 12 0 NOTE_INSN_DELETED)
> 
> r124 and r129 conflicts live range amd ira generates different registers which will not
> serve our purpose.

This conflict issue is tough, but IMHO it only causes unnecessary register moves and the
functionality should be still fine with it.  Not sure if there is some existing practice for
this kind of issue, one immature idea seems to directly use (subreg:V16QI (reg:OO 124) 0/1)
in those uses like UNSPEC_MMA_XVF32GER and UNSPEC_MMA_XVF32GERPP, and fix up them when
dropping subreg.

CC more experts for advice!

> 
> Making r124 and r129 as same will not allocate register by ira as r124 could have both OOmode
> and V16QImode.
> 
> Doing this pass before ira_pass has such above issues and we could solve them after making
> after reload pass.
> 
> Please suggest.
> 
> Thanks & Regards
> Ajit
> 
>> I am working further on this and send the new version of the patch with all the
>> comments incorporated.
>>
>> Thanks & Regards
>> Ajit
>>> BR,
>>> Kewen
>>>
>>>>         xxmfacc 0
>>>>         stxvp %vs2,0(%r3)
>>>>         stxvp %vs0,32(%r3)
>>>>         blr
>>>>
>>>>
>>>> Instead of vs33 ira generates vs45 if we move before pass_ira.
>>>>
>>>> Thanks & Regards
>>>> Ajit
>>>>
>>>>  
>>>>> Thanks & Regards
>>>>> Ajit
>>>>>>
>>>>>
>>>>>>>
>>>>>>> Making after reload pass is the only solution I see as ira and reload pass
>>>>>>> makes register pairs and vecload pass will be easier with generation of
>>>>>>> lxvp.
>>>>>>>
>>>>>>> Please suggest.
>>>>>>>  
>>>>>>>> Looking forward to the comments from Segher/David/Peter/Mike etc.
>>>>>>
>>>>>> Still looking forward. :)
>>>>>>
>>>>>> BR,
>>>>>> Kewen
>>>


BR,
Kewen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-12-08  8:01               ` Ajit Agarwal
  2023-12-08  9:51                 ` Kewen.Lin
@ 2023-12-12  6:28                 ` Kewen.Lin
  2023-12-12  7:38                   ` Ajit Agarwal
  1 sibling, 1 reply; 24+ messages in thread
From: Kewen.Lin @ 2023-12-12  6:28 UTC (permalink / raw)
  To: Ajit Agarwal
  Cc: Segher Boessenkool, David Edelsohn, Peter Bergner,
	Michael Meissner, GCC Patches

Hi Ajit,

on 2023/12/8 16:01, Ajit Agarwal wrote:
> Hello Kewen:
> 

[snip...]

> With UNSPEC_MMA_EXTRACT I could generate the register pair but functionally here is the
> below code which is incorrect.
> 
>  l	lxvp %vs0,0(%r4)
>         xxlor %vs32,%vs0,%vs0
>         xvf32ger 0,%vs34,%vs32
>         xvf32gerpp 0,%vs34,%vs33
>         xxmfacc 0
>         stxvp %vs2,0(%r3)
>         stxvp %vs0,32(%r3)
>         blr
> 
> 
> Here is the RTL Code:
> 
> (insn 19 4 20 2 (set (reg:OO 124 [ *ptr_4(D) ])
>         (mem:OO (reg/v/f:DI 122 [ ptr ]) [0 *ptr_4(D)+0 S16 A128])) -1
>      (nil))
> (insn 20 19 9 2 (set (reg:V16QI 129 [orig:124 *ptr_4(D) ] [124])
>         (subreg:V16QI (reg:OO 124 [ *ptr_4(D) ]) 0)) -1
>      (nil))
> (insn 9 20 11 2 (set (reg:XO 119 [ _7 ])
>         (unspec:XO [
>                 (reg/v:V16QI 123 [ src ])
>                 (reg:V16QI 129 [orig:124 *ptr_4(D) ] [124])
>             ] UNSPEC_MMA_XVF32GER)) 2195 {mma_xvf32ger}
>      (expr_list:REG_DEAD (reg:OO 124 [ *ptr_4(D) ])
>         (nil)))
> (insn 11 9 12 2 (set (reg:XO 120 [ _9 ])
>         (unspec:XO [
>                 (reg:XO 119 [ _7 ])
>                 (reg/v:V16QI 123 [ src ])
>                 (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
>             ] UNSPEC_MMA_XVF32GERPP)) 2209 {mma_xvf32gerpp}
>      (expr_list:REG_DEAD (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
>         (expr_list:REG_DEAD (reg/v:V16QI 123 [ src ])
>             (expr_list:REG_DEAD (reg:XO 119 [ _7 ])
>                 (nil)))))
> (insn 12 11 18 2 (set (mem:XO (reg:DI 126) [1 *dst_10(D)+0 S64 A128])
>         (reg:XO 120 [ _9 ])) "../gcc/testsuite/g++.target/powerpc/vecload.C":13:8 2182 {*movxo}
>      (expr_list:REG_DEAD (reg:DI 126)
>         (expr_list:REG_DEAD (reg:XO 120 [ _9 ])
>             (nil))))
> (note 18 12 0 NOTE_INSN_DELETED)
> 
> r124 and r129 conflicts live range amd ira generates different registers which will not
> serve our purpose.
> 
> Making r124 and r129 as same will not allocate register by ira as r124 could have both OOmode
> and V16QImode.
> 
> Doing this pass before ira_pass has such above issues and we could solve them after making
> after reload pass.

Could you also attach your latest WIP patch?  I'm going to look into the extra move issue with it.

Thanks!

BR,
Kewen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp
  2023-12-12  6:28                 ` Kewen.Lin
@ 2023-12-12  7:38                   ` Ajit Agarwal
  0 siblings, 0 replies; 24+ messages in thread
From: Ajit Agarwal @ 2023-12-12  7:38 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Segher Boessenkool, David Edelsohn, Peter Bergner,
	Michael Meissner, GCC Patches

Hello Kewen:

On 12/12/23 11:58 am, Kewen.Lin wrote:
> Hi Ajit,
> 
> on 2023/12/8 16:01, Ajit Agarwal wrote:
>> Hello Kewen:
>>
> 
> [snip...]
> 
>> With UNSPEC_MMA_EXTRACT I could generate the register pair but functionally here is the
>> below code which is incorrect.
>>
>>  l	lxvp %vs0,0(%r4)
>>         xxlor %vs32,%vs0,%vs0
>>         xvf32ger 0,%vs34,%vs32
>>         xvf32gerpp 0,%vs34,%vs33
>>         xxmfacc 0
>>         stxvp %vs2,0(%r3)
>>         stxvp %vs0,32(%r3)
>>         blr
>>
>>
>> Here is the RTL Code:
>>
>> (insn 19 4 20 2 (set (reg:OO 124 [ *ptr_4(D) ])
>>         (mem:OO (reg/v/f:DI 122 [ ptr ]) [0 *ptr_4(D)+0 S16 A128])) -1
>>      (nil))
>> (insn 20 19 9 2 (set (reg:V16QI 129 [orig:124 *ptr_4(D) ] [124])
>>         (subreg:V16QI (reg:OO 124 [ *ptr_4(D) ]) 0)) -1
>>      (nil))
>> (insn 9 20 11 2 (set (reg:XO 119 [ _7 ])
>>         (unspec:XO [
>>                 (reg/v:V16QI 123 [ src ])
>>                 (reg:V16QI 129 [orig:124 *ptr_4(D) ] [124])
>>             ] UNSPEC_MMA_XVF32GER)) 2195 {mma_xvf32ger}
>>      (expr_list:REG_DEAD (reg:OO 124 [ *ptr_4(D) ])
>>         (nil)))
>> (insn 11 9 12 2 (set (reg:XO 120 [ _9 ])
>>         (unspec:XO [
>>                 (reg:XO 119 [ _7 ])
>>                 (reg/v:V16QI 123 [ src ])
>>                 (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
>>             ] UNSPEC_MMA_XVF32GERPP)) 2209 {mma_xvf32gerpp}
>>      (expr_list:REG_DEAD (reg:V16QI 125 [ MEM[(__vector unsigned char *)ptr_4(D) + 16B] ])
>>         (expr_list:REG_DEAD (reg/v:V16QI 123 [ src ])
>>             (expr_list:REG_DEAD (reg:XO 119 [ _7 ])
>>                 (nil)))))
>> (insn 12 11 18 2 (set (mem:XO (reg:DI 126) [1 *dst_10(D)+0 S64 A128])
>>         (reg:XO 120 [ _9 ])) "../gcc/testsuite/g++.target/powerpc/vecload.C":13:8 2182 {*movxo}
>>      (expr_list:REG_DEAD (reg:DI 126)
>>         (expr_list:REG_DEAD (reg:XO 120 [ _9 ])
>>             (nil))))
>> (note 18 12 0 NOTE_INSN_DELETED)
>>
>> r124 and r129 conflicts live range amd ira generates different registers which will not
>> serve our purpose.
>>
>> Making r124 and r129 as same will not allocate register by ira as r124 could have both OOmode
>> and V16QImode.
>>
>> Doing this pass before ira_pass has such above issues and we could solve them after making
>> after reload pass.
> 
> Could you also attach your latest WIP patch?  I'm going to look into the extra move issue with it.
>

I have fixed the register allocator IRA pass to generate the register pair and with that no extra move 
is also generated and now with the fix the code is generated with register pair.

Earlier you have the idea to use SUBREG V16QI (reg: OOmode 124) at the use that also doesn't generate
the register pair. I had to make changes in IRA pass register allocator to generate register pair
and no extra moves are also not generated. I am testing the fix and would send for code review very 
soon.

Thanks for the help and suggestions.

Thanks & Regards
Ajit
 
> Thanks!
> 
> BR,
> Kewen

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-12-12  7:38 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-07 19:04 [PATCH v2] rs6000: Add new pass for replacement of contiguous addresses vector load lxv with lxvp Ajit Agarwal
2023-10-15 12:13 ` [PING ^0][PATCH " Ajit Agarwal
2023-10-23  8:32   ` [PING ^1][PATCH " Ajit Agarwal
2023-11-10  7:04     ` [PING ^2][PATCH " Ajit Agarwal
2023-11-24  9:31 ` [PATCH " Kewen.Lin
2023-11-28  4:34   ` Michael Meissner
2023-11-28  9:33     ` Kewen.Lin
2023-12-01  9:10   ` Ajit Agarwal
2023-12-04  2:01     ` Kewen.Lin
2023-12-05 13:43       ` Ajit Agarwal
2023-12-05 18:01         ` Ajit Agarwal
2023-12-06  2:22           ` Kewen.Lin
2023-12-06  5:09             ` Michael Meissner
2023-12-07  7:14               ` Kewen.Lin
2023-12-07 11:01             ` Ajit Agarwal
2023-12-08  8:01               ` Ajit Agarwal
2023-12-08  9:51                 ` Kewen.Lin
2023-12-12  6:28                 ` Kewen.Lin
2023-12-12  7:38                   ` Ajit Agarwal
2023-11-28  7:05 ` Michael Meissner
2023-11-28  9:44   ` Kewen.Lin
2023-11-28 15:41     ` Michael Meissner
2023-11-29 14:10       ` Ajit Agarwal
2023-12-01  9:13     ` Ajit Agarwal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).