public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Repost [PATCH 0/6] PowerPC Future patches
@ 2024-01-05 23:27 Michael Meissner
  2024-01-05 23:35 ` Repost [PATCH 1/6] Add -mcpu=future Michael Meissner
                   ` (6 more replies)
  0 siblings, 7 replies; 36+ messages in thread
From: Michael Meissner @ 2024-01-05 23:27 UTC (permalink / raw)
  To: gcc-patches, Michael Meissner, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner

I posted these patches on October 18th, 2023, and I never receieved any feedback
on the changes.  What changes do I need to make with these patches to get them
into GCC 14?

This patch is very preliminary support for a potential new feature to the
PowerPC that extends the current power10 MMA architecture.  This feature may or
may not be present in any specific future PowerPC processor.

In the current MMA subsystem for Power10, there are 8 512-bit accumulator
registers.  These accumulators are each tied to sets of 4 FPR registers.  When
you issue a prime instruction, it makes sure the accumulator is a copy of the 4
FPR registers the accumulator is tied to.  When you issue a deprime
instruction, it makes sure that the accumulator data content is logically
copied to the matching FPR register.

In the potential dense math system, the accumulators are moved to separate
registers called dense math registers (DM registers or DMR).  The DMRs are then
extended to 1,024 bits and new instructions will be added to deal with all
1,024 bits of the DMRs.

If you take existing MMA code, it will work as long as you don't do anything
with accumulators, and you follow the rules in the ISA 3.1 documentation for
using the MMA subsystem.

These patches add support for the 512-bit accumulators within the dense math
system, and for allocation of the 1,024-bit DMRs.  At this time, no additional
built-in functions will be done to support any dense math features other than
doing data movement between the DMRs and the VSX registers.  Before we can look
at adding any new dense math support other than data movement, we need the GCC
compiler to be able to allocate and use these DMRs.

There are 6 patches in this patch set:

1) The first patch just adds -mcpu=future as an option to add new support.
This is similar to the -mcpu=future that we did before power10 was announced.

2) The second patch enables GCC to use the load and store vector pair
instructions to optimize memory copy operations in the compiler.  For power10,
we needed to just stay with normal vector load/stores for memory copy
operations.

3) The third patch enables 512-bit accumulators that are located within in DMRs
instead of the FPRs.  This patch enables the register allocation, but it does
not move the existing MMA to use these registers.

4) The fourth patch switches the MMA subsystem to use 512-bit accumulators
within DMRs if you use -mcpu=future.

5) The fifth patch switches the names of the MMA instructions to use the dense
math equivalent name if -mcpu=future.

6) The sixth patch enables using the full 1,024-bit DMRs.  Right now, all you
can do with DMRs is move a VSX register to a DMR register, and to move a DMR
register to a VSX register.

In terms of changes, these patch now use the wD constraint for accumulators.
If you compile with -mcpu=power10, the wD constraint will match the equivalent
FPR register that overlaps with the accumulator.  If you compile with
-mcpu=future, the wD constraint will match the DMR register and not the FPR
register.

These patches also modifies the print_operand %A output modifier to print out
DMR register numbers if -mcpu=future, and continue to print out the FPR
register number divided by 4 for -mcpu=power10.

In general, if you only use the built-in functions, things work between the two
systems.  If you use extended asm, you will likely need to modify the code.
Going forward, hopefully if you modify your code to use the wD constraint and
%A output modifier, you can write code that switches more easily between the
two systems.

Again, these are preliminary patches for a potential future machine.  Things
will likely change in terms of implementation and usage over time.

Originally these patches were submitted in November 2022:
https://gcc.gnu.org/pipermail/gcc-patches/2022-November/605581.html

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Repost [PATCH 1/6] Add -mcpu=future
  2024-01-05 23:27 Repost [PATCH 0/6] PowerPC Future patches Michael Meissner
@ 2024-01-05 23:35 ` Michael Meissner
  2024-01-19 18:43   ` Ping " Michael Meissner
                     ` (2 more replies)
  2024-01-05 23:37 ` Repost [PATCH 2/6] PowerPC: Make -mcpu=future enable -mblock-ops-vector-pair Michael Meissner
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 36+ messages in thread
From: Michael Meissner @ 2024-01-05 23:35 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner

This patch implements support for a potential future PowerPC cpu.  Features
added with -mcpu=future, may or may not be added to new PowerPC processors.

This patch adds support for the -mcpu=future option.  If you use -mcpu=future,
the macro __ARCH_PWR_FUTURE__ is defined, and the assembler .machine directive
"future" is used.  Future patches in this series will add support for new
instructions that may be present in future PowerPC processors.

This particular patch does not any new features.  It exists as a ground work
for future patches to support for a possible PowerPC processor in the future.

This patch does not implement any differences in tuning when -mcpu=future is
used compared to -mcpu=power10.  If -mcpu=future is used, GCC will use power10
tuning.  If you explicitly use -mtune=future, you will get a warning that
-mtune=future is not supported, and default tuning will be set for power10.

The patches have been tested on both little and big endian systems.  Can I check
it into the master branch?

2024-01-05   Michael Meissner  <meissner@linux.ibm.com>

gcc/

	* config/rs6000/rs6000-c.cc (rs6000_target_modify_macros): Define
	__ARCH_PWR_FUTURE__ if -mcpu=future.
	* config/rs6000/rs6000-cpus.def (ISA_FUTURE_MASKS): New macro.
	(POWERPC_MASKS): Add -mcpu=future support.
	* config/rs6000/rs6000-opts.h (enum processor_type): Add
	PROCESSOR_FUTURE.
	* config/rs6000/rs6000-tables.opt: Regenerate.
	* config/rs6000/rs6000.cc (rs600_cpu_index_lookup): New helper
	function.
	(rs6000_option_override_internal): Make -mcpu=future set
	-mtune=power10.  If the user explicitly uses -mtune=future, give a
	warning and reset the tuning to power10.
	(rs6000_option_override_internal): Use power10 costs for future
	machine.
	(rs6000_machine_from_flags): Add support for -mcpu=future.
	(rs6000_opt_masks): Likewise.
	* config/rs6000/rs6000.h (ASM_CPU_SUPPORT): Likewise.
	* config/rs6000/rs6000.md (cpu attribute): Likewise.
	* config/rs6000/rs6000.opt (-mfuture): New undocumented debug switch.
	* doc/invoke.texi (IBM RS/6000 and PowerPC Options): Document -mcpu=future.
---
 gcc/config/rs6000/rs6000-c.cc       |  2 +
 gcc/config/rs6000/rs6000-cpus.def   |  6 +++
 gcc/config/rs6000/rs6000-opts.h     |  4 +-
 gcc/config/rs6000/rs6000-tables.opt |  3 ++
 gcc/config/rs6000/rs6000.cc         | 58 ++++++++++++++++++++++++-----
 gcc/config/rs6000/rs6000.h          |  1 +
 gcc/config/rs6000/rs6000.md         |  2 +-
 gcc/config/rs6000/rs6000.opt        |  4 ++
 gcc/doc/invoke.texi                 |  2 +-
 9 files changed, 69 insertions(+), 13 deletions(-)

diff --git a/gcc/config/rs6000/rs6000-c.cc b/gcc/config/rs6000/rs6000-c.cc
index ce0b14a8d37..f2fb5bef678 100644
--- a/gcc/config/rs6000/rs6000-c.cc
+++ b/gcc/config/rs6000/rs6000-c.cc
@@ -447,6 +447,8 @@ rs6000_target_modify_macros (bool define_p, HOST_WIDE_INT flags)
     rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR9");
   if ((flags & OPTION_MASK_POWER10) != 0)
     rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR10");
+  if ((flags & OPTION_MASK_FUTURE) != 0)
+    rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR_FUTURE");
   if ((flags & OPTION_MASK_SOFT_FLOAT) != 0)
     rs6000_define_or_undefine_macro (define_p, "_SOFT_FLOAT");
   if ((flags & OPTION_MASK_RECIP_PRECISION) != 0)
diff --git a/gcc/config/rs6000/rs6000-cpus.def b/gcc/config/rs6000/rs6000-cpus.def
index d28cc87eb2a..8754635f3d9 100644
--- a/gcc/config/rs6000/rs6000-cpus.def
+++ b/gcc/config/rs6000/rs6000-cpus.def
@@ -88,6 +88,10 @@
 				 | OPTION_MASK_POWER10			\
 				 | OTHER_POWER10_MASKS)
 
+/* Flags for a potential future processor that may or may not be delivered.  */
+#define ISA_FUTURE_MASKS	(ISA_3_1_MASKS_SERVER			\
+				 | OPTION_MASK_FUTURE)
+
 /* Flags that need to be turned off if -mno-power9-vector.  */
 #define OTHER_P9_VECTOR_MASKS	(OPTION_MASK_FLOAT128_HW		\
 				 | OPTION_MASK_P9_MINMAX)
@@ -135,6 +139,7 @@
 				 | OPTION_MASK_LOAD_VECTOR_PAIR		\
 				 | OPTION_MASK_POWER10			\
 				 | OPTION_MASK_P10_FUSION		\
+				 | OPTION_MASK_FUTURE			\
 				 | OPTION_MASK_HTM			\
 				 | OPTION_MASK_ISEL			\
 				 | OPTION_MASK_MFCRF			\
@@ -267,3 +272,4 @@ RS6000_CPU ("powerpc64", PROCESSOR_POWERPC64, OPTION_MASK_PPC_GFXOPT
 RS6000_CPU ("powerpc64le", PROCESSOR_POWER8, MASK_POWERPC64
 	    | ISA_2_7_MASKS_SERVER | OPTION_MASK_HTM)
 RS6000_CPU ("rs64", PROCESSOR_RS64A, OPTION_MASK_PPC_GFXOPT | MASK_POWERPC64)
+RS6000_CPU ("future", PROCESSOR_FUTURE, MASK_POWERPC64 | ISA_FUTURE_MASKS)
diff --git a/gcc/config/rs6000/rs6000-opts.h b/gcc/config/rs6000/rs6000-opts.h
index 33fd0efc936..25890ae3034 100644
--- a/gcc/config/rs6000/rs6000-opts.h
+++ b/gcc/config/rs6000/rs6000-opts.h
@@ -67,7 +67,9 @@ enum processor_type
    PROCESSOR_MPCCORE,
    PROCESSOR_CELL,
    PROCESSOR_PPCA2,
-   PROCESSOR_TITAN
+   PROCESSOR_TITAN,
+
+   PROCESSOR_FUTURE
 };
 
 
diff --git a/gcc/config/rs6000/rs6000-tables.opt b/gcc/config/rs6000/rs6000-tables.opt
index 65f46709716..97fa98a2e65 100644
--- a/gcc/config/rs6000/rs6000-tables.opt
+++ b/gcc/config/rs6000/rs6000-tables.opt
@@ -197,3 +197,6 @@ Enum(rs6000_cpu_opt_value) String(powerpc64le) Value(55)
 EnumValue
 Enum(rs6000_cpu_opt_value) String(rs64) Value(56)
 
+EnumValue
+Enum(rs6000_cpu_opt_value) String(future) Value(57)
+
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index 5a7e00b03d1..bc509399cf6 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -1809,6 +1809,18 @@ rs6000_cpu_name_lookup (const char *name)
   return -1;
 }
 
+/* Look up the index for a specific processor.  */
+
+static int
+rs600_cpu_index_lookup (enum processor_type processor)
+{
+  for (size_t i = 0; i < ARRAY_SIZE (processor_target_table); i++)
+    if (processor_target_table[i].processor == processor)
+      return i;
+
+  return -1;
+}
+
 \f
 /* Return number of consecutive hard regs needed starting at reg REGNO
    to hold something of mode MODE.
@@ -3756,23 +3768,45 @@ rs6000_option_override_internal (bool global_init_p)
     rs6000_isa_flags &= ~OPTION_MASK_POWERPC64;
 #endif
 
+  /* At the moment, we don't have explict -mtune=future support.  If the user
+     explicitly tried to use -mtune=future, give a warning.  If not, use the
+     power10 tuning until future tuning is added.  */
   if (rs6000_tune_index >= 0)
-    tune_index = rs6000_tune_index;
+    {
+      enum processor_type cur_proc
+	= processor_target_table[rs6000_tune_index].processor;
+
+      if (cur_proc == PROCESSOR_FUTURE)
+	{
+	  static bool issued_future_tune_warning = false;
+	  if (!issued_future_tune_warning)
+	    {
+	      issued_future_tune_warning = true;
+	      warning (0, "%qs is not currently supported", "-mtune=future");
+	    }
+
+	  rs6000_tune_index = rs600_cpu_index_lookup (PROCESSOR_POWER10);
+	}
+      tune_index = rs6000_tune_index;
+    }
   else if (cpu_index >= 0)
-    rs6000_tune_index = tune_index = cpu_index;
+    {
+      enum processor_type cur_cpu
+	= processor_target_table[cpu_index].processor;
+
+      rs6000_tune_index = tune_index
+	= (cur_cpu == PROCESSOR_FUTURE
+	   ? rs600_cpu_index_lookup (PROCESSOR_POWER10)
+	   : cpu_index);
+    }
   else
     {
-      size_t i;
       enum processor_type tune_proc
 	= (TARGET_POWERPC64 ? PROCESSOR_DEFAULT64 : PROCESSOR_DEFAULT);
 
-      tune_index = -1;
-      for (i = 0; i < ARRAY_SIZE (processor_target_table); i++)
-	if (processor_target_table[i].processor == tune_proc)
-	  {
-	    tune_index = i;
-	    break;
-	  }
+      tune_index = rs600_cpu_index_lookup (tune_proc == PROCESSOR_FUTURE
+					   ? PROCESSOR_POWER10
+					   : tune_proc);
     }
 
   if (cpu_index >= 0)
@@ -4785,6 +4819,7 @@ rs6000_option_override_internal (bool global_init_p)
 	break;
 
       case PROCESSOR_POWER10:
+      case PROCESSOR_FUTURE:
 	rs6000_cost = &power10_cost;
 	break;
 
@@ -5944,6 +5979,8 @@ rs6000_machine_from_flags (void)
   /* Disable the flags that should never influence the .machine selection.  */
   flags &= ~(OPTION_MASK_PPC_GFXOPT | OPTION_MASK_PPC_GPOPT | OPTION_MASK_ISEL);
 
+  if ((flags & (ISA_FUTURE_MASKS & ~ISA_3_1_MASKS_SERVER)) != 0)
+    return "future";
   if ((flags & (ISA_3_1_MASKS_SERVER & ~ISA_3_0_MASKS_SERVER)) != 0)
     return "power10";
   if ((flags & (ISA_3_0_MASKS_SERVER & ~ISA_2_7_MASKS_SERVER)) != 0)
@@ -24500,6 +24537,7 @@ static struct rs6000_opt_mask const rs6000_opt_masks[] =
   { "float128-hardware",	OPTION_MASK_FLOAT128_HW,	false, true  },
   { "fprnd",			OPTION_MASK_FPRND,		false, true  },
   { "power10",			OPTION_MASK_POWER10,		false, true  },
+  { "future",			OPTION_MASK_FUTURE,		false, true  },
   { "hard-dfp",			OPTION_MASK_DFP,		false, true  },
   { "htm",			OPTION_MASK_HTM,		false, true  },
   { "isel",			OPTION_MASK_ISEL,		false, true  },
diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
index 2291fe8d3a3..43209f9a6e7 100644
--- a/gcc/config/rs6000/rs6000.h
+++ b/gcc/config/rs6000/rs6000.h
@@ -163,6 +163,7 @@
   mcpu=e5500: -me5500; \
   mcpu=e6500: -me6500; \
   mcpu=titan: -mtitan; \
+  mcpu=future: -mfuture; \
   !mcpu*: %{mpower9-vector: -mpower9; \
 	    mpower8-vector|mcrypto|mdirect-move|mhtm: -mpower8; \
 	    mvsx: -mpower7; \
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index 969d34b69e6..a125fd8fc99 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -351,7 +351,7 @@ (define_attr "cpu"
    ppc403,ppc405,ppc440,ppc476,
    ppc8540,ppc8548,ppce300c2,ppce300c3,ppce500mc,ppce500mc64,ppce5500,ppce6500,
    power4,power5,power6,power7,power8,power9,power10,
-   rs64a,mpccore,cell,ppca2,titan"
+   rs64a,mpccore,cell,ppca2,titan,future"
   (const (symbol_ref "(enum attr_cpu) rs6000_tune")))
 
 ;; The ISA we implement.
diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt
index 60b923f5e4b..775ba830eac 100644
--- a/gcc/config/rs6000/rs6000.opt
+++ b/gcc/config/rs6000/rs6000.opt
@@ -628,6 +628,10 @@ mieee128-constant
 Target Var(TARGET_IEEE128_CONSTANT) Init(1) Save
 Generate (do not generate) code that uses the LXVKQ instruction.
 
+mfuture
+Target Undocumented Mask(FUTURE) Var(rs6000_isa_flags)
+Generate (do not generate) future instructions.
+
 ; Documented parameters
 
 -param=rs6000-vect-unroll-limit=
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index d71583853f0..0e817ee923a 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -30423,7 +30423,7 @@ Supported values for @var{cpu_type} are @samp{401}, @samp{403},
 @samp{titan}, @samp{power3}, @samp{power4}, @samp{power5}, @samp{power5+},
 @samp{power6}, @samp{power6x}, @samp{power7}, @samp{power8},
 @samp{power9}, @samp{power10}, @samp{powerpc}, @samp{powerpc64},
-@samp{powerpc64le}, @samp{rs64}, and @samp{native}.
+@samp{powerpc64le}, @samp{rs64}, @samp{future}, and @samp{native}.
 
 @option{-mcpu=powerpc}, @option{-mcpu=powerpc64}, and
 @option{-mcpu=powerpc64le} specify pure 32-bit PowerPC (either
-- 
2.43.0


-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Repost [PATCH 2/6] PowerPC: Make -mcpu=future enable -mblock-ops-vector-pair.
  2024-01-05 23:27 Repost [PATCH 0/6] PowerPC Future patches Michael Meissner
  2024-01-05 23:35 ` Repost [PATCH 1/6] Add -mcpu=future Michael Meissner
@ 2024-01-05 23:37 ` Michael Meissner
  2024-01-19 18:44   ` Ping " Michael Meissner
  2024-01-23  8:54   ` Repost " Kewen.Lin
  2024-01-05 23:38 ` Repost [PATCH 3/6] PowerPC: Add support for accumulators in DMR registers Michael Meissner
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 36+ messages in thread
From: Michael Meissner @ 2024-01-05 23:37 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner

This patch re-enables generating load and store vector pair instructions when
doing certain memory copy operations when -mcpu=future is used.

During power10 development, it was determined that using store vector pair
instructions were problematical in a few cases, so we disabled generating load
and store vector pair instructions for memory options by default.  This patch
re-enables generating these instructions if -mcpu=future is used.

The patches have been tested on both little and big endian systems.  Can I check
it into the master branch?

2024-01-05   Michael Meissner  <meissner@linux.ibm.com>

gcc/

	* config/rs6000/rs6000-cpus.def (ISA_FUTURE_MASKS): Add
	-mblock-ops-vector-pair.
	(POWERPC_MASKS): Likewise.
---
 gcc/config/rs6000/rs6000-cpus.def | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/gcc/config/rs6000/rs6000-cpus.def b/gcc/config/rs6000/rs6000-cpus.def
index 8754635f3d9..b6cd6d8cc84 100644
--- a/gcc/config/rs6000/rs6000-cpus.def
+++ b/gcc/config/rs6000/rs6000-cpus.def
@@ -90,6 +90,7 @@
 
 /* Flags for a potential future processor that may or may not be delivered.  */
 #define ISA_FUTURE_MASKS	(ISA_3_1_MASKS_SERVER			\
+				 | OPTION_MASK_BLOCK_OPS_VECTOR_PAIR	\
 				 | OPTION_MASK_FUTURE)
 
 /* Flags that need to be turned off if -mno-power9-vector.  */
@@ -127,6 +128,7 @@
 
 /* Mask of all options to set the default isa flags based on -mcpu=<xxx>.  */
 #define POWERPC_MASKS		(OPTION_MASK_ALTIVEC			\
+				 | OPTION_MASK_BLOCK_OPS_VECTOR_PAIR	\
 				 | OPTION_MASK_CMPB			\
 				 | OPTION_MASK_CRYPTO			\
 				 | OPTION_MASK_DFP			\
-- 
2.43.0


-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Repost [PATCH 3/6] PowerPC: Add support for accumulators in DMR registers.
  2024-01-05 23:27 Repost [PATCH 0/6] PowerPC Future patches Michael Meissner
  2024-01-05 23:35 ` Repost [PATCH 1/6] Add -mcpu=future Michael Meissner
  2024-01-05 23:37 ` Repost [PATCH 2/6] PowerPC: Make -mcpu=future enable -mblock-ops-vector-pair Michael Meissner
@ 2024-01-05 23:38 ` Michael Meissner
  2024-01-19 18:46   ` Ping " Michael Meissner
  2024-01-25  9:28   ` Repost " Kewen.Lin
  2024-01-05 23:39 ` Repost [PATCH 4/6] PowerPC: Make MMA insns support " Michael Meissner
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 36+ messages in thread
From: Michael Meissner @ 2024-01-05 23:38 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner

The MMA subsystem added the notion of accumulator registers as an optional
feature of ISA 3.1 (power10).  In ISA 3.1, these accumulators overlapped with
the traditional floating point registers 0..31, but logically the accumulator
registers were separate from the FPR registers.  In ISA 3.1, it was anticipated
that in future systems, the accumulator registers may no overlap with the FPR
registers.  This patch adds the support for dense math registers as separate
registers.

This particular patch does not change the MMA support to use the accumulators
within the dense math registers.  This patch just adds the basic support for
having separate DMRs.  The next patch will switch the MMA support to use the
accumulators if -mcpu=future is used.

For testing purposes, I added an undocumented option '-mdense-math' to enable
or disable the dense math support.

This patch adds a new constraint (wD).  If MMA is selected but dense math is
not selected (i.e. -mcpu=power10), the wD constraint will allow access to
accumulators that overlap with the VSX vector registers 0..31.  If both MMA and
dense math are selected (i.e. -mcpu=future), the wD constraint will only allow
dense math registers.

This patch modifies the existing %A output modifier.  If MMA is selected but
dense math is not selected, then %A output modifier converts the VSX register
number to the accumulator number, by dividing it by 4.  If both MMA and dense
math are selected, then %A will map the separate DMR registers into 0..7.

The intention is that user code using extended asm can be modified to run on
both MMA without dense math and MMA with dense math:

    1)	If possible, don't use extended asm, but instead use the MMA built-in
	functions;

    2)	If you do need to write extended asm, change the d constraints
	targetting accumulators should now use wD;

    3)	Only use the built-in zero, assemble and disassemble functions create
	move data between vector quad types and dense math accumulators.
	I.e. do not use the xxmfacc, xxmtacc, and xxsetaccz directly in the
	extended asm code.  The reason is these instructions assume there is a
	1-to-1 correspondence between 4 adjacent FPR registers and an
	accumulator that overlaps with those instructions.  With accumulators
	now being separate registers, there no longer is a 1-to-1
	correspondence.

It is possible that the mangling for DMRs and the GDB register numbers may
change in the future.

2024-01-05   Michael Meissner  <meissner@linux.ibm.com>

gcc/

	* config/rs6000/constraints.md (wD constraint): New constraint.
	* config/rs6000/mma.md (UNSPEC_DM_ASSEMBLE_ACC): New unspec.
	(movxo): Convert into define_expand.
	(movxo_vsx): Version of movxo where accumulators overlap with VSX vector
	registers 0..31.
	(movxo_dm): Verson of movxo that supports separate dense math
	accumulators.
	(mma_assemble_acc): Add dense math support to define_expand.
	(mma_assemble_acc_vsx): Rename from mma_assemble_acc, and restrict it to
	non dense math systems.
	(mma_assemble_acc_dm): Dense math version of mma_assemble_acc.
	(mma_disassemble_acc): Add dense math support to define_expand.
	(mma_disassemble_acc_vsx): Rename from mma_disassemble_acc, and restrict
	it to non dense math systems.
	(mma_disassemble_acc_dm): Dense math version of mma_disassemble_acc.
	* config/rs6000/predicates.md (dmr_operand): New predicate.
	(accumulator_operand): Likewise.
	* config/rs6000/rs6000-cpus.def (ISA_FUTURE_MASKS): Add -mdense-math.
	(POWERPC_MASKS): Likewise.
	* config/rs6000/rs6000.cc (enum rs6000_reg_type): Add DMR_REG_TYPE.
	(enum rs6000_reload_reg_type): Add RELOAD_REG_DMR.
	(LAST_RELOAD_REG_CLASS): Add support for DMR registers and the wD
	constraint.
	(reload_reg_map): Likewise.
	(rs6000_reg_names): Likewise.
	(alt_reg_names): Likewise.
	(rs6000_hard_regno_nregs_internal): Likewise.
	(rs6000_hard_regno_mode_ok_uncached): Likewise.
	(rs6000_debug_reg_global): Likewise.
	(rs6000_setup_reg_addr_masks): Likewise.
	(rs6000_init_hard_regno_mode_ok): Likewise.
	(rs6000_option_override_internal): Add checking for -mdense-math.
	(rs6000_secondary_reload_memory): Add support for DMR registers.
	(rs6000_secondary_reload_simple_move): Likewise.
	(rs6000_preferred_reload_class): Likewise.
	(rs6000_secondary_reload_class): Likewise.
	(print_operand): Make %A handle both FPRs and DMRs.
	(rs6000_dmr_register_move_cost): New helper function.
	(rs6000_register_move_cost): Add support for DMR registers.
	(rs6000_memory_move_cost): Likewise.
	(rs6000_compute_pressure_classes): Likewise.
	(rs6000_debugger_regno): Likewise.
	(rs6000_opt_masks): Add -mdense-math.
	(rs6000_split_multireg_move): Add support for DMRs.
	* config/rs6000/rs6000.h (UNITS_PER_DMR_WORD): New macro.
	(FIRST_PSEUDO_REGISTER): Update for DMRs.
	(FIXED_REGISTERS): Add DMRs.
	(CALL_REALLY_USED_REGISTERS): Likewise.
	(REG_ALLOC_ORDER): Likewise.
	(enum reg_class): Add DM_REGS.
	(REG_CLASS_NAMES): Likewise.
	(REG_CLASS_CONTENTS): Likewise.
	* config/rs6000/rs6000.md (FIRST_DMR_REGNO): New constant.
	(LAST_DMR_REGNO): Likewise.
	(isa attribute): Add 'dm' and 'not_dm' attributes.
	(enabled attribute): Support 'dm' and 'not_dm' attributes.
	* config/rs6000/rs6000.opt (-mdense-math): New switch.
	* doc/md.texi (PowerPC constraints): Document wD constraint.
---
 gcc/config/rs6000/constraints.md  |   3 +
 gcc/config/rs6000/mma.md          | 115 ++++++++++++------
 gcc/config/rs6000/predicates.md   |  32 +++++
 gcc/config/rs6000/rs6000-cpus.def |   2 +
 gcc/config/rs6000/rs6000.cc       | 189 ++++++++++++++++++++++++++----
 gcc/config/rs6000/rs6000.h        |  38 +++++-
 gcc/config/rs6000/rs6000.md       |  12 +-
 gcc/config/rs6000/rs6000.opt      |   4 +
 gcc/doc/md.texi                   |   7 ++
 9 files changed, 343 insertions(+), 59 deletions(-)

diff --git a/gcc/config/rs6000/constraints.md b/gcc/config/rs6000/constraints.md
index c99997bf82b..614e431c085 100644
--- a/gcc/config/rs6000/constraints.md
+++ b/gcc/config/rs6000/constraints.md
@@ -107,6 +107,9 @@ (define_constraint "wB"
        (match_test "TARGET_P8_VECTOR")
        (match_operand 0 "s5bit_cint_operand")))
 
+(define_register_constraint "wD" "rs6000_constraints[RS6000_CONSTRAINT_wD]"
+  "Accumulator register.")
+
 (define_constraint "wE"
   "@internal Vector constant that can be loaded with the XXSPLTIB instruction."
   (match_test "xxspltib_constant_nosplit (op, mode)"))
diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
index 6a7d8a836db..bb898919ab5 100644
--- a/gcc/config/rs6000/mma.md
+++ b/gcc/config/rs6000/mma.md
@@ -91,6 +91,7 @@ (define_c_enum "unspec"
    UNSPEC_MMA_XVI8GER4SPP
    UNSPEC_MMA_XXMFACC
    UNSPEC_MMA_XXMTACC
+   UNSPEC_DM_ASSEMBLE_ACC
   ])
 
 (define_c_enum "unspecv"
@@ -321,7 +322,9 @@ (define_insn_and_split "*movoo"
    (set_attr "length" "*,8,*,8,8")
    (set_attr "isa" "lxvp,*,stxvp,*,*")])
 \f
-;; Vector quad support.  XOmode can only live in FPRs.
+;; Vector quad support.  Under the original MMA, XOmode can only live in VSX
+;; vector registers 0..31.  With dense math, XOmode can live in either VSX
+;; registers (0..63) or DMR registers.
 (define_expand "movxo"
   [(set (match_operand:XO 0 "nonimmediate_operand")
 	(match_operand:XO 1 "input_operand"))]
@@ -346,10 +349,10 @@ (define_expand "movxo"
     gcc_assert (false);
 })
 
-(define_insn_and_split "*movxo"
+(define_insn_and_split "*movxo_nodm"
   [(set (match_operand:XO 0 "nonimmediate_operand" "=d,ZwO,d")
 	(match_operand:XO 1 "input_operand" "ZwO,d,d"))]
-  "TARGET_MMA
+  "TARGET_MMA && !TARGET_DENSE_MATH
    && (gpc_reg_operand (operands[0], XOmode)
        || gpc_reg_operand (operands[1], XOmode))"
   "@
@@ -366,6 +369,31 @@ (define_insn_and_split "*movxo"
    (set_attr "length" "*,*,16")
    (set_attr "max_prefixed_insns" "2,2,*")])
 
+(define_insn_and_split "*movxo_dm"
+  [(set (match_operand:XO 0 "nonimmediate_operand" "=wa,QwO,wa,wD,wD,wa")
+	(match_operand:XO 1 "input_operand"        "QwO,wa, wa,wa,wD,wD"))]
+  "TARGET_DENSE_MATH
+   && (gpc_reg_operand (operands[0], XOmode)
+       || gpc_reg_operand (operands[1], XOmode))"
+  "@
+   #
+   #
+   #
+   dmxxinstdmr512 %0,%1,%Y1,0
+   dmmr %0,%1
+   dmxxextfdmr512 %0,%Y0,%1,0"
+  "&& reload_completed
+   && !dmr_operand (operands[0], XOmode)
+   && !dmr_operand (operands[1], XOmode)"
+  [(const_int 0)]
+{
+  rs6000_split_multireg_move (operands[0], operands[1]);
+  DONE;
+}
+  [(set_attr "type" "vecload,vecstore,veclogical,mma,mma,mma")
+   (set_attr "length" "*,*,16,*,*,*")
+   (set_attr "max_prefixed_insns" "2,2,*,*,*,*")])
+
 (define_expand "vsx_assemble_pair"
   [(match_operand:OO 0 "vsx_register_operand")
    (match_operand:V16QI 1 "mma_assemble_input_operand")
@@ -433,25 +461,38 @@ (define_insn_and_split "*vsx_disassemble_pair"
 })
 
 (define_expand "mma_assemble_acc"
-  [(match_operand:XO 0 "fpr_reg_operand")
+  [(match_operand:XO 0 "register_operand")
    (match_operand:V16QI 1 "mma_assemble_input_operand")
    (match_operand:V16QI 2 "mma_assemble_input_operand")
    (match_operand:V16QI 3 "mma_assemble_input_operand")
    (match_operand:V16QI 4 "mma_assemble_input_operand")]
   "TARGET_MMA"
 {
-  rtx src = gen_rtx_UNSPEC_VOLATILE (XOmode,
-			    	     gen_rtvec (4, operands[1], operands[2],
-				       		operands[3], operands[4]),
-			    	     UNSPECV_MMA_ASSEMBLE);
-  emit_move_insn (operands[0], src);
+  rtx op0 = operands[0];
+  rtx op1 = operands[1];
+  rtx op2 = operands[2];
+  rtx op3 = operands[3];
+  rtx op4 = operands[4];
+
+  if (TARGET_DENSE_MATH)
+    {
+      rtx vpair1 = gen_reg_rtx (OOmode);
+      rtx vpair2 = gen_reg_rtx (OOmode);
+      emit_insn (gen_vsx_assemble_pair (vpair1, op1, op2));
+      emit_insn (gen_vsx_assemble_pair (vpair2, op3, op4));
+      emit_insn (gen_mma_assemble_acc_dm (op0, vpair1, vpair2));
+    }
+
+  else
+    emit_insn (gen_mma_assemble_acc_vsx (op0, op1, op2, op3, op4));
+
   DONE;
 })
 
 ;; We cannot update the four output registers atomically, so mark the output
-;; as an early clobber so we don't accidentally clobber the input operands.  */
+;; as an early clobber so we don't accidentally clobber the input operands.
 
-(define_insn_and_split "*mma_assemble_acc"
+(define_insn_and_split "mma_assemble_acc_vsx"
   [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
 	(unspec_volatile:XO
 	  [(match_operand:V16QI 1 "mma_assemble_input_operand" "mwa")
@@ -459,7 +500,7 @@ (define_insn_and_split "*mma_assemble_acc"
 	   (match_operand:V16QI 3 "mma_assemble_input_operand" "mwa")
 	   (match_operand:V16QI 4 "mma_assemble_input_operand" "mwa")]
 	  UNSPECV_MMA_ASSEMBLE))]
-  "TARGET_MMA
+  "TARGET_MMA && !TARGET_DENSE_MATH
    && fpr_reg_operand (operands[0], XOmode)"
   "#"
   "&& reload_completed"
@@ -473,28 +514,31 @@ (define_insn_and_split "*mma_assemble_acc"
   DONE;
 })
 
+;; On a system with dense math, we build the accumulators from two vector
+;; pairs.
+
+(define_insn "mma_assemble_acc_dm"
+ [(set (match_operand:XO 0 "dmr_operand" "=wD")
+       (unspec:XO [(match_operand:OO 1 "vsx_register_operand" "wa")
+		   (match_operand:OO 2 "vsx_register_operand" "wa")]
+		  UNSPEC_DM_ASSEMBLE_ACC))]
+ "TARGET_MMA && TARGET_DENSE_MATH"
+ "dmxxinstdmr512 %0,%1,%2,0"
+ [(set_attr "type" "mma")])
+
 (define_expand "mma_disassemble_acc"
-  [(match_operand:V16QI 0 "mma_disassemble_output_operand")
-   (match_operand:XO 1 "fpr_reg_operand")
-   (match_operand 2 "const_0_to_3_operand")]
-  "TARGET_MMA"
-{
-  rtx src;
-  int regoff = INTVAL (operands[2]);
-  src = gen_rtx_UNSPEC (V16QImode,
-			gen_rtvec (2, operands[1], GEN_INT (regoff)),
-			UNSPEC_MMA_EXTRACT);
-  emit_move_insn (operands[0], src);
-  DONE;
-})
+  [(set (match_operand:V16QI 0 "register_operand")
+	(unspec:V16QI [(match_operand:XO 1 "register_operand")
+		       (match_operand 2 "const_0_to_3_operand")]
+		      UNSPEC_MMA_EXTRACT))]
+  "TARGET_MMA")
 
-(define_insn_and_split "*mma_disassemble_acc"
+(define_insn_and_split "*mma_disassemble_acc_vsx"
   [(set (match_operand:V16QI 0 "mma_disassemble_output_operand" "=mwa")
-       (unspec:V16QI [(match_operand:XO 1 "fpr_reg_operand" "d")
-		      (match_operand 2 "const_0_to_3_operand")]
+	(unspec:V16QI [(match_operand:XO 1 "fpr_reg_operand" "d")
+		       (match_operand 2 "const_0_to_3_operand")]
 		      UNSPEC_MMA_EXTRACT))]
-  "TARGET_MMA
-   && fpr_reg_operand (operands[1], XOmode)"
+  "TARGET_MMA"
   "#"
   "&& reload_completed"
   [(const_int 0)]
@@ -506,9 +550,14 @@ (define_insn_and_split "*mma_disassemble_acc"
   DONE;
 })
 
-;; MMA instructions that do not use their accumulators as an input, still
-;; must not allow their vector operands to overlap the registers used by
-;; the accumulator.  We enforce this by marking the output as early clobber.
+(define_insn "*mma_disassemble_acc_dm"
+  [(set (match_operand:V16QI 0 "vsx_register_operand" "=wa")
+	(unspec:V16QI [(match_operand:XO 1 "dmr_operand" "wD")
+		       (match_operand 2 "const_0_to_3_operand")]
+		      UNSPEC_MMA_EXTRACT))]
+  "TARGET_DENSE_MATH"
+  "dmxxextfdmr256 %0,%1,2"
+  [(set_attr "type" "mma")])
 
 (define_insn "mma_<acc>"
   [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
diff --git a/gcc/config/rs6000/predicates.md b/gcc/config/rs6000/predicates.md
index d23ce9a77a3..3040dcd50a3 100644
--- a/gcc/config/rs6000/predicates.md
+++ b/gcc/config/rs6000/predicates.md
@@ -186,6 +186,38 @@ (define_predicate "vlogical_operand"
   return VLOGICAL_REGNO_P (REGNO (op));
 })
 
+;; Return 1 if op is a DMR register
+(define_predicate "dmr_operand"
+  (match_operand 0 "register_operand")
+{
+  if (!REG_P (op))
+    return 0;
+
+  if (!HARD_REGISTER_P (op))
+    return 1;
+
+  return DMR_REGNO_P (REGNO (op));
+})
+
+;; Return 1 if op is an accumulator.  On power10 systems, the accumulators
+;; overlap with the FPRs, while on systems with dense math, the accumulators
+;; are separate dense math registers and do not overlap with the FPR
+;; registers..
+(define_predicate "accumulator_operand"
+  (match_operand 0 "register_operand")
+{
+  if (!REG_P (op))
+    return 0;
+
+  if (!HARD_REGISTER_P (op))
+    return 1;
+
+  int r = REGNO (op);
+  return (TARGET_DENSE_MATH
+	  ? DMR_REGNO_P (r)
+	  : FP_REGNO_P (r) && (r & 3) == 0);
+})
+
 ;; Return 1 if op is the carry register.
 (define_predicate "ca_operand"
   (match_operand 0 "register_operand")
diff --git a/gcc/config/rs6000/rs6000-cpus.def b/gcc/config/rs6000/rs6000-cpus.def
index b6cd6d8cc84..4621b97b522 100644
--- a/gcc/config/rs6000/rs6000-cpus.def
+++ b/gcc/config/rs6000/rs6000-cpus.def
@@ -91,6 +91,7 @@
 /* Flags for a potential future processor that may or may not be delivered.  */
 #define ISA_FUTURE_MASKS	(ISA_3_1_MASKS_SERVER			\
 				 | OPTION_MASK_BLOCK_OPS_VECTOR_PAIR	\
+				 | OPTION_MASK_DENSE_MATH		\
 				 | OPTION_MASK_FUTURE)
 
 /* Flags that need to be turned off if -mno-power9-vector.  */
@@ -134,6 +135,7 @@
 				 | OPTION_MASK_DFP			\
 				 | OPTION_MASK_DIRECT_MOVE		\
 				 | OPTION_MASK_DLMZB			\
+				 | OPTION_MASK_DENSE_MATH		\
 				 | OPTION_MASK_EFFICIENT_UNALIGNED_VSX	\
 				 | OPTION_MASK_FLOAT128_HW		\
 				 | OPTION_MASK_FLOAT128_KEYWORD		\
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index bc509399cf6..83e32f7a43a 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -290,7 +290,8 @@ enum rs6000_reg_type {
   ALTIVEC_REG_TYPE,
   FPR_REG_TYPE,
   SPR_REG_TYPE,
-  CR_REG_TYPE
+  CR_REG_TYPE,
+  DMR_REG_TYPE
 };
 
 /* Map register class to register type.  */
@@ -304,22 +305,23 @@ static enum rs6000_reg_type reg_class_to_reg_type[N_REG_CLASSES];
 
 
 /* Register classes we care about in secondary reload or go if legitimate
-   address.  We only need to worry about GPR, FPR, and Altivec registers here,
-   along an ANY field that is the OR of the 3 register classes.  */
+   address.  We only need to worry about GPR, FPR, Altivec, and DMR registers
+   here, along an ANY field that is the OR of the 4 register classes.  */
 
 enum rs6000_reload_reg_type {
   RELOAD_REG_GPR,			/* General purpose registers.  */
   RELOAD_REG_FPR,			/* Traditional floating point regs.  */
   RELOAD_REG_VMX,			/* Altivec (VMX) registers.  */
-  RELOAD_REG_ANY,			/* OR of GPR, FPR, Altivec masks.  */
+  RELOAD_REG_DMR,			/* DMR registers.  */
+  RELOAD_REG_ANY,			/* OR of GPR/FPR/VMX/DMR masks.  */
   N_RELOAD_REG
 };
 
-/* For setting up register classes, loop through the 3 register classes mapping
+/* For setting up register classes, loop through the 4 register classes mapping
    into real registers, and skip the ANY class, which is just an OR of the
    bits.  */
 #define FIRST_RELOAD_REG_CLASS	RELOAD_REG_GPR
-#define LAST_RELOAD_REG_CLASS	RELOAD_REG_VMX
+#define LAST_RELOAD_REG_CLASS	RELOAD_REG_DMR
 
 /* Map reload register type to a register in the register class.  */
 struct reload_reg_map_type {
@@ -331,6 +333,7 @@ static const struct reload_reg_map_type reload_reg_map[N_RELOAD_REG] = {
   { "Gpr",	FIRST_GPR_REGNO },	/* RELOAD_REG_GPR.  */
   { "Fpr",	FIRST_FPR_REGNO },	/* RELOAD_REG_FPR.  */
   { "VMX",	FIRST_ALTIVEC_REGNO },	/* RELOAD_REG_VMX.  */
+  { "DMR",	FIRST_DMR_REGNO },	/* RELOAD_REG_DMR.  */
   { "Any",	-1 },			/* RELOAD_REG_ANY.  */
 };
 
@@ -1224,6 +1227,8 @@ char rs6000_reg_names[][8] =
       "0",  "1",  "2",  "3",  "4",  "5",  "6",  "7",
   /* vrsave vscr sfp */
       "vrsave", "vscr", "sfp",
+  /* DMRs */
+      "0", "1", "2", "3", "4", "5", "6", "7",
 };
 
 #ifdef TARGET_REGNAMES
@@ -1250,6 +1255,8 @@ static const char alt_reg_names[][8] =
   "%cr0",  "%cr1", "%cr2", "%cr3", "%cr4", "%cr5", "%cr6", "%cr7",
   /* vrsave vscr sfp */
   "vrsave", "vscr", "sfp",
+  /* DMRs */
+  "%dmr0", "%dmr1", "%dmr2", "%dmr3", "%dmr4", "%dmr5", "%dmr6", "%dmr7",
 };
 #endif
 
@@ -1846,6 +1853,9 @@ rs6000_hard_regno_nregs_internal (int regno, machine_mode mode)
   else if (ALTIVEC_REGNO_P (regno))
     reg_size = UNITS_PER_ALTIVEC_WORD;
 
+  else if (DMR_REGNO_P (regno))
+    reg_size = UNITS_PER_DMR_WORD;
+
   else
     reg_size = UNITS_PER_WORD;
 
@@ -1867,9 +1877,36 @@ rs6000_hard_regno_mode_ok_uncached (int regno, machine_mode mode)
   if (mode == OOmode)
     return (TARGET_MMA && VSX_REGNO_P (regno) && (regno & 1) == 0);
 
-  /* MMA accumulator modes need FPR registers divisible by 4.  */
+  /* On ISA 3.1 (power10), MMA accumulator modes need FPR registers divisible
+     by 4.
+
+     If dense math is enabled, allow all VSX registers plus the DMR registers.
+     We need to make sure we don't cross between the boundary of FPRs and
+     traditional Altiviec registers.  */
   if (mode == XOmode)
-    return (TARGET_MMA && FP_REGNO_P (regno) && (regno & 3) == 0);
+    {
+      if (TARGET_MMA && !TARGET_DENSE_MATH)
+	return (FP_REGNO_P (regno) && (regno & 3) == 0);
+
+      else if (TARGET_DENSE_MATH)
+	{
+	  if (DMR_REGNO_P (regno))
+	    return 1;
+
+	  if (FP_REGNO_P (regno))
+	    return ((regno & 1) == 0 && regno <= LAST_FPR_REGNO - 3);
+
+	  if (ALTIVEC_REGNO_P (regno))
+	    return ((regno & 1) == 0 && regno <= LAST_ALTIVEC_REGNO - 3);
+	}
+
+      else
+	return 0;
+    }
+
+  /* No other types other than XOmode can go in DMRs.  */
+  if (DMR_REGNO_P (regno))
+    return 0;
 
   /* PTImode can only go in GPRs.  Quad word memory operations require even/odd
      register combinations, and use PTImode where we need to deal with quad
@@ -2312,6 +2349,7 @@ rs6000_debug_reg_global (void)
   rs6000_debug_reg_print (FIRST_ALTIVEC_REGNO,
 			  LAST_ALTIVEC_REGNO,
 			  "vs");
+  rs6000_debug_reg_print (FIRST_DMR_REGNO, LAST_DMR_REGNO, "dmr");
   rs6000_debug_reg_print (LR_REGNO, LR_REGNO, "lr");
   rs6000_debug_reg_print (CTR_REGNO, CTR_REGNO, "ctr");
   rs6000_debug_reg_print (CR0_REGNO, CR7_REGNO, "cr");
@@ -2332,6 +2370,7 @@ rs6000_debug_reg_global (void)
 	   "wr reg_class = %s\n"
 	   "wx reg_class = %s\n"
 	   "wA reg_class = %s\n"
+	   "wD reg_class = %s\n"
 	   "\n",
 	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_d]],
 	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_v]],
@@ -2339,7 +2378,8 @@ rs6000_debug_reg_global (void)
 	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_we]],
 	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wr]],
 	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wx]],
-	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wA]]);
+	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wA]],
+	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wD]]);
 
   nl = "\n";
   for (m = 0; m < NUM_MACHINE_MODES; ++m)
@@ -2636,6 +2676,21 @@ rs6000_setup_reg_addr_masks (void)
 	  addr_mask = 0;
 	  reg = reload_reg_map[rc].reg;
 
+	  /* Special case DMR registers.  */
+	  if (rc == RELOAD_REG_DMR)
+	    {
+	      if (TARGET_DENSE_MATH && m2 == XOmode)
+		{
+		  addr_mask = RELOAD_REG_VALID;
+		  reg_addr[m].addr_mask[rc] = addr_mask;
+		  any_addr_mask |= addr_mask;
+		}
+	      else
+		reg_addr[m].addr_mask[rc] = 0;
+
+	      continue;
+	    }
+
 	  /* Can mode values go in the GPR/FPR/Altivec registers?  */
 	  if (reg >= 0 && rs6000_hard_regno_mode_ok_p[m][reg])
 	    {
@@ -2790,6 +2845,9 @@ rs6000_init_hard_regno_mode_ok (bool global_init_p)
   for (r = CR1_REGNO; r <= CR7_REGNO; ++r)
     rs6000_regno_regclass[r] = CR_REGS;
 
+  for (r = FIRST_DMR_REGNO; r <= LAST_DMR_REGNO; ++r)
+    rs6000_regno_regclass[r] = DM_REGS;
+
   rs6000_regno_regclass[LR_REGNO] = LINK_REGS;
   rs6000_regno_regclass[CTR_REGNO] = CTR_REGS;
   rs6000_regno_regclass[CA_REGNO] = NO_REGS;
@@ -2814,6 +2872,7 @@ rs6000_init_hard_regno_mode_ok (bool global_init_p)
   reg_class_to_reg_type[(int)LINK_OR_CTR_REGS] = SPR_REG_TYPE;
   reg_class_to_reg_type[(int)CR_REGS] = CR_REG_TYPE;
   reg_class_to_reg_type[(int)CR0_REGS] = CR_REG_TYPE;
+  reg_class_to_reg_type[(int)DM_REGS] = DMR_REG_TYPE;
 
   if (TARGET_VSX)
     {
@@ -3000,6 +3059,13 @@ rs6000_init_hard_regno_mode_ok (bool global_init_p)
   if (TARGET_DIRECT_MOVE_128)
     rs6000_constraints[RS6000_CONSTRAINT_we] = VSX_REGS;
 
+  /* Support for the accumulator registers, either FPR registers (aka original
+     mma) or DMR registers (dense math).  */
+  if (TARGET_DENSE_MATH)
+    rs6000_constraints[RS6000_CONSTRAINT_wD] = DM_REGS;
+  else if (TARGET_MMA)
+    rs6000_constraints[RS6000_CONSTRAINT_wD] = FLOAT_REGS;
+
   /* Set up the reload helper and direct move functions.  */
   if (TARGET_VSX || TARGET_ALTIVEC)
     {
@@ -4496,6 +4562,14 @@ rs6000_option_override_internal (bool global_init_p)
   if (!TARGET_PCREL && TARGET_PCREL_OPT)
     rs6000_isa_flags &= ~OPTION_MASK_PCREL_OPT;
 
+  /* Dense math requires MMA.  */
+  if (TARGET_DENSE_MATH && !TARGET_MMA)
+    {
+      if ((rs6000_isa_flags_explicit & OPTION_MASK_DENSE_MATH) != 0)
+	error ("%qs requires %qs", "-mdense-math", "-mmma");
+      rs6000_isa_flags &= ~OPTION_MASK_DENSE_MATH;
+    }
+
   if (TARGET_DEBUG_REG || TARGET_DEBUG_TARGET)
     rs6000_print_isa_options (stderr, 0, "after subtarget", rs6000_isa_flags);
 
@@ -12408,6 +12482,11 @@ rs6000_secondary_reload_memory (rtx addr,
     addr_mask = (reg_addr[mode].addr_mask[RELOAD_REG_VMX]
 		 & ~RELOAD_REG_AND_M16);
 
+  /* DMR registers use VSX registers, and need to generate some extra
+     instructions.  */
+  else if (rclass == DM_REGS)
+    return 2;
+
   /* If the register allocator hasn't made up its mind yet on the register
      class to use, settle on defaults to use.  */
   else if (rclass == NO_REGS)
@@ -12736,6 +12815,13 @@ rs6000_secondary_reload_simple_move (enum rs6000_reg_type to_type,
 	       || (to_type == SPR_REG_TYPE && from_type == GPR_REG_TYPE)))
     return true;
 
+  /* We can transfer between VSX registers and DMR registers without needing
+     extra registers.  */
+  if (TARGET_DENSE_MATH && mode == XOmode
+      && ((to_type == DMR_REG_TYPE && from_type == VSX_REG_TYPE)
+	  || (to_type == VSX_REG_TYPE && from_type == DMR_REG_TYPE)))
+    return true;
+
   return false;
 }
 
@@ -13430,6 +13516,10 @@ rs6000_preferred_reload_class (rtx x, enum reg_class rclass)
   machine_mode mode = GET_MODE (x);
   bool is_constant = CONSTANT_P (x);
 
+  /* DMR registers can't be loaded or stored.  */
+  if (rclass == DM_REGS)
+    return NO_REGS;
+
   /* If a mode can't go in FPR/ALTIVEC/VSX registers, don't return a preferred
      reload class for it.  */
   if ((rclass == ALTIVEC_REGS || rclass == VSX_REGS)
@@ -13526,7 +13616,7 @@ rs6000_preferred_reload_class (rtx x, enum reg_class rclass)
 	return VSX_REGS;
 
       if (mode == XOmode)
-	return FLOAT_REGS;
+	return TARGET_DENSE_MATH ? VSX_REGS : FLOAT_REGS;
 
       if (GET_MODE_CLASS (mode) == MODE_INT)
 	return GENERAL_REGS;
@@ -13651,6 +13741,11 @@ rs6000_secondary_reload_class (enum reg_class rclass, machine_mode mode,
   else
     regno = -1;
 
+  /* DMR registers don't have loads or stores.  We have to go through the VSX
+     registers to load XOmode (vector quad).  */
+  if (TARGET_DENSE_MATH && rclass == DM_REGS)
+    return VSX_REGS;
+
   /* If we have VSX register moves, prefer moving scalar values between
      Altivec registers and GPR by going via an FPR (and then via memory)
      instead of reloading the secondary memory address for Altivec moves.  */
@@ -14164,8 +14259,14 @@ print_operand (FILE *file, rtx x, int code)
 	 output_operand.  */
 
     case 'A':
-      /* Write the MMA accumulator number associated with VSX register X.  */
-      if (!REG_P (x) || !FP_REGNO_P (REGNO (x)) || (REGNO (x) % 4) != 0)
+      /* Write the MMA accumulator number associated with VSX register X.  On
+	 dense math systems, only allow DMR accumulators, not accumulators
+	 overlapping with the FPR registers.  */
+      if (!REG_P (x))
+	output_operand_lossage ("invalid %%A value");
+      else if (TARGET_DENSE_MATH && DMR_REGNO_P (REGNO (x)))
+	fprintf (file, "%d", REGNO (x) - FIRST_DMR_REGNO);
+      else if (!FP_REGNO_P (REGNO (x)) || (REGNO (x) % 4) != 0)
 	output_operand_lossage ("invalid %%A value");
       else
 	fprintf (file, "%d", (REGNO (x) - FIRST_FPR_REGNO) / 4);
@@ -22830,6 +22931,31 @@ rs6000_debug_address_cost (rtx x, machine_mode mode,
 }
 
 
+/* Subroutine to determine the move cost of dense math registers.  If we are
+   moving to/from VSX_REGISTER registers, the cost is either 1 move (for
+   512-bit accumulators) or 2 moves (for 1,024 dmr registers).  If we are
+   moving to anything else like GPR registers, make the cost very high.  */
+
+static int
+rs6000_dmr_register_move_cost (machine_mode mode, reg_class_t rclass)
+{
+  const int reg_move_base = 2;
+  HARD_REG_SET vsx_set = (reg_class_contents[rclass]
+			  & reg_class_contents[VSX_REGS]);
+
+  if (TARGET_DENSE_MATH && !hard_reg_set_empty_p (vsx_set))
+    {
+      /* __vector_quad (i.e. XOmode) is tranfered in 1 instruction.  */
+      if (mode == XOmode)
+	return reg_move_base;
+
+      else
+	return reg_move_base * 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
+    }
+
+  return 1000 * 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
+}
+
 /* A C expression returning the cost of moving data from a register of class
    CLASS1 to one of CLASS2.  */
 
@@ -22843,17 +22969,28 @@ rs6000_register_move_cost (machine_mode mode,
   if (TARGET_DEBUG_COST)
     dbg_cost_ctrl++;
 
+  HARD_REG_SET to_vsx, from_vsx;
+  to_vsx = reg_class_contents[to] & reg_class_contents[VSX_REGS];
+  from_vsx = reg_class_contents[from] & reg_class_contents[VSX_REGS];
+
+  /* Special case DMR registers, that can only move to/from VSX registers.  */
+  if (from == DM_REGS && to == DM_REGS)
+    ret = 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
+
+  else if (from == DM_REGS)
+    ret = rs6000_dmr_register_move_cost (mode, to);
+
+  else if (to == DM_REGS)
+    ret = rs6000_dmr_register_move_cost (mode, from);
+
   /* If we have VSX, we can easily move between FPR or Altivec registers,
      otherwise we can only easily move within classes.
      Do this first so we give best-case answers for union classes
      containing both gprs and vsx regs.  */
-  HARD_REG_SET to_vsx, from_vsx;
-  to_vsx = reg_class_contents[to] & reg_class_contents[VSX_REGS];
-  from_vsx = reg_class_contents[from] & reg_class_contents[VSX_REGS];
-  if (!hard_reg_set_empty_p (to_vsx)
-      && !hard_reg_set_empty_p (from_vsx)
-      && (TARGET_VSX
-	  || hard_reg_set_intersect_p (to_vsx, from_vsx)))
+  else if (!hard_reg_set_empty_p (to_vsx)
+	   && !hard_reg_set_empty_p (from_vsx)
+	   && (TARGET_VSX
+	       || hard_reg_set_intersect_p (to_vsx, from_vsx)))
     {
       int reg = FIRST_FPR_REGNO;
       if (TARGET_VSX
@@ -22948,6 +23085,9 @@ rs6000_memory_move_cost (machine_mode mode, reg_class_t rclass,
     ret = 4 * hard_regno_nregs (32, mode);
   else if (reg_classes_intersect_p (rclass, ALTIVEC_REGS))
     ret = 4 * hard_regno_nregs (FIRST_ALTIVEC_REGNO, mode);
+  else if (reg_classes_intersect_p (rclass, DM_REGS))
+    ret = (rs6000_dmr_register_move_cost (mode, VSX_REGS)
+	   + rs6000_memory_move_cost (mode, VSX_REGS, false));
   else
     ret = 4 + rs6000_register_move_cost (mode, rclass, GENERAL_REGS);
 
@@ -24156,6 +24296,8 @@ rs6000_compute_pressure_classes (enum reg_class *pressure_classes)
       if (TARGET_HARD_FLOAT)
 	pressure_classes[n++] = FLOAT_REGS;
     }
+  if (TARGET_DENSE_MATH)
+    pressure_classes[n++] = DM_REGS;
   pressure_classes[n++] = CR_REGS;
   pressure_classes[n++] = SPECIAL_REGS;
 
@@ -24320,6 +24462,10 @@ rs6000_debugger_regno (unsigned int regno, unsigned int format)
     return 67;
   if (regno == 64)
     return 64;
+  /* XXX: This is a guess.  The GCC register number for FIRST_DMR_REGNO is 111,
+     but the frame pointer regnum uses that.  */
+  if (DMR_REGNO_P (regno))
+    return regno - FIRST_DMR_REGNO + 112;
 
   gcc_unreachable ();
 }
@@ -24531,6 +24677,7 @@ static struct rs6000_opt_mask const rs6000_opt_masks[] =
   { "crypto",			OPTION_MASK_CRYPTO,		false, true  },
   { "direct-move",		OPTION_MASK_DIRECT_MOVE,	false, true  },
   { "dlmzb",			OPTION_MASK_DLMZB,		false, true  },
+  { "dense-math",		OPTION_MASK_DENSE_MATH,		false, true  },
   { "efficient-unaligned-vsx",	OPTION_MASK_EFFICIENT_UNALIGNED_VSX,
 								false, true  },
   { "float128",			OPTION_MASK_FLOAT128_KEYWORD,	false, true  },
@@ -27620,7 +27767,9 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 		      || XINT (src, 1) == UNSPECV_MMA_ASSEMBLE);
 	  gcc_assert (REG_P (dst));
 	  if (GET_MODE (src) == XOmode)
-	    gcc_assert (FP_REGNO_P (REGNO (dst)));
+	    gcc_assert ((TARGET_DENSE_MATH
+			 ? VSX_REGNO_P (REGNO (dst))
+			 : FP_REGNO_P (REGNO (dst))));
 	  if (GET_MODE (src) == OOmode)
 	    gcc_assert (VSX_REGNO_P (REGNO (dst)));
 
diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
index 43209f9a6e7..22efac4a80c 100644
--- a/gcc/config/rs6000/rs6000.h
+++ b/gcc/config/rs6000/rs6000.h
@@ -660,6 +660,7 @@ extern unsigned char rs6000_recip_bits[];
 #define UNITS_PER_FP_WORD 8
 #define UNITS_PER_ALTIVEC_WORD 16
 #define UNITS_PER_VSX_WORD 16
+#define UNITS_PER_DMR_WORD 128
 
 /* Type used for ptrdiff_t, as a string used in a declaration.  */
 #define PTRDIFF_TYPE "int"
@@ -787,7 +788,7 @@ enum data_align { align_abi, align_opt, align_both };
    Another pseudo (not included in DWARF_FRAME_REGISTERS) is soft frame
    pointer, which is eventually eliminated in favor of SP or FP.  */
 
-#define FIRST_PSEUDO_REGISTER 111
+#define FIRST_PSEUDO_REGISTER 119
 
 /* Use standard DWARF numbering for DWARF debugging information.  */
 #define DEBUGGER_REGNO(REGNO) rs6000_debugger_regno ((REGNO), 0)
@@ -824,7 +825,9 @@ enum data_align { align_abi, align_opt, align_both };
    /* cr0..cr7 */				   \
    0, 0, 0, 0, 0, 0, 0, 0,			   \
    /* vrsave vscr sfp */			   \
-   1, 1, 1					   \
+   1, 1, 1,					   \
+   /* DMR registers.  */			   \
+   0, 0, 0, 0, 0, 0, 0, 0			   \
 }
 
 /* Like `CALL_USED_REGISTERS' except this macro doesn't require that
@@ -848,7 +851,9 @@ enum data_align { align_abi, align_opt, align_both };
    /* cr0..cr7 */				   \
    1, 1, 0, 0, 0, 1, 1, 1,			   \
    /* vrsave vscr sfp */			   \
-   0, 0, 0					   \
+   0, 0, 0,					   \
+   /* DMR registers.  */			   \
+   0, 0, 0, 0, 0, 0, 0, 0			   \
 }
 
 #define TOTAL_ALTIVEC_REGS	(LAST_ALTIVEC_REGNO - FIRST_ALTIVEC_REGNO + 1)
@@ -885,6 +890,7 @@ enum data_align { align_abi, align_opt, align_both };
 	v2		(not saved; incoming vector arg reg; return value)
 	v19 - v14	(not saved or used for anything)
 	v31 - v20	(saved; order given to save least number)
+	dmr0 - dmr7	(not saved)
 	vrsave, vscr	(fixed)
 	sfp		(fixed)
 */
@@ -927,6 +933,9 @@ enum data_align { align_abi, align_opt, align_both };
    66,								\
    83, 82, 81, 80, 79, 78,					\
    95, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84,		\
+   /* DMR registers.  */					\
+   111, 112, 113, 114, 115, 116, 117, 118,			\
+   /* Vrsave, vscr, sfp.  */					\
    108, 109,							\
    110								\
 }
@@ -953,6 +962,9 @@ enum data_align { align_abi, align_opt, align_both };
 /* True if register is a VSX register.  */
 #define VSX_REGNO_P(N) (FP_REGNO_P (N) || ALTIVEC_REGNO_P (N))
 
+/* True if register is a DMR register.  */
+#define DMR_REGNO_P(N) ((N) >= FIRST_DMR_REGNO && (N) <= LAST_DMR_REGNO)
+
 /* Alternate name for any vector register supporting floating point, no matter
    which instruction set(s) are available.  */
 #define VFLOAT_REGNO_P(N) \
@@ -1088,6 +1100,7 @@ enum reg_class
   FLOAT_REGS,
   ALTIVEC_REGS,
   VSX_REGS,
+  DM_REGS,
   VRSAVE_REGS,
   VSCR_REGS,
   GEN_OR_FLOAT_REGS,
@@ -1117,6 +1130,7 @@ enum reg_class
   "FLOAT_REGS",								\
   "ALTIVEC_REGS",							\
   "VSX_REGS",								\
+  "DM_REGS",								\
   "VRSAVE_REGS",							\
   "VSCR_REGS",								\
   "GEN_OR_FLOAT_REGS",							\
@@ -1151,6 +1165,8 @@ enum reg_class
   { 0x00000000, 0x00000000, 0xffffffff, 0x00000000 },			\
   /* VSX_REGS.  */							\
   { 0x00000000, 0xffffffff, 0xffffffff, 0x00000000 },			\
+  /* DM_REGS.  */							\
+  { 0x00000000, 0x00000000, 0x00000000, 0x007f8000 },			\
   /* VRSAVE_REGS.  */							\
   { 0x00000000, 0x00000000, 0x00000000, 0x00001000 },			\
   /* VSCR_REGS.  */							\
@@ -1178,7 +1194,7 @@ enum reg_class
   /* CA_REGS.  */							\
   { 0x00000000, 0x00000000, 0x00000000, 0x00000004 },			\
   /* ALL_REGS.  */							\
-  { 0xffffffff, 0xffffffff, 0xffffffff, 0x00007fff }			\
+  { 0xffffffff, 0xffffffff, 0xffffffff, 0x007fffff }			\
 }
 
 /* The same information, inverted:
@@ -1202,6 +1218,7 @@ enum r6000_reg_class_enum {
   RS6000_CONSTRAINT_wr,		/* GPR register if 64-bit  */
   RS6000_CONSTRAINT_wx,		/* FPR register for STFIWX */
   RS6000_CONSTRAINT_wA,		/* BASE_REGS if 64-bit.  */
+  RS6000_CONSTRAINT_wD,		/* Accumulator regs if MMA/Dense Math.  */
   RS6000_CONSTRAINT_MAX
 };
 
@@ -2078,7 +2095,16 @@ extern char rs6000_reg_names[][8];	/* register names (0 vs. %r0).  */
   &rs6000_reg_names[108][0],	/* vrsave  */				\
   &rs6000_reg_names[109][0],	/* vscr  */				\
 									\
-  &rs6000_reg_names[110][0]	/* sfp  */				\
+  &rs6000_reg_names[110][0],	/* sfp  */				\
+									\
+  &rs6000_reg_names[111][0],	/* dmr0  */				\
+  &rs6000_reg_names[112][0],	/* dmr1  */				\
+  &rs6000_reg_names[113][0],	/* dmr2  */				\
+  &rs6000_reg_names[114][0],	/* dmr3  */				\
+  &rs6000_reg_names[115][0],	/* dmr4  */				\
+  &rs6000_reg_names[116][0],	/* dmr5  */				\
+  &rs6000_reg_names[117][0],	/* dmr6  */				\
+  &rs6000_reg_names[118][0],	/* dmr7  */				\
 }
 
 /* Table of additional register names to use in user input.  */
@@ -2132,6 +2158,8 @@ extern char rs6000_reg_names[][8];	/* register names (0 vs. %r0).  */
   {"vs52", 84}, {"vs53", 85}, {"vs54", 86}, {"vs55", 87},	\
   {"vs56", 88}, {"vs57", 89}, {"vs58", 90}, {"vs59", 91},	\
   {"vs60", 92}, {"vs61", 93}, {"vs62", 94}, {"vs63", 95},	\
+  {"dmr0", 111}, {"dmr1", 112}, {"dmr2", 113}, {"dmr3", 114},	\
+  {"dmr4", 115}, {"dmr5", 116}, {"dmr6", 117}, {"dmr7", 118},	\
 }
 
 /* This is how to output an element of a case-vector that is relative.  */
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index a125fd8fc99..72af3e6ef70 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -51,6 +51,8 @@ (define_constants
    (VRSAVE_REGNO		108)
    (VSCR_REGNO			109)
    (FRAME_POINTER_REGNUM	110)
+   (FIRST_DMR_REGNO		111)
+   (LAST_DMR_REGNO		118)
   ])
 
 ;;
@@ -355,7 +357,7 @@ (define_attr "cpu"
   (const (symbol_ref "(enum attr_cpu) rs6000_tune")))
 
 ;; The ISA we implement.
-(define_attr "isa" "any,p5,p6,p7,p7v,p8v,p9,p9v,p9kf,p9tf,p10,lxvp,stxvp"
+(define_attr "isa" "any,p5,p6,p7,p7v,p8v,p9,p9v,p9kf,p9tf,p10,lxvp,stxvp,dm,not_dm"
   (const_string "any"))
 
 ;; Is this alternative enabled for the current CPU/ISA/etc.?
@@ -411,6 +413,14 @@ (define_attr "enabled" ""
      (and (eq_attr "isa" "stxvp")
 	  (match_test "TARGET_STORE_VECTOR_PAIR"))
      (const_int 1)
+
+     (and (eq_attr "isa" "dm")
+	  (match_test "TARGET_DENSE_MATH"))
+     (const_int 1)
+
+     (and (eq_attr "isa" "not_dm")
+	  (match_test "!TARGET_DENSE_MATH"))
+     (const_int 1)
     ] (const_int 0)))
 
 ;; If this instruction is microcoded on the CELL processor
diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt
index 775ba830eac..70913d88d39 100644
--- a/gcc/config/rs6000/rs6000.opt
+++ b/gcc/config/rs6000/rs6000.opt
@@ -632,6 +632,10 @@ mfuture
 Target Undocumented Mask(FUTURE) Var(rs6000_isa_flags)
 Generate (do not generate) future instructions.
 
+mdense-math
+Target Undocumented Mask(DENSE_MATH) Var(rs6000_isa_flags)
+Generate (do not generate) dense math instructions.
+
 ; Documented parameters
 
 -param=rs6000-vect-unroll-limit=
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 47a87d6ceec..2d7674e85b3 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -3440,6 +3440,13 @@ Like @code{d}, if @option{-mpowerpc-gfxopt} is used; otherwise, @code{NO_REGS}.
 @item wA
 Like @code{b}, if @option{-mpowerpc64} is used; otherwise, @code{NO_REGS}.
 
+@item wD
+Accumulator register if @option{-mma} is used; otherwise,
+@code{NO_REGS}.  If @option{-mdense-math} is used, the accumulator
+register will be in the dense match register set.  If
+@option{-mno-dense-math} is used, the accumulator register will
+overlap with the VSX vector registers 0..31.
+
 @item wB
 Signed 5-bit constant integer that can be loaded into an Altivec register.
 
-- 
2.43.0


-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Repost [PATCH 4/6] PowerPC: Make MMA insns support DMR registers.
  2024-01-05 23:27 Repost [PATCH 0/6] PowerPC Future patches Michael Meissner
                   ` (2 preceding siblings ...)
  2024-01-05 23:38 ` Repost [PATCH 3/6] PowerPC: Add support for accumulators in DMR registers Michael Meissner
@ 2024-01-05 23:39 ` Michael Meissner
  2024-01-19 18:47   ` Ping " Michael Meissner
  2024-02-04  3:21   ` Repost " Kewen.Lin
  2024-01-05 23:40 ` Repost [PATCH 5/6] PowerPC: Switch to dense math names for all MMA operations Michael Meissner
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 36+ messages in thread
From: Michael Meissner @ 2024-01-05 23:39 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner

This patch changes the MMA instructions to use either FPR registers
(-mcpu=power10) or DMRs (-mcpu=future).  In this patch, the existing MMA
instruction names are used.

A macro (__PPC_DMR__) is defined if the MMA instructions use the DMRs.

The patches have been tested on both little and big endian systems.  Can I check
it into the master branch?

2024-01-05   Michael Meissner  <meissner@linux.ibm.com>

gcc/

	* config/rs6000/mma.md (mma_<acc>): New define_expand to handle
	mma_<acc> for dense math and non dense math.
	(mma_<acc> insn): Restrict to non dense math.
	(mma_xxsetaccz): Convert to define_expand to handle non dense math and
	dense math.
	(mma_xxsetaccz_vsx): Rename from mma_xxsetaccz and restrict usage to non
	dense math.
	(mma_xxsetaccz_dm): Dense math version of mma_xxsetaccz.
	(mma_<vv>): Add support for dense math.
	(mma_<avv>): Likewise.
	(mma_<pv>): Likewise.
	(mma_<apv>): Likewise.
	(mma_<vvi4i4i8>): Likewise.
	(mma_<avvi4i4i8>): Likewise.
	(mma_<vvi4i4i2>): Likewise.
	(mma_<avvi4i4i2>): Likewise.
	(mma_<vvi4i4>): Likewise.
	(mma_<avvi4i4>): Likewise.
	(mma_<pvi4i2>): Likewise.
	(mma_<apvi4i2>): Likewise.
	(mma_<vvi4i4i4>): Likewise.
	(mma_<avvi4i4i4>): Likewise.
	* config/rs6000/rs6000-c.cc (rs6000_target_modify_macros): Define
	__PPC_DMR__ if we have dense math instructions.
	* config/rs6000/rs6000.cc (print_operand): Make %A handle only DMRs if
	dense math and only FPRs if not dense math.
	(rs6000_split_multireg_move): Do not generate the xxmtacc instruction to
	prime the DMR registers or the xxmfacc instruction to de-prime
	instructions if we have dense math register support.
---
 gcc/config/rs6000/mma.md      | 247 +++++++++++++++++++++-------------
 gcc/config/rs6000/rs6000-c.cc |   3 +
 gcc/config/rs6000/rs6000.cc   |  35 ++---
 3 files changed, 176 insertions(+), 109 deletions(-)

diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
index bb898919ab5..525a85146ff 100644
--- a/gcc/config/rs6000/mma.md
+++ b/gcc/config/rs6000/mma.md
@@ -559,190 +559,249 @@ (define_insn "*mma_disassemble_acc_dm"
   "dmxxextfdmr256 %0,%1,2"
   [(set_attr "type" "mma")])
 
-(define_insn "mma_<acc>"
+;; MMA instructions that do not use their accumulators as an input, still must
+;; not allow their vector operands to overlap the registers used by the
+;; accumulator.  We enforce this by marking the output as early clobber.  If we
+;; have dense math, we don't need the whole prime/de-prime action, so just make
+;; thse instructions be NOPs.
+
+(define_expand "mma_<acc>"
+  [(set (match_operand:XO 0 "register_operand")
+	(unspec:XO [(match_operand:XO 1 "register_operand")]
+		   MMA_ACC))]
+  "TARGET_MMA"
+{
+  if (TARGET_DENSE_MATH)
+    {
+      if (!rtx_equal_p (operands[0], operands[1]))
+	emit_move_insn (operands[0], operands[1]);
+      DONE;
+    }
+
+  /* Generate the prime/de-prime code.  */
+})
+
+(define_insn "*mma_<acc>"
   [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
 	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")]
 		    MMA_ACC))]
-  "TARGET_MMA"
+  "TARGET_MMA && !TARGET_DENSE_MATH"
   "<acc> %A0"
   [(set_attr "type" "mma")])
 
 ;; We can't have integer constants in XOmode so we wrap this in an
-;; UNSPEC_VOLATILE.
+;; UNSPEC_VOLATILE for the non-dense math case.  For dense math, we don't need
+;; to disable optimization and we can do a normal UNSPEC.
 
-(define_insn "mma_xxsetaccz"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=d")
+(define_expand "mma_xxsetaccz"
+  [(set (match_operand:XO 0 "register_operand")
 	(unspec_volatile:XO [(const_int 0)]
 			    UNSPECV_MMA_XXSETACCZ))]
   "TARGET_MMA"
+{
+  if (TARGET_DENSE_MATH)
+    {
+      emit_insn (gen_mma_xxsetaccz_dm (operands[0]));
+      DONE;
+    }
+})
+
+(define_insn "*mma_xxsetaccz_vsx"
+  [(set (match_operand:XO 0 "fpr_reg_operand" "=d")
+	(unspec_volatile:XO [(const_int 0)]
+			    UNSPECV_MMA_XXSETACCZ))]
+  "TARGET_MMA && !TARGET_DENSE_MATH"
   "xxsetaccz %A0"
   [(set_attr "type" "mma")])
 
+
+(define_insn "mma_xxsetaccz_dm"
+  [(set (match_operand:XO 0 "dmr_operand" "=wD")
+	(unspec:XO [(const_int 0)]
+		   UNSPECV_MMA_XXSETACCZ))]
+  "TARGET_DENSE_MATH"
+  "dmsetdmrz %0"
+  [(set_attr "type" "mma")])
+
 (define_insn "mma_<vv>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")]
 		    MMA_VV))]
   "TARGET_MMA"
   "<vv> %A0,%x1,%x2"
-  [(set_attr "type" "mma")])
+  [(set_attr "type" "mma")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<avv>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")]
 		    MMA_AVV))]
   "TARGET_MMA"
   "<avv> %A0,%x2,%x3"
-  [(set_attr "type" "mma")])
+  [(set_attr "type" "mma")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<pv>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:OO 1 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:OO 1 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")]
 		    MMA_PV))]
   "TARGET_MMA"
   "<pv> %A0,%x1,%x2"
-  [(set_attr "type" "mma")])
+  [(set_attr "type" "mma")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<apv>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
-		    (match_operand:OO 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
+		    (match_operand:OO 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")]
 		    MMA_APV))]
   "TARGET_MMA"
   "<apv> %A0,%x2,%x3"
-  [(set_attr "type" "mma")])
+  [(set_attr "type" "mma")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<vvi4i4i8>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 5 "u8bit_cint_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 3 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 5 "u8bit_cint_operand" "n,n,n")]
 		    MMA_VVI4I4I8))]
   "TARGET_MMA"
   "<vvi4i4i8> %A0,%x1,%x2,%3,%4,%5"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<avvi4i4i8>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 6 "u8bit_cint_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 6 "u8bit_cint_operand" "n,n,n")]
 		    MMA_AVVI4I4I8))]
   "TARGET_MMA"
   "<avvi4i4i8> %A0,%x2,%x3,%4,%5,%6"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<vvi4i4i2>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 5 "const_0_to_3_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 3 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 5 "const_0_to_3_operand" "n,n,n")]
 		    MMA_VVI4I4I2))]
   "TARGET_MMA"
   "<vvi4i4i2> %A0,%x1,%x2,%3,%4,%5"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<avvi4i4i2>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 6 "const_0_to_3_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 6 "const_0_to_3_operand" "n,n,n")]
 		    MMA_AVVI4I4I2))]
   "TARGET_MMA"
   "<avvi4i4i2> %A0,%x2,%x3,%4,%5,%6"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<vvi4i4>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 3 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")]
 		    MMA_VVI4I4))]
   "TARGET_MMA"
   "<vvi4i4> %A0,%x1,%x2,%3,%4"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<avvi4i4>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")]
 		    MMA_AVVI4I4))]
   "TARGET_MMA"
   "<avvi4i4> %A0,%x2,%x3,%4,%5"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<pvi4i2>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:OO 1 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 4 "const_0_to_3_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:OO 1 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 3 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 4 "const_0_to_3_operand" "n,n,n")]
 		    MMA_PVI4I2))]
   "TARGET_MMA"
   "<pvi4i2> %A0,%x1,%x2,%3,%4"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<apvi4i2>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
-		    (match_operand:OO 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 5 "const_0_to_3_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
+		    (match_operand:OO 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 5 "const_0_to_3_operand" "n,n,n")]
 		    MMA_APVI4I2))]
   "TARGET_MMA"
   "<apvi4i2> %A0,%x2,%x3,%4,%5"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<vvi4i4i4>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 3 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")]
 		    MMA_VVI4I4I4))]
   "TARGET_MMA"
   "<vvi4i4i4> %A0,%x1,%x2,%3,%4,%5"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<avvi4i4i4>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 6 "const_0_to_15_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 6 "const_0_to_15_operand" "n,n,n")]
 		    MMA_AVVI4I4I4))]
   "TARGET_MMA"
   "<avvi4i4i4> %A0,%x2,%x3,%4,%5,%6"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
diff --git a/gcc/config/rs6000/rs6000-c.cc b/gcc/config/rs6000/rs6000-c.cc
index f2fb5bef678..4342620f87f 100644
--- a/gcc/config/rs6000/rs6000-c.cc
+++ b/gcc/config/rs6000/rs6000-c.cc
@@ -600,6 +600,9 @@ rs6000_target_modify_macros (bool define_p, HOST_WIDE_INT flags)
   /* Tell the user if we support the MMA instructions.  */
   if ((flags & OPTION_MASK_MMA) != 0)
     rs6000_define_or_undefine_macro (define_p, "__MMA__");
+  /* Tell the user if we support the dense math instructions.  */
+  if ((flags & OPTION_MASK_DENSE_MATH) != 0)
+    rs6000_define_or_undefine_macro (define_p, "__PPC_DMR__");
   /* Whether pc-relative code is being generated.  */
   if ((flags & OPTION_MASK_PCREL) != 0)
     rs6000_define_or_undefine_macro (define_p, "__PCREL__");
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index 83e32f7a43a..59517c8608d 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -14264,8 +14264,13 @@ print_operand (FILE *file, rtx x, int code)
 	 overlapping with the FPR registers.  */
       if (!REG_P (x))
 	output_operand_lossage ("invalid %%A value");
-      else if (TARGET_DENSE_MATH && DMR_REGNO_P (REGNO (x)))
-	fprintf (file, "%d", REGNO (x) - FIRST_DMR_REGNO);
+      else if (TARGET_DENSE_MATH)
+	{
+	  if (DMR_REGNO_P (REGNO (x)))
+	    fprintf (file, "%d", REGNO (x) - FIRST_DMR_REGNO);
+	  else
+	    output_operand_lossage ("%%A operand is not a DMR");
+	}
       else if (!FP_REGNO_P (REGNO (x)) || (REGNO (x) % 4) != 0)
 	output_operand_lossage ("invalid %%A value");
       else
@@ -27719,7 +27724,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 
 	  /* If we are reading an accumulator register, we have to
 	     deprime it before we can access it.  */
-	  if (TARGET_MMA
+	  if (TARGET_MMA && !TARGET_DENSE_MATH
 	      && GET_MODE (src) == XOmode && FP_REGNO_P (REGNO (src)))
 	    emit_insn (gen_mma_xxmfacc (src, src));
 
@@ -27751,9 +27756,9 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 	      emit_insn (gen_rtx_SET (dst2, src2));
 	    }
 
-	  /* If we are writing an accumulator register, we have to
-	     prime it after we've written it.  */
-	  if (TARGET_MMA
+	  /* If we are writing an accumulator register that overlaps with the
+	     FPR registers, we have to prime it after we've written it.  */
+	  if (TARGET_MMA && !TARGET_DENSE_MATH
 	      && GET_MODE (dst) == XOmode && FP_REGNO_P (REGNO (dst)))
 	    emit_insn (gen_mma_xxmtacc (dst, dst));
 
@@ -27822,9 +27827,9 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 	      emit_insn (gen_rtx_SET (dst_i, op));
 	    }
 
-	  /* We are writing an accumulator register, so we have to
-	     prime it after we've written it.  */
-	  if (GET_MODE (src) == XOmode)
+	  /* On systems without dense math where accumulators overlap with the
+	     vector registers, we have to prime it after we've written it.  */
+	  if (GET_MODE (src) == XOmode && !TARGET_DENSE_MATH)
 	    emit_insn (gen_mma_xxmtacc (dst, dst));
 
 	  return;
@@ -27835,9 +27840,9 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 
   if (REG_P (src) && REG_P (dst) && (REGNO (src) < REGNO (dst)))
     {
-      /* If we are reading an accumulator register, we have to
-	 deprime it before we can access it.  */
-      if (TARGET_MMA
+      /* If we are reading an accumulator register and we don't have dense
+	 math, we have to deprime it before we can access it.  */
+      if (TARGET_MMA && !TARGET_DENSE_MATH
 	  && GET_MODE (src) == XOmode && FP_REGNO_P (REGNO (src)))
 	emit_insn (gen_mma_xxmfacc (src, src));
 
@@ -27865,7 +27870,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 
       /* If we are writing an accumulator register, we have to
 	 prime it after we've written it.  */
-      if (TARGET_MMA
+      if (TARGET_MMA && !TARGET_DENSE_MATH
 	  && GET_MODE (dst) == XOmode && FP_REGNO_P (REGNO (dst)))
 	emit_insn (gen_mma_xxmtacc (dst, dst));
     }
@@ -28002,7 +28007,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 
       /* If we are reading an accumulator register, we have to
 	 deprime it before we can access it.  */
-      if (TARGET_MMA && REG_P (src)
+      if (TARGET_MMA && !TARGET_DENSE_MATH && REG_P (src)
 	  && GET_MODE (src) == XOmode && FP_REGNO_P (REGNO (src)))
 	emit_insn (gen_mma_xxmfacc (src, src));
 
@@ -28034,7 +28039,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 
       /* If we are writing an accumulator register, we have to
 	 prime it after we've written it.  */
-      if (TARGET_MMA && REG_P (dst)
+      if (TARGET_MMA && !TARGET_DENSE_MATH && REG_P (dst)
 	  && GET_MODE (dst) == XOmode && FP_REGNO_P (REGNO (dst)))
 	emit_insn (gen_mma_xxmtacc (dst, dst));
 
-- 
2.43.0


-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Repost [PATCH 5/6] PowerPC: Switch to dense math names for all MMA operations.
  2024-01-05 23:27 Repost [PATCH 0/6] PowerPC Future patches Michael Meissner
                   ` (3 preceding siblings ...)
  2024-01-05 23:39 ` Repost [PATCH 4/6] PowerPC: Make MMA insns support " Michael Meissner
@ 2024-01-05 23:40 ` Michael Meissner
  2024-01-19 18:48   ` Ping " Michael Meissner
  2024-02-04  5:47   ` Repost " Kewen.Lin
  2024-01-05 23:42 ` Repost [PATCH 6/6] PowerPC: Add support for 1,024 bit DMR registers Michael Meissner
  2024-02-08 18:22 ` Repost [PATCH 0/6] PowerPC Future patches Segher Boessenkool
  6 siblings, 2 replies; 36+ messages in thread
From: Michael Meissner @ 2024-01-05 23:40 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner

This patch changes the assembler instruction names for MMA instructions from
the original name used in power10 to the new name when used with the dense math
system.  I.e. xvf64gerpp becomes dmxvf64gerpp.  The assembler will emit the
same bits for either spelling.

The patches have been tested on both little and big endian systems.  Can I check
it into the master branch?

2024-01-05   Michael Meissner  <meissner@linux.ibm.com>

gcc/

	* config/rs6000/mma.md (vvi4i4i8_dm): New int attribute.
	(avvi4i4i8_dm): Likewise.
	(vvi4i4i2_dm): Likewise.
	(avvi4i4i2_dm): Likewise.
	(vvi4i4_dm): Likewise.
	(avvi4i4_dm): Likewise.
	(pvi4i2_dm): Likewise.
	(apvi4i2_dm): Likewise.
	(vvi4i4i4_dm): Likewise.
	(avvi4i4i4_dm): Likewise.
	(mma_<vv>): Add support for running on DMF systems, generating the dense
	math instruction and using the dense math accumulators.
	(mma_<avv>): Likewise.
	(mma_<pv>): Likewise.
	(mma_<apv>): Likewise.
	(mma_<vvi4i4i8>): Likewise.
	(mma_<avvi4i4i8>): Likewise.
	(mma_<vvi4i4i2>): Likewise.
	(mma_<avvi4i4i2>): Likewise.
	(mma_<vvi4i4>): Likewise.
	(mma_<avvi4i4): Likewise.
	(mma_<pvi4i2>): Likewise.
	(mma_<apvi4i2): Likewise.
	(mma_<vvi4i4i4>): Likewise.
	(mma_<avvi4i4i4>): Likewise.

gcc/testsuite/

	* gcc.target/powerpc/dm-double-test.c: New test.
	* lib/target-supports.exp (check_effective_target_ppc_dmr_ok): New
	target test.
---
 gcc/config/rs6000/mma.md                      |  98 +++++++--
 .../gcc.target/powerpc/dm-double-test.c       | 194 ++++++++++++++++++
 gcc/testsuite/lib/target-supports.exp         |  19 ++
 3 files changed, 299 insertions(+), 12 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/dm-double-test.c

diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
index 525a85146ff..f06e6bbb184 100644
--- a/gcc/config/rs6000/mma.md
+++ b/gcc/config/rs6000/mma.md
@@ -227,13 +227,22 @@ (define_int_attr apv		[(UNSPEC_MMA_XVF64GERPP		"xvf64gerpp")
 
 (define_int_attr vvi4i4i8	[(UNSPEC_MMA_PMXVI4GER8		"pmxvi4ger8")])
 
+(define_int_attr vvi4i4i8_dm	[(UNSPEC_MMA_PMXVI4GER8		"pmdmxvi4ger8")])
+
 (define_int_attr avvi4i4i8	[(UNSPEC_MMA_PMXVI4GER8PP	"pmxvi4ger8pp")])
 
+(define_int_attr avvi4i4i8_dm	[(UNSPEC_MMA_PMXVI4GER8PP	"pmdmxvi4ger8pp")])
+
 (define_int_attr vvi4i4i2	[(UNSPEC_MMA_PMXVI16GER2	"pmxvi16ger2")
 				 (UNSPEC_MMA_PMXVI16GER2S	"pmxvi16ger2s")
 				 (UNSPEC_MMA_PMXVF16GER2	"pmxvf16ger2")
 				 (UNSPEC_MMA_PMXVBF16GER2	"pmxvbf16ger2")])
 
+(define_int_attr vvi4i4i2_dm	[(UNSPEC_MMA_PMXVI16GER2	"pmdmxvi16ger2")
+				 (UNSPEC_MMA_PMXVI16GER2S	"pmdmxvi16ger2s")
+				 (UNSPEC_MMA_PMXVF16GER2	"pmdmxvf16ger2")
+				 (UNSPEC_MMA_PMXVBF16GER2	"pmdmxvbf16ger2")])
+
 (define_int_attr avvi4i4i2	[(UNSPEC_MMA_PMXVI16GER2PP	"pmxvi16ger2pp")
 				 (UNSPEC_MMA_PMXVI16GER2SPP	"pmxvi16ger2spp")
 				 (UNSPEC_MMA_PMXVF16GER2PP	"pmxvf16ger2pp")
@@ -245,25 +254,54 @@ (define_int_attr avvi4i4i2	[(UNSPEC_MMA_PMXVI16GER2PP	"pmxvi16ger2pp")
 				 (UNSPEC_MMA_PMXVBF16GER2NP	"pmxvbf16ger2np")
 				 (UNSPEC_MMA_PMXVBF16GER2NN	"pmxvbf16ger2nn")])
 
+(define_int_attr avvi4i4i2_dm	[(UNSPEC_MMA_PMXVI16GER2PP	"pmdmxvi16ger2pp")
+				 (UNSPEC_MMA_PMXVI16GER2SPP	"pmdmxvi16ger2spp")
+				 (UNSPEC_MMA_PMXVF16GER2PP	"pmdmxvf16ger2pp")
+				 (UNSPEC_MMA_PMXVF16GER2PN	"pmdmxvf16ger2pn")
+				 (UNSPEC_MMA_PMXVF16GER2NP	"pmdmxvf16ger2np")
+				 (UNSPEC_MMA_PMXVF16GER2NN	"pmdmxvf16ger2nn")
+				 (UNSPEC_MMA_PMXVBF16GER2PP	"pmdmxvbf16ger2pp")
+				 (UNSPEC_MMA_PMXVBF16GER2PN	"pmdmxvbf16ger2pn")
+				 (UNSPEC_MMA_PMXVBF16GER2NP	"pmdmxvbf16ger2np")
+				 (UNSPEC_MMA_PMXVBF16GER2NN	"pmdmxvbf16ger2nn")])
+
 (define_int_attr vvi4i4		[(UNSPEC_MMA_PMXVF32GER		"pmxvf32ger")])
 
+(define_int_attr vvi4i4_dm	[(UNSPEC_MMA_PMXVF32GER		"pmdmxvf32ger")])
+
 (define_int_attr avvi4i4	[(UNSPEC_MMA_PMXVF32GERPP	"pmxvf32gerpp")
 				 (UNSPEC_MMA_PMXVF32GERPN	"pmxvf32gerpn")
 				 (UNSPEC_MMA_PMXVF32GERNP	"pmxvf32gernp")
 				 (UNSPEC_MMA_PMXVF32GERNN	"pmxvf32gernn")])
 
+(define_int_attr avvi4i4_dm	[(UNSPEC_MMA_PMXVF32GERPP	"pmdmxvf32gerpp")
+				 (UNSPEC_MMA_PMXVF32GERPN	"pmdmxvf32gerpn")
+				 (UNSPEC_MMA_PMXVF32GERNP	"pmdmxvf32gernp")
+				 (UNSPEC_MMA_PMXVF32GERNN	"pmdmxvf32gernn")])
+
 (define_int_attr pvi4i2		[(UNSPEC_MMA_PMXVF64GER		"pmxvf64ger")])
 
+(define_int_attr pvi4i2_dm	[(UNSPEC_MMA_PMXVF64GER		"pmdmxvf64ger")])
+
 (define_int_attr apvi4i2	[(UNSPEC_MMA_PMXVF64GERPP	"pmxvf64gerpp")
 				 (UNSPEC_MMA_PMXVF64GERPN	"pmxvf64gerpn")
 				 (UNSPEC_MMA_PMXVF64GERNP	"pmxvf64gernp")
 				 (UNSPEC_MMA_PMXVF64GERNN	"pmxvf64gernn")])
 
+(define_int_attr apvi4i2_dm	[(UNSPEC_MMA_PMXVF64GERPP	"pmdmxvf64gerpp")
+				 (UNSPEC_MMA_PMXVF64GERPN	"pmdmxvf64gerpn")
+				 (UNSPEC_MMA_PMXVF64GERNP	"pmdmxvf64gernp")
+				 (UNSPEC_MMA_PMXVF64GERNN	"pmdmxvf64gernn")])
+
 (define_int_attr vvi4i4i4	[(UNSPEC_MMA_PMXVI8GER4		"pmxvi8ger4")])
 
+(define_int_attr vvi4i4i4_dm	[(UNSPEC_MMA_PMXVI8GER4		"pmdmxvi8ger4")])
+
 (define_int_attr avvi4i4i4	[(UNSPEC_MMA_PMXVI8GER4PP	"pmxvi8ger4pp")
 				 (UNSPEC_MMA_PMXVI8GER4SPP	"pmxvi8ger4spp")])
 
+(define_int_attr avvi4i4i4_dm	[(UNSPEC_MMA_PMXVI8GER4PP	"pmdmxvi8ger4pp")
+				 (UNSPEC_MMA_PMXVI8GER4SPP	"pmdmxvi8ger4spp")])
 
 ;; Vector pair support.  OOmode can only live in VSRs.
 (define_expand "movoo"
@@ -629,7 +667,10 @@ (define_insn "mma_<vv>"
 		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")]
 		    MMA_VV))]
   "TARGET_MMA"
-  "<vv> %A0,%x1,%x2"
+  "@
+   dm<vv> %A0,%x1,%x2
+   <vv> %A0,%x1,%x2
+   <vv> %A0,%x1,%x2"
   [(set_attr "type" "mma")
    (set_attr "isa" "dm,not_dm,not_dm")])
 
@@ -650,7 +691,10 @@ (define_insn "mma_<pv>"
 		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")]
 		    MMA_PV))]
   "TARGET_MMA"
-  "<pv> %A0,%x1,%x2"
+  "@
+   dm<pv> %A0,%x1,%x2
+   <pv> %A0,%x1,%x2
+   <pv> %A0,%x1,%x2"
   [(set_attr "type" "mma")
    (set_attr "isa" "dm,not_dm,not_dm")])
 
@@ -661,7 +705,10 @@ (define_insn "mma_<apv>"
 		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")]
 		    MMA_APV))]
   "TARGET_MMA"
-  "<apv> %A0,%x2,%x3"
+  "@
+   dm<apv> %A0,%x2,%x3
+   <apv> %A0,%x2,%x3
+   <apv> %A0,%x2,%x3"
   [(set_attr "type" "mma")
    (set_attr "isa" "dm,not_dm,not_dm")])
 
@@ -674,7 +721,10 @@ (define_insn "mma_<vvi4i4i8>"
 		    (match_operand:SI 5 "u8bit_cint_operand" "n,n,n")]
 		    MMA_VVI4I4I8))]
   "TARGET_MMA"
-  "<vvi4i4i8> %A0,%x1,%x2,%3,%4,%5"
+  "@
+   dm<vvi4i4i8> %A0,%x1,%x2,%3,%4,%5
+   <vvi4i4i8> %A0,%x1,%x2,%3,%4,%5
+   <vvi4i4i8> %A0,%x1,%x2,%3,%4,%5"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
@@ -703,7 +753,10 @@ (define_insn "mma_<vvi4i4i2>"
 		    (match_operand:SI 5 "const_0_to_3_operand" "n,n,n")]
 		    MMA_VVI4I4I2))]
   "TARGET_MMA"
-  "<vvi4i4i2> %A0,%x1,%x2,%3,%4,%5"
+  "@
+   <vvi4i4i2_dm> %A0,%x1,%x2,%3,%4,%5
+   <vvi4i4i2> %A0,%x1,%x2,%3,%4,%5
+   <vvi4i4i2> %A0,%x1,%x2,%3,%4,%5"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
@@ -718,7 +771,10 @@ (define_insn "mma_<avvi4i4i2>"
 		    (match_operand:SI 6 "const_0_to_3_operand" "n,n,n")]
 		    MMA_AVVI4I4I2))]
   "TARGET_MMA"
-  "<avvi4i4i2> %A0,%x2,%x3,%4,%5,%6"
+  "@
+   <avvi4i4i2_dm> %A0,%x2,%x3,%4,%5,%6
+   <avvi4i4i2> %A0,%x2,%x3,%4,%5,%6
+   <avvi4i4i2> %A0,%x2,%x3,%4,%5,%6"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
@@ -731,7 +787,10 @@ (define_insn "mma_<vvi4i4>"
 		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")]
 		    MMA_VVI4I4))]
   "TARGET_MMA"
-  "<vvi4i4> %A0,%x1,%x2,%3,%4"
+  "@
+   <vvi4i4_dm> %A0,%x1,%x2,%3,%4
+   <vvi4i4> %A0,%x1,%x2,%3,%4
+   <vvi4i4> %A0,%x1,%x2,%3,%4"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
@@ -745,7 +804,10 @@ (define_insn "mma_<avvi4i4>"
 		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")]
 		    MMA_AVVI4I4))]
   "TARGET_MMA"
-  "<avvi4i4> %A0,%x2,%x3,%4,%5"
+  "@
+   <avvi4i4_dm> %A0,%x2,%x3,%4,%5
+   <avvi4i4> %A0,%x2,%x3,%4,%5
+   <avvi4i4> %A0,%x2,%x3,%4,%5"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
@@ -758,7 +820,10 @@ (define_insn "mma_<pvi4i2>"
 		    (match_operand:SI 4 "const_0_to_3_operand" "n,n,n")]
 		    MMA_PVI4I2))]
   "TARGET_MMA"
-  "<pvi4i2> %A0,%x1,%x2,%3,%4"
+  "@
+   <pvi4i2_dm> %A0,%x1,%x2,%3,%4
+   <pvi4i2> %A0,%x1,%x2,%3,%4
+   <pvi4i2> %A0,%x1,%x2,%3,%4"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
@@ -772,7 +837,10 @@ (define_insn "mma_<apvi4i2>"
 		    (match_operand:SI 5 "const_0_to_3_operand" "n,n,n")]
 		    MMA_APVI4I2))]
   "TARGET_MMA"
-  "<apvi4i2> %A0,%x2,%x3,%4,%5"
+  "@
+   <apvi4i2_dm> %A0,%x2,%x3,%4,%5
+   <apvi4i2> %A0,%x2,%x3,%4,%5
+   <apvi4i2> %A0,%x2,%x3,%4,%5"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
@@ -786,7 +854,10 @@ (define_insn "mma_<vvi4i4i4>"
 		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")]
 		    MMA_VVI4I4I4))]
   "TARGET_MMA"
-  "<vvi4i4i4> %A0,%x1,%x2,%3,%4,%5"
+  "@
+   <vvi4i4i4_dm> %A0,%x1,%x2,%3,%4,%5
+   <vvi4i4i4> %A0,%x1,%x2,%3,%4,%5
+   <vvi4i4i4> %A0,%x1,%x2,%3,%4,%5"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
@@ -801,7 +872,10 @@ (define_insn "mma_<avvi4i4i4>"
 		    (match_operand:SI 6 "const_0_to_15_operand" "n,n,n")]
 		    MMA_AVVI4I4I4))]
   "TARGET_MMA"
-  "<avvi4i4i4> %A0,%x2,%x3,%4,%5,%6"
+  "@
+   <avvi4i4i4_dm> %A0,%x2,%x3,%4,%5,%6
+   <avvi4i4i4> %A0,%x2,%x3,%4,%5,%6
+   <avvi4i4i4> %A0,%x2,%x3,%4,%5,%6"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
diff --git a/gcc/testsuite/gcc.target/powerpc/dm-double-test.c b/gcc/testsuite/gcc.target/powerpc/dm-double-test.c
new file mode 100644
index 00000000000..66c19779585
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/dm-double-test.c
@@ -0,0 +1,194 @@
+/* Test derived from mma-double-1.c, modified for dense math.  */
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_dense_math_ok } */
+/* { dg-options "-mdejagnu-cpu=future -O2" } */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <altivec.h>
+
+typedef unsigned char vec_t __attribute__ ((vector_size (16)));
+typedef double v4sf_t __attribute__ ((vector_size (16)));
+#define SAVE_ACC(ACC, ldc, J)  \
+	  __builtin_mma_disassemble_acc (result, ACC); \
+	  rowC = (v4sf_t *) &CO[0*ldc+J]; \
+          rowC[0] += result[0]; \
+          rowC = (v4sf_t *) &CO[1*ldc+J]; \
+          rowC[0] += result[1]; \
+          rowC = (v4sf_t *) &CO[2*ldc+J]; \
+          rowC[0] += result[2]; \
+          rowC = (v4sf_t *) &CO[3*ldc+J]; \
+	  rowC[0] += result[3];
+
+void
+DM (int m, int n, int k, double *A, double *B, double *C)
+{
+  __vector_quad acc0, acc1, acc2, acc3, acc4, acc5, acc6, acc7;
+  v4sf_t result[4];
+  v4sf_t *rowC;
+  for (int l = 0; l < n; l += 4)
+    {
+      double *CO;
+      double *AO;
+      AO = A;
+      CO = C;
+      C += m * 4;
+      for (int j = 0; j < m; j += 16)
+	{
+	  double *BO = B;
+	  __builtin_mma_xxsetaccz (&acc0);
+	  __builtin_mma_xxsetaccz (&acc1);
+	  __builtin_mma_xxsetaccz (&acc2);
+	  __builtin_mma_xxsetaccz (&acc3);
+	  __builtin_mma_xxsetaccz (&acc4);
+	  __builtin_mma_xxsetaccz (&acc5);
+	  __builtin_mma_xxsetaccz (&acc6);
+	  __builtin_mma_xxsetaccz (&acc7);
+	  unsigned long i;
+
+	  for (i = 0; i < k; i++)
+	    {
+	      vec_t *rowA = (vec_t *) & AO[i * 16];
+	      __vector_pair rowB;
+	      vec_t *rb = (vec_t *) & BO[i * 4];
+	      __builtin_mma_assemble_pair (&rowB, rb[1], rb[0]);
+	      __builtin_mma_xvf64gerpp (&acc0, rowB, rowA[0]);
+	      __builtin_mma_xvf64gerpp (&acc1, rowB, rowA[1]);
+	      __builtin_mma_xvf64gerpp (&acc2, rowB, rowA[2]);
+	      __builtin_mma_xvf64gerpp (&acc3, rowB, rowA[3]);
+	      __builtin_mma_xvf64gerpp (&acc4, rowB, rowA[4]);
+	      __builtin_mma_xvf64gerpp (&acc5, rowB, rowA[5]);
+	      __builtin_mma_xvf64gerpp (&acc6, rowB, rowA[6]);
+	      __builtin_mma_xvf64gerpp (&acc7, rowB, rowA[7]);
+	    }
+	  SAVE_ACC (&acc0, m, 0);
+	  SAVE_ACC (&acc2, m, 4);
+	  SAVE_ACC (&acc1, m, 2);
+	  SAVE_ACC (&acc3, m, 6);
+	  SAVE_ACC (&acc4, m, 8);
+	  SAVE_ACC (&acc6, m, 12);
+	  SAVE_ACC (&acc5, m, 10);
+	  SAVE_ACC (&acc7, m, 14);
+	  AO += k * 16;
+	  BO += k * 4;
+	  CO += 16;
+	}
+      B += k * 4;
+    }
+}
+
+void
+init (double *matrix, int row, int column)
+{
+  for (int j = 0; j < column; j++)
+    {
+      for (int i = 0; i < row; i++)
+	{
+	  matrix[j * row + i] = (i * 16 + 2 + j) / 0.123;
+	}
+    }
+}
+
+void
+init0 (double *matrix, double *matrix1, int row, int column)
+{
+  for (int j = 0; j < column; j++)
+    for (int i = 0; i < row; i++)
+      matrix[j * row + i] = matrix1[j * row + i] = 0;
+}
+
+
+void
+print (const char *name, const double *matrix, int row, int column)
+{
+  printf ("Matrix %s has %d rows and %d columns:\n", name, row, column);
+  for (int i = 0; i < row; i++)
+    {
+      for (int j = 0; j < column; j++)
+	{
+	  printf ("%f ", matrix[j * row + i]);
+	}
+      printf ("\n");
+    }
+  printf ("\n");
+}
+
+int
+main (int argc, char *argv[])
+{
+  int rowsA, colsB, common;
+  int i, j, k;
+  int ret = 0;
+
+  for (int t = 16; t <= 128; t += 16)
+    {
+      for (int t1 = 4; t1 <= 16; t1 += 4)
+	{
+	  rowsA = t;
+	  colsB = t1;
+	  common = 1;
+	  /* printf ("Running test for rows = %d,cols = %d\n", t, t1); */
+	  double A[rowsA * common];
+	  double B[common * colsB];
+	  double C[rowsA * colsB];
+	  double D[rowsA * colsB];
+
+
+	  init (A, rowsA, common);
+	  init (B, common, colsB);
+	  init0 (C, D, rowsA, colsB);
+	  DM (rowsA, colsB, common, A, B, C);
+
+	  for (i = 0; i < colsB; i++)
+	    {
+	      for (j = 0; j < rowsA; j++)
+		{
+		  D[i * rowsA + j] = 0;
+		  for (k = 0; k < common; k++)
+		    {
+		      D[i * rowsA + j] +=
+			A[k * rowsA + j] * B[k + common * i];
+		    }
+		}
+	    }
+	  for (i = 0; i < colsB; i++)
+	    {
+	      for (j = 0; j < rowsA; j++)
+		{
+		  for (k = 0; k < common; k++)
+		    {
+		      if (D[i * rowsA + j] != C[i * rowsA + j])
+			{
+			  printf ("Error %d,%d,%d\n",i,j,k);
+			  ret++;
+			}
+		    }
+		}
+	    }
+	  if (ret)
+	    {
+	      print ("A", A, rowsA, common);
+	      print ("B", B, common, colsB);
+	      print ("C", C, rowsA, colsB);
+	      print ("D", D, rowsA, colsB);
+	    }
+	}
+    }
+  
+#ifdef VERBOSE
+  if (ret)
+    printf ("DM double test fail: %d errors\n",ret);
+  else
+    printf ("DM double test success: 0 DM errors\n");
+#else
+  if (ret)
+    abort();
+#endif
+      
+  return ret;
+}
+
+/* { dg-final { scan-assembler {\mdmsetdmrz\M}      } } */
+/* { dg-final { scan-assembler {\mdmxvf64gerpp\M}   } } */
+/* { dg-final { scan-assembler {\mdmxxextfdmr512\M} } } */
+
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 1b4a3fb18df..2dec3682a2f 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -7101,6 +7101,25 @@ proc check_effective_target_power10_ok { } {
     }
 }
 
+# Return 1 if this is a PowerPC target supporting -mcpu=future or -mdense-math
+# which enables the dense math operations.
+proc check_effective_target_powerpc_dense_math_ok { } {
+	return [check_no_compiler_messages_nocache powerpc_dense_math_ok assembly {
+		__vector_quad vq;
+		void test (void)
+		{
+		#ifndef __PPC_DMR__
+		#error "target does not have dense math support."
+		#else
+		/* Make sure we have dense math support.  */
+		  __vector_quad dmr;
+		  __asm__ ("dmsetaccz %A0" : "=wD" (dmr));
+		  vq = dmr;
+		#endif
+		}
+	} "-mcpu=future"]
+}
+
 # Return 1 if this is a PowerPC target supporting -mfloat128 via either
 # software emulation on power7/power8 systems or hardware support on power9.
 
-- 
2.43.0


-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Repost [PATCH 6/6] PowerPC: Add support for 1,024 bit DMR registers.
  2024-01-05 23:27 Repost [PATCH 0/6] PowerPC Future patches Michael Meissner
                   ` (4 preceding siblings ...)
  2024-01-05 23:40 ` Repost [PATCH 5/6] PowerPC: Switch to dense math names for all MMA operations Michael Meissner
@ 2024-01-05 23:42 ` Michael Meissner
  2024-01-19 18:49   ` Ping " Michael Meissner
  2024-02-05  3:58   ` Repost " Kewen.Lin
  2024-02-08 18:22 ` Repost [PATCH 0/6] PowerPC Future patches Segher Boessenkool
  6 siblings, 2 replies; 36+ messages in thread
From: Michael Meissner @ 2024-01-05 23:42 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner

This patch is a prelimianry patch to add the full 1,024 bit dense math register
(DMRs) for -mcpu=future.  The MMA 512-bit accumulators map onto the top of the
DMR register.

This patch only adds the new 1,024 bit register support.  It does not add
support for any instructions that need 1,024 bit registers instead of 512 bit
registers.

I used the new mode 'TDOmode' to be the opaque mode used for 1,204 bit
registers.  The 'wD' constraint added in previous patches is used for these
registers.  I added support to do load and store of DMRs via the VSX registers,
since there are no load/store dense math instructions.  I added the new keyword
'__dmr' to create 1,024 bit types that can be loaded into DMRs.  At present, I
don't have aliases for __dmr512 and __dmr1024 that we've discussed internally.

The patches have been tested on both little and big endian systems.  Can I check
it into the master branch?

2024-01-05   Michael Meissner  <meissner@linux.ibm.com>

gcc/

	* config/rs6000/mma.md (UNSPEC_DM_INSERT512_UPPER): New unspec.
	(UNSPEC_DM_INSERT512_LOWER): Likewise.
	(UNSPEC_DM_EXTRACT512): Likewise.
	(UNSPEC_DMR_RELOAD_FROM_MEMORY): Likewise.
	(UNSPEC_DMR_RELOAD_TO_MEMORY): Likewise.
	(movtdo): New define_expand and define_insn_and_split to implement 1,024
	bit DMR registers.
	(movtdo_insert512_upper): New insn.
	(movtdo_insert512_lower): Likewise.
	(movtdo_extract512): Likewise.
	(reload_dmr_from_memory): Likewise.
	(reload_dmr_to_memory): Likewise.
	* config/rs6000/rs6000-builtin.cc (rs6000_type_string): Add DMR
	support.
	(rs6000_init_builtins): Add support for __dmr keyword.
	* config/rs6000/rs6000-call.cc (rs6000_return_in_memory): Add support
	for TDOmode.
	(rs6000_function_arg): Likewise.
	* config/rs6000/rs6000-modes.def (TDOmode): New mode.
	* config/rs6000/rs6000.cc (rs6000_hard_regno_nregs_internal): Add
	support for TDOmode.
	(rs6000_hard_regno_mode_ok_uncached): Likewise.
	(rs6000_hard_regno_mode_ok): Likewise.
	(rs6000_modes_tieable_p): Likewise.
	(rs6000_debug_reg_global): Likewise.
	(rs6000_setup_reg_addr_masks): Likewise.
	(rs6000_init_hard_regno_mode_ok): Add support for TDOmode.  Setup reload
	hooks for DMR mode.
	(reg_offset_addressing_ok_p): Add support for TDOmode.
	(rs6000_emit_move): Likewise.
	(rs6000_secondary_reload_simple_move): Likewise.
	(rs6000_secondary_reload_class): Likewise.
	(rs6000_mangle_type): Add mangling for __dmr type.
	(rs6000_dmr_register_move_cost): Add support for TDOmode.
	(rs6000_split_multireg_move): Likewise.
	(rs6000_invalid_conversion): Likewise.
	* config/rs6000/rs6000.h (VECTOR_ALIGNMENT_P): Add TDOmode.
	(enum rs6000_builtin_type_index): Add DMR type nodes.
	(dmr_type_node): Likewise.
	(ptr_dmr_type_node): Likewise.

gcc/testsuite/

	* gcc.target/powerpc/dm-1024bit.c: New test.
---
 gcc/config/rs6000/mma.md                      | 152 ++++++++++++++++++
 gcc/config/rs6000/rs6000-builtin.cc           |  13 ++
 gcc/config/rs6000/rs6000-call.cc              |  13 +-
 gcc/config/rs6000/rs6000-modes.def            |   4 +
 gcc/config/rs6000/rs6000.cc                   | 135 ++++++++++++----
 gcc/config/rs6000/rs6000.h                    |   7 +-
 gcc/testsuite/gcc.target/powerpc/dm-1024bit.c |  63 ++++++++
 7 files changed, 351 insertions(+), 36 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/dm-1024bit.c

diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
index f06e6bbb184..37de9030903 100644
--- a/gcc/config/rs6000/mma.md
+++ b/gcc/config/rs6000/mma.md
@@ -92,6 +92,11 @@ (define_c_enum "unspec"
    UNSPEC_MMA_XXMFACC
    UNSPEC_MMA_XXMTACC
    UNSPEC_DM_ASSEMBLE_ACC
+   UNSPEC_DM_INSERT512_UPPER
+   UNSPEC_DM_INSERT512_LOWER
+   UNSPEC_DM_EXTRACT512
+   UNSPEC_DMR_RELOAD_FROM_MEMORY
+   UNSPEC_DMR_RELOAD_TO_MEMORY
   ])
 
 (define_c_enum "unspecv"
@@ -879,3 +884,150 @@ (define_insn "mma_<avvi4i4i4>"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
+
+\f
+;; TDOmode (i.e. __dmr).
+(define_expand "movtdo"
+  [(set (match_operand:TDO 0 "nonimmediate_operand")
+	(match_operand:TDO 1 "input_operand"))]
+  "TARGET_DENSE_MATH"
+{
+  rs6000_emit_move (operands[0], operands[1], TDOmode);
+  DONE;
+})
+
+(define_insn_and_split "*movtdo"
+  [(set (match_operand:TDO 0 "nonimmediate_operand" "=wa,m,wa,wD,wD,wa")
+	(match_operand:TDO 1 "input_operand" "m,wa,wa,wa,wD,wD"))]
+  "TARGET_DENSE_MATH
+   && (gpc_reg_operand (operands[0], TDOmode)
+       || gpc_reg_operand (operands[1], TDOmode))"
+  "@
+   #
+   #
+   #
+   #
+   dmmr %0,%1
+   #"
+  "&& reload_completed
+   && (!dmr_operand (operands[0], TDOmode) || !dmr_operand (operands[1], TDOmode))"
+  [(const_int 0)]
+{
+  rtx op0 = operands[0];
+  rtx op1 = operands[1];
+
+  if (REG_P (op0) && REG_P (op1))
+    {
+      int regno0 = REGNO (op0);
+      int regno1 = REGNO (op1);
+
+      if (DMR_REGNO_P (regno0) && VSX_REGNO_P (regno1))
+	{
+	  rtx op1_upper = gen_rtx_REG (XOmode, regno1);
+	  rtx op1_lower = gen_rtx_REG (XOmode, regno1 + 4);
+	  emit_insn (gen_movtdo_insert512_upper (op0, op1_upper));
+	  emit_insn (gen_movtdo_insert512_lower (op0, op0, op1_lower));
+	  DONE;
+	}
+
+      else if (VSX_REGNO_P (regno0) && DMR_REGNO_P (regno1))
+	{
+	  rtx op0_upper = gen_rtx_REG (XOmode, regno0);
+	  rtx op0_lower = gen_rtx_REG (XOmode, regno0 + 4);
+	  emit_insn (gen_movtdo_extract512 (op0_upper, op1, const0_rtx));
+	  emit_insn (gen_movtdo_extract512 (op0_lower, op1, const1_rtx));
+	  DONE;
+	}
+    }
+
+  rs6000_split_multireg_move (operands[0], operands[1]);
+  DONE;
+}
+  [(set_attr "type" "vecload,vecstore,vecmove,vecmove,vecmove,vecmove")
+   (set_attr "length" "*,*,32,8,*,8")
+   (set_attr "max_prefixed_insns" "4,4,*,*,*,*")])
+
+;; Move from VSX registers to DMR registers via two insert 512 bit
+;; instructions.
+(define_insn "movtdo_insert512_upper"
+  [(set (match_operand:TDO 0 "dmr_operand" "=wD")
+	(unspec:TDO [(match_operand:XO 1 "vsx_register_operand" "wa")]
+		    UNSPEC_DM_INSERT512_UPPER))]
+  "TARGET_DENSE_MATH"
+  "dmxxinstdmr512 %0,%1,%Y1,0"
+  [(set_attr "type" "mma")])
+
+(define_insn "movtdo_insert512_lower"
+  [(set (match_operand:TDO 0 "dmr_operand" "=wD")
+	(unspec:TDO [(match_operand:TDO 1 "dmr_operand" "0")
+		     (match_operand:XO 2 "vsx_register_operand" "wa")]
+		    UNSPEC_DM_INSERT512_LOWER))]
+  "TARGET_DENSE_MATH"
+  "dmxxinstdmr512 %0,%2,%Y2,1"
+  [(set_attr "type" "mma")])
+
+;; Move from DMR registers to VSX registers via two extract 512 bit
+;; instructions.
+(define_insn "movtdo_extract512"
+  [(set (match_operand:XO 0 "vsx_register_operand" "=wa")
+	(unspec:XO [(match_operand:TDO 1 "dmr_operand" "wD")
+		    (match_operand 2 "const_0_to_1_operand" "n")]
+		   UNSPEC_DM_EXTRACT512))]
+  "TARGET_DENSE_MATH"
+  "dmxxextfdmr512 %0,%Y0,%1,%2"
+  [(set_attr "type" "mma")])
+
+;; Reload DMR registers from memory
+(define_insn_and_split "reload_dmr_from_memory"
+  [(set (match_operand:TDO 0 "dmr_operand" "=wD")
+	(unspec:TDO [(match_operand:TDO 1 "memory_operand" "m")]
+		    UNSPEC_DMR_RELOAD_FROM_MEMORY))
+   (clobber (match_operand:XO 2 "vsx_register_operand" "=wa"))]
+  "TARGET_DENSE_MATH"
+  "#"
+  "&& reload_completed"
+  [(const_int 0)]
+{
+  rtx dest = operands[0];
+  rtx src = operands[1];
+  rtx tmp = operands[2];
+  rtx mem_upper = adjust_address (src, XOmode, BYTES_BIG_ENDIAN ? 0 : 32);
+  rtx mem_lower = adjust_address (src, XOmode, BYTES_BIG_ENDIAN ? 32 : 0);
+
+  emit_move_insn (tmp, mem_upper);
+  emit_insn (gen_movtdo_insert512_upper (dest, tmp));
+
+  emit_move_insn (tmp, mem_lower);
+  emit_insn (gen_movtdo_insert512_lower (dest, dest, tmp));
+  DONE;
+}
+  [(set_attr "length" "16")
+   (set_attr "max_prefixed_insns" "2")
+   (set_attr "type" "vecload")])
+
+;; Reload dense math registers to memory
+(define_insn_and_split "reload_dmr_to_memory"
+  [(set (match_operand:TDO 0 "memory_operand" "=m")
+	(unspec:TDO [(match_operand:TDO 1 "dmr_operand" "wD")]
+		    UNSPEC_DMR_RELOAD_TO_MEMORY))
+   (clobber (match_operand:XO 2 "vsx_register_operand" "=wa"))]
+  "TARGET_DENSE_MATH"
+  "#"
+  "&& reload_completed"
+  [(const_int 0)]
+{
+  rtx dest = operands[0];
+  rtx src = operands[1];
+  rtx tmp = operands[2];
+  rtx mem_upper = adjust_address (dest, XOmode, BYTES_BIG_ENDIAN ? 0 : 32);
+  rtx mem_lower = adjust_address (dest, XOmode, BYTES_BIG_ENDIAN ? 32 : 0);
+
+  emit_insn (gen_movtdo_extract512 (tmp, src, const0_rtx));
+  emit_move_insn (mem_upper, tmp);
+
+  emit_insn (gen_movtdo_extract512 (tmp, src, const1_rtx));
+  emit_move_insn (mem_lower, tmp);
+  DONE;
+}
+  [(set_attr "length" "16")
+   (set_attr "max_prefixed_insns" "2")])
diff --git a/gcc/config/rs6000/rs6000-builtin.cc b/gcc/config/rs6000/rs6000-builtin.cc
index 6698274031b..54868d2009c 100644
--- a/gcc/config/rs6000/rs6000-builtin.cc
+++ b/gcc/config/rs6000/rs6000-builtin.cc
@@ -495,6 +495,8 @@ const char *rs6000_type_string (tree type_node)
     return "__vector_pair";
   else if (type_node == vector_quad_type_node)
     return "__vector_quad";
+  else if (type_node == dmr_type_node)
+    return "__dmr";
 
   return "unknown";
 }
@@ -781,6 +783,17 @@ rs6000_init_builtins (void)
   t = build_qualified_type (vector_quad_type_node, TYPE_QUAL_CONST);
   ptr_vector_quad_type_node = build_pointer_type (t);
 
+  dmr_type_node = make_node (OPAQUE_TYPE);
+  SET_TYPE_MODE (dmr_type_node, TDOmode);
+  TYPE_SIZE (dmr_type_node) = bitsize_int (GET_MODE_BITSIZE (TDOmode));
+  TYPE_PRECISION (dmr_type_node) = GET_MODE_BITSIZE (TDOmode);
+  TYPE_SIZE_UNIT (dmr_type_node) = size_int (GET_MODE_SIZE (TDOmode));
+  SET_TYPE_ALIGN (dmr_type_node, 512);
+  TYPE_USER_ALIGN (dmr_type_node) = 0;
+  lang_hooks.types.register_builtin_type (dmr_type_node, "__dmr");
+  t = build_qualified_type (dmr_type_node, TYPE_QUAL_CONST);
+  ptr_dmr_type_node = build_pointer_type (t);
+
   tdecl = add_builtin_type ("__bool char", bool_char_type_node);
   TYPE_NAME (bool_char_type_node) = tdecl;
 
diff --git a/gcc/config/rs6000/rs6000-call.cc b/gcc/config/rs6000/rs6000-call.cc
index 8c590903c86..6e2465204cf 100644
--- a/gcc/config/rs6000/rs6000-call.cc
+++ b/gcc/config/rs6000/rs6000-call.cc
@@ -437,7 +437,8 @@ rs6000_return_in_memory (const_tree type, const_tree fntype ATTRIBUTE_UNUSED)
   if (cfun
       && !cfun->machine->mma_return_type_error
       && TREE_TYPE (cfun->decl) == fntype
-      && (TYPE_MODE (type) == OOmode || TYPE_MODE (type) == XOmode))
+      && (TYPE_MODE (type) == OOmode || TYPE_MODE (type) == XOmode
+	  || TYPE_MODE (type) == TDOmode))
     {
       /* Record we have now handled function CFUN, so the next time we
 	 are called, we do not re-report the same error.  */
@@ -1641,6 +1642,16 @@ rs6000_function_arg (cumulative_args_t cum_v, const function_arg_info &arg)
       return NULL_RTX;
     }
 
+  if (mode == TDOmode)
+    {
+      if (TYPE_CANONICAL (type) != NULL_TREE)
+	type = TYPE_CANONICAL (type);
+      error ("invalid use of dense math operand of type %qs as a function "
+	     "parameter",
+	     IDENTIFIER_POINTER (DECL_NAME (TYPE_NAME (type))));
+      return NULL_RTX;
+    }
+
   /* Return a marker to indicate whether CR1 needs to set or clear the
      bit that V.4 uses to say fp args were passed in registers.
      Assume that we don't need the marker for software floating point,
diff --git a/gcc/config/rs6000/rs6000-modes.def b/gcc/config/rs6000/rs6000-modes.def
index 094b246c834..60ebb363196 100644
--- a/gcc/config/rs6000/rs6000-modes.def
+++ b/gcc/config/rs6000/rs6000-modes.def
@@ -86,3 +86,7 @@ PARTIAL_INT_MODE (TI, 128, PTI);
 /* Modes used by __vector_pair and __vector_quad.  */
 OPAQUE_MODE (OO, 32);
 OPAQUE_MODE (XO, 64);
+
+/* Modes used by __dmr.  */
+OPAQUE_MODE (TDO, 128);
+
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index 59517c8608d..aed4b72c4ea 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -1846,7 +1846,9 @@ rs6000_hard_regno_nregs_internal (int regno, machine_mode mode)
      128-bit floating point that can go in vector registers, which has VSX
      memory addressing.  */
   if (FP_REGNO_P (regno))
-    reg_size = (VECTOR_MEM_VSX_P (mode) || VECTOR_ALIGNMENT_P (mode)
+    reg_size = (VECTOR_MEM_VSX_P (mode)
+		|| VECTOR_ALIGNMENT_P (mode)
+		|| mode == TDOmode
 		? UNITS_PER_VSX_WORD
 		: UNITS_PER_FP_WORD);
 
@@ -1880,9 +1882,9 @@ rs6000_hard_regno_mode_ok_uncached (int regno, machine_mode mode)
   /* On ISA 3.1 (power10), MMA accumulator modes need FPR registers divisible
      by 4.
 
-     If dense math is enabled, allow all VSX registers plus the DMR registers.
-     We need to make sure we don't cross between the boundary of FPRs and
-     traditional Altiviec registers.  */
+     If dense math is enabled, allow all VSX registers plus the dense math
+     registers.  We need to make sure we don't cross between the boundary of
+     FPRs and traditional Altiviec registers.  */
   if (mode == XOmode)
     {
       if (TARGET_MMA && !TARGET_DENSE_MATH)
@@ -1904,7 +1906,27 @@ rs6000_hard_regno_mode_ok_uncached (int regno, machine_mode mode)
 	return 0;
     }
 
-  /* No other types other than XOmode can go in DMRs.  */
+  /* Dense math register modes need DMR registers or VSX registers divisible by
+     2.  We need to make sure we don't cross between the boundary of FPRs and
+     traditional Altiviec registers.  */
+  if (mode == TDOmode)
+    {
+      if (!TARGET_DENSE_MATH)
+	return 0;
+
+      if (DMR_REGNO_P (regno))
+	return 1;
+
+      if (FP_REGNO_P (regno))
+	return ((regno & 1) == 0 && regno <= LAST_FPR_REGNO - 7);
+
+      if (ALTIVEC_REGNO_P (regno))
+	return ((regno & 1) == 0 && regno <= LAST_ALTIVEC_REGNO - 7);
+
+      return 0;
+    }
+
+  /* No other types other than XOmode or TDOmode can go in DMRs.  */
   if (DMR_REGNO_P (regno))
     return 0;
 
@@ -2012,9 +2034,11 @@ rs6000_hard_regno_mode_ok (unsigned int regno, machine_mode mode)
    GPR registers, and TImode can go in any GPR as well as VSX registers (PR
    57744).
 
-   Similarly, don't allow OOmode (vector pair, restricted to even VSX
-   registers) or XOmode (vector quad, restricted to FPR registers divisible
-   by 4) to tie with other modes.
+   Similarly, don't allow OOmode (vector pair), XOmode (vector quad), or
+   TDOmode (dmr register) to pair with anything else.  Vector pairs are
+   restricted to even/odd VSX registers.  Without dense math, vector quads are
+   limited to FPR registers divisible by 4.  With dense math, vector quads are
+   limited to even VSX registers or DMR registers.
 
    Altivec/VSX vector tests were moved ahead of scalar float mode, so that IEEE
    128-bit floating point on VSX systems ties with other vectors.  */
@@ -2023,7 +2047,8 @@ static bool
 rs6000_modes_tieable_p (machine_mode mode1, machine_mode mode2)
 {
   if (mode1 == PTImode || mode1 == OOmode || mode1 == XOmode
-      || mode2 == PTImode || mode2 == OOmode || mode2 == XOmode)
+      || mode1 == TDOmode || mode2 == PTImode || mode2 == OOmode
+      || mode2 == XOmode || mode2 == TDOmode)
     return mode1 == mode2;
 
   if (ALTIVEC_OR_VSX_VECTOR_MODE (mode1))
@@ -2314,6 +2339,7 @@ rs6000_debug_reg_global (void)
     V4DFmode,
     OOmode,
     XOmode,
+    TDOmode,
     CCmode,
     CCUNSmode,
     CCEQmode,
@@ -2679,7 +2705,7 @@ rs6000_setup_reg_addr_masks (void)
 	  /* Special case DMR registers.  */
 	  if (rc == RELOAD_REG_DMR)
 	    {
-	      if (TARGET_DENSE_MATH && m2 == XOmode)
+	      if (TARGET_DENSE_MATH && (m2 == XOmode || m2 == TDOmode))
 		{
 		  addr_mask = RELOAD_REG_VALID;
 		  reg_addr[m].addr_mask[rc] = addr_mask;
@@ -2786,12 +2812,14 @@ rs6000_setup_reg_addr_masks (void)
 
 	  /* Vector pairs can do both indexed and offset loads if the
 	     instructions are enabled, otherwise they can only do offset loads
-	     since it will be broken into two vector moves.  Vector quads can
-	     only do offset loads.  If the user restricted generation of either
-	     of the LXVP or STXVP instructions, do not allow indexed mode so
-	     that we can split the load/store.  */
+	     since it will be broken into two vector moves.  If the user
+	     restricted generation of either of the LXVP or STXVP instructions,
+	     do not allow indexed mode so that we can split the load/store.
+
+	     Vector quads and dense math 1,024 bit registers can only do offset
+	     loads.  */
 	  else if ((addr_mask != 0) && TARGET_MMA
-		   && (m2 == OOmode || m2 == XOmode))
+		   && (m2 == OOmode || m2 == XOmode || m2 == TDOmode))
 	    {
 	      addr_mask |= RELOAD_REG_OFFSET;
 	      if (rc == RELOAD_REG_FPR || rc == RELOAD_REG_VMX)
@@ -3021,6 +3049,14 @@ rs6000_init_hard_regno_mode_ok (bool global_init_p)
       rs6000_vector_align[XOmode] = 512;
     }
 
+  /* Add support for 1,024 bit DMR registers.  */
+  if (TARGET_DENSE_MATH)
+    {
+      rs6000_vector_unit[TDOmode] = VECTOR_NONE;
+      rs6000_vector_mem[TDOmode] = VECTOR_VSX;
+      rs6000_vector_align[TDOmode] = 512;
+    }
+
   /* Register class constraints for the constraints that depend on compile
      switches. When the VSX code was added, different constraints were added
      based on the type (DFmode, V2DFmode, V4SFmode).  For the vector types, all
@@ -3234,6 +3270,12 @@ rs6000_init_hard_regno_mode_ok (bool global_init_p)
 	}
     }
 
+  if (TARGET_DENSE_MATH)
+    {
+      reg_addr[TDOmode].reload_load = CODE_FOR_reload_dmr_from_memory;
+      reg_addr[TDOmode].reload_store = CODE_FOR_reload_dmr_to_memory;
+    }
+
   /* Precalculate HARD_REGNO_NREGS.  */
   for (r = 0; HARD_REGISTER_NUM_P (r); ++r)
     for (m = 0; m < NUM_MACHINE_MODES; ++m)
@@ -8800,12 +8842,15 @@ reg_offset_addressing_ok_p (machine_mode mode)
 	return mode_supports_dq_form (mode);
       break;
 
-      /* The vector pair/quad types support offset addressing if the
-	 underlying vectors support offset addressing.  */
+      /* The vector pair/quad types and the dense math types support offset
+	 addressing if the underlying vectors support offset addressing.  */
     case E_OOmode:
     case E_XOmode:
       return TARGET_MMA;
 
+    case E_TDOmode:
+      return TARGET_DENSE_MATH;
+
     case E_SDmode:
       /* If we can do direct load/stores of SDmode, restrict it to reg+reg
 	 addressing for the LFIWZX and STFIWX instructions.  */
@@ -11354,6 +11399,12 @@ rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
 	       (mode == OOmode) ? "__vector_pair" : "__vector_quad");
       break;
 
+    case E_TDOmode:
+      if (CONST_INT_P (operands[1]))
+	error ("%qs is an opaque type, and you cannot set it to constants",
+	       "__dmr");
+      break;
+
     case E_SImode:
     case E_DImode:
       /* Use default pattern for address of ELF small data */
@@ -12817,7 +12868,7 @@ rs6000_secondary_reload_simple_move (enum rs6000_reg_type to_type,
 
   /* We can transfer between VSX registers and DMR registers without needing
      extra registers.  */
-  if (TARGET_DENSE_MATH && mode == XOmode
+  if (TARGET_DENSE_MATH && (mode == XOmode || mode == TDOmode)
       && ((to_type == DMR_REG_TYPE && from_type == VSX_REG_TYPE)
 	  || (to_type == VSX_REG_TYPE && from_type == DMR_REG_TYPE)))
     return true;
@@ -13618,6 +13669,9 @@ rs6000_preferred_reload_class (rtx x, enum reg_class rclass)
       if (mode == XOmode)
 	return TARGET_DENSE_MATH ? VSX_REGS : FLOAT_REGS;
 
+      if (mode == TDOmode)
+	return VSX_REGS;
+
       if (GET_MODE_CLASS (mode) == MODE_INT)
 	return GENERAL_REGS;
     }
@@ -13741,8 +13795,9 @@ rs6000_secondary_reload_class (enum reg_class rclass, machine_mode mode,
   else
     regno = -1;
 
-  /* DMR registers don't have loads or stores.  We have to go through the VSX
-     registers to load XOmode (vector quad).  */
+  /* Dense math registers don't have loads or stores.  We have to go through
+     the VSX registers to load XOmode (vector quad) and TDOmode (dmr 1024
+     bit).  */
   if (TARGET_DENSE_MATH && rclass == DM_REGS)
     return VSX_REGS;
 
@@ -20830,6 +20885,8 @@ rs6000_mangle_type (const_tree type)
     return "u13__vector_pair";
   if (type == vector_quad_type_node)
     return "u13__vector_quad";
+  if (type == dmr_type_node)
+    return "u5__dmr";
 
   /* For all other types, use the default mangling.  */
   return NULL;
@@ -22954,6 +23011,10 @@ rs6000_dmr_register_move_cost (machine_mode mode, reg_class_t rclass)
       if (mode == XOmode)
 	return reg_move_base;
 
+      /* __dmr (i.e. TDOmode) is transferred in 2 instructions.  */
+      else if (mode == TDOmode)
+	return reg_move_base * 2;
+
       else
 	return reg_move_base * 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
     }
@@ -27651,9 +27712,10 @@ rs6000_split_multireg_move (rtx dst, rtx src)
   mode = GET_MODE (dst);
   nregs = hard_regno_nregs (reg, mode);
 
-  /* If we have a vector quad register for MMA, and this is a load or store,
-     see if we can use vector paired load/stores.  */
-  if (mode == XOmode && TARGET_MMA
+  /* If we have a vector quad register for MMA or DMR register for dense math,
+     and this is a load or store, see if we can use vector paired
+     load/stores.  */
+  if ((mode == XOmode || mode == TDOmode) && TARGET_MMA
       && (MEM_P (dst) || MEM_P (src)))
     {
       reg_mode = OOmode;
@@ -27661,7 +27723,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
     }
   /* If we have a vector pair/quad mode, split it into two/four separate
      vectors.  */
-  else if (mode == OOmode || mode == XOmode)
+  else if (mode == OOmode || mode == XOmode || mode == TDOmode)
     reg_mode = V1TImode;
   else if (FP_REGNO_P (reg))
     reg_mode = DECIMAL_FLOAT_MODE_P (mode) ? DDmode :
@@ -27707,13 +27769,13 @@ rs6000_split_multireg_move (rtx dst, rtx src)
       return;
     }
 
-  /* The __vector_pair and __vector_quad modes are multi-register
-     modes, so if we have to load or store the registers, we have to be
-     careful to properly swap them if we're in little endian mode
-     below.  This means the last register gets the first memory
-     location.  We also need to be careful of using the right register
-     numbers if we are splitting XO to OO.  */
-  if (mode == OOmode || mode == XOmode)
+  /* The __vector_pair, __vector_quad, and __dmr modes are multi-register
+     modes, so if we have to load or store the registers, we have to be careful
+     to properly swap them if we're in little endian mode below.  This means
+     the last register gets the first memory location.  We also need to be
+     careful of using the right register numbers if we are splitting XO to
+     OO.  */
+  if (mode == OOmode || mode == XOmode || mode == TDOmode)
     {
       nregs = hard_regno_nregs (reg, mode);
       int reg_mode_nregs = hard_regno_nregs (reg, reg_mode);
@@ -27850,7 +27912,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 	 overlap.  */
       int i;
       /* XO/OO are opaque so cannot use subregs. */
-      if (mode == OOmode || mode == XOmode )
+      if (mode == OOmode || mode == XOmode || mode == TDOmode)
 	{
 	  for (i = nregs - 1; i >= 0; i--)
 	    {
@@ -28024,7 +28086,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 	    continue;
 
 	  /* XO/OO are opaque so cannot use subregs. */
-	  if (mode == OOmode || mode == XOmode )
+	  if (mode == OOmode || mode == XOmode || mode == TDOmode)
 	    {
 	      rtx dst_i = gen_rtx_REG (reg_mode, REGNO (dst) + j);
 	      rtx src_i = gen_rtx_REG (reg_mode, REGNO (src) + j);
@@ -29006,7 +29068,8 @@ rs6000_invalid_conversion (const_tree fromtype, const_tree totype)
 
   if (frommode != tomode)
     {
-      /* Do not allow conversions to/from XOmode and OOmode types.  */
+      /* Do not allow conversions to/from XOmode, OOmode, and TDOmode
+	 types.  */
       if (frommode == XOmode)
 	return N_("invalid conversion from type %<__vector_quad%>");
       if (tomode == XOmode)
@@ -29015,6 +29078,10 @@ rs6000_invalid_conversion (const_tree fromtype, const_tree totype)
 	return N_("invalid conversion from type %<__vector_pair%>");
       if (tomode == OOmode)
 	return N_("invalid conversion to type %<__vector_pair%>");
+      if (frommode == TDOmode)
+	return N_("invalid conversion from type %<__dmr%>");
+      if (tomode == TDOmode)
+	return N_("invalid conversion to type %<__dmr%>");
     }
 
   /* Conversion allowed.  */
diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
index 22efac4a80c..9711777b5cd 100644
--- a/gcc/config/rs6000/rs6000.h
+++ b/gcc/config/rs6000/rs6000.h
@@ -1004,7 +1004,8 @@ enum data_align { align_abi, align_opt, align_both };
 /* Modes that are not vectors, but require vector alignment.  Treat these like
    vectors in terms of loads and stores.  */
 #define VECTOR_ALIGNMENT_P(MODE)					\
-  (FLOAT128_VECTOR_P (MODE) || (MODE) == OOmode || (MODE) == XOmode)
+  (FLOAT128_VECTOR_P (MODE) || (MODE) == OOmode || (MODE) == XOmode	\
+   || (MODE) == TDOmode)
 
 #define ALTIVEC_VECTOR_MODE(MODE)					\
   ((MODE) == V16QImode							\
@@ -2293,6 +2294,7 @@ enum rs6000_builtin_type_index
   RS6000_BTI_const_str,		 /* pointer to const char * */
   RS6000_BTI_vector_pair,	 /* unsigned 256-bit types (vector pair).  */
   RS6000_BTI_vector_quad,	 /* unsigned 512-bit types (vector quad).  */
+  RS6000_BTI_dmr,		 /* unsigned 1,024-bit types (dmr).  */
   RS6000_BTI_const_ptr_void,     /* const pointer to void */
   RS6000_BTI_ptr_V16QI,
   RS6000_BTI_ptr_V1TI,
@@ -2331,6 +2333,7 @@ enum rs6000_builtin_type_index
   RS6000_BTI_ptr_dfloat128,
   RS6000_BTI_ptr_vector_pair,
   RS6000_BTI_ptr_vector_quad,
+  RS6000_BTI_ptr_dmr,
   RS6000_BTI_ptr_long_long,
   RS6000_BTI_ptr_long_long_unsigned,
   RS6000_BTI_MAX
@@ -2388,6 +2391,7 @@ enum rs6000_builtin_type_index
 #define const_str_type_node		 (rs6000_builtin_types[RS6000_BTI_const_str])
 #define vector_pair_type_node		 (rs6000_builtin_types[RS6000_BTI_vector_pair])
 #define vector_quad_type_node		 (rs6000_builtin_types[RS6000_BTI_vector_quad])
+#define dmr_type_node			 (rs6000_builtin_types[RS6000_BTI_dmr])
 #define pcvoid_type_node		 (rs6000_builtin_types[RS6000_BTI_const_ptr_void])
 #define ptr_V16QI_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_V16QI])
 #define ptr_V1TI_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_V1TI])
@@ -2426,6 +2430,7 @@ enum rs6000_builtin_type_index
 #define ptr_dfloat128_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_dfloat128])
 #define ptr_vector_pair_type_node	 (rs6000_builtin_types[RS6000_BTI_ptr_vector_pair])
 #define ptr_vector_quad_type_node	 (rs6000_builtin_types[RS6000_BTI_ptr_vector_quad])
+#define ptr_dmr_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_dmr])
 #define ptr_long_long_integer_type_node	 (rs6000_builtin_types[RS6000_BTI_ptr_long_long])
 #define ptr_long_long_unsigned_type_node (rs6000_builtin_types[RS6000_BTI_ptr_long_long_unsigned])
 
diff --git a/gcc/testsuite/gcc.target/powerpc/dm-1024bit.c b/gcc/testsuite/gcc.target/powerpc/dm-1024bit.c
new file mode 100644
index 00000000000..0a9884ddf63
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/dm-1024bit.c
@@ -0,0 +1,63 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_dense_math_ok } */
+/* { dg-options "-mdejagnu-cpu=future -O2" } */
+
+/* Test basic load/store for __dmr type.  */
+
+#ifndef CONSTRAINT
+#if defined(USE_D)
+#define CONSTRAINT "d"
+
+#elif defined(USE_V)
+#define CONSTRAINT "v"
+
+#elif defined(USE_WA)
+#define CONSTRAINT "wa"
+
+#else
+#define CONSTRAINT "wD"
+#endif
+#endif
+const char constraint[] = CONSTRAINT;
+
+void foo_mem_asm (__dmr *p, __dmr *q)
+{
+  /* 2 LXVP instructions.  */
+  __dmr vq = *p;
+
+  /* 2 DMXXINSTDMR512 instructions to transfer VSX to DMR.  */
+  __asm__ ("# foo (" CONSTRAINT ") %A0" : "+" CONSTRAINT (vq));
+  /* 2 DMXXEXTFDMR512 instructions to transfer DMR to VSX.  */
+
+  /* 2 STXVP instructions.  */
+  *q = vq;
+}
+
+void foo_mem_asm2 (__dmr *p, __dmr *q)
+{
+  /* 2 LXVP instructions.  */
+  __dmr vq = *p;
+  __dmr vq2;
+  __dmr vq3;
+
+  /* 2 DMXXINSTDMR512 instructions to transfer VSX to DMR.  */
+  __asm__ ("# foo1 (" CONSTRAINT ") %A0" : "+" CONSTRAINT (vq));
+  /* 2 DMXXEXTFDMR512 instructions to transfer DMR to VSX.  */
+
+  vq2 = vq;
+  __asm__ ("# foo2 (wa) %0" : "+wa" (vq2));
+
+  /* 2 STXVP instructions.  */
+  *q = vq2;
+}
+
+void foo_mem (__dmr *p, __dmr *q)
+{
+  /* 2 LXVP, 2 STXVP instructions, no DMR transfer.  */
+  *q = *p;
+}
+
+/* { dg-final { scan-assembler-times {\mdmxxextfdmr512\M}  4 } } */
+/* { dg-final { scan-assembler-times {\mdmxxinstdmr512\M}  4 } } */
+/* { dg-final { scan-assembler-times {\mlxvp\M}           12 } } */
+/* { dg-final { scan-assembler-times {\mstxvp\M}          12 } } */
-- 
2.43.0


-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Ping [PATCH 1/6] Add -mcpu=future
  2024-01-05 23:35 ` Repost [PATCH 1/6] Add -mcpu=future Michael Meissner
@ 2024-01-19 18:43   ` Michael Meissner
  2024-01-23  8:44   ` Repost " Kewen.Lin
  2024-02-08 20:10   ` Segher Boessenkool
  2 siblings, 0 replies; 36+ messages in thread
From: Michael Meissner @ 2024-01-19 18:43 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner

Ping

| Date: Fri, 5 Jan 2024 18:35:37 -0500
| From: Michael Meissner <meissner@linux.ibm.com>
| Subject: Repost [PATCH 1/6] Add -mcpu=future
| Message-ID: <ZZiSSeBqMdd64W7V@cowardly-lion.the-meissners.org>

https://gcc.gnu.org/pipermail/gcc-patches/2024-January/641961.html

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Ping [PATCH 2/6] PowerPC: Make -mcpu=future enable -mblock-ops-vector-pair.
  2024-01-05 23:37 ` Repost [PATCH 2/6] PowerPC: Make -mcpu=future enable -mblock-ops-vector-pair Michael Meissner
@ 2024-01-19 18:44   ` Michael Meissner
  2024-01-23  8:54   ` Repost " Kewen.Lin
  1 sibling, 0 replies; 36+ messages in thread
From: Michael Meissner @ 2024-01-19 18:44 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner

Ping

| Date: Fri, 5 Jan 2024 18:37:17 -0500
| From: Michael Meissner <meissner@linux.ibm.com>
| Subject: Repost [PATCH 2/6] PowerPC: Make -mcpu=future enable -mblock-ops-vector-pair.
| Message-ID: <ZZiSrcdY46vL40E4@cowardly-lion.the-meissners.org>

https://gcc.gnu.org/pipermail/gcc-patches/2024-January/641962.html

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Ping [PATCH 3/6] PowerPC: Add support for accumulators in DMR registers.
  2024-01-05 23:38 ` Repost [PATCH 3/6] PowerPC: Add support for accumulators in DMR registers Michael Meissner
@ 2024-01-19 18:46   ` Michael Meissner
  2024-01-25  9:28   ` Repost " Kewen.Lin
  1 sibling, 0 replies; 36+ messages in thread
From: Michael Meissner @ 2024-01-19 18:46 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner

Ping

| Date: Fri, 5 Jan 2024 18:38:23 -0500
| From: Michael Meissner <meissner@linux.ibm.com>
| Subject: Repost [PATCH 3/6] PowerPC: Add support for accumulators in DMR registers.
| Message-ID: <ZZiS7-05Y1n48bjk@cowardly-lion.the-meissners.org>

https://gcc.gnu.org/pipermail/gcc-patches/2024-January/641963.html

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Ping [PATCH 4/6] PowerPC: Make MMA insns support DMR registers.
  2024-01-05 23:39 ` Repost [PATCH 4/6] PowerPC: Make MMA insns support " Michael Meissner
@ 2024-01-19 18:47   ` Michael Meissner
  2024-02-04  3:21   ` Repost " Kewen.Lin
  1 sibling, 0 replies; 36+ messages in thread
From: Michael Meissner @ 2024-01-19 18:47 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner

Ping

| Date: Fri, 5 Jan 2024 18:39:55 -0500
| From: Michael Meissner <meissner@linux.ibm.com>
| Subject: Repost [PATCH 4/6] PowerPC: Make MMA insns support DMR registers.
| Message-ID: <ZZiTS0adUUPx7wjY@cowardly-lion.the-meissners.org>

https://gcc.gnu.org/pipermail/gcc-patches/2024-January/641964.html

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Ping [PATCH 5/6] PowerPC: Switch to dense math names for all MMA operations.
  2024-01-05 23:40 ` Repost [PATCH 5/6] PowerPC: Switch to dense math names for all MMA operations Michael Meissner
@ 2024-01-19 18:48   ` Michael Meissner
  2024-02-04  5:47   ` Repost " Kewen.Lin
  1 sibling, 0 replies; 36+ messages in thread
From: Michael Meissner @ 2024-01-19 18:48 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner

Ping

| Date: Fri, 5 Jan 2024 18:40:58 -0500
| From: Michael Meissner <meissner@linux.ibm.com>
| Subject: Repost [PATCH 5/6] PowerPC: Switch to dense math names for all MMA operations.
| Message-ID: <ZZiTiojbYNzVvJEV@cowardly-lion.the-meissners.org>

https://gcc.gnu.org/pipermail/gcc-patches/2024-January/641965.html

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Ping [PATCH 6/6] PowerPC: Add support for 1,024 bit DMR registers.
  2024-01-05 23:42 ` Repost [PATCH 6/6] PowerPC: Add support for 1,024 bit DMR registers Michael Meissner
@ 2024-01-19 18:49   ` Michael Meissner
  2024-02-05  3:58   ` Repost " Kewen.Lin
  1 sibling, 0 replies; 36+ messages in thread
From: Michael Meissner @ 2024-01-19 18:49 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner

Ping

| Date: Fri, 5 Jan 2024 18:42:02 -0500
| From: Michael Meissner <meissner@linux.ibm.com>
| Subject: Repost [PATCH 6/6] PowerPC: Add support for 1,024 bit DMR registers.
| Message-ID: <ZZiTyrsBFO92FG84@cowardly-lion.the-meissners.org>

https://gcc.gnu.org/pipermail/gcc-patches/2024-January/641966.html

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 1/6] Add -mcpu=future
  2024-01-05 23:35 ` Repost [PATCH 1/6] Add -mcpu=future Michael Meissner
  2024-01-19 18:43   ` Ping " Michael Meissner
@ 2024-01-23  8:44   ` Kewen.Lin
  2024-02-06  6:01     ` Michael Meissner
  2024-02-08 20:10   ` Segher Boessenkool
  2 siblings, 1 reply; 36+ messages in thread
From: Kewen.Lin @ 2024-01-23  8:44 UTC (permalink / raw)
  To: Michael Meissner
  Cc: gcc-patches, Segher Boessenkool, David Edelsohn, Peter Bergner

Hi Mike,

on 2024/1/6 07:35, Michael Meissner wrote:
> This patch implements support for a potential future PowerPC cpu.  Features
> added with -mcpu=future, may or may not be added to new PowerPC processors.
> 
> This patch adds support for the -mcpu=future option.  If you use -mcpu=future,
> the macro __ARCH_PWR_FUTURE__ is defined, and the assembler .machine directive
> "future" is used.  Future patches in this series will add support for new
> instructions that may be present in future PowerPC processors.
> 
> This particular patch does not any new features.  It exists as a ground work
> for future patches to support for a possible PowerPC processor in the future.
> 
> This patch does not implement any differences in tuning when -mcpu=future is
> used compared to -mcpu=power10.  If -mcpu=future is used, GCC will use power10
> tuning.  If you explicitly use -mtune=future, you will get a warning that
> -mtune=future is not supported, and default tuning will be set for power10.
> 
> The patches have been tested on both little and big endian systems.  Can I check
> it into the master branch?
> 
> 2024-01-05   Michael Meissner  <meissner@linux.ibm.com>
> 
> gcc/
> 
> 	* config/rs6000/rs6000-c.cc (rs6000_target_modify_macros): Define
> 	__ARCH_PWR_FUTURE__ if -mcpu=future.
> 	* config/rs6000/rs6000-cpus.def (ISA_FUTURE_MASKS): New macro.
> 	(POWERPC_MASKS): Add -mcpu=future support.
> 	* config/rs6000/rs6000-opts.h (enum processor_type): Add
> 	PROCESSOR_FUTURE.
> 	* config/rs6000/rs6000-tables.opt: Regenerate.
> 	* config/rs6000/rs6000.cc (rs600_cpu_index_lookup): New helper
> 	function.
> 	(rs6000_option_override_internal): Make -mcpu=future set
> 	-mtune=power10.  If the user explicitly uses -mtune=future, give a
> 	warning and reset the tuning to power10.
> 	(rs6000_option_override_internal): Use power10 costs for future
> 	machine.
> 	(rs6000_machine_from_flags): Add support for -mcpu=future.
> 	(rs6000_opt_masks): Likewise.
> 	* config/rs6000/rs6000.h (ASM_CPU_SUPPORT): Likewise.
> 	* config/rs6000/rs6000.md (cpu attribute): Likewise.
> 	* config/rs6000/rs6000.opt (-mfuture): New undocumented debug switch.
> 	* doc/invoke.texi (IBM RS/6000 and PowerPC Options): Document -mcpu=future.
> ---
>  gcc/config/rs6000/rs6000-c.cc       |  2 +
>  gcc/config/rs6000/rs6000-cpus.def   |  6 +++
>  gcc/config/rs6000/rs6000-opts.h     |  4 +-
>  gcc/config/rs6000/rs6000-tables.opt |  3 ++
>  gcc/config/rs6000/rs6000.cc         | 58 ++++++++++++++++++++++++-----
>  gcc/config/rs6000/rs6000.h          |  1 +
>  gcc/config/rs6000/rs6000.md         |  2 +-
>  gcc/config/rs6000/rs6000.opt        |  4 ++
>  gcc/doc/invoke.texi                 |  2 +-
>  9 files changed, 69 insertions(+), 13 deletions(-)
> 
> diff --git a/gcc/config/rs6000/rs6000-c.cc b/gcc/config/rs6000/rs6000-c.cc
> index ce0b14a8d37..f2fb5bef678 100644
> --- a/gcc/config/rs6000/rs6000-c.cc
> +++ b/gcc/config/rs6000/rs6000-c.cc
> @@ -447,6 +447,8 @@ rs6000_target_modify_macros (bool define_p, HOST_WIDE_INT flags)
>      rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR9");
>    if ((flags & OPTION_MASK_POWER10) != 0)
>      rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR10");
> +  if ((flags & OPTION_MASK_FUTURE) != 0)
> +    rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR_FUTURE");
>    if ((flags & OPTION_MASK_SOFT_FLOAT) != 0)
>      rs6000_define_or_undefine_macro (define_p, "_SOFT_FLOAT");
>    if ((flags & OPTION_MASK_RECIP_PRECISION) != 0)
> diff --git a/gcc/config/rs6000/rs6000-cpus.def b/gcc/config/rs6000/rs6000-cpus.def
> index d28cc87eb2a..8754635f3d9 100644
> --- a/gcc/config/rs6000/rs6000-cpus.def
> +++ b/gcc/config/rs6000/rs6000-cpus.def
> @@ -88,6 +88,10 @@
>  				 | OPTION_MASK_POWER10			\
>  				 | OTHER_POWER10_MASKS)
>  
> +/* Flags for a potential future processor that may or may not be delivered.  */
> +#define ISA_FUTURE_MASKS	(ISA_3_1_MASKS_SERVER			\
> +				 | OPTION_MASK_FUTURE)
> +

Nit: Named as "ISA_FUTURE_MASKS_SERVER" seems more accurate as it's constituted
with ISA_3_1_MASKS_**SERVER** ...

>  /* Flags that need to be turned off if -mno-power9-vector.  */
>  #define OTHER_P9_VECTOR_MASKS	(OPTION_MASK_FLOAT128_HW		\
>  				 | OPTION_MASK_P9_MINMAX)
> @@ -135,6 +139,7 @@
>  				 | OPTION_MASK_LOAD_VECTOR_PAIR		\
>  				 | OPTION_MASK_POWER10			\
>  				 | OPTION_MASK_P10_FUSION		\
> +				 | OPTION_MASK_FUTURE			\
>  				 | OPTION_MASK_HTM			\
>  				 | OPTION_MASK_ISEL			\
>  				 | OPTION_MASK_MFCRF			\
> @@ -267,3 +272,4 @@ RS6000_CPU ("powerpc64", PROCESSOR_POWERPC64, OPTION_MASK_PPC_GFXOPT
>  RS6000_CPU ("powerpc64le", PROCESSOR_POWER8, MASK_POWERPC64
>  	    | ISA_2_7_MASKS_SERVER | OPTION_MASK_HTM)
>  RS6000_CPU ("rs64", PROCESSOR_RS64A, OPTION_MASK_PPC_GFXOPT | MASK_POWERPC64)
> +RS6000_CPU ("future", PROCESSOR_FUTURE, MASK_POWERPC64 | ISA_FUTURE_MASKS)

..., then this need to be updated accordingly.

> diff --git a/gcc/config/rs6000/rs6000-opts.h b/gcc/config/rs6000/rs6000-opts.h
> index 33fd0efc936..25890ae3034 100644
> --- a/gcc/config/rs6000/rs6000-opts.h
> +++ b/gcc/config/rs6000/rs6000-opts.h
> @@ -67,7 +67,9 @@ enum processor_type
>     PROCESSOR_MPCCORE,
>     PROCESSOR_CELL,
>     PROCESSOR_PPCA2,
> -   PROCESSOR_TITAN
> +   PROCESSOR_TITAN,
> +

Nit: unintentional empty line?

> +   PROCESSOR_FUTURE
>  };
>  
>  
> diff --git a/gcc/config/rs6000/rs6000-tables.opt b/gcc/config/rs6000/rs6000-tables.opt
> index 65f46709716..97fa98a2e65 100644
> --- a/gcc/config/rs6000/rs6000-tables.opt
> +++ b/gcc/config/rs6000/rs6000-tables.opt
> @@ -197,3 +197,6 @@ Enum(rs6000_cpu_opt_value) String(powerpc64le) Value(55)
>  EnumValue
>  Enum(rs6000_cpu_opt_value) String(rs64) Value(56)
>  
> +EnumValue
> +Enum(rs6000_cpu_opt_value) String(future) Value(57)
> +
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 5a7e00b03d1..bc509399cf6 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -1809,6 +1809,18 @@ rs6000_cpu_name_lookup (const char *name)
>    return -1;
>  }
>  
> +/* Look up the index for a specific processor.  */
> +
> +static int
> +rs600_cpu_index_lookup (enum processor_type processor)

s/rs600_cpu_index_lookup/rs6000_cpu_index_lookup/

> +{
> +  for (size_t i = 0; i < ARRAY_SIZE (processor_target_table); i++)
> +    if (processor_target_table[i].processor == processor)
> +      return i;
> +
> +  return -1;
> +}

Nit: Since this is given with a valid enum processor_type, I think it should
never return -1?  If so, may be more clear with gcc_unreachable () or adjust
with initial -1, break when hits and assert it's not -1.

> +
>  \f
>  /* Return number of consecutive hard regs needed starting at reg REGNO
>     to hold something of mode MODE.
> @@ -3756,23 +3768,45 @@ rs6000_option_override_internal (bool global_init_p)
>      rs6000_isa_flags &= ~OPTION_MASK_POWERPC64;
>  #endif
>  
> +  /* At the moment, we don't have explict -mtune=future support.  If the user

Nit: s/explict/explicit/

> +     explicitly tried to use -mtune=future, give a warning.  If not, use the

Nit: s/tried/tries/?

> +     power10 tuning until future tuning is added.  */
>    if (rs6000_tune_index >= 0)
> -    tune_index = rs6000_tune_index;
> +    {
> +      enum processor_type cur_proc
> +	= processor_target_table[rs6000_tune_index].processor;
> +
> +      if (cur_proc == PROCESSOR_FUTURE)
> +	{
> +	  static bool issued_future_tune_warning = false;
> +	  if (!issued_future_tune_warning)
> +	    {
> +	      issued_future_tune_warning = true;

This seems to ensure we only warn this once, but I noticed that in rs6000/
only some OPT_Wpsabi related warnings adopt this way, I wonder if we don't
restrict it like this, for a tiny simple case, how many times it would warn?

> +	      warning (0, "%qs is not currently supported", "-mtune=future");
> +	    }
> +> +	  rs6000_tune_index = rs600_cpu_index_lookup (PROCESSOR_POWER10);
> +	}
> +      tune_index = rs6000_tune_index;
> +    }
>    else if (cpu_index >= 0)
> -    rs6000_tune_index = tune_index = cpu_index;
> +    {
> +      enum processor_type cur_cpu
> +	= processor_target_table[cpu_index].processor;
> +
> +      rs6000_tune_index = tune_index
> +	= (cur_cpu == PROCESSOR_FUTURE
> +	   ? rs600_cpu_index_lookup (PROCESSOR_POWER10)

s/rs600_cpu_index_lookup/rs6000_cpu_index_lookup/

> +	   : cpu_index);
> +    }
>    else
>      {
> -      size_t i;
>        enum processor_type tune_proc
>  	= (TARGET_POWERPC64 ? PROCESSOR_DEFAULT64 : PROCESSOR_DEFAULT);
>  
> -      tune_index = -1;
> -      for (i = 0; i < ARRAY_SIZE (processor_target_table); i++)
> -	if (processor_target_table[i].processor == tune_proc)
> -	  {
> -	    tune_index = i;
> -	    break;
> -	  }
> +      tune_index = rs600_cpu_index_lookup (tune_proc == PROCESSOR_FUTURE
> +					   ? PROCESSOR_POWER10
> +					   : tune_proc);

This part looks useless, as tune_proc is impossible to be PROCESSOR_FUTURE.

>      }

Maybe re-structure the above into:

bool explicit_tune = false;
if (rs6000_tune_index >= 0)
  {
    tune_index = rs6000_tune_index;
    explicit_tune = true;
  }
else if (cpu_index >= 0)
  // as before
  rs6000_tune_index = tune_index = cpu_index;
else
  {
   //as before
   ...
  }

// Check tune_index here instead.

if (processor_target_table[tune_index].processor == PROCESSOR_FUTURE)
  {
    tune_index = rs6000_cpu_index_lookup (PROCESSOR_POWER10);
    if (explicit_tune)
      warn ...
  }

// as before
rs6000_tune = processor_target_table[tune_index].processor;

>  
>    if (cpu_index >= 0)
> @@ -4785,6 +4819,7 @@ rs6000_option_override_internal (bool global_init_p)
>  	break;
>  
>        case PROCESSOR_POWER10:
> +      case PROCESSOR_FUTURE:
>  	rs6000_cost = &power10_cost;
>  	break;
>  
> @@ -5944,6 +5979,8 @@ rs6000_machine_from_flags (void)
>    /* Disable the flags that should never influence the .machine selection.  */
>    flags &= ~(OPTION_MASK_PPC_GFXOPT | OPTION_MASK_PPC_GPOPT | OPTION_MASK_ISEL);
>  
> +  if ((flags & (ISA_FUTURE_MASKS & ~ISA_3_1_MASKS_SERVER)) != 0)
> +    return "future";
>    if ((flags & (ISA_3_1_MASKS_SERVER & ~ISA_3_0_MASKS_SERVER)) != 0)
>      return "power10";
>    if ((flags & (ISA_3_0_MASKS_SERVER & ~ISA_2_7_MASKS_SERVER)) != 0)
> @@ -24500,6 +24537,7 @@ static struct rs6000_opt_mask const rs6000_opt_masks[] =
>    { "float128-hardware",	OPTION_MASK_FLOAT128_HW,	false, true  },
>    { "fprnd",			OPTION_MASK_FPRND,		false, true  },
>    { "power10",			OPTION_MASK_POWER10,		false, true  },
> +  { "future",			OPTION_MASK_FUTURE,		false, true  },
>    { "hard-dfp",			OPTION_MASK_DFP,		false, true  },
>    { "htm",			OPTION_MASK_HTM,		false, true  },
>    { "isel",			OPTION_MASK_ISEL,		false, true  },
> diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
> index 2291fe8d3a3..43209f9a6e7 100644
> --- a/gcc/config/rs6000/rs6000.h
> +++ b/gcc/config/rs6000/rs6000.h
> @@ -163,6 +163,7 @@
>    mcpu=e5500: -me5500; \
>    mcpu=e6500: -me6500; \
>    mcpu=titan: -mtitan; \
> +  mcpu=future: -mfuture; \
>    !mcpu*: %{mpower9-vector: -mpower9; \
>  	    mpower8-vector|mcrypto|mdirect-move|mhtm: -mpower8; \
>  	    mvsx: -mpower7; \

I think we should also update asm_names in driver-rs6000.cc.

The others look good to me, thanks!

BR,
Kewen

> diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
> index 969d34b69e6..a125fd8fc99 100644
> --- a/gcc/config/rs6000/rs6000.md
> +++ b/gcc/config/rs6000/rs6000.md
> @@ -351,7 +351,7 @@ (define_attr "cpu"
>     ppc403,ppc405,ppc440,ppc476,
>     ppc8540,ppc8548,ppce300c2,ppce300c3,ppce500mc,ppce500mc64,ppce5500,ppce6500,
>     power4,power5,power6,power7,power8,power9,power10,
> -   rs64a,mpccore,cell,ppca2,titan"
> +   rs64a,mpccore,cell,ppca2,titan,future"
>    (const (symbol_ref "(enum attr_cpu) rs6000_tune")))
>  
>  ;; The ISA we implement.
> diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt
> index 60b923f5e4b..775ba830eac 100644
> --- a/gcc/config/rs6000/rs6000.opt
> +++ b/gcc/config/rs6000/rs6000.opt
> @@ -628,6 +628,10 @@ mieee128-constant
>  Target Var(TARGET_IEEE128_CONSTANT) Init(1) Save
>  Generate (do not generate) code that uses the LXVKQ instruction.
>  
> +mfuture
> +Target Undocumented Mask(FUTURE) Var(rs6000_isa_flags)
> +Generate (do not generate) future instructions.
> +
>  ; Documented parameters
>  
>  -param=rs6000-vect-unroll-limit=
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index d71583853f0..0e817ee923a 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -30423,7 +30423,7 @@ Supported values for @var{cpu_type} are @samp{401}, @samp{403},
>  @samp{titan}, @samp{power3}, @samp{power4}, @samp{power5}, @samp{power5+},
>  @samp{power6}, @samp{power6x}, @samp{power7}, @samp{power8},
>  @samp{power9}, @samp{power10}, @samp{powerpc}, @samp{powerpc64},
> -@samp{powerpc64le}, @samp{rs64}, and @samp{native}.
> +@samp{powerpc64le}, @samp{rs64}, @samp{future}, and @samp{native}.
>  
>  @option{-mcpu=powerpc}, @option{-mcpu=powerpc64}, and
>  @option{-mcpu=powerpc64le} specify pure 32-bit PowerPC (either


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 2/6] PowerPC: Make -mcpu=future enable -mblock-ops-vector-pair.
  2024-01-05 23:37 ` Repost [PATCH 2/6] PowerPC: Make -mcpu=future enable -mblock-ops-vector-pair Michael Meissner
  2024-01-19 18:44   ` Ping " Michael Meissner
@ 2024-01-23  8:54   ` Kewen.Lin
  1 sibling, 0 replies; 36+ messages in thread
From: Kewen.Lin @ 2024-01-23  8:54 UTC (permalink / raw)
  To: Michael Meissner
  Cc: gcc-patches, Segher Boessenkool, David Edelsohn, Peter Bergner

on 2024/1/6 07:37, Michael Meissner wrote:
> This patch re-enables generating load and store vector pair instructions when
> doing certain memory copy operations when -mcpu=future is used.
> 
> During power10 development, it was determined that using store vector pair
> instructions were problematical in a few cases, so we disabled generating load
> and store vector pair instructions for memory options by default.  This patch
> re-enables generating these instructions if -mcpu=future is used.
> 
> The patches have been tested on both little and big endian systems.  Can I check
> it into the master branch?
> 
> 2024-01-05   Michael Meissner  <meissner@linux.ibm.com>
> 
> gcc/
> 
> 	* config/rs6000/rs6000-cpus.def (ISA_FUTURE_MASKS): Add
> 	-mblock-ops-vector-pair.

Nit: s/-mblock-ops-vector-pair/OPTION_MASK_BLOCK_OPS_VECTOR_PAIR/

> 	(POWERPC_MASKS): Likewise.
> ---
>  gcc/config/rs6000/rs6000-cpus.def | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/gcc/config/rs6000/rs6000-cpus.def b/gcc/config/rs6000/rs6000-cpus.def
> index 8754635f3d9..b6cd6d8cc84 100644
> --- a/gcc/config/rs6000/rs6000-cpus.def
> +++ b/gcc/config/rs6000/rs6000-cpus.def
> @@ -90,6 +90,7 @@
>  
>  /* Flags for a potential future processor that may or may not be delivered.  */
>  #define ISA_FUTURE_MASKS	(ISA_3_1_MASKS_SERVER			\
> +				 | OPTION_MASK_BLOCK_OPS_VECTOR_PAIR	\
>  				 | OPTION_MASK_FUTURE)


OK with incorporating change s/ISA_FUTURE_MASKS/ISA_FUTURE_MASKS_SERVER/.  Thanks!

BR,
Kewen

>  
>  /* Flags that need to be turned off if -mno-power9-vector.  */
> @@ -127,6 +128,7 @@
>  
>  /* Mask of all options to set the default isa flags based on -mcpu=<xxx>.  */
>  #define POWERPC_MASKS		(OPTION_MASK_ALTIVEC			\
> +				 | OPTION_MASK_BLOCK_OPS_VECTOR_PAIR	\
>  				 | OPTION_MASK_CMPB			\
>  				 | OPTION_MASK_CRYPTO			\
>  				 | OPTION_MASK_DFP			\


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 3/6] PowerPC: Add support for accumulators in DMR registers.
  2024-01-05 23:38 ` Repost [PATCH 3/6] PowerPC: Add support for accumulators in DMR registers Michael Meissner
  2024-01-19 18:46   ` Ping " Michael Meissner
@ 2024-01-25  9:28   ` Kewen.Lin
  2024-02-07  0:06     ` Michael Meissner
  1 sibling, 1 reply; 36+ messages in thread
From: Kewen.Lin @ 2024-01-25  9:28 UTC (permalink / raw)
  To: Michael Meissner
  Cc: gcc-patches, Segher Boessenkool, David Edelsohn, Peter Bergner

Hi Mike,

on 2024/1/6 07:38, Michael Meissner wrote:
> The MMA subsystem added the notion of accumulator registers as an optional
> feature of ISA 3.1 (power10).  In ISA 3.1, these accumulators overlapped with
> the traditional floating point registers 0..31, but logically the accumulator
> registers were separate from the FPR registers.  In ISA 3.1, it was anticipated

Using VSX register 0..31 rather than traditional floating point registers 0..31
seems more clear, since floating point registers imply 64 bit long registers.

> that in future systems, the accumulator registers may no overlap with the FPR
> registers.  This patch adds the support for dense math registers as separate
> registers.
> 
> This particular patch does not change the MMA support to use the accumulators
> within the dense math registers.  This patch just adds the basic support for
> having separate DMRs.  The next patch will switch the MMA support to use the
> accumulators if -mcpu=future is used.
> 
> For testing purposes, I added an undocumented option '-mdense-math' to enable
> or disable the dense math support.

Can we avoid this and use one macro for it instead?  As you might have noticed
that some previous temporary options like -mpower{8,9}-vector cause ICEs due to
some unexpected combination and we are going to neuter them, so let's try our
best to avoid it if possible.  I guess one macro TARGET_DENSE_MATH defined by
TARGET_FUTURE && TARGET_MMA matches all use places? and specifying -mcpu=future
can enable it while -mcpu=power10 can disable it.

> 
> This patch adds a new constraint (wD).  If MMA is selected but dense math is
> not selected (i.e. -mcpu=power10), the wD constraint will allow access to
> accumulators that overlap with the VSX vector registers 0..31.  If both MMA and

Sorry for nitpicking, it's more accurate with "VSX registers 0..31".

> dense math are selected (i.e. -mcpu=future), the wD constraint will only allow
> dense math registers.
> 
> This patch modifies the existing %A output modifier.  If MMA is selected but
> dense math is not selected, then %A output modifier converts the VSX register
> number to the accumulator number, by dividing it by 4.  If both MMA and dense
> math are selected, then %A will map the separate DMR registers into 0..7.
> 
> The intention is that user code using extended asm can be modified to run on
> both MMA without dense math and MMA with dense math:
> 
>     1)	If possible, don't use extended asm, but instead use the MMA built-in
> 	functions;
> 
>     2)	If you do need to write extended asm, change the d constraints
> 	targetting accumulators should now use wD;
> 
>     3)	Only use the built-in zero, assemble and disassemble functions create
> 	move data between vector quad types and dense math accumulators.
> 	I.e. do not use the xxmfacc, xxmtacc, and xxsetaccz directly in the
> 	extended asm code.  The reason is these instructions assume there is a
> 	1-to-1 correspondence between 4 adjacent FPR registers and an
> 	accumulator that overlaps with those instructions.  With accumulators
> 	now being separate registers, there no longer is a 1-to-1
> 	correspondence.
> 
> It is possible that the mangling for DMRs and the GDB register numbers may
> change in the future.
> 
> 2024-01-05   Michael Meissner  <meissner@linux.ibm.com>
> 
> gcc/
> 
> 	* config/rs6000/constraints.md (wD constraint): New constraint.
> 	* config/rs6000/mma.md (UNSPEC_DM_ASSEMBLE_ACC): New unspec.
> 	(movxo): Convert into define_expand.
> 	(movxo_vsx): Version of movxo where accumulators overlap with VSX vector
> 	registers 0..31.
> 	(movxo_dm): Verson of movxo that supports separate dense math
> 	accumulators.
> 	(mma_assemble_acc): Add dense math support to define_expand.
> 	(mma_assemble_acc_vsx): Rename from mma_assemble_acc, and restrict it to
> 	non dense math systems.
> 	(mma_assemble_acc_dm): Dense math version of mma_assemble_acc.
> 	(mma_disassemble_acc): Add dense math support to define_expand.
> 	(mma_disassemble_acc_vsx): Rename from mma_disassemble_acc, and restrict
> 	it to non dense math systems.
> 	(mma_disassemble_acc_dm): Dense math version of mma_disassemble_acc.
> 	* config/rs6000/predicates.md (dmr_operand): New predicate.
> 	(accumulator_operand): Likewise.
> 	* config/rs6000/rs6000-cpus.def (ISA_FUTURE_MASKS): Add -mdense-math.
> 	(POWERPC_MASKS): Likewise.
> 	* config/rs6000/rs6000.cc (enum rs6000_reg_type): Add DMR_REG_TYPE.
> 	(enum rs6000_reload_reg_type): Add RELOAD_REG_DMR.
> 	(LAST_RELOAD_REG_CLASS): Add support for DMR registers and the wD
> 	constraint.
> 	(reload_reg_map): Likewise.
> 	(rs6000_reg_names): Likewise.
> 	(alt_reg_names): Likewise.
> 	(rs6000_hard_regno_nregs_internal): Likewise.
> 	(rs6000_hard_regno_mode_ok_uncached): Likewise.
> 	(rs6000_debug_reg_global): Likewise.
> 	(rs6000_setup_reg_addr_masks): Likewise.
> 	(rs6000_init_hard_regno_mode_ok): Likewise.
> 	(rs6000_option_override_internal): Add checking for -mdense-math.
> 	(rs6000_secondary_reload_memory): Add support for DMR registers.
> 	(rs6000_secondary_reload_simple_move): Likewise.
> 	(rs6000_preferred_reload_class): Likewise.
> 	(rs6000_secondary_reload_class): Likewise.
> 	(print_operand): Make %A handle both FPRs and DMRs.
> 	(rs6000_dmr_register_move_cost): New helper function.
> 	(rs6000_register_move_cost): Add support for DMR registers.
> 	(rs6000_memory_move_cost): Likewise.
> 	(rs6000_compute_pressure_classes): Likewise.
> 	(rs6000_debugger_regno): Likewise.
> 	(rs6000_opt_masks): Add -mdense-math.
> 	(rs6000_split_multireg_move): Add support for DMRs.
> 	* config/rs6000/rs6000.h (UNITS_PER_DMR_WORD): New macro.
> 	(FIRST_PSEUDO_REGISTER): Update for DMRs.
> 	(FIXED_REGISTERS): Add DMRs.
> 	(CALL_REALLY_USED_REGISTERS): Likewise.
> 	(REG_ALLOC_ORDER): Likewise.
> 	(enum reg_class): Add DM_REGS.
> 	(REG_CLASS_NAMES): Likewise.
> 	(REG_CLASS_CONTENTS): Likewise.
> 	* config/rs6000/rs6000.md (FIRST_DMR_REGNO): New constant.
> 	(LAST_DMR_REGNO): Likewise.
> 	(isa attribute): Add 'dm' and 'not_dm' attributes.
> 	(enabled attribute): Support 'dm' and 'not_dm' attributes.
> 	* config/rs6000/rs6000.opt (-mdense-math): New switch.
> 	* doc/md.texi (PowerPC constraints): Document wD constraint.
> ---
>  gcc/config/rs6000/constraints.md  |   3 +
>  gcc/config/rs6000/mma.md          | 115 ++++++++++++------
>  gcc/config/rs6000/predicates.md   |  32 +++++
>  gcc/config/rs6000/rs6000-cpus.def |   2 +
>  gcc/config/rs6000/rs6000.cc       | 189 ++++++++++++++++++++++++++----
>  gcc/config/rs6000/rs6000.h        |  38 +++++-
>  gcc/config/rs6000/rs6000.md       |  12 +-
>  gcc/config/rs6000/rs6000.opt      |   4 +
>  gcc/doc/md.texi                   |   7 ++
>  9 files changed, 343 insertions(+), 59 deletions(-)
> 
> diff --git a/gcc/config/rs6000/constraints.md b/gcc/config/rs6000/constraints.md
> index c99997bf82b..614e431c085 100644
> --- a/gcc/config/rs6000/constraints.md
> +++ b/gcc/config/rs6000/constraints.md
> @@ -107,6 +107,9 @@ (define_constraint "wB"
>         (match_test "TARGET_P8_VECTOR")
>         (match_operand 0 "s5bit_cint_operand")))
>  
> +(define_register_constraint "wD" "rs6000_constraints[RS6000_CONSTRAINT_wD]"
> +  "Accumulator register.")
> +
>  (define_constraint "wE"
>    "@internal Vector constant that can be loaded with the XXSPLTIB instruction."
>    (match_test "xxspltib_constant_nosplit (op, mode)"))
> diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
> index 6a7d8a836db..bb898919ab5 100644
> --- a/gcc/config/rs6000/mma.md
> +++ b/gcc/config/rs6000/mma.md
> @@ -91,6 +91,7 @@ (define_c_enum "unspec"
>     UNSPEC_MMA_XVI8GER4SPP
>     UNSPEC_MMA_XXMFACC
>     UNSPEC_MMA_XXMTACC
> +   UNSPEC_DM_ASSEMBLE_ACC

The other UNSPEC.*ASSEMBLE like UNSPECV_MMA_ASSEMBLE don't have _ACC suffix,
it's better to keep consistent if this suffix doesn't distinguish something.

>    ])
>  
>  (define_c_enum "unspecv"
> @@ -321,7 +322,9 @@ (define_insn_and_split "*movoo"
>     (set_attr "length" "*,8,*,8,8")
>     (set_attr "isa" "lxvp,*,stxvp,*,*")])
>  \f
> -;; Vector quad support.  XOmode can only live in FPRs.
> +;; Vector quad support.  Under the original MMA, XOmode can only live in VSX
> +;; vector registers 0..31.  With dense math, XOmode can live in either VSX

Nit: s/vector//

> +;; registers (0..63) or DMR registers.
>  (define_expand "movxo"
>    [(set (match_operand:XO 0 "nonimmediate_operand")
>  	(match_operand:XO 1 "input_operand"))]
> @@ -346,10 +349,10 @@ (define_expand "movxo"
>      gcc_assert (false);
>  })
>  
> -(define_insn_and_split "*movxo"
> +(define_insn_and_split "*movxo_nodm"
>    [(set (match_operand:XO 0 "nonimmediate_operand" "=d,ZwO,d")
>  	(match_operand:XO 1 "input_operand" "ZwO,d,d"))]
> -  "TARGET_MMA
> +  "TARGET_MMA && !TARGET_DENSE_MATH
>     && (gpc_reg_operand (operands[0], XOmode)
>         || gpc_reg_operand (operands[1], XOmode))"
>    "@
> @@ -366,6 +369,31 @@ (define_insn_and_split "*movxo"
>     (set_attr "length" "*,*,16")
>     (set_attr "max_prefixed_insns" "2,2,*")])
>  
> +(define_insn_and_split "*movxo_dm"
> +  [(set (match_operand:XO 0 "nonimmediate_operand" "=wa,QwO,wa,wD,wD,wa")
> +	(match_operand:XO 1 "input_operand"        "QwO,wa, wa,wa,wD,wD"))]

Why not adopt ZwO rather than QwO?

> +  "TARGET_DENSE_MATH
> +   && (gpc_reg_operand (operands[0], XOmode)
> +       || gpc_reg_operand (operands[1], XOmode))"
> +  "@
> +   #
> +   #
> +   #
> +   dmxxinstdmr512 %0,%1,%Y1,0
> +   dmmr %0,%1
> +   dmxxextfdmr512 %0,%Y0,%1,0"
> +  "&& reload_completed
> +   && !dmr_operand (operands[0], XOmode)
> +   && !dmr_operand (operands[1], XOmode)"
> +  [(const_int 0)]
> +{
> +  rs6000_split_multireg_move (operands[0], operands[1]);
> +  DONE;
> +}
> +  [(set_attr "type" "vecload,vecstore,veclogical,mma,mma,mma")
> +   (set_attr "length" "*,*,16,*,*,*")
> +   (set_attr "max_prefixed_insns" "2,2,*,*,*,*")])
> +
>  (define_expand "vsx_assemble_pair"
>    [(match_operand:OO 0 "vsx_register_operand")
>     (match_operand:V16QI 1 "mma_assemble_input_operand")
> @@ -433,25 +461,38 @@ (define_insn_and_split "*vsx_disassemble_pair"
>  })
>  
>  (define_expand "mma_assemble_acc"
> -  [(match_operand:XO 0 "fpr_reg_operand")
> +  [(match_operand:XO 0 "register_operand")

Maybe use the newly introduced accumulator_operand?

>     (match_operand:V16QI 1 "mma_assemble_input_operand")
>     (match_operand:V16QI 2 "mma_assemble_input_operand")
>     (match_operand:V16QI 3 "mma_assemble_input_operand")
>     (match_operand:V16QI 4 "mma_assemble_input_operand")]
>    "TARGET_MMA"
>  {
> -  rtx src = gen_rtx_UNSPEC_VOLATILE (XOmode,
> -			    	     gen_rtvec (4, operands[1], operands[2],
> -				       		operands[3], operands[4]),
> -			    	     UNSPECV_MMA_ASSEMBLE);
> -  emit_move_insn (operands[0], src);
> +  rtx op0 = operands[0];
> +  rtx op1 = operands[1];
> +  rtx op2 = operands[2];
> +  rtx op3 = operands[3];
> +  rtx op4 = operands[4];
> +
> +  if (TARGET_DENSE_MATH)
> +    {
> +      rtx vpair1 = gen_reg_rtx (OOmode);
> +      rtx vpair2 = gen_reg_rtx (OOmode);
> +      emit_insn (gen_vsx_assemble_pair (vpair1, op1, op2));
> +      emit_insn (gen_vsx_assemble_pair (vpair2, op3, op4));
> +      emit_insn (gen_mma_assemble_acc_dm (op0, vpair1, vpair2));
> +    }
> +
> +  else
> +    emit_insn (gen_mma_assemble_acc_vsx (op0, op1, op2, op3, op4));
> +
>    DONE;
>  })
>  
>  ;; We cannot update the four output registers atomically, so mark the output
> -;; as an early clobber so we don't accidentally clobber the input operands.  */
> +;; as an early clobber so we don't accidentally clobber the input operands.
>  
> -(define_insn_and_split "*mma_assemble_acc"
> +(define_insn_and_split "mma_assemble_acc_vsx"

Nit: since we use "*_nodm" above, it seems better to name it with
"mma_assemble_acc_nodm" which has the same style?

>    [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
>  	(unspec_volatile:XO
>  	  [(match_operand:V16QI 1 "mma_assemble_input_operand" "mwa")
> @@ -459,7 +500,7 @@ (define_insn_and_split "*mma_assemble_acc"
>  	   (match_operand:V16QI 3 "mma_assemble_input_operand" "mwa")
>  	   (match_operand:V16QI 4 "mma_assemble_input_operand" "mwa")]
>  	  UNSPECV_MMA_ASSEMBLE))]
> -  "TARGET_MMA
> +  "TARGET_MMA && !TARGET_DENSE_MATH
>     && fpr_reg_operand (operands[0], XOmode)"
>    "#"
>    "&& reload_completed"
> @@ -473,28 +514,31 @@ (define_insn_and_split "*mma_assemble_acc"
>    DONE;
>  })
>  
> +;; On a system with dense math, we build the accumulators from two vector
> +;; pairs.
> +
> +(define_insn "mma_assemble_acc_dm"
> + [(set (match_operand:XO 0 "dmr_operand" "=wD")
> +       (unspec:XO [(match_operand:OO 1 "vsx_register_operand" "wa")
> +		   (match_operand:OO 2 "vsx_register_operand" "wa")]
> +		  UNSPEC_DM_ASSEMBLE_ACC))]
> + "TARGET_MMA && TARGET_DENSE_MATH"

Nit: redundant TARGET_MMA checking.

> + "dmxxinstdmr512 %0,%1,%2,0"
> + [(set_attr "type" "mma")])
> +
>  (define_expand "mma_disassemble_acc"
> -  [(match_operand:V16QI 0 "mma_disassemble_output_operand")
> -   (match_operand:XO 1 "fpr_reg_operand")
> -   (match_operand 2 "const_0_to_3_operand")]
> -  "TARGET_MMA"
> -{
> -  rtx src;
> -  int regoff = INTVAL (operands[2]);
> -  src = gen_rtx_UNSPEC (V16QImode,
> -			gen_rtvec (2, operands[1], GEN_INT (regoff)),
> -			UNSPEC_MMA_EXTRACT);
> -  emit_move_insn (operands[0], src);
> -  DONE;
> -})
> +  [(set (match_operand:V16QI 0 "register_operand")
> +	(unspec:V16QI [(match_operand:XO 1 "register_operand")

s/register_operand/accumulator_operand/?

> +		       (match_operand 2 "const_0_to_3_operand")]
> +		      UNSPEC_MMA_EXTRACT))]
> +  "TARGET_MMA")
>  
> -(define_insn_and_split "*mma_disassemble_acc"
> +(define_insn_and_split "*mma_disassemble_acc_vsx"
>    [(set (match_operand:V16QI 0 "mma_disassemble_output_operand" "=mwa")
> -       (unspec:V16QI [(match_operand:XO 1 "fpr_reg_operand" "d")
> -		      (match_operand 2 "const_0_to_3_operand")]
> +	(unspec:V16QI [(match_operand:XO 1 "fpr_reg_operand" "d")
> +		       (match_operand 2 "const_0_to_3_operand")]
>  		      UNSPEC_MMA_EXTRACT))]
> -  "TARGET_MMA
> -   && fpr_reg_operand (operands[1], XOmode)"
> +  "TARGET_MMA"

Do we still expect to see this pattern if TARGET_DENSE_MATH?
If no, we should guard the condition with !TARGET_DENSE_MATH.

>    "#"
>    "&& reload_completed"
>    [(const_int 0)]
> @@ -506,9 +550,14 @@ (define_insn_and_split "*mma_disassemble_acc"
>    DONE;
>  })
>  
> -;; MMA instructions that do not use their accumulators as an input, still
> -;; must not allow their vector operands to overlap the registers used by
> -;; the accumulator.  We enforce this by marking the output as early clobber.
> +(define_insn "*mma_disassemble_acc_dm"
> +  [(set (match_operand:V16QI 0 "vsx_register_operand" "=wa")
> +	(unspec:V16QI [(match_operand:XO 1 "dmr_operand" "wD")
> +		       (match_operand 2 "const_0_to_3_operand")]
> +		      UNSPEC_MMA_EXTRACT))]
> +  "TARGET_DENSE_MATH"
> +  "dmxxextfdmr256 %0,%1,2"
> +  [(set_attr "type" "mma")])
>  
>  (define_insn "mma_<acc>"
>    [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
> diff --git a/gcc/config/rs6000/predicates.md b/gcc/config/rs6000/predicates.md
> index d23ce9a77a3..3040dcd50a3 100644
> --- a/gcc/config/rs6000/predicates.md
> +++ b/gcc/config/rs6000/predicates.md
> @@ -186,6 +186,38 @@ (define_predicate "vlogical_operand"
>    return VLOGICAL_REGNO_P (REGNO (op));
>  })
>  
> +;; Return 1 if op is a DMR register
> +(define_predicate "dmr_operand"
> +  (match_operand 0 "register_operand")
> +{
> +  if (!REG_P (op))
> +    return 0;
> +
> +  if (!HARD_REGISTER_P (op))
> +    return 1;
> +
> +  return DMR_REGNO_P (REGNO (op));
> +})
> +
> +;; Return 1 if op is an accumulator.  On power10 systems, the accumulators
> +;; overlap with the FPRs, while on systems with dense math, the accumulators
> +;; are separate dense math registers and do not overlap with the FPR
> +;; registers..

Nit: an unexpected "."?

> +(define_predicate "accumulator_operand"
> +  (match_operand 0 "register_operand")
> +{

fpr_reg_operand checks for subreg as well, should we check for it here as well?

> +  if (!REG_P (op))
> +    return 0;
> +
> +  if (!HARD_REGISTER_P (op))
> +    return 1;
> +
> +  int r = REGNO (op);
> +  return (TARGET_DENSE_MATH
> +	  ? DMR_REGNO_P (r)
> +	  : FP_REGNO_P (r) && (r & 3) == 0);
> +})
> +
>  ;; Return 1 if op is the carry register.
>  (define_predicate "ca_operand"
>    (match_operand 0 "register_operand")
> diff --git a/gcc/config/rs6000/rs6000-cpus.def b/gcc/config/rs6000/rs6000-cpus.def
> index b6cd6d8cc84..4621b97b522 100644
> --- a/gcc/config/rs6000/rs6000-cpus.def
> +++ b/gcc/config/rs6000/rs6000-cpus.def
> @@ -91,6 +91,7 @@
>  /* Flags for a potential future processor that may or may not be delivered.  */
>  #define ISA_FUTURE_MASKS	(ISA_3_1_MASKS_SERVER			\
>  				 | OPTION_MASK_BLOCK_OPS_VECTOR_PAIR	\
> +				 | OPTION_MASK_DENSE_MATH		\
>  				 | OPTION_MASK_FUTURE)
>  
>  /* Flags that need to be turned off if -mno-power9-vector.  */
> @@ -134,6 +135,7 @@
>  				 | OPTION_MASK_DFP			\
>  				 | OPTION_MASK_DIRECT_MOVE		\
>  				 | OPTION_MASK_DLMZB			\
> +				 | OPTION_MASK_DENSE_MATH		\
>  				 | OPTION_MASK_EFFICIENT_UNALIGNED_VSX	\
>  				 | OPTION_MASK_FLOAT128_HW		\
>  				 | OPTION_MASK_FLOAT128_KEYWORD		\
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index bc509399cf6..83e32f7a43a 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -290,7 +290,8 @@ enum rs6000_reg_type {
>    ALTIVEC_REG_TYPE,
>    FPR_REG_TYPE,
>    SPR_REG_TYPE,
> -  CR_REG_TYPE
> +  CR_REG_TYPE,
> +  DMR_REG_TYPE
>  };
>  
>  /* Map register class to register type.  */
> @@ -304,22 +305,23 @@ static enum rs6000_reg_type reg_class_to_reg_type[N_REG_CLASSES];
>  
>  
>  /* Register classes we care about in secondary reload or go if legitimate
> -   address.  We only need to worry about GPR, FPR, and Altivec registers here,
> -   along an ANY field that is the OR of the 3 register classes.  */
> +   address.  We only need to worry about GPR, FPR, Altivec, and DMR registers
> +   here, along an ANY field that is the OR of the 4 register classes.  */
>  
>  enum rs6000_reload_reg_type {
>    RELOAD_REG_GPR,			/* General purpose registers.  */
>    RELOAD_REG_FPR,			/* Traditional floating point regs.  */
>    RELOAD_REG_VMX,			/* Altivec (VMX) registers.  */
> -  RELOAD_REG_ANY,			/* OR of GPR, FPR, Altivec masks.  */
> +  RELOAD_REG_DMR,			/* DMR registers.  */
> +  RELOAD_REG_ANY,			/* OR of GPR/FPR/VMX/DMR masks.  */
>    N_RELOAD_REG
>  };
>  
> -/* For setting up register classes, loop through the 3 register classes mapping
> +/* For setting up register classes, loop through the 4 register classes mapping
>     into real registers, and skip the ANY class, which is just an OR of the
>     bits.  */
>  #define FIRST_RELOAD_REG_CLASS	RELOAD_REG_GPR
> -#define LAST_RELOAD_REG_CLASS	RELOAD_REG_VMX
> +#define LAST_RELOAD_REG_CLASS	RELOAD_REG_DMR
>  
>  /* Map reload register type to a register in the register class.  */
>  struct reload_reg_map_type {
> @@ -331,6 +333,7 @@ static const struct reload_reg_map_type reload_reg_map[N_RELOAD_REG] = {
>    { "Gpr",	FIRST_GPR_REGNO },	/* RELOAD_REG_GPR.  */
>    { "Fpr",	FIRST_FPR_REGNO },	/* RELOAD_REG_FPR.  */
>    { "VMX",	FIRST_ALTIVEC_REGNO },	/* RELOAD_REG_VMX.  */
> +  { "DMR",	FIRST_DMR_REGNO },	/* RELOAD_REG_DMR.  */
>    { "Any",	-1 },			/* RELOAD_REG_ANY.  */
>  };
>  
> @@ -1224,6 +1227,8 @@ char rs6000_reg_names[][8] =
>        "0",  "1",  "2",  "3",  "4",  "5",  "6",  "7",
>    /* vrsave vscr sfp */
>        "vrsave", "vscr", "sfp",
> +  /* DMRs */
> +      "0", "1", "2", "3", "4", "5", "6", "7",
>  };
>  
>  #ifdef TARGET_REGNAMES
> @@ -1250,6 +1255,8 @@ static const char alt_reg_names[][8] =
>    "%cr0",  "%cr1", "%cr2", "%cr3", "%cr4", "%cr5", "%cr6", "%cr7",
>    /* vrsave vscr sfp */
>    "vrsave", "vscr", "sfp",
> +  /* DMRs */
> +  "%dmr0", "%dmr1", "%dmr2", "%dmr3", "%dmr4", "%dmr5", "%dmr6", "%dmr7",

Should be without "r" here, as tested gas doesn't recognize %dmr0 but it does
recognize %dm0.

>  };
>  #endif
>  
> @@ -1846,6 +1853,9 @@ rs6000_hard_regno_nregs_internal (int regno, machine_mode mode)
>    else if (ALTIVEC_REGNO_P (regno))
>      reg_size = UNITS_PER_ALTIVEC_WORD;
>  
> +  else if (DMR_REGNO_P (regno))
> +    reg_size = UNITS_PER_DMR_WORD;
> +
>    else
>      reg_size = UNITS_PER_WORD;
>  
> @@ -1867,9 +1877,36 @@ rs6000_hard_regno_mode_ok_uncached (int regno, machine_mode mode)
>    if (mode == OOmode)
>      return (TARGET_MMA && VSX_REGNO_P (regno) && (regno & 1) == 0);
>  
> -  /* MMA accumulator modes need FPR registers divisible by 4.  */
> +  /* On ISA 3.1 (power10), MMA accumulator modes need FPR registers divisible
> +     by 4.
> +
> +     If dense math is enabled, allow all VSX registers plus the DMR registers.
> +     We need to make sure we don't cross between the boundary of FPRs and
> +     traditional Altiviec registers.  */
>    if (mode == XOmode)
> -    return (TARGET_MMA && FP_REGNO_P (regno) && (regno & 3) == 0);
> +    {
> +      if (TARGET_MMA && !TARGET_DENSE_MATH)
> +	return (FP_REGNO_P (regno) && (regno & 3) == 0);
> +
> +      else if (TARGET_DENSE_MATH)
> +	{
> +	  if (DMR_REGNO_P (regno))
> +	    return 1;
> +
> +	  if (FP_REGNO_P (regno))
> +	    return ((regno & 1) == 0 && regno <= LAST_FPR_REGNO - 3);
> +
> +	  if (ALTIVEC_REGNO_P (regno))
> +	    return ((regno & 1) == 0 && regno <= LAST_ALTIVEC_REGNO - 3);
> +	}

I could miss something, I didn't find which section of RFC indicates this
restriction, could you please point out for me?  Thanks!

> +
> +      else
> +	return 0;
> +    }
> +
> +  /* No other types other than XOmode can go in DMRs.  */
> +  if (DMR_REGNO_P (regno))
> +    return 0;
>  
>    /* PTImode can only go in GPRs.  Quad word memory operations require even/odd
>       register combinations, and use PTImode where we need to deal with quad
> @@ -2312,6 +2349,7 @@ rs6000_debug_reg_global (void)
>    rs6000_debug_reg_print (FIRST_ALTIVEC_REGNO,
>  			  LAST_ALTIVEC_REGNO,
>  			  "vs");
> +  rs6000_debug_reg_print (FIRST_DMR_REGNO, LAST_DMR_REGNO, "dmr");

Nit: Like above, use 'dm'.

>    rs6000_debug_reg_print (LR_REGNO, LR_REGNO, "lr");
>    rs6000_debug_reg_print (CTR_REGNO, CTR_REGNO, "ctr");
>    rs6000_debug_reg_print (CR0_REGNO, CR7_REGNO, "cr");
> @@ -2332,6 +2370,7 @@ rs6000_debug_reg_global (void)
>  	   "wr reg_class = %s\n"
>  	   "wx reg_class = %s\n"
>  	   "wA reg_class = %s\n"
> +	   "wD reg_class = %s\n"
>  	   "\n",
>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_d]],
>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_v]],
> @@ -2339,7 +2378,8 @@ rs6000_debug_reg_global (void)
>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_we]],
>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wr]],
>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wx]],
> -	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wA]]);
> +	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wA]],
> +	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wD]]);
> 

snip ...

> +/* Subroutine to determine the move cost of dense math registers.  If we are
> +   moving to/from VSX_REGISTER registers, the cost is either 1 move (for
> +   512-bit accumulators) or 2 moves (for 1,024 dmr registers).  If we are
> +   moving to anything else like GPR registers, make the cost very high.  */
> +
> +static int
> +rs6000_dmr_register_move_cost (machine_mode mode, reg_class_t rclass)
> +{
> +  const int reg_move_base = 2;
> +  HARD_REG_SET vsx_set = (reg_class_contents[rclass]
> +			  & reg_class_contents[VSX_REGS]);
> +
> +  if (TARGET_DENSE_MATH && !hard_reg_set_empty_p (vsx_set))

Can we just use reg_classes_intersect_p (rclass, VSX_REGS)?

> +    {
> +      /* __vector_quad (i.e. XOmode) is tranfered in 1 instruction.  */
> +      if (mode == XOmode)
> +	return reg_move_base;
> +
> +      else
> +	return reg_move_base * 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);

I guess this "else" arm is for TDOmode, which belongs to that patch.

> +    }
> +
> +  return 1000 * 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
> +}
> +
>  /* A C expression returning the cost of moving data from a register of class
>     CLASS1 to one of CLASS2.  */
>  
> @@ -22843,17 +22969,28 @@ rs6000_register_move_cost (machine_mode mode,
>    if (TARGET_DEBUG_COST)
>      dbg_cost_ctrl++;
>  

snip ...

>  /* Table of additional register names to use in user input.  */
> @@ -2132,6 +2158,8 @@ extern char rs6000_reg_names[][8];	/* register names (0 vs. %r0).  */
>    {"vs52", 84}, {"vs53", 85}, {"vs54", 86}, {"vs55", 87},	\
>    {"vs56", 88}, {"vs57", 89}, {"vs58", 90}, {"vs59", 91},	\
>    {"vs60", 92}, {"vs61", 93}, {"vs62", 94}, {"vs63", 95},	\
> +  {"dmr0", 111}, {"dmr1", 112}, {"dmr2", 113}, {"dmr3", 114},	\
> +  {"dmr4", 115}, {"dmr5", 116}, {"dmr6", 117}, {"dmr7", 118},	\

Nit: maybe s/dmr/dm/ to align the previous regnames.

>  }
>  
>  /* This is how to output an element of a case-vector that is relative.  */
> diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
> index a125fd8fc99..72af3e6ef70 100644
> --- a/gcc/config/rs6000/rs6000.md
> +++ b/gcc/config/rs6000/rs6000.md
> @@ -51,6 +51,8 @@ (define_constants
>     (VRSAVE_REGNO		108)
>     (VSCR_REGNO			109)
>     (FRAME_POINTER_REGNUM	110)
> +   (FIRST_DMR_REGNO		111)
> +   (LAST_DMR_REGNO		118)
>    ])
>  
>  ;;
> @@ -355,7 +357,7 @@ (define_attr "cpu"
>    (const (symbol_ref "(enum attr_cpu) rs6000_tune")))
>  
>  ;; The ISA we implement.
> -(define_attr "isa" "any,p5,p6,p7,p7v,p8v,p9,p9v,p9kf,p9tf,p10,lxvp,stxvp"
> +(define_attr "isa" "any,p5,p6,p7,p7v,p8v,p9,p9v,p9kf,p9tf,p10,lxvp,stxvp,dm,not_dm"

Nit: s/not_dm/nodm/ to align with some previous wording.

BR,
Kewen


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 4/6] PowerPC: Make MMA insns support DMR registers.
  2024-01-05 23:39 ` Repost [PATCH 4/6] PowerPC: Make MMA insns support " Michael Meissner
  2024-01-19 18:47   ` Ping " Michael Meissner
@ 2024-02-04  3:21   ` Kewen.Lin
  2024-02-07  3:31     ` Michael Meissner
  1 sibling, 1 reply; 36+ messages in thread
From: Kewen.Lin @ 2024-02-04  3:21 UTC (permalink / raw)
  To: Michael Meissner
  Cc: gcc-patches, Segher Boessenkool, David Edelsohn, Peter Bergner

Hi Mike,

on 2024/1/6 07:39, Michael Meissner wrote:
> This patch changes the MMA instructions to use either FPR registers
> (-mcpu=power10) or DMRs (-mcpu=future).  In this patch, the existing MMA
> instruction names are used.
> 
> A macro (__PPC_DMR__) is defined if the MMA instructions use the DMRs.
> 
> The patches have been tested on both little and big endian systems.  Can I check
> it into the master branch?
> 
> 2024-01-05   Michael Meissner  <meissner@linux.ibm.com>
> 
> gcc/
> 
> 	* config/rs6000/mma.md (mma_<acc>): New define_expand to handle
> 	mma_<acc> for dense math and non dense math.
> 	(mma_<acc> insn): Restrict to non dense math.
> 	(mma_xxsetaccz): Convert to define_expand to handle non dense math and
> 	dense math.
> 	(mma_xxsetaccz_vsx): Rename from mma_xxsetaccz and restrict usage to non
> 	dense math.
> 	(mma_xxsetaccz_dm): Dense math version of mma_xxsetaccz.
> 	(mma_<vv>): Add support for dense math.
> 	(mma_<avv>): Likewise.
> 	(mma_<pv>): Likewise.
> 	(mma_<apv>): Likewise.
> 	(mma_<vvi4i4i8>): Likewise.
> 	(mma_<avvi4i4i8>): Likewise.
> 	(mma_<vvi4i4i2>): Likewise.
> 	(mma_<avvi4i4i2>): Likewise.
> 	(mma_<vvi4i4>): Likewise.
> 	(mma_<avvi4i4>): Likewise.
> 	(mma_<pvi4i2>): Likewise.
> 	(mma_<apvi4i2>): Likewise.
> 	(mma_<vvi4i4i4>): Likewise.
> 	(mma_<avvi4i4i4>): Likewise.
> 	* config/rs6000/rs6000-c.cc (rs6000_target_modify_macros): Define
> 	__PPC_DMR__ if we have dense math instructions.
> 	* config/rs6000/rs6000.cc (print_operand): Make %A handle only DMRs if
> 	dense math and only FPRs if not dense math.
> 	(rs6000_split_multireg_move): Do not generate the xxmtacc instruction to
> 	prime the DMR registers or the xxmfacc instruction to de-prime
> 	instructions if we have dense math register support.
> ---
>  gcc/config/rs6000/mma.md      | 247 +++++++++++++++++++++-------------
>  gcc/config/rs6000/rs6000-c.cc |   3 +
>  gcc/config/rs6000/rs6000.cc   |  35 ++---
>  3 files changed, 176 insertions(+), 109 deletions(-)
> 
> diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
> index bb898919ab5..525a85146ff 100644
> --- a/gcc/config/rs6000/mma.md
> +++ b/gcc/config/rs6000/mma.md
> @@ -559,190 +559,249 @@ (define_insn "*mma_disassemble_acc_dm"
>    "dmxxextfdmr256 %0,%1,2"
>    [(set_attr "type" "mma")])
>  
> -(define_insn "mma_<acc>"
> +;; MMA instructions that do not use their accumulators as an input, still must
> +;; not allow their vector operands to overlap the registers used by the
> +;; accumulator.  We enforce this by marking the output as early clobber.  If we
> +;; have dense math, we don't need the whole prime/de-prime action, so just make
> +;; thse instructions be NOPs.

typo: thse.

> +
> +(define_expand "mma_<acc>"
> +  [(set (match_operand:XO 0 "register_operand")
> +	(unspec:XO [(match_operand:XO 1 "register_operand")]

s/register_operand/accumulator_operand/?

> +		   MMA_ACC))]
> +  "TARGET_MMA"
> +{
> +  if (TARGET_DENSE_MATH)
> +    {
> +      if (!rtx_equal_p (operands[0], operands[1]))
> +	emit_move_insn (operands[0], operands[1]);
> +      DONE;
> +    }
> +
> +  /* Generate the prime/de-prime code.  */
> +})
> +
> +(define_insn "*mma_<acc>"

May be better to name with "*mma_<acc>_nodm"?

>    [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
>  	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")]
>  		    MMA_ACC))]
> -  "TARGET_MMA"
> +  "TARGET_MMA && !TARGET_DENSE_MATH"

I found that "TARGET_MMA && !TARGET_DENSE_MATH" is used much (like changes in function
rs6000_split_multireg_move in this patch and some places in previous patches), maybe we
can introduce a macro named as TARGET_MMA_NODM short for it?

>    "<acc> %A0"
>    [(set_attr "type" "mma")])
>  
>  ;; We can't have integer constants in XOmode so we wrap this in an
> -;; UNSPEC_VOLATILE.
> +;; UNSPEC_VOLATILE for the non-dense math case.  For dense math, we don't need
> +;; to disable optimization and we can do a normal UNSPEC.
>  
> -(define_insn "mma_xxsetaccz"
> -  [(set (match_operand:XO 0 "fpr_reg_operand" "=d")
> +(define_expand "mma_xxsetaccz"
> +  [(set (match_operand:XO 0 "register_operand")

s/register_operand/accumulator_operand/?

>  	(unspec_volatile:XO [(const_int 0)]
>  			    UNSPECV_MMA_XXSETACCZ))]
>    "TARGET_MMA"
> +{
> +  if (TARGET_DENSE_MATH)
> +    {
> +      emit_insn (gen_mma_xxsetaccz_dm (operands[0]));
> +      DONE;
> +    }
> +})
> +
> +(define_insn "*mma_xxsetaccz_vsx"

s/vsx/nodm/

> +  [(set (match_operand:XO 0 "fpr_reg_operand" "=d")
> +	(unspec_volatile:XO [(const_int 0)]
> +			    UNSPECV_MMA_XXSETACCZ))]
> +  "TARGET_MMA && !TARGET_DENSE_MATH"
>    "xxsetaccz %A0"
>    [(set_attr "type" "mma")])
>  
> +
> +(define_insn "mma_xxsetaccz_dm"
> +  [(set (match_operand:XO 0 "dmr_operand" "=wD")
> +	(unspec:XO [(const_int 0)]
> +		   UNSPECV_MMA_XXSETACCZ))]
> +  "TARGET_DENSE_MATH"
> +  "dmsetdmrz %0"
> +  [(set_attr "type" "mma")])
> +
>  (define_insn "mma_<vv>"
> -  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
> -	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
> -		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")]
> +  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
> +	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")]
>  		    MMA_VV))]
>    "TARGET_MMA"
>    "<vv> %A0,%x1,%x2"
> -  [(set_attr "type" "mma")])
> +  [(set_attr "type" "mma")
> +   (set_attr "isa" "dm,not_dm,not_dm")])

Like what's suggested in previous patches, s/not_dm/nodm/

The others look good to me, thanks!

BR,
Kewen

>  
>  (define_insn "mma_<avv>"
> -  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
> -	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
> -		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
> -		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")]
> +  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
> +	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
> +		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")]
>  		    MMA_AVV))]
>    "TARGET_MMA"
>    "<avv> %A0,%x2,%x3"
> -  [(set_attr "type" "mma")])
> +  [(set_attr "type" "mma")
> +   (set_attr "isa" "dm,not_dm,not_dm")])
>  
>  (define_insn "mma_<pv>"
> -  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
> -	(unspec:XO [(match_operand:OO 1 "vsx_register_operand" "v,?wa")
> -		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")]
> +  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
> +	(unspec:XO [(match_operand:OO 1 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")]
>  		    MMA_PV))]
>    "TARGET_MMA"
>    "<pv> %A0,%x1,%x2"
> -  [(set_attr "type" "mma")])
> +  [(set_attr "type" "mma")
> +   (set_attr "isa" "dm,not_dm,not_dm")])
>  
>  (define_insn "mma_<apv>"
> -  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
> -	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
> -		    (match_operand:OO 2 "vsx_register_operand" "v,?wa")
> -		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")]
> +  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
> +	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
> +		    (match_operand:OO 2 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")]
>  		    MMA_APV))]
>    "TARGET_MMA"
>    "<apv> %A0,%x2,%x3"
> -  [(set_attr "type" "mma")])
> +  [(set_attr "type" "mma")
> +   (set_attr "isa" "dm,not_dm,not_dm")])
>  
>  (define_insn "mma_<vvi4i4i8>"
> -  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
> -	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
> -		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
> -		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
> -		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
> -		    (match_operand:SI 5 "u8bit_cint_operand" "n,n")]
> +  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
> +	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:SI 3 "const_0_to_15_operand" "n,n,n")
> +		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
> +		    (match_operand:SI 5 "u8bit_cint_operand" "n,n,n")]
>  		    MMA_VVI4I4I8))]
>    "TARGET_MMA"
>    "<vvi4i4i8> %A0,%x1,%x2,%3,%4,%5"
>    [(set_attr "type" "mma")
> -   (set_attr "prefixed" "yes")])
> +   (set_attr "prefixed" "yes")
> +   (set_attr "isa" "dm,not_dm,not_dm")])
>  
>  (define_insn "mma_<avvi4i4i8>"
> -  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
> -	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
> -		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
> -		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
> -		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
> -		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")
> -		    (match_operand:SI 6 "u8bit_cint_operand" "n,n")]
> +  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
> +	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
> +		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
> +		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")
> +		    (match_operand:SI 6 "u8bit_cint_operand" "n,n,n")]
>  		    MMA_AVVI4I4I8))]
>    "TARGET_MMA"
>    "<avvi4i4i8> %A0,%x2,%x3,%4,%5,%6"
>    [(set_attr "type" "mma")
> -   (set_attr "prefixed" "yes")])
> +   (set_attr "prefixed" "yes")
> +   (set_attr "isa" "dm,not_dm,not_dm")])
>  
>  (define_insn "mma_<vvi4i4i2>"
> -  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
> -	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
> -		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
> -		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
> -		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
> -		    (match_operand:SI 5 "const_0_to_3_operand" "n,n")]
> +  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
> +	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:SI 3 "const_0_to_15_operand" "n,n,n")
> +		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
> +		    (match_operand:SI 5 "const_0_to_3_operand" "n,n,n")]
>  		    MMA_VVI4I4I2))]
>    "TARGET_MMA"
>    "<vvi4i4i2> %A0,%x1,%x2,%3,%4,%5"
>    [(set_attr "type" "mma")
> -   (set_attr "prefixed" "yes")])
> +   (set_attr "prefixed" "yes")
> +   (set_attr "isa" "dm,not_dm,not_dm")])
>  
>  (define_insn "mma_<avvi4i4i2>"
> -  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
> -	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
> -		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
> -		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
> -		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
> -		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")
> -		    (match_operand:SI 6 "const_0_to_3_operand" "n,n")]
> +  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
> +	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
> +		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
> +		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")
> +		    (match_operand:SI 6 "const_0_to_3_operand" "n,n,n")]
>  		    MMA_AVVI4I4I2))]
>    "TARGET_MMA"
>    "<avvi4i4i2> %A0,%x2,%x3,%4,%5,%6"
>    [(set_attr "type" "mma")
> -   (set_attr "prefixed" "yes")])
> +   (set_attr "prefixed" "yes")
> +   (set_attr "isa" "dm,not_dm,not_dm")])
>  
>  (define_insn "mma_<vvi4i4>"
> -  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
> -	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
> -		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
> -		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
> -		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")]
> +  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
> +	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:SI 3 "const_0_to_15_operand" "n,n,n")
> +		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")]
>  		    MMA_VVI4I4))]
>    "TARGET_MMA"
>    "<vvi4i4> %A0,%x1,%x2,%3,%4"
>    [(set_attr "type" "mma")
> -   (set_attr "prefixed" "yes")])
> +   (set_attr "prefixed" "yes")
> +   (set_attr "isa" "dm,not_dm,not_dm")])
>  
>  (define_insn "mma_<avvi4i4>"
> -  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
> -	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
> -		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
> -		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
> -		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
> -		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")]
> +  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
> +	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
> +		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
> +		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")]
>  		    MMA_AVVI4I4))]
>    "TARGET_MMA"
>    "<avvi4i4> %A0,%x2,%x3,%4,%5"
>    [(set_attr "type" "mma")
> -   (set_attr "prefixed" "yes")])
> +   (set_attr "prefixed" "yes")
> +   (set_attr "isa" "dm,not_dm,not_dm")])
>  
>  (define_insn "mma_<pvi4i2>"
> -  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
> -	(unspec:XO [(match_operand:OO 1 "vsx_register_operand" "v,?wa")
> -		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
> -		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
> -		    (match_operand:SI 4 "const_0_to_3_operand" "n,n")]
> +  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
> +	(unspec:XO [(match_operand:OO 1 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:SI 3 "const_0_to_15_operand" "n,n,n")
> +		    (match_operand:SI 4 "const_0_to_3_operand" "n,n,n")]
>  		    MMA_PVI4I2))]
>    "TARGET_MMA"
>    "<pvi4i2> %A0,%x1,%x2,%3,%4"
>    [(set_attr "type" "mma")
> -   (set_attr "prefixed" "yes")])
> +   (set_attr "prefixed" "yes")
> +   (set_attr "isa" "dm,not_dm,not_dm")])
>  
>  (define_insn "mma_<apvi4i2>"
> -  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
> -	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
> -		    (match_operand:OO 2 "vsx_register_operand" "v,?wa")
> -		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
> -		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
> -		    (match_operand:SI 5 "const_0_to_3_operand" "n,n")]
> +  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
> +	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
> +		    (match_operand:OO 2 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
> +		    (match_operand:SI 5 "const_0_to_3_operand" "n,n,n")]
>  		    MMA_APVI4I2))]
>    "TARGET_MMA"
>    "<apvi4i2> %A0,%x2,%x3,%4,%5"
>    [(set_attr "type" "mma")
> -   (set_attr "prefixed" "yes")])
> +   (set_attr "prefixed" "yes")
> +   (set_attr "isa" "dm,not_dm,not_dm")])
>  
>  (define_insn "mma_<vvi4i4i4>"
> -  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
> -	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
> -		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
> -		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
> -		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
> -		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")]
> +  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
> +	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:SI 3 "const_0_to_15_operand" "n,n,n")
> +		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
> +		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")]
>  		    MMA_VVI4I4I4))]
>    "TARGET_MMA"
>    "<vvi4i4i4> %A0,%x1,%x2,%3,%4,%5"
>    [(set_attr "type" "mma")
> -   (set_attr "prefixed" "yes")])
> +   (set_attr "prefixed" "yes")
> +   (set_attr "isa" "dm,not_dm,not_dm")])
>  
>  (define_insn "mma_<avvi4i4i4>"
> -  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
> -	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
> -		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
> -		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
> -		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
> -		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")
> -		    (match_operand:SI 6 "const_0_to_15_operand" "n,n")]
> +  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
> +	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
> +		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")
> +		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
> +		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")
> +		    (match_operand:SI 6 "const_0_to_15_operand" "n,n,n")]
>  		    MMA_AVVI4I4I4))]
>    "TARGET_MMA"
>    "<avvi4i4i4> %A0,%x2,%x3,%4,%5,%6"
>    [(set_attr "type" "mma")
> -   (set_attr "prefixed" "yes")])
> +   (set_attr "prefixed" "yes")
> +   (set_attr "isa" "dm,not_dm,not_dm")])
> diff --git a/gcc/config/rs6000/rs6000-c.cc b/gcc/config/rs6000/rs6000-c.cc
> index f2fb5bef678..4342620f87f 100644
> --- a/gcc/config/rs6000/rs6000-c.cc
> +++ b/gcc/config/rs6000/rs6000-c.cc
> @@ -600,6 +600,9 @@ rs6000_target_modify_macros (bool define_p, HOST_WIDE_INT flags)
>    /* Tell the user if we support the MMA instructions.  */
>    if ((flags & OPTION_MASK_MMA) != 0)
>      rs6000_define_or_undefine_macro (define_p, "__MMA__");
> +  /* Tell the user if we support the dense math instructions.  */
> +  if ((flags & OPTION_MASK_DENSE_MATH) != 0)
> +    rs6000_define_or_undefine_macro (define_p, "__PPC_DMR__");
>    /* Whether pc-relative code is being generated.  */
>    if ((flags & OPTION_MASK_PCREL) != 0)
>      rs6000_define_or_undefine_macro (define_p, "__PCREL__");
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 83e32f7a43a..59517c8608d 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -14264,8 +14264,13 @@ print_operand (FILE *file, rtx x, int code)
>  	 overlapping with the FPR registers.  */
>        if (!REG_P (x))
>  	output_operand_lossage ("invalid %%A value");
> -      else if (TARGET_DENSE_MATH && DMR_REGNO_P (REGNO (x)))
> -	fprintf (file, "%d", REGNO (x) - FIRST_DMR_REGNO);
> +      else if (TARGET_DENSE_MATH)
> +	{
> +	  if (DMR_REGNO_P (REGNO (x)))
> +	    fprintf (file, "%d", REGNO (x) - FIRST_DMR_REGNO);
> +	  else
> +	    output_operand_lossage ("%%A operand is not a DMR");
> +	}
>        else if (!FP_REGNO_P (REGNO (x)) || (REGNO (x) % 4) != 0)
>  	output_operand_lossage ("invalid %%A value");
>        else
> @@ -27719,7 +27724,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>  
>  	  /* If we are reading an accumulator register, we have to
>  	     deprime it before we can access it.  */
> -	  if (TARGET_MMA
> +	  if (TARGET_MMA && !TARGET_DENSE_MATH
>  	      && GET_MODE (src) == XOmode && FP_REGNO_P (REGNO (src)))
>  	    emit_insn (gen_mma_xxmfacc (src, src));
>  
> @@ -27751,9 +27756,9 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>  	      emit_insn (gen_rtx_SET (dst2, src2));
>  	    }
>  
> -	  /* If we are writing an accumulator register, we have to
> -	     prime it after we've written it.  */
> -	  if (TARGET_MMA
> +	  /* If we are writing an accumulator register that overlaps with the
> +	     FPR registers, we have to prime it after we've written it.  */
> +	  if (TARGET_MMA && !TARGET_DENSE_MATH
>  	      && GET_MODE (dst) == XOmode && FP_REGNO_P (REGNO (dst)))
>  	    emit_insn (gen_mma_xxmtacc (dst, dst));
>  
> @@ -27822,9 +27827,9 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>  	      emit_insn (gen_rtx_SET (dst_i, op));
>  	    }
>  
> -	  /* We are writing an accumulator register, so we have to
> -	     prime it after we've written it.  */
> -	  if (GET_MODE (src) == XOmode)
> +	  /* On systems without dense math where accumulators overlap with the
> +	     vector registers, we have to prime it after we've written it.  */
> +	  if (GET_MODE (src) == XOmode && !TARGET_DENSE_MATH)
>  	    emit_insn (gen_mma_xxmtacc (dst, dst));
>  
>  	  return;
> @@ -27835,9 +27840,9 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>  
>    if (REG_P (src) && REG_P (dst) && (REGNO (src) < REGNO (dst)))
>      {
> -      /* If we are reading an accumulator register, we have to
> -	 deprime it before we can access it.  */
> -      if (TARGET_MMA
> +      /* If we are reading an accumulator register and we don't have dense
> +	 math, we have to deprime it before we can access it.  */
> +      if (TARGET_MMA && !TARGET_DENSE_MATH
>  	  && GET_MODE (src) == XOmode && FP_REGNO_P (REGNO (src)))
>  	emit_insn (gen_mma_xxmfacc (src, src));
>  
> @@ -27865,7 +27870,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>  
>        /* If we are writing an accumulator register, we have to
>  	 prime it after we've written it.  */
> -      if (TARGET_MMA
> +      if (TARGET_MMA && !TARGET_DENSE_MATH
>  	  && GET_MODE (dst) == XOmode && FP_REGNO_P (REGNO (dst)))
>  	emit_insn (gen_mma_xxmtacc (dst, dst));
>      }
> @@ -28002,7 +28007,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>  
>        /* If we are reading an accumulator register, we have to
>  	 deprime it before we can access it.  */
> -      if (TARGET_MMA && REG_P (src)
> +      if (TARGET_MMA && !TARGET_DENSE_MATH && REG_P (src)
>  	  && GET_MODE (src) == XOmode && FP_REGNO_P (REGNO (src)))
>  	emit_insn (gen_mma_xxmfacc (src, src));
>  
> @@ -28034,7 +28039,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>  
>        /* If we are writing an accumulator register, we have to
>  	 prime it after we've written it.  */
> -      if (TARGET_MMA && REG_P (dst)
> +      if (TARGET_MMA && !TARGET_DENSE_MATH && REG_P (dst)
>  	  && GET_MODE (dst) == XOmode && FP_REGNO_P (REGNO (dst)))
>  	emit_insn (gen_mma_xxmtacc (dst, dst));
>  
^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 5/6] PowerPC: Switch to dense math names for all MMA operations.
  2024-01-05 23:40 ` Repost [PATCH 5/6] PowerPC: Switch to dense math names for all MMA operations Michael Meissner
  2024-01-19 18:48   ` Ping " Michael Meissner
@ 2024-02-04  5:47   ` Kewen.Lin
  2024-02-07 20:01     ` Michael Meissner
  1 sibling, 1 reply; 36+ messages in thread
From: Kewen.Lin @ 2024-02-04  5:47 UTC (permalink / raw)
  To: Michael Meissner
  Cc: gcc-patches, Segher Boessenkool, David Edelsohn, Peter Bergner

Hi Mike,

on 2024/1/6 07:40, Michael Meissner wrote:
> This patch changes the assembler instruction names for MMA instructions from
> the original name used in power10 to the new name when used with the dense math
> system.  I.e. xvf64gerpp becomes dmxvf64gerpp.  The assembler will emit the
> same bits for either spelling.
> 
> The patches have been tested on both little and big endian systems.  Can I check
> it into the master branch?
> 
> 2024-01-05   Michael Meissner  <meissner@linux.ibm.com>
> 
> gcc/
> 
> 	* config/rs6000/mma.md (vvi4i4i8_dm): New int attribute.
> 	(avvi4i4i8_dm): Likewise.
> 	(vvi4i4i2_dm): Likewise.
> 	(avvi4i4i2_dm): Likewise.
> 	(vvi4i4_dm): Likewise.
> 	(avvi4i4_dm): Likewise.
> 	(pvi4i2_dm): Likewise.
> 	(apvi4i2_dm): Likewise.
> 	(vvi4i4i4_dm): Likewise.
> 	(avvi4i4i4_dm): Likewise.
> 	(mma_<vv>): Add support for running on DMF systems, generating the dense
> 	math instruction and using the dense math accumulators.
> 	(mma_<avv>): Likewise.
> 	(mma_<pv>): Likewise.
> 	(mma_<apv>): Likewise.
> 	(mma_<vvi4i4i8>): Likewise.
> 	(mma_<avvi4i4i8>): Likewise.
> 	(mma_<vvi4i4i2>): Likewise.
> 	(mma_<avvi4i4i2>): Likewise.
> 	(mma_<vvi4i4>): Likewise.
> 	(mma_<avvi4i4): Likewise.
> 	(mma_<pvi4i2>): Likewise.
> 	(mma_<apvi4i2): Likewise.
> 	(mma_<vvi4i4i4>): Likewise.
> 	(mma_<avvi4i4i4>): Likewise.
> 
> gcc/testsuite/
> 
> 	* gcc.target/powerpc/dm-double-test.c: New test.
> 	* lib/target-supports.exp (check_effective_target_ppc_dmr_ok): New
> 	target test.
> ---
>  gcc/config/rs6000/mma.md                      |  98 +++++++--
>  .../gcc.target/powerpc/dm-double-test.c       | 194 ++++++++++++++++++
>  gcc/testsuite/lib/target-supports.exp         |  19 ++
>  3 files changed, 299 insertions(+), 12 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/dm-double-test.c
> 
> diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
> index 525a85146ff..f06e6bbb184 100644
> --- a/gcc/config/rs6000/mma.md
> +++ b/gcc/config/rs6000/mma.md
> @@ -227,13 +227,22 @@ (define_int_attr apv		[(UNSPEC_MMA_XVF64GERPP		"xvf64gerpp")
>  
>  (define_int_attr vvi4i4i8	[(UNSPEC_MMA_PMXVI4GER8		"pmxvi4ger8")])
>  
> +(define_int_attr vvi4i4i8_dm	[(UNSPEC_MMA_PMXVI4GER8		"pmdmxvi4ger8")])

Can we update vvi4i4i8 to

(define_int_attr vvi4i4i8	[(UNSPEC_MMA_PMXVI4GER8		"xvi4ger8")])

by avoiding to introduce vvi4i4i8_dm, then its use places would be like:

-  "<vvi4i4i8> %A0,%x1,%x2,%3,%4,%5"
+  "@
+   pmdm<vvi4i4i8> %A0,%x1,%x2,%3,%4,%5
+   pm<vvi4i4i8> %A0,%x1,%x2,%3,%4,%5
+   pm<vvi4i4i8> %A0,%x1,%x2,%3,%4,%5"

and 

- define_insn "mma_<vvi4i4i8>"
+ define_insn "mma_pm<vvi4i4i8>"

(or updating its use in corresponding bif expander field)

?  

This comment is also applied for the other iterators changes.

> +
>  (define_int_attr avvi4i4i8	[(UNSPEC_MMA_PMXVI4GER8PP	"pmxvi4ger8pp")])
>  
> +(define_int_attr avvi4i4i8_dm	[(UNSPEC_MMA_PMXVI4GER8PP	"pmdmxvi4ger8pp")])
> +
>  (define_int_attr vvi4i4i2	[(UNSPEC_MMA_PMXVI16GER2	"pmxvi16ger2")
>  				 (UNSPEC_MMA_PMXVI16GER2S	"pmxvi16ger2s")
>  				 (UNSPEC_MMA_PMXVF16GER2	"pmxvf16ger2")
>  				 (UNSPEC_MMA_PMXVBF16GER2	"pmxvbf16ger2")])
>  
> +(define_int_attr vvi4i4i2_dm	[(UNSPEC_MMA_PMXVI16GER2	"pmdmxvi16ger2")
> +				 (UNSPEC_MMA_PMXVI16GER2S	"pmdmxvi16ger2s")
> +				 (UNSPEC_MMA_PMXVF16GER2	"pmdmxvf16ger2")
> +				 (UNSPEC_MMA_PMXVBF16GER2	"pmdmxvbf16ger2")])
> +
>  (define_int_attr avvi4i4i2	[(UNSPEC_MMA_PMXVI16GER2PP	"pmxvi16ger2pp")
>  				 (UNSPEC_MMA_PMXVI16GER2SPP	"pmxvi16ger2spp")
>  				 (UNSPEC_MMA_PMXVF16GER2PP	"pmxvf16ger2pp")
> @@ -245,25 +254,54 @@ (define_int_attr avvi4i4i2	[(UNSPEC_MMA_PMXVI16GER2PP	"pmxvi16ger2pp")
>  				 (UNSPEC_MMA_PMXVBF16GER2NP	"pmxvbf16ger2np")
>  				 (UNSPEC_MMA_PMXVBF16GER2NN	"pmxvbf16ger2nn")])
>  
> +(define_int_attr avvi4i4i2_dm	[(UNSPEC_MMA_PMXVI16GER2PP	"pmdmxvi16ger2pp")
> +				 (UNSPEC_MMA_PMXVI16GER2SPP	"pmdmxvi16ger2spp")
> +				 (UNSPEC_MMA_PMXVF16GER2PP	"pmdmxvf16ger2pp")
> +				 (UNSPEC_MMA_PMXVF16GER2PN	"pmdmxvf16ger2pn")
> +				 (UNSPEC_MMA_PMXVF16GER2NP	"pmdmxvf16ger2np")
> +				 (UNSPEC_MMA_PMXVF16GER2NN	"pmdmxvf16ger2nn")
> +				 (UNSPEC_MMA_PMXVBF16GER2PP	"pmdmxvbf16ger2pp")
> +				 (UNSPEC_MMA_PMXVBF16GER2PN	"pmdmxvbf16ger2pn")
> +				 (UNSPEC_MMA_PMXVBF16GER2NP	"pmdmxvbf16ger2np")
> +				 (UNSPEC_MMA_PMXVBF16GER2NN	"pmdmxvbf16ger2nn")])
> +
>  (define_int_attr vvi4i4		[(UNSPEC_MMA_PMXVF32GER		"pmxvf32ger")])
>  
> +(define_int_attr vvi4i4_dm	[(UNSPEC_MMA_PMXVF32GER		"pmdmxvf32ger")])
> +
>  (define_int_attr avvi4i4	[(UNSPEC_MMA_PMXVF32GERPP	"pmxvf32gerpp")
>  				 (UNSPEC_MMA_PMXVF32GERPN	"pmxvf32gerpn")
>  				 (UNSPEC_MMA_PMXVF32GERNP	"pmxvf32gernp")
>  				 (UNSPEC_MMA_PMXVF32GERNN	"pmxvf32gernn")])
>  
> +(define_int_attr avvi4i4_dm	[(UNSPEC_MMA_PMXVF32GERPP	"pmdmxvf32gerpp")
> +				 (UNSPEC_MMA_PMXVF32GERPN	"pmdmxvf32gerpn")
> +				 (UNSPEC_MMA_PMXVF32GERNP	"pmdmxvf32gernp")
> +				 (UNSPEC_MMA_PMXVF32GERNN	"pmdmxvf32gernn")])
> +
>  (define_int_attr pvi4i2		[(UNSPEC_MMA_PMXVF64GER		"pmxvf64ger")])
>  
> +(define_int_attr pvi4i2_dm	[(UNSPEC_MMA_PMXVF64GER		"pmdmxvf64ger")])
> +
>  (define_int_attr apvi4i2	[(UNSPEC_MMA_PMXVF64GERPP	"pmxvf64gerpp")
>  				 (UNSPEC_MMA_PMXVF64GERPN	"pmxvf64gerpn")
>  				 (UNSPEC_MMA_PMXVF64GERNP	"pmxvf64gernp")
>  				 (UNSPEC_MMA_PMXVF64GERNN	"pmxvf64gernn")])
>  
> +(define_int_attr apvi4i2_dm	[(UNSPEC_MMA_PMXVF64GERPP	"pmdmxvf64gerpp")
> +				 (UNSPEC_MMA_PMXVF64GERPN	"pmdmxvf64gerpn")
> +				 (UNSPEC_MMA_PMXVF64GERNP	"pmdmxvf64gernp")
> +				 (UNSPEC_MMA_PMXVF64GERNN	"pmdmxvf64gernn")])
> +
>  (define_int_attr vvi4i4i4	[(UNSPEC_MMA_PMXVI8GER4		"pmxvi8ger4")])
>  
> +(define_int_attr vvi4i4i4_dm	[(UNSPEC_MMA_PMXVI8GER4		"pmdmxvi8ger4")])
> +
>  (define_int_attr avvi4i4i4	[(UNSPEC_MMA_PMXVI8GER4PP	"pmxvi8ger4pp")
>  				 (UNSPEC_MMA_PMXVI8GER4SPP	"pmxvi8ger4spp")])
>  
> +(define_int_attr avvi4i4i4_dm	[(UNSPEC_MMA_PMXVI8GER4PP	"pmdmxvi8ger4pp")
> +				 (UNSPEC_MMA_PMXVI8GER4SPP	"pmdmxvi8ger4spp")])
>  
>  ;; Vector pair support.  OOmode can only live in VSRs.
>  (define_expand "movoo"
> @@ -629,7 +667,10 @@ (define_insn "mma_<vv>"
>  		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")]
>  		    MMA_VV))]
>    "TARGET_MMA"
> -  "<vv> %A0,%x1,%x2"
> +  "@
> +   dm<vv> %A0,%x1,%x2
> +   <vv> %A0,%x1,%x2
> +   <vv> %A0,%x1,%x2"
>    [(set_attr "type" "mma")
>     (set_attr "isa" "dm,not_dm,not_dm")])
>  
> @@ -650,7 +691,10 @@ (define_insn "mma_<pv>"
>  		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")]
>  		    MMA_PV))]
>    "TARGET_MMA"
> -  "<pv> %A0,%x1,%x2"
> +  "@
> +   dm<pv> %A0,%x1,%x2
> +   <pv> %A0,%x1,%x2
> +   <pv> %A0,%x1,%x2"
>    [(set_attr "type" "mma")
>     (set_attr "isa" "dm,not_dm,not_dm")])
>  
> @@ -661,7 +705,10 @@ (define_insn "mma_<apv>"
>  		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")]
>  		    MMA_APV))]
>    "TARGET_MMA"
> -  "<apv> %A0,%x2,%x3"
> +  "@
> +   dm<apv> %A0,%x2,%x3
> +   <apv> %A0,%x2,%x3
> +   <apv> %A0,%x2,%x3"
>    [(set_attr "type" "mma")
>     (set_attr "isa" "dm,not_dm,not_dm")])
>  
> @@ -674,7 +721,10 @@ (define_insn "mma_<vvi4i4i8>"
>  		    (match_operand:SI 5 "u8bit_cint_operand" "n,n,n")]
>  		    MMA_VVI4I4I8))]
>    "TARGET_MMA"
> -  "<vvi4i4i8> %A0,%x1,%x2,%3,%4,%5"
> +  "@
> +   dm<vvi4i4i8> %A0,%x1,%x2,%3,%4,%5

typo?  I think you meant <vvi4i4i8_dm>, but it doesn't matter any more with the
above suggestion.

> +   <vvi4i4i8> %A0,%x1,%x2,%3,%4,%5
> +   <vvi4i4i8> %A0,%x1,%x2,%3,%4,%5"
>    [(set_attr "type" "mma")
>     (set_attr "prefixed" "yes")
>     (set_attr "isa" "dm,not_dm,not_dm")])
> @@ -703,7 +753,10 @@ (define_insn "mma_<vvi4i4i2>"
>  		    (match_operand:SI 5 "const_0_to_3_operand" "n,n,n")]
>  		    MMA_VVI4I4I2))]
>    "TARGET_MMA"
> -  "<vvi4i4i2> %A0,%x1,%x2,%3,%4,%5"
> +  "@
> +   <vvi4i4i2_dm> %A0,%x1,%x2,%3,%4,%5
> +   <vvi4i4i2> %A0,%x1,%x2,%3,%4,%5
> +   <vvi4i4i2> %A0,%x1,%x2,%3,%4,%5"
>    [(set_attr "type" "mma")
>     (set_attr "prefixed" "yes")
>     (set_attr "isa" "dm,not_dm,not_dm")])
> @@ -718,7 +771,10 @@ (define_insn "mma_<avvi4i4i2>"
>  		    (match_operand:SI 6 "const_0_to_3_operand" "n,n,n")]
>  		    MMA_AVVI4I4I2))]
>    "TARGET_MMA"
> -  "<avvi4i4i2> %A0,%x2,%x3,%4,%5,%6"
> +  "@
> +   <avvi4i4i2_dm> %A0,%x2,%x3,%4,%5,%6
> +   <avvi4i4i2> %A0,%x2,%x3,%4,%5,%6
> +   <avvi4i4i2> %A0,%x2,%x3,%4,%5,%6"
>    [(set_attr "type" "mma")
>     (set_attr "prefixed" "yes")
>     (set_attr "isa" "dm,not_dm,not_dm")])
> @@ -731,7 +787,10 @@ (define_insn "mma_<vvi4i4>"
>  		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")]
>  		    MMA_VVI4I4))]
>    "TARGET_MMA"
> -  "<vvi4i4> %A0,%x1,%x2,%3,%4"
> +  "@
> +   <vvi4i4_dm> %A0,%x1,%x2,%3,%4
> +   <vvi4i4> %A0,%x1,%x2,%3,%4
> +   <vvi4i4> %A0,%x1,%x2,%3,%4"
>    [(set_attr "type" "mma")
>     (set_attr "prefixed" "yes")
>     (set_attr "isa" "dm,not_dm,not_dm")])
> @@ -745,7 +804,10 @@ (define_insn "mma_<avvi4i4>"
>  		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")]
>  		    MMA_AVVI4I4))]
>    "TARGET_MMA"
> -  "<avvi4i4> %A0,%x2,%x3,%4,%5"
> +  "@
> +   <avvi4i4_dm> %A0,%x2,%x3,%4,%5
> +   <avvi4i4> %A0,%x2,%x3,%4,%5
> +   <avvi4i4> %A0,%x2,%x3,%4,%5"
>    [(set_attr "type" "mma")
>     (set_attr "prefixed" "yes")
>     (set_attr "isa" "dm,not_dm,not_dm")])
> @@ -758,7 +820,10 @@ (define_insn "mma_<pvi4i2>"
>  		    (match_operand:SI 4 "const_0_to_3_operand" "n,n,n")]
>  		    MMA_PVI4I2))]
>    "TARGET_MMA"
> -  "<pvi4i2> %A0,%x1,%x2,%3,%4"
> +  "@
> +   <pvi4i2_dm> %A0,%x1,%x2,%3,%4
> +   <pvi4i2> %A0,%x1,%x2,%3,%4
> +   <pvi4i2> %A0,%x1,%x2,%3,%4"
>    [(set_attr "type" "mma")
>     (set_attr "prefixed" "yes")
>     (set_attr "isa" "dm,not_dm,not_dm")])
> @@ -772,7 +837,10 @@ (define_insn "mma_<apvi4i2>"
>  		    (match_operand:SI 5 "const_0_to_3_operand" "n,n,n")]
>  		    MMA_APVI4I2))]
>    "TARGET_MMA"
> -  "<apvi4i2> %A0,%x2,%x3,%4,%5"
> +  "@
> +   <apvi4i2_dm> %A0,%x2,%x3,%4,%5
> +   <apvi4i2> %A0,%x2,%x3,%4,%5
> +   <apvi4i2> %A0,%x2,%x3,%4,%5"
>    [(set_attr "type" "mma")
>     (set_attr "prefixed" "yes")
>     (set_attr "isa" "dm,not_dm,not_dm")])
> @@ -786,7 +854,10 @@ (define_insn "mma_<vvi4i4i4>"
>  		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")]
>  		    MMA_VVI4I4I4))]
>    "TARGET_MMA"
> -  "<vvi4i4i4> %A0,%x1,%x2,%3,%4,%5"
> +  "@
> +   <vvi4i4i4_dm> %A0,%x1,%x2,%3,%4,%5
> +   <vvi4i4i4> %A0,%x1,%x2,%3,%4,%5
> +   <vvi4i4i4> %A0,%x1,%x2,%3,%4,%5"
>    [(set_attr "type" "mma")
>     (set_attr "prefixed" "yes")
>     (set_attr "isa" "dm,not_dm,not_dm")])
> @@ -801,7 +872,10 @@ (define_insn "mma_<avvi4i4i4>"
>  		    (match_operand:SI 6 "const_0_to_15_operand" "n,n,n")]
>  		    MMA_AVVI4I4I4))]
>    "TARGET_MMA"
> -  "<avvi4i4i4> %A0,%x2,%x3,%4,%5,%6"
> +  "@
> +   <avvi4i4i4_dm> %A0,%x2,%x3,%4,%5,%6
> +   <avvi4i4i4> %A0,%x2,%x3,%4,%5,%6
> +   <avvi4i4i4> %A0,%x2,%x3,%4,%5,%6"
>    [(set_attr "type" "mma")
>     (set_attr "prefixed" "yes")
>     (set_attr "isa" "dm,not_dm,not_dm")])
> diff --git a/gcc/testsuite/gcc.target/powerpc/dm-double-test.c b/gcc/testsuite/gcc.target/powerpc/dm-double-test.c
> new file mode 100644
> index 00000000000..66c19779585
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/dm-double-test.c
> @@ -0,0 +1,194 @@
> +/* Test derived from mma-double-1.c, modified for dense math.  */
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_dense_math_ok } */
> +/* { dg-options "-mdejagnu-cpu=future -O2" } */
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <altivec.h>
> +
> +typedef unsigned char vec_t __attribute__ ((vector_size (16)));
> +typedef double v4sf_t __attribute__ ((vector_size (16)));
> +#define SAVE_ACC(ACC, ldc, J)  \
> +	  __builtin_mma_disassemble_acc (result, ACC); \
> +	  rowC = (v4sf_t *) &CO[0*ldc+J]; \
> +          rowC[0] += result[0]; \
> +          rowC = (v4sf_t *) &CO[1*ldc+J]; \
> +          rowC[0] += result[1]; \
> +          rowC = (v4sf_t *) &CO[2*ldc+J]; \
> +          rowC[0] += result[2]; \
> +          rowC = (v4sf_t *) &CO[3*ldc+J]; \
> +	  rowC[0] += result[3];
> +
> +void
> +DM (int m, int n, int k, double *A, double *B, double *C)
> +{
> +  __vector_quad acc0, acc1, acc2, acc3, acc4, acc5, acc6, acc7;
> +  v4sf_t result[4];
> +  v4sf_t *rowC;
> +  for (int l = 0; l < n; l += 4)
> +    {
> +      double *CO;
> +      double *AO;
> +      AO = A;
> +      CO = C;
> +      C += m * 4;
> +      for (int j = 0; j < m; j += 16)
> +	{
> +	  double *BO = B;
> +	  __builtin_mma_xxsetaccz (&acc0);
> +	  __builtin_mma_xxsetaccz (&acc1);
> +	  __builtin_mma_xxsetaccz (&acc2);
> +	  __builtin_mma_xxsetaccz (&acc3);
> +	  __builtin_mma_xxsetaccz (&acc4);
> +	  __builtin_mma_xxsetaccz (&acc5);
> +	  __builtin_mma_xxsetaccz (&acc6);
> +	  __builtin_mma_xxsetaccz (&acc7);
> +	  unsigned long i;
> +
> +	  for (i = 0; i < k; i++)
> +	    {
> +	      vec_t *rowA = (vec_t *) & AO[i * 16];
> +	      __vector_pair rowB;
> +	      vec_t *rb = (vec_t *) & BO[i * 4];
> +	      __builtin_mma_assemble_pair (&rowB, rb[1], rb[0]);
> +	      __builtin_mma_xvf64gerpp (&acc0, rowB, rowA[0]);
> +	      __builtin_mma_xvf64gerpp (&acc1, rowB, rowA[1]);
> +	      __builtin_mma_xvf64gerpp (&acc2, rowB, rowA[2]);
> +	      __builtin_mma_xvf64gerpp (&acc3, rowB, rowA[3]);
> +	      __builtin_mma_xvf64gerpp (&acc4, rowB, rowA[4]);
> +	      __builtin_mma_xvf64gerpp (&acc5, rowB, rowA[5]);
> +	      __builtin_mma_xvf64gerpp (&acc6, rowB, rowA[6]);
> +	      __builtin_mma_xvf64gerpp (&acc7, rowB, rowA[7]);
> +	    }
> +	  SAVE_ACC (&acc0, m, 0);
> +	  SAVE_ACC (&acc2, m, 4);
> +	  SAVE_ACC (&acc1, m, 2);
> +	  SAVE_ACC (&acc3, m, 6);
> +	  SAVE_ACC (&acc4, m, 8);
> +	  SAVE_ACC (&acc6, m, 12);
> +	  SAVE_ACC (&acc5, m, 10);
> +	  SAVE_ACC (&acc7, m, 14);
> +	  AO += k * 16;
> +	  BO += k * 4;
> +	  CO += 16;
> +	}
> +      B += k * 4;
> +    }
> +}
> +
> +void
> +init (double *matrix, int row, int column)
> +{
> +  for (int j = 0; j < column; j++)
> +    {
> +      for (int i = 0; i < row; i++)
> +	{
> +	  matrix[j * row + i] = (i * 16 + 2 + j) / 0.123;
> +	}
> +    }
> +}
> +
> +void
> +init0 (double *matrix, double *matrix1, int row, int column)
> +{
> +  for (int j = 0; j < column; j++)
> +    for (int i = 0; i < row; i++)
> +      matrix[j * row + i] = matrix1[j * row + i] = 0;
> +}
> +
> +
> +void
> +print (const char *name, const double *matrix, int row, int column)
> +{
> +  printf ("Matrix %s has %d rows and %d columns:\n", name, row, column);
> +  for (int i = 0; i < row; i++)
> +    {
> +      for (int j = 0; j < column; j++)
> +	{
> +	  printf ("%f ", matrix[j * row + i]);
> +	}
> +      printf ("\n");
> +    }
> +  printf ("\n");
> +}
> +
> +int
> +main (int argc, char *argv[])
> +{
> +  int rowsA, colsB, common;
> +  int i, j, k;
> +  int ret = 0;
> +
> +  for (int t = 16; t <= 128; t += 16)
> +    {
> +      for (int t1 = 4; t1 <= 16; t1 += 4)
> +	{
> +	  rowsA = t;
> +	  colsB = t1;
> +	  common = 1;
> +	  /* printf ("Running test for rows = %d,cols = %d\n", t, t1); */
> +	  double A[rowsA * common];
> +	  double B[common * colsB];
> +	  double C[rowsA * colsB];
> +	  double D[rowsA * colsB];
> +
> +
> +	  init (A, rowsA, common);
> +	  init (B, common, colsB);
> +	  init0 (C, D, rowsA, colsB);
> +	  DM (rowsA, colsB, common, A, B, C);
> +
> +	  for (i = 0; i < colsB; i++)
> +	    {
> +	      for (j = 0; j < rowsA; j++)
> +		{
> +		  D[i * rowsA + j] = 0;
> +		  for (k = 0; k < common; k++)
> +		    {
> +		      D[i * rowsA + j] +=
> +			A[k * rowsA + j] * B[k + common * i];
> +		    }
> +		}
> +	    }
> +	  for (i = 0; i < colsB; i++)
> +	    {
> +	      for (j = 0; j < rowsA; j++)
> +		{
> +		  for (k = 0; k < common; k++)
> +		    {
> +		      if (D[i * rowsA + j] != C[i * rowsA + j])
> +			{
> +			  printf ("Error %d,%d,%d\n",i,j,k);
> +			  ret++;
> +			}
> +		    }
> +		}
> +	    }
> +	  if (ret)
> +	    {
> +	      print ("A", A, rowsA, common);
> +	      print ("B", B, common, colsB);
> +	      print ("C", C, rowsA, colsB);
> +	      print ("D", D, rowsA, colsB);
> +	    }
> +	}
> +    }
> +  
> +#ifdef VERBOSE
> +  if (ret)
> +    printf ("DM double test fail: %d errors\n",ret);
> +  else
> +    printf ("DM double test success: 0 DM errors\n");
> +#else
> +  if (ret)
> +    abort();
> +#endif
> +      
> +  return ret;
> +}
> +
> +/* { dg-final { scan-assembler {\mdmsetdmrz\M}      } } */
> +/* { dg-final { scan-assembler {\mdmxvf64gerpp\M}   } } */
> +/* { dg-final { scan-assembler {\mdmxxextfdmr512\M} } } */
> +
> diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
> index 1b4a3fb18df..2dec3682a2f 100644
> --- a/gcc/testsuite/lib/target-supports.exp
> +++ b/gcc/testsuite/lib/target-supports.exp
> @@ -7101,6 +7101,25 @@ proc check_effective_target_power10_ok { } {
>      }
>  }
>  
> +# Return 1 if this is a PowerPC target supporting -mcpu=future or -mdense-math

s/ or -mdense-math//

The others look good to me, thanks!

BR,
Kewen

> +# which enables the dense math operations.
> +proc check_effective_target_powerpc_dense_math_ok { } {
> +	return [check_no_compiler_messages_nocache powerpc_dense_math_ok assembly {
> +		__vector_quad vq;
> +		void test (void)
> +		{
> +		#ifndef __PPC_DMR__
> +		#error "target does not have dense math support."
> +		#else
> +		/* Make sure we have dense math support.  */
> +		  __vector_quad dmr;
> +		  __asm__ ("dmsetaccz %A0" : "=wD" (dmr));
> +		  vq = dmr;
> +		#endif
> +		}
> +	} "-mcpu=future"]
> +}
> +
>  # Return 1 if this is a PowerPC target supporting -mfloat128 via either
>  # software emulation on power7/power8 systems or hardware support on power9.
>  


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 6/6] PowerPC: Add support for 1,024 bit DMR registers.
  2024-01-05 23:42 ` Repost [PATCH 6/6] PowerPC: Add support for 1,024 bit DMR registers Michael Meissner
  2024-01-19 18:49   ` Ping " Michael Meissner
@ 2024-02-05  3:58   ` Kewen.Lin
  2024-02-08  0:35     ` Michael Meissner
  1 sibling, 1 reply; 36+ messages in thread
From: Kewen.Lin @ 2024-02-05  3:58 UTC (permalink / raw)
  To: Michael Meissner
  Cc: gcc-patches, Segher Boessenkool, David Edelsohn, Peter Bergner

Hi Mike,

on 2024/1/6 07:42, Michael Meissner wrote:
> This patch is a prelimianry patch to add the full 1,024 bit dense math register> (DMRs) for -mcpu=future.  The MMA 512-bit accumulators map onto the top of the
> DMR register.
> 
> This patch only adds the new 1,024 bit register support.  It does not add
> support for any instructions that need 1,024 bit registers instead of 512 bit
> registers.
> 
> I used the new mode 'TDOmode' to be the opaque mode used for 1,204 bit

typo: 1,204

> registers.  The 'wD' constraint added in previous patches is used for these
> registers.  I added support to do load and store of DMRs via the VSX registers,
> since there are no load/store dense math instructions.  I added the new keyword
> '__dmr' to create 1,024 bit types that can be loaded into DMRs.  At present, I
> don't have aliases for __dmr512 and __dmr1024 that we've discussed internally.
> 
> The patches have been tested on both little and big endian systems.  Can I check
> it into the master branch?
> 
> 2024-01-05   Michael Meissner  <meissner@linux.ibm.com>
> 
> gcc/
> 
> 	* config/rs6000/mma.md (UNSPEC_DM_INSERT512_UPPER): New unspec.
> 	(UNSPEC_DM_INSERT512_LOWER): Likewise.
> 	(UNSPEC_DM_EXTRACT512): Likewise.
> 	(UNSPEC_DMR_RELOAD_FROM_MEMORY): Likewise.
> 	(UNSPEC_DMR_RELOAD_TO_MEMORY): Likewise.
> 	(movtdo): New define_expand and define_insn_and_split to implement 1,024
> 	bit DMR registers.
> 	(movtdo_insert512_upper): New insn.
> 	(movtdo_insert512_lower): Likewise.
> 	(movtdo_extract512): Likewise.
> 	(reload_dmr_from_memory): Likewise.
> 	(reload_dmr_to_memory): Likewise.
> 	* config/rs6000/rs6000-builtin.cc (rs6000_type_string): Add DMR
> 	support.
> 	(rs6000_init_builtins): Add support for __dmr keyword.
> 	* config/rs6000/rs6000-call.cc (rs6000_return_in_memory): Add support
> 	for TDOmode.
> 	(rs6000_function_arg): Likewise.
> 	* config/rs6000/rs6000-modes.def (TDOmode): New mode.
> 	* config/rs6000/rs6000.cc (rs6000_hard_regno_nregs_internal): Add
> 	support for TDOmode.
> 	(rs6000_hard_regno_mode_ok_uncached): Likewise.
> 	(rs6000_hard_regno_mode_ok): Likewise.
> 	(rs6000_modes_tieable_p): Likewise.
> 	(rs6000_debug_reg_global): Likewise.
> 	(rs6000_setup_reg_addr_masks): Likewise.
> 	(rs6000_init_hard_regno_mode_ok): Add support for TDOmode.  Setup reload
> 	hooks for DMR mode.
> 	(reg_offset_addressing_ok_p): Add support for TDOmode.
> 	(rs6000_emit_move): Likewise.
> 	(rs6000_secondary_reload_simple_move): Likewise.
> 	(rs6000_secondary_reload_class): Likewise.
> 	(rs6000_mangle_type): Add mangling for __dmr type.
> 	(rs6000_dmr_register_move_cost): Add support for TDOmode.
> 	(rs6000_split_multireg_move): Likewise.
> 	(rs6000_invalid_conversion): Likewise.
> 	* config/rs6000/rs6000.h (VECTOR_ALIGNMENT_P): Add TDOmode.
> 	(enum rs6000_builtin_type_index): Add DMR type nodes.
> 	(dmr_type_node): Likewise.
> 	(ptr_dmr_type_node): Likewise.
> 
> gcc/testsuite/
> 
> 	* gcc.target/powerpc/dm-1024bit.c: New test.
> ---
>  gcc/config/rs6000/mma.md                      | 152 ++++++++++++++++++
>  gcc/config/rs6000/rs6000-builtin.cc           |  13 ++
>  gcc/config/rs6000/rs6000-call.cc              |  13 +-
>  gcc/config/rs6000/rs6000-modes.def            |   4 +
>  gcc/config/rs6000/rs6000.cc                   | 135 ++++++++++++----
>  gcc/config/rs6000/rs6000.h                    |   7 +-
>  gcc/testsuite/gcc.target/powerpc/dm-1024bit.c |  63 ++++++++
>  7 files changed, 351 insertions(+), 36 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/dm-1024bit.c
> 
> diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
> index f06e6bbb184..37de9030903 100644
> --- a/gcc/config/rs6000/mma.md
> +++ b/gcc/config/rs6000/mma.md
> @@ -92,6 +92,11 @@ (define_c_enum "unspec"
>     UNSPEC_MMA_XXMFACC
>     UNSPEC_MMA_XXMTACC
>     UNSPEC_DM_ASSEMBLE_ACC
> +   UNSPEC_DM_INSERT512_UPPER
> +   UNSPEC_DM_INSERT512_LOWER
> +   UNSPEC_DM_EXTRACT512
> +   UNSPEC_DMR_RELOAD_FROM_MEMORY
> +   UNSPEC_DMR_RELOAD_TO_MEMORY
>    ])
>  
>  (define_c_enum "unspecv"
> @@ -879,3 +884,150 @@ (define_insn "mma_<avvi4i4i4>"
>    [(set_attr "type" "mma")
>     (set_attr "prefixed" "yes")
>     (set_attr "isa" "dm,not_dm,not_dm")])
> +
> +\f
> +;; TDOmode (i.e. __dmr).
> +(define_expand "movtdo"
> +  [(set (match_operand:TDO 0 "nonimmediate_operand")
> +	(match_operand:TDO 1 "input_operand"))]
> +  "TARGET_DENSE_MATH"
> +{
> +  rs6000_emit_move (operands[0], operands[1], TDOmode);
> +  DONE;
> +})
> +
> +(define_insn_and_split "*movtdo"
> +  [(set (match_operand:TDO 0 "nonimmediate_operand" "=wa,m,wa,wD,wD,wa")
> +	(match_operand:TDO 1 "input_operand" "m,wa,wa,wa,wD,wD"))]
> +  "TARGET_DENSE_MATH
> +   && (gpc_reg_operand (operands[0], TDOmode)
> +       || gpc_reg_operand (operands[1], TDOmode))"
> +  "@
> +   #
> +   #
> +   #
> +   #
> +   dmmr %0,%1
> +   #"
> +  "&& reload_completed
> +   && (!dmr_operand (operands[0], TDOmode) || !dmr_operand (operands[1], TDOmode))"
> +  [(const_int 0)]
> +{
> +  rtx op0 = operands[0];
> +  rtx op1 = operands[1];
> +
> +  if (REG_P (op0) && REG_P (op1))
> +    {
> +      int regno0 = REGNO (op0);
> +      int regno1 = REGNO (op1);
> +
> +      if (DMR_REGNO_P (regno0) && VSX_REGNO_P (regno1))
> +	{
> +	  rtx op1_upper = gen_rtx_REG (XOmode, regno1);
> +	  rtx op1_lower = gen_rtx_REG (XOmode, regno1 + 4);
> +	  emit_insn (gen_movtdo_insert512_upper (op0, op1_upper));
> +	  emit_insn (gen_movtdo_insert512_lower (op0, op0, op1_lower));
> +	  DONE;
> +	}
> +
> +      else if (VSX_REGNO_P (regno0) && DMR_REGNO_P (regno1))
> +	{
> +	  rtx op0_upper = gen_rtx_REG (XOmode, regno0);
> +	  rtx op0_lower = gen_rtx_REG (XOmode, regno0 + 4);
> +	  emit_insn (gen_movtdo_extract512 (op0_upper, op1, const0_rtx));
> +	  emit_insn (gen_movtdo_extract512 (op0_lower, op1, const1_rtx));
> +	  DONE;
> +	}

Add an assertion like gcc_assert (VSX_REGNO_P (regno1) && VSX_REGNO_P (regno2))?

> +    }
> +
> +  rs6000_split_multireg_move (operands[0], operands[1]);
> +  DONE;
> +}
> +  [(set_attr "type" "vecload,vecstore,vecmove,vecmove,vecmove,vecmove")
> +   (set_attr "length" "*,*,32,8,*,8")
> +   (set_attr "max_prefixed_insns" "4,4,*,*,*,*")])
> +
> +;; Move from VSX registers to DMR registers via two insert 512 bit
> +;; instructions.
> +(define_insn "movtdo_insert512_upper"
> +  [(set (match_operand:TDO 0 "dmr_operand" "=wD")
> +	(unspec:TDO [(match_operand:XO 1 "vsx_register_operand" "wa")]
> +		    UNSPEC_DM_INSERT512_UPPER))]
> +  "TARGET_DENSE_MATH"
> +  "dmxxinstdmr512 %0,%1,%Y1,0"
> +  [(set_attr "type" "mma")])
> +
> +(define_insn "movtdo_insert512_lower"
> +  [(set (match_operand:TDO 0 "dmr_operand" "=wD")
> +	(unspec:TDO [(match_operand:TDO 1 "dmr_operand" "0")
> +		     (match_operand:XO 2 "vsx_register_operand" "wa")]
> +		    UNSPEC_DM_INSERT512_LOWER))]
> +  "TARGET_DENSE_MATH"
> +  "dmxxinstdmr512 %0,%2,%Y2,1"
> +  [(set_attr "type" "mma")])
> +
> +;; Move from DMR registers to VSX registers via two extract 512 bit
> +;; instructions.
> +(define_insn "movtdo_extract512"
> +  [(set (match_operand:XO 0 "vsx_register_operand" "=wa")
> +	(unspec:XO [(match_operand:TDO 1 "dmr_operand" "wD")
> +		    (match_operand 2 "const_0_to_1_operand" "n")]
> +		   UNSPEC_DM_EXTRACT512))]
> +  "TARGET_DENSE_MATH"
> +  "dmxxextfdmr512 %0,%Y0,%1,%2"
> +  [(set_attr "type" "mma")])
> +
> +;; Reload DMR registers from memory
> +(define_insn_and_split "reload_dmr_from_memory"
> +  [(set (match_operand:TDO 0 "dmr_operand" "=wD")
> +	(unspec:TDO [(match_operand:TDO 1 "memory_operand" "m")]
> +		    UNSPEC_DMR_RELOAD_FROM_MEMORY))
> +   (clobber (match_operand:XO 2 "vsx_register_operand" "=wa"))]
> +  "TARGET_DENSE_MATH"
> +  "#"
> +  "&& reload_completed"
> +  [(const_int 0)]
> +{
> +  rtx dest = operands[0];
> +  rtx src = operands[1];
> +  rtx tmp = operands[2];
> +  rtx mem_upper = adjust_address (src, XOmode, BYTES_BIG_ENDIAN ? 0 : 32);
> +  rtx mem_lower = adjust_address (src, XOmode, BYTES_BIG_ENDIAN ? 32 : 0);

I think the offset should be 64 rather than 32.

> +
> +  emit_move_insn (tmp, mem_upper);
> +  emit_insn (gen_movtdo_insert512_upper (dest, tmp));
> +
> +  emit_move_insn (tmp, mem_lower);
> +  emit_insn (gen_movtdo_insert512_lower (dest, dest, tmp));
> +  DONE;
> +}
> +  [(set_attr "length" "16")
> +   (set_attr "max_prefixed_insns" "2")
> +   (set_attr "type" "vecload")])
> +
> +;; Reload dense math registers to memory
> +(define_insn_and_split "reload_dmr_to_memory"
> +  [(set (match_operand:TDO 0 "memory_operand" "=m")
> +	(unspec:TDO [(match_operand:TDO 1 "dmr_operand" "wD")]
> +		    UNSPEC_DMR_RELOAD_TO_MEMORY))
> +   (clobber (match_operand:XO 2 "vsx_register_operand" "=wa"))]
> +  "TARGET_DENSE_MATH"
> +  "#"
> +  "&& reload_completed"
> +  [(const_int 0)]
> +{
> +  rtx dest = operands[0];
> +  rtx src = operands[1];
> +  rtx tmp = operands[2];
> +  rtx mem_upper = adjust_address (dest, XOmode, BYTES_BIG_ENDIAN ? 0 : 32);
> +  rtx mem_lower = adjust_address (dest, XOmode, BYTES_BIG_ENDIAN ? 32 : 0);

Ditto.

> +
> +  emit_insn (gen_movtdo_extract512 (tmp, src, const0_rtx));
> +  emit_move_insn (mem_upper, tmp);
> +
> +  emit_insn (gen_movtdo_extract512 (tmp, src, const1_rtx));
> +  emit_move_insn (mem_lower, tmp);
> +  DONE;
> +}
> +  [(set_attr "length" "16")
> +   (set_attr "max_prefixed_insns" "2")])
> diff --git a/gcc/config/rs6000/rs6000-builtin.cc b/gcc/config/rs6000/rs6000-builtin.cc
> index 6698274031b..54868d2009c 100644
> --- a/gcc/config/rs6000/rs6000-builtin.cc
> +++ b/gcc/config/rs6000/rs6000-builtin.cc
> @@ -495,6 +495,8 @@ const char *rs6000_type_string (tree type_node)
>      return "__vector_pair";
>    else if (type_node == vector_quad_type_node)
>      return "__vector_quad";
> +  else if (type_node == dmr_type_node)
> +    return "__dmr";
>  
>    return "unknown";
>  }
> @@ -781,6 +783,17 @@ rs6000_init_builtins (void)
>    t = build_qualified_type (vector_quad_type_node, TYPE_QUAL_CONST);
>    ptr_vector_quad_type_node = build_pointer_type (t);
>  
> +  dmr_type_node = make_node (OPAQUE_TYPE);
> +  SET_TYPE_MODE (dmr_type_node, TDOmode);
> +  TYPE_SIZE (dmr_type_node) = bitsize_int (GET_MODE_BITSIZE (TDOmode));
> +  TYPE_PRECISION (dmr_type_node) = GET_MODE_BITSIZE (TDOmode);
> +  TYPE_SIZE_UNIT (dmr_type_node) = size_int (GET_MODE_SIZE (TDOmode));
> +  SET_TYPE_ALIGN (dmr_type_node, 512);

why not 1024?

> +  TYPE_USER_ALIGN (dmr_type_node) = 0;
> +  lang_hooks.types.register_builtin_type (dmr_type_node, "__dmr");
> +  t = build_qualified_type (dmr_type_node, TYPE_QUAL_CONST);
> +  ptr_dmr_type_node = build_pointer_type (t);
> +
>    tdecl = add_builtin_type ("__bool char", bool_char_type_node);
>    TYPE_NAME (bool_char_type_node) = tdecl;
>  
> diff --git a/gcc/config/rs6000/rs6000-call.cc b/gcc/config/rs6000/rs6000-call.cc
> index 8c590903c86..6e2465204cf 100644
> --- a/gcc/config/rs6000/rs6000-call.cc
> +++ b/gcc/config/rs6000/rs6000-call.cc
> @@ -437,7 +437,8 @@ rs6000_return_in_memory (const_tree type, const_tree fntype ATTRIBUTE_UNUSED)
>    if (cfun
>        && !cfun->machine->mma_return_type_error
>        && TREE_TYPE (cfun->decl) == fntype
> -      && (TYPE_MODE (type) == OOmode || TYPE_MODE (type) == XOmode))
> +      && (TYPE_MODE (type) == OOmode || TYPE_MODE (type) == XOmode
> +	  || TYPE_MODE (type) == TDOmode))

May be just with OPAQUE_MODE_P (TYPE_MODE (type)) for all the cases on type mode.

So far only rs6000 defines OPAQUE_MODE, if we are worried that there are some generic opaque modes
some day, we can probably add one assertion somewhere to guaratee it.  Or add one macro like
OPAQUE_MMA_MODE_P to ensure it only matches {OO,XO,TDO}mode.

>      {
>        /* Record we have now handled function CFUN, so the next time we
>  	 are called, we do not re-report the same error.  */
> @@ -1641,6 +1642,16 @@ rs6000_function_arg (cumulative_args_t cum_v, const function_arg_info &arg)
>        return NULL_RTX;
>      }
>  
> +  if (mode == TDOmode)
> +    {
> +      if (TYPE_CANONICAL (type) != NULL_TREE)
> +	type = TYPE_CANONICAL (type);
> +      error ("invalid use of dense math operand of type %qs as a function "
> +	     "parameter",
> +	     IDENTIFIER_POINTER (DECL_NAME (TYPE_NAME (type))));
> +      return NULL_RTX;
> +    }

Can we merge this hunk into the above hunk for OOmode and XOmode?  Then the code with TYPE_CANONICAL
can be shared and better to maintain.  IMHO, this dense math operand is also MMA operand so the above
error message still works, if it's desired to note this dense math operand then we can use
(mode == TDOmode)? "dense math": "MMA" for the different string part.

> +
>    /* Return a marker to indicate whether CR1 needs to set or clear the
>       bit that V.4 uses to say fp args were passed in registers.
>       Assume that we don't need the marker for software floating point,
> diff --git a/gcc/config/rs6000/rs6000-modes.def b/gcc/config/rs6000/rs6000-modes.def
> index 094b246c834..60ebb363196 100644
> --- a/gcc/config/rs6000/rs6000-modes.def
> +++ b/gcc/config/rs6000/rs6000-modes.def
> @@ -86,3 +86,7 @@ PARTIAL_INT_MODE (TI, 128, PTI);
>  /* Modes used by __vector_pair and __vector_quad.  */
>  OPAQUE_MODE (OO, 32);
>  OPAQUE_MODE (XO, 64);
> +
> +/* Modes used by __dmr.  */

Nit: s/Modes/Mode/

> +OPAQUE_MODE (TDO, 128);
> +

I assumed that "TD" stands for something but I have no idea (at least not obvious to me),
could we also put some comments for it?

> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 59517c8608d..aed4b72c4ea 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -1846,7 +1846,9 @@ rs6000_hard_regno_nregs_internal (int regno, machine_mode mode)
>       128-bit floating point that can go in vector registers, which has VSX
>       memory addressing.  */
>    if (FP_REGNO_P (regno))
> -    reg_size = (VECTOR_MEM_VSX_P (mode) || VECTOR_ALIGNMENT_P (mode)
> +    reg_size = (VECTOR_MEM_VSX_P (mode)
> +		|| VECTOR_ALIGNMENT_P (mode)
> +		|| mode == TDOmode

Redundant change, since VECTOR_ALIGNMENT_P considers TDOmode as this patch changes.

>  		? UNITS_PER_VSX_WORD
>  		: UNITS_PER_FP_WORD);
>  
> @@ -1880,9 +1882,9 @@ rs6000_hard_regno_mode_ok_uncached (int regno, machine_mode mode)
>    /* On ISA 3.1 (power10), MMA accumulator modes need FPR registers divisible
>       by 4.
>  
> -     If dense math is enabled, allow all VSX registers plus the DMR registers.
> -     We need to make sure we don't cross between the boundary of FPRs and
> -     traditional Altiviec registers.  */
> +     If dense math is enabled, allow all VSX registers plus the dense math
> +     registers.  We need to make sure we don't cross between the boundary of
> +     FPRs and traditional Altiviec registers.  */
>    if (mode == XOmode)
>      {
>        if (TARGET_MMA && !TARGET_DENSE_MATH)
> @@ -1904,7 +1906,27 @@ rs6000_hard_regno_mode_ok_uncached (int regno, machine_mode mode)
>  	return 0;
>      }
>  
> -  /* No other types other than XOmode can go in DMRs.  */
> +  /* Dense math register modes need DMR registers or VSX registers divisible by
> +     2.  We need to make sure we don't cross between the boundary of FPRs and
> +     traditional Altiviec registers.  */
> +  if (mode == TDOmode)
> +    {
> +      if (!TARGET_DENSE_MATH)
> +	return 0;
> +
> +      if (DMR_REGNO_P (regno))
> +	return 1;
> +
> +      if (FP_REGNO_P (regno))
> +	return ((regno & 1) == 0 && regno <= LAST_FPR_REGNO - 7);
> +

Like the comment on XOmode (in one previous patch), this restriction looks too
strict, isn't fine to cross FPR and Altivec registers boundary as the XAp,XBp
are separated in DM 512 insert/extract?

> +      if (ALTIVEC_REGNO_P (regno))
> +	return ((regno & 1) == 0 && regno <= LAST_ALTIVEC_REGNO - 7);
> +
> +      return 0;
> +    }
> +
> +  /* No other types other than XOmode or TDOmode can go in DMRs.  */
>    if (DMR_REGNO_P (regno))
>      return 0;
>  
> @@ -2012,9 +2034,11 @@ rs6000_hard_regno_mode_ok (unsigned int regno, machine_mode mode)
>     GPR registers, and TImode can go in any GPR as well as VSX registers (PR
>     57744).
>  
> -   Similarly, don't allow OOmode (vector pair, restricted to even VSX
> -   registers) or XOmode (vector quad, restricted to FPR registers divisible
> -   by 4) to tie with other modes.
> +   Similarly, don't allow OOmode (vector pair), XOmode (vector quad), or
> +   TDOmode (dmr register) to pair with anything else.  Vector pairs are
> +   restricted to even/odd VSX registers.  Without dense math, vector quads are
> +   limited to FPR registers divisible by 4.  With dense math, vector quads are
> +   limited to even VSX registers or DMR registers.
>  
>     Altivec/VSX vector tests were moved ahead of scalar float mode, so that IEEE
>     128-bit floating point on VSX systems ties with other vectors.  */
> @@ -2023,7 +2047,8 @@ static bool
>  rs6000_modes_tieable_p (machine_mode mode1, machine_mode mode2)
>  {
>    if (mode1 == PTImode || mode1 == OOmode || mode1 == XOmode
> -      || mode2 == PTImode || mode2 == OOmode || mode2 == XOmode)
> +      || mode1 == TDOmode || mode2 == PTImode || mode2 == OOmode
> +      || mode2 == XOmode || mode2 == TDOmode)
>      return mode1 == mode2;
>  
>    if (ALTIVEC_OR_VSX_VECTOR_MODE (mode1))
> @@ -2314,6 +2339,7 @@ rs6000_debug_reg_global (void)
>      V4DFmode,
>      OOmode,
>      XOmode,
> +    TDOmode,
>      CCmode,
>      CCUNSmode,
>      CCEQmode,
> @@ -2679,7 +2705,7 @@ rs6000_setup_reg_addr_masks (void)
>  	  /* Special case DMR registers.  */
>  	  if (rc == RELOAD_REG_DMR)
>  	    {
> -	      if (TARGET_DENSE_MATH && m2 == XOmode)
> +	      if (TARGET_DENSE_MATH && (m2 == XOmode || m2 == TDOmode))
>  		{
>  		  addr_mask = RELOAD_REG_VALID;
>  		  reg_addr[m].addr_mask[rc] = addr_mask;
> @@ -2786,12 +2812,14 @@ rs6000_setup_reg_addr_masks (void)
>  
>  	  /* Vector pairs can do both indexed and offset loads if the
>  	     instructions are enabled, otherwise they can only do offset loads
> -	     since it will be broken into two vector moves.  Vector quads can
> -	     only do offset loads.  If the user restricted generation of either
> -	     of the LXVP or STXVP instructions, do not allow indexed mode so
> -	     that we can split the load/store.  */
> +	     since it will be broken into two vector moves.  If the user
> +	     restricted generation of either of the LXVP or STXVP instructions,
> +	     do not allow indexed mode so that we can split the load/store.
> +
> +	     Vector quads and dense math 1,024 bit registers can only do offset
> +	     loads.  */
>  	  else if ((addr_mask != 0) && TARGET_MMA
> -		   && (m2 == OOmode || m2 == XOmode))
> +		   && (m2 == OOmode || m2 == XOmode || m2 == TDOmode))
>  	    {
>  	      addr_mask |= RELOAD_REG_OFFSET;
>  	      if (rc == RELOAD_REG_FPR || rc == RELOAD_REG_VMX)
> @@ -3021,6 +3049,14 @@ rs6000_init_hard_regno_mode_ok (bool global_init_p)
>        rs6000_vector_align[XOmode] = 512;
>      }
>  
> +  /* Add support for 1,024 bit DMR registers.  */
> +  if (TARGET_DENSE_MATH)
> +    {
> +      rs6000_vector_unit[TDOmode] = VECTOR_NONE;
> +      rs6000_vector_mem[TDOmode] = VECTOR_VSX;
> +      rs6000_vector_align[TDOmode] = 512;
> +    }
> +
>    /* Register class constraints for the constraints that depend on compile
>       switches. When the VSX code was added, different constraints were added
>       based on the type (DFmode, V2DFmode, V4SFmode).  For the vector types, all
> @@ -3234,6 +3270,12 @@ rs6000_init_hard_regno_mode_ok (bool global_init_p)
>  	}
>      }
>  
> +  if (TARGET_DENSE_MATH)
> +    {
> +      reg_addr[TDOmode].reload_load = CODE_FOR_reload_dmr_from_memory;
> +      reg_addr[TDOmode].reload_store = CODE_FOR_reload_dmr_to_memory;
> +    }
> +
>    /* Precalculate HARD_REGNO_NREGS.  */
>    for (r = 0; HARD_REGISTER_NUM_P (r); ++r)
>      for (m = 0; m < NUM_MACHINE_MODES; ++m)
> @@ -8800,12 +8842,15 @@ reg_offset_addressing_ok_p (machine_mode mode)
>  	return mode_supports_dq_form (mode);
>        break;
>  
> -      /* The vector pair/quad types support offset addressing if the
> -	 underlying vectors support offset addressing.  */
> +      /* The vector pair/quad types and the dense math types support offset

Nit: s/types/type/

> +	 addressing if the underlying vectors support offset addressing.  */
>      case E_OOmode:
>      case E_XOmode:
>        return TARGET_MMA;
>  
> +    case E_TDOmode:
> +      return TARGET_DENSE_MATH;
> +
>      case E_SDmode:
>        /* If we can do direct load/stores of SDmode, restrict it to reg+reg
>  	 addressing for the LFIWZX and STFIWX instructions.  */
> @@ -11354,6 +11399,12 @@ rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
>  	       (mode == OOmode) ? "__vector_pair" : "__vector_quad");
>        break;
>  
> +    case E_TDOmode:
> +      if (CONST_INT_P (operands[1]))
> +	error ("%qs is an opaque type, and you cannot set it to constants",
> +	       "__dmr");
> +      break;
> +
>      case E_SImode:
>      case E_DImode:
>        /* Use default pattern for address of ELF small data */
> @@ -12817,7 +12868,7 @@ rs6000_secondary_reload_simple_move (enum rs6000_reg_type to_type,
>  
>    /* We can transfer between VSX registers and DMR registers without needing
>       extra registers.  */
> -  if (TARGET_DENSE_MATH && mode == XOmode
> +  if (TARGET_DENSE_MATH && (mode == XOmode || mode == TDOmode)
>        && ((to_type == DMR_REG_TYPE && from_type == VSX_REG_TYPE)
>  	  || (to_type == VSX_REG_TYPE && from_type == DMR_REG_TYPE)))
>      return true;
> @@ -13618,6 +13669,9 @@ rs6000_preferred_reload_class (rtx x, enum reg_class rclass)
>        if (mode == XOmode)
>  	return TARGET_DENSE_MATH ? VSX_REGS : FLOAT_REGS;
>  

Nit: Should update the comments above:

   /* For the vector pair and vector quad modes, prefer their natural register
      (VSX or FPR) rather than GPR registers.  For other integer types, prefer
      the GPR registers.  */

> +      if (mode == TDOmode)
> +	return VSX_REGS;
> +
>        if (GET_MODE_CLASS (mode) == MODE_INT)
>  	return GENERAL_REGS;
>      }
> @@ -13741,8 +13795,9 @@ rs6000_secondary_reload_class (enum reg_class rclass, machine_mode mode,
>    else
>      regno = -1;
>  
> -  /* DMR registers don't have loads or stores.  We have to go through the VSX
> -     registers to load XOmode (vector quad).  */
> +  /* Dense math registers don't have loads or stores.  We have to go through
> +     the VSX registers to load XOmode (vector quad) and TDOmode (dmr 1024
> +     bit).  */
>    if (TARGET_DENSE_MATH && rclass == DM_REGS)
>      return VSX_REGS;
>  
> @@ -20830,6 +20885,8 @@ rs6000_mangle_type (const_tree type)
>      return "u13__vector_pair";
>    if (type == vector_quad_type_node)
>      return "u13__vector_quad";
> +  if (type == dmr_type_node)
> +    return "u5__dmr";
>  
>    /* For all other types, use the default mangling.  */
>    return NULL;
> @@ -22954,6 +23011,10 @@ rs6000_dmr_register_move_cost (machine_mode mode, reg_class_t rclass)
>        if (mode == XOmode)
>  	return reg_move_base;
>  
> +      /* __dmr (i.e. TDOmode) is transferred in 2 instructions.  */
> +      else if (mode == TDOmode)
> +	return reg_move_base * 2;
> +
>        else
>  	return reg_move_base * 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
>      }
> @@ -27651,9 +27712,10 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>    mode = GET_MODE (dst);
>    nregs = hard_regno_nregs (reg, mode);
>  
> -  /* If we have a vector quad register for MMA, and this is a load or store,
> -     see if we can use vector paired load/stores.  */
> -  if (mode == XOmode && TARGET_MMA
> +  /* If we have a vector quad register for MMA or DMR register for dense math,
> +     and this is a load or store, see if we can use vector paired
> +     load/stores.  */
> +  if ((mode == XOmode || mode == TDOmode) && TARGET_MMA
>        && (MEM_P (dst) || MEM_P (src)))
>      {
>        reg_mode = OOmode;
> @@ -27661,7 +27723,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>      }
>    /* If we have a vector pair/quad mode, split it into two/four separate
>       vectors.  */

Nit: The above comments need to be updated.

> -  else if (mode == OOmode || mode == XOmode)
> +  else if (mode == OOmode || mode == XOmode || mode == TDOmode)
>      reg_mode = V1TImode;
>    else if (FP_REGNO_P (reg))
>      reg_mode = DECIMAL_FLOAT_MODE_P (mode) ? DDmode :
> @@ -27707,13 +27769,13 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>        return;
>      }
>  
> -  /* The __vector_pair and __vector_quad modes are multi-register
> -     modes, so if we have to load or store the registers, we have to be
> -     careful to properly swap them if we're in little endian mode
> -     below.  This means the last register gets the first memory
> -     location.  We also need to be careful of using the right register
> -     numbers if we are splitting XO to OO.  */
> -  if (mode == OOmode || mode == XOmode)
> +  /* The __vector_pair, __vector_quad, and __dmr modes are multi-register
> +     modes, so if we have to load or store the registers, we have to be careful
> +     to properly swap them if we're in little endian mode below.  This means
> +     the last register gets the first memory location.  We also need to be
> +     careful of using the right register numbers if we are splitting XO to
> +     OO.  */
> +  if (mode == OOmode || mode == XOmode || mode == TDOmode)
>      {
>        nregs = hard_regno_nregs (reg, mode);
>        int reg_mode_nregs = hard_regno_nregs (reg, reg_mode);
> @@ -27850,7 +27912,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>  	 overlap.  */
>        int i;
>        /* XO/OO are opaque so cannot use subregs. */
> -      if (mode == OOmode || mode == XOmode )
> +      if (mode == OOmode || mode == XOmode || mode == TDOmode)
>  	{
>  	  for (i = nregs - 1; i >= 0; i--)
>  	    {
> @@ -28024,7 +28086,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>  	    continue;
>  
>  	  /* XO/OO are opaque so cannot use subregs. */
> -	  if (mode == OOmode || mode == XOmode )
> +	  if (mode == OOmode || mode == XOmode || mode == TDOmode)
>  	    {
>  	      rtx dst_i = gen_rtx_REG (reg_mode, REGNO (dst) + j);
>  	      rtx src_i = gen_rtx_REG (reg_mode, REGNO (src) + j);
> @@ -29006,7 +29068,8 @@ rs6000_invalid_conversion (const_tree fromtype, const_tree totype)
>  
>    if (frommode != tomode)
>      {
> -      /* Do not allow conversions to/from XOmode and OOmode types.  */
> +      /* Do not allow conversions to/from XOmode, OOmode, and TDOmode
> +	 types.  */
>        if (frommode == XOmode)
>  	return N_("invalid conversion from type %<__vector_quad%>");
>        if (tomode == XOmode)
> @@ -29015,6 +29078,10 @@ rs6000_invalid_conversion (const_tree fromtype, const_tree totype)
>  	return N_("invalid conversion from type %<__vector_pair%>");
>        if (tomode == OOmode)
>  	return N_("invalid conversion to type %<__vector_pair%>");
> +      if (frommode == TDOmode)
> +	return N_("invalid conversion from type %<__dmr%>");
> +      if (tomode == TDOmode)
> +	return N_("invalid conversion to type %<__dmr%>");
>      }
>  
>    /* Conversion allowed.  */
> diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
> index 22efac4a80c..9711777b5cd 100644
> --- a/gcc/config/rs6000/rs6000.h
> +++ b/gcc/config/rs6000/rs6000.h
> @@ -1004,7 +1004,8 @@ enum data_align { align_abi, align_opt, align_both };
>  /* Modes that are not vectors, but require vector alignment.  Treat these like
>     vectors in terms of loads and stores.  */
>  #define VECTOR_ALIGNMENT_P(MODE)					\
> -  (FLOAT128_VECTOR_P (MODE) || (MODE) == OOmode || (MODE) == XOmode)
> +  (FLOAT128_VECTOR_P (MODE) || (MODE) == OOmode || (MODE) == XOmode	\
> +   || (MODE) == TDOmode)
>  
>  #define ALTIVEC_VECTOR_MODE(MODE)					\
>    ((MODE) == V16QImode							\
> @@ -2293,6 +2294,7 @@ enum rs6000_builtin_type_index
>    RS6000_BTI_const_str,		 /* pointer to const char * */
>    RS6000_BTI_vector_pair,	 /* unsigned 256-bit types (vector pair).  */
>    RS6000_BTI_vector_quad,	 /* unsigned 512-bit types (vector quad).  */
> +  RS6000_BTI_dmr,		 /* unsigned 1,024-bit types (dmr).  */
>    RS6000_BTI_const_ptr_void,     /* const pointer to void */
>    RS6000_BTI_ptr_V16QI,
>    RS6000_BTI_ptr_V1TI,
> @@ -2331,6 +2333,7 @@ enum rs6000_builtin_type_index
>    RS6000_BTI_ptr_dfloat128,
>    RS6000_BTI_ptr_vector_pair,
>    RS6000_BTI_ptr_vector_quad,
> +  RS6000_BTI_ptr_dmr,
>    RS6000_BTI_ptr_long_long,
>    RS6000_BTI_ptr_long_long_unsigned,
>    RS6000_BTI_MAX
> @@ -2388,6 +2391,7 @@ enum rs6000_builtin_type_index
>  #define const_str_type_node		 (rs6000_builtin_types[RS6000_BTI_const_str])
>  #define vector_pair_type_node		 (rs6000_builtin_types[RS6000_BTI_vector_pair])
>  #define vector_quad_type_node		 (rs6000_builtin_types[RS6000_BTI_vector_quad])
> +#define dmr_type_node			 (rs6000_builtin_types[RS6000_BTI_dmr])
>  #define pcvoid_type_node		 (rs6000_builtin_types[RS6000_BTI_const_ptr_void])
>  #define ptr_V16QI_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_V16QI])
>  #define ptr_V1TI_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_V1TI])
> @@ -2426,6 +2430,7 @@ enum rs6000_builtin_type_index
>  #define ptr_dfloat128_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_dfloat128])
>  #define ptr_vector_pair_type_node	 (rs6000_builtin_types[RS6000_BTI_ptr_vector_pair])
>  #define ptr_vector_quad_type_node	 (rs6000_builtin_types[RS6000_BTI_ptr_vector_quad])
> +#define ptr_dmr_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_dmr])
>  #define ptr_long_long_integer_type_node	 (rs6000_builtin_types[RS6000_BTI_ptr_long_long])
>  #define ptr_long_long_unsigned_type_node (rs6000_builtin_types[RS6000_BTI_ptr_long_long_unsigned])
>  
> diff --git a/gcc/testsuite/gcc.target/powerpc/dm-1024bit.c b/gcc/testsuite/gcc.target/powerpc/dm-1024bit.c
> new file mode 100644
> index 00000000000..0a9884ddf63
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/dm-1024bit.c
> @@ -0,0 +1,63 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_dense_math_ok } */
> +/* { dg-options "-mdejagnu-cpu=future -O2" } */
> +
> +/* Test basic load/store for __dmr type.  */
> +
> +#ifndef CONSTRAINT
> +#if defined(USE_D)
> +#define CONSTRAINT "d"
> +
> +#elif defined(USE_V)
> +#define CONSTRAINT "v"
> +
> +#elif defined(USE_WA)
> +#define CONSTRAINT "wa"
> +
> +#else
> +#define CONSTRAINT "wD"
> +#endif
> +#endif
> +const char constraint[] = CONSTRAINT;
> +
> +void foo_mem_asm (__dmr *p, __dmr *q)
> +{
> +  /* 2 LXVP instructions.  */

Nit: s/2/4/

> +  __dmr vq = *p;
> +
> +  /* 2 DMXXINSTDMR512 instructions to transfer VSX to DMR.  */
> +  __asm__ ("# foo (" CONSTRAINT ") %A0" : "+" CONSTRAINT (vq));
> +  /* 2 DMXXEXTFDMR512 instructions to transfer DMR to VSX.  */
> +
> +  /* 2 STXVP instructions.  */

Ditto.

> +  *q = vq;
> +}
> +
> +void foo_mem_asm2 (__dmr *p, __dmr *q)
> +{
> +  /* 2 LXVP instructions.  */

Ditto.

> +  __dmr vq = *p;
> +  __dmr vq2;
> +  __dmr vq3;

Nit: vq3 is useless.

> +
> +  /* 2 DMXXINSTDMR512 instructions to transfer VSX to DMR.  */
> +  __asm__ ("# foo1 (" CONSTRAINT ") %A0" : "+" CONSTRAINT (vq));
> +  /* 2 DMXXEXTFDMR512 instructions to transfer DMR to VSX.  */
> +
> +  vq2 = vq;
> +  __asm__ ("# foo2 (wa) %0" : "+wa" (vq2));
> +
> +  /* 2 STXVP instructions.  */

Nit: s/2/4/

> +  *q = vq2;
> +}
> +
> +void foo_mem (__dmr *p, __dmr *q)
> +{
> +  /* 2 LXVP, 2 STXVP instructions, no DMR transfer.  */

Ditto.

> +  *q = *p;
> +}
> +
> +/* { dg-final { scan-assembler-times {\mdmxxextfdmr512\M}  4 } } */
> +/* { dg-final { scan-assembler-times {\mdmxxinstdmr512\M}  4 } } */
> +/* { dg-final { scan-assembler-times {\mlxvp\M}           12 } } */
> +/* { dg-final { scan-assembler-times {\mstxvp\M}          12 } } */


The others look good to me, thanks!

BR,
Kewen


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 1/6] Add -mcpu=future
  2024-01-23  8:44   ` Repost " Kewen.Lin
@ 2024-02-06  6:01     ` Michael Meissner
  2024-02-07  9:21       ` Kewen.Lin
  2024-02-08 18:35       ` Segher Boessenkool
  0 siblings, 2 replies; 36+ messages in thread
From: Michael Meissner @ 2024-02-06  6:01 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Michael Meissner, gcc-patches, Segher Boessenkool,
	David Edelsohn, Peter Bergner

On Tue, Jan 23, 2024 at 04:44:32PM +0800, Kewen.Lin wrote:
> > --- a/gcc/config/rs6000/rs6000-cpus.def
> > +++ b/gcc/config/rs6000/rs6000-cpus.def
> > @@ -88,6 +88,10 @@
> >  				 | OPTION_MASK_POWER10			\
> >  				 | OTHER_POWER10_MASKS)
> >  
> > +/* Flags for a potential future processor that may or may not be delivered.  */
> > +#define ISA_FUTURE_MASKS	(ISA_3_1_MASKS_SERVER			\
> > +				 | OPTION_MASK_FUTURE)
> > +
> 
> Nit: Named as "ISA_FUTURE_MASKS_SERVER" seems more accurate as it's constituted
> with ISA_3_1_MASKS_**SERVER** ...

Well the _SERVER stuff was due to the power7 days when we still had to support
the E500 in the main rs6000 tree.  But I will change it to be more consistant
in the future patches.

> ..., then this need to be updated accordingly.
> 
> > diff --git a/gcc/config/rs6000/rs6000-opts.h b/gcc/config/rs6000/rs6000-opts.h
> > index 33fd0efc936..25890ae3034 100644
> > --- a/gcc/config/rs6000/rs6000-opts.h
> > +++ b/gcc/config/rs6000/rs6000-opts.h
> > @@ -67,7 +67,9 @@ enum processor_type
> >     PROCESSOR_MPCCORE,
> >     PROCESSOR_CELL,
> >     PROCESSOR_PPCA2,
> > -   PROCESSOR_TITAN
> > +   PROCESSOR_TITAN,
> > +
> 
> Nit: unintentional empty line?
> 
> > +   PROCESSOR_FUTURE
> >  };

It was more as a separation.  The MPCCORE, CELL, PPCA2, and TITAN are rather
old processors.  I don't recall why we kept them after the POWER<x>.

Logically we should re-order the list and move MPCCORE, etc. earlier, but I
will delete the blank line in future patches.

> > +static int
> > +rs600_cpu_index_lookup (enum processor_type processor)
> 
> s/rs600_cpu_index_lookup/rs6000_cpu_index_lookup/

I'm going to redo it, and eliminate rs600_cpu_index_lookup.  Thanks for
catching the spelling of rs600 instead of rs6000.

> > +{
> > +  for (size_t i = 0; i < ARRAY_SIZE (processor_target_table); i++)
> > +    if (processor_target_table[i].processor == processor)
> > +      return i;
> > +
> > +  return -1;
> > +}
> 
> Nit: Since this is given with a valid enum processor_type, I think it should
> never return -1?  If so, may be more clear with gcc_unreachable () or adjust
> with initial -1, break when hits and assert it's not -1.

As I said, in looking at it, I think I will rewrite the code that uses it to
call rs6000_cpu_name_lookup instead.

> > +
> >  \f
> >  /* Return number of consecutive hard regs needed starting at reg REGNO
> >     to hold something of mode MODE.
> > @@ -3756,23 +3768,45 @@ rs6000_option_override_internal (bool global_init_p)
> >      rs6000_isa_flags &= ~OPTION_MASK_POWERPC64;
> >  #endif
> >  
> > +  /* At the moment, we don't have explict -mtune=future support.  If the user
> 
> Nit: s/explict/explicit/

Thanks.

> 
> > +     explicitly tried to use -mtune=future, give a warning.  If not, use the
> 
> Nit: s/tried/tries/?

Thanks.  I will reword the comment.

> > +     power10 tuning until future tuning is added.  */
> >    if (rs6000_tune_index >= 0)
> > -    tune_index = rs6000_tune_index;
> > +    {
> > +      enum processor_type cur_proc
> > +	= processor_target_table[rs6000_tune_index].processor;
> > +
> > +      if (cur_proc == PROCESSOR_FUTURE)
> > +	{
> > +	  static bool issued_future_tune_warning = false;
> > +	  if (!issued_future_tune_warning)
> > +	    {
> > +	      issued_future_tune_warning = true;
> 
> This seems to ensure we only warn this once, but I noticed that in rs6000/
> only some OPT_Wpsabi related warnings adopt this way, I wonder if we don't
> restrict it like this, for a tiny simple case, how many times it would warn?

In a simple case, you would only get the warning once.  But if you use
__attribute__((__target__(...))) or #pragma target ... you might see it more
than once.

> > +	      warning (0, "%qs is not currently supported", "-mtune=future");
> > +	    }
> > +> +	  rs6000_tune_index = rs600_cpu_index_lookup (PROCESSOR_POWER10);
> > +	}
> > +      tune_index = rs6000_tune_index;
> > +    }
> >    else if (cpu_index >= 0)
> > -    rs6000_tune_index = tune_index = cpu_index;
> > +    {
> > +      enum processor_type cur_cpu
> > +	= processor_target_table[cpu_index].processor;
> > +
> > +      rs6000_tune_index = tune_index
> > +	= (cur_cpu == PROCESSOR_FUTURE
> > +	   ? rs600_cpu_index_lookup (PROCESSOR_POWER10)
> 
> s/rs600_cpu_index_lookup/rs6000_cpu_index_lookup/

See above.

> > +	   : cpu_index);
> > +    }
> >    else
> >      {
> > -      size_t i;
> >        enum processor_type tune_proc
> >  	= (TARGET_POWERPC64 ? PROCESSOR_DEFAULT64 : PROCESSOR_DEFAULT);
> >  
> > -      tune_index = -1;
> > -      for (i = 0; i < ARRAY_SIZE (processor_target_table); i++)
> > -	if (processor_target_table[i].processor == tune_proc)
> > -	  {
> > -	    tune_index = i;
> > -	    break;
> > -	  }
> > +      tune_index = rs600_cpu_index_lookup (tune_proc == PROCESSOR_FUTURE
> > +					   ? PROCESSOR_POWER10
> > +					   : tune_proc);
> 
> This part looks useless, as tune_proc is impossible to be PROCESSOR_FUTURE.

Well in theory, you could configure the compiler with --with-cpu=future or
--with-tune=future.

> >      }
> 
> Maybe re-structure the above into:
> 
> bool explicit_tune = false;
> if (rs6000_tune_index >= 0)
>   {
>     tune_index = rs6000_tune_index;
>     explicit_tune = true;
>   }
> else if (cpu_index >= 0)
>   // as before
>   rs6000_tune_index = tune_index = cpu_index;
> else
>   {
>    //as before
>    ...
>   }
> 
> // Check tune_index here instead.
> 
> if (processor_target_table[tune_index].processor == PROCESSOR_FUTURE)
>   {
>     tune_index = rs6000_cpu_index_lookup (PROCESSOR_POWER10);
>     if (explicit_tune)
>       warn ...
>   }
> 
> // as before
> rs6000_tune = processor_target_table[tune_index].processor;
> 
> >  
> >    if (cpu_index >= 0)
> > @@ -4785,6 +4819,7 @@ rs6000_option_override_internal (bool global_init_p)
> >  	break;
> >  
> >        case PROCESSOR_POWER10:
> > +      case PROCESSOR_FUTURE:
> >  	rs6000_cost = &power10_cost;
> >  	break;
> >  
> > @@ -5944,6 +5979,8 @@ rs6000_machine_from_flags (void)
> >    /* Disable the flags that should never influence the .machine selection.  */
> >    flags &= ~(OPTION_MASK_PPC_GFXOPT | OPTION_MASK_PPC_GPOPT | OPTION_MASK_ISEL);
> >  
> > +  if ((flags & (ISA_FUTURE_MASKS & ~ISA_3_1_MASKS_SERVER)) != 0)
> > +    return "future";
> >    if ((flags & (ISA_3_1_MASKS_SERVER & ~ISA_3_0_MASKS_SERVER)) != 0)
> >      return "power10";
> >    if ((flags & (ISA_3_0_MASKS_SERVER & ~ISA_2_7_MASKS_SERVER)) != 0)
> > @@ -24500,6 +24537,7 @@ static struct rs6000_opt_mask const rs6000_opt_masks[] =
> >    { "float128-hardware",	OPTION_MASK_FLOAT128_HW,	false, true  },
> >    { "fprnd",			OPTION_MASK_FPRND,		false, true  },
> >    { "power10",			OPTION_MASK_POWER10,		false, true  },
> > +  { "future",			OPTION_MASK_FUTURE,		false, true  },
> >    { "hard-dfp",			OPTION_MASK_DFP,		false, true  },
> >    { "htm",			OPTION_MASK_HTM,		false, true  },
> >    { "isel",			OPTION_MASK_ISEL,		false, true  },
> > diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
> > index 2291fe8d3a3..43209f9a6e7 100644
> > --- a/gcc/config/rs6000/rs6000.h
> > +++ b/gcc/config/rs6000/rs6000.h
> > @@ -163,6 +163,7 @@
> >    mcpu=e5500: -me5500; \
> >    mcpu=e6500: -me6500; \
> >    mcpu=titan: -mtitan; \
> > +  mcpu=future: -mfuture; \
> >    !mcpu*: %{mpower9-vector: -mpower9; \
> >  	    mpower8-vector|mcrypto|mdirect-move|mhtm: -mpower8; \
> >  	    mvsx: -mpower7; \
> 
> I think we should also update asm_names in driver-rs6000.cc.

Ok.  Though the driver-rs6000.cc stuff won't kick in until we have a real
system that matches "future".

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 3/6] PowerPC: Add support for accumulators in DMR registers.
  2024-01-25  9:28   ` Repost " Kewen.Lin
@ 2024-02-07  0:06     ` Michael Meissner
  2024-02-07  9:38       ` Kewen.Lin
  0 siblings, 1 reply; 36+ messages in thread
From: Michael Meissner @ 2024-02-07  0:06 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Michael Meissner, gcc-patches, Segher Boessenkool,
	David Edelsohn, Peter Bergner

On Thu, Jan 25, 2024 at 05:28:49PM +0800, Kewen.Lin wrote:
> Hi Mike,
> 
> on 2024/1/6 07:38, Michael Meissner wrote:
> > The MMA subsystem added the notion of accumulator registers as an optional
> > feature of ISA 3.1 (power10).  In ISA 3.1, these accumulators overlapped with
> > the traditional floating point registers 0..31, but logically the accumulator
> > registers were separate from the FPR registers.  In ISA 3.1, it was anticipated
> 
> Using VSX register 0..31 rather than traditional floating point registers 0..31
> seems more clear, since floating point registers imply 64 bit long registers.

Ok.

> > that in future systems, the accumulator registers may no overlap with the FPR
> > registers.  This patch adds the support for dense math registers as separate
> > registers.
> > 
> > This particular patch does not change the MMA support to use the accumulators
> > within the dense math registers.  This patch just adds the basic support for
> > having separate DMRs.  The next patch will switch the MMA support to use the
> > accumulators if -mcpu=future is used.
> > 
> > For testing purposes, I added an undocumented option '-mdense-math' to enable
> > or disable the dense math support.
> 
> Can we avoid this and use one macro for it instead?  As you might have noticed
> that some previous temporary options like -mpower{8,9}-vector cause ICEs due to
> some unexpected combination and we are going to neuter them, so let's try our
> best to avoid it if possible.  I guess one macro TARGET_DENSE_MATH defined by
> TARGET_FUTURE && TARGET_MMA matches all use places? and specifying -mcpu=future
> can enable it while -mcpu=power10 can disable it.

That depends on whether there will be other things added in the future power
that are not in the MMA+ instruction set.

But I can switch to defining TARGET_DENSE_MATH to testing TARGET_FUTURE and
TARGET_MMA.  That way if/when a new cpu comes out, we will just have to change
the definition of TARGET_DENSE_MATH and not all of the uses.

I will also add TARGET_MMA_NO_DENSE_MATH to handle the existing MMA code for
assemble and disassemble when we don't have dense math instructions.

> > 
> > This patch adds a new constraint (wD).  If MMA is selected but dense math is
> > not selected (i.e. -mcpu=power10), the wD constraint will allow access to
> > accumulators that overlap with the VSX vector registers 0..31.  If both MMA and
> 
> Sorry for nitpicking, it's more accurate with "VSX registers 0..31".

Ok.

> > diff --git a/gcc/config/rs6000/constraints.md b/gcc/config/rs6000/constraints.md
> > index c99997bf82b..614e431c085 100644
> > --- a/gcc/config/rs6000/constraints.md
> > +++ b/gcc/config/rs6000/constraints.md
> > @@ -107,6 +107,9 @@ (define_constraint "wB"
> >         (match_test "TARGET_P8_VECTOR")
> >         (match_operand 0 "s5bit_cint_operand")))
> >  
> > +(define_register_constraint "wD" "rs6000_constraints[RS6000_CONSTRAINT_wD]"
> > +  "Accumulator register.")
> > +
> >  (define_constraint "wE"
> >    "@internal Vector constant that can be loaded with the XXSPLTIB instruction."
> >    (match_test "xxspltib_constant_nosplit (op, mode)"))
> > diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
> > index 6a7d8a836db..bb898919ab5 100644
> > --- a/gcc/config/rs6000/mma.md
> > +++ b/gcc/config/rs6000/mma.md
> > @@ -91,6 +91,7 @@ (define_c_enum "unspec"
> >     UNSPEC_MMA_XVI8GER4SPP
> >     UNSPEC_MMA_XXMFACC
> >     UNSPEC_MMA_XXMTACC
> > +   UNSPEC_DM_ASSEMBLE_ACC
> 
> The other UNSPEC.*ASSEMBLE like UNSPECV_MMA_ASSEMBLE don't have _ACC suffix,
> it's better to keep consistent if this suffix doesn't distinguish something.

Ok.

> >    ])
> >  
> >  (define_c_enum "unspecv"
> > @@ -321,7 +322,9 @@ (define_insn_and_split "*movoo"
> >     (set_attr "length" "*,8,*,8,8")
> >     (set_attr "isa" "lxvp,*,stxvp,*,*")])
> >  \f
> > -;; Vector quad support.  XOmode can only live in FPRs.
> > +;; Vector quad support.  Under the original MMA, XOmode can only live in VSX
> > +;; vector registers 0..31.  With dense math, XOmode can live in either VSX
> 
> Nit: s/vector//

Ok.

> > +;; registers (0..63) or DMR registers.
> >  (define_expand "movxo"
> >    [(set (match_operand:XO 0 "nonimmediate_operand")
> >  	(match_operand:XO 1 "input_operand"))]
> > @@ -346,10 +349,10 @@ (define_expand "movxo"
> >      gcc_assert (false);
> >  })
> >  
> > -(define_insn_and_split "*movxo"
> > +(define_insn_and_split "*movxo_nodm"
> >    [(set (match_operand:XO 0 "nonimmediate_operand" "=d,ZwO,d")
> >  	(match_operand:XO 1 "input_operand" "ZwO,d,d"))]
> > -  "TARGET_MMA
> > +  "TARGET_MMA && !TARGET_DENSE_MATH
> >     && (gpc_reg_operand (operands[0], XOmode)
> >         || gpc_reg_operand (operands[1], XOmode))"
> >    "@
> > @@ -366,6 +369,31 @@ (define_insn_and_split "*movxo"
> >     (set_attr "length" "*,*,16")
> >     (set_attr "max_prefixed_insns" "2,2,*")])
> >  
> > +(define_insn_and_split "*movxo_dm"
> > +  [(set (match_operand:XO 0 "nonimmediate_operand" "=wa,QwO,wa,wD,wD,wa")
> > +	(match_operand:XO 1 "input_operand"        "QwO,wa, wa,wa,wD,wD"))]
> 
> Why not adopt ZwO rather than QwO?

You have to split the address into 2 addresses for loading or storing vector
pairs (or 4 addresses for loading or storing vectors).  Z would allow
register+register addresses, and you wouldn't be able to create the second 
address by adding 128 to it.  Hence it uses 'Q' for register only and 'wo' for
d-form addresses.

> 
> > +  "TARGET_DENSE_MATH
> > +   && (gpc_reg_operand (operands[0], XOmode)
> > +       || gpc_reg_operand (operands[1], XOmode))"
> > +  "@
> > +   #
> > +   #
> > +   #
> > +   dmxxinstdmr512 %0,%1,%Y1,0
> > +   dmmr %0,%1
> > +   dmxxextfdmr512 %0,%Y0,%1,0"
> > +  "&& reload_completed
> > +   && !dmr_operand (operands[0], XOmode)
> > +   && !dmr_operand (operands[1], XOmode)"
> > +  [(const_int 0)]
> > +{
> > +  rs6000_split_multireg_move (operands[0], operands[1]);
> > +  DONE;
> > +}
> > +  [(set_attr "type" "vecload,vecstore,veclogical,mma,mma,mma")
> > +   (set_attr "length" "*,*,16,*,*,*")
> > +   (set_attr "max_prefixed_insns" "2,2,*,*,*,*")])
> > +
> >  (define_expand "vsx_assemble_pair"
> >    [(match_operand:OO 0 "vsx_register_operand")
> >     (match_operand:V16QI 1 "mma_assemble_input_operand")
> > @@ -433,25 +461,38 @@ (define_insn_and_split "*vsx_disassemble_pair"
> >  })
> >  
> >  (define_expand "mma_assemble_acc"
> > -  [(match_operand:XO 0 "fpr_reg_operand")
> > +  [(match_operand:XO 0 "register_operand")
> 
> Maybe use the newly introduced accumulator_operand?

Ok.

> 
> >     (match_operand:V16QI 1 "mma_assemble_input_operand")
> >     (match_operand:V16QI 2 "mma_assemble_input_operand")
> >     (match_operand:V16QI 3 "mma_assemble_input_operand")
> >     (match_operand:V16QI 4 "mma_assemble_input_operand")]
> >    "TARGET_MMA"
> >  {
> > -  rtx src = gen_rtx_UNSPEC_VOLATILE (XOmode,
> > -			    	     gen_rtvec (4, operands[1], operands[2],
> > -				       		operands[3], operands[4]),
> > -			    	     UNSPECV_MMA_ASSEMBLE);
> > -  emit_move_insn (operands[0], src);
> > +  rtx op0 = operands[0];
> > +  rtx op1 = operands[1];
> > +  rtx op2 = operands[2];
> > +  rtx op3 = operands[3];
> > +  rtx op4 = operands[4];
> > +
> > +  if (TARGET_DENSE_MATH)
> > +    {
> > +      rtx vpair1 = gen_reg_rtx (OOmode);
> > +      rtx vpair2 = gen_reg_rtx (OOmode);
> > +      emit_insn (gen_vsx_assemble_pair (vpair1, op1, op2));
> > +      emit_insn (gen_vsx_assemble_pair (vpair2, op3, op4));
> > +      emit_insn (gen_mma_assemble_acc_dm (op0, vpair1, vpair2));
> > +    }
> > +
> > +  else
> > +    emit_insn (gen_mma_assemble_acc_vsx (op0, op1, op2, op3, op4));
> > +
> >    DONE;
> >  })
> >  
> >  ;; We cannot update the four output registers atomically, so mark the output
> > -;; as an early clobber so we don't accidentally clobber the input operands.  */
> > +;; as an early clobber so we don't accidentally clobber the input operands.
> >  
> > -(define_insn_and_split "*mma_assemble_acc"
> > +(define_insn_and_split "mma_assemble_acc_vsx"
> 
> Nit: since we use "*_nodm" above, it seems better to name it with
> "mma_assemble_acc_nodm" which has the same style?

Ok.

> >    [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
> >  	(unspec_volatile:XO
> >  	  [(match_operand:V16QI 1 "mma_assemble_input_operand" "mwa")
> > @@ -459,7 +500,7 @@ (define_insn_and_split "*mma_assemble_acc"
> >  	   (match_operand:V16QI 3 "mma_assemble_input_operand" "mwa")
> >  	   (match_operand:V16QI 4 "mma_assemble_input_operand" "mwa")]
> >  	  UNSPECV_MMA_ASSEMBLE))]
> > -  "TARGET_MMA
> > +  "TARGET_MMA && !TARGET_DENSE_MATH
> >     && fpr_reg_operand (operands[0], XOmode)"
> >    "#"
> >    "&& reload_completed"
> > @@ -473,28 +514,31 @@ (define_insn_and_split "*mma_assemble_acc"
> >    DONE;
> >  })
> >  
> > +;; On a system with dense math, we build the accumulators from two vector
> > +;; pairs.
> > +
> > +(define_insn "mma_assemble_acc_dm"
> > + [(set (match_operand:XO 0 "dmr_operand" "=wD")
> > +       (unspec:XO [(match_operand:OO 1 "vsx_register_operand" "wa")
> > +		   (match_operand:OO 2 "vsx_register_operand" "wa")]
> > +		  UNSPEC_DM_ASSEMBLE_ACC))]
> > + "TARGET_MMA && TARGET_DENSE_MATH"
> 
> Nit: redundant TARGET_MMA checking.

Ok.

> > + "dmxxinstdmr512 %0,%1,%2,0"
> > + [(set_attr "type" "mma")])
> > +
> >  (define_expand "mma_disassemble_acc"
> > -  [(match_operand:V16QI 0 "mma_disassemble_output_operand")
> > -   (match_operand:XO 1 "fpr_reg_operand")
> > -   (match_operand 2 "const_0_to_3_operand")]
> > -  "TARGET_MMA"
> > -{
> > -  rtx src;
> > -  int regoff = INTVAL (operands[2]);
> > -  src = gen_rtx_UNSPEC (V16QImode,
> > -			gen_rtvec (2, operands[1], GEN_INT (regoff)),
> > -			UNSPEC_MMA_EXTRACT);
> > -  emit_move_insn (operands[0], src);
> > -  DONE;
> > -})
> > +  [(set (match_operand:V16QI 0 "register_operand")
> > +	(unspec:V16QI [(match_operand:XO 1 "register_operand")
> 
> s/register_operand/accumulator_operand/?

Ok

> 
> > +		       (match_operand 2 "const_0_to_3_operand")]
> > +		      UNSPEC_MMA_EXTRACT))]
> > +  "TARGET_MMA")
> >  
> > -(define_insn_and_split "*mma_disassemble_acc"
> > +(define_insn_and_split "*mma_disassemble_acc_vsx"
> >    [(set (match_operand:V16QI 0 "mma_disassemble_output_operand" "=mwa")
> > -       (unspec:V16QI [(match_operand:XO 1 "fpr_reg_operand" "d")
> > -		      (match_operand 2 "const_0_to_3_operand")]
> > +	(unspec:V16QI [(match_operand:XO 1 "fpr_reg_operand" "d")
> > +		       (match_operand 2 "const_0_to_3_operand")]
> >  		      UNSPEC_MMA_EXTRACT))]
> > -  "TARGET_MMA
> > -   && fpr_reg_operand (operands[1], XOmode)"
> > +  "TARGET_MMA"
> 
> Do we still expect to see this pattern if TARGET_DENSE_MATH?
> If no, we should guard the condition with !TARGET_DENSE_MATH.

Ok.
> 
> >    "#"
> >    "&& reload_completed"
> >    [(const_int 0)]
> > @@ -506,9 +550,14 @@ (define_insn_and_split "*mma_disassemble_acc"
> >    DONE;
> >  })
> >  
> > -;; MMA instructions that do not use their accumulators as an input, still
> > -;; must not allow their vector operands to overlap the registers used by
> > -;; the accumulator.  We enforce this by marking the output as early clobber.
> > +(define_insn "*mma_disassemble_acc_dm"
> > +  [(set (match_operand:V16QI 0 "vsx_register_operand" "=wa")
> > +	(unspec:V16QI [(match_operand:XO 1 "dmr_operand" "wD")
> > +		       (match_operand 2 "const_0_to_3_operand")]
> > +		      UNSPEC_MMA_EXTRACT))]
> > +  "TARGET_DENSE_MATH"
> > +  "dmxxextfdmr256 %0,%1,2"
> > +  [(set_attr "type" "mma")])
> >  
> >  (define_insn "mma_<acc>"
> >    [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
> > diff --git a/gcc/config/rs6000/predicates.md b/gcc/config/rs6000/predicates.md
> > index d23ce9a77a3..3040dcd50a3 100644
> > --- a/gcc/config/rs6000/predicates.md
> > +++ b/gcc/config/rs6000/predicates.md
> > @@ -186,6 +186,38 @@ (define_predicate "vlogical_operand"
> >    return VLOGICAL_REGNO_P (REGNO (op));
> >  })
> >  
> > +;; Return 1 if op is a DMR register
> > +(define_predicate "dmr_operand"
> > +  (match_operand 0 "register_operand")
> > +{
> > +  if (!REG_P (op))
> > +    return 0;
> > +
> > +  if (!HARD_REGISTER_P (op))
> > +    return 1;
> > +
> > +  return DMR_REGNO_P (REGNO (op));
> > +})
> > +
> > +;; Return 1 if op is an accumulator.  On power10 systems, the accumulators
> > +;; overlap with the FPRs, while on systems with dense math, the accumulators
> > +;; are separate dense math registers and do not overlap with the FPR
> > +;; registers..
> 
> Nit: an unexpected "."?
> 
> > +(define_predicate "accumulator_operand"
> > +  (match_operand 0 "register_operand")
> > +{
> 
> fpr_reg_operand checks for subreg as well, should we check for it here as well?
> 
> > +  if (!REG_P (op))
> > +    return 0;
> > +
> > +  if (!HARD_REGISTER_P (op))
> > +    return 1;
> > +
> > +  int r = REGNO (op);
> > +  return (TARGET_DENSE_MATH
> > +	  ? DMR_REGNO_P (r)
> > +	  : FP_REGNO_P (r) && (r & 3) == 0);
> > +})
> > +
> >  ;; Return 1 if op is the carry register.
> >  (define_predicate "ca_operand"
> >    (match_operand 0 "register_operand")
> > diff --git a/gcc/config/rs6000/rs6000-cpus.def b/gcc/config/rs6000/rs6000-cpus.def
> > index b6cd6d8cc84..4621b97b522 100644
> > --- a/gcc/config/rs6000/rs6000-cpus.def
> > +++ b/gcc/config/rs6000/rs6000-cpus.def
> > @@ -91,6 +91,7 @@
> >  /* Flags for a potential future processor that may or may not be delivered.  */
> >  #define ISA_FUTURE_MASKS	(ISA_3_1_MASKS_SERVER			\
> >  				 | OPTION_MASK_BLOCK_OPS_VECTOR_PAIR	\
> > +				 | OPTION_MASK_DENSE_MATH		\
> >  				 | OPTION_MASK_FUTURE)
> >  
> >  /* Flags that need to be turned off if -mno-power9-vector.  */
> > @@ -134,6 +135,7 @@
> >  				 | OPTION_MASK_DFP			\
> >  				 | OPTION_MASK_DIRECT_MOVE		\
> >  				 | OPTION_MASK_DLMZB			\
> > +				 | OPTION_MASK_DENSE_MATH		\
> >  				 | OPTION_MASK_EFFICIENT_UNALIGNED_VSX	\
> >  				 | OPTION_MASK_FLOAT128_HW		\
> >  				 | OPTION_MASK_FLOAT128_KEYWORD		\
> > diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> > index bc509399cf6..83e32f7a43a 100644
> > --- a/gcc/config/rs6000/rs6000.cc
> > +++ b/gcc/config/rs6000/rs6000.cc
> > @@ -290,7 +290,8 @@ enum rs6000_reg_type {
> >    ALTIVEC_REG_TYPE,
> >    FPR_REG_TYPE,
> >    SPR_REG_TYPE,
> > -  CR_REG_TYPE
> > +  CR_REG_TYPE,
> > +  DMR_REG_TYPE
> >  };
> >  
> >  /* Map register class to register type.  */
> > @@ -304,22 +305,23 @@ static enum rs6000_reg_type reg_class_to_reg_type[N_REG_CLASSES];
> >  
> >  
> >  /* Register classes we care about in secondary reload or go if legitimate
> > -   address.  We only need to worry about GPR, FPR, and Altivec registers here,
> > -   along an ANY field that is the OR of the 3 register classes.  */
> > +   address.  We only need to worry about GPR, FPR, Altivec, and DMR registers
> > +   here, along an ANY field that is the OR of the 4 register classes.  */
> >  
> >  enum rs6000_reload_reg_type {
> >    RELOAD_REG_GPR,			/* General purpose registers.  */
> >    RELOAD_REG_FPR,			/* Traditional floating point regs.  */
> >    RELOAD_REG_VMX,			/* Altivec (VMX) registers.  */
> > -  RELOAD_REG_ANY,			/* OR of GPR, FPR, Altivec masks.  */
> > +  RELOAD_REG_DMR,			/* DMR registers.  */
> > +  RELOAD_REG_ANY,			/* OR of GPR/FPR/VMX/DMR masks.  */
> >    N_RELOAD_REG
> >  };
> >  
> > -/* For setting up register classes, loop through the 3 register classes mapping
> > +/* For setting up register classes, loop through the 4 register classes mapping
> >     into real registers, and skip the ANY class, which is just an OR of the
> >     bits.  */
> >  #define FIRST_RELOAD_REG_CLASS	RELOAD_REG_GPR
> > -#define LAST_RELOAD_REG_CLASS	RELOAD_REG_VMX
> > +#define LAST_RELOAD_REG_CLASS	RELOAD_REG_DMR
> >  
> >  /* Map reload register type to a register in the register class.  */
> >  struct reload_reg_map_type {
> > @@ -331,6 +333,7 @@ static const struct reload_reg_map_type reload_reg_map[N_RELOAD_REG] = {
> >    { "Gpr",	FIRST_GPR_REGNO },	/* RELOAD_REG_GPR.  */
> >    { "Fpr",	FIRST_FPR_REGNO },	/* RELOAD_REG_FPR.  */
> >    { "VMX",	FIRST_ALTIVEC_REGNO },	/* RELOAD_REG_VMX.  */
> > +  { "DMR",	FIRST_DMR_REGNO },	/* RELOAD_REG_DMR.  */
> >    { "Any",	-1 },			/* RELOAD_REG_ANY.  */
> >  };
> >  
> > @@ -1224,6 +1227,8 @@ char rs6000_reg_names[][8] =
> >        "0",  "1",  "2",  "3",  "4",  "5",  "6",  "7",
> >    /* vrsave vscr sfp */
> >        "vrsave", "vscr", "sfp",
> > +  /* DMRs */
> > +      "0", "1", "2", "3", "4", "5", "6", "7",
> >  };
> >  
> >  #ifdef TARGET_REGNAMES
> > @@ -1250,6 +1255,8 @@ static const char alt_reg_names[][8] =
> >    "%cr0",  "%cr1", "%cr2", "%cr3", "%cr4", "%cr5", "%cr6", "%cr7",
> >    /* vrsave vscr sfp */
> >    "vrsave", "vscr", "sfp",
> > +  /* DMRs */
> > +  "%dmr0", "%dmr1", "%dmr2", "%dmr3", "%dmr4", "%dmr5", "%dmr6", "%dmr7",
> 
> Should be without "r" here, as tested gas doesn't recognize %dmr0 but it does
> recognize %dm0.
> 
> >  };
> >  #endif
> >  
> > @@ -1846,6 +1853,9 @@ rs6000_hard_regno_nregs_internal (int regno, machine_mode mode)
> >    else if (ALTIVEC_REGNO_P (regno))
> >      reg_size = UNITS_PER_ALTIVEC_WORD;
> >  
> > +  else if (DMR_REGNO_P (regno))
> > +    reg_size = UNITS_PER_DMR_WORD;
> > +
> >    else
> >      reg_size = UNITS_PER_WORD;
> >  
> > @@ -1867,9 +1877,36 @@ rs6000_hard_regno_mode_ok_uncached (int regno, machine_mode mode)
> >    if (mode == OOmode)
> >      return (TARGET_MMA && VSX_REGNO_P (regno) && (regno & 1) == 0);
> >  
> > -  /* MMA accumulator modes need FPR registers divisible by 4.  */
> > +  /* On ISA 3.1 (power10), MMA accumulator modes need FPR registers divisible
> > +     by 4.
> > +
> > +     If dense math is enabled, allow all VSX registers plus the DMR registers.
> > +     We need to make sure we don't cross between the boundary of FPRs and
> > +     traditional Altiviec registers.  */
> >    if (mode == XOmode)
> > -    return (TARGET_MMA && FP_REGNO_P (regno) && (regno & 3) == 0);
> > +    {
> > +      if (TARGET_MMA && !TARGET_DENSE_MATH)
> > +	return (FP_REGNO_P (regno) && (regno & 3) == 0);
> > +
> > +      else if (TARGET_DENSE_MATH)
> > +	{
> > +	  if (DMR_REGNO_P (regno))
> > +	    return 1;
> > +
> > +	  if (FP_REGNO_P (regno))
> > +	    return ((regno & 1) == 0 && regno <= LAST_FPR_REGNO - 3);
> > +
> > +	  if (ALTIVEC_REGNO_P (regno))
> > +	    return ((regno & 1) == 0 && regno <= LAST_ALTIVEC_REGNO - 3);
> > +	}
> 
> I could miss something, I didn't find which section of RFC indicates this
> restriction, could you please point out for me?  Thanks!
> 
> > +
> > +      else
> > +	return 0;
> > +    }
> > +
> > +  /* No other types other than XOmode can go in DMRs.  */
> > +  if (DMR_REGNO_P (regno))
> > +    return 0;
> >  
> >    /* PTImode can only go in GPRs.  Quad word memory operations require even/odd
> >       register combinations, and use PTImode where we need to deal with quad
> > @@ -2312,6 +2349,7 @@ rs6000_debug_reg_global (void)
> >    rs6000_debug_reg_print (FIRST_ALTIVEC_REGNO,
> >  			  LAST_ALTIVEC_REGNO,
> >  			  "vs");
> > +  rs6000_debug_reg_print (FIRST_DMR_REGNO, LAST_DMR_REGNO, "dmr");
> 
> Nit: Like above, use 'dm'.
> 
> >    rs6000_debug_reg_print (LR_REGNO, LR_REGNO, "lr");
> >    rs6000_debug_reg_print (CTR_REGNO, CTR_REGNO, "ctr");
> >    rs6000_debug_reg_print (CR0_REGNO, CR7_REGNO, "cr");
> > @@ -2332,6 +2370,7 @@ rs6000_debug_reg_global (void)
> >  	   "wr reg_class = %s\n"
> >  	   "wx reg_class = %s\n"
> >  	   "wA reg_class = %s\n"
> > +	   "wD reg_class = %s\n"
> >  	   "\n",
> >  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_d]],
> >  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_v]],
> > @@ -2339,7 +2378,8 @@ rs6000_debug_reg_global (void)
> >  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_we]],
> >  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wr]],
> >  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wx]],
> > -	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wA]]);
> > +	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wA]],
> > +	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wD]]);
> > 
> 
> snip ...
> 
> > +/* Subroutine to determine the move cost of dense math registers.  If we are
> > +   moving to/from VSX_REGISTER registers, the cost is either 1 move (for
> > +   512-bit accumulators) or 2 moves (for 1,024 dmr registers).  If we are
> > +   moving to anything else like GPR registers, make the cost very high.  */
> > +
> > +static int
> > +rs6000_dmr_register_move_cost (machine_mode mode, reg_class_t rclass)
> > +{
> > +  const int reg_move_base = 2;
> > +  HARD_REG_SET vsx_set = (reg_class_contents[rclass]
> > +			  & reg_class_contents[VSX_REGS]);
> > +
> > +  if (TARGET_DENSE_MATH && !hard_reg_set_empty_p (vsx_set))
> 
> Can we just use reg_classes_intersect_p (rclass, VSX_REGS)?
> 
> > +    {
> > +      /* __vector_quad (i.e. XOmode) is tranfered in 1 instruction.  */
> > +      if (mode == XOmode)
> > +	return reg_move_base;
> > +
> > +      else
> > +	return reg_move_base * 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
> 
> I guess this "else" arm is for TDOmode, which belongs to that patch.
> 
> > +    }
> > +
> > +  return 1000 * 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
> > +}
> > +
> >  /* A C expression returning the cost of moving data from a register of class
> >     CLASS1 to one of CLASS2.  */
> >  
> > @@ -22843,17 +22969,28 @@ rs6000_register_move_cost (machine_mode mode,
> >    if (TARGET_DEBUG_COST)
> >      dbg_cost_ctrl++;
> >  
> 
> snip ...
> 
> >  /* Table of additional register names to use in user input.  */
> > @@ -2132,6 +2158,8 @@ extern char rs6000_reg_names[][8];	/* register names (0 vs. %r0).  */
> >    {"vs52", 84}, {"vs53", 85}, {"vs54", 86}, {"vs55", 87},	\
> >    {"vs56", 88}, {"vs57", 89}, {"vs58", 90}, {"vs59", 91},	\
> >    {"vs60", 92}, {"vs61", 93}, {"vs62", 94}, {"vs63", 95},	\
> > +  {"dmr0", 111}, {"dmr1", 112}, {"dmr2", 113}, {"dmr3", 114},	\
> > +  {"dmr4", 115}, {"dmr5", 116}, {"dmr6", 117}, {"dmr7", 118},	\
> 
> Nit: maybe s/dmr/dm/ to align the previous regnames.
> 
> >  }
> >  
> >  /* This is how to output an element of a case-vector that is relative.  */
> > diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
> > index a125fd8fc99..72af3e6ef70 100644
> > --- a/gcc/config/rs6000/rs6000.md
> > +++ b/gcc/config/rs6000/rs6000.md
> > @@ -51,6 +51,8 @@ (define_constants
> >     (VRSAVE_REGNO		108)
> >     (VSCR_REGNO			109)
> >     (FRAME_POINTER_REGNUM	110)
> > +   (FIRST_DMR_REGNO		111)
> > +   (LAST_DMR_REGNO		118)
> >    ])
> >  
> >  ;;
> > @@ -355,7 +357,7 @@ (define_attr "cpu"
> >    (const (symbol_ref "(enum attr_cpu) rs6000_tune")))
> >  
> >  ;; The ISA we implement.
> > -(define_attr "isa" "any,p5,p6,p7,p7v,p8v,p9,p9v,p9kf,p9tf,p10,lxvp,stxvp"
> > +(define_attr "isa" "any,p5,p6,p7,p7v,p8v,p9,p9v,p9kf,p9tf,p10,lxvp,stxvp,dm,not_dm"
> 
> Nit: s/not_dm/nodm/ to align with some previous wording.
> 
> BR,
> Kewen
> 

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 4/6] PowerPC: Make MMA insns support DMR registers.
  2024-02-04  3:21   ` Repost " Kewen.Lin
@ 2024-02-07  3:31     ` Michael Meissner
  0 siblings, 0 replies; 36+ messages in thread
From: Michael Meissner @ 2024-02-07  3:31 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Michael Meissner, gcc-patches, Segher Boessenkool,
	David Edelsohn, Peter Bergner

On Sun, Feb 04, 2024 at 11:21:49AM +0800, Kewen.Lin wrote:
> Hi Mike,
> 
> > --- a/gcc/config/rs6000/mma.md
> > +++ b/gcc/config/rs6000/mma.md
> > @@ -559,190 +559,249 @@ (define_insn "*mma_disassemble_acc_dm"
> >    "dmxxextfdmr256 %0,%1,2"
> >    [(set_attr "type" "mma")])
> >  
> > -(define_insn "mma_<acc>"
> > +;; MMA instructions that do not use their accumulators as an input, still must
> > +;; not allow their vector operands to overlap the registers used by the
> > +;; accumulator.  We enforce this by marking the output as early clobber.  If we
> > +;; have dense math, we don't need the whole prime/de-prime action, so just make
> > +;; thse instructions be NOPs.
> 
> typo: thse.

Ok.

> > +
> > +(define_expand "mma_<acc>"
> > +  [(set (match_operand:XO 0 "register_operand")
> > +	(unspec:XO [(match_operand:XO 1 "register_operand")]
> 
> s/register_operand/accumulator_operand/?

Ok.

> > +		   MMA_ACC))]
> > +  "TARGET_MMA"
> > +{
> > +  if (TARGET_DENSE_MATH)
> > +    {
> > +      if (!rtx_equal_p (operands[0], operands[1]))
> > +	emit_move_insn (operands[0], operands[1]);
> > +      DONE;
> > +    }
> > +
> > +  /* Generate the prime/de-prime code.  */
> > +})
> > +
> > +(define_insn "*mma_<acc>"
> 
> May be better to name with "*mma_<acc>_nodm"?

Ok.

> >    [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
> >  	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")]
> >  		    MMA_ACC))]
> > -  "TARGET_MMA"
> > +  "TARGET_MMA && !TARGET_DENSE_MATH"
> 
> I found that "TARGET_MMA && !TARGET_DENSE_MATH" is used much (like changes in function
> rs6000_split_multireg_move in this patch and some places in previous patches), maybe we
> can introduce a macro named as TARGET_MMA_NODM short for it?

As I said in the message about the last patch, I added
TARGET_MMA_NO_DENSE_MATH.

> >    "<acc> %A0"
> >    [(set_attr "type" "mma")])
> >  
> >  ;; We can't have integer constants in XOmode so we wrap this in an
> > -;; UNSPEC_VOLATILE.
> > +;; UNSPEC_VOLATILE for the non-dense math case.  For dense math, we don't need
> > +;; to disable optimization and we can do a normal UNSPEC.
> >  
> > -(define_insn "mma_xxsetaccz"
> > -  [(set (match_operand:XO 0 "fpr_reg_operand" "=d")
> > +(define_expand "mma_xxsetaccz"
> > +  [(set (match_operand:XO 0 "register_operand")
> 
> s/register_operand/accumulator_operand/?

Ok.

> >  	(unspec_volatile:XO [(const_int 0)]
> >  			    UNSPECV_MMA_XXSETACCZ))]
> >    "TARGET_MMA"
> > +{
> > +  if (TARGET_DENSE_MATH)
> > +    {
> > +      emit_insn (gen_mma_xxsetaccz_dm (operands[0]));
> > +      DONE;
> > +    }
> > +})
> > +
> > +(define_insn "*mma_xxsetaccz_vsx"
> 
> s/vsx/nodm/

Ok.

> > +  [(set (match_operand:XO 0 "fpr_reg_operand" "=d")
> > +	(unspec_volatile:XO [(const_int 0)]
> > +			    UNSPECV_MMA_XXSETACCZ))]
> > +  "TARGET_MMA && !TARGET_DENSE_MATH"
> >    "xxsetaccz %A0"
> >    [(set_attr "type" "mma")])
> >  
> > +
> > +(define_insn "mma_xxsetaccz_dm"
> > +  [(set (match_operand:XO 0 "dmr_operand" "=wD")
> > +	(unspec:XO [(const_int 0)]
> > +		   UNSPECV_MMA_XXSETACCZ))]
> > +  "TARGET_DENSE_MATH"
> > +  "dmsetdmrz %0"
> > +  [(set_attr "type" "mma")])
> > +
> >  (define_insn "mma_<vv>"
> > -  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
> > -	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
> > -		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")]
> > +  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
> > +	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa,v,?wa")
> > +		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")]
> >  		    MMA_VV))]
> >    "TARGET_MMA"
> >    "<vv> %A0,%x1,%x2"
> > -  [(set_attr "type" "mma")])
> > +  [(set_attr "type" "mma")
> > +   (set_attr "isa" "dm,not_dm,not_dm")])
> 
> Like what's suggested in previous patches, s/not_dm/nodm/

Ok.

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 1/6] Add -mcpu=future
  2024-02-06  6:01     ` Michael Meissner
@ 2024-02-07  9:21       ` Kewen.Lin
  2024-02-07 19:58         ` Michael Meissner
  2024-02-08 18:42         ` Segher Boessenkool
  2024-02-08 18:35       ` Segher Boessenkool
  1 sibling, 2 replies; 36+ messages in thread
From: Kewen.Lin @ 2024-02-07  9:21 UTC (permalink / raw)
  To: Michael Meissner
  Cc: gcc-patches, Segher Boessenkool, David Edelsohn, Peter Bergner

on 2024/2/6 14:01, Michael Meissner wrote:
> On Tue, Jan 23, 2024 at 04:44:32PM +0800, Kewen.Lin wrote:
...
>>> diff --git a/gcc/config/rs6000/rs6000-opts.h b/gcc/config/rs6000/rs6000-opts.h
>>> index 33fd0efc936..25890ae3034 100644
>>> --- a/gcc/config/rs6000/rs6000-opts.h
>>> +++ b/gcc/config/rs6000/rs6000-opts.h
>>> @@ -67,7 +67,9 @@ enum processor_type
>>>     PROCESSOR_MPCCORE,
>>>     PROCESSOR_CELL,
>>>     PROCESSOR_PPCA2,
>>> -   PROCESSOR_TITAN
>>> +   PROCESSOR_TITAN,
>>> +
>>
>> Nit: unintentional empty line?
>>
>>> +   PROCESSOR_FUTURE
>>>  };
> 
> It was more as a separation.  The MPCCORE, CELL, PPCA2, and TITAN are rather
> old processors.  I don't recall why we kept them after the POWER<x>.
> 
> Logically we should re-order the list and move MPCCORE, etc. earlier, but I
> will delete the blank line in future patches.

Thanks for clarifying, the re-order thing can be done in a separate patch and
in this context one comment line would be better than a blank line. :)

...

>>> +     power10 tuning until future tuning is added.  */
>>>    if (rs6000_tune_index >= 0)
>>> -    tune_index = rs6000_tune_index;
>>> +    {
>>> +      enum processor_type cur_proc
>>> +	= processor_target_table[rs6000_tune_index].processor;
>>> +
>>> +      if (cur_proc == PROCESSOR_FUTURE)
>>> +	{
>>> +	  static bool issued_future_tune_warning = false;
>>> +	  if (!issued_future_tune_warning)
>>> +	    {
>>> +	      issued_future_tune_warning = true;
>>
>> This seems to ensure we only warn this once, but I noticed that in rs6000/
>> only some OPT_Wpsabi related warnings adopt this way, I wonder if we don't
>> restrict it like this, for a tiny simple case, how many times it would warn?
> 
> In a simple case, you would only get the warning once.  But if you use
> __attribute__((__target__(...))) or #pragma target ... you might see it more
> than once.

OK, considering we only get this warning once for a simple case, I'm inclined
not to keep a static variable for it, it's the same as what we do currently
for option conflict errors emission.  But I'm fine for either.


>>>    else
>>>      {
>>> -      size_t i;
>>>        enum processor_type tune_proc
>>>  	= (TARGET_POWERPC64 ? PROCESSOR_DEFAULT64 : PROCESSOR_DEFAULT);
>>>  
>>> -      tune_index = -1;
>>> -      for (i = 0; i < ARRAY_SIZE (processor_target_table); i++)
>>> -	if (processor_target_table[i].processor == tune_proc)
>>> -	  {
>>> -	    tune_index = i;
>>> -	    break;
>>> -	  }
>>> +      tune_index = rs600_cpu_index_lookup (tune_proc == PROCESSOR_FUTURE
>>> +					   ? PROCESSOR_POWER10
>>> +					   : tune_proc);
>>
>> This part looks useless, as tune_proc is impossible to be PROCESSOR_FUTURE.
> 
> Well in theory, you could configure the compiler with --with-cpu=future or
> --with-tune=future.

Sorry for the possible confusion here, the "tune_proc" that I referred to is
the variable in the above else branch:

   enum processor_type tune_proc = (TARGET_POWERPC64 ? PROCESSOR_DEFAULT64 : PROCESSOR_DEFAULT);

It's either PROCESSOR_DEFAULT64 or PROCESSOR_DEFAULT, so it doesn't have a
chance to be PROCESSOR_FUTURE, so the checking "tune_proc == PROCESSOR_FUTURE"
is useless.

That's why I suggested the below flow, it does a final check out of those checks,
it looks a bit more clear IMHO.

> 
>>>      }
>>
>> Maybe re-structure the above into:
>>
>> bool explicit_tune = false;
>> if (rs6000_tune_index >= 0)
>>   {
>>     tune_index = rs6000_tune_index;
>>     explicit_tune = true;
>>   }
>> else if (cpu_index >= 0)
>>   // as before
>>   rs6000_tune_index = tune_index = cpu_index;
>> else
>>   {
>>    //as before
>>    ...
>>   }
>>
>> // Check tune_index here instead.
>>
>> if (processor_target_table[tune_index].processor == PROCESSOR_FUTURE)
>>   {
>>     tune_index = rs6000_cpu_index_lookup (PROCESSOR_POWER10);
>>     if (explicit_tune)
>>       warn ...
>>   }
>>
>> // as before
>> rs6000_tune = processor_target_table[tune_index].processor;
>>
>>>  


BR,
Kewen


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 3/6] PowerPC: Add support for accumulators in DMR registers.
  2024-02-07  0:06     ` Michael Meissner
@ 2024-02-07  9:38       ` Kewen.Lin
  2024-02-08  0:26         ` Michael Meissner
  0 siblings, 1 reply; 36+ messages in thread
From: Kewen.Lin @ 2024-02-07  9:38 UTC (permalink / raw)
  To: Michael Meissner
  Cc: Segher Boessenkool, David Edelsohn, gcc-patches, Peter Bergner

on 2024/2/7 08:06, Michael Meissner wrote:
> On Thu, Jan 25, 2024 at 05:28:49PM +0800, Kewen.Lin wrote:
>> Hi Mike,
>>
>> on 2024/1/6 07:38, Michael Meissner wrote:
>>> The MMA subsystem added the notion of accumulator registers as an optional
>>> feature of ISA 3.1 (power10).  In ISA 3.1, these accumulators overlapped with
>>> the traditional floating point registers 0..31, but logically the accumulator
>>> registers were separate from the FPR registers.  In ISA 3.1, it was anticipated
>>
>> Using VSX register 0..31 rather than traditional floating point registers 0..31
>> seems more clear, since floating point registers imply 64 bit long registers.
> 
> Ok.
> 
>>> that in future systems, the accumulator registers may no overlap with the FPR
>>> registers.  This patch adds the support for dense math registers as separate
>>> registers.
>>>
>>> This particular patch does not change the MMA support to use the accumulators
>>> within the dense math registers.  This patch just adds the basic support for
>>> having separate DMRs.  The next patch will switch the MMA support to use the
>>> accumulators if -mcpu=future is used.
>>>
>>> For testing purposes, I added an undocumented option '-mdense-math' to enable
>>> or disable the dense math support.
>>
>> Can we avoid this and use one macro for it instead?  As you might have noticed
>> that some previous temporary options like -mpower{8,9}-vector cause ICEs due to
>> some unexpected combination and we are going to neuter them, so let's try our
>> best to avoid it if possible.  I guess one macro TARGET_DENSE_MATH defined by
>> TARGET_FUTURE && TARGET_MMA matches all use places? and specifying -mcpu=future
>> can enable it while -mcpu=power10 can disable it.
> 
> That depends on whether there will be other things added in the future power
> that are not in the MMA+ instruction set.
> 
> But I can switch to defining TARGET_DENSE_MATH to testing TARGET_FUTURE and
> TARGET_MMA.  That way if/when a new cpu comes out, we will just have to change
> the definition of TARGET_DENSE_MATH and not all of the uses.

Yes, that's what I expected.  Thanks!

> 
> I will also add TARGET_MMA_NO_DENSE_MATH to handle the existing MMA code for
> assemble and disassemble when we don't have dense math instructions.

Nice, I also found having such macro can help when reviewing one latter patch
so suggested a similar there.

>>> -(define_insn_and_split "*movxo"
>>> +(define_insn_and_split "*movxo_nodm"
>>>    [(set (match_operand:XO 0 "nonimmediate_operand" "=d,ZwO,d")
>>>  	(match_operand:XO 1 "input_operand" "ZwO,d,d"))]
>>> -  "TARGET_MMA
>>> +  "TARGET_MMA && !TARGET_DENSE_MATH
>>>     && (gpc_reg_operand (operands[0], XOmode)
>>>         || gpc_reg_operand (operands[1], XOmode))"
>>>    "@
>>> @@ -366,6 +369,31 @@ (define_insn_and_split "*movxo"
>>>     (set_attr "length" "*,*,16")
>>>     (set_attr "max_prefixed_insns" "2,2,*")])
>>>  
>>> +(define_insn_and_split "*movxo_dm"
>>> +  [(set (match_operand:XO 0 "nonimmediate_operand" "=wa,QwO,wa,wD,wD,wa")
>>> +	(match_operand:XO 1 "input_operand"        "QwO,wa, wa,wa,wD,wD"))]
>>
>> Why not adopt ZwO rather than QwO?
> 
> You have to split the address into 2 addresses for loading or storing vector
> pairs (or 4 addresses for loading or storing vectors).  Z would allow
> register+register addresses, and you wouldn't be able to create the second 
> address by adding 128 to it.  Hence it uses 'Q' for register only and 'wo' for
> d-form addresses.

Thanks for clarifying.  But without this patch the define_insn_and_split *movxo
adopts "ZwO", IMHO it would mean the current "*movxo" define_insn_and_split have
been problematic?  I thought adjust_address can ensure the new address would be
still valid after adjusting 128 offset, could you double check?

> 
>>
>>> +  "TARGET_DENSE_MATH
>>> +   && (gpc_reg_operand (operands[0], XOmode)
>>> +       || gpc_reg_operand (operands[1], XOmode))"
>>> +  "@
>>> +   #
>>> +   #
>>> +   #
>>> +   dmxxinstdmr512 %0,%1,%Y1,0
>>> +   dmmr %0,%1
>>> +   dmxxextfdmr512 %0,%Y0,%1,0"
>>> +  "&& reload_completed
>>> +   && !dmr_operand (operands[0], XOmode)
>>> +   && !dmr_operand (operands[1], XOmode)"
>>> +  [(const_int 0)]
>>> +{
>>> +  rs6000_split_multireg_move (operands[0], operands[1]);
>>> +  DONE;
>>> +}
>>> +  [(set_attr "type" "vecload,vecstore,veclogical,mma,mma,mma")
>>> +   (set_attr "length" "*,*,16,*,*,*")
>>> +   (set_attr "max_prefixed_insns" "2,2,*,*,*,*")])
>>> +

...

>>> +;; Return 1 if op is a DMR register
>>> +(define_predicate "dmr_operand"
>>> +  (match_operand 0 "register_operand")
>>> +{
>>> +  if (!REG_P (op))
>>> +    return 0;
>>> +
>>> +  if (!HARD_REGISTER_P (op))
>>> +    return 1;
>>> +
>>> +  return DMR_REGNO_P (REGNO (op));
>>> +})
>>> +
>>> +;; Return 1 if op is an accumulator.  On power10 systems, the accumulators
>>> +;; overlap with the FPRs, while on systems with dense math, the accumulators
>>> +;; are separate dense math registers and do not overlap with the FPR
>>> +;; registers..
>>
>> Nit: an unexpected "."?
>>
>>> +(define_predicate "accumulator_operand"
>>> +  (match_operand 0 "register_operand")
>>> +{
>>
>> fpr_reg_operand checks for subreg as well, should we check for it here as well?
>>
>>>  #ifdef TARGET_REGNAMES
>>> @@ -1250,6 +1255,8 @@ static const char alt_reg_names[][8] =
>>>    "%cr0",  "%cr1", "%cr2", "%cr3", "%cr4", "%cr5", "%cr6", "%cr7",
>>>    /* vrsave vscr sfp */
>>>    "vrsave", "vscr", "sfp",
>>> +  /* DMRs */
>>> +  "%dmr0", "%dmr1", "%dmr2", "%dmr3", "%dmr4", "%dmr5", "%dmr6", "%dmr7",
>>
>> Should be without "r" here, as tested gas doesn't recognize %dmr0 but it does
>> recognize %dm0.

I guessed some reply was missing on this part (and some latter others)?  Just want
to ensure something wasn't missing and hope this helps.  :)

>>
>>>  };
>>>  #endif
>>>  
>>> @@ -1846,6 +1853,9 @@ rs6000_hard_regno_nregs_internal (int regno, machine_mode mode)
>>>    else if (ALTIVEC_REGNO_P (regno))
>>>      reg_size = UNITS_PER_ALTIVEC_WORD;
>>>  
>>> +  else if (DMR_REGNO_P (regno))
>>> +    reg_size = UNITS_PER_DMR_WORD;
>>> +
>>>    else
>>>      reg_size = UNITS_PER_WORD;
>>>  
>>> @@ -1867,9 +1877,36 @@ rs6000_hard_regno_mode_ok_uncached (int regno, machine_mode mode)
>>>    if (mode == OOmode)
>>>      return (TARGET_MMA && VSX_REGNO_P (regno) && (regno & 1) == 0);
>>>  
>>> -  /* MMA accumulator modes need FPR registers divisible by 4.  */
>>> +  /* On ISA 3.1 (power10), MMA accumulator modes need FPR registers divisible
>>> +     by 4.
>>> +
>>> +     If dense math is enabled, allow all VSX registers plus the DMR registers.
>>> +     We need to make sure we don't cross between the boundary of FPRs and
>>> +     traditional Altiviec registers.  */
>>>    if (mode == XOmode)
>>> -    return (TARGET_MMA && FP_REGNO_P (regno) && (regno & 3) == 0);
>>> +    {
>>> +      if (TARGET_MMA && !TARGET_DENSE_MATH)
>>> +	return (FP_REGNO_P (regno) && (regno & 3) == 0);
>>> +
>>> +      else if (TARGET_DENSE_MATH)
>>> +	{
>>> +	  if (DMR_REGNO_P (regno))
>>> +	    return 1;
>>> +
>>> +	  if (FP_REGNO_P (regno))
>>> +	    return ((regno & 1) == 0 && regno <= LAST_FPR_REGNO - 3);
>>> +
>>> +	  if (ALTIVEC_REGNO_P (regno))
>>> +	    return ((regno & 1) == 0 && regno <= LAST_ALTIVEC_REGNO - 3);
>>> +	}
>>
>> I could miss something, I didn't find which section of RFC indicates this
>> restriction, could you please point out for me?  Thanks!
>>
>>> +
>>> +      else
>>> +	return 0;
>>> +    }
>>> +
>>> +  /* No other types other than XOmode can go in DMRs.  */
>>> +  if (DMR_REGNO_P (regno))
>>> +    return 0;
>>>  
>>>    /* PTImode can only go in GPRs.  Quad word memory operations require even/odd
>>>       register combinations, and use PTImode where we need to deal with quad
>>> @@ -2312,6 +2349,7 @@ rs6000_debug_reg_global (void)
>>>    rs6000_debug_reg_print (FIRST_ALTIVEC_REGNO,
>>>  			  LAST_ALTIVEC_REGNO,
>>>  			  "vs");
>>> +  rs6000_debug_reg_print (FIRST_DMR_REGNO, LAST_DMR_REGNO, "dmr");
>>
>> Nit: Like above, use 'dm'.
>>
>>>    rs6000_debug_reg_print (LR_REGNO, LR_REGNO, "lr");
>>>    rs6000_debug_reg_print (CTR_REGNO, CTR_REGNO, "ctr");
>>>    rs6000_debug_reg_print (CR0_REGNO, CR7_REGNO, "cr");
>>> @@ -2332,6 +2370,7 @@ rs6000_debug_reg_global (void)
>>>  	   "wr reg_class = %s\n"
>>>  	   "wx reg_class = %s\n"
>>>  	   "wA reg_class = %s\n"
>>> +	   "wD reg_class = %s\n"
>>>  	   "\n",
>>>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_d]],
>>>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_v]],
>>> @@ -2339,7 +2378,8 @@ rs6000_debug_reg_global (void)
>>>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_we]],
>>>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wr]],
>>>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wx]],
>>> -	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wA]]);
>>> +	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wA]],
>>> +	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wD]]);
>>>
>>
>> snip ...
>>
>>> +/* Subroutine to determine the move cost of dense math registers.  If we are
>>> +   moving to/from VSX_REGISTER registers, the cost is either 1 move (for
>>> +   512-bit accumulators) or 2 moves (for 1,024 dmr registers).  If we are
>>> +   moving to anything else like GPR registers, make the cost very high.  */
>>> +
>>> +static int
>>> +rs6000_dmr_register_move_cost (machine_mode mode, reg_class_t rclass)
>>> +{
>>> +  const int reg_move_base = 2;
>>> +  HARD_REG_SET vsx_set = (reg_class_contents[rclass]
>>> +			  & reg_class_contents[VSX_REGS]);
>>> +
>>> +  if (TARGET_DENSE_MATH && !hard_reg_set_empty_p (vsx_set))
>>
>> Can we just use reg_classes_intersect_p (rclass, VSX_REGS)?
>>


BR,
Kewen


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 1/6] Add -mcpu=future
  2024-02-07  9:21       ` Kewen.Lin
@ 2024-02-07 19:58         ` Michael Meissner
  2024-02-20 10:35           ` Kewen.Lin
  2024-02-08 18:42         ` Segher Boessenkool
  1 sibling, 1 reply; 36+ messages in thread
From: Michael Meissner @ 2024-02-07 19:58 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Michael Meissner, gcc-patches, Segher Boessenkool,
	David Edelsohn, Peter Bergner

On Wed, Feb 07, 2024 at 05:21:10PM +0800, Kewen.Lin wrote:
> on 2024/2/6 14:01, Michael Meissner wrote:
> Sorry for the possible confusion here, the "tune_proc" that I referred to is
> the variable in the above else branch:
> 
>    enum processor_type tune_proc = (TARGET_POWERPC64 ? PROCESSOR_DEFAULT64 : PROCESSOR_DEFAULT);
> 
> It's either PROCESSOR_DEFAULT64 or PROCESSOR_DEFAULT, so it doesn't have a
> chance to be PROCESSOR_FUTURE, so the checking "tune_proc == PROCESSOR_FUTURE"
> is useless.

PROCESSOR_DEFAULT can be PROCESSOR_FUTURE if somebody configures GCC with
--with-cpu=future.  While in general it shouldn't occur, it is helpful to
consider all of the corner cases.

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 5/6] PowerPC: Switch to dense math names for all MMA operations.
  2024-02-04  5:47   ` Repost " Kewen.Lin
@ 2024-02-07 20:01     ` Michael Meissner
  0 siblings, 0 replies; 36+ messages in thread
From: Michael Meissner @ 2024-02-07 20:01 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Michael Meissner, gcc-patches, Segher Boessenkool,
	David Edelsohn, Peter Bergner

On Sun, Feb 04, 2024 at 01:47:12PM +0800, Kewen.Lin wrote:
> > diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
> > index 525a85146ff..f06e6bbb184 100644
> > --- a/gcc/config/rs6000/mma.md
> > +++ b/gcc/config/rs6000/mma.md
> > @@ -227,13 +227,22 @@ (define_int_attr apv		[(UNSPEC_MMA_XVF64GERPP		"xvf64gerpp")
> >  
> >  (define_int_attr vvi4i4i8	[(UNSPEC_MMA_PMXVI4GER8		"pmxvi4ger8")])
> >  
> > +(define_int_attr vvi4i4i8_dm	[(UNSPEC_MMA_PMXVI4GER8		"pmdmxvi4ger8")])
> 
> Can we update vvi4i4i8 to
> 
> (define_int_attr vvi4i4i8	[(UNSPEC_MMA_PMXVI4GER8		"xvi4ger8")])
> 
> by avoiding to introduce vvi4i4i8_dm, then its use places would be like:
> 
> -  "<vvi4i4i8> %A0,%x1,%x2,%3,%4,%5"
> +  "@
> +   pmdm<vvi4i4i8> %A0,%x1,%x2,%3,%4,%5
> +   pm<vvi4i4i8> %A0,%x1,%x2,%3,%4,%5
> +   pm<vvi4i4i8> %A0,%x1,%x2,%3,%4,%5"
> 
> and 
> 
> - define_insn "mma_<vvi4i4i8>"
> + define_insn "mma_pm<vvi4i4i8>"
> 
> (or updating its use in corresponding bif expander field)

Yes I can do that.

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 3/6] PowerPC: Add support for accumulators in DMR registers.
  2024-02-07  9:38       ` Kewen.Lin
@ 2024-02-08  0:26         ` Michael Meissner
  0 siblings, 0 replies; 36+ messages in thread
From: Michael Meissner @ 2024-02-08  0:26 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Michael Meissner, Segher Boessenkool, David Edelsohn,
	gcc-patches, Peter Bergner

On Wed, Feb 07, 2024 at 05:38:46PM +0800, Kewen.Lin wrote:
> >>> -(define_insn_and_split "*movxo"
> >>> +(define_insn_and_split "*movxo_nodm"
> >>>    [(set (match_operand:XO 0 "nonimmediate_operand" "=d,ZwO,d")
> >>>  	(match_operand:XO 1 "input_operand" "ZwO,d,d"))]
> >>> -  "TARGET_MMA
> >>> +  "TARGET_MMA && !TARGET_DENSE_MATH
> >>>     && (gpc_reg_operand (operands[0], XOmode)
> >>>         || gpc_reg_operand (operands[1], XOmode))"
> >>>    "@
> >>> @@ -366,6 +369,31 @@ (define_insn_and_split "*movxo"
> >>>     (set_attr "length" "*,*,16")
> >>>     (set_attr "max_prefixed_insns" "2,2,*")])
> >>>  
> >>> +(define_insn_and_split "*movxo_dm"
> >>> +  [(set (match_operand:XO 0 "nonimmediate_operand" "=wa,QwO,wa,wD,wD,wa")
> >>> +	(match_operand:XO 1 "input_operand"        "QwO,wa, wa,wa,wD,wD"))]
> >>
> >> Why not adopt ZwO rather than QwO?
> > 
> > You have to split the address into 2 addresses for loading or storing vector
> > pairs (or 4 addresses for loading or storing vectors).  Z would allow
> > register+register addresses, and you wouldn't be able to create the second 
> > address by adding 128 to it.  Hence it uses 'Q' for register only and 'wo' for
> > d-form addresses.
> 
> Thanks for clarifying.  But without this patch the define_insn_and_split *movxo
> adopts "ZwO", IMHO it would mean the current "*movxo" define_insn_and_split have
> been problematic?  I thought adjust_address can ensure the new address would be
> still valid after adjusting 128 offset, could you double check?

Well it is more of a theoretical bug.  Using 'Z' is wrong as I said because after
register allocation you can't split an x-form (register+register) address.
Using 'Q' would not allow reg+reg but would allow reg, which can be split
because the 2nd address will be a d-form (reg+offset).

But in practice, it won't be an issue since rs6000_setup_reg_addr_masks won't
allow reg+reg addresses for TDOmode.

> >>>  #ifdef TARGET_REGNAMES
> >>> @@ -1250,6 +1255,8 @@ static const char alt_reg_names[][8] =
> >>>    "%cr0",  "%cr1", "%cr2", "%cr3", "%cr4", "%cr5", "%cr6", "%cr7",
> >>>    /* vrsave vscr sfp */
> >>>    "vrsave", "vscr", "sfp",
> >>> +  /* DMRs */
> >>> +  "%dmr0", "%dmr1", "%dmr2", "%dmr3", "%dmr4", "%dmr5", "%dmr6", "%dmr7",
> >>
> >> Should be without "r" here, as tested gas doesn't recognize %dmr0 but it does
> >> recognize %dm0.
> 
> I guessed some reply was missing on this part (and some latter others)?  Just want
> to ensure something wasn't missing and hope this helps.  :)

I missed this on the first round of comments, but if gas doesn't like %dmr2 we
should use %dm2.

Thanks for catching this.

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 6/6] PowerPC: Add support for 1,024 bit DMR registers.
  2024-02-05  3:58   ` Repost " Kewen.Lin
@ 2024-02-08  0:35     ` Michael Meissner
  0 siblings, 0 replies; 36+ messages in thread
From: Michael Meissner @ 2024-02-08  0:35 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Michael Meissner, gcc-patches, Segher Boessenkool,
	David Edelsohn, Peter Bergner

On Mon, Feb 05, 2024 at 11:58:31AM +0800, Kewen.Lin wrote:
> Hi Mike,

I will comment on about 1/2 of the things, and come back with the other
comments.

> on 2024/1/6 07:42, Michael Meissner wrote:
> > This patch is a prelimianry patch to add the full 1,024 bit dense math register> (DMRs) for -mcpu=future.  The MMA 512-bit accumulators map onto the top of the
> > DMR register.
> > 
> > This patch only adds the new 1,024 bit register support.  It does not add
> > support for any instructions that need 1,024 bit registers instead of 512 bit
> > registers.
> > 
> > I used the new mode 'TDOmode' to be the opaque mode used for 1,204 bit
> 
> typo: 1,204

Thanks.

> > +(define_insn_and_split "*movtdo"
> > +  [(set (match_operand:TDO 0 "nonimmediate_operand" "=wa,m,wa,wD,wD,wa")
> > +	(match_operand:TDO 1 "input_operand" "m,wa,wa,wa,wD,wD"))]
> > +  "TARGET_DENSE_MATH
> > +   && (gpc_reg_operand (operands[0], TDOmode)
> > +       || gpc_reg_operand (operands[1], TDOmode))"
> > +  "@
> > +   #
> > +   #
> > +   #
> > +   #
> > +   dmmr %0,%1
> > +   #"
> > +  "&& reload_completed
> > +   && (!dmr_operand (operands[0], TDOmode) || !dmr_operand (operands[1], TDOmode))"
> > +  [(const_int 0)]
> > +{
> > +  rtx op0 = operands[0];
> > +  rtx op1 = operands[1];
> > +
> > +  if (REG_P (op0) && REG_P (op1))
> > +    {
> > +      int regno0 = REGNO (op0);
> > +      int regno1 = REGNO (op1);
> > +
> > +      if (DMR_REGNO_P (regno0) && VSX_REGNO_P (regno1))
> > +	{
> > +	  rtx op1_upper = gen_rtx_REG (XOmode, regno1);
> > +	  rtx op1_lower = gen_rtx_REG (XOmode, regno1 + 4);
> > +	  emit_insn (gen_movtdo_insert512_upper (op0, op1_upper));
> > +	  emit_insn (gen_movtdo_insert512_lower (op0, op0, op1_lower));
> > +	  DONE;
> > +	}
> > +
> > +      else if (VSX_REGNO_P (regno0) && DMR_REGNO_P (regno1))
> > +	{
> > +	  rtx op0_upper = gen_rtx_REG (XOmode, regno0);
> > +	  rtx op0_lower = gen_rtx_REG (XOmode, regno0 + 4);
> > +	  emit_insn (gen_movtdo_extract512 (op0_upper, op1, const0_rtx));
> > +	  emit_insn (gen_movtdo_extract512 (op0_lower, op1, const1_rtx));
> > +	  DONE;
> > +	}
> 
> Add an assertion like gcc_assert (VSX_REGNO_P (regno1) && VSX_REGNO_P (regno2))?

Ok.

> > +
> > +;; Reload DMR registers from memory
> > +(define_insn_and_split "reload_dmr_from_memory"
> > +  [(set (match_operand:TDO 0 "dmr_operand" "=wD")
> > +	(unspec:TDO [(match_operand:TDO 1 "memory_operand" "m")]
> > +		    UNSPEC_DMR_RELOAD_FROM_MEMORY))
> > +   (clobber (match_operand:XO 2 "vsx_register_operand" "=wa"))]
> > +  "TARGET_DENSE_MATH"
> > +  "#"
> > +  "&& reload_completed"
> > +  [(const_int 0)]
> > +{
> > +  rtx dest = operands[0];
> > +  rtx src = operands[1];
> > +  rtx tmp = operands[2];
> > +  rtx mem_upper = adjust_address (src, XOmode, BYTES_BIG_ENDIAN ? 0 : 32);
> > +  rtx mem_lower = adjust_address (src, XOmode, BYTES_BIG_ENDIAN ? 32 : 0);
> 
> I think the offset should be 64 rather than 32.

Good catch, thanks.

> > +
> > +  emit_move_insn (tmp, mem_upper);
> > +  emit_insn (gen_movtdo_insert512_upper (dest, tmp));
> > +
> > +  emit_move_insn (tmp, mem_lower);
> > +  emit_insn (gen_movtdo_insert512_lower (dest, dest, tmp));
> > +  DONE;
> > +}
> > +  [(set_attr "length" "16")
> > +   (set_attr "max_prefixed_insns" "2")
> > +   (set_attr "type" "vecload")])
> > +
> > +;; Reload dense math registers to memory
> > +(define_insn_and_split "reload_dmr_to_memory"
> > +  [(set (match_operand:TDO 0 "memory_operand" "=m")
> > +	(unspec:TDO [(match_operand:TDO 1 "dmr_operand" "wD")]
> > +		    UNSPEC_DMR_RELOAD_TO_MEMORY))
> > +   (clobber (match_operand:XO 2 "vsx_register_operand" "=wa"))]
> > +  "TARGET_DENSE_MATH"
> > +  "#"
> > +  "&& reload_completed"
> > +  [(const_int 0)]
> > +{
> > +  rtx dest = operands[0];
> > +  rtx src = operands[1];
> > +  rtx tmp = operands[2];
> > +  rtx mem_upper = adjust_address (dest, XOmode, BYTES_BIG_ENDIAN ? 0 : 32);
> > +  rtx mem_lower = adjust_address (dest, XOmode, BYTES_BIG_ENDIAN ? 32 : 0);
> 
> Ditto.

Yep.

> > diff --git a/gcc/config/rs6000/rs6000-builtin.cc b/gcc/config/rs6000/rs6000-builtin.cc
> > index 6698274031b..54868d2009c 100644
> > --- a/gcc/config/rs6000/rs6000-builtin.cc
> > +++ b/gcc/config/rs6000/rs6000-builtin.cc
> > @@ -495,6 +495,8 @@ const char *rs6000_type_string (tree type_node)
> >      return "__vector_pair";
> >    else if (type_node == vector_quad_type_node)
> >      return "__vector_quad";
> > +  else if (type_node == dmr_type_node)
> > +    return "__dmr";
> >  
> >    return "unknown";
> >  }
> > @@ -781,6 +783,17 @@ rs6000_init_builtins (void)
> >    t = build_qualified_type (vector_quad_type_node, TYPE_QUAL_CONST);
> >    ptr_vector_quad_type_node = build_pointer_type (t);
> >  
> > +  dmr_type_node = make_node (OPAQUE_TYPE);
> > +  SET_TYPE_MODE (dmr_type_node, TDOmode);
> > +  TYPE_SIZE (dmr_type_node) = bitsize_int (GET_MODE_BITSIZE (TDOmode));
> > +  TYPE_PRECISION (dmr_type_node) = GET_MODE_BITSIZE (TDOmode);
> > +  TYPE_SIZE_UNIT (dmr_type_node) = size_int (GET_MODE_SIZE (TDOmode));
> > +  SET_TYPE_ALIGN (dmr_type_node, 512);
> 
> why not 1024?

Since we don't have a 1,024 bit load/store and have to use multiple vector pair
or vector load/stores, there is no reason to ask for a 1,024 alignment.  In
addition, I would worry that having a larger alignment might be an issue with
the stack, since I don't believe we have support for aligning the stack to
1,024 bit boundaries.

> > --- a/gcc/config/rs6000/rs6000-call.cc
> > +++ b/gcc/config/rs6000/rs6000-call.cc
> > @@ -437,7 +437,8 @@ rs6000_return_in_memory (const_tree type, const_tree fntype ATTRIBUTE_UNUSED)
> >    if (cfun
> >        && !cfun->machine->mma_return_type_error
> >        && TREE_TYPE (cfun->decl) == fntype
> > -      && (TYPE_MODE (type) == OOmode || TYPE_MODE (type) == XOmode))
> > +      && (TYPE_MODE (type) == OOmode || TYPE_MODE (type) == XOmode
> > +	  || TYPE_MODE (type) == TDOmode))
> 
> May be just with OPAQUE_MODE_P (TYPE_MODE (type)) for all the cases on type mode.

Basically I forgot about using OPAQUE_MODE in this case.  Using OPAQUE_MODE is
better.

> So far only rs6000 defines OPAQUE_MODE, if we are worried that there are some generic opaque modes
> some day, we can probably add one assertion somewhere to guaratee it.  Or add one macro like
> OPAQUE_MMA_MODE_P to ensure it only matches {OO,XO,TDO}mode.
> 
> >      {
> >        /* Record we have now handled function CFUN, so the next time we
> >  	 are called, we do not re-report the same error.  */
> > @@ -1641,6 +1642,16 @@ rs6000_function_arg (cumulative_args_t cum_v, const function_arg_info &arg)
> >        return NULL_RTX;
> >      }
> >  
> > +  if (mode == TDOmode)
> > +    {
> > +      if (TYPE_CANONICAL (type) != NULL_TREE)
> > +	type = TYPE_CANONICAL (type);
> > +      error ("invalid use of dense math operand of type %qs as a function "
> > +	     "parameter",
> > +	     IDENTIFIER_POINTER (DECL_NAME (TYPE_NAME (type))));
> > +      return NULL_RTX;
> > +    }
> 
> Can we merge this hunk into the above hunk for OOmode and XOmode?  Then the code with TYPE_CANONICAL
> can be shared and better to maintain.  IMHO, this dense math operand is also MMA operand so the above
> error message still works, if it's desired to note this dense math operand then we can use
> (mode == TDOmode)? "dense math": "MMA" for the different string part.

I will need to look into this later.

> > +
> >    /* Return a marker to indicate whether CR1 needs to set or clear the
> >       bit that V.4 uses to say fp args were passed in registers.
> >       Assume that we don't need the marker for software floating point,
> > diff --git a/gcc/config/rs6000/rs6000-modes.def b/gcc/config/rs6000/rs6000-modes.def
> > index 094b246c834..60ebb363196 100644
> > --- a/gcc/config/rs6000/rs6000-modes.def
> > +++ b/gcc/config/rs6000/rs6000-modes.def
> > @@ -86,3 +86,7 @@ PARTIAL_INT_MODE (TI, 128, PTI);
> >  /* Modes used by __vector_pair and __vector_quad.  */
> >  OPAQUE_MODE (OO, 32);
> >  OPAQUE_MODE (XO, 64);
> > +
> > +/* Modes used by __dmr.  */
> 
> Nit: s/Modes/Mode/
> 
> > +OPAQUE_MODE (TDO, 128);
> > +
> 
> I assumed that "TD" stands for something but I have no idea (at least not obvious to me),
> could we also put some comments for it?

Basically Segher and I went back and forth on the names.  I would have to dig
into my notes what TDO stands for.

> > diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> > index 59517c8608d..aed4b72c4ea 100644
> > --- a/gcc/config/rs6000/rs6000.cc
> > +++ b/gcc/config/rs6000/rs6000.cc
> > @@ -1846,7 +1846,9 @@ rs6000_hard_regno_nregs_internal (int regno, machine_mode mode)
> >       128-bit floating point that can go in vector registers, which has VSX
> >       memory addressing.  */
> >    if (FP_REGNO_P (regno))
> > -    reg_size = (VECTOR_MEM_VSX_P (mode) || VECTOR_ALIGNMENT_P (mode)
> > +    reg_size = (VECTOR_MEM_VSX_P (mode)
> > +		|| VECTOR_ALIGNMENT_P (mode)
> > +		|| mode == TDOmode
> 
> Redundant change, since VECTOR_ALIGNMENT_P considers TDOmode as this patch changes.

Ok.

And I'll get back to the rest of the comments shortly.

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 0/6] PowerPC Future patches
  2024-01-05 23:27 Repost [PATCH 0/6] PowerPC Future patches Michael Meissner
                   ` (5 preceding siblings ...)
  2024-01-05 23:42 ` Repost [PATCH 6/6] PowerPC: Add support for 1,024 bit DMR registers Michael Meissner
@ 2024-02-08 18:22 ` Segher Boessenkool
  6 siblings, 0 replies; 36+ messages in thread
From: Segher Boessenkool @ 2024-02-08 18:22 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Kewen.Lin, David Edelsohn, Peter Bergner

Hi!

On Fri, Jan 05, 2024 at 06:27:05PM -0500, Michael Meissner wrote:
> In the current MMA subsystem for Power10, there are 8 512-bit accumulator
> registers.  These accumulators are each tied to sets of 4 FPR registers.  When

Four VSX registers -- the FP registers are only a 64 bit part of each of
those.  Please do not call those VSX registers "FPRs".  They are not.

> These patches add support for the 512-bit accumulators within the dense math
> system, and for allocation of the 1,024-bit DMRs.  At this time, no additional
> built-in functions will be done to support any dense math features other than
> doing data movement between the DMRs and the VSX registers.  Before we can look
> at adding any new dense math support other than data movement, we need the GCC
> compiler to be able to allocate and use these DMRs.

Okido.

> If you compile with -mcpu=power10, the wD constraint will match the equivalent
> FPR register that overlaps with the accumulator.  If you compile with
> -mcpu=future, the wD constraint will match the DMR register and not the FPR
> register.
> 
> These patches also modifies the print_operand %A output modifier to print out
> DMR register numbers if -mcpu=future, and continue to print out the FPR
> register number divided by 4 for -mcpu=power10.

Yup.  Unfortunately that is the best we can do probably.  It _feels_
fragile, but it wil probably be okay in practice.

> Going forward, hopefully if you modify your code to use the wD constraint and
> %A output modifier, you can write code that switches more easily between the
> two systems.

But it will never become completely transparent.  Luckily the old thing
will over time fade into the background.

So, please post the -mcpu=future patches in a separate series, first.
I'll comment on that patch in a minute, you'll probably want to take
those comments into consideration before posting that series ;-)


Segher

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 1/6] Add -mcpu=future
  2024-02-06  6:01     ` Michael Meissner
  2024-02-07  9:21       ` Kewen.Lin
@ 2024-02-08 18:35       ` Segher Boessenkool
  1 sibling, 0 replies; 36+ messages in thread
From: Segher Boessenkool @ 2024-02-08 18:35 UTC (permalink / raw)
  To: Michael Meissner, Kewen.Lin, gcc-patches, David Edelsohn, Peter Bergner

On Tue, Feb 06, 2024 at 01:01:52AM -0500, Michael Meissner wrote:
> > Nit: Named as "ISA_FUTURE_MASKS_SERVER" seems more accurate as it's constituted
> > with ISA_3_1_MASKS_**SERVER** ...
> 
> Well the _SERVER stuff was due to the power7 days when we still had to support
> the E500 in the main rs6000 tree.  But I will change it to be more consistant
> in the future patches.

"_SERVER" still is a good shortish name for the server systems ;-)

> > > @@ -67,7 +67,9 @@ enum processor_type
> > >     PROCESSOR_MPCCORE,
> > >     PROCESSOR_CELL,
> > >     PROCESSOR_PPCA2,
> > > -   PROCESSOR_TITAN
> > > +   PROCESSOR_TITAN,
> > > +
> > 
> > Nit: unintentional empty line?
> > 
> > > +   PROCESSOR_FUTURE
> > >  };
> 
> It was more as a separation.  The MPCCORE, CELL, PPCA2, and TITAN are rather
> old processors.  I don't recall why we kept them after the POWER<x>.

Please don't add random separations.

> Logically we should re-order the list and move MPCCORE, etc. earlier, but I
> will delete the blank line in future patches.

Don't randomly reorder, either.

_FUTURE should be added after POWER11.

> > I think we should also update asm_names in driver-rs6000.cc.
> 
> Ok.  Though the driver-rs6000.cc stuff won't kick in until we have a real
> system that matches "future".

Or when during development you have that faked.  You did test it, right?
:-)


Segher

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 1/6] Add -mcpu=future
  2024-02-07  9:21       ` Kewen.Lin
  2024-02-07 19:58         ` Michael Meissner
@ 2024-02-08 18:42         ` Segher Boessenkool
  1 sibling, 0 replies; 36+ messages in thread
From: Segher Boessenkool @ 2024-02-08 18:42 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: Michael Meissner, gcc-patches, David Edelsohn, Peter Bergner

On Wed, Feb 07, 2024 at 05:21:10PM +0800, Kewen.Lin wrote:
> on 2024/2/6 14:01, Michael Meissner wrote:
> > It was more as a separation.  The MPCCORE, CELL, PPCA2, and TITAN are rather
> > old processors.

I'll probably remove Titan soonish, btw.  We have adjusted code around
it for what, fifteen years?  But the hardware never materialized.  There
are more silly things in our backend, but this one takes the prize.

> OK, considering we only get this warning once for a simple case, I'm inclined
> not to keep a static variable for it, it's the same as what we do currently
> for option conflict errors emission.  But I'm fine for either.

Whatever is easiest.


Segher

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 1/6] Add -mcpu=future
  2024-01-05 23:35 ` Repost [PATCH 1/6] Add -mcpu=future Michael Meissner
  2024-01-19 18:43   ` Ping " Michael Meissner
  2024-01-23  8:44   ` Repost " Kewen.Lin
@ 2024-02-08 20:10   ` Segher Boessenkool
  2 siblings, 0 replies; 36+ messages in thread
From: Segher Boessenkool @ 2024-02-08 20:10 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Kewen.Lin, David Edelsohn, Peter Bergner

On Fri, Jan 05, 2024 at 06:35:37PM -0500, Michael Meissner wrote:
> 	* config/rs6000/rs6000.opt (-mfuture): New undocumented debug switch.

No.  Never ever use a flag that does what -mcpu=<smth> should do.  We're
still trying to recover from previous such mistakes.  Don't add more
please.

> +++ b/gcc/config/rs6000/rs6000-c.cc
> @@ -447,6 +447,8 @@ rs6000_target_modify_macros (bool define_p, HOST_WIDE_INT flags)
>      rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR9");
>    if ((flags & OPTION_MASK_POWER10) != 0)
>      rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR10");
> +  if ((flags & OPTION_MASK_FUTURE) != 0)
> +    rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR_FUTURE");

if ((((a & B) != 0) != 0) != 0)  ?  You can do just
if (a & B)

Yes, existing code already does the silly thing, but just fix it then,
don't add more :-)

(And no   if ((a & B))   either please).

> +static int
> +rs600_cpu_index_lookup (enum processor_type processor)
> +{
> +  for (size_t i = 0; i < ARRAY_SIZE (processor_target_table); i++)
> +    if (processor_target_table[i].processor == processor)
> +      return i;
> +
> +  return -1;
> +}

"int i" please, not "size_t".  This has nothing to do with object sizes.
The loop counter will always be a small number.

> +  /* At the moment, we don't have explict -mtune=future support.  If the user

"At the moment" is out of date almost as soon as you write it.  It is
better to avoid such terms ;-)

> +     explicitly tried to use -mtune=future, give a warning.  If not, use the
> +     power10 tuning until future tuning is added.  */

There should be Power11 tuning now, please use that?

So please post this -- as a separate series, and not as a single patch --
after fixing the things Ke Wen pointed out.  Thanks!


Segher

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 1/6] Add -mcpu=future
  2024-02-07 19:58         ` Michael Meissner
@ 2024-02-20 10:35           ` Kewen.Lin
  2024-02-21  7:19             ` Michael Meissner
  2024-02-23 17:57             ` Segher Boessenkool
  0 siblings, 2 replies; 36+ messages in thread
From: Kewen.Lin @ 2024-02-20 10:35 UTC (permalink / raw)
  To: Michael Meissner
  Cc: gcc-patches, Segher Boessenkool, Peter Bergner, David Edelsohn

Hi Mike,

Sorry for late reply (just back from vacation).

on 2024/2/8 03:58, Michael Meissner wrote:
> On Wed, Feb 07, 2024 at 05:21:10PM +0800, Kewen.Lin wrote:
>> on 2024/2/6 14:01, Michael Meissner wrote:
>> Sorry for the possible confusion here, the "tune_proc" that I referred to is
>> the variable in the above else branch:
>>
>>    enum processor_type tune_proc = (TARGET_POWERPC64 ? PROCESSOR_DEFAULT64 : PROCESSOR_DEFAULT);
>>
>> It's either PROCESSOR_DEFAULT64 or PROCESSOR_DEFAULT, so it doesn't have a
>> chance to be PROCESSOR_FUTURE, so the checking "tune_proc == PROCESSOR_FUTURE"
>> is useless.
> 
> PROCESSOR_DEFAULT can be PROCESSOR_FUTURE if somebody configures GCC with
> --with-cpu=future.  While in general it shouldn't occur, it is helpful to
> consider all of the corner cases.

But it sounds not true, I think you meant TARGET_CPU_DEFAULT instead?

On one local ppc64le machine I tried to configure with --with-cpu=power10,
I got {,OPTION_}TARGET_CPU_DEFAULT "power10" but PROCESSOR_DEFAULT is still
PROCESSOR_POWER7 (PROCESSOR_DEFAULT64 is PROCESSOR_POWER8).  I think these
PROCESSOR_DEFAULT{,64} are defined by various headers:

$ grep -r "define PROCESSOR_DEFAULT" gcc/config/rs6000/
gcc/config/rs6000/aix71.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
gcc/config/rs6000/aix71.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER7
gcc/config/rs6000/aix72.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
gcc/config/rs6000/aix72.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER7
gcc/config/rs6000/aix73.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER8
gcc/config/rs6000/aix73.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
gcc/config/rs6000/darwin.h:#define PROCESSOR_DEFAULT  PROCESSOR_PPC7400
gcc/config/rs6000/darwin.h:#define PROCESSOR_DEFAULT64  PROCESSOR_POWER4
gcc/config/rs6000/freebsd64.h:#define PROCESSOR_DEFAULT PROCESSOR_PPC7450
gcc/config/rs6000/freebsd64.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
gcc/config/rs6000/linux64.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
gcc/config/rs6000/linux64.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
gcc/config/rs6000/rs6000.h:#define PROCESSOR_DEFAULT   PROCESSOR_PPC603
gcc/config/rs6000/rs6000.h:#define PROCESSOR_DEFAULT64 PROCESSOR_RS64A
gcc/config/rs6000/vxworks.h:#define PROCESSOR_DEFAULT PROCESSOR_PPC604

, and they are unlikely to be updated later, no?

btw, the given --with-cpu=future will make cpu_index never negative so

  ...
  else if (cpu_index >= 0)
    rs6000_tune_index = tune_index = cpu_index;
  else
    ... 

so there is no chance to enter "else" arm, that is, that arm only takes
effect when no cpu/tune is given (neither -m{cpu,tune} nor --with-cpu=).

BR,
Kewen


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 1/6] Add -mcpu=future
  2024-02-20 10:35           ` Kewen.Lin
@ 2024-02-21  7:19             ` Michael Meissner
  2024-02-26 10:46               ` Kewen.Lin
  2024-02-23 17:57             ` Segher Boessenkool
  1 sibling, 1 reply; 36+ messages in thread
From: Michael Meissner @ 2024-02-21  7:19 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Michael Meissner, gcc-patches, Segher Boessenkool, Peter Bergner,
	David Edelsohn

On Tue, Feb 20, 2024 at 06:35:34PM +0800, Kewen.Lin wrote:
> Hi Mike,
> 
> Sorry for late reply (just back from vacation).
> 
> on 2024/2/8 03:58, Michael Meissner wrote:
> > On Wed, Feb 07, 2024 at 05:21:10PM +0800, Kewen.Lin wrote:
> >> on 2024/2/6 14:01, Michael Meissner wrote:
> >> Sorry for the possible confusion here, the "tune_proc" that I referred to is
> >> the variable in the above else branch:
> >>
> >>    enum processor_type tune_proc = (TARGET_POWERPC64 ? PROCESSOR_DEFAULT64 : PROCESSOR_DEFAULT);
> >>
> >> It's either PROCESSOR_DEFAULT64 or PROCESSOR_DEFAULT, so it doesn't have a
> >> chance to be PROCESSOR_FUTURE, so the checking "tune_proc == PROCESSOR_FUTURE"
> >> is useless.
> > 
> > PROCESSOR_DEFAULT can be PROCESSOR_FUTURE if somebody configures GCC with
> > --with-cpu=future.  While in general it shouldn't occur, it is helpful to
> > consider all of the corner cases.
> 
> But it sounds not true, I think you meant TARGET_CPU_DEFAULT instead?
> 
> On one local ppc64le machine I tried to configure with --with-cpu=power10,
> I got {,OPTION_}TARGET_CPU_DEFAULT "power10" but PROCESSOR_DEFAULT is still
> PROCESSOR_POWER7 (PROCESSOR_DEFAULT64 is PROCESSOR_POWER8).  I think these
> PROCESSOR_DEFAULT{,64} are defined by various headers:

Yes, I was mistaken.  You are correct TARGET_CPU_DEFAULT is set.  I will change
the comments.

> gcc/config/rs6000/aix71.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
> gcc/config/rs6000/aix71.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER7
> gcc/config/rs6000/aix72.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
> gcc/config/rs6000/aix72.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER7
> gcc/config/rs6000/aix73.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER8
> gcc/config/rs6000/aix73.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
> gcc/config/rs6000/darwin.h:#define PROCESSOR_DEFAULT  PROCESSOR_PPC7400
> gcc/config/rs6000/darwin.h:#define PROCESSOR_DEFAULT64  PROCESSOR_POWER4
> gcc/config/rs6000/freebsd64.h:#define PROCESSOR_DEFAULT PROCESSOR_PPC7450
> gcc/config/rs6000/freebsd64.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
> gcc/config/rs6000/linux64.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
> gcc/config/rs6000/linux64.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
> gcc/config/rs6000/rs6000.h:#define PROCESSOR_DEFAULT   PROCESSOR_PPC603
> gcc/config/rs6000/rs6000.h:#define PROCESSOR_DEFAULT64 PROCESSOR_RS64A
> gcc/config/rs6000/vxworks.h:#define PROCESSOR_DEFAULT PROCESSOR_PPC604
> 
> , and they are unlikely to be updated later, no?
> 
> btw, the given --with-cpu=future will make cpu_index never negative so
> 
>   ...
>   else if (cpu_index >= 0)
>     rs6000_tune_index = tune_index = cpu_index;
>   else
>     ... 
> 
> so there is no chance to enter "else" arm, that is, that arm only takes
> effect when no cpu/tune is given (neither -m{cpu,tune} nor --with-cpu=).

Note, this is existing code.  I didn't modify it.  If we want to change it, we
should do it as another patch.

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 1/6] Add -mcpu=future
  2024-02-20 10:35           ` Kewen.Lin
  2024-02-21  7:19             ` Michael Meissner
@ 2024-02-23 17:57             ` Segher Boessenkool
  1 sibling, 0 replies; 36+ messages in thread
From: Segher Boessenkool @ 2024-02-23 17:57 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: Michael Meissner, gcc-patches, Peter Bergner, David Edelsohn

On Tue, Feb 20, 2024 at 06:35:34PM +0800, Kewen.Lin wrote:
> on 2024/2/8 03:58, Michael Meissner wrote:
> $ grep -r "define PROCESSOR_DEFAULT" gcc/config/rs6000/
> gcc/config/rs6000/aix71.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
> gcc/config/rs6000/aix71.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER7
> gcc/config/rs6000/aix72.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
> gcc/config/rs6000/aix72.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER7
> gcc/config/rs6000/aix73.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER8
> gcc/config/rs6000/aix73.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
> gcc/config/rs6000/darwin.h:#define PROCESSOR_DEFAULT  PROCESSOR_PPC7400
> gcc/config/rs6000/darwin.h:#define PROCESSOR_DEFAULT64  PROCESSOR_POWER4
> gcc/config/rs6000/freebsd64.h:#define PROCESSOR_DEFAULT PROCESSOR_PPC7450
> gcc/config/rs6000/freebsd64.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
> gcc/config/rs6000/linux64.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
> gcc/config/rs6000/linux64.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
> gcc/config/rs6000/rs6000.h:#define PROCESSOR_DEFAULT   PROCESSOR_PPC603
> gcc/config/rs6000/rs6000.h:#define PROCESSOR_DEFAULT64 PROCESSOR_RS64A
> gcc/config/rs6000/vxworks.h:#define PROCESSOR_DEFAULT PROCESSOR_PPC604
> 
> , and they are unlikely to be updated later, no?

In most cases did would be an ABI change.  Almost never an acceptable
thing to do.


Segher

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Repost [PATCH 1/6] Add -mcpu=future
  2024-02-21  7:19             ` Michael Meissner
@ 2024-02-26 10:46               ` Kewen.Lin
  0 siblings, 0 replies; 36+ messages in thread
From: Kewen.Lin @ 2024-02-26 10:46 UTC (permalink / raw)
  To: Michael Meissner
  Cc: gcc-patches, Peter Bergner, Segher Boessenkool, David Edelsohn

on 2024/2/21 15:19, Michael Meissner wrote:
> On Tue, Feb 20, 2024 at 06:35:34PM +0800, Kewen.Lin wrote:
>> Hi Mike,
>>
>> Sorry for late reply (just back from vacation).
>>
>> on 2024/2/8 03:58, Michael Meissner wrote:
>>> On Wed, Feb 07, 2024 at 05:21:10PM +0800, Kewen.Lin wrote:
>>>> on 2024/2/6 14:01, Michael Meissner wrote:
>>>> Sorry for the possible confusion here, the "tune_proc" that I referred to is
>>>> the variable in the above else branch:
>>>>
>>>>    enum processor_type tune_proc = (TARGET_POWERPC64 ? PROCESSOR_DEFAULT64 : PROCESSOR_DEFAULT);
>>>>
>>>> It's either PROCESSOR_DEFAULT64 or PROCESSOR_DEFAULT, so it doesn't have a
>>>> chance to be PROCESSOR_FUTURE, so the checking "tune_proc == PROCESSOR_FUTURE"
>>>> is useless.
>>>
>>> PROCESSOR_DEFAULT can be PROCESSOR_FUTURE if somebody configures GCC with
>>> --with-cpu=future.  While in general it shouldn't occur, it is helpful to
>>> consider all of the corner cases.
>>
>> But it sounds not true, I think you meant TARGET_CPU_DEFAULT instead?
>>
>> On one local ppc64le machine I tried to configure with --with-cpu=power10,
>> I got {,OPTION_}TARGET_CPU_DEFAULT "power10" but PROCESSOR_DEFAULT is still
>> PROCESSOR_POWER7 (PROCESSOR_DEFAULT64 is PROCESSOR_POWER8).  I think these
>> PROCESSOR_DEFAULT{,64} are defined by various headers:
> 
> Yes, I was mistaken.  You are correct TARGET_CPU_DEFAULT is set.  I will change
> the comments.

Thanks!

> 
>> gcc/config/rs6000/aix71.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
>> gcc/config/rs6000/aix71.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER7
>> gcc/config/rs6000/aix72.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
>> gcc/config/rs6000/aix72.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER7
>> gcc/config/rs6000/aix73.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER8
>> gcc/config/rs6000/aix73.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
>> gcc/config/rs6000/darwin.h:#define PROCESSOR_DEFAULT  PROCESSOR_PPC7400
>> gcc/config/rs6000/darwin.h:#define PROCESSOR_DEFAULT64  PROCESSOR_POWER4
>> gcc/config/rs6000/freebsd64.h:#define PROCESSOR_DEFAULT PROCESSOR_PPC7450
>> gcc/config/rs6000/freebsd64.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
>> gcc/config/rs6000/linux64.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
>> gcc/config/rs6000/linux64.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
>> gcc/config/rs6000/rs6000.h:#define PROCESSOR_DEFAULT   PROCESSOR_PPC603
>> gcc/config/rs6000/rs6000.h:#define PROCESSOR_DEFAULT64 PROCESSOR_RS64A
>> gcc/config/rs6000/vxworks.h:#define PROCESSOR_DEFAULT PROCESSOR_PPC604
>>
>> , and they are unlikely to be updated later, no?
>>
>> btw, the given --with-cpu=future will make cpu_index never negative so
>>
>>   ...
>>   else if (cpu_index >= 0)
>>     rs6000_tune_index = tune_index = cpu_index;
>>   else
>>     ... 
>>
>> so there is no chance to enter "else" arm, that is, that arm only takes
>> effect when no cpu/tune is given (neither -m{cpu,tune} nor --with-cpu=).
> 
> Note, this is existing code.  I didn't modify it.  If we want to change it, we
> should do it as another patch.

Yes, I agree.  Just to clarify, I didn't suggest changing it but instead
suggested almost keeping them, since we don't need any changes in "else"
arm, so instead of updating in arms "if" and "else if" for "future cpu type",
it seems a bit more clear to just check it after this, ie.:

----

bool explicit_tune = false;
if (rs6000_tune_index >= 0)
  {
    tune_index = rs6000_tune_index;
    explicit_tune = true;
  }
else if (cpu_index >= 0)
  // as before
  rs6000_tune_index = tune_index = cpu_index;
else
  {
   //as before
   ...
  }

// Check tune_index here instead.

if (processor_target_table[tune_index].processor == PROCESSOR_FUTURE)
  {
    tune_index = rs6000_cpu_index_lookup (PROCESSOR_POWER10);
    if (explicit_tune)
      warn ...
  }

// as before
rs6000_tune = processor_target_table[tune_index].processor;

----

, copied from previous comment: https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643681.html

BR,
Kewen


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2024-02-26 10:46 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-05 23:27 Repost [PATCH 0/6] PowerPC Future patches Michael Meissner
2024-01-05 23:35 ` Repost [PATCH 1/6] Add -mcpu=future Michael Meissner
2024-01-19 18:43   ` Ping " Michael Meissner
2024-01-23  8:44   ` Repost " Kewen.Lin
2024-02-06  6:01     ` Michael Meissner
2024-02-07  9:21       ` Kewen.Lin
2024-02-07 19:58         ` Michael Meissner
2024-02-20 10:35           ` Kewen.Lin
2024-02-21  7:19             ` Michael Meissner
2024-02-26 10:46               ` Kewen.Lin
2024-02-23 17:57             ` Segher Boessenkool
2024-02-08 18:42         ` Segher Boessenkool
2024-02-08 18:35       ` Segher Boessenkool
2024-02-08 20:10   ` Segher Boessenkool
2024-01-05 23:37 ` Repost [PATCH 2/6] PowerPC: Make -mcpu=future enable -mblock-ops-vector-pair Michael Meissner
2024-01-19 18:44   ` Ping " Michael Meissner
2024-01-23  8:54   ` Repost " Kewen.Lin
2024-01-05 23:38 ` Repost [PATCH 3/6] PowerPC: Add support for accumulators in DMR registers Michael Meissner
2024-01-19 18:46   ` Ping " Michael Meissner
2024-01-25  9:28   ` Repost " Kewen.Lin
2024-02-07  0:06     ` Michael Meissner
2024-02-07  9:38       ` Kewen.Lin
2024-02-08  0:26         ` Michael Meissner
2024-01-05 23:39 ` Repost [PATCH 4/6] PowerPC: Make MMA insns support " Michael Meissner
2024-01-19 18:47   ` Ping " Michael Meissner
2024-02-04  3:21   ` Repost " Kewen.Lin
2024-02-07  3:31     ` Michael Meissner
2024-01-05 23:40 ` Repost [PATCH 5/6] PowerPC: Switch to dense math names for all MMA operations Michael Meissner
2024-01-19 18:48   ` Ping " Michael Meissner
2024-02-04  5:47   ` Repost " Kewen.Lin
2024-02-07 20:01     ` Michael Meissner
2024-01-05 23:42 ` Repost [PATCH 6/6] PowerPC: Add support for 1,024 bit DMR registers Michael Meissner
2024-01-19 18:49   ` Ping " Michael Meissner
2024-02-05  3:58   ` Repost " Kewen.Lin
2024-02-08  0:35     ` Michael Meissner
2024-02-08 18:22 ` Repost [PATCH 0/6] PowerPC Future patches Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).