public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 0/8] PowerPC future support for Dense Math
@ 2023-02-03 21:16 Michael Meissner
  2023-02-03 21:21 ` [PATCH 1/8] PowerPC: Add -mcpu=future Michael Meissner
                   ` (8 more replies)
  0 siblings, 9 replies; 11+ messages in thread
From: Michael Meissner @ 2023-02-03 21:16 UTC (permalink / raw)
  To: gcc-patches, Michael Meissner, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner, Will Schmidt

These patches were originally posted on November 10th.  Segher has asked that I
repost them.  These patches are somewhat changed since the original posting to
address some of the comments.

https://gcc.gnu.org/pipermail/gcc-patches/2022-November/605581.html

In the first patch (adding -mcpu=future), I have taken out the code of making
-mtune=future act as -mtune=power10.  Instead I went through all of the places
that look at the tuning (mostly in power10.md and rs6000.cc), and added future
as an option.  Obviously at a later time, we will provide a separate tuning
file for future (or whatever the new name will be if the instructions are added
officially).  But for now, it will suffice.

In patch #3, I fixed the opcode for clearing a dense math register that Peter
had noticed.  I was using the name based on the existing clear instruction,
instead of the new instruction.

In patch #6, I fixed the code, relying on the changes for setting the precision
field to 16 bits.  Since that patch will not be able to go into GCC 13 at
present, we might skip that support for now.  The important thing for existing
users of the MMA code is the support for accumulators being in the separate
dense math registers rather than overlapping does need to go in, and we can
probably delay the 1,024 bit register support, or implement in a different
fashion.

In the insn names, I tried to switch to using _vsx instead of _fpr for the
existing MMA support instructions.  I also tried to clear up the comments to
specify ISA 3.1 instead of power10 when talking about the existing MMA
support.

The following is from the original posting (slightly modified):

This patch is very preliminary support for a potential new feature to the
PowerPC that extends the current power10 MMA architecture.  This feature may or
may not be present in any specific future PowerPC processor.

In the current MMA subsystem for Power10, there are 8 512-bit accumulator
registers.  These accumulators are each tied to sets of 4 FPR registers.  When
you issue a prime instruction, it makes sure the accumulator is a copy of the 4
FPR registers the accumulator is tied to.  When you issue a deprime
instruction, it makes sure that the accumulator data content is logically
copied to the matching FPR register.

In the potential dense math system, the accumulators are moved to separate
registers called dense math registers (DM registers or DMR).  The DMRs are then
extended to 1,024 bits and new instructions will be added to deal with all
1,024 bits of the DMRs.

If you take existing MMA code, it will work as long as you don't do anything
with accumulators, and you follow the rules in the ISA 3.1 documentation for
using the MMA subsystem.

These patches add support for the 512-bit accumulators within the dense math
system, and for allocation of the 1,024-bit DMRs.  At this time, no additional
built-in functions will be done to support any dense math features other than
doing data movement between the DMRs and the VSX registers.  Before we can look
at adding any new dense math support other than data movement, we need the GCC
compiler to be able to allocate and use these DMRs.

There are 8 patches in this patch set:

1) The first patch just adds -mcpu=future as an option to add new support.
This is similar to the -mcpu=future that we did before power10 was announced.

2) The second patch enables GCC to use the load and store vector pair
instructions to optimize memory copy operations in the compiler.  For power10,
we needed to just stay with normal vector load/stores for memory copy
operations.

3) The third patch enables 512-bit accumulators store in DMRs.  This patch
enables the register allocation, but it does not move the existing MMA to use
these registers.

4) The fourth patch switches the MMA subsystem to use 512-bit accumulators
within DMRs if you use -mcpu=future.

5) The fifth patch switches the names of the MMA instructions to use the dense
math equivalent name if -mcpu=future.

6) The sixth patch enables using the full 1,024-bit DMRs.  Right now, all you
can do with DMRs is move a VSX register to a DMR register, and to move a DMR
register to a VSX register.  [As I mentioned above, at the moment, this patch
is problematical as is]

7) The seventh patch is not DMR related.  It adds support for variants of the
load/store vector with length instruction that may be added in future PowerPC
processors.  These variants eliminate having to shift the byte length left by
56 bits.

8) The eighth patch is also not DMR related.  It adds support for a saturating
subtract operation that may be added to future PowerPC processors.

In terms of changes, we now use the wD constraint for accumulators.  If you
compile with -mcpu=power10, the wD constraint will match the equivalent VSX
register (0..31) that overlaps with the accumulator.  If you compile with
-mcpu=future, the wD constraint will match the DMR register and not the FPR
register.

This patch also modifies the print_operand %A output modifier to print out DMR
register numbers if -mcpu=future, and continue to print out the FPR register
number divided by 4 for -mcpu=power10.

In general, if you only use the built-in functions, things work between the two
systems.  If you use extended asm, you will likely need to modify the code.
Going forward, hopefully if you modify your code to use the wD constraint and
%A output modifier, you can write code that switches more easily between the
two systems.

Again, these are preliminary patches for a potential future machine.  Things
will likely change in terms of implementation and usage over time.

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/8] PowerPC: Add -mcpu=future.
  2023-02-03 21:16 [PATCH 0/8] PowerPC future support for Dense Math Michael Meissner
@ 2023-02-03 21:21 ` Michael Meissner
  2023-02-03 21:23 ` [PATCH 1/8] PowerPC: Make -mcpu=future enable -mblock-ops-vector-pair Michael Meissner
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Michael Meissner @ 2023-02-03 21:21 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner, Will Schmidt

These patches implement support for potential future PowerPC cpus.  At this
time, features enabled with -mcpu=future may or may not be in actual PowerPCs
that will be delivered in the future.

This patch adds support for the -mcpu=future and -mtune=future options.
If you use -mcpu=future, the macro __ARCH_PWR_FUTURE__ is defined, and the
assembler .machine directive "future" is used.  Future patches in this
series will add support for new instructions that may be present in future
PowerPC processors.

At the moment, we do not have any differences in tuning between power10 and
future.  It is anticipated that we may change the tuning characteristics for
-mtune=future at a later time.

The patches have been tested on the following platforms.  I added the patches
for PR target/107299 that I submitted on November 2nd before doing the builds so
that GCC would build on systems using IEEE 128-bit long double.
	* https://gcc.gnu.org/pipermail/gcc-patches/2022-November/604834.html

There were no regressions with doing bootstrap builds and running the regression
tests:

    1)	Power10 LE using --with-cpu=power10 --with-long-double-format=ieee;
    2)	Power10 LE using --with-cpu=power10 --with-long-double-format=ibm;
    3)	Power9 LE using --with-cpu=power9 --with-long-double-format=ibm; and
    4)	Power8 BE using --with-cpu=power8 (both 32-bit & 64-bit tested).

Can I check this patch into the GCC 13 master branch?

Note, I will be on vacation from Tuesday February 7th through Tuesday February
14th.

2023-02-03   Michael Meissner  <meissner@linux.ibm.com>

gcc/

	* config/rs6000/power10.md (power10-load): Temporarily treat
	-mcpu=future the same as -mcpu=power10.
	(power10-fused-load): Likewise.
	(power10-prefixed-load): Likewise.
	(power10-prefixed-load): Likewise.
	(power10-load-update): Likewise.
	(power10-fpload-double): Likewise.
	(power10-fpload-double): Likewise.
	(power10-prefixed-fpload-double): Likewise.
	(power10-prefixed-fpload-double): Likewise.
	(power10-fpload-update-double): Likewise.
	(power10-fpload-single): Likewise.
	(power10-fpload-update-single): Likewise.
	(power10-vecload): Likewise.
	(power10-vecload-pair): Likewise.
	(power10-store): Likewise.
	(power10-fused-store): Likewise.
	(power10-prefixed-store): Likewise.
	(power10-prefixed-store): Likewise.
	(power10-store-update): Likewise.
	(power10-vecstore-pair): Likewise.
	(power10-larx): Likewise.
	(power10-lq): Likewise.
	(power10-stcx): Likewise.
	(power10-stq): Likewise.
	(power10-sync): Likewise.
	(power10-sync): Likewise.
	(power10-alu): Likewise.
	(power10-fused_alu): Likewise.
	(power10-paddi): Likewise.
	(power10-rot): Likewise.
	(power10-rot-compare): Likewise.
	(power10-alu2): Likewise.
	(power10-cmp): Likewise.
	(power10-two): Likewise.
	(power10-three): Likewise.
	(power10-mul): Likewise.
	(power10-mul-compare): Likewise.
	(power10-div): Likewise.
	(power10-div-compare): Likewise.
	(power10-crlogical): Likewise.
	(power10-mfcrf): Likewise.
	(power10-mfcr): Likewise.
	(power10-mtcr): Likewise.
	(power10-mtjmpr): Likewise.
	(power10-mfjmpr): Likewise.
	(power10-mfjmpr): Likewise.
	(power10-fpsimple): Likewise.
	(power10-fp): Likewise.
	(power10-fpcompare): Likewise.
	(power10-sdiv): Likewise.
	(power10-ddiv): Likewise.
	(power10-sqrt): Likewise.
	(power10-dsqrt): Likewise.
	(power10-vec-2cyc): Likewise.
	(power10-fused-vec): Likewise.
	(power10-veccmp): Likewise.
	(power10-vecsimple): Likewise.
	(power10-vecnormal): Likewise.
	(power10-qp): Likewise.
	(power10-vecperm): Likewise.
	(power10-vecperm-compare): Likewise.
	(power10-prefixed-vecperm): Likewise.
	(power10-veccomplex): Likewise.
	(power10-vecfdiv): Likewise.
	(power10-vecdiv): Likewise.
	(power10-qpdiv): Likewise.
	(power10-qpmul): Likewise.
	(power10-mtvsr): Likewise.
	(power10-mfvsr): Likewise.
	(power10-mfvsr): Likewise.
	(power10-branch): Likewise.
	(power10-fused-branch): Likewise.
	(power10-crypto): Likewise.
	(power10-htm): Likewise.
	(power10-htm): Likewise.
	(power10-dfp): Likewise.
	(power10-dfpq): Likewise.
	(power10-mma): Likewise.
	(power10-prefixed-mma): Likewise.
	* config/rs6000/rs6000-c.cc (rs6000_target_modify_macros): Define
	__ARCH_PWR_FUTURE__ if -mcpu=future.
	* config/rs6000/rs6000-cpus.def (ISA_FUTURE_MASKS): New macro.
	(POWERPC_MASKS): Add -mcpu=future.
	* config/rs6000/rs6000-opts.h (enum processor_type): Add
	PROCESSOR_FUTURE.
	* config/rs6000/rs6000-tables.opt: Regenerate.
	* config/rs6000/rs6000.cc (future_costs): Add -mcpu=future support.
	Make -mtune=future act like -mtune=power10 for now.
	(rs6000_option_override_internal):
	(rs6000_machine_from_flags): Likewise.
	(rs6000_reassociation_width): Likewise.
	(rs6000_adjust_cost): Likewise.
	(rs6000_issue_rate): Likewise.
	(rs6000_sched_reorder): Likewise.
	(rs6000_sched_reorder2): Likewise.
	(rs6000_register_move_cost): Likewise.
	(rs6000_opt_masks): Add -mfuture.
	* config/rs6000/rs6000.h (ASM_CPU_SUPPORT): Likewise.
	* config/rs6000/rs6000.md (cpu attribute): Add -mcpu=future support.
	* config/rs6000/rs6000.opt (-mfuture): New undocumented debug switch.
	* doc/invoke.texi (IBM RS/6000 and PowerPC Options): Document -mcpu=future.
---
 gcc/config/rs6000/power10.md        | 142 ++++++++++++++--------------
 gcc/config/rs6000/rs6000-c.cc       |   2 +
 gcc/config/rs6000/rs6000-cpus.def   |   6 ++
 gcc/config/rs6000/rs6000-opts.h     |   4 +-
 gcc/config/rs6000/rs6000-tables.opt |   3 +
 gcc/config/rs6000/rs6000.cc         |  51 ++++++++--
 gcc/config/rs6000/rs6000.h          |   1 +
 gcc/config/rs6000/rs6000.md         |   2 +-
 gcc/config/rs6000/rs6000.opt        |   4 +
 gcc/doc/invoke.texi                 |   2 +-
 10 files changed, 137 insertions(+), 80 deletions(-)

diff --git a/gcc/config/rs6000/power10.md b/gcc/config/rs6000/power10.md
index 8e1d4e1afc6..caed2d53668 100644
--- a/gcc/config/rs6000/power10.md
+++ b/gcc/config/rs6000/power10.md
@@ -97,12 +97,12 @@ (define_insn_reservation "power10-load" 4
        (eq_attr "update" "no")
        (eq_attr "size" "!128")
        (eq_attr "prefixed" "no")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,LU_power10")
 
 (define_insn_reservation "power10-fused-load" 4
   (and (eq_attr "type" "fused_load_cmpi,fused_addis_load,fused_load_load")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,LU_power10")
 
 (define_insn_reservation "power10-prefixed-load" 4
@@ -110,13 +110,13 @@ (define_insn_reservation "power10-prefixed-load" 4
        (eq_attr "update" "no")
        (eq_attr "size" "!128")
        (eq_attr "prefixed" "yes")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,LU_power10")
 
 (define_insn_reservation "power10-load-update" 4
   (and (eq_attr "type" "load")
        (eq_attr "update" "yes")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,LU_power10+SXU_power10")
 
 (define_insn_reservation "power10-fpload-double" 4
@@ -124,7 +124,7 @@ (define_insn_reservation "power10-fpload-double" 4
        (eq_attr "update" "no")
        (eq_attr "size" "64")
        (eq_attr "prefixed" "no")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,LU_power10")
 
 (define_insn_reservation "power10-prefixed-fpload-double" 4
@@ -132,14 +132,14 @@ (define_insn_reservation "power10-prefixed-fpload-double" 4
        (eq_attr "update" "no")
        (eq_attr "size" "64")
        (eq_attr "prefixed" "yes")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,LU_power10")
 
 (define_insn_reservation "power10-fpload-update-double" 4
   (and (eq_attr "type" "fpload")
        (eq_attr "update" "yes")
        (eq_attr "size" "64")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,LU_power10+SXU_power10")
 
 ; SFmode loads are cracked and have additional 3 cycles over DFmode
@@ -148,27 +148,27 @@ (define_insn_reservation "power10-fpload-single" 7
   (and (eq_attr "type" "fpload")
        (eq_attr "update" "no")
        (eq_attr "size" "32")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,LU_power10")
 
 (define_insn_reservation "power10-fpload-update-single" 7
   (and (eq_attr "type" "fpload")
        (eq_attr "update" "yes")
        (eq_attr "size" "32")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,LU_power10+SXU_power10")
 
 (define_insn_reservation "power10-vecload" 4
   (and (eq_attr "type" "vecload")
        (eq_attr "size" "!256")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,LU_power10")
 
 ; lxvp
 (define_insn_reservation "power10-vecload-pair" 4
   (and (eq_attr "type" "vecload")
        (eq_attr "size" "256")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,LU_power10+SXU_power10")
 
 ; Store Unit
@@ -178,12 +178,12 @@ (define_insn_reservation "power10-store" 0
        (eq_attr "prefixed" "no")
        (eq_attr "size" "!128")
        (eq_attr "size" "!256")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,STU_power10")
 
 (define_insn_reservation "power10-fused-store" 0
   (and (eq_attr "type" "fused_store_store")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,STU_power10")
 
 (define_insn_reservation "power10-prefixed-store" 0
@@ -191,52 +191,52 @@ (define_insn_reservation "power10-prefixed-store" 0
        (eq_attr "prefixed" "yes")
        (eq_attr "size" "!128")
        (eq_attr "size" "!256")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,STU_power10")
 
 ; Update forms have 2 cycle latency for updated addr reg
 (define_insn_reservation "power10-store-update" 2
   (and (eq_attr "type" "store,fpstore")
        (eq_attr "update" "yes")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,STU_power10")
 
 ; stxvp
 (define_insn_reservation "power10-vecstore-pair" 0
   (and (eq_attr "type" "vecstore")
        (eq_attr "size" "256")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,stu0_power10+stu1_power10")
 
 (define_insn_reservation "power10-larx" 4
   (and (eq_attr "type" "load_l")
        (eq_attr "size" "!128")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,LU_power10")
 
 ; All load quad forms
 (define_insn_reservation "power10-lq" 4
   (and (eq_attr "type" "load,load_l")
        (eq_attr "size" "128")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,LU_power10+SXU_power10")
 
 (define_insn_reservation "power10-stcx" 0
   (and (eq_attr "type" "store_c")
        (eq_attr "size" "!128")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,STU_power10")
 
 ; All store quad forms
 (define_insn_reservation "power10-stq" 0
   (and (eq_attr "type" "store,store_c")
        (eq_attr "size" "128")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,stu0_power10+stu1_power10")
 
 (define_insn_reservation "power10-sync" 1
   (and (eq_attr "type" "sync,isync")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,STU_power10")
 
 
@@ -248,7 +248,7 @@ (define_insn_reservation "power10-sync" 1
 (define_insn_reservation "power10-alu" 2
   (and (eq_attr "type" "add,exts,integer,logical,isel")
        (eq_attr "prefixed" "no")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 ; 4 cycle CR latency
 (define_bypass 4 "power10-alu"
@@ -256,28 +256,28 @@ (define_bypass 4 "power10-alu"
 
 (define_insn_reservation "power10-fused_alu" 2
   (and (eq_attr "type" "fused_arith_logical,fused_cmp_isel,fused_carry")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,EXU_power10")
 
 ; paddi
 (define_insn_reservation "power10-paddi" 2
   (and (eq_attr "type" "add")
        (eq_attr "prefixed" "yes")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,EXU_power10")
 
 ; Rotate/shift (non-record form)
 (define_insn_reservation "power10-rot" 2
   (and (eq_attr "type" "insert,shift")
        (eq_attr "dot" "no")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 ; Record form rotate/shift
 (define_insn_reservation "power10-rot-compare" 3
   (and (eq_attr "type" "insert,shift")
        (eq_attr "dot" "yes")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 ; 5 cycle CR latency
 (define_bypass 5 "power10-rot-compare"
@@ -285,7 +285,7 @@ (define_bypass 5 "power10-rot-compare"
 
 (define_insn_reservation "power10-alu2" 3
   (and (eq_attr "type" "cntlz,popcnt,trap")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 ; 5 cycle CR latency
 (define_bypass 5 "power10-alu2"
@@ -293,24 +293,24 @@ (define_bypass 5 "power10-alu2"
 
 (define_insn_reservation "power10-cmp" 2
   (and (eq_attr "type" "cmp")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 ; Treat 'two' and 'three' types as 2 or 3 way cracked
 (define_insn_reservation "power10-two" 4
   (and (eq_attr "type" "two")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,EXU_power10")
 
 (define_insn_reservation "power10-three" 6
   (and (eq_attr "type" "three")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_all_power10,EXU_power10")
 
 (define_insn_reservation "power10-mul" 5
   (and (eq_attr "type" "mul")
        (eq_attr "dot" "no")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 ; 4 cycle MUL->MUL latency
 (define_bypass 4 "power10-mul"
@@ -319,7 +319,7 @@ (define_bypass 4 "power10-mul"
 (define_insn_reservation "power10-mul-compare" 5
   (and (eq_attr "type" "mul")
        (eq_attr "dot" "yes")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,EXU_power10")
 ; 4 cycle MUL->MUL latency
 (define_bypass 4 "power10-mul-compare"
@@ -331,13 +331,13 @@ (define_bypass 7 "power10-mul-compare"
 (define_insn_reservation "power10-div" 12
   (and (eq_attr "type" "div")
        (eq_attr "dot" "no")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-div-compare" 12
   (and (eq_attr "type" "div")
        (eq_attr "dot" "yes")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,EXU_power10")
 ; 14 cycle CR latency
 (define_bypass 14 "power10-div-compare"
@@ -345,34 +345,34 @@ (define_bypass 14 "power10-div-compare"
 
 (define_insn_reservation "power10-crlogical" 2
   (and (eq_attr "type" "cr_logical")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-mfcrf" 2
   (and (eq_attr "type" "mfcrf")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-mfcr" 3
   (and (eq_attr "type" "mfcr")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,EXU_power10")
 
 ; Should differentiate between 1 cr field and > 1 since target of > 1 cr
 ; is cracked
 (define_insn_reservation "power10-mtcr" 3
   (and (eq_attr "type" "mtcr")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-mtjmpr" 3
   (and (eq_attr "type" "mtjmpr")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-mfjmpr" 2
   (and (eq_attr "type" "mfjmpr")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 
@@ -380,126 +380,126 @@ (define_insn_reservation "power10-mfjmpr" 2
 
 (define_insn_reservation "power10-fpsimple" 3
   (and (eq_attr "type" "fpsimple")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-fp" 5
   (and (eq_attr "type" "fp,dmul")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-fpcompare" 3
   (and (eq_attr "type" "fpcompare")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-sdiv" 22
   (and (eq_attr "type" "sdiv")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-ddiv" 27
   (and (eq_attr "type" "ddiv")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-sqrt" 26
   (and (eq_attr "type" "ssqrt")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-dsqrt" 36
   (and (eq_attr "type" "dsqrt")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-vec-2cyc" 2
   (and (eq_attr "type" "vecmove,veclogical,vecexts,veccmpfx")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-fused-vec" 2
   (and (eq_attr "type" "fused_vector")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,EXU_power10")
 
 (define_insn_reservation "power10-veccmp" 3
   (and (eq_attr "type" "veccmp")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-vecsimple" 2
   (and (eq_attr "type" "vecsimple")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-vecnormal" 5
   (and (eq_attr "type" "vecfloat,vecdouble")
        (eq_attr "size" "!128")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-qp" 12
   (and (eq_attr "type" "vecfloat,vecdouble")
        (eq_attr "size" "128")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-vecperm" 3
   (and (eq_attr "type" "vecperm")
        (eq_attr "prefixed" "no")
        (eq_attr "dot" "no")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-vecperm-compare" 3
   (and (eq_attr "type" "vecperm")
        (eq_attr "dot" "yes")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,EXU_power10")
 
 (define_insn_reservation "power10-prefixed-vecperm" 3
   (and (eq_attr "type" "vecperm")
        (eq_attr "prefixed" "yes")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,EXU_power10")
 
 (define_insn_reservation "power10-veccomplex" 6
   (and (eq_attr "type" "veccomplex")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-vecfdiv" 24
   (and (eq_attr "type" "vecfdiv")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-vecdiv" 27
   (and (eq_attr "type" "vecdiv")
        (eq_attr "size" "!128")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-qpdiv" 56
   (and (eq_attr "type" "vecdiv")
        (eq_attr "size" "128")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-qpmul" 24
   (and (eq_attr "type" "qmul")
        (eq_attr "size" "128")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-mtvsr" 2
   (and (eq_attr "type" "mtvsr")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-mfvsr" 2
   (and (eq_attr "type" "mfvsr")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 
@@ -507,26 +507,26 @@ (define_insn_reservation "power10-mfvsr" 2
 ; Branch is 2 cycles, grouped with STU for issue
 (define_insn_reservation "power10-branch" 2
   (and (eq_attr "type" "jmpreg,branch")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,STU_power10")
 
 (define_insn_reservation "power10-fused-branch" 3
   (and (eq_attr "type" "fused_mtbc")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,STU_power10")
 
 
 ; Crypto
 (define_insn_reservation "power10-crypto" 4
   (and (eq_attr "type" "crypto")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 
 ; HTM
 (define_insn_reservation "power10-htm" 2
   (and (eq_attr "type" "htmsimple,htm")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 
@@ -535,26 +535,26 @@ (define_insn_reservation "power10-htm" 2
 (define_insn_reservation "power10-dfp" 12
   (and (eq_attr "type" "dfp")
        (eq_attr "size" "!128")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_power10")
 
 (define_insn_reservation "power10-dfpq" 12
   (and (eq_attr "type" "dfp")
        (eq_attr "size" "128")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,EXU_power10")
 
 ; MMA
 (define_insn_reservation "power10-mma" 9
   (and (eq_attr "type" "mma")
        (eq_attr "prefixed" "no")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_any_power10,EXU_super_power10")
 
 (define_insn_reservation "power10-prefixed-mma" 9
   (and (eq_attr "type" "mma")
        (eq_attr "prefixed" "yes")
-       (eq_attr "cpu" "power10"))
+       (eq_attr "cpu" "power10,future"))
   "DU_even_power10,EXU_super_power10")
 ; 4 cycle MMA->MMA latency
 (define_bypass 4 "power10-mma,power10-prefixed-mma"
diff --git a/gcc/config/rs6000/rs6000-c.cc b/gcc/config/rs6000/rs6000-c.cc
index 8555174d36e..2803014f2b6 100644
--- a/gcc/config/rs6000/rs6000-c.cc
+++ b/gcc/config/rs6000/rs6000-c.cc
@@ -447,6 +447,8 @@ rs6000_target_modify_macros (bool define_p, HOST_WIDE_INT flags)
     rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR9");
   if ((flags & OPTION_MASK_POWER10) != 0)
     rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR10");
+  if ((flags & OPTION_MASK_FUTURE) != 0)
+    rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR_FUTURE");
   if ((flags & OPTION_MASK_SOFT_FLOAT) != 0)
     rs6000_define_or_undefine_macro (define_p, "_SOFT_FLOAT");
   if ((flags & OPTION_MASK_RECIP_PRECISION) != 0)
diff --git a/gcc/config/rs6000/rs6000-cpus.def b/gcc/config/rs6000/rs6000-cpus.def
index 4f350da378c..deb4ea1c980 100644
--- a/gcc/config/rs6000/rs6000-cpus.def
+++ b/gcc/config/rs6000/rs6000-cpus.def
@@ -86,6 +86,10 @@
 				 | OPTION_MASK_POWER10			\
 				 | OTHER_POWER10_MASKS)
 
+/* Flags for a potential future processor that may or may not be delivered.  */
+#define ISA_FUTURE_MASKS	(ISA_3_1_MASKS_SERVER			\
+				 | OPTION_MASK_FUTURE)
+
 /* Flags that need to be turned off if -mno-power9-vector.  */
 #define OTHER_P9_VECTOR_MASKS	(OPTION_MASK_FLOAT128_HW		\
 				 | OPTION_MASK_P9_MINMAX)
@@ -132,6 +136,7 @@
 				 | OPTION_MASK_FPRND			\
 				 | OPTION_MASK_POWER10			\
 				 | OPTION_MASK_P10_FUSION		\
+				 | OPTION_MASK_FUTURE			\
 				 | OPTION_MASK_HTM			\
 				 | OPTION_MASK_ISEL			\
 				 | OPTION_MASK_MFCRF			\
@@ -263,3 +268,4 @@ RS6000_CPU ("powerpc64", PROCESSOR_POWERPC64, OPTION_MASK_PPC_GFXOPT
 RS6000_CPU ("powerpc64le", PROCESSOR_POWER8, MASK_POWERPC64
 	    | ISA_2_7_MASKS_SERVER | OPTION_MASK_HTM)
 RS6000_CPU ("rs64", PROCESSOR_RS64A, OPTION_MASK_PPC_GFXOPT | MASK_POWERPC64)
+RS6000_CPU ("future", PROCESSOR_FUTURE, MASK_POWERPC64 | ISA_FUTURE_MASKS)
diff --git a/gcc/config/rs6000/rs6000-opts.h b/gcc/config/rs6000/rs6000-opts.h
index 8040cfdc06e..f56f01d6fa5 100644
--- a/gcc/config/rs6000/rs6000-opts.h
+++ b/gcc/config/rs6000/rs6000-opts.h
@@ -67,7 +67,9 @@ enum processor_type
    PROCESSOR_MPCCORE,
    PROCESSOR_CELL,
    PROCESSOR_PPCA2,
-   PROCESSOR_TITAN
+   PROCESSOR_TITAN,
+
+   PROCESSOR_FUTURE
 };
 
 
diff --git a/gcc/config/rs6000/rs6000-tables.opt b/gcc/config/rs6000/rs6000-tables.opt
index b82f8205fa1..3ff28e39f6c 100644
--- a/gcc/config/rs6000/rs6000-tables.opt
+++ b/gcc/config/rs6000/rs6000-tables.opt
@@ -197,3 +197,6 @@ Enum(rs6000_cpu_opt_value) String(powerpc64le) Value(55)
 EnumValue
 Enum(rs6000_cpu_opt_value) String(rs64) Value(56)
 
+EnumValue
+Enum(rs6000_cpu_opt_value) String(future) Value(57)
+
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index 7e76c37fdab..ed26e2755cc 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -1085,6 +1085,27 @@ struct processor_costs power10_cost = {
   COSTS_N_INSNS (2),	/* SF->DF convert */
 };
 
+/* Instruction costs on Future processors.  At the moment, this is a copy of
+   the power10 costs, but it is expected to change over time..  */
+static const
+struct processor_costs future_cost = {
+  COSTS_N_INSNS (2),	/* mulsi */
+  COSTS_N_INSNS (2),	/* mulsi_const */
+  COSTS_N_INSNS (2),	/* mulsi_const9 */
+  COSTS_N_INSNS (2),	/* muldi */
+  COSTS_N_INSNS (6),	/* divsi */
+  COSTS_N_INSNS (6),	/* divdi */
+  COSTS_N_INSNS (2),	/* fp */
+  COSTS_N_INSNS (2),	/* dmul */
+  COSTS_N_INSNS (11),	/* sdiv */
+  COSTS_N_INSNS (13),	/* ddiv */
+  128,			/* cache line size */
+  32,			/* l1 cache */
+  512,			/* l2 cache */
+  16,			/* prefetch streams */
+  COSTS_N_INSNS (2),	/* SF->DF convert */
+};
+
 /* Instruction costs on POWER A2 processors.  */
 static const
 struct processor_costs ppca2_cost = {
@@ -4430,6 +4451,7 @@ rs6000_option_override_internal (bool global_init_p)
 			&& rs6000_tune != PROCESSOR_POWER8
 			&& rs6000_tune != PROCESSOR_POWER9
 			&& rs6000_tune != PROCESSOR_POWER10
+			&& rs6000_tune != PROCESSOR_FUTURE
 			&& rs6000_tune != PROCESSOR_PPCA2
 			&& rs6000_tune != PROCESSOR_CELL
 			&& rs6000_tune != PROCESSOR_PPC476);
@@ -4444,6 +4466,7 @@ rs6000_option_override_internal (bool global_init_p)
 				 || rs6000_tune == PROCESSOR_POWER8
 				 || rs6000_tune == PROCESSOR_POWER9
 				 || rs6000_tune == PROCESSOR_POWER10
+				 || rs6000_tune == PROCESSOR_FUTURE
 				 || rs6000_tune == PROCESSOR_PPCE500MC
 				 || rs6000_tune == PROCESSOR_PPCE500MC64
 				 || rs6000_tune == PROCESSOR_PPCE5500
@@ -4746,6 +4769,10 @@ rs6000_option_override_internal (bool global_init_p)
 	rs6000_cost = &power10_cost;
 	break;
 
+      case PROCESSOR_FUTURE:
+	rs6000_cost = &future_cost;
+	break;
+
       case PROCESSOR_PPCA2:
 	rs6000_cost = &ppca2_cost;
 	break;
@@ -5902,6 +5929,8 @@ rs6000_machine_from_flags (void)
   /* Disable the flags that should never influence the .machine selection.  */
   flags &= ~(OPTION_MASK_PPC_GFXOPT | OPTION_MASK_PPC_GPOPT | OPTION_MASK_ISEL);
 
+  if ((flags & (ISA_FUTURE_MASKS & ~ISA_3_1_MASKS_SERVER)) != 0)
+    return "future";
   if ((flags & (ISA_3_1_MASKS_SERVER & ~ISA_3_0_MASKS_SERVER)) != 0)
     return "power10";
   if ((flags & (ISA_3_0_MASKS_SERVER & ~ISA_2_7_MASKS_SERVER)) != 0)
@@ -10113,6 +10142,7 @@ rs6000_reassociation_width (unsigned int opc ATTRIBUTE_UNUSED,
     case PROCESSOR_POWER8:
     case PROCESSOR_POWER9:
     case PROCESSOR_POWER10:
+    case PROCESSOR_FUTURE:
       if (DECIMAL_FLOAT_MODE_P (mode))
 	return 1;
       if (VECTOR_MODE_P (mode))
@@ -17912,7 +17942,8 @@ rs6000_adjust_cost (rtx_insn *insn, int dep_type, rtx_insn *dep_insn, int cost,
 
 	/* Separate a load from a narrower, dependent store.  */
 	if ((rs6000_sched_groups || rs6000_tune == PROCESSOR_POWER9
-	     || rs6000_tune == PROCESSOR_POWER10)
+	     || rs6000_tune == PROCESSOR_POWER10
+	     || rs6000_tune == PROCESSOR_FUTURE)
 	    && GET_CODE (PATTERN (insn)) == SET
 	    && GET_CODE (PATTERN (dep_insn)) == SET
 	    && MEM_P (XEXP (PATTERN (insn), 1))
@@ -17951,6 +17982,7 @@ rs6000_adjust_cost (rtx_insn *insn, int dep_type, rtx_insn *dep_insn, int cost,
 		 || rs6000_tune == PROCESSOR_POWER8
 		 || rs6000_tune == PROCESSOR_POWER9
 		 || rs6000_tune == PROCESSOR_POWER10
+		 || rs6000_tune == PROCESSOR_FUTURE
                  || rs6000_tune == PROCESSOR_CELL)
                 && recog_memoized (dep_insn)
                 && (INSN_CODE (dep_insn) >= 0))
@@ -18525,6 +18557,7 @@ rs6000_issue_rate (void)
   case PROCESSOR_POWER9:
     return 6;
   case PROCESSOR_POWER10:
+  case PROCESSOR_FUTURE:
     return 8;
   default:
     return 1;
@@ -19240,8 +19273,10 @@ rs6000_sched_reorder (FILE *dump ATTRIBUTE_UNUSED, int sched_verbose,
   if (rs6000_tune == PROCESSOR_POWER6)
     load_store_pendulum = 0;
 
-  /* Do Power10 dependent reordering.  */
-  if (rs6000_tune == PROCESSOR_POWER10 && last_scheduled_insn)
+  /* Do Power10 dependent reordering.  For now, assume "future" has the same
+     dependent reordering as power10.  */
+  if ((rs6000_tune == PROCESSOR_POWER10
+       || rs6000_tune == PROCESSOR_FUTURE) && last_scheduled_insn)
     power10_sched_reorder (ready, n_ready - 1);
 
   return rs6000_issue_rate ();
@@ -19265,8 +19300,10 @@ rs6000_sched_reorder2 (FILE *dump, int sched_verbose, rtx_insn **ready,
       && recog_memoized (last_scheduled_insn) >= 0)
     return power9_sched_reorder2 (ready, *pn_ready - 1);
 
-  /* Do Power10 dependent reordering.  */
-  if (rs6000_tune == PROCESSOR_POWER10 && last_scheduled_insn)
+  /* Do Power10 dependent reordering.  For now, assume "future" has the same
+     dependent reordering as power10.  */
+  if ((rs6000_tune == PROCESSOR_POWER10
+       || rs6000_tune == PROCESSOR_FUTURE) && last_scheduled_insn)
     return power10_sched_reorder (ready, *pn_ready - 1);
 
   return cached_can_issue_more;
@@ -22481,7 +22518,8 @@ rs6000_register_move_cost (machine_mode mode,
 		 allocation a move within the same class might turn
 		 out to be a nop.  */
 	      if (rs6000_tune == PROCESSOR_POWER9
-		  || rs6000_tune == PROCESSOR_POWER10)
+		  || rs6000_tune == PROCESSOR_POWER10
+		  || rs6000_tune == PROCESSOR_FUTURE)
 		ret = 3 * hard_regno_nregs (FIRST_GPR_REGNO, mode);
 	      else
 		ret = 4 * hard_regno_nregs (FIRST_GPR_REGNO, mode);
@@ -24139,6 +24177,7 @@ static struct rs6000_opt_mask const rs6000_opt_masks[] =
   { "float128-hardware",	OPTION_MASK_FLOAT128_HW,	false, true  },
   { "fprnd",			OPTION_MASK_FPRND,		false, true  },
   { "power10",			OPTION_MASK_POWER10,		false, true  },
+  { "future",			OPTION_MASK_FUTURE,		false, true  },
   { "hard-dfp",			OPTION_MASK_DFP,		false, true  },
   { "htm",			OPTION_MASK_HTM,		false, true  },
   { "isel",			OPTION_MASK_ISEL,		false, true  },
diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
index 3503614efbd..44fa355a061 100644
--- a/gcc/config/rs6000/rs6000.h
+++ b/gcc/config/rs6000/rs6000.h
@@ -163,6 +163,7 @@
   mcpu=e5500: -me5500; \
   mcpu=e6500: -me6500; \
   mcpu=titan: -mtitan; \
+  mcpu=future: -mfuture; \
   !mcpu*: %{mpower9-vector: -mpower9; \
 	    mpower8-vector|mcrypto|mdirect-move|mhtm: -mpower8; \
 	    mvsx: -mpower7; \
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index 4a7812fa592..5f933bede93 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -350,7 +350,7 @@ (define_attr "cpu"
    ppc403,ppc405,ppc440,ppc476,
    ppc8540,ppc8548,ppce300c2,ppce300c3,ppce500mc,ppce500mc64,ppce5500,ppce6500,
    power4,power5,power6,power7,power8,power9,power10,
-   rs64a,mpccore,cell,ppca2,titan"
+   rs64a,mpccore,cell,ppca2,titan,future"
   (const (symbol_ref "(enum attr_cpu) rs6000_tune")))
 
 ;; The ISA we implement.
diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt
index bde6d3ff664..04532a774b9 100644
--- a/gcc/config/rs6000/rs6000.opt
+++ b/gcc/config/rs6000/rs6000.opt
@@ -620,6 +620,10 @@ mieee128-constant
 Target Var(TARGET_IEEE128_CONSTANT) Init(1) Save
 Generate (do not generate) code that uses the LXVKQ instruction.
 
+mfuture
+Target Undocumented Mask(FUTURE) Var(rs6000_isa_flags)
+Generate (do not generate) future instructions.
+
 ; Documented parameters
 
 -param=rs6000-vect-unroll-limit=
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 1eda0e0396b..696f941c1c5 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -29209,7 +29209,7 @@ Supported values for @var{cpu_type} are @samp{401}, @samp{403},
 @samp{titan}, @samp{power3}, @samp{power4}, @samp{power5}, @samp{power5+},
 @samp{power6}, @samp{power6x}, @samp{power7}, @samp{power8},
 @samp{power9}, @samp{power10}, @samp{powerpc}, @samp{powerpc64},
-@samp{powerpc64le}, @samp{rs64}, and @samp{native}.
+@samp{powerpc64le}, @samp{rs64}, @samp{future}, and @samp{native}.
 
 @option{-mcpu=powerpc}, @option{-mcpu=powerpc64}, and
 @option{-mcpu=powerpc64le} specify pure 32-bit PowerPC (either
-- 
2.39.1


-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/8] PowerPC: Make -mcpu=future enable -mblock-ops-vector-pair
  2023-02-03 21:16 [PATCH 0/8] PowerPC future support for Dense Math Michael Meissner
  2023-02-03 21:21 ` [PATCH 1/8] PowerPC: Add -mcpu=future Michael Meissner
@ 2023-02-03 21:23 ` Michael Meissner
  2023-02-03 21:25 ` [PATCH 2/8] PowerPC: Add support for accumulators in DMR registers Michael Meissner
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Michael Meissner @ 2023-02-03 21:23 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner, Will Schmidt

This patch enables generating load and store vector pair instructions when
doing certain memory copy operations when -mcpu=future is used.  In doing tests
on power10, it was determined that using these instructions were problematical
in a few cases, so we disabled generating them by default.  This patch
re-enabled generating these instructions if -mcpu=future is used.

The patches have been tested on the following platforms.  I added the patches
for PR target/107299 that I submitted on November 2nd before doing the builds so
that GCC would build on systems using IEEE 128-bit long double.
    *	https://gcc.gnu.org/pipermail/gcc-patches/2022-November/604834.html

There were no regressions with doing bootstrap builds and running the regression
tests:

    1)	Power10 LE using --with-cpu=power10 --with-long-double-format=ieee;
    2)	Power10 LE using --with-cpu=power10 --with-long-double-format=ibm;
    3)	Power9 LE using --with-cpu=power9 --with-long-double-format=ibm; and
    4)	Power8 BE using --with-cpu=power8 (both 32-bit & 64-bit tested).

Note, I will be on vacation from Tuesday February 7th through Tuesday February
14th.

Can I check this patch into the GCC 13 master branch?

2023-02-03   Michael Meissner  <meissner@linux.ibm.com>

gcc/

	* config/rs6000/rs6000-cpus.def (ISA_FUTURE_MASKS): Add
	-mblock-ops-vector-pair.
	(POWERPC_MASKS): Likewise.
---
 gcc/config/rs6000/rs6000-cpus.def | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/gcc/config/rs6000/rs6000-cpus.def b/gcc/config/rs6000/rs6000-cpus.def
index deb4ea1c980..b9a4d9ad76e 100644
--- a/gcc/config/rs6000/rs6000-cpus.def
+++ b/gcc/config/rs6000/rs6000-cpus.def
@@ -88,6 +88,7 @@
 
 /* Flags for a potential future processor that may or may not be delivered.  */
 #define ISA_FUTURE_MASKS	(ISA_3_1_MASKS_SERVER			\
+				 | OPTION_MASK_BLOCK_OPS_VECTOR_PAIR	\
 				 | OPTION_MASK_FUTURE)
 
 /* Flags that need to be turned off if -mno-power9-vector.  */
@@ -125,6 +126,7 @@
 
 /* Mask of all options to set the default isa flags based on -mcpu=<xxx>.  */
 #define POWERPC_MASKS		(OPTION_MASK_ALTIVEC			\
+				 | OPTION_MASK_BLOCK_OPS_VECTOR_PAIR	\
 				 | OPTION_MASK_CMPB			\
 				 | OPTION_MASK_CRYPTO			\
 				 | OPTION_MASK_DFP			\
-- 
2.39.1


-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/8] PowerPC: Add support for accumulators in DMR registers.
  2023-02-03 21:16 [PATCH 0/8] PowerPC future support for Dense Math Michael Meissner
  2023-02-03 21:21 ` [PATCH 1/8] PowerPC: Add -mcpu=future Michael Meissner
  2023-02-03 21:23 ` [PATCH 1/8] PowerPC: Make -mcpu=future enable -mblock-ops-vector-pair Michael Meissner
@ 2023-02-03 21:25 ` Michael Meissner
  2023-02-03 21:27 ` [PATCH 3/8] PowerPC: Make MMA insns support " Michael Meissner
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Michael Meissner @ 2023-02-03 21:25 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner, Will Schmidt

The MMA subsystem added the notion of accumulator registers as an optional
feature of ISA 3.1.  In ISA 3.1, these accumulators overlapped with the VSX
vector registers 0..31, but logically the accumulator registers were separate
from the FPR registers.  In ISA 3.1, it was anticipated that in future systems,
the accumulator registers may no overlap with the FPR registers.  This patch
adds the support for dense math registers as separate registers.

These changes are preliminary.  They are expected to change over time.

This particular patch does not change the MMA support to use the accumulators
within the dense math registers.  This patch just adds the basic support for
having separate DMRs.  The next patch will switch the MMA support to use the
accumulators if -mcpu=future is used.

For testing purposes, I added an undocumented option '-mdense-math' to enable
or disable the dense math support.

This patch adds a new constraint (wD).  If MMA is selected but dense math is
not selected (i.e. -mcpu=power10), the wD constraint will allow access to
accumulators that overlap with the VSX vector registers 0..31.  If both MMA and
dense math are selected (i.e. -mcpu=future), the wD constraint will only allow
dense math registers.

This patch modifies the existing %A output modifier.  If MMA is selected but
dense math is not selected, then %A output modifier converts the VSX register
number to the accumulator number, by dividing it by 4.  If both MMA and dense
math are selected, then %A will map the separate DMR registers into 0..7.

The intention is that user code using extended asm can be modified to run on
both MMA without dense math and MMA with dense math:

    1)	If possible, don't use extended asm, but instead use the MMA built-in
	functions;

    2)	If you do need to write extended asm, change the d constraints
	targetting accumulators should now use wD;

    3)	Only use the built-in zero, assemble and disassemble functions create
	move data between vector quad types and dense math accumulators.
	I.e. do not use the xxmfacc, xxmtacc, and xxsetaccz directly in the
	extended asm code.  The reason is these instructions assume there is a
	1-to-1 correspondence between 4 adjacent FPR registers and an
	accumulator that overlaps with those instructions.  With accumulators
	now being separate registers, there no longer is a 1-to-1
	correspondence.

It is possible that the mangling for DMRs and the GDB register numbers may
change in the future.

The patches have been tested on the following platforms.  I added the patches
for PR target/107299 that I submitted on November 2nd before doing the builds so
that GCC would build on systems using IEEE 128-bit long double.
    *	https://gcc.gnu.org/pipermail/gcc-patches/2022-November/604834.html

There were no regressions with doing bootstrap builds and running the regression
tests:

    1)	Power10 LE using --with-cpu=power10 --with-long-double-format=ieee;
    2)	Power10 LE using --with-cpu=power10 --with-long-double-format=ibm;
    3)	Power9 LE using --with-cpu=power9 --with-long-double-format=ibm; and
    4)	Power8 BE using --with-cpu=power8 (both 32-bit & 64-bit tested).

Can I check this patch into the GCC 13 master branch?

Note, I will be on vacation from Tuesday February 7th through Tuesday February
14th.

2023-02-03   Michael Meissner  <meissner@linux.ibm.com>

gcc/

	* config/rs6000/constraints.md (wD constraint): New constraint.
	* config/rs6000/mma.md (UNSPEC_DM_ASSEMBLE_ACC): New unspec.
	(movxo): Convert into define_expand.
	(movxo_vsx): Version of movxo where accumulators overlap with VSX vector
	registers 0..31.
	(movxo_dm): Verson of movxo that supports separate dense math
	accumulators.
	(mma_assemble_acc): Add dense math support to define_expand.
	(mma_assemble_acc_vsx): Rename from mma_assemble_acc, and restrict it to
	non dense math systems.
	(mma_assemble_acc_dm): Dense math version of mma_assemble_acc.
	(mma_disassemble_acc): Add dense math support to define_expand.
	(mma_disassemble_acc_vsx): Rename from mma_disassemble_acc, and restrict
	it to non dense math systems.
	(mma_disassemble_acc_dm): Dense math version of mma_disassemble_acc.
	* config/rs6000/predicates.md (dmr_operand): New predicate.
	(accumulator_operand): Likewise.
	* config/rs6000/rs6000-cpus.def (ISA_FUTURE_MASKS): Add -mdense-math.
	(POWERPC_MASKS): Likewise.
	* config/rs6000/rs6000.cc (enum rs6000_reg_type): Add DMR_REG_TYPE.
	(enum rs6000_reload_reg_type): Add RELOAD_REG_DMR.
	(LAST_RELOAD_REG_CLASS): Add support for DMR registers and the wD
	constraint.
	(reload_reg_map): Likewise.
	(rs6000_reg_names): Likewise.
	(alt_reg_names): Likewise.
	(rs6000_hard_regno_nregs_internal): Likewise.
	(rs6000_hard_regno_mode_ok_uncached): Likewise.
	(rs6000_debug_reg_global): Likewise.
	(rs6000_setup_reg_addr_masks): Likewise.
	(rs6000_init_hard_regno_mode_ok): Likewise.
	(rs6000_option_override_internal): Add checking for -mdense-math.
	(rs6000_secondary_reload_memory): Add support for DMR registers.
	(rs6000_secondary_reload_simple_move): Likewise.
	(rs6000_preferred_reload_class): Likewise.
	(rs6000_secondary_reload_class): Likewise.
	(print_operand): Make %A handle both FPRs and DMRs.
	(rs6000_dmr_register_move_cost): New helper function.
	(rs6000_register_move_cost): Add support for DMR registers.
	(rs6000_memory_move_cost): Likewise.
	(rs6000_compute_pressure_classes): Likewise.
	(rs6000_debugger_regno): Likewise.
	(rs6000_opt_masks): Add -mdense-math.
	(rs6000_split_multireg_move): Add support for DMRs.
	* config/rs6000/rs6000.h (UNITS_PER_DMR_WORD): New macro.
	(FIRST_PSEUDO_REGISTER): Update for DMRs.
	(FIXED_REGISTERS): Add DMRs.
	(CALL_REALLY_USED_REGISTERS): Likewise.
	(REG_ALLOC_ORDER): Likewise.
	(enum reg_class): Add DM_REGS.
	(REG_CLASS_NAMES): Likewise.
	(REG_CLASS_CONTENTS): Likewise.
	* config/rs6000/rs6000.md (FIRST_DMR_REGNO): New constant.
	(LAST_DMR_REGNO): Likewise.
	(isa attribute): Add 'dm' and 'not_dm' attributes.
	(enabled attribute): Support 'dm' and 'not_dm' attributes.
	* config/rs6000/rs6000.opt (-mdense-math): New switch.
	* doc/md.texi (PowerPC constraints): Document wD constraint.
---
 gcc/config/rs6000/constraints.md  |   3 +
 gcc/config/rs6000/mma.md          | 115 +++++++++++++-----
 gcc/config/rs6000/predicates.md   |  32 +++++
 gcc/config/rs6000/rs6000-cpus.def |   2 +
 gcc/config/rs6000/rs6000.cc       | 193 ++++++++++++++++++++++++++----
 gcc/config/rs6000/rs6000.h        |  38 +++++-
 gcc/config/rs6000/rs6000.md       |  12 +-
 gcc/config/rs6000/rs6000.opt      |   4 +
 gcc/doc/md.texi                   |   7 ++
 9 files changed, 345 insertions(+), 61 deletions(-)

diff --git a/gcc/config/rs6000/constraints.md b/gcc/config/rs6000/constraints.md
index c4a6ccf4efb..218e41d82a8 100644
--- a/gcc/config/rs6000/constraints.md
+++ b/gcc/config/rs6000/constraints.md
@@ -107,6 +107,9 @@ (define_constraint "wB"
        (match_test "TARGET_P8_VECTOR")
        (match_operand 0 "s5bit_cint_operand")))
 
+(define_register_constraint "wD" "rs6000_constraints[RS6000_CONSTRAINT_wD]"
+  "Accumulator register.")
+
 (define_constraint "wE"
   "@internal Vector constant that can be loaded with the XXSPLTIB instruction."
   (match_test "xxspltib_constant_nosplit (op, mode)"))
diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
index d36dc13872b..59ca6835f7c 100644
--- a/gcc/config/rs6000/mma.md
+++ b/gcc/config/rs6000/mma.md
@@ -91,6 +91,7 @@ (define_c_enum "unspec"
    UNSPEC_MMA_XVI8GER4SPP
    UNSPEC_MMA_XXMFACC
    UNSPEC_MMA_XXMTACC
+   UNSPEC_DM_ASSEMBLE_ACC
   ])
 
 (define_c_enum "unspecv"
@@ -314,7 +315,9 @@ (define_insn_and_split "*movoo"
    (set_attr "length" "*,*,8")])
 
 \f
-;; Vector quad support.  XOmode can only live in FPRs.
+;; Vector quad support.  Under the original MMA, XOmode can only live in VSX
+;; vector registers 0..31.  With dense math, XOmode can live in either VSX
+;; registers (0..63) or DMR registers.
 (define_expand "movxo"
   [(set (match_operand:XO 0 "nonimmediate_operand")
 	(match_operand:XO 1 "input_operand"))]
@@ -339,10 +342,10 @@ (define_expand "movxo"
     gcc_assert (false);
 })
 
-(define_insn_and_split "*movxo"
+(define_insn_and_split "*movxo_vsx"
   [(set (match_operand:XO 0 "nonimmediate_operand" "=d,m,d")
 	(match_operand:XO 1 "input_operand" "m,d,d"))]
-  "TARGET_MMA
+  "TARGET_MMA && !TARGET_DENSE_MATH
    && (gpc_reg_operand (operands[0], XOmode)
        || gpc_reg_operand (operands[1], XOmode))"
   "@
@@ -359,6 +362,31 @@ (define_insn_and_split "*movxo"
    (set_attr "length" "*,*,16")
    (set_attr "max_prefixed_insns" "2,2,*")])
 
+(define_insn_and_split "*movxo_dm"
+  [(set (match_operand:XO 0 "nonimmediate_operand" "=wa,m, wa,wD,wD,wa")
+	(match_operand:XO 1 "input_operand"          "m,wa,wa,wa,wD,wD"))]
+  "TARGET_DENSE_MATH
+   && (gpc_reg_operand (operands[0], XOmode)
+       || gpc_reg_operand (operands[1], XOmode))"
+  "@
+   #
+   #
+   #
+   dmxxinstdmr512 %0,%1,%Y1,0
+   dmmr %0,%1
+   dmxxextfdmr512 %0,%Y0,%1,0"
+  "&& reload_completed
+   && !dmr_operand (operands[0], XOmode)
+   && !dmr_operand (operands[1], XOmode)"
+  [(const_int 0)]
+{
+  rs6000_split_multireg_move (operands[0], operands[1]);
+  DONE;
+}
+  [(set_attr "type" "vecload,vecstore,veclogical,mma,mma,mma")
+   (set_attr "length" "*,*,16,*,*,*")
+   (set_attr "max_prefixed_insns" "2,2,*,*,*,*")])
+
 (define_expand "vsx_assemble_pair"
   [(match_operand:OO 0 "vsx_register_operand")
    (match_operand:V16QI 1 "mma_assemble_input_operand")
@@ -426,25 +454,38 @@ (define_insn_and_split "*vsx_disassemble_pair"
 })
 
 (define_expand "mma_assemble_acc"
-  [(match_operand:XO 0 "fpr_reg_operand")
+  [(match_operand:XO 0 "register_operand")
    (match_operand:V16QI 1 "mma_assemble_input_operand")
    (match_operand:V16QI 2 "mma_assemble_input_operand")
    (match_operand:V16QI 3 "mma_assemble_input_operand")
    (match_operand:V16QI 4 "mma_assemble_input_operand")]
   "TARGET_MMA"
 {
-  rtx src = gen_rtx_UNSPEC_VOLATILE (XOmode,
-			    	     gen_rtvec (4, operands[1], operands[2],
-				       		operands[3], operands[4]),
-			    	     UNSPECV_MMA_ASSEMBLE);
-  emit_move_insn (operands[0], src);
+  rtx op0 = operands[0];
+  rtx op1 = operands[1];
+  rtx op2 = operands[2];
+  rtx op3 = operands[3];
+  rtx op4 = operands[4];
+
+  if (TARGET_DENSE_MATH)
+    {
+      rtx vpair1 = gen_reg_rtx (OOmode);
+      rtx vpair2 = gen_reg_rtx (OOmode);
+      emit_insn (gen_vsx_assemble_pair (vpair1, op1, op2));
+      emit_insn (gen_vsx_assemble_pair (vpair2, op3, op4));
+      emit_insn (gen_mma_assemble_acc_dm (op0, vpair1, vpair2));
+    }
+
+  else
+    emit_insn (gen_mma_assemble_acc_vsx (op0, op1, op2, op3, op4));
+
   DONE;
 })
 
 ;; We cannot update the four output registers atomically, so mark the output
-;; as an early clobber so we don't accidentally clobber the input operands.  */
+;; as an early clobber so we don't accidentally clobber the input operands.
 
-(define_insn_and_split "*mma_assemble_acc"
+(define_insn_and_split "mma_assemble_acc_vsx"
   [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
 	(unspec_volatile:XO
 	  [(match_operand:V16QI 1 "mma_assemble_input_operand" "mwa")
@@ -452,7 +493,7 @@ (define_insn_and_split "*mma_assemble_acc"
 	   (match_operand:V16QI 3 "mma_assemble_input_operand" "mwa")
 	   (match_operand:V16QI 4 "mma_assemble_input_operand" "mwa")]
 	  UNSPECV_MMA_ASSEMBLE))]
-  "TARGET_MMA
+  "TARGET_MMA && !TARGET_DENSE_MATH
    && fpr_reg_operand (operands[0], XOmode)"
   "#"
   "&& reload_completed"
@@ -466,28 +507,31 @@ (define_insn_and_split "*mma_assemble_acc"
   DONE;
 })
 
+;; On a system with dense math, we build the accumulators from two vector
+;; pairs.
+
+(define_insn "mma_assemble_acc_dm"
+ [(set (match_operand:XO 0 "dmr_operand" "=wD")
+       (unspec:XO [(match_operand:OO 1 "vsx_register_operand" "wa")
+		   (match_operand:OO 2 "vsx_register_operand" "wa")]
+		  UNSPEC_DM_ASSEMBLE_ACC))]
+ "TARGET_MMA && TARGET_DENSE_MATH"
+ "dmxxinstdmr512 %0,%1,%2,0"
+ [(set_attr "type" "mma")])
+
 (define_expand "mma_disassemble_acc"
-  [(match_operand:V16QI 0 "mma_disassemble_output_operand")
-   (match_operand:XO 1 "fpr_reg_operand")
-   (match_operand 2 "const_0_to_3_operand")]
-  "TARGET_MMA"
-{
-  rtx src;
-  int regoff = INTVAL (operands[2]);
-  src = gen_rtx_UNSPEC (V16QImode,
-			gen_rtvec (2, operands[1], GEN_INT (regoff)),
-			UNSPEC_MMA_EXTRACT);
-  emit_move_insn (operands[0], src);
-  DONE;
-})
+  [(set (match_operand:V16QI 0 "register_operand")
+	(unspec:V16QI [(match_operand:XO 1 "register_operand")
+		       (match_operand 2 "const_0_to_3_operand")]
+		      UNSPEC_MMA_EXTRACT))]
+  "TARGET_MMA")
 
-(define_insn_and_split "*mma_disassemble_acc"
+(define_insn_and_split "*mma_disassemble_acc_vsx"
   [(set (match_operand:V16QI 0 "mma_disassemble_output_operand" "=mwa")
-       (unspec:V16QI [(match_operand:XO 1 "fpr_reg_operand" "d")
-		      (match_operand 2 "const_0_to_3_operand")]
+	(unspec:V16QI [(match_operand:XO 1 "fpr_reg_operand" "d")
+		       (match_operand 2 "const_0_to_3_operand")]
 		      UNSPEC_MMA_EXTRACT))]
-  "TARGET_MMA
-   && fpr_reg_operand (operands[1], XOmode)"
+  "TARGET_MMA"
   "#"
   "&& reload_completed"
   [(const_int 0)]
@@ -499,9 +543,14 @@ (define_insn_and_split "*mma_disassemble_acc"
   DONE;
 })
 
-;; MMA instructions that do not use their accumulators as an input, still
-;; must not allow their vector operands to overlap the registers used by
-;; the accumulator.  We enforce this by marking the output as early clobber.
+(define_insn "*mma_disassemble_acc_dm"
+  [(set (match_operand:V16QI 0 "vsx_register_operand" "=wa")
+	(unspec:V16QI [(match_operand:XO 1 "dmr_operand" "wD")
+		       (match_operand 2 "const_0_to_3_operand")]
+		      UNSPEC_MMA_EXTRACT))]
+  "TARGET_DENSE_MATH"
+  "dmxxextfdmr256 %0,%1,2"
+  [(set_attr "type" "mma")])
 
 (define_insn "mma_<acc>"
   [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
diff --git a/gcc/config/rs6000/predicates.md b/gcc/config/rs6000/predicates.md
index 52c65534e51..4ac9afd2c11 100644
--- a/gcc/config/rs6000/predicates.md
+++ b/gcc/config/rs6000/predicates.md
@@ -186,6 +186,38 @@ (define_predicate "vlogical_operand"
   return VLOGICAL_REGNO_P (REGNO (op));
 })
 
+;; Return 1 if op is a DMR register
+(define_predicate "dmr_operand"
+  (match_operand 0 "register_operand")
+{
+  if (!REG_P (op))
+    return 0;
+
+  if (!HARD_REGISTER_P (op))
+    return 1;
+
+  return DMR_REGNO_P (REGNO (op));
+})
+
+;; Return 1 if op is an accumulator.  On power10 systems, the accumulators
+;; overlap with the FPRs, while on systems with dense math, the accumulators
+;; are separate dense math registers and do not overlap with the FPR
+;; registers..
+(define_predicate "accumulator_operand"
+  (match_operand 0 "register_operand")
+{
+  if (!REG_P (op))
+    return 0;
+
+  if (!HARD_REGISTER_P (op))
+    return 1;
+
+  int r = REGNO (op);
+  return (TARGET_DENSE_MATH
+	  ? DMR_REGNO_P (r)
+	  : FP_REGNO_P (r) && (r & 3) == 0);
+})
+
 ;; Return 1 if op is the carry register.
 (define_predicate "ca_operand"
   (match_operand 0 "register_operand")
diff --git a/gcc/config/rs6000/rs6000-cpus.def b/gcc/config/rs6000/rs6000-cpus.def
index b9a4d9ad76e..a4cce08d727 100644
--- a/gcc/config/rs6000/rs6000-cpus.def
+++ b/gcc/config/rs6000/rs6000-cpus.def
@@ -89,6 +89,7 @@
 /* Flags for a potential future processor that may or may not be delivered.  */
 #define ISA_FUTURE_MASKS	(ISA_3_1_MASKS_SERVER			\
 				 | OPTION_MASK_BLOCK_OPS_VECTOR_PAIR	\
+				 | OPTION_MASK_DENSE_MATH		\
 				 | OPTION_MASK_FUTURE)
 
 /* Flags that need to be turned off if -mno-power9-vector.  */
@@ -132,6 +133,7 @@
 				 | OPTION_MASK_DFP			\
 				 | OPTION_MASK_DIRECT_MOVE		\
 				 | OPTION_MASK_DLMZB			\
+				 | OPTION_MASK_DENSE_MATH		\
 				 | OPTION_MASK_EFFICIENT_UNALIGNED_VSX	\
 				 | OPTION_MASK_FLOAT128_HW		\
 				 | OPTION_MASK_FLOAT128_KEYWORD		\
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index ed26e2755cc..8ecb3021ff9 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -290,7 +290,8 @@ enum rs6000_reg_type {
   ALTIVEC_REG_TYPE,
   FPR_REG_TYPE,
   SPR_REG_TYPE,
-  CR_REG_TYPE
+  CR_REG_TYPE,
+  DMR_REG_TYPE
 };
 
 /* Map register class to register type.  */
@@ -304,22 +305,23 @@ static enum rs6000_reg_type reg_class_to_reg_type[N_REG_CLASSES];
 
 
 /* Register classes we care about in secondary reload or go if legitimate
-   address.  We only need to worry about GPR, FPR, and Altivec registers here,
-   along an ANY field that is the OR of the 3 register classes.  */
+   address.  We only need to worry about GPR, FPR, Altivec, and DMR registers
+   here, along an ANY field that is the OR of the 4 register classes.  */
 
 enum rs6000_reload_reg_type {
   RELOAD_REG_GPR,			/* General purpose registers.  */
   RELOAD_REG_FPR,			/* Traditional floating point regs.  */
   RELOAD_REG_VMX,			/* Altivec (VMX) registers.  */
-  RELOAD_REG_ANY,			/* OR of GPR, FPR, Altivec masks.  */
+  RELOAD_REG_DMR,			/* DMR registers.  */
+  RELOAD_REG_ANY,			/* OR of GPR/FPR/VMX/DMR masks.  */
   N_RELOAD_REG
 };
 
-/* For setting up register classes, loop through the 3 register classes mapping
+/* For setting up register classes, loop through the 4 register classes mapping
    into real registers, and skip the ANY class, which is just an OR of the
    bits.  */
 #define FIRST_RELOAD_REG_CLASS	RELOAD_REG_GPR
-#define LAST_RELOAD_REG_CLASS	RELOAD_REG_VMX
+#define LAST_RELOAD_REG_CLASS	RELOAD_REG_DMR
 
 /* Map reload register type to a register in the register class.  */
 struct reload_reg_map_type {
@@ -331,6 +333,7 @@ static const struct reload_reg_map_type reload_reg_map[N_RELOAD_REG] = {
   { "Gpr",	FIRST_GPR_REGNO },	/* RELOAD_REG_GPR.  */
   { "Fpr",	FIRST_FPR_REGNO },	/* RELOAD_REG_FPR.  */
   { "VMX",	FIRST_ALTIVEC_REGNO },	/* RELOAD_REG_VMX.  */
+  { "DMR",	FIRST_DMR_REGNO },	/* RELOAD_REG_DMR.  */
   { "Any",	-1 },			/* RELOAD_REG_ANY.  */
 };
 
@@ -1244,6 +1247,8 @@ char rs6000_reg_names[][8] =
       "0",  "1",  "2",  "3",  "4",  "5",  "6",  "7",
   /* vrsave vscr sfp */
       "vrsave", "vscr", "sfp",
+  /* DMRs */
+      "0", "1", "2", "3", "4", "5", "6", "7",
 };
 
 #ifdef TARGET_REGNAMES
@@ -1270,6 +1275,8 @@ static const char alt_reg_names[][8] =
   "%cr0",  "%cr1", "%cr2", "%cr3", "%cr4", "%cr5", "%cr6", "%cr7",
   /* vrsave vscr sfp */
   "vrsave", "vscr", "sfp",
+  /* DMRs */
+  "%dmr0", "%dmr1", "%dmr2", "%dmr3", "%dmr4", "%dmr5", "%dmr6", "%dmr7",
 };
 #endif
 
@@ -1841,6 +1848,9 @@ rs6000_hard_regno_nregs_internal (int regno, machine_mode mode)
   else if (ALTIVEC_REGNO_P (regno))
     reg_size = UNITS_PER_ALTIVEC_WORD;
 
+  else if (DMR_REGNO_P (regno))
+    reg_size = UNITS_PER_DMR_WORD;
+
   else
     reg_size = UNITS_PER_WORD;
 
@@ -1862,9 +1872,36 @@ rs6000_hard_regno_mode_ok_uncached (int regno, machine_mode mode)
   if (mode == OOmode)
     return (TARGET_MMA && VSX_REGNO_P (regno) && (regno & 1) == 0);
 
-  /* MMA accumulator modes need FPR registers divisible by 4.  */
+  /* On ISA 3.1 (power10), MMA accumulator modes need FPR registers divisible
+     by 4.
+
+     If dense math is enabled, allow all VSX registers plus the DMR registers.
+     We need to make sure we don't cross between the boundary of FPRs and
+     traditional Altiviec registers.  */
   if (mode == XOmode)
-    return (TARGET_MMA && FP_REGNO_P (regno) && (regno & 3) == 0);
+    {
+      if (TARGET_MMA && !TARGET_DENSE_MATH)
+	return (FP_REGNO_P (regno) && (regno & 3) == 0);
+
+      else if (TARGET_DENSE_MATH)
+	{
+	  if (DMR_REGNO_P (regno))
+	    return 1;
+
+	  if (FP_REGNO_P (regno))
+	    return ((regno & 1) == 0 && regno <= LAST_FPR_REGNO - 3);
+
+	  if (ALTIVEC_REGNO_P (regno))
+	    return ((regno & 1) == 0 && regno <= LAST_ALTIVEC_REGNO - 3);
+	}
+
+      else
+	return 0;
+    }
+
+  /* No other types other than XOmode can go in DMRs.  */
+  if (DMR_REGNO_P (regno))
+    return 0;
 
   /* PTImode can only go in GPRs.  Quad word memory operations require even/odd
      register combinations, and use PTImode where we need to deal with quad
@@ -2307,6 +2344,7 @@ rs6000_debug_reg_global (void)
   rs6000_debug_reg_print (FIRST_ALTIVEC_REGNO,
 			  LAST_ALTIVEC_REGNO,
 			  "vs");
+  rs6000_debug_reg_print (FIRST_DMR_REGNO, LAST_DMR_REGNO, "dmr");
   rs6000_debug_reg_print (LR_REGNO, LR_REGNO, "lr");
   rs6000_debug_reg_print (CTR_REGNO, CTR_REGNO, "ctr");
   rs6000_debug_reg_print (CR0_REGNO, CR7_REGNO, "cr");
@@ -2327,6 +2365,7 @@ rs6000_debug_reg_global (void)
 	   "wr reg_class = %s\n"
 	   "wx reg_class = %s\n"
 	   "wA reg_class = %s\n"
+	   "wD reg_class = %s\n"
 	   "\n",
 	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_d]],
 	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_v]],
@@ -2334,7 +2373,8 @@ rs6000_debug_reg_global (void)
 	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_we]],
 	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wr]],
 	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wx]],
-	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wA]]);
+	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wA]],
+	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wD]]);
 
   nl = "\n";
   for (m = 0; m < NUM_MACHINE_MODES; ++m)
@@ -2631,6 +2671,21 @@ rs6000_setup_reg_addr_masks (void)
 	  addr_mask = 0;
 	  reg = reload_reg_map[rc].reg;
 
+	  /* Special case DMR registers.  */
+	  if (rc == RELOAD_REG_DMR)
+	    {
+	      if (TARGET_DENSE_MATH && m2 == XOmode)
+		{
+		  addr_mask = RELOAD_REG_VALID;
+		  reg_addr[m].addr_mask[rc] = addr_mask;
+		  any_addr_mask |= addr_mask;
+		}
+	      else
+		reg_addr[m].addr_mask[rc] = 0;
+
+	      continue;
+	    }
+
 	  /* Can mode values go in the GPR/FPR/Altivec registers?  */
 	  if (reg >= 0 && rs6000_hard_regno_mode_ok_p[m][reg])
 	    {
@@ -2726,8 +2781,8 @@ rs6000_setup_reg_addr_masks (void)
 
 	  /* Vector pairs can do both indexed and offset loads if the
 	     instructions are enabled, otherwise they can only do offset loads
-	     since it will be broken into two vector moves.  Vector quads can
-	     only do offset loads.  */
+	     since it will be broken into two vector moves.  Vector quads and
+	     1,024 bit DMR values can only do offset loads.  */
 	  else if ((addr_mask != 0) && TARGET_MMA
 		   && (m2 == OOmode || m2 == XOmode))
 	    {
@@ -2781,6 +2836,9 @@ rs6000_init_hard_regno_mode_ok (bool global_init_p)
   for (r = CR1_REGNO; r <= CR7_REGNO; ++r)
     rs6000_regno_regclass[r] = CR_REGS;
 
+  for (r = FIRST_DMR_REGNO; r <= LAST_DMR_REGNO; ++r)
+    rs6000_regno_regclass[r] = DM_REGS;
+
   rs6000_regno_regclass[LR_REGNO] = LINK_REGS;
   rs6000_regno_regclass[CTR_REGNO] = CTR_REGS;
   rs6000_regno_regclass[CA_REGNO] = NO_REGS;
@@ -2805,6 +2863,7 @@ rs6000_init_hard_regno_mode_ok (bool global_init_p)
   reg_class_to_reg_type[(int)LINK_OR_CTR_REGS] = SPR_REG_TYPE;
   reg_class_to_reg_type[(int)CR_REGS] = CR_REG_TYPE;
   reg_class_to_reg_type[(int)CR0_REGS] = CR_REG_TYPE;
+  reg_class_to_reg_type[(int)DM_REGS] = DMR_REG_TYPE;
 
   if (TARGET_VSX)
     {
@@ -2991,6 +3050,13 @@ rs6000_init_hard_regno_mode_ok (bool global_init_p)
   if (TARGET_DIRECT_MOVE_128)
     rs6000_constraints[RS6000_CONSTRAINT_we] = VSX_REGS;
 
+  /* Support for the accumulator registers, either FPR registers (aka original
+     mma) or DMR registers (dense math).  */
+  if (TARGET_DENSE_MATH)
+    rs6000_constraints[RS6000_CONSTRAINT_wD] = DM_REGS;
+  else if (TARGET_MMA)
+    rs6000_constraints[RS6000_CONSTRAINT_wD] = FLOAT_REGS;
+
   /* Set up the reload helper and direct move functions.  */
   if (TARGET_VSX || TARGET_ALTIVEC)
     {
@@ -4441,6 +4507,14 @@ rs6000_option_override_internal (bool global_init_p)
   if (!TARGET_PCREL && TARGET_PCREL_OPT)
     rs6000_isa_flags &= ~OPTION_MASK_PCREL_OPT;
 
+  /* Dense math requires MMA.  */
+  if (TARGET_DENSE_MATH && !TARGET_MMA)
+    {
+      if ((rs6000_isa_flags_explicit & OPTION_MASK_DENSE_MATH) != 0)
+	error ("%qs requires %qs", "-mdense-math", "-mmma");
+      rs6000_isa_flags &= ~OPTION_MASK_DENSE_MATH;
+    }
+
   if (TARGET_DEBUG_REG || TARGET_DEBUG_TARGET)
     rs6000_print_isa_options (stderr, 0, "after subtarget", rs6000_isa_flags);
 
@@ -12054,6 +12128,11 @@ rs6000_secondary_reload_memory (rtx addr,
     addr_mask = (reg_addr[mode].addr_mask[RELOAD_REG_VMX]
 		 & ~RELOAD_REG_AND_M16);
 
+  /* DMR registers use VSX registers, and need to generate some extra
+     instructions.  */
+  else if (rclass == DM_REGS)
+    return 2;
+
   /* If the register allocator hasn't made up its mind yet on the register
      class to use, settle on defaults to use.  */
   else if (rclass == NO_REGS)
@@ -12382,6 +12461,13 @@ rs6000_secondary_reload_simple_move (enum rs6000_reg_type to_type,
 	       || (to_type == SPR_REG_TYPE && from_type == GPR_REG_TYPE)))
     return true;
 
+  /* We can transfer between VSX registers and DMR registers without needing
+     extra registers.  */
+  if (TARGET_DENSE_MATH && mode == XOmode
+      && ((to_type == DMR_REG_TYPE && from_type == VSX_REG_TYPE)
+	  || (to_type == VSX_REG_TYPE && from_type == DMR_REG_TYPE)))
+    return true;
+
   return false;
 }
 
@@ -13076,6 +13162,10 @@ rs6000_preferred_reload_class (rtx x, enum reg_class rclass)
   machine_mode mode = GET_MODE (x);
   bool is_constant = CONSTANT_P (x);
 
+  /* DMR registers can't be loaded or stored.  */
+  if (rclass == DM_REGS)
+    return NO_REGS;
+
   /* If a mode can't go in FPR/ALTIVEC/VSX registers, don't return a preferred
      reload class for it.  */
   if ((rclass == ALTIVEC_REGS || rclass == VSX_REGS)
@@ -13172,7 +13262,7 @@ rs6000_preferred_reload_class (rtx x, enum reg_class rclass)
 	return VSX_REGS;
 
       if (mode == XOmode)
-	return FLOAT_REGS;
+	return TARGET_DENSE_MATH ? VSX_REGS : FLOAT_REGS;
 
       if (GET_MODE_CLASS (mode) == MODE_INT)
 	return GENERAL_REGS;
@@ -13297,6 +13387,11 @@ rs6000_secondary_reload_class (enum reg_class rclass, machine_mode mode,
   else
     regno = -1;
 
+  /* DMR registers don't have loads or stores.  We have to go through the VSX
+     registers to load XOmode (vector quad).  */
+  if (TARGET_DENSE_MATH && rclass == DM_REGS)
+    return VSX_REGS;
+
   /* If we have VSX register moves, prefer moving scalar values between
      Altivec registers and GPR by going via an FPR (and then via memory)
      instead of reloading the secondary memory address for Altivec moves.  */
@@ -13810,8 +13905,14 @@ print_operand (FILE *file, rtx x, int code)
 	 output_operand.  */
 
     case 'A':
-      /* Write the MMA accumulator number associated with VSX register X.  */
-      if (!REG_P (x) || !FP_REGNO_P (REGNO (x)) || (REGNO (x) % 4) != 0)
+      /* Write the MMA accumulator number associated with VSX register X.  On
+	 dense math systems, only allow DMR accumulators, not accumulators
+	 overlapping with the FPR registers.  */
+      if (!REG_P (x))
+	output_operand_lossage ("invalid %%A value");
+      else if (TARGET_DENSE_MATH && DMR_REGNO_P (REGNO (x)))
+	fprintf (file, "%d", REGNO (x) - FIRST_DMR_REGNO);
+      else if (!FP_REGNO_P (REGNO (x)) || (REGNO (x) % 4) != 0)
 	output_operand_lossage ("invalid %%A value");
       else
 	fprintf (file, "%d", (REGNO (x) - FIRST_FPR_REGNO) / 4);
@@ -22470,6 +22571,31 @@ rs6000_debug_address_cost (rtx x, machine_mode mode,
 }
 
 
+/* Subroutine to determine the move cost of dense math registers.  If we are
+   moving to/from VSX_REGISTER registers, the cost is either 1 move (for
+   512-bit accumulators) or 2 moves (for 1,024 dmr registers).  If we are
+   moving to anything else like GPR registers, make the cost very high.  */
+
+static int
+rs6000_dmr_register_move_cost (machine_mode mode, reg_class_t rclass)
+{
+  const int reg_move_base = 2;
+  HARD_REG_SET vsx_set = (reg_class_contents[rclass]
+			  & reg_class_contents[VSX_REGS]);
+
+  if (TARGET_DENSE_MATH && !hard_reg_set_empty_p (vsx_set))
+    {
+      /* __vector_quad (i.e. XOmode) is tranfered in 1 instruction.  */
+      if (mode == XOmode)
+	return reg_move_base;
+
+      else
+	return reg_move_base * 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
+    }
+
+  return 1000 * 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
+}
+
 /* A C expression returning the cost of moving data from a register of class
    CLASS1 to one of CLASS2.  */
 
@@ -22483,17 +22609,28 @@ rs6000_register_move_cost (machine_mode mode,
   if (TARGET_DEBUG_COST)
     dbg_cost_ctrl++;
 
+  HARD_REG_SET to_vsx, from_vsx;
+  to_vsx = reg_class_contents[to] & reg_class_contents[VSX_REGS];
+  from_vsx = reg_class_contents[from] & reg_class_contents[VSX_REGS];
+
+  /* Special case DMR registers, that can only move to/from VSX registers.  */
+  if (from == DM_REGS && to == DM_REGS)
+    ret = 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
+
+  else if (from == DM_REGS)
+    ret = rs6000_dmr_register_move_cost (mode, to);
+
+  else if (to == DM_REGS)
+    ret = rs6000_dmr_register_move_cost (mode, from);
+
   /* If we have VSX, we can easily move between FPR or Altivec registers,
      otherwise we can only easily move within classes.
      Do this first so we give best-case answers for union classes
      containing both gprs and vsx regs.  */
-  HARD_REG_SET to_vsx, from_vsx;
-  to_vsx = reg_class_contents[to] & reg_class_contents[VSX_REGS];
-  from_vsx = reg_class_contents[from] & reg_class_contents[VSX_REGS];
-  if (!hard_reg_set_empty_p (to_vsx)
-      && !hard_reg_set_empty_p (from_vsx)
-      && (TARGET_VSX
-	  || hard_reg_set_intersect_p (to_vsx, from_vsx)))
+  else if (!hard_reg_set_empty_p (to_vsx)
+	   && !hard_reg_set_empty_p (from_vsx)
+	   && (TARGET_VSX
+	       || hard_reg_set_intersect_p (to_vsx, from_vsx)))
     {
       int reg = FIRST_FPR_REGNO;
       if (TARGET_VSX
@@ -22589,6 +22726,9 @@ rs6000_memory_move_cost (machine_mode mode, reg_class_t rclass,
     ret = 4 * hard_regno_nregs (32, mode);
   else if (reg_classes_intersect_p (rclass, ALTIVEC_REGS))
     ret = 4 * hard_regno_nregs (FIRST_ALTIVEC_REGNO, mode);
+  else if (reg_classes_intersect_p (rclass, DM_REGS))
+    ret = (rs6000_dmr_register_move_cost (mode, VSX_REGS)
+	   + rs6000_memory_move_cost (mode, VSX_REGS, false));
   else
     ret = 4 + rs6000_register_move_cost (mode, rclass, GENERAL_REGS);
 
@@ -23797,6 +23937,8 @@ rs6000_compute_pressure_classes (enum reg_class *pressure_classes)
       if (TARGET_HARD_FLOAT)
 	pressure_classes[n++] = FLOAT_REGS;
     }
+  if (TARGET_DENSE_MATH)
+    pressure_classes[n++] = DM_REGS;
   pressure_classes[n++] = CR_REGS;
   pressure_classes[n++] = SPECIAL_REGS;
 
@@ -23961,6 +24103,10 @@ rs6000_debugger_regno (unsigned int regno, unsigned int format)
     return 67;
   if (regno == 64)
     return 64;
+  /* XXX: This is a guess.  The GCC register number for FIRST_DMR_REGNO is 111,
+     but the frame pointer regnum uses that.  */
+  if (DMR_REGNO_P (regno))
+    return regno - FIRST_DMR_REGNO + 112;
 
   gcc_unreachable ();
 }
@@ -24171,6 +24317,7 @@ static struct rs6000_opt_mask const rs6000_opt_masks[] =
   { "crypto",			OPTION_MASK_CRYPTO,		false, true  },
   { "direct-move",		OPTION_MASK_DIRECT_MOVE,	false, true  },
   { "dlmzb",			OPTION_MASK_DLMZB,		false, true  },
+  { "dense-math",		OPTION_MASK_DENSE_MATH,		false, true  },
   { "efficient-unaligned-vsx",	OPTION_MASK_EFFICIENT_UNALIGNED_VSX,
 								false, true  },
   { "float128",			OPTION_MASK_FLOAT128_KEYWORD,	false, true  },
@@ -27257,7 +27404,9 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 		      || XINT (src, 1) == UNSPECV_MMA_ASSEMBLE);
 	  gcc_assert (REG_P (dst));
 	  if (GET_MODE (src) == XOmode)
-	    gcc_assert (FP_REGNO_P (REGNO (dst)));
+	    gcc_assert ((TARGET_DENSE_MATH
+			 ? VSX_REGNO_P (REGNO (dst))
+			 : FP_REGNO_P (REGNO (dst))));
 	  if (GET_MODE (src) == OOmode)
 	    gcc_assert (VSX_REGNO_P (REGNO (dst)));
 
diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
index 44fa355a061..c034b9ed179 100644
--- a/gcc/config/rs6000/rs6000.h
+++ b/gcc/config/rs6000/rs6000.h
@@ -662,6 +662,7 @@ extern unsigned char rs6000_recip_bits[];
 #define UNITS_PER_FP_WORD 8
 #define UNITS_PER_ALTIVEC_WORD 16
 #define UNITS_PER_VSX_WORD 16
+#define UNITS_PER_DMR_WORD 128
 
 /* Type used for ptrdiff_t, as a string used in a declaration.  */
 #define PTRDIFF_TYPE "int"
@@ -789,7 +790,7 @@ enum data_align { align_abi, align_opt, align_both };
    Another pseudo (not included in DWARF_FRAME_REGISTERS) is soft frame
    pointer, which is eventually eliminated in favor of SP or FP.  */
 
-#define FIRST_PSEUDO_REGISTER 111
+#define FIRST_PSEUDO_REGISTER 119
 
 /* Use standard DWARF numbering for DWARF debugging information.  */
 #define DEBUGGER_REGNO(REGNO) rs6000_debugger_regno ((REGNO), 0)
@@ -826,7 +827,9 @@ enum data_align { align_abi, align_opt, align_both };
    /* cr0..cr7 */				   \
    0, 0, 0, 0, 0, 0, 0, 0,			   \
    /* vrsave vscr sfp */			   \
-   1, 1, 1					   \
+   1, 1, 1,					   \
+   /* DMR registers.  */			   \
+   0, 0, 0, 0, 0, 0, 0, 0			   \
 }
 
 /* Like `CALL_USED_REGISTERS' except this macro doesn't require that
@@ -850,7 +853,9 @@ enum data_align { align_abi, align_opt, align_both };
    /* cr0..cr7 */				   \
    1, 1, 0, 0, 0, 1, 1, 1,			   \
    /* vrsave vscr sfp */			   \
-   0, 0, 0					   \
+   0, 0, 0,					   \
+   /* DMR registers.  */			   \
+   0, 0, 0, 0, 0, 0, 0, 0			   \
 }
 
 #define TOTAL_ALTIVEC_REGS	(LAST_ALTIVEC_REGNO - FIRST_ALTIVEC_REGNO + 1)
@@ -887,6 +892,7 @@ enum data_align { align_abi, align_opt, align_both };
 	v2		(not saved; incoming vector arg reg; return value)
 	v19 - v14	(not saved or used for anything)
 	v31 - v20	(saved; order given to save least number)
+	dmr0 - dmr7	(not saved)
 	vrsave, vscr	(fixed)
 	sfp		(fixed)
 */
@@ -929,6 +935,9 @@ enum data_align { align_abi, align_opt, align_both };
    66,								\
    83, 82, 81, 80, 79, 78,					\
    95, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84,		\
+   /* DMR registers.  */					\
+   111, 112, 113, 114, 115, 116, 117, 118,			\
+   /* Vrsave, vscr, sfp.  */					\
    108, 109,							\
    110								\
 }
@@ -955,6 +964,9 @@ enum data_align { align_abi, align_opt, align_both };
 /* True if register is a VSX register.  */
 #define VSX_REGNO_P(N) (FP_REGNO_P (N) || ALTIVEC_REGNO_P (N))
 
+/* True if register is a DMR register.  */
+#define DMR_REGNO_P(N) ((N) >= FIRST_DMR_REGNO && (N) <= LAST_DMR_REGNO)
+
 /* Alternate name for any vector register supporting floating point, no matter
    which instruction set(s) are available.  */
 #define VFLOAT_REGNO_P(N) \
@@ -1090,6 +1102,7 @@ enum reg_class
   FLOAT_REGS,
   ALTIVEC_REGS,
   VSX_REGS,
+  DM_REGS,
   VRSAVE_REGS,
   VSCR_REGS,
   GEN_OR_FLOAT_REGS,
@@ -1119,6 +1132,7 @@ enum reg_class
   "FLOAT_REGS",								\
   "ALTIVEC_REGS",							\
   "VSX_REGS",								\
+  "DM_REGS",								\
   "VRSAVE_REGS",							\
   "VSCR_REGS",								\
   "GEN_OR_FLOAT_REGS",							\
@@ -1153,6 +1167,8 @@ enum reg_class
   { 0x00000000, 0x00000000, 0xffffffff, 0x00000000 },			\
   /* VSX_REGS.  */							\
   { 0x00000000, 0xffffffff, 0xffffffff, 0x00000000 },			\
+  /* DM_REGS.  */							\
+  { 0x00000000, 0x00000000, 0x00000000, 0x007f8000 },			\
   /* VRSAVE_REGS.  */							\
   { 0x00000000, 0x00000000, 0x00000000, 0x00001000 },			\
   /* VSCR_REGS.  */							\
@@ -1180,7 +1196,7 @@ enum reg_class
   /* CA_REGS.  */							\
   { 0x00000000, 0x00000000, 0x00000000, 0x00000004 },			\
   /* ALL_REGS.  */							\
-  { 0xffffffff, 0xffffffff, 0xffffffff, 0x00007fff }			\
+  { 0xffffffff, 0xffffffff, 0xffffffff, 0x007fffff }			\
 }
 
 /* The same information, inverted:
@@ -1204,6 +1220,7 @@ enum r6000_reg_class_enum {
   RS6000_CONSTRAINT_wr,		/* GPR register if 64-bit  */
   RS6000_CONSTRAINT_wx,		/* FPR register for STFIWX */
   RS6000_CONSTRAINT_wA,		/* BASE_REGS if 64-bit.  */
+  RS6000_CONSTRAINT_wD,		/* Accumulator regs if MMA/Dense Math.  */
   RS6000_CONSTRAINT_MAX
 };
 
@@ -2077,7 +2094,16 @@ extern char rs6000_reg_names[][8];	/* register names (0 vs. %r0).  */
   &rs6000_reg_names[108][0],	/* vrsave  */				\
   &rs6000_reg_names[109][0],	/* vscr  */				\
 									\
-  &rs6000_reg_names[110][0]	/* sfp  */				\
+  &rs6000_reg_names[110][0],	/* sfp  */				\
+									\
+  &rs6000_reg_names[111][0],	/* dmr0  */				\
+  &rs6000_reg_names[112][0],	/* dmr1  */				\
+  &rs6000_reg_names[113][0],	/* dmr2  */				\
+  &rs6000_reg_names[114][0],	/* dmr3  */				\
+  &rs6000_reg_names[115][0],	/* dmr4  */				\
+  &rs6000_reg_names[116][0],	/* dmr5  */				\
+  &rs6000_reg_names[117][0],	/* dmr6  */				\
+  &rs6000_reg_names[118][0],	/* dmr7  */				\
 }
 
 /* Table of additional register names to use in user input.  */
@@ -2131,6 +2157,8 @@ extern char rs6000_reg_names[][8];	/* register names (0 vs. %r0).  */
   {"vs52", 84}, {"vs53", 85}, {"vs54", 86}, {"vs55", 87},	\
   {"vs56", 88}, {"vs57", 89}, {"vs58", 90}, {"vs59", 91},	\
   {"vs60", 92}, {"vs61", 93}, {"vs62", 94}, {"vs63", 95},	\
+  {"dmr0", 111}, {"dmr1", 112}, {"dmr2", 113}, {"dmr3", 114},	\
+  {"dmr4", 115}, {"dmr5", 116}, {"dmr6", 117}, {"dmr7", 118},	\
 }
 
 /* This is how to output an element of a case-vector that is relative.  */
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index 5f933bede93..ee7651d9b43 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -51,6 +51,8 @@ (define_constants
    (VRSAVE_REGNO		108)
    (VSCR_REGNO			109)
    (FRAME_POINTER_REGNUM	110)
+   (FIRST_DMR_REGNO		111)
+   (LAST_DMR_REGNO		118)
   ])
 
 ;;
@@ -354,7 +356,7 @@ (define_attr "cpu"
   (const (symbol_ref "(enum attr_cpu) rs6000_tune")))
 
 ;; The ISA we implement.
-(define_attr "isa" "any,p5,p6,p7,p7v,p8v,p9,p9v,p9kf,p9tf,p10"
+(define_attr "isa" "any,p5,p6,p7,p7v,p8v,p9,p9v,p9kf,p9tf,p10,dm,not_dm"
   (const_string "any"))
 
 ;; Is this alternative enabled for the current CPU/ISA/etc.?
@@ -402,6 +404,14 @@ (define_attr "enabled" ""
      (and (eq_attr "isa" "p10")
 	  (match_test "TARGET_POWER10"))
      (const_int 1)
+
+     (and (eq_attr "isa" "dm")
+	  (match_test "TARGET_DENSE_MATH"))
+     (const_int 1)
+
+     (and (eq_attr "isa" "not_dm")
+	  (match_test "!TARGET_DENSE_MATH"))
+     (const_int 1)
     ] (const_int 0)))
 
 ;; If this instruction is microcoded on the CELL processor
diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt
index 04532a774b9..e45faf4a4ef 100644
--- a/gcc/config/rs6000/rs6000.opt
+++ b/gcc/config/rs6000/rs6000.opt
@@ -624,6 +624,10 @@ mfuture
 Target Undocumented Mask(FUTURE) Var(rs6000_isa_flags)
 Generate (do not generate) future instructions.
 
+mdense-math
+Target Undocumented Mask(DENSE_MATH) Var(rs6000_isa_flags)
+Generate (do not generate) dense math instructions.
+
 ; Documented parameters
 
 -param=rs6000-vect-unroll-limit=
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 7235d34c4b3..f8a02b25772 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -3264,6 +3264,13 @@ Like @code{d}, if @option{-mpowerpc-gfxopt} is used; otherwise, @code{NO_REGS}.
 @item wA
 Like @code{b}, if @option{-mpowerpc64} is used; otherwise, @code{NO_REGS}.
 
+@item wD
+Accumulator register if @option{-mma} is used; otherwise,
+@code{NO_REGS}.  If @option{-mdense-math} is used, the accumulator
+register will be in the dense match register set.  If
+@option{-mno-dense-math} is used, the accumulator register will
+overlap with the VSX vector registers 0..31.
+
 @item wB
 Signed 5-bit constant integer that can be loaded into an Altivec register.
 
-- 
2.39.1


-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 3/8] PowerPC: Make MMA insns support DMR registers.
  2023-02-03 21:16 [PATCH 0/8] PowerPC future support for Dense Math Michael Meissner
                   ` (2 preceding siblings ...)
  2023-02-03 21:25 ` [PATCH 2/8] PowerPC: Add support for accumulators in DMR registers Michael Meissner
@ 2023-02-03 21:27 ` Michael Meissner
  2023-02-03 21:29 ` [PATCH 4/8] PowerPC: Switch to dense math names for all MMA operations Michael Meissner
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Michael Meissner @ 2023-02-03 21:27 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner, Will Schmidt

This patch changes the MMA instructions to use either FPR registers
(-mcpu=power10) or DMRs (-mcpu=future).  In this patch, the existing MMA
instruction names are used.

A macro (__PPC_DMR__) is defined if the MMA instructions use the DMRs.

The patches have been tested on the following platforms.  I added the patches
for PR target/107299 that I submitted on November 2nd before doing the builds so
that GCC would build on systems using IEEE 128-bit long double.
    *	https://gcc.gnu.org/pipermail/gcc-patches/2022-November/604834.html

There were no regressions with doing bootstrap builds and running the regression
tests:

    1)	Power10 LE using --with-cpu=power10 --with-long-double-format=ieee;
    2)	Power10 LE using --with-cpu=power10 --with-long-double-format=ibm;
    3)	Power9 LE using --with-cpu=power9 --with-long-double-format=ibm; and
    4)	Power8 BE using --with-cpu=power8 (both 32-bit & 64-bit tested).

Note, I will be on vacation from Tuesday February 7th through Tuesday February
14th.

Can I check this patch into the GCC 13 master branch?

2023-02-03   Michael Meissner  <meissner@linux.ibm.com>

gcc/

	* config/rs6000/mma.md (mma_<acc>): New define_expand to handle
	mma_<acc> for dense math and non dense math.
	(mma_<acc> insn): Restrict to non dense math.
	(mma_xxsetaccz): Convert to define_expand to handle non dense math and
	dense math.
	(mma_xxsetaccz_vsx): Rename from mma_xxsetaccz and restrict usage to non
	dense math.
	(mma_xxsetaccz_dm): Dense math version of mma_xxsetaccz.
	(mma_<vv>): Add support for dense math.
	(mma_<avv>): Likewise.
	(mma_<pv>): Likewise.
	(mma_<apv>): Likewise.
	(mma_<vvi4i4i8>): Likewise.
	(mma_<avvi4i4i8>): Likewise.
	(mma_<vvi4i4i2>): Likewise.
	(mma_<avvi4i4i2>): Likewise.
	(mma_<vvi4i4>): Likewise.
	(mma_<avvi4i4>): Likewise.
	(mma_<pvi4i2>): Likewise.
	(mma_<apvi4i2>): Likewise.
	(mma_<vvi4i4i4>): Likewise.
	(mma_<avvi4i4i4>): Likewise.
	* config/rs6000/rs6000-c.cc (rs6000_target_modify_macros): Define
	__PPC_DMR__ if we have dense math instructions.
	* config/rs6000/rs6000.cc (print_operand): Make %A handle only DMRs if
	dense math and only FPRs if not dense math.
	(rs6000_split_multireg_move): Do not generate the xxmtacc instruction to
	prime the DMR registers or the xxmfacc instruction to de-prime
	instructions if we have dense math register support.
---
 gcc/config/rs6000/mma.md      | 247 +++++++++++++++++++++-------------
 gcc/config/rs6000/rs6000-c.cc |   3 +
 gcc/config/rs6000/rs6000.cc   |  35 ++---
 3 files changed, 176 insertions(+), 109 deletions(-)

diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
index 59ca6835f7c..9e3feb3ea54 100644
--- a/gcc/config/rs6000/mma.md
+++ b/gcc/config/rs6000/mma.md
@@ -552,190 +552,249 @@ (define_insn "*mma_disassemble_acc_dm"
   "dmxxextfdmr256 %0,%1,2"
   [(set_attr "type" "mma")])
 
-(define_insn "mma_<acc>"
+;; MMA instructions that do not use their accumulators as an input, still must
+;; not allow their vector operands to overlap the registers used by the
+;; accumulator.  We enforce this by marking the output as early clobber.  If we
+;; have dense math, we don't need the whole prime/de-prime action, so just make
+;; thse instructions be NOPs.
+
+(define_expand "mma_<acc>"
+  [(set (match_operand:XO 0 "register_operand")
+	(unspec:XO [(match_operand:XO 1 "register_operand")]
+		   MMA_ACC))]
+  "TARGET_MMA"
+{
+  if (TARGET_DENSE_MATH)
+    {
+      if (!rtx_equal_p (operands[0], operands[1]))
+	emit_move_insn (operands[0], operands[1]);
+      DONE;
+    }
+
+  /* Generate the prime/de-prime code.  */
+})
+
+(define_insn "*mma_<acc>"
   [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
 	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")]
 		    MMA_ACC))]
-  "TARGET_MMA"
+  "TARGET_MMA && !TARGET_DENSE_MATH"
   "<acc> %A0"
   [(set_attr "type" "mma")])
 
 ;; We can't have integer constants in XOmode so we wrap this in an
-;; UNSPEC_VOLATILE.
+;; UNSPEC_VOLATILE for the non-dense math case.  For dense math, we don't need
+;; to disable optimization and we can do a normal UNSPEC.
 
-(define_insn "mma_xxsetaccz"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=d")
+(define_expand "mma_xxsetaccz"
+  [(set (match_operand:XO 0 "register_operand")
 	(unspec_volatile:XO [(const_int 0)]
 			    UNSPECV_MMA_XXSETACCZ))]
   "TARGET_MMA"
+{
+  if (TARGET_DENSE_MATH)
+    {
+      emit_insn (gen_mma_xxsetaccz_dm (operands[0]));
+      DONE;
+    }
+})
+
+(define_insn "*mma_xxsetaccz_vsx"
+  [(set (match_operand:XO 0 "fpr_reg_operand" "=d")
+	(unspec_volatile:XO [(const_int 0)]
+			    UNSPECV_MMA_XXSETACCZ))]
+  "TARGET_MMA && !TARGET_DENSE_MATH"
   "xxsetaccz %A0"
   [(set_attr "type" "mma")])
 
+
+(define_insn "mma_xxsetaccz_dm"
+  [(set (match_operand:XO 0 "dmr_operand" "=wD")
+	(unspec:XO [(const_int 0)]
+		   UNSPECV_MMA_XXSETACCZ))]
+  "TARGET_DENSE_MATH"
+  "dmsetdmrz %0"
+  [(set_attr "type" "mma")])
+
 (define_insn "mma_<vv>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")]
 		    MMA_VV))]
   "TARGET_MMA"
   "<vv> %A0,%x1,%x2"
-  [(set_attr "type" "mma")])
+  [(set_attr "type" "mma")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<avv>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")]
 		    MMA_AVV))]
   "TARGET_MMA"
   "<avv> %A0,%x2,%x3"
-  [(set_attr "type" "mma")])
+  [(set_attr "type" "mma")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<pv>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:OO 1 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:OO 1 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")]
 		    MMA_PV))]
   "TARGET_MMA"
   "<pv> %A0,%x1,%x2"
-  [(set_attr "type" "mma")])
+  [(set_attr "type" "mma")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<apv>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
-		    (match_operand:OO 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
+		    (match_operand:OO 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")]
 		    MMA_APV))]
   "TARGET_MMA"
   "<apv> %A0,%x2,%x3"
-  [(set_attr "type" "mma")])
+  [(set_attr "type" "mma")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<vvi4i4i8>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 5 "u8bit_cint_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 3 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 5 "u8bit_cint_operand" "n,n,n")]
 		    MMA_VVI4I4I8))]
   "TARGET_MMA"
   "<vvi4i4i8> %A0,%x1,%x2,%3,%4,%5"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<avvi4i4i8>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 6 "u8bit_cint_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 6 "u8bit_cint_operand" "n,n,n")]
 		    MMA_AVVI4I4I8))]
   "TARGET_MMA"
   "<avvi4i4i8> %A0,%x2,%x3,%4,%5,%6"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<vvi4i4i2>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 5 "const_0_to_3_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 3 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 5 "const_0_to_3_operand" "n,n,n")]
 		    MMA_VVI4I4I2))]
   "TARGET_MMA"
   "<vvi4i4i2> %A0,%x1,%x2,%3,%4,%5"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<avvi4i4i2>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 6 "const_0_to_3_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 6 "const_0_to_3_operand" "n,n,n")]
 		    MMA_AVVI4I4I2))]
   "TARGET_MMA"
   "<avvi4i4i2> %A0,%x2,%x3,%4,%5,%6"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<vvi4i4>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 3 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")]
 		    MMA_VVI4I4))]
   "TARGET_MMA"
   "<vvi4i4> %A0,%x1,%x2,%3,%4"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<avvi4i4>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")]
 		    MMA_AVVI4I4))]
   "TARGET_MMA"
   "<avvi4i4> %A0,%x2,%x3,%4,%5"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<pvi4i2>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:OO 1 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 4 "const_0_to_3_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:OO 1 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 3 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 4 "const_0_to_3_operand" "n,n,n")]
 		    MMA_PVI4I2))]
   "TARGET_MMA"
   "<pvi4i2> %A0,%x1,%x2,%3,%4"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<apvi4i2>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
-		    (match_operand:OO 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 5 "const_0_to_3_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
+		    (match_operand:OO 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 5 "const_0_to_3_operand" "n,n,n")]
 		    MMA_APVI4I2))]
   "TARGET_MMA"
   "<apvi4i2> %A0,%x2,%x3,%4,%5"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<vvi4i4i4>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 3 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")]
 		    MMA_VVI4I4I4))]
   "TARGET_MMA"
   "<vvi4i4i4> %A0,%x1,%x2,%3,%4,%5"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
 
 (define_insn "mma_<avvi4i4i4>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
-		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")
-		    (match_operand:SI 6 "const_0_to_15_operand" "n,n")]
+  [(set (match_operand:XO 0 "accumulator_operand" "=wD,&d,&d")
+	(unspec:XO [(match_operand:XO 1 "accumulator_operand" "0,0,0")
+		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")
+		    (match_operand:SI 6 "const_0_to_15_operand" "n,n,n")]
 		    MMA_AVVI4I4I4))]
   "TARGET_MMA"
   "<avvi4i4i4> %A0,%x2,%x3,%4,%5,%6"
   [(set_attr "type" "mma")
-   (set_attr "prefixed" "yes")])
+   (set_attr "prefixed" "yes")
+   (set_attr "isa" "dm,not_dm,not_dm")])
diff --git a/gcc/config/rs6000/rs6000-c.cc b/gcc/config/rs6000/rs6000-c.cc
index 2803014f2b6..baf1f4dc92b 100644
--- a/gcc/config/rs6000/rs6000-c.cc
+++ b/gcc/config/rs6000/rs6000-c.cc
@@ -600,6 +600,9 @@ rs6000_target_modify_macros (bool define_p, HOST_WIDE_INT flags)
   /* Tell the user if we support the MMA instructions.  */
   if ((flags & OPTION_MASK_MMA) != 0)
     rs6000_define_or_undefine_macro (define_p, "__MMA__");
+  /* Tell the user if we support the dense math instructions.  */
+  if ((flags & OPTION_MASK_DENSE_MATH) != 0)
+    rs6000_define_or_undefine_macro (define_p, "__PPC_DMR__");
   /* Whether pc-relative code is being generated.  */
   if ((flags & OPTION_MASK_PCREL) != 0)
     rs6000_define_or_undefine_macro (define_p, "__PCREL__");
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index 8ecb3021ff9..c8f05f6f2d7 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -13910,8 +13910,13 @@ print_operand (FILE *file, rtx x, int code)
 	 overlapping with the FPR registers.  */
       if (!REG_P (x))
 	output_operand_lossage ("invalid %%A value");
-      else if (TARGET_DENSE_MATH && DMR_REGNO_P (REGNO (x)))
-	fprintf (file, "%d", REGNO (x) - FIRST_DMR_REGNO);
+      else if (TARGET_DENSE_MATH)
+	{
+	  if (DMR_REGNO_P (REGNO (x)))
+	    fprintf (file, "%d", REGNO (x) - FIRST_DMR_REGNO);
+	  else
+	    output_operand_lossage ("%%A operand is not a DMR");
+	}
       else if (!FP_REGNO_P (REGNO (x)) || (REGNO (x) % 4) != 0)
 	output_operand_lossage ("invalid %%A value");
       else
@@ -27356,7 +27361,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 
 	  /* If we are reading an accumulator register, we have to
 	     deprime it before we can access it.  */
-	  if (TARGET_MMA
+	  if (TARGET_MMA && !TARGET_DENSE_MATH
 	      && GET_MODE (src) == XOmode && FP_REGNO_P (REGNO (src)))
 	    emit_insn (gen_mma_xxmfacc (src, src));
 
@@ -27388,9 +27393,9 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 	      emit_insn (gen_rtx_SET (dst2, src2));
 	    }
 
-	  /* If we are writing an accumulator register, we have to
-	     prime it after we've written it.  */
-	  if (TARGET_MMA
+	  /* If we are writing an accumulator register that overlaps with the
+	     FPR registers, we have to prime it after we've written it.  */
+	  if (TARGET_MMA && !TARGET_DENSE_MATH
 	      && GET_MODE (dst) == XOmode && FP_REGNO_P (REGNO (dst)))
 	    emit_insn (gen_mma_xxmtacc (dst, dst));
 
@@ -27459,9 +27464,9 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 	      emit_insn (gen_rtx_SET (dst_i, op));
 	    }
 
-	  /* We are writing an accumulator register, so we have to
-	     prime it after we've written it.  */
-	  if (GET_MODE (src) == XOmode)
+	  /* On systems without dense math where accumulators overlap with the
+	     vector registers, we have to prime it after we've written it.  */
+	  if (GET_MODE (src) == XOmode && !TARGET_DENSE_MATH)
 	    emit_insn (gen_mma_xxmtacc (dst, dst));
 
 	  return;
@@ -27472,9 +27477,9 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 
   if (REG_P (src) && REG_P (dst) && (REGNO (src) < REGNO (dst)))
     {
-      /* If we are reading an accumulator register, we have to
-	 deprime it before we can access it.  */
-      if (TARGET_MMA
+      /* If we are reading an accumulator register and we don't have dense
+	 math, we have to deprime it before we can access it.  */
+      if (TARGET_MMA && !TARGET_DENSE_MATH
 	  && GET_MODE (src) == XOmode && FP_REGNO_P (REGNO (src)))
 	emit_insn (gen_mma_xxmfacc (src, src));
 
@@ -27502,7 +27507,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 
       /* If we are writing an accumulator register, we have to
 	 prime it after we've written it.  */
-      if (TARGET_MMA
+      if (TARGET_MMA && !TARGET_DENSE_MATH
 	  && GET_MODE (dst) == XOmode && FP_REGNO_P (REGNO (dst)))
 	emit_insn (gen_mma_xxmtacc (dst, dst));
     }
@@ -27639,7 +27644,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 
       /* If we are reading an accumulator register, we have to
 	 deprime it before we can access it.  */
-      if (TARGET_MMA && REG_P (src)
+      if (TARGET_MMA && !TARGET_DENSE_MATH && REG_P (src)
 	  && GET_MODE (src) == XOmode && FP_REGNO_P (REGNO (src)))
 	emit_insn (gen_mma_xxmfacc (src, src));
 
@@ -27671,7 +27676,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 
       /* If we are writing an accumulator register, we have to
 	 prime it after we've written it.  */
-      if (TARGET_MMA && REG_P (dst)
+      if (TARGET_MMA && !TARGET_DENSE_MATH && REG_P (dst)
 	  && GET_MODE (dst) == XOmode && FP_REGNO_P (REGNO (dst)))
 	emit_insn (gen_mma_xxmtacc (dst, dst));
 
-- 
2.39.1


-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 4/8] PowerPC: Switch to dense math names for all MMA operations
  2023-02-03 21:16 [PATCH 0/8] PowerPC future support for Dense Math Michael Meissner
                   ` (3 preceding siblings ...)
  2023-02-03 21:27 ` [PATCH 3/8] PowerPC: Make MMA insns support " Michael Meissner
@ 2023-02-03 21:29 ` Michael Meissner
  2023-02-03 21:33 ` [PATCH 6/8] PowerPC: Add support for 1,024 bit DMR registers Michael Meissner
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Michael Meissner @ 2023-02-03 21:29 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner, Will Schmidt

This patch changes the assembler instruction names for MMA instructions from
the original name used in power10 to the new name when used with the dense math
system.  I.e. xvf64gerpp becomes dmxvf64gerpp.  The assembler will emit the
same bits for either spelling.

The patches have been tested on the following platforms.  I added the patches
for PR target/107299 that I submitted on November 2nd before doing the builds so
that GCC would build on systems using IEEE 128-bit long double.
    *	https://gcc.gnu.org/pipermail/gcc-patches/2022-November/604834.html

There were no regressions with doing bootstrap builds and running the regression
tests:

    1)	Power10 LE using --with-cpu=power10 --with-long-double-format=ieee;
    2)	Power10 LE using --with-cpu=power10 --with-long-double-format=ibm;
    3)	Power9 LE using --with-cpu=power9 --with-long-double-format=ibm; and
    4)	Power8 BE using --with-cpu=power8 (both 32-bit & 64-bit tested).

Note, I will be on vacation from Tuesday February 7th through Tuesday February
14th.

Can I check this patch into the GCC 13 master branch?

2023-02-03   Michael Meissner  <meissner@linux.ibm.com>

gcc/

	* config/rs6000/mma.md (vvi4i4i8_dm): New int attribute.
	(avvi4i4i8_dm): Likewise.
	(vvi4i4i2_dm): Likewise.
	(avvi4i4i2_dm): Likewise.
	(vvi4i4_dm): Likewise.
	(avvi4i4_dm): Likewise.
	(pvi4i2_dm): Likewise.
	(apvi4i2_dm): Likewise.
	(vvi4i4i4_dm): Likewise.
	(avvi4i4i4_dm): Likewise.
	(mma_<vv>): Add support for running on DMF systems, generating the dense
	math instruction and using the dense math accumulators.
	(mma_<avv>): Likewise.
	(mma_<pv>): Likewise.
	(mma_<apv>): Likewise.
	(mma_<vvi4i4i8>): Likewise.
	(mma_<avvi4i4i8>): Likewise.
	(mma_<vvi4i4i2>): Likewise.
	(mma_<avvi4i4i2>): Likewise.
	(mma_<vvi4i4>): Likewise.
	(mma_<avvi4i4): Likewise.
	(mma_<pvi4i2>): Likewise.
	(mma_<apvi4i2): Likewise.
	(mma_<vvi4i4i4>): Likewise.
	(mma_<avvi4i4i4>): Likewise.

gcc/testsuite/

	* gcc.target/powerpc/dm-double-test.c: New test.
	* lib/target-supports.exp (check_effective_target_ppc_dmr_ok): New
	target test.
---
 gcc/config/rs6000/mma.md                      |  98 +++++++--
 .../gcc.target/powerpc/dm-double-test.c       | 194 ++++++++++++++++++
 gcc/testsuite/lib/target-supports.exp         |  19 ++
 3 files changed, 299 insertions(+), 12 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/dm-double-test.c

diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
index 9e3feb3ea54..411e2345291 100644
--- a/gcc/config/rs6000/mma.md
+++ b/gcc/config/rs6000/mma.md
@@ -227,13 +227,22 @@ (define_int_attr apv		[(UNSPEC_MMA_XVF64GERPP		"xvf64gerpp")
 
 (define_int_attr vvi4i4i8	[(UNSPEC_MMA_PMXVI4GER8		"pmxvi4ger8")])
 
+(define_int_attr vvi4i4i8_dm	[(UNSPEC_MMA_PMXVI4GER8		"pmdmxvi4ger8")])
+
 (define_int_attr avvi4i4i8	[(UNSPEC_MMA_PMXVI4GER8PP	"pmxvi4ger8pp")])
 
+(define_int_attr avvi4i4i8_dm	[(UNSPEC_MMA_PMXVI4GER8PP	"pmdmxvi4ger8pp")])
+
 (define_int_attr vvi4i4i2	[(UNSPEC_MMA_PMXVI16GER2	"pmxvi16ger2")
 				 (UNSPEC_MMA_PMXVI16GER2S	"pmxvi16ger2s")
 				 (UNSPEC_MMA_PMXVF16GER2	"pmxvf16ger2")
 				 (UNSPEC_MMA_PMXVBF16GER2	"pmxvbf16ger2")])
 
+(define_int_attr vvi4i4i2_dm	[(UNSPEC_MMA_PMXVI16GER2	"pmdmxvi16ger2")
+				 (UNSPEC_MMA_PMXVI16GER2S	"pmdmxvi16ger2s")
+				 (UNSPEC_MMA_PMXVF16GER2	"pmdmxvf16ger2")
+				 (UNSPEC_MMA_PMXVBF16GER2	"pmdmxvbf16ger2")])
+
 (define_int_attr avvi4i4i2	[(UNSPEC_MMA_PMXVI16GER2PP	"pmxvi16ger2pp")
 				 (UNSPEC_MMA_PMXVI16GER2SPP	"pmxvi16ger2spp")
 				 (UNSPEC_MMA_PMXVF16GER2PP	"pmxvf16ger2pp")
@@ -245,25 +254,54 @@ (define_int_attr avvi4i4i2	[(UNSPEC_MMA_PMXVI16GER2PP	"pmxvi16ger2pp")
 				 (UNSPEC_MMA_PMXVBF16GER2NP	"pmxvbf16ger2np")
 				 (UNSPEC_MMA_PMXVBF16GER2NN	"pmxvbf16ger2nn")])
 
+(define_int_attr avvi4i4i2_dm	[(UNSPEC_MMA_PMXVI16GER2PP	"pmdmxvi16ger2pp")
+				 (UNSPEC_MMA_PMXVI16GER2SPP	"pmdmxvi16ger2spp")
+				 (UNSPEC_MMA_PMXVF16GER2PP	"pmdmxvf16ger2pp")
+				 (UNSPEC_MMA_PMXVF16GER2PN	"pmdmxvf16ger2pn")
+				 (UNSPEC_MMA_PMXVF16GER2NP	"pmdmxvf16ger2np")
+				 (UNSPEC_MMA_PMXVF16GER2NN	"pmdmxvf16ger2nn")
+				 (UNSPEC_MMA_PMXVBF16GER2PP	"pmdmxvbf16ger2pp")
+				 (UNSPEC_MMA_PMXVBF16GER2PN	"pmdmxvbf16ger2pn")
+				 (UNSPEC_MMA_PMXVBF16GER2NP	"pmdmxvbf16ger2np")
+				 (UNSPEC_MMA_PMXVBF16GER2NN	"pmdmxvbf16ger2nn")])
+
 (define_int_attr vvi4i4		[(UNSPEC_MMA_PMXVF32GER		"pmxvf32ger")])
 
+(define_int_attr vvi4i4_dm	[(UNSPEC_MMA_PMXVF32GER		"pmdmxvf32ger")])
+
 (define_int_attr avvi4i4	[(UNSPEC_MMA_PMXVF32GERPP	"pmxvf32gerpp")
 				 (UNSPEC_MMA_PMXVF32GERPN	"pmxvf32gerpn")
 				 (UNSPEC_MMA_PMXVF32GERNP	"pmxvf32gernp")
 				 (UNSPEC_MMA_PMXVF32GERNN	"pmxvf32gernn")])
 
+(define_int_attr avvi4i4_dm	[(UNSPEC_MMA_PMXVF32GERPP	"pmdmxvf32gerpp")
+				 (UNSPEC_MMA_PMXVF32GERPN	"pmdmxvf32gerpn")
+				 (UNSPEC_MMA_PMXVF32GERNP	"pmdmxvf32gernp")
+				 (UNSPEC_MMA_PMXVF32GERNN	"pmdmxvf32gernn")])
+
 (define_int_attr pvi4i2		[(UNSPEC_MMA_PMXVF64GER		"pmxvf64ger")])
 
+(define_int_attr pvi4i2_dm	[(UNSPEC_MMA_PMXVF64GER		"pmdmxvf64ger")])
+
 (define_int_attr apvi4i2	[(UNSPEC_MMA_PMXVF64GERPP	"pmxvf64gerpp")
 				 (UNSPEC_MMA_PMXVF64GERPN	"pmxvf64gerpn")
 				 (UNSPEC_MMA_PMXVF64GERNP	"pmxvf64gernp")
 				 (UNSPEC_MMA_PMXVF64GERNN	"pmxvf64gernn")])
 
+(define_int_attr apvi4i2_dm	[(UNSPEC_MMA_PMXVF64GERPP	"pmdmxvf64gerpp")
+				 (UNSPEC_MMA_PMXVF64GERPN	"pmdmxvf64gerpn")
+				 (UNSPEC_MMA_PMXVF64GERNP	"pmdmxvf64gernp")
+				 (UNSPEC_MMA_PMXVF64GERNN	"pmdmxvf64gernn")])
+
 (define_int_attr vvi4i4i4	[(UNSPEC_MMA_PMXVI8GER4		"pmxvi8ger4")])
 
+(define_int_attr vvi4i4i4_dm	[(UNSPEC_MMA_PMXVI8GER4		"pmdmxvi8ger4")])
+
 (define_int_attr avvi4i4i4	[(UNSPEC_MMA_PMXVI8GER4PP	"pmxvi8ger4pp")
 				 (UNSPEC_MMA_PMXVI8GER4SPP	"pmxvi8ger4spp")])
 
+(define_int_attr avvi4i4i4_dm	[(UNSPEC_MMA_PMXVI8GER4PP	"pmdmxvi8ger4pp")
+				 (UNSPEC_MMA_PMXVI8GER4SPP	"pmdmxvi8ger4spp")])
 
 ;; Vector pair support.  OOmode can only live in VSRs.
 (define_expand "movoo"
@@ -622,7 +660,10 @@ (define_insn "mma_<vv>"
 		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")]
 		    MMA_VV))]
   "TARGET_MMA"
-  "<vv> %A0,%x1,%x2"
+  "@
+   dm<vv> %A0,%x1,%x2
+   <vv> %A0,%x1,%x2
+   <vv> %A0,%x1,%x2"
   [(set_attr "type" "mma")
    (set_attr "isa" "dm,not_dm,not_dm")])
 
@@ -643,7 +684,10 @@ (define_insn "mma_<pv>"
 		    (match_operand:V16QI 2 "vsx_register_operand" "wa,v,?wa")]
 		    MMA_PV))]
   "TARGET_MMA"
-  "<pv> %A0,%x1,%x2"
+  "@
+   dm<pv> %A0,%x1,%x2
+   <pv> %A0,%x1,%x2
+   <pv> %A0,%x1,%x2"
   [(set_attr "type" "mma")
    (set_attr "isa" "dm,not_dm,not_dm")])
 
@@ -654,7 +698,10 @@ (define_insn "mma_<apv>"
 		    (match_operand:V16QI 3 "vsx_register_operand" "wa,v,?wa")]
 		    MMA_APV))]
   "TARGET_MMA"
-  "<apv> %A0,%x2,%x3"
+  "@
+   dm<apv> %A0,%x2,%x3
+   <apv> %A0,%x2,%x3
+   <apv> %A0,%x2,%x3"
   [(set_attr "type" "mma")
    (set_attr "isa" "dm,not_dm,not_dm")])
 
@@ -667,7 +714,10 @@ (define_insn "mma_<vvi4i4i8>"
 		    (match_operand:SI 5 "u8bit_cint_operand" "n,n,n")]
 		    MMA_VVI4I4I8))]
   "TARGET_MMA"
-  "<vvi4i4i8> %A0,%x1,%x2,%3,%4,%5"
+  "@
+   dm<vvi4i4i8> %A0,%x1,%x2,%3,%4,%5
+   <vvi4i4i8> %A0,%x1,%x2,%3,%4,%5
+   <vvi4i4i8> %A0,%x1,%x2,%3,%4,%5"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
@@ -696,7 +746,10 @@ (define_insn "mma_<vvi4i4i2>"
 		    (match_operand:SI 5 "const_0_to_3_operand" "n,n,n")]
 		    MMA_VVI4I4I2))]
   "TARGET_MMA"
-  "<vvi4i4i2> %A0,%x1,%x2,%3,%4,%5"
+  "@
+   <vvi4i4i2_dm> %A0,%x1,%x2,%3,%4,%5
+   <vvi4i4i2> %A0,%x1,%x2,%3,%4,%5
+   <vvi4i4i2> %A0,%x1,%x2,%3,%4,%5"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
@@ -711,7 +764,10 @@ (define_insn "mma_<avvi4i4i2>"
 		    (match_operand:SI 6 "const_0_to_3_operand" "n,n,n")]
 		    MMA_AVVI4I4I2))]
   "TARGET_MMA"
-  "<avvi4i4i2> %A0,%x2,%x3,%4,%5,%6"
+  "@
+   <avvi4i4i2_dm> %A0,%x2,%x3,%4,%5,%6
+   <avvi4i4i2> %A0,%x2,%x3,%4,%5,%6
+   <avvi4i4i2> %A0,%x2,%x3,%4,%5,%6"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
@@ -724,7 +780,10 @@ (define_insn "mma_<vvi4i4>"
 		    (match_operand:SI 4 "const_0_to_15_operand" "n,n,n")]
 		    MMA_VVI4I4))]
   "TARGET_MMA"
-  "<vvi4i4> %A0,%x1,%x2,%3,%4"
+  "@
+   <vvi4i4_dm> %A0,%x1,%x2,%3,%4
+   <vvi4i4> %A0,%x1,%x2,%3,%4
+   <vvi4i4> %A0,%x1,%x2,%3,%4"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
@@ -738,7 +797,10 @@ (define_insn "mma_<avvi4i4>"
 		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")]
 		    MMA_AVVI4I4))]
   "TARGET_MMA"
-  "<avvi4i4> %A0,%x2,%x3,%4,%5"
+  "@
+   <avvi4i4_dm> %A0,%x2,%x3,%4,%5
+   <avvi4i4> %A0,%x2,%x3,%4,%5
+   <avvi4i4> %A0,%x2,%x3,%4,%5"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
@@ -751,7 +813,10 @@ (define_insn "mma_<pvi4i2>"
 		    (match_operand:SI 4 "const_0_to_3_operand" "n,n,n")]
 		    MMA_PVI4I2))]
   "TARGET_MMA"
-  "<pvi4i2> %A0,%x1,%x2,%3,%4"
+  "@
+   <pvi4i2_dm> %A0,%x1,%x2,%3,%4
+   <pvi4i2> %A0,%x1,%x2,%3,%4
+   <pvi4i2> %A0,%x1,%x2,%3,%4"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
@@ -765,7 +830,10 @@ (define_insn "mma_<apvi4i2>"
 		    (match_operand:SI 5 "const_0_to_3_operand" "n,n,n")]
 		    MMA_APVI4I2))]
   "TARGET_MMA"
-  "<apvi4i2> %A0,%x2,%x3,%4,%5"
+  "@
+   <apvi4i2_dm> %A0,%x2,%x3,%4,%5
+   <apvi4i2> %A0,%x2,%x3,%4,%5
+   <apvi4i2> %A0,%x2,%x3,%4,%5"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
@@ -779,7 +847,10 @@ (define_insn "mma_<vvi4i4i4>"
 		    (match_operand:SI 5 "const_0_to_15_operand" "n,n,n")]
 		    MMA_VVI4I4I4))]
   "TARGET_MMA"
-  "<vvi4i4i4> %A0,%x1,%x2,%3,%4,%5"
+  "@
+   <vvi4i4i4_dm> %A0,%x1,%x2,%3,%4,%5
+   <vvi4i4i4> %A0,%x1,%x2,%3,%4,%5
+   <vvi4i4i4> %A0,%x1,%x2,%3,%4,%5"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
@@ -794,7 +865,10 @@ (define_insn "mma_<avvi4i4i4>"
 		    (match_operand:SI 6 "const_0_to_15_operand" "n,n,n")]
 		    MMA_AVVI4I4I4))]
   "TARGET_MMA"
-  "<avvi4i4i4> %A0,%x2,%x3,%4,%5,%6"
+  "@
+   <avvi4i4i4_dm> %A0,%x2,%x3,%4,%5,%6
+   <avvi4i4i4> %A0,%x2,%x3,%4,%5,%6
+   <avvi4i4i4> %A0,%x2,%x3,%4,%5,%6"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
diff --git a/gcc/testsuite/gcc.target/powerpc/dm-double-test.c b/gcc/testsuite/gcc.target/powerpc/dm-double-test.c
new file mode 100644
index 00000000000..66c19779585
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/dm-double-test.c
@@ -0,0 +1,194 @@
+/* Test derived from mma-double-1.c, modified for dense math.  */
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_dense_math_ok } */
+/* { dg-options "-mdejagnu-cpu=future -O2" } */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <altivec.h>
+
+typedef unsigned char vec_t __attribute__ ((vector_size (16)));
+typedef double v4sf_t __attribute__ ((vector_size (16)));
+#define SAVE_ACC(ACC, ldc, J)  \
+	  __builtin_mma_disassemble_acc (result, ACC); \
+	  rowC = (v4sf_t *) &CO[0*ldc+J]; \
+          rowC[0] += result[0]; \
+          rowC = (v4sf_t *) &CO[1*ldc+J]; \
+          rowC[0] += result[1]; \
+          rowC = (v4sf_t *) &CO[2*ldc+J]; \
+          rowC[0] += result[2]; \
+          rowC = (v4sf_t *) &CO[3*ldc+J]; \
+	  rowC[0] += result[3];
+
+void
+DM (int m, int n, int k, double *A, double *B, double *C)
+{
+  __vector_quad acc0, acc1, acc2, acc3, acc4, acc5, acc6, acc7;
+  v4sf_t result[4];
+  v4sf_t *rowC;
+  for (int l = 0; l < n; l += 4)
+    {
+      double *CO;
+      double *AO;
+      AO = A;
+      CO = C;
+      C += m * 4;
+      for (int j = 0; j < m; j += 16)
+	{
+	  double *BO = B;
+	  __builtin_mma_xxsetaccz (&acc0);
+	  __builtin_mma_xxsetaccz (&acc1);
+	  __builtin_mma_xxsetaccz (&acc2);
+	  __builtin_mma_xxsetaccz (&acc3);
+	  __builtin_mma_xxsetaccz (&acc4);
+	  __builtin_mma_xxsetaccz (&acc5);
+	  __builtin_mma_xxsetaccz (&acc6);
+	  __builtin_mma_xxsetaccz (&acc7);
+	  unsigned long i;
+
+	  for (i = 0; i < k; i++)
+	    {
+	      vec_t *rowA = (vec_t *) & AO[i * 16];
+	      __vector_pair rowB;
+	      vec_t *rb = (vec_t *) & BO[i * 4];
+	      __builtin_mma_assemble_pair (&rowB, rb[1], rb[0]);
+	      __builtin_mma_xvf64gerpp (&acc0, rowB, rowA[0]);
+	      __builtin_mma_xvf64gerpp (&acc1, rowB, rowA[1]);
+	      __builtin_mma_xvf64gerpp (&acc2, rowB, rowA[2]);
+	      __builtin_mma_xvf64gerpp (&acc3, rowB, rowA[3]);
+	      __builtin_mma_xvf64gerpp (&acc4, rowB, rowA[4]);
+	      __builtin_mma_xvf64gerpp (&acc5, rowB, rowA[5]);
+	      __builtin_mma_xvf64gerpp (&acc6, rowB, rowA[6]);
+	      __builtin_mma_xvf64gerpp (&acc7, rowB, rowA[7]);
+	    }
+	  SAVE_ACC (&acc0, m, 0);
+	  SAVE_ACC (&acc2, m, 4);
+	  SAVE_ACC (&acc1, m, 2);
+	  SAVE_ACC (&acc3, m, 6);
+	  SAVE_ACC (&acc4, m, 8);
+	  SAVE_ACC (&acc6, m, 12);
+	  SAVE_ACC (&acc5, m, 10);
+	  SAVE_ACC (&acc7, m, 14);
+	  AO += k * 16;
+	  BO += k * 4;
+	  CO += 16;
+	}
+      B += k * 4;
+    }
+}
+
+void
+init (double *matrix, int row, int column)
+{
+  for (int j = 0; j < column; j++)
+    {
+      for (int i = 0; i < row; i++)
+	{
+	  matrix[j * row + i] = (i * 16 + 2 + j) / 0.123;
+	}
+    }
+}
+
+void
+init0 (double *matrix, double *matrix1, int row, int column)
+{
+  for (int j = 0; j < column; j++)
+    for (int i = 0; i < row; i++)
+      matrix[j * row + i] = matrix1[j * row + i] = 0;
+}
+
+
+void
+print (const char *name, const double *matrix, int row, int column)
+{
+  printf ("Matrix %s has %d rows and %d columns:\n", name, row, column);
+  for (int i = 0; i < row; i++)
+    {
+      for (int j = 0; j < column; j++)
+	{
+	  printf ("%f ", matrix[j * row + i]);
+	}
+      printf ("\n");
+    }
+  printf ("\n");
+}
+
+int
+main (int argc, char *argv[])
+{
+  int rowsA, colsB, common;
+  int i, j, k;
+  int ret = 0;
+
+  for (int t = 16; t <= 128; t += 16)
+    {
+      for (int t1 = 4; t1 <= 16; t1 += 4)
+	{
+	  rowsA = t;
+	  colsB = t1;
+	  common = 1;
+	  /* printf ("Running test for rows = %d,cols = %d\n", t, t1); */
+	  double A[rowsA * common];
+	  double B[common * colsB];
+	  double C[rowsA * colsB];
+	  double D[rowsA * colsB];
+
+
+	  init (A, rowsA, common);
+	  init (B, common, colsB);
+	  init0 (C, D, rowsA, colsB);
+	  DM (rowsA, colsB, common, A, B, C);
+
+	  for (i = 0; i < colsB; i++)
+	    {
+	      for (j = 0; j < rowsA; j++)
+		{
+		  D[i * rowsA + j] = 0;
+		  for (k = 0; k < common; k++)
+		    {
+		      D[i * rowsA + j] +=
+			A[k * rowsA + j] * B[k + common * i];
+		    }
+		}
+	    }
+	  for (i = 0; i < colsB; i++)
+	    {
+	      for (j = 0; j < rowsA; j++)
+		{
+		  for (k = 0; k < common; k++)
+		    {
+		      if (D[i * rowsA + j] != C[i * rowsA + j])
+			{
+			  printf ("Error %d,%d,%d\n",i,j,k);
+			  ret++;
+			}
+		    }
+		}
+	    }
+	  if (ret)
+	    {
+	      print ("A", A, rowsA, common);
+	      print ("B", B, common, colsB);
+	      print ("C", C, rowsA, colsB);
+	      print ("D", D, rowsA, colsB);
+	    }
+	}
+    }
+  
+#ifdef VERBOSE
+  if (ret)
+    printf ("DM double test fail: %d errors\n",ret);
+  else
+    printf ("DM double test success: 0 DM errors\n");
+#else
+  if (ret)
+    abort();
+#endif
+      
+  return ret;
+}
+
+/* { dg-final { scan-assembler {\mdmsetdmrz\M}      } } */
+/* { dg-final { scan-assembler {\mdmxvf64gerpp\M}   } } */
+/* { dg-final { scan-assembler {\mdmxxextfdmr512\M} } } */
+
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 227e3004077..9586ed3ae47 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -6581,6 +6581,25 @@ proc check_effective_target_power10_ok { } {
     }
 }
 
+# Return 1 if this is a PowerPC target supporting -mcpu=future or -mdense-math
+# which enables the dense math operations.
+proc check_effective_target_powerpc_dense_math_ok { } {
+	return [check_no_compiler_messages_nocache powerpc_dense_math_ok assembly {
+		__vector_quad vq;
+		void test (void)
+		{
+		#ifndef __PPC_DMR__
+		#error "target does not have dense math support."
+		#else
+		/* Make sure we have dense math support.  */
+		  __vector_quad dmr;
+		  __asm__ ("dmsetaccz %A0" : "=wD" (dmr));
+		  vq = dmr;
+		#endif
+		}
+	} "-mcpu=future"]
+}
+
 # Return 1 if this is a PowerPC target supporting -mfloat128 via either
 # software emulation on power7/power8 systems or hardware support on power9.
 
-- 
2.39.1


-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 6/8] PowerPC: Add support for 1,024 bit DMR registers.
  2023-02-03 21:16 [PATCH 0/8] PowerPC future support for Dense Math Michael Meissner
                   ` (4 preceding siblings ...)
  2023-02-03 21:29 ` [PATCH 4/8] PowerPC: Switch to dense math names for all MMA operations Michael Meissner
@ 2023-02-03 21:33 ` Michael Meissner
  2023-02-03 21:36 ` [PATCH 7/8] Support load/store vector with right length Michael Meissner
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Michael Meissner @ 2023-02-03 21:33 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner, Will Schmidt

This patch is a prelimianry patch to add the full 1,024 bit dense math register
(DMRs) for -mcpu=future.  The MMA 512-bit accumulators map onto the top of the
DMR register.

This patch only adds the new 1,024 bit register support.  It does not add
support for any instructions that need 1,024 bit registers instead of 512 bit
registers.

I used the new mode 'TDOmode' to be the opaque mode used for 1,204 bit
registers.  The 'wD' constraint added in previous patches is used for these
registers.  I added support to do load and store of DMRs via the VSX registers,
since there are no load/store dense math instructions.  I added the new keyword
'__dmr' to create 1,024 bit types that can be loaded into DMRs.  At present, I
don't have aliases for __dmr512 and __dmr1024 that we've discussed internally.

The patches have been tested on the following platforms.  I added the patches
for PR target/107299 that I submitted on November 2nd before doing the builds so
that GCC would build on systems using IEEE 128-bit long double.
    *	https://gcc.gnu.org/pipermail/gcc-patches/2022-November/604834.html

Note this patch requires the patch posted on February 2nd, 2023 to bump up the
precision size to 16 bits.  To get this into GCC 13, I will have to revise this
patch.

| Date: Thu, 2 Feb 2023 12:38:30 -0500
| Subject: [PATCH] Bump up precision size to 16 bits.
| Message-ID: <Y9v1FvWk30MUvi4Z@toto.the-meissners.org>
| https://gcc.gnu.org/pipermail/gcc-patches/2023-February/611198.html

There were no regressions with doing bootstrap builds and running the regression
tests, providing the above patch for the precision size has been installed:

    1)	Power10 LE using --with-cpu=power10 --with-long-double-format=ieee;
    2)	Power10 LE using --with-cpu=power10 --with-long-double-format=ibm;
    3)	Power9 LE using --with-cpu=power9 --with-long-double-format=ibm; and
    4)	Power8 BE using --with-cpu=power8 (both 32-bit & 64-bit tested).

Note, I will be on vacation from Tuesday February 7th through Tuesday February
14th.

Can I check this patch into the GCC 13 master branch?

2023-02-03   Michael Meissner  <meissner@linux.ibm.com>

gcc/

	* config/rs6000/mma.md (UNSPEC_DM_INSERT512_UPPER): New unspec.
	(UNSPEC_DM_INSERT512_LOWER): Likewise.
	(UNSPEC_DM_EXTRACT512): Likewise.
	(UNSPEC_DMR_RELOAD_FROM_MEMORY): Likewise.
	(UNSPEC_DMR_RELOAD_TO_MEMORY): Likewise.
	(movtdo): New define_expand and define_insn_and_split to implement 1,024
	bit DMR registers.
	(movtdo_insert512_upper): New insn.
	(movtdo_insert512_lower): Likewise.
	(movtdo_extract512): Likewise.
	(reload_dmr_from_memory): Likewise.
	(reload_dmr_to_memory): Likewise.
	* config/rs6000/rs6000-builtin.cc (rs6000_type_string): Add DMR
	support.
	(rs6000_init_builtins): Add support for __dmr keyword.
	* config/rs6000/rs6000-call.cc (rs6000_return_in_memory): Add support
	for TDOmode.
	(rs6000_function_arg): Likewise.
	* config/rs6000/rs6000-modes.def (TDOmode): New mode.
	* config/rs6000/rs6000.cc (rs6000_hard_regno_nregs_internal): Add
	support for TDOmode.
	(rs6000_hard_regno_mode_ok_uncached): Likewise.
	(rs6000_hard_regno_mode_ok): Likewise.
	(rs6000_modes_tieable_p): Likewise.
	(rs6000_debug_reg_global): Likewise.
	(rs6000_setup_reg_addr_masks): Likewise.
	(rs6000_init_hard_regno_mode_ok): Add support for TDOmode.  Setup reload
	hooks for DMR mode.
	(reg_offset_addressing_ok_p): Add support for TDOmode.
	(rs6000_emit_move): Likewise.
	(rs6000_secondary_reload_simple_move): Likewise.
	(rs6000_secondary_reload_class): Likewise.
	(rs6000_mangle_type): Add mangling for __dmr type.
	(rs6000_dmr_register_move_cost): Add support for TDOmode.
	(rs6000_split_multireg_move): Likewise.
	(rs6000_invalid_conversion): Likewise.
	* config/rs6000/rs6000.h (VECTOR_ALIGNMENT_P): Add TDOmode.
	(enum rs6000_builtin_type_index): Add DMR type nodes.
	(dmr_type_node): Likewise.
	(ptr_dmr_type_node): Likewise.

gcc/testsuite/

	* gcc.target/powerpc/dm-1024bit.c: New test.
---
 gcc/config/rs6000/mma.md                      | 152 ++++++++++++++++++
 gcc/config/rs6000/rs6000-builtin.cc           |  13 ++
 gcc/config/rs6000/rs6000-call.cc              |  13 +-
 gcc/config/rs6000/rs6000-modes.def            |   4 +
 gcc/config/rs6000/rs6000.cc                   | 125 ++++++++++----
 gcc/config/rs6000/rs6000.h                    |   7 +-
 gcc/testsuite/gcc.target/powerpc/dm-1024bit.c |  63 ++++++++
 7 files changed, 345 insertions(+), 32 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/dm-1024bit.c

diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
index 411e2345291..0233c7b304a 100644
--- a/gcc/config/rs6000/mma.md
+++ b/gcc/config/rs6000/mma.md
@@ -92,6 +92,11 @@ (define_c_enum "unspec"
    UNSPEC_MMA_XXMFACC
    UNSPEC_MMA_XXMTACC
    UNSPEC_DM_ASSEMBLE_ACC
+   UNSPEC_DM_INSERT512_UPPER
+   UNSPEC_DM_INSERT512_LOWER
+   UNSPEC_DM_EXTRACT512
+   UNSPEC_DMR_RELOAD_FROM_MEMORY
+   UNSPEC_DMR_RELOAD_TO_MEMORY
   ])
 
 (define_c_enum "unspecv"
@@ -872,3 +877,150 @@ (define_insn "mma_<avvi4i4i4>"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
+
+\f
+;; TDOmode (i.e. __dmr).
+(define_expand "movtdo"
+  [(set (match_operand:TDO 0 "nonimmediate_operand")
+	(match_operand:TDO 1 "input_operand"))]
+  "TARGET_DENSE_MATH"
+{
+  rs6000_emit_move (operands[0], operands[1], TDOmode);
+  DONE;
+})
+
+(define_insn_and_split "*movtdo"
+  [(set (match_operand:TDO 0 "nonimmediate_operand" "=wa,m,wa,wD,wD,wa")
+	(match_operand:TDO 1 "input_operand" "m,wa,wa,wa,wD,wD"))]
+  "TARGET_DENSE_MATH
+   && (gpc_reg_operand (operands[0], TDOmode)
+       || gpc_reg_operand (operands[1], TDOmode))"
+  "@
+   #
+   #
+   #
+   #
+   dmmr %0,%1
+   #"
+  "&& reload_completed
+   && (!dmr_operand (operands[0], TDOmode) || !dmr_operand (operands[1], TDOmode))"
+  [(const_int 0)]
+{
+  rtx op0 = operands[0];
+  rtx op1 = operands[1];
+
+  if (REG_P (op0) && REG_P (op1))
+    {
+      int regno0 = REGNO (op0);
+      int regno1 = REGNO (op1);
+
+      if (DMR_REGNO_P (regno0) && VSX_REGNO_P (regno1))
+	{
+	  rtx op1_upper = gen_rtx_REG (XOmode, regno1);
+	  rtx op1_lower = gen_rtx_REG (XOmode, regno1 + 4);
+	  emit_insn (gen_movtdo_insert512_upper (op0, op1_upper));
+	  emit_insn (gen_movtdo_insert512_lower (op0, op0, op1_lower));
+	  DONE;
+	}
+
+      else if (VSX_REGNO_P (regno0) && DMR_REGNO_P (regno1))
+	{
+	  rtx op0_upper = gen_rtx_REG (XOmode, regno0);
+	  rtx op0_lower = gen_rtx_REG (XOmode, regno0 + 4);
+	  emit_insn (gen_movtdo_extract512 (op0_upper, op1, const0_rtx));
+	  emit_insn (gen_movtdo_extract512 (op0_lower, op1, const1_rtx));
+	  DONE;
+	}
+    }
+
+  rs6000_split_multireg_move (operands[0], operands[1]);
+  DONE;
+}
+  [(set_attr "type" "vecload,vecstore,vecmove,vecmove,vecmove,vecmove")
+   (set_attr "length" "*,*,32,8,*,8")
+   (set_attr "max_prefixed_insns" "4,4,*,*,*,*")])
+
+;; Move from VSX registers to DMR registers via two insert 512 bit
+;; instructions.
+(define_insn "movtdo_insert512_upper"
+  [(set (match_operand:TDO 0 "dmr_operand" "=wD")
+	(unspec:TDO [(match_operand:XO 1 "vsx_register_operand" "wa")]
+		    UNSPEC_DM_INSERT512_UPPER))]
+  "TARGET_DENSE_MATH"
+  "dmxxinstdmr512 %0,%1,%Y1,0"
+  [(set_attr "type" "mma")])
+
+(define_insn "movtdo_insert512_lower"
+  [(set (match_operand:TDO 0 "dmr_operand" "=wD")
+	(unspec:TDO [(match_operand:TDO 1 "dmr_operand" "0")
+		     (match_operand:XO 2 "vsx_register_operand" "wa")]
+		    UNSPEC_DM_INSERT512_LOWER))]
+  "TARGET_DENSE_MATH"
+  "dmxxinstdmr512 %0,%2,%Y2,1"
+  [(set_attr "type" "mma")])
+
+;; Move from DMR registers to VSX registers via two extract 512 bit
+;; instructions.
+(define_insn "movtdo_extract512"
+  [(set (match_operand:XO 0 "vsx_register_operand" "=wa")
+	(unspec:XO [(match_operand:TDO 1 "dmr_operand" "wD")
+		    (match_operand 2 "const_0_to_1_operand" "n")]
+		   UNSPEC_DM_EXTRACT512))]
+  "TARGET_DENSE_MATH"
+  "dmxxextfdmr512 %0,%Y0,%1,%2"
+  [(set_attr "type" "mma")])
+
+;; Reload DMR registers from memory
+(define_insn_and_split "reload_dmr_from_memory"
+  [(set (match_operand:TDO 0 "dmr_operand" "=wD")
+	(unspec:TDO [(match_operand:TDO 1 "memory_operand" "m")]
+		    UNSPEC_DMR_RELOAD_FROM_MEMORY))
+   (clobber (match_operand:XO 2 "vsx_register_operand" "=wa"))]
+  "TARGET_DENSE_MATH"
+  "#"
+  "&& reload_completed"
+  [(const_int 0)]
+{
+  rtx dest = operands[0];
+  rtx src = operands[1];
+  rtx tmp = operands[2];
+  rtx mem_upper = adjust_address (src, XOmode, BYTES_BIG_ENDIAN ? 0 : 32);
+  rtx mem_lower = adjust_address (src, XOmode, BYTES_BIG_ENDIAN ? 32 : 0);
+
+  emit_move_insn (tmp, mem_upper);
+  emit_insn (gen_movtdo_insert512_upper (dest, tmp));
+
+  emit_move_insn (tmp, mem_lower);
+  emit_insn (gen_movtdo_insert512_lower (dest, dest, tmp));
+  DONE;
+}
+  [(set_attr "length" "16")
+   (set_attr "max_prefixed_insns" "2")
+   (set_attr "type" "vecload")])
+
+;; Reload dense math registers to memory
+(define_insn_and_split "reload_dmr_to_memory"
+  [(set (match_operand:TDO 0 "memory_operand" "=m")
+	(unspec:TDO [(match_operand:TDO 1 "dmr_operand" "wD")]
+		    UNSPEC_DMR_RELOAD_TO_MEMORY))
+   (clobber (match_operand:XO 2 "vsx_register_operand" "=wa"))]
+  "TARGET_DENSE_MATH"
+  "#"
+  "&& reload_completed"
+  [(const_int 0)]
+{
+  rtx dest = operands[0];
+  rtx src = operands[1];
+  rtx tmp = operands[2];
+  rtx mem_upper = adjust_address (dest, XOmode, BYTES_BIG_ENDIAN ? 0 : 32);
+  rtx mem_lower = adjust_address (dest, XOmode, BYTES_BIG_ENDIAN ? 32 : 0);
+
+  emit_insn (gen_movtdo_extract512 (tmp, src, const0_rtx));
+  emit_move_insn (mem_upper, tmp);
+
+  emit_insn (gen_movtdo_extract512 (tmp, src, const1_rtx));
+  emit_move_insn (mem_lower, tmp);
+  DONE;
+}
+  [(set_attr "length" "16")
+   (set_attr "max_prefixed_insns" "2")])
diff --git a/gcc/config/rs6000/rs6000-builtin.cc b/gcc/config/rs6000/rs6000-builtin.cc
index 737a5c42bfb..d971cf90e51 100644
--- a/gcc/config/rs6000/rs6000-builtin.cc
+++ b/gcc/config/rs6000/rs6000-builtin.cc
@@ -495,6 +495,8 @@ const char *rs6000_type_string (tree type_node)
     return "__vector_pair";
   else if (type_node == vector_quad_type_node)
     return "__vector_quad";
+  else if (type_node == dmr_type_node)
+    return "__dmr";
 
   return "unknown";
 }
@@ -781,6 +783,17 @@ rs6000_init_builtins (void)
   t = build_qualified_type (vector_quad_type_node, TYPE_QUAL_CONST);
   ptr_vector_quad_type_node = build_pointer_type (t);
 
+  dmr_type_node = make_node (OPAQUE_TYPE);
+  SET_TYPE_MODE (dmr_type_node, TDOmode);
+  TYPE_SIZE (dmr_type_node) = bitsize_int (GET_MODE_BITSIZE (TDOmode));
+  TYPE_PRECISION (dmr_type_node) = GET_MODE_BITSIZE (TDOmode);
+  TYPE_SIZE_UNIT (dmr_type_node) = size_int (GET_MODE_SIZE (TDOmode));
+  SET_TYPE_ALIGN (dmr_type_node, 512);
+  TYPE_USER_ALIGN (dmr_type_node) = 0;
+  lang_hooks.types.register_builtin_type (dmr_type_node, "__dmr");
+  t = build_qualified_type (dmr_type_node, TYPE_QUAL_CONST);
+  ptr_dmr_type_node = build_pointer_type (t);
+
   tdecl = add_builtin_type ("__bool char", bool_char_type_node);
   TYPE_NAME (bool_char_type_node) = tdecl;
 
diff --git a/gcc/config/rs6000/rs6000-call.cc b/gcc/config/rs6000/rs6000-call.cc
index 214613e083e..dcf5b470766 100644
--- a/gcc/config/rs6000/rs6000-call.cc
+++ b/gcc/config/rs6000/rs6000-call.cc
@@ -437,7 +437,8 @@ rs6000_return_in_memory (const_tree type, const_tree fntype ATTRIBUTE_UNUSED)
   if (cfun
       && !cfun->machine->mma_return_type_error
       && TREE_TYPE (cfun->decl) == fntype
-      && (TYPE_MODE (type) == OOmode || TYPE_MODE (type) == XOmode))
+      && (TYPE_MODE (type) == OOmode || TYPE_MODE (type) == XOmode
+	  || TYPE_MODE (type) == TDOmode))
     {
       /* Record we have now handled function CFUN, so the next time we
 	 are called, we do not re-report the same error.  */
@@ -1641,6 +1642,16 @@ rs6000_function_arg (cumulative_args_t cum_v, const function_arg_info &arg)
       return NULL_RTX;
     }
 
+  if (mode == TDOmode)
+    {
+      if (TYPE_CANONICAL (type) != NULL_TREE)
+	type = TYPE_CANONICAL (type);
+      error ("invalid use of dense math operand of type %qs as a function "
+	     "parameter",
+	     IDENTIFIER_POINTER (DECL_NAME (TYPE_NAME (type))));
+      return NULL_RTX;
+    }
+
   /* Return a marker to indicate whether CR1 needs to set or clear the
      bit that V.4 uses to say fp args were passed in registers.
      Assume that we don't need the marker for software floating point,
diff --git a/gcc/config/rs6000/rs6000-modes.def b/gcc/config/rs6000/rs6000-modes.def
index 73dfde5c6e7..d36bde9d2a0 100644
--- a/gcc/config/rs6000/rs6000-modes.def
+++ b/gcc/config/rs6000/rs6000-modes.def
@@ -86,3 +86,7 @@ PARTIAL_INT_MODE (TI, 128, PTI);
 /* Modes used by __vector_pair and __vector_quad.  */
 OPAQUE_MODE (OO, 32);
 OPAQUE_MODE (XO, 64);
+
+/* Modes used by __dmr.  */
+OPAQUE_MODE (TDO, 128);
+
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index c8f05f6f2d7..58ee643260f 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -1841,7 +1841,9 @@ rs6000_hard_regno_nregs_internal (int regno, machine_mode mode)
      128-bit floating point that can go in vector registers, which has VSX
      memory addressing.  */
   if (FP_REGNO_P (regno))
-    reg_size = (VECTOR_MEM_VSX_P (mode) || VECTOR_ALIGNMENT_P (mode)
+    reg_size = (VECTOR_MEM_VSX_P (mode)
+		|| VECTOR_ALIGNMENT_P (mode)
+		|| mode == TDOmode
 		? UNITS_PER_VSX_WORD
 		: UNITS_PER_FP_WORD);
 
@@ -1875,9 +1877,9 @@ rs6000_hard_regno_mode_ok_uncached (int regno, machine_mode mode)
   /* On ISA 3.1 (power10), MMA accumulator modes need FPR registers divisible
      by 4.
 
-     If dense math is enabled, allow all VSX registers plus the DMR registers.
-     We need to make sure we don't cross between the boundary of FPRs and
-     traditional Altiviec registers.  */
+     If dense math is enabled, allow all VSX registers plus the dense math
+     registers.  We need to make sure we don't cross between the boundary of
+     FPRs and traditional Altiviec registers.  */
   if (mode == XOmode)
     {
       if (TARGET_MMA && !TARGET_DENSE_MATH)
@@ -1899,7 +1901,27 @@ rs6000_hard_regno_mode_ok_uncached (int regno, machine_mode mode)
 	return 0;
     }
 
-  /* No other types other than XOmode can go in DMRs.  */
+  /* Dense math register modes need DMR registers or VSX registers divisible by
+     2.  We need to make sure we don't cross between the boundary of FPRs and
+     traditional Altiviec registers.  */
+  if (mode == TDOmode)
+    {
+      if (!TARGET_DENSE_MATH)
+	return 0;
+
+      if (DMR_REGNO_P (regno))
+	return 1;
+
+      if (FP_REGNO_P (regno))
+	return ((regno & 1) == 0 && regno <= LAST_FPR_REGNO - 7);
+
+      if (ALTIVEC_REGNO_P (regno))
+	return ((regno & 1) == 0 && regno <= LAST_ALTIVEC_REGNO - 7);
+
+      return 0;
+    }
+
+  /* No other types other than XOmode or TDOmode can go in DMRs.  */
   if (DMR_REGNO_P (regno))
     return 0;
 
@@ -2007,9 +2029,11 @@ rs6000_hard_regno_mode_ok (unsigned int regno, machine_mode mode)
    GPR registers, and TImode can go in any GPR as well as VSX registers (PR
    57744).
 
-   Similarly, don't allow OOmode (vector pair, restricted to even VSX
-   registers) or XOmode (vector quad, restricted to FPR registers divisible
-   by 4) to tie with other modes.
+   Similarly, don't allow OOmode (vector pair), XOmode (vector quad), or
+   TDOmode (dmr register) to pair with anything else.  Vector pairs are
+   restricted to even/odd VSX registers.  Without dense math, vector quads are
+   limited to FPR registers divisible by 4.  With dense math, vector quads are
+   limited to even VSX registers or DMR registers.
 
    Altivec/VSX vector tests were moved ahead of scalar float mode, so that IEEE
    128-bit floating point on VSX systems ties with other vectors.  */
@@ -2018,7 +2042,8 @@ static bool
 rs6000_modes_tieable_p (machine_mode mode1, machine_mode mode2)
 {
   if (mode1 == PTImode || mode1 == OOmode || mode1 == XOmode
-      || mode2 == PTImode || mode2 == OOmode || mode2 == XOmode)
+      || mode1 == TDOmode || mode2 == PTImode || mode2 == OOmode
+      || mode2 == XOmode || mode2 == TDOmode)
     return mode1 == mode2;
 
   if (ALTIVEC_OR_VSX_VECTOR_MODE (mode1))
@@ -2309,6 +2334,7 @@ rs6000_debug_reg_global (void)
     V4DFmode,
     OOmode,
     XOmode,
+    TDOmode,
     CCmode,
     CCUNSmode,
     CCEQmode,
@@ -2674,7 +2700,7 @@ rs6000_setup_reg_addr_masks (void)
 	  /* Special case DMR registers.  */
 	  if (rc == RELOAD_REG_DMR)
 	    {
-	      if (TARGET_DENSE_MATH && m2 == XOmode)
+	      if (TARGET_DENSE_MATH && (m2 == XOmode || m2 == TDOmode))
 		{
 		  addr_mask = RELOAD_REG_VALID;
 		  reg_addr[m].addr_mask[rc] = addr_mask;
@@ -2784,7 +2810,7 @@ rs6000_setup_reg_addr_masks (void)
 	     since it will be broken into two vector moves.  Vector quads and
 	     1,024 bit DMR values can only do offset loads.  */
 	  else if ((addr_mask != 0) && TARGET_MMA
-		   && (m2 == OOmode || m2 == XOmode))
+		   && (m2 == OOmode || m2 == XOmode || m2 == TDOmode))
 	    {
 	      addr_mask |= RELOAD_REG_OFFSET;
 	      if (rc == RELOAD_REG_FPR || rc == RELOAD_REG_VMX)
@@ -3012,6 +3038,14 @@ rs6000_init_hard_regno_mode_ok (bool global_init_p)
       rs6000_vector_align[XOmode] = 512;
     }
 
+  /* Add support for 1,024 bit DMR registers.  */
+  if (TARGET_DENSE_MATH)
+    {
+      rs6000_vector_unit[TDOmode] = VECTOR_NONE;
+      rs6000_vector_mem[TDOmode] = VECTOR_VSX;
+      rs6000_vector_align[TDOmode] = 512;
+    }
+
   /* Register class constraints for the constraints that depend on compile
      switches. When the VSX code was added, different constraints were added
      based on the type (DFmode, V2DFmode, V4SFmode).  For the vector types, all
@@ -3225,6 +3259,12 @@ rs6000_init_hard_regno_mode_ok (bool global_init_p)
 	}
     }
 
+  if (TARGET_DENSE_MATH)
+    {
+      reg_addr[TDOmode].reload_load = CODE_FOR_reload_dmr_from_memory;
+      reg_addr[TDOmode].reload_store = CODE_FOR_reload_dmr_to_memory;
+    }
+
   /* Precalculate HARD_REGNO_NREGS.  */
   for (r = 0; HARD_REGISTER_NUM_P (r); ++r)
     for (m = 0; m < NUM_MACHINE_MODES; ++m)
@@ -8722,12 +8762,15 @@ reg_offset_addressing_ok_p (machine_mode mode)
 	return mode_supports_dq_form (mode);
       break;
 
-      /* The vector pair/quad types support offset addressing if the
-	 underlying vectors support offset addressing.  */
+      /* The vector pair/quad types and the dense math types support offset
+	 addressing if the underlying vectors support offset addressing.  */
     case E_OOmode:
     case E_XOmode:
       return TARGET_MMA;
 
+    case E_TDOmode:
+      return TARGET_DENSE_MATH;
+
     case E_SDmode:
       /* If we can do direct load/stores of SDmode, restrict it to reg+reg
 	 addressing for the LFIWZX and STFIWX instructions.  */
@@ -11009,6 +11052,12 @@ rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
 	       (mode == OOmode) ? "__vector_pair" : "__vector_quad");
       break;
 
+    case E_TDOmode:
+      if (CONST_INT_P (operands[1]))
+	error ("%qs is an opaque type, and you cannot set it to constants",
+	       "__dmr");
+      break;
+
     case E_SImode:
     case E_DImode:
       /* Use default pattern for address of ELF small data */
@@ -12463,7 +12512,7 @@ rs6000_secondary_reload_simple_move (enum rs6000_reg_type to_type,
 
   /* We can transfer between VSX registers and DMR registers without needing
      extra registers.  */
-  if (TARGET_DENSE_MATH && mode == XOmode
+  if (TARGET_DENSE_MATH && (mode == XOmode || mode == TDOmode)
       && ((to_type == DMR_REG_TYPE && from_type == VSX_REG_TYPE)
 	  || (to_type == VSX_REG_TYPE && from_type == DMR_REG_TYPE)))
     return true;
@@ -13264,6 +13313,9 @@ rs6000_preferred_reload_class (rtx x, enum reg_class rclass)
       if (mode == XOmode)
 	return TARGET_DENSE_MATH ? VSX_REGS : FLOAT_REGS;
 
+      if (mode == TDOmode)
+	return VSX_REGS;
+
       if (GET_MODE_CLASS (mode) == MODE_INT)
 	return GENERAL_REGS;
     }
@@ -13387,8 +13439,9 @@ rs6000_secondary_reload_class (enum reg_class rclass, machine_mode mode,
   else
     regno = -1;
 
-  /* DMR registers don't have loads or stores.  We have to go through the VSX
-     registers to load XOmode (vector quad).  */
+  /* Dense math registers don't have loads or stores.  We have to go through
+     the VSX registers to load XOmode (vector quad) and TDOmode (dmr 1024
+     bit).  */
   if (TARGET_DENSE_MATH && rclass == DM_REGS)
     return VSX_REGS;
 
@@ -20471,6 +20524,8 @@ rs6000_mangle_type (const_tree type)
     return "u13__vector_pair";
   if (type == vector_quad_type_node)
     return "u13__vector_quad";
+  if (type == dmr_type_node)
+    return "u5__dmr";
 
   /* For all other types, use the default mangling.  */
   return NULL;
@@ -22594,6 +22649,10 @@ rs6000_dmr_register_move_cost (machine_mode mode, reg_class_t rclass)
       if (mode == XOmode)
 	return reg_move_base;
 
+      /* __dmr (i.e. TDOmode) is transferred in 2 instructions.  */
+      else if (mode == TDOmode)
+	return reg_move_base * 2;
+
       else
 	return reg_move_base * 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
     }
@@ -27288,9 +27347,10 @@ rs6000_split_multireg_move (rtx dst, rtx src)
   mode = GET_MODE (dst);
   nregs = hard_regno_nregs (reg, mode);
 
-  /* If we have a vector quad register for MMA, and this is a load or store,
-     see if we can use vector paired load/stores.  */
-  if (mode == XOmode && TARGET_MMA
+  /* If we have a vector quad register for MMA or DMR register for dense math,
+     and this is a load or store, see if we can use vector paired
+     load/stores.  */
+  if ((mode == XOmode || mode == TDOmode) && TARGET_MMA
       && (MEM_P (dst) || MEM_P (src)))
     {
       reg_mode = OOmode;
@@ -27298,7 +27358,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
     }
   /* If we have a vector pair/quad mode, split it into two/four separate
      vectors.  */
-  else if (mode == OOmode || mode == XOmode)
+  else if (mode == OOmode || mode == XOmode || mode == TDOmode)
     reg_mode = V1TImode;
   else if (FP_REGNO_P (reg))
     reg_mode = DECIMAL_FLOAT_MODE_P (mode) ? DDmode :
@@ -27344,13 +27404,13 @@ rs6000_split_multireg_move (rtx dst, rtx src)
       return;
     }
 
-  /* The __vector_pair and __vector_quad modes are multi-register
-     modes, so if we have to load or store the registers, we have to be
-     careful to properly swap them if we're in little endian mode
-     below.  This means the last register gets the first memory
-     location.  We also need to be careful of using the right register
-     numbers if we are splitting XO to OO.  */
-  if (mode == OOmode || mode == XOmode)
+  /* The __vector_pair, __vector_quad, and __dmr modes are multi-register
+     modes, so if we have to load or store the registers, we have to be careful
+     to properly swap them if we're in little endian mode below.  This means
+     the last register gets the first memory location.  We also need to be
+     careful of using the right register numbers if we are splitting XO to
+     OO.  */
+  if (mode == OOmode || mode == XOmode || mode == TDOmode)
     {
       nregs = hard_regno_nregs (reg, mode);
       int reg_mode_nregs = hard_regno_nregs (reg, reg_mode);
@@ -27487,7 +27547,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 	 overlap.  */
       int i;
       /* XO/OO are opaque so cannot use subregs. */
-      if (mode == OOmode || mode == XOmode )
+      if (mode == OOmode || mode == XOmode || mode == TDOmode)
 	{
 	  for (i = nregs - 1; i >= 0; i--)
 	    {
@@ -27661,7 +27721,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
 	    continue;
 
 	  /* XO/OO are opaque so cannot use subregs. */
-	  if (mode == OOmode || mode == XOmode )
+	  if (mode == OOmode || mode == XOmode || mode == TDOmode)
 	    {
 	      rtx dst_i = gen_rtx_REG (reg_mode, REGNO (dst) + j);
 	      rtx src_i = gen_rtx_REG (reg_mode, REGNO (src) + j);
@@ -28641,7 +28701,8 @@ rs6000_invalid_conversion (const_tree fromtype, const_tree totype)
 
   if (frommode != tomode)
     {
-      /* Do not allow conversions to/from XOmode and OOmode types.  */
+      /* Do not allow conversions to/from XOmode, OOmode, and TDOmode
+	 types.  */
       if (frommode == XOmode)
 	return N_("invalid conversion from type %<__vector_quad%>");
       if (tomode == XOmode)
@@ -28650,6 +28711,10 @@ rs6000_invalid_conversion (const_tree fromtype, const_tree totype)
 	return N_("invalid conversion from type %<__vector_pair%>");
       if (tomode == OOmode)
 	return N_("invalid conversion to type %<__vector_pair%>");
+      if (frommode == TDOmode)
+	return N_("invalid conversion from type %<__dmr%>");
+      if (tomode == TDOmode)
+	return N_("invalid conversion to type %<__dmr%>");
     }
 
   /* Conversion allowed.  */
diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
index c034b9ed179..823de897603 100644
--- a/gcc/config/rs6000/rs6000.h
+++ b/gcc/config/rs6000/rs6000.h
@@ -1006,7 +1006,8 @@ enum data_align { align_abi, align_opt, align_both };
 /* Modes that are not vectors, but require vector alignment.  Treat these like
    vectors in terms of loads and stores.  */
 #define VECTOR_ALIGNMENT_P(MODE)					\
-  (FLOAT128_VECTOR_P (MODE) || (MODE) == OOmode || (MODE) == XOmode)
+  (FLOAT128_VECTOR_P (MODE) || (MODE) == OOmode || (MODE) == XOmode	\
+   || (MODE) == TDOmode)
 
 #define ALTIVEC_VECTOR_MODE(MODE)					\
   ((MODE) == V16QImode							\
@@ -2292,6 +2293,7 @@ enum rs6000_builtin_type_index
   RS6000_BTI_const_str,		 /* pointer to const char * */
   RS6000_BTI_vector_pair,	 /* unsigned 256-bit types (vector pair).  */
   RS6000_BTI_vector_quad,	 /* unsigned 512-bit types (vector quad).  */
+  RS6000_BTI_dmr,		 /* unsigned 1,024-bit types (dmr).  */
   RS6000_BTI_const_ptr_void,     /* const pointer to void */
   RS6000_BTI_ptr_V16QI,
   RS6000_BTI_ptr_V1TI,
@@ -2330,6 +2332,7 @@ enum rs6000_builtin_type_index
   RS6000_BTI_ptr_dfloat128,
   RS6000_BTI_ptr_vector_pair,
   RS6000_BTI_ptr_vector_quad,
+  RS6000_BTI_ptr_dmr,
   RS6000_BTI_ptr_long_long,
   RS6000_BTI_ptr_long_long_unsigned,
   RS6000_BTI_MAX
@@ -2387,6 +2390,7 @@ enum rs6000_builtin_type_index
 #define const_str_type_node		 (rs6000_builtin_types[RS6000_BTI_const_str])
 #define vector_pair_type_node		 (rs6000_builtin_types[RS6000_BTI_vector_pair])
 #define vector_quad_type_node		 (rs6000_builtin_types[RS6000_BTI_vector_quad])
+#define dmr_type_node			 (rs6000_builtin_types[RS6000_BTI_dmr])
 #define pcvoid_type_node		 (rs6000_builtin_types[RS6000_BTI_const_ptr_void])
 #define ptr_V16QI_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_V16QI])
 #define ptr_V1TI_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_V1TI])
@@ -2425,6 +2429,7 @@ enum rs6000_builtin_type_index
 #define ptr_dfloat128_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_dfloat128])
 #define ptr_vector_pair_type_node	 (rs6000_builtin_types[RS6000_BTI_ptr_vector_pair])
 #define ptr_vector_quad_type_node	 (rs6000_builtin_types[RS6000_BTI_ptr_vector_quad])
+#define ptr_dmr_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_dmr])
 #define ptr_long_long_integer_type_node	 (rs6000_builtin_types[RS6000_BTI_ptr_long_long])
 #define ptr_long_long_unsigned_type_node (rs6000_builtin_types[RS6000_BTI_ptr_long_long_unsigned])
 
diff --git a/gcc/testsuite/gcc.target/powerpc/dm-1024bit.c b/gcc/testsuite/gcc.target/powerpc/dm-1024bit.c
new file mode 100644
index 00000000000..0a9884ddf63
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/dm-1024bit.c
@@ -0,0 +1,63 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_dense_math_ok } */
+/* { dg-options "-mdejagnu-cpu=future -O2" } */
+
+/* Test basic load/store for __dmr type.  */
+
+#ifndef CONSTRAINT
+#if defined(USE_D)
+#define CONSTRAINT "d"
+
+#elif defined(USE_V)
+#define CONSTRAINT "v"
+
+#elif defined(USE_WA)
+#define CONSTRAINT "wa"
+
+#else
+#define CONSTRAINT "wD"
+#endif
+#endif
+const char constraint[] = CONSTRAINT;
+
+void foo_mem_asm (__dmr *p, __dmr *q)
+{
+  /* 2 LXVP instructions.  */
+  __dmr vq = *p;
+
+  /* 2 DMXXINSTDMR512 instructions to transfer VSX to DMR.  */
+  __asm__ ("# foo (" CONSTRAINT ") %A0" : "+" CONSTRAINT (vq));
+  /* 2 DMXXEXTFDMR512 instructions to transfer DMR to VSX.  */
+
+  /* 2 STXVP instructions.  */
+  *q = vq;
+}
+
+void foo_mem_asm2 (__dmr *p, __dmr *q)
+{
+  /* 2 LXVP instructions.  */
+  __dmr vq = *p;
+  __dmr vq2;
+  __dmr vq3;
+
+  /* 2 DMXXINSTDMR512 instructions to transfer VSX to DMR.  */
+  __asm__ ("# foo1 (" CONSTRAINT ") %A0" : "+" CONSTRAINT (vq));
+  /* 2 DMXXEXTFDMR512 instructions to transfer DMR to VSX.  */
+
+  vq2 = vq;
+  __asm__ ("# foo2 (wa) %0" : "+wa" (vq2));
+
+  /* 2 STXVP instructions.  */
+  *q = vq2;
+}
+
+void foo_mem (__dmr *p, __dmr *q)
+{
+  /* 2 LXVP, 2 STXVP instructions, no DMR transfer.  */
+  *q = *p;
+}
+
+/* { dg-final { scan-assembler-times {\mdmxxextfdmr512\M}  4 } } */
+/* { dg-final { scan-assembler-times {\mdmxxinstdmr512\M}  4 } } */
+/* { dg-final { scan-assembler-times {\mlxvp\M}           12 } } */
+/* { dg-final { scan-assembler-times {\mstxvp\M}          12 } } */
-- 
2.39.1


-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 7/8] Support load/store vector with right length.
  2023-02-03 21:16 [PATCH 0/8] PowerPC future support for Dense Math Michael Meissner
                   ` (5 preceding siblings ...)
  2023-02-03 21:33 ` [PATCH 6/8] PowerPC: Add support for 1,024 bit DMR registers Michael Meissner
@ 2023-02-03 21:36 ` Michael Meissner
  2023-02-03 21:37 ` [PATCH 8/8] Add saturating subtract built-ins Michael Meissner
  2023-02-06  7:25 ` [PATCH 0/8] PowerPC future support for Dense Math Richard Biener
  8 siblings, 0 replies; 11+ messages in thread
From: Michael Meissner @ 2023-02-03 21:36 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner, Will Schmidt

This patch adds support for new instructions that may be added to the PowerPC
architecture in the future to enhance the load and store vector with length
instructions.

The current instructions (lxvl, lxvll, stxvl, and stxvll) are inconvient to use
since the count for the number of bytes must be in the top 8 bits of the GPR
register, instead of the bottom 8 bits.  This meant that code generating these
instructions typically had to do a shift left by 56 bits to get the count into
the right position.  In a future version of the PowerPC architecture, new
variants of these instructions might be added that expect the count to be in
the bottom 8 bits of the GPR register.  These patches add this support to GCC
if the user uses the -mcpu=future option.

I discovered that the code in rs6000-string.cc to generate ISA 3.1 lxvl/stxvl
future lxvll/stxvll instructions would generate these instructions on 32-bit.
However the patterns for these instructions is only done on 64-bit systems.  So
I added a check for 64-bit support before generating the instructions.

I tested this patch on a little endian power10 system with long double using
the tradiational IBM double double format.  Assuming the other 6 patches for
-mcpu=future are checked in (or at least the first patch), can I check this
patch into the master branch for GCC 13?

Note, I will be on vacation from Tuesday February 7th through Tuesday February
14th.

2023-02-03   Michael Meissner  <meissner@linux.ibm.com>

gcc/

	* config/rs6000/rs6000-string.cc (expand_block_move): Do generate lxvl
	and stxvl on 32-bit.
	* config/rs6000/vsx.md (lxvl): If -mcpu=future, generate the lxvl with
	the shift count automaticaly used in the insn.
	(lxvrl): New insn for -mcpu=future.
	(lxvrll): Likewise.
	(stxvl): If -mcpu=future, generate the stxvl with the shift count
	automaticaly used in the insn.
	(stxvrl): New insn for -mcpu=future.
	(stxvrll): Likewise.

gcc/testsuite/

	* gcc.target/powerpc/lxvrl.c: New test.
	* lib/target-supports.exp (check_effective_target_powerpc_future_ok):
	New effective target.
---
 gcc/config/rs6000/rs6000-string.cc       |   1 +
 gcc/config/rs6000/vsx.md                 | 122 +++++++++++++++++++----
 gcc/testsuite/gcc.target/powerpc/lxvrl.c |  32 ++++++
 gcc/testsuite/lib/target-supports.exp    |  16 ++-
 4 files changed, 148 insertions(+), 23 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/lxvrl.c

diff --git a/gcc/config/rs6000/rs6000-string.cc b/gcc/config/rs6000/rs6000-string.cc
index 75e6f8803a5..9b2f1b83b22 100644
--- a/gcc/config/rs6000/rs6000-string.cc
+++ b/gcc/config/rs6000/rs6000-string.cc
@@ -2811,6 +2811,7 @@ expand_block_move (rtx operands[], bool might_overlap)
 	  gen_func.mov = gen_vsx_movv2di_64bit;
 	}
       else if (TARGET_BLOCK_OPS_UNALIGNED_VSX
+	       && TARGET_POWERPC64
 	       && TARGET_POWER10 && bytes < 16
 	       && orig_bytes > 16
 	       && !(bytes == 1 || bytes == 2
diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index 0865608f94a..1ab8dc373c0 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -5582,20 +5582,32 @@ (define_expand "first_mismatch_or_eos_index_<mode>"
   DONE;
 })
 
-;; Load VSX Vector with Length
+;; Load VSX Vector with Length.  If we have lxvrl, we don't have to do an
+;; explicit shift left into a pseudo.
 (define_expand "lxvl"
-  [(set (match_dup 3)
-        (ashift:DI (match_operand:DI 2 "register_operand")
-                   (const_int 56)))
-   (set (match_operand:V16QI 0 "vsx_register_operand")
-	(unspec:V16QI
-	 [(match_operand:DI 1 "gpc_reg_operand")
-          (mem:V16QI (match_dup 1))
-	  (match_dup 3)]
-	 UNSPEC_LXVL))]
+  [(use (match_operand:V16QI 0 "vsx_register_operand"))
+   (use (match_operand:DI 1 "gpc_reg_operand"))
+   (use (match_operand:DI 2 "gpc_reg_operand"))]
   "TARGET_P9_VECTOR && TARGET_64BIT"
 {
-  operands[3] = gen_reg_rtx (DImode);
+  rtx shift_len = gen_rtx_ASHIFT (DImode, operands[2], GEN_INT (56));
+  rtx len;
+
+  if (TARGET_FUTURE)
+    len = shift_len;
+  else
+    {
+      len = gen_reg_rtx (DImode);
+      emit_insn (gen_rtx_SET (len, shift_len));
+    }
+
+  rtx dest = operands[0];
+  rtx addr = operands[1];
+  rtx mem = gen_rtx_MEM (V16QImode, addr);
+  rtvec rv = gen_rtvec (3, addr, mem, len);
+  rtx lxvl = gen_rtx_UNSPEC (V16QImode, rv, UNSPEC_LXVL);
+  emit_insn (gen_rtx_SET (dest, lxvl));
+  DONE;
 })
 
 (define_insn "*lxvl"
@@ -5619,6 +5631,34 @@ (define_insn "lxvll"
   "lxvll %x0,%1,%2"
   [(set_attr "type" "vecload")])
 
+;; For lxvrl and lxvrll, use the combiner to eliminate the shift.  The
+;; define_expand for lxvl will already incorporate the shift in generating the
+;; insn.  The lxvll buitl-in function required the user to have already done
+;; the shift.  Defining lxvrll this way, will optimize cases where the user has
+;; done the shift immediately before the built-in.
+(define_insn "*lxvrl"
+  [(set (match_operand:V16QI 0 "vsx_register_operand" "=wa")
+	(unspec:V16QI
+	 [(match_operand:DI 1 "gpc_reg_operand" "b")
+	  (mem:V16QI (match_dup 1))
+	  (ashift:DI (match_operand:DI 2 "register_operand" "r")
+		     (const_int 56))]
+	 UNSPEC_LXVL))]
+  "TARGET_FUTURE && TARGET_64BIT"
+  "lxvrl %x0,%1,%2"
+  [(set_attr "type" "vecload")])
+
+(define_insn "*lxvrll"
+  [(set (match_operand:V16QI 0 "vsx_register_operand" "=wa")
+	(unspec:V16QI [(match_operand:DI 1 "gpc_reg_operand" "b")
+                       (mem:V16QI (match_dup 1))
+		       (ashift:DI (match_operand:DI 2 "register_operand" "r")
+				  (const_int 56))]
+		      UNSPEC_LXVLL))]
+  "TARGET_FUTURE"
+  "lxvrll %x0,%1,%2"
+  [(set_attr "type" "vecload")])
+
 ;; Expand for builtin xl_len_r
 (define_expand "xl_len_r"
   [(match_operand:V16QI 0 "vsx_register_operand")
@@ -5650,18 +5690,29 @@ (define_insn "stxvll"
 
 ;; Store VSX Vector with Length
 (define_expand "stxvl"
-  [(set (match_dup 3)
-	(ashift:DI (match_operand:DI 2 "register_operand")
-		   (const_int 56)))
-   (set (mem:V16QI (match_operand:DI 1 "gpc_reg_operand"))
-	(unspec:V16QI
-	 [(match_operand:V16QI 0 "vsx_register_operand")
-	  (mem:V16QI (match_dup 1))
-	  (match_dup 3)]
-	 UNSPEC_STXVL))]
+  [(use (match_operand:V16QI 0 "vsx_register_operand"))
+   (use (match_operand:DI 1 "gpc_reg_operand"))
+   (use (match_operand:DI 2 "gpc_reg_operand"))]
   "TARGET_P9_VECTOR && TARGET_64BIT"
 {
-  operands[3] = gen_reg_rtx (DImode);
+  rtx shift_len = gen_rtx_ASHIFT (DImode, operands[2], GEN_INT (56));
+  rtx len;
+
+  if (TARGET_FUTURE)
+    len = shift_len;
+  else
+    {
+      len = gen_reg_rtx (DImode);
+      emit_insn (gen_rtx_SET (len, shift_len));
+    }
+
+  rtx src = operands[0];
+  rtx addr = operands[1];
+  rtx mem = gen_rtx_MEM (V16QImode, addr);
+  rtvec rv = gen_rtvec (3, src, mem, len);
+  rtx stxvl = gen_rtx_UNSPEC (V16QImode, rv, UNSPEC_STXVL);
+  emit_insn (gen_rtx_SET (mem, stxvl));
+  DONE;
 })
 
 ;; Define optab for vector access with length vectorization exploitation.
@@ -5705,6 +5756,35 @@ (define_insn "*stxvl"
   "stxvl %x0,%1,%2"
   [(set_attr "type" "vecstore")])
 
+;; For stxvrl and stxvrll, use the combiner to eliminate the shift.  The
+;; define_expand for stxvl will already incorporate the shift in generating the
+;; insn.  The stxvll buitl-in function required the user to have already done
+;; the shift.  Defining stxvrll this way, will optimize cases where the user
+;; has done the shift immediately before the built-in.
+
+(define_insn "*stxvrl"
+  [(set (mem:V16QI (match_operand:DI 1 "gpc_reg_operand" "b"))
+	(unspec:V16QI
+	 [(match_operand:V16QI 0 "vsx_register_operand" "wa")
+	  (mem:V16QI (match_dup 1))
+	  (ashift:DI (match_operand:DI 2 "register_operand" "r")
+		     (const_int 56))]
+	 UNSPEC_STXVL))]
+  "TARGET_FUTURE && TARGET_64BIT"
+  "stxvrl %x0,%1,%2"
+  [(set_attr "type" "vecstore")])
+
+(define_insn "*stxvrll"
+  [(set (mem:V16QI (match_operand:DI 1 "gpc_reg_operand" "b"))
+	(unspec:V16QI [(match_operand:V16QI 0 "vsx_register_operand" "wa")
+		       (mem:V16QI (match_dup 1))
+		       (ashift:DI (match_operand:DI 2 "register_operand" "r")
+				  (const_int 56))]
+	              UNSPEC_STXVLL))]
+  "TARGET_FUTURE"
+  "stxvrll %x0,%1,%2"
+  [(set_attr "type" "vecstore")])
+
 ;; Expand for builtin xst_len_r
 (define_expand "xst_len_r"
   [(match_operand:V16QI 0 "vsx_register_operand" "=wa")
diff --git a/gcc/testsuite/gcc.target/powerpc/lxvrl.c b/gcc/testsuite/gcc.target/powerpc/lxvrl.c
new file mode 100644
index 00000000000..71854c50c91
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/lxvrl.c
@@ -0,0 +1,32 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_future_ok } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-options "-mdejagnu-cpu=future -O2" } */
+
+/* Test whether the lxvrl and stxvrl instructions are generated for
+   -mcpu=future on memory copy operations.  */
+
+#ifndef VSIZE
+#define VSIZE 2
+#endif
+
+#ifndef LSIZE
+#define LSIZE 5
+#endif
+
+struct foo {
+  vector unsigned char vc[VSIZE];
+  unsigned char leftover[LSIZE];
+};
+
+void memcpy_ptr (struct foo *p, struct foo *q)
+{
+  __builtin_memcpy ((void *) p,		/* lxvrl and stxvrl.  */
+		    (void *) q,
+		    (sizeof (vector unsigned char) * VSIZE) + LSIZE);
+}
+
+/* { dg-final { scan-assembler     {\mlxvrl\M}  } } */
+/* { dg-final { scan-assembler     {\mstxvrl\M} } } */
+/* { dg-final { scan-assembler-not {\mlxvl\M}   } } */
+/* { dg-final { scan-assembler-not {\mstxvl\M}  } } */
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 9586ed3ae47..47adf407f83 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -6581,8 +6581,8 @@ proc check_effective_target_power10_ok { } {
     }
 }
 
-# Return 1 if this is a PowerPC target supporting -mcpu=future or -mdense-math
-# which enables the dense math operations.
+# Return 1 if this is a PowerPC target supporting -mcpu=future which enables
+# the dense math operations.
 proc check_effective_target_powerpc_dense_math_ok { } {
 	return [check_no_compiler_messages_nocache powerpc_dense_math_ok assembly {
 		__vector_quad vq;
@@ -6600,6 +6600,18 @@ proc check_effective_target_powerpc_dense_math_ok { } {
 	} "-mcpu=future"]
 }
 
+# Return 1 if this is a PowerPC target supporting -mcpu=future which enables
+# the saturating subtract instruction.
+proc check_effective_target_powerpc_future_ok { } {
+	return [check_no_compiler_messages powerpc_future_ok object {
+	    #ifndef _ARCH_PWR_FUTURE
+	    #error "not -mcpu=future"
+	    #else
+	    int dummy;
+	    #endif
+	} "-mcpu=future"]
+}
+
 # Return 1 if this is a PowerPC target supporting -mfloat128 via either
 # software emulation on power7/power8 systems or hardware support on power9.
 
-- 
2.39.1


-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 8/8] Add saturating subtract built-ins.
  2023-02-03 21:16 [PATCH 0/8] PowerPC future support for Dense Math Michael Meissner
                   ` (6 preceding siblings ...)
  2023-02-03 21:36 ` [PATCH 7/8] Support load/store vector with right length Michael Meissner
@ 2023-02-03 21:37 ` Michael Meissner
  2023-02-06  7:25 ` [PATCH 0/8] PowerPC future support for Dense Math Richard Biener
  8 siblings, 0 replies; 11+ messages in thread
From: Michael Meissner @ 2023-02-03 21:37 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner, Will Schmidt

This patch adds support for a saturating subtract built-in function that may be
added to a future PowerPC processor.  Note, if it is added, the name of the
built-in function may change before GCC 13 is released.  If the name changes,
we will submit a patch changing the name.

I also added support for providing dense math built-in functions, even though
at present, we have not added any new built-in functions for dense math.  It is
likely we will want to add new dense math built-in functions as the dense math
support is fleshed out.

I tested this patch on a little endian power10 system with long double using
the tradiational IBM double double format.  Assuming the other 6 patches for
-mcpu=future are checked in (or at least the first patch), can I check this
patch into the master branch for GCC 13.

Note, I will be on vacation from Tuesday February 7th through Tuesday February
14th.

2023-02-03   Michael Meissner  <meissner@linux.ibm.com>

gcc/

	* config/rs6000/rs6000-builtin.cc (rs6000_invalid_builtin): Add support
	for flagging invalid use of future built-in functions.
	(rs6000_builtin_is_supported): Add support for future built-in
	functions.
	* config/rs6000/rs6000-builtins.def (__builtin_saturate_subtract32): New
	built-in function for -mcpu=future.
	(__builtin_saturate_subtract64): Likewise.
	* config/rs6000/rs6000-gen-builtins.cc (enum bif_stanza): Add stanzas
	for -mcpu=future built-ins.
	(stanza_map): Likewise.
	(enable_string): Likewise.
	(struct attrinfo): Likewise.
	(parse_bif_attrs): Likewise.
	(write_decls): Likewise.
	* config/rs6000/rs6000.md (sat_sub<mode>3): Add saturating subtract
	built-in insn declarations.
	(sat_sub<mode>3_dot): Likewise.
	(sat_sub<mode>3_dot2): Likewise.
	* doc/extend.texi (Future PowerPC built-ins): New section.

gcc/testsuite/

	* gcc.target/powerpc/subfus-1.c: New test.
	* gcc.target/powerpc/subfus-2.c: Likewise.
---
 gcc/config/rs6000/rs6000-builtin.cc         | 17 ++++++
 gcc/config/rs6000/rs6000-builtins.def       | 11 ++++
 gcc/config/rs6000/rs6000-gen-builtins.cc    | 35 ++++++++++--
 gcc/config/rs6000/rs6000.md                 | 60 +++++++++++++++++++++
 gcc/doc/extend.texi                         | 24 +++++++++
 gcc/testsuite/gcc.target/powerpc/subfus-1.c | 32 +++++++++++
 gcc/testsuite/gcc.target/powerpc/subfus-2.c | 32 +++++++++++
 gcc/testsuite/lib/target-supports.exp       | 16 +++++-
 8 files changed, 220 insertions(+), 7 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/subfus-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/subfus-2.c

diff --git a/gcc/config/rs6000/rs6000-builtin.cc b/gcc/config/rs6000/rs6000-builtin.cc
index d971cf90e51..b9b0b2d52d0 100644
--- a/gcc/config/rs6000/rs6000-builtin.cc
+++ b/gcc/config/rs6000/rs6000-builtin.cc
@@ -139,6 +139,17 @@ rs6000_invalid_builtin (enum rs6000_gen_builtins fncode)
     case ENB_MMA:
       error ("%qs requires the %qs option", name, "-mmma");
       break;
+    case ENB_FUTURE:
+      error ("%qs requires the %qs option", name, "-mcpu=future");
+      break;
+    case ENB_FUTURE_64:
+      error ("%qs requires the %qs option and either the %qs or %qs option",
+	     name, "-mcpu=future", "-m64", "-mpowerpc64");
+      break;
+    case ENB_DM:
+      error ("%qs requires the %qs or %qs options", name, "-mcpu=future",
+	     "-mdense-math");
+      break;
     default:
     case ENB_ALWAYS:
       gcc_unreachable ();
@@ -194,6 +205,12 @@ rs6000_builtin_is_supported (enum rs6000_gen_builtins fncode)
       return TARGET_HTM;
     case ENB_MMA:
       return TARGET_MMA;
+    case ENB_FUTURE:
+      return TARGET_FUTURE;
+    case ENB_FUTURE_64:
+      return TARGET_FUTURE && TARGET_POWERPC64;
+    case ENB_DM:
+      return TARGET_DENSE_MATH;
     default:
       gcc_unreachable ();
     }
diff --git a/gcc/config/rs6000/rs6000-builtins.def b/gcc/config/rs6000/rs6000-builtins.def
index e0d9f5adc97..8b73e994558 100644
--- a/gcc/config/rs6000/rs6000-builtins.def
+++ b/gcc/config/rs6000/rs6000-builtins.def
@@ -139,6 +139,8 @@
 ;   endian   Needs special handling for endianness
 ;   ibmld    Restrict usage to the case when TFmode is IBM-128
 ;   ibm128   Restrict usage to the case where __ibm128 is supported or if ibmld
+;   future   Restrict usage to future instructions
+;   dm       Restrict usage to dense math
 ;
 ; Each attribute corresponds to extra processing required when
 ; the built-in is expanded.  All such special processing should
@@ -4108,3 +4110,12 @@
 
   void __builtin_vsx_stxvp (v256, unsigned long, const v256 *);
     STXVP nothing {mma,pair}
+
+[future]
+  const signed int __builtin_saturate_subtract32 (signed int, signed int);
+  SAT_SUBSI sat_subsi3 {}
+
+[future-64]
+  const signed long __builtin_saturate_subtract64 (signed long, signed long);
+  SAT_SUBDI sat_subdi3 {}
+
diff --git a/gcc/config/rs6000/rs6000-gen-builtins.cc b/gcc/config/rs6000/rs6000-gen-builtins.cc
index a2f442ed90d..daf7fff079e 100644
--- a/gcc/config/rs6000/rs6000-gen-builtins.cc
+++ b/gcc/config/rs6000/rs6000-gen-builtins.cc
@@ -233,6 +233,9 @@ enum bif_stanza
  BSTZ_P10,
  BSTZ_P10_64,
  BSTZ_MMA,
+ BSTZ_FUTURE,
+ BSTZ_FUTURE_64,
+ BSTZ_DM,
  NUMBIFSTANZAS
 };
 
@@ -266,7 +269,10 @@ static stanza_entry stanza_map[NUMBIFSTANZAS] =
     { "htm",		BSTZ_HTM	},
     { "power10",	BSTZ_P10	},
     { "power10-64",	BSTZ_P10_64	},
-    { "mma",		BSTZ_MMA	}
+    { "mma",		BSTZ_MMA	},
+    { "future",		BSTZ_FUTURE	},
+    { "future-64",	BSTZ_FUTURE_64	},
+    { "dm",		BSTZ_DM		},
   };
 
 static const char *enable_string[NUMBIFSTANZAS] =
@@ -291,7 +297,10 @@ static const char *enable_string[NUMBIFSTANZAS] =
     "ENB_HTM",
     "ENB_P10",
     "ENB_P10_64",
-    "ENB_MMA"
+    "ENB_MMA",
+    "ENB_FUTURE",
+    "ENB_FUTURE_64",
+    "ENB_DM",
   };
 
 /* Function modifiers provide special handling for const, pure, and fpmath
@@ -395,6 +404,8 @@ struct attrinfo
   bool isendian;
   bool isibmld;
   bool isibm128;
+  bool isfuture;
+  bool isdm;
 };
 
 /* Fields associated with a function prototype (bif or overload).  */
@@ -1477,7 +1488,8 @@ parse_bif_attrs (attrinfo *attrptr)
 	"ldvec = %d, stvec = %d, reve = %d, pred = %d, htm = %d, "
 	"htmspr = %d, htmcr = %d, mma = %d, quad = %d, pair = %d, "
 	"mmaint = %d, no32bit = %d, 32bit = %d, cpu = %d, ldstmask = %d, "
-	"lxvrse = %d, lxvrze = %d, endian = %d, ibmdld = %d, ibm128 = %d.\n",
+	"lxvrse = %d, lxvrze = %d, endian = %d, ibmdld = %d, ibm128 = %d,",
+	"future = %d, dm = %d.\n",
 	attrptr->isinit, attrptr->isset, attrptr->isextract,
 	attrptr->isnosoft, attrptr->isldvec, attrptr->isstvec,
 	attrptr->isreve, attrptr->ispred, attrptr->ishtm, attrptr->ishtmspr,
@@ -1485,7 +1497,7 @@ parse_bif_attrs (attrinfo *attrptr)
 	attrptr->ismmaint, attrptr->isno32bit, attrptr->is32bit,
 	attrptr->iscpu, attrptr->isldstmask, attrptr->islxvrse,
 	attrptr->islxvrze, attrptr->isendian, attrptr->isibmld,
-	attrptr->isibm128);
+	attrptr->isibm128, attrptr->isfuture, attrptr->isdm);
 #endif
 
   return PC_OK;
@@ -2257,7 +2269,10 @@ write_decls (void)
   fprintf (header_file, "  ENB_HTM,\n");
   fprintf (header_file, "  ENB_P10,\n");
   fprintf (header_file, "  ENB_P10_64,\n");
-  fprintf (header_file, "  ENB_MMA\n");
+  fprintf (header_file, "  ENB_MMA,\n");
+  fprintf (header_file, "  ENB_FUTURE,\n");
+  fprintf (header_file, "  ENB_FUTURE_64,\n");
+  fprintf (header_file, "  ENB_DM\n");
   fprintf (header_file, "};\n\n");
 
   fprintf (header_file, "#define PPC_MAXRESTROPNDS 3\n");
@@ -2301,6 +2316,8 @@ write_decls (void)
   fprintf (header_file, "#define bif_endian_bit\t\t(0x00200000)\n");
   fprintf (header_file, "#define bif_ibmld_bit\t\t(0x00400000)\n");
   fprintf (header_file, "#define bif_ibm128_bit\t\t(0x00800000)\n");
+  fprintf (header_file, "#define bif_future_bit\t\t(0x01000000)\n");
+  fprintf (header_file, "#define bif_dm_bit\t\t(0x02000000)\n");
   fprintf (header_file, "\n");
   fprintf (header_file,
 	   "#define bif_is_init(x)\t\t((x).bifattrs & bif_init_bit)\n");
@@ -2350,6 +2367,10 @@ write_decls (void)
 	   "#define bif_is_ibmld(x)\t((x).bifattrs & bif_ibmld_bit)\n");
   fprintf (header_file,
 	   "#define bif_is_ibm128(x)\t((x).bifattrs & bif_ibm128_bit)\n");
+  fprintf (header_file,
+	   "#define bif_is_future(x)\t((x).bifattrs & bif_future_bit)\n");
+  fprintf (header_file,
+	   "#define bif_is_dm(x)\t((x).bifattrs & bif_dm_bit)\n");
   fprintf (header_file, "\n");
 
   fprintf (header_file,
@@ -2548,6 +2569,10 @@ write_bif_static_init (void)
 	fprintf (init_file, " | bif_ibmld_bit");
       if (bifp->attrs.isibm128)
 	fprintf (init_file, " | bif_ibm128_bit");
+      if (bifp->attrs.isfuture)
+	fprintf (init_file, " | bif_future_bit");
+      if (bifp->attrs.isdm)
+	fprintf (init_file, " | bif_dm_bit");
       fprintf (init_file, ",\n");
       fprintf (init_file, "      /* restr_opnd */\t{%d, %d, %d},\n",
 	       bifp->proto.restr_opnd[0], bifp->proto.restr_opnd[1],
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index ee7651d9b43..f9e231c16be 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -15654,6 +15654,66 @@ (define_insn "hashchk"
 }
   [(set_attr "type" "load")])
 \f
+;; Signed saturation.
+
+;; The subfus instruction is defined as: SUBFUS RT,L,RA,RB.  The extended
+;; mnemonic that we use (subdus and subwus) has the arguments RA and RB
+;; reversed (so it becomes a subtract instead of subtract from).
+
+(define_insn "sat_sub<mode>3"
+  [(set (match_operand:GPR 0 "gpc_reg_operand" "=r")
+	(ss_minus:GPR (match_operand:GPR 1 "gpc_reg_operand" "r")
+		      (match_operand:GPR 2 "gpc_reg_operand" "r")))]
+  "TARGET_FUTURE"
+  "sub<wd>us %0,%1,%2"
+  [(set_attr "type" "add")])
+
+(define_insn_and_split "*sat_sub<mode>3_dot"
+  [(set (match_operand:CC 3 "cc_reg_operand" "=x,?y")
+	(compare:CC (ss_minus:GPR (match_operand:GPR 1 "gpc_reg_operand" "r,r")
+				  (match_operand:GPR 2 "gpc_reg_operand" "r,r"))
+		    (const_int 0)))
+   (clobber (match_scratch:GPR 0 "=r,r"))]
+  "TARGET_FUTURE"
+  "@
+   sub<wd>us. %0,%1,%2
+   #"
+  "&& reload_completed && cc_reg_not_cr0_operand (operands[3], CCmode)"
+  [(set (match_dup 0)
+	(ss_minus:GPR (match_dup 1)
+		      (match_dup 2)))
+   (set (match_dup 3)
+	(compare:CC (match_dup 0)
+		    (const_int 0)))]
+  ""
+  [(set_attr "type" "add")
+   (set_attr "dot" "yes")
+   (set_attr "length" "4,8")])
+
+(define_insn_and_split "*sat_sub<mode>3_dot2"
+  [(set (match_operand:CC 3 "cc_reg_operand" "=x,?y")
+	(compare:CC (ss_minus:GPR (match_operand:GPR 1 "gpc_reg_operand" "r,r")
+				  (match_operand:GPR 2 "gpc_reg_operand" "r,r"))
+		    (const_int 0)))
+   (set (match_operand:GPR 0 "gpc_reg_operand" "=r,r")
+	(ss_minus:GPR (match_dup 1)
+		      (match_dup 2)))]
+  "TARGET_FUTURE"
+  "@
+   sub<wd>us. %0,%1,%2
+   #"
+  "&& reload_completed && cc_reg_not_cr0_operand (operands[3], CCmode)"
+  [(set (match_dup 0)
+	(ss_minus:GPR (match_dup 1)
+		      (match_dup 2)))
+   (set (match_dup 3)
+	(compare:CC (match_dup 0)
+		    (const_int 0)))]
+  ""
+  [(set_attr "type" "add")
+   (set_attr "dot" "yes")
+   (set_attr "length" "4,8")])
+\f
 
 (include "sync.md")
 (include "vector.md")
diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index 5a026c4b48c..a6d25b5b618 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -17839,6 +17839,7 @@ Disable global interrupt.
 * Basic PowerPC Built-in Functions Available on ISA 2.07::
 * Basic PowerPC Built-in Functions Available on ISA 3.0::
 * Basic PowerPC Built-in Functions Available on ISA 3.1::
+* Basic Built-in Functions that may be available on future PowerPCs::
 @end menu
 
 This section describes PowerPC built-in functions that do not require
@@ -18496,6 +18497,29 @@ ISA 3.1 @code{stxvrbx}, @code{stxvrhx}, @code{stxvrwx}, and @code{stxvrdx}
 instructions.
 @findex vec_xst_trunc
 
+@node Basic Built-in Functions that may be available on future PowerPCs
+@subsubsection Potential future PowerPC Built-in Functions
+
+The built-in functions described in this section may be available on
+future PowerPC processors.  At present, these built-ins exist to
+allowing testing of new instructions.  There is no guarantee that
+these instructions will actually be implemented.
+
+The following built-in functions are available on Linux 64-bit systems
+that use a potential future instruction set (@option{-mcpu=future}):
+
+@table @code
+@item int __builtin_saturate_subtract32 (int, int)
+Subtract the second operand from the first operand.  If the value
+would be less than 0, then the result is 0 instead of the negative
+value of the subtraction.
+
+@item long __builtin_saturate_subtract64 (long, long)
+Subtract the second operand from the first operand.  If the value
+would be less than 0, then the result is 0 instead of the negative
+value of the subtraction.
+@end table
+
 @node PowerPC AltiVec/VSX Built-in Functions
 @subsection PowerPC AltiVec/VSX Built-in Functions
 
diff --git a/gcc/testsuite/gcc.target/powerpc/subfus-1.c b/gcc/testsuite/gcc.target/powerpc/subfus-1.c
new file mode 100644
index 00000000000..535e7f8483d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/subfus-1.c
@@ -0,0 +1,32 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_future_ok } */
+/* { dg-options "-mdejagnu-cpu=future -O2" } */
+
+/* Test whether the saturating subtract built-in generates subwus for 32-bit
+   subtracts.  */
+
+int do_sat_int  (int  a, int  b)
+{
+  return __builtin_saturate_subtract32 (a, b);		/* subwus  */
+}
+
+int do_sat_int_dot  (int  a, int  b, int  *p)
+{
+  int  r = __builtin_saturate_subtract32 (a, b);	/* subwus.  */
+  if (r == 0)
+    *p = 0;
+
+  return r;
+}
+
+void do_sat_int_dot2  (int  a, int  b, int  *p, int *q)
+{
+  if (__builtin_saturate_subtract32 (a, b))		/* subwus.  */
+    *p = 0;
+
+  *q = a + b;
+  return;
+}
+
+/* { dg-final { scan-assembler     {\msubwus\M} } } */
+/* { dg-final { scan-assembler-not {\msubf\M}   } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/subfus-2.c b/gcc/testsuite/gcc.target/powerpc/subfus-2.c
new file mode 100644
index 00000000000..b68e66dd2b0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/subfus-2.c
@@ -0,0 +1,32 @@
+/* { dg-do compile { target lp64 } } */
+/* { dg-require-effective-target powerpc_future_ok } */
+/* { dg-options "-mdejagnu-cpu=future -O2" } */
+
+/* Test whether the saturating subtract built-in generates subwus for 64-bit
+   subtracts.  */
+
+long do_sat_long  (long  a, long  b)
+{
+  return __builtin_saturate_subtract64 (a, b);		/* subwus  */
+}
+
+long do_sat_long_dot  (long  a, long  b, long  *p)
+{
+  long  r = __builtin_saturate_subtract64 (a, b);	/* subwus.  */
+  if (r == 0)
+    *p = 0;
+
+  return r;
+}
+
+void do_sat_long_dot2  (long  a, long  b, long  *p, long *q)
+{
+  if (__builtin_saturate_subtract64 (a, b))		/* subwus.  */
+    *p = 0;
+
+  *q = a + b;
+  return;
+}
+
+/* { dg-final { scan-assembler     {\msubdus\M} } } */
+/* { dg-final { scan-assembler-not {\msubf\M}   } } */
-- 
2.39.1


-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/8] PowerPC future support for Dense Math
  2023-02-03 21:16 [PATCH 0/8] PowerPC future support for Dense Math Michael Meissner
                   ` (7 preceding siblings ...)
  2023-02-03 21:37 ` [PATCH 8/8] Add saturating subtract built-ins Michael Meissner
@ 2023-02-06  7:25 ` Richard Biener
  2023-02-06 18:22   ` Peter Bergner
  8 siblings, 1 reply; 11+ messages in thread
From: Richard Biener @ 2023-02-06  7:25 UTC (permalink / raw)
  To: Michael Meissner, gcc-patches, Segher Boessenkool, Kewen.Lin,
	David Edelsohn, Peter Bergner, Will Schmidt

On Fri, Feb 3, 2023 at 10:16 PM Michael Meissner via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> These patches were originally posted on November 10th.  Segher has asked that I
> repost them.  These patches are somewhat changed since the original posting to
> address some of the comments.
>
> https://gcc.gnu.org/pipermail/gcc-patches/2022-November/605581.html
>
> In the first patch (adding -mcpu=future), I have taken out the code of making
> -mtune=future act as -mtune=power10.  Instead I went through all of the places
> that look at the tuning (mostly in power10.md and rs6000.cc), and added future
> as an option.  Obviously at a later time, we will provide a separate tuning
> file for future (or whatever the new name will be if the instructions are added
> officially).  But for now, it will suffice.
>
> In patch #3, I fixed the opcode for clearing a dense math register that Peter
> had noticed.  I was using the name based on the existing clear instruction,
> instead of the new instruction.
>
> In patch #6, I fixed the code, relying on the changes for setting the precision
> field to 16 bits.  Since that patch will not be able to go into GCC 13 at
> present, we might skip that support for now.  The important thing for existing
> users of the MMA code is the support for accumulators being in the separate
> dense math registers rather than overlapping does need to go in, and we can
> probably delay the 1,024 bit register support, or implement in a different
> fashion.
>
> In the insn names, I tried to switch to using _vsx instead of _fpr for the
> existing MMA support instructions.  I also tried to clear up the comments to
> specify ISA 3.1 instead of power10 when talking about the existing MMA
> support.
>
> The following is from the original posting (slightly modified):
>
> This patch is very preliminary support for a potential new feature to the
> PowerPC that extends the current power10 MMA architecture.  This feature may or
> may not be present in any specific future PowerPC processor.
>
> In the current MMA subsystem for Power10, there are 8 512-bit accumulator
> registers.  These accumulators are each tied to sets of 4 FPR registers.  When
> you issue a prime instruction, it makes sure the accumulator is a copy of the 4
> FPR registers the accumulator is tied to.  When you issue a deprime
> instruction, it makes sure that the accumulator data content is logically
> copied to the matching FPR register.
>
> In the potential dense math system, the accumulators are moved to separate
> registers called dense math registers (DM registers or DMR).  The DMRs are then
> extended to 1,024 bits and new instructions will be added to deal with all
> 1,024 bits of the DMRs.
>
> If you take existing MMA code, it will work as long as you don't do anything
> with accumulators, and you follow the rules in the ISA 3.1 documentation for
> using the MMA subsystem.
>
> These patches add support for the 512-bit accumulators within the dense math
> system, and for allocation of the 1,024-bit DMRs.  At this time, no additional
> built-in functions will be done to support any dense math features other than
> doing data movement between the DMRs and the VSX registers.  Before we can look
> at adding any new dense math support other than data movement, we need the GCC
> compiler to be able to allocate and use these DMRs.
>
> There are 8 patches in this patch set:
>
> 1) The first patch just adds -mcpu=future as an option to add new support.
> This is similar to the -mcpu=future that we did before power10 was announced.
>
> 2) The second patch enables GCC to use the load and store vector pair
> instructions to optimize memory copy operations in the compiler.  For power10,
> we needed to just stay with normal vector load/stores for memory copy
> operations.
>
> 3) The third patch enables 512-bit accumulators store in DMRs.  This patch
> enables the register allocation, but it does not move the existing MMA to use
> these registers.
>
> 4) The fourth patch switches the MMA subsystem to use 512-bit accumulators
> within DMRs if you use -mcpu=future.
>
> 5) The fifth patch switches the names of the MMA instructions to use the dense
> math equivalent name if -mcpu=future.
>
> 6) The sixth patch enables using the full 1,024-bit DMRs.  Right now, all you
> can do with DMRs is move a VSX register to a DMR register, and to move a DMR
> register to a VSX register.  [As I mentioned above, at the moment, this patch
> is problematical as is]
>
> 7) The seventh patch is not DMR related.  It adds support for variants of the
> load/store vector with length instruction that may be added in future PowerPC
> processors.  These variants eliminate having to shift the byte length left by
> 56 bits.
>
> 8) The eighth patch is also not DMR related.  It adds support for a saturating
> subtract operation that may be added to future PowerPC processors.
>
> In terms of changes, we now use the wD constraint for accumulators.  If you
> compile with -mcpu=power10, the wD constraint will match the equivalent VSX
> register (0..31) that overlaps with the accumulator.  If you compile with
> -mcpu=future, the wD constraint will match the DMR register and not the FPR
> register.
>
> This patch also modifies the print_operand %A output modifier to print out DMR
> register numbers if -mcpu=future, and continue to print out the FPR register
> number divided by 4 for -mcpu=power10.
>
> In general, if you only use the built-in functions, things work between the two
> systems.  If you use extended asm, you will likely need to modify the code.
> Going forward, hopefully if you modify your code to use the wD constraint and
> %A output modifier, you can write code that switches more easily between the
> two systems.
>
> Again, these are preliminary patches for a potential future machine.  Things
> will likely change in terms of implementation and usage over time.

May I ask to consider delaying this to stage1 exactly because of this
last reason?

Richard.

>
> --
> Michael Meissner, IBM
> PO Box 98, Ayer, Massachusetts, USA, 01432
> email: meissner@linux.ibm.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/8] PowerPC future support for Dense Math
  2023-02-06  7:25 ` [PATCH 0/8] PowerPC future support for Dense Math Richard Biener
@ 2023-02-06 18:22   ` Peter Bergner
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Bergner @ 2023-02-06 18:22 UTC (permalink / raw)
  To: Richard Biener, Michael Meissner, gcc-patches,
	Segher Boessenkool, Kewen.Lin, David Edelsohn, Will Schmidt

On 2/6/23 1:25 AM, Richard Biener wrote:
> May I ask to consider delaying this to stage1 exactly because of this
> last reason?

That is our plan.  We're just still working through the review so it's
ready when stage1 opens up.

Peter



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-02-06 18:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-03 21:16 [PATCH 0/8] PowerPC future support for Dense Math Michael Meissner
2023-02-03 21:21 ` [PATCH 1/8] PowerPC: Add -mcpu=future Michael Meissner
2023-02-03 21:23 ` [PATCH 1/8] PowerPC: Make -mcpu=future enable -mblock-ops-vector-pair Michael Meissner
2023-02-03 21:25 ` [PATCH 2/8] PowerPC: Add support for accumulators in DMR registers Michael Meissner
2023-02-03 21:27 ` [PATCH 3/8] PowerPC: Make MMA insns support " Michael Meissner
2023-02-03 21:29 ` [PATCH 4/8] PowerPC: Switch to dense math names for all MMA operations Michael Meissner
2023-02-03 21:33 ` [PATCH 6/8] PowerPC: Add support for 1,024 bit DMR registers Michael Meissner
2023-02-03 21:36 ` [PATCH 7/8] Support load/store vector with right length Michael Meissner
2023-02-03 21:37 ` [PATCH 8/8] Add saturating subtract built-ins Michael Meissner
2023-02-06  7:25 ` [PATCH 0/8] PowerPC future support for Dense Math Richard Biener
2023-02-06 18:22   ` Peter Bergner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).