[PATCH v2 0/2] RISC-V improve stack/array access by constant mat tweak

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] RISC-V improve stack/array access by constant mat tweak
@ 2024-05-13 18:49 Vineet Gupta
  2024-05-13 18:49 ` [PATCH v2 1/2] RISC-V: avoid LUI based const materialization ... [part of PR/106265] Vineet Gupta
  2024-05-13 18:49 ` [PATCH v2 2/2] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733] Vineet Gupta
  0 siblings, 2 replies; 19+ messages in thread
From: Vineet Gupta @ 2024-05-13 18:49 UTC (permalink / raw)
  To: gcc-patches
  Cc: Jeff Law, kito.cheng, Palmer Dabbelt, Christoph Müllner,
	Robin Dapp, gnu-toolchain, Vineet Gupta

Hi,

This set of patches help improve stack/array accesses by improving
constant materialization. Details are in respective patches.

The first patch is the main change which improves SPEC cactu by 10%.

As discussed/agreed for v1 [1], I've dropped the splitter variant for
stack accesses.

I also have a few follow-ups which I come back to seperately.

Thx,
-Vineet

[1] https://gcc.gnu.org/pipermail/gcc-patches/2024-March/647874.html

Vineet Gupta (2):
  RISC-V: avoid LUI based const materialization ... [part of PR/106265]
  RISC-V: avoid LUI based const mat in prologue/epilogue expansion
    [PR/105733]

 gcc/config/riscv/constraints.md               |  6 ++
 gcc/config/riscv/predicates.md                |  6 ++
 gcc/config/riscv/riscv-protos.h               |  3 +
 gcc/config/riscv/riscv.cc                     | 85 +++++++++++++++++--
 gcc/config/riscv/riscv.h                      | 22 +++++
 gcc/config/riscv/riscv.md                     | 40 +++++++++
 gcc/testsuite/gcc.target/riscv/pr105733.c     | 15 ++++
 .../riscv/rvv/autovec/vls/spill-1.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-2.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-3.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-4.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-5.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-6.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-7.c           |  4 +-
 .../gcc.target/riscv/sum-of-two-s12-const-1.c | 45 ++++++++++
 .../gcc.target/riscv/sum-of-two-s12-const-2.c | 15 ++++
 .../gcc.target/riscv/sum-of-two-s12-const-3.c | 22 +++++
 17 files changed, 266 insertions(+), 21 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/pr105733.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-2.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-3.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/2] RISC-V: avoid LUI based const materialization ... [part of PR/106265]
  2024-05-13 18:49 [PATCH v2 0/2] RISC-V improve stack/array access by constant mat tweak Vineet Gupta
@ 2024-05-13 18:49 ` Vineet Gupta
  2024-05-13 20:15   ` Jeff Law
  2024-05-13 21:13   ` Follow up #1 (was Re: [PATCH v2 1/2] RISC-V: avoid LUI based const materialization ... [part of PR/106265]) Vineet Gupta
  2024-05-13 18:49 ` [PATCH v2 2/2] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733] Vineet Gupta
  1 sibling, 2 replies; 19+ messages in thread
From: Vineet Gupta @ 2024-05-13 18:49 UTC (permalink / raw)
  To: gcc-patches
  Cc: Jeff Law, kito.cheng, Palmer Dabbelt, Christoph Müllner,
	Robin Dapp, gnu-toolchain, Vineet Gupta

Apologies for the delay in getting this out. Needed to fix one ICE
with glibc build and fresh round of testing: both testsuite and SPEC
runs (which are similar to v1 in terms of Cactu gains, but some more minor
regressions elsewhere gcc). Again those seem so small that IMHO this
should still go in.

I'll investigate those next as well as an existing weirdnes in glibc tempnam
which I spotted during the debugging.

Changes since v1 [1]
 - Tighten the main conditition to avoid stack regs as destination
   (to avoid making them potentially unaligned with -2047 addend:
    this might be OK execution/ABI wise, but undesirable/ugly still
    specially when coming from compiler codegen).
 - Ensure that first alternative is always split
 - Remove "&& 1" from split condition. That was tripping up glibc build
   with illegal operands `add s0, s0, 2048`.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2024-March/647877.html

---

... if the constant can be represented as sum of two S12 values.
The two S12 values could instead be fused with subsequent ADD insn.
The helps
 - avoid an additional LUI insn
 - side benefits of not clobbering a reg

e.g.
                            w/o patch             w/ patch
long                  |                     |
plus(unsigned long i) |	li	a5,4096     |
{                     |	addi	a5,a5,-2032 | addi a0, a0, 2047
   return i + 2064;   |	add	a0,a0,a5    | addi a0, a0, 17
}                     | 	ret         | ret

NOTE: In theory not having const in a standalone reg might seem less
      CSE friendly, but for workloads in consideration these mat are
      from very late LRA reloads and follow on GCSE is not doing much
      currently.

The real benefit however is seen in base+offset computation for array
accesses and especially for stack accesses which are finalized late in
optim pipeline, during LRA register allocation. Often the finalized
offsets trigger LRA reloads resulting in mind boggling repetition of
exact same insn sequence including LUI based constant materialization.

This shaves off 290 billion dynamic instrustions (QEMU icounts) in
SPEC 2017 Cactu benchmark which is over 10% of workload. In the rest of
suite, there additional 10 billion shaved, with both gains and losses
in indiv workloads as is usual with compiler changes.

 500.perlbench_r-0 |  1,214,534,029,025 | 1,212,887,959,387 |
 500.perlbench_r-1 |    740,383,419,739 |   739,280,308,163 |
 500.perlbench_r-2 |    692,074,638,817 |   691,118,734,547 |
 502.gcc_r-0       |    190,820,141,435 |   190,857,065,988 |
 502.gcc_r-1       |    225,747,660,839 |   225,809,444,357 | <- -0.02%
 502.gcc_r-2       |    220,370,089,641 |   220,406,367,876 | <- -0.03%
 502.gcc_r-3       |    179,111,460,458 |   179,135,609,723 | <- -0.02%
 502.gcc_r-4       |    219,301,546,340 |   219,320,416,956 | <- -0.01%
 503.bwaves_r-0    |    278,733,324,691 |   278,733,323,575 | <- -0.01%
 503.bwaves_r-1    |    442,397,521,282 |   442,397,519,616 |
 503.bwaves_r-2    |    344,112,218,206 |   344,112,216,760 |
 503.bwaves_r-3    |    417,561,469,153 |   417,561,467,597 |
 505.mcf_r         |    669,319,257,525 |   669,318,763,084 |
 507.cactuBSSN_r   |  2,852,767,394,456 | 2,564,736,063,742 | <+ 10.10%
 508.namd_r        |  1,855,884,342,110 | 1,855,881,110,934 |
 510.parest_r      |  1,654,525,521,053 | 1,654,402,859,174 |
 511.povray_r      |  2,990,146,655,619 | 2,990,060,324,589 |
 519.lbm_r         |  1,158,337,294,525 | 1,158,337,294,529 |
 520.omnetpp_r     |  1,021,765,791,283 | 1,026,165,661,394 |
 521.wrf_r         |  1,715,955,652,503 | 1,714,352,737,385 |
 523.xalancbmk_r   |    849,846,008,075 |   849,836,851,752 |
 525.x264_r-0      |    277,801,762,763 |   277,488,776,427 |
 525.x264_r-1      |    927,281,789,540 |   926,751,516,742 |
 525.x264_r-2      |    915,352,631,375 |   914,667,785,953 |
 526.blender_r     |  1,652,839,180,887 | 1,653,260,825,512 |
 527.cam4_r        |  1,487,053,494,925 | 1,484,526,670,770 |
 531.deepsjeng_r   |  1,641,969,526,837 | 1,642,126,598,866 |
 538.imagick_r     |  2,098,016,546,691 | 2,097,997,929,125 |
 541.leela_r       |  1,983,557,323,877 | 1,983,531,314,526 |
 544.nab_r         |  1,516,061,611,233 | 1,516,061,407,715 |
 548.exchange2_r   |  2,072,594,330,215 | 2,072,591,648,318 |
 549.fotonik3d_r   |  1,001,499,307,366 | 1,001,478,944,189 |
 554.roms_r        |  1,028,799,739,111 | 1,028,780,904,061 |
 557.xz_r-0        |    363,827,039,684 |   363,057,014,260 |
 557.xz_r-1        |    906,649,112,601 |   905,928,888,732 |
 557.xz_r-2        |    509,023,898,187 |   508,140,356,932 |
 997.specrand_fr   |        402,535,577 |       403,052,561 |
 999.specrand_ir   |        402,535,577 |       403,052,561 |

This should still be considered damage control as the real/deeper fix
would be to reduce number of LRA reloads or CSE/anchor those during
LRA constraint sub-pass (re)runs (thats a different PR/114729.

Implementation Details (for posterity)
--------------------------------------
 - basic idea is to have a splitter selected via a new predicate for constant
   being possible sum of two S12 and provide the transform.
   This is however a 2 -> 2 transform which combine can't handle.
   So we specify it using a define_insn_and_split.

 - the initial loose "i" constraint caused LRA to accept invalid insns thus
   needing a tighter new constraint as well.

 - An additional fallback alternate with catch-all "r" register
   constraint also needed to allow any "reloads" that LRA might
   require for ADDI with const larger than S12.

Testing
--------
This is testsuite clean (rv64 only).
I'll rely on post-commit CI multlib run for any possible fallout for
other setups such as rv32.

|                                               |         gcc |          g++ |     gfortran |
| rv64imafdc_zba_zbb_zbs_zicond/  lp64d/ medlow |  41 /    17 |    8 /     3 |    7 /     2 |
| rv64imafdc_zba_zbb_zbs_zicond/  lp64d/ medlow |  41 /    17 |    8 /     3 |    7 /     2 |

I also threw this into a buildroot run, it obviously boots Linux to
userspace. bloat-o-meter on glibc and kernel show overall decrease in
staic instruction counts with some minor spot increases.
These are generally in the category of

 - LUI + ADDI are 2 byte each vs. two ADD being 4 byte each.
 - Knock on effects due to inlining changes.
 - Sometimes the slightly shorter 2-insn seq in a mult-exit function
   can cause in-place epilogue duplication (vs. a jump back).
   This is slightly larger but more efficient in execution.
In summary nothing to fret about.

| linux/scripts/bloat-o-meter build-gcc-240131/target/lib/libc.so.6 \
         build-gcc-240131-new-splitter-1-variant/target/lib/libc.so.6
|
| add/remove: 0/0 grow/shrink: 21/49 up/down: 520/-3056 (-2536)
| Function                                     old     new   delta
| getnameinfo                                 2756    2892    +136
...
| tempnam                                      136     144      +8
| padzero                                      276     284      +8
...
| __GI___register_printf_specifier             284     280      -4
| __EI_xdr_array                               468     464      -4
| try_file_lock                                268     260      -8
| pthread_create@GLIBC_2                      3520    3508     -12
| __pthread_create_2_1                        3520    3508     -12
...
| _nss_files_setnetgrent                       932     904     -28
| _nss_dns_gethostbyaddr2_r                   1524    1480     -44
| build_trtable                               3312    3208    -104
| printf_positional                          25000   22580   -2420
| Total: Before=2107999, After=2105463, chg -0.12%

gcc/ChangeLog:
	* config/riscv/riscv.h: New macros to check for sum of two S12
	range.
	* config/riscv/constraints.md: New constraint.
	* config/riscv/predicates.md: New Predicate.
	* config/riscv/riscv.md: New splitter.
	* config/riscv/riscv.cc (riscv_reg_frame_related): New helper.
	* config/riscv/riscv-protos.h: New helper prototype.

gcc/testsuite/ChangeLog:
	* gcc.target/riscv/sum-of-two-s12-const-1.c: New test: checks
	for new patterns output.
	* gcc.target/riscv/sum-of-two-s12-const-2.c: Ditto.
	* gcc.target/riscv/sum-of-two-s12-const-3.c: New test: should not
	ICE.

Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
---
 gcc/config/riscv/constraints.md               |  6 +++
 gcc/config/riscv/predicates.md                |  6 +++
 gcc/config/riscv/riscv-protos.h               |  1 +
 gcc/config/riscv/riscv.cc                     | 11 +++++
 gcc/config/riscv/riscv.h                      | 15 +++++++
 gcc/config/riscv/riscv.md                     | 40 +++++++++++++++++
 .../gcc.target/riscv/sum-of-two-s12-const-1.c | 45 +++++++++++++++++++
 .../gcc.target/riscv/sum-of-two-s12-const-2.c | 15 +++++++
 .../gcc.target/riscv/sum-of-two-s12-const-3.c | 22 +++++++++
 9 files changed, 161 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-2.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-3.c

diff --git a/gcc/config/riscv/constraints.md b/gcc/config/riscv/constraints.md
index a590df545d7d..a9ee346af6f0 100644
--- a/gcc/config/riscv/constraints.md
+++ b/gcc/config/riscv/constraints.md
@@ -80,6 +80,12 @@
   (and (match_code "const_int")
        (match_test "LUI_OPERAND (ival)")))
 
+(define_constraint "MiG"
+  "const can be represented as sum of any S12 values."
+  (and (match_code "const_int")
+       (ior (match_test "IN_RANGE (ival,  2048,  4094)")
+	    (match_test "IN_RANGE (ival, -4096, -2049)"))))
+
 (define_constraint "Ds3"
   "@internal
    1, 2 or 3 immediate"
diff --git a/gcc/config/riscv/predicates.md b/gcc/config/riscv/predicates.md
index e7d797d4dbf0..8948fbfc3631 100644
--- a/gcc/config/riscv/predicates.md
+++ b/gcc/config/riscv/predicates.md
@@ -428,6 +428,12 @@
 	return true;
 })
 
+(define_predicate "const_two_s12"
+  (match_code "const_int")
+{
+  return SUM_OF_TWO_S12 (INTVAL (op));
+})
+
 ;; CORE-V Predicates:
 (define_predicate "immediate_register_operand"
   (ior (match_operand 0 "register_operand")
diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index e5aebf3fc3d5..706dc204e643 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -165,6 +165,7 @@ extern bool riscv_shamt_matches_mask_p (int, HOST_WIDE_INT);
 extern void riscv_subword_address (rtx, rtx *, rtx *, rtx *, rtx *);
 extern void riscv_lshift_subword (machine_mode, rtx, rtx, rtx *);
 extern enum memmodel riscv_union_memmodels (enum memmodel, enum memmodel);
+extern bool riscv_reg_frame_related (rtx);
 
 /* Routines implemented in riscv-c.cc.  */
 void riscv_cpu_cpp_builtins (cpp_reader *);
diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index a1e5a014bedf..4067505270e1 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -7145,6 +7145,17 @@ riscv_can_eliminate (const int from ATTRIBUTE_UNUSED, const int to)
   return (to == HARD_FRAME_POINTER_REGNUM || to == STACK_POINTER_REGNUM);
 }
 
+/* Helper to determine if reg X pertains to stack.  */
+bool
+riscv_reg_frame_related (rtx x)
+{
+  return REG_P (x)
+	 && (REGNO (x) == FRAME_POINTER_REGNUM
+	     || REGNO (x) == HARD_FRAME_POINTER_REGNUM
+	     || REGNO (x) == ARG_POINTER_REGNUM
+	     || REGNO (x) == VIRTUAL_STACK_VARS_REGNUM);
+}
+
 /* Implement INITIAL_ELIMINATION_OFFSET.  FROM is either the frame pointer
    or argument pointer.  TO is either the stack pointer or hard frame
    pointer.  */
diff --git a/gcc/config/riscv/riscv.h b/gcc/config/riscv/riscv.h
index 58d0b09bf7d9..0d27c0d378df 100644
--- a/gcc/config/riscv/riscv.h
+++ b/gcc/config/riscv/riscv.h
@@ -626,6 +626,21 @@ enum reg_class
   (((VALUE) | ((1UL<<31) - IMM_REACH)) == ((1UL<<31) - IMM_REACH)	\
    || ((VALUE) | ((1UL<<31) - IMM_REACH)) + IMM_REACH == 0)
 
+/* True if a VALUE (constant) can be expressed as sum of two S12 constants
+   (in range -2048 to 2047).
+   Range check logic:
+     from: min S12 + 1 (or -1 depending on what side of zero)
+       to: two times the min S12 value (to max out S12 bits).  */
+
+#define SUM_OF_TWO_S12_N(VALUE)						\
+  (((VALUE) >= (-2048 * 2)) && ((VALUE) <= (-2048 - 1)))
+
+#define SUM_OF_TWO_S12_P(VALUE)						\
+  (((VALUE) >= (2047 + 1)) && ((VALUE) <= (2047 * 2)))
+
+#define SUM_OF_TWO_S12(VALUE)						\
+  (SUM_OF_TWO_S12_N (VALUE) || SUM_OF_TWO_S12_P (VALUE))
+
 /* If this is a single bit mask, then we can load it with bseti.  Special
    handling of SImode 0x80000000 on RV64 is done in riscv_build_integer_1. */
 #define SINGLE_BIT_MASK_OPERAND(VALUE)					\
diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
index 4d6de9925572..f5dac342033e 100644
--- a/gcc/config/riscv/riscv.md
+++ b/gcc/config/riscv/riscv.md
@@ -703,6 +703,46 @@
   [(set_attr "type" "arith")
    (set_attr "mode" "DI")])
 
+;; Special case of adding a reg and constant if latter is sum of two S12
+;; values (in range -2048 to 2047). Avoid materialized the const and fuse
+;; into the add (with an additional add for 2nd value). Makes a 3 insn
+;; sequence into 2 insn.
+
+(define_insn_and_split "*add<mode>3_const_sum_of_two_s12"
+  [(set (match_operand:P	 0 "register_operand" "=r,r")
+	(plus:P (match_operand:P 1 "register_operand" " r,r")
+		(match_operand:P 2 "const_two_s12"    " MiG,r")))]
+  "!riscv_reg_frame_related (operands[0])"
+{
+  /* operand matching MiG constraint is always meant to be split.  */
+  if (which_alternative == 0)
+    return "#";
+  else
+    return "add %0,%1,%2";
+}
+  ""
+  [(set (match_dup 0)
+	(plus:P (match_dup 1) (match_dup 3)))
+   (set (match_dup 0)
+	(plus:P (match_dup 0) (match_dup 4)))]
+{
+  int val = INTVAL (operands[2]);
+  if (SUM_OF_TWO_S12_P (val))
+    {
+       operands[3] = GEN_INT (2047);
+       operands[4] = GEN_INT (val - 2047);
+    }
+  else if (SUM_OF_TWO_S12_N (val))
+    {
+       operands[3] = GEN_INT (-2048);
+       operands[4] = GEN_INT (val + 2048);
+    }
+  else
+      gcc_unreachable ();
+}
+  [(set_attr "type" "arith")
+   (set_attr "mode" "<P:MODE>")])
+
 (define_expand "addv<mode>4"
   [(set (match_operand:GPR           0 "register_operand" "=r,r")
 	(plus:GPR (match_operand:GPR 1 "register_operand" " r,r")
diff --git a/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-1.c b/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-1.c
new file mode 100644
index 000000000000..4d6d135de5f5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-1.c
@@ -0,0 +1,45 @@
+// TBD: This doesn't quite work for rv32 yet
+/* { dg-do compile } */
+/* { dg-options { -march=rv64gcv -mabi=lp64d } } */
+/* { dg-skip-if "" { *-*-* } { "-O0" "-Og" "-Os" "-Oz" } } */
+
+/* Ensure that gcc doesn't generate standlone li reg, 4096.  */
+long
+plus1(unsigned long i)
+{
+   return i + 2048;
+}
+
+long
+plus2(unsigned long i)
+{
+   return i + 4094;
+}
+
+long
+plus3(unsigned long i)
+{
+   return i + 2064;
+}
+
+/* Ensure that gcc doesn't generate standlone li reg, -4096.  */
+long
+minus1(unsigned long i)
+{
+   return i - 4096;
+}
+
+long
+minus2(unsigned long i)
+{
+   return i - 2049;
+}
+
+long
+minus3(unsigned long i)
+{
+   return i - 2064;
+}
+
+/* { dg-final { scan-assembler-not {li\t[a-x0-9]+,-4096} } } */
+/* { dg-final { scan-assembler-not {li\t[a-x0-9]+,4096} } } */
diff --git a/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-2.c b/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-2.c
new file mode 100644
index 000000000000..9343b43c3106
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-2.c
@@ -0,0 +1,15 @@
+/* Reduced from glibc/stdio-common/tempnam.c.
+   Can't have invalid insn in final output:
+      add s0, sp, 2048    */
+
+/* { dg-do compile } */
+/* { dg-options { -march=rv64gcv -mabi=lp64d -O2 } } */
+/* { dg-skip-if "" { *-*-* } { "-O0" "O1" "-Og" "-Os" "-Oz" } } */
+
+int a() {
+  char b[4096];
+  if (a(b))
+    a(b);
+}
+
+/* { dg-final { scan-assembler-not {add\t[a-x0-9]+,sp,2048} } } */
diff --git a/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-3.c b/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-3.c
new file mode 100644
index 000000000000..5dcab52c2610
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-3.c
@@ -0,0 +1,22 @@
+/* Reduced version of c-c++-common/torture/builtin-convertvector-1.c.  */
+/* This should NOT ICE */
+
+/* { dg-do compile } */
+
+typedef long b __attribute__((vector_size(256 * sizeof(long))));
+typedef double c __attribute__((vector_size(256 * sizeof(double))));
+int d;
+void e(b *f, c *g) { *g = __builtin_convertvector(*f, c); }
+void h() {
+  struct {
+    b i;
+  } j;
+  union {
+    c i;
+    double a[6];
+  } k;
+  e(&j.i, &k.i);
+  if (k.a[d])
+    for (;;)
+      ;
+}
-- 
2.34.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 2/2] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733]
  2024-05-13 18:49 [PATCH v2 0/2] RISC-V improve stack/array access by constant mat tweak Vineet Gupta
  2024-05-13 18:49 ` [PATCH v2 1/2] RISC-V: avoid LUI based const materialization ... [part of PR/106265] Vineet Gupta
@ 2024-05-13 18:49 ` Vineet Gupta
  2024-05-13 20:28   ` Jeff Law
  1 sibling, 1 reply; 19+ messages in thread
From: Vineet Gupta @ 2024-05-13 18:49 UTC (permalink / raw)
  To: gcc-patches
  Cc: Jeff Law, kito.cheng, Palmer Dabbelt, Christoph Müllner,
	Robin Dapp, gnu-toolchain, Vineet Gupta

If the constant used for stack offset can be expressed as sum of two S12
values, the constant need not be materialized (in a reg) and instead the
two S12 bits can be added to instructions involved with frame pointer.
This avoids burning a register and more importantly can often get down
to be 2 insn vs. 3.

The prev patches to generally avoid LUI based const materialization didn't
fix this PR and need this directed fix in funcion prologue/epilogue
expansion.

This fix doesn't move the neddle for SPEC, at all, but it is still a
win considering gcc generates one insn fewer than llvm for the test ;-)

   gcc-13.1 release   |      gcc 230823     |                   |
                      |    g6619b3d4c15c    |   This patch      |  clang/llvm
---------------------------------------------------------------------------------
li      t0,-4096     | li    t0,-4096      | addi  sp,sp,-2048 | addi sp,sp,-2048
addi    t0,t0,2016   | addi  t0,t0,2032    | add   sp,sp,-16   | addi sp,sp,-32
li      a4,4096      | add   sp,sp,t0      | add   a5,sp,a0    | add  a1,sp,16
add     sp,sp,t0     | addi  a5,sp,-2032   | sb    zero,0(a5)  | add  a0,a0,a1
li      a5,-4096     | add   a0,a5,a0      | addi  sp,sp,2032  | sb   zero,0(a0)
addi    a4,a4,-2032  | li    t0, 4096      | addi  sp,sp,32    | addi sp,sp,2032
add     a4,a4,a5     | sb    zero,2032(a0) | ret               | addi sp,sp,48
addi    a5,sp,16     | addi  t0,t0,-2032   |                   | ret
add     a5,a4,a5     | add   sp,sp,t0      |
add     a0,a5,a0     | ret                 |
li      t0,4096      |
sd      a5,8(sp)     |
sb      zero,2032(a0)|
addi    t0,t0,-2016  |
add     sp,sp,t0     |
ret                  |

gcc/ChangeLog:
	PR target/105733
	* config/riscv/riscv.h: New macros for with aligned offsets.
	* config/riscv/riscv.cc (riscv_split_sum_of_two_s12): New
	function to split a sum of two s12 values into constituents.
	(riscv_expand_prologue): Handle offset being sum of two S12.
	(riscv_expand_epilogue): Ditto.
	* config/riscv/riscv-protos.h (riscv_split_sum_of_two_s12): New.

gcc/testsuite/ChangeLog:
	* gcc.target/riscv/pr105733.c: New Test.
	* gcc.target/riscv/rvv/autovec/vls/spill-1.c: Adjust to not
	expect LUI 4096.
	* gcc.target/riscv/rvv/autovec/vls/spill-2.c: Ditto.
	* gcc.target/riscv/rvv/autovec/vls/spill-3.c: Ditto.
	* gcc.target/riscv/rvv/autovec/vls/spill-4.c: Ditto.
	* gcc.target/riscv/rvv/autovec/vls/spill-5.c: Ditto.
	* gcc.target/riscv/rvv/autovec/vls/spill-6.c: Ditto.
	* gcc.target/riscv/rvv/autovec/vls/spill-7.c: Ditto.

Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
---
 gcc/config/riscv/riscv-protos.h               |  2 +
 gcc/config/riscv/riscv.cc                     | 74 +++++++++++++++++--
 gcc/config/riscv/riscv.h                      |  7 ++
 gcc/testsuite/gcc.target/riscv/pr105733.c     | 15 ++++
 .../riscv/rvv/autovec/vls/spill-1.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-2.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-3.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-4.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-5.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-6.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-7.c           |  4 +-
 11 files changed, 105 insertions(+), 21 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/pr105733.c

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 706dc204e643..6da6ae4d041f 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -166,6 +166,8 @@ extern void riscv_subword_address (rtx, rtx *, rtx *, rtx *, rtx *);
 extern void riscv_lshift_subword (machine_mode, rtx, rtx, rtx *);
 extern enum memmodel riscv_union_memmodels (enum memmodel, enum memmodel);
 extern bool riscv_reg_frame_related (rtx);
+extern void riscv_split_sum_of_two_s12 (HOST_WIDE_INT, HOST_WIDE_INT *,
+					HOST_WIDE_INT *);
 
 /* Routines implemented in riscv-c.cc.  */
 void riscv_cpu_cpp_builtins (cpp_reader *);
diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 4067505270e1..4b742489b272 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -4063,6 +4063,32 @@ riscv_split_doubleword_move (rtx dest, rtx src)
        riscv_emit_move (riscv_subword (dest, true), riscv_subword (src, true));
      }
 }
+
+/* Constant VAL is known to be sum of two S12 constants.  Break it into
+   comprising BASE and OFF.
+   Numerically S12 is -2048 to 2047, however it uses the more conservative
+   range -2048 to 2032 as offsets pertain to stack related registers.  */
+
+void
+riscv_split_sum_of_two_s12 (HOST_WIDE_INT val, HOST_WIDE_INT *base,
+			    HOST_WIDE_INT *off)
+{
+  if (SUM_OF_TWO_S12_N (val))
+    {
+      *base = -2048;
+      *off = val - (-2048);
+    }
+  else if (SUM_OF_TWO_S12_P_ALGN (val))
+    {
+      *base = 2032;
+      *off = val - 2032;
+    }
+  else
+    {
+      gcc_unreachable ();
+    }
+}
+
 \f
 /* Return the appropriate instructions to move SRC into DEST.  Assume
    that SRC is operand 1 and DEST is operand 0.  */
@@ -7852,6 +7878,17 @@ riscv_expand_prologue (void)
 				GEN_INT (-constant_frame));
 	  RTX_FRAME_RELATED_P (emit_insn (insn)) = 1;
 	}
+      else if (SUM_OF_TWO_S12_ALGN (-constant_frame))
+	{
+	  HOST_WIDE_INT base, off;
+	  riscv_split_sum_of_two_s12 (-constant_frame, &base, &off);
+	  insn = gen_add3_insn (stack_pointer_rtx, stack_pointer_rtx,
+				GEN_INT (base));
+	  RTX_FRAME_RELATED_P (emit_insn (insn)) = 1;
+	  insn = gen_add3_insn (stack_pointer_rtx, stack_pointer_rtx,
+				GEN_INT (off));
+	  RTX_FRAME_RELATED_P (emit_insn (insn)) = 1;
+	}
       else
 	{
 	  riscv_emit_move (RISCV_PROLOGUE_TEMP (Pmode), GEN_INT (-constant_frame));
@@ -8074,14 +8111,26 @@ riscv_expand_epilogue (int style)
 	}
       else
 	{
-	  if (!SMALL_OPERAND (adjust_offset.to_constant ()))
+	  HOST_WIDE_INT adj_off_value = adjust_offset.to_constant ();
+	  if (SMALL_OPERAND (adj_off_value))
+	    {
+	      adjust = GEN_INT (adj_off_value);
+	    }
+	  else if (SUM_OF_TWO_S12_ALGN (adj_off_value))
+	    {
+	      HOST_WIDE_INT base, off;
+	      riscv_split_sum_of_two_s12 (adj_off_value, &base, &off);
+	      insn = gen_add3_insn (stack_pointer_rtx, hard_frame_pointer_rtx,
+					GEN_INT (base));
+	      RTX_FRAME_RELATED_P (insn) = 1;
+	      adjust = GEN_INT (off);
+	    }
+	  else
 	    {
 	      riscv_emit_move (RISCV_PROLOGUE_TEMP (Pmode),
-			       GEN_INT (adjust_offset.to_constant ()));
+			       GEN_INT (adj_off_value));
 	      adjust = RISCV_PROLOGUE_TEMP (Pmode);
 	    }
-	  else
-	    adjust = GEN_INT (adjust_offset.to_constant ());
 	}
 
       insn = emit_insn (
@@ -8148,10 +8197,21 @@ riscv_expand_epilogue (int style)
 
       /* Get an rtx for STEP1 that we can add to BASE.
 	 Skip if adjust equal to zero.  */
-      if (step1.to_constant () != 0)
+      HOST_WIDE_INT step1_value = step1.to_constant ();
+      if (step1_value != 0)
 	{
-	  rtx adjust = GEN_INT (step1.to_constant ());
-	  if (!SMALL_OPERAND (step1.to_constant ()))
+	  rtx adjust = GEN_INT (step1_value);
+	  if (SUM_OF_TWO_S12_ALGN (step1_value))
+	    {
+	      HOST_WIDE_INT base, off;
+	      riscv_split_sum_of_two_s12 (step1_value, &base, &off);
+	      insn = emit_insn (gen_add3_insn (stack_pointer_rtx,
+						stack_pointer_rtx,
+						GEN_INT (base)));
+	      RTX_FRAME_RELATED_P (insn) = 1;
+	      adjust = GEN_INT (off);
+	    }
+	  else if (!SMALL_OPERAND (step1_value))
 	    {
 	      riscv_emit_move (RISCV_PROLOGUE_TEMP (Pmode), adjust);
 	      adjust = RISCV_PROLOGUE_TEMP (Pmode);
diff --git a/gcc/config/riscv/riscv.h b/gcc/config/riscv/riscv.h
index 0d27c0d378df..d6b14c4d6205 100644
--- a/gcc/config/riscv/riscv.h
+++ b/gcc/config/riscv/riscv.h
@@ -641,6 +641,13 @@ enum reg_class
 #define SUM_OF_TWO_S12(VALUE)						\
   (SUM_OF_TWO_S12_N (VALUE) || SUM_OF_TWO_S12_P (VALUE))
 
+/* Variant with first value 8 byte aligned if involving stack regs.  */
+#define SUM_OF_TWO_S12_P_ALGN(VALUE)				\
+  (((VALUE) >= (2032 + 1)) && ((VALUE) <= (2032 * 2)))
+
+#define SUM_OF_TWO_S12_ALGN(VALUE)				\
+  (SUM_OF_TWO_S12_N (VALUE) || SUM_OF_TWO_S12_P_ALGN (VALUE))
+
 /* If this is a single bit mask, then we can load it with bseti.  Special
    handling of SImode 0x80000000 on RV64 is done in riscv_build_integer_1. */
 #define SINGLE_BIT_MASK_OPERAND(VALUE)					\
diff --git a/gcc/testsuite/gcc.target/riscv/pr105733.c b/gcc/testsuite/gcc.target/riscv/pr105733.c
new file mode 100644
index 000000000000..6156c36dc7ef
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/pr105733.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options { -march=rv64gcv -mabi=lp64d } } */
+/* { dg-skip-if "" { *-*-* } { "-O0" "-Og" "-Os" "-Oz" } } */
+
+#define BUF_SIZE 2064
+
+void
+foo(unsigned long i)
+{
+    volatile char buf[BUF_SIZE];
+
+    buf[i] = 0;
+}
+
+/* { dg-final { scan-assembler-not {li\t[a-x0-9]+,4096} } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-1.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-1.c
index b64c73f34f13..6afcf1db593b 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-1.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-1.c
@@ -129,5 +129,5 @@ spill_12 (int8_t *in, int8_t *out)
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-256} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-512} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-1024} 1 } } */
-/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 1 } } */
-/* { dg-final { scan-assembler-times {li\t[a-x0-9]+,-4096\s+add\tsp,sp,[a-x0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 3 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,2032} 1 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-2.c
index 8fcdca705384..544e8628a27b 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-2.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-2.c
@@ -120,5 +120,5 @@ spill_11 (int16_t *in, int16_t *out)
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-256} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-512} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-1024} 1 } } */
-/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 1 } } */
-/* { dg-final { scan-assembler-times {li\t[a-x0-9]+,-4096\s+add\tsp,sp,[a-x0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 3 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,2032} 1 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-3.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-3.c
index ca296ce02d66..4bfeb07e9aca 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-3.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-3.c
@@ -111,5 +111,5 @@ spill_10 (int32_t *in, int32_t *out)
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-256} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-512} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-1024} 1 } } */
-/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 1 } } */
-/* { dg-final { scan-assembler-times {li\t[a-x0-9]+,-4096\s+add\tsp,sp,[a-x0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 3 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,2032} 1 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-4.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-4.c
index ef61d9a2c0c3..1faf31ffd8e0 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-4.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-4.c
@@ -102,5 +102,5 @@ spill_9 (int64_t *in, int64_t *out)
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-256} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-512} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-1024} 1 } } */
-/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 1 } } */
-/* { dg-final { scan-assembler-times {li\t[a-x0-9]+,-4096\s+add\tsp,sp,[a-x0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 3 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,2032} 1 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-5.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-5.c
index 150135a91103..0c8dccc518e3 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-5.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-5.c
@@ -120,5 +120,5 @@ spill_11 (_Float16 *in, _Float16 *out)
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-256} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-512} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-1024} 1 } } */
-/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 1 } } */
-/* { dg-final { scan-assembler-times {li\t[a-x0-9]+,-4096\s+add\tsp,sp,[a-x0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 3 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,2032} 1 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-6.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-6.c
index c5d2d0194348..8bf53b84d1cd 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-6.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-6.c
@@ -111,5 +111,5 @@ spill_10 (float *in, float *out)
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-256} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-512} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-1024} 1 } } */
-/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 1 } } */
-/* { dg-final { scan-assembler-times {li\t[a-x0-9]+,-4096\s+add\tsp,sp,[a-x0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 3 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,2032} 1 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-7.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-7.c
index 70ca683908db..e3980a295406 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-7.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-7.c
@@ -102,5 +102,5 @@ spill_9 (int64_t *in, int64_t *out)
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-256} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-512} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-1024} 1 } } */
-/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 1 } } */
-/* { dg-final { scan-assembler-times {li\t[a-x0-9]+,-4096\s+add\tsp,sp,[a-x0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 3 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,2032} 1 } } */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2] RISC-V: avoid LUI based const materialization ... [part of PR/106265]
  2024-05-13 18:49 ` [PATCH v2 1/2] RISC-V: avoid LUI based const materialization ... [part of PR/106265] Vineet Gupta
@ 2024-05-13 20:15   ` Jeff Law
  2024-05-14 18:38     ` [COMMITTED] " Vineet Gupta
  2024-05-13 21:13   ` Follow up #1 (was Re: [PATCH v2 1/2] RISC-V: avoid LUI based const materialization ... [part of PR/106265]) Vineet Gupta
  1 sibling, 1 reply; 19+ messages in thread
From: Jeff Law @ 2024-05-13 20:15 UTC (permalink / raw)
  To: Vineet Gupta, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, Christoph Müllner, Robin Dapp,
	gnu-toolchain

On 5/13/24 12:49 PM, Vineet Gupta wrote:
> Apologies for the delay in getting this out. Needed to fix one ICE
> with glibc build and fresh round of testing: both testsuite and SPEC
> runs (which are similar to v1 in terms of Cactu gains, but some more minor
> regressions elsewhere gcc). Again those seem so small that IMHO this
> should still go in.
> 
> I'll investigate those next as well as an existing weirdnes in glibc tempnam
> which I spotted during the debugging.
> 
> Changes since v1 [1]
>   - Tighten the main conditition to avoid stack regs as destination
>     (to avoid making them potentially unaligned with -2047 addend:
>      this might be OK execution/ABI wise, but undesirable/ugly still
>      specially when coming from compiler codegen).
>   - Ensure that first alternative is always split
>   - Remove "&& 1" from split condition. That was tripping up glibc build
>     with illegal operands `add s0, s0, 2048`.
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2024-March/647877.html
> 
>   
> +;; Special case of adding a reg and constant if latter is sum of two S12
> +;; values (in range -2048 to 2047). Avoid materialized the const and fuse
> +;; into the add (with an additional add for 2nd value). Makes a 3 insn
> +;; sequence into 2 insn.
> +
> +(define_insn_and_split "*add<mode>3_const_sum_of_two_s12"
> +  [(set (match_operand:P	 0 "register_operand" "=r,r")
> +	(plus:P (match_operand:P 1 "register_operand" " r,r")
> +		(match_operand:P 2 "const_two_s12"    " MiG,r")))]
> +  "!riscv_reg_frame_related (operands[0])"
So that !riscv_reg_frame_related is my only concern with this patch. 
It's a destination, so it *may* be OK.

If it were a source operand, then we'd have to worry about cases where 
it was a pseudo with the same value as sp/fp/argp and subsequent copy 
propagation replacing the pseudo with sp/fp/argp causing the insn to no 
longer match.

Similarly if it were a source operand we'd have to worry about cases 
where the pseudo had a registered (or discoverable) equivalence to 
sp/fp/argp plus an offset.  IRA/LRA can replace the use with its 
equivalence in some of those cases which would have potentially caused 
headaches.

But as a destination we really just have to worry about generation in 
the prologue/epilogue and for alloca calls.  Those should be the only 
places that set one of those special registers.  They're constrained 
enough that I think we'll be OK.

I'm very slightly worried about hard register cprop, but I think it 
should be safe these days WRT those special registers in the unlikely 
event it found an opportunity to propagate them.

So a tentative OK.  If we find this tidibit is problematical in the 
future, then what I would suggest is we allow those special registers 
and dial-back the aggressiveness on the range of allowed constants. 
That would allow the first instruction in the sequence to never create a 
mis-aligned sp.  But again, that's only if we need to revisit.

Please wait for CI to report back sane results :-)

Jeff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733]
  2024-05-13 18:49 ` [PATCH v2 2/2] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733] Vineet Gupta
@ 2024-05-13 20:28   ` Jeff Law
  2024-05-14  0:54     ` Patrick O'Neill
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff Law @ 2024-05-13 20:28 UTC (permalink / raw)
  To: Vineet Gupta, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, Christoph Müllner, Robin Dapp,
	gnu-toolchain



On 5/13/24 12:49 PM, Vineet Gupta wrote:
> If the constant used for stack offset can be expressed as sum of two S12
> values, the constant need not be materialized (in a reg) and instead the
> two S12 bits can be added to instructions involved with frame pointer.
> This avoids burning a register and more importantly can often get down
> to be 2 insn vs. 3.
> 
> The prev patches to generally avoid LUI based const materialization didn't
> fix this PR and need this directed fix in funcion prologue/epilogue
> expansion.
> 
> This fix doesn't move the neddle for SPEC, at all, but it is still a
> win considering gcc generates one insn fewer than llvm for the test ;-)
> 
>     gcc-13.1 release   |      gcc 230823     |                   |
>                        |    g6619b3d4c15c    |   This patch      |  clang/llvm
> ---------------------------------------------------------------------------------
> li      t0,-4096     | li    t0,-4096      | addi  sp,sp,-2048 | addi sp,sp,-2048
> addi    t0,t0,2016   | addi  t0,t0,2032    | add   sp,sp,-16   | addi sp,sp,-32
> li      a4,4096      | add   sp,sp,t0      | add   a5,sp,a0    | add  a1,sp,16
> add     sp,sp,t0     | addi  a5,sp,-2032   | sb    zero,0(a5)  | add  a0,a0,a1
> li      a5,-4096     | add   a0,a5,a0      | addi  sp,sp,2032  | sb   zero,0(a0)
> addi    a4,a4,-2032  | li    t0, 4096      | addi  sp,sp,32    | addi sp,sp,2032
> add     a4,a4,a5     | sb    zero,2032(a0) | ret               | addi sp,sp,48
> addi    a5,sp,16     | addi  t0,t0,-2032   |                   | ret
> add     a5,a4,a5     | add   sp,sp,t0      |
> add     a0,a5,a0     | ret                 |
> li      t0,4096      |
> sd      a5,8(sp)     |
> sb      zero,2032(a0)|
> addi    t0,t0,-2016  |
> add     sp,sp,t0     |
> ret                  |
> 
> gcc/ChangeLog:
> 	PR target/105733
> 	* config/riscv/riscv.h: New macros for with aligned offsets.
> 	* config/riscv/riscv.cc (riscv_split_sum_of_two_s12): New
> 	function to split a sum of two s12 values into constituents.
> 	(riscv_expand_prologue): Handle offset being sum of two S12.
> 	(riscv_expand_epilogue): Ditto.
> 	* config/riscv/riscv-protos.h (riscv_split_sum_of_two_s12): New.
> 
> gcc/testsuite/ChangeLog:
> 	* gcc.target/riscv/pr105733.c: New Test.
> 	* gcc.target/riscv/rvv/autovec/vls/spill-1.c: Adjust to not
> 	expect LUI 4096.
> 	* gcc.target/riscv/rvv/autovec/vls/spill-2.c: Ditto.
> 	* gcc.target/riscv/rvv/autovec/vls/spill-3.c: Ditto.
> 	* gcc.target/riscv/rvv/autovec/vls/spill-4.c: Ditto.
> 	* gcc.target/riscv/rvv/autovec/vls/spill-5.c: Ditto.
> 	* gcc.target/riscv/rvv/autovec/vls/spill-6.c: Ditto.
> 	* gcc.target/riscv/rvv/autovec/vls/spill-7.c: Ditto.
> 


> @@ -8074,14 +8111,26 @@ riscv_expand_epilogue (int style)
>   	}
>         else
>   	{
> -	  if (!SMALL_OPERAND (adjust_offset.to_constant ()))
> +	  HOST_WIDE_INT adj_off_value = adjust_offset.to_constant ();
> +	  if (SMALL_OPERAND (adj_off_value))
> +	    {
> +	      adjust = GEN_INT (adj_off_value);
> +	    }
> +	  else if (SUM_OF_TWO_S12_ALGN (adj_off_value))
> +	    {
> +	      HOST_WIDE_INT base, off;
> +	      riscv_split_sum_of_two_s12 (adj_off_value, &base, &off);
> +	      insn = gen_add3_insn (stack_pointer_rtx, hard_frame_pointer_rtx,
> +					GEN_INT (base));
> +	      RTX_FRAME_RELATED_P (insn) = 1;
> +	      adjust = GEN_INT (off);
> +	    }
So this was the hunk that we identified internally as causing problems 
with libgomp's testsuite.  We never fully chased it down as this hunk 
didn't seem terribly important performance wise -- we just set it aside. 
  The thing is it looked basically correct to me.  So the failure was 
certainly unexpected, but it was consistent.

So I think the question is whether or not the CI system runs the libgomp 
testsuite, particularly in the rv64 linux configuration.  If it does, 
and it passes, then we're good.  I'm still finding my way around the 
configuration, so I don't know if the CI system Edwin & Patrick have 
built tests libgomp or not.

If it isn't run, then we'll need to do a run to test that.  I'm set up 
here to do that if needed.   I can just drop this version into our 
internal tree, trigger an internal CI run and see if it complains :-)

If it does complain, then we know where to start investigations.




Jeff


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Follow up #1 (was Re: [PATCH v2 1/2] RISC-V: avoid LUI based const materialization ... [part of PR/106265])
  2024-05-13 18:49 ` [PATCH v2 1/2] RISC-V: avoid LUI based const materialization ... [part of PR/106265] Vineet Gupta
  2024-05-13 20:15   ` Jeff Law
@ 2024-05-13 21:13   ` Vineet Gupta
  2024-05-13 22:47     ` Jeff Law
  1 sibling, 1 reply; 19+ messages in thread
From: Vineet Gupta @ 2024-05-13 21:13 UTC (permalink / raw)
  To: gcc-patches
  Cc: Jeff Law, kito.cheng, Palmer Dabbelt, Christoph Müllner,
	Robin Dapp, gnu-toolchain, Greg McGary

On 5/13/24 11:49, Vineet Gupta wrote:
>  500.perlbench_r-0 |  1,214,534,029,025 | 1,212,887,959,387 |
>  500.perlbench_r-1 |    740,383,419,739 |   739,280,308,163 |
>  500.perlbench_r-2 |    692,074,638,817 |   691,118,734,547 |
>  502.gcc_r-0       |    190,820,141,435 |   190,857,065,988 |
>  502.gcc_r-1       |    225,747,660,839 |   225,809,444,357 | <- -0.02%
>  502.gcc_r-2       |    220,370,089,641 |   220,406,367,876 | <- -0.03%
>  502.gcc_r-3       |    179,111,460,458 |   179,135,609,723 | <- -0.02%
>  502.gcc_r-4       |    219,301,546,340 |   219,320,416,956 | <- -0.01%
>  503.bwaves_r-0    |    278,733,324,691 |   278,733,323,575 | <- -0.01%
>  503.bwaves_r-1    |    442,397,521,282 |   442,397,519,616 |
>  503.bwaves_r-2    |    344,112,218,206 |   344,112,216,760 |
>  503.bwaves_r-3    |    417,561,469,153 |   417,561,467,597 |
>  505.mcf_r         |    669,319,257,525 |   669,318,763,084 |
>  507.cactuBSSN_r   |  2,852,767,394,456 | 2,564,736,063,742 | <+ 10.10%

The small gcc regression seems like a tooling issue of some sort.
Looking at the topblocks, the insn sequences are exactly the same, only
the counts differ and its not obvious why.
Here's for gcc_r-1.


    > Block 0 @ 0x170ca, 12 insns, 87854493 times, 0.47%:

    00000000000170ca <find_base_term>:
       170ca:    7179                    add    sp,sp,-48
       170cc:    ec26                    sd    s1,24(sp)
       170ce:    e84a                    sd    s2,16(sp)
       170d0:    e44e                    sd    s3,8(sp)
       170d2:    f406                    sd    ra,40(sp)
       170d4:    f022                    sd    s0,32(sp)
       170d6:    84aa                    mv    s1,a0
       170d8:    03200913              li    s2,50
       170dc:    03d00993              li    s3,61
       170e0:    8526                    mv    a0,s1
       170e2:    001cd097              auipc    ra,0x1cd
       170e6:    bac080e7              jalr    -1108(ra) # 1e3c8e
    <ix86_delegitimize_address.lto_priv.0>

    > Block 1 @ 0x706d0a, 3 insns, 274713936 times, 0.37%:
    >  Block 2 @ 0x1e3c8e, 9 insns, 88507109 times, 0.35%:
    ...

    < Block 0 @ 0x170ca, 12 insns, 87869602 times, 0.47%:
    < Block 1 @ 0x706d42, 3 insns, 274608893 times, 0.36%:
    < Block 2 @ 0x1e3c94, 9 insns, 88526354 times, 0.35%:


FWIW, Greg internally has been looking at some of this and found some
issues in the bbv tooling, but I wish all of this was  shared/upstream
(QEMU bbv plugin) for people to compare notes and not discover/fix the
same issues over and again.

Thx,
-Vineet

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Follow up #1 (was Re: [PATCH v2 1/2] RISC-V: avoid LUI based const materialization ... [part of PR/106265])
  2024-05-13 21:13   ` Follow up #1 (was Re: [PATCH v2 1/2] RISC-V: avoid LUI based const materialization ... [part of PR/106265]) Vineet Gupta
@ 2024-05-13 22:47     ` Jeff Law
  2024-05-13 23:08       ` Vineet Gupta
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff Law @ 2024-05-13 22:47 UTC (permalink / raw)
  To: Vineet Gupta, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, Christoph Müllner, Robin Dapp,
	gnu-toolchain, Greg McGary



On 5/13/24 3:13 PM, Vineet Gupta wrote:
> On 5/13/24 11:49, Vineet Gupta wrote:
>>   500.perlbench_r-0 |  1,214,534,029,025 | 1,212,887,959,387 |
>>   500.perlbench_r-1 |    740,383,419,739 |   739,280,308,163 |
>>   500.perlbench_r-2 |    692,074,638,817 |   691,118,734,547 |
>>   502.gcc_r-0       |    190,820,141,435 |   190,857,065,988 |
>>   502.gcc_r-1       |    225,747,660,839 |   225,809,444,357 | <- -0.02%
>>   502.gcc_r-2       |    220,370,089,641 |   220,406,367,876 | <- -0.03%
>>   502.gcc_r-3       |    179,111,460,458 |   179,135,609,723 | <- -0.02%
>>   502.gcc_r-4       |    219,301,546,340 |   219,320,416,956 | <- -0.01%
>>   503.bwaves_r-0    |    278,733,324,691 |   278,733,323,575 | <- -0.01%
>>   503.bwaves_r-1    |    442,397,521,282 |   442,397,519,616 |
>>   503.bwaves_r-2    |    344,112,218,206 |   344,112,216,760 |
>>   503.bwaves_r-3    |    417,561,469,153 |   417,561,467,597 |
>>   505.mcf_r         |    669,319,257,525 |   669,318,763,084 |
>>   507.cactuBSSN_r   |  2,852,767,394,456 | 2,564,736,063,742 | <+ 10.10%
> 
> The small gcc regression seems like a tooling issue of some sort.
> Looking at the topblocks, the insn sequences are exactly the same, only
> the counts differ and its not obvious why.
> Here's for gcc_r-1.
> 
> 
>      > Block 0 @ 0x170ca, 12 insns, 87854493 times, 0.47%:
> 
>      00000000000170ca <find_base_term>:
>         170ca:    7179                    add    sp,sp,-48
>         170cc:    ec26                    sd    s1,24(sp)
>         170ce:    e84a                    sd    s2,16(sp)
>         170d0:    e44e                    sd    s3,8(sp)
>         170d2:    f406                    sd    ra,40(sp)
>         170d4:    f022                    sd    s0,32(sp)
>         170d6:    84aa                    mv    s1,a0
>         170d8:    03200913              li    s2,50
>         170dc:    03d00993              li    s3,61
>         170e0:    8526                    mv    a0,s1
>         170e2:    001cd097              auipc    ra,0x1cd
>         170e6:    bac080e7              jalr    -1108(ra) # 1e3c8e
>      <ix86_delegitimize_address.lto_priv.0>
> 
>      > Block 1 @ 0x706d0a, 3 insns, 274713936 times, 0.37%:
>      >  Block 2 @ 0x1e3c8e, 9 insns, 88507109 times, 0.35%:
>      ...
> 
>      < Block 0 @ 0x170ca, 12 insns, 87869602 times, 0.47%:
>      < Block 1 @ 0x706d42, 3 insns, 274608893 times, 0.36%:
>      < Block 2 @ 0x1e3c94, 9 insns, 88526354 times, 0.35%:
> 
> 
> FWIW, Greg internally has been looking at some of this and found some
> issues in the bbv tooling, but I wish all of this was  shared/upstream
> (QEMU bbv plugin) for people to compare notes and not discover/fix the
> same issues over and again.
Yea, we all meant to coordinate on those plugins.  The one we've got had 
some problems with hash collisions and when there's a hash collision it 
just produces total junk data.  I chased a few of these down and fixed 
them about a year ago.

The other thing is qemu will split up blocks based on its internal 
notion of a translation page.   So if you're looking at block level data 
you'll stumble over that as well.  This aspect is the most troublesome 
problem I'm aware of right now.





Jeff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Follow up #1 (was Re: [PATCH v2 1/2] RISC-V: avoid LUI based const materialization ... [part of PR/106265])
  2024-05-13 22:47     ` Jeff Law
@ 2024-05-13 23:08       ` Vineet Gupta
  2024-05-14 22:12         ` Palmer Dabbelt
  0 siblings, 1 reply; 19+ messages in thread
From: Vineet Gupta @ 2024-05-13 23:08 UTC (permalink / raw)
  To: Jeff Law, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, Christoph Müllner, Robin Dapp,
	gnu-toolchain, Greg McGary



On 5/13/24 15:47, Jeff Law wrote:
>> On 5/13/24 11:49, Vineet Gupta wrote:
>>>   500.perlbench_r-0 |  1,214,534,029,025 | 1,212,887,959,387 |
>>>   500.perlbench_r-1 |    740,383,419,739 |   739,280,308,163 |
>>>   500.perlbench_r-2 |    692,074,638,817 |   691,118,734,547 |
>>>   502.gcc_r-0       |    190,820,141,435 |   190,857,065,988 |
>>>   502.gcc_r-1       |    225,747,660,839 |   225,809,444,357 | <- -0.02%
>>>   502.gcc_r-2       |    220,370,089,641 |   220,406,367,876 | <- -0.03%
>>>   502.gcc_r-3       |    179,111,460,458 |   179,135,609,723 | <- -0.02%
>>>   502.gcc_r-4       |    219,301,546,340 |   219,320,416,956 | <- -0.01%
>>>   503.bwaves_r-0    |    278,733,324,691 |   278,733,323,575 | <- -0.01%
>>>   503.bwaves_r-1    |    442,397,521,282 |   442,397,519,616 |
>>>   503.bwaves_r-2    |    344,112,218,206 |   344,112,216,760 |
>>>   503.bwaves_r-3    |    417,561,469,153 |   417,561,467,597 |
>>>   505.mcf_r         |    669,319,257,525 |   669,318,763,084 |
>>>   507.cactuBSSN_r   |  2,852,767,394,456 | 2,564,736,063,742 | <+ 10.10%
>> The small gcc regression seems like a tooling issue of some sort.
>> Looking at the topblocks, the insn sequences are exactly the same, only
>> the counts differ and its not obvious why.
>> Here's for gcc_r-1.
>>
>>
>>      > Block 0 @ 0x170ca, 12 insns, 87854493 times, 0.47%:
>>
>>      00000000000170ca <find_base_term>:
>>         170ca:    7179                    add    sp,sp,-48
>>         170cc:    ec26                    sd    s1,24(sp)
>>         170ce:    e84a                    sd    s2,16(sp)
>>         170d0:    e44e                    sd    s3,8(sp)
>>         170d2:    f406                    sd    ra,40(sp)
>>         170d4:    f022                    sd    s0,32(sp)
>>         170d6:    84aa                    mv    s1,a0
>>         170d8:    03200913              li    s2,50
>>         170dc:    03d00993              li    s3,61
>>         170e0:    8526                    mv    a0,s1
>>         170e2:    001cd097              auipc    ra,0x1cd
>>         170e6:    bac080e7              jalr    -1108(ra) # 1e3c8e
>>      <ix86_delegitimize_address.lto_priv.0>
>>
>>      > Block 1 @ 0x706d0a, 3 insns, 274713936 times, 0.37%:
>>      >  Block 2 @ 0x1e3c8e, 9 insns, 88507109 times, 0.35%:
>>      ...
>>
>>      < Block 0 @ 0x170ca, 12 insns, 87869602 times, 0.47%:
>>      < Block 1 @ 0x706d42, 3 insns, 274608893 times, 0.36%:
>>      < Block 2 @ 0x1e3c94, 9 insns, 88526354 times, 0.35%:
>>
>>
>> FWIW, Greg internally has been looking at some of this and found some
>> issues in the bbv tooling, but I wish all of this was  shared/upstream
>> (QEMU bbv plugin) for people to compare notes and not discover/fix the
>> same issues over and again.
> Yea, we all meant to coordinate on those plugins.  The one we've got had 
> some problems with hash collisions and when there's a hash collision it 
> just produces total junk data.  I chased a few of these down and fixed 
> them about a year ago.
>
> The other thing is qemu will split up blocks based on its internal 
> notion of a translation page.   So if you're looking at block level data 
> you'll stumble over that as well.  This aspect is the most troublesome 
> problem I'm aware of right now.

And these two are exactly what Greg fixed, among others :-)

-Vineet

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733]
  2024-05-13 20:28   ` Jeff Law
@ 2024-05-14  0:54     ` Patrick O'Neill
  2024-05-14  3:36       ` Jeff Law
  0 siblings, 1 reply; 19+ messages in thread
From: Patrick O'Neill @ 2024-05-14  0:54 UTC (permalink / raw)
  To: Jeff Law, Vineet Gupta, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, Christoph Müllner, Robin Dapp,
	gnu-toolchain


On 5/13/24 13:28, Jeff Law wrote:
>
>
> On 5/13/24 12:49 PM, Vineet Gupta wrote:
>> If the constant used for stack offset can be expressed as sum of two S12
>> values, the constant need not be materialized (in a reg) and instead the
>> two S12 bits can be added to instructions involved with frame pointer.
>> This avoids burning a register and more importantly can often get down
>> to be 2 insn vs. 3.
>>
>> The prev patches to generally avoid LUI based const materialization 
>> didn't
>> fix this PR and need this directed fix in funcion prologue/epilogue
>> expansion.
>>
>> This fix doesn't move the neddle for SPEC, at all, but it is still a
>> win considering gcc generates one insn fewer than llvm for the test ;-)
>>
>>     gcc-13.1 release   |      gcc 230823     | |
>>                        |    g6619b3d4c15c    |   This patch |  
>> clang/llvm
>> --------------------------------------------------------------------------------- 
>>
>> li      t0,-4096     | li    t0,-4096      | addi  sp,sp,-2048 | addi 
>> sp,sp,-2048
>> addi    t0,t0,2016   | addi  t0,t0,2032    | add   sp,sp,-16   | addi 
>> sp,sp,-32
>> li      a4,4096      | add   sp,sp,t0      | add   a5,sp,a0    | add  
>> a1,sp,16
>> add     sp,sp,t0     | addi  a5,sp,-2032   | sb    zero,0(a5)  | add  
>> a0,a0,a1
>> li      a5,-4096     | add   a0,a5,a0      | addi  sp,sp,2032  | sb   
>> zero,0(a0)
>> addi    a4,a4,-2032  | li    t0, 4096      | addi  sp,sp,32    | addi 
>> sp,sp,2032
>> add     a4,a4,a5     | sb    zero,2032(a0) | ret               | addi 
>> sp,sp,48
>> addi    a5,sp,16     | addi  t0,t0,-2032   |                   | ret
>> add     a5,a4,a5     | add   sp,sp,t0      |
>> add     a0,a5,a0     | ret                 |
>> li      t0,4096      |
>> sd      a5,8(sp)     |
>> sb      zero,2032(a0)|
>> addi    t0,t0,-2016  |
>> add     sp,sp,t0     |
>> ret                  |
>>
>> gcc/ChangeLog:
>>     PR target/105733
>>     * config/riscv/riscv.h: New macros for with aligned offsets.
>>     * config/riscv/riscv.cc (riscv_split_sum_of_two_s12): New
>>     function to split a sum of two s12 values into constituents.
>>     (riscv_expand_prologue): Handle offset being sum of two S12.
>>     (riscv_expand_epilogue): Ditto.
>>     * config/riscv/riscv-protos.h (riscv_split_sum_of_two_s12): New.
>>
>> gcc/testsuite/ChangeLog:
>>     * gcc.target/riscv/pr105733.c: New Test.
>>     * gcc.target/riscv/rvv/autovec/vls/spill-1.c: Adjust to not
>>     expect LUI 4096.
>>     * gcc.target/riscv/rvv/autovec/vls/spill-2.c: Ditto.
>>     * gcc.target/riscv/rvv/autovec/vls/spill-3.c: Ditto.
>>     * gcc.target/riscv/rvv/autovec/vls/spill-4.c: Ditto.
>>     * gcc.target/riscv/rvv/autovec/vls/spill-5.c: Ditto.
>>     * gcc.target/riscv/rvv/autovec/vls/spill-6.c: Ditto.
>>     * gcc.target/riscv/rvv/autovec/vls/spill-7.c: Ditto.
>>
>
>
>> @@ -8074,14 +8111,26 @@ riscv_expand_epilogue (int style)
>>       }
>>         else
>>       {
>> -      if (!SMALL_OPERAND (adjust_offset.to_constant ()))
>> +      HOST_WIDE_INT adj_off_value = adjust_offset.to_constant ();
>> +      if (SMALL_OPERAND (adj_off_value))
>> +        {
>> +          adjust = GEN_INT (adj_off_value);
>> +        }
>> +      else if (SUM_OF_TWO_S12_ALGN (adj_off_value))
>> +        {
>> +          HOST_WIDE_INT base, off;
>> +          riscv_split_sum_of_two_s12 (adj_off_value, &base, &off);
>> +          insn = gen_add3_insn (stack_pointer_rtx, 
>> hard_frame_pointer_rtx,
>> +                    GEN_INT (base));
>> +          RTX_FRAME_RELATED_P (insn) = 1;
>> +          adjust = GEN_INT (off);
>> +        }
> So this was the hunk that we identified internally as causing problems 
> with libgomp's testsuite.  We never fully chased it down as this hunk 
> didn't seem terribly important performance wise -- we just set it 
> aside.  The thing is it looked basically correct to me.  So the 
> failure was certainly unexpected, but it was consistent.
>
> So I think the question is whether or not the CI system runs the 
> libgomp testsuite, particularly in the rv64 linux configuration. If it 
> does, and it passes, then we're good.  I'm still finding my way around 
> the configuration, so I don't know if the CI system Edwin & Patrick 
> have built tests libgomp or not.

I poked around the .sum files in pre/postcommit and we do run tests like:

PASS: c-c++-common/gomp/affinity-2.c  (test for errors, line 45)

I'm not familar with libgomp so I don't know if that's the same libgomp 
tests you're referring to.

Patrick

>
> If it isn't run, then we'll need to do a run to test that.  I'm set up 
> here to do that if needed.   I can just drop this version into our 
> internal tree, trigger an internal CI run and see if it complains :-)
>
> If it does complain, then we know where to start investigations.
>
>
>
>
> Jeff
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733]
  2024-05-14  0:54     ` Patrick O'Neill
@ 2024-05-14  3:36       ` Jeff Law
  2024-05-14 14:51         ` Patrick O'Neill
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff Law @ 2024-05-14  3:36 UTC (permalink / raw)
  To: Patrick O'Neill, Vineet Gupta, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, Christoph Müllner, Robin Dapp,
	gnu-toolchain



On 5/13/24 6:54 PM, Patrick O'Neill wrote:
> 
> On 5/13/24 13:28, Jeff Law wrote:
>>
>>
>> On 5/13/24 12:49 PM, Vineet Gupta wrote:
>>> If the constant used for stack offset can be expressed as sum of two S12
>>> values, the constant need not be materialized (in a reg) and instead the
>>> two S12 bits can be added to instructions involved with frame pointer.
>>> This avoids burning a register and more importantly can often get down
>>> to be 2 insn vs. 3.
>>>
>>> The prev patches to generally avoid LUI based const materialization 
>>> didn't
>>> fix this PR and need this directed fix in funcion prologue/epilogue
>>> expansion.
>>>
>>> This fix doesn't move the neddle for SPEC, at all, but it is still a
>>> win considering gcc generates one insn fewer than llvm for the test ;-)
>>>
>>>     gcc-13.1 release   |      gcc 230823     | |
>>>                        |    g6619b3d4c15c    |   This patch | clang/llvm
>>> ---------------------------------------------------------------------------------
>>> li      t0,-4096     | li    t0,-4096      | addi  sp,sp,-2048 | addi 
>>> sp,sp,-2048
>>> addi    t0,t0,2016   | addi  t0,t0,2032    | add   sp,sp,-16   | addi 
>>> sp,sp,-32
>>> li      a4,4096      | add   sp,sp,t0      | add   a5,sp,a0    | add 
>>> a1,sp,16
>>> add     sp,sp,t0     | addi  a5,sp,-2032   | sb    zero,0(a5)  | add 
>>> a0,a0,a1
>>> li      a5,-4096     | add   a0,a5,a0      | addi  sp,sp,2032  | sb 
>>> zero,0(a0)
>>> addi    a4,a4,-2032  | li    t0, 4096      | addi  sp,sp,32    | addi 
>>> sp,sp,2032
>>> add     a4,a4,a5     | sb    zero,2032(a0) | ret               | addi 
>>> sp,sp,48
>>> addi    a5,sp,16     | addi  t0,t0,-2032   |                   | ret
>>> add     a5,a4,a5     | add   sp,sp,t0      |
>>> add     a0,a5,a0     | ret                 |
>>> li      t0,4096      |
>>> sd      a5,8(sp)     |
>>> sb      zero,2032(a0)|
>>> addi    t0,t0,-2016  |
>>> add     sp,sp,t0     |
>>> ret                  |
>>>
>>> gcc/ChangeLog:
>>>     PR target/105733
>>>     * config/riscv/riscv.h: New macros for with aligned offsets.
>>>     * config/riscv/riscv.cc (riscv_split_sum_of_two_s12): New
>>>     function to split a sum of two s12 values into constituents.
>>>     (riscv_expand_prologue): Handle offset being sum of two S12.
>>>     (riscv_expand_epilogue): Ditto.
>>>     * config/riscv/riscv-protos.h (riscv_split_sum_of_two_s12): New.
>>>
>>> gcc/testsuite/ChangeLog:
>>>     * gcc.target/riscv/pr105733.c: New Test.
>>>     * gcc.target/riscv/rvv/autovec/vls/spill-1.c: Adjust to not
>>>     expect LUI 4096.
>>>     * gcc.target/riscv/rvv/autovec/vls/spill-2.c: Ditto.
>>>     * gcc.target/riscv/rvv/autovec/vls/spill-3.c: Ditto.
>>>     * gcc.target/riscv/rvv/autovec/vls/spill-4.c: Ditto.
>>>     * gcc.target/riscv/rvv/autovec/vls/spill-5.c: Ditto.
>>>     * gcc.target/riscv/rvv/autovec/vls/spill-6.c: Ditto.
>>>     * gcc.target/riscv/rvv/autovec/vls/spill-7.c: Ditto.
>>>
>>
>>
>>> @@ -8074,14 +8111,26 @@ riscv_expand_epilogue (int style)
>>>       }
>>>         else
>>>       {
>>> -      if (!SMALL_OPERAND (adjust_offset.to_constant ()))
>>> +      HOST_WIDE_INT adj_off_value = adjust_offset.to_constant ();
>>> +      if (SMALL_OPERAND (adj_off_value))
>>> +        {
>>> +          adjust = GEN_INT (adj_off_value);
>>> +        }
>>> +      else if (SUM_OF_TWO_S12_ALGN (adj_off_value))
>>> +        {
>>> +          HOST_WIDE_INT base, off;
>>> +          riscv_split_sum_of_two_s12 (adj_off_value, &base, &off);
>>> +          insn = gen_add3_insn (stack_pointer_rtx, 
>>> hard_frame_pointer_rtx,
>>> +                    GEN_INT (base));
>>> +          RTX_FRAME_RELATED_P (insn) = 1;
>>> +          adjust = GEN_INT (off);
>>> +        }
>> So this was the hunk that we identified internally as causing problems 
>> with libgomp's testsuite.  We never fully chased it down as this hunk 
>> didn't seem terribly important performance wise -- we just set it 
>> aside.  The thing is it looked basically correct to me.  So the 
>> failure was certainly unexpected, but it was consistent.
>>
>> So I think the question is whether or not the CI system runs the 
>> libgomp testsuite, particularly in the rv64 linux configuration. If it 
>> does, and it passes, then we're good.  I'm still finding my way around 
>> the configuration, so I don't know if the CI system Edwin & Patrick 
>> have built tests libgomp or not.
> 
> I poked around the .sum files in pre/postcommit and we do run tests like:
> 
> PASS: c-c++-common/gomp/affinity-2.c  (test for errors, line 45)
I was able to find the summary info:

> Tests that now fail, but worked before (15 tests):
> libgomp: libgomp.fortran/simd7.f90   -O0  execution test
> libgomp: libgomp.fortran/task2.f90   -O0  execution test
> libgomp: libgomp.fortran/vla2.f90   -O0  execution test
> libgomp: libgomp.fortran/vla3.f90   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions  execution test
> libgomp: libgomp.fortran/vla3.f90   -O3 -g  execution test
> libgomp: libgomp.fortran/vla4.f90   -O1  execution test
> libgomp: libgomp.fortran/vla4.f90   -O2  execution test
> libgomp: libgomp.fortran/vla4.f90   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions  execution test
> libgomp: libgomp.fortran/vla4.f90   -O3 -g  execution test
> libgomp: libgomp.fortran/vla4.f90   -Os  execution test
> libgomp: libgomp.fortran/vla5.f90   -O1  execution test
> libgomp: libgomp.fortran/vla5.f90   -O2  execution test
> libgomp: libgomp.fortran/vla5.f90   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions  execution test
> libgomp: libgomp.fortran/vla5.f90   -O3 -g  execution test
> libgomp: libgomp.fortran/vla5.f90   -Os  execution test

So if you could check on those, it'd be appreciated.

Jeff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733]
  2024-05-14  3:36       ` Jeff Law
@ 2024-05-14 14:51         ` Patrick O'Neill
  2024-05-14 14:54           ` Jeff Law
  2024-05-14 15:44           ` Jeff Law
  0 siblings, 2 replies; 19+ messages in thread
From: Patrick O'Neill @ 2024-05-14 14:51 UTC (permalink / raw)
  To: Jeff Law, Vineet Gupta, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, Christoph Müllner, Robin Dapp,
	gnu-toolchain


On 5/13/24 20:36, Jeff Law wrote:
>
>
> On 5/13/24 6:54 PM, Patrick O'Neill wrote:
>>
>> On 5/13/24 13:28, Jeff Law wrote:
>>>
>>>
>>> On 5/13/24 12:49 PM, Vineet Gupta wrote:
>>>> If the constant used for stack offset can be expressed as sum of 
>>>> two S12
>>>> values, the constant need not be materialized (in a reg) and 
>>>> instead the
>>>> two S12 bits can be added to instructions involved with frame pointer.
>>>> This avoids burning a register and more importantly can often get down
>>>> to be 2 insn vs. 3.
>>>>
>>>> The prev patches to generally avoid LUI based const materialization 
>>>> didn't
>>>> fix this PR and need this directed fix in funcion prologue/epilogue
>>>> expansion.
>>>>
>>>> This fix doesn't move the neddle for SPEC, at all, but it is still a
>>>> win considering gcc generates one insn fewer than llvm for the test 
>>>> ;-)
>>>>
>>>>     gcc-13.1 release   |      gcc 230823     | |
>>>>                        |    g6619b3d4c15c    |   This patch | 
>>>> clang/llvm
>>>> --------------------------------------------------------------------------------- 
>>>>
>>>> li      t0,-4096     | li    t0,-4096      | addi sp,sp,-2048 | 
>>>> addi sp,sp,-2048
>>>> addi    t0,t0,2016   | addi  t0,t0,2032    | add sp,sp,-16   | addi 
>>>> sp,sp,-32
>>>> li      a4,4096      | add   sp,sp,t0      | add a5,sp,a0    | add 
>>>> a1,sp,16
>>>> add     sp,sp,t0     | addi  a5,sp,-2032   | sb zero,0(a5)  | add 
>>>> a0,a0,a1
>>>> li      a5,-4096     | add   a0,a5,a0      | addi sp,sp,2032  | sb 
>>>> zero,0(a0)
>>>> addi    a4,a4,-2032  | li    t0, 4096      | addi sp,sp,32    | 
>>>> addi sp,sp,2032
>>>> add     a4,a4,a5     | sb    zero,2032(a0) | ret               | 
>>>> addi sp,sp,48
>>>> addi    a5,sp,16     | addi  t0,t0,-2032 |                   | ret
>>>> add     a5,a4,a5     | add   sp,sp,t0      |
>>>> add     a0,a5,a0     | ret                 |
>>>> li      t0,4096      |
>>>> sd      a5,8(sp)     |
>>>> sb      zero,2032(a0)|
>>>> addi    t0,t0,-2016  |
>>>> add     sp,sp,t0     |
>>>> ret                  |
>>>>
>>>> gcc/ChangeLog:
>>>>     PR target/105733
>>>>     * config/riscv/riscv.h: New macros for with aligned offsets.
>>>>     * config/riscv/riscv.cc (riscv_split_sum_of_two_s12): New
>>>>     function to split a sum of two s12 values into constituents.
>>>>     (riscv_expand_prologue): Handle offset being sum of two S12.
>>>>     (riscv_expand_epilogue): Ditto.
>>>>     * config/riscv/riscv-protos.h (riscv_split_sum_of_two_s12): New.
>>>>
>>>> gcc/testsuite/ChangeLog:
>>>>     * gcc.target/riscv/pr105733.c: New Test.
>>>>     * gcc.target/riscv/rvv/autovec/vls/spill-1.c: Adjust to not
>>>>     expect LUI 4096.
>>>>     * gcc.target/riscv/rvv/autovec/vls/spill-2.c: Ditto.
>>>>     * gcc.target/riscv/rvv/autovec/vls/spill-3.c: Ditto.
>>>>     * gcc.target/riscv/rvv/autovec/vls/spill-4.c: Ditto.
>>>>     * gcc.target/riscv/rvv/autovec/vls/spill-5.c: Ditto.
>>>>     * gcc.target/riscv/rvv/autovec/vls/spill-6.c: Ditto.
>>>>     * gcc.target/riscv/rvv/autovec/vls/spill-7.c: Ditto.
>>>>
>>>
>>>
>>>> @@ -8074,14 +8111,26 @@ riscv_expand_epilogue (int style)
>>>>       }
>>>>         else
>>>>       {
>>>> -      if (!SMALL_OPERAND (adjust_offset.to_constant ()))
>>>> +      HOST_WIDE_INT adj_off_value = adjust_offset.to_constant ();
>>>> +      if (SMALL_OPERAND (adj_off_value))
>>>> +        {
>>>> +          adjust = GEN_INT (adj_off_value);
>>>> +        }
>>>> +      else if (SUM_OF_TWO_S12_ALGN (adj_off_value))
>>>> +        {
>>>> +          HOST_WIDE_INT base, off;
>>>> +          riscv_split_sum_of_two_s12 (adj_off_value, &base, &off);
>>>> +          insn = gen_add3_insn (stack_pointer_rtx, 
>>>> hard_frame_pointer_rtx,
>>>> +                    GEN_INT (base));
>>>> +          RTX_FRAME_RELATED_P (insn) = 1;
>>>> +          adjust = GEN_INT (off);
>>>> +        }
>>> So this was the hunk that we identified internally as causing 
>>> problems with libgomp's testsuite.  We never fully chased it down as 
>>> this hunk didn't seem terribly important performance wise -- we just 
>>> set it aside.  The thing is it looked basically correct to me.  So 
>>> the failure was certainly unexpected, but it was consistent.
>>>
>>> So I think the question is whether or not the CI system runs the 
>>> libgomp testsuite, particularly in the rv64 linux configuration. If 
>>> it does, and it passes, then we're good. I'm still finding my way 
>>> around the configuration, so I don't know if the CI system Edwin & 
>>> Patrick have built tests libgomp or not.
>>
>> I poked around the .sum files in pre/postcommit and we do run tests 
>> like:
>>
>> PASS: c-c++-common/gomp/affinity-2.c  (test for errors, line 45)
> I was able to find the summary info:
>
>> Tests that now fail, but worked before (15 tests):
>> libgomp: libgomp.fortran/simd7.f90   -O0  execution test
>> libgomp: libgomp.fortran/task2.f90   -O0  execution test
>> libgomp: libgomp.fortran/vla2.f90   -O0  execution test
>> libgomp: libgomp.fortran/vla3.f90   -O3 -fomit-frame-pointer 
>> -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
>> libgomp: libgomp.fortran/vla3.f90   -O3 -g  execution test
>> libgomp: libgomp.fortran/vla4.f90   -O1  execution test
>> libgomp: libgomp.fortran/vla4.f90   -O2  execution test
>> libgomp: libgomp.fortran/vla4.f90   -O3 -fomit-frame-pointer 
>> -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
>> libgomp: libgomp.fortran/vla4.f90   -O3 -g  execution test
>> libgomp: libgomp.fortran/vla4.f90   -Os  execution test
>> libgomp: libgomp.fortran/vla5.f90   -O1  execution test
>> libgomp: libgomp.fortran/vla5.f90   -O2  execution test
>> libgomp: libgomp.fortran/vla5.f90   -O3 -fomit-frame-pointer 
>> -funroll-loops -fpeel-loops -ftracer -finline-functions execution test
>> libgomp: libgomp.fortran/vla5.f90   -O3 -g  execution test
>> libgomp: libgomp.fortran/vla5.f90   -Os  execution test
>
> So if you could check on those, it'd be appreciated.

I checked rv64gcv linux and those do not currently run in CI.

Patrick


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733]
  2024-05-14 14:51         ` Patrick O'Neill
@ 2024-05-14 14:54           ` Jeff Law
  2024-05-14 15:44           ` Jeff Law
  1 sibling, 0 replies; 19+ messages in thread
From: Jeff Law @ 2024-05-14 14:54 UTC (permalink / raw)
  To: Patrick O'Neill, Vineet Gupta, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, Christoph Müllner, Robin Dapp,
	gnu-toolchain



On 5/14/24 8:51 AM, Patrick O'Neill wrote:
> 
> On 5/13/24 20:36, Jeff Law wrote:
>>
>>
>> On 5/13/24 6:54 PM, Patrick O'Neill wrote:
>>>
>>> On 5/13/24 13:28, Jeff Law wrote:
>>>>
>>>>
>>>> On 5/13/24 12:49 PM, Vineet Gupta wrote:
>>>>> If the constant used for stack offset can be expressed as sum of 
>>>>> two S12
>>>>> values, the constant need not be materialized (in a reg) and 
>>>>> instead the
>>>>> two S12 bits can be added to instructions involved with frame pointer.
>>>>> This avoids burning a register and more importantly can often get down
>>>>> to be 2 insn vs. 3.
>>>>>
>>>>> The prev patches to generally avoid LUI based const materialization 
>>>>> didn't
>>>>> fix this PR and need this directed fix in funcion prologue/epilogue
>>>>> expansion.
>>>>>
>>>>> This fix doesn't move the neddle for SPEC, at all, but it is still a
>>>>> win considering gcc generates one insn fewer than llvm for the 
>>>>> test ;-)
>>>>>
>>>>>     gcc-13.1 release   |      gcc 230823     | |
>>>>>                        |    g6619b3d4c15c    |   This patch | 
>>>>> clang/llvm
>>>>> ---------------------------------------------------------------------------------
>>>>> li      t0,-4096     | li    t0,-4096      | addi sp,sp,-2048 | 
>>>>> addi sp,sp,-2048
>>>>> addi    t0,t0,2016   | addi  t0,t0,2032    | add sp,sp,-16   | addi 
>>>>> sp,sp,-32
>>>>> li      a4,4096      | add   sp,sp,t0      | add a5,sp,a0    | add 
>>>>> a1,sp,16
>>>>> add     sp,sp,t0     | addi  a5,sp,-2032   | sb zero,0(a5)  | add 
>>>>> a0,a0,a1
>>>>> li      a5,-4096     | add   a0,a5,a0      | addi sp,sp,2032  | sb 
>>>>> zero,0(a0)
>>>>> addi    a4,a4,-2032  | li    t0, 4096      | addi sp,sp,32    | 
>>>>> addi sp,sp,2032
>>>>> add     a4,a4,a5     | sb    zero,2032(a0) | ret               | 
>>>>> addi sp,sp,48
>>>>> addi    a5,sp,16     | addi  t0,t0,-2032 |                   | ret
>>>>> add     a5,a4,a5     | add   sp,sp,t0      |
>>>>> add     a0,a5,a0     | ret                 |
>>>>> li      t0,4096      |
>>>>> sd      a5,8(sp)     |
>>>>> sb      zero,2032(a0)|
>>>>> addi    t0,t0,-2016  |
>>>>> add     sp,sp,t0     |
>>>>> ret                  |
>>>>>
>>>>> gcc/ChangeLog:
>>>>>     PR target/105733
>>>>>     * config/riscv/riscv.h: New macros for with aligned offsets.
>>>>>     * config/riscv/riscv.cc (riscv_split_sum_of_two_s12): New
>>>>>     function to split a sum of two s12 values into constituents.
>>>>>     (riscv_expand_prologue): Handle offset being sum of two S12.
>>>>>     (riscv_expand_epilogue): Ditto.
>>>>>     * config/riscv/riscv-protos.h (riscv_split_sum_of_two_s12): New.
>>>>>
>>>>> gcc/testsuite/ChangeLog:
>>>>>     * gcc.target/riscv/pr105733.c: New Test.
>>>>>     * gcc.target/riscv/rvv/autovec/vls/spill-1.c: Adjust to not
>>>>>     expect LUI 4096.
>>>>>     * gcc.target/riscv/rvv/autovec/vls/spill-2.c: Ditto.
>>>>>     * gcc.target/riscv/rvv/autovec/vls/spill-3.c: Ditto.
>>>>>     * gcc.target/riscv/rvv/autovec/vls/spill-4.c: Ditto.
>>>>>     * gcc.target/riscv/rvv/autovec/vls/spill-5.c: Ditto.
>>>>>     * gcc.target/riscv/rvv/autovec/vls/spill-6.c: Ditto.
>>>>>     * gcc.target/riscv/rvv/autovec/vls/spill-7.c: Ditto.
>>>>>
>>>>
>>>>
>>>>> @@ -8074,14 +8111,26 @@ riscv_expand_epilogue (int style)
>>>>>       }
>>>>>         else
>>>>>       {
>>>>> -      if (!SMALL_OPERAND (adjust_offset.to_constant ()))
>>>>> +      HOST_WIDE_INT adj_off_value = adjust_offset.to_constant ();
>>>>> +      if (SMALL_OPERAND (adj_off_value))
>>>>> +        {
>>>>> +          adjust = GEN_INT (adj_off_value);
>>>>> +        }
>>>>> +      else if (SUM_OF_TWO_S12_ALGN (adj_off_value))
>>>>> +        {
>>>>> +          HOST_WIDE_INT base, off;
>>>>> +          riscv_split_sum_of_two_s12 (adj_off_value, &base, &off);
>>>>> +          insn = gen_add3_insn (stack_pointer_rtx, 
>>>>> hard_frame_pointer_rtx,
>>>>> +                    GEN_INT (base));
>>>>> +          RTX_FRAME_RELATED_P (insn) = 1;
>>>>> +          adjust = GEN_INT (off);
>>>>> +        }
>>>> So this was the hunk that we identified internally as causing 
>>>> problems with libgomp's testsuite.  We never fully chased it down as 
>>>> this hunk didn't seem terribly important performance wise -- we just 
>>>> set it aside.  The thing is it looked basically correct to me.  So 
>>>> the failure was certainly unexpected, but it was consistent.
>>>>
>>>> So I think the question is whether or not the CI system runs the 
>>>> libgomp testsuite, particularly in the rv64 linux configuration. If 
>>>> it does, and it passes, then we're good. I'm still finding my way 
>>>> around the configuration, so I don't know if the CI system Edwin & 
>>>> Patrick have built tests libgomp or not.
>>>
>>> I poked around the .sum files in pre/postcommit and we do run tests 
>>> like:
>>>
>>> PASS: c-c++-common/gomp/affinity-2.c  (test for errors, line 45)
>> I was able to find the summary info:
>>
>>> Tests that now fail, but worked before (15 tests):
>>> libgomp: libgomp.fortran/simd7.f90   -O0  execution test
>>> libgomp: libgomp.fortran/task2.f90   -O0  execution test
>>> libgomp: libgomp.fortran/vla2.f90   -O0  execution test
>>> libgomp: libgomp.fortran/vla3.f90   -O3 -fomit-frame-pointer - 
>>> funroll-loops -fpeel-loops -ftracer -finline-functions execution test
>>> libgomp: libgomp.fortran/vla3.f90   -O3 -g  execution test
>>> libgomp: libgomp.fortran/vla4.f90   -O1  execution test
>>> libgomp: libgomp.fortran/vla4.f90   -O2  execution test
>>> libgomp: libgomp.fortran/vla4.f90   -O3 -fomit-frame-pointer - 
>>> funroll-loops -fpeel-loops -ftracer -finline-functions execution test
>>> libgomp: libgomp.fortran/vla4.f90   -O3 -g  execution test
>>> libgomp: libgomp.fortran/vla4.f90   -Os  execution test
>>> libgomp: libgomp.fortran/vla5.f90   -O1  execution test
>>> libgomp: libgomp.fortran/vla5.f90   -O2  execution test
>>> libgomp: libgomp.fortran/vla5.f90   -O3 -fomit-frame-pointer - 
>>> funroll-loops -fpeel-loops -ftracer -finline-functions execution test
>>> libgomp: libgomp.fortran/vla5.f90   -O3 -g  execution test
>>> libgomp: libgomp.fortran/vla5.f90   -Os  execution test
>>
>> So if you could check on those, it'd be appreciated.
> 
> I checked rv64gcv linux and those do not currently run in CI.
OK.  I'll Vineet's patch to an internal branch and force our CI to run 
on it.   Stay tuned for results.

jeff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733]
  2024-05-14 14:51         ` Patrick O'Neill
  2024-05-14 14:54           ` Jeff Law
@ 2024-05-14 15:44           ` Jeff Law
  2024-05-14 16:36             ` Vineet Gupta
  2024-05-20 16:44             ` epilogue expansion alloca codepath (was Re: [PATCH v2 2/2] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733]) Vineet Gupta
  1 sibling, 2 replies; 19+ messages in thread
From: Jeff Law @ 2024-05-14 15:44 UTC (permalink / raw)
  To: Patrick O'Neill, Vineet Gupta, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, Christoph Müllner, Robin Dapp,
	gnu-toolchain



On 5/14/24 8:51 AM, Patrick O'Neill wrote:
> 

>> I was able to find the summary info:
>>
>>> Tests that now fail, but worked before (15 tests):
>>> libgomp: libgomp.fortran/simd7.f90   -O0  execution test
>>> libgomp: libgomp.fortran/task2.f90   -O0  execution test
>>> libgomp: libgomp.fortran/vla2.f90   -O0  execution test
>>> libgomp: libgomp.fortran/vla3.f90   -O3 -fomit-frame-pointer - 
>>> funroll-loops -fpeel-loops -ftracer -finline-functions execution test
>>> libgomp: libgomp.fortran/vla3.f90   -O3 -g  execution test
>>> libgomp: libgomp.fortran/vla4.f90   -O1  execution test
>>> libgomp: libgomp.fortran/vla4.f90   -O2  execution test
>>> libgomp: libgomp.fortran/vla4.f90   -O3 -fomit-frame-pointer - 
>>> funroll-loops -fpeel-loops -ftracer -finline-functions execution test
>>> libgomp: libgomp.fortran/vla4.f90   -O3 -g  execution test
>>> libgomp: libgomp.fortran/vla4.f90   -Os  execution test
>>> libgomp: libgomp.fortran/vla5.f90   -O1  execution test
>>> libgomp: libgomp.fortran/vla5.f90   -O2  execution test
>>> libgomp: libgomp.fortran/vla5.f90   -O3 -fomit-frame-pointer - 
>>> funroll-loops -fpeel-loops -ftracer -finline-functions execution test
>>> libgomp: libgomp.fortran/vla5.f90   -O3 -g  execution test
>>> libgomp: libgomp.fortran/vla5.f90   -Os  execution test
>>
>> So if you could check on those, it'd be appreciated.
> 
> I checked rv64gcv linux and those do not currently run in CI.
So just ran with Vineet's patch in our CI system.  His patch is still 
triggering those regressions.  So we need to get that resolved before 
that second patch can go in.

jeff


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733]
  2024-05-14 15:44           ` Jeff Law
@ 2024-05-14 16:36             ` Vineet Gupta
  2024-05-14 17:06               ` Jeff Law
  2024-05-20 16:44             ` epilogue expansion alloca codepath (was Re: [PATCH v2 2/2] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733]) Vineet Gupta
  1 sibling, 1 reply; 19+ messages in thread
From: Vineet Gupta @ 2024-05-14 16:36 UTC (permalink / raw)
  To: Jeff Law, Patrick O'Neill, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, Christoph Müllner, Robin Dapp,
	gnu-toolchain



On 5/14/24 08:44, Jeff Law wrote:
> On 5/14/24 8:51 AM, Patrick O'Neill wrote:
>>> I was able to find the summary info:
>>>
>>>> Tests that now fail, but worked before (15 tests):
>>>> libgomp: libgomp.fortran/simd7.f90   -O0  execution test
>>>> libgomp: libgomp.fortran/task2.f90   -O0  execution test
>>>> libgomp: libgomp.fortran/vla2.f90   -O0  execution test
>>>> libgomp: libgomp.fortran/vla3.f90   -O3 -fomit-frame-pointer - 
>>>> funroll-loops -fpeel-loops -ftracer -finline-functions execution test
>>>> libgomp: libgomp.fortran/vla3.f90   -O3 -g  execution test
>>>> libgomp: libgomp.fortran/vla4.f90   -O1  execution test
>>>> libgomp: libgomp.fortran/vla4.f90   -O2  execution test
>>>> libgomp: libgomp.fortran/vla4.f90   -O3 -fomit-frame-pointer - 
>>>> funroll-loops -fpeel-loops -ftracer -finline-functions execution test
>>>> libgomp: libgomp.fortran/vla4.f90   -O3 -g  execution test
>>>> libgomp: libgomp.fortran/vla4.f90   -Os  execution test
>>>> libgomp: libgomp.fortran/vla5.f90   -O1  execution test
>>>> libgomp: libgomp.fortran/vla5.f90   -O2  execution test
>>>> libgomp: libgomp.fortran/vla5.f90   -O3 -fomit-frame-pointer - 
>>>> funroll-loops -fpeel-loops -ftracer -finline-functions execution test
>>>> libgomp: libgomp.fortran/vla5.f90   -O3 -g  execution test
>>>> libgomp: libgomp.fortran/vla5.f90   -Os  execution test
>>> So if you could check on those, it'd be appreciated.
>> I checked rv64gcv linux and those do not currently run in CI.
> So just ran with Vineet's patch in our CI system.  His patch is still 
> triggering those regressions.  So we need to get that resolved before 
> that second patch can go in.

And just for reproducibility what exact --with-arch build is this from ?

-Vineet

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733]
  2024-05-14 16:36             ` Vineet Gupta
@ 2024-05-14 17:06               ` Jeff Law
  0 siblings, 0 replies; 19+ messages in thread
From: Jeff Law @ 2024-05-14 17:06 UTC (permalink / raw)
  To: Vineet Gupta, Patrick O'Neill, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, Christoph Müllner, Robin Dapp,
	gnu-toolchain



On 5/14/24 10:36 AM, Vineet Gupta wrote:
> 
> 
> On 5/14/24 08:44, Jeff Law wrote:
>> On 5/14/24 8:51 AM, Patrick O'Neill wrote:
>>>> I was able to find the summary info:
>>>>
>>>>> Tests that now fail, but worked before (15 tests):
>>>>> libgomp: libgomp.fortran/simd7.f90   -O0  execution test
>>>>> libgomp: libgomp.fortran/task2.f90   -O0  execution test
>>>>> libgomp: libgomp.fortran/vla2.f90   -O0  execution test
>>>>> libgomp: libgomp.fortran/vla3.f90   -O3 -fomit-frame-pointer -
>>>>> funroll-loops -fpeel-loops -ftracer -finline-functions execution test
>>>>> libgomp: libgomp.fortran/vla3.f90   -O3 -g  execution test
>>>>> libgomp: libgomp.fortran/vla4.f90   -O1  execution test
>>>>> libgomp: libgomp.fortran/vla4.f90   -O2  execution test
>>>>> libgomp: libgomp.fortran/vla4.f90   -O3 -fomit-frame-pointer -
>>>>> funroll-loops -fpeel-loops -ftracer -finline-functions execution test
>>>>> libgomp: libgomp.fortran/vla4.f90   -O3 -g  execution test
>>>>> libgomp: libgomp.fortran/vla4.f90   -Os  execution test
>>>>> libgomp: libgomp.fortran/vla5.f90   -O1  execution test
>>>>> libgomp: libgomp.fortran/vla5.f90   -O2  execution test
>>>>> libgomp: libgomp.fortran/vla5.f90   -O3 -fomit-frame-pointer -
>>>>> funroll-loops -fpeel-loops -ftracer -finline-functions execution test
>>>>> libgomp: libgomp.fortran/vla5.f90   -O3 -g  execution test
>>>>> libgomp: libgomp.fortran/vla5.f90   -Os  execution test
>>>> So if you could check on those, it'd be appreciated.
>>> I checked rv64gcv linux and those do not currently run in CI.
>> So just ran with Vineet's patch in our CI system.  His patch is still
>> triggering those regressions.  So we need to get that resolved before
>> that second patch can go in.
> 
> And just for reproducibility what exact --with-arch build is this from ?
This run was with "--with-arch=rv64gc_zba_zbb_zbc_zbkb_zbs_zfa_zicond"

I think we likely saw it without zbkb & zfa when we first looked at this 
a few months back.

jeff


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [COMMITTED] RISC-V: avoid LUI based const materialization ... [part of PR/106265]
  2024-05-13 20:15   ` Jeff Law
@ 2024-05-14 18:38     ` Vineet Gupta
  0 siblings, 0 replies; 19+ messages in thread
From: Vineet Gupta @ 2024-05-14 18:38 UTC (permalink / raw)
  To: gcc-patches; +Cc: Vineet Gupta, Edwin Lu

... if the constant can be represented as sum of two S12 values.
The two S12 values could instead be fused with subsequent ADD insn.
The helps
 - avoid an additional LUI insn
 - side benefits of not clobbering a reg

e.g.
                            w/o patch             w/ patch
long                  |                     |
plus(unsigned long i) |	li	a5,4096     |
{                     |	addi	a5,a5,-2032 | addi a0, a0, 2047
   return i + 2064;   |	add	a0,a0,a5    | addi a0, a0, 17
}                     | 	ret         | ret

NOTE: In theory not having const in a standalone reg might seem less
      CSE friendly, but for workloads in consideration these mat are
      from very late LRA reloads and follow on GCSE is not doing much
      currently.

The real benefit however is seen in base+offset computation for array
accesses and especially for stack accesses which are finalized late in
optim pipeline, during LRA register allocation. Often the finalized
offsets trigger LRA reloads resulting in mind boggling repetition of
exact same insn sequence including LUI based constant materialization.

This shaves off 290 billion dynamic instrustions (QEMU icounts) in
SPEC 2017 Cactu benchmark which is over 10% of workload. In the rest of
suite, there additional 10 billion shaved, with both gains and losses
in indiv workloads as is usual with compiler changes.

 500.perlbench_r-0 |  1,214,534,029,025 | 1,212,887,959,387 |
 500.perlbench_r-1 |    740,383,419,739 |   739,280,308,163 |
 500.perlbench_r-2 |    692,074,638,817 |   691,118,734,547 |
 502.gcc_r-0       |    190,820,141,435 |   190,857,065,988 |
 502.gcc_r-1       |    225,747,660,839 |   225,809,444,357 | <- -0.02%
 502.gcc_r-2       |    220,370,089,641 |   220,406,367,876 | <- -0.03%
 502.gcc_r-3       |    179,111,460,458 |   179,135,609,723 | <- -0.02%
 502.gcc_r-4       |    219,301,546,340 |   219,320,416,956 | <- -0.01%
 503.bwaves_r-0    |    278,733,324,691 |   278,733,323,575 | <- -0.01%
 503.bwaves_r-1    |    442,397,521,282 |   442,397,519,616 |
 503.bwaves_r-2    |    344,112,218,206 |   344,112,216,760 |
 503.bwaves_r-3    |    417,561,469,153 |   417,561,467,597 |
 505.mcf_r         |    669,319,257,525 |   669,318,763,084 |
 507.cactuBSSN_r   |  2,852,767,394,456 | 2,564,736,063,742 | <+ 10.10%
 508.namd_r        |  1,855,884,342,110 | 1,855,881,110,934 |
 510.parest_r      |  1,654,525,521,053 | 1,654,402,859,174 |
 511.povray_r      |  2,990,146,655,619 | 2,990,060,324,589 |
 519.lbm_r         |  1,158,337,294,525 | 1,158,337,294,529 |
 520.omnetpp_r     |  1,021,765,791,283 | 1,026,165,661,394 |
 521.wrf_r         |  1,715,955,652,503 | 1,714,352,737,385 |
 523.xalancbmk_r   |    849,846,008,075 |   849,836,851,752 |
 525.x264_r-0      |    277,801,762,763 |   277,488,776,427 |
 525.x264_r-1      |    927,281,789,540 |   926,751,516,742 |
 525.x264_r-2      |    915,352,631,375 |   914,667,785,953 |
 526.blender_r     |  1,652,839,180,887 | 1,653,260,825,512 |
 527.cam4_r        |  1,487,053,494,925 | 1,484,526,670,770 |
 531.deepsjeng_r   |  1,641,969,526,837 | 1,642,126,598,866 |
 538.imagick_r     |  2,098,016,546,691 | 2,097,997,929,125 |
 541.leela_r       |  1,983,557,323,877 | 1,983,531,314,526 |
 544.nab_r         |  1,516,061,611,233 | 1,516,061,407,715 |
 548.exchange2_r   |  2,072,594,330,215 | 2,072,591,648,318 |
 549.fotonik3d_r   |  1,001,499,307,366 | 1,001,478,944,189 |
 554.roms_r        |  1,028,799,739,111 | 1,028,780,904,061 |
 557.xz_r-0        |    363,827,039,684 |   363,057,014,260 |
 557.xz_r-1        |    906,649,112,601 |   905,928,888,732 |
 557.xz_r-2        |    509,023,898,187 |   508,140,356,932 |
 997.specrand_fr   |        402,535,577 |       403,052,561 |
 999.specrand_ir   |        402,535,577 |       403,052,561 |

This should still be considered damage control as the real/deeper fix
would be to reduce number of LRA reloads or CSE/anchor those during
LRA constraint sub-pass (re)runs (thats a different PR/114729.

Implementation Details (for posterity)
--------------------------------------
 - basic idea is to have a splitter selected via a new predicate for constant
   being possible sum of two S12 and provide the transform.
   This is however a 2 -> 2 transform which combine can't handle.
   So we specify it using a define_insn_and_split.

 - the initial loose "i" constraint caused LRA to accept invalid insns thus
   needing a tighter new constraint as well.

 - An additional fallback alternate with catch-all "r" register
   constraint also needed to allow any "reloads" that LRA might
   require for ADDI with const larger than S12.

Testing
--------
This is testsuite clean (rv64 only).
I'll rely on post-commit CI multlib run for any possible fallout for
other setups such as rv32.

|                                               |         gcc |          g++ |     gfortran |
| rv64imafdc_zba_zbb_zbs_zicond/  lp64d/ medlow |  41 /    17 |    8 /     3 |    7 /     2 |
| rv64imafdc_zba_zbb_zbs_zicond/  lp64d/ medlow |  41 /    17 |    8 /     3 |    7 /     2 |

I also threw this into a buildroot run, it obviously boots Linux to
userspace. bloat-o-meter on glibc and kernel show overall decrease in
staic instruction counts with some minor spot increases.
These are generally in the category of

 - LUI + ADDI are 2 byte each vs. two ADD being 4 byte each.
 - Knock on effects due to inlining changes.
 - Sometimes the slightly shorter 2-insn seq in a mult-exit function
   can cause in-place epilogue duplication (vs. a jump back).
   This is slightly larger but more efficient in execution.
In summary nothing to fret about.

| linux/scripts/bloat-o-meter build-gcc-240131/target/lib/libc.so.6 \
         build-gcc-240131-new-splitter-1-variant/target/lib/libc.so.6
|
| add/remove: 0/0 grow/shrink: 21/49 up/down: 520/-3056 (-2536)
| Function                                     old     new   delta
| getnameinfo                                 2756    2892    +136
...
| tempnam                                      136     144      +8
| padzero                                      276     284      +8
...
| __GI___register_printf_specifier             284     280      -4
| __EI_xdr_array                               468     464      -4
| try_file_lock                                268     260      -8
| pthread_create@GLIBC_2                      3520    3508     -12
| __pthread_create_2_1                        3520    3508     -12
...
| _nss_files_setnetgrent                       932     904     -28
| _nss_dns_gethostbyaddr2_r                   1524    1480     -44
| build_trtable                               3312    3208    -104
| printf_positional                          25000   22580   -2420
| Total: Before=2107999, After=2105463, chg -0.12%

Caveat:
------
Jeff noted during v2 review that the operand0 constraint !riscv_reg_frame_related
could potentially cause issues with hard reg cprop in future. If that
trips things up we will have to loosen the constraint while dialing down
the const range to (-2048 to 2032) as opposed to fll S12 range of
(-2048 to 2047) to keep stack regs aligned.

gcc/ChangeLog:
	* config/riscv/riscv.h: New macros to check for sum of two S12
	range.
	* config/riscv/constraints.md: New constraint.
	* config/riscv/predicates.md: New Predicate.
	* config/riscv/riscv.md: New splitter.
	* config/riscv/riscv.cc (riscv_reg_frame_related): New helper.
	* config/riscv/riscv-protos.h: New helper prototype.

gcc/testsuite/ChangeLog:
	* gcc.target/riscv/sum-of-two-s12-const-1.c: New test: checks
	for new patterns output.
	* gcc.target/riscv/sum-of-two-s12-const-2.c: Ditto.
	* gcc.target/riscv/sum-of-two-s12-const-3.c: New test: should not
	ICE.

Tested-by: Edwin Lu <ewlu@rivosinc.com> # pre-commit-CI #1520
Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
---
 gcc/config/riscv/constraints.md               |  6 +++
 gcc/config/riscv/predicates.md                |  6 +++
 gcc/config/riscv/riscv-protos.h               |  1 +
 gcc/config/riscv/riscv.cc                     | 11 +++++
 gcc/config/riscv/riscv.h                      | 15 +++++++
 gcc/config/riscv/riscv.md                     | 40 +++++++++++++++++
 .../gcc.target/riscv/sum-of-two-s12-const-1.c | 45 +++++++++++++++++++
 .../gcc.target/riscv/sum-of-two-s12-const-2.c | 15 +++++++
 .../gcc.target/riscv/sum-of-two-s12-const-3.c | 22 +++++++++
 9 files changed, 161 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-2.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-3.c

diff --git a/gcc/config/riscv/constraints.md b/gcc/config/riscv/constraints.md
index a590df545d7d..a9ee346af6f0 100644
--- a/gcc/config/riscv/constraints.md
+++ b/gcc/config/riscv/constraints.md
@@ -80,6 +80,12 @@
   (and (match_code "const_int")
        (match_test "LUI_OPERAND (ival)")))
 
+(define_constraint "MiG"
+  "const can be represented as sum of any S12 values."
+  (and (match_code "const_int")
+       (ior (match_test "IN_RANGE (ival,  2048,  4094)")
+	    (match_test "IN_RANGE (ival, -4096, -2049)"))))
+
 (define_constraint "Ds3"
   "@internal
    1, 2 or 3 immediate"
diff --git a/gcc/config/riscv/predicates.md b/gcc/config/riscv/predicates.md
index e7d797d4dbf0..8948fbfc3631 100644
--- a/gcc/config/riscv/predicates.md
+++ b/gcc/config/riscv/predicates.md
@@ -428,6 +428,12 @@
 	return true;
 })
 
+(define_predicate "const_two_s12"
+  (match_code "const_int")
+{
+  return SUM_OF_TWO_S12 (INTVAL (op));
+})
+
 ;; CORE-V Predicates:
 (define_predicate "immediate_register_operand"
   (ior (match_operand 0 "register_operand")
diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 255fd6a0de97..5c8a52b78a22 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -165,6 +165,7 @@ extern bool riscv_shamt_matches_mask_p (int, HOST_WIDE_INT);
 extern void riscv_subword_address (rtx, rtx *, rtx *, rtx *, rtx *);
 extern void riscv_lshift_subword (machine_mode, rtx, rtx, rtx *);
 extern enum memmodel riscv_union_memmodels (enum memmodel, enum memmodel);
+extern bool riscv_reg_frame_related (rtx);
 
 /* Routines implemented in riscv-c.cc.  */
 void riscv_cpu_cpp_builtins (cpp_reader *);
diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index a1e5a014bedf..4067505270e1 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -7145,6 +7145,17 @@ riscv_can_eliminate (const int from ATTRIBUTE_UNUSED, const int to)
   return (to == HARD_FRAME_POINTER_REGNUM || to == STACK_POINTER_REGNUM);
 }
 
+/* Helper to determine if reg X pertains to stack.  */
+bool
+riscv_reg_frame_related (rtx x)
+{
+  return REG_P (x)
+	 && (REGNO (x) == FRAME_POINTER_REGNUM
+	     || REGNO (x) == HARD_FRAME_POINTER_REGNUM
+	     || REGNO (x) == ARG_POINTER_REGNUM
+	     || REGNO (x) == VIRTUAL_STACK_VARS_REGNUM);
+}
+
 /* Implement INITIAL_ELIMINATION_OFFSET.  FROM is either the frame pointer
    or argument pointer.  TO is either the stack pointer or hard frame
    pointer.  */
diff --git a/gcc/config/riscv/riscv.h b/gcc/config/riscv/riscv.h
index 58d0b09bf7d9..0d27c0d378df 100644
--- a/gcc/config/riscv/riscv.h
+++ b/gcc/config/riscv/riscv.h
@@ -626,6 +626,21 @@ enum reg_class
   (((VALUE) | ((1UL<<31) - IMM_REACH)) == ((1UL<<31) - IMM_REACH)	\
    || ((VALUE) | ((1UL<<31) - IMM_REACH)) + IMM_REACH == 0)
 
+/* True if a VALUE (constant) can be expressed as sum of two S12 constants
+   (in range -2048 to 2047).
+   Range check logic:
+     from: min S12 + 1 (or -1 depending on what side of zero)
+       to: two times the min S12 value (to max out S12 bits).  */
+
+#define SUM_OF_TWO_S12_N(VALUE)						\
+  (((VALUE) >= (-2048 * 2)) && ((VALUE) <= (-2048 - 1)))
+
+#define SUM_OF_TWO_S12_P(VALUE)						\
+  (((VALUE) >= (2047 + 1)) && ((VALUE) <= (2047 * 2)))
+
+#define SUM_OF_TWO_S12(VALUE)						\
+  (SUM_OF_TWO_S12_N (VALUE) || SUM_OF_TWO_S12_P (VALUE))
+
 /* If this is a single bit mask, then we can load it with bseti.  Special
    handling of SImode 0x80000000 on RV64 is done in riscv_build_integer_1. */
 #define SINGLE_BIT_MASK_OPERAND(VALUE)					\
diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
index c45b1129b0a0..893040f28541 100644
--- a/gcc/config/riscv/riscv.md
+++ b/gcc/config/riscv/riscv.md
@@ -703,6 +703,46 @@
   [(set_attr "type" "arith")
    (set_attr "mode" "DI")])
 
+;; Special case of adding a reg and constant if latter is sum of two S12
+;; values (in range -2048 to 2047). Avoid materialized the const and fuse
+;; into the add (with an additional add for 2nd value). Makes a 3 insn
+;; sequence into 2 insn.
+
+(define_insn_and_split "*add<mode>3_const_sum_of_two_s12"
+  [(set (match_operand:P	 0 "register_operand" "=r,r")
+	(plus:P (match_operand:P 1 "register_operand" " r,r")
+		(match_operand:P 2 "const_two_s12"    " MiG,r")))]
+  "!riscv_reg_frame_related (operands[0])"
+{
+  /* operand matching MiG constraint is always meant to be split.  */
+  if (which_alternative == 0)
+    return "#";
+  else
+    return "add %0,%1,%2";
+}
+  ""
+  [(set (match_dup 0)
+	(plus:P (match_dup 1) (match_dup 3)))
+   (set (match_dup 0)
+	(plus:P (match_dup 0) (match_dup 4)))]
+{
+  int val = INTVAL (operands[2]);
+  if (SUM_OF_TWO_S12_P (val))
+    {
+       operands[3] = GEN_INT (2047);
+       operands[4] = GEN_INT (val - 2047);
+    }
+  else if (SUM_OF_TWO_S12_N (val))
+    {
+       operands[3] = GEN_INT (-2048);
+       operands[4] = GEN_INT (val + 2048);
+    }
+  else
+      gcc_unreachable ();
+}
+  [(set_attr "type" "arith")
+   (set_attr "mode" "<P:MODE>")])
+
 (define_expand "addv<mode>4"
   [(set (match_operand:GPR           0 "register_operand" "=r,r")
 	(plus:GPR (match_operand:GPR 1 "register_operand" " r,r")
diff --git a/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-1.c b/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-1.c
new file mode 100644
index 000000000000..4d6d135de5f5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-1.c
@@ -0,0 +1,45 @@
+// TBD: This doesn't quite work for rv32 yet
+/* { dg-do compile } */
+/* { dg-options { -march=rv64gcv -mabi=lp64d } } */
+/* { dg-skip-if "" { *-*-* } { "-O0" "-Og" "-Os" "-Oz" } } */
+
+/* Ensure that gcc doesn't generate standlone li reg, 4096.  */
+long
+plus1(unsigned long i)
+{
+   return i + 2048;
+}
+
+long
+plus2(unsigned long i)
+{
+   return i + 4094;
+}
+
+long
+plus3(unsigned long i)
+{
+   return i + 2064;
+}
+
+/* Ensure that gcc doesn't generate standlone li reg, -4096.  */
+long
+minus1(unsigned long i)
+{
+   return i - 4096;
+}
+
+long
+minus2(unsigned long i)
+{
+   return i - 2049;
+}
+
+long
+minus3(unsigned long i)
+{
+   return i - 2064;
+}
+
+/* { dg-final { scan-assembler-not {li\t[a-x0-9]+,-4096} } } */
+/* { dg-final { scan-assembler-not {li\t[a-x0-9]+,4096} } } */
diff --git a/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-2.c b/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-2.c
new file mode 100644
index 000000000000..9343b43c3106
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-2.c
@@ -0,0 +1,15 @@
+/* Reduced from glibc/stdio-common/tempnam.c.
+   Can't have invalid insn in final output:
+      add s0, sp, 2048    */
+
+/* { dg-do compile } */
+/* { dg-options { -march=rv64gcv -mabi=lp64d -O2 } } */
+/* { dg-skip-if "" { *-*-* } { "-O0" "O1" "-Og" "-Os" "-Oz" } } */
+
+int a() {
+  char b[4096];
+  if (a(b))
+    a(b);
+}
+
+/* { dg-final { scan-assembler-not {add\t[a-x0-9]+,sp,2048} } } */
diff --git a/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-3.c b/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-3.c
new file mode 100644
index 000000000000..5dcab52c2610
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-3.c
@@ -0,0 +1,22 @@
+/* Reduced version of c-c++-common/torture/builtin-convertvector-1.c.  */
+/* This should NOT ICE */
+
+/* { dg-do compile } */
+
+typedef long b __attribute__((vector_size(256 * sizeof(long))));
+typedef double c __attribute__((vector_size(256 * sizeof(double))));
+int d;
+void e(b *f, c *g) { *g = __builtin_convertvector(*f, c); }
+void h() {
+  struct {
+    b i;
+  } j;
+  union {
+    c i;
+    double a[6];
+  } k;
+  e(&j.i, &k.i);
+  if (k.a[d])
+    for (;;)
+      ;
+}
-- 
2.34.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Follow up #1 (was Re: [PATCH v2 1/2] RISC-V: avoid LUI based const materialization ... [part of PR/106265])
  2024-05-13 23:08       ` Vineet Gupta
@ 2024-05-14 22:12         ` Palmer Dabbelt
  2024-05-14 23:12           ` Vineet Gupta
  0 siblings, 1 reply; 19+ messages in thread
From: Palmer Dabbelt @ 2024-05-14 22:12 UTC (permalink / raw)
  To: Vineet Gupta
  Cc: jeffreyalaw, gcc-patches, Kito Cheng, christoph.muellner,
	Robin Dapp, gnu-toolchain, Greg McGary

On Mon, 13 May 2024 16:08:21 PDT (-0700), Vineet Gupta wrote:
>
>
> On 5/13/24 15:47, Jeff Law wrote:
>>> On 5/13/24 11:49, Vineet Gupta wrote:
>>>>   500.perlbench_r-0 |  1,214,534,029,025 | 1,212,887,959,387 |
>>>>   500.perlbench_r-1 |    740,383,419,739 |   739,280,308,163 |
>>>>   500.perlbench_r-2 |    692,074,638,817 |   691,118,734,547 |
>>>>   502.gcc_r-0       |    190,820,141,435 |   190,857,065,988 |
>>>>   502.gcc_r-1       |    225,747,660,839 |   225,809,444,357 | <- -0.02%
>>>>   502.gcc_r-2       |    220,370,089,641 |   220,406,367,876 | <- -0.03%
>>>>   502.gcc_r-3       |    179,111,460,458 |   179,135,609,723 | <- -0.02%
>>>>   502.gcc_r-4       |    219,301,546,340 |   219,320,416,956 | <- -0.01%
>>>>   503.bwaves_r-0    |    278,733,324,691 |   278,733,323,575 | <- -0.01%
>>>>   503.bwaves_r-1    |    442,397,521,282 |   442,397,519,616 |
>>>>   503.bwaves_r-2    |    344,112,218,206 |   344,112,216,760 |
>>>>   503.bwaves_r-3    |    417,561,469,153 |   417,561,467,597 |
>>>>   505.mcf_r         |    669,319,257,525 |   669,318,763,084 |
>>>>   507.cactuBSSN_r   |  2,852,767,394,456 | 2,564,736,063,742 | <+ 10.10%
>>> The small gcc regression seems like a tooling issue of some sort.
>>> Looking at the topblocks, the insn sequences are exactly the same, only
>>> the counts differ and its not obvious why.
>>> Here's for gcc_r-1.
>>>
>>>
>>>      > Block 0 @ 0x170ca, 12 insns, 87854493 times, 0.47%:
>>>
>>>      00000000000170ca <find_base_term>:
>>>         170ca:    7179                    add    sp,sp,-48
>>>         170cc:    ec26                    sd    s1,24(sp)
>>>         170ce:    e84a                    sd    s2,16(sp)
>>>         170d0:    e44e                    sd    s3,8(sp)
>>>         170d2:    f406                    sd    ra,40(sp)
>>>         170d4:    f022                    sd    s0,32(sp)
>>>         170d6:    84aa                    mv    s1,a0
>>>         170d8:    03200913              li    s2,50
>>>         170dc:    03d00993              li    s3,61
>>>         170e0:    8526                    mv    a0,s1
>>>         170e2:    001cd097              auipc    ra,0x1cd
>>>         170e6:    bac080e7              jalr    -1108(ra) # 1e3c8e
>>>      <ix86_delegitimize_address.lto_priv.0>
>>>
>>>      > Block 1 @ 0x706d0a, 3 insns, 274713936 times, 0.37%:
>>>      >  Block 2 @ 0x1e3c8e, 9 insns, 88507109 times, 0.35%:
>>>      ...
>>>
>>>      < Block 0 @ 0x170ca, 12 insns, 87869602 times, 0.47%:
>>>      < Block 1 @ 0x706d42, 3 insns, 274608893 times, 0.36%:
>>>      < Block 2 @ 0x1e3c94, 9 insns, 88526354 times, 0.35%:
>>>
>>>
>>> FWIW, Greg internally has been looking at some of this and found some
>>> issues in the bbv tooling, but I wish all of this was  shared/upstream
>>> (QEMU bbv plugin) for people to compare notes and not discover/fix the
>>> same issues over and again.
>> Yea, we all meant to coordinate on those plugins.  The one we've got had
>> some problems with hash collisions and when there's a hash collision it
>> just produces total junk data.  I chased a few of these down and fixed
>> them about a year ago.
>>
>> The other thing is qemu will split up blocks based on its internal
>> notion of a translation page.   So if you're looking at block level data
>> you'll stumble over that as well.  This aspect is the most troublesome
>> problem I'm aware of right now.
>
> And these two are exactly what Greg fixed, among others :-)

IIRC the plan was for Jeff to send his version to the QEMU lists so we 
can talk about it over there.  Do you want us to just send Greg's 
version instead?  It's all based on the same original patch from the 
QEMU lists, just with possibly-different set of fixes.

>
> -Vineet

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Follow up #1 (was Re: [PATCH v2 1/2] RISC-V: avoid LUI based const materialization ... [part of PR/106265])
  2024-05-14 22:12         ` Palmer Dabbelt
@ 2024-05-14 23:12           ` Vineet Gupta
  0 siblings, 0 replies; 19+ messages in thread
From: Vineet Gupta @ 2024-05-14 23:12 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: jeffreyalaw, gcc-patches, Kito Cheng, christoph.muellner,
	Robin Dapp, gnu-toolchain, Greg McGary



On 5/14/24 15:12, Palmer Dabbelt wrote:
> On Mon, 13 May 2024 16:08:21 PDT (-0700), Vineet Gupta wrote:
>>
>> On 5/13/24 15:47, Jeff Law wrote:
>>>> On 5/13/24 11:49, Vineet Gupta wrote:
>>>>>   500.perlbench_r-0 |  1,214,534,029,025 | 1,212,887,959,387 |
>>>>>   500.perlbench_r-1 |    740,383,419,739 |   739,280,308,163 |
>>>>>   500.perlbench_r-2 |    692,074,638,817 |   691,118,734,547 |
>>>>>   502.gcc_r-0       |    190,820,141,435 |   190,857,065,988 |
>>>>>   502.gcc_r-1       |    225,747,660,839 |   225,809,444,357 | <- -0.02%
>>>>>   502.gcc_r-2       |    220,370,089,641 |   220,406,367,876 | <- -0.03%
>>>>>   502.gcc_r-3       |    179,111,460,458 |   179,135,609,723 | <- -0.02%
>>>>>   502.gcc_r-4       |    219,301,546,340 |   219,320,416,956 | <- -0.01%
>>>>>   503.bwaves_r-0    |    278,733,324,691 |   278,733,323,575 | <- -0.01%
>>>>>   503.bwaves_r-1    |    442,397,521,282 |   442,397,519,616 |
>>>>>   503.bwaves_r-2    |    344,112,218,206 |   344,112,216,760 |
>>>>>   503.bwaves_r-3    |    417,561,469,153 |   417,561,467,597 |
>>>>>   505.mcf_r         |    669,319,257,525 |   669,318,763,084 |
>>>>>   507.cactuBSSN_r   |  2,852,767,394,456 | 2,564,736,063,742 | <+ 10.10%
>>>> The small gcc regression seems like a tooling issue of some sort.
>>>> Looking at the topblocks, the insn sequences are exactly the same, only
>>>> the counts differ and its not obvious why.
>>>> Here's for gcc_r-1.
>>>>
>>>>
>>>>      > Block 0 @ 0x170ca, 12 insns, 87854493 times, 0.47%:
>>>>
>>>>      00000000000170ca <find_base_term>:
>>>>         170ca:    7179                    add    sp,sp,-48
>>>>         170cc:    ec26                    sd    s1,24(sp)
>>>>         170ce:    e84a                    sd    s2,16(sp)
>>>>         170d0:    e44e                    sd    s3,8(sp)
>>>>         170d2:    f406                    sd    ra,40(sp)
>>>>         170d4:    f022                    sd    s0,32(sp)
>>>>         170d6:    84aa                    mv    s1,a0
>>>>         170d8:    03200913              li    s2,50
>>>>         170dc:    03d00993              li    s3,61
>>>>         170e0:    8526                    mv    a0,s1
>>>>         170e2:    001cd097              auipc    ra,0x1cd
>>>>         170e6:    bac080e7              jalr    -1108(ra) # 1e3c8e
>>>>      <ix86_delegitimize_address.lto_priv.0>
>>>>
>>>>      > Block 1 @ 0x706d0a, 3 insns, 274713936 times, 0.37%:
>>>>      >  Block 2 @ 0x1e3c8e, 9 insns, 88507109 times, 0.35%:
>>>>      ...
>>>>
>>>>      < Block 0 @ 0x170ca, 12 insns, 87869602 times, 0.47%:
>>>>      < Block 1 @ 0x706d42, 3 insns, 274608893 times, 0.36%:
>>>>      < Block 2 @ 0x1e3c94, 9 insns, 88526354 times, 0.35%:
>>>>
>>>>
>>>> FWIW, Greg internally has been looking at some of this and found some
>>>> issues in the bbv tooling, but I wish all of this was  shared/upstream
>>>> (QEMU bbv plugin) for people to compare notes and not discover/fix the
>>>> same issues over and again.
>>> Yea, we all meant to coordinate on those plugins.  The one we've got had
>>> some problems with hash collisions and when there's a hash collision it
>>> just produces total junk data.  I chased a few of these down and fixed
>>> them about a year ago.
>>>
>>> The other thing is qemu will split up blocks based on its internal
>>> notion of a translation page.   So if you're looking at block level data
>>> you'll stumble over that as well.  This aspect is the most troublesome
>>> problem I'm aware of right now.
>> And these two are exactly what Greg fixed, among others :-)
> IIRC the plan was for Jeff to send his version to the QEMU lists so we 
> can talk about it over there.  Do you want us to just send Greg's 
> version instead?  It's all based on the same original patch from the 
> QEMU lists, just with possibly-different set of fixes.

FWIW Last year ? I did send out a cleanedup version of plugins but it
seems that got lost in other mayhem.

-Vineet

^ permalink raw reply	[flat|nested] 19+ messages in thread

* epilogue expansion alloca codepath (was Re: [PATCH v2 2/2] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733])
  2024-05-14 15:44           ` Jeff Law
  2024-05-14 16:36             ` Vineet Gupta
@ 2024-05-20 16:44             ` Vineet Gupta
  1 sibling, 0 replies; 19+ messages in thread
From: Vineet Gupta @ 2024-05-20 16:44 UTC (permalink / raw)
  To: Jeff Law, Patrick O'Neill, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, Christoph Müllner, Robin Dapp,
	gnu-toolchain

On 5/14/24 08:44, Jeff Law wrote:
>>> I was able to find the summary info:
>>>
>>>> Tests that now fail, but worked before (15 tests):
>>>> libgomp: libgomp.fortran/simd7.f90   -O0  execution test
>>>> libgomp: libgomp.fortran/task2.f90   -O0  execution test
>>>> libgomp: libgomp.fortran/vla2.f90   -O0  execution test
>>>> libgomp: libgomp.fortran/vla3.f90   -O3 -fomit-frame-pointer - 
>>>> funroll-loops -fpeel-loops -ftracer -finline-functions execution test
>>>> libgomp: libgomp.fortran/vla3.f90   -O3 -g  execution test
>>>> libgomp: libgomp.fortran/vla4.f90   -O1  execution test
>>>> libgomp: libgomp.fortran/vla4.f90   -O2  execution test
>>>> libgomp: libgomp.fortran/vla4.f90   -O3 -fomit-frame-pointer - 
>>>> funroll-loops -fpeel-loops -ftracer -finline-functions execution test
>>>> libgomp: libgomp.fortran/vla4.f90   -O3 -g  execution test
>>>> libgomp: libgomp.fortran/vla4.f90   -Os  execution test
>>>> libgomp: libgomp.fortran/vla5.f90   -O1  execution test
>>>> libgomp: libgomp.fortran/vla5.f90   -O2  execution test
>>>> libgomp: libgomp.fortran/vla5.f90   -O3 -fomit-frame-pointer - 
>>>> funroll-loops -fpeel-loops -ftracer -finline-functions execution test
>>>> libgomp: libgomp.fortran/vla5.f90   -O3 -g  execution test
>>>> libgomp: libgomp.fortran/vla5.f90   -Os  execution test
>>> So if you could check on those, it'd be appreciated.
>> I checked rv64gcv linux and those do not currently run in CI.
> So just ran with Vineet's patch in our CI system.  His patch is still 
> triggering those regressions.  So we need to get that resolved before 
> that second patch can go in.

So CI pointed to a lone Fortran execute failure which is very likely
also causing above.

    FAIL: gfortran.dg/PR93963.f90 -O0 execution test

Turns out the alloca codepath in epilogue expansion is simply busted -
I'm surprised that we only see 1 failure in the entire testsuite run
(libgomp run aside).

> -      if (!SMALL_OPERAND (adjust_offset.to_constant ()))
> +      HOST_WIDE_INT adj_off_value = adjust_offset.to_constant ();
> +      if (SMALL_OPERAND (adj_off_value))
> +        {
> +          adjust = GEN_INT (adj_off_value);
> +        }
> +      else if (SUM_OF_TWO_S12_ALGN (adj_off_value))
> +        {
> +          HOST_WIDE_INT base, off;
> +          riscv_split_sum_of_two_s12 (adj_off_value, &base, &off);
> +          insn = gen_add3_insn (stack_pointer_rtx,
hard_frame_pointer_rtx,
> +                    GEN_INT (base));

1. Missing gen_insn( ) here causes the insn to be overwritten by the
subsequent emit_insn below.

2. In sum of two s12 logic first insn is sp = xx + 2048, with the
additional insn expected to be of form
    sp = sp + off corresponding to
       stack_pointer_rtx+ stack_pointer_rtx+ off
    which the following emit_insn () below is not.
...

>        insn = emit_insn ( gen_add3_insn (stack_pointer_rtx,
hard_frame_pointer_rtx,
>                                                                adjust));

3. Yet another issue had to do with which insn should the dwarf be
attached -and the adj needed adjusting :-)
So In v3 I've split this patch into the main prologue/epilogue change
and one from the alloca one - which depending on reviewer guidance
(aesthetics, ugliness, trying to keep uniformity etc ? can be decided upon.

Thx,
-Vineet

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-05-20 16:44 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-13 18:49 [PATCH v2 0/2] RISC-V improve stack/array access by constant mat tweak Vineet Gupta
2024-05-13 18:49 ` [PATCH v2 1/2] RISC-V: avoid LUI based const materialization ... [part of PR/106265] Vineet Gupta
2024-05-13 20:15   ` Jeff Law
2024-05-14 18:38     ` [COMMITTED] " Vineet Gupta
2024-05-13 21:13   ` Follow up #1 (was Re: [PATCH v2 1/2] RISC-V: avoid LUI based const materialization ... [part of PR/106265]) Vineet Gupta
2024-05-13 22:47     ` Jeff Law
2024-05-13 23:08       ` Vineet Gupta
2024-05-14 22:12         ` Palmer Dabbelt
2024-05-14 23:12           ` Vineet Gupta
2024-05-13 18:49 ` [PATCH v2 2/2] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733] Vineet Gupta
2024-05-13 20:28   ` Jeff Law
2024-05-14  0:54     ` Patrick O'Neill
2024-05-14  3:36       ` Jeff Law
2024-05-14 14:51         ` Patrick O'Neill
2024-05-14 14:54           ` Jeff Law
2024-05-14 15:44           ` Jeff Law
2024-05-14 16:36             ` Vineet Gupta
2024-05-14 17:06               ` Jeff Law
2024-05-20 16:44             ` epilogue expansion alloca codepath (was Re: [PATCH v2 2/2] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733]) Vineet Gupta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).