public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak
@ 2024-03-16 17:35 Vineet Gupta
  2024-03-16 17:35 ` [gcc-15 1/3] RISC-V: avoid LUI based const materialization ... [part of PR/106265] Vineet Gupta
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Vineet Gupta @ 2024-03-16 17:35 UTC (permalink / raw)
  To: gcc-patches
  Cc: Jeff Law, kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp,
	Vineet Gupta

Hi,

This set of patches (for gcc-15) help improve stack/array accesses
by improving constant materialization. Details are in respective
patches.

The first patch is the main change which improves SPEC cactu by 10%.

There is at least one more pending change with same theme
(riscv_add_offset/riscv_legitimize_address) but that is tripping up
and currently being debugged.

Thx,
-Vineet

Vineet Gupta (3):
  RISC-V: avoid LUI based const materialization ... [part of PR/106265]
  RISC-V: avoid LUI based const mat: keep stack offsets aligned
  RISC-V: avoid LUI based const mat in prologue/epilogue expansion
    [PR/105733]

 gcc/config/riscv/constraints.md               | 12 +++
 gcc/config/riscv/predicates.md                | 12 +++
 gcc/config/riscv/riscv-protos.h               |  3 +
 gcc/config/riscv/riscv.cc                     | 91 +++++++++++++++++--
 gcc/config/riscv/riscv.h                      | 20 ++++
 gcc/config/riscv/riscv.md                     | 65 +++++++++++++
 gcc/testsuite/gcc.target/riscv/pr105733.c     | 15 +++
 .../riscv/rvv/autovec/vls/spill-1.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-2.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-3.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-4.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-5.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-6.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-7.c           |  4 +-
 .../gcc.target/riscv/sum-of-two-s12-const-1.c | 45 +++++++++
 .../gcc.target/riscv/sum-of-two-s12-const-2.c | 22 +++++
 16 files changed, 292 insertions(+), 21 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/pr105733.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-2.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [gcc-15 1/3] RISC-V: avoid LUI based const materialization ... [part of PR/106265]
  2024-03-16 17:35 [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak Vineet Gupta
@ 2024-03-16 17:35 ` Vineet Gupta
  2024-03-16 20:28   ` Jeff Law
  2024-03-16 17:35 ` [gcc-15 2/3] RISC-V: avoid LUI based const mat: keep stack offsets aligned Vineet Gupta
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 28+ messages in thread
From: Vineet Gupta @ 2024-03-16 17:35 UTC (permalink / raw)
  To: gcc-patches
  Cc: Jeff Law, kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp,
	Vineet Gupta

... if the constant can be represented as sum of two S12 values.
The two S12 values could instead be fused with subsequent ADD insn.
The helps
 - avoid an additional LUI insn
 - side benefits of not clobbering a reg

e.g.
                            w/o patch             w/ patch
long                  |                     |
plus(unsigned long i) |	li	a5,4096     |
{                     |	addi	a5,a5,-2032 | addi a0, a0, 2047
   return i + 2064;   |	add	a0,a0,a5    | addi a0, a0, 17
}                     | 	ret         | ret

NOTE: In theory not having const in a standalone reg might seem less
      CSE friendly, but for workloads in consideration these mat are
      from very late LRA reloads and follow on GCSE is not doing much
      currently.

The real benefit however is seen in base+offset computation for array
accesses and especially for stack accesses which are finalized late in
optim pipeline, during LRA register allocation. Often the finalized
offsets trigger LRA reloads resulting in mind boggling repetition of
exact same insn sequence including LUI based constant materialization.

This shaves off 290 billion dynamic instrustions (QEMU icounts) in
SPEC 2017 Cactu benchmark which is over 10% of workload. In the rest of
suite, there additional 10 billion shaved, with both gains and losses
in indiv workloads as is usual with compiler changes.

1,214,172,747,858 | 1,212,530,889,930 |  0.14%  <-
  740,618,658,160 |   739,515,555,243 |  0.15%  <-
  692,128,469,436 |   691,175,291,593 |  0.14%  <-
  190,811,495,310 |   190,854,437,311 | -0.02%
  225,751,808,189 |   225,818,881,019 | -0.03%
  220,364,237,915 |   220,405,868,038 | -0.02%
  179,157,361,261 |   179,186,059,187 | -0.02%
  219,313,921,837 |   219,337,232,829 | -0.01%
  278,733,265,382 |   278,733,264,380 |  0.00%
  442,397,437,578 |   442,397,436,026 |  0.00%
  344,112,144,242 |   344,112,142,910 |  0.00%
  417,561,390,319 |   417,561,388,877 |  0.00%
  669,319,255,911 |   669,318,761,114 |  0.00%
2,852,781,330,203 | 2,564,750,360,011 | 10.10% <--
1,855,884,349,001 | 1,855,881,122,280 |  0.00%
1,653,856,904,409 | 1,653,733,205,926 |  0.01%
2,974,347,350,401 | 2,974,174,999,505 |  0.01%
1,158,337,609,995 | 1,158,337,608,826 |  0.00%
1,021,799,191,806 | 1,021,425,634,319 |  0.04%
1,716,448,413,824 | 1,714,912,501,736 |  0.09%
  849,330,283,510 |   849,321,128,484 |  0.00%
  276,556,968,005 |   276,232,659,282 |  0.12%  <-
  913,352,953,686 |   912,822,592,719 |  0.06%
  903,792,550,116 |   903,107,637,917 |  0.08%
1,654,904,617,482 | 1,655,327,175,238 | -0.03%
1,495,841,421,311 | 1,493,418,761,599 |  0.16%  <-
1,641,969,530,523 | 1,642,126,596,979 | -0.01%
2,546,170,307,385 | 2,546,151,690,122 |  0.00%
1,983,551,321,388 | 1,983,525,296,345 |  0.00%
1,516,077,036,742 | 1,516,076,833,481 |  0.00%
2,055,868,055,165 | 2,055,865,373,175 |  0.00%
1,000,621,723,807 | 1,000,601,360,859 |  0.00%
1,024,149,313,694 | 1,024,127,472,779 |  0.00%
  363,827,037,541 |   363,057,012,239 |  0.21%  <-
  906,649,110,459 |   905,928,886,712 |  0.08%
  509,023,896,044 |   508,140,354,911 |  0.17%  <-
      403,238,430 |       403,238,485 |  0.00%
      403,238,430 |       403,238,485 |  0.00%

This should still be considered damage control as the real/deeper fix
would be to reduce number of LRA reloads or CSE/anchor those during
LRA constraint sub-pass (re)runs.

Implementation Details (for posterity)
--------------------------------------
 - basic idea is to have a splitter selected via a new predicate for constant
   being possible sum of two S12 and provide the transform.
   This is however a 2 -> 2 transform which combine can't handle.
   So we specify it using a define_insn_and_split.

 - the initial loose "i" constraint caused LRA to accept invalid insns thus
   needing a tighter new constraint as well.

 - An additional fallback alternate with catch-all "r" register
   constraint also needed to allow any "reloads" that LRA might
   require for ADDI with const larger than S12.

Testing
--------
This is testsuite clean (rv64 only).
I'll rely on post-commit CI multlib run for any possible fallout for
other setups such as rv32.

|                            |          gcc |          g++ |     gfortran |
| rv64imafdc_zba_zbb_zbs_zicond/  lp64d/ medlow |  181 /    44 |    4 /     1 |   72 /    12 |
| rv64imafdc_zba_zbb_zbs_zicond/  lp64d/ medlow |  181 /    44 |    4 /     1 |   73 /    13 |
|

I also threw this into a buildroot run, it obviously boots Linux to
userspace. bloat-o-meter on glibc and kernel show overall decrease in
staic instruction counts with some minor spot increases.
These are generally in the category of

 - LUI + ADDI are 2 byte each vs. two ADD being 4 byte each.
 - Knock on effects due to inlining changes.
 - Sometimes the slightly shorter 2-insn seq in a mult-exit function
   can cause in-place epilogue duplication (vs. a jump back).
   This is slightly larger but more efficient in execution.
In summary nothing to fret about.

| linux/scripts/bloat-o-meter build-gcc-240131/target/lib/libc.so.6 \
         build-gcc-240131-new-splitter-1-variant/target/lib/libc.so.6
|
| add/remove: 0/0 grow/shrink: 21/49 up/down: 520/-3056 (-2536)
| Function                                     old     new   delta
| getnameinfo                                 2756    2892    +136
...
| tempnam                                      136     144      +8
| padzero                                      276     284      +8
...
| __GI___register_printf_specifier             284     280      -4
| __EI_xdr_array                               468     464      -4
| try_file_lock                                268     260      -8
| pthread_create@GLIBC_2                      3520    3508     -12
| __pthread_create_2_1                        3520    3508     -12
...
| _nss_files_setnetgrent                       932     904     -28
| _nss_dns_gethostbyaddr2_r                   1524    1480     -44
| build_trtable                               3312    3208    -104
| printf_positional                          25000   22580   -2420
| Total: Before=2107999, After=2105463, chg -0.12%

gcc/ChangeLog:
	PR target/106265 (partially)
	* config/riscv/riscv.h: New macros to check for sum of two S12
	range.
	* config/riscv/constraints.md: New constraint.
	* config/riscv/predicates.md: New Predicate.
	* config/riscv/riscv.md: New splitter to not materialize
	constant in desired range.

gcc/testsuite/ChangeLog:
	* gcc.target/riscv/sum-of-two-s12-const-1.c: New test: checks for
	new patterns output.
	* gcc.target/riscv/sum-of-two-s12-const-2.c: New test: should not
	ICE.

Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
---
 gcc/config/riscv/constraints.md               |  6 +++
 gcc/config/riscv/predicates.md                |  6 +++
 gcc/config/riscv/riscv.h                      | 15 +++++++
 gcc/config/riscv/riscv.md                     | 34 ++++++++++++++
 .../gcc.target/riscv/sum-of-two-s12-const-1.c | 45 +++++++++++++++++++
 .../gcc.target/riscv/sum-of-two-s12-const-2.c | 22 +++++++++
 6 files changed, 128 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-2.c

diff --git a/gcc/config/riscv/constraints.md b/gcc/config/riscv/constraints.md
index 41acaea04eba..435461180c7e 100644
--- a/gcc/config/riscv/constraints.md
+++ b/gcc/config/riscv/constraints.md
@@ -80,6 +80,12 @@
   (and (match_code "const_int")
        (match_test "LUI_OPERAND (ival)")))
 
+(define_constraint "MiG"
+  "const can be represented as sum of any S12 values."
+  (and (match_code "const_int")
+       (ior (match_test "IN_RANGE (ival,  2048,  4094)")
+	    (match_test "IN_RANGE (ival, -4096, -2049)"))))
+
 (define_constraint "Ds3"
   "@internal
    1, 2 or 3 immediate"
diff --git a/gcc/config/riscv/predicates.md b/gcc/config/riscv/predicates.md
index 6c87a7bd1f49..89490339c2da 100644
--- a/gcc/config/riscv/predicates.md
+++ b/gcc/config/riscv/predicates.md
@@ -420,6 +420,12 @@
 	return true;
 })
 
+(define_predicate "const_two_s12"
+  (match_code "const_int")
+{
+  return SUM_OF_TWO_S12 (INTVAL (op));
+})
+
 ;; CORE-V Predicates:
 (define_predicate "immediate_register_operand"
   (ior (match_operand 0 "register_operand")
diff --git a/gcc/config/riscv/riscv.h b/gcc/config/riscv/riscv.h
index da089a03e9d1..817661058896 100644
--- a/gcc/config/riscv/riscv.h
+++ b/gcc/config/riscv/riscv.h
@@ -624,6 +624,21 @@ enum reg_class
   (((VALUE) | ((1UL<<31) - IMM_REACH)) == ((1UL<<31) - IMM_REACH)	\
    || ((VALUE) | ((1UL<<31) - IMM_REACH)) + IMM_REACH == 0)
 
+/* True if a VALUE (constant) can be expressed as sum of two S12 constants
+   (in range -2048 to 2047).
+   Range check logic:
+     from: min S12 + 1 (or -1 depending on what side of zero)
+       to: two times the min S12 value (to max out S12 bits).  */
+
+#define SUM_OF_TWO_S12_N(VALUE)						\
+  (((VALUE) >= (-2048 * 2)) && ((VALUE) <= (-2048 - 1)))
+
+#define SUM_OF_TWO_S12_P(VALUE)						\
+  (((VALUE) >= ( 2047 + 1)) && ((VALUE) <= ( 2047 * 2)))
+
+#define SUM_OF_TWO_S12(VALUE)						\
+  (SUM_OF_TWO_S12_N (VALUE) || SUM_OF_TWO_S12_P (VALUE))
+
 /* If this is a single bit mask, then we can load it with bseti.  Special
    handling of SImode 0x80000000 on RV64 is done in riscv_build_integer_1. */
 #define SINGLE_BIT_MASK_OPERAND(VALUE)					\
diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
index b16ed97909c0..79fe861ef91f 100644
--- a/gcc/config/riscv/riscv.md
+++ b/gcc/config/riscv/riscv.md
@@ -751,6 +751,40 @@
   [(set_attr "type" "arith")
    (set_attr "mode" "DI")])
 
+;; Special case of adding a reg and constant if latter is sum of two S12
+;; values (in range -2048 to 2047). Avoid materialized the const and fuse
+;; into the add (with an additional add for 2nd value). Makes a 3 insn
+;; sequence into 2 insn.
+
+(define_insn_and_split "*add<mode>3_const_sum_of_two_s12"
+  [(set (match_operand:P	 0 "register_operand" "=r,r")
+	(plus:P (match_operand:P 1 "register_operand" " r,r")
+		(match_operand:P 2 "const_two_s12"    " MiG,r")))]
+  ""
+  "add %0,%1,%2"
+  ""
+  [(set (match_dup 0)
+	(plus:P (match_dup 1) (match_dup 3)))
+   (set (match_dup 0)
+	(plus:P (match_dup 0) (match_dup 4)))]
+{
+  int val = INTVAL (operands[2]);
+  if (SUM_OF_TWO_S12_P (val))
+    {
+       operands[3] = GEN_INT (2047);
+       operands[4] = GEN_INT (val - 2047);
+    }
+  else if (SUM_OF_TWO_S12_N (val))
+    {
+       operands[3] = GEN_INT (-2048);
+       operands[4] = GEN_INT (val + 2048);
+    }
+  else
+      gcc_unreachable ();
+}
+  [(set_attr "type" "arith")
+   (set_attr "mode" "<P:MODE>")])
+
 (define_expand "addv<mode>4"
   [(set (match_operand:GPR           0 "register_operand" "=r,r")
 	(plus:GPR (match_operand:GPR 1 "register_operand" " r,r")
diff --git a/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-1.c b/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-1.c
new file mode 100644
index 000000000000..4d6d135de5f5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-1.c
@@ -0,0 +1,45 @@
+// TBD: This doesn't quite work for rv32 yet
+/* { dg-do compile } */
+/* { dg-options { -march=rv64gcv -mabi=lp64d } } */
+/* { dg-skip-if "" { *-*-* } { "-O0" "-Og" "-Os" "-Oz" } } */
+
+/* Ensure that gcc doesn't generate standlone li reg, 4096.  */
+long
+plus1(unsigned long i)
+{
+   return i + 2048;
+}
+
+long
+plus2(unsigned long i)
+{
+   return i + 4094;
+}
+
+long
+plus3(unsigned long i)
+{
+   return i + 2064;
+}
+
+/* Ensure that gcc doesn't generate standlone li reg, -4096.  */
+long
+minus1(unsigned long i)
+{
+   return i - 4096;
+}
+
+long
+minus2(unsigned long i)
+{
+   return i - 2049;
+}
+
+long
+minus3(unsigned long i)
+{
+   return i - 2064;
+}
+
+/* { dg-final { scan-assembler-not {li\t[a-x0-9]+,-4096} } } */
+/* { dg-final { scan-assembler-not {li\t[a-x0-9]+,4096} } } */
diff --git a/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-2.c b/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-2.c
new file mode 100644
index 000000000000..5dcab52c2610
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-2.c
@@ -0,0 +1,22 @@
+/* Reduced version of c-c++-common/torture/builtin-convertvector-1.c.  */
+/* This should NOT ICE */
+
+/* { dg-do compile } */
+
+typedef long b __attribute__((vector_size(256 * sizeof(long))));
+typedef double c __attribute__((vector_size(256 * sizeof(double))));
+int d;
+void e(b *f, c *g) { *g = __builtin_convertvector(*f, c); }
+void h() {
+  struct {
+    b i;
+  } j;
+  union {
+    c i;
+    double a[6];
+  } k;
+  e(&j.i, &k.i);
+  if (k.a[d])
+    for (;;)
+      ;
+}
-- 
2.34.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [gcc-15 2/3] RISC-V: avoid LUI based const mat: keep stack offsets aligned
  2024-03-16 17:35 [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak Vineet Gupta
  2024-03-16 17:35 ` [gcc-15 1/3] RISC-V: avoid LUI based const materialization ... [part of PR/106265] Vineet Gupta
@ 2024-03-16 17:35 ` Vineet Gupta
  2024-03-16 20:21   ` Jeff Law
  2024-03-16 17:35 ` [gcc-15 3/3] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733] Vineet Gupta
  2024-03-19  4:41 ` [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak Jeff Law
  3 siblings, 1 reply; 28+ messages in thread
From: Vineet Gupta @ 2024-03-16 17:35 UTC (permalink / raw)
  To: gcc-patches
  Cc: Jeff Law, kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp,
	Vineet Gupta

Noticed that new sum of two s12 splitter was generating following:

| 059930 <tempnam>:
|   59930:	add	sp,sp,-16
|   59934:	lui	t0,0xfffff
|   59938:	sd	s0,0(sp)
|   5993c:	sd	ra,8(sp)
|   59940:	add	sp,sp,t0
|   59944:	add	s0,sp,2047  <----
|   59948:	mv	a2,a0
|   5994c:	mv	a3,a1
|   59950:	mv	a0,sp
|   59954:	li	a4,1
|   59958:	lui	a1,0x1
|   5995c:	add	s0,s0,1     <---
|   59960:	jal	59a3c

SP here becomes unaligned, even if transitively which is undesirable as
well as incorrect:
 - ABI requires stack to be 8 byte aligned
 - asm code looks weird and unexpected
 - to the user it might falsely seem like a compiler bug even when not,
   specially when staring at asm for debugging unrelated issue.

Fix this by using 2032+addend idiom when handling register operands
related to stack. This only affects positive S12 values, negative values
are already -2048 based.

Unfortunately implementation requires making a copy of splitter, since
it needs different varaints of predicate and constraint which cant be
done conditionally in the same MD pattern (constraint with restricted
range prevents LRA from allowing such insn despite new predicate)

With the patch, we get following correct code instead:

| ..
| 59944:	add	s0,sp,2032
| ..
| 5995c:	add	s0,s0,16

gcc/Changelog:
	* config/riscv/riscv.h: Add alignment arg to new macros.
	* config/riscv/constraints.md: Variant of new constraint.
	* config/riscv/predicates.md: Variant of new predicate.
	* config/riscv/riscv.md: Variant of new splitter which offsets
	off of 2032 (vs. 2047).
	Gate existing splitter behind riscv_reg_frame_related.
	* config/riscv/riscv.cc (riscv_reg_frame_related): New helper to
	conditionalize the existing and new spitters.
	* config/riscv/riscv-protos.h: Add new prototype.

Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
---
 gcc/config/riscv/constraints.md |  6 ++++++
 gcc/config/riscv/predicates.md  |  8 ++++++-
 gcc/config/riscv/riscv-protos.h |  1 +
 gcc/config/riscv/riscv.cc       | 11 ++++++++++
 gcc/config/riscv/riscv.h        | 15 ++++++++-----
 gcc/config/riscv/riscv.md       | 37 ++++++++++++++++++++++++++++++---
 6 files changed, 69 insertions(+), 9 deletions(-)

diff --git a/gcc/config/riscv/constraints.md b/gcc/config/riscv/constraints.md
index 435461180c7e..a9446c54ee45 100644
--- a/gcc/config/riscv/constraints.md
+++ b/gcc/config/riscv/constraints.md
@@ -86,6 +86,12 @@
        (ior (match_test "IN_RANGE (ival,  2048,  4094)")
 	    (match_test "IN_RANGE (ival, -4096, -2049)"))))
 
+(define_constraint "MiA"
+  "const can be represented as sum of two S12 values with first aligned."
+  (and (match_code "const_int")
+       (ior (match_test "IN_RANGE (ival,  2033,  4064)")
+	    (match_test "IN_RANGE (ival, -4096, -2049)"))))
+
 (define_constraint "Ds3"
   "@internal
    1, 2 or 3 immediate"
diff --git a/gcc/config/riscv/predicates.md b/gcc/config/riscv/predicates.md
index 89490339c2da..56f9919daafa 100644
--- a/gcc/config/riscv/predicates.md
+++ b/gcc/config/riscv/predicates.md
@@ -423,7 +423,13 @@
 (define_predicate "const_two_s12"
   (match_code "const_int")
 {
-  return SUM_OF_TWO_S12 (INTVAL (op));
+  return SUM_OF_TWO_S12 (INTVAL (op), false);
+})
+
+(define_predicate "const_two_s12_algn"
+  (match_code "const_int")
+{
+  return SUM_OF_TWO_S12 (INTVAL (op), true);
 })
 
 ;; CORE-V Predicates:
diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index b87355938052..f9e407bf5768 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -164,6 +164,7 @@ extern bool riscv_shamt_matches_mask_p (int, HOST_WIDE_INT);
 extern void riscv_subword_address (rtx, rtx *, rtx *, rtx *, rtx *);
 extern void riscv_lshift_subword (machine_mode, rtx, rtx, rtx *);
 extern enum memmodel riscv_union_memmodels (enum memmodel, enum memmodel);
+extern bool riscv_reg_frame_related (rtx);
 
 /* Routines implemented in riscv-c.cc.  */
 void riscv_cpu_cpp_builtins (cpp_reader *);
diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 680c4a728e92..38aebefa2590 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -6667,6 +6667,17 @@ riscv_can_eliminate (const int from ATTRIBUTE_UNUSED, const int to)
   return (to == HARD_FRAME_POINTER_REGNUM || to == STACK_POINTER_REGNUM);
 }
 
+/* Helper to determine if reg X pertains to stack.  */
+bool
+riscv_reg_frame_related (rtx x)
+{
+  return REG_P (x)
+	 && (REGNO (x) == FRAME_POINTER_REGNUM
+	     || REGNO (x) == HARD_FRAME_POINTER_REGNUM
+	     || REGNO (x) == ARG_POINTER_REGNUM
+	     || REGNO (x) == VIRTUAL_STACK_VARS_REGNUM);
+}
+
 /* Implement INITIAL_ELIMINATION_OFFSET.  FROM is either the frame pointer
    or argument pointer.  TO is either the stack pointer or hard frame
    pointer.  */
diff --git a/gcc/config/riscv/riscv.h b/gcc/config/riscv/riscv.h
index 817661058896..00964ccd81db 100644
--- a/gcc/config/riscv/riscv.h
+++ b/gcc/config/riscv/riscv.h
@@ -633,11 +633,16 @@ enum reg_class
 #define SUM_OF_TWO_S12_N(VALUE)						\
   (((VALUE) >= (-2048 * 2)) && ((VALUE) <= (-2048 - 1)))
 
-#define SUM_OF_TWO_S12_P(VALUE)						\
-  (((VALUE) >= ( 2047 + 1)) && ((VALUE) <= ( 2047 * 2)))
-
-#define SUM_OF_TWO_S12(VALUE)						\
-  (SUM_OF_TWO_S12_N (VALUE) || SUM_OF_TWO_S12_P (VALUE))
+/* Variant with first value 8 byte aligned if involving stack regs.  */
+#define SUM_OF_TWO_S12_P(VALUE, WANT_ALIGN)				\
+  ((WANT_ALIGN)								\
+    ? (((VALUE) >= (2032 + 1)) && ((VALUE) <= (2032 * 2)))		\
+	: ((VALUE >= (2047 + 1)) && ((VALUE) <= (2047 * 2))))
+
+#define SUM_OF_TWO_S12(VALUE, WANT_ALIGN)				\
+  (SUM_OF_TWO_S12_N (VALUE)						\
+   || ((SUM_OF_TWO_S12_P (VALUE, false) && !(WANT_ALIGN))		\
+	|| (SUM_OF_TWO_S12_P (VALUE, true) && (WANT_ALIGN))))
 
 /* If this is a single bit mask, then we can load it with bseti.  Special
    handling of SImode 0x80000000 on RV64 is done in riscv_build_integer_1. */
diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
index 79fe861ef91f..cc8c3c653f3e 100644
--- a/gcc/config/riscv/riscv.md
+++ b/gcc/config/riscv/riscv.md
@@ -760,16 +760,17 @@
   [(set (match_operand:P	 0 "register_operand" "=r,r")
 	(plus:P (match_operand:P 1 "register_operand" " r,r")
 		(match_operand:P 2 "const_two_s12"    " MiG,r")))]
-  ""
+  "!riscv_reg_frame_related (operands[0])
+    && !riscv_reg_frame_related (operands[1])"
   "add %0,%1,%2"
-  ""
+  "&& 1"
   [(set (match_dup 0)
 	(plus:P (match_dup 1) (match_dup 3)))
    (set (match_dup 0)
 	(plus:P (match_dup 0) (match_dup 4)))]
 {
   int val = INTVAL (operands[2]);
-  if (SUM_OF_TWO_S12_P (val))
+  if (SUM_OF_TWO_S12_P (val, false))
     {
        operands[3] = GEN_INT (2047);
        operands[4] = GEN_INT (val - 2047);
@@ -785,6 +786,36 @@
   [(set_attr "type" "arith")
    (set_attr "mode" "<P:MODE>")])
 
+(define_insn_and_split "*add<mode>3_const_sum_of_two_s12_stack"
+  [(set (match_operand:P	 0 "register_operand"   "=r,r")
+	(plus:P (match_operand:P 1 "register_operand"   " r,r")
+		(match_operand:P 2 "const_two_s12_algn" " MiA,r")))]
+  "riscv_reg_frame_related (operands[0])
+   || riscv_reg_frame_related (operands[1])"
+  "add %0,%1,%2"
+  "&& 1"
+  [(set (match_dup 0)
+	(plus:P (match_dup 1) (match_dup 3)))
+   (set (match_dup 0)
+	(plus:P (match_dup 0) (match_dup 4)))]
+{
+  int val = INTVAL (operands[2]);
+  if (SUM_OF_TWO_S12_P (val, true))
+    {
+       operands[3] = GEN_INT (2032);
+       operands[4] = GEN_INT (val - 2032);
+    }
+  else if (SUM_OF_TWO_S12_N (val))
+    {
+       operands[3] = GEN_INT (-2048);
+       operands[4] = GEN_INT (val + 2048);
+    }
+  else
+      gcc_unreachable ();
+}
+  [(set_attr "type" "arith")
+   (set_attr "mode" "<P:MODE>")])
+
 (define_expand "addv<mode>4"
   [(set (match_operand:GPR           0 "register_operand" "=r,r")
 	(plus:GPR (match_operand:GPR 1 "register_operand" " r,r")
-- 
2.34.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [gcc-15 3/3] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733]
  2024-03-16 17:35 [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak Vineet Gupta
  2024-03-16 17:35 ` [gcc-15 1/3] RISC-V: avoid LUI based const materialization ... [part of PR/106265] Vineet Gupta
  2024-03-16 17:35 ` [gcc-15 2/3] RISC-V: avoid LUI based const mat: keep stack offsets aligned Vineet Gupta
@ 2024-03-16 17:35 ` Vineet Gupta
  2024-03-16 20:27   ` Jeff Law
  2024-03-19  4:41 ` [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak Jeff Law
  3 siblings, 1 reply; 28+ messages in thread
From: Vineet Gupta @ 2024-03-16 17:35 UTC (permalink / raw)
  To: gcc-patches
  Cc: Jeff Law, kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp,
	Vineet Gupta

If the constant used for stack offset can be expressed as sum of two S12
values, the constant need not be materialized (in a reg) and instead the
two S12 bits can be added to instructions involved with frame pointer.
This avoids burning a register and more importantly can often get down
to be 2 insn vs. 3.

The prev patches to generally avoid LUI based const materialization didn't
fix this PR and need this directed fix in funcion prologue/epilogue
expansion.

This fix doesn't move the neddle for SPEC, at all, but it is still a
win considering gcc generates one insn fewer than llvm for the test ;-)

   gcc-13.1 release   |      gcc 230823     |                   |
                      |    g6619b3d4c15c    |   This patch      |  clang/llvm
---------------------------------------------------------------------------------
li      t0,-4096     | li    t0,-4096      | addi  sp,sp,-2048 | addi sp,sp,-2048
addi    t0,t0,2016   | addi  t0,t0,2032    | add   sp,sp,-16   | addi sp,sp,-32
li      a4,4096      | add   sp,sp,t0      | add   a5,sp,a0    | add  a1,sp,16
add     sp,sp,t0     | addi  a5,sp,-2032   | sb    zero,0(a5)  | add  a0,a0,a1
li      a5,-4096     | add   a0,a5,a0      | addi  sp,sp,2032  | sb   zero,0(a0)
addi    a4,a4,-2032  | li    t0, 4096      | addi  sp,sp,32    | addi sp,sp,2032
add     a4,a4,a5     | sb    zero,2032(a0) | ret               | addi sp,sp,48
addi    a5,sp,16     | addi  t0,t0,-2032   |                   | ret
add     a5,a4,a5     | add   sp,sp,t0      |
add     a0,a5,a0     | ret                 |
li      t0,4096      |
sd      a5,8(sp)     |
sb      zero,2032(a0)|
addi    t0,t0,-2016  |
add     sp,sp,t0     |
ret                  |

gcc/ChangeLog:
	PR target/105733
	* config/riscv/riscv.cc (riscv_split_sum_of_two_s12): New
	function to split a sum of two s12 values into constituents.
	(riscv_expand_prologue): Handle offset being sum of two S12.
	(riscv_expand_epilogue): Ditto.
	* config/riscv/riscv-protos.h (riscv_split_sum_of_two_s12): New.

gcc/testsuite/ChangeLog:
	* gcc.target/riscv/pr105733.c: New Test.
	* gcc.target/riscv/rvv/autovec/vls/spill-1.c: Adjust to not
	expect LUI 4096.
	* gcc.target/riscv/rvv/autovec/vls/spill-2.c: Ditto.
	* gcc.target/riscv/rvv/autovec/vls/spill-3.c: Ditto.
	* gcc.target/riscv/rvv/autovec/vls/spill-4.c: Ditto.
	* gcc.target/riscv/rvv/autovec/vls/spill-5.c: Ditto.
	* gcc.target/riscv/rvv/autovec/vls/spill-6.c: Ditto.
	* gcc.target/riscv/rvv/autovec/vls/spill-7.c: Ditto.

Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
---
 gcc/config/riscv/riscv-protos.h               |  2 +
 gcc/config/riscv/riscv.cc                     | 80 +++++++++++++++++--
 gcc/testsuite/gcc.target/riscv/pr105733.c     | 15 ++++
 .../riscv/rvv/autovec/vls/spill-1.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-2.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-3.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-4.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-5.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-6.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-7.c           |  4 +-
 10 files changed, 104 insertions(+), 21 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/pr105733.c

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index f9e407bf5768..8b3f7ce181cc 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -165,6 +165,8 @@ extern void riscv_subword_address (rtx, rtx *, rtx *, rtx *, rtx *);
 extern void riscv_lshift_subword (machine_mode, rtx, rtx, rtx *);
 extern enum memmodel riscv_union_memmodels (enum memmodel, enum memmodel);
 extern bool riscv_reg_frame_related (rtx);
+extern void riscv_split_sum_of_two_s12 (HOST_WIDE_INT, bool,
+					HOST_WIDE_INT *, HOST_WIDE_INT *);
 
 /* Routines implemented in riscv-c.cc.  */
 void riscv_cpu_cpp_builtins (cpp_reader *);
diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 38aebefa2590..b4a7947433a7 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -3810,6 +3810,38 @@ riscv_split_doubleword_move (rtx dest, rtx src)
        riscv_emit_move (riscv_subword (dest, true), riscv_subword (src, true));
      }
 }
+
+/* Constant VAL is known to be sum of two S12 constants.  Break it into
+   comprising BASE and OFF.
+   Numerically S12 is -2048 to 2047, however if ALIGN_BASE is true, use the
+   more conservative -2048 to 2032 range for offsets pertaining to stack
+   related registers.  */
+
+void
+riscv_split_sum_of_two_s12 (HOST_WIDE_INT val, bool align_base,
+			    HOST_WIDE_INT *base, HOST_WIDE_INT *off)
+{
+  if (SUM_OF_TWO_S12_N (val))
+    {
+      *base = -2048;
+      *off = val - (-2048);
+    }
+  else if (SUM_OF_TWO_S12_P (val, true) && align_base)
+    {
+      *base = 2032;
+      *off = val - 2032;
+    }
+  else if (SUM_OF_TWO_S12_P (val, false) && !align_base)
+    {
+      *base = 2047;
+      *off = val - 2047;
+    }
+  else
+    {
+      gcc_unreachable ();
+    }
+}
+
 \f
 /* Return the appropriate instructions to move SRC into DEST.  Assume
    that SRC is operand 1 and DEST is operand 0.  */
@@ -7366,6 +7398,17 @@ riscv_expand_prologue (void)
 				GEN_INT (-constant_frame));
 	  RTX_FRAME_RELATED_P (emit_insn (insn)) = 1;
 	}
+      else if (SUM_OF_TWO_S12 (-constant_frame, true))
+	{
+	  HOST_WIDE_INT base, off;
+	  riscv_split_sum_of_two_s12 (-constant_frame, true, &base, &off);
+	  insn = gen_add3_insn (stack_pointer_rtx, stack_pointer_rtx,
+				GEN_INT (base));
+	  RTX_FRAME_RELATED_P (emit_insn (insn)) = 1;
+	  insn = gen_add3_insn (stack_pointer_rtx, stack_pointer_rtx,
+				GEN_INT (off));
+	  RTX_FRAME_RELATED_P (emit_insn (insn)) = 1;
+	}
       else
 	{
 	  riscv_emit_move (RISCV_PROLOGUE_TEMP (Pmode), GEN_INT (-constant_frame));
@@ -7581,14 +7624,26 @@ riscv_expand_epilogue (int style)
 	}
       else
 	{
-	  if (!SMALL_OPERAND (adjust_offset.to_constant ()))
+	  HOST_WIDE_INT adj_off_value = adjust_offset.to_constant ();
+	  if (SMALL_OPERAND (adj_off_value))
+	    {
+	      adjust = GEN_INT (adj_off_value);
+	    }
+	  else if (SUM_OF_TWO_S12 (adj_off_value, true))
+	    {
+	      HOST_WIDE_INT base, off;
+	      riscv_split_sum_of_two_s12 (adj_off_value, true, &base, &off);
+	      insn = gen_add3_insn (stack_pointer_rtx, hard_frame_pointer_rtx,
+					GEN_INT (base));
+	      RTX_FRAME_RELATED_P (insn) = 1;
+	      adjust = GEN_INT (off);
+	    }
+	  else
 	    {
 	      riscv_emit_move (RISCV_PROLOGUE_TEMP (Pmode),
-			       GEN_INT (adjust_offset.to_constant ()));
+			       GEN_INT (adj_off_value));
 	      adjust = RISCV_PROLOGUE_TEMP (Pmode);
 	    }
-	  else
-	    adjust = GEN_INT (adjust_offset.to_constant ());
 	}
 
       insn = emit_insn (
@@ -7655,10 +7710,21 @@ riscv_expand_epilogue (int style)
 
       /* Get an rtx for STEP1 that we can add to BASE.
 	 Skip if adjust equal to zero.  */
-      if (step1.to_constant () != 0)
+      HOST_WIDE_INT step1_value = step1.to_constant ();
+      if (step1_value != 0)
 	{
-	  rtx adjust = GEN_INT (step1.to_constant ());
-	  if (!SMALL_OPERAND (step1.to_constant ()))
+	  rtx adjust = GEN_INT (step1_value);
+	  if (SUM_OF_TWO_S12 (step1_value, true))
+	    {
+	      HOST_WIDE_INT base, off;
+	      riscv_split_sum_of_two_s12 (step1_value, true, &base, &off);
+	      insn = emit_insn (gen_add3_insn (stack_pointer_rtx,
+						stack_pointer_rtx,
+						GEN_INT (base)));
+	      RTX_FRAME_RELATED_P (insn) = 1;
+	      adjust = GEN_INT (off);
+	    }
+	  else if (!SMALL_OPERAND (step1_value))
 	    {
 	      riscv_emit_move (RISCV_PROLOGUE_TEMP (Pmode), adjust);
 	      adjust = RISCV_PROLOGUE_TEMP (Pmode);
diff --git a/gcc/testsuite/gcc.target/riscv/pr105733.c b/gcc/testsuite/gcc.target/riscv/pr105733.c
new file mode 100644
index 000000000000..6156c36dc7ef
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/pr105733.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options { -march=rv64gcv -mabi=lp64d } } */
+/* { dg-skip-if "" { *-*-* } { "-O0" "-Og" "-Os" "-Oz" } } */
+
+#define BUF_SIZE 2064
+
+void
+foo(unsigned long i)
+{
+    volatile char buf[BUF_SIZE];
+
+    buf[i] = 0;
+}
+
+/* { dg-final { scan-assembler-not {li\t[a-x0-9]+,4096} } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-1.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-1.c
index 842bb630be56..82f4e3c29a7e 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-1.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-1.c
@@ -129,5 +129,5 @@ spill_12 (int8_t *in, int8_t *out)
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-256} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-512} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-1024} 1 } } */
-/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 1 } } */
-/* { dg-final { scan-assembler-times {li\t[a-x0-9]+,-4096\s+add\tsp,sp,[a-x0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 3 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,2032} 1 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-2.c
index 8f6ee81b98f2..56cc4cc0c0b8 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-2.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-2.c
@@ -120,5 +120,5 @@ spill_11 (int16_t *in, int16_t *out)
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-256} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-512} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-1024} 1 } } */
-/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 1 } } */
-/* { dg-final { scan-assembler-times {li\t[a-x0-9]+,-4096\s+add\tsp,sp,[a-x0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 3 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,2032} 1 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-3.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-3.c
index 0f317d6cce5c..9159cc6162df 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-3.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-3.c
@@ -111,5 +111,5 @@ spill_10 (int32_t *in, int32_t *out)
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-256} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-512} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-1024} 1 } } */
-/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 1 } } */
-/* { dg-final { scan-assembler-times {li\t[a-x0-9]+,-4096\s+add\tsp,sp,[a-x0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 3 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,2032} 1 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-4.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-4.c
index ef61d9a2c0c3..1faf31ffd8e0 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-4.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-4.c
@@ -102,5 +102,5 @@ spill_9 (int64_t *in, int64_t *out)
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-256} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-512} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-1024} 1 } } */
-/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 1 } } */
-/* { dg-final { scan-assembler-times {li\t[a-x0-9]+,-4096\s+add\tsp,sp,[a-x0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 3 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,2032} 1 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-5.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-5.c
index b366a4649d87..1d322a27ccc1 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-5.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-5.c
@@ -120,5 +120,5 @@ spill_11 (_Float16 *in, _Float16 *out)
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-256} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-512} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-1024} 1 } } */
-/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 1 } } */
-/* { dg-final { scan-assembler-times {li\t[a-x0-9]+,-4096\s+add\tsp,sp,[a-x0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 3 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,2032} 1 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-6.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-6.c
index d35e2a44f79b..487624c7b696 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-6.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-6.c
@@ -111,5 +111,5 @@ spill_10 (float *in, float *out)
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-256} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-512} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-1024} 1 } } */
-/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 1 } } */
-/* { dg-final { scan-assembler-times {li\t[a-x0-9]+,-4096\s+add\tsp,sp,[a-x0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 3 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,2032} 1 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-7.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-7.c
index 70ca683908db..e3980a295406 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-7.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/spill-7.c
@@ -102,5 +102,5 @@ spill_9 (int64_t *in, int64_t *out)
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-256} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-512} 1 } } */
 /* { dg-final { scan-assembler-times {addi\tsp,sp,-1024} 1 } } */
-/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 1 } } */
-/* { dg-final { scan-assembler-times {li\t[a-x0-9]+,-4096\s+add\tsp,sp,[a-x0-9]+} 1 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,-2048} 3 } } */
+/* { dg-final { scan-assembler-times {addi\tsp,sp,2032} 1 } } */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gcc-15 2/3] RISC-V: avoid LUI based const mat: keep stack offsets aligned
  2024-03-16 17:35 ` [gcc-15 2/3] RISC-V: avoid LUI based const mat: keep stack offsets aligned Vineet Gupta
@ 2024-03-16 20:21   ` Jeff Law
  2024-03-19  0:27     ` Vineet Gupta
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff Law @ 2024-03-16 20:21 UTC (permalink / raw)
  To: Vineet Gupta, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp



On 3/16/24 11:35 AM, Vineet Gupta wrote:
> Noticed that new sum of two s12 splitter was generating following:
> 
> | 059930 <tempnam>:
> |   59930:	add	sp,sp,-16
> |   59934:	lui	t0,0xfffff
> |   59938:	sd	s0,0(sp)
> |   5993c:	sd	ra,8(sp)
> |   59940:	add	sp,sp,t0
> |   59944:	add	s0,sp,2047  <----
> |   59948:	mv	a2,a0
> |   5994c:	mv	a3,a1
> |   59950:	mv	a0,sp
> |   59954:	li	a4,1
> |   59958:	lui	a1,0x1
> |   5995c:	add	s0,s0,1     <---
> |   59960:	jal	59a3c
> 
> SP here becomes unaligned, even if transitively which is undesirable as
> well as incorrect:
>   - ABI requires stack to be 8 byte aligned
>   - asm code looks weird and unexpected
>   - to the user it might falsely seem like a compiler bug even when not,
>     specially when staring at asm for debugging unrelated issue.
It's not ideal, but I think it's still ABI compliant as-is.  If it 
wasn't, then I suspect things like virtual origins in Ada couldn't be 
made ABI compliant.


> 
> Fix this by using 2032+addend idiom when handling register operands
> related to stack. This only affects positive S12 values, negative values
> are already -2048 based.
> 
> Unfortunately implementation requires making a copy of splitter, since
> it needs different varaints of predicate and constraint which cant be
> done conditionally in the same MD pattern (constraint with restricted
> range prevents LRA from allowing such insn despite new predicate)
> 
> With the patch, we get following correct code instead:
> 
> | ..
> | 59944:	add	s0,sp,2032
> | ..
> | 5995c:	add	s0,s0,16
Alternately you could tighten the positive side of the range of the 
splitter from patch 1/3 so that you could always use 2032 rather than 
2047 on the first addi.   ie instead of allowing 2048..4094, allow 
2048..4064.

I don't have a strong opinion on that vs the direction you've gone here.


Jeff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gcc-15 3/3] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733]
  2024-03-16 17:35 ` [gcc-15 3/3] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733] Vineet Gupta
@ 2024-03-16 20:27   ` Jeff Law
  0 siblings, 0 replies; 28+ messages in thread
From: Jeff Law @ 2024-03-16 20:27 UTC (permalink / raw)
  To: Vineet Gupta, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp



On 3/16/24 11:35 AM, Vineet Gupta wrote:
> If the constant used for stack offset can be expressed as sum of two S12
> values, the constant need not be materialized (in a reg) and instead the
> two S12 bits can be added to instructions involved with frame pointer.
> This avoids burning a register and more importantly can often get down
> to be 2 insn vs. 3.
> 
> The prev patches to generally avoid LUI based const materialization didn't
> fix this PR and need this directed fix in funcion prologue/epilogue
> expansion.
> 
> This fix doesn't move the neddle for SPEC, at all, but it is still a
> win considering gcc generates one insn fewer than llvm for the test ;-)
> 
>     gcc-13.1 release   |      gcc 230823     |                   |
>                        |    g6619b3d4c15c    |   This patch      |  clang/llvm
> ---------------------------------------------------------------------------------
> li      t0,-4096     | li    t0,-4096      | addi  sp,sp,-2048 | addi sp,sp,-2048
> addi    t0,t0,2016   | addi  t0,t0,2032    | add   sp,sp,-16   | addi sp,sp,-32
> li      a4,4096      | add   sp,sp,t0      | add   a5,sp,a0    | add  a1,sp,16
> add     sp,sp,t0     | addi  a5,sp,-2032   | sb    zero,0(a5)  | add  a0,a0,a1
> li      a5,-4096     | add   a0,a5,a0      | addi  sp,sp,2032  | sb   zero,0(a0)
> addi    a4,a4,-2032  | li    t0, 4096      | addi  sp,sp,32    | addi sp,sp,2032
> add     a4,a4,a5     | sb    zero,2032(a0) | ret               | addi sp,sp,48
> addi    a5,sp,16     | addi  t0,t0,-2032   |                   | ret
> add     a5,a4,a5     | add   sp,sp,t0      |
> add     a0,a5,a0     | ret                 |
> li      t0,4096      |
> sd      a5,8(sp)     |
> sb      zero,2032(a0)|
> addi    t0,t0,-2016  |
> add     sp,sp,t0     |
> ret                  |
> 
> gcc/ChangeLog:
> 	PR target/105733
> 	* config/riscv/riscv.cc (riscv_split_sum_of_two_s12): New
> 	function to split a sum of two s12 values into constituents.
> 	(riscv_expand_prologue): Handle offset being sum of two S12.
> 	(riscv_expand_epilogue): Ditto.
> 	* config/riscv/riscv-protos.h (riscv_split_sum_of_two_s12): New.
> 
> gcc/testsuite/ChangeLog:
> 	* gcc.target/riscv/pr105733.c: New Test.
> 	* gcc.target/riscv/rvv/autovec/vls/spill-1.c: Adjust to not
> 	expect LUI 4096.
> 	* gcc.target/riscv/rvv/autovec/vls/spill-2.c: Ditto.
> 	* gcc.target/riscv/rvv/autovec/vls/spill-3.c: Ditto.
> 	* gcc.target/riscv/rvv/autovec/vls/spill-4.c: Ditto.
> 	* gcc.target/riscv/rvv/autovec/vls/spill-5.c: Ditto.
> 	* gcc.target/riscv/rvv/autovec/vls/spill-6.c: Ditto.
> 	* gcc.target/riscv/rvv/autovec/vls/spill-7.c: Ditto.
Yea, wouldn't expect this to move the needle on spec since it's just 
hitting the prologue/epilogue.  In fact, I wouldn't be surprised if 
there were other stack frame sizes that could be improved.  But I 
wouldn't bother chasing down those other cases.

If we think about the embedded space, they're probably not going to want 
to see functions with large frames to begin with.  So optimizing those 
cases for the embedded space just doesn't make much sense.

In the distro space, by this time next year we'll be living in a world 
where stack clash mitigations are enabled.  So for any given size stack 
frame, it'll be allocated in at most 1 page chunks.   So again, going to 
any significant length to optimize other cases just doesn't make much sense.

So we probably should go with this patch in the gcc-15 space, but I 
wouldn't suggest heroic efforts for other sized stack frames.

jeff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gcc-15 1/3] RISC-V: avoid LUI based const materialization ... [part of PR/106265]
  2024-03-16 17:35 ` [gcc-15 1/3] RISC-V: avoid LUI based const materialization ... [part of PR/106265] Vineet Gupta
@ 2024-03-16 20:28   ` Jeff Law
  2024-03-19  0:07     ` Vineet Gupta
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff Law @ 2024-03-16 20:28 UTC (permalink / raw)
  To: Vineet Gupta, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp



On 3/16/24 11:35 AM, Vineet Gupta wrote:
> ... if the constant can be represented as sum of two S12 values.
> The two S12 values could instead be fused with subsequent ADD insn.
> The helps
>   - avoid an additional LUI insn
>   - side benefits of not clobbering a reg
> 
> e.g.
>                              w/o patch             w/ patch
> long                  |                     |
> plus(unsigned long i) |	li	a5,4096     |
> {                     |	addi	a5,a5,-2032 | addi a0, a0, 2047
>     return i + 2064;   |	add	a0,a0,a5    | addi a0, a0, 17
> }                     | 	ret         | ret
> 
> NOTE: In theory not having const in a standalone reg might seem less
>        CSE friendly, but for workloads in consideration these mat are
>        from very late LRA reloads and follow on GCSE is not doing much
>        currently.
As you note, there's a bit of natural tension between what to expose and 
when.  There's no clear cut answer and I might argue there never will be 
given the design and implementation of various parts of the RTL pipeline.

We have some ports that expose early and just say "tough" if it's a 
minor loss in some cases.  Others choose to expose late.  We're kindof a 
mix of both in RISC-V land.   The limited offsets we've got will tend to 
make this problem a bit worse for RISC-V compared to other architectures.



> 
> The real benefit however is seen in base+offset computation for array
> accesses and especially for stack accesses which are finalized late in
> optim pipeline, during LRA register allocation. Often the finalized
> offsets trigger LRA reloads resulting in mind boggling repetition of
> exact same insn sequence including LUI based constant materialization.
Yea, this is a known sore spot.  I spent a good deal of time working on 
the PA where we only have a 5 bit displacement for FP memory ops.  You 
can assume this caused reload fits and often resulted in horrific (and 
repetitive code) code.

We did some interesting tricks to avoid the worst of the codegen issues. 
  None of those tricks really apply here since we're in an LRA world and 
there's no difference in the allowed offset in a memory reference vs an 
addi instruction.



> 
> This shaves off 290 billion dynamic instrustions (QEMU icounts) in
> SPEC 2017 Cactu benchmark which is over 10% of workload. In the rest of
> suite, there additional 10 billion shaved, with both gains and losses
> in indiv workloads as is usual with compiler changes.
That's a fantastic result.

> 
> This should still be considered damage control as the real/deeper fix
> would be to reduce number of LRA reloads or CSE/anchor those during
> LRA constraint sub-pass (re)runs.
Agreed.  But I think it's worth tackling from both ends.  Generate 
better code when we do have these reloads and avoid the reloads when we can.


> 
> Implementation Details (for posterity)
> --------------------------------------
>   - basic idea is to have a splitter selected via a new predicate for constant
>     being possible sum of two S12 and provide the transform.
>     This is however a 2 -> 2 transform which combine can't handle.
>     So we specify it using a define_insn_and_split.
Right.  For the record it looks like a 2->2 case because of the 
mvconst_internal define_insn_and_split.

What I was talking about in the Tuesday meeting last week was the desire 
to rethink that pattern precisely because it drives us to need to 
implement patterns like yours rather than the more natural 3->2 or 4->3/2.

Furthermore, I have a suspicion that there's logicals where we're going 
to want to do something similar and if we keep the mvconst_internal 
pattern they're all going to have to be implemented as 2->2s with a 
define_insn_and_split rather than the more natural 3->2.

Having said all that, I suspect the case you're diving into is far more 
important than the logicals.



> 
> | linux/scripts/bloat-o-meter build-gcc-240131/target/lib/libc.so.6 \
>           build-gcc-240131-new-splitter-1-variant/target/lib/libc.so.6
> |
> | add/remove: 0/0 grow/shrink: 21/49 up/down: 520/-3056 (-2536)
> | Function                                     old     new   delta
> | getnameinfo                                 2756    2892    +136
> ...
> | tempnam                                      136     144      +8
> | padzero                                      276     284      +8
> ...
> | __GI___register_printf_specifier             284     280      -4
> | __EI_xdr_array                               468     464      -4
> | try_file_lock                                268     260      -8
> | pthread_create@GLIBC_2                      3520    3508     -12
> | __pthread_create_2_1                        3520    3508     -12
> ...
> | _nss_files_setnetgrent                       932     904     -28
> | _nss_dns_gethostbyaddr2_r                   1524    1480     -44
> | build_trtable                               3312    3208    -104
> | printf_positional                          25000   22580   -2420
> | Total: Before=2107999, After=2105463, chg -0.12%
That's a bit funny.  I was inside printf_positional Friday.  It's almost 
certainly the case that it sees a big improvement because it uses a lot 
of stack space and I believe also has a frame pointer.

BUt yea, I wouldn't let the minor regressions stand in the way here.


> 
> gcc/ChangeLog:
> 	PR target/106265 (partially)
> 	* config/riscv/riscv.h: New macros to check for sum of two S12
> 	range.
> 	* config/riscv/constraints.md: New constraint.
> 	* config/riscv/predicates.md: New Predicate.
> 	* config/riscv/riscv.md: New splitter to not materialize
> 	constant in desired range.
> 
> gcc/testsuite/ChangeLog:
> 	* gcc.target/riscv/sum-of-two-s12-const-1.c: New test: checks for
> 	new patterns output.
> 	* gcc.target/riscv/sum-of-two-s12-const-2.c: New test: should not
> 	ICE.
> 
> Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
> ---
>   gcc/config/riscv/constraints.md               |  6 +++
>   gcc/config/riscv/predicates.md                |  6 +++
>   gcc/config/riscv/riscv.h                      | 15 +++++++
>   gcc/config/riscv/riscv.md                     | 34 ++++++++++++++
>   .../gcc.target/riscv/sum-of-two-s12-const-1.c | 45 +++++++++++++++++++
>   .../gcc.target/riscv/sum-of-two-s12-const-2.c | 22 +++++++++
>   6 files changed, 128 insertions(+)
>   create mode 100644 gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-1.c
>   create mode 100644 gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-2.c
> 
> diff --git a/gcc/config/riscv/constraints.md b/gcc/config/riscv/constraints.md
> index 41acaea04eba..435461180c7e 100644
> --- a/gcc/config/riscv/constraints.md
> +++ b/gcc/config/riscv/constraints.md
> @@ -80,6 +80,12 @@
>     (and (match_code "const_int")
>          (match_test "LUI_OPERAND (ival)")))
>   
> +(define_constraint "MiG"
> +  "const can be represented as sum of any S12 values."
> +  (and (match_code "const_int")
> +       (ior (match_test "IN_RANGE (ival,  2048,  4094)")
> +	    (match_test "IN_RANGE (ival, -4096, -2049)"))))
> +
>   (define_constraint "Ds3"
>     "@internal
>      1, 2 or 3 immediate"
> diff --git a/gcc/config/riscv/predicates.md b/gcc/config/riscv/predicates.md
> index 6c87a7bd1f49..89490339c2da 100644
> --- a/gcc/config/riscv/predicates.md
> +++ b/gcc/config/riscv/predicates.md
> @@ -420,6 +420,12 @@
>   	return true;
>   })
>   
> +(define_predicate "const_two_s12"
> +  (match_code "const_int")
> +{
> +  return SUM_OF_TWO_S12 (INTVAL (op));
> +})
> +
>   ;; CORE-V Predicates:
>   (define_predicate "immediate_register_operand"
>     (ior (match_operand 0 "register_operand")
> diff --git a/gcc/config/riscv/riscv.h b/gcc/config/riscv/riscv.h
> index da089a03e9d1..817661058896 100644
> --- a/gcc/config/riscv/riscv.h
> +++ b/gcc/config/riscv/riscv.h
> @@ -624,6 +624,21 @@ enum reg_class
>     (((VALUE) | ((1UL<<31) - IMM_REACH)) == ((1UL<<31) - IMM_REACH)	\
>      || ((VALUE) | ((1UL<<31) - IMM_REACH)) + IMM_REACH == 0)
>   
> +/* True if a VALUE (constant) can be expressed as sum of two S12 constants
> +   (in range -2048 to 2047).
> +   Range check logic:
> +     from: min S12 + 1 (or -1 depending on what side of zero)
> +       to: two times the min S12 value (to max out S12 bits).  */
> +
> +#define SUM_OF_TWO_S12_N(VALUE)						\
> +  (((VALUE) >= (-2048 * 2)) && ((VALUE) <= (-2048 - 1)))
> +
> +#define SUM_OF_TWO_S12_P(VALUE)						\
> +  (((VALUE) >= ( 2047 + 1)) && ((VALUE) <= ( 2047 * 2)))
> +
> +#define SUM_OF_TWO_S12(VALUE)						\
> +  (SUM_OF_TWO_S12_N (VALUE) || SUM_OF_TWO_S12_P (VALUE))
> +
>   /* If this is a single bit mask, then we can load it with bseti.  Special
>      handling of SImode 0x80000000 on RV64 is done in riscv_build_integer_1. */
>   #define SINGLE_BIT_MASK_OPERAND(VALUE)					\
> diff --git a/gcc/config/riscv/riscv.md b/gcc/config/riscv/riscv.md
> index b16ed97909c0..79fe861ef91f 100644
> --- a/gcc/config/riscv/riscv.md
> +++ b/gcc/config/riscv/riscv.md
> @@ -751,6 +751,40 @@
>     [(set_attr "type" "arith")
>      (set_attr "mode" "DI")])
>   
> +;; Special case of adding a reg and constant if latter is sum of two S12
> +;; values (in range -2048 to 2047). Avoid materialized the const and fuse
> +;; into the add (with an additional add for 2nd value). Makes a 3 insn
> +;; sequence into 2 insn.
> +
> +(define_insn_and_split "*add<mode>3_const_sum_of_two_s12"
> +  [(set (match_operand:P	 0 "register_operand" "=r,r")
> +	(plus:P (match_operand:P 1 "register_operand" " r,r")
> +		(match_operand:P 2 "const_two_s12"    " MiG,r")))]
> +  ""
> +  "add %0,%1,%2"
> +  ""
> +  [(set (match_dup 0)
> +	(plus:P (match_dup 1) (match_dup 3)))
> +   (set (match_dup 0)
> +	(plus:P (match_dup 0) (match_dup 4)))]
> +{
> +  int val = INTVAL (operands[2]);
> +  if (SUM_OF_TWO_S12_P (val))
> +    {
> +       operands[3] = GEN_INT (2047);
> +       operands[4] = GEN_INT (val - 2047);
> +    }
> +  else if (SUM_OF_TWO_S12_N (val))
> +    {
> +       operands[3] = GEN_INT (-2048);
> +       operands[4] = GEN_INT (val + 2048);
> +    }
> +  else
> +      gcc_unreachable ();
> +}
> +  [(set_attr "type" "arith")
> +   (set_attr "mode" "<P:MODE>")])
So why use "P" for your mode iterator?  I would have expected GPR since 
I think this works for both SI and DI mode as-is.

For the output template "#" for alternative 0 and the add instruction 
for alternative 1 is probably better.  That way it's clear to everyone 
that alternative 0 is always meant to be split.


Don't you need some kind of condition on the split to avoid splitting 
when you've got a register for operands[2]?  It would seem to me that as 
written, it would still try to spit and trigger an RTL checking error 
when you try to extract INTVAL (operands[2]).


Jeff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gcc-15 1/3] RISC-V: avoid LUI based const materialization ... [part of PR/106265]
  2024-03-16 20:28   ` Jeff Law
@ 2024-03-19  0:07     ` Vineet Gupta
  2024-03-23  5:59       ` Jeff Law
  0 siblings, 1 reply; 28+ messages in thread
From: Vineet Gupta @ 2024-03-19  0:07 UTC (permalink / raw)
  To: Jeff Law, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp



On 3/16/24 13:28, Jeff Law wrote:
>> Implementation Details (for posterity)
>> --------------------------------------
>>   - basic idea is to have a splitter selected via a new predicate for constant
>>     being possible sum of two S12 and provide the transform.
>>     This is however a 2 -> 2 transform which combine can't handle.
>>     So we specify it using a define_insn_and_split.
> Right.  For the record it looks like a 2->2 case because of the 
> mvconst_internal define_insn_and_split.
>
> What I was talking about in the Tuesday meeting last week was the desire 
> to rethink that pattern precisely because it drives us to need to 
> implement patterns like yours rather than the more natural 3->2 or 4->3/2.

A few weeks, I did an experimental run removing that splitter isn't
catastrophic for SPEC, granted it is a pretty narrow view of the world :-)
Below is upstream vs. revert of mvconst_internal (for apples:apples just
the revert, none of my new splitter stuff)

     Baseline        Revert mvconst_int
1,214,172,747,858 |  1,225,245,908,131 |  -0.91% <--
  740,618,658,160 |    743,873,857,461 |  -0.44% <-
  692,128,469,436 |    695,695,894,602 |  -0.52% <--
  190,811,495,310 |    190,934,386,665 |  -0.06%
  225,751,808,189 |    225,909,747,497 |  -0.07%
  220,364,237,915 |    220,466,640,640 |  -0.05%
  179,157,361,261 |    179,357,613,835 |  -0.11%
  219,313,921,837 |    219,436,712,114 |  -0.06%
  281,344,210,410 |    281,344,197,305 |   0.00%
  446,517,079,816 |    446,517,058,807 |   0.00%
  347,300,137,757 |    347,300,119,942 |   0.00%
  421,496,082,529 |    421,496,063,144 |   0.00%
  669,319,255,911 |    673,977,112,732 |  -0.70% <--
2,852,810,623,629 |  2,852,809,986,901 |   0.00%
1,855,884,349,001 |  1,855,837,810,109 |   0.00%
1,653,853,403,482 |  1,653,714,171,270 |   0.01%
2,974,347,287,057 |  2,970,520,074,342 |   0.13%
1,158,337,609,995 |  1,158,337,607,076 |   0.00%
1,019,181,486,091 |  1,020,576,716,760 |  -0.14%
1,738,961,517,942 |  1,736,024,569,247 |   0.17%
  849,330,280,930 |    848,955,738,855 |   0.04%
  276,584,892,794 |    276,457,202,331 |   0.05%
  913,410,589,335 |    913,407,101,214 |   0.00%
  903,864,384,529 |    903,800,709,452 |   0.01%
1,654,905,086,567 |  1,656,684,048,052 |  -0.11%
1,513,514,546,262 |  1,510,815,435,668 |   0.18%
1,641,980,078,831 |  1,602,410,552,874 |   2.41% <--
2,546,170,307,336 |  2,546,206,790,578 |   0.00%
1,983,551,321,388 |  1,979,562,936,994 |   0.20%
1,516,077,036,742 |  1,515,038,668,667 |   0.07%
2,056,386,745,670 |  2,055,592,903,700 |   0.04%
1,008,224,427,448 |  1,008,027,321,669 |   0.02%
1,169,305,141,789 |  1,169,028,937,430 |   0.02%
  363,827,037,541 |    361,982,880,800 |   0.51% <--
  906,649,110,459 |    909,651,995,900 |  -0.33% <-
  509,023,896,044 |    508,578,027,604 |   0.09%
      403,238,430 |        398,492,922 |   1.18%
      403,238,430 |        398,492,922 |   1.18%
38,917,902,479,830  38,886,374,486,212


Do note however that this takes us back to constant pool way of loading
complex constants (vs. bit fiddling and stitching). The 2.4% gain in
deepsjeng is exactly that and we need to decide what to do about it:
keep it as such or tweak it with a cost model change and/or make this
for everyone or have this per cpu tune - hard to tell whats the right
thing to do here.

P.S. Perhaps obvious but the prerequisite to revert is to tend to the
original linked PRs which will now start failing so that will hopefully
improve some of the above.

> Furthermore, I have a suspicion that there's logicals where we're going 
> to want to do something similar and if we keep the mvconst_internal 
> pattern they're all going to have to be implemented as 2->2s with a 
> define_insn_and_split rather than the more natural 3->2.

Naive question: why is define_split more natural vs. define_insn_and_split.
Is it because of the syntax/implementation or that one is more Combine
"centric" and other is more of a Split* Pass thing (roughly speaking of
course) or something else altogether ?

Would we have to revisit the new splitter (and perhaps audit the
existing define_insn_and_split patterns) if we were to go ahead with
this revert ?
>> +(define_insn_and_split "*add<mode>3_const_sum_of_two_s12"
>> +  [(set (match_operand:P	 0 "register_operand" "=r,r")
>> +	(plus:P (match_operand:P 1 "register_operand" " r,r")
>> +		(match_operand:P 2 "const_two_s12"    " MiG,r")))]
>> +  ""
>> +  "add %0,%1,%2"
>> +  ""
>> +  [(set (match_dup 0)
>> +	(plus:P (match_dup 1) (match_dup 3)))
>> +   (set (match_dup 0)
>> +	(plus:P (match_dup 0) (match_dup 4)))]
>> +{
>> +  int val = INTVAL (operands[2]);
>> +  if (SUM_OF_TWO_S12_P (val))
>> +    {
>> +       operands[3] = GEN_INT (2047);
>> +       operands[4] = GEN_INT (val - 2047);
>> +    }
>> +  else if (SUM_OF_TWO_S12_N (val))
>> +    {
>> +       operands[3] = GEN_INT (-2048);
>> +       operands[4] = GEN_INT (val + 2048);
>> +    }
>> +  else
>> +      gcc_unreachable ();
>> +}
>> +  [(set_attr "type" "arith")
>> +   (set_attr "mode" "<P:MODE>")])
> So why use "P" for your mode iterator?  I would have expected GPR since 
> I think this works for both SI and DI mode as-is.

My intent at least was to have this work for either of rv64/rv32 and in
either of those environments, work for both SI or DI - certainly
rv64+{SI,DI}.
To that end I debated between GPR, X and P.
It seems GPR only supports DI if TARGET_64BIT.
But I could well be wrong about above requirements or the subtle
differences in these.

> For the output template "#" for alternative 0 and the add instruction 
> for alternative 1 is probably better.  That way it's clear to everyone 
> that alternative 0 is always meant to be split.

OK.

> Don't you need some kind of condition on the split to avoid splitting 
> when you've got a register for operands[2]? 

Won't the predicate "match_code" guarantee that ?

  (define_predicate "const_two_s12"
     (match_code "const_int")
   {
     return SUM_OF_TWO_S12 (INTVAL (op), false);
   })

> It would seem to me that as 
> written, it would still try to spit and trigger an RTL checking error 
> when you try to extract INTVAL (operands[2]).

Do I need to build gcc in a certain way to expose such errors - I wasn't
able to trigger such an error for the entire testsuite or SPEC build.

Thx for the detailed quick review.
-Vineet

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gcc-15 2/3] RISC-V: avoid LUI based const mat: keep stack offsets aligned
  2024-03-16 20:21   ` Jeff Law
@ 2024-03-19  0:27     ` Vineet Gupta
  2024-03-19  6:48       ` Andrew Waterman
  0 siblings, 1 reply; 28+ messages in thread
From: Vineet Gupta @ 2024-03-19  0:27 UTC (permalink / raw)
  To: Jeff Law, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp



On 3/16/24 13:21, Jeff Law wrote:
> |   59944:	add	s0,sp,2047  <----
> |   59948:	mv	a2,a0
> |   5994c:	mv	a3,a1
> |   59950:	mv	a0,sp
> |   59954:	li	a4,1
> |   59958:	lui	a1,0x1
> |   5995c:	add	s0,s0,1     <---
> |   59960:	jal	59a3c
>
> SP here becomes unaligned, even if transitively which is undesirable as
> well as incorrect:
>   - ABI requires stack to be 8 byte aligned
>   - asm code looks weird and unexpected
>   - to the user it might falsely seem like a compiler bug even when not,
>     specially when staring at asm for debugging unrelated issue.
> It's not ideal, but I think it's still ABI compliant as-is.  If it 
> wasn't, then I suspect things like virtual origins in Ada couldn't be 
> made ABI compliant.

To be clear are u suggesting ADD sp, sp, 2047 is ABI compliant ?
I'd still like to avoid it as I'm sure someone will complain about it.

>> With the patch, we get following correct code instead:
>>
>> | ..
>> | 59944:	add	s0,sp,2032
>> | ..
>> | 5995c:	add	s0,s0,16
> Alternately you could tighten the positive side of the range of the 
> splitter from patch 1/3 so that you could always use 2032 rather than 
> 2047 on the first addi.   ie instead of allowing 2048..4094, allow 
> 2048..4064.

2033..4064 vs. 2048..4094

Yeah I was a bit split about this as well. Since you are OK with either,
I'll keep them as-is and perhaps add this observation to commitlog.

Thx,
-Vineet

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak
  2024-03-16 17:35 [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak Vineet Gupta
                   ` (2 preceding siblings ...)
  2024-03-16 17:35 ` [gcc-15 3/3] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733] Vineet Gupta
@ 2024-03-19  4:41 ` Jeff Law
  2024-03-21  0:45   ` Vineet Gupta
  2024-03-21 14:36   ` scheduler queue flush (was Re: [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak) Vineet Gupta
  3 siblings, 2 replies; 28+ messages in thread
From: Jeff Law @ 2024-03-19  4:41 UTC (permalink / raw)
  To: Vineet Gupta, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp



On 3/16/24 11:35 AM, Vineet Gupta wrote:
> Hi,
> 
> This set of patches (for gcc-15) help improve stack/array accesses
> by improving constant materialization. Details are in respective
> patches.
> 
> The first patch is the main change which improves SPEC cactu by 10%.
Just to confirm.  Yup, 10% reduction in icounts and about a 3.5% 
improvement in cycles on our target.  Which is great!

This also makes me wonder if cactu is the benchmark that was sensitive 
to flushing the pending queue in the scheduler.  Jivan's data would tend 
to indicate that is the case as several routines seem to flush the 
pending queue often.  In particular:

ML_BSSN_RHS_Body
ML_BSSN_Advect_Body
ML_BSSN_constraints_Body

All have a high number of dynamic instructions as well as lots of 
flushes of the pending queue.

Vineet, you might want to look and see if cranking up the 
max-pending-list-length parameter helps drive down spilling.   I think 
it's default value is 32 insns.  I've seen it cranked up to 128 and 256 
insns without significant ill effects on compile time.

My recollection (it's been like 3 years) of the key loop was that it had 
a few hundred instructions and we'd flush the pending list about 50 
cycles into the loop as there just wasn't enough issue bandwidth to the 
FP units to dispatch all the FP instructions as their inputs became 
ready.  So you'd be looking for flushes in a big loop.


Jeff



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gcc-15 2/3] RISC-V: avoid LUI based const mat: keep stack offsets aligned
  2024-03-19  0:27     ` Vineet Gupta
@ 2024-03-19  6:48       ` Andrew Waterman
  2024-03-19 13:10         ` Jeff Law
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Waterman @ 2024-03-19  6:48 UTC (permalink / raw)
  To: Vineet Gupta
  Cc: Jeff Law, gcc-patches, kito.cheng, Palmer Dabbelt, gnu-toolchain,
	Robin Dapp

On Mon, Mar 18, 2024 at 5:28 PM Vineet Gupta <vineetg@rivosinc.com> wrote:
>
>
>
> On 3/16/24 13:21, Jeff Law wrote:
> > |   59944:    add     s0,sp,2047  <----
> > |   59948:    mv      a2,a0
> > |   5994c:    mv      a3,a1
> > |   59950:    mv      a0,sp
> > |   59954:    li      a4,1
> > |   59958:    lui     a1,0x1
> > |   5995c:    add     s0,s0,1     <---
> > |   59960:    jal     59a3c
> >
> > SP here becomes unaligned, even if transitively which is undesirable as
> > well as incorrect:
> >   - ABI requires stack to be 8 byte aligned
> >   - asm code looks weird and unexpected
> >   - to the user it might falsely seem like a compiler bug even when not,
> >     specially when staring at asm for debugging unrelated issue.
> > It's not ideal, but I think it's still ABI compliant as-is.  If it
> > wasn't, then I suspect things like virtual origins in Ada couldn't be
> > made ABI compliant.
>
> To be clear are u suggesting ADD sp, sp, 2047 is ABI compliant ?
> I'd still like to avoid it as I'm sure someone will complain about it.
>
> >> With the patch, we get following correct code instead:
> >>
> >> | ..
> >> | 59944:     add     s0,sp,2032
> >> | ..
> >> | 5995c:     add     s0,s0,16
> > Alternately you could tighten the positive side of the range of the
> > splitter from patch 1/3 so that you could always use 2032 rather than
> > 2047 on the first addi.   ie instead of allowing 2048..4094, allow
> > 2048..4064.
>
> 2033..4064 vs. 2048..4094
>
> Yeah I was a bit split about this as well. Since you are OK with either,
> I'll keep them as-is and perhaps add this observation to commitlog.

There's a subset of embedded use cases where an interrupt service
routine continues on the same stack as the interrupted thread,
requiring sp to always have an ABI-compliant value (i.e. 16B aligned,
and with no important data on the stack at an address below sp).

Although not all use cases care about this property, it seems more
straightforward to maintain the invariant everywhere, rather than
selectively enforce it.

>
> Thx,
> -Vineet

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gcc-15 2/3] RISC-V: avoid LUI based const mat: keep stack offsets aligned
  2024-03-19  6:48       ` Andrew Waterman
@ 2024-03-19 13:10         ` Jeff Law
  2024-03-19 20:05           ` Vineet Gupta
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff Law @ 2024-03-19 13:10 UTC (permalink / raw)
  To: Andrew Waterman, Vineet Gupta
  Cc: gcc-patches, kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp



On 3/19/24 12:48 AM, Andrew Waterman wrote:
> On Mon, Mar 18, 2024 at 5:28 PM Vineet Gupta <vineetg@rivosinc.com> wrote:
>>
>>
>>
>> On 3/16/24 13:21, Jeff Law wrote:
>>> |   59944:    add     s0,sp,2047  <----
>>> |   59948:    mv      a2,a0
>>> |   5994c:    mv      a3,a1
>>> |   59950:    mv      a0,sp
>>> |   59954:    li      a4,1
>>> |   59958:    lui     a1,0x1
>>> |   5995c:    add     s0,s0,1     <---
>>> |   59960:    jal     59a3c
>>>
>>> SP here becomes unaligned, even if transitively which is undesirable as
>>> well as incorrect:
>>>    - ABI requires stack to be 8 byte aligned
>>>    - asm code looks weird and unexpected
>>>    - to the user it might falsely seem like a compiler bug even when not,
>>>      specially when staring at asm for debugging unrelated issue.
>>> It's not ideal, but I think it's still ABI compliant as-is.  If it
>>> wasn't, then I suspect things like virtual origins in Ada couldn't be
>>> made ABI compliant.
>>
>> To be clear are u suggesting ADD sp, sp, 2047 is ABI compliant ?
>> I'd still like to avoid it as I'm sure someone will complain about it.
>>
>>>> With the patch, we get following correct code instead:
>>>>
>>>> | ..
>>>> | 59944:     add     s0,sp,2032
>>>> | ..
>>>> | 5995c:     add     s0,s0,16
>>> Alternately you could tighten the positive side of the range of the
>>> splitter from patch 1/3 so that you could always use 2032 rather than
>>> 2047 on the first addi.   ie instead of allowing 2048..4094, allow
>>> 2048..4064.
>>
>> 2033..4064 vs. 2048..4094
>>
>> Yeah I was a bit split about this as well. Since you are OK with either,
>> I'll keep them as-is and perhaps add this observation to commitlog.
> 
> There's a subset of embedded use cases where an interrupt service
> routine continues on the same stack as the interrupted thread,
> requiring sp to always have an ABI-compliant value (i.e. 16B aligned,
> and with no important data on the stack at an address below sp).
> 
> Although not all use cases care about this property, it seems more
> straightforward to maintain the invariant everywhere, rather than
> selectively enforce it.
Just to be clear, the changes don't misalign the stack pointer at all. 
They merely have the potential to create *another* pointer into the 
stack which may or may not be aligned.  Which is totally normal, it's no 
different than taking the address of a char on the stack.

jeff


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gcc-15 2/3] RISC-V: avoid LUI based const mat: keep stack offsets aligned
  2024-03-19 13:10         ` Jeff Law
@ 2024-03-19 20:05           ` Vineet Gupta
  2024-03-19 20:58             ` Andrew Waterman
                               ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Vineet Gupta @ 2024-03-19 20:05 UTC (permalink / raw)
  To: Jeff Law, Andrew Waterman
  Cc: gcc-patches, kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp



On 3/19/24 06:10, Jeff Law wrote:
> On 3/19/24 12:48 AM, Andrew Waterman wrote:
>> On Mon, Mar 18, 2024 at 5:28 PM Vineet Gupta <vineetg@rivosinc.com> wrote:
>>> On 3/16/24 13:21, Jeff Law wrote:
>>>> |   59944:    add     s0,sp,2047  <----
>>>> |   59948:    mv      a2,a0
>>>> |   5994c:    mv      a3,a1
>>>> |   59950:    mv      a0,sp
>>>> |   59954:    li      a4,1
>>>> |   59958:    lui     a1,0x1
>>>> |   5995c:    add     s0,s0,1     <---
>>>> |   59960:    jal     59a3c
>>>>
>>>> SP here becomes unaligned, even if transitively which is undesirable as
>>>> well as incorrect:
>>>>    - ABI requires stack to be 8 byte aligned
>>>>    - asm code looks weird and unexpected
>>>>    - to the user it might falsely seem like a compiler bug even when not,
>>>>      specially when staring at asm for debugging unrelated issue.
>>>> It's not ideal, but I think it's still ABI compliant as-is.  If it
>>>> wasn't, then I suspect things like virtual origins in Ada couldn't be
>>>> made ABI compliant.
>>> To be clear are u suggesting ADD sp, sp, 2047 is ABI compliant ?
>>> I'd still like to avoid it as I'm sure someone will complain about it.
>>>
>>>>> With the patch, we get following correct code instead:
>>>>>
>>>>> | ..
>>>>> | 59944:     add     s0,sp,2032
>>>>> | ..
>>>>> | 5995c:     add     s0,s0,16
>>>> Alternately you could tighten the positive side of the range of the
>>>> splitter from patch 1/3 so that you could always use 2032 rather than
>>>> 2047 on the first addi.   ie instead of allowing 2048..4094, allow
>>>> 2048..4064.
>>> 2033..4064 vs. 2048..4094
>>>
>>> Yeah I was a bit split about this as well. Since you are OK with either,
>>> I'll keep them as-is and perhaps add this observation to commitlog.
>> There's a subset of embedded use cases where an interrupt service
>> routine continues on the same stack as the interrupted thread,
>> requiring sp to always have an ABI-compliant value (i.e. 16B aligned,
>> and with no important data on the stack at an address below sp).
>>
>> Although not all use cases care about this property, it seems more
>> straightforward to maintain the invariant everywhere, rather than
>> selectively enforce it.
> Just to be clear, the changes don't misalign the stack pointer at all. 
> They merely have the potential to create *another* pointer into the 
> stack which may or may not be aligned.  Which is totally normal, it's no 
> different than taking the address of a char on the stack.

Right I never saw any sp,sp,2047 getting generated - not even in the
first version of patch which lacked any filtering of stack regs via
riscv_reg_frame_related () and obviously didn't have the stack variant
of splitter. I don't know if that is just being lucky and not enough
testing exposure (I only spot checked buildroot libc, vmlinux) or
something somewhere enforces that.

However given that misaligned pointer off of stack is a non-issue, I
think we can do the following:

1. keep just one splitter with 2047 based predicates and constraint (and
not 2032) for both stack-related and general regs.
2. gate the splitter on only operands[0] being not stack related
(currently it checks for either [0] or [1]) - this allows the prominent
case where SP is simply a src, and avoids when any potential shenanigans
to SP itself.

-Vineet

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gcc-15 2/3] RISC-V: avoid LUI based const mat: keep stack offsets aligned
  2024-03-19 20:05           ` Vineet Gupta
@ 2024-03-19 20:58             ` Andrew Waterman
  2024-03-19 21:17             ` Palmer Dabbelt
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 28+ messages in thread
From: Andrew Waterman @ 2024-03-19 20:58 UTC (permalink / raw)
  To: Vineet Gupta
  Cc: Jeff Law, gcc-patches, kito.cheng, Palmer Dabbelt, gnu-toolchain,
	Robin Dapp

On Tue, Mar 19, 2024 at 1:05 PM Vineet Gupta <vineetg@rivosinc.com> wrote:
>
>
>
> On 3/19/24 06:10, Jeff Law wrote:
> > On 3/19/24 12:48 AM, Andrew Waterman wrote:
> >> On Mon, Mar 18, 2024 at 5:28 PM Vineet Gupta <vineetg@rivosinc.com> wrote:
> >>> On 3/16/24 13:21, Jeff Law wrote:
> >>>> |   59944:    add     s0,sp,2047  <----
> >>>> |   59948:    mv      a2,a0
> >>>> |   5994c:    mv      a3,a1
> >>>> |   59950:    mv      a0,sp
> >>>> |   59954:    li      a4,1
> >>>> |   59958:    lui     a1,0x1
> >>>> |   5995c:    add     s0,s0,1     <---
> >>>> |   59960:    jal     59a3c
> >>>>
> >>>> SP here becomes unaligned, even if transitively which is undesirable as
> >>>> well as incorrect:
> >>>>    - ABI requires stack to be 8 byte aligned
> >>>>    - asm code looks weird and unexpected
> >>>>    - to the user it might falsely seem like a compiler bug even when not,
> >>>>      specially when staring at asm for debugging unrelated issue.
> >>>> It's not ideal, but I think it's still ABI compliant as-is.  If it
> >>>> wasn't, then I suspect things like virtual origins in Ada couldn't be
> >>>> made ABI compliant.
> >>> To be clear are u suggesting ADD sp, sp, 2047 is ABI compliant ?
> >>> I'd still like to avoid it as I'm sure someone will complain about it.
> >>>
> >>>>> With the patch, we get following correct code instead:
> >>>>>
> >>>>> | ..
> >>>>> | 59944:     add     s0,sp,2032
> >>>>> | ..
> >>>>> | 5995c:     add     s0,s0,16
> >>>> Alternately you could tighten the positive side of the range of the
> >>>> splitter from patch 1/3 so that you could always use 2032 rather than
> >>>> 2047 on the first addi.   ie instead of allowing 2048..4094, allow
> >>>> 2048..4064.
> >>> 2033..4064 vs. 2048..4094
> >>>
> >>> Yeah I was a bit split about this as well. Since you are OK with either,
> >>> I'll keep them as-is and perhaps add this observation to commitlog.
> >> There's a subset of embedded use cases where an interrupt service
> >> routine continues on the same stack as the interrupted thread,
> >> requiring sp to always have an ABI-compliant value (i.e. 16B aligned,
> >> and with no important data on the stack at an address below sp).
> >>
> >> Although not all use cases care about this property, it seems more
> >> straightforward to maintain the invariant everywhere, rather than
> >> selectively enforce it.
> > Just to be clear, the changes don't misalign the stack pointer at all.
> > They merely have the potential to create *another* pointer into the
> > stack which may or may not be aligned.  Which is totally normal, it's no
> > different than taking the address of a char on the stack.
>
> Right I never saw any sp,sp,2047 getting generated - not even in the
> first version of patch which lacked any filtering of stack regs via
> riscv_reg_frame_related () and obviously didn't have the stack variant
> of splitter. I don't know if that is just being lucky and not enough
> testing exposure (I only spot checked buildroot libc, vmlinux) or
> something somewhere enforces that.
>
> However given that misaligned pointer off of stack is a non-issue, I
> think we can do the following:
>
> 1. keep just one splitter with 2047 based predicates and constraint (and
> not 2032) for both stack-related and general regs.
> 2. gate the splitter on only operands[0] being not stack related
> (currently it checks for either [0] or [1]) - this allows the prominent
> case where SP is simply a src, and avoids when any potential shenanigans
> to SP itself.

Agreed.  I misread the original code (add s0,sp,2047 looks a lot like
add sp,sp,2047 from a quick glance on a cell phone).

>
> -Vineet

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gcc-15 2/3] RISC-V: avoid LUI based const mat: keep stack offsets aligned
  2024-03-19 20:05           ` Vineet Gupta
  2024-03-19 20:58             ` Andrew Waterman
@ 2024-03-19 21:17             ` Palmer Dabbelt
  2024-03-20 18:57             ` Jeff Law
  2024-03-23  6:05             ` Jeff Law
  3 siblings, 0 replies; 28+ messages in thread
From: Palmer Dabbelt @ 2024-03-19 21:17 UTC (permalink / raw)
  To: Vineet Gupta
  Cc: jeffreyalaw, Andrew Waterman, gcc-patches, Kito Cheng,
	gnu-toolchain, Robin Dapp

On Tue, 19 Mar 2024 13:05:54 PDT (-0700), Vineet Gupta wrote:
>
>
> On 3/19/24 06:10, Jeff Law wrote:
>> On 3/19/24 12:48 AM, Andrew Waterman wrote:
>>> On Mon, Mar 18, 2024 at 5:28 PM Vineet Gupta <vineetg@rivosinc.com> wrote:
>>>> On 3/16/24 13:21, Jeff Law wrote:
>>>>> |   59944:    add     s0,sp,2047  <----
>>>>> |   59948:    mv      a2,a0
>>>>> |   5994c:    mv      a3,a1
>>>>> |   59950:    mv      a0,sp
>>>>> |   59954:    li      a4,1
>>>>> |   59958:    lui     a1,0x1
>>>>> |   5995c:    add     s0,s0,1     <---
>>>>> |   59960:    jal     59a3c
>>>>>
>>>>> SP here becomes unaligned, even if transitively which is undesirable as
>>>>> well as incorrect:
>>>>>    - ABI requires stack to be 8 byte aligned

It's 16-byte aligned in the default ABI, for Q, but that doesn't really 
matter.

>>>>>    - asm code looks weird and unexpected
>>>>>    - to the user it might falsely seem like a compiler bug even when not,
>>>>>      specially when staring at asm for debugging unrelated issue.
>>>>> It's not ideal, but I think it's still ABI compliant as-is.  If it
>>>>> wasn't, then I suspect things like virtual origins in Ada couldn't be
>>>>> made ABI compliant.
>>>> To be clear are u suggesting ADD sp, sp, 2047 is ABI compliant ?
>>>> I'd still like to avoid it as I'm sure someone will complain about it.
>>>>
>>>>>> With the patch, we get following correct code instead:
>>>>>>
>>>>>> | ..
>>>>>> | 59944:     add     s0,sp,2032
>>>>>> | ..
>>>>>> | 5995c:     add     s0,s0,16
>>>>> Alternately you could tighten the positive side of the range of the
>>>>> splitter from patch 1/3 so that you could always use 2032 rather than
>>>>> 2047 on the first addi.   ie instead of allowing 2048..4094, allow
>>>>> 2048..4064.
>>>> 2033..4064 vs. 2048..4094
>>>>
>>>> Yeah I was a bit split about this as well. Since you are OK with either,
>>>> I'll keep them as-is and perhaps add this observation to commitlog.
>>> There's a subset of embedded use cases where an interrupt service
>>> routine continues on the same stack as the interrupted thread,
>>> requiring sp to always have an ABI-compliant value (i.e. 16B aligned,
>>> and with no important data on the stack at an address below sp).
>>>
>>> Although not all use cases care about this property, it seems more
>>> straightforward to maintain the invariant everywhere, rather than
>>> selectively enforce it.
>> Just to be clear, the changes don't misalign the stack pointer at all.
>> They merely have the potential to create *another* pointer into the
>> stack which may or may not be aligned.  Which is totally normal, it's no
>> different than taking the address of a char on the stack.

IIRC the "always"-ness of the stack pointer alignment has come up 
before, and we decided to keep it aligned for these embedded 
interrupt-related reasons that Andrew points out.  That's a bit 
different than most other ABI requirements, where we're just enforcing 
them on function boundaries.

That said, this one sounds like just a terminology issue: it's not SP 
that's misaligned, but intemediate SP-based addressing calculations.  
I'm not sure if there's a word for these SP-based intermediate values, 
during last week's team meeting we came up with "stack anchors".

> Right I never saw any sp,sp,2047 getting generated - not even in the
> first version of patch which lacked any filtering of stack regs via
> riscv_reg_frame_related () and obviously didn't have the stack variant
> of splitter. I don't know if that is just being lucky and not enough
> testing exposure (I only spot checked buildroot libc, vmlinux) or
> something somewhere enforces that.

I guess we could run the tests with something like

    diff --git a/target/riscv/translate.c b/target/riscv/translate.c
    index ab18899122..e87cc83067 100644
    --- a/target/riscv/translate.c
    +++ b/target/riscv/translate.c
    @@ -320,12 +320,18 @@ static void gen_goto_tb(DisasContext *ctx, int n, target_long diff)
      */
     static TCGv get_gpr(DisasContext *ctx, int reg_num, DisasExtend ext)
     {
    -    TCGv t;
    +    TCGv t, gpr;
    
         if (reg_num == 0) {
             return ctx->zero;
         }
    
    +    if (reg_num == 2) {
    +        gpr = tcg_gen_temp_new();
    +        tcg_gen_andi_tl(gpr, cpu_gpr[reg_num], 0xF);
    +    } else
    +        gpr = cpu_gpr[reg_num];
    +
         switch (get_ol(ctx)) {
         case MXL_RV32:
             switch (ext) {
    @@ -333,11 +339,11 @@ static TCGv get_gpr(DisasContext *ctx, int reg_num, DisasExtend ext)
                 break;
             case EXT_SIGN:
                 t = tcg_temp_new();
    -            tcg_gen_ext32s_tl(t, cpu_gpr[reg_num]);
    +            tcg_gen_ext32s_tl(t, gpr);
                 return t;
             case EXT_ZERO:
                 t = tcg_temp_new();
    -            tcg_gen_ext32u_tl(t, cpu_gpr[reg_num]);
    +            tcg_gen_ext32u_tl(t, gpr);
                 return t;
             default:
                 g_assert_not_reached();
    @@ -349,7 +355,7 @@ static TCGv get_gpr(DisasContext *ctx, int reg_num, DisasExtend ext)
         default:
             g_assert_not_reached();
         }
    -    return cpu_gpr[reg_num];
    +    return gpr;
     }
    
     static TCGv get_gprh(DisasContext *ctx, int reg_num)

and see if anything blows up?  I'm not sure it's worth opening that can 
of worms, though...

> However given that misaligned pointer off of stack is a non-issue, I
> think we can do the following:
>
> 1. keep just one splitter with 2047 based predicates and constraint (and
> not 2032) for both stack-related and general regs.
> 2. gate the splitter on only operands[0] being not stack related
> (currently it checks for either [0] or [1]) - this allows the prominent
> case where SP is simply a src, and avoids when any potential shenanigans
> to SP itself.

That seems reasonable to me.

>
> -Vineet

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gcc-15 2/3] RISC-V: avoid LUI based const mat: keep stack offsets aligned
  2024-03-19 20:05           ` Vineet Gupta
  2024-03-19 20:58             ` Andrew Waterman
  2024-03-19 21:17             ` Palmer Dabbelt
@ 2024-03-20 18:57             ` Jeff Law
  2024-03-23  6:05             ` Jeff Law
  3 siblings, 0 replies; 28+ messages in thread
From: Jeff Law @ 2024-03-20 18:57 UTC (permalink / raw)
  To: Vineet Gupta, Andrew Waterman
  Cc: gcc-patches, kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp



On 3/19/24 2:05 PM, Vineet Gupta wrote:

>> Just to be clear, the changes don't misalign the stack pointer at all.
>> They merely have the potential to create *another* pointer into the
>> stack which may or may not be aligned.  Which is totally normal, it's no
>> different than taking the address of a char on the stack.
> 
> Right I never saw any sp,sp,2047 getting generated - not even in the
> first version of patch which lacked any filtering of stack regs via
> riscv_reg_frame_related () and obviously didn't have the stack variant
> of splitter. I don't know if that is just being lucky and not enough
> testing exposure (I only spot checked buildroot libc, vmlinux) or
> something somewhere enforces that.
> 
> However given that misaligned pointer off of stack is a non-issue, I
> think we can do the following:
> 
> 1. keep just one splitter with 2047 based predicates and constraint (and
> not 2032) for both stack-related and general regs.
> 2. gate the splitter on only operands[0] being not stack related
> (currently it checks for either [0] or [1]) - this allows the prominent
> case where SP is simply a src, and avoids when any potential shenanigans
> to SP itself.
Works for me.

Jeff


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak
  2024-03-19  4:41 ` [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak Jeff Law
@ 2024-03-21  0:45   ` Vineet Gupta
  2024-03-21 14:36   ` scheduler queue flush (was Re: [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak) Vineet Gupta
  1 sibling, 0 replies; 28+ messages in thread
From: Vineet Gupta @ 2024-03-21  0:45 UTC (permalink / raw)
  To: Jeff Law, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp



On 3/18/24 21:41, Jeff Law wrote:
>> The first patch is the main change which improves SPEC cactu by 10%.
> Just to confirm.  Yup, 10% reduction in icounts and about a 3.5% 
> improvement in cycles on our target.  Which is great!

Nice.

> This also makes me wonder if cactu is the benchmark that was sensitive 
> to flushing the pending queue in the scheduler.  Jivan's data would tend 
> to indicate that is the case as several routines seem to flush the 
> pending queue often.  In particular:
>
> ML_BSSN_RHS_Body
> ML_BSSN_Advect_Body
> ML_BSSN_constraints_Body
>
> All have a high number of dynamic instructions as well as lots of 
> flushes of the pending queue.
>
> Vineet, you might want to look and see if cranking up the 
> max-pending-list-length parameter helps drive down spilling.   I think 
> it's default value is 32 insns.  I've seen it cranked up to 128 and 256 
> insns without significant ill effects on compile time.
>
> My recollection (it's been like 3 years) of the key loop was that it had 
> a few hundred instructions and we'd flush the pending list about 50 
> cycles into the loop as there just wasn't enough issue bandwidth to the 
> FP units to dispatch all the FP instructions as their inputs became 
> ready.  So you'd be looking for flushes in a big loop.

Great insight.

Fired off a cactu run with 128, will keep you posted.

Thx,
-Vineet

^ permalink raw reply	[flat|nested] 28+ messages in thread

* scheduler queue flush (was Re: [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak)
  2024-03-19  4:41 ` [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak Jeff Law
  2024-03-21  0:45   ` Vineet Gupta
@ 2024-03-21 14:36   ` Vineet Gupta
  2024-03-21 14:45     ` Jeff Law
  1 sibling, 1 reply; 28+ messages in thread
From: Vineet Gupta @ 2024-03-21 14:36 UTC (permalink / raw)
  To: Jeff Law, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp



On 3/18/24 21:41, Jeff Law wrote:
>
> On 3/16/24 11:35 AM, Vineet Gupta wrote:
>> Hi,
>>
>> This set of patches (for gcc-15) help improve stack/array accesses
>> by improving constant materialization. Details are in respective
>> patches.
>>
>> The first patch is the main change which improves SPEC cactu by 10%.
> Just to confirm.  Yup, 10% reduction in icounts and about a 3.5% 
> improvement in cycles on our target.  Which is great!
>
> This also makes me wonder if cactu is the benchmark that was sensitive 
> to flushing the pending queue in the scheduler.  Jivan's data would tend 
> to indicate that is the case as several routines seem to flush the 
> pending queue often.  In particular:
>
> ML_BSSN_RHS_Body
> ML_BSSN_Advect_Body
> ML_BSSN_constraints_Body
>
> All have a high number of dynamic instructions as well as lots of 
> flushes of the pending queue.
>
> Vineet, you might want to look and see if cranking up the 
> max-pending-list-length parameter helps drive down spilling.   I think 
> it's default value is 32 insns.  I've seen it cranked up to 128 and 256 
> insns without significant ill effects on compile time.
>
> My recollection (it's been like 3 years) of the key loop was that it had 
> a few hundred instructions and we'd flush the pending list about 50 
> cycles into the loop as there just wasn't enough issue bandwidth to the 
> FP units to dispatch all the FP instructions as their inputs became 
> ready.  So you'd be looking for flushes in a big loop.

Here are the results for Cactu on top of the new splitter changes:

default	: 2,565,319,368,591
128	: 2,509,741,035,068
256	: 2,527,817,813,612

I've haven't probed deeper in generated code itself but likely to be
helping with spilling

-Vineet

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: scheduler queue flush (was Re: [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak)
  2024-03-21 14:36   ` scheduler queue flush (was Re: [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak) Vineet Gupta
@ 2024-03-21 14:45     ` Jeff Law
  2024-03-21 17:19       ` Vineet Gupta
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff Law @ 2024-03-21 14:45 UTC (permalink / raw)
  To: Vineet Gupta, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp



On 3/21/24 8:36 AM, Vineet Gupta wrote:
> 
> 
> On 3/18/24 21:41, Jeff Law wrote:
>>
>> On 3/16/24 11:35 AM, Vineet Gupta wrote:
>>> Hi,
>>>
>>> This set of patches (for gcc-15) help improve stack/array accesses
>>> by improving constant materialization. Details are in respective
>>> patches.
>>>
>>> The first patch is the main change which improves SPEC cactu by 10%.
>> Just to confirm.  Yup, 10% reduction in icounts and about a 3.5%
>> improvement in cycles on our target.  Which is great!
>>
>> This also makes me wonder if cactu is the benchmark that was sensitive
>> to flushing the pending queue in the scheduler.  Jivan's data would tend
>> to indicate that is the case as several routines seem to flush the
>> pending queue often.  In particular:
>>
>> ML_BSSN_RHS_Body
>> ML_BSSN_Advect_Body
>> ML_BSSN_constraints_Body
>>
>> All have a high number of dynamic instructions as well as lots of
>> flushes of the pending queue.
>>
>> Vineet, you might want to look and see if cranking up the
>> max-pending-list-length parameter helps drive down spilling.   I think
>> it's default value is 32 insns.  I've seen it cranked up to 128 and 256
>> insns without significant ill effects on compile time.
>>
>> My recollection (it's been like 3 years) of the key loop was that it had
>> a few hundred instructions and we'd flush the pending list about 50
>> cycles into the loop as there just wasn't enough issue bandwidth to the
>> FP units to dispatch all the FP instructions as their inputs became
>> ready.  So you'd be looking for flushes in a big loop.
> 
> Here are the results for Cactu on top of the new splitter changes:
> 
> default	: 2,565,319,368,591
> 128	: 2,509,741,035,068
> 256	: 2,527,817,813,612
> 
> I've haven't probed deeper in generated code itself but likely to be
> helping with spilling
Actually, I read that as not important for this issue.  While it is 50b 
instructions, I would be looking for something that had perhaps an order 
of magnitude bigger impact.    Ultimately I think it means we still 
don't have a good handle on what's causing the spilling.  Oh well.

So if we go back to Robin's observation that scheduling dramatically 
increases the instruction count, perhaps we try a run with 
-fno-schedule-insns -fno-schedule-insns2 and see how the instruction 
counts compare.


Jeff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: scheduler queue flush (was Re: [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak)
  2024-03-21 14:45     ` Jeff Law
@ 2024-03-21 17:19       ` Vineet Gupta
  2024-03-21 19:56         ` Jeff Law
  2024-03-25  3:05         ` Jeff Law
  0 siblings, 2 replies; 28+ messages in thread
From: Vineet Gupta @ 2024-03-21 17:19 UTC (permalink / raw)
  To: Jeff Law, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp



On 3/21/24 07:45, Jeff Law wrote:
>>>> The first patch is the main change which improves SPEC cactu by 10%.
>>> Just to confirm.  Yup, 10% reduction in icounts and about a 3.5%
>>> improvement in cycles on our target.  Which is great!
>>>
>>> This also makes me wonder if cactu is the benchmark that was sensitive
>>> to flushing the pending queue in the scheduler.  Jivan's data would tend
>>> to indicate that is the case as several routines seem to flush the
>>> pending queue often.  In particular:
>>>
>>> ML_BSSN_RHS_Body
>>> ML_BSSN_Advect_Body
>>> ML_BSSN_constraints_Body
>>>
>>> All have a high number of dynamic instructions as well as lots of
>>> flushes of the pending queue.
>>>
>>> Vineet, you might want to look and see if cranking up the
>>> max-pending-list-length parameter helps drive down spilling.   I think
>>> it's default value is 32 insns.  I've seen it cranked up to 128 and 256
>>> insns without significant ill effects on compile time.
>>>
>>> My recollection (it's been like 3 years) of the key loop was that it had
>>> a few hundred instructions and we'd flush the pending list about 50
>>> cycles into the loop as there just wasn't enough issue bandwidth to the
>>> FP units to dispatch all the FP instructions as their inputs became
>>> ready.  So you'd be looking for flushes in a big loop.
>> Here are the results for Cactu on top of the new splitter changes:
>>
>> default	: 2,565,319,368,591
>> 128	: 2,509,741,035,068
>> 256	: 2,527,817,813,612
>>
>> I've haven't probed deeper in generated code itself but likely to be
>> helping with spilling
> Actually, I read that as not important for this issue.  While it is 50b 
> instructions, I would be looking for something that had perhaps an order 
> of magnitude bigger impact.    Ultimately I think it means we still 
> don't have a good handle on what's causing the spilling.  Oh well.
>
> So if we go back to Robin's observation that scheduling dramatically 
> increases the instruction count, perhaps we try a run with 
> -fno-schedule-insns -fno-schedule-insns2 and see how the instruction 
> counts compare.

Oh yeah ! Robin hinted to this in Tues patchworks meeting too

default	    : 2,565,319,368,591
128	    : 2,509,741,035,068
256	    : 2,527,817,813,612
no-sched{,2}: 1,295,520,567,376

-Vineet


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: scheduler queue flush (was Re: [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak)
  2024-03-21 17:19       ` Vineet Gupta
@ 2024-03-21 19:56         ` Jeff Law
  2024-03-22  0:34           ` scheduler queue flush Vineet Gupta
  2024-03-22  8:47           ` scheduler queue flush (was Re: [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak) Richard Biener
  2024-03-25  3:05         ` Jeff Law
  1 sibling, 2 replies; 28+ messages in thread
From: Jeff Law @ 2024-03-21 19:56 UTC (permalink / raw)
  To: Vineet Gupta, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp



On 3/21/24 11:19 AM, Vineet Gupta wrote:

>>
>> So if we go back to Robin's observation that scheduling dramatically
>> increases the instruction count, perhaps we try a run with
>> -fno-schedule-insns -fno-schedule-insns2 and see how the instruction
>> counts compare.
> 
> Oh yeah ! Robin hinted to this in Tues patchworks meeting too
> 
> default	    : 2,565,319,368,591
> 128	    : 2,509,741,035,068
> 256	    : 2,527,817,813,612
> no-sched{,2}: 1,295,520,567,376
Now we're getting somewhere.  That's in line with expectations.

I would strongly suspect it's -fno-schedule-insns rather than 
-fno-schedule-insns2.  The former turns off scheduling before register 
allocation, the second turns it off after register allocation.  So if 
our theory about spilling is correct, then it must be the first since 
the second won't affect register allocation.   While I can speculate 
about other potential scheduler impacts, spilling due to sched1's 
actions is by far the most likely.

Given the magnitude here, I would bet we can see this pretty clearly if 
you've got function level or block level count data for those runs.  I'd 
start with that, ideally narrowing things down to a function or hot loop 
within a function which shows a huge delta.

 From that we can then look at the IRA and LRA dumps and correlate what 
we see there with the before/after scheduling dumps to see how we've 
lengthened lifetimes in critical locations.

I'd probably start with the IRA dump.  It's going to have annotations in 
its dump output like "Potential Spill" which may guide us.  In simplest 
terms a pseudo is trivially allocatable when it has fewer neighbors in 
the conflict graph than available hard registers.  If it has more 
neighbors in the conflict graph than available hard registers, then it's 
potentially going to be spilled -- we can't know during this phase of 
allocation.

As we pop registers off the coloring stack, some neighbors of the pseudo 
in question may end up allocated into the same hard register.  That can 
sometimes result in a hard register being available.  It might be easier 
to see with a graph

     a--b--c
        |
        d

Where a..d are pseudo registers.  If two pseudos are connected by an 
edge, then they have overlapping lifetimes and can't be allocated to the 
same hard register.  So as we can see b conflicts with a, c & d.  If we 
only have two hard registers, then b is not trivially colorable and will 
be marked as a potential spill.

During coloring we may end up allocating a, c & d to the same hard 
register (they don't conflict, so its safe).  If that happens, then 
there would be a register available for b.

Anyway, that should explain why b would be marked as a potential spill 
and how it might end up getting a hard register anyway.

The hope is we can see the potential spills increasing.  At which point 
we can walk backwards to sched1 and dive into its scheduling decisions.

Jeff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: scheduler queue flush
  2024-03-21 19:56         ` Jeff Law
@ 2024-03-22  0:34           ` Vineet Gupta
  2024-03-22  8:47           ` scheduler queue flush (was Re: [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak) Richard Biener
  1 sibling, 0 replies; 28+ messages in thread
From: Vineet Gupta @ 2024-03-22  0:34 UTC (permalink / raw)
  To: Jeff Law, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp



On 3/21/24 12:56, Jeff Law wrote:
>
> On 3/21/24 11:19 AM, Vineet Gupta wrote:
>
>>> So if we go back to Robin's observation that scheduling dramatically
>>> increases the instruction count, perhaps we try a run with
>>> -fno-schedule-insns -fno-schedule-insns2 and see how the instruction
>>> counts compare.
>> Oh yeah ! Robin hinted to this in Tues patchworks meeting too
>>
>> default	    : 2,565,319,368,591
>> 128	    : 2,509,741,035,068
>> 256	    : 2,527,817,813,612
>> no-sched{,2}: 1,295,520,567,376
> Now we're getting somewhere.  That's in line with expectations.
>
> I would strongly suspect it's -fno-schedule-insns rather than 
> -fno-schedule-insns2.  The former turns off scheduling before register 
> allocation, the second turns it off after register allocation.  So if 
> our theory about spilling is correct, then it must be the first since 
> the second won't affect register allocation.   While I can speculate 
> about other potential scheduler impacts, spilling due to sched1's 
> actions is by far the most likely.

As always you are absolutely right, just doing -fno-schedule-insns gets
almost the same as last row above.

> Given the magnitude here, I would bet we can see this pretty clearly if 
> you've got function level or block level count data for those runs.  I'd 
> start with that, ideally narrowing things down to a function or hot loop 
> within a function which shows a huge delta.

Alright, on it.

Thx,
-Vineet

> From that we can then look at the IRA and LRA dumps and correlate what 
> we see there with the before/after scheduling dumps to see how we've 
> lengthened lifetimes in critical locations.
>
> I'd probably start with the IRA dump.  It's going to have annotations in 
> its dump output like "Potential Spill" which may guide us.  In simplest 
> terms a pseudo is trivially allocatable when it has fewer neighbors in 
> the conflict graph than available hard registers.  If it has more 
> neighbors in the conflict graph than available hard registers, then it's 
> potentially going to be spilled -- we can't know during this phase of 
> allocation.
>
> As we pop registers off the coloring stack, some neighbors of the pseudo 
> in question may end up allocated into the same hard register.  That can 
> sometimes result in a hard register being available.  It might be easier 
> to see with a graph
>
>      a--b--c
>         |
>         d
>
> Where a..d are pseudo registers.  If two pseudos are connected by an 
> edge, then they have overlapping lifetimes and can't be allocated to the 
> same hard register.  So as we can see b conflicts with a, c & d.  If we 
> only have two hard registers, then b is not trivially colorable and will 
> be marked as a potential spill.
>
> During coloring we may end up allocating a, c & d to the same hard 
> register (they don't conflict, so its safe).  If that happens, then 
> there would be a register available for b.
>
> Anyway, that should explain why b would be marked as a potential spill 
> and how it might end up getting a hard register anyway.
>
> The hope is we can see the potential spills increasing.  At which point 
> we can walk backwards to sched1 and dive into its scheduling decisions.
>
> Jeff


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: scheduler queue flush (was Re: [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak)
  2024-03-21 19:56         ` Jeff Law
  2024-03-22  0:34           ` scheduler queue flush Vineet Gupta
@ 2024-03-22  8:47           ` Richard Biener
  2024-03-22 12:29             ` Jeff Law
  1 sibling, 1 reply; 28+ messages in thread
From: Richard Biener @ 2024-03-22  8:47 UTC (permalink / raw)
  To: Jeff Law
  Cc: Vineet Gupta, gcc-patches, kito.cheng, Palmer Dabbelt,
	gnu-toolchain, Robin Dapp

On Thu, Mar 21, 2024 at 8:56 PM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
>
> On 3/21/24 11:19 AM, Vineet Gupta wrote:
>
> >>
> >> So if we go back to Robin's observation that scheduling dramatically
> >> increases the instruction count, perhaps we try a run with
> >> -fno-schedule-insns -fno-schedule-insns2 and see how the instruction
> >> counts compare.
> >
> > Oh yeah ! Robin hinted to this in Tues patchworks meeting too
> >
> > default           : 2,565,319,368,591
> > 128       : 2,509,741,035,068
> > 256       : 2,527,817,813,612
> > no-sched{,2}: 1,295,520,567,376
> Now we're getting somewhere.  That's in line with expectations.
>
> I would strongly suspect it's -fno-schedule-insns rather than
> -fno-schedule-insns2.  The former turns off scheduling before register
> allocation, the second turns it off after register allocation.  So if
> our theory about spilling is correct, then it must be the first since
> the second won't affect register allocation.   While I can speculate
> about other potential scheduler impacts, spilling due to sched1's
> actions is by far the most likely.

Another option is to enable -fsched-pressure which should help with
this issue.

> Given the magnitude here, I would bet we can see this pretty clearly if
> you've got function level or block level count data for those runs.  I'd
> start with that, ideally narrowing things down to a function or hot loop
> within a function which shows a huge delta.
>
>  From that we can then look at the IRA and LRA dumps and correlate what
> we see there with the before/after scheduling dumps to see how we've
> lengthened lifetimes in critical locations.
>
> I'd probably start with the IRA dump.  It's going to have annotations in
> its dump output like "Potential Spill" which may guide us.  In simplest
> terms a pseudo is trivially allocatable when it has fewer neighbors in
> the conflict graph than available hard registers.  If it has more
> neighbors in the conflict graph than available hard registers, then it's
> potentially going to be spilled -- we can't know during this phase of
> allocation.
>
> As we pop registers off the coloring stack, some neighbors of the pseudo
> in question may end up allocated into the same hard register.  That can
> sometimes result in a hard register being available.  It might be easier
> to see with a graph
>
>      a--b--c
>         |
>         d
>
> Where a..d are pseudo registers.  If two pseudos are connected by an
> edge, then they have overlapping lifetimes and can't be allocated to the
> same hard register.  So as we can see b conflicts with a, c & d.  If we
> only have two hard registers, then b is not trivially colorable and will
> be marked as a potential spill.
>
> During coloring we may end up allocating a, c & d to the same hard
> register (they don't conflict, so its safe).  If that happens, then
> there would be a register available for b.
>
> Anyway, that should explain why b would be marked as a potential spill
> and how it might end up getting a hard register anyway.
>
> The hope is we can see the potential spills increasing.  At which point
> we can walk backwards to sched1 and dive into its scheduling decisions.
>
> Jeff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: scheduler queue flush (was Re: [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak)
  2024-03-22  8:47           ` scheduler queue flush (was Re: [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak) Richard Biener
@ 2024-03-22 12:29             ` Jeff Law
  2024-03-22 16:56               ` Vineet Gupta
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff Law @ 2024-03-22 12:29 UTC (permalink / raw)
  To: Richard Biener
  Cc: Vineet Gupta, gcc-patches, kito.cheng, Palmer Dabbelt,
	gnu-toolchain, Robin Dapp



On 3/22/24 2:47 AM, Richard Biener wrote:
> On Thu, Mar 21, 2024 at 8:56 PM Jeff Law <jeffreyalaw@gmail.com> wrote:
>>
>>
>>
>> On 3/21/24 11:19 AM, Vineet Gupta wrote:
>>
>>>>
>>>> So if we go back to Robin's observation that scheduling dramatically
>>>> increases the instruction count, perhaps we try a run with
>>>> -fno-schedule-insns -fno-schedule-insns2 and see how the instruction
>>>> counts compare.
>>>
>>> Oh yeah ! Robin hinted to this in Tues patchworks meeting too
>>>
>>> default           : 2,565,319,368,591
>>> 128       : 2,509,741,035,068
>>> 256       : 2,527,817,813,612
>>> no-sched{,2}: 1,295,520,567,376
>> Now we're getting somewhere.  That's in line with expectations.
>>
>> I would strongly suspect it's -fno-schedule-insns rather than
>> -fno-schedule-insns2.  The former turns off scheduling before register
>> allocation, the second turns it off after register allocation.  So if
>> our theory about spilling is correct, then it must be the first since
>> the second won't affect register allocation.   While I can speculate
>> about other potential scheduler impacts, spilling due to sched1's
>> actions is by far the most likely.
> 
> Another option is to enable -fsched-pressure which should help with
> this issue.
In theory we're already using that by default -- it's part of what makes 
me so curious to understand what's going on.

jeff


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: scheduler queue flush (was Re: [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak)
  2024-03-22 12:29             ` Jeff Law
@ 2024-03-22 16:56               ` Vineet Gupta
  0 siblings, 0 replies; 28+ messages in thread
From: Vineet Gupta @ 2024-03-22 16:56 UTC (permalink / raw)
  To: Jeff Law, Richard Biener
  Cc: gcc-patches, kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp



On 3/22/24 05:29, Jeff Law wrote:
>> Another option is to enable -fsched-pressure which should help with
>> this issue.
> In theory we're already using that by default -- it's part of what makes 
> me so curious to understand what's going on.

We are actually using it in practice :-)
Its the default for RISC-V port since Aug of last year.

-Vineet

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gcc-15 1/3] RISC-V: avoid LUI based const materialization ... [part of PR/106265]
  2024-03-19  0:07     ` Vineet Gupta
@ 2024-03-23  5:59       ` Jeff Law
  0 siblings, 0 replies; 28+ messages in thread
From: Jeff Law @ 2024-03-23  5:59 UTC (permalink / raw)
  To: Vineet Gupta, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp



On 3/18/24 6:07 PM, Vineet Gupta wrote:

> 
> Naive question: why is define_split more natural vs. define_insn_and_split.
> Is it because of the syntax/implementation or that one is more Combine
> "centric" and other is more of a Split* Pass thing (roughly speaking of
> course) or something else altogether ?
There are parts of combine that cost based on the number of insns.  This 
is primarily to keep combine from taking an insane amount of time.

So when we have a little white lie like mvconst_internal we muck up that 
costing aspect of the combiner.  That in turn stops the combiner from 
doing certain combinations.

As a concrete example consider this:

unsigned long long w32mem(unsigned long long w32)
{
     return w32 & ~(1U << 0);
}


Right now we use a bseti+addi to construct the constant, then do the 
logical and.   Prior to the combiner it looks like this:


> (insn 7 3 8 2 (set (reg:DI 137) 
>         (const_int 4294967294 [0xfffffffe])) "j.c":3:16 206 {*mvconst_internal}
>      (nil))
> (insn 8 7 13 2 (set (reg:DI 136 [ _2 ])
>         (and:DI (reg/v:DI 135 [ w32 ])
>             (reg:DI 137))) "j.c":3:16 102 {*anddi3}
>      (expr_list:REG_DEAD (reg:DI 137)
>         (expr_list:REG_DEAD (reg/v:DI 135 [ w32 ])
>             (expr_list:REG_EQUAL (and:DI (reg/v:DI 135 [ w32 ])
>                     (const_int 4294967294 [0xfffffffe]))
>                 (nil)))))

The combiner will refuse to match a splitter where the number of 
incoming insns matches the number of resulting insns.  So to match this 
we'd have to write another define_insn_and_split.



If we didn't have mvconst_internal, then we'd see something like this:

> (insn 6 3 7 2 (set (reg:DI 138)
>         (const_int 4294967296 [0x100000000])) "j.c":3:16 -1
>      (nil))
> (insn 7 6 8 2 (set (reg:DI 137)
>         (plus:DI (reg:DI 138)
>             (const_int -2 [0xfffffffffffffffe]))) "j.c":3:16 -1
>      (expr_list:REG_EQUAL (const_int 4294967294 [0xfffffffe])
>         (nil)))
> (insn 8 7 9 2 (set (reg:DI 136 [ _2 ])
>         (and:DI (reg/v:DI 135 [ w32 ])
>             (reg:DI 137))) "j.c":3:16 -1
>      (nil))

Which we could match with a define_split which would generate RTL for 
bclri+zext.w.


Note that I suspect there's a handful of these kinds of sequences for 
logical ops where the immediate doesn't fit a simm12.


Of course the point of the white lie is to expose complex constant 
synthesis in  away that the combiner can use to simplify things.  It's 
definitely a tradeoff either way.  What's been rattling around a little 
bit would be to narrow the set of constants allowed by mvconst_internal 
to those which require 3 or more to synthesize.  THe idea being cases 
like you're looking where we can use addi+add rather than lui+addi+add 
would be rejected by mvconst_internal, but the more complex constants 
that push combine over the 4 insn limit would be allowed by 
mvconst_internal.


> 
> Would we have to revisit the new splitter (and perhaps audit the
> existing define_insn_and_split patterns) if we were to go ahead with
> this revert ?
I don't recall follow-ups which used the result of mvconst_internal in 
subsequent combiner patterns, but it should be fairly simple to search 
for them.

We'd need to look back at the original bug which resulted in creating 
the mvconst_internal pattern.  My recollection is it had a fairly 
complex constant and we were unable to see enough insns to do the 
necessary simplification to fix that case.

I bet whatever goes on inside perlbench,  mcf and x264 (guessing based 
on instruction counts in your message) + the original bug report will 
provide reasonable testcases for evaluating reversion + adding patches 
to avoid the regressions.


>> So why use "P" for your mode iterator?  I would have expected GPR since
>> I think this works for both SI and DI mode as-is.
> 
> My intent at least was to have this work for either of rv64/rv32 and in
> either of those environments, work for both SI or DI - certainly
> rv64+{SI,DI}.
> To that end I debated between GPR, X and P.
> It seems GPR only supports DI if TARGET_64BIT.
> But I could well be wrong about above requirements or the subtle
> differences in these.
I wouldn't worry about GPR only support DI for TARGET_64BIT.  I don't 
think we generally expose DImode patterns for TARGET_32BIT.

> 
>> For the output template "#" for alternative 0 and the add instruction
>> for alternative 1 is probably better.  That way it's clear to everyone
>> that alternative 0 is always meant to be split.
> 
> OK.
> 
>> Don't you need some kind of condition on the split to avoid splitting
>> when you've got a register for operands[2]?
> 
> Won't the predicate "match_code" guarantee that ?
> 
>    (define_predicate "const_two_s12"
>       (match_code "const_int")
>     {
>       return SUM_OF_TWO_S12 (INTVAL (op), false);
>     })
You're right.  Missed the match_code.  Sorry for the bad steer.

> 
> Do I need to build gcc in a certain way to expose such errors - I wasn't
> able to trigger such an error for the entire testsuite or SPEC build.
There's a distinct RTL checking mode.  It's fairly expensive from a 
compile-time standpoint though.

Jeff


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gcc-15 2/3] RISC-V: avoid LUI based const mat: keep stack offsets aligned
  2024-03-19 20:05           ` Vineet Gupta
                               ` (2 preceding siblings ...)
  2024-03-20 18:57             ` Jeff Law
@ 2024-03-23  6:05             ` Jeff Law
  3 siblings, 0 replies; 28+ messages in thread
From: Jeff Law @ 2024-03-23  6:05 UTC (permalink / raw)
  To: Vineet Gupta, Andrew Waterman
  Cc: gcc-patches, kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp



On 3/19/24 2:05 PM, Vineet Gupta wrote:
> 
> 
> On 3/19/24 06:10, Jeff Law wrote:
>> On 3/19/24 12:48 AM, Andrew Waterman wrote:
>>> On Mon, Mar 18, 2024 at 5:28 PM Vineet Gupta <vineetg@rivosinc.com> wrote:
>>>> On 3/16/24 13:21, Jeff Law wrote:
>>>>> |   59944:    add     s0,sp,2047  <----
>>>>> |   59948:    mv      a2,a0
>>>>> |   5994c:    mv      a3,a1
>>>>> |   59950:    mv      a0,sp
>>>>> |   59954:    li      a4,1
>>>>> |   59958:    lui     a1,0x1
>>>>> |   5995c:    add     s0,s0,1     <---
>>>>> |   59960:    jal     59a3c
>>>>>
>>>>> SP here becomes unaligned, even if transitively which is undesirable as
>>>>> well as incorrect:
>>>>>     - ABI requires stack to be 8 byte aligned
>>>>>     - asm code looks weird and unexpected
>>>>>     - to the user it might falsely seem like a compiler bug even when not,
>>>>>       specially when staring at asm for debugging unrelated issue.
>>>>> It's not ideal, but I think it's still ABI compliant as-is.  If it
>>>>> wasn't, then I suspect things like virtual origins in Ada couldn't be
>>>>> made ABI compliant.
>>>> To be clear are u suggesting ADD sp, sp, 2047 is ABI compliant ?
>>>> I'd still like to avoid it as I'm sure someone will complain about it.
>>>>
>>>>>> With the patch, we get following correct code instead:
>>>>>>
>>>>>> | ..
>>>>>> | 59944:     add     s0,sp,2032
>>>>>> | ..
>>>>>> | 5995c:     add     s0,s0,16
>>>>> Alternately you could tighten the positive side of the range of the
>>>>> splitter from patch 1/3 so that you could always use 2032 rather than
>>>>> 2047 on the first addi.   ie instead of allowing 2048..4094, allow
>>>>> 2048..4064.
>>>> 2033..4064 vs. 2048..4094
>>>>
>>>> Yeah I was a bit split about this as well. Since you are OK with either,
>>>> I'll keep them as-is and perhaps add this observation to commitlog.
>>> There's a subset of embedded use cases where an interrupt service
>>> routine continues on the same stack as the interrupted thread,
>>> requiring sp to always have an ABI-compliant value (i.e. 16B aligned,
>>> and with no important data on the stack at an address below sp).
>>>
>>> Although not all use cases care about this property, it seems more
>>> straightforward to maintain the invariant everywhere, rather than
>>> selectively enforce it.
>> Just to be clear, the changes don't misalign the stack pointer at all.
>> They merely have the potential to create *another* pointer into the
>> stack which may or may not be aligned.  Which is totally normal, it's no
>> different than taking the address of a char on the stack.
> 
> Right I never saw any sp,sp,2047 getting generated - not even in the
> first version of patch which lacked any filtering of stack regs via
> riscv_reg_frame_related () and obviously didn't have the stack variant
> of splitter. I don't know if that is just being lucky and not enough
> testing exposure (I only spot checked buildroot libc, vmlinux) or
> something somewhere enforces that.
> 
> However given that misaligned pointer off of stack is a non-issue, I
> think we can do the following:
> 
> 1. keep just one splitter with 2047 based predicates and constraint (and
> not 2032) for both stack-related and general regs.
> 2. gate the splitter on only operands[0] being not stack related
> (currently it checks for either [0] or [1]) - this allows the prominent
> case where SP is simply a src, and avoids when any potential shenanigans
> to SP itself.
Sounds reasonable to me.

Jeff


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: scheduler queue flush (was Re: [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak)
  2024-03-21 17:19       ` Vineet Gupta
  2024-03-21 19:56         ` Jeff Law
@ 2024-03-25  3:05         ` Jeff Law
  1 sibling, 0 replies; 28+ messages in thread
From: Jeff Law @ 2024-03-25  3:05 UTC (permalink / raw)
  To: Vineet Gupta, gcc-patches
  Cc: kito.cheng, Palmer Dabbelt, gnu-toolchain, Robin Dapp



On 3/21/24 11:19 AM, Vineet Gupta wrote:
> 

> 
> Oh yeah ! Robin hinted to this in Tues patchworks meeting too
> 
> default	    : 2,565,319,368,591
> 128	    : 2,509,741,035,068
> 256	    : 2,527,817,813,612
> no-sched{,2}: 1,295,520,567,376
So one more nugget here.  I happened to be doing some historical data 
mining and I can see the huge instruction jump in our internal runs.  We 
jump from 1.4T instructions to 2.1T.  But more importantly we see the 
cycle counts *improve* across that span, roughly 17%.   Unfortunately 
the data points are way far apart in time, so I don't think they help us 
narrow down the root cause.

Mostly it highlights that while instruction counts are generally 
correlated to cycle counts, they can deviate and in this case they do so 
wildly.  Whatever fix we end up making we'll likely need to run it on a 
design to evaluate its actual performance impact.

jeff


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-03-25  3:05 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-16 17:35 [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak Vineet Gupta
2024-03-16 17:35 ` [gcc-15 1/3] RISC-V: avoid LUI based const materialization ... [part of PR/106265] Vineet Gupta
2024-03-16 20:28   ` Jeff Law
2024-03-19  0:07     ` Vineet Gupta
2024-03-23  5:59       ` Jeff Law
2024-03-16 17:35 ` [gcc-15 2/3] RISC-V: avoid LUI based const mat: keep stack offsets aligned Vineet Gupta
2024-03-16 20:21   ` Jeff Law
2024-03-19  0:27     ` Vineet Gupta
2024-03-19  6:48       ` Andrew Waterman
2024-03-19 13:10         ` Jeff Law
2024-03-19 20:05           ` Vineet Gupta
2024-03-19 20:58             ` Andrew Waterman
2024-03-19 21:17             ` Palmer Dabbelt
2024-03-20 18:57             ` Jeff Law
2024-03-23  6:05             ` Jeff Law
2024-03-16 17:35 ` [gcc-15 3/3] RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733] Vineet Gupta
2024-03-16 20:27   ` Jeff Law
2024-03-19  4:41 ` [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak Jeff Law
2024-03-21  0:45   ` Vineet Gupta
2024-03-21 14:36   ` scheduler queue flush (was Re: [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak) Vineet Gupta
2024-03-21 14:45     ` Jeff Law
2024-03-21 17:19       ` Vineet Gupta
2024-03-21 19:56         ` Jeff Law
2024-03-22  0:34           ` scheduler queue flush Vineet Gupta
2024-03-22  8:47           ` scheduler queue flush (was Re: [gcc-15 0/3] RISC-V improve stack/array access by constant mat tweak) Richard Biener
2024-03-22 12:29             ` Jeff Law
2024-03-22 16:56               ` Vineet Gupta
2024-03-25  3:05         ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).