RFD: doloop needs better support for nested loops

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* RFD: doloop needs better support for nested loops
@ 2023-10-10 11:31 Joern Rennecke
  2023-10-10 13:09 ` Richard Biener
  0 siblings, 1 reply; 2+ messages in thread
From: Joern Rennecke @ 2023-10-10 11:31 UTC (permalink / raw)
  To: GCC

[-- Attachment #1: Type: text/plain, Size: 2401 bytes --]

I'm working on implementing hardware loops for the CORE-V CV32E40P
https://docs.openhwgroup.org/projects/cv32e40p-user-manual/en/latest/corev_hw_loop.html

This core supports nested hardware lops, but does not allow any other flow
control inside hardware loops.  I found that our existing interfaces do not
allow sufficient control over when to emit doloop patterns, i.e. allowing
nested doloops while rejecting other flow control inside the loop.

TARGET_CAN_USE_DOLOOP_P does not get passed anything to look at the
individual loop.  Most convenient would be the loop structure, although
that would cause tight coupling of the target port with the internal data
structures of the loop optimizers.
OTOH we already have a precedent with TARGET_PREDICT_DOLOOP_P .

TARGET_INVALID_WITHIN_DOLOOP is missing context.  We neither know the loop
nesting depth, nor if any jump instruction under consideration is the final
branch to jump back to the loop latch.  Actually, the seccond part is the
main problem for the CV32E40P: inner doloops that have been transformed
can be recognized as such, but un-transformed condjumps could either be
spaghetti code inside the loop or the final jump instruction of the loop.

The doloop_end pattern is also missing context to make meaningful decisions.
Although we know the label where the pattern is supposed to jump to,
we don't know where the original branch is.  Even if we scan the insn
stream, this is ambigous, since there can be two (or more) nested doloop
candidates.
What we could do here is add optional arguments; there is precedence, e.g.
for the call pattern.  The advantage of this approach is that ports that
are fine with the current interface need not be patched.
To make it possible to scritinze the control flow of the loop, the branch
at the end of the loop makes a good optional argument.

There is also the issue that loop setup is a bit more costly for large loops,
and it would be nice to weigh that against the iteration count.  We had
information about the iteration count at TARGET_CAN_USE_DOLOOP_P, but
nothing to allow us to analyze the loop body.  Although the port could
stash avay the iteration count into a globalvariable or machine_function
member, it would be more straightforward and robust to pass the information
together so that it can be considered in context.

Attached is an patch for an optional 3rd parameter to doloop_end .

[-- Attachment #2: tmp.txt --]
[-- Type: text/plain, Size: 2595 bytes --]

2023-10-05  Joern Rennecke  <joern.rennecke@embecosm.com>

gcc/
            * doc/md.texi (doloop_end): Document oprional operand 2.
            * loop-doloop.cc (doloop_optimize): Provide 3rd operand to
            gen_doloop_end.
            * target-insns.def (doloop_end): Add optional 3rd operand.

diff --git a/gcc/config.gcc b/gcc/config.gcc
index ee46d96bf62..ba42ac3d425 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -544,7 +544,7 @@ riscv*)
 	extra_objs="riscv-builtins.o riscv-c.o riscv-sr.o riscv-shorten-memrefs.o riscv-selftests.o riscv-string.o"
 	extra_objs="${extra_objs} riscv-v.o riscv-vsetvl.o riscv-vector-costs.o"
 	extra_objs="${extra_objs} riscv-vector-builtins.o riscv-vector-builtins-shapes.o riscv-vector-builtins-bases.o"
-	extra_objs="${extra_objs} thead.o"
+	extra_objs="${extra_objs} thead.o corev.o"
 	d_target_objs="riscv-d.o"
 	extra_headers="riscv_vector.h"
 	target_gtfiles="$target_gtfiles \$(srcdir)/config/riscv/riscv-vector-builtins.cc"
diff --git a/gcc/loop-doloop.cc b/gcc/loop-doloop.cc
index 4feb0a25ab9..d703cb5f2af 100644
--- a/gcc/loop-doloop.cc
+++ b/gcc/loop-doloop.cc
@@ -720,7 +720,8 @@ doloop_optimize (class loop *loop)
   count = copy_rtx (desc->niter_expr);
   start_label = block_label (desc->in_edge->dest);
   doloop_reg = gen_reg_rtx (mode);
-  rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
+  rtx_insn *doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label,
+						 BB_END (desc->in_edge->src));

   word_mode_size = GET_MODE_PRECISION (word_mode);
   word_mode_max = (HOST_WIDE_INT_1U << (word_mode_size - 1) << 1) - 1;
@@ -737,7 +738,8 @@ doloop_optimize (class loop *loop)
       else
 	count = lowpart_subreg (word_mode, count, mode);
       PUT_MODE (doloop_reg, word_mode);
-      doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label);
+      doloop_seq = targetm.gen_doloop_end (doloop_reg, start_label,
+					   BB_END (desc->in_edge->src));
     }
   if (! doloop_seq)
     {
diff --git a/gcc/target-insns.def b/gcc/target-insns.def
index c4415d00735..962c5cc51d1 100644
--- a/gcc/target-insns.def
+++ b/gcc/target-insns.def
@@ -48,7 +48,7 @@ DEF_TARGET_INSN (casesi, (rtx x0, rtx x1, rtx x2, rtx x3, rtx x4))
 DEF_TARGET_INSN (check_stack, (rtx x0))
 DEF_TARGET_INSN (clear_cache, (rtx x0, rtx x1))
 DEF_TARGET_INSN (doloop_begin, (rtx x0, rtx x1))
-DEF_TARGET_INSN (doloop_end, (rtx x0, rtx x1))
+DEF_TARGET_INSN (doloop_end, (rtx x0, rtx x1, rtx opt2))
 DEF_TARGET_INSN (eh_return, (rtx x0))
 DEF_TARGET_INSN (epilogue, (void))
 DEF_TARGET_INSN (exception_receiver, (void))

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: RFD: doloop needs better support for nested loops
  2023-10-10 11:31 RFD: doloop needs better support for nested loops Joern Rennecke
@ 2023-10-10 13:09 ` Richard Biener
  0 siblings, 0 replies; 2+ messages in thread
From: Richard Biener @ 2023-10-10 13:09 UTC (permalink / raw)
  To: Joern Rennecke; +Cc: GCC

On Tue, Oct 10, 2023 at 2:43 PM Joern Rennecke
<joern.rennecke@embecosm.com> wrote:
>
> I'm working on implementing hardware loops for the CORE-V CV32E40P
> https://docs.openhwgroup.org/projects/cv32e40p-user-manual/en/latest/corev_hw_loop.html
>
> This core supports nested hardware lops, but does not allow any other flow
> control inside hardware loops.  I found that our existing interfaces do not
> allow sufficient control over when to emit doloop patterns, i.e. allowing
> nested doloops while rejecting other flow control inside the loop.
>
> TARGET_CAN_USE_DOLOOP_P does not get passed anything to look at the
> individual loop.  Most convenient would be the loop structure, although
> that would cause tight coupling of the target port with the internal data
> structures of the loop optimizers.

I don't think this would really be an issue, the loop structure is really
part of the CFG structure nowadays.

> OTOH we already have a precedent with TARGET_PREDICT_DOLOOP_P .
>
> TARGET_INVALID_WITHIN_DOLOOP is missing context.  We neither know the loop
> nesting depth, nor if any jump instruction under consideration is the final
> branch to jump back to the loop latch.  Actually, the seccond part is the
> main problem for the CV32E40P: inner doloops that have been transformed
> can be recognized as such, but un-transformed condjumps could either be
> spaghetti code inside the loop or the final jump instruction of the loop.
>
> The doloop_end pattern is also missing context to make meaningful decisions.
> Although we know the label where the pattern is supposed to jump to,
> we don't know where the original branch is.  Even if we scan the insn
> stream, this is ambigous, since there can be two (or more) nested doloop
> candidates.
> What we could do here is add optional arguments; there is precedence, e.g.
> for the call pattern.  The advantage of this approach is that ports that
> are fine with the current interface need not be patched.
> To make it possible to scritinze the control flow of the loop, the branch
> at the end of the loop makes a good optional argument.
>
> There is also the issue that loop setup is a bit more costly for large loops,
> and it would be nice to weigh that against the iteration count.  We had
> information about the iteration count at TARGET_CAN_USE_DOLOOP_P, but
> nothing to allow us to analyze the loop body.  Although the port could
> stash avay the iteration count into a globalvariable or machine_function
> member, it would be more straightforward and robust to pass the information
> together so that it can be considered in context.
>
> Attached is an patch for an optional 3rd parameter to doloop_end .

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-10-10 13:12 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-10 11:31 RFD: doloop needs better support for nested loops Joern Rennecke
2023-10-10 13:09 ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).