public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Manolis Tsamis <manolis.tsamis@vrull.eu>
To: Palmer Dabbelt <palmer@dabbelt.com>
Cc: gcc-patches@gcc.gnu.org, Vineet Gupta <vineetg@rivosinc.com>,
	 Kito Cheng <kito.cheng@gmail.com>,
	philipp.tomsich@vrull.eu
Subject: Re: [PATCH] Enable shrink wrapping for the RISC-V target.
Date: Tue, 18 Oct 2022 18:18:43 +0300	[thread overview]
Message-ID: <CAM3yNXqiM=t5w51ffQ43sBF98oNOM3m0M5fmbTb07z_B-=wi9Q@mail.gmail.com> (raw)
In-Reply-To: <mhng-a82e0def-6d5a-4983-98ba-b527e20a9781@palmer-ri-x1c9>

On Sun, Oct 2, 2022 at 11:32 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>
> On Tue, 06 Sep 2022 03:39:02 PDT (-0700), manolis.tsamis@vrull.eu wrote:
> > This commit implements the target macros (TARGET_SHRINK_WRAP_*) that
> > enable separate shrink wrapping for function prologues/epilogues in
> > RISC-V.
> >
> > Tested against SPEC CPU 2017, this change always has a net-positive
> > effect on the dynamic instruction count.  See the following table for
> > the breakdown on how this reduces the number of dynamic instructions
> > per workload on a like-for-like (i.e., same config file; suppressing
> > shrink-wrapping with -fno-shrink-wrap):
>
> Does this also pass the regression tests?
>

I've re-ran the testsuite on both rv32/rv64 and could not see any regressions.
However I observed one test case failing on LTO builds because the LTO
section that is generated
during the compilation contains a "t2" string and the testcase tests
for the absence of "t2" as a register (hence a false positive).

> (there's also some comments on the code in-line)
>
> >
> >                              # dynamic instructions
> >                 w/o shrink-wrap   w/ shrink-wrap      reduction
> > 500.perlbench_r   1265716786593    1262156218578     3560568015   0.28%
> > 500.perlbench_r    779224795689     765337009025    13887786664   1.78%
> > 500.perlbench_r    724087331471     711307152522    12780178949   1.77%
> > 502.gcc_r          204259864844     194517006339     9742858505   4.77%
> > 502.gcc_r          244047794302     231555834722    12491959580   5.12%
> > 502.gcc_r          230896069400     221877703011     9018366389   3.91%
> > 502.gcc_r          192130616624     183856450605     8274166019   4.31%
> > 502.gcc_r          258875074079     247756203226    11118870853   4.30%
> > 505.mcf_r          662653430325     660678680547     1974749778   0.30%
> > 520.omnetpp_r      985114167068     934191310154    50922856914   5.17%
> > 523.xalancbmk_r    927037633578     921688937650     5348695928   0.58%
> > 525.x264_r         490953958454     490565583447      388375007   0.08%
> > 525.x264_r        1994662294421    1993171932425     1490361996   0.07%
> > 525.x264_r        1897617120450    1896062750609     1554369841   0.08%
> > 531.deepsjeng_r   1695189878907    1669304130411    25885748496   1.53%
> > 541.leela_r       1925941222222    1897900861198    28040361024   1.46%
> > 548.exchange2_r   2073816227944    2073816226729           1215   0.00%
> > 557.xz_r           379572090003     379057409041      514680962   0.14%
> > 557.xz_r           953117469352     952680431430      437037922   0.05%
> > 557.xz_r           536859579650     536456690164      402889486   0.08%
> >                  18421773405376   18223938521833   197834883543   1.07%  totals
> >
> > Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu>
> >
> > gcc/ChangeLog:
> >
> >       * config/riscv/riscv.cc (struct machine_function): Add array to store
> >       register wrapping information.
> >       (riscv_for_each_saved_reg): Skip registers that are wrapped separetely.
> >       (riscv_get_separate_components): New function.
> >       (riscv_components_for_bb): Likewise.
> >       (riscv_disqualify_components): Likewise.
> >       (riscv_process_components): Likewise.
> >       (riscv_emit_prologue_components): Likewise.
> >       (riscv_emit_epilogue_components): Likewise.
> >       (riscv_set_handled_components): Likewise.
> >       (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS): Define.
> >       (TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB): Likewise.
> >       (TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS): Likewise.
> >       (TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS): Likewise.
> >       (TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS): Likewise.
> >       (TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Likewise.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       * gcc.target/riscv/shrink-wrap-1.c: New test.
> >
> > ---
> >
> >  gcc/config/riscv/riscv.cc                     | 187 +++++++++++++++++-
> >  .../gcc.target/riscv/shrink-wrap-1.c          |  25 +++
> >  2 files changed, 210 insertions(+), 2 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/riscv/shrink-wrap-1.c
> >
> > diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
> > index 5a0adffb5ce..3b633149a9a 100644
> > --- a/gcc/config/riscv/riscv.cc
> > +++ b/gcc/config/riscv/riscv.cc
> > @@ -25,6 +25,7 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "config.h"
> >  #include "system.h"
> >  #include "coretypes.h"
> > +#include "backend.h"
> >  #include "tm.h"
> >  #include "rtl.h"
> >  #include "regs.h"
> > @@ -52,6 +53,7 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "optabs.h"
> >  #include "bitmap.h"
> >  #include "df.h"
> > +#include "function-abi.h"
> >  #include "diagnostic.h"
> >  #include "builtins.h"
> >  #include "predict.h"
> > @@ -147,6 +149,11 @@ struct GTY(())  machine_function {
> >
> >    /* The current frame information, calculated by riscv_compute_frame_info.  */
> >    struct riscv_frame_info frame;
> > +
> > +  /* The components already handled by separate shrink-wrapping, which should
> > +     not be considered by the prologue and epilogue.  */
> > +  bool reg_is_wrapped_separately[FIRST_PSEUDO_REGISTER];
> > +
> >  };
> >
> >  /* Information about a single argument.  */
> > @@ -4209,7 +4216,7 @@ riscv_for_each_saved_reg (HOST_WIDE_INT sp_offset, riscv_save_restore_fn fn,
> >    for (unsigned int regno = GP_REG_FIRST; regno <= GP_REG_LAST; regno++)
> >      if (BITSET_P (cfun->machine->frame.mask, regno - GP_REG_FIRST))
> >        {
> > -     bool handle_reg = TRUE;
> > +     bool handle_reg = !cfun->machine->reg_is_wrapped_separately[regno];
> >
> >       /* If this is a normal return in a function that calls the eh_return
> >          builtin, then do not restore the eh return data registers as that
> > @@ -4240,9 +4247,11 @@ riscv_for_each_saved_reg (HOST_WIDE_INT sp_offset, riscv_save_restore_fn fn,
> >    for (unsigned int regno = FP_REG_FIRST; regno <= FP_REG_LAST; regno++)
> >      if (BITSET_P (cfun->machine->frame.fmask, regno - FP_REG_FIRST))
> >        {
> > +     bool handle_reg = !cfun->machine->reg_is_wrapped_separately[regno];
> >       machine_mode mode = TARGET_DOUBLE_FLOAT ? DFmode : SFmode;
> >
> > -     riscv_save_restore_reg (mode, regno, offset, fn);
> > +     if (handle_reg)
> > +       riscv_save_restore_reg (mode, regno, offset, fn);
> >       offset -= GET_MODE_SIZE (mode);
> >        }
> >  }
> > @@ -4667,6 +4676,156 @@ riscv_epilogue_uses (unsigned int regno)
> >    return false;
> >  }
> >
> > +/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS.  */
> > +
> > +static sbitmap
> > +riscv_get_separate_components (void)
> > +{
> > +  HOST_WIDE_INT offset;
> > +  sbitmap components = sbitmap_alloc (FIRST_PSEUDO_REGISTER);
> > +  bitmap_clear (components);
> > +
> > +  if (riscv_use_save_libcall (&cfun->machine->frame)
> > +      || cfun->machine->interrupt_handler_p)
>
> riscv_use_save_libcall() already checks interrupt_handler_p, so that's
> redundant.  That said, I'm not sure riscv_use_save_libcall() is the
> right check here as unless I'm missing something we don't have all those
> other constraints when shrink-wrapping.
>

riscv_use_save_libcall returns false when interrupt_handler_p is true, so the
check for interrupt_handler_p in the branch is not redundant in this case.

I encountered some issues when shrink wrapping and libcall was used in the same
function. Thinking that libcall replaces the prologue/epilogue I didn't see a
reason to have both at the same time and hence I opted to disable
shrink wrapping
in that case. From my understanding this should be harmless?

> Either way, we'll also need to avoid saving for naked functions.
>

Do naked functions need any additional logic for that?
The shrink-wrapping logic uses cfun->machine->frame.(f)mask
which is set accordingly to naked_p. I believe naked functions
are handled properly.

> > +    return components;
> > +
> > +  offset = cfun->machine->frame.gp_sp_offset;
> > +  for (unsigned int regno = GP_REG_FIRST; regno <= GP_REG_LAST; regno++)
> > +    if (BITSET_P (cfun->machine->frame.mask, regno - GP_REG_FIRST))
> > +      {
> > +     if (SMALL_OPERAND (offset))
> > +       bitmap_set_bit (components, regno);
> > +
> > +     offset -= UNITS_PER_WORD;
> > +      }
> > +
> > +  offset = cfun->machine->frame.fp_sp_offset;
> > +  for (unsigned int regno = FP_REG_FIRST; regno <= FP_REG_LAST; regno++)
> > +    if (BITSET_P (cfun->machine->frame.fmask, regno - FP_REG_FIRST))
> > +      {
> > +     machine_mode mode = TARGET_DOUBLE_FLOAT ? DFmode : SFmode;
> > +
> > +     if (SMALL_OPERAND (offset))
> > +       bitmap_set_bit (components, regno);
> > +
> > +     offset -= GET_MODE_SIZE (mode);
> > +      }
>
> It seems kind of clunky to have two copies of all these loops (and we'll
> need a third to make this work with the V stuff), but we've got that
> issue elsewhere in the port so I don't think you need to fix it here
> (though the V stuff will be there for the v2, so you'll need the third
> copy of each loop).
>

Indeed, I was following the other ports here. Do you think it would be
better to refactor this when the code for the V extension is added?
By taking into account what code will be needed for V, a proper refactored
function could be made to handle all cases.

> > +
> > +  /* Don't mess with the hard frame pointer.  */
> > +  if (frame_pointer_needed)
> > +    bitmap_clear_bit (components, HARD_FRAME_POINTER_REGNUM);
> > +
> > +  bitmap_clear_bit (components, RETURN_ADDR_REGNUM);
> > +
> > +  return components;
> > +}
> > +
> > +/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB.  */
> > +
> > +static sbitmap
> > +riscv_components_for_bb (basic_block bb)
> > +{
> > +  bitmap in = DF_LIVE_IN (bb);
> > +  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
> > +  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
> > +
> > +  sbitmap components = sbitmap_alloc (FIRST_PSEUDO_REGISTER);
> > +  bitmap_clear (components);
> > +
> > +  function_abi_aggregator callee_abis;
> > +  rtx_insn *insn;
> > +  FOR_BB_INSNS (bb, insn)
> > +    if (CALL_P (insn))
> > +      callee_abis.note_callee_abi (insn_callee_abi (insn));
> > +  HARD_REG_SET extra_caller_saves = callee_abis.caller_save_regs (*crtl->abi);
> > +
> > +  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
> > +  for (unsigned int regno = GP_REG_FIRST; regno <= GP_REG_LAST; regno++)
> > +    if (!fixed_regs[regno]
> > +       && !crtl->abi->clobbers_full_reg_p (regno)
> > +       && (TEST_HARD_REG_BIT (extra_caller_saves, regno)
> > +        || bitmap_bit_p (in, regno)
> > +        || bitmap_bit_p (gen, regno)
> > +        || bitmap_bit_p (kill, regno)))
> > +      bitmap_set_bit (components, regno);
> > +
> > +  for (unsigned int regno = FP_REG_FIRST; regno <= FP_REG_LAST; regno++)
> > +    if (!fixed_regs[regno]
> > +       && !crtl->abi->clobbers_full_reg_p (regno)
> > +       && (TEST_HARD_REG_BIT (extra_caller_saves, regno)
> > +        || bitmap_bit_p (in, regno)
> > +        || bitmap_bit_p (gen, regno)
> > +        || bitmap_bit_p (kill, regno)))
> > +      bitmap_set_bit (components, regno);
> > +
> > +  return components;
> > +}
> > +
> > +/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS.  */
> > +
> > +static void
> > +riscv_disqualify_components (sbitmap, edge, sbitmap, bool)
> > +{
> > +  /* Nothing to do for riscv.  */
>
> I think that's true, but when poking around I couldn't figure out how
> saving registers to long stack frames works: we need a temporary register
> to generate the long stack offset, which is easy when we're in the
> pro/epilogue but I haven't figured out if it's handled properly when
> shrink wrapping.
>
> It looks like the aarch64 port side steps the issue by only allowing
> registers that fall within a short offset to be shrink wrapped.  That
> seems like a pretty straight-forward way to handle the problem, but I'm
> not sure how it'd impact performance -- we've already got a lot of
> issues with large stack frames, though, so maybe it's OK to just punt on
> them for now?
>

The same is done in this implementation by checking if SMALL_OPERAND (offset)
is true to shrink wrap a component. I think this should be enough here.

> Either way, this deserves a test case.  I think it should be possible to
> write one by introducing some register pressure around a
> shrink-wrappable block that needs a long stack offset and making sure
> in-flight registers don't get trashed.
>

I tried to think of some way to introduce a test like that but couldn't and
I don't see how it would be done. Shrink wrapping only affects saved registers
so there are always available temporaries that are not affected by
shrink wrapping.
(Register pressure should be irrelevant in this case if I understand correctly).
Also the implementation checks for SMALL_OPERAND (offset) shrink wrapping
should be unaffected from long stack offsets. If you see some way to write
a test for that based on what I explained please explain how I could do that.

A rebased v2 of this patch that resolves some conflicts with master
and minor formatting
issue in the testcase has been also submitted:
https://gcc.gnu.org/pipermail/gcc-patches/2022-October/603822.html

Manolis

> > +}
> > +
> > +static void
> > +riscv_process_components (sbitmap components, bool prologue_p)
> > +{
> > +  HOST_WIDE_INT offset;
> > +  riscv_save_restore_fn fn = prologue_p? riscv_save_reg : riscv_restore_reg;
> > +
> > +  offset = cfun->machine->frame.gp_sp_offset;
> > +  for (unsigned int regno = GP_REG_FIRST; regno <= GP_REG_LAST; regno++)
> > +    if (BITSET_P (cfun->machine->frame.mask, regno - GP_REG_FIRST))
> > +      {
> > +     if (bitmap_bit_p (components, regno))
> > +       riscv_save_restore_reg (word_mode, regno, offset, fn);
> > +
> > +     offset -= UNITS_PER_WORD;
> > +      }
> > +
> > +  offset = cfun->machine->frame.fp_sp_offset;
> > +  for (unsigned int regno = FP_REG_FIRST; regno <= FP_REG_LAST; regno++)
> > +    if (BITSET_P (cfun->machine->frame.fmask, regno - FP_REG_FIRST))
> > +      {
> > +     machine_mode mode = TARGET_DOUBLE_FLOAT ? DFmode : SFmode;
> > +
> > +     if (bitmap_bit_p (components, regno))
> > +       riscv_save_restore_reg (mode, regno, offset, fn);
> > +
> > +     offset -= GET_MODE_SIZE (mode);
> > +      }
> > +}
> > +
> > +/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS.  */
> > +
> > +static void
> > +riscv_emit_prologue_components (sbitmap components)
> > +{
> > +  riscv_process_components (components, true);
> > +}
> > +
> > +/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS.  */
> > +
> > +static void
> > +riscv_emit_epilogue_components (sbitmap components)
> > +{
> > +  riscv_process_components (components, false);
> > +}
> > +
> > +static void
> > +riscv_set_handled_components (sbitmap components)
> > +{
> > +  for (unsigned int regno = GP_REG_FIRST; regno <= GP_REG_LAST; regno++)
> > +    if (bitmap_bit_p (components, regno))
> > +      cfun->machine->reg_is_wrapped_separately[regno] = true;
> > +
> > +  for (unsigned int regno = FP_REG_FIRST; regno <= FP_REG_LAST; regno++)
> > +    if (bitmap_bit_p (components, regno))
> > +      cfun->machine->reg_is_wrapped_separately[regno] = true;
> > +}
> > +
> >  /* Return nonzero if this function is known to have a null epilogue.
> >     This allows the optimizer to omit jumps to jumps if no stack
> >     was created.  */
> > @@ -5713,6 +5872,30 @@ riscv_asan_shadow_offset (void)
> >  #undef TARGET_FUNCTION_ARG_BOUNDARY
> >  #define TARGET_FUNCTION_ARG_BOUNDARY riscv_function_arg_boundary
> >
> > +#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
> > +#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS \
> > +  riscv_get_separate_components
> > +
> > +#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
> > +#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB \
> > +  riscv_components_for_bb
> > +
> > +#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
> > +#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS \
> > +  riscv_disqualify_components
> > +
> > +#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
> > +#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \
> > +  riscv_emit_prologue_components
> > +
> > +#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
> > +#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \
> > +  riscv_emit_epilogue_components
> > +
> > +#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
> > +#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS \
> > +  riscv_set_handled_components
> > +
> >  /* The generic ELF target does not always have TLS support.  */
> >  #ifdef HAVE_AS_TLS
> >  #undef TARGET_HAVE_TLS
> > diff --git a/gcc/testsuite/gcc.target/riscv/shrink-wrap-1.c b/gcc/testsuite/gcc.target/riscv/shrink-wrap-1.c
> > new file mode 100644
> > index 00000000000..b05745968e6
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/riscv/shrink-wrap-1.c
> > @@ -0,0 +1,25 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-fshrink-wrap" } */
> > +/* { dg-skip-if "" { *-*-* } { "-O0" "-Os" "-Oz" } } */
> > +
> > +void g(void);
> > +
> > +void f(int x)
> > +{
> > +     if (x)
> > +    {
> > +      // Those only need to be saved if x is non-zero; without
> > +      // separate shrink-wrapping it would however be saved in all cases.
> > +      // Force saving of callee-saved registers
> > +      register int s2 asm("18") = x;
> > +      register int s3 asm("19") = x;
> > +      register int s4 asm("20") = x;
> > +      asm("" : : "r"(s2));
> > +      asm("" : : "r"(s3));
> > +      asm("" : : "r"(s4));
> > +      g();
> > +    }
> > +}
> > +
> > +// The resulting code should do nothing if x == 0
> > +/* { dg-final { scan-assembler "bne\ta0,zero,.*\n.*ret" } } */

  reply	other threads:[~2022-10-18 15:19 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-06 10:39 mtsamis
2022-10-02 20:32 ` Palmer Dabbelt
2022-10-18 15:18   ` Manolis Tsamis [this message]
2022-10-18 15:57     ` Jeff Law
2022-10-18 17:35       ` Palmer Dabbelt
2022-10-19 17:15         ` Jeff Law
2022-10-20  7:42           ` Manolis Tsamis
2022-11-02 14:12           ` Manolis Tsamis
2022-11-02 15:02             ` Jeff Law
2022-10-20  7:35         ` Manolis Tsamis
2022-11-02 13:54         ` Manolis Tsamis
2022-11-02 15:06           ` Jeff Law
2022-11-03  0:26             ` Palmer Dabbelt
2022-11-03 22:23               ` Jeff Law
2022-11-07 22:07                 ` Palmer Dabbelt
2022-11-13  1:32                   ` Jeff Law
2022-11-16 10:26                     ` Manolis Tsamis
2022-11-17  2:09                       ` Jeff Law
2022-11-17 10:54                         ` Manolis Tsamis
2022-11-17 11:59                           ` Philipp Tomsich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAM3yNXqiM=t5w51ffQ43sBF98oNOM3m0M5fmbTb07z_B-=wi9Q@mail.gmail.com' \
    --to=manolis.tsamis@vrull.eu \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=kito.cheng@gmail.com \
    --cc=palmer@dabbelt.com \
    --cc=philipp.tomsich@vrull.eu \
    --cc=vineetg@rivosinc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).