From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl1-x62f.google.com (mail-pl1-x62f.google.com [IPv6:2607:f8b0:4864:20::62f]) by sourceware.org (Postfix) with ESMTPS id 39237385355F for ; Tue, 18 Oct 2022 15:19:21 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 39237385355F Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=vrull.eu Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=vrull.eu Received: by mail-pl1-x62f.google.com with SMTP id o21so11729685ple.5 for ; Tue, 18 Oct 2022 08:19:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vrull.eu; s=google; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=RRMwQQ3qLQJ49Vij7TKfqS9AQOwbnLIZyvZpTx4Z9+o=; b=sK/GuEcR3sZIi7U3b+YwpUBaXznVr/xs/MHtRyFts3/4psIWtOKW+AeuziFKc/94ol TyUSFjgLbHjyLlvuSAkK+cDuofkhIEOvOBKczoGeXi5YRXXvAz498wvtC6+ulEFJg1tH FBXrAzXDFGXcMwjzKopIttbApVEwfnURp7BBZtZCGELkWNlbcu4AnzeDNJmQx24vMrZo ktdubcoz/RfYqK3u3sJxaG5/0uBTnzTfidOp1E5ZHi2NWOUABTceZtJ3JxWQ9ndLLMCE RhSH2Q+eG7TKQ8e6iVa2niQ9pcXaA71k8wlflhOnEPNkIVLBmGbKjhHAidGJT2xa7Bkn zJaw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=RRMwQQ3qLQJ49Vij7TKfqS9AQOwbnLIZyvZpTx4Z9+o=; b=IE8L/VghMxS+dnkzZXMbdJF5YGkHoFKDKTNTqywjeM4ofb4An1QhRxpy719eL3tzyl ENfir8EE0WFCxhUlmxFbU9Qx5fx+9CE6vAT9IqJFkR9oOtPETlyEnbiGjbmv19dDFZgn AvTNo5Auvfp2p/bVzirhTVIw07izgCznBMXt2xQOXz7wkS0pMiE1eC+0x0rz27H9sKWu 4CAvObAWfx0gOz9dfigmJJxgHKwNMrjaUFrsFLLHUPC528Ne/gxZYp6NxQHweCRWQj6s TgI90Vi7AyNtVKdkRP6tURuxcJ445on/J18h+uVRnRJwKcQtqKDxZwrY8SzY+uPQHJK3 YCMg== X-Gm-Message-State: ACrzQf1WCdtkj6o5gLOWfNO4IkQsVTDTNo4EJgXBlENqHj1DnSFbxT5E qSXLSq/mVG0cfCSWVf9VQG3kGI+SXxsYy1/UrqxtYA== X-Google-Smtp-Source: AMsMyM7pQzJeTrZ1CqRjytCms/56ecL5WMp7bMWz18JuEsjDPhn9OKzMs8ZOxQKqmjcWvSngPJOkrPKJzW5rhn3DjL4= X-Received: by 2002:a17:902:ce84:b0:185:47af:a0f2 with SMTP id f4-20020a170902ce8400b0018547afa0f2mr3688006plg.123.1666106359881; Tue, 18 Oct 2022 08:19:19 -0700 (PDT) MIME-Version: 1.0 References: <20220906103902.2719585-1-manolis.tsamis@vrull.eu> In-Reply-To: From: Manolis Tsamis Date: Tue, 18 Oct 2022 18:18:43 +0300 Message-ID: Subject: Re: [PATCH] Enable shrink wrapping for the RISC-V target. To: Palmer Dabbelt Cc: gcc-patches@gcc.gnu.org, Vineet Gupta , Kito Cheng , philipp.tomsich@vrull.eu Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,JMQ_SPF_NEUTRAL,KAM_SHORT,LIKELY_SPAM_BODY,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Sun, Oct 2, 2022 at 11:32 PM Palmer Dabbelt wrote: > > On Tue, 06 Sep 2022 03:39:02 PDT (-0700), manolis.tsamis@vrull.eu wrote: > > This commit implements the target macros (TARGET_SHRINK_WRAP_*) that > > enable separate shrink wrapping for function prologues/epilogues in > > RISC-V. > > > > Tested against SPEC CPU 2017, this change always has a net-positive > > effect on the dynamic instruction count. See the following table for > > the breakdown on how this reduces the number of dynamic instructions > > per workload on a like-for-like (i.e., same config file; suppressing > > shrink-wrapping with -fno-shrink-wrap): > > Does this also pass the regression tests? > I've re-ran the testsuite on both rv32/rv64 and could not see any regressions. However I observed one test case failing on LTO builds because the LTO section that is generated during the compilation contains a "t2" string and the testcase tests for the absence of "t2" as a register (hence a false positive). > (there's also some comments on the code in-line) > > > > > # dynamic instructions > > w/o shrink-wrap w/ shrink-wrap reduction > > 500.perlbench_r 1265716786593 1262156218578 3560568015 0.28% > > 500.perlbench_r 779224795689 765337009025 13887786664 1.78% > > 500.perlbench_r 724087331471 711307152522 12780178949 1.77% > > 502.gcc_r 204259864844 194517006339 9742858505 4.77% > > 502.gcc_r 244047794302 231555834722 12491959580 5.12% > > 502.gcc_r 230896069400 221877703011 9018366389 3.91% > > 502.gcc_r 192130616624 183856450605 8274166019 4.31% > > 502.gcc_r 258875074079 247756203226 11118870853 4.30% > > 505.mcf_r 662653430325 660678680547 1974749778 0.30% > > 520.omnetpp_r 985114167068 934191310154 50922856914 5.17% > > 523.xalancbmk_r 927037633578 921688937650 5348695928 0.58% > > 525.x264_r 490953958454 490565583447 388375007 0.08% > > 525.x264_r 1994662294421 1993171932425 1490361996 0.07% > > 525.x264_r 1897617120450 1896062750609 1554369841 0.08% > > 531.deepsjeng_r 1695189878907 1669304130411 25885748496 1.53% > > 541.leela_r 1925941222222 1897900861198 28040361024 1.46% > > 548.exchange2_r 2073816227944 2073816226729 1215 0.00% > > 557.xz_r 379572090003 379057409041 514680962 0.14% > > 557.xz_r 953117469352 952680431430 437037922 0.05% > > 557.xz_r 536859579650 536456690164 402889486 0.08% > > 18421773405376 18223938521833 197834883543 1.07% totals > > > > Signed-off-by: Manolis Tsamis > > > > gcc/ChangeLog: > > > > * config/riscv/riscv.cc (struct machine_function): Add array to store > > register wrapping information. > > (riscv_for_each_saved_reg): Skip registers that are wrapped separetely. > > (riscv_get_separate_components): New function. > > (riscv_components_for_bb): Likewise. > > (riscv_disqualify_components): Likewise. > > (riscv_process_components): Likewise. > > (riscv_emit_prologue_components): Likewise. > > (riscv_emit_epilogue_components): Likewise. > > (riscv_set_handled_components): Likewise. > > (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS): Define. > > (TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB): Likewise. > > (TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS): Likewise. > > (TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS): Likewise. > > (TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS): Likewise. > > (TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Likewise. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/riscv/shrink-wrap-1.c: New test. > > > > --- > > > > gcc/config/riscv/riscv.cc | 187 +++++++++++++++++- > > .../gcc.target/riscv/shrink-wrap-1.c | 25 +++ > > 2 files changed, 210 insertions(+), 2 deletions(-) > > create mode 100644 gcc/testsuite/gcc.target/riscv/shrink-wrap-1.c > > > > diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc > > index 5a0adffb5ce..3b633149a9a 100644 > > --- a/gcc/config/riscv/riscv.cc > > +++ b/gcc/config/riscv/riscv.cc > > @@ -25,6 +25,7 @@ along with GCC; see the file COPYING3. If not see > > #include "config.h" > > #include "system.h" > > #include "coretypes.h" > > +#include "backend.h" > > #include "tm.h" > > #include "rtl.h" > > #include "regs.h" > > @@ -52,6 +53,7 @@ along with GCC; see the file COPYING3. If not see > > #include "optabs.h" > > #include "bitmap.h" > > #include "df.h" > > +#include "function-abi.h" > > #include "diagnostic.h" > > #include "builtins.h" > > #include "predict.h" > > @@ -147,6 +149,11 @@ struct GTY(()) machine_function { > > > > /* The current frame information, calculated by riscv_compute_frame_info. */ > > struct riscv_frame_info frame; > > + > > + /* The components already handled by separate shrink-wrapping, which should > > + not be considered by the prologue and epilogue. */ > > + bool reg_is_wrapped_separately[FIRST_PSEUDO_REGISTER]; > > + > > }; > > > > /* Information about a single argument. */ > > @@ -4209,7 +4216,7 @@ riscv_for_each_saved_reg (HOST_WIDE_INT sp_offset, riscv_save_restore_fn fn, > > for (unsigned int regno = GP_REG_FIRST; regno <= GP_REG_LAST; regno++) > > if (BITSET_P (cfun->machine->frame.mask, regno - GP_REG_FIRST)) > > { > > - bool handle_reg = TRUE; > > + bool handle_reg = !cfun->machine->reg_is_wrapped_separately[regno]; > > > > /* If this is a normal return in a function that calls the eh_return > > builtin, then do not restore the eh return data registers as that > > @@ -4240,9 +4247,11 @@ riscv_for_each_saved_reg (HOST_WIDE_INT sp_offset, riscv_save_restore_fn fn, > > for (unsigned int regno = FP_REG_FIRST; regno <= FP_REG_LAST; regno++) > > if (BITSET_P (cfun->machine->frame.fmask, regno - FP_REG_FIRST)) > > { > > + bool handle_reg = !cfun->machine->reg_is_wrapped_separately[regno]; > > machine_mode mode = TARGET_DOUBLE_FLOAT ? DFmode : SFmode; > > > > - riscv_save_restore_reg (mode, regno, offset, fn); > > + if (handle_reg) > > + riscv_save_restore_reg (mode, regno, offset, fn); > > offset -= GET_MODE_SIZE (mode); > > } > > } > > @@ -4667,6 +4676,156 @@ riscv_epilogue_uses (unsigned int regno) > > return false; > > } > > > > +/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS. */ > > + > > +static sbitmap > > +riscv_get_separate_components (void) > > +{ > > + HOST_WIDE_INT offset; > > + sbitmap components = sbitmap_alloc (FIRST_PSEUDO_REGISTER); > > + bitmap_clear (components); > > + > > + if (riscv_use_save_libcall (&cfun->machine->frame) > > + || cfun->machine->interrupt_handler_p) > > riscv_use_save_libcall() already checks interrupt_handler_p, so that's > redundant. That said, I'm not sure riscv_use_save_libcall() is the > right check here as unless I'm missing something we don't have all those > other constraints when shrink-wrapping. > riscv_use_save_libcall returns false when interrupt_handler_p is true, so the check for interrupt_handler_p in the branch is not redundant in this case. I encountered some issues when shrink wrapping and libcall was used in the same function. Thinking that libcall replaces the prologue/epilogue I didn't see a reason to have both at the same time and hence I opted to disable shrink wrapping in that case. From my understanding this should be harmless? > Either way, we'll also need to avoid saving for naked functions. > Do naked functions need any additional logic for that? The shrink-wrapping logic uses cfun->machine->frame.(f)mask which is set accordingly to naked_p. I believe naked functions are handled properly. > > + return components; > > + > > + offset = cfun->machine->frame.gp_sp_offset; > > + for (unsigned int regno = GP_REG_FIRST; regno <= GP_REG_LAST; regno++) > > + if (BITSET_P (cfun->machine->frame.mask, regno - GP_REG_FIRST)) > > + { > > + if (SMALL_OPERAND (offset)) > > + bitmap_set_bit (components, regno); > > + > > + offset -= UNITS_PER_WORD; > > + } > > + > > + offset = cfun->machine->frame.fp_sp_offset; > > + for (unsigned int regno = FP_REG_FIRST; regno <= FP_REG_LAST; regno++) > > + if (BITSET_P (cfun->machine->frame.fmask, regno - FP_REG_FIRST)) > > + { > > + machine_mode mode = TARGET_DOUBLE_FLOAT ? DFmode : SFmode; > > + > > + if (SMALL_OPERAND (offset)) > > + bitmap_set_bit (components, regno); > > + > > + offset -= GET_MODE_SIZE (mode); > > + } > > It seems kind of clunky to have two copies of all these loops (and we'll > need a third to make this work with the V stuff), but we've got that > issue elsewhere in the port so I don't think you need to fix it here > (though the V stuff will be there for the v2, so you'll need the third > copy of each loop). > Indeed, I was following the other ports here. Do you think it would be better to refactor this when the code for the V extension is added? By taking into account what code will be needed for V, a proper refactored function could be made to handle all cases. > > + > > + /* Don't mess with the hard frame pointer. */ > > + if (frame_pointer_needed) > > + bitmap_clear_bit (components, HARD_FRAME_POINTER_REGNUM); > > + > > + bitmap_clear_bit (components, RETURN_ADDR_REGNUM); > > + > > + return components; > > +} > > + > > +/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB. */ > > + > > +static sbitmap > > +riscv_components_for_bb (basic_block bb) > > +{ > > + bitmap in = DF_LIVE_IN (bb); > > + bitmap gen = &DF_LIVE_BB_INFO (bb)->gen; > > + bitmap kill = &DF_LIVE_BB_INFO (bb)->kill; > > + > > + sbitmap components = sbitmap_alloc (FIRST_PSEUDO_REGISTER); > > + bitmap_clear (components); > > + > > + function_abi_aggregator callee_abis; > > + rtx_insn *insn; > > + FOR_BB_INSNS (bb, insn) > > + if (CALL_P (insn)) > > + callee_abis.note_callee_abi (insn_callee_abi (insn)); > > + HARD_REG_SET extra_caller_saves = callee_abis.caller_save_regs (*crtl->abi); > > + > > + /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets. */ > > + for (unsigned int regno = GP_REG_FIRST; regno <= GP_REG_LAST; regno++) > > + if (!fixed_regs[regno] > > + && !crtl->abi->clobbers_full_reg_p (regno) > > + && (TEST_HARD_REG_BIT (extra_caller_saves, regno) > > + || bitmap_bit_p (in, regno) > > + || bitmap_bit_p (gen, regno) > > + || bitmap_bit_p (kill, regno))) > > + bitmap_set_bit (components, regno); > > + > > + for (unsigned int regno = FP_REG_FIRST; regno <= FP_REG_LAST; regno++) > > + if (!fixed_regs[regno] > > + && !crtl->abi->clobbers_full_reg_p (regno) > > + && (TEST_HARD_REG_BIT (extra_caller_saves, regno) > > + || bitmap_bit_p (in, regno) > > + || bitmap_bit_p (gen, regno) > > + || bitmap_bit_p (kill, regno))) > > + bitmap_set_bit (components, regno); > > + > > + return components; > > +} > > + > > +/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS. */ > > + > > +static void > > +riscv_disqualify_components (sbitmap, edge, sbitmap, bool) > > +{ > > + /* Nothing to do for riscv. */ > > I think that's true, but when poking around I couldn't figure out how > saving registers to long stack frames works: we need a temporary register > to generate the long stack offset, which is easy when we're in the > pro/epilogue but I haven't figured out if it's handled properly when > shrink wrapping. > > It looks like the aarch64 port side steps the issue by only allowing > registers that fall within a short offset to be shrink wrapped. That > seems like a pretty straight-forward way to handle the problem, but I'm > not sure how it'd impact performance -- we've already got a lot of > issues with large stack frames, though, so maybe it's OK to just punt on > them for now? > The same is done in this implementation by checking if SMALL_OPERAND (offset) is true to shrink wrap a component. I think this should be enough here. > Either way, this deserves a test case. I think it should be possible to > write one by introducing some register pressure around a > shrink-wrappable block that needs a long stack offset and making sure > in-flight registers don't get trashed. > I tried to think of some way to introduce a test like that but couldn't and I don't see how it would be done. Shrink wrapping only affects saved registers so there are always available temporaries that are not affected by shrink wrapping. (Register pressure should be irrelevant in this case if I understand correctly). Also the implementation checks for SMALL_OPERAND (offset) shrink wrapping should be unaffected from long stack offsets. If you see some way to write a test for that based on what I explained please explain how I could do that. A rebased v2 of this patch that resolves some conflicts with master and minor formatting issue in the testcase has been also submitted: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/603822.html Manolis > > +} > > + > > +static void > > +riscv_process_components (sbitmap components, bool prologue_p) > > +{ > > + HOST_WIDE_INT offset; > > + riscv_save_restore_fn fn = prologue_p? riscv_save_reg : riscv_restore_reg; > > + > > + offset = cfun->machine->frame.gp_sp_offset; > > + for (unsigned int regno = GP_REG_FIRST; regno <= GP_REG_LAST; regno++) > > + if (BITSET_P (cfun->machine->frame.mask, regno - GP_REG_FIRST)) > > + { > > + if (bitmap_bit_p (components, regno)) > > + riscv_save_restore_reg (word_mode, regno, offset, fn); > > + > > + offset -= UNITS_PER_WORD; > > + } > > + > > + offset = cfun->machine->frame.fp_sp_offset; > > + for (unsigned int regno = FP_REG_FIRST; regno <= FP_REG_LAST; regno++) > > + if (BITSET_P (cfun->machine->frame.fmask, regno - FP_REG_FIRST)) > > + { > > + machine_mode mode = TARGET_DOUBLE_FLOAT ? DFmode : SFmode; > > + > > + if (bitmap_bit_p (components, regno)) > > + riscv_save_restore_reg (mode, regno, offset, fn); > > + > > + offset -= GET_MODE_SIZE (mode); > > + } > > +} > > + > > +/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS. */ > > + > > +static void > > +riscv_emit_prologue_components (sbitmap components) > > +{ > > + riscv_process_components (components, true); > > +} > > + > > +/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS. */ > > + > > +static void > > +riscv_emit_epilogue_components (sbitmap components) > > +{ > > + riscv_process_components (components, false); > > +} > > + > > +static void > > +riscv_set_handled_components (sbitmap components) > > +{ > > + for (unsigned int regno = GP_REG_FIRST; regno <= GP_REG_LAST; regno++) > > + if (bitmap_bit_p (components, regno)) > > + cfun->machine->reg_is_wrapped_separately[regno] = true; > > + > > + for (unsigned int regno = FP_REG_FIRST; regno <= FP_REG_LAST; regno++) > > + if (bitmap_bit_p (components, regno)) > > + cfun->machine->reg_is_wrapped_separately[regno] = true; > > +} > > + > > /* Return nonzero if this function is known to have a null epilogue. > > This allows the optimizer to omit jumps to jumps if no stack > > was created. */ > > @@ -5713,6 +5872,30 @@ riscv_asan_shadow_offset (void) > > #undef TARGET_FUNCTION_ARG_BOUNDARY > > #define TARGET_FUNCTION_ARG_BOUNDARY riscv_function_arg_boundary > > > > +#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS > > +#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS \ > > + riscv_get_separate_components > > + > > +#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB > > +#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB \ > > + riscv_components_for_bb > > + > > +#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS > > +#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS \ > > + riscv_disqualify_components > > + > > +#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS > > +#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \ > > + riscv_emit_prologue_components > > + > > +#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS > > +#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \ > > + riscv_emit_epilogue_components > > + > > +#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS > > +#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS \ > > + riscv_set_handled_components > > + > > /* The generic ELF target does not always have TLS support. */ > > #ifdef HAVE_AS_TLS > > #undef TARGET_HAVE_TLS > > diff --git a/gcc/testsuite/gcc.target/riscv/shrink-wrap-1.c b/gcc/testsuite/gcc.target/riscv/shrink-wrap-1.c > > new file mode 100644 > > index 00000000000..b05745968e6 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/riscv/shrink-wrap-1.c > > @@ -0,0 +1,25 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-fshrink-wrap" } */ > > +/* { dg-skip-if "" { *-*-* } { "-O0" "-Os" "-Oz" } } */ > > + > > +void g(void); > > + > > +void f(int x) > > +{ > > + if (x) > > + { > > + // Those only need to be saved if x is non-zero; without > > + // separate shrink-wrapping it would however be saved in all cases. > > + // Force saving of callee-saved registers > > + register int s2 asm("18") = x; > > + register int s3 asm("19") = x; > > + register int s4 asm("20") = x; > > + asm("" : : "r"(s2)); > > + asm("" : : "r"(s3)); > > + asm("" : : "r"(s4)); > > + g(); > > + } > > +} > > + > > +// The resulting code should do nothing if x == 0 > > +/* { dg-final { scan-assembler "bne\ta0,zero,.*\n.*ret" } } */