Hi Kewen (and Segher), Many thanks for stress testing my patch to improve multiplication by integer constants on rs6000 by using the rldmi instruction. Although I've not been able to reproduce your ICE (using gcc135 on the compile farm), I completely agree with Segher's analysis that the Achilles heel with my approach/patch is that there's currently no way for the backend/recog to know that we're in a pass after combine. Rather than give up on this optimization (and a similar one for I386.md where test;sete can be replaced by xor $1 when combine knows that nonzero_bits is 1, but loses that information afterwards), I thought I'd post this "strawman" proposal to add a combine_completed global variable, matching the reload_completed and regstack_completed global variables already used (to track progress) by the middle-end. I was wondering if I could ask you could test the attached patch in combination with my previous rs6000.md patch (with the obvious change of reload_completed to combine_completed) to confirm that it fixes the problems you were seeing. Segher/Richard, would this sort of patch be considered acceptable? Or is there a better approach/solution? 2022-07-07 Roger Sayle gcc/ChangeLog * combine.cc (combine_completed): New global variable. (rest_of_handle_combine): Set combine_completed after pass. * final.cc (rest_of_clean_state): Reset combine_completed. * rtl.h (combine_completed): Prototype here. Many thanks in advance, Roger -- > -----Original Message----- > From: Kewen.Lin > Sent: 27 June 2022 10:04 > To: Roger Sayle > Cc: gcc-patches@gcc.gnu.org; Segher Boessenkool > ; David Edelsohn > Subject: Re: [rs6000 PATCH] Improve constant integer multiply using rldimi. > > Hi Roger, > > on 2022/6/27 04:56, Roger Sayle wrote: > > > > > > This patch tweaks the code generated on POWER for integer > > multiplications > > > > by a constant, by making use of rldimi instructions. Much like x86's > > > > lea instruction, rldimi can be used to implement a shift and add pair > > > > in some circumstances. For rldimi this is when the shifted operand > > > > is known to have no bits in common with the added operand. > > > > > > > > Hence for the new testcase below: > > > > > > > > int foo(int x) > > > > { > > > > int t = x & 42; > > > > return t * 0x2001; > > > > } > > > > > > > > when compiled with -O2, GCC currently generates: > > > > > > > > andi. 3,3,0x2a > > > > slwi 9,3,13 > > > > add 3,9,3 > > > > extsw 3,3 > > > > blr > > > > > > > > with this patch, we now generate: > > > > > > > > andi. 3,3,0x2a > > > > rlwimi 3,3,13,0,31-13 > > > > extsw 3,3 > > > > blr > > > > > > > > It turns out this optimization already exists in the form of a combine > > > > splitter in rs6000.md, but the constraints on combine splitters, > > > > requiring three of four input instructions (and generating one or two > > > > output instructions) mean it doesn't get applied as often as it could. > > > > This patch converts the define_split into a define_insn_and_split to > > > > catch more cases (such as the one above). > > > > > > > > The one bit that's tricky/controversial is the use of RTL's > > > > nonzero_bits which is accurate during the combine pass when this > > > > pattern is first recognized, but not as advanced (not kept up to > > > > date) when this pattern is eventually split. To support this, > > > > I've used a "|| reload_completed" idiom. Does this approach seem > > > > reasonable? [I've another patch of x86 that uses the same idiom]. > > > > > > I tested this patch on powerpc64-linux-gnu, it caused the below ICE against test > case gcc/testsuite/gcc.c-torture/compile/pr93098.c. > > gcc/testsuite/gcc.c-torture/compile/pr93098.c: In function ‘foo’: > gcc/testsuite/gcc.c-torture/compile/pr93098.c:10:1: error: unrecognizable insn: > (insn 104 32 34 2 (set (reg:SI 185 [+4 ]) > (ior:SI (and:SI (reg:SI 200 [+4 ]) > (const_int 4294967295 [0xffffffff])) > (ashift:SI (reg:SI 140) > (const_int 32 [0x20])))) "gcc/testsuite/gcc.c- > torture/compile/pr93098.c":6:11 -1 > (nil)) > during RTL pass: subreg3 > dump file: pr93098.c.291r.subreg3 > gcc/testsuite/gcc.c-torture/compile/pr93098.c:10:1: internal compiler error: in > extract_insn, at recog.cc:2791 0x101f664b _fatal_insn(char const*, rtx_def > const*, char const*, int, char const*) > gcc/rtl-error.cc:108 > 0x101f6697 _fatal_insn_not_found(rtx_def const*, char const*, int, char > const*) > gcc/rtl-error.cc:116 > 0x10ae427f extract_insn(rtx_insn*) > gcc/recog.cc:2791 > 0x11b239bb decompose_multiword_subregs > gcc/lower-subreg.cc:1555 > 0x11b25013 execute > gcc/lower-subreg.cc:1818 > > The above trace shows we fails to recog the pattern again due to the inaccurate > nonzero_bits information as you pointed out above. > > There was another patch [1] which wasn't on trunk but touched this same > define_split, not sure if that can help or we can follow the similar idea. > > [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-December/585841.html > > BR, > Kewen