public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* Redundant constants in coremark crc8 for RISCV/aarch64 (no-if-conversion)
@ 2022-10-14 15:56 Vineet Gupta
  2022-10-14 16:54 ` Jeff Law
  0 siblings, 1 reply; 8+ messages in thread
From: Vineet Gupta @ 2022-10-14 15:56 UTC (permalink / raw)
  To: gcc; +Cc: Christoph Muellner, Philipp Tomsich, Kito Cheng, Jeff Law, Jim Wilson

[-- Attachment #1: Type: text/plain, Size: 2760 bytes --]

Hi,

When analyzing coremark build for RISC-V, noticed redundant constants 
not being eliminated. While this is a recurrent issue with RV, this 
specific instance is not unique to RV as I can trigger similar output on 
aarch64 with -fno-if-conversion, hence something which could be 
addressed in common passes.

-O3 -march=rv64gc_zba_zbb_zbc_zbs

crcu8:
	xor	a3,a0,a1
	andi	a3,a3,1
	srli	a4,a0,1
	srli	a5,a1,1
	beq	a3,zero,.L2

	li	a3,-24576	# 0xFFFF_A000
	addi	a3,a3,1		# 0xFFFF_A001
	xor	a5,a5,a3
	zext.h	a5,a5

.L2:
	xor	a4,a4,a5	
	andi	a4,a4,1	
	srli	a3,a0,2	
	srli	a5,a5,1	
	beq	a4,zero,.L3	

	li	a4,-24576	# 0xFFFF_A000
	addi	a4,a4,1		# 0xFFFF_A001
	xor	a5,a5,a4	
	zext.h	a5,a5		

.L3:
	xor	a3,a3,a5	
	andi	a3,a3,1	
	srli	a4,a0,3	
	srli	a5,a5,1	
	beq	a3,zero,.L4

	li	a3,-24576	# 0xFFFF_A000
	addi	a3,a3,1		# 0xFFFF_A001
...
...

I see that with small tests cse1 is able to substitute redundant 
constant reg with equivalent old reg.

cse_insn
    make_regs_equiv()
    ..
    canon_reg()

e.g.
void foo(void)
{
    bar(2072, 2096);
}

foo:
     li	a0,4096
     addi	a1,a0,-2000
     addi	a0,a0,-2024
     tail	bar

And it seems to even work across basic blocks.

e.g.

void foo(int x)	# -O2
{
     bar1(2072);
     foo2();
     if (x)
	bar2(2096);
}

foo:
  ...
     li s1, 4096
     addi a0, s1, -2024
  ...
     addi a0, a1, -2000
     tail bar2

It seems cse redundancy elimination logic is scoped to "paths" created 
by cse_find_path() and those seems to chain only 2 basic blocks.

;; Following path with 17 sets: 2 3
;; Following path with 15 sets: 4 5
;; Following path with 15 sets: 6 7
;; Following path with 15 sets: 8 9
;; Following path with 15 sets: 10 11
;; Following path with 15 sets: 12 13
;; Following path with 15 sets: 14 15
;; Following path with 13 sets: 16 17
;; Following path with 2 sets: 18
...

While in this case the BBs containing the dups are 6, 8, ...

(note 36 35 37 6 [bb 6] NOTE_INSN_BASIC_BLOCK)
(insn 37 36 38 6 (set (reg:DI 196)
         (const_int -24576 [0xffffffffffffa000])) 
"../../coremark-crc8.c":23:17 -1
      (nil))
...
(note 54 53 55 8 [bb 8] NOTE_INSN_BASIC_BLOCK)
(insn 55 54 56 8 (set (reg:DI 207)
         (const_int -24576 [0xffffffffffffa000])) 
"../../coremark-crc8.c":23:17 -1
      (nil))

The path construction logic in cse seems non trivial for someone just 
starting to dig into that code so I'd appreciate some insight.

Also I'm wondering if cse is the right place to do this at all, or 
should it be done in gcse/PRE ?

TIA
-Vineet

P.S. With the proposed RV cond ops extension if-conversion could elide 
this specific instance of problem, however redundant constants are a 
common enough nuisance in RV codegen whcih we need to tackle, specially 
when it is typically a 2 instruction seq.

[-- Attachment #2: coremark-crc8.c --]
[-- Type: text/x-csrc, Size: 536 bytes --]

typedef unsigned short ee_u16;
typedef unsigned char ee_u8;

ee_u16
crcu8(ee_u8 data, ee_u16 crc)
{
    ee_u8 i = 0, x16 = 0, carry = 0;

    for (i = 0; i < 8; i++)
    {
        x16 = (ee_u8)((data & 1) ^ ((ee_u8)crc & 1));
        data >>= 1;

        if (x16 == 1)
        {
            crc ^= 0x4002;
            carry = 1;
        }
        else
            carry = 0;
        crc >>= 1;
        if (carry)
            crc |= 0x8000;	// set MSB bit
        else
            crc &= 0x7fff;	// clear MSB bit
    }
    return crc;
}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Redundant constants in coremark crc8 for RISCV/aarch64 (no-if-conversion)
  2022-10-14 15:56 Redundant constants in coremark crc8 for RISCV/aarch64 (no-if-conversion) Vineet Gupta
@ 2022-10-14 16:54 ` Jeff Law
  2022-10-18 21:51   ` Vineet Gupta
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Law @ 2022-10-14 16:54 UTC (permalink / raw)
  To: Vineet Gupta, gcc
  Cc: Christoph Muellner, Philipp Tomsich, Kito Cheng, Jim Wilson


On 10/14/22 09:56, Vineet Gupta wrote:
> Hi,
>
> When analyzing coremark build for RISC-V, noticed redundant constants 
> not being eliminated. While this is a recurrent issue with RV, this 
> specific instance is not unique to RV as I can trigger similar output 
> on aarch64 with -fno-if-conversion, hence something which could be 
> addressed in common passes.
>
> -O3 -march=rv64gc_zba_zbb_zbc_zbs
>
> crcu8:
>     xor    a3,a0,a1
>     andi    a3,a3,1
>     srli    a4,a0,1
>     srli    a5,a1,1
>     beq    a3,zero,.L2
>
>     li    a3,-24576    # 0xFFFF_A000
>     addi    a3,a3,1        # 0xFFFF_A001
>     xor    a5,a5,a3
>     zext.h    a5,a5
>
> .L2:
>     xor    a4,a4,a5
>     andi    a4,a4,1
>     srli    a3,a0,2
>     srli    a5,a5,1
>     beq    a4,zero,.L3
>
>     li    a4,-24576    # 0xFFFF_A000
>     addi    a4,a4,1        # 0xFFFF_A001
>     xor    a5,a5,a4
>     zext.h    a5,a5
>
> .L3:
>     xor    a3,a3,a5
>     andi    a3,a3,1
>     srli    a4,a0,3
>     srli    a5,a5,1
>     beq    a3,zero,.L4
>
>     li    a3,-24576    # 0xFFFF_A000
>     addi    a3,a3,1        # 0xFFFF_A001
> ...
> ...
>
> I see that with small tests cse1 is able to substitute redundant 
> constant reg with equivalent old reg.

I find it easier to reason about this stuff with a graphical CFG, so a 
bit of ascii art...


           2
         /    \
      3 ---> 4
              /    \
          5 --->  6


Where BB4 corresponds to .L2 and BB6 corresponds to .L3. Evaluation of 
the constants occurs in BB3 and BB5.

CSE isn't going to catch this.  The best way to think about CSE's 
capabilities is that it can work on extended basic blocks.     An 
extended basic block can have jumps out, but not jumps in.  There are 3 
EBBs in this code.  (1,2), (4,5) and 6.    So BB4 is in a different EBB 
than BB3.  So the evaluation in BB3 can't be used by CSE in the EBB 
containing BB4, BB5.


PRE/GCSE is better suited for this scenario, but it has a critical 
constraint.  In particular our PRE formulation is never allowed to put 
an evaluation of an expression on a path that didn't have one before.  
So while there clearly a redundancy on the path 2->3->4->5 (BB3 and 
BB5), there is nowhere we could put an evaluation that would reduce the 
number of evaluation on that path without introducing an evaluation on 
paths that didn't have one.  So consider 2->4->6.  On that path there 
are zero evaluations.  So we can't place an eval in BB2 because that 
will cause evaluations on 2->4->6 which didn't have any evaluations.

There isn't a great place in GCC to handle this right now.  If the 
constraints were relaxed in PRE, then we'd have a chance, but getting 
the cost model right is going to be tough.


Jeff



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Redundant constants in coremark crc8 for RISCV/aarch64 (no-if-conversion)
  2022-10-14 16:54 ` Jeff Law
@ 2022-10-18 21:51   ` Vineet Gupta
  2022-10-18 23:36     ` Jeff Law
  0 siblings, 1 reply; 8+ messages in thread
From: Vineet Gupta @ 2022-10-18 21:51 UTC (permalink / raw)
  To: Jeff Law, gcc; +Cc: Kito Cheng, Philipp Tomsich

Hi Jeff,

On 10/14/22 09:54, Jeff Law via Gcc wrote:

...
>> .L2:
>>     xor    a4,a4,a5
>>     andi    a4,a4,1
>>     srli    a3,a0,2
>>     srli    a5,a5,1
>>     beq    a4,zero,.L3
>>
>>     li    a4,-24576    # 0xFFFF_A000
>>     addi    a4,a4,1        # 0xFFFF_A001
>>     xor    a5,a5,a4
>>     zext.h    a5,a5
>>
>> .L3:
>>     xor    a3,a3,a5
>>     andi    a3,a3,1
>>     srli    a4,a0,3
>>     srli    a5,a5,1
>>     beq    a3,zero,.L4
>>
>>     li    a3,-24576    # 0xFFFF_A000
>>     addi    a3,a3,1        # 0xFFFF_A001
>> ...
>> ...
>>
>> I see that with small tests cse1 is able to substitute redundant 
>> constant reg with equivalent old reg.
> 
> I find it easier to reason about this stuff with a graphical CFG, so a 
> bit of ascii art...
> 
> 
>            2
>          /    \
>       3 ---> 4
>               /    \
>           5 --->  6
> 

Yeah A picture is worth thousand words :-)


> Where BB4 corresponds to .L2 and BB6 corresponds to .L3. Evaluation of 
> the constants occurs in BB3 and BB5.

And Evaluation here means use of the constant (vs. definition ?).

> CSE isn't going to catch this.  The best way to think about CSE's 
> capabilities is that it can work on extended basic blocks.     An 
> extended basic block can have jumps out, but not jumps in.  There are 3 
> EBBs in this code.  (1,2), (4,5) and 6.    So BB4 is in a different EBB 
> than BB3.  So the evaluation in BB3 can't be used by CSE in the EBB 
> containing BB4, BB5.

Thanks for the detailed explanation.

> PRE/GCSE is better suited for this scenario, but it has a critical 
> constraint.  In particular our PRE formulation is never allowed to put 
> an evaluation of an expression on a path that didn't have one before. So
> while there clearly a redundancy on the path 2->3->4->5 (BB3 and BB5), 
> there is nowhere we could put an evaluation that would reduce the number 
> of evaluation on that path without introducing an evaluation on paths 
> that didn't have one.  So consider 2->4->6.  On that path there are zero 
> evaluations.  So we can't place an eval in BB2 because that will cause 
> evaluations on 2->4->6 which didn't have any evaluations.

OK. How does PRE calculate all possible paths to consider: say your 
example 2-3-4-5 and 2-4-6 ? Is that just indicative or would actually be 
the one PRE calculates for this case. Would there be more ?

> There isn't a great place in GCC to handle this right now.  If the 
> constraints were relaxed in PRE, then we'd have a chance, but getting 
> the cost model right is going to be tough.

It would have been better (for this specific case) if loop unrolling was 
not being done so early. The tree pass cunroll is flattening it out and 
leaving for rest of the all tree/rtl passes to pick up the pieces and 
remove any redundancies, if at all. It obviously needs to be early if we 
are injecting 7x more instructions, but seems like a lot to unravel.

FWIW -fno-unroll-loops only seems to work at -O2. At -O3 it always 
unrolls. Is that expected ?

If this seems worthwhile and you have ideas to do this any better, I'd 
be happy to work on this with some guidance.

Thx,
-Vineet

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Redundant constants in coremark crc8 for RISCV/aarch64 (no-if-conversion)
  2022-10-18 21:51   ` Vineet Gupta
@ 2022-10-18 23:36     ` Jeff Law
  2022-10-19  2:09       ` Vineet Gupta
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Law @ 2022-10-18 23:36 UTC (permalink / raw)
  To: Vineet Gupta, gcc; +Cc: Kito Cheng, Philipp Tomsich


On 10/18/22 15:51, Vineet Gupta wrote:
>
>
>
>> Where BB4 corresponds to .L2 and BB6 corresponds to .L3. Evaluation 
>> of the constants occurs in BB3 and BB5.
>
> And Evaluation here means use of the constant (vs. definition ?).

In this case, use of the constant.


>
>> PRE/GCSE is better suited for this scenario, but it has a critical 
>> constraint.  In particular our PRE formulation is never allowed to 
>> put an evaluation of an expression on a path that didn't have one 
>> before. So
>> while there clearly a redundancy on the path 2->3->4->5 (BB3 and 
>> BB5), there is nowhere we could put an evaluation that would reduce 
>> the number of evaluation on that path without introducing an 
>> evaluation on paths that didn't have one.  So consider 2->4->6.  On 
>> that path there are zero evaluations.  So we can't place an eval in 
>> BB2 because that will cause evaluations on 2->4->6 which didn't have 
>> any evaluations.
>
> OK. How does PRE calculate all possible paths to consider: say your 
> example 2-3-4-5 and 2-4-6 ? Is that just indicative or would actually 
> be the one PRE calculates for this case. Would there be more ?

PRE has a series of dataflow equations it solves which gives it the 
answer to that question.  The one that computes this property is usually 
called anticipated.  Given some block B in a graph G. An expression is 
anticipated at B when the expression is guaranteed to be computed if we 
reach B.  That doesn't mean the evaluation must happen in B, just that 
evaluation at some point is guaranteed if we reach B.

If an expression is not anticipated in a block, then PRE will not insert 
in that block since doing so would add evaluations on paths where they 
did not previously have any.


>
>> There isn't a great place in GCC to handle this right now.  If the 
>> constraints were relaxed in PRE, then we'd have a chance, but getting 
>> the cost model right is going to be tough.
>
> It would have been better (for this specific case) if loop unrolling 
> was not being done so early. The tree pass cunroll is flattening it 
> out and leaving for rest of the all tree/rtl passes to pick up the 
> pieces and remove any redundancies, if at all. It obviously needs to 
> be early if we are injecting 7x more instructions, but seems like a 
> lot to unravel.

Yup.  If that loop gets unrolled, it's going to be a mess.  It will 
almost certainly make this problem worse as each iteration is going to 
have a pair of constants loaded and no good way to remove them.


>
> FWIW -fno-unroll-loops only seems to work at -O2. At -O3 it always 
> unrolls. Is that expected ?

The only case I'm immediately aware of where this wouldn't work would be 
if -O3 came after -fno-unroll-oops.


>
> If this seems worthwhile and you have ideas to do this any better, I'd 
> be happy to work on this with some guidance.

I don't see  a great solution here.    Something like Cliff Click's work 
might help, but it's far from a guarantee.  Click's work essentially 
throws away the PRE constraint about never inserting an expression 
evaluation on a path where it didn't exit, along with all kinds of other 
things.  Essentially it's a total reformulation of redundancy elimination.


I did an implementation eons ago in gimple, but never was able to 
convince myself the implementation was correct or that integrating it 
was a good thing.   It's almost certainly going to cause performance 
regressions elsewhere so it may end up doing more harm than good.  I 
don't really know.


https://courses.cs.washington.edu/courses/cse501/06wi/reading/click-pldi95.pdf


Jeff




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Redundant constants in coremark crc8 for RISCV/aarch64 (no-if-conversion)
  2022-10-18 23:36     ` Jeff Law
@ 2022-10-19  2:09       ` Vineet Gupta
  2022-10-19  3:42         ` Jeff Law
  0 siblings, 1 reply; 8+ messages in thread
From: Vineet Gupta @ 2022-10-19  2:09 UTC (permalink / raw)
  To: Jeff Law, gcc; +Cc: Kito Cheng, Philipp Tomsich


On 10/18/22 16:36, Jeff Law wrote:
>>> There isn't a great place in GCC to handle this right now.  If the 
>>> constraints were relaxed in PRE, then we'd have a chance, but 
>>> getting the cost model right is going to be tough.
>>
>> It would have been better (for this specific case) if loop unrolling 
>> was not being done so early. The tree pass cunroll is flattening it 
>> out and leaving for rest of the all tree/rtl passes to pick up the 
>> pieces and remove any redundancies, if at all. It obviously needs to 
>> be early if we are injecting 7x more instructions, but seems like a 
>> lot to unravel.
>
> Yup.  If that loop gets unrolled, it's going to be a mess.  It will 
> almost certainly make this problem worse as each iteration is going to 
> have a pair of constants loaded and no good way to remove them.

Thats the original problem that I started this thread with. I'd snipped 
the disassembly as it would have been too much text but basically on RV, 
Coremark crc8 loop of const 8 iterations gets unrolled including 
extraneous 8 insns pairs to load the same constant - which is 
preposterous. Other arches side-step by using if-conversion / cond 
moves, latter currently WIP in RV International. x86 w/o if-convert 
seems OK since the const can be encoded in the xor insn.

OTOH given that gimple/tree-pass cunroll is doing the culprit loop 
unrolling and introducing redundant const 8 times, can it ne addressed 
there somehow.
tree_estimate_loop_size() seems to identify constant expression, not 
just an operand. Can it be taught to identify a "non-trivial const" and 
hoist/code-move the expression. Sorry just rambling here, most likely 
non-sense.

>
>>
>> FWIW -fno-unroll-loops only seems to work at -O2. At -O3 it always 
>> unrolls. Is that expected ?
>
> The only case I'm immediately aware of where this wouldn't work would 
> be if -O3 came after -fno-unroll-oops.

Weird that gcc-12, gcc-11, gcc-10 all seem to be silently ignoring 
-funroll-loops despite following -O3. Perhaps a different toggle is 
needed to supress the issue.

Thx,
-Vineet

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Redundant constants in coremark crc8 for RISCV/aarch64 (no-if-conversion)
  2022-10-19  2:09       ` Vineet Gupta
@ 2022-10-19  3:42         ` Jeff Law
  2022-10-19  7:46           ` Richard Biener
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Law @ 2022-10-19  3:42 UTC (permalink / raw)
  To: Vineet Gupta, gcc; +Cc: Kito Cheng, Philipp Tomsich


On 10/18/22 20:09, Vineet Gupta wrote:
>
> On 10/18/22 16:36, Jeff Law wrote:
>>>> There isn't a great place in GCC to handle this right now.  If the 
>>>> constraints were relaxed in PRE, then we'd have a chance, but 
>>>> getting the cost model right is going to be tough.
>>>
>>> It would have been better (for this specific case) if loop unrolling 
>>> was not being done so early. The tree pass cunroll is flattening it 
>>> out and leaving for rest of the all tree/rtl passes to pick up the 
>>> pieces and remove any redundancies, if at all. It obviously needs to 
>>> be early if we are injecting 7x more instructions, but seems like a 
>>> lot to unravel.
>>
>> Yup.  If that loop gets unrolled, it's going to be a mess.  It will 
>> almost certainly make this problem worse as each iteration is going 
>> to have a pair of constants loaded and no good way to remove them.
>
> Thats the original problem that I started this thread with. I'd 
> snipped the disassembly as it would have been too much text but 
> basically on RV, Coremark crc8 loop of const 8 iterations gets 
> unrolled including extraneous 8 insns pairs to load the same constant 
> - which is preposterous. Other arches side-step by using if-conversion 
> / cond moves, latter currently WIP in RV International. x86 w/o 
> if-convert seems OK since the const can be encoded in the xor insn.
>
> OTOH given that gimple/tree-pass cunroll is doing the culprit loop 
> unrolling and introducing redundant const 8 times, can it ne addressed 
> there somehow.
> tree_estimate_loop_size() seems to identify constant expression, not 
> just an operand. Can it be taught to identify a "non-trivial const" 
> and hoist/code-move the expression. Sorry just rambling here, most 
> likely non-sense.

Oh, cunroll.  There might be a distinct flag for complete unrolling.


I really expect something like Click's work is the way forward. 
Essentially when you VN the function you'll identify those constants and 
collapse them all down to a single instance.  Then the GCM phase will 
kick in and find a place to put the evaluation so that you have one and 
only one.

Some of Bodik's work might catch it as well, though implementing his 
ideas is likely a lot more work.


Jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Redundant constants in coremark crc8 for RISCV/aarch64 (no-if-conversion)
  2022-10-19  3:42         ` Jeff Law
@ 2022-10-19  7:46           ` Richard Biener
  2022-10-19 13:30             ` Jeff Law
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Biener @ 2022-10-19  7:46 UTC (permalink / raw)
  To: Jeff Law; +Cc: Vineet Gupta, gcc, Kito Cheng, Philipp Tomsich

On Wed, Oct 19, 2022 at 5:44 AM Jeff Law via Gcc <gcc@gcc.gnu.org> wrote:
>
>
> On 10/18/22 20:09, Vineet Gupta wrote:
> >
> > On 10/18/22 16:36, Jeff Law wrote:
> >>>> There isn't a great place in GCC to handle this right now.  If the
> >>>> constraints were relaxed in PRE, then we'd have a chance, but
> >>>> getting the cost model right is going to be tough.
> >>>
> >>> It would have been better (for this specific case) if loop unrolling
> >>> was not being done so early. The tree pass cunroll is flattening it
> >>> out and leaving for rest of the all tree/rtl passes to pick up the
> >>> pieces and remove any redundancies, if at all. It obviously needs to
> >>> be early if we are injecting 7x more instructions, but seems like a
> >>> lot to unravel.
> >>
> >> Yup.  If that loop gets unrolled, it's going to be a mess.  It will
> >> almost certainly make this problem worse as each iteration is going
> >> to have a pair of constants loaded and no good way to remove them.
> >
> > Thats the original problem that I started this thread with. I'd
> > snipped the disassembly as it would have been too much text but
> > basically on RV, Coremark crc8 loop of const 8 iterations gets
> > unrolled including extraneous 8 insns pairs to load the same constant
> > - which is preposterous. Other arches side-step by using if-conversion
> > / cond moves, latter currently WIP in RV International. x86 w/o
> > if-convert seems OK since the const can be encoded in the xor insn.
> >
> > OTOH given that gimple/tree-pass cunroll is doing the culprit loop
> > unrolling and introducing redundant const 8 times, can it ne addressed
> > there somehow.
> > tree_estimate_loop_size() seems to identify constant expression, not
> > just an operand. Can it be taught to identify a "non-trivial const"
> > and hoist/code-move the expression. Sorry just rambling here, most
> > likely non-sense.

On GIMPLE all constants are "simple".

> Oh, cunroll.  There might be a distinct flag for complete unrolling.

At -O3 we peel completely, there's no flag to disable that.

> I really expect something like Click's work is the way forward.
> Essentially when you VN the function you'll identify those constants and
> collapse them all down to a single instance.  Then the GCM phase will
> kick in and find a place to put the evaluation so that you have one and
> only one.

I'd say postreload gcse would be a place to do that.  At least when
there's no available hardreg CSEing likely isn't going to be a win.

> Some of Bodik's work might catch it as well, though implementing his
> ideas is likely a lot more work.
>
>
> Jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Redundant constants in coremark crc8 for RISCV/aarch64 (no-if-conversion)
  2022-10-19  7:46           ` Richard Biener
@ 2022-10-19 13:30             ` Jeff Law
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff Law @ 2022-10-19 13:30 UTC (permalink / raw)
  To: Richard Biener; +Cc: Vineet Gupta, gcc, Kito Cheng, Philipp Tomsich


On 10/19/22 01:46, Richard Biener wrote:
> On Wed, Oct 19, 2022 at 5:44 AM Jeff Law via Gcc <gcc@gcc.gnu.org> wrote:
>>
>> On 10/18/22 20:09, Vineet Gupta wrote:
>>> On 10/18/22 16:36, Jeff Law wrote:
>>>>>> There isn't a great place in GCC to handle this right now.  If the
>>>>>> constraints were relaxed in PRE, then we'd have a chance, but
>>>>>> getting the cost model right is going to be tough.
>>>>> It would have been better (for this specific case) if loop unrolling
>>>>> was not being done so early. The tree pass cunroll is flattening it
>>>>> out and leaving for rest of the all tree/rtl passes to pick up the
>>>>> pieces and remove any redundancies, if at all. It obviously needs to
>>>>> be early if we are injecting 7x more instructions, but seems like a
>>>>> lot to unravel.
>>>> Yup.  If that loop gets unrolled, it's going to be a mess.  It will
>>>> almost certainly make this problem worse as each iteration is going
>>>> to have a pair of constants loaded and no good way to remove them.
>>> Thats the original problem that I started this thread with. I'd
>>> snipped the disassembly as it would have been too much text but
>>> basically on RV, Coremark crc8 loop of const 8 iterations gets
>>> unrolled including extraneous 8 insns pairs to load the same constant
>>> - which is preposterous. Other arches side-step by using if-conversion
>>> / cond moves, latter currently WIP in RV International. x86 w/o
>>> if-convert seems OK since the const can be encoded in the xor insn.
>>>
>>> OTOH given that gimple/tree-pass cunroll is doing the culprit loop
>>> unrolling and introducing redundant const 8 times, can it ne addressed
>>> there somehow.
>>> tree_estimate_loop_size() seems to identify constant expression, not
>>> just an operand. Can it be taught to identify a "non-trivial const"
>>> and hoist/code-move the expression. Sorry just rambling here, most
>>> likely non-sense.
> On GIMPLE all constants are "simple".
>
>> Oh, cunroll.  There might be a distinct flag for complete unrolling.
> At -O3 we peel completely, there's no flag to disable that.
>
>> I really expect something like Click's work is the way forward.
>> Essentially when you VN the function you'll identify those constants and
>> collapse them all down to a single instance.  Then the GCM phase will
>> kick in and find a place to put the evaluation so that you have one and
>> only one.
> I'd say postreload gcse would be a place to do that.  At least when
> there's no available hardreg CSEing likely isn't going to be a win.

That's an interesting idea.  Do it aggressively post-reload when we know 
there's a register available.   Vineet, that seems like it's worth 
investigation.

jeff



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-10-19 13:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-14 15:56 Redundant constants in coremark crc8 for RISCV/aarch64 (no-if-conversion) Vineet Gupta
2022-10-14 16:54 ` Jeff Law
2022-10-18 21:51   ` Vineet Gupta
2022-10-18 23:36     ` Jeff Law
2022-10-19  2:09       ` Vineet Gupta
2022-10-19  3:42         ` Jeff Law
2022-10-19  7:46           ` Richard Biener
2022-10-19 13:30             ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).